Commit Graph

213 Commits (4733631572ebf04c797bbbb3b03f7b7f97219b70)

Author SHA1 Message Date
Prathamesh Musale 4733631572 feat(k8s): namespace ownership check to prevent silent cross-deployment override
Two deployments whose stack_name derives the same namespace (e.g. two
deployments of the test stack, or any spec without an explicit
`namespace:` override) silently patch each other's Deployment,
ConfigMaps, Services, and PVCs when they share a cluster — last
`deployment start` wins. No error today; operator sees only "Updated
Deployment ... (rolling update)" and can't tell what happened.

Stamp the namespace with a `laconic.com/deployment-dir` annotation on
first creation. On subsequent `deployment start`:

- Annotation missing (legacy / user-created namespace): adopt by
  stamping, so the NEXT conflicting deployment fails loudly.
- Annotation matches this deployment's dir: proceed.
- Annotation points to a different deployment dir: raise
  DeployerException with both dirs and the exact `namespace:` spec
  override to fix it.

Low migration risk: the woodburn pattern (multiple stacks, each with
its own stack_name-derived namespace) continues to work — those
namespaces don't collide by construction. Only same-stack+same-cluster
deployments are affected, which never worked correctly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 13:30:53 +00:00
Prathamesh Musale cb84388d00 feat(k8s): auto-ConfigMap for file-level host-path compose volumes
File-level host-path compose volumes (e.g. `../config/foo.sh:/opt/foo.sh`)
were synthesized into a kind extraMount + hostPath PV chain with a
sanitized containerPath (`/mnt/host-path-<sanitized>`). The sanitized
name is derived from the compose volume source and is identical across
deployments of the same stack, so two deployments sharing a cluster
collided at the containerPath — kind only honors the first deployment's
bind, subsequent deployments' pods silently read the first's content.
The same code path was also broken on real k8s, which has no way to
populate `/mnt/host-path-*` on worker nodes.

File-level compose binds are conceptually k8s ConfigMaps. The snowball
stack already uses the ConfigMap-backed named-volume pattern by hand.
Make that automatic at the k8s object-generation layer, without
touching deployment-dir compose or spec files.

Behavior at deploy create (validation only, no file mutation):
- :rw on a host-path bind        -> DeployerException (use a named
                                     volume for writable data)
- Directory with subdirectories  -> DeployerException (embed in image,
                                     split into configmaps, or use
                                     initContainer)
- Directory or file > ~700 KiB   -> DeployerException (ConfigMap budget)
- File, or flat small directory  -> accepted, handled at deploy start

Behavior at deploy start:
- cluster_info.get_configmaps() additionally walks pod + job compose
  volumes and emits a V1ConfigMap per host-path bind (deduped by
  sanitized name across all pods/services). Content read from
  {deployment_dir}/config/<pod>/<file> (already populated by
  _copy_extra_config_dirs).
- volumes_for_pod_files emits V1ConfigMapVolumeSource instead of
  V1HostPathVolumeSource for host-path binds.
- volume_mounts_for_service stats the source and sets V1VolumeMount
  sub_path to the filename when source is a regular file — single-key
  ConfigMaps land as files, whole-dir ConfigMaps land as directories.
- _generate_kind_mounts no longer emits `/mnt/host-path-*` extraMounts
  for these binds (the ConfigMap path bypasses kind node FS entirely).

Deployment dir layout is unchanged. Compose files, spec.yml, and
{deployment_dir}/config/<pod>/ remain exactly as today — trivially
diffable against stack source, no synthetic volume names. ConfigMaps
are visible only in k8s (kubectl get cm -n <ns>).

The existing `/mnt/host-path-*` skip in check_mounts_compatible is
retained as a transition tolerance for deployments created before
this change.

Updates:
- deployment_create: _validate_host_path_mounts() called per pod/job
  in the create loops; 700 KiB ConfigMap budget (accounts for base64
  + metadata overhead)
- helpers: _generate_kind_mounts skips host-path entries;
  volumes_for_pod_files emits ConfigMap-backed V1Volume;
  volume_mounts_for_service takes optional deployment_dir and
  auto-sets sub_path for single-file sources
- cluster_info: new _host_path_bind_configmaps() walked from
  get_configmaps(); volume_mounts_for_service call passes
  deployment_dir from spec.file_path
- docs: document the behavior and the rejected shapes in
  deployment_patterns.md
- tests: k8s-deploy asserts the host-path ConfigMaps exist,
  compose/spec unchanged, and no `/mnt/host-path-*` extraMounts

Refs: so-b86

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 13:13:43 +00:00
Prathamesh Musale 1d019f9c4b fix(k8s): exclude per-deployment file-level host-path binds from mount check
Compose volumes like './config/x.sh' are emitted per-deployment with
containerPath '/mnt/host-path-<sanitized>' and source paths scoped to
each deployment's own directory. Two deployments of the same stack will
always clash at those containerPaths regardless of kind-mount-root —
this is a pre-existing SO aliasing behavior for file-level binds,
orthogonal to umbrella compatibility.

Let the mount-compatibility check skip '/mnt/host-path-*' entries so
the positive case (shared umbrella across deployments) doesn't false-
positive. The check still covers the /mnt umbrella itself and named-
volume data mounts.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 11:29:57 +00:00
Prathamesh Musale f1250a3da1 fix(k8s): tailor mount-mismatch error to cluster's umbrella state
The original error always prescribed cluster recreate. When the running
cluster already has an umbrella at /mnt, that's misleading — the right
fix is to align the new deployment to the existing umbrella (set
kind-mount-root to the cluster's umbrella source and move host paths
under it). Recreate is only correct when no umbrella exists.

Branch the error message on whether the cluster has a /mnt bind. With
umbrella: show its host source and tell the user to set kind-mount-root
to that value. Without: keep the recreate guidance.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 10:37:49 +00:00
Prathamesh Musale 1e274610d6 fix(k8s): run mount compatibility check on skip-cluster-management path too
The mount-compatibility check lived inside create_cluster(), which only
runs under --perform-cluster-management. Under the (default)
--skip-cluster-management path the check was skipped — a deployment
joining an existing cluster with an incompatible kind-config would
proceed and silently fall through to the node's overlay FS, which is
exactly the failure mode the check was designed to catch.

Rename _check_mounts_compatible → check_mounts_compatible (now public)
and call it from both paths in _setup_cluster().

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 10:08:31 +00:00
Prathamesh Musale 6ccbb4713b fix(k8s): graceful error when cluster missing under --skip-cluster-management
--skip-cluster-management is the default, so `deployment start` without
an existing cluster lands straight in connect_api() which raises a
cryptic kubernetes.config.ConfigException about a missing kube context.

Preflight in _setup_cluster() on the skip-cluster-management kind path:
- If no kind cluster is running, raise DeployerException pointing at
  --perform-cluster-management.
- If a different kind cluster is running, raise DeployerException
  showing both names and the two ways to reconcile (edit deployment.yml
  or recreate).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 10:05:03 +00:00
Prathamesh Musale 782c71ae36 feat(k8s): enforce kind extraMount compatibility on cluster reuse
Kind applies extraMounts only at cluster creation. When a deployment joins
an existing shared cluster, any extraMount its kind-config declares that
isn't already active on the running control-plane is silently ignored —
PVs backed by those mounts fall through to the node's overlay filesystem
and lose data on cluster destroy.

Validate this up front in create_cluster():
- On cluster reuse, compare the new deployment's extraMounts against the
  live bind mounts on the control-plane container (via docker inspect).
  Fail with a DeployerException listing every mismatched mount and
  pointing at docs/deployment_patterns.md.
- On first-time cluster creation without a /mnt umbrella mount
  (kind-mount-root unset), print a warning that future stacks may
  require a full recreate to add new host-path mounts.

Document the umbrella-mount convention (kind-mount-root) and the
migration path for existing clusters in docs/deployment_patterns.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 09:30:12 +00:00
prathamesh0 7f4b058066
so-o2o: kubectl-level Caddy cert backup/restore (#746)
Publish / Gate: k8s deploy e2e (push) Failing after 3s Details
Publish / Build and publish (push) Has been skipped Details
K8s Deploy Test / Run deploy test suite on kind/k8s (push) Failing after 0s Details
Lint Checks / Run linter (push) Failing after 0s Details
Deploy Test / Run deploy test suite (push) Failing after 0s Details
Webapp Test / Run webapp test suite (push) Failing after 0s Details
Smoke Test / Run basic test suite (push) Failing after 0s Details
Replaces the etcd-surgery persistence approach with a CronJob that dumps `manager=caddy` Secrets to `{kind-mount-root}/caddy-cert-backup/` every 5 min, and a restore step that applies the file before Caddy starts on a fresh cluster. Closes so-o2o.

Deletes `_clean_etcd_keeping_certs` and the etcd+PKI extraMounts. No new spec keys - activates when `kind-mount-root` is set.
2026-04-17 15:36:40 +05:30
prathamesh0 1334900407
so-o2o: detect etcd image dynamically + diagnose whitelist cleanup bugs (#745)
Replaces the hardcoded `gcr.io/etcd-development/etcd:v3.5.9` in `_clean_etcd_keeping_certs` with a dynamic ref captured from the running Kind node via `crictl`, persisted to `{backup_dir}/etcd-image.txt` and reused on subsequent cleanup runs. Self-adapts to Kind upgrades, no version table to maintain.

Testing on Kind v0.32 / etcd 3.6 surfaced two additional bugs in the whitelist cleanup that this PR does **not** fix (see so-o2o comments):
(a) the restore step pipes raw protobuf values through bash `echo`, corrupting binary bytes;
(b) the whitelist omits cluster-admin RBAC, SAs, and bootstrap tokens needed by kubeadm's pre-addon health check.

Merging this narrow fix + diagnosis trail; follow-up branch will replace the etcd-surgery approach with a kubectl-level Caddy secret backup/restore.
2026-04-17 13:48:30 +05:30
prathamesh0 cf8b7533fe
so-ad7: build per-pod Service for maintenance container (#744)
Webapp Test / Run webapp test suite (push) Failing after 0s Details
Publish / Gate: k8s deploy e2e (push) Failing after 3s Details
Publish / Build and publish (push) Has been skipped Details
Deploy Test / Run deploy test suite (push) Failing after 0s Details
K8s Deploy Test / Run deploy test suite on kind/k8s (push) Failing after 0s Details
Lint Checks / Run linter (push) Failing after 0s Details
Smoke Test / Run basic test suite (push) Failing after 0s Details
- Maintenance-page swap during `restart` was broken: Ingress got patched to point at `{app_name}-{pod_name}-service` for the maintenance pod, but that Service was never created. Caddy had no valid backend, users saw "site cannot be reached" instead of the maintenance page
- Root cause: `get_services()` only builds per-pod Services for pods referenced by `http-proxy` routes; the maintenance pod has no http-proxy route by design
- Fix: `get_services()` now also includes the container named by `maintenance-service:` in the container-ports map, so its per-pod `Service` gets built and sits idle until the swap window
- Also files `so-b9a` (P4) noting the latent fragility in the resolver/builder contract
2026-04-16 15:07:25 +05:30
prathamesh0 fc5dc80058
so-l2l: in-place stop/restart via label-scoped cleanup (#743)
- `down()` scopes cleanup to a single stack via `app.kubernetes.io/stack` and keeps the namespace `Active` by default
- New `stop/down --delete-namespace` flag for opt-in full teardown
- `down()` is synchronous - waits until resources are actually gone before returning. Callers can drop their own wait loops
- `up()` skip-if-exists for Jobs completes the create-or-replace coverage (other kinds already had it)
- Orphan PVs from a prior `stop --delete-namespace` get cleaned on the next `stop --delete-volumes`
- Every k8s resource SO creates now carries `app.kubernetes.io/stack` via a new `ClusterInfo._stack_labels()` helper
- Closes so-l2l, so-076.2. Also includes pebble audit: closes so-c71, so-b2b, so-k1k; files so-328
2026-04-16 12:10:04 +05:30
prathamesh0 f40913d187
Fix Kind port mappings and configmap source path resolution (#742)
Publish / Gate: k8s deploy e2e (push) Failing after 2s Details
Webapp Test / Run webapp test suite (push) Failing after 0s Details
Publish / Build and publish (push) Has been skipped Details
Smoke Test / Run basic test suite (push) Failing after 0s Details
Lint Checks / Run linter (push) Failing after 0s Details
Deploy Test / Run deploy test suite (push) Failing after 0s Details
- Only map host ports for services with network_mode: host (80/443 for Caddy always mapped). Previously all compose service ports were mapped unconditionally, causing conflicts with local services like postgres and redis
- Use spec configmap values as source paths instead of ignoring them. Fixes configmaps with user-defined paths (e.g. `stack-orchestrator/compose/maintenance`) and home-relative paths (e.g. `~/.credentials/local-certs/s3`)
- Read configmap files from deployment dir (`configmaps/{name}/`) when building k8s ConfigMap objects, not from the spec's source path which doesn't exist in the deployment dir
- File pebbles: `so-c71` (resolved), `so-078`: self-sufficient deployments (hooks should be copied to deployment dir)
2026-04-14 17:33:47 +05:30
prathamesh0 17b614cb4d
Fix configmap source path resolution for user-defined spec paths (#741)
Lint Checks / Run linter (push) Failing after 0s Details
Webapp Test / Run webapp test suite (push) Failing after 0s Details
Smoke Test / Run basic test suite (push) Failing after 0s Details
Publish / Gate: k8s deploy e2e (push) Failing after 2s Details
Deploy Test / Run deploy test suite (push) Failing after 0s Details
Publish / Build and publish (push) Has been skipped Details
2026-04-14 11:30:27 +05:30
prathamesh0 0bf1ea70d5
Add ip mode to external-services for static IP endpoints (#740)
Publish / Gate: k8s deploy e2e (push) Failing after 2s Details
Container Registry Test / Run container registry hosting test on kind/k8s (push) Failing after 0s Details
Publish / Build and publish (push) Has been skipped Details
Database Test / Run database hosting test on kind/k8s (push) Failing after 0s Details
External Stack Test / Run external stack test suite (push) Failing after 0s Details
K8s Deploy Test / Run deploy test suite on kind/k8s (push) Failing after 0s Details
K8s Deployment Control Test / Run deployment control suite on kind/k8s (push) Failing after 0s Details
Webapp Test / Run webapp test suite (push) Failing after 0s Details
Smoke Test / Run basic test suite (push) Failing after 0s Details
Lint Checks / Run linter (push) Failing after 0s Details
Deploy Test / Run deploy test suite (push) Failing after 0s Details
ExternalName services only support DNS names (CNAME records), not
raw IP addresses. Add an ip mode that creates a headless Service +
Endpoints with a static IP, enabling routing to host-network
services like Kind gateway IPs or bare-metal endpoints.

Spec format:
  external-services:
    my-service:
      ip: 172.18.0.1
      port: 8899

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-02 17:53:23 +05:30
A. F. Dudley 3da23683f6 fix: black formatting, line length, pyright type narrowing
- Apply black reformatting to deployer.py, cluster_info.py, deploy_k8s.py
- Shorten docstrings exceeding 88 char line limit
- Add assert for pyright Optional type narrowing on tls list

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 05:22:25 +00:00
A. F. Dudley 63325f68a7 fix: deduplicate container ports by (port, protocol)
Compose files with both "8001" (TCP) and "8001/udp" produce separate
V1ContainerPort entries that k8s rejects as duplicates. Deduplicate
after parsing by (container_port, protocol) key.

This was blocking biscayne's agave deployment — the spec has both
TCP 8001 (ip_echo) and UDP 8001 (gossip), which generated two UDP
8001 entries.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 05:18:10 +00:00
A. F. Dudley 87761c7041 fix: imagePullPolicy for kind, job images, duplicate registry call, test namespace
- deploy_k8s.py: default imagePullPolicy to IfNotPresent for kind
  (local images loaded via kind load, not pulled from registry)
- cluster_info.py: add job images to image_set so they're loaded into kind
- deploy_k8s.py: remove duplicate create_registry_secret call (merge artifact)
- deploy_k8s.py: fix indentation in run_job job_pull_policy (replace_all damage)
- tests/k8s-deploy: update namespace from laconic-{id} to laconic-{stack_name}
  to match the new stack-derived namespace scheme from wd-a7b

All 15 k8s deploy e2e tests pass.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01 23:34:51 +00:00
A. F. Dudley 5bf96112d3 fix lint errors from merge: duplicate def, shadowed import, empty f-string
- deployment_create.py: remove duplicate create_registry_secret signature
- deploy_k8s.py: rename loop var 'config' to 'svc_config' (shadowed import)
- deploy_k8s.py: remove f-prefix from string without placeholders
- deploy_k8s.py: suppress pre-existing C901 on _create_volume_data

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01 19:17:45 +00:00
A. F. Dudley 549ac8c01d Merge fix/kind-mount-propagation: all local branches unified
Merges 6 local branches into main:
- enya: HostToContainer mount propagation for kind-mount-root
- fix/k8s-port-mappings-v5: port protocol parsing, namespace fix
- peirce: idempotent deploy (create-or-replace), update-envs rename
- prince: etcd cleanup whitelist
- wd-a7b: timestamp cluster IDs, stack-derived namespaces, jobs,
  multi-cert ingress, user secrets, _build_containers refactor
- fix/kind-mount-propagation: deployment prepare command, pebbles

Conflicts resolved keeping main's evolved multi-pod architecture
(get_deployments, per-pod Services, CA cert injection) while
incorporating branch additions (HostToContainer propagation,
user secrets, jobs support).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01 18:26:05 +00:00
A. F. Dudley d50bd2b6d2 Merge wd-a7b: cluster-id/namespace naming, jobs, multi-cert, secrets
Combines timestamp-based cluster IDs, namespace derived from stack name,
_build_containers refactor, jobs support, multi-ingress certificates,
user-declared secrets, and label-based resource cleanup with the existing
idempotent deploy, mount propagation, and port mapping fixes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01 18:22:07 +00:00
A. F. Dudley 2307696a66 Merge fix/k8s-port-mappings-v5 into fix/kind-mount-propagation
Resolve conflicts keeping HostToContainer propagation on mount root
entry and per-container resource layering from the propagation branch.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01 17:06:25 +00:00
A. F. Dudley c64820ad5c Merge branch 'enya-ac868cc4-kind-mount-propagation-fix' into fix/kind-mount-propagation 2026-04-01 17:05:06 +00:00
A. F. Dudley 3e3f349151 Merge remote-tracking branch 'cerc-io.github.com/main'
# Conflicts:
#	stack_orchestrator/deploy/deployment_create.py
#	stack_orchestrator/deploy/k8s/deploy_k8s.py
2026-04-01 14:47:46 +00:00
Prathamesh Musale 33d3474d7d Fix registry secret created in wrong namespace (#998)
Publish / Build and publish (push) Failing after 0s Details
Deploy Test / Run deploy test suite (push) Failing after 0s Details
Webapp Test / Run webapp test suite (push) Failing after 0s Details
Lint Checks / Run linter (push) Failing after 0s Details
Smoke Test / Run basic test suite (push) Failing after 0s Details
`create_registry_secret()` was hardcoded to use the "default" namespace,
but pods are deployed to the spec's configured namespace. The secret
must be in the same namespace as the pods for `imagePullSecrets` to work.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Reviewed-on: https://git.vdb.to/cerc-io/stack-orchestrator/pulls/998
Co-authored-by: Prathamesh Musale <prathamesh.musale0@gmail.com>
Co-committed-by: Prathamesh Musale <prathamesh.musale0@gmail.com>
2026-03-26 08:36:39 +00:00
Snake Game Developer 90e32ffd60 Support image-overrides in spec for testing
Spec can override container images:
  image-overrides:
    dumpster-kubo: ghcr.io/.../dumpster-kubo:test-tag

Merged with CLI overrides (CLI wins). Enables testing with
GHCR-pushed test tags without modifying compose files.

Also reverts the image-pull-policy spec key (not needed —
the fix is to use proper GHCR tags, not IfNotPresent).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-22 01:02:23 +00:00
Snake Game Developer 1052a1d4e7 Support image-pull-policy in spec (default: Always)
Testing specs can set image-pull-policy: IfNotPresent so kind-loaded
local images are used instead of pulling from the registry. Production
specs omit the key and get the default Always behavior.

Root cause: with Always, k8s pulled the GHCR kubo image (with baked
R2 endpoint) instead of the locally-built image (with https://s3:443),
causing kubo to connect to R2 directly and get Unauthorized.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-21 20:17:06 +00:00
Snake Game Developer f93541f7db Fix CA cert mounting: subPath for Go, expanduser for configmaps
- CA certs mounted via subPath into /etc/ssl/certs/ so Go's x509
  picks them up (directory mount replaces the entire dir)
- get_configmaps() now expands ~ in paths via os.path.expanduser()
- Both changes discovered during testing with mkcert + MinIO

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-21 19:27:14 +00:00
Snake Game Developer 713a81c245 Add external-services and ca-certificates spec keys
New spec.yml features for routing external service dependencies:

external-services:
  s3:
    host: example.com  # ExternalName Service (production)
    port: 443
  s3:
    selector: {app: mock}  # headless Service + Endpoints (testing)
    namespace: mock-ns
    port: 443

ca-certificates:
  - ~/.local/share/mkcert/rootCA.pem  # testing only

laconic-so creates the appropriate k8s Service type per mode:
- host mode: ExternalName (DNS CNAME to external provider)
- selector mode: headless Service + Endpoints with pod IPs
  discovered from the target namespace at deploy time

ca-certificates mounts CA files into all containers at
/etc/ssl/certs/ and sets NODE_EXTRA_CA_CERTS for Node/Bun.

Also includes the previously committed PV Released state fix.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-21 15:25:47 +00:00
Snake Game Developer 98ff221a21 Fix PV rebinding after deployment stop/start cycle
deployment stop deletes the namespace (and PVCs) but preserves PVs
by default. On the next deployment start, PVs are in Released state
with a stale claimRef pointing at the deleted PVC. New PVCs cannot
bind to Released PVs, so pods get stuck in Pending.

Clear the claimRef on any Released PV during _create_volume_data()
so the PV returns to Available and can accept new PVC bindings.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-21 07:47:23 +00:00
A. F. Dudley 8d03083d0d feat: add kind-mount-root for unified Kind extraMount
Publish / Build and publish (push) Failing after 0s Details
Webapp Test / Run webapp test suite (push) Failing after 0s Details
Smoke Test / Run basic test suite (push) Failing after 0s Details
Lint Checks / Run linter (push) Failing after 0s Details
Deploy Test / Run deploy test suite (push) Failing after 0s Details
When kind-mount-root is set in spec.yml, emit a single extraMount
mapping the root to /mnt instead of per-volume mounts. This allows
adding new volumes without recreating the Kind cluster.

Volumes whose host path is under the root skip individual extraMounts
and their PV paths resolve to /mnt/{relative_path}. Volumes outside
the root keep individual extraMounts as before.

Cherry-picked from branch enya-ac868cc4-kind-mount-propagation-fix
(commits b6d6ad81, 929bdab8) and adapted for current main.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-20 21:28:40 +00:00
A. F. Dudley 9109cfb7a1 feat: add token-file option for image-pull-secret registry auth
Adds token-file key to image-pull-secret spec config. Reads the
registry token from a file on disk instead of requiring an environment
variable. File path supports ~ expansion. Falls back to token-env
if token-file is not set or file doesn't exist.

This lets operators store the GHCR token in ~/.credentials/ alongside
other secrets, removing the need for ansible to pass REGISTRY_TOKEN
as an env var.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-20 19:30:44 +00:00
A. F. Dudley 61afeb255c fix: keep cwd at repo root through entire restart, revert try/except
The stack path in spec.yml is relative — both create_operation and
up_operation need cwd at the repo root for stack_is_external() to
resolve it. Move os.chdir(prev_cwd) to after up_operation completes
instead of between the two operations.

Reverts the SystemExit catch in call_stack_deploy_start — the root
cause was cwd, not the hook.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-20 15:54:46 +00:00
A. F. Dudley 32f6e57b70 fix: ConfigMap volumes don't force Recreate strategy + resilient hooks
Two fixes for multi-deployment:

1. _pod_has_pvcs now excludes ConfigMap volumes from PVC detection.
   Pods with only ConfigMap volumes (like maintenance) correctly get
   RollingUpdate strategy instead of Recreate.

2. call_stack_deploy_start catches SystemExit when stack path doesn't
   resolve from cwd (common during restart). Most stacks don't have
   deploy hooks, so this is non-fatal.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-20 15:51:58 +00:00
A. F. Dudley 6923e1c23b refactor: extract methods from K8sDeployer.up to fix C901 complexity
Split up() into _setup_cluster(), _create_ingress(), _create_nodeports().
Reduces cyclomatic complexity below the flake8 threshold.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-20 15:20:50 +00:00
A. F. Dudley 0ac886bf95 fix: chdir to repo root before create_operation in restart
The spec's "stack:" value is a relative path that must resolve from
the repo root. stack_is_external() checks Path(stack).exists() from
cwd, which fails when cwd isn't the repo root.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-20 15:06:38 +00:00
A. F. Dudley 2484abfcce fix: use git rev-parse for repo root in restart command
The repo_root calculation assumed stack paths are always 4 levels deep
(stack_orchestrator/data/stacks/name). External stacks with different
nesting (e.g. stack-orchestrator/stacks/name = 3 levels) got the wrong
root, causing --spec-file resolution to fail.

Use git rev-parse --show-toplevel instead.

Fixes: so-k1k

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-20 15:03:24 +00:00
A. F. Dudley 967936e524 Multi-deployment: one k8s Deployment per pod in stack.yml
Lint Checks / Run linter (push) Failing after 0s Details
Deploy Test / Run deploy test suite (push) Failing after 0s Details
Webapp Test / Run webapp test suite (push) Failing after 0s Details
Publish / Build and publish (push) Failing after 0s Details
Smoke Test / Run basic test suite (push) Failing after 0s Details
Each pod entry in stack.yml now creates its own k8s Deployment with
independent lifecycle and update strategy. Pods with PVCs get Recreate,
pods without get RollingUpdate. This enables maintenance services that
survive main pod restarts.

- cluster_info: get_deployments() builds per-pod Deployments, Services
- cluster_info: Ingress routes to correct per-pod Service
- deploy_k8s: _create_deployment() iterates all Deployments/Services
- deployment: restart swaps Ingress to maintenance service during Recreate
- spec: add maintenance-service key

Single-pod stacks are backward compatible (same resource names).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-20 01:40:45 +00:00
A. F. Dudley 6ace024cd3 fix: use replace instead of patch for k8s resource updates
Lint Checks / Run linter (push) Failing after 0s Details
Publish / Build and publish (push) Failing after 0s Details
Deploy Test / Run deploy test suite (push) Failing after 0s Details
Webapp Test / Run webapp test suite (push) Failing after 0s Details
Smoke Test / Run basic test suite (push) Failing after 0s Details
Strategic merge patch preserves fields not present in the patch body.
This means removed volumes, ports, and env vars persist in the running
Deployment after a restart. Replace sends the complete spec built from
the current compose files — removed fields are actually deleted.

Affects Deployment, Service, Ingress, and NodePort updates. Service
replace preserves clusterIP (immutable field) by reading it from the
existing resource before replacing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-19 03:44:57 +00:00
A. F. Dudley ea610bb8d6 Merge branch 'cv-c3c-image-flag-for-restart'
# Conflicts:
#	stack_orchestrator/deploy/k8s/deploy_k8s.py
2026-03-18 23:04:55 +00:00
A. F. Dudley 4b1fc27a1e cv-c3c: add --image flag to deployment restart command
Allows callers to override container images during restart, e.g.:
  laconic-so deployment restart --image backend=ghcr.io/org/app:sha123

The override is applied to the k8s Deployment spec before
create-or-patch. Docker/compose deployers accept the parameter
but ignore it.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-18 22:42:56 +00:00
A. F. Dudley 25e5ff09d9 so-m3m: add credentials-files spec key for on-disk credential injection
_write_config_file() now reads each file listed under the credentials-files
top-level spec key and appends its contents to config.env after config vars.
Paths support ~ expansion. Missing files fail hard with sys.exit(1).

Also adds get_credentials_files() to Spec class following the same pattern
as get_image_registry_config().

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-18 21:55:28 +00:00
A. F. Dudley 0e4ecc3602 refactor: rename registry-credentials to image-pull-secret in spec
The spec key `registry-credentials` was ambiguous — could mean container
registry auth or Laconic registry config. Rename to `image-pull-secret`
which matches the k8s secret name it creates.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-18 21:38:31 +00:00
A. F. Dudley dc15c0f4a5 feat: auto-generate readiness probes from http-proxy routes
Lint Checks / Run linter (push) Failing after 0s Details
Publish / Build and publish (push) Failing after 0s Details
Deploy Test / Run deploy test suite (push) Failing after 0s Details
Webapp Test / Run webapp test suite (push) Failing after 0s Details
Smoke Test / Run basic test suite (push) Failing after 0s Details
Containers referenced in spec.yml http-proxy routes now get TCP
readiness probes on the proxied port. This tells k8s when a container
is actually ready to serve traffic.

Without readiness probes, k8s considers pods ready immediately after
start, which means:
- Rolling updates cut over before the app is listening
- Broken containers look "ready" and receive traffic (502s)
- kubectl rollout undo has nothing to roll back to

The probes use TCP socket checks (not HTTP) to work with any protocol.
Initial delay 5s, check every 10s, fail after 3 consecutive failures.

Closes so-l2l part C.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-18 19:43:09 +00:00
A. F. Dudley 2d11ca7bb0 feat: update-in-place deployments with rolling updates
Replace the destroy-and-recreate deployment model with in-place updates.

deploy_k8s.py: All resource creation (Deployment, Service, Ingress,
NodePort, ConfigMap) now uses create-or-update semantics. If a resource
already exists (409 Conflict), it patches instead of failing. For
Deployments, this triggers a k8s rolling update — old pods serve traffic
until new pods pass readiness checks.

deployment.py: restart() no longer calls down(). It just calls up()
which patches existing resources. No namespace deletion, no downtime
gap, no race conditions. k8s handles the rollout.

This gives:
- Zero-downtime deploys (old pods serve during rollout)
- Automatic rollback (if new pods fail readiness, rollout stalls)
- Manual rollback via kubectl rollout undo

Closes so-l2l (parts A and B).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-18 19:40:20 +00:00
A. F. Dudley ba39c991f1 fix: create imagePullSecret in deployment namespace, not default
create_registry_secret() hardcoded namespace="default" but deployments
now run in dedicated laconic-* namespaces. The secret was invisible
to pods in the deployment namespace, causing 401 on GHCR pulls.

Accept namespace as parameter, passed from deploy_k8s.py which knows
the correct namespace.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-18 19:08:52 +00:00
A. F. Dudley 0b3e5559d0 fix: wait for namespace termination in down() before returning
Reverts the label-based deletion approach — resources created by older
laconic-so lack labels, so label queries return empty results. Namespace
deletion is the only reliable cleanup.

Adds _wait_for_namespace_gone() so down() blocks until the namespace
is fully terminated. This prevents the race condition where up() tries
to create resources in a still-terminating namespace (403 Forbidden).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-18 18:49:38 +00:00
A. F. Dudley ae2cea3410 fix: never delete namespace on deployment down
down() deleted the entire namespace when it wasn't explicitly set in
the spec. This causes a race condition on restart: up() tries to create
resources in a namespace that's still terminating, getting 403 Forbidden.

Always use _delete_resources_by_label() instead. The namespace is cheap
to keep and required for immediate up() after down(). This also matches
the shared-namespace behavior, making down() consistent regardless of
namespace configuration.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-18 18:47:05 +00:00
A. F. Dudley e298e7444f fix: add auto-generated header to config.env
config.env is regenerated from spec.yml on every deploy create and
restart, silently overwriting manual edits. Add a header comment
explaining this so operators know to edit spec.yml instead.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-18 18:24:27 +00:00
A. F. Dudley e5a8ec5f06 fix: rename registry secret to image-pull-secret
The secret name `{app}-registry` is ambiguous — it could be a container
registry credential or a Laconic registry config. Rename to
`{app}-image-pull-secret` which clearly describes its purpose as a
Kubernetes imagePullSecret for private container registries.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-18 15:33:11 +00:00
A. F. Dudley 0bbb51067c fix: set imagePullPolicy=Always for kind deployments
Lint Checks / Run linter (push) Failing after 0s Details
Kind deployments used imagePullPolicy=None (defaults to IfNotPresent),
which means the kind node caches images by tag and never re-pulls from
the local registry. After a container rebuild + registry push, the pod
keeps using the stale cached image.

Set Always for all deployment types so k8s re-pulls on every pod
restart. With a local registry this adds negligible overhead.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-17 17:44:35 +00:00