stack-orchestrator

Commit Graph

Author	SHA1	Message	Date
prathamesh0	cf0e230b66	bug-fix: fix image-overrides usage to load locally build images into kind cluster (#751 ) Webapp Test / Run webapp test suite (push) Failing after 0s Details Lint Checks / Run linter (push) Failing after 0s Details Publish / Gate: k8s deploy e2e (push) Failing after 3s Details Deploy Test / Run deploy test suite (push) Failing after 0s Details Publish / Build and publish (push) Has been skipped Details Smoke Test / Run basic test suite (push) Failing after 0s Details - Cluster setup was only considering images from containers list in `stack.yml` for kind-loading into the cluster; i.e. images from `image_overrides` in spec were not being loaded - This also resulted in laconic-so to attempt kind-loading images not present locally sometimes - Fix: union `image_overrides` values (user-specified local images) with the ones from container-list, filtered to only ones that are actually present on the docker host	2026-05-05 10:08:08 +05:30
prathamesh0	7c65d39bb2	Make deployments self-sufficient and add E2E restart test (#750 ) Smoke Test / Run basic test suite (push) Failing after 0s Details Lint Checks / Run linter (push) Failing after 0s Details Webapp Test / Run webapp test suite (push) Failing after 0s Details K8s Deploy Test / Run deploy test suite on kind/k8s (push) Failing after 0s Details Publish / Gate: k8s deploy e2e (push) Failing after 3s Details Deploy Test / Run deploy test suite (push) Failing after 0s Details Publish / Build and publish (push) Has been skipped Details - `deploy create` now copies each pod's `commands.py` into `<deployment>/hooks/`. `call_stack_deploy_start` loads from there, so `deployment start` / `restart` no longer need the live stack source on disk to run the `start()` hook - Only the `start()` hook is affected. `init`, `setup`, and `create` still load from the live source — they only run at `deploy create` time, when the source is guaranteed to be present - Multi-repo stacks produce `hooks/commands_0.py`, `hooks/commands_1.py`, …; `call_stack_deploy_start` loads them all in sorted order - Adds `tests/k8s-deploy/run-restart-test.sh` covering the full single-repo restart cycle (v1 -> mutate working tree -> `restart` re-copies and re-executes v2) and the multi-repo file-naming + multi-hook invocation. Wired into the existing K8s Deploy Test workflow	2026-04-28 17:28:02 +05:30
prathamesh0	4977e3ff43	k8s: manage Caddy ingress image via spec (so-p3p) (#749 ) Publish / Gate: k8s deploy e2e (push) Failing after 3s Details Deploy Test / Run deploy test suite (push) Failing after 0s Details Publish / Build and publish (push) Has been skipped Details K8s Deployment Control Test / Run deployment control suite on kind/k8s (push) Failing after 0s Details Webapp Test / Run webapp test suite (push) Failing after 0s Details Smoke Test / Run basic test suite (push) Failing after 0s Details Lint Checks / Run linter (push) Failing after 0s Details K8s Deploy Test / Run deploy test suite on kind/k8s (push) Failing after 0s Details Closes so-p3p: - New spec key `caddy-ingress-image`: on fresh install, deploys Caddy with this image; on subsequent `deployment start`, patches the running Caddy Deployment if the image differs. Defaults to the manifest's hardcoded image when absent - When the spec key is absent, SO does not touch a running Caddy — avoids silently reverting an image set out-of-band (ansible playbook, another deployment's spec) - `strategy: Recreate` on the Caddy Deployment manifest (required — hostPort 80/443 deadlocks rolling updates) - Reconcile runs under both `--perform-cluster-management` and the default `--skip-cluster-management` (it's a k8s-API patch, not a cluster-lifecycle op) - Image template by container name rather than string match, so the spec override wins regardless of what the shipped manifest hardcodes - Cluster-scoped caveat documented: `caddy-system` is shared across deployments, so the last `deployment start` that sets the key wins for everyone	2026-04-21 14:40:39 +05:30
prathamesh0	421b83c430	k8s: shared-cluster safety checks and deployment-id decoupling (#748 ) - Kind extraMount compatibility: fail fast at `deployment start` when a new deployment's mounts don't match the running cluster; warn when the first cluster is created without a `kind-mount-root` umbrella; replace the cryptic `ConfigException` with readable errors when the cluster is missing - Auto-ConfigMap for file-level host-path compose volumes (so-7fc): `../config/foo.sh:/opt/foo.sh`-style binds become per-namespace ConfigMaps at deploy start instead of aliasing via the kind extraMount chain. `deploy create` rejects `:rw`, subdirs, and over-budget sources. Deployment-dir layout unchanged - Namespace ownership: stamp the namespace with `laconic.com/deployment-dir` on create; fail loudly if another deployment tries to land in the same namespace. Pre-existing namespaces adopt ownership on next start - deployment-id / cluster-id decoupling: split the two roles (kube context vs resource-name prefix) into separate `deployment.yml` fields. Backward-compat fallback keeps existing resource names stable - Close stale pebbles `so-n1n` and `so-ad7`	2026-04-21 12:17:28 +05:30
prathamesh0	7f4b058066	so-o2o: kubectl-level Caddy cert backup/restore (#746 ) Publish / Gate: k8s deploy e2e (push) Failing after 3s Details Publish / Build and publish (push) Has been skipped Details K8s Deploy Test / Run deploy test suite on kind/k8s (push) Failing after 0s Details Lint Checks / Run linter (push) Failing after 0s Details Deploy Test / Run deploy test suite (push) Failing after 0s Details Webapp Test / Run webapp test suite (push) Failing after 0s Details Smoke Test / Run basic test suite (push) Failing after 0s Details Replaces the etcd-surgery persistence approach with a CronJob that dumps `manager=caddy` Secrets to `{kind-mount-root}/caddy-cert-backup/` every 5 min, and a restore step that applies the file before Caddy starts on a fresh cluster. Closes so-o2o. Deletes `_clean_etcd_keeping_certs` and the etcd+PKI extraMounts. No new spec keys - activates when `kind-mount-root` is set.	2026-04-17 15:36:40 +05:30
prathamesh0	1334900407	so-o2o: detect etcd image dynamically + diagnose whitelist cleanup bugs (#745 ) Replaces the hardcoded `gcr.io/etcd-development/etcd:v3.5.9` in `_clean_etcd_keeping_certs` with a dynamic ref captured from the running Kind node via `crictl`, persisted to `{backup_dir}/etcd-image.txt` and reused on subsequent cleanup runs. Self-adapts to Kind upgrades, no version table to maintain. Testing on Kind v0.32 / etcd 3.6 surfaced two additional bugs in the whitelist cleanup that this PR does not fix (see so-o2o comments): (a) the restore step pipes raw protobuf values through bash `echo`, corrupting binary bytes; (b) the whitelist omits cluster-admin RBAC, SAs, and bootstrap tokens needed by kubeadm's pre-addon health check. Merging this narrow fix + diagnosis trail; follow-up branch will replace the etcd-surgery approach with a kubectl-level Caddy secret backup/restore.	2026-04-17 13:48:30 +05:30
prathamesh0	cf8b7533fe	so-ad7: build per-pod Service for maintenance container (#744 ) Webapp Test / Run webapp test suite (push) Failing after 0s Details Publish / Gate: k8s deploy e2e (push) Failing after 3s Details Publish / Build and publish (push) Has been skipped Details Deploy Test / Run deploy test suite (push) Failing after 0s Details K8s Deploy Test / Run deploy test suite on kind/k8s (push) Failing after 0s Details Lint Checks / Run linter (push) Failing after 0s Details Smoke Test / Run basic test suite (push) Failing after 0s Details - Maintenance-page swap during `restart` was broken: Ingress got patched to point at `{app_name}-{pod_name}-service` for the maintenance pod, but that Service was never created. Caddy had no valid backend, users saw "site cannot be reached" instead of the maintenance page - Root cause: `get_services()` only builds per-pod Services for pods referenced by `http-proxy` routes; the maintenance pod has no http-proxy route by design - Fix: `get_services()` now also includes the container named by `maintenance-service:` in the container-ports map, so its per-pod `Service` gets built and sits idle until the swap window - Also files `so-b9a` (P4) noting the latent fragility in the resolver/builder contract	2026-04-16 15:07:25 +05:30
prathamesh0	fc5dc80058	so-l2l: in-place stop/restart via label-scoped cleanup (#743 ) - `down()` scopes cleanup to a single stack via `app.kubernetes.io/stack` and keeps the namespace `Active` by default - New `stop/down --delete-namespace` flag for opt-in full teardown - `down()` is synchronous - waits until resources are actually gone before returning. Callers can drop their own wait loops - `up()` skip-if-exists for Jobs completes the create-or-replace coverage (other kinds already had it) - Orphan PVs from a prior `stop --delete-namespace` get cleaned on the next `stop --delete-volumes` - Every k8s resource SO creates now carries `app.kubernetes.io/stack` via a new `ClusterInfo._stack_labels()` helper - Closes so-l2l, so-076.2. Also includes pebble audit: closes so-c71, so-b2b, so-k1k; files so-328	2026-04-16 12:10:04 +05:30
prathamesh0	f40913d187	Fix Kind port mappings and configmap source path resolution (#742 ) Publish / Gate: k8s deploy e2e (push) Failing after 2s Details Webapp Test / Run webapp test suite (push) Failing after 0s Details Publish / Build and publish (push) Has been skipped Details Smoke Test / Run basic test suite (push) Failing after 0s Details Lint Checks / Run linter (push) Failing after 0s Details Deploy Test / Run deploy test suite (push) Failing after 0s Details - Only map host ports for services with network_mode: host (80/443 for Caddy always mapped). Previously all compose service ports were mapped unconditionally, causing conflicts with local services like postgres and redis - Use spec configmap values as source paths instead of ignoring them. Fixes configmaps with user-defined paths (e.g. `stack-orchestrator/compose/maintenance`) and home-relative paths (e.g. `~/.credentials/local-certs/s3`) - Read configmap files from deployment dir (`configmaps/{name}/`) when building k8s ConfigMap objects, not from the spec's source path which doesn't exist in the deployment dir - File pebbles: `so-c71` (resolved), `so-078`: self-sufficient deployments (hooks should be copied to deployment dir)	2026-04-14 17:33:47 +05:30
prathamesh0	17b614cb4d	Fix configmap source path resolution for user-defined spec paths (#741 ) Lint Checks / Run linter (push) Failing after 0s Details Webapp Test / Run webapp test suite (push) Failing after 0s Details Smoke Test / Run basic test suite (push) Failing after 0s Details Publish / Gate: k8s deploy e2e (push) Failing after 2s Details Deploy Test / Run deploy test suite (push) Failing after 0s Details Publish / Build and publish (push) Has been skipped Details	2026-04-14 11:30:27 +05:30
prathamesh0	0bf1ea70d5	Add ip mode to external-services for static IP endpoints (#740 ) Publish / Gate: k8s deploy e2e (push) Failing after 2s Details Container Registry Test / Run container registry hosting test on kind/k8s (push) Failing after 0s Details Publish / Build and publish (push) Has been skipped Details Database Test / Run database hosting test on kind/k8s (push) Failing after 0s Details External Stack Test / Run external stack test suite (push) Failing after 0s Details K8s Deploy Test / Run deploy test suite on kind/k8s (push) Failing after 0s Details K8s Deployment Control Test / Run deployment control suite on kind/k8s (push) Failing after 0s Details Webapp Test / Run webapp test suite (push) Failing after 0s Details Smoke Test / Run basic test suite (push) Failing after 0s Details Lint Checks / Run linter (push) Failing after 0s Details Deploy Test / Run deploy test suite (push) Failing after 0s Details ExternalName services only support DNS names (CNAME records), not raw IP addresses. Add an ip mode that creates a headless Service + Endpoints with a static IP, enabling routing to host-network services like Kind gateway IPs or bare-metal endpoints. Spec format: external-services: my-service: ip: 172.18.0.1 port: 8899 Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-02 17:53:23 +05:30
A. F. Dudley	3da23683f6	fix: black formatting, line length, pyright type narrowing - Apply black reformatting to deployer.py, cluster_info.py, deploy_k8s.py - Shorten docstrings exceeding 88 char line limit - Add assert for pyright Optional type narrowing on tls list Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-02 05:22:25 +00:00
A. F. Dudley	63325f68a7	fix: deduplicate container ports by (port, protocol) Compose files with both "8001" (TCP) and "8001/udp" produce separate V1ContainerPort entries that k8s rejects as duplicates. Deduplicate after parsing by (container_port, protocol) key. This was blocking biscayne's agave deployment — the spec has both TCP 8001 (ip_echo) and UDP 8001 (gossip), which generated two UDP 8001 entries. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-02 05:18:10 +00:00
A. F. Dudley	87761c7041	fix: imagePullPolicy for kind, job images, duplicate registry call, test namespace - deploy_k8s.py: default imagePullPolicy to IfNotPresent for kind (local images loaded via kind load, not pulled from registry) - cluster_info.py: add job images to image_set so they're loaded into kind - deploy_k8s.py: remove duplicate create_registry_secret call (merge artifact) - deploy_k8s.py: fix indentation in run_job job_pull_policy (replace_all damage) - tests/k8s-deploy: update namespace from laconic-{id} to laconic-{stack_name} to match the new stack-derived namespace scheme from wd-a7b All 15 k8s deploy e2e tests pass. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-01 23:34:51 +00:00
A. F. Dudley	5bf96112d3	fix lint errors from merge: duplicate def, shadowed import, empty f-string - deployment_create.py: remove duplicate create_registry_secret signature - deploy_k8s.py: rename loop var 'config' to 'svc_config' (shadowed import) - deploy_k8s.py: remove f-prefix from string without placeholders - deploy_k8s.py: suppress pre-existing C901 on _create_volume_data Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-01 19:17:45 +00:00
A. F. Dudley	549ac8c01d	Merge fix/kind-mount-propagation: all local branches unified Merges 6 local branches into main: - enya: HostToContainer mount propagation for kind-mount-root - fix/k8s-port-mappings-v5: port protocol parsing, namespace fix - peirce: idempotent deploy (create-or-replace), update-envs rename - prince: etcd cleanup whitelist - wd-a7b: timestamp cluster IDs, stack-derived namespaces, jobs, multi-cert ingress, user secrets, _build_containers refactor - fix/kind-mount-propagation: deployment prepare command, pebbles Conflicts resolved keeping main's evolved multi-pod architecture (get_deployments, per-pod Services, CA cert injection) while incorporating branch additions (HostToContainer propagation, user secrets, jobs support). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-01 18:26:05 +00:00
A. F. Dudley	d50bd2b6d2	Merge wd-a7b: cluster-id/namespace naming, jobs, multi-cert, secrets Combines timestamp-based cluster IDs, namespace derived from stack name, _build_containers refactor, jobs support, multi-ingress certificates, user-declared secrets, and label-based resource cleanup with the existing idempotent deploy, mount propagation, and port mapping fixes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-01 18:22:07 +00:00
A. F. Dudley	2307696a66	Merge fix/k8s-port-mappings-v5 into fix/kind-mount-propagation Resolve conflicts keeping HostToContainer propagation on mount root entry and per-container resource layering from the propagation branch. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-01 17:06:25 +00:00
A. F. Dudley	c64820ad5c	Merge branch 'enya-ac868cc4-kind-mount-propagation-fix' into fix/kind-mount-propagation	2026-04-01 17:05:06 +00:00
A. F. Dudley	3e3f349151	Merge remote-tracking branch 'cerc-io.github.com/main' # Conflicts: # stack_orchestrator/deploy/deployment_create.py # stack_orchestrator/deploy/k8s/deploy_k8s.py	2026-04-01 14:47:46 +00:00
Prathamesh Musale	33d3474d7d	Fix registry secret created in wrong namespace (#998 ) Publish / Build and publish (push) Failing after 0s Details Deploy Test / Run deploy test suite (push) Failing after 0s Details Webapp Test / Run webapp test suite (push) Failing after 0s Details Lint Checks / Run linter (push) Failing after 0s Details Smoke Test / Run basic test suite (push) Failing after 0s Details `create_registry_secret()` was hardcoded to use the "default" namespace, but pods are deployed to the spec's configured namespace. The secret must be in the same namespace as the pods for `imagePullSecrets` to work. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Reviewed-on: https://git.vdb.to/cerc-io/stack-orchestrator/pulls/998 Co-authored-by: Prathamesh Musale <prathamesh.musale0@gmail.com> Co-committed-by: Prathamesh Musale <prathamesh.musale0@gmail.com>	2026-03-26 08:36:39 +00:00
Snake Game Developer	90e32ffd60	Support image-overrides in spec for testing Spec can override container images: image-overrides: dumpster-kubo: ghcr.io/.../dumpster-kubo:test-tag Merged with CLI overrides (CLI wins). Enables testing with GHCR-pushed test tags without modifying compose files. Also reverts the image-pull-policy spec key (not needed — the fix is to use proper GHCR tags, not IfNotPresent). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-22 01:02:23 +00:00
Snake Game Developer	1052a1d4e7	Support image-pull-policy in spec (default: Always) Testing specs can set image-pull-policy: IfNotPresent so kind-loaded local images are used instead of pulling from the registry. Production specs omit the key and get the default Always behavior. Root cause: with Always, k8s pulled the GHCR kubo image (with baked R2 endpoint) instead of the locally-built image (with https://s3:443), causing kubo to connect to R2 directly and get Unauthorized. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-21 20:17:06 +00:00
Snake Game Developer	f93541f7db	Fix CA cert mounting: subPath for Go, expanduser for configmaps - CA certs mounted via subPath into /etc/ssl/certs/ so Go's x509 picks them up (directory mount replaces the entire dir) - get_configmaps() now expands ~ in paths via os.path.expanduser() - Both changes discovered during testing with mkcert + MinIO Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-21 19:27:14 +00:00
Snake Game Developer	713a81c245	Add external-services and ca-certificates spec keys New spec.yml features for routing external service dependencies: external-services: s3: host: example.com # ExternalName Service (production) port: 443 s3: selector: {app: mock} # headless Service + Endpoints (testing) namespace: mock-ns port: 443 ca-certificates: - ~/.local/share/mkcert/rootCA.pem # testing only laconic-so creates the appropriate k8s Service type per mode: - host mode: ExternalName (DNS CNAME to external provider) - selector mode: headless Service + Endpoints with pod IPs discovered from the target namespace at deploy time ca-certificates mounts CA files into all containers at /etc/ssl/certs/ and sets NODE_EXTRA_CA_CERTS for Node/Bun. Also includes the previously committed PV Released state fix. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-21 15:25:47 +00:00
Snake Game Developer	98ff221a21	Fix PV rebinding after deployment stop/start cycle deployment stop deletes the namespace (and PVCs) but preserves PVs by default. On the next deployment start, PVs are in Released state with a stale claimRef pointing at the deleted PVC. New PVCs cannot bind to Released PVs, so pods get stuck in Pending. Clear the claimRef on any Released PV during _create_volume_data() so the PV returns to Available and can accept new PVC bindings. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-21 07:47:23 +00:00
A. F. Dudley	8d03083d0d	feat: add kind-mount-root for unified Kind extraMount Publish / Build and publish (push) Failing after 0s Details Webapp Test / Run webapp test suite (push) Failing after 0s Details Smoke Test / Run basic test suite (push) Failing after 0s Details Lint Checks / Run linter (push) Failing after 0s Details Deploy Test / Run deploy test suite (push) Failing after 0s Details When kind-mount-root is set in spec.yml, emit a single extraMount mapping the root to /mnt instead of per-volume mounts. This allows adding new volumes without recreating the Kind cluster. Volumes whose host path is under the root skip individual extraMounts and their PV paths resolve to /mnt/{relative_path}. Volumes outside the root keep individual extraMounts as before. Cherry-picked from branch enya-ac868cc4-kind-mount-propagation-fix (commits `b6d6ad81`, `929bdab8`) and adapted for current main. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-20 21:28:40 +00:00
A. F. Dudley	9109cfb7a1	feat: add token-file option for image-pull-secret registry auth Adds token-file key to image-pull-secret spec config. Reads the registry token from a file on disk instead of requiring an environment variable. File path supports ~ expansion. Falls back to token-env if token-file is not set or file doesn't exist. This lets operators store the GHCR token in ~/.credentials/ alongside other secrets, removing the need for ansible to pass REGISTRY_TOKEN as an env var. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-20 19:30:44 +00:00
A. F. Dudley	61afeb255c	fix: keep cwd at repo root through entire restart, revert try/except The stack path in spec.yml is relative — both create_operation and up_operation need cwd at the repo root for stack_is_external() to resolve it. Move os.chdir(prev_cwd) to after up_operation completes instead of between the two operations. Reverts the SystemExit catch in call_stack_deploy_start — the root cause was cwd, not the hook. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-20 15:54:46 +00:00
A. F. Dudley	32f6e57b70	fix: ConfigMap volumes don't force Recreate strategy + resilient hooks Two fixes for multi-deployment: 1. _pod_has_pvcs now excludes ConfigMap volumes from PVC detection. Pods with only ConfigMap volumes (like maintenance) correctly get RollingUpdate strategy instead of Recreate. 2. call_stack_deploy_start catches SystemExit when stack path doesn't resolve from cwd (common during restart). Most stacks don't have deploy hooks, so this is non-fatal. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-20 15:51:58 +00:00
A. F. Dudley	6923e1c23b	refactor: extract methods from K8sDeployer.up to fix C901 complexity Split up() into _setup_cluster(), _create_ingress(), _create_nodeports(). Reduces cyclomatic complexity below the flake8 threshold. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-20 15:20:50 +00:00
A. F. Dudley	0ac886bf95	fix: chdir to repo root before create_operation in restart The spec's "stack:" value is a relative path that must resolve from the repo root. stack_is_external() checks Path(stack).exists() from cwd, which fails when cwd isn't the repo root. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-20 15:06:38 +00:00
A. F. Dudley	2484abfcce	fix: use git rev-parse for repo root in restart command The repo_root calculation assumed stack paths are always 4 levels deep (stack_orchestrator/data/stacks/name). External stacks with different nesting (e.g. stack-orchestrator/stacks/name = 3 levels) got the wrong root, causing --spec-file resolution to fail. Use git rev-parse --show-toplevel instead. Fixes: so-k1k Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-20 15:03:24 +00:00
A. F. Dudley	967936e524	Multi-deployment: one k8s Deployment per pod in stack.yml Lint Checks / Run linter (push) Failing after 0s Details Deploy Test / Run deploy test suite (push) Failing after 0s Details Webapp Test / Run webapp test suite (push) Failing after 0s Details Publish / Build and publish (push) Failing after 0s Details Smoke Test / Run basic test suite (push) Failing after 0s Details Each pod entry in stack.yml now creates its own k8s Deployment with independent lifecycle and update strategy. Pods with PVCs get Recreate, pods without get RollingUpdate. This enables maintenance services that survive main pod restarts. - cluster_info: get_deployments() builds per-pod Deployments, Services - cluster_info: Ingress routes to correct per-pod Service - deploy_k8s: _create_deployment() iterates all Deployments/Services - deployment: restart swaps Ingress to maintenance service during Recreate - spec: add maintenance-service key Single-pod stacks are backward compatible (same resource names). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-20 01:40:45 +00:00
A. F. Dudley	6ace024cd3	fix: use replace instead of patch for k8s resource updates Lint Checks / Run linter (push) Failing after 0s Details Publish / Build and publish (push) Failing after 0s Details Deploy Test / Run deploy test suite (push) Failing after 0s Details Webapp Test / Run webapp test suite (push) Failing after 0s Details Smoke Test / Run basic test suite (push) Failing after 0s Details Strategic merge patch preserves fields not present in the patch body. This means removed volumes, ports, and env vars persist in the running Deployment after a restart. Replace sends the complete spec built from the current compose files — removed fields are actually deleted. Affects Deployment, Service, Ingress, and NodePort updates. Service replace preserves clusterIP (immutable field) by reading it from the existing resource before replacing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-19 03:44:57 +00:00
A. F. Dudley	ea610bb8d6	Merge branch 'cv-c3c-image-flag-for-restart' # Conflicts: # stack_orchestrator/deploy/k8s/deploy_k8s.py	2026-03-18 23:04:55 +00:00
A. F. Dudley	4b1fc27a1e	cv-c3c: add --image flag to deployment restart command Allows callers to override container images during restart, e.g.: laconic-so deployment restart --image backend=ghcr.io/org/app:sha123 The override is applied to the k8s Deployment spec before create-or-patch. Docker/compose deployers accept the parameter but ignore it. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-18 22:42:56 +00:00
A. F. Dudley	25e5ff09d9	so-m3m: add credentials-files spec key for on-disk credential injection _write_config_file() now reads each file listed under the credentials-files top-level spec key and appends its contents to config.env after config vars. Paths support ~ expansion. Missing files fail hard with sys.exit(1). Also adds get_credentials_files() to Spec class following the same pattern as get_image_registry_config(). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-18 21:55:28 +00:00
A. F. Dudley	0e4ecc3602	refactor: rename registry-credentials to image-pull-secret in spec The spec key `registry-credentials` was ambiguous — could mean container registry auth or Laconic registry config. Rename to `image-pull-secret` which matches the k8s secret name it creates. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-18 21:38:31 +00:00
A. F. Dudley	dc15c0f4a5	feat: auto-generate readiness probes from http-proxy routes Lint Checks / Run linter (push) Failing after 0s Details Publish / Build and publish (push) Failing after 0s Details Deploy Test / Run deploy test suite (push) Failing after 0s Details Webapp Test / Run webapp test suite (push) Failing after 0s Details Smoke Test / Run basic test suite (push) Failing after 0s Details Containers referenced in spec.yml http-proxy routes now get TCP readiness probes on the proxied port. This tells k8s when a container is actually ready to serve traffic. Without readiness probes, k8s considers pods ready immediately after start, which means: - Rolling updates cut over before the app is listening - Broken containers look "ready" and receive traffic (502s) - kubectl rollout undo has nothing to roll back to The probes use TCP socket checks (not HTTP) to work with any protocol. Initial delay 5s, check every 10s, fail after 3 consecutive failures. Closes so-l2l part C. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-18 19:43:09 +00:00
A. F. Dudley	2d11ca7bb0	feat: update-in-place deployments with rolling updates Replace the destroy-and-recreate deployment model with in-place updates. deploy_k8s.py: All resource creation (Deployment, Service, Ingress, NodePort, ConfigMap) now uses create-or-update semantics. If a resource already exists (409 Conflict), it patches instead of failing. For Deployments, this triggers a k8s rolling update — old pods serve traffic until new pods pass readiness checks. deployment.py: restart() no longer calls down(). It just calls up() which patches existing resources. No namespace deletion, no downtime gap, no race conditions. k8s handles the rollout. This gives: - Zero-downtime deploys (old pods serve during rollout) - Automatic rollback (if new pods fail readiness, rollout stalls) - Manual rollback via kubectl rollout undo Closes so-l2l (parts A and B). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-18 19:40:20 +00:00
A. F. Dudley	ba39c991f1	fix: create imagePullSecret in deployment namespace, not default create_registry_secret() hardcoded namespace="default" but deployments now run in dedicated laconic-* namespaces. The secret was invisible to pods in the deployment namespace, causing 401 on GHCR pulls. Accept namespace as parameter, passed from deploy_k8s.py which knows the correct namespace. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-18 19:08:52 +00:00
A. F. Dudley	0b3e5559d0	fix: wait for namespace termination in down() before returning Reverts the label-based deletion approach — resources created by older laconic-so lack labels, so label queries return empty results. Namespace deletion is the only reliable cleanup. Adds _wait_for_namespace_gone() so down() blocks until the namespace is fully terminated. This prevents the race condition where up() tries to create resources in a still-terminating namespace (403 Forbidden). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-18 18:49:38 +00:00
A. F. Dudley	ae2cea3410	fix: never delete namespace on deployment down down() deleted the entire namespace when it wasn't explicitly set in the spec. This causes a race condition on restart: up() tries to create resources in a namespace that's still terminating, getting 403 Forbidden. Always use _delete_resources_by_label() instead. The namespace is cheap to keep and required for immediate up() after down(). This also matches the shared-namespace behavior, making down() consistent regardless of namespace configuration. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-18 18:47:05 +00:00
A. F. Dudley	e298e7444f	fix: add auto-generated header to config.env config.env is regenerated from spec.yml on every deploy create and restart, silently overwriting manual edits. Add a header comment explaining this so operators know to edit spec.yml instead. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-18 18:24:27 +00:00
A. F. Dudley	e5a8ec5f06	fix: rename registry secret to image-pull-secret The secret name `{app}-registry` is ambiguous — it could be a container registry credential or a Laconic registry config. Rename to `{app}-image-pull-secret` which clearly describes its purpose as a Kubernetes imagePullSecret for private container registries. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-18 15:33:11 +00:00
A. F. Dudley	0bbb51067c	fix: set imagePullPolicy=Always for kind deployments Lint Checks / Run linter (push) Failing after 0s Details Kind deployments used imagePullPolicy=None (defaults to IfNotPresent), which means the kind node caches images by tag and never re-pulls from the local registry. After a container rebuild + registry push, the pod keeps using the stale cached image. Set Always for all deployment types so k8s re-pulls on every pod restart. With a local registry this adds negligible overhead. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-17 17:44:35 +00:00
A. F. Dudley	72aabe7d9a	fix: deploy create --update now syncs config.env from spec Publish / Build and publish (push) Failing after 0s Details Deploy Test / Run deploy test suite (push) Failing after 0s Details Webapp Test / Run webapp test suite (push) Failing after 0s Details Smoke Test / Run basic test suite (push) Failing after 0s Details Lint Checks / Run linter (push) Failing after 0s Details The --update path excluded config.env from the safe_copy_tree, which meant new config vars added to spec.yml were never written to config.env. The XXX comment already flagged this as broken. Remove config.env from exclude_patterns so --update regenerates it from spec.yml like the non-update path does. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-17 08:20:45 +00:00
A. F. Dudley	36c37d2bde	wd-a7b: Fix cluster-id and namespace naming - Replace token_hex cluster IDs with sortable timestamp-based IDs (laconic-{base62_timestamp}{random_suffix}) via new ids.py module - Check for existing Kind cluster before generating a new cluster-id - Derive k8s namespace from stack name instead of compose_project_name (e.g. laconic-dumpster instead of laconic-<random>) - Plumb namespace through to secret generation instead of hardcoding 'default' Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-16 08:01:11 +00:00
afd	8a7491d3e0	Support multiple http-proxy entries in a single deployment Lint Checks / Run linter (push) Failing after 0s Details Previously get_ingress() only used the first http-proxy entry, silently ignoring additional hostnames. Now iterates over all entries, creating an Ingress rule and TLS config per hostname. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-16 06:16:28 +00:00

1 2 3 4 5

210 Commits (cf0e230b66c09d9c6aaa8115908a185d0b58b2a5)