Commit Graph

1264 Commits (1ea000e3c83907c74292df445755eff4a1f107e4)

Author SHA1 Message Date
Prathamesh Musale 1ea000e3c8 chore(pebbles): close so-l2l and so-076.2
Both resolved by the so-l2l Part A+B refactor on this branch:
restart no longer calls down(); down() scopes cleanup by
app.kubernetes.io/stack label and keeps the namespace Active by
default. --delete-namespace flag added for opt-in full teardown.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 06:24:50 +00:00
Prathamesh Musale 774b39836e so-l2l: refactor down() for clarity
down() is now a five-phase recipe with each phase a single line:
namespaced cleanup, cluster-scoped PV cleanup, synchronous wait,
optional namespace delete, optional cluster destroy. Each helper
does one thing.

- Extract _stack_label_selector() and _namespace_exists() so down()
  reads declaratively.
- Rename _delete_labeled_resources -> _delete_namespaced_labeled_
  resources to match what it actually does (namespaced phase only).
- Extract _list_delete_namespaced() helper for the Services and
  Endpoints list+delete pattern (k8s client lacks delete_collection
  for those kinds).
- _wait_for_labeled_gone (renamed from _wait_for_labeled_deletions)
  builds listers in clean append-style; DRY the poll/timeout
  iterations via a local remaining() closure.

No behavior changes — same semantics, ~50 fewer lines.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 06:10:56 +00:00
Prathamesh Musale 2f99e6f7c9 so-l2l: clean orphan PVs when namespace is already gone
down() used to early-return when read_namespace returned 404,
skipping all cleanup. That left cluster-scoped PVs orphaned
after a prior 'stop --delete-namespace' (namespace cascades
delete PVCs, but PVs with Retain reclaim policy survive).

Split _delete_labeled_resources into namespaced and cluster-
scoped phases via a namespace_present flag. When the namespace
is missing, jump straight to _delete_labeled_pvs (for
--delete-volumes) and the cluster-scoped half of the wait.
_wait_for_labeled_deletions now builds its lister set based on
whether the namespace still exists.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 06:08:32 +00:00
Prathamesh Musale 3d83c6ad27 so-l2l: make down() synchronous via _wait_for_labeled_deletions
delete_collection returns before the apiserver actually removes
objects — finalizers on PVs, PVCs, and pod graceful shutdown all
propagate async. Add _wait_for_labeled_deletions that polls the
same label selector across every kind we triggered a delete for,
with a 120s timeout. down() now returns only once the cluster has
actually settled, so callers (tests, ansible, cryovial) don't
need their own wait loops.

Update the k8s-deploy test's assert_no_labeled_resources to rely
on that synchronous contract — no polling in the test.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 05:35:43 +00:00
Prathamesh Musale 98ad60ca03 test: verify label-based stop and --delete-namespace behavior
Update run-deploy-test.sh for the new down() semantics:
- After stop --delete-volumes (without --delete-namespace) assert
  the namespace stays Active and no stack-labeled Deployments,
  Services, ConfigMaps, Secrets, or PVCs remain.
- Drop the 120s wait loop for namespace termination — not needed
  since stop no longer terminates the namespace.
- Exercise the new --delete-namespace flag at end-of-test teardown
  and assert the namespace is actually gone.

Rename delete_cluster_exit -> cleanup_and_exit and have it do a
full teardown (volumes + namespace) so failed CI runs don't leak
state between jobs. Add assert_ns_phase and assert_no_labeled_
resources helpers for the new assertions.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 04:31:46 +00:00
Prathamesh Musale cf2269ebdc so-l2l: add --delete-namespace flag to stop/down for full teardown
Plumb --delete-namespace through the CLI (stop, down), the
down_operation, the Deployer abstract method, and both k8s /
compose implementations. When set, k8s down() calls the existing
_delete_namespace() + _wait_for_namespace_gone() after label-
based resource deletion, restoring the old behavior for the
teardown case. Compose mode ignores the flag.

Default remains False, so normal stop/restart still keep the
namespace Active (Part B behavior).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 04:21:40 +00:00
Prathamesh Musale 258045190c so-l2l Part A complete: skip Job create on 409
Deployments, Services, ConfigMaps, Secrets, Ingresses, and
Endpoints already use create-or-replace in up(). Jobs were the
only remaining gap — they now skip-if-exists since Jobs are
one-shot and re-running on restart is usually unwanted.

This completes the in-place restart story alongside Part B:
restart (which already avoids down()) and stop+start (now that
down() keeps the namespace) both run up() idempotently against
a live namespace.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 13:18:26 +00:00
Prathamesh Musale c7d2aaa0d0 so-l2l Part B: down() deletes by stack label, keeps namespace
Stop no longer calls _delete_namespace() on every down. Instead,
deletion is scoped by app.kubernetes.io/stack=<stack-name> so
multiple stacks sharing a namespace are torn down independently,
and no namespace termination race blocks a following up().

Prerequisite: every V1ObjectMeta created by cluster_info.py and
deploy_k8s.py now carries the stack label via a new
ClusterInfo._stack_labels() helper (Namespace, Ingress, Service,
Deployment pod template, ConfigMap, Secret, PVC, PV, Endpoints,
Job, CA certs secret, external-service Services).

down() order: Ingresses -> Deployments -> Jobs -> Services ->
ConfigMaps/Secrets/Endpoints -> lingering Pods, then PVCs/PVs
only when --delete-volumes is passed. Kind cluster destruction
still gated by --perform-cluster-management.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 13:17:13 +00:00
Prathamesh Musale 8a586b7dfc chore(pebbles): file so-328 for restart propagation gaps
deployment restart does not remove files/resources when their
sources are deleted from the stack repo, and new stack.yml
structural entries do not auto-surface in existing deployments.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 11:11:59 +00:00
Prathamesh Musale 8497dde92b chore(pebbles): audit open bugs and update statuses
Close so-c71 (extraPortMappings fix merged in e909357a) and so-b2b
(image-pull-secret + REGISTRY_TOKEN wired up and documented). Close
so-k1k (restart now uses git rev-parse for repo_root). Add progress
comments on so-076.2 (skip-cluster-management default flipped) and
so-l2l (readiness probes landed; namespace-delete on down still open).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 06:31:32 +00:00
prathamesh0 f40913d187
Fix Kind port mappings and configmap source path resolution (#742)
Publish / Gate: k8s deploy e2e (push) Failing after 2s Details
Webapp Test / Run webapp test suite (push) Failing after 0s Details
Publish / Build and publish (push) Has been skipped Details
Smoke Test / Run basic test suite (push) Failing after 0s Details
Lint Checks / Run linter (push) Failing after 0s Details
Deploy Test / Run deploy test suite (push) Failing after 0s Details
- Only map host ports for services with network_mode: host (80/443 for Caddy always mapped). Previously all compose service ports were mapped unconditionally, causing conflicts with local services like postgres and redis
- Use spec configmap values as source paths instead of ignoring them. Fixes configmaps with user-defined paths (e.g. `stack-orchestrator/compose/maintenance`) and home-relative paths (e.g. `~/.credentials/local-certs/s3`)
- Read configmap files from deployment dir (`configmaps/{name}/`) when building k8s ConfigMap objects, not from the spec's source path which doesn't exist in the deployment dir
- File pebbles: `so-c71` (resolved), `so-078`: self-sufficient deployments (hooks should be copied to deployment dir)
2026-04-14 17:33:47 +05:30
prathamesh0 17b614cb4d
Fix configmap source path resolution for user-defined spec paths (#741)
Lint Checks / Run linter (push) Failing after 0s Details
Webapp Test / Run webapp test suite (push) Failing after 0s Details
Smoke Test / Run basic test suite (push) Failing after 0s Details
Publish / Gate: k8s deploy e2e (push) Failing after 2s Details
Deploy Test / Run deploy test suite (push) Failing after 0s Details
Publish / Build and publish (push) Has been skipped Details
2026-04-14 11:30:27 +05:30
prathamesh0 0bf1ea70d5
Add ip mode to external-services for static IP endpoints (#740)
Publish / Gate: k8s deploy e2e (push) Failing after 2s Details
Container Registry Test / Run container registry hosting test on kind/k8s (push) Failing after 0s Details
Publish / Build and publish (push) Has been skipped Details
Database Test / Run database hosting test on kind/k8s (push) Failing after 0s Details
External Stack Test / Run external stack test suite (push) Failing after 0s Details
K8s Deploy Test / Run deploy test suite on kind/k8s (push) Failing after 0s Details
K8s Deployment Control Test / Run deployment control suite on kind/k8s (push) Failing after 0s Details
Webapp Test / Run webapp test suite (push) Failing after 0s Details
Smoke Test / Run basic test suite (push) Failing after 0s Details
Lint Checks / Run linter (push) Failing after 0s Details
Deploy Test / Run deploy test suite (push) Failing after 0s Details
ExternalName services only support DNS names (CNAME records), not
raw IP addresses. Add an ip mode that creates a headless Service +
Endpoints with a static IP, enabling routing to host-network
services like Kind gateway IPs or bare-metal endpoints.

Spec format:
  external-services:
    my-service:
      ip: 172.18.0.1
      port: 8899

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-02 17:53:23 +05:30
prathamesh0 185ebf17f9
Fix failing k8s and external-stack CI test scripts (#739)
- Add `--perform-cluster-management` to container-registry, k8s-deployment-control, and database test scripts (`--skip-cluster-management` is now the default)
- Fix `wait_for_log_output()` in all k8s tests - "No logs available" is non-empty, so the check was passing prematurely
- Use HTTPS for container-registry catalog check (Caddy redirects HTTP->HTTPS)
- Fix external-stack sync test: sed pattern used `=` but spec is YAML (`: `), so the substitution never matched
- Workaround hyphenated env var name (`test-variable-1`) from upstream test-external-stack repo - docker compose v2 rejects hyphens
- Quote `echo $log_output` vars to prevent glob expansion in error output
- Use stack name (instead of cluster-id) derived namespace in k8s-deployment-control test
2026-04-02 15:00:57 +05:30
A. F. Dudley eb881ac179 merge upstream: resolve test-k8s-deploy.yml conflict, add workflow_call
Keep upstream's schedule/path triggers and install scripts, add
workflow_dispatch and workflow_call so publish.yml can gate on it.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 05:30:14 +00:00
prathamesh0 7f11766b05
Migrate canonical source from Gitea to GitHub (#738)
- Update all self-references from `git.vdb.to/cerc-io/stack-orchestrator` to
  `github.com/cerc-io/stack-orchestrator` (setup.py, pyproject.toml, README,
  docs, install scripts, cloud-init scripts, stack READMEs)
- Fix release download URL pattern (`releases/download/latest` -> `releases/latest/download`)
- Port 5 Gitea-only CI workflows to GitHub Actions (k8s-deploy, k8s-deployment-control, container-registry, database, external-stack)
- Pin `shiv==1.0.8` in all workflows for reproducible builds
- Restrict smoke/deploy/webapp test push triggers to `main` only
- Remove `.gitea/` directory - Gitea repo to be archived
2026-04-02 10:58:14 +05:30
A. F. Dudley a76cae5c70 Merge branch 'main' of github.com:cerc-io/stack-orchestrator 2026-04-02 05:26:10 +00:00
A. F. Dudley 3da23683f6 fix: black formatting, line length, pyright type narrowing
- Apply black reformatting to deployer.py, cluster_info.py, deploy_k8s.py
- Shorten docstrings exceeding 88 char line limit
- Add assert for pyright Optional type narrowing on tls list

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 05:22:25 +00:00
A. F. Dudley 63325f68a7 fix: deduplicate container ports by (port, protocol)
Compose files with both "8001" (TCP) and "8001/udp" produce separate
V1ContainerPort entries that k8s rejects as duplicates. Deduplicate
after parsing by (container_port, protocol) key.

This was blocking biscayne's agave deployment — the spec has both
TCP 8001 (ip_echo) and UDP 8001 (gossip), which generated two UDP
8001 entries.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 05:18:10 +00:00
A. F. Dudley ae1eae5b9b gate releases on k8s e2e test, remove per-push trigger
- test-k8s-deploy.yml: trigger on workflow_call and workflow_dispatch
  only (not every push/PR)
- publish.yml: add needs: e2e job that calls test-k8s-deploy.yml —
  release is blocked until the k8s e2e suite passes

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 04:31:38 +00:00
A. F. Dudley c95eeeffb8 add k8s deploy e2e test to CI and pre-push hook
- .github/workflows/test-k8s-deploy.yml: new workflow that installs
  kind+kubectl and runs tests/k8s-deploy/run-deploy-test.sh on every
  push and PR. Same script used locally and in release validation.
- .pre-commit-config.yaml: add local pre-push hook that runs the k8s
  e2e test (~3 min) before pushing to remote.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 04:30:37 +00:00
Prathamesh Musale 2afc7ad1ce Update refs in root readme; test publish workflow
Lint Checks / Run linter (push) Failing after 0s Details
Webapp Test / Run webapp test suite (push) Failing after 0s Details
Smoke Test / Run basic test suite (push) Failing after 0s Details
Publish / Build and publish (push) Failing after 0s Details
Deploy Test / Run deploy test suite (push) Failing after 0s Details
2026-04-02 09:33:53 +05:30
A. F. Dudley 87761c7041 fix: imagePullPolicy for kind, job images, duplicate registry call, test namespace
- deploy_k8s.py: default imagePullPolicy to IfNotPresent for kind
  (local images loaded via kind load, not pulled from registry)
- cluster_info.py: add job images to image_set so they're loaded into kind
- deploy_k8s.py: remove duplicate create_registry_secret call (merge artifact)
- deploy_k8s.py: fix indentation in run_job job_pull_policy (replace_all damage)
- tests/k8s-deploy: update namespace from laconic-{id} to laconic-{stack_name}
  to match the new stack-derived namespace scheme from wd-a7b

All 15 k8s deploy e2e tests pass.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01 23:34:51 +00:00
A. F. Dudley 66da312f67 fix: base36 IDs for kind-compatible cluster names, test --perform-cluster-management
- ids.py: use base36 (lowercase+digits) instead of base62 — kind
  cluster names must match ^[a-z0-9.-]+$
- k8s deploy test: pass --perform-cluster-management on first start
  since 'start' defaults to --skip-cluster-management

Found by running tests/k8s-deploy/run-deploy-test.sh locally.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01 21:50:53 +00:00
A. F. Dudley 5bf96112d3 fix lint errors from merge: duplicate def, shadowed import, empty f-string
- deployment_create.py: remove duplicate create_registry_secret signature
- deploy_k8s.py: rename loop var 'config' to 'svc_config' (shadowed import)
- deploy_k8s.py: remove f-prefix from string without placeholders
- deploy_k8s.py: suppress pre-existing C901 on _create_volume_data

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01 19:17:45 +00:00
A. F. Dudley 549ac8c01d Merge fix/kind-mount-propagation: all local branches unified
Merges 6 local branches into main:
- enya: HostToContainer mount propagation for kind-mount-root
- fix/k8s-port-mappings-v5: port protocol parsing, namespace fix
- peirce: idempotent deploy (create-or-replace), update-envs rename
- prince: etcd cleanup whitelist
- wd-a7b: timestamp cluster IDs, stack-derived namespaces, jobs,
  multi-cert ingress, user secrets, _build_containers refactor
- fix/kind-mount-propagation: deployment prepare command, pebbles

Conflicts resolved keeping main's evolved multi-pod architecture
(get_deployments, per-pod Services, CA cert injection) while
incorporating branch additions (HostToContainer propagation,
user secrets, jobs support).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01 18:26:05 +00:00
A. F. Dudley d50bd2b6d2 Merge wd-a7b: cluster-id/namespace naming, jobs, multi-cert, secrets
Combines timestamp-based cluster IDs, namespace derived from stack name,
_build_containers refactor, jobs support, multi-ingress certificates,
user-declared secrets, and label-based resource cleanup with the existing
idempotent deploy, mount propagation, and port mapping fixes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01 18:22:07 +00:00
A. F. Dudley 2307696a66 Merge fix/k8s-port-mappings-v5 into fix/kind-mount-propagation
Resolve conflicts keeping HostToContainer propagation on mount root
entry and per-container resource layering from the propagation branch.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01 17:06:25 +00:00
A. F. Dudley c64820ad5c Merge branch 'enya-ac868cc4-kind-mount-propagation-fix' into fix/kind-mount-propagation 2026-04-01 17:05:06 +00:00
A. F. Dudley 3e3f349151 Merge remote-tracking branch 'cerc-io.github.com/main'
# Conflicts:
#	stack_orchestrator/deploy/deployment_create.py
#	stack_orchestrator/deploy/k8s/deploy_k8s.py
2026-04-01 14:47:46 +00:00
Prathamesh Musale 33d3474d7d Fix registry secret created in wrong namespace (#998)
Publish / Build and publish (push) Failing after 0s Details
Deploy Test / Run deploy test suite (push) Failing after 0s Details
Webapp Test / Run webapp test suite (push) Failing after 0s Details
Lint Checks / Run linter (push) Failing after 0s Details
Smoke Test / Run basic test suite (push) Failing after 0s Details
`create_registry_secret()` was hardcoded to use the "default" namespace,
but pods are deployed to the spec's configured namespace. The secret
must be in the same namespace as the pods for `imagePullSecrets` to work.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Reviewed-on: https://git.vdb.to/cerc-io/stack-orchestrator/pulls/998
Co-authored-by: Prathamesh Musale <prathamesh.musale0@gmail.com>
Co-committed-by: Prathamesh Musale <prathamesh.musale0@gmail.com>
2026-03-26 08:36:39 +00:00
Snake Game Developer 90e32ffd60 Support image-overrides in spec for testing
Spec can override container images:
  image-overrides:
    dumpster-kubo: ghcr.io/.../dumpster-kubo:test-tag

Merged with CLI overrides (CLI wins). Enables testing with
GHCR-pushed test tags without modifying compose files.

Also reverts the image-pull-policy spec key (not needed —
the fix is to use proper GHCR tags, not IfNotPresent).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-22 01:02:23 +00:00
Snake Game Developer 1052a1d4e7 Support image-pull-policy in spec (default: Always)
Testing specs can set image-pull-policy: IfNotPresent so kind-loaded
local images are used instead of pulling from the registry. Production
specs omit the key and get the default Always behavior.

Root cause: with Always, k8s pulled the GHCR kubo image (with baked
R2 endpoint) instead of the locally-built image (with https://s3:443),
causing kubo to connect to R2 directly and get Unauthorized.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-21 20:17:06 +00:00
Snake Game Developer f93541f7db Fix CA cert mounting: subPath for Go, expanduser for configmaps
- CA certs mounted via subPath into /etc/ssl/certs/ so Go's x509
  picks them up (directory mount replaces the entire dir)
- get_configmaps() now expands ~ in paths via os.path.expanduser()
- Both changes discovered during testing with mkcert + MinIO

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-21 19:27:14 +00:00
Snake Game Developer 713a81c245 Add external-services and ca-certificates spec keys
New spec.yml features for routing external service dependencies:

external-services:
  s3:
    host: example.com  # ExternalName Service (production)
    port: 443
  s3:
    selector: {app: mock}  # headless Service + Endpoints (testing)
    namespace: mock-ns
    port: 443

ca-certificates:
  - ~/.local/share/mkcert/rootCA.pem  # testing only

laconic-so creates the appropriate k8s Service type per mode:
- host mode: ExternalName (DNS CNAME to external provider)
- selector mode: headless Service + Endpoints with pod IPs
  discovered from the target namespace at deploy time

ca-certificates mounts CA files into all containers at
/etc/ssl/certs/ and sets NODE_EXTRA_CA_CERTS for Node/Bun.

Also includes the previously committed PV Released state fix.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-21 15:25:47 +00:00
Snake Game Developer 98ff221a21 Fix PV rebinding after deployment stop/start cycle
deployment stop deletes the namespace (and PVCs) but preserves PVs
by default. On the next deployment start, PVs are in Released state
with a stale claimRef pointing at the deleted PVC. New PVCs cannot
bind to Released PVs, so pods get stuck in Pending.

Clear the claimRef on any Released PV during _create_volume_data()
so the PV returns to Available and can accept new PVC bindings.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-21 07:47:23 +00:00
A. F. Dudley 7141dc7637 file so-p3p: laconic-so should manage Caddy ingress image lifecycle
Lint Checks / Run linter (push) Failing after 0s Details
Publish / Build and publish (push) Failing after 0s Details
Deploy Test / Run deploy test suite (push) Failing after 0s Details
Smoke Test / Run basic test suite (push) Failing after 0s Details
Webapp Test / Run webapp test suite (push) Failing after 0s Details
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-21 00:30:46 +00:00
A. F. Dudley 2555df06b5 fix: use patched Caddy ingress image with ACME storage fix
Switch from caddy/ingress:latest to ghcr.io/laconicnetwork/caddy-ingress:latest
which has the List()/Stat() fix for secret_store. This fixes multi-domain
ACME provisioning deadlock where the second domain's cert request fails
because List() returns mangled keys and Stat() returns wrong IsTerminal.

Source: LaconicNetwork/ingress@109d69a (fix/acme-account-reuse branch)

Fixes: so-o2o (partially — etcd backup investigation still needed)
Closes: ds-v22v (Caddy sequential provisioning no longer needed)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-20 23:31:39 +00:00
A. F. Dudley 24cf22fea5 File pebbles: mount propagation merge + etcd cert backup broken
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-20 23:01:20 +00:00
A. F. Dudley 8d03083d0d feat: add kind-mount-root for unified Kind extraMount
Publish / Build and publish (push) Failing after 0s Details
Webapp Test / Run webapp test suite (push) Failing after 0s Details
Smoke Test / Run basic test suite (push) Failing after 0s Details
Lint Checks / Run linter (push) Failing after 0s Details
Deploy Test / Run deploy test suite (push) Failing after 0s Details
When kind-mount-root is set in spec.yml, emit a single extraMount
mapping the root to /mnt instead of per-volume mounts. This allows
adding new volumes without recreating the Kind cluster.

Volumes whose host path is under the root skip individual extraMounts
and their PV paths resolve to /mnt/{relative_path}. Volumes outside
the root keep individual extraMounts as before.

Cherry-picked from branch enya-ac868cc4-kind-mount-propagation-fix
(commits b6d6ad81, 929bdab8) and adapted for current main.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-20 21:28:40 +00:00
A. F. Dudley 9109cfb7a1 feat: add token-file option for image-pull-secret registry auth
Adds token-file key to image-pull-secret spec config. Reads the
registry token from a file on disk instead of requiring an environment
variable. File path supports ~ expansion. Falls back to token-env
if token-file is not set or file doesn't exist.

This lets operators store the GHCR token in ~/.credentials/ alongside
other secrets, removing the need for ansible to pass REGISTRY_TOKEN
as an env var.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-20 19:30:44 +00:00
A. F. Dudley 61afeb255c fix: keep cwd at repo root through entire restart, revert try/except
The stack path in spec.yml is relative — both create_operation and
up_operation need cwd at the repo root for stack_is_external() to
resolve it. Move os.chdir(prev_cwd) to after up_operation completes
instead of between the two operations.

Reverts the SystemExit catch in call_stack_deploy_start — the root
cause was cwd, not the hook.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-20 15:54:46 +00:00
A. F. Dudley 32f6e57b70 fix: ConfigMap volumes don't force Recreate strategy + resilient hooks
Two fixes for multi-deployment:

1. _pod_has_pvcs now excludes ConfigMap volumes from PVC detection.
   Pods with only ConfigMap volumes (like maintenance) correctly get
   RollingUpdate strategy instead of Recreate.

2. call_stack_deploy_start catches SystemExit when stack path doesn't
   resolve from cwd (common during restart). Most stacks don't have
   deploy hooks, so this is non-fatal.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-20 15:51:58 +00:00
A. F. Dudley 6923e1c23b refactor: extract methods from K8sDeployer.up to fix C901 complexity
Split up() into _setup_cluster(), _create_ingress(), _create_nodeports().
Reduces cyclomatic complexity below the flake8 threshold.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-20 15:20:50 +00:00
A. F. Dudley 5b8303f8f9 fix: resolve stack path from repo root + update deploy test
- chdir to git repo root before create_operation so relative stack
  paths in spec.yml resolve correctly via stack_is_external()
- Update deploy test: config.env is now regenerated from spec on
  --update (matching 72aabe7d behavior), verify backup exists

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-20 15:14:47 +00:00
A. F. Dudley 0ac886bf95 fix: chdir to repo root before create_operation in restart
The spec's "stack:" value is a relative path that must resolve from
the repo root. stack_is_external() checks Path(stack).exists() from
cwd, which fails when cwd isn't the repo root.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-20 15:06:38 +00:00
A. F. Dudley 2484abfcce fix: use git rev-parse for repo root in restart command
The repo_root calculation assumed stack paths are always 4 levels deep
(stack_orchestrator/data/stacks/name). External stacks with different
nesting (e.g. stack-orchestrator/stacks/name = 3 levels) got the wrong
root, causing --spec-file resolution to fail.

Use git rev-parse --show-toplevel instead.

Fixes: so-k1k

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-20 15:03:24 +00:00
A. F. Dudley 967936e524 Multi-deployment: one k8s Deployment per pod in stack.yml
Lint Checks / Run linter (push) Failing after 0s Details
Deploy Test / Run deploy test suite (push) Failing after 0s Details
Webapp Test / Run webapp test suite (push) Failing after 0s Details
Publish / Build and publish (push) Failing after 0s Details
Smoke Test / Run basic test suite (push) Failing after 0s Details
Each pod entry in stack.yml now creates its own k8s Deployment with
independent lifecycle and update strategy. Pods with PVCs get Recreate,
pods without get RollingUpdate. This enables maintenance services that
survive main pod restarts.

- cluster_info: get_deployments() builds per-pod Deployments, Services
- cluster_info: Ingress routes to correct per-pod Service
- deploy_k8s: _create_deployment() iterates all Deployments/Services
- deployment: restart swaps Ingress to maintenance service during Recreate
- spec: add maintenance-service key

Single-pod stacks are backward compatible (same resource names).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-20 01:40:45 +00:00
A. F. Dudley 6ace024cd3 fix: use replace instead of patch for k8s resource updates
Lint Checks / Run linter (push) Failing after 0s Details
Publish / Build and publish (push) Failing after 0s Details
Deploy Test / Run deploy test suite (push) Failing after 0s Details
Webapp Test / Run webapp test suite (push) Failing after 0s Details
Smoke Test / Run basic test suite (push) Failing after 0s Details
Strategic merge patch preserves fields not present in the patch body.
This means removed volumes, ports, and env vars persist in the running
Deployment after a restart. Replace sends the complete spec built from
the current compose files — removed fields are actually deleted.

Affects Deployment, Service, Ingress, and NodePort updates. Service
replace preserves clusterIP (immutable field) by reading it from the
existing resource before replacing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-19 03:44:57 +00:00
A. F. Dudley ea610bb8d6 Merge branch 'cv-c3c-image-flag-for-restart'
# Conflicts:
#	stack_orchestrator/deploy/k8s/deploy_k8s.py
2026-03-18 23:04:55 +00:00