Publish / Gate: k8s deploy e2e (push) Failing after 2sDetails
Container Registry Test / Run container registry hosting test on kind/k8s (push) Failing after 0sDetails
Publish / Build and publish (push) Has been skippedDetails
Database Test / Run database hosting test on kind/k8s (push) Failing after 0sDetails
External Stack Test / Run external stack test suite (push) Failing after 0sDetails
K8s Deploy Test / Run deploy test suite on kind/k8s (push) Failing after 0sDetails
K8s Deployment Control Test / Run deployment control suite on kind/k8s (push) Failing after 0sDetails
Webapp Test / Run webapp test suite (push) Failing after 0sDetails
Smoke Test / Run basic test suite (push) Failing after 0sDetails
Lint Checks / Run linter (push) Failing after 0sDetails
Deploy Test / Run deploy test suite (push) Failing after 0sDetails
ExternalName services only support DNS names (CNAME records), not
raw IP addresses. Add an ip mode that creates a headless Service +
Endpoints with a static IP, enabling routing to host-network
services like Kind gateway IPs or bare-metal endpoints.
Spec format:
external-services:
my-service:
ip: 172.18.0.1
port: 8899
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
- Add `--perform-cluster-management` to container-registry, k8s-deployment-control, and database test scripts (`--skip-cluster-management` is now the default)
- Fix `wait_for_log_output()` in all k8s tests - "No logs available" is non-empty, so the check was passing prematurely
- Use HTTPS for container-registry catalog check (Caddy redirects HTTP->HTTPS)
- Fix external-stack sync test: sed pattern used `=` but spec is YAML (`: `), so the substitution never matched
- Workaround hyphenated env var name (`test-variable-1`) from upstream test-external-stack repo - docker compose v2 rejects hyphens
- Quote `echo $log_output` vars to prevent glob expansion in error output
- Use stack name (instead of cluster-id) derived namespace in k8s-deployment-control test
Keep upstream's schedule/path triggers and install scripts, add
workflow_dispatch and workflow_call so publish.yml can gate on it.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Update all self-references from `git.vdb.to/cerc-io/stack-orchestrator` to
`github.com/cerc-io/stack-orchestrator` (setup.py, pyproject.toml, README,
docs, install scripts, cloud-init scripts, stack READMEs)
- Fix release download URL pattern (`releases/download/latest` -> `releases/latest/download`)
- Port 5 Gitea-only CI workflows to GitHub Actions (k8s-deploy, k8s-deployment-control, container-registry, database, external-stack)
- Pin `shiv==1.0.8` in all workflows for reproducible builds
- Restrict smoke/deploy/webapp test push triggers to `main` only
- Remove `.gitea/` directory - Gitea repo to be archived
- Apply black reformatting to deployer.py, cluster_info.py, deploy_k8s.py
- Shorten docstrings exceeding 88 char line limit
- Add assert for pyright Optional type narrowing on tls list
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Compose files with both "8001" (TCP) and "8001/udp" produce separate
V1ContainerPort entries that k8s rejects as duplicates. Deduplicate
after parsing by (container_port, protocol) key.
This was blocking biscayne's agave deployment — the spec has both
TCP 8001 (ip_echo) and UDP 8001 (gossip), which generated two UDP
8001 entries.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- test-k8s-deploy.yml: trigger on workflow_call and workflow_dispatch
only (not every push/PR)
- publish.yml: add needs: e2e job that calls test-k8s-deploy.yml —
release is blocked until the k8s e2e suite passes
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- .github/workflows/test-k8s-deploy.yml: new workflow that installs
kind+kubectl and runs tests/k8s-deploy/run-deploy-test.sh on every
push and PR. Same script used locally and in release validation.
- .pre-commit-config.yaml: add local pre-push hook that runs the k8s
e2e test (~3 min) before pushing to remote.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- deploy_k8s.py: default imagePullPolicy to IfNotPresent for kind
(local images loaded via kind load, not pulled from registry)
- cluster_info.py: add job images to image_set so they're loaded into kind
- deploy_k8s.py: remove duplicate create_registry_secret call (merge artifact)
- deploy_k8s.py: fix indentation in run_job job_pull_policy (replace_all damage)
- tests/k8s-deploy: update namespace from laconic-{id} to laconic-{stack_name}
to match the new stack-derived namespace scheme from wd-a7b
All 15 k8s deploy e2e tests pass.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- ids.py: use base36 (lowercase+digits) instead of base62 — kind
cluster names must match ^[a-z0-9.-]+$
- k8s deploy test: pass --perform-cluster-management on first start
since 'start' defaults to --skip-cluster-management
Found by running tests/k8s-deploy/run-deploy-test.sh locally.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Combines timestamp-based cluster IDs, namespace derived from stack name,
_build_containers refactor, jobs support, multi-ingress certificates,
user-declared secrets, and label-based resource cleanup with the existing
idempotent deploy, mount propagation, and port mapping fixes.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Resolve conflicts keeping HostToContainer propagation on mount root
entry and per-container resource layering from the propagation branch.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Publish / Build and publish (push) Failing after 0sDetails
Deploy Test / Run deploy test suite (push) Failing after 0sDetails
Webapp Test / Run webapp test suite (push) Failing after 0sDetails
Lint Checks / Run linter (push) Failing after 0sDetails
Smoke Test / Run basic test suite (push) Failing after 0sDetails
`create_registry_secret()` was hardcoded to use the "default" namespace,
but pods are deployed to the spec's configured namespace. The secret
must be in the same namespace as the pods for `imagePullSecrets` to work.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Reviewed-on: https://git.vdb.to/cerc-io/stack-orchestrator/pulls/998
Co-authored-by: Prathamesh Musale <prathamesh.musale0@gmail.com>
Co-committed-by: Prathamesh Musale <prathamesh.musale0@gmail.com>
Spec can override container images:
image-overrides:
dumpster-kubo: ghcr.io/.../dumpster-kubo:test-tag
Merged with CLI overrides (CLI wins). Enables testing with
GHCR-pushed test tags without modifying compose files.
Also reverts the image-pull-policy spec key (not needed —
the fix is to use proper GHCR tags, not IfNotPresent).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Testing specs can set image-pull-policy: IfNotPresent so kind-loaded
local images are used instead of pulling from the registry. Production
specs omit the key and get the default Always behavior.
Root cause: with Always, k8s pulled the GHCR kubo image (with baked
R2 endpoint) instead of the locally-built image (with https://s3:443),
causing kubo to connect to R2 directly and get Unauthorized.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- CA certs mounted via subPath into /etc/ssl/certs/ so Go's x509
picks them up (directory mount replaces the entire dir)
- get_configmaps() now expands ~ in paths via os.path.expanduser()
- Both changes discovered during testing with mkcert + MinIO
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New spec.yml features for routing external service dependencies:
external-services:
s3:
host: example.com # ExternalName Service (production)
port: 443
s3:
selector: {app: mock} # headless Service + Endpoints (testing)
namespace: mock-ns
port: 443
ca-certificates:
- ~/.local/share/mkcert/rootCA.pem # testing only
laconic-so creates the appropriate k8s Service type per mode:
- host mode: ExternalName (DNS CNAME to external provider)
- selector mode: headless Service + Endpoints with pod IPs
discovered from the target namespace at deploy time
ca-certificates mounts CA files into all containers at
/etc/ssl/certs/ and sets NODE_EXTRA_CA_CERTS for Node/Bun.
Also includes the previously committed PV Released state fix.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
deployment stop deletes the namespace (and PVCs) but preserves PVs
by default. On the next deployment start, PVs are in Released state
with a stale claimRef pointing at the deleted PVC. New PVCs cannot
bind to Released PVs, so pods get stuck in Pending.
Clear the claimRef on any Released PV during _create_volume_data()
so the PV returns to Available and can accept new PVC bindings.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Switch from caddy/ingress:latest to ghcr.io/laconicnetwork/caddy-ingress:latest
which has the List()/Stat() fix for secret_store. This fixes multi-domain
ACME provisioning deadlock where the second domain's cert request fails
because List() returns mangled keys and Stat() returns wrong IsTerminal.
Source: LaconicNetwork/ingress@109d69a (fix/acme-account-reuse branch)
Fixes: so-o2o (partially — etcd backup investigation still needed)
Closes: ds-v22v (Caddy sequential provisioning no longer needed)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Publish / Build and publish (push) Failing after 0sDetails
Webapp Test / Run webapp test suite (push) Failing after 0sDetails
Smoke Test / Run basic test suite (push) Failing after 0sDetails
Lint Checks / Run linter (push) Failing after 0sDetails
Deploy Test / Run deploy test suite (push) Failing after 0sDetails
When kind-mount-root is set in spec.yml, emit a single extraMount
mapping the root to /mnt instead of per-volume mounts. This allows
adding new volumes without recreating the Kind cluster.
Volumes whose host path is under the root skip individual extraMounts
and their PV paths resolve to /mnt/{relative_path}. Volumes outside
the root keep individual extraMounts as before.
Cherry-picked from branch enya-ac868cc4-kind-mount-propagation-fix
(commits b6d6ad81, 929bdab8) and adapted for current main.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds token-file key to image-pull-secret spec config. Reads the
registry token from a file on disk instead of requiring an environment
variable. File path supports ~ expansion. Falls back to token-env
if token-file is not set or file doesn't exist.
This lets operators store the GHCR token in ~/.credentials/ alongside
other secrets, removing the need for ansible to pass REGISTRY_TOKEN
as an env var.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The stack path in spec.yml is relative — both create_operation and
up_operation need cwd at the repo root for stack_is_external() to
resolve it. Move os.chdir(prev_cwd) to after up_operation completes
instead of between the two operations.
Reverts the SystemExit catch in call_stack_deploy_start — the root
cause was cwd, not the hook.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Two fixes for multi-deployment:
1. _pod_has_pvcs now excludes ConfigMap volumes from PVC detection.
Pods with only ConfigMap volumes (like maintenance) correctly get
RollingUpdate strategy instead of Recreate.
2. call_stack_deploy_start catches SystemExit when stack path doesn't
resolve from cwd (common during restart). Most stacks don't have
deploy hooks, so this is non-fatal.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Split up() into _setup_cluster(), _create_ingress(), _create_nodeports().
Reduces cyclomatic complexity below the flake8 threshold.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- chdir to git repo root before create_operation so relative stack
paths in spec.yml resolve correctly via stack_is_external()
- Update deploy test: config.env is now regenerated from spec on
--update (matching 72aabe7d behavior), verify backup exists
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The spec's "stack:" value is a relative path that must resolve from
the repo root. stack_is_external() checks Path(stack).exists() from
cwd, which fails when cwd isn't the repo root.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The repo_root calculation assumed stack paths are always 4 levels deep
(stack_orchestrator/data/stacks/name). External stacks with different
nesting (e.g. stack-orchestrator/stacks/name = 3 levels) got the wrong
root, causing --spec-file resolution to fail.
Use git rev-parse --show-toplevel instead.
Fixes: so-k1k
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Lint Checks / Run linter (push) Failing after 0sDetails
Deploy Test / Run deploy test suite (push) Failing after 0sDetails
Webapp Test / Run webapp test suite (push) Failing after 0sDetails
Publish / Build and publish (push) Failing after 0sDetails
Smoke Test / Run basic test suite (push) Failing after 0sDetails
Each pod entry in stack.yml now creates its own k8s Deployment with
independent lifecycle and update strategy. Pods with PVCs get Recreate,
pods without get RollingUpdate. This enables maintenance services that
survive main pod restarts.
- cluster_info: get_deployments() builds per-pod Deployments, Services
- cluster_info: Ingress routes to correct per-pod Service
- deploy_k8s: _create_deployment() iterates all Deployments/Services
- deployment: restart swaps Ingress to maintenance service during Recreate
- spec: add maintenance-service key
Single-pod stacks are backward compatible (same resource names).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Lint Checks / Run linter (push) Failing after 0sDetails
Publish / Build and publish (push) Failing after 0sDetails
Deploy Test / Run deploy test suite (push) Failing after 0sDetails
Webapp Test / Run webapp test suite (push) Failing after 0sDetails
Smoke Test / Run basic test suite (push) Failing after 0sDetails
Strategic merge patch preserves fields not present in the patch body.
This means removed volumes, ports, and env vars persist in the running
Deployment after a restart. Replace sends the complete spec built from
the current compose files — removed fields are actually deleted.
Affects Deployment, Service, Ingress, and NodePort updates. Service
replace preserves clusterIP (immutable field) by reading it from the
existing resource before replacing.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Allows callers to override container images during restart, e.g.:
laconic-so deployment restart --image backend=ghcr.io/org/app:sha123
The override is applied to the k8s Deployment spec before
create-or-patch. Docker/compose deployers accept the parameter
but ignore it.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
_write_config_file() now reads each file listed under the credentials-files
top-level spec key and appends its contents to config.env after config vars.
Paths support ~ expansion. Missing files fail hard with sys.exit(1).
Also adds get_credentials_files() to Spec class following the same pattern
as get_image_registry_config().
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The spec key `registry-credentials` was ambiguous — could mean container
registry auth or Laconic registry config. Rename to `image-pull-secret`
which matches the k8s secret name it creates.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Lint Checks / Run linter (push) Failing after 0sDetails
Publish / Build and publish (push) Failing after 0sDetails
Deploy Test / Run deploy test suite (push) Failing after 0sDetails
Webapp Test / Run webapp test suite (push) Failing after 0sDetails
Smoke Test / Run basic test suite (push) Failing after 0sDetails
Containers referenced in spec.yml http-proxy routes now get TCP
readiness probes on the proxied port. This tells k8s when a container
is actually ready to serve traffic.
Without readiness probes, k8s considers pods ready immediately after
start, which means:
- Rolling updates cut over before the app is listening
- Broken containers look "ready" and receive traffic (502s)
- kubectl rollout undo has nothing to roll back to
The probes use TCP socket checks (not HTTP) to work with any protocol.
Initial delay 5s, check every 10s, fail after 3 consecutive failures.
Closes so-l2l part C.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace the destroy-and-recreate deployment model with in-place updates.
deploy_k8s.py: All resource creation (Deployment, Service, Ingress,
NodePort, ConfigMap) now uses create-or-update semantics. If a resource
already exists (409 Conflict), it patches instead of failing. For
Deployments, this triggers a k8s rolling update — old pods serve traffic
until new pods pass readiness checks.
deployment.py: restart() no longer calls down(). It just calls up()
which patches existing resources. No namespace deletion, no downtime
gap, no race conditions. k8s handles the rollout.
This gives:
- Zero-downtime deploys (old pods serve during rollout)
- Automatic rollback (if new pods fail readiness, rollout stalls)
- Manual rollback via kubectl rollout undo
Closes so-l2l (parts A and B).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
create_registry_secret() hardcoded namespace="default" but deployments
now run in dedicated laconic-* namespaces. The secret was invisible
to pods in the deployment namespace, causing 401 on GHCR pulls.
Accept namespace as parameter, passed from deploy_k8s.py which knows
the correct namespace.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Reverts the label-based deletion approach — resources created by older
laconic-so lack labels, so label queries return empty results. Namespace
deletion is the only reliable cleanup.
Adds _wait_for_namespace_gone() so down() blocks until the namespace
is fully terminated. This prevents the race condition where up() tries
to create resources in a still-terminating namespace (403 Forbidden).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
down() deleted the entire namespace when it wasn't explicitly set in
the spec. This causes a race condition on restart: up() tries to create
resources in a namespace that's still terminating, getting 403 Forbidden.
Always use _delete_resources_by_label() instead. The namespace is cheap
to keep and required for immediate up() after down(). This also matches
the shared-namespace behavior, making down() consistent regardless of
namespace configuration.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
config.env is regenerated from spec.yml on every deploy create and
restart, silently overwriting manual edits. Add a header comment
explaining this so operators know to edit spec.yml instead.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The secret name `{app}-registry` is ambiguous — it could be a container
registry credential or a Laconic registry config. Rename to
`{app}-image-pull-secret` which clearly describes its purpose as a
Kubernetes imagePullSecret for private container registries.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Lint Checks / Run linter (push) Failing after 0sDetails
Kind deployments used imagePullPolicy=None (defaults to IfNotPresent),
which means the kind node caches images by tag and never re-pulls from
the local registry. After a container rebuild + registry push, the pod
keeps using the stale cached image.
Set Always for all deployment types so k8s re-pulls on every pod
restart. With a local registry this adds negligible overhead.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>