`host-metrics` is a native stack -- spec.yml and `laconic-so --stack`
both take the bare stack name, not a path. Replace the `docker ps -qf`
filter with `laconic-so deployment --dir ... logs` so the verify
recipe works regardless of the laconic deployment-hash prefix on the
container name.
Add telegraf-entrypoint.sh to render telegraf.conf from the template
(replacing @@HOST_TAG_BLOCK@@ and @@ZFS_BLOCK@@ markers via awk) and
exec telegraf. Add test-telegraf-entrypoint.sh with 8 offline tests
(10 assertions) covering marker substitution and required-env validation.
Fix run() stderr redirect from >/dev/null 2>&1 to >/dev/null so that
entrypoint error output reaches the T6-T8 assertion captures.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Publish / Gate: k8s deploy e2e (push) Failing after 3sDetails
Deploy Test / Run deploy test suite (push) Failing after 0sDetails
Publish / Build and publish (push) Has been skippedDetails
Smoke Test / Run basic test suite (push) Failing after 0sDetails
Lint Checks / Run linter (push) Failing after 0sDetails
Webapp Test / Run webapp test suite (push) Failing after 0sDetails
Operator-reported: editing source files mounted into a service via
bind volumes (alert rules, dashboards, scripts, templates, telegraf
config) and running 'laconic-so deployment ... restart' did not
take effect. Operator had to fall back to 'stop && start' to pick
up changes.
Root cause: 'restart' calls up_operation, which translates to
'docker compose up -d'. Compose's up only recreates a container
when the *service definition* itself (image, env, ports, volume
declarations) changes. Bind-mount target file content is not part
of that hash, so the running container kept its old in-memory
state (e.g. Grafana's pre-edit provisioning).
Add force_recreate kwarg through the deployer interface and have
restart pass force_recreate=True. compose path threads through to
python_on_whales' compose.up(force_recreate=...). k8s path accepts
the kwarg but is a no-op for now (rolling update on
unchanged-spec needs a separate fix that stamps the
kubectl.kubernetes.io/restartedAt annotation on managed
Deployments; tracked in a follow-up).
Webapp Test / Run webapp test suite (push) Failing after 0sDetails
Lint Checks / Run linter (push) Failing after 0sDetails
Publish / Gate: k8s deploy e2e (push) Failing after 3sDetails
Deploy Test / Run deploy test suite (push) Failing after 0sDetails
Publish / Build and publish (push) Has been skippedDetails
Smoke Test / Run basic test suite (push) Failing after 0sDetails
- Cluster setup was only considering images from containers list in `stack.yml` for kind-loading into the cluster; i.e. images from `image_overrides` in spec were not being loaded
- This also resulted in laconic-so to attempt kind-loading images not present locally sometimes
- Fix: union `image_overrides` values (user-specified local images) with the ones from container-list, filtered to only ones that are actually present on the docker host
Smoke Test / Run basic test suite (push) Failing after 0sDetails
Lint Checks / Run linter (push) Failing after 0sDetails
Webapp Test / Run webapp test suite (push) Failing after 0sDetails
K8s Deploy Test / Run deploy test suite on kind/k8s (push) Failing after 0sDetails
Publish / Gate: k8s deploy e2e (push) Failing after 3sDetails
Deploy Test / Run deploy test suite (push) Failing after 0sDetails
Publish / Build and publish (push) Has been skippedDetails
- `deploy create` now copies each pod's `commands.py` into `<deployment>/hooks/`. `call_stack_deploy_start` loads from there, so `deployment start` / `restart` no longer need the live stack source on disk to run the `start()` hook
- Only the `start()` hook is affected. `init`, `setup`, and `create` still load from the live source — they only run at `deploy create` time, when the source is guaranteed to be present
- Multi-repo stacks produce `hooks/commands_0.py`, `hooks/commands_1.py`, …; `call_stack_deploy_start` loads them all in sorted order
- Adds `tests/k8s-deploy/run-restart-test.sh` covering the full single-repo restart cycle (v1 -> mutate working tree -> `restart` re-copies and re-executes v2) and the multi-repo file-naming + multi-hook invocation. Wired into the existing **K8s Deploy Test** workflow
Publish / Gate: k8s deploy e2e (push) Failing after 3sDetails
Deploy Test / Run deploy test suite (push) Failing after 0sDetails
Publish / Build and publish (push) Has been skippedDetails
K8s Deployment Control Test / Run deployment control suite on kind/k8s (push) Failing after 0sDetails
Webapp Test / Run webapp test suite (push) Failing after 0sDetails
Smoke Test / Run basic test suite (push) Failing after 0sDetails
Lint Checks / Run linter (push) Failing after 0sDetails
K8s Deploy Test / Run deploy test suite on kind/k8s (push) Failing after 0sDetails
Closes so-p3p:
- New spec key `caddy-ingress-image`: on fresh install, deploys Caddy with this image; on subsequent `deployment start`, patches the running Caddy Deployment if the image differs. Defaults to the manifest's hardcoded image when absent
- When the spec key is absent, SO does **not** touch a running Caddy — avoids silently reverting an image set out-of-band (ansible playbook, another deployment's spec)
- `strategy: Recreate` on the Caddy Deployment manifest (required — hostPort 80/443 deadlocks rolling updates)
- Reconcile runs under both `--perform-cluster-management` and the default `--skip-cluster-management` (it's a k8s-API patch, not a cluster-lifecycle op)
- Image template by container name rather than string match, so the spec override wins regardless of what the shipped manifest hardcodes
- Cluster-scoped caveat documented: `caddy-system` is shared across deployments, so the last `deployment start` that sets the key wins for everyone
- **Kind extraMount compatibility**: fail fast at `deployment start` when a new deployment's mounts don't match the running cluster; warn when the first cluster is created without a `kind-mount-root` umbrella; replace the cryptic `ConfigException` with readable errors when the cluster is missing
- **Auto-ConfigMap for file-level host-path compose volumes** (so-7fc): `../config/foo.sh:/opt/foo.sh`-style binds become per-namespace ConfigMaps at deploy start instead of aliasing via the kind extraMount chain. `deploy create` rejects `:rw`, subdirs, and over-budget sources. Deployment-dir layout unchanged
- **Namespace ownership**: stamp the namespace with `laconic.com/deployment-dir` on create; fail loudly if another deployment tries to land in the same namespace. Pre-existing namespaces adopt ownership on next start
- **deployment-id / cluster-id decoupling**: split the two roles (kube context vs resource-name prefix) into separate `deployment.yml` fields. Backward-compat fallback keeps existing resource names stable
- Close stale pebbles `so-n1n` and `so-ad7`
Deploy Test / Run deploy test suite (push) Failing after 0sDetails
Webapp Test / Run webapp test suite (push) Failing after 0sDetails
Smoke Test / Run basic test suite (push) Failing after 0sDetails
Lint Checks / Run linter (push) Failing after 0sDetails
Publish / Gate: k8s deploy e2e (push) Failing after 3sDetails
Publish / Build and publish (push) Has been skippedDetails
Implementation shipped in PR #746. Woodburn migration (one-shot
host-kubectl export to seed the backup file) completed manually.
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Publish / Gate: k8s deploy e2e (push) Failing after 3sDetails
Publish / Build and publish (push) Has been skippedDetails
K8s Deploy Test / Run deploy test suite on kind/k8s (push) Failing after 0sDetails
Lint Checks / Run linter (push) Failing after 0sDetails
Deploy Test / Run deploy test suite (push) Failing after 0sDetails
Webapp Test / Run webapp test suite (push) Failing after 0sDetails
Smoke Test / Run basic test suite (push) Failing after 0sDetails
Replaces the etcd-surgery persistence approach with a CronJob that dumps `manager=caddy` Secrets to `{kind-mount-root}/caddy-cert-backup/` every 5 min, and a restore step that applies the file before Caddy starts on a fresh cluster. Closes so-o2o.
Deletes `_clean_etcd_keeping_certs` and the etcd+PKI extraMounts. No new spec keys - activates when `kind-mount-root` is set.
Replaces the hardcoded `gcr.io/etcd-development/etcd:v3.5.9` in `_clean_etcd_keeping_certs` with a dynamic ref captured from the running Kind node via `crictl`, persisted to `{backup_dir}/etcd-image.txt` and reused on subsequent cleanup runs. Self-adapts to Kind upgrades, no version table to maintain.
Testing on Kind v0.32 / etcd 3.6 surfaced two additional bugs in the whitelist cleanup that this PR does **not** fix (see so-o2o comments):
(a) the restore step pipes raw protobuf values through bash `echo`, corrupting binary bytes;
(b) the whitelist omits cluster-admin RBAC, SAs, and bootstrap tokens needed by kubeadm's pre-addon health check.
Merging this narrow fix + diagnosis trail; follow-up branch will replace the etcd-surgery approach with a kubectl-level Caddy secret backup/restore.
Webapp Test / Run webapp test suite (push) Failing after 0sDetails
Publish / Gate: k8s deploy e2e (push) Failing after 3sDetails
Publish / Build and publish (push) Has been skippedDetails
Deploy Test / Run deploy test suite (push) Failing after 0sDetails
K8s Deploy Test / Run deploy test suite on kind/k8s (push) Failing after 0sDetails
Lint Checks / Run linter (push) Failing after 0sDetails
Smoke Test / Run basic test suite (push) Failing after 0sDetails
- Maintenance-page swap during `restart` was broken: Ingress got patched to point at `{app_name}-{pod_name}-service` for the maintenance pod, but that Service was never created. Caddy had no valid backend, users saw "site cannot be reached" instead of the maintenance page
- Root cause: `get_services()` only builds per-pod Services for pods referenced by `http-proxy` routes; the maintenance pod has no http-proxy route by design
- Fix: `get_services()` now also includes the container named by `maintenance-service:` in the container-ports map, so its per-pod `Service` gets built and sits idle until the swap window
- Also files `so-b9a` (P4) noting the latent fragility in the resolver/builder contract
- `down()` scopes cleanup to a single stack via `app.kubernetes.io/stack` and keeps the namespace `Active` by default
- New `stop/down --delete-namespace` flag for opt-in full teardown
- `down()` is synchronous - waits until resources are actually gone before returning. Callers can drop their own wait loops
- `up()` skip-if-exists for Jobs completes the create-or-replace coverage (other kinds already had it)
- Orphan PVs from a prior `stop --delete-namespace` get cleaned on the next `stop --delete-volumes`
- Every k8s resource SO creates now carries `app.kubernetes.io/stack` via a new `ClusterInfo._stack_labels()` helper
- Closes so-l2l, so-076.2. Also includes pebble audit: closes so-c71, so-b2b, so-k1k; files so-328
Publish / Gate: k8s deploy e2e (push) Failing after 2sDetails
Webapp Test / Run webapp test suite (push) Failing after 0sDetails
Publish / Build and publish (push) Has been skippedDetails
Smoke Test / Run basic test suite (push) Failing after 0sDetails
Lint Checks / Run linter (push) Failing after 0sDetails
Deploy Test / Run deploy test suite (push) Failing after 0sDetails
- Only map host ports for services with network_mode: host (80/443 for Caddy always mapped). Previously all compose service ports were mapped unconditionally, causing conflicts with local services like postgres and redis
- Use spec configmap values as source paths instead of ignoring them. Fixes configmaps with user-defined paths (e.g. `stack-orchestrator/compose/maintenance`) and home-relative paths (e.g. `~/.credentials/local-certs/s3`)
- Read configmap files from deployment dir (`configmaps/{name}/`) when building k8s ConfigMap objects, not from the spec's source path which doesn't exist in the deployment dir
- File pebbles: `so-c71` (resolved), `so-078`: self-sufficient deployments (hooks should be copied to deployment dir)
Publish / Gate: k8s deploy e2e (push) Failing after 2sDetails
Container Registry Test / Run container registry hosting test on kind/k8s (push) Failing after 0sDetails
Publish / Build and publish (push) Has been skippedDetails
Database Test / Run database hosting test on kind/k8s (push) Failing after 0sDetails
External Stack Test / Run external stack test suite (push) Failing after 0sDetails
K8s Deploy Test / Run deploy test suite on kind/k8s (push) Failing after 0sDetails
K8s Deployment Control Test / Run deployment control suite on kind/k8s (push) Failing after 0sDetails
Webapp Test / Run webapp test suite (push) Failing after 0sDetails
Smoke Test / Run basic test suite (push) Failing after 0sDetails
Lint Checks / Run linter (push) Failing after 0sDetails
Deploy Test / Run deploy test suite (push) Failing after 0sDetails
ExternalName services only support DNS names (CNAME records), not
raw IP addresses. Add an ip mode that creates a headless Service +
Endpoints with a static IP, enabling routing to host-network
services like Kind gateway IPs or bare-metal endpoints.
Spec format:
external-services:
my-service:
ip: 172.18.0.1
port: 8899
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
- Add `--perform-cluster-management` to container-registry, k8s-deployment-control, and database test scripts (`--skip-cluster-management` is now the default)
- Fix `wait_for_log_output()` in all k8s tests - "No logs available" is non-empty, so the check was passing prematurely
- Use HTTPS for container-registry catalog check (Caddy redirects HTTP->HTTPS)
- Fix external-stack sync test: sed pattern used `=` but spec is YAML (`: `), so the substitution never matched
- Workaround hyphenated env var name (`test-variable-1`) from upstream test-external-stack repo - docker compose v2 rejects hyphens
- Quote `echo $log_output` vars to prevent glob expansion in error output
- Use stack name (instead of cluster-id) derived namespace in k8s-deployment-control test
Keep upstream's schedule/path triggers and install scripts, add
workflow_dispatch and workflow_call so publish.yml can gate on it.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Update all self-references from `git.vdb.to/cerc-io/stack-orchestrator` to
`github.com/cerc-io/stack-orchestrator` (setup.py, pyproject.toml, README,
docs, install scripts, cloud-init scripts, stack READMEs)
- Fix release download URL pattern (`releases/download/latest` -> `releases/latest/download`)
- Port 5 Gitea-only CI workflows to GitHub Actions (k8s-deploy, k8s-deployment-control, container-registry, database, external-stack)
- Pin `shiv==1.0.8` in all workflows for reproducible builds
- Restrict smoke/deploy/webapp test push triggers to `main` only
- Remove `.gitea/` directory - Gitea repo to be archived
- Apply black reformatting to deployer.py, cluster_info.py, deploy_k8s.py
- Shorten docstrings exceeding 88 char line limit
- Add assert for pyright Optional type narrowing on tls list
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Compose files with both "8001" (TCP) and "8001/udp" produce separate
V1ContainerPort entries that k8s rejects as duplicates. Deduplicate
after parsing by (container_port, protocol) key.
This was blocking biscayne's agave deployment — the spec has both
TCP 8001 (ip_echo) and UDP 8001 (gossip), which generated two UDP
8001 entries.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- test-k8s-deploy.yml: trigger on workflow_call and workflow_dispatch
only (not every push/PR)
- publish.yml: add needs: e2e job that calls test-k8s-deploy.yml —
release is blocked until the k8s e2e suite passes
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- .github/workflows/test-k8s-deploy.yml: new workflow that installs
kind+kubectl and runs tests/k8s-deploy/run-deploy-test.sh on every
push and PR. Same script used locally and in release validation.
- .pre-commit-config.yaml: add local pre-push hook that runs the k8s
e2e test (~3 min) before pushing to remote.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- deploy_k8s.py: default imagePullPolicy to IfNotPresent for kind
(local images loaded via kind load, not pulled from registry)
- cluster_info.py: add job images to image_set so they're loaded into kind
- deploy_k8s.py: remove duplicate create_registry_secret call (merge artifact)
- deploy_k8s.py: fix indentation in run_job job_pull_policy (replace_all damage)
- tests/k8s-deploy: update namespace from laconic-{id} to laconic-{stack_name}
to match the new stack-derived namespace scheme from wd-a7b
All 15 k8s deploy e2e tests pass.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- ids.py: use base36 (lowercase+digits) instead of base62 — kind
cluster names must match ^[a-z0-9.-]+$
- k8s deploy test: pass --perform-cluster-management on first start
since 'start' defaults to --skip-cluster-management
Found by running tests/k8s-deploy/run-deploy-test.sh locally.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Combines timestamp-based cluster IDs, namespace derived from stack name,
_build_containers refactor, jobs support, multi-ingress certificates,
user-declared secrets, and label-based resource cleanup with the existing
idempotent deploy, mount propagation, and port mapping fixes.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Resolve conflicts keeping HostToContainer propagation on mount root
entry and per-container resource layering from the propagation branch.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Publish / Build and publish (push) Failing after 0sDetails
Deploy Test / Run deploy test suite (push) Failing after 0sDetails
Webapp Test / Run webapp test suite (push) Failing after 0sDetails
Lint Checks / Run linter (push) Failing after 0sDetails
Smoke Test / Run basic test suite (push) Failing after 0sDetails
`create_registry_secret()` was hardcoded to use the "default" namespace,
but pods are deployed to the spec's configured namespace. The secret
must be in the same namespace as the pods for `imagePullSecrets` to work.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Reviewed-on: https://git.vdb.to/cerc-io/stack-orchestrator/pulls/998
Co-authored-by: Prathamesh Musale <prathamesh.musale0@gmail.com>
Co-committed-by: Prathamesh Musale <prathamesh.musale0@gmail.com>
Spec can override container images:
image-overrides:
dumpster-kubo: ghcr.io/.../dumpster-kubo:test-tag
Merged with CLI overrides (CLI wins). Enables testing with
GHCR-pushed test tags without modifying compose files.
Also reverts the image-pull-policy spec key (not needed —
the fix is to use proper GHCR tags, not IfNotPresent).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Testing specs can set image-pull-policy: IfNotPresent so kind-loaded
local images are used instead of pulling from the registry. Production
specs omit the key and get the default Always behavior.
Root cause: with Always, k8s pulled the GHCR kubo image (with baked
R2 endpoint) instead of the locally-built image (with https://s3:443),
causing kubo to connect to R2 directly and get Unauthorized.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- CA certs mounted via subPath into /etc/ssl/certs/ so Go's x509
picks them up (directory mount replaces the entire dir)
- get_configmaps() now expands ~ in paths via os.path.expanduser()
- Both changes discovered during testing with mkcert + MinIO
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New spec.yml features for routing external service dependencies:
external-services:
s3:
host: example.com # ExternalName Service (production)
port: 443
s3:
selector: {app: mock} # headless Service + Endpoints (testing)
namespace: mock-ns
port: 443
ca-certificates:
- ~/.local/share/mkcert/rootCA.pem # testing only
laconic-so creates the appropriate k8s Service type per mode:
- host mode: ExternalName (DNS CNAME to external provider)
- selector mode: headless Service + Endpoints with pod IPs
discovered from the target namespace at deploy time
ca-certificates mounts CA files into all containers at
/etc/ssl/certs/ and sets NODE_EXTRA_CA_CERTS for Node/Bun.
Also includes the previously committed PV Released state fix.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
deployment stop deletes the namespace (and PVCs) but preserves PVs
by default. On the next deployment start, PVs are in Released state
with a stale claimRef pointing at the deleted PVC. New PVCs cannot
bind to Released PVs, so pods get stuck in Pending.
Clear the claimRef on any Released PV during _create_volume_data()
so the PV returns to Available and can accept new PVC bindings.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Switch from caddy/ingress:latest to ghcr.io/laconicnetwork/caddy-ingress:latest
which has the List()/Stat() fix for secret_store. This fixes multi-domain
ACME provisioning deadlock where the second domain's cert request fails
because List() returns mangled keys and Stat() returns wrong IsTerminal.
Source: LaconicNetwork/ingress@109d69a (fix/acme-account-reuse branch)
Fixes: so-o2o (partially — etcd backup investigation still needed)
Closes: ds-v22v (Caddy sequential provisioning no longer needed)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Publish / Build and publish (push) Failing after 0sDetails
Webapp Test / Run webapp test suite (push) Failing after 0sDetails
Smoke Test / Run basic test suite (push) Failing after 0sDetails
Lint Checks / Run linter (push) Failing after 0sDetails
Deploy Test / Run deploy test suite (push) Failing after 0sDetails
When kind-mount-root is set in spec.yml, emit a single extraMount
mapping the root to /mnt instead of per-volume mounts. This allows
adding new volumes without recreating the Kind cluster.
Volumes whose host path is under the root skip individual extraMounts
and their PV paths resolve to /mnt/{relative_path}. Volumes outside
the root keep individual extraMounts as before.
Cherry-picked from branch enya-ac868cc4-kind-mount-propagation-fix
(commits b6d6ad81, 929bdab8) and adapted for current main.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds token-file key to image-pull-secret spec config. Reads the
registry token from a file on disk instead of requiring an environment
variable. File path supports ~ expansion. Falls back to token-env
if token-file is not set or file doesn't exist.
This lets operators store the GHCR token in ~/.credentials/ alongside
other secrets, removing the need for ansible to pass REGISTRY_TOKEN
as an env var.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The stack path in spec.yml is relative — both create_operation and
up_operation need cwd at the repo root for stack_is_external() to
resolve it. Move os.chdir(prev_cwd) to after up_operation completes
instead of between the two operations.
Reverts the SystemExit catch in call_stack_deploy_start — the root
cause was cwd, not the hook.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Two fixes for multi-deployment:
1. _pod_has_pvcs now excludes ConfigMap volumes from PVC detection.
Pods with only ConfigMap volumes (like maintenance) correctly get
RollingUpdate strategy instead of Recreate.
2. call_stack_deploy_start catches SystemExit when stack path doesn't
resolve from cwd (common during restart). Most stacks don't have
deploy hooks, so this is non-fatal.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>