Commit Graph

1274 Commits (f1b206016480fc198ab03db947f9b9bc57a9b28d)

Author SHA1 Message Date
Prathamesh Musale f1b2060164 host-metrics: bind-mount /run/udev for diskio udev lookups
inputs.diskio (via gopsutil) reads /run/udev/data/b<major>:<minor> to
get per-device tags. Without the mount it falls back to the legacy
/dev/.udev/db/block:<name> path which doesn't exist on systemd hosts,
producing one "stat /dev/.udev/db/block:..." warning per device per
collection cycle.
2026-05-11 12:20:28 +00:00
Prathamesh Musale b7b6bcf731 host-metrics: bind-mount /dev for inputs.diskio
`inputs.diskio` enumerates devices from /proc/diskstats and then opens
/dev/<name> for udev/uevent lookups. The container's /dev only has
docker's minimal set, so telegraf logs an "error reading /dev/<name>"
warning per device per collection cycle. Mount the host's /dev
read-only so device lookups succeed.
2026-05-11 12:07:37 +00:00
Prathamesh Musale 6726d68100 host-metrics: correct deploy create command shape
`deploy create` requires `--stack` (deploy.py:70) and lives in the
`deploy` group, not `deployment`. `laconic-so deployment create ...`
does not exist as a subcommand.
2026-05-11 12:04:12 +00:00
Prathamesh Musale a62be4def8 host-metrics: use native-stack name and laconic-so deployment logs
`host-metrics` is a native stack -- spec.yml and `laconic-so --stack`
both take the bare stack name, not a path. Replace the `docker ps -qf`
filter with `laconic-so deployment --dir ... logs` so the verify
recipe works regardless of the laconic deployment-hash prefix on the
container name.
2026-05-11 11:58:07 +00:00
Prathamesh Musale 4eaca0ecb0 host-metrics: operator README 2026-05-11 10:51:43 +00:00
Prathamesh Musale f898d65983 host-metrics: entrypoint + offline tests
Add telegraf-entrypoint.sh to render telegraf.conf from the template
(replacing @@HOST_TAG_BLOCK@@ and @@ZFS_BLOCK@@ markers via awk) and
exec telegraf. Add test-telegraf-entrypoint.sh with 8 offline tests
(10 assertions) covering marker substitution and required-env validation.
Fix run() stderr redirect from >/dev/null 2>&1 to >/dev/null so that
entrypoint error output reaches the T6-T8 assertion captures.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-11 10:50:13 +00:00
Prathamesh Musale f5adcfc77e host-metrics: remove marker mentions from template comments
The entrypoint substitutes @@HOST_TAG_BLOCK@@ and @@ZFS_BLOCK@@ globally;
having them inside leading comments would corrupt the rendered file.
2026-05-11 10:45:49 +00:00
Prathamesh Musale c79da27585 host-metrics: telegraf config template
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-11 10:44:58 +00:00
Prathamesh Musale f13f347f3a host-metrics: declare compose file version for consistency 2026-05-11 10:43:45 +00:00
Prathamesh Musale 874c61820d host-metrics: stack.yml + compose skeleton
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-11 10:40:57 +00:00
prathamesh0 2ff7e5eb77
deploy: restart now force-recreates compose containers (#752)
Publish / Gate: k8s deploy e2e (push) Failing after 3s Details
Deploy Test / Run deploy test suite (push) Failing after 0s Details
Publish / Build and publish (push) Has been skipped Details
Smoke Test / Run basic test suite (push) Failing after 0s Details
Lint Checks / Run linter (push) Failing after 0s Details
Webapp Test / Run webapp test suite (push) Failing after 0s Details
Operator-reported: editing source files mounted into a service via
bind volumes (alert rules, dashboards, scripts, templates, telegraf
config) and running 'laconic-so deployment ... restart' did not
take effect. Operator had to fall back to 'stop && start' to pick
up changes.

Root cause: 'restart' calls up_operation, which translates to
'docker compose up -d'. Compose's up only recreates a container
when the *service definition* itself (image, env, ports, volume
declarations) changes. Bind-mount target file content is not part
of that hash, so the running container kept its old in-memory
state (e.g. Grafana's pre-edit provisioning).

Add force_recreate kwarg through the deployer interface and have
restart pass force_recreate=True. compose path threads through to
python_on_whales' compose.up(force_recreate=...). k8s path accepts
the kwarg but is a no-op for now (rolling update on
unchanged-spec needs a separate fix that stamps the
kubectl.kubernetes.io/restartedAt annotation on managed
Deployments; tracked in a follow-up).
2026-05-06 15:26:30 +05:30
prathamesh0 cf0e230b66
bug-fix: fix image-overrides usage to load locally build images into kind cluster (#751)
Webapp Test / Run webapp test suite (push) Failing after 0s Details
Lint Checks / Run linter (push) Failing after 0s Details
Publish / Gate: k8s deploy e2e (push) Failing after 3s Details
Deploy Test / Run deploy test suite (push) Failing after 0s Details
Publish / Build and publish (push) Has been skipped Details
Smoke Test / Run basic test suite (push) Failing after 0s Details
- Cluster setup was only considering images from containers list in `stack.yml` for kind-loading into the cluster; i.e. images from `image_overrides` in spec were not being loaded
- This also resulted in laconic-so to attempt kind-loading images not present locally sometimes
- Fix: union `image_overrides` values (user-specified local images) with the ones from container-list, filtered to only ones that are actually present on the docker host
2026-05-05 10:08:08 +05:30
prathamesh0 7c65d39bb2
Make deployments self-sufficient and add E2E restart test (#750)
Smoke Test / Run basic test suite (push) Failing after 0s Details
Lint Checks / Run linter (push) Failing after 0s Details
Webapp Test / Run webapp test suite (push) Failing after 0s Details
K8s Deploy Test / Run deploy test suite on kind/k8s (push) Failing after 0s Details
Publish / Gate: k8s deploy e2e (push) Failing after 3s Details
Deploy Test / Run deploy test suite (push) Failing after 0s Details
Publish / Build and publish (push) Has been skipped Details
- `deploy create` now copies each pod's `commands.py` into `<deployment>/hooks/`. `call_stack_deploy_start` loads from there, so `deployment start` / `restart` no longer need the live stack source on disk to run the `start()` hook
- Only the `start()` hook is affected. `init`, `setup`, and `create` still load from the live source — they only run at `deploy create` time, when the source is guaranteed to be present
- Multi-repo stacks produce `hooks/commands_0.py`, `hooks/commands_1.py`, …; `call_stack_deploy_start` loads them all in sorted order
- Adds `tests/k8s-deploy/run-restart-test.sh` covering the full single-repo restart cycle (v1 -> mutate working tree -> `restart` re-copies and re-executes v2) and the multi-repo file-naming + multi-hook invocation. Wired into the existing **K8s Deploy Test** workflow
2026-04-28 17:28:02 +05:30
prathamesh0 4977e3ff43
k8s: manage Caddy ingress image via spec (so-p3p) (#749)
Publish / Gate: k8s deploy e2e (push) Failing after 3s Details
Deploy Test / Run deploy test suite (push) Failing after 0s Details
Publish / Build and publish (push) Has been skipped Details
K8s Deployment Control Test / Run deployment control suite on kind/k8s (push) Failing after 0s Details
Webapp Test / Run webapp test suite (push) Failing after 0s Details
Smoke Test / Run basic test suite (push) Failing after 0s Details
Lint Checks / Run linter (push) Failing after 0s Details
K8s Deploy Test / Run deploy test suite on kind/k8s (push) Failing after 0s Details
Closes so-p3p:
- New spec key `caddy-ingress-image`: on fresh install, deploys Caddy with this image; on subsequent `deployment start`, patches the running Caddy Deployment if the image differs. Defaults to the manifest's hardcoded image when absent
- When the spec key is absent, SO does **not** touch a running Caddy — avoids silently reverting an image set out-of-band (ansible playbook, another deployment's spec)
- `strategy: Recreate` on the Caddy Deployment manifest (required — hostPort 80/443 deadlocks rolling updates)
- Reconcile runs under both `--perform-cluster-management` and the default `--skip-cluster-management` (it's a k8s-API patch, not a cluster-lifecycle op)
- Image template by container name rather than string match, so the spec override wins regardless of what the shipped manifest hardcodes
- Cluster-scoped caveat documented: `caddy-system` is shared across deployments, so the last `deployment start` that sets the key wins for everyone
2026-04-21 14:40:39 +05:30
prathamesh0 421b83c430
k8s: shared-cluster safety checks and deployment-id decoupling (#748)
- **Kind extraMount compatibility**: fail fast at `deployment start` when a new deployment's mounts don't match the running cluster; warn when the first cluster is created without a `kind-mount-root` umbrella; replace the cryptic `ConfigException` with readable errors when the cluster is missing
- **Auto-ConfigMap for file-level host-path compose volumes** (so-7fc): `../config/foo.sh:/opt/foo.sh`-style binds become per-namespace ConfigMaps at deploy start instead of aliasing via the kind extraMount chain. `deploy create` rejects `:rw`, subdirs, and over-budget sources. Deployment-dir layout unchanged
- **Namespace ownership**: stamp the namespace with `laconic.com/deployment-dir` on create; fail loudly if another deployment tries to land in the same namespace. Pre-existing namespaces adopt ownership on next start
- **deployment-id / cluster-id decoupling**: split the two roles (kube context vs resource-name prefix) into separate `deployment.yml` fields. Backward-compat fallback keeps existing resource names stable
- Close stale pebbles `so-n1n` and `so-ad7`
2026-04-21 12:17:28 +05:30
prathamesh0 eb4704b563
chore(pebbles): close so-o2o (#747)
Deploy Test / Run deploy test suite (push) Failing after 0s Details
Webapp Test / Run webapp test suite (push) Failing after 0s Details
Smoke Test / Run basic test suite (push) Failing after 0s Details
Lint Checks / Run linter (push) Failing after 0s Details
Publish / Gate: k8s deploy e2e (push) Failing after 3s Details
Publish / Build and publish (push) Has been skipped Details
Implementation shipped in PR #746. Woodburn migration (one-shot
host-kubectl export to seed the backup file) completed manually.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-17 17:40:12 +05:30
prathamesh0 7f4b058066
so-o2o: kubectl-level Caddy cert backup/restore (#746)
Publish / Gate: k8s deploy e2e (push) Failing after 3s Details
Publish / Build and publish (push) Has been skipped Details
K8s Deploy Test / Run deploy test suite on kind/k8s (push) Failing after 0s Details
Lint Checks / Run linter (push) Failing after 0s Details
Deploy Test / Run deploy test suite (push) Failing after 0s Details
Webapp Test / Run webapp test suite (push) Failing after 0s Details
Smoke Test / Run basic test suite (push) Failing after 0s Details
Replaces the etcd-surgery persistence approach with a CronJob that dumps `manager=caddy` Secrets to `{kind-mount-root}/caddy-cert-backup/` every 5 min, and a restore step that applies the file before Caddy starts on a fresh cluster. Closes so-o2o.

Deletes `_clean_etcd_keeping_certs` and the etcd+PKI extraMounts. No new spec keys - activates when `kind-mount-root` is set.
2026-04-17 15:36:40 +05:30
prathamesh0 1334900407
so-o2o: detect etcd image dynamically + diagnose whitelist cleanup bugs (#745)
Replaces the hardcoded `gcr.io/etcd-development/etcd:v3.5.9` in `_clean_etcd_keeping_certs` with a dynamic ref captured from the running Kind node via `crictl`, persisted to `{backup_dir}/etcd-image.txt` and reused on subsequent cleanup runs. Self-adapts to Kind upgrades, no version table to maintain.

Testing on Kind v0.32 / etcd 3.6 surfaced two additional bugs in the whitelist cleanup that this PR does **not** fix (see so-o2o comments):
(a) the restore step pipes raw protobuf values through bash `echo`, corrupting binary bytes;
(b) the whitelist omits cluster-admin RBAC, SAs, and bootstrap tokens needed by kubeadm's pre-addon health check.

Merging this narrow fix + diagnosis trail; follow-up branch will replace the etcd-surgery approach with a kubectl-level Caddy secret backup/restore.
2026-04-17 13:48:30 +05:30
prathamesh0 cf8b7533fe
so-ad7: build per-pod Service for maintenance container (#744)
Webapp Test / Run webapp test suite (push) Failing after 0s Details
Publish / Gate: k8s deploy e2e (push) Failing after 3s Details
Publish / Build and publish (push) Has been skipped Details
Deploy Test / Run deploy test suite (push) Failing after 0s Details
K8s Deploy Test / Run deploy test suite on kind/k8s (push) Failing after 0s Details
Lint Checks / Run linter (push) Failing after 0s Details
Smoke Test / Run basic test suite (push) Failing after 0s Details
- Maintenance-page swap during `restart` was broken: Ingress got patched to point at `{app_name}-{pod_name}-service` for the maintenance pod, but that Service was never created. Caddy had no valid backend, users saw "site cannot be reached" instead of the maintenance page
- Root cause: `get_services()` only builds per-pod Services for pods referenced by `http-proxy` routes; the maintenance pod has no http-proxy route by design
- Fix: `get_services()` now also includes the container named by `maintenance-service:` in the container-ports map, so its per-pod `Service` gets built and sits idle until the swap window
- Also files `so-b9a` (P4) noting the latent fragility in the resolver/builder contract
2026-04-16 15:07:25 +05:30
prathamesh0 fc5dc80058
so-l2l: in-place stop/restart via label-scoped cleanup (#743)
- `down()` scopes cleanup to a single stack via `app.kubernetes.io/stack` and keeps the namespace `Active` by default
- New `stop/down --delete-namespace` flag for opt-in full teardown
- `down()` is synchronous - waits until resources are actually gone before returning. Callers can drop their own wait loops
- `up()` skip-if-exists for Jobs completes the create-or-replace coverage (other kinds already had it)
- Orphan PVs from a prior `stop --delete-namespace` get cleaned on the next `stop --delete-volumes`
- Every k8s resource SO creates now carries `app.kubernetes.io/stack` via a new `ClusterInfo._stack_labels()` helper
- Closes so-l2l, so-076.2. Also includes pebble audit: closes so-c71, so-b2b, so-k1k; files so-328
2026-04-16 12:10:04 +05:30
prathamesh0 f40913d187
Fix Kind port mappings and configmap source path resolution (#742)
Publish / Gate: k8s deploy e2e (push) Failing after 2s Details
Webapp Test / Run webapp test suite (push) Failing after 0s Details
Publish / Build and publish (push) Has been skipped Details
Smoke Test / Run basic test suite (push) Failing after 0s Details
Lint Checks / Run linter (push) Failing after 0s Details
Deploy Test / Run deploy test suite (push) Failing after 0s Details
- Only map host ports for services with network_mode: host (80/443 for Caddy always mapped). Previously all compose service ports were mapped unconditionally, causing conflicts with local services like postgres and redis
- Use spec configmap values as source paths instead of ignoring them. Fixes configmaps with user-defined paths (e.g. `stack-orchestrator/compose/maintenance`) and home-relative paths (e.g. `~/.credentials/local-certs/s3`)
- Read configmap files from deployment dir (`configmaps/{name}/`) when building k8s ConfigMap objects, not from the spec's source path which doesn't exist in the deployment dir
- File pebbles: `so-c71` (resolved), `so-078`: self-sufficient deployments (hooks should be copied to deployment dir)
2026-04-14 17:33:47 +05:30
prathamesh0 17b614cb4d
Fix configmap source path resolution for user-defined spec paths (#741)
Lint Checks / Run linter (push) Failing after 0s Details
Webapp Test / Run webapp test suite (push) Failing after 0s Details
Smoke Test / Run basic test suite (push) Failing after 0s Details
Publish / Gate: k8s deploy e2e (push) Failing after 2s Details
Deploy Test / Run deploy test suite (push) Failing after 0s Details
Publish / Build and publish (push) Has been skipped Details
2026-04-14 11:30:27 +05:30
prathamesh0 0bf1ea70d5
Add ip mode to external-services for static IP endpoints (#740)
Publish / Gate: k8s deploy e2e (push) Failing after 2s Details
Container Registry Test / Run container registry hosting test on kind/k8s (push) Failing after 0s Details
Publish / Build and publish (push) Has been skipped Details
Database Test / Run database hosting test on kind/k8s (push) Failing after 0s Details
External Stack Test / Run external stack test suite (push) Failing after 0s Details
K8s Deploy Test / Run deploy test suite on kind/k8s (push) Failing after 0s Details
K8s Deployment Control Test / Run deployment control suite on kind/k8s (push) Failing after 0s Details
Webapp Test / Run webapp test suite (push) Failing after 0s Details
Smoke Test / Run basic test suite (push) Failing after 0s Details
Lint Checks / Run linter (push) Failing after 0s Details
Deploy Test / Run deploy test suite (push) Failing after 0s Details
ExternalName services only support DNS names (CNAME records), not
raw IP addresses. Add an ip mode that creates a headless Service +
Endpoints with a static IP, enabling routing to host-network
services like Kind gateway IPs or bare-metal endpoints.

Spec format:
  external-services:
    my-service:
      ip: 172.18.0.1
      port: 8899

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-02 17:53:23 +05:30
prathamesh0 185ebf17f9
Fix failing k8s and external-stack CI test scripts (#739)
- Add `--perform-cluster-management` to container-registry, k8s-deployment-control, and database test scripts (`--skip-cluster-management` is now the default)
- Fix `wait_for_log_output()` in all k8s tests - "No logs available" is non-empty, so the check was passing prematurely
- Use HTTPS for container-registry catalog check (Caddy redirects HTTP->HTTPS)
- Fix external-stack sync test: sed pattern used `=` but spec is YAML (`: `), so the substitution never matched
- Workaround hyphenated env var name (`test-variable-1`) from upstream test-external-stack repo - docker compose v2 rejects hyphens
- Quote `echo $log_output` vars to prevent glob expansion in error output
- Use stack name (instead of cluster-id) derived namespace in k8s-deployment-control test
2026-04-02 15:00:57 +05:30
A. F. Dudley eb881ac179 merge upstream: resolve test-k8s-deploy.yml conflict, add workflow_call
Keep upstream's schedule/path triggers and install scripts, add
workflow_dispatch and workflow_call so publish.yml can gate on it.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 05:30:14 +00:00
prathamesh0 7f11766b05
Migrate canonical source from Gitea to GitHub (#738)
- Update all self-references from `git.vdb.to/cerc-io/stack-orchestrator` to
  `github.com/cerc-io/stack-orchestrator` (setup.py, pyproject.toml, README,
  docs, install scripts, cloud-init scripts, stack READMEs)
- Fix release download URL pattern (`releases/download/latest` -> `releases/latest/download`)
- Port 5 Gitea-only CI workflows to GitHub Actions (k8s-deploy, k8s-deployment-control, container-registry, database, external-stack)
- Pin `shiv==1.0.8` in all workflows for reproducible builds
- Restrict smoke/deploy/webapp test push triggers to `main` only
- Remove `.gitea/` directory - Gitea repo to be archived
2026-04-02 10:58:14 +05:30
A. F. Dudley a76cae5c70 Merge branch 'main' of github.com:cerc-io/stack-orchestrator 2026-04-02 05:26:10 +00:00
A. F. Dudley 3da23683f6 fix: black formatting, line length, pyright type narrowing
- Apply black reformatting to deployer.py, cluster_info.py, deploy_k8s.py
- Shorten docstrings exceeding 88 char line limit
- Add assert for pyright Optional type narrowing on tls list

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 05:22:25 +00:00
A. F. Dudley 63325f68a7 fix: deduplicate container ports by (port, protocol)
Compose files with both "8001" (TCP) and "8001/udp" produce separate
V1ContainerPort entries that k8s rejects as duplicates. Deduplicate
after parsing by (container_port, protocol) key.

This was blocking biscayne's agave deployment — the spec has both
TCP 8001 (ip_echo) and UDP 8001 (gossip), which generated two UDP
8001 entries.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 05:18:10 +00:00
A. F. Dudley ae1eae5b9b gate releases on k8s e2e test, remove per-push trigger
- test-k8s-deploy.yml: trigger on workflow_call and workflow_dispatch
  only (not every push/PR)
- publish.yml: add needs: e2e job that calls test-k8s-deploy.yml —
  release is blocked until the k8s e2e suite passes

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 04:31:38 +00:00
A. F. Dudley c95eeeffb8 add k8s deploy e2e test to CI and pre-push hook
- .github/workflows/test-k8s-deploy.yml: new workflow that installs
  kind+kubectl and runs tests/k8s-deploy/run-deploy-test.sh on every
  push and PR. Same script used locally and in release validation.
- .pre-commit-config.yaml: add local pre-push hook that runs the k8s
  e2e test (~3 min) before pushing to remote.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 04:30:37 +00:00
Prathamesh Musale 2afc7ad1ce Update refs in root readme; test publish workflow
Lint Checks / Run linter (push) Failing after 0s Details
Webapp Test / Run webapp test suite (push) Failing after 0s Details
Smoke Test / Run basic test suite (push) Failing after 0s Details
Publish / Build and publish (push) Failing after 0s Details
Deploy Test / Run deploy test suite (push) Failing after 0s Details
2026-04-02 09:33:53 +05:30
A. F. Dudley 87761c7041 fix: imagePullPolicy for kind, job images, duplicate registry call, test namespace
- deploy_k8s.py: default imagePullPolicy to IfNotPresent for kind
  (local images loaded via kind load, not pulled from registry)
- cluster_info.py: add job images to image_set so they're loaded into kind
- deploy_k8s.py: remove duplicate create_registry_secret call (merge artifact)
- deploy_k8s.py: fix indentation in run_job job_pull_policy (replace_all damage)
- tests/k8s-deploy: update namespace from laconic-{id} to laconic-{stack_name}
  to match the new stack-derived namespace scheme from wd-a7b

All 15 k8s deploy e2e tests pass.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01 23:34:51 +00:00
A. F. Dudley 66da312f67 fix: base36 IDs for kind-compatible cluster names, test --perform-cluster-management
- ids.py: use base36 (lowercase+digits) instead of base62 — kind
  cluster names must match ^[a-z0-9.-]+$
- k8s deploy test: pass --perform-cluster-management on first start
  since 'start' defaults to --skip-cluster-management

Found by running tests/k8s-deploy/run-deploy-test.sh locally.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01 21:50:53 +00:00
A. F. Dudley 5bf96112d3 fix lint errors from merge: duplicate def, shadowed import, empty f-string
- deployment_create.py: remove duplicate create_registry_secret signature
- deploy_k8s.py: rename loop var 'config' to 'svc_config' (shadowed import)
- deploy_k8s.py: remove f-prefix from string without placeholders
- deploy_k8s.py: suppress pre-existing C901 on _create_volume_data

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01 19:17:45 +00:00
A. F. Dudley 549ac8c01d Merge fix/kind-mount-propagation: all local branches unified
Merges 6 local branches into main:
- enya: HostToContainer mount propagation for kind-mount-root
- fix/k8s-port-mappings-v5: port protocol parsing, namespace fix
- peirce: idempotent deploy (create-or-replace), update-envs rename
- prince: etcd cleanup whitelist
- wd-a7b: timestamp cluster IDs, stack-derived namespaces, jobs,
  multi-cert ingress, user secrets, _build_containers refactor
- fix/kind-mount-propagation: deployment prepare command, pebbles

Conflicts resolved keeping main's evolved multi-pod architecture
(get_deployments, per-pod Services, CA cert injection) while
incorporating branch additions (HostToContainer propagation,
user secrets, jobs support).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01 18:26:05 +00:00
A. F. Dudley d50bd2b6d2 Merge wd-a7b: cluster-id/namespace naming, jobs, multi-cert, secrets
Combines timestamp-based cluster IDs, namespace derived from stack name,
_build_containers refactor, jobs support, multi-ingress certificates,
user-declared secrets, and label-based resource cleanup with the existing
idempotent deploy, mount propagation, and port mapping fixes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01 18:22:07 +00:00
A. F. Dudley 2307696a66 Merge fix/k8s-port-mappings-v5 into fix/kind-mount-propagation
Resolve conflicts keeping HostToContainer propagation on mount root
entry and per-container resource layering from the propagation branch.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01 17:06:25 +00:00
A. F. Dudley c64820ad5c Merge branch 'enya-ac868cc4-kind-mount-propagation-fix' into fix/kind-mount-propagation 2026-04-01 17:05:06 +00:00
A. F. Dudley 3e3f349151 Merge remote-tracking branch 'cerc-io.github.com/main'
# Conflicts:
#	stack_orchestrator/deploy/deployment_create.py
#	stack_orchestrator/deploy/k8s/deploy_k8s.py
2026-04-01 14:47:46 +00:00
Prathamesh Musale 33d3474d7d Fix registry secret created in wrong namespace (#998)
Publish / Build and publish (push) Failing after 0s Details
Deploy Test / Run deploy test suite (push) Failing after 0s Details
Webapp Test / Run webapp test suite (push) Failing after 0s Details
Lint Checks / Run linter (push) Failing after 0s Details
Smoke Test / Run basic test suite (push) Failing after 0s Details
`create_registry_secret()` was hardcoded to use the "default" namespace,
but pods are deployed to the spec's configured namespace. The secret
must be in the same namespace as the pods for `imagePullSecrets` to work.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Reviewed-on: https://git.vdb.to/cerc-io/stack-orchestrator/pulls/998
Co-authored-by: Prathamesh Musale <prathamesh.musale0@gmail.com>
Co-committed-by: Prathamesh Musale <prathamesh.musale0@gmail.com>
2026-03-26 08:36:39 +00:00
Snake Game Developer 90e32ffd60 Support image-overrides in spec for testing
Spec can override container images:
  image-overrides:
    dumpster-kubo: ghcr.io/.../dumpster-kubo:test-tag

Merged with CLI overrides (CLI wins). Enables testing with
GHCR-pushed test tags without modifying compose files.

Also reverts the image-pull-policy spec key (not needed —
the fix is to use proper GHCR tags, not IfNotPresent).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-22 01:02:23 +00:00
Snake Game Developer 1052a1d4e7 Support image-pull-policy in spec (default: Always)
Testing specs can set image-pull-policy: IfNotPresent so kind-loaded
local images are used instead of pulling from the registry. Production
specs omit the key and get the default Always behavior.

Root cause: with Always, k8s pulled the GHCR kubo image (with baked
R2 endpoint) instead of the locally-built image (with https://s3:443),
causing kubo to connect to R2 directly and get Unauthorized.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-21 20:17:06 +00:00
Snake Game Developer f93541f7db Fix CA cert mounting: subPath for Go, expanduser for configmaps
- CA certs mounted via subPath into /etc/ssl/certs/ so Go's x509
  picks them up (directory mount replaces the entire dir)
- get_configmaps() now expands ~ in paths via os.path.expanduser()
- Both changes discovered during testing with mkcert + MinIO

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-21 19:27:14 +00:00
Snake Game Developer 713a81c245 Add external-services and ca-certificates spec keys
New spec.yml features for routing external service dependencies:

external-services:
  s3:
    host: example.com  # ExternalName Service (production)
    port: 443
  s3:
    selector: {app: mock}  # headless Service + Endpoints (testing)
    namespace: mock-ns
    port: 443

ca-certificates:
  - ~/.local/share/mkcert/rootCA.pem  # testing only

laconic-so creates the appropriate k8s Service type per mode:
- host mode: ExternalName (DNS CNAME to external provider)
- selector mode: headless Service + Endpoints with pod IPs
  discovered from the target namespace at deploy time

ca-certificates mounts CA files into all containers at
/etc/ssl/certs/ and sets NODE_EXTRA_CA_CERTS for Node/Bun.

Also includes the previously committed PV Released state fix.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-21 15:25:47 +00:00
Snake Game Developer 98ff221a21 Fix PV rebinding after deployment stop/start cycle
deployment stop deletes the namespace (and PVCs) but preserves PVs
by default. On the next deployment start, PVs are in Released state
with a stale claimRef pointing at the deleted PVC. New PVCs cannot
bind to Released PVs, so pods get stuck in Pending.

Clear the claimRef on any Released PV during _create_volume_data()
so the PV returns to Available and can accept new PVC bindings.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-21 07:47:23 +00:00
A. F. Dudley 7141dc7637 file so-p3p: laconic-so should manage Caddy ingress image lifecycle
Lint Checks / Run linter (push) Failing after 0s Details
Publish / Build and publish (push) Failing after 0s Details
Deploy Test / Run deploy test suite (push) Failing after 0s Details
Smoke Test / Run basic test suite (push) Failing after 0s Details
Webapp Test / Run webapp test suite (push) Failing after 0s Details
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-21 00:30:46 +00:00
A. F. Dudley 2555df06b5 fix: use patched Caddy ingress image with ACME storage fix
Switch from caddy/ingress:latest to ghcr.io/laconicnetwork/caddy-ingress:latest
which has the List()/Stat() fix for secret_store. This fixes multi-domain
ACME provisioning deadlock where the second domain's cert request fails
because List() returns mangled keys and Stat() returns wrong IsTerminal.

Source: LaconicNetwork/ingress@109d69a (fix/acme-account-reuse branch)

Fixes: so-o2o (partially — etcd backup investigation still needed)
Closes: ds-v22v (Caddy sequential provisioning no longer needed)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-20 23:31:39 +00:00
A. F. Dudley 24cf22fea5 File pebbles: mount propagation merge + etcd cert backup broken
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-20 23:01:20 +00:00
A. F. Dudley 8d03083d0d feat: add kind-mount-root for unified Kind extraMount
Publish / Build and publish (push) Failing after 0s Details
Webapp Test / Run webapp test suite (push) Failing after 0s Details
Smoke Test / Run basic test suite (push) Failing after 0s Details
Lint Checks / Run linter (push) Failing after 0s Details
Deploy Test / Run deploy test suite (push) Failing after 0s Details
When kind-mount-root is set in spec.yml, emit a single extraMount
mapping the root to /mnt instead of per-volume mounts. This allows
adding new volumes without recreating the Kind cluster.

Volumes whose host path is under the root skip individual extraMounts
and their PV paths resolve to /mnt/{relative_path}. Volumes outside
the root keep individual extraMounts as before.

Cherry-picked from branch enya-ac868cc4-kind-mount-propagation-fix
(commits b6d6ad81, 929bdab8) and adapted for current main.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-20 21:28:40 +00:00