chore(pebbles): close so-7fc — auto-ConfigMap for host-path compose volumes

Implementation on this branch at commit cb84388d. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 13:14:31 +00:00 · 2026-04-20 13:14:31 +00:00 · 44a3ed63b4
parent cb84388d00
commit 44a3ed63b4
1 changed files with 2 additions and 0 deletions
--- a/.pebbles/events.jsonl
+++ b/.pebbles/events.jsonl
@ -46,3 +46,5 @@
 {"type":"comment","timestamp":"2026-04-17T08:13:32.753112339Z","issue_id":"so-o2o","payload":{"body":"Tested the version-detection fix (commit 832ab66d) locally. Fix works for its scope but surfaces two more bugs downstream. Current approach is broken at the architectural level, not just one-bug-fixable.\n\nWhat 832ab66d does: captures etcd image ref from crictl after cluster create, writes to {backup_dir}/etcd-image.txt, reads it on subsequent cleanup runs. Self-adapts to Kind upgrades. No more hardcoded v3.5.9. Confirmed locally: etcd-image.txt is written after first create, cleanup on second start uses it, member.backup-YYYYMMDD-HHMMSS dir is produced (proves cleanup ran end-to-end).\n\nWhat still fails after version fix: kubeadm init on cluster recreate. apiserver comes up but returns:\n- 403 Forbidden: User \"kubernetes-admin\" cannot get path /livez\n- 500: Body was not decodable ... json: cannot unmarshal array into Go value of type struct\n- eventually times out waiting for apiserver /livez\n\nTwo new bugs behind those:\n\n(a) Restore step corrupts binary values. In _clean_etcd_keeping_certs the restore loop is:\n    key=$(echo $encoded | base64 -d | jq -r .key | base64 -d)\n    val=$(echo $encoded | base64 -d | jq -r .value | base64 -d)\n    echo \"$val\" | /backup/etcdctl put \"$key\"\nk8s stores objects as protobuf. Piping raw protobuf through bash variable expansion + echo mangles non-printable bytes, truncates at null bytes, and appends a trailing newline. Explains the \"cannot unmarshal\" from apiserver — the kubernetes Service/Endpoints objects in /registry are corrupted on re-put.\n\n(b) Whitelist is too narrow. We keep only /registry/secrets/caddy-system and the /registry/services entries for kubernetes. Everything else is deleted — including /registry/clusterrolebindings (cluster-admin is gone), /registry/serviceaccounts, /registry/secrets/kube-system (bootstrap tokens), RBAC roles, apiserver's auth config. Explains the 403 for kubernetes-admin — cluster-admin binding doesn't exist yet and kubeadm's pre-addon health check can't authorize.\n\nFixing (a) would mean rewriting the restore step to not use shell piping — either use a proper etcdctl-based Go tool, or write directly to the on-disk snapshot format. Fixing (b) means exhaustively whitelisting everything kubeadm/apiserver bootstrapping needs — a moving target across k8s versions. Both together are a significant undertaking for the actual requirement (\"keep 4 Caddy secrets across cluster recreate\").\n\nDecision: merge 832ab66d for the narrow version-detection fix + diagnosis trail, then implement the kubectl-level backup/restore on a separate branch. The etcd approach is not salvageable at reasonable cost."}}
 {"type":"comment","timestamp":"2026-04-17T11:04:26.542659482Z","issue_id":"so-o2o","payload":{"body":"Shipped in PR #746. Etcd-persistence approach replaced with a kubectl-level Caddy Secret backup/restore gated on kind-mount-root.\n\nSummary of what landed:\n- components/ingress/caddy-cert-backup.yaml: SA/Role/RoleBinding + CronJob (alpine/kubectl:1.35.3) firing every 5min, writes {kind-mount-root}/caddy-cert-backup/caddy-secrets.yaml via atomic tmp+rename.\n- install_ingress_for_kind splits into 3 phases: pre-Deployment manifests → _restore_caddy_certs (kubectl apply from backup file) → Caddy Deployment → _install_caddy_cert_backup. Caddy pod can't exist until phase 3, so certs are always in place before secret_store startup.\n- Deleted _clean_etcd_keeping_certs, _get_etcd_host_path_from_kind_config, _capture_etcd_image, _read_etcd_image_ref, _etcd_image_ref_path and the etcd+PKI block in _generate_kind_mounts.\n- No new spec keys.\n\nTest coverage in tests/k8s-deploy/run-deploy-test.sh: install assertion after first --perform-cluster-management start, plus full E2E (seed fake manager=caddy Secret → trigger CronJob → verify backup file → stop/start --perform-cluster-management for cluster recreate → assert secret restored with matching decoded value).\n\nWoodburn migration: one-shot host-kubectl export to seed {kind-mount-root}/caddy-cert-backup/caddy-secrets.yaml was done manually on the running cluster (the in-cluster CronJob couldn't reach the host because the /srv/kind → /mnt extraMount was staged in kind-config.yml but never applied to the running cluster — it was added after cluster creation). File is in place for the eventual cluster recreate."}}
 {"type":"close","timestamp":"2026-04-17T11:04:26.999711375Z","issue_id":"so-o2o","payload":{}}
 {"type":"create","timestamp":"2026-04-20T13:14:26.312724048Z","issue_id":"so-7fc","payload":{"description":"## Problem\n\nFile-level host-path compose volumes (e.g. `../config/foo.sh:/opt/foo.sh`) were synthesized into a kind extraMount + k8s hostPath PV chain with a sanitized containerPath (`/mnt/host-path-\u003csanitized\u003e`).\n\n- On kind: two deployments of the same stack sharing a cluster collide at that containerPath — kind only honors the first deployment's bind, so subsequent deployments' pods silently read the first's file. No error, no warning.\n- On real k8s: the same code emits `hostPath: /mnt/host-path-*` but nothing populates that path on worker nodes — effectively broken.\n\nFile-level host-path binds are conceptually k8s ConfigMaps. The `snowballtools-base-backend` stack already uses the ConfigMap-backed named-volume pattern manually; this issue is to make that automatic for all stacks.\n\n## Resolution\n\nImplemented on branch `feat/so-b86-auto-configmap-host-path` (commit `cb84388d`), stacked on top of `feat/kind-mount-invariant-check`.\n\n**No deployment-dir file rewriting.** Compose files, spec.yml, and `{deployment_dir}/config/\u003cpod\u003e/` are untouched — trivially diffable against stack source, no synthetic volume names. ConfigMaps are materialized at deploy start and visible only in k8s (`kubectl get cm -n \u003cns\u003e`).\n\n### Deploy create — validation only\n\n| Source shape | Behavior |\n|---|---|\n| Single file | Accepted |\n| Flat directory, no subdirs, ≤ ~700 KiB | Accepted |\n| Directory with subdirs | `DeployerException` — guidance: embed in image / split configmaps / initContainer |\n| File or directory \u003e ~700 KiB | `DeployerException` — ConfigMap budget (accounts for base64 + metadata) |\n| `:rw` on any host-path bind | `DeployerException` — use a named volume for writable data |\n\n### Deploy start — k8s object generation\n\n- `cluster_info.get_configmaps()` walks pod + job compose volumes and emits a `V1ConfigMap` per host-path bind (deduped by sanitized name), content read from `{deployment_dir}/config/\u003cpod\u003e/\u003cfile\u003e`.\n- `volumes_for_pod_files` emits `V1ConfigMapVolumeSource` instead of `V1HostPathVolumeSource` for host-path binds.\n- `volume_mounts_for_service` stats the source and sets `V1VolumeMount.sub_path` to the filename when source is a regular file.\n- `_generate_kind_mounts` no longer emits `/mnt/host-path-*` extraMounts — ConfigMap path bypasses the kind node FS entirely.\n\n### Transition\n\nThe `/mnt/host-path-*` skip in `check_mounts_compatible` is retained as a transition tolerance for deployments created before this change. Test coverage in `tests/k8s-deploy/run-deploy-test.sh` asserts host-path ConfigMaps exist in the namespace, compose/spec in deployment dir unchanged, and no `/mnt/host-path-*` entries in kind-config.yml.","priority":"2","title":"File-level host-path compose volumes alias across deployments sharing a kind cluster","type":"bug"}}
 {"type":"status_update","timestamp":"2026-04-20T13:14:26.833816262Z","issue_id":"so-7fc","payload":{"status":"closed"}}