chore(pebbles): add resolved direction to so-o2o

Switch from etcd-level cleanup to kubectl-level Caddy secret backup/restore. Operates at the right layer (k8s, not etcd), selective by design, version-portable, operator-readable, and lets us delete ~150 LoC of docker-in-docker shell-in-Python. Branch and implementation pending decision to proceed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 14:34:27 +00:00 · 2026-04-16 14:34:27 +00:00 · a64883642a
parent 3df3f83f6e
commit a64883642a
1 changed files with 1 additions and 0 deletions
--- a/.pebbles/events.jsonl
+++ b/.pebbles/events.jsonl
@ -42,3 +42,4 @@
 {"type":"create","timestamp":"2026-04-16T07:26:56.820142001Z","issue_id":"so-ad7","payload":{"description":"_restart_with_maintenance in deployment.py patches Ingress backends to point at the maintenance Service, but that Service is never created. get_services() in cluster_info.py only builds per-pod ClusterIP Services for pods referenced by http-proxy routes (cluster_info.py:991-992 'if not ports_set: continue'). The maintenance pod has no http-proxy route by design, so no Service is built for it.\n\nResult: during a restart with maintenance-service configured, the Ingress points to a non-existent Service. Caddy has no valid backend, connection fails, users see 'site cannot be reached' instead of the maintenance page. Cryovial logs correctly report the swap happened.\n\n_resolve_service_name_for_container (cluster_info.py:183) and get_services() (cluster_info.py:945) operate on inconsistent premises — the resolver assumes every pod has a {app_name}-{pod_name}-service; the builder only creates one for http-proxy-referenced pods.\n\nFix: create_services() should also build a Service for the container named by spec's maintenance-service: key.","priority":"3","title":"Maintenance swap routes Ingress to non-existent Service","type":"bug"}}
 {"type":"create","timestamp":"2026-04-16T08:21:00.832961223Z","issue_id":"so-b9a","payload":{"description":"_resolve_service_name_for_container (cluster_info.py:183) mechanically returns {app_name}-{pod_name}-service for any container, with no awareness of whether get_services() actually built that Service. get_services() only builds Services for pods referenced by http-proxy or maintenance-service.\n\nCurrent callers happen to be safe: get_ingress() only passes http-proxy containers, _restart_with_maintenance passes the maintenance container (covered by so-ad7). But any future caller that passes a container outside {http-proxy ∪ maintenance-service} gets a ghost Service name and silent failure.\n\nFix direction (when a third caller emerges): either teach the resolver to return None / raise when the Service wasn't built, or make get_services() build a per-pod Service unconditionally for every pod with compose ports, aligning structure with the resolver's assumption.","priority":"4","title":"Service-name resolver and builder operate on inconsistent premises","type":"bug"}}
 {"type":"comment","timestamp":"2026-04-16T13:36:20.150833128Z","issue_id":"so-o2o","payload":{"body":"Reproduced and partially diagnosed locally. Original 'backup not persisting' framing turns out to be inaccurate — the host bind-mount works fine and the cleanup function runs end-to-end. The actual bug is downstream of those.\n\nWhat we confirmed:\n- The etcd extraMount at \u003cdeployment_dir\u003e/data/cluster-backups/\u003cid\u003e/etcd is honored. After 'kind delete', the host-side data persists (16MB db file, snap files intact, owned by root mode 0700).\n- _clean_etcd_keeping_certs (helpers.py:120-279) actually runs to completion. Evidence: timestamped 'member.backup-YYYYMMDD-HHMMSS' dirs accumulate (created at line 257-260, the last step before the swap-in).\n\nWhat actually breaks:\n- After cleanup + 'kind create cluster', kubeadm init fails. kube-apiserver never opens :6443 ('connection refused' loop until kubeadm gives up). kubelet itself is healthy.\n- Hypothesis (high confidence, not yet proven by inspecting an etcd container log): version skew. Cleanup uses gcr.io/etcd-development/etcd:v3.5.9 (helpers.py:148) which produces v3.5-format on-disk data. Kind v0.32 ships kindest/node:v1.35.1 with etcd v3.6.x, which can't read v3.5-format data and crashes — apiserver can't reach it.\n- Diagnostic that nails the version skew: moving the persisted etcd dir aside ('mv etcd etcd.away') and re-running 'start --perform-cluster-management' succeeds cleanly. With persisted-etcd present, fails. So the cleanup output is what breaks the new cluster.\n\nWhy prod hasn't hit this: woodburn runs kind v0.20.0 (kindest/node:v1.27.x with etcd v3.5.x) — compatible with the v3.5.9 cleanup image. Bug is dormant there until kind is bumped.\n\nWhat we do NOT know:\n- Whether Caddy certs would actually survive a successful recreate. Cluster never came up after cleanup, so we couldn't inspect /registry/secrets/caddy-system in the new etcd. The cleanup function's whitelist preserves them in theory, but end-to-end preservation is unverified.\n\nWhat's also broken regardless of root cause:\n- _clean_etcd_keeping_certs gates ALL its diagnostic prints on opts.o.debug (lines 141, 145, 274, 278) and returns False silently on failure. With a normal (non-debug) run, the operator gets zero indication that cleanup attempted, succeeded, or failed. Silent failure was 90% of why this took so long to diagnose.\n\nFix direction:\n1. Source etcdctl/etcdutl from the same kindest/node image kind is using, so on-disk format always matches what the cluster will boot with. Self-adapts to kind upgrades.\n2. Make failure messages unconditional prints, not gated on debug.\n3. After (1), re-test cert preservation end-to-end and update findings."}}
 {"type":"comment","timestamp":"2026-04-16T14:34:14.74327248Z","issue_id":"so-o2o","payload":{"body":"Resolved direction: replace the etcd-level cleanup mechanism with a kubectl-level Caddy secret backup/restore in SO.\n\nWhy the layer change: the etcd approach (current code, or even a cleaner snapshot save/restore) shares one fundamental problem — restoring an entire etcd snapshot resurrects ALL prior cluster state, conflicts with the new cluster's bootstrap, and forces us into a maintained whitelist (kindnet/coredns/etc.). It also couples SO to etcd binary versions. Working at the k8s/kubectl layer instead gives us a selective, version-portable, operator-readable backup that exactly matches what we want: a small set of Caddy Secrets.\n\nKey insight from the use case: Caddy's secret_store reuses any valid cert it finds in its store on startup. ACME challenges only fire for domains it has no cert for. So if we restore the right Secrets BEFORE Caddy comes up on a new cluster, no Let's Encrypt traffic happens, no rate limit risk, no TLS gap, and new domains added later still provision normally.\n\nProposed spec API:\n  network:\n    caddy-cert-backup:\n      path: ~/.credentials/caddy-certs/   # operator-owned host dir; presence enables\n\nLifecycle:\n1. destroy_cluster, before 'kind delete', if path configured:\n   kubectl get secrets -n caddy-system -l manager=caddy -o yaml \u003e \u003cpath\u003e/caddy-secrets.yaml\n   (loud failure if cluster gone or kubectl errors)\n2. install_ingress_for_kind, after manifest applied:\n   kubectl apply -f \u003cpath\u003e/caddy-secrets.yaml if it exists\n\nWhat this lets us delete:\n- _clean_etcd_keeping_certs (~150 LoC of docker-in-docker shell-in-Python)\n- _get_etcd_host_path_from_kind_config\n- etcd extraMount setup in _generate_kind_mounts\n- Hardcoded gcr.io/etcd-development/etcd:v3.5.9 image\n- The whitelist that needs maintenance per kind version\n- All silent-failure paths\n\nWhat woodburn can simplify after the SO fix:\n- inject_caddy_certs.yml and the kubectl-backup half of cluster_recreate.yml become redundant\n- extract-caddy-certs.sh (auger) stays as a disaster-recovery tool for cases where the in-SO backup didn't run\n\nPros vs. alternatives:\n- Operator-readable PEM YAML, can be git-committed / sops-encrypted / mirrored to S3\n- Zero version coupling\n- Selective by design (only what we want)\n- Lifecycle hooks are atomic with cluster lifecycle, no remembering to run a separate backup playbook\n- ~30 LoC in SO, no docker-in-docker\n\nBranch and implementation pending operator decision to proceed."}}