From 3df3f83f6ebe1790e16dfa6f22881610cc1ddd0a Mon Sep 17 00:00:00 2001
From: Prathamesh Musale <prathamesh.musale0@gmail.com>
Date: Thu, 16 Apr 2026 13:36:36 +0000
Subject: [PATCH] chore(pebbles): update so-o2o with actual diagnosis

Original framing ('backup not persisting') was inaccurate. The
bind-mount works and _clean_etcd_keeping_certs runs to completion.
The real bug is downstream: cleanup uses etcd v3.5.9 which
produces v3.5-format on-disk data, incompatible with the etcd
v3.6.x that ships in newer kindest/node images. apiserver fails
to start on cluster recreate. Prod (kind v0.20) is dormant; local
(kind v0.32) reproduces.

Also noting the silent-failure pattern that masked this for so
long.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---
 .pebbles/events.jsonl | 1 +
 1 file changed, 1 insertion(+)

diff --git a/.pebbles/events.jsonl b/.pebbles/events.jsonl
index c474992d..fd5029db 100644
--- a/.pebbles/events.jsonl
+++ b/.pebbles/events.jsonl
@@ -41,3 +41,4 @@
 {"type":"close","timestamp":"2026-04-16T06:24:42.153940477Z","issue_id":"so-076.2","payload":{}}
 {"type":"create","timestamp":"2026-04-16T07:26:56.820142001Z","issue_id":"so-ad7","payload":{"description":"_restart_with_maintenance in deployment.py patches Ingress backends to point at the maintenance Service, but that Service is never created. get_services() in cluster_info.py only builds per-pod ClusterIP Services for pods referenced by http-proxy routes (cluster_info.py:991-992 'if not ports_set: continue'). The maintenance pod has no http-proxy route by design, so no Service is built for it.\n\nResult: during a restart with maintenance-service configured, the Ingress points to a non-existent Service. Caddy has no valid backend, connection fails, users see 'site cannot be reached' instead of the maintenance page. Cryovial logs correctly report the swap happened.\n\n_resolve_service_name_for_container (cluster_info.py:183) and get_services() (cluster_info.py:945) operate on inconsistent premises — the resolver assumes every pod has a {app_name}-{pod_name}-service; the builder only creates one for http-proxy-referenced pods.\n\nFix: create_services() should also build a Service for the container named by spec's maintenance-service: key.","priority":"3","title":"Maintenance swap routes Ingress to non-existent Service","type":"bug"}}
 {"type":"create","timestamp":"2026-04-16T08:21:00.832961223Z","issue_id":"so-b9a","payload":{"description":"_resolve_service_name_for_container (cluster_info.py:183) mechanically returns {app_name}-{pod_name}-service for any container, with no awareness of whether get_services() actually built that Service. get_services() only builds Services for pods referenced by http-proxy or maintenance-service.\n\nCurrent callers happen to be safe: get_ingress() only passes http-proxy containers, _restart_with_maintenance passes the maintenance container (covered by so-ad7). But any future caller that passes a container outside {http-proxy ∪ maintenance-service} gets a ghost Service name and silent failure.\n\nFix direction (when a third caller emerges): either teach the resolver to return None / raise when the Service wasn't built, or make get_services() build a per-pod Service unconditionally for every pod with compose ports, aligning structure with the resolver's assumption.","priority":"4","title":"Service-name resolver and builder operate on inconsistent premises","type":"bug"}}
+{"type":"comment","timestamp":"2026-04-16T13:36:20.150833128Z","issue_id":"so-o2o","payload":{"body":"Reproduced and partially diagnosed locally. Original 'backup not persisting' framing turns out to be inaccurate — the host bind-mount works fine and the cleanup function runs end-to-end. The actual bug is downstream of those.\n\nWhat we confirmed:\n- The etcd extraMount at \u003cdeployment_dir\u003e/data/cluster-backups/\u003cid\u003e/etcd is honored. After 'kind delete', the host-side data persists (16MB db file, snap files intact, owned by root mode 0700).\n- _clean_etcd_keeping_certs (helpers.py:120-279) actually runs to completion. Evidence: timestamped 'member.backup-YYYYMMDD-HHMMSS' dirs accumulate (created at line 257-260, the last step before the swap-in).\n\nWhat actually breaks:\n- After cleanup + 'kind create cluster', kubeadm init fails. kube-apiserver never opens :6443 ('connection refused' loop until kubeadm gives up). kubelet itself is healthy.\n- Hypothesis (high confidence, not yet proven by inspecting an etcd container log): version skew. Cleanup uses gcr.io/etcd-development/etcd:v3.5.9 (helpers.py:148) which produces v3.5-format on-disk data. Kind v0.32 ships kindest/node:v1.35.1 with etcd v3.6.x, which can't read v3.5-format data and crashes — apiserver can't reach it.\n- Diagnostic that nails the version skew: moving the persisted etcd dir aside ('mv etcd etcd.away') and re-running 'start --perform-cluster-management' succeeds cleanly. With persisted-etcd present, fails. So the cleanup output is what breaks the new cluster.\n\nWhy prod hasn't hit this: woodburn runs kind v0.20.0 (kindest/node:v1.27.x with etcd v3.5.x) — compatible with the v3.5.9 cleanup image. Bug is dormant there until kind is bumped.\n\nWhat we do NOT know:\n- Whether Caddy certs would actually survive a successful recreate. Cluster never came up after cleanup, so we couldn't inspect /registry/secrets/caddy-system in the new etcd. The cleanup function's whitelist preserves them in theory, but end-to-end preservation is unverified.\n\nWhat's also broken regardless of root cause:\n- _clean_etcd_keeping_certs gates ALL its diagnostic prints on opts.o.debug (lines 141, 145, 274, 278) and returns False silently on failure. With a normal (non-debug) run, the operator gets zero indication that cleanup attempted, succeeded, or failed. Silent failure was 90% of why this took so long to diagnose.\n\nFix direction:\n1. Source etcdctl/etcdutl from the same kindest/node image kind is using, so on-disk format always matches what the cluster will boot with. Self-adapts to kind upgrades.\n2. Make failure messages unconditional prints, not gated on debug.\n3. After (1), re-test cert preservation end-to-end and update findings."}}