so-ad7: build per-pod Service for maintenance container (#744)

- Maintenance-page swap during `restart` was broken: Ingress got patched to point at `{app_name}-{pod_name}-service` for the maintenance pod, but that Service was never created. Caddy had no valid backend, users saw "site cannot be reached" instead of the maintenance page - Root cause: `get_services()` only builds per-pod Services for pods referenced by `http-proxy` routes; the maintenance pod has no http-proxy route by design - Fix: `get_services()` now also includes the container named by `maintenance-service:` in the container-ports map, so its per-pod `Service` gets built and sits idle until the swap window - Also files `so-b9a` (P4) noting the latent fragility in the resolver/builder contract
2026-04-16 15:07:25 +05:30 · 2026-04-16 15:07:25 +05:30 · cf8b7533fe
parent fc5dc80058
commit cf8b7533fe
2 changed files with 17 additions and 5 deletions
--- a/.pebbles/events.jsonl
+++ b/.pebbles/events.jsonl
@ -39,3 +39,5 @@
 {"type":"close","timestamp":"2026-04-16T06:24:39.175431401Z","issue_id":"so-l2l","payload":{}}
 {"type":"comment","timestamp":"2026-04-16T06:24:41.70556861Z","issue_id":"so-076.2","payload":{"body":"Fixed on chore/pebble-status-audit. stop now uses label-scoped cleanup (app.kubernetes.io/stack=\u003cstack\u003e) and keeps the namespace Active by default. The Kind cluster is not destroyed unless --perform-cluster-management is passed. Full namespace teardown is opt-in via the new --delete-namespace flag. Multiple stacks sharing a namespace/cluster are now cleaned up independently, not blown away en masse."}}
 {"type":"close","timestamp":"2026-04-16T06:24:42.153940477Z","issue_id":"so-076.2","payload":{}}
 {"type":"create","timestamp":"2026-04-16T07:26:56.820142001Z","issue_id":"so-ad7","payload":{"description":"_restart_with_maintenance in deployment.py patches Ingress backends to point at the maintenance Service, but that Service is never created. get_services() in cluster_info.py only builds per-pod ClusterIP Services for pods referenced by http-proxy routes (cluster_info.py:991-992 'if not ports_set: continue'). The maintenance pod has no http-proxy route by design, so no Service is built for it.\n\nResult: during a restart with maintenance-service configured, the Ingress points to a non-existent Service. Caddy has no valid backend, connection fails, users see 'site cannot be reached' instead of the maintenance page. Cryovial logs correctly report the swap happened.\n\n_resolve_service_name_for_container (cluster_info.py:183) and get_services() (cluster_info.py:945) operate on inconsistent premises — the resolver assumes every pod has a {app_name}-{pod_name}-service; the builder only creates one for http-proxy-referenced pods.\n\nFix: create_services() should also build a Service for the container named by spec's maintenance-service: key.","priority":"3","title":"Maintenance swap routes Ingress to non-existent Service","type":"bug"}}
 {"type":"create","timestamp":"2026-04-16T08:21:00.832961223Z","issue_id":"so-b9a","payload":{"description":"_resolve_service_name_for_container (cluster_info.py:183) mechanically returns {app_name}-{pod_name}-service for any container, with no awareness of whether get_services() actually built that Service. get_services() only builds Services for pods referenced by http-proxy or maintenance-service.\n\nCurrent callers happen to be safe: get_ingress() only passes http-proxy containers, _restart_with_maintenance passes the maintenance container (covered by so-ad7). But any future caller that passes a container outside {http-proxy ∪ maintenance-service} gets a ghost Service name and silent failure.\n\nFix direction (when a third caller emerges): either teach the resolver to return None / raise when the Service wasn't built, or make get_services() build a per-pod Service unconditionally for every pod with compose ports, aligning structure with the resolver's assumption.","priority":"4","title":"Service-name resolver and builder operate on inconsistent premises","type":"bug"}}
--- a/stack_orchestrator/deploy/k8s/cluster_info.py
+++ b/stack_orchestrator/deploy/k8s/cluster_info.py
@ -954,13 +954,18 @@ class ClusterInfo:
            svc = self.get_service()
            return [svc] if svc else []
-        # Multi-pod: one service per pod, only for pods that have
+        # Multi-pod: one service per pod, only for pods whose containers
-        # ports referenced by http-proxy routes
+        # are referenced by http-proxy routes or by maintenance-service.
-        http_proxy_list = self.spec.get_http_proxy()
+        http_proxy_list = self.spec.get_http_proxy() or []
-        if not http_proxy_list:
+        maintenance_svc = self.spec.get_maintenance_service()
        if not http_proxy_list and not maintenance_svc:
            return []
-        # Build map: container_name -> port from http-proxy routes
+        # Build map: container_name -> set of ports. Sources:
        # - http-proxy routes (normal traffic routing)
        # - maintenance-service (so _restart_with_maintenance can swap
        #   Ingress backends to a real Service during the maintenance
        #   window; the maintenance pod has no http-proxy route by design)
        container_ports: dict = {}
        for http_proxy in http_proxy_list:
            for route in http_proxy.get("routes", []):
@ -971,6 +976,11 @@ class ClusterInfo:
                    if container not in container_ports:
                        container_ports[container] = set()
                    container_ports[container].add(port)
        if maintenance_svc and ":" in maintenance_svc:
            maint_container, maint_port_str = maintenance_svc.split(":", 1)
            container_ports.setdefault(maint_container, set()).add(
                int(maint_port_str)
            )
        # Build map: pod_file -> set of service names in that pod
        pod_services_map: dict = {}