so-ad7: build per-pod Service for maintenance container (#744)
Webapp Test / Run webapp test suite (push) Failing after 0s
Details
Publish / Gate: k8s deploy e2e (push) Failing after 3s
Details
Publish / Build and publish (push) Has been skipped
Details
Deploy Test / Run deploy test suite (push) Failing after 0s
Details
K8s Deploy Test / Run deploy test suite on kind/k8s (push) Failing after 0s
Details
Lint Checks / Run linter (push) Failing after 0s
Details
Smoke Test / Run basic test suite (push) Failing after 0s
Details
Webapp Test / Run webapp test suite (push) Failing after 0s
Details
Publish / Gate: k8s deploy e2e (push) Failing after 3s
Details
Publish / Build and publish (push) Has been skipped
Details
Deploy Test / Run deploy test suite (push) Failing after 0s
Details
K8s Deploy Test / Run deploy test suite on kind/k8s (push) Failing after 0s
Details
Lint Checks / Run linter (push) Failing after 0s
Details
Smoke Test / Run basic test suite (push) Failing after 0s
Details
- Maintenance-page swap during `restart` was broken: Ingress got patched to point at `{app_name}-{pod_name}-service` for the maintenance pod, but that Service was never created. Caddy had no valid backend, users saw "site cannot be reached" instead of the maintenance page
- Root cause: `get_services()` only builds per-pod Services for pods referenced by `http-proxy` routes; the maintenance pod has no http-proxy route by design
- Fix: `get_services()` now also includes the container named by `maintenance-service:` in the container-ports map, so its per-pod `Service` gets built and sits idle until the swap window
- Also files `so-b9a` (P4) noting the latent fragility in the resolver/builder contract
pull/747/head
v1.1.0-cf8b753-202604160940
parent
fc5dc80058
commit
cf8b7533fe
|
|
@ -39,3 +39,5 @@
|
||||||
{"type":"close","timestamp":"2026-04-16T06:24:39.175431401Z","issue_id":"so-l2l","payload":{}}
|
{"type":"close","timestamp":"2026-04-16T06:24:39.175431401Z","issue_id":"so-l2l","payload":{}}
|
||||||
{"type":"comment","timestamp":"2026-04-16T06:24:41.70556861Z","issue_id":"so-076.2","payload":{"body":"Fixed on chore/pebble-status-audit. stop now uses label-scoped cleanup (app.kubernetes.io/stack=\u003cstack\u003e) and keeps the namespace Active by default. The Kind cluster is not destroyed unless --perform-cluster-management is passed. Full namespace teardown is opt-in via the new --delete-namespace flag. Multiple stacks sharing a namespace/cluster are now cleaned up independently, not blown away en masse."}}
|
{"type":"comment","timestamp":"2026-04-16T06:24:41.70556861Z","issue_id":"so-076.2","payload":{"body":"Fixed on chore/pebble-status-audit. stop now uses label-scoped cleanup (app.kubernetes.io/stack=\u003cstack\u003e) and keeps the namespace Active by default. The Kind cluster is not destroyed unless --perform-cluster-management is passed. Full namespace teardown is opt-in via the new --delete-namespace flag. Multiple stacks sharing a namespace/cluster are now cleaned up independently, not blown away en masse."}}
|
||||||
{"type":"close","timestamp":"2026-04-16T06:24:42.153940477Z","issue_id":"so-076.2","payload":{}}
|
{"type":"close","timestamp":"2026-04-16T06:24:42.153940477Z","issue_id":"so-076.2","payload":{}}
|
||||||
|
{"type":"create","timestamp":"2026-04-16T07:26:56.820142001Z","issue_id":"so-ad7","payload":{"description":"_restart_with_maintenance in deployment.py patches Ingress backends to point at the maintenance Service, but that Service is never created. get_services() in cluster_info.py only builds per-pod ClusterIP Services for pods referenced by http-proxy routes (cluster_info.py:991-992 'if not ports_set: continue'). The maintenance pod has no http-proxy route by design, so no Service is built for it.\n\nResult: during a restart with maintenance-service configured, the Ingress points to a non-existent Service. Caddy has no valid backend, connection fails, users see 'site cannot be reached' instead of the maintenance page. Cryovial logs correctly report the swap happened.\n\n_resolve_service_name_for_container (cluster_info.py:183) and get_services() (cluster_info.py:945) operate on inconsistent premises — the resolver assumes every pod has a {app_name}-{pod_name}-service; the builder only creates one for http-proxy-referenced pods.\n\nFix: create_services() should also build a Service for the container named by spec's maintenance-service: key.","priority":"3","title":"Maintenance swap routes Ingress to non-existent Service","type":"bug"}}
|
||||||
|
{"type":"create","timestamp":"2026-04-16T08:21:00.832961223Z","issue_id":"so-b9a","payload":{"description":"_resolve_service_name_for_container (cluster_info.py:183) mechanically returns {app_name}-{pod_name}-service for any container, with no awareness of whether get_services() actually built that Service. get_services() only builds Services for pods referenced by http-proxy or maintenance-service.\n\nCurrent callers happen to be safe: get_ingress() only passes http-proxy containers, _restart_with_maintenance passes the maintenance container (covered by so-ad7). But any future caller that passes a container outside {http-proxy ∪ maintenance-service} gets a ghost Service name and silent failure.\n\nFix direction (when a third caller emerges): either teach the resolver to return None / raise when the Service wasn't built, or make get_services() build a per-pod Service unconditionally for every pod with compose ports, aligning structure with the resolver's assumption.","priority":"4","title":"Service-name resolver and builder operate on inconsistent premises","type":"bug"}}
|
||||||
|
|
|
||||||
|
|
@ -954,13 +954,18 @@ class ClusterInfo:
|
||||||
svc = self.get_service()
|
svc = self.get_service()
|
||||||
return [svc] if svc else []
|
return [svc] if svc else []
|
||||||
|
|
||||||
# Multi-pod: one service per pod, only for pods that have
|
# Multi-pod: one service per pod, only for pods whose containers
|
||||||
# ports referenced by http-proxy routes
|
# are referenced by http-proxy routes or by maintenance-service.
|
||||||
http_proxy_list = self.spec.get_http_proxy()
|
http_proxy_list = self.spec.get_http_proxy() or []
|
||||||
if not http_proxy_list:
|
maintenance_svc = self.spec.get_maintenance_service()
|
||||||
|
if not http_proxy_list and not maintenance_svc:
|
||||||
return []
|
return []
|
||||||
|
|
||||||
# Build map: container_name -> port from http-proxy routes
|
# Build map: container_name -> set of ports. Sources:
|
||||||
|
# - http-proxy routes (normal traffic routing)
|
||||||
|
# - maintenance-service (so _restart_with_maintenance can swap
|
||||||
|
# Ingress backends to a real Service during the maintenance
|
||||||
|
# window; the maintenance pod has no http-proxy route by design)
|
||||||
container_ports: dict = {}
|
container_ports: dict = {}
|
||||||
for http_proxy in http_proxy_list:
|
for http_proxy in http_proxy_list:
|
||||||
for route in http_proxy.get("routes", []):
|
for route in http_proxy.get("routes", []):
|
||||||
|
|
@ -971,6 +976,11 @@ class ClusterInfo:
|
||||||
if container not in container_ports:
|
if container not in container_ports:
|
||||||
container_ports[container] = set()
|
container_ports[container] = set()
|
||||||
container_ports[container].add(port)
|
container_ports[container].add(port)
|
||||||
|
if maintenance_svc and ":" in maintenance_svc:
|
||||||
|
maint_container, maint_port_str = maintenance_svc.split(":", 1)
|
||||||
|
container_ports.setdefault(maint_container, set()).add(
|
||||||
|
int(maint_port_str)
|
||||||
|
)
|
||||||
|
|
||||||
# Build map: pod_file -> set of service names in that pod
|
# Build map: pod_file -> set of service names in that pod
|
||||||
pod_services_map: dict = {}
|
pod_services_map: dict = {}
|
||||||
|
|
|
||||||
Loading…
Reference in New Issue