3.7 KiB
Bug: laconic-so crashes on re-deploy when caddy ingress already exists
Summary
laconic-so deployment start crashes with FailToCreateError when the kind cluster already has caddy ingress resources installed. The deployer uses create_from_yaml() which fails on AlreadyExists conflicts instead of applying idempotently. This prevents the application deployment from ever being reached — the crash happens before any app manifests are applied.
Component
stack_orchestrator/deploy/k8s/deploy_k8s.py:366 — up() method
stack_orchestrator/deploy/k8s/helpers.py:369 — install_ingress_for_kind()
Reproduction
kind delete cluster --name laconic-70ce4c4b47e23b85laconic-so deployment --dir /srv/deployments/agave start— creates cluster, loads images, installs caddy ingress, but times out or is interrupted before app deployment completeslaconic-so deployment --dir /srv/deployments/agave start— crashes immediately after image loading
Symptoms
- Traceback ending in:
kubernetes.utils.create_from_yaml.FailToCreateError: Error from server (Conflict): namespaces "caddy-system" already exists Error from server (Conflict): serviceaccounts "caddy-ingress-controller" already exists Error from server (Conflict): clusterroles.rbac.authorization.k8s.io "caddy-ingress-controller" already exists ... - Namespace
laconic-laconic-70ce4c4b47e23b85exists but is empty — no pods, no deployments, no events - Cluster is healthy, images are loaded, but no app manifests are applied
Root Cause
install_ingress_for_kind() calls kubernetes.utils.create_from_yaml() which uses POST (create) semantics. If the resources already exist (from a previous partial run), every resource returns 409 Conflict and create_from_yaml raises FailToCreateError, aborting the entire up() method before the app deployment step.
The first laconic-so start after a fresh kind delete works because:
- Image loading into the kind node takes 5-10 minutes (images are ~10GB+)
- Caddy ingress is installed successfully
- App deployment begins
But if that first run is interrupted (timeout, Ctrl-C, ansible timeout), the second run finds caddy already installed and crashes.
Fix Options
- Use server-side apply instead of
create_from_yaml()—kubectl applyis idempotent - Check if ingress exists before installing — skip
install_ingress_for_kind()if caddy-system namespace exists - Catch
AlreadyExistsand continue — treat 409 as success for infrastructure resources
Workaround
Delete the caddy ingress resources before re-running:
kubectl delete namespace caddy-system
kubectl delete clusterrole caddy-ingress-controller
kubectl delete clusterrolebinding caddy-ingress-controller
kubectl delete ingressclass caddy
laconic-so deployment --dir /srv/deployments/agave start
Or nuke the entire cluster and start fresh:
kind delete cluster --name laconic-70ce4c4b47e23b85
laconic-so deployment --dir /srv/deployments/agave start
Interaction with ansible timeout
The biscayne-redeploy.yml playbook sets a 600s timeout on the laconic-so deployment start task. Image loading alone can exceed this on a fresh cluster (images must be re-loaded into the new kind node). When ansible kills the process at 600s, the caddy ingress is already installed but the app is not — putting the cluster into the broken state described above. Subsequent playbook runs hit this bug on every attempt.
Impact
- Blocks all re-deploys on biscayne without manual cleanup
- The playbook cannot recover automatically — every retry hits the same conflict
- Discovered 2026-03-05 during full wipe redeploy of biscayne validator