2.3 KiB
Bug: laconic-so etcd cleanup wipes core kubernetes service
Summary
_clean_etcd_keeping_certs() in laconic-stack-orchestrator 1.1.0 deletes the kubernetes service from etcd, breaking cluster networking on restart.
Component
stack_orchestrator/deploy/k8s/helpers.py — _clean_etcd_keeping_certs()
Reproduction
- Deploy with
laconic-soto a k8s-kind target with persisted etcd (hostPath mount in kind-config.yml) laconic-so deployment --dir <dir> stop(destroys cluster)laconic-so deployment --dir <dir> start(recreates cluster with cleaned etcd)
Symptoms
kindnetpods enter CrashLoopBackOff with:panic: unable to load in-cluster configuration, KUBERNETES_SERVICE_HOST and KUBERNETES_SERVICE_PORT must be definedkubectl get svc kubernetes -n defaultreturnsNotFound- coredns, caddy, local-path-provisioner stuck in Pending (no CNI without kindnet)
- No pods can be scheduled
Root Cause
_clean_etcd_keeping_certs() uses a whitelist that only preserves /registry/secrets/caddy-system keys. All other etcd keys are deleted, including /registry/services/specs/default/kubernetes — the core kubernetes ClusterIP service that kube-apiserver auto-creates.
When the kind cluster starts with the cleaned etcd, kube-apiserver sees the existing etcd data and does not re-create the kubernetes service. kindnet depends on the KUBERNETES_SERVICE_HOST environment variable which is injected by the kubelet from this service — without it, kindnet panics.
Fix Options
- Expand the whitelist to include
/registry/services/specs/default/kubernetesand other core cluster resources - Fully wipe etcd instead of selective cleanup — let the cluster bootstrap fresh (simpler, but loses Caddy TLS certs)
- Don't persist etcd at all — ephemeral etcd means clean state every restart (recommended for kind deployments)
Workaround
Fully delete the kind cluster before start:
kind delete cluster --name <cluster-name>
laconic-so deployment --dir <dir> start
This forces fresh etcd bootstrap. Downside: all other services deployed to the cluster (DaemonSets, other namespaces) are destroyed.
Impact
- Affects any k8s-kind deployment with persisted etcd
- Cluster is unrecoverable without full destroy+recreate
- All non-laconic-so-managed workloads in the cluster are lost