Commit Graph

1247 Commits (dd7aeb329a852ecac8748e28d48e11a7e225e8f8)

Author SHA1 Message Date
A. F. Dudley 496c7982cb feat: end-to-end relay test scripts
Three Python scripts send real packets from the kind node through the
full relay path (biscayne → tunnel → mia-sw01 → was-sw01 → internet)
and verify responses come back via the inbound path. No indirect
counter-checking — a response proves both directions work.

- relay-test-udp.py: DNS query with sport 8001
- relay-test-tcp-sport.py: HTTP request with sport 8001
- relay-test-tcp-dport.py: TCP connect to entrypoint dport 8001 (ip_echo)
- test-ashburn-relay.sh: orchestrates from ansible controller via nsenter

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-08 00:43:06 +00:00
A. F. Dudley 8eac9cc87f docs: document DoubleZero agent managed config on both switches
Inventories what the DZ agent controls (tunnels, ACLs, VRFs, BGP,
route-maps, loopbacks) so we don't accidentally modify objects that
the agent will silently overwrite. Includes a "safe to modify" section
listing our own relay infrastructure.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-07 23:45:36 +00:00
A. F. Dudley b82d66eeff fix: VRF isolation for mia-sw01 relay, TCP dport mangle for ip_echo
mia-sw01: Replace PBR-based outbound routing with VRF isolation.
TCAM profile tunnel-interface-acl doesn't support PBR or traffic-policy
on tunnel interfaces. Tunnel100 now lives in VRF "relay" whose default
route sends decapsulated traffic to was-sw01 via backbone, avoiding
BCP38 drops on the ISP uplink for src 137.239.194.65.

biscayne: Add TCP dport mangle rule for ip_echo (port 8001). Without it,
outbound ip_echo probes use biscayne's real IP instead of the Ashburn
relay IP, causing entrypoints to probe the wrong address. Also fix
loopback IP idempotency (handle "already assigned" error).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-07 23:31:18 +00:00
A. F. Dudley a02534fc11 chore: add containerlab topologies for relay testing
Ashburn relay and shred relay lab configs for local end-to-end
testing with cEOS. No secrets — only public IPs and test scripts.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-07 22:30:03 +00:00
A. F. Dudley 9cbc115295 fix: inventory layering — playbooks use hosts:all, cross-inventory uses explicit hosts
Normal playbooks should never hardcode hostnames — that's an inventory
concern. Changed all playbooks to hosts:all. The one exception is
ashburn-relay-check.yml which legitimately spans both inventories
(switches + biscayne) and uses explicit hostnames.

Also adds:
- ashburn-relay-check.yml: full-path relay diagnostics (switches + host)
- biscayne-start.yml: start kind container and scale validator to 1
- ashburn-relay-setup.sh.j2: boot persistence script for relay state
- Direct device mounts replacing rbind (ZFS shared propagation fix)
- systemd service replacing broken if-up.d/netfilter-persistent
- PV mount path corrections (/mnt/validator-* not /mnt/solana/*)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-07 22:28:21 +00:00
A. F. Dudley 7f205732f2 fix(k8s): expand etcd cleanup whitelist to preserve core cluster services
_clean_etcd_keeping_certs() only preserved /registry/secrets/caddy-system,
deleting everything else including the kubernetes ClusterIP service in the
default namespace. When kind recreated the cluster with the cleaned etcd,
kube-apiserver saw existing data and skipped bootstrapping the service.
kindnet panicked on KUBERNETES_SERVICE_HOST missing, blocking all pod
networking.

Expand the whitelist to also preserve:
- /registry/services/specs/default/kubernetes
- /registry/services/endpoints/default/kubernetes

Loop over multiple prefixes instead of a single etcdctl get --prefix call.

See docs/bug-laconic-so-etcd-cleanup.md in biscayne-agave-runbook.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-07 17:56:13 +00:00
A. F. Dudley 14c0f63775 feat: layer 4 invariants, mount checks, and deployment layer docs
- Rename biscayne-boot.yml → biscayne-prepare-agave.yml (layer 4)
- Document deployment layers and layer 4 invariants in playbook header
- Add zvol, ramdisk, rbind fstab management with stale entry cleanup
- Add kind node XFS verification (reads cluster-id from deployment)
- Add mount checks to health-check.yml (host mounts, kind visibility, propagation)
- Fix health-check discovery tasks with tags: [always] and non-fatal pod lookup
- Fix biscayne-redeploy.yml shell tasks missing executable: /bin/bash
- Add ansible_python_interpreter to inventory
- Update CLAUDE.md with deployment layers table and mount propagation notes

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-07 13:08:04 +00:00
A. F. Dudley a11d40f2f3 fix(k8s): add HostToContainer mount propagation to kind extraMounts
Without propagation, rbind submounts on the host (e.g., XFS zvol at
/srv/kind/solana) are invisible inside the kind node — it sees the
underlying filesystem (ZFS) instead. This causes agave's io_uring to
deadlock on ZFS transaction commits (D-state in dsl_dir_tempreserve_space).

HostToContainer propagation ensures host submounts propagate into the
kind node, so /mnt/solana correctly resolves to the XFS zvol.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-07 13:07:12 +00:00
A. F. Dudley b40883ef65 fix: separate switch inventory to prevent accidental targeting
Move switches.yml to inventory-switches/ so ansible.cfg's
`inventory = inventory/` only loads biscayne. Switch playbooks
must pass `-i inventory-switches/` explicitly.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-07 10:56:48 +00:00
A. F. Dudley 4f452db6fe fix: ansible-lint production profile compliance for all playbooks
- FQCN for all modules (ansible.builtin.*)
- changed_when/failed_when on all command/shell tasks
- set -o pipefail on all shell tasks
- Add KUBECONFIG environment to health-check.yml

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-07 10:52:40 +00:00
A. F. Dudley eae4c3cdff feat(k8s): per-service resource layering in deployer
Resolve container resources using layered priority:
1. spec.yml per-container override (resources.containers.<name>)
2. Compose file deploy.resources block
3. spec.yml global resources
4. DEFAULT_CONTAINER_RESOURCES fallback

This prevents monitoring sidecars from inheriting the validator's
resource requests (e.g., 256G memory). Each service gets appropriate
resources from its compose definition unless explicitly overridden.

Note: existing deployments with a global resources block in spec.yml
can remove it once compose files declare per-service defaults.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-07 10:26:10 +00:00
A. F. Dudley d36a71f13d fix: redeploy playbook handles SSH agent, git pull, config regen, stale PVs
- ansible.cfg: enable SSH agent forwarding for git operations
- biscayne-redeploy.yml: add git pull, deploy create --update, and
  clear stale PV claimRefs after namespace deletion

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-07 09:58:29 +00:00
A. F. Dudley 8a8b882e32 bug: deploy create doesn't auto-generate volume mappings for new pods
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-07 09:56:28 +00:00
A. F. Dudley 9f6e1b5da7 fix: remove auto-revert timer, use checkpoint + write memory instead
Config is committed to running-config immediately (no 5-min timer).
Safety net is the checkpoint (rollback) and the fact that startup-config
is only written with -e commit=true. A reboot reverts uncommitted changes.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-07 01:49:25 +00:00
A. F. Dudley 742e84e3b0 feat: dedicated GRE tunnel (Tunnel100) bypassing DZ-managed Tunnel500
Root cause: the doublezero-agent on mia-sw01 manages Tunnel500's ACL
(SEC-USER-500-IN) and drops outbound gossip with src 137.239.194.65.
The agent overwrites any custom ACL entries.

Fix: create a separate GRE tunnel (Tunnel100) using mia-sw01's free
LAN IP (209.42.167.137) as tunnel source. This tunnel goes over the
ISP uplink, completely independent of the DZ overlay:
- mia-sw01: Tunnel100 src 209.42.167.137, dst 186.233.184.235
- biscayne: gre-ashburn src 186.233.184.235, dst 209.42.167.137
- Link addresses: 169.254.100.0/31

Playbook changes:
- ashburn-relay-mia-sw01: Tunnel100 + Loopback101 + SEC-VALIDATOR-100-IN
- ashburn-relay-biscayne: gre-ashburn tunnel + updated policy routing
- New template: ashburn-routing-ifup.sh.j2 for boot persistence

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-07 01:47:58 +00:00
A. F. Dudley 0b52fc99d7 fix: ashburn relay playbooks and document DZ tunnel ACL root cause
Playbook fixes from testing:
- ashburn-relay-biscayne: insert DNAT rules at position 1 before
  Docker's ADDRTYPE LOCAL rule (was being swallowed at position 3+)
- ashburn-relay-mia-sw01: add inbound route for 137.239.194.65 via
  egress-vrf vrf1 (nexthop only, no interface — EOS silently drops
  cross-VRF routes that specify a tunnel interface)
- ashburn-relay-was-sw01: replace PBR with static route, remove
  Loopback101

Bug doc (bug-ashburn-tunnel-port-filtering.md): root cause is the
DoubleZero agent on mia-sw01 overwrites SEC-USER-500-IN ACL, dropping
outbound gossip with src 137.239.194.65. The DZ agent controls
Tunnel500's lifecycle. Fix requires a separate GRE tunnel using
mia-sw01's free LAN IP (209.42.167.137) to bypass DZ infrastructure.

Also adds all repo docs, scripts, inventory, and remaining playbooks.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-07 01:44:25 +00:00
A. F. Dudley 6841d5e3c3 feat: ashburn validator relay playbooks
Three playbooks for routing all validator traffic through 137.239.194.65:

- was-sw01: Loopback101 + PBR redirect on Et1/1 (already applied/committed)
  Will be simplified to a static route in next iteration.

- mia-sw01: ACL permit for src 137.239.194.65 on Tunnel500 + default route
  in vrf1 via egress-vrf default to was-sw01 backbone. No PBR needed —
  per-tunnel ACLs already scope what enters vrf1.

- biscayne: DNAT inbound (137.239.194.65 → kind node), SNAT + policy
  routing outbound (validator sport 8001,9000-9025 → doublezero0 GRE).
  Inbound already applied.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-06 21:08:48 +00:00
A. F. Dudley dd29257dd8 chore: snapshot mia-sw01 and was-sw01 running configs
Captured via ansible `show running-config` before applying
mia-sw01 outbound validator redirect changes.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-06 20:45:32 +00:00
AFDudley 4a1b5d86fd Merge pull request 'fix(k8s): translate service names to localhost for sidecar containers' (#989) from fix-sidecar-localhost into main
Webapp Test / Run webapp test suite (push) Failing after 0s Details
Smoke Test / Run basic test suite (push) Failing after 0s Details
Lint Checks / Run linter (push) Failing after 0s Details
Publish / Build and publish (push) Failing after 0s Details
Deploy Test / Run deploy test suite (push) Failing after 0s Details
Reviewed-on: https://git.vdb.to/cerc-io/stack-orchestrator/pulls/989
2026-02-03 23:13:27 +00:00
A. F. Dudley 019225ca18 fix(k8s): translate service names to localhost for sidecar containers
In docker-compose, services can reference each other by name (e.g., 'db:5432').
In Kubernetes, when multiple containers are in the same pod (sidecars), they
share the same network namespace and must use 'localhost' instead.

This fix adds translate_sidecar_service_names() which replaces docker-compose
service name references with 'localhost' in environment variable values for
containers that share the same pod.

Fixes issue where multi-container pods fail because one container tries to
connect to a sibling using the compose service name instead of localhost.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-03 18:10:32 -05:00
AFDudley 0296da6f64 Merge pull request 'feat(k8s): namespace-per-deployment for resource isolation and cleanup' (#988) from feat-namespace-per-deployment into main
Reviewed-on: https://git.vdb.to/cerc-io/stack-orchestrator/pulls/988
2026-02-03 23:09:16 +00:00
A. F. Dudley d913926144 feat(k8s): namespace-per-deployment for resource isolation and cleanup
Each deployment now gets its own Kubernetes namespace (laconic-{deployment_id}).
This provides:
- Resource isolation between deployments on the same cluster
- Simplified cleanup: deleting the namespace cascades to all namespaced resources
- No orphaned resources possible when deployment IDs change

Changes:
- Set k8s_namespace based on deployment name in __init__
- Add _ensure_namespace() to create namespace before deploying resources
- Add _delete_namespace() for cleanup
- Simplify down() to just delete PVs (cluster-scoped) and the namespace
- Fix hardcoded "default" namespace in logs function

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-03 18:04:52 -05:00
AFDudley b41e0cb2f5 Merge pull request 'fix(k8s): query resources by label in down() for proper cleanup' (#987) from fix-down-cleanup-by-label into main
Reviewed-on: https://git.vdb.to/cerc-io/stack-orchestrator/pulls/987
2026-02-03 22:57:52 +00:00
A. F. Dudley 47d3d10ead fix(k8s): query resources by label in down() for proper cleanup
Previously, down() generated resource names from the deployment config
and deleted those specific names. This failed to clean up orphaned
resources when deployment IDs changed (e.g., after force_redeploy).

Changes:
- Add 'app' label to all resources: Ingress, Service, NodePort, ConfigMap, PV
- Refactor down() to query K8s by label selector instead of generating names
- This ensures all resources for a deployment are cleaned up, even if
  the deployment config has changed or been deleted

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-03 17:55:14 -05:00
AFDudley 21d47908cc Merge pull request 'feat(k8s): ACME email fix, etcd persistence, volume paths' (#986) from fix-caddy-acme-email-rbac into main
Reviewed-on: https://git.vdb.to/cerc-io/stack-orchestrator/pulls/986
2026-02-03 22:31:47 +00:00
A. F. Dudley f70e87b848 Add etcd + PKI extraMounts for offline data recovery
Mount /var/lib/etcd and /etc/kubernetes/pki to host filesystem
so cluster state is preserved for offline recovery. Each deployment
gets its own backup directory keyed by deployment ID.

Directory structure:
  data/cluster-backups/{deployment_id}/etcd/
  data/cluster-backups/{deployment_id}/pki/

This enables extracting secrets from etcd backups using etcdctl
with the preserved PKI certificates.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-03 17:19:52 -05:00
A. F. Dudley 5bc6c978ac feat(k8s): support acme-email config for Caddy ingress
Adds support for configuring ACME email for Let's Encrypt certificates
in kind deployments. The email can be specified in the spec under
network.acme-email and will be used to configure the Caddy ingress
controller ConfigMap.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-03 17:19:52 -05:00
A. F. Dudley ee59918082 Allow relative volume paths for k8s-kind deployments
For k8s-kind, relative paths (e.g., ./data/rpc-config) are resolved to
$DEPLOYMENT_DIR/path by _make_absolute_host_path() during kind config
generation. This provides Docker Host persistence that survives cluster
restarts.

Previously, validation threw an exception before paths could be resolved,
making it impossible to use relative paths for persistent storage.

Changes:
- deployment_create.py: Skip relative path check for k8s-kind
- cluster_info.py: Allow relative paths to reach PV generation
- docs/deployment_patterns.md: Document volume persistence patterns

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-03 17:17:44 -05:00
A. F. Dudley 581ceaea94 docs: Add cluster and volume management section
Document that:
- Volumes persist across cluster deletion by design
- Only use --delete-volumes when explicitly requested
- Multiple deployments share one kind cluster
- Use --skip-cluster-management to stop single deployment

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-03 17:16:26 -05:00
A. F. Dudley 7cecf2caa6 Fix Caddy ACME email race condition by templating YAML
Previously, install_ingress_for_kind() applied the YAML (which starts
the Caddy pod with email: ""), then patched the ConfigMap afterward.
The pod had already read the empty email and Caddy doesn't hot-reload.

Now template the email into the YAML before applying, so the pod starts
with the correct email from the beginning.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-03 17:16:26 -05:00
A. F. Dudley cb6fdb77a6 Rename image-registry to registry-credentials to avoid collision
The existing 'image-registry' key is used for pushing images to a remote
registry (URL string). Rename the new auth config to 'registry-credentials'
to avoid collision.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-03 17:16:26 -05:00
A. F. Dudley 73ba13aaa5 Add private registry authentication support
Add ability to configure private container registry credentials in spec.yml
for deployments using images from registries like GHCR.

- Add get_image_registry_config() to spec.py for parsing image-registry config
- Add create_registry_secret() to create K8s docker-registry secrets
- Update cluster_info.py to use dynamic {deployment}-registry secret names
- Update deploy_k8s.py to create registry secret before deployment
- Document feature in deployment_patterns.md

The token-env pattern keeps credentials out of git - the spec references an
environment variable name, and the actual token is passed at runtime.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-03 17:16:26 -05:00
A. F. Dudley d82b3fb881 Only load locally-built images into kind, auto-detect ingress
- Check stack.yml containers: field to determine which images are local builds
- Only load local images via kind load; let k8s pull registry images directly
- Add is_ingress_running() to skip ingress installation if already running
- Fixes deployment failures when public registry images aren't in local Docker

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-03 17:16:26 -05:00
A. F. Dudley 3bc7832d8c Fix deployment name extraction from path
When stack: field in spec.yml contains a path (e.g., stack_orchestrator/data/stacks/name),
extract just the final name component for K8s secret naming. K8s resource names must
be valid RFC 1123 subdomains and cannot contain slashes.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-03 17:16:26 -05:00
A. F. Dudley a75138093b Add setup-repositories to key files list
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-03 17:16:26 -05:00
A. F. Dudley 1128c95969 Split documentation: README for users, CLAUDE.md for agents
README.md: deployment types, external stacks, commands, spec.yml reference
CLAUDE.md: implementation details, code locations, codebase navigation

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-03 17:16:26 -05:00
A. F. Dudley d292e7c48d Add k8s-kind architecture documentation to CLAUDE.md
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-03 17:16:25 -05:00
A. F. Dudley b057969ddd Clarify create_cluster docstring: one cluster per host by design
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-03 17:15:19 -05:00
A. F. Dudley ca090d2cd5 Add $generate:type:length$ token support for K8s secrets
- Add GENERATE_TOKEN_PATTERN to detect $generate:hex:N$ and $generate:base64:N$ tokens
- Add _generate_and_store_secrets() to create K8s Secrets from spec.yml config
- Modify _write_config_file() to separate secrets from regular config
- Add env_from with secretRef to container spec in cluster_info.py
- Secrets are injected directly into containers via K8s native mechanism

This enables declarative secret generation in spec.yml:
  config:
    SESSION_SECRET: $generate:hex:32$
    DB_PASSWORD: $generate:hex:16$

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-03 17:15:19 -05:00
A. F. Dudley 2d3721efa4 Add cluster reuse for multi-stack k8s-kind deployments
When deploying a second stack to k8s-kind, automatically reuse an existing
kind cluster instead of trying to create a new one (which would fail due
to port 80/443 conflicts).

Changes:
- helpers.py: create_cluster() now checks for existing cluster first
- deploy_k8s.py: up() captures returned cluster name and updates self

This enables deploying multiple stacks (e.g., gorbagana-rpc + trashscan-explorer)
to the same kind cluster.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-03 17:15:19 -05:00
A. F. Dudley 4408725b08 Fix repo root path calculation (4 parents from stack path) 2026-02-03 17:15:19 -05:00
A. F. Dudley 22d64f1e97 Add --spec-file option to restart and auto-detect GitOps spec
- Add --spec-file option to specify spec location in repo
- Auto-detect deployment/spec.yml in repo as GitOps location
- Fall back to deployment dir if no repo spec found

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-03 17:15:19 -05:00
A. F. Dudley 14258500bc Fix restart command for GitOps deployments
- Remove init_operation() from restart - don't regenerate spec from
  commands.py defaults, use existing git-tracked spec.yml instead
- Add docs/deployment_patterns.md documenting GitOps workflow
- Add pre-commit rule to CLAUDE.md
- Fix line length issues in helpers.py

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-03 17:15:19 -05:00
A. F. Dudley 3fbd854b8c Use docker for etcd existence check (root-owned dir)
The etcd directory is root-owned, so shell test -f fails.
Use docker with volume mount to check file existence.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-03 17:15:19 -05:00
A. F. Dudley e2d3c44321 Keep timestamped backup of etcd forever
Create member.backup-YYYYMMDD-HHMMSS before cleaning.
Each cluster recreation creates a new backup, preserving history.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-03 17:15:19 -05:00
A. F. Dudley 720e01fc75 Preserve original etcd backup until restore is verified
Move original to .bak, move new into place, then delete bak.
If anything fails before the swap, original remains intact.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-03 17:15:19 -05:00
A. F. Dudley 5b06cffe17 Use whitelist approach for etcd cleanup
Instead of trying to delete specific stale resources (blacklist),
keep only the valuable data (caddy TLS certs) and delete everything
else. This is more robust as we don't need to maintain a list of
all possible stale resources.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-03 17:15:19 -05:00
A. F. Dudley 8948f5bfec Fix etcd cleanup to use docker for root-owned files
Use docker containers with volume mounts to handle all file
operations on root-owned etcd directories, avoiding the need
for sudo on the host.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-03 17:15:19 -05:00
A. F. Dudley 675ee87544 Clear stale CNI resources from persisted etcd before cluster creation
When etcd is persisted (for certificate backup) and a cluster is
recreated, kind tries to install CNI (kindnet) fresh but the
persisted etcd already has those resources, causing 'AlreadyExists'
errors and cluster creation failure.

This fix:
- Detects etcd mount path from kind config
- Before cluster creation, clears stale CNI resources (kindnet, coredns)
- Preserves certificate and other important data

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-03 17:15:19 -05:00
A. F. Dudley 8d3191e4fd Fix Caddy ingress ACME email and RBAC issues
- Add acme_email_key constant for spec.yml parsing
- Add get_acme_email() method to Spec class
- Modify install_ingress_for_kind() to patch ConfigMap with email
- Pass acme-email from spec to ingress installation
- Add 'delete' verb to leases RBAC for certificate lock cleanup

The acme-email field in spec.yml was previously ignored, causing
Let's Encrypt to fail with "unable to parse email address".
The missing delete permission on leases caused lock cleanup failures.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-03 17:15:19 -05:00