Track stack-orchestrator work items with pebbles (append-only event log).
Epic so-076: Stack composition — deploy multiple stacks into one kind cluster
with independent lifecycle management per sub-stack.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
K8s PUT (replace) operations require metadata.resourceVersion for
optimistic concurrency control. Services additionally have immutable
spec.clusterIP that must be preserved from the existing object.
On 409 conflict, all _ensure_* methods now read the existing resource
first and copy resourceVersion (and clusterIP for Services) into the
body before calling replace.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
All K8s resource creation in deploy_k8s.py now uses try-create, catch
ApiException(409), then replace — matching the pattern already used for
secrets in deployment_create.py. This allows `deployment start` to be
safely re-run without 409 Conflict errors.
Resources made idempotent:
- Deployment (create_namespaced_deployment → replace on 409)
- Service (create_namespaced_service → replace on 409)
- Ingress (create_namespaced_ingress → replace on 409)
- NodePort services (same as Service)
- ConfigMap (create_namespaced_config_map → replace on 409)
- PV/PVC: bare `except: pass` replaced with explicit ApiException
catch for 404
Extracted _ensure_deployment(), _ensure_service(), _ensure_ingress(),
and _ensure_config_map() helpers to keep cyclomatic complexity in check.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Destroying the kind cluster on stop/start is almost never the intent.
The cluster holds PVs, ConfigMaps, and networking state that are
expensive to recreate. Default to preserving the cluster; pass
--perform-cluster-management explicitly when a full teardown is needed.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The update command only patches environment variables and adds a
restart annotation. It does not update ports, volumes, configmaps,
or any other deployment spec. The old name was misleading — it
implied a full spec update, causing operators to expect changes
that never took effect.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
_clean_etcd_keeping_certs() only preserved /registry/secrets/caddy-system,
deleting everything else including the kubernetes ClusterIP service in the
default namespace. When kind recreated the cluster with the cleaned etcd,
kube-apiserver saw existing data and skipped bootstrapping the service.
kindnet panicked on KUBERNETES_SERVICE_HOST missing, blocking all pod
networking.
Expand the whitelist to also preserve:
- /registry/services/specs/default/kubernetes
- /registry/services/endpoints/default/kubernetes
Loop over multiple prefixes instead of a single etcdctl get --prefix call.
See docs/bug-laconic-so-etcd-cleanup.md in biscayne-agave-runbook.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Without propagation, rbind submounts on the host (e.g., XFS zvol at
/srv/kind/solana) are invisible inside the kind node — it sees the
underlying filesystem (ZFS) instead. This causes agave's io_uring to
deadlock on ZFS transaction commits (D-state in dsl_dir_tempreserve_space).
HostToContainer propagation ensures host submounts propagate into the
kind node, so /mnt/solana correctly resolves to the XFS zvol.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Resolve container resources using layered priority:
1. spec.yml per-container override (resources.containers.<name>)
2. Compose file deploy.resources block
3. spec.yml global resources
4. DEFAULT_CONTAINER_RESOURCES fallback
This prevents monitoring sidecars from inheriting the validator's
resource requests (e.g., 256G memory). Each service gets appropriate
resources from its compose definition unless explicitly overridden.
Note: existing deployments with a global resources block in spec.yml
can remove it once compose files declare per-service defaults.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
In docker-compose, services can reference each other by name (e.g., 'db:5432').
In Kubernetes, when multiple containers are in the same pod (sidecars), they
share the same network namespace and must use 'localhost' instead.
This fix adds translate_sidecar_service_names() which replaces docker-compose
service name references with 'localhost' in environment variable values for
containers that share the same pod.
Fixes issue where multi-container pods fail because one container tries to
connect to a sibling using the compose service name instead of localhost.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Each deployment now gets its own Kubernetes namespace (laconic-{deployment_id}).
This provides:
- Resource isolation between deployments on the same cluster
- Simplified cleanup: deleting the namespace cascades to all namespaced resources
- No orphaned resources possible when deployment IDs change
Changes:
- Set k8s_namespace based on deployment name in __init__
- Add _ensure_namespace() to create namespace before deploying resources
- Add _delete_namespace() for cleanup
- Simplify down() to just delete PVs (cluster-scoped) and the namespace
- Fix hardcoded "default" namespace in logs function
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Previously, down() generated resource names from the deployment config
and deleted those specific names. This failed to clean up orphaned
resources when deployment IDs changed (e.g., after force_redeploy).
Changes:
- Add 'app' label to all resources: Ingress, Service, NodePort, ConfigMap, PV
- Refactor down() to query K8s by label selector instead of generating names
- This ensures all resources for a deployment are cleaned up, even if
the deployment config has changed or been deleted
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Mount /var/lib/etcd and /etc/kubernetes/pki to host filesystem
so cluster state is preserved for offline recovery. Each deployment
gets its own backup directory keyed by deployment ID.
Directory structure:
data/cluster-backups/{deployment_id}/etcd/
data/cluster-backups/{deployment_id}/pki/
This enables extracting secrets from etcd backups using etcdctl
with the preserved PKI certificates.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Adds support for configuring ACME email for Let's Encrypt certificates
in kind deployments. The email can be specified in the spec under
network.acme-email and will be used to configure the Caddy ingress
controller ConfigMap.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
For k8s-kind, relative paths (e.g., ./data/rpc-config) are resolved to
$DEPLOYMENT_DIR/path by _make_absolute_host_path() during kind config
generation. This provides Docker Host persistence that survives cluster
restarts.
Previously, validation threw an exception before paths could be resolved,
making it impossible to use relative paths for persistent storage.
Changes:
- deployment_create.py: Skip relative path check for k8s-kind
- cluster_info.py: Allow relative paths to reach PV generation
- docs/deployment_patterns.md: Document volume persistence patterns
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Document that:
- Volumes persist across cluster deletion by design
- Only use --delete-volumes when explicitly requested
- Multiple deployments share one kind cluster
- Use --skip-cluster-management to stop single deployment
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Previously, install_ingress_for_kind() applied the YAML (which starts
the Caddy pod with email: ""), then patched the ConfigMap afterward.
The pod had already read the empty email and Caddy doesn't hot-reload.
Now template the email into the YAML before applying, so the pod starts
with the correct email from the beginning.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The existing 'image-registry' key is used for pushing images to a remote
registry (URL string). Rename the new auth config to 'registry-credentials'
to avoid collision.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add ability to configure private container registry credentials in spec.yml
for deployments using images from registries like GHCR.
- Add get_image_registry_config() to spec.py for parsing image-registry config
- Add create_registry_secret() to create K8s docker-registry secrets
- Update cluster_info.py to use dynamic {deployment}-registry secret names
- Update deploy_k8s.py to create registry secret before deployment
- Document feature in deployment_patterns.md
The token-env pattern keeps credentials out of git - the spec references an
environment variable name, and the actual token is passed at runtime.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Check stack.yml containers: field to determine which images are local builds
- Only load local images via kind load; let k8s pull registry images directly
- Add is_ingress_running() to skip ingress installation if already running
- Fixes deployment failures when public registry images aren't in local Docker
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
When stack: field in spec.yml contains a path (e.g., stack_orchestrator/data/stacks/name),
extract just the final name component for K8s secret naming. K8s resource names must
be valid RFC 1123 subdomains and cannot contain slashes.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add GENERATE_TOKEN_PATTERN to detect $generate:hex:N$ and $generate:base64:N$ tokens
- Add _generate_and_store_secrets() to create K8s Secrets from spec.yml config
- Modify _write_config_file() to separate secrets from regular config
- Add env_from with secretRef to container spec in cluster_info.py
- Secrets are injected directly into containers via K8s native mechanism
This enables declarative secret generation in spec.yml:
config:
SESSION_SECRET: $generate:hex:32$
DB_PASSWORD: $generate:hex:16$
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
When deploying a second stack to k8s-kind, automatically reuse an existing
kind cluster instead of trying to create a new one (which would fail due
to port 80/443 conflicts).
Changes:
- helpers.py: create_cluster() now checks for existing cluster first
- deploy_k8s.py: up() captures returned cluster name and updates self
This enables deploying multiple stacks (e.g., gorbagana-rpc + trashscan-explorer)
to the same kind cluster.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add --spec-file option to specify spec location in repo
- Auto-detect deployment/spec.yml in repo as GitOps location
- Fall back to deployment dir if no repo spec found
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The etcd directory is root-owned, so shell test -f fails.
Use docker with volume mount to check file existence.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Create member.backup-YYYYMMDD-HHMMSS before cleaning.
Each cluster recreation creates a new backup, preserving history.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Move original to .bak, move new into place, then delete bak.
If anything fails before the swap, original remains intact.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Instead of trying to delete specific stale resources (blacklist),
keep only the valuable data (caddy TLS certs) and delete everything
else. This is more robust as we don't need to maintain a list of
all possible stale resources.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Use docker containers with volume mounts to handle all file
operations on root-owned etcd directories, avoiding the need
for sudo on the host.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
When etcd is persisted (for certificate backup) and a cluster is
recreated, kind tries to install CNI (kindnet) fresh but the
persisted etcd already has those resources, causing 'AlreadyExists'
errors and cluster creation failure.
This fix:
- Detects etcd mount path from kind config
- Before cluster creation, clears stale CNI resources (kindnet, coredns)
- Preserves certificate and other important data
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add acme_email_key constant for spec.yml parsing
- Add get_acme_email() method to Spec class
- Modify install_ingress_for_kind() to patch ConfigMap with email
- Pass acme-email from spec to ingress installation
- Add 'delete' verb to leases RBAC for certificate lock cleanup
The acme-email field in spec.yml was previously ignored, causing
Let's Encrypt to fail with "unable to parse email address".
The missing delete permission on leases caused lock cleanup failures.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add `laconic-so deployment restart` command that:
- Pulls latest code from stack git repository
- Regenerates spec.yml from stack's commands.py
- Verifies DNS if hostname changed (with --force to skip)
- Syncs deployment directory preserving cluster ID and data
- Stops and restarts deployment with --skip-cluster-management
Also stores stack-source path in deployment.yml during create
for automatic stack location on restart.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Document that external stack pattern should be used when creating new
stacks for any reason, with directory structure and usage examples.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Previously, volumes defined in a stack's commands.py init() function
were being overwritten by volumes discovered from compose files.
This prevented stacks from adding infrastructure volumes like caddy-data
that aren't defined in the compose files.
Now volumes are merged, with init() volumes taking precedence over
compose-discovered defaults.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add support for Docker Compose host path mounts (like ../config/file:/path)
in k8s deployments. Previously these were silently skipped, causing k8s
deployments to fail when compose files used host path mounts.
Changes:
- Add helper functions for host path detection and name sanitization
- Generate kind extraMounts for host path mounts
- Create hostPath volumes in pod specs for host path mounts
- Create volumeMounts with sanitized names for host path mounts
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
To allow updating an existing deployment
- Check the deployment dir exists when updating
- Write to temp dir, then safely copy tree
- Don't overwrite data dir or config.env
Resolve conflicts:
- deployment_context.py: Keep single modify_yaml method from main
- fixturenet-optimism/commands.py: Use modify_yaml helper from main
- deployment_create.py: Keep helm-chart, network-dir, initial-peers options
- deploy_webapp.py: Update create_operation call signature
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>