The mount-compatibility check lived inside create_cluster(), which only
runs under --perform-cluster-management. Under the (default)
--skip-cluster-management path the check was skipped — a deployment
joining an existing cluster with an incompatible kind-config would
proceed and silently fall through to the node's overlay FS, which is
exactly the failure mode the check was designed to catch.
Rename _check_mounts_compatible → check_mounts_compatible (now public)
and call it from both paths in _setup_cluster().
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Kind applies extraMounts only at cluster creation. When a deployment joins
an existing shared cluster, any extraMount its kind-config declares that
isn't already active on the running control-plane is silently ignored —
PVs backed by those mounts fall through to the node's overlay filesystem
and lose data on cluster destroy.
Validate this up front in create_cluster():
- On cluster reuse, compare the new deployment's extraMounts against the
live bind mounts on the control-plane container (via docker inspect).
Fail with a DeployerException listing every mismatched mount and
pointing at docs/deployment_patterns.md.
- On first-time cluster creation without a /mnt umbrella mount
(kind-mount-root unset), print a warning that future stacks may
require a full recreate to add new host-path mounts.
Document the umbrella-mount convention (kind-mount-root) and the
migration path for existing clusters in docs/deployment_patterns.md.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Publish / Gate: k8s deploy e2e (push) Failing after 3sDetails
Publish / Build and publish (push) Has been skippedDetails
K8s Deploy Test / Run deploy test suite on kind/k8s (push) Failing after 0sDetails
Lint Checks / Run linter (push) Failing after 0sDetails
Deploy Test / Run deploy test suite (push) Failing after 0sDetails
Webapp Test / Run webapp test suite (push) Failing after 0sDetails
Smoke Test / Run basic test suite (push) Failing after 0sDetails
Replaces the etcd-surgery persistence approach with a CronJob that dumps `manager=caddy` Secrets to `{kind-mount-root}/caddy-cert-backup/` every 5 min, and a restore step that applies the file before Caddy starts on a fresh cluster. Closes so-o2o.
Deletes `_clean_etcd_keeping_certs` and the etcd+PKI extraMounts. No new spec keys - activates when `kind-mount-root` is set.
Replaces the hardcoded `gcr.io/etcd-development/etcd:v3.5.9` in `_clean_etcd_keeping_certs` with a dynamic ref captured from the running Kind node via `crictl`, persisted to `{backup_dir}/etcd-image.txt` and reused on subsequent cleanup runs. Self-adapts to Kind upgrades, no version table to maintain.
Testing on Kind v0.32 / etcd 3.6 surfaced two additional bugs in the whitelist cleanup that this PR does **not** fix (see so-o2o comments):
(a) the restore step pipes raw protobuf values through bash `echo`, corrupting binary bytes;
(b) the whitelist omits cluster-admin RBAC, SAs, and bootstrap tokens needed by kubeadm's pre-addon health check.
Merging this narrow fix + diagnosis trail; follow-up branch will replace the etcd-surgery approach with a kubectl-level Caddy secret backup/restore.
Publish / Gate: k8s deploy e2e (push) Failing after 2sDetails
Webapp Test / Run webapp test suite (push) Failing after 0sDetails
Publish / Build and publish (push) Has been skippedDetails
Smoke Test / Run basic test suite (push) Failing after 0sDetails
Lint Checks / Run linter (push) Failing after 0sDetails
Deploy Test / Run deploy test suite (push) Failing after 0sDetails
- Only map host ports for services with network_mode: host (80/443 for Caddy always mapped). Previously all compose service ports were mapped unconditionally, causing conflicts with local services like postgres and redis
- Use spec configmap values as source paths instead of ignoring them. Fixes configmaps with user-defined paths (e.g. `stack-orchestrator/compose/maintenance`) and home-relative paths (e.g. `~/.credentials/local-certs/s3`)
- Read configmap files from deployment dir (`configmaps/{name}/`) when building k8s ConfigMap objects, not from the spec's source path which doesn't exist in the deployment dir
- File pebbles: `so-c71` (resolved), `so-078`: self-sufficient deployments (hooks should be copied to deployment dir)
Combines timestamp-based cluster IDs, namespace derived from stack name,
_build_containers refactor, jobs support, multi-ingress certificates,
user-declared secrets, and label-based resource cleanup with the existing
idempotent deploy, mount propagation, and port mapping fixes.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Resolve conflicts keeping HostToContainer propagation on mount root
entry and per-container resource layering from the propagation branch.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Publish / Build and publish (push) Failing after 0sDetails
Webapp Test / Run webapp test suite (push) Failing after 0sDetails
Smoke Test / Run basic test suite (push) Failing after 0sDetails
Lint Checks / Run linter (push) Failing after 0sDetails
Deploy Test / Run deploy test suite (push) Failing after 0sDetails
When kind-mount-root is set in spec.yml, emit a single extraMount
mapping the root to /mnt instead of per-volume mounts. This allows
adding new volumes without recreating the Kind cluster.
Volumes whose host path is under the root skip individual extraMounts
and their PV paths resolve to /mnt/{relative_path}. Volumes outside
the root keep individual extraMounts as before.
Cherry-picked from branch enya-ac868cc4-kind-mount-propagation-fix
(commits b6d6ad81, 929bdab8) and adapted for current main.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
_write_config_file() now reads each file listed under the credentials-files
top-level spec key and appends its contents to config.env after config vars.
Paths support ~ expansion. Missing files fail hard with sys.exit(1).
Also adds get_credentials_files() to Spec class following the same pattern
as get_image_registry_config().
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Replace token_hex cluster IDs with sortable timestamp-based IDs
(laconic-{base62_timestamp}{random_suffix}) via new ids.py module
- Check for existing Kind cluster before generating a new cluster-id
- Derive k8s namespace from stack name instead of compose_project_name
(e.g. laconic-dumpster instead of laconic-<random>)
- Plumb namespace through to secret generation instead of hardcoding
'default'
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
_clean_etcd_keeping_certs() only preserved /registry/secrets/caddy-system,
deleting everything else including the kubernetes ClusterIP service in the
default namespace. When kind recreated the cluster with the cleaned etcd,
kube-apiserver saw existing data and skipped bootstrapping the service.
kindnet panicked on KUBERNETES_SERVICE_HOST missing, blocking all pod
networking.
Expand the whitelist to also preserve:
- /registry/services/specs/default/kubernetes
- /registry/services/endpoints/default/kubernetes
Loop over multiple prefixes instead of a single etcdctl get --prefix call.
See docs/bug-laconic-so-etcd-cleanup.md in biscayne-agave-runbook.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Without propagation, rbind submounts on the host (e.g., XFS zvol at
/srv/kind/solana) are invisible inside the kind node — it sees the
underlying filesystem (ZFS) instead. This causes agave's io_uring to
deadlock on ZFS transaction commits (D-state in dsl_dir_tempreserve_space).
HostToContainer propagation ensures host submounts propagate into the
kind node, so /mnt/solana correctly resolves to the XFS zvol.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The kind-mount-root extraMount entry used kind's default propagation
(None), so new bind mounts under the root on the host (e.g. zvols
mounted under /srv/kind) were not visible inside the kind node until
restart. Setting propagation to HostToContainer makes host-side mount
changes propagate into the kind node automatically.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When kind-mount-root is set in spec.yml, emit a single extraMount
mapping the root to /mnt instead of per-volume mounts. This allows
adding new volumes without recreating the kind cluster.
Volumes whose host path is under the root are skipped for individual
extraMounts and their PV paths resolve to /mnt/{relative_path}.
Volumes outside the root keep individual extraMounts as before.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
pods_in_deployment() and containers_in_pod() were hardcoded to search
the "default" namespace, but deployments are created in a per-deployment
namespace (laconic-{name}). This caused logs() to report "Pods not
running" even when pods were healthy.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When kind-mount-root is set in spec.yml, emit a single extraMount
mapping the root to /mnt instead of per-volume mounts. This allows
adding new volumes without recreating the kind cluster.
Volumes whose host path is under the root are skipped for individual
extraMounts and their PV paths resolve to /mnt/{relative_path}.
Volumes outside the root keep individual extraMounts as before.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Kind's extraPortMappings only included ports 80/443 for Caddy. Compose
service ports (RPC, gossip, UDP) were never forwarded, making them
unreachable from the host. Also adds hostNetwork/dnsPolicy to the k8s
pod spec when any compose service uses network_mode: host.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
In docker-compose, services can reference each other by name (e.g., 'db:5432').
In Kubernetes, when multiple containers are in the same pod (sidecars), they
share the same network namespace and must use 'localhost' instead.
This fix adds translate_sidecar_service_names() which replaces docker-compose
service name references with 'localhost' in environment variable values for
containers that share the same pod.
Fixes issue where multi-container pods fail because one container tries to
connect to a sibling using the compose service name instead of localhost.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Mount /var/lib/etcd and /etc/kubernetes/pki to host filesystem
so cluster state is preserved for offline recovery. Each deployment
gets its own backup directory keyed by deployment ID.
Directory structure:
data/cluster-backups/{deployment_id}/etcd/
data/cluster-backups/{deployment_id}/pki/
This enables extracting secrets from etcd backups using etcdctl
with the preserved PKI certificates.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Adds support for configuring ACME email for Let's Encrypt certificates
in kind deployments. The email can be specified in the spec under
network.acme-email and will be used to configure the Caddy ingress
controller ConfigMap.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Previously, install_ingress_for_kind() applied the YAML (which starts
the Caddy pod with email: ""), then patched the ConfigMap afterward.
The pod had already read the empty email and Caddy doesn't hot-reload.
Now template the email into the YAML before applying, so the pod starts
with the correct email from the beginning.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Check stack.yml containers: field to determine which images are local builds
- Only load local images via kind load; let k8s pull registry images directly
- Add is_ingress_running() to skip ingress installation if already running
- Fixes deployment failures when public registry images aren't in local Docker
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
When deploying a second stack to k8s-kind, automatically reuse an existing
kind cluster instead of trying to create a new one (which would fail due
to port 80/443 conflicts).
Changes:
- helpers.py: create_cluster() now checks for existing cluster first
- deploy_k8s.py: up() captures returned cluster name and updates self
This enables deploying multiple stacks (e.g., gorbagana-rpc + trashscan-explorer)
to the same kind cluster.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The etcd directory is root-owned, so shell test -f fails.
Use docker with volume mount to check file existence.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Create member.backup-YYYYMMDD-HHMMSS before cleaning.
Each cluster recreation creates a new backup, preserving history.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Move original to .bak, move new into place, then delete bak.
If anything fails before the swap, original remains intact.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Instead of trying to delete specific stale resources (blacklist),
keep only the valuable data (caddy TLS certs) and delete everything
else. This is more robust as we don't need to maintain a list of
all possible stale resources.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Use docker containers with volume mounts to handle all file
operations on root-owned etcd directories, avoiding the need
for sudo on the host.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
When etcd is persisted (for certificate backup) and a cluster is
recreated, kind tries to install CNI (kindnet) fresh but the
persisted etcd already has those resources, causing 'AlreadyExists'
errors and cluster creation failure.
This fix:
- Detects etcd mount path from kind config
- Before cluster creation, clears stale CNI resources (kindnet, coredns)
- Preserves certificate and other important data
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add acme_email_key constant for spec.yml parsing
- Add get_acme_email() method to Spec class
- Modify install_ingress_for_kind() to patch ConfigMap with email
- Pass acme-email from spec to ingress installation
- Add 'delete' verb to leases RBAC for certificate lock cleanup
The acme-email field in spec.yml was previously ignored, causing
Let's Encrypt to fail with "unable to parse email address".
The missing delete permission on leases caused lock cleanup failures.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add support for Docker Compose host path mounts (like ../config/file:/path)
in k8s deployments. Previously these were silently skipped, causing k8s
deployments to fail when compose files used host path mounts.
Changes:
- Add helper functions for host path detection and name sanitization
- Generate kind extraMounts for host path mounts
- Create hostPath volumes in pod specs for host path mounts
- Create volumeMounts with sanitized names for host path mounts
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Caddy provides automatic HTTPS with Let's Encrypt, but needs port 443
mapped from the kind container to the host. Previously only port 80 was
mapped.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The helm-charts-with-caddy branch had the Caddy manifest file but was still
using nginx in the code. This change:
- Switch install_ingress_for_kind() to use ingress-caddy-kind-deploy.yaml
- Update wait_for_ingress_in_kind() to watch caddy-system namespace
- Use correct label selector for Caddy ingress controller pods
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The base_runtime_spec for containerd requires a complete OCI spec,
not just the rlimits section. The minimal spec was causing runc to
fail with "open /proc/self/fd: no such file or directory" because
essential mounts and namespaces were missing.
This commit uses kind's default cri-base.json as the base and adds
the rlimits configuration on top. The spec includes all necessary
mounts, namespaces, capabilities, and kind-specific hooks.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The previous approach of mounting cri-base.json into kind nodes failed
because we didn't tell containerd to use it via containerdConfigPatches.
RuntimeClass allows different stacks to have different rlimit profiles,
which is essential since kind only supports one cluster per host and
multiple stacks share the same cluster.
Changes:
- Add containerdConfigPatches to kind-config.yml to define runtime handlers
- Create RuntimeClass resources after cluster creation
- Add runtimeClassName to pod specs based on stack's security settings
- Rename cri-base.json to high-memlock-spec.json for clarity
- Add get_runtime_class() method to Spec that auto-derives from
unlimited-memlock setting
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add pyrightconfig.json for pyright 1.1.408 TOML parsing workaround
- Add NoReturn annotations to fatal() functions for proper type narrowing
- Add None checks and assertions after require=True get_record() calls
- Fix AttrDict class with __getattr__ for dynamic attribute access
- Add type annotations and casts for Kubernetes client objects
- Store compose config as DockerDeployer instance attributes
- Filter None values from dotenv and environment mappings
- Use hasattr/getattr patterns for optional container attributes
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Lint Checks / Run linter (push) Failing after 0sDetails
Container Registry Test / Run contaier registry hosting test on kind/k8s (push) Failing after 0sDetails
Add spec.yml option `security.unlimited-memlock` that configures
RLIMIT_MEMLOCK to unlimited for Kind cluster pods. This is needed
for workloads like Solana validators that require large amounts of
locked memory for memory-mapped files during snapshot decompression.
When enabled, generates a cri-base.json file with rlimits and mounts
it into the Kind node to override the default containerd runtime spec.
Also includes flake8 line-length fixes for affected files.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add Caddy ingress controller manifest for kind deployments
- Add k8s cluster list command for kind cluster management
- Add k8s_command import and registration in deploy.py
- Fix network section merge to preserve http-proxy settings
- Increase default container resources (4 CPUs, 8GB memory)
- Add UDP protocol support for K8s port definitions
- Add command/entrypoint support for K8s deployments
- Implement docker-compose variable expansion for K8s
- Set ConfigMap defaultMode to 0755 for executable scripts
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
In kind, when we bind-mount a host directory it is first mounted into the kind container at /mnt, then into the pod at the desired location.
We accidentally picked this up for full-blown k8s, and were creating volumes at /mnt. This changes the behavior for both kind and regular k8s so that bind mounts are only allowed if a fully-qualified path is specified. If no path is specified at all, a default storageClass is assumed to be present, and the volume managed by a provisioner.
Eg, for kind, the default provisioner is: https://github.com/rancher/local-path-provisioner
```
stack: test
deploy-to: k8s-kind
config:
test-variable-1: test-value-1
network:
ports:
test:
- '80'
volumes:
# this will be bind-mounted to a host-path
test-data-bind: /srv/data
# this will be managed by the k8s node
test-data-auto:
configmaps:
test-config: ./configmap/test-config
```
Reviewed-on: https://git.vdb.to/cerc-io/stack-orchestrator/pulls/741
Co-authored-by: Thomas E Lackey <telackey@bozemanpass.com>
Co-committed-by: Thomas E Lackey <telackey@bozemanpass.com>