stack-orchestrator

Commit Graph

Author	SHA1	Message	Date
A. F. Dudley	14c0f63775	feat: layer 4 invariants, mount checks, and deployment layer docs - Rename biscayne-boot.yml → biscayne-prepare-agave.yml (layer 4) - Document deployment layers and layer 4 invariants in playbook header - Add zvol, ramdisk, rbind fstab management with stale entry cleanup - Add kind node XFS verification (reads cluster-id from deployment) - Add mount checks to health-check.yml (host mounts, kind visibility, propagation) - Fix health-check discovery tasks with tags: [always] and non-fatal pod lookup - Fix biscayne-redeploy.yml shell tasks missing executable: /bin/bash - Add ansible_python_interpreter to inventory - Update CLAUDE.md with deployment layers table and mount propagation notes Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-07 13:08:04 +00:00
A. F. Dudley	a11d40f2f3	fix(k8s): add HostToContainer mount propagation to kind extraMounts Without propagation, rbind submounts on the host (e.g., XFS zvol at /srv/kind/solana) are invisible inside the kind node — it sees the underlying filesystem (ZFS) instead. This causes agave's io_uring to deadlock on ZFS transaction commits (D-state in dsl_dir_tempreserve_space). HostToContainer propagation ensures host submounts propagate into the kind node, so /mnt/solana correctly resolves to the XFS zvol. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-07 13:07:12 +00:00
A. F. Dudley	b40883ef65	fix: separate switch inventory to prevent accidental targeting Move switches.yml to inventory-switches/ so ansible.cfg's `inventory = inventory/` only loads biscayne. Switch playbooks must pass `-i inventory-switches/` explicitly. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-07 10:56:48 +00:00
A. F. Dudley	4f452db6fe	fix: ansible-lint production profile compliance for all playbooks - FQCN for all modules (ansible.builtin.*) - changed_when/failed_when on all command/shell tasks - set -o pipefail on all shell tasks - Add KUBECONFIG environment to health-check.yml Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-07 10:52:40 +00:00
A. F. Dudley	eae4c3cdff	feat(k8s): per-service resource layering in deployer Resolve container resources using layered priority: 1. spec.yml per-container override (resources.containers.<name>) 2. Compose file deploy.resources block 3. spec.yml global resources 4. DEFAULT_CONTAINER_RESOURCES fallback This prevents monitoring sidecars from inheriting the validator's resource requests (e.g., 256G memory). Each service gets appropriate resources from its compose definition unless explicitly overridden. Note: existing deployments with a global resources block in spec.yml can remove it once compose files declare per-service defaults. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-07 10:26:10 +00:00
A. F. Dudley	d36a71f13d	fix: redeploy playbook handles SSH agent, git pull, config regen, stale PVs - ansible.cfg: enable SSH agent forwarding for git operations - biscayne-redeploy.yml: add git pull, deploy create --update, and clear stale PV claimRefs after namespace deletion Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-07 09:58:29 +00:00
A. F. Dudley	8a8b882e32	bug: deploy create doesn't auto-generate volume mappings for new pods Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-07 09:56:28 +00:00
A. F. Dudley	9f6e1b5da7	fix: remove auto-revert timer, use checkpoint + write memory instead Config is committed to running-config immediately (no 5-min timer). Safety net is the checkpoint (rollback) and the fact that startup-config is only written with -e commit=true. A reboot reverts uncommitted changes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-07 01:49:25 +00:00
A. F. Dudley	742e84e3b0	feat: dedicated GRE tunnel (Tunnel100) bypassing DZ-managed Tunnel500 Root cause: the doublezero-agent on mia-sw01 manages Tunnel500's ACL (SEC-USER-500-IN) and drops outbound gossip with src 137.239.194.65. The agent overwrites any custom ACL entries. Fix: create a separate GRE tunnel (Tunnel100) using mia-sw01's free LAN IP (209.42.167.137) as tunnel source. This tunnel goes over the ISP uplink, completely independent of the DZ overlay: - mia-sw01: Tunnel100 src 209.42.167.137, dst 186.233.184.235 - biscayne: gre-ashburn src 186.233.184.235, dst 209.42.167.137 - Link addresses: 169.254.100.0/31 Playbook changes: - ashburn-relay-mia-sw01: Tunnel100 + Loopback101 + SEC-VALIDATOR-100-IN - ashburn-relay-biscayne: gre-ashburn tunnel + updated policy routing - New template: ashburn-routing-ifup.sh.j2 for boot persistence Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-07 01:47:58 +00:00
A. F. Dudley	0b52fc99d7	fix: ashburn relay playbooks and document DZ tunnel ACL root cause Playbook fixes from testing: - ashburn-relay-biscayne: insert DNAT rules at position 1 before Docker's ADDRTYPE LOCAL rule (was being swallowed at position 3+) - ashburn-relay-mia-sw01: add inbound route for 137.239.194.65 via egress-vrf vrf1 (nexthop only, no interface — EOS silently drops cross-VRF routes that specify a tunnel interface) - ashburn-relay-was-sw01: replace PBR with static route, remove Loopback101 Bug doc (bug-ashburn-tunnel-port-filtering.md): root cause is the DoubleZero agent on mia-sw01 overwrites SEC-USER-500-IN ACL, dropping outbound gossip with src 137.239.194.65. The DZ agent controls Tunnel500's lifecycle. Fix requires a separate GRE tunnel using mia-sw01's free LAN IP (209.42.167.137) to bypass DZ infrastructure. Also adds all repo docs, scripts, inventory, and remaining playbooks. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-07 01:44:25 +00:00
A. F. Dudley	6841d5e3c3	feat: ashburn validator relay playbooks Three playbooks for routing all validator traffic through 137.239.194.65: - was-sw01: Loopback101 + PBR redirect on Et1/1 (already applied/committed) Will be simplified to a static route in next iteration. - mia-sw01: ACL permit for src 137.239.194.65 on Tunnel500 + default route in vrf1 via egress-vrf default to was-sw01 backbone. No PBR needed — per-tunnel ACLs already scope what enters vrf1. - biscayne: DNAT inbound (137.239.194.65 → kind node), SNAT + policy routing outbound (validator sport 8001,9000-9025 → doublezero0 GRE). Inbound already applied. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-06 21:08:48 +00:00
A. F. Dudley	dd29257dd8	chore: snapshot mia-sw01 and was-sw01 running configs Captured via ansible `show running-config` before applying mia-sw01 outbound validator redirect changes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-06 20:45:32 +00:00
AFDudley	4a1b5d86fd	Merge pull request 'fix(k8s): translate service names to localhost for sidecar containers' (#989 ) from fix-sidecar-localhost into main Webapp Test / Run webapp test suite (push) Failing after 0s Details Smoke Test / Run basic test suite (push) Failing after 0s Details Lint Checks / Run linter (push) Failing after 0s Details Publish / Build and publish (push) Failing after 0s Details Deploy Test / Run deploy test suite (push) Failing after 0s Details Reviewed-on: https://git.vdb.to/cerc-io/stack-orchestrator/pulls/989	2026-02-03 23:13:27 +00:00
A. F. Dudley	019225ca18	fix(k8s): translate service names to localhost for sidecar containers In docker-compose, services can reference each other by name (e.g., 'db:5432'). In Kubernetes, when multiple containers are in the same pod (sidecars), they share the same network namespace and must use 'localhost' instead. This fix adds translate_sidecar_service_names() which replaces docker-compose service name references with 'localhost' in environment variable values for containers that share the same pod. Fixes issue where multi-container pods fail because one container tries to connect to a sibling using the compose service name instead of localhost. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-03 18:10:32 -05:00
AFDudley	0296da6f64	Merge pull request 'feat(k8s): namespace-per-deployment for resource isolation and cleanup' (#988 ) from feat-namespace-per-deployment into main Reviewed-on: https://git.vdb.to/cerc-io/stack-orchestrator/pulls/988	2026-02-03 23:09:16 +00:00
A. F. Dudley	d913926144	feat(k8s): namespace-per-deployment for resource isolation and cleanup Each deployment now gets its own Kubernetes namespace (laconic-{deployment_id}). This provides: - Resource isolation between deployments on the same cluster - Simplified cleanup: deleting the namespace cascades to all namespaced resources - No orphaned resources possible when deployment IDs change Changes: - Set k8s_namespace based on deployment name in __init__ - Add _ensure_namespace() to create namespace before deploying resources - Add _delete_namespace() for cleanup - Simplify down() to just delete PVs (cluster-scoped) and the namespace - Fix hardcoded "default" namespace in logs function Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-03 18:04:52 -05:00
AFDudley	b41e0cb2f5	Merge pull request 'fix(k8s): query resources by label in down() for proper cleanup' (#987 ) from fix-down-cleanup-by-label into main Reviewed-on: https://git.vdb.to/cerc-io/stack-orchestrator/pulls/987	2026-02-03 22:57:52 +00:00
A. F. Dudley	47d3d10ead	fix(k8s): query resources by label in down() for proper cleanup Previously, down() generated resource names from the deployment config and deleted those specific names. This failed to clean up orphaned resources when deployment IDs changed (e.g., after force_redeploy). Changes: - Add 'app' label to all resources: Ingress, Service, NodePort, ConfigMap, PV - Refactor down() to query K8s by label selector instead of generating names - This ensures all resources for a deployment are cleaned up, even if the deployment config has changed or been deleted Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-03 17:55:14 -05:00
AFDudley	21d47908cc	Merge pull request 'feat(k8s): ACME email fix, etcd persistence, volume paths' (#986 ) from fix-caddy-acme-email-rbac into main Reviewed-on: https://git.vdb.to/cerc-io/stack-orchestrator/pulls/986	2026-02-03 22:31:47 +00:00
A. F. Dudley	f70e87b848	Add etcd + PKI extraMounts for offline data recovery Mount /var/lib/etcd and /etc/kubernetes/pki to host filesystem so cluster state is preserved for offline recovery. Each deployment gets its own backup directory keyed by deployment ID. Directory structure: data/cluster-backups/{deployment_id}/etcd/ data/cluster-backups/{deployment_id}/pki/ This enables extracting secrets from etcd backups using etcdctl with the preserved PKI certificates. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-03 17:19:52 -05:00
A. F. Dudley	5bc6c978ac	feat(k8s): support acme-email config for Caddy ingress Adds support for configuring ACME email for Let's Encrypt certificates in kind deployments. The email can be specified in the spec under network.acme-email and will be used to configure the Caddy ingress controller ConfigMap. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-03 17:19:52 -05:00
A. F. Dudley	ee59918082	Allow relative volume paths for k8s-kind deployments For k8s-kind, relative paths (e.g., ./data/rpc-config) are resolved to $DEPLOYMENT_DIR/path by _make_absolute_host_path() during kind config generation. This provides Docker Host persistence that survives cluster restarts. Previously, validation threw an exception before paths could be resolved, making it impossible to use relative paths for persistent storage. Changes: - deployment_create.py: Skip relative path check for k8s-kind - cluster_info.py: Allow relative paths to reach PV generation - docs/deployment_patterns.md: Document volume persistence patterns Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-03 17:17:44 -05:00
A. F. Dudley	581ceaea94	docs: Add cluster and volume management section Document that: - Volumes persist across cluster deletion by design - Only use --delete-volumes when explicitly requested - Multiple deployments share one kind cluster - Use --skip-cluster-management to stop single deployment Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-03 17:16:26 -05:00
A. F. Dudley	7cecf2caa6	Fix Caddy ACME email race condition by templating YAML Previously, install_ingress_for_kind() applied the YAML (which starts the Caddy pod with email: ""), then patched the ConfigMap afterward. The pod had already read the empty email and Caddy doesn't hot-reload. Now template the email into the YAML before applying, so the pod starts with the correct email from the beginning. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-03 17:16:26 -05:00
A. F. Dudley	cb6fdb77a6	Rename image-registry to registry-credentials to avoid collision The existing 'image-registry' key is used for pushing images to a remote registry (URL string). Rename the new auth config to 'registry-credentials' to avoid collision. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-03 17:16:26 -05:00
A. F. Dudley	73ba13aaa5	Add private registry authentication support Add ability to configure private container registry credentials in spec.yml for deployments using images from registries like GHCR. - Add get_image_registry_config() to spec.py for parsing image-registry config - Add create_registry_secret() to create K8s docker-registry secrets - Update cluster_info.py to use dynamic {deployment}-registry secret names - Update deploy_k8s.py to create registry secret before deployment - Document feature in deployment_patterns.md The token-env pattern keeps credentials out of git - the spec references an environment variable name, and the actual token is passed at runtime. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-03 17:16:26 -05:00
A. F. Dudley	d82b3fb881	Only load locally-built images into kind, auto-detect ingress - Check stack.yml containers: field to determine which images are local builds - Only load local images via kind load; let k8s pull registry images directly - Add is_ingress_running() to skip ingress installation if already running - Fixes deployment failures when public registry images aren't in local Docker Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-03 17:16:26 -05:00
A. F. Dudley	3bc7832d8c	Fix deployment name extraction from path When stack: field in spec.yml contains a path (e.g., stack_orchestrator/data/stacks/name), extract just the final name component for K8s secret naming. K8s resource names must be valid RFC 1123 subdomains and cannot contain slashes. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-03 17:16:26 -05:00
A. F. Dudley	a75138093b	Add setup-repositories to key files list Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-03 17:16:26 -05:00
A. F. Dudley	1128c95969	Split documentation: README for users, CLAUDE.md for agents README.md: deployment types, external stacks, commands, spec.yml reference CLAUDE.md: implementation details, code locations, codebase navigation Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-03 17:16:26 -05:00
A. F. Dudley	d292e7c48d	Add k8s-kind architecture documentation to CLAUDE.md Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-03 17:16:25 -05:00
A. F. Dudley	b057969ddd	Clarify create_cluster docstring: one cluster per host by design Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-03 17:15:19 -05:00
A. F. Dudley	ca090d2cd5	Add $generate:type:length$ token support for K8s secrets - Add GENERATE_TOKEN_PATTERN to detect $generate:hex:N$ and $generate:base64:N$ tokens - Add _generate_and_store_secrets() to create K8s Secrets from spec.yml config - Modify _write_config_file() to separate secrets from regular config - Add env_from with secretRef to container spec in cluster_info.py - Secrets are injected directly into containers via K8s native mechanism This enables declarative secret generation in spec.yml: config: SESSION_SECRET: $generate:hex:32$ DB_PASSWORD: $generate:hex:16$ Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-03 17:15:19 -05:00
A. F. Dudley	2d3721efa4	Add cluster reuse for multi-stack k8s-kind deployments When deploying a second stack to k8s-kind, automatically reuse an existing kind cluster instead of trying to create a new one (which would fail due to port 80/443 conflicts). Changes: - helpers.py: create_cluster() now checks for existing cluster first - deploy_k8s.py: up() captures returned cluster name and updates self This enables deploying multiple stacks (e.g., gorbagana-rpc + trashscan-explorer) to the same kind cluster. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-03 17:15:19 -05:00
A. F. Dudley	4408725b08	Fix repo root path calculation (4 parents from stack path)	2026-02-03 17:15:19 -05:00
A. F. Dudley	22d64f1e97	Add --spec-file option to restart and auto-detect GitOps spec - Add --spec-file option to specify spec location in repo - Auto-detect deployment/spec.yml in repo as GitOps location - Fall back to deployment dir if no repo spec found Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-03 17:15:19 -05:00
A. F. Dudley	14258500bc	Fix restart command for GitOps deployments - Remove init_operation() from restart - don't regenerate spec from commands.py defaults, use existing git-tracked spec.yml instead - Add docs/deployment_patterns.md documenting GitOps workflow - Add pre-commit rule to CLAUDE.md - Fix line length issues in helpers.py Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-03 17:15:19 -05:00
A. F. Dudley	3fbd854b8c	Use docker for etcd existence check (root-owned dir) The etcd directory is root-owned, so shell test -f fails. Use docker with volume mount to check file existence. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-03 17:15:19 -05:00
A. F. Dudley	e2d3c44321	Keep timestamped backup of etcd forever Create member.backup-YYYYMMDD-HHMMSS before cleaning. Each cluster recreation creates a new backup, preserving history. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-03 17:15:19 -05:00
A. F. Dudley	720e01fc75	Preserve original etcd backup until restore is verified Move original to .bak, move new into place, then delete bak. If anything fails before the swap, original remains intact. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-03 17:15:19 -05:00
A. F. Dudley	5b06cffe17	Use whitelist approach for etcd cleanup Instead of trying to delete specific stale resources (blacklist), keep only the valuable data (caddy TLS certs) and delete everything else. This is more robust as we don't need to maintain a list of all possible stale resources. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-03 17:15:19 -05:00
A. F. Dudley	8948f5bfec	Fix etcd cleanup to use docker for root-owned files Use docker containers with volume mounts to handle all file operations on root-owned etcd directories, avoiding the need for sudo on the host. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-03 17:15:19 -05:00
A. F. Dudley	675ee87544	Clear stale CNI resources from persisted etcd before cluster creation When etcd is persisted (for certificate backup) and a cluster is recreated, kind tries to install CNI (kindnet) fresh but the persisted etcd already has those resources, causing 'AlreadyExists' errors and cluster creation failure. This fix: - Detects etcd mount path from kind config - Before cluster creation, clears stale CNI resources (kindnet, coredns) - Preserves certificate and other important data Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-03 17:15:19 -05:00
A. F. Dudley	8d3191e4fd	Fix Caddy ingress ACME email and RBAC issues - Add acme_email_key constant for spec.yml parsing - Add get_acme_email() method to Spec class - Modify install_ingress_for_kind() to patch ConfigMap with email - Pass acme-email from spec to ingress installation - Add 'delete' verb to leases RBAC for certificate lock cleanup The acme-email field in spec.yml was previously ignored, causing Let's Encrypt to fail with "unable to parse email address". The missing delete permission on leases caused lock cleanup failures. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-03 17:15:19 -05:00
A. F. Dudley	c197406cc7	feat(deploy): add deployment restart command Add `laconic-so deployment restart` command that: - Pulls latest code from stack git repository - Regenerates spec.yml from stack's commands.py - Verifies DNS if hostname changed (with --force to skip) - Syncs deployment directory preserving cluster ID and data - Stops and restarts deployment with --skip-cluster-management Also stores stack-source path in deployment.yml during create for automatic stack location on restart. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-03 17:15:19 -05:00
A. F. Dudley	4713107546	docs(CLAUDE.md): add external stacks preferred guideline Document that external stack pattern should be used when creating new stacks for any reason, with directory structure and usage examples. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-03 17:15:19 -05:00
AFDudley	88dccdfb7c	Merge pull request 'fix(deploy): merge volumes from stack init() instead of overwriting' (#985 ) from fix-init-volumes-merge into main Database Test / Run database hosting test on kind/k8s (push) Failing after 0s Details External Stack Test / Run external stack test suite (push) Failing after 0s Details Smoke Test / Run basic test suite (push) Failing after 0s Details Lint Checks / Run linter (push) Failing after 0s Details Publish / Build and publish (push) Failing after 0s Details Deploy Test / Run deploy test suite (push) Failing after 0s Details Webapp Test / Run webapp test suite (push) Failing after 0s Details Reviewed-on: https://git.vdb.to/cerc-io/stack-orchestrator/pulls/985	2026-01-31 23:39:38 +00:00
A. F. Dudley	76c0c17c3b	fix(deploy): merge volumes from stack init() instead of overwriting Previously, volumes defined in a stack's commands.py init() function were being overwritten by volumes discovered from compose files. This prevented stacks from adding infrastructure volumes like caddy-data that aren't defined in the compose files. Now volumes are merged, with init() volumes taking precedence over compose-discovered defaults. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-31 18:23:20 -05:00
AFDudley	6a2bbae250	Merge pull request 'Add `--update` flag to `deploy create`' (#984 ) from roysc/deployment-create-sync into main Reviewed-on: https://git.vdb.to/cerc-io/stack-orchestrator/pulls/984 Reviewed-by: AFDudley <afdudley@noreply.git.vdb.to>	2026-01-31 22:46:40 +00:00
A. F. Dudley	458b548dcf	fix(k8s): add hostPath support for compose host path mounts Add support for Docker Compose host path mounts (like ../config/file:/path) in k8s deployments. Previously these were silently skipped, causing k8s deployments to fail when compose files used host path mounts. Changes: - Add helper functions for host path detection and name sanitization - Generate kind extraMounts for host path mounts - Create hostPath volumes in pod specs for host path mounts - Create volumeMounts with sanitized names for host path mounts Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-30 19:25:28 -05:00

1 2 3 4 5 ...

1241 Commits (03a5b5e39efe5505ccf5ca5bf24b5b394f48a85d) All Branches Search

1241 Commits (03a5b5e39efe5505ccf5ca5bf24b5b394f48a85d)

All Branches