stack-orchestrator

Commit Graph

Author	SHA1	Message	Date
A. F. Dudley	ddbcd1a97c	fix: migration playbook stops docker first, skips stale data copy - biscayne-migrate-storage.yml: stop docker to release bind mounts before destroying zvol, no data copy (stale, fresh snapshot needed), handle partially-migrated state, restart docker at end - biscayne-upgrade-zfs.yml: use add-apt-repository CLI (module times out), fix libzfs package name (libzfs4linux not 5), allow apt update warnings from stale influxdata GPG key Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-10 00:48:37 +00:00
A. F. Dudley	b88af2be70	feat: graceful shutdown, ZFS upgrade, storage migration, sync-tools build - entrypoint.py: Python stays PID 1, traps SIGTERM, requests graceful exit via admin RPC (agave-validator exit --force) before falling back to signals - snapshot_download.py: fix break-on-failure bug in incremental download loop (continue + re-probe instead of giving up) - biscayne-upgrade-zfs.yml: upgrade ZFS 2.2.2 → 2.2.9 via arter97/zfs-lts PPA to fix io_uring deadlock at kernel module level - biscayne-migrate-storage.yml: one-time migration from zvol/XFS to ZFS dataset (zvol workaround no longer needed with graceful shutdown + ZFS fix) - biscayne-stop.yml: patch terminationGracePeriodSeconds to 300 before scaling to 0, updated docs for admin RPC shutdown - biscayne-sync-tools.yml: fix SSH agent forwarding (vars: ansible_become), add --tags build-container support, add set -e to shell blocks - biscayne-recover.yml: updated for graceful shutdown awareness - check-status.py: add --pane flag for tmux, clean redraw in watch mode - CLAUDE.md: update docs for ZFS dataset storage, graceful shutdown Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-09 07:58:37 +00:00
A. F. Dudley	173b807451	fix: check-status.py discovers cluster-id from deployment.yml Instead of hardcoding the laconic cluster ID, namespace, deployment name, and pod label, read cluster-id from deployment.yml on biscayne and derive everything from it. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-09 06:48:19 +00:00
A. F. Dudley	ed6f6bfd59	fix: check-status.py pod label selector matches actual k8s labels The pod label is app=laconic-70ce4c4b47e23b85, not app=laconic-70ce4c4b47e23b85-deployment. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-09 06:46:17 +00:00
A. F. Dudley	09728a719c	fix: recovery playbook is fire-and-forget, add check-status.py The recovery playbook now exits after scaling to 1. The container entrypoint handles snapshot download (60+ min) and validator startup autonomously. Removed all polling/verification steps that would time out waiting. Added scripts/check-status.py for monitoring download progress, validator slot, gap to mainnet, catch-up rate, and ramdisk usage. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-09 06:39:25 +00:00
A. F. Dudley	3dc345ea7d	fix: recovery playbook delegates snapshot download to container entrypoint The container's entrypoint.py already handles snapshot freshness checks, cleanup, download (with rolling incremental convergence), and validator startup. Remove the host-side download and let the container do the work. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-09 06:28:01 +00:00
A. F. Dudley	f842aba56a	fix: sync-tools playbook uses agent forwarding, not socket hunting - Add become: false to git tasks so SSH_AUTH_SOCK survives (sudo drops it) - Fetch explicit branch names instead of bare `git fetch origin` - Remove the fragile `Find SSH agent socket` workaround Requires ForwardAgent yes in SSH config (added to ~/.ssh/config). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-09 06:20:16 +00:00
A. F. Dudley	601f520a45	fix: add 30-min wall-clock timeout to incremental convergence loop Without a bound, the loop runs forever if sources never serve an incremental close enough to head (e.g. full snapshot base slot is too old). After 30 minutes, proceed with the best incremental available or none. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-09 06:11:19 +00:00
A. F. Dudley	bfde58431e	feat: rolling incremental snapshot download loop After the full snapshot downloads, continuously re-probe all fast sources for newer incrementals until the best available is within convergence_slots (default 500) of head. Each iteration finds the highest-slot incremental matching our full snapshot's base slot, downloads it (replacing any previous), and checks the gap to mainnet head. - Extract probe_incremental() from inline re-probe code - Add convergence_slots param to download_best_snapshot() (default 500) - Add --convergence-slots CLI arg - Pass SNAPSHOT_CONVERGENCE_SLOTS env var from entrypoint.py Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-09 05:33:47 +00:00
A. F. Dudley	bd38c1b791	fix: remove Ansible snapshot download, add sync-tools playbook The container entrypoint (entrypoint.py) handles snapshot download internally via aria2c. Ansible no longer needs to scale-to-0, download, scale-to-1 — it just deploys and lets the container manage startup. - biscayne-redeploy.yml: remove snapshot download section, simplify to teardown → wipe → deploy → verify - biscayne-sync-tools.yml: new playbook to sync laconic-so and agave-stack repos on biscayne, with separate branch controls - snapshot_download.py: re-probe for fresh incremental after full snapshot download completes (old incremental is stale by then) - Switch laconic_so_branch to fix/kind-mount-propagation (has hostNetwork translation code) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-09 05:14:43 +00:00
A. F. Dudley	3574e387cc	fix: update playbooks to use subtree path for snapshot_download.py scripts/agave-container/ is a git subtree of agave-stack's container-build directory. Replaces fragile cross-repo symlink with proper subtree. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-08 19:13:53 +00:00
A. F. Dudley	25952b4fa7	Merge commit 'f4b3a46109a8da00fdd68d8999160ddc45dcc88a' as 'scripts/agave-container'	2026-03-08 19:13:38 +00:00
A. F. Dudley	f4b3a46109	Squashed 'scripts/agave-container/' content from commit 4b5c875 git-subtree-dir: scripts/agave-container git-subtree-split: 4b5c875a05cbbfbde38eeb053fd5443a8a50228c	2026-03-08 19:13:38 +00:00
A. F. Dudley	ba015bf3b1	chore: remove snapshot-download.py symlink (replacing with subtree) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-08 19:13:34 +00:00
A. F. Dudley	078872d78d	feat: add iptables playbook, symlink snapshot-download.py to agave-stack - playbooks/biscayne-iptables.yml: manages PREROUTING DNAT and DOCKER-USER rules for both host IP (186.233.184.235) and relay loopback (137.239.194.65). Idempotent, persists via netfilter-persistent. - scripts/snapshot-download.py: replaced standalone copy with symlink to agave-stack source of truth, eliminating duplication. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-08 19:11:24 +00:00
A. F. Dudley	ec12e6079b	fix: redeploy wipe uses umount+remount instead of rm -rf Remounting tmpfs is instant (kernel frees pages), while rm -rf on 400GB+ of accounts files traverses every inode. Recover playbook keeps rm -rf because the kind node's bind mount prevents umount while the container is running. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-08 18:45:44 +00:00
A. F. Dudley	b2342bc539	fix: switch ramdisk from /dev/ram0 to tmpfs, refactor snapshot-download.py The /dev/ram0 + XFS + format-ramdisk.service approach was unnecessary complexity from a migration confusion — there was no actual tmpfs bug with io_uring. tmpfs is simpler (no format-on-boot), resizable on the fly, and what every other Solana operator uses. Changes: - prepare-agave: remove format-ramdisk.service and ramdisk-accounts.service, use tmpfs fstab entry with size=1024G (was 600G /dev/ram0, too small) - recover: remove ramdisk_device var (no longer needed) - redeploy: wipe accounts by rm -rf instead of umount+mkfs - snapshot-download.py: extract download_best_snapshot() public API for use by the new container entrypoint.py (in agave-stack) - CLAUDE.md: update ramdisk docs, fix /srv/solana → /srv/kind/solana paths - health-check: fix ramdisk path references Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-08 18:43:41 +00:00
A. F. Dudley	591d158e1f	chore: populate pebbles with known bugs and feature requests Issues: - bar-a3b [P0] agave-validator crash after ~57 seconds - bar-41a [P1] telegraf volume mounts missing from pod spec - bar-02e [P1] zvol mount bug (closed — fixed 2026-03-08) - bar-b04 [P2] update redeploy to use deployment prepare - bar-b41 [P2] snapshot leapfrog recovery playbook - bar-0b4 [P3] prepare-agave unconditionally imports relay playbook Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-08 06:59:07 +00:00
A. F. Dudley	63735a9830	fix: revert snapshot_dir, add laconic_so_branch, move kind ramdisk check - Revert snapshot_dir to /srv/solana/snapshots — aria2c runs on the host where this is the direct zvol mount (always available), unlike /srv/kind/solana/snapshots which depends on the bind mount - Add laconic_so_branch variable (default: main) and use it in both git reset commands so the branch can be overridden via -e - Move "Verify ramdisk visible inside kind node" from preflight to after "Wait for deployment to exist" — the kind container may not exist during preflight after teardown Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-08 04:42:11 +00:00
A. F. Dudley	fe935037f7	fix: add laconic-so update step, downgrade unified mount check to warning - Add laconic_so_repo variable (/home/rix/stack-orchestrator) and a git pull task before deployment start — the editable install must be current or stale code causes deploy failures - Downgrade unified mount root check from fatal assertion to debug warning — the mount style depends on which laconic-so version is deployed, and individual PV mounts (/mnt/validator-*) work fine Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-08 04:32:20 +00:00
A. F. Dudley	ad68d505ae	fix: redeploy playbook paths, tags, and idempotency - Fix snapshot_dir: /srv/solana/snapshots → /srv/kind/solana/snapshots (kind node reads from the bind mount, not the zvol mount directly) - Fix kind-internal paths: /mnt/solana/... → /mnt/validator-... to match actual PV hostPath layout (individual mounts, not unified) - Add 'scale-up' tag to "Scale validator to 1" task for partial recovery (--tags snapshot,scale-up,verify resumes without re-running deploy) - Make 'Start deployment' idempotent: failed_when: false + follow-up check so existing deployment doesn't fail the play Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-08 04:14:05 +00:00
A. F. Dudley	05f9acf8a0	fix: DOCKER-USER rules for inbound relay, add UDP test playbooks Root cause: Docker FORWARD chain policy DROP blocked all DNAT'd relay traffic (UDP/TCP 8001, UDP 9000-9025) to the kind node. The DOCKER chain only ACCEPTs specific TCP ports (6443, 443, 80). Added ACCEPT rules in DOCKER-USER chain which runs before all Docker chains. Changes: - ashburn-relay-biscayne.yml: add DOCKER-USER ACCEPT rules (inbound tag) and rollback cleanup - ashburn-relay-setup.sh.j2: persist DOCKER-USER rules across reboot - relay-inbound-udp-test.yml: controlled e2e test — listener in kind netns, sender from kelce, assert arrival - relay-link-test.yml: link-by-link tcpdump captures at each hop - relay-test-udp-listen.py, relay-test-udp-send.py: test helpers - relay-test-ip-echo.py: full ip_echo protocol test - inventory/kelce.yml, inventory/panic.yml: test host inventories - test-ashburn-relay.sh: add ip_echo UDP reachability test Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-08 02:43:31 +00:00
A. F. Dudley	496c7982cb	feat: end-to-end relay test scripts Three Python scripts send real packets from the kind node through the full relay path (biscayne → tunnel → mia-sw01 → was-sw01 → internet) and verify responses come back via the inbound path. No indirect counter-checking — a response proves both directions work. - relay-test-udp.py: DNS query with sport 8001 - relay-test-tcp-sport.py: HTTP request with sport 8001 - relay-test-tcp-dport.py: TCP connect to entrypoint dport 8001 (ip_echo) - test-ashburn-relay.sh: orchestrates from ansible controller via nsenter Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-08 00:43:06 +00:00
A. F. Dudley	8eac9cc87f	docs: document DoubleZero agent managed config on both switches Inventories what the DZ agent controls (tunnels, ACLs, VRFs, BGP, route-maps, loopbacks) so we don't accidentally modify objects that the agent will silently overwrite. Includes a "safe to modify" section listing our own relay infrastructure. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-07 23:45:36 +00:00
A. F. Dudley	b82d66eeff	fix: VRF isolation for mia-sw01 relay, TCP dport mangle for ip_echo mia-sw01: Replace PBR-based outbound routing with VRF isolation. TCAM profile tunnel-interface-acl doesn't support PBR or traffic-policy on tunnel interfaces. Tunnel100 now lives in VRF "relay" whose default route sends decapsulated traffic to was-sw01 via backbone, avoiding BCP38 drops on the ISP uplink for src 137.239.194.65. biscayne: Add TCP dport mangle rule for ip_echo (port 8001). Without it, outbound ip_echo probes use biscayne's real IP instead of the Ashburn relay IP, causing entrypoints to probe the wrong address. Also fix loopback IP idempotency (handle "already assigned" error). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-07 23:31:18 +00:00
A. F. Dudley	a02534fc11	chore: add containerlab topologies for relay testing Ashburn relay and shred relay lab configs for local end-to-end testing with cEOS. No secrets — only public IPs and test scripts. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-07 22:30:03 +00:00
A. F. Dudley	9cbc115295	fix: inventory layering — playbooks use hosts:all, cross-inventory uses explicit hosts Normal playbooks should never hardcode hostnames — that's an inventory concern. Changed all playbooks to hosts:all. The one exception is ashburn-relay-check.yml which legitimately spans both inventories (switches + biscayne) and uses explicit hostnames. Also adds: - ashburn-relay-check.yml: full-path relay diagnostics (switches + host) - biscayne-start.yml: start kind container and scale validator to 1 - ashburn-relay-setup.sh.j2: boot persistence script for relay state - Direct device mounts replacing rbind (ZFS shared propagation fix) - systemd service replacing broken if-up.d/netfilter-persistent - PV mount path corrections (/mnt/validator-* not /mnt/solana/*) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-07 22:28:21 +00:00
A. F. Dudley	14c0f63775	feat: layer 4 invariants, mount checks, and deployment layer docs - Rename biscayne-boot.yml → biscayne-prepare-agave.yml (layer 4) - Document deployment layers and layer 4 invariants in playbook header - Add zvol, ramdisk, rbind fstab management with stale entry cleanup - Add kind node XFS verification (reads cluster-id from deployment) - Add mount checks to health-check.yml (host mounts, kind visibility, propagation) - Fix health-check discovery tasks with tags: [always] and non-fatal pod lookup - Fix biscayne-redeploy.yml shell tasks missing executable: /bin/bash - Add ansible_python_interpreter to inventory - Update CLAUDE.md with deployment layers table and mount propagation notes Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-07 13:08:04 +00:00
A. F. Dudley	b40883ef65	fix: separate switch inventory to prevent accidental targeting Move switches.yml to inventory-switches/ so ansible.cfg's `inventory = inventory/` only loads biscayne. Switch playbooks must pass `-i inventory-switches/` explicitly. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-07 10:56:48 +00:00
A. F. Dudley	4f452db6fe	fix: ansible-lint production profile compliance for all playbooks - FQCN for all modules (ansible.builtin.*) - changed_when/failed_when on all command/shell tasks - set -o pipefail on all shell tasks - Add KUBECONFIG environment to health-check.yml Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-07 10:52:40 +00:00
A. F. Dudley	d36a71f13d	fix: redeploy playbook handles SSH agent, git pull, config regen, stale PVs - ansible.cfg: enable SSH agent forwarding for git operations - biscayne-redeploy.yml: add git pull, deploy create --update, and clear stale PV claimRefs after namespace deletion Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-07 09:58:29 +00:00
A. F. Dudley	9f6e1b5da7	fix: remove auto-revert timer, use checkpoint + write memory instead Config is committed to running-config immediately (no 5-min timer). Safety net is the checkpoint (rollback) and the fact that startup-config is only written with -e commit=true. A reboot reverts uncommitted changes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-07 01:49:25 +00:00
A. F. Dudley	742e84e3b0	feat: dedicated GRE tunnel (Tunnel100) bypassing DZ-managed Tunnel500 Root cause: the doublezero-agent on mia-sw01 manages Tunnel500's ACL (SEC-USER-500-IN) and drops outbound gossip with src 137.239.194.65. The agent overwrites any custom ACL entries. Fix: create a separate GRE tunnel (Tunnel100) using mia-sw01's free LAN IP (209.42.167.137) as tunnel source. This tunnel goes over the ISP uplink, completely independent of the DZ overlay: - mia-sw01: Tunnel100 src 209.42.167.137, dst 186.233.184.235 - biscayne: gre-ashburn src 186.233.184.235, dst 209.42.167.137 - Link addresses: 169.254.100.0/31 Playbook changes: - ashburn-relay-mia-sw01: Tunnel100 + Loopback101 + SEC-VALIDATOR-100-IN - ashburn-relay-biscayne: gre-ashburn tunnel + updated policy routing - New template: ashburn-routing-ifup.sh.j2 for boot persistence Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-07 01:47:58 +00:00
A. F. Dudley	0b52fc99d7	fix: ashburn relay playbooks and document DZ tunnel ACL root cause Playbook fixes from testing: - ashburn-relay-biscayne: insert DNAT rules at position 1 before Docker's ADDRTYPE LOCAL rule (was being swallowed at position 3+) - ashburn-relay-mia-sw01: add inbound route for 137.239.194.65 via egress-vrf vrf1 (nexthop only, no interface — EOS silently drops cross-VRF routes that specify a tunnel interface) - ashburn-relay-was-sw01: replace PBR with static route, remove Loopback101 Bug doc (bug-ashburn-tunnel-port-filtering.md): root cause is the DoubleZero agent on mia-sw01 overwrites SEC-USER-500-IN ACL, dropping outbound gossip with src 137.239.194.65. The DZ agent controls Tunnel500's lifecycle. Fix requires a separate GRE tunnel using mia-sw01's free LAN IP (209.42.167.137) to bypass DZ infrastructure. Also adds all repo docs, scripts, inventory, and remaining playbooks. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-07 01:44:25 +00:00
A. F. Dudley	6841d5e3c3	feat: ashburn validator relay playbooks Three playbooks for routing all validator traffic through 137.239.194.65: - was-sw01: Loopback101 + PBR redirect on Et1/1 (already applied/committed) Will be simplified to a static route in next iteration. - mia-sw01: ACL permit for src 137.239.194.65 on Tunnel500 + default route in vrf1 via egress-vrf default to was-sw01 backbone. No PBR needed — per-tunnel ACLs already scope what enters vrf1. - biscayne: DNAT inbound (137.239.194.65 → kind node), SNAT + policy routing outbound (validator sport 8001,9000-9025 → doublezero0 GRE). Inbound already applied. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-06 21:08:48 +00:00
A. F. Dudley	dd29257dd8	chore: snapshot mia-sw01 and was-sw01 running configs Captured via ansible `show running-config` before applying mia-sw01 outbound validator redirect changes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-06 20:45:32 +00:00

36 Commits (ddbcd1a97c24e11e7a441aafe0013afe0b55ab9d) All Branches Search

36 Commits (ddbcd1a97c24e11e7a441aafe0013afe0b55ab9d)

All Branches