stack-orchestrator

Commit Graph

Author	SHA1	Message	Date
A. F. Dudley	08380ec070	fix: Dockerfile includes ip_echo_preflight.py Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-10 06:08:22 +00:00
A. F. Dudley	61b7f6a236	feat: ip_echo preflight tool + relay post-mortem and checklist ip_echo_preflight.py: reimplements Solana ip_echo client protocol in Python. Verifies UDP port reachability before snapshot download, called from entrypoint.py. Prevents wasting hours on a snapshot only to crash-loop on port reachability. docs/postmortem-ashburn-relay-outbound.md: root cause analysis of the firewalld nftables FORWARD chain blocking outbound relay traffic. docs/ashburn-relay-checklist.md: 7-layer verification checklist for relay path debugging. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-10 05:54:23 +00:00
A. F. Dudley	3bf87a2e9b	feat: snapshot leapfrog — auto-recovery when validator falls behind Entrypoint changes: - Always require full + incremental before starting (retry until found) - Check incremental freshness against convergence threshold (500 slots) - Gap monitor thread: if validator falls >5000 slots behind for 3 consecutive checks, graceful stop + restart with fresh incremental - cmd_serve is now a loop: download → run → monitor → leapfrog → repeat - --no-snapshot-fetch moved to common args (both RPC and validator modes) - --maximum-full-snapshots-to-retain default 1 (validator deletes downloaded full after generating its own) - SNAPSHOT_MAX_AGE_SLOTS default 100000 (one full snapshot generation) snapshot_download.py refactoring: - Extract _discover_and_benchmark() and _rolling_incremental_download() as shared helpers - Restore download_incremental_for_slot() using shared helpers (downloads only an incremental for an existing full snapshot) - download_best_snapshot() uses shared helpers, downloads full then incremental as separate operations The leapfrog cycle: validator generates full snapshots at standard 100k block height intervals (same slots as the rest of the network). When the gap monitor triggers, the entrypoint loops back to maybe_download_snapshot which finds the validator's local full, downloads a fresh network incremental (generated every ~40s, converges within the ~11hr full generation window), and restarts. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-10 05:53:56 +00:00
A. F. Dudley	cd36bfe5ee	fix: check-status.py smooth in-place redraw, remove comment bars - Overwrite lines in place instead of clear+redraw (no flicker) - Pad lines to terminal width to clear stale characters - Blank leftover rows when output shrinks between frames - Hide cursor during watch mode - Remove section comment bars - Replace unicode checkmarks with +/x Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-10 01:00:36 +00:00
A. F. Dudley	b88af2be70	feat: graceful shutdown, ZFS upgrade, storage migration, sync-tools build - entrypoint.py: Python stays PID 1, traps SIGTERM, requests graceful exit via admin RPC (agave-validator exit --force) before falling back to signals - snapshot_download.py: fix break-on-failure bug in incremental download loop (continue + re-probe instead of giving up) - biscayne-upgrade-zfs.yml: upgrade ZFS 2.2.2 → 2.2.9 via arter97/zfs-lts PPA to fix io_uring deadlock at kernel module level - biscayne-migrate-storage.yml: one-time migration from zvol/XFS to ZFS dataset (zvol workaround no longer needed with graceful shutdown + ZFS fix) - biscayne-stop.yml: patch terminationGracePeriodSeconds to 300 before scaling to 0, updated docs for admin RPC shutdown - biscayne-sync-tools.yml: fix SSH agent forwarding (vars: ansible_become), add --tags build-container support, add set -e to shell blocks - biscayne-recover.yml: updated for graceful shutdown awareness - check-status.py: add --pane flag for tmux, clean redraw in watch mode - CLAUDE.md: update docs for ZFS dataset storage, graceful shutdown Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-09 07:58:37 +00:00
A. F. Dudley	173b807451	fix: check-status.py discovers cluster-id from deployment.yml Instead of hardcoding the laconic cluster ID, namespace, deployment name, and pod label, read cluster-id from deployment.yml on biscayne and derive everything from it. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-09 06:48:19 +00:00
A. F. Dudley	ed6f6bfd59	fix: check-status.py pod label selector matches actual k8s labels The pod label is app=laconic-70ce4c4b47e23b85, not app=laconic-70ce4c4b47e23b85-deployment. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-09 06:46:17 +00:00
A. F. Dudley	09728a719c	fix: recovery playbook is fire-and-forget, add check-status.py The recovery playbook now exits after scaling to 1. The container entrypoint handles snapshot download (60+ min) and validator startup autonomously. Removed all polling/verification steps that would time out waiting. Added scripts/check-status.py for monitoring download progress, validator slot, gap to mainnet, catch-up rate, and ramdisk usage. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-09 06:39:25 +00:00
A. F. Dudley	601f520a45	fix: add 30-min wall-clock timeout to incremental convergence loop Without a bound, the loop runs forever if sources never serve an incremental close enough to head (e.g. full snapshot base slot is too old). After 30 minutes, proceed with the best incremental available or none. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-09 06:11:19 +00:00
A. F. Dudley	bfde58431e	feat: rolling incremental snapshot download loop After the full snapshot downloads, continuously re-probe all fast sources for newer incrementals until the best available is within convergence_slots (default 500) of head. Each iteration finds the highest-slot incremental matching our full snapshot's base slot, downloads it (replacing any previous), and checks the gap to mainnet head. - Extract probe_incremental() from inline re-probe code - Add convergence_slots param to download_best_snapshot() (default 500) - Add --convergence-slots CLI arg - Pass SNAPSHOT_CONVERGENCE_SLOTS env var from entrypoint.py Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-09 05:33:47 +00:00
A. F. Dudley	bd38c1b791	fix: remove Ansible snapshot download, add sync-tools playbook The container entrypoint (entrypoint.py) handles snapshot download internally via aria2c. Ansible no longer needs to scale-to-0, download, scale-to-1 — it just deploys and lets the container manage startup. - biscayne-redeploy.yml: remove snapshot download section, simplify to teardown → wipe → deploy → verify - biscayne-sync-tools.yml: new playbook to sync laconic-so and agave-stack repos on biscayne, with separate branch controls - snapshot_download.py: re-probe for fresh incremental after full snapshot download completes (old incremental is stale by then) - Switch laconic_so_branch to fix/kind-mount-propagation (has hostNetwork translation code) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-09 05:14:43 +00:00
A. F. Dudley	25952b4fa7	Merge commit 'f4b3a46109a8da00fdd68d8999160ddc45dcc88a' as 'scripts/agave-container'	2026-03-08 19:13:38 +00:00
A. F. Dudley	ba015bf3b1	chore: remove snapshot-download.py symlink (replacing with subtree) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-08 19:13:34 +00:00
A. F. Dudley	078872d78d	feat: add iptables playbook, symlink snapshot-download.py to agave-stack - playbooks/biscayne-iptables.yml: manages PREROUTING DNAT and DOCKER-USER rules for both host IP (186.233.184.235) and relay loopback (137.239.194.65). Idempotent, persists via netfilter-persistent. - scripts/snapshot-download.py: replaced standalone copy with symlink to agave-stack source of truth, eliminating duplication. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-08 19:11:24 +00:00
A. F. Dudley	b2342bc539	fix: switch ramdisk from /dev/ram0 to tmpfs, refactor snapshot-download.py The /dev/ram0 + XFS + format-ramdisk.service approach was unnecessary complexity from a migration confusion — there was no actual tmpfs bug with io_uring. tmpfs is simpler (no format-on-boot), resizable on the fly, and what every other Solana operator uses. Changes: - prepare-agave: remove format-ramdisk.service and ramdisk-accounts.service, use tmpfs fstab entry with size=1024G (was 600G /dev/ram0, too small) - recover: remove ramdisk_device var (no longer needed) - redeploy: wipe accounts by rm -rf instead of umount+mkfs - snapshot-download.py: extract download_best_snapshot() public API for use by the new container entrypoint.py (in agave-stack) - CLAUDE.md: update ramdisk docs, fix /srv/solana → /srv/kind/solana paths - health-check: fix ramdisk path references Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-08 18:43:41 +00:00
A. F. Dudley	05f9acf8a0	fix: DOCKER-USER rules for inbound relay, add UDP test playbooks Root cause: Docker FORWARD chain policy DROP blocked all DNAT'd relay traffic (UDP/TCP 8001, UDP 9000-9025) to the kind node. The DOCKER chain only ACCEPTs specific TCP ports (6443, 443, 80). Added ACCEPT rules in DOCKER-USER chain which runs before all Docker chains. Changes: - ashburn-relay-biscayne.yml: add DOCKER-USER ACCEPT rules (inbound tag) and rollback cleanup - ashburn-relay-setup.sh.j2: persist DOCKER-USER rules across reboot - relay-inbound-udp-test.yml: controlled e2e test — listener in kind netns, sender from kelce, assert arrival - relay-link-test.yml: link-by-link tcpdump captures at each hop - relay-test-udp-listen.py, relay-test-udp-send.py: test helpers - relay-test-ip-echo.py: full ip_echo protocol test - inventory/kelce.yml, inventory/panic.yml: test host inventories - test-ashburn-relay.sh: add ip_echo UDP reachability test Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-08 02:43:31 +00:00
A. F. Dudley	496c7982cb	feat: end-to-end relay test scripts Three Python scripts send real packets from the kind node through the full relay path (biscayne → tunnel → mia-sw01 → was-sw01 → internet) and verify responses come back via the inbound path. No indirect counter-checking — a response proves both directions work. - relay-test-udp.py: DNS query with sport 8001 - relay-test-tcp-sport.py: HTTP request with sport 8001 - relay-test-tcp-dport.py: TCP connect to entrypoint dport 8001 (ip_echo) - test-ashburn-relay.sh: orchestrates from ansible controller via nsenter Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-08 00:43:06 +00:00
A. F. Dudley	0b52fc99d7	fix: ashburn relay playbooks and document DZ tunnel ACL root cause Playbook fixes from testing: - ashburn-relay-biscayne: insert DNAT rules at position 1 before Docker's ADDRTYPE LOCAL rule (was being swallowed at position 3+) - ashburn-relay-mia-sw01: add inbound route for 137.239.194.65 via egress-vrf vrf1 (nexthop only, no interface — EOS silently drops cross-VRF routes that specify a tunnel interface) - ashburn-relay-was-sw01: replace PBR with static route, remove Loopback101 Bug doc (bug-ashburn-tunnel-port-filtering.md): root cause is the DoubleZero agent on mia-sw01 overwrites SEC-USER-500-IN ACL, dropping outbound gossip with src 137.239.194.65. The DZ agent controls Tunnel500's lifecycle. Fix requires a separate GRE tunnel using mia-sw01's free LAN IP (209.42.167.137) to bypass DZ infrastructure. Also adds all repo docs, scripts, inventory, and remaining playbooks. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-07 01:44:25 +00:00

18 Commits (08380ec070428be49450e8227aa5960845537de8)