Commit Graph

28 Commits (6464492009a83564b4f9cb1703f509f38c2f49dd)

Author SHA1 Message Date
A. F. Dudley 9009fb0363 fix: build.sh must be executable for laconic-so build-containers
Also fix --include filter: container name uses slash (laconicnetwork/agave)
not dash (laconicnetwork-agave). The old filter silently skipped the build.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-10 07:25:54 +00:00
A. F. Dudley ceea8f0572 fix: restart playbook preserves SSH agent and clears stale PV claimRefs
Two fixes for biscayne-restart.yml:

1. ansible_become_flags: "-E" on the restart task preserves SSH_AUTH_SOCK
   through sudo so laconic-so can git pull the stack repo.

2. After restart, clear claimRef on any Released PVs. laconic-so restart
   deletes the namespace (cascading to PVCs) then recreates, but the PVs
   retain stale claimRefs that prevent new PVCs from binding.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-10 06:37:45 +00:00
A. F. Dudley e143bb45c7 feat: add biscayne-restart.yml for graceful restart without cluster teardown
Uses laconic-so deployment restart (GitOps) to pick up new container
images and config. Gracefully stops the validator first (scale to 0,
wait for pod termination, verify no agave processes). Preserves the
kind cluster, all data volumes, and cluster state.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-10 06:21:46 +00:00
A. F. Dudley 68edcc60c7 fix: migrate ashburn relay playbook to firewalld + iptables coexistence
Firewalld zones/policies for forwarding (Docker bridge → gre-ashburn),
iptables for Docker-specific rules (DNAT, DOCKER-USER, mangle, SNAT).
Both coexist at different netfilter priorities.

See docs/postmortem-ashburn-relay-outbound.md for root cause analysis.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-10 05:54:08 +00:00
A. F. Dudley e597968708 fix: recovery playbook fixes grafana PV ownership before scale-up
laconic-so creates PV hostPath dirs as root. Grafana runs as UID 472
and crashes on startup because it can't write to /var/lib/grafana.
Fix ownership inside the kind node before scaling the deployment up.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-10 00:57:36 +00:00
A. F. Dudley ddbcd1a97c fix: migration playbook stops docker first, skips stale data copy
- biscayne-migrate-storage.yml: stop docker to release bind mounts
  before destroying zvol, no data copy (stale, fresh snapshot needed),
  handle partially-migrated state, restart docker at end
- biscayne-upgrade-zfs.yml: use add-apt-repository CLI (module times
  out), fix libzfs package name (libzfs4linux not 5), allow apt update
  warnings from stale influxdata GPG key

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-10 00:48:37 +00:00
A. F. Dudley b88af2be70 feat: graceful shutdown, ZFS upgrade, storage migration, sync-tools build
- entrypoint.py: Python stays PID 1, traps SIGTERM, requests graceful exit
  via admin RPC (agave-validator exit --force) before falling back to signals
- snapshot_download.py: fix break-on-failure bug in incremental download loop
  (continue + re-probe instead of giving up)
- biscayne-upgrade-zfs.yml: upgrade ZFS 2.2.2 → 2.2.9 via arter97/zfs-lts
  PPA to fix io_uring deadlock at kernel module level
- biscayne-migrate-storage.yml: one-time migration from zvol/XFS to ZFS
  dataset (zvol workaround no longer needed with graceful shutdown + ZFS fix)
- biscayne-stop.yml: patch terminationGracePeriodSeconds to 300 before
  scaling to 0, updated docs for admin RPC shutdown
- biscayne-sync-tools.yml: fix SSH agent forwarding (vars: ansible_become),
  add --tags build-container support, add set -e to shell blocks
- biscayne-recover.yml: updated for graceful shutdown awareness
- check-status.py: add --pane flag for tmux, clean redraw in watch mode
- CLAUDE.md: update docs for ZFS dataset storage, graceful shutdown

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-09 07:58:37 +00:00
A. F. Dudley 09728a719c fix: recovery playbook is fire-and-forget, add check-status.py
The recovery playbook now exits after scaling to 1. The container
entrypoint handles snapshot download (60+ min) and validator startup
autonomously. Removed all polling/verification steps that would
time out waiting.

Added scripts/check-status.py for monitoring download progress,
validator slot, gap to mainnet, catch-up rate, and ramdisk usage.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-09 06:39:25 +00:00
A. F. Dudley 3dc345ea7d fix: recovery playbook delegates snapshot download to container entrypoint
The container's entrypoint.py already handles snapshot freshness checks,
cleanup, download (with rolling incremental convergence), and validator
startup. Remove the host-side download and let the container do the work.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-09 06:28:01 +00:00
A. F. Dudley f842aba56a fix: sync-tools playbook uses agent forwarding, not socket hunting
- Add become: false to git tasks so SSH_AUTH_SOCK survives (sudo drops it)
- Fetch explicit branch names instead of bare `git fetch origin`
- Remove the fragile `Find SSH agent socket` workaround

Requires ForwardAgent yes in SSH config (added to ~/.ssh/config).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-09 06:20:16 +00:00
A. F. Dudley bd38c1b791 fix: remove Ansible snapshot download, add sync-tools playbook
The container entrypoint (entrypoint.py) handles snapshot download
internally via aria2c. Ansible no longer needs to scale-to-0, download,
scale-to-1 — it just deploys and lets the container manage startup.

- biscayne-redeploy.yml: remove snapshot download section, simplify to
  teardown → wipe → deploy → verify
- biscayne-sync-tools.yml: new playbook to sync laconic-so and
  agave-stack repos on biscayne, with separate branch controls
- snapshot_download.py: re-probe for fresh incremental after full
  snapshot download completes (old incremental is stale by then)
- Switch laconic_so_branch to fix/kind-mount-propagation (has
  hostNetwork translation code)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-09 05:14:43 +00:00
A. F. Dudley 3574e387cc fix: update playbooks to use subtree path for snapshot_download.py
scripts/agave-container/ is a git subtree of agave-stack's container-build
directory. Replaces fragile cross-repo symlink with proper subtree.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-08 19:13:53 +00:00
A. F. Dudley 078872d78d feat: add iptables playbook, symlink snapshot-download.py to agave-stack
- playbooks/biscayne-iptables.yml: manages PREROUTING DNAT and DOCKER-USER
  rules for both host IP (186.233.184.235) and relay loopback (137.239.194.65).
  Idempotent, persists via netfilter-persistent.
- scripts/snapshot-download.py: replaced standalone copy with symlink to
  agave-stack source of truth, eliminating duplication.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-08 19:11:24 +00:00
A. F. Dudley ec12e6079b fix: redeploy wipe uses umount+remount instead of rm -rf
Remounting tmpfs is instant (kernel frees pages), while rm -rf on 400GB+
of accounts files traverses every inode. Recover playbook keeps rm -rf
because the kind node's bind mount prevents umount while the container
is running.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-08 18:45:44 +00:00
A. F. Dudley b2342bc539 fix: switch ramdisk from /dev/ram0 to tmpfs, refactor snapshot-download.py
The /dev/ram0 + XFS + format-ramdisk.service approach was unnecessary
complexity from a migration confusion — there was no actual tmpfs bug
with io_uring. tmpfs is simpler (no format-on-boot), resizable on the
fly, and what every other Solana operator uses.

Changes:
- prepare-agave: remove format-ramdisk.service and ramdisk-accounts.service,
  use tmpfs fstab entry with size=1024G (was 600G /dev/ram0, too small)
- recover: remove ramdisk_device var (no longer needed)
- redeploy: wipe accounts by rm -rf instead of umount+mkfs
- snapshot-download.py: extract download_best_snapshot() public API for
  use by the new container entrypoint.py (in agave-stack)
- CLAUDE.md: update ramdisk docs, fix /srv/solana → /srv/kind/solana paths
- health-check: fix ramdisk path references

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-08 18:43:41 +00:00
A. F. Dudley 63735a9830 fix: revert snapshot_dir, add laconic_so_branch, move kind ramdisk check
- Revert snapshot_dir to /srv/solana/snapshots — aria2c runs on the host
  where this is the direct zvol mount (always available), unlike
  /srv/kind/solana/snapshots which depends on the bind mount
- Add laconic_so_branch variable (default: main) and use it in both
  git reset commands so the branch can be overridden via -e
- Move "Verify ramdisk visible inside kind node" from preflight to after
  "Wait for deployment to exist" — the kind container may not exist
  during preflight after teardown

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-08 04:42:11 +00:00
A. F. Dudley fe935037f7 fix: add laconic-so update step, downgrade unified mount check to warning
- Add laconic_so_repo variable (/home/rix/stack-orchestrator) and a
  git pull task before deployment start — the editable install must be
  current or stale code causes deploy failures
- Downgrade unified mount root check from fatal assertion to debug
  warning — the mount style depends on which laconic-so version is
  deployed, and individual PV mounts (/mnt/validator-*) work fine

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-08 04:32:20 +00:00
A. F. Dudley ad68d505ae fix: redeploy playbook paths, tags, and idempotency
- Fix snapshot_dir: /srv/solana/snapshots → /srv/kind/solana/snapshots
  (kind node reads from the bind mount, not the zvol mount directly)
- Fix kind-internal paths: /mnt/solana/... → /mnt/validator-... to match
  actual PV hostPath layout (individual mounts, not unified)
- Add 'scale-up' tag to "Scale validator to 1" task for partial recovery
  (--tags snapshot,scale-up,verify resumes without re-running deploy)
- Make 'Start deployment' idempotent: failed_when: false + follow-up
  check so existing deployment doesn't fail the play

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-08 04:14:05 +00:00
A. F. Dudley 05f9acf8a0 fix: DOCKER-USER rules for inbound relay, add UDP test playbooks
Root cause: Docker FORWARD chain policy DROP blocked all DNAT'd relay
traffic (UDP/TCP 8001, UDP 9000-9025) to the kind node. The DOCKER
chain only ACCEPTs specific TCP ports (6443, 443, 80). Added ACCEPT
rules in DOCKER-USER chain which runs before all Docker chains.

Changes:
- ashburn-relay-biscayne.yml: add DOCKER-USER ACCEPT rules (inbound
  tag) and rollback cleanup
- ashburn-relay-setup.sh.j2: persist DOCKER-USER rules across reboot
- relay-inbound-udp-test.yml: controlled e2e test — listener in kind
  netns, sender from kelce, assert arrival
- relay-link-test.yml: link-by-link tcpdump captures at each hop
- relay-test-udp-listen.py, relay-test-udp-send.py: test helpers
- relay-test-ip-echo.py: full ip_echo protocol test
- inventory/kelce.yml, inventory/panic.yml: test host inventories
- test-ashburn-relay.sh: add ip_echo UDP reachability test

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-08 02:43:31 +00:00
A. F. Dudley b82d66eeff fix: VRF isolation for mia-sw01 relay, TCP dport mangle for ip_echo
mia-sw01: Replace PBR-based outbound routing with VRF isolation.
TCAM profile tunnel-interface-acl doesn't support PBR or traffic-policy
on tunnel interfaces. Tunnel100 now lives in VRF "relay" whose default
route sends decapsulated traffic to was-sw01 via backbone, avoiding
BCP38 drops on the ISP uplink for src 137.239.194.65.

biscayne: Add TCP dport mangle rule for ip_echo (port 8001). Without it,
outbound ip_echo probes use biscayne's real IP instead of the Ashburn
relay IP, causing entrypoints to probe the wrong address. Also fix
loopback IP idempotency (handle "already assigned" error).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-07 23:31:18 +00:00
A. F. Dudley 9cbc115295 fix: inventory layering — playbooks use hosts:all, cross-inventory uses explicit hosts
Normal playbooks should never hardcode hostnames — that's an inventory
concern. Changed all playbooks to hosts:all. The one exception is
ashburn-relay-check.yml which legitimately spans both inventories
(switches + biscayne) and uses explicit hostnames.

Also adds:
- ashburn-relay-check.yml: full-path relay diagnostics (switches + host)
- biscayne-start.yml: start kind container and scale validator to 1
- ashburn-relay-setup.sh.j2: boot persistence script for relay state
- Direct device mounts replacing rbind (ZFS shared propagation fix)
- systemd service replacing broken if-up.d/netfilter-persistent
- PV mount path corrections (/mnt/validator-* not /mnt/solana/*)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-07 22:28:21 +00:00
A. F. Dudley 14c0f63775 feat: layer 4 invariants, mount checks, and deployment layer docs
- Rename biscayne-boot.yml → biscayne-prepare-agave.yml (layer 4)
- Document deployment layers and layer 4 invariants in playbook header
- Add zvol, ramdisk, rbind fstab management with stale entry cleanup
- Add kind node XFS verification (reads cluster-id from deployment)
- Add mount checks to health-check.yml (host mounts, kind visibility, propagation)
- Fix health-check discovery tasks with tags: [always] and non-fatal pod lookup
- Fix biscayne-redeploy.yml shell tasks missing executable: /bin/bash
- Add ansible_python_interpreter to inventory
- Update CLAUDE.md with deployment layers table and mount propagation notes

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-07 13:08:04 +00:00
A. F. Dudley 4f452db6fe fix: ansible-lint production profile compliance for all playbooks
- FQCN for all modules (ansible.builtin.*)
- changed_when/failed_when on all command/shell tasks
- set -o pipefail on all shell tasks
- Add KUBECONFIG environment to health-check.yml

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-07 10:52:40 +00:00
A. F. Dudley d36a71f13d fix: redeploy playbook handles SSH agent, git pull, config regen, stale PVs
- ansible.cfg: enable SSH agent forwarding for git operations
- biscayne-redeploy.yml: add git pull, deploy create --update, and
  clear stale PV claimRefs after namespace deletion

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-07 09:58:29 +00:00
A. F. Dudley 9f6e1b5da7 fix: remove auto-revert timer, use checkpoint + write memory instead
Config is committed to running-config immediately (no 5-min timer).
Safety net is the checkpoint (rollback) and the fact that startup-config
is only written with -e commit=true. A reboot reverts uncommitted changes.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-07 01:49:25 +00:00
A. F. Dudley 742e84e3b0 feat: dedicated GRE tunnel (Tunnel100) bypassing DZ-managed Tunnel500
Root cause: the doublezero-agent on mia-sw01 manages Tunnel500's ACL
(SEC-USER-500-IN) and drops outbound gossip with src 137.239.194.65.
The agent overwrites any custom ACL entries.

Fix: create a separate GRE tunnel (Tunnel100) using mia-sw01's free
LAN IP (209.42.167.137) as tunnel source. This tunnel goes over the
ISP uplink, completely independent of the DZ overlay:
- mia-sw01: Tunnel100 src 209.42.167.137, dst 186.233.184.235
- biscayne: gre-ashburn src 186.233.184.235, dst 209.42.167.137
- Link addresses: 169.254.100.0/31

Playbook changes:
- ashburn-relay-mia-sw01: Tunnel100 + Loopback101 + SEC-VALIDATOR-100-IN
- ashburn-relay-biscayne: gre-ashburn tunnel + updated policy routing
- New template: ashburn-routing-ifup.sh.j2 for boot persistence

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-07 01:47:58 +00:00
A. F. Dudley 0b52fc99d7 fix: ashburn relay playbooks and document DZ tunnel ACL root cause
Playbook fixes from testing:
- ashburn-relay-biscayne: insert DNAT rules at position 1 before
  Docker's ADDRTYPE LOCAL rule (was being swallowed at position 3+)
- ashburn-relay-mia-sw01: add inbound route for 137.239.194.65 via
  egress-vrf vrf1 (nexthop only, no interface — EOS silently drops
  cross-VRF routes that specify a tunnel interface)
- ashburn-relay-was-sw01: replace PBR with static route, remove
  Loopback101

Bug doc (bug-ashburn-tunnel-port-filtering.md): root cause is the
DoubleZero agent on mia-sw01 overwrites SEC-USER-500-IN ACL, dropping
outbound gossip with src 137.239.194.65. The DZ agent controls
Tunnel500's lifecycle. Fix requires a separate GRE tunnel using
mia-sw01's free LAN IP (209.42.167.137) to bypass DZ infrastructure.

Also adds all repo docs, scripts, inventory, and remaining playbooks.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-07 01:44:25 +00:00
A. F. Dudley 6841d5e3c3 feat: ashburn validator relay playbooks
Three playbooks for routing all validator traffic through 137.239.194.65:

- was-sw01: Loopback101 + PBR redirect on Et1/1 (already applied/committed)
  Will be simplified to a static route in next iteration.

- mia-sw01: ACL permit for src 137.239.194.65 on Tunnel500 + default route
  in vrf1 via egress-vrf default to was-sw01 backbone. No PBR needed —
  per-tunnel ACLs already scope what enters vrf1.

- biscayne: DNAT inbound (137.239.194.65 → kind node), SNAT + policy
  routing outbound (validator sport 8001,9000-9025 → doublezero0 GRE).
  Inbound already applied.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-06 21:08:48 +00:00