Commit Graph

18 Commits (bfde58431eef3d61b99a101f6d8de25112d201e7)

Author SHA1 Message Date
A. F. Dudley bd38c1b791 fix: remove Ansible snapshot download, add sync-tools playbook
The container entrypoint (entrypoint.py) handles snapshot download
internally via aria2c. Ansible no longer needs to scale-to-0, download,
scale-to-1 — it just deploys and lets the container manage startup.

- biscayne-redeploy.yml: remove snapshot download section, simplify to
  teardown → wipe → deploy → verify
- biscayne-sync-tools.yml: new playbook to sync laconic-so and
  agave-stack repos on biscayne, with separate branch controls
- snapshot_download.py: re-probe for fresh incremental after full
  snapshot download completes (old incremental is stale by then)
- Switch laconic_so_branch to fix/kind-mount-propagation (has
  hostNetwork translation code)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-09 05:14:43 +00:00
A. F. Dudley 3574e387cc fix: update playbooks to use subtree path for snapshot_download.py
scripts/agave-container/ is a git subtree of agave-stack's container-build
directory. Replaces fragile cross-repo symlink with proper subtree.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-08 19:13:53 +00:00
A. F. Dudley 078872d78d feat: add iptables playbook, symlink snapshot-download.py to agave-stack
- playbooks/biscayne-iptables.yml: manages PREROUTING DNAT and DOCKER-USER
  rules for both host IP (186.233.184.235) and relay loopback (137.239.194.65).
  Idempotent, persists via netfilter-persistent.
- scripts/snapshot-download.py: replaced standalone copy with symlink to
  agave-stack source of truth, eliminating duplication.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-08 19:11:24 +00:00
A. F. Dudley ec12e6079b fix: redeploy wipe uses umount+remount instead of rm -rf
Remounting tmpfs is instant (kernel frees pages), while rm -rf on 400GB+
of accounts files traverses every inode. Recover playbook keeps rm -rf
because the kind node's bind mount prevents umount while the container
is running.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-08 18:45:44 +00:00
A. F. Dudley b2342bc539 fix: switch ramdisk from /dev/ram0 to tmpfs, refactor snapshot-download.py
The /dev/ram0 + XFS + format-ramdisk.service approach was unnecessary
complexity from a migration confusion — there was no actual tmpfs bug
with io_uring. tmpfs is simpler (no format-on-boot), resizable on the
fly, and what every other Solana operator uses.

Changes:
- prepare-agave: remove format-ramdisk.service and ramdisk-accounts.service,
  use tmpfs fstab entry with size=1024G (was 600G /dev/ram0, too small)
- recover: remove ramdisk_device var (no longer needed)
- redeploy: wipe accounts by rm -rf instead of umount+mkfs
- snapshot-download.py: extract download_best_snapshot() public API for
  use by the new container entrypoint.py (in agave-stack)
- CLAUDE.md: update ramdisk docs, fix /srv/solana → /srv/kind/solana paths
- health-check: fix ramdisk path references

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-08 18:43:41 +00:00
A. F. Dudley 63735a9830 fix: revert snapshot_dir, add laconic_so_branch, move kind ramdisk check
- Revert snapshot_dir to /srv/solana/snapshots — aria2c runs on the host
  where this is the direct zvol mount (always available), unlike
  /srv/kind/solana/snapshots which depends on the bind mount
- Add laconic_so_branch variable (default: main) and use it in both
  git reset commands so the branch can be overridden via -e
- Move "Verify ramdisk visible inside kind node" from preflight to after
  "Wait for deployment to exist" — the kind container may not exist
  during preflight after teardown

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-08 04:42:11 +00:00
A. F. Dudley fe935037f7 fix: add laconic-so update step, downgrade unified mount check to warning
- Add laconic_so_repo variable (/home/rix/stack-orchestrator) and a
  git pull task before deployment start — the editable install must be
  current or stale code causes deploy failures
- Downgrade unified mount root check from fatal assertion to debug
  warning — the mount style depends on which laconic-so version is
  deployed, and individual PV mounts (/mnt/validator-*) work fine

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-08 04:32:20 +00:00
A. F. Dudley ad68d505ae fix: redeploy playbook paths, tags, and idempotency
- Fix snapshot_dir: /srv/solana/snapshots → /srv/kind/solana/snapshots
  (kind node reads from the bind mount, not the zvol mount directly)
- Fix kind-internal paths: /mnt/solana/... → /mnt/validator-... to match
  actual PV hostPath layout (individual mounts, not unified)
- Add 'scale-up' tag to "Scale validator to 1" task for partial recovery
  (--tags snapshot,scale-up,verify resumes without re-running deploy)
- Make 'Start deployment' idempotent: failed_when: false + follow-up
  check so existing deployment doesn't fail the play

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-08 04:14:05 +00:00
A. F. Dudley 05f9acf8a0 fix: DOCKER-USER rules for inbound relay, add UDP test playbooks
Root cause: Docker FORWARD chain policy DROP blocked all DNAT'd relay
traffic (UDP/TCP 8001, UDP 9000-9025) to the kind node. The DOCKER
chain only ACCEPTs specific TCP ports (6443, 443, 80). Added ACCEPT
rules in DOCKER-USER chain which runs before all Docker chains.

Changes:
- ashburn-relay-biscayne.yml: add DOCKER-USER ACCEPT rules (inbound
  tag) and rollback cleanup
- ashburn-relay-setup.sh.j2: persist DOCKER-USER rules across reboot
- relay-inbound-udp-test.yml: controlled e2e test — listener in kind
  netns, sender from kelce, assert arrival
- relay-link-test.yml: link-by-link tcpdump captures at each hop
- relay-test-udp-listen.py, relay-test-udp-send.py: test helpers
- relay-test-ip-echo.py: full ip_echo protocol test
- inventory/kelce.yml, inventory/panic.yml: test host inventories
- test-ashburn-relay.sh: add ip_echo UDP reachability test

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-08 02:43:31 +00:00
A. F. Dudley b82d66eeff fix: VRF isolation for mia-sw01 relay, TCP dport mangle for ip_echo
mia-sw01: Replace PBR-based outbound routing with VRF isolation.
TCAM profile tunnel-interface-acl doesn't support PBR or traffic-policy
on tunnel interfaces. Tunnel100 now lives in VRF "relay" whose default
route sends decapsulated traffic to was-sw01 via backbone, avoiding
BCP38 drops on the ISP uplink for src 137.239.194.65.

biscayne: Add TCP dport mangle rule for ip_echo (port 8001). Without it,
outbound ip_echo probes use biscayne's real IP instead of the Ashburn
relay IP, causing entrypoints to probe the wrong address. Also fix
loopback IP idempotency (handle "already assigned" error).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-07 23:31:18 +00:00
A. F. Dudley 9cbc115295 fix: inventory layering — playbooks use hosts:all, cross-inventory uses explicit hosts
Normal playbooks should never hardcode hostnames — that's an inventory
concern. Changed all playbooks to hosts:all. The one exception is
ashburn-relay-check.yml which legitimately spans both inventories
(switches + biscayne) and uses explicit hostnames.

Also adds:
- ashburn-relay-check.yml: full-path relay diagnostics (switches + host)
- biscayne-start.yml: start kind container and scale validator to 1
- ashburn-relay-setup.sh.j2: boot persistence script for relay state
- Direct device mounts replacing rbind (ZFS shared propagation fix)
- systemd service replacing broken if-up.d/netfilter-persistent
- PV mount path corrections (/mnt/validator-* not /mnt/solana/*)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-07 22:28:21 +00:00
A. F. Dudley 14c0f63775 feat: layer 4 invariants, mount checks, and deployment layer docs
- Rename biscayne-boot.yml → biscayne-prepare-agave.yml (layer 4)
- Document deployment layers and layer 4 invariants in playbook header
- Add zvol, ramdisk, rbind fstab management with stale entry cleanup
- Add kind node XFS verification (reads cluster-id from deployment)
- Add mount checks to health-check.yml (host mounts, kind visibility, propagation)
- Fix health-check discovery tasks with tags: [always] and non-fatal pod lookup
- Fix biscayne-redeploy.yml shell tasks missing executable: /bin/bash
- Add ansible_python_interpreter to inventory
- Update CLAUDE.md with deployment layers table and mount propagation notes

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-07 13:08:04 +00:00
A. F. Dudley 4f452db6fe fix: ansible-lint production profile compliance for all playbooks
- FQCN for all modules (ansible.builtin.*)
- changed_when/failed_when on all command/shell tasks
- set -o pipefail on all shell tasks
- Add KUBECONFIG environment to health-check.yml

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-07 10:52:40 +00:00
A. F. Dudley d36a71f13d fix: redeploy playbook handles SSH agent, git pull, config regen, stale PVs
- ansible.cfg: enable SSH agent forwarding for git operations
- biscayne-redeploy.yml: add git pull, deploy create --update, and
  clear stale PV claimRefs after namespace deletion

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-07 09:58:29 +00:00
A. F. Dudley 9f6e1b5da7 fix: remove auto-revert timer, use checkpoint + write memory instead
Config is committed to running-config immediately (no 5-min timer).
Safety net is the checkpoint (rollback) and the fact that startup-config
is only written with -e commit=true. A reboot reverts uncommitted changes.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-07 01:49:25 +00:00
A. F. Dudley 742e84e3b0 feat: dedicated GRE tunnel (Tunnel100) bypassing DZ-managed Tunnel500
Root cause: the doublezero-agent on mia-sw01 manages Tunnel500's ACL
(SEC-USER-500-IN) and drops outbound gossip with src 137.239.194.65.
The agent overwrites any custom ACL entries.

Fix: create a separate GRE tunnel (Tunnel100) using mia-sw01's free
LAN IP (209.42.167.137) as tunnel source. This tunnel goes over the
ISP uplink, completely independent of the DZ overlay:
- mia-sw01: Tunnel100 src 209.42.167.137, dst 186.233.184.235
- biscayne: gre-ashburn src 186.233.184.235, dst 209.42.167.137
- Link addresses: 169.254.100.0/31

Playbook changes:
- ashburn-relay-mia-sw01: Tunnel100 + Loopback101 + SEC-VALIDATOR-100-IN
- ashburn-relay-biscayne: gre-ashburn tunnel + updated policy routing
- New template: ashburn-routing-ifup.sh.j2 for boot persistence

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-07 01:47:58 +00:00
A. F. Dudley 0b52fc99d7 fix: ashburn relay playbooks and document DZ tunnel ACL root cause
Playbook fixes from testing:
- ashburn-relay-biscayne: insert DNAT rules at position 1 before
  Docker's ADDRTYPE LOCAL rule (was being swallowed at position 3+)
- ashburn-relay-mia-sw01: add inbound route for 137.239.194.65 via
  egress-vrf vrf1 (nexthop only, no interface — EOS silently drops
  cross-VRF routes that specify a tunnel interface)
- ashburn-relay-was-sw01: replace PBR with static route, remove
  Loopback101

Bug doc (bug-ashburn-tunnel-port-filtering.md): root cause is the
DoubleZero agent on mia-sw01 overwrites SEC-USER-500-IN ACL, dropping
outbound gossip with src 137.239.194.65. The DZ agent controls
Tunnel500's lifecycle. Fix requires a separate GRE tunnel using
mia-sw01's free LAN IP (209.42.167.137) to bypass DZ infrastructure.

Also adds all repo docs, scripts, inventory, and remaining playbooks.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-07 01:44:25 +00:00
A. F. Dudley 6841d5e3c3 feat: ashburn validator relay playbooks
Three playbooks for routing all validator traffic through 137.239.194.65:

- was-sw01: Loopback101 + PBR redirect on Et1/1 (already applied/committed)
  Will be simplified to a static route in next iteration.

- mia-sw01: ACL permit for src 137.239.194.65 on Tunnel500 + default route
  in vrf1 via egress-vrf default to was-sw01 backbone. No PBR needed —
  per-tunnel ACLs already scope what enters vrf1.

- biscayne: DNAT inbound (137.239.194.65 → kind node), SNAT + policy
  routing outbound (validator sport 8001,9000-9025 → doublezero0 GRE).
  Inbound already applied.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-06 21:08:48 +00:00