Commit Graph

10 Commits (601f520a457cb83fc39c56082afbefac382a03bb)

Author SHA1 Message Date
A. F. Dudley 601f520a45 fix: add 30-min wall-clock timeout to incremental convergence loop
Without a bound, the loop runs forever if sources never serve an
incremental close enough to head (e.g. full snapshot base slot is
too old). After 30 minutes, proceed with the best incremental
available or none.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-09 06:11:19 +00:00
A. F. Dudley bfde58431e feat: rolling incremental snapshot download loop
After the full snapshot downloads, continuously re-probe all fast sources
for newer incrementals until the best available is within convergence_slots
(default 500) of head. Each iteration finds the highest-slot incremental
matching our full snapshot's base slot, downloads it (replacing any previous),
and checks the gap to mainnet head.

- Extract probe_incremental() from inline re-probe code
- Add convergence_slots param to download_best_snapshot() (default 500)
- Add --convergence-slots CLI arg
- Pass SNAPSHOT_CONVERGENCE_SLOTS env var from entrypoint.py

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-09 05:33:47 +00:00
A. F. Dudley bd38c1b791 fix: remove Ansible snapshot download, add sync-tools playbook
The container entrypoint (entrypoint.py) handles snapshot download
internally via aria2c. Ansible no longer needs to scale-to-0, download,
scale-to-1 — it just deploys and lets the container manage startup.

- biscayne-redeploy.yml: remove snapshot download section, simplify to
  teardown → wipe → deploy → verify
- biscayne-sync-tools.yml: new playbook to sync laconic-so and
  agave-stack repos on biscayne, with separate branch controls
- snapshot_download.py: re-probe for fresh incremental after full
  snapshot download completes (old incremental is stale by then)
- Switch laconic_so_branch to fix/kind-mount-propagation (has
  hostNetwork translation code)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-09 05:14:43 +00:00
A. F. Dudley 25952b4fa7 Merge commit 'f4b3a46109a8da00fdd68d8999160ddc45dcc88a' as 'scripts/agave-container' 2026-03-08 19:13:38 +00:00
A. F. Dudley ba015bf3b1 chore: remove snapshot-download.py symlink (replacing with subtree)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-08 19:13:34 +00:00
A. F. Dudley 078872d78d feat: add iptables playbook, symlink snapshot-download.py to agave-stack
- playbooks/biscayne-iptables.yml: manages PREROUTING DNAT and DOCKER-USER
  rules for both host IP (186.233.184.235) and relay loopback (137.239.194.65).
  Idempotent, persists via netfilter-persistent.
- scripts/snapshot-download.py: replaced standalone copy with symlink to
  agave-stack source of truth, eliminating duplication.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-08 19:11:24 +00:00
A. F. Dudley b2342bc539 fix: switch ramdisk from /dev/ram0 to tmpfs, refactor snapshot-download.py
The /dev/ram0 + XFS + format-ramdisk.service approach was unnecessary
complexity from a migration confusion — there was no actual tmpfs bug
with io_uring. tmpfs is simpler (no format-on-boot), resizable on the
fly, and what every other Solana operator uses.

Changes:
- prepare-agave: remove format-ramdisk.service and ramdisk-accounts.service,
  use tmpfs fstab entry with size=1024G (was 600G /dev/ram0, too small)
- recover: remove ramdisk_device var (no longer needed)
- redeploy: wipe accounts by rm -rf instead of umount+mkfs
- snapshot-download.py: extract download_best_snapshot() public API for
  use by the new container entrypoint.py (in agave-stack)
- CLAUDE.md: update ramdisk docs, fix /srv/solana → /srv/kind/solana paths
- health-check: fix ramdisk path references

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-08 18:43:41 +00:00
A. F. Dudley 05f9acf8a0 fix: DOCKER-USER rules for inbound relay, add UDP test playbooks
Root cause: Docker FORWARD chain policy DROP blocked all DNAT'd relay
traffic (UDP/TCP 8001, UDP 9000-9025) to the kind node. The DOCKER
chain only ACCEPTs specific TCP ports (6443, 443, 80). Added ACCEPT
rules in DOCKER-USER chain which runs before all Docker chains.

Changes:
- ashburn-relay-biscayne.yml: add DOCKER-USER ACCEPT rules (inbound
  tag) and rollback cleanup
- ashburn-relay-setup.sh.j2: persist DOCKER-USER rules across reboot
- relay-inbound-udp-test.yml: controlled e2e test — listener in kind
  netns, sender from kelce, assert arrival
- relay-link-test.yml: link-by-link tcpdump captures at each hop
- relay-test-udp-listen.py, relay-test-udp-send.py: test helpers
- relay-test-ip-echo.py: full ip_echo protocol test
- inventory/kelce.yml, inventory/panic.yml: test host inventories
- test-ashburn-relay.sh: add ip_echo UDP reachability test

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-08 02:43:31 +00:00
A. F. Dudley 496c7982cb feat: end-to-end relay test scripts
Three Python scripts send real packets from the kind node through the
full relay path (biscayne → tunnel → mia-sw01 → was-sw01 → internet)
and verify responses come back via the inbound path. No indirect
counter-checking — a response proves both directions work.

- relay-test-udp.py: DNS query with sport 8001
- relay-test-tcp-sport.py: HTTP request with sport 8001
- relay-test-tcp-dport.py: TCP connect to entrypoint dport 8001 (ip_echo)
- test-ashburn-relay.sh: orchestrates from ansible controller via nsenter

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-08 00:43:06 +00:00
A. F. Dudley 0b52fc99d7 fix: ashburn relay playbooks and document DZ tunnel ACL root cause
Playbook fixes from testing:
- ashburn-relay-biscayne: insert DNAT rules at position 1 before
  Docker's ADDRTYPE LOCAL rule (was being swallowed at position 3+)
- ashburn-relay-mia-sw01: add inbound route for 137.239.194.65 via
  egress-vrf vrf1 (nexthop only, no interface — EOS silently drops
  cross-VRF routes that specify a tunnel interface)
- ashburn-relay-was-sw01: replace PBR with static route, remove
  Loopback101

Bug doc (bug-ashburn-tunnel-port-filtering.md): root cause is the
DoubleZero agent on mia-sw01 overwrites SEC-USER-500-IN ACL, dropping
outbound gossip with src 137.239.194.65. The DZ agent controls
Tunnel500's lifecycle. Fix requires a separate GRE tunnel using
mia-sw01's free LAN IP (209.42.167.137) to bypass DZ infrastructure.

Also adds all repo docs, scripts, inventory, and remaining playbooks.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-07 01:44:25 +00:00