- biscayne-migrate-storage.yml: stop docker to release bind mounts
before destroying zvol, no data copy (stale, fresh snapshot needed),
handle partially-migrated state, restart docker at end
- biscayne-upgrade-zfs.yml: use add-apt-repository CLI (module times
out), fix libzfs package name (libzfs4linux not 5), allow apt update
warnings from stale influxdata GPG key
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Instead of hardcoding the laconic cluster ID, namespace, deployment
name, and pod label, read cluster-id from deployment.yml on biscayne
and derive everything from it.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The recovery playbook now exits after scaling to 1. The container
entrypoint handles snapshot download (60+ min) and validator startup
autonomously. Removed all polling/verification steps that would
time out waiting.
Added scripts/check-status.py for monitoring download progress,
validator slot, gap to mainnet, catch-up rate, and ramdisk usage.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The container's entrypoint.py already handles snapshot freshness checks,
cleanup, download (with rolling incremental convergence), and validator
startup. Remove the host-side download and let the container do the work.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Without a bound, the loop runs forever if sources never serve an
incremental close enough to head (e.g. full snapshot base slot is
too old). After 30 minutes, proceed with the best incremental
available or none.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
After the full snapshot downloads, continuously re-probe all fast sources
for newer incrementals until the best available is within convergence_slots
(default 500) of head. Each iteration finds the highest-slot incremental
matching our full snapshot's base slot, downloads it (replacing any previous),
and checks the gap to mainnet head.
- Extract probe_incremental() from inline re-probe code
- Add convergence_slots param to download_best_snapshot() (default 500)
- Add --convergence-slots CLI arg
- Pass SNAPSHOT_CONVERGENCE_SLOTS env var from entrypoint.py
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The container entrypoint (entrypoint.py) handles snapshot download
internally via aria2c. Ansible no longer needs to scale-to-0, download,
scale-to-1 — it just deploys and lets the container manage startup.
- biscayne-redeploy.yml: remove snapshot download section, simplify to
teardown → wipe → deploy → verify
- biscayne-sync-tools.yml: new playbook to sync laconic-so and
agave-stack repos on biscayne, with separate branch controls
- snapshot_download.py: re-probe for fresh incremental after full
snapshot download completes (old incremental is stale by then)
- Switch laconic_so_branch to fix/kind-mount-propagation (has
hostNetwork translation code)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
scripts/agave-container/ is a git subtree of agave-stack's container-build
directory. Replaces fragile cross-repo symlink with proper subtree.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- playbooks/biscayne-iptables.yml: manages PREROUTING DNAT and DOCKER-USER
rules for both host IP (186.233.184.235) and relay loopback (137.239.194.65).
Idempotent, persists via netfilter-persistent.
- scripts/snapshot-download.py: replaced standalone copy with symlink to
agave-stack source of truth, eliminating duplication.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Remounting tmpfs is instant (kernel frees pages), while rm -rf on 400GB+
of accounts files traverses every inode. Recover playbook keeps rm -rf
because the kind node's bind mount prevents umount while the container
is running.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The /dev/ram0 + XFS + format-ramdisk.service approach was unnecessary
complexity from a migration confusion — there was no actual tmpfs bug
with io_uring. tmpfs is simpler (no format-on-boot), resizable on the
fly, and what every other Solana operator uses.
Changes:
- prepare-agave: remove format-ramdisk.service and ramdisk-accounts.service,
use tmpfs fstab entry with size=1024G (was 600G /dev/ram0, too small)
- recover: remove ramdisk_device var (no longer needed)
- redeploy: wipe accounts by rm -rf instead of umount+mkfs
- snapshot-download.py: extract download_best_snapshot() public API for
use by the new container entrypoint.py (in agave-stack)
- CLAUDE.md: update ramdisk docs, fix /srv/solana → /srv/kind/solana paths
- health-check: fix ramdisk path references
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Revert snapshot_dir to /srv/solana/snapshots — aria2c runs on the host
where this is the direct zvol mount (always available), unlike
/srv/kind/solana/snapshots which depends on the bind mount
- Add laconic_so_branch variable (default: main) and use it in both
git reset commands so the branch can be overridden via -e
- Move "Verify ramdisk visible inside kind node" from preflight to after
"Wait for deployment to exist" — the kind container may not exist
during preflight after teardown
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add laconic_so_repo variable (/home/rix/stack-orchestrator) and a
git pull task before deployment start — the editable install must be
current or stale code causes deploy failures
- Downgrade unified mount root check from fatal assertion to debug
warning — the mount style depends on which laconic-so version is
deployed, and individual PV mounts (/mnt/validator-*) work fine
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Fix snapshot_dir: /srv/solana/snapshots → /srv/kind/solana/snapshots
(kind node reads from the bind mount, not the zvol mount directly)
- Fix kind-internal paths: /mnt/solana/... → /mnt/validator-... to match
actual PV hostPath layout (individual mounts, not unified)
- Add 'scale-up' tag to "Scale validator to 1" task for partial recovery
(--tags snapshot,scale-up,verify resumes without re-running deploy)
- Make 'Start deployment' idempotent: failed_when: false + follow-up
check so existing deployment doesn't fail the play
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Root cause: Docker FORWARD chain policy DROP blocked all DNAT'd relay
traffic (UDP/TCP 8001, UDP 9000-9025) to the kind node. The DOCKER
chain only ACCEPTs specific TCP ports (6443, 443, 80). Added ACCEPT
rules in DOCKER-USER chain which runs before all Docker chains.
Changes:
- ashburn-relay-biscayne.yml: add DOCKER-USER ACCEPT rules (inbound
tag) and rollback cleanup
- ashburn-relay-setup.sh.j2: persist DOCKER-USER rules across reboot
- relay-inbound-udp-test.yml: controlled e2e test — listener in kind
netns, sender from kelce, assert arrival
- relay-link-test.yml: link-by-link tcpdump captures at each hop
- relay-test-udp-listen.py, relay-test-udp-send.py: test helpers
- relay-test-ip-echo.py: full ip_echo protocol test
- inventory/kelce.yml, inventory/panic.yml: test host inventories
- test-ashburn-relay.sh: add ip_echo UDP reachability test
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Three Python scripts send real packets from the kind node through the
full relay path (biscayne → tunnel → mia-sw01 → was-sw01 → internet)
and verify responses come back via the inbound path. No indirect
counter-checking — a response proves both directions work.
- relay-test-udp.py: DNS query with sport 8001
- relay-test-tcp-sport.py: HTTP request with sport 8001
- relay-test-tcp-dport.py: TCP connect to entrypoint dport 8001 (ip_echo)
- test-ashburn-relay.sh: orchestrates from ansible controller via nsenter
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Inventories what the DZ agent controls (tunnels, ACLs, VRFs, BGP,
route-maps, loopbacks) so we don't accidentally modify objects that
the agent will silently overwrite. Includes a "safe to modify" section
listing our own relay infrastructure.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
mia-sw01: Replace PBR-based outbound routing with VRF isolation.
TCAM profile tunnel-interface-acl doesn't support PBR or traffic-policy
on tunnel interfaces. Tunnel100 now lives in VRF "relay" whose default
route sends decapsulated traffic to was-sw01 via backbone, avoiding
BCP38 drops on the ISP uplink for src 137.239.194.65.
biscayne: Add TCP dport mangle rule for ip_echo (port 8001). Without it,
outbound ip_echo probes use biscayne's real IP instead of the Ashburn
relay IP, causing entrypoints to probe the wrong address. Also fix
loopback IP idempotency (handle "already assigned" error).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Ashburn relay and shred relay lab configs for local end-to-end
testing with cEOS. No secrets — only public IPs and test scripts.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Normal playbooks should never hardcode hostnames — that's an inventory
concern. Changed all playbooks to hosts:all. The one exception is
ashburn-relay-check.yml which legitimately spans both inventories
(switches + biscayne) and uses explicit hostnames.
Also adds:
- ashburn-relay-check.yml: full-path relay diagnostics (switches + host)
- biscayne-start.yml: start kind container and scale validator to 1
- ashburn-relay-setup.sh.j2: boot persistence script for relay state
- Direct device mounts replacing rbind (ZFS shared propagation fix)
- systemd service replacing broken if-up.d/netfilter-persistent
- PV mount path corrections (/mnt/validator-* not /mnt/solana/*)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move switches.yml to inventory-switches/ so ansible.cfg's
`inventory = inventory/` only loads biscayne. Switch playbooks
must pass `-i inventory-switches/` explicitly.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- FQCN for all modules (ansible.builtin.*)
- changed_when/failed_when on all command/shell tasks
- set -o pipefail on all shell tasks
- Add KUBECONFIG environment to health-check.yml
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- ansible.cfg: enable SSH agent forwarding for git operations
- biscayne-redeploy.yml: add git pull, deploy create --update, and
clear stale PV claimRefs after namespace deletion
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Config is committed to running-config immediately (no 5-min timer).
Safety net is the checkpoint (rollback) and the fact that startup-config
is only written with -e commit=true. A reboot reverts uncommitted changes.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Root cause: the doublezero-agent on mia-sw01 manages Tunnel500's ACL
(SEC-USER-500-IN) and drops outbound gossip with src 137.239.194.65.
The agent overwrites any custom ACL entries.
Fix: create a separate GRE tunnel (Tunnel100) using mia-sw01's free
LAN IP (209.42.167.137) as tunnel source. This tunnel goes over the
ISP uplink, completely independent of the DZ overlay:
- mia-sw01: Tunnel100 src 209.42.167.137, dst 186.233.184.235
- biscayne: gre-ashburn src 186.233.184.235, dst 209.42.167.137
- Link addresses: 169.254.100.0/31
Playbook changes:
- ashburn-relay-mia-sw01: Tunnel100 + Loopback101 + SEC-VALIDATOR-100-IN
- ashburn-relay-biscayne: gre-ashburn tunnel + updated policy routing
- New template: ashburn-routing-ifup.sh.j2 for boot persistence
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Playbook fixes from testing:
- ashburn-relay-biscayne: insert DNAT rules at position 1 before
Docker's ADDRTYPE LOCAL rule (was being swallowed at position 3+)
- ashburn-relay-mia-sw01: add inbound route for 137.239.194.65 via
egress-vrf vrf1 (nexthop only, no interface — EOS silently drops
cross-VRF routes that specify a tunnel interface)
- ashburn-relay-was-sw01: replace PBR with static route, remove
Loopback101
Bug doc (bug-ashburn-tunnel-port-filtering.md): root cause is the
DoubleZero agent on mia-sw01 overwrites SEC-USER-500-IN ACL, dropping
outbound gossip with src 137.239.194.65. The DZ agent controls
Tunnel500's lifecycle. Fix requires a separate GRE tunnel using
mia-sw01's free LAN IP (209.42.167.137) to bypass DZ infrastructure.
Also adds all repo docs, scripts, inventory, and remaining playbooks.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Three playbooks for routing all validator traffic through 137.239.194.65:
- was-sw01: Loopback101 + PBR redirect on Et1/1 (already applied/committed)
Will be simplified to a static route in next iteration.
- mia-sw01: ACL permit for src 137.239.194.65 on Tunnel500 + default route
in vrf1 via egress-vrf default to was-sw01 backbone. No PBR needed —
per-tunnel ACLs already scope what enters vrf1.
- biscayne: DNAT inbound (137.239.194.65 → kind node), SNAT + policy
routing outbound (validator sport 8001,9000-9025 → doublezero0 GRE).
Inbound already applied.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Captured via ansible `show running-config` before applying
mia-sw01 outbound validator redirect changes.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>