The container's entrypoint.py already handles snapshot freshness checks,
cleanup, download (with rolling incremental convergence), and validator
startup. Remove the host-side download and let the container do the work.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
scripts/agave-container/ is a git subtree of agave-stack's container-build
directory. Replaces fragile cross-repo symlink with proper subtree.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Remounting tmpfs is instant (kernel frees pages), while rm -rf on 400GB+
of accounts files traverses every inode. Recover playbook keeps rm -rf
because the kind node's bind mount prevents umount while the container
is running.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The /dev/ram0 + XFS + format-ramdisk.service approach was unnecessary
complexity from a migration confusion — there was no actual tmpfs bug
with io_uring. tmpfs is simpler (no format-on-boot), resizable on the
fly, and what every other Solana operator uses.
Changes:
- prepare-agave: remove format-ramdisk.service and ramdisk-accounts.service,
use tmpfs fstab entry with size=1024G (was 600G /dev/ram0, too small)
- recover: remove ramdisk_device var (no longer needed)
- redeploy: wipe accounts by rm -rf instead of umount+mkfs
- snapshot-download.py: extract download_best_snapshot() public API for
use by the new container entrypoint.py (in agave-stack)
- CLAUDE.md: update ramdisk docs, fix /srv/solana → /srv/kind/solana paths
- health-check: fix ramdisk path references
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- FQCN for all modules (ansible.builtin.*)
- changed_when/failed_when on all command/shell tasks
- set -o pipefail on all shell tasks
- Add KUBECONFIG environment to health-check.yml
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Playbook fixes from testing:
- ashburn-relay-biscayne: insert DNAT rules at position 1 before
Docker's ADDRTYPE LOCAL rule (was being swallowed at position 3+)
- ashburn-relay-mia-sw01: add inbound route for 137.239.194.65 via
egress-vrf vrf1 (nexthop only, no interface — EOS silently drops
cross-VRF routes that specify a tunnel interface)
- ashburn-relay-was-sw01: replace PBR with static route, remove
Loopback101
Bug doc (bug-ashburn-tunnel-port-filtering.md): root cause is the
DoubleZero agent on mia-sw01 overwrites SEC-USER-500-IN ACL, dropping
outbound gossip with src 137.239.194.65. The DZ agent controls
Tunnel500's lifecycle. Fix requires a separate GRE tunnel using
mia-sw01's free LAN IP (209.42.167.137) to bypass DZ infrastructure.
Also adds all repo docs, scripts, inventory, and remaining playbooks.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>