Commit Graph

5 Commits (b88af2be70cc9ae0d4a009c2136adfc1254adc70)

Author SHA1 Message Date
A. F. Dudley b88af2be70 feat: graceful shutdown, ZFS upgrade, storage migration, sync-tools build
- entrypoint.py: Python stays PID 1, traps SIGTERM, requests graceful exit
  via admin RPC (agave-validator exit --force) before falling back to signals
- snapshot_download.py: fix break-on-failure bug in incremental download loop
  (continue + re-probe instead of giving up)
- biscayne-upgrade-zfs.yml: upgrade ZFS 2.2.2 → 2.2.9 via arter97/zfs-lts
  PPA to fix io_uring deadlock at kernel module level
- biscayne-migrate-storage.yml: one-time migration from zvol/XFS to ZFS
  dataset (zvol workaround no longer needed with graceful shutdown + ZFS fix)
- biscayne-stop.yml: patch terminationGracePeriodSeconds to 300 before
  scaling to 0, updated docs for admin RPC shutdown
- biscayne-sync-tools.yml: fix SSH agent forwarding (vars: ansible_become),
  add --tags build-container support, add set -e to shell blocks
- biscayne-recover.yml: updated for graceful shutdown awareness
- check-status.py: add --pane flag for tmux, clean redraw in watch mode
- CLAUDE.md: update docs for ZFS dataset storage, graceful shutdown

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-09 07:58:37 +00:00
A. F. Dudley 601f520a45 fix: add 30-min wall-clock timeout to incremental convergence loop
Without a bound, the loop runs forever if sources never serve an
incremental close enough to head (e.g. full snapshot base slot is
too old). After 30 minutes, proceed with the best incremental
available or none.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-09 06:11:19 +00:00
A. F. Dudley bfde58431e feat: rolling incremental snapshot download loop
After the full snapshot downloads, continuously re-probe all fast sources
for newer incrementals until the best available is within convergence_slots
(default 500) of head. Each iteration finds the highest-slot incremental
matching our full snapshot's base slot, downloads it (replacing any previous),
and checks the gap to mainnet head.

- Extract probe_incremental() from inline re-probe code
- Add convergence_slots param to download_best_snapshot() (default 500)
- Add --convergence-slots CLI arg
- Pass SNAPSHOT_CONVERGENCE_SLOTS env var from entrypoint.py

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-09 05:33:47 +00:00
A. F. Dudley bd38c1b791 fix: remove Ansible snapshot download, add sync-tools playbook
The container entrypoint (entrypoint.py) handles snapshot download
internally via aria2c. Ansible no longer needs to scale-to-0, download,
scale-to-1 — it just deploys and lets the container manage startup.

- biscayne-redeploy.yml: remove snapshot download section, simplify to
  teardown → wipe → deploy → verify
- biscayne-sync-tools.yml: new playbook to sync laconic-so and
  agave-stack repos on biscayne, with separate branch controls
- snapshot_download.py: re-probe for fresh incremental after full
  snapshot download completes (old incremental is stale by then)
- Switch laconic_so_branch to fix/kind-mount-propagation (has
  hostNetwork translation code)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-09 05:14:43 +00:00
A. F. Dudley 25952b4fa7 Merge commit 'f4b3a46109a8da00fdd68d8999160ddc45dcc88a' as 'scripts/agave-container' 2026-03-08 19:13:38 +00:00