feat: graceful shutdown, ZFS upgrade, storage migration, sync-tools build

- entrypoint.py: Python stays PID 1, traps SIGTERM, requests graceful exit via admin RPC (agave-validator exit --force) before falling back to signals - snapshot_download.py: fix break-on-failure bug in incremental download loop (continue + re-probe instead of giving up) - biscayne-upgrade-zfs.yml: upgrade ZFS 2.2.2 → 2.2.9 via arter97/zfs-lts PPA to fix io_uring deadlock at kernel module level - biscayne-migrate-storage.yml: one-time migration from zvol/XFS to ZFS dataset (zvol workaround no longer needed with graceful shutdown + ZFS fix) - biscayne-stop.yml: patch terminationGracePeriodSeconds to 300 before scaling to 0, updated docs for admin RPC shutdown - biscayne-sync-tools.yml: fix SSH agent forwarding (vars: ansible_become), add --tags build-container support, add set -e to shell blocks - biscayne-recover.yml: updated for graceful shutdown awareness - check-status.py: add --pane flag for tmux, clean redraw in watch mode - CLAUDE.md: update docs for ZFS dataset storage, graceful shutdown Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-09 07:58:37 +00:00 · 2026-03-09 07:58:37 +00:00 · b88af2be70
parent 173b807451
commit b88af2be70
9 changed files with 661 additions and 54 deletions
--- a/CLAUDE.md
+++ b/CLAUDE.md
@ -10,16 +10,16 @@ below it are correct. Playbooks belong to exactly one layer.
 | 1. Base system | Docker, ZFS, packages | Out of scope (manual/PXE) |
 | 2. Prepare kind | `/srv/kind` exists (ZFS dataset) | None needed (ZFS handles it) |
 | 3. Install kind | `laconic-so deployment start` creates kind cluster, mounts `/srv/kind` → `/mnt` in kind node | `biscayne-redeploy.yml` (deploy tags) |
-| 4. Prepare agave | Host storage for agave: zvol, ramdisk, rbind into `/srv/kind/solana` | `biscayne-prepare-agave.yml` |
+| 4. Prepare agave | Host storage for agave: ZFS dataset, ramdisk | `biscayne-prepare-agave.yml` |
 | 5. Deploy agave | Deploy agave-stack into kind, snapshot download, scale up | `biscayne-redeploy.yml` (snapshot/verify tags), `biscayne-recover.yml` |
 **Layer 4 invariants** (asserted by `biscayne-prepare-agave.yml`):
- `/srv/kind/solana` is XFS on a zvol — agave uses io_uring which deadlocks on ZFS. `/srv/solana` is NOT the zvol (it's a ZFS dataset directory); never use it for data paths
+- `/srv/kind/solana` is a ZFS dataset (`biscayne/DATA/srv/kind/solana`), child of the `/srv/kind` dataset
 - `/srv/kind/solana/ramdisk` is tmpfs (1TB) — accounts must be in RAM
 - `/srv/solana` is NOT the data path — it's a directory on the parent ZFS dataset. All data paths use `/srv/kind/solana`
 These invariants are checked at runtime and persisted to fstab/systemd so they
-survive reboot. They are agave's requirements reaching into the boot sequence,
+survive reboot.
 not base system concerns.
 **Cross-cutting**: `health-check.yml` (read-only diagnostics), `biscayne-stop.yml`
 (layer 5 — graceful shutdown), `fix-pv-mounts.yml` (layer 5 — PV repair).
@ -30,11 +30,8 @@ not base system concerns.
 The agave validator runs inside a kind-based k8s cluster managed by `laconic-so`.
 The kind node is a Docker container. **Never restart or kill the kind node container
-while the validator is running.** Agave uses `io_uring` for async I/O, and on ZFS,
+while the validator is running.** Use `agave-validator exit --force` via the admin
-killing the process can produce unkillable kernel threads (D-state in
+RPC socket for graceful shutdown, or scale the deployment to 0 and wait.
 `io_wq_put_and_exit` blocked on ZFS transaction commits). This deadlocks the
 container's PID namespace, making `docker stop`, `docker restart`, `docker exec`,
 and even `reboot` hang.
 Correct shutdown sequence:
@ -61,15 +58,16 @@ The accounts directory must be in RAM for performance. tmpfs is used instead of
 `/dev/ram0` — simpler (no format-on-boot service needed), resizable on the fly
 with `mount -o remount,size=<new>`, and what most Solana operators use.
-**Boot ordering**: fstab entry mounts tmpfs at `/srv/kind/solana/ramdisk` with
+**Boot ordering**: `/srv/kind/solana` is a ZFS dataset mounted automatically by
-`x-systemd.requires=srv-kind-solana.mount`. tmpfs mounts natively via fstab —
+`zfs-mount.service`. The tmpfs ramdisk fstab entry uses
-no systemd format service needed. **No manual intervention after reboot.**
+`x-systemd.requires=zfs-mount.service` to ensure the dataset is mounted first.
 **No manual intervention after reboot.**
 **Mount propagation**: The kind node bind-mounts `/srv/kind` → `/mnt` at container
 start. laconic-so sets `propagation: HostToContainer` on all kind extraMounts
-(commit `a11d40f2` in stack-orchestrator), so host submounts (like the rbind at
+(commit `a11d40f2` in stack-orchestrator), so host submounts propagate into the
-`/srv/kind/solana`) propagate into the kind node automatically. A kind restart
+kind node automatically. A kind restart is required to pick up the new config
-is required to pick up the new config after updating laconic-so.
+after updating laconic-so.
 ### KUBECONFIG
@ -92,21 +90,20 @@ Then export it:
 export SSH_AUTH_SOCK=/tmp/ssh-XXXX/agent.NNNN
 ```
-### io_uring/ZFS Deadlock — Root Cause
+### io_uring/ZFS Deadlock — Historical Note
-When agave-validator is killed while performing I/O against ZFS-backed paths (not
+Agave uses io_uring for async I/O. Killing agave ungracefully while it has
-the ramdisk), io_uring worker threads get stuck in D-state:
+outstanding I/O against ZFS can produce unkillable D-state kernel threads
-```
+(`io_wq_put_and_exit` blocked on ZFS transactions), deadlocking the container.
 io_wq_put_and_exit → dsl_dir_tempreserve_space (ZFS module)
 ```
 These threads are unkillable (SIGKILL has no effect on D-state processes). They
 prevent the container's PID namespace from being reaped (`zap_pid_ns_processes`
 waits forever), which breaks `docker stop`, `docker restart`, `docker exec`, and
 even `reboot`. The only fix is a hard power cycle.
-**Prevention**: Always scale the deployment to 0 and wait for the pod to terminate
+**Prevention**: Use graceful shutdown (`agave-validator exit --force` via admin
-before any destructive operation (namespace delete, kind restart, host reboot).
+RPC, or scale to 0 and wait). The `biscayne-stop.yml` playbook enforces this.
-The `biscayne-stop.yml` playbook enforces this.
+With graceful shutdown, io_uring contexts are closed cleanly and ZFS storage
 is safe to use directly (no zvol/XFS workaround needed).
 **ZFS fix**: The underlying io_uring bug is fixed in ZFS 2.2.8+ (PR #17298).
 Biscayne currently runs ZFS 2.2.2. Upgrading ZFS will eliminate the deadlock
 risk entirely, even for ungraceful shutdowns.
 ### laconic-so Architecture
@ -133,11 +130,11 @@ kind node via a single bind mount.
 - Deployment: `laconic-70ce4c4b47e23b85-deployment`
 - Kind node container: `laconic-70ce4c4b47e23b85-control-plane`
 - Deployment dir: `/srv/deployments/agave`
- Snapshot dir: `/srv/kind/solana/snapshots` (on zvol, visible to kind at `/mnt/validator-snapshots`)
+- Snapshot dir: `/srv/kind/solana/snapshots` (ZFS dataset, visible to kind at `/mnt/validator-snapshots`)
- Ledger dir: `/srv/kind/solana/ledger` (on zvol, visible to kind at `/mnt/validator-ledger`)
+- Ledger dir: `/srv/kind/solana/ledger` (ZFS dataset, visible to kind at `/mnt/validator-ledger`)
- Accounts dir: `/srv/kind/solana/ramdisk/accounts` (on ramdisk `/dev/ram0`, visible to kind at `/mnt/validator-accounts`)
+- Accounts dir: `/srv/kind/solana/ramdisk/accounts` (tmpfs ramdisk, visible to kind at `/mnt/validator-accounts`)
- Log dir: `/srv/kind/solana/log` (on zvol, visible to kind at `/mnt/validator-log`)
+- Log dir: `/srv/kind/solana/log` (ZFS dataset, visible to kind at `/mnt/validator-log`)
- **WARNING**: `/srv/solana` is a ZFS dataset directory, NOT the zvol. Never use it for data paths.
+- **WARNING**: `/srv/solana` is a different ZFS dataset directory. All data paths use `/srv/kind/solana`.
 - Host bind mount root: `/srv/kind` -> kind node `/mnt`
 - laconic-so: `/home/rix/.local/bin/laconic-so` (editable install)
--- a/playbooks/biscayne-migrate-storage.yml
+++ b/playbooks/biscayne-migrate-storage.yml
@ -0,0 +1,286 @@
 ---
 # One-time migration: zvol/XFS → ZFS dataset for /srv/kind/solana
 #
 # Background:
 #   Biscayne used a ZFS zvol formatted as XFS to work around io_uring/ZFS
 #   deadlocks. The root cause is now handled by graceful shutdown via admin
 #   RPC (agave-validator exit --force), so the zvol/XFS layer is unnecessary.
 #
 # What this does:
 #   1. Asserts the validator is scaled to 0 (does NOT scale it — that's
 #      the operator's job via biscayne-stop.yml)
 #   2. Creates a child ZFS dataset biscayne/DATA/srv/kind/solana
 #   3. Copies data from the zvol to the new dataset (rsync)
 #   4. Updates fstab (removes zvol line, fixes tmpfs dependency)
 #   5. Destroys the zvol after verification
 #
 # Prerequisites:
 #   - Validator MUST be stopped (scale 0, no agave processes)
 #   - Run biscayne-stop.yml first
 #
 # Usage:
 #   ansible-playbook -i inventory/ playbooks/biscayne-migrate-storage.yml
 #
 # After migration, run biscayne-prepare-agave.yml to update its checks,
 # then biscayne-start.yml to bring the validator back up.
 #
 - name: Migrate storage from zvol/XFS to ZFS dataset
  hosts: all
  gather_facts: false
  become: true
  environment:
    KUBECONFIG: /home/rix/.kube/config
  vars:
    kind_cluster: laconic-70ce4c4b47e23b85
    k8s_namespace: "laconic-{{ kind_cluster }}"
    deployment_name: "{{ kind_cluster }}-deployment"
    zvol_device: /dev/zvol/biscayne/DATA/volumes/solana
    zvol_dataset: biscayne/DATA/volumes/solana
    new_dataset: biscayne/DATA/srv/kind/solana
    kind_solana_dir: /srv/kind/solana
    ramdisk_mount: /srv/kind/solana/ramdisk
    ramdisk_size: 1024G
    # Temporary mount for zvol during data copy
    zvol_tmp_mount: /mnt/zvol-migration-tmp
  tasks:
    # ---- preconditions --------------------------------------------------------
    - name: Check deployment replica count
      ansible.builtin.command: >
        kubectl get deployment {{ deployment_name }}
        -n {{ k8s_namespace }}
        -o jsonpath='{.spec.replicas}'
      register: current_replicas
      failed_when: false
      changed_when: false
    - name: Fail if validator is running
      ansible.builtin.fail:
        msg: >-
          Validator must be scaled to 0 before migration.
          Current replicas: {{ current_replicas.stdout | default('unknown') }}.
          Run biscayne-stop.yml first.
      when: current_replicas.stdout | default('0') | int > 0
    - name: Verify no agave processes in kind node
      ansible.builtin.command: >
        docker exec {{ kind_cluster }}-control-plane
        pgrep -c agave-validator
      register: agave_procs
      failed_when: false
      changed_when: false
    - name: Fail if agave still running
      ansible.builtin.fail:
        msg: >-
          agave-validator process still running inside kind node.
          Cannot migrate while validator is active.
      when: agave_procs.rc == 0
    # ---- check current state --------------------------------------------------
    - name: Check if zvol device exists
      ansible.builtin.stat:
        path: "{{ zvol_device }}"
      register: zvol_exists
    - name: Check if ZFS dataset already exists
      ansible.builtin.command: zfs list -H -o name {{ new_dataset }}
      register: dataset_exists
      failed_when: false
      changed_when: false
    - name: Check current mount type at {{ kind_solana_dir }}
      ansible.builtin.shell:
        cmd: set -o pipefail && findmnt -n -o FSTYPE {{ kind_solana_dir }}
        executable: /bin/bash
      register: current_fstype
      failed_when: false
      changed_when: false
    - name: Report current state
      ansible.builtin.debug:
        msg:
          zvol_exists: "{{ zvol_exists.stat.exists | default(false) }}"
          dataset_exists: "{{ dataset_exists.rc == 0 }}"
          current_fstype: "{{ current_fstype.stdout | default('none') }}"
    # ---- skip if already migrated ---------------------------------------------
    - name: End play if already on ZFS dataset
      ansible.builtin.meta: end_play
      when:
        - dataset_exists.rc == 0
        - current_fstype.stdout | default('') == 'zfs'
        - not (zvol_exists.stat.exists | default(false))
    # ---- step 1: unmount ramdisk and zvol ------------------------------------
    - name: Unmount ramdisk
      ansible.posix.mount:
        path: "{{ ramdisk_mount }}"
        state: unmounted
    - name: Unmount zvol from {{ kind_solana_dir }}
      ansible.posix.mount:
        path: "{{ kind_solana_dir }}"
        state: unmounted
      when: current_fstype.stdout | default('') == 'xfs'
    # ---- step 2: create ZFS dataset -----------------------------------------
    - name: Create ZFS dataset {{ new_dataset }}
      ansible.builtin.command: >
        zfs create -o mountpoint={{ kind_solana_dir }} {{ new_dataset }}
      changed_when: true
      when: dataset_exists.rc != 0
    - name: Mount ZFS dataset if it already existed
      ansible.builtin.command: zfs mount {{ new_dataset }}
      changed_when: true
      failed_when: false
      when: dataset_exists.rc == 0
    - name: Verify ZFS dataset is mounted
      ansible.builtin.shell:
        cmd: set -o pipefail && findmnt -n -o FSTYPE {{ kind_solana_dir }} | grep -q zfs
        executable: /bin/bash
      changed_when: false
    # ---- step 3: copy data from zvol ----------------------------------------
    - name: Create temporary mount point for zvol
      ansible.builtin.file:
        path: "{{ zvol_tmp_mount }}"
        state: directory
        mode: "0755"
      when: zvol_exists.stat.exists | default(false)
    - name: Mount zvol at temporary location
      ansible.posix.mount:
        path: "{{ zvol_tmp_mount }}"
        src: "{{ zvol_device }}"
        fstype: xfs
        state: mounted
      when: zvol_exists.stat.exists | default(false)
    - name: Copy data from zvol to ZFS dataset  # noqa: command-instead-of-module
      ansible.builtin.command: >
        rsync -a --info=progress2
        --exclude='ramdisk/'
        {{ zvol_tmp_mount }}/
        {{ kind_solana_dir }}/
      changed_when: true
      when: zvol_exists.stat.exists | default(false)
    # ---- step 4: verify data integrity --------------------------------------
    - name: Check key directories exist on new dataset
      ansible.builtin.stat:
        path: "{{ kind_solana_dir }}/{{ item }}"
      register: dir_checks
      loop:
        - ledger
        - snapshots
        - log
    - name: Report directory verification
      ansible.builtin.debug:
        msg: "{{ item.item }}: {{ 'exists' if item.stat.exists else 'MISSING' }}"
      loop: "{{ dir_checks.results }}"
      loop_control:
        label: "{{ item.item }}"
    # ---- step 5: update fstab ------------------------------------------------
    - name: Remove zvol fstab entry
      ansible.builtin.lineinfile:
        path: /etc/fstab
        regexp: '^\S+zvol\S+\s+{{ kind_solana_dir }}\s'
        state: absent
      register: fstab_zvol_removed
    # Also match any XFS entry for kind_solana_dir (non-zvol form)
    - name: Remove any XFS fstab entry for {{ kind_solana_dir }}
      ansible.builtin.lineinfile:
        path: /etc/fstab
        regexp: '^\S+\s+{{ kind_solana_dir }}\s+xfs'
        state: absent
    # ZFS datasets are mounted by zfs-mount.service automatically.
    # The tmpfs ramdisk depends on the solana dir existing, which ZFS
    # guarantees via zfs-mount.service. Update the systemd dependency.
    - name: Update tmpfs ramdisk fstab entry
      ansible.builtin.lineinfile:
        path: /etc/fstab
        regexp: '^\S+\s+{{ ramdisk_mount }}\s'
        line: "tmpfs {{ ramdisk_mount }} tmpfs nodev,nosuid,noexec,nodiratime,size={{ ramdisk_size }},nofail,x-systemd.requires=zfs-mount.service 0 0"
    - name: Reload systemd  # noqa: no-handler
      ansible.builtin.systemd:
        daemon_reload: true
      when: fstab_zvol_removed.changed
    # ---- step 6: mount ramdisk -----------------------------------------------
    - name: Mount tmpfs ramdisk
      ansible.posix.mount:
        path: "{{ ramdisk_mount }}"
        src: tmpfs
        fstype: tmpfs
        opts: "nodev,nosuid,noexec,nodiratime,size={{ ramdisk_size }}"
        state: mounted
    - name: Ensure accounts directory
      ansible.builtin.file:
        path: "{{ ramdisk_mount }}/accounts"
        state: directory
        owner: solana
        group: solana
        mode: "0755"
    # ---- step 7: clean up zvol -----------------------------------------------
    - name: Unmount zvol from temporary location
      ansible.posix.mount:
        path: "{{ zvol_tmp_mount }}"
        state: unmounted
      when: zvol_exists.stat.exists | default(false)
    - name: Remove temporary mount point
      ansible.builtin.file:
        path: "{{ zvol_tmp_mount }}"
        state: absent
    - name: Destroy zvol {{ zvol_dataset }}
      ansible.builtin.command: zfs destroy {{ zvol_dataset }}
      changed_when: true
      when: zvol_exists.stat.exists | default(false)
    # ---- step 8: ensure shared propagation for docker ------------------------
    - name: Ensure shared propagation on kind mounts  # noqa: command-instead-of-module
      ansible.builtin.command:
        cmd: mount --make-shared {{ item }}
      loop:
        - "{{ kind_solana_dir }}"
        - "{{ ramdisk_mount }}"
      changed_when: false
    # ---- verification ---------------------------------------------------------
    - name: Verify solana dir is ZFS
      ansible.builtin.shell:
        cmd: set -o pipefail && df -T {{ kind_solana_dir }} | grep -q zfs
        executable: /bin/bash
      changed_when: false
    - name: Verify ramdisk is tmpfs
      ansible.builtin.shell:
        cmd: set -o pipefail && df -T {{ ramdisk_mount }} | grep -q tmpfs
        executable: /bin/bash
      changed_when: false
    - name: Verify zvol is destroyed
      ansible.builtin.command: zfs list -H -o name {{ zvol_dataset }}
      register: zvol_gone
      failed_when: zvol_gone.rc == 0
      changed_when: false
    - name: Migration complete
      ansible.builtin.debug:
        msg: >-
          Storage migration complete.
          {{ kind_solana_dir }} is now a ZFS dataset ({{ new_dataset }}).
          Ramdisk at {{ ramdisk_mount }} (tmpfs, {{ ramdisk_size }}).
          zvol {{ zvol_dataset }} destroyed.
          Next: update biscayne-prepare-agave.yml, then start the validator.
--- a/playbooks/biscayne-recover.yml
+++ b/playbooks/biscayne-recover.yml
@ -10,7 +10,8 @@
 #   2. Wait for pods to terminate (io_uring safety check)
 #   3. Wipe accounts ramdisk
 #   4. Clean old snapshots
-#   5. Scale to 1 — container entrypoint downloads snapshot + starts validator
+#   5. Ensure terminationGracePeriodSeconds is 300 (for graceful shutdown)
 #   6. Scale to 1 — container entrypoint downloads snapshot + starts validator
 #
 # The playbook exits after step 5. The container handles snapshot download
 # (60+ min) and validator startup autonomously. Monitor with:
@ -95,7 +96,18 @@
      become: true
      changed_when: true
-    # ---- step 5: scale to 1 — entrypoint handles snapshot download ------------
+    # ---- step 5: ensure terminationGracePeriodSeconds -------------------------
    # laconic-so doesn't support this declaratively. Patch the deployment so
    # k8s gives the entrypoint 300s to perform graceful shutdown via admin RPC.
    - name: Ensure terminationGracePeriodSeconds is 300
      ansible.builtin.command: >
        kubectl patch deployment {{ deployment_name }}
        -n {{ k8s_namespace }}
        -p '{"spec":{"template":{"spec":{"terminationGracePeriodSeconds":300}}}}'
      register: patch_result
      changed_when: "'no change' not in patch_result.stdout"
    # ---- step 6: scale to 1 — entrypoint handles snapshot download ------------
    # The container's entrypoint.py checks snapshot freshness, cleans stale
    # snapshots, downloads fresh ones (with rolling incremental convergence),
    # then starts the validator. No host-side download needed.
--- a/playbooks/biscayne-stop.yml
+++ b/playbooks/biscayne-stop.yml
@ -5,11 +5,12 @@
 # This MUST be done before any kind node restart, host reboot,
 # or docker operations.
 #
-# The agave validator uses io_uring for async I/O. On ZFS, killing
+# The container entrypoint (PID 1) traps SIGTERM and runs
-# the process ungracefully (SIGKILL, docker kill, etc.) can produce
+# ``agave-validator exit --force --ledger /data/ledger`` which tells
-# unkillable kernel threads stuck in io_wq_put_and_exit, deadlocking
+# the validator to flush I/O and exit cleanly via the admin RPC Unix
-# the container's PID namespace. A graceful SIGTERM via k8s scale-down
+# socket. This avoids the io_uring/ZFS deadlock that occurs when the
-# allows agave to flush and close its io_uring contexts cleanly.
+# process is killed. terminationGracePeriodSeconds must be set to 300
 # on the k8s deployment to allow time for the flush.
 #
 # Usage:
 #   # Stop the validator
@ -42,6 +43,17 @@
      failed_when: false
      changed_when: false
    # Ensure k8s gives the entrypoint enough time for graceful shutdown
    # via admin RPC before sending SIGKILL.
    - name: Ensure terminationGracePeriodSeconds is 300
      ansible.builtin.command: >
        kubectl patch deployment {{ deployment_name }}
        -n {{ k8s_namespace }}
        -p '{"spec":{"template":{"spec":{"terminationGracePeriodSeconds":300}}}}'
      register: patch_result
      changed_when: "'no change' not in patch_result.stdout"
      when: current_replicas.stdout | default('0') | int > 0
    - name: Scale deployment to 0
      ansible.builtin.command: >
        kubectl scale deployment {{ deployment_name }}
--- a/playbooks/biscayne-sync-tools.yml
+++ b/playbooks/biscayne-sync-tools.yml
@ -15,6 +15,10 @@
 #   ansible-playbook -i inventory/biscayne.yml playbooks/biscayne-sync-tools.yml \
 #     -e laconic_so_branch=fix/kind-mount-propagation
 #
 #   # Sync and rebuild the agave container image
 #   ansible-playbook -i inventory/biscayne.yml playbooks/biscayne-sync-tools.yml \
 #     --tags build-container
 #
 - name: Sync laconic-so and agave-stack
  hosts: all
  gather_facts: false
@ -30,49 +34,55 @@
    stack_branch: main
  tasks:
    # Git operations run as the connecting user (no become) so that
    # SSH agent forwarding works. sudo drops SSH_AUTH_SOCK.
    - name: Update laconic-so (editable install)
      become: false
      ansible.builtin.shell: |
        set -e
        cd {{ laconic_so_repo }}
        git fetch origin {{ laconic_so_branch }}
        git reset --hard origin/{{ laconic_so_branch }}
      vars:
        ansible_become: false
      register: laconic_so_update
      changed_when: true
      tags: [sync, build-container]
    - name: Show laconic-so version
      become: false
      ansible.builtin.shell:
        cmd: set -o pipefail && cd {{ laconic_so_repo }} && git log --oneline -1
        executable: /bin/bash
      register: laconic_so_version
      changed_when: false
      tags: [sync, build-container]
    - name: Report laconic-so
      ansible.builtin.debug:
        msg: "laconic-so: {{ laconic_so_version.stdout }}"
      tags: [sync, build-container]
    - name: Pull agave-stack repo
      become: false
      ansible.builtin.shell: |
        set -e
        cd {{ stack_repo }}
        git fetch origin {{ stack_branch }}
        git reset --hard origin/{{ stack_branch }}
      vars:
        ansible_become: false
      register: stack_update
      changed_when: true
      tags: [sync, build-container]
    - name: Show agave-stack version
      become: false
      ansible.builtin.shell:
        cmd: set -o pipefail && cd {{ stack_repo }} && git log --oneline -1
        executable: /bin/bash
      register: stack_version
      changed_when: false
      tags: [sync, build-container]
    - name: Report agave-stack
      ansible.builtin.debug:
        msg: "agave-stack: {{ stack_version.stdout }}"
      tags: [sync, build-container]
    - name: Regenerate deployment config from updated stack
      ansible.builtin.command: >
@ -84,6 +94,7 @@
        --update
      register: regen_result
      changed_when: true
      tags: [sync, build-container]
    - name: Report sync complete
      ansible.builtin.debug:
@ -91,3 +102,27 @@
          Sync complete. laconic-so and agave-stack updated to
          origin/{{ laconic_so_branch }}. Deployment config regenerated.
          Restart or redeploy required to apply changes.
      tags: [sync, build-container]
    # ---- optional: rebuild container image --------------------------------------
    # Only runs when explicitly requested with --tags build-container.
    # Safe to run while the validator is running — just builds a new image.
    # The running pod keeps the old image until restarted.
    - name: Build agave container image
      ansible.builtin.command: >
        {{ laconic_so }}
        --stack {{ stack_path }}
        build-containers
        --include laconicnetwork-agave
      tags:
        - build-container
        - never
      register: build_result
      changed_when: true
    - name: Report build complete
      ansible.builtin.debug:
        msg: "Container image built. Will be used on next pod restart."
      tags:
        - build-container
        - never
--- a/playbooks/biscayne-upgrade-zfs.yml
+++ b/playbooks/biscayne-upgrade-zfs.yml
@ -0,0 +1,158 @@
 ---
 # Upgrade ZFS from 2.2.2 to 2.2.9 via arter97's zfs-lts PPA
 #
 # Fixes the io_uring deadlock (OpenZFS PR #17298) at the kernel module level.
 # After this upgrade, the zvol/XFS workaround is unnecessary and can be removed
 # with biscayne-migrate-storage.yml.
 #
 # PPA: ppa:arter97/zfs-lts (Juhyung Park, OpenZFS contributor)
 # Builds from source on Launchpad — transparent, auditable.
 #
 # WARNING: This playbook triggers a reboot at the end. If the io_uring zombie
 # is present, the reboot WILL HANG. The operator must hard power cycle the
 # machine (IPMI/iDRAC/physical). The playbook does not wait for the reboot —
 # run the verify tag separately after the machine comes back.
 #
 # Usage:
 #   # Full upgrade (adds PPA, upgrades, reboots)
 #   ansible-playbook -i inventory/ playbooks/biscayne-upgrade-zfs.yml
 #
 #   # Verify after reboot
 #   ansible-playbook -i inventory/ playbooks/biscayne-upgrade-zfs.yml \
 #     --tags verify
 #
 #   # Dry run — show what would be upgraded
 #   ansible-playbook -i inventory/ playbooks/biscayne-upgrade-zfs.yml \
 #     --tags dry-run
 #
 - name: Upgrade ZFS via arter97/zfs-lts PPA
  hosts: all
  gather_facts: true
  become: true
  vars:
    zfs_min_version: "2.2.8"
    ppa_name: "ppa:arter97/zfs-lts"
    zfs_packages:
      - zfsutils-linux
      - zfs-dkms
      - libzfs5linux
  tasks:
    # ---- pre-flight checks ----------------------------------------------------
    - name: Get current ZFS version
      ansible.builtin.command: modinfo -F version zfs
      register: zfs_current_version
      changed_when: false
      tags: [always]
    - name: Report current ZFS version
      ansible.builtin.debug:
        msg: "Current ZFS: {{ zfs_current_version.stdout }}"
      tags: [always]
    - name: Skip if already upgraded
      ansible.builtin.meta: end_play
      when: zfs_current_version.stdout is version(zfs_min_version, '>=')
      tags: [always]
    # ---- dry run ---------------------------------------------------------------
    - name: Show available ZFS packages from PPA (dry run)
      ansible.builtin.shell:
        cmd: >
          set -o pipefail &&
          apt-cache policy zfsutils-linux zfs-dkms 2>/dev/null |
          grep -A2 'zfsutils-linux\|zfs-dkms'
        executable: /bin/bash
      changed_when: false
      failed_when: false
      tags:
        - dry-run
        - never
    # ---- add PPA ---------------------------------------------------------------
    - name: Add arter97/zfs-lts PPA
      ansible.builtin.apt_repository:
        repo: "{{ ppa_name }}"
        state: present
        update_cache: true
      tags: [upgrade]
    # ---- upgrade ZFS packages --------------------------------------------------
    - name: Upgrade ZFS packages
      ansible.builtin.apt:
        name: "{{ zfs_packages }}"
        state: latest  # noqa: package-latest
        update_cache: true
      register: zfs_upgrade
      tags: [upgrade]
    - name: Show upgrade result
      ansible.builtin.debug:
        msg: "{{ zfs_upgrade.stdout_lines | default(['no output']) }}"
      tags: [upgrade]
    # ---- reboot ----------------------------------------------------------------
    - name: Report pre-reboot status
      ansible.builtin.debug:
        msg: >-
          ZFS packages upgraded. Rebooting now.
          If the io_uring zombie is present, this reboot WILL HANG.
          Hard power cycle the machine, then run this playbook with
          --tags verify to confirm the upgrade.
      tags: [upgrade]
    - name: Reboot to load new ZFS modules
      ansible.builtin.reboot:
        msg: "ZFS upgrade — rebooting to load new kernel modules"
        reboot_timeout: 600
      tags: [upgrade]
      # This will timeout if io_uring zombie blocks shutdown.
      # Operator must hard power cycle. That's expected.
    # ---- post-reboot verification -----------------------------------------------
    - name: Get ZFS version after reboot
      ansible.builtin.command: modinfo -F version zfs
      register: zfs_new_version
      changed_when: false
      tags:
        - verify
        - never
    - name: Verify ZFS version meets minimum
      ansible.builtin.assert:
        that:
          - zfs_new_version.stdout is version(zfs_min_version, '>=')
        fail_msg: >-
          ZFS version {{ zfs_new_version.stdout }} is below minimum
          {{ zfs_min_version }}. Upgrade may have failed.
        success_msg: "ZFS {{ zfs_new_version.stdout }} — io_uring fix confirmed."
      tags:
        - verify
        - never
    - name: Verify ZFS pools are healthy
      ansible.builtin.command: zpool status -x
      register: zpool_status
      changed_when: false
      failed_when: "'all pools are healthy' not in zpool_status.stdout"
      tags:
        - verify
        - never
    - name: Verify ZFS datasets are mounted
      ansible.builtin.command: zfs mount
      register: zfs_mounts
      changed_when: false
      tags:
        - verify
        - never
    - name: Report verification
      ansible.builtin.debug:
        msg:
          zfs_version: "{{ zfs_new_version.stdout }}"
          pools: "{{ zpool_status.stdout }}"
          mounts: "{{ zfs_mounts.stdout_lines }}"
      tags:
        - verify
        - never
--- a/scripts/agave-container/entrypoint.py
+++ b/scripts/agave-container/entrypoint.py
@ -2,12 +2,17 @@
 """Agave validator entrypoint — snapshot management, arg construction, liveness probe.
 Two subcommands:
-  entrypoint.py serve   (default) — snapshot freshness check + exec agave-validator
+  entrypoint.py serve   (default) — snapshot freshness check + run agave-validator
  entrypoint.py probe   — liveness probe (slot lag check, exits 0/1)
 Replaces the bash entrypoint.sh / start-rpc.sh / start-validator.sh with a single
 Python module. Test mode still dispatches to start-test.sh.
 Python stays as PID 1 and traps SIGTERM. On SIGTERM, it runs
 ``agave-validator exit --force --ledger /data/ledger`` which connects to the
 admin RPC Unix socket and tells the validator to flush I/O and exit cleanly.
 This avoids the io_uring/ZFS deadlock that occurs when the process is killed.
 All configuration comes from environment variables — same vars as the original
 bash scripts. See compose files for defaults.
 """
@ -18,8 +23,10 @@ import json
 import logging
 import os
 import re
 import signal
 import subprocess
 import sys
 import threading
 import time
 import urllib.error
 import urllib.request
@ -365,11 +372,77 @@ def append_extra_args(args: list[str]) -> list[str]:
    return args
 # -- Graceful shutdown --------------------------------------------------------
 # Timeout for graceful exit via admin RPC. Leave 30s margin for k8s
 # terminationGracePeriodSeconds (300s).
 GRACEFUL_EXIT_TIMEOUT = 270
 def graceful_exit(child: subprocess.Popen[bytes]) -> None:
    """Request graceful shutdown via the admin RPC Unix socket.
    Runs ``agave-validator exit --force --ledger /data/ledger`` which connects
    to the admin RPC socket at ``/data/ledger/admin.rpc`` and sets the
    validator's exit flag. The validator flushes all I/O and exits cleanly,
    avoiding the io_uring/ZFS deadlock.
    If the admin RPC exit fails or the child doesn't exit within the timeout,
    falls back to SIGTERM then SIGKILL.
    """
    log.info("SIGTERM received — requesting graceful exit via admin RPC")
    try:
        result = subprocess.run(
            ["agave-validator", "exit", "--force", "--ledger", LEDGER_DIR],
            capture_output=True, text=True, timeout=30,
        )
        if result.returncode == 0:
            log.info("Admin RPC exit requested successfully")
        else:
            log.warning(
                "Admin RPC exit returned %d: %s",
                result.returncode, result.stderr.strip(),
            )
    except subprocess.TimeoutExpired:
        log.warning("Admin RPC exit command timed out after 30s")
    except FileNotFoundError:
        log.warning("agave-validator binary not found for exit command")
    # Wait for child to exit
    try:
        child.wait(timeout=GRACEFUL_EXIT_TIMEOUT)
        log.info("Validator exited cleanly with code %d", child.returncode)
        return
    except subprocess.TimeoutExpired:
        log.warning(
            "Validator did not exit within %ds — sending SIGTERM",
            GRACEFUL_EXIT_TIMEOUT,
        )
    # Fallback: SIGTERM
    child.terminate()
    try:
        child.wait(timeout=15)
        log.info("Validator exited after SIGTERM with code %d", child.returncode)
        return
    except subprocess.TimeoutExpired:
        log.warning("Validator did not exit after SIGTERM — sending SIGKILL")
    # Last resort: SIGKILL
    child.kill()
    child.wait()
    log.info("Validator killed with SIGKILL, code %d", child.returncode)
 # -- Serve subcommand ---------------------------------------------------------
 def cmd_serve() -> None:
-    """Main serve flow: snapshot check, setup, exec agave-validator."""
+    """Main serve flow: snapshot check, setup, run agave-validator as child.
    Python stays as PID 1 and traps SIGTERM to perform graceful shutdown
    via the admin RPC Unix socket.
    """
    mode = env("AGAVE_MODE", "test")
    log.info("AGAVE_MODE=%s", mode)
@ -407,7 +480,21 @@ def cmd_serve() -> None:
    Path("/tmp/entrypoint-start").write_text(str(time.time()))
    log.info("Starting agave-validator with %d arguments", len(args))
-    os.execvp("agave-validator", ["agave-validator"] + args)
+    child = subprocess.Popen(["agave-validator"] + args)
    # Forward SIGUSR1 to child (log rotation)
    signal.signal(signal.SIGUSR1, lambda _sig, _frame: child.send_signal(signal.SIGUSR1))
    # Trap SIGTERM — run graceful_exit in a thread so the signal handler returns
    # immediately and child.wait() in the main thread can observe the exit.
    def _on_sigterm(_sig: int, _frame: object) -> None:
        threading.Thread(target=graceful_exit, args=(child,), daemon=True).start()
    signal.signal(signal.SIGTERM, _on_sigterm)
    # Wait for child — if it exits on its own (crash, normal exit), propagate code
    child.wait()
    sys.exit(child.returncode)
 # -- Probe subcommand ---------------------------------------------------------
--- a/scripts/agave-container/snapshot_download.py
+++ b/scripts/agave-container/snapshot_download.py
@ -655,8 +655,9 @@ def download_best_snapshot(
                log.info("Downloading incremental %s (%d mirrors, slot %d, gap %d slots)",
                         inc_fn, len(inc_mirrors), inc_slot, gap)
                if not download_aria2c(inc_mirrors, output_dir, inc_fn, connections):
-                    log.error("Failed to download incremental %s", inc_fn)
+                    log.warning("Failed to download incremental %s — re-probing in 10s", inc_fn)
-                    break
+                    time.sleep(10)
                    continue
                prev_inc_filename = inc_fn
--- a/scripts/check-status.py
+++ b/scripts/check-status.py
@ -18,6 +18,7 @@ from __future__ import annotations
 import argparse
 import json
 import os
 import subprocess
 import sys
 import time
@ -206,9 +207,11 @@ def display(iteration: int = 0) -> None:
    snapshots = check_snapshots()
    ramdisk = check_ramdisk()
-    print(f"\n{'=' * 60}")
+    # Clear screen and home cursor for clean redraw in watch mode
-    print(f"  Biscayne Agave Status — {ts}")
+    if iteration > 0:
-    print(f"{'=' * 60}")
+        print("\033[2J\033[H", end="")
    print(f"\n  Biscayne Agave Status — {ts}\n")
    # Pod
    print(f"\n  Pod: {pod['phase']}")
@ -275,14 +278,30 @@ def display(iteration: int = 0) -> None:
 # -- Main ---------------------------------------------------------------------
 def spawn_tmux_pane(interval: int) -> None:
    """Launch this script with --watch in a new tmux pane."""
    script = os.path.abspath(__file__)
    cmd = f"python3 {script} --watch -i {interval}"
    subprocess.run(
        ["tmux", "split-window", "-h", "-d", cmd],
        check=True,
    )
 def main() -> int:
    p = argparse.ArgumentParser(description=__doc__,
                                formatter_class=argparse.RawDescriptionHelpFormatter)
    p.add_argument("--watch", action="store_true", help="Repeat every interval")
    p.add_argument("--pane", action="store_true",
                   help="Launch --watch in a new tmux pane")
    p.add_argument("-i", "--interval", type=int, default=30,
                   help="Watch interval in seconds (default: 30)")
    args = p.parse_args()
    if args.pane:
        spawn_tmux_pane(args.interval)
        return 0
    discover()
    try: