feat: graceful shutdown, ZFS upgrade, storage migration, sync-tools build

- entrypoint.py: Python stays PID 1, traps SIGTERM, requests graceful exit via admin RPC (agave-validator exit --force) before falling back to signals - snapshot_download.py: fix break-on-failure bug in incremental download loop (continue + re-probe instead of giving up) - biscayne-upgrade-zfs.yml: upgrade ZFS 2.2.2 → 2.2.9 via arter97/zfs-lts PPA to fix io_uring deadlock at kernel module level - biscayne-migrate-storage.yml: one-time migration from zvol/XFS to ZFS dataset (zvol workaround no longer needed with graceful shutdown + ZFS fix) - biscayne-stop.yml: patch terminationGracePeriodSeconds to 300 before scaling to 0, updated docs for admin RPC shutdown - biscayne-sync-tools.yml: fix SSH agent forwarding (vars: ansible_become), add --tags build-container support, add set -e to shell blocks - biscayne-recover.yml: updated for graceful shutdown awareness - check-status.py: add --pane flag for tmux, clean redraw in watch mode - CLAUDE.md: update docs for ZFS dataset storage, graceful shutdown Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-09 07:58:37 +00:00 · 2026-03-09 07:58:37 +00:00 · b88af2be70
parent 173b807451
commit b88af2be70
9 changed files with 661 additions and 54 deletions
--- a/CLAUDE.md
+++ b/CLAUDE.md
@ -10,16 +10,16 @@ below it are correct. Playbooks belong to exactly one layer.
 | 1. Base system | Docker, ZFS, packages | Out of scope (manual/PXE) |
 | 2. Prepare kind | `/srv/kind` exists (ZFS dataset) | None needed (ZFS handles it) |
 | 3. Install kind | `laconic-so deployment start` creates kind cluster, mounts `/srv/kind` → `/mnt` in kind node | `biscayne-redeploy.yml` (deploy tags) |
-| 4. Prepare agave | Host storage for agave: zvol, ramdisk, rbind into `/srv/kind/solana` | `biscayne-prepare-agave.yml` |
+| 4. Prepare agave | Host storage for agave: ZFS dataset, ramdisk | `biscayne-prepare-agave.yml` |
 | 5. Deploy agave | Deploy agave-stack into kind, snapshot download, scale up | `biscayne-redeploy.yml` (snapshot/verify tags), `biscayne-recover.yml` |

 **Layer 4 invariants** (asserted by `biscayne-prepare-agave.yml`):
- `/srv/kind/solana` is XFS on a zvol — agave uses io_uring which deadlocks on ZFS. `/srv/solana` is NOT the zvol (it's a ZFS dataset directory); never use it for data paths
+- `/srv/kind/solana` is a ZFS dataset (`biscayne/DATA/srv/kind/solana`), child of the `/srv/kind` dataset
 - `/srv/kind/solana/ramdisk` is tmpfs (1TB) — accounts must be in RAM
+- `/srv/solana` is NOT the data path — it's a directory on the parent ZFS dataset. All data paths use `/srv/kind/solana`

 These invariants are checked at runtime and persisted to fstab/systemd so they
-survive reboot. They are agave's requirements reaching into the boot sequence,
-not base system concerns.
+survive reboot.

 **Cross-cutting**: `health-check.yml` (read-only diagnostics), `biscayne-stop.yml`
 (layer 5 — graceful shutdown), `fix-pv-mounts.yml` (layer 5 — PV repair).
@ -30,11 +30,8 @@ not base system concerns.

 The agave validator runs inside a kind-based k8s cluster managed by `laconic-so`.
 The kind node is a Docker container. **Never restart or kill the kind node container
-while the validator is running.** Agave uses `io_uring` for async I/O, and on ZFS,
-killing the process can produce unkillable kernel threads (D-state in
-`io_wq_put_and_exit` blocked on ZFS transaction commits). This deadlocks the
-container's PID namespace, making `docker stop`, `docker restart`, `docker exec`,
-and even `reboot` hang.
+while the validator is running.** Use `agave-validator exit --force` via the admin
+RPC socket for graceful shutdown, or scale the deployment to 0 and wait.

 Correct shutdown sequence:

@ -61,15 +58,16 @@ The accounts directory must be in RAM for performance. tmpfs is used instead of
 `/dev/ram0` — simpler (no format-on-boot service needed), resizable on the fly
 with `mount -o remount,size=<new>`, and what most Solana operators use.

-**Boot ordering**: fstab entry mounts tmpfs at `/srv/kind/solana/ramdisk` with
-`x-systemd.requires=srv-kind-solana.mount`. tmpfs mounts natively via fstab —
-no systemd format service needed. **No manual intervention after reboot.**
+**Boot ordering**: `/srv/kind/solana` is a ZFS dataset mounted automatically by
+`zfs-mount.service`. The tmpfs ramdisk fstab entry uses
+`x-systemd.requires=zfs-mount.service` to ensure the dataset is mounted first.
+**No manual intervention after reboot.**

 **Mount propagation**: The kind node bind-mounts `/srv/kind` → `/mnt` at container
 start. laconic-so sets `propagation: HostToContainer` on all kind extraMounts
-(commit `a11d40f2` in stack-orchestrator), so host submounts (like the rbind at
-`/srv/kind/solana`) propagate into the kind node automatically. A kind restart
-is required to pick up the new config after updating laconic-so.
+(commit `a11d40f2` in stack-orchestrator), so host submounts propagate into the
+kind node automatically. A kind restart is required to pick up the new config
+after updating laconic-so.

 ### KUBECONFIG

@ -92,21 +90,20 @@ Then export it:
 export SSH_AUTH_SOCK=/tmp/ssh-XXXX/agent.NNNN
 ```

-### io_uring/ZFS Deadlock — Root Cause
+### io_uring/ZFS Deadlock — Historical Note

-When agave-validator is killed while performing I/O against ZFS-backed paths (not
-the ramdisk), io_uring worker threads get stuck in D-state:
-```
-io_wq_put_and_exit → dsl_dir_tempreserve_space (ZFS module)
-```
-These threads are unkillable (SIGKILL has no effect on D-state processes). They
-prevent the container's PID namespace from being reaped (`zap_pid_ns_processes`
-waits forever), which breaks `docker stop`, `docker restart`, `docker exec`, and
-even `reboot`. The only fix is a hard power cycle.
+Agave uses io_uring for async I/O. Killing agave ungracefully while it has
+outstanding I/O against ZFS can produce unkillable D-state kernel threads
+(`io_wq_put_and_exit` blocked on ZFS transactions), deadlocking the container.

-**Prevention**: Always scale the deployment to 0 and wait for the pod to terminate
-before any destructive operation (namespace delete, kind restart, host reboot).
-The `biscayne-stop.yml` playbook enforces this.
+**Prevention**: Use graceful shutdown (`agave-validator exit --force` via admin
+RPC, or scale to 0 and wait). The `biscayne-stop.yml` playbook enforces this.
+With graceful shutdown, io_uring contexts are closed cleanly and ZFS storage
+is safe to use directly (no zvol/XFS workaround needed).
+
+**ZFS fix**: The underlying io_uring bug is fixed in ZFS 2.2.8+ (PR #17298).
+Biscayne currently runs ZFS 2.2.2. Upgrading ZFS will eliminate the deadlock
+risk entirely, even for ungraceful shutdowns.

 ### laconic-so Architecture

@ -133,11 +130,11 @@ kind node via a single bind mount.
 - Deployment: `laconic-70ce4c4b47e23b85-deployment`
 - Kind node container: `laconic-70ce4c4b47e23b85-control-plane`
 - Deployment dir: `/srv/deployments/agave`
- Snapshot dir: `/srv/kind/solana/snapshots` (on zvol, visible to kind at `/mnt/validator-snapshots`)
- Ledger dir: `/srv/kind/solana/ledger` (on zvol, visible to kind at `/mnt/validator-ledger`)
- Accounts dir: `/srv/kind/solana/ramdisk/accounts` (on ramdisk `/dev/ram0`, visible to kind at `/mnt/validator-accounts`)
- Log dir: `/srv/kind/solana/log` (on zvol, visible to kind at `/mnt/validator-log`)
- **WARNING**: `/srv/solana` is a ZFS dataset directory, NOT the zvol. Never use it for data paths.
+- Snapshot dir: `/srv/kind/solana/snapshots` (ZFS dataset, visible to kind at `/mnt/validator-snapshots`)
+- Ledger dir: `/srv/kind/solana/ledger` (ZFS dataset, visible to kind at `/mnt/validator-ledger`)
+- Accounts dir: `/srv/kind/solana/ramdisk/accounts` (tmpfs ramdisk, visible to kind at `/mnt/validator-accounts`)
+- Log dir: `/srv/kind/solana/log` (ZFS dataset, visible to kind at `/mnt/validator-log`)
+- **WARNING**: `/srv/solana` is a different ZFS dataset directory. All data paths use `/srv/kind/solana`.
 - Host bind mount root: `/srv/kind` -> kind node `/mnt`
 - laconic-so: `/home/rix/.local/bin/laconic-so` (editable install)

--- a/playbooks/biscayne-migrate-storage.yml
+++ b/playbooks/biscayne-migrate-storage.yml
@ -0,0 +1,286 @@
+---
+# One-time migration: zvol/XFS → ZFS dataset for /srv/kind/solana
+#
+# Background:
+#   Biscayne used a ZFS zvol formatted as XFS to work around io_uring/ZFS
+#   deadlocks. The root cause is now handled by graceful shutdown via admin
+#   RPC (agave-validator exit --force), so the zvol/XFS layer is unnecessary.
+#
+# What this does:
+#   1. Asserts the validator is scaled to 0 (does NOT scale it — that's
+#      the operator's job via biscayne-stop.yml)
+#   2. Creates a child ZFS dataset biscayne/DATA/srv/kind/solana
+#   3. Copies data from the zvol to the new dataset (rsync)
+#   4. Updates fstab (removes zvol line, fixes tmpfs dependency)
+#   5. Destroys the zvol after verification
+#
+# Prerequisites:
+#   - Validator MUST be stopped (scale 0, no agave processes)
+#   - Run biscayne-stop.yml first
+#
+# Usage:
+#   ansible-playbook -i inventory/ playbooks/biscayne-migrate-storage.yml
+#
+# After migration, run biscayne-prepare-agave.yml to update its checks,
+# then biscayne-start.yml to bring the validator back up.
+#
+- name: Migrate storage from zvol/XFS to ZFS dataset
+  hosts: all
+  gather_facts: false
+  become: true
+  environment:
+    KUBECONFIG: /home/rix/.kube/config
+  vars:
+    kind_cluster: laconic-70ce4c4b47e23b85
+    k8s_namespace: "laconic-{{ kind_cluster }}"
+    deployment_name: "{{ kind_cluster }}-deployment"
+    zvol_device: /dev/zvol/biscayne/DATA/volumes/solana
+    zvol_dataset: biscayne/DATA/volumes/solana
+    new_dataset: biscayne/DATA/srv/kind/solana
+    kind_solana_dir: /srv/kind/solana
+    ramdisk_mount: /srv/kind/solana/ramdisk
+    ramdisk_size: 1024G
+    # Temporary mount for zvol during data copy
+    zvol_tmp_mount: /mnt/zvol-migration-tmp
+
+  tasks:
+    # ---- preconditions --------------------------------------------------------
+    - name: Check deployment replica count
+      ansible.builtin.command: >
+        kubectl get deployment {{ deployment_name }}
+        -n {{ k8s_namespace }}
+        -o jsonpath='{.spec.replicas}'
+      register: current_replicas
+      failed_when: false
+      changed_when: false
+
+    - name: Fail if validator is running
+      ansible.builtin.fail:
+        msg: >-
+          Validator must be scaled to 0 before migration.
+          Current replicas: {{ current_replicas.stdout | default('unknown') }}.
+          Run biscayne-stop.yml first.
+      when: current_replicas.stdout | default('0') | int > 0
+
+    - name: Verify no agave processes in kind node
+      ansible.builtin.command: >
+        docker exec {{ kind_cluster }}-control-plane
+        pgrep -c agave-validator
+      register: agave_procs
+      failed_when: false
+      changed_when: false
+
+    - name: Fail if agave still running
+      ansible.builtin.fail:
+        msg: >-
+          agave-validator process still running inside kind node.
+          Cannot migrate while validator is active.
+      when: agave_procs.rc == 0
+
+    # ---- check current state --------------------------------------------------
+    - name: Check if zvol device exists
+      ansible.builtin.stat:
+        path: "{{ zvol_device }}"
+      register: zvol_exists
+
+    - name: Check if ZFS dataset already exists
+      ansible.builtin.command: zfs list -H -o name {{ new_dataset }}
+      register: dataset_exists
+      failed_when: false
+      changed_when: false
+
+    - name: Check current mount type at {{ kind_solana_dir }}
+      ansible.builtin.shell:
+        cmd: set -o pipefail && findmnt -n -o FSTYPE {{ kind_solana_dir }}
+        executable: /bin/bash
+      register: current_fstype
+      failed_when: false
+      changed_when: false
+
+    - name: Report current state
+      ansible.builtin.debug:
+        msg:
+          zvol_exists: "{{ zvol_exists.stat.exists | default(false) }}"
+          dataset_exists: "{{ dataset_exists.rc == 0 }}"
+          current_fstype: "{{ current_fstype.stdout | default('none') }}"
+
+    # ---- skip if already migrated ---------------------------------------------
+    - name: End play if already on ZFS dataset
+      ansible.builtin.meta: end_play
+      when:
+        - dataset_exists.rc == 0
+        - current_fstype.stdout | default('') == 'zfs'
+        - not (zvol_exists.stat.exists | default(false))
+
+    # ---- step 1: unmount ramdisk and zvol ------------------------------------
+    - name: Unmount ramdisk
+      ansible.posix.mount:
+        path: "{{ ramdisk_mount }}"
+        state: unmounted
+
+    - name: Unmount zvol from {{ kind_solana_dir }}
+      ansible.posix.mount:
+        path: "{{ kind_solana_dir }}"
+        state: unmounted
+      when: current_fstype.stdout | default('') == 'xfs'
+
+    # ---- step 2: create ZFS dataset -----------------------------------------
+    - name: Create ZFS dataset {{ new_dataset }}
+      ansible.builtin.command: >
+        zfs create -o mountpoint={{ kind_solana_dir }} {{ new_dataset }}
+      changed_when: true
+      when: dataset_exists.rc != 0
+
+    - name: Mount ZFS dataset if it already existed
+      ansible.builtin.command: zfs mount {{ new_dataset }}
+      changed_when: true
+      failed_when: false
+      when: dataset_exists.rc == 0
+
+    - name: Verify ZFS dataset is mounted
+      ansible.builtin.shell:
+        cmd: set -o pipefail && findmnt -n -o FSTYPE {{ kind_solana_dir }} | grep -q zfs
+        executable: /bin/bash
+      changed_when: false
+
+    # ---- step 3: copy data from zvol ----------------------------------------
+    - name: Create temporary mount point for zvol
+      ansible.builtin.file:
+        path: "{{ zvol_tmp_mount }}"
+        state: directory
+        mode: "0755"
+      when: zvol_exists.stat.exists | default(false)
+
+    - name: Mount zvol at temporary location
+      ansible.posix.mount:
+        path: "{{ zvol_tmp_mount }}"
+        src: "{{ zvol_device }}"
+        fstype: xfs
+        state: mounted
+      when: zvol_exists.stat.exists | default(false)
+
+    - name: Copy data from zvol to ZFS dataset  # noqa: command-instead-of-module
+      ansible.builtin.command: >
+        rsync -a --info=progress2
+        --exclude='ramdisk/'
+        {{ zvol_tmp_mount }}/
+        {{ kind_solana_dir }}/
+      changed_when: true
+      when: zvol_exists.stat.exists | default(false)
+
+    # ---- step 4: verify data integrity --------------------------------------
+    - name: Check key directories exist on new dataset
+      ansible.builtin.stat:
+        path: "{{ kind_solana_dir }}/{{ item }}"
+      register: dir_checks
+      loop:
+        - ledger
+        - snapshots
+        - log
+
+    - name: Report directory verification
+      ansible.builtin.debug:
+        msg: "{{ item.item }}: {{ 'exists' if item.stat.exists else 'MISSING' }}"
+      loop: "{{ dir_checks.results }}"
+      loop_control:
+        label: "{{ item.item }}"
+
+    # ---- step 5: update fstab ------------------------------------------------
+    - name: Remove zvol fstab entry
+      ansible.builtin.lineinfile:
+        path: /etc/fstab
+        regexp: '^\S+zvol\S+\s+{{ kind_solana_dir }}\s'
+        state: absent
+      register: fstab_zvol_removed
+
+    # Also match any XFS entry for kind_solana_dir (non-zvol form)
+    - name: Remove any XFS fstab entry for {{ kind_solana_dir }}
+      ansible.builtin.lineinfile:
+        path: /etc/fstab
+        regexp: '^\S+\s+{{ kind_solana_dir }}\s+xfs'
+        state: absent
+
+    # ZFS datasets are mounted by zfs-mount.service automatically.
+    # The tmpfs ramdisk depends on the solana dir existing, which ZFS
+    # guarantees via zfs-mount.service. Update the systemd dependency.
+    - name: Update tmpfs ramdisk fstab entry
+      ansible.builtin.lineinfile:
+        path: /etc/fstab
+        regexp: '^\S+\s+{{ ramdisk_mount }}\s'
+        line: "tmpfs {{ ramdisk_mount }} tmpfs nodev,nosuid,noexec,nodiratime,size={{ ramdisk_size }},nofail,x-systemd.requires=zfs-mount.service 0 0"
+
+    - name: Reload systemd  # noqa: no-handler
+      ansible.builtin.systemd:
+        daemon_reload: true
+      when: fstab_zvol_removed.changed
+
+    # ---- step 6: mount ramdisk -----------------------------------------------
+    - name: Mount tmpfs ramdisk
+      ansible.posix.mount:
+        path: "{{ ramdisk_mount }}"
+        src: tmpfs
+        fstype: tmpfs
+        opts: "nodev,nosuid,noexec,nodiratime,size={{ ramdisk_size }}"
+        state: mounted
+
+    - name: Ensure accounts directory
+      ansible.builtin.file:
+        path: "{{ ramdisk_mount }}/accounts"
+        state: directory
+        owner: solana
+        group: solana
+        mode: "0755"
+
+    # ---- step 7: clean up zvol -----------------------------------------------
+    - name: Unmount zvol from temporary location
+      ansible.posix.mount:
+        path: "{{ zvol_tmp_mount }}"
+        state: unmounted
+      when: zvol_exists.stat.exists | default(false)
+
+    - name: Remove temporary mount point
+      ansible.builtin.file:
+        path: "{{ zvol_tmp_mount }}"
+        state: absent
+
+    - name: Destroy zvol {{ zvol_dataset }}
+      ansible.builtin.command: zfs destroy {{ zvol_dataset }}
+      changed_when: true
+      when: zvol_exists.stat.exists | default(false)
+
+    # ---- step 8: ensure shared propagation for docker ------------------------
+    - name: Ensure shared propagation on kind mounts  # noqa: command-instead-of-module
+      ansible.builtin.command:
+        cmd: mount --make-shared {{ item }}
+      loop:
+        - "{{ kind_solana_dir }}"
+        - "{{ ramdisk_mount }}"
+      changed_when: false
+
+    # ---- verification ---------------------------------------------------------
+    - name: Verify solana dir is ZFS
+      ansible.builtin.shell:
+        cmd: set -o pipefail && df -T {{ kind_solana_dir }} | grep -q zfs
+        executable: /bin/bash
+      changed_when: false
+
+    - name: Verify ramdisk is tmpfs
+      ansible.builtin.shell:
+        cmd: set -o pipefail && df -T {{ ramdisk_mount }} | grep -q tmpfs
+        executable: /bin/bash
+      changed_when: false
+
+    - name: Verify zvol is destroyed
+      ansible.builtin.command: zfs list -H -o name {{ zvol_dataset }}
+      register: zvol_gone
+      failed_when: zvol_gone.rc == 0
+      changed_when: false
+
+    - name: Migration complete
+      ansible.builtin.debug:
+        msg: >-
+          Storage migration complete.
+          {{ kind_solana_dir }} is now a ZFS dataset ({{ new_dataset }}).
+          Ramdisk at {{ ramdisk_mount }} (tmpfs, {{ ramdisk_size }}).
+          zvol {{ zvol_dataset }} destroyed.
+          Next: update biscayne-prepare-agave.yml, then start the validator.
--- a/playbooks/biscayne-recover.yml
+++ b/playbooks/biscayne-recover.yml
@ -10,7 +10,8 @@
 #   2. Wait for pods to terminate (io_uring safety check)
 #   3. Wipe accounts ramdisk
 #   4. Clean old snapshots
-#   5. Scale to 1 — container entrypoint downloads snapshot + starts validator
+#   5. Ensure terminationGracePeriodSeconds is 300 (for graceful shutdown)
+#   6. Scale to 1 — container entrypoint downloads snapshot + starts validator
 #
 # The playbook exits after step 5. The container handles snapshot download
 # (60+ min) and validator startup autonomously. Monitor with:
@ -95,7 +96,18 @@
      become: true
      changed_when: true

-    # ---- step 5: scale to 1 — entrypoint handles snapshot download ------------
+    # ---- step 5: ensure terminationGracePeriodSeconds -------------------------
+    # laconic-so doesn't support this declaratively. Patch the deployment so
+    # k8s gives the entrypoint 300s to perform graceful shutdown via admin RPC.
+    - name: Ensure terminationGracePeriodSeconds is 300
+      ansible.builtin.command: >
+        kubectl patch deployment {{ deployment_name }}
+        -n {{ k8s_namespace }}
+        -p '{"spec":{"template":{"spec":{"terminationGracePeriodSeconds":300}}}}'
+      register: patch_result
+      changed_when: "'no change' not in patch_result.stdout"
+
+    # ---- step 6: scale to 1 — entrypoint handles snapshot download ------------
    # The container's entrypoint.py checks snapshot freshness, cleans stale
    # snapshots, downloads fresh ones (with rolling incremental convergence),
    # then starts the validator. No host-side download needed.
--- a/playbooks/biscayne-stop.yml
+++ b/playbooks/biscayne-stop.yml
@ -5,11 +5,12 @@
 # This MUST be done before any kind node restart, host reboot,
 # or docker operations.
 #
-# The agave validator uses io_uring for async I/O. On ZFS, killing
-# the process ungracefully (SIGKILL, docker kill, etc.) can produce
-# unkillable kernel threads stuck in io_wq_put_and_exit, deadlocking
-# the container's PID namespace. A graceful SIGTERM via k8s scale-down
-# allows agave to flush and close its io_uring contexts cleanly.
+# The container entrypoint (PID 1) traps SIGTERM and runs
+# ``agave-validator exit --force --ledger /data/ledger`` which tells
+# the validator to flush I/O and exit cleanly via the admin RPC Unix
+# socket. This avoids the io_uring/ZFS deadlock that occurs when the
+# process is killed. terminationGracePeriodSeconds must be set to 300
+# on the k8s deployment to allow time for the flush.
 #
 # Usage:
 #   # Stop the validator
@ -42,6 +43,17 @@
      failed_when: false
      changed_when: false

+    # Ensure k8s gives the entrypoint enough time for graceful shutdown
+    # via admin RPC before sending SIGKILL.
+    - name: Ensure terminationGracePeriodSeconds is 300
+      ansible.builtin.command: >
+        kubectl patch deployment {{ deployment_name }}
+        -n {{ k8s_namespace }}
+        -p '{"spec":{"template":{"spec":{"terminationGracePeriodSeconds":300}}}}'
+      register: patch_result
+      changed_when: "'no change' not in patch_result.stdout"
+      when: current_replicas.stdout | default('0') | int > 0
+
    - name: Scale deployment to 0
      ansible.builtin.command: >
        kubectl scale deployment {{ deployment_name }}
--- a/playbooks/biscayne-sync-tools.yml
+++ b/playbooks/biscayne-sync-tools.yml
@ -15,6 +15,10 @@
 #   ansible-playbook -i inventory/biscayne.yml playbooks/biscayne-sync-tools.yml \
 #     -e laconic_so_branch=fix/kind-mount-propagation
 #
+#   # Sync and rebuild the agave container image
+#   ansible-playbook -i inventory/biscayne.yml playbooks/biscayne-sync-tools.yml \
+#     --tags build-container
+#
 - name: Sync laconic-so and agave-stack
  hosts: all
  gather_facts: false
@ -30,49 +34,55 @@
    stack_branch: main

  tasks:
-    # Git operations run as the connecting user (no become) so that
-    # SSH agent forwarding works. sudo drops SSH_AUTH_SOCK.
    - name: Update laconic-so (editable install)
-      become: false
      ansible.builtin.shell: |
+        set -e
        cd {{ laconic_so_repo }}
        git fetch origin {{ laconic_so_branch }}
        git reset --hard origin/{{ laconic_so_branch }}
+      vars:
+        ansible_become: false
      register: laconic_so_update
      changed_when: true
+      tags: [sync, build-container]

    - name: Show laconic-so version
-      become: false
      ansible.builtin.shell:
        cmd: set -o pipefail && cd {{ laconic_so_repo }} && git log --oneline -1
        executable: /bin/bash
      register: laconic_so_version
      changed_when: false
+      tags: [sync, build-container]

    - name: Report laconic-so
      ansible.builtin.debug:
        msg: "laconic-so: {{ laconic_so_version.stdout }}"
+      tags: [sync, build-container]

    - name: Pull agave-stack repo
-      become: false
      ansible.builtin.shell: |
+        set -e
        cd {{ stack_repo }}
        git fetch origin {{ stack_branch }}
        git reset --hard origin/{{ stack_branch }}
+      vars:
+        ansible_become: false
      register: stack_update
      changed_when: true
+      tags: [sync, build-container]

    - name: Show agave-stack version
-      become: false
      ansible.builtin.shell:
        cmd: set -o pipefail && cd {{ stack_repo }} && git log --oneline -1
        executable: /bin/bash
      register: stack_version
      changed_when: false
+      tags: [sync, build-container]

    - name: Report agave-stack
      ansible.builtin.debug:
        msg: "agave-stack: {{ stack_version.stdout }}"
+      tags: [sync, build-container]

    - name: Regenerate deployment config from updated stack
      ansible.builtin.command: >
@ -84,6 +94,7 @@
        --update
      register: regen_result
      changed_when: true
+      tags: [sync, build-container]

    - name: Report sync complete
      ansible.builtin.debug:
@ -91,3 +102,27 @@
          Sync complete. laconic-so and agave-stack updated to
          origin/{{ laconic_so_branch }}. Deployment config regenerated.
          Restart or redeploy required to apply changes.
+      tags: [sync, build-container]
+
+    # ---- optional: rebuild container image --------------------------------------
+    # Only runs when explicitly requested with --tags build-container.
+    # Safe to run while the validator is running — just builds a new image.
+    # The running pod keeps the old image until restarted.
+    - name: Build agave container image
+      ansible.builtin.command: >
+        {{ laconic_so }}
+        --stack {{ stack_path }}
+        build-containers
+        --include laconicnetwork-agave
+      tags:
+        - build-container
+        - never
+      register: build_result
+      changed_when: true
+
+    - name: Report build complete
+      ansible.builtin.debug:
+        msg: "Container image built. Will be used on next pod restart."
+      tags:
+        - build-container
+        - never
--- a/playbooks/biscayne-upgrade-zfs.yml
+++ b/playbooks/biscayne-upgrade-zfs.yml
@ -0,0 +1,158 @@
+---
+# Upgrade ZFS from 2.2.2 to 2.2.9 via arter97's zfs-lts PPA
+#
+# Fixes the io_uring deadlock (OpenZFS PR #17298) at the kernel module level.
+# After this upgrade, the zvol/XFS workaround is unnecessary and can be removed
+# with biscayne-migrate-storage.yml.
+#
+# PPA: ppa:arter97/zfs-lts (Juhyung Park, OpenZFS contributor)
+# Builds from source on Launchpad — transparent, auditable.
+#
+# WARNING: This playbook triggers a reboot at the end. If the io_uring zombie
+# is present, the reboot WILL HANG. The operator must hard power cycle the
+# machine (IPMI/iDRAC/physical). The playbook does not wait for the reboot —
+# run the verify tag separately after the machine comes back.
+#
+# Usage:
+#   # Full upgrade (adds PPA, upgrades, reboots)
+#   ansible-playbook -i inventory/ playbooks/biscayne-upgrade-zfs.yml
+#
+#   # Verify after reboot
+#   ansible-playbook -i inventory/ playbooks/biscayne-upgrade-zfs.yml \
+#     --tags verify
+#
+#   # Dry run — show what would be upgraded
+#   ansible-playbook -i inventory/ playbooks/biscayne-upgrade-zfs.yml \
+#     --tags dry-run
+#
+- name: Upgrade ZFS via arter97/zfs-lts PPA
+  hosts: all
+  gather_facts: true
+  become: true
+  vars:
+    zfs_min_version: "2.2.8"
+    ppa_name: "ppa:arter97/zfs-lts"
+    zfs_packages:
+      - zfsutils-linux
+      - zfs-dkms
+      - libzfs5linux
+
+  tasks:
+    # ---- pre-flight checks ----------------------------------------------------
+    - name: Get current ZFS version
+      ansible.builtin.command: modinfo -F version zfs
+      register: zfs_current_version
+      changed_when: false
+      tags: [always]
+
+    - name: Report current ZFS version
+      ansible.builtin.debug:
+        msg: "Current ZFS: {{ zfs_current_version.stdout }}"
+      tags: [always]
+
+    - name: Skip if already upgraded
+      ansible.builtin.meta: end_play
+      when: zfs_current_version.stdout is version(zfs_min_version, '>=')
+      tags: [always]
+
+    # ---- dry run ---------------------------------------------------------------
+    - name: Show available ZFS packages from PPA (dry run)
+      ansible.builtin.shell:
+        cmd: >
+          set -o pipefail &&
+          apt-cache policy zfsutils-linux zfs-dkms 2>/dev/null |
+          grep -A2 'zfsutils-linux\|zfs-dkms'
+        executable: /bin/bash
+      changed_when: false
+      failed_when: false
+      tags:
+        - dry-run
+        - never
+
+    # ---- add PPA ---------------------------------------------------------------
+    - name: Add arter97/zfs-lts PPA
+      ansible.builtin.apt_repository:
+        repo: "{{ ppa_name }}"
+        state: present
+        update_cache: true
+      tags: [upgrade]
+
+    # ---- upgrade ZFS packages --------------------------------------------------
+    - name: Upgrade ZFS packages
+      ansible.builtin.apt:
+        name: "{{ zfs_packages }}"
+        state: latest  # noqa: package-latest
+        update_cache: true
+      register: zfs_upgrade
+      tags: [upgrade]
+
+    - name: Show upgrade result
+      ansible.builtin.debug:
+        msg: "{{ zfs_upgrade.stdout_lines | default(['no output']) }}"
+      tags: [upgrade]
+
+    # ---- reboot ----------------------------------------------------------------
+    - name: Report pre-reboot status
+      ansible.builtin.debug:
+        msg: >-
+          ZFS packages upgraded. Rebooting now.
+          If the io_uring zombie is present, this reboot WILL HANG.
+          Hard power cycle the machine, then run this playbook with
+          --tags verify to confirm the upgrade.
+      tags: [upgrade]
+
+    - name: Reboot to load new ZFS modules
+      ansible.builtin.reboot:
+        msg: "ZFS upgrade — rebooting to load new kernel modules"
+        reboot_timeout: 600
+      tags: [upgrade]
+      # This will timeout if io_uring zombie blocks shutdown.
+      # Operator must hard power cycle. That's expected.
+
+    # ---- post-reboot verification -----------------------------------------------
+    - name: Get ZFS version after reboot
+      ansible.builtin.command: modinfo -F version zfs
+      register: zfs_new_version
+      changed_when: false
+      tags:
+        - verify
+        - never
+
+    - name: Verify ZFS version meets minimum
+      ansible.builtin.assert:
+        that:
+          - zfs_new_version.stdout is version(zfs_min_version, '>=')
+        fail_msg: >-
+          ZFS version {{ zfs_new_version.stdout }} is below minimum
+          {{ zfs_min_version }}. Upgrade may have failed.
+        success_msg: "ZFS {{ zfs_new_version.stdout }} — io_uring fix confirmed."
+      tags:
+        - verify
+        - never
+
+    - name: Verify ZFS pools are healthy
+      ansible.builtin.command: zpool status -x
+      register: zpool_status
+      changed_when: false
+      failed_when: "'all pools are healthy' not in zpool_status.stdout"
+      tags:
+        - verify
+        - never
+
+    - name: Verify ZFS datasets are mounted
+      ansible.builtin.command: zfs mount
+      register: zfs_mounts
+      changed_when: false
+      tags:
+        - verify
+        - never
+
+    - name: Report verification
+      ansible.builtin.debug:
+        msg:
+          zfs_version: "{{ zfs_new_version.stdout }}"
+          pools: "{{ zpool_status.stdout }}"
+          mounts: "{{ zfs_mounts.stdout_lines }}"
+      tags:
+        - verify
+        - never
--- a/scripts/agave-container/entrypoint.py
+++ b/scripts/agave-container/entrypoint.py
@ -2,12 +2,17 @@
 """Agave validator entrypoint — snapshot management, arg construction, liveness probe.

 Two subcommands:
-  entrypoint.py serve   (default) — snapshot freshness check + exec agave-validator
+  entrypoint.py serve   (default) — snapshot freshness check + run agave-validator
  entrypoint.py probe   — liveness probe (slot lag check, exits 0/1)

 Replaces the bash entrypoint.sh / start-rpc.sh / start-validator.sh with a single
 Python module. Test mode still dispatches to start-test.sh.

+Python stays as PID 1 and traps SIGTERM. On SIGTERM, it runs
+``agave-validator exit --force --ledger /data/ledger`` which connects to the
+admin RPC Unix socket and tells the validator to flush I/O and exit cleanly.
+This avoids the io_uring/ZFS deadlock that occurs when the process is killed.
+
 All configuration comes from environment variables — same vars as the original
 bash scripts. See compose files for defaults.
 """
@ -18,8 +23,10 @@ import json
 import logging
 import os
 import re
+import signal
 import subprocess
 import sys
+import threading
 import time
 import urllib.error
 import urllib.request
@ -365,11 +372,77 @@ def append_extra_args(args: list[str]) -> list[str]:
    return args


+# -- Graceful shutdown --------------------------------------------------------
+
+# Timeout for graceful exit via admin RPC. Leave 30s margin for k8s
+# terminationGracePeriodSeconds (300s).
+GRACEFUL_EXIT_TIMEOUT = 270
+
+
+def graceful_exit(child: subprocess.Popen[bytes]) -> None:
+    """Request graceful shutdown via the admin RPC Unix socket.
+
+    Runs ``agave-validator exit --force --ledger /data/ledger`` which connects
+    to the admin RPC socket at ``/data/ledger/admin.rpc`` and sets the
+    validator's exit flag. The validator flushes all I/O and exits cleanly,
+    avoiding the io_uring/ZFS deadlock.
+
+    If the admin RPC exit fails or the child doesn't exit within the timeout,
+    falls back to SIGTERM then SIGKILL.
+    """
+    log.info("SIGTERM received — requesting graceful exit via admin RPC")
+    try:
+        result = subprocess.run(
+            ["agave-validator", "exit", "--force", "--ledger", LEDGER_DIR],
+            capture_output=True, text=True, timeout=30,
+        )
+        if result.returncode == 0:
+            log.info("Admin RPC exit requested successfully")
+        else:
+            log.warning(
+                "Admin RPC exit returned %d: %s",
+                result.returncode, result.stderr.strip(),
+            )
+    except subprocess.TimeoutExpired:
+        log.warning("Admin RPC exit command timed out after 30s")
+    except FileNotFoundError:
+        log.warning("agave-validator binary not found for exit command")
+
+    # Wait for child to exit
+    try:
+        child.wait(timeout=GRACEFUL_EXIT_TIMEOUT)
+        log.info("Validator exited cleanly with code %d", child.returncode)
+        return
+    except subprocess.TimeoutExpired:
+        log.warning(
+            "Validator did not exit within %ds — sending SIGTERM",
+            GRACEFUL_EXIT_TIMEOUT,
+        )
+
+    # Fallback: SIGTERM
+    child.terminate()
+    try:
+        child.wait(timeout=15)
+        log.info("Validator exited after SIGTERM with code %d", child.returncode)
+        return
+    except subprocess.TimeoutExpired:
+        log.warning("Validator did not exit after SIGTERM — sending SIGKILL")
+
+    # Last resort: SIGKILL
+    child.kill()
+    child.wait()
+    log.info("Validator killed with SIGKILL, code %d", child.returncode)
+
+
 # -- Serve subcommand ---------------------------------------------------------


 def cmd_serve() -> None:
-    """Main serve flow: snapshot check, setup, exec agave-validator."""
+    """Main serve flow: snapshot check, setup, run agave-validator as child.
+
+    Python stays as PID 1 and traps SIGTERM to perform graceful shutdown
+    via the admin RPC Unix socket.
+    """
    mode = env("AGAVE_MODE", "test")
    log.info("AGAVE_MODE=%s", mode)

@ -407,7 +480,21 @@ def cmd_serve() -> None:
    Path("/tmp/entrypoint-start").write_text(str(time.time()))

    log.info("Starting agave-validator with %d arguments", len(args))
-    os.execvp("agave-validator", ["agave-validator"] + args)
+    child = subprocess.Popen(["agave-validator"] + args)
+
+    # Forward SIGUSR1 to child (log rotation)
+    signal.signal(signal.SIGUSR1, lambda _sig, _frame: child.send_signal(signal.SIGUSR1))
+
+    # Trap SIGTERM — run graceful_exit in a thread so the signal handler returns
+    # immediately and child.wait() in the main thread can observe the exit.
+    def _on_sigterm(_sig: int, _frame: object) -> None:
+        threading.Thread(target=graceful_exit, args=(child,), daemon=True).start()
+
+    signal.signal(signal.SIGTERM, _on_sigterm)
+
+    # Wait for child — if it exits on its own (crash, normal exit), propagate code
+    child.wait()
+    sys.exit(child.returncode)


 # -- Probe subcommand ---------------------------------------------------------
--- a/scripts/agave-container/snapshot_download.py
+++ b/scripts/agave-container/snapshot_download.py
@ -655,8 +655,9 @@ def download_best_snapshot(
                log.info("Downloading incremental %s (%d mirrors, slot %d, gap %d slots)",
                         inc_fn, len(inc_mirrors), inc_slot, gap)
                if not download_aria2c(inc_mirrors, output_dir, inc_fn, connections):
-                    log.error("Failed to download incremental %s", inc_fn)
-                    break
+                    log.warning("Failed to download incremental %s — re-probing in 10s", inc_fn)
+                    time.sleep(10)
+                    continue

                prev_inc_filename = inc_fn

--- a/scripts/check-status.py
+++ b/scripts/check-status.py
@ -18,6 +18,7 @@ from __future__ import annotations

 import argparse
 import json
+import os
 import subprocess
 import sys
 import time
@ -206,9 +207,11 @@ def display(iteration: int = 0) -> None:
    snapshots = check_snapshots()
    ramdisk = check_ramdisk()

-    print(f"\n{'=' * 60}")
-    print(f"  Biscayne Agave Status — {ts}")
-    print(f"{'=' * 60}")
+    # Clear screen and home cursor for clean redraw in watch mode
+    if iteration > 0:
+        print("\033[2J\033[H", end="")
+
+    print(f"\n  Biscayne Agave Status — {ts}\n")

    # Pod
    print(f"\n  Pod: {pod['phase']}")
@ -275,14 +278,30 @@ def display(iteration: int = 0) -> None:
 # -- Main ---------------------------------------------------------------------


+def spawn_tmux_pane(interval: int) -> None:
+    """Launch this script with --watch in a new tmux pane."""
+    script = os.path.abspath(__file__)
+    cmd = f"python3 {script} --watch -i {interval}"
+    subprocess.run(
+        ["tmux", "split-window", "-h", "-d", cmd],
+        check=True,
+    )
+
+
 def main() -> int:
    p = argparse.ArgumentParser(description=__doc__,
                                formatter_class=argparse.RawDescriptionHelpFormatter)
    p.add_argument("--watch", action="store_true", help="Repeat every interval")
+    p.add_argument("--pane", action="store_true",
+                   help="Launch --watch in a new tmux pane")
    p.add_argument("-i", "--interval", type=int, default=30,
                   help="Watch interval in seconds (default: 30)")
    args = p.parse_args()

+    if args.pane:
+        spawn_tmux_pane(args.interval)
+        return 0
+
    discover()

    try: