feat: graceful shutdown, ZFS upgrade, storage migration, sync-tools build
- entrypoint.py: Python stays PID 1, traps SIGTERM, requests graceful exit via admin RPC (agave-validator exit --force) before falling back to signals - snapshot_download.py: fix break-on-failure bug in incremental download loop (continue + re-probe instead of giving up) - biscayne-upgrade-zfs.yml: upgrade ZFS 2.2.2 → 2.2.9 via arter97/zfs-lts PPA to fix io_uring deadlock at kernel module level - biscayne-migrate-storage.yml: one-time migration from zvol/XFS to ZFS dataset (zvol workaround no longer needed with graceful shutdown + ZFS fix) - biscayne-stop.yml: patch terminationGracePeriodSeconds to 300 before scaling to 0, updated docs for admin RPC shutdown - biscayne-sync-tools.yml: fix SSH agent forwarding (vars: ansible_become), add --tags build-container support, add set -e to shell blocks - biscayne-recover.yml: updated for graceful shutdown awareness - check-status.py: add --pane flag for tmux, clean redraw in watch mode - CLAUDE.md: update docs for ZFS dataset storage, graceful shutdown Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>fix/kind-mount-propagation
parent
173b807451
commit
b88af2be70
63
CLAUDE.md
63
CLAUDE.md
|
|
@ -10,16 +10,16 @@ below it are correct. Playbooks belong to exactly one layer.
|
||||||
| 1. Base system | Docker, ZFS, packages | Out of scope (manual/PXE) |
|
| 1. Base system | Docker, ZFS, packages | Out of scope (manual/PXE) |
|
||||||
| 2. Prepare kind | `/srv/kind` exists (ZFS dataset) | None needed (ZFS handles it) |
|
| 2. Prepare kind | `/srv/kind` exists (ZFS dataset) | None needed (ZFS handles it) |
|
||||||
| 3. Install kind | `laconic-so deployment start` creates kind cluster, mounts `/srv/kind` → `/mnt` in kind node | `biscayne-redeploy.yml` (deploy tags) |
|
| 3. Install kind | `laconic-so deployment start` creates kind cluster, mounts `/srv/kind` → `/mnt` in kind node | `biscayne-redeploy.yml` (deploy tags) |
|
||||||
| 4. Prepare agave | Host storage for agave: zvol, ramdisk, rbind into `/srv/kind/solana` | `biscayne-prepare-agave.yml` |
|
| 4. Prepare agave | Host storage for agave: ZFS dataset, ramdisk | `biscayne-prepare-agave.yml` |
|
||||||
| 5. Deploy agave | Deploy agave-stack into kind, snapshot download, scale up | `biscayne-redeploy.yml` (snapshot/verify tags), `biscayne-recover.yml` |
|
| 5. Deploy agave | Deploy agave-stack into kind, snapshot download, scale up | `biscayne-redeploy.yml` (snapshot/verify tags), `biscayne-recover.yml` |
|
||||||
|
|
||||||
**Layer 4 invariants** (asserted by `biscayne-prepare-agave.yml`):
|
**Layer 4 invariants** (asserted by `biscayne-prepare-agave.yml`):
|
||||||
- `/srv/kind/solana` is XFS on a zvol — agave uses io_uring which deadlocks on ZFS. `/srv/solana` is NOT the zvol (it's a ZFS dataset directory); never use it for data paths
|
- `/srv/kind/solana` is a ZFS dataset (`biscayne/DATA/srv/kind/solana`), child of the `/srv/kind` dataset
|
||||||
- `/srv/kind/solana/ramdisk` is tmpfs (1TB) — accounts must be in RAM
|
- `/srv/kind/solana/ramdisk` is tmpfs (1TB) — accounts must be in RAM
|
||||||
|
- `/srv/solana` is NOT the data path — it's a directory on the parent ZFS dataset. All data paths use `/srv/kind/solana`
|
||||||
|
|
||||||
These invariants are checked at runtime and persisted to fstab/systemd so they
|
These invariants are checked at runtime and persisted to fstab/systemd so they
|
||||||
survive reboot. They are agave's requirements reaching into the boot sequence,
|
survive reboot.
|
||||||
not base system concerns.
|
|
||||||
|
|
||||||
**Cross-cutting**: `health-check.yml` (read-only diagnostics), `biscayne-stop.yml`
|
**Cross-cutting**: `health-check.yml` (read-only diagnostics), `biscayne-stop.yml`
|
||||||
(layer 5 — graceful shutdown), `fix-pv-mounts.yml` (layer 5 — PV repair).
|
(layer 5 — graceful shutdown), `fix-pv-mounts.yml` (layer 5 — PV repair).
|
||||||
|
|
@ -30,11 +30,8 @@ not base system concerns.
|
||||||
|
|
||||||
The agave validator runs inside a kind-based k8s cluster managed by `laconic-so`.
|
The agave validator runs inside a kind-based k8s cluster managed by `laconic-so`.
|
||||||
The kind node is a Docker container. **Never restart or kill the kind node container
|
The kind node is a Docker container. **Never restart or kill the kind node container
|
||||||
while the validator is running.** Agave uses `io_uring` for async I/O, and on ZFS,
|
while the validator is running.** Use `agave-validator exit --force` via the admin
|
||||||
killing the process can produce unkillable kernel threads (D-state in
|
RPC socket for graceful shutdown, or scale the deployment to 0 and wait.
|
||||||
`io_wq_put_and_exit` blocked on ZFS transaction commits). This deadlocks the
|
|
||||||
container's PID namespace, making `docker stop`, `docker restart`, `docker exec`,
|
|
||||||
and even `reboot` hang.
|
|
||||||
|
|
||||||
Correct shutdown sequence:
|
Correct shutdown sequence:
|
||||||
|
|
||||||
|
|
@ -61,15 +58,16 @@ The accounts directory must be in RAM for performance. tmpfs is used instead of
|
||||||
`/dev/ram0` — simpler (no format-on-boot service needed), resizable on the fly
|
`/dev/ram0` — simpler (no format-on-boot service needed), resizable on the fly
|
||||||
with `mount -o remount,size=<new>`, and what most Solana operators use.
|
with `mount -o remount,size=<new>`, and what most Solana operators use.
|
||||||
|
|
||||||
**Boot ordering**: fstab entry mounts tmpfs at `/srv/kind/solana/ramdisk` with
|
**Boot ordering**: `/srv/kind/solana` is a ZFS dataset mounted automatically by
|
||||||
`x-systemd.requires=srv-kind-solana.mount`. tmpfs mounts natively via fstab —
|
`zfs-mount.service`. The tmpfs ramdisk fstab entry uses
|
||||||
no systemd format service needed. **No manual intervention after reboot.**
|
`x-systemd.requires=zfs-mount.service` to ensure the dataset is mounted first.
|
||||||
|
**No manual intervention after reboot.**
|
||||||
|
|
||||||
**Mount propagation**: The kind node bind-mounts `/srv/kind` → `/mnt` at container
|
**Mount propagation**: The kind node bind-mounts `/srv/kind` → `/mnt` at container
|
||||||
start. laconic-so sets `propagation: HostToContainer` on all kind extraMounts
|
start. laconic-so sets `propagation: HostToContainer` on all kind extraMounts
|
||||||
(commit `a11d40f2` in stack-orchestrator), so host submounts (like the rbind at
|
(commit `a11d40f2` in stack-orchestrator), so host submounts propagate into the
|
||||||
`/srv/kind/solana`) propagate into the kind node automatically. A kind restart
|
kind node automatically. A kind restart is required to pick up the new config
|
||||||
is required to pick up the new config after updating laconic-so.
|
after updating laconic-so.
|
||||||
|
|
||||||
### KUBECONFIG
|
### KUBECONFIG
|
||||||
|
|
||||||
|
|
@ -92,21 +90,20 @@ Then export it:
|
||||||
export SSH_AUTH_SOCK=/tmp/ssh-XXXX/agent.NNNN
|
export SSH_AUTH_SOCK=/tmp/ssh-XXXX/agent.NNNN
|
||||||
```
|
```
|
||||||
|
|
||||||
### io_uring/ZFS Deadlock — Root Cause
|
### io_uring/ZFS Deadlock — Historical Note
|
||||||
|
|
||||||
When agave-validator is killed while performing I/O against ZFS-backed paths (not
|
Agave uses io_uring for async I/O. Killing agave ungracefully while it has
|
||||||
the ramdisk), io_uring worker threads get stuck in D-state:
|
outstanding I/O against ZFS can produce unkillable D-state kernel threads
|
||||||
```
|
(`io_wq_put_and_exit` blocked on ZFS transactions), deadlocking the container.
|
||||||
io_wq_put_and_exit → dsl_dir_tempreserve_space (ZFS module)
|
|
||||||
```
|
|
||||||
These threads are unkillable (SIGKILL has no effect on D-state processes). They
|
|
||||||
prevent the container's PID namespace from being reaped (`zap_pid_ns_processes`
|
|
||||||
waits forever), which breaks `docker stop`, `docker restart`, `docker exec`, and
|
|
||||||
even `reboot`. The only fix is a hard power cycle.
|
|
||||||
|
|
||||||
**Prevention**: Always scale the deployment to 0 and wait for the pod to terminate
|
**Prevention**: Use graceful shutdown (`agave-validator exit --force` via admin
|
||||||
before any destructive operation (namespace delete, kind restart, host reboot).
|
RPC, or scale to 0 and wait). The `biscayne-stop.yml` playbook enforces this.
|
||||||
The `biscayne-stop.yml` playbook enforces this.
|
With graceful shutdown, io_uring contexts are closed cleanly and ZFS storage
|
||||||
|
is safe to use directly (no zvol/XFS workaround needed).
|
||||||
|
|
||||||
|
**ZFS fix**: The underlying io_uring bug is fixed in ZFS 2.2.8+ (PR #17298).
|
||||||
|
Biscayne currently runs ZFS 2.2.2. Upgrading ZFS will eliminate the deadlock
|
||||||
|
risk entirely, even for ungraceful shutdowns.
|
||||||
|
|
||||||
### laconic-so Architecture
|
### laconic-so Architecture
|
||||||
|
|
||||||
|
|
@ -133,11 +130,11 @@ kind node via a single bind mount.
|
||||||
- Deployment: `laconic-70ce4c4b47e23b85-deployment`
|
- Deployment: `laconic-70ce4c4b47e23b85-deployment`
|
||||||
- Kind node container: `laconic-70ce4c4b47e23b85-control-plane`
|
- Kind node container: `laconic-70ce4c4b47e23b85-control-plane`
|
||||||
- Deployment dir: `/srv/deployments/agave`
|
- Deployment dir: `/srv/deployments/agave`
|
||||||
- Snapshot dir: `/srv/kind/solana/snapshots` (on zvol, visible to kind at `/mnt/validator-snapshots`)
|
- Snapshot dir: `/srv/kind/solana/snapshots` (ZFS dataset, visible to kind at `/mnt/validator-snapshots`)
|
||||||
- Ledger dir: `/srv/kind/solana/ledger` (on zvol, visible to kind at `/mnt/validator-ledger`)
|
- Ledger dir: `/srv/kind/solana/ledger` (ZFS dataset, visible to kind at `/mnt/validator-ledger`)
|
||||||
- Accounts dir: `/srv/kind/solana/ramdisk/accounts` (on ramdisk `/dev/ram0`, visible to kind at `/mnt/validator-accounts`)
|
- Accounts dir: `/srv/kind/solana/ramdisk/accounts` (tmpfs ramdisk, visible to kind at `/mnt/validator-accounts`)
|
||||||
- Log dir: `/srv/kind/solana/log` (on zvol, visible to kind at `/mnt/validator-log`)
|
- Log dir: `/srv/kind/solana/log` (ZFS dataset, visible to kind at `/mnt/validator-log`)
|
||||||
- **WARNING**: `/srv/solana` is a ZFS dataset directory, NOT the zvol. Never use it for data paths.
|
- **WARNING**: `/srv/solana` is a different ZFS dataset directory. All data paths use `/srv/kind/solana`.
|
||||||
- Host bind mount root: `/srv/kind` -> kind node `/mnt`
|
- Host bind mount root: `/srv/kind` -> kind node `/mnt`
|
||||||
- laconic-so: `/home/rix/.local/bin/laconic-so` (editable install)
|
- laconic-so: `/home/rix/.local/bin/laconic-so` (editable install)
|
||||||
|
|
||||||
|
|
|
||||||
|
|
@ -0,0 +1,286 @@
|
||||||
|
---
|
||||||
|
# One-time migration: zvol/XFS → ZFS dataset for /srv/kind/solana
|
||||||
|
#
|
||||||
|
# Background:
|
||||||
|
# Biscayne used a ZFS zvol formatted as XFS to work around io_uring/ZFS
|
||||||
|
# deadlocks. The root cause is now handled by graceful shutdown via admin
|
||||||
|
# RPC (agave-validator exit --force), so the zvol/XFS layer is unnecessary.
|
||||||
|
#
|
||||||
|
# What this does:
|
||||||
|
# 1. Asserts the validator is scaled to 0 (does NOT scale it — that's
|
||||||
|
# the operator's job via biscayne-stop.yml)
|
||||||
|
# 2. Creates a child ZFS dataset biscayne/DATA/srv/kind/solana
|
||||||
|
# 3. Copies data from the zvol to the new dataset (rsync)
|
||||||
|
# 4. Updates fstab (removes zvol line, fixes tmpfs dependency)
|
||||||
|
# 5. Destroys the zvol after verification
|
||||||
|
#
|
||||||
|
# Prerequisites:
|
||||||
|
# - Validator MUST be stopped (scale 0, no agave processes)
|
||||||
|
# - Run biscayne-stop.yml first
|
||||||
|
#
|
||||||
|
# Usage:
|
||||||
|
# ansible-playbook -i inventory/ playbooks/biscayne-migrate-storage.yml
|
||||||
|
#
|
||||||
|
# After migration, run biscayne-prepare-agave.yml to update its checks,
|
||||||
|
# then biscayne-start.yml to bring the validator back up.
|
||||||
|
#
|
||||||
|
- name: Migrate storage from zvol/XFS to ZFS dataset
|
||||||
|
hosts: all
|
||||||
|
gather_facts: false
|
||||||
|
become: true
|
||||||
|
environment:
|
||||||
|
KUBECONFIG: /home/rix/.kube/config
|
||||||
|
vars:
|
||||||
|
kind_cluster: laconic-70ce4c4b47e23b85
|
||||||
|
k8s_namespace: "laconic-{{ kind_cluster }}"
|
||||||
|
deployment_name: "{{ kind_cluster }}-deployment"
|
||||||
|
zvol_device: /dev/zvol/biscayne/DATA/volumes/solana
|
||||||
|
zvol_dataset: biscayne/DATA/volumes/solana
|
||||||
|
new_dataset: biscayne/DATA/srv/kind/solana
|
||||||
|
kind_solana_dir: /srv/kind/solana
|
||||||
|
ramdisk_mount: /srv/kind/solana/ramdisk
|
||||||
|
ramdisk_size: 1024G
|
||||||
|
# Temporary mount for zvol during data copy
|
||||||
|
zvol_tmp_mount: /mnt/zvol-migration-tmp
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
# ---- preconditions --------------------------------------------------------
|
||||||
|
- name: Check deployment replica count
|
||||||
|
ansible.builtin.command: >
|
||||||
|
kubectl get deployment {{ deployment_name }}
|
||||||
|
-n {{ k8s_namespace }}
|
||||||
|
-o jsonpath='{.spec.replicas}'
|
||||||
|
register: current_replicas
|
||||||
|
failed_when: false
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Fail if validator is running
|
||||||
|
ansible.builtin.fail:
|
||||||
|
msg: >-
|
||||||
|
Validator must be scaled to 0 before migration.
|
||||||
|
Current replicas: {{ current_replicas.stdout | default('unknown') }}.
|
||||||
|
Run biscayne-stop.yml first.
|
||||||
|
when: current_replicas.stdout | default('0') | int > 0
|
||||||
|
|
||||||
|
- name: Verify no agave processes in kind node
|
||||||
|
ansible.builtin.command: >
|
||||||
|
docker exec {{ kind_cluster }}-control-plane
|
||||||
|
pgrep -c agave-validator
|
||||||
|
register: agave_procs
|
||||||
|
failed_when: false
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Fail if agave still running
|
||||||
|
ansible.builtin.fail:
|
||||||
|
msg: >-
|
||||||
|
agave-validator process still running inside kind node.
|
||||||
|
Cannot migrate while validator is active.
|
||||||
|
when: agave_procs.rc == 0
|
||||||
|
|
||||||
|
# ---- check current state --------------------------------------------------
|
||||||
|
- name: Check if zvol device exists
|
||||||
|
ansible.builtin.stat:
|
||||||
|
path: "{{ zvol_device }}"
|
||||||
|
register: zvol_exists
|
||||||
|
|
||||||
|
- name: Check if ZFS dataset already exists
|
||||||
|
ansible.builtin.command: zfs list -H -o name {{ new_dataset }}
|
||||||
|
register: dataset_exists
|
||||||
|
failed_when: false
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Check current mount type at {{ kind_solana_dir }}
|
||||||
|
ansible.builtin.shell:
|
||||||
|
cmd: set -o pipefail && findmnt -n -o FSTYPE {{ kind_solana_dir }}
|
||||||
|
executable: /bin/bash
|
||||||
|
register: current_fstype
|
||||||
|
failed_when: false
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Report current state
|
||||||
|
ansible.builtin.debug:
|
||||||
|
msg:
|
||||||
|
zvol_exists: "{{ zvol_exists.stat.exists | default(false) }}"
|
||||||
|
dataset_exists: "{{ dataset_exists.rc == 0 }}"
|
||||||
|
current_fstype: "{{ current_fstype.stdout | default('none') }}"
|
||||||
|
|
||||||
|
# ---- skip if already migrated ---------------------------------------------
|
||||||
|
- name: End play if already on ZFS dataset
|
||||||
|
ansible.builtin.meta: end_play
|
||||||
|
when:
|
||||||
|
- dataset_exists.rc == 0
|
||||||
|
- current_fstype.stdout | default('') == 'zfs'
|
||||||
|
- not (zvol_exists.stat.exists | default(false))
|
||||||
|
|
||||||
|
# ---- step 1: unmount ramdisk and zvol ------------------------------------
|
||||||
|
- name: Unmount ramdisk
|
||||||
|
ansible.posix.mount:
|
||||||
|
path: "{{ ramdisk_mount }}"
|
||||||
|
state: unmounted
|
||||||
|
|
||||||
|
- name: Unmount zvol from {{ kind_solana_dir }}
|
||||||
|
ansible.posix.mount:
|
||||||
|
path: "{{ kind_solana_dir }}"
|
||||||
|
state: unmounted
|
||||||
|
when: current_fstype.stdout | default('') == 'xfs'
|
||||||
|
|
||||||
|
# ---- step 2: create ZFS dataset -----------------------------------------
|
||||||
|
- name: Create ZFS dataset {{ new_dataset }}
|
||||||
|
ansible.builtin.command: >
|
||||||
|
zfs create -o mountpoint={{ kind_solana_dir }} {{ new_dataset }}
|
||||||
|
changed_when: true
|
||||||
|
when: dataset_exists.rc != 0
|
||||||
|
|
||||||
|
- name: Mount ZFS dataset if it already existed
|
||||||
|
ansible.builtin.command: zfs mount {{ new_dataset }}
|
||||||
|
changed_when: true
|
||||||
|
failed_when: false
|
||||||
|
when: dataset_exists.rc == 0
|
||||||
|
|
||||||
|
- name: Verify ZFS dataset is mounted
|
||||||
|
ansible.builtin.shell:
|
||||||
|
cmd: set -o pipefail && findmnt -n -o FSTYPE {{ kind_solana_dir }} | grep -q zfs
|
||||||
|
executable: /bin/bash
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
# ---- step 3: copy data from zvol ----------------------------------------
|
||||||
|
- name: Create temporary mount point for zvol
|
||||||
|
ansible.builtin.file:
|
||||||
|
path: "{{ zvol_tmp_mount }}"
|
||||||
|
state: directory
|
||||||
|
mode: "0755"
|
||||||
|
when: zvol_exists.stat.exists | default(false)
|
||||||
|
|
||||||
|
- name: Mount zvol at temporary location
|
||||||
|
ansible.posix.mount:
|
||||||
|
path: "{{ zvol_tmp_mount }}"
|
||||||
|
src: "{{ zvol_device }}"
|
||||||
|
fstype: xfs
|
||||||
|
state: mounted
|
||||||
|
when: zvol_exists.stat.exists | default(false)
|
||||||
|
|
||||||
|
- name: Copy data from zvol to ZFS dataset # noqa: command-instead-of-module
|
||||||
|
ansible.builtin.command: >
|
||||||
|
rsync -a --info=progress2
|
||||||
|
--exclude='ramdisk/'
|
||||||
|
{{ zvol_tmp_mount }}/
|
||||||
|
{{ kind_solana_dir }}/
|
||||||
|
changed_when: true
|
||||||
|
when: zvol_exists.stat.exists | default(false)
|
||||||
|
|
||||||
|
# ---- step 4: verify data integrity --------------------------------------
|
||||||
|
- name: Check key directories exist on new dataset
|
||||||
|
ansible.builtin.stat:
|
||||||
|
path: "{{ kind_solana_dir }}/{{ item }}"
|
||||||
|
register: dir_checks
|
||||||
|
loop:
|
||||||
|
- ledger
|
||||||
|
- snapshots
|
||||||
|
- log
|
||||||
|
|
||||||
|
- name: Report directory verification
|
||||||
|
ansible.builtin.debug:
|
||||||
|
msg: "{{ item.item }}: {{ 'exists' if item.stat.exists else 'MISSING' }}"
|
||||||
|
loop: "{{ dir_checks.results }}"
|
||||||
|
loop_control:
|
||||||
|
label: "{{ item.item }}"
|
||||||
|
|
||||||
|
# ---- step 5: update fstab ------------------------------------------------
|
||||||
|
- name: Remove zvol fstab entry
|
||||||
|
ansible.builtin.lineinfile:
|
||||||
|
path: /etc/fstab
|
||||||
|
regexp: '^\S+zvol\S+\s+{{ kind_solana_dir }}\s'
|
||||||
|
state: absent
|
||||||
|
register: fstab_zvol_removed
|
||||||
|
|
||||||
|
# Also match any XFS entry for kind_solana_dir (non-zvol form)
|
||||||
|
- name: Remove any XFS fstab entry for {{ kind_solana_dir }}
|
||||||
|
ansible.builtin.lineinfile:
|
||||||
|
path: /etc/fstab
|
||||||
|
regexp: '^\S+\s+{{ kind_solana_dir }}\s+xfs'
|
||||||
|
state: absent
|
||||||
|
|
||||||
|
# ZFS datasets are mounted by zfs-mount.service automatically.
|
||||||
|
# The tmpfs ramdisk depends on the solana dir existing, which ZFS
|
||||||
|
# guarantees via zfs-mount.service. Update the systemd dependency.
|
||||||
|
- name: Update tmpfs ramdisk fstab entry
|
||||||
|
ansible.builtin.lineinfile:
|
||||||
|
path: /etc/fstab
|
||||||
|
regexp: '^\S+\s+{{ ramdisk_mount }}\s'
|
||||||
|
line: "tmpfs {{ ramdisk_mount }} tmpfs nodev,nosuid,noexec,nodiratime,size={{ ramdisk_size }},nofail,x-systemd.requires=zfs-mount.service 0 0"
|
||||||
|
|
||||||
|
- name: Reload systemd # noqa: no-handler
|
||||||
|
ansible.builtin.systemd:
|
||||||
|
daemon_reload: true
|
||||||
|
when: fstab_zvol_removed.changed
|
||||||
|
|
||||||
|
# ---- step 6: mount ramdisk -----------------------------------------------
|
||||||
|
- name: Mount tmpfs ramdisk
|
||||||
|
ansible.posix.mount:
|
||||||
|
path: "{{ ramdisk_mount }}"
|
||||||
|
src: tmpfs
|
||||||
|
fstype: tmpfs
|
||||||
|
opts: "nodev,nosuid,noexec,nodiratime,size={{ ramdisk_size }}"
|
||||||
|
state: mounted
|
||||||
|
|
||||||
|
- name: Ensure accounts directory
|
||||||
|
ansible.builtin.file:
|
||||||
|
path: "{{ ramdisk_mount }}/accounts"
|
||||||
|
state: directory
|
||||||
|
owner: solana
|
||||||
|
group: solana
|
||||||
|
mode: "0755"
|
||||||
|
|
||||||
|
# ---- step 7: clean up zvol -----------------------------------------------
|
||||||
|
- name: Unmount zvol from temporary location
|
||||||
|
ansible.posix.mount:
|
||||||
|
path: "{{ zvol_tmp_mount }}"
|
||||||
|
state: unmounted
|
||||||
|
when: zvol_exists.stat.exists | default(false)
|
||||||
|
|
||||||
|
- name: Remove temporary mount point
|
||||||
|
ansible.builtin.file:
|
||||||
|
path: "{{ zvol_tmp_mount }}"
|
||||||
|
state: absent
|
||||||
|
|
||||||
|
- name: Destroy zvol {{ zvol_dataset }}
|
||||||
|
ansible.builtin.command: zfs destroy {{ zvol_dataset }}
|
||||||
|
changed_when: true
|
||||||
|
when: zvol_exists.stat.exists | default(false)
|
||||||
|
|
||||||
|
# ---- step 8: ensure shared propagation for docker ------------------------
|
||||||
|
- name: Ensure shared propagation on kind mounts # noqa: command-instead-of-module
|
||||||
|
ansible.builtin.command:
|
||||||
|
cmd: mount --make-shared {{ item }}
|
||||||
|
loop:
|
||||||
|
- "{{ kind_solana_dir }}"
|
||||||
|
- "{{ ramdisk_mount }}"
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
# ---- verification ---------------------------------------------------------
|
||||||
|
- name: Verify solana dir is ZFS
|
||||||
|
ansible.builtin.shell:
|
||||||
|
cmd: set -o pipefail && df -T {{ kind_solana_dir }} | grep -q zfs
|
||||||
|
executable: /bin/bash
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Verify ramdisk is tmpfs
|
||||||
|
ansible.builtin.shell:
|
||||||
|
cmd: set -o pipefail && df -T {{ ramdisk_mount }} | grep -q tmpfs
|
||||||
|
executable: /bin/bash
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Verify zvol is destroyed
|
||||||
|
ansible.builtin.command: zfs list -H -o name {{ zvol_dataset }}
|
||||||
|
register: zvol_gone
|
||||||
|
failed_when: zvol_gone.rc == 0
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Migration complete
|
||||||
|
ansible.builtin.debug:
|
||||||
|
msg: >-
|
||||||
|
Storage migration complete.
|
||||||
|
{{ kind_solana_dir }} is now a ZFS dataset ({{ new_dataset }}).
|
||||||
|
Ramdisk at {{ ramdisk_mount }} (tmpfs, {{ ramdisk_size }}).
|
||||||
|
zvol {{ zvol_dataset }} destroyed.
|
||||||
|
Next: update biscayne-prepare-agave.yml, then start the validator.
|
||||||
|
|
@ -10,7 +10,8 @@
|
||||||
# 2. Wait for pods to terminate (io_uring safety check)
|
# 2. Wait for pods to terminate (io_uring safety check)
|
||||||
# 3. Wipe accounts ramdisk
|
# 3. Wipe accounts ramdisk
|
||||||
# 4. Clean old snapshots
|
# 4. Clean old snapshots
|
||||||
# 5. Scale to 1 — container entrypoint downloads snapshot + starts validator
|
# 5. Ensure terminationGracePeriodSeconds is 300 (for graceful shutdown)
|
||||||
|
# 6. Scale to 1 — container entrypoint downloads snapshot + starts validator
|
||||||
#
|
#
|
||||||
# The playbook exits after step 5. The container handles snapshot download
|
# The playbook exits after step 5. The container handles snapshot download
|
||||||
# (60+ min) and validator startup autonomously. Monitor with:
|
# (60+ min) and validator startup autonomously. Monitor with:
|
||||||
|
|
@ -95,7 +96,18 @@
|
||||||
become: true
|
become: true
|
||||||
changed_when: true
|
changed_when: true
|
||||||
|
|
||||||
# ---- step 5: scale to 1 — entrypoint handles snapshot download ------------
|
# ---- step 5: ensure terminationGracePeriodSeconds -------------------------
|
||||||
|
# laconic-so doesn't support this declaratively. Patch the deployment so
|
||||||
|
# k8s gives the entrypoint 300s to perform graceful shutdown via admin RPC.
|
||||||
|
- name: Ensure terminationGracePeriodSeconds is 300
|
||||||
|
ansible.builtin.command: >
|
||||||
|
kubectl patch deployment {{ deployment_name }}
|
||||||
|
-n {{ k8s_namespace }}
|
||||||
|
-p '{"spec":{"template":{"spec":{"terminationGracePeriodSeconds":300}}}}'
|
||||||
|
register: patch_result
|
||||||
|
changed_when: "'no change' not in patch_result.stdout"
|
||||||
|
|
||||||
|
# ---- step 6: scale to 1 — entrypoint handles snapshot download ------------
|
||||||
# The container's entrypoint.py checks snapshot freshness, cleans stale
|
# The container's entrypoint.py checks snapshot freshness, cleans stale
|
||||||
# snapshots, downloads fresh ones (with rolling incremental convergence),
|
# snapshots, downloads fresh ones (with rolling incremental convergence),
|
||||||
# then starts the validator. No host-side download needed.
|
# then starts the validator. No host-side download needed.
|
||||||
|
|
|
||||||
|
|
@ -5,11 +5,12 @@
|
||||||
# This MUST be done before any kind node restart, host reboot,
|
# This MUST be done before any kind node restart, host reboot,
|
||||||
# or docker operations.
|
# or docker operations.
|
||||||
#
|
#
|
||||||
# The agave validator uses io_uring for async I/O. On ZFS, killing
|
# The container entrypoint (PID 1) traps SIGTERM and runs
|
||||||
# the process ungracefully (SIGKILL, docker kill, etc.) can produce
|
# ``agave-validator exit --force --ledger /data/ledger`` which tells
|
||||||
# unkillable kernel threads stuck in io_wq_put_and_exit, deadlocking
|
# the validator to flush I/O and exit cleanly via the admin RPC Unix
|
||||||
# the container's PID namespace. A graceful SIGTERM via k8s scale-down
|
# socket. This avoids the io_uring/ZFS deadlock that occurs when the
|
||||||
# allows agave to flush and close its io_uring contexts cleanly.
|
# process is killed. terminationGracePeriodSeconds must be set to 300
|
||||||
|
# on the k8s deployment to allow time for the flush.
|
||||||
#
|
#
|
||||||
# Usage:
|
# Usage:
|
||||||
# # Stop the validator
|
# # Stop the validator
|
||||||
|
|
@ -42,6 +43,17 @@
|
||||||
failed_when: false
|
failed_when: false
|
||||||
changed_when: false
|
changed_when: false
|
||||||
|
|
||||||
|
# Ensure k8s gives the entrypoint enough time for graceful shutdown
|
||||||
|
# via admin RPC before sending SIGKILL.
|
||||||
|
- name: Ensure terminationGracePeriodSeconds is 300
|
||||||
|
ansible.builtin.command: >
|
||||||
|
kubectl patch deployment {{ deployment_name }}
|
||||||
|
-n {{ k8s_namespace }}
|
||||||
|
-p '{"spec":{"template":{"spec":{"terminationGracePeriodSeconds":300}}}}'
|
||||||
|
register: patch_result
|
||||||
|
changed_when: "'no change' not in patch_result.stdout"
|
||||||
|
when: current_replicas.stdout | default('0') | int > 0
|
||||||
|
|
||||||
- name: Scale deployment to 0
|
- name: Scale deployment to 0
|
||||||
ansible.builtin.command: >
|
ansible.builtin.command: >
|
||||||
kubectl scale deployment {{ deployment_name }}
|
kubectl scale deployment {{ deployment_name }}
|
||||||
|
|
|
||||||
|
|
@ -15,6 +15,10 @@
|
||||||
# ansible-playbook -i inventory/biscayne.yml playbooks/biscayne-sync-tools.yml \
|
# ansible-playbook -i inventory/biscayne.yml playbooks/biscayne-sync-tools.yml \
|
||||||
# -e laconic_so_branch=fix/kind-mount-propagation
|
# -e laconic_so_branch=fix/kind-mount-propagation
|
||||||
#
|
#
|
||||||
|
# # Sync and rebuild the agave container image
|
||||||
|
# ansible-playbook -i inventory/biscayne.yml playbooks/biscayne-sync-tools.yml \
|
||||||
|
# --tags build-container
|
||||||
|
#
|
||||||
- name: Sync laconic-so and agave-stack
|
- name: Sync laconic-so and agave-stack
|
||||||
hosts: all
|
hosts: all
|
||||||
gather_facts: false
|
gather_facts: false
|
||||||
|
|
@ -30,49 +34,55 @@
|
||||||
stack_branch: main
|
stack_branch: main
|
||||||
|
|
||||||
tasks:
|
tasks:
|
||||||
# Git operations run as the connecting user (no become) so that
|
|
||||||
# SSH agent forwarding works. sudo drops SSH_AUTH_SOCK.
|
|
||||||
- name: Update laconic-so (editable install)
|
- name: Update laconic-so (editable install)
|
||||||
become: false
|
|
||||||
ansible.builtin.shell: |
|
ansible.builtin.shell: |
|
||||||
|
set -e
|
||||||
cd {{ laconic_so_repo }}
|
cd {{ laconic_so_repo }}
|
||||||
git fetch origin {{ laconic_so_branch }}
|
git fetch origin {{ laconic_so_branch }}
|
||||||
git reset --hard origin/{{ laconic_so_branch }}
|
git reset --hard origin/{{ laconic_so_branch }}
|
||||||
|
vars:
|
||||||
|
ansible_become: false
|
||||||
register: laconic_so_update
|
register: laconic_so_update
|
||||||
changed_when: true
|
changed_when: true
|
||||||
|
tags: [sync, build-container]
|
||||||
|
|
||||||
- name: Show laconic-so version
|
- name: Show laconic-so version
|
||||||
become: false
|
|
||||||
ansible.builtin.shell:
|
ansible.builtin.shell:
|
||||||
cmd: set -o pipefail && cd {{ laconic_so_repo }} && git log --oneline -1
|
cmd: set -o pipefail && cd {{ laconic_so_repo }} && git log --oneline -1
|
||||||
executable: /bin/bash
|
executable: /bin/bash
|
||||||
register: laconic_so_version
|
register: laconic_so_version
|
||||||
changed_when: false
|
changed_when: false
|
||||||
|
tags: [sync, build-container]
|
||||||
|
|
||||||
- name: Report laconic-so
|
- name: Report laconic-so
|
||||||
ansible.builtin.debug:
|
ansible.builtin.debug:
|
||||||
msg: "laconic-so: {{ laconic_so_version.stdout }}"
|
msg: "laconic-so: {{ laconic_so_version.stdout }}"
|
||||||
|
tags: [sync, build-container]
|
||||||
|
|
||||||
- name: Pull agave-stack repo
|
- name: Pull agave-stack repo
|
||||||
become: false
|
|
||||||
ansible.builtin.shell: |
|
ansible.builtin.shell: |
|
||||||
|
set -e
|
||||||
cd {{ stack_repo }}
|
cd {{ stack_repo }}
|
||||||
git fetch origin {{ stack_branch }}
|
git fetch origin {{ stack_branch }}
|
||||||
git reset --hard origin/{{ stack_branch }}
|
git reset --hard origin/{{ stack_branch }}
|
||||||
|
vars:
|
||||||
|
ansible_become: false
|
||||||
register: stack_update
|
register: stack_update
|
||||||
changed_when: true
|
changed_when: true
|
||||||
|
tags: [sync, build-container]
|
||||||
|
|
||||||
- name: Show agave-stack version
|
- name: Show agave-stack version
|
||||||
become: false
|
|
||||||
ansible.builtin.shell:
|
ansible.builtin.shell:
|
||||||
cmd: set -o pipefail && cd {{ stack_repo }} && git log --oneline -1
|
cmd: set -o pipefail && cd {{ stack_repo }} && git log --oneline -1
|
||||||
executable: /bin/bash
|
executable: /bin/bash
|
||||||
register: stack_version
|
register: stack_version
|
||||||
changed_when: false
|
changed_when: false
|
||||||
|
tags: [sync, build-container]
|
||||||
|
|
||||||
- name: Report agave-stack
|
- name: Report agave-stack
|
||||||
ansible.builtin.debug:
|
ansible.builtin.debug:
|
||||||
msg: "agave-stack: {{ stack_version.stdout }}"
|
msg: "agave-stack: {{ stack_version.stdout }}"
|
||||||
|
tags: [sync, build-container]
|
||||||
|
|
||||||
- name: Regenerate deployment config from updated stack
|
- name: Regenerate deployment config from updated stack
|
||||||
ansible.builtin.command: >
|
ansible.builtin.command: >
|
||||||
|
|
@ -84,6 +94,7 @@
|
||||||
--update
|
--update
|
||||||
register: regen_result
|
register: regen_result
|
||||||
changed_when: true
|
changed_when: true
|
||||||
|
tags: [sync, build-container]
|
||||||
|
|
||||||
- name: Report sync complete
|
- name: Report sync complete
|
||||||
ansible.builtin.debug:
|
ansible.builtin.debug:
|
||||||
|
|
@ -91,3 +102,27 @@
|
||||||
Sync complete. laconic-so and agave-stack updated to
|
Sync complete. laconic-so and agave-stack updated to
|
||||||
origin/{{ laconic_so_branch }}. Deployment config regenerated.
|
origin/{{ laconic_so_branch }}. Deployment config regenerated.
|
||||||
Restart or redeploy required to apply changes.
|
Restart or redeploy required to apply changes.
|
||||||
|
tags: [sync, build-container]
|
||||||
|
|
||||||
|
# ---- optional: rebuild container image --------------------------------------
|
||||||
|
# Only runs when explicitly requested with --tags build-container.
|
||||||
|
# Safe to run while the validator is running — just builds a new image.
|
||||||
|
# The running pod keeps the old image until restarted.
|
||||||
|
- name: Build agave container image
|
||||||
|
ansible.builtin.command: >
|
||||||
|
{{ laconic_so }}
|
||||||
|
--stack {{ stack_path }}
|
||||||
|
build-containers
|
||||||
|
--include laconicnetwork-agave
|
||||||
|
tags:
|
||||||
|
- build-container
|
||||||
|
- never
|
||||||
|
register: build_result
|
||||||
|
changed_when: true
|
||||||
|
|
||||||
|
- name: Report build complete
|
||||||
|
ansible.builtin.debug:
|
||||||
|
msg: "Container image built. Will be used on next pod restart."
|
||||||
|
tags:
|
||||||
|
- build-container
|
||||||
|
- never
|
||||||
|
|
|
||||||
|
|
@ -0,0 +1,158 @@
|
||||||
|
---
|
||||||
|
# Upgrade ZFS from 2.2.2 to 2.2.9 via arter97's zfs-lts PPA
|
||||||
|
#
|
||||||
|
# Fixes the io_uring deadlock (OpenZFS PR #17298) at the kernel module level.
|
||||||
|
# After this upgrade, the zvol/XFS workaround is unnecessary and can be removed
|
||||||
|
# with biscayne-migrate-storage.yml.
|
||||||
|
#
|
||||||
|
# PPA: ppa:arter97/zfs-lts (Juhyung Park, OpenZFS contributor)
|
||||||
|
# Builds from source on Launchpad — transparent, auditable.
|
||||||
|
#
|
||||||
|
# WARNING: This playbook triggers a reboot at the end. If the io_uring zombie
|
||||||
|
# is present, the reboot WILL HANG. The operator must hard power cycle the
|
||||||
|
# machine (IPMI/iDRAC/physical). The playbook does not wait for the reboot —
|
||||||
|
# run the verify tag separately after the machine comes back.
|
||||||
|
#
|
||||||
|
# Usage:
|
||||||
|
# # Full upgrade (adds PPA, upgrades, reboots)
|
||||||
|
# ansible-playbook -i inventory/ playbooks/biscayne-upgrade-zfs.yml
|
||||||
|
#
|
||||||
|
# # Verify after reboot
|
||||||
|
# ansible-playbook -i inventory/ playbooks/biscayne-upgrade-zfs.yml \
|
||||||
|
# --tags verify
|
||||||
|
#
|
||||||
|
# # Dry run — show what would be upgraded
|
||||||
|
# ansible-playbook -i inventory/ playbooks/biscayne-upgrade-zfs.yml \
|
||||||
|
# --tags dry-run
|
||||||
|
#
|
||||||
|
- name: Upgrade ZFS via arter97/zfs-lts PPA
|
||||||
|
hosts: all
|
||||||
|
gather_facts: true
|
||||||
|
become: true
|
||||||
|
vars:
|
||||||
|
zfs_min_version: "2.2.8"
|
||||||
|
ppa_name: "ppa:arter97/zfs-lts"
|
||||||
|
zfs_packages:
|
||||||
|
- zfsutils-linux
|
||||||
|
- zfs-dkms
|
||||||
|
- libzfs5linux
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
# ---- pre-flight checks ----------------------------------------------------
|
||||||
|
- name: Get current ZFS version
|
||||||
|
ansible.builtin.command: modinfo -F version zfs
|
||||||
|
register: zfs_current_version
|
||||||
|
changed_when: false
|
||||||
|
tags: [always]
|
||||||
|
|
||||||
|
- name: Report current ZFS version
|
||||||
|
ansible.builtin.debug:
|
||||||
|
msg: "Current ZFS: {{ zfs_current_version.stdout }}"
|
||||||
|
tags: [always]
|
||||||
|
|
||||||
|
- name: Skip if already upgraded
|
||||||
|
ansible.builtin.meta: end_play
|
||||||
|
when: zfs_current_version.stdout is version(zfs_min_version, '>=')
|
||||||
|
tags: [always]
|
||||||
|
|
||||||
|
# ---- dry run ---------------------------------------------------------------
|
||||||
|
- name: Show available ZFS packages from PPA (dry run)
|
||||||
|
ansible.builtin.shell:
|
||||||
|
cmd: >
|
||||||
|
set -o pipefail &&
|
||||||
|
apt-cache policy zfsutils-linux zfs-dkms 2>/dev/null |
|
||||||
|
grep -A2 'zfsutils-linux\|zfs-dkms'
|
||||||
|
executable: /bin/bash
|
||||||
|
changed_when: false
|
||||||
|
failed_when: false
|
||||||
|
tags:
|
||||||
|
- dry-run
|
||||||
|
- never
|
||||||
|
|
||||||
|
# ---- add PPA ---------------------------------------------------------------
|
||||||
|
- name: Add arter97/zfs-lts PPA
|
||||||
|
ansible.builtin.apt_repository:
|
||||||
|
repo: "{{ ppa_name }}"
|
||||||
|
state: present
|
||||||
|
update_cache: true
|
||||||
|
tags: [upgrade]
|
||||||
|
|
||||||
|
# ---- upgrade ZFS packages --------------------------------------------------
|
||||||
|
- name: Upgrade ZFS packages
|
||||||
|
ansible.builtin.apt:
|
||||||
|
name: "{{ zfs_packages }}"
|
||||||
|
state: latest # noqa: package-latest
|
||||||
|
update_cache: true
|
||||||
|
register: zfs_upgrade
|
||||||
|
tags: [upgrade]
|
||||||
|
|
||||||
|
- name: Show upgrade result
|
||||||
|
ansible.builtin.debug:
|
||||||
|
msg: "{{ zfs_upgrade.stdout_lines | default(['no output']) }}"
|
||||||
|
tags: [upgrade]
|
||||||
|
|
||||||
|
# ---- reboot ----------------------------------------------------------------
|
||||||
|
- name: Report pre-reboot status
|
||||||
|
ansible.builtin.debug:
|
||||||
|
msg: >-
|
||||||
|
ZFS packages upgraded. Rebooting now.
|
||||||
|
If the io_uring zombie is present, this reboot WILL HANG.
|
||||||
|
Hard power cycle the machine, then run this playbook with
|
||||||
|
--tags verify to confirm the upgrade.
|
||||||
|
tags: [upgrade]
|
||||||
|
|
||||||
|
- name: Reboot to load new ZFS modules
|
||||||
|
ansible.builtin.reboot:
|
||||||
|
msg: "ZFS upgrade — rebooting to load new kernel modules"
|
||||||
|
reboot_timeout: 600
|
||||||
|
tags: [upgrade]
|
||||||
|
# This will timeout if io_uring zombie blocks shutdown.
|
||||||
|
# Operator must hard power cycle. That's expected.
|
||||||
|
|
||||||
|
# ---- post-reboot verification -----------------------------------------------
|
||||||
|
- name: Get ZFS version after reboot
|
||||||
|
ansible.builtin.command: modinfo -F version zfs
|
||||||
|
register: zfs_new_version
|
||||||
|
changed_when: false
|
||||||
|
tags:
|
||||||
|
- verify
|
||||||
|
- never
|
||||||
|
|
||||||
|
- name: Verify ZFS version meets minimum
|
||||||
|
ansible.builtin.assert:
|
||||||
|
that:
|
||||||
|
- zfs_new_version.stdout is version(zfs_min_version, '>=')
|
||||||
|
fail_msg: >-
|
||||||
|
ZFS version {{ zfs_new_version.stdout }} is below minimum
|
||||||
|
{{ zfs_min_version }}. Upgrade may have failed.
|
||||||
|
success_msg: "ZFS {{ zfs_new_version.stdout }} — io_uring fix confirmed."
|
||||||
|
tags:
|
||||||
|
- verify
|
||||||
|
- never
|
||||||
|
|
||||||
|
- name: Verify ZFS pools are healthy
|
||||||
|
ansible.builtin.command: zpool status -x
|
||||||
|
register: zpool_status
|
||||||
|
changed_when: false
|
||||||
|
failed_when: "'all pools are healthy' not in zpool_status.stdout"
|
||||||
|
tags:
|
||||||
|
- verify
|
||||||
|
- never
|
||||||
|
|
||||||
|
- name: Verify ZFS datasets are mounted
|
||||||
|
ansible.builtin.command: zfs mount
|
||||||
|
register: zfs_mounts
|
||||||
|
changed_when: false
|
||||||
|
tags:
|
||||||
|
- verify
|
||||||
|
- never
|
||||||
|
|
||||||
|
- name: Report verification
|
||||||
|
ansible.builtin.debug:
|
||||||
|
msg:
|
||||||
|
zfs_version: "{{ zfs_new_version.stdout }}"
|
||||||
|
pools: "{{ zpool_status.stdout }}"
|
||||||
|
mounts: "{{ zfs_mounts.stdout_lines }}"
|
||||||
|
tags:
|
||||||
|
- verify
|
||||||
|
- never
|
||||||
|
|
@ -2,12 +2,17 @@
|
||||||
"""Agave validator entrypoint — snapshot management, arg construction, liveness probe.
|
"""Agave validator entrypoint — snapshot management, arg construction, liveness probe.
|
||||||
|
|
||||||
Two subcommands:
|
Two subcommands:
|
||||||
entrypoint.py serve (default) — snapshot freshness check + exec agave-validator
|
entrypoint.py serve (default) — snapshot freshness check + run agave-validator
|
||||||
entrypoint.py probe — liveness probe (slot lag check, exits 0/1)
|
entrypoint.py probe — liveness probe (slot lag check, exits 0/1)
|
||||||
|
|
||||||
Replaces the bash entrypoint.sh / start-rpc.sh / start-validator.sh with a single
|
Replaces the bash entrypoint.sh / start-rpc.sh / start-validator.sh with a single
|
||||||
Python module. Test mode still dispatches to start-test.sh.
|
Python module. Test mode still dispatches to start-test.sh.
|
||||||
|
|
||||||
|
Python stays as PID 1 and traps SIGTERM. On SIGTERM, it runs
|
||||||
|
``agave-validator exit --force --ledger /data/ledger`` which connects to the
|
||||||
|
admin RPC Unix socket and tells the validator to flush I/O and exit cleanly.
|
||||||
|
This avoids the io_uring/ZFS deadlock that occurs when the process is killed.
|
||||||
|
|
||||||
All configuration comes from environment variables — same vars as the original
|
All configuration comes from environment variables — same vars as the original
|
||||||
bash scripts. See compose files for defaults.
|
bash scripts. See compose files for defaults.
|
||||||
"""
|
"""
|
||||||
|
|
@ -18,8 +23,10 @@ import json
|
||||||
import logging
|
import logging
|
||||||
import os
|
import os
|
||||||
import re
|
import re
|
||||||
|
import signal
|
||||||
import subprocess
|
import subprocess
|
||||||
import sys
|
import sys
|
||||||
|
import threading
|
||||||
import time
|
import time
|
||||||
import urllib.error
|
import urllib.error
|
||||||
import urllib.request
|
import urllib.request
|
||||||
|
|
@ -365,11 +372,77 @@ def append_extra_args(args: list[str]) -> list[str]:
|
||||||
return args
|
return args
|
||||||
|
|
||||||
|
|
||||||
|
# -- Graceful shutdown --------------------------------------------------------
|
||||||
|
|
||||||
|
# Timeout for graceful exit via admin RPC. Leave 30s margin for k8s
|
||||||
|
# terminationGracePeriodSeconds (300s).
|
||||||
|
GRACEFUL_EXIT_TIMEOUT = 270
|
||||||
|
|
||||||
|
|
||||||
|
def graceful_exit(child: subprocess.Popen[bytes]) -> None:
|
||||||
|
"""Request graceful shutdown via the admin RPC Unix socket.
|
||||||
|
|
||||||
|
Runs ``agave-validator exit --force --ledger /data/ledger`` which connects
|
||||||
|
to the admin RPC socket at ``/data/ledger/admin.rpc`` and sets the
|
||||||
|
validator's exit flag. The validator flushes all I/O and exits cleanly,
|
||||||
|
avoiding the io_uring/ZFS deadlock.
|
||||||
|
|
||||||
|
If the admin RPC exit fails or the child doesn't exit within the timeout,
|
||||||
|
falls back to SIGTERM then SIGKILL.
|
||||||
|
"""
|
||||||
|
log.info("SIGTERM received — requesting graceful exit via admin RPC")
|
||||||
|
try:
|
||||||
|
result = subprocess.run(
|
||||||
|
["agave-validator", "exit", "--force", "--ledger", LEDGER_DIR],
|
||||||
|
capture_output=True, text=True, timeout=30,
|
||||||
|
)
|
||||||
|
if result.returncode == 0:
|
||||||
|
log.info("Admin RPC exit requested successfully")
|
||||||
|
else:
|
||||||
|
log.warning(
|
||||||
|
"Admin RPC exit returned %d: %s",
|
||||||
|
result.returncode, result.stderr.strip(),
|
||||||
|
)
|
||||||
|
except subprocess.TimeoutExpired:
|
||||||
|
log.warning("Admin RPC exit command timed out after 30s")
|
||||||
|
except FileNotFoundError:
|
||||||
|
log.warning("agave-validator binary not found for exit command")
|
||||||
|
|
||||||
|
# Wait for child to exit
|
||||||
|
try:
|
||||||
|
child.wait(timeout=GRACEFUL_EXIT_TIMEOUT)
|
||||||
|
log.info("Validator exited cleanly with code %d", child.returncode)
|
||||||
|
return
|
||||||
|
except subprocess.TimeoutExpired:
|
||||||
|
log.warning(
|
||||||
|
"Validator did not exit within %ds — sending SIGTERM",
|
||||||
|
GRACEFUL_EXIT_TIMEOUT,
|
||||||
|
)
|
||||||
|
|
||||||
|
# Fallback: SIGTERM
|
||||||
|
child.terminate()
|
||||||
|
try:
|
||||||
|
child.wait(timeout=15)
|
||||||
|
log.info("Validator exited after SIGTERM with code %d", child.returncode)
|
||||||
|
return
|
||||||
|
except subprocess.TimeoutExpired:
|
||||||
|
log.warning("Validator did not exit after SIGTERM — sending SIGKILL")
|
||||||
|
|
||||||
|
# Last resort: SIGKILL
|
||||||
|
child.kill()
|
||||||
|
child.wait()
|
||||||
|
log.info("Validator killed with SIGKILL, code %d", child.returncode)
|
||||||
|
|
||||||
|
|
||||||
# -- Serve subcommand ---------------------------------------------------------
|
# -- Serve subcommand ---------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
def cmd_serve() -> None:
|
def cmd_serve() -> None:
|
||||||
"""Main serve flow: snapshot check, setup, exec agave-validator."""
|
"""Main serve flow: snapshot check, setup, run agave-validator as child.
|
||||||
|
|
||||||
|
Python stays as PID 1 and traps SIGTERM to perform graceful shutdown
|
||||||
|
via the admin RPC Unix socket.
|
||||||
|
"""
|
||||||
mode = env("AGAVE_MODE", "test")
|
mode = env("AGAVE_MODE", "test")
|
||||||
log.info("AGAVE_MODE=%s", mode)
|
log.info("AGAVE_MODE=%s", mode)
|
||||||
|
|
||||||
|
|
@ -407,7 +480,21 @@ def cmd_serve() -> None:
|
||||||
Path("/tmp/entrypoint-start").write_text(str(time.time()))
|
Path("/tmp/entrypoint-start").write_text(str(time.time()))
|
||||||
|
|
||||||
log.info("Starting agave-validator with %d arguments", len(args))
|
log.info("Starting agave-validator with %d arguments", len(args))
|
||||||
os.execvp("agave-validator", ["agave-validator"] + args)
|
child = subprocess.Popen(["agave-validator"] + args)
|
||||||
|
|
||||||
|
# Forward SIGUSR1 to child (log rotation)
|
||||||
|
signal.signal(signal.SIGUSR1, lambda _sig, _frame: child.send_signal(signal.SIGUSR1))
|
||||||
|
|
||||||
|
# Trap SIGTERM — run graceful_exit in a thread so the signal handler returns
|
||||||
|
# immediately and child.wait() in the main thread can observe the exit.
|
||||||
|
def _on_sigterm(_sig: int, _frame: object) -> None:
|
||||||
|
threading.Thread(target=graceful_exit, args=(child,), daemon=True).start()
|
||||||
|
|
||||||
|
signal.signal(signal.SIGTERM, _on_sigterm)
|
||||||
|
|
||||||
|
# Wait for child — if it exits on its own (crash, normal exit), propagate code
|
||||||
|
child.wait()
|
||||||
|
sys.exit(child.returncode)
|
||||||
|
|
||||||
|
|
||||||
# -- Probe subcommand ---------------------------------------------------------
|
# -- Probe subcommand ---------------------------------------------------------
|
||||||
|
|
|
||||||
|
|
@ -655,8 +655,9 @@ def download_best_snapshot(
|
||||||
log.info("Downloading incremental %s (%d mirrors, slot %d, gap %d slots)",
|
log.info("Downloading incremental %s (%d mirrors, slot %d, gap %d slots)",
|
||||||
inc_fn, len(inc_mirrors), inc_slot, gap)
|
inc_fn, len(inc_mirrors), inc_slot, gap)
|
||||||
if not download_aria2c(inc_mirrors, output_dir, inc_fn, connections):
|
if not download_aria2c(inc_mirrors, output_dir, inc_fn, connections):
|
||||||
log.error("Failed to download incremental %s", inc_fn)
|
log.warning("Failed to download incremental %s — re-probing in 10s", inc_fn)
|
||||||
break
|
time.sleep(10)
|
||||||
|
continue
|
||||||
|
|
||||||
prev_inc_filename = inc_fn
|
prev_inc_filename = inc_fn
|
||||||
|
|
||||||
|
|
|
||||||
|
|
@ -18,6 +18,7 @@ from __future__ import annotations
|
||||||
|
|
||||||
import argparse
|
import argparse
|
||||||
import json
|
import json
|
||||||
|
import os
|
||||||
import subprocess
|
import subprocess
|
||||||
import sys
|
import sys
|
||||||
import time
|
import time
|
||||||
|
|
@ -206,9 +207,11 @@ def display(iteration: int = 0) -> None:
|
||||||
snapshots = check_snapshots()
|
snapshots = check_snapshots()
|
||||||
ramdisk = check_ramdisk()
|
ramdisk = check_ramdisk()
|
||||||
|
|
||||||
print(f"\n{'=' * 60}")
|
# Clear screen and home cursor for clean redraw in watch mode
|
||||||
print(f" Biscayne Agave Status — {ts}")
|
if iteration > 0:
|
||||||
print(f"{'=' * 60}")
|
print("\033[2J\033[H", end="")
|
||||||
|
|
||||||
|
print(f"\n Biscayne Agave Status — {ts}\n")
|
||||||
|
|
||||||
# Pod
|
# Pod
|
||||||
print(f"\n Pod: {pod['phase']}")
|
print(f"\n Pod: {pod['phase']}")
|
||||||
|
|
@ -275,14 +278,30 @@ def display(iteration: int = 0) -> None:
|
||||||
# -- Main ---------------------------------------------------------------------
|
# -- Main ---------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
def spawn_tmux_pane(interval: int) -> None:
|
||||||
|
"""Launch this script with --watch in a new tmux pane."""
|
||||||
|
script = os.path.abspath(__file__)
|
||||||
|
cmd = f"python3 {script} --watch -i {interval}"
|
||||||
|
subprocess.run(
|
||||||
|
["tmux", "split-window", "-h", "-d", cmd],
|
||||||
|
check=True,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
def main() -> int:
|
def main() -> int:
|
||||||
p = argparse.ArgumentParser(description=__doc__,
|
p = argparse.ArgumentParser(description=__doc__,
|
||||||
formatter_class=argparse.RawDescriptionHelpFormatter)
|
formatter_class=argparse.RawDescriptionHelpFormatter)
|
||||||
p.add_argument("--watch", action="store_true", help="Repeat every interval")
|
p.add_argument("--watch", action="store_true", help="Repeat every interval")
|
||||||
|
p.add_argument("--pane", action="store_true",
|
||||||
|
help="Launch --watch in a new tmux pane")
|
||||||
p.add_argument("-i", "--interval", type=int, default=30,
|
p.add_argument("-i", "--interval", type=int, default=30,
|
||||||
help="Watch interval in seconds (default: 30)")
|
help="Watch interval in seconds (default: 30)")
|
||||||
args = p.parse_args()
|
args = p.parse_args()
|
||||||
|
|
||||||
|
if args.pane:
|
||||||
|
spawn_tmux_pane(args.interval)
|
||||||
|
return 0
|
||||||
|
|
||||||
discover()
|
discover()
|
||||||
|
|
||||||
try:
|
try:
|
||||||
|
|
|
||||||
Loading…
Reference in New Issue