8.3 KiB
Biscayne Agave Runbook
Cluster Operations
Shutdown Order
The agave validator runs inside a kind-based k8s cluster managed by laconic-so.
The kind node is a Docker container. Never restart or kill the kind node container
while the validator is running. Agave uses io_uring for async I/O, and on ZFS,
killing the process can produce unkillable kernel threads (D-state in
io_wq_put_and_exit blocked on ZFS transaction commits). This deadlocks the
container's PID namespace, making docker stop, docker restart, docker exec,
and even reboot hang.
Correct shutdown sequence:
- Scale the deployment to 0 and wait for the pod to terminate:
kubectl scale deployment laconic-70ce4c4b47e23b85-deployment \ -n laconic-laconic-70ce4c4b47e23b85 --replicas=0 kubectl wait --for=delete pod -l app=laconic-70ce4c4b47e23b85-deployment \ -n laconic-laconic-70ce4c4b47e23b85 --timeout=120s - Only then restart the kind node if needed:
docker restart laconic-70ce4c4b47e23b85-control-plane - Scale back up:
kubectl scale deployment laconic-70ce4c4b47e23b85-deployment \ -n laconic-laconic-70ce4c4b47e23b85 --replicas=1
Ramdisk
The accounts directory must be on a ramdisk for performance. /dev/ram0 loses its
filesystem on reboot and must be reformatted before mounting.
Boot ordering is handled by systemd units (installed by biscayne-boot.yml):
format-ramdisk.service: runsmkfs.xfs -f /dev/ram0beforelocal-fs.target- fstab entry: mounts
/dev/ram0at/srv/solana/ramdiskwithx-systemd.requires=format-ramdisk.service ramdisk-accounts.service: creates/srv/solana/ramdisk/accountsand sets ownership after the mount
These units run before docker, so the kind node's bind mounts always see the ramdisk. No manual intervention is needed after reboot.
Mount propagation: The kind node bind-mounts /srv/kind → /mnt. Because
the ramdisk is mounted at /srv/solana/ramdisk and symlinked/overlaid through
/srv/kind/solana/ramdisk, mount propagation makes it visible inside the kind
node at /mnt/solana/ramdisk without restarting the kind node. Do NOT restart
the kind node just to pick up a ramdisk mount.
KUBECONFIG
kubectl must be told where the kubeconfig is when running as root or via ansible:
KUBECONFIG=/home/rix/.kube/config kubectl ...
The ansible playbooks set environment: KUBECONFIG: /home/rix/.kube/config.
SSH Agent
SSH to biscayne goes through a ProxyCommand jump host (abernathy.ch2.vaasl.io). The SSH agent socket rotates when the user reconnects. Find the current one:
ls -t /tmp/ssh-*/agent.* | head -1
Then export it:
export SSH_AUTH_SOCK=/tmp/ssh-XXXX/agent.NNNN
io_uring/ZFS Deadlock — Root Cause
When agave-validator is killed while performing I/O against ZFS-backed paths (not the ramdisk), io_uring worker threads get stuck in D-state:
io_wq_put_and_exit → dsl_dir_tempreserve_space (ZFS module)
These threads are unkillable (SIGKILL has no effect on D-state processes). They
prevent the container's PID namespace from being reaped (zap_pid_ns_processes
waits forever), which breaks docker stop, docker restart, docker exec, and
even reboot. The only fix is a hard power cycle.
Prevention: Always scale the deployment to 0 and wait for the pod to terminate
before any destructive operation (namespace delete, kind restart, host reboot).
The biscayne-stop.yml playbook enforces this.
laconic-so Architecture
laconic-so manages kind clusters atomically — deployment start creates the
kind cluster, namespace, PVs, PVCs, and deployment in one shot. There is no way
to create the cluster without deploying the pod.
Key code paths in stack-orchestrator:
deploy_k8s.py:up()— creates everything atomicallycluster_info.py:get_pvs()— translates host paths usingkind-mount-roothelpers_k8s.py:get_kind_pv_bind_mount_path()— stripskind-mount-rootprefix and prepends/mnt/helpers_k8s.py:_generate_kind_mounts()— whenkind-mount-rootis set, emits a single/srv/kind→/mntmount instead of individual mounts
The kind-mount-root: /srv/kind setting in spec.yml means all data volumes
whose host paths start with /srv/kind get translated to /mnt/... inside the
kind node via a single bind mount.
Key Identifiers
- Kind cluster:
laconic-70ce4c4b47e23b85 - Namespace:
laconic-laconic-70ce4c4b47e23b85 - Deployment:
laconic-70ce4c4b47e23b85-deployment - Kind node container:
laconic-70ce4c4b47e23b85-control-plane - Deployment dir:
/srv/deployments/agave - Snapshot dir:
/srv/solana/snapshots - Ledger dir:
/srv/solana/ledger - Accounts dir:
/srv/solana/ramdisk/accounts - Log dir:
/srv/solana/log - Host bind mount root:
/srv/kind-> kind node/mnt - laconic-so:
/home/rix/.local/bin/laconic-so(editable install)
PV Mount Paths (inside kind node)
| PV Name | hostPath |
|---|---|
| validator-snapshots | /mnt/solana/snapshots |
| validator-ledger | /mnt/solana/ledger |
| validator-accounts | /mnt/solana/ramdisk/accounts |
| validator-log | /mnt/solana/log |
Snapshot Freshness
If the snapshot is more than 20,000 slots behind the current mainnet tip, it is too old. Stop the validator, download a fresh snapshot, and restart. Do NOT let it try to catch up from an old snapshot — it will take too long and may never converge.
Check with:
# Snapshot slot (from filename)
ls /srv/solana/snapshots/snapshot-*.tar.*
# Current mainnet slot
curl -s -X POST -H "Content-Type: application/json" \
-d '{"jsonrpc":"2.0","id":1,"method":"getSlot","params":[{"commitment":"finalized"}]}' \
https://api.mainnet-beta.solana.com
Snapshot Leapfrog Recovery
When the validator is stuck in a repair-dependent gap (incomplete shreds from a relay outage or insufficient turbine coverage), "grinding through" doesn't work. At 0.4 slots/sec replay through incomplete blocks vs 2.5 slots/sec chain production, the gap grows faster than it shrinks.
Strategy: Download a fresh snapshot whose slot lands past the incomplete zone, into the range where turbine+relay shreds are accumulating in the blockstore. Keep the existing ledger — it has those shreds. The validator replays from local blockstore data instead of waiting on repair.
Steps:
- Let the validator run — turbine+relay accumulate shreds at the tip
- Monitor shred completeness at the tip:
scripts/check-shred-completeness.sh 500 - When there's a contiguous run of complete blocks (>100 slots), note the starting slot of that run
- Scale to 0, wipe accounts (ramdisk), wipe old snapshots
- Do NOT wipe ledger — it has the turbine shreds
- Download a fresh snapshot (its slot should be within the complete run)
- Scale to 1 — validator replays from local blockstore at 3-5 slots/sec
Why this works: Turbine delivers ~60% of shreds in real-time. Repair fills the rest for recent slots quickly (peers prioritize recent data). The only problem is repair for old slots (minutes/hours behind) which peers deprioritize. By snapshotting past the gap, we skip the old-slot repair bottleneck entirely.
Shred Relay (Ashburn)
The TVU shred relay from laconic-was-sw01 provides ~4,000 additional shreds/sec. Without it, turbine alone delivers ~60% of blocks. With it, completeness improves but still requires repair for full coverage.
Current state: Old pipeline (monitor session + socat + shred-unwrap.py).
The traffic-policy redirect was never committed (auto-revert after 5 min timer).
See docs/tvu-shred-relay.md for the traffic-policy config that needs to be
properly applied.
Boot dependency: shred-unwrap.py must be running on biscayne for the old
pipeline to work. It is NOT persistent across reboots. The iptables DNAT rule
for the new pipeline IS persistent (iptables-persistent installed).
Redeploy Flow
See playbooks/biscayne-redeploy.yml. The scale-to-0 pattern is required because
laconic-so creates the cluster and deploys the pod atomically:
- Delete namespace (teardown)
- Optionally wipe data
laconic-so deployment start(creates cluster + pod)- Immediately scale to 0
- Download snapshot via aria2c
- Scale to 1
- Verify