# Biscayne Agave Runbook ## Deployment Layers Operations on biscayne follow a strict layering. Each layer assumes the layers below it are correct. Playbooks belong to exactly one layer. | Layer | What | Playbooks | |-------|------|-----------| | 1. Base system | Docker, ZFS, packages | Out of scope (manual/PXE) | | 2. Prepare kind | `/srv/kind` exists (ZFS dataset) | None needed (ZFS handles it) | | 3. Install kind | `laconic-so deployment start` creates kind cluster, mounts `/srv/kind` → `/mnt` in kind node | `biscayne-redeploy.yml` (deploy tags) | | 4. Prepare agave | Host storage for agave: ZFS dataset, ramdisk | `biscayne-prepare-agave.yml` | | 5. Deploy agave | Deploy agave-stack into kind, snapshot download, scale up | `biscayne-redeploy.yml` (snapshot/verify tags), `biscayne-recover.yml` | **Layer 4 invariants** (asserted by `biscayne-prepare-agave.yml`): - `/srv/kind/solana` is a ZFS dataset (`biscayne/DATA/srv/kind/solana`), child of the `/srv/kind` dataset - `/srv/kind/solana/ramdisk` is tmpfs (1TB) — accounts must be in RAM - `/srv/solana` is NOT the data path — it's a directory on the parent ZFS dataset. All data paths use `/srv/kind/solana` These invariants are checked at runtime and persisted to fstab/systemd so they survive reboot. **Cross-cutting**: `health-check.yml` (read-only diagnostics), `biscayne-stop.yml` (layer 5 — graceful shutdown), `fix-pv-mounts.yml` (layer 5 — PV repair). ## Cluster Operations ### Shutdown Order The agave validator runs inside a kind-based k8s cluster managed by `laconic-so`. The kind node is a Docker container. **Never restart or kill the kind node container while the validator is running.** Use `agave-validator exit --force` via the admin RPC socket for graceful shutdown, or scale the deployment to 0 and wait. Correct shutdown sequence: 1. Scale the deployment to 0 and wait for the pod to terminate: ``` kubectl scale deployment laconic-70ce4c4b47e23b85-deployment \ -n laconic-laconic-70ce4c4b47e23b85 --replicas=0 kubectl wait --for=delete pod -l app=laconic-70ce4c4b47e23b85-deployment \ -n laconic-laconic-70ce4c4b47e23b85 --timeout=120s ``` 2. Only then restart the kind node if needed: ``` docker restart laconic-70ce4c4b47e23b85-control-plane ``` 3. Scale back up: ``` kubectl scale deployment laconic-70ce4c4b47e23b85-deployment \ -n laconic-laconic-70ce4c4b47e23b85 --replicas=1 ``` ### Ramdisk The accounts directory must be in RAM for performance. tmpfs is used instead of `/dev/ram0` — simpler (no format-on-boot service needed), resizable on the fly with `mount -o remount,size=`, and what most Solana operators use. **Boot ordering**: `/srv/kind/solana` is a ZFS dataset mounted automatically by `zfs-mount.service`. The tmpfs ramdisk fstab entry uses `x-systemd.requires=zfs-mount.service` to ensure the dataset is mounted first. **No manual intervention after reboot.** **Mount propagation**: The kind node bind-mounts `/srv/kind` → `/mnt` at container start. laconic-so sets `propagation: HostToContainer` on all kind extraMounts (commit `a11d40f2` in stack-orchestrator), so host submounts propagate into the kind node automatically. A kind restart is required to pick up the new config after updating laconic-so. ### KUBECONFIG kubectl must be told where the kubeconfig is when running as root or via ansible: ``` KUBECONFIG=/home/rix/.kube/config kubectl ... ``` The ansible playbooks set `environment: KUBECONFIG: /home/rix/.kube/config`. ### SSH Agent SSH to biscayne goes through a ProxyCommand jump host (abernathy.ch2.vaasl.io). The SSH agent socket rotates when the user reconnects. Find the current one: ``` ls -t /tmp/ssh-*/agent.* | head -1 ``` Then export it: ``` export SSH_AUTH_SOCK=/tmp/ssh-XXXX/agent.NNNN ``` ### io_uring/ZFS Deadlock — Historical Note Agave uses io_uring for async I/O. Killing agave ungracefully while it has outstanding I/O against ZFS can produce unkillable D-state kernel threads (`io_wq_put_and_exit` blocked on ZFS transactions), deadlocking the container. **Prevention**: Use graceful shutdown (`agave-validator exit --force` via admin RPC, or scale to 0 and wait). The `biscayne-stop.yml` playbook enforces this. With graceful shutdown, io_uring contexts are closed cleanly and ZFS storage is safe to use directly (no zvol/XFS workaround needed). **ZFS fix**: The underlying io_uring bug is fixed in ZFS 2.2.8+ (PR #17298). Biscayne currently runs ZFS 2.2.2. Upgrading ZFS will eliminate the deadlock risk entirely, even for ungraceful shutdowns. ### laconic-so Architecture `laconic-so` manages kind clusters atomically — `deployment start` creates the kind cluster, namespace, PVs, PVCs, and deployment in one shot. There is no way to create the cluster without deploying the pod. Key code paths in stack-orchestrator: - `deploy_k8s.py:up()` — creates everything atomically - `cluster_info.py:get_pvs()` — translates host paths using `kind-mount-root` - `helpers_k8s.py:get_kind_pv_bind_mount_path()` — strips `kind-mount-root` prefix and prepends `/mnt/` - `helpers_k8s.py:_generate_kind_mounts()` — when `kind-mount-root` is set, emits a single `/srv/kind` → `/mnt` mount instead of individual mounts The `kind-mount-root: /srv/kind` setting in `spec.yml` means all data volumes whose host paths start with `/srv/kind` get translated to `/mnt/...` inside the kind node via a single bind mount. ### Key Identifiers - Kind cluster: `laconic-70ce4c4b47e23b85` - Namespace: `laconic-laconic-70ce4c4b47e23b85` - Deployment: `laconic-70ce4c4b47e23b85-deployment` - Kind node container: `laconic-70ce4c4b47e23b85-control-plane` - Deployment dir: `/srv/deployments/agave` - Snapshot dir: `/srv/kind/solana/snapshots` (ZFS dataset, visible to kind at `/mnt/validator-snapshots`) - Ledger dir: `/srv/kind/solana/ledger` (ZFS dataset, visible to kind at `/mnt/validator-ledger`) - Accounts dir: `/srv/kind/solana/ramdisk/accounts` (tmpfs ramdisk, visible to kind at `/mnt/validator-accounts`) - Log dir: `/srv/kind/solana/log` (ZFS dataset, visible to kind at `/mnt/validator-log`) - **WARNING**: `/srv/solana` is a different ZFS dataset directory. All data paths use `/srv/kind/solana`. - Host bind mount root: `/srv/kind` -> kind node `/mnt` - laconic-so: `/home/rix/.local/bin/laconic-so` (editable install) ### PV Mount Paths (inside kind node) | PV Name | hostPath | |----------------------|-------------------------------| | validator-snapshots | /mnt/validator-snapshots | | validator-ledger | /mnt/validator-ledger | | validator-accounts | /mnt/validator-accounts | | validator-log | /mnt/validator-log | ### Snapshot Freshness If the snapshot is more than **20,000 slots behind** the current mainnet tip, it is too old. Stop the validator, download a fresh snapshot, and restart. Do NOT let it try to catch up from an old snapshot — it will take too long and may never converge. Check with: ``` # Snapshot slot (from filename) ls /srv/kind/solana/snapshots/snapshot-*.tar.* # Current mainnet slot curl -s -X POST -H "Content-Type: application/json" \ -d '{"jsonrpc":"2.0","id":1,"method":"getSlot","params":[{"commitment":"finalized"}]}' \ https://api.mainnet-beta.solana.com ``` ### Snapshot Leapfrog Recovery When the validator is stuck in a repair-dependent gap (incomplete shreds from a relay outage or insufficient turbine coverage), "grinding through" doesn't work. At 0.4 slots/sec replay through incomplete blocks vs 2.5 slots/sec chain production, the gap grows faster than it shrinks. **Strategy**: Download a fresh snapshot whose slot lands *past* the incomplete zone, into the range where turbine+relay shreds are accumulating in the blockstore. **Keep the existing ledger** — it has those shreds. The validator replays from local blockstore data instead of waiting on repair. **Steps**: 1. Let the validator run — turbine+relay accumulate shreds at the tip 2. Monitor shred completeness at the tip: `scripts/check-shred-completeness.sh 500` 3. When there's a contiguous run of complete blocks (>100 slots), note the starting slot of that run 4. Scale to 0, wipe accounts (ramdisk), wipe old snapshots 5. **Do NOT wipe ledger** — it has the turbine shreds 6. Download a fresh snapshot (its slot should be within the complete run) 7. Scale to 1 — validator replays from local blockstore at 3-5 slots/sec **Why this works**: Turbine delivers ~60% of shreds in real-time. Repair fills the rest for recent slots quickly (peers prioritize recent data). The only problem is repair for *old* slots (minutes/hours behind) which peers deprioritize. By snapshotting past the gap, we skip the old-slot repair bottleneck entirely. ### Shred Relay (Ashburn) The TVU shred relay from laconic-was-sw01 provides ~4,000 additional shreds/sec. Without it, turbine alone delivers ~60% of blocks. With it, completeness improves but still requires repair for full coverage. **Current state**: Old pipeline (monitor session + socat + shred-unwrap.py). The traffic-policy redirect was never committed (auto-revert after 5 min timer). See `docs/tvu-shred-relay.md` for the traffic-policy config that needs to be properly applied. **Boot dependency**: `shred-unwrap.py` must be running on biscayne for the old pipeline to work. It is NOT persistent across reboots. The iptables DNAT rule for the new pipeline IS persistent (iptables-persistent installed). ### Redeploy Flow See `playbooks/biscayne-redeploy.yml`. The scale-to-0 pattern is required because `laconic-so` creates the cluster and deploys the pod atomically: 1. Delete namespace (teardown) 2. Optionally wipe data 3. `laconic-so deployment start` (creates cluster + pod) 4. Immediately scale to 0 5. Download snapshot via aria2c 6. Scale to 1 7. Verify