2026-03-07 01:44:25 +00:00
|
|
|
# Biscayne Agave Runbook
|
|
|
|
|
|
2026-03-07 13:07:54 +00:00
|
|
|
## Deployment Layers
|
|
|
|
|
|
|
|
|
|
Operations on biscayne follow a strict layering. Each layer assumes the layers
|
|
|
|
|
below it are correct. Playbooks belong to exactly one layer.
|
|
|
|
|
|
|
|
|
|
| Layer | What | Playbooks |
|
|
|
|
|
|-------|------|-----------|
|
|
|
|
|
| 1. Base system | Docker, ZFS, packages | Out of scope (manual/PXE) |
|
|
|
|
|
| 2. Prepare kind | `/srv/kind` exists (ZFS dataset) | None needed (ZFS handles it) |
|
|
|
|
|
| 3. Install kind | `laconic-so deployment start` creates kind cluster, mounts `/srv/kind` → `/mnt` in kind node | `biscayne-redeploy.yml` (deploy tags) |
|
|
|
|
|
| 4. Prepare agave | Host storage for agave: zvol, ramdisk, rbind into `/srv/kind/solana` | `biscayne-prepare-agave.yml` |
|
|
|
|
|
| 5. Deploy agave | Deploy agave-stack into kind, snapshot download, scale up | `biscayne-redeploy.yml` (snapshot/verify tags), `biscayne-recover.yml` |
|
|
|
|
|
|
|
|
|
|
**Layer 4 invariants** (asserted by `biscayne-prepare-agave.yml`):
|
|
|
|
|
- `/srv/solana` is XFS on a zvol — agave uses io_uring which deadlocks on ZFS
|
|
|
|
|
- `/srv/solana/ramdisk` is XFS on `/dev/ram0` — accounts must be on ramdisk
|
|
|
|
|
- `/srv/kind/solana` is an rbind of `/srv/solana` — makes the zvol visible to kind at `/mnt/solana`
|
|
|
|
|
|
|
|
|
|
These invariants are checked at runtime and persisted to fstab/systemd so they
|
|
|
|
|
survive reboot. They are agave's requirements reaching into the boot sequence,
|
|
|
|
|
not base system concerns.
|
|
|
|
|
|
|
|
|
|
**Cross-cutting**: `health-check.yml` (read-only diagnostics), `biscayne-stop.yml`
|
|
|
|
|
(layer 5 — graceful shutdown), `fix-pv-mounts.yml` (layer 5 — PV repair).
|
|
|
|
|
|
2026-03-07 01:44:25 +00:00
|
|
|
## Cluster Operations
|
|
|
|
|
|
|
|
|
|
### Shutdown Order
|
|
|
|
|
|
|
|
|
|
The agave validator runs inside a kind-based k8s cluster managed by `laconic-so`.
|
|
|
|
|
The kind node is a Docker container. **Never restart or kill the kind node container
|
|
|
|
|
while the validator is running.** Agave uses `io_uring` for async I/O, and on ZFS,
|
|
|
|
|
killing the process can produce unkillable kernel threads (D-state in
|
|
|
|
|
`io_wq_put_and_exit` blocked on ZFS transaction commits). This deadlocks the
|
|
|
|
|
container's PID namespace, making `docker stop`, `docker restart`, `docker exec`,
|
|
|
|
|
and even `reboot` hang.
|
|
|
|
|
|
|
|
|
|
Correct shutdown sequence:
|
|
|
|
|
|
|
|
|
|
1. Scale the deployment to 0 and wait for the pod to terminate:
|
|
|
|
|
```
|
|
|
|
|
kubectl scale deployment laconic-70ce4c4b47e23b85-deployment \
|
|
|
|
|
-n laconic-laconic-70ce4c4b47e23b85 --replicas=0
|
|
|
|
|
kubectl wait --for=delete pod -l app=laconic-70ce4c4b47e23b85-deployment \
|
|
|
|
|
-n laconic-laconic-70ce4c4b47e23b85 --timeout=120s
|
|
|
|
|
```
|
|
|
|
|
2. Only then restart the kind node if needed:
|
|
|
|
|
```
|
|
|
|
|
docker restart laconic-70ce4c4b47e23b85-control-plane
|
|
|
|
|
```
|
|
|
|
|
3. Scale back up:
|
|
|
|
|
```
|
|
|
|
|
kubectl scale deployment laconic-70ce4c4b47e23b85-deployment \
|
|
|
|
|
-n laconic-laconic-70ce4c4b47e23b85 --replicas=1
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
### Ramdisk
|
|
|
|
|
|
|
|
|
|
The accounts directory must be on a ramdisk for performance. `/dev/ram0` loses its
|
|
|
|
|
filesystem on reboot and must be reformatted before mounting.
|
|
|
|
|
|
2026-03-07 13:07:54 +00:00
|
|
|
**Boot ordering is handled by systemd units** (installed by `biscayne-prepare-agave.yml`):
|
2026-03-07 01:44:25 +00:00
|
|
|
- `format-ramdisk.service`: runs `mkfs.xfs -f /dev/ram0` before `local-fs.target`
|
|
|
|
|
- fstab entry: mounts `/dev/ram0` at `/srv/solana/ramdisk` with
|
|
|
|
|
`x-systemd.requires=format-ramdisk.service`
|
|
|
|
|
- `ramdisk-accounts.service`: creates `/srv/solana/ramdisk/accounts` and sets
|
|
|
|
|
ownership after the mount
|
|
|
|
|
|
|
|
|
|
These units run before docker, so the kind node's bind mounts always see the
|
|
|
|
|
ramdisk. **No manual intervention is needed after reboot.**
|
|
|
|
|
|
2026-03-07 13:07:54 +00:00
|
|
|
**Mount propagation**: The kind node bind-mounts `/srv/kind` → `/mnt` at container
|
2026-03-07 22:28:21 +00:00
|
|
|
start. laconic-so sets `propagation: HostToContainer` on all kind extraMounts
|
|
|
|
|
(commit `a11d40f2` in stack-orchestrator), so host submounts (like the rbind at
|
|
|
|
|
`/srv/kind/solana`) propagate into the kind node automatically. A kind restart
|
|
|
|
|
is required to pick up the new config after updating laconic-so.
|
2026-03-07 01:44:25 +00:00
|
|
|
|
|
|
|
|
### KUBECONFIG
|
|
|
|
|
|
|
|
|
|
kubectl must be told where the kubeconfig is when running as root or via ansible:
|
|
|
|
|
```
|
|
|
|
|
KUBECONFIG=/home/rix/.kube/config kubectl ...
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
The ansible playbooks set `environment: KUBECONFIG: /home/rix/.kube/config`.
|
|
|
|
|
|
|
|
|
|
### SSH Agent
|
|
|
|
|
|
|
|
|
|
SSH to biscayne goes through a ProxyCommand jump host (abernathy.ch2.vaasl.io).
|
|
|
|
|
The SSH agent socket rotates when the user reconnects. Find the current one:
|
|
|
|
|
```
|
|
|
|
|
ls -t /tmp/ssh-*/agent.* | head -1
|
|
|
|
|
```
|
|
|
|
|
Then export it:
|
|
|
|
|
```
|
|
|
|
|
export SSH_AUTH_SOCK=/tmp/ssh-XXXX/agent.NNNN
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
### io_uring/ZFS Deadlock — Root Cause
|
|
|
|
|
|
|
|
|
|
When agave-validator is killed while performing I/O against ZFS-backed paths (not
|
|
|
|
|
the ramdisk), io_uring worker threads get stuck in D-state:
|
|
|
|
|
```
|
|
|
|
|
io_wq_put_and_exit → dsl_dir_tempreserve_space (ZFS module)
|
|
|
|
|
```
|
|
|
|
|
These threads are unkillable (SIGKILL has no effect on D-state processes). They
|
|
|
|
|
prevent the container's PID namespace from being reaped (`zap_pid_ns_processes`
|
|
|
|
|
waits forever), which breaks `docker stop`, `docker restart`, `docker exec`, and
|
|
|
|
|
even `reboot`. The only fix is a hard power cycle.
|
|
|
|
|
|
|
|
|
|
**Prevention**: Always scale the deployment to 0 and wait for the pod to terminate
|
|
|
|
|
before any destructive operation (namespace delete, kind restart, host reboot).
|
|
|
|
|
The `biscayne-stop.yml` playbook enforces this.
|
|
|
|
|
|
|
|
|
|
### laconic-so Architecture
|
|
|
|
|
|
|
|
|
|
`laconic-so` manages kind clusters atomically — `deployment start` creates the
|
|
|
|
|
kind cluster, namespace, PVs, PVCs, and deployment in one shot. There is no way
|
|
|
|
|
to create the cluster without deploying the pod.
|
|
|
|
|
|
|
|
|
|
Key code paths in stack-orchestrator:
|
|
|
|
|
- `deploy_k8s.py:up()` — creates everything atomically
|
|
|
|
|
- `cluster_info.py:get_pvs()` — translates host paths using `kind-mount-root`
|
|
|
|
|
- `helpers_k8s.py:get_kind_pv_bind_mount_path()` — strips `kind-mount-root`
|
|
|
|
|
prefix and prepends `/mnt/`
|
|
|
|
|
- `helpers_k8s.py:_generate_kind_mounts()` — when `kind-mount-root` is set,
|
|
|
|
|
emits a single `/srv/kind` → `/mnt` mount instead of individual mounts
|
|
|
|
|
|
|
|
|
|
The `kind-mount-root: /srv/kind` setting in `spec.yml` means all data volumes
|
|
|
|
|
whose host paths start with `/srv/kind` get translated to `/mnt/...` inside the
|
|
|
|
|
kind node via a single bind mount.
|
|
|
|
|
|
|
|
|
|
### Key Identifiers
|
|
|
|
|
|
|
|
|
|
- Kind cluster: `laconic-70ce4c4b47e23b85`
|
|
|
|
|
- Namespace: `laconic-laconic-70ce4c4b47e23b85`
|
|
|
|
|
- Deployment: `laconic-70ce4c4b47e23b85-deployment`
|
|
|
|
|
- Kind node container: `laconic-70ce4c4b47e23b85-control-plane`
|
|
|
|
|
- Deployment dir: `/srv/deployments/agave`
|
|
|
|
|
- Snapshot dir: `/srv/solana/snapshots`
|
|
|
|
|
- Ledger dir: `/srv/solana/ledger`
|
|
|
|
|
- Accounts dir: `/srv/solana/ramdisk/accounts`
|
|
|
|
|
- Log dir: `/srv/solana/log`
|
|
|
|
|
- Host bind mount root: `/srv/kind` -> kind node `/mnt`
|
|
|
|
|
- laconic-so: `/home/rix/.local/bin/laconic-so` (editable install)
|
|
|
|
|
|
|
|
|
|
### PV Mount Paths (inside kind node)
|
|
|
|
|
|
|
|
|
|
| PV Name | hostPath |
|
|
|
|
|
|----------------------|-------------------------------|
|
|
|
|
|
| validator-snapshots | /mnt/solana/snapshots |
|
|
|
|
|
| validator-ledger | /mnt/solana/ledger |
|
|
|
|
|
| validator-accounts | /mnt/solana/ramdisk/accounts |
|
|
|
|
|
| validator-log | /mnt/solana/log |
|
|
|
|
|
|
|
|
|
|
### Snapshot Freshness
|
|
|
|
|
|
|
|
|
|
If the snapshot is more than **20,000 slots behind** the current mainnet tip, it is
|
|
|
|
|
too old. Stop the validator, download a fresh snapshot, and restart. Do NOT let it
|
|
|
|
|
try to catch up from an old snapshot — it will take too long and may never converge.
|
|
|
|
|
|
|
|
|
|
Check with:
|
|
|
|
|
```
|
|
|
|
|
# Snapshot slot (from filename)
|
|
|
|
|
ls /srv/solana/snapshots/snapshot-*.tar.*
|
|
|
|
|
|
|
|
|
|
# Current mainnet slot
|
|
|
|
|
curl -s -X POST -H "Content-Type: application/json" \
|
|
|
|
|
-d '{"jsonrpc":"2.0","id":1,"method":"getSlot","params":[{"commitment":"finalized"}]}' \
|
|
|
|
|
https://api.mainnet-beta.solana.com
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
### Snapshot Leapfrog Recovery
|
|
|
|
|
|
|
|
|
|
When the validator is stuck in a repair-dependent gap (incomplete shreds from a
|
|
|
|
|
relay outage or insufficient turbine coverage), "grinding through" doesn't work.
|
|
|
|
|
At 0.4 slots/sec replay through incomplete blocks vs 2.5 slots/sec chain
|
|
|
|
|
production, the gap grows faster than it shrinks.
|
|
|
|
|
|
|
|
|
|
**Strategy**: Download a fresh snapshot whose slot lands *past* the incomplete zone,
|
|
|
|
|
into the range where turbine+relay shreds are accumulating in the blockstore.
|
|
|
|
|
**Keep the existing ledger** — it has those shreds. The validator replays from
|
|
|
|
|
local blockstore data instead of waiting on repair.
|
|
|
|
|
|
|
|
|
|
**Steps**:
|
|
|
|
|
1. Let the validator run — turbine+relay accumulate shreds at the tip
|
|
|
|
|
2. Monitor shred completeness at the tip:
|
|
|
|
|
`scripts/check-shred-completeness.sh 500`
|
|
|
|
|
3. When there's a contiguous run of complete blocks (>100 slots), note the
|
|
|
|
|
starting slot of that run
|
|
|
|
|
4. Scale to 0, wipe accounts (ramdisk), wipe old snapshots
|
|
|
|
|
5. **Do NOT wipe ledger** — it has the turbine shreds
|
|
|
|
|
6. Download a fresh snapshot (its slot should be within the complete run)
|
|
|
|
|
7. Scale to 1 — validator replays from local blockstore at 3-5 slots/sec
|
|
|
|
|
|
|
|
|
|
**Why this works**: Turbine delivers ~60% of shreds in real-time. Repair fills
|
|
|
|
|
the rest for recent slots quickly (peers prioritize recent data). The only
|
|
|
|
|
problem is repair for *old* slots (minutes/hours behind) which peers deprioritize.
|
|
|
|
|
By snapshotting past the gap, we skip the old-slot repair bottleneck entirely.
|
|
|
|
|
|
|
|
|
|
### Shred Relay (Ashburn)
|
|
|
|
|
|
|
|
|
|
The TVU shred relay from laconic-was-sw01 provides ~4,000 additional shreds/sec.
|
|
|
|
|
Without it, turbine alone delivers ~60% of blocks. With it, completeness improves
|
|
|
|
|
but still requires repair for full coverage.
|
|
|
|
|
|
|
|
|
|
**Current state**: Old pipeline (monitor session + socat + shred-unwrap.py).
|
|
|
|
|
The traffic-policy redirect was never committed (auto-revert after 5 min timer).
|
|
|
|
|
See `docs/tvu-shred-relay.md` for the traffic-policy config that needs to be
|
|
|
|
|
properly applied.
|
|
|
|
|
|
|
|
|
|
**Boot dependency**: `shred-unwrap.py` must be running on biscayne for the old
|
|
|
|
|
pipeline to work. It is NOT persistent across reboots. The iptables DNAT rule
|
|
|
|
|
for the new pipeline IS persistent (iptables-persistent installed).
|
|
|
|
|
|
|
|
|
|
### Redeploy Flow
|
|
|
|
|
|
|
|
|
|
See `playbooks/biscayne-redeploy.yml`. The scale-to-0 pattern is required because
|
|
|
|
|
`laconic-so` creates the cluster and deploys the pod atomically:
|
|
|
|
|
|
|
|
|
|
1. Delete namespace (teardown)
|
|
|
|
|
2. Optionally wipe data
|
|
|
|
|
3. `laconic-so deployment start` (creates cluster + pod)
|
|
|
|
|
4. Immediately scale to 0
|
|
|
|
|
5. Download snapshot via aria2c
|
|
|
|
|
6. Scale to 1
|
|
|
|
|
7. Verify
|