stack-orchestrator/CLAUDE.md

205 lines
8.3 KiB
Markdown

# Biscayne Agave Runbook
## Cluster Operations
### Shutdown Order
The agave validator runs inside a kind-based k8s cluster managed by `laconic-so`.
The kind node is a Docker container. **Never restart or kill the kind node container
while the validator is running.** Agave uses `io_uring` for async I/O, and on ZFS,
killing the process can produce unkillable kernel threads (D-state in
`io_wq_put_and_exit` blocked on ZFS transaction commits). This deadlocks the
container's PID namespace, making `docker stop`, `docker restart`, `docker exec`,
and even `reboot` hang.
Correct shutdown sequence:
1. Scale the deployment to 0 and wait for the pod to terminate:
```
kubectl scale deployment laconic-70ce4c4b47e23b85-deployment \
-n laconic-laconic-70ce4c4b47e23b85 --replicas=0
kubectl wait --for=delete pod -l app=laconic-70ce4c4b47e23b85-deployment \
-n laconic-laconic-70ce4c4b47e23b85 --timeout=120s
```
2. Only then restart the kind node if needed:
```
docker restart laconic-70ce4c4b47e23b85-control-plane
```
3. Scale back up:
```
kubectl scale deployment laconic-70ce4c4b47e23b85-deployment \
-n laconic-laconic-70ce4c4b47e23b85 --replicas=1
```
### Ramdisk
The accounts directory must be on a ramdisk for performance. `/dev/ram0` loses its
filesystem on reboot and must be reformatted before mounting.
**Boot ordering is handled by systemd units** (installed by `biscayne-boot.yml`):
- `format-ramdisk.service`: runs `mkfs.xfs -f /dev/ram0` before `local-fs.target`
- fstab entry: mounts `/dev/ram0` at `/srv/solana/ramdisk` with
`x-systemd.requires=format-ramdisk.service`
- `ramdisk-accounts.service`: creates `/srv/solana/ramdisk/accounts` and sets
ownership after the mount
These units run before docker, so the kind node's bind mounts always see the
ramdisk. **No manual intervention is needed after reboot.**
**Mount propagation**: The kind node bind-mounts `/srv/kind``/mnt`. Because
the ramdisk is mounted at `/srv/solana/ramdisk` and symlinked/overlaid through
`/srv/kind/solana/ramdisk`, mount propagation makes it visible inside the kind
node at `/mnt/solana/ramdisk` without restarting the kind node. **Do NOT restart
the kind node just to pick up a ramdisk mount.**
### KUBECONFIG
kubectl must be told where the kubeconfig is when running as root or via ansible:
```
KUBECONFIG=/home/rix/.kube/config kubectl ...
```
The ansible playbooks set `environment: KUBECONFIG: /home/rix/.kube/config`.
### SSH Agent
SSH to biscayne goes through a ProxyCommand jump host (abernathy.ch2.vaasl.io).
The SSH agent socket rotates when the user reconnects. Find the current one:
```
ls -t /tmp/ssh-*/agent.* | head -1
```
Then export it:
```
export SSH_AUTH_SOCK=/tmp/ssh-XXXX/agent.NNNN
```
### io_uring/ZFS Deadlock — Root Cause
When agave-validator is killed while performing I/O against ZFS-backed paths (not
the ramdisk), io_uring worker threads get stuck in D-state:
```
io_wq_put_and_exit → dsl_dir_tempreserve_space (ZFS module)
```
These threads are unkillable (SIGKILL has no effect on D-state processes). They
prevent the container's PID namespace from being reaped (`zap_pid_ns_processes`
waits forever), which breaks `docker stop`, `docker restart`, `docker exec`, and
even `reboot`. The only fix is a hard power cycle.
**Prevention**: Always scale the deployment to 0 and wait for the pod to terminate
before any destructive operation (namespace delete, kind restart, host reboot).
The `biscayne-stop.yml` playbook enforces this.
### laconic-so Architecture
`laconic-so` manages kind clusters atomically — `deployment start` creates the
kind cluster, namespace, PVs, PVCs, and deployment in one shot. There is no way
to create the cluster without deploying the pod.
Key code paths in stack-orchestrator:
- `deploy_k8s.py:up()` — creates everything atomically
- `cluster_info.py:get_pvs()` — translates host paths using `kind-mount-root`
- `helpers_k8s.py:get_kind_pv_bind_mount_path()` — strips `kind-mount-root`
prefix and prepends `/mnt/`
- `helpers_k8s.py:_generate_kind_mounts()` — when `kind-mount-root` is set,
emits a single `/srv/kind``/mnt` mount instead of individual mounts
The `kind-mount-root: /srv/kind` setting in `spec.yml` means all data volumes
whose host paths start with `/srv/kind` get translated to `/mnt/...` inside the
kind node via a single bind mount.
### Key Identifiers
- Kind cluster: `laconic-70ce4c4b47e23b85`
- Namespace: `laconic-laconic-70ce4c4b47e23b85`
- Deployment: `laconic-70ce4c4b47e23b85-deployment`
- Kind node container: `laconic-70ce4c4b47e23b85-control-plane`
- Deployment dir: `/srv/deployments/agave`
- Snapshot dir: `/srv/solana/snapshots`
- Ledger dir: `/srv/solana/ledger`
- Accounts dir: `/srv/solana/ramdisk/accounts`
- Log dir: `/srv/solana/log`
- Host bind mount root: `/srv/kind` -> kind node `/mnt`
- laconic-so: `/home/rix/.local/bin/laconic-so` (editable install)
### PV Mount Paths (inside kind node)
| PV Name | hostPath |
|----------------------|-------------------------------|
| validator-snapshots | /mnt/solana/snapshots |
| validator-ledger | /mnt/solana/ledger |
| validator-accounts | /mnt/solana/ramdisk/accounts |
| validator-log | /mnt/solana/log |
### Snapshot Freshness
If the snapshot is more than **20,000 slots behind** the current mainnet tip, it is
too old. Stop the validator, download a fresh snapshot, and restart. Do NOT let it
try to catch up from an old snapshot — it will take too long and may never converge.
Check with:
```
# Snapshot slot (from filename)
ls /srv/solana/snapshots/snapshot-*.tar.*
# Current mainnet slot
curl -s -X POST -H "Content-Type: application/json" \
-d '{"jsonrpc":"2.0","id":1,"method":"getSlot","params":[{"commitment":"finalized"}]}' \
https://api.mainnet-beta.solana.com
```
### Snapshot Leapfrog Recovery
When the validator is stuck in a repair-dependent gap (incomplete shreds from a
relay outage or insufficient turbine coverage), "grinding through" doesn't work.
At 0.4 slots/sec replay through incomplete blocks vs 2.5 slots/sec chain
production, the gap grows faster than it shrinks.
**Strategy**: Download a fresh snapshot whose slot lands *past* the incomplete zone,
into the range where turbine+relay shreds are accumulating in the blockstore.
**Keep the existing ledger** — it has those shreds. The validator replays from
local blockstore data instead of waiting on repair.
**Steps**:
1. Let the validator run — turbine+relay accumulate shreds at the tip
2. Monitor shred completeness at the tip:
`scripts/check-shred-completeness.sh 500`
3. When there's a contiguous run of complete blocks (>100 slots), note the
starting slot of that run
4. Scale to 0, wipe accounts (ramdisk), wipe old snapshots
5. **Do NOT wipe ledger** — it has the turbine shreds
6. Download a fresh snapshot (its slot should be within the complete run)
7. Scale to 1 — validator replays from local blockstore at 3-5 slots/sec
**Why this works**: Turbine delivers ~60% of shreds in real-time. Repair fills
the rest for recent slots quickly (peers prioritize recent data). The only
problem is repair for *old* slots (minutes/hours behind) which peers deprioritize.
By snapshotting past the gap, we skip the old-slot repair bottleneck entirely.
### Shred Relay (Ashburn)
The TVU shred relay from laconic-was-sw01 provides ~4,000 additional shreds/sec.
Without it, turbine alone delivers ~60% of blocks. With it, completeness improves
but still requires repair for full coverage.
**Current state**: Old pipeline (monitor session + socat + shred-unwrap.py).
The traffic-policy redirect was never committed (auto-revert after 5 min timer).
See `docs/tvu-shred-relay.md` for the traffic-policy config that needs to be
properly applied.
**Boot dependency**: `shred-unwrap.py` must be running on biscayne for the old
pipeline to work. It is NOT persistent across reboots. The iptables DNAT rule
for the new pipeline IS persistent (iptables-persistent installed).
### Redeploy Flow
See `playbooks/biscayne-redeploy.yml`. The scale-to-0 pattern is required because
`laconic-so` creates the cluster and deploys the pod atomically:
1. Delete namespace (teardown)
2. Optionally wipe data
3. `laconic-so deployment start` (creates cluster + pod)
4. Immediately scale to 0
5. Download snapshot via aria2c
6. Scale to 1
7. Verify