stack-orchestrator/CLAUDE.md

222 lines
9.6 KiB
Markdown
Raw Normal View History

# Biscayne Agave Runbook
## Deployment Layers
Operations on biscayne follow a strict layering. Each layer assumes the layers
below it are correct. Playbooks belong to exactly one layer.
| Layer | What | Playbooks |
|-------|------|-----------|
| 1. Base system | Docker, ZFS, packages | Out of scope (manual/PXE) |
| 2. Prepare kind | `/srv/kind` exists (ZFS dataset) | None needed (ZFS handles it) |
| 3. Install kind | `laconic-so deployment start` creates kind cluster, mounts `/srv/kind``/mnt` in kind node | `biscayne-redeploy.yml` (deploy tags) |
| 4. Prepare agave | Host storage for agave: ZFS dataset, ramdisk | `biscayne-prepare-agave.yml` |
| 5. Deploy agave | Deploy agave-stack into kind, snapshot download, scale up | `biscayne-redeploy.yml` (snapshot/verify tags), `biscayne-recover.yml` |
**Layer 4 invariants** (asserted by `biscayne-prepare-agave.yml`):
- `/srv/kind/solana` is a ZFS dataset (`biscayne/DATA/srv/kind/solana`), child of the `/srv/kind` dataset
- `/srv/kind/solana/ramdisk` is tmpfs (1TB) — accounts must be in RAM
- `/srv/solana` is NOT the data path — it's a directory on the parent ZFS dataset. All data paths use `/srv/kind/solana`
These invariants are checked at runtime and persisted to fstab/systemd so they
survive reboot.
**Cross-cutting**: `health-check.yml` (read-only diagnostics), `biscayne-stop.yml`
(layer 5 — graceful shutdown), `fix-pv-mounts.yml` (layer 5 — PV repair).
## Cluster Operations
### Shutdown Order
The agave validator runs inside a kind-based k8s cluster managed by `laconic-so`.
The kind node is a Docker container. **Never restart or kill the kind node container
while the validator is running.** Use `agave-validator exit --force` via the admin
RPC socket for graceful shutdown, or scale the deployment to 0 and wait.
Correct shutdown sequence:
1. Scale the deployment to 0 and wait for the pod to terminate:
```
kubectl scale deployment laconic-70ce4c4b47e23b85-deployment \
-n laconic-laconic-70ce4c4b47e23b85 --replicas=0
kubectl wait --for=delete pod -l app=laconic-70ce4c4b47e23b85-deployment \
-n laconic-laconic-70ce4c4b47e23b85 --timeout=120s
```
2. Only then restart the kind node if needed:
```
docker restart laconic-70ce4c4b47e23b85-control-plane
```
3. Scale back up:
```
kubectl scale deployment laconic-70ce4c4b47e23b85-deployment \
-n laconic-laconic-70ce4c4b47e23b85 --replicas=1
```
### Ramdisk
The accounts directory must be in RAM for performance. tmpfs is used instead of
`/dev/ram0` — simpler (no format-on-boot service needed), resizable on the fly
with `mount -o remount,size=<new>`, and what most Solana operators use.
**Boot ordering**: `/srv/kind/solana` is a ZFS dataset mounted automatically by
`zfs-mount.service`. The tmpfs ramdisk fstab entry uses
`x-systemd.requires=zfs-mount.service` to ensure the dataset is mounted first.
**No manual intervention after reboot.**
**Mount propagation**: The kind node bind-mounts `/srv/kind``/mnt` at container
start. laconic-so sets `propagation: HostToContainer` on all kind extraMounts
(commit `a11d40f2` in stack-orchestrator), so host submounts propagate into the
kind node automatically. A kind restart is required to pick up the new config
after updating laconic-so.
### KUBECONFIG
kubectl must be told where the kubeconfig is when running as root or via ansible:
```
KUBECONFIG=/home/rix/.kube/config kubectl ...
```
The ansible playbooks set `environment: KUBECONFIG: /home/rix/.kube/config`.
### SSH Agent
SSH to biscayne goes through a ProxyCommand jump host (abernathy.ch2.vaasl.io).
The SSH agent socket rotates when the user reconnects. Find the current one:
```
ls -t /tmp/ssh-*/agent.* | head -1
```
Then export it:
```
export SSH_AUTH_SOCK=/tmp/ssh-XXXX/agent.NNNN
```
### io_uring/ZFS Deadlock — Historical Note
Agave uses io_uring for async I/O. Killing agave ungracefully while it has
outstanding I/O against ZFS can produce unkillable D-state kernel threads
(`io_wq_put_and_exit` blocked on ZFS transactions), deadlocking the container.
**Prevention**: Use graceful shutdown (`agave-validator exit --force` via admin
RPC, or scale to 0 and wait). The `biscayne-stop.yml` playbook enforces this.
With graceful shutdown, io_uring contexts are closed cleanly and ZFS storage
is safe to use directly (no zvol/XFS workaround needed).
**ZFS fix**: The underlying io_uring bug is fixed in ZFS 2.2.8+ (PR #17298).
Biscayne currently runs ZFS 2.2.2. Upgrading ZFS will eliminate the deadlock
risk entirely, even for ungraceful shutdowns.
### laconic-so Architecture
`laconic-so` manages kind clusters atomically — `deployment start` creates the
kind cluster, namespace, PVs, PVCs, and deployment in one shot. There is no way
to create the cluster without deploying the pod.
Key code paths in stack-orchestrator:
- `deploy_k8s.py:up()` — creates everything atomically
- `cluster_info.py:get_pvs()` — translates host paths using `kind-mount-root`
- `helpers_k8s.py:get_kind_pv_bind_mount_path()` — strips `kind-mount-root`
prefix and prepends `/mnt/`
- `helpers_k8s.py:_generate_kind_mounts()` — when `kind-mount-root` is set,
emits a single `/srv/kind``/mnt` mount instead of individual mounts
The `kind-mount-root: /srv/kind` setting in `spec.yml` means all data volumes
whose host paths start with `/srv/kind` get translated to `/mnt/...` inside the
kind node via a single bind mount.
### Key Identifiers
- Kind cluster: `laconic-70ce4c4b47e23b85`
- Namespace: `laconic-laconic-70ce4c4b47e23b85`
- Deployment: `laconic-70ce4c4b47e23b85-deployment`
- Kind node container: `laconic-70ce4c4b47e23b85-control-plane`
- Deployment dir: `/srv/deployments/agave`
- Snapshot dir: `/srv/kind/solana/snapshots` (ZFS dataset, visible to kind at `/mnt/validator-snapshots`)
- Ledger dir: `/srv/kind/solana/ledger` (ZFS dataset, visible to kind at `/mnt/validator-ledger`)
- Accounts dir: `/srv/kind/solana/ramdisk/accounts` (tmpfs ramdisk, visible to kind at `/mnt/validator-accounts`)
- Log dir: `/srv/kind/solana/log` (ZFS dataset, visible to kind at `/mnt/validator-log`)
- **WARNING**: `/srv/solana` is a different ZFS dataset directory. All data paths use `/srv/kind/solana`.
- Host bind mount root: `/srv/kind` -> kind node `/mnt`
- laconic-so: `/home/rix/.local/bin/laconic-so` (editable install)
### PV Mount Paths (inside kind node)
| PV Name | hostPath |
|----------------------|-------------------------------|
| validator-snapshots | /mnt/validator-snapshots |
| validator-ledger | /mnt/validator-ledger |
| validator-accounts | /mnt/validator-accounts |
| validator-log | /mnt/validator-log |
### Snapshot Freshness
If the snapshot is more than **20,000 slots behind** the current mainnet tip, it is
too old. Stop the validator, download a fresh snapshot, and restart. Do NOT let it
try to catch up from an old snapshot — it will take too long and may never converge.
Check with:
```
# Snapshot slot (from filename)
ls /srv/kind/solana/snapshots/snapshot-*.tar.*
# Current mainnet slot
curl -s -X POST -H "Content-Type: application/json" \
-d '{"jsonrpc":"2.0","id":1,"method":"getSlot","params":[{"commitment":"finalized"}]}' \
https://api.mainnet-beta.solana.com
```
### Snapshot Leapfrog Recovery
When the validator is stuck in a repair-dependent gap (incomplete shreds from a
relay outage or insufficient turbine coverage), "grinding through" doesn't work.
At 0.4 slots/sec replay through incomplete blocks vs 2.5 slots/sec chain
production, the gap grows faster than it shrinks.
**Strategy**: Download a fresh snapshot whose slot lands *past* the incomplete zone,
into the range where turbine+relay shreds are accumulating in the blockstore.
**Keep the existing ledger** — it has those shreds. The validator replays from
local blockstore data instead of waiting on repair.
**Steps**:
1. Let the validator run — turbine+relay accumulate shreds at the tip
2. Monitor shred completeness at the tip:
`scripts/check-shred-completeness.sh 500`
3. When there's a contiguous run of complete blocks (>100 slots), note the
starting slot of that run
4. Scale to 0, wipe accounts (ramdisk), wipe old snapshots
5. **Do NOT wipe ledger** — it has the turbine shreds
6. Download a fresh snapshot (its slot should be within the complete run)
7. Scale to 1 — validator replays from local blockstore at 3-5 slots/sec
**Why this works**: Turbine delivers ~60% of shreds in real-time. Repair fills
the rest for recent slots quickly (peers prioritize recent data). The only
problem is repair for *old* slots (minutes/hours behind) which peers deprioritize.
By snapshotting past the gap, we skip the old-slot repair bottleneck entirely.
### Shred Relay (Ashburn)
The TVU shred relay from laconic-was-sw01 provides ~4,000 additional shreds/sec.
Without it, turbine alone delivers ~60% of blocks. With it, completeness improves
but still requires repair for full coverage.
**Current state**: Old pipeline (monitor session + socat + shred-unwrap.py).
The traffic-policy redirect was never committed (auto-revert after 5 min timer).
See `docs/tvu-shred-relay.md` for the traffic-policy config that needs to be
properly applied.
**Boot dependency**: `shred-unwrap.py` must be running on biscayne for the old
pipeline to work. It is NOT persistent across reboots. The iptables DNAT rule
for the new pipeline IS persistent (iptables-persistent installed).
### Redeploy Flow
See `playbooks/biscayne-redeploy.yml`. The scale-to-0 pattern is required because
`laconic-so` creates the cluster and deploys the pod atomically:
1. Delete namespace (teardown)
2. Optionally wipe data
3. `laconic-so deployment start` (creates cluster + pod)
4. Immediately scale to 0
5. Download snapshot via aria2c
6. Scale to 1
7. Verify