2026-03-07 01:44:25 +00:00
|
|
|
# Biscayne Agave Runbook
|
|
|
|
|
|
2026-03-07 13:07:54 +00:00
|
|
|
## Deployment Layers
|
|
|
|
|
|
|
|
|
|
Operations on biscayne follow a strict layering. Each layer assumes the layers
|
|
|
|
|
below it are correct. Playbooks belong to exactly one layer.
|
|
|
|
|
|
|
|
|
|
| Layer | What | Playbooks |
|
|
|
|
|
|-------|------|-----------|
|
|
|
|
|
| 1. Base system | Docker, ZFS, packages | Out of scope (manual/PXE) |
|
|
|
|
|
| 2. Prepare kind | `/srv/kind` exists (ZFS dataset) | None needed (ZFS handles it) |
|
|
|
|
|
| 3. Install kind | `laconic-so deployment start` creates kind cluster, mounts `/srv/kind` → `/mnt` in kind node | `biscayne-redeploy.yml` (deploy tags) |
|
feat: graceful shutdown, ZFS upgrade, storage migration, sync-tools build
- entrypoint.py: Python stays PID 1, traps SIGTERM, requests graceful exit
via admin RPC (agave-validator exit --force) before falling back to signals
- snapshot_download.py: fix break-on-failure bug in incremental download loop
(continue + re-probe instead of giving up)
- biscayne-upgrade-zfs.yml: upgrade ZFS 2.2.2 → 2.2.9 via arter97/zfs-lts
PPA to fix io_uring deadlock at kernel module level
- biscayne-migrate-storage.yml: one-time migration from zvol/XFS to ZFS
dataset (zvol workaround no longer needed with graceful shutdown + ZFS fix)
- biscayne-stop.yml: patch terminationGracePeriodSeconds to 300 before
scaling to 0, updated docs for admin RPC shutdown
- biscayne-sync-tools.yml: fix SSH agent forwarding (vars: ansible_become),
add --tags build-container support, add set -e to shell blocks
- biscayne-recover.yml: updated for graceful shutdown awareness
- check-status.py: add --pane flag for tmux, clean redraw in watch mode
- CLAUDE.md: update docs for ZFS dataset storage, graceful shutdown
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-09 07:58:37 +00:00
|
|
|
| 4. Prepare agave | Host storage for agave: ZFS dataset, ramdisk | `biscayne-prepare-agave.yml` |
|
2026-03-07 13:07:54 +00:00
|
|
|
| 5. Deploy agave | Deploy agave-stack into kind, snapshot download, scale up | `biscayne-redeploy.yml` (snapshot/verify tags), `biscayne-recover.yml` |
|
|
|
|
|
|
|
|
|
|
**Layer 4 invariants** (asserted by `biscayne-prepare-agave.yml`):
|
feat: graceful shutdown, ZFS upgrade, storage migration, sync-tools build
- entrypoint.py: Python stays PID 1, traps SIGTERM, requests graceful exit
via admin RPC (agave-validator exit --force) before falling back to signals
- snapshot_download.py: fix break-on-failure bug in incremental download loop
(continue + re-probe instead of giving up)
- biscayne-upgrade-zfs.yml: upgrade ZFS 2.2.2 → 2.2.9 via arter97/zfs-lts
PPA to fix io_uring deadlock at kernel module level
- biscayne-migrate-storage.yml: one-time migration from zvol/XFS to ZFS
dataset (zvol workaround no longer needed with graceful shutdown + ZFS fix)
- biscayne-stop.yml: patch terminationGracePeriodSeconds to 300 before
scaling to 0, updated docs for admin RPC shutdown
- biscayne-sync-tools.yml: fix SSH agent forwarding (vars: ansible_become),
add --tags build-container support, add set -e to shell blocks
- biscayne-recover.yml: updated for graceful shutdown awareness
- check-status.py: add --pane flag for tmux, clean redraw in watch mode
- CLAUDE.md: update docs for ZFS dataset storage, graceful shutdown
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-09 07:58:37 +00:00
|
|
|
- `/srv/kind/solana` is a ZFS dataset (`biscayne/DATA/srv/kind/solana`), child of the `/srv/kind` dataset
|
2026-03-08 18:43:41 +00:00
|
|
|
- `/srv/kind/solana/ramdisk` is tmpfs (1TB) — accounts must be in RAM
|
feat: graceful shutdown, ZFS upgrade, storage migration, sync-tools build
- entrypoint.py: Python stays PID 1, traps SIGTERM, requests graceful exit
via admin RPC (agave-validator exit --force) before falling back to signals
- snapshot_download.py: fix break-on-failure bug in incremental download loop
(continue + re-probe instead of giving up)
- biscayne-upgrade-zfs.yml: upgrade ZFS 2.2.2 → 2.2.9 via arter97/zfs-lts
PPA to fix io_uring deadlock at kernel module level
- biscayne-migrate-storage.yml: one-time migration from zvol/XFS to ZFS
dataset (zvol workaround no longer needed with graceful shutdown + ZFS fix)
- biscayne-stop.yml: patch terminationGracePeriodSeconds to 300 before
scaling to 0, updated docs for admin RPC shutdown
- biscayne-sync-tools.yml: fix SSH agent forwarding (vars: ansible_become),
add --tags build-container support, add set -e to shell blocks
- biscayne-recover.yml: updated for graceful shutdown awareness
- check-status.py: add --pane flag for tmux, clean redraw in watch mode
- CLAUDE.md: update docs for ZFS dataset storage, graceful shutdown
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-09 07:58:37 +00:00
|
|
|
- `/srv/solana` is NOT the data path — it's a directory on the parent ZFS dataset. All data paths use `/srv/kind/solana`
|
2026-03-07 13:07:54 +00:00
|
|
|
|
|
|
|
|
These invariants are checked at runtime and persisted to fstab/systemd so they
|
feat: graceful shutdown, ZFS upgrade, storage migration, sync-tools build
- entrypoint.py: Python stays PID 1, traps SIGTERM, requests graceful exit
via admin RPC (agave-validator exit --force) before falling back to signals
- snapshot_download.py: fix break-on-failure bug in incremental download loop
(continue + re-probe instead of giving up)
- biscayne-upgrade-zfs.yml: upgrade ZFS 2.2.2 → 2.2.9 via arter97/zfs-lts
PPA to fix io_uring deadlock at kernel module level
- biscayne-migrate-storage.yml: one-time migration from zvol/XFS to ZFS
dataset (zvol workaround no longer needed with graceful shutdown + ZFS fix)
- biscayne-stop.yml: patch terminationGracePeriodSeconds to 300 before
scaling to 0, updated docs for admin RPC shutdown
- biscayne-sync-tools.yml: fix SSH agent forwarding (vars: ansible_become),
add --tags build-container support, add set -e to shell blocks
- biscayne-recover.yml: updated for graceful shutdown awareness
- check-status.py: add --pane flag for tmux, clean redraw in watch mode
- CLAUDE.md: update docs for ZFS dataset storage, graceful shutdown
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-09 07:58:37 +00:00
|
|
|
survive reboot.
|
2026-03-07 13:07:54 +00:00
|
|
|
|
|
|
|
|
**Cross-cutting**: `health-check.yml` (read-only diagnostics), `biscayne-stop.yml`
|
|
|
|
|
(layer 5 — graceful shutdown), `fix-pv-mounts.yml` (layer 5 — PV repair).
|
|
|
|
|
|
2026-03-07 01:44:25 +00:00
|
|
|
## Cluster Operations
|
|
|
|
|
|
|
|
|
|
### Shutdown Order
|
|
|
|
|
|
|
|
|
|
The agave validator runs inside a kind-based k8s cluster managed by `laconic-so`.
|
|
|
|
|
The kind node is a Docker container. **Never restart or kill the kind node container
|
feat: graceful shutdown, ZFS upgrade, storage migration, sync-tools build
- entrypoint.py: Python stays PID 1, traps SIGTERM, requests graceful exit
via admin RPC (agave-validator exit --force) before falling back to signals
- snapshot_download.py: fix break-on-failure bug in incremental download loop
(continue + re-probe instead of giving up)
- biscayne-upgrade-zfs.yml: upgrade ZFS 2.2.2 → 2.2.9 via arter97/zfs-lts
PPA to fix io_uring deadlock at kernel module level
- biscayne-migrate-storage.yml: one-time migration from zvol/XFS to ZFS
dataset (zvol workaround no longer needed with graceful shutdown + ZFS fix)
- biscayne-stop.yml: patch terminationGracePeriodSeconds to 300 before
scaling to 0, updated docs for admin RPC shutdown
- biscayne-sync-tools.yml: fix SSH agent forwarding (vars: ansible_become),
add --tags build-container support, add set -e to shell blocks
- biscayne-recover.yml: updated for graceful shutdown awareness
- check-status.py: add --pane flag for tmux, clean redraw in watch mode
- CLAUDE.md: update docs for ZFS dataset storage, graceful shutdown
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-09 07:58:37 +00:00
|
|
|
while the validator is running.** Use `agave-validator exit --force` via the admin
|
|
|
|
|
RPC socket for graceful shutdown, or scale the deployment to 0 and wait.
|
2026-03-07 01:44:25 +00:00
|
|
|
|
|
|
|
|
Correct shutdown sequence:
|
|
|
|
|
|
|
|
|
|
1. Scale the deployment to 0 and wait for the pod to terminate:
|
|
|
|
|
```
|
|
|
|
|
kubectl scale deployment laconic-70ce4c4b47e23b85-deployment \
|
|
|
|
|
-n laconic-laconic-70ce4c4b47e23b85 --replicas=0
|
|
|
|
|
kubectl wait --for=delete pod -l app=laconic-70ce4c4b47e23b85-deployment \
|
|
|
|
|
-n laconic-laconic-70ce4c4b47e23b85 --timeout=120s
|
|
|
|
|
```
|
|
|
|
|
2. Only then restart the kind node if needed:
|
|
|
|
|
```
|
|
|
|
|
docker restart laconic-70ce4c4b47e23b85-control-plane
|
|
|
|
|
```
|
|
|
|
|
3. Scale back up:
|
|
|
|
|
```
|
|
|
|
|
kubectl scale deployment laconic-70ce4c4b47e23b85-deployment \
|
|
|
|
|
-n laconic-laconic-70ce4c4b47e23b85 --replicas=1
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
### Ramdisk
|
|
|
|
|
|
2026-03-08 18:43:41 +00:00
|
|
|
The accounts directory must be in RAM for performance. tmpfs is used instead of
|
|
|
|
|
`/dev/ram0` — simpler (no format-on-boot service needed), resizable on the fly
|
|
|
|
|
with `mount -o remount,size=<new>`, and what most Solana operators use.
|
2026-03-07 01:44:25 +00:00
|
|
|
|
feat: graceful shutdown, ZFS upgrade, storage migration, sync-tools build
- entrypoint.py: Python stays PID 1, traps SIGTERM, requests graceful exit
via admin RPC (agave-validator exit --force) before falling back to signals
- snapshot_download.py: fix break-on-failure bug in incremental download loop
(continue + re-probe instead of giving up)
- biscayne-upgrade-zfs.yml: upgrade ZFS 2.2.2 → 2.2.9 via arter97/zfs-lts
PPA to fix io_uring deadlock at kernel module level
- biscayne-migrate-storage.yml: one-time migration from zvol/XFS to ZFS
dataset (zvol workaround no longer needed with graceful shutdown + ZFS fix)
- biscayne-stop.yml: patch terminationGracePeriodSeconds to 300 before
scaling to 0, updated docs for admin RPC shutdown
- biscayne-sync-tools.yml: fix SSH agent forwarding (vars: ansible_become),
add --tags build-container support, add set -e to shell blocks
- biscayne-recover.yml: updated for graceful shutdown awareness
- check-status.py: add --pane flag for tmux, clean redraw in watch mode
- CLAUDE.md: update docs for ZFS dataset storage, graceful shutdown
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-09 07:58:37 +00:00
|
|
|
**Boot ordering**: `/srv/kind/solana` is a ZFS dataset mounted automatically by
|
|
|
|
|
`zfs-mount.service`. The tmpfs ramdisk fstab entry uses
|
|
|
|
|
`x-systemd.requires=zfs-mount.service` to ensure the dataset is mounted first.
|
|
|
|
|
**No manual intervention after reboot.**
|
2026-03-07 01:44:25 +00:00
|
|
|
|
2026-03-07 13:07:54 +00:00
|
|
|
**Mount propagation**: The kind node bind-mounts `/srv/kind` → `/mnt` at container
|
2026-03-07 22:28:21 +00:00
|
|
|
start. laconic-so sets `propagation: HostToContainer` on all kind extraMounts
|
feat: graceful shutdown, ZFS upgrade, storage migration, sync-tools build
- entrypoint.py: Python stays PID 1, traps SIGTERM, requests graceful exit
via admin RPC (agave-validator exit --force) before falling back to signals
- snapshot_download.py: fix break-on-failure bug in incremental download loop
(continue + re-probe instead of giving up)
- biscayne-upgrade-zfs.yml: upgrade ZFS 2.2.2 → 2.2.9 via arter97/zfs-lts
PPA to fix io_uring deadlock at kernel module level
- biscayne-migrate-storage.yml: one-time migration from zvol/XFS to ZFS
dataset (zvol workaround no longer needed with graceful shutdown + ZFS fix)
- biscayne-stop.yml: patch terminationGracePeriodSeconds to 300 before
scaling to 0, updated docs for admin RPC shutdown
- biscayne-sync-tools.yml: fix SSH agent forwarding (vars: ansible_become),
add --tags build-container support, add set -e to shell blocks
- biscayne-recover.yml: updated for graceful shutdown awareness
- check-status.py: add --pane flag for tmux, clean redraw in watch mode
- CLAUDE.md: update docs for ZFS dataset storage, graceful shutdown
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-09 07:58:37 +00:00
|
|
|
(commit `a11d40f2` in stack-orchestrator), so host submounts propagate into the
|
|
|
|
|
kind node automatically. A kind restart is required to pick up the new config
|
|
|
|
|
after updating laconic-so.
|
2026-03-07 01:44:25 +00:00
|
|
|
|
|
|
|
|
### KUBECONFIG
|
|
|
|
|
|
|
|
|
|
kubectl must be told where the kubeconfig is when running as root or via ansible:
|
|
|
|
|
```
|
|
|
|
|
KUBECONFIG=/home/rix/.kube/config kubectl ...
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
The ansible playbooks set `environment: KUBECONFIG: /home/rix/.kube/config`.
|
|
|
|
|
|
|
|
|
|
### SSH Agent
|
|
|
|
|
|
|
|
|
|
SSH to biscayne goes through a ProxyCommand jump host (abernathy.ch2.vaasl.io).
|
|
|
|
|
The SSH agent socket rotates when the user reconnects. Find the current one:
|
|
|
|
|
```
|
|
|
|
|
ls -t /tmp/ssh-*/agent.* | head -1
|
|
|
|
|
```
|
|
|
|
|
Then export it:
|
|
|
|
|
```
|
|
|
|
|
export SSH_AUTH_SOCK=/tmp/ssh-XXXX/agent.NNNN
|
|
|
|
|
```
|
|
|
|
|
|
feat: graceful shutdown, ZFS upgrade, storage migration, sync-tools build
- entrypoint.py: Python stays PID 1, traps SIGTERM, requests graceful exit
via admin RPC (agave-validator exit --force) before falling back to signals
- snapshot_download.py: fix break-on-failure bug in incremental download loop
(continue + re-probe instead of giving up)
- biscayne-upgrade-zfs.yml: upgrade ZFS 2.2.2 → 2.2.9 via arter97/zfs-lts
PPA to fix io_uring deadlock at kernel module level
- biscayne-migrate-storage.yml: one-time migration from zvol/XFS to ZFS
dataset (zvol workaround no longer needed with graceful shutdown + ZFS fix)
- biscayne-stop.yml: patch terminationGracePeriodSeconds to 300 before
scaling to 0, updated docs for admin RPC shutdown
- biscayne-sync-tools.yml: fix SSH agent forwarding (vars: ansible_become),
add --tags build-container support, add set -e to shell blocks
- biscayne-recover.yml: updated for graceful shutdown awareness
- check-status.py: add --pane flag for tmux, clean redraw in watch mode
- CLAUDE.md: update docs for ZFS dataset storage, graceful shutdown
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-09 07:58:37 +00:00
|
|
|
### io_uring/ZFS Deadlock — Historical Note
|
2026-03-07 01:44:25 +00:00
|
|
|
|
feat: graceful shutdown, ZFS upgrade, storage migration, sync-tools build
- entrypoint.py: Python stays PID 1, traps SIGTERM, requests graceful exit
via admin RPC (agave-validator exit --force) before falling back to signals
- snapshot_download.py: fix break-on-failure bug in incremental download loop
(continue + re-probe instead of giving up)
- biscayne-upgrade-zfs.yml: upgrade ZFS 2.2.2 → 2.2.9 via arter97/zfs-lts
PPA to fix io_uring deadlock at kernel module level
- biscayne-migrate-storage.yml: one-time migration from zvol/XFS to ZFS
dataset (zvol workaround no longer needed with graceful shutdown + ZFS fix)
- biscayne-stop.yml: patch terminationGracePeriodSeconds to 300 before
scaling to 0, updated docs for admin RPC shutdown
- biscayne-sync-tools.yml: fix SSH agent forwarding (vars: ansible_become),
add --tags build-container support, add set -e to shell blocks
- biscayne-recover.yml: updated for graceful shutdown awareness
- check-status.py: add --pane flag for tmux, clean redraw in watch mode
- CLAUDE.md: update docs for ZFS dataset storage, graceful shutdown
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-09 07:58:37 +00:00
|
|
|
Agave uses io_uring for async I/O. Killing agave ungracefully while it has
|
|
|
|
|
outstanding I/O against ZFS can produce unkillable D-state kernel threads
|
|
|
|
|
(`io_wq_put_and_exit` blocked on ZFS transactions), deadlocking the container.
|
|
|
|
|
|
|
|
|
|
**Prevention**: Use graceful shutdown (`agave-validator exit --force` via admin
|
|
|
|
|
RPC, or scale to 0 and wait). The `biscayne-stop.yml` playbook enforces this.
|
|
|
|
|
With graceful shutdown, io_uring contexts are closed cleanly and ZFS storage
|
|
|
|
|
is safe to use directly (no zvol/XFS workaround needed).
|
2026-03-07 01:44:25 +00:00
|
|
|
|
feat: graceful shutdown, ZFS upgrade, storage migration, sync-tools build
- entrypoint.py: Python stays PID 1, traps SIGTERM, requests graceful exit
via admin RPC (agave-validator exit --force) before falling back to signals
- snapshot_download.py: fix break-on-failure bug in incremental download loop
(continue + re-probe instead of giving up)
- biscayne-upgrade-zfs.yml: upgrade ZFS 2.2.2 → 2.2.9 via arter97/zfs-lts
PPA to fix io_uring deadlock at kernel module level
- biscayne-migrate-storage.yml: one-time migration from zvol/XFS to ZFS
dataset (zvol workaround no longer needed with graceful shutdown + ZFS fix)
- biscayne-stop.yml: patch terminationGracePeriodSeconds to 300 before
scaling to 0, updated docs for admin RPC shutdown
- biscayne-sync-tools.yml: fix SSH agent forwarding (vars: ansible_become),
add --tags build-container support, add set -e to shell blocks
- biscayne-recover.yml: updated for graceful shutdown awareness
- check-status.py: add --pane flag for tmux, clean redraw in watch mode
- CLAUDE.md: update docs for ZFS dataset storage, graceful shutdown
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-09 07:58:37 +00:00
|
|
|
**ZFS fix**: The underlying io_uring bug is fixed in ZFS 2.2.8+ (PR #17298).
|
|
|
|
|
Biscayne currently runs ZFS 2.2.2. Upgrading ZFS will eliminate the deadlock
|
|
|
|
|
risk entirely, even for ungraceful shutdowns.
|
2026-03-07 01:44:25 +00:00
|
|
|
|
|
|
|
|
### laconic-so Architecture
|
|
|
|
|
|
|
|
|
|
`laconic-so` manages kind clusters atomically — `deployment start` creates the
|
|
|
|
|
kind cluster, namespace, PVs, PVCs, and deployment in one shot. There is no way
|
|
|
|
|
to create the cluster without deploying the pod.
|
|
|
|
|
|
|
|
|
|
Key code paths in stack-orchestrator:
|
|
|
|
|
- `deploy_k8s.py:up()` — creates everything atomically
|
|
|
|
|
- `cluster_info.py:get_pvs()` — translates host paths using `kind-mount-root`
|
|
|
|
|
- `helpers_k8s.py:get_kind_pv_bind_mount_path()` — strips `kind-mount-root`
|
|
|
|
|
prefix and prepends `/mnt/`
|
|
|
|
|
- `helpers_k8s.py:_generate_kind_mounts()` — when `kind-mount-root` is set,
|
|
|
|
|
emits a single `/srv/kind` → `/mnt` mount instead of individual mounts
|
|
|
|
|
|
|
|
|
|
The `kind-mount-root: /srv/kind` setting in `spec.yml` means all data volumes
|
|
|
|
|
whose host paths start with `/srv/kind` get translated to `/mnt/...` inside the
|
|
|
|
|
kind node via a single bind mount.
|
|
|
|
|
|
|
|
|
|
### Key Identifiers
|
|
|
|
|
|
|
|
|
|
- Kind cluster: `laconic-70ce4c4b47e23b85`
|
|
|
|
|
- Namespace: `laconic-laconic-70ce4c4b47e23b85`
|
|
|
|
|
- Deployment: `laconic-70ce4c4b47e23b85-deployment`
|
|
|
|
|
- Kind node container: `laconic-70ce4c4b47e23b85-control-plane`
|
|
|
|
|
- Deployment dir: `/srv/deployments/agave`
|
feat: graceful shutdown, ZFS upgrade, storage migration, sync-tools build
- entrypoint.py: Python stays PID 1, traps SIGTERM, requests graceful exit
via admin RPC (agave-validator exit --force) before falling back to signals
- snapshot_download.py: fix break-on-failure bug in incremental download loop
(continue + re-probe instead of giving up)
- biscayne-upgrade-zfs.yml: upgrade ZFS 2.2.2 → 2.2.9 via arter97/zfs-lts
PPA to fix io_uring deadlock at kernel module level
- biscayne-migrate-storage.yml: one-time migration from zvol/XFS to ZFS
dataset (zvol workaround no longer needed with graceful shutdown + ZFS fix)
- biscayne-stop.yml: patch terminationGracePeriodSeconds to 300 before
scaling to 0, updated docs for admin RPC shutdown
- biscayne-sync-tools.yml: fix SSH agent forwarding (vars: ansible_become),
add --tags build-container support, add set -e to shell blocks
- biscayne-recover.yml: updated for graceful shutdown awareness
- check-status.py: add --pane flag for tmux, clean redraw in watch mode
- CLAUDE.md: update docs for ZFS dataset storage, graceful shutdown
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-09 07:58:37 +00:00
|
|
|
- Snapshot dir: `/srv/kind/solana/snapshots` (ZFS dataset, visible to kind at `/mnt/validator-snapshots`)
|
|
|
|
|
- Ledger dir: `/srv/kind/solana/ledger` (ZFS dataset, visible to kind at `/mnt/validator-ledger`)
|
|
|
|
|
- Accounts dir: `/srv/kind/solana/ramdisk/accounts` (tmpfs ramdisk, visible to kind at `/mnt/validator-accounts`)
|
|
|
|
|
- Log dir: `/srv/kind/solana/log` (ZFS dataset, visible to kind at `/mnt/validator-log`)
|
|
|
|
|
- **WARNING**: `/srv/solana` is a different ZFS dataset directory. All data paths use `/srv/kind/solana`.
|
2026-03-07 01:44:25 +00:00
|
|
|
- Host bind mount root: `/srv/kind` -> kind node `/mnt`
|
|
|
|
|
- laconic-so: `/home/rix/.local/bin/laconic-so` (editable install)
|
|
|
|
|
|
|
|
|
|
### PV Mount Paths (inside kind node)
|
|
|
|
|
|
|
|
|
|
| PV Name | hostPath |
|
|
|
|
|
|----------------------|-------------------------------|
|
2026-03-08 18:43:41 +00:00
|
|
|
| validator-snapshots | /mnt/validator-snapshots |
|
|
|
|
|
| validator-ledger | /mnt/validator-ledger |
|
|
|
|
|
| validator-accounts | /mnt/validator-accounts |
|
|
|
|
|
| validator-log | /mnt/validator-log |
|
2026-03-07 01:44:25 +00:00
|
|
|
|
|
|
|
|
### Snapshot Freshness
|
|
|
|
|
|
|
|
|
|
If the snapshot is more than **20,000 slots behind** the current mainnet tip, it is
|
|
|
|
|
too old. Stop the validator, download a fresh snapshot, and restart. Do NOT let it
|
|
|
|
|
try to catch up from an old snapshot — it will take too long and may never converge.
|
|
|
|
|
|
|
|
|
|
Check with:
|
|
|
|
|
```
|
|
|
|
|
# Snapshot slot (from filename)
|
2026-03-08 18:43:41 +00:00
|
|
|
ls /srv/kind/solana/snapshots/snapshot-*.tar.*
|
2026-03-07 01:44:25 +00:00
|
|
|
|
|
|
|
|
# Current mainnet slot
|
|
|
|
|
curl -s -X POST -H "Content-Type: application/json" \
|
|
|
|
|
-d '{"jsonrpc":"2.0","id":1,"method":"getSlot","params":[{"commitment":"finalized"}]}' \
|
|
|
|
|
https://api.mainnet-beta.solana.com
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
### Snapshot Leapfrog Recovery
|
|
|
|
|
|
|
|
|
|
When the validator is stuck in a repair-dependent gap (incomplete shreds from a
|
|
|
|
|
relay outage or insufficient turbine coverage), "grinding through" doesn't work.
|
|
|
|
|
At 0.4 slots/sec replay through incomplete blocks vs 2.5 slots/sec chain
|
|
|
|
|
production, the gap grows faster than it shrinks.
|
|
|
|
|
|
|
|
|
|
**Strategy**: Download a fresh snapshot whose slot lands *past* the incomplete zone,
|
|
|
|
|
into the range where turbine+relay shreds are accumulating in the blockstore.
|
|
|
|
|
**Keep the existing ledger** — it has those shreds. The validator replays from
|
|
|
|
|
local blockstore data instead of waiting on repair.
|
|
|
|
|
|
|
|
|
|
**Steps**:
|
|
|
|
|
1. Let the validator run — turbine+relay accumulate shreds at the tip
|
|
|
|
|
2. Monitor shred completeness at the tip:
|
|
|
|
|
`scripts/check-shred-completeness.sh 500`
|
|
|
|
|
3. When there's a contiguous run of complete blocks (>100 slots), note the
|
|
|
|
|
starting slot of that run
|
|
|
|
|
4. Scale to 0, wipe accounts (ramdisk), wipe old snapshots
|
|
|
|
|
5. **Do NOT wipe ledger** — it has the turbine shreds
|
|
|
|
|
6. Download a fresh snapshot (its slot should be within the complete run)
|
|
|
|
|
7. Scale to 1 — validator replays from local blockstore at 3-5 slots/sec
|
|
|
|
|
|
|
|
|
|
**Why this works**: Turbine delivers ~60% of shreds in real-time. Repair fills
|
|
|
|
|
the rest for recent slots quickly (peers prioritize recent data). The only
|
|
|
|
|
problem is repair for *old* slots (minutes/hours behind) which peers deprioritize.
|
|
|
|
|
By snapshotting past the gap, we skip the old-slot repair bottleneck entirely.
|
|
|
|
|
|
|
|
|
|
### Shred Relay (Ashburn)
|
|
|
|
|
|
|
|
|
|
The TVU shred relay from laconic-was-sw01 provides ~4,000 additional shreds/sec.
|
|
|
|
|
Without it, turbine alone delivers ~60% of blocks. With it, completeness improves
|
|
|
|
|
but still requires repair for full coverage.
|
|
|
|
|
|
|
|
|
|
**Current state**: Old pipeline (monitor session + socat + shred-unwrap.py).
|
|
|
|
|
The traffic-policy redirect was never committed (auto-revert after 5 min timer).
|
|
|
|
|
See `docs/tvu-shred-relay.md` for the traffic-policy config that needs to be
|
|
|
|
|
properly applied.
|
|
|
|
|
|
|
|
|
|
**Boot dependency**: `shred-unwrap.py` must be running on biscayne for the old
|
|
|
|
|
pipeline to work. It is NOT persistent across reboots. The iptables DNAT rule
|
|
|
|
|
for the new pipeline IS persistent (iptables-persistent installed).
|
|
|
|
|
|
|
|
|
|
### Redeploy Flow
|
|
|
|
|
|
|
|
|
|
See `playbooks/biscayne-redeploy.yml`. The scale-to-0 pattern is required because
|
|
|
|
|
`laconic-so` creates the cluster and deploys the pod atomically:
|
|
|
|
|
|
|
|
|
|
1. Delete namespace (teardown)
|
|
|
|
|
2. Optionally wipe data
|
|
|
|
|
3. `laconic-so deployment start` (creates cluster + pod)
|
|
|
|
|
4. Immediately scale to 0
|
|
|
|
|
5. Download snapshot via aria2c
|
|
|
|
|
6. Scale to 1
|
|
|
|
|
7. Verify
|