Process bug fix: no pre-commit existed for this repo's Python code.
Added pyproject.toml with unified dependencies (ruff, mypy, ansible-lint),
.pre-commit-config.yaml with repo-based hooks (ruff) and local uv-run
hooks (mypy, ansible-lint).
Fixed 249 ruff errors (B023, B904, B006, B007, UP008, UP031, C408),
~13 mypy type errors, 11 ansible-lint violations, and ruff-format
across all Python files including stack-orchestrator subtree.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
kind load docker-image serializes the full image (docker save | ctr import),
taking 5-10 minutes per cluster recreate. Replace with a persistent local
registry (registry:2 on port 5001) that survives kind cluster deletes.
stack-orchestrator changes:
- helpers.py: replace load_images_into_kind() with ensure_local_registry(),
connect_registry_to_kind_network(), push_images_to_local_registry()
- helpers.py: add registry mirror to containerdConfigPatches so kind nodes
pull from localhost:5001 via the kind-registry container
- deploy_k8s.py: rewrite local container image refs to localhost:5001/...
so containerd pulls from the registry instead of local store
Ansible changes:
- biscayne-sync-tools.yml: ensure registry container before build, then
tag+push to local registry after build (build-container tag)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Three related fixes in the k8s deployer restart/up flow:
1. Clear stale claimRefs on Released PVs (_clear_released_pv_claim_refs):
After namespace deletion, PVs survive in Released state with claimRefs
pointing to deleted PVC UIDs. New PVCs can't bind until the stale
claimRef is removed. Now clears them before PVC creation.
2. Wait for namespace termination (_wait_for_namespace_deletion):
_ensure_namespace() now detects a terminating namespace and polls
until deletion completes (up to 120s) before creating the new one.
Replaces the racy 5s sleep in deployment restart.
3. Resilient PVC creation: wrap each PVC creation in error handling so
one failure doesn't prevent subsequent PVCs from being attempted.
All errors are collected and reported together.
Closes: bar-6cb, bar-31a, bar-fec
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Use \033[H\033[J (home + clear-to-end) instead of just \033[H to
prevent stale lines from previous frames persisting when output
shrinks between refreshes.
- Fix cursor restore on exit: was \033[?25l (hide) instead of
\033[?25h (show), leaving terminal with invisible cursor.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Also fix --include filter: container name uses slash (laconicnetwork/agave)
not dash (laconicnetwork-agave). The old filter silently skipped the build.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Two fixes for biscayne-restart.yml:
1. ansible_become_flags: "-E" on the restart task preserves SSH_AUTH_SOCK
through sudo so laconic-so can git pull the stack repo.
2. After restart, clear claimRef on any Released PVs. laconic-so restart
deletes the namespace (cascading to PVCs) then recreates, but the PVs
retain stale claimRefs that prevent new PVCs from binding.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Uses laconic-so deployment restart (GitOps) to pick up new container
images and config. Gracefully stops the validator first (scale to 0,
wait for pod termination, verify no agave processes). Preserves the
kind cluster, all data volumes, and cluster state.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Moving container scripts into agave-stack subtree (correct direction).
The source of truth will be agave-stack/ in this repo, pushed out to
LaconicNetwork/agave-stack via git subtree push.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
ip_echo_preflight.py: reimplements Solana ip_echo client protocol in
Python. Verifies UDP port reachability before snapshot download, called
from entrypoint.py. Prevents wasting hours on a snapshot only to
crash-loop on port reachability.
docs/postmortem-ashburn-relay-outbound.md: root cause analysis of the
firewalld nftables FORWARD chain blocking outbound relay traffic.
docs/ashburn-relay-checklist.md: 7-layer verification checklist for
relay path debugging.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Firewalld zones/policies for forwarding (Docker bridge → gre-ashburn),
iptables for Docker-specific rules (DNAT, DOCKER-USER, mangle, SNAT).
Both coexist at different netfilter priorities.
See docs/postmortem-ashburn-relay-outbound.md for root cause analysis.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Entrypoint changes:
- Always require full + incremental before starting (retry until found)
- Check incremental freshness against convergence threshold (500 slots)
- Gap monitor thread: if validator falls >5000 slots behind for 3
consecutive checks, graceful stop + restart with fresh incremental
- cmd_serve is now a loop: download → run → monitor → leapfrog → repeat
- --no-snapshot-fetch moved to common args (both RPC and validator modes)
- --maximum-full-snapshots-to-retain default 1 (validator deletes
downloaded full after generating its own)
- SNAPSHOT_MAX_AGE_SLOTS default 100000 (one full snapshot generation)
snapshot_download.py refactoring:
- Extract _discover_and_benchmark() and _rolling_incremental_download()
as shared helpers
- Restore download_incremental_for_slot() using shared helpers (downloads
only an incremental for an existing full snapshot)
- download_best_snapshot() uses shared helpers, downloads full then
incremental as separate operations
The leapfrog cycle: validator generates full snapshots at standard 100k
block height intervals (same slots as the rest of the network). When the
gap monitor triggers, the entrypoint loops back to maybe_download_snapshot
which finds the validator's local full, downloads a fresh network
incremental (generated every ~40s, converges within the ~11hr full
generation window), and restarts.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Overwrite lines in place instead of clear+redraw (no flicker)
- Pad lines to terminal width to clear stale characters
- Blank leftover rows when output shrinks between frames
- Hide cursor during watch mode
- Remove section comment bars
- Replace unicode checkmarks with +/x
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
laconic-so creates PV hostPath dirs as root. Grafana runs as UID 472
and crashes on startup because it can't write to /var/lib/grafana.
Fix ownership inside the kind node before scaling the deployment up.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- biscayne-migrate-storage.yml: stop docker to release bind mounts
before destroying zvol, no data copy (stale, fresh snapshot needed),
handle partially-migrated state, restart docker at end
- biscayne-upgrade-zfs.yml: use add-apt-repository CLI (module times
out), fix libzfs package name (libzfs4linux not 5), allow apt update
warnings from stale influxdata GPG key
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Instead of hardcoding the laconic cluster ID, namespace, deployment
name, and pod label, read cluster-id from deployment.yml on biscayne
and derive everything from it.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The recovery playbook now exits after scaling to 1. The container
entrypoint handles snapshot download (60+ min) and validator startup
autonomously. Removed all polling/verification steps that would
time out waiting.
Added scripts/check-status.py for monitoring download progress,
validator slot, gap to mainnet, catch-up rate, and ramdisk usage.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The container's entrypoint.py already handles snapshot freshness checks,
cleanup, download (with rolling incremental convergence), and validator
startup. Remove the host-side download and let the container do the work.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Without a bound, the loop runs forever if sources never serve an
incremental close enough to head (e.g. full snapshot base slot is
too old). After 30 minutes, proceed with the best incremental
available or none.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
After the full snapshot downloads, continuously re-probe all fast sources
for newer incrementals until the best available is within convergence_slots
(default 500) of head. Each iteration finds the highest-slot incremental
matching our full snapshot's base slot, downloads it (replacing any previous),
and checks the gap to mainnet head.
- Extract probe_incremental() from inline re-probe code
- Add convergence_slots param to download_best_snapshot() (default 500)
- Add --convergence-slots CLI arg
- Pass SNAPSHOT_CONVERGENCE_SLOTS env var from entrypoint.py
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The container entrypoint (entrypoint.py) handles snapshot download
internally via aria2c. Ansible no longer needs to scale-to-0, download,
scale-to-1 — it just deploys and lets the container manage startup.
- biscayne-redeploy.yml: remove snapshot download section, simplify to
teardown → wipe → deploy → verify
- biscayne-sync-tools.yml: new playbook to sync laconic-so and
agave-stack repos on biscayne, with separate branch controls
- snapshot_download.py: re-probe for fresh incremental after full
snapshot download completes (old incremental is stale by then)
- Switch laconic_so_branch to fix/kind-mount-propagation (has
hostNetwork translation code)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
scripts/agave-container/ is a git subtree of agave-stack's container-build
directory. Replaces fragile cross-repo symlink with proper subtree.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- playbooks/biscayne-iptables.yml: manages PREROUTING DNAT and DOCKER-USER
rules for both host IP (186.233.184.235) and relay loopback (137.239.194.65).
Idempotent, persists via netfilter-persistent.
- scripts/snapshot-download.py: replaced standalone copy with symlink to
agave-stack source of truth, eliminating duplication.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Remounting tmpfs is instant (kernel frees pages), while rm -rf on 400GB+
of accounts files traverses every inode. Recover playbook keeps rm -rf
because the kind node's bind mount prevents umount while the container
is running.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The /dev/ram0 + XFS + format-ramdisk.service approach was unnecessary
complexity from a migration confusion — there was no actual tmpfs bug
with io_uring. tmpfs is simpler (no format-on-boot), resizable on the
fly, and what every other Solana operator uses.
Changes:
- prepare-agave: remove format-ramdisk.service and ramdisk-accounts.service,
use tmpfs fstab entry with size=1024G (was 600G /dev/ram0, too small)
- recover: remove ramdisk_device var (no longer needed)
- redeploy: wipe accounts by rm -rf instead of umount+mkfs
- snapshot-download.py: extract download_best_snapshot() public API for
use by the new container entrypoint.py (in agave-stack)
- CLAUDE.md: update ramdisk docs, fix /srv/solana → /srv/kind/solana paths
- health-check: fix ramdisk path references
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Refactors K8sDeployer.up() into three composable methods:
- _setup_cluster_and_namespace(): kind cluster, API, namespace, ingress
- _create_infrastructure(): PVs, PVCs, ConfigMaps, Services, NodePorts
- _create_deployment(): Deployment resource (pods)
`prepare` calls the first two only — creates all cluster infrastructure
without starting pods. This eliminates the scale-to-0 workaround where
operators had to run `deployment start` then immediately scale down.
Usage: laconic-so deployment --dir <dir> prepare
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Track stack-orchestrator work items with pebbles (append-only event log).
Epic so-076: Stack composition — deploy multiple stacks into one kind cluster
with independent lifecycle management per sub-stack.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Revert snapshot_dir to /srv/solana/snapshots — aria2c runs on the host
where this is the direct zvol mount (always available), unlike
/srv/kind/solana/snapshots which depends on the bind mount
- Add laconic_so_branch variable (default: main) and use it in both
git reset commands so the branch can be overridden via -e
- Move "Verify ramdisk visible inside kind node" from preflight to after
"Wait for deployment to exist" — the kind container may not exist
during preflight after teardown
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
K8s PUT (replace) operations require metadata.resourceVersion for
optimistic concurrency control. Services additionally have immutable
spec.clusterIP that must be preserved from the existing object.
On 409 conflict, all _ensure_* methods now read the existing resource
first and copy resourceVersion (and clusterIP for Services) into the
body before calling replace.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add laconic_so_repo variable (/home/rix/stack-orchestrator) and a
git pull task before deployment start — the editable install must be
current or stale code causes deploy failures
- Downgrade unified mount root check from fatal assertion to debug
warning — the mount style depends on which laconic-so version is
deployed, and individual PV mounts (/mnt/validator-*) work fine
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
All K8s resource creation in deploy_k8s.py now uses try-create, catch
ApiException(409), then replace — matching the pattern already used for
secrets in deployment_create.py. This allows `deployment start` to be
safely re-run without 409 Conflict errors.
Resources made idempotent:
- Deployment (create_namespaced_deployment → replace on 409)
- Service (create_namespaced_service → replace on 409)
- Ingress (create_namespaced_ingress → replace on 409)
- NodePort services (same as Service)
- ConfigMap (create_namespaced_config_map → replace on 409)
- PV/PVC: bare `except: pass` replaced with explicit ApiException
catch for 404
Extracted _ensure_deployment(), _ensure_service(), _ensure_ingress(),
and _ensure_config_map() helpers to keep cyclomatic complexity in check.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Fix snapshot_dir: /srv/solana/snapshots → /srv/kind/solana/snapshots
(kind node reads from the bind mount, not the zvol mount directly)
- Fix kind-internal paths: /mnt/solana/... → /mnt/validator-... to match
actual PV hostPath layout (individual mounts, not unified)
- Add 'scale-up' tag to "Scale validator to 1" task for partial recovery
(--tags snapshot,scale-up,verify resumes without re-running deploy)
- Make 'Start deployment' idempotent: failed_when: false + follow-up
check so existing deployment doesn't fail the play
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Root cause: Docker FORWARD chain policy DROP blocked all DNAT'd relay
traffic (UDP/TCP 8001, UDP 9000-9025) to the kind node. The DOCKER
chain only ACCEPTs specific TCP ports (6443, 443, 80). Added ACCEPT
rules in DOCKER-USER chain which runs before all Docker chains.
Changes:
- ashburn-relay-biscayne.yml: add DOCKER-USER ACCEPT rules (inbound
tag) and rollback cleanup
- ashburn-relay-setup.sh.j2: persist DOCKER-USER rules across reboot
- relay-inbound-udp-test.yml: controlled e2e test — listener in kind
netns, sender from kelce, assert arrival
- relay-link-test.yml: link-by-link tcpdump captures at each hop
- relay-test-udp-listen.py, relay-test-udp-send.py: test helpers
- relay-test-ip-echo.py: full ip_echo protocol test
- inventory/kelce.yml, inventory/panic.yml: test host inventories
- test-ashburn-relay.sh: add ip_echo UDP reachability test
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Destroying the kind cluster on stop/start is almost never the intent.
The cluster holds PVs, ConfigMaps, and networking state that are
expensive to recreate. Default to preserving the cluster; pass
--perform-cluster-management explicitly when a full teardown is needed.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The update command only patches environment variables and adds a
restart annotation. It does not update ports, volumes, configmaps,
or any other deployment spec. The old name was misleading — it
implied a full spec update, causing operators to expect changes
that never took effect.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>