Squashed 'agave-stack/' content from commit 7100d11

git-subtree-dir: agave-stack git-subtree-split: 7100d117421bd79fb52d3dfcd85b76cf18ed0ffa
2026-03-10 06:21:15 +00:00 · 2026-03-10 06:21:15 +00:00 · 481e9d2392
commit 481e9d2392
36 changed files with 14471 additions and 0 deletions
--- a/README.md
+++ b/README.md
@ -0,0 +1,277 @@
 # agave-stack
 Unified Agave/Jito Solana stack for [laconic-so](https://github.com/LaconicNetwork/stack-orchestrator). Deploys Solana validators, RPC nodes, and test validators as containers with optional [DoubleZero](https://doublezero.xyz) network routing.
 ## Modes
 | Mode | Compose file | Use case |
 |------|-------------|----------|
 | `validator` | `docker-compose-agave.yml` | Voting validator (mainnet/testnet) |
 | `rpc` | `docker-compose-agave-rpc.yml` | Non-voting RPC node |
 | `test` | `docker-compose-agave-test.yml` | Local dev with instant finality |
 Mode is selected via the `AGAVE_MODE` environment variable.
 ## Repository layout
 ```
 agave-stack/
 ├── deployment/                              # Reference deployment (biscayne)
 │   ├── spec.yml                            # k8s-kind deployment spec
 │   └── k8s-manifests/
 │       └── doublezero-daemonset.yaml       # DZ DaemonSet (hostNetwork)
 ├── stack-orchestrator/
 │   ├── stacks/agave/
 │   │   ├── stack.yml                       # laconic-so stack definition
 │   │   └── README.md                       # Stack-level docs
 │   ├── compose/
 │   │   ├── docker-compose-agave.yml        # Voting validator
 │   │   ├── docker-compose-agave-rpc.yml    # Non-voting RPC
 │   │   ├── docker-compose-agave-test.yml   # Test validator
 │   │   └── docker-compose-doublezero.yml   # DoubleZero daemon
 │   ├── container-build/
 │   │   ├── laconicnetwork-agave/           # Agave/Jito image
 │   │   │   ├── Dockerfile                  # Two-stage build from source
 │   │   │   ├── build.sh                    # laconic-so build script
 │   │   │   ├── entrypoint.sh               # Mode router
 │   │   │   ├── start-validator.sh          # Voting validator startup
 │   │   │   ├── start-rpc.sh               # RPC node startup
 │   │   │   └── start-test.sh              # Test validator + SPL setup
 │   │   └── laconicnetwork-doublezero/      # DoubleZero image
 │   │       ├── Dockerfile                  # Installs from Cloudsmith apt
 │   │       ├── build.sh
 │   │       └── entrypoint.sh
 │   └── config/agave/
 │       ├── restart-node.sh                 # Container restart helper
 │       └── restart.cron                    # Scheduled restart schedule
 ```
 ## Prerequisites
 - [laconic-so](https://github.com/LaconicNetwork/stack-orchestrator) (stack orchestrator)
 - Docker
 - Kind (for k8s deployments)
 ## Building
 ```bash
 # Vanilla Agave v3.1.9
 laconic-so --stack agave build-containers
 # Jito v3.1.8 (required for MEV)
 AGAVE_REPO=https://github.com/jito-foundation/jito-solana.git \
 AGAVE_VERSION=v3.1.8-jito \
 laconic-so --stack agave build-containers
 ```
 Build compiles from source (~30-60 min on first build). This produces both the `laconicnetwork/agave:local` and `laconicnetwork/doublezero:local` images.
 ## Deploying
 ### Test validator (local dev)
 ```bash
 laconic-so --stack agave deploy init --output spec.yml
 laconic-so --stack agave deploy create --spec-file spec.yml --deployment-dir my-test
 laconic-so deployment --dir my-test start
 ```
 The test validator starts with instant finality and optionally creates SPL token mints and airdrops to configured pubkeys.
 ### Mainnet/testnet (Docker Compose)
 ```bash
 laconic-so --stack agave deploy init --output spec.yml
 # Edit spec.yml: set AGAVE_MODE, VALIDATOR_ENTRYPOINT, KNOWN_VALIDATOR, etc.
 laconic-so --stack agave deploy create --spec-file spec.yml --deployment-dir my-node
 laconic-so deployment --dir my-node start
 ```
 ### Kind/k8s deployment
 The `deployment/spec.yml` provides a reference spec targeting `k8s-kind`. The compose files use `network_mode: host` which works for Docker Compose and is silently ignored by laconic-so's k8s conversion (it uses explicit ports from the deployment spec instead).
 ```bash
 laconic-so --stack agave deploy create \
  --spec-file deployment/spec.yml \
  --deployment-dir my-deployment
 # Mount validator keypairs
 cp validator-identity.json my-deployment/data/validator-config/
 cp vote-account-keypair.json my-deployment/data/validator-config/  # validator mode only
 laconic-so deployment --dir my-deployment start
 ```
 ## Configuration
 ### Common (all modes)
 | Variable | Default | Description |
 |----------|---------|-------------|
 | `AGAVE_MODE` | `test` | `test`, `rpc`, or `validator` |
 | `VALIDATOR_ENTRYPOINT` | *required* | Cluster entrypoint (host:port) |
 | `KNOWN_VALIDATOR` | *required* | Known validator pubkey |
 | `EXTRA_ENTRYPOINTS` | | Space-separated additional entrypoints |
 | `EXTRA_KNOWN_VALIDATORS` | | Space-separated additional known validators |
 | `RPC_PORT` | `8899` | RPC HTTP port |
 | `RPC_BIND_ADDRESS` | `127.0.0.1` | RPC bind address |
 | `GOSSIP_PORT` | `8001` | Gossip protocol port |
 | `DYNAMIC_PORT_RANGE` | `8000-10000` | TPU/TVU/repair UDP port range |
 | `LIMIT_LEDGER_SIZE` | `50000000` | Max ledger slots to retain |
 | `SNAPSHOT_INTERVAL_SLOTS` | `1000` | Full snapshot interval |
 | `MAXIMUM_SNAPSHOTS_TO_RETAIN` | `5` | Max full snapshots |
 | `EXPECTED_GENESIS_HASH` | | Cluster genesis verification |
 | `EXPECTED_SHRED_VERSION` | | Shred version verification |
 | `RUST_LOG` | `info` | Log level |
 | `SOLANA_METRICS_CONFIG` | | Metrics reporting config |
 ### Validator mode
 | Variable | Default | Description |
 |----------|---------|-------------|
 | `VOTE_ACCOUNT_KEYPAIR` | `/data/config/vote-account-keypair.json` | Vote account keypair path |
 Identity keypair must be mounted at `/data/config/validator-identity.json`.
 ### RPC mode
 | Variable | Default | Description |
 |----------|---------|-------------|
 | `PUBLIC_RPC_ADDRESS` | | If set, advertise as public RPC |
 | `ACCOUNT_INDEXES` | `program-id,spl-token-owner,spl-token-mint` | Account indexes for queries |
 Identity is auto-generated if not mounted.
 ### Jito MEV (validator and RPC modes)
 Set `JITO_ENABLE=true` and provide:
 | Variable | Description |
 |----------|-------------|
 | `JITO_BLOCK_ENGINE_URL` | Block engine endpoint |
 | `JITO_SHRED_RECEIVER_ADDR` | Shred receiver (region-specific) |
 | `JITO_RELAYER_URL` | Relayer URL (validator mode) |
 | `JITO_TIP_PAYMENT_PROGRAM` | Tip payment program pubkey |
 | `JITO_DISTRIBUTION_PROGRAM` | Tip distribution program pubkey |
 | `JITO_MERKLE_ROOT_AUTHORITY` | Merkle root upload authority |
 | `JITO_COMMISSION_BPS` | Commission basis points |
 Image must be built from `jito-foundation/jito-solana` for Jito flags to work.
 ### Test mode
 | Variable | Default | Description |
 |----------|---------|-------------|
 | `FACILITATOR_PUBKEY` | | Pubkey to airdrop SOL |
 | `SERVER_PUBKEY` | | Pubkey to airdrop SOL |
 | `CLIENT_PUBKEY` | | Pubkey to airdrop SOL + create ATA |
 | `MINT_DECIMALS` | `6` | SPL token decimals |
 | `MINT_AMOUNT` | `1000000` | SPL tokens to mint |
 ## DoubleZero
 [DoubleZero](https://doublezero.xyz) provides optimized network routing for Solana validators via GRE tunnels (IP protocol 47) and BGP (TCP/179) over link-local 169.254.0.0/16. Validator traffic to other DZ participants is routed through private fiber instead of the public internet.
 ### How it works
 `doublezerod` creates a `doublezero0` GRE tunnel interface and runs BGP peering through it. Routes are injected into the host routing table, so the validator transparently sends traffic over the fiber backbone. IBRL mode falls back to public internet if DZ is down.
 ### Requirements
 - Validator identity keypair at `/data/config/validator-identity.json`
 - `privileged: true` + `NET_ADMIN` (GRE tunnel + route table manipulation)
 - `hostNetwork: true` (GRE uses IP protocol 47 — cannot be port-mapped)
 - Node registered with DoubleZero passport system
 ### Docker Compose
 `docker-compose-doublezero.yml` runs alongside the validator with `network_mode: host`, sharing the `validator-config` volume for identity access.
 ### k8s
 laconic-so does not pass `hostNetwork` through to generated k8s resources. DoubleZero runs as a DaemonSet applied after `deployment start`:
 ```bash
 kubectl apply -f deployment/k8s-manifests/doublezero-daemonset.yaml
 ```
 Since the validator pods share the node's network namespace, they automatically see the GRE routes injected by `doublezerod`.
 | Variable | Default | Description |
 |----------|---------|-------------|
 | `VALIDATOR_IDENTITY_PATH` | `/data/config/validator-identity.json` | Validator identity keypair |
 | `DOUBLEZERO_RPC_ENDPOINT` | `http://127.0.0.1:8899` | Solana RPC for DZ registration |
 | `DOUBLEZERO_EXTRA_ARGS` | | Additional doublezerod arguments |
 ## Runtime requirements
 The container requires the following (already set in compose files):
 | Setting | Value | Why |
 |---------|-------|-----|
 | `privileged` | `true` | `mlock()` syscall and raw network access |
 | `cap_add` | `IPC_LOCK` | Memory page locking for account indexes and ledger |
 | `ulimits.memlock` | `-1` (unlimited) | Agave locks gigabytes of memory |
 | `ulimits.nofile` | `1000000` | Gossip/TPU connections + memory-mapped ledger files |
 | `network_mode` | `host` | Direct host network stack for gossip, TPU, UDP ranges |
 Without these, Agave either refuses to start or dies under load.
 ## Container overhead
 Containers with `privileged: true` and `network_mode: host` add **zero measurable overhead** vs bare metal. Linux containers are not VMs:
 - **Network**: Host network namespace directly — no bridge, no NAT, no veth. Same kernel code path as bare metal.
 - **CPU**: No hypervisor. Same physical cores, same scheduler priority.
 - **Memory**: `IPC_LOCK` + unlimited memlock = identical `mlock()` behavior.
 - **Disk I/O**: hostPath-backed PVs have identical I/O characteristics.
 The only overhead is cgroup accounting (nanoseconds per syscall) and overlayfs for cold file opens (single-digit microseconds, zero once cached).
 ## Scheduled restarts
 The `config/agave/restart.cron` defines periodic restarts to mitigate memory growth:
 - **Validator**: every 4 hours
 - **RPC**: every 6 hours (staggered 30 min offset)
 Uses `restart-node.sh` which sends TERM to the matching container for graceful shutdown.
 ## Biscayne reference deployment
 The `deployment/` directory contains a reference deployment for biscayne.vaasl.io (186.233.184.235), a mainnet voting validator with Jito MEV and DoubleZero:
 ```bash
 # Build Jito image
 AGAVE_REPO=https://github.com/jito-foundation/jito-solana.git \
 AGAVE_VERSION=v3.1.8-jito \
 laconic-so --stack agave build-containers
 # Create deployment
 laconic-so --stack agave deploy create \
  --spec-file deployment/spec.yml \
  --deployment-dir biscayne-deployment
 # Mount keypairs
 cp validator-identity.json biscayne-deployment/data/validator-config/
 cp vote-account-keypair.json biscayne-deployment/data/validator-config/
 # Start
 laconic-so deployment --dir biscayne-deployment start
 # Start DoubleZero
 kubectl apply -f deployment/k8s-manifests/doublezero-daemonset.yaml
 ```
 To run as non-voting RPC, change `AGAVE_MODE: rpc` in `deployment/spec.yml`.
 ## Volumes
 | Volume | Mount | Content |
 |--------|-------|---------|
 | `validator-config` / `rpc-config` | `/data/config` | Identity keypairs, node config |
 | `validator-ledger` / `rpc-ledger` | `/data/ledger` | Blockchain ledger data |
 | `validator-accounts` / `rpc-accounts` | `/data/accounts` | Account state cache |
 | `validator-snapshots` / `rpc-snapshots` | `/data/snapshots` | Full and incremental snapshots |
 | `doublezero-config` | `~/.config/doublezero` | DZ identity and state |
--- a/WORK_IN_PROGRESS.md
+++ b/WORK_IN_PROGRESS.md
@ -0,0 +1,198 @@
 # Work in Progress: Biscayne TVU Shred Relay
 ## Overview
 Biscayne's agave validator was shred-starved (~1.7 slots/sec replay vs ~2.5 mainnet).
 Root cause: not enough turbine shreds arriving. Solution: advertise a TVU address in
 Ashburn (dense validator population, better turbine tree neighbors) and relay shreds
 to biscayne in Miami over the laconic backbone.
 ### Architecture
 ```
 Turbine peers (hundreds of validators)
       |
       v UDP shreds to port 20000
 laconic-was-sw01 Et1/1 (64.92.84.81, Ashburn)
       |  ASIC receives on front-panel interface
       |  EOS monitor session mirrors matched packets to CPU
       v
 mirror0 interface (Linux userspace)
       |  socat reads raw frames, sends as UDP
       v  172.16.1.188 -> 186.233.184.235:9100  (Et4/1 backbone, 25.4ms)
 laconic-mia-sw01 Et4/1 (172.16.1.189, Miami)
       |  forwards via default route (Et1/1, same metro)
       v  0.13ms
 biscayne:9100 (186.233.184.235, Miami)
       |  shred-unwrap.py strips IP+UDP headers
       v  clean shred payload to localhost:9000
 agave-validator TVU port
 ```
 Total one-way relay latency: ~12.8ms
 ### Results
 Before relay: ~1.7 slots/sec replay, falling behind ~0.8 slots/sec.
 After relay: ~3.32 slots/sec replay, catching up ~0.82 slots/sec.
 ---
 ## Changes by Host
 ### laconic-was-sw01 (Ashburn) — `install@137.239.200.198`
 All changes are ephemeral (not persisted, lost on reboot).
 **1. EOS monitor session (running-config, not in startup-config)**
 Mirrors inbound UDP port 20000 traffic on Et1/1 to a CPU-accessible `mirror0` interface.
 Required because the Arista 7280CR3A ASIC handles front-panel traffic without punting to
 Linux userspace — regular sockets cannot receive packets on front-panel IPs.
 ```
 monitor session 1 source Ethernet1/1 rx
 monitor session 1 ip access-group SHRED-RELAY
 monitor session 1 destination Cpu
 ```
 **2. EOS ACL (running-config, not in startup-config)**
 ```
 ip access-list SHRED-RELAY
   10 permit udp any any eq 20000
 ```
 **3. EOS static route (running-config, not in startup-config)**
 ```
 ip route 186.233.184.235/32 172.16.1.189
 ```
 Routes biscayne traffic via Et4/1 backbone to laconic-mia-sw01 instead of the default
 route (64.92.84.80, Cogent public internet).
 **4. Linux kernel static route (ephemeral, `ip route add`)**
 ```
 ip route add 186.233.184.235/32 via 172.16.1.189 dev et4_1
 ```
 Required because socat runs in Linux userspace. The EOS static route programs the ASIC
 but does not always sync to the Linux kernel routing table. Without this, socat's UDP
 packets egress via the default route (et1_1, public internet).
 **5. socat relay process (foreground, pts/5)**
 ```bash
 sudo socat -u INTERFACE:mirror0,type=2 UDP-SENDTO:186.233.184.235:9100
 ```
 Reads raw L2 frames from mirror0 (SOCK_DGRAM strips ethernet header, leaving IP+UDP+payload).
 Sends each frame as a UDP datagram to biscayne:9100. Runs as root (raw socket access to mirror0).
 PID: 27743 (child of sudo PID 27742)
 ---
 ### laconic-mia-sw01 (Miami) — `install@209.42.167.130`
 **No changes made.** MIA already reaches biscayne at 0.13ms via its default route
 (`209.42.167.132` on Et1/1, same metro). Relay traffic from WAS arrives on Et4/1
 (`172.16.1.189`) and MIA forwards to `186.233.184.235` natively.
 Key interfaces for reference:
 - Et1/1: `209.42.167.133/31` (public uplink, default route via 209.42.167.132)
 - Et4/1: `172.16.1.189/31` (backbone link to WAS, peer 172.16.1.188)
 - Et8/1: `172.16.1.192/31` (another backbone link, not used for relay)
 ---
 ### biscayne (Miami) — `rix@biscayne.vaasl.io`
 **1. Custom agave image: `laconicnetwork/agave:tvu-relay`**
 Stock agave v3.1.9 with cherry-picked commit 9f4b3ae from anza master (adds
 `--public-tvu-address` flag, from anza PR #6778). Built in `/tmp/agave-tvu-patch/`,
 transferred via `docker save | scp | docker load | kind load docker-image`.
 **2. K8s deployment changes**
 Namespace: `laconic-laconic-70ce4c4b47e23b85`
 Deployment: `laconic-70ce4c4b47e23b85-deployment`
 Changes from previous deployment:
 - Image: `laconicnetwork/agave:local` -> `laconicnetwork/agave:tvu-relay`
 - Added env: `PUBLIC_TVU_ADDRESS=64.92.84.81:20000`
 - Set: `JITO_ENABLE=false` (stock agave has no Jito flags)
 - Strategy: changed to `Recreate` (hostNetwork port conflicts prevent RollingUpdate)
 The validator runs with `--public-tvu-address 64.92.84.81:20000`, causing it to
 advertise the Ashburn switch IP as its TVU address in gossip. Turbine tree peers
 send shreds to Ashburn instead of directly to Miami.
 **3. shred-unwrap.py (foreground process, PID 2497694)**
 ```bash
 python3 /tmp/shred-unwrap.py 9100 127.0.0.1 9000
 ```
 Listens on UDP port 9100, strips IP+UDP headers from mirrored packets (variable-length
 IP header via IHL field + 8-byte UDP header), forwards clean shred payloads to
 localhost:9000 (the validator's TVU port). Running as user `rix`.
 Script location: `/tmp/shred-unwrap.py`
 **4. agave-stack repo changes (uncommitted)**
 - `stack-orchestrator/container-build/laconicnetwork-agave/start-rpc.sh`:
  Added `PUBLIC_TVU_ADDRESS` to header docs and
  `[ -n "${PUBLIC_TVU_ADDRESS:-}" ] && ARGS+=(--public-tvu-address "$PUBLIC_TVU_ADDRESS")`
 - `stack-orchestrator/compose/docker-compose-agave-rpc.yml`:
  Added `PUBLIC_TVU_ADDRESS: ${PUBLIC_TVU_ADDRESS:-}` to environment section
 ---
 ## What's NOT Production-Ready
 ### Ephemeral processes
 - socat on laconic-was-sw01: foreground process in a terminal session
 - shred-unwrap.py on biscayne: foreground process, running from /tmp
 - Both die if the terminal disconnects or the host reboots
 - Need systemd units for both
 ### Ephemeral switch config
 - Monitor session, ACL, and static routes on was-sw01 are in running-config only
 - Not saved to startup-config (`write memory` was run but the route didn't persist)
 - Linux kernel route (`ip route add`) is completely ephemeral
 - All lost on switch reboot
 ### No monitoring
 - No alerting on relay health (socat crash, shred-unwrap crash, packet loss)
 - No metrics on relay throughput vs direct turbine throughput
 - No comparison of before/after slot gap trends
 ### Validator still catching up
 - ~50k slots behind as of initial relay activation
 - Catching up at ~0.82 slots/sec (~2,950 slots/hour)
 - ~17 hours to catch up from current position, or reset with fresh snapshot (~15-30 min)
 ---
 ## Key Details
 | Item | Value |
 |------|-------|
 | Biscayne validator identity | `4WeLUxfQghbhsLEuwaAzjZiHg2VBw87vqHc4iZrGvKPr` |
 | Biscayne IP | `186.233.184.235` |
 | laconic-was-sw01 public IP | `64.92.84.81` (Et1/1) |
 | laconic-was-sw01 backbone IP | `172.16.1.188` (Et4/1) |
 | laconic-was-sw01 SSH | `install@137.239.200.198` |
 | laconic-mia-sw01 backbone IP | `172.16.1.189` (Et4/1) |
 | laconic-mia-sw01 SSH | `install@209.42.167.130` |
 | Biscayne SSH | `rix@biscayne.vaasl.io` (via ProxyJump abernathy) |
 | Backbone RTT (WAS-MIA) | 25.4ms (Et4/1 ↔ Et4/1, 0.01ms jitter) |
 | Relay one-way latency | ~12.8ms |
 | Agave image | `laconicnetwork/agave:tvu-relay` (v3.1.9 + commit 9f4b3ae) |
 | EOS version | 4.34.0F |
--- a/ansible/biscayne-redeploy.yml
+++ b/ansible/biscayne-redeploy.yml
@ -0,0 +1,193 @@
 ---
 # Redeploy agave-stack on biscayne with aria2c snapshot pre-download
 #
 # Usage:
 #   # Standard redeploy (download snapshot, preserve accounts + ledger)
 #   ansible-playbook -i biscayne.vaasl.io, ansible/biscayne-redeploy.yml
 #
 #   # Full wipe (accounts + ledger) — slow rebuild
 #   ansible-playbook -i biscayne.vaasl.io, ansible/biscayne-redeploy.yml \
 #     -e wipe_accounts=true -e wipe_ledger=true
 #
 #   # Skip snapshot download (use existing)
 #   ansible-playbook -i biscayne.vaasl.io, ansible/biscayne-redeploy.yml \
 #     -e skip_snapshot=true
 #
 #   # Pass extra args to snapshot-download.py
 #   ansible-playbook -i biscayne.vaasl.io, ansible/biscayne-redeploy.yml \
 #     -e 'snapshot_args=--version 2.2 --min-download-speed 50'
 #
 #   # Snapshot only (no redeploy)
 #   ansible-playbook -i biscayne.vaasl.io, ansible/biscayne-redeploy.yml --tags snapshot
 #
 - name: Redeploy agave validator on biscayne
  hosts: all
  gather_facts: false
  vars:
    deployment_dir: /srv/deployments/agave
    laconic_so: /home/rix/.local/bin/laconic-so
    kind_cluster: laconic-70ce4c4b47e23b85
    k8s_namespace: "laconic-{{ kind_cluster }}"
    snapshot_dir: /srv/solana/snapshots
    ledger_dir: /srv/solana/ledger
    accounts_dir: /srv/solana/ramdisk/accounts
    ramdisk_mount: /srv/solana/ramdisk
    ramdisk_device: /dev/ram0
    snapshot_script_local: "{{ playbook_dir }}/../scripts/snapshot-download.py"
    snapshot_script: /tmp/snapshot-download.py
    # Flags — non-destructive by default
    wipe_accounts: false
    wipe_ledger: false
    skip_snapshot: false
    snapshot_args: ""
  tasks:
    # --- Snapshot download (runs while validator is still up) ---
    - name: Verify aria2c installed
      command: which aria2c
      changed_when: false
      when: not skip_snapshot | bool
      tags: [snapshot]
    - name: Copy snapshot script to remote
      copy:
        src: "{{ snapshot_script_local }}"
        dest: "{{ snapshot_script }}"
        mode: "0755"
      when: not skip_snapshot | bool
      tags: [snapshot]
    - name: Download snapshot via aria2c
      command: >
        python3 {{ snapshot_script }}
        -o {{ snapshot_dir }}
        {{ snapshot_args }}
      become: true
      register: snapshot_result
      when: not skip_snapshot | bool
      timeout: 3600
      tags: [snapshot]
    - name: Show snapshot download result
      debug:
        msg: "{{ snapshot_result.stdout_lines | default(['skipped']) }}"
      tags: [snapshot]
    # --- Teardown (namespace only, preserve kind cluster) ---
    - name: Delete deployment namespace
      command: >
        kubectl delete namespace {{ k8s_namespace }} --timeout=120s
      register: ns_delete
      failed_when: false
      tags: [teardown]
    - name: Wait for namespace to terminate
      command: >
        kubectl get namespace {{ k8s_namespace }}
        -o jsonpath='{.status.phase}'
      register: ns_status
      retries: 30
      delay: 5
      until: ns_status.rc != 0
      failed_when: false
      when: ns_delete.rc == 0
      tags: [teardown]
    # --- Data wipe (opt-in) ---
    - name: Wipe ledger data
      shell: rm -rf {{ ledger_dir }}/*
      become: true
      when: wipe_ledger | bool
      tags: [wipe]
    - name: Wipe accounts ramdisk (umount + mkfs + mount)
      shell: |
        mountpoint -q {{ ramdisk_mount }} && umount {{ ramdisk_mount }} || true
        mkfs.ext4 -q {{ ramdisk_device }}
        mount {{ ramdisk_device }} {{ ramdisk_mount }}
        mkdir -p {{ accounts_dir }}
        chown solana:solana {{ ramdisk_mount }} {{ accounts_dir }}
      become: true
      when: wipe_accounts | bool
      tags: [wipe]
    - name: Clean old snapshots (keep newest full + incremental)
      shell: |
        cd {{ snapshot_dir }} || exit 0
        newest=$(ls -t snapshot-*.tar.* 2>/dev/null | head -1)
        if [ -n "$newest" ]; then
          newest_inc=$(ls -t incremental-snapshot-*.tar.* 2>/dev/null | head -1)
          find . -maxdepth 1 -name '*.tar.*' \
            ! -name "$newest" \
            ! -name "${newest_inc:-__none__}" \
            -delete
        fi
      become: true
      when: not skip_snapshot | bool
      tags: [wipe]
    # --- Deploy ---
    - name: Verify kind-config.yml has unified mount root
      command: "grep -c 'containerPath: /mnt$' {{ deployment_dir }}/kind-config.yml"
      register: mount_root_check
      failed_when: mount_root_check.stdout | int < 1
      tags: [deploy]
    - name: Start deployment
      command: "{{ laconic_so }} deployment --dir {{ deployment_dir }} start"
      timeout: 600
      tags: [deploy]
    - name: Wait for pod to be running
      command: >
        kubectl get pods -n {{ k8s_namespace }}
        -o jsonpath='{.items[0].status.phase}'
      register: pod_status
      retries: 60
      delay: 10
      until: pod_status.stdout == "Running"
      tags: [deploy]
    # --- Verify ---
    - name: Verify unified mount inside kind node
      command: "docker exec {{ kind_cluster }}-control-plane ls /mnt/solana/"
      register: mount_check
      tags: [verify]
    - name: Show mount contents
      debug:
        msg: "{{ mount_check.stdout_lines }}"
      tags: [verify]
    - name: Check validator log file is being written
      command: >
        kubectl exec -n {{ k8s_namespace }}
        deployment/{{ kind_cluster }}-deployment
        -c agave-validator -- test -f /data/log/validator.log
      retries: 12
      delay: 10
      until: log_file_check.rc == 0
      register: log_file_check
      failed_when: false
      tags: [verify]
    - name: Check RPC health
      uri:
        url: http://127.0.0.1:8899/health
        return_content: true
      register: rpc_health
      retries: 6
      delay: 10
      until: rpc_health.status == 200
      failed_when: false
      delegate_to: "{{ inventory_hostname }}"
      tags: [verify]
    - name: Report status
      debug:
        msg: >-
          Deployment complete.
          Log: {{ 'writing' if log_file_check.rc == 0 else 'not yet created' }}.
          RPC: {{ rpc_health.content | default('not responding') }}.
          Wiped: ledger={{ wipe_ledger }}, accounts={{ wipe_accounts }}.
      tags: [verify]
--- a/deployment/k8s-manifests/doublezero-daemonset.yaml
+++ b/deployment/k8s-manifests/doublezero-daemonset.yaml
@ -0,0 +1,50 @@
 # DoubleZero DaemonSet - applied separately from laconic-so deployment
 # laconic-so does not support hostNetwork in generated k8s resources,
 # so this manifest is applied via kubectl after 'deployment start'.
 #
 # DoubleZero creates GRE tunnels (IP protocol 47) and runs BGP (tcp/179)
 # on link-local 169.254.0.0/16. This requires host network access.
 # The GRE routes injected into the node routing table are automatically
 # visible to all pods using hostNetwork.
 apiVersion: apps/v1
 kind: DaemonSet
 metadata:
  name: doublezero
  labels:
    app: doublezero
 spec:
  selector:
    matchLabels:
      app: doublezero
  template:
    metadata:
      labels:
        app: doublezero
    spec:
      hostNetwork: true
      containers:
        - name: doublezerod
          image: laconicnetwork/doublezero:local
          securityContext:
            privileged: true
            capabilities:
              add:
                - NET_ADMIN
          env:
            - name: VALIDATOR_IDENTITY_PATH
              value: /data/config/validator-identity.json
            - name: DOUBLEZERO_RPC_ENDPOINT
              value: http://127.0.0.1:8899
          volumeMounts:
            - name: validator-config
              mountPath: /data/config
              readOnly: true
            - name: doublezero-config
              mountPath: /root/.config/doublezero
      volumes:
        - name: validator-config
          persistentVolumeClaim:
            claimName: validator-config
        - name: doublezero-config
          persistentVolumeClaim:
            claimName: doublezero-config
--- a/deployment/spec.yml
+++ b/deployment/spec.yml
@ -0,0 +1,113 @@
 # Biscayne Solana Validator deployment spec
 # Host: biscayne.vaasl.io (186.233.184.235)
 # Identity: 4WeLUxfQghbhsLEuwaAzjZiHg2VBw87vqHc4iZrGvKPr
 stack: /srv/deployments/agave-stack/stack-orchestrator/stacks/agave
 deploy-to: k8s-kind
 kind-mount-root: /srv/kind
 network:
  http-proxy:
    - host-name: biscayne.vaasl.io
      routes:
        - path: /
          proxy-to: agave-validator:8899
        - path: /
          proxy-to: agave-validator:8900
          websocket: true
  ports:
    agave-validator:
      - '8899'
      - '8900'
      - '8001'
      - 8001/udp
      - 9000/udp
      - 9001/udp
      - 9002/udp
      - 9003/udp
      - 9004/udp
      - 9005/udp
      - 9006/udp
      - 9007/udp
      - 9008/udp
      - 9009/udp
      - 9010/udp
      - 9011/udp
      - 9012/udp
      - 9013/udp
      - 9014/udp
      - 9015/udp
      - 9016/udp
      - 9017/udp
      - 9018/udp
      - 9019/udp
      - 9020/udp
      - 9021/udp
      - 9022/udp
      - 9023/udp
      - 9024/udp
      - 9025/udp
 resources:
  containers:
    reservations:
      cpus: '4.0'
      memory: 256000M
    limits:
      cpus: '32.0'
      memory: 921600M
 security:
  privileged: true
  unlimited-memlock: true
  capabilities:
    - IPC_LOCK
 volumes:
  # Config volumes — on ZFS dataset (backed up via snapshots)
  validator-config: /srv/deployments/agave/data/validator-config
  doublezero-validator-identity: /srv/deployments/agave/data/validator-config
  doublezero-config: /srv/deployments/agave/data/doublezero-config
  # Heavy data volumes — on zvol/ramdisk (not backed up, rebuildable)
  validator-ledger: /srv/kind/solana/ledger
  validator-accounts: /srv/kind/solana/ramdisk/accounts
  validator-snapshots: /srv/kind/solana/snapshots
  validator-log: /srv/kind/solana/log
  # Monitoring
  monitoring-influxdb-data: /srv/kind/solana/monitoring/influxdb
  monitoring-grafana-data: /srv/kind/solana/monitoring/grafana
 configmaps:
  monitoring-telegraf-config: config/monitoring/telegraf-config
  monitoring-telegraf-scripts: config/monitoring/scripts
  monitoring-grafana-datasources: config/monitoring/grafana-datasources
  monitoring-grafana-dashboards: config/monitoring/grafana-dashboards
 config:
  # Mode: 'rpc' (non-voting) — matches current biscayne systemd config
  AGAVE_MODE: rpc
  # Mainnet entrypoints
  VALIDATOR_ENTRYPOINT: entrypoint.mainnet-beta.solana.com:8001
  EXTRA_ENTRYPOINTS: entrypoint2.mainnet-beta.solana.com:8001 entrypoint3.mainnet-beta.solana.com:8001 entrypoint4.mainnet-beta.solana.com:8001 entrypoint5.mainnet-beta.solana.com:8001
  # Known validators (Solana Foundation, Everstake, Chorus One)
  KNOWN_VALIDATOR: 7Np41oeYqPefeNQEHSv1UDhYrehxin3NStELsSKCT4K2
  EXTRA_KNOWN_VALIDATORS: GdnSyH3YtwcxFvQrVVJMm1JhTS4QVX7MFsX56uJLUfiZ dDzy5SR3AXdYWVqbDEkVFdvSPCtS9ihF5kJkHCtXoFs DE1bawNcRJB9rVm3buyMVfr8mBEoyyu73NBovf2oXJsJ CakcnaRDHka2gXyfbEd2d3xsvkJkqsLw2akB3zsN1D2S C1ocKDYMCm2ooWptMMnpd5VEB2Nx4UMJgRuYofysyzcA GwHH8ciFhR8vejWCqmg8FWZUCNtubPY2esALvy5tBvji 6WgdYhhGE53WrZ7ywJA15hBVkw7CRbQ8yDBBTwmBtAHN
  # Network
  RPC_PORT: '8899'
  RPC_BIND_ADDRESS: 0.0.0.0
  GOSSIP_PORT: '8001'
  GOSSIP_HOST: 137.239.194.65
  DYNAMIC_PORT_RANGE: 9000-10000
  # Cluster verification
  EXPECTED_GENESIS_HASH: 5eykt4UsFv8P8NJdTREpY1vzqKqZKvdpKuc147dw2N9d
  EXPECTED_SHRED_VERSION: '50093'
  # Storage
  LIMIT_LEDGER_SIZE: '50000000'
  SNAPSHOT_INTERVAL_SLOTS: '1000'
  MAXIMUM_SNAPSHOTS_TO_RETAIN: '5'
  NO_INCREMENTAL_SNAPSHOTS: 'true'
  RUST_LOG: info,solana_metrics=warn
  SOLANA_METRICS_CONFIG: host=http://localhost:8086,db=agave_metrics,u=admin,p=admin
  # Jito MEV (NY region shred receiver) — disabled until voting enabled
  JITO_ENABLE: 'false'
  JITO_BLOCK_ENGINE_URL: https://mainnet.block-engine.jito.wtf
  JITO_SHRED_RECEIVER_ADDR: 141.98.216.96:1002
  JITO_TIP_PAYMENT_PROGRAM: T1pyyaTNZsKv2WcRAB8oVnk93mLJw2XzjtVYqCsaHqt
  JITO_DISTRIBUTION_PROGRAM: 4R3gSG8BpU4t19KYj8CfnbtRpnT8gtk4dvTHxVRwc2r7
  JITO_MERKLE_ROOT_AUTHORITY: 8F4jGUmxF36vQ6yabnsxX6AQVXdKBhs8kGSUuRKSg8Xt
  JITO_COMMISSION_BPS: '800'
  # DoubleZero
  DOUBLEZERO_RPC_ENDPOINT: http://127.0.0.1:8899
--- a/scripts/backlog.sh
+++ b/scripts/backlog.sh
@ -0,0 +1,234 @@
 #!/bin/bash
 set -Eeuo pipefail
 export PATH=/sbin:/bin:/usr/sbin:/usr/bin:/usr/local/sbin:/usr/local/bin
 export XDG_RUNTIME_DIR="/run/user/$(id -u)"
 mkdir -p "$XDG_RUNTIME_DIR"
 # optional suffix from command-line, prepend dash if non-empty
 SUFFIX="${1:-}"
 SUFFIX="${SUFFIX:+-$SUFFIX}"
 # define variables
 DATASET="biscayne/DATA/deployments"
 DEPLOYMENT_DIR="/srv/deployments/agave"
 LOG_FILE="$HOME/.backlog_history"
 ZFS_HOLD="backlog:pending"
 SERVICE_STOP_TIMEOUT="300"
 SNAPSHOT_RETENTION="6"
 SNAPSHOT_PREFIX="backlog"
 SNAPSHOT_TAG="$(date +%Y%m%d)${SUFFIX}"
 SNAPSHOT="${DATASET}@${SNAPSHOT_PREFIX}-${SNAPSHOT_TAG}"
 # remote replication targets
 REMOTES=(
    "mysterio:edith/DATA/backlog/biscayne-main"
    "ardham:batterywharf/DATA/backlog/biscayne-main"
 )
 # log functions
 log() {
    local time_fmt
    time_fmt=$(date -u +"%Y-%m-%dT%H:%M:%SZ")
    echo "[$time_fmt] $1" >> "$LOG_FILE"
 }
 log_close() {
    local end_time duration
    end_time=$(date +%s)
    duration=$((end_time - start_time))
    log "Backlog completed in ${duration}s"
    echo "" >> "$LOG_FILE"
 }
 # service controls
 services() {
    local action="$1"
    case "$action" in
        stop)
            log "Stopping agave deployment..."
            laconic-so deployment --dir "$DEPLOYMENT_DIR" stop
            log "Waiting for services to fully stop..."
            local deadline=$(( $(date +%s) + SERVICE_STOP_TIMEOUT ))
            while true; do
                local running
                running=$(docker ps --filter "label=com.docker.compose.project.working_dir=$DEPLOYMENT_DIR" -q 2>/dev/null | wc -l)
                if [[ "$running" -eq 0 ]]; then
                    break
                fi
                if (( $(date +%s) >= deadline )); then
                    log "WARNING: Timeout waiting for services to stop; continuing."
                    break
                fi
                sleep 0.2
            done
            ;;
        start)
            log "Starting agave deployment..."
            laconic-so deployment --dir "$DEPLOYMENT_DIR" start
            ;;
        *)
            log "ERROR: Unknown action '$action' in services()"
            exit 2
            ;;
    esac
 }
 # send a snapshot to one remote
 # args: snap remote_host remote_dataset
 snapshot_send_one() {
    local snap="$1" remote_host="$2" remote_dataset="$3"
    log "Checking remote snapshots on $remote_host..."
    local -a local_snaps remote_snaps
    mapfile -t local_snaps < <(zfs list -H -t snapshot -o name -s creation -d1 "$DATASET" | grep -F "${DATASET}@${SNAPSHOT_PREFIX}-")
    mapfile -t remote_snaps < <(ssh "$remote_host" zfs list -H -t snapshot -o name -s creation "$remote_dataset" | grep -F "${remote_dataset}@${SNAPSHOT_PREFIX}-" || true)
    # find latest common snapshot
    local base=""
    local local_snap remote_snap remote_check
    for local_snap in "${local_snaps[@]}"; do
        remote_snap="${local_snap/$DATASET/$remote_dataset}"
        for remote_check in "${remote_snaps[@]}"; do
            if [[ "$remote_check" == "$remote_snap" ]]; then
                base="$local_snap"
                break
            fi
        done
    done
    if [[ -z "$base" && ${#remote_snaps[@]} -eq 0 ]]; then
        log "No remote snapshots found on $remote_host — sending full snapshot."
        if zfs send "$snap" | ssh "$remote_host" zfs receive -sF "$remote_dataset"; then
            log "Full send to $remote_host succeeded."
            return 0
        else
            log "ERROR: Full send to $remote_host failed."
            return 1
        fi
    elif [[ -n "$base" ]]; then
        log "Common base snapshot $base found — sending incremental to $remote_host."
        if zfs send -i "$base" "$snap" | ssh "$remote_host" zfs receive -sF "$remote_dataset"; then
            log "Incremental send to $remote_host succeeded."
            return 0
        else
            log "ERROR: Incremental send to $remote_host failed."
            return 1
        fi
    else
        log "STALE DESTINATION: $remote_host has snapshots but no common base with local — skipping."
        return 1
    fi
 }
 # send snapshot to all remotes
 snapshot_send() {
    local snap="$1"
    local failure_count=0
    set +e
    local entry remote_host remote_dataset
    for entry in "${REMOTES[@]}"; do
        remote_host="${entry%%:*}"
        remote_dataset="${entry#*:}"
        if ! snapshot_send_one "$snap" "$remote_host" "$remote_dataset"; then
            failure_count=$((failure_count + 1))
        fi
    done
    set -e
    if [[ "$failure_count" -gt 0 ]]; then
        log "WARNING: $failure_count destination(s) failed or are out of sync."
        return 1
    fi
    return 0
 }
 # snapshot management
 snapshot() {
    local action="$1"
    case "$action" in
        create)
            log "Creating snapshot: $SNAPSHOT"
            zfs snapshot "$SNAPSHOT"
            zfs hold "$ZFS_HOLD" "$SNAPSHOT" || log "ERROR: Failed to hold $SNAPSHOT"
            ;;
        send)
            log "Sending snapshot $SNAPSHOT..."
            if snapshot_send "$SNAPSHOT"; then
                log "Snapshot send completed. Releasing hold."
                zfs release "$ZFS_HOLD" "$SNAPSHOT" || log "ERROR: Failed to release hold on $SNAPSHOT"
            else
                log "WARNING: Snapshot send encountered errors. Hold retained on $SNAPSHOT."
            fi
            ;;
        prune)
            if [[ "$SNAPSHOT_RETENTION" -gt 0 ]]; then
                log "Pruning old snapshots in $DATASET (retaining $SNAPSHOT_RETENTION destroyable snapshots)..."
                local -a all_snaps destroyable
                mapfile -t all_snaps < <(zfs list -H -t snapshot -o name -s creation -d1 "$DATASET" | grep -F "${DATASET}@${SNAPSHOT_PREFIX}-")
                destroyable=()
                for snap in "${all_snaps[@]}"; do
                    if zfs destroy -n -- "$snap" &>/dev/null; then
                        destroyable+=("$snap")
                    else
                        log "Skipping $snap — snapshot not destroyable (likely held)"
                    fi
                done
                local count to_destroy
                count="${#destroyable[@]}"
                to_destroy=$((count - SNAPSHOT_RETENTION))
                if [[ "$to_destroy" -le 0 ]]; then
                    log "Nothing to prune — only $count destroyable snapshots exist"
                else
                    local i
                    for (( i=0; i<to_destroy; i++ )); do
                        snap="${destroyable[$i]}"
                        log "Destroying snapshot: $snap"
                        if ! zfs destroy -- "$snap"; then
                            log "WARNING: Failed to destroy $snap despite earlier check"
                        fi
                    done
                fi
            else
                log "Skipping pruning — retention is set to $SNAPSHOT_RETENTION"
            fi
            ;;
        *)
            log "ERROR: Snapshot unknown action: $action"
            exit 2
            ;;
    esac
 }
 # open logging and begin execution
 mkdir -p "$(dirname -- "$LOG_FILE")"
 start_time=$(date +%s)
 exec >> "$LOG_FILE" 2>&1
 trap 'log_close' EXIT
 trap 'rc=$?; log "ERROR: command failed at line $LINENO (exit $rc)"; exit $rc' ERR
 log "Backlog Started"
 if zfs list -H -t snapshot -o name -d1 "$DATASET" | grep -qxF "$SNAPSHOT"; then
    log "WARNING: Snapshot $SNAPSHOT already exists. Exiting."
    exit 1
 fi
 services stop
 snapshot create
 services start
 snapshot send
 snapshot prune
 # end
--- a/scripts/biscayne-status.py
+++ b/scripts/biscayne-status.py
@ -0,0 +1,280 @@
 #!/usr/bin/env python3
 """Biscayne agave validator status check.
 Collects and displays key health metrics:
 - Slot position (local vs mainnet, gap, replay rate)
 - Pod status (running, restarts, age)
 - Memory usage (cgroup current vs limit, % used)
 - OOM kills (recent dmesg entries)
 - Shred relay (packets/sec on port 9100, shred-unwrap.py alive)
 - Validator process state (from logs)
 """
 import json
 import subprocess
 import sys
 import time
 NAMESPACE = "laconic-laconic-70ce4c4b47e23b85"
 DEPLOYMENT = "laconic-70ce4c4b47e23b85-deployment"
 KIND_NODE = "laconic-70ce4c4b47e23b85-control-plane"
 SSH = "rix@biscayne.vaasl.io"
 MAINNET_RPC = "https://api.mainnet-beta.solana.com"
 LOCAL_RPC = "http://127.0.0.1:8899"
 def ssh(cmd: str, timeout: int = 10) -> str:
    try:
        r = subprocess.run(
            ["ssh", SSH, cmd],
            capture_output=True, text=True, timeout=timeout,
        )
        return r.stdout.strip() + r.stderr.strip()
    except subprocess.TimeoutExpired:
        return "<timeout>"
 def local(cmd: str, timeout: int = 10) -> str:
    try:
        r = subprocess.run(
            cmd, shell=True, capture_output=True, text=True, timeout=timeout,
        )
        return r.stdout.strip()
    except subprocess.TimeoutExpired:
        return "<timeout>"
 def rpc_call(method: str, url: str = LOCAL_RPC, remote: bool = True, params: list | None = None) -> dict | None:
    payload = json.dumps({"jsonrpc": "2.0", "id": 1, "method": method, "params": params or []})
    cmd = f"curl -s {url} -X POST -H 'Content-Type: application/json' -d '{payload}'"
    raw = ssh(cmd) if remote else local(cmd)
    try:
        return json.loads(raw)
    except (json.JSONDecodeError, TypeError):
        return None
 def get_slots() -> tuple[int | None, int | None]:
    local_resp = rpc_call("getSlot")
    mainnet_resp = rpc_call("getSlot", MAINNET_RPC, remote=False)
    local_slot = local_resp.get("result") if local_resp else None
    mainnet_slot = mainnet_resp.get("result") if mainnet_resp else None
    return local_slot, mainnet_slot
 def get_health() -> str:
    resp = rpc_call("getHealth")
    if not resp:
        return "unreachable"
    if "result" in resp and resp["result"] == "ok":
        return "healthy"
    err = resp.get("error", {})
    msg = err.get("message", "unknown")
    behind = err.get("data", {}).get("numSlotsBehind")
    if behind is not None:
        return f"behind {behind:,} slots"
    return msg
 def get_pod_status() -> str:
    cmd = f"kubectl -n {NAMESPACE} get pods -o json"
    raw = ssh(cmd, timeout=15)
    try:
        data = json.loads(raw)
    except (json.JSONDecodeError, TypeError):
        return "unknown"
    items = data.get("items", [])
    if not items:
        return "no pods"
    pod = items[0]
    name = pod["metadata"]["name"].split("-")[-1]
    phase = pod["status"].get("phase", "?")
    containers = pod["status"].get("containerStatuses", [])
    restarts = sum(c.get("restartCount", 0) for c in containers)
    ready = sum(1 for c in containers if c.get("ready"))
    total = len(containers)
    age = pod["metadata"].get("creationTimestamp", "?")
    return f"{ready}/{total} {phase}  restarts={restarts}  pod=..{name}  created={age}"
 def get_memory() -> str:
    cmd = (
        f"docker exec {KIND_NODE} bash -c '"
        "find /sys/fs/cgroup -name memory.current -path \"*burstable*\" 2>/dev/null | head -1 | "
        "while read f; do "
        "  dir=$(dirname $f); "
        "  cur=$(cat $f); "
        "  max=$(cat $dir/memory.max 2>/dev/null || echo unknown); "
        "  echo $cur $max; "
        "done'"
    )
    raw = ssh(cmd, timeout=10)
    try:
        parts = raw.split()
        current = int(parts[0])
        limit_str = parts[1]
        cur_gb = current / (1024**3)
        if limit_str == "max":
            return f"{cur_gb:.0f}GB / unlimited"
        limit = int(limit_str)
        lim_gb = limit / (1024**3)
        pct = (current / limit) * 100
        return f"{cur_gb:.0f}GB / {lim_gb:.0f}GB ({pct:.0f}%)"
    except (IndexError, ValueError):
        return raw or "unknown"
 def get_oom_kills() -> str:
    raw = ssh("sudo dmesg | grep -c 'oom-kill' || echo 0")
    try:
        count = int(raw.strip())
    except ValueError:
        return "check failed"
    if count == 0:
        return "none"
    # Get kernel uptime-relative timestamp and convert to UTC
    # dmesg timestamps are seconds since boot; combine with boot time
    raw = ssh(
        "BOOT=$(date -d \"$(uptime -s)\" +%s); "
        "KERN_TS=$(sudo dmesg | grep 'oom-kill' | tail -1 | "
        "  sed 's/\\[\\s*\\([0-9.]*\\)\\].*/\\1/'); "
        "echo $BOOT $KERN_TS"
    )
    try:
        parts = raw.split()
        boot_epoch = int(parts[0])
        kern_secs = float(parts[1])
        oom_epoch = boot_epoch + int(kern_secs)
        from datetime import datetime, timezone
        oom_utc = datetime.fromtimestamp(oom_epoch, tz=timezone.utc).strftime("%Y-%m-%d %H:%M:%S UTC")
        return f"{count} total (last: {oom_utc})"
    except (IndexError, ValueError):
        return f"{count} total (timestamp parse failed)"
 def get_relay_rate() -> str:
    # Two samples 3s apart from /proc/net/snmp
    cmd = (
        "T0=$(cat /proc/net/snmp | grep '^Udp:' | tail -1 | awk '{print $2}'); "
        "sleep 3; "
        "T1=$(cat /proc/net/snmp | grep '^Udp:' | tail -1 | awk '{print $2}'); "
        "echo $T0 $T1"
    )
    raw = ssh(cmd, timeout=15)
    try:
        parts = raw.split()
        t0, t1 = int(parts[0]), int(parts[1])
        rate = (t1 - t0) / 3
        return f"{rate:,.0f} UDP dgrams/sec (all ports)"
    except (IndexError, ValueError):
        return raw or "unknown"
 def get_shreds_per_sec() -> str:
    """Count UDP packets on TVU port 9000 over 3 seconds using tcpdump."""
    cmd = "sudo timeout 3 tcpdump -i any udp dst port 9000 -q 2>&1 | grep -oP '\\d+(?= packets captured)'"
    raw = ssh(cmd, timeout=15)
    try:
        count = int(raw.strip())
        rate = count / 3
        return f"{rate:,.0f} shreds/sec ({count:,} in 3s)"
    except (ValueError, TypeError):
        return raw or "unknown"
 def get_unwrap_status() -> str:
    raw = ssh("ps -p $(pgrep -f shred-unwrap | head -1) -o pid,etime,rss --no-headers 2>/dev/null || echo dead")
    if "dead" in raw or not raw.strip():
        return "NOT RUNNING"
    parts = raw.split()
    if len(parts) >= 3:
        pid, etime, rss_kb = parts[0], parts[1], parts[2]
        rss_mb = int(rss_kb) / 1024
        return f"pid={pid}  uptime={etime}  rss={rss_mb:.0f}MB"
    return raw
 def get_replay_rate() -> tuple[float | None, int | None, int | None]:
    """Sample processed slot twice over 10s to measure replay rate."""
    params = [{"commitment": "processed"}]
    r0 = rpc_call("getSlot", params=params)
    s0 = r0.get("result") if r0 else None
    if s0 is None:
        return None, None, None
    t0 = time.monotonic()
    time.sleep(10)
    r1 = rpc_call("getSlot", params=params)
    s1 = r1.get("result") if r1 else None
    if s1 is None:
        return None, s0, None
    dt = time.monotonic() - t0
    rate = (s1 - s0) / dt if s1 != s0 else 0
    return rate, s0, s1
 def main() -> None:
    print("=" * 60)
    print("  BISCAYNE VALIDATOR STATUS")
    print("=" * 60)
    # Health + slots
    print("\n--- RPC ---")
    health = get_health()
    local_slot, mainnet_slot = get_slots()
    print(f"  Health:       {health}")
    if local_slot is not None:
        print(f"  Local slot:   {local_slot:,}")
    else:
        print("  Local slot:   unreachable")
    if mainnet_slot is not None:
        print(f"  Mainnet slot: {mainnet_slot:,}")
    if local_slot and mainnet_slot:
        gap = mainnet_slot - local_slot
        print(f"  Gap:          {gap:,} slots")
    # Replay rate (10s sample)
    print("\n--- Replay ---")
    print("  Sampling replay rate (10s)...", end="", flush=True)
    rate, s0, s1 = get_replay_rate()
    if rate is not None:
        print(f"\r  Replay rate:  {rate:.1f} slots/sec ({s0:,} → {s1:,})")
        net = rate - 2.5
        if net > 0:
            print(f"  Net catchup:  +{net:.1f} slots/sec (gaining)")
        elif net < 0:
            print(f"  Net catchup:  {net:.1f} slots/sec (falling behind)")
        else:
            print("  Net catchup:  0 (keeping pace)")
    else:
        print("\r  Replay rate:  could not measure")
    # Pod
    print("\n--- Pod ---")
    pod = get_pod_status()
    print(f"  {pod}")
    # Memory
    print("\n--- Memory ---")
    mem = get_memory()
    print(f"  Cgroup:       {mem}")
    # OOM
    oom = get_oom_kills()
    print(f"  OOM kills:    {oom}")
    # Relay
    print("\n--- Shred Relay ---")
    unwrap = get_unwrap_status()
    print(f"  shred-unwrap: {unwrap}")
    print("  Measuring shred rate (3s)...", end="", flush=True)
    shreds = get_shreds_per_sec()
    print(f"\r  TVU shreds:   {shreds}          ")
    print("  Measuring UDP rate (3s)...", end="", flush=True)
    relay = get_relay_rate()
    print(f"\r  UDP inbound:  {relay}          ")
    print("\n" + "=" * 60)
 if __name__ == "__main__":
    main()
--- a/scripts/snapshot-download.py
+++ b/scripts/snapshot-download.py
@ -0,0 +1,546 @@
 #!/usr/bin/env python3
 """Download Solana snapshots using aria2c for parallel multi-connection downloads.
 Discovers snapshot sources by querying getClusterNodes for all RPCs in the
 cluster, probing each for available snapshots, benchmarking download speed,
 and downloading from the fastest source using aria2c (16 connections by default).
 Based on the discovery approach from etcusr/solana-snapshot-finder but replaces
 the single-connection wget download with aria2c parallel chunked downloads.
 Usage:
    # Download to /srv/solana/snapshots (mainnet, 16 connections)
    ./snapshot-download.py -o /srv/solana/snapshots
    # Dry run — find best source, print URL
    ./snapshot-download.py --dry-run
    # Custom RPC for cluster node discovery + 32 connections
    ./snapshot-download.py -r https://api.mainnet-beta.solana.com -n 32
    # Testnet
    ./snapshot-download.py -c testnet -o /data/snapshots
 Requirements:
    - aria2c (apt install aria2)
    - python3 >= 3.10 (stdlib only, no pip dependencies)
 """
 from __future__ import annotations
 import argparse
 import concurrent.futures
 import json
 import logging
 import os
 import re
 import shutil
 import subprocess
 import sys
 import time
 import urllib.error
 import urllib.request
 from dataclasses import dataclass, field
 from http.client import HTTPResponse
 from pathlib import Path
 from typing import NoReturn
 from urllib.request import Request
 log: logging.Logger = logging.getLogger("snapshot-download")
 CLUSTER_RPC: dict[str, str] = {
    "mainnet-beta": "https://api.mainnet-beta.solana.com",
    "testnet": "https://api.testnet.solana.com",
    "devnet": "https://api.devnet.solana.com",
 }
 # Snapshot filenames:
 #   snapshot-<slot>-<hash>.tar.zst
 #   incremental-snapshot-<base_slot>-<slot>-<hash>.tar.zst
 FULL_SNAP_RE: re.Pattern[str] = re.compile(
    r"^snapshot-(\d+)-([A-Za-z0-9]+)\.tar\.(zst|bz2)$"
 )
 INCR_SNAP_RE: re.Pattern[str] = re.compile(
    r"^incremental-snapshot-(\d+)-(\d+)-([A-Za-z0-9]+)\.tar\.(zst|bz2)$"
 )
@dataclass
 class SnapshotSource:
    """A snapshot file available from a specific RPC node."""
    rpc_address: str
    # Full redirect paths as returned by the server (e.g. /snapshot-123-hash.tar.zst)
    file_paths: list[str] = field(default_factory=list)
    slots_diff: int = 0
    latency_ms: float = 0.0
    download_speed: float = 0.0  # bytes/sec
 # -- JSON-RPC helpers ----------------------------------------------------------
 class _NoRedirectHandler(urllib.request.HTTPRedirectHandler):
    """Handler that captures redirect Location instead of following it."""
    def redirect_request(
        self,
        req: Request,
        fp: HTTPResponse,
        code: int,
        msg: str,
        headers: dict[str, str],  # type: ignore[override]
        newurl: str,
    ) -> None:
        return None
 def rpc_post(url: str, method: str, params: list[object] | None = None,
             timeout: int = 25) -> object | None:
    """JSON-RPC POST. Returns parsed 'result' field or None on error."""
    payload: bytes = json.dumps({
        "jsonrpc": "2.0", "id": 1,
        "method": method, "params": params or [],
    }).encode()
    req = Request(url, data=payload,
                  headers={"Content-Type": "application/json"})
    try:
        with urllib.request.urlopen(req, timeout=timeout) as resp:
            data: dict[str, object] = json.loads(resp.read())
            return data.get("result")
    except (urllib.error.URLError, json.JSONDecodeError, OSError, TimeoutError) as e:
        log.debug("rpc_post %s %s failed: %s", url, method, e)
        return None
 def head_no_follow(url: str, timeout: float = 3) -> tuple[str | None, float]:
    """HEAD request without following redirects.
    Returns (Location header value, latency_sec) if the server returned a
    3xx redirect. Returns (None, 0.0) on any error or non-redirect response.
    """
    opener: urllib.request.OpenerDirector = urllib.request.build_opener(_NoRedirectHandler)
    req = Request(url, method="HEAD")
    try:
        start: float = time.monotonic()
        resp: HTTPResponse = opener.open(req, timeout=timeout)  # type: ignore[assignment]
        latency: float = time.monotonic() - start
        # Non-redirect (2xx) — server didn't redirect, not useful for discovery
        location: str | None = resp.headers.get("Location")
        resp.close()
        return location, latency
    except urllib.error.HTTPError as e:
        # 3xx redirects raise HTTPError with the redirect info
        latency = time.monotonic() - start  # type: ignore[possibly-undefined]
        location = e.headers.get("Location")
        if location and 300 <= e.code < 400:
            return location, latency
        return None, 0.0
    except (urllib.error.URLError, OSError, TimeoutError):
        return None, 0.0
 # -- Discovery -----------------------------------------------------------------
 def get_current_slot(rpc_url: str) -> int | None:
    """Get current slot from RPC."""
    result: object | None = rpc_post(rpc_url, "getSlot")
    if isinstance(result, int):
        return result
    return None
 def get_cluster_rpc_nodes(rpc_url: str, version_filter: str | None = None) -> list[str]:
    """Get all RPC node addresses from getClusterNodes."""
    result: object | None = rpc_post(rpc_url, "getClusterNodes")
    if not isinstance(result, list):
        return []
    rpc_addrs: list[str] = []
    for node in result:
        if not isinstance(node, dict):
            continue
        if version_filter is not None:
            node_version: str | None = node.get("version")
            if node_version and not node_version.startswith(version_filter):
                continue
        rpc: str | None = node.get("rpc")
        if rpc:
            rpc_addrs.append(rpc)
    return list(set(rpc_addrs))
 def _parse_snapshot_filename(location: str) -> tuple[str, str | None]:
    """Extract filename and full redirect path from Location header.
    Returns (filename, full_path). full_path includes any path prefix
    the server returned (e.g. '/snapshots/snapshot-123-hash.tar.zst').
    """
    # Location may be absolute URL or relative path
    if location.startswith("http://") or location.startswith("https://"):
        # Absolute URL — extract path
        from urllib.parse import urlparse
        path: str = urlparse(location).path
    else:
        path = location
    filename: str = path.rsplit("/", 1)[-1]
    return filename, path
 def probe_rpc_snapshot(
    rpc_address: str,
    current_slot: int,
    max_age_slots: int,
    max_latency_ms: float,
 ) -> SnapshotSource | None:
    """Probe a single RPC node for available snapshots.
    Probes for full snapshot first (required), then incremental. Records all
    available files. Which files to actually download is decided at download
    time based on what already exists locally — not here.
    Based on the discovery approach from etcusr/solana-snapshot-finder.
    """
    full_url: str = f"http://{rpc_address}/snapshot.tar.bz2"
    # Full snapshot is required — every source must have one
    full_location, full_latency = head_no_follow(full_url, timeout=2)
    if not full_location:
        return None
    latency_ms: float = full_latency * 1000
    if latency_ms > max_latency_ms:
        return None
    full_filename, full_path = _parse_snapshot_filename(full_location)
    fm: re.Match[str] | None = FULL_SNAP_RE.match(full_filename)
    if not fm:
        return None
    full_snap_slot: int = int(fm.group(1))
    slots_diff: int = current_slot - full_snap_slot
    if slots_diff > max_age_slots or slots_diff < -100:
        return None
    file_paths: list[str] = [full_path]
    # Also check for incremental snapshot
    inc_url: str = f"http://{rpc_address}/incremental-snapshot.tar.bz2"
    inc_location, _ = head_no_follow(inc_url, timeout=2)
    if inc_location:
        inc_filename, inc_path = _parse_snapshot_filename(inc_location)
        m: re.Match[str] | None = INCR_SNAP_RE.match(inc_filename)
        if m:
            inc_base_slot: int = int(m.group(1))
            # Incremental must be based on this source's full snapshot
            if inc_base_slot == full_snap_slot:
                file_paths.append(inc_path)
    return SnapshotSource(
        rpc_address=rpc_address,
        file_paths=file_paths,
        slots_diff=slots_diff,
        latency_ms=latency_ms,
    )
 def discover_sources(
    rpc_url: str,
    current_slot: int,
    max_age_slots: int,
    max_latency_ms: float,
    threads: int,
    version_filter: str | None,
 ) -> list[SnapshotSource]:
    """Discover all snapshot sources from the cluster."""
    rpc_nodes: list[str] = get_cluster_rpc_nodes(rpc_url, version_filter)
    if not rpc_nodes:
        log.error("No RPC nodes found via getClusterNodes")
        return []
    log.info("Found %d RPC nodes, probing for snapshots...", len(rpc_nodes))
    sources: list[SnapshotSource] = []
    with concurrent.futures.ThreadPoolExecutor(max_workers=threads) as pool:
        futures: dict[concurrent.futures.Future[SnapshotSource | None], str] = {
            pool.submit(
                probe_rpc_snapshot, addr, current_slot,
                max_age_slots, max_latency_ms,
            ): addr
            for addr in rpc_nodes
        }
        done: int = 0
        for future in concurrent.futures.as_completed(futures):
            done += 1
            if done % 200 == 0:
                log.info("  probed %d/%d nodes, %d sources found",
                         done, len(rpc_nodes), len(sources))
            try:
                result: SnapshotSource | None = future.result()
            except (urllib.error.URLError, OSError, TimeoutError) as e:
                log.debug("Probe failed for %s: %s", futures[future], e)
                continue
            if result:
                sources.append(result)
    log.info("Found %d RPC nodes with suitable snapshots", len(sources))
    return sources
 # -- Speed benchmark -----------------------------------------------------------
 def measure_speed(rpc_address: str, measure_time: int = 7) -> float:
    """Measure download speed from an RPC node. Returns bytes/sec."""
    url: str = f"http://{rpc_address}/snapshot.tar.bz2"
    req = Request(url)
    try:
        with urllib.request.urlopen(req, timeout=measure_time + 5) as resp:
            start: float = time.monotonic()
            total: int = 0
            while True:
                elapsed: float = time.monotonic() - start
                if elapsed >= measure_time:
                    break
                chunk: bytes = resp.read(81920)
                if not chunk:
                    break
                total += len(chunk)
            elapsed = time.monotonic() - start
            if elapsed <= 0:
                return 0.0
            return total / elapsed
    except (urllib.error.URLError, OSError, TimeoutError):
        return 0.0
 # -- Download ------------------------------------------------------------------
 def download_aria2c(
    urls: list[str],
    output_dir: str,
    filename: str,
    connections: int = 16,
 ) -> bool:
    """Download a file using aria2c with parallel connections.
    When multiple URLs are provided, aria2c treats them as mirrors of the
    same file and distributes chunks across all of them.
    """
    num_mirrors: int = len(urls)
    total_splits: int = max(connections, connections * num_mirrors)
    cmd: list[str] = [
        "aria2c",
        "--file-allocation=none",
        "--continue=true",
        f"--max-connection-per-server={connections}",
        f"--split={total_splits}",
        "--min-split-size=50M",
        # aria2c retries individual chunk connections on transient network
        # errors (TCP reset, timeout). This is transport-level retry analogous
        # to TCP retransmit, not application-level retry of a failed operation.
        "--max-tries=5",
        "--retry-wait=5",
        "--timeout=60",
        "--connect-timeout=10",
        "--summary-interval=10",
        "--console-log-level=notice",
        f"--dir={output_dir}",
        f"--out={filename}",
        "--auto-file-renaming=false",
        "--allow-overwrite=true",
        *urls,
    ]
    log.info("Downloading %s", filename)
    log.info("  aria2c: %d connections × %d mirrors (%d splits)",
             connections, num_mirrors, total_splits)
    start: float = time.monotonic()
    result: subprocess.CompletedProcess[bytes] = subprocess.run(cmd)
    elapsed: float = time.monotonic() - start
    if result.returncode != 0:
        log.error("aria2c failed with exit code %d", result.returncode)
        return False
    filepath: Path = Path(output_dir) / filename
    if not filepath.exists():
        log.error("aria2c reported success but %s does not exist", filepath)
        return False
    size_bytes: int = filepath.stat().st_size
    size_gb: float = size_bytes / (1024 ** 3)
    avg_mb: float = size_bytes / elapsed / (1024 ** 2) if elapsed > 0 else 0
    log.info("  Done: %.1f GB in %.0fs (%.1f MiB/s avg)", size_gb, elapsed, avg_mb)
    return True
 # -- Main ----------------------------------------------------------------------
 def main() -> int:
    p: argparse.ArgumentParser = argparse.ArgumentParser(
        description="Download Solana snapshots with aria2c parallel downloads",
    )
    p.add_argument("-o", "--output", default="/srv/solana/snapshots",
                   help="Snapshot output directory (default: /srv/solana/snapshots)")
    p.add_argument("-c", "--cluster", default="mainnet-beta",
                   choices=list(CLUSTER_RPC),
                   help="Solana cluster (default: mainnet-beta)")
    p.add_argument("-r", "--rpc", default=None,
                   help="RPC URL for cluster discovery (default: public RPC)")
    p.add_argument("-n", "--connections", type=int, default=16,
                   help="aria2c connections per download (default: 16)")
    p.add_argument("-t", "--threads", type=int, default=500,
                   help="Threads for parallel RPC probing (default: 500)")
    p.add_argument("--max-snapshot-age", type=int, default=1300,
                   help="Max snapshot age in slots (default: 1300)")
    p.add_argument("--max-latency", type=float, default=100,
                   help="Max RPC probe latency in ms (default: 100)")
    p.add_argument("--min-download-speed", type=int, default=20,
                   help="Min download speed in MiB/s (default: 20)")
    p.add_argument("--measurement-time", type=int, default=7,
                   help="Speed measurement duration in seconds (default: 7)")
    p.add_argument("--max-speed-checks", type=int, default=15,
                   help="Max nodes to benchmark before giving up (default: 15)")
    p.add_argument("--version", default=None,
                   help="Filter nodes by version prefix (e.g. '2.2')")
    p.add_argument("--full-only", action="store_true",
                   help="Download only full snapshot, skip incremental")
    p.add_argument("--dry-run", action="store_true",
                   help="Find best source and print URL, don't download")
    p.add_argument("-v", "--verbose", action="store_true")
    args: argparse.Namespace = p.parse_args()
    logging.basicConfig(
        level=logging.DEBUG if args.verbose else logging.INFO,
        format="%(asctime)s %(levelname)s %(message)s",
        datefmt="%H:%M:%S",
    )
    rpc_url: str = args.rpc or CLUSTER_RPC[args.cluster]
    # aria2c is required for actual downloads (not dry-run)
    if not args.dry_run and not shutil.which("aria2c"):
        log.error("aria2c not found. Install with: apt install aria2")
        return 1
    # Get current slot
    log.info("Cluster: %s | RPC: %s", args.cluster, rpc_url)
    current_slot: int | None = get_current_slot(rpc_url)
    if current_slot is None:
        log.error("Cannot get current slot from %s", rpc_url)
        return 1
    log.info("Current slot: %d", current_slot)
    # Discover sources
    sources: list[SnapshotSource] = discover_sources(
        rpc_url, current_slot,
        max_age_slots=args.max_snapshot_age,
        max_latency_ms=args.max_latency,
        threads=args.threads,
        version_filter=args.version,
    )
    if not sources:
        log.error("No snapshot sources found")
        return 1
    # Sort by latency (lowest first) for speed benchmarking
    sources.sort(key=lambda s: s.latency_ms)
    # Benchmark top candidates — all speeds in MiB/s (binary, 1 MiB = 1048576 bytes)
    log.info("Benchmarking download speed on top %d sources...", args.max_speed_checks)
    fast_sources: list[SnapshotSource] = []
    checked: int = 0
    min_speed_bytes: int = args.min_download_speed * 1024 * 1024  # MiB to bytes
    for source in sources:
        if checked >= args.max_speed_checks:
            break
        checked += 1
        speed: float = measure_speed(source.rpc_address, args.measurement_time)
        source.download_speed = speed
        speed_mib: float = speed / (1024 ** 2)
        if speed < min_speed_bytes:
            log.info("  %s: %.1f MiB/s (too slow, need >=%d MiB/s)",
                     source.rpc_address, speed_mib, args.min_download_speed)
            continue
        log.info("  %s: %.1f MiB/s (latency: %.0fms, age: %d slots)",
                 source.rpc_address, speed_mib,
                 source.latency_ms, source.slots_diff)
        fast_sources.append(source)
    if not fast_sources:
        log.error("No source met minimum speed requirement (%d MiB/s)",
                  args.min_download_speed)
        log.info("Try: --min-download-speed 10")
        return 1
    # Use the fastest source as primary, collect mirrors for each file
    best: SnapshotSource = fast_sources[0]
    file_paths: list[str] = best.file_paths
    if args.full_only:
        file_paths = [fp for fp in file_paths
                      if fp.rsplit("/", 1)[-1].startswith("snapshot-")]
    # Build mirror URL lists: for each file, collect URLs from all fast sources
    # that serve the same filename
    download_plan: list[tuple[str, list[str]]] = []
    for fp in file_paths:
        filename: str = fp.rsplit("/", 1)[-1]
        mirror_urls: list[str] = [f"http://{best.rpc_address}{fp}"]
        for other in fast_sources[1:]:
            for other_fp in other.file_paths:
                if other_fp.rsplit("/", 1)[-1] == filename:
                    mirror_urls.append(f"http://{other.rpc_address}{other_fp}")
                    break
        download_plan.append((filename, mirror_urls))
    speed_mib: float = best.download_speed / (1024 ** 2)
    log.info("Best source: %s (%.1f MiB/s), %d mirrors total",
             best.rpc_address, speed_mib, len(fast_sources))
    for filename, mirror_urls in download_plan:
        log.info("  %s (%d mirrors)", filename, len(mirror_urls))
        for url in mirror_urls:
            log.info("    %s", url)
    if args.dry_run:
        for _, mirror_urls in download_plan:
            for url in mirror_urls:
                print(url)
        return 0
    # Download — skip files that already exist locally
    os.makedirs(args.output, exist_ok=True)
    total_start: float = time.monotonic()
    for filename, mirror_urls in download_plan:
        filepath: Path = Path(args.output) / filename
        if filepath.exists() and filepath.stat().st_size > 0:
            log.info("Skipping %s (already exists: %.1f GB)",
                     filename, filepath.stat().st_size / (1024 ** 3))
            continue
        if not download_aria2c(mirror_urls, args.output, filename, args.connections):
            log.error("Failed to download %s", filename)
            return 1
    total_elapsed: float = time.monotonic() - total_start
    log.info("All downloads complete in %.0fs", total_elapsed)
    for filename, _ in download_plan:
        fp: Path = Path(args.output) / filename
        if fp.exists():
            log.info("  %s (%.1f GB)", fp.name, fp.stat().st_size / (1024 ** 3))
    return 0
 if __name__ == "__main__":
    sys.exit(main())
--- a/scripts/zfs-setup.md
+++ b/scripts/zfs-setup.md
@ -0,0 +1,109 @@
 # ZFS Setup for Biscayne
 ## Current State
 ```
 biscayne                      none          (pool root)
 biscayne/DATA                 none
 biscayne/DATA/home            /home         42G
 biscayne/DATA/home/solana     /home/solana  2.9G
 biscayne/DATA/srv             /srv          712G
 biscayne/DATA/srv/backups     /srv/backups  208G
 biscayne/DATA/volumes/solana  (zvol, 4T)    → block-mounted at /srv/solana
 ```
 Docker root: `/var/lib/docker` on root filesystem (`/dev/md0`, 439G).
 ## Target State
 ```
 biscayne/DATA/deployments     /srv/deployments   ← laconic-so deployment dirs (snapshotted)
 biscayne/DATA/var/docker      /var/lib/docker    ← docker storage on ZFS
 biscayne/DATA/volumes/solana  (zvol, 4T)         ← bulk solana data (not backed up)
 ```
 ## Steps
 ### 1. Create deployments dataset
 ```bash
 zfs create -o mountpoint=/srv/deployments biscayne/DATA/deployments
 ```
 ### 2. Move docker onto ZFS
 Stop docker and all containers first:
 ```bash
 systemctl stop docker.socket docker.service
 ```
 Create the dataset:
 ```bash
 zfs create -o mountpoint=/var/lib/docker biscayne/DATA/var
 zfs create biscayne/DATA/var/docker
 ```
 Copy existing docker data (if any worth keeping):
 ```bash
 rsync -aHAX /var/lib/docker.bak/ /var/lib/docker/
 ```
 Or just start fresh — the only running containers are telegraf/influxdb monitoring
 which can be recreated.
 Start docker:
 ```bash
 systemctl start docker.service
 ```
 ### 3. Grant ZFS permissions to the backup user
 ```bash
 zfs allow -u <backup-user> destroy,snapshot,send,hold,release,mount biscayne/DATA/deployments
 ```
 ### 4. Create remote receiving datasets
 On mysterio:
 ```bash
 zfs create -p edith/DATA/backlog/biscayne-main
 ```
 On ardham:
 ```bash
 zfs create -p batterywharf/DATA/backlog/biscayne-main
 ```
 These will fail until SSH keys and network access are configured for biscayne
 to reach these hosts. The backup script handles this gracefully.
 ### 5. Install backlog.sh and crontab
 ```bash
 mkdir -p ~/.local/bin
 cp scripts/backlog.sh ~/.local/bin/backlog.sh
 chmod +x ~/.local/bin/backlog.sh
 crontab -e
 # Add: 01 0 * * * /home/<user>/.local/bin/backlog.sh
 ```
 ## Volume Layout
 laconic-so deployment at `/srv/deployments/agave/`:
 | Volume | Location | Backed up |
 |---|---|---|
 | validator-config | `/srv/deployments/agave/data/validator-config/` | Yes (ZFS snapshot) |
 | doublezero-config | `/srv/deployments/agave/data/doublezero-config/` | Yes (ZFS snapshot) |
 | validator-ledger | `/srv/solana/ledger/` (zvol) | No (rebuildable) |
 | validator-accounts | `/srv/solana/accounts/` (zvol) | No (rebuildable) |
 | validator-snapshots | `/srv/solana/snapshots/` (zvol) | No (rebuildable) |
 The laconic-so spec.yml must map the heavy volumes to zvol paths and the small
 config volumes to the deployment directory.
--- a/stack-orchestrator/compose/docker-compose-agave-rpc.yml
+++ b/stack-orchestrator/compose/docker-compose-agave-rpc.yml
@ -0,0 +1,112 @@
 services:
  agave-rpc:
    restart: unless-stopped
    image: laconicnetwork/agave:local
    network_mode: host
    privileged: true
    cap_add:
      - IPC_LOCK
    # Compose owns all defaults. spec.yml overrides per-deployment.
    environment:
      AGAVE_MODE: rpc
      # Required — no defaults
      VALIDATOR_ENTRYPOINT: ${VALIDATOR_ENTRYPOINT}
      KNOWN_VALIDATOR: ${KNOWN_VALIDATOR}
      # Optional with defaults
      EXTRA_ENTRYPOINTS: ${EXTRA_ENTRYPOINTS:-}
      EXTRA_KNOWN_VALIDATORS: ${EXTRA_KNOWN_VALIDATORS:-}
      RPC_PORT: ${RPC_PORT:-8899}
      RPC_BIND_ADDRESS: ${RPC_BIND_ADDRESS:-127.0.0.1}
      GOSSIP_PORT: ${GOSSIP_PORT:-8001}
      DYNAMIC_PORT_RANGE: ${DYNAMIC_PORT_RANGE:-9000-10000}
      EXPECTED_GENESIS_HASH: ${EXPECTED_GENESIS_HASH:-}
      EXPECTED_SHRED_VERSION: ${EXPECTED_SHRED_VERSION:-}
      LIMIT_LEDGER_SIZE: ${LIMIT_LEDGER_SIZE:-50000000}
      NO_SNAPSHOTS: ${NO_SNAPSHOTS:-false}
      SNAPSHOT_INTERVAL_SLOTS: ${SNAPSHOT_INTERVAL_SLOTS:-100000}
      MAXIMUM_SNAPSHOTS_TO_RETAIN: ${MAXIMUM_SNAPSHOTS_TO_RETAIN:-1}
      NO_INCREMENTAL_SNAPSHOTS: ${NO_INCREMENTAL_SNAPSHOTS:-false}
      ACCOUNT_INDEXES: ${ACCOUNT_INDEXES:-}
      PUBLIC_RPC_ADDRESS: ${PUBLIC_RPC_ADDRESS:-}
      GOSSIP_HOST: ${GOSSIP_HOST:-}
      PUBLIC_TVU_ADDRESS: ${PUBLIC_TVU_ADDRESS:-}
      RUST_LOG: ${RUST_LOG:-info}
      SOLANA_METRICS_CONFIG: ${SOLANA_METRICS_CONFIG:-}
      JITO_ENABLE: ${JITO_ENABLE:-false}
      JITO_BLOCK_ENGINE_URL: ${JITO_BLOCK_ENGINE_URL:-}
      JITO_SHRED_RECEIVER_ADDR: ${JITO_SHRED_RECEIVER_ADDR:-}
      JITO_TIP_PAYMENT_PROGRAM: ${JITO_TIP_PAYMENT_PROGRAM:-}
      JITO_DISTRIBUTION_PROGRAM: ${JITO_DISTRIBUTION_PROGRAM:-}
      JITO_MERKLE_ROOT_AUTHORITY: ${JITO_MERKLE_ROOT_AUTHORITY:-}
      JITO_COMMISSION_BPS: ${JITO_COMMISSION_BPS:-0}
      EXTRA_ARGS: ${EXTRA_ARGS:-}
      SNAPSHOT_AUTO_DOWNLOAD: ${SNAPSHOT_AUTO_DOWNLOAD:-true}
      SNAPSHOT_MAX_AGE_SLOTS: ${SNAPSHOT_MAX_AGE_SLOTS:-20000}
      PROBE_GRACE_SECONDS: ${PROBE_GRACE_SECONDS:-600}
      PROBE_MAX_SLOT_LAG: ${PROBE_MAX_SLOT_LAG:-20000}
    deploy:
      resources:
        reservations:
          cpus: '4.0'
          memory: 256000M
        limits:
          cpus: '32.0'
          memory: 921600M
    volumes:
      - rpc-config:/data/config
      - rpc-ledger:/data/ledger
      - rpc-accounts:/data/accounts
      - rpc-snapshots:/data/snapshots
    ports:
      # RPC ports
      - "8899"
      - "8900"
      # Gossip port
      - "8001"
      - "8001/udp"
      # Dynamic port range for TPU/TVU/repair (9000-9025, 26 ports)
      - "9000/udp"
      - "9001/udp"
      - "9002/udp"
      - "9003/udp"
      - "9004/udp"
      - "9005/udp"
      - "9006/udp"
      - "9007/udp"
      - "9008/udp"
      - "9009/udp"
      - "9010/udp"
      - "9011/udp"
      - "9012/udp"
      - "9013/udp"
      - "9014/udp"
      - "9015/udp"
      - "9016/udp"
      - "9017/udp"
      - "9018/udp"
      - "9019/udp"
      - "9020/udp"
      - "9021/udp"
      - "9022/udp"
      - "9023/udp"
      - "9024/udp"
      - "9025/udp"
    ulimits:
      memlock:
        soft: -1
        hard: -1
      nofile:
        soft: 1000000
        hard: 1000000
    healthcheck:
      test: ["CMD", "entrypoint.py", "probe"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 600s
 volumes:
  rpc-config:
  rpc-ledger:
  rpc-accounts:
  rpc-snapshots:
--- a/stack-orchestrator/compose/docker-compose-agave-test.yml
+++ b/stack-orchestrator/compose/docker-compose-agave-test.yml
@ -0,0 +1,27 @@
 services:
  agave-test:
    restart: unless-stopped
    image: laconicnetwork/agave:local
    security_opt:
      - seccomp=unconfined
    environment:
      AGAVE_MODE: test
      FACILITATOR_PUBKEY: ${FACILITATOR_PUBKEY:-}
      SERVER_PUBKEY: ${SERVER_PUBKEY:-}
      CLIENT_PUBKEY: ${CLIENT_PUBKEY:-}
      MINT_DECIMALS: ${MINT_DECIMALS:-6}
      MINT_AMOUNT: ${MINT_AMOUNT:-1000000000}
    volumes:
      - test-ledger:/data/ledger
    ports:
      - "8899"
      - "8900"
    healthcheck:
      test: ["CMD", "solana", "cluster-version", "--url", "http://127.0.0.1:8899"]
      interval: 5s
      timeout: 5s
      retries: 30
      start_period: 10s
 volumes:
  test-ledger:
--- a/stack-orchestrator/compose/docker-compose-agave.yml
+++ b/stack-orchestrator/compose/docker-compose-agave.yml
@ -0,0 +1,115 @@
 services:
  agave-validator:
    restart: unless-stopped
    image: laconicnetwork/agave:local
    network_mode: host
    privileged: true
    cap_add:
      - IPC_LOCK
    # Compose owns all defaults. spec.yml overrides per-deployment.
    environment:
      AGAVE_MODE: ${AGAVE_MODE:-validator}
      # Required — no defaults
      VALIDATOR_ENTRYPOINT: ${VALIDATOR_ENTRYPOINT}
      KNOWN_VALIDATOR: ${KNOWN_VALIDATOR}
      # Optional with defaults
      EXTRA_ENTRYPOINTS: ${EXTRA_ENTRYPOINTS:-}
      EXTRA_KNOWN_VALIDATORS: ${EXTRA_KNOWN_VALIDATORS:-}
      RPC_PORT: ${RPC_PORT:-8899}
      RPC_BIND_ADDRESS: ${RPC_BIND_ADDRESS:-127.0.0.1}
      GOSSIP_PORT: ${GOSSIP_PORT:-8001}
      DYNAMIC_PORT_RANGE: ${DYNAMIC_PORT_RANGE:-9000-10000}
      EXPECTED_GENESIS_HASH: ${EXPECTED_GENESIS_HASH:-}
      EXPECTED_SHRED_VERSION: ${EXPECTED_SHRED_VERSION:-}
      LIMIT_LEDGER_SIZE: ${LIMIT_LEDGER_SIZE:-50000000}
      NO_SNAPSHOTS: ${NO_SNAPSHOTS:-false}
      SNAPSHOT_INTERVAL_SLOTS: ${SNAPSHOT_INTERVAL_SLOTS:-100000}
      MAXIMUM_SNAPSHOTS_TO_RETAIN: ${MAXIMUM_SNAPSHOTS_TO_RETAIN:-1}
      NO_INCREMENTAL_SNAPSHOTS: ${NO_INCREMENTAL_SNAPSHOTS:-false}
      ACCOUNT_INDEXES: ${ACCOUNT_INDEXES:-}
      VOTE_ACCOUNT_KEYPAIR: ${VOTE_ACCOUNT_KEYPAIR:-/data/config/vote-account-keypair.json}
      GOSSIP_HOST: ${GOSSIP_HOST:-}
      PUBLIC_TVU_ADDRESS: ${PUBLIC_TVU_ADDRESS:-}
      RUST_LOG: ${RUST_LOG:-info}
      SOLANA_METRICS_CONFIG: ${SOLANA_METRICS_CONFIG:-}
      JITO_ENABLE: ${JITO_ENABLE:-false}
      JITO_BLOCK_ENGINE_URL: ${JITO_BLOCK_ENGINE_URL:-}
      JITO_RELAYER_URL: ${JITO_RELAYER_URL:-}
      JITO_SHRED_RECEIVER_ADDR: ${JITO_SHRED_RECEIVER_ADDR:-}
      JITO_TIP_PAYMENT_PROGRAM: ${JITO_TIP_PAYMENT_PROGRAM:-}
      JITO_DISTRIBUTION_PROGRAM: ${JITO_DISTRIBUTION_PROGRAM:-}
      JITO_MERKLE_ROOT_AUTHORITY: ${JITO_MERKLE_ROOT_AUTHORITY:-}
      JITO_COMMISSION_BPS: ${JITO_COMMISSION_BPS:-0}
      EXTRA_ARGS: ${EXTRA_ARGS:-}
      SNAPSHOT_AUTO_DOWNLOAD: ${SNAPSHOT_AUTO_DOWNLOAD:-true}
      SNAPSHOT_MAX_AGE_SLOTS: ${SNAPSHOT_MAX_AGE_SLOTS:-20000}
      PROBE_GRACE_SECONDS: ${PROBE_GRACE_SECONDS:-600}
      PROBE_MAX_SLOT_LAG: ${PROBE_MAX_SLOT_LAG:-20000}
    deploy:
      resources:
        reservations:
          cpus: '4.0'
          memory: 256000M
        limits:
          cpus: '32.0'
          memory: 921600M
    volumes:
      - validator-config:/data/config
      - validator-ledger:/data/ledger
      - validator-accounts:/data/accounts
      - validator-snapshots:/data/snapshots
      - validator-log:/data/log
    ports:
      # RPC ports
      - "8899"
      - "8900"
      # Gossip port
      - "8001"
      - "8001/udp"
      # Dynamic port range for TPU/TVU/repair (9000-9025, 26 ports)
      - "9000/udp"
      - "9001/udp"
      - "9002/udp"
      - "9003/udp"
      - "9004/udp"
      - "9005/udp"
      - "9006/udp"
      - "9007/udp"
      - "9008/udp"
      - "9009/udp"
      - "9010/udp"
      - "9011/udp"
      - "9012/udp"
      - "9013/udp"
      - "9014/udp"
      - "9015/udp"
      - "9016/udp"
      - "9017/udp"
      - "9018/udp"
      - "9019/udp"
      - "9020/udp"
      - "9021/udp"
      - "9022/udp"
      - "9023/udp"
      - "9024/udp"
      - "9025/udp"
    ulimits:
      memlock:
        soft: -1
        hard: -1
      nofile:
        soft: 1000000
        hard: 1000000
    healthcheck:
      test: ["CMD", "entrypoint.py", "probe"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 600s
 volumes:
  validator-config:
  validator-ledger:
  validator-accounts:
  validator-snapshots:
  validator-log:
--- a/stack-orchestrator/compose/docker-compose-doublezero.yml
+++ b/stack-orchestrator/compose/docker-compose-doublezero.yml
@ -0,0 +1,19 @@
 services:
  doublezerod:
    restart: unless-stopped
    image: laconicnetwork/doublezero:local
    network_mode: host
    privileged: true
    cap_add:
      - NET_ADMIN
    environment:
      DOUBLEZERO_RPC_ENDPOINT: ${DOUBLEZERO_RPC_ENDPOINT:-http://127.0.0.1:8899}
      DOUBLEZERO_ENV: ${DOUBLEZERO_ENV:-mainnet-beta}
      DOUBLEZERO_EXTRA_ARGS: ${DOUBLEZERO_EXTRA_ARGS:-}
    volumes:
      - doublezero-validator-identity:/data/config:ro
      - doublezero-config:/root/.config/doublezero
 volumes:
  doublezero-validator-identity:
  doublezero-config:
--- a/stack-orchestrator/compose/docker-compose-monitoring.yml
+++ b/stack-orchestrator/compose/docker-compose-monitoring.yml
@ -0,0 +1,49 @@
 services:
  monitoring-influxdb:
    image: influxdb:1.8
    restart: unless-stopped
    environment:
      INFLUXDB_DB: agave_metrics
      INFLUXDB_HTTP_AUTH_ENABLED: "true"
      INFLUXDB_ADMIN_USER: admin
      INFLUXDB_ADMIN_PASSWORD: admin
      INFLUXDB_REPORTING_DISABLED: "true"
    volumes:
      - monitoring-influxdb-data:/var/lib/influxdb
    ports:
      - "8086"
  monitoring-grafana:
    image: grafana/grafana:latest
    restart: unless-stopped
    environment:
      GF_SECURITY_ADMIN_PASSWORD: admin
      GF_SECURITY_ADMIN_USER: admin
      GF_USERS_ALLOW_SIGN_UP: "false"
      GF_PATHS_DATA: /var/lib/grafana
    volumes:
      - monitoring-grafana-data:/var/lib/grafana
      - monitoring-grafana-datasources:/etc/grafana/provisioning/datasources:ro
      - monitoring-grafana-dashboards:/etc/grafana/provisioning/dashboards:ro
    ports:
      - "3000"
  monitoring-telegraf:
    image: telegraf:1.36
    restart: unless-stopped
    network_mode: host
    environment:
      NODE_RPC_URL: ${NODE_RPC_URL:-http://localhost:8899}
      CANONICAL_RPC_URL: ${CANONICAL_RPC_URL:-https://api.mainnet-beta.solana.com}
      INFLUXDB_URL: ${INFLUXDB_URL:-http://localhost:8086}
    volumes:
      - monitoring-telegraf-config:/etc/telegraf:ro
      - monitoring-telegraf-scripts:/scripts:ro
 volumes:
  monitoring-influxdb-data:
  monitoring-grafana-data:
  monitoring-grafana-datasources:
  monitoring-grafana-dashboards:
  monitoring-telegraf-config:
  monitoring-telegraf-scripts:
--- a/stack-orchestrator/config/agave/restart-node.sh
+++ b/stack-orchestrator/config/agave/restart-node.sh
@ -0,0 +1,8 @@
 #!/bin/sh
 # Restart a container by label filter
 # Used by the cron-based restarter sidecar
 label_filter="$1"
 container=$(docker ps -qf "label=$label_filter")
 if [ -n "$container" ]; then
  docker restart -s TERM "$container" > /dev/null
 fi
--- a/stack-orchestrator/config/agave/restart.cron
+++ b/stack-orchestrator/config/agave/restart.cron
@ -0,0 +1,4 @@
 # Restart validator every 4 hours (mitigate memory leaks)
 0 */4 * * * /scripts/restart-node.sh role=validator
 # Restart RPC every 6 hours (staggered from validator)
 30 */6 * * * /scripts/restart-node.sh role=rpc
--- a/stack-orchestrator/config/monitoring/grafana-dashboards/agave-indexing.json
+++ b/stack-orchestrator/config/monitoring/grafana-dashboards/agave-indexing.json
--- a/stack-orchestrator/config/monitoring/grafana-dashboards/agave-transactions.json
+++ b/stack-orchestrator/config/monitoring/grafana-dashboards/agave-transactions.json
--- a/stack-orchestrator/config/monitoring/grafana-dashboards/dashboards.yml
+++ b/stack-orchestrator/config/monitoring/grafana-dashboards/dashboards.yml
@ -0,0 +1,12 @@
 apiVersion: 1
 providers:
  - name: 'default'
    orgId: 1
    folder: ''
    type: file
    disableDeletion: false
    editable: true
    options:
      path: /etc/grafana/provisioning/dashboards
      foldersFromFilesStructure: false
--- a/stack-orchestrator/config/monitoring/grafana-dashboards/sync-status.json
+++ b/stack-orchestrator/config/monitoring/grafana-dashboards/sync-status.json
--- a/stack-orchestrator/config/monitoring/grafana-dashboards/system-overview.json
+++ b/stack-orchestrator/config/monitoring/grafana-dashboards/system-overview.json
--- a/stack-orchestrator/config/monitoring/grafana-datasources/datasources.yml
+++ b/stack-orchestrator/config/monitoring/grafana-datasources/datasources.yml
@ -0,0 +1,16 @@
 apiVersion: 1
 datasources:
  - name: InfluxDB
    type: influxdb
    access: proxy
    url: http://monitoring-influxdb:8086
    database: agave_metrics
    user: admin
    isDefault: true
    editable: true
    secureJsonData:
      password: admin
    jsonData:
      timeInterval: 10s
      httpMode: GET
--- a/stack-orchestrator/config/monitoring/scripts/check_canonical_slot.sh
+++ b/stack-orchestrator/config/monitoring/scripts/check_canonical_slot.sh
@ -0,0 +1,17 @@
 #!/bin/bash
 # Query canonical mainnet slot for sync lag comparison
 set -euo pipefail
 CANONICAL_RPC="${CANONICAL_RPC_URL:-https://api.mainnet-beta.solana.com}"
 response=$(curl -s --max-time 10 -X POST \
  -H "Content-Type: application/json" \
  -d '{"jsonrpc":"2.0","id":1,"method":"getSlot"}' \
  "$CANONICAL_RPC" 2>/dev/null || echo '{"result":0}')
 slot=$(echo "$response" | grep -o '"result":[0-9]*' | grep -o '[0-9]*' || echo "0")
 if [ "$slot" != "0" ]; then
  echo "canonical_slot slot=${slot}i"
 fi
--- a/stack-orchestrator/config/monitoring/scripts/check_getslot_latency.sh
+++ b/stack-orchestrator/config/monitoring/scripts/check_getslot_latency.sh
@ -0,0 +1,33 @@
 #!/bin/bash
 # Check getSlot RPC latency
 # Outputs metrics in InfluxDB line protocol format
 set -euo pipefail
 RPC_URL="${NODE_RPC_URL:-http://localhost:8899}"
 RPC_PAYLOAD='{"jsonrpc":"2.0","id":1,"method":"getSlot"}'
 response=$(curl -sk --max-time 10 -X POST \
  -H "Content-Type: application/json" \
  -d "$RPC_PAYLOAD" \
  -w "\n%{http_code}\n%{time_total}" \
  "$RPC_URL" 2>/dev/null || echo -e "\n000\n0")
 json_response=$(echo "$response" | head -n 1)
 # curl -w output follows response body; blank lines may appear between them
 http_code=$(echo "$response" | tail -2 | head -1)
 time_total=$(echo "$response" | tail -1)
 latency_ms="$(awk -v t="$time_total" 'BEGIN { printf "%.0f", (t * 1000) }')"
 # Strip leading zeros from http_code (influx line protocol rejects 000i)
 http_code=$((10#${http_code:-0}))
 if [ "$http_code" = "200" ]; then
  slot=$(echo "$json_response" | grep -o '"result":[0-9]*' | grep -o '[0-9]*' || echo "0")
  [ "$slot" != "0" ] && success=1 || success=0
 else
  success=0
  slot=0
 fi
 echo "rpc_latency,endpoint=direct,method=getSlot latency_ms=${latency_ms},success=${success}i,http_code=${http_code}i,slot=${slot}i"
--- a/stack-orchestrator/config/monitoring/telegraf-config/telegraf.conf
+++ b/stack-orchestrator/config/monitoring/telegraf-config/telegraf.conf
@ -0,0 +1,36 @@
 # Telegraf configuration for Agave monitoring
 [agent]
  interval = "10s"
  round_interval = true
  metric_batch_size = 1000
  metric_buffer_limit = 10000
  collection_jitter = "0s"
  flush_interval = "10s"
  flush_jitter = "0s"
  precision = "0s"
  hostname = "telegraf"
  omit_hostname = false
 # Output to InfluxDB
 [[outputs.influxdb]]
  urls = ["http://localhost:8086"]
  database = "agave_metrics"
  skip_database_creation = true
  username = "admin"
  password = "admin"
  retention_policy = ""
  write_consistency = "any"
  timeout = "5s"
 # Custom getSlot latency check
 [[inputs.exec]]
  commands = ["/scripts/check_getslot_latency.sh"]
  timeout = "30s"
  data_format = "influx"
 # Canonical mainnet slot tracking
 [[inputs.exec]]
  commands = ["/scripts/check_canonical_slot.sh"]
  timeout = "30s"
  data_format = "influx"
--- a/stack-orchestrator/container-build/laconicnetwork-agave/Dockerfile
+++ b/stack-orchestrator/container-build/laconicnetwork-agave/Dockerfile
@ -0,0 +1,81 @@
 # Unified Agave/Jito Solana image
 # Supports three modes via AGAVE_MODE env: test, rpc, validator
 #
 # Build args:
 #   AGAVE_REPO    - git repo URL (anza-xyz/agave or jito-foundation/jito-solana)
 #   AGAVE_VERSION - git tag to build (e.g. v3.1.9, v3.1.8-jito)
 ARG AGAVE_REPO=https://github.com/anza-xyz/agave.git
 ARG AGAVE_VERSION=v3.1.9
 # ---------- Stage 1: Build ----------
 FROM rust:1.85-bookworm AS builder
 ARG AGAVE_REPO
 ARG AGAVE_VERSION
 RUN apt-get update && apt-get install -y --no-install-recommends \
    build-essential \
    pkg-config \
    libssl-dev \
    libudev-dev \
    libclang-dev \
    protobuf-compiler \
    ca-certificates \
    git \
    cmake \
    && rm -rf /var/lib/apt/lists/*
 WORKDIR /build
 RUN git clone "$AGAVE_REPO" --depth 1 --branch "$AGAVE_VERSION" --recurse-submodules agave
 WORKDIR /build/agave
 # Cherry-pick --public-tvu-address support (anza-xyz/agave PR #6778, commit 9f4b3ae)
 # This flag only exists on master, not in v3.1.9 — fetch the PR ref and cherry-pick
 ARG TVU_ADDRESS_PR=6778
 RUN if [ -n "$TVU_ADDRESS_PR" ]; then \
      git fetch --depth 50 origin "pull/${TVU_ADDRESS_PR}/head:tvu-pr" && \
      git cherry-pick --no-commit tvu-pr; \
    fi
 # Build all binaries using the upstream install script
 RUN CI_COMMIT=$(git rev-parse HEAD) scripts/cargo-install-all.sh /solana-release
 # ---------- Stage 2: Runtime ----------
 FROM debian:bookworm-slim
 RUN apt-get update && apt-get install -y --no-install-recommends \
    ca-certificates \
    libssl3 \
    libudev1 \
    curl \
    sudo \
    aria2 \
    python3 \
    && rm -rf /var/lib/apt/lists/*
 # Create non-root user with sudo
 RUN useradd -m -s /bin/bash agave \
    && echo "agave ALL=(ALL) NOPASSWD: ALL" >> /etc/sudoers
 # Copy all compiled binaries
 COPY --from=builder /solana-release/bin/ /usr/local/bin/
 # Copy entrypoint and support scripts
 COPY entrypoint.py snapshot_download.py ip_echo_preflight.py /usr/local/bin/
 COPY start-test.sh /usr/local/bin/
 RUN chmod +x /usr/local/bin/entrypoint.py /usr/local/bin/start-test.sh
 # Create data directories
 RUN mkdir -p /data/config /data/ledger /data/accounts /data/snapshots \
    && chown -R agave:agave /data
 USER agave
 WORKDIR /data
 ENV RUST_LOG=info
 ENV RUST_BACKTRACE=1
 EXPOSE 8899 8900 8001 8001/udp
 ENTRYPOINT ["entrypoint.py"]
--- a/stack-orchestrator/container-build/laconicnetwork-agave/build.sh
+++ b/stack-orchestrator/container-build/laconicnetwork-agave/build.sh
@ -0,0 +1,17 @@
 #!/usr/bin/env bash
 # Build laconicnetwork/agave
 # Set AGAVE_REPO and AGAVE_VERSION env vars to build Jito or a different version
 source ${CERC_CONTAINER_BASE_DIR}/build-base.sh
 SCRIPT_DIR=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )
 AGAVE_REPO="${AGAVE_REPO:-https://github.com/anza-xyz/agave.git}"
 AGAVE_VERSION="${AGAVE_VERSION:-v3.1.9}"
 docker build -t laconicnetwork/agave:local \
  --build-arg AGAVE_REPO="$AGAVE_REPO" \
  --build-arg AGAVE_VERSION="$AGAVE_VERSION" \
  ${build_command_args} \
  -f ${SCRIPT_DIR}/Dockerfile \
  ${SCRIPT_DIR}
--- a/stack-orchestrator/container-build/laconicnetwork-agave/entrypoint.py
+++ b/stack-orchestrator/container-build/laconicnetwork-agave/entrypoint.py
@ -0,0 +1,686 @@
 #!/usr/bin/env python3
 """Agave validator entrypoint — snapshot management, arg construction, liveness probe.
 Two subcommands:
  entrypoint.py serve   (default) — snapshot freshness check + run agave-validator
  entrypoint.py probe   — liveness probe (slot lag check, exits 0/1)
 Replaces the bash entrypoint.sh / start-rpc.sh / start-validator.sh with a single
 Python module. Test mode still dispatches to start-test.sh.
 Python stays as PID 1 and traps SIGTERM. On SIGTERM, it runs
 ``agave-validator exit --force --ledger /data/ledger`` which connects to the
 admin RPC Unix socket and tells the validator to flush I/O and exit cleanly.
 This avoids the io_uring/ZFS deadlock that occurs when the process is killed.
 All configuration comes from environment variables — same vars as the original
 bash scripts. See compose files for defaults.
 """
 from __future__ import annotations
 import json
 import logging
 import os
 import re
 import signal
 import subprocess
 import sys
 import threading
 import time
 import urllib.error
 import urllib.request
 from pathlib import Path
 from urllib.request import Request
 log: logging.Logger = logging.getLogger("entrypoint")
 # Directories
 CONFIG_DIR = "/data/config"
 LEDGER_DIR = "/data/ledger"
 ACCOUNTS_DIR = "/data/accounts"
 SNAPSHOTS_DIR = "/data/snapshots"
 LOG_DIR = "/data/log"
 IDENTITY_FILE = f"{CONFIG_DIR}/validator-identity.json"
 # Snapshot filename patterns
 FULL_SNAP_RE: re.Pattern[str] = re.compile(
    r"^snapshot-(\d+)-[A-Za-z0-9]+\.tar\.(zst|bz2)$"
 )
 INCR_SNAP_RE: re.Pattern[str] = re.compile(
    r"^incremental-snapshot-(\d+)-(\d+)-[A-Za-z0-9]+\.tar\.(zst|bz2)$"
 )
 MAINNET_RPC = "https://api.mainnet-beta.solana.com"
 # -- Helpers -------------------------------------------------------------------
 def env(name: str, default: str = "") -> str:
    """Read env var with default."""
    return os.environ.get(name, default)
 def env_required(name: str) -> str:
    """Read required env var, exit if missing."""
    val = os.environ.get(name)
    if not val:
        log.error("%s is required but not set", name)
        sys.exit(1)
    return val
 def env_bool(name: str, default: bool = False) -> bool:
    """Read boolean env var (true/false/1/0)."""
    val = os.environ.get(name, "").lower()
    if not val:
        return default
    return val in ("true", "1", "yes")
 def rpc_get_slot(url: str, timeout: int = 10) -> int | None:
    """Get current slot from a Solana RPC endpoint."""
    payload = json.dumps({
        "jsonrpc": "2.0", "id": 1,
        "method": "getSlot", "params": [],
    }).encode()
    req = Request(url, data=payload,
                  headers={"Content-Type": "application/json"})
    try:
        with urllib.request.urlopen(req, timeout=timeout) as resp:
            data = json.loads(resp.read())
            result = data.get("result")
            if isinstance(result, int):
                return result
    except (urllib.error.URLError, json.JSONDecodeError, OSError, TimeoutError):
        pass
    return None
 # -- Snapshot management -------------------------------------------------------
 def get_local_snapshot_slot(snapshots_dir: str) -> int | None:
    """Find the highest slot among local snapshot files."""
    best_slot: int | None = None
    snap_path = Path(snapshots_dir)
    if not snap_path.is_dir():
        return None
    for entry in snap_path.iterdir():
        m = FULL_SNAP_RE.match(entry.name)
        if m:
            slot = int(m.group(1))
            if best_slot is None or slot > best_slot:
                best_slot = slot
    return best_slot
 def clean_snapshots(snapshots_dir: str) -> None:
    """Remove all snapshot files from the directory."""
    snap_path = Path(snapshots_dir)
    if not snap_path.is_dir():
        return
    for entry in snap_path.iterdir():
        if entry.name.startswith(("snapshot-", "incremental-snapshot-")):
            log.info("Removing old snapshot: %s", entry.name)
            entry.unlink(missing_ok=True)
 def get_incremental_slot(snapshots_dir: str, full_slot: int | None) -> int | None:
    """Get the highest incremental snapshot slot matching the full's base slot."""
    if full_slot is None:
        return None
    snap_path = Path(snapshots_dir)
    if not snap_path.is_dir():
        return None
    best: int | None = None
    for entry in snap_path.iterdir():
        m = INCR_SNAP_RE.match(entry.name)
        if m and int(m.group(1)) == full_slot:
            slot = int(m.group(2))
            if best is None or slot > best:
                best = slot
    return best
 def maybe_download_snapshot(snapshots_dir: str) -> None:
    """Ensure full + incremental snapshots exist before starting.
    The validator should always start from a full + incremental pair to
    minimize replay time. If either is missing or the full is too old,
    download fresh ones via download_best_snapshot (which does rolling
    incremental convergence after downloading the full).
    Controlled by env vars:
      SNAPSHOT_AUTO_DOWNLOAD (default: true) — enable/disable
      SNAPSHOT_MAX_AGE_SLOTS (default: 100000) — full snapshot staleness threshold
        (one full snapshot generation, ~11 hours)
    """
    if not env_bool("SNAPSHOT_AUTO_DOWNLOAD", default=True):
        log.info("Snapshot auto-download disabled")
        return
    max_age = int(env("SNAPSHOT_MAX_AGE_SLOTS", "100000"))
    mainnet_slot = rpc_get_slot(MAINNET_RPC)
    if mainnet_slot is None:
        log.warning("Cannot reach mainnet RPC — skipping snapshot check")
        return
    script_dir = Path(__file__).resolve().parent
    sys.path.insert(0, str(script_dir))
    from snapshot_download import download_best_snapshot, download_incremental_for_slot
    convergence = int(env("SNAPSHOT_CONVERGENCE_SLOTS", "500"))
    retry_delay = int(env("SNAPSHOT_RETRY_DELAY", "60"))
    # Check local full snapshot
    local_slot = get_local_snapshot_slot(snapshots_dir)
    have_fresh_full = (local_slot is not None
                       and (mainnet_slot - local_slot) <= max_age)
    if have_fresh_full:
        assert local_slot is not None
        inc_slot = get_incremental_slot(snapshots_dir, local_slot)
        if inc_slot is not None:
            inc_gap = mainnet_slot - inc_slot
            if inc_gap <= convergence:
                log.info("Full (slot %d) + incremental (slot %d, gap %d) "
                         "within convergence, starting",
                         local_slot, inc_slot, inc_gap)
                return
            log.info("Incremental too stale (slot %d, gap %d > %d)",
                     inc_slot, inc_gap, convergence)
        # Fresh full, need a fresh incremental
        log.info("Downloading incremental for full at slot %d", local_slot)
        while True:
            if download_incremental_for_slot(snapshots_dir, local_slot,
                                             convergence_slots=convergence):
                return
            log.warning("Incremental download failed — retrying in %ds",
                        retry_delay)
            time.sleep(retry_delay)
    # No full or full too old — download both
    log.info("Downloading full + incremental")
    clean_snapshots(snapshots_dir)
    while True:
        if download_best_snapshot(snapshots_dir, convergence_slots=convergence):
            return
        log.warning("Snapshot download failed — retrying in %ds", retry_delay)
        time.sleep(retry_delay)
 # -- Directory and identity setup ----------------------------------------------
 def ensure_dirs(*dirs: str) -> None:
    """Create directories and fix ownership."""
    uid = os.getuid()
    gid = os.getgid()
    for d in dirs:
        os.makedirs(d, exist_ok=True)
        try:
            subprocess.run(
                ["sudo", "chown", "-R", f"{uid}:{gid}", d],
                check=False, capture_output=True,
            )
        except FileNotFoundError:
            pass  # sudo not available — dirs already owned correctly
 def ensure_identity_rpc() -> None:
    """Generate ephemeral identity keypair for RPC mode if not mounted."""
    if os.path.isfile(IDENTITY_FILE):
        return
    log.info("Generating RPC node identity keypair...")
    subprocess.run(
        ["solana-keygen", "new", "--no-passphrase", "--silent",
         "--force", "--outfile", IDENTITY_FILE],
        check=True,
    )
 def print_identity() -> None:
    """Print the node identity pubkey."""
    result = subprocess.run(
        ["solana-keygen", "pubkey", IDENTITY_FILE],
        capture_output=True, text=True, check=False,
    )
    if result.returncode == 0:
        log.info("Node identity: %s", result.stdout.strip())
 # -- Arg construction ----------------------------------------------------------
 def build_common_args() -> list[str]:
    """Build agave-validator args common to both RPC and validator modes."""
    args: list[str] = [
        "--identity", IDENTITY_FILE,
        "--entrypoint", env_required("VALIDATOR_ENTRYPOINT"),
        "--known-validator", env_required("KNOWN_VALIDATOR"),
        "--ledger", LEDGER_DIR,
        "--accounts", ACCOUNTS_DIR,
        "--snapshots", SNAPSHOTS_DIR,
        "--rpc-port", env("RPC_PORT", "8899"),
        "--rpc-bind-address", env("RPC_BIND_ADDRESS", "127.0.0.1"),
        "--gossip-port", env("GOSSIP_PORT", "8001"),
        "--dynamic-port-range", env("DYNAMIC_PORT_RANGE", "9000-10000"),
        "--no-os-network-limits-test",
        "--wal-recovery-mode", "skip_any_corrupted_record",
        "--limit-ledger-size", env("LIMIT_LEDGER_SIZE", "50000000"),
        "--no-snapshot-fetch",  # entrypoint handles snapshot download
    ]
    # Snapshot generation
    if env("NO_SNAPSHOTS") == "true":
        args.append("--no-snapshots")
    else:
        args += [
            "--full-snapshot-interval-slots", env("SNAPSHOT_INTERVAL_SLOTS", "100000"),
            "--maximum-full-snapshots-to-retain", env("MAXIMUM_SNAPSHOTS_TO_RETAIN", "1"),
        ]
        if env("NO_INCREMENTAL_SNAPSHOTS") != "true":
            args += ["--maximum-incremental-snapshots-to-retain", "2"]
    # Account indexes
    account_indexes = env("ACCOUNT_INDEXES")
    if account_indexes:
        for idx in account_indexes.split(","):
            idx = idx.strip()
            if idx:
                args += ["--account-index", idx]
    # Additional entrypoints
    for ep in env("EXTRA_ENTRYPOINTS").split():
        if ep:
            args += ["--entrypoint", ep]
    # Additional known validators
    for kv in env("EXTRA_KNOWN_VALIDATORS").split():
        if kv:
            args += ["--known-validator", kv]
    # Cluster verification
    genesis_hash = env("EXPECTED_GENESIS_HASH")
    if genesis_hash:
        args += ["--expected-genesis-hash", genesis_hash]
    shred_version = env("EXPECTED_SHRED_VERSION")
    if shred_version:
        args += ["--expected-shred-version", shred_version]
    # Metrics — just needs to be in the environment, agave reads it directly
    # (env var is already set, nothing to pass as arg)
    # Gossip host / TVU address
    gossip_host = env("GOSSIP_HOST")
    if gossip_host:
        args += ["--gossip-host", gossip_host]
    elif env("PUBLIC_TVU_ADDRESS"):
        args += ["--public-tvu-address", env("PUBLIC_TVU_ADDRESS")]
    # Jito flags
    if env("JITO_ENABLE") == "true":
        log.info("Jito MEV enabled")
        jito_flags: list[tuple[str, str]] = [
            ("JITO_TIP_PAYMENT_PROGRAM", "--tip-payment-program-pubkey"),
            ("JITO_DISTRIBUTION_PROGRAM", "--tip-distribution-program-pubkey"),
            ("JITO_MERKLE_ROOT_AUTHORITY", "--merkle-root-upload-authority"),
            ("JITO_COMMISSION_BPS", "--commission-bps"),
            ("JITO_BLOCK_ENGINE_URL", "--block-engine-url"),
            ("JITO_SHRED_RECEIVER_ADDR", "--shred-receiver-address"),
        ]
        for env_name, flag in jito_flags:
            val = env(env_name)
            if val:
                args += [flag, val]
    return args
 def build_rpc_args() -> list[str]:
    """Build agave-validator args for RPC (non-voting) mode."""
    args = build_common_args()
    args += [
        "--no-voting",
        "--log", f"{LOG_DIR}/validator.log",
        "--full-rpc-api",
        "--enable-rpc-transaction-history",
        "--rpc-pubsub-enable-block-subscription",
        "--enable-extended-tx-metadata-storage",
        "--no-wait-for-vote-to-start-leader",
    ]
    # Public vs private RPC
    public_rpc = env("PUBLIC_RPC_ADDRESS")
    if public_rpc:
        args += ["--public-rpc-address", public_rpc]
    else:
        args += ["--private-rpc", "--allow-private-addr", "--only-known-rpc"]
    # Jito relayer URL (RPC mode doesn't use it, but validator mode does —
    # handled in build_validator_args)
    return args
 def build_validator_args() -> list[str]:
    """Build agave-validator args for voting validator mode."""
    vote_keypair = env("VOTE_ACCOUNT_KEYPAIR",
                       "/data/config/vote-account-keypair.json")
    # Identity must be mounted for validator mode
    if not os.path.isfile(IDENTITY_FILE):
        log.error("Validator identity keypair not found at %s", IDENTITY_FILE)
        log.error("Mount your validator keypair to %s", IDENTITY_FILE)
        sys.exit(1)
    # Vote account keypair must exist
    if not os.path.isfile(vote_keypair):
        log.error("Vote account keypair not found at %s", vote_keypair)
        log.error("Mount your vote account keypair or set VOTE_ACCOUNT_KEYPAIR")
        sys.exit(1)
    # Print vote account pubkey
    result = subprocess.run(
        ["solana-keygen", "pubkey", vote_keypair],
        capture_output=True, text=True, check=False,
    )
    if result.returncode == 0:
        log.info("Vote account: %s", result.stdout.strip())
    args = build_common_args()
    args += [
        "--vote-account", vote_keypair,
        "--log", "-",
    ]
    # Jito relayer URL (validator-only)
    relayer_url = env("JITO_RELAYER_URL")
    if env("JITO_ENABLE") == "true" and relayer_url:
        args += ["--relayer-url", relayer_url]
    return args
 def append_extra_args(args: list[str]) -> list[str]:
    """Append EXTRA_ARGS passthrough flags."""
    extra = env("EXTRA_ARGS")
    if extra:
        args += extra.split()
    return args
 # -- Graceful shutdown --------------------------------------------------------
 # Timeout for graceful exit via admin RPC. Leave 30s margin for k8s
 # terminationGracePeriodSeconds (300s).
 GRACEFUL_EXIT_TIMEOUT = 270
 def graceful_exit(child: subprocess.Popen[bytes], reason: str = "SIGTERM") -> None:
    """Request graceful shutdown via the admin RPC Unix socket.
    Runs ``agave-validator exit --force --ledger /data/ledger`` which connects
    to the admin RPC socket at ``/data/ledger/admin.rpc`` and sets the
    validator's exit flag. The validator flushes all I/O and exits cleanly,
    avoiding the io_uring/ZFS deadlock.
    If the admin RPC exit fails or the child doesn't exit within the timeout,
    falls back to SIGTERM then SIGKILL.
    """
    log.info("%s — requesting graceful exit via admin RPC", reason)
    try:
        result = subprocess.run(
            ["agave-validator", "exit", "--force", "--ledger", LEDGER_DIR],
            capture_output=True, text=True, timeout=30,
        )
        if result.returncode == 0:
            log.info("Admin RPC exit requested successfully")
        else:
            log.warning(
                "Admin RPC exit returned %d: %s",
                result.returncode, result.stderr.strip(),
            )
    except subprocess.TimeoutExpired:
        log.warning("Admin RPC exit command timed out after 30s")
    except FileNotFoundError:
        log.warning("agave-validator binary not found for exit command")
    # Wait for child to exit
    try:
        child.wait(timeout=GRACEFUL_EXIT_TIMEOUT)
        log.info("Validator exited cleanly with code %d", child.returncode)
        return
    except subprocess.TimeoutExpired:
        log.warning(
            "Validator did not exit within %ds — sending SIGTERM",
            GRACEFUL_EXIT_TIMEOUT,
        )
    # Fallback: SIGTERM
    child.terminate()
    try:
        child.wait(timeout=15)
        log.info("Validator exited after SIGTERM with code %d", child.returncode)
        return
    except subprocess.TimeoutExpired:
        log.warning("Validator did not exit after SIGTERM — sending SIGKILL")
    # Last resort: SIGKILL
    child.kill()
    child.wait()
    log.info("Validator killed with SIGKILL, code %d", child.returncode)
 # -- Serve subcommand ---------------------------------------------------------
 def _gap_monitor(
    child: subprocess.Popen[bytes],
    leapfrog: threading.Event,
    shutting_down: threading.Event,
 ) -> None:
    """Background thread: poll slot gap and trigger leapfrog if too far behind.
    Waits for a grace period (SNAPSHOT_MONITOR_GRACE, default 600s) before
    monitoring — the validator needs time to extract snapshots and catch up.
    Then polls every SNAPSHOT_MONITOR_INTERVAL (default 30s). If the gap
    exceeds SNAPSHOT_LEAPFROG_SLOTS (default 5000) for SNAPSHOT_LEAPFROG_CHECKS
    (default 3) consecutive checks, triggers graceful shutdown and sets the
    leapfrog event so cmd_serve loops back to download a fresh incremental.
    """
    threshold = int(env("SNAPSHOT_LEAPFROG_SLOTS", "5000"))
    required_checks = int(env("SNAPSHOT_LEAPFROG_CHECKS", "3"))
    interval = int(env("SNAPSHOT_MONITOR_INTERVAL", "30"))
    grace = int(env("SNAPSHOT_MONITOR_GRACE", "600"))
    rpc_port = env("RPC_PORT", "8899")
    local_url = f"http://127.0.0.1:{rpc_port}"
    # Grace period — don't monitor during initial catch-up
    if shutting_down.wait(grace):
        return
    consecutive = 0
    while not shutting_down.is_set():
        local_slot = rpc_get_slot(local_url, timeout=5)
        mainnet_slot = rpc_get_slot(MAINNET_RPC, timeout=10)
        if local_slot is not None and mainnet_slot is not None:
            gap = mainnet_slot - local_slot
            if gap > threshold:
                consecutive += 1
                log.warning("Gap %d > %d (%d/%d consecutive)",
                            gap, threshold, consecutive, required_checks)
                if consecutive >= required_checks:
                    log.warning("Leapfrog triggered: gap %d", gap)
                    leapfrog.set()
                    graceful_exit(child, reason="Leapfrog")
                    return
            else:
                if consecutive > 0:
                    log.info("Gap %d within threshold, resetting counter", gap)
                consecutive = 0
        shutting_down.wait(interval)
 def cmd_serve() -> None:
    """Main serve flow: snapshot download, run validator, monitor gap, leapfrog.
    Python stays as PID 1. On each iteration:
      1. Download full + incremental snapshots (if needed)
      2. Start agave-validator as child process
      3. Monitor slot gap in background thread
      4. If gap exceeds threshold → graceful stop → loop back to step 1
      5. If SIGTERM → graceful stop → exit
      6. If validator crashes → exit with its return code
    """
    mode = env("AGAVE_MODE", "test")
    log.info("AGAVE_MODE=%s", mode)
    if mode == "test":
        os.execvp("start-test.sh", ["start-test.sh"])
    if mode not in ("rpc", "validator"):
        log.error("Unknown AGAVE_MODE: %s (valid: test, rpc, validator)", mode)
        sys.exit(1)
    # One-time setup
    dirs = [CONFIG_DIR, LEDGER_DIR, ACCOUNTS_DIR, SNAPSHOTS_DIR]
    if mode == "rpc":
        dirs.append(LOG_DIR)
    ensure_dirs(*dirs)
    if not env_bool("SKIP_IP_ECHO_PREFLIGHT"):
        script_dir = Path(__file__).resolve().parent
        sys.path.insert(0, str(script_dir))
        from ip_echo_preflight import main as ip_echo_main
        if ip_echo_main() != 0:
            sys.exit(1)
    if mode == "rpc":
        ensure_identity_rpc()
    print_identity()
    if mode == "rpc":
        args = build_rpc_args()
    else:
        args = build_validator_args()
    args = append_extra_args(args)
    # Main loop: download → run → monitor → leapfrog if needed
    while True:
        maybe_download_snapshot(SNAPSHOTS_DIR)
        Path("/tmp/entrypoint-start").write_text(str(time.time()))
        log.info("Starting agave-validator with %d arguments", len(args))
        child = subprocess.Popen(["agave-validator"] + args)
        shutting_down = threading.Event()
        leapfrog = threading.Event()
        signal.signal(signal.SIGUSR1,
                      lambda _sig, _frame: child.send_signal(signal.SIGUSR1))
        def _on_sigterm(_sig: int, _frame: object) -> None:
            shutting_down.set()
            threading.Thread(
                target=graceful_exit, args=(child,), daemon=True,
            ).start()
        signal.signal(signal.SIGTERM, _on_sigterm)
        # Start gap monitor
        monitor = threading.Thread(
            target=_gap_monitor,
            args=(child, leapfrog, shutting_down),
            daemon=True,
        )
        monitor.start()
        child.wait()
        if leapfrog.is_set():
            log.info("Leapfrog: restarting with fresh incremental")
            continue
        sys.exit(child.returncode)
 # -- Probe subcommand ---------------------------------------------------------
 def cmd_probe() -> None:
    """Liveness probe: check local RPC slot vs mainnet.
    Exit 0 = healthy, exit 1 = unhealthy.
    Grace period: PROBE_GRACE_SECONDS (default 600) — probe always passes
    during grace period to allow for snapshot unpacking and initial replay.
    """
    grace_seconds = int(env("PROBE_GRACE_SECONDS", "600"))
    max_lag = int(env("PROBE_MAX_SLOT_LAG", "20000"))
    # Check grace period
    start_file = Path("/tmp/entrypoint-start")
    if start_file.exists():
        try:
            start_time = float(start_file.read_text().strip())
            elapsed = time.time() - start_time
            if elapsed < grace_seconds:
                # Within grace period — always healthy
                sys.exit(0)
        except (ValueError, OSError):
            pass
    else:
        # No start file — serve hasn't started yet, within grace
        sys.exit(0)
    # Query local RPC
    rpc_port = env("RPC_PORT", "8899")
    local_url = f"http://127.0.0.1:{rpc_port}"
    local_slot = rpc_get_slot(local_url, timeout=5)
    if local_slot is None:
        # Local RPC unreachable after grace period — unhealthy
        sys.exit(1)
    # Query mainnet
    mainnet_slot = rpc_get_slot(MAINNET_RPC, timeout=10)
    if mainnet_slot is None:
        # Can't reach mainnet to compare — assume healthy (don't penalize
        # the validator for mainnet RPC being down)
        sys.exit(0)
    lag = mainnet_slot - local_slot
    if lag > max_lag:
        sys.exit(1)
    sys.exit(0)
 # -- Main ----------------------------------------------------------------------
 def main() -> None:
    logging.basicConfig(
        level=logging.INFO,
        format="%(asctime)s %(levelname)s [%(name)s] %(message)s",
        datefmt="%H:%M:%S",
    )
    subcmd = sys.argv[1] if len(sys.argv) > 1 else "serve"
    if subcmd == "serve":
        cmd_serve()
    elif subcmd == "probe":
        cmd_probe()
    else:
        log.error("Unknown subcommand: %s (valid: serve, probe)", subcmd)
        sys.exit(1)
 if __name__ == "__main__":
    main()
--- a/stack-orchestrator/container-build/laconicnetwork-agave/ip_echo_preflight.py
+++ b/stack-orchestrator/container-build/laconicnetwork-agave/ip_echo_preflight.py
@ -0,0 +1,249 @@
 #!/usr/bin/env python3
 """ip_echo preflight — verify UDP port reachability before starting the validator.
 Implements the Solana ip_echo client protocol exactly:
 1. Bind UDP sockets on the ports the validator will use
 2. TCP connect to entrypoint gossip port, send IpEchoServerMessage
 3. Parse IpEchoServerResponse (our IP as seen by entrypoint)
 4. Wait for entrypoint's UDP probes on each port
 5. Exit 0 if all ports reachable, exit 1 if any fail
 Wire format (from agave net-utils/src/):
  Request:  4 null bytes + [u16; 4] tcp_ports LE + [u16; 4] udp_ports LE + \n
  Response: 4 null bytes + bincode IpAddr (variant byte + addr) + optional shred_version
 Called from entrypoint.py before snapshot download. Prevents wasting hours
 downloading a snapshot only to crash-loop on port reachability.
 """
 from __future__ import annotations
 import logging
 import os
 import socket
 import struct
 import sys
 import threading
 import time
 log = logging.getLogger("ip_echo_preflight")
 HEADER = b"\x00\x00\x00\x00"
 TERMINUS = b"\x0a"
 RESPONSE_BUF = 27
 IO_TIMEOUT = 5.0
 PROBE_TIMEOUT = 10.0
 MAX_RETRIES = 3
 RETRY_DELAY = 2.0
 def build_request(tcp_ports: list[int], udp_ports: list[int]) -> bytes:
    """Build IpEchoServerMessage: header + [u16;4] tcp + [u16;4] udp + newline."""
    tcp = (tcp_ports + [0, 0, 0, 0])[:4]
    udp = (udp_ports + [0, 0, 0, 0])[:4]
    return HEADER + struct.pack("<4H", *tcp) + struct.pack("<4H", *udp) + TERMINUS
 def parse_response(data: bytes) -> tuple[str, int | None]:
    """Parse IpEchoServerResponse → (ip_string, shred_version | None).
    Wire format (bincode):
      4 bytes   header (\0\0\0\0)
      4 bytes   IpAddr enum variant (u32 LE: 0=IPv4, 1=IPv6)
      4|16 bytes  address octets
      1 byte    Option tag (0=None, 1=Some)
      2 bytes   shred_version (u16 LE, only if Some)
    """
    if len(data) < 8:
        raise ValueError(f"response too short: {len(data)} bytes")
    if data[:4] == b"HTTP":
        raise ValueError("got HTTP response — not an ip_echo server")
    if data[:4] != HEADER:
        raise ValueError(f"unexpected header: {data[:4].hex()}")
    variant = struct.unpack("<I", data[4:8])[0]
    if variant == 0:  # IPv4
        if len(data) < 12:
            raise ValueError(f"IPv4 response truncated: {len(data)} bytes")
        ip = socket.inet_ntoa(data[8:12])
        rest = data[12:]
    elif variant == 1:  # IPv6
        if len(data) < 24:
            raise ValueError(f"IPv6 response truncated: {len(data)} bytes")
        ip = socket.inet_ntop(socket.AF_INET6, data[8:24])
        rest = data[24:]
    else:
        raise ValueError(f"unknown IpAddr variant: {variant}")
    shred_version = None
    if len(rest) >= 3 and rest[0] == 1:
        shred_version = struct.unpack("<H", rest[1:3])[0]
    return ip, shred_version
 def _listen_udp(port: int, results: dict, stop: threading.Event) -> None:
    """Bind a UDP socket and wait for a probe packet."""
    try:
        sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
        sock.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
        sock.bind(("0.0.0.0", port))
        sock.settimeout(0.5)
        try:
            while not stop.is_set():
                try:
                    _data, addr = sock.recvfrom(64)
                    results[port] = ("ok", addr)
                    return
                except socket.timeout:
                    continue
        finally:
            sock.close()
    except OSError as exc:
        results[port] = ("bind_error", str(exc))
 def ip_echo_check(
    entrypoint_host: str,
    entrypoint_port: int,
    udp_ports: list[int],
 ) -> tuple[str, dict[int, bool]]:
    """Run one ip_echo exchange and return (seen_ip, {port: reachable}).
    Raises on TCP failure (caller retries).
    """
    udp_ports = [p for p in udp_ports if p != 0][:4]
    # Start UDP listeners before sending the TCP request
    results: dict[int, tuple] = {}
    stop = threading.Event()
    threads = []
    for port in udp_ports:
        t = threading.Thread(target=_listen_udp, args=(port, results, stop), daemon=True)
        t.start()
        threads.append(t)
    time.sleep(0.1)  # let listeners bind
    # TCP: send request, read response
    sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    sock.settimeout(IO_TIMEOUT)
    try:
        sock.connect((entrypoint_host, entrypoint_port))
        sock.sendall(build_request([], udp_ports))
        resp = sock.recv(RESPONSE_BUF)
    finally:
        sock.close()
    seen_ip, shred_version = parse_response(resp)
    log.info(
        "entrypoint %s:%d sees us as %s (shred_version=%s)",
        entrypoint_host, entrypoint_port, seen_ip, shred_version,
    )
    # Wait for UDP probes
    deadline = time.monotonic() + PROBE_TIMEOUT
    while time.monotonic() < deadline:
        if all(p in results for p in udp_ports):
            break
        time.sleep(0.2)
    stop.set()
    for t in threads:
        t.join(timeout=1)
    port_ok: dict[int, bool] = {}
    for port in udp_ports:
        if port not in results:
            log.error("port %d: no probe received within %.0fs", port, PROBE_TIMEOUT)
            port_ok[port] = False
        else:
            status, detail = results[port]
            if status == "ok":
                log.info("port %d: probe received from %s", port, detail)
                port_ok[port] = True
            else:
                log.error("port %d: %s: %s", port, status, detail)
                port_ok[port] = False
    return seen_ip, port_ok
 def run_preflight(
    entrypoint_host: str,
    entrypoint_port: int,
    udp_ports: list[int],
    expected_ip: str = "",
 ) -> bool:
    """Run ip_echo check with retries. Returns True if all ports pass."""
    for attempt in range(1, MAX_RETRIES + 1):
        log.info("ip_echo attempt %d/%d → %s:%d, ports %s",
                 attempt, MAX_RETRIES, entrypoint_host, entrypoint_port, udp_ports)
        try:
            seen_ip, port_ok = ip_echo_check(entrypoint_host, entrypoint_port, udp_ports)
        except Exception as exc:
            log.error("attempt %d TCP failed: %s", attempt, exc)
            if attempt < MAX_RETRIES:
                time.sleep(RETRY_DELAY)
            continue
        if expected_ip and seen_ip != expected_ip:
            log.error(
                "IP MISMATCH: entrypoint sees %s, expected %s (GOSSIP_HOST). "
                "Outbound mangle/SNAT path is broken.",
                seen_ip, expected_ip,
            )
            if attempt < MAX_RETRIES:
                time.sleep(RETRY_DELAY)
            continue
        reachable = [p for p, ok in port_ok.items() if ok]
        unreachable = [p for p, ok in port_ok.items() if not ok]
        if not unreachable:
            log.info("PASS: all ports reachable %s, seen as %s", reachable, seen_ip)
            return True
        log.error(
            "attempt %d: unreachable %s, reachable %s, seen as %s",
            attempt, unreachable, reachable, seen_ip,
        )
        if attempt < MAX_RETRIES:
            time.sleep(RETRY_DELAY)
    log.error("FAIL: ip_echo preflight exhausted %d attempts", MAX_RETRIES)
    return False
 def main() -> int:
    logging.basicConfig(
        level=logging.INFO,
        format="%(asctime)s %(levelname)s [%(name)s] %(message)s",
        datefmt="%H:%M:%S",
    )
    # Parse entrypoint — VALIDATOR_ENTRYPOINT is "host:port"
    raw = os.environ.get("VALIDATOR_ENTRYPOINT", "")
    if not raw and len(sys.argv) > 1:
        raw = sys.argv[1]
    if not raw:
        log.error("set VALIDATOR_ENTRYPOINT or pass host:port as argument")
        return 1
    if ":" in raw:
        host, port_str = raw.rsplit(":", 1)
        ep_port = int(port_str)
    else:
        host = raw
        ep_port = 8001
    gossip_port = int(os.environ.get("GOSSIP_PORT", "8001"))
    dynamic_range = os.environ.get("DYNAMIC_PORT_RANGE", "9000-10000")
    range_start = int(dynamic_range.split("-")[0])
    expected_ip = os.environ.get("GOSSIP_HOST", "")
    # Test gossip + first 3 ports from dynamic range (4 max per ip_echo message)
    udp_ports = [gossip_port, range_start, range_start + 2, range_start + 3]
    ok = run_preflight(host, ep_port, udp_ports, expected_ip)
    return 0 if ok else 1
 if __name__ == "__main__":
    sys.exit(main())
--- a/stack-orchestrator/container-build/laconicnetwork-agave/snapshot_download.py
+++ b/stack-orchestrator/container-build/laconicnetwork-agave/snapshot_download.py
@ -0,0 +1,878 @@
 #!/usr/bin/env python3
 """Download Solana snapshots using aria2c for parallel multi-connection downloads.
 Discovers snapshot sources by querying getClusterNodes for all RPCs in the
 cluster, probing each for available snapshots, benchmarking download speed,
 and downloading from the fastest source using aria2c (16 connections by default).
 Based on the discovery approach from etcusr/solana-snapshot-finder but replaces
 the single-connection wget download with aria2c parallel chunked downloads.
 Usage:
    # Download to /srv/kind/solana/snapshots (mainnet, 16 connections)
    ./snapshot_download.py -o /srv/kind/solana/snapshots
    # Dry run — find best source, print URL
    ./snapshot_download.py --dry-run
    # Custom RPC for cluster discovery + 32 connections
    ./snapshot_download.py -r https://api.mainnet-beta.solana.com -n 32
    # Testnet
    ./snapshot_download.py -c testnet -o /data/snapshots
    # Programmatic use from entrypoint.py:
    from snapshot_download import download_best_snapshot
    ok = download_best_snapshot("/data/snapshots")
 Requirements:
    - aria2c (apt install aria2)
    - python3 >= 3.10 (stdlib only, no pip dependencies)
 """
 from __future__ import annotations
 import argparse
 import concurrent.futures
 import json
 import logging
 import os
 import re
 import shutil
 import subprocess
 import sys
 import time
 import urllib.error
 import urllib.request
 from dataclasses import dataclass, field
 from http.client import HTTPResponse
 from pathlib import Path
 from urllib.request import Request
 log: logging.Logger = logging.getLogger("snapshot-download")
 CLUSTER_RPC: dict[str, str] = {
    "mainnet-beta": "https://api.mainnet-beta.solana.com",
    "testnet": "https://api.testnet.solana.com",
    "devnet": "https://api.devnet.solana.com",
 }
 # Snapshot filenames:
 #   snapshot-<slot>-<hash>.tar.zst
 #   incremental-snapshot-<base_slot>-<slot>-<hash>.tar.zst
 FULL_SNAP_RE: re.Pattern[str] = re.compile(
    r"^snapshot-(\d+)-([A-Za-z0-9]+)\.tar\.(zst|bz2)$"
 )
 INCR_SNAP_RE: re.Pattern[str] = re.compile(
    r"^incremental-snapshot-(\d+)-(\d+)-([A-Za-z0-9]+)\.tar\.(zst|bz2)$"
 )
@dataclass
 class SnapshotSource:
    """A snapshot file available from a specific RPC node."""
    rpc_address: str
    # Full redirect paths as returned by the server (e.g. /snapshot-123-hash.tar.zst)
    file_paths: list[str] = field(default_factory=list)
    slots_diff: int = 0
    latency_ms: float = 0.0
    download_speed: float = 0.0  # bytes/sec
 # -- JSON-RPC helpers ----------------------------------------------------------
 class _NoRedirectHandler(urllib.request.HTTPRedirectHandler):
    """Handler that captures redirect Location instead of following it."""
    def redirect_request(
        self,
        req: Request,
        fp: HTTPResponse,
        code: int,
        msg: str,
        headers: dict[str, str],  # type: ignore[override]
        newurl: str,
    ) -> None:
        return None
 def rpc_post(url: str, method: str, params: list[object] | None = None,
             timeout: int = 25) -> object | None:
    """JSON-RPC POST. Returns parsed 'result' field or None on error."""
    payload: bytes = json.dumps({
        "jsonrpc": "2.0", "id": 1,
        "method": method, "params": params or [],
    }).encode()
    req = Request(url, data=payload,
                  headers={"Content-Type": "application/json"})
    try:
        with urllib.request.urlopen(req, timeout=timeout) as resp:
            data: dict[str, object] = json.loads(resp.read())
            return data.get("result")
    except (urllib.error.URLError, json.JSONDecodeError, OSError, TimeoutError) as e:
        log.debug("rpc_post %s %s failed: %s", url, method, e)
        return None
 def head_no_follow(url: str, timeout: float = 3) -> tuple[str | None, float]:
    """HEAD request without following redirects.
    Returns (Location header value, latency_sec) if the server returned a
    3xx redirect. Returns (None, 0.0) on any error or non-redirect response.
    """
    opener: urllib.request.OpenerDirector = urllib.request.build_opener(_NoRedirectHandler)
    req = Request(url, method="HEAD")
    try:
        start: float = time.monotonic()
        resp: HTTPResponse = opener.open(req, timeout=timeout)  # type: ignore[assignment]
        latency: float = time.monotonic() - start
        # Non-redirect (2xx) — server didn't redirect, not useful for discovery
        location: str | None = resp.headers.get("Location")
        resp.close()
        return location, latency
    except urllib.error.HTTPError as e:
        # 3xx redirects raise HTTPError with the redirect info
        latency = time.monotonic() - start  # type: ignore[possibly-undefined]
        location = e.headers.get("Location")
        if location and 300 <= e.code < 400:
            return location, latency
        return None, 0.0
    except (urllib.error.URLError, OSError, TimeoutError):
        return None, 0.0
 # -- Discovery -----------------------------------------------------------------
 def get_current_slot(rpc_url: str) -> int | None:
    """Get current slot from RPC."""
    result: object | None = rpc_post(rpc_url, "getSlot")
    if isinstance(result, int):
        return result
    return None
 def get_cluster_rpc_nodes(rpc_url: str, version_filter: str | None = None) -> list[str]:
    """Get all RPC node addresses from getClusterNodes."""
    result: object | None = rpc_post(rpc_url, "getClusterNodes")
    if not isinstance(result, list):
        return []
    rpc_addrs: list[str] = []
    for node in result:
        if not isinstance(node, dict):
            continue
        if version_filter is not None:
            node_version: str | None = node.get("version")
            if node_version and not node_version.startswith(version_filter):
                continue
        rpc: str | None = node.get("rpc")
        if rpc:
            rpc_addrs.append(rpc)
    return list(set(rpc_addrs))
 def _parse_snapshot_filename(location: str) -> tuple[str, str | None]:
    """Extract filename and full redirect path from Location header.
    Returns (filename, full_path). full_path includes any path prefix
    the server returned (e.g. '/snapshots/snapshot-123-hash.tar.zst').
    """
    # Location may be absolute URL or relative path
    if location.startswith("http://") or location.startswith("https://"):
        # Absolute URL — extract path
        from urllib.parse import urlparse
        path: str = urlparse(location).path
    else:
        path = location
    filename: str = path.rsplit("/", 1)[-1]
    return filename, path
 def probe_rpc_snapshot(
    rpc_address: str,
    current_slot: int,
 ) -> SnapshotSource | None:
    """Probe a single RPC node for available snapshots.
    Discovery only — no filtering. Returns a SnapshotSource with all available
    info so the caller can decide what to keep. Filtering happens after all
    probes complete, so rejected sources are still visible for debugging.
    """
    full_url: str = f"http://{rpc_address}/snapshot.tar.bz2"
    # Full snapshot is required — every source must have one
    full_location, full_latency = head_no_follow(full_url, timeout=2)
    if not full_location:
        return None
    latency_ms: float = full_latency * 1000
    full_filename, full_path = _parse_snapshot_filename(full_location)
    fm: re.Match[str] | None = FULL_SNAP_RE.match(full_filename)
    if not fm:
        return None
    full_snap_slot: int = int(fm.group(1))
    slots_diff: int = current_slot - full_snap_slot
    file_paths: list[str] = [full_path]
    # Also check for incremental snapshot
    inc_url: str = f"http://{rpc_address}/incremental-snapshot.tar.bz2"
    inc_location, _ = head_no_follow(inc_url, timeout=2)
    if inc_location:
        inc_filename, inc_path = _parse_snapshot_filename(inc_location)
        m: re.Match[str] | None = INCR_SNAP_RE.match(inc_filename)
        if m:
            inc_base_slot: int = int(m.group(1))
            # Incremental must be based on this source's full snapshot
            if inc_base_slot == full_snap_slot:
                file_paths.append(inc_path)
    return SnapshotSource(
        rpc_address=rpc_address,
        file_paths=file_paths,
        slots_diff=slots_diff,
        latency_ms=latency_ms,
    )
 def discover_sources(
    rpc_url: str,
    current_slot: int,
    max_age_slots: int,
    max_latency_ms: float,
    threads: int,
    version_filter: str | None,
 ) -> list[SnapshotSource]:
    """Discover all snapshot sources, then filter.
    Probing and filtering are separate: all reachable sources are collected
    first so we can report what exists even if filters reject everything.
    """
    rpc_nodes: list[str] = get_cluster_rpc_nodes(rpc_url, version_filter)
    if not rpc_nodes:
        log.error("No RPC nodes found via getClusterNodes")
        return []
    log.info("Found %d RPC nodes, probing for snapshots...", len(rpc_nodes))
    all_sources: list[SnapshotSource] = []
    with concurrent.futures.ThreadPoolExecutor(max_workers=threads) as pool:
        futures: dict[concurrent.futures.Future[SnapshotSource | None], str] = {
            pool.submit(probe_rpc_snapshot, addr, current_slot): addr
            for addr in rpc_nodes
        }
        done: int = 0
        for future in concurrent.futures.as_completed(futures):
            done += 1
            if done % 200 == 0:
                log.info("  probed %d/%d nodes, %d reachable",
                         done, len(rpc_nodes), len(all_sources))
            try:
                result: SnapshotSource | None = future.result()
            except (urllib.error.URLError, OSError, TimeoutError) as e:
                log.debug("Probe failed for %s: %s", futures[future], e)
                continue
            if result:
                all_sources.append(result)
    log.info("Discovered %d reachable sources", len(all_sources))
    # Apply filters
    filtered: list[SnapshotSource] = []
    rejected_age: int = 0
    rejected_latency: int = 0
    for src in all_sources:
        if src.slots_diff > max_age_slots or src.slots_diff < -100:
            rejected_age += 1
            continue
        if src.latency_ms > max_latency_ms:
            rejected_latency += 1
            continue
        filtered.append(src)
    if rejected_age or rejected_latency:
        log.info("Filtered: %d rejected by age (>%d slots), %d by latency (>%.0fms)",
                 rejected_age, max_age_slots, rejected_latency, max_latency_ms)
    if not filtered and all_sources:
        # Show what was available so the user can adjust filters
        all_sources.sort(key=lambda s: s.slots_diff)
        best = all_sources[0]
        log.warning("All %d sources rejected by filters. Best available: "
                     "%s (age=%d slots, latency=%.0fms). "
                     "Try --max-snapshot-age %d --max-latency %.0f",
                     len(all_sources), best.rpc_address,
                     best.slots_diff, best.latency_ms,
                     best.slots_diff + 500,
                     max(best.latency_ms * 1.5, 500))
    log.info("Found %d sources after filtering", len(filtered))
    return filtered
 # -- Speed benchmark -----------------------------------------------------------
 def measure_speed(rpc_address: str, measure_time: int = 7) -> float:
    """Measure download speed from an RPC node. Returns bytes/sec."""
    url: str = f"http://{rpc_address}/snapshot.tar.bz2"
    req = Request(url)
    try:
        with urllib.request.urlopen(req, timeout=measure_time + 5) as resp:
            start: float = time.monotonic()
            total: int = 0
            while True:
                elapsed: float = time.monotonic() - start
                if elapsed >= measure_time:
                    break
                chunk: bytes = resp.read(81920)
                if not chunk:
                    break
                total += len(chunk)
            elapsed = time.monotonic() - start
            if elapsed <= 0:
                return 0.0
            return total / elapsed
    except (urllib.error.URLError, OSError, TimeoutError):
        return 0.0
 # -- Incremental probing -------------------------------------------------------
 def probe_incremental(
    fast_sources: list[SnapshotSource],
    full_snap_slot: int,
 ) -> tuple[str | None, list[str]]:
    """Probe fast sources for the best incremental matching full_snap_slot.
    Returns (filename, mirror_urls) or (None, []) if no match found.
    The "best" incremental is the one with the highest slot (closest to head).
    """
    best_filename: str | None = None
    best_slot: int = 0
    best_source: SnapshotSource | None = None
    best_path: str | None = None
    for source in fast_sources:
        inc_url: str = f"http://{source.rpc_address}/incremental-snapshot.tar.bz2"
        inc_location, _ = head_no_follow(inc_url, timeout=2)
        if not inc_location:
            continue
        inc_fn, inc_fp = _parse_snapshot_filename(inc_location)
        m: re.Match[str] | None = INCR_SNAP_RE.match(inc_fn)
        if not m:
            continue
        if int(m.group(1)) != full_snap_slot:
            log.debug("  %s: incremental base slot %s != full %d, skipping",
                      source.rpc_address, m.group(1), full_snap_slot)
            continue
        inc_slot: int = int(m.group(2))
        if inc_slot > best_slot:
            best_slot = inc_slot
            best_filename = inc_fn
            best_source = source
            best_path = inc_fp
    if best_filename is None or best_source is None or best_path is None:
        return None, []
    # Build mirror list — check other sources for the same filename
    mirror_urls: list[str] = [f"http://{best_source.rpc_address}{best_path}"]
    for other in fast_sources:
        if other.rpc_address == best_source.rpc_address:
            continue
        other_loc, _ = head_no_follow(
            f"http://{other.rpc_address}/incremental-snapshot.tar.bz2", timeout=2)
        if other_loc:
            other_fn, other_fp = _parse_snapshot_filename(other_loc)
            if other_fn == best_filename:
                mirror_urls.append(f"http://{other.rpc_address}{other_fp}")
    return best_filename, mirror_urls
 # -- Download ------------------------------------------------------------------
 def download_aria2c(
    urls: list[str],
    output_dir: str,
    filename: str,
    connections: int = 16,
 ) -> bool:
    """Download a file using aria2c with parallel connections.
    When multiple URLs are provided, aria2c treats them as mirrors of the
    same file and distributes chunks across all of them.
    """
    num_mirrors: int = len(urls)
    total_splits: int = max(connections, connections * num_mirrors)
    cmd: list[str] = [
        "aria2c",
        "--file-allocation=none",
        "--continue=false",
        f"--max-connection-per-server={connections}",
        f"--split={total_splits}",
        "--min-split-size=50M",
        # aria2c retries individual chunk connections on transient network
        # errors (TCP reset, timeout). This is transport-level retry analogous
        # to TCP retransmit, not application-level retry of a failed operation.
        "--max-tries=5",
        "--retry-wait=5",
        "--timeout=60",
        "--connect-timeout=10",
        "--summary-interval=10",
        "--console-log-level=notice",
        f"--dir={output_dir}",
        f"--out={filename}",
        "--auto-file-renaming=false",
        "--allow-overwrite=true",
        *urls,
    ]
    log.info("Downloading %s", filename)
    log.info("  aria2c: %d connections x %d mirrors (%d splits)",
             connections, num_mirrors, total_splits)
    start: float = time.monotonic()
    result: subprocess.CompletedProcess[bytes] = subprocess.run(cmd)
    elapsed: float = time.monotonic() - start
    if result.returncode != 0:
        log.error("aria2c failed with exit code %d", result.returncode)
        return False
    filepath: Path = Path(output_dir) / filename
    if not filepath.exists():
        log.error("aria2c reported success but %s does not exist", filepath)
        return False
    size_bytes: int = filepath.stat().st_size
    size_gb: float = size_bytes / (1024 ** 3)
    avg_mb: float = size_bytes / elapsed / (1024 ** 2) if elapsed > 0 else 0
    log.info("  Done: %.1f GB in %.0fs (%.1f MiB/s avg)", size_gb, elapsed, avg_mb)
    return True
 # -- Shared helpers ------------------------------------------------------------
 def _discover_and_benchmark(
    rpc_url: str,
    current_slot: int,
    *,
    max_snapshot_age: int = 10000,
    max_latency: float = 500,
    threads: int = 500,
    min_download_speed: int = 20,
    measurement_time: int = 7,
    max_speed_checks: int = 15,
    version_filter: str | None = None,
 ) -> list[SnapshotSource]:
    """Discover snapshot sources and benchmark download speed.
    Returns sources that meet the minimum speed requirement, sorted by speed.
    """
    sources: list[SnapshotSource] = discover_sources(
        rpc_url, current_slot,
        max_age_slots=max_snapshot_age,
        max_latency_ms=max_latency,
        threads=threads,
        version_filter=version_filter,
    )
    if not sources:
        return []
    sources.sort(key=lambda s: s.latency_ms)
    log.info("Benchmarking download speed on top %d sources...", max_speed_checks)
    fast_sources: list[SnapshotSource] = []
    checked: int = 0
    min_speed_bytes: int = min_download_speed * 1024 * 1024
    for source in sources:
        if checked >= max_speed_checks:
            break
        checked += 1
        speed: float = measure_speed(source.rpc_address, measurement_time)
        source.download_speed = speed
        speed_mib: float = speed / (1024 ** 2)
        if speed < min_speed_bytes:
            log.info("  %s: %.1f MiB/s (too slow, need >=%d MiB/s)",
                     source.rpc_address, speed_mib, min_download_speed)
            continue
        log.info("  %s: %.1f MiB/s (latency: %.0fms, age: %d slots)",
                 source.rpc_address, speed_mib,
                 source.latency_ms, source.slots_diff)
        fast_sources.append(source)
    return fast_sources
 def _rolling_incremental_download(
    fast_sources: list[SnapshotSource],
    full_snap_slot: int,
    output_dir: str,
    convergence_slots: int,
    connections: int,
    rpc_url: str,
 ) -> str | None:
    """Download incrementals in a loop until converged.
    Probes fast_sources for incrementals matching full_snap_slot, downloads
    the freshest one, then re-probes until the gap to head is within
    convergence_slots. Returns the filename of the final incremental,
    or None if no incremental was found.
    """
    prev_inc_filename: str | None = None
    loop_start: float = time.monotonic()
    max_convergence_time: float = 1800.0  # 30 min wall-clock limit
    while True:
        if time.monotonic() - loop_start > max_convergence_time:
            if prev_inc_filename:
                log.warning("Convergence timeout (%.0fs) — using %s",
                            max_convergence_time, prev_inc_filename)
            else:
                log.warning("Convergence timeout (%.0fs) — no incremental downloaded",
                            max_convergence_time)
            break
        inc_fn, inc_mirrors = probe_incremental(fast_sources, full_snap_slot)
        if inc_fn is None:
            if prev_inc_filename is None:
                log.error("No matching incremental found for base slot %d",
                          full_snap_slot)
            else:
                log.info("No newer incremental available, using %s", prev_inc_filename)
            break
        m_inc: re.Match[str] | None = INCR_SNAP_RE.match(inc_fn)
        assert m_inc is not None
        inc_slot: int = int(m_inc.group(2))
        head_slot: int | None = get_current_slot(rpc_url)
        if head_slot is None:
            log.warning("Cannot get current slot — downloading best available incremental")
            gap: int = convergence_slots + 1
        else:
            gap = head_slot - inc_slot
        if inc_fn == prev_inc_filename:
            if gap <= convergence_slots:
                log.info("Incremental %s already downloaded (gap %d slots, converged)",
                         inc_fn, gap)
                break
            log.info("No newer incremental yet (slot %d, gap %d slots), waiting...",
                     inc_slot, gap)
            time.sleep(10)
            continue
        if prev_inc_filename is not None:
            old_path: Path = Path(output_dir) / prev_inc_filename
            if old_path.exists():
                log.info("Removing superseded incremental %s", prev_inc_filename)
                old_path.unlink()
        log.info("Downloading incremental %s (%d mirrors, slot %d, gap %d slots)",
                 inc_fn, len(inc_mirrors), inc_slot, gap)
        if not download_aria2c(inc_mirrors, output_dir, inc_fn, connections):
            log.warning("Failed to download incremental %s — re-probing in 10s", inc_fn)
            time.sleep(10)
            continue
        prev_inc_filename = inc_fn
        if gap <= convergence_slots:
            log.info("Converged: incremental slot %d is %d slots behind head",
                     inc_slot, gap)
            break
        if head_slot is None:
            break
        log.info("Not converged (gap %d > %d), re-probing in 10s...",
                 gap, convergence_slots)
        time.sleep(10)
    return prev_inc_filename
 # -- Public API ----------------------------------------------------------------
 def download_incremental_for_slot(
    output_dir: str,
    full_snap_slot: int,
    *,
    cluster: str = "mainnet-beta",
    rpc_url: str | None = None,
    connections: int = 16,
    threads: int = 500,
    max_snapshot_age: int = 10000,
    max_latency: float = 500,
    min_download_speed: int = 20,
    measurement_time: int = 7,
    max_speed_checks: int = 15,
    version_filter: str | None = None,
    convergence_slots: int = 500,
 ) -> bool:
    """Download an incremental snapshot for an existing full snapshot.
    Discovers sources, benchmarks speed, then runs the rolling incremental
    download loop for the given full snapshot base slot. Does NOT download
    a full snapshot.
    Returns True if an incremental was downloaded, False otherwise.
    """
    resolved_rpc: str = rpc_url or CLUSTER_RPC[cluster]
    if not shutil.which("aria2c"):
        log.error("aria2c not found. Install with: apt install aria2")
        return False
    log.info("Incremental download for base slot %d", full_snap_slot)
    current_slot: int | None = get_current_slot(resolved_rpc)
    if current_slot is None:
        log.error("Cannot get current slot from %s", resolved_rpc)
        return False
    fast_sources: list[SnapshotSource] = _discover_and_benchmark(
        resolved_rpc, current_slot,
        max_snapshot_age=max_snapshot_age,
        max_latency=max_latency,
        threads=threads,
        min_download_speed=min_download_speed,
        measurement_time=measurement_time,
        max_speed_checks=max_speed_checks,
        version_filter=version_filter,
    )
    if not fast_sources:
        log.error("No fast sources found")
        return False
    os.makedirs(output_dir, exist_ok=True)
    result: str | None = _rolling_incremental_download(
        fast_sources, full_snap_slot, output_dir,
        convergence_slots, connections, resolved_rpc,
    )
    return result is not None
 def download_best_snapshot(
    output_dir: str,
    *,
    cluster: str = "mainnet-beta",
    rpc_url: str | None = None,
    connections: int = 16,
    threads: int = 500,
    max_snapshot_age: int = 10000,
    max_latency: float = 500,
    min_download_speed: int = 20,
    measurement_time: int = 7,
    max_speed_checks: int = 15,
    version_filter: str | None = None,
    full_only: bool = False,
    convergence_slots: int = 500,
 ) -> bool:
    """Download the best available snapshot to output_dir.
    This is the programmatic API — called by entrypoint.py for automatic
    snapshot download. Returns True on success, False on failure.
    All parameters have sensible defaults matching the CLI interface.
    """
    resolved_rpc: str = rpc_url or CLUSTER_RPC[cluster]
    if not shutil.which("aria2c"):
        log.error("aria2c not found. Install with: apt install aria2")
        return False
    log.info("Cluster: %s | RPC: %s", cluster, resolved_rpc)
    current_slot: int | None = get_current_slot(resolved_rpc)
    if current_slot is None:
        log.error("Cannot get current slot from %s", resolved_rpc)
        return False
    log.info("Current slot: %d", current_slot)
    fast_sources: list[SnapshotSource] = _discover_and_benchmark(
        resolved_rpc, current_slot,
        max_snapshot_age=max_snapshot_age,
        max_latency=max_latency,
        threads=threads,
        min_download_speed=min_download_speed,
        measurement_time=measurement_time,
        max_speed_checks=max_speed_checks,
        version_filter=version_filter,
    )
    if not fast_sources:
        log.error("No fast sources found")
        return False
    # Use the fastest source as primary, build full snapshot download plan
    best: SnapshotSource = fast_sources[0]
    full_paths: list[str] = [fp for fp in best.file_paths
                             if fp.rsplit("/", 1)[-1].startswith("snapshot-")]
    if not full_paths:
        log.error("Best source has no full snapshot")
        return False
    # Build mirror URLs for the full snapshot
    full_filename: str = full_paths[0].rsplit("/", 1)[-1]
    full_mirrors: list[str] = [f"http://{best.rpc_address}{full_paths[0]}"]
    for other in fast_sources[1:]:
        for other_fp in other.file_paths:
            if other_fp.rsplit("/", 1)[-1] == full_filename:
                full_mirrors.append(f"http://{other.rpc_address}{other_fp}")
                break
    speed_mib: float = best.download_speed / (1024 ** 2)
    log.info("Best source: %s (%.1f MiB/s), %d mirrors",
             best.rpc_address, speed_mib, len(full_mirrors))
    # Download full snapshot
    os.makedirs(output_dir, exist_ok=True)
    total_start: float = time.monotonic()
    filepath: Path = Path(output_dir) / full_filename
    if filepath.exists() and filepath.stat().st_size > 0:
        log.info("Skipping %s (already exists: %.1f GB)",
                 full_filename, filepath.stat().st_size / (1024 ** 3))
    else:
        if not download_aria2c(full_mirrors, output_dir, full_filename, connections):
            log.error("Failed to download %s", full_filename)
            return False
    # Download incremental separately — the full download took minutes,
    # so any incremental from discovery is stale. Re-probe for fresh ones.
    if not full_only:
        fm: re.Match[str] | None = FULL_SNAP_RE.match(full_filename)
        if fm:
            full_snap_slot: int = int(fm.group(1))
            log.info("Downloading incremental for base slot %d...", full_snap_slot)
            _rolling_incremental_download(
                fast_sources, full_snap_slot, output_dir,
                convergence_slots, connections, resolved_rpc,
            )
    total_elapsed: float = time.monotonic() - total_start
    log.info("All downloads complete in %.0fs", total_elapsed)
    return True
 # -- Main (CLI) ----------------------------------------------------------------
 def main() -> int:
    p: argparse.ArgumentParser = argparse.ArgumentParser(
        description="Download Solana snapshots with aria2c parallel downloads",
    )
    p.add_argument("-o", "--output", default="/srv/kind/solana/snapshots",
                   help="Snapshot output directory (default: /srv/kind/solana/snapshots)")
    p.add_argument("-c", "--cluster", default="mainnet-beta",
                   choices=list(CLUSTER_RPC),
                   help="Solana cluster (default: mainnet-beta)")
    p.add_argument("-r", "--rpc", default=None,
                   help="RPC URL for cluster discovery (default: public RPC)")
    p.add_argument("-n", "--connections", type=int, default=16,
                   help="aria2c connections per download (default: 16)")
    p.add_argument("-t", "--threads", type=int, default=500,
                   help="Threads for parallel RPC probing (default: 500)")
    p.add_argument("--max-snapshot-age", type=int, default=10000,
                   help="Max snapshot age in slots (default: 10000)")
    p.add_argument("--max-latency", type=float, default=500,
                   help="Max RPC probe latency in ms (default: 500)")
    p.add_argument("--min-download-speed", type=int, default=20,
                   help="Min download speed in MiB/s (default: 20)")
    p.add_argument("--measurement-time", type=int, default=7,
                   help="Speed measurement duration in seconds (default: 7)")
    p.add_argument("--max-speed-checks", type=int, default=15,
                   help="Max nodes to benchmark before giving up (default: 15)")
    p.add_argument("--version", default=None,
                   help="Filter nodes by version prefix (e.g. '2.2')")
    p.add_argument("--convergence-slots", type=int, default=500,
                   help="Max slot gap for incremental convergence (default: 500)")
    p.add_argument("--full-only", action="store_true",
                   help="Download only full snapshot, skip incremental")
    p.add_argument("--dry-run", action="store_true",
                   help="Find best source and print URL, don't download")
    p.add_argument("--post-cmd",
                   help="Shell command to run after successful download "
                        "(e.g. 'kubectl scale deployment ... --replicas=1')")
    p.add_argument("-v", "--verbose", action="store_true")
    args: argparse.Namespace = p.parse_args()
    logging.basicConfig(
        level=logging.DEBUG if args.verbose else logging.INFO,
        format="%(asctime)s %(levelname)s %(message)s",
        datefmt="%H:%M:%S",
    )
    # Dry-run uses the original inline flow (needs access to sources for URL printing)
    if args.dry_run:
        rpc_url: str = args.rpc or CLUSTER_RPC[args.cluster]
        current_slot: int | None = get_current_slot(rpc_url)
        if current_slot is None:
            log.error("Cannot get current slot from %s", rpc_url)
            return 1
        sources: list[SnapshotSource] = discover_sources(
            rpc_url, current_slot,
            max_age_slots=args.max_snapshot_age,
            max_latency_ms=args.max_latency,
            threads=args.threads,
            version_filter=args.version,
        )
        if not sources:
            log.error("No snapshot sources found")
            return 1
        sources.sort(key=lambda s: s.latency_ms)
        best = sources[0]
        for fp in best.file_paths:
            print(f"http://{best.rpc_address}{fp}")
        return 0
    ok: bool = download_best_snapshot(
        args.output,
        cluster=args.cluster,
        rpc_url=args.rpc,
        connections=args.connections,
        threads=args.threads,
        max_snapshot_age=args.max_snapshot_age,
        max_latency=args.max_latency,
        min_download_speed=args.min_download_speed,
        measurement_time=args.measurement_time,
        max_speed_checks=args.max_speed_checks,
        version_filter=args.version,
        full_only=args.full_only,
        convergence_slots=args.convergence_slots,
    )
    if ok and args.post_cmd:
        log.info("Running post-download command: %s", args.post_cmd)
        result: subprocess.CompletedProcess[bytes] = subprocess.run(
            args.post_cmd, shell=True,
        )
        if result.returncode != 0:
            log.error("Post-download command failed with exit code %d",
                      result.returncode)
            return 1
        log.info("Post-download command completed successfully")
    return 0 if ok else 1
 if __name__ == "__main__":
    sys.exit(main())
--- a/stack-orchestrator/container-build/laconicnetwork-agave/start-test.sh
+++ b/stack-orchestrator/container-build/laconicnetwork-agave/start-test.sh
@ -0,0 +1,112 @@
 #!/usr/bin/env bash
 set -euo pipefail
 # -----------------------------------------------------------------------
 # Start solana-test-validator with optional SPL token setup
 #
 # Environment variables:
 #   FACILITATOR_PUBKEY  - facilitator fee-payer public key (base58)
 #   SERVER_PUBKEY       - server/payee wallet public key (base58)
 #   CLIENT_PUBKEY       - client/payer wallet public key (base58)
 #   MINT_DECIMALS       - token decimals (default: 6, matching USDC)
 #   MINT_AMOUNT         - amount to mint to client (default: 1000000000)
 #   LEDGER_DIR          - ledger directory (default: /data/ledger)
 # -----------------------------------------------------------------------
 LEDGER_DIR="${LEDGER_DIR:-/data/ledger}"
 MINT_DECIMALS="${MINT_DECIMALS:-6}"
 MINT_AMOUNT="${MINT_AMOUNT:-1000000000}"
 SETUP_MARKER="${LEDGER_DIR}/.setup-done"
 sudo chown -R "$(id -u):$(id -g)" "$LEDGER_DIR" 2>/dev/null || true
 # Start test-validator in the background
 solana-test-validator \
  --ledger "${LEDGER_DIR}" \
  --rpc-port 8899 \
  --bind-address 0.0.0.0 \
  --quiet &
 VALIDATOR_PID=$!
 # Wait for RPC to become available
 echo "Waiting for test-validator RPC..."
 for i in $(seq 1 60); do
  if solana cluster-version --url http://127.0.0.1:8899 >/dev/null 2>&1; then
    echo "Test-validator is ready (attempt ${i})"
    break
  fi
  sleep 1
 done
 solana config set --url http://127.0.0.1:8899
 # Only run setup once (idempotent via marker file)
 if [ ! -f "${SETUP_MARKER}" ]; then
  echo "Running first-time setup..."
  # Airdrop SOL to all wallets for gas
  for PUBKEY in "${FACILITATOR_PUBKEY:-}" "${SERVER_PUBKEY:-}" "${CLIENT_PUBKEY:-}"; do
    if [ -n "${PUBKEY}" ]; then
      echo "Airdropping 100 SOL to ${PUBKEY}..."
      solana airdrop 100 "${PUBKEY}" --url http://127.0.0.1:8899 || true
    fi
  done
  # Create a USDC-equivalent SPL token mint if any pubkeys are set
  if [ -n "${CLIENT_PUBKEY:-}" ] || [ -n "${FACILITATOR_PUBKEY:-}" ] || [ -n "${SERVER_PUBKEY:-}" ]; then
    MINT_AUTHORITY_FILE="${LEDGER_DIR}/mint-authority.json"
    if [ ! -f "${MINT_AUTHORITY_FILE}" ]; then
      solana-keygen new --no-bip39-passphrase --outfile "${MINT_AUTHORITY_FILE}" --force
      MINT_AUTH_PUBKEY=$(solana-keygen pubkey "${MINT_AUTHORITY_FILE}")
      solana airdrop 10 "${MINT_AUTH_PUBKEY}" --url http://127.0.0.1:8899
    fi
    MINT_ADDRESS_FILE="${LEDGER_DIR}/usdc-mint-address.txt"
    if [ ! -f "${MINT_ADDRESS_FILE}" ]; then
      spl-token create-token \
        --decimals "${MINT_DECIMALS}" \
        --mint-authority "${MINT_AUTHORITY_FILE}" \
        --url http://127.0.0.1:8899 \
        2>&1 | grep "Creating token" | awk '{print $3}' > "${MINT_ADDRESS_FILE}"
      echo "Created USDC mint: $(cat "${MINT_ADDRESS_FILE}")"
    fi
    USDC_MINT=$(cat "${MINT_ADDRESS_FILE}")
    # Create ATAs and mint tokens for the client
    if [ -n "${CLIENT_PUBKEY:-}" ]; then
      echo "Creating ATA for client ${CLIENT_PUBKEY}..."
      spl-token create-account "${USDC_MINT}" \
        --owner "${CLIENT_PUBKEY}" \
        --fee-payer "${MINT_AUTHORITY_FILE}" \
        --url http://127.0.0.1:8899 || true
      echo "Minting ${MINT_AMOUNT} tokens to client..."
      spl-token mint "${USDC_MINT}" "${MINT_AMOUNT}" \
        --recipient-owner "${CLIENT_PUBKEY}" \
        --mint-authority "${MINT_AUTHORITY_FILE}" \
        --url http://127.0.0.1:8899 || true
    fi
    # Create ATAs for server and facilitator
    for PUBKEY in "${SERVER_PUBKEY:-}" "${FACILITATOR_PUBKEY:-}"; do
      if [ -n "${PUBKEY}" ]; then
        echo "Creating ATA for ${PUBKEY}..."
        spl-token create-account "${USDC_MINT}" \
          --owner "${PUBKEY}" \
          --fee-payer "${MINT_AUTHORITY_FILE}" \
          --url http://127.0.0.1:8899 || true
      fi
    done
    # Expose mint address for other containers
    cp "${MINT_ADDRESS_FILE}" /tmp/usdc-mint-address.txt 2>/dev/null || true
  fi
  touch "${SETUP_MARKER}"
  echo "Setup complete."
 fi
 echo "solana-test-validator running (PID ${VALIDATOR_PID})"
 wait ${VALIDATOR_PID}
--- a/stack-orchestrator/container-build/laconicnetwork-doublezero/Dockerfile
+++ b/stack-orchestrator/container-build/laconicnetwork-doublezero/Dockerfile
@ -0,0 +1,22 @@
 # DoubleZero network daemon for Solana validators
 # Provides GRE tunnel + BGP routing via the DoubleZero fiber backbone
 FROM debian:bookworm-slim
 RUN apt-get update && apt-get install -y --no-install-recommends \
    ca-certificates \
    curl \
    gnupg \
    iproute2 \
    && rm -rf /var/lib/apt/lists/*
 # Install DoubleZero from Cloudsmith apt repo
 RUN curl -1sLf https://dl.cloudsmith.io/public/malbeclabs/doublezero/setup.deb.sh | bash \
    && apt-get update \
    && apt-get install -y doublezero \
    && rm -rf /var/lib/apt/lists/*
 COPY entrypoint.sh /usr/local/bin/entrypoint.sh
 RUN chmod +x /usr/local/bin/entrypoint.sh
 ENTRYPOINT ["entrypoint.sh"]
--- a/stack-orchestrator/container-build/laconicnetwork-doublezero/build.sh
+++ b/stack-orchestrator/container-build/laconicnetwork-doublezero/build.sh
@ -0,0 +1,9 @@
 #!/usr/bin/env bash
 # Build laconicnetwork/doublezero
 source ${CERC_CONTAINER_BASE_DIR}/build-base.sh
 docker build -t laconicnetwork/doublezero:local \
  ${build_command_args} \
  -f ${CERC_CONTAINER_BASE_DIR}/laconicnetwork-doublezero/Dockerfile \
  ${CERC_CONTAINER_BASE_DIR}/laconicnetwork-doublezero
--- a/stack-orchestrator/container-build/laconicnetwork-doublezero/entrypoint.sh
+++ b/stack-orchestrator/container-build/laconicnetwork-doublezero/entrypoint.sh
@ -0,0 +1,38 @@
 #!/usr/bin/env bash
 set -euo pipefail
 # -----------------------------------------------------------------------
 # Start doublezerod
 #
 # Optional environment:
 #   DOUBLEZERO_RPC_ENDPOINT - Solana RPC endpoint (default: http://127.0.0.1:8899)
 #   DOUBLEZERO_ENV          - DoubleZero environment (default: mainnet-beta)
 #   DOUBLEZERO_EXTRA_ARGS   - additional doublezerod arguments
 # -----------------------------------------------------------------------
 RPC_ENDPOINT="${DOUBLEZERO_RPC_ENDPOINT:-http://127.0.0.1:8899}"
 DZ_ENV="${DOUBLEZERO_ENV:-mainnet-beta}"
 # Ensure state directories exist
 mkdir -p /var/lib/doublezerod /var/run/doublezerod
 # Generate DZ identity if not already present
 DZ_CONFIG_DIR="${HOME}/.config/doublezero"
 mkdir -p "$DZ_CONFIG_DIR"
 if [ ! -f "$DZ_CONFIG_DIR/id.json" ]; then
  echo "Generating DoubleZero identity..."
  doublezero keygen
 fi
 echo "Starting doublezerod..."
 echo "Environment: $DZ_ENV"
 echo "RPC endpoint: $RPC_ENDPOINT"
 echo "DZ address: $(doublezero address)"
 ARGS=()
 [ -n "${DOUBLEZERO_EXTRA_ARGS:-}" ] && read -ra ARGS <<< "$DOUBLEZERO_EXTRA_ARGS"
 exec doublezerod \
  -env "$DZ_ENV" \
  -solana-rpc-endpoint "$RPC_ENDPOINT" \
  "${ARGS[@]}"
--- a/stack-orchestrator/stacks/agave/README.md
+++ b/stack-orchestrator/stacks/agave/README.md
@ -0,0 +1,169 @@
 # agave stack
 Unified Agave/Jito Solana stack supporting three modes:
 | Mode | Compose file | Use case |
 |------|-------------|----------|
 | `test` | `docker-compose-agave-test.yml` | Local dev with instant finality |
 | `rpc` | `docker-compose-agave-rpc.yml` | Non-voting mainnet/testnet RPC node |
 | `validator` | `docker-compose-agave.yml` | Voting validator |
 ## Build
 ```bash
 # Vanilla Agave v3.1.9
 laconic-so --stack agave build-containers
 # Jito v3.1.8
 AGAVE_REPO=https://github.com/jito-foundation/jito-solana.git \
 AGAVE_VERSION=v3.1.8-jito \
 laconic-so --stack agave build-containers
 ```
 Build compiles from source (~30-60 min on first build).
 ## Deploy
 ```bash
 # Test validator (dev)
 laconic-so --stack agave deploy init --output spec.yml
 laconic-so --stack agave deploy create --spec-file spec.yml --deployment-dir my-test
 laconic-so deployment --dir my-test start
 # Mainnet RPC (e.g. biscayne)
 # Edit spec.yml to set AGAVE_MODE=rpc, VALIDATOR_ENTRYPOINT, KNOWN_VALIDATOR, etc.
 laconic-so --stack agave deploy init --output spec.yml
 laconic-so --stack agave deploy create --spec-file spec.yml --deployment-dir my-rpc
 laconic-so deployment --dir my-rpc start
 ```
 ## Configuration
 Mode is selected via `AGAVE_MODE` environment variable (`test`, `rpc`, or `validator`).
 ### RPC mode required env
 - `VALIDATOR_ENTRYPOINT` - cluster entrypoint (e.g. `entrypoint.mainnet-beta.solana.com:8001`)
 - `KNOWN_VALIDATOR` - known validator pubkey
 ### Validator mode required env
 - `VALIDATOR_ENTRYPOINT` - cluster entrypoint
 - `KNOWN_VALIDATOR` - known validator pubkey
 - Identity and vote account keypairs mounted at `/data/config/`
 ### Jito (optional, any mode except test)
 Set `JITO_ENABLE=true` and provide:
 - `JITO_BLOCK_ENGINE_URL`
 - `JITO_SHRED_RECEIVER_ADDR`
 - `JITO_TIP_PAYMENT_PROGRAM`
 - `JITO_DISTRIBUTION_PROGRAM`
 - `JITO_MERKLE_ROOT_AUTHORITY`
 - `JITO_COMMISSION_BPS`
 Image must be built from `jito-foundation/jito-solana` repo for Jito flags to work.
 ## Runtime requirements
 The container requires the following (already set in compose files):
 - `privileged: true` — allows `mlock()` and raw network access
 - `cap_add: IPC_LOCK` — memory page locking for account indexes and ledger mappings
 - `ulimits: memlock: -1` (unlimited) — Agave locks gigabytes of memory
 - `ulimits: nofile: 1000000` — gossip/TPU connections + memory-mapped ledger files
 - `network_mode: host` — direct host network stack for gossip, TPU, and UDP port ranges
 Without these, Agave either refuses to start or dies under load.
 ## Container overhead
 Containers running with `privileged: true` and `network_mode: host` add **zero
 measurable overhead** compared to bare metal. Linux containers are not VMs — there
 is no hypervisor, no emulation layer, no packet translation:
 - **Network**: `network_mode: host` shares the host's network namespace directly.
  No virtual bridge, no NAT, no veth pair. Same kernel code path as bare metal.
  GRE tunnels (DoubleZero) and raw sockets work identically.
 - **CPU**: No hypervisor. The process runs on the same physical cores with the
  same scheduler priority as any host process.
 - **Memory**: `IPC_LOCK` + unlimited memlock means Agave can `mlock()` pages
  exactly like bare metal. No memory ballooning or overcommit.
 - **Disk I/O**: PersistentVolumes backed by hostPath mounts have identical I/O
  characteristics to direct filesystem access.
 The only overhead is cgroup accounting (nanoseconds per syscall) and overlayfs
 for cold file opens (single-digit microseconds, zero once cached).
 ## DoubleZero
 DoubleZero provides optimized network routing for Solana validators via GRE
 tunnels (IP protocol 47) and BGP (TCP/179) over link-local 169.254.0.0/16.
 Traffic to other DoubleZero participants is routed through private fiber
 instead of the public internet.
 ### How it works
 `doublezerod` creates a `doublezero0` GRE tunnel interface and runs BGP
 peering through it. Routes are injected into the host routing table, so
 the validator transparently sends traffic to other DZ validators over
 the fiber backbone. IBRL mode falls back to public internet if DZ is down.
 ### Container build
 ```bash
 laconic-so --stack agave build-containers
 ```
 This builds both the `laconicnetwork/agave` and `laconicnetwork/doublezero` images.
 ### Requirements
 - Validator identity keypair at `/data/config/validator-identity.json`
 - `privileged: true` + `NET_ADMIN` (GRE tunnel + route table manipulation)
 - `hostNetwork: true` (GRE uses IP protocol 47, not TCP/UDP — cannot be port-mapped)
 - Node registered with DoubleZero passport system
 ### Docker Compose
 The `docker-compose-doublezero.yml` runs alongside the validator with
 `network_mode: host`, sharing the `validator-config` volume for identity access.
 ### k8s deployment
 laconic-so does not pass `hostNetwork` through to generated k8s resources.
 DoubleZero runs as a DaemonSet defined in `deployment/k8s-manifests/doublezero-daemonset.yaml`,
 applied after `deployment start`:
 ```bash
 kubectl apply -f deployment/k8s-manifests/doublezero-daemonset.yaml
 ```
 Since validator pods also use `hostNetwork: true` (via the compose `network_mode: host`
 which maps to the pod spec in k8s), they automatically see the GRE routes
 injected by `doublezerod` into the node's routing table.
 ## Biscayne deployment (biscayne.vaasl.io)
 Mainnet voting validator with Jito MEV and DoubleZero.
 ```bash
 # Build Jito image
 AGAVE_REPO=https://github.com/jito-foundation/jito-solana.git \
 AGAVE_VERSION=v3.1.8-jito \
 laconic-so --stack agave build-containers
 # Create deployment from biscayne spec
 laconic-so --stack agave deploy create \
  --spec-file deployment/spec.yml \
  --deployment-dir biscayne-deployment
 # Copy validator keypairs
 cp /path/to/validator-identity.json biscayne-deployment/data/validator-config/
 cp /path/to/vote-account-keypair.json biscayne-deployment/data/validator-config/
 # Start validator
 laconic-so deployment --dir biscayne-deployment start
 # Start DoubleZero (after deployment is running)
 kubectl apply -f deployment/k8s-manifests/doublezero-daemonset.yaml
 ```
 To run as non-voting RPC instead, change `AGAVE_MODE: rpc` in `deployment/spec.yml`.
--- a/stack-orchestrator/stacks/agave/stack.yml
+++ b/stack-orchestrator/stacks/agave/stack.yml
@ -0,0 +1,10 @@
 version: "1.1"
 name: agave
 description: "Agave/Jito Solana validator, RPC node, or test-validator"
 containers:
  - laconicnetwork/agave
  - laconicnetwork/doublezero
 pods:
  - agave
  - doublezero
  - monitoring