stack-orchestrator/docs/blue-green-upgrades.md

13 KiB

Blue-Green Upgrades for Biscayne

Zero-downtime upgrade procedures for the agave-stack deployment on biscayne. Uses ZFS clones for instant data duplication, Caddy health-check routing for traffic shifting, and k8s native sidecars for independent container upgrades.

Architecture

                    Caddy ingress (biscayne.vaasl.io)
                    ├── upstream A: localhost:8899  ← health: /health
                    └── upstream B: localhost:8897  ← health: /health
                              │
            ┌─────────────────┴──────────────────┐
            │         kind cluster                │
            │                                     │
            │  Deployment A        Deployment B   │
            │  ┌─────────────┐   ┌─────────────┐  │
            │  │ agave :8899 │   │ agave :8897 │  │
            │  │ doublezerod │   │ doublezerod │  │
            │  └──────┬──────┘   └──────┬──────┘  │
            └─────────┼─────────────────┼─────────┘
                      │                 │
              ZFS dataset A      ZFS clone B
              (original)         (instant CoW copy)

Both deployments run in the same kind cluster with hostNetwork: true. Caddy active health checks route traffic to whichever deployment has a healthy /health endpoint.

Storage Layout

Data Path Type Survives restart?
Ledger /srv/solana/ledger ZFS zvol (xfs) Yes
Snapshots /srv/solana/snapshots ZFS zvol (xfs) Yes
Accounts /srv/solana/ramdisk/accounts /dev/ram0 (xfs) Until host reboot
Validator config /srv/deployments/agave/data/validator-config ZFS Yes
DZ config /srv/deployments/agave/data/doublezero-config ZFS Yes

The ZFS zvol biscayne/DATA/volumes/solana backs /srv/solana (ledger, snapshots). The ramdisk at /dev/ram0 holds accounts — it's a block device, not tmpfs, so it survives process restarts but not host reboots.


Procedure 1: DoubleZero Binary Upgrade (zero downtime, single pod)

The GRE tunnel (doublezero0) and BGP routes live in kernel space. They persist across doublezerod process restarts. Upgrading the DZ binary does not require tearing down the tunnel or restarting the validator.

Prerequisites

  • doublezerod is defined as a k8s native sidecar (spec.initContainers with restartPolicy: Always). See Required Changes below.
  • k8s 1.29+ (biscayne runs 1.35.1)

Steps

  1. Build or pull the new doublezero container image.

  2. Patch the pod's sidecar image:

    kubectl -n <ns> patch pod <pod> --type='json' -p='[
      {"op": "replace", "path": "/spec/initContainers/0/image",
       "value": "laconicnetwork/doublezero:new-version"}
    ]'
    
  3. Only the doublezerod container restarts. The agave container is unaffected. The GRE tunnel interface and BGP routes remain in the kernel throughout.

  4. Verify:

    kubectl -n <ns> exec <pod> -c doublezerod -- doublezero --version
    kubectl -n <ns> exec <pod> -c doublezerod -- doublezero status
    ip route | grep doublezero0   # routes still present
    

Rollback

Patch the image back to the previous version. Same process, same zero downtime.


Procedure 2: Agave Version Upgrade (zero RPC downtime, blue-green)

Agave is the main container and must be restarted for a version change. To maintain zero RPC downtime, we run two deployments simultaneously and let Caddy shift traffic based on health checks.

Prerequisites

  • Caddy ingress configured with dual upstreams and active health checks
  • A parameterized spec.yml that accepts alternate ports and volume paths
  • ZFS snapshot/clone scripts

Steps

Phase 1: Prepare (no downtime, no risk)

  1. ZFS snapshot for rollback safety:

    zfs snapshot -r biscayne/DATA@pre-upgrade-$(date +%Y%m%d)
    
  2. ZFS clone the validator volumes:

    zfs clone biscayne/DATA/volumes/solana@pre-upgrade-$(date +%Y%m%d) \
      biscayne/DATA/volumes/solana-blue
    

    This is instant (copy-on-write). No additional storage until writes diverge.

  3. Clone the ramdisk accounts (not on ZFS):

    mkdir -p /srv/solana-blue/ramdisk/accounts
    cp -a /srv/solana/ramdisk/accounts/* /srv/solana-blue/ramdisk/accounts/
    

    This is the slow step — 460GB on ramdisk. Consider rsync with --inplace to minimize copy time, or investigate whether the ramdisk can move to a ZFS dataset for instant cloning in future deployments.

  4. Build or pull the new agave container image.

Phase 2: Start blue deployment (no downtime)

  1. Create Deployment B in the same kind cluster, pointing at cloned volumes, with RPC on port 8897:

    # Apply the blue deployment manifest (parameterized spec)
    kubectl apply -f deployment/k8s-manifests/agave-blue.yaml
    
  2. Deployment B catches up. It starts from the snapshot point and replays. Monitor progress:

    kubectl -n <ns> exec <blue-pod> -c agave-validator -- \
      solana -u http://127.0.0.1:8897 slot
    
  3. Validate the new version works:

    • RPC responds: curl -sf http://localhost:8897/health
    • Correct version: kubectl -n <ns> exec <blue-pod> -c agave-validator -- agave-validator --version
    • doublezerod connected (if applicable)

    Take as long as needed. Deployment A is still serving all traffic.

Phase 3: Traffic shift (zero downtime)

  1. Caddy routes traffic to B. Once B's /health returns 200, Caddy's active health check automatically starts routing to it. Alternatively, update the Caddy upstream config to prefer B.

  2. Verify B is serving live traffic:

    curl -sf https://biscayne.vaasl.io/health
    # Check Caddy access logs for requests hitting port 8897
    

Phase 4: Cleanup

  1. Stop Deployment A:

    kubectl -n <ns> delete deployment agave-green
    
  2. Reconfigure B to use standard port (8899) if desired, or update Caddy to only route to 8897.

  3. Clean up ZFS clone (or keep as rollback):

    zfs destroy biscayne/DATA/volumes/solana-blue
    

Rollback

At any point before Phase 4:

  • Deployment A is untouched and still serving traffic (or can be restarted)
  • Delete Deployment B: kubectl -n <ns> delete deployment agave-blue
  • Destroy the ZFS clone: zfs destroy biscayne/DATA/volumes/solana-blue

After Phase 4 (A already stopped):

  • zfs rollback to restore original data
  • Redeploy A with old image

Required Changes to agave-stack

1. Move doublezerod to native sidecar

In the pod spec generation (laconic-so or compose override), doublezerod must be defined as a native sidecar container instead of a regular container:

spec:
  initContainers:
    - name: doublezerod
      image: laconicnetwork/doublezero:local
      restartPolicy: Always          # makes it a native sidecar
      securityContext:
        privileged: true
        capabilities:
          add: [NET_ADMIN]
      env:
        - name: DOUBLEZERO_RPC_ENDPOINT
          value: https://api.mainnet-beta.solana.com
      volumeMounts:
        - name: doublezero-config
          mountPath: /root/.config/doublezero
  containers:
    - name: agave-validator
      image: laconicnetwork/agave:local
      # ... existing config

This change means:

  • doublezerod starts before agave and stays running
  • Patching the doublezerod image restarts only that container
  • agave can be restarted independently without affecting doublezerod

This requires a laconic-so change to support initContainers with restartPolicy in compose-to-k8s translation — or a post-deployment patch.

2. Caddy dual-upstream config

Add health-checked upstreams for both blue and green deployments:

biscayne.vaasl.io {
    reverse_proxy {
        to localhost:8899 localhost:8897

        health_uri /health
        health_interval 5s
        health_timeout 3s

        lb_policy first
    }
}

lb_policy first routes to the first healthy upstream. When only A is running, all traffic goes to :8899. When B comes up healthy, traffic shifts.

3. Parameterized deployment spec

Create a parameterized spec or kustomize overlay that accepts:

  • RPC port (8899 vs 8897)
  • Volume paths (original vs ZFS clone)
  • Deployment name suffix (green vs blue)

4. Delete DaemonSet workaround

Remove deployment/k8s-manifests/doublezero-daemonset.yaml from agave-stack.

5. Fix container DZ identity

Copy the registered identity into the container volume:

sudo cp /home/solana/.config/doublezero/id.json \
  /srv/deployments/agave/data/doublezero-config/id.json

6. Disable host systemd doublezerod

After the container sidecar is working:

sudo systemctl stop doublezerod
sudo systemctl disable doublezerod

Implementation Order

This is a spec-driven, test-driven plan. Each step produces a testable artifact.

Step 1: Fix existing DZ bugs (no code changes to laconic-so)

Fixes BUG-1 through BUG-5 from doublezero-status.md.

Spec: Container doublezerod shows correct identity, connects to laconic-mia-sw01, host systemd doublezerod is disabled.

Test:

kubectl -n <ns> exec <pod> -c doublezerod -- doublezero address
# assert: 3Bw6v7EruQvTwoY79h2QjQCs2KBQFzSneBdYUbcXK1Tr

kubectl -n <ns> exec <pod> -c doublezerod -- doublezero status
# assert: BGP Session Up, laconic-mia-sw01

systemctl is-active doublezerod
# assert: inactive

Changes:

  • Copy id.json to container volume
  • Update DOUBLEZERO_RPC_ENDPOINT in spec.yml
  • Deploy with hostNetwork-enabled stack-orchestrator
  • Stop and disable host doublezerod
  • Delete DaemonSet manifest from agave-stack

Step 2: Native sidecar for doublezerod

Spec: doublezerod image can be patched without restarting the agave container. GRE tunnel and routes persist across doublezerod restart.

Test:

# Record current agave container start time
BEFORE=$(kubectl -n <ns> get pod <pod> -o jsonpath='{.status.containerStatuses[?(@.name=="agave-validator")].state.running.startedAt}')

# Patch DZ image
kubectl -n <ns> patch pod <pod> --type='json' -p='[
  {"op":"replace","path":"/spec/initContainers/0/image","value":"laconicnetwork/doublezero:test"}
]'

# Wait for DZ container to restart
sleep 10

# Verify agave was NOT restarted
AFTER=$(kubectl -n <ns> get pod <pod> -o jsonpath='{.status.containerStatuses[?(@.name=="agave-validator")].state.running.startedAt}')
[ "$BEFORE" = "$AFTER" ]  # assert: same start time

# Verify tunnel survived
ip route | grep doublezero0  # assert: routes present

Changes:

  • laconic-so: support initContainers with restartPolicy: Always in compose-to-k8s translation (or: define doublezerod as native sidecar in compose via x-kubernetes-init-container extension or equivalent)
  • Alternatively: post-deploy kubectl patch to move doublezerod to initContainers

Step 3: Caddy dual-upstream routing

Spec: Caddy routes RPC traffic to whichever backend is healthy. Adding a second healthy backend on :8897 causes traffic to shift without configuration changes.

Test:

# Start a test HTTP server on :8897 with /health
python3 -c "
from http.server import HTTPServer, BaseHTTPRequestHandler
class H(BaseHTTPRequestHandler):
    def do_GET(self):
        self.send_response(200); self.end_headers(); self.wfile.write(b'ok')
HTTPServer(('', 8897), H).serve_forever()
" &

# Verify Caddy discovers it
sleep 10
curl -sf https://biscayne.vaasl.io/health
# assert: 200

kill %1

Changes:

  • Update Caddy ingress config with dual upstreams and health checks

Step 4: ZFS clone and blue-green tooling

Spec: A script creates a ZFS clone, starts a blue deployment on alternate ports using the cloned data, and the deployment catches up and becomes healthy.

Test:

# Run the clone + deploy script
./scripts/blue-green-prepare.sh --target-version v2.2.1

# assert: ZFS clone exists
zfs list biscayne/DATA/volumes/solana-blue

# assert: blue deployment exists and is catching up
kubectl -n <ns> get deployment agave-blue

# assert: blue RPC eventually becomes healthy
timeout 600 bash -c 'until curl -sf http://localhost:8897/health; do sleep 5; done'

Changes:

  • scripts/blue-green-prepare.sh — ZFS snapshot, clone, deploy B
  • scripts/blue-green-promote.sh — tear down A, optional port swap
  • scripts/blue-green-rollback.sh — destroy B, restore A
  • Parameterized deployment spec (kustomize overlay or env-driven)

Step 5: End-to-end upgrade test

Spec: Full upgrade cycle completes with zero dropped RPC requests.

Test:

# Start continuous health probe in background
while true; do
  curl -sf -o /dev/null -w "%{http_code} %{time_total}\n" \
    https://biscayne.vaasl.io/health || echo "FAIL $(date)"
  sleep 0.5
done > /tmp/health-probe.log &

# Execute full blue-green upgrade
./scripts/blue-green-prepare.sh --target-version v2.2.1
# wait for blue to sync...
./scripts/blue-green-promote.sh

# Stop probe
kill %1

# assert: no FAIL lines in probe log
grep -c FAIL /tmp/health-probe.log
# assert: 0