13 KiB
Blue-Green Upgrades for Biscayne
Zero-downtime upgrade procedures for the agave-stack deployment on biscayne. Uses ZFS clones for instant data duplication, Caddy health-check routing for traffic shifting, and k8s native sidecars for independent container upgrades.
Architecture
Caddy ingress (biscayne.vaasl.io)
├── upstream A: localhost:8899 ← health: /health
└── upstream B: localhost:8897 ← health: /health
│
┌─────────────────┴──────────────────┐
│ kind cluster │
│ │
│ Deployment A Deployment B │
│ ┌─────────────┐ ┌─────────────┐ │
│ │ agave :8899 │ │ agave :8897 │ │
│ │ doublezerod │ │ doublezerod │ │
│ └──────┬──────┘ └──────┬──────┘ │
└─────────┼─────────────────┼─────────┘
│ │
ZFS dataset A ZFS clone B
(original) (instant CoW copy)
Both deployments run in the same kind cluster with hostNetwork: true.
Caddy active health checks route traffic to whichever deployment has a
healthy /health endpoint.
Storage Layout
| Data | Path | Type | Survives restart? |
|---|---|---|---|
| Ledger | /srv/solana/ledger |
ZFS zvol (xfs) | Yes |
| Snapshots | /srv/solana/snapshots |
ZFS zvol (xfs) | Yes |
| Accounts | /srv/solana/ramdisk/accounts |
/dev/ram0 (xfs) |
Until host reboot |
| Validator config | /srv/deployments/agave/data/validator-config |
ZFS | Yes |
| DZ config | /srv/deployments/agave/data/doublezero-config |
ZFS | Yes |
The ZFS zvol biscayne/DATA/volumes/solana backs /srv/solana (ledger, snapshots).
The ramdisk at /dev/ram0 holds accounts — it's a block device, not tmpfs, so it
survives process restarts but not host reboots.
Procedure 1: DoubleZero Binary Upgrade (zero downtime, single pod)
The GRE tunnel (doublezero0) and BGP routes live in kernel space. They persist
across doublezerod process restarts. Upgrading the DZ binary does not require
tearing down the tunnel or restarting the validator.
Prerequisites
- doublezerod is defined as a k8s native sidecar (
spec.initContainerswithrestartPolicy: Always). See Required Changes below. - k8s 1.29+ (biscayne runs 1.35.1)
Steps
-
Build or pull the new doublezero container image.
-
Patch the pod's sidecar image:
kubectl -n <ns> patch pod <pod> --type='json' -p='[ {"op": "replace", "path": "/spec/initContainers/0/image", "value": "laconicnetwork/doublezero:new-version"} ]' -
Only the doublezerod container restarts. The agave container is unaffected. The GRE tunnel interface and BGP routes remain in the kernel throughout.
-
Verify:
kubectl -n <ns> exec <pod> -c doublezerod -- doublezero --version kubectl -n <ns> exec <pod> -c doublezerod -- doublezero status ip route | grep doublezero0 # routes still present
Rollback
Patch the image back to the previous version. Same process, same zero downtime.
Procedure 2: Agave Version Upgrade (zero RPC downtime, blue-green)
Agave is the main container and must be restarted for a version change. To maintain zero RPC downtime, we run two deployments simultaneously and let Caddy shift traffic based on health checks.
Prerequisites
- Caddy ingress configured with dual upstreams and active health checks
- A parameterized spec.yml that accepts alternate ports and volume paths
- ZFS snapshot/clone scripts
Steps
Phase 1: Prepare (no downtime, no risk)
-
ZFS snapshot for rollback safety:
zfs snapshot -r biscayne/DATA@pre-upgrade-$(date +%Y%m%d) -
ZFS clone the validator volumes:
zfs clone biscayne/DATA/volumes/solana@pre-upgrade-$(date +%Y%m%d) \ biscayne/DATA/volumes/solana-blueThis is instant (copy-on-write). No additional storage until writes diverge.
-
Clone the ramdisk accounts (not on ZFS):
mkdir -p /srv/solana-blue/ramdisk/accounts cp -a /srv/solana/ramdisk/accounts/* /srv/solana-blue/ramdisk/accounts/This is the slow step — 460GB on ramdisk. Consider
rsyncwith--inplaceto minimize copy time, or investigate whether the ramdisk can move to a ZFS dataset for instant cloning in future deployments. -
Build or pull the new agave container image.
Phase 2: Start blue deployment (no downtime)
-
Create Deployment B in the same kind cluster, pointing at cloned volumes, with RPC on port 8897:
# Apply the blue deployment manifest (parameterized spec) kubectl apply -f deployment/k8s-manifests/agave-blue.yaml -
Deployment B catches up. It starts from the snapshot point and replays. Monitor progress:
kubectl -n <ns> exec <blue-pod> -c agave-validator -- \ solana -u http://127.0.0.1:8897 slot -
Validate the new version works:
- RPC responds:
curl -sf http://localhost:8897/health - Correct version:
kubectl -n <ns> exec <blue-pod> -c agave-validator -- agave-validator --version - doublezerod connected (if applicable)
Take as long as needed. Deployment A is still serving all traffic.
- RPC responds:
Phase 3: Traffic shift (zero downtime)
-
Caddy routes traffic to B. Once B's
/healthreturns 200, Caddy's active health check automatically starts routing to it. Alternatively, update the Caddy upstream config to prefer B. -
Verify B is serving live traffic:
curl -sf https://biscayne.vaasl.io/health # Check Caddy access logs for requests hitting port 8897
Phase 4: Cleanup
-
Stop Deployment A:
kubectl -n <ns> delete deployment agave-green -
Reconfigure B to use standard port (8899) if desired, or update Caddy to only route to 8897.
-
Clean up ZFS clone (or keep as rollback):
zfs destroy biscayne/DATA/volumes/solana-blue
Rollback
At any point before Phase 4:
- Deployment A is untouched and still serving traffic (or can be restarted)
- Delete Deployment B:
kubectl -n <ns> delete deployment agave-blue - Destroy the ZFS clone:
zfs destroy biscayne/DATA/volumes/solana-blue
After Phase 4 (A already stopped):
zfs rollbackto restore original data- Redeploy A with old image
Required Changes to agave-stack
1. Move doublezerod to native sidecar
In the pod spec generation (laconic-so or compose override), doublezerod must be defined as a native sidecar container instead of a regular container:
spec:
initContainers:
- name: doublezerod
image: laconicnetwork/doublezero:local
restartPolicy: Always # makes it a native sidecar
securityContext:
privileged: true
capabilities:
add: [NET_ADMIN]
env:
- name: DOUBLEZERO_RPC_ENDPOINT
value: https://api.mainnet-beta.solana.com
volumeMounts:
- name: doublezero-config
mountPath: /root/.config/doublezero
containers:
- name: agave-validator
image: laconicnetwork/agave:local
# ... existing config
This change means:
- doublezerod starts before agave and stays running
- Patching the doublezerod image restarts only that container
- agave can be restarted independently without affecting doublezerod
This requires a laconic-so change to support initContainers with restartPolicy
in compose-to-k8s translation — or a post-deployment patch.
2. Caddy dual-upstream config
Add health-checked upstreams for both blue and green deployments:
biscayne.vaasl.io {
reverse_proxy {
to localhost:8899 localhost:8897
health_uri /health
health_interval 5s
health_timeout 3s
lb_policy first
}
}
lb_policy first routes to the first healthy upstream. When only A is running,
all traffic goes to :8899. When B comes up healthy, traffic shifts.
3. Parameterized deployment spec
Create a parameterized spec or kustomize overlay that accepts:
- RPC port (8899 vs 8897)
- Volume paths (original vs ZFS clone)
- Deployment name suffix (green vs blue)
4. Delete DaemonSet workaround
Remove deployment/k8s-manifests/doublezero-daemonset.yaml from agave-stack.
5. Fix container DZ identity
Copy the registered identity into the container volume:
sudo cp /home/solana/.config/doublezero/id.json \
/srv/deployments/agave/data/doublezero-config/id.json
6. Disable host systemd doublezerod
After the container sidecar is working:
sudo systemctl stop doublezerod
sudo systemctl disable doublezerod
Implementation Order
This is a spec-driven, test-driven plan. Each step produces a testable artifact.
Step 1: Fix existing DZ bugs (no code changes to laconic-so)
Fixes BUG-1 through BUG-5 from doublezero-status.md.
Spec: Container doublezerod shows correct identity, connects to laconic-mia-sw01, host systemd doublezerod is disabled.
Test:
kubectl -n <ns> exec <pod> -c doublezerod -- doublezero address
# assert: 3Bw6v7EruQvTwoY79h2QjQCs2KBQFzSneBdYUbcXK1Tr
kubectl -n <ns> exec <pod> -c doublezerod -- doublezero status
# assert: BGP Session Up, laconic-mia-sw01
systemctl is-active doublezerod
# assert: inactive
Changes:
- Copy
id.jsonto container volume - Update
DOUBLEZERO_RPC_ENDPOINTin spec.yml - Deploy with hostNetwork-enabled stack-orchestrator
- Stop and disable host doublezerod
- Delete DaemonSet manifest from agave-stack
Step 2: Native sidecar for doublezerod
Spec: doublezerod image can be patched without restarting the agave container. GRE tunnel and routes persist across doublezerod restart.
Test:
# Record current agave container start time
BEFORE=$(kubectl -n <ns> get pod <pod> -o jsonpath='{.status.containerStatuses[?(@.name=="agave-validator")].state.running.startedAt}')
# Patch DZ image
kubectl -n <ns> patch pod <pod> --type='json' -p='[
{"op":"replace","path":"/spec/initContainers/0/image","value":"laconicnetwork/doublezero:test"}
]'
# Wait for DZ container to restart
sleep 10
# Verify agave was NOT restarted
AFTER=$(kubectl -n <ns> get pod <pod> -o jsonpath='{.status.containerStatuses[?(@.name=="agave-validator")].state.running.startedAt}')
[ "$BEFORE" = "$AFTER" ] # assert: same start time
# Verify tunnel survived
ip route | grep doublezero0 # assert: routes present
Changes:
- laconic-so: support
initContainerswithrestartPolicy: Alwaysin compose-to-k8s translation (or: define doublezerod as native sidecar in compose viax-kubernetes-init-containerextension or equivalent) - Alternatively: post-deploy kubectl patch to move doublezerod to initContainers
Step 3: Caddy dual-upstream routing
Spec: Caddy routes RPC traffic to whichever backend is healthy. Adding a second healthy backend on :8897 causes traffic to shift without configuration changes.
Test:
# Start a test HTTP server on :8897 with /health
python3 -c "
from http.server import HTTPServer, BaseHTTPRequestHandler
class H(BaseHTTPRequestHandler):
def do_GET(self):
self.send_response(200); self.end_headers(); self.wfile.write(b'ok')
HTTPServer(('', 8897), H).serve_forever()
" &
# Verify Caddy discovers it
sleep 10
curl -sf https://biscayne.vaasl.io/health
# assert: 200
kill %1
Changes:
- Update Caddy ingress config with dual upstreams and health checks
Step 4: ZFS clone and blue-green tooling
Spec: A script creates a ZFS clone, starts a blue deployment on alternate ports using the cloned data, and the deployment catches up and becomes healthy.
Test:
# Run the clone + deploy script
./scripts/blue-green-prepare.sh --target-version v2.2.1
# assert: ZFS clone exists
zfs list biscayne/DATA/volumes/solana-blue
# assert: blue deployment exists and is catching up
kubectl -n <ns> get deployment agave-blue
# assert: blue RPC eventually becomes healthy
timeout 600 bash -c 'until curl -sf http://localhost:8897/health; do sleep 5; done'
Changes:
scripts/blue-green-prepare.sh— ZFS snapshot, clone, deploy Bscripts/blue-green-promote.sh— tear down A, optional port swapscripts/blue-green-rollback.sh— destroy B, restore A- Parameterized deployment spec (kustomize overlay or env-driven)
Step 5: End-to-end upgrade test
Spec: Full upgrade cycle completes with zero dropped RPC requests.
Test:
# Start continuous health probe in background
while true; do
curl -sf -o /dev/null -w "%{http_code} %{time_total}\n" \
https://biscayne.vaasl.io/health || echo "FAIL $(date)"
sleep 0.5
done > /tmp/health-probe.log &
# Execute full blue-green upgrade
./scripts/blue-green-prepare.sh --target-version v2.2.1
# wait for blue to sync...
./scripts/blue-green-promote.sh
# Stop probe
kill %1
# assert: no FAIL lines in probe log
grep -c FAIL /tmp/health-probe.log
# assert: 0