stack-orchestrator

Commit Graph

Author	SHA1	Message	Date
A. F. Dudley	dd7aeb329a	Merge remote-tracking branch 'stack-orchestrator/fix/kind-mount-propagation' into _so_main_merge	2026-03-10 17:09:45 +00:00
A. F. Dudley	b129aaa9a5	Merge branch 'bar-822-kind-load-after-rebuild' # Conflicts: # stack-orchestrator/stack_orchestrator/deploy/k8s/deploy_k8s.py # stack-orchestrator/stack_orchestrator/deploy/k8s/helpers.py	2026-03-10 16:53:55 +00:00
A. F. Dudley	fdde3be5c8	fix: add pre-commit hooks and fix all lint/type/format errors Process bug fix: no pre-commit existed for this repo's Python code. Added pyproject.toml with unified dependencies (ruff, mypy, ansible-lint), .pre-commit-config.yaml with repo-based hooks (ruff) and local uv-run hooks (mypy, ansible-lint). Fixed 249 ruff errors (B023, B904, B006, B007, UP008, UP031, C408), ~13 mypy type errors, 11 ansible-lint violations, and ruff-format across all Python files including stack-orchestrator subtree. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-10 14:56:22 +00:00
A. F. Dudley	8119b25add	bar-822: replace kind load with local registry for image loading kind load docker-image serializes the full image (docker save \| ctr import), taking 5-10 minutes per cluster recreate. Replace with a persistent local registry (registry:2 on port 5001) that survives kind cluster deletes. stack-orchestrator changes: - helpers.py: replace load_images_into_kind() with ensure_local_registry(), connect_registry_to_kind_network(), push_images_to_local_registry() - helpers.py: add registry mirror to containerdConfigPatches so kind nodes pull from localhost:5001 via the kind-registry container - deploy_k8s.py: rewrite local container image refs to localhost:5001/... so containerd pulls from the registry instead of local store Ansible changes: - biscayne-sync-tools.yml: ensure registry container before build, then tag+push to local registry after build (build-container tag) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-10 08:37:53 +00:00
A. F. Dudley	7f12270939	bar-6cb: fix PV claimRef, namespace race, and PVC creation resilience Three related fixes in the k8s deployer restart/up flow: 1. Clear stale claimRefs on Released PVs (_clear_released_pv_claim_refs): After namespace deletion, PVs survive in Released state with claimRefs pointing to deleted PVC UIDs. New PVCs can't bind until the stale claimRef is removed. Now clears them before PVC creation. 2. Wait for namespace termination (_wait_for_namespace_deletion): _ensure_namespace() now detects a terminating namespace and polls until deletion completes (up to 120s) before creating the new one. Replaces the racy 5s sleep in deployment restart. 3. Resilient PVC creation: wrap each PVC creation in error handling so one failure doesn't prevent subsequent PVCs from being attempted. All errors are collected and reported together. Closes: bar-6cb, bar-31a, bar-fec Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-10 08:33:45 +00:00
A. F. Dudley	03a5b5e39e	Merge commit '19bb90f8148833ea7ff79cba312b048abc0d790b' as 'stack-orchestrator'	2026-03-10 08:08:04 +00:00
A. F. Dudley	12339ab46e	pebbles: sync	2026-03-10 08:05:41 +00:00
A. F. Dudley	6464492009	fix: check-status.py smooth in-place redraw, remove comment bars - Use \033[H\033[J (home + clear-to-end) instead of just \033[H to prevent stale lines from previous frames persisting when output shrinks between refreshes. - Fix cursor restore on exit: was \033[?25l (hide) instead of \033[?25h (show), leaving terminal with invisible cursor. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-10 08:04:29 +00:00
A. F. Dudley	9009fb0363	fix: build.sh must be executable for laconic-so build-containers Also fix --include filter: container name uses slash (laconicnetwork/agave) not dash (laconicnetwork-agave). The old filter silently skipped the build. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-10 07:25:54 +00:00
A. F. Dudley	a76431a5dd	fix: spec.yml snapshot settings — retain 1, enable incrementals MAXIMUM_SNAPSHOTS_TO_RETAIN: 1 (was 5) NO_INCREMENTAL_SNAPSHOTS: false (was true) Removed SNAPSHOT_INTERVAL_SLOTS override (compose default 100000 is correct) Spec.yml overrides compose defaults, so changing compose was ineffective. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-10 07:18:38 +00:00
A. F. Dudley	ceea8f0572	fix: restart playbook preserves SSH agent and clears stale PV claimRefs Two fixes for biscayne-restart.yml: 1. ansible_become_flags: "-E" on the restart task preserves SSH_AUTH_SOCK through sudo so laconic-so can git pull the stack repo. 2. After restart, clear claimRef on any Released PVs. laconic-so restart deletes the namespace (cascading to PVCs) then recreates, but the PVs retain stale claimRefs that prevent new PVCs from binding. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-10 06:37:45 +00:00
A. F. Dudley	e143bb45c7	feat: add biscayne-restart.yml for graceful restart without cluster teardown Uses laconic-so deployment restart (GitOps) to pick up new container images and config. Gracefully stops the validator first (scale to 0, wait for pod termination, verify no agave processes). Preserves the kind cluster, all data volumes, and cluster state. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-10 06:21:46 +00:00
A. F. Dudley	0bbc3b5a64	Merge commit '481e9d239247c01604ed9e11160abc94e9dd9eb4' as 'agave-stack'	2026-03-10 06:21:15 +00:00
A. F. Dudley	481e9d2392	Squashed 'agave-stack/' content from commit 7100d11 git-subtree-dir: agave-stack git-subtree-split: 7100d117421bd79fb52d3dfcd85b76cf18ed0ffa	2026-03-10 06:21:15 +00:00
A. F. Dudley	7c58809cc1	chore: remove scripts/agave-container before subtree add Moving container scripts into agave-stack subtree (correct direction). The source of truth will be agave-stack/ in this repo, pushed out to LaconicNetwork/agave-stack via git subtree push. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-10 06:21:12 +00:00
A. F. Dudley	08380ec070	fix: Dockerfile includes ip_echo_preflight.py Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-10 06:08:22 +00:00
A. F. Dudley	61b7f6a236	feat: ip_echo preflight tool + relay post-mortem and checklist ip_echo_preflight.py: reimplements Solana ip_echo client protocol in Python. Verifies UDP port reachability before snapshot download, called from entrypoint.py. Prevents wasting hours on a snapshot only to crash-loop on port reachability. docs/postmortem-ashburn-relay-outbound.md: root cause analysis of the firewalld nftables FORWARD chain blocking outbound relay traffic. docs/ashburn-relay-checklist.md: 7-layer verification checklist for relay path debugging. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-10 05:54:23 +00:00
A. F. Dudley	68edcc60c7	fix: migrate ashburn relay playbook to firewalld + iptables coexistence Firewalld zones/policies for forwarding (Docker bridge → gre-ashburn), iptables for Docker-specific rules (DNAT, DOCKER-USER, mangle, SNAT). Both coexist at different netfilter priorities. See docs/postmortem-ashburn-relay-outbound.md for root cause analysis. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-10 05:54:08 +00:00
A. F. Dudley	3bf87a2e9b	feat: snapshot leapfrog — auto-recovery when validator falls behind Entrypoint changes: - Always require full + incremental before starting (retry until found) - Check incremental freshness against convergence threshold (500 slots) - Gap monitor thread: if validator falls >5000 slots behind for 3 consecutive checks, graceful stop + restart with fresh incremental - cmd_serve is now a loop: download → run → monitor → leapfrog → repeat - --no-snapshot-fetch moved to common args (both RPC and validator modes) - --maximum-full-snapshots-to-retain default 1 (validator deletes downloaded full after generating its own) - SNAPSHOT_MAX_AGE_SLOTS default 100000 (one full snapshot generation) snapshot_download.py refactoring: - Extract _discover_and_benchmark() and _rolling_incremental_download() as shared helpers - Restore download_incremental_for_slot() using shared helpers (downloads only an incremental for an existing full snapshot) - download_best_snapshot() uses shared helpers, downloads full then incremental as separate operations The leapfrog cycle: validator generates full snapshots at standard 100k block height intervals (same slots as the rest of the network). When the gap monitor triggers, the entrypoint loops back to maybe_download_snapshot which finds the validator's local full, downloads a fresh network incremental (generated every ~40s, converges within the ~11hr full generation window), and restarts. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-10 05:53:56 +00:00
A. F. Dudley	cd36bfe5ee	fix: check-status.py smooth in-place redraw, remove comment bars - Overwrite lines in place instead of clear+redraw (no flicker) - Pad lines to terminal width to clear stale characters - Blank leftover rows when output shrinks between frames - Hide cursor during watch mode - Remove section comment bars - Replace unicode checkmarks with +/x Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-10 01:00:36 +00:00
A. F. Dudley	e597968708	fix: recovery playbook fixes grafana PV ownership before scale-up laconic-so creates PV hostPath dirs as root. Grafana runs as UID 472 and crashes on startup because it can't write to /var/lib/grafana. Fix ownership inside the kind node before scaling the deployment up. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-10 00:57:36 +00:00
A. F. Dudley	ddbcd1a97c	fix: migration playbook stops docker first, skips stale data copy - biscayne-migrate-storage.yml: stop docker to release bind mounts before destroying zvol, no data copy (stale, fresh snapshot needed), handle partially-migrated state, restart docker at end - biscayne-upgrade-zfs.yml: use add-apt-repository CLI (module times out), fix libzfs package name (libzfs4linux not 5), allow apt update warnings from stale influxdata GPG key Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-10 00:48:37 +00:00
AFDudley	8cc0a9a19a	add/local-test-runner (#996 ) Publish / Build and publish (push) Failing after 0s Details Deploy Test / Run deploy test suite (push) Failing after 0s Details Webapp Test / Run webapp test suite (push) Failing after 0s Details Lint Checks / Run linter (push) Failing after 0s Details Smoke Test / Run basic test suite (push) Failing after 0s Details Co-authored-by: A. F. Dudley <a.frederick.dudley@gmail.com> Reviewed-on: https://git.vdb.to/cerc-io/stack-orchestrator/pulls/996	2026-03-09 20:04:58 +00:00
A. F. Dudley	b88af2be70	feat: graceful shutdown, ZFS upgrade, storage migration, sync-tools build - entrypoint.py: Python stays PID 1, traps SIGTERM, requests graceful exit via admin RPC (agave-validator exit --force) before falling back to signals - snapshot_download.py: fix break-on-failure bug in incremental download loop (continue + re-probe instead of giving up) - biscayne-upgrade-zfs.yml: upgrade ZFS 2.2.2 → 2.2.9 via arter97/zfs-lts PPA to fix io_uring deadlock at kernel module level - biscayne-migrate-storage.yml: one-time migration from zvol/XFS to ZFS dataset (zvol workaround no longer needed with graceful shutdown + ZFS fix) - biscayne-stop.yml: patch terminationGracePeriodSeconds to 300 before scaling to 0, updated docs for admin RPC shutdown - biscayne-sync-tools.yml: fix SSH agent forwarding (vars: ansible_become), add --tags build-container support, add set -e to shell blocks - biscayne-recover.yml: updated for graceful shutdown awareness - check-status.py: add --pane flag for tmux, clean redraw in watch mode - CLAUDE.md: update docs for ZFS dataset storage, graceful shutdown Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-09 07:58:37 +00:00
A. F. Dudley	173b807451	fix: check-status.py discovers cluster-id from deployment.yml Instead of hardcoding the laconic cluster ID, namespace, deployment name, and pod label, read cluster-id from deployment.yml on biscayne and derive everything from it. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-09 06:48:19 +00:00
A. F. Dudley	ed6f6bfd59	fix: check-status.py pod label selector matches actual k8s labels The pod label is app=laconic-70ce4c4b47e23b85, not app=laconic-70ce4c4b47e23b85-deployment. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-09 06:46:17 +00:00
A. F. Dudley	09728a719c	fix: recovery playbook is fire-and-forget, add check-status.py The recovery playbook now exits after scaling to 1. The container entrypoint handles snapshot download (60+ min) and validator startup autonomously. Removed all polling/verification steps that would time out waiting. Added scripts/check-status.py for monitoring download progress, validator slot, gap to mainnet, catch-up rate, and ramdisk usage. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-09 06:39:25 +00:00
A. F. Dudley	3dc345ea7d	fix: recovery playbook delegates snapshot download to container entrypoint The container's entrypoint.py already handles snapshot freshness checks, cleanup, download (with rolling incremental convergence), and validator startup. Remove the host-side download and let the container do the work. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-09 06:28:01 +00:00
A. F. Dudley	f842aba56a	fix: sync-tools playbook uses agent forwarding, not socket hunting - Add become: false to git tasks so SSH_AUTH_SOCK survives (sudo drops it) - Fetch explicit branch names instead of bare `git fetch origin` - Remove the fragile `Find SSH agent socket` workaround Requires ForwardAgent yes in SSH config (added to ~/.ssh/config). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-09 06:20:16 +00:00
A. F. Dudley	601f520a45	fix: add 30-min wall-clock timeout to incremental convergence loop Without a bound, the loop runs forever if sources never serve an incremental close enough to head (e.g. full snapshot base slot is too old). After 30 minutes, proceed with the best incremental available or none. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-09 06:11:19 +00:00
A. F. Dudley	bfde58431e	feat: rolling incremental snapshot download loop After the full snapshot downloads, continuously re-probe all fast sources for newer incrementals until the best available is within convergence_slots (default 500) of head. Each iteration finds the highest-slot incremental matching our full snapshot's base slot, downloads it (replacing any previous), and checks the gap to mainnet head. - Extract probe_incremental() from inline re-probe code - Add convergence_slots param to download_best_snapshot() (default 500) - Add --convergence-slots CLI arg - Pass SNAPSHOT_CONVERGENCE_SLOTS env var from entrypoint.py Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-09 05:33:47 +00:00
A. F. Dudley	bd38c1b791	fix: remove Ansible snapshot download, add sync-tools playbook The container entrypoint (entrypoint.py) handles snapshot download internally via aria2c. Ansible no longer needs to scale-to-0, download, scale-to-1 — it just deploys and lets the container manage startup. - biscayne-redeploy.yml: remove snapshot download section, simplify to teardown → wipe → deploy → verify - biscayne-sync-tools.yml: new playbook to sync laconic-so and agave-stack repos on biscayne, with separate branch controls - snapshot_download.py: re-probe for fresh incremental after full snapshot download completes (old incremental is stale by then) - Switch laconic_so_branch to fix/kind-mount-propagation (has hostNetwork translation code) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-09 05:14:43 +00:00
A. F. Dudley	3574e387cc	fix: update playbooks to use subtree path for snapshot_download.py scripts/agave-container/ is a git subtree of agave-stack's container-build directory. Replaces fragile cross-repo symlink with proper subtree. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-08 19:13:53 +00:00
A. F. Dudley	25952b4fa7	Merge commit 'f4b3a46109a8da00fdd68d8999160ddc45dcc88a' as 'scripts/agave-container'	2026-03-08 19:13:38 +00:00
A. F. Dudley	f4b3a46109	Squashed 'scripts/agave-container/' content from commit 4b5c875 git-subtree-dir: scripts/agave-container git-subtree-split: 4b5c875a05cbbfbde38eeb053fd5443a8a50228c	2026-03-08 19:13:38 +00:00
A. F. Dudley	ba015bf3b1	chore: remove snapshot-download.py symlink (replacing with subtree) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-08 19:13:34 +00:00
A. F. Dudley	078872d78d	feat: add iptables playbook, symlink snapshot-download.py to agave-stack - playbooks/biscayne-iptables.yml: manages PREROUTING DNAT and DOCKER-USER rules for both host IP (186.233.184.235) and relay loopback (137.239.194.65). Idempotent, persists via netfilter-persistent. - scripts/snapshot-download.py: replaced standalone copy with symlink to agave-stack source of truth, eliminating duplication. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-08 19:11:24 +00:00
A. F. Dudley	ec12e6079b	fix: redeploy wipe uses umount+remount instead of rm -rf Remounting tmpfs is instant (kernel frees pages), while rm -rf on 400GB+ of accounts files traverses every inode. Recover playbook keeps rm -rf because the kind node's bind mount prevents umount while the container is running. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-08 18:45:44 +00:00
A. F. Dudley	b2342bc539	fix: switch ramdisk from /dev/ram0 to tmpfs, refactor snapshot-download.py The /dev/ram0 + XFS + format-ramdisk.service approach was unnecessary complexity from a migration confusion — there was no actual tmpfs bug with io_uring. tmpfs is simpler (no format-on-boot), resizable on the fly, and what every other Solana operator uses. Changes: - prepare-agave: remove format-ramdisk.service and ramdisk-accounts.service, use tmpfs fstab entry with size=1024G (was 600G /dev/ram0, too small) - recover: remove ramdisk_device var (no longer needed) - redeploy: wipe accounts by rm -rf instead of umount+mkfs - snapshot-download.py: extract download_best_snapshot() public API for use by the new container entrypoint.py (in agave-stack) - CLAUDE.md: update ramdisk docs, fix /srv/solana → /srv/kind/solana paths - health-check: fix ramdisk path references Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-08 18:43:41 +00:00
A. F. Dudley	591d158e1f	chore: populate pebbles with known bugs and feature requests Issues: - bar-a3b [P0] agave-validator crash after ~57 seconds - bar-41a [P1] telegraf volume mounts missing from pod spec - bar-02e [P1] zvol mount bug (closed — fixed 2026-03-08) - bar-b04 [P2] update redeploy to use deployment prepare - bar-b41 [P2] snapshot leapfrog recovery playbook - bar-0b4 [P3] prepare-agave unconditionally imports relay playbook Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-08 06:59:07 +00:00
A. F. Dudley	974eed0c73	feat: add `deployment prepare` command (so-076.1) Refactors K8sDeployer.up() into three composable methods: - _setup_cluster_and_namespace(): kind cluster, API, namespace, ingress - _create_infrastructure(): PVs, PVCs, ConfigMaps, Services, NodePorts - _create_deployment(): Deployment resource (pods) `prepare` calls the first two only — creates all cluster infrastructure without starting pods. This eliminates the scale-to-0 workaround where operators had to run `deployment start` then immediately scale down. Usage: laconic-so deployment --dir <dir> prepare Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-08 06:56:34 +00:00
A. F. Dudley	9c5b8e3f4e	chore: initialize pebbles issue tracker Track stack-orchestrator work items with pebbles (append-only event log). Epic so-076: Stack composition — deploy multiple stacks into one kind cluster with independent lifecycle management per sub-stack. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-08 06:56:25 +00:00
A. F. Dudley	63735a9830	fix: revert snapshot_dir, add laconic_so_branch, move kind ramdisk check - Revert snapshot_dir to /srv/solana/snapshots — aria2c runs on the host where this is the direct zvol mount (always available), unlike /srv/kind/solana/snapshots which depends on the bind mount - Add laconic_so_branch variable (default: main) and use it in both git reset commands so the branch can be overridden via -e - Move "Verify ramdisk visible inside kind node" from preflight to after "Wait for deployment to exist" — the kind container may not exist during preflight after teardown Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-08 04:42:11 +00:00
A. F. Dudley	14f423ea0c	fix(k8s): read existing resourceVersion/clusterIP before replace K8s PUT (replace) operations require metadata.resourceVersion for optimistic concurrency control. Services additionally have immutable spec.clusterIP that must be preserved from the existing object. On 409 conflict, all _ensure_* methods now read the existing resource first and copy resourceVersion (and clusterIP for Services) into the body before calling replace. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-08 04:32:20 +00:00
A. F. Dudley	fe935037f7	fix: add laconic-so update step, downgrade unified mount check to warning - Add laconic_so_repo variable (/home/rix/stack-orchestrator) and a git pull task before deployment start — the editable install must be current or stale code causes deploy failures - Downgrade unified mount root check from fatal assertion to debug warning — the mount style depends on which laconic-so version is deployed, and individual PV mounts (/mnt/validator-*) work fine Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-08 04:32:20 +00:00
A. F. Dudley	1da69cf739	fix(k8s): make deploy_k8s.py idempotent with create-or-replace semantics All K8s resource creation in deploy_k8s.py now uses try-create, catch ApiException(409), then replace — matching the pattern already used for secrets in deployment_create.py. This allows `deployment start` to be safely re-run without 409 Conflict errors. Resources made idempotent: - Deployment (create_namespaced_deployment → replace on 409) - Service (create_namespaced_service → replace on 409) - Ingress (create_namespaced_ingress → replace on 409) - NodePort services (same as Service) - ConfigMap (create_namespaced_config_map → replace on 409) - PV/PVC: bare `except: pass` replaced with explicit ApiException catch for 404 Extracted _ensure_deployment(), _ensure_service(), _ensure_ingress(), and _ensure_config_map() helpers to keep cyclomatic complexity in check. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-08 04:15:03 +00:00
A. F. Dudley	ad68d505ae	fix: redeploy playbook paths, tags, and idempotency - Fix snapshot_dir: /srv/solana/snapshots → /srv/kind/solana/snapshots (kind node reads from the bind mount, not the zvol mount directly) - Fix kind-internal paths: /mnt/solana/... → /mnt/validator-... to match actual PV hostPath layout (individual mounts, not unified) - Add 'scale-up' tag to "Scale validator to 1" task for partial recovery (--tags snapshot,scale-up,verify resumes without re-running deploy) - Make 'Start deployment' idempotent: failed_when: false + follow-up check so existing deployment doesn't fail the play Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-08 04:14:05 +00:00
A. F. Dudley	05f9acf8a0	fix: DOCKER-USER rules for inbound relay, add UDP test playbooks Root cause: Docker FORWARD chain policy DROP blocked all DNAT'd relay traffic (UDP/TCP 8001, UDP 9000-9025) to the kind node. The DOCKER chain only ACCEPTs specific TCP ports (6443, 443, 80). Added ACCEPT rules in DOCKER-USER chain which runs before all Docker chains. Changes: - ashburn-relay-biscayne.yml: add DOCKER-USER ACCEPT rules (inbound tag) and rollback cleanup - ashburn-relay-setup.sh.j2: persist DOCKER-USER rules across reboot - relay-inbound-udp-test.yml: controlled e2e test — listener in kind netns, sender from kelce, assert arrival - relay-link-test.yml: link-by-link tcpdump captures at each hop - relay-test-udp-listen.py, relay-test-udp-send.py: test helpers - relay-test-ip-echo.py: full ip_echo protocol test - inventory/kelce.yml, inventory/panic.yml: test host inventories - test-ashburn-relay.sh: add ip_echo UDP reachability test Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-08 02:43:31 +00:00
A. F. Dudley	cc6acd5f09	fix: default skip-cluster-management to true Destroying the kind cluster on stop/start is almost never the intent. The cluster holds PVs, ConfigMaps, and networking state that are expensive to recreate. Default to preserving the cluster; pass --perform-cluster-management explicitly when a full teardown is needed. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-08 02:41:25 +00:00
A. F. Dudley	806c1bb723	refactor: rename `deployment update` to `deployment update-envs` The update command only patches environment variables and adds a restart annotation. It does not update ports, volumes, configmaps, or any other deployment spec. The old name was misleading — it implied a full spec update, causing operators to expect changes that never took effect. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-08 02:33:20 +00:00

1 2 3 4 5 ...

1247 Commits (dd7aeb329a852ecac8748e28d48e11a7e225e8f8) All Branches Search

1247 Commits (dd7aeb329a852ecac8748e28d48e11a7e225e8f8)

All Branches