stack-orchestrator/.pebbles/events.jsonl

53 lines
15 KiB
JSON

{"type":"create","timestamp":"2026-03-06T07:57:55.427398426Z","issue_id":"bar-48f","payload":{"description":"Route all validator traffic (gossip, repair, TVU, TPU) through 137.239.194.65 on laconic-was-sw01 in Ashburn. Supersedes old TVU-only shred relay. See docs/ashburn-validator-relay.md for full design.","priority":"1","title":"Ashburn Full Validator Traffic Relay","type":"epic"}}
{"type":"create","timestamp":"2026-03-06T07:58:01.589463071Z","issue_id":"bar-a47","payload":{"description":"Create Loopback101 (137.239.194.65/32), VALIDATOR-RELAY ACL + traffic-policy on Et1/1, replacing old SHRED-RELAY. Uses 5-min auto-revert config session. Playbook: playbooks/ashburn-relay-was-sw01.yml","priority":"1","title":"was-sw01: Inbound validator relay config","type":"task"}}
{"type":"create","timestamp":"2026-03-06T07:58:07.292140983Z","issue_id":"bar-0e5","payload":{"description":"Add 137.239.194.65/32 to lo, DNAT rules for ports 8001,9000-9025 to kind node 172.20.0.2. Playbook: playbooks/ashburn-relay-biscayne.yml -t inbound","priority":"1","title":"biscayne: Inbound DNAT rules","type":"task"}}
{"type":"create","timestamp":"2026-03-06T07:58:10.838534858Z","issue_id":"bar-f9b","payload":{"description":"Ping 137.239.194.65 from external host, check DNAT counters on biscayne, verify traffic-policy counters on was-sw01.","priority":"1","title":"Verify inbound relay","type":"task"}}
{"type":"create","timestamp":"2026-03-06T07:58:15.228970622Z","issue_id":"bar-bf4","payload":{"description":"Pre-flight to discover GRE tunnel interface, then apply VALIDATOR-OUTBOUND traffic-policy redirecting src 137.239.194.65 to was-sw01 via backbone. Playbook: playbooks/ashburn-relay-mia-sw01.yml","priority":"1","title":"mia-sw01: Outbound validator redirect","type":"task"}}
{"type":"create","timestamp":"2026-03-06T07:58:19.571640837Z","issue_id":"bar-78d","payload":{"description":"fwmark 100 on validator source ports, SNAT to 137.239.194.65, policy route via doublezero0 table ashburn. Playbook: playbooks/ashburn-relay-biscayne.yml -t outbound","priority":"1","title":"biscayne: Outbound SNAT + policy routing","type":"task"}}
{"type":"create","timestamp":"2026-03-06T07:58:23.377441628Z","issue_id":"bar-f3b","payload":{"description":"Verify traffic-policy counters on both switches, iptables counters on biscayne, validator gossip ContactInfo shows 137.239.194.65, repair peer count increases, slot catchup rate improves. Write memory on both switches.","priority":"1","title":"End-to-end verification","type":"task"}}
{"type":"create","timestamp":"2026-03-06T07:58:27.341320984Z","issue_id":"bar-8a9","payload":{"description":"After stable: remove old SHRED-RELAY policy and ACL from was-sw01, remove old 64.92.84.81:20000 DNAT from biscayne.","priority":"2","title":"Cleanup old SHRED-RELAY","type":"task"}}
{"type":"rename","timestamp":"2026-03-06T07:58:32.091645662Z","issue_id":"bar-a47","payload":{"new_id":"bar-48f.1"}}
{"type":"dep_add","timestamp":"2026-03-06T07:58:32.091647902Z","issue_id":"bar-48f.1","payload":{"dep_type":"parent-child","depends_on":"bar-48f"}}
{"type":"rename","timestamp":"2026-03-06T07:58:32.274391159Z","issue_id":"bar-0e5","payload":{"new_id":"bar-48f.2"}}
{"type":"dep_add","timestamp":"2026-03-06T07:58:32.274392749Z","issue_id":"bar-48f.2","payload":{"dep_type":"parent-child","depends_on":"bar-48f"}}
{"type":"rename","timestamp":"2026-03-06T07:58:32.468426932Z","issue_id":"bar-f9b","payload":{"new_id":"bar-48f.3"}}
{"type":"dep_add","timestamp":"2026-03-06T07:58:32.468428522Z","issue_id":"bar-48f.3","payload":{"dep_type":"parent-child","depends_on":"bar-48f"}}
{"type":"rename","timestamp":"2026-03-06T07:58:32.657295386Z","issue_id":"bar-bf4","payload":{"new_id":"bar-48f.4"}}
{"type":"dep_add","timestamp":"2026-03-06T07:58:32.657297846Z","issue_id":"bar-48f.4","payload":{"dep_type":"parent-child","depends_on":"bar-48f"}}
{"type":"rename","timestamp":"2026-03-06T07:58:32.864939519Z","issue_id":"bar-78d","payload":{"new_id":"bar-48f.5"}}
{"type":"dep_add","timestamp":"2026-03-06T07:58:32.864941739Z","issue_id":"bar-48f.5","payload":{"dep_type":"parent-child","depends_on":"bar-48f"}}
{"type":"rename","timestamp":"2026-03-06T07:58:33.364299485Z","issue_id":"bar-f3b","payload":{"new_id":"bar-48f.6"}}
{"type":"dep_add","timestamp":"2026-03-06T07:58:33.364301305Z","issue_id":"bar-48f.6","payload":{"dep_type":"parent-child","depends_on":"bar-48f"}}
{"type":"rename","timestamp":"2026-03-06T07:58:33.639638369Z","issue_id":"bar-8a9","payload":{"new_id":"bar-48f.7"}}
{"type":"dep_add","timestamp":"2026-03-06T07:58:33.639640069Z","issue_id":"bar-48f.7","payload":{"dep_type":"parent-child","depends_on":"bar-48f"}}
{"type":"dep_add","timestamp":"2026-03-06T07:58:39.486721446Z","issue_id":"bar-48f.2","payload":{"dep_type":"blocks","depends_on":"bar-48f.1"}}
{"type":"dep_add","timestamp":"2026-03-06T07:58:39.911749641Z","issue_id":"bar-48f.3","payload":{"dep_type":"blocks","depends_on":"bar-48f.2"}}
{"type":"dep_add","timestamp":"2026-03-06T07:58:40.398532353Z","issue_id":"bar-48f.4","payload":{"dep_type":"blocks","depends_on":"bar-48f.3"}}
{"type":"dep_add","timestamp":"2026-03-06T07:58:40.762666046Z","issue_id":"bar-48f.5","payload":{"dep_type":"blocks","depends_on":"bar-48f.4"}}
{"type":"dep_add","timestamp":"2026-03-06T07:58:41.173027726Z","issue_id":"bar-48f.6","payload":{"dep_type":"blocks","depends_on":"bar-48f.5"}}
{"type":"dep_add","timestamp":"2026-03-06T07:58:41.467313496Z","issue_id":"bar-48f.7","payload":{"dep_type":"blocks","depends_on":"bar-48f.6"}}
{"type":"update","timestamp":"2026-03-06T18:32:00.041874266Z","issue_id":"bar-48f.1","payload":{"description":"Run ansible playbook (pane A) to apply config session with 5-min auto-revert. Review output. In pane B, SSH to install@137.239.200.198 and manually verify (show session-config diffs, show traffic-policy counters). Type 'configure session validator-relay commit' and 'write memory' when satisfied. Playbook: playbooks/ashburn-relay-was-sw01.yml (do NOT use -e commit=true; commit is manual via SSH)."}}
{"type":"update","timestamp":"2026-03-06T18:32:05.861153312Z","issue_id":"bar-48f.4","payload":{"description":"Run ansible playbook pre-flight (pane A) to discover GRE tunnel interface. Then run with -e apply=true -e tunnel_interface=TunnelX for 5-min auto-revert. In pane B, SSH to install@209.42.167.133 and manually verify. Type 'configure session validator-outbound commit' and 'write memory' when satisfied. Playbook: playbooks/ashburn-relay-mia-sw01.yml (do NOT use -e commit=true; commit is manual via SSH)."}}
{"type":"status_update","timestamp":"2026-03-06T18:35:35.320628231Z","issue_id":"bar-48f","payload":{"status":"in_progress"}}
{"type":"status_update","timestamp":"2026-03-06T18:35:35.717040604Z","issue_id":"bar-48f.1","payload":{"status":"in_progress"}}
{"type":"close","timestamp":"2026-03-06T20:12:45.087966093Z","issue_id":"bar-48f.1","payload":{}}
{"type":"status_update","timestamp":"2026-03-06T20:16:34.00466057Z","issue_id":"bar-48f.2","payload":{"status":"in_progress"}}
{"type":"close","timestamp":"2026-03-06T20:17:18.681131396Z","issue_id":"bar-48f.2","payload":{}}
{"type":"status_update","timestamp":"2026-03-06T20:17:19.159927405Z","issue_id":"bar-48f.3","payload":{"status":"in_progress"}}
{"type":"close","timestamp":"2026-03-06T20:18:42.42112937Z","issue_id":"bar-48f.3","payload":{}}
{"type":"status_update","timestamp":"2026-03-06T20:18:42.930237032Z","issue_id":"bar-48f.4","payload":{"status":"in_progress"}}
{"type":"create","timestamp":"2026-03-08T06:58:52.122307149Z","issue_id":"bar-02e","payload":{"description":"/srv/solana is a directory on the ZFS dataset biscayne/DATA/srv (mounted at /srv\nwith overlay=on). The fstab zvol mount at /srv/solana was shadowed by ZFS.\n\nFixed 2026-03-08: removed /srv/solana fstab entries, canonical data path is now\n/srv/kind/solana. All playbooks updated. fstab clean. Mounts verified.","priority":"1","title":"zvol mount: /srv/solana resolves to ZFS dataset, not zvol","type":"bug"}}
{"type":"create","timestamp":"2026-03-08T06:58:52.557582445Z","issue_id":"bar-41a","payload":{"description":"laconic-so creates configmap resources for telegraf but does not generate\nvolumeMounts in the pod spec. The telegraf container crashes because\n/etc/telegraf and /scripts are empty. Manual configmap creation works but\nthe volume mounts are still missing. Root cause is in laconic-so's stack\nmigration — configmap volume mount generation is incomplete.","priority":"1","title":"telegraf volume mounts missing from pod spec","type":"bug"}}
{"type":"create","timestamp":"2026-03-08T06:58:53.065888933Z","issue_id":"bar-a3b","payload":{"description":"Validator exits shortly after starting. Log shows UDP port reachability checks\nand TCP port checks failing. Needs full log analysis from kind node path\n(/mnt/validator-log/validator.log). May be related to networking/firewall\nconfiguration or the shred relay setup.","priority":"0","title":"agave-validator crash after ~57 seconds","type":"bug"}}
{"type":"create","timestamp":"2026-03-08T06:58:53.589221516Z","issue_id":"bar-b04","payload":{"description":"Once laconic-so deployment prepare lands, update biscayne-redeploy.yml to use\nprepare instead of start+scale-to-0 workaround. The deploy tag section should\ncall deployment prepare, and scale-up should call deployment start\n--skip-cluster-management.","priority":"2","title":"update biscayne-redeploy to use deployment prepare","type":"task"}}
{"type":"create","timestamp":"2026-03-08T06:58:54.238136989Z","issue_id":"bar-b41","payload":{"description":"Automate the leapfrog recovery strategy documented in CLAUDE.md. When the\nvalidator is stuck in a repair-dependent gap, download a fresh snapshot past\nthe incomplete zone while preserving the existing ledger (which has turbine\nshreds at the tip). Needs: shred completeness check, snapshot slot targeting,\nselective wipe (accounts+snapshots only, keep ledger).","priority":"2","title":"snapshot leapfrog recovery playbook","type":"feature"}}
{"type":"create","timestamp":"2026-03-08T06:58:54.756609299Z","issue_id":"bar-0b4","payload":{"description":"biscayne-prepare-agave.yml unconditionally imports ashburn-relay-biscayne.yml\nat the end. This couples filesystem preparation to relay setup. The relay\nplaybook fails if the kind node isn't running (ping to 172.20.0.2 fails).\nShould be a separate playbook invocation, not an import.","priority":"3","title":"biscayne-prepare-agave imports ashburn-relay-biscayne unconditionally","type":"bug"}}
{"type":"close","timestamp":"2026-03-08T06:59:00.140156099Z","issue_id":"bar-02e","payload":{}}
{"type":"create","timestamp":"2026-03-10T08:05:07.190617713Z","issue_id":"bar-2c9","payload":{"description":"laconic-so build-containers --include filter does exact string match via\ninclude_exclude_check(). Container names use slash (laconicnetwork/agave),\nnot dash. Using --include laconicnetwork-agave silently skips the build\nand reports success.\n\nFixed in biscayne-sync-tools.yml (commit ceea8f0) but the underlying\nlaconic-so behavior of silently skipping with no warning is a bug.","priority":"2","title":"build-containers --include uses slash not dash in container names","type":"bug"}}
{"type":"create","timestamp":"2026-03-10T08:05:12.506655809Z","issue_id":"bar-6cb","payload":{"description":"When laconic-so deployment restart deletes the namespace, PVCs are\ncascade-deleted but PVs (cluster-scoped) survive in Released state with\nstale claimRefs pointing to the old PVC UIDs. New PVCs created by the\nrestarted deployment can't bind because the PVs still reference the\ndeleted PVCs.\n\nWorkaround: patch Released PVs to clear claimRef after restart.\nAdded to biscayne-restart.yml. Root cause is in laconic-so — it should\nclear stale claimRefs as part of the restart flow.\n\nRelated: so-933 (namespace termination race).","priority":"1","title":"PV claimRefs go stale after deployment restart","type":"bug"}}
{"type":"create","timestamp":"2026-03-10T08:05:15.941416301Z","issue_id":"bar-fec","payload":{"description":"monitoring-grafana-data volume is defined in spec.yml but laconic-so's\nget_pvcs() does not generate a PVC for it. The PV is created but no\nmatching PVC exists, so the grafana container can't mount its data volume.\n\nWorkaround: manually kubectl apply the PVC after each deployment restart.\nRoot cause is in stack-orchestrator deploy_k8s.py get_pvcs().","priority":"2","title":"grafana PVC not generated by get_pvcs()","type":"bug"}}
{"type":"create","timestamp":"2026-03-10T08:05:22.853965263Z","issue_id":"bar-822","payload":{"description":"Rebuilding a container image on the Docker host does NOT update the image\ninside the kind node. With imagePullPolicy: IfNotPresent (the default for\n:local tags), kind uses its cached copy. Must run:\n\n kind load docker-image laconicnetwork/agave:local \\\n --name laconic-70ce4c4b47e23b85\n\nafter every rebuild. This step is not in any playbook or laconic-so flow.\nShould be added to biscayne-sync-tools.yml build-container tag or to\nlaconic-so build-containers itself.","priority":"2","title":"kind load docker-image required after container rebuild","type":"bug"}}
{"type":"create","timestamp":"2026-03-10T08:05:28.585915055Z","issue_id":"bar-571","payload":{"description":"Full snapshot slots differ per validator depending on when each started.\nThe entrypoint's incremental download loop assumes it can find an\nincremental keyed to any full snapshot's base slot, but no other validator\nmay have produced a full at that exact slot.\n\nThis causes the incremental download to retry forever when the local\nfull snapshot has a base slot that no network peer has incrementals for.\n\nDocumented for awareness. The entrypoint's infinite retry is intentional\n(user decision) — eventually a matching incremental will appear or the\nentrypoint falls through to download a fresh full+incremental pair.","priority":"3","title":"snapshot base slots are not consensus-aligned across validators","type":"bug"}}
{"type":"create","timestamp":"2026-03-10T08:05:32.262889286Z","issue_id":"bar-2d9","payload":{"description":"When spec.yml has explicit values for env vars that also have defaults in\nthe compose file, the spec.yml values win. Changing compose file defaults\nhas no effect unless the spec.yml override is also removed.\n\nThis is by design (spec.yml is deployment-specific config) but the\ninteraction is non-obvious. Bit us when changing snapshot settings in\ncompose but spec.yml still had the old values.\n\nNot a code bug — more a documentation/workflow issue. Operators must\ncheck both compose defaults and spec.yml overrides.","priority":"3","title":"spec.yml overrides compose defaults silently","type":"bug"}}
{"type":"create","timestamp":"2026-03-10T08:05:36.212405156Z","issue_id":"bar-31a","payload":{"description":"laconic-so deployment restart sleeps only 5s between down and up. If the\nnamespace is still terminating when 'up' runs, k8s returns 403 Forbidden\ncreating configmaps in the new namespace.\n\nCross-ref: so-933 in the stack-orchestrator pebbles project.\n\nWorkaround: retry the restart or wait manually. The restart playbook\n(biscayne-restart.yml) handles this by scaling to 0 first, waiting for\npod termination, then calling laconic-so restart.","priority":"1","title":"deployment restart namespace termination race","type":"bug"}}