From 4eaca0ecb05869e995ae9b98b4cd2bcaddcd274d Mon Sep 17 00:00:00 2001 From: Prathamesh Musale Date: Mon, 11 May 2026 10:51:43 +0000 Subject: [PATCH] host-metrics: operator README --- .../data/stacks/host-metrics/README.md | 113 +++++++++++++++++- 1 file changed, 111 insertions(+), 2 deletions(-) diff --git a/stack_orchestrator/data/stacks/host-metrics/README.md b/stack_orchestrator/data/stacks/host-metrics/README.md index d281e638..a97a99c3 100644 --- a/stack_orchestrator/data/stacks/host-metrics/README.md +++ b/stack_orchestrator/data/stacks/host-metrics/README.md @@ -7,6 +7,115 @@ it runs on. Writes to an InfluxDB 1.x endpoint of your choosing. Deploy one instance per machine you want monitored. -## Quick deploy +## What gets collected -(Filled in by a later task.) +| Input | Measurements (in InfluxDB) | +|-------|----------------------------| +| inputs.cpu (totalcpu only) | cpu (`cpu=cpu-total`) | +| inputs.mem | mem | +| inputs.swap | swap | +| inputs.system | system (uptime, load1/5/15, n_users, n_cpus) | +| inputs.processes | processes (running/sleeping/blocked/zombies) | +| inputs.disk | disk (used/free/used_percent per mount) | +| inputs.diskio | diskio (read/write bytes/ops per device) | +| inputs.net | net (bytes/packets/err in/out per interface) | +| inputs.zfs (opt-in via COLLECT_ZFS=true) | zfs (ARC stats, pool state) | + +All rows are tagged with `host` (kernel hostname, or `HOST_TAG` override). + +## Deploy + +### Create a spec + +```bash +laconic-so --stack /stack_orchestrator/data/stacks/host-metrics \ + deploy init --output spec-host-metrics.yml +``` + +Edit `spec-host-metrics.yml` to look like: + +```yaml +stack: /stack_orchestrator/data/stacks/host-metrics +deploy-to: compose +credentials-files: + - ~/.credentials/host-metrics.env +config: + INFLUXDB_URL: 'https://influxdb.example.com' + INFLUXDB_DB: 'host_metrics' # default; override for a custom DB + HOST_TAG: 'validator-1' # optional; defaults to kernel hostname + COLLECT_INTERVAL: '10s' # telegraf collection + flush cadence + COLLECT_ZFS: 'false' # set to 'true' on ZFS hosts +``` + +`~/.credentials/host-metrics.env` must contain: + +``` +INFLUXDB_WRITE_USER= +INFLUXDB_WRITE_PASSWORD= +``` + +These are issued by the InfluxDB admin (the monitoring host operator); they +are the same writer-only credentials used by validators/RPCs to push agave +metrics. + +### Create and start + +```bash +laconic-so deployment create --spec-file spec-host-metrics.yml --deployment-dir ./deployment-host-metrics +laconic-so deployment --dir ./deployment-host-metrics start +``` + +### Verify + +```bash +docker logs $(docker ps -qf name=host-metrics) | head +``` + +Expected: telegraf prints its startup banner and `Loaded inputs: ...`. No +errors about missing config or auth failures. + +Within ~20 seconds, the host's data appears in the InfluxDB endpoint's +`host_metrics` database (or whichever DB you set in INFLUXDB_DB) and in +any Grafana dashboards bound to that DB. + +## Configuration reference + +| Env | Required | Default | Notes | +|-----|----------|---------|-------| +| `INFLUXDB_URL` | yes | - | Full URL including scheme. Example: `https://influxdb.example.com`. | +| `INFLUXDB_DB` | no | `host_metrics` | Target database. Must exist (writer is not granted CREATE). | +| `INFLUXDB_WRITE_USER` | yes | - | Writer-only user. | +| `INFLUXDB_WRITE_PASSWORD` | yes | - | Writer-only password. | +| `COLLECT_INTERVAL` | no | `10s` | Telegraf collection and flush cadence. | +| `HOST_TAG` | no | empty | Overrides the kernel hostname for the `host` tag on every row. Useful when a VM has a generic hostname. | +| `COLLECT_ZFS` | no | `false` | Set to `true` to enable `inputs.zfs` (pool state + ARC stats). | + +## ZFS hosts + +`inputs.disk` already reports used/free per mount for any filesystem type +including ZFS, so the disk-usage view works out of the box. Setting +`COLLECT_ZFS=true` additionally enables `inputs.zfs` which reads +`/proc/spl/kstat/zfs/...` and emits ARC hit ratio, ARC size, and per-pool +health metrics. The bind mount of `/proc` provides the necessary +visibility; no extra mounts are needed. + +If you set `COLLECT_ZFS=true` on a non-ZFS host, telegraf logs an error +once per collection cycle and skips the input. Harmless but noisy; leave +the toggle off on non-ZFS machines. + +## Troubleshooting + +| Symptom | Likely cause | +|---------|-------------| +| Container fails to start with `FATAL: INFLUXDB_URL is required but empty` | Missing required env. Check spec.yml + credentials file. | +| Container starts, no rows appear in InfluxDB | Writer credentials wrong, or InfluxDB unreachable from this host's network. Check `docker logs ` for `Post ... 401` / `connection refused`. | +| Two hosts overwriting each other's series | Both use the same kernel hostname. Set distinct `HOST_TAG` values. | +| `inputs.processes` reports only 1 process | `pid: host` missing from compose. Re-deploy. | + +## Caveats + +- Requires Docker with privileges to bind-mount `/`, `/proc`, `/sys`, and to + share the host PID namespace. Rootless Docker installations may refuse + `pid: host` and the `/` bind mount. +- One deployment per host. Running two on the same machine writes + duplicate rows under the same `host` tag.