host-metrics: operator README
parent
f898d65983
commit
4eaca0ecb0
|
|
@ -7,6 +7,115 @@ it runs on. Writes to an InfluxDB 1.x endpoint of your choosing.
|
||||||
|
|
||||||
Deploy one instance per machine you want monitored.
|
Deploy one instance per machine you want monitored.
|
||||||
|
|
||||||
## Quick deploy
|
## What gets collected
|
||||||
|
|
||||||
(Filled in by a later task.)
|
| Input | Measurements (in InfluxDB) |
|
||||||
|
|-------|----------------------------|
|
||||||
|
| inputs.cpu (totalcpu only) | cpu (`cpu=cpu-total`) |
|
||||||
|
| inputs.mem | mem |
|
||||||
|
| inputs.swap | swap |
|
||||||
|
| inputs.system | system (uptime, load1/5/15, n_users, n_cpus) |
|
||||||
|
| inputs.processes | processes (running/sleeping/blocked/zombies) |
|
||||||
|
| inputs.disk | disk (used/free/used_percent per mount) |
|
||||||
|
| inputs.diskio | diskio (read/write bytes/ops per device) |
|
||||||
|
| inputs.net | net (bytes/packets/err in/out per interface) |
|
||||||
|
| inputs.zfs (opt-in via COLLECT_ZFS=true) | zfs (ARC stats, pool state) |
|
||||||
|
|
||||||
|
All rows are tagged with `host` (kernel hostname, or `HOST_TAG` override).
|
||||||
|
|
||||||
|
## Deploy
|
||||||
|
|
||||||
|
### Create a spec
|
||||||
|
|
||||||
|
```bash
|
||||||
|
laconic-so --stack <path>/stack_orchestrator/data/stacks/host-metrics \
|
||||||
|
deploy init --output spec-host-metrics.yml
|
||||||
|
```
|
||||||
|
|
||||||
|
Edit `spec-host-metrics.yml` to look like:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
stack: <path>/stack_orchestrator/data/stacks/host-metrics
|
||||||
|
deploy-to: compose
|
||||||
|
credentials-files:
|
||||||
|
- ~/.credentials/host-metrics.env
|
||||||
|
config:
|
||||||
|
INFLUXDB_URL: 'https://influxdb.example.com'
|
||||||
|
INFLUXDB_DB: 'host_metrics' # default; override for a custom DB
|
||||||
|
HOST_TAG: 'validator-1' # optional; defaults to kernel hostname
|
||||||
|
COLLECT_INTERVAL: '10s' # telegraf collection + flush cadence
|
||||||
|
COLLECT_ZFS: 'false' # set to 'true' on ZFS hosts
|
||||||
|
```
|
||||||
|
|
||||||
|
`~/.credentials/host-metrics.env` must contain:
|
||||||
|
|
||||||
|
```
|
||||||
|
INFLUXDB_WRITE_USER=<writer-username>
|
||||||
|
INFLUXDB_WRITE_PASSWORD=<writer-password>
|
||||||
|
```
|
||||||
|
|
||||||
|
These are issued by the InfluxDB admin (the monitoring host operator); they
|
||||||
|
are the same writer-only credentials used by validators/RPCs to push agave
|
||||||
|
metrics.
|
||||||
|
|
||||||
|
### Create and start
|
||||||
|
|
||||||
|
```bash
|
||||||
|
laconic-so deployment create --spec-file spec-host-metrics.yml --deployment-dir ./deployment-host-metrics
|
||||||
|
laconic-so deployment --dir ./deployment-host-metrics start
|
||||||
|
```
|
||||||
|
|
||||||
|
### Verify
|
||||||
|
|
||||||
|
```bash
|
||||||
|
docker logs $(docker ps -qf name=host-metrics) | head
|
||||||
|
```
|
||||||
|
|
||||||
|
Expected: telegraf prints its startup banner and `Loaded inputs: ...`. No
|
||||||
|
errors about missing config or auth failures.
|
||||||
|
|
||||||
|
Within ~20 seconds, the host's data appears in the InfluxDB endpoint's
|
||||||
|
`host_metrics` database (or whichever DB you set in INFLUXDB_DB) and in
|
||||||
|
any Grafana dashboards bound to that DB.
|
||||||
|
|
||||||
|
## Configuration reference
|
||||||
|
|
||||||
|
| Env | Required | Default | Notes |
|
||||||
|
|-----|----------|---------|-------|
|
||||||
|
| `INFLUXDB_URL` | yes | - | Full URL including scheme. Example: `https://influxdb.example.com`. |
|
||||||
|
| `INFLUXDB_DB` | no | `host_metrics` | Target database. Must exist (writer is not granted CREATE). |
|
||||||
|
| `INFLUXDB_WRITE_USER` | yes | - | Writer-only user. |
|
||||||
|
| `INFLUXDB_WRITE_PASSWORD` | yes | - | Writer-only password. |
|
||||||
|
| `COLLECT_INTERVAL` | no | `10s` | Telegraf collection and flush cadence. |
|
||||||
|
| `HOST_TAG` | no | empty | Overrides the kernel hostname for the `host` tag on every row. Useful when a VM has a generic hostname. |
|
||||||
|
| `COLLECT_ZFS` | no | `false` | Set to `true` to enable `inputs.zfs` (pool state + ARC stats). |
|
||||||
|
|
||||||
|
## ZFS hosts
|
||||||
|
|
||||||
|
`inputs.disk` already reports used/free per mount for any filesystem type
|
||||||
|
including ZFS, so the disk-usage view works out of the box. Setting
|
||||||
|
`COLLECT_ZFS=true` additionally enables `inputs.zfs` which reads
|
||||||
|
`/proc/spl/kstat/zfs/...` and emits ARC hit ratio, ARC size, and per-pool
|
||||||
|
health metrics. The bind mount of `/proc` provides the necessary
|
||||||
|
visibility; no extra mounts are needed.
|
||||||
|
|
||||||
|
If you set `COLLECT_ZFS=true` on a non-ZFS host, telegraf logs an error
|
||||||
|
once per collection cycle and skips the input. Harmless but noisy; leave
|
||||||
|
the toggle off on non-ZFS machines.
|
||||||
|
|
||||||
|
## Troubleshooting
|
||||||
|
|
||||||
|
| Symptom | Likely cause |
|
||||||
|
|---------|-------------|
|
||||||
|
| Container fails to start with `FATAL: INFLUXDB_URL is required but empty` | Missing required env. Check spec.yml + credentials file. |
|
||||||
|
| Container starts, no rows appear in InfluxDB | Writer credentials wrong, or InfluxDB unreachable from this host's network. Check `docker logs <telegraf>` for `Post ... 401` / `connection refused`. |
|
||||||
|
| Two hosts overwriting each other's series | Both use the same kernel hostname. Set distinct `HOST_TAG` values. |
|
||||||
|
| `inputs.processes` reports only 1 process | `pid: host` missing from compose. Re-deploy. |
|
||||||
|
|
||||||
|
## Caveats
|
||||||
|
|
||||||
|
- Requires Docker with privileges to bind-mount `/`, `/proc`, `/sys`, and to
|
||||||
|
share the host PID namespace. Rootless Docker installations may refuse
|
||||||
|
`pid: host` and the `/` bind mount.
|
||||||
|
- One deployment per host. Running two on the same machine writes
|
||||||
|
duplicate rows under the same `host` tag.
|
||||||
|
|
|
||||||
Loading…
Reference in New Issue