diff --git a/SCRATCH/2026-02-07_prometheus-grafana.md b/SCRATCH/2026-02-07_prometheus-grafana.md new file mode 100644 index 0000000..77a3581 --- /dev/null +++ b/SCRATCH/2026-02-07_prometheus-grafana.md @@ -0,0 +1,127 @@ +# 2026-02-07 — Prometheus + Grafana (LXC node_exporter) status + +## What works right now + +- Each LXC template has **node_exporter** installed. +- Prometheus is scraping container exporters via **HTTP service discovery** from the API. +- Grafana can display metrics using a standard Node Exporter dashboard (and a custom "ZLH LXC Overview" dashboard). + +## Prometheus config (current) + +On `zlh-monitor`: + +`/etc/prometheus/prometheus.yml` includes a dynamic job: + +```yaml + - job_name: 'zlh_lxc' + scrape_interval: 15s + http_sd_configs: + - url: http://10.60.0.245:4000/sd/exporters + refresh_interval: 30s + authorization: + type: Bearer + credentials: '' +``` + +This is why containers show up even though there are no hardcoded container IPs. + +### The earlier failure mode + +Prometheus log showed: + +* `dial tcp 10.60.0.245:80: connect: connection refused` + +That happened when the SD URL was missing the port (or using the wrong path). Correct is `:4000/...` (or whatever the API listens on). + +## Grafana dashboard import gotcha + +When importing JSON dashboards: + +* If the JSON references a datasource UID that does not exist (example error: **"Datasource PROM was not found"**), + you must either: + + * select the real datasource during import, or + * edit the dashboard JSON datasource UID to match your Grafana datasource UID. + +Once the datasource was mapped, the "ZeroLagHub - LXC Overview" dashboard started drawing data. + +## CPU mismatch vs `top` (note) + +You observed CPU% in Grafana not matching what `top` was showing. + +This can be normal depending on: + +* how CPU is computed (rate window, per-core vs normalized, cgroup accounting in containers, etc.) +* dashboard query math (e.g. averaging across cores) + +We decided not to block on perfect parity right now — the dashboard is good enough for: + +* "is it pegged?" +* "is it trending up/down?" +* "is this instance dead?" + +If we need tighter parity later: + +* validate the container is exporting the right counters +* compare `node_cpu_seconds_total` raw series +* consider container-aware exporters/cAdvisor (optional) + +## Next step (portal) + +Goal: + +* show "CPU / RAM / Disk / Net" indicators on the portal server cards. + +Two practical routes: + +1. **Query Prometheus from API** (recommended) + + * API calls Prometheus HTTP API (server-side) + * portal calls API + * avoids exposing Prometheus or Grafana to end users + +2. **Embed Grafana panels** (works, but needs careful auth + iframe settings) + + * useful for internal admin views + * more risk for multi-tenant UI unless you add strict access control + +Short-term plan: + +* build API endpoints like: + + * `GET /api/metrics/servers/:vmid/summary` + * returns a small payload suitable for server cards: + + * cpu_percent + * mem_percent + * disk_percent + * net_in/out (rate) + +## Quick PromQL building blocks (reference) + +CPU (percent used, 5m window): + +```promql +100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) +``` + +Memory (percent used): + +```promql +100 * (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) +``` + +Disk (percent used, excluding tmpfs/overlay): + +```promql +100 * (1 - (node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"} / node_filesystem_size_bytes{fstype!~"tmpfs|overlay"})) +``` + +Network in/out: + +```promql +sum by (instance) (rate(node_network_receive_bytes_total{device!~"lo"}[5m])) +sum by (instance) (rate(node_network_transmit_bytes_total{device!~"lo"}[5m])) +``` + +If your SD labels include `vmid`, prefer grouping by `vmid`.