docs: add Prometheus + Grafana setup notes and portal metrics plan

2026-02-07 21:47:41 +00:00 · 2026-02-07 21:47:41 +00:00 · 7b81770f16
commit 7b81770f16
parent 67b46d189f
1 changed files with 127 additions and 0 deletions
--- a/SCRATCH/2026-02-07_prometheus-grafana.md
+++ b/SCRATCH/2026-02-07_prometheus-grafana.md
@ -0,0 +1,127 @@
+# 2026-02-07 — Prometheus + Grafana (LXC node_exporter) status
+
+## What works right now
+
+- Each LXC template has **node_exporter** installed.
+- Prometheus is scraping container exporters via **HTTP service discovery** from the API.
+- Grafana can display metrics using a standard Node Exporter dashboard (and a custom "ZLH LXC Overview" dashboard).
+
+## Prometheus config (current)
+
+On `zlh-monitor`:
+
+`/etc/prometheus/prometheus.yml` includes a dynamic job:
+
+```yaml
+  - job_name: 'zlh_lxc'
+    scrape_interval: 15s
+    http_sd_configs:
+      - url: http://10.60.0.245:4000/sd/exporters
+        refresh_interval: 30s
+        authorization:
+          type: Bearer
+          credentials: '<TOKEN>'
+```
+
+This is why containers show up even though there are no hardcoded container IPs.
+
+### The earlier failure mode
+
+Prometheus log showed:
+
+* `dial tcp 10.60.0.245:80: connect: connection refused`
+
+That happened when the SD URL was missing the port (or using the wrong path). Correct is `:4000/...` (or whatever the API listens on).
+
+## Grafana dashboard import gotcha
+
+When importing JSON dashboards:
+
+* If the JSON references a datasource UID that does not exist (example error: **"Datasource PROM was not found"**),
+  you must either:
+
+  * select the real datasource during import, or
+  * edit the dashboard JSON datasource UID to match your Grafana datasource UID.
+
+Once the datasource was mapped, the "ZeroLagHub - LXC Overview" dashboard started drawing data.
+
+## CPU mismatch vs `top` (note)
+
+You observed CPU% in Grafana not matching what `top` was showing.
+
+This can be normal depending on:
+
+* how CPU is computed (rate window, per-core vs normalized, cgroup accounting in containers, etc.)
+* dashboard query math (e.g. averaging across cores)
+
+We decided not to block on perfect parity right now — the dashboard is good enough for:
+
+* "is it pegged?"
+* "is it trending up/down?"
+* "is this instance dead?"
+
+If we need tighter parity later:
+
+* validate the container is exporting the right counters
+* compare `node_cpu_seconds_total` raw series
+* consider container-aware exporters/cAdvisor (optional)
+
+## Next step (portal)
+
+Goal:
+
+* show "CPU / RAM / Disk / Net" indicators on the portal server cards.
+
+Two practical routes:
+
+1. **Query Prometheus from API** (recommended)
+
+   * API calls Prometheus HTTP API (server-side)
+   * portal calls API
+   * avoids exposing Prometheus or Grafana to end users
+
+2. **Embed Grafana panels** (works, but needs careful auth + iframe settings)
+
+   * useful for internal admin views
+   * more risk for multi-tenant UI unless you add strict access control
+
+Short-term plan:
+
+* build API endpoints like:
+
+  * `GET /api/metrics/servers/:vmid/summary`
+  * returns a small payload suitable for server cards:
+
+    * cpu_percent
+    * mem_percent
+    * disk_percent
+    * net_in/out (rate)
+
+## Quick PromQL building blocks (reference)
+
+CPU (percent used, 5m window):
+
+```promql
+100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
+```
+
+Memory (percent used):
+
+```promql
+100 * (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes))
+```
+
+Disk (percent used, excluding tmpfs/overlay):
+
+```promql
+100 * (1 - (node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"} / node_filesystem_size_bytes{fstype!~"tmpfs|overlay"}))
+```
+
+Network in/out:
+
+```promql
+sum by (instance) (rate(node_network_receive_bytes_total{device!~"lo"}[5m]))
+sum by (instance) (rate(node_network_transmit_bytes_total{device!~"lo"}[5m]))
+```
+
+If your SD labels include `vmid`, prefer grouping by `vmid`.