docs: add Prometheus + Grafana setup notes and portal metrics plan
This commit is contained in:
parent
67b46d189f
commit
7b81770f16
127
SCRATCH/2026-02-07_prometheus-grafana.md
Normal file
127
SCRATCH/2026-02-07_prometheus-grafana.md
Normal file
@ -0,0 +1,127 @@
|
||||
# 2026-02-07 — Prometheus + Grafana (LXC node_exporter) status
|
||||
|
||||
## What works right now
|
||||
|
||||
- Each LXC template has **node_exporter** installed.
|
||||
- Prometheus is scraping container exporters via **HTTP service discovery** from the API.
|
||||
- Grafana can display metrics using a standard Node Exporter dashboard (and a custom "ZLH LXC Overview" dashboard).
|
||||
|
||||
## Prometheus config (current)
|
||||
|
||||
On `zlh-monitor`:
|
||||
|
||||
`/etc/prometheus/prometheus.yml` includes a dynamic job:
|
||||
|
||||
```yaml
|
||||
- job_name: 'zlh_lxc'
|
||||
scrape_interval: 15s
|
||||
http_sd_configs:
|
||||
- url: http://10.60.0.245:4000/sd/exporters
|
||||
refresh_interval: 30s
|
||||
authorization:
|
||||
type: Bearer
|
||||
credentials: '<TOKEN>'
|
||||
```
|
||||
|
||||
This is why containers show up even though there are no hardcoded container IPs.
|
||||
|
||||
### The earlier failure mode
|
||||
|
||||
Prometheus log showed:
|
||||
|
||||
* `dial tcp 10.60.0.245:80: connect: connection refused`
|
||||
|
||||
That happened when the SD URL was missing the port (or using the wrong path). Correct is `:4000/...` (or whatever the API listens on).
|
||||
|
||||
## Grafana dashboard import gotcha
|
||||
|
||||
When importing JSON dashboards:
|
||||
|
||||
* If the JSON references a datasource UID that does not exist (example error: **"Datasource PROM was not found"**),
|
||||
you must either:
|
||||
|
||||
* select the real datasource during import, or
|
||||
* edit the dashboard JSON datasource UID to match your Grafana datasource UID.
|
||||
|
||||
Once the datasource was mapped, the "ZeroLagHub - LXC Overview" dashboard started drawing data.
|
||||
|
||||
## CPU mismatch vs `top` (note)
|
||||
|
||||
You observed CPU% in Grafana not matching what `top` was showing.
|
||||
|
||||
This can be normal depending on:
|
||||
|
||||
* how CPU is computed (rate window, per-core vs normalized, cgroup accounting in containers, etc.)
|
||||
* dashboard query math (e.g. averaging across cores)
|
||||
|
||||
We decided not to block on perfect parity right now — the dashboard is good enough for:
|
||||
|
||||
* "is it pegged?"
|
||||
* "is it trending up/down?"
|
||||
* "is this instance dead?"
|
||||
|
||||
If we need tighter parity later:
|
||||
|
||||
* validate the container is exporting the right counters
|
||||
* compare `node_cpu_seconds_total` raw series
|
||||
* consider container-aware exporters/cAdvisor (optional)
|
||||
|
||||
## Next step (portal)
|
||||
|
||||
Goal:
|
||||
|
||||
* show "CPU / RAM / Disk / Net" indicators on the portal server cards.
|
||||
|
||||
Two practical routes:
|
||||
|
||||
1. **Query Prometheus from API** (recommended)
|
||||
|
||||
* API calls Prometheus HTTP API (server-side)
|
||||
* portal calls API
|
||||
* avoids exposing Prometheus or Grafana to end users
|
||||
|
||||
2. **Embed Grafana panels** (works, but needs careful auth + iframe settings)
|
||||
|
||||
* useful for internal admin views
|
||||
* more risk for multi-tenant UI unless you add strict access control
|
||||
|
||||
Short-term plan:
|
||||
|
||||
* build API endpoints like:
|
||||
|
||||
* `GET /api/metrics/servers/:vmid/summary`
|
||||
* returns a small payload suitable for server cards:
|
||||
|
||||
* cpu_percent
|
||||
* mem_percent
|
||||
* disk_percent
|
||||
* net_in/out (rate)
|
||||
|
||||
## Quick PromQL building blocks (reference)
|
||||
|
||||
CPU (percent used, 5m window):
|
||||
|
||||
```promql
|
||||
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
|
||||
```
|
||||
|
||||
Memory (percent used):
|
||||
|
||||
```promql
|
||||
100 * (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes))
|
||||
```
|
||||
|
||||
Disk (percent used, excluding tmpfs/overlay):
|
||||
|
||||
```promql
|
||||
100 * (1 - (node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"} / node_filesystem_size_bytes{fstype!~"tmpfs|overlay"}))
|
||||
```
|
||||
|
||||
Network in/out:
|
||||
|
||||
```promql
|
||||
sum by (instance) (rate(node_network_receive_bytes_total{device!~"lo"}[5m]))
|
||||
sum by (instance) (rate(node_network_transmit_bytes_total{device!~"lo"}[5m]))
|
||||
```
|
||||
|
||||
If your SD labels include `vmid`, prefer grouping by `vmid`.
|
||||
Loading…
Reference in New Issue
Block a user