3.5 KiB
2026-02-07 — Prometheus + Grafana (LXC node_exporter) status
What works right now
- Each LXC template has node_exporter installed.
- Prometheus is scraping container exporters via HTTP service discovery from the API.
- Grafana can display metrics using a standard Node Exporter dashboard (and a custom "ZLH LXC Overview" dashboard).
Prometheus config (current)
On zlh-monitor:
/etc/prometheus/prometheus.yml includes a dynamic job:
- job_name: 'zlh_lxc'
scrape_interval: 15s
http_sd_configs:
- url: http://10.60.0.245:4000/sd/exporters
refresh_interval: 30s
authorization:
type: Bearer
credentials: '<TOKEN>'
This is why containers show up even though there are no hardcoded container IPs.
The earlier failure mode
Prometheus log showed:
dial tcp 10.60.0.245:80: connect: connection refused
That happened when the SD URL was missing the port (or using the wrong path). Correct is :4000/... (or whatever the API listens on).
Grafana dashboard import gotcha
When importing JSON dashboards:
-
If the JSON references a datasource UID that does not exist (example error: "Datasource PROM was not found"), you must either:
- select the real datasource during import, or
- edit the dashboard JSON datasource UID to match your Grafana datasource UID.
Once the datasource was mapped, the "ZeroLagHub - LXC Overview" dashboard started drawing data.
CPU mismatch vs top (note)
You observed CPU% in Grafana not matching what top was showing.
This can be normal depending on:
- how CPU is computed (rate window, per-core vs normalized, cgroup accounting in containers, etc.)
- dashboard query math (e.g. averaging across cores)
We decided not to block on perfect parity right now — the dashboard is good enough for:
- "is it pegged?"
- "is it trending up/down?"
- "is this instance dead?"
If we need tighter parity later:
- validate the container is exporting the right counters
- compare
node_cpu_seconds_totalraw series - consider container-aware exporters/cAdvisor (optional)
Next step (portal)
Goal:
- show "CPU / RAM / Disk / Net" indicators on the portal server cards.
Two practical routes:
-
Query Prometheus from API (recommended)
- API calls Prometheus HTTP API (server-side)
- portal calls API
- avoids exposing Prometheus or Grafana to end users
-
Embed Grafana panels (works, but needs careful auth + iframe settings)
- useful for internal admin views
- more risk for multi-tenant UI unless you add strict access control
Short-term plan:
-
build API endpoints like:
-
GET /api/metrics/servers/:vmid/summary -
returns a small payload suitable for server cards:
- cpu_percent
- mem_percent
- disk_percent
- net_in/out (rate)
-
Quick PromQL building blocks (reference)
CPU (percent used, 5m window):
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
Memory (percent used):
100 * (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes))
Disk (percent used, excluding tmpfs/overlay):
100 * (1 - (node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"} / node_filesystem_size_bytes{fstype!~"tmpfs|overlay"}))
Network in/out:
sum by (instance) (rate(node_network_receive_bytes_total{device!~"lo"}[5m]))
sum by (instance) (rate(node_network_transmit_bytes_total{device!~"lo"}[5m]))
If your SD labels include vmid, prefer grouping by vmid.