# 2026-02-07 — Prometheus + Grafana (LXC node_exporter) status ## What works right now - Each LXC template has **node_exporter** installed. - Prometheus is scraping container exporters via **HTTP service discovery** from the API. - Grafana can display metrics using a standard Node Exporter dashboard (and a custom "ZLH LXC Overview" dashboard). ## Prometheus config (current) On `zlh-monitor`: `/etc/prometheus/prometheus.yml` includes a dynamic job: ```yaml - job_name: 'zlh_lxc' scrape_interval: 15s http_sd_configs: - url: http://10.60.0.245:4000/sd/exporters refresh_interval: 30s authorization: type: Bearer credentials: '' ``` This is why containers show up even though there are no hardcoded container IPs. ### The earlier failure mode Prometheus log showed: * `dial tcp 10.60.0.245:80: connect: connection refused` That happened when the SD URL was missing the port (or using the wrong path). Correct is `:4000/...` (or whatever the API listens on). ## Grafana dashboard import gotcha When importing JSON dashboards: * If the JSON references a datasource UID that does not exist (example error: **"Datasource PROM was not found"**), you must either: * select the real datasource during import, or * edit the dashboard JSON datasource UID to match your Grafana datasource UID. Once the datasource was mapped, the "ZeroLagHub - LXC Overview" dashboard started drawing data. ## CPU mismatch vs `top` (note) You observed CPU% in Grafana not matching what `top` was showing. This can be normal depending on: * how CPU is computed (rate window, per-core vs normalized, cgroup accounting in containers, etc.) * dashboard query math (e.g. averaging across cores) We decided not to block on perfect parity right now — the dashboard is good enough for: * "is it pegged?" * "is it trending up/down?" * "is this instance dead?" If we need tighter parity later: * validate the container is exporting the right counters * compare `node_cpu_seconds_total` raw series * consider container-aware exporters/cAdvisor (optional) ## Next step (portal) Goal: * show "CPU / RAM / Disk / Net" indicators on the portal server cards. Two practical routes: 1. **Query Prometheus from API** (recommended) * API calls Prometheus HTTP API (server-side) * portal calls API * avoids exposing Prometheus or Grafana to end users 2. **Embed Grafana panels** (works, but needs careful auth + iframe settings) * useful for internal admin views * more risk for multi-tenant UI unless you add strict access control Short-term plan: * build API endpoints like: * `GET /api/metrics/servers/:vmid/summary` * returns a small payload suitable for server cards: * cpu_percent * mem_percent * disk_percent * net_in/out (rate) ## Quick PromQL building blocks (reference) CPU (percent used, 5m window): ```promql 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) ``` Memory (percent used): ```promql 100 * (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) ``` Disk (percent used, excluding tmpfs/overlay): ```promql 100 * (1 - (node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"} / node_filesystem_size_bytes{fstype!~"tmpfs|overlay"})) ``` Network in/out: ```promql sum by (instance) (rate(node_network_receive_bytes_total{device!~"lo"}[5m])) sum by (instance) (rate(node_network_transmit_bytes_total{device!~"lo"}[5m])) ``` If your SD labels include `vmid`, prefer grouping by `vmid`.