zlh-grind/SCRATCH/2026-02-07_prometheus-grafana.md

3.5 KiB

2026-02-07 — Prometheus + Grafana (LXC node_exporter) status

What works right now

  • Each LXC template has node_exporter installed.
  • Prometheus is scraping container exporters via HTTP service discovery from the API.
  • Grafana can display metrics using a standard Node Exporter dashboard (and a custom "ZLH LXC Overview" dashboard).

Prometheus config (current)

On zlh-monitor:

/etc/prometheus/prometheus.yml includes a dynamic job:

  - job_name: 'zlh_lxc'
    scrape_interval: 15s
    http_sd_configs:
      - url: http://10.60.0.245:4000/sd/exporters
        refresh_interval: 30s
        authorization:
          type: Bearer
          credentials: '<TOKEN>'

This is why containers show up even though there are no hardcoded container IPs.

The earlier failure mode

Prometheus log showed:

  • dial tcp 10.60.0.245:80: connect: connection refused

That happened when the SD URL was missing the port (or using the wrong path). Correct is :4000/... (or whatever the API listens on).

Grafana dashboard import gotcha

When importing JSON dashboards:

  • If the JSON references a datasource UID that does not exist (example error: "Datasource PROM was not found"), you must either:

    • select the real datasource during import, or
    • edit the dashboard JSON datasource UID to match your Grafana datasource UID.

Once the datasource was mapped, the "ZeroLagHub - LXC Overview" dashboard started drawing data.

CPU mismatch vs top (note)

You observed CPU% in Grafana not matching what top was showing.

This can be normal depending on:

  • how CPU is computed (rate window, per-core vs normalized, cgroup accounting in containers, etc.)
  • dashboard query math (e.g. averaging across cores)

We decided not to block on perfect parity right now — the dashboard is good enough for:

  • "is it pegged?"
  • "is it trending up/down?"
  • "is this instance dead?"

If we need tighter parity later:

  • validate the container is exporting the right counters
  • compare node_cpu_seconds_total raw series
  • consider container-aware exporters/cAdvisor (optional)

Next step (portal)

Goal:

  • show "CPU / RAM / Disk / Net" indicators on the portal server cards.

Two practical routes:

  1. Query Prometheus from API (recommended)

    • API calls Prometheus HTTP API (server-side)
    • portal calls API
    • avoids exposing Prometheus or Grafana to end users
  2. Embed Grafana panels (works, but needs careful auth + iframe settings)

    • useful for internal admin views
    • more risk for multi-tenant UI unless you add strict access control

Short-term plan:

  • build API endpoints like:

    • GET /api/metrics/servers/:vmid/summary

    • returns a small payload suitable for server cards:

      • cpu_percent
      • mem_percent
      • disk_percent
      • net_in/out (rate)

Quick PromQL building blocks (reference)

CPU (percent used, 5m window):

100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

Memory (percent used):

100 * (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes))

Disk (percent used, excluding tmpfs/overlay):

100 * (1 - (node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"} / node_filesystem_size_bytes{fstype!~"tmpfs|overlay"}))

Network in/out:

sum by (instance) (rate(node_network_receive_bytes_total{device!~"lo"}[5m]))
sum by (instance) (rate(node_network_transmit_bytes_total{device!~"lo"}[5m]))

If your SD labels include vmid, prefer grouping by vmid.