2.7 KiB
Monitoring — Decisions
Settled monitoring choices / do-not-re-litigate notes.
Runtime source of truth
/etc/zlh-monitor is the operational source of truth for the monitoring host.
- Prometheus runtime config:
/etc/zlh-monitor/prometheus/prometheus.yml - Prometheus service wiring:
/etc/default/prometheus - Grafana provisioning:
/etc/zlh-monitor/grafana/provisioning - Grafana dashboards:
/etc/zlh-monitor/grafana/dashboards
Do not treat legacy /etc/prometheus or default Grafana sample provisioning as authoritative once the zlh-monitor layout is in place.
Metrics model
Monitoring is not API-only.
- API discovery + monitor sync + file_sd owns lifecycle inventory and add/remove validation.
- Container Alloy remote-write to Prometheus is the canonical game/dev metrics path.
- Prometheus listens on
10.60.0.25:9090so Alloy remote-write from hosts/containers can reach it. - Grafana reads Prometheus for dashboards.
Container Alloy model
Alloy is the standard collector for game/dev containers.
- Container Alloy listens on
0.0.0.0:12345. game-dev-alloyscrape health is valid because container Alloy is reachable on the container IP.- Container Alloy also remote-writes metrics to Prometheus.
node_exporteris not used in game/dev containers.
Template note should reflect:
Alloy Metrics/UI: :12345/metrics
Alloy Listen Addr: 0.0.0.0:12345
Node Exporter: not used
Discovery behavior
Dynamic game/dev lifecycle discovery excludes non-lifecycle core dev systems by CIDR. Core/dev infrastructure such as 9091 / zpack-dev-velocity / 10.60.0.220 must not appear as a lifecycle dev container.
Monitor sync allows empty discovery:
ALLOW_EMPTY="true"
This is intentional so deleting the last game/dev container clears:
/etc/zlh-monitor/prometheus/file_sd/game-dev-alloy.json -> []
Grafana model
Grafana provisions from /etc/zlh-monitor/grafana/provisioning and dashboards from /etc/zlh-monitor/grafana/dashboards.
Live dashboards:
- Core Services - Overview
- Core Service - Detail
- Game Containers
- Dev Containers
The old node_exporter router dashboard is archived at:
/etc/zlh-monitor/grafana/dashboards-archive/core-routers.json
Logging model
No Loki is currently enabled.
Future direction:
- Loki should be centralized shared infrastructure.
- Do not run Loki per game/dev container.
- Containers should ship selected logs through Alloy when centralized logging is enabled.
Platform exceptions
OPNsense and PBS remain explicit platform monitoring exceptions.
OPNsense may later need a router-only node_exporter job or another router-specific monitoring path. That should stay separate from the game/dev container Alloy contract.