3.8 KiB
Monitoring — Current State
Launch readiness
Status: core lifecycle monitoring is launch-ready.
The monitoring stack has been consolidated so /etc/zlh-monitor is now the operational source of truth. Game/dev lifecycle discovery, file_sd sync, Prometheus target appearance/removal, Alloy scrape health, and Alloy remote-write metrics have been validated through create/delete smoke tests.
Remaining future work:
- centralized logs are not enabled yet
- no Loki currently
- OPNsense may still need a router-only node_exporter job or equivalent router-specific monitoring later
Runtime source of truth
/etc/zlh-monitor is now the monitoring runtime source of truth.
Prometheus now runs from:
/etc/zlh-monitor/prometheus/prometheus.yml
configured via:
/etc/default/prometheus
Runtime Prometheus jobs are limited to:
prometheus
alloy
game-dev-alloy
Removed from active Prometheus runtime config:
- old
node_exporterjobs - cAdvisor jobs
- stale static node targets
- legacy game/dev targets
- active
:9100scraping jobs
Local prometheus-node-exporter is disabled.
Prometheus listens on:
10.60.0.25:9090
This allows Alloy remote-write from hosts and containers to reach Prometheus.
Grafana state
Grafana provisions from:
/etc/zlh-monitor/grafana/provisioning
Dashboard source directory:
/etc/zlh-monitor/grafana/dashboards
Live dashboards provisioned:
- Core Services - Overview
- Core Service - Detail
- Game Containers
- Dev Containers
The old node_exporter router dashboard was archived:
/etc/zlh-monitor/grafana/dashboards-archive/core-routers.json
Grafana listens on loopback only:
127.0.0.1:3000
Discovery behavior
API lifecycle discovery was corrected to exclude non-lifecycle core dev systems by CIDR.
9091 / zpack-dev-velocity / 10.60.0.220 no longer appears as a lifecycle dev container.
The monitor sync allows empty discovery:
ALLOW_EMPTY="true"
This ensures deleted containers correctly clear:
/etc/zlh-monitor/prometheus/file_sd/game-dev-alloy.json -> []
Alloy template state
The base template was updated so container Alloy listens on:
0.0.0.0:12345
instead of only:
127.0.0.1:12345
This allows Prometheus to scrape container Alloy health via the game-dev-alloy job.
Updated template note replaces node_exporter with Alloy:
Alloy Metrics/UI: :12345/metrics
Alloy Listen Addr: 0.0.0.0:12345
Node Exporter: not used
Metrics model
Monitoring is not API-only.
- API discovery + monitor sync + file_sd owns lifecycle inventory and add/remove validation.
- Container Alloy remote-write to Prometheus is the canonical game/dev metrics path.
game-dev-alloyscrape health also works now that container Alloy binds on0.0.0.0:12345.- Grafana reads Prometheus for dashboards.
Smoke tests passed
First manual test
dev 6085 -> appeared, scraped after Alloy bind fix, remote-write worked
game 5206 -> appeared, scraped after Alloy bind fix, remote-write worked
delete -> both disappeared from API discovery, file_sd, and Prometheus
Template test after updating base image
dev 6086 / 10.100.0.24 -> up
game 5207 / 10.200.0.80 -> up
Full loop verified
create -> API discovery -> file_sd -> Prometheus target up -> remote-write metrics present
delete -> API discovery empty -> file_sd [] -> Prometheus targets gone
Logging state
Centralized logs are not enabled yet.
No Loki currently.
Future direction is centralized Loki with selected logs shipped by Alloy. Do not run Loki inside every game/dev container.
Platform exceptions
OPNsense and PBS remain explicit platform monitoring exceptions.
OPNsense may still need a router-only node_exporter job or another router-specific monitoring path later.