zlh-grind/Codex/Monitoring/DECISIONS.md
2026-05-01 20:54:27 +00:00

2.7 KiB

Monitoring — Decisions

Settled monitoring choices / do-not-re-litigate notes.

Runtime source of truth

/etc/zlh-monitor is the operational source of truth for the monitoring host.

  • Prometheus runtime config: /etc/zlh-monitor/prometheus/prometheus.yml
  • Prometheus service wiring: /etc/default/prometheus
  • Grafana provisioning: /etc/zlh-monitor/grafana/provisioning
  • Grafana dashboards: /etc/zlh-monitor/grafana/dashboards

Do not treat legacy /etc/prometheus or default Grafana sample provisioning as authoritative once the zlh-monitor layout is in place.

Metrics model

Monitoring is not API-only.

  • API discovery + monitor sync + file_sd owns lifecycle inventory and add/remove validation.
  • Container Alloy remote-write to Prometheus is the canonical game/dev metrics path.
  • Prometheus listens on 10.60.0.25:9090 so Alloy remote-write from hosts/containers can reach it.
  • Grafana reads Prometheus for dashboards.

Container Alloy model

Alloy is the standard collector for game/dev containers.

  • Container Alloy listens on 0.0.0.0:12345.
  • game-dev-alloy scrape health is valid because container Alloy is reachable on the container IP.
  • Container Alloy also remote-writes metrics to Prometheus.
  • node_exporter is not used in game/dev containers.

Template note should reflect:

Alloy Metrics/UI: :12345/metrics
Alloy Listen Addr: 0.0.0.0:12345
Node Exporter: not used

Discovery behavior

Dynamic game/dev lifecycle discovery excludes non-lifecycle core dev systems by CIDR. Core/dev infrastructure such as 9091 / zpack-dev-velocity / 10.60.0.220 must not appear as a lifecycle dev container.

Monitor sync allows empty discovery:

ALLOW_EMPTY="true"

This is intentional so deleting the last game/dev container clears:

/etc/zlh-monitor/prometheus/file_sd/game-dev-alloy.json -> []

Grafana model

Grafana provisions from /etc/zlh-monitor/grafana/provisioning and dashboards from /etc/zlh-monitor/grafana/dashboards.

Live dashboards:

  • Core Services - Overview
  • Core Service - Detail
  • Game Containers
  • Dev Containers

The old node_exporter router dashboard is archived at:

/etc/zlh-monitor/grafana/dashboards-archive/core-routers.json

Logging model

No Loki is currently enabled.

Future direction:

  • Loki should be centralized shared infrastructure.
  • Do not run Loki per game/dev container.
  • Containers should ship selected logs through Alloy when centralized logging is enabled.

Platform exceptions

OPNsense and PBS remain explicit platform monitoring exceptions.

OPNsense may later need a router-only node_exporter job or another router-specific monitoring path. That should stay separate from the game/dev container Alloy contract.