zlh-grind/Codex/Monitoring/DECISIONS.md
2026-05-01 20:54:27 +00:00

89 lines
2.7 KiB
Markdown

# Monitoring — Decisions
Settled monitoring choices / do-not-re-litigate notes.
## Runtime source of truth
`/etc/zlh-monitor` is the operational source of truth for the monitoring host.
- Prometheus runtime config: `/etc/zlh-monitor/prometheus/prometheus.yml`
- Prometheus service wiring: `/etc/default/prometheus`
- Grafana provisioning: `/etc/zlh-monitor/grafana/provisioning`
- Grafana dashboards: `/etc/zlh-monitor/grafana/dashboards`
Do not treat legacy `/etc/prometheus` or default Grafana sample provisioning as authoritative once the zlh-monitor layout is in place.
## Metrics model
Monitoring is not API-only.
- API discovery + monitor sync + file_sd owns lifecycle inventory and add/remove validation.
- Container Alloy remote-write to Prometheus is the canonical game/dev metrics path.
- Prometheus listens on `10.60.0.25:9090` so Alloy remote-write from hosts/containers can reach it.
- Grafana reads Prometheus for dashboards.
## Container Alloy model
Alloy is the standard collector for game/dev containers.
- Container Alloy listens on `0.0.0.0:12345`.
- `game-dev-alloy` scrape health is valid because container Alloy is reachable on the container IP.
- Container Alloy also remote-writes metrics to Prometheus.
- `node_exporter` is not used in game/dev containers.
Template note should reflect:
```text
Alloy Metrics/UI: :12345/metrics
Alloy Listen Addr: 0.0.0.0:12345
Node Exporter: not used
```
## Discovery behavior
Dynamic game/dev lifecycle discovery excludes non-lifecycle core dev systems by CIDR. Core/dev infrastructure such as `9091 / zpack-dev-velocity / 10.60.0.220` must not appear as a lifecycle dev container.
Monitor sync allows empty discovery:
```text
ALLOW_EMPTY="true"
```
This is intentional so deleting the last game/dev container clears:
```text
/etc/zlh-monitor/prometheus/file_sd/game-dev-alloy.json -> []
```
## Grafana model
Grafana provisions from `/etc/zlh-monitor/grafana/provisioning` and dashboards from `/etc/zlh-monitor/grafana/dashboards`.
Live dashboards:
- Core Services - Overview
- Core Service - Detail
- Game Containers
- Dev Containers
The old node_exporter router dashboard is archived at:
```text
/etc/zlh-monitor/grafana/dashboards-archive/core-routers.json
```
## Logging model
No Loki is currently enabled.
Future direction:
- Loki should be centralized shared infrastructure.
- Do not run Loki per game/dev container.
- Containers should ship selected logs through Alloy when centralized logging is enabled.
## Platform exceptions
OPNsense and PBS remain explicit platform monitoring exceptions.
OPNsense may later need a router-only `node_exporter` job or another router-specific monitoring path. That should stay separate from the game/dev container Alloy contract.