From 9c5984a95c1c1e5e0730d7d1551985bf5cb65789 Mon Sep 17 00:00:00 2001 From: jester Date: Fri, 1 May 2026 20:54:27 +0000 Subject: [PATCH] Add monitoring decisions --- Codex/Monitoring/DECISIONS.md | 89 +++++++++++++++++++++++++++++++++++ 1 file changed, 89 insertions(+) create mode 100644 Codex/Monitoring/DECISIONS.md diff --git a/Codex/Monitoring/DECISIONS.md b/Codex/Monitoring/DECISIONS.md new file mode 100644 index 0000000..64550f6 --- /dev/null +++ b/Codex/Monitoring/DECISIONS.md @@ -0,0 +1,89 @@ +# Monitoring — Decisions + +Settled monitoring choices / do-not-re-litigate notes. + +## Runtime source of truth + +`/etc/zlh-monitor` is the operational source of truth for the monitoring host. + +- Prometheus runtime config: `/etc/zlh-monitor/prometheus/prometheus.yml` +- Prometheus service wiring: `/etc/default/prometheus` +- Grafana provisioning: `/etc/zlh-monitor/grafana/provisioning` +- Grafana dashboards: `/etc/zlh-monitor/grafana/dashboards` + +Do not treat legacy `/etc/prometheus` or default Grafana sample provisioning as authoritative once the zlh-monitor layout is in place. + +## Metrics model + +Monitoring is not API-only. + +- API discovery + monitor sync + file_sd owns lifecycle inventory and add/remove validation. +- Container Alloy remote-write to Prometheus is the canonical game/dev metrics path. +- Prometheus listens on `10.60.0.25:9090` so Alloy remote-write from hosts/containers can reach it. +- Grafana reads Prometheus for dashboards. + +## Container Alloy model + +Alloy is the standard collector for game/dev containers. + +- Container Alloy listens on `0.0.0.0:12345`. +- `game-dev-alloy` scrape health is valid because container Alloy is reachable on the container IP. +- Container Alloy also remote-writes metrics to Prometheus. +- `node_exporter` is not used in game/dev containers. + +Template note should reflect: + +```text +Alloy Metrics/UI: :12345/metrics +Alloy Listen Addr: 0.0.0.0:12345 +Node Exporter: not used +``` + +## Discovery behavior + +Dynamic game/dev lifecycle discovery excludes non-lifecycle core dev systems by CIDR. Core/dev infrastructure such as `9091 / zpack-dev-velocity / 10.60.0.220` must not appear as a lifecycle dev container. + +Monitor sync allows empty discovery: + +```text +ALLOW_EMPTY="true" +``` + +This is intentional so deleting the last game/dev container clears: + +```text +/etc/zlh-monitor/prometheus/file_sd/game-dev-alloy.json -> [] +``` + +## Grafana model + +Grafana provisions from `/etc/zlh-monitor/grafana/provisioning` and dashboards from `/etc/zlh-monitor/grafana/dashboards`. + +Live dashboards: + +- Core Services - Overview +- Core Service - Detail +- Game Containers +- Dev Containers + +The old node_exporter router dashboard is archived at: + +```text +/etc/zlh-monitor/grafana/dashboards-archive/core-routers.json +``` + +## Logging model + +No Loki is currently enabled. + +Future direction: + +- Loki should be centralized shared infrastructure. +- Do not run Loki per game/dev container. +- Containers should ship selected logs through Alloy when centralized logging is enabled. + +## Platform exceptions + +OPNsense and PBS remain explicit platform monitoring exceptions. + +OPNsense may later need a router-only `node_exporter` job or another router-specific monitoring path. That should stay separate from the game/dev container Alloy contract. \ No newline at end of file