Add monitoring decisions
This commit is contained in:
parent
a78c663f7b
commit
9c5984a95c
89
Codex/Monitoring/DECISIONS.md
Normal file
89
Codex/Monitoring/DECISIONS.md
Normal file
@ -0,0 +1,89 @@
|
||||
# Monitoring — Decisions
|
||||
|
||||
Settled monitoring choices / do-not-re-litigate notes.
|
||||
|
||||
## Runtime source of truth
|
||||
|
||||
`/etc/zlh-monitor` is the operational source of truth for the monitoring host.
|
||||
|
||||
- Prometheus runtime config: `/etc/zlh-monitor/prometheus/prometheus.yml`
|
||||
- Prometheus service wiring: `/etc/default/prometheus`
|
||||
- Grafana provisioning: `/etc/zlh-monitor/grafana/provisioning`
|
||||
- Grafana dashboards: `/etc/zlh-monitor/grafana/dashboards`
|
||||
|
||||
Do not treat legacy `/etc/prometheus` or default Grafana sample provisioning as authoritative once the zlh-monitor layout is in place.
|
||||
|
||||
## Metrics model
|
||||
|
||||
Monitoring is not API-only.
|
||||
|
||||
- API discovery + monitor sync + file_sd owns lifecycle inventory and add/remove validation.
|
||||
- Container Alloy remote-write to Prometheus is the canonical game/dev metrics path.
|
||||
- Prometheus listens on `10.60.0.25:9090` so Alloy remote-write from hosts/containers can reach it.
|
||||
- Grafana reads Prometheus for dashboards.
|
||||
|
||||
## Container Alloy model
|
||||
|
||||
Alloy is the standard collector for game/dev containers.
|
||||
|
||||
- Container Alloy listens on `0.0.0.0:12345`.
|
||||
- `game-dev-alloy` scrape health is valid because container Alloy is reachable on the container IP.
|
||||
- Container Alloy also remote-writes metrics to Prometheus.
|
||||
- `node_exporter` is not used in game/dev containers.
|
||||
|
||||
Template note should reflect:
|
||||
|
||||
```text
|
||||
Alloy Metrics/UI: :12345/metrics
|
||||
Alloy Listen Addr: 0.0.0.0:12345
|
||||
Node Exporter: not used
|
||||
```
|
||||
|
||||
## Discovery behavior
|
||||
|
||||
Dynamic game/dev lifecycle discovery excludes non-lifecycle core dev systems by CIDR. Core/dev infrastructure such as `9091 / zpack-dev-velocity / 10.60.0.220` must not appear as a lifecycle dev container.
|
||||
|
||||
Monitor sync allows empty discovery:
|
||||
|
||||
```text
|
||||
ALLOW_EMPTY="true"
|
||||
```
|
||||
|
||||
This is intentional so deleting the last game/dev container clears:
|
||||
|
||||
```text
|
||||
/etc/zlh-monitor/prometheus/file_sd/game-dev-alloy.json -> []
|
||||
```
|
||||
|
||||
## Grafana model
|
||||
|
||||
Grafana provisions from `/etc/zlh-monitor/grafana/provisioning` and dashboards from `/etc/zlh-monitor/grafana/dashboards`.
|
||||
|
||||
Live dashboards:
|
||||
|
||||
- Core Services - Overview
|
||||
- Core Service - Detail
|
||||
- Game Containers
|
||||
- Dev Containers
|
||||
|
||||
The old node_exporter router dashboard is archived at:
|
||||
|
||||
```text
|
||||
/etc/zlh-monitor/grafana/dashboards-archive/core-routers.json
|
||||
```
|
||||
|
||||
## Logging model
|
||||
|
||||
No Loki is currently enabled.
|
||||
|
||||
Future direction:
|
||||
|
||||
- Loki should be centralized shared infrastructure.
|
||||
- Do not run Loki per game/dev container.
|
||||
- Containers should ship selected logs through Alloy when centralized logging is enabled.
|
||||
|
||||
## Platform exceptions
|
||||
|
||||
OPNsense and PBS remain explicit platform monitoring exceptions.
|
||||
|
||||
OPNsense may later need a router-only `node_exporter` job or another router-specific monitoring path. That should stay separate from the game/dev container Alloy contract.
|
||||
Loading…
Reference in New Issue
Block a user