zlh-grind/Codex/Monitoring/CURRENT_STATE.md

177 lines
3.8 KiB
Markdown

# Monitoring — Current State
## Launch readiness
**Status: core lifecycle monitoring is launch-ready.**
The monitoring stack has been consolidated so `/etc/zlh-monitor` is now the operational source of truth. Game/dev lifecycle discovery, file_sd sync, Prometheus target appearance/removal, Alloy scrape health, and Alloy remote-write metrics have been validated through create/delete smoke tests.
Remaining future work:
- centralized logs are not enabled yet
- no Loki currently
- OPNsense may still need a router-only node_exporter job or equivalent router-specific monitoring later
## Runtime source of truth
`/etc/zlh-monitor` is now the monitoring runtime source of truth.
Prometheus now runs from:
```text
/etc/zlh-monitor/prometheus/prometheus.yml
```
configured via:
```text
/etc/default/prometheus
```
Runtime Prometheus jobs are limited to:
```text
prometheus
alloy
game-dev-alloy
```
Removed from active Prometheus runtime config:
- old `node_exporter` jobs
- cAdvisor jobs
- stale static node targets
- legacy game/dev targets
- active `:9100` scraping jobs
Local `prometheus-node-exporter` is disabled.
Prometheus listens on:
```text
10.60.0.25:9090
```
This allows Alloy remote-write from hosts and containers to reach Prometheus.
## Grafana state
Grafana provisions from:
```text
/etc/zlh-monitor/grafana/provisioning
```
Dashboard source directory:
```text
/etc/zlh-monitor/grafana/dashboards
```
Live dashboards provisioned:
- Core Services - Overview
- Core Service - Detail
- Game Containers
- Dev Containers
The old node_exporter router dashboard was archived:
```text
/etc/zlh-monitor/grafana/dashboards-archive/core-routers.json
```
Grafana listens on loopback only:
```text
127.0.0.1:3000
```
## Discovery behavior
API lifecycle discovery was corrected to exclude non-lifecycle core dev systems by CIDR.
`9091 / zpack-dev-velocity / 10.60.0.220` no longer appears as a lifecycle dev container.
The monitor sync allows empty discovery:
```text
ALLOW_EMPTY="true"
```
This ensures deleted containers correctly clear:
```text
/etc/zlh-monitor/prometheus/file_sd/game-dev-alloy.json -> []
```
## Alloy template state
The base template was updated so container Alloy listens on:
```text
0.0.0.0:12345
```
instead of only:
```text
127.0.0.1:12345
```
This allows Prometheus to scrape container Alloy health via the `game-dev-alloy` job.
Updated template note replaces node_exporter with Alloy:
```text
Alloy Metrics/UI: :12345/metrics
Alloy Listen Addr: 0.0.0.0:12345
Node Exporter: not used
```
## Metrics model
Monitoring is not API-only.
- API discovery + monitor sync + file_sd owns lifecycle inventory and add/remove validation.
- Container Alloy remote-write to Prometheus is the canonical game/dev metrics path.
- `game-dev-alloy` scrape health also works now that container Alloy binds on `0.0.0.0:12345`.
- Grafana reads Prometheus for dashboards.
## Smoke tests passed
### First manual test
```text
dev 6085 -> appeared, scraped after Alloy bind fix, remote-write worked
game 5206 -> appeared, scraped after Alloy bind fix, remote-write worked
delete -> both disappeared from API discovery, file_sd, and Prometheus
```
### Template test after updating base image
```text
dev 6086 / 10.100.0.24 -> up
game 5207 / 10.200.0.80 -> up
```
### Full loop verified
```text
create -> API discovery -> file_sd -> Prometheus target up -> remote-write metrics present
delete -> API discovery empty -> file_sd [] -> Prometheus targets gone
```
## Logging state
Centralized logs are not enabled yet.
No Loki currently.
Future direction is centralized Loki with selected logs shipped by Alloy. Do not run Loki inside every game/dev container.
## Platform exceptions
OPNsense and PBS remain explicit platform monitoring exceptions.
OPNsense may still need a router-only node_exporter job or another router-specific monitoring path later.