Update monitoring launch posture

This commit is contained in:
jester 2026-05-01 21:05:56 +00:00
parent 5bd6ddd8d7
commit 2b3a552358

View File

@ -6,10 +6,10 @@ Monitoring implementation lives across infrastructure, API, Agent, Prometheus, G
## Files ## Files
- `CURRENT_STATE.md` observed monitoring state and launch-readiness posture - `CURRENT_STATE.md` - observed monitoring state and launch-readiness posture
- `CONTRACT.md` intended monitoring/discovery/label contract - `CONTRACT.md` - intended monitoring/discovery/label contract
- `OPEN_ITEMS.md` active monitoring blockers and follow-up work - `OPEN_ITEMS.md` - active monitoring blockers and follow-up work
- `VALIDATION.md` smoke-test and acceptance checklist - `VALIDATION.md` - smoke-test and acceptance checklist
## Ownership boundary ## Ownership boundary
@ -22,4 +22,17 @@ Monitoring implementation lives across infrastructure, API, Agent, Prometheus, G
## Current launch posture ## Current launch posture
As of the latest monitoring audit, monitoring is **not launch-ready**. Core services are running, but public exposure, stale/failing discovery, missing dashboards, and missing centralized logs block launch-debug readiness. Core lifecycle monitoring is launch-ready for game/dev add-remove visibility and basic metrics debugging.
The operational source of truth is `/etc/zlh-monitor`. Prometheus, Grafana provisioning, dashboard JSON, file_sd discovery output, and monitoring-host Alloy config should be treated from that layout rather than legacy default paths.
Validated launch behavior includes:
- API discovery -> monitor sync -> Prometheus file_sd for game/dev lifecycle inventory
- new game/dev containers appear as `game-dev-alloy` targets
- deleted game/dev containers disappear and file_sd can intentionally become `[]`
- container Alloy remote-writes metrics to Prometheus at `10.60.0.25:9090`
- container Alloy direct scrape health works on `0.0.0.0:12345`
- Grafana dashboards are provisioned from `/etc/zlh-monitor/grafana/provisioning`
Remaining future work is tracked in `OPEN_ITEMS.md`, especially centralized logs/Loki and optional OPNsense router-only monitoring. Those are known follow-ups, not blockers to the current core lifecycle monitoring path unless launch policy changes.