zlh-grind/Codex/Monitoring/README.md

38 lines
2.2 KiB
Markdown

# Monitoring Codex
This folder tracks the ZeroLagHub monitoring and observability workstream.
Monitoring implementation lives across infrastructure, API, Agent, Prometheus, Grafana, Alloy, and future log collection. This folder is the coordination layer for the monitoring contract, launch-readiness status, and validation plan.
## Files
- `CURRENT_STATE.md` - observed monitoring state and launch-readiness posture
- `CONTRACT.md` - intended monitoring/discovery/label contract
- `OPEN_ITEMS.md` - active monitoring blockers and follow-up work
- `VALIDATION.md` - smoke-test and acceptance checklist
## Ownership boundary
`zlh-grind` tracks coordination and decisions only. Implementation belongs in the relevant source/config locations:
- `zpack-api` for monitoring discovery endpoints, health endpoints, app metrics, and auth boundaries
- `zlh-agent` for Alloy config/labels, structured logs, and container-local telemetry behavior
- monitoring host configuration for Prometheus, Grafana, Alloy, dashboards, firewall/bind policy, and file_sd sync
- infrastructure layer for OPNsense/PBS monitoring exceptions
## Current launch posture
Core lifecycle monitoring is launch-ready for game/dev add-remove visibility and basic metrics debugging.
The operational source of truth is `/etc/zlh-monitor`. Prometheus, Grafana provisioning, dashboard JSON, file_sd discovery output, and monitoring-host Alloy config should be treated from that layout rather than legacy default paths.
Validated launch behavior includes:
- API discovery -> monitor sync -> Prometheus file_sd for game/dev lifecycle inventory
- new game/dev containers appear as `game-dev-alloy` targets
- deleted game/dev containers disappear and file_sd can intentionally become `[]`
- container Alloy remote-writes metrics to Prometheus at `10.60.0.25:9090`
- container Alloy direct scrape health works on `0.0.0.0:12345`
- Grafana dashboards are provisioned from `/etc/zlh-monitor/grafana/provisioning`
Remaining future work is tracked in `OPEN_ITEMS.md`, especially centralized logs/Loki and optional OPNsense router-only monitoring. Those are known follow-ups, not blockers to the current core lifecycle monitoring path unless launch policy changes.