zlh-grind/Codex/Monitoring
2026-05-01 21:05:56 +00:00
..
CONTRACT.md Clarify centralized Loki log model 2026-05-01 20:23:23 +00:00
CURRENT_STATE.md Update monitoring current state after launch-ready smoke tests 2026-05-01 20:55:42 +00:00
DECISIONS.md Add monitoring decisions 2026-05-01 20:54:27 +00:00
OPEN_ITEMS.md Trim monitoring open items after launch-ready validation 2026-05-01 20:57:03 +00:00
README.md Update monitoring launch posture 2026-05-01 21:05:56 +00:00
VALIDATION.md Add monitoring validation checklist 2026-05-01 18:27:50 +00:00

Monitoring Codex

This folder tracks the ZeroLagHub monitoring and observability workstream.

Monitoring implementation lives across infrastructure, API, Agent, Prometheus, Grafana, Alloy, and future log collection. This folder is the coordination layer for the monitoring contract, launch-readiness status, and validation plan.

Files

  • CURRENT_STATE.md - observed monitoring state and launch-readiness posture
  • CONTRACT.md - intended monitoring/discovery/label contract
  • OPEN_ITEMS.md - active monitoring blockers and follow-up work
  • VALIDATION.md - smoke-test and acceptance checklist

Ownership boundary

zlh-grind tracks coordination and decisions only. Implementation belongs in the relevant source/config locations:

  • zpack-api for monitoring discovery endpoints, health endpoints, app metrics, and auth boundaries
  • zlh-agent for Alloy config/labels, structured logs, and container-local telemetry behavior
  • monitoring host configuration for Prometheus, Grafana, Alloy, dashboards, firewall/bind policy, and file_sd sync
  • infrastructure layer for OPNsense/PBS monitoring exceptions

Current launch posture

Core lifecycle monitoring is launch-ready for game/dev add-remove visibility and basic metrics debugging.

The operational source of truth is /etc/zlh-monitor. Prometheus, Grafana provisioning, dashboard JSON, file_sd discovery output, and monitoring-host Alloy config should be treated from that layout rather than legacy default paths.

Validated launch behavior includes:

  • API discovery -> monitor sync -> Prometheus file_sd for game/dev lifecycle inventory
  • new game/dev containers appear as game-dev-alloy targets
  • deleted game/dev containers disappear and file_sd can intentionally become []
  • container Alloy remote-writes metrics to Prometheus at 10.60.0.25:9090
  • container Alloy direct scrape health works on 0.0.0.0:12345
  • Grafana dashboards are provisioned from /etc/zlh-monitor/grafana/provisioning

Remaining future work is tracked in OPEN_ITEMS.md, especially centralized logs/Loki and optional OPNsense router-only monitoring. Those are known follow-ups, not blockers to the current core lifecycle monitoring path unless launch policy changes.