Launch blocker: monitoring/observability readiness #5
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
Current monitoring posture is not launch-ready.
Core services are running, and Prometheus can see the API host via Alloy remote-written host metrics plus one current dev/node target from
/sd/exporters, but launch-debug readiness is blocked by public exposure, stale/failing game/dev discovery, missing centralized logs, and dashboards not actually installed in Grafana.Source of truth docs
Codex/Monitoring/CURRENT_STATE.mdCodex/Monitoring/CONTRACT.mdCodex/Monitoring/OPEN_ITEMS.mdCodex/Monitoring/VALIDATION.mdLaunch blockers
Restrict monitoring endpoint exposure.
*:9090.*:3000.*:9100.Fix game/dev discovery sync.
zlh-monitor-game-dev-discovery.serviceis failed.6050on10.60.0.220.10.100.0.0/16and refuses to write an empty discovery file.Remove stale file_sd targets.
5205and VMID6084.Install/provision Grafana dashboards.
/etc/zlh-monitor/grafana/dashboards.Add API health/application scrape.
/healthreturned404 Cannot GET /health.Add launch-debug lifecycle visibility.
ready,connectable,operationInProgress,operationType, backup/restore state, code-server state.Add centralized logs.
Tighten monitoring token storage.
/etc/zlh-monitor/game-dev-discovery.env.Acceptance criteria
Progress update: container discovery is now clean.
Verified current API discovery output:
Monitor-side safe follow-up was applied:
A one-shot sync succeeded:
Prometheus refreshed and stale
game-dev-alloytargets5205and6084are gone.Docs updated:
Codex/Monitoring/CURRENT_STATE.mdCodex/Monitoring/OPEN_ITEMS.mdRemaining priorities:
alloy-dev-6076hardcoded/downcadvisor 10.60.0.11:8081down/no routezerolaghub-nodes 10.60.0.19:9100down/connection refused*:9090, Grafana*:3000, and node_exporter*:9100.Progress update:
/etc/zlh-monitoris now the runtime source of truth and the active Prometheus/Grafana state is much cleaner.Prometheus now runs from:
Active Prometheus jobs are now only:
No active
node_exporter, no active cAdvisor, and no active:9100jobs remain in the runtime Prometheus config. Current targets are clean:Grafana now provisions from:
Source-managed datasource:
Live provisioned dashboards:
Legacy node_exporter router dashboard archived to:
/etc/zlh-monitor/README.mdnow describes the new single-source layout. Prometheus config validates withpromtool, andprometheus,grafana-server,alloy, and the discovery timer are active.Docs updated:
Codex/Monitoring/CURRENT_STATE.mdCodex/Monitoring/OPEN_ITEMS.mdRemaining launch blockers are now narrowed to:
Progress update: the important game/dev monitoring path is confirmed working, and the monitoring contract is clarified as not API-only.
Current model:
Confirmed test servers:
Confirmed behavior:
10.60.0.25:9090.node_uname_infoObserved values:
Bind/state after the fix:
Design mismatch recorded:
127.0.0.1:12345.container-ip:12345.game-dev-alloyscrape targets are discovered but down.Current decision: use remote-write as canonical and do not rely on
game-dev-alloyscrape health panels unless direct container Alloy scrape is intentionally enabled later. Direct scrape would require exposing container Alloy HTTP on a reachable address such as0.0.0.0:12345plus security review.Docs updated:
Codex/Monitoring/CONTRACT.mdCodex/Monitoring/CURRENT_STATE.mdCodex/Monitoring/OPEN_ITEMS.mdRemaining next step: delete the test dev/game servers and verify API discovery/file_sd drops them, then confirm dashboard metrics age out as expected.