diff --git a/Codex/Monitoring/CURRENT_STATE.md b/Codex/Monitoring/CURRENT_STATE.md index 07afde1..05872df 100644 --- a/Codex/Monitoring/CURRENT_STATE.md +++ b/Codex/Monitoring/CURRENT_STATE.md @@ -4,9 +4,9 @@ **Status: PARTIAL FAIL for launch monitoring readiness.** -Core monitoring services are running. Game/dev dynamic discovery empty-state is clean, stale `game-dev-alloy` VMIDs `5205` and `6084` are gone, legacy/static Prometheus jobs have been removed from the active runtime config, and Grafana dashboard provisioning is now live from the `/etc/zlh-monitor` runtime source-of-truth path. +Core monitoring services are running. Game/dev dynamic discovery works, stale `game-dev-alloy` VMIDs `5205` and `6084` are gone, legacy/static Prometheus jobs have been removed from the active runtime config, Grafana dashboard provisioning is live from `/etc/zlh-monitor`, and the important game/dev metrics path now works through container Alloy remote-write. -Launch-debug readiness is still blocked by public monitoring endpoint exposure, missing/unfinished centralized logs, missing API application health scrape, incomplete lifecycle telemetry, and the need to run one full game/dev lifecycle smoke test. +Launch-debug readiness is still blocked by final exposure verification/lockdown, missing/unfinished centralized logs, missing API application health scrape, incomplete lifecycle telemetry, and the need to finish the delete half of the game/dev lifecycle smoke test. ## Runtime source of truth @@ -26,12 +26,28 @@ Grafana now provisions from: `/etc/zlh-monitor/README.md` was updated to describe the new single-source layout. +## Canonical metrics model + +Monitoring is not API-only. + +The API owns discovery/metadata and tells the monitoring stack what game/dev containers exist. The canonical metrics path for game/dev containers is now: + +```text +container Alloy remote_write + -> Prometheus on 10.60.0.25:9090 + -> Grafana dashboards +``` + +The `game-dev-alloy` file_sd path is useful for inventory/add-remove validation. It should not be treated as the canonical container health signal while container Alloy listens on `127.0.0.1:12345` and Prometheus file_sd scrapes `container-ip:12345`. + +If direct scrape health for each container Alloy is desired later, container Alloy must intentionally expose its HTTP listener on a reachable address such as `0.0.0.0:12345`, with security review. + ## Recently verified progress Container discovery is clean: -- `/sd/exporters -> []` -- `/monitoring/game-dev -> containers: []` +- `/sd/exporters -> []` during empty state +- `/monitoring/game-dev -> containers: []` during empty state - monitor-side `ALLOW_EMPTY="true"` added in `/etc/zlh-monitor/game-dev-discovery.env` - one-shot sync succeeded: wrote `0 targets` to `/etc/zlh-monitor/prometheus/file_sd/game-dev-alloy.json` - Prometheus refreshed and stale `game-dev-alloy` targets `5205` and `6084` are gone @@ -47,10 +63,13 @@ Prometheus runtime config is clean: - no active cAdvisor job - no active `:9100` jobs -Current Prometheus targets are clean: +Runtime bind/target state after the remote-write fix: -- `up alloy 127.0.0.1:12345` -- `up prometheus localhost:9090` +- Prometheus listens on `10.60.0.25:9090` +- Prometheus self-scrape target is `10.60.0.25:9090` +- Grafana remains loopback-only on `127.0.0.1:3000` +- zlh-monitor local Alloy remains loopback-only on `127.0.0.1:12345` +- local node_exporter is disabled Grafana provisioning is live: @@ -71,23 +90,55 @@ Services active after cleanup: - `alloy` - discovery timer +## Game/dev metrics smoke evidence + +The important monitoring path works for current test containers. + +API discovery emits both test servers: + +- `6085 / dev-6085 / 10.100.0.23 / dev` +- `5206 / mc-vanilla-5206 / 10.200.0.79 / game` + +file_sd sync writes both targets, and Prometheus discovers both. + +Alloy remote-write from both containers works after Prometheus was rebound to `10.60.0.25:9090`. + +Dashboard-style metrics exist for both containers: + +- `node_uname_info` +- CPU metrics +- memory metrics + +Observed values: + +```text +6085 dev-6085 dev-container CPU ~1.04% Mem ~6.58% +5206 mc-vanilla-5206 game-container CPU ~0.68% Mem ~61.69% +``` + +Known design mismatch: + +- container Alloy listens on `127.0.0.1:12345` +- Prometheus file_sd scrapes `container-ip:12345` +- therefore `game-dev-alloy` scrape targets are discovered but down + +Current decision: use remote-write as canonical and do not rely on `game-dev-alloy` scrape health panels unless/until direct container Alloy scrape is intentionally enabled. + ## Exposure state -Observed listening ports from prior audit: +Current reported bind posture: -- `*:9090` Prometheus -- `*:3000` Grafana -- `*:9100` node_exporter -- `127.0.0.1:12345` Alloy +- Prometheus listens on `10.60.0.25:9090` +- Grafana remains loopback-only on `127.0.0.1:3000` +- zlh-monitor local Alloy remains loopback-only on `127.0.0.1:12345` +- local node_exporter is disabled -Prometheus active jobs no longer include `node_exporter` or `:9100` scraping, but network exposure still needs separate verification/lockdown. +Remaining exposure work: -Host firewall posture from prior audit: - -- UFW inactive -- nftables input/forward/output policy accept - -This remains a launch blocker until Prometheus, Grafana, and any remaining node_exporter listener exposure are restricted to trusted admin/monitoring paths. +- verify Prometheus is reachable only from intended management/container remote-write networks, not the public internet +- verify Grafana is not reachable except via intended admin path +- verify node_exporter is no longer listening or is firewalled if any listener remains +- verify host firewall/nftables policy is appropriate ## Prometheus state @@ -97,9 +148,8 @@ This remains a launch blocker until Prometheus, Grafana, and any remaining node_ - `prometheus` - `alloy` - `game-dev-alloy` -- Current targets: - - `up alloy 127.0.0.1:12345` - - `up prometheus localhost:9090` +- Prometheus listens on `10.60.0.25:9090` +- Prometheus self-scrape target is `10.60.0.25:9090` - `/sd/exporters` is configured with bearer auth and works with token - unauthenticated `/sd/exporters` and `/monitoring/game-dev` return `401 unauthorized` @@ -115,12 +165,18 @@ Resolved: - cAdvisor `10.60.0.11:8081` is gone from active runtime target list - `zerolaghub-nodes 10.60.0.19:9100` is gone from active runtime target list - empty discovery writes a valid empty file_sd file instead of failing +- active test containers appear in API discovery and file_sd -No stale/static down targets are currently present in active Prometheus targets. +Expected remaining validation: + +- delete test dev/game servers +- confirm API discovery drops them +- confirm file_sd drops them +- confirm dashboard metrics age out as expected ## API monitoring state -API host/system metrics were previously available via `integrations/unix`, but the new active runtime Prometheus jobs are intentionally reduced to Prometheus, local Alloy, and dynamic game/dev Alloy discovery. +API discovery is working for game/dev monitoring inventory. Observed API app health endpoint `/health` previously returned `404 Cannot GET /health`; no API health scrape was found. @@ -128,14 +184,10 @@ No centralized API logs were found on the monitoring host. ## Agent/container monitoring state -Current dynamic game/dev discovery is empty and clean. +Current test containers prove remote-write metrics are working: -Expected next validation is lifecycle smoke testing: - -- create a game and/or dev container -- confirm target appears -- delete it -- confirm target disappears now that empty discovery works +- `6085 dev-6085 dev-container` +- `5206 mc-vanilla-5206 game-container` Labels previously observed include: @@ -177,6 +229,8 @@ Archived: - `/etc/zlh-monitor/grafana/dashboards-archive/core-routers.json` +Dashboard-style CPU and memory metrics exist for both current test containers. + ## Logs state No Loki, Promtail, or log-ingesting Alloy pipeline was found in the prior audit. @@ -185,14 +239,14 @@ Only journald/local logs for monitoring components were available on the monitor ## Security state -Working controls from prior audit: +Working controls: -- Grafana anonymous access is disabled; unauthenticated `/api/search` returned `401` +- Grafana is loopback-only on `127.0.0.1:3000` per current report - discovery endpoints require token; unauthenticated `/sd/exporters` and `/monitoring/game-dev` returned `401` -- Prometheus labels/scrape URLs checked did not expose the bearer token +- Prometheus labels/scrape URLs checked did not expose the bearer token in the prior audit -Launch-blocking risk: +Remaining launch security checks: -- Prometheus and Grafana previously bound publicly on the host network -- node_exporter previously listened on `*:9100`; it is no longer scraped by active Prometheus config, but listener exposure still needs verification/removal or firewalling -- host firewall policy did not restrict access in prior audit +- Prometheus on `10.60.0.25:9090` must be restricted to intended container remote-write and admin/monitoring paths +- any remaining node_exporter listener must be disabled or firewalled +- host firewall policy must be verified \ No newline at end of file