Update monitoring current state after Prometheus Grafana cleanup

2026-05-01 19:19:29 +00:00 · 2026-05-01 19:19:29 +00:00 · 1a837f3770
commit 1a837f3770
parent c57c5ba3df
1 changed files with 86 additions and 35 deletions
--- a/Codex/Monitoring/CURRENT_STATE.md
+++ b/Codex/Monitoring/CURRENT_STATE.md
@ -4,9 +4,27 @@
 **Status: PARTIAL FAIL for launch monitoring readiness.**
-Core monitoring services are running. Game/dev dynamic discovery empty-state is now clean and stale `game-dev-alloy` VMIDs `5205` and `6084` have disappeared from Prometheus after a successful empty sync.
+Core monitoring services are running. Game/dev dynamic discovery empty-state is clean, stale `game-dev-alloy` VMIDs `5205` and `6084` are gone, legacy/static Prometheus jobs have been removed from the active runtime config, and Grafana dashboard provisioning is now live from the `/etc/zlh-monitor` runtime source-of-truth path.
-Launch-debug readiness is still blocked by public monitoring endpoint exposure, remaining stale/static down targets, dashboards not actually installed in Grafana, missing API health scrape, incomplete lifecycle telemetry, and missing centralized logs.
+Launch-debug readiness is still blocked by public monitoring endpoint exposure, missing/unfinished centralized logs, missing API application health scrape, incomplete lifecycle telemetry, and the need to run one full game/dev lifecycle smoke test.
 ## Runtime source of truth
 `/etc/zlh-monitor` is now the runtime source of truth.
 Prometheus now runs from:
 - `/etc/zlh-monitor/prometheus/prometheus.yml`
 - configured via `/etc/default/prometheus`
 - backup: `/etc/default/prometheus.zlh-monitor-20260501191319.bak`
 Grafana now provisions from:
 - `/etc/zlh-monitor/grafana/provisioning`
 - configured via `/etc/default/grafana-server`
 - backup: `/etc/default/grafana-server.zlh-monitor-20260501191319.bak`
 `/etc/zlh-monitor/README.md` was updated to describe the new single-source layout.
 ## Recently verified progress
@ -18,16 +36,40 @@ Container discovery is clean:
 - one-shot sync succeeded: wrote `0 targets` to `/etc/zlh-monitor/prometheus/file_sd/game-dev-alloy.json`
 - Prometheus refreshed and stale `game-dev-alloy` targets `5205` and `6084` are gone
-This resolves the previous dynamic discovery empty-state/stale `game-dev-alloy` blocker.
+Prometheus runtime config is clean:
-## Running services observed from prior audit
+- config validates with `promtool`
 - active jobs are now only:
  - `prometheus`
  - `alloy -> 127.0.0.1:12345`
  - `game-dev-alloy -> /etc/zlh-monitor/prometheus/file_sd/game-dev-alloy.json`
 - no active `node_exporter` job
 - no active cAdvisor job
 - no active `:9100` jobs
- `prometheus.service` — active/running
+Current Prometheus targets are clean:
 - `grafana-server.service` — active/running
 - `alloy.service` — active/running
 - `prometheus-node-exporter.service` — active/running
-`zlh-monitor-game-dev-discovery.service` was previously failed, but the latest manual sync succeeded after enabling `ALLOW_EMPTY="true"`. Verify timer/service health after the next scheduled run.
+- `up alloy 127.0.0.1:12345`
 - `up prometheus localhost:9090`
 Grafana provisioning is live:
 - one source-managed datasource exists:
  - `uid=prometheus`, `name=Prometheus`, `url=http://localhost:9090`, `default=1`
 - provisioned dashboards:
  - `Core Service - Detail`
  - `Core Services - Overview`
  - `Dev Containers`
  - `Game Containers`
 - legacy node_exporter router dashboard archived to:
  - `/etc/zlh-monitor/grafana/dashboards-archive/core-routers.json`
 Services active after cleanup:
 - `prometheus`
 - `grafana-server`
 - `alloy`
 - discovery timer
 ## Exposure state
@ -38,19 +80,26 @@ Observed listening ports from prior audit:
 - `*:9100` node_exporter
 - `127.0.0.1:12345` Alloy
 Prometheus active jobs no longer include `node_exporter` or `:9100` scraping, but network exposure still needs separate verification/lockdown.
 Host firewall posture from prior audit:
 - UFW inactive
 - nftables input/forward/output policy accept
-This remains a launch blocker because Prometheus, Grafana, and node_exporter are reachable on all interfaces without host firewall restriction.
+This remains a launch blocker until Prometheus, Grafana, and any remaining node_exporter listener exposure are restricted to trusted admin/monitoring paths.
 ## Prometheus state
- Config path: `/etc/prometheus/prometheus.yml`
+- Runtime config path: `/etc/zlh-monitor/prometheus/prometheus.yml`
- Global scrape interval: `15s`
+- Config validates with `promtool`
- Prometheus self-scrape: `5s`
+- Active jobs:
- Dynamic discovery refresh: `30s`
+  - `prometheus`
  - `alloy`
  - `game-dev-alloy`
 - Current targets:
  - `up alloy 127.0.0.1:12345`
  - `up prometheus localhost:9090`
 - `/sd/exporters` is configured with bearer auth and works with token
 - unauthenticated `/sd/exporters` and `/monitoring/game-dev` return `401 unauthorized`
@ -62,21 +111,16 @@ Resolved:
 - `game-dev-alloy` VMID `5205` is gone
 - `game-dev-alloy` VMID `6084` is gone
 - `alloy-dev-6076` hardcoded/down target is gone from active runtime target list
 - cAdvisor `10.60.0.11:8081` is gone from active runtime target list
 - `zerolaghub-nodes 10.60.0.19:9100` is gone from active runtime target list
 - empty discovery writes a valid empty file_sd file instead of failing
-Remaining problem targets to remove/fix:
+No stale/static down targets are currently present in active Prometheus targets.
 - `alloy-dev-6076`, `10.100.0.14:12345`, hardcoded/down
 - `cadvisor`, `10.60.0.11:8081`, down/no route
 - `zerolaghub-nodes`, `10.60.0.19:9100`, down/connection refused
 ## API monitoring state
-API host metrics exist via `integrations/unix`:
+API host/system metrics were previously available via `integrations/unix`, but the new active runtime Prometheus jobs are intentionally reduced to Prometheus, local Alloy, and dynamic game/dev Alloy discovery.
 - `vmid=9020`
 - `role=API`
 - `instance=zlh-api`
 Observed API app health endpoint `/health` previously returned `404 Cannot GET /health`; no API health scrape was found.
@ -113,19 +157,25 @@ Missing or incomplete for lifecycle debugging:
 ## Grafana state
-Dashboard JSON files exist under `/etc/zlh-monitor/grafana/dashboards`:
+Grafana now provisions from `/etc/zlh-monitor/grafana/provisioning` via `/etc/default/grafana-server`.
- `core-routers.json`
+Datasource:
 - `core-service-detail.json`
 - `core-services-overview.json`
 - `dev-containers.json`
 - `game-containers.json`
-However, Grafana DB previously showed no dashboards installed. Datasource exists:
+- `uid=prometheus`
 - `name=Prometheus`
 - `url=http://localhost:9090`
 - default datasource
- `prometheus -> http://localhost:9090`
+Live provisioned dashboards:
-Dashboard JSON must be provisioned/visible in Grafana before launch monitoring is considered ready.
+- `Core Service - Detail`
 - `Core Services - Overview`
 - `Dev Containers`
 - `Game Containers`
 Archived:
 - `/etc/zlh-monitor/grafana/dashboards-archive/core-routers.json`
 ## Logs state
@ -143,5 +193,6 @@ Working controls from prior audit:
 Launch-blocking risk:
- Prometheus, Grafana, and node_exporter bind publicly on the host network
+- Prometheus and Grafana previously bound publicly on the host network
- host firewall policy does not restrict access
+- node_exporter previously listened on `*:9100`; it is no longer scraped by active Prometheus config, but listener exposure still needs verification/removal or firewalling
 - host firewall policy did not restrict access in prior audit