Update monitoring current state after Prometheus Grafana cleanup

2026-05-01 19:19:29 +00:00 · 2026-05-01 19:19:29 +00:00 · 1a837f3770
commit 1a837f3770
parent c57c5ba3df
1 changed files with 86 additions and 35 deletions
--- a/Codex/Monitoring/CURRENT_STATE.md
+++ b/Codex/Monitoring/CURRENT_STATE.md
@ -4,9 +4,27 @@

 **Status: PARTIAL FAIL for launch monitoring readiness.**

-Core monitoring services are running. Game/dev dynamic discovery empty-state is now clean and stale `game-dev-alloy` VMIDs `5205` and `6084` have disappeared from Prometheus after a successful empty sync.
+Core monitoring services are running. Game/dev dynamic discovery empty-state is clean, stale `game-dev-alloy` VMIDs `5205` and `6084` are gone, legacy/static Prometheus jobs have been removed from the active runtime config, and Grafana dashboard provisioning is now live from the `/etc/zlh-monitor` runtime source-of-truth path.

-Launch-debug readiness is still blocked by public monitoring endpoint exposure, remaining stale/static down targets, dashboards not actually installed in Grafana, missing API health scrape, incomplete lifecycle telemetry, and missing centralized logs.
+Launch-debug readiness is still blocked by public monitoring endpoint exposure, missing/unfinished centralized logs, missing API application health scrape, incomplete lifecycle telemetry, and the need to run one full game/dev lifecycle smoke test.
+
+## Runtime source of truth
+
+`/etc/zlh-monitor` is now the runtime source of truth.
+
+Prometheus now runs from:
+
+- `/etc/zlh-monitor/prometheus/prometheus.yml`
+- configured via `/etc/default/prometheus`
+- backup: `/etc/default/prometheus.zlh-monitor-20260501191319.bak`
+
+Grafana now provisions from:
+
+- `/etc/zlh-monitor/grafana/provisioning`
+- configured via `/etc/default/grafana-server`
+- backup: `/etc/default/grafana-server.zlh-monitor-20260501191319.bak`
+
+`/etc/zlh-monitor/README.md` was updated to describe the new single-source layout.

 ## Recently verified progress

@ -18,16 +36,40 @@ Container discovery is clean:
 - one-shot sync succeeded: wrote `0 targets` to `/etc/zlh-monitor/prometheus/file_sd/game-dev-alloy.json`
 - Prometheus refreshed and stale `game-dev-alloy` targets `5205` and `6084` are gone

-This resolves the previous dynamic discovery empty-state/stale `game-dev-alloy` blocker.
+Prometheus runtime config is clean:

-## Running services observed from prior audit
+- config validates with `promtool`
+- active jobs are now only:
+  - `prometheus`
+  - `alloy -> 127.0.0.1:12345`
+  - `game-dev-alloy -> /etc/zlh-monitor/prometheus/file_sd/game-dev-alloy.json`
+- no active `node_exporter` job
+- no active cAdvisor job
+- no active `:9100` jobs

- `prometheus.service` — active/running
- `grafana-server.service` — active/running
- `alloy.service` — active/running
- `prometheus-node-exporter.service` — active/running
+Current Prometheus targets are clean:

-`zlh-monitor-game-dev-discovery.service` was previously failed, but the latest manual sync succeeded after enabling `ALLOW_EMPTY="true"`. Verify timer/service health after the next scheduled run.
+- `up alloy 127.0.0.1:12345`
+- `up prometheus localhost:9090`
+
+Grafana provisioning is live:
+
+- one source-managed datasource exists:
+  - `uid=prometheus`, `name=Prometheus`, `url=http://localhost:9090`, `default=1`
+- provisioned dashboards:
+  - `Core Service - Detail`
+  - `Core Services - Overview`
+  - `Dev Containers`
+  - `Game Containers`
+- legacy node_exporter router dashboard archived to:
+  - `/etc/zlh-monitor/grafana/dashboards-archive/core-routers.json`
+
+Services active after cleanup:
+
+- `prometheus`
+- `grafana-server`
+- `alloy`
+- discovery timer

 ## Exposure state

@ -38,19 +80,26 @@ Observed listening ports from prior audit:
 - `*:9100` node_exporter
 - `127.0.0.1:12345` Alloy

+Prometheus active jobs no longer include `node_exporter` or `:9100` scraping, but network exposure still needs separate verification/lockdown.
+
 Host firewall posture from prior audit:

 - UFW inactive
 - nftables input/forward/output policy accept

-This remains a launch blocker because Prometheus, Grafana, and node_exporter are reachable on all interfaces without host firewall restriction.
+This remains a launch blocker until Prometheus, Grafana, and any remaining node_exporter listener exposure are restricted to trusted admin/monitoring paths.

 ## Prometheus state

- Config path: `/etc/prometheus/prometheus.yml`
- Global scrape interval: `15s`
- Prometheus self-scrape: `5s`
- Dynamic discovery refresh: `30s`
+- Runtime config path: `/etc/zlh-monitor/prometheus/prometheus.yml`
+- Config validates with `promtool`
+- Active jobs:
+  - `prometheus`
+  - `alloy`
+  - `game-dev-alloy`
+- Current targets:
+  - `up alloy 127.0.0.1:12345`
+  - `up prometheus localhost:9090`
 - `/sd/exporters` is configured with bearer auth and works with token
 - unauthenticated `/sd/exporters` and `/monitoring/game-dev` return `401 unauthorized`

@ -62,21 +111,16 @@ Resolved:

 - `game-dev-alloy` VMID `5205` is gone
 - `game-dev-alloy` VMID `6084` is gone
+- `alloy-dev-6076` hardcoded/down target is gone from active runtime target list
+- cAdvisor `10.60.0.11:8081` is gone from active runtime target list
+- `zerolaghub-nodes 10.60.0.19:9100` is gone from active runtime target list
 - empty discovery writes a valid empty file_sd file instead of failing

-Remaining problem targets to remove/fix:
-
- `alloy-dev-6076`, `10.100.0.14:12345`, hardcoded/down
- `cadvisor`, `10.60.0.11:8081`, down/no route
- `zerolaghub-nodes`, `10.60.0.19:9100`, down/connection refused
+No stale/static down targets are currently present in active Prometheus targets.

 ## API monitoring state

-API host metrics exist via `integrations/unix`:
-
- `vmid=9020`
- `role=API`
- `instance=zlh-api`
+API host/system metrics were previously available via `integrations/unix`, but the new active runtime Prometheus jobs are intentionally reduced to Prometheus, local Alloy, and dynamic game/dev Alloy discovery.

 Observed API app health endpoint `/health` previously returned `404 Cannot GET /health`; no API health scrape was found.

@ -113,19 +157,25 @@ Missing or incomplete for lifecycle debugging:

 ## Grafana state

-Dashboard JSON files exist under `/etc/zlh-monitor/grafana/dashboards`:
+Grafana now provisions from `/etc/zlh-monitor/grafana/provisioning` via `/etc/default/grafana-server`.

- `core-routers.json`
- `core-service-detail.json`
- `core-services-overview.json`
- `dev-containers.json`
- `game-containers.json`
+Datasource:

-However, Grafana DB previously showed no dashboards installed. Datasource exists:
+- `uid=prometheus`
+- `name=Prometheus`
+- `url=http://localhost:9090`
+- default datasource

- `prometheus -> http://localhost:9090`
+Live provisioned dashboards:

-Dashboard JSON must be provisioned/visible in Grafana before launch monitoring is considered ready.
+- `Core Service - Detail`
+- `Core Services - Overview`
+- `Dev Containers`
+- `Game Containers`
+
+Archived:
+
+- `/etc/zlh-monitor/grafana/dashboards-archive/core-routers.json`

 ## Logs state

@ -143,5 +193,6 @@ Working controls from prior audit:

 Launch-blocking risk:

- Prometheus, Grafana, and node_exporter bind publicly on the host network
- host firewall policy does not restrict access
+- Prometheus and Grafana previously bound publicly on the host network
+- node_exporter previously listened on `*:9100`; it is no longer scraped by active Prometheus config, but listener exposure still needs verification/removal or firewalling
+- host firewall policy did not restrict access in prior audit