Update monitoring current state after clean discovery

2026-05-01 18:58:24 +00:00 · 2026-05-01 18:58:24 +00:00 · c57c5ba3df
commit c57c5ba3df
parent 1169a89b71
1 changed files with 44 additions and 37 deletions
--- a/Codex/Monitoring/CURRENT_STATE.md
+++ b/Codex/Monitoring/CURRENT_STATE.md
@ -2,33 +2,48 @@
 ## Launch readiness
-**Status: FAIL for launch monitoring readiness.**
+**Status: PARTIAL FAIL for launch monitoring readiness.**
-Core monitoring services are running, and Prometheus can see the API host via Alloy remote-written host metrics plus one current dev/node target from `/sd/exporters`. Launch-debug readiness is blocked by public exposure, stale/failing game/dev discovery, missing centralized logs, and dashboards not actually installed in Grafana.
+Core monitoring services are running. Game/dev dynamic discovery empty-state is now clean and stale `game-dev-alloy` VMIDs `5205` and `6084` have disappeared from Prometheus after a successful empty sync.
-## Running services observed
+Launch-debug readiness is still blocked by public monitoring endpoint exposure, remaining stale/static down targets, dashboards not actually installed in Grafana, missing API health scrape, incomplete lifecycle telemetry, and missing centralized logs.
 ## Recently verified progress
 Container discovery is clean:
 - `/sd/exporters -> []`
 - `/monitoring/game-dev -> containers: []`
 - monitor-side `ALLOW_EMPTY="true"` added in `/etc/zlh-monitor/game-dev-discovery.env`
 - one-shot sync succeeded: wrote `0 targets` to `/etc/zlh-monitor/prometheus/file_sd/game-dev-alloy.json`
 - Prometheus refreshed and stale `game-dev-alloy` targets `5205` and `6084` are gone
 This resolves the previous dynamic discovery empty-state/stale `game-dev-alloy` blocker.
 ## Running services observed from prior audit
 - `prometheus.service` — active/running
 - `grafana-server.service` — active/running
 - `alloy.service` — active/running
 - `prometheus-node-exporter.service` — active/running
- `zlh-monitor-game-dev-discovery.service` — failed
+
 `zlh-monitor-game-dev-discovery.service` was previously failed, but the latest manual sync succeeded after enabling `ALLOW_EMPTY="true"`. Verify timer/service health after the next scheduled run.
 ## Exposure state
-Observed listening ports:
+Observed listening ports from prior audit:
 - `*:9090` Prometheus
 - `*:3000` Grafana
 - `*:9100` node_exporter
 - `127.0.0.1:12345` Alloy
-Host firewall posture observed:
+Host firewall posture from prior audit:
 - UFW inactive
 - nftables input/forward/output policy accept
-This is a launch blocker because Prometheus, Grafana, and node_exporter are reachable on all interfaces without host firewall restriction.
+This remains a launch blocker because Prometheus, Grafana, and node_exporter are reachable on all interfaces without host firewall restriction.
 ## Prometheus state
@ -39,32 +54,21 @@ This is a launch blocker because Prometheus, Grafana, and node_exporter are reac
 - `/sd/exporters` is configured with bearer auth and works with token
 - unauthenticated `/sd/exporters` and `/monitoring/game-dev` return `401 unauthorized`
-Token risk: the bearer token is stored directly in Prometheus YAML, backup configs, and `/etc/zlh-monitor/game-dev-discovery.env`. Tokens were not observed leaking through labels or scrape URLs in the audit, but storage should still be tightened.
+Token risk: the bearer token was previously observed stored directly in Prometheus YAML, backup configs, and `/etc/zlh-monitor/game-dev-discovery.env`. Tokens were not observed leaking through labels or scrape URLs in the audit, but storage should still be tightened.
 ## Targets/discovery state
-Current active targets observed: 12.
+Resolved:
-Up targets include:
+- `game-dev-alloy` VMID `5205` is gone
 - `game-dev-alloy` VMID `6084` is gone
 - empty discovery writes a valid empty file_sd file instead of failing
- `prometheus`
+Remaining problem targets to remove/fix:
 - `alloy`
 - local `node`
 - routers
 - monitor node
 - `zlh_node` VMID `6050`
-Problem targets observed:
+- `alloy-dev-6076`, `10.100.0.14:12345`, hardcoded/down
-
+- `cadvisor`, `10.60.0.11:8081`, down/no route
- `game-dev-alloy` VMID `5205`, `10.200.0.78:12345`, timeout
+- `zerolaghub-nodes`, `10.60.0.19:9100`, down/connection refused
 - `game-dev-alloy` VMID `6084`, `10.100.0.22:12345`, timeout
 - `alloy-dev-6076`, `10.100.0.14:12345`, timeout
 - `cadvisor`, `10.60.0.11:8081`, no route
 - `zerolaghub-nodes`, `10.60.0.19:9100`, connection refused
 `/etc/zlh-monitor/prometheus/file_sd/game-dev-alloy.json` contains stale down targets `5205` and `6084`.
 The discovery sync currently fails because the API returns dev VMID `6050` on `10.60.0.220`, while the sync script only allows dev IPs in `10.100.0.0/16`. The service then refuses to write an empty discovery file.
 ## API monitoring state
@ -74,19 +78,22 @@ API host metrics exist via `integrations/unix`:
 - `role=API`
 - `instance=zlh-api`
-Observed API app health endpoint `/health` returned `404 Cannot GET /health`; no API health scrape was found.
+Observed API app health endpoint `/health` previously returned `404 Cannot GET /health`; no API health scrape was found.
 No centralized API logs were found on the monitoring host.
 ## Agent/container monitoring state
-`/sd/exporters` returned one node exporter target:
+Current dynamic game/dev discovery is empty and clean.
- `10.60.0.220:9100`, `vmid=6050`, `ctype=dev`, `job=zlh_node`
+Expected next validation is lifecycle smoke testing:
-File discovery still contains stale Alloy collector targets for `5205` and `6084`, both down.
+- create a game and/or dev container
 - confirm target appears
 - delete it
 - confirm target disappears now that empty discovery works
-Labels currently observed include:
+Labels previously observed include:
 - `vmid`
 - `ctype` or `container_type`
@ -114,21 +121,21 @@ Dashboard JSON files exist under `/etc/zlh-monitor/grafana/dashboards`:
 - `dev-containers.json`
 - `game-containers.json`
-However, Grafana DB showed no dashboards installed. Datasource exists:
+However, Grafana DB previously showed no dashboards installed. Datasource exists:
 - `prometheus -> http://localhost:9090`
-Dashboard JSON queries reference `job="integrations/unix"` and `job="game-dev-alloy"`, but game/dev Alloy endpoint data is currently down/stale.
+Dashboard JSON must be provisioned/visible in Grafana before launch monitoring is considered ready.
 ## Logs state
-No Loki, Promtail, or log-ingesting Alloy pipeline was found.
+No Loki, Promtail, or log-ingesting Alloy pipeline was found in the prior audit.
-Only journald/local logs for monitoring components are available on the monitoring host. Recent logs showed repeated discovery failures and Prometheus HTTP SD errors. No token values appeared in those log excerpts.
+Only journald/local logs for monitoring components were available on the monitoring host. This remains a launch-debug gap for API, Agent, provisioning, backup/restore, delete/teardown, Velocity/DNS/edge actions, and discovery sync.
 ## Security state
-Working controls:
+Working controls from prior audit:
 - Grafana anonymous access is disabled; unauthenticated `/api/search` returned `401`
 - discovery endpoints require token; unauthenticated `/sd/exporters` and `/monitoring/game-dev` returned `401`