From c57c5ba3df135c187d124e35faae619060fa88d6 Mon Sep 17 00:00:00 2001 From: jester Date: Fri, 1 May 2026 18:58:24 +0000 Subject: [PATCH] Update monitoring current state after clean discovery --- Codex/Monitoring/CURRENT_STATE.md | 81 +++++++++++++++++-------------- 1 file changed, 44 insertions(+), 37 deletions(-) diff --git a/Codex/Monitoring/CURRENT_STATE.md b/Codex/Monitoring/CURRENT_STATE.md index 680e0ea..3fa4513 100644 --- a/Codex/Monitoring/CURRENT_STATE.md +++ b/Codex/Monitoring/CURRENT_STATE.md @@ -2,33 +2,48 @@ ## Launch readiness -**Status: FAIL for launch monitoring readiness.** +**Status: PARTIAL FAIL for launch monitoring readiness.** -Core monitoring services are running, and Prometheus can see the API host via Alloy remote-written host metrics plus one current dev/node target from `/sd/exporters`. Launch-debug readiness is blocked by public exposure, stale/failing game/dev discovery, missing centralized logs, and dashboards not actually installed in Grafana. +Core monitoring services are running. Game/dev dynamic discovery empty-state is now clean and stale `game-dev-alloy` VMIDs `5205` and `6084` have disappeared from Prometheus after a successful empty sync. -## Running services observed +Launch-debug readiness is still blocked by public monitoring endpoint exposure, remaining stale/static down targets, dashboards not actually installed in Grafana, missing API health scrape, incomplete lifecycle telemetry, and missing centralized logs. + +## Recently verified progress + +Container discovery is clean: + +- `/sd/exporters -> []` +- `/monitoring/game-dev -> containers: []` +- monitor-side `ALLOW_EMPTY="true"` added in `/etc/zlh-monitor/game-dev-discovery.env` +- one-shot sync succeeded: wrote `0 targets` to `/etc/zlh-monitor/prometheus/file_sd/game-dev-alloy.json` +- Prometheus refreshed and stale `game-dev-alloy` targets `5205` and `6084` are gone + +This resolves the previous dynamic discovery empty-state/stale `game-dev-alloy` blocker. + +## Running services observed from prior audit - `prometheus.service` — active/running - `grafana-server.service` — active/running - `alloy.service` — active/running - `prometheus-node-exporter.service` — active/running -- `zlh-monitor-game-dev-discovery.service` — failed + +`zlh-monitor-game-dev-discovery.service` was previously failed, but the latest manual sync succeeded after enabling `ALLOW_EMPTY="true"`. Verify timer/service health after the next scheduled run. ## Exposure state -Observed listening ports: +Observed listening ports from prior audit: - `*:9090` Prometheus - `*:3000` Grafana - `*:9100` node_exporter - `127.0.0.1:12345` Alloy -Host firewall posture observed: +Host firewall posture from prior audit: - UFW inactive - nftables input/forward/output policy accept -This is a launch blocker because Prometheus, Grafana, and node_exporter are reachable on all interfaces without host firewall restriction. +This remains a launch blocker because Prometheus, Grafana, and node_exporter are reachable on all interfaces without host firewall restriction. ## Prometheus state @@ -39,32 +54,21 @@ This is a launch blocker because Prometheus, Grafana, and node_exporter are reac - `/sd/exporters` is configured with bearer auth and works with token - unauthenticated `/sd/exporters` and `/monitoring/game-dev` return `401 unauthorized` -Token risk: the bearer token is stored directly in Prometheus YAML, backup configs, and `/etc/zlh-monitor/game-dev-discovery.env`. Tokens were not observed leaking through labels or scrape URLs in the audit, but storage should still be tightened. +Token risk: the bearer token was previously observed stored directly in Prometheus YAML, backup configs, and `/etc/zlh-monitor/game-dev-discovery.env`. Tokens were not observed leaking through labels or scrape URLs in the audit, but storage should still be tightened. ## Targets/discovery state -Current active targets observed: 12. +Resolved: -Up targets include: +- `game-dev-alloy` VMID `5205` is gone +- `game-dev-alloy` VMID `6084` is gone +- empty discovery writes a valid empty file_sd file instead of failing -- `prometheus` -- `alloy` -- local `node` -- routers -- monitor node -- `zlh_node` VMID `6050` +Remaining problem targets to remove/fix: -Problem targets observed: - -- `game-dev-alloy` VMID `5205`, `10.200.0.78:12345`, timeout -- `game-dev-alloy` VMID `6084`, `10.100.0.22:12345`, timeout -- `alloy-dev-6076`, `10.100.0.14:12345`, timeout -- `cadvisor`, `10.60.0.11:8081`, no route -- `zerolaghub-nodes`, `10.60.0.19:9100`, connection refused - -`/etc/zlh-monitor/prometheus/file_sd/game-dev-alloy.json` contains stale down targets `5205` and `6084`. - -The discovery sync currently fails because the API returns dev VMID `6050` on `10.60.0.220`, while the sync script only allows dev IPs in `10.100.0.0/16`. The service then refuses to write an empty discovery file. +- `alloy-dev-6076`, `10.100.0.14:12345`, hardcoded/down +- `cadvisor`, `10.60.0.11:8081`, down/no route +- `zerolaghub-nodes`, `10.60.0.19:9100`, down/connection refused ## API monitoring state @@ -74,19 +78,22 @@ API host metrics exist via `integrations/unix`: - `role=API` - `instance=zlh-api` -Observed API app health endpoint `/health` returned `404 Cannot GET /health`; no API health scrape was found. +Observed API app health endpoint `/health` previously returned `404 Cannot GET /health`; no API health scrape was found. No centralized API logs were found on the monitoring host. ## Agent/container monitoring state -`/sd/exporters` returned one node exporter target: +Current dynamic game/dev discovery is empty and clean. -- `10.60.0.220:9100`, `vmid=6050`, `ctype=dev`, `job=zlh_node` +Expected next validation is lifecycle smoke testing: -File discovery still contains stale Alloy collector targets for `5205` and `6084`, both down. +- create a game and/or dev container +- confirm target appears +- delete it +- confirm target disappears now that empty discovery works -Labels currently observed include: +Labels previously observed include: - `vmid` - `ctype` or `container_type` @@ -114,21 +121,21 @@ Dashboard JSON files exist under `/etc/zlh-monitor/grafana/dashboards`: - `dev-containers.json` - `game-containers.json` -However, Grafana DB showed no dashboards installed. Datasource exists: +However, Grafana DB previously showed no dashboards installed. Datasource exists: - `prometheus -> http://localhost:9090` -Dashboard JSON queries reference `job="integrations/unix"` and `job="game-dev-alloy"`, but game/dev Alloy endpoint data is currently down/stale. +Dashboard JSON must be provisioned/visible in Grafana before launch monitoring is considered ready. ## Logs state -No Loki, Promtail, or log-ingesting Alloy pipeline was found. +No Loki, Promtail, or log-ingesting Alloy pipeline was found in the prior audit. -Only journald/local logs for monitoring components are available on the monitoring host. Recent logs showed repeated discovery failures and Prometheus HTTP SD errors. No token values appeared in those log excerpts. +Only journald/local logs for monitoring components were available on the monitoring host. This remains a launch-debug gap for API, Agent, provisioning, backup/restore, delete/teardown, Velocity/DNS/edge actions, and discovery sync. ## Security state -Working controls: +Working controls from prior audit: - Grafana anonymous access is disabled; unauthenticated `/api/search` returned `401` - discovery endpoints require token; unauthenticated `/sd/exporters` and `/monitoring/game-dev` returned `401`