Update monitoring open items after runtime source cleanup

2026-05-01 19:20:44 +00:00 · 2026-05-01 19:20:44 +00:00 · 840394eb8e
commit 840394eb8e
parent 1a837f3770
1 changed files with 52 additions and 23 deletions
--- a/Codex/Monitoring/OPEN_ITEMS.md
+++ b/Codex/Monitoring/OPEN_ITEMS.md
@ -4,6 +4,11 @@ Only keep unfinished monitoring/observability work here.

 ## Recently resolved / verified

+- `/etc/zlh-monitor` is now the runtime source of truth.
+  - Prometheus runs from `/etc/zlh-monitor/prometheus/prometheus.yml` via `/etc/default/prometheus`.
+  - Grafana provisions from `/etc/zlh-monitor/grafana/provisioning` via `/etc/default/grafana-server`.
+  - `/etc/zlh-monitor/README.md` describes the new single-source layout.
+
 - Game/dev dynamic discovery empty-state now works.
  - `/sd/exporters -> []`
  - `/monitoring/game-dev -> containers: []`
@ -11,43 +16,67 @@ Only keep unfinished monitoring/observability work here.
  - one-shot sync succeeded and wrote `0 targets` to `/etc/zlh-monitor/prometheus/file_sd/game-dev-alloy.json`
  - Prometheus refreshed and stale `game-dev-alloy` VMIDs `5205` and `6084` disappeared

+- Prometheus runtime config is clean.
+  - validates with `promtool`
+  - active jobs are only:
+    - `prometheus`
+    - `alloy -> 127.0.0.1:12345`
+    - `game-dev-alloy -> /etc/zlh-monitor/prometheus/file_sd/game-dev-alloy.json`
+  - no active `node_exporter` job
+  - no active cAdvisor job
+  - no active `:9100` jobs
+  - current active targets are clean:
+    - `up alloy 127.0.0.1:12345`
+    - `up prometheus localhost:9090`
+
+- Remaining stale/static Prometheus targets are removed from active runtime config.
+  - `alloy-dev-6076` no longer appears as an active runtime target
+  - `cadvisor 10.60.0.11:8081` no longer appears as an active runtime target
+  - `zerolaghub-nodes 10.60.0.19:9100` no longer appears as an active runtime target
+
+- Grafana provisioning is live.
+  - source-managed datasource:
+    - `uid=prometheus`, `name=Prometheus`, `url=http://localhost:9090`, `default=1`
+  - provisioned dashboards:
+    - `Core Service - Detail`
+    - `Core Services - Overview`
+    - `Dev Containers`
+    - `Game Containers`
+  - legacy node_exporter router dashboard archived to `/etc/zlh-monitor/grafana/dashboards-archive/core-routers.json`
+
+- Services active after cleanup:
+  - `prometheus`
+  - `grafana-server`
+  - `alloy`
+  - discovery timer
+
 ## Launch blockers

 1. Restrict monitoring endpoint exposure.
-   - Prometheus currently listens on `*:9090`.
-   - Grafana currently listens on `*:3000`.
-   - node_exporter currently listens on `*:9100`.
-   - UFW is inactive and nftables policy is accept.
-   - Required fix: bind to loopback/management VLAN only or firewall to trusted admin/monitoring sources.
+   - Prometheus previously listened on `*:9090`.
+   - Grafana previously listened on `*:3000`.
+   - node_exporter previously listened on `*:9100`.
+   - UFW was inactive and nftables policy was accept in the prior audit.
+   - Prometheus no longer scrapes node_exporter or any `:9100` jobs, but listener exposure still needs verification/removal or firewalling.
+   - Required fix: bind Prometheus/Grafana to loopback or management VLAN only, remove/disable any unneeded node_exporter listener, and/or firewall to trusted admin/monitoring sources.

-2. Remove or fix remaining stale/static targets.
-   - `alloy-dev-6076` is still hardcoded/down.
-   - `cadvisor`, `10.60.0.11:8081`, is down/no route.
-   - `zerolaghub-nodes`, `10.60.0.19:9100`, is down/connection refused.
-   - Acceptance: remaining down targets are either removed, fixed, or explicitly documented as intentional disabled targets.
-
-3. Install/provision Grafana dashboards.
-   - Dashboard JSON exists under `/etc/zlh-monitor/grafana/dashboards`.
-   - Grafana DB previously had no dashboards installed.
-   - Acceptance: dashboards visible in Grafana and queries return live data.
-
-4. Add API health scrape.
-   - Observed `/health` returns `404 Cannot GET /health`.
+2. Add API health scrape or equivalent application-level health signal.
+   - Observed `/health` previously returned `404 Cannot GET /health`.
   - Acceptance: Prometheus has an API app health target or equivalent application-level health signal.

-5. Add launch-debug lifecycle telemetry.
+3. Add launch-debug lifecycle telemetry.
   - Missing/incomplete: `ready`, `connectable`, `operationInProgress`, `operationType`, backup/restore state, code-server state.
   - Acceptance: dashboards or Prometheus queries can answer what a game/dev container is doing and why it is not ready.

-6. Add centralized logs.
-   - No Loki, Promtail, or log-ingesting Alloy config found.
+4. Add centralized logs.
+   - No Loki, Promtail, or log-ingesting Alloy config found in prior audit.
   - Acceptance: API, Agent, provisioning, backup/restore, delete/teardown, Velocity/DNS/edge, and discovery-sync logs are queryable from the monitoring surface.

-7. Tighten bearer token storage.
+5. Tighten bearer token storage.
   - Token previously appeared in Prometheus YAML, backup configs, and `/etc/zlh-monitor/game-dev-discovery.env`.
   - Acceptance: token stored in tight-permission env file or systemd credentials; not world-readable; not leaked in labels/URLs/logs.

-8. Run lifecycle smoke test once remaining blockers are controlled enough.
+6. Run lifecycle smoke test once remaining blockers are controlled enough.
   - Create a game container and/or dev container.
   - Confirm target appears.
   - Delete it.