diff --git a/Codex/Monitoring/CURRENT_STATE.md b/Codex/Monitoring/CURRENT_STATE.md index 3fa4513..07afde1 100644 --- a/Codex/Monitoring/CURRENT_STATE.md +++ b/Codex/Monitoring/CURRENT_STATE.md @@ -4,9 +4,27 @@ **Status: PARTIAL FAIL for launch monitoring readiness.** -Core monitoring services are running. Game/dev dynamic discovery empty-state is now clean and stale `game-dev-alloy` VMIDs `5205` and `6084` have disappeared from Prometheus after a successful empty sync. +Core monitoring services are running. Game/dev dynamic discovery empty-state is clean, stale `game-dev-alloy` VMIDs `5205` and `6084` are gone, legacy/static Prometheus jobs have been removed from the active runtime config, and Grafana dashboard provisioning is now live from the `/etc/zlh-monitor` runtime source-of-truth path. -Launch-debug readiness is still blocked by public monitoring endpoint exposure, remaining stale/static down targets, dashboards not actually installed in Grafana, missing API health scrape, incomplete lifecycle telemetry, and missing centralized logs. +Launch-debug readiness is still blocked by public monitoring endpoint exposure, missing/unfinished centralized logs, missing API application health scrape, incomplete lifecycle telemetry, and the need to run one full game/dev lifecycle smoke test. + +## Runtime source of truth + +`/etc/zlh-monitor` is now the runtime source of truth. + +Prometheus now runs from: + +- `/etc/zlh-monitor/prometheus/prometheus.yml` +- configured via `/etc/default/prometheus` +- backup: `/etc/default/prometheus.zlh-monitor-20260501191319.bak` + +Grafana now provisions from: + +- `/etc/zlh-monitor/grafana/provisioning` +- configured via `/etc/default/grafana-server` +- backup: `/etc/default/grafana-server.zlh-monitor-20260501191319.bak` + +`/etc/zlh-monitor/README.md` was updated to describe the new single-source layout. ## Recently verified progress @@ -18,16 +36,40 @@ Container discovery is clean: - one-shot sync succeeded: wrote `0 targets` to `/etc/zlh-monitor/prometheus/file_sd/game-dev-alloy.json` - Prometheus refreshed and stale `game-dev-alloy` targets `5205` and `6084` are gone -This resolves the previous dynamic discovery empty-state/stale `game-dev-alloy` blocker. +Prometheus runtime config is clean: -## Running services observed from prior audit +- config validates with `promtool` +- active jobs are now only: + - `prometheus` + - `alloy -> 127.0.0.1:12345` + - `game-dev-alloy -> /etc/zlh-monitor/prometheus/file_sd/game-dev-alloy.json` +- no active `node_exporter` job +- no active cAdvisor job +- no active `:9100` jobs -- `prometheus.service` — active/running -- `grafana-server.service` — active/running -- `alloy.service` — active/running -- `prometheus-node-exporter.service` — active/running +Current Prometheus targets are clean: -`zlh-monitor-game-dev-discovery.service` was previously failed, but the latest manual sync succeeded after enabling `ALLOW_EMPTY="true"`. Verify timer/service health after the next scheduled run. +- `up alloy 127.0.0.1:12345` +- `up prometheus localhost:9090` + +Grafana provisioning is live: + +- one source-managed datasource exists: + - `uid=prometheus`, `name=Prometheus`, `url=http://localhost:9090`, `default=1` +- provisioned dashboards: + - `Core Service - Detail` + - `Core Services - Overview` + - `Dev Containers` + - `Game Containers` +- legacy node_exporter router dashboard archived to: + - `/etc/zlh-monitor/grafana/dashboards-archive/core-routers.json` + +Services active after cleanup: + +- `prometheus` +- `grafana-server` +- `alloy` +- discovery timer ## Exposure state @@ -38,19 +80,26 @@ Observed listening ports from prior audit: - `*:9100` node_exporter - `127.0.0.1:12345` Alloy +Prometheus active jobs no longer include `node_exporter` or `:9100` scraping, but network exposure still needs separate verification/lockdown. + Host firewall posture from prior audit: - UFW inactive - nftables input/forward/output policy accept -This remains a launch blocker because Prometheus, Grafana, and node_exporter are reachable on all interfaces without host firewall restriction. +This remains a launch blocker until Prometheus, Grafana, and any remaining node_exporter listener exposure are restricted to trusted admin/monitoring paths. ## Prometheus state -- Config path: `/etc/prometheus/prometheus.yml` -- Global scrape interval: `15s` -- Prometheus self-scrape: `5s` -- Dynamic discovery refresh: `30s` +- Runtime config path: `/etc/zlh-monitor/prometheus/prometheus.yml` +- Config validates with `promtool` +- Active jobs: + - `prometheus` + - `alloy` + - `game-dev-alloy` +- Current targets: + - `up alloy 127.0.0.1:12345` + - `up prometheus localhost:9090` - `/sd/exporters` is configured with bearer auth and works with token - unauthenticated `/sd/exporters` and `/monitoring/game-dev` return `401 unauthorized` @@ -62,21 +111,16 @@ Resolved: - `game-dev-alloy` VMID `5205` is gone - `game-dev-alloy` VMID `6084` is gone +- `alloy-dev-6076` hardcoded/down target is gone from active runtime target list +- cAdvisor `10.60.0.11:8081` is gone from active runtime target list +- `zerolaghub-nodes 10.60.0.19:9100` is gone from active runtime target list - empty discovery writes a valid empty file_sd file instead of failing -Remaining problem targets to remove/fix: - -- `alloy-dev-6076`, `10.100.0.14:12345`, hardcoded/down -- `cadvisor`, `10.60.0.11:8081`, down/no route -- `zerolaghub-nodes`, `10.60.0.19:9100`, down/connection refused +No stale/static down targets are currently present in active Prometheus targets. ## API monitoring state -API host metrics exist via `integrations/unix`: - -- `vmid=9020` -- `role=API` -- `instance=zlh-api` +API host/system metrics were previously available via `integrations/unix`, but the new active runtime Prometheus jobs are intentionally reduced to Prometheus, local Alloy, and dynamic game/dev Alloy discovery. Observed API app health endpoint `/health` previously returned `404 Cannot GET /health`; no API health scrape was found. @@ -113,19 +157,25 @@ Missing or incomplete for lifecycle debugging: ## Grafana state -Dashboard JSON files exist under `/etc/zlh-monitor/grafana/dashboards`: +Grafana now provisions from `/etc/zlh-monitor/grafana/provisioning` via `/etc/default/grafana-server`. -- `core-routers.json` -- `core-service-detail.json` -- `core-services-overview.json` -- `dev-containers.json` -- `game-containers.json` +Datasource: -However, Grafana DB previously showed no dashboards installed. Datasource exists: +- `uid=prometheus` +- `name=Prometheus` +- `url=http://localhost:9090` +- default datasource -- `prometheus -> http://localhost:9090` +Live provisioned dashboards: -Dashboard JSON must be provisioned/visible in Grafana before launch monitoring is considered ready. +- `Core Service - Detail` +- `Core Services - Overview` +- `Dev Containers` +- `Game Containers` + +Archived: + +- `/etc/zlh-monitor/grafana/dashboards-archive/core-routers.json` ## Logs state @@ -143,5 +193,6 @@ Working controls from prior audit: Launch-blocking risk: -- Prometheus, Grafana, and node_exporter bind publicly on the host network -- host firewall policy does not restrict access +- Prometheus and Grafana previously bound publicly on the host network +- node_exporter previously listened on `*:9100`; it is no longer scraped by active Prometheus config, but listener exposure still needs verification/removal or firewalling +- host firewall policy did not restrict access in prior audit