Update monitoring current state after Prometheus Grafana cleanup

This commit is contained in:
jester 2026-05-01 19:19:29 +00:00
parent c57c5ba3df
commit 1a837f3770

View File

@ -4,9 +4,27 @@
**Status: PARTIAL FAIL for launch monitoring readiness.** **Status: PARTIAL FAIL for launch monitoring readiness.**
Core monitoring services are running. Game/dev dynamic discovery empty-state is now clean and stale `game-dev-alloy` VMIDs `5205` and `6084` have disappeared from Prometheus after a successful empty sync. Core monitoring services are running. Game/dev dynamic discovery empty-state is clean, stale `game-dev-alloy` VMIDs `5205` and `6084` are gone, legacy/static Prometheus jobs have been removed from the active runtime config, and Grafana dashboard provisioning is now live from the `/etc/zlh-monitor` runtime source-of-truth path.
Launch-debug readiness is still blocked by public monitoring endpoint exposure, remaining stale/static down targets, dashboards not actually installed in Grafana, missing API health scrape, incomplete lifecycle telemetry, and missing centralized logs. Launch-debug readiness is still blocked by public monitoring endpoint exposure, missing/unfinished centralized logs, missing API application health scrape, incomplete lifecycle telemetry, and the need to run one full game/dev lifecycle smoke test.
## Runtime source of truth
`/etc/zlh-monitor` is now the runtime source of truth.
Prometheus now runs from:
- `/etc/zlh-monitor/prometheus/prometheus.yml`
- configured via `/etc/default/prometheus`
- backup: `/etc/default/prometheus.zlh-monitor-20260501191319.bak`
Grafana now provisions from:
- `/etc/zlh-monitor/grafana/provisioning`
- configured via `/etc/default/grafana-server`
- backup: `/etc/default/grafana-server.zlh-monitor-20260501191319.bak`
`/etc/zlh-monitor/README.md` was updated to describe the new single-source layout.
## Recently verified progress ## Recently verified progress
@ -18,16 +36,40 @@ Container discovery is clean:
- one-shot sync succeeded: wrote `0 targets` to `/etc/zlh-monitor/prometheus/file_sd/game-dev-alloy.json` - one-shot sync succeeded: wrote `0 targets` to `/etc/zlh-monitor/prometheus/file_sd/game-dev-alloy.json`
- Prometheus refreshed and stale `game-dev-alloy` targets `5205` and `6084` are gone - Prometheus refreshed and stale `game-dev-alloy` targets `5205` and `6084` are gone
This resolves the previous dynamic discovery empty-state/stale `game-dev-alloy` blocker. Prometheus runtime config is clean:
## Running services observed from prior audit - config validates with `promtool`
- active jobs are now only:
- `prometheus`
- `alloy -> 127.0.0.1:12345`
- `game-dev-alloy -> /etc/zlh-monitor/prometheus/file_sd/game-dev-alloy.json`
- no active `node_exporter` job
- no active cAdvisor job
- no active `:9100` jobs
- `prometheus.service` — active/running Current Prometheus targets are clean:
- `grafana-server.service` — active/running
- `alloy.service` — active/running
- `prometheus-node-exporter.service` — active/running
`zlh-monitor-game-dev-discovery.service` was previously failed, but the latest manual sync succeeded after enabling `ALLOW_EMPTY="true"`. Verify timer/service health after the next scheduled run. - `up alloy 127.0.0.1:12345`
- `up prometheus localhost:9090`
Grafana provisioning is live:
- one source-managed datasource exists:
- `uid=prometheus`, `name=Prometheus`, `url=http://localhost:9090`, `default=1`
- provisioned dashboards:
- `Core Service - Detail`
- `Core Services - Overview`
- `Dev Containers`
- `Game Containers`
- legacy node_exporter router dashboard archived to:
- `/etc/zlh-monitor/grafana/dashboards-archive/core-routers.json`
Services active after cleanup:
- `prometheus`
- `grafana-server`
- `alloy`
- discovery timer
## Exposure state ## Exposure state
@ -38,19 +80,26 @@ Observed listening ports from prior audit:
- `*:9100` node_exporter - `*:9100` node_exporter
- `127.0.0.1:12345` Alloy - `127.0.0.1:12345` Alloy
Prometheus active jobs no longer include `node_exporter` or `:9100` scraping, but network exposure still needs separate verification/lockdown.
Host firewall posture from prior audit: Host firewall posture from prior audit:
- UFW inactive - UFW inactive
- nftables input/forward/output policy accept - nftables input/forward/output policy accept
This remains a launch blocker because Prometheus, Grafana, and node_exporter are reachable on all interfaces without host firewall restriction. This remains a launch blocker until Prometheus, Grafana, and any remaining node_exporter listener exposure are restricted to trusted admin/monitoring paths.
## Prometheus state ## Prometheus state
- Config path: `/etc/prometheus/prometheus.yml` - Runtime config path: `/etc/zlh-monitor/prometheus/prometheus.yml`
- Global scrape interval: `15s` - Config validates with `promtool`
- Prometheus self-scrape: `5s` - Active jobs:
- Dynamic discovery refresh: `30s` - `prometheus`
- `alloy`
- `game-dev-alloy`
- Current targets:
- `up alloy 127.0.0.1:12345`
- `up prometheus localhost:9090`
- `/sd/exporters` is configured with bearer auth and works with token - `/sd/exporters` is configured with bearer auth and works with token
- unauthenticated `/sd/exporters` and `/monitoring/game-dev` return `401 unauthorized` - unauthenticated `/sd/exporters` and `/monitoring/game-dev` return `401 unauthorized`
@ -62,21 +111,16 @@ Resolved:
- `game-dev-alloy` VMID `5205` is gone - `game-dev-alloy` VMID `5205` is gone
- `game-dev-alloy` VMID `6084` is gone - `game-dev-alloy` VMID `6084` is gone
- `alloy-dev-6076` hardcoded/down target is gone from active runtime target list
- cAdvisor `10.60.0.11:8081` is gone from active runtime target list
- `zerolaghub-nodes 10.60.0.19:9100` is gone from active runtime target list
- empty discovery writes a valid empty file_sd file instead of failing - empty discovery writes a valid empty file_sd file instead of failing
Remaining problem targets to remove/fix: No stale/static down targets are currently present in active Prometheus targets.
- `alloy-dev-6076`, `10.100.0.14:12345`, hardcoded/down
- `cadvisor`, `10.60.0.11:8081`, down/no route
- `zerolaghub-nodes`, `10.60.0.19:9100`, down/connection refused
## API monitoring state ## API monitoring state
API host metrics exist via `integrations/unix`: API host/system metrics were previously available via `integrations/unix`, but the new active runtime Prometheus jobs are intentionally reduced to Prometheus, local Alloy, and dynamic game/dev Alloy discovery.
- `vmid=9020`
- `role=API`
- `instance=zlh-api`
Observed API app health endpoint `/health` previously returned `404 Cannot GET /health`; no API health scrape was found. Observed API app health endpoint `/health` previously returned `404 Cannot GET /health`; no API health scrape was found.
@ -113,19 +157,25 @@ Missing or incomplete for lifecycle debugging:
## Grafana state ## Grafana state
Dashboard JSON files exist under `/etc/zlh-monitor/grafana/dashboards`: Grafana now provisions from `/etc/zlh-monitor/grafana/provisioning` via `/etc/default/grafana-server`.
- `core-routers.json` Datasource:
- `core-service-detail.json`
- `core-services-overview.json`
- `dev-containers.json`
- `game-containers.json`
However, Grafana DB previously showed no dashboards installed. Datasource exists: - `uid=prometheus`
- `name=Prometheus`
- `url=http://localhost:9090`
- default datasource
- `prometheus -> http://localhost:9090` Live provisioned dashboards:
Dashboard JSON must be provisioned/visible in Grafana before launch monitoring is considered ready. - `Core Service - Detail`
- `Core Services - Overview`
- `Dev Containers`
- `Game Containers`
Archived:
- `/etc/zlh-monitor/grafana/dashboards-archive/core-routers.json`
## Logs state ## Logs state
@ -143,5 +193,6 @@ Working controls from prior audit:
Launch-blocking risk: Launch-blocking risk:
- Prometheus, Grafana, and node_exporter bind publicly on the host network - Prometheus and Grafana previously bound publicly on the host network
- host firewall policy does not restrict access - node_exporter previously listened on `*:9100`; it is no longer scraped by active Prometheus config, but listener exposure still needs verification/removal or firewalling
- host firewall policy did not restrict access in prior audit