Update monitoring current state after clean discovery

This commit is contained in:
jester 2026-05-01 18:58:24 +00:00
parent 1169a89b71
commit c57c5ba3df

View File

@ -2,33 +2,48 @@
## Launch readiness ## Launch readiness
**Status: FAIL for launch monitoring readiness.** **Status: PARTIAL FAIL for launch monitoring readiness.**
Core monitoring services are running, and Prometheus can see the API host via Alloy remote-written host metrics plus one current dev/node target from `/sd/exporters`. Launch-debug readiness is blocked by public exposure, stale/failing game/dev discovery, missing centralized logs, and dashboards not actually installed in Grafana. Core monitoring services are running. Game/dev dynamic discovery empty-state is now clean and stale `game-dev-alloy` VMIDs `5205` and `6084` have disappeared from Prometheus after a successful empty sync.
## Running services observed Launch-debug readiness is still blocked by public monitoring endpoint exposure, remaining stale/static down targets, dashboards not actually installed in Grafana, missing API health scrape, incomplete lifecycle telemetry, and missing centralized logs.
## Recently verified progress
Container discovery is clean:
- `/sd/exporters -> []`
- `/monitoring/game-dev -> containers: []`
- monitor-side `ALLOW_EMPTY="true"` added in `/etc/zlh-monitor/game-dev-discovery.env`
- one-shot sync succeeded: wrote `0 targets` to `/etc/zlh-monitor/prometheus/file_sd/game-dev-alloy.json`
- Prometheus refreshed and stale `game-dev-alloy` targets `5205` and `6084` are gone
This resolves the previous dynamic discovery empty-state/stale `game-dev-alloy` blocker.
## Running services observed from prior audit
- `prometheus.service` — active/running - `prometheus.service` — active/running
- `grafana-server.service` — active/running - `grafana-server.service` — active/running
- `alloy.service` — active/running - `alloy.service` — active/running
- `prometheus-node-exporter.service` — active/running - `prometheus-node-exporter.service` — active/running
- `zlh-monitor-game-dev-discovery.service` — failed
`zlh-monitor-game-dev-discovery.service` was previously failed, but the latest manual sync succeeded after enabling `ALLOW_EMPTY="true"`. Verify timer/service health after the next scheduled run.
## Exposure state ## Exposure state
Observed listening ports: Observed listening ports from prior audit:
- `*:9090` Prometheus - `*:9090` Prometheus
- `*:3000` Grafana - `*:3000` Grafana
- `*:9100` node_exporter - `*:9100` node_exporter
- `127.0.0.1:12345` Alloy - `127.0.0.1:12345` Alloy
Host firewall posture observed: Host firewall posture from prior audit:
- UFW inactive - UFW inactive
- nftables input/forward/output policy accept - nftables input/forward/output policy accept
This is a launch blocker because Prometheus, Grafana, and node_exporter are reachable on all interfaces without host firewall restriction. This remains a launch blocker because Prometheus, Grafana, and node_exporter are reachable on all interfaces without host firewall restriction.
## Prometheus state ## Prometheus state
@ -39,32 +54,21 @@ This is a launch blocker because Prometheus, Grafana, and node_exporter are reac
- `/sd/exporters` is configured with bearer auth and works with token - `/sd/exporters` is configured with bearer auth and works with token
- unauthenticated `/sd/exporters` and `/monitoring/game-dev` return `401 unauthorized` - unauthenticated `/sd/exporters` and `/monitoring/game-dev` return `401 unauthorized`
Token risk: the bearer token is stored directly in Prometheus YAML, backup configs, and `/etc/zlh-monitor/game-dev-discovery.env`. Tokens were not observed leaking through labels or scrape URLs in the audit, but storage should still be tightened. Token risk: the bearer token was previously observed stored directly in Prometheus YAML, backup configs, and `/etc/zlh-monitor/game-dev-discovery.env`. Tokens were not observed leaking through labels or scrape URLs in the audit, but storage should still be tightened.
## Targets/discovery state ## Targets/discovery state
Current active targets observed: 12. Resolved:
Up targets include: - `game-dev-alloy` VMID `5205` is gone
- `game-dev-alloy` VMID `6084` is gone
- empty discovery writes a valid empty file_sd file instead of failing
- `prometheus` Remaining problem targets to remove/fix:
- `alloy`
- local `node`
- routers
- monitor node
- `zlh_node` VMID `6050`
Problem targets observed: - `alloy-dev-6076`, `10.100.0.14:12345`, hardcoded/down
- `cadvisor`, `10.60.0.11:8081`, down/no route
- `game-dev-alloy` VMID `5205`, `10.200.0.78:12345`, timeout - `zerolaghub-nodes`, `10.60.0.19:9100`, down/connection refused
- `game-dev-alloy` VMID `6084`, `10.100.0.22:12345`, timeout
- `alloy-dev-6076`, `10.100.0.14:12345`, timeout
- `cadvisor`, `10.60.0.11:8081`, no route
- `zerolaghub-nodes`, `10.60.0.19:9100`, connection refused
`/etc/zlh-monitor/prometheus/file_sd/game-dev-alloy.json` contains stale down targets `5205` and `6084`.
The discovery sync currently fails because the API returns dev VMID `6050` on `10.60.0.220`, while the sync script only allows dev IPs in `10.100.0.0/16`. The service then refuses to write an empty discovery file.
## API monitoring state ## API monitoring state
@ -74,19 +78,22 @@ API host metrics exist via `integrations/unix`:
- `role=API` - `role=API`
- `instance=zlh-api` - `instance=zlh-api`
Observed API app health endpoint `/health` returned `404 Cannot GET /health`; no API health scrape was found. Observed API app health endpoint `/health` previously returned `404 Cannot GET /health`; no API health scrape was found.
No centralized API logs were found on the monitoring host. No centralized API logs were found on the monitoring host.
## Agent/container monitoring state ## Agent/container monitoring state
`/sd/exporters` returned one node exporter target: Current dynamic game/dev discovery is empty and clean.
- `10.60.0.220:9100`, `vmid=6050`, `ctype=dev`, `job=zlh_node` Expected next validation is lifecycle smoke testing:
File discovery still contains stale Alloy collector targets for `5205` and `6084`, both down. - create a game and/or dev container
- confirm target appears
- delete it
- confirm target disappears now that empty discovery works
Labels currently observed include: Labels previously observed include:
- `vmid` - `vmid`
- `ctype` or `container_type` - `ctype` or `container_type`
@ -114,21 +121,21 @@ Dashboard JSON files exist under `/etc/zlh-monitor/grafana/dashboards`:
- `dev-containers.json` - `dev-containers.json`
- `game-containers.json` - `game-containers.json`
However, Grafana DB showed no dashboards installed. Datasource exists: However, Grafana DB previously showed no dashboards installed. Datasource exists:
- `prometheus -> http://localhost:9090` - `prometheus -> http://localhost:9090`
Dashboard JSON queries reference `job="integrations/unix"` and `job="game-dev-alloy"`, but game/dev Alloy endpoint data is currently down/stale. Dashboard JSON must be provisioned/visible in Grafana before launch monitoring is considered ready.
## Logs state ## Logs state
No Loki, Promtail, or log-ingesting Alloy pipeline was found. No Loki, Promtail, or log-ingesting Alloy pipeline was found in the prior audit.
Only journald/local logs for monitoring components are available on the monitoring host. Recent logs showed repeated discovery failures and Prometheus HTTP SD errors. No token values appeared in those log excerpts. Only journald/local logs for monitoring components were available on the monitoring host. This remains a launch-debug gap for API, Agent, provisioning, backup/restore, delete/teardown, Velocity/DNS/edge actions, and discovery sync.
## Security state ## Security state
Working controls: Working controls from prior audit:
- Grafana anonymous access is disabled; unauthenticated `/api/search` returned `401` - Grafana anonymous access is disabled; unauthenticated `/api/search` returned `401`
- discovery endpoints require token; unauthenticated `/sd/exporters` and `/monitoring/game-dev` returned `401` - discovery endpoints require token; unauthenticated `/sd/exporters` and `/monitoring/game-dev` returned `401`