zlh-grind/Codex/Monitoring/CURRENT_STATE.md

141 lines
4.4 KiB
Markdown

# Monitoring — Current State
## Launch readiness
**Status: FAIL for launch monitoring readiness.**
Core monitoring services are running, and Prometheus can see the API host via Alloy remote-written host metrics plus one current dev/node target from `/sd/exporters`. Launch-debug readiness is blocked by public exposure, stale/failing game/dev discovery, missing centralized logs, and dashboards not actually installed in Grafana.
## Running services observed
- `prometheus.service` — active/running
- `grafana-server.service` — active/running
- `alloy.service` — active/running
- `prometheus-node-exporter.service` — active/running
- `zlh-monitor-game-dev-discovery.service` — failed
## Exposure state
Observed listening ports:
- `*:9090` Prometheus
- `*:3000` Grafana
- `*:9100` node_exporter
- `127.0.0.1:12345` Alloy
Host firewall posture observed:
- UFW inactive
- nftables input/forward/output policy accept
This is a launch blocker because Prometheus, Grafana, and node_exporter are reachable on all interfaces without host firewall restriction.
## Prometheus state
- Config path: `/etc/prometheus/prometheus.yml`
- Global scrape interval: `15s`
- Prometheus self-scrape: `5s`
- Dynamic discovery refresh: `30s`
- `/sd/exporters` is configured with bearer auth and works with token
- unauthenticated `/sd/exporters` and `/monitoring/game-dev` return `401 unauthorized`
Token risk: the bearer token is stored directly in Prometheus YAML, backup configs, and `/etc/zlh-monitor/game-dev-discovery.env`. Tokens were not observed leaking through labels or scrape URLs in the audit, but storage should still be tightened.
## Targets/discovery state
Current active targets observed: 12.
Up targets include:
- `prometheus`
- `alloy`
- local `node`
- routers
- monitor node
- `zlh_node` VMID `6050`
Problem targets observed:
- `game-dev-alloy` VMID `5205`, `10.200.0.78:12345`, timeout
- `game-dev-alloy` VMID `6084`, `10.100.0.22:12345`, timeout
- `alloy-dev-6076`, `10.100.0.14:12345`, timeout
- `cadvisor`, `10.60.0.11:8081`, no route
- `zerolaghub-nodes`, `10.60.0.19:9100`, connection refused
`/etc/zlh-monitor/prometheus/file_sd/game-dev-alloy.json` contains stale down targets `5205` and `6084`.
The discovery sync currently fails because the API returns dev VMID `6050` on `10.60.0.220`, while the sync script only allows dev IPs in `10.100.0.0/16`. The service then refuses to write an empty discovery file.
## API monitoring state
API host metrics exist via `integrations/unix`:
- `vmid=9020`
- `role=API`
- `instance=zlh-api`
Observed API app health endpoint `/health` returned `404 Cannot GET /health`; no API health scrape was found.
No centralized API logs were found on the monitoring host.
## Agent/container monitoring state
`/sd/exporters` returned one node exporter target:
- `10.60.0.220:9100`, `vmid=6050`, `ctype=dev`, `job=zlh_node`
File discovery still contains stale Alloy collector targets for `5205` and `6084`, both down.
Labels currently observed include:
- `vmid`
- `ctype` or `container_type`
- `game`
- `variant`
- `hostname` / `name`
- `job`
- `instance`
Missing or incomplete for lifecycle debugging:
- `ready` / `connectable`
- `operationInProgress`
- `operationType`
- backup/restore state
- code-server state
## Grafana state
Dashboard JSON files exist under `/etc/zlh-monitor/grafana/dashboards`:
- `core-routers.json`
- `core-service-detail.json`
- `core-services-overview.json`
- `dev-containers.json`
- `game-containers.json`
However, Grafana DB showed no dashboards installed. Datasource exists:
- `prometheus -> http://localhost:9090`
Dashboard JSON queries reference `job="integrations/unix"` and `job="game-dev-alloy"`, but game/dev Alloy endpoint data is currently down/stale.
## Logs state
No Loki, Promtail, or log-ingesting Alloy pipeline was found.
Only journald/local logs for monitoring components are available on the monitoring host. Recent logs showed repeated discovery failures and Prometheus HTTP SD errors. No token values appeared in those log excerpts.
## Security state
Working controls:
- Grafana anonymous access is disabled; unauthenticated `/api/search` returned `401`
- discovery endpoints require token; unauthenticated `/sd/exporters` and `/monitoring/game-dev` returned `401`
- Prometheus labels/scrape URLs checked did not expose the bearer token
Launch-blocking risk:
- Prometheus, Grafana, and node_exporter bind publicly on the host network
- host firewall policy does not restrict access