diff --git a/Codex/Monitoring/CURRENT_STATE.md b/Codex/Monitoring/CURRENT_STATE.md new file mode 100644 index 0000000..680e0ea --- /dev/null +++ b/Codex/Monitoring/CURRENT_STATE.md @@ -0,0 +1,140 @@ +# Monitoring — Current State + +## Launch readiness + +**Status: FAIL for launch monitoring readiness.** + +Core monitoring services are running, and Prometheus can see the API host via Alloy remote-written host metrics plus one current dev/node target from `/sd/exporters`. Launch-debug readiness is blocked by public exposure, stale/failing game/dev discovery, missing centralized logs, and dashboards not actually installed in Grafana. + +## Running services observed + +- `prometheus.service` — active/running +- `grafana-server.service` — active/running +- `alloy.service` — active/running +- `prometheus-node-exporter.service` — active/running +- `zlh-monitor-game-dev-discovery.service` — failed + +## Exposure state + +Observed listening ports: + +- `*:9090` Prometheus +- `*:3000` Grafana +- `*:9100` node_exporter +- `127.0.0.1:12345` Alloy + +Host firewall posture observed: + +- UFW inactive +- nftables input/forward/output policy accept + +This is a launch blocker because Prometheus, Grafana, and node_exporter are reachable on all interfaces without host firewall restriction. + +## Prometheus state + +- Config path: `/etc/prometheus/prometheus.yml` +- Global scrape interval: `15s` +- Prometheus self-scrape: `5s` +- Dynamic discovery refresh: `30s` +- `/sd/exporters` is configured with bearer auth and works with token +- unauthenticated `/sd/exporters` and `/monitoring/game-dev` return `401 unauthorized` + +Token risk: the bearer token is stored directly in Prometheus YAML, backup configs, and `/etc/zlh-monitor/game-dev-discovery.env`. Tokens were not observed leaking through labels or scrape URLs in the audit, but storage should still be tightened. + +## Targets/discovery state + +Current active targets observed: 12. + +Up targets include: + +- `prometheus` +- `alloy` +- local `node` +- routers +- monitor node +- `zlh_node` VMID `6050` + +Problem targets observed: + +- `game-dev-alloy` VMID `5205`, `10.200.0.78:12345`, timeout +- `game-dev-alloy` VMID `6084`, `10.100.0.22:12345`, timeout +- `alloy-dev-6076`, `10.100.0.14:12345`, timeout +- `cadvisor`, `10.60.0.11:8081`, no route +- `zerolaghub-nodes`, `10.60.0.19:9100`, connection refused + +`/etc/zlh-monitor/prometheus/file_sd/game-dev-alloy.json` contains stale down targets `5205` and `6084`. + +The discovery sync currently fails because the API returns dev VMID `6050` on `10.60.0.220`, while the sync script only allows dev IPs in `10.100.0.0/16`. The service then refuses to write an empty discovery file. + +## API monitoring state + +API host metrics exist via `integrations/unix`: + +- `vmid=9020` +- `role=API` +- `instance=zlh-api` + +Observed API app health endpoint `/health` returned `404 Cannot GET /health`; no API health scrape was found. + +No centralized API logs were found on the monitoring host. + +## Agent/container monitoring state + +`/sd/exporters` returned one node exporter target: + +- `10.60.0.220:9100`, `vmid=6050`, `ctype=dev`, `job=zlh_node` + +File discovery still contains stale Alloy collector targets for `5205` and `6084`, both down. + +Labels currently observed include: + +- `vmid` +- `ctype` or `container_type` +- `game` +- `variant` +- `hostname` / `name` +- `job` +- `instance` + +Missing or incomplete for lifecycle debugging: + +- `ready` / `connectable` +- `operationInProgress` +- `operationType` +- backup/restore state +- code-server state + +## Grafana state + +Dashboard JSON files exist under `/etc/zlh-monitor/grafana/dashboards`: + +- `core-routers.json` +- `core-service-detail.json` +- `core-services-overview.json` +- `dev-containers.json` +- `game-containers.json` + +However, Grafana DB showed no dashboards installed. Datasource exists: + +- `prometheus -> http://localhost:9090` + +Dashboard JSON queries reference `job="integrations/unix"` and `job="game-dev-alloy"`, but game/dev Alloy endpoint data is currently down/stale. + +## Logs state + +No Loki, Promtail, or log-ingesting Alloy pipeline was found. + +Only journald/local logs for monitoring components are available on the monitoring host. Recent logs showed repeated discovery failures and Prometheus HTTP SD errors. No token values appeared in those log excerpts. + +## Security state + +Working controls: + +- Grafana anonymous access is disabled; unauthenticated `/api/search` returned `401` +- discovery endpoints require token; unauthenticated `/sd/exporters` and `/monitoring/game-dev` returned `401` +- Prometheus labels/scrape URLs checked did not expose the bearer token + +Launch-blocking risk: + +- Prometheus, Grafana, and node_exporter bind publicly on the host network +- host firewall policy does not restrict access