zlh-grind/Codex/Monitoring/CURRENT_STATE.md

4.4 KiB

Monitoring — Current State

Launch readiness

Status: FAIL for launch monitoring readiness.

Core monitoring services are running, and Prometheus can see the API host via Alloy remote-written host metrics plus one current dev/node target from /sd/exporters. Launch-debug readiness is blocked by public exposure, stale/failing game/dev discovery, missing centralized logs, and dashboards not actually installed in Grafana.

Running services observed

  • prometheus.service — active/running
  • grafana-server.service — active/running
  • alloy.service — active/running
  • prometheus-node-exporter.service — active/running
  • zlh-monitor-game-dev-discovery.service — failed

Exposure state

Observed listening ports:

  • *:9090 Prometheus
  • *:3000 Grafana
  • *:9100 node_exporter
  • 127.0.0.1:12345 Alloy

Host firewall posture observed:

  • UFW inactive
  • nftables input/forward/output policy accept

This is a launch blocker because Prometheus, Grafana, and node_exporter are reachable on all interfaces without host firewall restriction.

Prometheus state

  • Config path: /etc/prometheus/prometheus.yml
  • Global scrape interval: 15s
  • Prometheus self-scrape: 5s
  • Dynamic discovery refresh: 30s
  • /sd/exporters is configured with bearer auth and works with token
  • unauthenticated /sd/exporters and /monitoring/game-dev return 401 unauthorized

Token risk: the bearer token is stored directly in Prometheus YAML, backup configs, and /etc/zlh-monitor/game-dev-discovery.env. Tokens were not observed leaking through labels or scrape URLs in the audit, but storage should still be tightened.

Targets/discovery state

Current active targets observed: 12.

Up targets include:

  • prometheus
  • alloy
  • local node
  • routers
  • monitor node
  • zlh_node VMID 6050

Problem targets observed:

  • game-dev-alloy VMID 5205, 10.200.0.78:12345, timeout
  • game-dev-alloy VMID 6084, 10.100.0.22:12345, timeout
  • alloy-dev-6076, 10.100.0.14:12345, timeout
  • cadvisor, 10.60.0.11:8081, no route
  • zerolaghub-nodes, 10.60.0.19:9100, connection refused

/etc/zlh-monitor/prometheus/file_sd/game-dev-alloy.json contains stale down targets 5205 and 6084.

The discovery sync currently fails because the API returns dev VMID 6050 on 10.60.0.220, while the sync script only allows dev IPs in 10.100.0.0/16. The service then refuses to write an empty discovery file.

API monitoring state

API host metrics exist via integrations/unix:

  • vmid=9020
  • role=API
  • instance=zlh-api

Observed API app health endpoint /health returned 404 Cannot GET /health; no API health scrape was found.

No centralized API logs were found on the monitoring host.

Agent/container monitoring state

/sd/exporters returned one node exporter target:

  • 10.60.0.220:9100, vmid=6050, ctype=dev, job=zlh_node

File discovery still contains stale Alloy collector targets for 5205 and 6084, both down.

Labels currently observed include:

  • vmid
  • ctype or container_type
  • game
  • variant
  • hostname / name
  • job
  • instance

Missing or incomplete for lifecycle debugging:

  • ready / connectable
  • operationInProgress
  • operationType
  • backup/restore state
  • code-server state

Grafana state

Dashboard JSON files exist under /etc/zlh-monitor/grafana/dashboards:

  • core-routers.json
  • core-service-detail.json
  • core-services-overview.json
  • dev-containers.json
  • game-containers.json

However, Grafana DB showed no dashboards installed. Datasource exists:

  • prometheus -> http://localhost:9090

Dashboard JSON queries reference job="integrations/unix" and job="game-dev-alloy", but game/dev Alloy endpoint data is currently down/stale.

Logs state

No Loki, Promtail, or log-ingesting Alloy pipeline was found.

Only journald/local logs for monitoring components are available on the monitoring host. Recent logs showed repeated discovery failures and Prometheus HTTP SD errors. No token values appeared in those log excerpts.

Security state

Working controls:

  • Grafana anonymous access is disabled; unauthenticated /api/search returned 401
  • discovery endpoints require token; unauthenticated /sd/exporters and /monitoring/game-dev returned 401
  • Prometheus labels/scrape URLs checked did not expose the bearer token

Launch-blocking risk:

  • Prometheus, Grafana, and node_exporter bind publicly on the host network
  • host firewall policy does not restrict access