4.4 KiB
Monitoring — Current State
Launch readiness
Status: FAIL for launch monitoring readiness.
Core monitoring services are running, and Prometheus can see the API host via Alloy remote-written host metrics plus one current dev/node target from /sd/exporters. Launch-debug readiness is blocked by public exposure, stale/failing game/dev discovery, missing centralized logs, and dashboards not actually installed in Grafana.
Running services observed
prometheus.service— active/runninggrafana-server.service— active/runningalloy.service— active/runningprometheus-node-exporter.service— active/runningzlh-monitor-game-dev-discovery.service— failed
Exposure state
Observed listening ports:
*:9090Prometheus*:3000Grafana*:9100node_exporter127.0.0.1:12345Alloy
Host firewall posture observed:
- UFW inactive
- nftables input/forward/output policy accept
This is a launch blocker because Prometheus, Grafana, and node_exporter are reachable on all interfaces without host firewall restriction.
Prometheus state
- Config path:
/etc/prometheus/prometheus.yml - Global scrape interval:
15s - Prometheus self-scrape:
5s - Dynamic discovery refresh:
30s /sd/exportersis configured with bearer auth and works with token- unauthenticated
/sd/exportersand/monitoring/game-devreturn401 unauthorized
Token risk: the bearer token is stored directly in Prometheus YAML, backup configs, and /etc/zlh-monitor/game-dev-discovery.env. Tokens were not observed leaking through labels or scrape URLs in the audit, but storage should still be tightened.
Targets/discovery state
Current active targets observed: 12.
Up targets include:
prometheusalloy- local
node - routers
- monitor node
zlh_nodeVMID6050
Problem targets observed:
game-dev-alloyVMID5205,10.200.0.78:12345, timeoutgame-dev-alloyVMID6084,10.100.0.22:12345, timeoutalloy-dev-6076,10.100.0.14:12345, timeoutcadvisor,10.60.0.11:8081, no routezerolaghub-nodes,10.60.0.19:9100, connection refused
/etc/zlh-monitor/prometheus/file_sd/game-dev-alloy.json contains stale down targets 5205 and 6084.
The discovery sync currently fails because the API returns dev VMID 6050 on 10.60.0.220, while the sync script only allows dev IPs in 10.100.0.0/16. The service then refuses to write an empty discovery file.
API monitoring state
API host metrics exist via integrations/unix:
vmid=9020role=APIinstance=zlh-api
Observed API app health endpoint /health returned 404 Cannot GET /health; no API health scrape was found.
No centralized API logs were found on the monitoring host.
Agent/container monitoring state
/sd/exporters returned one node exporter target:
10.60.0.220:9100,vmid=6050,ctype=dev,job=zlh_node
File discovery still contains stale Alloy collector targets for 5205 and 6084, both down.
Labels currently observed include:
vmidctypeorcontainer_typegamevarianthostname/namejobinstance
Missing or incomplete for lifecycle debugging:
ready/connectableoperationInProgressoperationType- backup/restore state
- code-server state
Grafana state
Dashboard JSON files exist under /etc/zlh-monitor/grafana/dashboards:
core-routers.jsoncore-service-detail.jsoncore-services-overview.jsondev-containers.jsongame-containers.json
However, Grafana DB showed no dashboards installed. Datasource exists:
prometheus -> http://localhost:9090
Dashboard JSON queries reference job="integrations/unix" and job="game-dev-alloy", but game/dev Alloy endpoint data is currently down/stale.
Logs state
No Loki, Promtail, or log-ingesting Alloy pipeline was found.
Only journald/local logs for monitoring components are available on the monitoring host. Recent logs showed repeated discovery failures and Prometheus HTTP SD errors. No token values appeared in those log excerpts.
Security state
Working controls:
- Grafana anonymous access is disabled; unauthenticated
/api/searchreturned401 - discovery endpoints require token; unauthenticated
/sd/exportersand/monitoring/game-devreturned401 - Prometheus labels/scrape URLs checked did not expose the bearer token
Launch-blocking risk:
- Prometheus, Grafana, and node_exporter bind publicly on the host network
- host firewall policy does not restrict access