4.9 KiB
Monitoring — Current State
Launch readiness
Status: PARTIAL FAIL for launch monitoring readiness.
Core monitoring services are running. Game/dev dynamic discovery empty-state is now clean and stale game-dev-alloy VMIDs 5205 and 6084 have disappeared from Prometheus after a successful empty sync.
Launch-debug readiness is still blocked by public monitoring endpoint exposure, remaining stale/static down targets, dashboards not actually installed in Grafana, missing API health scrape, incomplete lifecycle telemetry, and missing centralized logs.
Recently verified progress
Container discovery is clean:
/sd/exporters -> []/monitoring/game-dev -> containers: []- monitor-side
ALLOW_EMPTY="true"added in/etc/zlh-monitor/game-dev-discovery.env - one-shot sync succeeded: wrote
0 targetsto/etc/zlh-monitor/prometheus/file_sd/game-dev-alloy.json - Prometheus refreshed and stale
game-dev-alloytargets5205and6084are gone
This resolves the previous dynamic discovery empty-state/stale game-dev-alloy blocker.
Running services observed from prior audit
prometheus.service— active/runninggrafana-server.service— active/runningalloy.service— active/runningprometheus-node-exporter.service— active/running
zlh-monitor-game-dev-discovery.service was previously failed, but the latest manual sync succeeded after enabling ALLOW_EMPTY="true". Verify timer/service health after the next scheduled run.
Exposure state
Observed listening ports from prior audit:
*:9090Prometheus*:3000Grafana*:9100node_exporter127.0.0.1:12345Alloy
Host firewall posture from prior audit:
- UFW inactive
- nftables input/forward/output policy accept
This remains a launch blocker because Prometheus, Grafana, and node_exporter are reachable on all interfaces without host firewall restriction.
Prometheus state
- Config path:
/etc/prometheus/prometheus.yml - Global scrape interval:
15s - Prometheus self-scrape:
5s - Dynamic discovery refresh:
30s /sd/exportersis configured with bearer auth and works with token- unauthenticated
/sd/exportersand/monitoring/game-devreturn401 unauthorized
Token risk: the bearer token was previously observed stored directly in Prometheus YAML, backup configs, and /etc/zlh-monitor/game-dev-discovery.env. Tokens were not observed leaking through labels or scrape URLs in the audit, but storage should still be tightened.
Targets/discovery state
Resolved:
game-dev-alloyVMID5205is gonegame-dev-alloyVMID6084is gone- empty discovery writes a valid empty file_sd file instead of failing
Remaining problem targets to remove/fix:
alloy-dev-6076,10.100.0.14:12345, hardcoded/downcadvisor,10.60.0.11:8081, down/no routezerolaghub-nodes,10.60.0.19:9100, down/connection refused
API monitoring state
API host metrics exist via integrations/unix:
vmid=9020role=APIinstance=zlh-api
Observed API app health endpoint /health previously returned 404 Cannot GET /health; no API health scrape was found.
No centralized API logs were found on the monitoring host.
Agent/container monitoring state
Current dynamic game/dev discovery is empty and clean.
Expected next validation is lifecycle smoke testing:
- create a game and/or dev container
- confirm target appears
- delete it
- confirm target disappears now that empty discovery works
Labels previously observed include:
vmidctypeorcontainer_typegamevarianthostname/namejobinstance
Missing or incomplete for lifecycle debugging:
ready/connectableoperationInProgressoperationType- backup/restore state
- code-server state
Grafana state
Dashboard JSON files exist under /etc/zlh-monitor/grafana/dashboards:
core-routers.jsoncore-service-detail.jsoncore-services-overview.jsondev-containers.jsongame-containers.json
However, Grafana DB previously showed no dashboards installed. Datasource exists:
prometheus -> http://localhost:9090
Dashboard JSON must be provisioned/visible in Grafana before launch monitoring is considered ready.
Logs state
No Loki, Promtail, or log-ingesting Alloy pipeline was found in the prior audit.
Only journald/local logs for monitoring components were available on the monitoring host. This remains a launch-debug gap for API, Agent, provisioning, backup/restore, delete/teardown, Velocity/DNS/edge actions, and discovery sync.
Security state
Working controls from prior audit:
- Grafana anonymous access is disabled; unauthenticated
/api/searchreturned401 - discovery endpoints require token; unauthenticated
/sd/exportersand/monitoring/game-devreturned401 - Prometheus labels/scrape URLs checked did not expose the bearer token
Launch-blocking risk:
- Prometheus, Grafana, and node_exporter bind publicly on the host network
- host firewall policy does not restrict access