6.6 KiB
Monitoring — Current State
Launch readiness
Status: PARTIAL FAIL for launch monitoring readiness.
Core monitoring services are running. Game/dev dynamic discovery empty-state is clean, stale game-dev-alloy VMIDs 5205 and 6084 are gone, legacy/static Prometheus jobs have been removed from the active runtime config, and Grafana dashboard provisioning is now live from the /etc/zlh-monitor runtime source-of-truth path.
Launch-debug readiness is still blocked by public monitoring endpoint exposure, missing/unfinished centralized logs, missing API application health scrape, incomplete lifecycle telemetry, and the need to run one full game/dev lifecycle smoke test.
Runtime source of truth
/etc/zlh-monitor is now the runtime source of truth.
Prometheus now runs from:
/etc/zlh-monitor/prometheus/prometheus.yml- configured via
/etc/default/prometheus - backup:
/etc/default/prometheus.zlh-monitor-20260501191319.bak
Grafana now provisions from:
/etc/zlh-monitor/grafana/provisioning- configured via
/etc/default/grafana-server - backup:
/etc/default/grafana-server.zlh-monitor-20260501191319.bak
/etc/zlh-monitor/README.md was updated to describe the new single-source layout.
Recently verified progress
Container discovery is clean:
/sd/exporters -> []/monitoring/game-dev -> containers: []- monitor-side
ALLOW_EMPTY="true"added in/etc/zlh-monitor/game-dev-discovery.env - one-shot sync succeeded: wrote
0 targetsto/etc/zlh-monitor/prometheus/file_sd/game-dev-alloy.json - Prometheus refreshed and stale
game-dev-alloytargets5205and6084are gone
Prometheus runtime config is clean:
- config validates with
promtool - active jobs are now only:
prometheusalloy -> 127.0.0.1:12345game-dev-alloy -> /etc/zlh-monitor/prometheus/file_sd/game-dev-alloy.json
- no active
node_exporterjob - no active cAdvisor job
- no active
:9100jobs
Current Prometheus targets are clean:
up alloy 127.0.0.1:12345up prometheus localhost:9090
Grafana provisioning is live:
- one source-managed datasource exists:
uid=prometheus,name=Prometheus,url=http://localhost:9090,default=1
- provisioned dashboards:
Core Service - DetailCore Services - OverviewDev ContainersGame Containers
- legacy node_exporter router dashboard archived to:
/etc/zlh-monitor/grafana/dashboards-archive/core-routers.json
Services active after cleanup:
prometheusgrafana-serveralloy- discovery timer
Exposure state
Observed listening ports from prior audit:
*:9090Prometheus*:3000Grafana*:9100node_exporter127.0.0.1:12345Alloy
Prometheus active jobs no longer include node_exporter or :9100 scraping, but network exposure still needs separate verification/lockdown.
Host firewall posture from prior audit:
- UFW inactive
- nftables input/forward/output policy accept
This remains a launch blocker until Prometheus, Grafana, and any remaining node_exporter listener exposure are restricted to trusted admin/monitoring paths.
Prometheus state
- Runtime config path:
/etc/zlh-monitor/prometheus/prometheus.yml - Config validates with
promtool - Active jobs:
prometheusalloygame-dev-alloy
- Current targets:
up alloy 127.0.0.1:12345up prometheus localhost:9090
/sd/exportersis configured with bearer auth and works with token- unauthenticated
/sd/exportersand/monitoring/game-devreturn401 unauthorized
Token risk: the bearer token was previously observed stored directly in Prometheus YAML, backup configs, and /etc/zlh-monitor/game-dev-discovery.env. Tokens were not observed leaking through labels or scrape URLs in the audit, but storage should still be tightened.
Targets/discovery state
Resolved:
game-dev-alloyVMID5205is gonegame-dev-alloyVMID6084is gonealloy-dev-6076hardcoded/down target is gone from active runtime target list- cAdvisor
10.60.0.11:8081is gone from active runtime target list zerolaghub-nodes 10.60.0.19:9100is gone from active runtime target list- empty discovery writes a valid empty file_sd file instead of failing
No stale/static down targets are currently present in active Prometheus targets.
API monitoring state
API host/system metrics were previously available via integrations/unix, but the new active runtime Prometheus jobs are intentionally reduced to Prometheus, local Alloy, and dynamic game/dev Alloy discovery.
Observed API app health endpoint /health previously returned 404 Cannot GET /health; no API health scrape was found.
No centralized API logs were found on the monitoring host.
Agent/container monitoring state
Current dynamic game/dev discovery is empty and clean.
Expected next validation is lifecycle smoke testing:
- create a game and/or dev container
- confirm target appears
- delete it
- confirm target disappears now that empty discovery works
Labels previously observed include:
vmidctypeorcontainer_typegamevarianthostname/namejobinstance
Missing or incomplete for lifecycle debugging:
ready/connectableoperationInProgressoperationType- backup/restore state
- code-server state
Grafana state
Grafana now provisions from /etc/zlh-monitor/grafana/provisioning via /etc/default/grafana-server.
Datasource:
uid=prometheusname=Prometheusurl=http://localhost:9090- default datasource
Live provisioned dashboards:
Core Service - DetailCore Services - OverviewDev ContainersGame Containers
Archived:
/etc/zlh-monitor/grafana/dashboards-archive/core-routers.json
Logs state
No Loki, Promtail, or log-ingesting Alloy pipeline was found in the prior audit.
Only journald/local logs for monitoring components were available on the monitoring host. This remains a launch-debug gap for API, Agent, provisioning, backup/restore, delete/teardown, Velocity/DNS/edge actions, and discovery sync.
Security state
Working controls from prior audit:
- Grafana anonymous access is disabled; unauthenticated
/api/searchreturned401 - discovery endpoints require token; unauthenticated
/sd/exportersand/monitoring/game-devreturned401 - Prometheus labels/scrape URLs checked did not expose the bearer token
Launch-blocking risk:
- Prometheus and Grafana previously bound publicly on the host network
- node_exporter previously listened on
*:9100; it is no longer scraped by active Prometheus config, but listener exposure still needs verification/removal or firewalling - host firewall policy did not restrict access in prior audit