# Monitoring — Current State ## Launch readiness **Status: PARTIAL FAIL for launch monitoring readiness.** Core monitoring services are running. Game/dev dynamic discovery empty-state is now clean and stale `game-dev-alloy` VMIDs `5205` and `6084` have disappeared from Prometheus after a successful empty sync. Launch-debug readiness is still blocked by public monitoring endpoint exposure, remaining stale/static down targets, dashboards not actually installed in Grafana, missing API health scrape, incomplete lifecycle telemetry, and missing centralized logs. ## Recently verified progress Container discovery is clean: - `/sd/exporters -> []` - `/monitoring/game-dev -> containers: []` - monitor-side `ALLOW_EMPTY="true"` added in `/etc/zlh-monitor/game-dev-discovery.env` - one-shot sync succeeded: wrote `0 targets` to `/etc/zlh-monitor/prometheus/file_sd/game-dev-alloy.json` - Prometheus refreshed and stale `game-dev-alloy` targets `5205` and `6084` are gone This resolves the previous dynamic discovery empty-state/stale `game-dev-alloy` blocker. ## Running services observed from prior audit - `prometheus.service` — active/running - `grafana-server.service` — active/running - `alloy.service` — active/running - `prometheus-node-exporter.service` — active/running `zlh-monitor-game-dev-discovery.service` was previously failed, but the latest manual sync succeeded after enabling `ALLOW_EMPTY="true"`. Verify timer/service health after the next scheduled run. ## Exposure state Observed listening ports from prior audit: - `*:9090` Prometheus - `*:3000` Grafana - `*:9100` node_exporter - `127.0.0.1:12345` Alloy Host firewall posture from prior audit: - UFW inactive - nftables input/forward/output policy accept This remains a launch blocker because Prometheus, Grafana, and node_exporter are reachable on all interfaces without host firewall restriction. ## Prometheus state - Config path: `/etc/prometheus/prometheus.yml` - Global scrape interval: `15s` - Prometheus self-scrape: `5s` - Dynamic discovery refresh: `30s` - `/sd/exporters` is configured with bearer auth and works with token - unauthenticated `/sd/exporters` and `/monitoring/game-dev` return `401 unauthorized` Token risk: the bearer token was previously observed stored directly in Prometheus YAML, backup configs, and `/etc/zlh-monitor/game-dev-discovery.env`. Tokens were not observed leaking through labels or scrape URLs in the audit, but storage should still be tightened. ## Targets/discovery state Resolved: - `game-dev-alloy` VMID `5205` is gone - `game-dev-alloy` VMID `6084` is gone - empty discovery writes a valid empty file_sd file instead of failing Remaining problem targets to remove/fix: - `alloy-dev-6076`, `10.100.0.14:12345`, hardcoded/down - `cadvisor`, `10.60.0.11:8081`, down/no route - `zerolaghub-nodes`, `10.60.0.19:9100`, down/connection refused ## API monitoring state API host metrics exist via `integrations/unix`: - `vmid=9020` - `role=API` - `instance=zlh-api` Observed API app health endpoint `/health` previously returned `404 Cannot GET /health`; no API health scrape was found. No centralized API logs were found on the monitoring host. ## Agent/container monitoring state Current dynamic game/dev discovery is empty and clean. Expected next validation is lifecycle smoke testing: - create a game and/or dev container - confirm target appears - delete it - confirm target disappears now that empty discovery works Labels previously observed include: - `vmid` - `ctype` or `container_type` - `game` - `variant` - `hostname` / `name` - `job` - `instance` Missing or incomplete for lifecycle debugging: - `ready` / `connectable` - `operationInProgress` - `operationType` - backup/restore state - code-server state ## Grafana state Dashboard JSON files exist under `/etc/zlh-monitor/grafana/dashboards`: - `core-routers.json` - `core-service-detail.json` - `core-services-overview.json` - `dev-containers.json` - `game-containers.json` However, Grafana DB previously showed no dashboards installed. Datasource exists: - `prometheus -> http://localhost:9090` Dashboard JSON must be provisioned/visible in Grafana before launch monitoring is considered ready. ## Logs state No Loki, Promtail, or log-ingesting Alloy pipeline was found in the prior audit. Only journald/local logs for monitoring components were available on the monitoring host. This remains a launch-debug gap for API, Agent, provisioning, backup/restore, delete/teardown, Velocity/DNS/edge actions, and discovery sync. ## Security state Working controls from prior audit: - Grafana anonymous access is disabled; unauthenticated `/api/search` returned `401` - discovery endpoints require token; unauthenticated `/sd/exporters` and `/monitoring/game-dev` returned `401` - Prometheus labels/scrape URLs checked did not expose the bearer token Launch-blocking risk: - Prometheus, Grafana, and node_exporter bind publicly on the host network - host firewall policy does not restrict access