# Monitoring — Open Items Only keep unfinished monitoring/observability work here. ## Recently resolved / verified - `/etc/zlh-monitor` is now the runtime source of truth. - Prometheus runs from `/etc/zlh-monitor/prometheus/prometheus.yml` via `/etc/default/prometheus`. - Grafana provisions from `/etc/zlh-monitor/grafana/provisioning` via `/etc/default/grafana-server`. - `/etc/zlh-monitor/README.md` describes the new single-source layout. - Game/dev dynamic discovery empty-state now works. - `/sd/exporters -> []` - `/monitoring/game-dev -> containers: []` - monitor-side `ALLOW_EMPTY="true"` added in `/etc/zlh-monitor/game-dev-discovery.env` - one-shot sync succeeded and wrote `0 targets` to `/etc/zlh-monitor/prometheus/file_sd/game-dev-alloy.json` - Prometheus refreshed and stale `game-dev-alloy` VMIDs `5205` and `6084` disappeared - Prometheus runtime config is clean. - validates with `promtool` - active jobs are only: - `prometheus` - `alloy -> 127.0.0.1:12345` - `game-dev-alloy -> /etc/zlh-monitor/prometheus/file_sd/game-dev-alloy.json` - no active `node_exporter` job - no active cAdvisor job - no active `:9100` jobs - current active targets are clean: - `up alloy 127.0.0.1:12345` - `up prometheus localhost:9090` - Remaining stale/static Prometheus targets are removed from active runtime config. - `alloy-dev-6076` no longer appears as an active runtime target - `cadvisor 10.60.0.11:8081` no longer appears as an active runtime target - `zerolaghub-nodes 10.60.0.19:9100` no longer appears as an active runtime target - Grafana provisioning is live. - source-managed datasource: - `uid=prometheus`, `name=Prometheus`, `url=http://localhost:9090`, `default=1` - provisioned dashboards: - `Core Service - Detail` - `Core Services - Overview` - `Dev Containers` - `Game Containers` - legacy node_exporter router dashboard archived to `/etc/zlh-monitor/grafana/dashboards-archive/core-routers.json` - Services active after cleanup: - `prometheus` - `grafana-server` - `alloy` - discovery timer ## Launch blockers 1. Restrict monitoring endpoint exposure. - Prometheus previously listened on `*:9090`. - Grafana previously listened on `*:3000`. - node_exporter previously listened on `*:9100`. - UFW was inactive and nftables policy was accept in the prior audit. - Prometheus no longer scrapes node_exporter or any `:9100` jobs, but listener exposure still needs verification/removal or firewalling. - Required fix: bind Prometheus/Grafana to loopback or management VLAN only, remove/disable any unneeded node_exporter listener, and/or firewall to trusted admin/monitoring sources. 2. Add API health scrape or equivalent application-level health signal. - Observed `/health` previously returned `404 Cannot GET /health`. - Acceptance: Prometheus has an API app health target or equivalent application-level health signal. 3. Add launch-debug lifecycle telemetry. - Missing/incomplete: `ready`, `connectable`, `operationInProgress`, `operationType`, backup/restore state, code-server state. - Acceptance: dashboards or Prometheus queries can answer what a game/dev container is doing and why it is not ready. 4. Add centralized logs. - No Loki, Promtail, or log-ingesting Alloy config found in prior audit. - Acceptance: API, Agent, provisioning, backup/restore, delete/teardown, Velocity/DNS/edge, and discovery-sync logs are queryable from the monitoring surface. 5. Tighten bearer token storage. - Token previously appeared in Prometheus YAML, backup configs, and `/etc/zlh-monitor/game-dev-discovery.env`. - Acceptance: token stored in tight-permission env file or systemd credentials; not world-readable; not leaked in labels/URLs/logs. 6. Run lifecycle smoke test once remaining blockers are controlled enough. - Create a game container and/or dev container. - Confirm target appears. - Delete it. - Confirm target disappears now that empty discovery works. ## Standardization work - Standardize labels: prefer canonical `container_type` over mixed `ctype` / `container_type`. - Decide canonical label spelling for hostname/name/server identity. - Keep required game/dev labels minimal: `vmid`, `instance`, `container_type`. - Keep optional metadata useful but not noisy. - Review any user/customer-identifying labels before dashboard or alert exposure. ## Alerts to add - discovery sync failed - file_sd file stale / not updated recently - target down by VMID/type - API `/sd/exporters` failure - API app health failure - Prometheus scrape failures by important job - Grafana dashboard provisioning missing/failure - log pipeline down once Loki/Alloy logs exist ## Follow-up improvements - Add Loki or Alloy log collection for API, Agent, provisioning, backup/restore, Velocity/DNS/edge actions. - Add panels for lifecycle state: `ready`, `connectable`, `operationInProgress`, `operationType`, backup/restore status, code-server status. - Add dashboard row/panels for stale/deleted targets. - Add a runbook for discovery sync failure. - Add a runbook for failed game/dev provisioning debugging from monitoring only.