# Monitoring — Current State ## Launch readiness **Status: PARTIAL FAIL for launch monitoring readiness.** Core monitoring services are running. Game/dev dynamic discovery works, stale `game-dev-alloy` VMIDs `5205` and `6084` are gone, legacy/static Prometheus jobs have been removed from the active runtime config, Grafana dashboard provisioning is live from `/etc/zlh-monitor`, and the important game/dev metrics path now works through container Alloy remote-write. Launch-debug readiness is still blocked by final exposure verification/lockdown, missing/unfinished centralized logs, missing API application health scrape, incomplete lifecycle telemetry, and the need to finish the delete half of the game/dev lifecycle smoke test. ## Runtime source of truth `/etc/zlh-monitor` is now the runtime source of truth. Prometheus now runs from: - `/etc/zlh-monitor/prometheus/prometheus.yml` - configured via `/etc/default/prometheus` - backup: `/etc/default/prometheus.zlh-monitor-20260501191319.bak` Grafana now provisions from: - `/etc/zlh-monitor/grafana/provisioning` - configured via `/etc/default/grafana-server` - backup: `/etc/default/grafana-server.zlh-monitor-20260501191319.bak` `/etc/zlh-monitor/README.md` was updated to describe the new single-source layout. ## Canonical metrics model Monitoring is not API-only. The API owns discovery/metadata and tells the monitoring stack what game/dev containers exist. The canonical metrics path for game/dev containers is now: ```text container Alloy remote_write -> Prometheus on 10.60.0.25:9090 -> Grafana dashboards ``` The `game-dev-alloy` file_sd path is useful for inventory/add-remove validation. It should not be treated as the canonical container health signal while container Alloy listens on `127.0.0.1:12345` and Prometheus file_sd scrapes `container-ip:12345`. If direct scrape health for each container Alloy is desired later, container Alloy must intentionally expose its HTTP listener on a reachable address such as `0.0.0.0:12345`, with security review. ## Recently verified progress Container discovery is clean: - `/sd/exporters -> []` during empty state - `/monitoring/game-dev -> containers: []` during empty state - monitor-side `ALLOW_EMPTY="true"` added in `/etc/zlh-monitor/game-dev-discovery.env` - one-shot sync succeeded: wrote `0 targets` to `/etc/zlh-monitor/prometheus/file_sd/game-dev-alloy.json` - Prometheus refreshed and stale `game-dev-alloy` targets `5205` and `6084` are gone Prometheus runtime config is clean: - config validates with `promtool` - active jobs are now only: - `prometheus` - `alloy -> 127.0.0.1:12345` - `game-dev-alloy -> /etc/zlh-monitor/prometheus/file_sd/game-dev-alloy.json` - no active `node_exporter` job - no active cAdvisor job - no active `:9100` jobs Runtime bind/target state after the remote-write fix: - Prometheus listens on `10.60.0.25:9090` - Prometheus self-scrape target is `10.60.0.25:9090` - Grafana remains loopback-only on `127.0.0.1:3000` - zlh-monitor local Alloy remains loopback-only on `127.0.0.1:12345` - local node_exporter is disabled Grafana provisioning is live: - one source-managed datasource exists: - `uid=prometheus`, `name=Prometheus`, `url=http://localhost:9090`, `default=1` - provisioned dashboards: - `Core Service - Detail` - `Core Services - Overview` - `Dev Containers` - `Game Containers` - legacy node_exporter router dashboard archived to: - `/etc/zlh-monitor/grafana/dashboards-archive/core-routers.json` Services active after cleanup: - `prometheus` - `grafana-server` - `alloy` - discovery timer ## Game/dev metrics smoke evidence The important monitoring path works for current test containers. API discovery emits both test servers: - `6085 / dev-6085 / 10.100.0.23 / dev` - `5206 / mc-vanilla-5206 / 10.200.0.79 / game` file_sd sync writes both targets, and Prometheus discovers both. Alloy remote-write from both containers works after Prometheus was rebound to `10.60.0.25:9090`. Dashboard-style metrics exist for both containers: - `node_uname_info` - CPU metrics - memory metrics Observed values: ```text 6085 dev-6085 dev-container CPU ~1.04% Mem ~6.58% 5206 mc-vanilla-5206 game-container CPU ~0.68% Mem ~61.69% ``` Known design mismatch: - container Alloy listens on `127.0.0.1:12345` - Prometheus file_sd scrapes `container-ip:12345` - therefore `game-dev-alloy` scrape targets are discovered but down Current decision: use remote-write as canonical and do not rely on `game-dev-alloy` scrape health panels unless/until direct container Alloy scrape is intentionally enabled. ## Exposure state Current reported bind posture: - Prometheus listens on `10.60.0.25:9090` - Grafana remains loopback-only on `127.0.0.1:3000` - zlh-monitor local Alloy remains loopback-only on `127.0.0.1:12345` - local node_exporter is disabled Remaining exposure work: - verify Prometheus is reachable only from intended management/container remote-write networks, not the public internet - verify Grafana is not reachable except via intended admin path - verify node_exporter is no longer listening or is firewalled if any listener remains - verify host firewall/nftables policy is appropriate ## Prometheus state - Runtime config path: `/etc/zlh-monitor/prometheus/prometheus.yml` - Config validates with `promtool` - Active jobs: - `prometheus` - `alloy` - `game-dev-alloy` - Prometheus listens on `10.60.0.25:9090` - Prometheus self-scrape target is `10.60.0.25:9090` - `/sd/exporters` is configured with bearer auth and works with token - unauthenticated `/sd/exporters` and `/monitoring/game-dev` return `401 unauthorized` Token risk: the bearer token was previously observed stored directly in Prometheus YAML, backup configs, and `/etc/zlh-monitor/game-dev-discovery.env`. Tokens were not observed leaking through labels or scrape URLs in the audit, but storage should still be tightened. ## Targets/discovery state Resolved: - `game-dev-alloy` VMID `5205` is gone - `game-dev-alloy` VMID `6084` is gone - `alloy-dev-6076` hardcoded/down target is gone from active runtime target list - cAdvisor `10.60.0.11:8081` is gone from active runtime target list - `zerolaghub-nodes 10.60.0.19:9100` is gone from active runtime target list - empty discovery writes a valid empty file_sd file instead of failing - active test containers appear in API discovery and file_sd Expected remaining validation: - delete test dev/game servers - confirm API discovery drops them - confirm file_sd drops them - confirm dashboard metrics age out as expected ## API monitoring state API discovery is working for game/dev monitoring inventory. Observed API app health endpoint `/health` previously returned `404 Cannot GET /health`; no API health scrape was found. No centralized API logs were found on the monitoring host. ## Agent/container monitoring state Current test containers prove remote-write metrics are working: - `6085 dev-6085 dev-container` - `5206 mc-vanilla-5206 game-container` Labels previously observed include: - `vmid` - `ctype` or `container_type` - `game` - `variant` - `hostname` / `name` - `job` - `instance` Missing or incomplete for lifecycle debugging: - `ready` / `connectable` - `operationInProgress` - `operationType` - backup/restore state - code-server state ## Grafana state Grafana now provisions from `/etc/zlh-monitor/grafana/provisioning` via `/etc/default/grafana-server`. Datasource: - `uid=prometheus` - `name=Prometheus` - `url=http://localhost:9090` - default datasource Live provisioned dashboards: - `Core Service - Detail` - `Core Services - Overview` - `Dev Containers` - `Game Containers` Archived: - `/etc/zlh-monitor/grafana/dashboards-archive/core-routers.json` Dashboard-style CPU and memory metrics exist for both current test containers. ## Logs state No Loki, Promtail, or log-ingesting Alloy pipeline was found in the prior audit. Only journald/local logs for monitoring components were available on the monitoring host. This remains a launch-debug gap for API, Agent, provisioning, backup/restore, delete/teardown, Velocity/DNS/edge actions, and discovery sync. ## Security state Working controls: - Grafana is loopback-only on `127.0.0.1:3000` per current report - discovery endpoints require token; unauthenticated `/sd/exporters` and `/monitoring/game-dev` returned `401` - Prometheus labels/scrape URLs checked did not expose the bearer token in the prior audit Remaining launch security checks: - Prometheus on `10.60.0.25:9090` must be restricted to intended container remote-write and admin/monitoring paths - any remaining node_exporter listener must be disabled or firewalled - host firewall policy must be verified