# Monitoring — Current State ## Launch readiness **Status: core lifecycle monitoring is launch-ready, and centralized Loki is installed for zlh-monitor logs.** The monitoring stack has been consolidated so `/etc/zlh-monitor` is now the operational source of truth. Game/dev lifecycle discovery, file_sd sync, Prometheus target appearance/removal, Alloy scrape health, and Alloy remote-write metrics have been validated through create/delete smoke tests. Centralized Loki is now installed on `zlh-monitor` and receives selected monitoring-host logs through Alloy. Broader core-service and game/dev container log shipping remains future work. Remaining future work: - add selected core-service logs beyond zlh-monitor - add selected game/dev container logs after review - OPNsense may still need a router-only node_exporter job or equivalent router-specific monitoring later ## Runtime source of truth `/etc/zlh-monitor` is now the monitoring runtime source of truth. Prometheus now runs from: ```text /etc/zlh-monitor/prometheus/prometheus.yml ``` configured via: ```text /etc/default/prometheus ``` Runtime Prometheus jobs are limited to: ```text prometheus alloy game-dev-alloy ``` Removed from active Prometheus runtime config: - old `node_exporter` jobs - cAdvisor jobs - stale static node targets - legacy game/dev targets - active `:9100` scraping jobs Local `prometheus-node-exporter` is disabled. Prometheus listens on: ```text 10.60.0.25:9090 ``` This allows Alloy remote-write from hosts and containers to reach Prometheus. ## Grafana state Grafana provisions from: ```text /etc/zlh-monitor/grafana/provisioning ``` Dashboard source directory: ```text /etc/zlh-monitor/grafana/dashboards ``` Live dashboards provisioned: - Core Services - Overview - Core Service - Detail - Game Containers - Dev Containers The old node_exporter router dashboard was archived: ```text /etc/zlh-monitor/grafana/dashboards-archive/core-routers.json ``` Grafana listens on loopback only: ```text 127.0.0.1:3000 ``` Provisioned datasource: - `Prometheus` - `Loki` ## Loki state Centralized Loki is installed on `zlh-monitor`. Files created/changed on the monitoring host: - `/etc/zlh-monitor/loki/loki.yml` - `/etc/zlh-monitor/systemd/loki.service` - `/etc/zlh-monitor/grafana/provisioning/datasources/loki.yml` - `/etc/zlh-monitor/alloy/config.alloy` - `/etc/zlh-monitor/README.md` Installed package: ```text loki 3.7.1 ``` Installed service: ```text /etc/systemd/system/loki.service ``` Services restarted/enabled: - `loki` - `grafana-server` - `alloy` Listening addresses: ```text Loki HTTP: 127.0.0.1:3100 Loki gRPC: 127.0.0.1:9096 Grafana: 127.0.0.1:3000 Alloy: 127.0.0.1:12345 Prometheus: 10.60.0.25:9090 ``` No Loki public or non-loopback listener is exposed. Grafana Loki datasource: ```text name: Loki uid: loki url: http://127.0.0.1:3100 ``` Grafana logs confirmed datasource provisioning: ```text inserting datasource from configuration name=Loki uid=loki ``` Successful Loki query verified with: ```bash curl -G http://127.0.0.1:3100/loki/api/v1/query_range \ --data-urlencode 'query={job="zlh-monitor-journal"}' \ --data-urlencode limit=5 \ --data-urlencode direction=backward ``` The query returned live streams for `grafana-server` and `loki` with labels including: - `host="zlh-monitor"` - `vmid="9016"` - `service` - `job` - `log_source` Log sources currently wired: - `prometheus.service` - `grafana-server.service` - `alloy.service` - `loki.service` - `zlh-monitor-game-dev-discovery.service` Intentionally skipped in first Loki pass: - game/dev container logs - zpack-api logs - zpack-portal logs - zpack-velocity logs - DNS/edge action logs - provisioning/backup/restore/delete logs beyond existing monitor discovery unit Loki schema note: - Loki initially rejected 24-hour backfill entries because the schema start date was current-day. - Schema start was moved to the standard `2020-10-24` TSDB start. - Rejection stopped after the change. ## Discovery behavior API lifecycle discovery was corrected to exclude non-lifecycle core dev systems by CIDR. `9091 / zpack-dev-velocity / 10.60.0.220` no longer appears as a lifecycle dev container. The monitor sync allows empty discovery: ```text ALLOW_EMPTY="true" ``` This ensures deleted containers correctly clear: ```text /etc/zlh-monitor/prometheus/file_sd/game-dev-alloy.json -> [] ``` ## Alloy template state The base template was updated so container Alloy listens on: ```text 0.0.0.0:12345 ``` instead of only: ```text 127.0.0.1:12345 ``` This allows Prometheus to scrape container Alloy health via the `game-dev-alloy` job. Updated template note replaces node_exporter with Alloy: ```text Alloy Metrics/UI: :12345/metrics Alloy Listen Addr: 0.0.0.0:12345 Node Exporter: not used ``` ## Metrics model Monitoring is not API-only. - API discovery + monitor sync + file_sd owns lifecycle inventory and add/remove validation. - Container Alloy remote-write to Prometheus is the canonical game/dev metrics path. - `game-dev-alloy` scrape health also works now that container Alloy binds on `0.0.0.0:12345`. - Grafana reads Prometheus for dashboards. ## Smoke tests passed ### First manual test ```text dev 6085 -> appeared, scraped after Alloy bind fix, remote-write worked game 5206 -> appeared, scraped after Alloy bind fix, remote-write worked delete -> both disappeared from API discovery, file_sd, and Prometheus ``` ### Template test after updating base image ```text dev 6086 / 10.100.0.24 -> up game 5207 / 10.200.0.80 -> up ``` ### Full loop verified ```text create -> API discovery -> file_sd -> Prometheus target up -> remote-write metrics present delete -> API discovery empty -> file_sd [] -> Prometheus targets gone ``` ## Logging model Centralized Loki is now installed for zlh-monitor logs. Current model: ```text logs -> Alloy -> centralized Loki -> Grafana ``` Loki is centralized monitoring infrastructure. Do not run Loki inside every game/dev container. Future work is to expand selected log shipping from core services and, later, reviewed game/dev container log sources. ## Platform exceptions OPNsense and PBS remain explicit platform monitoring exceptions. OPNsense may still need a router-only node_exporter job or another router-specific monitoring path later.