6.5 KiB
Monitoring — Current State
Launch readiness
Status: core lifecycle monitoring is launch-ready, and centralized Loki is installed for zlh-monitor logs.
The monitoring stack has been consolidated so /etc/zlh-monitor is now the operational source of truth. Game/dev lifecycle discovery, file_sd sync, Prometheus target appearance/removal, Alloy scrape health, and Alloy remote-write metrics have been validated through create/delete smoke tests.
Centralized Loki is now installed on zlh-monitor and receives selected monitoring-host logs through Alloy. Broader core-service and game/dev container log shipping remains future work.
Remaining future work:
- add selected core-service logs beyond zlh-monitor
- add selected game/dev container logs after review
- OPNsense may still need a router-only node_exporter job or equivalent router-specific monitoring later
Runtime source of truth
/etc/zlh-monitor is now the monitoring runtime source of truth.
Prometheus now runs from:
/etc/zlh-monitor/prometheus/prometheus.yml
configured via:
/etc/default/prometheus
Runtime Prometheus jobs are limited to:
prometheus
alloy
game-dev-alloy
Removed from active Prometheus runtime config:
- old
node_exporterjobs - cAdvisor jobs
- stale static node targets
- legacy game/dev targets
- active
:9100scraping jobs
Local prometheus-node-exporter is disabled.
Prometheus listens on:
10.60.0.25:9090
This allows Alloy remote-write from hosts and containers to reach Prometheus.
Grafana state
Grafana provisions from:
/etc/zlh-monitor/grafana/provisioning
Dashboard source directory:
/etc/zlh-monitor/grafana/dashboards
Live dashboards provisioned:
- Core Services - Overview
- Core Service - Detail
- Game Containers
- Dev Containers
The old node_exporter router dashboard was archived:
/etc/zlh-monitor/grafana/dashboards-archive/core-routers.json
Grafana listens on loopback only:
127.0.0.1:3000
Provisioned datasource:
PrometheusLoki
Loki state
Centralized Loki is installed on zlh-monitor.
Files created/changed on the monitoring host:
/etc/zlh-monitor/loki/loki.yml/etc/zlh-monitor/systemd/loki.service/etc/zlh-monitor/grafana/provisioning/datasources/loki.yml/etc/zlh-monitor/alloy/config.alloy/etc/zlh-monitor/README.md
Installed package:
loki 3.7.1
Installed service:
/etc/systemd/system/loki.service
Services restarted/enabled:
lokigrafana-serveralloy
Listening addresses:
Loki HTTP: 127.0.0.1:3100
Loki gRPC: 127.0.0.1:9096
Grafana: 127.0.0.1:3000
Alloy: 127.0.0.1:12345
Prometheus: 10.60.0.25:9090
No Loki public or non-loopback listener is exposed.
Grafana Loki datasource:
name: Loki
uid: loki
url: http://127.0.0.1:3100
Grafana logs confirmed datasource provisioning:
inserting datasource from configuration name=Loki uid=loki
Current zlh-monitor log job:
job="zlh-monitor"
This replaced the initial temporary job="zlh-monitor-journal" label.
Successful Loki query verified with:
curl -G http://127.0.0.1:3100/loki/api/v1/query_range \
--data-urlencode 'query={job="zlh-monitor"}' \
--data-urlencode limit=5 \
--data-urlencode direction=backward
The query returned fresh live streams for alloy and loki with labels still intact:
host="zlh-monitor"vmid="9016"servicejoblog_source
Intended future Loki job names:
job="zlh-api"job="zlh-portal"job="zlh-velocity"job="zlh-monitor"
Log sources currently wired under job="zlh-monitor":
prometheus.servicegrafana-server.servicealloy.serviceloki.servicezlh-monitor-game-dev-discovery.service
Intentionally skipped in first Loki pass:
- game/dev container logs
- zpack-api logs
- zpack-portal logs
- zpack-velocity logs
- DNS/edge action logs
- provisioning/backup/restore/delete logs beyond existing monitor discovery unit
Loki schema note:
- Loki initially rejected 24-hour backfill entries because the schema start date was current-day.
- Schema start was moved to the standard
2020-10-24TSDB start. - Rejection stopped after the change.
Discovery behavior
API lifecycle discovery was corrected to exclude non-lifecycle core dev systems by CIDR.
9091 / zpack-dev-velocity / 10.60.0.220 no longer appears as a lifecycle dev container.
The monitor sync allows empty discovery:
ALLOW_EMPTY="true"
This ensures deleted containers correctly clear:
/etc/zlh-monitor/prometheus/file_sd/game-dev-alloy.json -> []
Alloy template state
The base template was updated so container Alloy listens on:
0.0.0.0:12345
instead of only:
127.0.0.1:12345
This allows Prometheus to scrape container Alloy health via the game-dev-alloy job.
Updated template note replaces node_exporter with Alloy:
Alloy Metrics/UI: :12345/metrics
Alloy Listen Addr: 0.0.0.0:12345
Node Exporter: not used
Metrics model
Monitoring is not API-only.
- API discovery + monitor sync + file_sd owns lifecycle inventory and add/remove validation.
- Container Alloy remote-write to Prometheus is the canonical game/dev metrics path.
game-dev-alloyscrape health also works now that container Alloy binds on0.0.0.0:12345.- Grafana reads Prometheus for dashboards.
Smoke tests passed
First manual test
dev 6085 -> appeared, scraped after Alloy bind fix, remote-write worked
game 5206 -> appeared, scraped after Alloy bind fix, remote-write worked
delete -> both disappeared from API discovery, file_sd, and Prometheus
Template test after updating base image
dev 6086 / 10.100.0.24 -> up
game 5207 / 10.200.0.80 -> up
Full loop verified
create -> API discovery -> file_sd -> Prometheus target up -> remote-write metrics present
delete -> API discovery empty -> file_sd [] -> Prometheus targets gone
Logging model
Centralized Loki is now installed for zlh-monitor logs.
Current model:
logs -> Alloy -> centralized Loki -> Grafana
Loki is centralized monitoring infrastructure. Do not run Loki inside every game/dev container.
Future work is to expand selected log shipping from core services and, later, reviewed game/dev container log sources.
Platform exceptions
OPNsense and PBS remain explicit platform monitoring exceptions.
OPNsense may still need a router-only node_exporter job or another router-specific monitoring path later.