zlh-grind/Codex/Monitoring/CURRENT_STATE.md

6.5 KiB

Monitoring — Current State

Launch readiness

Status: core lifecycle monitoring is launch-ready, and centralized Loki is installed for zlh-monitor logs.

The monitoring stack has been consolidated so /etc/zlh-monitor is now the operational source of truth. Game/dev lifecycle discovery, file_sd sync, Prometheus target appearance/removal, Alloy scrape health, and Alloy remote-write metrics have been validated through create/delete smoke tests.

Centralized Loki is now installed on zlh-monitor and receives selected monitoring-host logs through Alloy. Broader core-service and game/dev container log shipping remains future work.

Remaining future work:

  • add selected core-service logs beyond zlh-monitor
  • add selected game/dev container logs after review
  • OPNsense may still need a router-only node_exporter job or equivalent router-specific monitoring later

Runtime source of truth

/etc/zlh-monitor is now the monitoring runtime source of truth.

Prometheus now runs from:

/etc/zlh-monitor/prometheus/prometheus.yml

configured via:

/etc/default/prometheus

Runtime Prometheus jobs are limited to:

prometheus
alloy
game-dev-alloy

Removed from active Prometheus runtime config:

  • old node_exporter jobs
  • cAdvisor jobs
  • stale static node targets
  • legacy game/dev targets
  • active :9100 scraping jobs

Local prometheus-node-exporter is disabled.

Prometheus listens on:

10.60.0.25:9090

This allows Alloy remote-write from hosts and containers to reach Prometheus.

Grafana state

Grafana provisions from:

/etc/zlh-monitor/grafana/provisioning

Dashboard source directory:

/etc/zlh-monitor/grafana/dashboards

Live dashboards provisioned:

  • Core Services - Overview
  • Core Service - Detail
  • Game Containers
  • Dev Containers

The old node_exporter router dashboard was archived:

/etc/zlh-monitor/grafana/dashboards-archive/core-routers.json

Grafana listens on loopback only:

127.0.0.1:3000

Provisioned datasource:

  • Prometheus
  • Loki

Loki state

Centralized Loki is installed on zlh-monitor.

Files created/changed on the monitoring host:

  • /etc/zlh-monitor/loki/loki.yml
  • /etc/zlh-monitor/systemd/loki.service
  • /etc/zlh-monitor/grafana/provisioning/datasources/loki.yml
  • /etc/zlh-monitor/alloy/config.alloy
  • /etc/zlh-monitor/README.md

Installed package:

loki 3.7.1

Installed service:

/etc/systemd/system/loki.service

Services restarted/enabled:

  • loki
  • grafana-server
  • alloy

Listening addresses:

Loki HTTP:     127.0.0.1:3100
Loki gRPC:     127.0.0.1:9096
Grafana:       127.0.0.1:3000
Alloy:         127.0.0.1:12345
Prometheus:    10.60.0.25:9090

No Loki public or non-loopback listener is exposed.

Grafana Loki datasource:

name: Loki
uid: loki
url: http://127.0.0.1:3100

Grafana logs confirmed datasource provisioning:

inserting datasource from configuration name=Loki uid=loki

Current zlh-monitor log job:

job="zlh-monitor"

This replaced the initial temporary job="zlh-monitor-journal" label.

Successful Loki query verified with:

curl -G http://127.0.0.1:3100/loki/api/v1/query_range \
  --data-urlencode 'query={job="zlh-monitor"}' \
  --data-urlencode limit=5 \
  --data-urlencode direction=backward

The query returned fresh live streams for alloy and loki with labels still intact:

  • host="zlh-monitor"
  • vmid="9016"
  • service
  • job
  • log_source

Intended future Loki job names:

  • job="zlh-api"
  • job="zlh-portal"
  • job="zlh-velocity"
  • job="zlh-monitor"

Log sources currently wired under job="zlh-monitor":

  • prometheus.service
  • grafana-server.service
  • alloy.service
  • loki.service
  • zlh-monitor-game-dev-discovery.service

Intentionally skipped in first Loki pass:

  • game/dev container logs
  • zpack-api logs
  • zpack-portal logs
  • zpack-velocity logs
  • DNS/edge action logs
  • provisioning/backup/restore/delete logs beyond existing monitor discovery unit

Loki schema note:

  • Loki initially rejected 24-hour backfill entries because the schema start date was current-day.
  • Schema start was moved to the standard 2020-10-24 TSDB start.
  • Rejection stopped after the change.

Discovery behavior

API lifecycle discovery was corrected to exclude non-lifecycle core dev systems by CIDR.

9091 / zpack-dev-velocity / 10.60.0.220 no longer appears as a lifecycle dev container.

The monitor sync allows empty discovery:

ALLOW_EMPTY="true"

This ensures deleted containers correctly clear:

/etc/zlh-monitor/prometheus/file_sd/game-dev-alloy.json -> []

Alloy template state

The base template was updated so container Alloy listens on:

0.0.0.0:12345

instead of only:

127.0.0.1:12345

This allows Prometheus to scrape container Alloy health via the game-dev-alloy job.

Updated template note replaces node_exporter with Alloy:

Alloy Metrics/UI: :12345/metrics
Alloy Listen Addr: 0.0.0.0:12345
Node Exporter: not used

Metrics model

Monitoring is not API-only.

  • API discovery + monitor sync + file_sd owns lifecycle inventory and add/remove validation.
  • Container Alloy remote-write to Prometheus is the canonical game/dev metrics path.
  • game-dev-alloy scrape health also works now that container Alloy binds on 0.0.0.0:12345.
  • Grafana reads Prometheus for dashboards.

Smoke tests passed

First manual test

dev  6085 -> appeared, scraped after Alloy bind fix, remote-write worked
game 5206 -> appeared, scraped after Alloy bind fix, remote-write worked
delete -> both disappeared from API discovery, file_sd, and Prometheus

Template test after updating base image

dev  6086 / 10.100.0.24 -> up
game 5207 / 10.200.0.80 -> up

Full loop verified

create -> API discovery -> file_sd -> Prometheus target up -> remote-write metrics present
delete -> API discovery empty -> file_sd [] -> Prometheus targets gone

Logging model

Centralized Loki is now installed for zlh-monitor logs.

Current model:

logs -> Alloy -> centralized Loki -> Grafana

Loki is centralized monitoring infrastructure. Do not run Loki inside every game/dev container.

Future work is to expand selected log shipping from core services and, later, reviewed game/dev container log sources.

Platform exceptions

OPNsense and PBS remain explicit platform monitoring exceptions.

OPNsense may still need a router-only node_exporter job or another router-specific monitoring path later.