zlh-grind/Codex/Monitoring/CURRENT_STATE.md

3.8 KiB

Monitoring — Current State

Launch readiness

Status: core lifecycle monitoring is launch-ready.

The monitoring stack has been consolidated so /etc/zlh-monitor is now the operational source of truth. Game/dev lifecycle discovery, file_sd sync, Prometheus target appearance/removal, Alloy scrape health, and Alloy remote-write metrics have been validated through create/delete smoke tests.

Remaining future work:

  • centralized logs are not enabled yet
  • no Loki currently
  • OPNsense may still need a router-only node_exporter job or equivalent router-specific monitoring later

Runtime source of truth

/etc/zlh-monitor is now the monitoring runtime source of truth.

Prometheus now runs from:

/etc/zlh-monitor/prometheus/prometheus.yml

configured via:

/etc/default/prometheus

Runtime Prometheus jobs are limited to:

prometheus
alloy
game-dev-alloy

Removed from active Prometheus runtime config:

  • old node_exporter jobs
  • cAdvisor jobs
  • stale static node targets
  • legacy game/dev targets
  • active :9100 scraping jobs

Local prometheus-node-exporter is disabled.

Prometheus listens on:

10.60.0.25:9090

This allows Alloy remote-write from hosts and containers to reach Prometheus.

Grafana state

Grafana provisions from:

/etc/zlh-monitor/grafana/provisioning

Dashboard source directory:

/etc/zlh-monitor/grafana/dashboards

Live dashboards provisioned:

  • Core Services - Overview
  • Core Service - Detail
  • Game Containers
  • Dev Containers

The old node_exporter router dashboard was archived:

/etc/zlh-monitor/grafana/dashboards-archive/core-routers.json

Grafana listens on loopback only:

127.0.0.1:3000

Discovery behavior

API lifecycle discovery was corrected to exclude non-lifecycle core dev systems by CIDR.

9091 / zpack-dev-velocity / 10.60.0.220 no longer appears as a lifecycle dev container.

The monitor sync allows empty discovery:

ALLOW_EMPTY="true"

This ensures deleted containers correctly clear:

/etc/zlh-monitor/prometheus/file_sd/game-dev-alloy.json -> []

Alloy template state

The base template was updated so container Alloy listens on:

0.0.0.0:12345

instead of only:

127.0.0.1:12345

This allows Prometheus to scrape container Alloy health via the game-dev-alloy job.

Updated template note replaces node_exporter with Alloy:

Alloy Metrics/UI: :12345/metrics
Alloy Listen Addr: 0.0.0.0:12345
Node Exporter: not used

Metrics model

Monitoring is not API-only.

  • API discovery + monitor sync + file_sd owns lifecycle inventory and add/remove validation.
  • Container Alloy remote-write to Prometheus is the canonical game/dev metrics path.
  • game-dev-alloy scrape health also works now that container Alloy binds on 0.0.0.0:12345.
  • Grafana reads Prometheus for dashboards.

Smoke tests passed

First manual test

dev  6085 -> appeared, scraped after Alloy bind fix, remote-write worked
game 5206 -> appeared, scraped after Alloy bind fix, remote-write worked
delete -> both disappeared from API discovery, file_sd, and Prometheus

Template test after updating base image

dev  6086 / 10.100.0.24 -> up
game 5207 / 10.200.0.80 -> up

Full loop verified

create -> API discovery -> file_sd -> Prometheus target up -> remote-write metrics present
delete -> API discovery empty -> file_sd [] -> Prometheus targets gone

Logging state

Centralized logs are not enabled yet.

No Loki currently.

Future direction is centralized Loki with selected logs shipped by Alloy. Do not run Loki inside every game/dev container.

Platform exceptions

OPNsense and PBS remain explicit platform monitoring exceptions.

OPNsense may still need a router-only node_exporter job or another router-specific monitoring path later.