zlh-grind/Codex/Monitoring/CURRENT_STATE.md

8.5 KiB

Monitoring — Current State

Launch readiness

Status: PARTIAL FAIL for launch monitoring readiness.

Core monitoring services are running. Game/dev dynamic discovery works, stale game-dev-alloy VMIDs 5205 and 6084 are gone, legacy/static Prometheus jobs have been removed from the active runtime config, Grafana dashboard provisioning is live from /etc/zlh-monitor, and the important game/dev metrics path now works through container Alloy remote-write.

Launch-debug readiness is still blocked by final exposure verification/lockdown, missing/unfinished centralized logs, missing API application health scrape, incomplete lifecycle telemetry, and the need to finish the delete half of the game/dev lifecycle smoke test.

Runtime source of truth

/etc/zlh-monitor is now the runtime source of truth.

Prometheus now runs from:

  • /etc/zlh-monitor/prometheus/prometheus.yml
  • configured via /etc/default/prometheus
  • backup: /etc/default/prometheus.zlh-monitor-20260501191319.bak

Grafana now provisions from:

  • /etc/zlh-monitor/grafana/provisioning
  • configured via /etc/default/grafana-server
  • backup: /etc/default/grafana-server.zlh-monitor-20260501191319.bak

/etc/zlh-monitor/README.md was updated to describe the new single-source layout.

Canonical metrics model

Monitoring is not API-only.

The API owns discovery/metadata and tells the monitoring stack what game/dev containers exist. The canonical metrics path for game/dev containers is now:

container Alloy remote_write
  -> Prometheus on 10.60.0.25:9090
  -> Grafana dashboards

The game-dev-alloy file_sd path is useful for inventory/add-remove validation. It should not be treated as the canonical container health signal while container Alloy listens on 127.0.0.1:12345 and Prometheus file_sd scrapes container-ip:12345.

If direct scrape health for each container Alloy is desired later, container Alloy must intentionally expose its HTTP listener on a reachable address such as 0.0.0.0:12345, with security review.

Recently verified progress

Container discovery is clean:

  • /sd/exporters -> [] during empty state
  • /monitoring/game-dev -> containers: [] during empty state
  • monitor-side ALLOW_EMPTY="true" added in /etc/zlh-monitor/game-dev-discovery.env
  • one-shot sync succeeded: wrote 0 targets to /etc/zlh-monitor/prometheus/file_sd/game-dev-alloy.json
  • Prometheus refreshed and stale game-dev-alloy targets 5205 and 6084 are gone

Prometheus runtime config is clean:

  • config validates with promtool
  • active jobs are now only:
    • prometheus
    • alloy -> 127.0.0.1:12345
    • game-dev-alloy -> /etc/zlh-monitor/prometheus/file_sd/game-dev-alloy.json
  • no active node_exporter job
  • no active cAdvisor job
  • no active :9100 jobs

Runtime bind/target state after the remote-write fix:

  • Prometheus listens on 10.60.0.25:9090
  • Prometheus self-scrape target is 10.60.0.25:9090
  • Grafana remains loopback-only on 127.0.0.1:3000
  • zlh-monitor local Alloy remains loopback-only on 127.0.0.1:12345
  • local node_exporter is disabled

Grafana provisioning is live:

  • one source-managed datasource exists:
    • uid=prometheus, name=Prometheus, url=http://localhost:9090, default=1
  • provisioned dashboards:
    • Core Service - Detail
    • Core Services - Overview
    • Dev Containers
    • Game Containers
  • legacy node_exporter router dashboard archived to:
    • /etc/zlh-monitor/grafana/dashboards-archive/core-routers.json

Services active after cleanup:

  • prometheus
  • grafana-server
  • alloy
  • discovery timer

Game/dev metrics smoke evidence

The important monitoring path works for current test containers.

API discovery emits both test servers:

  • 6085 / dev-6085 / 10.100.0.23 / dev
  • 5206 / mc-vanilla-5206 / 10.200.0.79 / game

file_sd sync writes both targets, and Prometheus discovers both.

Alloy remote-write from both containers works after Prometheus was rebound to 10.60.0.25:9090.

Dashboard-style metrics exist for both containers:

  • node_uname_info
  • CPU metrics
  • memory metrics

Observed values:

6085 dev-6085        dev-container   CPU ~1.04%   Mem ~6.58%
5206 mc-vanilla-5206 game-container  CPU ~0.68%   Mem ~61.69%

Known design mismatch:

  • container Alloy listens on 127.0.0.1:12345
  • Prometheus file_sd scrapes container-ip:12345
  • therefore game-dev-alloy scrape targets are discovered but down

Current decision: use remote-write as canonical and do not rely on game-dev-alloy scrape health panels unless/until direct container Alloy scrape is intentionally enabled.

Exposure state

Current reported bind posture:

  • Prometheus listens on 10.60.0.25:9090
  • Grafana remains loopback-only on 127.0.0.1:3000
  • zlh-monitor local Alloy remains loopback-only on 127.0.0.1:12345
  • local node_exporter is disabled

Remaining exposure work:

  • verify Prometheus is reachable only from intended management/container remote-write networks, not the public internet
  • verify Grafana is not reachable except via intended admin path
  • verify node_exporter is no longer listening or is firewalled if any listener remains
  • verify host firewall/nftables policy is appropriate

Prometheus state

  • Runtime config path: /etc/zlh-monitor/prometheus/prometheus.yml
  • Config validates with promtool
  • Active jobs:
    • prometheus
    • alloy
    • game-dev-alloy
  • Prometheus listens on 10.60.0.25:9090
  • Prometheus self-scrape target is 10.60.0.25:9090
  • /sd/exporters is configured with bearer auth and works with token
  • unauthenticated /sd/exporters and /monitoring/game-dev return 401 unauthorized

Token risk: the bearer token was previously observed stored directly in Prometheus YAML, backup configs, and /etc/zlh-monitor/game-dev-discovery.env. Tokens were not observed leaking through labels or scrape URLs in the audit, but storage should still be tightened.

Targets/discovery state

Resolved:

  • game-dev-alloy VMID 5205 is gone
  • game-dev-alloy VMID 6084 is gone
  • alloy-dev-6076 hardcoded/down target is gone from active runtime target list
  • cAdvisor 10.60.0.11:8081 is gone from active runtime target list
  • zerolaghub-nodes 10.60.0.19:9100 is gone from active runtime target list
  • empty discovery writes a valid empty file_sd file instead of failing
  • active test containers appear in API discovery and file_sd

Expected remaining validation:

  • delete test dev/game servers
  • confirm API discovery drops them
  • confirm file_sd drops them
  • confirm dashboard metrics age out as expected

API monitoring state

API discovery is working for game/dev monitoring inventory.

Observed API app health endpoint /health previously returned 404 Cannot GET /health; no API health scrape was found.

No centralized API logs were found on the monitoring host.

Agent/container monitoring state

Current test containers prove remote-write metrics are working:

  • 6085 dev-6085 dev-container
  • 5206 mc-vanilla-5206 game-container

Labels previously observed include:

  • vmid
  • ctype or container_type
  • game
  • variant
  • hostname / name
  • job
  • instance

Missing or incomplete for lifecycle debugging:

  • ready / connectable
  • operationInProgress
  • operationType
  • backup/restore state
  • code-server state

Grafana state

Grafana now provisions from /etc/zlh-monitor/grafana/provisioning via /etc/default/grafana-server.

Datasource:

  • uid=prometheus
  • name=Prometheus
  • url=http://localhost:9090
  • default datasource

Live provisioned dashboards:

  • Core Service - Detail
  • Core Services - Overview
  • Dev Containers
  • Game Containers

Archived:

  • /etc/zlh-monitor/grafana/dashboards-archive/core-routers.json

Dashboard-style CPU and memory metrics exist for both current test containers.

Logs state

No Loki, Promtail, or log-ingesting Alloy pipeline was found in the prior audit.

Only journald/local logs for monitoring components were available on the monitoring host. This remains a launch-debug gap for API, Agent, provisioning, backup/restore, delete/teardown, Velocity/DNS/edge actions, and discovery sync.

Security state

Working controls:

  • Grafana is loopback-only on 127.0.0.1:3000 per current report
  • discovery endpoints require token; unauthenticated /sd/exporters and /monitoring/game-dev returned 401
  • Prometheus labels/scrape URLs checked did not expose the bearer token in the prior audit

Remaining launch security checks:

  • Prometheus on 10.60.0.25:9090 must be restricted to intended container remote-write and admin/monitoring paths
  • any remaining node_exporter listener must be disabled or firewalled
  • host firewall policy must be verified