zlh-grind/Codex/Monitoring/CURRENT_STATE.md

6.6 KiB

Monitoring — Current State

Launch readiness

Status: PARTIAL FAIL for launch monitoring readiness.

Core monitoring services are running. Game/dev dynamic discovery empty-state is clean, stale game-dev-alloy VMIDs 5205 and 6084 are gone, legacy/static Prometheus jobs have been removed from the active runtime config, and Grafana dashboard provisioning is now live from the /etc/zlh-monitor runtime source-of-truth path.

Launch-debug readiness is still blocked by public monitoring endpoint exposure, missing/unfinished centralized logs, missing API application health scrape, incomplete lifecycle telemetry, and the need to run one full game/dev lifecycle smoke test.

Runtime source of truth

/etc/zlh-monitor is now the runtime source of truth.

Prometheus now runs from:

  • /etc/zlh-monitor/prometheus/prometheus.yml
  • configured via /etc/default/prometheus
  • backup: /etc/default/prometheus.zlh-monitor-20260501191319.bak

Grafana now provisions from:

  • /etc/zlh-monitor/grafana/provisioning
  • configured via /etc/default/grafana-server
  • backup: /etc/default/grafana-server.zlh-monitor-20260501191319.bak

/etc/zlh-monitor/README.md was updated to describe the new single-source layout.

Recently verified progress

Container discovery is clean:

  • /sd/exporters -> []
  • /monitoring/game-dev -> containers: []
  • monitor-side ALLOW_EMPTY="true" added in /etc/zlh-monitor/game-dev-discovery.env
  • one-shot sync succeeded: wrote 0 targets to /etc/zlh-monitor/prometheus/file_sd/game-dev-alloy.json
  • Prometheus refreshed and stale game-dev-alloy targets 5205 and 6084 are gone

Prometheus runtime config is clean:

  • config validates with promtool
  • active jobs are now only:
    • prometheus
    • alloy -> 127.0.0.1:12345
    • game-dev-alloy -> /etc/zlh-monitor/prometheus/file_sd/game-dev-alloy.json
  • no active node_exporter job
  • no active cAdvisor job
  • no active :9100 jobs

Current Prometheus targets are clean:

  • up alloy 127.0.0.1:12345
  • up prometheus localhost:9090

Grafana provisioning is live:

  • one source-managed datasource exists:
    • uid=prometheus, name=Prometheus, url=http://localhost:9090, default=1
  • provisioned dashboards:
    • Core Service - Detail
    • Core Services - Overview
    • Dev Containers
    • Game Containers
  • legacy node_exporter router dashboard archived to:
    • /etc/zlh-monitor/grafana/dashboards-archive/core-routers.json

Services active after cleanup:

  • prometheus
  • grafana-server
  • alloy
  • discovery timer

Exposure state

Observed listening ports from prior audit:

  • *:9090 Prometheus
  • *:3000 Grafana
  • *:9100 node_exporter
  • 127.0.0.1:12345 Alloy

Prometheus active jobs no longer include node_exporter or :9100 scraping, but network exposure still needs separate verification/lockdown.

Host firewall posture from prior audit:

  • UFW inactive
  • nftables input/forward/output policy accept

This remains a launch blocker until Prometheus, Grafana, and any remaining node_exporter listener exposure are restricted to trusted admin/monitoring paths.

Prometheus state

  • Runtime config path: /etc/zlh-monitor/prometheus/prometheus.yml
  • Config validates with promtool
  • Active jobs:
    • prometheus
    • alloy
    • game-dev-alloy
  • Current targets:
    • up alloy 127.0.0.1:12345
    • up prometheus localhost:9090
  • /sd/exporters is configured with bearer auth and works with token
  • unauthenticated /sd/exporters and /monitoring/game-dev return 401 unauthorized

Token risk: the bearer token was previously observed stored directly in Prometheus YAML, backup configs, and /etc/zlh-monitor/game-dev-discovery.env. Tokens were not observed leaking through labels or scrape URLs in the audit, but storage should still be tightened.

Targets/discovery state

Resolved:

  • game-dev-alloy VMID 5205 is gone
  • game-dev-alloy VMID 6084 is gone
  • alloy-dev-6076 hardcoded/down target is gone from active runtime target list
  • cAdvisor 10.60.0.11:8081 is gone from active runtime target list
  • zerolaghub-nodes 10.60.0.19:9100 is gone from active runtime target list
  • empty discovery writes a valid empty file_sd file instead of failing

No stale/static down targets are currently present in active Prometheus targets.

API monitoring state

API host/system metrics were previously available via integrations/unix, but the new active runtime Prometheus jobs are intentionally reduced to Prometheus, local Alloy, and dynamic game/dev Alloy discovery.

Observed API app health endpoint /health previously returned 404 Cannot GET /health; no API health scrape was found.

No centralized API logs were found on the monitoring host.

Agent/container monitoring state

Current dynamic game/dev discovery is empty and clean.

Expected next validation is lifecycle smoke testing:

  • create a game and/or dev container
  • confirm target appears
  • delete it
  • confirm target disappears now that empty discovery works

Labels previously observed include:

  • vmid
  • ctype or container_type
  • game
  • variant
  • hostname / name
  • job
  • instance

Missing or incomplete for lifecycle debugging:

  • ready / connectable
  • operationInProgress
  • operationType
  • backup/restore state
  • code-server state

Grafana state

Grafana now provisions from /etc/zlh-monitor/grafana/provisioning via /etc/default/grafana-server.

Datasource:

  • uid=prometheus
  • name=Prometheus
  • url=http://localhost:9090
  • default datasource

Live provisioned dashboards:

  • Core Service - Detail
  • Core Services - Overview
  • Dev Containers
  • Game Containers

Archived:

  • /etc/zlh-monitor/grafana/dashboards-archive/core-routers.json

Logs state

No Loki, Promtail, or log-ingesting Alloy pipeline was found in the prior audit.

Only journald/local logs for monitoring components were available on the monitoring host. This remains a launch-debug gap for API, Agent, provisioning, backup/restore, delete/teardown, Velocity/DNS/edge actions, and discovery sync.

Security state

Working controls from prior audit:

  • Grafana anonymous access is disabled; unauthenticated /api/search returned 401
  • discovery endpoints require token; unauthenticated /sd/exporters and /monitoring/game-dev returned 401
  • Prometheus labels/scrape URLs checked did not expose the bearer token

Launch-blocking risk:

  • Prometheus and Grafana previously bound publicly on the host network
  • node_exporter previously listened on *:9100; it is no longer scraped by active Prometheus config, but listener exposure still needs verification/removal or firewalling
  • host firewall policy did not restrict access in prior audit