zlh-grind/Codex/Monitoring/CURRENT_STATE.md

4.9 KiB

Monitoring — Current State

Launch readiness

Status: PARTIAL FAIL for launch monitoring readiness.

Core monitoring services are running. Game/dev dynamic discovery empty-state is now clean and stale game-dev-alloy VMIDs 5205 and 6084 have disappeared from Prometheus after a successful empty sync.

Launch-debug readiness is still blocked by public monitoring endpoint exposure, remaining stale/static down targets, dashboards not actually installed in Grafana, missing API health scrape, incomplete lifecycle telemetry, and missing centralized logs.

Recently verified progress

Container discovery is clean:

  • /sd/exporters -> []
  • /monitoring/game-dev -> containers: []
  • monitor-side ALLOW_EMPTY="true" added in /etc/zlh-monitor/game-dev-discovery.env
  • one-shot sync succeeded: wrote 0 targets to /etc/zlh-monitor/prometheus/file_sd/game-dev-alloy.json
  • Prometheus refreshed and stale game-dev-alloy targets 5205 and 6084 are gone

This resolves the previous dynamic discovery empty-state/stale game-dev-alloy blocker.

Running services observed from prior audit

  • prometheus.service — active/running
  • grafana-server.service — active/running
  • alloy.service — active/running
  • prometheus-node-exporter.service — active/running

zlh-monitor-game-dev-discovery.service was previously failed, but the latest manual sync succeeded after enabling ALLOW_EMPTY="true". Verify timer/service health after the next scheduled run.

Exposure state

Observed listening ports from prior audit:

  • *:9090 Prometheus
  • *:3000 Grafana
  • *:9100 node_exporter
  • 127.0.0.1:12345 Alloy

Host firewall posture from prior audit:

  • UFW inactive
  • nftables input/forward/output policy accept

This remains a launch blocker because Prometheus, Grafana, and node_exporter are reachable on all interfaces without host firewall restriction.

Prometheus state

  • Config path: /etc/prometheus/prometheus.yml
  • Global scrape interval: 15s
  • Prometheus self-scrape: 5s
  • Dynamic discovery refresh: 30s
  • /sd/exporters is configured with bearer auth and works with token
  • unauthenticated /sd/exporters and /monitoring/game-dev return 401 unauthorized

Token risk: the bearer token was previously observed stored directly in Prometheus YAML, backup configs, and /etc/zlh-monitor/game-dev-discovery.env. Tokens were not observed leaking through labels or scrape URLs in the audit, but storage should still be tightened.

Targets/discovery state

Resolved:

  • game-dev-alloy VMID 5205 is gone
  • game-dev-alloy VMID 6084 is gone
  • empty discovery writes a valid empty file_sd file instead of failing

Remaining problem targets to remove/fix:

  • alloy-dev-6076, 10.100.0.14:12345, hardcoded/down
  • cadvisor, 10.60.0.11:8081, down/no route
  • zerolaghub-nodes, 10.60.0.19:9100, down/connection refused

API monitoring state

API host metrics exist via integrations/unix:

  • vmid=9020
  • role=API
  • instance=zlh-api

Observed API app health endpoint /health previously returned 404 Cannot GET /health; no API health scrape was found.

No centralized API logs were found on the monitoring host.

Agent/container monitoring state

Current dynamic game/dev discovery is empty and clean.

Expected next validation is lifecycle smoke testing:

  • create a game and/or dev container
  • confirm target appears
  • delete it
  • confirm target disappears now that empty discovery works

Labels previously observed include:

  • vmid
  • ctype or container_type
  • game
  • variant
  • hostname / name
  • job
  • instance

Missing or incomplete for lifecycle debugging:

  • ready / connectable
  • operationInProgress
  • operationType
  • backup/restore state
  • code-server state

Grafana state

Dashboard JSON files exist under /etc/zlh-monitor/grafana/dashboards:

  • core-routers.json
  • core-service-detail.json
  • core-services-overview.json
  • dev-containers.json
  • game-containers.json

However, Grafana DB previously showed no dashboards installed. Datasource exists:

  • prometheus -> http://localhost:9090

Dashboard JSON must be provisioned/visible in Grafana before launch monitoring is considered ready.

Logs state

No Loki, Promtail, or log-ingesting Alloy pipeline was found in the prior audit.

Only journald/local logs for monitoring components were available on the monitoring host. This remains a launch-debug gap for API, Agent, provisioning, backup/restore, delete/teardown, Velocity/DNS/edge actions, and discovery sync.

Security state

Working controls from prior audit:

  • Grafana anonymous access is disabled; unauthenticated /api/search returned 401
  • discovery endpoints require token; unauthenticated /sd/exporters and /monitoring/game-dev returned 401
  • Prometheus labels/scrape URLs checked did not expose the bearer token

Launch-blocking risk:

  • Prometheus, Grafana, and node_exporter bind publicly on the host network
  • host firewall policy does not restrict access