zlh-grind/Codex/Monitoring/OPEN_ITEMS.md

6.0 KiB

Monitoring — Open Items

Only keep unfinished monitoring/observability work here.

Recently resolved / verified

  • /etc/zlh-monitor is now the runtime source of truth.

    • Prometheus runs from /etc/zlh-monitor/prometheus/prometheus.yml via /etc/default/prometheus.
    • Grafana provisions from /etc/zlh-monitor/grafana/provisioning via /etc/default/grafana-server.
    • /etc/zlh-monitor/README.md describes the new single-source layout.
  • Game/dev dynamic discovery empty-state now works.

    • /sd/exporters -> []
    • /monitoring/game-dev -> containers: []
    • monitor-side ALLOW_EMPTY="true" added in /etc/zlh-monitor/game-dev-discovery.env
    • one-shot sync succeeded and wrote 0 targets to /etc/zlh-monitor/prometheus/file_sd/game-dev-alloy.json
    • Prometheus refreshed and stale game-dev-alloy VMIDs 5205 and 6084 disappeared
  • Prometheus runtime config is clean.

    • validates with promtool
    • active jobs are only:
      • prometheus
      • alloy -> 127.0.0.1:12345
      • game-dev-alloy -> /etc/zlh-monitor/prometheus/file_sd/game-dev-alloy.json
    • no active node_exporter job
    • no active cAdvisor job
    • no active :9100 jobs
  • Remaining stale/static Prometheus targets are removed from active runtime config.

    • alloy-dev-6076 no longer appears as an active runtime target
    • cadvisor 10.60.0.11:8081 no longer appears as an active runtime target
    • zerolaghub-nodes 10.60.0.19:9100 no longer appears as an active runtime target
  • Grafana provisioning is live.

    • source-managed datasource:
      • uid=prometheus, name=Prometheus, url=http://localhost:9090, default=1
    • provisioned dashboards:
      • Core Service - Detail
      • Core Services - Overview
      • Dev Containers
      • Game Containers
    • legacy node_exporter router dashboard archived to /etc/zlh-monitor/grafana/dashboards-archive/core-routers.json
  • Important game/dev metrics path works through remote-write.

    • API discovery emitted both test servers:
      • 6085 / dev-6085 / 10.100.0.23 / dev
      • 5206 / mc-vanilla-5206 / 10.200.0.79 / game
    • file_sd sync wrote both targets and Prometheus discovered both
    • container Alloy remote-write works after Prometheus was rebound to 10.60.0.25:9090
    • dashboard-style metrics exist for both:
      • node_uname_info
      • CPU
      • memory
    • observed values:
      • 6085 dev-6085 dev-container CPU ~1.04% Mem ~6.58%
      • 5206 mc-vanilla-5206 game-container CPU ~0.68% Mem ~61.69%
  • Current monitoring model clarified.

    • Not API-only: API owns discovery/metadata.
    • Canonical metrics path is container Alloy remote-write to Prometheus.
    • game-dev-alloy scrape-up panels should not be relied on while container Alloy listens on 127.0.0.1:12345.
    • Direct scrape health would require intentionally exposing container Alloy HTTP on a reachable address such as 0.0.0.0:12345 plus security review.
  • Services active after cleanup:

    • prometheus
    • grafana-server
    • alloy
    • discovery timer

Launch blockers

  1. Finish network exposure verification/lockdown.

    • Prometheus now listens on 10.60.0.25:9090 for container remote-write and self-scrape.
    • Grafana remains loopback-only on 127.0.0.1:3000.
    • zlh-monitor local Alloy remains loopback-only on 127.0.0.1:12345.
    • local node_exporter is disabled.
    • Required validation: Prometheus is reachable only from intended container remote-write and trusted admin/monitoring paths; Grafana is not externally exposed; node_exporter is not listening or is firewalled; host firewall/nftables posture is acceptable.
  2. Add API health scrape or equivalent application-level health signal.

    • Observed /health previously returned 404 Cannot GET /health.
    • Acceptance: Prometheus has an API app health target or equivalent application-level health signal.
  3. Add launch-debug lifecycle telemetry.

    • Missing/incomplete: ready, connectable, operationInProgress, operationType, backup/restore state, code-server state.
    • Acceptance: dashboards or Prometheus queries can answer what a game/dev container is doing and why it is not ready.
  4. Add centralized logs.

    • No Loki, Promtail, or log-ingesting Alloy config found in prior audit.
    • Acceptance: API, Agent, provisioning, backup/restore, delete/teardown, Velocity/DNS/edge, and discovery-sync logs are queryable from the monitoring surface.
  5. Tighten bearer token storage.

    • Token previously appeared in Prometheus YAML, backup configs, and /etc/zlh-monitor/game-dev-discovery.env.
    • Acceptance: token stored in tight-permission env file or systemd credentials; not world-readable; not leaked in labels/URLs/logs.
  6. Finish lifecycle smoke test delete step.

    • Delete test dev/game servers.
    • Confirm API discovery drops them.
    • Confirm file_sd drops them.
    • Confirm dashboard metrics age out as expected.

Standardization work

  • Standardize labels: prefer canonical container_type over mixed ctype / container_type.
  • Decide canonical label spelling for hostname/name/server identity.
  • Keep required game/dev labels minimal: vmid, instance, container_type.
  • Keep optional metadata useful but not noisy.
  • Review any user/customer-identifying labels before dashboard or alert exposure.

Alerts to add

  • discovery sync failed
  • file_sd file stale / not updated recently
  • missing/stale remote-write metrics by VMID/type
  • API /sd/exporters failure
  • API app health failure
  • Prometheus scrape failures by important job
  • Grafana dashboard provisioning missing/failure
  • log pipeline down once Loki/Alloy logs exist

Follow-up improvements

  • Add Loki or Alloy log collection for API, Agent, provisioning, backup/restore, Velocity/DNS/edge actions.
  • Add panels for lifecycle state: ready, connectable, operationInProgress, operationType, backup/restore status, code-server status.
  • Add dashboard row/panels for stale/deleted targets.
  • Add a runbook for discovery sync failure.
  • Add a runbook for failed game/dev provisioning debugging from monitoring only.