zlh-grind/Codex/Monitoring/OPEN_ITEMS.md
2026-05-01 18:27:24 +00:00

3.3 KiB

Monitoring — Open Items

Only keep unfinished monitoring/observability work here.

Launch blockers

  1. Restrict monitoring endpoint exposure.

    • Prometheus currently listens on *:9090.
    • Grafana currently listens on *:3000.
    • node_exporter currently listens on *:9100.
    • UFW is inactive and nftables policy is accept.
    • Required fix: bind to loopback/management VLAN only or firewall to trusted admin/monitoring sources.
  2. Fix game/dev discovery sync.

    • zlh-monitor-game-dev-discovery.service is failed.
    • Current failure: API returns dev VMID 6050 on 10.60.0.220, but sync script only allows dev IPs in 10.100.0.0/16.
    • Decide whether 6050 on 10.60.0.0/16 is intentional; then either update the allowlist or fix API/IP assignment.
  3. Remove stale file_sd targets after discovery is healthy.

    • Stale targets observed: VMID 5205 and VMID 6084.
    • Acceptance: deleted containers disappear automatically and stale targets do not remain indefinitely.
  4. Install/provision Grafana dashboards.

    • Dashboard JSON exists under /etc/zlh-monitor/grafana/dashboards.
    • Grafana DB currently has no dashboards installed.
    • Acceptance: dashboards visible in Grafana and queries return live data.
  5. Add API health scrape.

    • Observed /health returns 404 Cannot GET /health.
    • Acceptance: Prometheus has an API app health target or equivalent application-level health signal.
  6. Add launch-debug lifecycle telemetry.

    • Missing/incomplete: ready, connectable, operationInProgress, operationType, backup/restore state, code-server state.
    • Acceptance: dashboards or Prometheus queries can answer what a game/dev container is doing and why it is not ready.
  7. Add centralized logs.

    • No Loki, Promtail, or log-ingesting Alloy config found.
    • Acceptance: API, Agent, provisioning, backup/restore, delete/teardown, Velocity/DNS/edge, and discovery-sync logs are queryable from the monitoring surface.
  8. Tighten bearer token storage.

    • Token currently appears in Prometheus YAML, backup configs, and /etc/zlh-monitor/game-dev-discovery.env.
    • Acceptance: token stored in tight-permission env file or systemd credentials; not world-readable; not leaked in labels/URLs/logs.

Standardization work

  • Standardize labels: prefer canonical container_type over mixed ctype / container_type.
  • Decide canonical label spelling for hostname/name/server identity.
  • Keep required game/dev labels minimal: vmid, instance, container_type.
  • Keep optional metadata useful but not noisy.
  • Review any user/customer-identifying labels before dashboard or alert exposure.

Alerts to add

  • discovery sync failed
  • file_sd file stale / not updated recently
  • target down by VMID/type
  • API /sd/exporters failure
  • API app health failure
  • Prometheus scrape failures by important job
  • Grafana dashboard provisioning missing/failure
  • log pipeline down once Loki/Alloy logs exist

Follow-up improvements

  • Add Loki or Alloy log collection for API, Agent, provisioning, backup/restore, Velocity/DNS/edge actions.
  • Add panels for lifecycle state: ready, connectable, operationInProgress, operationType, backup/restore status, code-server status.
  • Add dashboard row/panels for stale/deleted targets.
  • Add a runbook for discovery sync failure.
  • Add a runbook for failed game/dev provisioning debugging from monitoring only.