zlh-grind/Codex/Monitoring/OPEN_ITEMS.md

1.9 KiB

Monitoring — Open Items

Only keep unfinished monitoring/observability work here.

Active

  1. Add centralized logs.

    • No Loki currently.
    • Future direction: centralized Loki as shared infrastructure.
    • Containers should ship selected logs through Alloy; do not run Loki per game/dev container.
    • Desired sources:
      • API application logs
      • Agent logs
      • provisioning logs
      • backup/restore logs
      • delete/teardown logs
      • Velocity/DNS/edge action logs
      • discovery sync logs
  2. Decide whether OPNsense needs router-only monitoring.

    • Current game/dev monitoring intentionally uses Alloy, not node_exporter.
    • OPNsense may still need a router-only node_exporter job or another router-specific monitoring path later.
    • Keep this separate from the game/dev container Alloy contract.

Optional hardening / improvements

  • Add API app health scrape or equivalent application-level health signal.
  • Add lifecycle/debug telemetry panels or metrics for:
    • ready
    • connectable
    • operationInProgress
    • operationType
    • backup/restore state
    • code-server state
  • Standardize labels: prefer canonical container_type over mixed ctype / container_type.
  • Decide canonical label spelling for hostname/name/server identity.
  • Keep required game/dev labels minimal: vmid, instance, container_type.
  • Review any user/customer-identifying labels before dashboard or alert exposure.
  • Add runbooks for discovery sync failure and failed game/dev provisioning debugging from monitoring only.

Alerts to add

  • discovery sync failed
  • file_sd file stale / not updated recently
  • missing/stale remote-write metrics by VMID/type
  • game-dev-alloy scrape target down by VMID/type
  • API /sd/exporters failure
  • API app health failure once an app health signal exists
  • Prometheus scrape failures by important job
  • Grafana dashboard provisioning missing/failure
  • log pipeline down once Loki/Alloy logs exist