zlh-grind/Codex/Monitoring/VALIDATION.md

4.0 KiB

Monitoring — Validation Checklist

Use this checklist to decide whether launch monitoring is ready.

For each item record:

  • PASS / FAIL / BLOCKED
  • command or UI path used
  • observed result
  • target VMID/server ID if applicable
  • follow-up issue link if failed

Security/access validation

  • Prometheus is not reachable from untrusted networks.
  • Grafana is not reachable from untrusted networks except through intended authenticated/admin path.
  • node_exporter is not reachable from untrusted networks.
  • UFW/nftables or equivalent host/network policy restricts monitoring ports.
  • /sd/exporters returns 401 without token.
  • /monitoring/game-dev returns 401 without token.
  • bearer token is not present in labels, scrape URLs, dashboard JSON, or logs.
  • bearer token files have tight permissions or use systemd credentials.

Service validation

  • prometheus.service active/running.
  • grafana-server.service active/running.
  • alloy.service active/running.
  • local node/system metrics target up.
  • monitoring discovery sync service active/successful.

Discovery validation

  • authenticated /sd/exporters returns current active game/dev targets.
  • authenticated /monitoring/game-dev returns current active game/dev containers.
  • file_sd output is updated after sync.
  • file_sd output is not empty unless there truly are no active game/dev containers.
  • stale VMIDs are absent from file_sd after deletion/sync.
  • no old down target remains without an explicit reason.

Game container lifecycle validation

Create a test game container and verify:

  • target appears in discovery.
  • Prometheus target appears.
  • metrics are scraped successfully.
  • labels include canonical vmid, instance, and container_type.
  • dashboard can filter/show the container.
  • lifecycle state is visible enough to debug readiness.
  • delete removes the target from API discovery.
  • delete removes the target from file_sd.
  • delete removes or stops the Prometheus target.

Dev container lifecycle validation

Create a test dev container and verify:

  • target appears in discovery.
  • Prometheus target appears.
  • metrics are scraped successfully.
  • labels include canonical vmid, instance, and container_type.
  • dashboard can filter/show the container.
  • code-server state is visible or planned as an accepted launch gap.
  • delete removes the target from API discovery.
  • delete removes the target from file_sd.
  • delete removes or stops the Prometheus target.

API/app validation

  • API host/system metrics are present.
  • API application health target exists or equivalent health metric exists.
  • API discovery endpoint failures are visible in Prometheus or logs.
  • API create/provision/delete/backup/restore failures are visible in centralized logs or accepted as a launch gap.

Grafana validation

  • Prometheus datasource exists and works.
  • core services overview dashboard is installed and visible.
  • core service detail dashboard is installed and visible.
  • game containers dashboard is installed and visible.
  • dev containers dashboard is installed and visible.
  • dashboard queries match the canonical label contract.
  • dashboard queries return live data for current targets.

Log validation

  • API logs queryable from monitoring surface.
  • Agent logs queryable from monitoring surface.
  • provisioning logs queryable from monitoring surface.
  • backup/restore logs queryable from monitoring surface.
  • delete/teardown logs queryable from monitoring surface.
  • Velocity/DNS/edge action logs queryable from monitoring surface.
  • discovery sync logs queryable from monitoring surface.

Launch acceptance

Monitoring can be considered launch-ready only when:

  • monitoring endpoints are not publicly exposed
  • discovery add/remove works for game and dev containers
  • stale file_sd entries do not persist after successful sync
  • dashboards are installed and return useful data
  • API/app health is visible
  • logs are centralized or the missing log surface is explicitly accepted as a launch risk
  • discovery/auth token handling is locked down