4.0 KiB
4.0 KiB
Monitoring — Validation Checklist
Use this checklist to decide whether launch monitoring is ready.
For each item record:
- PASS / FAIL / BLOCKED
- command or UI path used
- observed result
- target VMID/server ID if applicable
- follow-up issue link if failed
Security/access validation
- Prometheus is not reachable from untrusted networks.
- Grafana is not reachable from untrusted networks except through intended authenticated/admin path.
- node_exporter is not reachable from untrusted networks.
- UFW/nftables or equivalent host/network policy restricts monitoring ports.
/sd/exportersreturns401without token./monitoring/game-devreturns401without token.- bearer token is not present in labels, scrape URLs, dashboard JSON, or logs.
- bearer token files have tight permissions or use systemd credentials.
Service validation
prometheus.serviceactive/running.grafana-server.serviceactive/running.alloy.serviceactive/running.- local node/system metrics target up.
- monitoring discovery sync service active/successful.
Discovery validation
- authenticated
/sd/exportersreturns current active game/dev targets. - authenticated
/monitoring/game-devreturns current active game/dev containers. - file_sd output is updated after sync.
- file_sd output is not empty unless there truly are no active game/dev containers.
- stale VMIDs are absent from file_sd after deletion/sync.
- no old down target remains without an explicit reason.
Game container lifecycle validation
Create a test game container and verify:
- target appears in discovery.
- Prometheus target appears.
- metrics are scraped successfully.
- labels include canonical
vmid,instance, andcontainer_type. - dashboard can filter/show the container.
- lifecycle state is visible enough to debug readiness.
- delete removes the target from API discovery.
- delete removes the target from file_sd.
- delete removes or stops the Prometheus target.
Dev container lifecycle validation
Create a test dev container and verify:
- target appears in discovery.
- Prometheus target appears.
- metrics are scraped successfully.
- labels include canonical
vmid,instance, andcontainer_type. - dashboard can filter/show the container.
- code-server state is visible or planned as an accepted launch gap.
- delete removes the target from API discovery.
- delete removes the target from file_sd.
- delete removes or stops the Prometheus target.
API/app validation
- API host/system metrics are present.
- API application health target exists or equivalent health metric exists.
- API discovery endpoint failures are visible in Prometheus or logs.
- API create/provision/delete/backup/restore failures are visible in centralized logs or accepted as a launch gap.
Grafana validation
- Prometheus datasource exists and works.
- core services overview dashboard is installed and visible.
- core service detail dashboard is installed and visible.
- game containers dashboard is installed and visible.
- dev containers dashboard is installed and visible.
- dashboard queries match the canonical label contract.
- dashboard queries return live data for current targets.
Log validation
- API logs queryable from monitoring surface.
- Agent logs queryable from monitoring surface.
- provisioning logs queryable from monitoring surface.
- backup/restore logs queryable from monitoring surface.
- delete/teardown logs queryable from monitoring surface.
- Velocity/DNS/edge action logs queryable from monitoring surface.
- discovery sync logs queryable from monitoring surface.
Launch acceptance
Monitoring can be considered launch-ready only when:
- monitoring endpoints are not publicly exposed
- discovery add/remove works for game and dev containers
- stale file_sd entries do not persist after successful sync
- dashboards are installed and return useful data
- API/app health is visible
- logs are centralized or the missing log surface is explicitly accepted as a launch risk
- discovery/auth token handling is locked down