# Monitoring — Validation Checklist Use this checklist to decide whether launch monitoring is ready. For each item record: - PASS / FAIL / BLOCKED - command or UI path used - observed result - target VMID/server ID if applicable - follow-up issue link if failed ## Security/access validation - Prometheus is not reachable from untrusted networks. - Grafana is not reachable from untrusted networks except through intended authenticated/admin path. - node_exporter is not reachable from untrusted networks. - UFW/nftables or equivalent host/network policy restricts monitoring ports. - `/sd/exporters` returns `401` without token. - `/monitoring/game-dev` returns `401` without token. - bearer token is not present in labels, scrape URLs, dashboard JSON, or logs. - bearer token files have tight permissions or use systemd credentials. ## Service validation - `prometheus.service` active/running. - `grafana-server.service` active/running. - `alloy.service` active/running. - local node/system metrics target up. - monitoring discovery sync service active/successful. ## Discovery validation - authenticated `/sd/exporters` returns current active game/dev targets. - authenticated `/monitoring/game-dev` returns current active game/dev containers. - file_sd output is updated after sync. - file_sd output is not empty unless there truly are no active game/dev containers. - stale VMIDs are absent from file_sd after deletion/sync. - no old down target remains without an explicit reason. ## Game container lifecycle validation Create a test game container and verify: - target appears in discovery. - Prometheus target appears. - metrics are scraped successfully. - labels include canonical `vmid`, `instance`, and `container_type`. - dashboard can filter/show the container. - lifecycle state is visible enough to debug readiness. - delete removes the target from API discovery. - delete removes the target from file_sd. - delete removes or stops the Prometheus target. ## Dev container lifecycle validation Create a test dev container and verify: - target appears in discovery. - Prometheus target appears. - metrics are scraped successfully. - labels include canonical `vmid`, `instance`, and `container_type`. - dashboard can filter/show the container. - code-server state is visible or planned as an accepted launch gap. - delete removes the target from API discovery. - delete removes the target from file_sd. - delete removes or stops the Prometheus target. ## API/app validation - API host/system metrics are present. - API application health target exists or equivalent health metric exists. - API discovery endpoint failures are visible in Prometheus or logs. - API create/provision/delete/backup/restore failures are visible in centralized logs or accepted as a launch gap. ## Grafana validation - Prometheus datasource exists and works. - core services overview dashboard is installed and visible. - core service detail dashboard is installed and visible. - game containers dashboard is installed and visible. - dev containers dashboard is installed and visible. - dashboard queries match the canonical label contract. - dashboard queries return live data for current targets. ## Log validation - API logs queryable from monitoring surface. - Agent logs queryable from monitoring surface. - provisioning logs queryable from monitoring surface. - backup/restore logs queryable from monitoring surface. - delete/teardown logs queryable from monitoring surface. - Velocity/DNS/edge action logs queryable from monitoring surface. - discovery sync logs queryable from monitoring surface. ## Launch acceptance Monitoring can be considered launch-ready only when: - monitoring endpoints are not publicly exposed - discovery add/remove works for game and dev containers - stale file_sd entries do not persist after successful sync - dashboards are installed and return useful data - API/app health is visible - logs are centralized or the missing log surface is explicitly accepted as a launch risk - discovery/auth token handling is locked down