Add monitoring validation checklist

2026-05-01 18:27:50 +00:00 · 2026-05-01 18:27:50 +00:00 · c1c2104936
commit c1c2104936
parent b0773cb97f
1 changed files with 106 additions and 0 deletions
--- a/Codex/Monitoring/VALIDATION.md
+++ b/Codex/Monitoring/VALIDATION.md
@ -0,0 +1,106 @@
 # Monitoring — Validation Checklist
 Use this checklist to decide whether launch monitoring is ready.
 For each item record:
 - PASS / FAIL / BLOCKED
 - command or UI path used
 - observed result
 - target VMID/server ID if applicable
 - follow-up issue link if failed
 ## Security/access validation
 - Prometheus is not reachable from untrusted networks.
 - Grafana is not reachable from untrusted networks except through intended authenticated/admin path.
 - node_exporter is not reachable from untrusted networks.
 - UFW/nftables or equivalent host/network policy restricts monitoring ports.
 - `/sd/exporters` returns `401` without token.
 - `/monitoring/game-dev` returns `401` without token.
 - bearer token is not present in labels, scrape URLs, dashboard JSON, or logs.
 - bearer token files have tight permissions or use systemd credentials.
 ## Service validation
 - `prometheus.service` active/running.
 - `grafana-server.service` active/running.
 - `alloy.service` active/running.
 - local node/system metrics target up.
 - monitoring discovery sync service active/successful.
 ## Discovery validation
 - authenticated `/sd/exporters` returns current active game/dev targets.
 - authenticated `/monitoring/game-dev` returns current active game/dev containers.
 - file_sd output is updated after sync.
 - file_sd output is not empty unless there truly are no active game/dev containers.
 - stale VMIDs are absent from file_sd after deletion/sync.
 - no old down target remains without an explicit reason.
 ## Game container lifecycle validation
 Create a test game container and verify:
 - target appears in discovery.
 - Prometheus target appears.
 - metrics are scraped successfully.
 - labels include canonical `vmid`, `instance`, and `container_type`.
 - dashboard can filter/show the container.
 - lifecycle state is visible enough to debug readiness.
 - delete removes the target from API discovery.
 - delete removes the target from file_sd.
 - delete removes or stops the Prometheus target.
 ## Dev container lifecycle validation
 Create a test dev container and verify:
 - target appears in discovery.
 - Prometheus target appears.
 - metrics are scraped successfully.
 - labels include canonical `vmid`, `instance`, and `container_type`.
 - dashboard can filter/show the container.
 - code-server state is visible or planned as an accepted launch gap.
 - delete removes the target from API discovery.
 - delete removes the target from file_sd.
 - delete removes or stops the Prometheus target.
 ## API/app validation
 - API host/system metrics are present.
 - API application health target exists or equivalent health metric exists.
 - API discovery endpoint failures are visible in Prometheus or logs.
 - API create/provision/delete/backup/restore failures are visible in centralized logs or accepted as a launch gap.
 ## Grafana validation
 - Prometheus datasource exists and works.
 - core services overview dashboard is installed and visible.
 - core service detail dashboard is installed and visible.
 - game containers dashboard is installed and visible.
 - dev containers dashboard is installed and visible.
 - dashboard queries match the canonical label contract.
 - dashboard queries return live data for current targets.
 ## Log validation
 - API logs queryable from monitoring surface.
 - Agent logs queryable from monitoring surface.
 - provisioning logs queryable from monitoring surface.
 - backup/restore logs queryable from monitoring surface.
 - delete/teardown logs queryable from monitoring surface.
 - Velocity/DNS/edge action logs queryable from monitoring surface.
 - discovery sync logs queryable from monitoring surface.
 ## Launch acceptance
 Monitoring can be considered launch-ready only when:
 - monitoring endpoints are not publicly exposed
 - discovery add/remove works for game and dev containers
 - stale file_sd entries do not persist after successful sync
 - dashboards are installed and return useful data
 - API/app health is visible
 - logs are centralized or the missing log surface is explicitly accepted as a launch risk
 - discovery/auth token handling is locked down