diff --git a/Codex/Monitoring/VALIDATION.md b/Codex/Monitoring/VALIDATION.md new file mode 100644 index 0000000..9ddc881 --- /dev/null +++ b/Codex/Monitoring/VALIDATION.md @@ -0,0 +1,106 @@ +# Monitoring — Validation Checklist + +Use this checklist to decide whether launch monitoring is ready. + +For each item record: + +- PASS / FAIL / BLOCKED +- command or UI path used +- observed result +- target VMID/server ID if applicable +- follow-up issue link if failed + +## Security/access validation + +- Prometheus is not reachable from untrusted networks. +- Grafana is not reachable from untrusted networks except through intended authenticated/admin path. +- node_exporter is not reachable from untrusted networks. +- UFW/nftables or equivalent host/network policy restricts monitoring ports. +- `/sd/exporters` returns `401` without token. +- `/monitoring/game-dev` returns `401` without token. +- bearer token is not present in labels, scrape URLs, dashboard JSON, or logs. +- bearer token files have tight permissions or use systemd credentials. + +## Service validation + +- `prometheus.service` active/running. +- `grafana-server.service` active/running. +- `alloy.service` active/running. +- local node/system metrics target up. +- monitoring discovery sync service active/successful. + +## Discovery validation + +- authenticated `/sd/exporters` returns current active game/dev targets. +- authenticated `/monitoring/game-dev` returns current active game/dev containers. +- file_sd output is updated after sync. +- file_sd output is not empty unless there truly are no active game/dev containers. +- stale VMIDs are absent from file_sd after deletion/sync. +- no old down target remains without an explicit reason. + +## Game container lifecycle validation + +Create a test game container and verify: + +- target appears in discovery. +- Prometheus target appears. +- metrics are scraped successfully. +- labels include canonical `vmid`, `instance`, and `container_type`. +- dashboard can filter/show the container. +- lifecycle state is visible enough to debug readiness. +- delete removes the target from API discovery. +- delete removes the target from file_sd. +- delete removes or stops the Prometheus target. + +## Dev container lifecycle validation + +Create a test dev container and verify: + +- target appears in discovery. +- Prometheus target appears. +- metrics are scraped successfully. +- labels include canonical `vmid`, `instance`, and `container_type`. +- dashboard can filter/show the container. +- code-server state is visible or planned as an accepted launch gap. +- delete removes the target from API discovery. +- delete removes the target from file_sd. +- delete removes or stops the Prometheus target. + +## API/app validation + +- API host/system metrics are present. +- API application health target exists or equivalent health metric exists. +- API discovery endpoint failures are visible in Prometheus or logs. +- API create/provision/delete/backup/restore failures are visible in centralized logs or accepted as a launch gap. + +## Grafana validation + +- Prometheus datasource exists and works. +- core services overview dashboard is installed and visible. +- core service detail dashboard is installed and visible. +- game containers dashboard is installed and visible. +- dev containers dashboard is installed and visible. +- dashboard queries match the canonical label contract. +- dashboard queries return live data for current targets. + +## Log validation + +- API logs queryable from monitoring surface. +- Agent logs queryable from monitoring surface. +- provisioning logs queryable from monitoring surface. +- backup/restore logs queryable from monitoring surface. +- delete/teardown logs queryable from monitoring surface. +- Velocity/DNS/edge action logs queryable from monitoring surface. +- discovery sync logs queryable from monitoring surface. + +## Launch acceptance + +Monitoring can be considered launch-ready only when: + +- monitoring endpoints are not publicly exposed +- discovery add/remove works for game and dev containers +- stale file_sd entries do not persist after successful sync +- dashboards are installed and return useful data +- API/app health is visible +- logs are centralized or the missing log surface is explicitly accepted as a launch risk +- discovery/auth token handling is locked down