Add monitoring validation checklist
This commit is contained in:
parent
b0773cb97f
commit
c1c2104936
106
Codex/Monitoring/VALIDATION.md
Normal file
106
Codex/Monitoring/VALIDATION.md
Normal file
@ -0,0 +1,106 @@
|
||||
# Monitoring — Validation Checklist
|
||||
|
||||
Use this checklist to decide whether launch monitoring is ready.
|
||||
|
||||
For each item record:
|
||||
|
||||
- PASS / FAIL / BLOCKED
|
||||
- command or UI path used
|
||||
- observed result
|
||||
- target VMID/server ID if applicable
|
||||
- follow-up issue link if failed
|
||||
|
||||
## Security/access validation
|
||||
|
||||
- Prometheus is not reachable from untrusted networks.
|
||||
- Grafana is not reachable from untrusted networks except through intended authenticated/admin path.
|
||||
- node_exporter is not reachable from untrusted networks.
|
||||
- UFW/nftables or equivalent host/network policy restricts monitoring ports.
|
||||
- `/sd/exporters` returns `401` without token.
|
||||
- `/monitoring/game-dev` returns `401` without token.
|
||||
- bearer token is not present in labels, scrape URLs, dashboard JSON, or logs.
|
||||
- bearer token files have tight permissions or use systemd credentials.
|
||||
|
||||
## Service validation
|
||||
|
||||
- `prometheus.service` active/running.
|
||||
- `grafana-server.service` active/running.
|
||||
- `alloy.service` active/running.
|
||||
- local node/system metrics target up.
|
||||
- monitoring discovery sync service active/successful.
|
||||
|
||||
## Discovery validation
|
||||
|
||||
- authenticated `/sd/exporters` returns current active game/dev targets.
|
||||
- authenticated `/monitoring/game-dev` returns current active game/dev containers.
|
||||
- file_sd output is updated after sync.
|
||||
- file_sd output is not empty unless there truly are no active game/dev containers.
|
||||
- stale VMIDs are absent from file_sd after deletion/sync.
|
||||
- no old down target remains without an explicit reason.
|
||||
|
||||
## Game container lifecycle validation
|
||||
|
||||
Create a test game container and verify:
|
||||
|
||||
- target appears in discovery.
|
||||
- Prometheus target appears.
|
||||
- metrics are scraped successfully.
|
||||
- labels include canonical `vmid`, `instance`, and `container_type`.
|
||||
- dashboard can filter/show the container.
|
||||
- lifecycle state is visible enough to debug readiness.
|
||||
- delete removes the target from API discovery.
|
||||
- delete removes the target from file_sd.
|
||||
- delete removes or stops the Prometheus target.
|
||||
|
||||
## Dev container lifecycle validation
|
||||
|
||||
Create a test dev container and verify:
|
||||
|
||||
- target appears in discovery.
|
||||
- Prometheus target appears.
|
||||
- metrics are scraped successfully.
|
||||
- labels include canonical `vmid`, `instance`, and `container_type`.
|
||||
- dashboard can filter/show the container.
|
||||
- code-server state is visible or planned as an accepted launch gap.
|
||||
- delete removes the target from API discovery.
|
||||
- delete removes the target from file_sd.
|
||||
- delete removes or stops the Prometheus target.
|
||||
|
||||
## API/app validation
|
||||
|
||||
- API host/system metrics are present.
|
||||
- API application health target exists or equivalent health metric exists.
|
||||
- API discovery endpoint failures are visible in Prometheus or logs.
|
||||
- API create/provision/delete/backup/restore failures are visible in centralized logs or accepted as a launch gap.
|
||||
|
||||
## Grafana validation
|
||||
|
||||
- Prometheus datasource exists and works.
|
||||
- core services overview dashboard is installed and visible.
|
||||
- core service detail dashboard is installed and visible.
|
||||
- game containers dashboard is installed and visible.
|
||||
- dev containers dashboard is installed and visible.
|
||||
- dashboard queries match the canonical label contract.
|
||||
- dashboard queries return live data for current targets.
|
||||
|
||||
## Log validation
|
||||
|
||||
- API logs queryable from monitoring surface.
|
||||
- Agent logs queryable from monitoring surface.
|
||||
- provisioning logs queryable from monitoring surface.
|
||||
- backup/restore logs queryable from monitoring surface.
|
||||
- delete/teardown logs queryable from monitoring surface.
|
||||
- Velocity/DNS/edge action logs queryable from monitoring surface.
|
||||
- discovery sync logs queryable from monitoring surface.
|
||||
|
||||
## Launch acceptance
|
||||
|
||||
Monitoring can be considered launch-ready only when:
|
||||
|
||||
- monitoring endpoints are not publicly exposed
|
||||
- discovery add/remove works for game and dev containers
|
||||
- stale file_sd entries do not persist after successful sync
|
||||
- dashboards are installed and return useful data
|
||||
- API/app health is visible
|
||||
- logs are centralized or the missing log surface is explicitly accepted as a launch risk
|
||||
- discovery/auth token handling is locked down
|
||||
Loading…
Reference in New Issue
Block a user