Add monitoring contract

2026-05-01 18:27:04 +00:00 · 2026-05-01 18:27:04 +00:00 · c3aca715ee
commit c3aca715ee
parent 15b29b54c5
1 changed files with 157 additions and 0 deletions
--- a/Codex/Monitoring/CONTRACT.md
+++ b/Codex/Monitoring/CONTRACT.md
@ -0,0 +1,157 @@
+# Monitoring Contract
+
+This file defines the intended monitoring contract for ZeroLagHub launch readiness.
+
+## Scope
+
+Linux-managed game/dev containers should converge on Grafana Alloy as the standard telemetry collector.
+
+The monitoring system must support launch debugging for:
+
+- platform service health
+- API availability and discovery behavior
+- game/dev container presence and deletion
+- container host metrics
+- game/dev lifecycle state
+- provisioning, backup, restore, delete, and code-server/debug workflows
+- stale target detection
+
+## Explicit exceptions
+
+The game/dev Alloy contract does not replace every platform-specific monitoring path.
+
+Exceptions:
+
+- OPNsense monitoring remains plugin/exporter-specific.
+- PBS monitoring remains platform-specific.
+- Appliance-style infrastructure should not be forced through the game/dev Alloy contract unless it is intentionally Linux-managed in the same way.
+
+## Dynamic discovery path
+
+Preferred launch model:
+
+```text
+API discovery endpoint
+  -> monitor sync script
+  -> Prometheus file_sd
+  -> Grafana dashboards / alerts
+```
+
+Expected behavior:
+
+- active game/dev containers appear automatically
+- deleted game/dev containers disappear automatically
+- stale file_sd entries are removed once sync is healthy
+- discovery failures alert loudly and do not silently preserve misleading stale state
+
+## API discovery requirements
+
+API-owned monitoring discovery endpoints must:
+
+- require bearer/internal monitoring auth outside explicit development/test mode
+- fail closed without token
+- return only active/current targets unless historical/debug mode is explicitly requested
+- provide stable labels for Prometheus/Grafana
+- avoid leaking secrets in labels, URLs, logs, or target metadata
+- expose enough metadata to debug game/dev lifecycle state
+
+## Required labels
+
+For game/dev targets, standardize on the following required labels:
+
+- `vmid`
+- `instance`
+- `container_type`
+
+`container_type` should be the canonical spelling. Avoid using both `ctype` and `container_type` long-term.
+
+## Optional labels
+
+Optional labels may include:
+
+- `server_id`
+- `game`
+- `variant`
+- `runtime`
+- `hostname`
+- `name`
+- `plan`
+- `role`
+
+Any customer/user-identifying label should be reviewed before dashboard or alert exposure.
+
+## Lifecycle/debug state
+
+Launch-debug monitoring should expose or derive these states:
+
+- `ready`
+- `connectable`
+- `operationInProgress`
+- `operationType`
+- backup status
+- restore status
+- code-server status for dev containers
+- provisioning-complete state
+- recent failure reason where safe
+
+## Agent responsibilities
+
+Agent/container-side monitoring responsibilities:
+
+- keep Alloy installed/prepared in game/dev templates
+- remove legacy `node-exporter` from game/dev templates once Alloy is standard
+- keep `/etc/default/alloy` fixed to `/etc/alloy/config.alloy` on port `12345`
+- write or refresh Alloy config after `POST /config`
+- keep labels minimal and consistent with dashboards/discovery
+- eventually emit structured logs suitable for Loki/Alloy log ingestion
+
+## Monitoring host responsibilities
+
+The monitoring host must:
+
+- restrict Prometheus, Grafana, and node_exporter exposure to trusted admin/monitoring paths
+- run discovery sync reliably
+- provision dashboards into Grafana, not merely store dashboard JSON on disk
+- alert on failed discovery sync
+- alert on stale file_sd age
+- alert on target-down by VMID/type
+- alert on API discovery endpoint failure
+- store bearer tokens with tight permissions or systemd credentials
+
+## Grafana responsibilities
+
+Grafana must have:
+
+- Prometheus datasource installed
+- core service dashboards installed
+- game container dashboard installed
+- dev container dashboard installed
+- dashboard queries aligned with the canonical label contract
+
+Dashboard JSON existing on disk is not sufficient; dashboards must be visible in Grafana.
+
+## Logs contract
+
+Launch-ready debugging requires centralized logs for at least:
+
+- API application logs
+- Agent logs
+- provisioning flow
+- backup/restore flow
+- delete/teardown flow
+- Velocity registration callbacks
+- DNS/edge publish and cleanup actions
+- monitoring discovery sync
+
+Preferred direction: Loki or Alloy log ingestion. Until this exists, launch-debug readiness remains incomplete.
+
+## Security contract
+
+Required before launch:
+
+- Prometheus not publicly reachable
+- Grafana not publicly reachable except through intended authenticated/admin path
+- node_exporter not publicly reachable
+- API monitoring endpoints fail closed without token
+- no monitoring bearer token in world-readable files
+- no token leakage in labels, target URLs, dashboards, or logs