Add monitoring contract

2026-05-01 18:27:04 +00:00 · 2026-05-01 18:27:04 +00:00 · c3aca715ee
commit c3aca715ee
parent 15b29b54c5
1 changed files with 157 additions and 0 deletions
--- a/Codex/Monitoring/CONTRACT.md
+++ b/Codex/Monitoring/CONTRACT.md
@ -0,0 +1,157 @@
 # Monitoring Contract
 This file defines the intended monitoring contract for ZeroLagHub launch readiness.
 ## Scope
 Linux-managed game/dev containers should converge on Grafana Alloy as the standard telemetry collector.
 The monitoring system must support launch debugging for:
 - platform service health
 - API availability and discovery behavior
 - game/dev container presence and deletion
 - container host metrics
 - game/dev lifecycle state
 - provisioning, backup, restore, delete, and code-server/debug workflows
 - stale target detection
 ## Explicit exceptions
 The game/dev Alloy contract does not replace every platform-specific monitoring path.
 Exceptions:
 - OPNsense monitoring remains plugin/exporter-specific.
 - PBS monitoring remains platform-specific.
 - Appliance-style infrastructure should not be forced through the game/dev Alloy contract unless it is intentionally Linux-managed in the same way.
 ## Dynamic discovery path
 Preferred launch model:
 ```text
 API discovery endpoint
  -> monitor sync script
  -> Prometheus file_sd
  -> Grafana dashboards / alerts
 ```
 Expected behavior:
 - active game/dev containers appear automatically
 - deleted game/dev containers disappear automatically
 - stale file_sd entries are removed once sync is healthy
 - discovery failures alert loudly and do not silently preserve misleading stale state
 ## API discovery requirements
 API-owned monitoring discovery endpoints must:
 - require bearer/internal monitoring auth outside explicit development/test mode
 - fail closed without token
 - return only active/current targets unless historical/debug mode is explicitly requested
 - provide stable labels for Prometheus/Grafana
 - avoid leaking secrets in labels, URLs, logs, or target metadata
 - expose enough metadata to debug game/dev lifecycle state
 ## Required labels
 For game/dev targets, standardize on the following required labels:
 - `vmid`
 - `instance`
 - `container_type`
 `container_type` should be the canonical spelling. Avoid using both `ctype` and `container_type` long-term.
 ## Optional labels
 Optional labels may include:
 - `server_id`
 - `game`
 - `variant`
 - `runtime`
 - `hostname`
 - `name`
 - `plan`
 - `role`
 Any customer/user-identifying label should be reviewed before dashboard or alert exposure.
 ## Lifecycle/debug state
 Launch-debug monitoring should expose or derive these states:
 - `ready`
 - `connectable`
 - `operationInProgress`
 - `operationType`
 - backup status
 - restore status
 - code-server status for dev containers
 - provisioning-complete state
 - recent failure reason where safe
 ## Agent responsibilities
 Agent/container-side monitoring responsibilities:
 - keep Alloy installed/prepared in game/dev templates
 - remove legacy `node-exporter` from game/dev templates once Alloy is standard
 - keep `/etc/default/alloy` fixed to `/etc/alloy/config.alloy` on port `12345`
 - write or refresh Alloy config after `POST /config`
 - keep labels minimal and consistent with dashboards/discovery
 - eventually emit structured logs suitable for Loki/Alloy log ingestion
 ## Monitoring host responsibilities
 The monitoring host must:
 - restrict Prometheus, Grafana, and node_exporter exposure to trusted admin/monitoring paths
 - run discovery sync reliably
 - provision dashboards into Grafana, not merely store dashboard JSON on disk
 - alert on failed discovery sync
 - alert on stale file_sd age
 - alert on target-down by VMID/type
 - alert on API discovery endpoint failure
 - store bearer tokens with tight permissions or systemd credentials
 ## Grafana responsibilities
 Grafana must have:
 - Prometheus datasource installed
 - core service dashboards installed
 - game container dashboard installed
 - dev container dashboard installed
 - dashboard queries aligned with the canonical label contract
 Dashboard JSON existing on disk is not sufficient; dashboards must be visible in Grafana.
 ## Logs contract
 Launch-ready debugging requires centralized logs for at least:
 - API application logs
 - Agent logs
 - provisioning flow
 - backup/restore flow
 - delete/teardown flow
 - Velocity registration callbacks
 - DNS/edge publish and cleanup actions
 - monitoring discovery sync
 Preferred direction: Loki or Alloy log ingestion. Until this exists, launch-debug readiness remains incomplete.
 ## Security contract
 Required before launch:
 - Prometheus not publicly reachable
 - Grafana not publicly reachable except through intended authenticated/admin path
 - node_exporter not publicly reachable
 - API monitoring endpoints fail closed without token
 - no monitoring bearer token in world-readable files
 - no token leakage in labels, target URLs, dashboards, or logs