From c3aca715ee8a2b6abb31aa991f0149c69e2bdb6a Mon Sep 17 00:00:00 2001 From: jester Date: Fri, 1 May 2026 18:27:04 +0000 Subject: [PATCH] Add monitoring contract --- Codex/Monitoring/CONTRACT.md | 157 +++++++++++++++++++++++++++++++++++ 1 file changed, 157 insertions(+) create mode 100644 Codex/Monitoring/CONTRACT.md diff --git a/Codex/Monitoring/CONTRACT.md b/Codex/Monitoring/CONTRACT.md new file mode 100644 index 0000000..3c413ac --- /dev/null +++ b/Codex/Monitoring/CONTRACT.md @@ -0,0 +1,157 @@ +# Monitoring Contract + +This file defines the intended monitoring contract for ZeroLagHub launch readiness. + +## Scope + +Linux-managed game/dev containers should converge on Grafana Alloy as the standard telemetry collector. + +The monitoring system must support launch debugging for: + +- platform service health +- API availability and discovery behavior +- game/dev container presence and deletion +- container host metrics +- game/dev lifecycle state +- provisioning, backup, restore, delete, and code-server/debug workflows +- stale target detection + +## Explicit exceptions + +The game/dev Alloy contract does not replace every platform-specific monitoring path. + +Exceptions: + +- OPNsense monitoring remains plugin/exporter-specific. +- PBS monitoring remains platform-specific. +- Appliance-style infrastructure should not be forced through the game/dev Alloy contract unless it is intentionally Linux-managed in the same way. + +## Dynamic discovery path + +Preferred launch model: + +```text +API discovery endpoint + -> monitor sync script + -> Prometheus file_sd + -> Grafana dashboards / alerts +``` + +Expected behavior: + +- active game/dev containers appear automatically +- deleted game/dev containers disappear automatically +- stale file_sd entries are removed once sync is healthy +- discovery failures alert loudly and do not silently preserve misleading stale state + +## API discovery requirements + +API-owned monitoring discovery endpoints must: + +- require bearer/internal monitoring auth outside explicit development/test mode +- fail closed without token +- return only active/current targets unless historical/debug mode is explicitly requested +- provide stable labels for Prometheus/Grafana +- avoid leaking secrets in labels, URLs, logs, or target metadata +- expose enough metadata to debug game/dev lifecycle state + +## Required labels + +For game/dev targets, standardize on the following required labels: + +- `vmid` +- `instance` +- `container_type` + +`container_type` should be the canonical spelling. Avoid using both `ctype` and `container_type` long-term. + +## Optional labels + +Optional labels may include: + +- `server_id` +- `game` +- `variant` +- `runtime` +- `hostname` +- `name` +- `plan` +- `role` + +Any customer/user-identifying label should be reviewed before dashboard or alert exposure. + +## Lifecycle/debug state + +Launch-debug monitoring should expose or derive these states: + +- `ready` +- `connectable` +- `operationInProgress` +- `operationType` +- backup status +- restore status +- code-server status for dev containers +- provisioning-complete state +- recent failure reason where safe + +## Agent responsibilities + +Agent/container-side monitoring responsibilities: + +- keep Alloy installed/prepared in game/dev templates +- remove legacy `node-exporter` from game/dev templates once Alloy is standard +- keep `/etc/default/alloy` fixed to `/etc/alloy/config.alloy` on port `12345` +- write or refresh Alloy config after `POST /config` +- keep labels minimal and consistent with dashboards/discovery +- eventually emit structured logs suitable for Loki/Alloy log ingestion + +## Monitoring host responsibilities + +The monitoring host must: + +- restrict Prometheus, Grafana, and node_exporter exposure to trusted admin/monitoring paths +- run discovery sync reliably +- provision dashboards into Grafana, not merely store dashboard JSON on disk +- alert on failed discovery sync +- alert on stale file_sd age +- alert on target-down by VMID/type +- alert on API discovery endpoint failure +- store bearer tokens with tight permissions or systemd credentials + +## Grafana responsibilities + +Grafana must have: + +- Prometheus datasource installed +- core service dashboards installed +- game container dashboard installed +- dev container dashboard installed +- dashboard queries aligned with the canonical label contract + +Dashboard JSON existing on disk is not sufficient; dashboards must be visible in Grafana. + +## Logs contract + +Launch-ready debugging requires centralized logs for at least: + +- API application logs +- Agent logs +- provisioning flow +- backup/restore flow +- delete/teardown flow +- Velocity registration callbacks +- DNS/edge publish and cleanup actions +- monitoring discovery sync + +Preferred direction: Loki or Alloy log ingestion. Until this exists, launch-debug readiness remains incomplete. + +## Security contract + +Required before launch: + +- Prometheus not publicly reachable +- Grafana not publicly reachable except through intended authenticated/admin path +- node_exporter not publicly reachable +- API monitoring endpoints fail closed without token +- no monitoring bearer token in world-readable files +- no token leakage in labels, target URLs, dashboards, or logs