# Monitoring Contract This file defines the intended monitoring contract for ZeroLagHub launch readiness. ## Scope Linux-managed game/dev containers should converge on Grafana Alloy as the standard telemetry collector. The monitoring system must support launch debugging for: - platform service health - API availability and discovery behavior - game/dev container presence and deletion - container host metrics - game/dev lifecycle state - provisioning, backup, restore, delete, and code-server/debug workflows - stale target detection ## Explicit exceptions The game/dev Alloy contract does not replace every platform-specific monitoring path. Exceptions: - OPNsense monitoring remains plugin/exporter-specific. - PBS monitoring remains platform-specific. - Appliance-style infrastructure should not be forced through the game/dev Alloy contract unless it is intentionally Linux-managed in the same way. ## Dynamic discovery path Preferred launch model: ```text API discovery endpoint -> monitor sync script -> Prometheus file_sd -> Grafana dashboards / alerts ``` Expected behavior: - active game/dev containers appear automatically - deleted game/dev containers disappear automatically - stale file_sd entries are removed once sync is healthy - discovery failures alert loudly and do not silently preserve misleading stale state ## API discovery requirements API-owned monitoring discovery endpoints must: - require bearer/internal monitoring auth outside explicit development/test mode - fail closed without token - return only active/current targets unless historical/debug mode is explicitly requested - provide stable labels for Prometheus/Grafana - avoid leaking secrets in labels, URLs, logs, or target metadata - expose enough metadata to debug game/dev lifecycle state ## Required labels For game/dev targets, standardize on the following required labels: - `vmid` - `instance` - `container_type` `container_type` should be the canonical spelling. Avoid using both `ctype` and `container_type` long-term. ## Optional labels Optional labels may include: - `server_id` - `game` - `variant` - `runtime` - `hostname` - `name` - `plan` - `role` Any customer/user-identifying label should be reviewed before dashboard or alert exposure. ## Lifecycle/debug state Launch-debug monitoring should expose or derive these states: - `ready` - `connectable` - `operationInProgress` - `operationType` - backup status - restore status - code-server status for dev containers - provisioning-complete state - recent failure reason where safe ## Agent responsibilities Agent/container-side monitoring responsibilities: - keep Alloy installed/prepared in game/dev templates - remove legacy `node-exporter` from game/dev templates once Alloy is standard - keep `/etc/default/alloy` fixed to `/etc/alloy/config.alloy` on port `12345` - write or refresh Alloy config after `POST /config` - keep labels minimal and consistent with dashboards/discovery - eventually emit structured logs suitable for Loki/Alloy log ingestion ## Monitoring host responsibilities The monitoring host must: - restrict Prometheus, Grafana, and node_exporter exposure to trusted admin/monitoring paths - run discovery sync reliably - provision dashboards into Grafana, not merely store dashboard JSON on disk - alert on failed discovery sync - alert on stale file_sd age - alert on target-down by VMID/type - alert on API discovery endpoint failure - store bearer tokens with tight permissions or systemd credentials ## Grafana responsibilities Grafana must have: - Prometheus datasource installed - core service dashboards installed - game container dashboard installed - dev container dashboard installed - dashboard queries aligned with the canonical label contract Dashboard JSON existing on disk is not sufficient; dashboards must be visible in Grafana. ## Logs contract Launch-ready debugging requires centralized logs for at least: - API application logs - Agent logs - provisioning flow - backup/restore flow - delete/teardown flow - Velocity registration callbacks - DNS/edge publish and cleanup actions - monitoring discovery sync Preferred direction: Loki or Alloy log ingestion. Until this exists, launch-debug readiness remains incomplete. ## Security contract Required before launch: - Prometheus not publicly reachable - Grafana not publicly reachable except through intended authenticated/admin path - node_exporter not publicly reachable - API monitoring endpoints fail closed without token - no monitoring bearer token in world-readable files - no token leakage in labels, target URLs, dashboards, or logs