# Monitoring Contract This file defines the intended monitoring contract for ZeroLagHub launch readiness. ## Scope Linux-managed game/dev containers should converge on Grafana Alloy as the standard telemetry collector. The monitoring system must support launch debugging for: - platform service health - API availability and discovery behavior - game/dev container presence and deletion - container host metrics - game/dev lifecycle state - provisioning, backup, restore, delete, and code-server/debug workflows - stale target detection ## Not API-only Monitoring is not an API-only design. The API is the discovery and metadata/control-plane source. It tells the monitoring stack which game/dev containers exist and what labels/metadata they should carry. The canonical metrics transport for game/dev containers is Alloy remote-write from each container into Prometheus on `10.60.0.25:9090`. That means there are two related but separate paths: ```text API discovery -> monitor sync -> file_sd target inventory / add-remove validation container Alloy -> remote_write -> Prometheus time series -> Grafana dashboards ``` Prometheus scraping `container-ip:12345` is not canonical while container Alloy listens on `127.0.0.1:12345`. If direct scrape health for each container Alloy is desired later, the container Alloy HTTP listener must intentionally bind to `0.0.0.0:12345` or another reachable address and the security model must be reviewed. Until that decision changes, dashboard health should rely on received remote-write metrics and expected time-series freshness, not `game-dev-alloy` scrape-up status. ## Explicit exceptions The game/dev Alloy contract does not replace every platform-specific monitoring path. Exceptions: - OPNsense monitoring remains plugin/exporter-specific. - PBS monitoring remains platform-specific. - Appliance-style infrastructure should not be forced through the game/dev Alloy contract unless it is intentionally Linux-managed in the same way. ## Dynamic discovery path Preferred launch model: ```text API discovery endpoint -> monitor sync script -> Prometheus file_sd inventory -> add/remove validation and optional direct scrape targets ``` Expected behavior: - active game/dev containers appear automatically in discovery/file_sd - deleted game/dev containers disappear automatically from discovery/file_sd - stale file_sd entries are removed once sync is healthy - discovery failures alert loudly and do not silently preserve misleading stale state Note: discovery/file_sd proves target inventory and add/remove behavior. It does not by itself prove canonical metrics delivery when remote-write is the chosen metrics path. ## Metrics path Canonical launch path for game/dev container metrics: ```text container Alloy remote_write -> Prometheus at 10.60.0.25:9090 -> Grafana dashboards ``` Acceptance evidence for a game/dev container should include real time series such as: - `node_uname_info` - CPU metrics - memory metrics - expected labels such as `vmid`, `instance`, and `container_type` ## API discovery requirements API-owned monitoring discovery endpoints must: - require bearer/internal monitoring auth outside explicit development/test mode - fail closed without token - return only active/current targets unless historical/debug mode is explicitly requested - provide stable labels for Prometheus/Grafana - avoid leaking secrets in labels, URLs, logs, or target metadata - expose enough metadata to debug game/dev lifecycle state ## Required labels For game/dev targets/time series, standardize on the following required labels: - `vmid` - `instance` - `container_type` `container_type` should be the canonical spelling. Avoid using both `ctype` and `container_type` long-term. ## Optional labels Optional labels may include: - `server_id` - `game` - `variant` - `runtime` - `hostname` - `name` - `plan` - `role` Any customer/user-identifying label should be reviewed before dashboard or alert exposure. ## Lifecycle/debug state Launch-debug monitoring should expose or derive these states: - `ready` - `connectable` - `operationInProgress` - `operationType` - backup status - restore status - code-server status for dev containers - provisioning-complete state - recent failure reason where safe ## Agent responsibilities Agent/container-side monitoring responsibilities: - keep Alloy installed/prepared in game/dev templates - remove legacy `node-exporter` from game/dev templates once Alloy is standard - configure container Alloy remote-write to Prometheus at `10.60.0.25:9090` - keep local Alloy HTTP bound to loopback unless direct scrape health is intentionally adopted - write or refresh Alloy config after `POST /config` - keep labels minimal and consistent with dashboards/discovery - eventually emit structured logs suitable for Loki/Alloy log ingestion ## Monitoring host responsibilities The monitoring host must: - restrict Prometheus and Grafana exposure to trusted admin/monitoring paths while still allowing container Alloy remote-write to Prometheus - run discovery sync reliably - provision dashboards into Grafana, not merely store dashboard JSON on disk - alert on failed discovery sync - alert on stale file_sd age - alert on missing/stale remote-write metrics by VMID/type - alert on API discovery endpoint failure - store bearer tokens with tight permissions or systemd credentials ## Grafana responsibilities Grafana must have: - Prometheus datasource installed - core service dashboards installed - game container dashboard installed - dev container dashboard installed - dashboard queries aligned with the canonical label contract - dashboards based on remote-write metric freshness for game/dev container health unless direct scrape health is intentionally enabled Dashboard JSON existing on disk is not sufficient; dashboards must be visible in Grafana. ## Logs contract Launch-ready debugging requires centralized logs for at least: - API application logs - Agent logs - provisioning flow - backup/restore flow - delete/teardown flow - Velocity registration callbacks - DNS/edge publish and cleanup actions - monitoring discovery sync Preferred direction: Loki or Alloy log ingestion. Until this exists, launch-debug readiness remains incomplete. ## Security contract Required before launch: - Prometheus not publicly reachable, while still reachable from intended container remote-write network paths - Grafana not publicly reachable except through intended authenticated/admin path - node_exporter not publicly reachable - API monitoring endpoints fail closed without token - no monitoring bearer token in world-readable files - no token leakage in labels, target URLs, dashboards, or logs