# Monitoring Contract This file defines the intended monitoring contract for ZeroLagHub launch readiness. ## Scope Linux-managed game/dev containers should converge on Grafana Alloy as the standard telemetry collector. The monitoring system must support launch debugging for: - platform service health - API availability and discovery behavior - game/dev container presence and deletion - container host metrics - game/dev lifecycle state - provisioning, backup, restore, delete, and code-server/debug workflows - stale target detection ## Not API-only Monitoring is not an API-only design. The API is the discovery and metadata/control-plane source. It tells the monitoring stack which game/dev containers exist and what labels/metadata they should carry. The canonical metrics transport for game/dev containers is Alloy remote-write from each container into Prometheus on `10.60.0.25:9090`. There are two related but separate paths: ```text API discovery -> monitor sync -> file_sd target inventory / add-remove validation -> game-dev-alloy scrape health container Alloy -> remote_write -> Prometheus time series -> Grafana dashboards ``` Container Alloy intentionally listens on `0.0.0.0:12345` for Linux-managed game/dev containers so Prometheus can scrape Alloy health by container IP through the `game-dev-alloy` job. This direct scrape proves collector reachability and target add/remove behavior. Remote-write remains the canonical metrics delivery path. Dashboards may use direct scrape health for target presence and collector health, but workload/system metrics should be validated from received time series. ## Explicit exceptions The game/dev Alloy contract does not replace every platform-specific monitoring path. Exceptions: - OPNsense monitoring remains plugin/exporter-specific. - PBS monitoring remains platform-specific. - Appliance-style infrastructure should not be forced through the game/dev Alloy contract unless it is intentionally Linux-managed in the same way. ## Dynamic discovery path Preferred launch model: ```text API discovery endpoint -> monitor sync script -> Prometheus file_sd inventory -> add/remove validation and direct Alloy scrape targets ``` Expected behavior: - active game/dev containers appear automatically in discovery/file_sd - deleted game/dev containers disappear automatically from discovery/file_sd - stale file_sd entries are removed once sync is healthy - discovery failures alert loudly and do not silently preserve misleading stale state Note: discovery/file_sd proves target inventory and add/remove behavior. It does not by itself prove canonical metrics delivery when remote-write is the chosen metrics path. ## Metrics path Canonical launch path for game/dev container metrics: ```text container Alloy remote_write -> Prometheus at 10.60.0.25:9090 -> Grafana dashboards ``` Acceptance evidence for a game/dev container should include real time series such as: - `node_uname_info` - CPU metrics - memory metrics - expected labels such as `vmid`, `instance`, and `container_type` Direct Alloy scrape health should also be visible through the `game-dev-alloy` Prometheus job for current lifecycle targets. ## API discovery requirements API-owned monitoring discovery endpoints must: - require bearer/internal monitoring auth outside explicit development/test mode - fail closed without token - return only active/current targets unless historical/debug mode is explicitly requested - provide stable labels for Prometheus/Grafana - avoid leaking secrets in labels, URLs, logs, or target metadata - expose enough metadata to debug game/dev lifecycle state ## Required labels For game/dev targets/time series, standardize on the following required labels: - `vmid` - `instance` - `container_type` `container_type` should be the canonical spelling. Avoid using both `ctype` and `container_type` long-term. ## Optional labels Optional labels may include: - `server_id` - `game` - `variant` - `runtime` - `hostname` - `name` - `plan` - `role` Any customer/user-identifying label should be reviewed before dashboard or alert exposure. ## Lifecycle/debug state Launch-debug monitoring should expose or derive these states: - `ready` - `connectable` - `operationInProgress` - `operationType` - backup status - restore status - code-server status for dev containers - provisioning-complete state - recent failure reason where safe ## Agent responsibilities Agent/container-side monitoring responsibilities: - keep Alloy installed/prepared in game/dev templates - remove legacy `node-exporter` from game/dev templates once Alloy is standard - configure container Alloy remote-write to Prometheus at `10.60.0.25:9090` - configure container Alloy HTTP to listen on `0.0.0.0:12345` for the intended internal monitoring networks - write or refresh Alloy config after `POST /config` - keep labels minimal and consistent with dashboards/discovery - eventually ship selected container logs through Alloy to centralized Loki - do not run Loki itself inside each game/dev container ## Monitoring host responsibilities The monitoring host must: - restrict Prometheus and Grafana exposure to trusted admin/monitoring paths while still allowing container Alloy remote-write to Prometheus - run discovery sync reliably - provision dashboards into Grafana, not merely store dashboard JSON on disk - alert on failed discovery sync - alert on stale file_sd age - alert on missing/stale remote-write metrics by VMID/type - alert on `game-dev-alloy` scrape target down by VMID/type - alert on API discovery endpoint failure - store bearer tokens with tight permissions or systemd credentials - host or reach the centralized Loki service once log collection is enabled ## Grafana responsibilities Grafana must have: - Prometheus datasource installed - core service dashboards installed - game container dashboard installed - dev container dashboard installed - dashboard queries aligned with the canonical label contract - dashboards based on remote-write metric freshness for game/dev container metrics - direct scrape health panels or queries for `game-dev-alloy` target health - Loki datasource installed once centralized logs are enabled Dashboard JSON existing on disk is not sufficient; dashboards must be visible in Grafana. ## Logs contract Loki should be centralized, not installed per game/dev container. Preferred future direction: ```text container/app logs -> container Alloy log collection -> centralized Loki -> Grafana Explore / dashboard links ``` Container Alloy should tail only selected, useful logs and attach stable labels such as: - `vmid` - `instance` - `container_type` - `server_id` where safe/useful - `log_source`, for example `agent`, `minecraft`, `codeserver`, `provisioning` Do not turn every container into a Loki server. Loki belongs on the monitoring/logging side as shared infrastructure. Containers should run Alloy as the lightweight collector/shipper. Centralized logs are not enabled yet. This is a known follow-up unless launch policy requires log search from Grafana before release. When logs are added, prioritize: - API application logs - Agent logs - provisioning flow - backup/restore flow - delete/teardown flow - Velocity registration callbacks - DNS/edge publish and cleanup actions - monitoring discovery sync Initial container log candidates: - Agent service logs - Minecraft/server stdout logs where useful - code-server logs for dev containers - provisioning/config application logs - backup/restore operation logs Avoid collecting noisy or sensitive logs by default. Review anything that may include customer secrets, tokens, console input, file contents, or environment variables before shipping to Loki. ## Security contract Required before launch: - Prometheus not publicly reachable, while still reachable from intended container remote-write network paths - Grafana not publicly reachable except through intended authenticated/admin path - node_exporter not publicly reachable - API monitoring endpoints fail closed without token - no monitoring bearer token in world-readable files - no token leakage in labels, target URLs, dashboards, or logs - Loki, once added, is not publicly reachable and accepts writes only from intended Alloy collectors or trusted internal paths