4.5 KiB
Monitoring Contract
This file defines the intended monitoring contract for ZeroLagHub launch readiness.
Scope
Linux-managed game/dev containers should converge on Grafana Alloy as the standard telemetry collector.
The monitoring system must support launch debugging for:
- platform service health
- API availability and discovery behavior
- game/dev container presence and deletion
- container host metrics
- game/dev lifecycle state
- provisioning, backup, restore, delete, and code-server/debug workflows
- stale target detection
Explicit exceptions
The game/dev Alloy contract does not replace every platform-specific monitoring path.
Exceptions:
- OPNsense monitoring remains plugin/exporter-specific.
- PBS monitoring remains platform-specific.
- Appliance-style infrastructure should not be forced through the game/dev Alloy contract unless it is intentionally Linux-managed in the same way.
Dynamic discovery path
Preferred launch model:
API discovery endpoint
-> monitor sync script
-> Prometheus file_sd
-> Grafana dashboards / alerts
Expected behavior:
- active game/dev containers appear automatically
- deleted game/dev containers disappear automatically
- stale file_sd entries are removed once sync is healthy
- discovery failures alert loudly and do not silently preserve misleading stale state
API discovery requirements
API-owned monitoring discovery endpoints must:
- require bearer/internal monitoring auth outside explicit development/test mode
- fail closed without token
- return only active/current targets unless historical/debug mode is explicitly requested
- provide stable labels for Prometheus/Grafana
- avoid leaking secrets in labels, URLs, logs, or target metadata
- expose enough metadata to debug game/dev lifecycle state
Required labels
For game/dev targets, standardize on the following required labels:
vmidinstancecontainer_type
container_type should be the canonical spelling. Avoid using both ctype and container_type long-term.
Optional labels
Optional labels may include:
server_idgamevariantruntimehostnamenameplanrole
Any customer/user-identifying label should be reviewed before dashboard or alert exposure.
Lifecycle/debug state
Launch-debug monitoring should expose or derive these states:
readyconnectableoperationInProgressoperationType- backup status
- restore status
- code-server status for dev containers
- provisioning-complete state
- recent failure reason where safe
Agent responsibilities
Agent/container-side monitoring responsibilities:
- keep Alloy installed/prepared in game/dev templates
- remove legacy
node-exporterfrom game/dev templates once Alloy is standard - keep
/etc/default/alloyfixed to/etc/alloy/config.alloyon port12345 - write or refresh Alloy config after
POST /config - keep labels minimal and consistent with dashboards/discovery
- eventually emit structured logs suitable for Loki/Alloy log ingestion
Monitoring host responsibilities
The monitoring host must:
- restrict Prometheus, Grafana, and node_exporter exposure to trusted admin/monitoring paths
- run discovery sync reliably
- provision dashboards into Grafana, not merely store dashboard JSON on disk
- alert on failed discovery sync
- alert on stale file_sd age
- alert on target-down by VMID/type
- alert on API discovery endpoint failure
- store bearer tokens with tight permissions or systemd credentials
Grafana responsibilities
Grafana must have:
- Prometheus datasource installed
- core service dashboards installed
- game container dashboard installed
- dev container dashboard installed
- dashboard queries aligned with the canonical label contract
Dashboard JSON existing on disk is not sufficient; dashboards must be visible in Grafana.
Logs contract
Launch-ready debugging requires centralized logs for at least:
- API application logs
- Agent logs
- provisioning flow
- backup/restore flow
- delete/teardown flow
- Velocity registration callbacks
- DNS/edge publish and cleanup actions
- monitoring discovery sync
Preferred direction: Loki or Alloy log ingestion. Until this exists, launch-debug readiness remains incomplete.
Security contract
Required before launch:
- Prometheus not publicly reachable
- Grafana not publicly reachable except through intended authenticated/admin path
- node_exporter not publicly reachable
- API monitoring endpoints fail closed without token
- no monitoring bearer token in world-readable files
- no token leakage in labels, target URLs, dashboards, or logs