8.2 KiB
Monitoring Contract
This file defines the intended monitoring contract for ZeroLagHub launch readiness.
Scope
Linux-managed game/dev containers should converge on Grafana Alloy as the standard telemetry collector.
The monitoring system must support launch debugging for:
- platform service health
- API availability and discovery behavior
- game/dev container presence and deletion
- container host metrics
- game/dev lifecycle state
- provisioning, backup, restore, delete, and code-server/debug workflows
- stale target detection
Not API-only
Monitoring is not an API-only design.
The API is the discovery and metadata/control-plane source. It tells the monitoring stack which game/dev containers exist and what labels/metadata they should carry.
The canonical metrics transport for game/dev containers is Alloy remote-write from each container into Prometheus on 10.60.0.25:9090.
There are two related but separate paths:
API discovery
-> monitor sync
-> file_sd target inventory / add-remove validation
-> game-dev-alloy scrape health
container Alloy
-> remote_write
-> Prometheus time series
-> Grafana dashboards
Container Alloy intentionally listens on 0.0.0.0:12345 for Linux-managed game/dev containers so Prometheus can scrape Alloy health by container IP through the game-dev-alloy job. This direct scrape proves collector reachability and target add/remove behavior.
Remote-write remains the canonical metrics delivery path. Dashboards may use direct scrape health for target presence and collector health, but workload/system metrics should be validated from received time series.
Explicit exceptions
The game/dev Alloy contract does not replace every platform-specific monitoring path.
Exceptions:
- OPNsense monitoring remains plugin/exporter-specific.
- PBS monitoring remains platform-specific.
- Appliance-style infrastructure should not be forced through the game/dev Alloy contract unless it is intentionally Linux-managed in the same way.
Dynamic discovery path
Preferred launch model:
API discovery endpoint
-> monitor sync script
-> Prometheus file_sd inventory
-> add/remove validation and direct Alloy scrape targets
Expected behavior:
- active game/dev containers appear automatically in discovery/file_sd
- deleted game/dev containers disappear automatically from discovery/file_sd
- stale file_sd entries are removed once sync is healthy
- discovery failures alert loudly and do not silently preserve misleading stale state
Note: discovery/file_sd proves target inventory and add/remove behavior. It does not by itself prove canonical metrics delivery when remote-write is the chosen metrics path.
Metrics path
Canonical launch path for game/dev container metrics:
container Alloy remote_write
-> Prometheus at 10.60.0.25:9090
-> Grafana dashboards
Acceptance evidence for a game/dev container should include real time series such as:
node_uname_info- CPU metrics
- memory metrics
- expected labels such as
vmid,instance, andcontainer_type
Direct Alloy scrape health should also be visible through the game-dev-alloy Prometheus job for current lifecycle targets.
API discovery requirements
API-owned monitoring discovery endpoints must:
- require bearer/internal monitoring auth outside explicit development/test mode
- fail closed without token
- return only active/current targets unless historical/debug mode is explicitly requested
- provide stable labels for Prometheus/Grafana
- avoid leaking secrets in labels, URLs, logs, or target metadata
- expose enough metadata to debug game/dev lifecycle state
Required labels
For game/dev targets/time series, standardize on the following required labels:
vmidinstancecontainer_type
container_type should be the canonical spelling. Avoid using both ctype and container_type long-term.
Optional labels
Optional labels may include:
server_idgamevariantruntimehostnamenameplanrole
Any customer/user-identifying label should be reviewed before dashboard or alert exposure.
Lifecycle/debug state
Launch-debug monitoring should expose or derive these states:
readyconnectableoperationInProgressoperationType- backup status
- restore status
- code-server status for dev containers
- provisioning-complete state
- recent failure reason where safe
Agent responsibilities
Agent/container-side monitoring responsibilities:
- keep Alloy installed/prepared in game/dev templates
- remove legacy
node-exporterfrom game/dev templates once Alloy is standard - configure container Alloy remote-write to Prometheus at
10.60.0.25:9090 - configure container Alloy HTTP to listen on
0.0.0.0:12345for the intended internal monitoring networks - write or refresh Alloy config after
POST /config - keep labels minimal and consistent with dashboards/discovery
- eventually ship selected container logs through Alloy to centralized Loki
- do not run Loki itself inside each game/dev container
Monitoring host responsibilities
The monitoring host must:
- restrict Prometheus and Grafana exposure to trusted admin/monitoring paths while still allowing container Alloy remote-write to Prometheus
- run discovery sync reliably
- provision dashboards into Grafana, not merely store dashboard JSON on disk
- alert on failed discovery sync
- alert on stale file_sd age
- alert on missing/stale remote-write metrics by VMID/type
- alert on
game-dev-alloyscrape target down by VMID/type - alert on API discovery endpoint failure
- store bearer tokens with tight permissions or systemd credentials
- host or reach the centralized Loki service once log collection is enabled
Grafana responsibilities
Grafana must have:
- Prometheus datasource installed
- core service dashboards installed
- game container dashboard installed
- dev container dashboard installed
- dashboard queries aligned with the canonical label contract
- dashboards based on remote-write metric freshness for game/dev container metrics
- direct scrape health panels or queries for
game-dev-alloytarget health - Loki datasource installed once centralized logs are enabled
Dashboard JSON existing on disk is not sufficient; dashboards must be visible in Grafana.
Logs contract
Loki should be centralized, not installed per game/dev container.
Preferred future direction:
container/app logs
-> container Alloy log collection
-> centralized Loki
-> Grafana Explore / dashboard links
Container Alloy should tail only selected, useful logs and attach stable labels such as:
vmidinstancecontainer_typeserver_idwhere safe/usefullog_source, for exampleagent,minecraft,codeserver,provisioning
Do not turn every container into a Loki server. Loki belongs on the monitoring/logging side as shared infrastructure. Containers should run Alloy as the lightweight collector/shipper.
Centralized logs are not enabled yet. This is a known follow-up unless launch policy requires log search from Grafana before release.
When logs are added, prioritize:
- API application logs
- Agent logs
- provisioning flow
- backup/restore flow
- delete/teardown flow
- Velocity registration callbacks
- DNS/edge publish and cleanup actions
- monitoring discovery sync
Initial container log candidates:
- Agent service logs
- Minecraft/server stdout logs where useful
- code-server logs for dev containers
- provisioning/config application logs
- backup/restore operation logs
Avoid collecting noisy or sensitive logs by default. Review anything that may include customer secrets, tokens, console input, file contents, or environment variables before shipping to Loki.
Security contract
Required before launch:
- Prometheus not publicly reachable, while still reachable from intended container remote-write network paths
- Grafana not publicly reachable except through intended authenticated/admin path
- node_exporter not publicly reachable
- API monitoring endpoints fail closed without token
- no monitoring bearer token in world-readable files
- no token leakage in labels, target URLs, dashboards, or logs
- Loki, once added, is not publicly reachable and accepts writes only from intended Alloy collectors or trusted internal paths