jester/zlh-grind

Fork 0

jester c3aca715ee Add monitoring contract

2026-05-01 18:27:04 +00:00

4.5 KiB

Raw Blame History

Monitoring Contract

This file defines the intended monitoring contract for ZeroLagHub launch readiness.

Scope

Linux-managed game/dev containers should converge on Grafana Alloy as the standard telemetry collector.

The monitoring system must support launch debugging for:

platform service health
API availability and discovery behavior
game/dev container presence and deletion
container host metrics
game/dev lifecycle state
provisioning, backup, restore, delete, and code-server/debug workflows
stale target detection

Explicit exceptions

The game/dev Alloy contract does not replace every platform-specific monitoring path.

Exceptions:

OPNsense monitoring remains plugin/exporter-specific.
PBS monitoring remains platform-specific.
Appliance-style infrastructure should not be forced through the game/dev Alloy contract unless it is intentionally Linux-managed in the same way.

Dynamic discovery path

Preferred launch model:

API discovery endpoint
  -> monitor sync script
  -> Prometheus file_sd
  -> Grafana dashboards / alerts

Expected behavior:

active game/dev containers appear automatically
deleted game/dev containers disappear automatically
stale file_sd entries are removed once sync is healthy
discovery failures alert loudly and do not silently preserve misleading stale state

API discovery requirements

API-owned monitoring discovery endpoints must:

require bearer/internal monitoring auth outside explicit development/test mode
fail closed without token
return only active/current targets unless historical/debug mode is explicitly requested
provide stable labels for Prometheus/Grafana
avoid leaking secrets in labels, URLs, logs, or target metadata
expose enough metadata to debug game/dev lifecycle state

Required labels

For game/dev targets, standardize on the following required labels:

vmid
instance
container_type

container_type should be the canonical spelling. Avoid using both ctype and container_type long-term.

Optional labels

Optional labels may include:

server_id
game
variant
runtime
hostname
name
plan
role

Any customer/user-identifying label should be reviewed before dashboard or alert exposure.

Lifecycle/debug state

Launch-debug monitoring should expose or derive these states:

ready
connectable
operationInProgress
operationType
backup status
restore status
code-server status for dev containers
provisioning-complete state
recent failure reason where safe

Agent responsibilities

Agent/container-side monitoring responsibilities:

keep Alloy installed/prepared in game/dev templates
remove legacy node-exporter from game/dev templates once Alloy is standard
keep /etc/default/alloy fixed to /etc/alloy/config.alloy on port 12345
write or refresh Alloy config after POST /config
keep labels minimal and consistent with dashboards/discovery
eventually emit structured logs suitable for Loki/Alloy log ingestion

Monitoring host responsibilities

The monitoring host must:

restrict Prometheus, Grafana, and node_exporter exposure to trusted admin/monitoring paths
run discovery sync reliably
provision dashboards into Grafana, not merely store dashboard JSON on disk
alert on failed discovery sync
alert on stale file_sd age
alert on target-down by VMID/type
alert on API discovery endpoint failure
store bearer tokens with tight permissions or systemd credentials

Grafana responsibilities

Grafana must have:

Prometheus datasource installed
core service dashboards installed
game container dashboard installed
dev container dashboard installed
dashboard queries aligned with the canonical label contract

Dashboard JSON existing on disk is not sufficient; dashboards must be visible in Grafana.

Logs contract

Launch-ready debugging requires centralized logs for at least:

API application logs
Agent logs
provisioning flow
backup/restore flow
delete/teardown flow
Velocity registration callbacks
DNS/edge publish and cleanup actions
monitoring discovery sync

Preferred direction: Loki or Alloy log ingestion. Until this exists, launch-debug readiness remains incomplete.

Security contract

Required before launch:

Prometheus not publicly reachable
Grafana not publicly reachable except through intended authenticated/admin path
node_exporter not publicly reachable
API monitoring endpoints fail closed without token
no monitoring bearer token in world-readable files
no token leakage in labels, target URLs, dashboards, or logs

4.5 KiB Raw Blame History