zlh-grind/Codex/Monitoring/CONTRACT.md
2026-05-01 18:27:04 +00:00

4.5 KiB

Monitoring Contract

This file defines the intended monitoring contract for ZeroLagHub launch readiness.

Scope

Linux-managed game/dev containers should converge on Grafana Alloy as the standard telemetry collector.

The monitoring system must support launch debugging for:

  • platform service health
  • API availability and discovery behavior
  • game/dev container presence and deletion
  • container host metrics
  • game/dev lifecycle state
  • provisioning, backup, restore, delete, and code-server/debug workflows
  • stale target detection

Explicit exceptions

The game/dev Alloy contract does not replace every platform-specific monitoring path.

Exceptions:

  • OPNsense monitoring remains plugin/exporter-specific.
  • PBS monitoring remains platform-specific.
  • Appliance-style infrastructure should not be forced through the game/dev Alloy contract unless it is intentionally Linux-managed in the same way.

Dynamic discovery path

Preferred launch model:

API discovery endpoint
  -> monitor sync script
  -> Prometheus file_sd
  -> Grafana dashboards / alerts

Expected behavior:

  • active game/dev containers appear automatically
  • deleted game/dev containers disappear automatically
  • stale file_sd entries are removed once sync is healthy
  • discovery failures alert loudly and do not silently preserve misleading stale state

API discovery requirements

API-owned monitoring discovery endpoints must:

  • require bearer/internal monitoring auth outside explicit development/test mode
  • fail closed without token
  • return only active/current targets unless historical/debug mode is explicitly requested
  • provide stable labels for Prometheus/Grafana
  • avoid leaking secrets in labels, URLs, logs, or target metadata
  • expose enough metadata to debug game/dev lifecycle state

Required labels

For game/dev targets, standardize on the following required labels:

  • vmid
  • instance
  • container_type

container_type should be the canonical spelling. Avoid using both ctype and container_type long-term.

Optional labels

Optional labels may include:

  • server_id
  • game
  • variant
  • runtime
  • hostname
  • name
  • plan
  • role

Any customer/user-identifying label should be reviewed before dashboard or alert exposure.

Lifecycle/debug state

Launch-debug monitoring should expose or derive these states:

  • ready
  • connectable
  • operationInProgress
  • operationType
  • backup status
  • restore status
  • code-server status for dev containers
  • provisioning-complete state
  • recent failure reason where safe

Agent responsibilities

Agent/container-side monitoring responsibilities:

  • keep Alloy installed/prepared in game/dev templates
  • remove legacy node-exporter from game/dev templates once Alloy is standard
  • keep /etc/default/alloy fixed to /etc/alloy/config.alloy on port 12345
  • write or refresh Alloy config after POST /config
  • keep labels minimal and consistent with dashboards/discovery
  • eventually emit structured logs suitable for Loki/Alloy log ingestion

Monitoring host responsibilities

The monitoring host must:

  • restrict Prometheus, Grafana, and node_exporter exposure to trusted admin/monitoring paths
  • run discovery sync reliably
  • provision dashboards into Grafana, not merely store dashboard JSON on disk
  • alert on failed discovery sync
  • alert on stale file_sd age
  • alert on target-down by VMID/type
  • alert on API discovery endpoint failure
  • store bearer tokens with tight permissions or systemd credentials

Grafana responsibilities

Grafana must have:

  • Prometheus datasource installed
  • core service dashboards installed
  • game container dashboard installed
  • dev container dashboard installed
  • dashboard queries aligned with the canonical label contract

Dashboard JSON existing on disk is not sufficient; dashboards must be visible in Grafana.

Logs contract

Launch-ready debugging requires centralized logs for at least:

  • API application logs
  • Agent logs
  • provisioning flow
  • backup/restore flow
  • delete/teardown flow
  • Velocity registration callbacks
  • DNS/edge publish and cleanup actions
  • monitoring discovery sync

Preferred direction: Loki or Alloy log ingestion. Until this exists, launch-debug readiness remains incomplete.

Security contract

Required before launch:

  • Prometheus not publicly reachable
  • Grafana not publicly reachable except through intended authenticated/admin path
  • node_exporter not publicly reachable
  • API monitoring endpoints fail closed without token
  • no monitoring bearer token in world-readable files
  • no token leakage in labels, target URLs, dashboards, or logs