zlh-grind/Codex/Monitoring/CONTRACT.md

6.6 KiB

Monitoring Contract

This file defines the intended monitoring contract for ZeroLagHub launch readiness.

Scope

Linux-managed game/dev containers should converge on Grafana Alloy as the standard telemetry collector.

The monitoring system must support launch debugging for:

  • platform service health
  • API availability and discovery behavior
  • game/dev container presence and deletion
  • container host metrics
  • game/dev lifecycle state
  • provisioning, backup, restore, delete, and code-server/debug workflows
  • stale target detection

Not API-only

Monitoring is not an API-only design.

The API is the discovery and metadata/control-plane source. It tells the monitoring stack which game/dev containers exist and what labels/metadata they should carry.

The canonical metrics transport for game/dev containers is Alloy remote-write from each container into Prometheus on 10.60.0.25:9090.

That means there are two related but separate paths:

API discovery
  -> monitor sync
  -> file_sd target inventory / add-remove validation

container Alloy
  -> remote_write
  -> Prometheus time series
  -> Grafana dashboards

Prometheus scraping container-ip:12345 is not canonical while container Alloy listens on 127.0.0.1:12345. If direct scrape health for each container Alloy is desired later, the container Alloy HTTP listener must intentionally bind to 0.0.0.0:12345 or another reachable address and the security model must be reviewed.

Until that decision changes, dashboard health should rely on received remote-write metrics and expected time-series freshness, not game-dev-alloy scrape-up status.

Explicit exceptions

The game/dev Alloy contract does not replace every platform-specific monitoring path.

Exceptions:

  • OPNsense monitoring remains plugin/exporter-specific.
  • PBS monitoring remains platform-specific.
  • Appliance-style infrastructure should not be forced through the game/dev Alloy contract unless it is intentionally Linux-managed in the same way.

Dynamic discovery path

Preferred launch model:

API discovery endpoint
  -> monitor sync script
  -> Prometheus file_sd inventory
  -> add/remove validation and optional direct scrape targets

Expected behavior:

  • active game/dev containers appear automatically in discovery/file_sd
  • deleted game/dev containers disappear automatically from discovery/file_sd
  • stale file_sd entries are removed once sync is healthy
  • discovery failures alert loudly and do not silently preserve misleading stale state

Note: discovery/file_sd proves target inventory and add/remove behavior. It does not by itself prove canonical metrics delivery when remote-write is the chosen metrics path.

Metrics path

Canonical launch path for game/dev container metrics:

container Alloy remote_write
  -> Prometheus at 10.60.0.25:9090
  -> Grafana dashboards

Acceptance evidence for a game/dev container should include real time series such as:

  • node_uname_info
  • CPU metrics
  • memory metrics
  • expected labels such as vmid, instance, and container_type

API discovery requirements

API-owned monitoring discovery endpoints must:

  • require bearer/internal monitoring auth outside explicit development/test mode
  • fail closed without token
  • return only active/current targets unless historical/debug mode is explicitly requested
  • provide stable labels for Prometheus/Grafana
  • avoid leaking secrets in labels, URLs, logs, or target metadata
  • expose enough metadata to debug game/dev lifecycle state

Required labels

For game/dev targets/time series, standardize on the following required labels:

  • vmid
  • instance
  • container_type

container_type should be the canonical spelling. Avoid using both ctype and container_type long-term.

Optional labels

Optional labels may include:

  • server_id
  • game
  • variant
  • runtime
  • hostname
  • name
  • plan
  • role

Any customer/user-identifying label should be reviewed before dashboard or alert exposure.

Lifecycle/debug state

Launch-debug monitoring should expose or derive these states:

  • ready
  • connectable
  • operationInProgress
  • operationType
  • backup status
  • restore status
  • code-server status for dev containers
  • provisioning-complete state
  • recent failure reason where safe

Agent responsibilities

Agent/container-side monitoring responsibilities:

  • keep Alloy installed/prepared in game/dev templates
  • remove legacy node-exporter from game/dev templates once Alloy is standard
  • configure container Alloy remote-write to Prometheus at 10.60.0.25:9090
  • keep local Alloy HTTP bound to loopback unless direct scrape health is intentionally adopted
  • write or refresh Alloy config after POST /config
  • keep labels minimal and consistent with dashboards/discovery
  • eventually emit structured logs suitable for Loki/Alloy log ingestion

Monitoring host responsibilities

The monitoring host must:

  • restrict Prometheus and Grafana exposure to trusted admin/monitoring paths while still allowing container Alloy remote-write to Prometheus
  • run discovery sync reliably
  • provision dashboards into Grafana, not merely store dashboard JSON on disk
  • alert on failed discovery sync
  • alert on stale file_sd age
  • alert on missing/stale remote-write metrics by VMID/type
  • alert on API discovery endpoint failure
  • store bearer tokens with tight permissions or systemd credentials

Grafana responsibilities

Grafana must have:

  • Prometheus datasource installed
  • core service dashboards installed
  • game container dashboard installed
  • dev container dashboard installed
  • dashboard queries aligned with the canonical label contract
  • dashboards based on remote-write metric freshness for game/dev container health unless direct scrape health is intentionally enabled

Dashboard JSON existing on disk is not sufficient; dashboards must be visible in Grafana.

Logs contract

Launch-ready debugging requires centralized logs for at least:

  • API application logs
  • Agent logs
  • provisioning flow
  • backup/restore flow
  • delete/teardown flow
  • Velocity registration callbacks
  • DNS/edge publish and cleanup actions
  • monitoring discovery sync

Preferred direction: Loki or Alloy log ingestion. Until this exists, launch-debug readiness remains incomplete.

Security contract

Required before launch:

  • Prometheus not publicly reachable, while still reachable from intended container remote-write network paths
  • Grafana not publicly reachable except through intended authenticated/admin path
  • node_exporter not publicly reachable
  • API monitoring endpoints fail closed without token
  • no monitoring bearer token in world-readable files
  • no token leakage in labels, target URLs, dashboards, or logs