jester/zlh-grind

Fork 0

jester e355bbb52a Clarify remote-write as canonical monitoring path

2026-05-01 19:38:06 +00:00

6.6 KiB

Raw Blame History

Monitoring Contract

This file defines the intended monitoring contract for ZeroLagHub launch readiness.

Scope

Linux-managed game/dev containers should converge on Grafana Alloy as the standard telemetry collector.

The monitoring system must support launch debugging for:

platform service health
API availability and discovery behavior
game/dev container presence and deletion
container host metrics
game/dev lifecycle state
provisioning, backup, restore, delete, and code-server/debug workflows
stale target detection

Not API-only

Monitoring is not an API-only design.

The API is the discovery and metadata/control-plane source. It tells the monitoring stack which game/dev containers exist and what labels/metadata they should carry.

The canonical metrics transport for game/dev containers is Alloy remote-write from each container into Prometheus on 10.60.0.25:9090.

That means there are two related but separate paths:

API discovery
  -> monitor sync
  -> file_sd target inventory / add-remove validation

container Alloy
  -> remote_write
  -> Prometheus time series
  -> Grafana dashboards

Prometheus scraping container-ip:12345 is not canonical while container Alloy listens on 127.0.0.1:12345. If direct scrape health for each container Alloy is desired later, the container Alloy HTTP listener must intentionally bind to 0.0.0.0:12345 or another reachable address and the security model must be reviewed.

Until that decision changes, dashboard health should rely on received remote-write metrics and expected time-series freshness, not game-dev-alloy scrape-up status.

Explicit exceptions

The game/dev Alloy contract does not replace every platform-specific monitoring path.

Exceptions:

OPNsense monitoring remains plugin/exporter-specific.
PBS monitoring remains platform-specific.
Appliance-style infrastructure should not be forced through the game/dev Alloy contract unless it is intentionally Linux-managed in the same way.

Dynamic discovery path

Preferred launch model:

API discovery endpoint
  -> monitor sync script
  -> Prometheus file_sd inventory
  -> add/remove validation and optional direct scrape targets

Expected behavior:

active game/dev containers appear automatically in discovery/file_sd
deleted game/dev containers disappear automatically from discovery/file_sd
stale file_sd entries are removed once sync is healthy
discovery failures alert loudly and do not silently preserve misleading stale state

Note: discovery/file_sd proves target inventory and add/remove behavior. It does not by itself prove canonical metrics delivery when remote-write is the chosen metrics path.

Metrics path

Canonical launch path for game/dev container metrics:

container Alloy remote_write
  -> Prometheus at 10.60.0.25:9090
  -> Grafana dashboards

Acceptance evidence for a game/dev container should include real time series such as:

node_uname_info
CPU metrics
memory metrics
expected labels such as vmid, instance, and container_type

API discovery requirements

API-owned monitoring discovery endpoints must:

require bearer/internal monitoring auth outside explicit development/test mode
fail closed without token
return only active/current targets unless historical/debug mode is explicitly requested
provide stable labels for Prometheus/Grafana
avoid leaking secrets in labels, URLs, logs, or target metadata
expose enough metadata to debug game/dev lifecycle state

Required labels

For game/dev targets/time series, standardize on the following required labels:

vmid
instance
container_type

container_type should be the canonical spelling. Avoid using both ctype and container_type long-term.

Optional labels

Optional labels may include:

server_id
game
variant
runtime
hostname
name
plan
role

Any customer/user-identifying label should be reviewed before dashboard or alert exposure.

Lifecycle/debug state

Launch-debug monitoring should expose or derive these states:

ready
connectable
operationInProgress
operationType
backup status
restore status
code-server status for dev containers
provisioning-complete state
recent failure reason where safe

Agent responsibilities

Agent/container-side monitoring responsibilities:

keep Alloy installed/prepared in game/dev templates
remove legacy node-exporter from game/dev templates once Alloy is standard
configure container Alloy remote-write to Prometheus at 10.60.0.25:9090
keep local Alloy HTTP bound to loopback unless direct scrape health is intentionally adopted
write or refresh Alloy config after POST /config
keep labels minimal and consistent with dashboards/discovery
eventually emit structured logs suitable for Loki/Alloy log ingestion

Monitoring host responsibilities

The monitoring host must:

restrict Prometheus and Grafana exposure to trusted admin/monitoring paths while still allowing container Alloy remote-write to Prometheus
run discovery sync reliably
provision dashboards into Grafana, not merely store dashboard JSON on disk
alert on failed discovery sync
alert on stale file_sd age
alert on missing/stale remote-write metrics by VMID/type
alert on API discovery endpoint failure
store bearer tokens with tight permissions or systemd credentials

Grafana responsibilities

Grafana must have:

Prometheus datasource installed
core service dashboards installed
game container dashboard installed
dev container dashboard installed
dashboard queries aligned with the canonical label contract
dashboards based on remote-write metric freshness for game/dev container health unless direct scrape health is intentionally enabled

Dashboard JSON existing on disk is not sufficient; dashboards must be visible in Grafana.

Logs contract

Launch-ready debugging requires centralized logs for at least:

API application logs
Agent logs
provisioning flow
backup/restore flow
delete/teardown flow
Velocity registration callbacks
DNS/edge publish and cleanup actions
monitoring discovery sync

Preferred direction: Loki or Alloy log ingestion. Until this exists, launch-debug readiness remains incomplete.

Security contract

Required before launch:

Prometheus not publicly reachable, while still reachable from intended container remote-write network paths
Grafana not publicly reachable except through intended authenticated/admin path
node_exporter not publicly reachable
API monitoring endpoints fail closed without token
no monitoring bearer token in world-readable files
no token leakage in labels, target URLs, dashboards, or logs

6.6 KiB Raw Blame History