jester/zlh-grind

Fork 0

jester 130fc5eeda Align monitoring contract with Alloy scrape model

2026-05-01 21:06:35 +00:00

8.2 KiB

Raw Permalink Blame History

Monitoring Contract

This file defines the intended monitoring contract for ZeroLagHub launch readiness.

Scope

Linux-managed game/dev containers should converge on Grafana Alloy as the standard telemetry collector.

The monitoring system must support launch debugging for:

platform service health
API availability and discovery behavior
game/dev container presence and deletion
container host metrics
game/dev lifecycle state
provisioning, backup, restore, delete, and code-server/debug workflows
stale target detection

Not API-only

Monitoring is not an API-only design.

The API is the discovery and metadata/control-plane source. It tells the monitoring stack which game/dev containers exist and what labels/metadata they should carry.

The canonical metrics transport for game/dev containers is Alloy remote-write from each container into Prometheus on 10.60.0.25:9090.

There are two related but separate paths:

API discovery
  -> monitor sync
  -> file_sd target inventory / add-remove validation
  -> game-dev-alloy scrape health

container Alloy
  -> remote_write
  -> Prometheus time series
  -> Grafana dashboards

Container Alloy intentionally listens on 0.0.0.0:12345 for Linux-managed game/dev containers so Prometheus can scrape Alloy health by container IP through the game-dev-alloy job. This direct scrape proves collector reachability and target add/remove behavior.

Remote-write remains the canonical metrics delivery path. Dashboards may use direct scrape health for target presence and collector health, but workload/system metrics should be validated from received time series.

Explicit exceptions

The game/dev Alloy contract does not replace every platform-specific monitoring path.

Exceptions:

OPNsense monitoring remains plugin/exporter-specific.
PBS monitoring remains platform-specific.
Appliance-style infrastructure should not be forced through the game/dev Alloy contract unless it is intentionally Linux-managed in the same way.

Dynamic discovery path

Preferred launch model:

API discovery endpoint
  -> monitor sync script
  -> Prometheus file_sd inventory
  -> add/remove validation and direct Alloy scrape targets

Expected behavior:

active game/dev containers appear automatically in discovery/file_sd
deleted game/dev containers disappear automatically from discovery/file_sd
stale file_sd entries are removed once sync is healthy
discovery failures alert loudly and do not silently preserve misleading stale state

Note: discovery/file_sd proves target inventory and add/remove behavior. It does not by itself prove canonical metrics delivery when remote-write is the chosen metrics path.

Metrics path

Canonical launch path for game/dev container metrics:

container Alloy remote_write
  -> Prometheus at 10.60.0.25:9090
  -> Grafana dashboards

Acceptance evidence for a game/dev container should include real time series such as:

node_uname_info
CPU metrics
memory metrics
expected labels such as vmid, instance, and container_type

Direct Alloy scrape health should also be visible through the game-dev-alloy Prometheus job for current lifecycle targets.

API discovery requirements

API-owned monitoring discovery endpoints must:

require bearer/internal monitoring auth outside explicit development/test mode
fail closed without token
return only active/current targets unless historical/debug mode is explicitly requested
provide stable labels for Prometheus/Grafana
avoid leaking secrets in labels, URLs, logs, or target metadata
expose enough metadata to debug game/dev lifecycle state

Required labels

For game/dev targets/time series, standardize on the following required labels:

vmid
instance
container_type

container_type should be the canonical spelling. Avoid using both ctype and container_type long-term.

Optional labels

Optional labels may include:

server_id
game
variant
runtime
hostname
name
plan
role

Any customer/user-identifying label should be reviewed before dashboard or alert exposure.

Lifecycle/debug state

Launch-debug monitoring should expose or derive these states:

ready
connectable
operationInProgress
operationType
backup status
restore status
code-server status for dev containers
provisioning-complete state
recent failure reason where safe

Agent responsibilities

Agent/container-side monitoring responsibilities:

keep Alloy installed/prepared in game/dev templates
remove legacy node-exporter from game/dev templates once Alloy is standard
configure container Alloy remote-write to Prometheus at 10.60.0.25:9090
configure container Alloy HTTP to listen on 0.0.0.0:12345 for the intended internal monitoring networks
write or refresh Alloy config after POST /config
keep labels minimal and consistent with dashboards/discovery
eventually ship selected container logs through Alloy to centralized Loki
do not run Loki itself inside each game/dev container

Monitoring host responsibilities

The monitoring host must:

restrict Prometheus and Grafana exposure to trusted admin/monitoring paths while still allowing container Alloy remote-write to Prometheus
run discovery sync reliably
provision dashboards into Grafana, not merely store dashboard JSON on disk
alert on failed discovery sync
alert on stale file_sd age
alert on missing/stale remote-write metrics by VMID/type
alert on game-dev-alloy scrape target down by VMID/type
alert on API discovery endpoint failure
store bearer tokens with tight permissions or systemd credentials
host or reach the centralized Loki service once log collection is enabled

Grafana responsibilities

Grafana must have:

Prometheus datasource installed
core service dashboards installed
game container dashboard installed
dev container dashboard installed
dashboard queries aligned with the canonical label contract
dashboards based on remote-write metric freshness for game/dev container metrics
direct scrape health panels or queries for game-dev-alloy target health
Loki datasource installed once centralized logs are enabled

Dashboard JSON existing on disk is not sufficient; dashboards must be visible in Grafana.

Logs contract

Loki should be centralized, not installed per game/dev container.

Preferred future direction:

container/app logs
  -> container Alloy log collection
  -> centralized Loki
  -> Grafana Explore / dashboard links

Container Alloy should tail only selected, useful logs and attach stable labels such as:

vmid
instance
container_type
server_id where safe/useful
log_source, for example agent, minecraft, codeserver, provisioning

Do not turn every container into a Loki server. Loki belongs on the monitoring/logging side as shared infrastructure. Containers should run Alloy as the lightweight collector/shipper.

Centralized logs are not enabled yet. This is a known follow-up unless launch policy requires log search from Grafana before release.

When logs are added, prioritize:

API application logs
Agent logs
provisioning flow
backup/restore flow
delete/teardown flow
Velocity registration callbacks
DNS/edge publish and cleanup actions
monitoring discovery sync

Initial container log candidates:

Agent service logs
Minecraft/server stdout logs where useful
code-server logs for dev containers
provisioning/config application logs
backup/restore operation logs

Avoid collecting noisy or sensitive logs by default. Review anything that may include customer secrets, tokens, console input, file contents, or environment variables before shipping to Loki.

Security contract

Required before launch:

Prometheus not publicly reachable, while still reachable from intended container remote-write network paths
Grafana not publicly reachable except through intended authenticated/admin path
node_exporter not publicly reachable
API monitoring endpoints fail closed without token
no monitoring bearer token in world-readable files
no token leakage in labels, target URLs, dashboards, or logs
Loki, once added, is not publicly reachable and accepts writes only from intended Alloy collectors or trusted internal paths

8.2 KiB Raw Permalink Blame History