jester/zlh-grind

Fork 0

jester a78c663f7b Clarify centralized Loki log model

2026-05-01 20:23:23 +00:00

7.9 KiB

Raw Blame History

Monitoring Contract

This file defines the intended monitoring contract for ZeroLagHub launch readiness.

Scope

Linux-managed game/dev containers should converge on Grafana Alloy as the standard telemetry collector.

The monitoring system must support launch debugging for:

platform service health
API availability and discovery behavior
game/dev container presence and deletion
container host metrics
game/dev lifecycle state
provisioning, backup, restore, delete, and code-server/debug workflows
stale target detection

Not API-only

Monitoring is not an API-only design.

The API is the discovery and metadata/control-plane source. It tells the monitoring stack which game/dev containers exist and what labels/metadata they should carry.

The canonical metrics transport for game/dev containers is Alloy remote-write from each container into Prometheus on 10.60.0.25:9090.

That means there are two related but separate paths:

API discovery
  -> monitor sync
  -> file_sd target inventory / add-remove validation

container Alloy
  -> remote_write
  -> Prometheus time series
  -> Grafana dashboards

Prometheus scraping container-ip:12345 is not canonical while container Alloy listens on 127.0.0.1:12345. If direct scrape health for each container Alloy is desired later, the container Alloy HTTP listener must intentionally bind to 0.0.0.0:12345 or another reachable address and the security model must be reviewed.

Until that decision changes, dashboard health should rely on received remote-write metrics and expected time-series freshness, not game-dev-alloy scrape-up status.

Explicit exceptions

The game/dev Alloy contract does not replace every platform-specific monitoring path.

Exceptions:

OPNsense monitoring remains plugin/exporter-specific.
PBS monitoring remains platform-specific.
Appliance-style infrastructure should not be forced through the game/dev Alloy contract unless it is intentionally Linux-managed in the same way.

Dynamic discovery path

Preferred launch model:

API discovery endpoint
  -> monitor sync script
  -> Prometheus file_sd inventory
  -> add/remove validation and optional direct scrape targets

Expected behavior:

active game/dev containers appear automatically in discovery/file_sd
deleted game/dev containers disappear automatically from discovery/file_sd
stale file_sd entries are removed once sync is healthy
discovery failures alert loudly and do not silently preserve misleading stale state

Note: discovery/file_sd proves target inventory and add/remove behavior. It does not by itself prove canonical metrics delivery when remote-write is the chosen metrics path.

Metrics path

Canonical launch path for game/dev container metrics:

container Alloy remote_write
  -> Prometheus at 10.60.0.25:9090
  -> Grafana dashboards

Acceptance evidence for a game/dev container should include real time series such as:

node_uname_info
CPU metrics
memory metrics
expected labels such as vmid, instance, and container_type

API discovery requirements

API-owned monitoring discovery endpoints must:

require bearer/internal monitoring auth outside explicit development/test mode
fail closed without token
return only active/current targets unless historical/debug mode is explicitly requested
provide stable labels for Prometheus/Grafana
avoid leaking secrets in labels, URLs, logs, or target metadata
expose enough metadata to debug game/dev lifecycle state

Required labels

For game/dev targets/time series, standardize on the following required labels:

vmid
instance
container_type

container_type should be the canonical spelling. Avoid using both ctype and container_type long-term.

Optional labels

Optional labels may include:

server_id
game
variant
runtime
hostname
name
plan
role

Any customer/user-identifying label should be reviewed before dashboard or alert exposure.

Lifecycle/debug state

Launch-debug monitoring should expose or derive these states:

ready
connectable
operationInProgress
operationType
backup status
restore status
code-server status for dev containers
provisioning-complete state
recent failure reason where safe

Agent responsibilities

Agent/container-side monitoring responsibilities:

keep Alloy installed/prepared in game/dev templates
remove legacy node-exporter from game/dev templates once Alloy is standard
configure container Alloy remote-write to Prometheus at 10.60.0.25:9090
keep local Alloy HTTP bound to loopback unless direct scrape health is intentionally adopted
write or refresh Alloy config after POST /config
keep labels minimal and consistent with dashboards/discovery
eventually ship selected container logs through Alloy to centralized Loki
do not run Loki itself inside each game/dev container

Monitoring host responsibilities

The monitoring host must:

restrict Prometheus and Grafana exposure to trusted admin/monitoring paths while still allowing container Alloy remote-write to Prometheus
run discovery sync reliably
provision dashboards into Grafana, not merely store dashboard JSON on disk
alert on failed discovery sync
alert on stale file_sd age
alert on missing/stale remote-write metrics by VMID/type
alert on API discovery endpoint failure
store bearer tokens with tight permissions or systemd credentials
host or reach the centralized Loki service once log collection is enabled

Grafana responsibilities

Grafana must have:

Prometheus datasource installed
core service dashboards installed
game container dashboard installed
dev container dashboard installed
dashboard queries aligned with the canonical label contract
dashboards based on remote-write metric freshness for game/dev container health unless direct scrape health is intentionally enabled
Loki datasource installed once centralized logs are enabled

Dashboard JSON existing on disk is not sufficient; dashboards must be visible in Grafana.

Logs contract

Loki should be centralized, not installed per game/dev container.

Preferred launch direction:

container/app logs
  -> container Alloy log collection
  -> centralized Loki
  -> Grafana Explore / dashboard links

Container Alloy should tail only selected, useful logs and attach stable labels such as:

vmid
instance
container_type
server_id where safe/useful
log_source, for example agent, minecraft, codeserver, provisioning

Do not turn every container into a Loki server. Loki belongs on the monitoring/logging side as shared infrastructure. Containers should run Alloy as the lightweight collector/shipper.

Launch-ready debugging requires centralized logs for at least:

API application logs
Agent logs
provisioning flow
backup/restore flow
delete/teardown flow
Velocity registration callbacks
DNS/edge publish and cleanup actions
monitoring discovery sync

Initial container log candidates:

Agent service logs
Minecraft/server stdout logs where useful
code-server logs for dev containers
provisioning/config application logs
backup/restore operation logs

Avoid collecting noisy or sensitive logs by default. Review anything that may include customer secrets, tokens, console input, file contents, or environment variables before shipping to Loki.

Until centralized Loki/Alloy log shipping exists, launch-debug readiness remains incomplete.

Security contract

Required before launch:

Prometheus not publicly reachable, while still reachable from intended container remote-write network paths
Grafana not publicly reachable except through intended authenticated/admin path
node_exporter not publicly reachable
API monitoring endpoints fail closed without token
no monitoring bearer token in world-readable files
no token leakage in labels, target URLs, dashboards, or logs
Loki, once added, is not publicly reachable and accepts writes only from intended Alloy collectors or trusted internal paths

7.9 KiB Raw Blame History