zlh-grind/Codex/Monitoring/CONTRACT.md

# Monitoring Contract

This file defines the intended monitoring contract for ZeroLagHub launch readiness.

## Scope

Linux-managed game/dev containers should converge on Grafana Alloy as the standard telemetry collector.

The monitoring system must support launch debugging for:

- platform service health
- API availability and discovery behavior
- game/dev container presence and deletion
- container host metrics
- game/dev lifecycle state
- provisioning, backup, restore, delete, and code-server/debug workflows
- stale target detection

## Not API-only

Monitoring is not an API-only design.

The API is the discovery and metadata/control-plane source. It tells the monitoring stack which game/dev containers exist and what labels/metadata they should carry.

The canonical metrics transport for game/dev containers is Alloy remote-write from each container into Prometheus on `10.60.0.25:9090`.

There are two related but separate paths:

```text
API discovery
  -> monitor sync
  -> file_sd target inventory / add-remove validation
  -> game-dev-alloy scrape health

container Alloy
  -> remote_write
  -> Prometheus time series
  -> Grafana dashboards
```

Container Alloy intentionally listens on `0.0.0.0:12345` for Linux-managed game/dev containers so Prometheus can scrape Alloy health by container IP through the `game-dev-alloy` job. This direct scrape proves collector reachability and target add/remove behavior.

Remote-write remains the canonical metrics delivery path. Dashboards may use direct scrape health for target presence and collector health, but workload/system metrics should be validated from received time series.

## Explicit exceptions

The game/dev Alloy contract does not replace every platform-specific monitoring path.

Exceptions:

- OPNsense monitoring remains plugin/exporter-specific.
- PBS monitoring remains platform-specific.
- Appliance-style infrastructure should not be forced through the game/dev Alloy contract unless it is intentionally Linux-managed in the same way.

## Dynamic discovery path

Preferred launch model:

```text
API discovery endpoint
  -> monitor sync script
  -> Prometheus file_sd inventory
  -> add/remove validation and direct Alloy scrape targets
```

Expected behavior:

- active game/dev containers appear automatically in discovery/file_sd
- deleted game/dev containers disappear automatically from discovery/file_sd
- stale file_sd entries are removed once sync is healthy
- discovery failures alert loudly and do not silently preserve misleading stale state

Note: discovery/file_sd proves target inventory and add/remove behavior. It does not by itself prove canonical metrics delivery when remote-write is the chosen metrics path.

## Metrics path

Canonical launch path for game/dev container metrics:

```text
container Alloy remote_write
  -> Prometheus at 10.60.0.25:9090
  -> Grafana dashboards
```

Acceptance evidence for a game/dev container should include real time series such as:

- `node_uname_info`
- CPU metrics
- memory metrics
- expected labels such as `vmid`, `instance`, and `container_type`

Direct Alloy scrape health should also be visible through the `game-dev-alloy` Prometheus job for current lifecycle targets.

## API discovery requirements

API-owned monitoring discovery endpoints must:

- require bearer/internal monitoring auth outside explicit development/test mode
- fail closed without token
- return only active/current targets unless historical/debug mode is explicitly requested
- provide stable labels for Prometheus/Grafana
- avoid leaking secrets in labels, URLs, logs, or target metadata
- expose enough metadata to debug game/dev lifecycle state

## Required labels

For game/dev targets/time series, standardize on the following required labels:

- `vmid`
- `instance`
- `container_type`

`container_type` should be the canonical spelling. Avoid using both `ctype` and `container_type` long-term.

## Optional labels

Optional labels may include:

- `server_id`
- `game`
- `variant`
- `runtime`
- `hostname`
- `name`
- `plan`
- `role`

Any customer/user-identifying label should be reviewed before dashboard or alert exposure.

## Lifecycle/debug state

Launch-debug monitoring should expose or derive these states:

- `ready`
- `connectable`
- `operationInProgress`
- `operationType`
- backup status
- restore status
- code-server status for dev containers
- provisioning-complete state
- recent failure reason where safe

## Agent responsibilities

Agent/container-side monitoring responsibilities:

- keep Alloy installed/prepared in game/dev templates
- remove legacy `node-exporter` from game/dev templates once Alloy is standard
- configure container Alloy remote-write to Prometheus at `10.60.0.25:9090`
- configure container Alloy HTTP to listen on `0.0.0.0:12345` for the intended internal monitoring networks
- write or refresh Alloy config after `POST /config`
- keep labels minimal and consistent with dashboards/discovery
- eventually ship selected container logs through Alloy to centralized Loki
- do not run Loki itself inside each game/dev container

## Monitoring host responsibilities

The monitoring host must:

- restrict Prometheus and Grafana exposure to trusted admin/monitoring paths while still allowing container Alloy remote-write to Prometheus
- run discovery sync reliably
- provision dashboards into Grafana, not merely store dashboard JSON on disk
- alert on failed discovery sync
- alert on stale file_sd age
- alert on missing/stale remote-write metrics by VMID/type
- alert on `game-dev-alloy` scrape target down by VMID/type
- alert on API discovery endpoint failure
- store bearer tokens with tight permissions or systemd credentials
- host or reach the centralized Loki service once log collection is enabled

## Grafana responsibilities

Grafana must have:

- Prometheus datasource installed
- core service dashboards installed
- game container dashboard installed
- dev container dashboard installed
- dashboard queries aligned with the canonical label contract
- dashboards based on remote-write metric freshness for game/dev container metrics
- direct scrape health panels or queries for `game-dev-alloy` target health
- Loki datasource installed once centralized logs are enabled

Dashboard JSON existing on disk is not sufficient; dashboards must be visible in Grafana.

## Logs contract

Loki should be centralized, not installed per game/dev container.

Preferred future direction:

```text
container/app logs
  -> container Alloy log collection
  -> centralized Loki
  -> Grafana Explore / dashboard links
```

Container Alloy should tail only selected, useful logs and attach stable labels such as:

- `vmid`
- `instance`
- `container_type`
- `server_id` where safe/useful
- `log_source`, for example `agent`, `minecraft`, `codeserver`, `provisioning`

Do not turn every container into a Loki server. Loki belongs on the monitoring/logging side as shared infrastructure. Containers should run Alloy as the lightweight collector/shipper.

Centralized logs are not enabled yet. This is a known follow-up unless launch policy requires log search from Grafana before release.

When logs are added, prioritize:

- API application logs
- Agent logs
- provisioning flow
- backup/restore flow
- delete/teardown flow
- Velocity registration callbacks
- DNS/edge publish and cleanup actions
- monitoring discovery sync

Initial container log candidates:

- Agent service logs
- Minecraft/server stdout logs where useful
- code-server logs for dev containers
- provisioning/config application logs
- backup/restore operation logs

Avoid collecting noisy or sensitive logs by default. Review anything that may include customer secrets, tokens, console input, file contents, or environment variables before shipping to Loki.

## Security contract

Required before launch:

- Prometheus not publicly reachable, while still reachable from intended container remote-write network paths
- Grafana not publicly reachable except through intended authenticated/admin path
- node_exporter not publicly reachable
- API monitoring endpoints fail closed without token
- no monitoring bearer token in world-readable files
- no token leakage in labels, target URLs, dashboards, or logs
- Loki, once added, is not publicly reachable and accepts writes only from intended Alloy collectors or trusted internal paths