zlh-grind/Codex/Monitoring/CONTRACT.md

# Monitoring Contract

This file defines the intended monitoring contract for ZeroLagHub launch readiness.

## Scope

Linux-managed game/dev containers should converge on Grafana Alloy as the standard telemetry collector.

The monitoring system must support launch debugging for:

- platform service health
- API availability and discovery behavior
- game/dev container presence and deletion
- container host metrics
- game/dev lifecycle state
- provisioning, backup, restore, delete, and code-server/debug workflows
- stale target detection

## Not API-only

Monitoring is not an API-only design.

The API is the discovery and metadata/control-plane source. It tells the monitoring stack which game/dev containers exist and what labels/metadata they should carry.

The canonical metrics transport for game/dev containers is Alloy remote-write from each container into Prometheus on `10.60.0.25:9090`.

That means there are two related but separate paths:

```text
API discovery
  -> monitor sync
  -> file_sd target inventory / add-remove validation

container Alloy
  -> remote_write
  -> Prometheus time series
  -> Grafana dashboards
```

Prometheus scraping `container-ip:12345` is not canonical while container Alloy listens on `127.0.0.1:12345`. If direct scrape health for each container Alloy is desired later, the container Alloy HTTP listener must intentionally bind to `0.0.0.0:12345` or another reachable address and the security model must be reviewed.

Until that decision changes, dashboard health should rely on received remote-write metrics and expected time-series freshness, not `game-dev-alloy` scrape-up status.

## Explicit exceptions

The game/dev Alloy contract does not replace every platform-specific monitoring path.

Exceptions:

- OPNsense monitoring remains plugin/exporter-specific.
- PBS monitoring remains platform-specific.
- Appliance-style infrastructure should not be forced through the game/dev Alloy contract unless it is intentionally Linux-managed in the same way.

## Dynamic discovery path

Preferred launch model:

```text
API discovery endpoint
  -> monitor sync script
  -> Prometheus file_sd inventory
  -> add/remove validation and optional direct scrape targets
```

Expected behavior:

- active game/dev containers appear automatically in discovery/file_sd
- deleted game/dev containers disappear automatically from discovery/file_sd
- stale file_sd entries are removed once sync is healthy
- discovery failures alert loudly and do not silently preserve misleading stale state

Note: discovery/file_sd proves target inventory and add/remove behavior. It does not by itself prove canonical metrics delivery when remote-write is the chosen metrics path.

## Metrics path

Canonical launch path for game/dev container metrics:

```text
container Alloy remote_write
  -> Prometheus at 10.60.0.25:9090
  -> Grafana dashboards
```

Acceptance evidence for a game/dev container should include real time series such as:

- `node_uname_info`
- CPU metrics
- memory metrics
- expected labels such as `vmid`, `instance`, and `container_type`

## API discovery requirements

API-owned monitoring discovery endpoints must:

- require bearer/internal monitoring auth outside explicit development/test mode
- fail closed without token
- return only active/current targets unless historical/debug mode is explicitly requested
- provide stable labels for Prometheus/Grafana
- avoid leaking secrets in labels, URLs, logs, or target metadata
- expose enough metadata to debug game/dev lifecycle state

## Required labels

For game/dev targets/time series, standardize on the following required labels:

- `vmid`
- `instance`
- `container_type`

`container_type` should be the canonical spelling. Avoid using both `ctype` and `container_type` long-term.

## Optional labels

Optional labels may include:

- `server_id`
- `game`
- `variant`
- `runtime`
- `hostname`
- `name`
- `plan`
- `role`

Any customer/user-identifying label should be reviewed before dashboard or alert exposure.

## Lifecycle/debug state

Launch-debug monitoring should expose or derive these states:

- `ready`
- `connectable`
- `operationInProgress`
- `operationType`
- backup status
- restore status
- code-server status for dev containers
- provisioning-complete state
- recent failure reason where safe

## Agent responsibilities

Agent/container-side monitoring responsibilities:

- keep Alloy installed/prepared in game/dev templates
- remove legacy `node-exporter` from game/dev templates once Alloy is standard
- configure container Alloy remote-write to Prometheus at `10.60.0.25:9090`
- keep local Alloy HTTP bound to loopback unless direct scrape health is intentionally adopted
- write or refresh Alloy config after `POST /config`
- keep labels minimal and consistent with dashboards/discovery
- eventually ship selected container logs through Alloy to centralized Loki
- do not run Loki itself inside each game/dev container

## Monitoring host responsibilities

The monitoring host must:

- restrict Prometheus and Grafana exposure to trusted admin/monitoring paths while still allowing container Alloy remote-write to Prometheus
- run discovery sync reliably
- provision dashboards into Grafana, not merely store dashboard JSON on disk
- alert on failed discovery sync
- alert on stale file_sd age
- alert on missing/stale remote-write metrics by VMID/type
- alert on API discovery endpoint failure
- store bearer tokens with tight permissions or systemd credentials
- host or reach the centralized Loki service once log collection is enabled

## Grafana responsibilities

Grafana must have:

- Prometheus datasource installed
- core service dashboards installed
- game container dashboard installed
- dev container dashboard installed
- dashboard queries aligned with the canonical label contract
- dashboards based on remote-write metric freshness for game/dev container health unless direct scrape health is intentionally enabled
- Loki datasource installed once centralized logs are enabled

Dashboard JSON existing on disk is not sufficient; dashboards must be visible in Grafana.

## Logs contract

Loki should be centralized, not installed per game/dev container.

Preferred launch direction:

```text
container/app logs
  -> container Alloy log collection
  -> centralized Loki
  -> Grafana Explore / dashboard links
```

Container Alloy should tail only selected, useful logs and attach stable labels such as:

- `vmid`
- `instance`
- `container_type`
- `server_id` where safe/useful
- `log_source`, for example `agent`, `minecraft`, `codeserver`, `provisioning`

Do not turn every container into a Loki server. Loki belongs on the monitoring/logging side as shared infrastructure. Containers should run Alloy as the lightweight collector/shipper.

Launch-ready debugging requires centralized logs for at least:

- API application logs
- Agent logs
- provisioning flow
- backup/restore flow
- delete/teardown flow
- Velocity registration callbacks
- DNS/edge publish and cleanup actions
- monitoring discovery sync

Initial container log candidates:

- Agent service logs
- Minecraft/server stdout logs where useful
- code-server logs for dev containers
- provisioning/config application logs
- backup/restore operation logs

Avoid collecting noisy or sensitive logs by default. Review anything that may include customer secrets, tokens, console input, file contents, or environment variables before shipping to Loki.

Until centralized Loki/Alloy log shipping exists, launch-debug readiness remains incomplete.

## Security contract

Required before launch:

- Prometheus not publicly reachable, while still reachable from intended container remote-write network paths
- Grafana not publicly reachable except through intended authenticated/admin path
- node_exporter not publicly reachable
- API monitoring endpoints fail closed without token
- no monitoring bearer token in world-readable files
- no token leakage in labels, target URLs, dashboards, or logs
- Loki, once added, is not publicly reachable and accepts writes only from intended Alloy collectors or trusted internal paths