239 lines
7.9 KiB
Markdown
239 lines
7.9 KiB
Markdown
# Monitoring Contract
|
|
|
|
This file defines the intended monitoring contract for ZeroLagHub launch readiness.
|
|
|
|
## Scope
|
|
|
|
Linux-managed game/dev containers should converge on Grafana Alloy as the standard telemetry collector.
|
|
|
|
The monitoring system must support launch debugging for:
|
|
|
|
- platform service health
|
|
- API availability and discovery behavior
|
|
- game/dev container presence and deletion
|
|
- container host metrics
|
|
- game/dev lifecycle state
|
|
- provisioning, backup, restore, delete, and code-server/debug workflows
|
|
- stale target detection
|
|
|
|
## Not API-only
|
|
|
|
Monitoring is not an API-only design.
|
|
|
|
The API is the discovery and metadata/control-plane source. It tells the monitoring stack which game/dev containers exist and what labels/metadata they should carry.
|
|
|
|
The canonical metrics transport for game/dev containers is Alloy remote-write from each container into Prometheus on `10.60.0.25:9090`.
|
|
|
|
That means there are two related but separate paths:
|
|
|
|
```text
|
|
API discovery
|
|
-> monitor sync
|
|
-> file_sd target inventory / add-remove validation
|
|
|
|
container Alloy
|
|
-> remote_write
|
|
-> Prometheus time series
|
|
-> Grafana dashboards
|
|
```
|
|
|
|
Prometheus scraping `container-ip:12345` is not canonical while container Alloy listens on `127.0.0.1:12345`. If direct scrape health for each container Alloy is desired later, the container Alloy HTTP listener must intentionally bind to `0.0.0.0:12345` or another reachable address and the security model must be reviewed.
|
|
|
|
Until that decision changes, dashboard health should rely on received remote-write metrics and expected time-series freshness, not `game-dev-alloy` scrape-up status.
|
|
|
|
## Explicit exceptions
|
|
|
|
The game/dev Alloy contract does not replace every platform-specific monitoring path.
|
|
|
|
Exceptions:
|
|
|
|
- OPNsense monitoring remains plugin/exporter-specific.
|
|
- PBS monitoring remains platform-specific.
|
|
- Appliance-style infrastructure should not be forced through the game/dev Alloy contract unless it is intentionally Linux-managed in the same way.
|
|
|
|
## Dynamic discovery path
|
|
|
|
Preferred launch model:
|
|
|
|
```text
|
|
API discovery endpoint
|
|
-> monitor sync script
|
|
-> Prometheus file_sd inventory
|
|
-> add/remove validation and optional direct scrape targets
|
|
```
|
|
|
|
Expected behavior:
|
|
|
|
- active game/dev containers appear automatically in discovery/file_sd
|
|
- deleted game/dev containers disappear automatically from discovery/file_sd
|
|
- stale file_sd entries are removed once sync is healthy
|
|
- discovery failures alert loudly and do not silently preserve misleading stale state
|
|
|
|
Note: discovery/file_sd proves target inventory and add/remove behavior. It does not by itself prove canonical metrics delivery when remote-write is the chosen metrics path.
|
|
|
|
## Metrics path
|
|
|
|
Canonical launch path for game/dev container metrics:
|
|
|
|
```text
|
|
container Alloy remote_write
|
|
-> Prometheus at 10.60.0.25:9090
|
|
-> Grafana dashboards
|
|
```
|
|
|
|
Acceptance evidence for a game/dev container should include real time series such as:
|
|
|
|
- `node_uname_info`
|
|
- CPU metrics
|
|
- memory metrics
|
|
- expected labels such as `vmid`, `instance`, and `container_type`
|
|
|
|
## API discovery requirements
|
|
|
|
API-owned monitoring discovery endpoints must:
|
|
|
|
- require bearer/internal monitoring auth outside explicit development/test mode
|
|
- fail closed without token
|
|
- return only active/current targets unless historical/debug mode is explicitly requested
|
|
- provide stable labels for Prometheus/Grafana
|
|
- avoid leaking secrets in labels, URLs, logs, or target metadata
|
|
- expose enough metadata to debug game/dev lifecycle state
|
|
|
|
## Required labels
|
|
|
|
For game/dev targets/time series, standardize on the following required labels:
|
|
|
|
- `vmid`
|
|
- `instance`
|
|
- `container_type`
|
|
|
|
`container_type` should be the canonical spelling. Avoid using both `ctype` and `container_type` long-term.
|
|
|
|
## Optional labels
|
|
|
|
Optional labels may include:
|
|
|
|
- `server_id`
|
|
- `game`
|
|
- `variant`
|
|
- `runtime`
|
|
- `hostname`
|
|
- `name`
|
|
- `plan`
|
|
- `role`
|
|
|
|
Any customer/user-identifying label should be reviewed before dashboard or alert exposure.
|
|
|
|
## Lifecycle/debug state
|
|
|
|
Launch-debug monitoring should expose or derive these states:
|
|
|
|
- `ready`
|
|
- `connectable`
|
|
- `operationInProgress`
|
|
- `operationType`
|
|
- backup status
|
|
- restore status
|
|
- code-server status for dev containers
|
|
- provisioning-complete state
|
|
- recent failure reason where safe
|
|
|
|
## Agent responsibilities
|
|
|
|
Agent/container-side monitoring responsibilities:
|
|
|
|
- keep Alloy installed/prepared in game/dev templates
|
|
- remove legacy `node-exporter` from game/dev templates once Alloy is standard
|
|
- configure container Alloy remote-write to Prometheus at `10.60.0.25:9090`
|
|
- keep local Alloy HTTP bound to loopback unless direct scrape health is intentionally adopted
|
|
- write or refresh Alloy config after `POST /config`
|
|
- keep labels minimal and consistent with dashboards/discovery
|
|
- eventually ship selected container logs through Alloy to centralized Loki
|
|
- do not run Loki itself inside each game/dev container
|
|
|
|
## Monitoring host responsibilities
|
|
|
|
The monitoring host must:
|
|
|
|
- restrict Prometheus and Grafana exposure to trusted admin/monitoring paths while still allowing container Alloy remote-write to Prometheus
|
|
- run discovery sync reliably
|
|
- provision dashboards into Grafana, not merely store dashboard JSON on disk
|
|
- alert on failed discovery sync
|
|
- alert on stale file_sd age
|
|
- alert on missing/stale remote-write metrics by VMID/type
|
|
- alert on API discovery endpoint failure
|
|
- store bearer tokens with tight permissions or systemd credentials
|
|
- host or reach the centralized Loki service once log collection is enabled
|
|
|
|
## Grafana responsibilities
|
|
|
|
Grafana must have:
|
|
|
|
- Prometheus datasource installed
|
|
- core service dashboards installed
|
|
- game container dashboard installed
|
|
- dev container dashboard installed
|
|
- dashboard queries aligned with the canonical label contract
|
|
- dashboards based on remote-write metric freshness for game/dev container health unless direct scrape health is intentionally enabled
|
|
- Loki datasource installed once centralized logs are enabled
|
|
|
|
Dashboard JSON existing on disk is not sufficient; dashboards must be visible in Grafana.
|
|
|
|
## Logs contract
|
|
|
|
Loki should be centralized, not installed per game/dev container.
|
|
|
|
Preferred launch direction:
|
|
|
|
```text
|
|
container/app logs
|
|
-> container Alloy log collection
|
|
-> centralized Loki
|
|
-> Grafana Explore / dashboard links
|
|
```
|
|
|
|
Container Alloy should tail only selected, useful logs and attach stable labels such as:
|
|
|
|
- `vmid`
|
|
- `instance`
|
|
- `container_type`
|
|
- `server_id` where safe/useful
|
|
- `log_source`, for example `agent`, `minecraft`, `codeserver`, `provisioning`
|
|
|
|
Do not turn every container into a Loki server. Loki belongs on the monitoring/logging side as shared infrastructure. Containers should run Alloy as the lightweight collector/shipper.
|
|
|
|
Launch-ready debugging requires centralized logs for at least:
|
|
|
|
- API application logs
|
|
- Agent logs
|
|
- provisioning flow
|
|
- backup/restore flow
|
|
- delete/teardown flow
|
|
- Velocity registration callbacks
|
|
- DNS/edge publish and cleanup actions
|
|
- monitoring discovery sync
|
|
|
|
Initial container log candidates:
|
|
|
|
- Agent service logs
|
|
- Minecraft/server stdout logs where useful
|
|
- code-server logs for dev containers
|
|
- provisioning/config application logs
|
|
- backup/restore operation logs
|
|
|
|
Avoid collecting noisy or sensitive logs by default. Review anything that may include customer secrets, tokens, console input, file contents, or environment variables before shipping to Loki.
|
|
|
|
Until centralized Loki/Alloy log shipping exists, launch-debug readiness remains incomplete.
|
|
|
|
## Security contract
|
|
|
|
Required before launch:
|
|
|
|
- Prometheus not publicly reachable, while still reachable from intended container remote-write network paths
|
|
- Grafana not publicly reachable except through intended authenticated/admin path
|
|
- node_exporter not publicly reachable
|
|
- API monitoring endpoints fail closed without token
|
|
- no monitoring bearer token in world-readable files
|
|
- no token leakage in labels, target URLs, dashboards, or logs
|
|
- Loki, once added, is not publicly reachable and accepts writes only from intended Alloy collectors or trusted internal paths
|