Add monitoring contract
This commit is contained in:
parent
15b29b54c5
commit
c3aca715ee
157
Codex/Monitoring/CONTRACT.md
Normal file
157
Codex/Monitoring/CONTRACT.md
Normal file
@ -0,0 +1,157 @@
|
||||
# Monitoring Contract
|
||||
|
||||
This file defines the intended monitoring contract for ZeroLagHub launch readiness.
|
||||
|
||||
## Scope
|
||||
|
||||
Linux-managed game/dev containers should converge on Grafana Alloy as the standard telemetry collector.
|
||||
|
||||
The monitoring system must support launch debugging for:
|
||||
|
||||
- platform service health
|
||||
- API availability and discovery behavior
|
||||
- game/dev container presence and deletion
|
||||
- container host metrics
|
||||
- game/dev lifecycle state
|
||||
- provisioning, backup, restore, delete, and code-server/debug workflows
|
||||
- stale target detection
|
||||
|
||||
## Explicit exceptions
|
||||
|
||||
The game/dev Alloy contract does not replace every platform-specific monitoring path.
|
||||
|
||||
Exceptions:
|
||||
|
||||
- OPNsense monitoring remains plugin/exporter-specific.
|
||||
- PBS monitoring remains platform-specific.
|
||||
- Appliance-style infrastructure should not be forced through the game/dev Alloy contract unless it is intentionally Linux-managed in the same way.
|
||||
|
||||
## Dynamic discovery path
|
||||
|
||||
Preferred launch model:
|
||||
|
||||
```text
|
||||
API discovery endpoint
|
||||
-> monitor sync script
|
||||
-> Prometheus file_sd
|
||||
-> Grafana dashboards / alerts
|
||||
```
|
||||
|
||||
Expected behavior:
|
||||
|
||||
- active game/dev containers appear automatically
|
||||
- deleted game/dev containers disappear automatically
|
||||
- stale file_sd entries are removed once sync is healthy
|
||||
- discovery failures alert loudly and do not silently preserve misleading stale state
|
||||
|
||||
## API discovery requirements
|
||||
|
||||
API-owned monitoring discovery endpoints must:
|
||||
|
||||
- require bearer/internal monitoring auth outside explicit development/test mode
|
||||
- fail closed without token
|
||||
- return only active/current targets unless historical/debug mode is explicitly requested
|
||||
- provide stable labels for Prometheus/Grafana
|
||||
- avoid leaking secrets in labels, URLs, logs, or target metadata
|
||||
- expose enough metadata to debug game/dev lifecycle state
|
||||
|
||||
## Required labels
|
||||
|
||||
For game/dev targets, standardize on the following required labels:
|
||||
|
||||
- `vmid`
|
||||
- `instance`
|
||||
- `container_type`
|
||||
|
||||
`container_type` should be the canonical spelling. Avoid using both `ctype` and `container_type` long-term.
|
||||
|
||||
## Optional labels
|
||||
|
||||
Optional labels may include:
|
||||
|
||||
- `server_id`
|
||||
- `game`
|
||||
- `variant`
|
||||
- `runtime`
|
||||
- `hostname`
|
||||
- `name`
|
||||
- `plan`
|
||||
- `role`
|
||||
|
||||
Any customer/user-identifying label should be reviewed before dashboard or alert exposure.
|
||||
|
||||
## Lifecycle/debug state
|
||||
|
||||
Launch-debug monitoring should expose or derive these states:
|
||||
|
||||
- `ready`
|
||||
- `connectable`
|
||||
- `operationInProgress`
|
||||
- `operationType`
|
||||
- backup status
|
||||
- restore status
|
||||
- code-server status for dev containers
|
||||
- provisioning-complete state
|
||||
- recent failure reason where safe
|
||||
|
||||
## Agent responsibilities
|
||||
|
||||
Agent/container-side monitoring responsibilities:
|
||||
|
||||
- keep Alloy installed/prepared in game/dev templates
|
||||
- remove legacy `node-exporter` from game/dev templates once Alloy is standard
|
||||
- keep `/etc/default/alloy` fixed to `/etc/alloy/config.alloy` on port `12345`
|
||||
- write or refresh Alloy config after `POST /config`
|
||||
- keep labels minimal and consistent with dashboards/discovery
|
||||
- eventually emit structured logs suitable for Loki/Alloy log ingestion
|
||||
|
||||
## Monitoring host responsibilities
|
||||
|
||||
The monitoring host must:
|
||||
|
||||
- restrict Prometheus, Grafana, and node_exporter exposure to trusted admin/monitoring paths
|
||||
- run discovery sync reliably
|
||||
- provision dashboards into Grafana, not merely store dashboard JSON on disk
|
||||
- alert on failed discovery sync
|
||||
- alert on stale file_sd age
|
||||
- alert on target-down by VMID/type
|
||||
- alert on API discovery endpoint failure
|
||||
- store bearer tokens with tight permissions or systemd credentials
|
||||
|
||||
## Grafana responsibilities
|
||||
|
||||
Grafana must have:
|
||||
|
||||
- Prometheus datasource installed
|
||||
- core service dashboards installed
|
||||
- game container dashboard installed
|
||||
- dev container dashboard installed
|
||||
- dashboard queries aligned with the canonical label contract
|
||||
|
||||
Dashboard JSON existing on disk is not sufficient; dashboards must be visible in Grafana.
|
||||
|
||||
## Logs contract
|
||||
|
||||
Launch-ready debugging requires centralized logs for at least:
|
||||
|
||||
- API application logs
|
||||
- Agent logs
|
||||
- provisioning flow
|
||||
- backup/restore flow
|
||||
- delete/teardown flow
|
||||
- Velocity registration callbacks
|
||||
- DNS/edge publish and cleanup actions
|
||||
- monitoring discovery sync
|
||||
|
||||
Preferred direction: Loki or Alloy log ingestion. Until this exists, launch-debug readiness remains incomplete.
|
||||
|
||||
## Security contract
|
||||
|
||||
Required before launch:
|
||||
|
||||
- Prometheus not publicly reachable
|
||||
- Grafana not publicly reachable except through intended authenticated/admin path
|
||||
- node_exporter not publicly reachable
|
||||
- API monitoring endpoints fail closed without token
|
||||
- no monitoring bearer token in world-readable files
|
||||
- no token leakage in labels, target URLs, dashboards, or logs
|
||||
Loading…
Reference in New Issue
Block a user