diff --git a/Codex/Monitoring/CONTRACT.md b/Codex/Monitoring/CONTRACT.md index 36a9af4..f9c2779 100644 --- a/Codex/Monitoring/CONTRACT.md +++ b/Codex/Monitoring/CONTRACT.md @@ -24,12 +24,13 @@ The API is the discovery and metadata/control-plane source. It tells the monitor The canonical metrics transport for game/dev containers is Alloy remote-write from each container into Prometheus on `10.60.0.25:9090`. -That means there are two related but separate paths: +There are two related but separate paths: ```text API discovery -> monitor sync -> file_sd target inventory / add-remove validation + -> game-dev-alloy scrape health container Alloy -> remote_write @@ -37,9 +38,9 @@ container Alloy -> Grafana dashboards ``` -Prometheus scraping `container-ip:12345` is not canonical while container Alloy listens on `127.0.0.1:12345`. If direct scrape health for each container Alloy is desired later, the container Alloy HTTP listener must intentionally bind to `0.0.0.0:12345` or another reachable address and the security model must be reviewed. +Container Alloy intentionally listens on `0.0.0.0:12345` for Linux-managed game/dev containers so Prometheus can scrape Alloy health by container IP through the `game-dev-alloy` job. This direct scrape proves collector reachability and target add/remove behavior. -Until that decision changes, dashboard health should rely on received remote-write metrics and expected time-series freshness, not `game-dev-alloy` scrape-up status. +Remote-write remains the canonical metrics delivery path. Dashboards may use direct scrape health for target presence and collector health, but workload/system metrics should be validated from received time series. ## Explicit exceptions @@ -59,7 +60,7 @@ Preferred launch model: API discovery endpoint -> monitor sync script -> Prometheus file_sd inventory - -> add/remove validation and optional direct scrape targets + -> add/remove validation and direct Alloy scrape targets ``` Expected behavior: @@ -88,6 +89,8 @@ Acceptance evidence for a game/dev container should include real time series suc - memory metrics - expected labels such as `vmid`, `instance`, and `container_type` +Direct Alloy scrape health should also be visible through the `game-dev-alloy` Prometheus job for current lifecycle targets. + ## API discovery requirements API-owned monitoring discovery endpoints must: @@ -145,7 +148,7 @@ Agent/container-side monitoring responsibilities: - keep Alloy installed/prepared in game/dev templates - remove legacy `node-exporter` from game/dev templates once Alloy is standard - configure container Alloy remote-write to Prometheus at `10.60.0.25:9090` -- keep local Alloy HTTP bound to loopback unless direct scrape health is intentionally adopted +- configure container Alloy HTTP to listen on `0.0.0.0:12345` for the intended internal monitoring networks - write or refresh Alloy config after `POST /config` - keep labels minimal and consistent with dashboards/discovery - eventually ship selected container logs through Alloy to centralized Loki @@ -161,6 +164,7 @@ The monitoring host must: - alert on failed discovery sync - alert on stale file_sd age - alert on missing/stale remote-write metrics by VMID/type +- alert on `game-dev-alloy` scrape target down by VMID/type - alert on API discovery endpoint failure - store bearer tokens with tight permissions or systemd credentials - host or reach the centralized Loki service once log collection is enabled @@ -174,7 +178,8 @@ Grafana must have: - game container dashboard installed - dev container dashboard installed - dashboard queries aligned with the canonical label contract -- dashboards based on remote-write metric freshness for game/dev container health unless direct scrape health is intentionally enabled +- dashboards based on remote-write metric freshness for game/dev container metrics +- direct scrape health panels or queries for `game-dev-alloy` target health - Loki datasource installed once centralized logs are enabled Dashboard JSON existing on disk is not sufficient; dashboards must be visible in Grafana. @@ -183,7 +188,7 @@ Dashboard JSON existing on disk is not sufficient; dashboards must be visible in Loki should be centralized, not installed per game/dev container. -Preferred launch direction: +Preferred future direction: ```text container/app logs @@ -202,7 +207,9 @@ Container Alloy should tail only selected, useful logs and attach stable labels Do not turn every container into a Loki server. Loki belongs on the monitoring/logging side as shared infrastructure. Containers should run Alloy as the lightweight collector/shipper. -Launch-ready debugging requires centralized logs for at least: +Centralized logs are not enabled yet. This is a known follow-up unless launch policy requires log search from Grafana before release. + +When logs are added, prioritize: - API application logs - Agent logs @@ -223,8 +230,6 @@ Initial container log candidates: Avoid collecting noisy or sensitive logs by default. Review anything that may include customer secrets, tokens, console input, file contents, or environment variables before shipping to Loki. -Until centralized Loki/Alloy log shipping exists, launch-debug readiness remains incomplete. - ## Security contract Required before launch: