Align monitoring contract with Alloy scrape model

2026-05-01 21:06:35 +00:00 · 2026-05-01 21:06:35 +00:00 · 130fc5eeda
commit 130fc5eeda
parent 2b3a552358
1 changed files with 15 additions and 10 deletions
--- a/Codex/Monitoring/CONTRACT.md
+++ b/Codex/Monitoring/CONTRACT.md
@ -24,12 +24,13 @@ The API is the discovery and metadata/control-plane source. It tells the monitor

 The canonical metrics transport for game/dev containers is Alloy remote-write from each container into Prometheus on `10.60.0.25:9090`.

-That means there are two related but separate paths:
+There are two related but separate paths:

 ```text
 API discovery
  -> monitor sync
  -> file_sd target inventory / add-remove validation
+  -> game-dev-alloy scrape health

 container Alloy
  -> remote_write
@ -37,9 +38,9 @@ container Alloy
  -> Grafana dashboards
 ```

-Prometheus scraping `container-ip:12345` is not canonical while container Alloy listens on `127.0.0.1:12345`. If direct scrape health for each container Alloy is desired later, the container Alloy HTTP listener must intentionally bind to `0.0.0.0:12345` or another reachable address and the security model must be reviewed.
+Container Alloy intentionally listens on `0.0.0.0:12345` for Linux-managed game/dev containers so Prometheus can scrape Alloy health by container IP through the `game-dev-alloy` job. This direct scrape proves collector reachability and target add/remove behavior.

-Until that decision changes, dashboard health should rely on received remote-write metrics and expected time-series freshness, not `game-dev-alloy` scrape-up status.
+Remote-write remains the canonical metrics delivery path. Dashboards may use direct scrape health for target presence and collector health, but workload/system metrics should be validated from received time series.

 ## Explicit exceptions

@ -59,7 +60,7 @@ Preferred launch model:
 API discovery endpoint
  -> monitor sync script
  -> Prometheus file_sd inventory
-  -> add/remove validation and optional direct scrape targets
+  -> add/remove validation and direct Alloy scrape targets
 ```

 Expected behavior:
@ -88,6 +89,8 @@ Acceptance evidence for a game/dev container should include real time series suc
 - memory metrics
 - expected labels such as `vmid`, `instance`, and `container_type`

+Direct Alloy scrape health should also be visible through the `game-dev-alloy` Prometheus job for current lifecycle targets.
+
 ## API discovery requirements

 API-owned monitoring discovery endpoints must:
@ -145,7 +148,7 @@ Agent/container-side monitoring responsibilities:
 - keep Alloy installed/prepared in game/dev templates
 - remove legacy `node-exporter` from game/dev templates once Alloy is standard
 - configure container Alloy remote-write to Prometheus at `10.60.0.25:9090`
- keep local Alloy HTTP bound to loopback unless direct scrape health is intentionally adopted
+- configure container Alloy HTTP to listen on `0.0.0.0:12345` for the intended internal monitoring networks
 - write or refresh Alloy config after `POST /config`
 - keep labels minimal and consistent with dashboards/discovery
 - eventually ship selected container logs through Alloy to centralized Loki
@ -161,6 +164,7 @@ The monitoring host must:
 - alert on failed discovery sync
 - alert on stale file_sd age
 - alert on missing/stale remote-write metrics by VMID/type
+- alert on `game-dev-alloy` scrape target down by VMID/type
 - alert on API discovery endpoint failure
 - store bearer tokens with tight permissions or systemd credentials
 - host or reach the centralized Loki service once log collection is enabled
@ -174,7 +178,8 @@ Grafana must have:
 - game container dashboard installed
 - dev container dashboard installed
 - dashboard queries aligned with the canonical label contract
- dashboards based on remote-write metric freshness for game/dev container health unless direct scrape health is intentionally enabled
+- dashboards based on remote-write metric freshness for game/dev container metrics
+- direct scrape health panels or queries for `game-dev-alloy` target health
 - Loki datasource installed once centralized logs are enabled

 Dashboard JSON existing on disk is not sufficient; dashboards must be visible in Grafana.
@ -183,7 +188,7 @@ Dashboard JSON existing on disk is not sufficient; dashboards must be visible in

 Loki should be centralized, not installed per game/dev container.

-Preferred launch direction:
+Preferred future direction:

 ```text
 container/app logs
@ -202,7 +207,9 @@ Container Alloy should tail only selected, useful logs and attach stable labels

 Do not turn every container into a Loki server. Loki belongs on the monitoring/logging side as shared infrastructure. Containers should run Alloy as the lightweight collector/shipper.

-Launch-ready debugging requires centralized logs for at least:
+Centralized logs are not enabled yet. This is a known follow-up unless launch policy requires log search from Grafana before release.
+
+When logs are added, prioritize:

 - API application logs
 - Agent logs
@ -223,8 +230,6 @@ Initial container log candidates:

 Avoid collecting noisy or sensitive logs by default. Review anything that may include customer secrets, tokens, console input, file contents, or environment variables before shipping to Loki.

-Until centralized Loki/Alloy log shipping exists, launch-debug readiness remains incomplete.
-
 ## Security contract

 Required before launch: