From e355bbb52a092a53a71bdf1b900dbacf4a7c339f Mon Sep 17 00:00:00 2001 From: jester Date: Fri, 1 May 2026 19:38:06 +0000 Subject: [PATCH] Clarify remote-write as canonical monitoring path --- Codex/Monitoring/CONTRACT.md | 64 +++++++++++++++++++++++++++++++----- 1 file changed, 55 insertions(+), 9 deletions(-) diff --git a/Codex/Monitoring/CONTRACT.md b/Codex/Monitoring/CONTRACT.md index 3c413ac..a2c2f0c 100644 --- a/Codex/Monitoring/CONTRACT.md +++ b/Codex/Monitoring/CONTRACT.md @@ -16,6 +16,31 @@ The monitoring system must support launch debugging for: - provisioning, backup, restore, delete, and code-server/debug workflows - stale target detection +## Not API-only + +Monitoring is not an API-only design. + +The API is the discovery and metadata/control-plane source. It tells the monitoring stack which game/dev containers exist and what labels/metadata they should carry. + +The canonical metrics transport for game/dev containers is Alloy remote-write from each container into Prometheus on `10.60.0.25:9090`. + +That means there are two related but separate paths: + +```text +API discovery + -> monitor sync + -> file_sd target inventory / add-remove validation + +container Alloy + -> remote_write + -> Prometheus time series + -> Grafana dashboards +``` + +Prometheus scraping `container-ip:12345` is not canonical while container Alloy listens on `127.0.0.1:12345`. If direct scrape health for each container Alloy is desired later, the container Alloy HTTP listener must intentionally bind to `0.0.0.0:12345` or another reachable address and the security model must be reviewed. + +Until that decision changes, dashboard health should rely on received remote-write metrics and expected time-series freshness, not `game-dev-alloy` scrape-up status. + ## Explicit exceptions The game/dev Alloy contract does not replace every platform-specific monitoring path. @@ -33,17 +58,36 @@ Preferred launch model: ```text API discovery endpoint -> monitor sync script - -> Prometheus file_sd - -> Grafana dashboards / alerts + -> Prometheus file_sd inventory + -> add/remove validation and optional direct scrape targets ``` Expected behavior: -- active game/dev containers appear automatically -- deleted game/dev containers disappear automatically +- active game/dev containers appear automatically in discovery/file_sd +- deleted game/dev containers disappear automatically from discovery/file_sd - stale file_sd entries are removed once sync is healthy - discovery failures alert loudly and do not silently preserve misleading stale state +Note: discovery/file_sd proves target inventory and add/remove behavior. It does not by itself prove canonical metrics delivery when remote-write is the chosen metrics path. + +## Metrics path + +Canonical launch path for game/dev container metrics: + +```text +container Alloy remote_write + -> Prometheus at 10.60.0.25:9090 + -> Grafana dashboards +``` + +Acceptance evidence for a game/dev container should include real time series such as: + +- `node_uname_info` +- CPU metrics +- memory metrics +- expected labels such as `vmid`, `instance`, and `container_type` + ## API discovery requirements API-owned monitoring discovery endpoints must: @@ -57,7 +101,7 @@ API-owned monitoring discovery endpoints must: ## Required labels -For game/dev targets, standardize on the following required labels: +For game/dev targets/time series, standardize on the following required labels: - `vmid` - `instance` @@ -100,7 +144,8 @@ Agent/container-side monitoring responsibilities: - keep Alloy installed/prepared in game/dev templates - remove legacy `node-exporter` from game/dev templates once Alloy is standard -- keep `/etc/default/alloy` fixed to `/etc/alloy/config.alloy` on port `12345` +- configure container Alloy remote-write to Prometheus at `10.60.0.25:9090` +- keep local Alloy HTTP bound to loopback unless direct scrape health is intentionally adopted - write or refresh Alloy config after `POST /config` - keep labels minimal and consistent with dashboards/discovery - eventually emit structured logs suitable for Loki/Alloy log ingestion @@ -109,12 +154,12 @@ Agent/container-side monitoring responsibilities: The monitoring host must: -- restrict Prometheus, Grafana, and node_exporter exposure to trusted admin/monitoring paths +- restrict Prometheus and Grafana exposure to trusted admin/monitoring paths while still allowing container Alloy remote-write to Prometheus - run discovery sync reliably - provision dashboards into Grafana, not merely store dashboard JSON on disk - alert on failed discovery sync - alert on stale file_sd age -- alert on target-down by VMID/type +- alert on missing/stale remote-write metrics by VMID/type - alert on API discovery endpoint failure - store bearer tokens with tight permissions or systemd credentials @@ -127,6 +172,7 @@ Grafana must have: - game container dashboard installed - dev container dashboard installed - dashboard queries aligned with the canonical label contract +- dashboards based on remote-write metric freshness for game/dev container health unless direct scrape health is intentionally enabled Dashboard JSON existing on disk is not sufficient; dashboards must be visible in Grafana. @@ -149,7 +195,7 @@ Preferred direction: Loki or Alloy log ingestion. Until this exists, launch-debu Required before launch: -- Prometheus not publicly reachable +- Prometheus not publicly reachable, while still reachable from intended container remote-write network paths - Grafana not publicly reachable except through intended authenticated/admin path - node_exporter not publicly reachable - API monitoring endpoints fail closed without token