From e355bbb52a092a53a71bdf1b900dbacf4a7c339f Mon Sep 17 00:00:00 2001
From: jester <admin@zerolaghub.com>
Date: Fri, 1 May 2026 19:38:06 +0000
Subject: [PATCH] Clarify remote-write as canonical monitoring path

---
 Codex/Monitoring/CONTRACT.md | 64 +++++++++++++++++++++++++++++++-----
 1 file changed, 55 insertions(+), 9 deletions(-)

diff --git a/Codex/Monitoring/CONTRACT.md b/Codex/Monitoring/CONTRACT.md
index 3c413ac..a2c2f0c 100644
--- a/Codex/Monitoring/CONTRACT.md
+++ b/Codex/Monitoring/CONTRACT.md
@@ -16,6 +16,31 @@ The monitoring system must support launch debugging for:
 - provisioning, backup, restore, delete, and code-server/debug workflows
 - stale target detection
 
+## Not API-only
+
+Monitoring is not an API-only design.
+
+The API is the discovery and metadata/control-plane source. It tells the monitoring stack which game/dev containers exist and what labels/metadata they should carry.
+
+The canonical metrics transport for game/dev containers is Alloy remote-write from each container into Prometheus on `10.60.0.25:9090`.
+
+That means there are two related but separate paths:
+
+```text
+API discovery
+  -> monitor sync
+  -> file_sd target inventory / add-remove validation
+
+container Alloy
+  -> remote_write
+  -> Prometheus time series
+  -> Grafana dashboards
+```
+
+Prometheus scraping `container-ip:12345` is not canonical while container Alloy listens on `127.0.0.1:12345`. If direct scrape health for each container Alloy is desired later, the container Alloy HTTP listener must intentionally bind to `0.0.0.0:12345` or another reachable address and the security model must be reviewed.
+
+Until that decision changes, dashboard health should rely on received remote-write metrics and expected time-series freshness, not `game-dev-alloy` scrape-up status.
+
 ## Explicit exceptions
 
 The game/dev Alloy contract does not replace every platform-specific monitoring path.
@@ -33,17 +58,36 @@ Preferred launch model:
 ```text
 API discovery endpoint
   -> monitor sync script
-  -> Prometheus file_sd
-  -> Grafana dashboards / alerts
+  -> Prometheus file_sd inventory
+  -> add/remove validation and optional direct scrape targets
 ```
 
 Expected behavior:
 
-- active game/dev containers appear automatically
-- deleted game/dev containers disappear automatically
+- active game/dev containers appear automatically in discovery/file_sd
+- deleted game/dev containers disappear automatically from discovery/file_sd
 - stale file_sd entries are removed once sync is healthy
 - discovery failures alert loudly and do not silently preserve misleading stale state
 
+Note: discovery/file_sd proves target inventory and add/remove behavior. It does not by itself prove canonical metrics delivery when remote-write is the chosen metrics path.
+
+## Metrics path
+
+Canonical launch path for game/dev container metrics:
+
+```text
+container Alloy remote_write
+  -> Prometheus at 10.60.0.25:9090
+  -> Grafana dashboards
+```
+
+Acceptance evidence for a game/dev container should include real time series such as:
+
+- `node_uname_info`
+- CPU metrics
+- memory metrics
+- expected labels such as `vmid`, `instance`, and `container_type`
+
 ## API discovery requirements
 
 API-owned monitoring discovery endpoints must:
@@ -57,7 +101,7 @@ API-owned monitoring discovery endpoints must:
 
 ## Required labels
 
-For game/dev targets, standardize on the following required labels:
+For game/dev targets/time series, standardize on the following required labels:
 
 - `vmid`
 - `instance`
@@ -100,7 +144,8 @@ Agent/container-side monitoring responsibilities:
 
 - keep Alloy installed/prepared in game/dev templates
 - remove legacy `node-exporter` from game/dev templates once Alloy is standard
-- keep `/etc/default/alloy` fixed to `/etc/alloy/config.alloy` on port `12345`
+- configure container Alloy remote-write to Prometheus at `10.60.0.25:9090`
+- keep local Alloy HTTP bound to loopback unless direct scrape health is intentionally adopted
 - write or refresh Alloy config after `POST /config`
 - keep labels minimal and consistent with dashboards/discovery
 - eventually emit structured logs suitable for Loki/Alloy log ingestion
@@ -109,12 +154,12 @@ Agent/container-side monitoring responsibilities:
 
 The monitoring host must:
 
-- restrict Prometheus, Grafana, and node_exporter exposure to trusted admin/monitoring paths
+- restrict Prometheus and Grafana exposure to trusted admin/monitoring paths while still allowing container Alloy remote-write to Prometheus
 - run discovery sync reliably
 - provision dashboards into Grafana, not merely store dashboard JSON on disk
 - alert on failed discovery sync
 - alert on stale file_sd age
-- alert on target-down by VMID/type
+- alert on missing/stale remote-write metrics by VMID/type
 - alert on API discovery endpoint failure
 - store bearer tokens with tight permissions or systemd credentials
 
@@ -127,6 +172,7 @@ Grafana must have:
 - game container dashboard installed
 - dev container dashboard installed
 - dashboard queries aligned with the canonical label contract
+- dashboards based on remote-write metric freshness for game/dev container health unless direct scrape health is intentionally enabled
 
 Dashboard JSON existing on disk is not sufficient; dashboards must be visible in Grafana.
 
@@ -149,7 +195,7 @@ Preferred direction: Loki or Alloy log ingestion. Until this exists, launch-debu
 
 Required before launch:
 
-- Prometheus not publicly reachable
+- Prometheus not publicly reachable, while still reachable from intended container remote-write network paths
 - Grafana not publicly reachable except through intended authenticated/admin path
 - node_exporter not publicly reachable
 - API monitoring endpoints fail closed without token