Align monitoring contract with Alloy scrape model

This commit is contained in:
jester 2026-05-01 21:06:35 +00:00
parent 2b3a552358
commit 130fc5eeda

View File

@ -24,12 +24,13 @@ The API is the discovery and metadata/control-plane source. It tells the monitor
The canonical metrics transport for game/dev containers is Alloy remote-write from each container into Prometheus on `10.60.0.25:9090`. The canonical metrics transport for game/dev containers is Alloy remote-write from each container into Prometheus on `10.60.0.25:9090`.
That means there are two related but separate paths: There are two related but separate paths:
```text ```text
API discovery API discovery
-> monitor sync -> monitor sync
-> file_sd target inventory / add-remove validation -> file_sd target inventory / add-remove validation
-> game-dev-alloy scrape health
container Alloy container Alloy
-> remote_write -> remote_write
@ -37,9 +38,9 @@ container Alloy
-> Grafana dashboards -> Grafana dashboards
``` ```
Prometheus scraping `container-ip:12345` is not canonical while container Alloy listens on `127.0.0.1:12345`. If direct scrape health for each container Alloy is desired later, the container Alloy HTTP listener must intentionally bind to `0.0.0.0:12345` or another reachable address and the security model must be reviewed. Container Alloy intentionally listens on `0.0.0.0:12345` for Linux-managed game/dev containers so Prometheus can scrape Alloy health by container IP through the `game-dev-alloy` job. This direct scrape proves collector reachability and target add/remove behavior.
Until that decision changes, dashboard health should rely on received remote-write metrics and expected time-series freshness, not `game-dev-alloy` scrape-up status. Remote-write remains the canonical metrics delivery path. Dashboards may use direct scrape health for target presence and collector health, but workload/system metrics should be validated from received time series.
## Explicit exceptions ## Explicit exceptions
@ -59,7 +60,7 @@ Preferred launch model:
API discovery endpoint API discovery endpoint
-> monitor sync script -> monitor sync script
-> Prometheus file_sd inventory -> Prometheus file_sd inventory
-> add/remove validation and optional direct scrape targets -> add/remove validation and direct Alloy scrape targets
``` ```
Expected behavior: Expected behavior:
@ -88,6 +89,8 @@ Acceptance evidence for a game/dev container should include real time series suc
- memory metrics - memory metrics
- expected labels such as `vmid`, `instance`, and `container_type` - expected labels such as `vmid`, `instance`, and `container_type`
Direct Alloy scrape health should also be visible through the `game-dev-alloy` Prometheus job for current lifecycle targets.
## API discovery requirements ## API discovery requirements
API-owned monitoring discovery endpoints must: API-owned monitoring discovery endpoints must:
@ -145,7 +148,7 @@ Agent/container-side monitoring responsibilities:
- keep Alloy installed/prepared in game/dev templates - keep Alloy installed/prepared in game/dev templates
- remove legacy `node-exporter` from game/dev templates once Alloy is standard - remove legacy `node-exporter` from game/dev templates once Alloy is standard
- configure container Alloy remote-write to Prometheus at `10.60.0.25:9090` - configure container Alloy remote-write to Prometheus at `10.60.0.25:9090`
- keep local Alloy HTTP bound to loopback unless direct scrape health is intentionally adopted - configure container Alloy HTTP to listen on `0.0.0.0:12345` for the intended internal monitoring networks
- write or refresh Alloy config after `POST /config` - write or refresh Alloy config after `POST /config`
- keep labels minimal and consistent with dashboards/discovery - keep labels minimal and consistent with dashboards/discovery
- eventually ship selected container logs through Alloy to centralized Loki - eventually ship selected container logs through Alloy to centralized Loki
@ -161,6 +164,7 @@ The monitoring host must:
- alert on failed discovery sync - alert on failed discovery sync
- alert on stale file_sd age - alert on stale file_sd age
- alert on missing/stale remote-write metrics by VMID/type - alert on missing/stale remote-write metrics by VMID/type
- alert on `game-dev-alloy` scrape target down by VMID/type
- alert on API discovery endpoint failure - alert on API discovery endpoint failure
- store bearer tokens with tight permissions or systemd credentials - store bearer tokens with tight permissions or systemd credentials
- host or reach the centralized Loki service once log collection is enabled - host or reach the centralized Loki service once log collection is enabled
@ -174,7 +178,8 @@ Grafana must have:
- game container dashboard installed - game container dashboard installed
- dev container dashboard installed - dev container dashboard installed
- dashboard queries aligned with the canonical label contract - dashboard queries aligned with the canonical label contract
- dashboards based on remote-write metric freshness for game/dev container health unless direct scrape health is intentionally enabled - dashboards based on remote-write metric freshness for game/dev container metrics
- direct scrape health panels or queries for `game-dev-alloy` target health
- Loki datasource installed once centralized logs are enabled - Loki datasource installed once centralized logs are enabled
Dashboard JSON existing on disk is not sufficient; dashboards must be visible in Grafana. Dashboard JSON existing on disk is not sufficient; dashboards must be visible in Grafana.
@ -183,7 +188,7 @@ Dashboard JSON existing on disk is not sufficient; dashboards must be visible in
Loki should be centralized, not installed per game/dev container. Loki should be centralized, not installed per game/dev container.
Preferred launch direction: Preferred future direction:
```text ```text
container/app logs container/app logs
@ -202,7 +207,9 @@ Container Alloy should tail only selected, useful logs and attach stable labels
Do not turn every container into a Loki server. Loki belongs on the monitoring/logging side as shared infrastructure. Containers should run Alloy as the lightweight collector/shipper. Do not turn every container into a Loki server. Loki belongs on the monitoring/logging side as shared infrastructure. Containers should run Alloy as the lightweight collector/shipper.
Launch-ready debugging requires centralized logs for at least: Centralized logs are not enabled yet. This is a known follow-up unless launch policy requires log search from Grafana before release.
When logs are added, prioritize:
- API application logs - API application logs
- Agent logs - Agent logs
@ -223,8 +230,6 @@ Initial container log candidates:
Avoid collecting noisy or sensitive logs by default. Review anything that may include customer secrets, tokens, console input, file contents, or environment variables before shipping to Loki. Avoid collecting noisy or sensitive logs by default. Review anything that may include customer secrets, tokens, console input, file contents, or environment variables before shipping to Loki.
Until centralized Loki/Alloy log shipping exists, launch-debug readiness remains incomplete.
## Security contract ## Security contract
Required before launch: Required before launch: