Align monitoring contract with Alloy scrape model
This commit is contained in:
parent
2b3a552358
commit
130fc5eeda
@ -24,12 +24,13 @@ The API is the discovery and metadata/control-plane source. It tells the monitor
|
|||||||
|
|
||||||
The canonical metrics transport for game/dev containers is Alloy remote-write from each container into Prometheus on `10.60.0.25:9090`.
|
The canonical metrics transport for game/dev containers is Alloy remote-write from each container into Prometheus on `10.60.0.25:9090`.
|
||||||
|
|
||||||
That means there are two related but separate paths:
|
There are two related but separate paths:
|
||||||
|
|
||||||
```text
|
```text
|
||||||
API discovery
|
API discovery
|
||||||
-> monitor sync
|
-> monitor sync
|
||||||
-> file_sd target inventory / add-remove validation
|
-> file_sd target inventory / add-remove validation
|
||||||
|
-> game-dev-alloy scrape health
|
||||||
|
|
||||||
container Alloy
|
container Alloy
|
||||||
-> remote_write
|
-> remote_write
|
||||||
@ -37,9 +38,9 @@ container Alloy
|
|||||||
-> Grafana dashboards
|
-> Grafana dashboards
|
||||||
```
|
```
|
||||||
|
|
||||||
Prometheus scraping `container-ip:12345` is not canonical while container Alloy listens on `127.0.0.1:12345`. If direct scrape health for each container Alloy is desired later, the container Alloy HTTP listener must intentionally bind to `0.0.0.0:12345` or another reachable address and the security model must be reviewed.
|
Container Alloy intentionally listens on `0.0.0.0:12345` for Linux-managed game/dev containers so Prometheus can scrape Alloy health by container IP through the `game-dev-alloy` job. This direct scrape proves collector reachability and target add/remove behavior.
|
||||||
|
|
||||||
Until that decision changes, dashboard health should rely on received remote-write metrics and expected time-series freshness, not `game-dev-alloy` scrape-up status.
|
Remote-write remains the canonical metrics delivery path. Dashboards may use direct scrape health for target presence and collector health, but workload/system metrics should be validated from received time series.
|
||||||
|
|
||||||
## Explicit exceptions
|
## Explicit exceptions
|
||||||
|
|
||||||
@ -59,7 +60,7 @@ Preferred launch model:
|
|||||||
API discovery endpoint
|
API discovery endpoint
|
||||||
-> monitor sync script
|
-> monitor sync script
|
||||||
-> Prometheus file_sd inventory
|
-> Prometheus file_sd inventory
|
||||||
-> add/remove validation and optional direct scrape targets
|
-> add/remove validation and direct Alloy scrape targets
|
||||||
```
|
```
|
||||||
|
|
||||||
Expected behavior:
|
Expected behavior:
|
||||||
@ -88,6 +89,8 @@ Acceptance evidence for a game/dev container should include real time series suc
|
|||||||
- memory metrics
|
- memory metrics
|
||||||
- expected labels such as `vmid`, `instance`, and `container_type`
|
- expected labels such as `vmid`, `instance`, and `container_type`
|
||||||
|
|
||||||
|
Direct Alloy scrape health should also be visible through the `game-dev-alloy` Prometheus job for current lifecycle targets.
|
||||||
|
|
||||||
## API discovery requirements
|
## API discovery requirements
|
||||||
|
|
||||||
API-owned monitoring discovery endpoints must:
|
API-owned monitoring discovery endpoints must:
|
||||||
@ -145,7 +148,7 @@ Agent/container-side monitoring responsibilities:
|
|||||||
- keep Alloy installed/prepared in game/dev templates
|
- keep Alloy installed/prepared in game/dev templates
|
||||||
- remove legacy `node-exporter` from game/dev templates once Alloy is standard
|
- remove legacy `node-exporter` from game/dev templates once Alloy is standard
|
||||||
- configure container Alloy remote-write to Prometheus at `10.60.0.25:9090`
|
- configure container Alloy remote-write to Prometheus at `10.60.0.25:9090`
|
||||||
- keep local Alloy HTTP bound to loopback unless direct scrape health is intentionally adopted
|
- configure container Alloy HTTP to listen on `0.0.0.0:12345` for the intended internal monitoring networks
|
||||||
- write or refresh Alloy config after `POST /config`
|
- write or refresh Alloy config after `POST /config`
|
||||||
- keep labels minimal and consistent with dashboards/discovery
|
- keep labels minimal and consistent with dashboards/discovery
|
||||||
- eventually ship selected container logs through Alloy to centralized Loki
|
- eventually ship selected container logs through Alloy to centralized Loki
|
||||||
@ -161,6 +164,7 @@ The monitoring host must:
|
|||||||
- alert on failed discovery sync
|
- alert on failed discovery sync
|
||||||
- alert on stale file_sd age
|
- alert on stale file_sd age
|
||||||
- alert on missing/stale remote-write metrics by VMID/type
|
- alert on missing/stale remote-write metrics by VMID/type
|
||||||
|
- alert on `game-dev-alloy` scrape target down by VMID/type
|
||||||
- alert on API discovery endpoint failure
|
- alert on API discovery endpoint failure
|
||||||
- store bearer tokens with tight permissions or systemd credentials
|
- store bearer tokens with tight permissions or systemd credentials
|
||||||
- host or reach the centralized Loki service once log collection is enabled
|
- host or reach the centralized Loki service once log collection is enabled
|
||||||
@ -174,7 +178,8 @@ Grafana must have:
|
|||||||
- game container dashboard installed
|
- game container dashboard installed
|
||||||
- dev container dashboard installed
|
- dev container dashboard installed
|
||||||
- dashboard queries aligned with the canonical label contract
|
- dashboard queries aligned with the canonical label contract
|
||||||
- dashboards based on remote-write metric freshness for game/dev container health unless direct scrape health is intentionally enabled
|
- dashboards based on remote-write metric freshness for game/dev container metrics
|
||||||
|
- direct scrape health panels or queries for `game-dev-alloy` target health
|
||||||
- Loki datasource installed once centralized logs are enabled
|
- Loki datasource installed once centralized logs are enabled
|
||||||
|
|
||||||
Dashboard JSON existing on disk is not sufficient; dashboards must be visible in Grafana.
|
Dashboard JSON existing on disk is not sufficient; dashboards must be visible in Grafana.
|
||||||
@ -183,7 +188,7 @@ Dashboard JSON existing on disk is not sufficient; dashboards must be visible in
|
|||||||
|
|
||||||
Loki should be centralized, not installed per game/dev container.
|
Loki should be centralized, not installed per game/dev container.
|
||||||
|
|
||||||
Preferred launch direction:
|
Preferred future direction:
|
||||||
|
|
||||||
```text
|
```text
|
||||||
container/app logs
|
container/app logs
|
||||||
@ -202,7 +207,9 @@ Container Alloy should tail only selected, useful logs and attach stable labels
|
|||||||
|
|
||||||
Do not turn every container into a Loki server. Loki belongs on the monitoring/logging side as shared infrastructure. Containers should run Alloy as the lightweight collector/shipper.
|
Do not turn every container into a Loki server. Loki belongs on the monitoring/logging side as shared infrastructure. Containers should run Alloy as the lightweight collector/shipper.
|
||||||
|
|
||||||
Launch-ready debugging requires centralized logs for at least:
|
Centralized logs are not enabled yet. This is a known follow-up unless launch policy requires log search from Grafana before release.
|
||||||
|
|
||||||
|
When logs are added, prioritize:
|
||||||
|
|
||||||
- API application logs
|
- API application logs
|
||||||
- Agent logs
|
- Agent logs
|
||||||
@ -223,8 +230,6 @@ Initial container log candidates:
|
|||||||
|
|
||||||
Avoid collecting noisy or sensitive logs by default. Review anything that may include customer secrets, tokens, console input, file contents, or environment variables before shipping to Loki.
|
Avoid collecting noisy or sensitive logs by default. Review anything that may include customer secrets, tokens, console input, file contents, or environment variables before shipping to Loki.
|
||||||
|
|
||||||
Until centralized Loki/Alloy log shipping exists, launch-debug readiness remains incomplete.
|
|
||||||
|
|
||||||
## Security contract
|
## Security contract
|
||||||
|
|
||||||
Required before launch:
|
Required before launch:
|
||||||
|
|||||||
Loading…
Reference in New Issue
Block a user