Clarify remote-write as canonical monitoring path
This commit is contained in:
parent
840394eb8e
commit
e355bbb52a
@ -16,6 +16,31 @@ The monitoring system must support launch debugging for:
|
|||||||
- provisioning, backup, restore, delete, and code-server/debug workflows
|
- provisioning, backup, restore, delete, and code-server/debug workflows
|
||||||
- stale target detection
|
- stale target detection
|
||||||
|
|
||||||
|
## Not API-only
|
||||||
|
|
||||||
|
Monitoring is not an API-only design.
|
||||||
|
|
||||||
|
The API is the discovery and metadata/control-plane source. It tells the monitoring stack which game/dev containers exist and what labels/metadata they should carry.
|
||||||
|
|
||||||
|
The canonical metrics transport for game/dev containers is Alloy remote-write from each container into Prometheus on `10.60.0.25:9090`.
|
||||||
|
|
||||||
|
That means there are two related but separate paths:
|
||||||
|
|
||||||
|
```text
|
||||||
|
API discovery
|
||||||
|
-> monitor sync
|
||||||
|
-> file_sd target inventory / add-remove validation
|
||||||
|
|
||||||
|
container Alloy
|
||||||
|
-> remote_write
|
||||||
|
-> Prometheus time series
|
||||||
|
-> Grafana dashboards
|
||||||
|
```
|
||||||
|
|
||||||
|
Prometheus scraping `container-ip:12345` is not canonical while container Alloy listens on `127.0.0.1:12345`. If direct scrape health for each container Alloy is desired later, the container Alloy HTTP listener must intentionally bind to `0.0.0.0:12345` or another reachable address and the security model must be reviewed.
|
||||||
|
|
||||||
|
Until that decision changes, dashboard health should rely on received remote-write metrics and expected time-series freshness, not `game-dev-alloy` scrape-up status.
|
||||||
|
|
||||||
## Explicit exceptions
|
## Explicit exceptions
|
||||||
|
|
||||||
The game/dev Alloy contract does not replace every platform-specific monitoring path.
|
The game/dev Alloy contract does not replace every platform-specific monitoring path.
|
||||||
@ -33,17 +58,36 @@ Preferred launch model:
|
|||||||
```text
|
```text
|
||||||
API discovery endpoint
|
API discovery endpoint
|
||||||
-> monitor sync script
|
-> monitor sync script
|
||||||
-> Prometheus file_sd
|
-> Prometheus file_sd inventory
|
||||||
-> Grafana dashboards / alerts
|
-> add/remove validation and optional direct scrape targets
|
||||||
```
|
```
|
||||||
|
|
||||||
Expected behavior:
|
Expected behavior:
|
||||||
|
|
||||||
- active game/dev containers appear automatically
|
- active game/dev containers appear automatically in discovery/file_sd
|
||||||
- deleted game/dev containers disappear automatically
|
- deleted game/dev containers disappear automatically from discovery/file_sd
|
||||||
- stale file_sd entries are removed once sync is healthy
|
- stale file_sd entries are removed once sync is healthy
|
||||||
- discovery failures alert loudly and do not silently preserve misleading stale state
|
- discovery failures alert loudly and do not silently preserve misleading stale state
|
||||||
|
|
||||||
|
Note: discovery/file_sd proves target inventory and add/remove behavior. It does not by itself prove canonical metrics delivery when remote-write is the chosen metrics path.
|
||||||
|
|
||||||
|
## Metrics path
|
||||||
|
|
||||||
|
Canonical launch path for game/dev container metrics:
|
||||||
|
|
||||||
|
```text
|
||||||
|
container Alloy remote_write
|
||||||
|
-> Prometheus at 10.60.0.25:9090
|
||||||
|
-> Grafana dashboards
|
||||||
|
```
|
||||||
|
|
||||||
|
Acceptance evidence for a game/dev container should include real time series such as:
|
||||||
|
|
||||||
|
- `node_uname_info`
|
||||||
|
- CPU metrics
|
||||||
|
- memory metrics
|
||||||
|
- expected labels such as `vmid`, `instance`, and `container_type`
|
||||||
|
|
||||||
## API discovery requirements
|
## API discovery requirements
|
||||||
|
|
||||||
API-owned monitoring discovery endpoints must:
|
API-owned monitoring discovery endpoints must:
|
||||||
@ -57,7 +101,7 @@ API-owned monitoring discovery endpoints must:
|
|||||||
|
|
||||||
## Required labels
|
## Required labels
|
||||||
|
|
||||||
For game/dev targets, standardize on the following required labels:
|
For game/dev targets/time series, standardize on the following required labels:
|
||||||
|
|
||||||
- `vmid`
|
- `vmid`
|
||||||
- `instance`
|
- `instance`
|
||||||
@ -100,7 +144,8 @@ Agent/container-side monitoring responsibilities:
|
|||||||
|
|
||||||
- keep Alloy installed/prepared in game/dev templates
|
- keep Alloy installed/prepared in game/dev templates
|
||||||
- remove legacy `node-exporter` from game/dev templates once Alloy is standard
|
- remove legacy `node-exporter` from game/dev templates once Alloy is standard
|
||||||
- keep `/etc/default/alloy` fixed to `/etc/alloy/config.alloy` on port `12345`
|
- configure container Alloy remote-write to Prometheus at `10.60.0.25:9090`
|
||||||
|
- keep local Alloy HTTP bound to loopback unless direct scrape health is intentionally adopted
|
||||||
- write or refresh Alloy config after `POST /config`
|
- write or refresh Alloy config after `POST /config`
|
||||||
- keep labels minimal and consistent with dashboards/discovery
|
- keep labels minimal and consistent with dashboards/discovery
|
||||||
- eventually emit structured logs suitable for Loki/Alloy log ingestion
|
- eventually emit structured logs suitable for Loki/Alloy log ingestion
|
||||||
@ -109,12 +154,12 @@ Agent/container-side monitoring responsibilities:
|
|||||||
|
|
||||||
The monitoring host must:
|
The monitoring host must:
|
||||||
|
|
||||||
- restrict Prometheus, Grafana, and node_exporter exposure to trusted admin/monitoring paths
|
- restrict Prometheus and Grafana exposure to trusted admin/monitoring paths while still allowing container Alloy remote-write to Prometheus
|
||||||
- run discovery sync reliably
|
- run discovery sync reliably
|
||||||
- provision dashboards into Grafana, not merely store dashboard JSON on disk
|
- provision dashboards into Grafana, not merely store dashboard JSON on disk
|
||||||
- alert on failed discovery sync
|
- alert on failed discovery sync
|
||||||
- alert on stale file_sd age
|
- alert on stale file_sd age
|
||||||
- alert on target-down by VMID/type
|
- alert on missing/stale remote-write metrics by VMID/type
|
||||||
- alert on API discovery endpoint failure
|
- alert on API discovery endpoint failure
|
||||||
- store bearer tokens with tight permissions or systemd credentials
|
- store bearer tokens with tight permissions or systemd credentials
|
||||||
|
|
||||||
@ -127,6 +172,7 @@ Grafana must have:
|
|||||||
- game container dashboard installed
|
- game container dashboard installed
|
||||||
- dev container dashboard installed
|
- dev container dashboard installed
|
||||||
- dashboard queries aligned with the canonical label contract
|
- dashboard queries aligned with the canonical label contract
|
||||||
|
- dashboards based on remote-write metric freshness for game/dev container health unless direct scrape health is intentionally enabled
|
||||||
|
|
||||||
Dashboard JSON existing on disk is not sufficient; dashboards must be visible in Grafana.
|
Dashboard JSON existing on disk is not sufficient; dashboards must be visible in Grafana.
|
||||||
|
|
||||||
@ -149,7 +195,7 @@ Preferred direction: Loki or Alloy log ingestion. Until this exists, launch-debu
|
|||||||
|
|
||||||
Required before launch:
|
Required before launch:
|
||||||
|
|
||||||
- Prometheus not publicly reachable
|
- Prometheus not publicly reachable, while still reachable from intended container remote-write network paths
|
||||||
- Grafana not publicly reachable except through intended authenticated/admin path
|
- Grafana not publicly reachable except through intended authenticated/admin path
|
||||||
- node_exporter not publicly reachable
|
- node_exporter not publicly reachable
|
||||||
- API monitoring endpoints fail closed without token
|
- API monitoring endpoints fail closed without token
|
||||||
|
|||||||
Loading…
Reference in New Issue
Block a user