Record remote-write smoke test evidence
This commit is contained in:
parent
e355bbb52a
commit
3b74817e4e
@ -4,9 +4,9 @@
|
|||||||
|
|
||||||
**Status: PARTIAL FAIL for launch monitoring readiness.**
|
**Status: PARTIAL FAIL for launch monitoring readiness.**
|
||||||
|
|
||||||
Core monitoring services are running. Game/dev dynamic discovery empty-state is clean, stale `game-dev-alloy` VMIDs `5205` and `6084` are gone, legacy/static Prometheus jobs have been removed from the active runtime config, and Grafana dashboard provisioning is now live from the `/etc/zlh-monitor` runtime source-of-truth path.
|
Core monitoring services are running. Game/dev dynamic discovery works, stale `game-dev-alloy` VMIDs `5205` and `6084` are gone, legacy/static Prometheus jobs have been removed from the active runtime config, Grafana dashboard provisioning is live from `/etc/zlh-monitor`, and the important game/dev metrics path now works through container Alloy remote-write.
|
||||||
|
|
||||||
Launch-debug readiness is still blocked by public monitoring endpoint exposure, missing/unfinished centralized logs, missing API application health scrape, incomplete lifecycle telemetry, and the need to run one full game/dev lifecycle smoke test.
|
Launch-debug readiness is still blocked by final exposure verification/lockdown, missing/unfinished centralized logs, missing API application health scrape, incomplete lifecycle telemetry, and the need to finish the delete half of the game/dev lifecycle smoke test.
|
||||||
|
|
||||||
## Runtime source of truth
|
## Runtime source of truth
|
||||||
|
|
||||||
@ -26,12 +26,28 @@ Grafana now provisions from:
|
|||||||
|
|
||||||
`/etc/zlh-monitor/README.md` was updated to describe the new single-source layout.
|
`/etc/zlh-monitor/README.md` was updated to describe the new single-source layout.
|
||||||
|
|
||||||
|
## Canonical metrics model
|
||||||
|
|
||||||
|
Monitoring is not API-only.
|
||||||
|
|
||||||
|
The API owns discovery/metadata and tells the monitoring stack what game/dev containers exist. The canonical metrics path for game/dev containers is now:
|
||||||
|
|
||||||
|
```text
|
||||||
|
container Alloy remote_write
|
||||||
|
-> Prometheus on 10.60.0.25:9090
|
||||||
|
-> Grafana dashboards
|
||||||
|
```
|
||||||
|
|
||||||
|
The `game-dev-alloy` file_sd path is useful for inventory/add-remove validation. It should not be treated as the canonical container health signal while container Alloy listens on `127.0.0.1:12345` and Prometheus file_sd scrapes `container-ip:12345`.
|
||||||
|
|
||||||
|
If direct scrape health for each container Alloy is desired later, container Alloy must intentionally expose its HTTP listener on a reachable address such as `0.0.0.0:12345`, with security review.
|
||||||
|
|
||||||
## Recently verified progress
|
## Recently verified progress
|
||||||
|
|
||||||
Container discovery is clean:
|
Container discovery is clean:
|
||||||
|
|
||||||
- `/sd/exporters -> []`
|
- `/sd/exporters -> []` during empty state
|
||||||
- `/monitoring/game-dev -> containers: []`
|
- `/monitoring/game-dev -> containers: []` during empty state
|
||||||
- monitor-side `ALLOW_EMPTY="true"` added in `/etc/zlh-monitor/game-dev-discovery.env`
|
- monitor-side `ALLOW_EMPTY="true"` added in `/etc/zlh-monitor/game-dev-discovery.env`
|
||||||
- one-shot sync succeeded: wrote `0 targets` to `/etc/zlh-monitor/prometheus/file_sd/game-dev-alloy.json`
|
- one-shot sync succeeded: wrote `0 targets` to `/etc/zlh-monitor/prometheus/file_sd/game-dev-alloy.json`
|
||||||
- Prometheus refreshed and stale `game-dev-alloy` targets `5205` and `6084` are gone
|
- Prometheus refreshed and stale `game-dev-alloy` targets `5205` and `6084` are gone
|
||||||
@ -47,10 +63,13 @@ Prometheus runtime config is clean:
|
|||||||
- no active cAdvisor job
|
- no active cAdvisor job
|
||||||
- no active `:9100` jobs
|
- no active `:9100` jobs
|
||||||
|
|
||||||
Current Prometheus targets are clean:
|
Runtime bind/target state after the remote-write fix:
|
||||||
|
|
||||||
- `up alloy 127.0.0.1:12345`
|
- Prometheus listens on `10.60.0.25:9090`
|
||||||
- `up prometheus localhost:9090`
|
- Prometheus self-scrape target is `10.60.0.25:9090`
|
||||||
|
- Grafana remains loopback-only on `127.0.0.1:3000`
|
||||||
|
- zlh-monitor local Alloy remains loopback-only on `127.0.0.1:12345`
|
||||||
|
- local node_exporter is disabled
|
||||||
|
|
||||||
Grafana provisioning is live:
|
Grafana provisioning is live:
|
||||||
|
|
||||||
@ -71,23 +90,55 @@ Services active after cleanup:
|
|||||||
- `alloy`
|
- `alloy`
|
||||||
- discovery timer
|
- discovery timer
|
||||||
|
|
||||||
|
## Game/dev metrics smoke evidence
|
||||||
|
|
||||||
|
The important monitoring path works for current test containers.
|
||||||
|
|
||||||
|
API discovery emits both test servers:
|
||||||
|
|
||||||
|
- `6085 / dev-6085 / 10.100.0.23 / dev`
|
||||||
|
- `5206 / mc-vanilla-5206 / 10.200.0.79 / game`
|
||||||
|
|
||||||
|
file_sd sync writes both targets, and Prometheus discovers both.
|
||||||
|
|
||||||
|
Alloy remote-write from both containers works after Prometheus was rebound to `10.60.0.25:9090`.
|
||||||
|
|
||||||
|
Dashboard-style metrics exist for both containers:
|
||||||
|
|
||||||
|
- `node_uname_info`
|
||||||
|
- CPU metrics
|
||||||
|
- memory metrics
|
||||||
|
|
||||||
|
Observed values:
|
||||||
|
|
||||||
|
```text
|
||||||
|
6085 dev-6085 dev-container CPU ~1.04% Mem ~6.58%
|
||||||
|
5206 mc-vanilla-5206 game-container CPU ~0.68% Mem ~61.69%
|
||||||
|
```
|
||||||
|
|
||||||
|
Known design mismatch:
|
||||||
|
|
||||||
|
- container Alloy listens on `127.0.0.1:12345`
|
||||||
|
- Prometheus file_sd scrapes `container-ip:12345`
|
||||||
|
- therefore `game-dev-alloy` scrape targets are discovered but down
|
||||||
|
|
||||||
|
Current decision: use remote-write as canonical and do not rely on `game-dev-alloy` scrape health panels unless/until direct container Alloy scrape is intentionally enabled.
|
||||||
|
|
||||||
## Exposure state
|
## Exposure state
|
||||||
|
|
||||||
Observed listening ports from prior audit:
|
Current reported bind posture:
|
||||||
|
|
||||||
- `*:9090` Prometheus
|
- Prometheus listens on `10.60.0.25:9090`
|
||||||
- `*:3000` Grafana
|
- Grafana remains loopback-only on `127.0.0.1:3000`
|
||||||
- `*:9100` node_exporter
|
- zlh-monitor local Alloy remains loopback-only on `127.0.0.1:12345`
|
||||||
- `127.0.0.1:12345` Alloy
|
- local node_exporter is disabled
|
||||||
|
|
||||||
Prometheus active jobs no longer include `node_exporter` or `:9100` scraping, but network exposure still needs separate verification/lockdown.
|
Remaining exposure work:
|
||||||
|
|
||||||
Host firewall posture from prior audit:
|
- verify Prometheus is reachable only from intended management/container remote-write networks, not the public internet
|
||||||
|
- verify Grafana is not reachable except via intended admin path
|
||||||
- UFW inactive
|
- verify node_exporter is no longer listening or is firewalled if any listener remains
|
||||||
- nftables input/forward/output policy accept
|
- verify host firewall/nftables policy is appropriate
|
||||||
|
|
||||||
This remains a launch blocker until Prometheus, Grafana, and any remaining node_exporter listener exposure are restricted to trusted admin/monitoring paths.
|
|
||||||
|
|
||||||
## Prometheus state
|
## Prometheus state
|
||||||
|
|
||||||
@ -97,9 +148,8 @@ This remains a launch blocker until Prometheus, Grafana, and any remaining node_
|
|||||||
- `prometheus`
|
- `prometheus`
|
||||||
- `alloy`
|
- `alloy`
|
||||||
- `game-dev-alloy`
|
- `game-dev-alloy`
|
||||||
- Current targets:
|
- Prometheus listens on `10.60.0.25:9090`
|
||||||
- `up alloy 127.0.0.1:12345`
|
- Prometheus self-scrape target is `10.60.0.25:9090`
|
||||||
- `up prometheus localhost:9090`
|
|
||||||
- `/sd/exporters` is configured with bearer auth and works with token
|
- `/sd/exporters` is configured with bearer auth and works with token
|
||||||
- unauthenticated `/sd/exporters` and `/monitoring/game-dev` return `401 unauthorized`
|
- unauthenticated `/sd/exporters` and `/monitoring/game-dev` return `401 unauthorized`
|
||||||
|
|
||||||
@ -115,12 +165,18 @@ Resolved:
|
|||||||
- cAdvisor `10.60.0.11:8081` is gone from active runtime target list
|
- cAdvisor `10.60.0.11:8081` is gone from active runtime target list
|
||||||
- `zerolaghub-nodes 10.60.0.19:9100` is gone from active runtime target list
|
- `zerolaghub-nodes 10.60.0.19:9100` is gone from active runtime target list
|
||||||
- empty discovery writes a valid empty file_sd file instead of failing
|
- empty discovery writes a valid empty file_sd file instead of failing
|
||||||
|
- active test containers appear in API discovery and file_sd
|
||||||
|
|
||||||
No stale/static down targets are currently present in active Prometheus targets.
|
Expected remaining validation:
|
||||||
|
|
||||||
|
- delete test dev/game servers
|
||||||
|
- confirm API discovery drops them
|
||||||
|
- confirm file_sd drops them
|
||||||
|
- confirm dashboard metrics age out as expected
|
||||||
|
|
||||||
## API monitoring state
|
## API monitoring state
|
||||||
|
|
||||||
API host/system metrics were previously available via `integrations/unix`, but the new active runtime Prometheus jobs are intentionally reduced to Prometheus, local Alloy, and dynamic game/dev Alloy discovery.
|
API discovery is working for game/dev monitoring inventory.
|
||||||
|
|
||||||
Observed API app health endpoint `/health` previously returned `404 Cannot GET /health`; no API health scrape was found.
|
Observed API app health endpoint `/health` previously returned `404 Cannot GET /health`; no API health scrape was found.
|
||||||
|
|
||||||
@ -128,14 +184,10 @@ No centralized API logs were found on the monitoring host.
|
|||||||
|
|
||||||
## Agent/container monitoring state
|
## Agent/container monitoring state
|
||||||
|
|
||||||
Current dynamic game/dev discovery is empty and clean.
|
Current test containers prove remote-write metrics are working:
|
||||||
|
|
||||||
Expected next validation is lifecycle smoke testing:
|
- `6085 dev-6085 dev-container`
|
||||||
|
- `5206 mc-vanilla-5206 game-container`
|
||||||
- create a game and/or dev container
|
|
||||||
- confirm target appears
|
|
||||||
- delete it
|
|
||||||
- confirm target disappears now that empty discovery works
|
|
||||||
|
|
||||||
Labels previously observed include:
|
Labels previously observed include:
|
||||||
|
|
||||||
@ -177,6 +229,8 @@ Archived:
|
|||||||
|
|
||||||
- `/etc/zlh-monitor/grafana/dashboards-archive/core-routers.json`
|
- `/etc/zlh-monitor/grafana/dashboards-archive/core-routers.json`
|
||||||
|
|
||||||
|
Dashboard-style CPU and memory metrics exist for both current test containers.
|
||||||
|
|
||||||
## Logs state
|
## Logs state
|
||||||
|
|
||||||
No Loki, Promtail, or log-ingesting Alloy pipeline was found in the prior audit.
|
No Loki, Promtail, or log-ingesting Alloy pipeline was found in the prior audit.
|
||||||
@ -185,14 +239,14 @@ Only journald/local logs for monitoring components were available on the monitor
|
|||||||
|
|
||||||
## Security state
|
## Security state
|
||||||
|
|
||||||
Working controls from prior audit:
|
Working controls:
|
||||||
|
|
||||||
- Grafana anonymous access is disabled; unauthenticated `/api/search` returned `401`
|
- Grafana is loopback-only on `127.0.0.1:3000` per current report
|
||||||
- discovery endpoints require token; unauthenticated `/sd/exporters` and `/monitoring/game-dev` returned `401`
|
- discovery endpoints require token; unauthenticated `/sd/exporters` and `/monitoring/game-dev` returned `401`
|
||||||
- Prometheus labels/scrape URLs checked did not expose the bearer token
|
- Prometheus labels/scrape URLs checked did not expose the bearer token in the prior audit
|
||||||
|
|
||||||
Launch-blocking risk:
|
Remaining launch security checks:
|
||||||
|
|
||||||
- Prometheus and Grafana previously bound publicly on the host network
|
- Prometheus on `10.60.0.25:9090` must be restricted to intended container remote-write and admin/monitoring paths
|
||||||
- node_exporter previously listened on `*:9100`; it is no longer scraped by active Prometheus config, but listener exposure still needs verification/removal or firewalling
|
- any remaining node_exporter listener must be disabled or firewalled
|
||||||
- host firewall policy did not restrict access in prior audit
|
- host firewall policy must be verified
|
||||||
Loading…
Reference in New Issue
Block a user