Update open items for remote-write monitoring model
This commit is contained in:
parent
3b74817e4e
commit
9020e35107
@ -25,9 +25,6 @@ Only keep unfinished monitoring/observability work here.
|
||||
- no active `node_exporter` job
|
||||
- no active cAdvisor job
|
||||
- no active `:9100` jobs
|
||||
- current active targets are clean:
|
||||
- `up alloy 127.0.0.1:12345`
|
||||
- `up prometheus localhost:9090`
|
||||
|
||||
- Remaining stale/static Prometheus targets are removed from active runtime config.
|
||||
- `alloy-dev-6076` no longer appears as an active runtime target
|
||||
@ -44,6 +41,26 @@ Only keep unfinished monitoring/observability work here.
|
||||
- `Game Containers`
|
||||
- legacy node_exporter router dashboard archived to `/etc/zlh-monitor/grafana/dashboards-archive/core-routers.json`
|
||||
|
||||
- Important game/dev metrics path works through remote-write.
|
||||
- API discovery emitted both test servers:
|
||||
- `6085 / dev-6085 / 10.100.0.23 / dev`
|
||||
- `5206 / mc-vanilla-5206 / 10.200.0.79 / game`
|
||||
- file_sd sync wrote both targets and Prometheus discovered both
|
||||
- container Alloy remote-write works after Prometheus was rebound to `10.60.0.25:9090`
|
||||
- dashboard-style metrics exist for both:
|
||||
- `node_uname_info`
|
||||
- CPU
|
||||
- memory
|
||||
- observed values:
|
||||
- `6085 dev-6085 dev-container CPU ~1.04% Mem ~6.58%`
|
||||
- `5206 mc-vanilla-5206 game-container CPU ~0.68% Mem ~61.69%`
|
||||
|
||||
- Current monitoring model clarified.
|
||||
- Not API-only: API owns discovery/metadata.
|
||||
- Canonical metrics path is container Alloy remote-write to Prometheus.
|
||||
- `game-dev-alloy` scrape-up panels should not be relied on while container Alloy listens on `127.0.0.1:12345`.
|
||||
- Direct scrape health would require intentionally exposing container Alloy HTTP on a reachable address such as `0.0.0.0:12345` plus security review.
|
||||
|
||||
- Services active after cleanup:
|
||||
- `prometheus`
|
||||
- `grafana-server`
|
||||
@ -52,13 +69,12 @@ Only keep unfinished monitoring/observability work here.
|
||||
|
||||
## Launch blockers
|
||||
|
||||
1. Restrict monitoring endpoint exposure.
|
||||
- Prometheus previously listened on `*:9090`.
|
||||
- Grafana previously listened on `*:3000`.
|
||||
- node_exporter previously listened on `*:9100`.
|
||||
- UFW was inactive and nftables policy was accept in the prior audit.
|
||||
- Prometheus no longer scrapes node_exporter or any `:9100` jobs, but listener exposure still needs verification/removal or firewalling.
|
||||
- Required fix: bind Prometheus/Grafana to loopback or management VLAN only, remove/disable any unneeded node_exporter listener, and/or firewall to trusted admin/monitoring sources.
|
||||
1. Finish network exposure verification/lockdown.
|
||||
- Prometheus now listens on `10.60.0.25:9090` for container remote-write and self-scrape.
|
||||
- Grafana remains loopback-only on `127.0.0.1:3000`.
|
||||
- zlh-monitor local Alloy remains loopback-only on `127.0.0.1:12345`.
|
||||
- local node_exporter is disabled.
|
||||
- Required validation: Prometheus is reachable only from intended container remote-write and trusted admin/monitoring paths; Grafana is not externally exposed; node_exporter is not listening or is firewalled; host firewall/nftables posture is acceptable.
|
||||
|
||||
2. Add API health scrape or equivalent application-level health signal.
|
||||
- Observed `/health` previously returned `404 Cannot GET /health`.
|
||||
@ -76,11 +92,11 @@ Only keep unfinished monitoring/observability work here.
|
||||
- Token previously appeared in Prometheus YAML, backup configs, and `/etc/zlh-monitor/game-dev-discovery.env`.
|
||||
- Acceptance: token stored in tight-permission env file or systemd credentials; not world-readable; not leaked in labels/URLs/logs.
|
||||
|
||||
6. Run lifecycle smoke test once remaining blockers are controlled enough.
|
||||
- Create a game container and/or dev container.
|
||||
- Confirm target appears.
|
||||
- Delete it.
|
||||
- Confirm target disappears now that empty discovery works.
|
||||
6. Finish lifecycle smoke test delete step.
|
||||
- Delete test dev/game servers.
|
||||
- Confirm API discovery drops them.
|
||||
- Confirm file_sd drops them.
|
||||
- Confirm dashboard metrics age out as expected.
|
||||
|
||||
## Standardization work
|
||||
|
||||
@ -94,7 +110,7 @@ Only keep unfinished monitoring/observability work here.
|
||||
|
||||
- discovery sync failed
|
||||
- file_sd file stale / not updated recently
|
||||
- target down by VMID/type
|
||||
- missing/stale remote-write metrics by VMID/type
|
||||
- API `/sd/exporters` failure
|
||||
- API app health failure
|
||||
- Prometheus scrape failures by important job
|
||||
|
||||
Loading…
Reference in New Issue
Block a user