diff --git a/Codex/Monitoring/OPEN_ITEMS.md b/Codex/Monitoring/OPEN_ITEMS.md index 495a7d1..7ec8145 100644 --- a/Codex/Monitoring/OPEN_ITEMS.md +++ b/Codex/Monitoring/OPEN_ITEMS.md @@ -25,9 +25,6 @@ Only keep unfinished monitoring/observability work here. - no active `node_exporter` job - no active cAdvisor job - no active `:9100` jobs - - current active targets are clean: - - `up alloy 127.0.0.1:12345` - - `up prometheus localhost:9090` - Remaining stale/static Prometheus targets are removed from active runtime config. - `alloy-dev-6076` no longer appears as an active runtime target @@ -44,6 +41,26 @@ Only keep unfinished monitoring/observability work here. - `Game Containers` - legacy node_exporter router dashboard archived to `/etc/zlh-monitor/grafana/dashboards-archive/core-routers.json` +- Important game/dev metrics path works through remote-write. + - API discovery emitted both test servers: + - `6085 / dev-6085 / 10.100.0.23 / dev` + - `5206 / mc-vanilla-5206 / 10.200.0.79 / game` + - file_sd sync wrote both targets and Prometheus discovered both + - container Alloy remote-write works after Prometheus was rebound to `10.60.0.25:9090` + - dashboard-style metrics exist for both: + - `node_uname_info` + - CPU + - memory + - observed values: + - `6085 dev-6085 dev-container CPU ~1.04% Mem ~6.58%` + - `5206 mc-vanilla-5206 game-container CPU ~0.68% Mem ~61.69%` + +- Current monitoring model clarified. + - Not API-only: API owns discovery/metadata. + - Canonical metrics path is container Alloy remote-write to Prometheus. + - `game-dev-alloy` scrape-up panels should not be relied on while container Alloy listens on `127.0.0.1:12345`. + - Direct scrape health would require intentionally exposing container Alloy HTTP on a reachable address such as `0.0.0.0:12345` plus security review. + - Services active after cleanup: - `prometheus` - `grafana-server` @@ -52,13 +69,12 @@ Only keep unfinished monitoring/observability work here. ## Launch blockers -1. Restrict monitoring endpoint exposure. - - Prometheus previously listened on `*:9090`. - - Grafana previously listened on `*:3000`. - - node_exporter previously listened on `*:9100`. - - UFW was inactive and nftables policy was accept in the prior audit. - - Prometheus no longer scrapes node_exporter or any `:9100` jobs, but listener exposure still needs verification/removal or firewalling. - - Required fix: bind Prometheus/Grafana to loopback or management VLAN only, remove/disable any unneeded node_exporter listener, and/or firewall to trusted admin/monitoring sources. +1. Finish network exposure verification/lockdown. + - Prometheus now listens on `10.60.0.25:9090` for container remote-write and self-scrape. + - Grafana remains loopback-only on `127.0.0.1:3000`. + - zlh-monitor local Alloy remains loopback-only on `127.0.0.1:12345`. + - local node_exporter is disabled. + - Required validation: Prometheus is reachable only from intended container remote-write and trusted admin/monitoring paths; Grafana is not externally exposed; node_exporter is not listening or is firewalled; host firewall/nftables posture is acceptable. 2. Add API health scrape or equivalent application-level health signal. - Observed `/health` previously returned `404 Cannot GET /health`. @@ -76,11 +92,11 @@ Only keep unfinished monitoring/observability work here. - Token previously appeared in Prometheus YAML, backup configs, and `/etc/zlh-monitor/game-dev-discovery.env`. - Acceptance: token stored in tight-permission env file or systemd credentials; not world-readable; not leaked in labels/URLs/logs. -6. Run lifecycle smoke test once remaining blockers are controlled enough. - - Create a game container and/or dev container. - - Confirm target appears. - - Delete it. - - Confirm target disappears now that empty discovery works. +6. Finish lifecycle smoke test delete step. + - Delete test dev/game servers. + - Confirm API discovery drops them. + - Confirm file_sd drops them. + - Confirm dashboard metrics age out as expected. ## Standardization work @@ -94,7 +110,7 @@ Only keep unfinished monitoring/observability work here. - discovery sync failed - file_sd file stale / not updated recently -- target down by VMID/type +- missing/stale remote-write metrics by VMID/type - API `/sd/exporters` failure - API app health failure - Prometheus scrape failures by important job