From 1169a89b714f3e34614953851c9ce07c17fce5d1 Mon Sep 17 00:00:00 2001 From: jester Date: Fri, 1 May 2026 18:57:15 +0000 Subject: [PATCH] Update monitoring open items after clean discovery --- Codex/Monitoring/OPEN_ITEMS.md | 42 ++++++++++++++++++++++------------ 1 file changed, 27 insertions(+), 15 deletions(-) diff --git a/Codex/Monitoring/OPEN_ITEMS.md b/Codex/Monitoring/OPEN_ITEMS.md index bfcfbfd..f8cb5a6 100644 --- a/Codex/Monitoring/OPEN_ITEMS.md +++ b/Codex/Monitoring/OPEN_ITEMS.md @@ -2,6 +2,15 @@ Only keep unfinished monitoring/observability work here. +## Recently resolved / verified + +- Game/dev dynamic discovery empty-state now works. + - `/sd/exporters -> []` + - `/monitoring/game-dev -> containers: []` + - monitor-side `ALLOW_EMPTY="true"` added in `/etc/zlh-monitor/game-dev-discovery.env` + - one-shot sync succeeded and wrote `0 targets` to `/etc/zlh-monitor/prometheus/file_sd/game-dev-alloy.json` + - Prometheus refreshed and stale `game-dev-alloy` VMIDs `5205` and `6084` disappeared + ## Launch blockers 1. Restrict monitoring endpoint exposure. @@ -11,36 +20,39 @@ Only keep unfinished monitoring/observability work here. - UFW is inactive and nftables policy is accept. - Required fix: bind to loopback/management VLAN only or firewall to trusted admin/monitoring sources. -2. Fix game/dev discovery sync. - - `zlh-monitor-game-dev-discovery.service` is failed. - - Current failure: API returns dev VMID `6050` on `10.60.0.220`, but sync script only allows dev IPs in `10.100.0.0/16`. - - Decide whether `6050` on `10.60.0.0/16` is intentional; then either update the allowlist or fix API/IP assignment. +2. Remove or fix remaining stale/static targets. + - `alloy-dev-6076` is still hardcoded/down. + - `cadvisor`, `10.60.0.11:8081`, is down/no route. + - `zerolaghub-nodes`, `10.60.0.19:9100`, is down/connection refused. + - Acceptance: remaining down targets are either removed, fixed, or explicitly documented as intentional disabled targets. -3. Remove stale file_sd targets after discovery is healthy. - - Stale targets observed: VMID `5205` and VMID `6084`. - - Acceptance: deleted containers disappear automatically and stale targets do not remain indefinitely. - -4. Install/provision Grafana dashboards. +3. Install/provision Grafana dashboards. - Dashboard JSON exists under `/etc/zlh-monitor/grafana/dashboards`. - - Grafana DB currently has no dashboards installed. + - Grafana DB previously had no dashboards installed. - Acceptance: dashboards visible in Grafana and queries return live data. -5. Add API health scrape. +4. Add API health scrape. - Observed `/health` returns `404 Cannot GET /health`. - Acceptance: Prometheus has an API app health target or equivalent application-level health signal. -6. Add launch-debug lifecycle telemetry. +5. Add launch-debug lifecycle telemetry. - Missing/incomplete: `ready`, `connectable`, `operationInProgress`, `operationType`, backup/restore state, code-server state. - Acceptance: dashboards or Prometheus queries can answer what a game/dev container is doing and why it is not ready. -7. Add centralized logs. +6. Add centralized logs. - No Loki, Promtail, or log-ingesting Alloy config found. - Acceptance: API, Agent, provisioning, backup/restore, delete/teardown, Velocity/DNS/edge, and discovery-sync logs are queryable from the monitoring surface. -8. Tighten bearer token storage. - - Token currently appears in Prometheus YAML, backup configs, and `/etc/zlh-monitor/game-dev-discovery.env`. +7. Tighten bearer token storage. + - Token previously appeared in Prometheus YAML, backup configs, and `/etc/zlh-monitor/game-dev-discovery.env`. - Acceptance: token stored in tight-permission env file or systemd credentials; not world-readable; not leaked in labels/URLs/logs. +8. Run lifecycle smoke test once remaining blockers are controlled enough. + - Create a game container and/or dev container. + - Confirm target appears. + - Delete it. + - Confirm target disappears now that empty discovery works. + ## Standardization work - Standardize labels: prefer canonical `container_type` over mixed `ctype` / `container_type`.