diff --git a/Codex/Monitoring/OPEN_ITEMS.md b/Codex/Monitoring/OPEN_ITEMS.md index f8cb5a6..495a7d1 100644 --- a/Codex/Monitoring/OPEN_ITEMS.md +++ b/Codex/Monitoring/OPEN_ITEMS.md @@ -4,6 +4,11 @@ Only keep unfinished monitoring/observability work here. ## Recently resolved / verified +- `/etc/zlh-monitor` is now the runtime source of truth. + - Prometheus runs from `/etc/zlh-monitor/prometheus/prometheus.yml` via `/etc/default/prometheus`. + - Grafana provisions from `/etc/zlh-monitor/grafana/provisioning` via `/etc/default/grafana-server`. + - `/etc/zlh-monitor/README.md` describes the new single-source layout. + - Game/dev dynamic discovery empty-state now works. - `/sd/exporters -> []` - `/monitoring/game-dev -> containers: []` @@ -11,43 +16,67 @@ Only keep unfinished monitoring/observability work here. - one-shot sync succeeded and wrote `0 targets` to `/etc/zlh-monitor/prometheus/file_sd/game-dev-alloy.json` - Prometheus refreshed and stale `game-dev-alloy` VMIDs `5205` and `6084` disappeared +- Prometheus runtime config is clean. + - validates with `promtool` + - active jobs are only: + - `prometheus` + - `alloy -> 127.0.0.1:12345` + - `game-dev-alloy -> /etc/zlh-monitor/prometheus/file_sd/game-dev-alloy.json` + - no active `node_exporter` job + - no active cAdvisor job + - no active `:9100` jobs + - current active targets are clean: + - `up alloy 127.0.0.1:12345` + - `up prometheus localhost:9090` + +- Remaining stale/static Prometheus targets are removed from active runtime config. + - `alloy-dev-6076` no longer appears as an active runtime target + - `cadvisor 10.60.0.11:8081` no longer appears as an active runtime target + - `zerolaghub-nodes 10.60.0.19:9100` no longer appears as an active runtime target + +- Grafana provisioning is live. + - source-managed datasource: + - `uid=prometheus`, `name=Prometheus`, `url=http://localhost:9090`, `default=1` + - provisioned dashboards: + - `Core Service - Detail` + - `Core Services - Overview` + - `Dev Containers` + - `Game Containers` + - legacy node_exporter router dashboard archived to `/etc/zlh-monitor/grafana/dashboards-archive/core-routers.json` + +- Services active after cleanup: + - `prometheus` + - `grafana-server` + - `alloy` + - discovery timer + ## Launch blockers 1. Restrict monitoring endpoint exposure. - - Prometheus currently listens on `*:9090`. - - Grafana currently listens on `*:3000`. - - node_exporter currently listens on `*:9100`. - - UFW is inactive and nftables policy is accept. - - Required fix: bind to loopback/management VLAN only or firewall to trusted admin/monitoring sources. + - Prometheus previously listened on `*:9090`. + - Grafana previously listened on `*:3000`. + - node_exporter previously listened on `*:9100`. + - UFW was inactive and nftables policy was accept in the prior audit. + - Prometheus no longer scrapes node_exporter or any `:9100` jobs, but listener exposure still needs verification/removal or firewalling. + - Required fix: bind Prometheus/Grafana to loopback or management VLAN only, remove/disable any unneeded node_exporter listener, and/or firewall to trusted admin/monitoring sources. -2. Remove or fix remaining stale/static targets. - - `alloy-dev-6076` is still hardcoded/down. - - `cadvisor`, `10.60.0.11:8081`, is down/no route. - - `zerolaghub-nodes`, `10.60.0.19:9100`, is down/connection refused. - - Acceptance: remaining down targets are either removed, fixed, or explicitly documented as intentional disabled targets. - -3. Install/provision Grafana dashboards. - - Dashboard JSON exists under `/etc/zlh-monitor/grafana/dashboards`. - - Grafana DB previously had no dashboards installed. - - Acceptance: dashboards visible in Grafana and queries return live data. - -4. Add API health scrape. - - Observed `/health` returns `404 Cannot GET /health`. +2. Add API health scrape or equivalent application-level health signal. + - Observed `/health` previously returned `404 Cannot GET /health`. - Acceptance: Prometheus has an API app health target or equivalent application-level health signal. -5. Add launch-debug lifecycle telemetry. +3. Add launch-debug lifecycle telemetry. - Missing/incomplete: `ready`, `connectable`, `operationInProgress`, `operationType`, backup/restore state, code-server state. - Acceptance: dashboards or Prometheus queries can answer what a game/dev container is doing and why it is not ready. -6. Add centralized logs. - - No Loki, Promtail, or log-ingesting Alloy config found. +4. Add centralized logs. + - No Loki, Promtail, or log-ingesting Alloy config found in prior audit. - Acceptance: API, Agent, provisioning, backup/restore, delete/teardown, Velocity/DNS/edge, and discovery-sync logs are queryable from the monitoring surface. -7. Tighten bearer token storage. +5. Tighten bearer token storage. - Token previously appeared in Prometheus YAML, backup configs, and `/etc/zlh-monitor/game-dev-discovery.env`. - Acceptance: token stored in tight-permission env file or systemd credentials; not world-readable; not leaked in labels/URLs/logs. -8. Run lifecycle smoke test once remaining blockers are controlled enough. +6. Run lifecycle smoke test once remaining blockers are controlled enough. - Create a game container and/or dev container. - Confirm target appears. - Delete it.