Trim monitoring open items after launch-ready validation

2026-05-01 20:57:03 +00:00 · 2026-05-01 20:57:03 +00:00 · a146979dad
commit a146979dad
parent aab9569a49
1 changed files with 29 additions and 104 deletions
--- a/Codex/Monitoring/OPEN_ITEMS.md
+++ b/Codex/Monitoring/OPEN_ITEMS.md
@ -2,125 +2,50 @@

 Only keep unfinished monitoring/observability work here.

-## Recently resolved / verified
+## Active

- `/etc/zlh-monitor` is now the runtime source of truth.
-  - Prometheus runs from `/etc/zlh-monitor/prometheus/prometheus.yml` via `/etc/default/prometheus`.
-  - Grafana provisions from `/etc/zlh-monitor/grafana/provisioning` via `/etc/default/grafana-server`.
-  - `/etc/zlh-monitor/README.md` describes the new single-source layout.
+1. Add centralized logs.
+   - No Loki currently.
+   - Future direction: centralized Loki as shared infrastructure.
+   - Containers should ship selected logs through Alloy; do not run Loki per game/dev container.
+   - Desired sources:
+     - API application logs
+     - Agent logs
+     - provisioning logs
+     - backup/restore logs
+     - delete/teardown logs
+     - Velocity/DNS/edge action logs
+     - discovery sync logs

- Game/dev dynamic discovery empty-state now works.
-  - `/sd/exporters -> []`
-  - `/monitoring/game-dev -> containers: []`
-  - monitor-side `ALLOW_EMPTY="true"` added in `/etc/zlh-monitor/game-dev-discovery.env`
-  - one-shot sync succeeded and wrote `0 targets` to `/etc/zlh-monitor/prometheus/file_sd/game-dev-alloy.json`
-  - Prometheus refreshed and stale `game-dev-alloy` VMIDs `5205` and `6084` disappeared
+2. Decide whether OPNsense needs router-only monitoring.
+   - Current game/dev monitoring intentionally uses Alloy, not node_exporter.
+   - OPNsense may still need a router-only node_exporter job or another router-specific monitoring path later.
+   - Keep this separate from the game/dev container Alloy contract.

- Prometheus runtime config is clean.
-  - validates with `promtool`
-  - active jobs are only:
-    - `prometheus`
-    - `alloy -> 127.0.0.1:12345`
-    - `game-dev-alloy -> /etc/zlh-monitor/prometheus/file_sd/game-dev-alloy.json`
-  - no active `node_exporter` job
-  - no active cAdvisor job
-  - no active `:9100` jobs
-
- Remaining stale/static Prometheus targets are removed from active runtime config.
-  - `alloy-dev-6076` no longer appears as an active runtime target
-  - `cadvisor 10.60.0.11:8081` no longer appears as an active runtime target
-  - `zerolaghub-nodes 10.60.0.19:9100` no longer appears as an active runtime target
-
- Grafana provisioning is live.
-  - source-managed datasource:
-    - `uid=prometheus`, `name=Prometheus`, `url=http://localhost:9090`, `default=1`
-  - provisioned dashboards:
-    - `Core Service - Detail`
-    - `Core Services - Overview`
-    - `Dev Containers`
-    - `Game Containers`
-  - legacy node_exporter router dashboard archived to `/etc/zlh-monitor/grafana/dashboards-archive/core-routers.json`
-
- Important game/dev metrics path works through remote-write.
-  - API discovery emitted both test servers:
-    - `6085 / dev-6085 / 10.100.0.23 / dev`
-    - `5206 / mc-vanilla-5206 / 10.200.0.79 / game`
-  - file_sd sync wrote both targets and Prometheus discovered both
-  - container Alloy remote-write works after Prometheus was rebound to `10.60.0.25:9090`
-  - dashboard-style metrics exist for both:
-    - `node_uname_info`
-    - CPU
-    - memory
-  - observed values:
-    - `6085 dev-6085 dev-container CPU ~1.04% Mem ~6.58%`
-    - `5206 mc-vanilla-5206 game-container CPU ~0.68% Mem ~61.69%`
-
- Current monitoring model clarified.
-  - Not API-only: API owns discovery/metadata.
-  - Canonical metrics path is container Alloy remote-write to Prometheus.
-  - `game-dev-alloy` scrape-up panels should not be relied on while container Alloy listens on `127.0.0.1:12345`.
-  - Direct scrape health would require intentionally exposing container Alloy HTTP on a reachable address such as `0.0.0.0:12345` plus security review.
-
- Services active after cleanup:
-  - `prometheus`
-  - `grafana-server`
-  - `alloy`
-  - discovery timer
-
-## Launch blockers
-
-1. Finish network exposure verification/lockdown.
-   - Prometheus now listens on `10.60.0.25:9090` for container remote-write and self-scrape.
-   - Grafana remains loopback-only on `127.0.0.1:3000`.
-   - zlh-monitor local Alloy remains loopback-only on `127.0.0.1:12345`.
-   - local node_exporter is disabled.
-   - Required validation: Prometheus is reachable only from intended container remote-write and trusted admin/monitoring paths; Grafana is not externally exposed; node_exporter is not listening or is firewalled; host firewall/nftables posture is acceptable.
-
-2. Add API health scrape or equivalent application-level health signal.
-   - Observed `/health` previously returned `404 Cannot GET /health`.
-   - Acceptance: Prometheus has an API app health target or equivalent application-level health signal.
-
-3. Add launch-debug lifecycle telemetry.
-   - Missing/incomplete: `ready`, `connectable`, `operationInProgress`, `operationType`, backup/restore state, code-server state.
-   - Acceptance: dashboards or Prometheus queries can answer what a game/dev container is doing and why it is not ready.
-
-4. Add centralized logs.
-   - No Loki, Promtail, or log-ingesting Alloy config found in prior audit.
-   - Acceptance: API, Agent, provisioning, backup/restore, delete/teardown, Velocity/DNS/edge, and discovery-sync logs are queryable from the monitoring surface.
-
-5. Tighten bearer token storage.
-   - Token previously appeared in Prometheus YAML, backup configs, and `/etc/zlh-monitor/game-dev-discovery.env`.
-   - Acceptance: token stored in tight-permission env file or systemd credentials; not world-readable; not leaked in labels/URLs/logs.
-
-6. Finish lifecycle smoke test delete step.
-   - Delete test dev/game servers.
-   - Confirm API discovery drops them.
-   - Confirm file_sd drops them.
-   - Confirm dashboard metrics age out as expected.
-
-## Standardization work
+## Optional hardening / improvements

+- Add API app health scrape or equivalent application-level health signal.
+- Add lifecycle/debug telemetry panels or metrics for:
+  - `ready`
+  - `connectable`
+  - `operationInProgress`
+  - `operationType`
+  - backup/restore state
+  - code-server state
 - Standardize labels: prefer canonical `container_type` over mixed `ctype` / `container_type`.
 - Decide canonical label spelling for hostname/name/server identity.
 - Keep required game/dev labels minimal: `vmid`, `instance`, `container_type`.
- Keep optional metadata useful but not noisy.
 - Review any user/customer-identifying labels before dashboard or alert exposure.
+- Add runbooks for discovery sync failure and failed game/dev provisioning debugging from monitoring only.

 ## Alerts to add

 - discovery sync failed
 - file_sd file stale / not updated recently
 - missing/stale remote-write metrics by VMID/type
+- game-dev-alloy scrape target down by VMID/type
 - API `/sd/exporters` failure
- API app health failure
+- API app health failure once an app health signal exists
 - Prometheus scrape failures by important job
 - Grafana dashboard provisioning missing/failure
 - log pipeline down once Loki/Alloy logs exist
-
-## Follow-up improvements
-
- Add Loki or Alloy log collection for API, Agent, provisioning, backup/restore, Velocity/DNS/edge actions.
- Add panels for lifecycle state: `ready`, `connectable`, `operationInProgress`, `operationType`, backup/restore status, code-server status.
- Add dashboard row/panels for stale/deleted targets.
- Add a runbook for discovery sync failure.
- Add a runbook for failed game/dev provisioning debugging from monitoring only.