Add monitoring open items

2026-05-01 18:27:24 +00:00 · 2026-05-01 18:27:24 +00:00 · b0773cb97f
commit b0773cb97f
parent c3aca715ee
1 changed files with 69 additions and 0 deletions
--- a/Codex/Monitoring/OPEN_ITEMS.md
+++ b/Codex/Monitoring/OPEN_ITEMS.md
@ -0,0 +1,69 @@
+# Monitoring — Open Items
+
+Only keep unfinished monitoring/observability work here.
+
+## Launch blockers
+
+1. Restrict monitoring endpoint exposure.
+   - Prometheus currently listens on `*:9090`.
+   - Grafana currently listens on `*:3000`.
+   - node_exporter currently listens on `*:9100`.
+   - UFW is inactive and nftables policy is accept.
+   - Required fix: bind to loopback/management VLAN only or firewall to trusted admin/monitoring sources.
+
+2. Fix game/dev discovery sync.
+   - `zlh-monitor-game-dev-discovery.service` is failed.
+   - Current failure: API returns dev VMID `6050` on `10.60.0.220`, but sync script only allows dev IPs in `10.100.0.0/16`.
+   - Decide whether `6050` on `10.60.0.0/16` is intentional; then either update the allowlist or fix API/IP assignment.
+
+3. Remove stale file_sd targets after discovery is healthy.
+   - Stale targets observed: VMID `5205` and VMID `6084`.
+   - Acceptance: deleted containers disappear automatically and stale targets do not remain indefinitely.
+
+4. Install/provision Grafana dashboards.
+   - Dashboard JSON exists under `/etc/zlh-monitor/grafana/dashboards`.
+   - Grafana DB currently has no dashboards installed.
+   - Acceptance: dashboards visible in Grafana and queries return live data.
+
+5. Add API health scrape.
+   - Observed `/health` returns `404 Cannot GET /health`.
+   - Acceptance: Prometheus has an API app health target or equivalent application-level health signal.
+
+6. Add launch-debug lifecycle telemetry.
+   - Missing/incomplete: `ready`, `connectable`, `operationInProgress`, `operationType`, backup/restore state, code-server state.
+   - Acceptance: dashboards or Prometheus queries can answer what a game/dev container is doing and why it is not ready.
+
+7. Add centralized logs.
+   - No Loki, Promtail, or log-ingesting Alloy config found.
+   - Acceptance: API, Agent, provisioning, backup/restore, delete/teardown, Velocity/DNS/edge, and discovery-sync logs are queryable from the monitoring surface.
+
+8. Tighten bearer token storage.
+   - Token currently appears in Prometheus YAML, backup configs, and `/etc/zlh-monitor/game-dev-discovery.env`.
+   - Acceptance: token stored in tight-permission env file or systemd credentials; not world-readable; not leaked in labels/URLs/logs.
+
+## Standardization work
+
+- Standardize labels: prefer canonical `container_type` over mixed `ctype` / `container_type`.
+- Decide canonical label spelling for hostname/name/server identity.
+- Keep required game/dev labels minimal: `vmid`, `instance`, `container_type`.
+- Keep optional metadata useful but not noisy.
+- Review any user/customer-identifying labels before dashboard or alert exposure.
+
+## Alerts to add
+
+- discovery sync failed
+- file_sd file stale / not updated recently
+- target down by VMID/type
+- API `/sd/exporters` failure
+- API app health failure
+- Prometheus scrape failures by important job
+- Grafana dashboard provisioning missing/failure
+- log pipeline down once Loki/Alloy logs exist
+
+## Follow-up improvements
+
+- Add Loki or Alloy log collection for API, Agent, provisioning, backup/restore, Velocity/DNS/edge actions.
+- Add panels for lifecycle state: `ready`, `connectable`, `operationInProgress`, `operationType`, backup/restore status, code-server status.
+- Add dashboard row/panels for stale/deleted targets.
+- Add a runbook for discovery sync failure.
+- Add a runbook for failed game/dev provisioning debugging from monitoring only.