diff --git a/Codex/Monitoring/OPEN_ITEMS.md b/Codex/Monitoring/OPEN_ITEMS.md new file mode 100644 index 0000000..bfcfbfd --- /dev/null +++ b/Codex/Monitoring/OPEN_ITEMS.md @@ -0,0 +1,69 @@ +# Monitoring — Open Items + +Only keep unfinished monitoring/observability work here. + +## Launch blockers + +1. Restrict monitoring endpoint exposure. + - Prometheus currently listens on `*:9090`. + - Grafana currently listens on `*:3000`. + - node_exporter currently listens on `*:9100`. + - UFW is inactive and nftables policy is accept. + - Required fix: bind to loopback/management VLAN only or firewall to trusted admin/monitoring sources. + +2. Fix game/dev discovery sync. + - `zlh-monitor-game-dev-discovery.service` is failed. + - Current failure: API returns dev VMID `6050` on `10.60.0.220`, but sync script only allows dev IPs in `10.100.0.0/16`. + - Decide whether `6050` on `10.60.0.0/16` is intentional; then either update the allowlist or fix API/IP assignment. + +3. Remove stale file_sd targets after discovery is healthy. + - Stale targets observed: VMID `5205` and VMID `6084`. + - Acceptance: deleted containers disappear automatically and stale targets do not remain indefinitely. + +4. Install/provision Grafana dashboards. + - Dashboard JSON exists under `/etc/zlh-monitor/grafana/dashboards`. + - Grafana DB currently has no dashboards installed. + - Acceptance: dashboards visible in Grafana and queries return live data. + +5. Add API health scrape. + - Observed `/health` returns `404 Cannot GET /health`. + - Acceptance: Prometheus has an API app health target or equivalent application-level health signal. + +6. Add launch-debug lifecycle telemetry. + - Missing/incomplete: `ready`, `connectable`, `operationInProgress`, `operationType`, backup/restore state, code-server state. + - Acceptance: dashboards or Prometheus queries can answer what a game/dev container is doing and why it is not ready. + +7. Add centralized logs. + - No Loki, Promtail, or log-ingesting Alloy config found. + - Acceptance: API, Agent, provisioning, backup/restore, delete/teardown, Velocity/DNS/edge, and discovery-sync logs are queryable from the monitoring surface. + +8. Tighten bearer token storage. + - Token currently appears in Prometheus YAML, backup configs, and `/etc/zlh-monitor/game-dev-discovery.env`. + - Acceptance: token stored in tight-permission env file or systemd credentials; not world-readable; not leaked in labels/URLs/logs. + +## Standardization work + +- Standardize labels: prefer canonical `container_type` over mixed `ctype` / `container_type`. +- Decide canonical label spelling for hostname/name/server identity. +- Keep required game/dev labels minimal: `vmid`, `instance`, `container_type`. +- Keep optional metadata useful but not noisy. +- Review any user/customer-identifying labels before dashboard or alert exposure. + +## Alerts to add + +- discovery sync failed +- file_sd file stale / not updated recently +- target down by VMID/type +- API `/sd/exporters` failure +- API app health failure +- Prometheus scrape failures by important job +- Grafana dashboard provisioning missing/failure +- log pipeline down once Loki/Alloy logs exist + +## Follow-up improvements + +- Add Loki or Alloy log collection for API, Agent, provisioning, backup/restore, Velocity/DNS/edge actions. +- Add panels for lifecycle state: `ready`, `connectable`, `operationInProgress`, `operationType`, backup/restore status, code-server status. +- Add dashboard row/panels for stale/deleted targets. +- Add a runbook for discovery sync failure. +- Add a runbook for failed game/dev provisioning debugging from monitoring only.