3.3 KiB
3.3 KiB
Monitoring — Open Items
Only keep unfinished monitoring/observability work here.
Launch blockers
-
Restrict monitoring endpoint exposure.
- Prometheus currently listens on
*:9090. - Grafana currently listens on
*:3000. - node_exporter currently listens on
*:9100. - UFW is inactive and nftables policy is accept.
- Required fix: bind to loopback/management VLAN only or firewall to trusted admin/monitoring sources.
- Prometheus currently listens on
-
Fix game/dev discovery sync.
zlh-monitor-game-dev-discovery.serviceis failed.- Current failure: API returns dev VMID
6050on10.60.0.220, but sync script only allows dev IPs in10.100.0.0/16. - Decide whether
6050on10.60.0.0/16is intentional; then either update the allowlist or fix API/IP assignment.
-
Remove stale file_sd targets after discovery is healthy.
- Stale targets observed: VMID
5205and VMID6084. - Acceptance: deleted containers disappear automatically and stale targets do not remain indefinitely.
- Stale targets observed: VMID
-
Install/provision Grafana dashboards.
- Dashboard JSON exists under
/etc/zlh-monitor/grafana/dashboards. - Grafana DB currently has no dashboards installed.
- Acceptance: dashboards visible in Grafana and queries return live data.
- Dashboard JSON exists under
-
Add API health scrape.
- Observed
/healthreturns404 Cannot GET /health. - Acceptance: Prometheus has an API app health target or equivalent application-level health signal.
- Observed
-
Add launch-debug lifecycle telemetry.
- Missing/incomplete:
ready,connectable,operationInProgress,operationType, backup/restore state, code-server state. - Acceptance: dashboards or Prometheus queries can answer what a game/dev container is doing and why it is not ready.
- Missing/incomplete:
-
Add centralized logs.
- No Loki, Promtail, or log-ingesting Alloy config found.
- Acceptance: API, Agent, provisioning, backup/restore, delete/teardown, Velocity/DNS/edge, and discovery-sync logs are queryable from the monitoring surface.
-
Tighten bearer token storage.
- Token currently appears in Prometheus YAML, backup configs, and
/etc/zlh-monitor/game-dev-discovery.env. - Acceptance: token stored in tight-permission env file or systemd credentials; not world-readable; not leaked in labels/URLs/logs.
- Token currently appears in Prometheus YAML, backup configs, and
Standardization work
- Standardize labels: prefer canonical
container_typeover mixedctype/container_type. - Decide canonical label spelling for hostname/name/server identity.
- Keep required game/dev labels minimal:
vmid,instance,container_type. - Keep optional metadata useful but not noisy.
- Review any user/customer-identifying labels before dashboard or alert exposure.
Alerts to add
- discovery sync failed
- file_sd file stale / not updated recently
- target down by VMID/type
- API
/sd/exportersfailure - API app health failure
- Prometheus scrape failures by important job
- Grafana dashboard provisioning missing/failure
- log pipeline down once Loki/Alloy logs exist
Follow-up improvements
- Add Loki or Alloy log collection for API, Agent, provisioning, backup/restore, Velocity/DNS/edge actions.
- Add panels for lifecycle state:
ready,connectable,operationInProgress,operationType, backup/restore status, code-server status. - Add dashboard row/panels for stale/deleted targets.
- Add a runbook for discovery sync failure.
- Add a runbook for failed game/dev provisioning debugging from monitoring only.