Update monitoring open items after clean discovery

This commit is contained in:
jester 2026-05-01 18:57:15 +00:00
parent c1c2104936
commit 1169a89b71

View File

@ -2,6 +2,15 @@
Only keep unfinished monitoring/observability work here. Only keep unfinished monitoring/observability work here.
## Recently resolved / verified
- Game/dev dynamic discovery empty-state now works.
- `/sd/exporters -> []`
- `/monitoring/game-dev -> containers: []`
- monitor-side `ALLOW_EMPTY="true"` added in `/etc/zlh-monitor/game-dev-discovery.env`
- one-shot sync succeeded and wrote `0 targets` to `/etc/zlh-monitor/prometheus/file_sd/game-dev-alloy.json`
- Prometheus refreshed and stale `game-dev-alloy` VMIDs `5205` and `6084` disappeared
## Launch blockers ## Launch blockers
1. Restrict monitoring endpoint exposure. 1. Restrict monitoring endpoint exposure.
@ -11,36 +20,39 @@ Only keep unfinished monitoring/observability work here.
- UFW is inactive and nftables policy is accept. - UFW is inactive and nftables policy is accept.
- Required fix: bind to loopback/management VLAN only or firewall to trusted admin/monitoring sources. - Required fix: bind to loopback/management VLAN only or firewall to trusted admin/monitoring sources.
2. Fix game/dev discovery sync. 2. Remove or fix remaining stale/static targets.
- `zlh-monitor-game-dev-discovery.service` is failed. - `alloy-dev-6076` is still hardcoded/down.
- Current failure: API returns dev VMID `6050` on `10.60.0.220`, but sync script only allows dev IPs in `10.100.0.0/16`. - `cadvisor`, `10.60.0.11:8081`, is down/no route.
- Decide whether `6050` on `10.60.0.0/16` is intentional; then either update the allowlist or fix API/IP assignment. - `zerolaghub-nodes`, `10.60.0.19:9100`, is down/connection refused.
- Acceptance: remaining down targets are either removed, fixed, or explicitly documented as intentional disabled targets.
3. Remove stale file_sd targets after discovery is healthy. 3. Install/provision Grafana dashboards.
- Stale targets observed: VMID `5205` and VMID `6084`.
- Acceptance: deleted containers disappear automatically and stale targets do not remain indefinitely.
4. Install/provision Grafana dashboards.
- Dashboard JSON exists under `/etc/zlh-monitor/grafana/dashboards`. - Dashboard JSON exists under `/etc/zlh-monitor/grafana/dashboards`.
- Grafana DB currently has no dashboards installed. - Grafana DB previously had no dashboards installed.
- Acceptance: dashboards visible in Grafana and queries return live data. - Acceptance: dashboards visible in Grafana and queries return live data.
5. Add API health scrape. 4. Add API health scrape.
- Observed `/health` returns `404 Cannot GET /health`. - Observed `/health` returns `404 Cannot GET /health`.
- Acceptance: Prometheus has an API app health target or equivalent application-level health signal. - Acceptance: Prometheus has an API app health target or equivalent application-level health signal.
6. Add launch-debug lifecycle telemetry. 5. Add launch-debug lifecycle telemetry.
- Missing/incomplete: `ready`, `connectable`, `operationInProgress`, `operationType`, backup/restore state, code-server state. - Missing/incomplete: `ready`, `connectable`, `operationInProgress`, `operationType`, backup/restore state, code-server state.
- Acceptance: dashboards or Prometheus queries can answer what a game/dev container is doing and why it is not ready. - Acceptance: dashboards or Prometheus queries can answer what a game/dev container is doing and why it is not ready.
7. Add centralized logs. 6. Add centralized logs.
- No Loki, Promtail, or log-ingesting Alloy config found. - No Loki, Promtail, or log-ingesting Alloy config found.
- Acceptance: API, Agent, provisioning, backup/restore, delete/teardown, Velocity/DNS/edge, and discovery-sync logs are queryable from the monitoring surface. - Acceptance: API, Agent, provisioning, backup/restore, delete/teardown, Velocity/DNS/edge, and discovery-sync logs are queryable from the monitoring surface.
8. Tighten bearer token storage. 7. Tighten bearer token storage.
- Token currently appears in Prometheus YAML, backup configs, and `/etc/zlh-monitor/game-dev-discovery.env`. - Token previously appeared in Prometheus YAML, backup configs, and `/etc/zlh-monitor/game-dev-discovery.env`.
- Acceptance: token stored in tight-permission env file or systemd credentials; not world-readable; not leaked in labels/URLs/logs. - Acceptance: token stored in tight-permission env file or systemd credentials; not world-readable; not leaked in labels/URLs/logs.
8. Run lifecycle smoke test once remaining blockers are controlled enough.
- Create a game container and/or dev container.
- Confirm target appears.
- Delete it.
- Confirm target disappears now that empty discovery works.
## Standardization work ## Standardization work
- Standardize labels: prefer canonical `container_type` over mixed `ctype` / `container_type`. - Standardize labels: prefer canonical `container_type` over mixed `ctype` / `container_type`.