82 lines
3.8 KiB
Markdown
82 lines
3.8 KiB
Markdown
# Monitoring — Open Items
|
|
|
|
Only keep unfinished monitoring/observability work here.
|
|
|
|
## Recently resolved / verified
|
|
|
|
- Game/dev dynamic discovery empty-state now works.
|
|
- `/sd/exporters -> []`
|
|
- `/monitoring/game-dev -> containers: []`
|
|
- monitor-side `ALLOW_EMPTY="true"` added in `/etc/zlh-monitor/game-dev-discovery.env`
|
|
- one-shot sync succeeded and wrote `0 targets` to `/etc/zlh-monitor/prometheus/file_sd/game-dev-alloy.json`
|
|
- Prometheus refreshed and stale `game-dev-alloy` VMIDs `5205` and `6084` disappeared
|
|
|
|
## Launch blockers
|
|
|
|
1. Restrict monitoring endpoint exposure.
|
|
- Prometheus currently listens on `*:9090`.
|
|
- Grafana currently listens on `*:3000`.
|
|
- node_exporter currently listens on `*:9100`.
|
|
- UFW is inactive and nftables policy is accept.
|
|
- Required fix: bind to loopback/management VLAN only or firewall to trusted admin/monitoring sources.
|
|
|
|
2. Remove or fix remaining stale/static targets.
|
|
- `alloy-dev-6076` is still hardcoded/down.
|
|
- `cadvisor`, `10.60.0.11:8081`, is down/no route.
|
|
- `zerolaghub-nodes`, `10.60.0.19:9100`, is down/connection refused.
|
|
- Acceptance: remaining down targets are either removed, fixed, or explicitly documented as intentional disabled targets.
|
|
|
|
3. Install/provision Grafana dashboards.
|
|
- Dashboard JSON exists under `/etc/zlh-monitor/grafana/dashboards`.
|
|
- Grafana DB previously had no dashboards installed.
|
|
- Acceptance: dashboards visible in Grafana and queries return live data.
|
|
|
|
4. Add API health scrape.
|
|
- Observed `/health` returns `404 Cannot GET /health`.
|
|
- Acceptance: Prometheus has an API app health target or equivalent application-level health signal.
|
|
|
|
5. Add launch-debug lifecycle telemetry.
|
|
- Missing/incomplete: `ready`, `connectable`, `operationInProgress`, `operationType`, backup/restore state, code-server state.
|
|
- Acceptance: dashboards or Prometheus queries can answer what a game/dev container is doing and why it is not ready.
|
|
|
|
6. Add centralized logs.
|
|
- No Loki, Promtail, or log-ingesting Alloy config found.
|
|
- Acceptance: API, Agent, provisioning, backup/restore, delete/teardown, Velocity/DNS/edge, and discovery-sync logs are queryable from the monitoring surface.
|
|
|
|
7. Tighten bearer token storage.
|
|
- Token previously appeared in Prometheus YAML, backup configs, and `/etc/zlh-monitor/game-dev-discovery.env`.
|
|
- Acceptance: token stored in tight-permission env file or systemd credentials; not world-readable; not leaked in labels/URLs/logs.
|
|
|
|
8. Run lifecycle smoke test once remaining blockers are controlled enough.
|
|
- Create a game container and/or dev container.
|
|
- Confirm target appears.
|
|
- Delete it.
|
|
- Confirm target disappears now that empty discovery works.
|
|
|
|
## Standardization work
|
|
|
|
- Standardize labels: prefer canonical `container_type` over mixed `ctype` / `container_type`.
|
|
- Decide canonical label spelling for hostname/name/server identity.
|
|
- Keep required game/dev labels minimal: `vmid`, `instance`, `container_type`.
|
|
- Keep optional metadata useful but not noisy.
|
|
- Review any user/customer-identifying labels before dashboard or alert exposure.
|
|
|
|
## Alerts to add
|
|
|
|
- discovery sync failed
|
|
- file_sd file stale / not updated recently
|
|
- target down by VMID/type
|
|
- API `/sd/exporters` failure
|
|
- API app health failure
|
|
- Prometheus scrape failures by important job
|
|
- Grafana dashboard provisioning missing/failure
|
|
- log pipeline down once Loki/Alloy logs exist
|
|
|
|
## Follow-up improvements
|
|
|
|
- Add Loki or Alloy log collection for API, Agent, provisioning, backup/restore, Velocity/DNS/edge actions.
|
|
- Add panels for lifecycle state: `ready`, `connectable`, `operationInProgress`, `operationType`, backup/restore status, code-server status.
|
|
- Add dashboard row/panels for stale/deleted targets.
|
|
- Add a runbook for discovery sync failure.
|
|
- Add a runbook for failed game/dev provisioning debugging from monitoring only.
|