zlh-grind/Codex/Monitoring/OPEN_ITEMS.md

52 lines
1.9 KiB
Markdown

# Monitoring — Open Items
Only keep unfinished monitoring/observability work here.
## Active
1. Add centralized logs.
- No Loki currently.
- Future direction: centralized Loki as shared infrastructure.
- Containers should ship selected logs through Alloy; do not run Loki per game/dev container.
- Desired sources:
- API application logs
- Agent logs
- provisioning logs
- backup/restore logs
- delete/teardown logs
- Velocity/DNS/edge action logs
- discovery sync logs
2. Decide whether OPNsense needs router-only monitoring.
- Current game/dev monitoring intentionally uses Alloy, not node_exporter.
- OPNsense may still need a router-only node_exporter job or another router-specific monitoring path later.
- Keep this separate from the game/dev container Alloy contract.
## Optional hardening / improvements
- Add API app health scrape or equivalent application-level health signal.
- Add lifecycle/debug telemetry panels or metrics for:
- `ready`
- `connectable`
- `operationInProgress`
- `operationType`
- backup/restore state
- code-server state
- Standardize labels: prefer canonical `container_type` over mixed `ctype` / `container_type`.
- Decide canonical label spelling for hostname/name/server identity.
- Keep required game/dev labels minimal: `vmid`, `instance`, `container_type`.
- Review any user/customer-identifying labels before dashboard or alert exposure.
- Add runbooks for discovery sync failure and failed game/dev provisioning debugging from monitoring only.
## Alerts to add
- discovery sync failed
- file_sd file stale / not updated recently
- missing/stale remote-write metrics by VMID/type
- game-dev-alloy scrape target down by VMID/type
- API `/sd/exporters` failure
- API app health failure once an app health signal exists
- Prometheus scrape failures by important job
- Grafana dashboard provisioning missing/failure
- log pipeline down once Loki/Alloy logs exist