1.9 KiB
1.9 KiB
Monitoring — Open Items
Only keep unfinished monitoring/observability work here.
Active
-
Add centralized logs.
- No Loki currently.
- Future direction: centralized Loki as shared infrastructure.
- Containers should ship selected logs through Alloy; do not run Loki per game/dev container.
- Desired sources:
- API application logs
- Agent logs
- provisioning logs
- backup/restore logs
- delete/teardown logs
- Velocity/DNS/edge action logs
- discovery sync logs
-
Decide whether OPNsense needs router-only monitoring.
- Current game/dev monitoring intentionally uses Alloy, not node_exporter.
- OPNsense may still need a router-only node_exporter job or another router-specific monitoring path later.
- Keep this separate from the game/dev container Alloy contract.
Optional hardening / improvements
- Add API app health scrape or equivalent application-level health signal.
- Add lifecycle/debug telemetry panels or metrics for:
readyconnectableoperationInProgressoperationType- backup/restore state
- code-server state
- Standardize labels: prefer canonical
container_typeover mixedctype/container_type. - Decide canonical label spelling for hostname/name/server identity.
- Keep required game/dev labels minimal:
vmid,instance,container_type. - Review any user/customer-identifying labels before dashboard or alert exposure.
- Add runbooks for discovery sync failure and failed game/dev provisioning debugging from monitoring only.
Alerts to add
- discovery sync failed
- file_sd file stale / not updated recently
- missing/stale remote-write metrics by VMID/type
- game-dev-alloy scrape target down by VMID/type
- API
/sd/exportersfailure - API app health failure once an app health signal exists
- Prometheus scrape failures by important job
- Grafana dashboard provisioning missing/failure
- log pipeline down once Loki/Alloy logs exist