52 lines
1.9 KiB
Markdown
52 lines
1.9 KiB
Markdown
# Monitoring — Open Items
|
|
|
|
Only keep unfinished monitoring/observability work here.
|
|
|
|
## Active
|
|
|
|
1. Add centralized logs.
|
|
- No Loki currently.
|
|
- Future direction: centralized Loki as shared infrastructure.
|
|
- Containers should ship selected logs through Alloy; do not run Loki per game/dev container.
|
|
- Desired sources:
|
|
- API application logs
|
|
- Agent logs
|
|
- provisioning logs
|
|
- backup/restore logs
|
|
- delete/teardown logs
|
|
- Velocity/DNS/edge action logs
|
|
- discovery sync logs
|
|
|
|
2. Decide whether OPNsense needs router-only monitoring.
|
|
- Current game/dev monitoring intentionally uses Alloy, not node_exporter.
|
|
- OPNsense may still need a router-only node_exporter job or another router-specific monitoring path later.
|
|
- Keep this separate from the game/dev container Alloy contract.
|
|
|
|
## Optional hardening / improvements
|
|
|
|
- Add API app health scrape or equivalent application-level health signal.
|
|
- Add lifecycle/debug telemetry panels or metrics for:
|
|
- `ready`
|
|
- `connectable`
|
|
- `operationInProgress`
|
|
- `operationType`
|
|
- backup/restore state
|
|
- code-server state
|
|
- Standardize labels: prefer canonical `container_type` over mixed `ctype` / `container_type`.
|
|
- Decide canonical label spelling for hostname/name/server identity.
|
|
- Keep required game/dev labels minimal: `vmid`, `instance`, `container_type`.
|
|
- Review any user/customer-identifying labels before dashboard or alert exposure.
|
|
- Add runbooks for discovery sync failure and failed game/dev provisioning debugging from monitoring only.
|
|
|
|
## Alerts to add
|
|
|
|
- discovery sync failed
|
|
- file_sd file stale / not updated recently
|
|
- missing/stale remote-write metrics by VMID/type
|
|
- game-dev-alloy scrape target down by VMID/type
|
|
- API `/sd/exporters` failure
|
|
- API app health failure once an app health signal exists
|
|
- Prometheus scrape failures by important job
|
|
- Grafana dashboard provisioning missing/failure
|
|
- log pipeline down once Loki/Alloy logs exist
|