# Monitoring — Open Items Only keep unfinished monitoring/observability work here. ## Active 1. Add centralized logs. - No Loki currently. - Future direction: centralized Loki as shared infrastructure. - Containers should ship selected logs through Alloy; do not run Loki per game/dev container. - Desired sources: - API application logs - Agent logs - provisioning logs - backup/restore logs - delete/teardown logs - Velocity/DNS/edge action logs - discovery sync logs 2. Decide whether OPNsense needs router-only monitoring. - Current game/dev monitoring intentionally uses Alloy, not node_exporter. - OPNsense may still need a router-only node_exporter job or another router-specific monitoring path later. - Keep this separate from the game/dev container Alloy contract. ## Optional hardening / improvements - Add API app health scrape or equivalent application-level health signal. - Add lifecycle/debug telemetry panels or metrics for: - `ready` - `connectable` - `operationInProgress` - `operationType` - backup/restore state - code-server state - Standardize labels: prefer canonical `container_type` over mixed `ctype` / `container_type`. - Decide canonical label spelling for hostname/name/server identity. - Keep required game/dev labels minimal: `vmid`, `instance`, `container_type`. - Review any user/customer-identifying labels before dashboard or alert exposure. - Add runbooks for discovery sync failure and failed game/dev provisioning debugging from monitoring only. ## Alerts to add - discovery sync failed - file_sd file stale / not updated recently - missing/stale remote-write metrics by VMID/type - game-dev-alloy scrape target down by VMID/type - API `/sd/exporters` failure - API app health failure once an app health signal exists - Prometheus scrape failures by important job - Grafana dashboard provisioning missing/failure - log pipeline down once Loki/Alloy logs exist