zlh-grind/Codex/Monitoring/OPEN_ITEMS.md

# Monitoring — Open Items

Only keep unfinished monitoring/observability work here.

## Active

1. Add centralized logs.
   - No Loki currently.
   - Future direction: centralized Loki as shared infrastructure.
   - Containers should ship selected logs through Alloy; do not run Loki per game/dev container.
   - Desired sources:
     - API application logs
     - Agent logs
     - provisioning logs
     - backup/restore logs
     - delete/teardown logs
     - Velocity/DNS/edge action logs
     - discovery sync logs

2. Decide whether OPNsense needs router-only monitoring.
   - Current game/dev monitoring intentionally uses Alloy, not node_exporter.
   - OPNsense may still need a router-only node_exporter job or another router-specific monitoring path later.
   - Keep this separate from the game/dev container Alloy contract.

## Optional hardening / improvements

- Add API app health scrape or equivalent application-level health signal.
- Add lifecycle/debug telemetry panels or metrics for:
  - `ready`
  - `connectable`
  - `operationInProgress`
  - `operationType`
  - backup/restore state
  - code-server state
- Standardize labels: prefer canonical `container_type` over mixed `ctype` / `container_type`.
- Decide canonical label spelling for hostname/name/server identity.
- Keep required game/dev labels minimal: `vmid`, `instance`, `container_type`.
- Review any user/customer-identifying labels before dashboard or alert exposure.
- Add runbooks for discovery sync failure and failed game/dev provisioning debugging from monitoring only.

## Alerts to add

- discovery sync failed
- file_sd file stale / not updated recently
- missing/stale remote-write metrics by VMID/type
- game-dev-alloy scrape target down by VMID/type
- API `/sd/exporters` failure
- API app health failure once an app health signal exists
- Prometheus scrape failures by important job
- Grafana dashboard provisioning missing/failure
- log pipeline down once Loki/Alloy logs exist