Monitoring — Open Items

Only keep unfinished monitoring/observability work here.

Recently resolved / verified

/etc/zlh-monitor is now the runtime source of truth.
- Prometheus runs from /etc/zlh-monitor/prometheus/prometheus.yml via /etc/default/prometheus.
- Grafana provisions from /etc/zlh-monitor/grafana/provisioning via /etc/default/grafana-server.
- /etc/zlh-monitor/README.md describes the new single-source layout.
Game/dev dynamic discovery empty-state now works.
- /sd/exporters -> []
- /monitoring/game-dev -> containers: []
- monitor-side ALLOW_EMPTY="true" added in /etc/zlh-monitor/game-dev-discovery.env
- one-shot sync succeeded and wrote 0 targets to /etc/zlh-monitor/prometheus/file_sd/game-dev-alloy.json
- Prometheus refreshed and stale game-dev-alloy VMIDs 5205 and 6084 disappeared
Prometheus runtime config is clean.
- validates with promtool
- active jobs are only:
  - prometheus
  - alloy -> 127.0.0.1:12345
  - game-dev-alloy -> /etc/zlh-monitor/prometheus/file_sd/game-dev-alloy.json
- no active node_exporter job
- no active cAdvisor job
- no active :9100 jobs
- current active targets are clean:
  - up alloy 127.0.0.1:12345
  - up prometheus localhost:9090
Remaining stale/static Prometheus targets are removed from active runtime config.
- alloy-dev-6076 no longer appears as an active runtime target
- cadvisor 10.60.0.11:8081 no longer appears as an active runtime target
- zerolaghub-nodes 10.60.0.19:9100 no longer appears as an active runtime target
Grafana provisioning is live.
- source-managed datasource:
  - uid=prometheus, name=Prometheus, url=http://localhost:9090, default=1
- provisioned dashboards:
  - Core Service - Detail
  - Core Services - Overview
  - Dev Containers
  - Game Containers
- legacy node_exporter router dashboard archived to /etc/zlh-monitor/grafana/dashboards-archive/core-routers.json
Services active after cleanup:
- prometheus
- grafana-server
- alloy
- discovery timer

Launch blockers

Restrict monitoring endpoint exposure.
- Prometheus previously listened on *:9090.
- Grafana previously listened on *:3000.
- node_exporter previously listened on *:9100.
- UFW was inactive and nftables policy was accept in the prior audit.
- Prometheus no longer scrapes node_exporter or any :9100 jobs, but listener exposure still needs verification/removal or firewalling.
- Required fix: bind Prometheus/Grafana to loopback or management VLAN only, remove/disable any unneeded node_exporter listener, and/or firewall to trusted admin/monitoring sources.
Add API health scrape or equivalent application-level health signal.
- Observed /health previously returned 404 Cannot GET /health.
- Acceptance: Prometheus has an API app health target or equivalent application-level health signal.
Add launch-debug lifecycle telemetry.
- Missing/incomplete: ready, connectable, operationInProgress, operationType, backup/restore state, code-server state.
- Acceptance: dashboards or Prometheus queries can answer what a game/dev container is doing and why it is not ready.
Add centralized logs.
- No Loki, Promtail, or log-ingesting Alloy config found in prior audit.
- Acceptance: API, Agent, provisioning, backup/restore, delete/teardown, Velocity/DNS/edge, and discovery-sync logs are queryable from the monitoring surface.
Tighten bearer token storage.
- Token previously appeared in Prometheus YAML, backup configs, and /etc/zlh-monitor/game-dev-discovery.env.
- Acceptance: token stored in tight-permission env file or systemd credentials; not world-readable; not leaked in labels/URLs/logs.
Run lifecycle smoke test once remaining blockers are controlled enough.
- Create a game container and/or dev container.
- Confirm target appears.
- Delete it.
- Confirm target disappears now that empty discovery works.

Standardization work

Standardize labels: prefer canonical container_type over mixed ctype / container_type.
Decide canonical label spelling for hostname/name/server identity.
Keep required game/dev labels minimal: vmid, instance, container_type.
Keep optional metadata useful but not noisy.
Review any user/customer-identifying labels before dashboard or alert exposure.

Alerts to add

discovery sync failed
file_sd file stale / not updated recently
target down by VMID/type
API /sd/exporters failure
API app health failure
Prometheus scrape failures by important job
Grafana dashboard provisioning missing/failure
log pipeline down once Loki/Alloy logs exist

Follow-up improvements

Add Loki or Alloy log collection for API, Agent, provisioning, backup/restore, Velocity/DNS/edge actions.
Add panels for lifecycle state: ready, connectable, operationInProgress, operationType, backup/restore status, code-server status.
Add dashboard row/panels for stale/deleted targets.
Add a runbook for discovery sync failure.
Add a runbook for failed game/dev provisioning debugging from monitoring only.

5.1 KiB Raw Blame History

Monitoring — Open Items

Recently resolved / verified

Launch blockers

Standardization work

Alerts to add

Follow-up improvements

5.1 KiB

Raw Blame History