6.0 KiB
6.0 KiB
Monitoring — Open Items
Only keep unfinished monitoring/observability work here.
Recently resolved / verified
-
/etc/zlh-monitoris now the runtime source of truth.- Prometheus runs from
/etc/zlh-monitor/prometheus/prometheus.ymlvia/etc/default/prometheus. - Grafana provisions from
/etc/zlh-monitor/grafana/provisioningvia/etc/default/grafana-server. /etc/zlh-monitor/README.mddescribes the new single-source layout.
- Prometheus runs from
-
Game/dev dynamic discovery empty-state now works.
/sd/exporters -> []/monitoring/game-dev -> containers: []- monitor-side
ALLOW_EMPTY="true"added in/etc/zlh-monitor/game-dev-discovery.env - one-shot sync succeeded and wrote
0 targetsto/etc/zlh-monitor/prometheus/file_sd/game-dev-alloy.json - Prometheus refreshed and stale
game-dev-alloyVMIDs5205and6084disappeared
-
Prometheus runtime config is clean.
- validates with
promtool - active jobs are only:
prometheusalloy -> 127.0.0.1:12345game-dev-alloy -> /etc/zlh-monitor/prometheus/file_sd/game-dev-alloy.json
- no active
node_exporterjob - no active cAdvisor job
- no active
:9100jobs
- validates with
-
Remaining stale/static Prometheus targets are removed from active runtime config.
alloy-dev-6076no longer appears as an active runtime targetcadvisor 10.60.0.11:8081no longer appears as an active runtime targetzerolaghub-nodes 10.60.0.19:9100no longer appears as an active runtime target
-
Grafana provisioning is live.
- source-managed datasource:
uid=prometheus,name=Prometheus,url=http://localhost:9090,default=1
- provisioned dashboards:
Core Service - DetailCore Services - OverviewDev ContainersGame Containers
- legacy node_exporter router dashboard archived to
/etc/zlh-monitor/grafana/dashboards-archive/core-routers.json
- source-managed datasource:
-
Important game/dev metrics path works through remote-write.
- API discovery emitted both test servers:
6085 / dev-6085 / 10.100.0.23 / dev5206 / mc-vanilla-5206 / 10.200.0.79 / game
- file_sd sync wrote both targets and Prometheus discovered both
- container Alloy remote-write works after Prometheus was rebound to
10.60.0.25:9090 - dashboard-style metrics exist for both:
node_uname_info- CPU
- memory
- observed values:
6085 dev-6085 dev-container CPU ~1.04% Mem ~6.58%5206 mc-vanilla-5206 game-container CPU ~0.68% Mem ~61.69%
- API discovery emitted both test servers:
-
Current monitoring model clarified.
- Not API-only: API owns discovery/metadata.
- Canonical metrics path is container Alloy remote-write to Prometheus.
game-dev-alloyscrape-up panels should not be relied on while container Alloy listens on127.0.0.1:12345.- Direct scrape health would require intentionally exposing container Alloy HTTP on a reachable address such as
0.0.0.0:12345plus security review.
-
Services active after cleanup:
prometheusgrafana-serveralloy- discovery timer
Launch blockers
-
Finish network exposure verification/lockdown.
- Prometheus now listens on
10.60.0.25:9090for container remote-write and self-scrape. - Grafana remains loopback-only on
127.0.0.1:3000. - zlh-monitor local Alloy remains loopback-only on
127.0.0.1:12345. - local node_exporter is disabled.
- Required validation: Prometheus is reachable only from intended container remote-write and trusted admin/monitoring paths; Grafana is not externally exposed; node_exporter is not listening or is firewalled; host firewall/nftables posture is acceptable.
- Prometheus now listens on
-
Add API health scrape or equivalent application-level health signal.
- Observed
/healthpreviously returned404 Cannot GET /health. - Acceptance: Prometheus has an API app health target or equivalent application-level health signal.
- Observed
-
Add launch-debug lifecycle telemetry.
- Missing/incomplete:
ready,connectable,operationInProgress,operationType, backup/restore state, code-server state. - Acceptance: dashboards or Prometheus queries can answer what a game/dev container is doing and why it is not ready.
- Missing/incomplete:
-
Add centralized logs.
- No Loki, Promtail, or log-ingesting Alloy config found in prior audit.
- Acceptance: API, Agent, provisioning, backup/restore, delete/teardown, Velocity/DNS/edge, and discovery-sync logs are queryable from the monitoring surface.
-
Tighten bearer token storage.
- Token previously appeared in Prometheus YAML, backup configs, and
/etc/zlh-monitor/game-dev-discovery.env. - Acceptance: token stored in tight-permission env file or systemd credentials; not world-readable; not leaked in labels/URLs/logs.
- Token previously appeared in Prometheus YAML, backup configs, and
-
Finish lifecycle smoke test delete step.
- Delete test dev/game servers.
- Confirm API discovery drops them.
- Confirm file_sd drops them.
- Confirm dashboard metrics age out as expected.
Standardization work
- Standardize labels: prefer canonical
container_typeover mixedctype/container_type. - Decide canonical label spelling for hostname/name/server identity.
- Keep required game/dev labels minimal:
vmid,instance,container_type. - Keep optional metadata useful but not noisy.
- Review any user/customer-identifying labels before dashboard or alert exposure.
Alerts to add
- discovery sync failed
- file_sd file stale / not updated recently
- missing/stale remote-write metrics by VMID/type
- API
/sd/exportersfailure - API app health failure
- Prometheus scrape failures by important job
- Grafana dashboard provisioning missing/failure
- log pipeline down once Loki/Alloy logs exist
Follow-up improvements
- Add Loki or Alloy log collection for API, Agent, provisioning, backup/restore, Velocity/DNS/edge actions.
- Add panels for lifecycle state:
ready,connectable,operationInProgress,operationType, backup/restore status, code-server status. - Add dashboard row/panels for stale/deleted targets.
- Add a runbook for discovery sync failure.
- Add a runbook for failed game/dev provisioning debugging from monitoring only.