jester 1a837f3770 Update monitoring current state after Prometheus Grafana cleanup

2026-05-01 19:19:29 +00:00

6.6 KiB

Raw Blame History

Monitoring — Current State

Launch readiness

Status: PARTIAL FAIL for launch monitoring readiness.

Core monitoring services are running. Game/dev dynamic discovery empty-state is clean, stale game-dev-alloy VMIDs 5205 and 6084 are gone, legacy/static Prometheus jobs have been removed from the active runtime config, and Grafana dashboard provisioning is now live from the /etc/zlh-monitor runtime source-of-truth path.

Launch-debug readiness is still blocked by public monitoring endpoint exposure, missing/unfinished centralized logs, missing API application health scrape, incomplete lifecycle telemetry, and the need to run one full game/dev lifecycle smoke test.

Runtime source of truth

/etc/zlh-monitor is now the runtime source of truth.

Prometheus now runs from:

/etc/zlh-monitor/prometheus/prometheus.yml
configured via /etc/default/prometheus
backup: /etc/default/prometheus.zlh-monitor-20260501191319.bak

Grafana now provisions from:

/etc/zlh-monitor/grafana/provisioning
configured via /etc/default/grafana-server
backup: /etc/default/grafana-server.zlh-monitor-20260501191319.bak

/etc/zlh-monitor/README.md was updated to describe the new single-source layout.

Recently verified progress

Container discovery is clean:

/sd/exporters -> []
/monitoring/game-dev -> containers: []
monitor-side ALLOW_EMPTY="true" added in /etc/zlh-monitor/game-dev-discovery.env
one-shot sync succeeded: wrote 0 targets to /etc/zlh-monitor/prometheus/file_sd/game-dev-alloy.json
Prometheus refreshed and stale game-dev-alloy targets 5205 and 6084 are gone

Prometheus runtime config is clean:

config validates with promtool
active jobs are now only:
- prometheus
- alloy -> 127.0.0.1:12345
- game-dev-alloy -> /etc/zlh-monitor/prometheus/file_sd/game-dev-alloy.json
no active node_exporter job
no active cAdvisor job
no active :9100 jobs

Current Prometheus targets are clean:

up alloy 127.0.0.1:12345
up prometheus localhost:9090

Grafana provisioning is live:

one source-managed datasource exists:
- uid=prometheus, name=Prometheus, url=http://localhost:9090, default=1
provisioned dashboards:
- Core Service - Detail
- Core Services - Overview
- Dev Containers
- Game Containers
legacy node_exporter router dashboard archived to:
- /etc/zlh-monitor/grafana/dashboards-archive/core-routers.json

Services active after cleanup:

prometheus
grafana-server
alloy
discovery timer

Exposure state

Observed listening ports from prior audit:

*:9090 Prometheus
*:3000 Grafana
*:9100 node_exporter
127.0.0.1:12345 Alloy

Prometheus active jobs no longer include node_exporter or :9100 scraping, but network exposure still needs separate verification/lockdown.

Host firewall posture from prior audit:

UFW inactive
nftables input/forward/output policy accept

This remains a launch blocker until Prometheus, Grafana, and any remaining node_exporter listener exposure are restricted to trusted admin/monitoring paths.

Prometheus state

Runtime config path: /etc/zlh-monitor/prometheus/prometheus.yml
Config validates with promtool
Active jobs:
- prometheus
- alloy
- game-dev-alloy
Current targets:
- up alloy 127.0.0.1:12345
- up prometheus localhost:9090
/sd/exporters is configured with bearer auth and works with token
unauthenticated /sd/exporters and /monitoring/game-dev return 401 unauthorized

Token risk: the bearer token was previously observed stored directly in Prometheus YAML, backup configs, and /etc/zlh-monitor/game-dev-discovery.env. Tokens were not observed leaking through labels or scrape URLs in the audit, but storage should still be tightened.

Targets/discovery state

Resolved:

game-dev-alloy VMID 5205 is gone
game-dev-alloy VMID 6084 is gone
alloy-dev-6076 hardcoded/down target is gone from active runtime target list
cAdvisor 10.60.0.11:8081 is gone from active runtime target list
zerolaghub-nodes 10.60.0.19:9100 is gone from active runtime target list
empty discovery writes a valid empty file_sd file instead of failing

No stale/static down targets are currently present in active Prometheus targets.

API monitoring state

API host/system metrics were previously available via integrations/unix, but the new active runtime Prometheus jobs are intentionally reduced to Prometheus, local Alloy, and dynamic game/dev Alloy discovery.

Observed API app health endpoint /health previously returned 404 Cannot GET /health; no API health scrape was found.

No centralized API logs were found on the monitoring host.

Agent/container monitoring state

Current dynamic game/dev discovery is empty and clean.

Expected next validation is lifecycle smoke testing:

create a game and/or dev container
confirm target appears
delete it
confirm target disappears now that empty discovery works

Labels previously observed include:

vmid
ctype or container_type
game
variant
hostname / name
job
instance

Missing or incomplete for lifecycle debugging:

ready / connectable
operationInProgress
operationType
backup/restore state
code-server state

Grafana state

Grafana now provisions from /etc/zlh-monitor/grafana/provisioning via /etc/default/grafana-server.

Datasource:

uid=prometheus
name=Prometheus
url=http://localhost:9090
default datasource

Live provisioned dashboards:

Core Service - Detail
Core Services - Overview
Dev Containers
Game Containers

Archived:

/etc/zlh-monitor/grafana/dashboards-archive/core-routers.json

Logs state

No Loki, Promtail, or log-ingesting Alloy pipeline was found in the prior audit.

Only journald/local logs for monitoring components were available on the monitoring host. This remains a launch-debug gap for API, Agent, provisioning, backup/restore, delete/teardown, Velocity/DNS/edge actions, and discovery sync.

Security state

Working controls from prior audit:

Grafana anonymous access is disabled; unauthenticated /api/search returned 401
discovery endpoints require token; unauthenticated /sd/exporters and /monitoring/game-dev returned 401
Prometheus labels/scrape URLs checked did not expose the bearer token

Launch-blocking risk:

Prometheus and Grafana previously bound publicly on the host network
node_exporter previously listened on *:9100; it is no longer scraped by active Prometheus config, but listener exposure still needs verification/removal or firewalling
host firewall policy did not restrict access in prior audit

6.6 KiB Raw Blame History