jester 3b74817e4e Record remote-write smoke test evidence

2026-05-01 20:18:00 +00:00

8.5 KiB

Raw Blame History

Monitoring — Current State

Launch readiness

Status: PARTIAL FAIL for launch monitoring readiness.

Core monitoring services are running. Game/dev dynamic discovery works, stale game-dev-alloy VMIDs 5205 and 6084 are gone, legacy/static Prometheus jobs have been removed from the active runtime config, Grafana dashboard provisioning is live from /etc/zlh-monitor, and the important game/dev metrics path now works through container Alloy remote-write.

Launch-debug readiness is still blocked by final exposure verification/lockdown, missing/unfinished centralized logs, missing API application health scrape, incomplete lifecycle telemetry, and the need to finish the delete half of the game/dev lifecycle smoke test.

Runtime source of truth

/etc/zlh-monitor is now the runtime source of truth.

Prometheus now runs from:

/etc/zlh-monitor/prometheus/prometheus.yml
configured via /etc/default/prometheus
backup: /etc/default/prometheus.zlh-monitor-20260501191319.bak

Grafana now provisions from:

/etc/zlh-monitor/grafana/provisioning
configured via /etc/default/grafana-server
backup: /etc/default/grafana-server.zlh-monitor-20260501191319.bak

/etc/zlh-monitor/README.md was updated to describe the new single-source layout.

Canonical metrics model

Monitoring is not API-only.

The API owns discovery/metadata and tells the monitoring stack what game/dev containers exist. The canonical metrics path for game/dev containers is now:

container Alloy remote_write
  -> Prometheus on 10.60.0.25:9090
  -> Grafana dashboards

The game-dev-alloy file_sd path is useful for inventory/add-remove validation. It should not be treated as the canonical container health signal while container Alloy listens on 127.0.0.1:12345 and Prometheus file_sd scrapes container-ip:12345.

If direct scrape health for each container Alloy is desired later, container Alloy must intentionally expose its HTTP listener on a reachable address such as 0.0.0.0:12345, with security review.

Recently verified progress

Container discovery is clean:

/sd/exporters -> [] during empty state
/monitoring/game-dev -> containers: [] during empty state
monitor-side ALLOW_EMPTY="true" added in /etc/zlh-monitor/game-dev-discovery.env
one-shot sync succeeded: wrote 0 targets to /etc/zlh-monitor/prometheus/file_sd/game-dev-alloy.json
Prometheus refreshed and stale game-dev-alloy targets 5205 and 6084 are gone

Prometheus runtime config is clean:

config validates with promtool
active jobs are now only:
- prometheus
- alloy -> 127.0.0.1:12345
- game-dev-alloy -> /etc/zlh-monitor/prometheus/file_sd/game-dev-alloy.json
no active node_exporter job
no active cAdvisor job
no active :9100 jobs

Runtime bind/target state after the remote-write fix:

Prometheus listens on 10.60.0.25:9090
Prometheus self-scrape target is 10.60.0.25:9090
Grafana remains loopback-only on 127.0.0.1:3000
zlh-monitor local Alloy remains loopback-only on 127.0.0.1:12345
local node_exporter is disabled

Grafana provisioning is live:

one source-managed datasource exists:
- uid=prometheus, name=Prometheus, url=http://localhost:9090, default=1
provisioned dashboards:
- Core Service - Detail
- Core Services - Overview
- Dev Containers
- Game Containers
legacy node_exporter router dashboard archived to:
- /etc/zlh-monitor/grafana/dashboards-archive/core-routers.json

Services active after cleanup:

prometheus
grafana-server
alloy
discovery timer

Game/dev metrics smoke evidence

The important monitoring path works for current test containers.

API discovery emits both test servers:

6085 / dev-6085 / 10.100.0.23 / dev
5206 / mc-vanilla-5206 / 10.200.0.79 / game

file_sd sync writes both targets, and Prometheus discovers both.

Alloy remote-write from both containers works after Prometheus was rebound to 10.60.0.25:9090.

Dashboard-style metrics exist for both containers:

node_uname_info
CPU metrics
memory metrics

Observed values:

6085 dev-6085        dev-container   CPU ~1.04%   Mem ~6.58%
5206 mc-vanilla-5206 game-container  CPU ~0.68%   Mem ~61.69%

Known design mismatch:

container Alloy listens on 127.0.0.1:12345
Prometheus file_sd scrapes container-ip:12345
therefore game-dev-alloy scrape targets are discovered but down

Current decision: use remote-write as canonical and do not rely on game-dev-alloy scrape health panels unless/until direct container Alloy scrape is intentionally enabled.

Exposure state

Current reported bind posture:

Prometheus listens on 10.60.0.25:9090
Grafana remains loopback-only on 127.0.0.1:3000
zlh-monitor local Alloy remains loopback-only on 127.0.0.1:12345
local node_exporter is disabled

Remaining exposure work:

verify Prometheus is reachable only from intended management/container remote-write networks, not the public internet
verify Grafana is not reachable except via intended admin path
verify node_exporter is no longer listening or is firewalled if any listener remains
verify host firewall/nftables policy is appropriate

Prometheus state

Runtime config path: /etc/zlh-monitor/prometheus/prometheus.yml
Config validates with promtool
Active jobs:
- prometheus
- alloy
- game-dev-alloy
Prometheus listens on 10.60.0.25:9090
Prometheus self-scrape target is 10.60.0.25:9090
/sd/exporters is configured with bearer auth and works with token
unauthenticated /sd/exporters and /monitoring/game-dev return 401 unauthorized

Token risk: the bearer token was previously observed stored directly in Prometheus YAML, backup configs, and /etc/zlh-monitor/game-dev-discovery.env. Tokens were not observed leaking through labels or scrape URLs in the audit, but storage should still be tightened.

Targets/discovery state

Resolved:

game-dev-alloy VMID 5205 is gone
game-dev-alloy VMID 6084 is gone
alloy-dev-6076 hardcoded/down target is gone from active runtime target list
cAdvisor 10.60.0.11:8081 is gone from active runtime target list
zerolaghub-nodes 10.60.0.19:9100 is gone from active runtime target list
empty discovery writes a valid empty file_sd file instead of failing
active test containers appear in API discovery and file_sd

Expected remaining validation:

delete test dev/game servers
confirm API discovery drops them
confirm file_sd drops them
confirm dashboard metrics age out as expected

API monitoring state

API discovery is working for game/dev monitoring inventory.

Observed API app health endpoint /health previously returned 404 Cannot GET /health; no API health scrape was found.

No centralized API logs were found on the monitoring host.

Agent/container monitoring state

Current test containers prove remote-write metrics are working:

6085 dev-6085 dev-container
5206 mc-vanilla-5206 game-container

Labels previously observed include:

vmid
ctype or container_type
game
variant
hostname / name
job
instance

Missing or incomplete for lifecycle debugging:

ready / connectable
operationInProgress
operationType
backup/restore state
code-server state

Grafana state

Grafana now provisions from /etc/zlh-monitor/grafana/provisioning via /etc/default/grafana-server.

Datasource:

uid=prometheus
name=Prometheus
url=http://localhost:9090
default datasource

Live provisioned dashboards:

Core Service - Detail
Core Services - Overview
Dev Containers
Game Containers

Archived:

/etc/zlh-monitor/grafana/dashboards-archive/core-routers.json

Dashboard-style CPU and memory metrics exist for both current test containers.

Logs state

No Loki, Promtail, or log-ingesting Alloy pipeline was found in the prior audit.

Only journald/local logs for monitoring components were available on the monitoring host. This remains a launch-debug gap for API, Agent, provisioning, backup/restore, delete/teardown, Velocity/DNS/edge actions, and discovery sync.

Security state

Working controls:

Grafana is loopback-only on 127.0.0.1:3000 per current report
discovery endpoints require token; unauthenticated /sd/exporters and /monitoring/game-dev returned 401
Prometheus labels/scrape URLs checked did not expose the bearer token in the prior audit

Remaining launch security checks:

Prometheus on 10.60.0.25:9090 must be restricted to intended container remote-write and admin/monitoring paths
any remaining node_exporter listener must be disabled or firewalled
host firewall policy must be verified

8.5 KiB Raw Blame History