Launch blocker: monitoring/observability readiness #5

New Issue

jester · 2026-05-01T18:29:06Z

jester commented

2026-05-01 18:29:06 +00:00

Summary

Current monitoring posture is not launch-ready.

Core services are running, and Prometheus can see the API host via Alloy remote-written host metrics plus one current dev/node target from /sd/exporters, but launch-debug readiness is blocked by public exposure, stale/failing game/dev discovery, missing centralized logs, and dashboards not actually installed in Grafana.

Source of truth docs

Codex/Monitoring/CURRENT_STATE.md
Codex/Monitoring/CONTRACT.md
Codex/Monitoring/OPEN_ITEMS.md
Codex/Monitoring/VALIDATION.md

Launch blockers

Restrict monitoring endpoint exposure.
- Prometheus currently listens on *:9090.
- Grafana currently listens on *:3000.
- node_exporter currently listens on *:9100.
- UFW inactive; nftables policy accept.
Fix game/dev discovery sync.
- zlh-monitor-game-dev-discovery.service is failed.
- API returns dev VMID 6050 on 10.60.0.220.
- Sync script only allows dev IPs in 10.100.0.0/16 and refuses to write an empty discovery file.
Remove stale file_sd targets.
- Stale down targets observed: VMID 5205 and VMID 6084.
Install/provision Grafana dashboards.
- Dashboard JSON exists under /etc/zlh-monitor/grafana/dashboards.
- Grafana DB currently has no dashboards installed.
Add API health/application scrape.
- /health returned 404 Cannot GET /health.
- No API health scrape was found.
Add launch-debug lifecycle visibility.
- Missing/incomplete: ready, connectable, operationInProgress, operationType, backup/restore state, code-server state.
Add centralized logs.
- No Loki, Promtail, or log-ingesting Alloy pipeline found.
- API/Agent/provisioning/backup/delete failures are not visible from the monitoring surface.
Tighten monitoring token storage.
- Bearer token appears in Prometheus YAML, backup configs, and /etc/zlh-monitor/game-dev-discovery.env.
- Move to tighter-permission env files or systemd credentials.

Acceptance criteria

Prometheus, Grafana, and node_exporter are not publicly reachable.
Game and dev targets automatically appear after create.
Game and dev targets automatically disappear after delete.
Discovery sync is healthy and alerts if it fails.
Stale file_sd targets are gone after successful sync.
Grafana dashboards are installed, visible, and return live data.
API app health is visible to Prometheus or equivalent monitoring.
Lifecycle state is visible enough for launch debugging.
Centralized logs exist or missing logs are explicitly accepted as launch risk.
Discovery/auth bearer token handling is locked down.

## Summary Current monitoring posture is **not launch-ready**. Core services are running, and Prometheus can see the API host via Alloy remote-written host metrics plus one current dev/node target from `/sd/exporters`, but launch-debug readiness is blocked by public exposure, stale/failing game/dev discovery, missing centralized logs, and dashboards not actually installed in Grafana. ## Source of truth docs - `Codex/Monitoring/CURRENT_STATE.md` - `Codex/Monitoring/CONTRACT.md` - `Codex/Monitoring/OPEN_ITEMS.md` - `Codex/Monitoring/VALIDATION.md` ## Launch blockers 1. Restrict monitoring endpoint exposure. - Prometheus currently listens on `*:9090`. - Grafana currently listens on `*:3000`. - node_exporter currently listens on `*:9100`. - UFW inactive; nftables policy accept. 2. Fix game/dev discovery sync. - `zlh-monitor-game-dev-discovery.service` is failed. - API returns dev VMID `6050` on `10.60.0.220`. - Sync script only allows dev IPs in `10.100.0.0/16` and refuses to write an empty discovery file. 3. Remove stale file_sd targets. - Stale down targets observed: VMID `5205` and VMID `6084`. 4. Install/provision Grafana dashboards. - Dashboard JSON exists under `/etc/zlh-monitor/grafana/dashboards`. - Grafana DB currently has no dashboards installed. 5. Add API health/application scrape. - `/health` returned `404 Cannot GET /health`. - No API health scrape was found. 6. Add launch-debug lifecycle visibility. - Missing/incomplete: `ready`, `connectable`, `operationInProgress`, `operationType`, backup/restore state, code-server state. 7. Add centralized logs. - No Loki, Promtail, or log-ingesting Alloy pipeline found. - API/Agent/provisioning/backup/delete failures are not visible from the monitoring surface. 8. Tighten monitoring token storage. - Bearer token appears in Prometheus YAML, backup configs, and `/etc/zlh-monitor/game-dev-discovery.env`. - Move to tighter-permission env files or systemd credentials. ## Acceptance criteria - Prometheus, Grafana, and node_exporter are not publicly reachable. - Game and dev targets automatically appear after create. - Game and dev targets automatically disappear after delete. - Discovery sync is healthy and alerts if it fails. - Stale file_sd targets are gone after successful sync. - Grafana dashboards are installed, visible, and return live data. - API app health is visible to Prometheus or equivalent monitoring. - Lifecycle state is visible enough for launch debugging. - Centralized logs exist or missing logs are explicitly accepted as launch risk. - Discovery/auth bearer token handling is locked down.

jester commented

2026-05-01 19:01:28 +00:00

Progress update: container discovery is now clean.

Verified current API discovery output:

/sd/exporters -> []
/monitoring/game-dev -> containers: []

Monitor-side safe follow-up was applied:

/etc/zlh-monitor/game-dev-discovery.env
ALLOW_EMPTY="true"

A one-shot sync succeeded:

wrote 0 targets to /etc/zlh-monitor/prometheus/file_sd/game-dev-alloy.json

Prometheus refreshed and stale game-dev-alloy targets 5205 and 6084 are gone.

Docs updated:

Codex/Monitoring/CURRENT_STATE.md
Codex/Monitoring/OPEN_ITEMS.md

Remaining priorities:

Remove or fix remaining stale/static targets:
- alloy-dev-6076 hardcoded/down
- cadvisor 10.60.0.11:8081 down/no route
- zerolaghub-nodes 10.60.0.19:9100 down/connection refused
Lock down exposure for Prometheus *:9090, Grafana *:3000, and node_exporter *:9100.
Provision/check Grafana dashboards.
Add/verify centralized logs for API/Agent/provision/backup/restore/delete paths.
Run one lifecycle smoke test: create game/dev container, confirm target appears, delete it, confirm target disappears.

Progress update: container discovery is now clean. Verified current API discovery output: ```text /sd/exporters -> [] /monitoring/game-dev -> containers: [] ``` Monitor-side safe follow-up was applied: ```text /etc/zlh-monitor/game-dev-discovery.env ALLOW_EMPTY="true" ``` A one-shot sync succeeded: ```text wrote 0 targets to /etc/zlh-monitor/prometheus/file_sd/game-dev-alloy.json ``` Prometheus refreshed and stale `game-dev-alloy` targets `5205` and `6084` are gone. Docs updated: - `Codex/Monitoring/CURRENT_STATE.md` - `Codex/Monitoring/OPEN_ITEMS.md` Remaining priorities: 1. Remove or fix remaining stale/static targets: - `alloy-dev-6076` hardcoded/down - `cadvisor 10.60.0.11:8081` down/no route - `zerolaghub-nodes 10.60.0.19:9100` down/connection refused 2. Lock down exposure for Prometheus `*:9090`, Grafana `*:3000`, and node_exporter `*:9100`. 3. Provision/check Grafana dashboards. 4. Add/verify centralized logs for API/Agent/provision/backup/restore/delete paths. 5. Run one lifecycle smoke test: create game/dev container, confirm target appears, delete it, confirm target disappears.

jester commented

2026-05-01 19:21:10 +00:00

Progress update: /etc/zlh-monitor is now the runtime source of truth and the active Prometheus/Grafana state is much cleaner.

Prometheus now runs from:

/etc/zlh-monitor/prometheus/prometheus.yml
configured via /etc/default/prometheus
backup: /etc/default/prometheus.zlh-monitor-20260501191319.bak

Active Prometheus jobs are now only:

prometheus
alloy -> 127.0.0.1:12345
game-dev-alloy -> /etc/zlh-monitor/prometheus/file_sd/game-dev-alloy.json

No active node_exporter, no active cAdvisor, and no active :9100 jobs remain in the runtime Prometheus config. Current targets are clean:

up alloy       127.0.0.1:12345
up prometheus  localhost:9090

Grafana now provisions from:

/etc/zlh-monitor/grafana/provisioning
configured via /etc/default/grafana-server
backup: /etc/default/grafana-server.zlh-monitor-20260501191319.bak

Source-managed datasource:

uid=prometheus name=Prometheus url=http://localhost:9090 default=1

Live provisioned dashboards:

Core Service - Detail
Core Services - Overview
Dev Containers
Game Containers

Legacy node_exporter router dashboard archived to:

/etc/zlh-monitor/grafana/dashboards-archive/core-routers.json

/etc/zlh-monitor/README.md now describes the new single-source layout. Prometheus config validates with promtool, and prometheus, grafana-server, alloy, and the discovery timer are active.

Docs updated:

Codex/Monitoring/CURRENT_STATE.md
Codex/Monitoring/OPEN_ITEMS.md

Remaining launch blockers are now narrowed to:

Verify/lock down network exposure for Prometheus/Grafana and any remaining node_exporter listener.
Add API health scrape or equivalent application-level health signal.
Add launch-debug lifecycle telemetry.
Add centralized logs.
Tighten bearer-token storage if permissions/credential handling are not already locked down.
Run the game/dev lifecycle discovery smoke test.

Progress update: `/etc/zlh-monitor` is now the runtime source of truth and the active Prometheus/Grafana state is much cleaner. Prometheus now runs from: ```text /etc/zlh-monitor/prometheus/prometheus.yml configured via /etc/default/prometheus backup: /etc/default/prometheus.zlh-monitor-20260501191319.bak ``` Active Prometheus jobs are now only: ```text prometheus alloy -> 127.0.0.1:12345 game-dev-alloy -> /etc/zlh-monitor/prometheus/file_sd/game-dev-alloy.json ``` No active `node_exporter`, no active cAdvisor, and no active `:9100` jobs remain in the runtime Prometheus config. Current targets are clean: ```text up alloy 127.0.0.1:12345 up prometheus localhost:9090 ``` Grafana now provisions from: ```text /etc/zlh-monitor/grafana/provisioning configured via /etc/default/grafana-server backup: /etc/default/grafana-server.zlh-monitor-20260501191319.bak ``` Source-managed datasource: ```text uid=prometheus name=Prometheus url=http://localhost:9090 default=1 ``` Live provisioned dashboards: ```text Core Service - Detail Core Services - Overview Dev Containers Game Containers ``` Legacy node_exporter router dashboard archived to: ```text /etc/zlh-monitor/grafana/dashboards-archive/core-routers.json ``` `/etc/zlh-monitor/README.md` now describes the new single-source layout. Prometheus config validates with `promtool`, and `prometheus`, `grafana-server`, `alloy`, and the discovery timer are active. Docs updated: - `Codex/Monitoring/CURRENT_STATE.md` - `Codex/Monitoring/OPEN_ITEMS.md` Remaining launch blockers are now narrowed to: 1. Verify/lock down network exposure for Prometheus/Grafana and any remaining node_exporter listener. 2. Add API health scrape or equivalent application-level health signal. 3. Add launch-debug lifecycle telemetry. 4. Add centralized logs. 5. Tighten bearer-token storage if permissions/credential handling are not already locked down. 6. Run the game/dev lifecycle discovery smoke test.

jester commented

2026-05-01 20:20:33 +00:00

Progress update: the important game/dev monitoring path is confirmed working, and the monitoring contract is clarified as not API-only.

Current model:

API discovery/metadata
  -> monitor sync/file_sd inventory

container Alloy remote_write
  -> Prometheus at 10.60.0.25:9090
  -> Grafana dashboards

Confirmed test servers:

6085 / dev-6085 / 10.100.0.23 / dev
5206 / mc-vanilla-5206 / 10.200.0.79 / game

Confirmed behavior:

API discovery emits both test servers.
file_sd sync writes both targets.
Prometheus discovers both.
Alloy remote-write from both containers works after Prometheus was rebound to 10.60.0.25:9090.
Dashboard-style metrics exist for both:
- node_uname_info
- CPU
- memory

Observed values:

6085 dev-6085        dev-container   CPU ~1.04%   Mem ~6.58%
5206 mc-vanilla-5206 game-container  CPU ~0.68%   Mem ~61.69%

Bind/state after the fix:

Prometheus listens on 10.60.0.25:9090
Prometheus self-scrape target is 10.60.0.25:9090
Grafana remains loopback-only on 127.0.0.1:3000
Alloy remains loopback-only on 127.0.0.1:12345 on zlh-monitor
local node_exporter is disabled

Design mismatch recorded:

Container Alloy listens on 127.0.0.1:12345.
Prometheus file_sd scrapes container-ip:12345.
Therefore game-dev-alloy scrape targets are discovered but down.

Current decision: use remote-write as canonical and do not rely on game-dev-alloy scrape health panels unless direct container Alloy scrape is intentionally enabled later. Direct scrape would require exposing container Alloy HTTP on a reachable address such as 0.0.0.0:12345 plus security review.

Docs updated:

Codex/Monitoring/CONTRACT.md
Codex/Monitoring/CURRENT_STATE.md
Codex/Monitoring/OPEN_ITEMS.md

Remaining next step: delete the test dev/game servers and verify API discovery/file_sd drops them, then confirm dashboard metrics age out as expected.

Progress update: the important game/dev monitoring path is confirmed working, and the monitoring contract is clarified as **not API-only**. Current model: ```text API discovery/metadata -> monitor sync/file_sd inventory container Alloy remote_write -> Prometheus at 10.60.0.25:9090 -> Grafana dashboards ``` Confirmed test servers: ```text 6085 / dev-6085 / 10.100.0.23 / dev 5206 / mc-vanilla-5206 / 10.200.0.79 / game ``` Confirmed behavior: - API discovery emits both test servers. - file_sd sync writes both targets. - Prometheus discovers both. - Alloy remote-write from both containers works after Prometheus was rebound to `10.60.0.25:9090`. - Dashboard-style metrics exist for both: - `node_uname_info` - CPU - memory Observed values: ```text 6085 dev-6085 dev-container CPU ~1.04% Mem ~6.58% 5206 mc-vanilla-5206 game-container CPU ~0.68% Mem ~61.69% ``` Bind/state after the fix: ```text Prometheus listens on 10.60.0.25:9090 Prometheus self-scrape target is 10.60.0.25:9090 Grafana remains loopback-only on 127.0.0.1:3000 Alloy remains loopback-only on 127.0.0.1:12345 on zlh-monitor local node_exporter is disabled ``` Design mismatch recorded: - Container Alloy listens on `127.0.0.1:12345`. - Prometheus file_sd scrapes `container-ip:12345`. - Therefore `game-dev-alloy` scrape targets are discovered but down. Current decision: use remote-write as canonical and do not rely on `game-dev-alloy` scrape health panels unless direct container Alloy scrape is intentionally enabled later. Direct scrape would require exposing container Alloy HTTP on a reachable address such as `0.0.0.0:12345` plus security review. Docs updated: - `Codex/Monitoring/CONTRACT.md` - `Codex/Monitoring/CURRENT_STATE.md` - `Codex/Monitoring/OPEN_ITEMS.md` Remaining next step: delete the test dev/game servers and verify API discovery/file_sd drops them, then confirm dashboard metrics age out as expected.

Sign in to join this conversation.

No Label

No Milestone

No project

No Assignees

1 Participants

Notifications

Due Date

The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: jester/zlh-grind#5