Launch blocker: monitoring/observability readiness #5

Open
opened 2026-05-01 18:29:06 +00:00 by jester · 3 comments
Owner

Summary

Current monitoring posture is not launch-ready.

Core services are running, and Prometheus can see the API host via Alloy remote-written host metrics plus one current dev/node target from /sd/exporters, but launch-debug readiness is blocked by public exposure, stale/failing game/dev discovery, missing centralized logs, and dashboards not actually installed in Grafana.

Source of truth docs

  • Codex/Monitoring/CURRENT_STATE.md
  • Codex/Monitoring/CONTRACT.md
  • Codex/Monitoring/OPEN_ITEMS.md
  • Codex/Monitoring/VALIDATION.md

Launch blockers

  1. Restrict monitoring endpoint exposure.

    • Prometheus currently listens on *:9090.
    • Grafana currently listens on *:3000.
    • node_exporter currently listens on *:9100.
    • UFW inactive; nftables policy accept.
  2. Fix game/dev discovery sync.

    • zlh-monitor-game-dev-discovery.service is failed.
    • API returns dev VMID 6050 on 10.60.0.220.
    • Sync script only allows dev IPs in 10.100.0.0/16 and refuses to write an empty discovery file.
  3. Remove stale file_sd targets.

    • Stale down targets observed: VMID 5205 and VMID 6084.
  4. Install/provision Grafana dashboards.

    • Dashboard JSON exists under /etc/zlh-monitor/grafana/dashboards.
    • Grafana DB currently has no dashboards installed.
  5. Add API health/application scrape.

    • /health returned 404 Cannot GET /health.
    • No API health scrape was found.
  6. Add launch-debug lifecycle visibility.

    • Missing/incomplete: ready, connectable, operationInProgress, operationType, backup/restore state, code-server state.
  7. Add centralized logs.

    • No Loki, Promtail, or log-ingesting Alloy pipeline found.
    • API/Agent/provisioning/backup/delete failures are not visible from the monitoring surface.
  8. Tighten monitoring token storage.

    • Bearer token appears in Prometheus YAML, backup configs, and /etc/zlh-monitor/game-dev-discovery.env.
    • Move to tighter-permission env files or systemd credentials.

Acceptance criteria

  • Prometheus, Grafana, and node_exporter are not publicly reachable.
  • Game and dev targets automatically appear after create.
  • Game and dev targets automatically disappear after delete.
  • Discovery sync is healthy and alerts if it fails.
  • Stale file_sd targets are gone after successful sync.
  • Grafana dashboards are installed, visible, and return live data.
  • API app health is visible to Prometheus or equivalent monitoring.
  • Lifecycle state is visible enough for launch debugging.
  • Centralized logs exist or missing logs are explicitly accepted as launch risk.
  • Discovery/auth bearer token handling is locked down.
## Summary Current monitoring posture is **not launch-ready**. Core services are running, and Prometheus can see the API host via Alloy remote-written host metrics plus one current dev/node target from `/sd/exporters`, but launch-debug readiness is blocked by public exposure, stale/failing game/dev discovery, missing centralized logs, and dashboards not actually installed in Grafana. ## Source of truth docs - `Codex/Monitoring/CURRENT_STATE.md` - `Codex/Monitoring/CONTRACT.md` - `Codex/Monitoring/OPEN_ITEMS.md` - `Codex/Monitoring/VALIDATION.md` ## Launch blockers 1. Restrict monitoring endpoint exposure. - Prometheus currently listens on `*:9090`. - Grafana currently listens on `*:3000`. - node_exporter currently listens on `*:9100`. - UFW inactive; nftables policy accept. 2. Fix game/dev discovery sync. - `zlh-monitor-game-dev-discovery.service` is failed. - API returns dev VMID `6050` on `10.60.0.220`. - Sync script only allows dev IPs in `10.100.0.0/16` and refuses to write an empty discovery file. 3. Remove stale file_sd targets. - Stale down targets observed: VMID `5205` and VMID `6084`. 4. Install/provision Grafana dashboards. - Dashboard JSON exists under `/etc/zlh-monitor/grafana/dashboards`. - Grafana DB currently has no dashboards installed. 5. Add API health/application scrape. - `/health` returned `404 Cannot GET /health`. - No API health scrape was found. 6. Add launch-debug lifecycle visibility. - Missing/incomplete: `ready`, `connectable`, `operationInProgress`, `operationType`, backup/restore state, code-server state. 7. Add centralized logs. - No Loki, Promtail, or log-ingesting Alloy pipeline found. - API/Agent/provisioning/backup/delete failures are not visible from the monitoring surface. 8. Tighten monitoring token storage. - Bearer token appears in Prometheus YAML, backup configs, and `/etc/zlh-monitor/game-dev-discovery.env`. - Move to tighter-permission env files or systemd credentials. ## Acceptance criteria - Prometheus, Grafana, and node_exporter are not publicly reachable. - Game and dev targets automatically appear after create. - Game and dev targets automatically disappear after delete. - Discovery sync is healthy and alerts if it fails. - Stale file_sd targets are gone after successful sync. - Grafana dashboards are installed, visible, and return live data. - API app health is visible to Prometheus or equivalent monitoring. - Lifecycle state is visible enough for launch debugging. - Centralized logs exist or missing logs are explicitly accepted as launch risk. - Discovery/auth bearer token handling is locked down.
Author
Owner

Progress update: container discovery is now clean.

Verified current API discovery output:

/sd/exporters -> []
/monitoring/game-dev -> containers: []

Monitor-side safe follow-up was applied:

/etc/zlh-monitor/game-dev-discovery.env
ALLOW_EMPTY="true"

A one-shot sync succeeded:

wrote 0 targets to /etc/zlh-monitor/prometheus/file_sd/game-dev-alloy.json

Prometheus refreshed and stale game-dev-alloy targets 5205 and 6084 are gone.

Docs updated:

  • Codex/Monitoring/CURRENT_STATE.md
  • Codex/Monitoring/OPEN_ITEMS.md

Remaining priorities:

  1. Remove or fix remaining stale/static targets:
    • alloy-dev-6076 hardcoded/down
    • cadvisor 10.60.0.11:8081 down/no route
    • zerolaghub-nodes 10.60.0.19:9100 down/connection refused
  2. Lock down exposure for Prometheus *:9090, Grafana *:3000, and node_exporter *:9100.
  3. Provision/check Grafana dashboards.
  4. Add/verify centralized logs for API/Agent/provision/backup/restore/delete paths.
  5. Run one lifecycle smoke test: create game/dev container, confirm target appears, delete it, confirm target disappears.
Progress update: container discovery is now clean. Verified current API discovery output: ```text /sd/exporters -> [] /monitoring/game-dev -> containers: [] ``` Monitor-side safe follow-up was applied: ```text /etc/zlh-monitor/game-dev-discovery.env ALLOW_EMPTY="true" ``` A one-shot sync succeeded: ```text wrote 0 targets to /etc/zlh-monitor/prometheus/file_sd/game-dev-alloy.json ``` Prometheus refreshed and stale `game-dev-alloy` targets `5205` and `6084` are gone. Docs updated: - `Codex/Monitoring/CURRENT_STATE.md` - `Codex/Monitoring/OPEN_ITEMS.md` Remaining priorities: 1. Remove or fix remaining stale/static targets: - `alloy-dev-6076` hardcoded/down - `cadvisor 10.60.0.11:8081` down/no route - `zerolaghub-nodes 10.60.0.19:9100` down/connection refused 2. Lock down exposure for Prometheus `*:9090`, Grafana `*:3000`, and node_exporter `*:9100`. 3. Provision/check Grafana dashboards. 4. Add/verify centralized logs for API/Agent/provision/backup/restore/delete paths. 5. Run one lifecycle smoke test: create game/dev container, confirm target appears, delete it, confirm target disappears.
Author
Owner

Progress update: /etc/zlh-monitor is now the runtime source of truth and the active Prometheus/Grafana state is much cleaner.

Prometheus now runs from:

/etc/zlh-monitor/prometheus/prometheus.yml
configured via /etc/default/prometheus
backup: /etc/default/prometheus.zlh-monitor-20260501191319.bak

Active Prometheus jobs are now only:

prometheus
alloy -> 127.0.0.1:12345
game-dev-alloy -> /etc/zlh-monitor/prometheus/file_sd/game-dev-alloy.json

No active node_exporter, no active cAdvisor, and no active :9100 jobs remain in the runtime Prometheus config. Current targets are clean:

up alloy       127.0.0.1:12345
up prometheus  localhost:9090

Grafana now provisions from:

/etc/zlh-monitor/grafana/provisioning
configured via /etc/default/grafana-server
backup: /etc/default/grafana-server.zlh-monitor-20260501191319.bak

Source-managed datasource:

uid=prometheus name=Prometheus url=http://localhost:9090 default=1

Live provisioned dashboards:

Core Service - Detail
Core Services - Overview
Dev Containers
Game Containers

Legacy node_exporter router dashboard archived to:

/etc/zlh-monitor/grafana/dashboards-archive/core-routers.json

/etc/zlh-monitor/README.md now describes the new single-source layout. Prometheus config validates with promtool, and prometheus, grafana-server, alloy, and the discovery timer are active.

Docs updated:

  • Codex/Monitoring/CURRENT_STATE.md
  • Codex/Monitoring/OPEN_ITEMS.md

Remaining launch blockers are now narrowed to:

  1. Verify/lock down network exposure for Prometheus/Grafana and any remaining node_exporter listener.
  2. Add API health scrape or equivalent application-level health signal.
  3. Add launch-debug lifecycle telemetry.
  4. Add centralized logs.
  5. Tighten bearer-token storage if permissions/credential handling are not already locked down.
  6. Run the game/dev lifecycle discovery smoke test.
Progress update: `/etc/zlh-monitor` is now the runtime source of truth and the active Prometheus/Grafana state is much cleaner. Prometheus now runs from: ```text /etc/zlh-monitor/prometheus/prometheus.yml configured via /etc/default/prometheus backup: /etc/default/prometheus.zlh-monitor-20260501191319.bak ``` Active Prometheus jobs are now only: ```text prometheus alloy -> 127.0.0.1:12345 game-dev-alloy -> /etc/zlh-monitor/prometheus/file_sd/game-dev-alloy.json ``` No active `node_exporter`, no active cAdvisor, and no active `:9100` jobs remain in the runtime Prometheus config. Current targets are clean: ```text up alloy 127.0.0.1:12345 up prometheus localhost:9090 ``` Grafana now provisions from: ```text /etc/zlh-monitor/grafana/provisioning configured via /etc/default/grafana-server backup: /etc/default/grafana-server.zlh-monitor-20260501191319.bak ``` Source-managed datasource: ```text uid=prometheus name=Prometheus url=http://localhost:9090 default=1 ``` Live provisioned dashboards: ```text Core Service - Detail Core Services - Overview Dev Containers Game Containers ``` Legacy node_exporter router dashboard archived to: ```text /etc/zlh-monitor/grafana/dashboards-archive/core-routers.json ``` `/etc/zlh-monitor/README.md` now describes the new single-source layout. Prometheus config validates with `promtool`, and `prometheus`, `grafana-server`, `alloy`, and the discovery timer are active. Docs updated: - `Codex/Monitoring/CURRENT_STATE.md` - `Codex/Monitoring/OPEN_ITEMS.md` Remaining launch blockers are now narrowed to: 1. Verify/lock down network exposure for Prometheus/Grafana and any remaining node_exporter listener. 2. Add API health scrape or equivalent application-level health signal. 3. Add launch-debug lifecycle telemetry. 4. Add centralized logs. 5. Tighten bearer-token storage if permissions/credential handling are not already locked down. 6. Run the game/dev lifecycle discovery smoke test.
Author
Owner

Progress update: the important game/dev monitoring path is confirmed working, and the monitoring contract is clarified as not API-only.

Current model:

API discovery/metadata
  -> monitor sync/file_sd inventory

container Alloy remote_write
  -> Prometheus at 10.60.0.25:9090
  -> Grafana dashboards

Confirmed test servers:

6085 / dev-6085 / 10.100.0.23 / dev
5206 / mc-vanilla-5206 / 10.200.0.79 / game

Confirmed behavior:

  • API discovery emits both test servers.
  • file_sd sync writes both targets.
  • Prometheus discovers both.
  • Alloy remote-write from both containers works after Prometheus was rebound to 10.60.0.25:9090.
  • Dashboard-style metrics exist for both:
    • node_uname_info
    • CPU
    • memory

Observed values:

6085 dev-6085        dev-container   CPU ~1.04%   Mem ~6.58%
5206 mc-vanilla-5206 game-container  CPU ~0.68%   Mem ~61.69%

Bind/state after the fix:

Prometheus listens on 10.60.0.25:9090
Prometheus self-scrape target is 10.60.0.25:9090
Grafana remains loopback-only on 127.0.0.1:3000
Alloy remains loopback-only on 127.0.0.1:12345 on zlh-monitor
local node_exporter is disabled

Design mismatch recorded:

  • Container Alloy listens on 127.0.0.1:12345.
  • Prometheus file_sd scrapes container-ip:12345.
  • Therefore game-dev-alloy scrape targets are discovered but down.

Current decision: use remote-write as canonical and do not rely on game-dev-alloy scrape health panels unless direct container Alloy scrape is intentionally enabled later. Direct scrape would require exposing container Alloy HTTP on a reachable address such as 0.0.0.0:12345 plus security review.

Docs updated:

  • Codex/Monitoring/CONTRACT.md
  • Codex/Monitoring/CURRENT_STATE.md
  • Codex/Monitoring/OPEN_ITEMS.md

Remaining next step: delete the test dev/game servers and verify API discovery/file_sd drops them, then confirm dashboard metrics age out as expected.

Progress update: the important game/dev monitoring path is confirmed working, and the monitoring contract is clarified as **not API-only**. Current model: ```text API discovery/metadata -> monitor sync/file_sd inventory container Alloy remote_write -> Prometheus at 10.60.0.25:9090 -> Grafana dashboards ``` Confirmed test servers: ```text 6085 / dev-6085 / 10.100.0.23 / dev 5206 / mc-vanilla-5206 / 10.200.0.79 / game ``` Confirmed behavior: - API discovery emits both test servers. - file_sd sync writes both targets. - Prometheus discovers both. - Alloy remote-write from both containers works after Prometheus was rebound to `10.60.0.25:9090`. - Dashboard-style metrics exist for both: - `node_uname_info` - CPU - memory Observed values: ```text 6085 dev-6085 dev-container CPU ~1.04% Mem ~6.58% 5206 mc-vanilla-5206 game-container CPU ~0.68% Mem ~61.69% ``` Bind/state after the fix: ```text Prometheus listens on 10.60.0.25:9090 Prometheus self-scrape target is 10.60.0.25:9090 Grafana remains loopback-only on 127.0.0.1:3000 Alloy remains loopback-only on 127.0.0.1:12345 on zlh-monitor local node_exporter is disabled ``` Design mismatch recorded: - Container Alloy listens on `127.0.0.1:12345`. - Prometheus file_sd scrapes `container-ip:12345`. - Therefore `game-dev-alloy` scrape targets are discovered but down. Current decision: use remote-write as canonical and do not rely on `game-dev-alloy` scrape health panels unless direct container Alloy scrape is intentionally enabled later. Direct scrape would require exposing container Alloy HTTP on a reachable address such as `0.0.0.0:12345` plus security review. Docs updated: - `Codex/Monitoring/CONTRACT.md` - `Codex/Monitoring/CURRENT_STATE.md` - `Codex/Monitoring/OPEN_ITEMS.md` Remaining next step: delete the test dev/game servers and verify API discovery/file_sd drops them, then confirm dashboard metrics age out as expected.
Sign in to join this conversation.
No Label
No Milestone
No project
No Assignees
1 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: jester/zlh-grind#5
No description provided.