From aab9569a490e29f8f2f1879c08926cb8e6aef207 Mon Sep 17 00:00:00 2001 From: jester Date: Fri, 1 May 2026 20:55:42 +0000 Subject: [PATCH] Update monitoring current state after launch-ready smoke tests --- Codex/Monitoring/CURRENT_STATE.md | 341 ++++++++++++------------------ 1 file changed, 133 insertions(+), 208 deletions(-) diff --git a/Codex/Monitoring/CURRENT_STATE.md b/Codex/Monitoring/CURRENT_STATE.md index 05872df..d132c28 100644 --- a/Codex/Monitoring/CURRENT_STATE.md +++ b/Codex/Monitoring/CURRENT_STATE.md @@ -2,251 +2,176 @@ ## Launch readiness -**Status: PARTIAL FAIL for launch monitoring readiness.** +**Status: core lifecycle monitoring is launch-ready.** -Core monitoring services are running. Game/dev dynamic discovery works, stale `game-dev-alloy` VMIDs `5205` and `6084` are gone, legacy/static Prometheus jobs have been removed from the active runtime config, Grafana dashboard provisioning is live from `/etc/zlh-monitor`, and the important game/dev metrics path now works through container Alloy remote-write. +The monitoring stack has been consolidated so `/etc/zlh-monitor` is now the operational source of truth. Game/dev lifecycle discovery, file_sd sync, Prometheus target appearance/removal, Alloy scrape health, and Alloy remote-write metrics have been validated through create/delete smoke tests. -Launch-debug readiness is still blocked by final exposure verification/lockdown, missing/unfinished centralized logs, missing API application health scrape, incomplete lifecycle telemetry, and the need to finish the delete half of the game/dev lifecycle smoke test. +Remaining future work: + +- centralized logs are not enabled yet +- no Loki currently +- OPNsense may still need a router-only node_exporter job or equivalent router-specific monitoring later ## Runtime source of truth -`/etc/zlh-monitor` is now the runtime source of truth. +`/etc/zlh-monitor` is now the monitoring runtime source of truth. Prometheus now runs from: -- `/etc/zlh-monitor/prometheus/prometheus.yml` -- configured via `/etc/default/prometheus` -- backup: `/etc/default/prometheus.zlh-monitor-20260501191319.bak` - -Grafana now provisions from: - -- `/etc/zlh-monitor/grafana/provisioning` -- configured via `/etc/default/grafana-server` -- backup: `/etc/default/grafana-server.zlh-monitor-20260501191319.bak` - -`/etc/zlh-monitor/README.md` was updated to describe the new single-source layout. - -## Canonical metrics model - -Monitoring is not API-only. - -The API owns discovery/metadata and tells the monitoring stack what game/dev containers exist. The canonical metrics path for game/dev containers is now: - ```text -container Alloy remote_write - -> Prometheus on 10.60.0.25:9090 - -> Grafana dashboards +/etc/zlh-monitor/prometheus/prometheus.yml ``` -The `game-dev-alloy` file_sd path is useful for inventory/add-remove validation. It should not be treated as the canonical container health signal while container Alloy listens on `127.0.0.1:12345` and Prometheus file_sd scrapes `container-ip:12345`. - -If direct scrape health for each container Alloy is desired later, container Alloy must intentionally expose its HTTP listener on a reachable address such as `0.0.0.0:12345`, with security review. - -## Recently verified progress - -Container discovery is clean: - -- `/sd/exporters -> []` during empty state -- `/monitoring/game-dev -> containers: []` during empty state -- monitor-side `ALLOW_EMPTY="true"` added in `/etc/zlh-monitor/game-dev-discovery.env` -- one-shot sync succeeded: wrote `0 targets` to `/etc/zlh-monitor/prometheus/file_sd/game-dev-alloy.json` -- Prometheus refreshed and stale `game-dev-alloy` targets `5205` and `6084` are gone - -Prometheus runtime config is clean: - -- config validates with `promtool` -- active jobs are now only: - - `prometheus` - - `alloy -> 127.0.0.1:12345` - - `game-dev-alloy -> /etc/zlh-monitor/prometheus/file_sd/game-dev-alloy.json` -- no active `node_exporter` job -- no active cAdvisor job -- no active `:9100` jobs - -Runtime bind/target state after the remote-write fix: - -- Prometheus listens on `10.60.0.25:9090` -- Prometheus self-scrape target is `10.60.0.25:9090` -- Grafana remains loopback-only on `127.0.0.1:3000` -- zlh-monitor local Alloy remains loopback-only on `127.0.0.1:12345` -- local node_exporter is disabled - -Grafana provisioning is live: - -- one source-managed datasource exists: - - `uid=prometheus`, `name=Prometheus`, `url=http://localhost:9090`, `default=1` -- provisioned dashboards: - - `Core Service - Detail` - - `Core Services - Overview` - - `Dev Containers` - - `Game Containers` -- legacy node_exporter router dashboard archived to: - - `/etc/zlh-monitor/grafana/dashboards-archive/core-routers.json` - -Services active after cleanup: - -- `prometheus` -- `grafana-server` -- `alloy` -- discovery timer - -## Game/dev metrics smoke evidence - -The important monitoring path works for current test containers. - -API discovery emits both test servers: - -- `6085 / dev-6085 / 10.100.0.23 / dev` -- `5206 / mc-vanilla-5206 / 10.200.0.79 / game` - -file_sd sync writes both targets, and Prometheus discovers both. - -Alloy remote-write from both containers works after Prometheus was rebound to `10.60.0.25:9090`. - -Dashboard-style metrics exist for both containers: - -- `node_uname_info` -- CPU metrics -- memory metrics - -Observed values: +configured via: ```text -6085 dev-6085 dev-container CPU ~1.04% Mem ~6.58% -5206 mc-vanilla-5206 game-container CPU ~0.68% Mem ~61.69% +/etc/default/prometheus ``` -Known design mismatch: +Runtime Prometheus jobs are limited to: -- container Alloy listens on `127.0.0.1:12345` -- Prometheus file_sd scrapes `container-ip:12345` -- therefore `game-dev-alloy` scrape targets are discovered but down +```text +prometheus +alloy +game-dev-alloy +``` -Current decision: use remote-write as canonical and do not rely on `game-dev-alloy` scrape health panels unless/until direct container Alloy scrape is intentionally enabled. +Removed from active Prometheus runtime config: -## Exposure state +- old `node_exporter` jobs +- cAdvisor jobs +- stale static node targets +- legacy game/dev targets +- active `:9100` scraping jobs -Current reported bind posture: +Local `prometheus-node-exporter` is disabled. -- Prometheus listens on `10.60.0.25:9090` -- Grafana remains loopback-only on `127.0.0.1:3000` -- zlh-monitor local Alloy remains loopback-only on `127.0.0.1:12345` -- local node_exporter is disabled +Prometheus listens on: -Remaining exposure work: +```text +10.60.0.25:9090 +``` -- verify Prometheus is reachable only from intended management/container remote-write networks, not the public internet -- verify Grafana is not reachable except via intended admin path -- verify node_exporter is no longer listening or is firewalled if any listener remains -- verify host firewall/nftables policy is appropriate - -## Prometheus state - -- Runtime config path: `/etc/zlh-monitor/prometheus/prometheus.yml` -- Config validates with `promtool` -- Active jobs: - - `prometheus` - - `alloy` - - `game-dev-alloy` -- Prometheus listens on `10.60.0.25:9090` -- Prometheus self-scrape target is `10.60.0.25:9090` -- `/sd/exporters` is configured with bearer auth and works with token -- unauthenticated `/sd/exporters` and `/monitoring/game-dev` return `401 unauthorized` - -Token risk: the bearer token was previously observed stored directly in Prometheus YAML, backup configs, and `/etc/zlh-monitor/game-dev-discovery.env`. Tokens were not observed leaking through labels or scrape URLs in the audit, but storage should still be tightened. - -## Targets/discovery state - -Resolved: - -- `game-dev-alloy` VMID `5205` is gone -- `game-dev-alloy` VMID `6084` is gone -- `alloy-dev-6076` hardcoded/down target is gone from active runtime target list -- cAdvisor `10.60.0.11:8081` is gone from active runtime target list -- `zerolaghub-nodes 10.60.0.19:9100` is gone from active runtime target list -- empty discovery writes a valid empty file_sd file instead of failing -- active test containers appear in API discovery and file_sd - -Expected remaining validation: - -- delete test dev/game servers -- confirm API discovery drops them -- confirm file_sd drops them -- confirm dashboard metrics age out as expected - -## API monitoring state - -API discovery is working for game/dev monitoring inventory. - -Observed API app health endpoint `/health` previously returned `404 Cannot GET /health`; no API health scrape was found. - -No centralized API logs were found on the monitoring host. - -## Agent/container monitoring state - -Current test containers prove remote-write metrics are working: - -- `6085 dev-6085 dev-container` -- `5206 mc-vanilla-5206 game-container` - -Labels previously observed include: - -- `vmid` -- `ctype` or `container_type` -- `game` -- `variant` -- `hostname` / `name` -- `job` -- `instance` - -Missing or incomplete for lifecycle debugging: - -- `ready` / `connectable` -- `operationInProgress` -- `operationType` -- backup/restore state -- code-server state +This allows Alloy remote-write from hosts and containers to reach Prometheus. ## Grafana state -Grafana now provisions from `/etc/zlh-monitor/grafana/provisioning` via `/etc/default/grafana-server`. +Grafana provisions from: -Datasource: +```text +/etc/zlh-monitor/grafana/provisioning +``` -- `uid=prometheus` -- `name=Prometheus` -- `url=http://localhost:9090` -- default datasource +Dashboard source directory: -Live provisioned dashboards: +```text +/etc/zlh-monitor/grafana/dashboards +``` -- `Core Service - Detail` -- `Core Services - Overview` -- `Dev Containers` -- `Game Containers` +Live dashboards provisioned: -Archived: +- Core Services - Overview +- Core Service - Detail +- Game Containers +- Dev Containers -- `/etc/zlh-monitor/grafana/dashboards-archive/core-routers.json` +The old node_exporter router dashboard was archived: -Dashboard-style CPU and memory metrics exist for both current test containers. +```text +/etc/zlh-monitor/grafana/dashboards-archive/core-routers.json +``` -## Logs state +Grafana listens on loopback only: -No Loki, Promtail, or log-ingesting Alloy pipeline was found in the prior audit. +```text +127.0.0.1:3000 +``` -Only journald/local logs for monitoring components were available on the monitoring host. This remains a launch-debug gap for API, Agent, provisioning, backup/restore, delete/teardown, Velocity/DNS/edge actions, and discovery sync. +## Discovery behavior -## Security state +API lifecycle discovery was corrected to exclude non-lifecycle core dev systems by CIDR. -Working controls: +`9091 / zpack-dev-velocity / 10.60.0.220` no longer appears as a lifecycle dev container. -- Grafana is loopback-only on `127.0.0.1:3000` per current report -- discovery endpoints require token; unauthenticated `/sd/exporters` and `/monitoring/game-dev` returned `401` -- Prometheus labels/scrape URLs checked did not expose the bearer token in the prior audit +The monitor sync allows empty discovery: -Remaining launch security checks: +```text +ALLOW_EMPTY="true" +``` -- Prometheus on `10.60.0.25:9090` must be restricted to intended container remote-write and admin/monitoring paths -- any remaining node_exporter listener must be disabled or firewalled -- host firewall policy must be verified \ No newline at end of file +This ensures deleted containers correctly clear: + +```text +/etc/zlh-monitor/prometheus/file_sd/game-dev-alloy.json -> [] +``` + +## Alloy template state + +The base template was updated so container Alloy listens on: + +```text +0.0.0.0:12345 +``` + +instead of only: + +```text +127.0.0.1:12345 +``` + +This allows Prometheus to scrape container Alloy health via the `game-dev-alloy` job. + +Updated template note replaces node_exporter with Alloy: + +```text +Alloy Metrics/UI: :12345/metrics +Alloy Listen Addr: 0.0.0.0:12345 +Node Exporter: not used +``` + +## Metrics model + +Monitoring is not API-only. + +- API discovery + monitor sync + file_sd owns lifecycle inventory and add/remove validation. +- Container Alloy remote-write to Prometheus is the canonical game/dev metrics path. +- `game-dev-alloy` scrape health also works now that container Alloy binds on `0.0.0.0:12345`. +- Grafana reads Prometheus for dashboards. + +## Smoke tests passed + +### First manual test + +```text +dev 6085 -> appeared, scraped after Alloy bind fix, remote-write worked +game 5206 -> appeared, scraped after Alloy bind fix, remote-write worked +delete -> both disappeared from API discovery, file_sd, and Prometheus +``` + +### Template test after updating base image + +```text +dev 6086 / 10.100.0.24 -> up +game 5207 / 10.200.0.80 -> up +``` + +### Full loop verified + +```text +create -> API discovery -> file_sd -> Prometheus target up -> remote-write metrics present +delete -> API discovery empty -> file_sd [] -> Prometheus targets gone +``` + +## Logging state + +Centralized logs are not enabled yet. + +No Loki currently. + +Future direction is centralized Loki with selected logs shipped by Alloy. Do not run Loki inside every game/dev container. + +## Platform exceptions + +OPNsense and PBS remain explicit platform monitoring exceptions. + +OPNsense may still need a router-only node_exporter job or another router-specific monitoring path later. \ No newline at end of file