From aab9569a490e29f8f2f1879c08926cb8e6aef207 Mon Sep 17 00:00:00 2001
From: jester <admin@zerolaghub.com>
Date: Fri, 1 May 2026 20:55:42 +0000
Subject: [PATCH] Update monitoring current state after launch-ready smoke
 tests

---
 Codex/Monitoring/CURRENT_STATE.md | 341 ++++++++++++------------------
 1 file changed, 133 insertions(+), 208 deletions(-)

diff --git a/Codex/Monitoring/CURRENT_STATE.md b/Codex/Monitoring/CURRENT_STATE.md
index 05872df..d132c28 100644
--- a/Codex/Monitoring/CURRENT_STATE.md
+++ b/Codex/Monitoring/CURRENT_STATE.md
@@ -2,251 +2,176 @@
 
 ## Launch readiness
 
-**Status: PARTIAL FAIL for launch monitoring readiness.**
+**Status: core lifecycle monitoring is launch-ready.**
 
-Core monitoring services are running. Game/dev dynamic discovery works, stale `game-dev-alloy` VMIDs `5205` and `6084` are gone, legacy/static Prometheus jobs have been removed from the active runtime config, Grafana dashboard provisioning is live from `/etc/zlh-monitor`, and the important game/dev metrics path now works through container Alloy remote-write.
+The monitoring stack has been consolidated so `/etc/zlh-monitor` is now the operational source of truth. Game/dev lifecycle discovery, file_sd sync, Prometheus target appearance/removal, Alloy scrape health, and Alloy remote-write metrics have been validated through create/delete smoke tests.
 
-Launch-debug readiness is still blocked by final exposure verification/lockdown, missing/unfinished centralized logs, missing API application health scrape, incomplete lifecycle telemetry, and the need to finish the delete half of the game/dev lifecycle smoke test.
+Remaining future work:
+
+- centralized logs are not enabled yet
+- no Loki currently
+- OPNsense may still need a router-only node_exporter job or equivalent router-specific monitoring later
 
 ## Runtime source of truth
 
-`/etc/zlh-monitor` is now the runtime source of truth.
+`/etc/zlh-monitor` is now the monitoring runtime source of truth.
 
 Prometheus now runs from:
 
-- `/etc/zlh-monitor/prometheus/prometheus.yml`
-- configured via `/etc/default/prometheus`
-- backup: `/etc/default/prometheus.zlh-monitor-20260501191319.bak`
-
-Grafana now provisions from:
-
-- `/etc/zlh-monitor/grafana/provisioning`
-- configured via `/etc/default/grafana-server`
-- backup: `/etc/default/grafana-server.zlh-monitor-20260501191319.bak`
-
-`/etc/zlh-monitor/README.md` was updated to describe the new single-source layout.
-
-## Canonical metrics model
-
-Monitoring is not API-only.
-
-The API owns discovery/metadata and tells the monitoring stack what game/dev containers exist. The canonical metrics path for game/dev containers is now:
-
 ```text
-container Alloy remote_write
-  -> Prometheus on 10.60.0.25:9090
-  -> Grafana dashboards
+/etc/zlh-monitor/prometheus/prometheus.yml
 ```
 
-The `game-dev-alloy` file_sd path is useful for inventory/add-remove validation. It should not be treated as the canonical container health signal while container Alloy listens on `127.0.0.1:12345` and Prometheus file_sd scrapes `container-ip:12345`.
-
-If direct scrape health for each container Alloy is desired later, container Alloy must intentionally expose its HTTP listener on a reachable address such as `0.0.0.0:12345`, with security review.
-
-## Recently verified progress
-
-Container discovery is clean:
-
-- `/sd/exporters -> []` during empty state
-- `/monitoring/game-dev -> containers: []` during empty state
-- monitor-side `ALLOW_EMPTY="true"` added in `/etc/zlh-monitor/game-dev-discovery.env`
-- one-shot sync succeeded: wrote `0 targets` to `/etc/zlh-monitor/prometheus/file_sd/game-dev-alloy.json`
-- Prometheus refreshed and stale `game-dev-alloy` targets `5205` and `6084` are gone
-
-Prometheus runtime config is clean:
-
-- config validates with `promtool`
-- active jobs are now only:
-  - `prometheus`
-  - `alloy -> 127.0.0.1:12345`
-  - `game-dev-alloy -> /etc/zlh-monitor/prometheus/file_sd/game-dev-alloy.json`
-- no active `node_exporter` job
-- no active cAdvisor job
-- no active `:9100` jobs
-
-Runtime bind/target state after the remote-write fix:
-
-- Prometheus listens on `10.60.0.25:9090`
-- Prometheus self-scrape target is `10.60.0.25:9090`
-- Grafana remains loopback-only on `127.0.0.1:3000`
-- zlh-monitor local Alloy remains loopback-only on `127.0.0.1:12345`
-- local node_exporter is disabled
-
-Grafana provisioning is live:
-
-- one source-managed datasource exists:
-  - `uid=prometheus`, `name=Prometheus`, `url=http://localhost:9090`, `default=1`
-- provisioned dashboards:
-  - `Core Service - Detail`
-  - `Core Services - Overview`
-  - `Dev Containers`
-  - `Game Containers`
-- legacy node_exporter router dashboard archived to:
-  - `/etc/zlh-monitor/grafana/dashboards-archive/core-routers.json`
-
-Services active after cleanup:
-
-- `prometheus`
-- `grafana-server`
-- `alloy`
-- discovery timer
-
-## Game/dev metrics smoke evidence
-
-The important monitoring path works for current test containers.
-
-API discovery emits both test servers:
-
-- `6085 / dev-6085 / 10.100.0.23 / dev`
-- `5206 / mc-vanilla-5206 / 10.200.0.79 / game`
-
-file_sd sync writes both targets, and Prometheus discovers both.
-
-Alloy remote-write from both containers works after Prometheus was rebound to `10.60.0.25:9090`.
-
-Dashboard-style metrics exist for both containers:
-
-- `node_uname_info`
-- CPU metrics
-- memory metrics
-
-Observed values:
+configured via:
 
 ```text
-6085 dev-6085        dev-container   CPU ~1.04%   Mem ~6.58%
-5206 mc-vanilla-5206 game-container  CPU ~0.68%   Mem ~61.69%
+/etc/default/prometheus
 ```
 
-Known design mismatch:
+Runtime Prometheus jobs are limited to:
 
-- container Alloy listens on `127.0.0.1:12345`
-- Prometheus file_sd scrapes `container-ip:12345`
-- therefore `game-dev-alloy` scrape targets are discovered but down
+```text
+prometheus
+alloy
+game-dev-alloy
+```
 
-Current decision: use remote-write as canonical and do not rely on `game-dev-alloy` scrape health panels unless/until direct container Alloy scrape is intentionally enabled.
+Removed from active Prometheus runtime config:
 
-## Exposure state
+- old `node_exporter` jobs
+- cAdvisor jobs
+- stale static node targets
+- legacy game/dev targets
+- active `:9100` scraping jobs
 
-Current reported bind posture:
+Local `prometheus-node-exporter` is disabled.
 
-- Prometheus listens on `10.60.0.25:9090`
-- Grafana remains loopback-only on `127.0.0.1:3000`
-- zlh-monitor local Alloy remains loopback-only on `127.0.0.1:12345`
-- local node_exporter is disabled
+Prometheus listens on:
 
-Remaining exposure work:
+```text
+10.60.0.25:9090
+```
 
-- verify Prometheus is reachable only from intended management/container remote-write networks, not the public internet
-- verify Grafana is not reachable except via intended admin path
-- verify node_exporter is no longer listening or is firewalled if any listener remains
-- verify host firewall/nftables policy is appropriate
-
-## Prometheus state
-
-- Runtime config path: `/etc/zlh-monitor/prometheus/prometheus.yml`
-- Config validates with `promtool`
-- Active jobs:
-  - `prometheus`
-  - `alloy`
-  - `game-dev-alloy`
-- Prometheus listens on `10.60.0.25:9090`
-- Prometheus self-scrape target is `10.60.0.25:9090`
-- `/sd/exporters` is configured with bearer auth and works with token
-- unauthenticated `/sd/exporters` and `/monitoring/game-dev` return `401 unauthorized`
-
-Token risk: the bearer token was previously observed stored directly in Prometheus YAML, backup configs, and `/etc/zlh-monitor/game-dev-discovery.env`. Tokens were not observed leaking through labels or scrape URLs in the audit, but storage should still be tightened.
-
-## Targets/discovery state
-
-Resolved:
-
-- `game-dev-alloy` VMID `5205` is gone
-- `game-dev-alloy` VMID `6084` is gone
-- `alloy-dev-6076` hardcoded/down target is gone from active runtime target list
-- cAdvisor `10.60.0.11:8081` is gone from active runtime target list
-- `zerolaghub-nodes 10.60.0.19:9100` is gone from active runtime target list
-- empty discovery writes a valid empty file_sd file instead of failing
-- active test containers appear in API discovery and file_sd
-
-Expected remaining validation:
-
-- delete test dev/game servers
-- confirm API discovery drops them
-- confirm file_sd drops them
-- confirm dashboard metrics age out as expected
-
-## API monitoring state
-
-API discovery is working for game/dev monitoring inventory.
-
-Observed API app health endpoint `/health` previously returned `404 Cannot GET /health`; no API health scrape was found.
-
-No centralized API logs were found on the monitoring host.
-
-## Agent/container monitoring state
-
-Current test containers prove remote-write metrics are working:
-
-- `6085 dev-6085 dev-container`
-- `5206 mc-vanilla-5206 game-container`
-
-Labels previously observed include:
-
-- `vmid`
-- `ctype` or `container_type`
-- `game`
-- `variant`
-- `hostname` / `name`
-- `job`
-- `instance`
-
-Missing or incomplete for lifecycle debugging:
-
-- `ready` / `connectable`
-- `operationInProgress`
-- `operationType`
-- backup/restore state
-- code-server state
+This allows Alloy remote-write from hosts and containers to reach Prometheus.
 
 ## Grafana state
 
-Grafana now provisions from `/etc/zlh-monitor/grafana/provisioning` via `/etc/default/grafana-server`.
+Grafana provisions from:
 
-Datasource:
+```text
+/etc/zlh-monitor/grafana/provisioning
+```
 
-- `uid=prometheus`
-- `name=Prometheus`
-- `url=http://localhost:9090`
-- default datasource
+Dashboard source directory:
 
-Live provisioned dashboards:
+```text
+/etc/zlh-monitor/grafana/dashboards
+```
 
-- `Core Service - Detail`
-- `Core Services - Overview`
-- `Dev Containers`
-- `Game Containers`
+Live dashboards provisioned:
 
-Archived:
+- Core Services - Overview
+- Core Service - Detail
+- Game Containers
+- Dev Containers
 
-- `/etc/zlh-monitor/grafana/dashboards-archive/core-routers.json`
+The old node_exporter router dashboard was archived:
 
-Dashboard-style CPU and memory metrics exist for both current test containers.
+```text
+/etc/zlh-monitor/grafana/dashboards-archive/core-routers.json
+```
 
-## Logs state
+Grafana listens on loopback only:
 
-No Loki, Promtail, or log-ingesting Alloy pipeline was found in the prior audit.
+```text
+127.0.0.1:3000
+```
 
-Only journald/local logs for monitoring components were available on the monitoring host. This remains a launch-debug gap for API, Agent, provisioning, backup/restore, delete/teardown, Velocity/DNS/edge actions, and discovery sync.
+## Discovery behavior
 
-## Security state
+API lifecycle discovery was corrected to exclude non-lifecycle core dev systems by CIDR.
 
-Working controls:
+`9091 / zpack-dev-velocity / 10.60.0.220` no longer appears as a lifecycle dev container.
 
-- Grafana is loopback-only on `127.0.0.1:3000` per current report
-- discovery endpoints require token; unauthenticated `/sd/exporters` and `/monitoring/game-dev` returned `401`
-- Prometheus labels/scrape URLs checked did not expose the bearer token in the prior audit
+The monitor sync allows empty discovery:
 
-Remaining launch security checks:
+```text
+ALLOW_EMPTY="true"
+```
 
-- Prometheus on `10.60.0.25:9090` must be restricted to intended container remote-write and admin/monitoring paths
-- any remaining node_exporter listener must be disabled or firewalled
-- host firewall policy must be verified
\ No newline at end of file
+This ensures deleted containers correctly clear:
+
+```text
+/etc/zlh-monitor/prometheus/file_sd/game-dev-alloy.json -> []
+```
+
+## Alloy template state
+
+The base template was updated so container Alloy listens on:
+
+```text
+0.0.0.0:12345
+```
+
+instead of only:
+
+```text
+127.0.0.1:12345
+```
+
+This allows Prometheus to scrape container Alloy health via the `game-dev-alloy` job.
+
+Updated template note replaces node_exporter with Alloy:
+
+```text
+Alloy Metrics/UI: :12345/metrics
+Alloy Listen Addr: 0.0.0.0:12345
+Node Exporter: not used
+```
+
+## Metrics model
+
+Monitoring is not API-only.
+
+- API discovery + monitor sync + file_sd owns lifecycle inventory and add/remove validation.
+- Container Alloy remote-write to Prometheus is the canonical game/dev metrics path.
+- `game-dev-alloy` scrape health also works now that container Alloy binds on `0.0.0.0:12345`.
+- Grafana reads Prometheus for dashboards.
+
+## Smoke tests passed
+
+### First manual test
+
+```text
+dev  6085 -> appeared, scraped after Alloy bind fix, remote-write worked
+game 5206 -> appeared, scraped after Alloy bind fix, remote-write worked
+delete -> both disappeared from API discovery, file_sd, and Prometheus
+```
+
+### Template test after updating base image
+
+```text
+dev  6086 / 10.100.0.24 -> up
+game 5207 / 10.200.0.80 -> up
+```
+
+### Full loop verified
+
+```text
+create -> API discovery -> file_sd -> Prometheus target up -> remote-write metrics present
+delete -> API discovery empty -> file_sd [] -> Prometheus targets gone
+```
+
+## Logging state
+
+Centralized logs are not enabled yet.
+
+No Loki currently.
+
+Future direction is centralized Loki with selected logs shipped by Alloy. Do not run Loki inside every game/dev container.
+
+## Platform exceptions
+
+OPNsense and PBS remain explicit platform monitoring exceptions.
+
+OPNsense may still need a router-only node_exporter job or another router-specific monitoring path later.
\ No newline at end of file