From a146979dad29a167142401e00f804b34a4344528 Mon Sep 17 00:00:00 2001 From: jester Date: Fri, 1 May 2026 20:57:03 +0000 Subject: [PATCH] Trim monitoring open items after launch-ready validation --- Codex/Monitoring/OPEN_ITEMS.md | 133 +++++++-------------------------- 1 file changed, 29 insertions(+), 104 deletions(-) diff --git a/Codex/Monitoring/OPEN_ITEMS.md b/Codex/Monitoring/OPEN_ITEMS.md index 7ec8145..2e75f84 100644 --- a/Codex/Monitoring/OPEN_ITEMS.md +++ b/Codex/Monitoring/OPEN_ITEMS.md @@ -2,125 +2,50 @@ Only keep unfinished monitoring/observability work here. -## Recently resolved / verified +## Active -- `/etc/zlh-monitor` is now the runtime source of truth. - - Prometheus runs from `/etc/zlh-monitor/prometheus/prometheus.yml` via `/etc/default/prometheus`. - - Grafana provisions from `/etc/zlh-monitor/grafana/provisioning` via `/etc/default/grafana-server`. - - `/etc/zlh-monitor/README.md` describes the new single-source layout. +1. Add centralized logs. + - No Loki currently. + - Future direction: centralized Loki as shared infrastructure. + - Containers should ship selected logs through Alloy; do not run Loki per game/dev container. + - Desired sources: + - API application logs + - Agent logs + - provisioning logs + - backup/restore logs + - delete/teardown logs + - Velocity/DNS/edge action logs + - discovery sync logs -- Game/dev dynamic discovery empty-state now works. - - `/sd/exporters -> []` - - `/monitoring/game-dev -> containers: []` - - monitor-side `ALLOW_EMPTY="true"` added in `/etc/zlh-monitor/game-dev-discovery.env` - - one-shot sync succeeded and wrote `0 targets` to `/etc/zlh-monitor/prometheus/file_sd/game-dev-alloy.json` - - Prometheus refreshed and stale `game-dev-alloy` VMIDs `5205` and `6084` disappeared +2. Decide whether OPNsense needs router-only monitoring. + - Current game/dev monitoring intentionally uses Alloy, not node_exporter. + - OPNsense may still need a router-only node_exporter job or another router-specific monitoring path later. + - Keep this separate from the game/dev container Alloy contract. -- Prometheus runtime config is clean. - - validates with `promtool` - - active jobs are only: - - `prometheus` - - `alloy -> 127.0.0.1:12345` - - `game-dev-alloy -> /etc/zlh-monitor/prometheus/file_sd/game-dev-alloy.json` - - no active `node_exporter` job - - no active cAdvisor job - - no active `:9100` jobs - -- Remaining stale/static Prometheus targets are removed from active runtime config. - - `alloy-dev-6076` no longer appears as an active runtime target - - `cadvisor 10.60.0.11:8081` no longer appears as an active runtime target - - `zerolaghub-nodes 10.60.0.19:9100` no longer appears as an active runtime target - -- Grafana provisioning is live. - - source-managed datasource: - - `uid=prometheus`, `name=Prometheus`, `url=http://localhost:9090`, `default=1` - - provisioned dashboards: - - `Core Service - Detail` - - `Core Services - Overview` - - `Dev Containers` - - `Game Containers` - - legacy node_exporter router dashboard archived to `/etc/zlh-monitor/grafana/dashboards-archive/core-routers.json` - -- Important game/dev metrics path works through remote-write. - - API discovery emitted both test servers: - - `6085 / dev-6085 / 10.100.0.23 / dev` - - `5206 / mc-vanilla-5206 / 10.200.0.79 / game` - - file_sd sync wrote both targets and Prometheus discovered both - - container Alloy remote-write works after Prometheus was rebound to `10.60.0.25:9090` - - dashboard-style metrics exist for both: - - `node_uname_info` - - CPU - - memory - - observed values: - - `6085 dev-6085 dev-container CPU ~1.04% Mem ~6.58%` - - `5206 mc-vanilla-5206 game-container CPU ~0.68% Mem ~61.69%` - -- Current monitoring model clarified. - - Not API-only: API owns discovery/metadata. - - Canonical metrics path is container Alloy remote-write to Prometheus. - - `game-dev-alloy` scrape-up panels should not be relied on while container Alloy listens on `127.0.0.1:12345`. - - Direct scrape health would require intentionally exposing container Alloy HTTP on a reachable address such as `0.0.0.0:12345` plus security review. - -- Services active after cleanup: - - `prometheus` - - `grafana-server` - - `alloy` - - discovery timer - -## Launch blockers - -1. Finish network exposure verification/lockdown. - - Prometheus now listens on `10.60.0.25:9090` for container remote-write and self-scrape. - - Grafana remains loopback-only on `127.0.0.1:3000`. - - zlh-monitor local Alloy remains loopback-only on `127.0.0.1:12345`. - - local node_exporter is disabled. - - Required validation: Prometheus is reachable only from intended container remote-write and trusted admin/monitoring paths; Grafana is not externally exposed; node_exporter is not listening or is firewalled; host firewall/nftables posture is acceptable. - -2. Add API health scrape or equivalent application-level health signal. - - Observed `/health` previously returned `404 Cannot GET /health`. - - Acceptance: Prometheus has an API app health target or equivalent application-level health signal. - -3. Add launch-debug lifecycle telemetry. - - Missing/incomplete: `ready`, `connectable`, `operationInProgress`, `operationType`, backup/restore state, code-server state. - - Acceptance: dashboards or Prometheus queries can answer what a game/dev container is doing and why it is not ready. - -4. Add centralized logs. - - No Loki, Promtail, or log-ingesting Alloy config found in prior audit. - - Acceptance: API, Agent, provisioning, backup/restore, delete/teardown, Velocity/DNS/edge, and discovery-sync logs are queryable from the monitoring surface. - -5. Tighten bearer token storage. - - Token previously appeared in Prometheus YAML, backup configs, and `/etc/zlh-monitor/game-dev-discovery.env`. - - Acceptance: token stored in tight-permission env file or systemd credentials; not world-readable; not leaked in labels/URLs/logs. - -6. Finish lifecycle smoke test delete step. - - Delete test dev/game servers. - - Confirm API discovery drops them. - - Confirm file_sd drops them. - - Confirm dashboard metrics age out as expected. - -## Standardization work +## Optional hardening / improvements +- Add API app health scrape or equivalent application-level health signal. +- Add lifecycle/debug telemetry panels or metrics for: + - `ready` + - `connectable` + - `operationInProgress` + - `operationType` + - backup/restore state + - code-server state - Standardize labels: prefer canonical `container_type` over mixed `ctype` / `container_type`. - Decide canonical label spelling for hostname/name/server identity. - Keep required game/dev labels minimal: `vmid`, `instance`, `container_type`. -- Keep optional metadata useful but not noisy. - Review any user/customer-identifying labels before dashboard or alert exposure. +- Add runbooks for discovery sync failure and failed game/dev provisioning debugging from monitoring only. ## Alerts to add - discovery sync failed - file_sd file stale / not updated recently - missing/stale remote-write metrics by VMID/type +- game-dev-alloy scrape target down by VMID/type - API `/sd/exporters` failure -- API app health failure +- API app health failure once an app health signal exists - Prometheus scrape failures by important job - Grafana dashboard provisioning missing/failure - log pipeline down once Loki/Alloy logs exist - -## Follow-up improvements - -- Add Loki or Alloy log collection for API, Agent, provisioning, backup/restore, Velocity/DNS/edge actions. -- Add panels for lifecycle state: `ready`, `connectable`, `operationInProgress`, `operationType`, backup/restore status, code-server status. -- Add dashboard row/panels for stale/deleted targets. -- Add a runbook for discovery sync failure. -- Add a runbook for failed game/dev provisioning debugging from monitoring only.