zlh-grind/Codex/Monitoring/CURRENT_STATE.md

# Monitoring — Current State

## Launch readiness

**Status: PARTIAL FAIL for launch monitoring readiness.**

Core monitoring services are running. Game/dev dynamic discovery works, stale `game-dev-alloy` VMIDs `5205` and `6084` are gone, legacy/static Prometheus jobs have been removed from the active runtime config, Grafana dashboard provisioning is live from `/etc/zlh-monitor`, and the important game/dev metrics path now works through container Alloy remote-write.

Launch-debug readiness is still blocked by final exposure verification/lockdown, missing/unfinished centralized logs, missing API application health scrape, incomplete lifecycle telemetry, and the need to finish the delete half of the game/dev lifecycle smoke test.

## Runtime source of truth

`/etc/zlh-monitor` is now the runtime source of truth.

Prometheus now runs from:

- `/etc/zlh-monitor/prometheus/prometheus.yml`
- configured via `/etc/default/prometheus`
- backup: `/etc/default/prometheus.zlh-monitor-20260501191319.bak`

Grafana now provisions from:

- `/etc/zlh-monitor/grafana/provisioning`
- configured via `/etc/default/grafana-server`
- backup: `/etc/default/grafana-server.zlh-monitor-20260501191319.bak`

`/etc/zlh-monitor/README.md` was updated to describe the new single-source layout.

## Canonical metrics model

Monitoring is not API-only.

The API owns discovery/metadata and tells the monitoring stack what game/dev containers exist. The canonical metrics path for game/dev containers is now:

```text
container Alloy remote_write
  -> Prometheus on 10.60.0.25:9090
  -> Grafana dashboards
```

The `game-dev-alloy` file_sd path is useful for inventory/add-remove validation. It should not be treated as the canonical container health signal while container Alloy listens on `127.0.0.1:12345` and Prometheus file_sd scrapes `container-ip:12345`.

If direct scrape health for each container Alloy is desired later, container Alloy must intentionally expose its HTTP listener on a reachable address such as `0.0.0.0:12345`, with security review.

## Recently verified progress

Container discovery is clean:

- `/sd/exporters -> []` during empty state
- `/monitoring/game-dev -> containers: []` during empty state
- monitor-side `ALLOW_EMPTY="true"` added in `/etc/zlh-monitor/game-dev-discovery.env`
- one-shot sync succeeded: wrote `0 targets` to `/etc/zlh-monitor/prometheus/file_sd/game-dev-alloy.json`
- Prometheus refreshed and stale `game-dev-alloy` targets `5205` and `6084` are gone

Prometheus runtime config is clean:

- config validates with `promtool`
- active jobs are now only:
  - `prometheus`
  - `alloy -> 127.0.0.1:12345`
  - `game-dev-alloy -> /etc/zlh-monitor/prometheus/file_sd/game-dev-alloy.json`
- no active `node_exporter` job
- no active cAdvisor job
- no active `:9100` jobs

Runtime bind/target state after the remote-write fix:

- Prometheus listens on `10.60.0.25:9090`
- Prometheus self-scrape target is `10.60.0.25:9090`
- Grafana remains loopback-only on `127.0.0.1:3000`
- zlh-monitor local Alloy remains loopback-only on `127.0.0.1:12345`
- local node_exporter is disabled

Grafana provisioning is live:

- one source-managed datasource exists:
  - `uid=prometheus`, `name=Prometheus`, `url=http://localhost:9090`, `default=1`
- provisioned dashboards:
  - `Core Service - Detail`
  - `Core Services - Overview`
  - `Dev Containers`
  - `Game Containers`
- legacy node_exporter router dashboard archived to:
  - `/etc/zlh-monitor/grafana/dashboards-archive/core-routers.json`

Services active after cleanup:

- `prometheus`
- `grafana-server`
- `alloy`
- discovery timer

## Game/dev metrics smoke evidence

The important monitoring path works for current test containers.

API discovery emits both test servers:

- `6085 / dev-6085 / 10.100.0.23 / dev`
- `5206 / mc-vanilla-5206 / 10.200.0.79 / game`

file_sd sync writes both targets, and Prometheus discovers both.

Alloy remote-write from both containers works after Prometheus was rebound to `10.60.0.25:9090`.

Dashboard-style metrics exist for both containers:

- `node_uname_info`
- CPU metrics
- memory metrics

Observed values:

```text
6085 dev-6085        dev-container   CPU ~1.04%   Mem ~6.58%
5206 mc-vanilla-5206 game-container  CPU ~0.68%   Mem ~61.69%
```

Known design mismatch:

- container Alloy listens on `127.0.0.1:12345`
- Prometheus file_sd scrapes `container-ip:12345`
- therefore `game-dev-alloy` scrape targets are discovered but down

Current decision: use remote-write as canonical and do not rely on `game-dev-alloy` scrape health panels unless/until direct container Alloy scrape is intentionally enabled.

## Exposure state

Current reported bind posture:

- Prometheus listens on `10.60.0.25:9090`
- Grafana remains loopback-only on `127.0.0.1:3000`
- zlh-monitor local Alloy remains loopback-only on `127.0.0.1:12345`
- local node_exporter is disabled

Remaining exposure work:

- verify Prometheus is reachable only from intended management/container remote-write networks, not the public internet
- verify Grafana is not reachable except via intended admin path
- verify node_exporter is no longer listening or is firewalled if any listener remains
- verify host firewall/nftables policy is appropriate

## Prometheus state

- Runtime config path: `/etc/zlh-monitor/prometheus/prometheus.yml`
- Config validates with `promtool`
- Active jobs:
  - `prometheus`
  - `alloy`
  - `game-dev-alloy`
- Prometheus listens on `10.60.0.25:9090`
- Prometheus self-scrape target is `10.60.0.25:9090`
- `/sd/exporters` is configured with bearer auth and works with token
- unauthenticated `/sd/exporters` and `/monitoring/game-dev` return `401 unauthorized`

Token risk: the bearer token was previously observed stored directly in Prometheus YAML, backup configs, and `/etc/zlh-monitor/game-dev-discovery.env`. Tokens were not observed leaking through labels or scrape URLs in the audit, but storage should still be tightened.

## Targets/discovery state

Resolved:

- `game-dev-alloy` VMID `5205` is gone
- `game-dev-alloy` VMID `6084` is gone
- `alloy-dev-6076` hardcoded/down target is gone from active runtime target list
- cAdvisor `10.60.0.11:8081` is gone from active runtime target list
- `zerolaghub-nodes 10.60.0.19:9100` is gone from active runtime target list
- empty discovery writes a valid empty file_sd file instead of failing
- active test containers appear in API discovery and file_sd

Expected remaining validation:

- delete test dev/game servers
- confirm API discovery drops them
- confirm file_sd drops them
- confirm dashboard metrics age out as expected

## API monitoring state

API discovery is working for game/dev monitoring inventory.

Observed API app health endpoint `/health` previously returned `404 Cannot GET /health`; no API health scrape was found.

No centralized API logs were found on the monitoring host.

## Agent/container monitoring state

Current test containers prove remote-write metrics are working:

- `6085 dev-6085 dev-container`
- `5206 mc-vanilla-5206 game-container`

Labels previously observed include:

- `vmid`
- `ctype` or `container_type`
- `game`
- `variant`
- `hostname` / `name`
- `job`
- `instance`

Missing or incomplete for lifecycle debugging:

- `ready` / `connectable`
- `operationInProgress`
- `operationType`
- backup/restore state
- code-server state

## Grafana state

Grafana now provisions from `/etc/zlh-monitor/grafana/provisioning` via `/etc/default/grafana-server`.

Datasource:

- `uid=prometheus`
- `name=Prometheus`
- `url=http://localhost:9090`
- default datasource

Live provisioned dashboards:

- `Core Service - Detail`
- `Core Services - Overview`
- `Dev Containers`
- `Game Containers`

Archived:

- `/etc/zlh-monitor/grafana/dashboards-archive/core-routers.json`

Dashboard-style CPU and memory metrics exist for both current test containers.

## Logs state

No Loki, Promtail, or log-ingesting Alloy pipeline was found in the prior audit.

Only journald/local logs for monitoring components were available on the monitoring host. This remains a launch-debug gap for API, Agent, provisioning, backup/restore, delete/teardown, Velocity/DNS/edge actions, and discovery sync.

## Security state

Working controls:

- Grafana is loopback-only on `127.0.0.1:3000` per current report
- discovery endpoints require token; unauthenticated `/sd/exporters` and `/monitoring/game-dev` returned `401`
- Prometheus labels/scrape URLs checked did not expose the bearer token in the prior audit

Remaining launch security checks:

- Prometheus on `10.60.0.25:9090` must be restricted to intended container remote-write and admin/monitoring paths
- any remaining node_exporter listener must be disabled or firewalled
- host firewall policy must be verified