zlh-grind/Codex/Monitoring/CURRENT_STATE.md

# Monitoring — Current State

## Launch readiness

**Status: core lifecycle monitoring is launch-ready, and centralized Loki is installed for zlh-monitor logs.**

The monitoring stack has been consolidated so `/etc/zlh-monitor` is now the operational source of truth. Game/dev lifecycle discovery, file_sd sync, Prometheus target appearance/removal, Alloy scrape health, and Alloy remote-write metrics have been validated through create/delete smoke tests.

Centralized Loki is now installed on `zlh-monitor` and receives selected monitoring-host logs through Alloy. Broader core-service and game/dev container log shipping remains future work.

Remaining future work:

- add selected core-service logs beyond zlh-monitor
- add selected game/dev container logs after review
- OPNsense may still need a router-only node_exporter job or equivalent router-specific monitoring later

## Runtime source of truth

`/etc/zlh-monitor` is now the monitoring runtime source of truth.

Prometheus now runs from:

```text
/etc/zlh-monitor/prometheus/prometheus.yml
```

configured via:

```text
/etc/default/prometheus
```

Runtime Prometheus jobs are limited to:

```text
prometheus
alloy
game-dev-alloy
```

Removed from active Prometheus runtime config:

- old `node_exporter` jobs
- cAdvisor jobs
- stale static node targets
- legacy game/dev targets
- active `:9100` scraping jobs

Local `prometheus-node-exporter` is disabled.

Prometheus listens on:

```text
10.60.0.25:9090
```

This allows Alloy remote-write from hosts and containers to reach Prometheus.

## Grafana state

Grafana provisions from:

```text
/etc/zlh-monitor/grafana/provisioning
```

Dashboard source directory:

```text
/etc/zlh-monitor/grafana/dashboards
```

Live dashboards provisioned:

- Core Services - Overview
- Core Service - Detail
- Game Containers
- Dev Containers

The old node_exporter router dashboard was archived:

```text
/etc/zlh-monitor/grafana/dashboards-archive/core-routers.json
```

Grafana listens on loopback only:

```text
127.0.0.1:3000
```

Provisioned datasource:

- `Prometheus`
- `Loki`

## Loki state

Centralized Loki is installed on `zlh-monitor`.

Files created/changed on the monitoring host:

- `/etc/zlh-monitor/loki/loki.yml`
- `/etc/zlh-monitor/systemd/loki.service`
- `/etc/zlh-monitor/grafana/provisioning/datasources/loki.yml`
- `/etc/zlh-monitor/alloy/config.alloy`
- `/etc/zlh-monitor/README.md`

Installed package:

```text
loki 3.7.1
```

Installed service:

```text
/etc/systemd/system/loki.service
```

Services restarted/enabled:

- `loki`
- `grafana-server`
- `alloy`

Listening addresses:

```text
Loki HTTP:     127.0.0.1:3100
Loki gRPC:     127.0.0.1:9096
Grafana:       127.0.0.1:3000
Alloy:         127.0.0.1:12345
Prometheus:    10.60.0.25:9090
```

No Loki public or non-loopback listener is exposed.

Grafana Loki datasource:

```text
name: Loki
uid: loki
url: http://127.0.0.1:3100
```

Grafana logs confirmed datasource provisioning:

```text
inserting datasource from configuration name=Loki uid=loki
```

Successful Loki query verified with:

```bash
curl -G http://127.0.0.1:3100/loki/api/v1/query_range \
  --data-urlencode 'query={job="zlh-monitor-journal"}' \
  --data-urlencode limit=5 \
  --data-urlencode direction=backward
```

The query returned live streams for `grafana-server` and `loki` with labels including:

- `host="zlh-monitor"`
- `vmid="9016"`
- `service`
- `job`
- `log_source`

Log sources currently wired:

- `prometheus.service`
- `grafana-server.service`
- `alloy.service`
- `loki.service`
- `zlh-monitor-game-dev-discovery.service`

Intentionally skipped in first Loki pass:

- game/dev container logs
- zpack-api logs
- zpack-portal logs
- zpack-velocity logs
- DNS/edge action logs
- provisioning/backup/restore/delete logs beyond existing monitor discovery unit

Loki schema note:

- Loki initially rejected 24-hour backfill entries because the schema start date was current-day.
- Schema start was moved to the standard `2020-10-24` TSDB start.
- Rejection stopped after the change.

## Discovery behavior

API lifecycle discovery was corrected to exclude non-lifecycle core dev systems by CIDR.

`9091 / zpack-dev-velocity / 10.60.0.220` no longer appears as a lifecycle dev container.

The monitor sync allows empty discovery:

```text
ALLOW_EMPTY="true"
```

This ensures deleted containers correctly clear:

```text
/etc/zlh-monitor/prometheus/file_sd/game-dev-alloy.json -> []
```

## Alloy template state

The base template was updated so container Alloy listens on:

```text
0.0.0.0:12345
```

instead of only:

```text
127.0.0.1:12345
```

This allows Prometheus to scrape container Alloy health via the `game-dev-alloy` job.

Updated template note replaces node_exporter with Alloy:

```text
Alloy Metrics/UI: :12345/metrics
Alloy Listen Addr: 0.0.0.0:12345
Node Exporter: not used
```

## Metrics model

Monitoring is not API-only.

- API discovery + monitor sync + file_sd owns lifecycle inventory and add/remove validation.
- Container Alloy remote-write to Prometheus is the canonical game/dev metrics path.
- `game-dev-alloy` scrape health also works now that container Alloy binds on `0.0.0.0:12345`.
- Grafana reads Prometheus for dashboards.

## Smoke tests passed

### First manual test

```text
dev  6085 -> appeared, scraped after Alloy bind fix, remote-write worked
game 5206 -> appeared, scraped after Alloy bind fix, remote-write worked
delete -> both disappeared from API discovery, file_sd, and Prometheus
```

### Template test after updating base image

```text
dev  6086 / 10.100.0.24 -> up
game 5207 / 10.200.0.80 -> up
```

### Full loop verified

```text
create -> API discovery -> file_sd -> Prometheus target up -> remote-write metrics present
delete -> API discovery empty -> file_sd [] -> Prometheus targets gone
```

## Logging model

Centralized Loki is now installed for zlh-monitor logs.

Current model:

```text
logs -> Alloy -> centralized Loki -> Grafana
```

Loki is centralized monitoring infrastructure. Do not run Loki inside every game/dev container.

Future work is to expand selected log shipping from core services and, later, reviewed game/dev container log sources.

## Platform exceptions

OPNsense and PBS remain explicit platform monitoring exceptions.

OPNsense may still need a router-only node_exporter job or another router-specific monitoring path later.