286 lines
6.2 KiB
Markdown
286 lines
6.2 KiB
Markdown
# Monitoring — Current State
|
|
|
|
## Launch readiness
|
|
|
|
**Status: core lifecycle monitoring is launch-ready, and centralized Loki is installed for zlh-monitor logs.**
|
|
|
|
The monitoring stack has been consolidated so `/etc/zlh-monitor` is now the operational source of truth. Game/dev lifecycle discovery, file_sd sync, Prometheus target appearance/removal, Alloy scrape health, and Alloy remote-write metrics have been validated through create/delete smoke tests.
|
|
|
|
Centralized Loki is now installed on `zlh-monitor` and receives selected monitoring-host logs through Alloy. Broader core-service and game/dev container log shipping remains future work.
|
|
|
|
Remaining future work:
|
|
|
|
- add selected core-service logs beyond zlh-monitor
|
|
- add selected game/dev container logs after review
|
|
- OPNsense may still need a router-only node_exporter job or equivalent router-specific monitoring later
|
|
|
|
## Runtime source of truth
|
|
|
|
`/etc/zlh-monitor` is now the monitoring runtime source of truth.
|
|
|
|
Prometheus now runs from:
|
|
|
|
```text
|
|
/etc/zlh-monitor/prometheus/prometheus.yml
|
|
```
|
|
|
|
configured via:
|
|
|
|
```text
|
|
/etc/default/prometheus
|
|
```
|
|
|
|
Runtime Prometheus jobs are limited to:
|
|
|
|
```text
|
|
prometheus
|
|
alloy
|
|
game-dev-alloy
|
|
```
|
|
|
|
Removed from active Prometheus runtime config:
|
|
|
|
- old `node_exporter` jobs
|
|
- cAdvisor jobs
|
|
- stale static node targets
|
|
- legacy game/dev targets
|
|
- active `:9100` scraping jobs
|
|
|
|
Local `prometheus-node-exporter` is disabled.
|
|
|
|
Prometheus listens on:
|
|
|
|
```text
|
|
10.60.0.25:9090
|
|
```
|
|
|
|
This allows Alloy remote-write from hosts and containers to reach Prometheus.
|
|
|
|
## Grafana state
|
|
|
|
Grafana provisions from:
|
|
|
|
```text
|
|
/etc/zlh-monitor/grafana/provisioning
|
|
```
|
|
|
|
Dashboard source directory:
|
|
|
|
```text
|
|
/etc/zlh-monitor/grafana/dashboards
|
|
```
|
|
|
|
Live dashboards provisioned:
|
|
|
|
- Core Services - Overview
|
|
- Core Service - Detail
|
|
- Game Containers
|
|
- Dev Containers
|
|
|
|
The old node_exporter router dashboard was archived:
|
|
|
|
```text
|
|
/etc/zlh-monitor/grafana/dashboards-archive/core-routers.json
|
|
```
|
|
|
|
Grafana listens on loopback only:
|
|
|
|
```text
|
|
127.0.0.1:3000
|
|
```
|
|
|
|
Provisioned datasource:
|
|
|
|
- `Prometheus`
|
|
- `Loki`
|
|
|
|
## Loki state
|
|
|
|
Centralized Loki is installed on `zlh-monitor`.
|
|
|
|
Files created/changed on the monitoring host:
|
|
|
|
- `/etc/zlh-monitor/loki/loki.yml`
|
|
- `/etc/zlh-monitor/systemd/loki.service`
|
|
- `/etc/zlh-monitor/grafana/provisioning/datasources/loki.yml`
|
|
- `/etc/zlh-monitor/alloy/config.alloy`
|
|
- `/etc/zlh-monitor/README.md`
|
|
|
|
Installed package:
|
|
|
|
```text
|
|
loki 3.7.1
|
|
```
|
|
|
|
Installed service:
|
|
|
|
```text
|
|
/etc/systemd/system/loki.service
|
|
```
|
|
|
|
Services restarted/enabled:
|
|
|
|
- `loki`
|
|
- `grafana-server`
|
|
- `alloy`
|
|
|
|
Listening addresses:
|
|
|
|
```text
|
|
Loki HTTP: 127.0.0.1:3100
|
|
Loki gRPC: 127.0.0.1:9096
|
|
Grafana: 127.0.0.1:3000
|
|
Alloy: 127.0.0.1:12345
|
|
Prometheus: 10.60.0.25:9090
|
|
```
|
|
|
|
No Loki public or non-loopback listener is exposed.
|
|
|
|
Grafana Loki datasource:
|
|
|
|
```text
|
|
name: Loki
|
|
uid: loki
|
|
url: http://127.0.0.1:3100
|
|
```
|
|
|
|
Grafana logs confirmed datasource provisioning:
|
|
|
|
```text
|
|
inserting datasource from configuration name=Loki uid=loki
|
|
```
|
|
|
|
Successful Loki query verified with:
|
|
|
|
```bash
|
|
curl -G http://127.0.0.1:3100/loki/api/v1/query_range \
|
|
--data-urlencode 'query={job="zlh-monitor-journal"}' \
|
|
--data-urlencode limit=5 \
|
|
--data-urlencode direction=backward
|
|
```
|
|
|
|
The query returned live streams for `grafana-server` and `loki` with labels including:
|
|
|
|
- `host="zlh-monitor"`
|
|
- `vmid="9016"`
|
|
- `service`
|
|
- `job`
|
|
- `log_source`
|
|
|
|
Log sources currently wired:
|
|
|
|
- `prometheus.service`
|
|
- `grafana-server.service`
|
|
- `alloy.service`
|
|
- `loki.service`
|
|
- `zlh-monitor-game-dev-discovery.service`
|
|
|
|
Intentionally skipped in first Loki pass:
|
|
|
|
- game/dev container logs
|
|
- zpack-api logs
|
|
- zpack-portal logs
|
|
- zpack-velocity logs
|
|
- DNS/edge action logs
|
|
- provisioning/backup/restore/delete logs beyond existing monitor discovery unit
|
|
|
|
Loki schema note:
|
|
|
|
- Loki initially rejected 24-hour backfill entries because the schema start date was current-day.
|
|
- Schema start was moved to the standard `2020-10-24` TSDB start.
|
|
- Rejection stopped after the change.
|
|
|
|
## Discovery behavior
|
|
|
|
API lifecycle discovery was corrected to exclude non-lifecycle core dev systems by CIDR.
|
|
|
|
`9091 / zpack-dev-velocity / 10.60.0.220` no longer appears as a lifecycle dev container.
|
|
|
|
The monitor sync allows empty discovery:
|
|
|
|
```text
|
|
ALLOW_EMPTY="true"
|
|
```
|
|
|
|
This ensures deleted containers correctly clear:
|
|
|
|
```text
|
|
/etc/zlh-monitor/prometheus/file_sd/game-dev-alloy.json -> []
|
|
```
|
|
|
|
## Alloy template state
|
|
|
|
The base template was updated so container Alloy listens on:
|
|
|
|
```text
|
|
0.0.0.0:12345
|
|
```
|
|
|
|
instead of only:
|
|
|
|
```text
|
|
127.0.0.1:12345
|
|
```
|
|
|
|
This allows Prometheus to scrape container Alloy health via the `game-dev-alloy` job.
|
|
|
|
Updated template note replaces node_exporter with Alloy:
|
|
|
|
```text
|
|
Alloy Metrics/UI: :12345/metrics
|
|
Alloy Listen Addr: 0.0.0.0:12345
|
|
Node Exporter: not used
|
|
```
|
|
|
|
## Metrics model
|
|
|
|
Monitoring is not API-only.
|
|
|
|
- API discovery + monitor sync + file_sd owns lifecycle inventory and add/remove validation.
|
|
- Container Alloy remote-write to Prometheus is the canonical game/dev metrics path.
|
|
- `game-dev-alloy` scrape health also works now that container Alloy binds on `0.0.0.0:12345`.
|
|
- Grafana reads Prometheus for dashboards.
|
|
|
|
## Smoke tests passed
|
|
|
|
### First manual test
|
|
|
|
```text
|
|
dev 6085 -> appeared, scraped after Alloy bind fix, remote-write worked
|
|
game 5206 -> appeared, scraped after Alloy bind fix, remote-write worked
|
|
delete -> both disappeared from API discovery, file_sd, and Prometheus
|
|
```
|
|
|
|
### Template test after updating base image
|
|
|
|
```text
|
|
dev 6086 / 10.100.0.24 -> up
|
|
game 5207 / 10.200.0.80 -> up
|
|
```
|
|
|
|
### Full loop verified
|
|
|
|
```text
|
|
create -> API discovery -> file_sd -> Prometheus target up -> remote-write metrics present
|
|
delete -> API discovery empty -> file_sd [] -> Prometheus targets gone
|
|
```
|
|
|
|
## Logging model
|
|
|
|
Centralized Loki is now installed for zlh-monitor logs.
|
|
|
|
Current model:
|
|
|
|
```text
|
|
logs -> Alloy -> centralized Loki -> Grafana
|
|
```
|
|
|
|
Loki is centralized monitoring infrastructure. Do not run Loki inside every game/dev container.
|
|
|
|
Future work is to expand selected log shipping from core services and, later, reviewed game/dev container log sources.
|
|
|
|
## Platform exceptions
|
|
|
|
OPNsense and PBS remain explicit platform monitoring exceptions.
|
|
|
|
OPNsense may still need a router-only node_exporter job or another router-specific monitoring path later. |