Update monitoring state after Loki installation
This commit is contained in:
parent
be9244e2af
commit
7cf4d9f247
@ -2,14 +2,16 @@
|
||||
|
||||
## Launch readiness
|
||||
|
||||
**Status: core lifecycle monitoring is launch-ready.**
|
||||
**Status: core lifecycle monitoring is launch-ready, and centralized Loki is installed for zlh-monitor logs.**
|
||||
|
||||
The monitoring stack has been consolidated so `/etc/zlh-monitor` is now the operational source of truth. Game/dev lifecycle discovery, file_sd sync, Prometheus target appearance/removal, Alloy scrape health, and Alloy remote-write metrics have been validated through create/delete smoke tests.
|
||||
|
||||
Centralized Loki is now installed on `zlh-monitor` and receives selected monitoring-host logs through Alloy. Broader core-service and game/dev container log shipping remains future work.
|
||||
|
||||
Remaining future work:
|
||||
|
||||
- centralized logs are not enabled yet
|
||||
- no Loki currently
|
||||
- add selected core-service logs beyond zlh-monitor
|
||||
- add selected game/dev container logs after review
|
||||
- OPNsense may still need a router-only node_exporter job or equivalent router-specific monitoring later
|
||||
|
||||
## Runtime source of truth
|
||||
@ -87,6 +89,107 @@ Grafana listens on loopback only:
|
||||
127.0.0.1:3000
|
||||
```
|
||||
|
||||
Provisioned datasource:
|
||||
|
||||
- `Prometheus`
|
||||
- `Loki`
|
||||
|
||||
## Loki state
|
||||
|
||||
Centralized Loki is installed on `zlh-monitor`.
|
||||
|
||||
Files created/changed on the monitoring host:
|
||||
|
||||
- `/etc/zlh-monitor/loki/loki.yml`
|
||||
- `/etc/zlh-monitor/systemd/loki.service`
|
||||
- `/etc/zlh-monitor/grafana/provisioning/datasources/loki.yml`
|
||||
- `/etc/zlh-monitor/alloy/config.alloy`
|
||||
- `/etc/zlh-monitor/README.md`
|
||||
|
||||
Installed package:
|
||||
|
||||
```text
|
||||
loki 3.7.1
|
||||
```
|
||||
|
||||
Installed service:
|
||||
|
||||
```text
|
||||
/etc/systemd/system/loki.service
|
||||
```
|
||||
|
||||
Services restarted/enabled:
|
||||
|
||||
- `loki`
|
||||
- `grafana-server`
|
||||
- `alloy`
|
||||
|
||||
Listening addresses:
|
||||
|
||||
```text
|
||||
Loki HTTP: 127.0.0.1:3100
|
||||
Loki gRPC: 127.0.0.1:9096
|
||||
Grafana: 127.0.0.1:3000
|
||||
Alloy: 127.0.0.1:12345
|
||||
Prometheus: 10.60.0.25:9090
|
||||
```
|
||||
|
||||
No Loki public or non-loopback listener is exposed.
|
||||
|
||||
Grafana Loki datasource:
|
||||
|
||||
```text
|
||||
name: Loki
|
||||
uid: loki
|
||||
url: http://127.0.0.1:3100
|
||||
```
|
||||
|
||||
Grafana logs confirmed datasource provisioning:
|
||||
|
||||
```text
|
||||
inserting datasource from configuration name=Loki uid=loki
|
||||
```
|
||||
|
||||
Successful Loki query verified with:
|
||||
|
||||
```bash
|
||||
curl -G http://127.0.0.1:3100/loki/api/v1/query_range \
|
||||
--data-urlencode 'query={job="zlh-monitor-journal"}' \
|
||||
--data-urlencode limit=5 \
|
||||
--data-urlencode direction=backward
|
||||
```
|
||||
|
||||
The query returned live streams for `grafana-server` and `loki` with labels including:
|
||||
|
||||
- `host="zlh-monitor"`
|
||||
- `vmid="9016"`
|
||||
- `service`
|
||||
- `job`
|
||||
- `log_source`
|
||||
|
||||
Log sources currently wired:
|
||||
|
||||
- `prometheus.service`
|
||||
- `grafana-server.service`
|
||||
- `alloy.service`
|
||||
- `loki.service`
|
||||
- `zlh-monitor-game-dev-discovery.service`
|
||||
|
||||
Intentionally skipped in first Loki pass:
|
||||
|
||||
- game/dev container logs
|
||||
- zpack-api logs
|
||||
- zpack-portal logs
|
||||
- zpack-velocity logs
|
||||
- DNS/edge action logs
|
||||
- provisioning/backup/restore/delete logs beyond existing monitor discovery unit
|
||||
|
||||
Loki schema note:
|
||||
|
||||
- Loki initially rejected 24-hour backfill entries because the schema start date was current-day.
|
||||
- Schema start was moved to the standard `2020-10-24` TSDB start.
|
||||
- Rejection stopped after the change.
|
||||
|
||||
## Discovery behavior
|
||||
|
||||
API lifecycle discovery was corrected to exclude non-lifecycle core dev systems by CIDR.
|
||||
@ -162,13 +265,19 @@ create -> API discovery -> file_sd -> Prometheus target up -> remote-write metri
|
||||
delete -> API discovery empty -> file_sd [] -> Prometheus targets gone
|
||||
```
|
||||
|
||||
## Logging state
|
||||
## Logging model
|
||||
|
||||
Centralized logs are not enabled yet.
|
||||
Centralized Loki is now installed for zlh-monitor logs.
|
||||
|
||||
No Loki currently.
|
||||
Current model:
|
||||
|
||||
Future direction is centralized Loki with selected logs shipped by Alloy. Do not run Loki inside every game/dev container.
|
||||
```text
|
||||
logs -> Alloy -> centralized Loki -> Grafana
|
||||
```
|
||||
|
||||
Loki is centralized monitoring infrastructure. Do not run Loki inside every game/dev container.
|
||||
|
||||
Future work is to expand selected log shipping from core services and, later, reviewed game/dev container log sources.
|
||||
|
||||
## Platform exceptions
|
||||
|
||||
|
||||
Loading…
Reference in New Issue
Block a user