diff --git a/Codex/Monitoring/CURRENT_STATE.md b/Codex/Monitoring/CURRENT_STATE.md index d132c28..9e062f3 100644 --- a/Codex/Monitoring/CURRENT_STATE.md +++ b/Codex/Monitoring/CURRENT_STATE.md @@ -2,14 +2,16 @@ ## Launch readiness -**Status: core lifecycle monitoring is launch-ready.** +**Status: core lifecycle monitoring is launch-ready, and centralized Loki is installed for zlh-monitor logs.** The monitoring stack has been consolidated so `/etc/zlh-monitor` is now the operational source of truth. Game/dev lifecycle discovery, file_sd sync, Prometheus target appearance/removal, Alloy scrape health, and Alloy remote-write metrics have been validated through create/delete smoke tests. +Centralized Loki is now installed on `zlh-monitor` and receives selected monitoring-host logs through Alloy. Broader core-service and game/dev container log shipping remains future work. + Remaining future work: -- centralized logs are not enabled yet -- no Loki currently +- add selected core-service logs beyond zlh-monitor +- add selected game/dev container logs after review - OPNsense may still need a router-only node_exporter job or equivalent router-specific monitoring later ## Runtime source of truth @@ -87,6 +89,107 @@ Grafana listens on loopback only: 127.0.0.1:3000 ``` +Provisioned datasource: + +- `Prometheus` +- `Loki` + +## Loki state + +Centralized Loki is installed on `zlh-monitor`. + +Files created/changed on the monitoring host: + +- `/etc/zlh-monitor/loki/loki.yml` +- `/etc/zlh-monitor/systemd/loki.service` +- `/etc/zlh-monitor/grafana/provisioning/datasources/loki.yml` +- `/etc/zlh-monitor/alloy/config.alloy` +- `/etc/zlh-monitor/README.md` + +Installed package: + +```text +loki 3.7.1 +``` + +Installed service: + +```text +/etc/systemd/system/loki.service +``` + +Services restarted/enabled: + +- `loki` +- `grafana-server` +- `alloy` + +Listening addresses: + +```text +Loki HTTP: 127.0.0.1:3100 +Loki gRPC: 127.0.0.1:9096 +Grafana: 127.0.0.1:3000 +Alloy: 127.0.0.1:12345 +Prometheus: 10.60.0.25:9090 +``` + +No Loki public or non-loopback listener is exposed. + +Grafana Loki datasource: + +```text +name: Loki +uid: loki +url: http://127.0.0.1:3100 +``` + +Grafana logs confirmed datasource provisioning: + +```text +inserting datasource from configuration name=Loki uid=loki +``` + +Successful Loki query verified with: + +```bash +curl -G http://127.0.0.1:3100/loki/api/v1/query_range \ + --data-urlencode 'query={job="zlh-monitor-journal"}' \ + --data-urlencode limit=5 \ + --data-urlencode direction=backward +``` + +The query returned live streams for `grafana-server` and `loki` with labels including: + +- `host="zlh-monitor"` +- `vmid="9016"` +- `service` +- `job` +- `log_source` + +Log sources currently wired: + +- `prometheus.service` +- `grafana-server.service` +- `alloy.service` +- `loki.service` +- `zlh-monitor-game-dev-discovery.service` + +Intentionally skipped in first Loki pass: + +- game/dev container logs +- zpack-api logs +- zpack-portal logs +- zpack-velocity logs +- DNS/edge action logs +- provisioning/backup/restore/delete logs beyond existing monitor discovery unit + +Loki schema note: + +- Loki initially rejected 24-hour backfill entries because the schema start date was current-day. +- Schema start was moved to the standard `2020-10-24` TSDB start. +- Rejection stopped after the change. + ## Discovery behavior API lifecycle discovery was corrected to exclude non-lifecycle core dev systems by CIDR. @@ -162,13 +265,19 @@ create -> API discovery -> file_sd -> Prometheus target up -> remote-write metri delete -> API discovery empty -> file_sd [] -> Prometheus targets gone ``` -## Logging state +## Logging model -Centralized logs are not enabled yet. +Centralized Loki is now installed for zlh-monitor logs. -No Loki currently. +Current model: -Future direction is centralized Loki with selected logs shipped by Alloy. Do not run Loki inside every game/dev container. +```text +logs -> Alloy -> centralized Loki -> Grafana +``` + +Loki is centralized monitoring infrastructure. Do not run Loki inside every game/dev container. + +Future work is to expand selected log shipping from core services and, later, reviewed game/dev container log sources. ## Platform exceptions