Update monitoring current state after Prometheus Grafana cleanup
This commit is contained in:
parent
c57c5ba3df
commit
1a837f3770
@ -4,9 +4,27 @@
|
||||
|
||||
**Status: PARTIAL FAIL for launch monitoring readiness.**
|
||||
|
||||
Core monitoring services are running. Game/dev dynamic discovery empty-state is now clean and stale `game-dev-alloy` VMIDs `5205` and `6084` have disappeared from Prometheus after a successful empty sync.
|
||||
Core monitoring services are running. Game/dev dynamic discovery empty-state is clean, stale `game-dev-alloy` VMIDs `5205` and `6084` are gone, legacy/static Prometheus jobs have been removed from the active runtime config, and Grafana dashboard provisioning is now live from the `/etc/zlh-monitor` runtime source-of-truth path.
|
||||
|
||||
Launch-debug readiness is still blocked by public monitoring endpoint exposure, remaining stale/static down targets, dashboards not actually installed in Grafana, missing API health scrape, incomplete lifecycle telemetry, and missing centralized logs.
|
||||
Launch-debug readiness is still blocked by public monitoring endpoint exposure, missing/unfinished centralized logs, missing API application health scrape, incomplete lifecycle telemetry, and the need to run one full game/dev lifecycle smoke test.
|
||||
|
||||
## Runtime source of truth
|
||||
|
||||
`/etc/zlh-monitor` is now the runtime source of truth.
|
||||
|
||||
Prometheus now runs from:
|
||||
|
||||
- `/etc/zlh-monitor/prometheus/prometheus.yml`
|
||||
- configured via `/etc/default/prometheus`
|
||||
- backup: `/etc/default/prometheus.zlh-monitor-20260501191319.bak`
|
||||
|
||||
Grafana now provisions from:
|
||||
|
||||
- `/etc/zlh-monitor/grafana/provisioning`
|
||||
- configured via `/etc/default/grafana-server`
|
||||
- backup: `/etc/default/grafana-server.zlh-monitor-20260501191319.bak`
|
||||
|
||||
`/etc/zlh-monitor/README.md` was updated to describe the new single-source layout.
|
||||
|
||||
## Recently verified progress
|
||||
|
||||
@ -18,16 +36,40 @@ Container discovery is clean:
|
||||
- one-shot sync succeeded: wrote `0 targets` to `/etc/zlh-monitor/prometheus/file_sd/game-dev-alloy.json`
|
||||
- Prometheus refreshed and stale `game-dev-alloy` targets `5205` and `6084` are gone
|
||||
|
||||
This resolves the previous dynamic discovery empty-state/stale `game-dev-alloy` blocker.
|
||||
Prometheus runtime config is clean:
|
||||
|
||||
## Running services observed from prior audit
|
||||
- config validates with `promtool`
|
||||
- active jobs are now only:
|
||||
- `prometheus`
|
||||
- `alloy -> 127.0.0.1:12345`
|
||||
- `game-dev-alloy -> /etc/zlh-monitor/prometheus/file_sd/game-dev-alloy.json`
|
||||
- no active `node_exporter` job
|
||||
- no active cAdvisor job
|
||||
- no active `:9100` jobs
|
||||
|
||||
- `prometheus.service` — active/running
|
||||
- `grafana-server.service` — active/running
|
||||
- `alloy.service` — active/running
|
||||
- `prometheus-node-exporter.service` — active/running
|
||||
Current Prometheus targets are clean:
|
||||
|
||||
`zlh-monitor-game-dev-discovery.service` was previously failed, but the latest manual sync succeeded after enabling `ALLOW_EMPTY="true"`. Verify timer/service health after the next scheduled run.
|
||||
- `up alloy 127.0.0.1:12345`
|
||||
- `up prometheus localhost:9090`
|
||||
|
||||
Grafana provisioning is live:
|
||||
|
||||
- one source-managed datasource exists:
|
||||
- `uid=prometheus`, `name=Prometheus`, `url=http://localhost:9090`, `default=1`
|
||||
- provisioned dashboards:
|
||||
- `Core Service - Detail`
|
||||
- `Core Services - Overview`
|
||||
- `Dev Containers`
|
||||
- `Game Containers`
|
||||
- legacy node_exporter router dashboard archived to:
|
||||
- `/etc/zlh-monitor/grafana/dashboards-archive/core-routers.json`
|
||||
|
||||
Services active after cleanup:
|
||||
|
||||
- `prometheus`
|
||||
- `grafana-server`
|
||||
- `alloy`
|
||||
- discovery timer
|
||||
|
||||
## Exposure state
|
||||
|
||||
@ -38,19 +80,26 @@ Observed listening ports from prior audit:
|
||||
- `*:9100` node_exporter
|
||||
- `127.0.0.1:12345` Alloy
|
||||
|
||||
Prometheus active jobs no longer include `node_exporter` or `:9100` scraping, but network exposure still needs separate verification/lockdown.
|
||||
|
||||
Host firewall posture from prior audit:
|
||||
|
||||
- UFW inactive
|
||||
- nftables input/forward/output policy accept
|
||||
|
||||
This remains a launch blocker because Prometheus, Grafana, and node_exporter are reachable on all interfaces without host firewall restriction.
|
||||
This remains a launch blocker until Prometheus, Grafana, and any remaining node_exporter listener exposure are restricted to trusted admin/monitoring paths.
|
||||
|
||||
## Prometheus state
|
||||
|
||||
- Config path: `/etc/prometheus/prometheus.yml`
|
||||
- Global scrape interval: `15s`
|
||||
- Prometheus self-scrape: `5s`
|
||||
- Dynamic discovery refresh: `30s`
|
||||
- Runtime config path: `/etc/zlh-monitor/prometheus/prometheus.yml`
|
||||
- Config validates with `promtool`
|
||||
- Active jobs:
|
||||
- `prometheus`
|
||||
- `alloy`
|
||||
- `game-dev-alloy`
|
||||
- Current targets:
|
||||
- `up alloy 127.0.0.1:12345`
|
||||
- `up prometheus localhost:9090`
|
||||
- `/sd/exporters` is configured with bearer auth and works with token
|
||||
- unauthenticated `/sd/exporters` and `/monitoring/game-dev` return `401 unauthorized`
|
||||
|
||||
@ -62,21 +111,16 @@ Resolved:
|
||||
|
||||
- `game-dev-alloy` VMID `5205` is gone
|
||||
- `game-dev-alloy` VMID `6084` is gone
|
||||
- `alloy-dev-6076` hardcoded/down target is gone from active runtime target list
|
||||
- cAdvisor `10.60.0.11:8081` is gone from active runtime target list
|
||||
- `zerolaghub-nodes 10.60.0.19:9100` is gone from active runtime target list
|
||||
- empty discovery writes a valid empty file_sd file instead of failing
|
||||
|
||||
Remaining problem targets to remove/fix:
|
||||
|
||||
- `alloy-dev-6076`, `10.100.0.14:12345`, hardcoded/down
|
||||
- `cadvisor`, `10.60.0.11:8081`, down/no route
|
||||
- `zerolaghub-nodes`, `10.60.0.19:9100`, down/connection refused
|
||||
No stale/static down targets are currently present in active Prometheus targets.
|
||||
|
||||
## API monitoring state
|
||||
|
||||
API host metrics exist via `integrations/unix`:
|
||||
|
||||
- `vmid=9020`
|
||||
- `role=API`
|
||||
- `instance=zlh-api`
|
||||
API host/system metrics were previously available via `integrations/unix`, but the new active runtime Prometheus jobs are intentionally reduced to Prometheus, local Alloy, and dynamic game/dev Alloy discovery.
|
||||
|
||||
Observed API app health endpoint `/health` previously returned `404 Cannot GET /health`; no API health scrape was found.
|
||||
|
||||
@ -113,19 +157,25 @@ Missing or incomplete for lifecycle debugging:
|
||||
|
||||
## Grafana state
|
||||
|
||||
Dashboard JSON files exist under `/etc/zlh-monitor/grafana/dashboards`:
|
||||
Grafana now provisions from `/etc/zlh-monitor/grafana/provisioning` via `/etc/default/grafana-server`.
|
||||
|
||||
- `core-routers.json`
|
||||
- `core-service-detail.json`
|
||||
- `core-services-overview.json`
|
||||
- `dev-containers.json`
|
||||
- `game-containers.json`
|
||||
Datasource:
|
||||
|
||||
However, Grafana DB previously showed no dashboards installed. Datasource exists:
|
||||
- `uid=prometheus`
|
||||
- `name=Prometheus`
|
||||
- `url=http://localhost:9090`
|
||||
- default datasource
|
||||
|
||||
- `prometheus -> http://localhost:9090`
|
||||
Live provisioned dashboards:
|
||||
|
||||
Dashboard JSON must be provisioned/visible in Grafana before launch monitoring is considered ready.
|
||||
- `Core Service - Detail`
|
||||
- `Core Services - Overview`
|
||||
- `Dev Containers`
|
||||
- `Game Containers`
|
||||
|
||||
Archived:
|
||||
|
||||
- `/etc/zlh-monitor/grafana/dashboards-archive/core-routers.json`
|
||||
|
||||
## Logs state
|
||||
|
||||
@ -143,5 +193,6 @@ Working controls from prior audit:
|
||||
|
||||
Launch-blocking risk:
|
||||
|
||||
- Prometheus, Grafana, and node_exporter bind publicly on the host network
|
||||
- host firewall policy does not restrict access
|
||||
- Prometheus and Grafana previously bound publicly on the host network
|
||||
- node_exporter previously listened on `*:9100`; it is no longer scraped by active Prometheus config, but listener exposure still needs verification/removal or firewalling
|
||||
- host firewall policy did not restrict access in prior audit
|
||||
|
||||
Loading…
Reference in New Issue
Block a user