Update monitoring current state after clean discovery
This commit is contained in:
parent
1169a89b71
commit
c57c5ba3df
@ -2,33 +2,48 @@
|
|||||||
|
|
||||||
## Launch readiness
|
## Launch readiness
|
||||||
|
|
||||||
**Status: FAIL for launch monitoring readiness.**
|
**Status: PARTIAL FAIL for launch monitoring readiness.**
|
||||||
|
|
||||||
Core monitoring services are running, and Prometheus can see the API host via Alloy remote-written host metrics plus one current dev/node target from `/sd/exporters`. Launch-debug readiness is blocked by public exposure, stale/failing game/dev discovery, missing centralized logs, and dashboards not actually installed in Grafana.
|
Core monitoring services are running. Game/dev dynamic discovery empty-state is now clean and stale `game-dev-alloy` VMIDs `5205` and `6084` have disappeared from Prometheus after a successful empty sync.
|
||||||
|
|
||||||
## Running services observed
|
Launch-debug readiness is still blocked by public monitoring endpoint exposure, remaining stale/static down targets, dashboards not actually installed in Grafana, missing API health scrape, incomplete lifecycle telemetry, and missing centralized logs.
|
||||||
|
|
||||||
|
## Recently verified progress
|
||||||
|
|
||||||
|
Container discovery is clean:
|
||||||
|
|
||||||
|
- `/sd/exporters -> []`
|
||||||
|
- `/monitoring/game-dev -> containers: []`
|
||||||
|
- monitor-side `ALLOW_EMPTY="true"` added in `/etc/zlh-monitor/game-dev-discovery.env`
|
||||||
|
- one-shot sync succeeded: wrote `0 targets` to `/etc/zlh-monitor/prometheus/file_sd/game-dev-alloy.json`
|
||||||
|
- Prometheus refreshed and stale `game-dev-alloy` targets `5205` and `6084` are gone
|
||||||
|
|
||||||
|
This resolves the previous dynamic discovery empty-state/stale `game-dev-alloy` blocker.
|
||||||
|
|
||||||
|
## Running services observed from prior audit
|
||||||
|
|
||||||
- `prometheus.service` — active/running
|
- `prometheus.service` — active/running
|
||||||
- `grafana-server.service` — active/running
|
- `grafana-server.service` — active/running
|
||||||
- `alloy.service` — active/running
|
- `alloy.service` — active/running
|
||||||
- `prometheus-node-exporter.service` — active/running
|
- `prometheus-node-exporter.service` — active/running
|
||||||
- `zlh-monitor-game-dev-discovery.service` — failed
|
|
||||||
|
`zlh-monitor-game-dev-discovery.service` was previously failed, but the latest manual sync succeeded after enabling `ALLOW_EMPTY="true"`. Verify timer/service health after the next scheduled run.
|
||||||
|
|
||||||
## Exposure state
|
## Exposure state
|
||||||
|
|
||||||
Observed listening ports:
|
Observed listening ports from prior audit:
|
||||||
|
|
||||||
- `*:9090` Prometheus
|
- `*:9090` Prometheus
|
||||||
- `*:3000` Grafana
|
- `*:3000` Grafana
|
||||||
- `*:9100` node_exporter
|
- `*:9100` node_exporter
|
||||||
- `127.0.0.1:12345` Alloy
|
- `127.0.0.1:12345` Alloy
|
||||||
|
|
||||||
Host firewall posture observed:
|
Host firewall posture from prior audit:
|
||||||
|
|
||||||
- UFW inactive
|
- UFW inactive
|
||||||
- nftables input/forward/output policy accept
|
- nftables input/forward/output policy accept
|
||||||
|
|
||||||
This is a launch blocker because Prometheus, Grafana, and node_exporter are reachable on all interfaces without host firewall restriction.
|
This remains a launch blocker because Prometheus, Grafana, and node_exporter are reachable on all interfaces without host firewall restriction.
|
||||||
|
|
||||||
## Prometheus state
|
## Prometheus state
|
||||||
|
|
||||||
@ -39,32 +54,21 @@ This is a launch blocker because Prometheus, Grafana, and node_exporter are reac
|
|||||||
- `/sd/exporters` is configured with bearer auth and works with token
|
- `/sd/exporters` is configured with bearer auth and works with token
|
||||||
- unauthenticated `/sd/exporters` and `/monitoring/game-dev` return `401 unauthorized`
|
- unauthenticated `/sd/exporters` and `/monitoring/game-dev` return `401 unauthorized`
|
||||||
|
|
||||||
Token risk: the bearer token is stored directly in Prometheus YAML, backup configs, and `/etc/zlh-monitor/game-dev-discovery.env`. Tokens were not observed leaking through labels or scrape URLs in the audit, but storage should still be tightened.
|
Token risk: the bearer token was previously observed stored directly in Prometheus YAML, backup configs, and `/etc/zlh-monitor/game-dev-discovery.env`. Tokens were not observed leaking through labels or scrape URLs in the audit, but storage should still be tightened.
|
||||||
|
|
||||||
## Targets/discovery state
|
## Targets/discovery state
|
||||||
|
|
||||||
Current active targets observed: 12.
|
Resolved:
|
||||||
|
|
||||||
Up targets include:
|
- `game-dev-alloy` VMID `5205` is gone
|
||||||
|
- `game-dev-alloy` VMID `6084` is gone
|
||||||
|
- empty discovery writes a valid empty file_sd file instead of failing
|
||||||
|
|
||||||
- `prometheus`
|
Remaining problem targets to remove/fix:
|
||||||
- `alloy`
|
|
||||||
- local `node`
|
|
||||||
- routers
|
|
||||||
- monitor node
|
|
||||||
- `zlh_node` VMID `6050`
|
|
||||||
|
|
||||||
Problem targets observed:
|
- `alloy-dev-6076`, `10.100.0.14:12345`, hardcoded/down
|
||||||
|
- `cadvisor`, `10.60.0.11:8081`, down/no route
|
||||||
- `game-dev-alloy` VMID `5205`, `10.200.0.78:12345`, timeout
|
- `zerolaghub-nodes`, `10.60.0.19:9100`, down/connection refused
|
||||||
- `game-dev-alloy` VMID `6084`, `10.100.0.22:12345`, timeout
|
|
||||||
- `alloy-dev-6076`, `10.100.0.14:12345`, timeout
|
|
||||||
- `cadvisor`, `10.60.0.11:8081`, no route
|
|
||||||
- `zerolaghub-nodes`, `10.60.0.19:9100`, connection refused
|
|
||||||
|
|
||||||
`/etc/zlh-monitor/prometheus/file_sd/game-dev-alloy.json` contains stale down targets `5205` and `6084`.
|
|
||||||
|
|
||||||
The discovery sync currently fails because the API returns dev VMID `6050` on `10.60.0.220`, while the sync script only allows dev IPs in `10.100.0.0/16`. The service then refuses to write an empty discovery file.
|
|
||||||
|
|
||||||
## API monitoring state
|
## API monitoring state
|
||||||
|
|
||||||
@ -74,19 +78,22 @@ API host metrics exist via `integrations/unix`:
|
|||||||
- `role=API`
|
- `role=API`
|
||||||
- `instance=zlh-api`
|
- `instance=zlh-api`
|
||||||
|
|
||||||
Observed API app health endpoint `/health` returned `404 Cannot GET /health`; no API health scrape was found.
|
Observed API app health endpoint `/health` previously returned `404 Cannot GET /health`; no API health scrape was found.
|
||||||
|
|
||||||
No centralized API logs were found on the monitoring host.
|
No centralized API logs were found on the monitoring host.
|
||||||
|
|
||||||
## Agent/container monitoring state
|
## Agent/container monitoring state
|
||||||
|
|
||||||
`/sd/exporters` returned one node exporter target:
|
Current dynamic game/dev discovery is empty and clean.
|
||||||
|
|
||||||
- `10.60.0.220:9100`, `vmid=6050`, `ctype=dev`, `job=zlh_node`
|
Expected next validation is lifecycle smoke testing:
|
||||||
|
|
||||||
File discovery still contains stale Alloy collector targets for `5205` and `6084`, both down.
|
- create a game and/or dev container
|
||||||
|
- confirm target appears
|
||||||
|
- delete it
|
||||||
|
- confirm target disappears now that empty discovery works
|
||||||
|
|
||||||
Labels currently observed include:
|
Labels previously observed include:
|
||||||
|
|
||||||
- `vmid`
|
- `vmid`
|
||||||
- `ctype` or `container_type`
|
- `ctype` or `container_type`
|
||||||
@ -114,21 +121,21 @@ Dashboard JSON files exist under `/etc/zlh-monitor/grafana/dashboards`:
|
|||||||
- `dev-containers.json`
|
- `dev-containers.json`
|
||||||
- `game-containers.json`
|
- `game-containers.json`
|
||||||
|
|
||||||
However, Grafana DB showed no dashboards installed. Datasource exists:
|
However, Grafana DB previously showed no dashboards installed. Datasource exists:
|
||||||
|
|
||||||
- `prometheus -> http://localhost:9090`
|
- `prometheus -> http://localhost:9090`
|
||||||
|
|
||||||
Dashboard JSON queries reference `job="integrations/unix"` and `job="game-dev-alloy"`, but game/dev Alloy endpoint data is currently down/stale.
|
Dashboard JSON must be provisioned/visible in Grafana before launch monitoring is considered ready.
|
||||||
|
|
||||||
## Logs state
|
## Logs state
|
||||||
|
|
||||||
No Loki, Promtail, or log-ingesting Alloy pipeline was found.
|
No Loki, Promtail, or log-ingesting Alloy pipeline was found in the prior audit.
|
||||||
|
|
||||||
Only journald/local logs for monitoring components are available on the monitoring host. Recent logs showed repeated discovery failures and Prometheus HTTP SD errors. No token values appeared in those log excerpts.
|
Only journald/local logs for monitoring components were available on the monitoring host. This remains a launch-debug gap for API, Agent, provisioning, backup/restore, delete/teardown, Velocity/DNS/edge actions, and discovery sync.
|
||||||
|
|
||||||
## Security state
|
## Security state
|
||||||
|
|
||||||
Working controls:
|
Working controls from prior audit:
|
||||||
|
|
||||||
- Grafana anonymous access is disabled; unauthenticated `/api/search` returned `401`
|
- Grafana anonymous access is disabled; unauthenticated `/api/search` returned `401`
|
||||||
- discovery endpoints require token; unauthenticated `/sd/exporters` and `/monitoring/game-dev` returned `401`
|
- discovery endpoints require token; unauthenticated `/sd/exporters` and `/monitoring/game-dev` returned `401`
|
||||||
|
|||||||
Loading…
Reference in New Issue
Block a user