Update monitoring current state after launch-ready smoke tests
This commit is contained in:
parent
9c5984a95c
commit
aab9569a49
@ -2,251 +2,176 @@
|
|||||||
|
|
||||||
## Launch readiness
|
## Launch readiness
|
||||||
|
|
||||||
**Status: PARTIAL FAIL for launch monitoring readiness.**
|
**Status: core lifecycle monitoring is launch-ready.**
|
||||||
|
|
||||||
Core monitoring services are running. Game/dev dynamic discovery works, stale `game-dev-alloy` VMIDs `5205` and `6084` are gone, legacy/static Prometheus jobs have been removed from the active runtime config, Grafana dashboard provisioning is live from `/etc/zlh-monitor`, and the important game/dev metrics path now works through container Alloy remote-write.
|
The monitoring stack has been consolidated so `/etc/zlh-monitor` is now the operational source of truth. Game/dev lifecycle discovery, file_sd sync, Prometheus target appearance/removal, Alloy scrape health, and Alloy remote-write metrics have been validated through create/delete smoke tests.
|
||||||
|
|
||||||
Launch-debug readiness is still blocked by final exposure verification/lockdown, missing/unfinished centralized logs, missing API application health scrape, incomplete lifecycle telemetry, and the need to finish the delete half of the game/dev lifecycle smoke test.
|
Remaining future work:
|
||||||
|
|
||||||
|
- centralized logs are not enabled yet
|
||||||
|
- no Loki currently
|
||||||
|
- OPNsense may still need a router-only node_exporter job or equivalent router-specific monitoring later
|
||||||
|
|
||||||
## Runtime source of truth
|
## Runtime source of truth
|
||||||
|
|
||||||
`/etc/zlh-monitor` is now the runtime source of truth.
|
`/etc/zlh-monitor` is now the monitoring runtime source of truth.
|
||||||
|
|
||||||
Prometheus now runs from:
|
Prometheus now runs from:
|
||||||
|
|
||||||
- `/etc/zlh-monitor/prometheus/prometheus.yml`
|
|
||||||
- configured via `/etc/default/prometheus`
|
|
||||||
- backup: `/etc/default/prometheus.zlh-monitor-20260501191319.bak`
|
|
||||||
|
|
||||||
Grafana now provisions from:
|
|
||||||
|
|
||||||
- `/etc/zlh-monitor/grafana/provisioning`
|
|
||||||
- configured via `/etc/default/grafana-server`
|
|
||||||
- backup: `/etc/default/grafana-server.zlh-monitor-20260501191319.bak`
|
|
||||||
|
|
||||||
`/etc/zlh-monitor/README.md` was updated to describe the new single-source layout.
|
|
||||||
|
|
||||||
## Canonical metrics model
|
|
||||||
|
|
||||||
Monitoring is not API-only.
|
|
||||||
|
|
||||||
The API owns discovery/metadata and tells the monitoring stack what game/dev containers exist. The canonical metrics path for game/dev containers is now:
|
|
||||||
|
|
||||||
```text
|
```text
|
||||||
container Alloy remote_write
|
/etc/zlh-monitor/prometheus/prometheus.yml
|
||||||
-> Prometheus on 10.60.0.25:9090
|
|
||||||
-> Grafana dashboards
|
|
||||||
```
|
```
|
||||||
|
|
||||||
The `game-dev-alloy` file_sd path is useful for inventory/add-remove validation. It should not be treated as the canonical container health signal while container Alloy listens on `127.0.0.1:12345` and Prometheus file_sd scrapes `container-ip:12345`.
|
configured via:
|
||||||
|
|
||||||
If direct scrape health for each container Alloy is desired later, container Alloy must intentionally expose its HTTP listener on a reachable address such as `0.0.0.0:12345`, with security review.
|
|
||||||
|
|
||||||
## Recently verified progress
|
|
||||||
|
|
||||||
Container discovery is clean:
|
|
||||||
|
|
||||||
- `/sd/exporters -> []` during empty state
|
|
||||||
- `/monitoring/game-dev -> containers: []` during empty state
|
|
||||||
- monitor-side `ALLOW_EMPTY="true"` added in `/etc/zlh-monitor/game-dev-discovery.env`
|
|
||||||
- one-shot sync succeeded: wrote `0 targets` to `/etc/zlh-monitor/prometheus/file_sd/game-dev-alloy.json`
|
|
||||||
- Prometheus refreshed and stale `game-dev-alloy` targets `5205` and `6084` are gone
|
|
||||||
|
|
||||||
Prometheus runtime config is clean:
|
|
||||||
|
|
||||||
- config validates with `promtool`
|
|
||||||
- active jobs are now only:
|
|
||||||
- `prometheus`
|
|
||||||
- `alloy -> 127.0.0.1:12345`
|
|
||||||
- `game-dev-alloy -> /etc/zlh-monitor/prometheus/file_sd/game-dev-alloy.json`
|
|
||||||
- no active `node_exporter` job
|
|
||||||
- no active cAdvisor job
|
|
||||||
- no active `:9100` jobs
|
|
||||||
|
|
||||||
Runtime bind/target state after the remote-write fix:
|
|
||||||
|
|
||||||
- Prometheus listens on `10.60.0.25:9090`
|
|
||||||
- Prometheus self-scrape target is `10.60.0.25:9090`
|
|
||||||
- Grafana remains loopback-only on `127.0.0.1:3000`
|
|
||||||
- zlh-monitor local Alloy remains loopback-only on `127.0.0.1:12345`
|
|
||||||
- local node_exporter is disabled
|
|
||||||
|
|
||||||
Grafana provisioning is live:
|
|
||||||
|
|
||||||
- one source-managed datasource exists:
|
|
||||||
- `uid=prometheus`, `name=Prometheus`, `url=http://localhost:9090`, `default=1`
|
|
||||||
- provisioned dashboards:
|
|
||||||
- `Core Service - Detail`
|
|
||||||
- `Core Services - Overview`
|
|
||||||
- `Dev Containers`
|
|
||||||
- `Game Containers`
|
|
||||||
- legacy node_exporter router dashboard archived to:
|
|
||||||
- `/etc/zlh-monitor/grafana/dashboards-archive/core-routers.json`
|
|
||||||
|
|
||||||
Services active after cleanup:
|
|
||||||
|
|
||||||
- `prometheus`
|
|
||||||
- `grafana-server`
|
|
||||||
- `alloy`
|
|
||||||
- discovery timer
|
|
||||||
|
|
||||||
## Game/dev metrics smoke evidence
|
|
||||||
|
|
||||||
The important monitoring path works for current test containers.
|
|
||||||
|
|
||||||
API discovery emits both test servers:
|
|
||||||
|
|
||||||
- `6085 / dev-6085 / 10.100.0.23 / dev`
|
|
||||||
- `5206 / mc-vanilla-5206 / 10.200.0.79 / game`
|
|
||||||
|
|
||||||
file_sd sync writes both targets, and Prometheus discovers both.
|
|
||||||
|
|
||||||
Alloy remote-write from both containers works after Prometheus was rebound to `10.60.0.25:9090`.
|
|
||||||
|
|
||||||
Dashboard-style metrics exist for both containers:
|
|
||||||
|
|
||||||
- `node_uname_info`
|
|
||||||
- CPU metrics
|
|
||||||
- memory metrics
|
|
||||||
|
|
||||||
Observed values:
|
|
||||||
|
|
||||||
```text
|
```text
|
||||||
6085 dev-6085 dev-container CPU ~1.04% Mem ~6.58%
|
/etc/default/prometheus
|
||||||
5206 mc-vanilla-5206 game-container CPU ~0.68% Mem ~61.69%
|
|
||||||
```
|
```
|
||||||
|
|
||||||
Known design mismatch:
|
Runtime Prometheus jobs are limited to:
|
||||||
|
|
||||||
- container Alloy listens on `127.0.0.1:12345`
|
```text
|
||||||
- Prometheus file_sd scrapes `container-ip:12345`
|
prometheus
|
||||||
- therefore `game-dev-alloy` scrape targets are discovered but down
|
alloy
|
||||||
|
game-dev-alloy
|
||||||
|
```
|
||||||
|
|
||||||
Current decision: use remote-write as canonical and do not rely on `game-dev-alloy` scrape health panels unless/until direct container Alloy scrape is intentionally enabled.
|
Removed from active Prometheus runtime config:
|
||||||
|
|
||||||
## Exposure state
|
- old `node_exporter` jobs
|
||||||
|
- cAdvisor jobs
|
||||||
|
- stale static node targets
|
||||||
|
- legacy game/dev targets
|
||||||
|
- active `:9100` scraping jobs
|
||||||
|
|
||||||
Current reported bind posture:
|
Local `prometheus-node-exporter` is disabled.
|
||||||
|
|
||||||
- Prometheus listens on `10.60.0.25:9090`
|
Prometheus listens on:
|
||||||
- Grafana remains loopback-only on `127.0.0.1:3000`
|
|
||||||
- zlh-monitor local Alloy remains loopback-only on `127.0.0.1:12345`
|
|
||||||
- local node_exporter is disabled
|
|
||||||
|
|
||||||
Remaining exposure work:
|
```text
|
||||||
|
10.60.0.25:9090
|
||||||
|
```
|
||||||
|
|
||||||
- verify Prometheus is reachable only from intended management/container remote-write networks, not the public internet
|
This allows Alloy remote-write from hosts and containers to reach Prometheus.
|
||||||
- verify Grafana is not reachable except via intended admin path
|
|
||||||
- verify node_exporter is no longer listening or is firewalled if any listener remains
|
|
||||||
- verify host firewall/nftables policy is appropriate
|
|
||||||
|
|
||||||
## Prometheus state
|
|
||||||
|
|
||||||
- Runtime config path: `/etc/zlh-monitor/prometheus/prometheus.yml`
|
|
||||||
- Config validates with `promtool`
|
|
||||||
- Active jobs:
|
|
||||||
- `prometheus`
|
|
||||||
- `alloy`
|
|
||||||
- `game-dev-alloy`
|
|
||||||
- Prometheus listens on `10.60.0.25:9090`
|
|
||||||
- Prometheus self-scrape target is `10.60.0.25:9090`
|
|
||||||
- `/sd/exporters` is configured with bearer auth and works with token
|
|
||||||
- unauthenticated `/sd/exporters` and `/monitoring/game-dev` return `401 unauthorized`
|
|
||||||
|
|
||||||
Token risk: the bearer token was previously observed stored directly in Prometheus YAML, backup configs, and `/etc/zlh-monitor/game-dev-discovery.env`. Tokens were not observed leaking through labels or scrape URLs in the audit, but storage should still be tightened.
|
|
||||||
|
|
||||||
## Targets/discovery state
|
|
||||||
|
|
||||||
Resolved:
|
|
||||||
|
|
||||||
- `game-dev-alloy` VMID `5205` is gone
|
|
||||||
- `game-dev-alloy` VMID `6084` is gone
|
|
||||||
- `alloy-dev-6076` hardcoded/down target is gone from active runtime target list
|
|
||||||
- cAdvisor `10.60.0.11:8081` is gone from active runtime target list
|
|
||||||
- `zerolaghub-nodes 10.60.0.19:9100` is gone from active runtime target list
|
|
||||||
- empty discovery writes a valid empty file_sd file instead of failing
|
|
||||||
- active test containers appear in API discovery and file_sd
|
|
||||||
|
|
||||||
Expected remaining validation:
|
|
||||||
|
|
||||||
- delete test dev/game servers
|
|
||||||
- confirm API discovery drops them
|
|
||||||
- confirm file_sd drops them
|
|
||||||
- confirm dashboard metrics age out as expected
|
|
||||||
|
|
||||||
## API monitoring state
|
|
||||||
|
|
||||||
API discovery is working for game/dev monitoring inventory.
|
|
||||||
|
|
||||||
Observed API app health endpoint `/health` previously returned `404 Cannot GET /health`; no API health scrape was found.
|
|
||||||
|
|
||||||
No centralized API logs were found on the monitoring host.
|
|
||||||
|
|
||||||
## Agent/container monitoring state
|
|
||||||
|
|
||||||
Current test containers prove remote-write metrics are working:
|
|
||||||
|
|
||||||
- `6085 dev-6085 dev-container`
|
|
||||||
- `5206 mc-vanilla-5206 game-container`
|
|
||||||
|
|
||||||
Labels previously observed include:
|
|
||||||
|
|
||||||
- `vmid`
|
|
||||||
- `ctype` or `container_type`
|
|
||||||
- `game`
|
|
||||||
- `variant`
|
|
||||||
- `hostname` / `name`
|
|
||||||
- `job`
|
|
||||||
- `instance`
|
|
||||||
|
|
||||||
Missing or incomplete for lifecycle debugging:
|
|
||||||
|
|
||||||
- `ready` / `connectable`
|
|
||||||
- `operationInProgress`
|
|
||||||
- `operationType`
|
|
||||||
- backup/restore state
|
|
||||||
- code-server state
|
|
||||||
|
|
||||||
## Grafana state
|
## Grafana state
|
||||||
|
|
||||||
Grafana now provisions from `/etc/zlh-monitor/grafana/provisioning` via `/etc/default/grafana-server`.
|
Grafana provisions from:
|
||||||
|
|
||||||
Datasource:
|
```text
|
||||||
|
/etc/zlh-monitor/grafana/provisioning
|
||||||
|
```
|
||||||
|
|
||||||
- `uid=prometheus`
|
Dashboard source directory:
|
||||||
- `name=Prometheus`
|
|
||||||
- `url=http://localhost:9090`
|
|
||||||
- default datasource
|
|
||||||
|
|
||||||
Live provisioned dashboards:
|
```text
|
||||||
|
/etc/zlh-monitor/grafana/dashboards
|
||||||
|
```
|
||||||
|
|
||||||
- `Core Service - Detail`
|
Live dashboards provisioned:
|
||||||
- `Core Services - Overview`
|
|
||||||
- `Dev Containers`
|
|
||||||
- `Game Containers`
|
|
||||||
|
|
||||||
Archived:
|
- Core Services - Overview
|
||||||
|
- Core Service - Detail
|
||||||
|
- Game Containers
|
||||||
|
- Dev Containers
|
||||||
|
|
||||||
- `/etc/zlh-monitor/grafana/dashboards-archive/core-routers.json`
|
The old node_exporter router dashboard was archived:
|
||||||
|
|
||||||
Dashboard-style CPU and memory metrics exist for both current test containers.
|
```text
|
||||||
|
/etc/zlh-monitor/grafana/dashboards-archive/core-routers.json
|
||||||
|
```
|
||||||
|
|
||||||
## Logs state
|
Grafana listens on loopback only:
|
||||||
|
|
||||||
No Loki, Promtail, or log-ingesting Alloy pipeline was found in the prior audit.
|
```text
|
||||||
|
127.0.0.1:3000
|
||||||
|
```
|
||||||
|
|
||||||
Only journald/local logs for monitoring components were available on the monitoring host. This remains a launch-debug gap for API, Agent, provisioning, backup/restore, delete/teardown, Velocity/DNS/edge actions, and discovery sync.
|
## Discovery behavior
|
||||||
|
|
||||||
## Security state
|
API lifecycle discovery was corrected to exclude non-lifecycle core dev systems by CIDR.
|
||||||
|
|
||||||
Working controls:
|
`9091 / zpack-dev-velocity / 10.60.0.220` no longer appears as a lifecycle dev container.
|
||||||
|
|
||||||
- Grafana is loopback-only on `127.0.0.1:3000` per current report
|
The monitor sync allows empty discovery:
|
||||||
- discovery endpoints require token; unauthenticated `/sd/exporters` and `/monitoring/game-dev` returned `401`
|
|
||||||
- Prometheus labels/scrape URLs checked did not expose the bearer token in the prior audit
|
|
||||||
|
|
||||||
Remaining launch security checks:
|
```text
|
||||||
|
ALLOW_EMPTY="true"
|
||||||
|
```
|
||||||
|
|
||||||
- Prometheus on `10.60.0.25:9090` must be restricted to intended container remote-write and admin/monitoring paths
|
This ensures deleted containers correctly clear:
|
||||||
- any remaining node_exporter listener must be disabled or firewalled
|
|
||||||
- host firewall policy must be verified
|
```text
|
||||||
|
/etc/zlh-monitor/prometheus/file_sd/game-dev-alloy.json -> []
|
||||||
|
```
|
||||||
|
|
||||||
|
## Alloy template state
|
||||||
|
|
||||||
|
The base template was updated so container Alloy listens on:
|
||||||
|
|
||||||
|
```text
|
||||||
|
0.0.0.0:12345
|
||||||
|
```
|
||||||
|
|
||||||
|
instead of only:
|
||||||
|
|
||||||
|
```text
|
||||||
|
127.0.0.1:12345
|
||||||
|
```
|
||||||
|
|
||||||
|
This allows Prometheus to scrape container Alloy health via the `game-dev-alloy` job.
|
||||||
|
|
||||||
|
Updated template note replaces node_exporter with Alloy:
|
||||||
|
|
||||||
|
```text
|
||||||
|
Alloy Metrics/UI: :12345/metrics
|
||||||
|
Alloy Listen Addr: 0.0.0.0:12345
|
||||||
|
Node Exporter: not used
|
||||||
|
```
|
||||||
|
|
||||||
|
## Metrics model
|
||||||
|
|
||||||
|
Monitoring is not API-only.
|
||||||
|
|
||||||
|
- API discovery + monitor sync + file_sd owns lifecycle inventory and add/remove validation.
|
||||||
|
- Container Alloy remote-write to Prometheus is the canonical game/dev metrics path.
|
||||||
|
- `game-dev-alloy` scrape health also works now that container Alloy binds on `0.0.0.0:12345`.
|
||||||
|
- Grafana reads Prometheus for dashboards.
|
||||||
|
|
||||||
|
## Smoke tests passed
|
||||||
|
|
||||||
|
### First manual test
|
||||||
|
|
||||||
|
```text
|
||||||
|
dev 6085 -> appeared, scraped after Alloy bind fix, remote-write worked
|
||||||
|
game 5206 -> appeared, scraped after Alloy bind fix, remote-write worked
|
||||||
|
delete -> both disappeared from API discovery, file_sd, and Prometheus
|
||||||
|
```
|
||||||
|
|
||||||
|
### Template test after updating base image
|
||||||
|
|
||||||
|
```text
|
||||||
|
dev 6086 / 10.100.0.24 -> up
|
||||||
|
game 5207 / 10.200.0.80 -> up
|
||||||
|
```
|
||||||
|
|
||||||
|
### Full loop verified
|
||||||
|
|
||||||
|
```text
|
||||||
|
create -> API discovery -> file_sd -> Prometheus target up -> remote-write metrics present
|
||||||
|
delete -> API discovery empty -> file_sd [] -> Prometheus targets gone
|
||||||
|
```
|
||||||
|
|
||||||
|
## Logging state
|
||||||
|
|
||||||
|
Centralized logs are not enabled yet.
|
||||||
|
|
||||||
|
No Loki currently.
|
||||||
|
|
||||||
|
Future direction is centralized Loki with selected logs shipped by Alloy. Do not run Loki inside every game/dev container.
|
||||||
|
|
||||||
|
## Platform exceptions
|
||||||
|
|
||||||
|
OPNsense and PBS remain explicit platform monitoring exceptions.
|
||||||
|
|
||||||
|
OPNsense may still need a router-only node_exporter job or another router-specific monitoring path later.
|
||||||
Loading…
Reference in New Issue
Block a user