199 lines
6.6 KiB
Markdown
199 lines
6.6 KiB
Markdown
# Monitoring — Current State
|
|
|
|
## Launch readiness
|
|
|
|
**Status: PARTIAL FAIL for launch monitoring readiness.**
|
|
|
|
Core monitoring services are running. Game/dev dynamic discovery empty-state is clean, stale `game-dev-alloy` VMIDs `5205` and `6084` are gone, legacy/static Prometheus jobs have been removed from the active runtime config, and Grafana dashboard provisioning is now live from the `/etc/zlh-monitor` runtime source-of-truth path.
|
|
|
|
Launch-debug readiness is still blocked by public monitoring endpoint exposure, missing/unfinished centralized logs, missing API application health scrape, incomplete lifecycle telemetry, and the need to run one full game/dev lifecycle smoke test.
|
|
|
|
## Runtime source of truth
|
|
|
|
`/etc/zlh-monitor` is now the runtime source of truth.
|
|
|
|
Prometheus now runs from:
|
|
|
|
- `/etc/zlh-monitor/prometheus/prometheus.yml`
|
|
- configured via `/etc/default/prometheus`
|
|
- backup: `/etc/default/prometheus.zlh-monitor-20260501191319.bak`
|
|
|
|
Grafana now provisions from:
|
|
|
|
- `/etc/zlh-monitor/grafana/provisioning`
|
|
- configured via `/etc/default/grafana-server`
|
|
- backup: `/etc/default/grafana-server.zlh-monitor-20260501191319.bak`
|
|
|
|
`/etc/zlh-monitor/README.md` was updated to describe the new single-source layout.
|
|
|
|
## Recently verified progress
|
|
|
|
Container discovery is clean:
|
|
|
|
- `/sd/exporters -> []`
|
|
- `/monitoring/game-dev -> containers: []`
|
|
- monitor-side `ALLOW_EMPTY="true"` added in `/etc/zlh-monitor/game-dev-discovery.env`
|
|
- one-shot sync succeeded: wrote `0 targets` to `/etc/zlh-monitor/prometheus/file_sd/game-dev-alloy.json`
|
|
- Prometheus refreshed and stale `game-dev-alloy` targets `5205` and `6084` are gone
|
|
|
|
Prometheus runtime config is clean:
|
|
|
|
- config validates with `promtool`
|
|
- active jobs are now only:
|
|
- `prometheus`
|
|
- `alloy -> 127.0.0.1:12345`
|
|
- `game-dev-alloy -> /etc/zlh-monitor/prometheus/file_sd/game-dev-alloy.json`
|
|
- no active `node_exporter` job
|
|
- no active cAdvisor job
|
|
- no active `:9100` jobs
|
|
|
|
Current Prometheus targets are clean:
|
|
|
|
- `up alloy 127.0.0.1:12345`
|
|
- `up prometheus localhost:9090`
|
|
|
|
Grafana provisioning is live:
|
|
|
|
- one source-managed datasource exists:
|
|
- `uid=prometheus`, `name=Prometheus`, `url=http://localhost:9090`, `default=1`
|
|
- provisioned dashboards:
|
|
- `Core Service - Detail`
|
|
- `Core Services - Overview`
|
|
- `Dev Containers`
|
|
- `Game Containers`
|
|
- legacy node_exporter router dashboard archived to:
|
|
- `/etc/zlh-monitor/grafana/dashboards-archive/core-routers.json`
|
|
|
|
Services active after cleanup:
|
|
|
|
- `prometheus`
|
|
- `grafana-server`
|
|
- `alloy`
|
|
- discovery timer
|
|
|
|
## Exposure state
|
|
|
|
Observed listening ports from prior audit:
|
|
|
|
- `*:9090` Prometheus
|
|
- `*:3000` Grafana
|
|
- `*:9100` node_exporter
|
|
- `127.0.0.1:12345` Alloy
|
|
|
|
Prometheus active jobs no longer include `node_exporter` or `:9100` scraping, but network exposure still needs separate verification/lockdown.
|
|
|
|
Host firewall posture from prior audit:
|
|
|
|
- UFW inactive
|
|
- nftables input/forward/output policy accept
|
|
|
|
This remains a launch blocker until Prometheus, Grafana, and any remaining node_exporter listener exposure are restricted to trusted admin/monitoring paths.
|
|
|
|
## Prometheus state
|
|
|
|
- Runtime config path: `/etc/zlh-monitor/prometheus/prometheus.yml`
|
|
- Config validates with `promtool`
|
|
- Active jobs:
|
|
- `prometheus`
|
|
- `alloy`
|
|
- `game-dev-alloy`
|
|
- Current targets:
|
|
- `up alloy 127.0.0.1:12345`
|
|
- `up prometheus localhost:9090`
|
|
- `/sd/exporters` is configured with bearer auth and works with token
|
|
- unauthenticated `/sd/exporters` and `/monitoring/game-dev` return `401 unauthorized`
|
|
|
|
Token risk: the bearer token was previously observed stored directly in Prometheus YAML, backup configs, and `/etc/zlh-monitor/game-dev-discovery.env`. Tokens were not observed leaking through labels or scrape URLs in the audit, but storage should still be tightened.
|
|
|
|
## Targets/discovery state
|
|
|
|
Resolved:
|
|
|
|
- `game-dev-alloy` VMID `5205` is gone
|
|
- `game-dev-alloy` VMID `6084` is gone
|
|
- `alloy-dev-6076` hardcoded/down target is gone from active runtime target list
|
|
- cAdvisor `10.60.0.11:8081` is gone from active runtime target list
|
|
- `zerolaghub-nodes 10.60.0.19:9100` is gone from active runtime target list
|
|
- empty discovery writes a valid empty file_sd file instead of failing
|
|
|
|
No stale/static down targets are currently present in active Prometheus targets.
|
|
|
|
## API monitoring state
|
|
|
|
API host/system metrics were previously available via `integrations/unix`, but the new active runtime Prometheus jobs are intentionally reduced to Prometheus, local Alloy, and dynamic game/dev Alloy discovery.
|
|
|
|
Observed API app health endpoint `/health` previously returned `404 Cannot GET /health`; no API health scrape was found.
|
|
|
|
No centralized API logs were found on the monitoring host.
|
|
|
|
## Agent/container monitoring state
|
|
|
|
Current dynamic game/dev discovery is empty and clean.
|
|
|
|
Expected next validation is lifecycle smoke testing:
|
|
|
|
- create a game and/or dev container
|
|
- confirm target appears
|
|
- delete it
|
|
- confirm target disappears now that empty discovery works
|
|
|
|
Labels previously observed include:
|
|
|
|
- `vmid`
|
|
- `ctype` or `container_type`
|
|
- `game`
|
|
- `variant`
|
|
- `hostname` / `name`
|
|
- `job`
|
|
- `instance`
|
|
|
|
Missing or incomplete for lifecycle debugging:
|
|
|
|
- `ready` / `connectable`
|
|
- `operationInProgress`
|
|
- `operationType`
|
|
- backup/restore state
|
|
- code-server state
|
|
|
|
## Grafana state
|
|
|
|
Grafana now provisions from `/etc/zlh-monitor/grafana/provisioning` via `/etc/default/grafana-server`.
|
|
|
|
Datasource:
|
|
|
|
- `uid=prometheus`
|
|
- `name=Prometheus`
|
|
- `url=http://localhost:9090`
|
|
- default datasource
|
|
|
|
Live provisioned dashboards:
|
|
|
|
- `Core Service - Detail`
|
|
- `Core Services - Overview`
|
|
- `Dev Containers`
|
|
- `Game Containers`
|
|
|
|
Archived:
|
|
|
|
- `/etc/zlh-monitor/grafana/dashboards-archive/core-routers.json`
|
|
|
|
## Logs state
|
|
|
|
No Loki, Promtail, or log-ingesting Alloy pipeline was found in the prior audit.
|
|
|
|
Only journald/local logs for monitoring components were available on the monitoring host. This remains a launch-debug gap for API, Agent, provisioning, backup/restore, delete/teardown, Velocity/DNS/edge actions, and discovery sync.
|
|
|
|
## Security state
|
|
|
|
Working controls from prior audit:
|
|
|
|
- Grafana anonymous access is disabled; unauthenticated `/api/search` returned `401`
|
|
- discovery endpoints require token; unauthenticated `/sd/exporters` and `/monitoring/game-dev` returned `401`
|
|
- Prometheus labels/scrape URLs checked did not expose the bearer token
|
|
|
|
Launch-blocking risk:
|
|
|
|
- Prometheus and Grafana previously bound publicly on the host network
|
|
- node_exporter previously listened on `*:9100`; it is no longer scraped by active Prometheus config, but listener exposure still needs verification/removal or firewalling
|
|
- host firewall policy did not restrict access in prior audit
|