Update monitoring open items after runtime source cleanup
This commit is contained in:
parent
1a837f3770
commit
840394eb8e
@ -4,6 +4,11 @@ Only keep unfinished monitoring/observability work here.
|
|||||||
|
|
||||||
## Recently resolved / verified
|
## Recently resolved / verified
|
||||||
|
|
||||||
|
- `/etc/zlh-monitor` is now the runtime source of truth.
|
||||||
|
- Prometheus runs from `/etc/zlh-monitor/prometheus/prometheus.yml` via `/etc/default/prometheus`.
|
||||||
|
- Grafana provisions from `/etc/zlh-monitor/grafana/provisioning` via `/etc/default/grafana-server`.
|
||||||
|
- `/etc/zlh-monitor/README.md` describes the new single-source layout.
|
||||||
|
|
||||||
- Game/dev dynamic discovery empty-state now works.
|
- Game/dev dynamic discovery empty-state now works.
|
||||||
- `/sd/exporters -> []`
|
- `/sd/exporters -> []`
|
||||||
- `/monitoring/game-dev -> containers: []`
|
- `/monitoring/game-dev -> containers: []`
|
||||||
@ -11,43 +16,67 @@ Only keep unfinished monitoring/observability work here.
|
|||||||
- one-shot sync succeeded and wrote `0 targets` to `/etc/zlh-monitor/prometheus/file_sd/game-dev-alloy.json`
|
- one-shot sync succeeded and wrote `0 targets` to `/etc/zlh-monitor/prometheus/file_sd/game-dev-alloy.json`
|
||||||
- Prometheus refreshed and stale `game-dev-alloy` VMIDs `5205` and `6084` disappeared
|
- Prometheus refreshed and stale `game-dev-alloy` VMIDs `5205` and `6084` disappeared
|
||||||
|
|
||||||
|
- Prometheus runtime config is clean.
|
||||||
|
- validates with `promtool`
|
||||||
|
- active jobs are only:
|
||||||
|
- `prometheus`
|
||||||
|
- `alloy -> 127.0.0.1:12345`
|
||||||
|
- `game-dev-alloy -> /etc/zlh-monitor/prometheus/file_sd/game-dev-alloy.json`
|
||||||
|
- no active `node_exporter` job
|
||||||
|
- no active cAdvisor job
|
||||||
|
- no active `:9100` jobs
|
||||||
|
- current active targets are clean:
|
||||||
|
- `up alloy 127.0.0.1:12345`
|
||||||
|
- `up prometheus localhost:9090`
|
||||||
|
|
||||||
|
- Remaining stale/static Prometheus targets are removed from active runtime config.
|
||||||
|
- `alloy-dev-6076` no longer appears as an active runtime target
|
||||||
|
- `cadvisor 10.60.0.11:8081` no longer appears as an active runtime target
|
||||||
|
- `zerolaghub-nodes 10.60.0.19:9100` no longer appears as an active runtime target
|
||||||
|
|
||||||
|
- Grafana provisioning is live.
|
||||||
|
- source-managed datasource:
|
||||||
|
- `uid=prometheus`, `name=Prometheus`, `url=http://localhost:9090`, `default=1`
|
||||||
|
- provisioned dashboards:
|
||||||
|
- `Core Service - Detail`
|
||||||
|
- `Core Services - Overview`
|
||||||
|
- `Dev Containers`
|
||||||
|
- `Game Containers`
|
||||||
|
- legacy node_exporter router dashboard archived to `/etc/zlh-monitor/grafana/dashboards-archive/core-routers.json`
|
||||||
|
|
||||||
|
- Services active after cleanup:
|
||||||
|
- `prometheus`
|
||||||
|
- `grafana-server`
|
||||||
|
- `alloy`
|
||||||
|
- discovery timer
|
||||||
|
|
||||||
## Launch blockers
|
## Launch blockers
|
||||||
|
|
||||||
1. Restrict monitoring endpoint exposure.
|
1. Restrict monitoring endpoint exposure.
|
||||||
- Prometheus currently listens on `*:9090`.
|
- Prometheus previously listened on `*:9090`.
|
||||||
- Grafana currently listens on `*:3000`.
|
- Grafana previously listened on `*:3000`.
|
||||||
- node_exporter currently listens on `*:9100`.
|
- node_exporter previously listened on `*:9100`.
|
||||||
- UFW is inactive and nftables policy is accept.
|
- UFW was inactive and nftables policy was accept in the prior audit.
|
||||||
- Required fix: bind to loopback/management VLAN only or firewall to trusted admin/monitoring sources.
|
- Prometheus no longer scrapes node_exporter or any `:9100` jobs, but listener exposure still needs verification/removal or firewalling.
|
||||||
|
- Required fix: bind Prometheus/Grafana to loopback or management VLAN only, remove/disable any unneeded node_exporter listener, and/or firewall to trusted admin/monitoring sources.
|
||||||
|
|
||||||
2. Remove or fix remaining stale/static targets.
|
2. Add API health scrape or equivalent application-level health signal.
|
||||||
- `alloy-dev-6076` is still hardcoded/down.
|
- Observed `/health` previously returned `404 Cannot GET /health`.
|
||||||
- `cadvisor`, `10.60.0.11:8081`, is down/no route.
|
|
||||||
- `zerolaghub-nodes`, `10.60.0.19:9100`, is down/connection refused.
|
|
||||||
- Acceptance: remaining down targets are either removed, fixed, or explicitly documented as intentional disabled targets.
|
|
||||||
|
|
||||||
3. Install/provision Grafana dashboards.
|
|
||||||
- Dashboard JSON exists under `/etc/zlh-monitor/grafana/dashboards`.
|
|
||||||
- Grafana DB previously had no dashboards installed.
|
|
||||||
- Acceptance: dashboards visible in Grafana and queries return live data.
|
|
||||||
|
|
||||||
4. Add API health scrape.
|
|
||||||
- Observed `/health` returns `404 Cannot GET /health`.
|
|
||||||
- Acceptance: Prometheus has an API app health target or equivalent application-level health signal.
|
- Acceptance: Prometheus has an API app health target or equivalent application-level health signal.
|
||||||
|
|
||||||
5. Add launch-debug lifecycle telemetry.
|
3. Add launch-debug lifecycle telemetry.
|
||||||
- Missing/incomplete: `ready`, `connectable`, `operationInProgress`, `operationType`, backup/restore state, code-server state.
|
- Missing/incomplete: `ready`, `connectable`, `operationInProgress`, `operationType`, backup/restore state, code-server state.
|
||||||
- Acceptance: dashboards or Prometheus queries can answer what a game/dev container is doing and why it is not ready.
|
- Acceptance: dashboards or Prometheus queries can answer what a game/dev container is doing and why it is not ready.
|
||||||
|
|
||||||
6. Add centralized logs.
|
4. Add centralized logs.
|
||||||
- No Loki, Promtail, or log-ingesting Alloy config found.
|
- No Loki, Promtail, or log-ingesting Alloy config found in prior audit.
|
||||||
- Acceptance: API, Agent, provisioning, backup/restore, delete/teardown, Velocity/DNS/edge, and discovery-sync logs are queryable from the monitoring surface.
|
- Acceptance: API, Agent, provisioning, backup/restore, delete/teardown, Velocity/DNS/edge, and discovery-sync logs are queryable from the monitoring surface.
|
||||||
|
|
||||||
7. Tighten bearer token storage.
|
5. Tighten bearer token storage.
|
||||||
- Token previously appeared in Prometheus YAML, backup configs, and `/etc/zlh-monitor/game-dev-discovery.env`.
|
- Token previously appeared in Prometheus YAML, backup configs, and `/etc/zlh-monitor/game-dev-discovery.env`.
|
||||||
- Acceptance: token stored in tight-permission env file or systemd credentials; not world-readable; not leaked in labels/URLs/logs.
|
- Acceptance: token stored in tight-permission env file or systemd credentials; not world-readable; not leaked in labels/URLs/logs.
|
||||||
|
|
||||||
8. Run lifecycle smoke test once remaining blockers are controlled enough.
|
6. Run lifecycle smoke test once remaining blockers are controlled enough.
|
||||||
- Create a game container and/or dev container.
|
- Create a game container and/or dev container.
|
||||||
- Confirm target appears.
|
- Confirm target appears.
|
||||||
- Delete it.
|
- Delete it.
|
||||||
|
|||||||
Loading…
Reference in New Issue
Block a user