From c2c8d0c8d16e7827ca1c7a16a8e24c27a37d3d9e Mon Sep 17 00:00:00 2001 From: jester Date: Fri, 17 Apr 2026 14:45:17 +0000 Subject: [PATCH] Add session summary for Alloy monitor rollout and container test prep --- .../2026-04-17-alloy-monitor-rollout.md | 148 ++++++++++++++++++ 1 file changed, 148 insertions(+) create mode 100644 Session_Summaries/2026-04-17-alloy-monitor-rollout.md diff --git a/Session_Summaries/2026-04-17-alloy-monitor-rollout.md b/Session_Summaries/2026-04-17-alloy-monitor-rollout.md new file mode 100644 index 0000000..0ce3028 --- /dev/null +++ b/Session_Summaries/2026-04-17-alloy-monitor-rollout.md @@ -0,0 +1,148 @@ +# Session Summary — Alloy Monitor Rollout (2026-04-17) + +## Objective +Prepare `zlh-monitor` for a modern Alloy-based observability stack and validate readiness for container-level metrics testing. + +--- + +## What Was Completed + +### 1. Root / Codex Restructure +- Introduced `Codex/` ownership model: + - API / Portal / Agent now maintain their own state and open items +- Root docs updated: + - `README.md` + - `SESSION_START.md` + - `OPEN_THREADS.md` + - `PROJECT_CONTEXT.md` +- Enforced cleanup rule: + - remove completed items + - move durable behavior to `CURRENT_STATE.md` + - keep root cross-repo only + +--- + +### 2. Monitoring Strategy Alignment + +Established current (2026) observability direction: + +- Prometheus → metrics storage +- Grafana → visualization +- Loki → optional log storage (not required yet) +- Grafana Alloy → unified collector (replaces node-exporter/promtail/agent layer) + +Key decisions: +- Metrics-first rollout +- Do NOT introduce Loki yet +- Do NOT use Promtail or Grafana Agent + +--- + +### 3. zlh-monitor Preparation + +Created a clean, self-contained repo structure at: + +`/etc/zlh-monitor` + +Added: +- Alloy config (host metrics) +- Prometheus scrape snippet for Alloy +- Prometheus args snippet (remote write receiver) +- systemd service for Alloy +- Grafana datasource provisioning stub +- validation script +- README with audit + deployment notes + +--- + +### 4. Live Host Integration (zlh-monitor) + +Completed live wiring: + +- Installed Alloy (`v1.15.1`) +- Installed and enabled systemd service +- Enabled Prometheus remote-write receiver +- Added Alloy scrape job to Prometheus +- Restarted Prometheus successfully +- Started Alloy successfully + +Validation: +- `up{job="alloy"} == 1` +- Alloy `/metrics` endpoint reachable +- Host metrics confirmed via Alloy: + - CPU + - memory + - filesystem + - uname + +Important: +- Existing node-exporter was intentionally left running +- No Loki added + +--- + +### 5. Current System State + +Working: +- Alloy (host-level metrics) +- Prometheus (healthy) +- Grafana (unchanged, functional) + +Known unrelated issues (not addressed in this session): +- cadvisor target down +- one node-exporter target down +- `zlh_lxc` HTTP discovery endpoint failing + +--- + +### 6. Container Monitoring Test Plan + +Defined next step: + +> Perform a **single narrow container scrape test** using Alloy + +Approach: +- Do NOT modify containers +- Use existing node-exporter inside container +- Add one static scrape target in Alloy +- Validate ingestion in Prometheus + +Manual test flow: +1. Verify container exporter reachable (`:9100`) +2. Add static Alloy scrape target +3. Restart Alloy +4. Confirm: + - `up{job="integrations/container-node"} == 1` + - container metrics present +5. Inspect labels for future scaling + +--- + +## Key Architectural Decisions + +- Alloy replaces per-node collectors (node-exporter layer in this system) +- Prometheus remains the metrics backend +- Grafana remains the UI +- Loki deferred until logs are needed +- Rollout strategy is incremental, not global + +--- + +## Next Step + +Proceed with container test: + +> Add one Alloy scrape target for a running container and validate metrics ingestion. + +After that: +- evaluate labeling strategy +- decide scaling approach (manual vs discovery) + +--- + +## Status + +- Host monitoring: ✅ complete +- Alloy rollout: ✅ complete (host) +- Container monitoring: ⏳ next step +- Logging (Loki): ❌ not implemented (intentional)