Add session summary for Alloy monitor rollout and container test prep

2026-04-17 14:45:17 +00:00 · 2026-04-17 14:45:17 +00:00 · c2c8d0c8d1
commit c2c8d0c8d1
parent 7f90c8f5c8
1 changed files with 148 additions and 0 deletions
--- a/Session_Summaries/2026-04-17-alloy-monitor-rollout.md
+++ b/Session_Summaries/2026-04-17-alloy-monitor-rollout.md
@ -0,0 +1,148 @@
+# Session Summary — Alloy Monitor Rollout (2026-04-17)
+
+## Objective
+Prepare `zlh-monitor` for a modern Alloy-based observability stack and validate readiness for container-level metrics testing.
+
+---
+
+## What Was Completed
+
+### 1. Root / Codex Restructure
+- Introduced `Codex/` ownership model:
+  - API / Portal / Agent now maintain their own state and open items
+- Root docs updated:
+  - `README.md`
+  - `SESSION_START.md`
+  - `OPEN_THREADS.md`
+  - `PROJECT_CONTEXT.md`
+- Enforced cleanup rule:
+  - remove completed items
+  - move durable behavior to `CURRENT_STATE.md`
+  - keep root cross-repo only
+
+---
+
+### 2. Monitoring Strategy Alignment
+
+Established current (2026) observability direction:
+
+- Prometheus → metrics storage
+- Grafana → visualization
+- Loki → optional log storage (not required yet)
+- Grafana Alloy → unified collector (replaces node-exporter/promtail/agent layer)
+
+Key decisions:
+- Metrics-first rollout
+- Do NOT introduce Loki yet
+- Do NOT use Promtail or Grafana Agent
+
+---
+
+### 3. zlh-monitor Preparation
+
+Created a clean, self-contained repo structure at:
+
+`/etc/zlh-monitor`
+
+Added:
+- Alloy config (host metrics)
+- Prometheus scrape snippet for Alloy
+- Prometheus args snippet (remote write receiver)
+- systemd service for Alloy
+- Grafana datasource provisioning stub
+- validation script
+- README with audit + deployment notes
+
+---
+
+### 4. Live Host Integration (zlh-monitor)
+
+Completed live wiring:
+
+- Installed Alloy (`v1.15.1`)
+- Installed and enabled systemd service
+- Enabled Prometheus remote-write receiver
+- Added Alloy scrape job to Prometheus
+- Restarted Prometheus successfully
+- Started Alloy successfully
+
+Validation:
+- `up{job="alloy"} == 1`
+- Alloy `/metrics` endpoint reachable
+- Host metrics confirmed via Alloy:
+  - CPU
+  - memory
+  - filesystem
+  - uname
+
+Important:
+- Existing node-exporter was intentionally left running
+- No Loki added
+
+---
+
+### 5. Current System State
+
+Working:
+- Alloy (host-level metrics)
+- Prometheus (healthy)
+- Grafana (unchanged, functional)
+
+Known unrelated issues (not addressed in this session):
+- cadvisor target down
+- one node-exporter target down
+- `zlh_lxc` HTTP discovery endpoint failing
+
+---
+
+### 6. Container Monitoring Test Plan
+
+Defined next step:
+
+> Perform a **single narrow container scrape test** using Alloy
+
+Approach:
+- Do NOT modify containers
+- Use existing node-exporter inside container
+- Add one static scrape target in Alloy
+- Validate ingestion in Prometheus
+
+Manual test flow:
+1. Verify container exporter reachable (`:9100`)
+2. Add static Alloy scrape target
+3. Restart Alloy
+4. Confirm:
+   - `up{job="integrations/container-node"} == 1`
+   - container metrics present
+5. Inspect labels for future scaling
+
+---
+
+## Key Architectural Decisions
+
+- Alloy replaces per-node collectors (node-exporter layer in this system)
+- Prometheus remains the metrics backend
+- Grafana remains the UI
+- Loki deferred until logs are needed
+- Rollout strategy is incremental, not global
+
+---
+
+## Next Step
+
+Proceed with container test:
+
+> Add one Alloy scrape target for a running container and validate metrics ingestion.
+
+After that:
+- evaluate labeling strategy
+- decide scaling approach (manual vs discovery)
+
+---
+
+## Status
+
+- Host monitoring: ✅ complete
+- Alloy rollout: ✅ complete (host)
+- Container monitoring: ⏳ next step
+- Logging (Loki): ❌ not implemented (intentional)