# Session Summary — Alloy Monitor Rollout (2026-04-17) ## Objective Prepare `zlh-monitor` for a modern Alloy-based observability stack and validate readiness for container-level metrics testing. --- ## What Was Completed ### 1. Root / Codex Restructure - Introduced `Codex/` ownership model: - API / Portal / Agent now maintain their own state and open items - Root docs updated: - `README.md` - `SESSION_START.md` - `OPEN_THREADS.md` - `PROJECT_CONTEXT.md` - Enforced cleanup rule: - remove completed items - move durable behavior to `CURRENT_STATE.md` - keep root cross-repo only --- ### 2. Monitoring Strategy Alignment Established current (2026) observability direction: - Prometheus → metrics storage - Grafana → visualization - Loki → optional log storage (not required yet) - Grafana Alloy → unified collector (replaces node-exporter/promtail/agent layer) Key decisions: - Metrics-first rollout - Do NOT introduce Loki yet - Do NOT use Promtail or Grafana Agent --- ### 3. zlh-monitor Preparation Created a clean, self-contained repo structure at: `/etc/zlh-monitor` Added: - Alloy config (host metrics) - Prometheus scrape snippet for Alloy - Prometheus args snippet (remote write receiver) - systemd service for Alloy - Grafana datasource provisioning stub - validation script - README with audit + deployment notes --- ### 4. Live Host Integration (zlh-monitor) Completed live wiring: - Installed Alloy (`v1.15.1`) - Installed and enabled systemd service - Enabled Prometheus remote-write receiver - Added Alloy scrape job to Prometheus - Restarted Prometheus successfully - Started Alloy successfully Validation: - `up{job="alloy"} == 1` - Alloy `/metrics` endpoint reachable - Host metrics confirmed via Alloy: - CPU - memory - filesystem - uname Important: - Existing node-exporter was intentionally left running - No Loki added --- ### 5. Current System State Working: - Alloy (host-level metrics) - Prometheus (healthy) - Grafana (unchanged, functional) Known unrelated issues (not addressed in this session): - cadvisor target down - one node-exporter target down - `zlh_lxc` HTTP discovery endpoint failing --- ### 6. Container Monitoring Test Plan Defined next step: > Perform a **single narrow container scrape test** using Alloy Approach: - Do NOT modify containers - Use existing node-exporter inside container - Add one static scrape target in Alloy - Validate ingestion in Prometheus Manual test flow: 1. Verify container exporter reachable (`:9100`) 2. Add static Alloy scrape target 3. Restart Alloy 4. Confirm: - `up{job="integrations/container-node"} == 1` - container metrics present 5. Inspect labels for future scaling --- ## Key Architectural Decisions - Alloy replaces per-node collectors (node-exporter layer in this system) - Prometheus remains the metrics backend - Grafana remains the UI - Loki deferred until logs are needed - Rollout strategy is incremental, not global --- ## Next Step Proceed with container test: > Add one Alloy scrape target for a running container and validate metrics ingestion. After that: - evaluate labeling strategy - decide scaling approach (manual vs discovery) --- ## Status - Host monitoring: ✅ complete - Alloy rollout: ✅ complete (host) - Container monitoring: ⏳ next step - Logging (Loki): ❌ not implemented (intentional)