Add session summary for Alloy monitor rollout and container test prep

This commit is contained in:
jester 2026-04-17 14:45:17 +00:00
parent 7f90c8f5c8
commit c2c8d0c8d1

View File

@ -0,0 +1,148 @@
# Session Summary — Alloy Monitor Rollout (2026-04-17)
## Objective
Prepare `zlh-monitor` for a modern Alloy-based observability stack and validate readiness for container-level metrics testing.
---
## What Was Completed
### 1. Root / Codex Restructure
- Introduced `Codex/` ownership model:
- API / Portal / Agent now maintain their own state and open items
- Root docs updated:
- `README.md`
- `SESSION_START.md`
- `OPEN_THREADS.md`
- `PROJECT_CONTEXT.md`
- Enforced cleanup rule:
- remove completed items
- move durable behavior to `CURRENT_STATE.md`
- keep root cross-repo only
---
### 2. Monitoring Strategy Alignment
Established current (2026) observability direction:
- Prometheus → metrics storage
- Grafana → visualization
- Loki → optional log storage (not required yet)
- Grafana Alloy → unified collector (replaces node-exporter/promtail/agent layer)
Key decisions:
- Metrics-first rollout
- Do NOT introduce Loki yet
- Do NOT use Promtail or Grafana Agent
---
### 3. zlh-monitor Preparation
Created a clean, self-contained repo structure at:
`/etc/zlh-monitor`
Added:
- Alloy config (host metrics)
- Prometheus scrape snippet for Alloy
- Prometheus args snippet (remote write receiver)
- systemd service for Alloy
- Grafana datasource provisioning stub
- validation script
- README with audit + deployment notes
---
### 4. Live Host Integration (zlh-monitor)
Completed live wiring:
- Installed Alloy (`v1.15.1`)
- Installed and enabled systemd service
- Enabled Prometheus remote-write receiver
- Added Alloy scrape job to Prometheus
- Restarted Prometheus successfully
- Started Alloy successfully
Validation:
- `up{job="alloy"} == 1`
- Alloy `/metrics` endpoint reachable
- Host metrics confirmed via Alloy:
- CPU
- memory
- filesystem
- uname
Important:
- Existing node-exporter was intentionally left running
- No Loki added
---
### 5. Current System State
Working:
- Alloy (host-level metrics)
- Prometheus (healthy)
- Grafana (unchanged, functional)
Known unrelated issues (not addressed in this session):
- cadvisor target down
- one node-exporter target down
- `zlh_lxc` HTTP discovery endpoint failing
---
### 6. Container Monitoring Test Plan
Defined next step:
> Perform a **single narrow container scrape test** using Alloy
Approach:
- Do NOT modify containers
- Use existing node-exporter inside container
- Add one static scrape target in Alloy
- Validate ingestion in Prometheus
Manual test flow:
1. Verify container exporter reachable (`:9100`)
2. Add static Alloy scrape target
3. Restart Alloy
4. Confirm:
- `up{job="integrations/container-node"} == 1`
- container metrics present
5. Inspect labels for future scaling
---
## Key Architectural Decisions
- Alloy replaces per-node collectors (node-exporter layer in this system)
- Prometheus remains the metrics backend
- Grafana remains the UI
- Loki deferred until logs are needed
- Rollout strategy is incremental, not global
---
## Next Step
Proceed with container test:
> Add one Alloy scrape target for a running container and validate metrics ingestion.
After that:
- evaluate labeling strategy
- decide scaling approach (manual vs discovery)
---
## Status
- Host monitoring: ✅ complete
- Alloy rollout: ✅ complete (host)
- Container monitoring: ⏳ next step
- Logging (Loki): ❌ not implemented (intentional)