Add session summary for Alloy monitor rollout and container test prep
This commit is contained in:
parent
7f90c8f5c8
commit
c2c8d0c8d1
148
Session_Summaries/2026-04-17-alloy-monitor-rollout.md
Normal file
148
Session_Summaries/2026-04-17-alloy-monitor-rollout.md
Normal file
@ -0,0 +1,148 @@
|
||||
# Session Summary — Alloy Monitor Rollout (2026-04-17)
|
||||
|
||||
## Objective
|
||||
Prepare `zlh-monitor` for a modern Alloy-based observability stack and validate readiness for container-level metrics testing.
|
||||
|
||||
---
|
||||
|
||||
## What Was Completed
|
||||
|
||||
### 1. Root / Codex Restructure
|
||||
- Introduced `Codex/` ownership model:
|
||||
- API / Portal / Agent now maintain their own state and open items
|
||||
- Root docs updated:
|
||||
- `README.md`
|
||||
- `SESSION_START.md`
|
||||
- `OPEN_THREADS.md`
|
||||
- `PROJECT_CONTEXT.md`
|
||||
- Enforced cleanup rule:
|
||||
- remove completed items
|
||||
- move durable behavior to `CURRENT_STATE.md`
|
||||
- keep root cross-repo only
|
||||
|
||||
---
|
||||
|
||||
### 2. Monitoring Strategy Alignment
|
||||
|
||||
Established current (2026) observability direction:
|
||||
|
||||
- Prometheus → metrics storage
|
||||
- Grafana → visualization
|
||||
- Loki → optional log storage (not required yet)
|
||||
- Grafana Alloy → unified collector (replaces node-exporter/promtail/agent layer)
|
||||
|
||||
Key decisions:
|
||||
- Metrics-first rollout
|
||||
- Do NOT introduce Loki yet
|
||||
- Do NOT use Promtail or Grafana Agent
|
||||
|
||||
---
|
||||
|
||||
### 3. zlh-monitor Preparation
|
||||
|
||||
Created a clean, self-contained repo structure at:
|
||||
|
||||
`/etc/zlh-monitor`
|
||||
|
||||
Added:
|
||||
- Alloy config (host metrics)
|
||||
- Prometheus scrape snippet for Alloy
|
||||
- Prometheus args snippet (remote write receiver)
|
||||
- systemd service for Alloy
|
||||
- Grafana datasource provisioning stub
|
||||
- validation script
|
||||
- README with audit + deployment notes
|
||||
|
||||
---
|
||||
|
||||
### 4. Live Host Integration (zlh-monitor)
|
||||
|
||||
Completed live wiring:
|
||||
|
||||
- Installed Alloy (`v1.15.1`)
|
||||
- Installed and enabled systemd service
|
||||
- Enabled Prometheus remote-write receiver
|
||||
- Added Alloy scrape job to Prometheus
|
||||
- Restarted Prometheus successfully
|
||||
- Started Alloy successfully
|
||||
|
||||
Validation:
|
||||
- `up{job="alloy"} == 1`
|
||||
- Alloy `/metrics` endpoint reachable
|
||||
- Host metrics confirmed via Alloy:
|
||||
- CPU
|
||||
- memory
|
||||
- filesystem
|
||||
- uname
|
||||
|
||||
Important:
|
||||
- Existing node-exporter was intentionally left running
|
||||
- No Loki added
|
||||
|
||||
---
|
||||
|
||||
### 5. Current System State
|
||||
|
||||
Working:
|
||||
- Alloy (host-level metrics)
|
||||
- Prometheus (healthy)
|
||||
- Grafana (unchanged, functional)
|
||||
|
||||
Known unrelated issues (not addressed in this session):
|
||||
- cadvisor target down
|
||||
- one node-exporter target down
|
||||
- `zlh_lxc` HTTP discovery endpoint failing
|
||||
|
||||
---
|
||||
|
||||
### 6. Container Monitoring Test Plan
|
||||
|
||||
Defined next step:
|
||||
|
||||
> Perform a **single narrow container scrape test** using Alloy
|
||||
|
||||
Approach:
|
||||
- Do NOT modify containers
|
||||
- Use existing node-exporter inside container
|
||||
- Add one static scrape target in Alloy
|
||||
- Validate ingestion in Prometheus
|
||||
|
||||
Manual test flow:
|
||||
1. Verify container exporter reachable (`:9100`)
|
||||
2. Add static Alloy scrape target
|
||||
3. Restart Alloy
|
||||
4. Confirm:
|
||||
- `up{job="integrations/container-node"} == 1`
|
||||
- container metrics present
|
||||
5. Inspect labels for future scaling
|
||||
|
||||
---
|
||||
|
||||
## Key Architectural Decisions
|
||||
|
||||
- Alloy replaces per-node collectors (node-exporter layer in this system)
|
||||
- Prometheus remains the metrics backend
|
||||
- Grafana remains the UI
|
||||
- Loki deferred until logs are needed
|
||||
- Rollout strategy is incremental, not global
|
||||
|
||||
---
|
||||
|
||||
## Next Step
|
||||
|
||||
Proceed with container test:
|
||||
|
||||
> Add one Alloy scrape target for a running container and validate metrics ingestion.
|
||||
|
||||
After that:
|
||||
- evaluate labeling strategy
|
||||
- decide scaling approach (manual vs discovery)
|
||||
|
||||
---
|
||||
|
||||
## Status
|
||||
|
||||
- Host monitoring: ✅ complete
|
||||
- Alloy rollout: ✅ complete (host)
|
||||
- Container monitoring: ⏳ next step
|
||||
- Logging (Loki): ❌ not implemented (intentional)
|
||||
Loading…
Reference in New Issue
Block a user