Add session summary for Alloy monitor rollout and container test prep
This commit is contained in:
parent
7f90c8f5c8
commit
c2c8d0c8d1
148
Session_Summaries/2026-04-17-alloy-monitor-rollout.md
Normal file
148
Session_Summaries/2026-04-17-alloy-monitor-rollout.md
Normal file
@ -0,0 +1,148 @@
|
|||||||
|
# Session Summary — Alloy Monitor Rollout (2026-04-17)
|
||||||
|
|
||||||
|
## Objective
|
||||||
|
Prepare `zlh-monitor` for a modern Alloy-based observability stack and validate readiness for container-level metrics testing.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## What Was Completed
|
||||||
|
|
||||||
|
### 1. Root / Codex Restructure
|
||||||
|
- Introduced `Codex/` ownership model:
|
||||||
|
- API / Portal / Agent now maintain their own state and open items
|
||||||
|
- Root docs updated:
|
||||||
|
- `README.md`
|
||||||
|
- `SESSION_START.md`
|
||||||
|
- `OPEN_THREADS.md`
|
||||||
|
- `PROJECT_CONTEXT.md`
|
||||||
|
- Enforced cleanup rule:
|
||||||
|
- remove completed items
|
||||||
|
- move durable behavior to `CURRENT_STATE.md`
|
||||||
|
- keep root cross-repo only
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 2. Monitoring Strategy Alignment
|
||||||
|
|
||||||
|
Established current (2026) observability direction:
|
||||||
|
|
||||||
|
- Prometheus → metrics storage
|
||||||
|
- Grafana → visualization
|
||||||
|
- Loki → optional log storage (not required yet)
|
||||||
|
- Grafana Alloy → unified collector (replaces node-exporter/promtail/agent layer)
|
||||||
|
|
||||||
|
Key decisions:
|
||||||
|
- Metrics-first rollout
|
||||||
|
- Do NOT introduce Loki yet
|
||||||
|
- Do NOT use Promtail or Grafana Agent
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 3. zlh-monitor Preparation
|
||||||
|
|
||||||
|
Created a clean, self-contained repo structure at:
|
||||||
|
|
||||||
|
`/etc/zlh-monitor`
|
||||||
|
|
||||||
|
Added:
|
||||||
|
- Alloy config (host metrics)
|
||||||
|
- Prometheus scrape snippet for Alloy
|
||||||
|
- Prometheus args snippet (remote write receiver)
|
||||||
|
- systemd service for Alloy
|
||||||
|
- Grafana datasource provisioning stub
|
||||||
|
- validation script
|
||||||
|
- README with audit + deployment notes
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 4. Live Host Integration (zlh-monitor)
|
||||||
|
|
||||||
|
Completed live wiring:
|
||||||
|
|
||||||
|
- Installed Alloy (`v1.15.1`)
|
||||||
|
- Installed and enabled systemd service
|
||||||
|
- Enabled Prometheus remote-write receiver
|
||||||
|
- Added Alloy scrape job to Prometheus
|
||||||
|
- Restarted Prometheus successfully
|
||||||
|
- Started Alloy successfully
|
||||||
|
|
||||||
|
Validation:
|
||||||
|
- `up{job="alloy"} == 1`
|
||||||
|
- Alloy `/metrics` endpoint reachable
|
||||||
|
- Host metrics confirmed via Alloy:
|
||||||
|
- CPU
|
||||||
|
- memory
|
||||||
|
- filesystem
|
||||||
|
- uname
|
||||||
|
|
||||||
|
Important:
|
||||||
|
- Existing node-exporter was intentionally left running
|
||||||
|
- No Loki added
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 5. Current System State
|
||||||
|
|
||||||
|
Working:
|
||||||
|
- Alloy (host-level metrics)
|
||||||
|
- Prometheus (healthy)
|
||||||
|
- Grafana (unchanged, functional)
|
||||||
|
|
||||||
|
Known unrelated issues (not addressed in this session):
|
||||||
|
- cadvisor target down
|
||||||
|
- one node-exporter target down
|
||||||
|
- `zlh_lxc` HTTP discovery endpoint failing
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 6. Container Monitoring Test Plan
|
||||||
|
|
||||||
|
Defined next step:
|
||||||
|
|
||||||
|
> Perform a **single narrow container scrape test** using Alloy
|
||||||
|
|
||||||
|
Approach:
|
||||||
|
- Do NOT modify containers
|
||||||
|
- Use existing node-exporter inside container
|
||||||
|
- Add one static scrape target in Alloy
|
||||||
|
- Validate ingestion in Prometheus
|
||||||
|
|
||||||
|
Manual test flow:
|
||||||
|
1. Verify container exporter reachable (`:9100`)
|
||||||
|
2. Add static Alloy scrape target
|
||||||
|
3. Restart Alloy
|
||||||
|
4. Confirm:
|
||||||
|
- `up{job="integrations/container-node"} == 1`
|
||||||
|
- container metrics present
|
||||||
|
5. Inspect labels for future scaling
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Key Architectural Decisions
|
||||||
|
|
||||||
|
- Alloy replaces per-node collectors (node-exporter layer in this system)
|
||||||
|
- Prometheus remains the metrics backend
|
||||||
|
- Grafana remains the UI
|
||||||
|
- Loki deferred until logs are needed
|
||||||
|
- Rollout strategy is incremental, not global
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Next Step
|
||||||
|
|
||||||
|
Proceed with container test:
|
||||||
|
|
||||||
|
> Add one Alloy scrape target for a running container and validate metrics ingestion.
|
||||||
|
|
||||||
|
After that:
|
||||||
|
- evaluate labeling strategy
|
||||||
|
- decide scaling approach (manual vs discovery)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Status
|
||||||
|
|
||||||
|
- Host monitoring: ✅ complete
|
||||||
|
- Alloy rollout: ✅ complete (host)
|
||||||
|
- Container monitoring: ⏳ next step
|
||||||
|
- Logging (Loki): ❌ not implemented (intentional)
|
||||||
Loading…
Reference in New Issue
Block a user