3.3 KiB
3.3 KiB
Session Summary — Alloy Monitor Rollout (2026-04-17)
Objective
Prepare zlh-monitor for a modern Alloy-based observability stack and validate readiness for container-level metrics testing.
What Was Completed
1. Root / Codex Restructure
- Introduced
Codex/ownership model:- API / Portal / Agent now maintain their own state and open items
- Root docs updated:
README.mdSESSION_START.mdOPEN_THREADS.mdPROJECT_CONTEXT.md
- Enforced cleanup rule:
- remove completed items
- move durable behavior to
CURRENT_STATE.md - keep root cross-repo only
2. Monitoring Strategy Alignment
Established current (2026) observability direction:
- Prometheus → metrics storage
- Grafana → visualization
- Loki → optional log storage (not required yet)
- Grafana Alloy → unified collector (replaces node-exporter/promtail/agent layer)
Key decisions:
- Metrics-first rollout
- Do NOT introduce Loki yet
- Do NOT use Promtail or Grafana Agent
3. zlh-monitor Preparation
Created a clean, self-contained repo structure at:
/etc/zlh-monitor
Added:
- Alloy config (host metrics)
- Prometheus scrape snippet for Alloy
- Prometheus args snippet (remote write receiver)
- systemd service for Alloy
- Grafana datasource provisioning stub
- validation script
- README with audit + deployment notes
4. Live Host Integration (zlh-monitor)
Completed live wiring:
- Installed Alloy (
v1.15.1) - Installed and enabled systemd service
- Enabled Prometheus remote-write receiver
- Added Alloy scrape job to Prometheus
- Restarted Prometheus successfully
- Started Alloy successfully
Validation:
up{job="alloy"} == 1- Alloy
/metricsendpoint reachable - Host metrics confirmed via Alloy:
- CPU
- memory
- filesystem
- uname
Important:
- Existing node-exporter was intentionally left running
- No Loki added
5. Current System State
Working:
- Alloy (host-level metrics)
- Prometheus (healthy)
- Grafana (unchanged, functional)
Known unrelated issues (not addressed in this session):
- cadvisor target down
- one node-exporter target down
zlh_lxcHTTP discovery endpoint failing
6. Container Monitoring Test Plan
Defined next step:
Perform a single narrow container scrape test using Alloy
Approach:
- Do NOT modify containers
- Use existing node-exporter inside container
- Add one static scrape target in Alloy
- Validate ingestion in Prometheus
Manual test flow:
- Verify container exporter reachable (
:9100) - Add static Alloy scrape target
- Restart Alloy
- Confirm:
up{job="integrations/container-node"} == 1- container metrics present
- Inspect labels for future scaling
Key Architectural Decisions
- Alloy replaces per-node collectors (node-exporter layer in this system)
- Prometheus remains the metrics backend
- Grafana remains the UI
- Loki deferred until logs are needed
- Rollout strategy is incremental, not global
Next Step
Proceed with container test:
Add one Alloy scrape target for a running container and validate metrics ingestion.
After that:
- evaluate labeling strategy
- decide scaling approach (manual vs discovery)
Status
- Host monitoring: ✅ complete
- Alloy rollout: ✅ complete (host)
- Container monitoring: ⏳ next step
- Logging (Loki): ❌ not implemented (intentional)