jester c2c8d0c8d1 Add session summary for Alloy monitor rollout and container test prep

2026-04-17 14:45:17 +00:00

3.3 KiB

Raw Blame History

Session Summary — Alloy Monitor Rollout (2026-04-17)

Objective

Prepare zlh-monitor for a modern Alloy-based observability stack and validate readiness for container-level metrics testing.

What Was Completed

1. Root / Codex Restructure

Introduced Codex/ ownership model:
- API / Portal / Agent now maintain their own state and open items
Root docs updated:
- README.md
- SESSION_START.md
- OPEN_THREADS.md
- PROJECT_CONTEXT.md
Enforced cleanup rule:
- remove completed items
- move durable behavior to CURRENT_STATE.md
- keep root cross-repo only

2. Monitoring Strategy Alignment

Established current (2026) observability direction:

Prometheus → metrics storage
Grafana → visualization
Loki → optional log storage (not required yet)
Grafana Alloy → unified collector (replaces node-exporter/promtail/agent layer)

Key decisions:

Metrics-first rollout
Do NOT introduce Loki yet
Do NOT use Promtail or Grafana Agent

3. zlh-monitor Preparation

Created a clean, self-contained repo structure at:

/etc/zlh-monitor

Added:

Alloy config (host metrics)
Prometheus scrape snippet for Alloy
Prometheus args snippet (remote write receiver)
systemd service for Alloy
Grafana datasource provisioning stub
validation script
README with audit + deployment notes

4. Live Host Integration (zlh-monitor)

Completed live wiring:

Installed Alloy (v1.15.1)
Installed and enabled systemd service
Enabled Prometheus remote-write receiver
Added Alloy scrape job to Prometheus
Restarted Prometheus successfully
Started Alloy successfully

Validation:

up{job="alloy"} == 1
Alloy /metrics endpoint reachable
Host metrics confirmed via Alloy:
- CPU
- memory
- filesystem
- uname

Important:

Existing node-exporter was intentionally left running
No Loki added

5. Current System State

Working:

Alloy (host-level metrics)
Prometheus (healthy)
Grafana (unchanged, functional)

Known unrelated issues (not addressed in this session):

cadvisor target down
one node-exporter target down
zlh_lxc HTTP discovery endpoint failing

6. Container Monitoring Test Plan

Defined next step:

Perform a single narrow container scrape test using Alloy

Approach:

Do NOT modify containers
Use existing node-exporter inside container
Add one static scrape target in Alloy
Validate ingestion in Prometheus

Manual test flow:

Verify container exporter reachable (:9100)
Add static Alloy scrape target
Restart Alloy
Confirm:
- up{job="integrations/container-node"} == 1
- container metrics present
Inspect labels for future scaling

Key Architectural Decisions

Alloy replaces per-node collectors (node-exporter layer in this system)
Prometheus remains the metrics backend
Grafana remains the UI
Loki deferred until logs are needed
Rollout strategy is incremental, not global

Next Step

Proceed with container test:

Add one Alloy scrape target for a running container and validate metrics ingestion.

After that:

evaluate labeling strategy
decide scaling approach (manual vs discovery)

Status

Host monitoring: ✅ complete
Alloy rollout: ✅ complete (host)
Container monitoring: ⏳ next step
Logging (Loki): ❌ not implemented (intentional)

3.3 KiB Raw Blame History