# Session Summary — Alloy Monitor Rollout (2026-04-17)

## Objective
Prepare `zlh-monitor` for a modern Alloy-based observability stack and validate readiness for container-level metrics testing.

---

## What Was Completed

### 1. Root / Codex Restructure
- Introduced `Codex/` ownership model:
  - API / Portal / Agent now maintain their own state and open items
- Root docs updated:
  - `README.md`
  - `SESSION_START.md`
  - `OPEN_THREADS.md`
  - `PROJECT_CONTEXT.md`
- Enforced cleanup rule:
  - remove completed items
  - move durable behavior to `CURRENT_STATE.md`
  - keep root cross-repo only

---

### 2. Monitoring Strategy Alignment

Established current (2026) observability direction:

- Prometheus → metrics storage
- Grafana → visualization
- Loki → optional log storage (not required yet)
- Grafana Alloy → unified collector (replaces node-exporter/promtail/agent layer)

Key decisions:
- Metrics-first rollout
- Do NOT introduce Loki yet
- Do NOT use Promtail or Grafana Agent

---

### 3. zlh-monitor Preparation

Created a clean, self-contained repo structure at:

`/etc/zlh-monitor`

Added:
- Alloy config (host metrics)
- Prometheus scrape snippet for Alloy
- Prometheus args snippet (remote write receiver)
- systemd service for Alloy
- Grafana datasource provisioning stub
- validation script
- README with audit + deployment notes

---

### 4. Live Host Integration (zlh-monitor)

Completed live wiring:

- Installed Alloy (`v1.15.1`)
- Installed and enabled systemd service
- Enabled Prometheus remote-write receiver
- Added Alloy scrape job to Prometheus
- Restarted Prometheus successfully
- Started Alloy successfully

Validation:
- `up{job="alloy"} == 1`
- Alloy `/metrics` endpoint reachable
- Host metrics confirmed via Alloy:
  - CPU
  - memory
  - filesystem
  - uname

Important:
- Existing node-exporter was intentionally left running
- No Loki added

---

### 5. Current System State

Working:
- Alloy (host-level metrics)
- Prometheus (healthy)
- Grafana (unchanged, functional)

Known unrelated issues (not addressed in this session):
- cadvisor target down
- one node-exporter target down
- `zlh_lxc` HTTP discovery endpoint failing

---

### 6. Container Monitoring Test Plan

Defined next step:

> Perform a **single narrow container scrape test** using Alloy

Approach:
- Do NOT modify containers
- Use existing node-exporter inside container
- Add one static scrape target in Alloy
- Validate ingestion in Prometheus

Manual test flow:
1. Verify container exporter reachable (`:9100`)
2. Add static Alloy scrape target
3. Restart Alloy
4. Confirm:
   - `up{job="integrations/container-node"} == 1`
   - container metrics present
5. Inspect labels for future scaling

---

## Key Architectural Decisions

- Alloy replaces per-node collectors (node-exporter layer in this system)
- Prometheus remains the metrics backend
- Grafana remains the UI
- Loki deferred until logs are needed
- Rollout strategy is incremental, not global

---

## Next Step

Proceed with container test:

> Add one Alloy scrape target for a running container and validate metrics ingestion.

After that:
- evaluate labeling strategy
- decide scaling approach (manual vs discovery)

---

## Status

- Host monitoring: ✅ complete
- Alloy rollout: ✅ complete (host)
- Container monitoring: ⏳ next step
- Logging (Loki): ❌ not implemented (intentional)