zlh-grind/Session_Summaries/2026-04-17-alloy-monitor-rollout.md

3.3 KiB

Session Summary — Alloy Monitor Rollout (2026-04-17)

Objective

Prepare zlh-monitor for a modern Alloy-based observability stack and validate readiness for container-level metrics testing.


What Was Completed

1. Root / Codex Restructure

  • Introduced Codex/ ownership model:
    • API / Portal / Agent now maintain their own state and open items
  • Root docs updated:
    • README.md
    • SESSION_START.md
    • OPEN_THREADS.md
    • PROJECT_CONTEXT.md
  • Enforced cleanup rule:
    • remove completed items
    • move durable behavior to CURRENT_STATE.md
    • keep root cross-repo only

2. Monitoring Strategy Alignment

Established current (2026) observability direction:

  • Prometheus → metrics storage
  • Grafana → visualization
  • Loki → optional log storage (not required yet)
  • Grafana Alloy → unified collector (replaces node-exporter/promtail/agent layer)

Key decisions:

  • Metrics-first rollout
  • Do NOT introduce Loki yet
  • Do NOT use Promtail or Grafana Agent

3. zlh-monitor Preparation

Created a clean, self-contained repo structure at:

/etc/zlh-monitor

Added:

  • Alloy config (host metrics)
  • Prometheus scrape snippet for Alloy
  • Prometheus args snippet (remote write receiver)
  • systemd service for Alloy
  • Grafana datasource provisioning stub
  • validation script
  • README with audit + deployment notes

4. Live Host Integration (zlh-monitor)

Completed live wiring:

  • Installed Alloy (v1.15.1)
  • Installed and enabled systemd service
  • Enabled Prometheus remote-write receiver
  • Added Alloy scrape job to Prometheus
  • Restarted Prometheus successfully
  • Started Alloy successfully

Validation:

  • up{job="alloy"} == 1
  • Alloy /metrics endpoint reachable
  • Host metrics confirmed via Alloy:
    • CPU
    • memory
    • filesystem
    • uname

Important:

  • Existing node-exporter was intentionally left running
  • No Loki added

5. Current System State

Working:

  • Alloy (host-level metrics)
  • Prometheus (healthy)
  • Grafana (unchanged, functional)

Known unrelated issues (not addressed in this session):

  • cadvisor target down
  • one node-exporter target down
  • zlh_lxc HTTP discovery endpoint failing

6. Container Monitoring Test Plan

Defined next step:

Perform a single narrow container scrape test using Alloy

Approach:

  • Do NOT modify containers
  • Use existing node-exporter inside container
  • Add one static scrape target in Alloy
  • Validate ingestion in Prometheus

Manual test flow:

  1. Verify container exporter reachable (:9100)
  2. Add static Alloy scrape target
  3. Restart Alloy
  4. Confirm:
    • up{job="integrations/container-node"} == 1
    • container metrics present
  5. Inspect labels for future scaling

Key Architectural Decisions

  • Alloy replaces per-node collectors (node-exporter layer in this system)
  • Prometheus remains the metrics backend
  • Grafana remains the UI
  • Loki deferred until logs are needed
  • Rollout strategy is incremental, not global

Next Step

Proceed with container test:

Add one Alloy scrape target for a running container and validate metrics ingestion.

After that:

  • evaluate labeling strategy
  • decide scaling approach (manual vs discovery)

Status

  • Host monitoring: complete
  • Alloy rollout: complete (host)
  • Container monitoring: next step
  • Logging (Loki): not implemented (intentional)