zlh-grind/Session_Summaries/2026-05-03-autonomy-controller-handoff.md

7.3 KiB
Raw Permalink Blame History

ZLH Architecture Session Summary — Autonomy, Controller, and Launch Order

Date: 2026-05-03

Context

James works full time and operates ZLH solo with no secondary operator. The platform must be autonomous enough to self-heal without human intervention for routine failures. This is a hard requirement, not a nice-to-have.

Decisions Made

1. Provisioning Worker — Build First

Move provisionAgentInstance() out of the API path into a BullMQ worker. The current implementation has a 10-minute polling loop inside the HTTP request lifecycle and should not remain in the request/response path.

Implementation:

- Create src/queues/provisioning.js
- Add npm run worker:provision script
- POST /api/servers/create validates auth/billing/quota, enqueues job, returns 202
- Worker calls existing provisionAgentInstance() logic
- Deploy as separate systemd service on same VM: zpack-provision-worker.service
- Phase 1: one queue, concurrency 1
- Do not split game/dev yet

Additional requirements:

- Error cleanup must survive worker context; BullMQ retry behavior must not re-run partially completed jobs
- confirmVmidAllocated must be atomic with DB persist to prevent VMID leaks
- Add deduplication key on queue keyed to customerId + ctype + requestNonce to prevent duplicate provisioning from double Portal submissions

2. Controller / Reconciler — Build at Step 3, Not Post-Launch

Given solo operation requirements, the controller is a launch requirement. It is what makes billing, provisioning, and edge state reliable in production.

Build order:

1. Provisioning worker + operation leases
2. Minimal controller (observe + Level 1 repairs)
3. Billing webhook live
4. Expand controller to workload-level repairs

Minimal controller scope for launch:

Every 60 seconds:
- Clear stale operation locks (heartbeat expired, no active BullMQ job)
- Detect agent offline (LXC running, agent unreachable) -> restart agent service
- Detect missing Velocity registration -> re-register automatically
- Detect missing DNS record -> republish automatically
- Detect stuck provisioning jobs (enqueued > 15 min, no progress) -> fail cleanly, release VMID
- Billing suspended + server running -> stop workload
- Backup missed -> notify only, do not auto-repair

Controller must be a singleton with a Redis-backed distributed lock. If two controller instances run simultaneously there is split-brain repair risk. Wire this from day one.

Repair policy levels — only Level 0 and Level 1 enabled at launch:

Level 0: observe and log only
Level 1: stale lock clear, edge republish, DNS republish, Velocity re-register
Level 2: agent service restart, workload restart (enable after proven stable)
Level 3: restore backup, rebuild container — never automatic, admin only

Circuit breakers required from day one:

Max 3 repair attempts per server per hour
Max 1 destructive candidate per server until admin review
Global repair concurrency limit
Per-host repair concurrency limit
No repair during active backup/restore/maintenance lock

Placement:

src/controllers/reconciler.js in zpack-api repo
Deployed as zlh-controller.service

Queue structure:

provisioning queue  -> create servers
operation queue     -> start/stop/restart/backup/restore
repair queue        -> fix drift
reconcile loop      -> scheduled 60s health check

3. Discord Notification Channel

James is creating a dedicated ZLH Discord server. Wire Discord webhook notifications into the controller as src/services/discord.js.

Channel structure:

#alerts-critical     -> repair failed 3x, data at risk — always ping
#alerts-repairs      -> controller acted automatically — no ping
#alerts-billing      -> payment events, webhook failures — ping
#alerts-provisioning -> server create/fail — ping on failure
#system-health       -> daily digest, backup status, queue depth — no ping
#audit-log           -> full timestamped controller action history — no ping

Notification rules:

Level 1 repair succeeded     -> #alerts-repairs, no ping
Level 2 repair succeeded     -> #alerts-repairs, ping if outside business hours
Repair failed 3x             -> #alerts-critical, always ping
Provisioning job failed      -> #alerts-provisioning, always ping
Billing webhook missed       -> #alerts-billing, always ping
Backup missed 2x consecutive -> #alerts-critical, always ping
Daily health digest          -> #system-health, no ping

Implementation:

Discord webhook per channel
HTTP POST only
No bot required at this stage
One function per alert level in the notification service

4. Console / Terminal — No Stack Change

xterm.js + creack/pty + gorilla/websocket is the correct stack. No language or library change is warranted.

Remaining console flake is a Portal-side WebSocket state machine problem. Fix required before launch:

- Add 1015 second connect timeout in TerminalView
- Suspend timeout if operationInProgress is true
- Derive isStreaming from connectionStatus rather than setting it independently
- Error handler must close socket, clear socketRef, allow reconnect
- API websocket bridge must propagate agent-side close events to browser — verify this before assuming fix is frontend-only

Post-launch improvement:

Add console_ready / terminal_attached control frame from Agent so Portal knows the PTY is actually attached, not just that the browser-to-API socket opened.

5. Patch / Update Management

Proxmox host:
  manual, monthly, after PBS snapshot

Infrastructure VMs:
  unattended-upgrades, security pocket only, no auto-reboot
  blacklist manually managed packages such as Node.js, Caddy, Traefik

Customer LXC containers:
  update-lxcs.sh from Proxmox VE community scripts, run from Proxmox shell

Application dependencies:
  add Renovate (self-hosted) post-launch for automated dependency CVE PRs
  add to OPEN_THREADS

6. API / Portal Deployment

Keep API and Portal as separate VMs on separate systemd services. Do not combine into one container. Independent failure modes, independent deployment cadence, and the boundary between them is where auth and ownership checks live.

Both should be migrated from pm2 to systemd services. pm2 is not wrong, but systemd aligns with the rest of the stack, gives journald integration for Loki/Alloy, and removes one process supervisor.

Updated Pre-Launch Blocker Order

1. Provisioning worker + operation leases
2. Minimal controller + Discord notifications
3. Billing webhook delivery
4. Fabric readiness gating (agent TCP probe gate)
5. User onboarding flow
6. Console WebSocket state machine fix
7. Usage limits / quota enforcement
8. Email notifications
9. Stress testing
10. OPNsense audit
11. Final smoke test

What Does Not Change

- DEC-001 through DEC-011 all remain locked
- API orchestrates, Agent executes
- Frontend never calls Agent directly
- Redis never restored from backup
- Vanilla uses Fabric loader + FabricProxy-Lite

Notes for next session

Treat the controller/reconciler as launch architecture, not post-launch polish. The purpose is to make the platform survivable while James is unavailable during normal work hours. The first controller should be conservative: observe, clear stale locks, repair edge drift, and notify/escalate when safe automatic repair is not possible.