Add autonomy controller handoff summary

2026-05-03 12:14:27 +00:00 · 2026-05-03 12:14:27 +00:00 · fa575c45f4
commit fa575c45f4
parent 3d462f264d
1 changed files with 206 additions and 0 deletions
--- a/Session_Summaries/2026-05-03-autonomy-controller-handoff.md
+++ b/Session_Summaries/2026-05-03-autonomy-controller-handoff.md
@ -0,0 +1,206 @@
+# ZLH Architecture Session Summary — Autonomy, Controller, and Launch Order
+
+Date: 2026-05-03
+
+## Context
+
+James works full time and operates ZLH solo with no secondary operator. The platform must be autonomous enough to self-heal without human intervention for routine failures. This is a hard requirement, not a nice-to-have.
+
+## Decisions Made
+
+### 1. Provisioning Worker — Build First
+
+Move `provisionAgentInstance()` out of the API path into a BullMQ worker. The current implementation has a 10-minute polling loop inside the HTTP request lifecycle and should not remain in the request/response path.
+
+Implementation:
+
+```text
+- Create src/queues/provisioning.js
+- Add npm run worker:provision script
+- POST /api/servers/create validates auth/billing/quota, enqueues job, returns 202
+- Worker calls existing provisionAgentInstance() logic
+- Deploy as separate systemd service on same VM: zpack-provision-worker.service
+- Phase 1: one queue, concurrency 1
+- Do not split game/dev yet
+```
+
+Additional requirements:
+
+```text
+- Error cleanup must survive worker context; BullMQ retry behavior must not re-run partially completed jobs
+- confirmVmidAllocated must be atomic with DB persist to prevent VMID leaks
+- Add deduplication key on queue keyed to customerId + ctype + requestNonce to prevent duplicate provisioning from double Portal submissions
+```
+
+### 2. Controller / Reconciler — Build at Step 3, Not Post-Launch
+
+Given solo operation requirements, the controller is a launch requirement. It is what makes billing, provisioning, and edge state reliable in production.
+
+Build order:
+
+```text
+1. Provisioning worker + operation leases
+2. Minimal controller (observe + Level 1 repairs)
+3. Billing webhook live
+4. Expand controller to workload-level repairs
+```
+
+Minimal controller scope for launch:
+
+```text
+Every 60 seconds:
+- Clear stale operation locks (heartbeat expired, no active BullMQ job)
+- Detect agent offline (LXC running, agent unreachable) -> restart agent service
+- Detect missing Velocity registration -> re-register automatically
+- Detect missing DNS record -> republish automatically
+- Detect stuck provisioning jobs (enqueued > 15 min, no progress) -> fail cleanly, release VMID
+- Billing suspended + server running -> stop workload
+- Backup missed -> notify only, do not auto-repair
+```
+
+Controller must be a singleton with a Redis-backed distributed lock. If two controller instances run simultaneously there is split-brain repair risk. Wire this from day one.
+
+Repair policy levels — only Level 0 and Level 1 enabled at launch:
+
+```text
+Level 0: observe and log only
+Level 1: stale lock clear, edge republish, DNS republish, Velocity re-register
+Level 2: agent service restart, workload restart (enable after proven stable)
+Level 3: restore backup, rebuild container — never automatic, admin only
+```
+
+Circuit breakers required from day one:
+
+```text
+Max 3 repair attempts per server per hour
+Max 1 destructive candidate per server until admin review
+Global repair concurrency limit
+Per-host repair concurrency limit
+No repair during active backup/restore/maintenance lock
+```
+
+Placement:
+
+```text
+src/controllers/reconciler.js in zpack-api repo
+Deployed as zlh-controller.service
+```
+
+Queue structure:
+
+```text
+provisioning queue  -> create servers
+operation queue     -> start/stop/restart/backup/restore
+repair queue        -> fix drift
+reconcile loop      -> scheduled 60s health check
+```
+
+### 3. Discord Notification Channel
+
+James is creating a dedicated ZLH Discord server. Wire Discord webhook notifications into the controller as `src/services/discord.js`.
+
+Channel structure:
+
+```text
+#alerts-critical     -> repair failed 3x, data at risk — always ping
+#alerts-repairs      -> controller acted automatically — no ping
+#alerts-billing      -> payment events, webhook failures — ping
+#alerts-provisioning -> server create/fail — ping on failure
+#system-health       -> daily digest, backup status, queue depth — no ping
+#audit-log           -> full timestamped controller action history — no ping
+```
+
+Notification rules:
+
+```text
+Level 1 repair succeeded     -> #alerts-repairs, no ping
+Level 2 repair succeeded     -> #alerts-repairs, ping if outside business hours
+Repair failed 3x             -> #alerts-critical, always ping
+Provisioning job failed      -> #alerts-provisioning, always ping
+Billing webhook missed       -> #alerts-billing, always ping
+Backup missed 2x consecutive -> #alerts-critical, always ping
+Daily health digest          -> #system-health, no ping
+```
+
+Implementation:
+
+```text
+Discord webhook per channel
+HTTP POST only
+No bot required at this stage
+One function per alert level in the notification service
+```
+
+### 4. Console / Terminal — No Stack Change
+
+xterm.js + `creack/pty` + `gorilla/websocket` is the correct stack. No language or library change is warranted.
+
+Remaining console flake is a Portal-side WebSocket state machine problem. Fix required before launch:
+
+```text
+- Add 10–15 second connect timeout in TerminalView
+- Suspend timeout if operationInProgress is true
+- Derive isStreaming from connectionStatus rather than setting it independently
+- Error handler must close socket, clear socketRef, allow reconnect
+- API websocket bridge must propagate agent-side close events to browser — verify this before assuming fix is frontend-only
+```
+
+Post-launch improvement:
+
+```text
+Add console_ready / terminal_attached control frame from Agent so Portal knows the PTY is actually attached, not just that the browser-to-API socket opened.
+```
+
+### 5. Patch / Update Management
+
+```text
+Proxmox host:
+  manual, monthly, after PBS snapshot
+
+Infrastructure VMs:
+  unattended-upgrades, security pocket only, no auto-reboot
+  blacklist manually managed packages such as Node.js, Caddy, Traefik
+
+Customer LXC containers:
+  update-lxcs.sh from Proxmox VE community scripts, run from Proxmox shell
+
+Application dependencies:
+  add Renovate (self-hosted) post-launch for automated dependency CVE PRs
+  add to OPEN_THREADS
+```
+
+### 6. API / Portal Deployment
+
+Keep API and Portal as separate VMs on separate systemd services. Do not combine into one container. Independent failure modes, independent deployment cadence, and the boundary between them is where auth and ownership checks live.
+
+Both should be migrated from pm2 to systemd services. pm2 is not wrong, but systemd aligns with the rest of the stack, gives journald integration for Loki/Alloy, and removes one process supervisor.
+
+## Updated Pre-Launch Blocker Order
+
+```text
+1. Provisioning worker + operation leases
+2. Minimal controller + Discord notifications
+3. Billing webhook delivery
+4. Fabric readiness gating (agent TCP probe gate)
+5. User onboarding flow
+6. Console WebSocket state machine fix
+7. Usage limits / quota enforcement
+8. Email notifications
+9. Stress testing
+10. OPNsense audit
+11. Final smoke test
+```
+
+## What Does Not Change
+
+```text
+- DEC-001 through DEC-011 all remain locked
+- API orchestrates, Agent executes
+- Frontend never calls Agent directly
+- Redis never restored from backup
+- Vanilla uses Fabric loader + FabricProxy-Lite
+```
+
+## Notes for next session
+
+Treat the controller/reconciler as launch architecture, not post-launch polish. The purpose is to make the platform survivable while James is unavailable during normal work hours. The first controller should be conservative: observe, clear stale locks, repair edge drift, and notify/escalate when safe automatic repair is not possible.