From 7c1e95892184e8aa17c0d0c5530eb2ce9fc3165e Mon Sep 17 00:00:00 2001 From: jester Date: Tue, 7 Apr 2026 21:09:53 +0000 Subject: [PATCH] Archive resolved: debug-provisioning-consistency --- SCRATCH/debug-provisioning-consistency.md | 121 ---------------------- 1 file changed, 121 deletions(-) delete mode 100644 SCRATCH/debug-provisioning-consistency.md diff --git a/SCRATCH/debug-provisioning-consistency.md b/SCRATCH/debug-provisioning-consistency.md deleted file mode 100644 index 98e6b72..0000000 --- a/SCRATCH/debug-provisioning-consistency.md +++ /dev/null @@ -1,121 +0,0 @@ -# ZeroLagHub – Debug Focus: Provisioning + Database Consistency - -## Context - -Recent changes implemented deterministic Fabric provisioning (mods + config pre-installed before first boot). This worked successfully once but is not consistently reliable across subsequent server creations. - -There are ongoing issues with: - -- Duplicate server creation during a single request -- Inconsistent or failed server termination -- Potential mismatch between system state and database state - ---- - -## Suspected Root Causes - -### 1. Database State Drift - -There may be inconsistencies between: - -- Database (`ContainerInstance`, status fields, VMID allocation) -- Actual Proxmox state (container exists / does not exist) -- Agent state (running, crashed, not initialized) - -Potential issues: - -- Multiple job executions writing to DB -- Missing transactional guarantees during create/delete -- Lack of idempotency in provisioning flow -- VMID or port allocation not properly locked - ---- - -### 2. Job / Queue Behavior (BullMQ) - -Provisioning may be executing more than once due to: - -- Duplicate job enqueueing -- Retry logic triggering unintended second runs -- Missing job deduplication or idempotency keys - ---- - -### 3. DNS-Based Routing Side Effects - -Current architecture relies on wildcard DNS: - -``` -dev-.zerolaghub.dev → Traefik → API → container -``` - -This may be introducing timing-related issues: - -- Server considered "ready" before DNS propagation / routing works -- API relying on hostname instead of internal state -- Failure paths incorrectly handled when DNS resolution or routing is delayed - -This is especially problematic during: - -- First-time provisioning -- Rapid create/delete cycles - ---- - -## Areas to Review - -### Database Layer - -- `ContainerInstance` lifecycle: creation, status transitions, deletion -- VMID allocation logic: ensure strict uniqueness, confirm locking or transactional safety -- Port allocation: ensure no duplicate assignment -- Verify no duplicate records created per request - -### Provisioning Flow - -- Ensure create endpoint is idempotent and guarded against double execution -- Confirm only one job is enqueued per request -- Validate job completion updates DB exactly once - -### Termination Flow - -- Ensure delete: removes container in Proxmox, updates DB state correctly, handles partial failures (e.g., container already gone) - -### Agent Interaction - -- Confirm agent is not re-triggering provisioning or reporting stale state -- Validate `/status` vs DB consistency - -### DNS / Routing (Secondary Concern) - -- Evaluate whether DNS-based routing is being used as a source of truth (should not be) -- Consider temporarily validating flows using direct IP instead of DNS - ---- - -## Goal - -Restore deterministic lifecycle management: - -``` -Request → Single DB record → Single provision job → Single container → Clean termination -``` - -No duplicates, no drift, no reliance on DNS timing. - ---- - -## Key Principle - -**The database must be the source of truth, not DNS or runtime reachability.** - ---- - -## Next Step - -Full audit of: - -- DB writes during create/delete -- Job queue behavior -- VMID + resource allocation -- State reconciliation between DB and Proxmox