# ZeroLagHub – Debug Focus: Provisioning + Database Consistency ## Context Recent changes implemented deterministic Fabric provisioning (mods + config pre-installed before first boot). This worked successfully once but is not consistently reliable across subsequent server creations. There are ongoing issues with: - Duplicate server creation during a single request - Inconsistent or failed server termination - Potential mismatch between system state and database state --- ## Suspected Root Causes ### 1. Database State Drift There may be inconsistencies between: - Database (`ContainerInstance`, status fields, VMID allocation) - Actual Proxmox state (container exists / does not exist) - Agent state (running, crashed, not initialized) Potential issues: - Multiple job executions writing to DB - Missing transactional guarantees during create/delete - Lack of idempotency in provisioning flow - VMID or port allocation not properly locked --- ### 2. Job / Queue Behavior (BullMQ) Provisioning may be executing more than once due to: - Duplicate job enqueueing - Retry logic triggering unintended second runs - Missing job deduplication or idempotency keys --- ### 3. DNS-Based Routing Side Effects Current architecture relies on wildcard DNS: ``` dev-.zerolaghub.dev → Traefik → API → container ``` This may be introducing timing-related issues: - Server considered "ready" before DNS propagation / routing works - API relying on hostname instead of internal state - Failure paths incorrectly handled when DNS resolution or routing is delayed This is especially problematic during: - First-time provisioning - Rapid create/delete cycles --- ## Areas to Review ### Database Layer - `ContainerInstance` lifecycle: creation, status transitions, deletion - VMID allocation logic: ensure strict uniqueness, confirm locking or transactional safety - Port allocation: ensure no duplicate assignment - Verify no duplicate records created per request ### Provisioning Flow - Ensure create endpoint is idempotent and guarded against double execution - Confirm only one job is enqueued per request - Validate job completion updates DB exactly once ### Termination Flow - Ensure delete: removes container in Proxmox, updates DB state correctly, handles partial failures (e.g., container already gone) ### Agent Interaction - Confirm agent is not re-triggering provisioning or reporting stale state - Validate `/status` vs DB consistency ### DNS / Routing (Secondary Concern) - Evaluate whether DNS-based routing is being used as a source of truth (should not be) - Consider temporarily validating flows using direct IP instead of DNS --- ## Goal Restore deterministic lifecycle management: ``` Request → Single DB record → Single provision job → Single container → Clean termination ``` No duplicates, no drift, no reliance on DNS timing. --- ## Key Principle **The database must be the source of truth, not DNS or runtime reachability.** --- ## Next Step Full audit of: - DB writes during create/delete - Job queue behavior - VMID + resource allocation - State reconciliation between DB and Proxmox