diff --git a/SCRATCH/debug-provisioning-consistency.md b/SCRATCH/debug-provisioning-consistency.md new file mode 100644 index 0000000..98e6b72 --- /dev/null +++ b/SCRATCH/debug-provisioning-consistency.md @@ -0,0 +1,121 @@ +# ZeroLagHub – Debug Focus: Provisioning + Database Consistency + +## Context + +Recent changes implemented deterministic Fabric provisioning (mods + config pre-installed before first boot). This worked successfully once but is not consistently reliable across subsequent server creations. + +There are ongoing issues with: + +- Duplicate server creation during a single request +- Inconsistent or failed server termination +- Potential mismatch between system state and database state + +--- + +## Suspected Root Causes + +### 1. Database State Drift + +There may be inconsistencies between: + +- Database (`ContainerInstance`, status fields, VMID allocation) +- Actual Proxmox state (container exists / does not exist) +- Agent state (running, crashed, not initialized) + +Potential issues: + +- Multiple job executions writing to DB +- Missing transactional guarantees during create/delete +- Lack of idempotency in provisioning flow +- VMID or port allocation not properly locked + +--- + +### 2. Job / Queue Behavior (BullMQ) + +Provisioning may be executing more than once due to: + +- Duplicate job enqueueing +- Retry logic triggering unintended second runs +- Missing job deduplication or idempotency keys + +--- + +### 3. DNS-Based Routing Side Effects + +Current architecture relies on wildcard DNS: + +``` +dev-.zerolaghub.dev → Traefik → API → container +``` + +This may be introducing timing-related issues: + +- Server considered "ready" before DNS propagation / routing works +- API relying on hostname instead of internal state +- Failure paths incorrectly handled when DNS resolution or routing is delayed + +This is especially problematic during: + +- First-time provisioning +- Rapid create/delete cycles + +--- + +## Areas to Review + +### Database Layer + +- `ContainerInstance` lifecycle: creation, status transitions, deletion +- VMID allocation logic: ensure strict uniqueness, confirm locking or transactional safety +- Port allocation: ensure no duplicate assignment +- Verify no duplicate records created per request + +### Provisioning Flow + +- Ensure create endpoint is idempotent and guarded against double execution +- Confirm only one job is enqueued per request +- Validate job completion updates DB exactly once + +### Termination Flow + +- Ensure delete: removes container in Proxmox, updates DB state correctly, handles partial failures (e.g., container already gone) + +### Agent Interaction + +- Confirm agent is not re-triggering provisioning or reporting stale state +- Validate `/status` vs DB consistency + +### DNS / Routing (Secondary Concern) + +- Evaluate whether DNS-based routing is being used as a source of truth (should not be) +- Consider temporarily validating flows using direct IP instead of DNS + +--- + +## Goal + +Restore deterministic lifecycle management: + +``` +Request → Single DB record → Single provision job → Single container → Clean termination +``` + +No duplicates, no drift, no reliance on DNS timing. + +--- + +## Key Principle + +**The database must be the source of truth, not DNS or runtime reachability.** + +--- + +## Next Step + +Full audit of: + +- DB writes during create/delete +- Job queue behavior +- VMID + resource allocation +- State reconciliation between DB and Proxmox