3.1 KiB
3.1 KiB
ZeroLagHub – Debug Focus: Provisioning + Database Consistency
Context
Recent changes implemented deterministic Fabric provisioning (mods + config pre-installed before first boot). This worked successfully once but is not consistently reliable across subsequent server creations.
There are ongoing issues with:
- Duplicate server creation during a single request
- Inconsistent or failed server termination
- Potential mismatch between system state and database state
Suspected Root Causes
1. Database State Drift
There may be inconsistencies between:
- Database (
ContainerInstance, status fields, VMID allocation) - Actual Proxmox state (container exists / does not exist)
- Agent state (running, crashed, not initialized)
Potential issues:
- Multiple job executions writing to DB
- Missing transactional guarantees during create/delete
- Lack of idempotency in provisioning flow
- VMID or port allocation not properly locked
2. Job / Queue Behavior (BullMQ)
Provisioning may be executing more than once due to:
- Duplicate job enqueueing
- Retry logic triggering unintended second runs
- Missing job deduplication or idempotency keys
3. DNS-Based Routing Side Effects
Current architecture relies on wildcard DNS:
dev-<vmid>.zerolaghub.dev → Traefik → API → container
This may be introducing timing-related issues:
- Server considered "ready" before DNS propagation / routing works
- API relying on hostname instead of internal state
- Failure paths incorrectly handled when DNS resolution or routing is delayed
This is especially problematic during:
- First-time provisioning
- Rapid create/delete cycles
Areas to Review
Database Layer
ContainerInstancelifecycle: creation, status transitions, deletion- VMID allocation logic: ensure strict uniqueness, confirm locking or transactional safety
- Port allocation: ensure no duplicate assignment
- Verify no duplicate records created per request
Provisioning Flow
- Ensure create endpoint is idempotent and guarded against double execution
- Confirm only one job is enqueued per request
- Validate job completion updates DB exactly once
Termination Flow
- Ensure delete: removes container in Proxmox, updates DB state correctly, handles partial failures (e.g., container already gone)
Agent Interaction
- Confirm agent is not re-triggering provisioning or reporting stale state
- Validate
/statusvs DB consistency
DNS / Routing (Secondary Concern)
- Evaluate whether DNS-based routing is being used as a source of truth (should not be)
- Consider temporarily validating flows using direct IP instead of DNS
Goal
Restore deterministic lifecycle management:
Request → Single DB record → Single provision job → Single container → Clean termination
No duplicates, no drift, no reliance on DNS timing.
Key Principle
The database must be the source of truth, not DNS or runtime reachability.
Next Step
Full audit of:
- DB writes during create/delete
- Job queue behavior
- VMID + resource allocation
- State reconciliation between DB and Proxmox