Add provisioning + DB consistency debug focus doc

This commit is contained in:
jester 2026-04-05 21:58:01 +00:00
parent 3f117ade7e
commit ae9724d6ff

View File

@ -0,0 +1,121 @@
# ZeroLagHub Debug Focus: Provisioning + Database Consistency
## Context
Recent changes implemented deterministic Fabric provisioning (mods + config pre-installed before first boot). This worked successfully once but is not consistently reliable across subsequent server creations.
There are ongoing issues with:
- Duplicate server creation during a single request
- Inconsistent or failed server termination
- Potential mismatch between system state and database state
---
## Suspected Root Causes
### 1. Database State Drift
There may be inconsistencies between:
- Database (`ContainerInstance`, status fields, VMID allocation)
- Actual Proxmox state (container exists / does not exist)
- Agent state (running, crashed, not initialized)
Potential issues:
- Multiple job executions writing to DB
- Missing transactional guarantees during create/delete
- Lack of idempotency in provisioning flow
- VMID or port allocation not properly locked
---
### 2. Job / Queue Behavior (BullMQ)
Provisioning may be executing more than once due to:
- Duplicate job enqueueing
- Retry logic triggering unintended second runs
- Missing job deduplication or idempotency keys
---
### 3. DNS-Based Routing Side Effects
Current architecture relies on wildcard DNS:
```
dev-<vmid>.zerolaghub.dev → Traefik → API → container
```
This may be introducing timing-related issues:
- Server considered "ready" before DNS propagation / routing works
- API relying on hostname instead of internal state
- Failure paths incorrectly handled when DNS resolution or routing is delayed
This is especially problematic during:
- First-time provisioning
- Rapid create/delete cycles
---
## Areas to Review
### Database Layer
- `ContainerInstance` lifecycle: creation, status transitions, deletion
- VMID allocation logic: ensure strict uniqueness, confirm locking or transactional safety
- Port allocation: ensure no duplicate assignment
- Verify no duplicate records created per request
### Provisioning Flow
- Ensure create endpoint is idempotent and guarded against double execution
- Confirm only one job is enqueued per request
- Validate job completion updates DB exactly once
### Termination Flow
- Ensure delete: removes container in Proxmox, updates DB state correctly, handles partial failures (e.g., container already gone)
### Agent Interaction
- Confirm agent is not re-triggering provisioning or reporting stale state
- Validate `/status` vs DB consistency
### DNS / Routing (Secondary Concern)
- Evaluate whether DNS-based routing is being used as a source of truth (should not be)
- Consider temporarily validating flows using direct IP instead of DNS
---
## Goal
Restore deterministic lifecycle management:
```
Request → Single DB record → Single provision job → Single container → Clean termination
```
No duplicates, no drift, no reliance on DNS timing.
---
## Key Principle
**The database must be the source of truth, not DNS or runtime reachability.**
---
## Next Step
Full audit of:
- DB writes during create/delete
- Job queue behavior
- VMID + resource allocation
- State reconciliation between DB and Proxmox