122 lines
3.1 KiB
Markdown
122 lines
3.1 KiB
Markdown
# ZeroLagHub – Debug Focus: Provisioning + Database Consistency
|
||
|
||
## Context
|
||
|
||
Recent changes implemented deterministic Fabric provisioning (mods + config pre-installed before first boot). This worked successfully once but is not consistently reliable across subsequent server creations.
|
||
|
||
There are ongoing issues with:
|
||
|
||
- Duplicate server creation during a single request
|
||
- Inconsistent or failed server termination
|
||
- Potential mismatch between system state and database state
|
||
|
||
---
|
||
|
||
## Suspected Root Causes
|
||
|
||
### 1. Database State Drift
|
||
|
||
There may be inconsistencies between:
|
||
|
||
- Database (`ContainerInstance`, status fields, VMID allocation)
|
||
- Actual Proxmox state (container exists / does not exist)
|
||
- Agent state (running, crashed, not initialized)
|
||
|
||
Potential issues:
|
||
|
||
- Multiple job executions writing to DB
|
||
- Missing transactional guarantees during create/delete
|
||
- Lack of idempotency in provisioning flow
|
||
- VMID or port allocation not properly locked
|
||
|
||
---
|
||
|
||
### 2. Job / Queue Behavior (BullMQ)
|
||
|
||
Provisioning may be executing more than once due to:
|
||
|
||
- Duplicate job enqueueing
|
||
- Retry logic triggering unintended second runs
|
||
- Missing job deduplication or idempotency keys
|
||
|
||
---
|
||
|
||
### 3. DNS-Based Routing Side Effects
|
||
|
||
Current architecture relies on wildcard DNS:
|
||
|
||
```
|
||
dev-<vmid>.zerolaghub.dev → Traefik → API → container
|
||
```
|
||
|
||
This may be introducing timing-related issues:
|
||
|
||
- Server considered "ready" before DNS propagation / routing works
|
||
- API relying on hostname instead of internal state
|
||
- Failure paths incorrectly handled when DNS resolution or routing is delayed
|
||
|
||
This is especially problematic during:
|
||
|
||
- First-time provisioning
|
||
- Rapid create/delete cycles
|
||
|
||
---
|
||
|
||
## Areas to Review
|
||
|
||
### Database Layer
|
||
|
||
- `ContainerInstance` lifecycle: creation, status transitions, deletion
|
||
- VMID allocation logic: ensure strict uniqueness, confirm locking or transactional safety
|
||
- Port allocation: ensure no duplicate assignment
|
||
- Verify no duplicate records created per request
|
||
|
||
### Provisioning Flow
|
||
|
||
- Ensure create endpoint is idempotent and guarded against double execution
|
||
- Confirm only one job is enqueued per request
|
||
- Validate job completion updates DB exactly once
|
||
|
||
### Termination Flow
|
||
|
||
- Ensure delete: removes container in Proxmox, updates DB state correctly, handles partial failures (e.g., container already gone)
|
||
|
||
### Agent Interaction
|
||
|
||
- Confirm agent is not re-triggering provisioning or reporting stale state
|
||
- Validate `/status` vs DB consistency
|
||
|
||
### DNS / Routing (Secondary Concern)
|
||
|
||
- Evaluate whether DNS-based routing is being used as a source of truth (should not be)
|
||
- Consider temporarily validating flows using direct IP instead of DNS
|
||
|
||
---
|
||
|
||
## Goal
|
||
|
||
Restore deterministic lifecycle management:
|
||
|
||
```
|
||
Request → Single DB record → Single provision job → Single container → Clean termination
|
||
```
|
||
|
||
No duplicates, no drift, no reliance on DNS timing.
|
||
|
||
---
|
||
|
||
## Key Principle
|
||
|
||
**The database must be the source of truth, not DNS or runtime reachability.**
|
||
|
||
---
|
||
|
||
## Next Step
|
||
|
||
Full audit of:
|
||
|
||
- DB writes during create/delete
|
||
- Job queue behavior
|
||
- VMID + resource allocation
|
||
- State reconciliation between DB and Proxmox
|