Add provisioning + DB consistency debug focus doc
This commit is contained in:
parent
3f117ade7e
commit
ae9724d6ff
121
SCRATCH/debug-provisioning-consistency.md
Normal file
121
SCRATCH/debug-provisioning-consistency.md
Normal file
@ -0,0 +1,121 @@
|
|||||||
|
# ZeroLagHub – Debug Focus: Provisioning + Database Consistency
|
||||||
|
|
||||||
|
## Context
|
||||||
|
|
||||||
|
Recent changes implemented deterministic Fabric provisioning (mods + config pre-installed before first boot). This worked successfully once but is not consistently reliable across subsequent server creations.
|
||||||
|
|
||||||
|
There are ongoing issues with:
|
||||||
|
|
||||||
|
- Duplicate server creation during a single request
|
||||||
|
- Inconsistent or failed server termination
|
||||||
|
- Potential mismatch between system state and database state
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Suspected Root Causes
|
||||||
|
|
||||||
|
### 1. Database State Drift
|
||||||
|
|
||||||
|
There may be inconsistencies between:
|
||||||
|
|
||||||
|
- Database (`ContainerInstance`, status fields, VMID allocation)
|
||||||
|
- Actual Proxmox state (container exists / does not exist)
|
||||||
|
- Agent state (running, crashed, not initialized)
|
||||||
|
|
||||||
|
Potential issues:
|
||||||
|
|
||||||
|
- Multiple job executions writing to DB
|
||||||
|
- Missing transactional guarantees during create/delete
|
||||||
|
- Lack of idempotency in provisioning flow
|
||||||
|
- VMID or port allocation not properly locked
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 2. Job / Queue Behavior (BullMQ)
|
||||||
|
|
||||||
|
Provisioning may be executing more than once due to:
|
||||||
|
|
||||||
|
- Duplicate job enqueueing
|
||||||
|
- Retry logic triggering unintended second runs
|
||||||
|
- Missing job deduplication or idempotency keys
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 3. DNS-Based Routing Side Effects
|
||||||
|
|
||||||
|
Current architecture relies on wildcard DNS:
|
||||||
|
|
||||||
|
```
|
||||||
|
dev-<vmid>.zerolaghub.dev → Traefik → API → container
|
||||||
|
```
|
||||||
|
|
||||||
|
This may be introducing timing-related issues:
|
||||||
|
|
||||||
|
- Server considered "ready" before DNS propagation / routing works
|
||||||
|
- API relying on hostname instead of internal state
|
||||||
|
- Failure paths incorrectly handled when DNS resolution or routing is delayed
|
||||||
|
|
||||||
|
This is especially problematic during:
|
||||||
|
|
||||||
|
- First-time provisioning
|
||||||
|
- Rapid create/delete cycles
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Areas to Review
|
||||||
|
|
||||||
|
### Database Layer
|
||||||
|
|
||||||
|
- `ContainerInstance` lifecycle: creation, status transitions, deletion
|
||||||
|
- VMID allocation logic: ensure strict uniqueness, confirm locking or transactional safety
|
||||||
|
- Port allocation: ensure no duplicate assignment
|
||||||
|
- Verify no duplicate records created per request
|
||||||
|
|
||||||
|
### Provisioning Flow
|
||||||
|
|
||||||
|
- Ensure create endpoint is idempotent and guarded against double execution
|
||||||
|
- Confirm only one job is enqueued per request
|
||||||
|
- Validate job completion updates DB exactly once
|
||||||
|
|
||||||
|
### Termination Flow
|
||||||
|
|
||||||
|
- Ensure delete: removes container in Proxmox, updates DB state correctly, handles partial failures (e.g., container already gone)
|
||||||
|
|
||||||
|
### Agent Interaction
|
||||||
|
|
||||||
|
- Confirm agent is not re-triggering provisioning or reporting stale state
|
||||||
|
- Validate `/status` vs DB consistency
|
||||||
|
|
||||||
|
### DNS / Routing (Secondary Concern)
|
||||||
|
|
||||||
|
- Evaluate whether DNS-based routing is being used as a source of truth (should not be)
|
||||||
|
- Consider temporarily validating flows using direct IP instead of DNS
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Goal
|
||||||
|
|
||||||
|
Restore deterministic lifecycle management:
|
||||||
|
|
||||||
|
```
|
||||||
|
Request → Single DB record → Single provision job → Single container → Clean termination
|
||||||
|
```
|
||||||
|
|
||||||
|
No duplicates, no drift, no reliance on DNS timing.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Key Principle
|
||||||
|
|
||||||
|
**The database must be the source of truth, not DNS or runtime reachability.**
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Next Step
|
||||||
|
|
||||||
|
Full audit of:
|
||||||
|
|
||||||
|
- DB writes during create/delete
|
||||||
|
- Job queue behavior
|
||||||
|
- VMID + resource allocation
|
||||||
|
- State reconciliation between DB and Proxmox
|
||||||
Loading…
Reference in New Issue
Block a user