Archive resolved: debug-provisioning-consistency
This commit is contained in:
parent
2a49eb9054
commit
7c1e958921
@ -1,121 +0,0 @@
|
|||||||
# ZeroLagHub – Debug Focus: Provisioning + Database Consistency
|
|
||||||
|
|
||||||
## Context
|
|
||||||
|
|
||||||
Recent changes implemented deterministic Fabric provisioning (mods + config pre-installed before first boot). This worked successfully once but is not consistently reliable across subsequent server creations.
|
|
||||||
|
|
||||||
There are ongoing issues with:
|
|
||||||
|
|
||||||
- Duplicate server creation during a single request
|
|
||||||
- Inconsistent or failed server termination
|
|
||||||
- Potential mismatch between system state and database state
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Suspected Root Causes
|
|
||||||
|
|
||||||
### 1. Database State Drift
|
|
||||||
|
|
||||||
There may be inconsistencies between:
|
|
||||||
|
|
||||||
- Database (`ContainerInstance`, status fields, VMID allocation)
|
|
||||||
- Actual Proxmox state (container exists / does not exist)
|
|
||||||
- Agent state (running, crashed, not initialized)
|
|
||||||
|
|
||||||
Potential issues:
|
|
||||||
|
|
||||||
- Multiple job executions writing to DB
|
|
||||||
- Missing transactional guarantees during create/delete
|
|
||||||
- Lack of idempotency in provisioning flow
|
|
||||||
- VMID or port allocation not properly locked
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### 2. Job / Queue Behavior (BullMQ)
|
|
||||||
|
|
||||||
Provisioning may be executing more than once due to:
|
|
||||||
|
|
||||||
- Duplicate job enqueueing
|
|
||||||
- Retry logic triggering unintended second runs
|
|
||||||
- Missing job deduplication or idempotency keys
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### 3. DNS-Based Routing Side Effects
|
|
||||||
|
|
||||||
Current architecture relies on wildcard DNS:
|
|
||||||
|
|
||||||
```
|
|
||||||
dev-<vmid>.zerolaghub.dev → Traefik → API → container
|
|
||||||
```
|
|
||||||
|
|
||||||
This may be introducing timing-related issues:
|
|
||||||
|
|
||||||
- Server considered "ready" before DNS propagation / routing works
|
|
||||||
- API relying on hostname instead of internal state
|
|
||||||
- Failure paths incorrectly handled when DNS resolution or routing is delayed
|
|
||||||
|
|
||||||
This is especially problematic during:
|
|
||||||
|
|
||||||
- First-time provisioning
|
|
||||||
- Rapid create/delete cycles
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Areas to Review
|
|
||||||
|
|
||||||
### Database Layer
|
|
||||||
|
|
||||||
- `ContainerInstance` lifecycle: creation, status transitions, deletion
|
|
||||||
- VMID allocation logic: ensure strict uniqueness, confirm locking or transactional safety
|
|
||||||
- Port allocation: ensure no duplicate assignment
|
|
||||||
- Verify no duplicate records created per request
|
|
||||||
|
|
||||||
### Provisioning Flow
|
|
||||||
|
|
||||||
- Ensure create endpoint is idempotent and guarded against double execution
|
|
||||||
- Confirm only one job is enqueued per request
|
|
||||||
- Validate job completion updates DB exactly once
|
|
||||||
|
|
||||||
### Termination Flow
|
|
||||||
|
|
||||||
- Ensure delete: removes container in Proxmox, updates DB state correctly, handles partial failures (e.g., container already gone)
|
|
||||||
|
|
||||||
### Agent Interaction
|
|
||||||
|
|
||||||
- Confirm agent is not re-triggering provisioning or reporting stale state
|
|
||||||
- Validate `/status` vs DB consistency
|
|
||||||
|
|
||||||
### DNS / Routing (Secondary Concern)
|
|
||||||
|
|
||||||
- Evaluate whether DNS-based routing is being used as a source of truth (should not be)
|
|
||||||
- Consider temporarily validating flows using direct IP instead of DNS
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Goal
|
|
||||||
|
|
||||||
Restore deterministic lifecycle management:
|
|
||||||
|
|
||||||
```
|
|
||||||
Request → Single DB record → Single provision job → Single container → Clean termination
|
|
||||||
```
|
|
||||||
|
|
||||||
No duplicates, no drift, no reliance on DNS timing.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Key Principle
|
|
||||||
|
|
||||||
**The database must be the source of truth, not DNS or runtime reachability.**
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Next Step
|
|
||||||
|
|
||||||
Full audit of:
|
|
||||||
|
|
||||||
- DB writes during create/delete
|
|
||||||
- Job queue behavior
|
|
||||||
- VMID + resource allocation
|
|
||||||
- State reconciliation between DB and Proxmox
|
|
||||||
Loading…
Reference in New Issue
Block a user