zlh-grind/SCRATCH/debug-provisioning-consistency.md

122 lines
3.1 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# ZeroLagHub Debug Focus: Provisioning + Database Consistency
## Context
Recent changes implemented deterministic Fabric provisioning (mods + config pre-installed before first boot). This worked successfully once but is not consistently reliable across subsequent server creations.
There are ongoing issues with:
- Duplicate server creation during a single request
- Inconsistent or failed server termination
- Potential mismatch between system state and database state
---
## Suspected Root Causes
### 1. Database State Drift
There may be inconsistencies between:
- Database (`ContainerInstance`, status fields, VMID allocation)
- Actual Proxmox state (container exists / does not exist)
- Agent state (running, crashed, not initialized)
Potential issues:
- Multiple job executions writing to DB
- Missing transactional guarantees during create/delete
- Lack of idempotency in provisioning flow
- VMID or port allocation not properly locked
---
### 2. Job / Queue Behavior (BullMQ)
Provisioning may be executing more than once due to:
- Duplicate job enqueueing
- Retry logic triggering unintended second runs
- Missing job deduplication or idempotency keys
---
### 3. DNS-Based Routing Side Effects
Current architecture relies on wildcard DNS:
```
dev-<vmid>.zerolaghub.dev → Traefik → API → container
```
This may be introducing timing-related issues:
- Server considered "ready" before DNS propagation / routing works
- API relying on hostname instead of internal state
- Failure paths incorrectly handled when DNS resolution or routing is delayed
This is especially problematic during:
- First-time provisioning
- Rapid create/delete cycles
---
## Areas to Review
### Database Layer
- `ContainerInstance` lifecycle: creation, status transitions, deletion
- VMID allocation logic: ensure strict uniqueness, confirm locking or transactional safety
- Port allocation: ensure no duplicate assignment
- Verify no duplicate records created per request
### Provisioning Flow
- Ensure create endpoint is idempotent and guarded against double execution
- Confirm only one job is enqueued per request
- Validate job completion updates DB exactly once
### Termination Flow
- Ensure delete: removes container in Proxmox, updates DB state correctly, handles partial failures (e.g., container already gone)
### Agent Interaction
- Confirm agent is not re-triggering provisioning or reporting stale state
- Validate `/status` vs DB consistency
### DNS / Routing (Secondary Concern)
- Evaluate whether DNS-based routing is being used as a source of truth (should not be)
- Consider temporarily validating flows using direct IP instead of DNS
---
## Goal
Restore deterministic lifecycle management:
```
Request → Single DB record → Single provision job → Single container → Clean termination
```
No duplicates, no drift, no reliance on DNS timing.
---
## Key Principle
**The database must be the source of truth, not DNS or runtime reachability.**
---
## Next Step
Full audit of:
- DB writes during create/delete
- Job queue behavior
- VMID + resource allocation
- State reconciliation between DB and Proxmox