zlh-grind/SCRATCH/debug-provisioning-consistency.md

3.1 KiB
Raw Blame History

ZeroLagHub Debug Focus: Provisioning + Database Consistency

Context

Recent changes implemented deterministic Fabric provisioning (mods + config pre-installed before first boot). This worked successfully once but is not consistently reliable across subsequent server creations.

There are ongoing issues with:

  • Duplicate server creation during a single request
  • Inconsistent or failed server termination
  • Potential mismatch between system state and database state

Suspected Root Causes

1. Database State Drift

There may be inconsistencies between:

  • Database (ContainerInstance, status fields, VMID allocation)
  • Actual Proxmox state (container exists / does not exist)
  • Agent state (running, crashed, not initialized)

Potential issues:

  • Multiple job executions writing to DB
  • Missing transactional guarantees during create/delete
  • Lack of idempotency in provisioning flow
  • VMID or port allocation not properly locked

2. Job / Queue Behavior (BullMQ)

Provisioning may be executing more than once due to:

  • Duplicate job enqueueing
  • Retry logic triggering unintended second runs
  • Missing job deduplication or idempotency keys

3. DNS-Based Routing Side Effects

Current architecture relies on wildcard DNS:

dev-<vmid>.zerolaghub.dev → Traefik → API → container

This may be introducing timing-related issues:

  • Server considered "ready" before DNS propagation / routing works
  • API relying on hostname instead of internal state
  • Failure paths incorrectly handled when DNS resolution or routing is delayed

This is especially problematic during:

  • First-time provisioning
  • Rapid create/delete cycles

Areas to Review

Database Layer

  • ContainerInstance lifecycle: creation, status transitions, deletion
  • VMID allocation logic: ensure strict uniqueness, confirm locking or transactional safety
  • Port allocation: ensure no duplicate assignment
  • Verify no duplicate records created per request

Provisioning Flow

  • Ensure create endpoint is idempotent and guarded against double execution
  • Confirm only one job is enqueued per request
  • Validate job completion updates DB exactly once

Termination Flow

  • Ensure delete: removes container in Proxmox, updates DB state correctly, handles partial failures (e.g., container already gone)

Agent Interaction

  • Confirm agent is not re-triggering provisioning or reporting stale state
  • Validate /status vs DB consistency

DNS / Routing (Secondary Concern)

  • Evaluate whether DNS-based routing is being used as a source of truth (should not be)
  • Consider temporarily validating flows using direct IP instead of DNS

Goal

Restore deterministic lifecycle management:

Request → Single DB record → Single provision job → Single container → Clean termination

No duplicates, no drift, no reliance on DNS timing.


Key Principle

The database must be the source of truth, not DNS or runtime reachability.


Next Step

Full audit of:

  • DB writes during create/delete
  • Job queue behavior
  • VMID + resource allocation
  • State reconciliation between DB and Proxmox