From 6d16e3e00e97dddd83f1f378b3558c912ddb03bf Mon Sep 17 00:00:00 2001 From: jester Date: Sun, 5 Apr 2026 22:31:44 +0000 Subject: [PATCH] Update provisioning bug with confirmed root cause and add Redis/DB/worker cleanup to pre-launch checklist --- OPEN_THREADS.md | 35 +++++++++++++++++++++++------------ 1 file changed, 23 insertions(+), 12 deletions(-) diff --git a/OPEN_THREADS.md b/OPEN_THREADS.md index d421ada..a331a84 100644 --- a/OPEN_THREADS.md +++ b/OPEN_THREADS.md @@ -147,18 +147,26 @@ Outstanding: - Headscale auth key generation - **Velocity resync endpoint** — `POST /api/velocity/resync` to re-register all running containers after Velocity restart -**Provisioning + DB Consistency (Active Bug — needs audit before launch):** -- Duplicate server creation occurring during single requests -- Inconsistent / failed server termination -- DB state drifting from actual Proxmox + agent state -- BullMQ jobs may be executing more than once — missing deduplication / idempotency keys -- VMID and port allocation not confirmed to be transactionally safe -- Create endpoint not confirmed idempotent or guarded against double execution -- Delete flow not confirmed to handle partial failures (container already gone) -- DNS-based routing must not be used as source of truth — DB is the source of truth -- **internal.zlh DNS may be introducing timing issues in provisioning hot path** — replace with env-based IP config for all service-to-service calls +**Provisioning + DB Consistency (Active Bug — root cause identified, needs cleanup):** + +Root cause confirmed: state inconsistency from Denver → Detroit migration, not a code bug. +- Redis likely has stale/replayed jobs from migration overlap period +- Denver worker may have briefly run alongside Detroit +- DB/Redis migration may have been partial or unsynchronized +- internal.zlh DNS adding latency/timing risk in provisioning hot path + +Immediate remediation steps (next session): +1. `FLUSHALL` Redis — do NOT restore from backup, start clean +2. Validate DB — check for duplicate VMIDs, stuck "creating" states +3. Confirm only one worker running (Detroit only) +4. If DB inconsistent → restore MariaDB from Denver PBS snapshot +5. Run single controlled provisioning test — verify 1 record, 1 job, 1 agent execution + +Architectural fix: +- Replace internal.zlh FQDNs with env-based IPs for all service-to-service calls +- See `SCRATCH/service-discovery.md` for spec - See `SCRATCH/debug-provisioning-consistency.md` for full audit plan -- See `SCRATCH/service-discovery.md` for env-based service discovery approach +- See `SCRATCH/session-update-migration-stability.md` for session findings --- @@ -246,7 +254,10 @@ Outstanding before launch: - **Billing endpoints** — add back to API - **Stress testing** — k6 IDE session load test + Minecraft bot test + **code-server memory baseline** - **OPNsense audit** — both routers need systematic validation -- **Provisioning + DB consistency audit** — duplicate creation, state drift, BullMQ idempotency +- **Redis FLUSHALL** — clear stale migration jobs before launch +- **DB consistency check** — verify no duplicate VMIDs or stuck states from migration +- **Single worker verification** — confirm only Detroit API/worker running +- **Provisioning validation** — single controlled creation, confirm 1 record / 1 job / 1 execution - **Service discovery migration** — replace internal.zlh with env-based IPs in API and agent hot paths - Remove `testdaemon` binary from zpac-portal repo root