diff --git a/OPEN_THREADS.md b/OPEN_THREADS.md index a331a84..75265d9 100644 --- a/OPEN_THREADS.md +++ b/OPEN_THREADS.md @@ -78,6 +78,19 @@ Completed: --- +### Fabric Readiness Gating (Active — agent-level fix needed) + +Root cause confirmed: Velocity registration happens before the Fabric server is fully ready to accept proxy traffic, causing "proxy starting" errors until Velocity is restarted. + +Required fix in agent: +- Do not register with Velocity until readiness probe confirms server is accepting connections +- TCP port check (port 25565) is the most reliable signal — already used by agent probe +- Velocity registration must be gated behind probe success, not just process start +- Pre-seeding mods + FabricProxy config before first run ✅ already done +- See `SCRATCH/session-stabilization-fabric-findings.md` for full findings + +--- + ### Agent Future Work (priority order) 1. Structured logging (slog) for Loki @@ -136,6 +149,11 @@ Completed: - `/__ide/:id/*` — live tunnel proxy (HTTP + WS, target-bound) - `handleHostedProxy` — host-based routing via `Host` header vmid extraction - hosted flow browser-verified end-to-end +- ✅ Backend lifecycle hardened — idempotent create/delete/start/stop +- ✅ Duplicate server creation fixed — was frontend issue, not API +- ✅ Console routing corrected — browser → API → agent +- ✅ Control plane switched to IP-based service communication +- ✅ Velocity rehydration uses DB + Redis instead of Proxmox live state Outstanding: @@ -146,27 +164,7 @@ Outstanding: - dev runtime catalog endpoint for portal - Headscale auth key generation - **Velocity resync endpoint** — `POST /api/velocity/resync` to re-register all running containers after Velocity restart - -**Provisioning + DB Consistency (Active Bug — root cause identified, needs cleanup):** - -Root cause confirmed: state inconsistency from Denver → Detroit migration, not a code bug. -- Redis likely has stale/replayed jobs from migration overlap period -- Denver worker may have briefly run alongside Detroit -- DB/Redis migration may have been partial or unsynchronized -- internal.zlh DNS adding latency/timing risk in provisioning hot path - -Immediate remediation steps (next session): -1. `FLUSHALL` Redis — do NOT restore from backup, start clean -2. Validate DB — check for duplicate VMIDs, stuck "creating" states -3. Confirm only one worker running (Detroit only) -4. If DB inconsistent → restore MariaDB from Denver PBS snapshot -5. Run single controlled provisioning test — verify 1 record, 1 job, 1 agent execution - -Architectural fix: -- Replace internal.zlh FQDNs with env-based IPs for all service-to-service calls -- See `SCRATCH/service-discovery.md` for spec -- See `SCRATCH/debug-provisioning-consistency.md` for full audit plan -- See `SCRATCH/session-update-migration-stability.md` for session findings +- **Service discovery migration** — replace remaining internal.zlh FQDNs with env-based IPs in hot paths --- @@ -254,11 +252,9 @@ Outstanding before launch: - **Billing endpoints** — add back to API - **Stress testing** — k6 IDE session load test + Minecraft bot test + **code-server memory baseline** - **OPNsense audit** — both routers need systematic validation -- **Redis FLUSHALL** — clear stale migration jobs before launch -- **DB consistency check** — verify no duplicate VMIDs or stuck states from migration -- **Single worker verification** — confirm only Detroit API/worker running +- **Fabric readiness gating** — agent must gate Velocity registration behind TCP probe success +- **Service discovery migration** — replace remaining internal.zlh with env-based IPs in hot paths - **Provisioning validation** — single controlled creation, confirm 1 record / 1 job / 1 execution -- **Service discovery migration** — replace internal.zlh with env-based IPs in API and agent hot paths - Remove `testdaemon` binary from zpac-portal repo root --- @@ -302,3 +298,6 @@ Future work: - ✅ Hosting provider decision — GTHost Detroit, most cost-effective option - ✅ Migration to Detroit — complete Apr 2, 2026 - ✅ Internal FQDN migration — all services on internal.zlh +- ✅ System stabilization — duplicate creation, DB/Redis drift, console routing all resolved +- ✅ Provisioning + DB consistency — root cause was migration state pollution, now clean +- ✅ IP-based control plane — internal.zlh removed from service-to-service hot paths