Update with stabilization completions and add Fabric readiness gating as active agent fix
This commit is contained in:
parent
f55dfac860
commit
b26fe62116
@ -78,6 +78,19 @@ Completed:
|
||||
|
||||
---
|
||||
|
||||
### Fabric Readiness Gating (Active — agent-level fix needed)
|
||||
|
||||
Root cause confirmed: Velocity registration happens before the Fabric server is fully ready to accept proxy traffic, causing "proxy starting" errors until Velocity is restarted.
|
||||
|
||||
Required fix in agent:
|
||||
- Do not register with Velocity until readiness probe confirms server is accepting connections
|
||||
- TCP port check (port 25565) is the most reliable signal — already used by agent probe
|
||||
- Velocity registration must be gated behind probe success, not just process start
|
||||
- Pre-seeding mods + FabricProxy config before first run ✅ already done
|
||||
- See `SCRATCH/session-stabilization-fabric-findings.md` for full findings
|
||||
|
||||
---
|
||||
|
||||
### Agent Future Work (priority order)
|
||||
|
||||
1. Structured logging (slog) for Loki
|
||||
@ -136,6 +149,11 @@ Completed:
|
||||
- `/__ide/:id/*` — live tunnel proxy (HTTP + WS, target-bound)
|
||||
- `handleHostedProxy` — host-based routing via `Host` header vmid extraction
|
||||
- hosted flow browser-verified end-to-end
|
||||
- ✅ Backend lifecycle hardened — idempotent create/delete/start/stop
|
||||
- ✅ Duplicate server creation fixed — was frontend issue, not API
|
||||
- ✅ Console routing corrected — browser → API → agent
|
||||
- ✅ Control plane switched to IP-based service communication
|
||||
- ✅ Velocity rehydration uses DB + Redis instead of Proxmox live state
|
||||
|
||||
Outstanding:
|
||||
|
||||
@ -146,27 +164,7 @@ Outstanding:
|
||||
- dev runtime catalog endpoint for portal
|
||||
- Headscale auth key generation
|
||||
- **Velocity resync endpoint** — `POST /api/velocity/resync` to re-register all running containers after Velocity restart
|
||||
|
||||
**Provisioning + DB Consistency (Active Bug — root cause identified, needs cleanup):**
|
||||
|
||||
Root cause confirmed: state inconsistency from Denver → Detroit migration, not a code bug.
|
||||
- Redis likely has stale/replayed jobs from migration overlap period
|
||||
- Denver worker may have briefly run alongside Detroit
|
||||
- DB/Redis migration may have been partial or unsynchronized
|
||||
- internal.zlh DNS adding latency/timing risk in provisioning hot path
|
||||
|
||||
Immediate remediation steps (next session):
|
||||
1. `FLUSHALL` Redis — do NOT restore from backup, start clean
|
||||
2. Validate DB — check for duplicate VMIDs, stuck "creating" states
|
||||
3. Confirm only one worker running (Detroit only)
|
||||
4. If DB inconsistent → restore MariaDB from Denver PBS snapshot
|
||||
5. Run single controlled provisioning test — verify 1 record, 1 job, 1 agent execution
|
||||
|
||||
Architectural fix:
|
||||
- Replace internal.zlh FQDNs with env-based IPs for all service-to-service calls
|
||||
- See `SCRATCH/service-discovery.md` for spec
|
||||
- See `SCRATCH/debug-provisioning-consistency.md` for full audit plan
|
||||
- See `SCRATCH/session-update-migration-stability.md` for session findings
|
||||
- **Service discovery migration** — replace remaining internal.zlh FQDNs with env-based IPs in hot paths
|
||||
|
||||
---
|
||||
|
||||
@ -254,11 +252,9 @@ Outstanding before launch:
|
||||
- **Billing endpoints** — add back to API
|
||||
- **Stress testing** — k6 IDE session load test + Minecraft bot test + **code-server memory baseline**
|
||||
- **OPNsense audit** — both routers need systematic validation
|
||||
- **Redis FLUSHALL** — clear stale migration jobs before launch
|
||||
- **DB consistency check** — verify no duplicate VMIDs or stuck states from migration
|
||||
- **Single worker verification** — confirm only Detroit API/worker running
|
||||
- **Fabric readiness gating** — agent must gate Velocity registration behind TCP probe success
|
||||
- **Service discovery migration** — replace remaining internal.zlh with env-based IPs in hot paths
|
||||
- **Provisioning validation** — single controlled creation, confirm 1 record / 1 job / 1 execution
|
||||
- **Service discovery migration** — replace internal.zlh with env-based IPs in API and agent hot paths
|
||||
- Remove `testdaemon` binary from zpac-portal repo root
|
||||
|
||||
---
|
||||
@ -302,3 +298,6 @@ Future work:
|
||||
- ✅ Hosting provider decision — GTHost Detroit, most cost-effective option
|
||||
- ✅ Migration to Detroit — complete Apr 2, 2026
|
||||
- ✅ Internal FQDN migration — all services on internal.zlh
|
||||
- ✅ System stabilization — duplicate creation, DB/Redis drift, console routing all resolved
|
||||
- ✅ Provisioning + DB consistency — root cause was migration state pollution, now clean
|
||||
- ✅ IP-based control plane — internal.zlh removed from service-to-service hot paths
|
||||
|
||||
Loading…
Reference in New Issue
Block a user