diff --git a/SCRATCH/session-stabilization-fabric-findings.md b/SCRATCH/session-stabilization-fabric-findings.md new file mode 100644 index 0000000..f01f6cf --- /dev/null +++ b/SCRATCH/session-stabilization-fabric-findings.md @@ -0,0 +1,72 @@ +# ZeroLagHub – System Stabilization & Fabric Proxy Findings + +## Current State + +Following API and frontend revisions, the system is now stable and deterministic: + +- Backend lifecycle handling hardened — idempotent create/delete/start/stop +- Redis and database verified healthy +- Duplicate server creation traced to frontend (not API) +- Console routing corrected (browser → API → agent) +- Internal domain routing removed in favor of explicit IP-based communication + +--- + +## Key Architectural Improvements + +- Control plane now uses direct IP-based service communication +- Data plane remains domain-based (Traefik) +- Velocity rehydration now uses DB + Redis instead of Proxmox live state +- API now acts as the central authority for routing, console, and orchestration + +--- + +## Fabric / Velocity Issue — Root Cause + +The remaining issue with Fabric servers is NOT related to: +- Fabric API version +- FabricProxy-Lite version +- Mod installation + +The issue is caused by **timing of server readiness vs Velocity registration**: + +- Server is registered with Velocity before it is fully ready to accept proxy traffic +- This results in "proxy starting" errors until Velocity is restarted +- After restart, the server is already fully initialized and works correctly + +**Conclusion:** The system lacks a reliable readiness signal for Fabric-based servers. + +--- + +## Required Fix + +Agent (or API) must delay Velocity registration until true readiness is confirmed. + +### Recommended approach + +1. Pre-seed Fabric API and FabricProxy config before first run ✅ (already done) +2. Do not rely solely on "Done" log output +3. Introduce readiness gating — one of: + - Short delay buffer after startup + - TCP port readiness check (port 25565 accepting connections) + - Log-based readiness confirmation (watch for specific "done" log line) + +TCP port check is the most reliable — if the port is accepting connections, the server is ready. This is already what the agent's probe does. The issue is likely that Velocity registration happens before the probe confirms readiness. + +### Fix location + +Agent-level orchestration change — do not register with Velocity until the readiness probe returns success. + +Avoid modifying Velocity itself — the issue is orchestration timing, not proxy configuration. + +--- + +## Summary + +| Issue | Status | +|-------|--------| +| Duplicate server creation | ✅ Fixed — was frontend, not API | +| DB/Redis state drift | ✅ Resolved — verified healthy | +| Console routing | ✅ Fixed | +| Internal DNS timing | ✅ Resolved — replaced with IP env vars | +| Fabric readiness timing | 🔧 Pending — readiness gating needed in agent |