zlh-grind/SCRATCH/session-stabilization-fabric-findings.md

2.6 KiB
Raw Blame History

ZeroLagHub System Stabilization & Fabric Proxy Findings

Current State

Following API and frontend revisions, the system is now stable and deterministic:

  • Backend lifecycle handling hardened — idempotent create/delete/start/stop
  • Redis and database verified healthy
  • Duplicate server creation traced to frontend (not API)
  • Console routing corrected (browser → API → agent)
  • Internal domain routing removed in favor of explicit IP-based communication

Key Architectural Improvements

  • Control plane now uses direct IP-based service communication
  • Data plane remains domain-based (Traefik)
  • Velocity rehydration now uses DB + Redis instead of Proxmox live state
  • API now acts as the central authority for routing, console, and orchestration

Fabric / Velocity Issue — Root Cause

The remaining issue with Fabric servers is NOT related to:

  • Fabric API version
  • FabricProxy-Lite version
  • Mod installation

The issue is caused by timing of server readiness vs Velocity registration:

  • Server is registered with Velocity before it is fully ready to accept proxy traffic
  • This results in "proxy starting" errors until Velocity is restarted
  • After restart, the server is already fully initialized and works correctly

Conclusion: The system lacks a reliable readiness signal for Fabric-based servers.


Required Fix

Agent (or API) must delay Velocity registration until true readiness is confirmed.

  1. Pre-seed Fabric API and FabricProxy config before first run (already done)
  2. Do not rely solely on "Done" log output
  3. Introduce readiness gating — one of:
    • Short delay buffer after startup
    • TCP port readiness check (port 25565 accepting connections)
    • Log-based readiness confirmation (watch for specific "done" log line)

TCP port check is the most reliable — if the port is accepting connections, the server is ready. This is already what the agent's probe does. The issue is likely that Velocity registration happens before the probe confirms readiness.

Fix location

Agent-level orchestration change — do not register with Velocity until the readiness probe returns success.

Avoid modifying Velocity itself — the issue is orchestration timing, not proxy configuration.


Summary

Issue Status
Duplicate server creation Fixed — was frontend, not API
DB/Redis state drift Resolved — verified healthy
Console routing Fixed
Internal DNS timing Resolved — replaced with IP env vars
Fabric readiness timing 🔧 Pending — readiness gating needed in agent