zlh-grind/SCRATCH/session-stabilization-fabric-findings.md

73 lines
2.6 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# ZeroLagHub System Stabilization & Fabric Proxy Findings
## Current State
Following API and frontend revisions, the system is now stable and deterministic:
- Backend lifecycle handling hardened — idempotent create/delete/start/stop
- Redis and database verified healthy
- Duplicate server creation traced to frontend (not API)
- Console routing corrected (browser → API → agent)
- Internal domain routing removed in favor of explicit IP-based communication
---
## Key Architectural Improvements
- Control plane now uses direct IP-based service communication
- Data plane remains domain-based (Traefik)
- Velocity rehydration now uses DB + Redis instead of Proxmox live state
- API now acts as the central authority for routing, console, and orchestration
---
## Fabric / Velocity Issue — Root Cause
The remaining issue with Fabric servers is NOT related to:
- Fabric API version
- FabricProxy-Lite version
- Mod installation
The issue is caused by **timing of server readiness vs Velocity registration**:
- Server is registered with Velocity before it is fully ready to accept proxy traffic
- This results in "proxy starting" errors until Velocity is restarted
- After restart, the server is already fully initialized and works correctly
**Conclusion:** The system lacks a reliable readiness signal for Fabric-based servers.
---
## Required Fix
Agent (or API) must delay Velocity registration until true readiness is confirmed.
### Recommended approach
1. Pre-seed Fabric API and FabricProxy config before first run ✅ (already done)
2. Do not rely solely on "Done" log output
3. Introduce readiness gating — one of:
- Short delay buffer after startup
- TCP port readiness check (port 25565 accepting connections)
- Log-based readiness confirmation (watch for specific "done" log line)
TCP port check is the most reliable — if the port is accepting connections, the server is ready. This is already what the agent's probe does. The issue is likely that Velocity registration happens before the probe confirms readiness.
### Fix location
Agent-level orchestration change — do not register with Velocity until the readiness probe returns success.
Avoid modifying Velocity itself — the issue is orchestration timing, not proxy configuration.
---
## Summary
| Issue | Status |
|-------|--------|
| Duplicate server creation | ✅ Fixed — was frontend, not API |
| DB/Redis state drift | ✅ Resolved — verified healthy |
| Console routing | ✅ Fixed |
| Internal DNS timing | ✅ Resolved — replaced with IP env vars |
| Fabric readiness timing | 🔧 Pending — readiness gating needed in agent |