# ZeroLagHub – Session Update: Migration, Stability, and Next Steps ## Context System was migrated from Denver → Detroit (better CPU, same cost). Following the move, several non-deterministic issues appeared, including duplicate server creation and inconsistent lifecycle behavior. No API code changes were made during this period. --- ## Key Findings Issues are not code-related — they are caused by state inconsistency during/after migration. Likely causes: - Redis queue replay or stale jobs - Multiple workers briefly running (Denver + Detroit overlap) - Partial or unsynchronized DB/Redis migration - Internal domain routing introducing unintended request paths --- ## Architecture Adjustment Moving forward, all internal service communication will use IP-based env variables: ``` ARTIFACT_SERVER=http://10.x.x.x:port ``` Domain-based routing is reserved strictly for external/user-facing traffic (Traefik). This enforces proper separation of: - **Control plane**: API ↔ Agent ↔ services (IP env vars) - **Data plane**: user/browser ↔ domain ↔ proxy (Traefik) See `SCRATCH/service-discovery.md` for full spec. --- ## Fabric / Vanilla Progress Fabric server now successfully starts with: - Fabric API present in `/mods` - `FabricProxy-Lite.toml` configured (including forwarding secret) Remaining issue: - Initial startup requires Velocity restart before proxy recognizes backend Conclusion: - Agent must handle pre-seeding mods + config before first run - Velocity/plugin behavior likely tied to readiness timing (server not fully "ready" when registered) --- ## Next Session Plan (Priority Order) ### 1. Database Validation - Check for duplicate VMIDs or container records - Verify lifecycle states are consistent (no stuck "creating") ### 2. Redis Reset - Do NOT restore from backup - Start clean — `FLUSHALL` - Ensure no duplicate or replayed jobs ### 3. Worker Verification - Confirm only one API/worker instance is running (Detroit only) - Confirm Denver host is fully decommissioned (it is — OS wiped Apr 2, 2026) ### 4. PBS Backup Contingency - If DB is inconsistent → restore MariaDB from Denver snapshot - Redis remains fresh (never restored) ### 5. Provisioning Validation - Run a single controlled server creation - Confirm: - 1 DB record - 1 job - 1 agent execution --- ## Key Takeaway This session confirmed the system is operating as a distributed, state-driven platform: | Layer | Role | |-------|------| | Redis | Execution layer — ephemeral, never restore | | Database | Source of truth — authoritative state | | Routing | Strict boundaries — control plane vs data plane | Stability depends on maintaining a single authoritative state and clean execution flow across all components.