diff --git a/SCRATCH/session-update-migration-stability.md b/SCRATCH/session-update-migration-stability.md new file mode 100644 index 0000000..afe19d8 --- /dev/null +++ b/SCRATCH/session-update-migration-stability.md @@ -0,0 +1,92 @@ +# ZeroLagHub – Session Update: Migration, Stability, and Next Steps + +## Context + +System was migrated from Denver → Detroit (better CPU, same cost). Following the move, several non-deterministic issues appeared, including duplicate server creation and inconsistent lifecycle behavior. No API code changes were made during this period. + +--- + +## Key Findings + +Issues are not code-related — they are caused by state inconsistency during/after migration. + +Likely causes: +- Redis queue replay or stale jobs +- Multiple workers briefly running (Denver + Detroit overlap) +- Partial or unsynchronized DB/Redis migration +- Internal domain routing introducing unintended request paths + +--- + +## Architecture Adjustment + +Moving forward, all internal service communication will use IP-based env variables: + +``` +ARTIFACT_SERVER=http://10.x.x.x:port +``` + +Domain-based routing is reserved strictly for external/user-facing traffic (Traefik). + +This enforces proper separation of: +- **Control plane**: API ↔ Agent ↔ services (IP env vars) +- **Data plane**: user/browser ↔ domain ↔ proxy (Traefik) + +See `SCRATCH/service-discovery.md` for full spec. + +--- + +## Fabric / Vanilla Progress + +Fabric server now successfully starts with: +- Fabric API present in `/mods` +- `FabricProxy-Lite.toml` configured (including forwarding secret) + +Remaining issue: +- Initial startup requires Velocity restart before proxy recognizes backend + +Conclusion: +- Agent must handle pre-seeding mods + config before first run +- Velocity/plugin behavior likely tied to readiness timing (server not fully "ready" when registered) + +--- + +## Next Session Plan (Priority Order) + +### 1. Database Validation +- Check for duplicate VMIDs or container records +- Verify lifecycle states are consistent (no stuck "creating") + +### 2. Redis Reset +- Do NOT restore from backup +- Start clean — `FLUSHALL` +- Ensure no duplicate or replayed jobs + +### 3. Worker Verification +- Confirm only one API/worker instance is running (Detroit only) +- Confirm Denver host is fully decommissioned (it is — OS wiped Apr 2, 2026) + +### 4. PBS Backup Contingency +- If DB is inconsistent → restore MariaDB from Denver snapshot +- Redis remains fresh (never restored) + +### 5. Provisioning Validation +- Run a single controlled server creation +- Confirm: + - 1 DB record + - 1 job + - 1 agent execution + +--- + +## Key Takeaway + +This session confirmed the system is operating as a distributed, state-driven platform: + +| Layer | Role | +|-------|------| +| Redis | Execution layer — ephemeral, never restore | +| Database | Source of truth — authoritative state | +| Routing | Strict boundaries — control plane vs data plane | + +Stability depends on maintaining a single authoritative state and clean execution flow across all components.