Add session update: migration stability findings and next steps

2026-04-05 22:30:16 +00:00 · 2026-04-05 22:30:16 +00:00 · ff06e6211d
commit ff06e6211d
parent 0d1df03127
1 changed files with 92 additions and 0 deletions
--- a/SCRATCH/session-update-migration-stability.md
+++ b/SCRATCH/session-update-migration-stability.md
@ -0,0 +1,92 @@
 # ZeroLagHub – Session Update: Migration, Stability, and Next Steps
 ## Context
 System was migrated from Denver → Detroit (better CPU, same cost). Following the move, several non-deterministic issues appeared, including duplicate server creation and inconsistent lifecycle behavior. No API code changes were made during this period.
 ---
 ## Key Findings
 Issues are not code-related — they are caused by state inconsistency during/after migration.
 Likely causes:
 - Redis queue replay or stale jobs
 - Multiple workers briefly running (Denver + Detroit overlap)
 - Partial or unsynchronized DB/Redis migration
 - Internal domain routing introducing unintended request paths
 ---
 ## Architecture Adjustment
 Moving forward, all internal service communication will use IP-based env variables:
 ```
 ARTIFACT_SERVER=http://10.x.x.x:port
 ```
 Domain-based routing is reserved strictly for external/user-facing traffic (Traefik).
 This enforces proper separation of:
 - **Control plane**: API ↔ Agent ↔ services (IP env vars)
 - **Data plane**: user/browser ↔ domain ↔ proxy (Traefik)
 See `SCRATCH/service-discovery.md` for full spec.
 ---
 ## Fabric / Vanilla Progress
 Fabric server now successfully starts with:
 - Fabric API present in `/mods`
 - `FabricProxy-Lite.toml` configured (including forwarding secret)
 Remaining issue:
 - Initial startup requires Velocity restart before proxy recognizes backend
 Conclusion:
 - Agent must handle pre-seeding mods + config before first run
 - Velocity/plugin behavior likely tied to readiness timing (server not fully "ready" when registered)
 ---
 ## Next Session Plan (Priority Order)
 ### 1. Database Validation
 - Check for duplicate VMIDs or container records
 - Verify lifecycle states are consistent (no stuck "creating")
 ### 2. Redis Reset
 - Do NOT restore from backup
 - Start clean — `FLUSHALL`
 - Ensure no duplicate or replayed jobs
 ### 3. Worker Verification
 - Confirm only one API/worker instance is running (Detroit only)
 - Confirm Denver host is fully decommissioned (it is — OS wiped Apr 2, 2026)
 ### 4. PBS Backup Contingency
 - If DB is inconsistent → restore MariaDB from Denver snapshot
 - Redis remains fresh (never restored)
 ### 5. Provisioning Validation
 - Run a single controlled server creation
 - Confirm:
  - 1 DB record
  - 1 job
  - 1 agent execution
 ---
 ## Key Takeaway
 This session confirmed the system is operating as a distributed, state-driven platform:
 | Layer | Role |
 |-------|------|
 | Redis | Execution layer — ephemeral, never restore |
 | Database | Source of truth — authoritative state |
 | Routing | Strict boundaries — control plane vs data plane |
 Stability depends on maintaining a single authoritative state and clean execution flow across all components.