From 56cc7cdce4d5833b181c12b9aaa45502fd5fc62c Mon Sep 17 00:00:00 2001 From: jester Date: Tue, 7 Apr 2026 21:01:19 +0000 Subject: [PATCH] Archive resolved: session-update-migration-stability --- SCRATCH/session-update-migration-stability.md | 92 ------------------- 1 file changed, 92 deletions(-) delete mode 100644 SCRATCH/session-update-migration-stability.md diff --git a/SCRATCH/session-update-migration-stability.md b/SCRATCH/session-update-migration-stability.md deleted file mode 100644 index afe19d8..0000000 --- a/SCRATCH/session-update-migration-stability.md +++ /dev/null @@ -1,92 +0,0 @@ -# ZeroLagHub – Session Update: Migration, Stability, and Next Steps - -## Context - -System was migrated from Denver → Detroit (better CPU, same cost). Following the move, several non-deterministic issues appeared, including duplicate server creation and inconsistent lifecycle behavior. No API code changes were made during this period. - ---- - -## Key Findings - -Issues are not code-related — they are caused by state inconsistency during/after migration. - -Likely causes: -- Redis queue replay or stale jobs -- Multiple workers briefly running (Denver + Detroit overlap) -- Partial or unsynchronized DB/Redis migration -- Internal domain routing introducing unintended request paths - ---- - -## Architecture Adjustment - -Moving forward, all internal service communication will use IP-based env variables: - -``` -ARTIFACT_SERVER=http://10.x.x.x:port -``` - -Domain-based routing is reserved strictly for external/user-facing traffic (Traefik). - -This enforces proper separation of: -- **Control plane**: API ↔ Agent ↔ services (IP env vars) -- **Data plane**: user/browser ↔ domain ↔ proxy (Traefik) - -See `SCRATCH/service-discovery.md` for full spec. - ---- - -## Fabric / Vanilla Progress - -Fabric server now successfully starts with: -- Fabric API present in `/mods` -- `FabricProxy-Lite.toml` configured (including forwarding secret) - -Remaining issue: -- Initial startup requires Velocity restart before proxy recognizes backend - -Conclusion: -- Agent must handle pre-seeding mods + config before first run -- Velocity/plugin behavior likely tied to readiness timing (server not fully "ready" when registered) - ---- - -## Next Session Plan (Priority Order) - -### 1. Database Validation -- Check for duplicate VMIDs or container records -- Verify lifecycle states are consistent (no stuck "creating") - -### 2. Redis Reset -- Do NOT restore from backup -- Start clean — `FLUSHALL` -- Ensure no duplicate or replayed jobs - -### 3. Worker Verification -- Confirm only one API/worker instance is running (Detroit only) -- Confirm Denver host is fully decommissioned (it is — OS wiped Apr 2, 2026) - -### 4. PBS Backup Contingency -- If DB is inconsistent → restore MariaDB from Denver snapshot -- Redis remains fresh (never restored) - -### 5. Provisioning Validation -- Run a single controlled server creation -- Confirm: - - 1 DB record - - 1 job - - 1 agent execution - ---- - -## Key Takeaway - -This session confirmed the system is operating as a distributed, state-driven platform: - -| Layer | Role | -|-------|------| -| Redis | Execution layer — ephemeral, never restore | -| Database | Source of truth — authoritative state | -| Routing | Strict boundaries — control plane vs data plane | - -Stability depends on maintaining a single authoritative state and clean execution flow across all components.