zlh-grind/SCRATCH/session-update-migration-stability.md

2.7 KiB
Raw Blame History

ZeroLagHub Session Update: Migration, Stability, and Next Steps

Context

System was migrated from Denver → Detroit (better CPU, same cost). Following the move, several non-deterministic issues appeared, including duplicate server creation and inconsistent lifecycle behavior. No API code changes were made during this period.


Key Findings

Issues are not code-related — they are caused by state inconsistency during/after migration.

Likely causes:

  • Redis queue replay or stale jobs
  • Multiple workers briefly running (Denver + Detroit overlap)
  • Partial or unsynchronized DB/Redis migration
  • Internal domain routing introducing unintended request paths

Architecture Adjustment

Moving forward, all internal service communication will use IP-based env variables:

ARTIFACT_SERVER=http://10.x.x.x:port

Domain-based routing is reserved strictly for external/user-facing traffic (Traefik).

This enforces proper separation of:

  • Control plane: API ↔ Agent ↔ services (IP env vars)
  • Data plane: user/browser ↔ domain ↔ proxy (Traefik)

See SCRATCH/service-discovery.md for full spec.


Fabric / Vanilla Progress

Fabric server now successfully starts with:

  • Fabric API present in /mods
  • FabricProxy-Lite.toml configured (including forwarding secret)

Remaining issue:

  • Initial startup requires Velocity restart before proxy recognizes backend

Conclusion:

  • Agent must handle pre-seeding mods + config before first run
  • Velocity/plugin behavior likely tied to readiness timing (server not fully "ready" when registered)

Next Session Plan (Priority Order)

1. Database Validation

  • Check for duplicate VMIDs or container records
  • Verify lifecycle states are consistent (no stuck "creating")

2. Redis Reset

  • Do NOT restore from backup
  • Start clean — FLUSHALL
  • Ensure no duplicate or replayed jobs

3. Worker Verification

  • Confirm only one API/worker instance is running (Detroit only)
  • Confirm Denver host is fully decommissioned (it is — OS wiped Apr 2, 2026)

4. PBS Backup Contingency

  • If DB is inconsistent → restore MariaDB from Denver snapshot
  • Redis remains fresh (never restored)

5. Provisioning Validation

  • Run a single controlled server creation
  • Confirm:
    • 1 DB record
    • 1 job
    • 1 agent execution

Key Takeaway

This session confirmed the system is operating as a distributed, state-driven platform:

Layer Role
Redis Execution layer — ephemeral, never restore
Database Source of truth — authoritative state
Routing Strict boundaries — control plane vs data plane

Stability depends on maintaining a single authoritative state and clean execution flow across all components.