ZeroLagHub – Session Update: Migration, Stability, and Next Steps

Context

System was migrated from Denver → Detroit (better CPU, same cost). Following the move, several non-deterministic issues appeared, including duplicate server creation and inconsistent lifecycle behavior. No API code changes were made during this period.

Key Findings

Issues are not code-related — they are caused by state inconsistency during/after migration.

Likely causes:

Redis queue replay or stale jobs
Multiple workers briefly running (Denver + Detroit overlap)
Partial or unsynchronized DB/Redis migration
Internal domain routing introducing unintended request paths

Architecture Adjustment

Moving forward, all internal service communication will use IP-based env variables:

ARTIFACT_SERVER=http://10.x.x.x:port

Domain-based routing is reserved strictly for external/user-facing traffic (Traefik).

This enforces proper separation of:

Control plane: API ↔ Agent ↔ services (IP env vars)
Data plane: user/browser ↔ domain ↔ proxy (Traefik)

See SCRATCH/service-discovery.md for full spec.

Fabric / Vanilla Progress

Fabric server now successfully starts with:

Fabric API present in /mods
FabricProxy-Lite.toml configured (including forwarding secret)

Remaining issue:

Initial startup requires Velocity restart before proxy recognizes backend

Conclusion:

Agent must handle pre-seeding mods + config before first run
Velocity/plugin behavior likely tied to readiness timing (server not fully "ready" when registered)

Next Session Plan (Priority Order)

1. Database Validation

Check for duplicate VMIDs or container records
Verify lifecycle states are consistent (no stuck "creating")

2. Redis Reset

Do NOT restore from backup
Start clean — FLUSHALL
Ensure no duplicate or replayed jobs

3. Worker Verification

Confirm only one API/worker instance is running (Detroit only)
Confirm Denver host is fully decommissioned (it is — OS wiped Apr 2, 2026)

4. PBS Backup Contingency

If DB is inconsistent → restore MariaDB from Denver snapshot
Redis remains fresh (never restored)

5. Provisioning Validation

Run a single controlled server creation
Confirm:
- 1 DB record
- 1 job
- 1 agent execution

Key Takeaway

This session confirmed the system is operating as a distributed, state-driven platform:

Layer	Role
Redis	Execution layer — ephemeral, never restore
Database	Source of truth — authoritative state
Routing	Strict boundaries — control plane vs data plane

Stability depends on maintaining a single authoritative state and clean execution flow across all components.

2.7 KiB Raw Blame History Unescape Escape