zlh-grind/SCRATCH/session-update-migration-stability.md

# ZeroLagHub – Session Update: Migration, Stability, and Next Steps

## Context

System was migrated from Denver → Detroit (better CPU, same cost). Following the move, several non-deterministic issues appeared, including duplicate server creation and inconsistent lifecycle behavior. No API code changes were made during this period.

---

## Key Findings

Issues are not code-related — they are caused by state inconsistency during/after migration.

Likely causes:
- Redis queue replay or stale jobs
- Multiple workers briefly running (Denver + Detroit overlap)
- Partial or unsynchronized DB/Redis migration
- Internal domain routing introducing unintended request paths

---

## Architecture Adjustment

Moving forward, all internal service communication will use IP-based env variables:

```
ARTIFACT_SERVER=http://10.x.x.x:port
```

Domain-based routing is reserved strictly for external/user-facing traffic (Traefik).

This enforces proper separation of:
- **Control plane**: API ↔ Agent ↔ services (IP env vars)
- **Data plane**: user/browser ↔ domain ↔ proxy (Traefik)

See `SCRATCH/service-discovery.md` for full spec.

---

## Fabric / Vanilla Progress

Fabric server now successfully starts with:
- Fabric API present in `/mods`
- `FabricProxy-Lite.toml` configured (including forwarding secret)

Remaining issue:
- Initial startup requires Velocity restart before proxy recognizes backend

Conclusion:
- Agent must handle pre-seeding mods + config before first run
- Velocity/plugin behavior likely tied to readiness timing (server not fully "ready" when registered)

---

## Next Session Plan (Priority Order)

### 1. Database Validation
- Check for duplicate VMIDs or container records
- Verify lifecycle states are consistent (no stuck "creating")

### 2. Redis Reset
- Do NOT restore from backup
- Start clean — `FLUSHALL`
- Ensure no duplicate or replayed jobs

### 3. Worker Verification
- Confirm only one API/worker instance is running (Detroit only)
- Confirm Denver host is fully decommissioned (it is — OS wiped Apr 2, 2026)

### 4. PBS Backup Contingency
- If DB is inconsistent → restore MariaDB from Denver snapshot
- Redis remains fresh (never restored)

### 5. Provisioning Validation
- Run a single controlled server creation
- Confirm:
  - 1 DB record
  - 1 job
  - 1 agent execution

---

## Key Takeaway

This session confirmed the system is operating as a distributed, state-driven platform:

| Layer | Role |
|-------|------|
| Redis | Execution layer — ephemeral, never restore |
| Database | Source of truth — authoritative state |
| Routing | Strict boundaries — control plane vs data plane |

Stability depends on maintaining a single authoritative state and clean execution flow across all components.