zlh-grind/SCRATCH/session-update-migration-stability.md

93 lines
2.7 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# ZeroLagHub Session Update: Migration, Stability, and Next Steps
## Context
System was migrated from Denver → Detroit (better CPU, same cost). Following the move, several non-deterministic issues appeared, including duplicate server creation and inconsistent lifecycle behavior. No API code changes were made during this period.
---
## Key Findings
Issues are not code-related — they are caused by state inconsistency during/after migration.
Likely causes:
- Redis queue replay or stale jobs
- Multiple workers briefly running (Denver + Detroit overlap)
- Partial or unsynchronized DB/Redis migration
- Internal domain routing introducing unintended request paths
---
## Architecture Adjustment
Moving forward, all internal service communication will use IP-based env variables:
```
ARTIFACT_SERVER=http://10.x.x.x:port
```
Domain-based routing is reserved strictly for external/user-facing traffic (Traefik).
This enforces proper separation of:
- **Control plane**: API ↔ Agent ↔ services (IP env vars)
- **Data plane**: user/browser ↔ domain ↔ proxy (Traefik)
See `SCRATCH/service-discovery.md` for full spec.
---
## Fabric / Vanilla Progress
Fabric server now successfully starts with:
- Fabric API present in `/mods`
- `FabricProxy-Lite.toml` configured (including forwarding secret)
Remaining issue:
- Initial startup requires Velocity restart before proxy recognizes backend
Conclusion:
- Agent must handle pre-seeding mods + config before first run
- Velocity/plugin behavior likely tied to readiness timing (server not fully "ready" when registered)
---
## Next Session Plan (Priority Order)
### 1. Database Validation
- Check for duplicate VMIDs or container records
- Verify lifecycle states are consistent (no stuck "creating")
### 2. Redis Reset
- Do NOT restore from backup
- Start clean — `FLUSHALL`
- Ensure no duplicate or replayed jobs
### 3. Worker Verification
- Confirm only one API/worker instance is running (Detroit only)
- Confirm Denver host is fully decommissioned (it is — OS wiped Apr 2, 2026)
### 4. PBS Backup Contingency
- If DB is inconsistent → restore MariaDB from Denver snapshot
- Redis remains fresh (never restored)
### 5. Provisioning Validation
- Run a single controlled server creation
- Confirm:
- 1 DB record
- 1 job
- 1 agent execution
---
## Key Takeaway
This session confirmed the system is operating as a distributed, state-driven platform:
| Layer | Role |
|-------|------|
| Redis | Execution layer — ephemeral, never restore |
| Database | Source of truth — authoritative state |
| Routing | Strict boundaries — control plane vs data plane |
Stability depends on maintaining a single authoritative state and clean execution flow across all components.