Add session update: migration stability findings and next steps
This commit is contained in:
parent
0d1df03127
commit
ff06e6211d
92
SCRATCH/session-update-migration-stability.md
Normal file
92
SCRATCH/session-update-migration-stability.md
Normal file
@ -0,0 +1,92 @@
|
|||||||
|
# ZeroLagHub – Session Update: Migration, Stability, and Next Steps
|
||||||
|
|
||||||
|
## Context
|
||||||
|
|
||||||
|
System was migrated from Denver → Detroit (better CPU, same cost). Following the move, several non-deterministic issues appeared, including duplicate server creation and inconsistent lifecycle behavior. No API code changes were made during this period.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Key Findings
|
||||||
|
|
||||||
|
Issues are not code-related — they are caused by state inconsistency during/after migration.
|
||||||
|
|
||||||
|
Likely causes:
|
||||||
|
- Redis queue replay or stale jobs
|
||||||
|
- Multiple workers briefly running (Denver + Detroit overlap)
|
||||||
|
- Partial or unsynchronized DB/Redis migration
|
||||||
|
- Internal domain routing introducing unintended request paths
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Architecture Adjustment
|
||||||
|
|
||||||
|
Moving forward, all internal service communication will use IP-based env variables:
|
||||||
|
|
||||||
|
```
|
||||||
|
ARTIFACT_SERVER=http://10.x.x.x:port
|
||||||
|
```
|
||||||
|
|
||||||
|
Domain-based routing is reserved strictly for external/user-facing traffic (Traefik).
|
||||||
|
|
||||||
|
This enforces proper separation of:
|
||||||
|
- **Control plane**: API ↔ Agent ↔ services (IP env vars)
|
||||||
|
- **Data plane**: user/browser ↔ domain ↔ proxy (Traefik)
|
||||||
|
|
||||||
|
See `SCRATCH/service-discovery.md` for full spec.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Fabric / Vanilla Progress
|
||||||
|
|
||||||
|
Fabric server now successfully starts with:
|
||||||
|
- Fabric API present in `/mods`
|
||||||
|
- `FabricProxy-Lite.toml` configured (including forwarding secret)
|
||||||
|
|
||||||
|
Remaining issue:
|
||||||
|
- Initial startup requires Velocity restart before proxy recognizes backend
|
||||||
|
|
||||||
|
Conclusion:
|
||||||
|
- Agent must handle pre-seeding mods + config before first run
|
||||||
|
- Velocity/plugin behavior likely tied to readiness timing (server not fully "ready" when registered)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Next Session Plan (Priority Order)
|
||||||
|
|
||||||
|
### 1. Database Validation
|
||||||
|
- Check for duplicate VMIDs or container records
|
||||||
|
- Verify lifecycle states are consistent (no stuck "creating")
|
||||||
|
|
||||||
|
### 2. Redis Reset
|
||||||
|
- Do NOT restore from backup
|
||||||
|
- Start clean — `FLUSHALL`
|
||||||
|
- Ensure no duplicate or replayed jobs
|
||||||
|
|
||||||
|
### 3. Worker Verification
|
||||||
|
- Confirm only one API/worker instance is running (Detroit only)
|
||||||
|
- Confirm Denver host is fully decommissioned (it is — OS wiped Apr 2, 2026)
|
||||||
|
|
||||||
|
### 4. PBS Backup Contingency
|
||||||
|
- If DB is inconsistent → restore MariaDB from Denver snapshot
|
||||||
|
- Redis remains fresh (never restored)
|
||||||
|
|
||||||
|
### 5. Provisioning Validation
|
||||||
|
- Run a single controlled server creation
|
||||||
|
- Confirm:
|
||||||
|
- 1 DB record
|
||||||
|
- 1 job
|
||||||
|
- 1 agent execution
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Key Takeaway
|
||||||
|
|
||||||
|
This session confirmed the system is operating as a distributed, state-driven platform:
|
||||||
|
|
||||||
|
| Layer | Role |
|
||||||
|
|-------|------|
|
||||||
|
| Redis | Execution layer — ephemeral, never restore |
|
||||||
|
| Database | Source of truth — authoritative state |
|
||||||
|
| Routing | Strict boundaries — control plane vs data plane |
|
||||||
|
|
||||||
|
Stability depends on maintaining a single authoritative state and clean execution flow across all components.
|
||||||
Loading…
Reference in New Issue
Block a user