574 lines
17 KiB
Markdown
574 lines
17 KiB
Markdown
# 🚀 ZeroLagHub - GPT Implementation Handover (December 2025)
|
||
|
||
**Last Updated**: December 7, 2025
|
||
**Version**: 4.0 (Launch-Ready Implementation Guide)
|
||
**Status**: 85% Platform Complete - Active Development Sprint
|
||
|
||
---
|
||
|
||
## 🎯 Your Role (GPT - Implementation AI)
|
||
|
||
**You are**: The tactical implementation AI responsible for **building features**
|
||
**Claude is**: The strategic architecture AI responsible for **design decisions**
|
||
|
||
### **Your Responsibilities**
|
||
✅ Implement features within architectural boundaries
|
||
✅ Write code for API, Agent, and Frontend
|
||
✅ Fix bugs and optimize performance
|
||
✅ Execute sprint tasks from Kanban board
|
||
✅ Consult Cross-Project Tracker before crossing system boundaries
|
||
|
||
### **Your Constraints**
|
||
❌ Do NOT make architectural decisions without Claude
|
||
❌ Do NOT violate ownership boundaries (API vs Agent vs Frontend)
|
||
❌ Do NOT change contracts without updating both sides
|
||
❌ Do NOT skip drift prevention checks
|
||
|
||
---
|
||
|
||
## 📋 Critical Documents (READ THESE FIRST)
|
||
|
||
### **🛡️ MANDATORY Before ANY Code**
|
||
1. **[Drift Prevention Card](computer:///mnt/user-data/outputs/ZeroLagHub_Drift_Prevention_Card.md)** (30 seconds)
|
||
- Quick boundary check for every session
|
||
- Violation triggers to watch for
|
||
|
||
2. **[Cross-Project Tracker](computer:///mnt/user-data/outputs/ZeroLagHub_Cross_Project_Tracker.md)** (consult before changes)
|
||
- Ownership matrix (API vs Agent vs Frontend)
|
||
- Canonical contracts (API ↔ Agent, Frontend ↔ API)
|
||
- Drift detection rules with examples
|
||
|
||
### **📊 Implementation Context**
|
||
3. **[Complete Current State](computer:///mnt/user-data/outputs/ZeroLagHub_Complete_Current_State_Dec7.md)** (5 minutes)
|
||
- Engineering Kanban (DONE/IN PROGRESS/TODO)
|
||
- 3-day sprint plan with tasks
|
||
- Troubleshooting guides per variant
|
||
- Launch readiness matrix
|
||
|
||
4. **Engineering Handover** (from today's uploaded document)
|
||
- System lifecycle diagram
|
||
- Provisioning sequence
|
||
- Verification system specs
|
||
- Today's accomplishments
|
||
|
||
### **🔧 Technical Reference**
|
||
5. **[Agent Complete Spec](computer:///home/claude/ZeroLagHub_Agent_Complete_Spec.md)** (as needed)
|
||
- Go agent implementation details
|
||
- API endpoints and contracts
|
||
|
||
6. **[Infrastructure Specs](computer:///mnt/user-data/outputs/ZeroLagHub_Infrastructure_Specifications.md)** (as needed)
|
||
- GTHost hardware constraints
|
||
- Capacity planning
|
||
|
||
---
|
||
|
||
## 🎯 Current Platform Status (December 7, 2025)
|
||
|
||
### **What's Working** ✅ (85% Complete)
|
||
|
||
**Core Provisioning Pipeline**:
|
||
- ✅ All 6 Minecraft variants (Vanilla, Paper, Purpur, Fabric, Forge, NeoForge)
|
||
- ✅ VMID allocation (sequential)
|
||
- ✅ LXC container creation (template VMID 800)
|
||
- ✅ IP detection (10.200.0.X)
|
||
- ✅ Go agent deployment + self-repair
|
||
- ✅ Java runtime auto-selection (17/21)
|
||
- ✅ DNS automation (Cloudflare + Technitium)
|
||
- ✅ Velocity proxy registration
|
||
- ✅ Start/stop/restart control
|
||
- ✅ Console command injection
|
||
- ✅ Log tailing (HTTP polling)
|
||
- ✅ Crash detection
|
||
|
||
**Supported Minecraft Versions**: 1.12.2, 1.16.5, 1.18.2, 1.19.2, 1.20.1, 1.21.x
|
||
|
||
### **What's Missing** ❌ (15% To-Do)
|
||
|
||
**Critical for Launch** (7-9 hours):
|
||
- ❌ WebSocket console streaming (4-6 hours) - **HIGH PRIORITY**
|
||
- ❌ Crash loop protection with backoff (2 hours) - **HIGH PRIORITY**
|
||
- ❌ Disk space monitoring (1 hour) - **HIGH PRIORITY**
|
||
|
||
**Dev Platform** (1 day):
|
||
- 🔧 Dev container provisioning (Day 1 of sprint)
|
||
- 🔧 EdgeState schema migration (Day 2 of sprint)
|
||
- 🔧 Reconciliation job (Day 3 of sprint)
|
||
|
||
**Nice-to-Have** (future):
|
||
- 📋 File upload/download
|
||
- 📋 Backup/restore UI
|
||
- 📋 Resource monitoring dashboard
|
||
|
||
---
|
||
|
||
## 🗺️ System Architecture (Know Your Boundaries)
|
||
|
||
### **Three-System Ownership**
|
||
|
||
```
|
||
┌─────────────────────────────────────────┐
|
||
│ NODE.JS API (Orchestrator) │
|
||
│ You Own: Routes, services, job queue │
|
||
│ Speaks To: Proxmox, DNS, Velocity, DB │
|
||
│ Never: Installs Java, downloads files │
|
||
└────────────┬────────────────────────────┘
|
||
│ HTTP Contract
|
||
│ POST /config, /start, /stop
|
||
│ GET /status, /health
|
||
▼
|
||
┌─────────────────────────────────────────┐
|
||
│ GO AGENT (Container Manager) │
|
||
│ You Own: Installation, verification │
|
||
│ Speaks To: Filesystem, game process │
|
||
│ Never: Allocates ports, creates DNS │
|
||
└────────────┬────────────────────────────┘
|
||
│ Status Polling
|
||
▼
|
||
┌─────────────────────────────────────────┐
|
||
│ NEXT.JS FRONTEND (UI Only) │
|
||
│ You Own: Components, client state │
|
||
│ Speaks To: API only │
|
||
│ Never: Agent, Proxmox, DNS, Velocity │
|
||
└─────────────────────────────────────────┘
|
||
```
|
||
|
||
### **Critical Boundaries (NEVER CROSS)**
|
||
|
||
**API Must NOT**:
|
||
- ❌ Install Java inside containers
|
||
- ❌ Download game server files
|
||
- ❌ Execute commands directly in containers (use Agent)
|
||
|
||
**Agent Must NOT**:
|
||
- ❌ Allocate ports
|
||
- ❌ Create DNS records
|
||
- ❌ Register with Velocity
|
||
- ❌ Talk to Proxmox API
|
||
- ❌ Manage VMIDs
|
||
|
||
**Frontend Must NOT**:
|
||
- ❌ Talk directly to Agent
|
||
- ❌ Call Proxmox API
|
||
- ❌ Create DNS records
|
||
- ❌ Bypass API for any infrastructure
|
||
|
||
**Violation = STOP → Consult Cross-Project Tracker**
|
||
|
||
---
|
||
|
||
## 📅 3-Day Sprint Plan (Your Tasks)
|
||
|
||
### **Day 1: Dev Containers** (December 8)
|
||
|
||
**Goal**: Enable developer environment provisioning
|
||
|
||
**Tasks**:
|
||
1. [ ] Define dev container spec (Python, Node, Go, Java)
|
||
2. [ ] Create template VMID 6000 (base dev environment)
|
||
3. [ ] API: Add `/api/dev-instances` endpoints (create, delete, status)
|
||
4. [ ] Agent: Add dev provisioning flow (no game server start)
|
||
5. [ ] Test: Provision Python + Node dev environments
|
||
|
||
**Success Criteria**:
|
||
- Can provision dev environment with chosen language
|
||
- Dev container accessible via SSH or web console
|
||
- No game server logic triggered
|
||
|
||
**Files to Modify**:
|
||
- `src/routes/devInstances.js` (new)
|
||
- `src/services/devProvisioner.js` (new)
|
||
- Agent: `dev.go` (new provisioning flow)
|
||
|
||
---
|
||
|
||
### **Day 2: EdgeState + DNS Reliability** (December 9)
|
||
|
||
**Goal**: Fix Cloudflare SRV deletion problem
|
||
|
||
**Tasks**:
|
||
1. [ ] Implement EdgeState model in Prisma schema
|
||
2. [ ] Update `edgePublisher.js` to store Cloudflare record IDs
|
||
3. [ ] Update `dePublisher.js` to delete by record ID (not hostname)
|
||
4. [ ] Test: Create → Delete → Recreate same hostname
|
||
|
||
**Success Criteria**:
|
||
- No orphaned DNS records after deletion
|
||
- EdgeState tracks all Cloudflare record IDs
|
||
- Re-provisioning same hostname works
|
||
|
||
**Files to Modify**:
|
||
- `prisma/schema.prisma` (add EdgeState model)
|
||
- `src/services/edgePublisher.js`
|
||
- `src/services/dePublisher.js`
|
||
- `src/clients/cloudflareClient.js` (return record IDs)
|
||
|
||
---
|
||
|
||
### **Day 3: Reconciliation + Hardening** (December 10)
|
||
|
||
**Goal**: Self-healing infrastructure
|
||
|
||
**Tasks**:
|
||
1. [ ] Create reconciliation job (DB ↔ Proxmox ↔ DNS ↔ Velocity)
|
||
2. [ ] Detect orphaned containers (in Proxmox but not DB)
|
||
3. [ ] Detect orphaned DNS records (in DNS but not DB)
|
||
4. [ ] Auto-cleanup with confirmation prompt
|
||
5. [ ] Regression test suite
|
||
|
||
**Success Criteria**:
|
||
- Reconciliation job detects all orphans
|
||
- Can auto-clean with user confirmation
|
||
- System recovers from partial failures
|
||
|
||
**Files to Create**:
|
||
- `src/jobs/reconciliationJob.js`
|
||
- `src/services/reconciler.js` (rewrite)
|
||
- `tests/reconciliation.test.js`
|
||
|
||
---
|
||
|
||
## 🐛 Known Bugs (Fix These)
|
||
|
||
### **Go Agent Bugs** (Non-Blocking, But Should Fix)
|
||
|
||
1. **Forge server.jar Glob Logic** (`artifacts.go` lines 112-116, 147-151)
|
||
```go
|
||
// REMOVE THIS - Forge doesn't create server.jar
|
||
serverJarPath := filepath.Join(installDir, "*server.jar")
|
||
```
|
||
**Fix**: Remove glob/rename logic entirely
|
||
|
||
2. **ensureProvisioned() Fallthrough** (`agent.go` lines 155-171)
|
||
```go
|
||
// ADD ELSE HERE
|
||
if variant == "forge" || variant == "neoforge" {
|
||
// Forge logic
|
||
} else { // <-- ADD THIS
|
||
// Vanilla-like logic
|
||
}
|
||
```
|
||
|
||
3. **Forge Stop Command Exclusion** (`process.go` line 83)
|
||
```go
|
||
// REMOVE THIS EXCLUSION - Forge accepts stop commands
|
||
if p.variant != "forge" && p.variant != "neoforge" {
|
||
p.sendCommand("stop")
|
||
}
|
||
```
|
||
**Fix**: Remove the if condition, send stop to all variants
|
||
|
||
---
|
||
|
||
## 🚨 High-Risk Integration Zones (Careful!)
|
||
|
||
These areas have caused drift in past sessions:
|
||
|
||
1. **Forge/NeoForge Installation**
|
||
- Agent owns ALL installation logic
|
||
- API only passes config, never executes install commands
|
||
|
||
2. **Cloudflare SRV Deletion**
|
||
- Must use record IDs (not hostname inference)
|
||
- Store IDs in EdgeState on creation
|
||
|
||
3. **Velocity Registration Order**
|
||
- Wait for Agent to report RUNNING state
|
||
- Then register with Velocity (not before)
|
||
|
||
4. **Port Allocation**
|
||
- API allocates → provisions → commits
|
||
- Rollback on failure (don't leak ports)
|
||
|
||
5. **Agent READY Detection**
|
||
- Use variant-aware log parsing
|
||
- Forge takes 60-90s (don't timeout early)
|
||
|
||
---
|
||
|
||
## 📁 Key File Locations
|
||
|
||
### **API Service** (`/home/zlh/zlh-api-v2/`)
|
||
```
|
||
src/
|
||
├── routes/
|
||
│ ├── containers.js # Game server endpoints
|
||
│ └── devInstances.js # Dev environment endpoints (TO CREATE)
|
||
├── services/
|
||
│ ├── edgePublisher.js # DNS + Velocity publishing
|
||
│ ├── dePublisher.js # Edge cleanup (NEEDS REWRITE)
|
||
│ ├── portAllocator.js # Port management
|
||
│ └── reconciler.js # Orphan detection (NEEDS REWRITE)
|
||
├── clients/
|
||
│ ├── cloudflareClient.js # Cloudflare API (UPDATE for record IDs)
|
||
│ ├── technitiumClient.js # Technitium DNS
|
||
│ └── proxmoxClient.js # Proxmox API
|
||
└── jobs/
|
||
└── reconciliationJob.js # Self-healing job (TO CREATE)
|
||
```
|
||
|
||
### **Go Agent** (`/opt/zlh-agent/`)
|
||
```
|
||
├── agent.go # Main provisioning logic (FIX fallthrough)
|
||
├── artifacts.go # Download + verification (REMOVE Forge glob)
|
||
├── process.go # Server lifecycle (FIX Forge stop exclusion)
|
||
├── api.go # HTTP server for control
|
||
└── dev.go # Dev environment provisioning (TO CREATE)
|
||
```
|
||
|
||
### **Frontend** (`/home/zlh/zlh-portal/`)
|
||
```
|
||
src/
|
||
├── app/
|
||
│ ├── containers/ # Game server UI
|
||
│ └── dev/ # Dev environment UI (TO CREATE)
|
||
└── components/
|
||
└── Console.tsx # WebSocket console (TO CREATE)
|
||
```
|
||
|
||
---
|
||
|
||
## 🎮 Minecraft Variant Status
|
||
|
||
| Variant | Install | Verify | Start | READY Detection | Status |
|
||
|---------|---------|--------|-------|-----------------|--------|
|
||
| Vanilla | ✅ | ✅ | ✅ | ✅ | Production |
|
||
| Paper | ✅ | ✅ | ✅ | ✅ | Production |
|
||
| Purpur | ✅ | ✅ | ✅ | ✅ | Production |
|
||
| Fabric | ✅ | ✅ | ✅ | ✅ | Production |
|
||
| Forge | ✅ | ✅ | ✅ | ✅ | Production (has bugs) |
|
||
| NeoForge | ✅ | ✅ | ✅ | ✅ | Production |
|
||
|
||
**All variants work** - 3 non-blocking bugs should be fixed for code quality.
|
||
|
||
---
|
||
|
||
## 🔄 Provisioning Flow (Know This)
|
||
|
||
```
|
||
1. User creates server via Frontend
|
||
↓
|
||
2. Frontend → API: POST /api/containers/create
|
||
↓
|
||
3. API allocates VMID, ports (if needed)
|
||
↓
|
||
4. API clones LXC from template VMID 800
|
||
↓
|
||
5. API configures container (IP, resources)
|
||
↓
|
||
6. API starts LXC
|
||
↓
|
||
7. API detects container IP (10.200.0.X)
|
||
↓
|
||
8. API → Agent: POST /config (payload)
|
||
↓
|
||
9. Agent saves payload.json
|
||
↓
|
||
10. Agent spawns install goroutine (async)
|
||
├─ Download Java
|
||
├─ Download game artifacts
|
||
├─ Verify installation
|
||
└─ Self-repair if needed
|
||
↓
|
||
11. Agent starts server
|
||
↓
|
||
12. Agent detects READY (log parsing)
|
||
↓
|
||
13. Agent sets state = RUNNING
|
||
↓
|
||
14. API polls /status until RUNNING
|
||
↓
|
||
15. API saves to database
|
||
↓
|
||
16. API publishes DNS (Cloudflare + Technitium)
|
||
↓
|
||
17. API registers with Velocity
|
||
↓
|
||
18. API returns SUCCESS to user
|
||
↓
|
||
COMPLETE ✅
|
||
```
|
||
|
||
---
|
||
|
||
## 🛡️ Drift Prevention (ACTIVE)
|
||
|
||
### **Before Writing ANY Code**
|
||
|
||
**Ask yourself**:
|
||
1. Which system am I modifying? (API / Agent / Frontend)
|
||
2. Does this cross boundaries? (If yes → read Cross-Project Tracker)
|
||
3. Am I adding external calls to Agent? (If yes → VIOLATION)
|
||
4. Am I adding container execution to API? (If yes → VIOLATION)
|
||
5. Am I bypassing API in Frontend? (If yes → VIOLATION)
|
||
|
||
**If ANY doubt** → Stop and consult [Cross-Project Tracker](computer:///mnt/user-data/outputs/ZeroLagHub_Cross_Project_Tracker.md)
|
||
|
||
### **Common Violations to Avoid**
|
||
|
||
❌ Agent allocating ports → API owns this
|
||
❌ API installing Java → Agent owns this
|
||
❌ Frontend calling Agent directly → Must go through API
|
||
❌ Agent creating DNS → API owns this
|
||
❌ API deciding Java version → Agent owns this (version-aware)
|
||
|
||
---
|
||
|
||
## 🎯 Success Metrics (How You'll Be Measured)
|
||
|
||
### **Sprint Completion**
|
||
- [ ] All 3 days of sprint tasks completed
|
||
- [ ] Dev containers operational
|
||
- [ ] EdgeState tracking DNS record IDs
|
||
- [ ] Reconciliation job working
|
||
|
||
### **Code Quality**
|
||
- [ ] No architectural violations (follow Cross-Project Tracker)
|
||
- [ ] All 3 Go agent bugs fixed
|
||
- [ ] Tests passing
|
||
- [ ] No new drift introduced
|
||
|
||
### **Platform Readiness**
|
||
- [ ] 95% launch-ready after sprint
|
||
- [ ] All MC variants still working
|
||
- [ ] No regressions from changes
|
||
|
||
---
|
||
|
||
## 🧪 Testing Requirements
|
||
|
||
### **Before Committing Code**
|
||
|
||
**Test Each Variant**:
|
||
```bash
|
||
# Test provisioning
|
||
POST /api/containers/create {variant: "vanilla"}
|
||
POST /api/containers/create {variant: "paper"}
|
||
POST /api/containers/create {variant: "fabric"}
|
||
POST /api/containers/create {variant: "forge"}
|
||
POST /api/containers/create {variant: "neoforge"}
|
||
|
||
# Verify RUNNING state
|
||
GET /api/containers/:vmid/status
|
||
# Should return: {state: "RUNNING"}
|
||
|
||
# Test control
|
||
POST /api/containers/:vmid/stop
|
||
POST /api/containers/:vmid/start
|
||
POST /api/containers/:vmid/restart
|
||
|
||
# Test cleanup
|
||
DELETE /api/containers/:vmid
|
||
# Verify no orphaned DNS records
|
||
```
|
||
|
||
**Test Dev Containers**:
|
||
```bash
|
||
POST /api/dev-instances/create {language: "python"}
|
||
POST /api/dev-instances/create {language: "node"}
|
||
|
||
# Verify accessible
|
||
ssh into dev container
|
||
# Should have language toolchain installed
|
||
```
|
||
|
||
---
|
||
|
||
## 📊 Infrastructure Constraints (Know Your Limits)
|
||
|
||
**Hardware** (GTHost Dedicated):
|
||
- CPU: 12 cores / 24 threads (Intel Xeon Silver 4116)
|
||
- RAM: 192 GB (can run 30-50 simultaneous 4GB servers)
|
||
- Storage: 1.8 TB free (300-500 servers capacity)
|
||
- Network: 300 Mbit/s (30-60 concurrent players)
|
||
|
||
**Current Allocation**:
|
||
- 11 VMs: 56 GB RAM, 24 CPU threads, ~512 GB disk
|
||
- Available for servers: 128 GB RAM, 1.8 TB disk
|
||
|
||
**Don't exceed these limits** - check before provisioning.
|
||
|
||
---
|
||
|
||
## 🚀 Launch Decision (Context)
|
||
|
||
**Option A**: Launch NOW (85% ready, soft beta only)
|
||
**Option B**: +3 Days Sprint (95% ready, RECOMMENDED)
|
||
**Option C**: +1 Week (98% ready, over-engineering)
|
||
|
||
**Your role**: Execute Option B sprint (complete 3 days of tasks)
|
||
|
||
**After sprint**: Platform ready for professional launch
|
||
|
||
---
|
||
|
||
## 📞 Session Continuity
|
||
|
||
### **Starting Fresh Session (You)**
|
||
|
||
```
|
||
1. Read Drift Prevention Card (30 seconds)
|
||
└─ Activate boundary awareness
|
||
|
||
2. Read Complete Current State (5 minutes)
|
||
└─ Get Kanban state + sprint tasks
|
||
|
||
3. Consult Cross-Project Tracker before code
|
||
└─ Verify no boundary violations
|
||
|
||
4. Execute sprint tasks with constraints active
|
||
|
||
5. Test all variants before committing
|
||
```
|
||
|
||
### **Handoff to Claude (Architecture Questions)**
|
||
|
||
If you encounter:
|
||
- Architectural decisions (e.g., should we change contracts?)
|
||
- Strategic questions (e.g., which features to prioritize?)
|
||
- Business model questions
|
||
- Major design changes
|
||
|
||
**STOP and escalate to Claude** for architectural guidance.
|
||
|
||
---
|
||
|
||
## ✅ Quick Reference
|
||
|
||
### **Your Mission**
|
||
Execute 3-day sprint → Deliver 95% launch-ready platform
|
||
|
||
### **Your Boundaries**
|
||
API orchestrates | Agent installs | Frontend displays
|
||
|
||
### **Your Critical Docs**
|
||
1. Drift Prevention Card (session start)
|
||
2. Cross-Project Tracker (before code)
|
||
3. Complete Current State (sprint tasks)
|
||
4. Engineering Handover (technical details)
|
||
|
||
### **Your Success**
|
||
- [ ] 3-day sprint complete
|
||
- [ ] No architectural violations
|
||
- [ ] All variants still working
|
||
- [ ] Tests passing
|
||
|
||
---
|
||
|
||
## 🎯 Start Here (First Actions)
|
||
|
||
**Right Now**:
|
||
1. ✅ Read [Drift Prevention Card](computer:///mnt/user-data/outputs/ZeroLagHub_Drift_Prevention_Card.md) (30 sec)
|
||
2. ✅ Read [Complete Current State](computer:///mnt/user-data/outputs/ZeroLagHub_Complete_Current_State_Dec7.md) (5 min)
|
||
3. ✅ Check Engineering Kanban for current TODO
|
||
4. ✅ Begin Day 1 sprint: Dev containers
|
||
|
||
**Remember**:
|
||
- 🛡️ Drift prevention ACTIVE
|
||
- 📋 Consult tracker before crossing boundaries
|
||
- ✅ Test all variants before committing
|
||
- 🚀 Goal: 95% launch-ready after 3 days
|
||
|
||
---
|
||
|
||
**Status**: You have everything you need to execute the sprint. Let's build! 🚀
|