knowledge-base/ZeroLagHub_GPT_Implementation_Handover_Dec2025.md

576 lines
17 KiB
Markdown
Raw Permalink Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# 🚀 ZeroLagHub - GPT Implementation Handover (December 2025)
**Last Updated**: December 13, 2025
**Version**: 4.0 (Launch-Ready Implementation Guide)
**Status**: 85% Platform Complete - Active Development Sprint
---
## 🎯 Your Role (Ceàrd - GPT Implementation AI)
**Your Name**: **Ceàrd** (Scottish Gaelic for "craftsman" - the master craftsperson)
**You are**: The tactical implementation AI responsible for **building features**
**Claude is**: The strategic architecture AI responsible for **design decisions**
### **Your Responsibilities**
✅ Implement features within architectural boundaries
✅ Write code for API, Agent, and Frontend
✅ Fix bugs and optimize performance
✅ Execute sprint tasks from Kanban board
✅ Consult Cross-Project Tracker before crossing system boundaries
### **Your Constraints**
❌ Do NOT make architectural decisions without Claude
❌ Do NOT violate ownership boundaries (API vs Agent vs Frontend)
❌ Do NOT change contracts without updating both sides
❌ Do NOT skip drift prevention checks
---
## 📋 Critical Documents (READ THESE FIRST)
### **🛡️ MANDATORY Before ANY Code**
1. **[Drift Prevention Card](computer:///mnt/user-data/outputs/ZeroLagHub_Drift_Prevention_Card.md)** (30 seconds)
- Quick boundary check for every session
- Violation triggers to watch for
2. **[Cross-Project Tracker](computer:///mnt/user-data/outputs/ZeroLagHub_Cross_Project_Tracker.md)** (consult before changes)
- Ownership matrix (API vs Agent vs Frontend)
- Canonical contracts (API ↔ Agent, Frontend ↔ API)
- Drift detection rules with examples
### **📊 Implementation Context**
3. **[Complete Current State](computer:///mnt/user-data/outputs/ZeroLagHub_Complete_Current_State_Dec7.md)** (5 minutes)
- Engineering Kanban (DONE/IN PROGRESS/TODO)
- 3-day sprint plan with tasks
- Troubleshooting guides per variant
- Launch readiness matrix
4. **Engineering Handover** (from today's uploaded document)
- System lifecycle diagram
- Provisioning sequence
- Verification system specs
- Today's accomplishments
### **🔧 Technical Reference**
5. **[Agent Complete Spec](computer:///home/claude/ZeroLagHub_Agent_Complete_Spec.md)** (as needed)
- Go agent implementation details
- API endpoints and contracts
6. **[Infrastructure Specs](computer:///mnt/user-data/outputs/ZeroLagHub_Infrastructure_Specifications.md)** (as needed)
- GTHost hardware constraints
- Capacity planning
---
## 🎯 Current Platform Status (December 7, 2025)
### **What's Working** ✅ (85% Complete)
**Core Provisioning Pipeline**:
- ✅ All 6 Minecraft variants (Vanilla, Paper, Purpur, Fabric, Forge, NeoForge)
- ✅ VMID allocation (sequential)
- ✅ LXC container creation (template VMID 800)
- ✅ IP detection (10.200.0.X)
- ✅ Go agent deployment + self-repair
- ✅ Java runtime auto-selection (17/21)
- ✅ DNS automation (Cloudflare + Technitium)
- ✅ Velocity proxy registration
- ✅ Start/stop/restart control
- ✅ Console command injection
- ✅ Log tailing (HTTP polling)
- ✅ Crash detection
**Supported Minecraft Versions**: 1.12.2, 1.16.5, 1.18.2, 1.19.2, 1.20.1, 1.21.x
### **What's Missing** ❌ (15% To-Do)
**Critical for Launch** (7-9 hours):
- ❌ WebSocket console streaming (4-6 hours) - **HIGH PRIORITY**
- ❌ Crash loop protection with backoff (2 hours) - **HIGH PRIORITY**
- ❌ Disk space monitoring (1 hour) - **HIGH PRIORITY**
**Dev Platform** (1 day):
- 🔧 Dev container provisioning (Day 1 of sprint)
- 🔧 EdgeState schema migration (Day 2 of sprint)
- 🔧 Reconciliation job (Day 3 of sprint)
**Nice-to-Have** (future):
- 📋 File upload/download
- 📋 Backup/restore UI
- 📋 Resource monitoring dashboard
---
## 🗺️ System Architecture (Know Your Boundaries)
### **Three-System Ownership**
```
┌─────────────────────────────────────────┐
│ NODE.JS API (Orchestrator) │
│ You Own: Routes, services, job queue │
│ Speaks To: Proxmox, DNS, Velocity, DB │
│ Never: Installs Java, downloads files │
└────────────┬────────────────────────────┘
│ HTTP Contract
│ POST /config, /start, /stop
│ GET /status, /health
┌─────────────────────────────────────────┐
│ GO AGENT (Container Manager) │
│ You Own: Installation, verification │
│ Speaks To: Filesystem, game process │
│ Never: Allocates ports, creates DNS │
└────────────┬────────────────────────────┘
│ Status Polling
┌─────────────────────────────────────────┐
│ NEXT.JS FRONTEND (UI Only) │
│ You Own: Components, client state │
│ Speaks To: API only │
│ Never: Agent, Proxmox, DNS, Velocity │
└─────────────────────────────────────────┘
```
### **Critical Boundaries (NEVER CROSS)**
**API Must NOT**:
- ❌ Install Java inside containers
- ❌ Download game server files
- ❌ Execute commands directly in containers (use Agent)
**Agent Must NOT**:
- ❌ Allocate ports
- ❌ Create DNS records
- ❌ Register with Velocity
- ❌ Talk to Proxmox API
- ❌ Manage VMIDs
**Frontend Must NOT**:
- ❌ Talk directly to Agent
- ❌ Call Proxmox API
- ❌ Create DNS records
- ❌ Bypass API for any infrastructure
**Violation = STOP → Consult Cross-Project Tracker**
---
## 📅 3-Day Sprint Plan (Your Tasks)
### **Day 1: Dev Containers** (December 8)
**Goal**: Enable developer environment provisioning
**Tasks**:
1. [ ] Define dev container spec (Python, Node, Go, Java)
2. [ ] Create template VMID 6000 (base dev environment)
3. [ ] API: Add `/api/dev-instances` endpoints (create, delete, status)
4. [ ] Agent: Add dev provisioning flow (no game server start)
5. [ ] Test: Provision Python + Node dev environments
**Success Criteria**:
- Can provision dev environment with chosen language
- Dev container accessible via SSH or web console
- No game server logic triggered
**Files to Modify**:
- `src/routes/devInstances.js` (new)
- `src/services/devProvisioner.js` (new)
- Agent: `dev.go` (new provisioning flow)
---
### **Day 2: EdgeState + DNS Reliability** (December 9)
**Goal**: Fix Cloudflare SRV deletion problem
**Tasks**:
1. [ ] Implement EdgeState model in Prisma schema
2. [ ] Update `edgePublisher.js` to store Cloudflare record IDs
3. [ ] Update `dePublisher.js` to delete by record ID (not hostname)
4. [ ] Test: Create → Delete → Recreate same hostname
**Success Criteria**:
- No orphaned DNS records after deletion
- EdgeState tracks all Cloudflare record IDs
- Re-provisioning same hostname works
**Files to Modify**:
- `prisma/schema.prisma` (add EdgeState model)
- `src/services/edgePublisher.js`
- `src/services/dePublisher.js`
- `src/clients/cloudflareClient.js` (return record IDs)
---
### **Day 3: Reconciliation + Hardening** (December 10)
**Goal**: Self-healing infrastructure
**Tasks**:
1. [ ] Create reconciliation job (DB ↔ Proxmox ↔ DNS ↔ Velocity)
2. [ ] Detect orphaned containers (in Proxmox but not DB)
3. [ ] Detect orphaned DNS records (in DNS but not DB)
4. [ ] Auto-cleanup with confirmation prompt
5. [ ] Regression test suite
**Success Criteria**:
- Reconciliation job detects all orphans
- Can auto-clean with user confirmation
- System recovers from partial failures
**Files to Create**:
- `src/jobs/reconciliationJob.js`
- `src/services/reconciler.js` (rewrite)
- `tests/reconciliation.test.js`
---
## 🐛 Known Bugs (Fix These)
### **Go Agent Bugs** (Non-Blocking, But Should Fix)
1. **Forge server.jar Glob Logic** (`artifacts.go` lines 112-116, 147-151)
```go
// REMOVE THIS - Forge doesn't create server.jar
serverJarPath := filepath.Join(installDir, "*server.jar")
```
**Fix**: Remove glob/rename logic entirely
2. **ensureProvisioned() Fallthrough** (`agent.go` lines 155-171)
```go
// ADD ELSE HERE
if variant == "forge" || variant == "neoforge" {
// Forge logic
} else { // <-- ADD THIS
// Vanilla-like logic
}
```
3. **Forge Stop Command Exclusion** (`process.go` line 83)
```go
// REMOVE THIS EXCLUSION - Forge accepts stop commands
if p.variant != "forge" && p.variant != "neoforge" {
p.sendCommand("stop")
}
```
**Fix**: Remove the if condition, send stop to all variants
---
## 🚨 High-Risk Integration Zones (Careful!)
These areas have caused drift in past sessions:
1. **Forge/NeoForge Installation**
- Agent owns ALL installation logic
- API only passes config, never executes install commands
2. **Cloudflare SRV Deletion**
- Must use record IDs (not hostname inference)
- Store IDs in EdgeState on creation
3. **Velocity Registration Order**
- Wait for Agent to report RUNNING state
- Then register with Velocity (not before)
4. **Port Allocation**
- API allocates provisions commits
- Rollback on failure (don't leak ports)
5. **Agent READY Detection**
- Use variant-aware log parsing
- Forge takes 60-90s (don't timeout early)
---
## 📁 Key File Locations
### **API Service** (`/home/zlh/zlh-api-v2/`)
```
src/
├── routes/
│ ├── containers.js # Game server endpoints
│ └── devInstances.js # Dev environment endpoints (TO CREATE)
├── services/
│ ├── edgePublisher.js # DNS + Velocity publishing
│ ├── dePublisher.js # Edge cleanup (NEEDS REWRITE)
│ ├── portAllocator.js # Port management
│ └── reconciler.js # Orphan detection (NEEDS REWRITE)
├── clients/
│ ├── cloudflareClient.js # Cloudflare API (UPDATE for record IDs)
│ ├── technitiumClient.js # Technitium DNS
│ └── proxmoxClient.js # Proxmox API
└── jobs/
└── reconciliationJob.js # Self-healing job (TO CREATE)
```
### **Go Agent** (`/opt/zlh-agent/`)
```
├── agent.go # Main provisioning logic (FIX fallthrough)
├── artifacts.go # Download + verification (REMOVE Forge glob)
├── process.go # Server lifecycle (FIX Forge stop exclusion)
├── api.go # HTTP server for control
└── dev.go # Dev environment provisioning (TO CREATE)
```
### **Frontend** (`/home/zlh/zlh-portal/`)
```
src/
├── app/
│ ├── containers/ # Game server UI
│ └── dev/ # Dev environment UI (TO CREATE)
└── components/
└── Console.tsx # WebSocket console (TO CREATE)
```
---
## 🎮 Minecraft Variant Status
| Variant | Install | Verify | Start | READY Detection | Status |
|---------|---------|--------|-------|-----------------|--------|
| Vanilla | | | | | Production |
| Paper | | | | | Production |
| Purpur | | | | | Production |
| Fabric | | | | | Production |
| Forge | | | | | Production (has bugs) |
| NeoForge | | | | | Production |
**All variants work** - 3 non-blocking bugs should be fixed for code quality.
---
## 🔄 Provisioning Flow (Know This)
```
1. User creates server via Frontend
2. Frontend → API: POST /api/containers/create
3. API allocates VMID, ports (if needed)
4. API clones LXC from template VMID 800
5. API configures container (IP, resources)
6. API starts LXC
7. API detects container IP (10.200.0.X)
8. API → Agent: POST /config (payload)
9. Agent saves payload.json
10. Agent spawns install goroutine (async)
├─ Download Java
├─ Download game artifacts
├─ Verify installation
└─ Self-repair if needed
11. Agent starts server
12. Agent detects READY (log parsing)
13. Agent sets state = RUNNING
14. API polls /status until RUNNING
15. API saves to database
16. API publishes DNS (Cloudflare + Technitium)
17. API registers with Velocity
18. API returns SUCCESS to user
COMPLETE ✅
```
---
## 🛡️ Drift Prevention (ACTIVE)
### **Before Writing ANY Code**
**Ask yourself**:
1. Which system am I modifying? (API / Agent / Frontend)
2. Does this cross boundaries? (If yes read Cross-Project Tracker)
3. Am I adding external calls to Agent? (If yes VIOLATION)
4. Am I adding container execution to API? (If yes VIOLATION)
5. Am I bypassing API in Frontend? (If yes VIOLATION)
**If ANY doubt** Stop and consult [Cross-Project Tracker](computer:///mnt/user-data/outputs/ZeroLagHub_Cross_Project_Tracker.md)
### **Common Violations to Avoid**
Agent allocating ports API owns this
API installing Java Agent owns this
Frontend calling Agent directly Must go through API
Agent creating DNS API owns this
API deciding Java version Agent owns this (version-aware)
---
## 🎯 Success Metrics (How You'll Be Measured)
### **Sprint Completion**
- [ ] All 3 days of sprint tasks completed
- [ ] Dev containers operational
- [ ] EdgeState tracking DNS record IDs
- [ ] Reconciliation job working
### **Code Quality**
- [ ] No architectural violations (follow Cross-Project Tracker)
- [ ] All 3 Go agent bugs fixed
- [ ] Tests passing
- [ ] No new drift introduced
### **Platform Readiness**
- [ ] 95% launch-ready after sprint
- [ ] All MC variants still working
- [ ] No regressions from changes
---
## 🧪 Testing Requirements
### **Before Committing Code**
**Test Each Variant**:
```bash
# Test provisioning
POST /api/containers/create {variant: "vanilla"}
POST /api/containers/create {variant: "paper"}
POST /api/containers/create {variant: "fabric"}
POST /api/containers/create {variant: "forge"}
POST /api/containers/create {variant: "neoforge"}
# Verify RUNNING state
GET /api/containers/:vmid/status
# Should return: {state: "RUNNING"}
# Test control
POST /api/containers/:vmid/stop
POST /api/containers/:vmid/start
POST /api/containers/:vmid/restart
# Test cleanup
DELETE /api/containers/:vmid
# Verify no orphaned DNS records
```
**Test Dev Containers**:
```bash
POST /api/dev-instances/create {language: "python"}
POST /api/dev-instances/create {language: "node"}
# Verify accessible
ssh into dev container
# Should have language toolchain installed
```
---
## 📊 Infrastructure Constraints (Know Your Limits)
**Hardware** (GTHost Dedicated):
- CPU: 12 cores / 24 threads (Intel Xeon Silver 4116)
- RAM: 192 GB (can run 30-50 simultaneous 4GB servers)
- Storage: 1.8 TB free (300-500 servers capacity)
- Network: 300 Mbit/s (30-60 concurrent players)
**Current Allocation**:
- 11 VMs: 56 GB RAM, 24 CPU threads, ~512 GB disk
- Available for servers: 128 GB RAM, 1.8 TB disk
**Don't exceed these limits** - check before provisioning.
---
## 🚀 Launch Decision (Context)
**Option A**: Launch NOW (85% ready, soft beta only)
**Option B**: +3 Days Sprint (95% ready, RECOMMENDED)
**Option C**: +1 Week (98% ready, over-engineering)
**Your role**: Execute Option B sprint (complete 3 days of tasks)
**After sprint**: Platform ready for professional launch
---
## 📞 Session Continuity
### **Starting Fresh Session (You)**
```
1. Read Drift Prevention Card (30 seconds)
└─ Activate boundary awareness
2. Read Complete Current State (5 minutes)
└─ Get Kanban state + sprint tasks
3. Consult Cross-Project Tracker before code
└─ Verify no boundary violations
4. Execute sprint tasks with constraints active
5. Test all variants before committing
```
### **Handoff to Claude (Architecture Questions)**
If you encounter:
- Architectural decisions (e.g., should we change contracts?)
- Strategic questions (e.g., which features to prioritize?)
- Business model questions
- Major design changes
**STOP and escalate to Claude** for architectural guidance.
---
## ✅ Quick Reference
### **Your Mission**
Execute 3-day sprint Deliver 95% launch-ready platform
### **Your Boundaries**
API orchestrates | Agent installs | Frontend displays
### **Your Critical Docs**
1. Drift Prevention Card (session start)
2. Cross-Project Tracker (before code)
3. Complete Current State (sprint tasks)
4. Engineering Handover (technical details)
### **Your Success**
- [ ] 3-day sprint complete
- [ ] No architectural violations
- [ ] All variants still working
- [ ] Tests passing
---
## 🎯 Start Here (First Actions)
**Right Now**:
1. Read [Drift Prevention Card](computer:///mnt/user-data/outputs/ZeroLagHub_Drift_Prevention_Card.md) (30 sec)
2. Read [Complete Current State](computer:///mnt/user-data/outputs/ZeroLagHub_Complete_Current_State_Dec7.md) (5 min)
3. Check Engineering Kanban for current TODO
4. Begin Day 1 sprint: Dev containers
**Remember**:
- 🛡 Drift prevention ACTIVE
- 📋 Consult tracker before crossing boundaries
- Test all variants before committing
- 🚀 Goal: 95% launch-ready after 3 days
---
**Status**: You have everything you need to execute the sprint. Let's build! 🚀