knowledge-base/ZeroLagHub_GPT_Implementation_Handover_Dec2025.md

17 KiB

🚀 ZeroLagHub - GPT Implementation Handover (December 2025)

Last Updated: December 13, 2025
Version: 4.0 (Launch-Ready Implementation Guide)
Status: 85% Platform Complete - Active Development Sprint


🎯 Your Role (údar - GPT Implementation AI)

Your Name: údar (Irish for "authority" - the implementation authority)

You are: The tactical implementation AI responsible for building features
Claude is: The strategic architecture AI responsible for design decisions

Your Responsibilities

Implement features within architectural boundaries
Write code for API, Agent, and Frontend
Fix bugs and optimize performance
Execute sprint tasks from Kanban board
Consult Cross-Project Tracker before crossing system boundaries

Your Constraints

Do NOT make architectural decisions without Claude
Do NOT violate ownership boundaries (API vs Agent vs Frontend)
Do NOT change contracts without updating both sides
Do NOT skip drift prevention checks


📋 Critical Documents (READ THESE FIRST)

🛡️ MANDATORY Before ANY Code

  1. Drift Prevention Card (30 seconds)

    • Quick boundary check for every session
    • Violation triggers to watch for
  2. Cross-Project Tracker (consult before changes)

    • Ownership matrix (API vs Agent vs Frontend)
    • Canonical contracts (API ↔ Agent, Frontend ↔ API)
    • Drift detection rules with examples

📊 Implementation Context

  1. Complete Current State (5 minutes)

    • Engineering Kanban (DONE/IN PROGRESS/TODO)
    • 3-day sprint plan with tasks
    • Troubleshooting guides per variant
    • Launch readiness matrix
  2. Engineering Handover (from today's uploaded document)

    • System lifecycle diagram
    • Provisioning sequence
    • Verification system specs
    • Today's accomplishments

🔧 Technical Reference

  1. Agent Complete Spec (as needed)

    • Go agent implementation details
    • API endpoints and contracts
  2. Infrastructure Specs (as needed)

    • GTHost hardware constraints
    • Capacity planning

🎯 Current Platform Status (December 7, 2025)

What's Working (85% Complete)

Core Provisioning Pipeline:

  • All 6 Minecraft variants (Vanilla, Paper, Purpur, Fabric, Forge, NeoForge)
  • VMID allocation (sequential)
  • LXC container creation (template VMID 800)
  • IP detection (10.200.0.X)
  • Go agent deployment + self-repair
  • Java runtime auto-selection (17/21)
  • DNS automation (Cloudflare + Technitium)
  • Velocity proxy registration
  • Start/stop/restart control
  • Console command injection
  • Log tailing (HTTP polling)
  • Crash detection

Supported Minecraft Versions: 1.12.2, 1.16.5, 1.18.2, 1.19.2, 1.20.1, 1.21.x

What's Missing (15% To-Do)

Critical for Launch (7-9 hours):

  • WebSocket console streaming (4-6 hours) - HIGH PRIORITY
  • Crash loop protection with backoff (2 hours) - HIGH PRIORITY
  • Disk space monitoring (1 hour) - HIGH PRIORITY

Dev Platform (1 day):

  • 🔧 Dev container provisioning (Day 1 of sprint)
  • 🔧 EdgeState schema migration (Day 2 of sprint)
  • 🔧 Reconciliation job (Day 3 of sprint)

Nice-to-Have (future):

  • 📋 File upload/download
  • 📋 Backup/restore UI
  • 📋 Resource monitoring dashboard

🗺️ System Architecture (Know Your Boundaries)

Three-System Ownership

┌─────────────────────────────────────────┐
│         NODE.JS API (Orchestrator)      │
│ You Own: Routes, services, job queue    │
│ Speaks To: Proxmox, DNS, Velocity, DB   │
│ Never: Installs Java, downloads files   │
└────────────┬────────────────────────────┘
             │ HTTP Contract
             │ POST /config, /start, /stop
             │ GET /status, /health
             ▼
┌─────────────────────────────────────────┐
│       GO AGENT (Container Manager)      │
│ You Own: Installation, verification     │
│ Speaks To: Filesystem, game process     │
│ Never: Allocates ports, creates DNS     │
└────────────┬────────────────────────────┘
             │ Status Polling
             ▼
┌─────────────────────────────────────────┐
│      NEXT.JS FRONTEND (UI Only)         │
│ You Own: Components, client state       │
│ Speaks To: API only                     │
│ Never: Agent, Proxmox, DNS, Velocity    │
└─────────────────────────────────────────┘

Critical Boundaries (NEVER CROSS)

API Must NOT:

  • Install Java inside containers
  • Download game server files
  • Execute commands directly in containers (use Agent)

Agent Must NOT:

  • Allocate ports
  • Create DNS records
  • Register with Velocity
  • Talk to Proxmox API
  • Manage VMIDs

Frontend Must NOT:

  • Talk directly to Agent
  • Call Proxmox API
  • Create DNS records
  • Bypass API for any infrastructure

Violation = STOP → Consult Cross-Project Tracker


📅 3-Day Sprint Plan (Your Tasks)

Day 1: Dev Containers (December 8)

Goal: Enable developer environment provisioning

Tasks:

  1. Define dev container spec (Python, Node, Go, Java)
  2. Create template VMID 6000 (base dev environment)
  3. API: Add /api/dev-instances endpoints (create, delete, status)
  4. Agent: Add dev provisioning flow (no game server start)
  5. Test: Provision Python + Node dev environments

Success Criteria:

  • Can provision dev environment with chosen language
  • Dev container accessible via SSH or web console
  • No game server logic triggered

Files to Modify:

  • src/routes/devInstances.js (new)
  • src/services/devProvisioner.js (new)
  • Agent: dev.go (new provisioning flow)

Day 2: EdgeState + DNS Reliability (December 9)

Goal: Fix Cloudflare SRV deletion problem

Tasks:

  1. Implement EdgeState model in Prisma schema
  2. Update edgePublisher.js to store Cloudflare record IDs
  3. Update dePublisher.js to delete by record ID (not hostname)
  4. Test: Create → Delete → Recreate same hostname

Success Criteria:

  • No orphaned DNS records after deletion
  • EdgeState tracks all Cloudflare record IDs
  • Re-provisioning same hostname works

Files to Modify:

  • prisma/schema.prisma (add EdgeState model)
  • src/services/edgePublisher.js
  • src/services/dePublisher.js
  • src/clients/cloudflareClient.js (return record IDs)

Day 3: Reconciliation + Hardening (December 10)

Goal: Self-healing infrastructure

Tasks:

  1. Create reconciliation job (DB ↔ Proxmox ↔ DNS ↔ Velocity)
  2. Detect orphaned containers (in Proxmox but not DB)
  3. Detect orphaned DNS records (in DNS but not DB)
  4. Auto-cleanup with confirmation prompt
  5. Regression test suite

Success Criteria:

  • Reconciliation job detects all orphans
  • Can auto-clean with user confirmation
  • System recovers from partial failures

Files to Create:

  • src/jobs/reconciliationJob.js
  • src/services/reconciler.js (rewrite)
  • tests/reconciliation.test.js

🐛 Known Bugs (Fix These)

Go Agent Bugs (Non-Blocking, But Should Fix)

  1. Forge server.jar Glob Logic (artifacts.go lines 112-116, 147-151)

    // REMOVE THIS - Forge doesn't create server.jar
    serverJarPath := filepath.Join(installDir, "*server.jar")
    

    Fix: Remove glob/rename logic entirely

  2. ensureProvisioned() Fallthrough (agent.go lines 155-171)

    // ADD ELSE HERE
    if variant == "forge" || variant == "neoforge" {
        // Forge logic
    } else {  // <-- ADD THIS
        // Vanilla-like logic
    }
    
  3. Forge Stop Command Exclusion (process.go line 83)

    // REMOVE THIS EXCLUSION - Forge accepts stop commands
    if p.variant != "forge" && p.variant != "neoforge" {
        p.sendCommand("stop")
    }
    

    Fix: Remove the if condition, send stop to all variants


🚨 High-Risk Integration Zones (Careful!)

These areas have caused drift in past sessions:

  1. Forge/NeoForge Installation

    • Agent owns ALL installation logic
    • API only passes config, never executes install commands
  2. Cloudflare SRV Deletion

    • Must use record IDs (not hostname inference)
    • Store IDs in EdgeState on creation
  3. Velocity Registration Order

    • Wait for Agent to report RUNNING state
    • Then register with Velocity (not before)
  4. Port Allocation

    • API allocates → provisions → commits
    • Rollback on failure (don't leak ports)
  5. Agent READY Detection

    • Use variant-aware log parsing
    • Forge takes 60-90s (don't timeout early)

📁 Key File Locations

API Service (/home/zlh/zlh-api-v2/)

src/
├── routes/
│   ├── containers.js          # Game server endpoints
│   └── devInstances.js        # Dev environment endpoints (TO CREATE)
├── services/
│   ├── edgePublisher.js       # DNS + Velocity publishing
│   ├── dePublisher.js         # Edge cleanup (NEEDS REWRITE)
│   ├── portAllocator.js       # Port management
│   └── reconciler.js          # Orphan detection (NEEDS REWRITE)
├── clients/
│   ├── cloudflareClient.js    # Cloudflare API (UPDATE for record IDs)
│   ├── technitiumClient.js    # Technitium DNS
│   └── proxmoxClient.js       # Proxmox API
└── jobs/
    └── reconciliationJob.js   # Self-healing job (TO CREATE)

Go Agent (/opt/zlh-agent/)

├── agent.go              # Main provisioning logic (FIX fallthrough)
├── artifacts.go          # Download + verification (REMOVE Forge glob)
├── process.go            # Server lifecycle (FIX Forge stop exclusion)
├── api.go                # HTTP server for control
└── dev.go                # Dev environment provisioning (TO CREATE)

Frontend (/home/zlh/zlh-portal/)

src/
├── app/
│   ├── containers/       # Game server UI
│   └── dev/              # Dev environment UI (TO CREATE)
└── components/
    └── Console.tsx       # WebSocket console (TO CREATE)

🎮 Minecraft Variant Status

Variant Install Verify Start READY Detection Status
Vanilla Production
Paper Production
Purpur Production
Fabric Production
Forge Production (has bugs)
NeoForge Production

All variants work - 3 non-blocking bugs should be fixed for code quality.


🔄 Provisioning Flow (Know This)

1. User creates server via Frontend
   ↓
2. Frontend → API: POST /api/containers/create
   ↓
3. API allocates VMID, ports (if needed)
   ↓
4. API clones LXC from template VMID 800
   ↓
5. API configures container (IP, resources)
   ↓
6. API starts LXC
   ↓
7. API detects container IP (10.200.0.X)
   ↓
8. API → Agent: POST /config (payload)
   ↓
9. Agent saves payload.json
   ↓
10. Agent spawns install goroutine (async)
    ├─ Download Java
    ├─ Download game artifacts
    ├─ Verify installation
    └─ Self-repair if needed
   ↓
11. Agent starts server
   ↓
12. Agent detects READY (log parsing)
   ↓
13. Agent sets state = RUNNING
   ↓
14. API polls /status until RUNNING
   ↓
15. API saves to database
   ↓
16. API publishes DNS (Cloudflare + Technitium)
   ↓
17. API registers with Velocity
   ↓
18. API returns SUCCESS to user
   ↓
COMPLETE ✅

🛡️ Drift Prevention (ACTIVE)

Before Writing ANY Code

Ask yourself:

  1. Which system am I modifying? (API / Agent / Frontend)
  2. Does this cross boundaries? (If yes → read Cross-Project Tracker)
  3. Am I adding external calls to Agent? (If yes → VIOLATION)
  4. Am I adding container execution to API? (If yes → VIOLATION)
  5. Am I bypassing API in Frontend? (If yes → VIOLATION)

If ANY doubt → Stop and consult Cross-Project Tracker

Common Violations to Avoid

Agent allocating ports → API owns this
API installing Java → Agent owns this
Frontend calling Agent directly → Must go through API
Agent creating DNS → API owns this
API deciding Java version → Agent owns this (version-aware)


🎯 Success Metrics (How You'll Be Measured)

Sprint Completion

  • All 3 days of sprint tasks completed
  • Dev containers operational
  • EdgeState tracking DNS record IDs
  • Reconciliation job working

Code Quality

  • No architectural violations (follow Cross-Project Tracker)
  • All 3 Go agent bugs fixed
  • Tests passing
  • No new drift introduced

Platform Readiness

  • 95% launch-ready after sprint
  • All MC variants still working
  • No regressions from changes

🧪 Testing Requirements

Before Committing Code

Test Each Variant:

# Test provisioning
POST /api/containers/create {variant: "vanilla"}
POST /api/containers/create {variant: "paper"}
POST /api/containers/create {variant: "fabric"}
POST /api/containers/create {variant: "forge"}
POST /api/containers/create {variant: "neoforge"}

# Verify RUNNING state
GET /api/containers/:vmid/status
# Should return: {state: "RUNNING"}

# Test control
POST /api/containers/:vmid/stop
POST /api/containers/:vmid/start
POST /api/containers/:vmid/restart

# Test cleanup
DELETE /api/containers/:vmid
# Verify no orphaned DNS records

Test Dev Containers:

POST /api/dev-instances/create {language: "python"}
POST /api/dev-instances/create {language: "node"}

# Verify accessible
ssh into dev container
# Should have language toolchain installed

📊 Infrastructure Constraints (Know Your Limits)

Hardware (GTHost Dedicated):

  • CPU: 12 cores / 24 threads (Intel Xeon Silver 4116)
  • RAM: 192 GB (can run 30-50 simultaneous 4GB servers)
  • Storage: 1.8 TB free (300-500 servers capacity)
  • Network: 300 Mbit/s (30-60 concurrent players)

Current Allocation:

  • 11 VMs: 56 GB RAM, 24 CPU threads, ~512 GB disk
  • Available for servers: 128 GB RAM, 1.8 TB disk

Don't exceed these limits - check before provisioning.


🚀 Launch Decision (Context)

Option A: Launch NOW (85% ready, soft beta only)
Option B: +3 Days Sprint (95% ready, RECOMMENDED)
Option C: +1 Week (98% ready, over-engineering)

Your role: Execute Option B sprint (complete 3 days of tasks)

After sprint: Platform ready for professional launch


📞 Session Continuity

Starting Fresh Session (You)

1. Read Drift Prevention Card (30 seconds)
   └─ Activate boundary awareness

2. Read Complete Current State (5 minutes)
   └─ Get Kanban state + sprint tasks

3. Consult Cross-Project Tracker before code
   └─ Verify no boundary violations

4. Execute sprint tasks with constraints active

5. Test all variants before committing

Handoff to Claude (Architecture Questions)

If you encounter:

  • Architectural decisions (e.g., should we change contracts?)
  • Strategic questions (e.g., which features to prioritize?)
  • Business model questions
  • Major design changes

STOP and escalate to Claude for architectural guidance.


Quick Reference

Your Mission

Execute 3-day sprint → Deliver 95% launch-ready platform

Your Boundaries

API orchestrates | Agent installs | Frontend displays

Your Critical Docs

  1. Drift Prevention Card (session start)
  2. Cross-Project Tracker (before code)
  3. Complete Current State (sprint tasks)
  4. Engineering Handover (technical details)

Your Success

  • 3-day sprint complete
  • No architectural violations
  • All variants still working
  • Tests passing

🎯 Start Here (First Actions)

Right Now:

  1. Read Drift Prevention Card (30 sec)
  2. Read Complete Current State (5 min)
  3. Check Engineering Kanban for current TODO
  4. Begin Day 1 sprint: Dev containers

Remember:

  • 🛡️ Drift prevention ACTIVE
  • 📋 Consult tracker before crossing boundaries
  • Test all variants before committing
  • 🚀 Goal: 95% launch-ready after 3 days

Status: You have everything you need to execute the sprint. Let's build! 🚀