Upload files to "/"

This commit is contained in:
jester 2025-12-13 16:40:48 +00:00
parent 882c0c7169
commit 5a5ad001af
5 changed files with 2595 additions and 0 deletions

View File

@ -0,0 +1,846 @@
# 🛡️ ZeroLagHub Cross-Project Tracker & Drift Prevention System
**Last Updated**: December 7, 2025
**Version**: 1.0 (Canonical Architecture Enforcement)
**Status**: ACTIVE - Must Be Consulted Before All Code Changes
---
## 🎯 Document Purpose
This document establishes **architectural boundaries** across the three major ZeroLagHub systems to prevent drift, confusion, and broken contracts when switching between contexts or AI assistants.
**Critical Insight**: Most project failures come from **gradual architectural erosion**, not sudden breaking changes. This document prevents that.
---
## 📊 The Three Systems (Canonical Ownership)
```
┌─────────────────────────────────────────────────────────────┐
│ NODE.JS API v2 │
│ (Orchestration Engine) │
│ │
│ Owns: VMID allocation, port management, DNS publishing, │
│ Velocity registration, job queue, database state │
│ │
│ Speaks To: Proxmox, Cloudflare, Technitium, Velocity, │
│ MariaDB, BullMQ, Go Agent (HTTP) │
│ │
│ Never Touches: Container filesystem, game server files, │
│ Java installation, artifact downloads │
└─────────────────────┬───────────────────────────────────────┘
│ HTTP Contract
│ POST /config, /start, /stop
│ GET /status, /health
┌─────────────────────────────────────────────────────────────┐
│ GO AGENT │
│ (Container-Internal Manager) │
│ │
│ Owns: Server installation, Java runtime, artifact │
│ downloads, process management, READY detection, │
│ filesystem layout, verification + self-repair │
│ │
│ Speaks To: Local filesystem, game server process, │
│ API (status updates via HTTP polling) │
│ │
│ Never Touches: Proxmox, DNS, Cloudflare, Velocity, │
│ port allocation, VMID selection │
└─────────────────────┬───────────────────────────────────────┘
│ Status Polling
│ Agent reports state
┌─────────────────────────────────────────────────────────────┐
│ NEXT.JS FRONTEND │
│ (Customer + Admin UI) │
│ │
│ Owns: User interaction, form validation, display logic, │
│ client-side state, UI components │
│ │
│ Speaks To: API v2 only (REST + WebSocket when added) │
│ │
│ Never Touches: Proxmox, Go Agent, DNS, Velocity, │
│ Cloudflare, direct container access │
└─────────────────────────────────────────────────────────────┘
```
---
## 🗺️ System Ownership Matrix (CANONICAL)
| Area | Node.js API v2 | Go Agent | Frontend |
|------|----------------|----------|----------|
| **Provisioning Orchestration** | ✅ OWNER (allocates VMID, ports, builds LXC config) | ❌ Executes inside container | ❌ Triggers only |
| **Template Selection** | ✅ OWNER (selects template, passes config) | ❌ Template contains agent | ❌ Displays options |
| **Server Installation** | ❌ Never | ✅ OWNER (Java, artifacts, validation) | ❌ Displays results |
| **Runtime Control** | ✅ OWNER (sends commands) | ✅ OWNER (executes commands) | ❌ UI only |
| **DNS (Cloudflare + Technitium)** | ✅ OWNER (creates, deletes, tracks IDs) | ❌ Never | ❌ Displays info |
| **Velocity Registration** | ✅ OWNER (registers, deregisters) | ❌ Never | ❌ Displays status |
| **IP Logic** | ✅ OWNER (external + internal IPs) | ❌ Sees container IP only | ❌ Displays final |
| **Port Allocation** | ✅ OWNER (PortPool DB management) | ❌ Receives assignments | ❌ Displays ports |
| **Monitoring** | ✅ OWNER (collects metrics) | ✅ OWNER (exposes /health) | ❌ Displays data |
| **Error Handling** | ✅ OWNER (BullMQ jobs, retries) | ❌ Local output only | ❌ User notifications |
---
## 🔄 API ↔ Agent Contract (IMMUTABLE)
### **API → Agent Endpoints**
```
POST /config
├─ Payload: {
│ game: "minecraft",
│ variant: "paper",
│ version: "1.21.3",
│ ports: [25565, 25575],
│ memory: "4G",
│ motd: "Welcome to ZeroLagHub",
│ worldSettings: {...}
│ }
└─ Response: 200 OK
POST /start
└─ Triggers: ensureProvisioned() → StartServer()
POST /stop
└─ Triggers: Graceful shutdown via server stop command
POST /restart
└─ Triggers: stop → start sequence
GET /status
└─ Response: {
state: "RUNNING" | "INSTALLING" | "FAILED",
pid: 12345,
uptime: 3600,
lastError: null
}
GET /health
└─ Response: {
healthy: true,
java: "/usr/lib/jvm/java-21-openjdk-amd64",
variant: "paper",
ready: true
}
```
### **Agent → API Signals (via /status polling)**
```
State Machine:
INSTALLING
├─> DOWNLOADING_ARTIFACT
├─> INSTALLING_JAVA
├─> FINALIZING
├─> RUNNING
└─> FAILED | CRASHED
```
### **🚫 Forbidden Agent Behaviors**
Agent **NEVER**:
- ❌ Allocates or manages ports
- ❌ Talks to Proxmox API
- ❌ Creates DNS records
- ❌ Registers with Velocity
- ❌ Modifies LXC config
- ❌ Allocates VMIDs
- ❌ Manages templates
- ❌ Talks to Cloudflare or Technitium
**Violation Detection**: If agent code imports Proxmox client, DNS client, or port allocation logic → **DRIFT VIOLATION**
---
## 🖥️ Frontend ↔ API Contract (IMMUTABLE)
### **Frontend Allowed Endpoints**
```
Provisioning:
POST /api/containers/create
GET /api/containers/:vmid
DELETE /api/containers/:vmid
Control:
POST /api/containers/:vmid/start
POST /api/containers/:vmid/stop
POST /api/containers/:vmid/restart
Status:
GET /api/containers/:vmid/status
GET /api/containers/:vmid/logs
GET /api/containers/:vmid/stats
Discovery:
GET /api/templates
GET /api/containers (list user's containers)
Read-Only Info:
GET /api/dns/:hostname (display only)
GET /api/velocity/status (display only)
```
### **🚫 Forbidden Frontend Behaviors**
Frontend **NEVER**:
- ❌ Talks directly to Go Agent
- ❌ Calls Proxmox API
- ❌ Creates DNS records
- ❌ Registers with Velocity
- ❌ Allocates ports
- ❌ Executes container commands
- ❌ Accesses MariaDB directly
**Violation Detection**: If frontend code imports Proxmox client, agent HTTP client (except via API), or database client → **DRIFT VIOLATION**
---
## 🚨 Drift Detection Rules (ACTIVE ENFORCEMENT)
### **Rule #1: Provisioning Ownership**
**Violation**: Agent asked to allocate ports, choose templates, create DNS, call Proxmox, manage VMIDs
**Correct Path**: API owns ALL provisioning orchestration
**Example Violation**:
```go
// WRONG - Agent should NEVER do this
func (a *Agent) allocatePorts() ([]int, error) {
// ... port allocation logic
}
```
**Correct Pattern**:
```javascript
// RIGHT - API allocates, agent receives
async function provisionInstance(game, variant) {
const ports = await portAllocator.allocate(vmid);
await agent.postConfig({ ports, game, variant });
}
```
---
### **Rule #2: Artifact Ownership**
**Violation**: API asked to install Java, download server.jar, run installers
**Correct Path**: Agent owns ALL in-container installation
**Example Violation**:
```javascript
// WRONG - API should NEVER do this
async function installMinecraft(vmid, version) {
await proxmox.exec(vmid, `wget https://...`);
await proxmox.exec(vmid, `java -jar installer.jar`);
}
```
**Correct Pattern**:
```go
// RIGHT - Agent handles installation
func (p *Provisioner) ProvisionAll(cfg Config) error {
downloadArtifact(cfg.Variant, cfg.Version)
installJavaRuntime(cfg.Version)
verifyInstallation()
}
```
---
### **Rule #3: Direct Container Access**
**Violation**: Frontend or API wants to exec commands directly into container
**Correct Path**: Agent owns container execution layer
**Example Violation**:
```typescript
// WRONG - Frontend should NEVER do this
async function restartServer(vmid: number) {
const ssh = new SSHClient();
await ssh.connect(containerIP);
await ssh.exec('systemctl restart minecraft');
}
```
**Correct Pattern**:
```typescript
// RIGHT - Frontend talks to API, API talks to agent
async function restartServer(vmid: number) {
await api.post(`/containers/${vmid}/restart`);
}
```
---
### **Rule #4: Networking Responsibilities**
**Violation**: Agent asked to select public vs internal IPs, decide DNS zones
**Correct Path**: API owns dual-IP logic (Cloudflare external, Technitium internal)
**Example Violation**:
```go
// WRONG - Agent should NEVER decide this
func (a *Agent) determinePublicIP() string {
if a.needsCloudflare() {
return "139.64.165.248"
}
return a.containerIP
}
```
**Correct Pattern**:
```javascript
// RIGHT - API decides network topology
function determineIPs(vmid, game) {
const internalIP = `10.200.0.${vmid - 1000}`;
const externalIP = "139.64.165.248"; // Cloudflare target
const velocityIP = "10.70.0.241"; // Internal routing
return { internalIP, externalIP, velocityIP };
}
```
---
### **Rule #5: Proxy Responsibilities**
**Violation**: Agent asked to register with Velocity, configure proxy routing
**Correct Path**: API owns ALL proxy integrations
**Example Violation**:
```go
// WRONG - Agent should NEVER do this
func (a *Agent) registerWithVelocity() error {
client := velocity.NewClient()
return client.Register(a.hostname, a.port)
}
```
**Correct Pattern**:
```javascript
// RIGHT - API handles Velocity registration
async function registerVelocity(vmid, hostname, internalIP) {
await velocityBridge.registerBackend({
name: hostname,
address: internalIP,
port: 25565
});
}
```
---
## 🔄 Context Switching Safety Workflow
### **When Moving to Node.js API Work:**
**Pre-Switch Checklist**:
- [ ] Agent contract unchanged? (POST /config, /start, /stop, GET /status)
- [ ] Database schema unchanged? (Prisma models consistent)
- [ ] LXC template IDs unchanged? (VMID 800 for game, 6000-series for dev)
- [ ] DNS/IP logic consistent? (Cloudflare external, Technitium internal)
- [ ] Port allocation logic preserved? (PortPool DB-backed)
**Common Drift Patterns**:
- ⚠️ Adding agent installation logic to API
- ⚠️ Changing agent contract without updating both sides
- ⚠️ Moving DNS logic to different service
- ⚠️ Bypassing job queue for provisioning
---
### **When Moving to Go Agent Work:**
**Pre-Switch Checklist**:
- [ ] No provisioning logic outside allowed scope (no port allocation, DNS, etc.)
- [ ] File paths remain canonical (`/opt/zlh/<game>/<variant>/world`)
- [ ] Naming conventions maintained (`server.jar`, `fabric-server.jar`, etc.)
- [ ] No external API calls (Proxmox, DNS, Velocity)
- [ ] Status states unchanged (INSTALLING, RUNNING, FAILED, etc.)
**Common Drift Patterns**:
- ⚠️ Adding port allocation to agent
- ⚠️ Making agent talk to external services
- ⚠️ Changing directory structure without API coordination
- ⚠️ Adding orchestration logic to agent
---
### **When Moving to Frontend Work:**
**Pre-Switch Checklist**:
- [ ] Only API-approved fields used in UI
- [ ] No direct agent HTTP calls
- [ ] VMID not exposed in user-facing UI
- [ ] Internal IPs not displayed to users
- [ ] All state from API, not computed locally
**Common Drift Patterns**:
- ⚠️ Adding direct agent calls from frontend
- ⚠️ Computing server state client-side
- ⚠️ Exposing internal infrastructure details
- ⚠️ Bypassing API for container control
---
## 🚨 High-Risk Integration Zones (GUARDED)
These areas have historically caused drift across sessions:
### **1. Forge / NeoForge Installation Logic**
- **Risk**: Agent vs API confusion on who handles `run.sh` patching
- **Guard**: Agent owns ALL Forge installation, API just passes config
- **Test**: Can provision Forge 1.21.3 without API filesystem access?
### **2. Cloudflare SRV Deletion**
- **Risk**: Case sensitivity, subdomain normalization, record ID tracking
- **Guard**: API stores Cloudflare record IDs in EdgeState, deletes by ID
- **Test**: Create → Delete → Recreate same hostname without orphans?
### **3. Technitium DNS Zone Mismatch**
- **Risk**: Wrong zone selection, duplicate records
- **Guard**: API hardcodes zone as `zpack.zerolaghub.com`, validates before creation
- **Test**: No records created in wrong zones?
### **4. Velocity Registration Order**
- **Risk**: Registering before server ready, deregistering incorrectly
- **Guard**: API waits for agent RUNNING state, then registers Velocity
- **Test**: Player connection works immediately after provisioning complete?
### **5. PortPool Commit Logic**
- **Risk**: Race conditions, double-allocation, uncommitted ports
- **Guard**: API allocates → provisions → commits (rollback on failure)
- **Test**: Concurrent provisions don't collide on ports?
### **6. Agent READY Detection**
- **Risk**: False negatives, false positives, variant-specific patterns
- **Guard**: Agent uses variant-aware log parsing, multiple confirmation lines
- **Test**: All 6 variants correctly detect READY state?
### **7. Server Start-Up False Negatives**
- **Risk**: Timeout too short, log parsing too strict
- **Guard**: Agent increases timeout for Forge (90s), multiple log patterns
- **Test**: Forge installer completes without false failure?
### **8. IP Selection Logic**
- **Risk**: Confusing external (Cloudflare) vs internal (Velocity/Technitium) IPs
- **Guard**: API clearly separates: externalIP (139.64.165.248), internalIP (10.200.0.X), velocityIP (10.70.0.241)
- **Test**: DNS points to correct IPs, Velocity routes to correct internal IP?
---
## 📋 Architecture Decision Log (LOCKED) ⭐ NEW
**Purpose**: Records finalized architectural decisions that **must not be re-litigated** unless explicitly requested.
**Status**: LOCKED - These decisions are final and cannot be changed without user explicitly saying "Revisit decision X"
---
### **DEC-001: Templates vs Go Agent (FINAL)**
**Decision**: Hybrid model
- LXC templates define **base environment only**
- Go Agent is **authoritative execution layer** inside containers
**Rationale**:
- Templates alone cannot handle multi-variant logic (Forge, NeoForge, Fabric)
- Agent enables self-repair, async provisioning, runtime control
- Hybrid provides speed + flexibility without API container access
**Applies To**: API v2, Go Agent, Frontend
**Status**: ✅ LOCKED
---
### **DEC-002: Provisioning Authority**
**Decision**: API orchestrates, Agent executes
**Rationale**:
- API has global visibility (DB, DNS, Proxmox, Velocity)
- Agent is intentionally sandboxed to container filesystem + process
**Applies To**: All systems
**Status**: ✅ LOCKED
---
### **DEC-003: DNS & Edge Publishing Ownership**
**Decision**: API-only responsibility
**Rationale**:
- Requires external credentials (Cloudflare, Technitium)
- Must correlate DB state, record IDs, reconciliation jobs
**Applies To**: API v2
**Status**: ✅ LOCKED
---
### **DEC-004: Proxy Stack**
**Decision**: Traefik + Velocity only
**Rationale**:
- Traefik for HTTP/control-plane
- Velocity for Minecraft TCP routing
- **HAProxy explicitly deprecated** for ZeroLagHub
**Applies To**: Infrastructure, API
**Status**: ✅ LOCKED
---
### **DEC-005: State Persistence**
**Decision**: MariaDB is single source of truth
**Rationale**:
- Flat files caused race conditions and drift
- DB enables reconciliation, recovery, observability
**Applies To**: API v2
**Status**: ✅ LOCKED
---
### **DEC-006: Frontend Access Model**
**Decision**: Frontend communicates with API only
**Rationale**:
- Security boundary
- Prevents leaking infrastructure details
**Applies To**: Frontend
**Status**: ✅ LOCKED
---
### **DEC-007: Architecture Enforcement Policy**
**Decision**: Drift prevention is mandatory
**Rationale**:
- Prevents oscillation between alternatives
- Preserves velocity during late-stage development
**Applies To**: All work sessions
**Status**: ✅ LOCKED
---
## 📋 Canonical Architecture Anchors (ABSOLUTE)
These rules are **immutable** unless explicitly changed with full system review:
### **Anchor #1: Orchestration**
✅ API v2 orchestrates EVERYTHING (jobs, provisioning, DNS, proxy, lifecycle)
### **Anchor #2: Container-Internal**
✅ Agent performs EVERYTHING inside container (install, start, stop, detect)
### **Anchor #3: Templates**
✅ Templates contain agent + base environment (VMID 800 for game, 6000-series for dev)
### **Anchor #4: Job Queue**
✅ BullMQ/Redis drive job system (async provisioning, retries, reconciliation)
### **Anchor #5: Database**
✅ MariaDB holds all state (PortPool, ContainerInstance, EdgeState, etc.)
### **Anchor #6: Infrastructure**
✅ Proxmox API (not Ansible) for LXC management
### **Anchor #7: Routing**
✅ Traefik (HTTP) + Velocity (Minecraft) are ONLY routing/proxy systems
### **Anchor #8: DNS**
✅ Cloudflare = authoritative public DNS
✅ Technitium = authoritative internal DNS
### **Anchor #9: Frontend Isolation**
✅ Frontend speaks ONLY to API (no direct agent, Proxmox, DNS, Velocity)
### **Anchor #10: Directory Structure**
`/opt/zlh/<game>/<variant>/world` is canonical game server path
---
## 🛡️ Enforcement Policies (ACTIVE)
When future instructions conflict with Canonical Architecture or Architecture Decision Log:
### **Step 1: STOP**
Immediately halt the task. Do not proceed with drift-inducing change.
### **Step 2: RAISE DRIFT WARNING**
```
⚠️ DRIFT WARNING ⚠️
Proposed change violates Canonical Architecture:
Rule Violated: [Rule #X: Description]
OR
Decision Violated: [DEC-XXX: Decision Name]
Violation: [Specific behavior that violates rule/decision]
Correct Path: [Architecture-aligned approach]
Impact: [What breaks if this drift is allowed]
Options:
1. Implement correct architecture-aligned path
2. Amend Canonical Architecture (requires full system review)
3. Request user to "Revisit decision X" (for ADL changes)
4. Cancel proposed change
```
### **Step 3: PROVIDE CORRECT PATH**
Show the architecture-aligned implementation that achieves the same goal.
### **Step 4: ASK FOR DIRECTION**
```
Should we:
A) Implement the correct architecture-aligned path?
B) Perform full system review to amend architecture?
C) Request user to explicitly revisit locked decision?
D) Cancel this change?
```
### **Step 5: ARCHITECTURE DECISION LOG SPECIAL RULE**
If violation is against a **LOCKED** Architecture Decision (DEC-001 through DEC-007):
**ADDITIONAL CHECK**:
```
⚠️ LOCKED DECISION WARNING ⚠️
This change conflicts with Architecture Decision Log entry: [DEC-XXX]
Status: LOCKED
This decision can ONLY be changed if the user explicitly says:
"Revisit decision [DEC-XXX]"
Without explicit user request to revisit, this decision is FINAL.
Proceeding with this change would violate architectural governance.
```
**Never proceed** with drift-inducing changes without explicit confirmation.
**Never re-litigate** locked decisions without user explicitly requesting revision.
---
## 📊 Drift Detection Examples
### **Example 1: Agent Port Allocation (VIOLATION)**
**Proposed Change**:
```go
// Agent code proposal
func (a *Agent) allocatePort() int {
// Find available port...
return port
}
```
**Drift Warning**:
```
⚠️ DRIFT WARNING ⚠️
Rule Violated: Rule #1 - Provisioning Ownership
Violation: Agent attempting to allocate ports
Correct Path: API allocates ports, agent receives them via /config
Impact: Port collisions, database inconsistency, broken PortPool
Recommendation: Remove port allocation from agent, ensure API sends ports in config payload
```
---
### **Example 2: Frontend Direct Agent Call (VIOLATION)**
**Proposed Change**:
```typescript
// Frontend code proposal
async function getServerLogs(vmid: number) {
const agentURL = `http://10.200.0.${vmid}:8080/logs`;
return await fetch(agentURL);
}
```
**Drift Warning**:
```
⚠️ DRIFT WARNING ⚠️
Rule Violated: Rule #3 - Direct Container Access
Violation: Frontend bypassing API to talk to agent
Correct Path: Frontend → API → Agent (API proxies logs)
Impact: Broken frontend if container IP changes, no auth/rate limiting, security risk
Recommendation: Add GET /api/containers/:vmid/logs endpoint that proxies to agent
```
---
### **Example 3: API Installing Java (VIOLATION)**
**Proposed Change**:
```javascript
// API code proposal
async function provisionServer(vmid, game, variant) {
await proxmox.exec(vmid, 'apt-get install openjdk-21-jdk');
await proxmox.exec(vmid, 'wget https://papermc.io/...');
}
```
**Drift Warning**:
```
⚠️ DRIFT WARNING ⚠️
Rule Violated: Rule #2 - Artifact Ownership
Violation: API performing in-container installation
Correct Path: API sends config to agent, agent handles installation
Impact: Breaks agent self-repair, variant-specific logic duplicated, no verification system
Recommendation: Remove installation logic from API, ensure agent receives proper config via POST /config
```
---
## 📁 Integration with Existing Documentation
### **Relationship to Master Bootstrap**
- Master Bootstrap: Strategic overview and business model
- **This Document**: Technical governance and boundary enforcement
- **Usage**: Consult this before implementing ANY code changes
### **Relationship to Complete Current State**
- Complete Current State: What's working, what's next
- **This Document**: How things MUST work (regardless of current state)
- **Usage**: This is the "law", current state is "status"
### **Relationship to Engineering Handover**
- Engineering Handover: Daily tactical tasks and sprint plan
- **This Document**: Constraints within which tasks must be implemented
- **Usage**: Check this before starting each handover task
---
## 🔄 Architecture Amendment Process
If legitimate need to change Canonical Architecture:
### **Step 1: Identify Change**
Document exactly what architectural boundary needs to change and why.
### **Step 2: Full System Impact Analysis**
- What breaks in API?
- What breaks in Agent?
- What breaks in Frontend?
- What changes to contracts?
- What database migrations needed?
### **Step 3: Update ALL Affected Documents**
- This document (Canonical Architecture)
- Master Bootstrap (if strategic impact)
- Complete Current State (implementation changes)
- Engineering Handover (sprint tasks)
- Agent Spec, Operational Guide (if affected)
### **Step 4: Update ALL Systems**
- API code + tests
- Agent code + tests
- Frontend code + tests
- Database schema (migration)
- Infrastructure config
### **Step 5: Validation**
- Integration tests pass?
- No new drift introduced?
- Documentation consistent?
- All AIs briefed on change?
**Only after ALL steps** is architecture amendment complete.
---
## 🎯 Quick Reference Card
### **API Owns**
- ✅ Provisioning orchestration
- ✅ Port allocation (PortPool)
- ✅ DNS (Cloudflare + Technitium)
- ✅ Velocity registration
- ✅ IP logic (external + internal)
- ✅ Job queue (BullMQ)
- ✅ Database state
### **Agent Owns**
- ✅ Container-internal installation
- ✅ Java runtime
- ✅ Artifact downloads
- ✅ Server process management
- ✅ READY detection
- ✅ Self-repair + verification
- ✅ Filesystem layout
### **Frontend Owns**
- ✅ User interaction
- ✅ Display logic
- ✅ Client state
- ✅ Form validation
### **Never**
- ❌ Agent allocates ports
- ❌ Agent talks to DNS/Velocity/Proxmox
- ❌ API installs server files
- ❌ API executes in-container commands
- ❌ Frontend talks to agent directly
- ❌ Frontend talks to infrastructure
---
## 📋 Session Start Checklist
Before every coding session with any AI:
- [ ] Read this document's Quick Reference Card
- [ ] Identify which system you're working on (API, Agent, Frontend)
- [ ] Review that system's "Owns" list
- [ ] Check High-Risk Integration Zones if touching those areas
- [ ] Verify no drift from previous session
- [ ] Confirm contracts unchanged since last session
**If ANY doubt**: Re-read full Canonical Architecture Anchors section.
---
## ✅ Document Status
**Status**: ACTIVE - Must be consulted before all code changes
**Enforcement**: MANDATORY - Drift violations must be caught
**Authority**: CANONICAL - Overrides conflicting guidance
**Updates**: Only via Architecture Amendment Process
---
🛡️ **This document prevents architectural drift. Violate at your own risk.**

View File

@ -0,0 +1,105 @@
# 🛡️ ZeroLagHub Drift Prevention - Quick Start Card
**Use This**: At the start of EVERY coding session (API, Agent, or Frontend)
---
## ⚡ 30-Second Architecture Check
### **Working on API?**
✅ Can allocate: Ports, VMIDs, DNS, Velocity
❌ Cannot: Install Java, download artifacts, exec in container
### **Working on Agent?**
✅ Can install: Java, artifacts, server files
❌ Cannot: Allocate ports, create DNS, call Proxmox, register Velocity
### **Working on Frontend?**
✅ Can: Display data, call API endpoints
❌ Cannot: Talk to Agent, Proxmox, DNS, Velocity, allocate anything
---
## 🔒 7 Locked Architectural Decisions (NEW) ⭐
**These decisions are FINAL** - cannot be changed without user saying "Revisit decision X":
1. **DEC-001**: Templates + Agent hybrid (not templates-only or agent-only)
2. **DEC-002**: API orchestrates, Agent executes (not reversed)
3. **DEC-003**: API owns DNS (Agent never creates DNS)
4. **DEC-004**: Traefik + Velocity only (no HAProxy)
5. **DEC-005**: MariaDB is source of truth (no flat files)
6. **DEC-006**: Frontend → API only (no direct agent calls)
7. **DEC-007**: Drift prevention mandatory (always enforced)
**If proposing change that conflicts with DEC-001 through DEC-007**:
→ STOP → Consult Cross-Project Tracker → Request "Revisit decision X" from user
---
## 🚨 Drift Detection Triggers
**STOP and consult full tracker if you hear:**
1. "Agent should allocate ports..."
2. "API should install Java inside container..."
3. "Frontend should call agent directly..."
4. "Agent should register with Velocity..."
5. "API should decide what Java version..."
6. "Frontend should manage DNS records..."
7. "Let's use templates only..." (violates DEC-001)
8. "Let's add HAProxy..." (violates DEC-004)
9. "Let's use flat files instead of DB..." (violates DEC-005)
**All of these are VIOLATIONS** → Consult [ZeroLagHub_Cross_Project_Tracker.md](computer:///mnt/user-data/outputs/ZeroLagHub_Cross_Project_Tracker.md)
---
## 📋 Pre-Coding Checklist
Before writing ANY code:
- [ ] Which system am I modifying? (API / Agent / Frontend)
- [ ] Does this change cross boundaries? (If yes → read full tracker)
- [ ] Am I adding external API calls to Agent? (If yes → VIOLATION)
- [ ] Am I adding container execution to API? (If yes → VIOLATION)
- [ ] Am I bypassing API in Frontend? (If yes → VIOLATION)
---
## 🎯 The Golden Rules
1. **API orchestrates** (allocates resources, publishes state)
2. **Agent executes** (installs, runs, monitors inside container)
3. **Frontend displays** (no direct infrastructure access)
**Anything else** → Drift → Consult full tracker
---
## 📞 Quick Reference
**Full Tracker**: [ZeroLagHub_Cross_Project_Tracker.md](computer:///mnt/user-data/outputs/ZeroLagHub_Cross_Project_Tracker.md)
**High-Risk Zones**:
- Forge/NeoForge installation
- Cloudflare SRV deletion
- Velocity registration order
- PortPool commit logic
- Agent READY detection
- IP selection logic
---
## ✅ Session Start Command
```
Read ZeroLagHub Cross-Project Tracker Quick Start Card.
Activate drift detection.
Confirm which system I'm working on: [API / Agent / Frontend]
Proceed with architecture-aligned implementation only.
```
---
🛡️ **Drift prevention ACTIVE. Proceed with confidence.**

View File

@ -0,0 +1,573 @@
# 🚀 ZeroLagHub - GPT Implementation Handover (December 2025)
**Last Updated**: December 7, 2025
**Version**: 4.0 (Launch-Ready Implementation Guide)
**Status**: 85% Platform Complete - Active Development Sprint
---
## 🎯 Your Role (GPT - Implementation AI)
**You are**: The tactical implementation AI responsible for **building features**
**Claude is**: The strategic architecture AI responsible for **design decisions**
### **Your Responsibilities**
✅ Implement features within architectural boundaries
✅ Write code for API, Agent, and Frontend
✅ Fix bugs and optimize performance
✅ Execute sprint tasks from Kanban board
✅ Consult Cross-Project Tracker before crossing system boundaries
### **Your Constraints**
❌ Do NOT make architectural decisions without Claude
❌ Do NOT violate ownership boundaries (API vs Agent vs Frontend)
❌ Do NOT change contracts without updating both sides
❌ Do NOT skip drift prevention checks
---
## 📋 Critical Documents (READ THESE FIRST)
### **🛡️ MANDATORY Before ANY Code**
1. **[Drift Prevention Card](computer:///mnt/user-data/outputs/ZeroLagHub_Drift_Prevention_Card.md)** (30 seconds)
- Quick boundary check for every session
- Violation triggers to watch for
2. **[Cross-Project Tracker](computer:///mnt/user-data/outputs/ZeroLagHub_Cross_Project_Tracker.md)** (consult before changes)
- Ownership matrix (API vs Agent vs Frontend)
- Canonical contracts (API ↔ Agent, Frontend ↔ API)
- Drift detection rules with examples
### **📊 Implementation Context**
3. **[Complete Current State](computer:///mnt/user-data/outputs/ZeroLagHub_Complete_Current_State_Dec7.md)** (5 minutes)
- Engineering Kanban (DONE/IN PROGRESS/TODO)
- 3-day sprint plan with tasks
- Troubleshooting guides per variant
- Launch readiness matrix
4. **Engineering Handover** (from today's uploaded document)
- System lifecycle diagram
- Provisioning sequence
- Verification system specs
- Today's accomplishments
### **🔧 Technical Reference**
5. **[Agent Complete Spec](computer:///home/claude/ZeroLagHub_Agent_Complete_Spec.md)** (as needed)
- Go agent implementation details
- API endpoints and contracts
6. **[Infrastructure Specs](computer:///mnt/user-data/outputs/ZeroLagHub_Infrastructure_Specifications.md)** (as needed)
- GTHost hardware constraints
- Capacity planning
---
## 🎯 Current Platform Status (December 7, 2025)
### **What's Working** ✅ (85% Complete)
**Core Provisioning Pipeline**:
- ✅ All 6 Minecraft variants (Vanilla, Paper, Purpur, Fabric, Forge, NeoForge)
- ✅ VMID allocation (sequential)
- ✅ LXC container creation (template VMID 800)
- ✅ IP detection (10.200.0.X)
- ✅ Go agent deployment + self-repair
- ✅ Java runtime auto-selection (17/21)
- ✅ DNS automation (Cloudflare + Technitium)
- ✅ Velocity proxy registration
- ✅ Start/stop/restart control
- ✅ Console command injection
- ✅ Log tailing (HTTP polling)
- ✅ Crash detection
**Supported Minecraft Versions**: 1.12.2, 1.16.5, 1.18.2, 1.19.2, 1.20.1, 1.21.x
### **What's Missing** ❌ (15% To-Do)
**Critical for Launch** (7-9 hours):
- ❌ WebSocket console streaming (4-6 hours) - **HIGH PRIORITY**
- ❌ Crash loop protection with backoff (2 hours) - **HIGH PRIORITY**
- ❌ Disk space monitoring (1 hour) - **HIGH PRIORITY**
**Dev Platform** (1 day):
- 🔧 Dev container provisioning (Day 1 of sprint)
- 🔧 EdgeState schema migration (Day 2 of sprint)
- 🔧 Reconciliation job (Day 3 of sprint)
**Nice-to-Have** (future):
- 📋 File upload/download
- 📋 Backup/restore UI
- 📋 Resource monitoring dashboard
---
## 🗺️ System Architecture (Know Your Boundaries)
### **Three-System Ownership**
```
┌─────────────────────────────────────────┐
│ NODE.JS API (Orchestrator) │
│ You Own: Routes, services, job queue │
│ Speaks To: Proxmox, DNS, Velocity, DB │
│ Never: Installs Java, downloads files │
└────────────┬────────────────────────────┘
│ HTTP Contract
│ POST /config, /start, /stop
│ GET /status, /health
┌─────────────────────────────────────────┐
│ GO AGENT (Container Manager) │
│ You Own: Installation, verification │
│ Speaks To: Filesystem, game process │
│ Never: Allocates ports, creates DNS │
└────────────┬────────────────────────────┘
│ Status Polling
┌─────────────────────────────────────────┐
│ NEXT.JS FRONTEND (UI Only) │
│ You Own: Components, client state │
│ Speaks To: API only │
│ Never: Agent, Proxmox, DNS, Velocity │
└─────────────────────────────────────────┘
```
### **Critical Boundaries (NEVER CROSS)**
**API Must NOT**:
- ❌ Install Java inside containers
- ❌ Download game server files
- ❌ Execute commands directly in containers (use Agent)
**Agent Must NOT**:
- ❌ Allocate ports
- ❌ Create DNS records
- ❌ Register with Velocity
- ❌ Talk to Proxmox API
- ❌ Manage VMIDs
**Frontend Must NOT**:
- ❌ Talk directly to Agent
- ❌ Call Proxmox API
- ❌ Create DNS records
- ❌ Bypass API for any infrastructure
**Violation = STOP → Consult Cross-Project Tracker**
---
## 📅 3-Day Sprint Plan (Your Tasks)
### **Day 1: Dev Containers** (December 8)
**Goal**: Enable developer environment provisioning
**Tasks**:
1. [ ] Define dev container spec (Python, Node, Go, Java)
2. [ ] Create template VMID 6000 (base dev environment)
3. [ ] API: Add `/api/dev-instances` endpoints (create, delete, status)
4. [ ] Agent: Add dev provisioning flow (no game server start)
5. [ ] Test: Provision Python + Node dev environments
**Success Criteria**:
- Can provision dev environment with chosen language
- Dev container accessible via SSH or web console
- No game server logic triggered
**Files to Modify**:
- `src/routes/devInstances.js` (new)
- `src/services/devProvisioner.js` (new)
- Agent: `dev.go` (new provisioning flow)
---
### **Day 2: EdgeState + DNS Reliability** (December 9)
**Goal**: Fix Cloudflare SRV deletion problem
**Tasks**:
1. [ ] Implement EdgeState model in Prisma schema
2. [ ] Update `edgePublisher.js` to store Cloudflare record IDs
3. [ ] Update `dePublisher.js` to delete by record ID (not hostname)
4. [ ] Test: Create → Delete → Recreate same hostname
**Success Criteria**:
- No orphaned DNS records after deletion
- EdgeState tracks all Cloudflare record IDs
- Re-provisioning same hostname works
**Files to Modify**:
- `prisma/schema.prisma` (add EdgeState model)
- `src/services/edgePublisher.js`
- `src/services/dePublisher.js`
- `src/clients/cloudflareClient.js` (return record IDs)
---
### **Day 3: Reconciliation + Hardening** (December 10)
**Goal**: Self-healing infrastructure
**Tasks**:
1. [ ] Create reconciliation job (DB ↔ Proxmox ↔ DNS ↔ Velocity)
2. [ ] Detect orphaned containers (in Proxmox but not DB)
3. [ ] Detect orphaned DNS records (in DNS but not DB)
4. [ ] Auto-cleanup with confirmation prompt
5. [ ] Regression test suite
**Success Criteria**:
- Reconciliation job detects all orphans
- Can auto-clean with user confirmation
- System recovers from partial failures
**Files to Create**:
- `src/jobs/reconciliationJob.js`
- `src/services/reconciler.js` (rewrite)
- `tests/reconciliation.test.js`
---
## 🐛 Known Bugs (Fix These)
### **Go Agent Bugs** (Non-Blocking, But Should Fix)
1. **Forge server.jar Glob Logic** (`artifacts.go` lines 112-116, 147-151)
```go
// REMOVE THIS - Forge doesn't create server.jar
serverJarPath := filepath.Join(installDir, "*server.jar")
```
**Fix**: Remove glob/rename logic entirely
2. **ensureProvisioned() Fallthrough** (`agent.go` lines 155-171)
```go
// ADD ELSE HERE
if variant == "forge" || variant == "neoforge" {
// Forge logic
} else { // <-- ADD THIS
// Vanilla-like logic
}
```
3. **Forge Stop Command Exclusion** (`process.go` line 83)
```go
// REMOVE THIS EXCLUSION - Forge accepts stop commands
if p.variant != "forge" && p.variant != "neoforge" {
p.sendCommand("stop")
}
```
**Fix**: Remove the if condition, send stop to all variants
---
## 🚨 High-Risk Integration Zones (Careful!)
These areas have caused drift in past sessions:
1. **Forge/NeoForge Installation**
- Agent owns ALL installation logic
- API only passes config, never executes install commands
2. **Cloudflare SRV Deletion**
- Must use record IDs (not hostname inference)
- Store IDs in EdgeState on creation
3. **Velocity Registration Order**
- Wait for Agent to report RUNNING state
- Then register with Velocity (not before)
4. **Port Allocation**
- API allocates → provisions → commits
- Rollback on failure (don't leak ports)
5. **Agent READY Detection**
- Use variant-aware log parsing
- Forge takes 60-90s (don't timeout early)
---
## 📁 Key File Locations
### **API Service** (`/home/zlh/zlh-api-v2/`)
```
src/
├── routes/
│ ├── containers.js # Game server endpoints
│ └── devInstances.js # Dev environment endpoints (TO CREATE)
├── services/
│ ├── edgePublisher.js # DNS + Velocity publishing
│ ├── dePublisher.js # Edge cleanup (NEEDS REWRITE)
│ ├── portAllocator.js # Port management
│ └── reconciler.js # Orphan detection (NEEDS REWRITE)
├── clients/
│ ├── cloudflareClient.js # Cloudflare API (UPDATE for record IDs)
│ ├── technitiumClient.js # Technitium DNS
│ └── proxmoxClient.js # Proxmox API
└── jobs/
└── reconciliationJob.js # Self-healing job (TO CREATE)
```
### **Go Agent** (`/opt/zlh-agent/`)
```
├── agent.go # Main provisioning logic (FIX fallthrough)
├── artifacts.go # Download + verification (REMOVE Forge glob)
├── process.go # Server lifecycle (FIX Forge stop exclusion)
├── api.go # HTTP server for control
└── dev.go # Dev environment provisioning (TO CREATE)
```
### **Frontend** (`/home/zlh/zlh-portal/`)
```
src/
├── app/
│ ├── containers/ # Game server UI
│ └── dev/ # Dev environment UI (TO CREATE)
└── components/
└── Console.tsx # WebSocket console (TO CREATE)
```
---
## 🎮 Minecraft Variant Status
| Variant | Install | Verify | Start | READY Detection | Status |
|---------|---------|--------|-------|-----------------|--------|
| Vanilla | ✅ | ✅ | ✅ | ✅ | Production |
| Paper | ✅ | ✅ | ✅ | ✅ | Production |
| Purpur | ✅ | ✅ | ✅ | ✅ | Production |
| Fabric | ✅ | ✅ | ✅ | ✅ | Production |
| Forge | ✅ | ✅ | ✅ | ✅ | Production (has bugs) |
| NeoForge | ✅ | ✅ | ✅ | ✅ | Production |
**All variants work** - 3 non-blocking bugs should be fixed for code quality.
---
## 🔄 Provisioning Flow (Know This)
```
1. User creates server via Frontend
2. Frontend → API: POST /api/containers/create
3. API allocates VMID, ports (if needed)
4. API clones LXC from template VMID 800
5. API configures container (IP, resources)
6. API starts LXC
7. API detects container IP (10.200.0.X)
8. API → Agent: POST /config (payload)
9. Agent saves payload.json
10. Agent spawns install goroutine (async)
├─ Download Java
├─ Download game artifacts
├─ Verify installation
└─ Self-repair if needed
11. Agent starts server
12. Agent detects READY (log parsing)
13. Agent sets state = RUNNING
14. API polls /status until RUNNING
15. API saves to database
16. API publishes DNS (Cloudflare + Technitium)
17. API registers with Velocity
18. API returns SUCCESS to user
COMPLETE ✅
```
---
## 🛡️ Drift Prevention (ACTIVE)
### **Before Writing ANY Code**
**Ask yourself**:
1. Which system am I modifying? (API / Agent / Frontend)
2. Does this cross boundaries? (If yes → read Cross-Project Tracker)
3. Am I adding external calls to Agent? (If yes → VIOLATION)
4. Am I adding container execution to API? (If yes → VIOLATION)
5. Am I bypassing API in Frontend? (If yes → VIOLATION)
**If ANY doubt** → Stop and consult [Cross-Project Tracker](computer:///mnt/user-data/outputs/ZeroLagHub_Cross_Project_Tracker.md)
### **Common Violations to Avoid**
❌ Agent allocating ports → API owns this
❌ API installing Java → Agent owns this
❌ Frontend calling Agent directly → Must go through API
❌ Agent creating DNS → API owns this
❌ API deciding Java version → Agent owns this (version-aware)
---
## 🎯 Success Metrics (How You'll Be Measured)
### **Sprint Completion**
- [ ] All 3 days of sprint tasks completed
- [ ] Dev containers operational
- [ ] EdgeState tracking DNS record IDs
- [ ] Reconciliation job working
### **Code Quality**
- [ ] No architectural violations (follow Cross-Project Tracker)
- [ ] All 3 Go agent bugs fixed
- [ ] Tests passing
- [ ] No new drift introduced
### **Platform Readiness**
- [ ] 95% launch-ready after sprint
- [ ] All MC variants still working
- [ ] No regressions from changes
---
## 🧪 Testing Requirements
### **Before Committing Code**
**Test Each Variant**:
```bash
# Test provisioning
POST /api/containers/create {variant: "vanilla"}
POST /api/containers/create {variant: "paper"}
POST /api/containers/create {variant: "fabric"}
POST /api/containers/create {variant: "forge"}
POST /api/containers/create {variant: "neoforge"}
# Verify RUNNING state
GET /api/containers/:vmid/status
# Should return: {state: "RUNNING"}
# Test control
POST /api/containers/:vmid/stop
POST /api/containers/:vmid/start
POST /api/containers/:vmid/restart
# Test cleanup
DELETE /api/containers/:vmid
# Verify no orphaned DNS records
```
**Test Dev Containers**:
```bash
POST /api/dev-instances/create {language: "python"}
POST /api/dev-instances/create {language: "node"}
# Verify accessible
ssh into dev container
# Should have language toolchain installed
```
---
## 📊 Infrastructure Constraints (Know Your Limits)
**Hardware** (GTHost Dedicated):
- CPU: 12 cores / 24 threads (Intel Xeon Silver 4116)
- RAM: 192 GB (can run 30-50 simultaneous 4GB servers)
- Storage: 1.8 TB free (300-500 servers capacity)
- Network: 300 Mbit/s (30-60 concurrent players)
**Current Allocation**:
- 11 VMs: 56 GB RAM, 24 CPU threads, ~512 GB disk
- Available for servers: 128 GB RAM, 1.8 TB disk
**Don't exceed these limits** - check before provisioning.
---
## 🚀 Launch Decision (Context)
**Option A**: Launch NOW (85% ready, soft beta only)
**Option B**: +3 Days Sprint (95% ready, RECOMMENDED)
**Option C**: +1 Week (98% ready, over-engineering)
**Your role**: Execute Option B sprint (complete 3 days of tasks)
**After sprint**: Platform ready for professional launch
---
## 📞 Session Continuity
### **Starting Fresh Session (You)**
```
1. Read Drift Prevention Card (30 seconds)
└─ Activate boundary awareness
2. Read Complete Current State (5 minutes)
└─ Get Kanban state + sprint tasks
3. Consult Cross-Project Tracker before code
└─ Verify no boundary violations
4. Execute sprint tasks with constraints active
5. Test all variants before committing
```
### **Handoff to Claude (Architecture Questions)**
If you encounter:
- Architectural decisions (e.g., should we change contracts?)
- Strategic questions (e.g., which features to prioritize?)
- Business model questions
- Major design changes
**STOP and escalate to Claude** for architectural guidance.
---
## ✅ Quick Reference
### **Your Mission**
Execute 3-day sprint → Deliver 95% launch-ready platform
### **Your Boundaries**
API orchestrates | Agent installs | Frontend displays
### **Your Critical Docs**
1. Drift Prevention Card (session start)
2. Cross-Project Tracker (before code)
3. Complete Current State (sprint tasks)
4. Engineering Handover (technical details)
### **Your Success**
- [ ] 3-day sprint complete
- [ ] No architectural violations
- [ ] All variants still working
- [ ] Tests passing
---
## 🎯 Start Here (First Actions)
**Right Now**:
1. ✅ Read [Drift Prevention Card](computer:///mnt/user-data/outputs/ZeroLagHub_Drift_Prevention_Card.md) (30 sec)
2. ✅ Read [Complete Current State](computer:///mnt/user-data/outputs/ZeroLagHub_Complete_Current_State_Dec7.md) (5 min)
3. ✅ Check Engineering Kanban for current TODO
4. ✅ Begin Day 1 sprint: Dev containers
**Remember**:
- 🛡️ Drift prevention ACTIVE
- 📋 Consult tracker before crossing boundaries
- ✅ Test all variants before committing
- 🚀 Goal: 95% launch-ready after 3 days
---
**Status**: You have everything you need to execute the sprint. Let's build! 🚀

View File

@ -0,0 +1,465 @@
# 🏗️ ZeroLagHub Infrastructure - Complete Specifications
**Last Updated**: December 7, 2025
**Provider**: GTHost
**Cost**: $109/month
**Status**: Production Infrastructure
---
## 🖥️ Dedicated Server Hardware
### **Server Platform**
**Model**: Supermicro 2029TP-HC1R (hot-swap HDD chassis)
**Form Factor**: Enterprise rackmount server
### **CPU**
**Model**: Intel Xeon Silver 4116
**Cores**: 12 cores / 24 threads
**Clock Speed**: 2.1 GHz base, 3.0 GHz turbo
**Architecture**: Intel Skylake-SP (Server Processor)
**TDP**: 85W
**Cache**: 16.5 MB L3
**Performance**:
- Single-threaded: Excellent for game servers
- Multi-threaded: Handles 11 VMs + 75-100 containers
- Turbo boost ensures responsive provisioning
### **Memory (RAM)**
**Configuration**: 6 x 32GB DIMMs
**Total**: 192 GB
**Type**: Hynix DDR4 RDIMM (Registered)
**Speed**: 2400 MHz
**ECC**: Yes (Error-Correcting Code)
**Benefits**:
- ECC protects against memory corruption
- Registered DIMMs = enterprise reliability
- 192GB = ample headroom for VM overhead + game servers
### **Storage**
**Configuration**: 2 x 1.92TB SSD
**Model**: Samsung PM863 (Enterprise SSD)
**Total Capacity**: 3.84 TB raw
**Available**: ~1.8 TB (after Proxmox + VMs + overhead)
**Interface**: SATA (likely in RAID configuration)
**Samsung PM863 Specs**:
- Enterprise-grade datacenter SSD
- Optimized for mixed workload
- Power Loss Protection (PLP)
- High endurance rating
**RAID Configuration** (Likely):
- RAID 1 (mirrored) for redundancy
- Or ZFS RAID-Z for Proxmox storage
### **Network**
**Bandwidth**: 300 Mbit/s (37.5 MB/s)
**Metering**: Unmetered (no bandwidth caps)
**Uplink**: Enterprise datacenter connection
**Performance Assessment**:
- 300 Mbit/s = 30-40 concurrent Minecraft players comfortably
- Unmetered = no surprise overage charges
- Sufficient for soft launch, may need upgrade at scale
### **Operating System**
**OS**: Proxmox VE 8 (64-bit)
**Base**: Debian 12 "Bookworm"
**Kernel**: Linux 6.2+
---
## 📊 Partition Layout
### **Disk Partitions**
```
/boot: 1024 MB (1 GB) - Boot partition
/swap: 2048 MB (2 GB) - Swap space
/root: Auto size - Main Proxmox system + VM storage
(~3.8 TB usable after RAID)
```
**Storage Allocation** (Estimated):
```
Total: 3.84 TB raw (2 x 1.92TB)
RAID overhead: -0.04 TB (metadata, alignment)
--------
Available: 3.80 TB
Proxmox OS: -0.10 TB (Proxmox + system)
VM Disks: -1.80 TB (11 VMs, templates)
LXC Containers: -0.10 TB (current containers)
--------
Free Space: 1.80 TB (for game servers + growth)
```
---
## 🎯 Performance Characteristics
### **CPU Capabilities**
**Per-Core Performance**: ⭐⭐⭐⭐ (Excellent for game servers)
- Minecraft servers are single-threaded
- 3.0 GHz turbo provides snappy performance
- 12 cores = 12 simultaneous high-performance game servers
**Multi-Core Throughput**: ⭐⭐⭐⭐⭐ (Excellent for hosting)
- 24 threads handle VM overhead efficiently
- Can run 11 Proxmox VMs + 75-100 LXC containers
- Provisioning operations don't impact running servers
**Virtualization**: ⭐⭐⭐⭐⭐ (Native support)
- Intel VT-x + VT-d enabled
- Hardware-accelerated virtualization
- LXC containers = near-native performance
### **Memory Capabilities**
**Capacity**: 192 GB = **Excellent** for scale
- Average Minecraft server: 2-4 GB
- 192 GB / 4 GB = **48 simultaneous 4GB servers**
- Or 96 lightweight 2GB servers
**Speed**: 2400 MHz DDR4 = **Good** (not bleeding edge, but sufficient)
- DDR4-2400 provides adequate bandwidth for game hosting
- ECC ensures data integrity under load
**Reliability**: ECC RDIMM = **Enterprise-grade**
- Detects and corrects memory errors
- Critical for 24/7 uptime
### **Storage Capabilities**
**Capacity**: 3.84 TB = **Very Good** for initial scale
- Each Minecraft server: 1-5 GB (world size varies)
- Can host 300-500 Minecraft servers comfortably
- 1.8 TB free = room for significant growth
**Performance**: Samsung PM863 = **Excellent** for workload
- Random IOPS: ~10,000 read, ~2,000 write
- Sequential: 520 MB/s read, 485 MB/s write
- Perfect for database + game world I/O
**Reliability**: Enterprise SSD = **Excellent**
- Power Loss Protection prevents corruption
- Rated for 1.3 PB writes (years of 24/7 operation)
- RAID 1 (likely) provides redundancy
### **Network Capabilities**
**Bandwidth**: 300 Mbit/s = **Adequate for soft launch**
- Minecraft player: ~0.5-1 Mbit/s
- 300 Mbit/s = 30-60 players (conservative estimate)
- Unmetered = no bandwidth overage charges
**Upgrade Path**: GTHost likely offers 1 Gbps upgrades
- 1 Gbps would support 100-200 players
- Consider upgrade when approaching 40+ concurrent players
---
## 🏗️ Current Resource Allocation
### **VM Resource Breakdown** (11 VMs)
```
Hypervisor Overhead: ~8 GB RAM, 2 CPU cores
Critical Production:
├─ VM 100 (zlh-panel): 4 GB RAM, 2 cores, 32 GB disk
├─ VM 103 (zlh-api): 4 GB RAM, 2 cores, 32 GB disk
└─ VM 101 (zlh-wings): 8 GB RAM, 4 cores, 64 GB disk
Platform Services:
├─ VM 102 (zlh-portal): 4 GB RAM, 2 cores, 32 GB disk
├─ VM 104 (zlh-monitor): 8 GB RAM, 2 cores, 64 GB disk
└─ VM 1002 (zlh-proxy): 2 GB RAM, 1 core, 16 GB disk
Network Layer:
├─ VM 1000 (zlh-router): 4 GB RAM, 2 cores, 32 GB disk
├─ VM 1006 (zpack-router): 4 GB RAM, 2 cores, 32 GB disk
└─ VM 1001 (zlh-dns): 2 GB RAM, 1 core, 16 GB disk
Development/Support:
├─ VM 300 (zlh-panel-dev): 4 GB RAM, 2 cores, 32 GB disk
└─ VM 2000 (zlh-ci): 4 GB RAM, 2 cores, 32 GB disk
Backup:
└─ VM [zlh-back]: 8 GB RAM, 2 cores, 128 GB disk
TOTAL VM ALLOCATION: 56 GB RAM, 24 cores, ~512 GB disk
```
### **Available for Game Servers**
```
RAM Available: 192 GB - 56 GB (VMs) - 8 GB (overhead) = 128 GB
CPU Available: 24 threads - 24 (VM allocation) = 0 (shared)
Disk Available: 1.8 TB free
Game Server Capacity (Conservative):
├─ 2GB servers: 64 simultaneous servers
├─ 4GB servers: 32 simultaneous servers
└─ 8GB servers: 16 simultaneous servers
Developer Environment Capacity:
├─ 2GB dev envs: 64 simultaneous environments
└─ 4GB dev envs: 32 simultaneous environments
```
**Note**: CPU is oversubscribed (common in hosting) since most game servers idle at <20% CPU usage. Turbo boost ensures good single-thread performance when needed.
---
## 📈 Capacity & Scaling Projections
### **Current Capacity** (As Deployed)
**Game Servers**: 30-50 active servers with current VM allocation
**Developer Environments**: 75-100 environments (documented capacity)
**Concurrent Players**: 30-60 players (network limited)
### **Optimized Capacity** (With Tuning)
**Game Servers**: 60-80 active servers (after VM consolidation)
**Developer Environments**: 100-150 environments
**Concurrent Players**: Still 30-60 (network bottleneck)
### **Maximum Theoretical Capacity**
**Game Servers**: 128 lightweight servers (if only game hosting, no dev)
**Developer Environments**: 192 environments (if only dev, no games)
**Storage**: 300-500 servers before storage exhaustion
**Limiting Factors**:
1. **Network** (300 Mbit/s) - limits concurrent players
2. **RAM** (192 GB) - limits concurrent heavy servers
3. **Storage** (1.8 TB free) - limits total servers
---
## 💰 Cost Analysis
### **Current Infrastructure Cost**
**Monthly**: $109 GTHost dedicated server
**Annually**: $1,308
**Cost per Resource**:
- Per GB RAM: $0.57/month ($109 ÷ 192 GB)
- Per CPU core: $9.08/month ($109 ÷ 12 cores)
- Per TB storage: $28.39/month ($109 ÷ 3.84 TB)
### **Competitive Analysis**
**AWS Equivalent** (m5.2xlarge + storage + bandwidth):
- 8 vCPU, 32 GB RAM, 1 TB storage, 1 Gbps
- Cost: ~$300-400/month
**Hetzner Dedicated** (Similar specs):
- 12 core Xeon, 128 GB RAM, 2x2TB SSD
- Cost: ~$100/month (but higher network costs)
**GTHost Value**: ⭐⭐⭐⭐⭐ Excellent
- 40-60% cheaper than AWS
- Competitive with Hetzner
- Unmetered bandwidth (key advantage)
---
## 🎯 Competitive Advantages
### **1. LXC Performance**
- Host hardware enables 20-30% better performance vs Docker
- Intel Xeon Silver 4116 single-thread performance excellent for games
### **2. Resource Density**
- 192 GB RAM supports 30-50 simultaneous 4GB servers
- Competitors typically offer 64-128 GB at this price point
### **3. Storage Performance**
- Samsung PM863 enterprise SSDs outperform consumer SSDs
- Power Loss Protection prevents world corruption
- Hot-swap chassis enables maintenance without downtime
### **4. Network**
- Unmetered = no bandwidth surprises
- 300 Mbit/s adequate for soft launch
- Upgrade path available when needed
---
## ⚠️ Identified Constraints
### **1. Network Bandwidth** (Current Bottleneck)
- **300 Mbit/s limits to 30-60 concurrent players**
- **Recommendation**: Monitor bandwidth usage, upgrade to 1 Gbps when approaching 40 players
- **Upgrade Cost**: Likely +$20-50/month for 1 Gbps
### **2. CPU Oversubscription**
- 24 threads allocated to VMs, but most VMs idle
- Game servers share CPU via time-slicing
- **Risk**: If all servers spike simultaneously, performance degrades
- **Mitigation**: Limit concurrent servers to 40-50 until load testing proves higher safe
### **3. Storage Growth**
- 1.8 TB free supports 300-500 servers
- Each server grows over time (world expansion)
- **Recommendation**: Monitor disk usage, plan expansion at 70% utilization
- **Expansion Options**: Add external storage or upgrade to larger SSDs
---
## 🔧 Optimization Opportunities
### **Immediate Optimizations** (No Cost)
1. **VM Consolidation**
- Merge zlh-panel-dev into zlh-panel (save 4 GB RAM, 2 cores)
- Merge zlh-proxy into zlh-router (save 2 GB RAM, 1 core)
- **Gain**: 6 GB RAM, 3 cores for game servers
2. **LXC Over VMs**
- Convert lightweight VMs to LXC containers
- Example: zlh-dns, zlh-proxy candidates
- **Gain**: Lower overhead, faster provisioning
3. **Memory Ballooning**
- Enable KSM (Kernel Same-page Merging) on Proxmox
- Deduplicate identical memory pages
- **Gain**: 5-10% more available RAM
### **Paid Optimizations** (Consider at Scale)
1. **Network Upgrade**: 1 Gbps uplink (+$20-50/month)
- Removes player concurrency bottleneck
- Enables 100-200 player capacity
2. **Storage Expansion**: Add 4TB NVMe (+$50/month)
- Doubles storage to ~6 TB total
- Supports 600-1000 servers
3. **Cloudflare Enterprise** (+$200/month)
- DDoS protection for game traffic
- CDN for static assets
- Worth it at 100+ servers
---
## 📊 Hardware Lifecycle
### **Current Status** (December 2025)
**Server Age**: Unknown (likely 1-3 years based on Xeon Silver 4116 era)
**Expected Lifespan**: 5-7 years for enterprise server
**Remaining Life**: Likely 3-5 years
**Components**:
- CPU: Xeon Silver 4116 (2017 release) - still very capable
- RAM: DDR4-2400 (current gen, plenty of life)
- SSD: Samsung PM863 (enterprise grade, high endurance)
### **Upgrade Path** (Future)
**Year 1-2** (Current plan):
- Optimize existing hardware
- Minor network upgrades if needed
**Year 3-4** (Growth phase):
- Consider second dedicated server
- Load balance across servers
- Geographic distribution
**Year 5+** (Scale phase):
- Migrate to colocation or cloud
- Multi-datacenter deployment
---
## 🛡️ Reliability Features
### **Hardware Reliability**
**ECC Memory** - Corrects single-bit errors automatically
**Enterprise SSDs** - Power Loss Protection, high endurance
**Hot-Swap Chassis** - Replace drives without shutdown
**Redundant Power** (likely) - Supermicro chassis typically dual PSU
### **Software Reliability**
**Proxmox High Availability** - VM failover (if configured)
**PBS Backup** - Incremental backups to Backblaze B2
**LXC Snapshots** - Fast rollback capability
**RAID Mirroring** (likely) - Disk failure protection
### **Network Reliability**
**Datacenter Uptime** - GTHost likely 99.9%+ SLA
**Unmetered Bandwidth** - No throttling during spikes
⚠️ **Single Uplink** - No network redundancy (acceptable for price point)
---
## 🎯 Summary & Recommendations
### **Hardware Assessment**: ⭐⭐⭐⭐ Very Good for Use Case
**Strengths**:
- Excellent CPU for game server hosting (Xeon Silver 4116)
- Abundant RAM (192 GB = 30-50 servers)
- Enterprise storage (Samsung PM863 + hot-swap)
- Unmetered bandwidth (no surprise charges)
- Great value ($109/month for these specs)
**Limitations**:
- Network bandwidth (300 Mbit/s = 30-60 players)
- Storage growth constraint (monitor usage)
- CPU oversubscription (limit concurrent servers initially)
### **Recommendations**
**Now** (Launch Phase):
1. ✅ Deploy on current hardware - adequate for soft launch
2. ✅ Limit to 40-50 concurrent servers initially
3. ✅ Monitor bandwidth, RAM, and disk usage
**Month 1-3** (Early Growth):
1. 🔧 Optimize VM allocation (consolidate where possible)
2. 🔧 Implement aggressive monitoring
3. 🔧 Consider 1 Gbps network upgrade if approaching 40 players
**Month 6-12** (Scale Phase):
1. 📈 Evaluate storage expansion based on usage
2. 📈 Consider second server for geographic distribution
3. 📈 Implement Cloudflare Enterprise for DDoS protection
### **Capacity Targets by Phase**
**Soft Launch** (Month 1-3): 20-30 servers, 10-20 players
**Public Launch** (Month 3-6): 40-50 servers, 30-40 players
**Growth Phase** (Month 6-12): 60-80 servers, 60-100 players (with 1 Gbps upgrade)
**Scale Phase** (Month 12+): 100+ servers, multi-server deployment
---
## ✅ Conclusion
**Status**: Infrastructure is **production-ready** for ZeroLagHub launch.
**Key Points**:
- Hardware specifications are excellent for initial scale
- 192 GB RAM supports 30-50 game servers
- Storage capacity adequate for 300-500 servers
- Network bandwidth is current bottleneck (acceptable for soft launch)
- Cost-effective ($109/month for enterprise-grade hardware)
**Green Light**: ✅ Launch when platform development complete (currently 85% ready).
---
**Last Updated**: December 7, 2025
**Source**: GTHost server specifications + ZeroLagHub infrastructure analysis

View File

@ -0,0 +1,606 @@
# 🚀 ZeroLagHub - Master Bootstrap Document (December 2025)
**Last Updated**: December 7, 2025
**Version**: 4.0 (Platform Launch Ready)
**Status**: 85% Complete - Launch Decision Point
---
## 📌 Quick Start for New AI Sessions
> **Resume Point**: Platform 85% launch-ready, all core provisioning operational, critical UX features needed.
>
> **Current Phase**: Launch readiness assessment - choose NOW vs +1 week vs +1 month
>
> **Critical Context**: All 6 Minecraft variants provisioning successfully via Go agent. Need WebSocket console, crash protection, and disk monitoring for competitive parity.
---
## 🎯 Project Overview
**ZeroLagHub** is a developer-focused game server hosting platform built on:
- **Proxmox VE** with LXC containers (20-30% performance advantage over Docker)
- **Hybrid Architecture**: Pterodactyl panel + Custom Node.js API + Go provisioning agent
- **Velocity proxy** for seamless Minecraft routing
- **Dual-router architecture** for traffic separation
- **Developer-to-player revenue pipeline** with 9.75x revenue multiplier
### Core Value Proposition
Complete dev-to-production pipeline: Development environments ($20/mo) → Testing servers (50% discount) → Player hosting (25% discount) → Revenue sharing (7.5% commission) = viral growth through developer ecosystem.
---
## 🏗️ Current Architecture (December 2025)
### Infrastructure Overview (11 VMs)
```
Critical Production:
├── VM 100 (zlh-panel) - Pterodactyl panel + OAuth customization
├── VM 103 (zlh-api) - Node.js backend + developer platform APIs
├── VM 101 (zlh-wings) - Game servers + LXC integration target
Platform Services:
├── VM 102 (zlh-portal) - Next.js frontend + developer dashboard
├── VM 104 (zlh-monitor) - Prometheus/Grafana monitoring
Network & Infrastructure:
├── VM 1000 (zlh-router) - Platform services routing + VLANs
├── VM 1006 (zpack-router) - Game traffic routing + Velocity
├── VM 1001 (zlh-dns) - Technitium DNS + development domains
├── VM 1002 (zlh-proxy) - Caddy reverse proxy + SSL automation
├── VM 300 (zlh-panel-dev) - Development environment + testing
├── VM 2000 (zlh-ci) - CI/CD pipeline + automation
└── VM [zlh-back] - PBS backup + Backblaze B2 replication
```
### Network Topology
```
zlh-router (VM 1000):
├─ WAN1: Platform services (API, portal, monitoring)
├─ CORE_LAN: 10.60.0.0/24 (internal services)
├─ MGMT_LAN: 172.60.0.10/24 (inter-router communication)
└─ WireGuard: Admin access
zpack-router (VM 1006):
├─ WAN2: 139.64.165.248 (game services)
├─ ZPACK_LAN: 10.70.0.0/24 (Velocity @ 10.70.0.241)
├─ DEV_LAN: 10.100.0.0/24 (developer environments - future)
├─ GAME_LAN: 10.200.0.0/24 (game server LXCs)
└─ MGMT_LAN: 172.60.0.20/24 (control plane communication)
```
### Traffic Flows
- **Platform Access**: Client → WAN1 → zlh-router → Frontend/API
- **Game Play**: Player → WAN2 (139.64.165.248) → zpack-router → Velocity (10.70.0.241) → Game Server (10.200.0.X)
- **Control Plane**: API → MGMT_LAN (172.60.0.X) → Velocity/DNS/Monitoring
---
## ✅ What's Working (December 7, 2025)
### Provisioning Pipeline (100% Operational)
| Component | Status | Notes |
|-----------|--------|-------|
| **LXC Container Creation** | ✅ | Template VMID 800, auto-cloning working |
| **VMID Allocation** | ✅ | Sequential assignment from range |
| **IP Detection** | ✅ | Automatic network configuration |
| **Go Agent Deployment** | ✅ | Payload delivery + self-repair system |
| **Java Runtime Selection** | ✅ | Auto-detect MC version → Java 17/21 |
| **All 6 MC Variants** | ✅ | Vanilla, Paper, Purpur, Fabric, Forge, NeoForge |
| **Server Startup** | ✅ | All variants start successfully |
| **DNS Publishing** | ✅ | Cloudflare + Technitium A + SRV records |
| **Velocity Registration** | ✅ | Dynamic backend server registration |
| **Client Connectivity** | ✅ | Players can connect and play |
### Control Functions
| Function | Status | Implementation |
|----------|--------|----------------|
| **Start/Stop/Restart** | ✅ | HTTP API → Go agent |
| **Console Commands** | ✅ | Command injection working |
| **Log Tailing** | ⚠️ | HTTP polling only (need WebSocket) |
| **Status Reporting** | ✅ | Agent emits RUNNING state |
| **Crash Detection** | ✅ | Agent tracks exit codes |
### Game Support Matrix
**Launch Ready (Minecraft Only)**:
- ✅ **Vanilla** - Official Mojang server
- ✅ **Paper** - Primary recommendation (vanilla + plugins)
- ✅ **Purpur** - Paper fork with extra features
- ✅ **Fabric** - Lightweight mod support
- ✅ **Forge** - Heavy mod support (tech/magic mods)
- ✅ **NeoForge** - Modern Forge fork (**competitive advantage**)
**Supported Versions**: 1.12.2, 1.16.5, 1.18.2, 1.19.2, 1.20.1, 1.21.x
**Deferred to Post-Launch**:
- 📋 Terraria
- 📋 Project Zomboid
- 📋 Valheim
- 📋 Rust
---
## 🚨 Known Issues & Gaps
### Critical Bugs (Non-Blocking, System Works)
1. **Forge server.jar Glob Logic** (`artifacts.go` lines 112-116, 147-151)
- Tries to find `*server.jar` but Forge ≥1.17 doesn't create this
- **Fix**: Remove glob/rename logic (Forge uses `run.sh` + `libraries/`)
- **Impact**: System works, but unnecessary code
2. **ensureProvisioned() Fallthrough** (`agent.go` lines 155-171)
- After Forge check, falls through to check `server.jar`
- **Fix**: Add `else` to prevent fallthrough
- **Impact**: Minor efficiency issue
3. **Forge Stop Command Exclusion** (`process.go` line 83)
- Excludes Forge from receiving `stop` command
- **Fix**: Remove exclusion (Forge accepts stop commands)
- **Impact**: Manual workaround needed for Forge stops
### Missing Competitive Features (CRITICAL)
| Feature | Apex | Shockbyte | ZeroLagHub | Priority |
|---------|------|-----------|------------|----------|
| **All MC Variants** | ✅ | ✅ | ✅ | - |
| **NeoForge** | ❌ | ❌ | ✅ | ADVANTAGE |
| **Performance** | 🟡 | 🟡 | ✅ | ADVANTAGE |
| **Console Streaming** | ✅ | ✅ | ❌ | 🔴 HIGH |
| **File Management** | ✅ | ✅ | ❌ | 🟡 MEDIUM |
| **Backups** | ✅ | ✅ | ❌ | 🟡 MEDIUM |
| **Crash Protection** | ✅ | ✅ | ❌ | 🔴 HIGH |
| **Disk Monitoring** | ✅ | ✅ | ❌ | 🔴 HIGH |
---
## 🎯 Platform Readiness Assessment (85%)
### Core Platform (100%)
- ✅ Container orchestration
- ✅ Multi-variant provisioning
- ✅ Network routing (dual-router)
- ✅ DNS automation (Cloudflare + Technitium)
- ✅ Velocity proxy integration
- ✅ Start/stop/restart control
- ✅ Console command injection
- ✅ Status monitoring
### Operational Features (70%)
- ✅ Log tailing (HTTP polling)
- ✅ Crash detection
- ❌ **WebSocket console** (need real-time streaming)
- ❌ **Crash loop protection** (need exponential backoff)
- ❌ **Disk space monitoring** (prevent corruption)
### File Management (0%)
- ❌ File upload/download
- ❌ Backup/restore system
- ❌ World file management
### Advanced Features (Planned)
- 📋 Resource monitoring dashboard
- 📋 Plugin marketplace
- 📋 Developer platform APIs
- 📋 Performance optimization tools
---
## 🚀 Launch Decision Point
### Option A: Launch NOW (Soft Beta)
**Status**: 85% ready
**Timeline**: Immediate
**Pros**: Fast to market, gather user feedback
**Cons**: Missing competitive UX features, higher support burden
**Recommendation**: ⚠️ Acceptable for 10-20 beta users only
### Option B: +1 Week (Critical Features) ⭐ RECOMMENDED
**Status**: 95% ready after additions
**Timeline**: December 14, 2025
**Add**: WebSocket console + Crash protection + Disk monitoring
**Effort**: 7-9 hours total
**Pros**: Competitive feature parity, professional launch
**Cons**: Minimal delay
**Recommendation**: ✅ Best balance of quality and speed
### Option C: +1 Month (Full Feature Parity)
**Status**: 100% ready
**Timeline**: January 7, 2026
**Add**: All UX features + file management + backups
**Effort**: ~30 hours
**Pros**: Complete competitive offering
**Cons**: Slower to market, feature creep risk
**Recommendation**: ⚠️ Over-engineering for launch
---
## 📋 Critical Outstanding Items
### 🔴 High Priority (Before Launch)
**1. WebSocket Console Streaming** [4-6 hours]
- **Current**: HTTP polling via `/logs/tail`
- **Needed**: Real-time WebSocket streaming
- **Why**: Industry standard, users expect it
- **Technical**: Socket.io integration to Go agent
**2. Crash Loop Protection** [2 hours]
- **Current**: Immediate restart on crash
- **Needed**: Exponential backoff (5s, 10s, 15s), stop after 3 crashes
- **Why**: Prevents resource thrashing
- **Technical**: Agent retry logic with backoff timer
**3. Disk Space Monitoring** [1 hour]
- **Current**: No checks
- **Needed**: Alert when <1GB free, prevent start if insufficient
- **Why**: Prevents world corruption
- **Technical**: Agent disk space check before start
### 🟡 Medium Priority (Week 1)
**4. File Upload/Download** [6-8 hours]
- Plugin management, world uploads
- HTTP multipart + streaming
**5. Backup System** [8-10 hours]
- World backup/restore
- Integration with PBS backup infrastructure
**6. Enhanced Health Checks** [3-4 hours]
- Query server status
- Resource monitoring (CPU/RAM)
### 🟢 Low Priority (Month 1)
7. Resource monitoring dashboard
8. Plugin marketplace integration
9. Developer platform APIs
10. Performance optimization
---
## 🗄️ Technical Architecture Details
### Directory Structure (Finalized)
```
/opt/zlh/<game>/<variant>/world/
Examples:
/opt/zlh/minecraft/vanilla/world/
/opt/zlh/minecraft/forge/world/
/opt/zlh/minecraft/fabric/world/
```
**Benefits**:
- Clear game/variant separation
- Scalable to all future games
- Self-documenting paths
- Easy backup automation
### Container Model
**Architecture**: One game per LXC container
**Rationale**: Industry standard, 3-5x simpler than multi-game
**Benefits**:
- Better resource isolation
- Simpler billing
- Clearer security boundaries
- Easier debugging
### Java Runtime Selection
```
MC 1.21.x → Java 21
MC ≥1.20.5 → Java 21
MC <1.20.5 Java 17
```
### Artifact Download Paths
```
minecraft/vanilla/<version>/server.jar
minecraft/paper/<version>/server.jar
minecraft/purpur/<version>/server.jar
minecraft/fabric/<version>/fabric-server.jar
minecraft/forge/<version>/forge-installer.jar
minecraft/neoforge/<version>/neoforge-installer.jar
```
**Critical Note**: Fabric uses `fabric-server.jar` (pre-built), not installer pattern
---
## 💰 Business Model & Revenue Strategy
### Developer-to-Player Pipeline
```
Step 1: Developer Acquisition
├─ Development Environment: $20/month
└─ Testing Server: $25/month (50% discount)
Step 2: Player Acquisition (via developer)
├─ Player 1-10: $15/month each (25% discount)
└─ Total Player Revenue: $150/month
Step 3: Developer Commission
├─ Revenue Share: 7.5% of player revenue
├─ Developer Earns: $11.25/month
└─ Platform Keeps: $138.75/month
Total Monthly Revenue from One Developer:
$20 (dev env) + $25 (test server) + $150 (players) = $195/month
Revenue Multiplier: 9.75x on developer acquisition cost
```
### Financial Projections
**Month 6**: $8K-30K (LXC advantage + developer pipeline)
**Month 12**: $25K-100K (custom platform competitive advantages)
**Month 24**: $75K-300K (market leadership + technology licensing)
### Competitive Advantages
1. **LXC Performance**: 20-30% improvement over Docker competitors
2. **Developer Ecosystem**: Complete dev-to-production pipeline vs pure hosting
3. **Open Source Foundation**: 30-40% cost advantage over corporate providers
4. **Gaming-First Architecture**: Purpose-built vs adapted generic hosting
5. **NeoForge Support**: Ahead of Apex and Shockbyte
---
## 🔐 Security Vulnerabilities (CRITICAL - Active Fix Required)
### API Department Issues
1. **Server Ownership Bypass**
- Any user can control any server via UUID
- No ownership validation in API endpoints
- **Impact**: Critical security flaw
2. **Admin Privilege Escalation**
- Frontend can claim admin via JWT manipulation
- No server-side role validation
- **Impact**: Complete access control bypass
3. **Token URL Exposure**
- JWTs visible in browser history/logs
- Tokens passed as URL parameters
- **Impact**: Token theft vulnerability
4. **API Key Validation Missing**
- Authentication bypass vulnerabilities
- Inconsistent validation patterns
- **Impact**: Unauthorized API access
### Required Fixes
- Implement ownership checks on all server operations
- Server-side JWT validation and role enforcement
- Move tokens from URL to headers/cookies
- Comprehensive API key validation
**Priority**: Must fix before public launch (current soft beta acceptable)
---
## 🛠️ Ford Assembly Line Department Structure
### Management Department (Coordination Hub)
- **Role**: Strategic oversight, cross-department integration
- **AI Resource**: Claude (architecture) + ChatGPT (implementation)
- **Current Focus**: Launch readiness + critical feature completion
### 5 Specialized Departments
**1. API Department** ⚠️ CRITICAL SECURITY + DEVELOPER PLATFORM
- Tech: Node.js/Express, MariaDB, JWT auth, Pterodactyl integration
- Priority: Security fixes + developer environment APIs
**2. Infrastructure Department** ✅ LXC INTEGRATION PRIORITY
- Tech: Proxmox VMs, Ansible automation, PBS backup, Monitoring
- Achievement: Enterprise backup system operational
- Capacity: 1.8TB available, supports 75-100 developers
**3. Frontend Department** 🔧 TOKEN SECURITY + DEVELOPER UI
- Tech: Next.js 15, TailwindCSS, sci-fi HUD aesthetic, TypeScript
- Priority: Token security + developer dashboard
**4. Pterodactyl Department** ⚠️ OAUTH + WINGS LXC
- Role: Panel customization, OAuth integration
- Future: Wings LXC integration for performance advantage
**5. Planning & Brainstorming Department** 🧠 STRATEGIC EXECUTION
- Role: Long-term vision, competitive strategy
- Focus: Developer acquisition, viral growth mechanics
---
## 📋 Immediate Next Steps (Priority Order)
### Phase 1: Critical Features (Before Launch)
1. ✅ **Fix Go Agent Bugs** - Remove Forge glob, fix fallthrough, enable stop commands
2. 🔧 **WebSocket Console** - Implement real-time streaming (4-6 hours)
3. 🔧 **Crash Loop Protection** - Add exponential backoff (2 hours)
4. 🔧 **Disk Space Monitoring** - Prevent starts on low disk (1 hour)
### Phase 2: Launch Readiness
5. 📋 **Security Audit** - Review critical vulnerabilities
6. 📋 **Documentation** - User guides, API docs
7. 📋 **Monitoring** - Alert thresholds, dashboards
8. 📋 **Soft Beta** - 10-20 users, gather feedback
### Phase 3: Week 1 Post-Launch
9. 📋 **File Management** - Upload/download interface
10. 📋 **Backup System** - World backup/restore
11. 📋 **Enhanced Health Checks** - Resource monitoring
---
## 🎯 Success Metrics
### Technical Metrics
- ✅ 100% provisioning success rate (all 6 variants)
- ⚠️ Zero DNS orphan records (needs EdgeState migration)
- ⚠️ Sub-second WebSocket latency (needs implementation)
- ✅ LXC 20-30% performance advantage (validated)
### Business Metrics (Future)
- Developer referral system operational
- Revenue sharing calculations accurate
- Customer quota enforcement working
- Usage metering for billing
### User Experience Metrics
- Professional HUD aesthetic maintained
- Zero breaking changes during updates
- Seamless dev-to-production pipeline
- <3s average provisioning time
---
## 📁 Key Files & Locations
### API Service (`/home/zlh/zlh-api-v2/`)
- `prisma/schema.prisma` - Database schema
- `src/services/edgePublisher.js` - DNS + Velocity publishing
- `src/services/dePublisher.js` - Edge cleanup
- `src/services/portAllocator.js` - Port management
- `src/clients/cloudflareClient.js` - Cloudflare API wrapper
- `src/clients/technitiumClient.js` - Technitium DNS API wrapper
### Go Agent (`/opt/zlh-agent/`)
- `agent.go` - Main provisioning logic
- `artifacts.go` - Download + verification (has bugs)
- `process.go` - Server lifecycle management (has bug)
- `api.go` - HTTP server for control commands
- `payload.json` - Configuration from API
### Frontend (`/home/zlh/zlh-portal/`)
- Next.js 15 application
- Steel-texture HUD aesthetic
- Developer dashboard (in progress)
---
## ⚠️ Critical Rules & Constraints
### DO NOT
- ❌ Infer hostnames from DNS records
- ❌ Use DNS as source of truth
- ❌ Delete Cloudflare records without record IDs
- ❌ Launch without WebSocket console (competitive requirement)
- ❌ Skip crash protection (operational stability)
- ❌ Ignore disk space monitoring (data safety)
### ALWAYS
- ✅ Treat DB as authoritative source of truth
- ✅ Store Cloudflare record IDs in EdgeState
- ✅ Use exact hostname matching
- ✅ Track all async operations in JobLog
- ✅ Audit significant actions
- ✅ Test all 6 MC variants before deploy
---
## 💡 Key Architectural Decisions (ADRs)
**ADR-001: Minecraft-Only Launch**
**Decision**: Launch with Minecraft only, defer other games
**Rationale**: Market validation, focused quality, faster to market
**Consequence**: 6 variants + 6 versions = comprehensive MC offering
**ADR-002: One Game Per Container**
**Decision**: Single game per LXC container
**Rationale**: Industry standard, 3-5x simpler than multi-game
**Consequence**: Better isolation, clearer billing, easier debugging
**ADR-003: Velocity Over Direct Port Forwarding**
**Decision**: Use Velocity proxy for Minecraft routing
**Rationale**: Single entry point, dynamic registration, no NAT complexity
**Consequence**: No external port allocation needed for MC
**ADR-004: Hybrid Pterodactyl + Custom API**
**Decision**: Keep Pterodactyl panel, build custom API alongside
**Rationale**: Preserve working OAuth, gradual migration path
**Consequence**: Dual system complexity, eventual migration needed
**ADR-005: Go Agent Architecture**
**Decision**: Containerized Go agent handles provisioning
**Rationale**: Language-agnostic, self-healing, version-aware
**Consequence**: Robust provisioning, automatic repair, clean separation
---
## 🧠 Session Continuity Prompt
For AI assistants resuming work on this project:
> Resume from ZeroLagHub Master Bootstrap (December 7, 2025).
>
> **Current State**: Platform 85% launch-ready. All 6 Minecraft variants provisioning successfully via Go agent. Core functionality operational, need critical UX features for competitive parity.
>
> **Launch Decision**: Recommend +1 week for WebSocket console, crash protection, and disk monitoring.
>
> **Known Bugs**: 3 non-blocking Go agent issues (Forge glob, fallthrough, stop exclusion).
>
> **Critical Context**:
> - Security vulnerabilities exist but acceptable for soft beta
> - Business model validated with 9.75x revenue multiplier
> - Developer-to-player pipeline is core differentiator
> - LXC performance advantage is primary competitive edge
>
> **Next Actions**: Fix Go agent bugs, implement critical features, launch beta.
---
## 📞 Support & Escalation
- **Platform Owner**: 44 years old, full-stack developer
- **AI Coordination**: Claude (architecture) + ChatGPT (implementation)
- **Infrastructure**: GTHost dedicated server ($109/month)
- **Domain**: zerolaghub.com, zpack.zerolaghub.com
- **Public Game IP**: 139.64.165.248
---
## 📊 Platform Status Summary
**Technical Readiness**: 85% complete
**Competitive Position**: Ready to compete on core provisioning, need UX polish
**Strategic Clarity**: Clear path to launch with validated business model
**Infrastructure**: Production-grade with enterprise backup system
**Security**: Known vulnerabilities, acceptable for soft beta, must fix before public launch
---
## 🎯 Strategic Recommendation
**Recommended Path**: Option B (+1 Week)
**Rationale**:
1. WebSocket console is table stakes (competitors have it)
2. Crash protection prevents operational nightmares
3. Disk monitoring prevents data loss
4. 1 week is negligible for long-term platform success
5. Professional launch > rushed launch
**Timeline**:
- **Dec 7-10**: Implement critical features (WebSocket, crash, disk)
- **Dec 11-13**: Testing + bug fixes
- **Dec 14**: Soft beta launch (10-20 users)
- **Dec 21**: Public launch after beta feedback
---
**This document serves as the single source of truth for project continuity. Update after each major milestone or architectural change.**
🚀 **Next action: Decide launch timeline, then implement critical features or launch beta.**