Upload files to "/"
This commit is contained in:
parent
882c0c7169
commit
5a5ad001af
846
ZeroLagHub_Cross_Project_Tracker.md
Normal file
846
ZeroLagHub_Cross_Project_Tracker.md
Normal file
@ -0,0 +1,846 @@
|
|||||||
|
# 🛡️ ZeroLagHub Cross-Project Tracker & Drift Prevention System
|
||||||
|
|
||||||
|
**Last Updated**: December 7, 2025
|
||||||
|
**Version**: 1.0 (Canonical Architecture Enforcement)
|
||||||
|
**Status**: ACTIVE - Must Be Consulted Before All Code Changes
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🎯 Document Purpose
|
||||||
|
|
||||||
|
This document establishes **architectural boundaries** across the three major ZeroLagHub systems to prevent drift, confusion, and broken contracts when switching between contexts or AI assistants.
|
||||||
|
|
||||||
|
**Critical Insight**: Most project failures come from **gradual architectural erosion**, not sudden breaking changes. This document prevents that.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 📊 The Three Systems (Canonical Ownership)
|
||||||
|
|
||||||
|
```
|
||||||
|
┌─────────────────────────────────────────────────────────────┐
|
||||||
|
│ NODE.JS API v2 │
|
||||||
|
│ (Orchestration Engine) │
|
||||||
|
│ │
|
||||||
|
│ Owns: VMID allocation, port management, DNS publishing, │
|
||||||
|
│ Velocity registration, job queue, database state │
|
||||||
|
│ │
|
||||||
|
│ Speaks To: Proxmox, Cloudflare, Technitium, Velocity, │
|
||||||
|
│ MariaDB, BullMQ, Go Agent (HTTP) │
|
||||||
|
│ │
|
||||||
|
│ Never Touches: Container filesystem, game server files, │
|
||||||
|
│ Java installation, artifact downloads │
|
||||||
|
└─────────────────────┬───────────────────────────────────────┘
|
||||||
|
│
|
||||||
|
│ HTTP Contract
|
||||||
|
│ POST /config, /start, /stop
|
||||||
|
│ GET /status, /health
|
||||||
|
│
|
||||||
|
▼
|
||||||
|
┌─────────────────────────────────────────────────────────────┐
|
||||||
|
│ GO AGENT │
|
||||||
|
│ (Container-Internal Manager) │
|
||||||
|
│ │
|
||||||
|
│ Owns: Server installation, Java runtime, artifact │
|
||||||
|
│ downloads, process management, READY detection, │
|
||||||
|
│ filesystem layout, verification + self-repair │
|
||||||
|
│ │
|
||||||
|
│ Speaks To: Local filesystem, game server process, │
|
||||||
|
│ API (status updates via HTTP polling) │
|
||||||
|
│ │
|
||||||
|
│ Never Touches: Proxmox, DNS, Cloudflare, Velocity, │
|
||||||
|
│ port allocation, VMID selection │
|
||||||
|
└─────────────────────┬───────────────────────────────────────┘
|
||||||
|
│
|
||||||
|
│ Status Polling
|
||||||
|
│ Agent reports state
|
||||||
|
│
|
||||||
|
▼
|
||||||
|
┌─────────────────────────────────────────────────────────────┐
|
||||||
|
│ NEXT.JS FRONTEND │
|
||||||
|
│ (Customer + Admin UI) │
|
||||||
|
│ │
|
||||||
|
│ Owns: User interaction, form validation, display logic, │
|
||||||
|
│ client-side state, UI components │
|
||||||
|
│ │
|
||||||
|
│ Speaks To: API v2 only (REST + WebSocket when added) │
|
||||||
|
│ │
|
||||||
|
│ Never Touches: Proxmox, Go Agent, DNS, Velocity, │
|
||||||
|
│ Cloudflare, direct container access │
|
||||||
|
└─────────────────────────────────────────────────────────────┘
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🗺️ System Ownership Matrix (CANONICAL)
|
||||||
|
|
||||||
|
| Area | Node.js API v2 | Go Agent | Frontend |
|
||||||
|
|------|----------------|----------|----------|
|
||||||
|
| **Provisioning Orchestration** | ✅ OWNER (allocates VMID, ports, builds LXC config) | ❌ Executes inside container | ❌ Triggers only |
|
||||||
|
| **Template Selection** | ✅ OWNER (selects template, passes config) | ❌ Template contains agent | ❌ Displays options |
|
||||||
|
| **Server Installation** | ❌ Never | ✅ OWNER (Java, artifacts, validation) | ❌ Displays results |
|
||||||
|
| **Runtime Control** | ✅ OWNER (sends commands) | ✅ OWNER (executes commands) | ❌ UI only |
|
||||||
|
| **DNS (Cloudflare + Technitium)** | ✅ OWNER (creates, deletes, tracks IDs) | ❌ Never | ❌ Displays info |
|
||||||
|
| **Velocity Registration** | ✅ OWNER (registers, deregisters) | ❌ Never | ❌ Displays status |
|
||||||
|
| **IP Logic** | ✅ OWNER (external + internal IPs) | ❌ Sees container IP only | ❌ Displays final |
|
||||||
|
| **Port Allocation** | ✅ OWNER (PortPool DB management) | ❌ Receives assignments | ❌ Displays ports |
|
||||||
|
| **Monitoring** | ✅ OWNER (collects metrics) | ✅ OWNER (exposes /health) | ❌ Displays data |
|
||||||
|
| **Error Handling** | ✅ OWNER (BullMQ jobs, retries) | ❌ Local output only | ❌ User notifications |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🔄 API ↔ Agent Contract (IMMUTABLE)
|
||||||
|
|
||||||
|
### **API → Agent Endpoints**
|
||||||
|
|
||||||
|
```
|
||||||
|
POST /config
|
||||||
|
├─ Payload: {
|
||||||
|
│ game: "minecraft",
|
||||||
|
│ variant: "paper",
|
||||||
|
│ version: "1.21.3",
|
||||||
|
│ ports: [25565, 25575],
|
||||||
|
│ memory: "4G",
|
||||||
|
│ motd: "Welcome to ZeroLagHub",
|
||||||
|
│ worldSettings: {...}
|
||||||
|
│ }
|
||||||
|
└─ Response: 200 OK
|
||||||
|
|
||||||
|
POST /start
|
||||||
|
└─ Triggers: ensureProvisioned() → StartServer()
|
||||||
|
|
||||||
|
POST /stop
|
||||||
|
└─ Triggers: Graceful shutdown via server stop command
|
||||||
|
|
||||||
|
POST /restart
|
||||||
|
└─ Triggers: stop → start sequence
|
||||||
|
|
||||||
|
GET /status
|
||||||
|
└─ Response: {
|
||||||
|
state: "RUNNING" | "INSTALLING" | "FAILED",
|
||||||
|
pid: 12345,
|
||||||
|
uptime: 3600,
|
||||||
|
lastError: null
|
||||||
|
}
|
||||||
|
|
||||||
|
GET /health
|
||||||
|
└─ Response: {
|
||||||
|
healthy: true,
|
||||||
|
java: "/usr/lib/jvm/java-21-openjdk-amd64",
|
||||||
|
variant: "paper",
|
||||||
|
ready: true
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### **Agent → API Signals (via /status polling)**
|
||||||
|
|
||||||
|
```
|
||||||
|
State Machine:
|
||||||
|
INSTALLING
|
||||||
|
├─> DOWNLOADING_ARTIFACT
|
||||||
|
├─> INSTALLING_JAVA
|
||||||
|
├─> FINALIZING
|
||||||
|
├─> RUNNING
|
||||||
|
└─> FAILED | CRASHED
|
||||||
|
```
|
||||||
|
|
||||||
|
### **🚫 Forbidden Agent Behaviors**
|
||||||
|
|
||||||
|
Agent **NEVER**:
|
||||||
|
- ❌ Allocates or manages ports
|
||||||
|
- ❌ Talks to Proxmox API
|
||||||
|
- ❌ Creates DNS records
|
||||||
|
- ❌ Registers with Velocity
|
||||||
|
- ❌ Modifies LXC config
|
||||||
|
- ❌ Allocates VMIDs
|
||||||
|
- ❌ Manages templates
|
||||||
|
- ❌ Talks to Cloudflare or Technitium
|
||||||
|
|
||||||
|
**Violation Detection**: If agent code imports Proxmox client, DNS client, or port allocation logic → **DRIFT VIOLATION**
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🖥️ Frontend ↔ API Contract (IMMUTABLE)
|
||||||
|
|
||||||
|
### **Frontend Allowed Endpoints**
|
||||||
|
|
||||||
|
```
|
||||||
|
Provisioning:
|
||||||
|
POST /api/containers/create
|
||||||
|
GET /api/containers/:vmid
|
||||||
|
DELETE /api/containers/:vmid
|
||||||
|
|
||||||
|
Control:
|
||||||
|
POST /api/containers/:vmid/start
|
||||||
|
POST /api/containers/:vmid/stop
|
||||||
|
POST /api/containers/:vmid/restart
|
||||||
|
|
||||||
|
Status:
|
||||||
|
GET /api/containers/:vmid/status
|
||||||
|
GET /api/containers/:vmid/logs
|
||||||
|
GET /api/containers/:vmid/stats
|
||||||
|
|
||||||
|
Discovery:
|
||||||
|
GET /api/templates
|
||||||
|
GET /api/containers (list user's containers)
|
||||||
|
|
||||||
|
Read-Only Info:
|
||||||
|
GET /api/dns/:hostname (display only)
|
||||||
|
GET /api/velocity/status (display only)
|
||||||
|
```
|
||||||
|
|
||||||
|
### **🚫 Forbidden Frontend Behaviors**
|
||||||
|
|
||||||
|
Frontend **NEVER**:
|
||||||
|
- ❌ Talks directly to Go Agent
|
||||||
|
- ❌ Calls Proxmox API
|
||||||
|
- ❌ Creates DNS records
|
||||||
|
- ❌ Registers with Velocity
|
||||||
|
- ❌ Allocates ports
|
||||||
|
- ❌ Executes container commands
|
||||||
|
- ❌ Accesses MariaDB directly
|
||||||
|
|
||||||
|
**Violation Detection**: If frontend code imports Proxmox client, agent HTTP client (except via API), or database client → **DRIFT VIOLATION**
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🚨 Drift Detection Rules (ACTIVE ENFORCEMENT)
|
||||||
|
|
||||||
|
### **Rule #1: Provisioning Ownership**
|
||||||
|
|
||||||
|
**Violation**: Agent asked to allocate ports, choose templates, create DNS, call Proxmox, manage VMIDs
|
||||||
|
|
||||||
|
**Correct Path**: API owns ALL provisioning orchestration
|
||||||
|
|
||||||
|
**Example Violation**:
|
||||||
|
```go
|
||||||
|
// WRONG - Agent should NEVER do this
|
||||||
|
func (a *Agent) allocatePorts() ([]int, error) {
|
||||||
|
// ... port allocation logic
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
**Correct Pattern**:
|
||||||
|
```javascript
|
||||||
|
// RIGHT - API allocates, agent receives
|
||||||
|
async function provisionInstance(game, variant) {
|
||||||
|
const ports = await portAllocator.allocate(vmid);
|
||||||
|
await agent.postConfig({ ports, game, variant });
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### **Rule #2: Artifact Ownership**
|
||||||
|
|
||||||
|
**Violation**: API asked to install Java, download server.jar, run installers
|
||||||
|
|
||||||
|
**Correct Path**: Agent owns ALL in-container installation
|
||||||
|
|
||||||
|
**Example Violation**:
|
||||||
|
```javascript
|
||||||
|
// WRONG - API should NEVER do this
|
||||||
|
async function installMinecraft(vmid, version) {
|
||||||
|
await proxmox.exec(vmid, `wget https://...`);
|
||||||
|
await proxmox.exec(vmid, `java -jar installer.jar`);
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
**Correct Pattern**:
|
||||||
|
```go
|
||||||
|
// RIGHT - Agent handles installation
|
||||||
|
func (p *Provisioner) ProvisionAll(cfg Config) error {
|
||||||
|
downloadArtifact(cfg.Variant, cfg.Version)
|
||||||
|
installJavaRuntime(cfg.Version)
|
||||||
|
verifyInstallation()
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### **Rule #3: Direct Container Access**
|
||||||
|
|
||||||
|
**Violation**: Frontend or API wants to exec commands directly into container
|
||||||
|
|
||||||
|
**Correct Path**: Agent owns container execution layer
|
||||||
|
|
||||||
|
**Example Violation**:
|
||||||
|
```typescript
|
||||||
|
// WRONG - Frontend should NEVER do this
|
||||||
|
async function restartServer(vmid: number) {
|
||||||
|
const ssh = new SSHClient();
|
||||||
|
await ssh.connect(containerIP);
|
||||||
|
await ssh.exec('systemctl restart minecraft');
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
**Correct Pattern**:
|
||||||
|
```typescript
|
||||||
|
// RIGHT - Frontend talks to API, API talks to agent
|
||||||
|
async function restartServer(vmid: number) {
|
||||||
|
await api.post(`/containers/${vmid}/restart`);
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### **Rule #4: Networking Responsibilities**
|
||||||
|
|
||||||
|
**Violation**: Agent asked to select public vs internal IPs, decide DNS zones
|
||||||
|
|
||||||
|
**Correct Path**: API owns dual-IP logic (Cloudflare external, Technitium internal)
|
||||||
|
|
||||||
|
**Example Violation**:
|
||||||
|
```go
|
||||||
|
// WRONG - Agent should NEVER decide this
|
||||||
|
func (a *Agent) determinePublicIP() string {
|
||||||
|
if a.needsCloudflare() {
|
||||||
|
return "139.64.165.248"
|
||||||
|
}
|
||||||
|
return a.containerIP
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
**Correct Pattern**:
|
||||||
|
```javascript
|
||||||
|
// RIGHT - API decides network topology
|
||||||
|
function determineIPs(vmid, game) {
|
||||||
|
const internalIP = `10.200.0.${vmid - 1000}`;
|
||||||
|
const externalIP = "139.64.165.248"; // Cloudflare target
|
||||||
|
const velocityIP = "10.70.0.241"; // Internal routing
|
||||||
|
|
||||||
|
return { internalIP, externalIP, velocityIP };
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### **Rule #5: Proxy Responsibilities**
|
||||||
|
|
||||||
|
**Violation**: Agent asked to register with Velocity, configure proxy routing
|
||||||
|
|
||||||
|
**Correct Path**: API owns ALL proxy integrations
|
||||||
|
|
||||||
|
**Example Violation**:
|
||||||
|
```go
|
||||||
|
// WRONG - Agent should NEVER do this
|
||||||
|
func (a *Agent) registerWithVelocity() error {
|
||||||
|
client := velocity.NewClient()
|
||||||
|
return client.Register(a.hostname, a.port)
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
**Correct Pattern**:
|
||||||
|
```javascript
|
||||||
|
// RIGHT - API handles Velocity registration
|
||||||
|
async function registerVelocity(vmid, hostname, internalIP) {
|
||||||
|
await velocityBridge.registerBackend({
|
||||||
|
name: hostname,
|
||||||
|
address: internalIP,
|
||||||
|
port: 25565
|
||||||
|
});
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🔄 Context Switching Safety Workflow
|
||||||
|
|
||||||
|
### **When Moving to Node.js API Work:**
|
||||||
|
|
||||||
|
**Pre-Switch Checklist**:
|
||||||
|
- [ ] Agent contract unchanged? (POST /config, /start, /stop, GET /status)
|
||||||
|
- [ ] Database schema unchanged? (Prisma models consistent)
|
||||||
|
- [ ] LXC template IDs unchanged? (VMID 800 for game, 6000-series for dev)
|
||||||
|
- [ ] DNS/IP logic consistent? (Cloudflare external, Technitium internal)
|
||||||
|
- [ ] Port allocation logic preserved? (PortPool DB-backed)
|
||||||
|
|
||||||
|
**Common Drift Patterns**:
|
||||||
|
- ⚠️ Adding agent installation logic to API
|
||||||
|
- ⚠️ Changing agent contract without updating both sides
|
||||||
|
- ⚠️ Moving DNS logic to different service
|
||||||
|
- ⚠️ Bypassing job queue for provisioning
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### **When Moving to Go Agent Work:**
|
||||||
|
|
||||||
|
**Pre-Switch Checklist**:
|
||||||
|
- [ ] No provisioning logic outside allowed scope (no port allocation, DNS, etc.)
|
||||||
|
- [ ] File paths remain canonical (`/opt/zlh/<game>/<variant>/world`)
|
||||||
|
- [ ] Naming conventions maintained (`server.jar`, `fabric-server.jar`, etc.)
|
||||||
|
- [ ] No external API calls (Proxmox, DNS, Velocity)
|
||||||
|
- [ ] Status states unchanged (INSTALLING, RUNNING, FAILED, etc.)
|
||||||
|
|
||||||
|
**Common Drift Patterns**:
|
||||||
|
- ⚠️ Adding port allocation to agent
|
||||||
|
- ⚠️ Making agent talk to external services
|
||||||
|
- ⚠️ Changing directory structure without API coordination
|
||||||
|
- ⚠️ Adding orchestration logic to agent
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### **When Moving to Frontend Work:**
|
||||||
|
|
||||||
|
**Pre-Switch Checklist**:
|
||||||
|
- [ ] Only API-approved fields used in UI
|
||||||
|
- [ ] No direct agent HTTP calls
|
||||||
|
- [ ] VMID not exposed in user-facing UI
|
||||||
|
- [ ] Internal IPs not displayed to users
|
||||||
|
- [ ] All state from API, not computed locally
|
||||||
|
|
||||||
|
**Common Drift Patterns**:
|
||||||
|
- ⚠️ Adding direct agent calls from frontend
|
||||||
|
- ⚠️ Computing server state client-side
|
||||||
|
- ⚠️ Exposing internal infrastructure details
|
||||||
|
- ⚠️ Bypassing API for container control
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🚨 High-Risk Integration Zones (GUARDED)
|
||||||
|
|
||||||
|
These areas have historically caused drift across sessions:
|
||||||
|
|
||||||
|
### **1. Forge / NeoForge Installation Logic**
|
||||||
|
- **Risk**: Agent vs API confusion on who handles `run.sh` patching
|
||||||
|
- **Guard**: Agent owns ALL Forge installation, API just passes config
|
||||||
|
- **Test**: Can provision Forge 1.21.3 without API filesystem access?
|
||||||
|
|
||||||
|
### **2. Cloudflare SRV Deletion**
|
||||||
|
- **Risk**: Case sensitivity, subdomain normalization, record ID tracking
|
||||||
|
- **Guard**: API stores Cloudflare record IDs in EdgeState, deletes by ID
|
||||||
|
- **Test**: Create → Delete → Recreate same hostname without orphans?
|
||||||
|
|
||||||
|
### **3. Technitium DNS Zone Mismatch**
|
||||||
|
- **Risk**: Wrong zone selection, duplicate records
|
||||||
|
- **Guard**: API hardcodes zone as `zpack.zerolaghub.com`, validates before creation
|
||||||
|
- **Test**: No records created in wrong zones?
|
||||||
|
|
||||||
|
### **4. Velocity Registration Order**
|
||||||
|
- **Risk**: Registering before server ready, deregistering incorrectly
|
||||||
|
- **Guard**: API waits for agent RUNNING state, then registers Velocity
|
||||||
|
- **Test**: Player connection works immediately after provisioning complete?
|
||||||
|
|
||||||
|
### **5. PortPool Commit Logic**
|
||||||
|
- **Risk**: Race conditions, double-allocation, uncommitted ports
|
||||||
|
- **Guard**: API allocates → provisions → commits (rollback on failure)
|
||||||
|
- **Test**: Concurrent provisions don't collide on ports?
|
||||||
|
|
||||||
|
### **6. Agent READY Detection**
|
||||||
|
- **Risk**: False negatives, false positives, variant-specific patterns
|
||||||
|
- **Guard**: Agent uses variant-aware log parsing, multiple confirmation lines
|
||||||
|
- **Test**: All 6 variants correctly detect READY state?
|
||||||
|
|
||||||
|
### **7. Server Start-Up False Negatives**
|
||||||
|
- **Risk**: Timeout too short, log parsing too strict
|
||||||
|
- **Guard**: Agent increases timeout for Forge (90s), multiple log patterns
|
||||||
|
- **Test**: Forge installer completes without false failure?
|
||||||
|
|
||||||
|
### **8. IP Selection Logic**
|
||||||
|
- **Risk**: Confusing external (Cloudflare) vs internal (Velocity/Technitium) IPs
|
||||||
|
- **Guard**: API clearly separates: externalIP (139.64.165.248), internalIP (10.200.0.X), velocityIP (10.70.0.241)
|
||||||
|
- **Test**: DNS points to correct IPs, Velocity routes to correct internal IP?
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 📋 Architecture Decision Log (LOCKED) ⭐ NEW
|
||||||
|
|
||||||
|
**Purpose**: Records finalized architectural decisions that **must not be re-litigated** unless explicitly requested.
|
||||||
|
|
||||||
|
**Status**: LOCKED - These decisions are final and cannot be changed without user explicitly saying "Revisit decision X"
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### **DEC-001: Templates vs Go Agent (FINAL)**
|
||||||
|
|
||||||
|
**Decision**: Hybrid model
|
||||||
|
- LXC templates define **base environment only**
|
||||||
|
- Go Agent is **authoritative execution layer** inside containers
|
||||||
|
|
||||||
|
**Rationale**:
|
||||||
|
- Templates alone cannot handle multi-variant logic (Forge, NeoForge, Fabric)
|
||||||
|
- Agent enables self-repair, async provisioning, runtime control
|
||||||
|
- Hybrid provides speed + flexibility without API container access
|
||||||
|
|
||||||
|
**Applies To**: API v2, Go Agent, Frontend
|
||||||
|
**Status**: ✅ LOCKED
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### **DEC-002: Provisioning Authority**
|
||||||
|
|
||||||
|
**Decision**: API orchestrates, Agent executes
|
||||||
|
|
||||||
|
**Rationale**:
|
||||||
|
- API has global visibility (DB, DNS, Proxmox, Velocity)
|
||||||
|
- Agent is intentionally sandboxed to container filesystem + process
|
||||||
|
|
||||||
|
**Applies To**: All systems
|
||||||
|
**Status**: ✅ LOCKED
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### **DEC-003: DNS & Edge Publishing Ownership**
|
||||||
|
|
||||||
|
**Decision**: API-only responsibility
|
||||||
|
|
||||||
|
**Rationale**:
|
||||||
|
- Requires external credentials (Cloudflare, Technitium)
|
||||||
|
- Must correlate DB state, record IDs, reconciliation jobs
|
||||||
|
|
||||||
|
**Applies To**: API v2
|
||||||
|
**Status**: ✅ LOCKED
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### **DEC-004: Proxy Stack**
|
||||||
|
|
||||||
|
**Decision**: Traefik + Velocity only
|
||||||
|
|
||||||
|
**Rationale**:
|
||||||
|
- Traefik for HTTP/control-plane
|
||||||
|
- Velocity for Minecraft TCP routing
|
||||||
|
- **HAProxy explicitly deprecated** for ZeroLagHub
|
||||||
|
|
||||||
|
**Applies To**: Infrastructure, API
|
||||||
|
**Status**: ✅ LOCKED
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### **DEC-005: State Persistence**
|
||||||
|
|
||||||
|
**Decision**: MariaDB is single source of truth
|
||||||
|
|
||||||
|
**Rationale**:
|
||||||
|
- Flat files caused race conditions and drift
|
||||||
|
- DB enables reconciliation, recovery, observability
|
||||||
|
|
||||||
|
**Applies To**: API v2
|
||||||
|
**Status**: ✅ LOCKED
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### **DEC-006: Frontend Access Model**
|
||||||
|
|
||||||
|
**Decision**: Frontend communicates with API only
|
||||||
|
|
||||||
|
**Rationale**:
|
||||||
|
- Security boundary
|
||||||
|
- Prevents leaking infrastructure details
|
||||||
|
|
||||||
|
**Applies To**: Frontend
|
||||||
|
**Status**: ✅ LOCKED
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### **DEC-007: Architecture Enforcement Policy**
|
||||||
|
|
||||||
|
**Decision**: Drift prevention is mandatory
|
||||||
|
|
||||||
|
**Rationale**:
|
||||||
|
- Prevents oscillation between alternatives
|
||||||
|
- Preserves velocity during late-stage development
|
||||||
|
|
||||||
|
**Applies To**: All work sessions
|
||||||
|
**Status**: ✅ LOCKED
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 📋 Canonical Architecture Anchors (ABSOLUTE)
|
||||||
|
|
||||||
|
These rules are **immutable** unless explicitly changed with full system review:
|
||||||
|
|
||||||
|
### **Anchor #1: Orchestration**
|
||||||
|
✅ API v2 orchestrates EVERYTHING (jobs, provisioning, DNS, proxy, lifecycle)
|
||||||
|
|
||||||
|
### **Anchor #2: Container-Internal**
|
||||||
|
✅ Agent performs EVERYTHING inside container (install, start, stop, detect)
|
||||||
|
|
||||||
|
### **Anchor #3: Templates**
|
||||||
|
✅ Templates contain agent + base environment (VMID 800 for game, 6000-series for dev)
|
||||||
|
|
||||||
|
### **Anchor #4: Job Queue**
|
||||||
|
✅ BullMQ/Redis drive job system (async provisioning, retries, reconciliation)
|
||||||
|
|
||||||
|
### **Anchor #5: Database**
|
||||||
|
✅ MariaDB holds all state (PortPool, ContainerInstance, EdgeState, etc.)
|
||||||
|
|
||||||
|
### **Anchor #6: Infrastructure**
|
||||||
|
✅ Proxmox API (not Ansible) for LXC management
|
||||||
|
|
||||||
|
### **Anchor #7: Routing**
|
||||||
|
✅ Traefik (HTTP) + Velocity (Minecraft) are ONLY routing/proxy systems
|
||||||
|
|
||||||
|
### **Anchor #8: DNS**
|
||||||
|
✅ Cloudflare = authoritative public DNS
|
||||||
|
✅ Technitium = authoritative internal DNS
|
||||||
|
|
||||||
|
### **Anchor #9: Frontend Isolation**
|
||||||
|
✅ Frontend speaks ONLY to API (no direct agent, Proxmox, DNS, Velocity)
|
||||||
|
|
||||||
|
### **Anchor #10: Directory Structure**
|
||||||
|
✅ `/opt/zlh/<game>/<variant>/world` is canonical game server path
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🛡️ Enforcement Policies (ACTIVE)
|
||||||
|
|
||||||
|
When future instructions conflict with Canonical Architecture or Architecture Decision Log:
|
||||||
|
|
||||||
|
### **Step 1: STOP**
|
||||||
|
Immediately halt the task. Do not proceed with drift-inducing change.
|
||||||
|
|
||||||
|
### **Step 2: RAISE DRIFT WARNING**
|
||||||
|
```
|
||||||
|
⚠️ DRIFT WARNING ⚠️
|
||||||
|
|
||||||
|
Proposed change violates Canonical Architecture:
|
||||||
|
|
||||||
|
Rule Violated: [Rule #X: Description]
|
||||||
|
OR
|
||||||
|
Decision Violated: [DEC-XXX: Decision Name]
|
||||||
|
|
||||||
|
Violation: [Specific behavior that violates rule/decision]
|
||||||
|
Correct Path: [Architecture-aligned approach]
|
||||||
|
|
||||||
|
Impact: [What breaks if this drift is allowed]
|
||||||
|
|
||||||
|
Options:
|
||||||
|
1. Implement correct architecture-aligned path
|
||||||
|
2. Amend Canonical Architecture (requires full system review)
|
||||||
|
3. Request user to "Revisit decision X" (for ADL changes)
|
||||||
|
4. Cancel proposed change
|
||||||
|
```
|
||||||
|
|
||||||
|
### **Step 3: PROVIDE CORRECT PATH**
|
||||||
|
Show the architecture-aligned implementation that achieves the same goal.
|
||||||
|
|
||||||
|
### **Step 4: ASK FOR DIRECTION**
|
||||||
|
```
|
||||||
|
Should we:
|
||||||
|
A) Implement the correct architecture-aligned path?
|
||||||
|
B) Perform full system review to amend architecture?
|
||||||
|
C) Request user to explicitly revisit locked decision?
|
||||||
|
D) Cancel this change?
|
||||||
|
```
|
||||||
|
|
||||||
|
### **Step 5: ARCHITECTURE DECISION LOG SPECIAL RULE**
|
||||||
|
If violation is against a **LOCKED** Architecture Decision (DEC-001 through DEC-007):
|
||||||
|
|
||||||
|
**ADDITIONAL CHECK**:
|
||||||
|
```
|
||||||
|
⚠️ LOCKED DECISION WARNING ⚠️
|
||||||
|
|
||||||
|
This change conflicts with Architecture Decision Log entry: [DEC-XXX]
|
||||||
|
Status: LOCKED
|
||||||
|
|
||||||
|
This decision can ONLY be changed if the user explicitly says:
|
||||||
|
"Revisit decision [DEC-XXX]"
|
||||||
|
|
||||||
|
Without explicit user request to revisit, this decision is FINAL.
|
||||||
|
|
||||||
|
Proceeding with this change would violate architectural governance.
|
||||||
|
```
|
||||||
|
|
||||||
|
**Never proceed** with drift-inducing changes without explicit confirmation.
|
||||||
|
**Never re-litigate** locked decisions without user explicitly requesting revision.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 📊 Drift Detection Examples
|
||||||
|
|
||||||
|
### **Example 1: Agent Port Allocation (VIOLATION)**
|
||||||
|
|
||||||
|
**Proposed Change**:
|
||||||
|
```go
|
||||||
|
// Agent code proposal
|
||||||
|
func (a *Agent) allocatePort() int {
|
||||||
|
// Find available port...
|
||||||
|
return port
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
**Drift Warning**:
|
||||||
|
```
|
||||||
|
⚠️ DRIFT WARNING ⚠️
|
||||||
|
|
||||||
|
Rule Violated: Rule #1 - Provisioning Ownership
|
||||||
|
Violation: Agent attempting to allocate ports
|
||||||
|
Correct Path: API allocates ports, agent receives them via /config
|
||||||
|
|
||||||
|
Impact: Port collisions, database inconsistency, broken PortPool
|
||||||
|
|
||||||
|
Recommendation: Remove port allocation from agent, ensure API sends ports in config payload
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### **Example 2: Frontend Direct Agent Call (VIOLATION)**
|
||||||
|
|
||||||
|
**Proposed Change**:
|
||||||
|
```typescript
|
||||||
|
// Frontend code proposal
|
||||||
|
async function getServerLogs(vmid: number) {
|
||||||
|
const agentURL = `http://10.200.0.${vmid}:8080/logs`;
|
||||||
|
return await fetch(agentURL);
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
**Drift Warning**:
|
||||||
|
```
|
||||||
|
⚠️ DRIFT WARNING ⚠️
|
||||||
|
|
||||||
|
Rule Violated: Rule #3 - Direct Container Access
|
||||||
|
Violation: Frontend bypassing API to talk to agent
|
||||||
|
Correct Path: Frontend → API → Agent (API proxies logs)
|
||||||
|
|
||||||
|
Impact: Broken frontend if container IP changes, no auth/rate limiting, security risk
|
||||||
|
|
||||||
|
Recommendation: Add GET /api/containers/:vmid/logs endpoint that proxies to agent
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### **Example 3: API Installing Java (VIOLATION)**
|
||||||
|
|
||||||
|
**Proposed Change**:
|
||||||
|
```javascript
|
||||||
|
// API code proposal
|
||||||
|
async function provisionServer(vmid, game, variant) {
|
||||||
|
await proxmox.exec(vmid, 'apt-get install openjdk-21-jdk');
|
||||||
|
await proxmox.exec(vmid, 'wget https://papermc.io/...');
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
**Drift Warning**:
|
||||||
|
```
|
||||||
|
⚠️ DRIFT WARNING ⚠️
|
||||||
|
|
||||||
|
Rule Violated: Rule #2 - Artifact Ownership
|
||||||
|
Violation: API performing in-container installation
|
||||||
|
Correct Path: API sends config to agent, agent handles installation
|
||||||
|
|
||||||
|
Impact: Breaks agent self-repair, variant-specific logic duplicated, no verification system
|
||||||
|
|
||||||
|
Recommendation: Remove installation logic from API, ensure agent receives proper config via POST /config
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 📁 Integration with Existing Documentation
|
||||||
|
|
||||||
|
### **Relationship to Master Bootstrap**
|
||||||
|
- Master Bootstrap: Strategic overview and business model
|
||||||
|
- **This Document**: Technical governance and boundary enforcement
|
||||||
|
- **Usage**: Consult this before implementing ANY code changes
|
||||||
|
|
||||||
|
### **Relationship to Complete Current State**
|
||||||
|
- Complete Current State: What's working, what's next
|
||||||
|
- **This Document**: How things MUST work (regardless of current state)
|
||||||
|
- **Usage**: This is the "law", current state is "status"
|
||||||
|
|
||||||
|
### **Relationship to Engineering Handover**
|
||||||
|
- Engineering Handover: Daily tactical tasks and sprint plan
|
||||||
|
- **This Document**: Constraints within which tasks must be implemented
|
||||||
|
- **Usage**: Check this before starting each handover task
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🔄 Architecture Amendment Process
|
||||||
|
|
||||||
|
If legitimate need to change Canonical Architecture:
|
||||||
|
|
||||||
|
### **Step 1: Identify Change**
|
||||||
|
Document exactly what architectural boundary needs to change and why.
|
||||||
|
|
||||||
|
### **Step 2: Full System Impact Analysis**
|
||||||
|
- What breaks in API?
|
||||||
|
- What breaks in Agent?
|
||||||
|
- What breaks in Frontend?
|
||||||
|
- What changes to contracts?
|
||||||
|
- What database migrations needed?
|
||||||
|
|
||||||
|
### **Step 3: Update ALL Affected Documents**
|
||||||
|
- This document (Canonical Architecture)
|
||||||
|
- Master Bootstrap (if strategic impact)
|
||||||
|
- Complete Current State (implementation changes)
|
||||||
|
- Engineering Handover (sprint tasks)
|
||||||
|
- Agent Spec, Operational Guide (if affected)
|
||||||
|
|
||||||
|
### **Step 4: Update ALL Systems**
|
||||||
|
- API code + tests
|
||||||
|
- Agent code + tests
|
||||||
|
- Frontend code + tests
|
||||||
|
- Database schema (migration)
|
||||||
|
- Infrastructure config
|
||||||
|
|
||||||
|
### **Step 5: Validation**
|
||||||
|
- Integration tests pass?
|
||||||
|
- No new drift introduced?
|
||||||
|
- Documentation consistent?
|
||||||
|
- All AIs briefed on change?
|
||||||
|
|
||||||
|
**Only after ALL steps** is architecture amendment complete.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🎯 Quick Reference Card
|
||||||
|
|
||||||
|
### **API Owns**
|
||||||
|
- ✅ Provisioning orchestration
|
||||||
|
- ✅ Port allocation (PortPool)
|
||||||
|
- ✅ DNS (Cloudflare + Technitium)
|
||||||
|
- ✅ Velocity registration
|
||||||
|
- ✅ IP logic (external + internal)
|
||||||
|
- ✅ Job queue (BullMQ)
|
||||||
|
- ✅ Database state
|
||||||
|
|
||||||
|
### **Agent Owns**
|
||||||
|
- ✅ Container-internal installation
|
||||||
|
- ✅ Java runtime
|
||||||
|
- ✅ Artifact downloads
|
||||||
|
- ✅ Server process management
|
||||||
|
- ✅ READY detection
|
||||||
|
- ✅ Self-repair + verification
|
||||||
|
- ✅ Filesystem layout
|
||||||
|
|
||||||
|
### **Frontend Owns**
|
||||||
|
- ✅ User interaction
|
||||||
|
- ✅ Display logic
|
||||||
|
- ✅ Client state
|
||||||
|
- ✅ Form validation
|
||||||
|
|
||||||
|
### **Never**
|
||||||
|
- ❌ Agent allocates ports
|
||||||
|
- ❌ Agent talks to DNS/Velocity/Proxmox
|
||||||
|
- ❌ API installs server files
|
||||||
|
- ❌ API executes in-container commands
|
||||||
|
- ❌ Frontend talks to agent directly
|
||||||
|
- ❌ Frontend talks to infrastructure
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 📋 Session Start Checklist
|
||||||
|
|
||||||
|
Before every coding session with any AI:
|
||||||
|
|
||||||
|
- [ ] Read this document's Quick Reference Card
|
||||||
|
- [ ] Identify which system you're working on (API, Agent, Frontend)
|
||||||
|
- [ ] Review that system's "Owns" list
|
||||||
|
- [ ] Check High-Risk Integration Zones if touching those areas
|
||||||
|
- [ ] Verify no drift from previous session
|
||||||
|
- [ ] Confirm contracts unchanged since last session
|
||||||
|
|
||||||
|
**If ANY doubt**: Re-read full Canonical Architecture Anchors section.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## ✅ Document Status
|
||||||
|
|
||||||
|
**Status**: ACTIVE - Must be consulted before all code changes
|
||||||
|
**Enforcement**: MANDATORY - Drift violations must be caught
|
||||||
|
**Authority**: CANONICAL - Overrides conflicting guidance
|
||||||
|
**Updates**: Only via Architecture Amendment Process
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
🛡️ **This document prevents architectural drift. Violate at your own risk.**
|
||||||
105
ZeroLagHub_Drift_Prevention_Card.md
Normal file
105
ZeroLagHub_Drift_Prevention_Card.md
Normal file
@ -0,0 +1,105 @@
|
|||||||
|
# 🛡️ ZeroLagHub Drift Prevention - Quick Start Card
|
||||||
|
|
||||||
|
**Use This**: At the start of EVERY coding session (API, Agent, or Frontend)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## ⚡ 30-Second Architecture Check
|
||||||
|
|
||||||
|
### **Working on API?**
|
||||||
|
✅ Can allocate: Ports, VMIDs, DNS, Velocity
|
||||||
|
❌ Cannot: Install Java, download artifacts, exec in container
|
||||||
|
|
||||||
|
### **Working on Agent?**
|
||||||
|
✅ Can install: Java, artifacts, server files
|
||||||
|
❌ Cannot: Allocate ports, create DNS, call Proxmox, register Velocity
|
||||||
|
|
||||||
|
### **Working on Frontend?**
|
||||||
|
✅ Can: Display data, call API endpoints
|
||||||
|
❌ Cannot: Talk to Agent, Proxmox, DNS, Velocity, allocate anything
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🔒 7 Locked Architectural Decisions (NEW) ⭐
|
||||||
|
|
||||||
|
**These decisions are FINAL** - cannot be changed without user saying "Revisit decision X":
|
||||||
|
|
||||||
|
1. **DEC-001**: Templates + Agent hybrid (not templates-only or agent-only)
|
||||||
|
2. **DEC-002**: API orchestrates, Agent executes (not reversed)
|
||||||
|
3. **DEC-003**: API owns DNS (Agent never creates DNS)
|
||||||
|
4. **DEC-004**: Traefik + Velocity only (no HAProxy)
|
||||||
|
5. **DEC-005**: MariaDB is source of truth (no flat files)
|
||||||
|
6. **DEC-006**: Frontend → API only (no direct agent calls)
|
||||||
|
7. **DEC-007**: Drift prevention mandatory (always enforced)
|
||||||
|
|
||||||
|
**If proposing change that conflicts with DEC-001 through DEC-007**:
|
||||||
|
→ STOP → Consult Cross-Project Tracker → Request "Revisit decision X" from user
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🚨 Drift Detection Triggers
|
||||||
|
|
||||||
|
**STOP and consult full tracker if you hear:**
|
||||||
|
|
||||||
|
1. "Agent should allocate ports..."
|
||||||
|
2. "API should install Java inside container..."
|
||||||
|
3. "Frontend should call agent directly..."
|
||||||
|
4. "Agent should register with Velocity..."
|
||||||
|
5. "API should decide what Java version..."
|
||||||
|
6. "Frontend should manage DNS records..."
|
||||||
|
7. "Let's use templates only..." (violates DEC-001)
|
||||||
|
8. "Let's add HAProxy..." (violates DEC-004)
|
||||||
|
9. "Let's use flat files instead of DB..." (violates DEC-005)
|
||||||
|
|
||||||
|
**All of these are VIOLATIONS** → Consult [ZeroLagHub_Cross_Project_Tracker.md](computer:///mnt/user-data/outputs/ZeroLagHub_Cross_Project_Tracker.md)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 📋 Pre-Coding Checklist
|
||||||
|
|
||||||
|
Before writing ANY code:
|
||||||
|
|
||||||
|
- [ ] Which system am I modifying? (API / Agent / Frontend)
|
||||||
|
- [ ] Does this change cross boundaries? (If yes → read full tracker)
|
||||||
|
- [ ] Am I adding external API calls to Agent? (If yes → VIOLATION)
|
||||||
|
- [ ] Am I adding container execution to API? (If yes → VIOLATION)
|
||||||
|
- [ ] Am I bypassing API in Frontend? (If yes → VIOLATION)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🎯 The Golden Rules
|
||||||
|
|
||||||
|
1. **API orchestrates** (allocates resources, publishes state)
|
||||||
|
2. **Agent executes** (installs, runs, monitors inside container)
|
||||||
|
3. **Frontend displays** (no direct infrastructure access)
|
||||||
|
|
||||||
|
**Anything else** → Drift → Consult full tracker
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 📞 Quick Reference
|
||||||
|
|
||||||
|
**Full Tracker**: [ZeroLagHub_Cross_Project_Tracker.md](computer:///mnt/user-data/outputs/ZeroLagHub_Cross_Project_Tracker.md)
|
||||||
|
|
||||||
|
**High-Risk Zones**:
|
||||||
|
- Forge/NeoForge installation
|
||||||
|
- Cloudflare SRV deletion
|
||||||
|
- Velocity registration order
|
||||||
|
- PortPool commit logic
|
||||||
|
- Agent READY detection
|
||||||
|
- IP selection logic
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## ✅ Session Start Command
|
||||||
|
|
||||||
|
```
|
||||||
|
Read ZeroLagHub Cross-Project Tracker Quick Start Card.
|
||||||
|
Activate drift detection.
|
||||||
|
Confirm which system I'm working on: [API / Agent / Frontend]
|
||||||
|
Proceed with architecture-aligned implementation only.
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
🛡️ **Drift prevention ACTIVE. Proceed with confidence.**
|
||||||
573
ZeroLagHub_GPT_Implementation_Handover_Dec2025.md
Normal file
573
ZeroLagHub_GPT_Implementation_Handover_Dec2025.md
Normal file
@ -0,0 +1,573 @@
|
|||||||
|
# 🚀 ZeroLagHub - GPT Implementation Handover (December 2025)
|
||||||
|
|
||||||
|
**Last Updated**: December 7, 2025
|
||||||
|
**Version**: 4.0 (Launch-Ready Implementation Guide)
|
||||||
|
**Status**: 85% Platform Complete - Active Development Sprint
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🎯 Your Role (GPT - Implementation AI)
|
||||||
|
|
||||||
|
**You are**: The tactical implementation AI responsible for **building features**
|
||||||
|
**Claude is**: The strategic architecture AI responsible for **design decisions**
|
||||||
|
|
||||||
|
### **Your Responsibilities**
|
||||||
|
✅ Implement features within architectural boundaries
|
||||||
|
✅ Write code for API, Agent, and Frontend
|
||||||
|
✅ Fix bugs and optimize performance
|
||||||
|
✅ Execute sprint tasks from Kanban board
|
||||||
|
✅ Consult Cross-Project Tracker before crossing system boundaries
|
||||||
|
|
||||||
|
### **Your Constraints**
|
||||||
|
❌ Do NOT make architectural decisions without Claude
|
||||||
|
❌ Do NOT violate ownership boundaries (API vs Agent vs Frontend)
|
||||||
|
❌ Do NOT change contracts without updating both sides
|
||||||
|
❌ Do NOT skip drift prevention checks
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 📋 Critical Documents (READ THESE FIRST)
|
||||||
|
|
||||||
|
### **🛡️ MANDATORY Before ANY Code**
|
||||||
|
1. **[Drift Prevention Card](computer:///mnt/user-data/outputs/ZeroLagHub_Drift_Prevention_Card.md)** (30 seconds)
|
||||||
|
- Quick boundary check for every session
|
||||||
|
- Violation triggers to watch for
|
||||||
|
|
||||||
|
2. **[Cross-Project Tracker](computer:///mnt/user-data/outputs/ZeroLagHub_Cross_Project_Tracker.md)** (consult before changes)
|
||||||
|
- Ownership matrix (API vs Agent vs Frontend)
|
||||||
|
- Canonical contracts (API ↔ Agent, Frontend ↔ API)
|
||||||
|
- Drift detection rules with examples
|
||||||
|
|
||||||
|
### **📊 Implementation Context**
|
||||||
|
3. **[Complete Current State](computer:///mnt/user-data/outputs/ZeroLagHub_Complete_Current_State_Dec7.md)** (5 minutes)
|
||||||
|
- Engineering Kanban (DONE/IN PROGRESS/TODO)
|
||||||
|
- 3-day sprint plan with tasks
|
||||||
|
- Troubleshooting guides per variant
|
||||||
|
- Launch readiness matrix
|
||||||
|
|
||||||
|
4. **Engineering Handover** (from today's uploaded document)
|
||||||
|
- System lifecycle diagram
|
||||||
|
- Provisioning sequence
|
||||||
|
- Verification system specs
|
||||||
|
- Today's accomplishments
|
||||||
|
|
||||||
|
### **🔧 Technical Reference**
|
||||||
|
5. **[Agent Complete Spec](computer:///home/claude/ZeroLagHub_Agent_Complete_Spec.md)** (as needed)
|
||||||
|
- Go agent implementation details
|
||||||
|
- API endpoints and contracts
|
||||||
|
|
||||||
|
6. **[Infrastructure Specs](computer:///mnt/user-data/outputs/ZeroLagHub_Infrastructure_Specifications.md)** (as needed)
|
||||||
|
- GTHost hardware constraints
|
||||||
|
- Capacity planning
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🎯 Current Platform Status (December 7, 2025)
|
||||||
|
|
||||||
|
### **What's Working** ✅ (85% Complete)
|
||||||
|
|
||||||
|
**Core Provisioning Pipeline**:
|
||||||
|
- ✅ All 6 Minecraft variants (Vanilla, Paper, Purpur, Fabric, Forge, NeoForge)
|
||||||
|
- ✅ VMID allocation (sequential)
|
||||||
|
- ✅ LXC container creation (template VMID 800)
|
||||||
|
- ✅ IP detection (10.200.0.X)
|
||||||
|
- ✅ Go agent deployment + self-repair
|
||||||
|
- ✅ Java runtime auto-selection (17/21)
|
||||||
|
- ✅ DNS automation (Cloudflare + Technitium)
|
||||||
|
- ✅ Velocity proxy registration
|
||||||
|
- ✅ Start/stop/restart control
|
||||||
|
- ✅ Console command injection
|
||||||
|
- ✅ Log tailing (HTTP polling)
|
||||||
|
- ✅ Crash detection
|
||||||
|
|
||||||
|
**Supported Minecraft Versions**: 1.12.2, 1.16.5, 1.18.2, 1.19.2, 1.20.1, 1.21.x
|
||||||
|
|
||||||
|
### **What's Missing** ❌ (15% To-Do)
|
||||||
|
|
||||||
|
**Critical for Launch** (7-9 hours):
|
||||||
|
- ❌ WebSocket console streaming (4-6 hours) - **HIGH PRIORITY**
|
||||||
|
- ❌ Crash loop protection with backoff (2 hours) - **HIGH PRIORITY**
|
||||||
|
- ❌ Disk space monitoring (1 hour) - **HIGH PRIORITY**
|
||||||
|
|
||||||
|
**Dev Platform** (1 day):
|
||||||
|
- 🔧 Dev container provisioning (Day 1 of sprint)
|
||||||
|
- 🔧 EdgeState schema migration (Day 2 of sprint)
|
||||||
|
- 🔧 Reconciliation job (Day 3 of sprint)
|
||||||
|
|
||||||
|
**Nice-to-Have** (future):
|
||||||
|
- 📋 File upload/download
|
||||||
|
- 📋 Backup/restore UI
|
||||||
|
- 📋 Resource monitoring dashboard
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🗺️ System Architecture (Know Your Boundaries)
|
||||||
|
|
||||||
|
### **Three-System Ownership**
|
||||||
|
|
||||||
|
```
|
||||||
|
┌─────────────────────────────────────────┐
|
||||||
|
│ NODE.JS API (Orchestrator) │
|
||||||
|
│ You Own: Routes, services, job queue │
|
||||||
|
│ Speaks To: Proxmox, DNS, Velocity, DB │
|
||||||
|
│ Never: Installs Java, downloads files │
|
||||||
|
└────────────┬────────────────────────────┘
|
||||||
|
│ HTTP Contract
|
||||||
|
│ POST /config, /start, /stop
|
||||||
|
│ GET /status, /health
|
||||||
|
▼
|
||||||
|
┌─────────────────────────────────────────┐
|
||||||
|
│ GO AGENT (Container Manager) │
|
||||||
|
│ You Own: Installation, verification │
|
||||||
|
│ Speaks To: Filesystem, game process │
|
||||||
|
│ Never: Allocates ports, creates DNS │
|
||||||
|
└────────────┬────────────────────────────┘
|
||||||
|
│ Status Polling
|
||||||
|
▼
|
||||||
|
┌─────────────────────────────────────────┐
|
||||||
|
│ NEXT.JS FRONTEND (UI Only) │
|
||||||
|
│ You Own: Components, client state │
|
||||||
|
│ Speaks To: API only │
|
||||||
|
│ Never: Agent, Proxmox, DNS, Velocity │
|
||||||
|
└─────────────────────────────────────────┘
|
||||||
|
```
|
||||||
|
|
||||||
|
### **Critical Boundaries (NEVER CROSS)**
|
||||||
|
|
||||||
|
**API Must NOT**:
|
||||||
|
- ❌ Install Java inside containers
|
||||||
|
- ❌ Download game server files
|
||||||
|
- ❌ Execute commands directly in containers (use Agent)
|
||||||
|
|
||||||
|
**Agent Must NOT**:
|
||||||
|
- ❌ Allocate ports
|
||||||
|
- ❌ Create DNS records
|
||||||
|
- ❌ Register with Velocity
|
||||||
|
- ❌ Talk to Proxmox API
|
||||||
|
- ❌ Manage VMIDs
|
||||||
|
|
||||||
|
**Frontend Must NOT**:
|
||||||
|
- ❌ Talk directly to Agent
|
||||||
|
- ❌ Call Proxmox API
|
||||||
|
- ❌ Create DNS records
|
||||||
|
- ❌ Bypass API for any infrastructure
|
||||||
|
|
||||||
|
**Violation = STOP → Consult Cross-Project Tracker**
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 📅 3-Day Sprint Plan (Your Tasks)
|
||||||
|
|
||||||
|
### **Day 1: Dev Containers** (December 8)
|
||||||
|
|
||||||
|
**Goal**: Enable developer environment provisioning
|
||||||
|
|
||||||
|
**Tasks**:
|
||||||
|
1. [ ] Define dev container spec (Python, Node, Go, Java)
|
||||||
|
2. [ ] Create template VMID 6000 (base dev environment)
|
||||||
|
3. [ ] API: Add `/api/dev-instances` endpoints (create, delete, status)
|
||||||
|
4. [ ] Agent: Add dev provisioning flow (no game server start)
|
||||||
|
5. [ ] Test: Provision Python + Node dev environments
|
||||||
|
|
||||||
|
**Success Criteria**:
|
||||||
|
- Can provision dev environment with chosen language
|
||||||
|
- Dev container accessible via SSH or web console
|
||||||
|
- No game server logic triggered
|
||||||
|
|
||||||
|
**Files to Modify**:
|
||||||
|
- `src/routes/devInstances.js` (new)
|
||||||
|
- `src/services/devProvisioner.js` (new)
|
||||||
|
- Agent: `dev.go` (new provisioning flow)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### **Day 2: EdgeState + DNS Reliability** (December 9)
|
||||||
|
|
||||||
|
**Goal**: Fix Cloudflare SRV deletion problem
|
||||||
|
|
||||||
|
**Tasks**:
|
||||||
|
1. [ ] Implement EdgeState model in Prisma schema
|
||||||
|
2. [ ] Update `edgePublisher.js` to store Cloudflare record IDs
|
||||||
|
3. [ ] Update `dePublisher.js` to delete by record ID (not hostname)
|
||||||
|
4. [ ] Test: Create → Delete → Recreate same hostname
|
||||||
|
|
||||||
|
**Success Criteria**:
|
||||||
|
- No orphaned DNS records after deletion
|
||||||
|
- EdgeState tracks all Cloudflare record IDs
|
||||||
|
- Re-provisioning same hostname works
|
||||||
|
|
||||||
|
**Files to Modify**:
|
||||||
|
- `prisma/schema.prisma` (add EdgeState model)
|
||||||
|
- `src/services/edgePublisher.js`
|
||||||
|
- `src/services/dePublisher.js`
|
||||||
|
- `src/clients/cloudflareClient.js` (return record IDs)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### **Day 3: Reconciliation + Hardening** (December 10)
|
||||||
|
|
||||||
|
**Goal**: Self-healing infrastructure
|
||||||
|
|
||||||
|
**Tasks**:
|
||||||
|
1. [ ] Create reconciliation job (DB ↔ Proxmox ↔ DNS ↔ Velocity)
|
||||||
|
2. [ ] Detect orphaned containers (in Proxmox but not DB)
|
||||||
|
3. [ ] Detect orphaned DNS records (in DNS but not DB)
|
||||||
|
4. [ ] Auto-cleanup with confirmation prompt
|
||||||
|
5. [ ] Regression test suite
|
||||||
|
|
||||||
|
**Success Criteria**:
|
||||||
|
- Reconciliation job detects all orphans
|
||||||
|
- Can auto-clean with user confirmation
|
||||||
|
- System recovers from partial failures
|
||||||
|
|
||||||
|
**Files to Create**:
|
||||||
|
- `src/jobs/reconciliationJob.js`
|
||||||
|
- `src/services/reconciler.js` (rewrite)
|
||||||
|
- `tests/reconciliation.test.js`
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🐛 Known Bugs (Fix These)
|
||||||
|
|
||||||
|
### **Go Agent Bugs** (Non-Blocking, But Should Fix)
|
||||||
|
|
||||||
|
1. **Forge server.jar Glob Logic** (`artifacts.go` lines 112-116, 147-151)
|
||||||
|
```go
|
||||||
|
// REMOVE THIS - Forge doesn't create server.jar
|
||||||
|
serverJarPath := filepath.Join(installDir, "*server.jar")
|
||||||
|
```
|
||||||
|
**Fix**: Remove glob/rename logic entirely
|
||||||
|
|
||||||
|
2. **ensureProvisioned() Fallthrough** (`agent.go` lines 155-171)
|
||||||
|
```go
|
||||||
|
// ADD ELSE HERE
|
||||||
|
if variant == "forge" || variant == "neoforge" {
|
||||||
|
// Forge logic
|
||||||
|
} else { // <-- ADD THIS
|
||||||
|
// Vanilla-like logic
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
3. **Forge Stop Command Exclusion** (`process.go` line 83)
|
||||||
|
```go
|
||||||
|
// REMOVE THIS EXCLUSION - Forge accepts stop commands
|
||||||
|
if p.variant != "forge" && p.variant != "neoforge" {
|
||||||
|
p.sendCommand("stop")
|
||||||
|
}
|
||||||
|
```
|
||||||
|
**Fix**: Remove the if condition, send stop to all variants
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🚨 High-Risk Integration Zones (Careful!)
|
||||||
|
|
||||||
|
These areas have caused drift in past sessions:
|
||||||
|
|
||||||
|
1. **Forge/NeoForge Installation**
|
||||||
|
- Agent owns ALL installation logic
|
||||||
|
- API only passes config, never executes install commands
|
||||||
|
|
||||||
|
2. **Cloudflare SRV Deletion**
|
||||||
|
- Must use record IDs (not hostname inference)
|
||||||
|
- Store IDs in EdgeState on creation
|
||||||
|
|
||||||
|
3. **Velocity Registration Order**
|
||||||
|
- Wait for Agent to report RUNNING state
|
||||||
|
- Then register with Velocity (not before)
|
||||||
|
|
||||||
|
4. **Port Allocation**
|
||||||
|
- API allocates → provisions → commits
|
||||||
|
- Rollback on failure (don't leak ports)
|
||||||
|
|
||||||
|
5. **Agent READY Detection**
|
||||||
|
- Use variant-aware log parsing
|
||||||
|
- Forge takes 60-90s (don't timeout early)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 📁 Key File Locations
|
||||||
|
|
||||||
|
### **API Service** (`/home/zlh/zlh-api-v2/`)
|
||||||
|
```
|
||||||
|
src/
|
||||||
|
├── routes/
|
||||||
|
│ ├── containers.js # Game server endpoints
|
||||||
|
│ └── devInstances.js # Dev environment endpoints (TO CREATE)
|
||||||
|
├── services/
|
||||||
|
│ ├── edgePublisher.js # DNS + Velocity publishing
|
||||||
|
│ ├── dePublisher.js # Edge cleanup (NEEDS REWRITE)
|
||||||
|
│ ├── portAllocator.js # Port management
|
||||||
|
│ └── reconciler.js # Orphan detection (NEEDS REWRITE)
|
||||||
|
├── clients/
|
||||||
|
│ ├── cloudflareClient.js # Cloudflare API (UPDATE for record IDs)
|
||||||
|
│ ├── technitiumClient.js # Technitium DNS
|
||||||
|
│ └── proxmoxClient.js # Proxmox API
|
||||||
|
└── jobs/
|
||||||
|
└── reconciliationJob.js # Self-healing job (TO CREATE)
|
||||||
|
```
|
||||||
|
|
||||||
|
### **Go Agent** (`/opt/zlh-agent/`)
|
||||||
|
```
|
||||||
|
├── agent.go # Main provisioning logic (FIX fallthrough)
|
||||||
|
├── artifacts.go # Download + verification (REMOVE Forge glob)
|
||||||
|
├── process.go # Server lifecycle (FIX Forge stop exclusion)
|
||||||
|
├── api.go # HTTP server for control
|
||||||
|
└── dev.go # Dev environment provisioning (TO CREATE)
|
||||||
|
```
|
||||||
|
|
||||||
|
### **Frontend** (`/home/zlh/zlh-portal/`)
|
||||||
|
```
|
||||||
|
src/
|
||||||
|
├── app/
|
||||||
|
│ ├── containers/ # Game server UI
|
||||||
|
│ └── dev/ # Dev environment UI (TO CREATE)
|
||||||
|
└── components/
|
||||||
|
└── Console.tsx # WebSocket console (TO CREATE)
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🎮 Minecraft Variant Status
|
||||||
|
|
||||||
|
| Variant | Install | Verify | Start | READY Detection | Status |
|
||||||
|
|---------|---------|--------|-------|-----------------|--------|
|
||||||
|
| Vanilla | ✅ | ✅ | ✅ | ✅ | Production |
|
||||||
|
| Paper | ✅ | ✅ | ✅ | ✅ | Production |
|
||||||
|
| Purpur | ✅ | ✅ | ✅ | ✅ | Production |
|
||||||
|
| Fabric | ✅ | ✅ | ✅ | ✅ | Production |
|
||||||
|
| Forge | ✅ | ✅ | ✅ | ✅ | Production (has bugs) |
|
||||||
|
| NeoForge | ✅ | ✅ | ✅ | ✅ | Production |
|
||||||
|
|
||||||
|
**All variants work** - 3 non-blocking bugs should be fixed for code quality.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🔄 Provisioning Flow (Know This)
|
||||||
|
|
||||||
|
```
|
||||||
|
1. User creates server via Frontend
|
||||||
|
↓
|
||||||
|
2. Frontend → API: POST /api/containers/create
|
||||||
|
↓
|
||||||
|
3. API allocates VMID, ports (if needed)
|
||||||
|
↓
|
||||||
|
4. API clones LXC from template VMID 800
|
||||||
|
↓
|
||||||
|
5. API configures container (IP, resources)
|
||||||
|
↓
|
||||||
|
6. API starts LXC
|
||||||
|
↓
|
||||||
|
7. API detects container IP (10.200.0.X)
|
||||||
|
↓
|
||||||
|
8. API → Agent: POST /config (payload)
|
||||||
|
↓
|
||||||
|
9. Agent saves payload.json
|
||||||
|
↓
|
||||||
|
10. Agent spawns install goroutine (async)
|
||||||
|
├─ Download Java
|
||||||
|
├─ Download game artifacts
|
||||||
|
├─ Verify installation
|
||||||
|
└─ Self-repair if needed
|
||||||
|
↓
|
||||||
|
11. Agent starts server
|
||||||
|
↓
|
||||||
|
12. Agent detects READY (log parsing)
|
||||||
|
↓
|
||||||
|
13. Agent sets state = RUNNING
|
||||||
|
↓
|
||||||
|
14. API polls /status until RUNNING
|
||||||
|
↓
|
||||||
|
15. API saves to database
|
||||||
|
↓
|
||||||
|
16. API publishes DNS (Cloudflare + Technitium)
|
||||||
|
↓
|
||||||
|
17. API registers with Velocity
|
||||||
|
↓
|
||||||
|
18. API returns SUCCESS to user
|
||||||
|
↓
|
||||||
|
COMPLETE ✅
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🛡️ Drift Prevention (ACTIVE)
|
||||||
|
|
||||||
|
### **Before Writing ANY Code**
|
||||||
|
|
||||||
|
**Ask yourself**:
|
||||||
|
1. Which system am I modifying? (API / Agent / Frontend)
|
||||||
|
2. Does this cross boundaries? (If yes → read Cross-Project Tracker)
|
||||||
|
3. Am I adding external calls to Agent? (If yes → VIOLATION)
|
||||||
|
4. Am I adding container execution to API? (If yes → VIOLATION)
|
||||||
|
5. Am I bypassing API in Frontend? (If yes → VIOLATION)
|
||||||
|
|
||||||
|
**If ANY doubt** → Stop and consult [Cross-Project Tracker](computer:///mnt/user-data/outputs/ZeroLagHub_Cross_Project_Tracker.md)
|
||||||
|
|
||||||
|
### **Common Violations to Avoid**
|
||||||
|
|
||||||
|
❌ Agent allocating ports → API owns this
|
||||||
|
❌ API installing Java → Agent owns this
|
||||||
|
❌ Frontend calling Agent directly → Must go through API
|
||||||
|
❌ Agent creating DNS → API owns this
|
||||||
|
❌ API deciding Java version → Agent owns this (version-aware)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🎯 Success Metrics (How You'll Be Measured)
|
||||||
|
|
||||||
|
### **Sprint Completion**
|
||||||
|
- [ ] All 3 days of sprint tasks completed
|
||||||
|
- [ ] Dev containers operational
|
||||||
|
- [ ] EdgeState tracking DNS record IDs
|
||||||
|
- [ ] Reconciliation job working
|
||||||
|
|
||||||
|
### **Code Quality**
|
||||||
|
- [ ] No architectural violations (follow Cross-Project Tracker)
|
||||||
|
- [ ] All 3 Go agent bugs fixed
|
||||||
|
- [ ] Tests passing
|
||||||
|
- [ ] No new drift introduced
|
||||||
|
|
||||||
|
### **Platform Readiness**
|
||||||
|
- [ ] 95% launch-ready after sprint
|
||||||
|
- [ ] All MC variants still working
|
||||||
|
- [ ] No regressions from changes
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🧪 Testing Requirements
|
||||||
|
|
||||||
|
### **Before Committing Code**
|
||||||
|
|
||||||
|
**Test Each Variant**:
|
||||||
|
```bash
|
||||||
|
# Test provisioning
|
||||||
|
POST /api/containers/create {variant: "vanilla"}
|
||||||
|
POST /api/containers/create {variant: "paper"}
|
||||||
|
POST /api/containers/create {variant: "fabric"}
|
||||||
|
POST /api/containers/create {variant: "forge"}
|
||||||
|
POST /api/containers/create {variant: "neoforge"}
|
||||||
|
|
||||||
|
# Verify RUNNING state
|
||||||
|
GET /api/containers/:vmid/status
|
||||||
|
# Should return: {state: "RUNNING"}
|
||||||
|
|
||||||
|
# Test control
|
||||||
|
POST /api/containers/:vmid/stop
|
||||||
|
POST /api/containers/:vmid/start
|
||||||
|
POST /api/containers/:vmid/restart
|
||||||
|
|
||||||
|
# Test cleanup
|
||||||
|
DELETE /api/containers/:vmid
|
||||||
|
# Verify no orphaned DNS records
|
||||||
|
```
|
||||||
|
|
||||||
|
**Test Dev Containers**:
|
||||||
|
```bash
|
||||||
|
POST /api/dev-instances/create {language: "python"}
|
||||||
|
POST /api/dev-instances/create {language: "node"}
|
||||||
|
|
||||||
|
# Verify accessible
|
||||||
|
ssh into dev container
|
||||||
|
# Should have language toolchain installed
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 📊 Infrastructure Constraints (Know Your Limits)
|
||||||
|
|
||||||
|
**Hardware** (GTHost Dedicated):
|
||||||
|
- CPU: 12 cores / 24 threads (Intel Xeon Silver 4116)
|
||||||
|
- RAM: 192 GB (can run 30-50 simultaneous 4GB servers)
|
||||||
|
- Storage: 1.8 TB free (300-500 servers capacity)
|
||||||
|
- Network: 300 Mbit/s (30-60 concurrent players)
|
||||||
|
|
||||||
|
**Current Allocation**:
|
||||||
|
- 11 VMs: 56 GB RAM, 24 CPU threads, ~512 GB disk
|
||||||
|
- Available for servers: 128 GB RAM, 1.8 TB disk
|
||||||
|
|
||||||
|
**Don't exceed these limits** - check before provisioning.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🚀 Launch Decision (Context)
|
||||||
|
|
||||||
|
**Option A**: Launch NOW (85% ready, soft beta only)
|
||||||
|
**Option B**: +3 Days Sprint (95% ready, RECOMMENDED)
|
||||||
|
**Option C**: +1 Week (98% ready, over-engineering)
|
||||||
|
|
||||||
|
**Your role**: Execute Option B sprint (complete 3 days of tasks)
|
||||||
|
|
||||||
|
**After sprint**: Platform ready for professional launch
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 📞 Session Continuity
|
||||||
|
|
||||||
|
### **Starting Fresh Session (You)**
|
||||||
|
|
||||||
|
```
|
||||||
|
1. Read Drift Prevention Card (30 seconds)
|
||||||
|
└─ Activate boundary awareness
|
||||||
|
|
||||||
|
2. Read Complete Current State (5 minutes)
|
||||||
|
└─ Get Kanban state + sprint tasks
|
||||||
|
|
||||||
|
3. Consult Cross-Project Tracker before code
|
||||||
|
└─ Verify no boundary violations
|
||||||
|
|
||||||
|
4. Execute sprint tasks with constraints active
|
||||||
|
|
||||||
|
5. Test all variants before committing
|
||||||
|
```
|
||||||
|
|
||||||
|
### **Handoff to Claude (Architecture Questions)**
|
||||||
|
|
||||||
|
If you encounter:
|
||||||
|
- Architectural decisions (e.g., should we change contracts?)
|
||||||
|
- Strategic questions (e.g., which features to prioritize?)
|
||||||
|
- Business model questions
|
||||||
|
- Major design changes
|
||||||
|
|
||||||
|
**STOP and escalate to Claude** for architectural guidance.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## ✅ Quick Reference
|
||||||
|
|
||||||
|
### **Your Mission**
|
||||||
|
Execute 3-day sprint → Deliver 95% launch-ready platform
|
||||||
|
|
||||||
|
### **Your Boundaries**
|
||||||
|
API orchestrates | Agent installs | Frontend displays
|
||||||
|
|
||||||
|
### **Your Critical Docs**
|
||||||
|
1. Drift Prevention Card (session start)
|
||||||
|
2. Cross-Project Tracker (before code)
|
||||||
|
3. Complete Current State (sprint tasks)
|
||||||
|
4. Engineering Handover (technical details)
|
||||||
|
|
||||||
|
### **Your Success**
|
||||||
|
- [ ] 3-day sprint complete
|
||||||
|
- [ ] No architectural violations
|
||||||
|
- [ ] All variants still working
|
||||||
|
- [ ] Tests passing
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🎯 Start Here (First Actions)
|
||||||
|
|
||||||
|
**Right Now**:
|
||||||
|
1. ✅ Read [Drift Prevention Card](computer:///mnt/user-data/outputs/ZeroLagHub_Drift_Prevention_Card.md) (30 sec)
|
||||||
|
2. ✅ Read [Complete Current State](computer:///mnt/user-data/outputs/ZeroLagHub_Complete_Current_State_Dec7.md) (5 min)
|
||||||
|
3. ✅ Check Engineering Kanban for current TODO
|
||||||
|
4. ✅ Begin Day 1 sprint: Dev containers
|
||||||
|
|
||||||
|
**Remember**:
|
||||||
|
- 🛡️ Drift prevention ACTIVE
|
||||||
|
- 📋 Consult tracker before crossing boundaries
|
||||||
|
- ✅ Test all variants before committing
|
||||||
|
- 🚀 Goal: 95% launch-ready after 3 days
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**Status**: You have everything you need to execute the sprint. Let's build! 🚀
|
||||||
465
ZeroLagHub_Infrastructure_Specifications.md
Normal file
465
ZeroLagHub_Infrastructure_Specifications.md
Normal file
@ -0,0 +1,465 @@
|
|||||||
|
# 🏗️ ZeroLagHub Infrastructure - Complete Specifications
|
||||||
|
|
||||||
|
**Last Updated**: December 7, 2025
|
||||||
|
**Provider**: GTHost
|
||||||
|
**Cost**: $109/month
|
||||||
|
**Status**: Production Infrastructure
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🖥️ Dedicated Server Hardware
|
||||||
|
|
||||||
|
### **Server Platform**
|
||||||
|
**Model**: Supermicro 2029TP-HC1R (hot-swap HDD chassis)
|
||||||
|
**Form Factor**: Enterprise rackmount server
|
||||||
|
|
||||||
|
### **CPU**
|
||||||
|
**Model**: Intel Xeon Silver 4116
|
||||||
|
**Cores**: 12 cores / 24 threads
|
||||||
|
**Clock Speed**: 2.1 GHz base, 3.0 GHz turbo
|
||||||
|
**Architecture**: Intel Skylake-SP (Server Processor)
|
||||||
|
**TDP**: 85W
|
||||||
|
**Cache**: 16.5 MB L3
|
||||||
|
|
||||||
|
**Performance**:
|
||||||
|
- Single-threaded: Excellent for game servers
|
||||||
|
- Multi-threaded: Handles 11 VMs + 75-100 containers
|
||||||
|
- Turbo boost ensures responsive provisioning
|
||||||
|
|
||||||
|
### **Memory (RAM)**
|
||||||
|
**Configuration**: 6 x 32GB DIMMs
|
||||||
|
**Total**: 192 GB
|
||||||
|
**Type**: Hynix DDR4 RDIMM (Registered)
|
||||||
|
**Speed**: 2400 MHz
|
||||||
|
**ECC**: Yes (Error-Correcting Code)
|
||||||
|
|
||||||
|
**Benefits**:
|
||||||
|
- ECC protects against memory corruption
|
||||||
|
- Registered DIMMs = enterprise reliability
|
||||||
|
- 192GB = ample headroom for VM overhead + game servers
|
||||||
|
|
||||||
|
### **Storage**
|
||||||
|
**Configuration**: 2 x 1.92TB SSD
|
||||||
|
**Model**: Samsung PM863 (Enterprise SSD)
|
||||||
|
**Total Capacity**: 3.84 TB raw
|
||||||
|
**Available**: ~1.8 TB (after Proxmox + VMs + overhead)
|
||||||
|
**Interface**: SATA (likely in RAID configuration)
|
||||||
|
|
||||||
|
**Samsung PM863 Specs**:
|
||||||
|
- Enterprise-grade datacenter SSD
|
||||||
|
- Optimized for mixed workload
|
||||||
|
- Power Loss Protection (PLP)
|
||||||
|
- High endurance rating
|
||||||
|
|
||||||
|
**RAID Configuration** (Likely):
|
||||||
|
- RAID 1 (mirrored) for redundancy
|
||||||
|
- Or ZFS RAID-Z for Proxmox storage
|
||||||
|
|
||||||
|
### **Network**
|
||||||
|
**Bandwidth**: 300 Mbit/s (37.5 MB/s)
|
||||||
|
**Metering**: Unmetered (no bandwidth caps)
|
||||||
|
**Uplink**: Enterprise datacenter connection
|
||||||
|
|
||||||
|
**Performance Assessment**:
|
||||||
|
- 300 Mbit/s = 30-40 concurrent Minecraft players comfortably
|
||||||
|
- Unmetered = no surprise overage charges
|
||||||
|
- Sufficient for soft launch, may need upgrade at scale
|
||||||
|
|
||||||
|
### **Operating System**
|
||||||
|
**OS**: Proxmox VE 8 (64-bit)
|
||||||
|
**Base**: Debian 12 "Bookworm"
|
||||||
|
**Kernel**: Linux 6.2+
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 📊 Partition Layout
|
||||||
|
|
||||||
|
### **Disk Partitions**
|
||||||
|
```
|
||||||
|
/boot: 1024 MB (1 GB) - Boot partition
|
||||||
|
/swap: 2048 MB (2 GB) - Swap space
|
||||||
|
/root: Auto size - Main Proxmox system + VM storage
|
||||||
|
(~3.8 TB usable after RAID)
|
||||||
|
```
|
||||||
|
|
||||||
|
**Storage Allocation** (Estimated):
|
||||||
|
```
|
||||||
|
Total: 3.84 TB raw (2 x 1.92TB)
|
||||||
|
RAID overhead: -0.04 TB (metadata, alignment)
|
||||||
|
--------
|
||||||
|
Available: 3.80 TB
|
||||||
|
|
||||||
|
Proxmox OS: -0.10 TB (Proxmox + system)
|
||||||
|
VM Disks: -1.80 TB (11 VMs, templates)
|
||||||
|
LXC Containers: -0.10 TB (current containers)
|
||||||
|
--------
|
||||||
|
Free Space: 1.80 TB (for game servers + growth)
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🎯 Performance Characteristics
|
||||||
|
|
||||||
|
### **CPU Capabilities**
|
||||||
|
|
||||||
|
**Per-Core Performance**: ⭐⭐⭐⭐ (Excellent for game servers)
|
||||||
|
- Minecraft servers are single-threaded
|
||||||
|
- 3.0 GHz turbo provides snappy performance
|
||||||
|
- 12 cores = 12 simultaneous high-performance game servers
|
||||||
|
|
||||||
|
**Multi-Core Throughput**: ⭐⭐⭐⭐⭐ (Excellent for hosting)
|
||||||
|
- 24 threads handle VM overhead efficiently
|
||||||
|
- Can run 11 Proxmox VMs + 75-100 LXC containers
|
||||||
|
- Provisioning operations don't impact running servers
|
||||||
|
|
||||||
|
**Virtualization**: ⭐⭐⭐⭐⭐ (Native support)
|
||||||
|
- Intel VT-x + VT-d enabled
|
||||||
|
- Hardware-accelerated virtualization
|
||||||
|
- LXC containers = near-native performance
|
||||||
|
|
||||||
|
### **Memory Capabilities**
|
||||||
|
|
||||||
|
**Capacity**: 192 GB = **Excellent** for scale
|
||||||
|
- Average Minecraft server: 2-4 GB
|
||||||
|
- 192 GB / 4 GB = **48 simultaneous 4GB servers**
|
||||||
|
- Or 96 lightweight 2GB servers
|
||||||
|
|
||||||
|
**Speed**: 2400 MHz DDR4 = **Good** (not bleeding edge, but sufficient)
|
||||||
|
- DDR4-2400 provides adequate bandwidth for game hosting
|
||||||
|
- ECC ensures data integrity under load
|
||||||
|
|
||||||
|
**Reliability**: ECC RDIMM = **Enterprise-grade**
|
||||||
|
- Detects and corrects memory errors
|
||||||
|
- Critical for 24/7 uptime
|
||||||
|
|
||||||
|
### **Storage Capabilities**
|
||||||
|
|
||||||
|
**Capacity**: 3.84 TB = **Very Good** for initial scale
|
||||||
|
- Each Minecraft server: 1-5 GB (world size varies)
|
||||||
|
- Can host 300-500 Minecraft servers comfortably
|
||||||
|
- 1.8 TB free = room for significant growth
|
||||||
|
|
||||||
|
**Performance**: Samsung PM863 = **Excellent** for workload
|
||||||
|
- Random IOPS: ~10,000 read, ~2,000 write
|
||||||
|
- Sequential: 520 MB/s read, 485 MB/s write
|
||||||
|
- Perfect for database + game world I/O
|
||||||
|
|
||||||
|
**Reliability**: Enterprise SSD = **Excellent**
|
||||||
|
- Power Loss Protection prevents corruption
|
||||||
|
- Rated for 1.3 PB writes (years of 24/7 operation)
|
||||||
|
- RAID 1 (likely) provides redundancy
|
||||||
|
|
||||||
|
### **Network Capabilities**
|
||||||
|
|
||||||
|
**Bandwidth**: 300 Mbit/s = **Adequate for soft launch**
|
||||||
|
- Minecraft player: ~0.5-1 Mbit/s
|
||||||
|
- 300 Mbit/s = 30-60 players (conservative estimate)
|
||||||
|
- Unmetered = no bandwidth overage charges
|
||||||
|
|
||||||
|
**Upgrade Path**: GTHost likely offers 1 Gbps upgrades
|
||||||
|
- 1 Gbps would support 100-200 players
|
||||||
|
- Consider upgrade when approaching 40+ concurrent players
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🏗️ Current Resource Allocation
|
||||||
|
|
||||||
|
### **VM Resource Breakdown** (11 VMs)
|
||||||
|
|
||||||
|
```
|
||||||
|
Hypervisor Overhead: ~8 GB RAM, 2 CPU cores
|
||||||
|
|
||||||
|
Critical Production:
|
||||||
|
├─ VM 100 (zlh-panel): 4 GB RAM, 2 cores, 32 GB disk
|
||||||
|
├─ VM 103 (zlh-api): 4 GB RAM, 2 cores, 32 GB disk
|
||||||
|
└─ VM 101 (zlh-wings): 8 GB RAM, 4 cores, 64 GB disk
|
||||||
|
|
||||||
|
Platform Services:
|
||||||
|
├─ VM 102 (zlh-portal): 4 GB RAM, 2 cores, 32 GB disk
|
||||||
|
├─ VM 104 (zlh-monitor): 8 GB RAM, 2 cores, 64 GB disk
|
||||||
|
└─ VM 1002 (zlh-proxy): 2 GB RAM, 1 core, 16 GB disk
|
||||||
|
|
||||||
|
Network Layer:
|
||||||
|
├─ VM 1000 (zlh-router): 4 GB RAM, 2 cores, 32 GB disk
|
||||||
|
├─ VM 1006 (zpack-router): 4 GB RAM, 2 cores, 32 GB disk
|
||||||
|
└─ VM 1001 (zlh-dns): 2 GB RAM, 1 core, 16 GB disk
|
||||||
|
|
||||||
|
Development/Support:
|
||||||
|
├─ VM 300 (zlh-panel-dev): 4 GB RAM, 2 cores, 32 GB disk
|
||||||
|
└─ VM 2000 (zlh-ci): 4 GB RAM, 2 cores, 32 GB disk
|
||||||
|
|
||||||
|
Backup:
|
||||||
|
└─ VM [zlh-back]: 8 GB RAM, 2 cores, 128 GB disk
|
||||||
|
|
||||||
|
TOTAL VM ALLOCATION: 56 GB RAM, 24 cores, ~512 GB disk
|
||||||
|
```
|
||||||
|
|
||||||
|
### **Available for Game Servers**
|
||||||
|
|
||||||
|
```
|
||||||
|
RAM Available: 192 GB - 56 GB (VMs) - 8 GB (overhead) = 128 GB
|
||||||
|
CPU Available: 24 threads - 24 (VM allocation) = 0 (shared)
|
||||||
|
Disk Available: 1.8 TB free
|
||||||
|
|
||||||
|
Game Server Capacity (Conservative):
|
||||||
|
├─ 2GB servers: 64 simultaneous servers
|
||||||
|
├─ 4GB servers: 32 simultaneous servers
|
||||||
|
└─ 8GB servers: 16 simultaneous servers
|
||||||
|
|
||||||
|
Developer Environment Capacity:
|
||||||
|
├─ 2GB dev envs: 64 simultaneous environments
|
||||||
|
└─ 4GB dev envs: 32 simultaneous environments
|
||||||
|
```
|
||||||
|
|
||||||
|
**Note**: CPU is oversubscribed (common in hosting) since most game servers idle at <20% CPU usage. Turbo boost ensures good single-thread performance when needed.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 📈 Capacity & Scaling Projections
|
||||||
|
|
||||||
|
### **Current Capacity** (As Deployed)
|
||||||
|
|
||||||
|
**Game Servers**: 30-50 active servers with current VM allocation
|
||||||
|
**Developer Environments**: 75-100 environments (documented capacity)
|
||||||
|
**Concurrent Players**: 30-60 players (network limited)
|
||||||
|
|
||||||
|
### **Optimized Capacity** (With Tuning)
|
||||||
|
|
||||||
|
**Game Servers**: 60-80 active servers (after VM consolidation)
|
||||||
|
**Developer Environments**: 100-150 environments
|
||||||
|
**Concurrent Players**: Still 30-60 (network bottleneck)
|
||||||
|
|
||||||
|
### **Maximum Theoretical Capacity**
|
||||||
|
|
||||||
|
**Game Servers**: 128 lightweight servers (if only game hosting, no dev)
|
||||||
|
**Developer Environments**: 192 environments (if only dev, no games)
|
||||||
|
**Storage**: 300-500 servers before storage exhaustion
|
||||||
|
|
||||||
|
**Limiting Factors**:
|
||||||
|
1. **Network** (300 Mbit/s) - limits concurrent players
|
||||||
|
2. **RAM** (192 GB) - limits concurrent heavy servers
|
||||||
|
3. **Storage** (1.8 TB free) - limits total servers
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 💰 Cost Analysis
|
||||||
|
|
||||||
|
### **Current Infrastructure Cost**
|
||||||
|
|
||||||
|
**Monthly**: $109 GTHost dedicated server
|
||||||
|
**Annually**: $1,308
|
||||||
|
|
||||||
|
**Cost per Resource**:
|
||||||
|
- Per GB RAM: $0.57/month ($109 ÷ 192 GB)
|
||||||
|
- Per CPU core: $9.08/month ($109 ÷ 12 cores)
|
||||||
|
- Per TB storage: $28.39/month ($109 ÷ 3.84 TB)
|
||||||
|
|
||||||
|
### **Competitive Analysis**
|
||||||
|
|
||||||
|
**AWS Equivalent** (m5.2xlarge + storage + bandwidth):
|
||||||
|
- 8 vCPU, 32 GB RAM, 1 TB storage, 1 Gbps
|
||||||
|
- Cost: ~$300-400/month
|
||||||
|
|
||||||
|
**Hetzner Dedicated** (Similar specs):
|
||||||
|
- 12 core Xeon, 128 GB RAM, 2x2TB SSD
|
||||||
|
- Cost: ~$100/month (but higher network costs)
|
||||||
|
|
||||||
|
**GTHost Value**: ⭐⭐⭐⭐⭐ Excellent
|
||||||
|
- 40-60% cheaper than AWS
|
||||||
|
- Competitive with Hetzner
|
||||||
|
- Unmetered bandwidth (key advantage)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🎯 Competitive Advantages
|
||||||
|
|
||||||
|
### **1. LXC Performance**
|
||||||
|
- Host hardware enables 20-30% better performance vs Docker
|
||||||
|
- Intel Xeon Silver 4116 single-thread performance excellent for games
|
||||||
|
|
||||||
|
### **2. Resource Density**
|
||||||
|
- 192 GB RAM supports 30-50 simultaneous 4GB servers
|
||||||
|
- Competitors typically offer 64-128 GB at this price point
|
||||||
|
|
||||||
|
### **3. Storage Performance**
|
||||||
|
- Samsung PM863 enterprise SSDs outperform consumer SSDs
|
||||||
|
- Power Loss Protection prevents world corruption
|
||||||
|
- Hot-swap chassis enables maintenance without downtime
|
||||||
|
|
||||||
|
### **4. Network**
|
||||||
|
- Unmetered = no bandwidth surprises
|
||||||
|
- 300 Mbit/s adequate for soft launch
|
||||||
|
- Upgrade path available when needed
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## ⚠️ Identified Constraints
|
||||||
|
|
||||||
|
### **1. Network Bandwidth** (Current Bottleneck)
|
||||||
|
- **300 Mbit/s limits to 30-60 concurrent players**
|
||||||
|
- **Recommendation**: Monitor bandwidth usage, upgrade to 1 Gbps when approaching 40 players
|
||||||
|
- **Upgrade Cost**: Likely +$20-50/month for 1 Gbps
|
||||||
|
|
||||||
|
### **2. CPU Oversubscription**
|
||||||
|
- 24 threads allocated to VMs, but most VMs idle
|
||||||
|
- Game servers share CPU via time-slicing
|
||||||
|
- **Risk**: If all servers spike simultaneously, performance degrades
|
||||||
|
- **Mitigation**: Limit concurrent servers to 40-50 until load testing proves higher safe
|
||||||
|
|
||||||
|
### **3. Storage Growth**
|
||||||
|
- 1.8 TB free supports 300-500 servers
|
||||||
|
- Each server grows over time (world expansion)
|
||||||
|
- **Recommendation**: Monitor disk usage, plan expansion at 70% utilization
|
||||||
|
- **Expansion Options**: Add external storage or upgrade to larger SSDs
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🔧 Optimization Opportunities
|
||||||
|
|
||||||
|
### **Immediate Optimizations** (No Cost)
|
||||||
|
|
||||||
|
1. **VM Consolidation**
|
||||||
|
- Merge zlh-panel-dev into zlh-panel (save 4 GB RAM, 2 cores)
|
||||||
|
- Merge zlh-proxy into zlh-router (save 2 GB RAM, 1 core)
|
||||||
|
- **Gain**: 6 GB RAM, 3 cores for game servers
|
||||||
|
|
||||||
|
2. **LXC Over VMs**
|
||||||
|
- Convert lightweight VMs to LXC containers
|
||||||
|
- Example: zlh-dns, zlh-proxy candidates
|
||||||
|
- **Gain**: Lower overhead, faster provisioning
|
||||||
|
|
||||||
|
3. **Memory Ballooning**
|
||||||
|
- Enable KSM (Kernel Same-page Merging) on Proxmox
|
||||||
|
- Deduplicate identical memory pages
|
||||||
|
- **Gain**: 5-10% more available RAM
|
||||||
|
|
||||||
|
### **Paid Optimizations** (Consider at Scale)
|
||||||
|
|
||||||
|
1. **Network Upgrade**: 1 Gbps uplink (+$20-50/month)
|
||||||
|
- Removes player concurrency bottleneck
|
||||||
|
- Enables 100-200 player capacity
|
||||||
|
|
||||||
|
2. **Storage Expansion**: Add 4TB NVMe (+$50/month)
|
||||||
|
- Doubles storage to ~6 TB total
|
||||||
|
- Supports 600-1000 servers
|
||||||
|
|
||||||
|
3. **Cloudflare Enterprise** (+$200/month)
|
||||||
|
- DDoS protection for game traffic
|
||||||
|
- CDN for static assets
|
||||||
|
- Worth it at 100+ servers
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 📊 Hardware Lifecycle
|
||||||
|
|
||||||
|
### **Current Status** (December 2025)
|
||||||
|
|
||||||
|
**Server Age**: Unknown (likely 1-3 years based on Xeon Silver 4116 era)
|
||||||
|
**Expected Lifespan**: 5-7 years for enterprise server
|
||||||
|
**Remaining Life**: Likely 3-5 years
|
||||||
|
|
||||||
|
**Components**:
|
||||||
|
- CPU: Xeon Silver 4116 (2017 release) - still very capable
|
||||||
|
- RAM: DDR4-2400 (current gen, plenty of life)
|
||||||
|
- SSD: Samsung PM863 (enterprise grade, high endurance)
|
||||||
|
|
||||||
|
### **Upgrade Path** (Future)
|
||||||
|
|
||||||
|
**Year 1-2** (Current plan):
|
||||||
|
- Optimize existing hardware
|
||||||
|
- Minor network upgrades if needed
|
||||||
|
|
||||||
|
**Year 3-4** (Growth phase):
|
||||||
|
- Consider second dedicated server
|
||||||
|
- Load balance across servers
|
||||||
|
- Geographic distribution
|
||||||
|
|
||||||
|
**Year 5+** (Scale phase):
|
||||||
|
- Migrate to colocation or cloud
|
||||||
|
- Multi-datacenter deployment
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🛡️ Reliability Features
|
||||||
|
|
||||||
|
### **Hardware Reliability**
|
||||||
|
|
||||||
|
✅ **ECC Memory** - Corrects single-bit errors automatically
|
||||||
|
✅ **Enterprise SSDs** - Power Loss Protection, high endurance
|
||||||
|
✅ **Hot-Swap Chassis** - Replace drives without shutdown
|
||||||
|
✅ **Redundant Power** (likely) - Supermicro chassis typically dual PSU
|
||||||
|
|
||||||
|
### **Software Reliability**
|
||||||
|
|
||||||
|
✅ **Proxmox High Availability** - VM failover (if configured)
|
||||||
|
✅ **PBS Backup** - Incremental backups to Backblaze B2
|
||||||
|
✅ **LXC Snapshots** - Fast rollback capability
|
||||||
|
✅ **RAID Mirroring** (likely) - Disk failure protection
|
||||||
|
|
||||||
|
### **Network Reliability**
|
||||||
|
|
||||||
|
✅ **Datacenter Uptime** - GTHost likely 99.9%+ SLA
|
||||||
|
✅ **Unmetered Bandwidth** - No throttling during spikes
|
||||||
|
⚠️ **Single Uplink** - No network redundancy (acceptable for price point)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🎯 Summary & Recommendations
|
||||||
|
|
||||||
|
### **Hardware Assessment**: ⭐⭐⭐⭐ Very Good for Use Case
|
||||||
|
|
||||||
|
**Strengths**:
|
||||||
|
- Excellent CPU for game server hosting (Xeon Silver 4116)
|
||||||
|
- Abundant RAM (192 GB = 30-50 servers)
|
||||||
|
- Enterprise storage (Samsung PM863 + hot-swap)
|
||||||
|
- Unmetered bandwidth (no surprise charges)
|
||||||
|
- Great value ($109/month for these specs)
|
||||||
|
|
||||||
|
**Limitations**:
|
||||||
|
- Network bandwidth (300 Mbit/s = 30-60 players)
|
||||||
|
- Storage growth constraint (monitor usage)
|
||||||
|
- CPU oversubscription (limit concurrent servers initially)
|
||||||
|
|
||||||
|
### **Recommendations**
|
||||||
|
|
||||||
|
**Now** (Launch Phase):
|
||||||
|
1. ✅ Deploy on current hardware - adequate for soft launch
|
||||||
|
2. ✅ Limit to 40-50 concurrent servers initially
|
||||||
|
3. ✅ Monitor bandwidth, RAM, and disk usage
|
||||||
|
|
||||||
|
**Month 1-3** (Early Growth):
|
||||||
|
1. 🔧 Optimize VM allocation (consolidate where possible)
|
||||||
|
2. 🔧 Implement aggressive monitoring
|
||||||
|
3. 🔧 Consider 1 Gbps network upgrade if approaching 40 players
|
||||||
|
|
||||||
|
**Month 6-12** (Scale Phase):
|
||||||
|
1. 📈 Evaluate storage expansion based on usage
|
||||||
|
2. 📈 Consider second server for geographic distribution
|
||||||
|
3. 📈 Implement Cloudflare Enterprise for DDoS protection
|
||||||
|
|
||||||
|
### **Capacity Targets by Phase**
|
||||||
|
|
||||||
|
**Soft Launch** (Month 1-3): 20-30 servers, 10-20 players
|
||||||
|
**Public Launch** (Month 3-6): 40-50 servers, 30-40 players
|
||||||
|
**Growth Phase** (Month 6-12): 60-80 servers, 60-100 players (with 1 Gbps upgrade)
|
||||||
|
**Scale Phase** (Month 12+): 100+ servers, multi-server deployment
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## ✅ Conclusion
|
||||||
|
|
||||||
|
**Status**: Infrastructure is **production-ready** for ZeroLagHub launch.
|
||||||
|
|
||||||
|
**Key Points**:
|
||||||
|
- Hardware specifications are excellent for initial scale
|
||||||
|
- 192 GB RAM supports 30-50 game servers
|
||||||
|
- Storage capacity adequate for 300-500 servers
|
||||||
|
- Network bandwidth is current bottleneck (acceptable for soft launch)
|
||||||
|
- Cost-effective ($109/month for enterprise-grade hardware)
|
||||||
|
|
||||||
|
**Green Light**: ✅ Launch when platform development complete (currently 85% ready).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**Last Updated**: December 7, 2025
|
||||||
|
**Source**: GTHost server specifications + ZeroLagHub infrastructure analysis
|
||||||
606
ZeroLagHub_Master_Bootstrap_Dec2025.md
Normal file
606
ZeroLagHub_Master_Bootstrap_Dec2025.md
Normal file
@ -0,0 +1,606 @@
|
|||||||
|
# 🚀 ZeroLagHub - Master Bootstrap Document (December 2025)
|
||||||
|
|
||||||
|
**Last Updated**: December 7, 2025
|
||||||
|
**Version**: 4.0 (Platform Launch Ready)
|
||||||
|
**Status**: 85% Complete - Launch Decision Point
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 📌 Quick Start for New AI Sessions
|
||||||
|
|
||||||
|
> **Resume Point**: Platform 85% launch-ready, all core provisioning operational, critical UX features needed.
|
||||||
|
>
|
||||||
|
> **Current Phase**: Launch readiness assessment - choose NOW vs +1 week vs +1 month
|
||||||
|
>
|
||||||
|
> **Critical Context**: All 6 Minecraft variants provisioning successfully via Go agent. Need WebSocket console, crash protection, and disk monitoring for competitive parity.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🎯 Project Overview
|
||||||
|
|
||||||
|
**ZeroLagHub** is a developer-focused game server hosting platform built on:
|
||||||
|
- **Proxmox VE** with LXC containers (20-30% performance advantage over Docker)
|
||||||
|
- **Hybrid Architecture**: Pterodactyl panel + Custom Node.js API + Go provisioning agent
|
||||||
|
- **Velocity proxy** for seamless Minecraft routing
|
||||||
|
- **Dual-router architecture** for traffic separation
|
||||||
|
- **Developer-to-player revenue pipeline** with 9.75x revenue multiplier
|
||||||
|
|
||||||
|
### Core Value Proposition
|
||||||
|
Complete dev-to-production pipeline: Development environments ($20/mo) → Testing servers (50% discount) → Player hosting (25% discount) → Revenue sharing (7.5% commission) = viral growth through developer ecosystem.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🏗️ Current Architecture (December 2025)
|
||||||
|
|
||||||
|
### Infrastructure Overview (11 VMs)
|
||||||
|
|
||||||
|
```
|
||||||
|
Critical Production:
|
||||||
|
├── VM 100 (zlh-panel) - Pterodactyl panel + OAuth customization
|
||||||
|
├── VM 103 (zlh-api) - Node.js backend + developer platform APIs
|
||||||
|
├── VM 101 (zlh-wings) - Game servers + LXC integration target
|
||||||
|
|
||||||
|
Platform Services:
|
||||||
|
├── VM 102 (zlh-portal) - Next.js frontend + developer dashboard
|
||||||
|
├── VM 104 (zlh-monitor) - Prometheus/Grafana monitoring
|
||||||
|
|
||||||
|
Network & Infrastructure:
|
||||||
|
├── VM 1000 (zlh-router) - Platform services routing + VLANs
|
||||||
|
├── VM 1006 (zpack-router) - Game traffic routing + Velocity
|
||||||
|
├── VM 1001 (zlh-dns) - Technitium DNS + development domains
|
||||||
|
├── VM 1002 (zlh-proxy) - Caddy reverse proxy + SSL automation
|
||||||
|
├── VM 300 (zlh-panel-dev) - Development environment + testing
|
||||||
|
├── VM 2000 (zlh-ci) - CI/CD pipeline + automation
|
||||||
|
└── VM [zlh-back] - PBS backup + Backblaze B2 replication
|
||||||
|
```
|
||||||
|
|
||||||
|
### Network Topology
|
||||||
|
|
||||||
|
```
|
||||||
|
zlh-router (VM 1000):
|
||||||
|
├─ WAN1: Platform services (API, portal, monitoring)
|
||||||
|
├─ CORE_LAN: 10.60.0.0/24 (internal services)
|
||||||
|
├─ MGMT_LAN: 172.60.0.10/24 (inter-router communication)
|
||||||
|
└─ WireGuard: Admin access
|
||||||
|
|
||||||
|
zpack-router (VM 1006):
|
||||||
|
├─ WAN2: 139.64.165.248 (game services)
|
||||||
|
├─ ZPACK_LAN: 10.70.0.0/24 (Velocity @ 10.70.0.241)
|
||||||
|
├─ DEV_LAN: 10.100.0.0/24 (developer environments - future)
|
||||||
|
├─ GAME_LAN: 10.200.0.0/24 (game server LXCs)
|
||||||
|
└─ MGMT_LAN: 172.60.0.20/24 (control plane communication)
|
||||||
|
```
|
||||||
|
|
||||||
|
### Traffic Flows
|
||||||
|
|
||||||
|
- **Platform Access**: Client → WAN1 → zlh-router → Frontend/API
|
||||||
|
- **Game Play**: Player → WAN2 (139.64.165.248) → zpack-router → Velocity (10.70.0.241) → Game Server (10.200.0.X)
|
||||||
|
- **Control Plane**: API → MGMT_LAN (172.60.0.X) → Velocity/DNS/Monitoring
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## ✅ What's Working (December 7, 2025)
|
||||||
|
|
||||||
|
### Provisioning Pipeline (100% Operational)
|
||||||
|
|
||||||
|
| Component | Status | Notes |
|
||||||
|
|-----------|--------|-------|
|
||||||
|
| **LXC Container Creation** | ✅ | Template VMID 800, auto-cloning working |
|
||||||
|
| **VMID Allocation** | ✅ | Sequential assignment from range |
|
||||||
|
| **IP Detection** | ✅ | Automatic network configuration |
|
||||||
|
| **Go Agent Deployment** | ✅ | Payload delivery + self-repair system |
|
||||||
|
| **Java Runtime Selection** | ✅ | Auto-detect MC version → Java 17/21 |
|
||||||
|
| **All 6 MC Variants** | ✅ | Vanilla, Paper, Purpur, Fabric, Forge, NeoForge |
|
||||||
|
| **Server Startup** | ✅ | All variants start successfully |
|
||||||
|
| **DNS Publishing** | ✅ | Cloudflare + Technitium A + SRV records |
|
||||||
|
| **Velocity Registration** | ✅ | Dynamic backend server registration |
|
||||||
|
| **Client Connectivity** | ✅ | Players can connect and play |
|
||||||
|
|
||||||
|
### Control Functions
|
||||||
|
|
||||||
|
| Function | Status | Implementation |
|
||||||
|
|----------|--------|----------------|
|
||||||
|
| **Start/Stop/Restart** | ✅ | HTTP API → Go agent |
|
||||||
|
| **Console Commands** | ✅ | Command injection working |
|
||||||
|
| **Log Tailing** | ⚠️ | HTTP polling only (need WebSocket) |
|
||||||
|
| **Status Reporting** | ✅ | Agent emits RUNNING state |
|
||||||
|
| **Crash Detection** | ✅ | Agent tracks exit codes |
|
||||||
|
|
||||||
|
### Game Support Matrix
|
||||||
|
|
||||||
|
**Launch Ready (Minecraft Only)**:
|
||||||
|
- ✅ **Vanilla** - Official Mojang server
|
||||||
|
- ✅ **Paper** - Primary recommendation (vanilla + plugins)
|
||||||
|
- ✅ **Purpur** - Paper fork with extra features
|
||||||
|
- ✅ **Fabric** - Lightweight mod support
|
||||||
|
- ✅ **Forge** - Heavy mod support (tech/magic mods)
|
||||||
|
- ✅ **NeoForge** - Modern Forge fork (**competitive advantage**)
|
||||||
|
|
||||||
|
**Supported Versions**: 1.12.2, 1.16.5, 1.18.2, 1.19.2, 1.20.1, 1.21.x
|
||||||
|
|
||||||
|
**Deferred to Post-Launch**:
|
||||||
|
- 📋 Terraria
|
||||||
|
- 📋 Project Zomboid
|
||||||
|
- 📋 Valheim
|
||||||
|
- 📋 Rust
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🚨 Known Issues & Gaps
|
||||||
|
|
||||||
|
### Critical Bugs (Non-Blocking, System Works)
|
||||||
|
|
||||||
|
1. **Forge server.jar Glob Logic** (`artifacts.go` lines 112-116, 147-151)
|
||||||
|
- Tries to find `*server.jar` but Forge ≥1.17 doesn't create this
|
||||||
|
- **Fix**: Remove glob/rename logic (Forge uses `run.sh` + `libraries/`)
|
||||||
|
- **Impact**: System works, but unnecessary code
|
||||||
|
|
||||||
|
2. **ensureProvisioned() Fallthrough** (`agent.go` lines 155-171)
|
||||||
|
- After Forge check, falls through to check `server.jar`
|
||||||
|
- **Fix**: Add `else` to prevent fallthrough
|
||||||
|
- **Impact**: Minor efficiency issue
|
||||||
|
|
||||||
|
3. **Forge Stop Command Exclusion** (`process.go` line 83)
|
||||||
|
- Excludes Forge from receiving `stop` command
|
||||||
|
- **Fix**: Remove exclusion (Forge accepts stop commands)
|
||||||
|
- **Impact**: Manual workaround needed for Forge stops
|
||||||
|
|
||||||
|
### Missing Competitive Features (CRITICAL)
|
||||||
|
|
||||||
|
| Feature | Apex | Shockbyte | ZeroLagHub | Priority |
|
||||||
|
|---------|------|-----------|------------|----------|
|
||||||
|
| **All MC Variants** | ✅ | ✅ | ✅ | - |
|
||||||
|
| **NeoForge** | ❌ | ❌ | ✅ | ADVANTAGE |
|
||||||
|
| **Performance** | 🟡 | 🟡 | ✅ | ADVANTAGE |
|
||||||
|
| **Console Streaming** | ✅ | ✅ | ❌ | 🔴 HIGH |
|
||||||
|
| **File Management** | ✅ | ✅ | ❌ | 🟡 MEDIUM |
|
||||||
|
| **Backups** | ✅ | ✅ | ❌ | 🟡 MEDIUM |
|
||||||
|
| **Crash Protection** | ✅ | ✅ | ❌ | 🔴 HIGH |
|
||||||
|
| **Disk Monitoring** | ✅ | ✅ | ❌ | 🔴 HIGH |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🎯 Platform Readiness Assessment (85%)
|
||||||
|
|
||||||
|
### Core Platform (100%)
|
||||||
|
- ✅ Container orchestration
|
||||||
|
- ✅ Multi-variant provisioning
|
||||||
|
- ✅ Network routing (dual-router)
|
||||||
|
- ✅ DNS automation (Cloudflare + Technitium)
|
||||||
|
- ✅ Velocity proxy integration
|
||||||
|
- ✅ Start/stop/restart control
|
||||||
|
- ✅ Console command injection
|
||||||
|
- ✅ Status monitoring
|
||||||
|
|
||||||
|
### Operational Features (70%)
|
||||||
|
- ✅ Log tailing (HTTP polling)
|
||||||
|
- ✅ Crash detection
|
||||||
|
- ❌ **WebSocket console** (need real-time streaming)
|
||||||
|
- ❌ **Crash loop protection** (need exponential backoff)
|
||||||
|
- ❌ **Disk space monitoring** (prevent corruption)
|
||||||
|
|
||||||
|
### File Management (0%)
|
||||||
|
- ❌ File upload/download
|
||||||
|
- ❌ Backup/restore system
|
||||||
|
- ❌ World file management
|
||||||
|
|
||||||
|
### Advanced Features (Planned)
|
||||||
|
- 📋 Resource monitoring dashboard
|
||||||
|
- 📋 Plugin marketplace
|
||||||
|
- 📋 Developer platform APIs
|
||||||
|
- 📋 Performance optimization tools
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🚀 Launch Decision Point
|
||||||
|
|
||||||
|
### Option A: Launch NOW (Soft Beta)
|
||||||
|
**Status**: 85% ready
|
||||||
|
**Timeline**: Immediate
|
||||||
|
**Pros**: Fast to market, gather user feedback
|
||||||
|
**Cons**: Missing competitive UX features, higher support burden
|
||||||
|
**Recommendation**: ⚠️ Acceptable for 10-20 beta users only
|
||||||
|
|
||||||
|
### Option B: +1 Week (Critical Features) ⭐ RECOMMENDED
|
||||||
|
**Status**: 95% ready after additions
|
||||||
|
**Timeline**: December 14, 2025
|
||||||
|
**Add**: WebSocket console + Crash protection + Disk monitoring
|
||||||
|
**Effort**: 7-9 hours total
|
||||||
|
**Pros**: Competitive feature parity, professional launch
|
||||||
|
**Cons**: Minimal delay
|
||||||
|
**Recommendation**: ✅ Best balance of quality and speed
|
||||||
|
|
||||||
|
### Option C: +1 Month (Full Feature Parity)
|
||||||
|
**Status**: 100% ready
|
||||||
|
**Timeline**: January 7, 2026
|
||||||
|
**Add**: All UX features + file management + backups
|
||||||
|
**Effort**: ~30 hours
|
||||||
|
**Pros**: Complete competitive offering
|
||||||
|
**Cons**: Slower to market, feature creep risk
|
||||||
|
**Recommendation**: ⚠️ Over-engineering for launch
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 📋 Critical Outstanding Items
|
||||||
|
|
||||||
|
### 🔴 High Priority (Before Launch)
|
||||||
|
|
||||||
|
**1. WebSocket Console Streaming** [4-6 hours]
|
||||||
|
- **Current**: HTTP polling via `/logs/tail`
|
||||||
|
- **Needed**: Real-time WebSocket streaming
|
||||||
|
- **Why**: Industry standard, users expect it
|
||||||
|
- **Technical**: Socket.io integration to Go agent
|
||||||
|
|
||||||
|
**2. Crash Loop Protection** [2 hours]
|
||||||
|
- **Current**: Immediate restart on crash
|
||||||
|
- **Needed**: Exponential backoff (5s, 10s, 15s), stop after 3 crashes
|
||||||
|
- **Why**: Prevents resource thrashing
|
||||||
|
- **Technical**: Agent retry logic with backoff timer
|
||||||
|
|
||||||
|
**3. Disk Space Monitoring** [1 hour]
|
||||||
|
- **Current**: No checks
|
||||||
|
- **Needed**: Alert when <1GB free, prevent start if insufficient
|
||||||
|
- **Why**: Prevents world corruption
|
||||||
|
- **Technical**: Agent disk space check before start
|
||||||
|
|
||||||
|
### 🟡 Medium Priority (Week 1)
|
||||||
|
|
||||||
|
**4. File Upload/Download** [6-8 hours]
|
||||||
|
- Plugin management, world uploads
|
||||||
|
- HTTP multipart + streaming
|
||||||
|
|
||||||
|
**5. Backup System** [8-10 hours]
|
||||||
|
- World backup/restore
|
||||||
|
- Integration with PBS backup infrastructure
|
||||||
|
|
||||||
|
**6. Enhanced Health Checks** [3-4 hours]
|
||||||
|
- Query server status
|
||||||
|
- Resource monitoring (CPU/RAM)
|
||||||
|
|
||||||
|
### 🟢 Low Priority (Month 1)
|
||||||
|
|
||||||
|
7. Resource monitoring dashboard
|
||||||
|
8. Plugin marketplace integration
|
||||||
|
9. Developer platform APIs
|
||||||
|
10. Performance optimization
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🗄️ Technical Architecture Details
|
||||||
|
|
||||||
|
### Directory Structure (Finalized)
|
||||||
|
|
||||||
|
```
|
||||||
|
/opt/zlh/<game>/<variant>/world/
|
||||||
|
|
||||||
|
Examples:
|
||||||
|
/opt/zlh/minecraft/vanilla/world/
|
||||||
|
/opt/zlh/minecraft/forge/world/
|
||||||
|
/opt/zlh/minecraft/fabric/world/
|
||||||
|
```
|
||||||
|
|
||||||
|
**Benefits**:
|
||||||
|
- Clear game/variant separation
|
||||||
|
- Scalable to all future games
|
||||||
|
- Self-documenting paths
|
||||||
|
- Easy backup automation
|
||||||
|
|
||||||
|
### Container Model
|
||||||
|
|
||||||
|
**Architecture**: One game per LXC container
|
||||||
|
**Rationale**: Industry standard, 3-5x simpler than multi-game
|
||||||
|
**Benefits**:
|
||||||
|
- Better resource isolation
|
||||||
|
- Simpler billing
|
||||||
|
- Clearer security boundaries
|
||||||
|
- Easier debugging
|
||||||
|
|
||||||
|
### Java Runtime Selection
|
||||||
|
|
||||||
|
```
|
||||||
|
MC 1.21.x → Java 21
|
||||||
|
MC ≥1.20.5 → Java 21
|
||||||
|
MC <1.20.5 → Java 17
|
||||||
|
```
|
||||||
|
|
||||||
|
### Artifact Download Paths
|
||||||
|
|
||||||
|
```
|
||||||
|
minecraft/vanilla/<version>/server.jar
|
||||||
|
minecraft/paper/<version>/server.jar
|
||||||
|
minecraft/purpur/<version>/server.jar
|
||||||
|
minecraft/fabric/<version>/fabric-server.jar
|
||||||
|
minecraft/forge/<version>/forge-installer.jar
|
||||||
|
minecraft/neoforge/<version>/neoforge-installer.jar
|
||||||
|
```
|
||||||
|
|
||||||
|
**Critical Note**: Fabric uses `fabric-server.jar` (pre-built), not installer pattern
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 💰 Business Model & Revenue Strategy
|
||||||
|
|
||||||
|
### Developer-to-Player Pipeline
|
||||||
|
|
||||||
|
```
|
||||||
|
Step 1: Developer Acquisition
|
||||||
|
├─ Development Environment: $20/month
|
||||||
|
└─ Testing Server: $25/month (50% discount)
|
||||||
|
|
||||||
|
Step 2: Player Acquisition (via developer)
|
||||||
|
├─ Player 1-10: $15/month each (25% discount)
|
||||||
|
└─ Total Player Revenue: $150/month
|
||||||
|
|
||||||
|
Step 3: Developer Commission
|
||||||
|
├─ Revenue Share: 7.5% of player revenue
|
||||||
|
├─ Developer Earns: $11.25/month
|
||||||
|
└─ Platform Keeps: $138.75/month
|
||||||
|
|
||||||
|
Total Monthly Revenue from One Developer:
|
||||||
|
$20 (dev env) + $25 (test server) + $150 (players) = $195/month
|
||||||
|
Revenue Multiplier: 9.75x on developer acquisition cost
|
||||||
|
```
|
||||||
|
|
||||||
|
### Financial Projections
|
||||||
|
|
||||||
|
**Month 6**: $8K-30K (LXC advantage + developer pipeline)
|
||||||
|
**Month 12**: $25K-100K (custom platform competitive advantages)
|
||||||
|
**Month 24**: $75K-300K (market leadership + technology licensing)
|
||||||
|
|
||||||
|
### Competitive Advantages
|
||||||
|
|
||||||
|
1. **LXC Performance**: 20-30% improvement over Docker competitors
|
||||||
|
2. **Developer Ecosystem**: Complete dev-to-production pipeline vs pure hosting
|
||||||
|
3. **Open Source Foundation**: 30-40% cost advantage over corporate providers
|
||||||
|
4. **Gaming-First Architecture**: Purpose-built vs adapted generic hosting
|
||||||
|
5. **NeoForge Support**: Ahead of Apex and Shockbyte
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🔐 Security Vulnerabilities (CRITICAL - Active Fix Required)
|
||||||
|
|
||||||
|
### API Department Issues
|
||||||
|
|
||||||
|
1. **Server Ownership Bypass**
|
||||||
|
- Any user can control any server via UUID
|
||||||
|
- No ownership validation in API endpoints
|
||||||
|
- **Impact**: Critical security flaw
|
||||||
|
|
||||||
|
2. **Admin Privilege Escalation**
|
||||||
|
- Frontend can claim admin via JWT manipulation
|
||||||
|
- No server-side role validation
|
||||||
|
- **Impact**: Complete access control bypass
|
||||||
|
|
||||||
|
3. **Token URL Exposure**
|
||||||
|
- JWTs visible in browser history/logs
|
||||||
|
- Tokens passed as URL parameters
|
||||||
|
- **Impact**: Token theft vulnerability
|
||||||
|
|
||||||
|
4. **API Key Validation Missing**
|
||||||
|
- Authentication bypass vulnerabilities
|
||||||
|
- Inconsistent validation patterns
|
||||||
|
- **Impact**: Unauthorized API access
|
||||||
|
|
||||||
|
### Required Fixes
|
||||||
|
|
||||||
|
- Implement ownership checks on all server operations
|
||||||
|
- Server-side JWT validation and role enforcement
|
||||||
|
- Move tokens from URL to headers/cookies
|
||||||
|
- Comprehensive API key validation
|
||||||
|
|
||||||
|
**Priority**: Must fix before public launch (current soft beta acceptable)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🛠️ Ford Assembly Line Department Structure
|
||||||
|
|
||||||
|
### Management Department (Coordination Hub)
|
||||||
|
- **Role**: Strategic oversight, cross-department integration
|
||||||
|
- **AI Resource**: Claude (architecture) + ChatGPT (implementation)
|
||||||
|
- **Current Focus**: Launch readiness + critical feature completion
|
||||||
|
|
||||||
|
### 5 Specialized Departments
|
||||||
|
|
||||||
|
**1. API Department** ⚠️ CRITICAL SECURITY + DEVELOPER PLATFORM
|
||||||
|
- Tech: Node.js/Express, MariaDB, JWT auth, Pterodactyl integration
|
||||||
|
- Priority: Security fixes + developer environment APIs
|
||||||
|
|
||||||
|
**2. Infrastructure Department** ✅ LXC INTEGRATION PRIORITY
|
||||||
|
- Tech: Proxmox VMs, Ansible automation, PBS backup, Monitoring
|
||||||
|
- Achievement: Enterprise backup system operational
|
||||||
|
- Capacity: 1.8TB available, supports 75-100 developers
|
||||||
|
|
||||||
|
**3. Frontend Department** 🔧 TOKEN SECURITY + DEVELOPER UI
|
||||||
|
- Tech: Next.js 15, TailwindCSS, sci-fi HUD aesthetic, TypeScript
|
||||||
|
- Priority: Token security + developer dashboard
|
||||||
|
|
||||||
|
**4. Pterodactyl Department** ⚠️ OAUTH + WINGS LXC
|
||||||
|
- Role: Panel customization, OAuth integration
|
||||||
|
- Future: Wings LXC integration for performance advantage
|
||||||
|
|
||||||
|
**5. Planning & Brainstorming Department** 🧠 STRATEGIC EXECUTION
|
||||||
|
- Role: Long-term vision, competitive strategy
|
||||||
|
- Focus: Developer acquisition, viral growth mechanics
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 📋 Immediate Next Steps (Priority Order)
|
||||||
|
|
||||||
|
### Phase 1: Critical Features (Before Launch)
|
||||||
|
1. ✅ **Fix Go Agent Bugs** - Remove Forge glob, fix fallthrough, enable stop commands
|
||||||
|
2. 🔧 **WebSocket Console** - Implement real-time streaming (4-6 hours)
|
||||||
|
3. 🔧 **Crash Loop Protection** - Add exponential backoff (2 hours)
|
||||||
|
4. 🔧 **Disk Space Monitoring** - Prevent starts on low disk (1 hour)
|
||||||
|
|
||||||
|
### Phase 2: Launch Readiness
|
||||||
|
5. 📋 **Security Audit** - Review critical vulnerabilities
|
||||||
|
6. 📋 **Documentation** - User guides, API docs
|
||||||
|
7. 📋 **Monitoring** - Alert thresholds, dashboards
|
||||||
|
8. 📋 **Soft Beta** - 10-20 users, gather feedback
|
||||||
|
|
||||||
|
### Phase 3: Week 1 Post-Launch
|
||||||
|
9. 📋 **File Management** - Upload/download interface
|
||||||
|
10. 📋 **Backup System** - World backup/restore
|
||||||
|
11. 📋 **Enhanced Health Checks** - Resource monitoring
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🎯 Success Metrics
|
||||||
|
|
||||||
|
### Technical Metrics
|
||||||
|
- ✅ 100% provisioning success rate (all 6 variants)
|
||||||
|
- ⚠️ Zero DNS orphan records (needs EdgeState migration)
|
||||||
|
- ⚠️ Sub-second WebSocket latency (needs implementation)
|
||||||
|
- ✅ LXC 20-30% performance advantage (validated)
|
||||||
|
|
||||||
|
### Business Metrics (Future)
|
||||||
|
- Developer referral system operational
|
||||||
|
- Revenue sharing calculations accurate
|
||||||
|
- Customer quota enforcement working
|
||||||
|
- Usage metering for billing
|
||||||
|
|
||||||
|
### User Experience Metrics
|
||||||
|
- Professional HUD aesthetic maintained
|
||||||
|
- Zero breaking changes during updates
|
||||||
|
- Seamless dev-to-production pipeline
|
||||||
|
- <3s average provisioning time
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 📁 Key Files & Locations
|
||||||
|
|
||||||
|
### API Service (`/home/zlh/zlh-api-v2/`)
|
||||||
|
- `prisma/schema.prisma` - Database schema
|
||||||
|
- `src/services/edgePublisher.js` - DNS + Velocity publishing
|
||||||
|
- `src/services/dePublisher.js` - Edge cleanup
|
||||||
|
- `src/services/portAllocator.js` - Port management
|
||||||
|
- `src/clients/cloudflareClient.js` - Cloudflare API wrapper
|
||||||
|
- `src/clients/technitiumClient.js` - Technitium DNS API wrapper
|
||||||
|
|
||||||
|
### Go Agent (`/opt/zlh-agent/`)
|
||||||
|
- `agent.go` - Main provisioning logic
|
||||||
|
- `artifacts.go` - Download + verification (has bugs)
|
||||||
|
- `process.go` - Server lifecycle management (has bug)
|
||||||
|
- `api.go` - HTTP server for control commands
|
||||||
|
- `payload.json` - Configuration from API
|
||||||
|
|
||||||
|
### Frontend (`/home/zlh/zlh-portal/`)
|
||||||
|
- Next.js 15 application
|
||||||
|
- Steel-texture HUD aesthetic
|
||||||
|
- Developer dashboard (in progress)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## ⚠️ Critical Rules & Constraints
|
||||||
|
|
||||||
|
### DO NOT
|
||||||
|
- ❌ Infer hostnames from DNS records
|
||||||
|
- ❌ Use DNS as source of truth
|
||||||
|
- ❌ Delete Cloudflare records without record IDs
|
||||||
|
- ❌ Launch without WebSocket console (competitive requirement)
|
||||||
|
- ❌ Skip crash protection (operational stability)
|
||||||
|
- ❌ Ignore disk space monitoring (data safety)
|
||||||
|
|
||||||
|
### ALWAYS
|
||||||
|
- ✅ Treat DB as authoritative source of truth
|
||||||
|
- ✅ Store Cloudflare record IDs in EdgeState
|
||||||
|
- ✅ Use exact hostname matching
|
||||||
|
- ✅ Track all async operations in JobLog
|
||||||
|
- ✅ Audit significant actions
|
||||||
|
- ✅ Test all 6 MC variants before deploy
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 💡 Key Architectural Decisions (ADRs)
|
||||||
|
|
||||||
|
**ADR-001: Minecraft-Only Launch**
|
||||||
|
**Decision**: Launch with Minecraft only, defer other games
|
||||||
|
**Rationale**: Market validation, focused quality, faster to market
|
||||||
|
**Consequence**: 6 variants + 6 versions = comprehensive MC offering
|
||||||
|
|
||||||
|
**ADR-002: One Game Per Container**
|
||||||
|
**Decision**: Single game per LXC container
|
||||||
|
**Rationale**: Industry standard, 3-5x simpler than multi-game
|
||||||
|
**Consequence**: Better isolation, clearer billing, easier debugging
|
||||||
|
|
||||||
|
**ADR-003: Velocity Over Direct Port Forwarding**
|
||||||
|
**Decision**: Use Velocity proxy for Minecraft routing
|
||||||
|
**Rationale**: Single entry point, dynamic registration, no NAT complexity
|
||||||
|
**Consequence**: No external port allocation needed for MC
|
||||||
|
|
||||||
|
**ADR-004: Hybrid Pterodactyl + Custom API**
|
||||||
|
**Decision**: Keep Pterodactyl panel, build custom API alongside
|
||||||
|
**Rationale**: Preserve working OAuth, gradual migration path
|
||||||
|
**Consequence**: Dual system complexity, eventual migration needed
|
||||||
|
|
||||||
|
**ADR-005: Go Agent Architecture**
|
||||||
|
**Decision**: Containerized Go agent handles provisioning
|
||||||
|
**Rationale**: Language-agnostic, self-healing, version-aware
|
||||||
|
**Consequence**: Robust provisioning, automatic repair, clean separation
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🧠 Session Continuity Prompt
|
||||||
|
|
||||||
|
For AI assistants resuming work on this project:
|
||||||
|
|
||||||
|
> Resume from ZeroLagHub Master Bootstrap (December 7, 2025).
|
||||||
|
>
|
||||||
|
> **Current State**: Platform 85% launch-ready. All 6 Minecraft variants provisioning successfully via Go agent. Core functionality operational, need critical UX features for competitive parity.
|
||||||
|
>
|
||||||
|
> **Launch Decision**: Recommend +1 week for WebSocket console, crash protection, and disk monitoring.
|
||||||
|
>
|
||||||
|
> **Known Bugs**: 3 non-blocking Go agent issues (Forge glob, fallthrough, stop exclusion).
|
||||||
|
>
|
||||||
|
> **Critical Context**:
|
||||||
|
> - Security vulnerabilities exist but acceptable for soft beta
|
||||||
|
> - Business model validated with 9.75x revenue multiplier
|
||||||
|
> - Developer-to-player pipeline is core differentiator
|
||||||
|
> - LXC performance advantage is primary competitive edge
|
||||||
|
>
|
||||||
|
> **Next Actions**: Fix Go agent bugs, implement critical features, launch beta.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 📞 Support & Escalation
|
||||||
|
|
||||||
|
- **Platform Owner**: 44 years old, full-stack developer
|
||||||
|
- **AI Coordination**: Claude (architecture) + ChatGPT (implementation)
|
||||||
|
- **Infrastructure**: GTHost dedicated server ($109/month)
|
||||||
|
- **Domain**: zerolaghub.com, zpack.zerolaghub.com
|
||||||
|
- **Public Game IP**: 139.64.165.248
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 📊 Platform Status Summary
|
||||||
|
|
||||||
|
**Technical Readiness**: 85% complete
|
||||||
|
**Competitive Position**: Ready to compete on core provisioning, need UX polish
|
||||||
|
**Strategic Clarity**: Clear path to launch with validated business model
|
||||||
|
**Infrastructure**: Production-grade with enterprise backup system
|
||||||
|
**Security**: Known vulnerabilities, acceptable for soft beta, must fix before public launch
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🎯 Strategic Recommendation
|
||||||
|
|
||||||
|
**Recommended Path**: Option B (+1 Week)
|
||||||
|
|
||||||
|
**Rationale**:
|
||||||
|
1. WebSocket console is table stakes (competitors have it)
|
||||||
|
2. Crash protection prevents operational nightmares
|
||||||
|
3. Disk monitoring prevents data loss
|
||||||
|
4. 1 week is negligible for long-term platform success
|
||||||
|
5. Professional launch > rushed launch
|
||||||
|
|
||||||
|
**Timeline**:
|
||||||
|
- **Dec 7-10**: Implement critical features (WebSocket, crash, disk)
|
||||||
|
- **Dec 11-13**: Testing + bug fixes
|
||||||
|
- **Dec 14**: Soft beta launch (10-20 users)
|
||||||
|
- **Dec 21**: Public launch after beta feedback
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**This document serves as the single source of truth for project continuity. Update after each major milestone or architectural change.**
|
||||||
|
|
||||||
|
🚀 **Next action: Decide launch timeline, then implement critical features or launch beta.**
|
||||||
Loading…
Reference in New Issue
Block a user