Upload files to "/"
This commit is contained in:
parent
882c0c7169
commit
5a5ad001af
846
ZeroLagHub_Cross_Project_Tracker.md
Normal file
846
ZeroLagHub_Cross_Project_Tracker.md
Normal file
@ -0,0 +1,846 @@
|
||||
# 🛡️ ZeroLagHub Cross-Project Tracker & Drift Prevention System
|
||||
|
||||
**Last Updated**: December 7, 2025
|
||||
**Version**: 1.0 (Canonical Architecture Enforcement)
|
||||
**Status**: ACTIVE - Must Be Consulted Before All Code Changes
|
||||
|
||||
---
|
||||
|
||||
## 🎯 Document Purpose
|
||||
|
||||
This document establishes **architectural boundaries** across the three major ZeroLagHub systems to prevent drift, confusion, and broken contracts when switching between contexts or AI assistants.
|
||||
|
||||
**Critical Insight**: Most project failures come from **gradual architectural erosion**, not sudden breaking changes. This document prevents that.
|
||||
|
||||
---
|
||||
|
||||
## 📊 The Three Systems (Canonical Ownership)
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────┐
|
||||
│ NODE.JS API v2 │
|
||||
│ (Orchestration Engine) │
|
||||
│ │
|
||||
│ Owns: VMID allocation, port management, DNS publishing, │
|
||||
│ Velocity registration, job queue, database state │
|
||||
│ │
|
||||
│ Speaks To: Proxmox, Cloudflare, Technitium, Velocity, │
|
||||
│ MariaDB, BullMQ, Go Agent (HTTP) │
|
||||
│ │
|
||||
│ Never Touches: Container filesystem, game server files, │
|
||||
│ Java installation, artifact downloads │
|
||||
└─────────────────────┬───────────────────────────────────────┘
|
||||
│
|
||||
│ HTTP Contract
|
||||
│ POST /config, /start, /stop
|
||||
│ GET /status, /health
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────────┐
|
||||
│ GO AGENT │
|
||||
│ (Container-Internal Manager) │
|
||||
│ │
|
||||
│ Owns: Server installation, Java runtime, artifact │
|
||||
│ downloads, process management, READY detection, │
|
||||
│ filesystem layout, verification + self-repair │
|
||||
│ │
|
||||
│ Speaks To: Local filesystem, game server process, │
|
||||
│ API (status updates via HTTP polling) │
|
||||
│ │
|
||||
│ Never Touches: Proxmox, DNS, Cloudflare, Velocity, │
|
||||
│ port allocation, VMID selection │
|
||||
└─────────────────────┬───────────────────────────────────────┘
|
||||
│
|
||||
│ Status Polling
|
||||
│ Agent reports state
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────────┐
|
||||
│ NEXT.JS FRONTEND │
|
||||
│ (Customer + Admin UI) │
|
||||
│ │
|
||||
│ Owns: User interaction, form validation, display logic, │
|
||||
│ client-side state, UI components │
|
||||
│ │
|
||||
│ Speaks To: API v2 only (REST + WebSocket when added) │
|
||||
│ │
|
||||
│ Never Touches: Proxmox, Go Agent, DNS, Velocity, │
|
||||
│ Cloudflare, direct container access │
|
||||
└─────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 🗺️ System Ownership Matrix (CANONICAL)
|
||||
|
||||
| Area | Node.js API v2 | Go Agent | Frontend |
|
||||
|------|----------------|----------|----------|
|
||||
| **Provisioning Orchestration** | ✅ OWNER (allocates VMID, ports, builds LXC config) | ❌ Executes inside container | ❌ Triggers only |
|
||||
| **Template Selection** | ✅ OWNER (selects template, passes config) | ❌ Template contains agent | ❌ Displays options |
|
||||
| **Server Installation** | ❌ Never | ✅ OWNER (Java, artifacts, validation) | ❌ Displays results |
|
||||
| **Runtime Control** | ✅ OWNER (sends commands) | ✅ OWNER (executes commands) | ❌ UI only |
|
||||
| **DNS (Cloudflare + Technitium)** | ✅ OWNER (creates, deletes, tracks IDs) | ❌ Never | ❌ Displays info |
|
||||
| **Velocity Registration** | ✅ OWNER (registers, deregisters) | ❌ Never | ❌ Displays status |
|
||||
| **IP Logic** | ✅ OWNER (external + internal IPs) | ❌ Sees container IP only | ❌ Displays final |
|
||||
| **Port Allocation** | ✅ OWNER (PortPool DB management) | ❌ Receives assignments | ❌ Displays ports |
|
||||
| **Monitoring** | ✅ OWNER (collects metrics) | ✅ OWNER (exposes /health) | ❌ Displays data |
|
||||
| **Error Handling** | ✅ OWNER (BullMQ jobs, retries) | ❌ Local output only | ❌ User notifications |
|
||||
|
||||
---
|
||||
|
||||
## 🔄 API ↔ Agent Contract (IMMUTABLE)
|
||||
|
||||
### **API → Agent Endpoints**
|
||||
|
||||
```
|
||||
POST /config
|
||||
├─ Payload: {
|
||||
│ game: "minecraft",
|
||||
│ variant: "paper",
|
||||
│ version: "1.21.3",
|
||||
│ ports: [25565, 25575],
|
||||
│ memory: "4G",
|
||||
│ motd: "Welcome to ZeroLagHub",
|
||||
│ worldSettings: {...}
|
||||
│ }
|
||||
└─ Response: 200 OK
|
||||
|
||||
POST /start
|
||||
└─ Triggers: ensureProvisioned() → StartServer()
|
||||
|
||||
POST /stop
|
||||
└─ Triggers: Graceful shutdown via server stop command
|
||||
|
||||
POST /restart
|
||||
└─ Triggers: stop → start sequence
|
||||
|
||||
GET /status
|
||||
└─ Response: {
|
||||
state: "RUNNING" | "INSTALLING" | "FAILED",
|
||||
pid: 12345,
|
||||
uptime: 3600,
|
||||
lastError: null
|
||||
}
|
||||
|
||||
GET /health
|
||||
└─ Response: {
|
||||
healthy: true,
|
||||
java: "/usr/lib/jvm/java-21-openjdk-amd64",
|
||||
variant: "paper",
|
||||
ready: true
|
||||
}
|
||||
```
|
||||
|
||||
### **Agent → API Signals (via /status polling)**
|
||||
|
||||
```
|
||||
State Machine:
|
||||
INSTALLING
|
||||
├─> DOWNLOADING_ARTIFACT
|
||||
├─> INSTALLING_JAVA
|
||||
├─> FINALIZING
|
||||
├─> RUNNING
|
||||
└─> FAILED | CRASHED
|
||||
```
|
||||
|
||||
### **🚫 Forbidden Agent Behaviors**
|
||||
|
||||
Agent **NEVER**:
|
||||
- ❌ Allocates or manages ports
|
||||
- ❌ Talks to Proxmox API
|
||||
- ❌ Creates DNS records
|
||||
- ❌ Registers with Velocity
|
||||
- ❌ Modifies LXC config
|
||||
- ❌ Allocates VMIDs
|
||||
- ❌ Manages templates
|
||||
- ❌ Talks to Cloudflare or Technitium
|
||||
|
||||
**Violation Detection**: If agent code imports Proxmox client, DNS client, or port allocation logic → **DRIFT VIOLATION**
|
||||
|
||||
---
|
||||
|
||||
## 🖥️ Frontend ↔ API Contract (IMMUTABLE)
|
||||
|
||||
### **Frontend Allowed Endpoints**
|
||||
|
||||
```
|
||||
Provisioning:
|
||||
POST /api/containers/create
|
||||
GET /api/containers/:vmid
|
||||
DELETE /api/containers/:vmid
|
||||
|
||||
Control:
|
||||
POST /api/containers/:vmid/start
|
||||
POST /api/containers/:vmid/stop
|
||||
POST /api/containers/:vmid/restart
|
||||
|
||||
Status:
|
||||
GET /api/containers/:vmid/status
|
||||
GET /api/containers/:vmid/logs
|
||||
GET /api/containers/:vmid/stats
|
||||
|
||||
Discovery:
|
||||
GET /api/templates
|
||||
GET /api/containers (list user's containers)
|
||||
|
||||
Read-Only Info:
|
||||
GET /api/dns/:hostname (display only)
|
||||
GET /api/velocity/status (display only)
|
||||
```
|
||||
|
||||
### **🚫 Forbidden Frontend Behaviors**
|
||||
|
||||
Frontend **NEVER**:
|
||||
- ❌ Talks directly to Go Agent
|
||||
- ❌ Calls Proxmox API
|
||||
- ❌ Creates DNS records
|
||||
- ❌ Registers with Velocity
|
||||
- ❌ Allocates ports
|
||||
- ❌ Executes container commands
|
||||
- ❌ Accesses MariaDB directly
|
||||
|
||||
**Violation Detection**: If frontend code imports Proxmox client, agent HTTP client (except via API), or database client → **DRIFT VIOLATION**
|
||||
|
||||
---
|
||||
|
||||
## 🚨 Drift Detection Rules (ACTIVE ENFORCEMENT)
|
||||
|
||||
### **Rule #1: Provisioning Ownership**
|
||||
|
||||
**Violation**: Agent asked to allocate ports, choose templates, create DNS, call Proxmox, manage VMIDs
|
||||
|
||||
**Correct Path**: API owns ALL provisioning orchestration
|
||||
|
||||
**Example Violation**:
|
||||
```go
|
||||
// WRONG - Agent should NEVER do this
|
||||
func (a *Agent) allocatePorts() ([]int, error) {
|
||||
// ... port allocation logic
|
||||
}
|
||||
```
|
||||
|
||||
**Correct Pattern**:
|
||||
```javascript
|
||||
// RIGHT - API allocates, agent receives
|
||||
async function provisionInstance(game, variant) {
|
||||
const ports = await portAllocator.allocate(vmid);
|
||||
await agent.postConfig({ ports, game, variant });
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### **Rule #2: Artifact Ownership**
|
||||
|
||||
**Violation**: API asked to install Java, download server.jar, run installers
|
||||
|
||||
**Correct Path**: Agent owns ALL in-container installation
|
||||
|
||||
**Example Violation**:
|
||||
```javascript
|
||||
// WRONG - API should NEVER do this
|
||||
async function installMinecraft(vmid, version) {
|
||||
await proxmox.exec(vmid, `wget https://...`);
|
||||
await proxmox.exec(vmid, `java -jar installer.jar`);
|
||||
}
|
||||
```
|
||||
|
||||
**Correct Pattern**:
|
||||
```go
|
||||
// RIGHT - Agent handles installation
|
||||
func (p *Provisioner) ProvisionAll(cfg Config) error {
|
||||
downloadArtifact(cfg.Variant, cfg.Version)
|
||||
installJavaRuntime(cfg.Version)
|
||||
verifyInstallation()
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### **Rule #3: Direct Container Access**
|
||||
|
||||
**Violation**: Frontend or API wants to exec commands directly into container
|
||||
|
||||
**Correct Path**: Agent owns container execution layer
|
||||
|
||||
**Example Violation**:
|
||||
```typescript
|
||||
// WRONG - Frontend should NEVER do this
|
||||
async function restartServer(vmid: number) {
|
||||
const ssh = new SSHClient();
|
||||
await ssh.connect(containerIP);
|
||||
await ssh.exec('systemctl restart minecraft');
|
||||
}
|
||||
```
|
||||
|
||||
**Correct Pattern**:
|
||||
```typescript
|
||||
// RIGHT - Frontend talks to API, API talks to agent
|
||||
async function restartServer(vmid: number) {
|
||||
await api.post(`/containers/${vmid}/restart`);
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### **Rule #4: Networking Responsibilities**
|
||||
|
||||
**Violation**: Agent asked to select public vs internal IPs, decide DNS zones
|
||||
|
||||
**Correct Path**: API owns dual-IP logic (Cloudflare external, Technitium internal)
|
||||
|
||||
**Example Violation**:
|
||||
```go
|
||||
// WRONG - Agent should NEVER decide this
|
||||
func (a *Agent) determinePublicIP() string {
|
||||
if a.needsCloudflare() {
|
||||
return "139.64.165.248"
|
||||
}
|
||||
return a.containerIP
|
||||
}
|
||||
```
|
||||
|
||||
**Correct Pattern**:
|
||||
```javascript
|
||||
// RIGHT - API decides network topology
|
||||
function determineIPs(vmid, game) {
|
||||
const internalIP = `10.200.0.${vmid - 1000}`;
|
||||
const externalIP = "139.64.165.248"; // Cloudflare target
|
||||
const velocityIP = "10.70.0.241"; // Internal routing
|
||||
|
||||
return { internalIP, externalIP, velocityIP };
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### **Rule #5: Proxy Responsibilities**
|
||||
|
||||
**Violation**: Agent asked to register with Velocity, configure proxy routing
|
||||
|
||||
**Correct Path**: API owns ALL proxy integrations
|
||||
|
||||
**Example Violation**:
|
||||
```go
|
||||
// WRONG - Agent should NEVER do this
|
||||
func (a *Agent) registerWithVelocity() error {
|
||||
client := velocity.NewClient()
|
||||
return client.Register(a.hostname, a.port)
|
||||
}
|
||||
```
|
||||
|
||||
**Correct Pattern**:
|
||||
```javascript
|
||||
// RIGHT - API handles Velocity registration
|
||||
async function registerVelocity(vmid, hostname, internalIP) {
|
||||
await velocityBridge.registerBackend({
|
||||
name: hostname,
|
||||
address: internalIP,
|
||||
port: 25565
|
||||
});
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 🔄 Context Switching Safety Workflow
|
||||
|
||||
### **When Moving to Node.js API Work:**
|
||||
|
||||
**Pre-Switch Checklist**:
|
||||
- [ ] Agent contract unchanged? (POST /config, /start, /stop, GET /status)
|
||||
- [ ] Database schema unchanged? (Prisma models consistent)
|
||||
- [ ] LXC template IDs unchanged? (VMID 800 for game, 6000-series for dev)
|
||||
- [ ] DNS/IP logic consistent? (Cloudflare external, Technitium internal)
|
||||
- [ ] Port allocation logic preserved? (PortPool DB-backed)
|
||||
|
||||
**Common Drift Patterns**:
|
||||
- ⚠️ Adding agent installation logic to API
|
||||
- ⚠️ Changing agent contract without updating both sides
|
||||
- ⚠️ Moving DNS logic to different service
|
||||
- ⚠️ Bypassing job queue for provisioning
|
||||
|
||||
---
|
||||
|
||||
### **When Moving to Go Agent Work:**
|
||||
|
||||
**Pre-Switch Checklist**:
|
||||
- [ ] No provisioning logic outside allowed scope (no port allocation, DNS, etc.)
|
||||
- [ ] File paths remain canonical (`/opt/zlh/<game>/<variant>/world`)
|
||||
- [ ] Naming conventions maintained (`server.jar`, `fabric-server.jar`, etc.)
|
||||
- [ ] No external API calls (Proxmox, DNS, Velocity)
|
||||
- [ ] Status states unchanged (INSTALLING, RUNNING, FAILED, etc.)
|
||||
|
||||
**Common Drift Patterns**:
|
||||
- ⚠️ Adding port allocation to agent
|
||||
- ⚠️ Making agent talk to external services
|
||||
- ⚠️ Changing directory structure without API coordination
|
||||
- ⚠️ Adding orchestration logic to agent
|
||||
|
||||
---
|
||||
|
||||
### **When Moving to Frontend Work:**
|
||||
|
||||
**Pre-Switch Checklist**:
|
||||
- [ ] Only API-approved fields used in UI
|
||||
- [ ] No direct agent HTTP calls
|
||||
- [ ] VMID not exposed in user-facing UI
|
||||
- [ ] Internal IPs not displayed to users
|
||||
- [ ] All state from API, not computed locally
|
||||
|
||||
**Common Drift Patterns**:
|
||||
- ⚠️ Adding direct agent calls from frontend
|
||||
- ⚠️ Computing server state client-side
|
||||
- ⚠️ Exposing internal infrastructure details
|
||||
- ⚠️ Bypassing API for container control
|
||||
|
||||
---
|
||||
|
||||
## 🚨 High-Risk Integration Zones (GUARDED)
|
||||
|
||||
These areas have historically caused drift across sessions:
|
||||
|
||||
### **1. Forge / NeoForge Installation Logic**
|
||||
- **Risk**: Agent vs API confusion on who handles `run.sh` patching
|
||||
- **Guard**: Agent owns ALL Forge installation, API just passes config
|
||||
- **Test**: Can provision Forge 1.21.3 without API filesystem access?
|
||||
|
||||
### **2. Cloudflare SRV Deletion**
|
||||
- **Risk**: Case sensitivity, subdomain normalization, record ID tracking
|
||||
- **Guard**: API stores Cloudflare record IDs in EdgeState, deletes by ID
|
||||
- **Test**: Create → Delete → Recreate same hostname without orphans?
|
||||
|
||||
### **3. Technitium DNS Zone Mismatch**
|
||||
- **Risk**: Wrong zone selection, duplicate records
|
||||
- **Guard**: API hardcodes zone as `zpack.zerolaghub.com`, validates before creation
|
||||
- **Test**: No records created in wrong zones?
|
||||
|
||||
### **4. Velocity Registration Order**
|
||||
- **Risk**: Registering before server ready, deregistering incorrectly
|
||||
- **Guard**: API waits for agent RUNNING state, then registers Velocity
|
||||
- **Test**: Player connection works immediately after provisioning complete?
|
||||
|
||||
### **5. PortPool Commit Logic**
|
||||
- **Risk**: Race conditions, double-allocation, uncommitted ports
|
||||
- **Guard**: API allocates → provisions → commits (rollback on failure)
|
||||
- **Test**: Concurrent provisions don't collide on ports?
|
||||
|
||||
### **6. Agent READY Detection**
|
||||
- **Risk**: False negatives, false positives, variant-specific patterns
|
||||
- **Guard**: Agent uses variant-aware log parsing, multiple confirmation lines
|
||||
- **Test**: All 6 variants correctly detect READY state?
|
||||
|
||||
### **7. Server Start-Up False Negatives**
|
||||
- **Risk**: Timeout too short, log parsing too strict
|
||||
- **Guard**: Agent increases timeout for Forge (90s), multiple log patterns
|
||||
- **Test**: Forge installer completes without false failure?
|
||||
|
||||
### **8. IP Selection Logic**
|
||||
- **Risk**: Confusing external (Cloudflare) vs internal (Velocity/Technitium) IPs
|
||||
- **Guard**: API clearly separates: externalIP (139.64.165.248), internalIP (10.200.0.X), velocityIP (10.70.0.241)
|
||||
- **Test**: DNS points to correct IPs, Velocity routes to correct internal IP?
|
||||
|
||||
---
|
||||
|
||||
## 📋 Architecture Decision Log (LOCKED) ⭐ NEW
|
||||
|
||||
**Purpose**: Records finalized architectural decisions that **must not be re-litigated** unless explicitly requested.
|
||||
|
||||
**Status**: LOCKED - These decisions are final and cannot be changed without user explicitly saying "Revisit decision X"
|
||||
|
||||
---
|
||||
|
||||
### **DEC-001: Templates vs Go Agent (FINAL)**
|
||||
|
||||
**Decision**: Hybrid model
|
||||
- LXC templates define **base environment only**
|
||||
- Go Agent is **authoritative execution layer** inside containers
|
||||
|
||||
**Rationale**:
|
||||
- Templates alone cannot handle multi-variant logic (Forge, NeoForge, Fabric)
|
||||
- Agent enables self-repair, async provisioning, runtime control
|
||||
- Hybrid provides speed + flexibility without API container access
|
||||
|
||||
**Applies To**: API v2, Go Agent, Frontend
|
||||
**Status**: ✅ LOCKED
|
||||
|
||||
---
|
||||
|
||||
### **DEC-002: Provisioning Authority**
|
||||
|
||||
**Decision**: API orchestrates, Agent executes
|
||||
|
||||
**Rationale**:
|
||||
- API has global visibility (DB, DNS, Proxmox, Velocity)
|
||||
- Agent is intentionally sandboxed to container filesystem + process
|
||||
|
||||
**Applies To**: All systems
|
||||
**Status**: ✅ LOCKED
|
||||
|
||||
---
|
||||
|
||||
### **DEC-003: DNS & Edge Publishing Ownership**
|
||||
|
||||
**Decision**: API-only responsibility
|
||||
|
||||
**Rationale**:
|
||||
- Requires external credentials (Cloudflare, Technitium)
|
||||
- Must correlate DB state, record IDs, reconciliation jobs
|
||||
|
||||
**Applies To**: API v2
|
||||
**Status**: ✅ LOCKED
|
||||
|
||||
---
|
||||
|
||||
### **DEC-004: Proxy Stack**
|
||||
|
||||
**Decision**: Traefik + Velocity only
|
||||
|
||||
**Rationale**:
|
||||
- Traefik for HTTP/control-plane
|
||||
- Velocity for Minecraft TCP routing
|
||||
- **HAProxy explicitly deprecated** for ZeroLagHub
|
||||
|
||||
**Applies To**: Infrastructure, API
|
||||
**Status**: ✅ LOCKED
|
||||
|
||||
---
|
||||
|
||||
### **DEC-005: State Persistence**
|
||||
|
||||
**Decision**: MariaDB is single source of truth
|
||||
|
||||
**Rationale**:
|
||||
- Flat files caused race conditions and drift
|
||||
- DB enables reconciliation, recovery, observability
|
||||
|
||||
**Applies To**: API v2
|
||||
**Status**: ✅ LOCKED
|
||||
|
||||
---
|
||||
|
||||
### **DEC-006: Frontend Access Model**
|
||||
|
||||
**Decision**: Frontend communicates with API only
|
||||
|
||||
**Rationale**:
|
||||
- Security boundary
|
||||
- Prevents leaking infrastructure details
|
||||
|
||||
**Applies To**: Frontend
|
||||
**Status**: ✅ LOCKED
|
||||
|
||||
---
|
||||
|
||||
### **DEC-007: Architecture Enforcement Policy**
|
||||
|
||||
**Decision**: Drift prevention is mandatory
|
||||
|
||||
**Rationale**:
|
||||
- Prevents oscillation between alternatives
|
||||
- Preserves velocity during late-stage development
|
||||
|
||||
**Applies To**: All work sessions
|
||||
**Status**: ✅ LOCKED
|
||||
|
||||
---
|
||||
|
||||
## 📋 Canonical Architecture Anchors (ABSOLUTE)
|
||||
|
||||
These rules are **immutable** unless explicitly changed with full system review:
|
||||
|
||||
### **Anchor #1: Orchestration**
|
||||
✅ API v2 orchestrates EVERYTHING (jobs, provisioning, DNS, proxy, lifecycle)
|
||||
|
||||
### **Anchor #2: Container-Internal**
|
||||
✅ Agent performs EVERYTHING inside container (install, start, stop, detect)
|
||||
|
||||
### **Anchor #3: Templates**
|
||||
✅ Templates contain agent + base environment (VMID 800 for game, 6000-series for dev)
|
||||
|
||||
### **Anchor #4: Job Queue**
|
||||
✅ BullMQ/Redis drive job system (async provisioning, retries, reconciliation)
|
||||
|
||||
### **Anchor #5: Database**
|
||||
✅ MariaDB holds all state (PortPool, ContainerInstance, EdgeState, etc.)
|
||||
|
||||
### **Anchor #6: Infrastructure**
|
||||
✅ Proxmox API (not Ansible) for LXC management
|
||||
|
||||
### **Anchor #7: Routing**
|
||||
✅ Traefik (HTTP) + Velocity (Minecraft) are ONLY routing/proxy systems
|
||||
|
||||
### **Anchor #8: DNS**
|
||||
✅ Cloudflare = authoritative public DNS
|
||||
✅ Technitium = authoritative internal DNS
|
||||
|
||||
### **Anchor #9: Frontend Isolation**
|
||||
✅ Frontend speaks ONLY to API (no direct agent, Proxmox, DNS, Velocity)
|
||||
|
||||
### **Anchor #10: Directory Structure**
|
||||
✅ `/opt/zlh/<game>/<variant>/world` is canonical game server path
|
||||
|
||||
---
|
||||
|
||||
## 🛡️ Enforcement Policies (ACTIVE)
|
||||
|
||||
When future instructions conflict with Canonical Architecture or Architecture Decision Log:
|
||||
|
||||
### **Step 1: STOP**
|
||||
Immediately halt the task. Do not proceed with drift-inducing change.
|
||||
|
||||
### **Step 2: RAISE DRIFT WARNING**
|
||||
```
|
||||
⚠️ DRIFT WARNING ⚠️
|
||||
|
||||
Proposed change violates Canonical Architecture:
|
||||
|
||||
Rule Violated: [Rule #X: Description]
|
||||
OR
|
||||
Decision Violated: [DEC-XXX: Decision Name]
|
||||
|
||||
Violation: [Specific behavior that violates rule/decision]
|
||||
Correct Path: [Architecture-aligned approach]
|
||||
|
||||
Impact: [What breaks if this drift is allowed]
|
||||
|
||||
Options:
|
||||
1. Implement correct architecture-aligned path
|
||||
2. Amend Canonical Architecture (requires full system review)
|
||||
3. Request user to "Revisit decision X" (for ADL changes)
|
||||
4. Cancel proposed change
|
||||
```
|
||||
|
||||
### **Step 3: PROVIDE CORRECT PATH**
|
||||
Show the architecture-aligned implementation that achieves the same goal.
|
||||
|
||||
### **Step 4: ASK FOR DIRECTION**
|
||||
```
|
||||
Should we:
|
||||
A) Implement the correct architecture-aligned path?
|
||||
B) Perform full system review to amend architecture?
|
||||
C) Request user to explicitly revisit locked decision?
|
||||
D) Cancel this change?
|
||||
```
|
||||
|
||||
### **Step 5: ARCHITECTURE DECISION LOG SPECIAL RULE**
|
||||
If violation is against a **LOCKED** Architecture Decision (DEC-001 through DEC-007):
|
||||
|
||||
**ADDITIONAL CHECK**:
|
||||
```
|
||||
⚠️ LOCKED DECISION WARNING ⚠️
|
||||
|
||||
This change conflicts with Architecture Decision Log entry: [DEC-XXX]
|
||||
Status: LOCKED
|
||||
|
||||
This decision can ONLY be changed if the user explicitly says:
|
||||
"Revisit decision [DEC-XXX]"
|
||||
|
||||
Without explicit user request to revisit, this decision is FINAL.
|
||||
|
||||
Proceeding with this change would violate architectural governance.
|
||||
```
|
||||
|
||||
**Never proceed** with drift-inducing changes without explicit confirmation.
|
||||
**Never re-litigate** locked decisions without user explicitly requesting revision.
|
||||
|
||||
---
|
||||
|
||||
## 📊 Drift Detection Examples
|
||||
|
||||
### **Example 1: Agent Port Allocation (VIOLATION)**
|
||||
|
||||
**Proposed Change**:
|
||||
```go
|
||||
// Agent code proposal
|
||||
func (a *Agent) allocatePort() int {
|
||||
// Find available port...
|
||||
return port
|
||||
}
|
||||
```
|
||||
|
||||
**Drift Warning**:
|
||||
```
|
||||
⚠️ DRIFT WARNING ⚠️
|
||||
|
||||
Rule Violated: Rule #1 - Provisioning Ownership
|
||||
Violation: Agent attempting to allocate ports
|
||||
Correct Path: API allocates ports, agent receives them via /config
|
||||
|
||||
Impact: Port collisions, database inconsistency, broken PortPool
|
||||
|
||||
Recommendation: Remove port allocation from agent, ensure API sends ports in config payload
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### **Example 2: Frontend Direct Agent Call (VIOLATION)**
|
||||
|
||||
**Proposed Change**:
|
||||
```typescript
|
||||
// Frontend code proposal
|
||||
async function getServerLogs(vmid: number) {
|
||||
const agentURL = `http://10.200.0.${vmid}:8080/logs`;
|
||||
return await fetch(agentURL);
|
||||
}
|
||||
```
|
||||
|
||||
**Drift Warning**:
|
||||
```
|
||||
⚠️ DRIFT WARNING ⚠️
|
||||
|
||||
Rule Violated: Rule #3 - Direct Container Access
|
||||
Violation: Frontend bypassing API to talk to agent
|
||||
Correct Path: Frontend → API → Agent (API proxies logs)
|
||||
|
||||
Impact: Broken frontend if container IP changes, no auth/rate limiting, security risk
|
||||
|
||||
Recommendation: Add GET /api/containers/:vmid/logs endpoint that proxies to agent
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### **Example 3: API Installing Java (VIOLATION)**
|
||||
|
||||
**Proposed Change**:
|
||||
```javascript
|
||||
// API code proposal
|
||||
async function provisionServer(vmid, game, variant) {
|
||||
await proxmox.exec(vmid, 'apt-get install openjdk-21-jdk');
|
||||
await proxmox.exec(vmid, 'wget https://papermc.io/...');
|
||||
}
|
||||
```
|
||||
|
||||
**Drift Warning**:
|
||||
```
|
||||
⚠️ DRIFT WARNING ⚠️
|
||||
|
||||
Rule Violated: Rule #2 - Artifact Ownership
|
||||
Violation: API performing in-container installation
|
||||
Correct Path: API sends config to agent, agent handles installation
|
||||
|
||||
Impact: Breaks agent self-repair, variant-specific logic duplicated, no verification system
|
||||
|
||||
Recommendation: Remove installation logic from API, ensure agent receives proper config via POST /config
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 📁 Integration with Existing Documentation
|
||||
|
||||
### **Relationship to Master Bootstrap**
|
||||
- Master Bootstrap: Strategic overview and business model
|
||||
- **This Document**: Technical governance and boundary enforcement
|
||||
- **Usage**: Consult this before implementing ANY code changes
|
||||
|
||||
### **Relationship to Complete Current State**
|
||||
- Complete Current State: What's working, what's next
|
||||
- **This Document**: How things MUST work (regardless of current state)
|
||||
- **Usage**: This is the "law", current state is "status"
|
||||
|
||||
### **Relationship to Engineering Handover**
|
||||
- Engineering Handover: Daily tactical tasks and sprint plan
|
||||
- **This Document**: Constraints within which tasks must be implemented
|
||||
- **Usage**: Check this before starting each handover task
|
||||
|
||||
---
|
||||
|
||||
## 🔄 Architecture Amendment Process
|
||||
|
||||
If legitimate need to change Canonical Architecture:
|
||||
|
||||
### **Step 1: Identify Change**
|
||||
Document exactly what architectural boundary needs to change and why.
|
||||
|
||||
### **Step 2: Full System Impact Analysis**
|
||||
- What breaks in API?
|
||||
- What breaks in Agent?
|
||||
- What breaks in Frontend?
|
||||
- What changes to contracts?
|
||||
- What database migrations needed?
|
||||
|
||||
### **Step 3: Update ALL Affected Documents**
|
||||
- This document (Canonical Architecture)
|
||||
- Master Bootstrap (if strategic impact)
|
||||
- Complete Current State (implementation changes)
|
||||
- Engineering Handover (sprint tasks)
|
||||
- Agent Spec, Operational Guide (if affected)
|
||||
|
||||
### **Step 4: Update ALL Systems**
|
||||
- API code + tests
|
||||
- Agent code + tests
|
||||
- Frontend code + tests
|
||||
- Database schema (migration)
|
||||
- Infrastructure config
|
||||
|
||||
### **Step 5: Validation**
|
||||
- Integration tests pass?
|
||||
- No new drift introduced?
|
||||
- Documentation consistent?
|
||||
- All AIs briefed on change?
|
||||
|
||||
**Only after ALL steps** is architecture amendment complete.
|
||||
|
||||
---
|
||||
|
||||
## 🎯 Quick Reference Card
|
||||
|
||||
### **API Owns**
|
||||
- ✅ Provisioning orchestration
|
||||
- ✅ Port allocation (PortPool)
|
||||
- ✅ DNS (Cloudflare + Technitium)
|
||||
- ✅ Velocity registration
|
||||
- ✅ IP logic (external + internal)
|
||||
- ✅ Job queue (BullMQ)
|
||||
- ✅ Database state
|
||||
|
||||
### **Agent Owns**
|
||||
- ✅ Container-internal installation
|
||||
- ✅ Java runtime
|
||||
- ✅ Artifact downloads
|
||||
- ✅ Server process management
|
||||
- ✅ READY detection
|
||||
- ✅ Self-repair + verification
|
||||
- ✅ Filesystem layout
|
||||
|
||||
### **Frontend Owns**
|
||||
- ✅ User interaction
|
||||
- ✅ Display logic
|
||||
- ✅ Client state
|
||||
- ✅ Form validation
|
||||
|
||||
### **Never**
|
||||
- ❌ Agent allocates ports
|
||||
- ❌ Agent talks to DNS/Velocity/Proxmox
|
||||
- ❌ API installs server files
|
||||
- ❌ API executes in-container commands
|
||||
- ❌ Frontend talks to agent directly
|
||||
- ❌ Frontend talks to infrastructure
|
||||
|
||||
---
|
||||
|
||||
## 📋 Session Start Checklist
|
||||
|
||||
Before every coding session with any AI:
|
||||
|
||||
- [ ] Read this document's Quick Reference Card
|
||||
- [ ] Identify which system you're working on (API, Agent, Frontend)
|
||||
- [ ] Review that system's "Owns" list
|
||||
- [ ] Check High-Risk Integration Zones if touching those areas
|
||||
- [ ] Verify no drift from previous session
|
||||
- [ ] Confirm contracts unchanged since last session
|
||||
|
||||
**If ANY doubt**: Re-read full Canonical Architecture Anchors section.
|
||||
|
||||
---
|
||||
|
||||
## ✅ Document Status
|
||||
|
||||
**Status**: ACTIVE - Must be consulted before all code changes
|
||||
**Enforcement**: MANDATORY - Drift violations must be caught
|
||||
**Authority**: CANONICAL - Overrides conflicting guidance
|
||||
**Updates**: Only via Architecture Amendment Process
|
||||
|
||||
---
|
||||
|
||||
🛡️ **This document prevents architectural drift. Violate at your own risk.**
|
||||
105
ZeroLagHub_Drift_Prevention_Card.md
Normal file
105
ZeroLagHub_Drift_Prevention_Card.md
Normal file
@ -0,0 +1,105 @@
|
||||
# 🛡️ ZeroLagHub Drift Prevention - Quick Start Card
|
||||
|
||||
**Use This**: At the start of EVERY coding session (API, Agent, or Frontend)
|
||||
|
||||
---
|
||||
|
||||
## ⚡ 30-Second Architecture Check
|
||||
|
||||
### **Working on API?**
|
||||
✅ Can allocate: Ports, VMIDs, DNS, Velocity
|
||||
❌ Cannot: Install Java, download artifacts, exec in container
|
||||
|
||||
### **Working on Agent?**
|
||||
✅ Can install: Java, artifacts, server files
|
||||
❌ Cannot: Allocate ports, create DNS, call Proxmox, register Velocity
|
||||
|
||||
### **Working on Frontend?**
|
||||
✅ Can: Display data, call API endpoints
|
||||
❌ Cannot: Talk to Agent, Proxmox, DNS, Velocity, allocate anything
|
||||
|
||||
---
|
||||
|
||||
## 🔒 7 Locked Architectural Decisions (NEW) ⭐
|
||||
|
||||
**These decisions are FINAL** - cannot be changed without user saying "Revisit decision X":
|
||||
|
||||
1. **DEC-001**: Templates + Agent hybrid (not templates-only or agent-only)
|
||||
2. **DEC-002**: API orchestrates, Agent executes (not reversed)
|
||||
3. **DEC-003**: API owns DNS (Agent never creates DNS)
|
||||
4. **DEC-004**: Traefik + Velocity only (no HAProxy)
|
||||
5. **DEC-005**: MariaDB is source of truth (no flat files)
|
||||
6. **DEC-006**: Frontend → API only (no direct agent calls)
|
||||
7. **DEC-007**: Drift prevention mandatory (always enforced)
|
||||
|
||||
**If proposing change that conflicts with DEC-001 through DEC-007**:
|
||||
→ STOP → Consult Cross-Project Tracker → Request "Revisit decision X" from user
|
||||
|
||||
---
|
||||
|
||||
## 🚨 Drift Detection Triggers
|
||||
|
||||
**STOP and consult full tracker if you hear:**
|
||||
|
||||
1. "Agent should allocate ports..."
|
||||
2. "API should install Java inside container..."
|
||||
3. "Frontend should call agent directly..."
|
||||
4. "Agent should register with Velocity..."
|
||||
5. "API should decide what Java version..."
|
||||
6. "Frontend should manage DNS records..."
|
||||
7. "Let's use templates only..." (violates DEC-001)
|
||||
8. "Let's add HAProxy..." (violates DEC-004)
|
||||
9. "Let's use flat files instead of DB..." (violates DEC-005)
|
||||
|
||||
**All of these are VIOLATIONS** → Consult [ZeroLagHub_Cross_Project_Tracker.md](computer:///mnt/user-data/outputs/ZeroLagHub_Cross_Project_Tracker.md)
|
||||
|
||||
---
|
||||
|
||||
## 📋 Pre-Coding Checklist
|
||||
|
||||
Before writing ANY code:
|
||||
|
||||
- [ ] Which system am I modifying? (API / Agent / Frontend)
|
||||
- [ ] Does this change cross boundaries? (If yes → read full tracker)
|
||||
- [ ] Am I adding external API calls to Agent? (If yes → VIOLATION)
|
||||
- [ ] Am I adding container execution to API? (If yes → VIOLATION)
|
||||
- [ ] Am I bypassing API in Frontend? (If yes → VIOLATION)
|
||||
|
||||
---
|
||||
|
||||
## 🎯 The Golden Rules
|
||||
|
||||
1. **API orchestrates** (allocates resources, publishes state)
|
||||
2. **Agent executes** (installs, runs, monitors inside container)
|
||||
3. **Frontend displays** (no direct infrastructure access)
|
||||
|
||||
**Anything else** → Drift → Consult full tracker
|
||||
|
||||
---
|
||||
|
||||
## 📞 Quick Reference
|
||||
|
||||
**Full Tracker**: [ZeroLagHub_Cross_Project_Tracker.md](computer:///mnt/user-data/outputs/ZeroLagHub_Cross_Project_Tracker.md)
|
||||
|
||||
**High-Risk Zones**:
|
||||
- Forge/NeoForge installation
|
||||
- Cloudflare SRV deletion
|
||||
- Velocity registration order
|
||||
- PortPool commit logic
|
||||
- Agent READY detection
|
||||
- IP selection logic
|
||||
|
||||
---
|
||||
|
||||
## ✅ Session Start Command
|
||||
|
||||
```
|
||||
Read ZeroLagHub Cross-Project Tracker Quick Start Card.
|
||||
Activate drift detection.
|
||||
Confirm which system I'm working on: [API / Agent / Frontend]
|
||||
Proceed with architecture-aligned implementation only.
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
🛡️ **Drift prevention ACTIVE. Proceed with confidence.**
|
||||
573
ZeroLagHub_GPT_Implementation_Handover_Dec2025.md
Normal file
573
ZeroLagHub_GPT_Implementation_Handover_Dec2025.md
Normal file
@ -0,0 +1,573 @@
|
||||
# 🚀 ZeroLagHub - GPT Implementation Handover (December 2025)
|
||||
|
||||
**Last Updated**: December 7, 2025
|
||||
**Version**: 4.0 (Launch-Ready Implementation Guide)
|
||||
**Status**: 85% Platform Complete - Active Development Sprint
|
||||
|
||||
---
|
||||
|
||||
## 🎯 Your Role (GPT - Implementation AI)
|
||||
|
||||
**You are**: The tactical implementation AI responsible for **building features**
|
||||
**Claude is**: The strategic architecture AI responsible for **design decisions**
|
||||
|
||||
### **Your Responsibilities**
|
||||
✅ Implement features within architectural boundaries
|
||||
✅ Write code for API, Agent, and Frontend
|
||||
✅ Fix bugs and optimize performance
|
||||
✅ Execute sprint tasks from Kanban board
|
||||
✅ Consult Cross-Project Tracker before crossing system boundaries
|
||||
|
||||
### **Your Constraints**
|
||||
❌ Do NOT make architectural decisions without Claude
|
||||
❌ Do NOT violate ownership boundaries (API vs Agent vs Frontend)
|
||||
❌ Do NOT change contracts without updating both sides
|
||||
❌ Do NOT skip drift prevention checks
|
||||
|
||||
---
|
||||
|
||||
## 📋 Critical Documents (READ THESE FIRST)
|
||||
|
||||
### **🛡️ MANDATORY Before ANY Code**
|
||||
1. **[Drift Prevention Card](computer:///mnt/user-data/outputs/ZeroLagHub_Drift_Prevention_Card.md)** (30 seconds)
|
||||
- Quick boundary check for every session
|
||||
- Violation triggers to watch for
|
||||
|
||||
2. **[Cross-Project Tracker](computer:///mnt/user-data/outputs/ZeroLagHub_Cross_Project_Tracker.md)** (consult before changes)
|
||||
- Ownership matrix (API vs Agent vs Frontend)
|
||||
- Canonical contracts (API ↔ Agent, Frontend ↔ API)
|
||||
- Drift detection rules with examples
|
||||
|
||||
### **📊 Implementation Context**
|
||||
3. **[Complete Current State](computer:///mnt/user-data/outputs/ZeroLagHub_Complete_Current_State_Dec7.md)** (5 minutes)
|
||||
- Engineering Kanban (DONE/IN PROGRESS/TODO)
|
||||
- 3-day sprint plan with tasks
|
||||
- Troubleshooting guides per variant
|
||||
- Launch readiness matrix
|
||||
|
||||
4. **Engineering Handover** (from today's uploaded document)
|
||||
- System lifecycle diagram
|
||||
- Provisioning sequence
|
||||
- Verification system specs
|
||||
- Today's accomplishments
|
||||
|
||||
### **🔧 Technical Reference**
|
||||
5. **[Agent Complete Spec](computer:///home/claude/ZeroLagHub_Agent_Complete_Spec.md)** (as needed)
|
||||
- Go agent implementation details
|
||||
- API endpoints and contracts
|
||||
|
||||
6. **[Infrastructure Specs](computer:///mnt/user-data/outputs/ZeroLagHub_Infrastructure_Specifications.md)** (as needed)
|
||||
- GTHost hardware constraints
|
||||
- Capacity planning
|
||||
|
||||
---
|
||||
|
||||
## 🎯 Current Platform Status (December 7, 2025)
|
||||
|
||||
### **What's Working** ✅ (85% Complete)
|
||||
|
||||
**Core Provisioning Pipeline**:
|
||||
- ✅ All 6 Minecraft variants (Vanilla, Paper, Purpur, Fabric, Forge, NeoForge)
|
||||
- ✅ VMID allocation (sequential)
|
||||
- ✅ LXC container creation (template VMID 800)
|
||||
- ✅ IP detection (10.200.0.X)
|
||||
- ✅ Go agent deployment + self-repair
|
||||
- ✅ Java runtime auto-selection (17/21)
|
||||
- ✅ DNS automation (Cloudflare + Technitium)
|
||||
- ✅ Velocity proxy registration
|
||||
- ✅ Start/stop/restart control
|
||||
- ✅ Console command injection
|
||||
- ✅ Log tailing (HTTP polling)
|
||||
- ✅ Crash detection
|
||||
|
||||
**Supported Minecraft Versions**: 1.12.2, 1.16.5, 1.18.2, 1.19.2, 1.20.1, 1.21.x
|
||||
|
||||
### **What's Missing** ❌ (15% To-Do)
|
||||
|
||||
**Critical for Launch** (7-9 hours):
|
||||
- ❌ WebSocket console streaming (4-6 hours) - **HIGH PRIORITY**
|
||||
- ❌ Crash loop protection with backoff (2 hours) - **HIGH PRIORITY**
|
||||
- ❌ Disk space monitoring (1 hour) - **HIGH PRIORITY**
|
||||
|
||||
**Dev Platform** (1 day):
|
||||
- 🔧 Dev container provisioning (Day 1 of sprint)
|
||||
- 🔧 EdgeState schema migration (Day 2 of sprint)
|
||||
- 🔧 Reconciliation job (Day 3 of sprint)
|
||||
|
||||
**Nice-to-Have** (future):
|
||||
- 📋 File upload/download
|
||||
- 📋 Backup/restore UI
|
||||
- 📋 Resource monitoring dashboard
|
||||
|
||||
---
|
||||
|
||||
## 🗺️ System Architecture (Know Your Boundaries)
|
||||
|
||||
### **Three-System Ownership**
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────┐
|
||||
│ NODE.JS API (Orchestrator) │
|
||||
│ You Own: Routes, services, job queue │
|
||||
│ Speaks To: Proxmox, DNS, Velocity, DB │
|
||||
│ Never: Installs Java, downloads files │
|
||||
└────────────┬────────────────────────────┘
|
||||
│ HTTP Contract
|
||||
│ POST /config, /start, /stop
|
||||
│ GET /status, /health
|
||||
▼
|
||||
┌─────────────────────────────────────────┐
|
||||
│ GO AGENT (Container Manager) │
|
||||
│ You Own: Installation, verification │
|
||||
│ Speaks To: Filesystem, game process │
|
||||
│ Never: Allocates ports, creates DNS │
|
||||
└────────────┬────────────────────────────┘
|
||||
│ Status Polling
|
||||
▼
|
||||
┌─────────────────────────────────────────┐
|
||||
│ NEXT.JS FRONTEND (UI Only) │
|
||||
│ You Own: Components, client state │
|
||||
│ Speaks To: API only │
|
||||
│ Never: Agent, Proxmox, DNS, Velocity │
|
||||
└─────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
### **Critical Boundaries (NEVER CROSS)**
|
||||
|
||||
**API Must NOT**:
|
||||
- ❌ Install Java inside containers
|
||||
- ❌ Download game server files
|
||||
- ❌ Execute commands directly in containers (use Agent)
|
||||
|
||||
**Agent Must NOT**:
|
||||
- ❌ Allocate ports
|
||||
- ❌ Create DNS records
|
||||
- ❌ Register with Velocity
|
||||
- ❌ Talk to Proxmox API
|
||||
- ❌ Manage VMIDs
|
||||
|
||||
**Frontend Must NOT**:
|
||||
- ❌ Talk directly to Agent
|
||||
- ❌ Call Proxmox API
|
||||
- ❌ Create DNS records
|
||||
- ❌ Bypass API for any infrastructure
|
||||
|
||||
**Violation = STOP → Consult Cross-Project Tracker**
|
||||
|
||||
---
|
||||
|
||||
## 📅 3-Day Sprint Plan (Your Tasks)
|
||||
|
||||
### **Day 1: Dev Containers** (December 8)
|
||||
|
||||
**Goal**: Enable developer environment provisioning
|
||||
|
||||
**Tasks**:
|
||||
1. [ ] Define dev container spec (Python, Node, Go, Java)
|
||||
2. [ ] Create template VMID 6000 (base dev environment)
|
||||
3. [ ] API: Add `/api/dev-instances` endpoints (create, delete, status)
|
||||
4. [ ] Agent: Add dev provisioning flow (no game server start)
|
||||
5. [ ] Test: Provision Python + Node dev environments
|
||||
|
||||
**Success Criteria**:
|
||||
- Can provision dev environment with chosen language
|
||||
- Dev container accessible via SSH or web console
|
||||
- No game server logic triggered
|
||||
|
||||
**Files to Modify**:
|
||||
- `src/routes/devInstances.js` (new)
|
||||
- `src/services/devProvisioner.js` (new)
|
||||
- Agent: `dev.go` (new provisioning flow)
|
||||
|
||||
---
|
||||
|
||||
### **Day 2: EdgeState + DNS Reliability** (December 9)
|
||||
|
||||
**Goal**: Fix Cloudflare SRV deletion problem
|
||||
|
||||
**Tasks**:
|
||||
1. [ ] Implement EdgeState model in Prisma schema
|
||||
2. [ ] Update `edgePublisher.js` to store Cloudflare record IDs
|
||||
3. [ ] Update `dePublisher.js` to delete by record ID (not hostname)
|
||||
4. [ ] Test: Create → Delete → Recreate same hostname
|
||||
|
||||
**Success Criteria**:
|
||||
- No orphaned DNS records after deletion
|
||||
- EdgeState tracks all Cloudflare record IDs
|
||||
- Re-provisioning same hostname works
|
||||
|
||||
**Files to Modify**:
|
||||
- `prisma/schema.prisma` (add EdgeState model)
|
||||
- `src/services/edgePublisher.js`
|
||||
- `src/services/dePublisher.js`
|
||||
- `src/clients/cloudflareClient.js` (return record IDs)
|
||||
|
||||
---
|
||||
|
||||
### **Day 3: Reconciliation + Hardening** (December 10)
|
||||
|
||||
**Goal**: Self-healing infrastructure
|
||||
|
||||
**Tasks**:
|
||||
1. [ ] Create reconciliation job (DB ↔ Proxmox ↔ DNS ↔ Velocity)
|
||||
2. [ ] Detect orphaned containers (in Proxmox but not DB)
|
||||
3. [ ] Detect orphaned DNS records (in DNS but not DB)
|
||||
4. [ ] Auto-cleanup with confirmation prompt
|
||||
5. [ ] Regression test suite
|
||||
|
||||
**Success Criteria**:
|
||||
- Reconciliation job detects all orphans
|
||||
- Can auto-clean with user confirmation
|
||||
- System recovers from partial failures
|
||||
|
||||
**Files to Create**:
|
||||
- `src/jobs/reconciliationJob.js`
|
||||
- `src/services/reconciler.js` (rewrite)
|
||||
- `tests/reconciliation.test.js`
|
||||
|
||||
---
|
||||
|
||||
## 🐛 Known Bugs (Fix These)
|
||||
|
||||
### **Go Agent Bugs** (Non-Blocking, But Should Fix)
|
||||
|
||||
1. **Forge server.jar Glob Logic** (`artifacts.go` lines 112-116, 147-151)
|
||||
```go
|
||||
// REMOVE THIS - Forge doesn't create server.jar
|
||||
serverJarPath := filepath.Join(installDir, "*server.jar")
|
||||
```
|
||||
**Fix**: Remove glob/rename logic entirely
|
||||
|
||||
2. **ensureProvisioned() Fallthrough** (`agent.go` lines 155-171)
|
||||
```go
|
||||
// ADD ELSE HERE
|
||||
if variant == "forge" || variant == "neoforge" {
|
||||
// Forge logic
|
||||
} else { // <-- ADD THIS
|
||||
// Vanilla-like logic
|
||||
}
|
||||
```
|
||||
|
||||
3. **Forge Stop Command Exclusion** (`process.go` line 83)
|
||||
```go
|
||||
// REMOVE THIS EXCLUSION - Forge accepts stop commands
|
||||
if p.variant != "forge" && p.variant != "neoforge" {
|
||||
p.sendCommand("stop")
|
||||
}
|
||||
```
|
||||
**Fix**: Remove the if condition, send stop to all variants
|
||||
|
||||
---
|
||||
|
||||
## 🚨 High-Risk Integration Zones (Careful!)
|
||||
|
||||
These areas have caused drift in past sessions:
|
||||
|
||||
1. **Forge/NeoForge Installation**
|
||||
- Agent owns ALL installation logic
|
||||
- API only passes config, never executes install commands
|
||||
|
||||
2. **Cloudflare SRV Deletion**
|
||||
- Must use record IDs (not hostname inference)
|
||||
- Store IDs in EdgeState on creation
|
||||
|
||||
3. **Velocity Registration Order**
|
||||
- Wait for Agent to report RUNNING state
|
||||
- Then register with Velocity (not before)
|
||||
|
||||
4. **Port Allocation**
|
||||
- API allocates → provisions → commits
|
||||
- Rollback on failure (don't leak ports)
|
||||
|
||||
5. **Agent READY Detection**
|
||||
- Use variant-aware log parsing
|
||||
- Forge takes 60-90s (don't timeout early)
|
||||
|
||||
---
|
||||
|
||||
## 📁 Key File Locations
|
||||
|
||||
### **API Service** (`/home/zlh/zlh-api-v2/`)
|
||||
```
|
||||
src/
|
||||
├── routes/
|
||||
│ ├── containers.js # Game server endpoints
|
||||
│ └── devInstances.js # Dev environment endpoints (TO CREATE)
|
||||
├── services/
|
||||
│ ├── edgePublisher.js # DNS + Velocity publishing
|
||||
│ ├── dePublisher.js # Edge cleanup (NEEDS REWRITE)
|
||||
│ ├── portAllocator.js # Port management
|
||||
│ └── reconciler.js # Orphan detection (NEEDS REWRITE)
|
||||
├── clients/
|
||||
│ ├── cloudflareClient.js # Cloudflare API (UPDATE for record IDs)
|
||||
│ ├── technitiumClient.js # Technitium DNS
|
||||
│ └── proxmoxClient.js # Proxmox API
|
||||
└── jobs/
|
||||
└── reconciliationJob.js # Self-healing job (TO CREATE)
|
||||
```
|
||||
|
||||
### **Go Agent** (`/opt/zlh-agent/`)
|
||||
```
|
||||
├── agent.go # Main provisioning logic (FIX fallthrough)
|
||||
├── artifacts.go # Download + verification (REMOVE Forge glob)
|
||||
├── process.go # Server lifecycle (FIX Forge stop exclusion)
|
||||
├── api.go # HTTP server for control
|
||||
└── dev.go # Dev environment provisioning (TO CREATE)
|
||||
```
|
||||
|
||||
### **Frontend** (`/home/zlh/zlh-portal/`)
|
||||
```
|
||||
src/
|
||||
├── app/
|
||||
│ ├── containers/ # Game server UI
|
||||
│ └── dev/ # Dev environment UI (TO CREATE)
|
||||
└── components/
|
||||
└── Console.tsx # WebSocket console (TO CREATE)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 🎮 Minecraft Variant Status
|
||||
|
||||
| Variant | Install | Verify | Start | READY Detection | Status |
|
||||
|---------|---------|--------|-------|-----------------|--------|
|
||||
| Vanilla | ✅ | ✅ | ✅ | ✅ | Production |
|
||||
| Paper | ✅ | ✅ | ✅ | ✅ | Production |
|
||||
| Purpur | ✅ | ✅ | ✅ | ✅ | Production |
|
||||
| Fabric | ✅ | ✅ | ✅ | ✅ | Production |
|
||||
| Forge | ✅ | ✅ | ✅ | ✅ | Production (has bugs) |
|
||||
| NeoForge | ✅ | ✅ | ✅ | ✅ | Production |
|
||||
|
||||
**All variants work** - 3 non-blocking bugs should be fixed for code quality.
|
||||
|
||||
---
|
||||
|
||||
## 🔄 Provisioning Flow (Know This)
|
||||
|
||||
```
|
||||
1. User creates server via Frontend
|
||||
↓
|
||||
2. Frontend → API: POST /api/containers/create
|
||||
↓
|
||||
3. API allocates VMID, ports (if needed)
|
||||
↓
|
||||
4. API clones LXC from template VMID 800
|
||||
↓
|
||||
5. API configures container (IP, resources)
|
||||
↓
|
||||
6. API starts LXC
|
||||
↓
|
||||
7. API detects container IP (10.200.0.X)
|
||||
↓
|
||||
8. API → Agent: POST /config (payload)
|
||||
↓
|
||||
9. Agent saves payload.json
|
||||
↓
|
||||
10. Agent spawns install goroutine (async)
|
||||
├─ Download Java
|
||||
├─ Download game artifacts
|
||||
├─ Verify installation
|
||||
└─ Self-repair if needed
|
||||
↓
|
||||
11. Agent starts server
|
||||
↓
|
||||
12. Agent detects READY (log parsing)
|
||||
↓
|
||||
13. Agent sets state = RUNNING
|
||||
↓
|
||||
14. API polls /status until RUNNING
|
||||
↓
|
||||
15. API saves to database
|
||||
↓
|
||||
16. API publishes DNS (Cloudflare + Technitium)
|
||||
↓
|
||||
17. API registers with Velocity
|
||||
↓
|
||||
18. API returns SUCCESS to user
|
||||
↓
|
||||
COMPLETE ✅
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 🛡️ Drift Prevention (ACTIVE)
|
||||
|
||||
### **Before Writing ANY Code**
|
||||
|
||||
**Ask yourself**:
|
||||
1. Which system am I modifying? (API / Agent / Frontend)
|
||||
2. Does this cross boundaries? (If yes → read Cross-Project Tracker)
|
||||
3. Am I adding external calls to Agent? (If yes → VIOLATION)
|
||||
4. Am I adding container execution to API? (If yes → VIOLATION)
|
||||
5. Am I bypassing API in Frontend? (If yes → VIOLATION)
|
||||
|
||||
**If ANY doubt** → Stop and consult [Cross-Project Tracker](computer:///mnt/user-data/outputs/ZeroLagHub_Cross_Project_Tracker.md)
|
||||
|
||||
### **Common Violations to Avoid**
|
||||
|
||||
❌ Agent allocating ports → API owns this
|
||||
❌ API installing Java → Agent owns this
|
||||
❌ Frontend calling Agent directly → Must go through API
|
||||
❌ Agent creating DNS → API owns this
|
||||
❌ API deciding Java version → Agent owns this (version-aware)
|
||||
|
||||
---
|
||||
|
||||
## 🎯 Success Metrics (How You'll Be Measured)
|
||||
|
||||
### **Sprint Completion**
|
||||
- [ ] All 3 days of sprint tasks completed
|
||||
- [ ] Dev containers operational
|
||||
- [ ] EdgeState tracking DNS record IDs
|
||||
- [ ] Reconciliation job working
|
||||
|
||||
### **Code Quality**
|
||||
- [ ] No architectural violations (follow Cross-Project Tracker)
|
||||
- [ ] All 3 Go agent bugs fixed
|
||||
- [ ] Tests passing
|
||||
- [ ] No new drift introduced
|
||||
|
||||
### **Platform Readiness**
|
||||
- [ ] 95% launch-ready after sprint
|
||||
- [ ] All MC variants still working
|
||||
- [ ] No regressions from changes
|
||||
|
||||
---
|
||||
|
||||
## 🧪 Testing Requirements
|
||||
|
||||
### **Before Committing Code**
|
||||
|
||||
**Test Each Variant**:
|
||||
```bash
|
||||
# Test provisioning
|
||||
POST /api/containers/create {variant: "vanilla"}
|
||||
POST /api/containers/create {variant: "paper"}
|
||||
POST /api/containers/create {variant: "fabric"}
|
||||
POST /api/containers/create {variant: "forge"}
|
||||
POST /api/containers/create {variant: "neoforge"}
|
||||
|
||||
# Verify RUNNING state
|
||||
GET /api/containers/:vmid/status
|
||||
# Should return: {state: "RUNNING"}
|
||||
|
||||
# Test control
|
||||
POST /api/containers/:vmid/stop
|
||||
POST /api/containers/:vmid/start
|
||||
POST /api/containers/:vmid/restart
|
||||
|
||||
# Test cleanup
|
||||
DELETE /api/containers/:vmid
|
||||
# Verify no orphaned DNS records
|
||||
```
|
||||
|
||||
**Test Dev Containers**:
|
||||
```bash
|
||||
POST /api/dev-instances/create {language: "python"}
|
||||
POST /api/dev-instances/create {language: "node"}
|
||||
|
||||
# Verify accessible
|
||||
ssh into dev container
|
||||
# Should have language toolchain installed
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 📊 Infrastructure Constraints (Know Your Limits)
|
||||
|
||||
**Hardware** (GTHost Dedicated):
|
||||
- CPU: 12 cores / 24 threads (Intel Xeon Silver 4116)
|
||||
- RAM: 192 GB (can run 30-50 simultaneous 4GB servers)
|
||||
- Storage: 1.8 TB free (300-500 servers capacity)
|
||||
- Network: 300 Mbit/s (30-60 concurrent players)
|
||||
|
||||
**Current Allocation**:
|
||||
- 11 VMs: 56 GB RAM, 24 CPU threads, ~512 GB disk
|
||||
- Available for servers: 128 GB RAM, 1.8 TB disk
|
||||
|
||||
**Don't exceed these limits** - check before provisioning.
|
||||
|
||||
---
|
||||
|
||||
## 🚀 Launch Decision (Context)
|
||||
|
||||
**Option A**: Launch NOW (85% ready, soft beta only)
|
||||
**Option B**: +3 Days Sprint (95% ready, RECOMMENDED)
|
||||
**Option C**: +1 Week (98% ready, over-engineering)
|
||||
|
||||
**Your role**: Execute Option B sprint (complete 3 days of tasks)
|
||||
|
||||
**After sprint**: Platform ready for professional launch
|
||||
|
||||
---
|
||||
|
||||
## 📞 Session Continuity
|
||||
|
||||
### **Starting Fresh Session (You)**
|
||||
|
||||
```
|
||||
1. Read Drift Prevention Card (30 seconds)
|
||||
└─ Activate boundary awareness
|
||||
|
||||
2. Read Complete Current State (5 minutes)
|
||||
└─ Get Kanban state + sprint tasks
|
||||
|
||||
3. Consult Cross-Project Tracker before code
|
||||
└─ Verify no boundary violations
|
||||
|
||||
4. Execute sprint tasks with constraints active
|
||||
|
||||
5. Test all variants before committing
|
||||
```
|
||||
|
||||
### **Handoff to Claude (Architecture Questions)**
|
||||
|
||||
If you encounter:
|
||||
- Architectural decisions (e.g., should we change contracts?)
|
||||
- Strategic questions (e.g., which features to prioritize?)
|
||||
- Business model questions
|
||||
- Major design changes
|
||||
|
||||
**STOP and escalate to Claude** for architectural guidance.
|
||||
|
||||
---
|
||||
|
||||
## ✅ Quick Reference
|
||||
|
||||
### **Your Mission**
|
||||
Execute 3-day sprint → Deliver 95% launch-ready platform
|
||||
|
||||
### **Your Boundaries**
|
||||
API orchestrates | Agent installs | Frontend displays
|
||||
|
||||
### **Your Critical Docs**
|
||||
1. Drift Prevention Card (session start)
|
||||
2. Cross-Project Tracker (before code)
|
||||
3. Complete Current State (sprint tasks)
|
||||
4. Engineering Handover (technical details)
|
||||
|
||||
### **Your Success**
|
||||
- [ ] 3-day sprint complete
|
||||
- [ ] No architectural violations
|
||||
- [ ] All variants still working
|
||||
- [ ] Tests passing
|
||||
|
||||
---
|
||||
|
||||
## 🎯 Start Here (First Actions)
|
||||
|
||||
**Right Now**:
|
||||
1. ✅ Read [Drift Prevention Card](computer:///mnt/user-data/outputs/ZeroLagHub_Drift_Prevention_Card.md) (30 sec)
|
||||
2. ✅ Read [Complete Current State](computer:///mnt/user-data/outputs/ZeroLagHub_Complete_Current_State_Dec7.md) (5 min)
|
||||
3. ✅ Check Engineering Kanban for current TODO
|
||||
4. ✅ Begin Day 1 sprint: Dev containers
|
||||
|
||||
**Remember**:
|
||||
- 🛡️ Drift prevention ACTIVE
|
||||
- 📋 Consult tracker before crossing boundaries
|
||||
- ✅ Test all variants before committing
|
||||
- 🚀 Goal: 95% launch-ready after 3 days
|
||||
|
||||
---
|
||||
|
||||
**Status**: You have everything you need to execute the sprint. Let's build! 🚀
|
||||
465
ZeroLagHub_Infrastructure_Specifications.md
Normal file
465
ZeroLagHub_Infrastructure_Specifications.md
Normal file
@ -0,0 +1,465 @@
|
||||
# 🏗️ ZeroLagHub Infrastructure - Complete Specifications
|
||||
|
||||
**Last Updated**: December 7, 2025
|
||||
**Provider**: GTHost
|
||||
**Cost**: $109/month
|
||||
**Status**: Production Infrastructure
|
||||
|
||||
---
|
||||
|
||||
## 🖥️ Dedicated Server Hardware
|
||||
|
||||
### **Server Platform**
|
||||
**Model**: Supermicro 2029TP-HC1R (hot-swap HDD chassis)
|
||||
**Form Factor**: Enterprise rackmount server
|
||||
|
||||
### **CPU**
|
||||
**Model**: Intel Xeon Silver 4116
|
||||
**Cores**: 12 cores / 24 threads
|
||||
**Clock Speed**: 2.1 GHz base, 3.0 GHz turbo
|
||||
**Architecture**: Intel Skylake-SP (Server Processor)
|
||||
**TDP**: 85W
|
||||
**Cache**: 16.5 MB L3
|
||||
|
||||
**Performance**:
|
||||
- Single-threaded: Excellent for game servers
|
||||
- Multi-threaded: Handles 11 VMs + 75-100 containers
|
||||
- Turbo boost ensures responsive provisioning
|
||||
|
||||
### **Memory (RAM)**
|
||||
**Configuration**: 6 x 32GB DIMMs
|
||||
**Total**: 192 GB
|
||||
**Type**: Hynix DDR4 RDIMM (Registered)
|
||||
**Speed**: 2400 MHz
|
||||
**ECC**: Yes (Error-Correcting Code)
|
||||
|
||||
**Benefits**:
|
||||
- ECC protects against memory corruption
|
||||
- Registered DIMMs = enterprise reliability
|
||||
- 192GB = ample headroom for VM overhead + game servers
|
||||
|
||||
### **Storage**
|
||||
**Configuration**: 2 x 1.92TB SSD
|
||||
**Model**: Samsung PM863 (Enterprise SSD)
|
||||
**Total Capacity**: 3.84 TB raw
|
||||
**Available**: ~1.8 TB (after Proxmox + VMs + overhead)
|
||||
**Interface**: SATA (likely in RAID configuration)
|
||||
|
||||
**Samsung PM863 Specs**:
|
||||
- Enterprise-grade datacenter SSD
|
||||
- Optimized for mixed workload
|
||||
- Power Loss Protection (PLP)
|
||||
- High endurance rating
|
||||
|
||||
**RAID Configuration** (Likely):
|
||||
- RAID 1 (mirrored) for redundancy
|
||||
- Or ZFS RAID-Z for Proxmox storage
|
||||
|
||||
### **Network**
|
||||
**Bandwidth**: 300 Mbit/s (37.5 MB/s)
|
||||
**Metering**: Unmetered (no bandwidth caps)
|
||||
**Uplink**: Enterprise datacenter connection
|
||||
|
||||
**Performance Assessment**:
|
||||
- 300 Mbit/s = 30-40 concurrent Minecraft players comfortably
|
||||
- Unmetered = no surprise overage charges
|
||||
- Sufficient for soft launch, may need upgrade at scale
|
||||
|
||||
### **Operating System**
|
||||
**OS**: Proxmox VE 8 (64-bit)
|
||||
**Base**: Debian 12 "Bookworm"
|
||||
**Kernel**: Linux 6.2+
|
||||
|
||||
---
|
||||
|
||||
## 📊 Partition Layout
|
||||
|
||||
### **Disk Partitions**
|
||||
```
|
||||
/boot: 1024 MB (1 GB) - Boot partition
|
||||
/swap: 2048 MB (2 GB) - Swap space
|
||||
/root: Auto size - Main Proxmox system + VM storage
|
||||
(~3.8 TB usable after RAID)
|
||||
```
|
||||
|
||||
**Storage Allocation** (Estimated):
|
||||
```
|
||||
Total: 3.84 TB raw (2 x 1.92TB)
|
||||
RAID overhead: -0.04 TB (metadata, alignment)
|
||||
--------
|
||||
Available: 3.80 TB
|
||||
|
||||
Proxmox OS: -0.10 TB (Proxmox + system)
|
||||
VM Disks: -1.80 TB (11 VMs, templates)
|
||||
LXC Containers: -0.10 TB (current containers)
|
||||
--------
|
||||
Free Space: 1.80 TB (for game servers + growth)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 🎯 Performance Characteristics
|
||||
|
||||
### **CPU Capabilities**
|
||||
|
||||
**Per-Core Performance**: ⭐⭐⭐⭐ (Excellent for game servers)
|
||||
- Minecraft servers are single-threaded
|
||||
- 3.0 GHz turbo provides snappy performance
|
||||
- 12 cores = 12 simultaneous high-performance game servers
|
||||
|
||||
**Multi-Core Throughput**: ⭐⭐⭐⭐⭐ (Excellent for hosting)
|
||||
- 24 threads handle VM overhead efficiently
|
||||
- Can run 11 Proxmox VMs + 75-100 LXC containers
|
||||
- Provisioning operations don't impact running servers
|
||||
|
||||
**Virtualization**: ⭐⭐⭐⭐⭐ (Native support)
|
||||
- Intel VT-x + VT-d enabled
|
||||
- Hardware-accelerated virtualization
|
||||
- LXC containers = near-native performance
|
||||
|
||||
### **Memory Capabilities**
|
||||
|
||||
**Capacity**: 192 GB = **Excellent** for scale
|
||||
- Average Minecraft server: 2-4 GB
|
||||
- 192 GB / 4 GB = **48 simultaneous 4GB servers**
|
||||
- Or 96 lightweight 2GB servers
|
||||
|
||||
**Speed**: 2400 MHz DDR4 = **Good** (not bleeding edge, but sufficient)
|
||||
- DDR4-2400 provides adequate bandwidth for game hosting
|
||||
- ECC ensures data integrity under load
|
||||
|
||||
**Reliability**: ECC RDIMM = **Enterprise-grade**
|
||||
- Detects and corrects memory errors
|
||||
- Critical for 24/7 uptime
|
||||
|
||||
### **Storage Capabilities**
|
||||
|
||||
**Capacity**: 3.84 TB = **Very Good** for initial scale
|
||||
- Each Minecraft server: 1-5 GB (world size varies)
|
||||
- Can host 300-500 Minecraft servers comfortably
|
||||
- 1.8 TB free = room for significant growth
|
||||
|
||||
**Performance**: Samsung PM863 = **Excellent** for workload
|
||||
- Random IOPS: ~10,000 read, ~2,000 write
|
||||
- Sequential: 520 MB/s read, 485 MB/s write
|
||||
- Perfect for database + game world I/O
|
||||
|
||||
**Reliability**: Enterprise SSD = **Excellent**
|
||||
- Power Loss Protection prevents corruption
|
||||
- Rated for 1.3 PB writes (years of 24/7 operation)
|
||||
- RAID 1 (likely) provides redundancy
|
||||
|
||||
### **Network Capabilities**
|
||||
|
||||
**Bandwidth**: 300 Mbit/s = **Adequate for soft launch**
|
||||
- Minecraft player: ~0.5-1 Mbit/s
|
||||
- 300 Mbit/s = 30-60 players (conservative estimate)
|
||||
- Unmetered = no bandwidth overage charges
|
||||
|
||||
**Upgrade Path**: GTHost likely offers 1 Gbps upgrades
|
||||
- 1 Gbps would support 100-200 players
|
||||
- Consider upgrade when approaching 40+ concurrent players
|
||||
|
||||
---
|
||||
|
||||
## 🏗️ Current Resource Allocation
|
||||
|
||||
### **VM Resource Breakdown** (11 VMs)
|
||||
|
||||
```
|
||||
Hypervisor Overhead: ~8 GB RAM, 2 CPU cores
|
||||
|
||||
Critical Production:
|
||||
├─ VM 100 (zlh-panel): 4 GB RAM, 2 cores, 32 GB disk
|
||||
├─ VM 103 (zlh-api): 4 GB RAM, 2 cores, 32 GB disk
|
||||
└─ VM 101 (zlh-wings): 8 GB RAM, 4 cores, 64 GB disk
|
||||
|
||||
Platform Services:
|
||||
├─ VM 102 (zlh-portal): 4 GB RAM, 2 cores, 32 GB disk
|
||||
├─ VM 104 (zlh-monitor): 8 GB RAM, 2 cores, 64 GB disk
|
||||
└─ VM 1002 (zlh-proxy): 2 GB RAM, 1 core, 16 GB disk
|
||||
|
||||
Network Layer:
|
||||
├─ VM 1000 (zlh-router): 4 GB RAM, 2 cores, 32 GB disk
|
||||
├─ VM 1006 (zpack-router): 4 GB RAM, 2 cores, 32 GB disk
|
||||
└─ VM 1001 (zlh-dns): 2 GB RAM, 1 core, 16 GB disk
|
||||
|
||||
Development/Support:
|
||||
├─ VM 300 (zlh-panel-dev): 4 GB RAM, 2 cores, 32 GB disk
|
||||
└─ VM 2000 (zlh-ci): 4 GB RAM, 2 cores, 32 GB disk
|
||||
|
||||
Backup:
|
||||
└─ VM [zlh-back]: 8 GB RAM, 2 cores, 128 GB disk
|
||||
|
||||
TOTAL VM ALLOCATION: 56 GB RAM, 24 cores, ~512 GB disk
|
||||
```
|
||||
|
||||
### **Available for Game Servers**
|
||||
|
||||
```
|
||||
RAM Available: 192 GB - 56 GB (VMs) - 8 GB (overhead) = 128 GB
|
||||
CPU Available: 24 threads - 24 (VM allocation) = 0 (shared)
|
||||
Disk Available: 1.8 TB free
|
||||
|
||||
Game Server Capacity (Conservative):
|
||||
├─ 2GB servers: 64 simultaneous servers
|
||||
├─ 4GB servers: 32 simultaneous servers
|
||||
└─ 8GB servers: 16 simultaneous servers
|
||||
|
||||
Developer Environment Capacity:
|
||||
├─ 2GB dev envs: 64 simultaneous environments
|
||||
└─ 4GB dev envs: 32 simultaneous environments
|
||||
```
|
||||
|
||||
**Note**: CPU is oversubscribed (common in hosting) since most game servers idle at <20% CPU usage. Turbo boost ensures good single-thread performance when needed.
|
||||
|
||||
---
|
||||
|
||||
## 📈 Capacity & Scaling Projections
|
||||
|
||||
### **Current Capacity** (As Deployed)
|
||||
|
||||
**Game Servers**: 30-50 active servers with current VM allocation
|
||||
**Developer Environments**: 75-100 environments (documented capacity)
|
||||
**Concurrent Players**: 30-60 players (network limited)
|
||||
|
||||
### **Optimized Capacity** (With Tuning)
|
||||
|
||||
**Game Servers**: 60-80 active servers (after VM consolidation)
|
||||
**Developer Environments**: 100-150 environments
|
||||
**Concurrent Players**: Still 30-60 (network bottleneck)
|
||||
|
||||
### **Maximum Theoretical Capacity**
|
||||
|
||||
**Game Servers**: 128 lightweight servers (if only game hosting, no dev)
|
||||
**Developer Environments**: 192 environments (if only dev, no games)
|
||||
**Storage**: 300-500 servers before storage exhaustion
|
||||
|
||||
**Limiting Factors**:
|
||||
1. **Network** (300 Mbit/s) - limits concurrent players
|
||||
2. **RAM** (192 GB) - limits concurrent heavy servers
|
||||
3. **Storage** (1.8 TB free) - limits total servers
|
||||
|
||||
---
|
||||
|
||||
## 💰 Cost Analysis
|
||||
|
||||
### **Current Infrastructure Cost**
|
||||
|
||||
**Monthly**: $109 GTHost dedicated server
|
||||
**Annually**: $1,308
|
||||
|
||||
**Cost per Resource**:
|
||||
- Per GB RAM: $0.57/month ($109 ÷ 192 GB)
|
||||
- Per CPU core: $9.08/month ($109 ÷ 12 cores)
|
||||
- Per TB storage: $28.39/month ($109 ÷ 3.84 TB)
|
||||
|
||||
### **Competitive Analysis**
|
||||
|
||||
**AWS Equivalent** (m5.2xlarge + storage + bandwidth):
|
||||
- 8 vCPU, 32 GB RAM, 1 TB storage, 1 Gbps
|
||||
- Cost: ~$300-400/month
|
||||
|
||||
**Hetzner Dedicated** (Similar specs):
|
||||
- 12 core Xeon, 128 GB RAM, 2x2TB SSD
|
||||
- Cost: ~$100/month (but higher network costs)
|
||||
|
||||
**GTHost Value**: ⭐⭐⭐⭐⭐ Excellent
|
||||
- 40-60% cheaper than AWS
|
||||
- Competitive with Hetzner
|
||||
- Unmetered bandwidth (key advantage)
|
||||
|
||||
---
|
||||
|
||||
## 🎯 Competitive Advantages
|
||||
|
||||
### **1. LXC Performance**
|
||||
- Host hardware enables 20-30% better performance vs Docker
|
||||
- Intel Xeon Silver 4116 single-thread performance excellent for games
|
||||
|
||||
### **2. Resource Density**
|
||||
- 192 GB RAM supports 30-50 simultaneous 4GB servers
|
||||
- Competitors typically offer 64-128 GB at this price point
|
||||
|
||||
### **3. Storage Performance**
|
||||
- Samsung PM863 enterprise SSDs outperform consumer SSDs
|
||||
- Power Loss Protection prevents world corruption
|
||||
- Hot-swap chassis enables maintenance without downtime
|
||||
|
||||
### **4. Network**
|
||||
- Unmetered = no bandwidth surprises
|
||||
- 300 Mbit/s adequate for soft launch
|
||||
- Upgrade path available when needed
|
||||
|
||||
---
|
||||
|
||||
## ⚠️ Identified Constraints
|
||||
|
||||
### **1. Network Bandwidth** (Current Bottleneck)
|
||||
- **300 Mbit/s limits to 30-60 concurrent players**
|
||||
- **Recommendation**: Monitor bandwidth usage, upgrade to 1 Gbps when approaching 40 players
|
||||
- **Upgrade Cost**: Likely +$20-50/month for 1 Gbps
|
||||
|
||||
### **2. CPU Oversubscription**
|
||||
- 24 threads allocated to VMs, but most VMs idle
|
||||
- Game servers share CPU via time-slicing
|
||||
- **Risk**: If all servers spike simultaneously, performance degrades
|
||||
- **Mitigation**: Limit concurrent servers to 40-50 until load testing proves higher safe
|
||||
|
||||
### **3. Storage Growth**
|
||||
- 1.8 TB free supports 300-500 servers
|
||||
- Each server grows over time (world expansion)
|
||||
- **Recommendation**: Monitor disk usage, plan expansion at 70% utilization
|
||||
- **Expansion Options**: Add external storage or upgrade to larger SSDs
|
||||
|
||||
---
|
||||
|
||||
## 🔧 Optimization Opportunities
|
||||
|
||||
### **Immediate Optimizations** (No Cost)
|
||||
|
||||
1. **VM Consolidation**
|
||||
- Merge zlh-panel-dev into zlh-panel (save 4 GB RAM, 2 cores)
|
||||
- Merge zlh-proxy into zlh-router (save 2 GB RAM, 1 core)
|
||||
- **Gain**: 6 GB RAM, 3 cores for game servers
|
||||
|
||||
2. **LXC Over VMs**
|
||||
- Convert lightweight VMs to LXC containers
|
||||
- Example: zlh-dns, zlh-proxy candidates
|
||||
- **Gain**: Lower overhead, faster provisioning
|
||||
|
||||
3. **Memory Ballooning**
|
||||
- Enable KSM (Kernel Same-page Merging) on Proxmox
|
||||
- Deduplicate identical memory pages
|
||||
- **Gain**: 5-10% more available RAM
|
||||
|
||||
### **Paid Optimizations** (Consider at Scale)
|
||||
|
||||
1. **Network Upgrade**: 1 Gbps uplink (+$20-50/month)
|
||||
- Removes player concurrency bottleneck
|
||||
- Enables 100-200 player capacity
|
||||
|
||||
2. **Storage Expansion**: Add 4TB NVMe (+$50/month)
|
||||
- Doubles storage to ~6 TB total
|
||||
- Supports 600-1000 servers
|
||||
|
||||
3. **Cloudflare Enterprise** (+$200/month)
|
||||
- DDoS protection for game traffic
|
||||
- CDN for static assets
|
||||
- Worth it at 100+ servers
|
||||
|
||||
---
|
||||
|
||||
## 📊 Hardware Lifecycle
|
||||
|
||||
### **Current Status** (December 2025)
|
||||
|
||||
**Server Age**: Unknown (likely 1-3 years based on Xeon Silver 4116 era)
|
||||
**Expected Lifespan**: 5-7 years for enterprise server
|
||||
**Remaining Life**: Likely 3-5 years
|
||||
|
||||
**Components**:
|
||||
- CPU: Xeon Silver 4116 (2017 release) - still very capable
|
||||
- RAM: DDR4-2400 (current gen, plenty of life)
|
||||
- SSD: Samsung PM863 (enterprise grade, high endurance)
|
||||
|
||||
### **Upgrade Path** (Future)
|
||||
|
||||
**Year 1-2** (Current plan):
|
||||
- Optimize existing hardware
|
||||
- Minor network upgrades if needed
|
||||
|
||||
**Year 3-4** (Growth phase):
|
||||
- Consider second dedicated server
|
||||
- Load balance across servers
|
||||
- Geographic distribution
|
||||
|
||||
**Year 5+** (Scale phase):
|
||||
- Migrate to colocation or cloud
|
||||
- Multi-datacenter deployment
|
||||
|
||||
---
|
||||
|
||||
## 🛡️ Reliability Features
|
||||
|
||||
### **Hardware Reliability**
|
||||
|
||||
✅ **ECC Memory** - Corrects single-bit errors automatically
|
||||
✅ **Enterprise SSDs** - Power Loss Protection, high endurance
|
||||
✅ **Hot-Swap Chassis** - Replace drives without shutdown
|
||||
✅ **Redundant Power** (likely) - Supermicro chassis typically dual PSU
|
||||
|
||||
### **Software Reliability**
|
||||
|
||||
✅ **Proxmox High Availability** - VM failover (if configured)
|
||||
✅ **PBS Backup** - Incremental backups to Backblaze B2
|
||||
✅ **LXC Snapshots** - Fast rollback capability
|
||||
✅ **RAID Mirroring** (likely) - Disk failure protection
|
||||
|
||||
### **Network Reliability**
|
||||
|
||||
✅ **Datacenter Uptime** - GTHost likely 99.9%+ SLA
|
||||
✅ **Unmetered Bandwidth** - No throttling during spikes
|
||||
⚠️ **Single Uplink** - No network redundancy (acceptable for price point)
|
||||
|
||||
---
|
||||
|
||||
## 🎯 Summary & Recommendations
|
||||
|
||||
### **Hardware Assessment**: ⭐⭐⭐⭐ Very Good for Use Case
|
||||
|
||||
**Strengths**:
|
||||
- Excellent CPU for game server hosting (Xeon Silver 4116)
|
||||
- Abundant RAM (192 GB = 30-50 servers)
|
||||
- Enterprise storage (Samsung PM863 + hot-swap)
|
||||
- Unmetered bandwidth (no surprise charges)
|
||||
- Great value ($109/month for these specs)
|
||||
|
||||
**Limitations**:
|
||||
- Network bandwidth (300 Mbit/s = 30-60 players)
|
||||
- Storage growth constraint (monitor usage)
|
||||
- CPU oversubscription (limit concurrent servers initially)
|
||||
|
||||
### **Recommendations**
|
||||
|
||||
**Now** (Launch Phase):
|
||||
1. ✅ Deploy on current hardware - adequate for soft launch
|
||||
2. ✅ Limit to 40-50 concurrent servers initially
|
||||
3. ✅ Monitor bandwidth, RAM, and disk usage
|
||||
|
||||
**Month 1-3** (Early Growth):
|
||||
1. 🔧 Optimize VM allocation (consolidate where possible)
|
||||
2. 🔧 Implement aggressive monitoring
|
||||
3. 🔧 Consider 1 Gbps network upgrade if approaching 40 players
|
||||
|
||||
**Month 6-12** (Scale Phase):
|
||||
1. 📈 Evaluate storage expansion based on usage
|
||||
2. 📈 Consider second server for geographic distribution
|
||||
3. 📈 Implement Cloudflare Enterprise for DDoS protection
|
||||
|
||||
### **Capacity Targets by Phase**
|
||||
|
||||
**Soft Launch** (Month 1-3): 20-30 servers, 10-20 players
|
||||
**Public Launch** (Month 3-6): 40-50 servers, 30-40 players
|
||||
**Growth Phase** (Month 6-12): 60-80 servers, 60-100 players (with 1 Gbps upgrade)
|
||||
**Scale Phase** (Month 12+): 100+ servers, multi-server deployment
|
||||
|
||||
---
|
||||
|
||||
## ✅ Conclusion
|
||||
|
||||
**Status**: Infrastructure is **production-ready** for ZeroLagHub launch.
|
||||
|
||||
**Key Points**:
|
||||
- Hardware specifications are excellent for initial scale
|
||||
- 192 GB RAM supports 30-50 game servers
|
||||
- Storage capacity adequate for 300-500 servers
|
||||
- Network bandwidth is current bottleneck (acceptable for soft launch)
|
||||
- Cost-effective ($109/month for enterprise-grade hardware)
|
||||
|
||||
**Green Light**: ✅ Launch when platform development complete (currently 85% ready).
|
||||
|
||||
---
|
||||
|
||||
**Last Updated**: December 7, 2025
|
||||
**Source**: GTHost server specifications + ZeroLagHub infrastructure analysis
|
||||
606
ZeroLagHub_Master_Bootstrap_Dec2025.md
Normal file
606
ZeroLagHub_Master_Bootstrap_Dec2025.md
Normal file
@ -0,0 +1,606 @@
|
||||
# 🚀 ZeroLagHub - Master Bootstrap Document (December 2025)
|
||||
|
||||
**Last Updated**: December 7, 2025
|
||||
**Version**: 4.0 (Platform Launch Ready)
|
||||
**Status**: 85% Complete - Launch Decision Point
|
||||
|
||||
---
|
||||
|
||||
## 📌 Quick Start for New AI Sessions
|
||||
|
||||
> **Resume Point**: Platform 85% launch-ready, all core provisioning operational, critical UX features needed.
|
||||
>
|
||||
> **Current Phase**: Launch readiness assessment - choose NOW vs +1 week vs +1 month
|
||||
>
|
||||
> **Critical Context**: All 6 Minecraft variants provisioning successfully via Go agent. Need WebSocket console, crash protection, and disk monitoring for competitive parity.
|
||||
|
||||
---
|
||||
|
||||
## 🎯 Project Overview
|
||||
|
||||
**ZeroLagHub** is a developer-focused game server hosting platform built on:
|
||||
- **Proxmox VE** with LXC containers (20-30% performance advantage over Docker)
|
||||
- **Hybrid Architecture**: Pterodactyl panel + Custom Node.js API + Go provisioning agent
|
||||
- **Velocity proxy** for seamless Minecraft routing
|
||||
- **Dual-router architecture** for traffic separation
|
||||
- **Developer-to-player revenue pipeline** with 9.75x revenue multiplier
|
||||
|
||||
### Core Value Proposition
|
||||
Complete dev-to-production pipeline: Development environments ($20/mo) → Testing servers (50% discount) → Player hosting (25% discount) → Revenue sharing (7.5% commission) = viral growth through developer ecosystem.
|
||||
|
||||
---
|
||||
|
||||
## 🏗️ Current Architecture (December 2025)
|
||||
|
||||
### Infrastructure Overview (11 VMs)
|
||||
|
||||
```
|
||||
Critical Production:
|
||||
├── VM 100 (zlh-panel) - Pterodactyl panel + OAuth customization
|
||||
├── VM 103 (zlh-api) - Node.js backend + developer platform APIs
|
||||
├── VM 101 (zlh-wings) - Game servers + LXC integration target
|
||||
|
||||
Platform Services:
|
||||
├── VM 102 (zlh-portal) - Next.js frontend + developer dashboard
|
||||
├── VM 104 (zlh-monitor) - Prometheus/Grafana monitoring
|
||||
|
||||
Network & Infrastructure:
|
||||
├── VM 1000 (zlh-router) - Platform services routing + VLANs
|
||||
├── VM 1006 (zpack-router) - Game traffic routing + Velocity
|
||||
├── VM 1001 (zlh-dns) - Technitium DNS + development domains
|
||||
├── VM 1002 (zlh-proxy) - Caddy reverse proxy + SSL automation
|
||||
├── VM 300 (zlh-panel-dev) - Development environment + testing
|
||||
├── VM 2000 (zlh-ci) - CI/CD pipeline + automation
|
||||
└── VM [zlh-back] - PBS backup + Backblaze B2 replication
|
||||
```
|
||||
|
||||
### Network Topology
|
||||
|
||||
```
|
||||
zlh-router (VM 1000):
|
||||
├─ WAN1: Platform services (API, portal, monitoring)
|
||||
├─ CORE_LAN: 10.60.0.0/24 (internal services)
|
||||
├─ MGMT_LAN: 172.60.0.10/24 (inter-router communication)
|
||||
└─ WireGuard: Admin access
|
||||
|
||||
zpack-router (VM 1006):
|
||||
├─ WAN2: 139.64.165.248 (game services)
|
||||
├─ ZPACK_LAN: 10.70.0.0/24 (Velocity @ 10.70.0.241)
|
||||
├─ DEV_LAN: 10.100.0.0/24 (developer environments - future)
|
||||
├─ GAME_LAN: 10.200.0.0/24 (game server LXCs)
|
||||
└─ MGMT_LAN: 172.60.0.20/24 (control plane communication)
|
||||
```
|
||||
|
||||
### Traffic Flows
|
||||
|
||||
- **Platform Access**: Client → WAN1 → zlh-router → Frontend/API
|
||||
- **Game Play**: Player → WAN2 (139.64.165.248) → zpack-router → Velocity (10.70.0.241) → Game Server (10.200.0.X)
|
||||
- **Control Plane**: API → MGMT_LAN (172.60.0.X) → Velocity/DNS/Monitoring
|
||||
|
||||
---
|
||||
|
||||
## ✅ What's Working (December 7, 2025)
|
||||
|
||||
### Provisioning Pipeline (100% Operational)
|
||||
|
||||
| Component | Status | Notes |
|
||||
|-----------|--------|-------|
|
||||
| **LXC Container Creation** | ✅ | Template VMID 800, auto-cloning working |
|
||||
| **VMID Allocation** | ✅ | Sequential assignment from range |
|
||||
| **IP Detection** | ✅ | Automatic network configuration |
|
||||
| **Go Agent Deployment** | ✅ | Payload delivery + self-repair system |
|
||||
| **Java Runtime Selection** | ✅ | Auto-detect MC version → Java 17/21 |
|
||||
| **All 6 MC Variants** | ✅ | Vanilla, Paper, Purpur, Fabric, Forge, NeoForge |
|
||||
| **Server Startup** | ✅ | All variants start successfully |
|
||||
| **DNS Publishing** | ✅ | Cloudflare + Technitium A + SRV records |
|
||||
| **Velocity Registration** | ✅ | Dynamic backend server registration |
|
||||
| **Client Connectivity** | ✅ | Players can connect and play |
|
||||
|
||||
### Control Functions
|
||||
|
||||
| Function | Status | Implementation |
|
||||
|----------|--------|----------------|
|
||||
| **Start/Stop/Restart** | ✅ | HTTP API → Go agent |
|
||||
| **Console Commands** | ✅ | Command injection working |
|
||||
| **Log Tailing** | ⚠️ | HTTP polling only (need WebSocket) |
|
||||
| **Status Reporting** | ✅ | Agent emits RUNNING state |
|
||||
| **Crash Detection** | ✅ | Agent tracks exit codes |
|
||||
|
||||
### Game Support Matrix
|
||||
|
||||
**Launch Ready (Minecraft Only)**:
|
||||
- ✅ **Vanilla** - Official Mojang server
|
||||
- ✅ **Paper** - Primary recommendation (vanilla + plugins)
|
||||
- ✅ **Purpur** - Paper fork with extra features
|
||||
- ✅ **Fabric** - Lightweight mod support
|
||||
- ✅ **Forge** - Heavy mod support (tech/magic mods)
|
||||
- ✅ **NeoForge** - Modern Forge fork (**competitive advantage**)
|
||||
|
||||
**Supported Versions**: 1.12.2, 1.16.5, 1.18.2, 1.19.2, 1.20.1, 1.21.x
|
||||
|
||||
**Deferred to Post-Launch**:
|
||||
- 📋 Terraria
|
||||
- 📋 Project Zomboid
|
||||
- 📋 Valheim
|
||||
- 📋 Rust
|
||||
|
||||
---
|
||||
|
||||
## 🚨 Known Issues & Gaps
|
||||
|
||||
### Critical Bugs (Non-Blocking, System Works)
|
||||
|
||||
1. **Forge server.jar Glob Logic** (`artifacts.go` lines 112-116, 147-151)
|
||||
- Tries to find `*server.jar` but Forge ≥1.17 doesn't create this
|
||||
- **Fix**: Remove glob/rename logic (Forge uses `run.sh` + `libraries/`)
|
||||
- **Impact**: System works, but unnecessary code
|
||||
|
||||
2. **ensureProvisioned() Fallthrough** (`agent.go` lines 155-171)
|
||||
- After Forge check, falls through to check `server.jar`
|
||||
- **Fix**: Add `else` to prevent fallthrough
|
||||
- **Impact**: Minor efficiency issue
|
||||
|
||||
3. **Forge Stop Command Exclusion** (`process.go` line 83)
|
||||
- Excludes Forge from receiving `stop` command
|
||||
- **Fix**: Remove exclusion (Forge accepts stop commands)
|
||||
- **Impact**: Manual workaround needed for Forge stops
|
||||
|
||||
### Missing Competitive Features (CRITICAL)
|
||||
|
||||
| Feature | Apex | Shockbyte | ZeroLagHub | Priority |
|
||||
|---------|------|-----------|------------|----------|
|
||||
| **All MC Variants** | ✅ | ✅ | ✅ | - |
|
||||
| **NeoForge** | ❌ | ❌ | ✅ | ADVANTAGE |
|
||||
| **Performance** | 🟡 | 🟡 | ✅ | ADVANTAGE |
|
||||
| **Console Streaming** | ✅ | ✅ | ❌ | 🔴 HIGH |
|
||||
| **File Management** | ✅ | ✅ | ❌ | 🟡 MEDIUM |
|
||||
| **Backups** | ✅ | ✅ | ❌ | 🟡 MEDIUM |
|
||||
| **Crash Protection** | ✅ | ✅ | ❌ | 🔴 HIGH |
|
||||
| **Disk Monitoring** | ✅ | ✅ | ❌ | 🔴 HIGH |
|
||||
|
||||
---
|
||||
|
||||
## 🎯 Platform Readiness Assessment (85%)
|
||||
|
||||
### Core Platform (100%)
|
||||
- ✅ Container orchestration
|
||||
- ✅ Multi-variant provisioning
|
||||
- ✅ Network routing (dual-router)
|
||||
- ✅ DNS automation (Cloudflare + Technitium)
|
||||
- ✅ Velocity proxy integration
|
||||
- ✅ Start/stop/restart control
|
||||
- ✅ Console command injection
|
||||
- ✅ Status monitoring
|
||||
|
||||
### Operational Features (70%)
|
||||
- ✅ Log tailing (HTTP polling)
|
||||
- ✅ Crash detection
|
||||
- ❌ **WebSocket console** (need real-time streaming)
|
||||
- ❌ **Crash loop protection** (need exponential backoff)
|
||||
- ❌ **Disk space monitoring** (prevent corruption)
|
||||
|
||||
### File Management (0%)
|
||||
- ❌ File upload/download
|
||||
- ❌ Backup/restore system
|
||||
- ❌ World file management
|
||||
|
||||
### Advanced Features (Planned)
|
||||
- 📋 Resource monitoring dashboard
|
||||
- 📋 Plugin marketplace
|
||||
- 📋 Developer platform APIs
|
||||
- 📋 Performance optimization tools
|
||||
|
||||
---
|
||||
|
||||
## 🚀 Launch Decision Point
|
||||
|
||||
### Option A: Launch NOW (Soft Beta)
|
||||
**Status**: 85% ready
|
||||
**Timeline**: Immediate
|
||||
**Pros**: Fast to market, gather user feedback
|
||||
**Cons**: Missing competitive UX features, higher support burden
|
||||
**Recommendation**: ⚠️ Acceptable for 10-20 beta users only
|
||||
|
||||
### Option B: +1 Week (Critical Features) ⭐ RECOMMENDED
|
||||
**Status**: 95% ready after additions
|
||||
**Timeline**: December 14, 2025
|
||||
**Add**: WebSocket console + Crash protection + Disk monitoring
|
||||
**Effort**: 7-9 hours total
|
||||
**Pros**: Competitive feature parity, professional launch
|
||||
**Cons**: Minimal delay
|
||||
**Recommendation**: ✅ Best balance of quality and speed
|
||||
|
||||
### Option C: +1 Month (Full Feature Parity)
|
||||
**Status**: 100% ready
|
||||
**Timeline**: January 7, 2026
|
||||
**Add**: All UX features + file management + backups
|
||||
**Effort**: ~30 hours
|
||||
**Pros**: Complete competitive offering
|
||||
**Cons**: Slower to market, feature creep risk
|
||||
**Recommendation**: ⚠️ Over-engineering for launch
|
||||
|
||||
---
|
||||
|
||||
## 📋 Critical Outstanding Items
|
||||
|
||||
### 🔴 High Priority (Before Launch)
|
||||
|
||||
**1. WebSocket Console Streaming** [4-6 hours]
|
||||
- **Current**: HTTP polling via `/logs/tail`
|
||||
- **Needed**: Real-time WebSocket streaming
|
||||
- **Why**: Industry standard, users expect it
|
||||
- **Technical**: Socket.io integration to Go agent
|
||||
|
||||
**2. Crash Loop Protection** [2 hours]
|
||||
- **Current**: Immediate restart on crash
|
||||
- **Needed**: Exponential backoff (5s, 10s, 15s), stop after 3 crashes
|
||||
- **Why**: Prevents resource thrashing
|
||||
- **Technical**: Agent retry logic with backoff timer
|
||||
|
||||
**3. Disk Space Monitoring** [1 hour]
|
||||
- **Current**: No checks
|
||||
- **Needed**: Alert when <1GB free, prevent start if insufficient
|
||||
- **Why**: Prevents world corruption
|
||||
- **Technical**: Agent disk space check before start
|
||||
|
||||
### 🟡 Medium Priority (Week 1)
|
||||
|
||||
**4. File Upload/Download** [6-8 hours]
|
||||
- Plugin management, world uploads
|
||||
- HTTP multipart + streaming
|
||||
|
||||
**5. Backup System** [8-10 hours]
|
||||
- World backup/restore
|
||||
- Integration with PBS backup infrastructure
|
||||
|
||||
**6. Enhanced Health Checks** [3-4 hours]
|
||||
- Query server status
|
||||
- Resource monitoring (CPU/RAM)
|
||||
|
||||
### 🟢 Low Priority (Month 1)
|
||||
|
||||
7. Resource monitoring dashboard
|
||||
8. Plugin marketplace integration
|
||||
9. Developer platform APIs
|
||||
10. Performance optimization
|
||||
|
||||
---
|
||||
|
||||
## 🗄️ Technical Architecture Details
|
||||
|
||||
### Directory Structure (Finalized)
|
||||
|
||||
```
|
||||
/opt/zlh/<game>/<variant>/world/
|
||||
|
||||
Examples:
|
||||
/opt/zlh/minecraft/vanilla/world/
|
||||
/opt/zlh/minecraft/forge/world/
|
||||
/opt/zlh/minecraft/fabric/world/
|
||||
```
|
||||
|
||||
**Benefits**:
|
||||
- Clear game/variant separation
|
||||
- Scalable to all future games
|
||||
- Self-documenting paths
|
||||
- Easy backup automation
|
||||
|
||||
### Container Model
|
||||
|
||||
**Architecture**: One game per LXC container
|
||||
**Rationale**: Industry standard, 3-5x simpler than multi-game
|
||||
**Benefits**:
|
||||
- Better resource isolation
|
||||
- Simpler billing
|
||||
- Clearer security boundaries
|
||||
- Easier debugging
|
||||
|
||||
### Java Runtime Selection
|
||||
|
||||
```
|
||||
MC 1.21.x → Java 21
|
||||
MC ≥1.20.5 → Java 21
|
||||
MC <1.20.5 → Java 17
|
||||
```
|
||||
|
||||
### Artifact Download Paths
|
||||
|
||||
```
|
||||
minecraft/vanilla/<version>/server.jar
|
||||
minecraft/paper/<version>/server.jar
|
||||
minecraft/purpur/<version>/server.jar
|
||||
minecraft/fabric/<version>/fabric-server.jar
|
||||
minecraft/forge/<version>/forge-installer.jar
|
||||
minecraft/neoforge/<version>/neoforge-installer.jar
|
||||
```
|
||||
|
||||
**Critical Note**: Fabric uses `fabric-server.jar` (pre-built), not installer pattern
|
||||
|
||||
---
|
||||
|
||||
## 💰 Business Model & Revenue Strategy
|
||||
|
||||
### Developer-to-Player Pipeline
|
||||
|
||||
```
|
||||
Step 1: Developer Acquisition
|
||||
├─ Development Environment: $20/month
|
||||
└─ Testing Server: $25/month (50% discount)
|
||||
|
||||
Step 2: Player Acquisition (via developer)
|
||||
├─ Player 1-10: $15/month each (25% discount)
|
||||
└─ Total Player Revenue: $150/month
|
||||
|
||||
Step 3: Developer Commission
|
||||
├─ Revenue Share: 7.5% of player revenue
|
||||
├─ Developer Earns: $11.25/month
|
||||
└─ Platform Keeps: $138.75/month
|
||||
|
||||
Total Monthly Revenue from One Developer:
|
||||
$20 (dev env) + $25 (test server) + $150 (players) = $195/month
|
||||
Revenue Multiplier: 9.75x on developer acquisition cost
|
||||
```
|
||||
|
||||
### Financial Projections
|
||||
|
||||
**Month 6**: $8K-30K (LXC advantage + developer pipeline)
|
||||
**Month 12**: $25K-100K (custom platform competitive advantages)
|
||||
**Month 24**: $75K-300K (market leadership + technology licensing)
|
||||
|
||||
### Competitive Advantages
|
||||
|
||||
1. **LXC Performance**: 20-30% improvement over Docker competitors
|
||||
2. **Developer Ecosystem**: Complete dev-to-production pipeline vs pure hosting
|
||||
3. **Open Source Foundation**: 30-40% cost advantage over corporate providers
|
||||
4. **Gaming-First Architecture**: Purpose-built vs adapted generic hosting
|
||||
5. **NeoForge Support**: Ahead of Apex and Shockbyte
|
||||
|
||||
---
|
||||
|
||||
## 🔐 Security Vulnerabilities (CRITICAL - Active Fix Required)
|
||||
|
||||
### API Department Issues
|
||||
|
||||
1. **Server Ownership Bypass**
|
||||
- Any user can control any server via UUID
|
||||
- No ownership validation in API endpoints
|
||||
- **Impact**: Critical security flaw
|
||||
|
||||
2. **Admin Privilege Escalation**
|
||||
- Frontend can claim admin via JWT manipulation
|
||||
- No server-side role validation
|
||||
- **Impact**: Complete access control bypass
|
||||
|
||||
3. **Token URL Exposure**
|
||||
- JWTs visible in browser history/logs
|
||||
- Tokens passed as URL parameters
|
||||
- **Impact**: Token theft vulnerability
|
||||
|
||||
4. **API Key Validation Missing**
|
||||
- Authentication bypass vulnerabilities
|
||||
- Inconsistent validation patterns
|
||||
- **Impact**: Unauthorized API access
|
||||
|
||||
### Required Fixes
|
||||
|
||||
- Implement ownership checks on all server operations
|
||||
- Server-side JWT validation and role enforcement
|
||||
- Move tokens from URL to headers/cookies
|
||||
- Comprehensive API key validation
|
||||
|
||||
**Priority**: Must fix before public launch (current soft beta acceptable)
|
||||
|
||||
---
|
||||
|
||||
## 🛠️ Ford Assembly Line Department Structure
|
||||
|
||||
### Management Department (Coordination Hub)
|
||||
- **Role**: Strategic oversight, cross-department integration
|
||||
- **AI Resource**: Claude (architecture) + ChatGPT (implementation)
|
||||
- **Current Focus**: Launch readiness + critical feature completion
|
||||
|
||||
### 5 Specialized Departments
|
||||
|
||||
**1. API Department** ⚠️ CRITICAL SECURITY + DEVELOPER PLATFORM
|
||||
- Tech: Node.js/Express, MariaDB, JWT auth, Pterodactyl integration
|
||||
- Priority: Security fixes + developer environment APIs
|
||||
|
||||
**2. Infrastructure Department** ✅ LXC INTEGRATION PRIORITY
|
||||
- Tech: Proxmox VMs, Ansible automation, PBS backup, Monitoring
|
||||
- Achievement: Enterprise backup system operational
|
||||
- Capacity: 1.8TB available, supports 75-100 developers
|
||||
|
||||
**3. Frontend Department** 🔧 TOKEN SECURITY + DEVELOPER UI
|
||||
- Tech: Next.js 15, TailwindCSS, sci-fi HUD aesthetic, TypeScript
|
||||
- Priority: Token security + developer dashboard
|
||||
|
||||
**4. Pterodactyl Department** ⚠️ OAUTH + WINGS LXC
|
||||
- Role: Panel customization, OAuth integration
|
||||
- Future: Wings LXC integration for performance advantage
|
||||
|
||||
**5. Planning & Brainstorming Department** 🧠 STRATEGIC EXECUTION
|
||||
- Role: Long-term vision, competitive strategy
|
||||
- Focus: Developer acquisition, viral growth mechanics
|
||||
|
||||
---
|
||||
|
||||
## 📋 Immediate Next Steps (Priority Order)
|
||||
|
||||
### Phase 1: Critical Features (Before Launch)
|
||||
1. ✅ **Fix Go Agent Bugs** - Remove Forge glob, fix fallthrough, enable stop commands
|
||||
2. 🔧 **WebSocket Console** - Implement real-time streaming (4-6 hours)
|
||||
3. 🔧 **Crash Loop Protection** - Add exponential backoff (2 hours)
|
||||
4. 🔧 **Disk Space Monitoring** - Prevent starts on low disk (1 hour)
|
||||
|
||||
### Phase 2: Launch Readiness
|
||||
5. 📋 **Security Audit** - Review critical vulnerabilities
|
||||
6. 📋 **Documentation** - User guides, API docs
|
||||
7. 📋 **Monitoring** - Alert thresholds, dashboards
|
||||
8. 📋 **Soft Beta** - 10-20 users, gather feedback
|
||||
|
||||
### Phase 3: Week 1 Post-Launch
|
||||
9. 📋 **File Management** - Upload/download interface
|
||||
10. 📋 **Backup System** - World backup/restore
|
||||
11. 📋 **Enhanced Health Checks** - Resource monitoring
|
||||
|
||||
---
|
||||
|
||||
## 🎯 Success Metrics
|
||||
|
||||
### Technical Metrics
|
||||
- ✅ 100% provisioning success rate (all 6 variants)
|
||||
- ⚠️ Zero DNS orphan records (needs EdgeState migration)
|
||||
- ⚠️ Sub-second WebSocket latency (needs implementation)
|
||||
- ✅ LXC 20-30% performance advantage (validated)
|
||||
|
||||
### Business Metrics (Future)
|
||||
- Developer referral system operational
|
||||
- Revenue sharing calculations accurate
|
||||
- Customer quota enforcement working
|
||||
- Usage metering for billing
|
||||
|
||||
### User Experience Metrics
|
||||
- Professional HUD aesthetic maintained
|
||||
- Zero breaking changes during updates
|
||||
- Seamless dev-to-production pipeline
|
||||
- <3s average provisioning time
|
||||
|
||||
---
|
||||
|
||||
## 📁 Key Files & Locations
|
||||
|
||||
### API Service (`/home/zlh/zlh-api-v2/`)
|
||||
- `prisma/schema.prisma` - Database schema
|
||||
- `src/services/edgePublisher.js` - DNS + Velocity publishing
|
||||
- `src/services/dePublisher.js` - Edge cleanup
|
||||
- `src/services/portAllocator.js` - Port management
|
||||
- `src/clients/cloudflareClient.js` - Cloudflare API wrapper
|
||||
- `src/clients/technitiumClient.js` - Technitium DNS API wrapper
|
||||
|
||||
### Go Agent (`/opt/zlh-agent/`)
|
||||
- `agent.go` - Main provisioning logic
|
||||
- `artifacts.go` - Download + verification (has bugs)
|
||||
- `process.go` - Server lifecycle management (has bug)
|
||||
- `api.go` - HTTP server for control commands
|
||||
- `payload.json` - Configuration from API
|
||||
|
||||
### Frontend (`/home/zlh/zlh-portal/`)
|
||||
- Next.js 15 application
|
||||
- Steel-texture HUD aesthetic
|
||||
- Developer dashboard (in progress)
|
||||
|
||||
---
|
||||
|
||||
## ⚠️ Critical Rules & Constraints
|
||||
|
||||
### DO NOT
|
||||
- ❌ Infer hostnames from DNS records
|
||||
- ❌ Use DNS as source of truth
|
||||
- ❌ Delete Cloudflare records without record IDs
|
||||
- ❌ Launch without WebSocket console (competitive requirement)
|
||||
- ❌ Skip crash protection (operational stability)
|
||||
- ❌ Ignore disk space monitoring (data safety)
|
||||
|
||||
### ALWAYS
|
||||
- ✅ Treat DB as authoritative source of truth
|
||||
- ✅ Store Cloudflare record IDs in EdgeState
|
||||
- ✅ Use exact hostname matching
|
||||
- ✅ Track all async operations in JobLog
|
||||
- ✅ Audit significant actions
|
||||
- ✅ Test all 6 MC variants before deploy
|
||||
|
||||
---
|
||||
|
||||
## 💡 Key Architectural Decisions (ADRs)
|
||||
|
||||
**ADR-001: Minecraft-Only Launch**
|
||||
**Decision**: Launch with Minecraft only, defer other games
|
||||
**Rationale**: Market validation, focused quality, faster to market
|
||||
**Consequence**: 6 variants + 6 versions = comprehensive MC offering
|
||||
|
||||
**ADR-002: One Game Per Container**
|
||||
**Decision**: Single game per LXC container
|
||||
**Rationale**: Industry standard, 3-5x simpler than multi-game
|
||||
**Consequence**: Better isolation, clearer billing, easier debugging
|
||||
|
||||
**ADR-003: Velocity Over Direct Port Forwarding**
|
||||
**Decision**: Use Velocity proxy for Minecraft routing
|
||||
**Rationale**: Single entry point, dynamic registration, no NAT complexity
|
||||
**Consequence**: No external port allocation needed for MC
|
||||
|
||||
**ADR-004: Hybrid Pterodactyl + Custom API**
|
||||
**Decision**: Keep Pterodactyl panel, build custom API alongside
|
||||
**Rationale**: Preserve working OAuth, gradual migration path
|
||||
**Consequence**: Dual system complexity, eventual migration needed
|
||||
|
||||
**ADR-005: Go Agent Architecture**
|
||||
**Decision**: Containerized Go agent handles provisioning
|
||||
**Rationale**: Language-agnostic, self-healing, version-aware
|
||||
**Consequence**: Robust provisioning, automatic repair, clean separation
|
||||
|
||||
---
|
||||
|
||||
## 🧠 Session Continuity Prompt
|
||||
|
||||
For AI assistants resuming work on this project:
|
||||
|
||||
> Resume from ZeroLagHub Master Bootstrap (December 7, 2025).
|
||||
>
|
||||
> **Current State**: Platform 85% launch-ready. All 6 Minecraft variants provisioning successfully via Go agent. Core functionality operational, need critical UX features for competitive parity.
|
||||
>
|
||||
> **Launch Decision**: Recommend +1 week for WebSocket console, crash protection, and disk monitoring.
|
||||
>
|
||||
> **Known Bugs**: 3 non-blocking Go agent issues (Forge glob, fallthrough, stop exclusion).
|
||||
>
|
||||
> **Critical Context**:
|
||||
> - Security vulnerabilities exist but acceptable for soft beta
|
||||
> - Business model validated with 9.75x revenue multiplier
|
||||
> - Developer-to-player pipeline is core differentiator
|
||||
> - LXC performance advantage is primary competitive edge
|
||||
>
|
||||
> **Next Actions**: Fix Go agent bugs, implement critical features, launch beta.
|
||||
|
||||
---
|
||||
|
||||
## 📞 Support & Escalation
|
||||
|
||||
- **Platform Owner**: 44 years old, full-stack developer
|
||||
- **AI Coordination**: Claude (architecture) + ChatGPT (implementation)
|
||||
- **Infrastructure**: GTHost dedicated server ($109/month)
|
||||
- **Domain**: zerolaghub.com, zpack.zerolaghub.com
|
||||
- **Public Game IP**: 139.64.165.248
|
||||
|
||||
---
|
||||
|
||||
## 📊 Platform Status Summary
|
||||
|
||||
**Technical Readiness**: 85% complete
|
||||
**Competitive Position**: Ready to compete on core provisioning, need UX polish
|
||||
**Strategic Clarity**: Clear path to launch with validated business model
|
||||
**Infrastructure**: Production-grade with enterprise backup system
|
||||
**Security**: Known vulnerabilities, acceptable for soft beta, must fix before public launch
|
||||
|
||||
---
|
||||
|
||||
## 🎯 Strategic Recommendation
|
||||
|
||||
**Recommended Path**: Option B (+1 Week)
|
||||
|
||||
**Rationale**:
|
||||
1. WebSocket console is table stakes (competitors have it)
|
||||
2. Crash protection prevents operational nightmares
|
||||
3. Disk monitoring prevents data loss
|
||||
4. 1 week is negligible for long-term platform success
|
||||
5. Professional launch > rushed launch
|
||||
|
||||
**Timeline**:
|
||||
- **Dec 7-10**: Implement critical features (WebSocket, crash, disk)
|
||||
- **Dec 11-13**: Testing + bug fixes
|
||||
- **Dec 14**: Soft beta launch (10-20 users)
|
||||
- **Dec 21**: Public launch after beta feedback
|
||||
|
||||
---
|
||||
|
||||
**This document serves as the single source of truth for project continuity. Update after each major milestone or architectural change.**
|
||||
|
||||
🚀 **Next action: Decide launch timeline, then implement critical features or launch beta.**
|
||||
Loading…
Reference in New Issue
Block a user