From 5a5ad001afb31e86db3d9f5a091febb962bd9d6b Mon Sep 17 00:00:00 2001 From: jester Date: Sat, 13 Dec 2025 16:40:48 +0000 Subject: [PATCH] Upload files to "/" --- ZeroLagHub_Cross_Project_Tracker.md | 846 ++++++++++++++++++ ZeroLagHub_Drift_Prevention_Card.md | 105 +++ ...Hub_GPT_Implementation_Handover_Dec2025.md | 573 ++++++++++++ ZeroLagHub_Infrastructure_Specifications.md | 465 ++++++++++ ZeroLagHub_Master_Bootstrap_Dec2025.md | 606 +++++++++++++ 5 files changed, 2595 insertions(+) create mode 100644 ZeroLagHub_Cross_Project_Tracker.md create mode 100644 ZeroLagHub_Drift_Prevention_Card.md create mode 100644 ZeroLagHub_GPT_Implementation_Handover_Dec2025.md create mode 100644 ZeroLagHub_Infrastructure_Specifications.md create mode 100644 ZeroLagHub_Master_Bootstrap_Dec2025.md diff --git a/ZeroLagHub_Cross_Project_Tracker.md b/ZeroLagHub_Cross_Project_Tracker.md new file mode 100644 index 0000000..729c765 --- /dev/null +++ b/ZeroLagHub_Cross_Project_Tracker.md @@ -0,0 +1,846 @@ +# ๐Ÿ›ก๏ธ ZeroLagHub Cross-Project Tracker & Drift Prevention System + +**Last Updated**: December 7, 2025 +**Version**: 1.0 (Canonical Architecture Enforcement) +**Status**: ACTIVE - Must Be Consulted Before All Code Changes + +--- + +## ๐ŸŽฏ Document Purpose + +This document establishes **architectural boundaries** across the three major ZeroLagHub systems to prevent drift, confusion, and broken contracts when switching between contexts or AI assistants. + +**Critical Insight**: Most project failures come from **gradual architectural erosion**, not sudden breaking changes. This document prevents that. + +--- + +## ๐Ÿ“Š The Three Systems (Canonical Ownership) + +``` +โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” +โ”‚ NODE.JS API v2 โ”‚ +โ”‚ (Orchestration Engine) โ”‚ +โ”‚ โ”‚ +โ”‚ Owns: VMID allocation, port management, DNS publishing, โ”‚ +โ”‚ Velocity registration, job queue, database state โ”‚ +โ”‚ โ”‚ +โ”‚ Speaks To: Proxmox, Cloudflare, Technitium, Velocity, โ”‚ +โ”‚ MariaDB, BullMQ, Go Agent (HTTP) โ”‚ +โ”‚ โ”‚ +โ”‚ Never Touches: Container filesystem, game server files, โ”‚ +โ”‚ Java installation, artifact downloads โ”‚ +โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ + โ”‚ + โ”‚ HTTP Contract + โ”‚ POST /config, /start, /stop + โ”‚ GET /status, /health + โ”‚ + โ–ผ +โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” +โ”‚ GO AGENT โ”‚ +โ”‚ (Container-Internal Manager) โ”‚ +โ”‚ โ”‚ +โ”‚ Owns: Server installation, Java runtime, artifact โ”‚ +โ”‚ downloads, process management, READY detection, โ”‚ +โ”‚ filesystem layout, verification + self-repair โ”‚ +โ”‚ โ”‚ +โ”‚ Speaks To: Local filesystem, game server process, โ”‚ +โ”‚ API (status updates via HTTP polling) โ”‚ +โ”‚ โ”‚ +โ”‚ Never Touches: Proxmox, DNS, Cloudflare, Velocity, โ”‚ +โ”‚ port allocation, VMID selection โ”‚ +โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ + โ”‚ + โ”‚ Status Polling + โ”‚ Agent reports state + โ”‚ + โ–ผ +โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” +โ”‚ NEXT.JS FRONTEND โ”‚ +โ”‚ (Customer + Admin UI) โ”‚ +โ”‚ โ”‚ +โ”‚ Owns: User interaction, form validation, display logic, โ”‚ +โ”‚ client-side state, UI components โ”‚ +โ”‚ โ”‚ +โ”‚ Speaks To: API v2 only (REST + WebSocket when added) โ”‚ +โ”‚ โ”‚ +โ”‚ Never Touches: Proxmox, Go Agent, DNS, Velocity, โ”‚ +โ”‚ Cloudflare, direct container access โ”‚ +โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ +``` + +--- + +## ๐Ÿ—บ๏ธ System Ownership Matrix (CANONICAL) + +| Area | Node.js API v2 | Go Agent | Frontend | +|------|----------------|----------|----------| +| **Provisioning Orchestration** | โœ… OWNER (allocates VMID, ports, builds LXC config) | โŒ Executes inside container | โŒ Triggers only | +| **Template Selection** | โœ… OWNER (selects template, passes config) | โŒ Template contains agent | โŒ Displays options | +| **Server Installation** | โŒ Never | โœ… OWNER (Java, artifacts, validation) | โŒ Displays results | +| **Runtime Control** | โœ… OWNER (sends commands) | โœ… OWNER (executes commands) | โŒ UI only | +| **DNS (Cloudflare + Technitium)** | โœ… OWNER (creates, deletes, tracks IDs) | โŒ Never | โŒ Displays info | +| **Velocity Registration** | โœ… OWNER (registers, deregisters) | โŒ Never | โŒ Displays status | +| **IP Logic** | โœ… OWNER (external + internal IPs) | โŒ Sees container IP only | โŒ Displays final | +| **Port Allocation** | โœ… OWNER (PortPool DB management) | โŒ Receives assignments | โŒ Displays ports | +| **Monitoring** | โœ… OWNER (collects metrics) | โœ… OWNER (exposes /health) | โŒ Displays data | +| **Error Handling** | โœ… OWNER (BullMQ jobs, retries) | โŒ Local output only | โŒ User notifications | + +--- + +## ๐Ÿ”„ API โ†” Agent Contract (IMMUTABLE) + +### **API โ†’ Agent Endpoints** + +``` +POST /config +โ”œโ”€ Payload: { +โ”‚ game: "minecraft", +โ”‚ variant: "paper", +โ”‚ version: "1.21.3", +โ”‚ ports: [25565, 25575], +โ”‚ memory: "4G", +โ”‚ motd: "Welcome to ZeroLagHub", +โ”‚ worldSettings: {...} +โ”‚ } +โ””โ”€ Response: 200 OK + +POST /start +โ””โ”€ Triggers: ensureProvisioned() โ†’ StartServer() + +POST /stop +โ””โ”€ Triggers: Graceful shutdown via server stop command + +POST /restart +โ””โ”€ Triggers: stop โ†’ start sequence + +GET /status +โ””โ”€ Response: { + state: "RUNNING" | "INSTALLING" | "FAILED", + pid: 12345, + uptime: 3600, + lastError: null + } + +GET /health +โ””โ”€ Response: { + healthy: true, + java: "/usr/lib/jvm/java-21-openjdk-amd64", + variant: "paper", + ready: true + } +``` + +### **Agent โ†’ API Signals (via /status polling)** + +``` +State Machine: +INSTALLING + โ”œโ”€> DOWNLOADING_ARTIFACT + โ”œโ”€> INSTALLING_JAVA + โ”œโ”€> FINALIZING + โ”œโ”€> RUNNING + โ””โ”€> FAILED | CRASHED +``` + +### **๐Ÿšซ Forbidden Agent Behaviors** + +Agent **NEVER**: +- โŒ Allocates or manages ports +- โŒ Talks to Proxmox API +- โŒ Creates DNS records +- โŒ Registers with Velocity +- โŒ Modifies LXC config +- โŒ Allocates VMIDs +- โŒ Manages templates +- โŒ Talks to Cloudflare or Technitium + +**Violation Detection**: If agent code imports Proxmox client, DNS client, or port allocation logic โ†’ **DRIFT VIOLATION** + +--- + +## ๐Ÿ–ฅ๏ธ Frontend โ†” API Contract (IMMUTABLE) + +### **Frontend Allowed Endpoints** + +``` +Provisioning: +POST /api/containers/create +GET /api/containers/:vmid +DELETE /api/containers/:vmid + +Control: +POST /api/containers/:vmid/start +POST /api/containers/:vmid/stop +POST /api/containers/:vmid/restart + +Status: +GET /api/containers/:vmid/status +GET /api/containers/:vmid/logs +GET /api/containers/:vmid/stats + +Discovery: +GET /api/templates +GET /api/containers (list user's containers) + +Read-Only Info: +GET /api/dns/:hostname (display only) +GET /api/velocity/status (display only) +``` + +### **๐Ÿšซ Forbidden Frontend Behaviors** + +Frontend **NEVER**: +- โŒ Talks directly to Go Agent +- โŒ Calls Proxmox API +- โŒ Creates DNS records +- โŒ Registers with Velocity +- โŒ Allocates ports +- โŒ Executes container commands +- โŒ Accesses MariaDB directly + +**Violation Detection**: If frontend code imports Proxmox client, agent HTTP client (except via API), or database client โ†’ **DRIFT VIOLATION** + +--- + +## ๐Ÿšจ Drift Detection Rules (ACTIVE ENFORCEMENT) + +### **Rule #1: Provisioning Ownership** + +**Violation**: Agent asked to allocate ports, choose templates, create DNS, call Proxmox, manage VMIDs + +**Correct Path**: API owns ALL provisioning orchestration + +**Example Violation**: +```go +// WRONG - Agent should NEVER do this +func (a *Agent) allocatePorts() ([]int, error) { + // ... port allocation logic +} +``` + +**Correct Pattern**: +```javascript +// RIGHT - API allocates, agent receives +async function provisionInstance(game, variant) { + const ports = await portAllocator.allocate(vmid); + await agent.postConfig({ ports, game, variant }); +} +``` + +--- + +### **Rule #2: Artifact Ownership** + +**Violation**: API asked to install Java, download server.jar, run installers + +**Correct Path**: Agent owns ALL in-container installation + +**Example Violation**: +```javascript +// WRONG - API should NEVER do this +async function installMinecraft(vmid, version) { + await proxmox.exec(vmid, `wget https://...`); + await proxmox.exec(vmid, `java -jar installer.jar`); +} +``` + +**Correct Pattern**: +```go +// RIGHT - Agent handles installation +func (p *Provisioner) ProvisionAll(cfg Config) error { + downloadArtifact(cfg.Variant, cfg.Version) + installJavaRuntime(cfg.Version) + verifyInstallation() +} +``` + +--- + +### **Rule #3: Direct Container Access** + +**Violation**: Frontend or API wants to exec commands directly into container + +**Correct Path**: Agent owns container execution layer + +**Example Violation**: +```typescript +// WRONG - Frontend should NEVER do this +async function restartServer(vmid: number) { + const ssh = new SSHClient(); + await ssh.connect(containerIP); + await ssh.exec('systemctl restart minecraft'); +} +``` + +**Correct Pattern**: +```typescript +// RIGHT - Frontend talks to API, API talks to agent +async function restartServer(vmid: number) { + await api.post(`/containers/${vmid}/restart`); +} +``` + +--- + +### **Rule #4: Networking Responsibilities** + +**Violation**: Agent asked to select public vs internal IPs, decide DNS zones + +**Correct Path**: API owns dual-IP logic (Cloudflare external, Technitium internal) + +**Example Violation**: +```go +// WRONG - Agent should NEVER decide this +func (a *Agent) determinePublicIP() string { + if a.needsCloudflare() { + return "139.64.165.248" + } + return a.containerIP +} +``` + +**Correct Pattern**: +```javascript +// RIGHT - API decides network topology +function determineIPs(vmid, game) { + const internalIP = `10.200.0.${vmid - 1000}`; + const externalIP = "139.64.165.248"; // Cloudflare target + const velocityIP = "10.70.0.241"; // Internal routing + + return { internalIP, externalIP, velocityIP }; +} +``` + +--- + +### **Rule #5: Proxy Responsibilities** + +**Violation**: Agent asked to register with Velocity, configure proxy routing + +**Correct Path**: API owns ALL proxy integrations + +**Example Violation**: +```go +// WRONG - Agent should NEVER do this +func (a *Agent) registerWithVelocity() error { + client := velocity.NewClient() + return client.Register(a.hostname, a.port) +} +``` + +**Correct Pattern**: +```javascript +// RIGHT - API handles Velocity registration +async function registerVelocity(vmid, hostname, internalIP) { + await velocityBridge.registerBackend({ + name: hostname, + address: internalIP, + port: 25565 + }); +} +``` + +--- + +## ๐Ÿ”„ Context Switching Safety Workflow + +### **When Moving to Node.js API Work:** + +**Pre-Switch Checklist**: +- [ ] Agent contract unchanged? (POST /config, /start, /stop, GET /status) +- [ ] Database schema unchanged? (Prisma models consistent) +- [ ] LXC template IDs unchanged? (VMID 800 for game, 6000-series for dev) +- [ ] DNS/IP logic consistent? (Cloudflare external, Technitium internal) +- [ ] Port allocation logic preserved? (PortPool DB-backed) + +**Common Drift Patterns**: +- โš ๏ธ Adding agent installation logic to API +- โš ๏ธ Changing agent contract without updating both sides +- โš ๏ธ Moving DNS logic to different service +- โš ๏ธ Bypassing job queue for provisioning + +--- + +### **When Moving to Go Agent Work:** + +**Pre-Switch Checklist**: +- [ ] No provisioning logic outside allowed scope (no port allocation, DNS, etc.) +- [ ] File paths remain canonical (`/opt/zlh///world`) +- [ ] Naming conventions maintained (`server.jar`, `fabric-server.jar`, etc.) +- [ ] No external API calls (Proxmox, DNS, Velocity) +- [ ] Status states unchanged (INSTALLING, RUNNING, FAILED, etc.) + +**Common Drift Patterns**: +- โš ๏ธ Adding port allocation to agent +- โš ๏ธ Making agent talk to external services +- โš ๏ธ Changing directory structure without API coordination +- โš ๏ธ Adding orchestration logic to agent + +--- + +### **When Moving to Frontend Work:** + +**Pre-Switch Checklist**: +- [ ] Only API-approved fields used in UI +- [ ] No direct agent HTTP calls +- [ ] VMID not exposed in user-facing UI +- [ ] Internal IPs not displayed to users +- [ ] All state from API, not computed locally + +**Common Drift Patterns**: +- โš ๏ธ Adding direct agent calls from frontend +- โš ๏ธ Computing server state client-side +- โš ๏ธ Exposing internal infrastructure details +- โš ๏ธ Bypassing API for container control + +--- + +## ๐Ÿšจ High-Risk Integration Zones (GUARDED) + +These areas have historically caused drift across sessions: + +### **1. Forge / NeoForge Installation Logic** +- **Risk**: Agent vs API confusion on who handles `run.sh` patching +- **Guard**: Agent owns ALL Forge installation, API just passes config +- **Test**: Can provision Forge 1.21.3 without API filesystem access? + +### **2. Cloudflare SRV Deletion** +- **Risk**: Case sensitivity, subdomain normalization, record ID tracking +- **Guard**: API stores Cloudflare record IDs in EdgeState, deletes by ID +- **Test**: Create โ†’ Delete โ†’ Recreate same hostname without orphans? + +### **3. Technitium DNS Zone Mismatch** +- **Risk**: Wrong zone selection, duplicate records +- **Guard**: API hardcodes zone as `zpack.zerolaghub.com`, validates before creation +- **Test**: No records created in wrong zones? + +### **4. Velocity Registration Order** +- **Risk**: Registering before server ready, deregistering incorrectly +- **Guard**: API waits for agent RUNNING state, then registers Velocity +- **Test**: Player connection works immediately after provisioning complete? + +### **5. PortPool Commit Logic** +- **Risk**: Race conditions, double-allocation, uncommitted ports +- **Guard**: API allocates โ†’ provisions โ†’ commits (rollback on failure) +- **Test**: Concurrent provisions don't collide on ports? + +### **6. Agent READY Detection** +- **Risk**: False negatives, false positives, variant-specific patterns +- **Guard**: Agent uses variant-aware log parsing, multiple confirmation lines +- **Test**: All 6 variants correctly detect READY state? + +### **7. Server Start-Up False Negatives** +- **Risk**: Timeout too short, log parsing too strict +- **Guard**: Agent increases timeout for Forge (90s), multiple log patterns +- **Test**: Forge installer completes without false failure? + +### **8. IP Selection Logic** +- **Risk**: Confusing external (Cloudflare) vs internal (Velocity/Technitium) IPs +- **Guard**: API clearly separates: externalIP (139.64.165.248), internalIP (10.200.0.X), velocityIP (10.70.0.241) +- **Test**: DNS points to correct IPs, Velocity routes to correct internal IP? + +--- + +## ๐Ÿ“‹ Architecture Decision Log (LOCKED) โญ NEW + +**Purpose**: Records finalized architectural decisions that **must not be re-litigated** unless explicitly requested. + +**Status**: LOCKED - These decisions are final and cannot be changed without user explicitly saying "Revisit decision X" + +--- + +### **DEC-001: Templates vs Go Agent (FINAL)** + +**Decision**: Hybrid model +- LXC templates define **base environment only** +- Go Agent is **authoritative execution layer** inside containers + +**Rationale**: +- Templates alone cannot handle multi-variant logic (Forge, NeoForge, Fabric) +- Agent enables self-repair, async provisioning, runtime control +- Hybrid provides speed + flexibility without API container access + +**Applies To**: API v2, Go Agent, Frontend +**Status**: โœ… LOCKED + +--- + +### **DEC-002: Provisioning Authority** + +**Decision**: API orchestrates, Agent executes + +**Rationale**: +- API has global visibility (DB, DNS, Proxmox, Velocity) +- Agent is intentionally sandboxed to container filesystem + process + +**Applies To**: All systems +**Status**: โœ… LOCKED + +--- + +### **DEC-003: DNS & Edge Publishing Ownership** + +**Decision**: API-only responsibility + +**Rationale**: +- Requires external credentials (Cloudflare, Technitium) +- Must correlate DB state, record IDs, reconciliation jobs + +**Applies To**: API v2 +**Status**: โœ… LOCKED + +--- + +### **DEC-004: Proxy Stack** + +**Decision**: Traefik + Velocity only + +**Rationale**: +- Traefik for HTTP/control-plane +- Velocity for Minecraft TCP routing +- **HAProxy explicitly deprecated** for ZeroLagHub + +**Applies To**: Infrastructure, API +**Status**: โœ… LOCKED + +--- + +### **DEC-005: State Persistence** + +**Decision**: MariaDB is single source of truth + +**Rationale**: +- Flat files caused race conditions and drift +- DB enables reconciliation, recovery, observability + +**Applies To**: API v2 +**Status**: โœ… LOCKED + +--- + +### **DEC-006: Frontend Access Model** + +**Decision**: Frontend communicates with API only + +**Rationale**: +- Security boundary +- Prevents leaking infrastructure details + +**Applies To**: Frontend +**Status**: โœ… LOCKED + +--- + +### **DEC-007: Architecture Enforcement Policy** + +**Decision**: Drift prevention is mandatory + +**Rationale**: +- Prevents oscillation between alternatives +- Preserves velocity during late-stage development + +**Applies To**: All work sessions +**Status**: โœ… LOCKED + +--- + +## ๐Ÿ“‹ Canonical Architecture Anchors (ABSOLUTE) + +These rules are **immutable** unless explicitly changed with full system review: + +### **Anchor #1: Orchestration** +โœ… API v2 orchestrates EVERYTHING (jobs, provisioning, DNS, proxy, lifecycle) + +### **Anchor #2: Container-Internal** +โœ… Agent performs EVERYTHING inside container (install, start, stop, detect) + +### **Anchor #3: Templates** +โœ… Templates contain agent + base environment (VMID 800 for game, 6000-series for dev) + +### **Anchor #4: Job Queue** +โœ… BullMQ/Redis drive job system (async provisioning, retries, reconciliation) + +### **Anchor #5: Database** +โœ… MariaDB holds all state (PortPool, ContainerInstance, EdgeState, etc.) + +### **Anchor #6: Infrastructure** +โœ… Proxmox API (not Ansible) for LXC management + +### **Anchor #7: Routing** +โœ… Traefik (HTTP) + Velocity (Minecraft) are ONLY routing/proxy systems + +### **Anchor #8: DNS** +โœ… Cloudflare = authoritative public DNS +โœ… Technitium = authoritative internal DNS + +### **Anchor #9: Frontend Isolation** +โœ… Frontend speaks ONLY to API (no direct agent, Proxmox, DNS, Velocity) + +### **Anchor #10: Directory Structure** +โœ… `/opt/zlh///world` is canonical game server path + +--- + +## ๐Ÿ›ก๏ธ Enforcement Policies (ACTIVE) + +When future instructions conflict with Canonical Architecture or Architecture Decision Log: + +### **Step 1: STOP** +Immediately halt the task. Do not proceed with drift-inducing change. + +### **Step 2: RAISE DRIFT WARNING** +``` +โš ๏ธ DRIFT WARNING โš ๏ธ + +Proposed change violates Canonical Architecture: + +Rule Violated: [Rule #X: Description] +OR +Decision Violated: [DEC-XXX: Decision Name] + +Violation: [Specific behavior that violates rule/decision] +Correct Path: [Architecture-aligned approach] + +Impact: [What breaks if this drift is allowed] + +Options: +1. Implement correct architecture-aligned path +2. Amend Canonical Architecture (requires full system review) +3. Request user to "Revisit decision X" (for ADL changes) +4. Cancel proposed change +``` + +### **Step 3: PROVIDE CORRECT PATH** +Show the architecture-aligned implementation that achieves the same goal. + +### **Step 4: ASK FOR DIRECTION** +``` +Should we: +A) Implement the correct architecture-aligned path? +B) Perform full system review to amend architecture? +C) Request user to explicitly revisit locked decision? +D) Cancel this change? +``` + +### **Step 5: ARCHITECTURE DECISION LOG SPECIAL RULE** +If violation is against a **LOCKED** Architecture Decision (DEC-001 through DEC-007): + +**ADDITIONAL CHECK**: +``` +โš ๏ธ LOCKED DECISION WARNING โš ๏ธ + +This change conflicts with Architecture Decision Log entry: [DEC-XXX] +Status: LOCKED + +This decision can ONLY be changed if the user explicitly says: +"Revisit decision [DEC-XXX]" + +Without explicit user request to revisit, this decision is FINAL. + +Proceeding with this change would violate architectural governance. +``` + +**Never proceed** with drift-inducing changes without explicit confirmation. +**Never re-litigate** locked decisions without user explicitly requesting revision. + +--- + +## ๐Ÿ“Š Drift Detection Examples + +### **Example 1: Agent Port Allocation (VIOLATION)** + +**Proposed Change**: +```go +// Agent code proposal +func (a *Agent) allocatePort() int { + // Find available port... + return port +} +``` + +**Drift Warning**: +``` +โš ๏ธ DRIFT WARNING โš ๏ธ + +Rule Violated: Rule #1 - Provisioning Ownership +Violation: Agent attempting to allocate ports +Correct Path: API allocates ports, agent receives them via /config + +Impact: Port collisions, database inconsistency, broken PortPool + +Recommendation: Remove port allocation from agent, ensure API sends ports in config payload +``` + +--- + +### **Example 2: Frontend Direct Agent Call (VIOLATION)** + +**Proposed Change**: +```typescript +// Frontend code proposal +async function getServerLogs(vmid: number) { + const agentURL = `http://10.200.0.${vmid}:8080/logs`; + return await fetch(agentURL); +} +``` + +**Drift Warning**: +``` +โš ๏ธ DRIFT WARNING โš ๏ธ + +Rule Violated: Rule #3 - Direct Container Access +Violation: Frontend bypassing API to talk to agent +Correct Path: Frontend โ†’ API โ†’ Agent (API proxies logs) + +Impact: Broken frontend if container IP changes, no auth/rate limiting, security risk + +Recommendation: Add GET /api/containers/:vmid/logs endpoint that proxies to agent +``` + +--- + +### **Example 3: API Installing Java (VIOLATION)** + +**Proposed Change**: +```javascript +// API code proposal +async function provisionServer(vmid, game, variant) { + await proxmox.exec(vmid, 'apt-get install openjdk-21-jdk'); + await proxmox.exec(vmid, 'wget https://papermc.io/...'); +} +``` + +**Drift Warning**: +``` +โš ๏ธ DRIFT WARNING โš ๏ธ + +Rule Violated: Rule #2 - Artifact Ownership +Violation: API performing in-container installation +Correct Path: API sends config to agent, agent handles installation + +Impact: Breaks agent self-repair, variant-specific logic duplicated, no verification system + +Recommendation: Remove installation logic from API, ensure agent receives proper config via POST /config +``` + +--- + +## ๐Ÿ“ Integration with Existing Documentation + +### **Relationship to Master Bootstrap** +- Master Bootstrap: Strategic overview and business model +- **This Document**: Technical governance and boundary enforcement +- **Usage**: Consult this before implementing ANY code changes + +### **Relationship to Complete Current State** +- Complete Current State: What's working, what's next +- **This Document**: How things MUST work (regardless of current state) +- **Usage**: This is the "law", current state is "status" + +### **Relationship to Engineering Handover** +- Engineering Handover: Daily tactical tasks and sprint plan +- **This Document**: Constraints within which tasks must be implemented +- **Usage**: Check this before starting each handover task + +--- + +## ๐Ÿ”„ Architecture Amendment Process + +If legitimate need to change Canonical Architecture: + +### **Step 1: Identify Change** +Document exactly what architectural boundary needs to change and why. + +### **Step 2: Full System Impact Analysis** +- What breaks in API? +- What breaks in Agent? +- What breaks in Frontend? +- What changes to contracts? +- What database migrations needed? + +### **Step 3: Update ALL Affected Documents** +- This document (Canonical Architecture) +- Master Bootstrap (if strategic impact) +- Complete Current State (implementation changes) +- Engineering Handover (sprint tasks) +- Agent Spec, Operational Guide (if affected) + +### **Step 4: Update ALL Systems** +- API code + tests +- Agent code + tests +- Frontend code + tests +- Database schema (migration) +- Infrastructure config + +### **Step 5: Validation** +- Integration tests pass? +- No new drift introduced? +- Documentation consistent? +- All AIs briefed on change? + +**Only after ALL steps** is architecture amendment complete. + +--- + +## ๐ŸŽฏ Quick Reference Card + +### **API Owns** +- โœ… Provisioning orchestration +- โœ… Port allocation (PortPool) +- โœ… DNS (Cloudflare + Technitium) +- โœ… Velocity registration +- โœ… IP logic (external + internal) +- โœ… Job queue (BullMQ) +- โœ… Database state + +### **Agent Owns** +- โœ… Container-internal installation +- โœ… Java runtime +- โœ… Artifact downloads +- โœ… Server process management +- โœ… READY detection +- โœ… Self-repair + verification +- โœ… Filesystem layout + +### **Frontend Owns** +- โœ… User interaction +- โœ… Display logic +- โœ… Client state +- โœ… Form validation + +### **Never** +- โŒ Agent allocates ports +- โŒ Agent talks to DNS/Velocity/Proxmox +- โŒ API installs server files +- โŒ API executes in-container commands +- โŒ Frontend talks to agent directly +- โŒ Frontend talks to infrastructure + +--- + +## ๐Ÿ“‹ Session Start Checklist + +Before every coding session with any AI: + +- [ ] Read this document's Quick Reference Card +- [ ] Identify which system you're working on (API, Agent, Frontend) +- [ ] Review that system's "Owns" list +- [ ] Check High-Risk Integration Zones if touching those areas +- [ ] Verify no drift from previous session +- [ ] Confirm contracts unchanged since last session + +**If ANY doubt**: Re-read full Canonical Architecture Anchors section. + +--- + +## โœ… Document Status + +**Status**: ACTIVE - Must be consulted before all code changes +**Enforcement**: MANDATORY - Drift violations must be caught +**Authority**: CANONICAL - Overrides conflicting guidance +**Updates**: Only via Architecture Amendment Process + +--- + +๐Ÿ›ก๏ธ **This document prevents architectural drift. Violate at your own risk.** diff --git a/ZeroLagHub_Drift_Prevention_Card.md b/ZeroLagHub_Drift_Prevention_Card.md new file mode 100644 index 0000000..33ca5e6 --- /dev/null +++ b/ZeroLagHub_Drift_Prevention_Card.md @@ -0,0 +1,105 @@ +# ๐Ÿ›ก๏ธ ZeroLagHub Drift Prevention - Quick Start Card + +**Use This**: At the start of EVERY coding session (API, Agent, or Frontend) + +--- + +## โšก 30-Second Architecture Check + +### **Working on API?** +โœ… Can allocate: Ports, VMIDs, DNS, Velocity +โŒ Cannot: Install Java, download artifacts, exec in container + +### **Working on Agent?** +โœ… Can install: Java, artifacts, server files +โŒ Cannot: Allocate ports, create DNS, call Proxmox, register Velocity + +### **Working on Frontend?** +โœ… Can: Display data, call API endpoints +โŒ Cannot: Talk to Agent, Proxmox, DNS, Velocity, allocate anything + +--- + +## ๐Ÿ”’ 7 Locked Architectural Decisions (NEW) โญ + +**These decisions are FINAL** - cannot be changed without user saying "Revisit decision X": + +1. **DEC-001**: Templates + Agent hybrid (not templates-only or agent-only) +2. **DEC-002**: API orchestrates, Agent executes (not reversed) +3. **DEC-003**: API owns DNS (Agent never creates DNS) +4. **DEC-004**: Traefik + Velocity only (no HAProxy) +5. **DEC-005**: MariaDB is source of truth (no flat files) +6. **DEC-006**: Frontend โ†’ API only (no direct agent calls) +7. **DEC-007**: Drift prevention mandatory (always enforced) + +**If proposing change that conflicts with DEC-001 through DEC-007**: +โ†’ STOP โ†’ Consult Cross-Project Tracker โ†’ Request "Revisit decision X" from user + +--- + +## ๐Ÿšจ Drift Detection Triggers + +**STOP and consult full tracker if you hear:** + +1. "Agent should allocate ports..." +2. "API should install Java inside container..." +3. "Frontend should call agent directly..." +4. "Agent should register with Velocity..." +5. "API should decide what Java version..." +6. "Frontend should manage DNS records..." +7. "Let's use templates only..." (violates DEC-001) +8. "Let's add HAProxy..." (violates DEC-004) +9. "Let's use flat files instead of DB..." (violates DEC-005) + +**All of these are VIOLATIONS** โ†’ Consult [ZeroLagHub_Cross_Project_Tracker.md](computer:///mnt/user-data/outputs/ZeroLagHub_Cross_Project_Tracker.md) + +--- + +## ๐Ÿ“‹ Pre-Coding Checklist + +Before writing ANY code: + +- [ ] Which system am I modifying? (API / Agent / Frontend) +- [ ] Does this change cross boundaries? (If yes โ†’ read full tracker) +- [ ] Am I adding external API calls to Agent? (If yes โ†’ VIOLATION) +- [ ] Am I adding container execution to API? (If yes โ†’ VIOLATION) +- [ ] Am I bypassing API in Frontend? (If yes โ†’ VIOLATION) + +--- + +## ๐ŸŽฏ The Golden Rules + +1. **API orchestrates** (allocates resources, publishes state) +2. **Agent executes** (installs, runs, monitors inside container) +3. **Frontend displays** (no direct infrastructure access) + +**Anything else** โ†’ Drift โ†’ Consult full tracker + +--- + +## ๐Ÿ“ž Quick Reference + +**Full Tracker**: [ZeroLagHub_Cross_Project_Tracker.md](computer:///mnt/user-data/outputs/ZeroLagHub_Cross_Project_Tracker.md) + +**High-Risk Zones**: +- Forge/NeoForge installation +- Cloudflare SRV deletion +- Velocity registration order +- PortPool commit logic +- Agent READY detection +- IP selection logic + +--- + +## โœ… Session Start Command + +``` +Read ZeroLagHub Cross-Project Tracker Quick Start Card. +Activate drift detection. +Confirm which system I'm working on: [API / Agent / Frontend] +Proceed with architecture-aligned implementation only. +``` + +--- + +๐Ÿ›ก๏ธ **Drift prevention ACTIVE. Proceed with confidence.** diff --git a/ZeroLagHub_GPT_Implementation_Handover_Dec2025.md b/ZeroLagHub_GPT_Implementation_Handover_Dec2025.md new file mode 100644 index 0000000..69a311c --- /dev/null +++ b/ZeroLagHub_GPT_Implementation_Handover_Dec2025.md @@ -0,0 +1,573 @@ +# ๐Ÿš€ ZeroLagHub - GPT Implementation Handover (December 2025) + +**Last Updated**: December 7, 2025 +**Version**: 4.0 (Launch-Ready Implementation Guide) +**Status**: 85% Platform Complete - Active Development Sprint + +--- + +## ๐ŸŽฏ Your Role (GPT - Implementation AI) + +**You are**: The tactical implementation AI responsible for **building features** +**Claude is**: The strategic architecture AI responsible for **design decisions** + +### **Your Responsibilities** +โœ… Implement features within architectural boundaries +โœ… Write code for API, Agent, and Frontend +โœ… Fix bugs and optimize performance +โœ… Execute sprint tasks from Kanban board +โœ… Consult Cross-Project Tracker before crossing system boundaries + +### **Your Constraints** +โŒ Do NOT make architectural decisions without Claude +โŒ Do NOT violate ownership boundaries (API vs Agent vs Frontend) +โŒ Do NOT change contracts without updating both sides +โŒ Do NOT skip drift prevention checks + +--- + +## ๐Ÿ“‹ Critical Documents (READ THESE FIRST) + +### **๐Ÿ›ก๏ธ MANDATORY Before ANY Code** +1. **[Drift Prevention Card](computer:///mnt/user-data/outputs/ZeroLagHub_Drift_Prevention_Card.md)** (30 seconds) + - Quick boundary check for every session + - Violation triggers to watch for + +2. **[Cross-Project Tracker](computer:///mnt/user-data/outputs/ZeroLagHub_Cross_Project_Tracker.md)** (consult before changes) + - Ownership matrix (API vs Agent vs Frontend) + - Canonical contracts (API โ†” Agent, Frontend โ†” API) + - Drift detection rules with examples + +### **๐Ÿ“Š Implementation Context** +3. **[Complete Current State](computer:///mnt/user-data/outputs/ZeroLagHub_Complete_Current_State_Dec7.md)** (5 minutes) + - Engineering Kanban (DONE/IN PROGRESS/TODO) + - 3-day sprint plan with tasks + - Troubleshooting guides per variant + - Launch readiness matrix + +4. **Engineering Handover** (from today's uploaded document) + - System lifecycle diagram + - Provisioning sequence + - Verification system specs + - Today's accomplishments + +### **๐Ÿ”ง Technical Reference** +5. **[Agent Complete Spec](computer:///home/claude/ZeroLagHub_Agent_Complete_Spec.md)** (as needed) + - Go agent implementation details + - API endpoints and contracts + +6. **[Infrastructure Specs](computer:///mnt/user-data/outputs/ZeroLagHub_Infrastructure_Specifications.md)** (as needed) + - GTHost hardware constraints + - Capacity planning + +--- + +## ๐ŸŽฏ Current Platform Status (December 7, 2025) + +### **What's Working** โœ… (85% Complete) + +**Core Provisioning Pipeline**: +- โœ… All 6 Minecraft variants (Vanilla, Paper, Purpur, Fabric, Forge, NeoForge) +- โœ… VMID allocation (sequential) +- โœ… LXC container creation (template VMID 800) +- โœ… IP detection (10.200.0.X) +- โœ… Go agent deployment + self-repair +- โœ… Java runtime auto-selection (17/21) +- โœ… DNS automation (Cloudflare + Technitium) +- โœ… Velocity proxy registration +- โœ… Start/stop/restart control +- โœ… Console command injection +- โœ… Log tailing (HTTP polling) +- โœ… Crash detection + +**Supported Minecraft Versions**: 1.12.2, 1.16.5, 1.18.2, 1.19.2, 1.20.1, 1.21.x + +### **What's Missing** โŒ (15% To-Do) + +**Critical for Launch** (7-9 hours): +- โŒ WebSocket console streaming (4-6 hours) - **HIGH PRIORITY** +- โŒ Crash loop protection with backoff (2 hours) - **HIGH PRIORITY** +- โŒ Disk space monitoring (1 hour) - **HIGH PRIORITY** + +**Dev Platform** (1 day): +- ๐Ÿ”ง Dev container provisioning (Day 1 of sprint) +- ๐Ÿ”ง EdgeState schema migration (Day 2 of sprint) +- ๐Ÿ”ง Reconciliation job (Day 3 of sprint) + +**Nice-to-Have** (future): +- ๐Ÿ“‹ File upload/download +- ๐Ÿ“‹ Backup/restore UI +- ๐Ÿ“‹ Resource monitoring dashboard + +--- + +## ๐Ÿ—บ๏ธ System Architecture (Know Your Boundaries) + +### **Three-System Ownership** + +``` +โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” +โ”‚ NODE.JS API (Orchestrator) โ”‚ +โ”‚ You Own: Routes, services, job queue โ”‚ +โ”‚ Speaks To: Proxmox, DNS, Velocity, DB โ”‚ +โ”‚ Never: Installs Java, downloads files โ”‚ +โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ + โ”‚ HTTP Contract + โ”‚ POST /config, /start, /stop + โ”‚ GET /status, /health + โ–ผ +โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” +โ”‚ GO AGENT (Container Manager) โ”‚ +โ”‚ You Own: Installation, verification โ”‚ +โ”‚ Speaks To: Filesystem, game process โ”‚ +โ”‚ Never: Allocates ports, creates DNS โ”‚ +โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ + โ”‚ Status Polling + โ–ผ +โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” +โ”‚ NEXT.JS FRONTEND (UI Only) โ”‚ +โ”‚ You Own: Components, client state โ”‚ +โ”‚ Speaks To: API only โ”‚ +โ”‚ Never: Agent, Proxmox, DNS, Velocity โ”‚ +โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ +``` + +### **Critical Boundaries (NEVER CROSS)** + +**API Must NOT**: +- โŒ Install Java inside containers +- โŒ Download game server files +- โŒ Execute commands directly in containers (use Agent) + +**Agent Must NOT**: +- โŒ Allocate ports +- โŒ Create DNS records +- โŒ Register with Velocity +- โŒ Talk to Proxmox API +- โŒ Manage VMIDs + +**Frontend Must NOT**: +- โŒ Talk directly to Agent +- โŒ Call Proxmox API +- โŒ Create DNS records +- โŒ Bypass API for any infrastructure + +**Violation = STOP โ†’ Consult Cross-Project Tracker** + +--- + +## ๐Ÿ“… 3-Day Sprint Plan (Your Tasks) + +### **Day 1: Dev Containers** (December 8) + +**Goal**: Enable developer environment provisioning + +**Tasks**: +1. [ ] Define dev container spec (Python, Node, Go, Java) +2. [ ] Create template VMID 6000 (base dev environment) +3. [ ] API: Add `/api/dev-instances` endpoints (create, delete, status) +4. [ ] Agent: Add dev provisioning flow (no game server start) +5. [ ] Test: Provision Python + Node dev environments + +**Success Criteria**: +- Can provision dev environment with chosen language +- Dev container accessible via SSH or web console +- No game server logic triggered + +**Files to Modify**: +- `src/routes/devInstances.js` (new) +- `src/services/devProvisioner.js` (new) +- Agent: `dev.go` (new provisioning flow) + +--- + +### **Day 2: EdgeState + DNS Reliability** (December 9) + +**Goal**: Fix Cloudflare SRV deletion problem + +**Tasks**: +1. [ ] Implement EdgeState model in Prisma schema +2. [ ] Update `edgePublisher.js` to store Cloudflare record IDs +3. [ ] Update `dePublisher.js` to delete by record ID (not hostname) +4. [ ] Test: Create โ†’ Delete โ†’ Recreate same hostname + +**Success Criteria**: +- No orphaned DNS records after deletion +- EdgeState tracks all Cloudflare record IDs +- Re-provisioning same hostname works + +**Files to Modify**: +- `prisma/schema.prisma` (add EdgeState model) +- `src/services/edgePublisher.js` +- `src/services/dePublisher.js` +- `src/clients/cloudflareClient.js` (return record IDs) + +--- + +### **Day 3: Reconciliation + Hardening** (December 10) + +**Goal**: Self-healing infrastructure + +**Tasks**: +1. [ ] Create reconciliation job (DB โ†” Proxmox โ†” DNS โ†” Velocity) +2. [ ] Detect orphaned containers (in Proxmox but not DB) +3. [ ] Detect orphaned DNS records (in DNS but not DB) +4. [ ] Auto-cleanup with confirmation prompt +5. [ ] Regression test suite + +**Success Criteria**: +- Reconciliation job detects all orphans +- Can auto-clean with user confirmation +- System recovers from partial failures + +**Files to Create**: +- `src/jobs/reconciliationJob.js` +- `src/services/reconciler.js` (rewrite) +- `tests/reconciliation.test.js` + +--- + +## ๐Ÿ› Known Bugs (Fix These) + +### **Go Agent Bugs** (Non-Blocking, But Should Fix) + +1. **Forge server.jar Glob Logic** (`artifacts.go` lines 112-116, 147-151) + ```go + // REMOVE THIS - Forge doesn't create server.jar + serverJarPath := filepath.Join(installDir, "*server.jar") + ``` + **Fix**: Remove glob/rename logic entirely + +2. **ensureProvisioned() Fallthrough** (`agent.go` lines 155-171) + ```go + // ADD ELSE HERE + if variant == "forge" || variant == "neoforge" { + // Forge logic + } else { // <-- ADD THIS + // Vanilla-like logic + } + ``` + +3. **Forge Stop Command Exclusion** (`process.go` line 83) + ```go + // REMOVE THIS EXCLUSION - Forge accepts stop commands + if p.variant != "forge" && p.variant != "neoforge" { + p.sendCommand("stop") + } + ``` + **Fix**: Remove the if condition, send stop to all variants + +--- + +## ๐Ÿšจ High-Risk Integration Zones (Careful!) + +These areas have caused drift in past sessions: + +1. **Forge/NeoForge Installation** + - Agent owns ALL installation logic + - API only passes config, never executes install commands + +2. **Cloudflare SRV Deletion** + - Must use record IDs (not hostname inference) + - Store IDs in EdgeState on creation + +3. **Velocity Registration Order** + - Wait for Agent to report RUNNING state + - Then register with Velocity (not before) + +4. **Port Allocation** + - API allocates โ†’ provisions โ†’ commits + - Rollback on failure (don't leak ports) + +5. **Agent READY Detection** + - Use variant-aware log parsing + - Forge takes 60-90s (don't timeout early) + +--- + +## ๐Ÿ“ Key File Locations + +### **API Service** (`/home/zlh/zlh-api-v2/`) +``` +src/ +โ”œโ”€โ”€ routes/ +โ”‚ โ”œโ”€โ”€ containers.js # Game server endpoints +โ”‚ โ””โ”€โ”€ devInstances.js # Dev environment endpoints (TO CREATE) +โ”œโ”€โ”€ services/ +โ”‚ โ”œโ”€โ”€ edgePublisher.js # DNS + Velocity publishing +โ”‚ โ”œโ”€โ”€ dePublisher.js # Edge cleanup (NEEDS REWRITE) +โ”‚ โ”œโ”€โ”€ portAllocator.js # Port management +โ”‚ โ””โ”€โ”€ reconciler.js # Orphan detection (NEEDS REWRITE) +โ”œโ”€โ”€ clients/ +โ”‚ โ”œโ”€โ”€ cloudflareClient.js # Cloudflare API (UPDATE for record IDs) +โ”‚ โ”œโ”€โ”€ technitiumClient.js # Technitium DNS +โ”‚ โ””โ”€โ”€ proxmoxClient.js # Proxmox API +โ””โ”€โ”€ jobs/ + โ””โ”€โ”€ reconciliationJob.js # Self-healing job (TO CREATE) +``` + +### **Go Agent** (`/opt/zlh-agent/`) +``` +โ”œโ”€โ”€ agent.go # Main provisioning logic (FIX fallthrough) +โ”œโ”€โ”€ artifacts.go # Download + verification (REMOVE Forge glob) +โ”œโ”€โ”€ process.go # Server lifecycle (FIX Forge stop exclusion) +โ”œโ”€โ”€ api.go # HTTP server for control +โ””โ”€โ”€ dev.go # Dev environment provisioning (TO CREATE) +``` + +### **Frontend** (`/home/zlh/zlh-portal/`) +``` +src/ +โ”œโ”€โ”€ app/ +โ”‚ โ”œโ”€โ”€ containers/ # Game server UI +โ”‚ โ””โ”€โ”€ dev/ # Dev environment UI (TO CREATE) +โ””โ”€โ”€ components/ + โ””โ”€โ”€ Console.tsx # WebSocket console (TO CREATE) +``` + +--- + +## ๐ŸŽฎ Minecraft Variant Status + +| Variant | Install | Verify | Start | READY Detection | Status | +|---------|---------|--------|-------|-----------------|--------| +| Vanilla | โœ… | โœ… | โœ… | โœ… | Production | +| Paper | โœ… | โœ… | โœ… | โœ… | Production | +| Purpur | โœ… | โœ… | โœ… | โœ… | Production | +| Fabric | โœ… | โœ… | โœ… | โœ… | Production | +| Forge | โœ… | โœ… | โœ… | โœ… | Production (has bugs) | +| NeoForge | โœ… | โœ… | โœ… | โœ… | Production | + +**All variants work** - 3 non-blocking bugs should be fixed for code quality. + +--- + +## ๐Ÿ”„ Provisioning Flow (Know This) + +``` +1. User creates server via Frontend + โ†“ +2. Frontend โ†’ API: POST /api/containers/create + โ†“ +3. API allocates VMID, ports (if needed) + โ†“ +4. API clones LXC from template VMID 800 + โ†“ +5. API configures container (IP, resources) + โ†“ +6. API starts LXC + โ†“ +7. API detects container IP (10.200.0.X) + โ†“ +8. API โ†’ Agent: POST /config (payload) + โ†“ +9. Agent saves payload.json + โ†“ +10. Agent spawns install goroutine (async) + โ”œโ”€ Download Java + โ”œโ”€ Download game artifacts + โ”œโ”€ Verify installation + โ””โ”€ Self-repair if needed + โ†“ +11. Agent starts server + โ†“ +12. Agent detects READY (log parsing) + โ†“ +13. Agent sets state = RUNNING + โ†“ +14. API polls /status until RUNNING + โ†“ +15. API saves to database + โ†“ +16. API publishes DNS (Cloudflare + Technitium) + โ†“ +17. API registers with Velocity + โ†“ +18. API returns SUCCESS to user + โ†“ +COMPLETE โœ… +``` + +--- + +## ๐Ÿ›ก๏ธ Drift Prevention (ACTIVE) + +### **Before Writing ANY Code** + +**Ask yourself**: +1. Which system am I modifying? (API / Agent / Frontend) +2. Does this cross boundaries? (If yes โ†’ read Cross-Project Tracker) +3. Am I adding external calls to Agent? (If yes โ†’ VIOLATION) +4. Am I adding container execution to API? (If yes โ†’ VIOLATION) +5. Am I bypassing API in Frontend? (If yes โ†’ VIOLATION) + +**If ANY doubt** โ†’ Stop and consult [Cross-Project Tracker](computer:///mnt/user-data/outputs/ZeroLagHub_Cross_Project_Tracker.md) + +### **Common Violations to Avoid** + +โŒ Agent allocating ports โ†’ API owns this +โŒ API installing Java โ†’ Agent owns this +โŒ Frontend calling Agent directly โ†’ Must go through API +โŒ Agent creating DNS โ†’ API owns this +โŒ API deciding Java version โ†’ Agent owns this (version-aware) + +--- + +## ๐ŸŽฏ Success Metrics (How You'll Be Measured) + +### **Sprint Completion** +- [ ] All 3 days of sprint tasks completed +- [ ] Dev containers operational +- [ ] EdgeState tracking DNS record IDs +- [ ] Reconciliation job working + +### **Code Quality** +- [ ] No architectural violations (follow Cross-Project Tracker) +- [ ] All 3 Go agent bugs fixed +- [ ] Tests passing +- [ ] No new drift introduced + +### **Platform Readiness** +- [ ] 95% launch-ready after sprint +- [ ] All MC variants still working +- [ ] No regressions from changes + +--- + +## ๐Ÿงช Testing Requirements + +### **Before Committing Code** + +**Test Each Variant**: +```bash +# Test provisioning +POST /api/containers/create {variant: "vanilla"} +POST /api/containers/create {variant: "paper"} +POST /api/containers/create {variant: "fabric"} +POST /api/containers/create {variant: "forge"} +POST /api/containers/create {variant: "neoforge"} + +# Verify RUNNING state +GET /api/containers/:vmid/status +# Should return: {state: "RUNNING"} + +# Test control +POST /api/containers/:vmid/stop +POST /api/containers/:vmid/start +POST /api/containers/:vmid/restart + +# Test cleanup +DELETE /api/containers/:vmid +# Verify no orphaned DNS records +``` + +**Test Dev Containers**: +```bash +POST /api/dev-instances/create {language: "python"} +POST /api/dev-instances/create {language: "node"} + +# Verify accessible +ssh into dev container +# Should have language toolchain installed +``` + +--- + +## ๐Ÿ“Š Infrastructure Constraints (Know Your Limits) + +**Hardware** (GTHost Dedicated): +- CPU: 12 cores / 24 threads (Intel Xeon Silver 4116) +- RAM: 192 GB (can run 30-50 simultaneous 4GB servers) +- Storage: 1.8 TB free (300-500 servers capacity) +- Network: 300 Mbit/s (30-60 concurrent players) + +**Current Allocation**: +- 11 VMs: 56 GB RAM, 24 CPU threads, ~512 GB disk +- Available for servers: 128 GB RAM, 1.8 TB disk + +**Don't exceed these limits** - check before provisioning. + +--- + +## ๐Ÿš€ Launch Decision (Context) + +**Option A**: Launch NOW (85% ready, soft beta only) +**Option B**: +3 Days Sprint (95% ready, RECOMMENDED) +**Option C**: +1 Week (98% ready, over-engineering) + +**Your role**: Execute Option B sprint (complete 3 days of tasks) + +**After sprint**: Platform ready for professional launch + +--- + +## ๐Ÿ“ž Session Continuity + +### **Starting Fresh Session (You)** + +``` +1. Read Drift Prevention Card (30 seconds) + โ””โ”€ Activate boundary awareness + +2. Read Complete Current State (5 minutes) + โ””โ”€ Get Kanban state + sprint tasks + +3. Consult Cross-Project Tracker before code + โ””โ”€ Verify no boundary violations + +4. Execute sprint tasks with constraints active + +5. Test all variants before committing +``` + +### **Handoff to Claude (Architecture Questions)** + +If you encounter: +- Architectural decisions (e.g., should we change contracts?) +- Strategic questions (e.g., which features to prioritize?) +- Business model questions +- Major design changes + +**STOP and escalate to Claude** for architectural guidance. + +--- + +## โœ… Quick Reference + +### **Your Mission** +Execute 3-day sprint โ†’ Deliver 95% launch-ready platform + +### **Your Boundaries** +API orchestrates | Agent installs | Frontend displays + +### **Your Critical Docs** +1. Drift Prevention Card (session start) +2. Cross-Project Tracker (before code) +3. Complete Current State (sprint tasks) +4. Engineering Handover (technical details) + +### **Your Success** +- [ ] 3-day sprint complete +- [ ] No architectural violations +- [ ] All variants still working +- [ ] Tests passing + +--- + +## ๐ŸŽฏ Start Here (First Actions) + +**Right Now**: +1. โœ… Read [Drift Prevention Card](computer:///mnt/user-data/outputs/ZeroLagHub_Drift_Prevention_Card.md) (30 sec) +2. โœ… Read [Complete Current State](computer:///mnt/user-data/outputs/ZeroLagHub_Complete_Current_State_Dec7.md) (5 min) +3. โœ… Check Engineering Kanban for current TODO +4. โœ… Begin Day 1 sprint: Dev containers + +**Remember**: +- ๐Ÿ›ก๏ธ Drift prevention ACTIVE +- ๐Ÿ“‹ Consult tracker before crossing boundaries +- โœ… Test all variants before committing +- ๐Ÿš€ Goal: 95% launch-ready after 3 days + +--- + +**Status**: You have everything you need to execute the sprint. Let's build! ๐Ÿš€ diff --git a/ZeroLagHub_Infrastructure_Specifications.md b/ZeroLagHub_Infrastructure_Specifications.md new file mode 100644 index 0000000..97420e8 --- /dev/null +++ b/ZeroLagHub_Infrastructure_Specifications.md @@ -0,0 +1,465 @@ +# ๐Ÿ—๏ธ ZeroLagHub Infrastructure - Complete Specifications + +**Last Updated**: December 7, 2025 +**Provider**: GTHost +**Cost**: $109/month +**Status**: Production Infrastructure + +--- + +## ๐Ÿ–ฅ๏ธ Dedicated Server Hardware + +### **Server Platform** +**Model**: Supermicro 2029TP-HC1R (hot-swap HDD chassis) +**Form Factor**: Enterprise rackmount server + +### **CPU** +**Model**: Intel Xeon Silver 4116 +**Cores**: 12 cores / 24 threads +**Clock Speed**: 2.1 GHz base, 3.0 GHz turbo +**Architecture**: Intel Skylake-SP (Server Processor) +**TDP**: 85W +**Cache**: 16.5 MB L3 + +**Performance**: +- Single-threaded: Excellent for game servers +- Multi-threaded: Handles 11 VMs + 75-100 containers +- Turbo boost ensures responsive provisioning + +### **Memory (RAM)** +**Configuration**: 6 x 32GB DIMMs +**Total**: 192 GB +**Type**: Hynix DDR4 RDIMM (Registered) +**Speed**: 2400 MHz +**ECC**: Yes (Error-Correcting Code) + +**Benefits**: +- ECC protects against memory corruption +- Registered DIMMs = enterprise reliability +- 192GB = ample headroom for VM overhead + game servers + +### **Storage** +**Configuration**: 2 x 1.92TB SSD +**Model**: Samsung PM863 (Enterprise SSD) +**Total Capacity**: 3.84 TB raw +**Available**: ~1.8 TB (after Proxmox + VMs + overhead) +**Interface**: SATA (likely in RAID configuration) + +**Samsung PM863 Specs**: +- Enterprise-grade datacenter SSD +- Optimized for mixed workload +- Power Loss Protection (PLP) +- High endurance rating + +**RAID Configuration** (Likely): +- RAID 1 (mirrored) for redundancy +- Or ZFS RAID-Z for Proxmox storage + +### **Network** +**Bandwidth**: 300 Mbit/s (37.5 MB/s) +**Metering**: Unmetered (no bandwidth caps) +**Uplink**: Enterprise datacenter connection + +**Performance Assessment**: +- 300 Mbit/s = 30-40 concurrent Minecraft players comfortably +- Unmetered = no surprise overage charges +- Sufficient for soft launch, may need upgrade at scale + +### **Operating System** +**OS**: Proxmox VE 8 (64-bit) +**Base**: Debian 12 "Bookworm" +**Kernel**: Linux 6.2+ + +--- + +## ๐Ÿ“Š Partition Layout + +### **Disk Partitions** +``` +/boot: 1024 MB (1 GB) - Boot partition +/swap: 2048 MB (2 GB) - Swap space +/root: Auto size - Main Proxmox system + VM storage + (~3.8 TB usable after RAID) +``` + +**Storage Allocation** (Estimated): +``` +Total: 3.84 TB raw (2 x 1.92TB) +RAID overhead: -0.04 TB (metadata, alignment) + -------- +Available: 3.80 TB + +Proxmox OS: -0.10 TB (Proxmox + system) +VM Disks: -1.80 TB (11 VMs, templates) +LXC Containers: -0.10 TB (current containers) + -------- +Free Space: 1.80 TB (for game servers + growth) +``` + +--- + +## ๐ŸŽฏ Performance Characteristics + +### **CPU Capabilities** + +**Per-Core Performance**: โญโญโญโญ (Excellent for game servers) +- Minecraft servers are single-threaded +- 3.0 GHz turbo provides snappy performance +- 12 cores = 12 simultaneous high-performance game servers + +**Multi-Core Throughput**: โญโญโญโญโญ (Excellent for hosting) +- 24 threads handle VM overhead efficiently +- Can run 11 Proxmox VMs + 75-100 LXC containers +- Provisioning operations don't impact running servers + +**Virtualization**: โญโญโญโญโญ (Native support) +- Intel VT-x + VT-d enabled +- Hardware-accelerated virtualization +- LXC containers = near-native performance + +### **Memory Capabilities** + +**Capacity**: 192 GB = **Excellent** for scale +- Average Minecraft server: 2-4 GB +- 192 GB / 4 GB = **48 simultaneous 4GB servers** +- Or 96 lightweight 2GB servers + +**Speed**: 2400 MHz DDR4 = **Good** (not bleeding edge, but sufficient) +- DDR4-2400 provides adequate bandwidth for game hosting +- ECC ensures data integrity under load + +**Reliability**: ECC RDIMM = **Enterprise-grade** +- Detects and corrects memory errors +- Critical for 24/7 uptime + +### **Storage Capabilities** + +**Capacity**: 3.84 TB = **Very Good** for initial scale +- Each Minecraft server: 1-5 GB (world size varies) +- Can host 300-500 Minecraft servers comfortably +- 1.8 TB free = room for significant growth + +**Performance**: Samsung PM863 = **Excellent** for workload +- Random IOPS: ~10,000 read, ~2,000 write +- Sequential: 520 MB/s read, 485 MB/s write +- Perfect for database + game world I/O + +**Reliability**: Enterprise SSD = **Excellent** +- Power Loss Protection prevents corruption +- Rated for 1.3 PB writes (years of 24/7 operation) +- RAID 1 (likely) provides redundancy + +### **Network Capabilities** + +**Bandwidth**: 300 Mbit/s = **Adequate for soft launch** +- Minecraft player: ~0.5-1 Mbit/s +- 300 Mbit/s = 30-60 players (conservative estimate) +- Unmetered = no bandwidth overage charges + +**Upgrade Path**: GTHost likely offers 1 Gbps upgrades +- 1 Gbps would support 100-200 players +- Consider upgrade when approaching 40+ concurrent players + +--- + +## ๐Ÿ—๏ธ Current Resource Allocation + +### **VM Resource Breakdown** (11 VMs) + +``` +Hypervisor Overhead: ~8 GB RAM, 2 CPU cores + +Critical Production: +โ”œโ”€ VM 100 (zlh-panel): 4 GB RAM, 2 cores, 32 GB disk +โ”œโ”€ VM 103 (zlh-api): 4 GB RAM, 2 cores, 32 GB disk +โ””โ”€ VM 101 (zlh-wings): 8 GB RAM, 4 cores, 64 GB disk + +Platform Services: +โ”œโ”€ VM 102 (zlh-portal): 4 GB RAM, 2 cores, 32 GB disk +โ”œโ”€ VM 104 (zlh-monitor): 8 GB RAM, 2 cores, 64 GB disk +โ””โ”€ VM 1002 (zlh-proxy): 2 GB RAM, 1 core, 16 GB disk + +Network Layer: +โ”œโ”€ VM 1000 (zlh-router): 4 GB RAM, 2 cores, 32 GB disk +โ”œโ”€ VM 1006 (zpack-router): 4 GB RAM, 2 cores, 32 GB disk +โ””โ”€ VM 1001 (zlh-dns): 2 GB RAM, 1 core, 16 GB disk + +Development/Support: +โ”œโ”€ VM 300 (zlh-panel-dev): 4 GB RAM, 2 cores, 32 GB disk +โ””โ”€ VM 2000 (zlh-ci): 4 GB RAM, 2 cores, 32 GB disk + +Backup: +โ””โ”€ VM [zlh-back]: 8 GB RAM, 2 cores, 128 GB disk + +TOTAL VM ALLOCATION: 56 GB RAM, 24 cores, ~512 GB disk +``` + +### **Available for Game Servers** + +``` +RAM Available: 192 GB - 56 GB (VMs) - 8 GB (overhead) = 128 GB +CPU Available: 24 threads - 24 (VM allocation) = 0 (shared) +Disk Available: 1.8 TB free + +Game Server Capacity (Conservative): +โ”œโ”€ 2GB servers: 64 simultaneous servers +โ”œโ”€ 4GB servers: 32 simultaneous servers +โ””โ”€ 8GB servers: 16 simultaneous servers + +Developer Environment Capacity: +โ”œโ”€ 2GB dev envs: 64 simultaneous environments +โ””โ”€ 4GB dev envs: 32 simultaneous environments +``` + +**Note**: CPU is oversubscribed (common in hosting) since most game servers idle at <20% CPU usage. Turbo boost ensures good single-thread performance when needed. + +--- + +## ๐Ÿ“ˆ Capacity & Scaling Projections + +### **Current Capacity** (As Deployed) + +**Game Servers**: 30-50 active servers with current VM allocation +**Developer Environments**: 75-100 environments (documented capacity) +**Concurrent Players**: 30-60 players (network limited) + +### **Optimized Capacity** (With Tuning) + +**Game Servers**: 60-80 active servers (after VM consolidation) +**Developer Environments**: 100-150 environments +**Concurrent Players**: Still 30-60 (network bottleneck) + +### **Maximum Theoretical Capacity** + +**Game Servers**: 128 lightweight servers (if only game hosting, no dev) +**Developer Environments**: 192 environments (if only dev, no games) +**Storage**: 300-500 servers before storage exhaustion + +**Limiting Factors**: +1. **Network** (300 Mbit/s) - limits concurrent players +2. **RAM** (192 GB) - limits concurrent heavy servers +3. **Storage** (1.8 TB free) - limits total servers + +--- + +## ๐Ÿ’ฐ Cost Analysis + +### **Current Infrastructure Cost** + +**Monthly**: $109 GTHost dedicated server +**Annually**: $1,308 + +**Cost per Resource**: +- Per GB RAM: $0.57/month ($109 รท 192 GB) +- Per CPU core: $9.08/month ($109 รท 12 cores) +- Per TB storage: $28.39/month ($109 รท 3.84 TB) + +### **Competitive Analysis** + +**AWS Equivalent** (m5.2xlarge + storage + bandwidth): +- 8 vCPU, 32 GB RAM, 1 TB storage, 1 Gbps +- Cost: ~$300-400/month + +**Hetzner Dedicated** (Similar specs): +- 12 core Xeon, 128 GB RAM, 2x2TB SSD +- Cost: ~$100/month (but higher network costs) + +**GTHost Value**: โญโญโญโญโญ Excellent +- 40-60% cheaper than AWS +- Competitive with Hetzner +- Unmetered bandwidth (key advantage) + +--- + +## ๐ŸŽฏ Competitive Advantages + +### **1. LXC Performance** +- Host hardware enables 20-30% better performance vs Docker +- Intel Xeon Silver 4116 single-thread performance excellent for games + +### **2. Resource Density** +- 192 GB RAM supports 30-50 simultaneous 4GB servers +- Competitors typically offer 64-128 GB at this price point + +### **3. Storage Performance** +- Samsung PM863 enterprise SSDs outperform consumer SSDs +- Power Loss Protection prevents world corruption +- Hot-swap chassis enables maintenance without downtime + +### **4. Network** +- Unmetered = no bandwidth surprises +- 300 Mbit/s adequate for soft launch +- Upgrade path available when needed + +--- + +## โš ๏ธ Identified Constraints + +### **1. Network Bandwidth** (Current Bottleneck) +- **300 Mbit/s limits to 30-60 concurrent players** +- **Recommendation**: Monitor bandwidth usage, upgrade to 1 Gbps when approaching 40 players +- **Upgrade Cost**: Likely +$20-50/month for 1 Gbps + +### **2. CPU Oversubscription** +- 24 threads allocated to VMs, but most VMs idle +- Game servers share CPU via time-slicing +- **Risk**: If all servers spike simultaneously, performance degrades +- **Mitigation**: Limit concurrent servers to 40-50 until load testing proves higher safe + +### **3. Storage Growth** +- 1.8 TB free supports 300-500 servers +- Each server grows over time (world expansion) +- **Recommendation**: Monitor disk usage, plan expansion at 70% utilization +- **Expansion Options**: Add external storage or upgrade to larger SSDs + +--- + +## ๐Ÿ”ง Optimization Opportunities + +### **Immediate Optimizations** (No Cost) + +1. **VM Consolidation** + - Merge zlh-panel-dev into zlh-panel (save 4 GB RAM, 2 cores) + - Merge zlh-proxy into zlh-router (save 2 GB RAM, 1 core) + - **Gain**: 6 GB RAM, 3 cores for game servers + +2. **LXC Over VMs** + - Convert lightweight VMs to LXC containers + - Example: zlh-dns, zlh-proxy candidates + - **Gain**: Lower overhead, faster provisioning + +3. **Memory Ballooning** + - Enable KSM (Kernel Same-page Merging) on Proxmox + - Deduplicate identical memory pages + - **Gain**: 5-10% more available RAM + +### **Paid Optimizations** (Consider at Scale) + +1. **Network Upgrade**: 1 Gbps uplink (+$20-50/month) + - Removes player concurrency bottleneck + - Enables 100-200 player capacity + +2. **Storage Expansion**: Add 4TB NVMe (+$50/month) + - Doubles storage to ~6 TB total + - Supports 600-1000 servers + +3. **Cloudflare Enterprise** (+$200/month) + - DDoS protection for game traffic + - CDN for static assets + - Worth it at 100+ servers + +--- + +## ๐Ÿ“Š Hardware Lifecycle + +### **Current Status** (December 2025) + +**Server Age**: Unknown (likely 1-3 years based on Xeon Silver 4116 era) +**Expected Lifespan**: 5-7 years for enterprise server +**Remaining Life**: Likely 3-5 years + +**Components**: +- CPU: Xeon Silver 4116 (2017 release) - still very capable +- RAM: DDR4-2400 (current gen, plenty of life) +- SSD: Samsung PM863 (enterprise grade, high endurance) + +### **Upgrade Path** (Future) + +**Year 1-2** (Current plan): +- Optimize existing hardware +- Minor network upgrades if needed + +**Year 3-4** (Growth phase): +- Consider second dedicated server +- Load balance across servers +- Geographic distribution + +**Year 5+** (Scale phase): +- Migrate to colocation or cloud +- Multi-datacenter deployment + +--- + +## ๐Ÿ›ก๏ธ Reliability Features + +### **Hardware Reliability** + +โœ… **ECC Memory** - Corrects single-bit errors automatically +โœ… **Enterprise SSDs** - Power Loss Protection, high endurance +โœ… **Hot-Swap Chassis** - Replace drives without shutdown +โœ… **Redundant Power** (likely) - Supermicro chassis typically dual PSU + +### **Software Reliability** + +โœ… **Proxmox High Availability** - VM failover (if configured) +โœ… **PBS Backup** - Incremental backups to Backblaze B2 +โœ… **LXC Snapshots** - Fast rollback capability +โœ… **RAID Mirroring** (likely) - Disk failure protection + +### **Network Reliability** + +โœ… **Datacenter Uptime** - GTHost likely 99.9%+ SLA +โœ… **Unmetered Bandwidth** - No throttling during spikes +โš ๏ธ **Single Uplink** - No network redundancy (acceptable for price point) + +--- + +## ๐ŸŽฏ Summary & Recommendations + +### **Hardware Assessment**: โญโญโญโญ Very Good for Use Case + +**Strengths**: +- Excellent CPU for game server hosting (Xeon Silver 4116) +- Abundant RAM (192 GB = 30-50 servers) +- Enterprise storage (Samsung PM863 + hot-swap) +- Unmetered bandwidth (no surprise charges) +- Great value ($109/month for these specs) + +**Limitations**: +- Network bandwidth (300 Mbit/s = 30-60 players) +- Storage growth constraint (monitor usage) +- CPU oversubscription (limit concurrent servers initially) + +### **Recommendations** + +**Now** (Launch Phase): +1. โœ… Deploy on current hardware - adequate for soft launch +2. โœ… Limit to 40-50 concurrent servers initially +3. โœ… Monitor bandwidth, RAM, and disk usage + +**Month 1-3** (Early Growth): +1. ๐Ÿ”ง Optimize VM allocation (consolidate where possible) +2. ๐Ÿ”ง Implement aggressive monitoring +3. ๐Ÿ”ง Consider 1 Gbps network upgrade if approaching 40 players + +**Month 6-12** (Scale Phase): +1. ๐Ÿ“ˆ Evaluate storage expansion based on usage +2. ๐Ÿ“ˆ Consider second server for geographic distribution +3. ๐Ÿ“ˆ Implement Cloudflare Enterprise for DDoS protection + +### **Capacity Targets by Phase** + +**Soft Launch** (Month 1-3): 20-30 servers, 10-20 players +**Public Launch** (Month 3-6): 40-50 servers, 30-40 players +**Growth Phase** (Month 6-12): 60-80 servers, 60-100 players (with 1 Gbps upgrade) +**Scale Phase** (Month 12+): 100+ servers, multi-server deployment + +--- + +## โœ… Conclusion + +**Status**: Infrastructure is **production-ready** for ZeroLagHub launch. + +**Key Points**: +- Hardware specifications are excellent for initial scale +- 192 GB RAM supports 30-50 game servers +- Storage capacity adequate for 300-500 servers +- Network bandwidth is current bottleneck (acceptable for soft launch) +- Cost-effective ($109/month for enterprise-grade hardware) + +**Green Light**: โœ… Launch when platform development complete (currently 85% ready). + +--- + +**Last Updated**: December 7, 2025 +**Source**: GTHost server specifications + ZeroLagHub infrastructure analysis diff --git a/ZeroLagHub_Master_Bootstrap_Dec2025.md b/ZeroLagHub_Master_Bootstrap_Dec2025.md new file mode 100644 index 0000000..e3f74a7 --- /dev/null +++ b/ZeroLagHub_Master_Bootstrap_Dec2025.md @@ -0,0 +1,606 @@ +# ๐Ÿš€ ZeroLagHub - Master Bootstrap Document (December 2025) + +**Last Updated**: December 7, 2025 +**Version**: 4.0 (Platform Launch Ready) +**Status**: 85% Complete - Launch Decision Point + +--- + +## ๐Ÿ“Œ Quick Start for New AI Sessions + +> **Resume Point**: Platform 85% launch-ready, all core provisioning operational, critical UX features needed. +> +> **Current Phase**: Launch readiness assessment - choose NOW vs +1 week vs +1 month +> +> **Critical Context**: All 6 Minecraft variants provisioning successfully via Go agent. Need WebSocket console, crash protection, and disk monitoring for competitive parity. + +--- + +## ๐ŸŽฏ Project Overview + +**ZeroLagHub** is a developer-focused game server hosting platform built on: +- **Proxmox VE** with LXC containers (20-30% performance advantage over Docker) +- **Hybrid Architecture**: Pterodactyl panel + Custom Node.js API + Go provisioning agent +- **Velocity proxy** for seamless Minecraft routing +- **Dual-router architecture** for traffic separation +- **Developer-to-player revenue pipeline** with 9.75x revenue multiplier + +### Core Value Proposition +Complete dev-to-production pipeline: Development environments ($20/mo) โ†’ Testing servers (50% discount) โ†’ Player hosting (25% discount) โ†’ Revenue sharing (7.5% commission) = viral growth through developer ecosystem. + +--- + +## ๐Ÿ—๏ธ Current Architecture (December 2025) + +### Infrastructure Overview (11 VMs) + +``` +Critical Production: +โ”œโ”€โ”€ VM 100 (zlh-panel) - Pterodactyl panel + OAuth customization +โ”œโ”€โ”€ VM 103 (zlh-api) - Node.js backend + developer platform APIs +โ”œโ”€โ”€ VM 101 (zlh-wings) - Game servers + LXC integration target + +Platform Services: +โ”œโ”€โ”€ VM 102 (zlh-portal) - Next.js frontend + developer dashboard +โ”œโ”€โ”€ VM 104 (zlh-monitor) - Prometheus/Grafana monitoring + +Network & Infrastructure: +โ”œโ”€โ”€ VM 1000 (zlh-router) - Platform services routing + VLANs +โ”œโ”€โ”€ VM 1006 (zpack-router) - Game traffic routing + Velocity +โ”œโ”€โ”€ VM 1001 (zlh-dns) - Technitium DNS + development domains +โ”œโ”€โ”€ VM 1002 (zlh-proxy) - Caddy reverse proxy + SSL automation +โ”œโ”€โ”€ VM 300 (zlh-panel-dev) - Development environment + testing +โ”œโ”€โ”€ VM 2000 (zlh-ci) - CI/CD pipeline + automation +โ””โ”€โ”€ VM [zlh-back] - PBS backup + Backblaze B2 replication +``` + +### Network Topology + +``` +zlh-router (VM 1000): +โ”œโ”€ WAN1: Platform services (API, portal, monitoring) +โ”œโ”€ CORE_LAN: 10.60.0.0/24 (internal services) +โ”œโ”€ MGMT_LAN: 172.60.0.10/24 (inter-router communication) +โ””โ”€ WireGuard: Admin access + +zpack-router (VM 1006): +โ”œโ”€ WAN2: 139.64.165.248 (game services) +โ”œโ”€ ZPACK_LAN: 10.70.0.0/24 (Velocity @ 10.70.0.241) +โ”œโ”€ DEV_LAN: 10.100.0.0/24 (developer environments - future) +โ”œโ”€ GAME_LAN: 10.200.0.0/24 (game server LXCs) +โ””โ”€ MGMT_LAN: 172.60.0.20/24 (control plane communication) +``` + +### Traffic Flows + +- **Platform Access**: Client โ†’ WAN1 โ†’ zlh-router โ†’ Frontend/API +- **Game Play**: Player โ†’ WAN2 (139.64.165.248) โ†’ zpack-router โ†’ Velocity (10.70.0.241) โ†’ Game Server (10.200.0.X) +- **Control Plane**: API โ†’ MGMT_LAN (172.60.0.X) โ†’ Velocity/DNS/Monitoring + +--- + +## โœ… What's Working (December 7, 2025) + +### Provisioning Pipeline (100% Operational) + +| Component | Status | Notes | +|-----------|--------|-------| +| **LXC Container Creation** | โœ… | Template VMID 800, auto-cloning working | +| **VMID Allocation** | โœ… | Sequential assignment from range | +| **IP Detection** | โœ… | Automatic network configuration | +| **Go Agent Deployment** | โœ… | Payload delivery + self-repair system | +| **Java Runtime Selection** | โœ… | Auto-detect MC version โ†’ Java 17/21 | +| **All 6 MC Variants** | โœ… | Vanilla, Paper, Purpur, Fabric, Forge, NeoForge | +| **Server Startup** | โœ… | All variants start successfully | +| **DNS Publishing** | โœ… | Cloudflare + Technitium A + SRV records | +| **Velocity Registration** | โœ… | Dynamic backend server registration | +| **Client Connectivity** | โœ… | Players can connect and play | + +### Control Functions + +| Function | Status | Implementation | +|----------|--------|----------------| +| **Start/Stop/Restart** | โœ… | HTTP API โ†’ Go agent | +| **Console Commands** | โœ… | Command injection working | +| **Log Tailing** | โš ๏ธ | HTTP polling only (need WebSocket) | +| **Status Reporting** | โœ… | Agent emits RUNNING state | +| **Crash Detection** | โœ… | Agent tracks exit codes | + +### Game Support Matrix + +**Launch Ready (Minecraft Only)**: +- โœ… **Vanilla** - Official Mojang server +- โœ… **Paper** - Primary recommendation (vanilla + plugins) +- โœ… **Purpur** - Paper fork with extra features +- โœ… **Fabric** - Lightweight mod support +- โœ… **Forge** - Heavy mod support (tech/magic mods) +- โœ… **NeoForge** - Modern Forge fork (**competitive advantage**) + +**Supported Versions**: 1.12.2, 1.16.5, 1.18.2, 1.19.2, 1.20.1, 1.21.x + +**Deferred to Post-Launch**: +- ๐Ÿ“‹ Terraria +- ๐Ÿ“‹ Project Zomboid +- ๐Ÿ“‹ Valheim +- ๐Ÿ“‹ Rust + +--- + +## ๐Ÿšจ Known Issues & Gaps + +### Critical Bugs (Non-Blocking, System Works) + +1. **Forge server.jar Glob Logic** (`artifacts.go` lines 112-116, 147-151) + - Tries to find `*server.jar` but Forge โ‰ฅ1.17 doesn't create this + - **Fix**: Remove glob/rename logic (Forge uses `run.sh` + `libraries/`) + - **Impact**: System works, but unnecessary code + +2. **ensureProvisioned() Fallthrough** (`agent.go` lines 155-171) + - After Forge check, falls through to check `server.jar` + - **Fix**: Add `else` to prevent fallthrough + - **Impact**: Minor efficiency issue + +3. **Forge Stop Command Exclusion** (`process.go` line 83) + - Excludes Forge from receiving `stop` command + - **Fix**: Remove exclusion (Forge accepts stop commands) + - **Impact**: Manual workaround needed for Forge stops + +### Missing Competitive Features (CRITICAL) + +| Feature | Apex | Shockbyte | ZeroLagHub | Priority | +|---------|------|-----------|------------|----------| +| **All MC Variants** | โœ… | โœ… | โœ… | - | +| **NeoForge** | โŒ | โŒ | โœ… | ADVANTAGE | +| **Performance** | ๐ŸŸก | ๐ŸŸก | โœ… | ADVANTAGE | +| **Console Streaming** | โœ… | โœ… | โŒ | ๐Ÿ”ด HIGH | +| **File Management** | โœ… | โœ… | โŒ | ๐ŸŸก MEDIUM | +| **Backups** | โœ… | โœ… | โŒ | ๐ŸŸก MEDIUM | +| **Crash Protection** | โœ… | โœ… | โŒ | ๐Ÿ”ด HIGH | +| **Disk Monitoring** | โœ… | โœ… | โŒ | ๐Ÿ”ด HIGH | + +--- + +## ๐ŸŽฏ Platform Readiness Assessment (85%) + +### Core Platform (100%) +- โœ… Container orchestration +- โœ… Multi-variant provisioning +- โœ… Network routing (dual-router) +- โœ… DNS automation (Cloudflare + Technitium) +- โœ… Velocity proxy integration +- โœ… Start/stop/restart control +- โœ… Console command injection +- โœ… Status monitoring + +### Operational Features (70%) +- โœ… Log tailing (HTTP polling) +- โœ… Crash detection +- โŒ **WebSocket console** (need real-time streaming) +- โŒ **Crash loop protection** (need exponential backoff) +- โŒ **Disk space monitoring** (prevent corruption) + +### File Management (0%) +- โŒ File upload/download +- โŒ Backup/restore system +- โŒ World file management + +### Advanced Features (Planned) +- ๐Ÿ“‹ Resource monitoring dashboard +- ๐Ÿ“‹ Plugin marketplace +- ๐Ÿ“‹ Developer platform APIs +- ๐Ÿ“‹ Performance optimization tools + +--- + +## ๐Ÿš€ Launch Decision Point + +### Option A: Launch NOW (Soft Beta) +**Status**: 85% ready +**Timeline**: Immediate +**Pros**: Fast to market, gather user feedback +**Cons**: Missing competitive UX features, higher support burden +**Recommendation**: โš ๏ธ Acceptable for 10-20 beta users only + +### Option B: +1 Week (Critical Features) โญ RECOMMENDED +**Status**: 95% ready after additions +**Timeline**: December 14, 2025 +**Add**: WebSocket console + Crash protection + Disk monitoring +**Effort**: 7-9 hours total +**Pros**: Competitive feature parity, professional launch +**Cons**: Minimal delay +**Recommendation**: โœ… Best balance of quality and speed + +### Option C: +1 Month (Full Feature Parity) +**Status**: 100% ready +**Timeline**: January 7, 2026 +**Add**: All UX features + file management + backups +**Effort**: ~30 hours +**Pros**: Complete competitive offering +**Cons**: Slower to market, feature creep risk +**Recommendation**: โš ๏ธ Over-engineering for launch + +--- + +## ๐Ÿ“‹ Critical Outstanding Items + +### ๐Ÿ”ด High Priority (Before Launch) + +**1. WebSocket Console Streaming** [4-6 hours] +- **Current**: HTTP polling via `/logs/tail` +- **Needed**: Real-time WebSocket streaming +- **Why**: Industry standard, users expect it +- **Technical**: Socket.io integration to Go agent + +**2. Crash Loop Protection** [2 hours] +- **Current**: Immediate restart on crash +- **Needed**: Exponential backoff (5s, 10s, 15s), stop after 3 crashes +- **Why**: Prevents resource thrashing +- **Technical**: Agent retry logic with backoff timer + +**3. Disk Space Monitoring** [1 hour] +- **Current**: No checks +- **Needed**: Alert when <1GB free, prevent start if insufficient +- **Why**: Prevents world corruption +- **Technical**: Agent disk space check before start + +### ๐ŸŸก Medium Priority (Week 1) + +**4. File Upload/Download** [6-8 hours] +- Plugin management, world uploads +- HTTP multipart + streaming + +**5. Backup System** [8-10 hours] +- World backup/restore +- Integration with PBS backup infrastructure + +**6. Enhanced Health Checks** [3-4 hours] +- Query server status +- Resource monitoring (CPU/RAM) + +### ๐ŸŸข Low Priority (Month 1) + +7. Resource monitoring dashboard +8. Plugin marketplace integration +9. Developer platform APIs +10. Performance optimization + +--- + +## ๐Ÿ—„๏ธ Technical Architecture Details + +### Directory Structure (Finalized) + +``` +/opt/zlh///world/ + +Examples: +/opt/zlh/minecraft/vanilla/world/ +/opt/zlh/minecraft/forge/world/ +/opt/zlh/minecraft/fabric/world/ +``` + +**Benefits**: +- Clear game/variant separation +- Scalable to all future games +- Self-documenting paths +- Easy backup automation + +### Container Model + +**Architecture**: One game per LXC container +**Rationale**: Industry standard, 3-5x simpler than multi-game +**Benefits**: +- Better resource isolation +- Simpler billing +- Clearer security boundaries +- Easier debugging + +### Java Runtime Selection + +``` +MC 1.21.x โ†’ Java 21 +MC โ‰ฅ1.20.5 โ†’ Java 21 +MC <1.20.5 โ†’ Java 17 +``` + +### Artifact Download Paths + +``` +minecraft/vanilla//server.jar +minecraft/paper//server.jar +minecraft/purpur//server.jar +minecraft/fabric//fabric-server.jar +minecraft/forge//forge-installer.jar +minecraft/neoforge//neoforge-installer.jar +``` + +**Critical Note**: Fabric uses `fabric-server.jar` (pre-built), not installer pattern + +--- + +## ๐Ÿ’ฐ Business Model & Revenue Strategy + +### Developer-to-Player Pipeline + +``` +Step 1: Developer Acquisition +โ”œโ”€ Development Environment: $20/month +โ””โ”€ Testing Server: $25/month (50% discount) + +Step 2: Player Acquisition (via developer) +โ”œโ”€ Player 1-10: $15/month each (25% discount) +โ””โ”€ Total Player Revenue: $150/month + +Step 3: Developer Commission +โ”œโ”€ Revenue Share: 7.5% of player revenue +โ”œโ”€ Developer Earns: $11.25/month +โ””โ”€ Platform Keeps: $138.75/month + +Total Monthly Revenue from One Developer: +$20 (dev env) + $25 (test server) + $150 (players) = $195/month +Revenue Multiplier: 9.75x on developer acquisition cost +``` + +### Financial Projections + +**Month 6**: $8K-30K (LXC advantage + developer pipeline) +**Month 12**: $25K-100K (custom platform competitive advantages) +**Month 24**: $75K-300K (market leadership + technology licensing) + +### Competitive Advantages + +1. **LXC Performance**: 20-30% improvement over Docker competitors +2. **Developer Ecosystem**: Complete dev-to-production pipeline vs pure hosting +3. **Open Source Foundation**: 30-40% cost advantage over corporate providers +4. **Gaming-First Architecture**: Purpose-built vs adapted generic hosting +5. **NeoForge Support**: Ahead of Apex and Shockbyte + +--- + +## ๐Ÿ” Security Vulnerabilities (CRITICAL - Active Fix Required) + +### API Department Issues + +1. **Server Ownership Bypass** + - Any user can control any server via UUID + - No ownership validation in API endpoints + - **Impact**: Critical security flaw + +2. **Admin Privilege Escalation** + - Frontend can claim admin via JWT manipulation + - No server-side role validation + - **Impact**: Complete access control bypass + +3. **Token URL Exposure** + - JWTs visible in browser history/logs + - Tokens passed as URL parameters + - **Impact**: Token theft vulnerability + +4. **API Key Validation Missing** + - Authentication bypass vulnerabilities + - Inconsistent validation patterns + - **Impact**: Unauthorized API access + +### Required Fixes + +- Implement ownership checks on all server operations +- Server-side JWT validation and role enforcement +- Move tokens from URL to headers/cookies +- Comprehensive API key validation + +**Priority**: Must fix before public launch (current soft beta acceptable) + +--- + +## ๐Ÿ› ๏ธ Ford Assembly Line Department Structure + +### Management Department (Coordination Hub) +- **Role**: Strategic oversight, cross-department integration +- **AI Resource**: Claude (architecture) + ChatGPT (implementation) +- **Current Focus**: Launch readiness + critical feature completion + +### 5 Specialized Departments + +**1. API Department** โš ๏ธ CRITICAL SECURITY + DEVELOPER PLATFORM +- Tech: Node.js/Express, MariaDB, JWT auth, Pterodactyl integration +- Priority: Security fixes + developer environment APIs + +**2. Infrastructure Department** โœ… LXC INTEGRATION PRIORITY +- Tech: Proxmox VMs, Ansible automation, PBS backup, Monitoring +- Achievement: Enterprise backup system operational +- Capacity: 1.8TB available, supports 75-100 developers + +**3. Frontend Department** ๐Ÿ”ง TOKEN SECURITY + DEVELOPER UI +- Tech: Next.js 15, TailwindCSS, sci-fi HUD aesthetic, TypeScript +- Priority: Token security + developer dashboard + +**4. Pterodactyl Department** โš ๏ธ OAUTH + WINGS LXC +- Role: Panel customization, OAuth integration +- Future: Wings LXC integration for performance advantage + +**5. Planning & Brainstorming Department** ๐Ÿง  STRATEGIC EXECUTION +- Role: Long-term vision, competitive strategy +- Focus: Developer acquisition, viral growth mechanics + +--- + +## ๐Ÿ“‹ Immediate Next Steps (Priority Order) + +### Phase 1: Critical Features (Before Launch) +1. โœ… **Fix Go Agent Bugs** - Remove Forge glob, fix fallthrough, enable stop commands +2. ๐Ÿ”ง **WebSocket Console** - Implement real-time streaming (4-6 hours) +3. ๐Ÿ”ง **Crash Loop Protection** - Add exponential backoff (2 hours) +4. ๐Ÿ”ง **Disk Space Monitoring** - Prevent starts on low disk (1 hour) + +### Phase 2: Launch Readiness +5. ๐Ÿ“‹ **Security Audit** - Review critical vulnerabilities +6. ๐Ÿ“‹ **Documentation** - User guides, API docs +7. ๐Ÿ“‹ **Monitoring** - Alert thresholds, dashboards +8. ๐Ÿ“‹ **Soft Beta** - 10-20 users, gather feedback + +### Phase 3: Week 1 Post-Launch +9. ๐Ÿ“‹ **File Management** - Upload/download interface +10. ๐Ÿ“‹ **Backup System** - World backup/restore +11. ๐Ÿ“‹ **Enhanced Health Checks** - Resource monitoring + +--- + +## ๐ŸŽฏ Success Metrics + +### Technical Metrics +- โœ… 100% provisioning success rate (all 6 variants) +- โš ๏ธ Zero DNS orphan records (needs EdgeState migration) +- โš ๏ธ Sub-second WebSocket latency (needs implementation) +- โœ… LXC 20-30% performance advantage (validated) + +### Business Metrics (Future) +- Developer referral system operational +- Revenue sharing calculations accurate +- Customer quota enforcement working +- Usage metering for billing + +### User Experience Metrics +- Professional HUD aesthetic maintained +- Zero breaking changes during updates +- Seamless dev-to-production pipeline +- <3s average provisioning time + +--- + +## ๐Ÿ“ Key Files & Locations + +### API Service (`/home/zlh/zlh-api-v2/`) +- `prisma/schema.prisma` - Database schema +- `src/services/edgePublisher.js` - DNS + Velocity publishing +- `src/services/dePublisher.js` - Edge cleanup +- `src/services/portAllocator.js` - Port management +- `src/clients/cloudflareClient.js` - Cloudflare API wrapper +- `src/clients/technitiumClient.js` - Technitium DNS API wrapper + +### Go Agent (`/opt/zlh-agent/`) +- `agent.go` - Main provisioning logic +- `artifacts.go` - Download + verification (has bugs) +- `process.go` - Server lifecycle management (has bug) +- `api.go` - HTTP server for control commands +- `payload.json` - Configuration from API + +### Frontend (`/home/zlh/zlh-portal/`) +- Next.js 15 application +- Steel-texture HUD aesthetic +- Developer dashboard (in progress) + +--- + +## โš ๏ธ Critical Rules & Constraints + +### DO NOT +- โŒ Infer hostnames from DNS records +- โŒ Use DNS as source of truth +- โŒ Delete Cloudflare records without record IDs +- โŒ Launch without WebSocket console (competitive requirement) +- โŒ Skip crash protection (operational stability) +- โŒ Ignore disk space monitoring (data safety) + +### ALWAYS +- โœ… Treat DB as authoritative source of truth +- โœ… Store Cloudflare record IDs in EdgeState +- โœ… Use exact hostname matching +- โœ… Track all async operations in JobLog +- โœ… Audit significant actions +- โœ… Test all 6 MC variants before deploy + +--- + +## ๐Ÿ’ก Key Architectural Decisions (ADRs) + +**ADR-001: Minecraft-Only Launch** +**Decision**: Launch with Minecraft only, defer other games +**Rationale**: Market validation, focused quality, faster to market +**Consequence**: 6 variants + 6 versions = comprehensive MC offering + +**ADR-002: One Game Per Container** +**Decision**: Single game per LXC container +**Rationale**: Industry standard, 3-5x simpler than multi-game +**Consequence**: Better isolation, clearer billing, easier debugging + +**ADR-003: Velocity Over Direct Port Forwarding** +**Decision**: Use Velocity proxy for Minecraft routing +**Rationale**: Single entry point, dynamic registration, no NAT complexity +**Consequence**: No external port allocation needed for MC + +**ADR-004: Hybrid Pterodactyl + Custom API** +**Decision**: Keep Pterodactyl panel, build custom API alongside +**Rationale**: Preserve working OAuth, gradual migration path +**Consequence**: Dual system complexity, eventual migration needed + +**ADR-005: Go Agent Architecture** +**Decision**: Containerized Go agent handles provisioning +**Rationale**: Language-agnostic, self-healing, version-aware +**Consequence**: Robust provisioning, automatic repair, clean separation + +--- + +## ๐Ÿง  Session Continuity Prompt + +For AI assistants resuming work on this project: + +> Resume from ZeroLagHub Master Bootstrap (December 7, 2025). +> +> **Current State**: Platform 85% launch-ready. All 6 Minecraft variants provisioning successfully via Go agent. Core functionality operational, need critical UX features for competitive parity. +> +> **Launch Decision**: Recommend +1 week for WebSocket console, crash protection, and disk monitoring. +> +> **Known Bugs**: 3 non-blocking Go agent issues (Forge glob, fallthrough, stop exclusion). +> +> **Critical Context**: +> - Security vulnerabilities exist but acceptable for soft beta +> - Business model validated with 9.75x revenue multiplier +> - Developer-to-player pipeline is core differentiator +> - LXC performance advantage is primary competitive edge +> +> **Next Actions**: Fix Go agent bugs, implement critical features, launch beta. + +--- + +## ๐Ÿ“ž Support & Escalation + +- **Platform Owner**: 44 years old, full-stack developer +- **AI Coordination**: Claude (architecture) + ChatGPT (implementation) +- **Infrastructure**: GTHost dedicated server ($109/month) +- **Domain**: zerolaghub.com, zpack.zerolaghub.com +- **Public Game IP**: 139.64.165.248 + +--- + +## ๐Ÿ“Š Platform Status Summary + +**Technical Readiness**: 85% complete +**Competitive Position**: Ready to compete on core provisioning, need UX polish +**Strategic Clarity**: Clear path to launch with validated business model +**Infrastructure**: Production-grade with enterprise backup system +**Security**: Known vulnerabilities, acceptable for soft beta, must fix before public launch + +--- + +## ๐ŸŽฏ Strategic Recommendation + +**Recommended Path**: Option B (+1 Week) + +**Rationale**: +1. WebSocket console is table stakes (competitors have it) +2. Crash protection prevents operational nightmares +3. Disk monitoring prevents data loss +4. 1 week is negligible for long-term platform success +5. Professional launch > rushed launch + +**Timeline**: +- **Dec 7-10**: Implement critical features (WebSocket, crash, disk) +- **Dec 11-13**: Testing + bug fixes +- **Dec 14**: Soft beta launch (10-20 users) +- **Dec 21**: Public launch after beta feedback + +--- + +**This document serves as the single source of truth for project continuity. Update after each major milestone or architectural change.** + +๐Ÿš€ **Next action: Decide launch timeline, then implement critical features or launch beta.**