knowledge-base/ZeroLagHub_Master_Bootstrap_Dec2025.md
2025-12-13 16:40:48 +00:00

607 lines
20 KiB
Markdown
Raw Permalink Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# 🚀 ZeroLagHub - Master Bootstrap Document (December 2025)
**Last Updated**: December 7, 2025
**Version**: 4.0 (Platform Launch Ready)
**Status**: 85% Complete - Launch Decision Point
---
## 📌 Quick Start for New AI Sessions
> **Resume Point**: Platform 85% launch-ready, all core provisioning operational, critical UX features needed.
>
> **Current Phase**: Launch readiness assessment - choose NOW vs +1 week vs +1 month
>
> **Critical Context**: All 6 Minecraft variants provisioning successfully via Go agent. Need WebSocket console, crash protection, and disk monitoring for competitive parity.
---
## 🎯 Project Overview
**ZeroLagHub** is a developer-focused game server hosting platform built on:
- **Proxmox VE** with LXC containers (20-30% performance advantage over Docker)
- **Hybrid Architecture**: Pterodactyl panel + Custom Node.js API + Go provisioning agent
- **Velocity proxy** for seamless Minecraft routing
- **Dual-router architecture** for traffic separation
- **Developer-to-player revenue pipeline** with 9.75x revenue multiplier
### Core Value Proposition
Complete dev-to-production pipeline: Development environments ($20/mo) → Testing servers (50% discount) → Player hosting (25% discount) → Revenue sharing (7.5% commission) = viral growth through developer ecosystem.
---
## 🏗️ Current Architecture (December 2025)
### Infrastructure Overview (11 VMs)
```
Critical Production:
├── VM 100 (zlh-panel) - Pterodactyl panel + OAuth customization
├── VM 103 (zlh-api) - Node.js backend + developer platform APIs
├── VM 101 (zlh-wings) - Game servers + LXC integration target
Platform Services:
├── VM 102 (zlh-portal) - Next.js frontend + developer dashboard
├── VM 104 (zlh-monitor) - Prometheus/Grafana monitoring
Network & Infrastructure:
├── VM 1000 (zlh-router) - Platform services routing + VLANs
├── VM 1006 (zpack-router) - Game traffic routing + Velocity
├── VM 1001 (zlh-dns) - Technitium DNS + development domains
├── VM 1002 (zlh-proxy) - Caddy reverse proxy + SSL automation
├── VM 300 (zlh-panel-dev) - Development environment + testing
├── VM 2000 (zlh-ci) - CI/CD pipeline + automation
└── VM [zlh-back] - PBS backup + Backblaze B2 replication
```
### Network Topology
```
zlh-router (VM 1000):
├─ WAN1: Platform services (API, portal, monitoring)
├─ CORE_LAN: 10.60.0.0/24 (internal services)
├─ MGMT_LAN: 172.60.0.10/24 (inter-router communication)
└─ WireGuard: Admin access
zpack-router (VM 1006):
├─ WAN2: 139.64.165.248 (game services)
├─ ZPACK_LAN: 10.70.0.0/24 (Velocity @ 10.70.0.241)
├─ DEV_LAN: 10.100.0.0/24 (developer environments - future)
├─ GAME_LAN: 10.200.0.0/24 (game server LXCs)
└─ MGMT_LAN: 172.60.0.20/24 (control plane communication)
```
### Traffic Flows
- **Platform Access**: Client → WAN1 → zlh-router → Frontend/API
- **Game Play**: Player → WAN2 (139.64.165.248) → zpack-router → Velocity (10.70.0.241) → Game Server (10.200.0.X)
- **Control Plane**: API → MGMT_LAN (172.60.0.X) → Velocity/DNS/Monitoring
---
## ✅ What's Working (December 7, 2025)
### Provisioning Pipeline (100% Operational)
| Component | Status | Notes |
|-----------|--------|-------|
| **LXC Container Creation** | ✅ | Template VMID 800, auto-cloning working |
| **VMID Allocation** | ✅ | Sequential assignment from range |
| **IP Detection** | ✅ | Automatic network configuration |
| **Go Agent Deployment** | ✅ | Payload delivery + self-repair system |
| **Java Runtime Selection** | ✅ | Auto-detect MC version → Java 17/21 |
| **All 6 MC Variants** | ✅ | Vanilla, Paper, Purpur, Fabric, Forge, NeoForge |
| **Server Startup** | ✅ | All variants start successfully |
| **DNS Publishing** | ✅ | Cloudflare + Technitium A + SRV records |
| **Velocity Registration** | ✅ | Dynamic backend server registration |
| **Client Connectivity** | ✅ | Players can connect and play |
### Control Functions
| Function | Status | Implementation |
|----------|--------|----------------|
| **Start/Stop/Restart** | ✅ | HTTP API → Go agent |
| **Console Commands** | ✅ | Command injection working |
| **Log Tailing** | ⚠️ | HTTP polling only (need WebSocket) |
| **Status Reporting** | ✅ | Agent emits RUNNING state |
| **Crash Detection** | ✅ | Agent tracks exit codes |
### Game Support Matrix
**Launch Ready (Minecraft Only)**:
-**Vanilla** - Official Mojang server
-**Paper** - Primary recommendation (vanilla + plugins)
-**Purpur** - Paper fork with extra features
-**Fabric** - Lightweight mod support
-**Forge** - Heavy mod support (tech/magic mods)
-**NeoForge** - Modern Forge fork (**competitive advantage**)
**Supported Versions**: 1.12.2, 1.16.5, 1.18.2, 1.19.2, 1.20.1, 1.21.x
**Deferred to Post-Launch**:
- 📋 Terraria
- 📋 Project Zomboid
- 📋 Valheim
- 📋 Rust
---
## 🚨 Known Issues & Gaps
### Critical Bugs (Non-Blocking, System Works)
1. **Forge server.jar Glob Logic** (`artifacts.go` lines 112-116, 147-151)
- Tries to find `*server.jar` but Forge ≥1.17 doesn't create this
- **Fix**: Remove glob/rename logic (Forge uses `run.sh` + `libraries/`)
- **Impact**: System works, but unnecessary code
2. **ensureProvisioned() Fallthrough** (`agent.go` lines 155-171)
- After Forge check, falls through to check `server.jar`
- **Fix**: Add `else` to prevent fallthrough
- **Impact**: Minor efficiency issue
3. **Forge Stop Command Exclusion** (`process.go` line 83)
- Excludes Forge from receiving `stop` command
- **Fix**: Remove exclusion (Forge accepts stop commands)
- **Impact**: Manual workaround needed for Forge stops
### Missing Competitive Features (CRITICAL)
| Feature | Apex | Shockbyte | ZeroLagHub | Priority |
|---------|------|-----------|------------|----------|
| **All MC Variants** | ✅ | ✅ | ✅ | - |
| **NeoForge** | ❌ | ❌ | ✅ | ADVANTAGE |
| **Performance** | 🟡 | 🟡 | ✅ | ADVANTAGE |
| **Console Streaming** | ✅ | ✅ | ❌ | 🔴 HIGH |
| **File Management** | ✅ | ✅ | ❌ | 🟡 MEDIUM |
| **Backups** | ✅ | ✅ | ❌ | 🟡 MEDIUM |
| **Crash Protection** | ✅ | ✅ | ❌ | 🔴 HIGH |
| **Disk Monitoring** | ✅ | ✅ | ❌ | 🔴 HIGH |
---
## 🎯 Platform Readiness Assessment (85%)
### Core Platform (100%)
- ✅ Container orchestration
- ✅ Multi-variant provisioning
- ✅ Network routing (dual-router)
- ✅ DNS automation (Cloudflare + Technitium)
- ✅ Velocity proxy integration
- ✅ Start/stop/restart control
- ✅ Console command injection
- ✅ Status monitoring
### Operational Features (70%)
- ✅ Log tailing (HTTP polling)
- ✅ Crash detection
-**WebSocket console** (need real-time streaming)
-**Crash loop protection** (need exponential backoff)
-**Disk space monitoring** (prevent corruption)
### File Management (0%)
- ❌ File upload/download
- ❌ Backup/restore system
- ❌ World file management
### Advanced Features (Planned)
- 📋 Resource monitoring dashboard
- 📋 Plugin marketplace
- 📋 Developer platform APIs
- 📋 Performance optimization tools
---
## 🚀 Launch Decision Point
### Option A: Launch NOW (Soft Beta)
**Status**: 85% ready
**Timeline**: Immediate
**Pros**: Fast to market, gather user feedback
**Cons**: Missing competitive UX features, higher support burden
**Recommendation**: ⚠️ Acceptable for 10-20 beta users only
### Option B: +1 Week (Critical Features) ⭐ RECOMMENDED
**Status**: 95% ready after additions
**Timeline**: December 14, 2025
**Add**: WebSocket console + Crash protection + Disk monitoring
**Effort**: 7-9 hours total
**Pros**: Competitive feature parity, professional launch
**Cons**: Minimal delay
**Recommendation**: ✅ Best balance of quality and speed
### Option C: +1 Month (Full Feature Parity)
**Status**: 100% ready
**Timeline**: January 7, 2026
**Add**: All UX features + file management + backups
**Effort**: ~30 hours
**Pros**: Complete competitive offering
**Cons**: Slower to market, feature creep risk
**Recommendation**: ⚠️ Over-engineering for launch
---
## 📋 Critical Outstanding Items
### 🔴 High Priority (Before Launch)
**1. WebSocket Console Streaming** [4-6 hours]
- **Current**: HTTP polling via `/logs/tail`
- **Needed**: Real-time WebSocket streaming
- **Why**: Industry standard, users expect it
- **Technical**: Socket.io integration to Go agent
**2. Crash Loop Protection** [2 hours]
- **Current**: Immediate restart on crash
- **Needed**: Exponential backoff (5s, 10s, 15s), stop after 3 crashes
- **Why**: Prevents resource thrashing
- **Technical**: Agent retry logic with backoff timer
**3. Disk Space Monitoring** [1 hour]
- **Current**: No checks
- **Needed**: Alert when <1GB free, prevent start if insufficient
- **Why**: Prevents world corruption
- **Technical**: Agent disk space check before start
### 🟡 Medium Priority (Week 1)
**4. File Upload/Download** [6-8 hours]
- Plugin management, world uploads
- HTTP multipart + streaming
**5. Backup System** [8-10 hours]
- World backup/restore
- Integration with PBS backup infrastructure
**6. Enhanced Health Checks** [3-4 hours]
- Query server status
- Resource monitoring (CPU/RAM)
### 🟢 Low Priority (Month 1)
7. Resource monitoring dashboard
8. Plugin marketplace integration
9. Developer platform APIs
10. Performance optimization
---
## 🗄️ Technical Architecture Details
### Directory Structure (Finalized)
```
/opt/zlh/<game>/<variant>/world/
Examples:
/opt/zlh/minecraft/vanilla/world/
/opt/zlh/minecraft/forge/world/
/opt/zlh/minecraft/fabric/world/
```
**Benefits**:
- Clear game/variant separation
- Scalable to all future games
- Self-documenting paths
- Easy backup automation
### Container Model
**Architecture**: One game per LXC container
**Rationale**: Industry standard, 3-5x simpler than multi-game
**Benefits**:
- Better resource isolation
- Simpler billing
- Clearer security boundaries
- Easier debugging
### Java Runtime Selection
```
MC 1.21.x → Java 21
MC ≥1.20.5 → Java 21
MC <1.20.5 → Java 17
```
### Artifact Download Paths
```
minecraft/vanilla/<version>/server.jar
minecraft/paper/<version>/server.jar
minecraft/purpur/<version>/server.jar
minecraft/fabric/<version>/fabric-server.jar
minecraft/forge/<version>/forge-installer.jar
minecraft/neoforge/<version>/neoforge-installer.jar
```
**Critical Note**: Fabric uses `fabric-server.jar` (pre-built), not installer pattern
---
## 💰 Business Model & Revenue Strategy
### Developer-to-Player Pipeline
```
Step 1: Developer Acquisition
├─ Development Environment: $20/month
└─ Testing Server: $25/month (50% discount)
Step 2: Player Acquisition (via developer)
├─ Player 1-10: $15/month each (25% discount)
└─ Total Player Revenue: $150/month
Step 3: Developer Commission
├─ Revenue Share: 7.5% of player revenue
├─ Developer Earns: $11.25/month
└─ Platform Keeps: $138.75/month
Total Monthly Revenue from One Developer:
$20 (dev env) + $25 (test server) + $150 (players) = $195/month
Revenue Multiplier: 9.75x on developer acquisition cost
```
### Financial Projections
**Month 6**: $8K-30K (LXC advantage + developer pipeline)
**Month 12**: $25K-100K (custom platform competitive advantages)
**Month 24**: $75K-300K (market leadership + technology licensing)
### Competitive Advantages
1. **LXC Performance**: 20-30% improvement over Docker competitors
2. **Developer Ecosystem**: Complete dev-to-production pipeline vs pure hosting
3. **Open Source Foundation**: 30-40% cost advantage over corporate providers
4. **Gaming-First Architecture**: Purpose-built vs adapted generic hosting
5. **NeoForge Support**: Ahead of Apex and Shockbyte
---
## 🔐 Security Vulnerabilities (CRITICAL - Active Fix Required)
### API Department Issues
1. **Server Ownership Bypass**
- Any user can control any server via UUID
- No ownership validation in API endpoints
- **Impact**: Critical security flaw
2. **Admin Privilege Escalation**
- Frontend can claim admin via JWT manipulation
- No server-side role validation
- **Impact**: Complete access control bypass
3. **Token URL Exposure**
- JWTs visible in browser history/logs
- Tokens passed as URL parameters
- **Impact**: Token theft vulnerability
4. **API Key Validation Missing**
- Authentication bypass vulnerabilities
- Inconsistent validation patterns
- **Impact**: Unauthorized API access
### Required Fixes
- Implement ownership checks on all server operations
- Server-side JWT validation and role enforcement
- Move tokens from URL to headers/cookies
- Comprehensive API key validation
**Priority**: Must fix before public launch (current soft beta acceptable)
---
## 🛠️ Ford Assembly Line Department Structure
### Management Department (Coordination Hub)
- **Role**: Strategic oversight, cross-department integration
- **AI Resource**: Claude (architecture) + ChatGPT (implementation)
- **Current Focus**: Launch readiness + critical feature completion
### 5 Specialized Departments
**1. API Department** CRITICAL SECURITY + DEVELOPER PLATFORM
- Tech: Node.js/Express, MariaDB, JWT auth, Pterodactyl integration
- Priority: Security fixes + developer environment APIs
**2. Infrastructure Department** LXC INTEGRATION PRIORITY
- Tech: Proxmox VMs, Ansible automation, PBS backup, Monitoring
- Achievement: Enterprise backup system operational
- Capacity: 1.8TB available, supports 75-100 developers
**3. Frontend Department** 🔧 TOKEN SECURITY + DEVELOPER UI
- Tech: Next.js 15, TailwindCSS, sci-fi HUD aesthetic, TypeScript
- Priority: Token security + developer dashboard
**4. Pterodactyl Department** OAUTH + WINGS LXC
- Role: Panel customization, OAuth integration
- Future: Wings LXC integration for performance advantage
**5. Planning & Brainstorming Department** 🧠 STRATEGIC EXECUTION
- Role: Long-term vision, competitive strategy
- Focus: Developer acquisition, viral growth mechanics
---
## 📋 Immediate Next Steps (Priority Order)
### Phase 1: Critical Features (Before Launch)
1. **Fix Go Agent Bugs** - Remove Forge glob, fix fallthrough, enable stop commands
2. 🔧 **WebSocket Console** - Implement real-time streaming (4-6 hours)
3. 🔧 **Crash Loop Protection** - Add exponential backoff (2 hours)
4. 🔧 **Disk Space Monitoring** - Prevent starts on low disk (1 hour)
### Phase 2: Launch Readiness
5. 📋 **Security Audit** - Review critical vulnerabilities
6. 📋 **Documentation** - User guides, API docs
7. 📋 **Monitoring** - Alert thresholds, dashboards
8. 📋 **Soft Beta** - 10-20 users, gather feedback
### Phase 3: Week 1 Post-Launch
9. 📋 **File Management** - Upload/download interface
10. 📋 **Backup System** - World backup/restore
11. 📋 **Enhanced Health Checks** - Resource monitoring
---
## 🎯 Success Metrics
### Technical Metrics
- 100% provisioning success rate (all 6 variants)
- Zero DNS orphan records (needs EdgeState migration)
- Sub-second WebSocket latency (needs implementation)
- LXC 20-30% performance advantage (validated)
### Business Metrics (Future)
- Developer referral system operational
- Revenue sharing calculations accurate
- Customer quota enforcement working
- Usage metering for billing
### User Experience Metrics
- Professional HUD aesthetic maintained
- Zero breaking changes during updates
- Seamless dev-to-production pipeline
- <3s average provisioning time
---
## 📁 Key Files & Locations
### API Service (`/home/zlh/zlh-api-v2/`)
- `prisma/schema.prisma` - Database schema
- `src/services/edgePublisher.js` - DNS + Velocity publishing
- `src/services/dePublisher.js` - Edge cleanup
- `src/services/portAllocator.js` - Port management
- `src/clients/cloudflareClient.js` - Cloudflare API wrapper
- `src/clients/technitiumClient.js` - Technitium DNS API wrapper
### Go Agent (`/opt/zlh-agent/`)
- `agent.go` - Main provisioning logic
- `artifacts.go` - Download + verification (has bugs)
- `process.go` - Server lifecycle management (has bug)
- `api.go` - HTTP server for control commands
- `payload.json` - Configuration from API
### Frontend (`/home/zlh/zlh-portal/`)
- Next.js 15 application
- Steel-texture HUD aesthetic
- Developer dashboard (in progress)
---
## ⚠️ Critical Rules & Constraints
### DO NOT
- Infer hostnames from DNS records
- Use DNS as source of truth
- Delete Cloudflare records without record IDs
- Launch without WebSocket console (competitive requirement)
- Skip crash protection (operational stability)
- Ignore disk space monitoring (data safety)
### ALWAYS
- Treat DB as authoritative source of truth
- Store Cloudflare record IDs in EdgeState
- Use exact hostname matching
- Track all async operations in JobLog
- Audit significant actions
- Test all 6 MC variants before deploy
---
## 💡 Key Architectural Decisions (ADRs)
**ADR-001: Minecraft-Only Launch**
**Decision**: Launch with Minecraft only, defer other games
**Rationale**: Market validation, focused quality, faster to market
**Consequence**: 6 variants + 6 versions = comprehensive MC offering
**ADR-002: One Game Per Container**
**Decision**: Single game per LXC container
**Rationale**: Industry standard, 3-5x simpler than multi-game
**Consequence**: Better isolation, clearer billing, easier debugging
**ADR-003: Velocity Over Direct Port Forwarding**
**Decision**: Use Velocity proxy for Minecraft routing
**Rationale**: Single entry point, dynamic registration, no NAT complexity
**Consequence**: No external port allocation needed for MC
**ADR-004: Hybrid Pterodactyl + Custom API**
**Decision**: Keep Pterodactyl panel, build custom API alongside
**Rationale**: Preserve working OAuth, gradual migration path
**Consequence**: Dual system complexity, eventual migration needed
**ADR-005: Go Agent Architecture**
**Decision**: Containerized Go agent handles provisioning
**Rationale**: Language-agnostic, self-healing, version-aware
**Consequence**: Robust provisioning, automatic repair, clean separation
---
## 🧠 Session Continuity Prompt
For AI assistants resuming work on this project:
> Resume from ZeroLagHub Master Bootstrap (December 7, 2025).
>
> **Current State**: Platform 85% launch-ready. All 6 Minecraft variants provisioning successfully via Go agent. Core functionality operational, need critical UX features for competitive parity.
>
> **Launch Decision**: Recommend +1 week for WebSocket console, crash protection, and disk monitoring.
>
> **Known Bugs**: 3 non-blocking Go agent issues (Forge glob, fallthrough, stop exclusion).
>
> **Critical Context**:
> - Security vulnerabilities exist but acceptable for soft beta
> - Business model validated with 9.75x revenue multiplier
> - Developer-to-player pipeline is core differentiator
> - LXC performance advantage is primary competitive edge
>
> **Next Actions**: Fix Go agent bugs, implement critical features, launch beta.
---
## 📞 Support & Escalation
- **Platform Owner**: 44 years old, full-stack developer
- **AI Coordination**: Claude (architecture) + ChatGPT (implementation)
- **Infrastructure**: GTHost dedicated server ($109/month)
- **Domain**: zerolaghub.com, zpack.zerolaghub.com
- **Public Game IP**: 139.64.165.248
---
## 📊 Platform Status Summary
**Technical Readiness**: 85% complete
**Competitive Position**: Ready to compete on core provisioning, need UX polish
**Strategic Clarity**: Clear path to launch with validated business model
**Infrastructure**: Production-grade with enterprise backup system
**Security**: Known vulnerabilities, acceptable for soft beta, must fix before public launch
---
## 🎯 Strategic Recommendation
**Recommended Path**: Option B (+1 Week)
**Rationale**:
1. WebSocket console is table stakes (competitors have it)
2. Crash protection prevents operational nightmares
3. Disk monitoring prevents data loss
4. 1 week is negligible for long-term platform success
5. Professional launch > rushed launch
**Timeline**:
- **Dec 7-10**: Implement critical features (WebSocket, crash, disk)
- **Dec 11-13**: Testing + bug fixes
- **Dec 14**: Soft beta launch (10-20 users)
- **Dec 21**: Public launch after beta feedback
---
**This document serves as the single source of truth for project continuity. Update after each major milestone or architectural change.**
🚀 **Next action: Decide launch timeline, then implement critical features or launch beta.**