knowledge-base/ZeroLagHub_Master_Bootstrap_Apr2026.md

240 lines
12 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# ZeroLagHub — Master Bootstrap (April 2026)
**For**: Claude (strategic/architecture sessions)
**Last Updated**: April 30, 2026
**Supersedes**: Previous April 7, 2026 version
**Source refs**: zlh-grind/SCRATCH/handover-apr-2026.md, zlh-grind/OPEN_THREADS.md, zlh-grind/SCRATCH/billing-stripe-handover-apr11-2026.md
---
## AI Role Split
| AI | Role | Workspace |
|----|------|-----------|
| **Claude** | Architecture, strategy, design decisions, cross-cutting concerns | `jester/knowledge-base` |
| **GPT (Ceàrd)** | Implementation, code changes, verification, session continuity | `jester/zlh-grind` |
**Hard rule**: GPT must not make architecture decisions. If architectural interpretation is required, stop and defer to Claude or canonical docs.
**Claude's session start checklist**:
1. Read this file
2. Read `ZeroLagHub_Cross_Project_Tracker.md` (locked decisions + boundaries)
3. Reference `INFRASTRUCTURE.md` in `zlh-grind` for current IPs/VMs
4. Read `zlh-grind/OPEN_THREADS.md` for current active work
---
## What ZeroLagHub Is
Build-and-run platform for developers, modders, and game communities. Browser-based dev environments + managed game servers + developer-to-player pipeline in one platform.
**Competitive advantages**:
- LXC containers (20-30% performance over Docker) — system-container infrastructure, not Docker overhead
- Custom Go agent architecture — purpose-built, not adapted generic tooling
- Open-source stack (30-40% cost advantage)
- Developer-to-player pipeline: mod devs become a distribution channel
**System posture as of Apr 30, 2026**: Stable. Pre-launch validation phase. Most core workflows verified end-to-end.
---
## Infrastructure
**Single host**: GTHost Detroit
Supermicro 2029TP-HTR, Xeon Gold 6152 22c/44t, 192GB DDR4, 2×1.92TB SSD, $99/mo, Proxmox VE 9.x
**Denver host**: Fully decommissioned April 2, 2026. OS wiped, disks striped.
### Active VMs (all 9000s range — authoritative in zlh-grind/INFRASTRUCTURE.md)
| VM | Name | Role |
|----|------|------|
| 9001 | zlh-router | OPNsense, WAN 66.163.115.221 |
| 9002 | zpack-router | OPNsense, WAN 66.163.115.115 |
| 9010 | zpack-dns | Technitium DNS |
| 9011 | zlh-proxy | Caddy reverse proxy |
| 9012 | zpack-proxy | Traefik v3 — game/dev edge routing + wildcard TLS |
| 9014 | zlh-artifacts | Caddy file server — runtime binaries + server jars |
| 9015 | zpack-velocity | Velocity 3.5 — Minecraft proxy |
| 9017 | zlh-back | PBS backup + Backblaze B2 |
| 9020 | zpack-api | Node.js API, MariaDB, Redis |
| 9021 | zpack-portal | Next.js frontend |
**LXC containers**:
- Game containers: IDs 5000+
- Dev containers: IDs 6000+
- Base template: ID 820 (zlh-base)
**Legacy VMs (NOT active production)**: 100 (zlh-panel), 101 (zlh-wings), 103 (zlh-api), 1000 (zlh-router)
---
## Naming Conventions
- `zlh-*` = core infrastructure (DNS, monitoring, backup, routing, artifacts)
- `zpack-*` = game and dev server stack (portal, API, containers)
---
## Stack
**API (zpack-api, VM 9020)**: Node.js ESM, Express 5, Prisma 6, MariaDB, Redis, BullMQ, JWT, Stripe, argon2, ssh2, WebSocket, http-proxy-middleware
**Portal (zpack-portal, VM 9021)**: Next.js 15, TypeScript, TailwindCSS, Axios, WebSocket console. Hybrid marketing + SaaS structure with SEO landing pages. Pricing tiers: Starter / Pro / Performance.
**Agent (zlh-agent)**: Go 1.21, stdlib HTTP, creack/pty, gorilla/websocket. Runs inside every game/dev container. Only process with direct filesystem access. Pulls runtimes + server jars from zlh-artifacts (VM 9014).
---
## Architecture Rules (Locked)
These cannot be changed without an explicit "Revisit decision X" conversation with Claude.
| # | Decision |
|---|----------|
| DEC-001 | Templates + Agent hybrid (not templates-only or agent-only) |
| DEC-002 | API orchestrates, Agent executes (never reversed) |
| DEC-003 | API owns DNS (Agent never creates DNS records) |
| DEC-004 | Traefik + Velocity only (no HAProxy) |
| DEC-005 | MariaDB is source of truth (no flat files) |
| DEC-006 | Frontend → API only (no direct agent calls) |
| DEC-007 | Drift prevention mandatory |
| DEC-008 | Control plane uses IP env vars, not internal.zlh DNS in hot paths |
| DEC-009 | Redis never restored from backup — always start clean |
| DEC-010 | Vanilla Minecraft servers use Fabric loader + FabricProxy-Lite (not vanilla JAR) |
| DEC-011 | Velocity forwarding mode = `modern` (not `none` — breaks UUID stability) |
---
## Control Plane Architecture
```
Browser
→ Traefik (zpack-proxy, 10.70.0.242) — TLS termination, domain routing
→ API (10.60.0.245:4000) — orchestration, auth, state
→ Agent (:18888) — container execution (game + dev)
```
- Agent is HTTP server on :18888, internal only. API is the only caller.
- Portal never calls agents directly.
- Service-to-service communication uses IP env vars, not DNS FQDNs.
- Upload transport: raw `http.request` piping (`req.pipe(proxyReq)`), never `fetch()`.
### Dev IDE Access (Browser — Current Working Model)
```
Browser → dev-<vmid>.zerolaghub.dev → Traefik → API → container:6000
```
Flow: frontend calls `POST /api/dev/:id/ide-token` → API returns hosted URL → browser opens → Traefik wildcard routes to API → API validates token, sets HTTP-only cookie, proxies to code-server. Browser-verified end-to-end.
---
## Game Support
**Production**: Minecraft (Vanilla/Fabric/Paper/Forge/Neoforge), Rust, Terraria, Project Zomboid
**Pipeline**: Valheim, Palworld, Vintage Story, Core Keeper
**Target market**: Modded/indie games vs mainstream providers
### Minecraft Runtime Split (Verified)
- `vanilla` = Fabric-based internal profile with proxy/API/config injection (FabricProxy-Lite pre-seeded)
- `fabric` = plain Fabric jar delivery only
- Forge/NeoForge first-start flow avoids premature readiness gating, applies post-start property enforcement, restarts through readiness-aware path
---
## Developer-to-Player Revenue Pipeline
```
LXC Dev Environment ($15-40/mo)
→ Game/mod creation + testing
→ Testing servers (50% dev discount)
→ Player community referrals (25% player discount)
→ Developer revenue share (5-10% commission)
→ Viral growth
```
Revenue multiplier: 1 developer → ~10 players → $147.50/mo total from one developer acquisition.
---
## Current System State (April 30, 2026)
**System is stable and pre-launch validated** across core workflows.
### Resolved since last bootstrap (Apr 7)
- ✅ Password reset — 5-minute tokens, hashed at rest, single-use, old tokens invalidated on deploy
- ✅ Billing / Stripe — checkout flow works, customer creation works, Stripe sandbox confirmed, Portal billing UI working. **Remaining**: webhook delivery (Stripe cannot reach API — not yet publicly exposed). Once webhook is live, billing is functionally complete.
- ✅ Vanilla/Fabric runtime split restored and validated
- ✅ Forge/NeoForge first-start flow correct end-to-end
- ✅ Delete/teardown lifecycle removes Velocity, Cloudflare, and Technitium records
- ✅ Portal consumes API-owned `connectable`/`connection` state — no longer infers Minecraft readiness itself
- ✅ Velocity proxy lifecycle callbacks live with `registered_with_proxy` and `proxy_ping_ok` in API state
- ✅ Portal status labels fixed — non-connectable states no longer all show "Needs attention"
- ✅ Portal server creation redirects to `/servers` and tracks setup progress there
- ✅ Portal public marketing site: hybrid SEO structure, Starter/Pro/Performance pricing tiers, root metadata fixed, hero copy fixed, fake CLI line removed
- ✅ SEO landing pages added: `/minecraft-server-hosting`, `/modded-minecraft-hosting`, `/browser-dev-environment`
- ✅ Local Minecraft backup create/restore verified live end-to-end
- ✅ Restore creates intentional pre-restore checkpoint; API starts restore asynchronously
- ✅ Backup timestamps normalized; pre-restore checkpoints filtered from default backup list
- ✅ Vanilla datapack upload works; direct vanilla `mods/` upload rejected by API
- ✅ NeoForge mod search/install/list works
- ✅ Agent-backed file edits create shadow copies for revert; API route/stream forwarding issues fixed
- ✅ Public exposure model in place: Portal public, control plane private
- ✅ Dev container creation succeeds; hosted IDE access verified post-cleanup passes
- ✅ Minecraft server creation succeeds across supported runtime variants
### Still Active: Fabric Readiness Gating (agent fix needed)
**Root cause**: Velocity registration fires before Fabric server is ready to accept proxy traffic → "proxy starting" errors until Velocity restart.
**Fix**: Gate Velocity registration behind TCP probe success (port 25565). Agent already runs probe — registration must fire after probe returns success, not at process start.
**Scope**: Agent only. No Velocity or FabricProxy-Lite changes needed.
**Ref**: `zlh-grind/SCRATCH/session-stabilization-fabric-findings.md`
---
## Pre-Launch Blockers
In priority order:
1. **Billing webhook delivery** — Stripe cannot POST to `/api/billing/webhook` (API not publicly exposed). Checkout works, customer/subscription created in Stripe, but `subscriptionStatus` and `plan` in DB not updating. Fix: expose webhook endpoint via public domain with HTTPS, or use Stripe CLI forwarding for dev testing. Ref: `zlh-grind/SCRATCH/billing-stripe-handover-apr11-2026.md`
2. **Fabric readiness gating** — agent TCP probe gate (see above)
3. **User onboarding flow** — guided first-server creation after registration
4. **Usage limits / quota enforcement** — prevent unbounded server creation
5. **Email notifications** — crash, billing, provisioning complete
6. **Upload testing** — end-to-end verification in dev containers
7. **Stress testing** — k6 IDE session load + Minecraft bot + code-server memory baseline
8. **OPNsense audit** — both routers need systematic validation
9. **Service discovery migration** — remaining non-hot-path internal.zlh refs
10. **Provisioning validation** — single controlled creation, confirm 1 record / 1 job / 1 execution
11. **Final smoke test** — full lifecycle: create → ready → connectable → backup → restore → stop/start/restart → delete. Confirm Velocity unregister, Cloudflare cleanup, Technitium cleanup.
12. **Portal public-site QA** — desktop + mobile layouts, CTA routing, metadata verification. Mobile not yet optimized.
13. **Monitoring / observability** — normalize game/dev Alloy label contract across API discovery, agent-written labels, Prometheus targets, and Grafana dashboards. Finish template cleanup (remove node-exporter, keep Alloy).
14. **Billing Portal UI polish**`trialing` state handling, upgrade/downgrade flow, plan limit gating in Portal
---
## Repo Registry
| Repo | Purpose |
|------|---------|
| `jester/knowledge-base` | Claude's home — architecture, strategy, canonical decisions |
| `jester/zlh-grind` | GPT's workspace — execution continuity, session handovers, debug notes |
| `jester/zlh-docs` | API/agent/portal operational reference docs |
| `jester/zpac-api` | API source |
| `jester/zpac-portal` | Portal source |
| `jester/zlh-agent` | Agent source (Go) |
---
## Session Guidance for Claude
- **Architecture decisions belong here** — if GPT hits an architectural question, it should stop and bring it to Claude
- `zlh-grind` is GPT's execution ledger, not an architecture source
- `zlh-grind/INFRASTRUCTURE.md` is authoritative for current IPs and VM inventory
- `zlh-grind/OPEN_THREADS.md` has full active + outstanding work
- `zlh-grind/SCRATCH/` has detailed debug docs for specific topics
- Do not mark unimplemented work as complete
- Game publish flow must never be modified by dev routing changes
- Synthesis of Codex review outputs belongs here, not in Codex