zlh-grind/PROJECT_CONTEXT.md

264 lines
9.2 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# ZeroLagHub Project Context
## What It Is
Game server hosting platform targeting modded, indie, and emerging games.
Competitive advantages: LXC containers (20-30% perf over Docker), custom
agent architecture, open-source stack, and a developer-to-player pipeline that
turns mod developers into a distribution channel.
System posture: stable, controlled expansion phase.
---
## Naming Convention
- `zlh-*` = core infrastructure (routing, monitoring, backup, artifacts, shared services)
- `zpack-*` = game and dev stack (portal, API, containers, Velocity/game edge)
---
## Infrastructure (Current Reality)
### Active host / environment
- Active dedicated host: **GTHost Detroit**
- Denver host: **decommissioned Apr 2, 2026**
- Active infra VM/LXC IDs are in the **9000s range**
- Legacy IDs in the 100s / 300s / 1000s / 2000s are reference only and should **not** be treated as active production
### Infrastructure source of truth
- Authoritative VM/IP inventory: `INFRASTRUCTURE.md`
- `PROJECT_CONTEXT.md` is a platform snapshot, **not** the authoritative VM/IP registry
### Active core infrastructure (high level)
- `9001 zlh-router` — OPNsense core router
- `9002 zpack-router` — OPNsense game/dev router
- `9010 zpack-dns` — Technitium DNS
- `9011 zlh-proxy` — core reverse proxy
- `9012 zpack-proxy` — game/dev edge proxy
- `9014 zlh-artifacts` — runtime, jar, and agent artifact server
- `9015 zpack-velocity` — Velocity proxy
- `9016 zlh-monitor` — Prometheus/Grafana
- `9017 zlh-back` — Proxmox Backup Server
- `9020 zpack-api` — API VM
- `9021 zpack-portal` — portal VM
### Service discovery note
- `internal.zlh` is **not currently used** for hot-path runtime service discovery
- Prefer explicit env-configured IPs / addresses in runtime-critical paths
---
## Stack
**API (`zpack-api`):** Node.js ESM, Express 5, Prisma 6, MariaDB, Redis,
BullMQ, JWT, Stripe, argon2, ssh2, WebSocket, http-proxy-middleware
**Portal (`zpack-portal`):** Next.js 15, TypeScript, TailwindCSS,
Axios, WebSocket console. Sci-fi HUD aesthetic (steel textures, neon
accents, beveled panels).
**Agent (`zlh-agent`):** Go 1.21, stdlib HTTP, creack/pty, gorilla/websocket.
Runs inside every game/dev container. Only process with direct filesystem
access. Pulls runtimes + server jars from `zlh-artifacts`.
**Velocity plugin (`ZpackVelocityBridge`):** custom Velocity-side bridge that
hydrates/registers backend servers for the proxy and exposes plugin-local
register/unregister/status HTTP endpoints.
---
## Agent (Operational)
- HTTP server on :18888, internal only — API is the only intended caller
- Container types: `game` and `dev`
- Lifecycle: `POST /config` triggers async provision + start pipeline
- Filesystem: strict path allowlist for games, workspace-root sandbox for dev containers
- Upload transport: raw `http.request` piping (`req.pipe(proxyReq)`), never `fetch()`
- Console: PTY-backed WebSocket, one read loop per container
- Self-update: periodic check + apply
- Forge/Neoforge: automated post-install patch sequence
- Modrinth mod lifecycle: install/enable/disable/delete — operational
- Provenance: `.zlh_metadata.json` — source is `null` if not set
- Status transport model: poll-based (`/status`), not push-based
- State transitions: `idle`, `installing`, `starting`, `running`, `stopping`, `crashed`, `error`
- Crash recovery: backoff 30s/60s/120s, resets if uptime ≥ 30s, `error` state after repeated failures
- Crash observability: exit code, signal, uptime, log tail, classification
- Real Minecraft readiness probing exists in `internal/minecraft/readiness.go`
- Current open Minecraft issue is sequencing/use of readiness, not absence of readiness logic
---
## Dev Containers (Current State)
- supported runtimes: node, python, go, java, dotnet
- runtime installs are artifact-backed and idempotent
- runtime root: `/opt/zlh/runtimes/<runtime>/<version>`
- dev identity: `dev:dev`
- workspace root: `/home/dev/workspace`
- shell env: `HOME`, `USER`, `LOGNAME`, `TERM` set correctly
- code-server install path: `/opt/zlh/services/code-server`
- code-server port: `6000`
- code-server lifecycle: `POST /dev/codeserver/start|stop|restart`
- code-server detection: `/proc/*/cmdline` scan
- agent port: `18888`
Code-server launch model:
- binds to `0.0.0.0`
- `--auth none`
- API/hosted flow handles auth and proxying
---
## Dev Container Access Model
### Browser IDE (Current Working Model)
```text
Browser
Traefik / hosted dev edge
API
container:6000
```
Working hosted flow:
1. frontend calls `POST /api/dev/:id/ide-token`
2. API returns hosted IDE URL with short-lived token
3. browser opens hosted URL
4. edge forwards to API
5. API validates token, sets HTTP-only IDE cookie, redirects to clean hosted URL
6. subsequent cookie-backed requests proxy to container code-server
7. code-server redirects to `/?folder=/home/dev/workspace`
8. IDE loads successfully
### Traefik / edge role
- terminates TLS for hosted dev traffic
- forwards hosted dev traffic to the API
- preserves original `Host` header
- does **not** route directly to containers for hosted IDE access
### API role
- extracts vmid from hosted request context
- validates short-lived IDE token
- sets HTTP-only IDE cookie
- redirects token URL to clean hostname URL
- proxies live code-server HTTP + WebSocket traffic to the correct container
### Local developer access (future / separate track)
Headscale/Tailscale for SSH, VS Code Remote, local tools.
Constraints: no exit nodes, `magic_dns: false`.
### Removed / No Longer Current
- path-based `/api/dev/:id/ide` as the primary browser entry
- Caddy-hosted dev IDE edge
- per-container Traefik file creation from dev provisioning
- per-container Cloudflare/Technitium publish/unpublish for dev IDE browser access
`proxyClient.js` remains in repo and is still used by game edge publish logic.
---
## API / Frontend Status
- API polls agent `/status`
- API exposes polled state back to frontend via `GET /api/servers/:id/status`
- Portal uses the API-mediated hosted IDE flow
- Portal uses the API websocket bridge for console access
- Portal still has some migration debt through `src/lib/api/legacy.ts`
- Portal no longer relies on stale DB-only state for console availability
- Game publish flow remains untouched by dev routing work
---
## Velocity / Registration Model
### Current model
- API is the source of backend inventory/state for Minecraft routing
- Velocity plugin (`ZpackVelocityBridge`) is the component that actually registers/unregisters backends inside Velocity
- Plugin supports startup rehydrate from API plus plugin-local webhook endpoints:
- `POST /zpack/register`
- `POST /zpack/unregister`
- `GET /zpack/status`
### Important current finding
- The likely current bug is **not** generic “Velocity broken”
- The likely issue is that the API/plugin path can expose/register a backend while the game server is `running` but not actually `ready`
- The plugin is the critical part of that registration path
### Important implementation note
- Current plugin default endpoint behavior still references `zpack-api.internal.zlh` unless overridden
- That default is stale relative to current hot-path architecture and should not be relied on long-term
---
## Game Support
**Production:** Minecraft (vanilla/Fabric/Paper/Forge/Neoforge), Rust,
Terraria, Project Zomboid
**In Pipeline:** Valheim, Palworld, Vintage Story, Core Keeper
---
## Developer-to-Player Pipeline (Revenue Model)
```text
LXC Dev Environment ($15-40/mo)
→ Game/mod creation + testing
→ Testing servers (50% dev discount)
→ Player community referrals (25% player discount)
→ Developer revenue share (5-10% commission)
→ Viral growth
```
Revenue multiplier: 1 developer → ~10 players → $147.50/mo total.
---
## Open Threads (High Level)
1. Billing / Stripe completion
2. Game server world backup / restore
3. User onboarding flow
4. Fabric readiness gating / Velocity exposure sequencing
5. Password reset verification
6. Usage limits / quota enforcement
7. Email notifications
8. Velocity resync / refresh behavior
9. Upload testing, stress testing, OPNsense audit, provisioning validation
See `OPEN_THREADS.md` for active detail and priority order.
---
## Repo Registry
| Repo | Purpose |
|------|---------|
| `zlh-grind` | execution workspace / continuity / active constraints |
| `knowledge-base` | canonical architecture / strategy / bootstrap |
| `zlh-docs` | API/agent/portal reference docs |
| `zpack-api` | API source |
| `zpack-portal` | portal source |
| `zlh-agent` | agent source |
| `ZpackVelocityBridge` | Velocity plugin / backend registration layer |
All at `git.zerolaghub.com/jester/<repo>`
---
## Session Guidance
- `knowledge-base` is the architecture authority
- `zlh-grind` is the execution continuity layer
- `INFRASTRUCTURE.md` is the authoritative VM/IP inventory
- Agent is the authority on filesystem enforcement — API must **not** duplicate filesystem logic
- Portal does not enforce real policy — agent enforces
- Portal never calls agents directly — all traffic goes through API
- Upload transport uses raw `http.request` piping, never `fetch()`
- Do not mark unimplemented work as complete
- Game publish flow must never be modified by dev routing changes
- `proxyClient.js` must not be deleted — used by game edge publish path