zlh-grind/PROJECT_CONTEXT.md

279 lines
11 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# ZeroLagHub Project Context
## What It Is
Game server hosting platform targeting modded, indie, and emerging games.
Current operating strengths:
- LXC-based container model
- custom agent architecture
- open-source-oriented stack
- developer-friendly game and dev environment focus
- integrated hosted IDE / dev container surface through the API
System posture: stable, controlled expansion phase.
---
## Naming Convention
- `zlh-*` = core infrastructure (routing, monitoring, backup, artifacts, shared services)
- `zpack-*` = game and dev stack (portal, API, containers, Velocity/game edge)
---
## Infrastructure (Current Reality)
### Active host / environment
- Active dedicated host: **GTHost Detroit**
- Denver host: **decommissioned Apr 2, 2026**
- Active infra VM/LXC IDs are in the **9000s range**
- Legacy IDs in the 100s / 300s / 1000s / 2000s are reference only and should **not** be treated as active production
### Infrastructure source of truth
- Authoritative VM/IP inventory: `INFRASTRUCTURE.md`
- `PROJECT_CONTEXT.md` is a platform snapshot, **not** the authoritative VM/IP registry
### Active core infrastructure (high level)
- `9001 zlh-router` — OPNsense core router
- `9002 zpack-router` — OPNsense game/dev router
- `9010 zpack-dns` — Technitium DNS
- `9011 zlh-proxy` — core reverse proxy
- `9012 zpack-proxy` — game/dev edge proxy
- `9014 zlh-artifacts` — runtime, jar, and agent artifact server
- `9015 zpack-velocity` — Velocity proxy
- `9016 zlh-monitor` — Prometheus/Grafana
- `9017 zlh-back` — Proxmox Backup Server
- `9020 zpack-api` — API VM
- `9021 zpack-portal` — portal VM
### Service discovery note
- `internal.zlh` is **not currently used** for hot-path runtime service discovery
- Prefer explicit env-configured IPs / addresses in runtime-critical paths
---
## Stack
**API (`zpack-api`):** Node.js ESM on the Node 24 runtime line, Express 5, Prisma 6, MariaDB, Redis, BullMQ, JWT, Stripe, argon2, ssh2, WebSocket, http-proxy-middleware
**Portal (`zpack-portal`):** Next.js 16, TypeScript, TailwindCSS, Axios, WebSocket console, aligned to the Node 24 runtime line
**Agent (`zlh-agent`):** Go 1.21, stdlib HTTP, creack/pty, gorilla/websocket. Runs inside every game/dev container. Only process with direct filesystem access. Pulls runtimes + server jars from `zlh-artifacts`.
**Velocity plugin (`ZpackVelocityBridge`):** custom Velocity-side bridge that hydrates/registers backend servers for the proxy and exposes plugin-local register/unregister/status HTTP endpoints.
---
## Agent (Operational)
- HTTP server on :18888, internal only — API is the only intended caller
- Container types: `game` and `dev`
- Lifecycle: `POST /config` triggers async provision + start pipeline
- Filesystem: strict path allowlist for games; dev file API behavior is intended to be workspace-root-scoped
- Interactive console/PTY shell in dev containers is **not currently proven to be workspace-confined** and current live validation indicates `cd ..` can escape upward from `/home/dev/workspace`
- Upload transport: raw `http.request` piping (`req.pipe(proxyReq)`), never `fetch()`
- Console: PTY-backed WebSocket, one read loop per container
- Self-update: periodic check + apply
- Forge/Neoforge: automated post-install patch sequence
- Modrinth mod lifecycle: install/enable/disable/delete — operational
- Provenance: `.zlh_metadata.json` — source is `null` if not set
- Status transport model: poll-based (`/status`), not push-based
- State transitions: `idle`, `installing`, `starting`, `running`, `stopping`, `crashed`, `error`
- Crash recovery: backoff 30s/60s/120s, resets if uptime ≥ 30s, `error` state after repeated failures
- Crash observability: exit code, signal, uptime, log tail, classification
- Real Minecraft readiness probing exists in `internal/minecraft/readiness.go`
- File handlers are live for both `game` and `dev` containers
- File edits create shadow copies for revert when supported by the agent file policy
### Minecraft runtime behavior
- `vanilla`
- implemented as the internal `vanilla-fabric` profile
- downloads `minecraft/fabric/<version>/server.jar`
- installs FabricProxy-Lite
- installs version-matched Fabric API from `minecraft/fabric/fabric-api/<version>/fabric-api.jar`
- installs FabricProxy-Lite config
- `fabric`
- downloads `minecraft/fabric/<version>/server.jar`
- no proxy jar
- no Fabric API injection
- no proxy config
- `forge` / `neoforge`
- use installer-based setup
- first boot avoids readiness ping gating until post-install files are generated
- agent waits for `server.properties`, stops the process, enforces server properties, then restarts through the readiness-aware path
- extended readiness timeout for first-start / post-processing flow
This split is important: ZeroLagHub `vanilla` is not plain Mojang jar delivery; it is the Fabric-based vanilla profile used for proxy compatibility and platform behavior. Normal Fabric is plain Fabric jar delivery only.
### Backup boundary
- Agent-owned game backups are local, app-aware rollback backups
- Current implemented game backup scope is local Minecraft backup create/list/restore/delete plus pre-restore checkpoint hardening
- Restore is real and validated in practice; API now starts restore asynchronously and Portal should rely on polled status rather than a long-lived restore POST
- PBS / platform backups are the durability and disaster-recovery layer
- Do not treat offsite/PBS durability work as agent implementation work unless ownership changes
---
## Dev Containers (Current State)
- supported runtimes: node, python, go, java, dotnet
- runtime installs are artifact-backed and idempotent
- runtime root: `/opt/zlh/runtimes/<runtime>/<version>`
- dev identity: `dev:dev`
- workspace root: `/home/dev/workspace`
- shell env: `HOME`, `USER`, `LOGNAME`, `TERM` set correctly
- code-server install path: `/opt/zlh/services/code-server`
- code-server port: `6000`
- code-server lifecycle: `POST /dev/codeserver/start|stop|restart`
- code-server detection: `/proc/*/cmdline` scan
- agent port: `18888`
Code-server launch model:
- binds to `0.0.0.0`
- `--auth none`
- API/hosted flow handles auth and proxying
---
## Dev Container Access Model
### Browser IDE (Current Working Model)
```text
Browser
Traefik / hosted dev edge
API
container:6000
```
Working hosted flow:
1. frontend calls `POST /api/dev/:id/ide-token`
2. API returns hosted IDE URL with short-lived token
3. browser opens hosted URL
4. edge forwards to API
5. API validates token, sets HTTP-only IDE cookie, redirects to clean hosted URL
6. subsequent cookie-backed requests proxy to container code-server
7. code-server redirects to `/?folder=/home/dev/workspace`
8. IDE loads successfully
### Traefik / edge role
- terminates TLS for hosted dev traffic
- forwards hosted dev traffic to the API
- preserves original `Host` header
- does **not** route directly to containers for hosted IDE access
### API role
- extracts vmid from hosted request context
- validates short-lived IDE token
- sets HTTP-only IDE cookie
- redirects token URL to clean hostname URL
- proxies live code-server HTTP + WebSocket traffic to the correct container
### Local developer access (future / separate track)
Headscale/Tailscale for SSH, VS Code Remote, local tools. Constraints: no exit nodes, `magic_dns: false`.
### Removed / No Longer Current
- path-based `/api/dev/:id/ide` as the primary browser entry
- Caddy-hosted dev IDE edge
- per-container Traefik file creation from dev provisioning
- per-container Cloudflare/Technitium publish/unpublish for dev IDE browser access
`proxyClient.js` remains in repo and is still used by game edge publish logic.
---
## API / Frontend Status
- API polls agent `/status`
- API exposes polled state back to frontend via `GET /api/servers/:id/status`
- Portal uses the API-mediated hosted IDE flow
- Portal uses the API websocket bridge for console access
- Portal no longer relies on stale DB-only state for console availability
- Portal uses API-mediated game files, backups, and restore flows
- Restore start is async at the API layer; frontend should treat `202 Accepted` as success and then poll status
- API normalizes backup timestamps and filters pre-restore checkpoints out of the default backup list
- Game publish flow remains untouched by dev routing work
---
## Velocity / Registration Model
### Current model
- API is the source of backend inventory/state for Minecraft routing
- Velocity plugin (`ZpackVelocityBridge`) is the component that actually registers/unregisters backends inside Velocity
- Plugin supports startup rehydrate from API plus plugin-local webhook endpoints:
- `POST /zpack/register`
- `POST /zpack/unregister`
- `GET /zpack/status`
### Important current finding
- The likely current issue is sequencing: a backend must not be surfaced before semantic readiness succeeds
- Current work is verification of any remaining registration path that could expose a backend before readiness probe success
### Important implementation note
- Current plugin default endpoint behavior still references `zpack-api.internal.zlh` unless overridden
- That default is stale relative to current hot-path architecture and should not be relied on long-term
---
## Game Support
**Production:** Minecraft (vanilla/Fabric/Paper/Forge/Neoforge), Rust, Terraria, Project Zomboid
**In Pipeline:** Valheim, Palworld, Vintage Story, Core Keeper
---
## Open Threads (High Level)
Cross-repo / platform work remains in `OPEN_THREADS.md`.
Repo-specific active work now lives under:
- `Codex/API/*`
- `Codex/Portal/*`
- `Codex/Agent/*`
High-level active themes:
1. Backup / restore UX and status polish
2. Dev access / SSH / hosted IDE hardening
3. Service discovery and provisioning validation
4. Email notifications and launch polish
5. Launch testing and infrastructure audit
---
## Repo Registry
| Repo | Purpose |
|------|---------|
| `zlh-grind` | execution workspace / continuity / active constraints |
| `knowledge-base` | canonical architecture / strategy / bootstrap |
| `zlh-docs` | API/agent/portal reference docs |
| `zpack-api` | API source |
| `zpack-portal` | portal source |
| `zlh-agent` | agent source |
| `ZpackVelocityBridge` | Velocity plugin / backend registration layer |
All at `git.zerolaghub.com/jester/<repo>`
---
## Session Guidance
- `knowledge-base` is the architecture authority
- `zlh-grind` is the execution continuity layer
- `INFRASTRUCTURE.md` is the authoritative VM/IP inventory
- repo-specific truth is maintained in `Codex/*`
- root docs should stay focused on cross-repo/platform truth
- Agent is the authority on filesystem enforcement — API must **not** duplicate filesystem logic
- Portal does not enforce real policy — agent enforces
- Portal never calls agents directly — all traffic goes through API
- Upload transport uses raw `http.request` piping, never `fetch()`
- Do not mark unimplemented work as complete
- remove completed items from `OPEN_ITEMS.md` instead of letting them linger
- Game publish flow must never be modified by dev routing changes
- `proxyClient.js` must not be deleted — used by game edge publish path