# Open Threads — zlh-grind This file tracks active but unfinished work. Keep it short. --- ## Agent (zlh-agent) ### Dev Runtime System Completed: - catalog validation implemented - runtime installs artifact-backed - install guard implemented - all installs now fetch from artifact server (no local artifact assumption) Outstanding: - runtime install verification improvements - catalog hash validation - runtime removal / upgrade handling - **runtime update process** — in-place runtime version upgrade for dev containers --- ### Dev Environment Completed: - dev user creation - workspace root `/home/dev/workspace` - console runs as dev user - `HOME`, `USER`, `LOGNAME`, `TERM` env vars set correctly Outstanding: - PATH normalization - shell profile consistency - runtime PATH injection --- ### Code Server Addon Status: ✅ Installed, running, browser-verified end-to-end Confirmed: - pulled from artifact server (tar.gz) - installed to `/opt/zlh/services/code-server` - binds to `0.0.0.0:6000` - lifecycle endpoints: `POST /dev/codeserver/start|stop|restart` - detection via `/proc/*/cmdline` scan - full browser IDE loading confirmed at `dev-6070.zerolaghub.dev` **Outstanding — Memory concern:** - code-server consumes 1-2GB RAM at idle - observed 99% memory usage in dev container during active use with AI coding agent - 8GB container allocation may be insufficient for heavy dev workloads - need to establish baseline memory usage for code-server alone vs with active coding session - consider raising default dev container RAM or making it tier-based - **stress test: k6 IDE session load test** — still outstanding from pre-launch checklist --- ### Game Server Supervision Completed: - crash recovery with backoff: 30s → 60s → 120s - backoff resets if uptime ≥ 30s - transitions to `error` state after repeated failures - crash observability: time, exit code, signal, uptime, log tail, classification - classifications: `oom`, `mod_or_plugin_error`, `missing_dependency`, `nonzero_exit`, `unexpected_exit` --- ### Agent Future Work (priority order) 1. Structured logging (slog) for Loki 2. Dev container `provisioningComplete` state in `/status` 3. Graceful shutdown verification (SIGTERM + wait for Minecraft) 4. Process reattachment on agent restart 5. SSH server install in dev container provisioning pipeline (for CF Tunnel SSH) --- ## Dev IDE Access ### Browser IDE ✅ Fully Working (browser-verified) — Zero Install ``` Browser → dev-.zerolaghub.dev → Traefik → API → container:6000 ``` This is the primary zero-install access method. VS Code in the browser, no client software required. Browser-verified and production ready. ### Remaining Work - confirm "Open IDE" button in portal uses hosted URL in production path - reduce legacy `/__ide/:id` compatibility paths once portal button confirmed - simplify and harden `devProxy` — remove stale path-based assumptions ### Local Dev Access — SSH via CF Tunnel (Power Users Only) **Important:** CF Tunnel SSH requires `cloudflared` installed on the developer's local machine. This is not zero-install and is an optional feature for power users who want local VS Code or terminal access. The browser IDE remains the zero-install story for all developers. Current state: - ✅ CF Tunnel created and connected to bastion VM - ✅ Cloudflare Zero Trust free plan active - 🟡 Tunnel SSH hostname mapping not yet configured in Zero Trust dashboard - 🟡 Bastion SSH proxy jump config not yet done - 🟡 Dev container SSH server not yet verified - 🟡 Portal SSH config snippet not yet built --- ## API (zpac-api) Completed: - dev provisioning payload - runtime/version fields - enable_code_server flag - `GET /api/servers/:id/status` — server status endpoint - `POST /api/dev/:id/ide-token` — IDE token generation + hosted URL - `GET /api/dev/:id/ide` — bootstrap route - `/__ide/:id/*` — live tunnel proxy (HTTP + WS, target-bound) - `handleHostedProxy` — host-based routing via `Host` header vmid extraction - hosted flow browser-verified end-to-end Outstanding: - **Billing / Stripe integration** — cannot take payments without this, launch blocker - **Password reset flow** — verify it is fully wired up - **Usage limits / quota enforcement** — prevent unbounded server creation per account - simplify and harden host-native `devProxy` - dev runtime catalog endpoint for portal - Headscale auth key generation - **Velocity resync endpoint** — `POST /api/velocity/resync` to re-register all running containers after Velocity restart **Provisioning + DB Consistency (Active Bug — needs audit before launch):** - Duplicate server creation occurring during single requests - Inconsistent / failed server termination - DB state drifting from actual Proxmox + agent state - BullMQ jobs may be executing more than once — missing deduplication / idempotency keys - VMID and port allocation not confirmed to be transactionally safe - Create endpoint not confirmed idempotent or guarded against double execution - Delete flow not confirmed to handle partial failures (container already gone) - DNS-based routing must not be used as source of truth — DB is the source of truth - **internal.zlh DNS may be introducing timing issues in provisioning hot path** — replace with env-based IP config for all service-to-service calls - See `SCRATCH/debug-provisioning-consistency.md` for full audit plan - See `SCRATCH/service-discovery.md` for env-based service discovery approach --- ## Portal (zpac-portal) Completed: - dev runtime dropdown - dotnet runtime support - enable code-server checkbox - dev file browser support - ✅ Site copy/wording rewrite — landing, features, FAQ, about updated for new platform - ✅ Pricing page updated — Vanilla/Modded/Heavy tiers, Minecraft only Outstanding: - confirm "Open IDE" button fully uses hosted URL flow - SSH config snippet for power users (optional, clearly labeled as requiring cloudflared install) - **User onboarding flow** — what happens after registration? Guided first-server creation needed - **Email notifications** — server crashed, provisioning complete, billing receipts - Remove `testdaemon` binary from repo root (7MB, should not be in repo) --- ## Game Servers Outstanding: - **Game server world backup / restore** — player world data backup separate from PBS infrastructure backup. Trust-critical — players losing world data will kill retention. - **Game server subdomain** — how do players connect? Verify IP vs subdomain (e.g. `mc.zerolaghub.com` style) --- ## Velocity / ZpackVelocityBridge Outstanding: - **Server registrations are in-memory only** — lost on Velocity restart. API needs `POST /api/velocity/resync` to re-register all running containers - **ZpackCommands not registered** — `/zpack status/list/reload` commands exist in code but not wired up - **Player routing on restart** — existing running containers go unrouted until new provisioning happens or resync is called - See `SCRATCH/velocity-plugin.md` for full plugin documentation --- ## Infrastructure ### Hosting — GTHost (Decision: Stay, Most Cost-Effective) GTHost Detroit is the primary host. Decision made to stay on GTHost long-term: - Bare metal, instant Proxmox install without workarounds - Unmetered bandwidth - Competitive pricing — most cost-effective option evaluated - Digital Ocean: too expensive, no bare metal/Proxmox - OVH: more expensive, overkill for current scale - Hetzner: Proxmox install was painful historically ### DDoS Protection (Resolved — Minimal Attack Surface, Accepted Risk) Investigation complete. Attack surface is actually very small: - All infrastructure is internal to Proxmox — not publicly exposed - Portal, API, admin access all internal or behind Twingate - Minecraft player traffic is proxied through Velocity VM — individual game containers never directly exposed - Only the Velocity VM TCP port is public-facing - Minecraft Java uses TCP — harder to volumetric flood than UDP games - OPNsense rate limiting on Velocity-facing port is sufficient for launch - Cloudflare DNS (non-proxied) hides real IP from casual attackers **Decision:** Architecture is well-designed — attack surface is minimal. Accept remaining risk at launch. Revisit if attack occurs or revenue supports additional mitigation. --- ## Pre-Launch Checklist Outstanding before launch: - **Billing / Stripe integration** — launch blocker, cannot take money - **Game server world backup / restore** — trust-critical - **User onboarding flow** — guided first-server creation after register - **Password reset flow** — verify wired up - **Usage limits / quota enforcement** — per account - **Game server subdomain** — verify player connection method - **Email notifications** — crashed, billing, provisioning - **Upload testing** — test file upload flow end-to-end in dev containers - **Billing endpoints** — add back to API - **Stress testing** — k6 IDE session load test + Minecraft bot test + **code-server memory baseline** - **OPNsense audit** — both routers need systematic validation - **Provisioning + DB consistency audit** — duplicate creation, state drift, BullMQ idempotency - **Service discovery migration** — replace internal.zlh with env-based IPs in API and agent hot paths - Remove `testdaemon` binary from zpac-portal repo root --- ## Platform Future work: - CF Tunnel SSH completion — power user feature, requires client cloudflared install - artifact version promotion - runtime rollback support - Cloudflare R2 for large artifact/mod file delivery at scale - **Admin panel** — manage users/servers as operator - **Referral / dev pipeline reward system** — revenue sharing for developers - **Uptime history** — visible to users per server - **DDoS mitigation** — revisit Path.net or similar when revenue supports it --- ## Closed Threads - ✅ PTY console (dev + game) - ✅ Mod lifecycle - ✅ Upload pipeline - ✅ Runtime artifact installs - ✅ Dev container filesystem model - ✅ Code-server artifact fix - ✅ API status endpoint for frontend agent-state consumption - ✅ Game server crash recovery with backoff - ✅ Crash observability (classification, log tail, exit metadata) - ✅ Code-server lifecycle endpoints (start/stop/restart) - ✅ Code-server process detection via /proc scan - ✅ Dev IDE proxy — path-based browser IDE working end-to-end - ✅ Hosted wildcard Traefik → API → container dev IDE flow — browser-verified - ✅ Per-container dev IDE edge publish/unpublish removed from API - ✅ Wildcard TLS cert `*.zerolaghub.dev` via Let's Encrypt + Cloudflare DNS-01 - ✅ Browser IDE fully loading at dev-.zerolaghub.dev - ✅ CF Tunnel created and connected to bastion VM - ✅ Portal copy rewrite — landing, features, FAQ, about, pricing - ✅ DDoS investigation — minimal attack surface, Velocity proxy + internal architecture, accepted risk - ✅ Hosting provider decision — GTHost Detroit, most cost-effective option - ✅ Migration to Detroit — complete Apr 2, 2026 - ✅ Internal FQDN migration — all services on internal.zlh