zlh-grind/OPEN_THREADS.md

11 KiB

Open Threads — zlh-grind

This file tracks active but unfinished work.

Keep it short.


Agent (zlh-agent)

Dev Runtime System

Completed:

  • catalog validation implemented
  • runtime installs artifact-backed
  • install guard implemented
  • all installs now fetch from artifact server (no local artifact assumption)

Outstanding:

  • runtime install verification improvements
  • catalog hash validation
  • runtime removal / upgrade handling
  • runtime update process — in-place runtime version upgrade for dev containers

Dev Environment

Completed:

  • dev user creation
  • workspace root /home/dev/workspace
  • console runs as dev user
  • HOME, USER, LOGNAME, TERM env vars set correctly

Outstanding:

  • PATH normalization
  • shell profile consistency
  • runtime PATH injection

Code Server Addon

Status: Installed, running, browser-verified end-to-end

Confirmed:

  • pulled from artifact server (tar.gz)
  • installed to /opt/zlh/services/code-server
  • binds to 0.0.0.0:6000
  • lifecycle endpoints: POST /dev/codeserver/start|stop|restart
  • detection via /proc/*/cmdline scan
  • full browser IDE loading confirmed at dev-6070.zerolaghub.dev

Outstanding — Memory concern:

  • code-server consumes 1-2GB RAM at idle
  • observed 99% memory usage in dev container during active use with AI coding agent
  • 8GB container allocation may be insufficient for heavy dev workloads
  • need to establish baseline memory usage for code-server alone vs with active coding session
  • consider raising default dev container RAM or making it tier-based
  • stress test: k6 IDE session load test — still outstanding from pre-launch checklist

Game Server Supervision

Completed:

  • crash recovery with backoff: 30s → 60s → 120s
  • backoff resets if uptime ≥ 30s
  • transitions to error state after repeated failures
  • crash observability: time, exit code, signal, uptime, log tail, classification
  • classifications: oom, mod_or_plugin_error, missing_dependency, nonzero_exit, unexpected_exit

Agent Future Work (priority order)

  1. Structured logging (slog) for Loki
  2. Dev container provisioningComplete state in /status
  3. Graceful shutdown verification (SIGTERM + wait for Minecraft)
  4. Process reattachment on agent restart
  5. SSH server install in dev container provisioning pipeline (for CF Tunnel SSH)

Dev IDE Access

Browser IDE Fully Working (browser-verified) — Zero Install

Browser → dev-<vmid>.zerolaghub.dev → Traefik → API → container:6000

This is the primary zero-install access method. VS Code in the browser, no client software required. Browser-verified and production ready.

Remaining Work

  • confirm "Open IDE" button in portal uses hosted URL in production path
  • reduce legacy /__ide/:id compatibility paths once portal button confirmed
  • simplify and harden devProxy — remove stale path-based assumptions

Local Dev Access — SSH via CF Tunnel (Power Users Only)

Important: CF Tunnel SSH requires cloudflared installed on the developer's local machine. This is not zero-install and is an optional feature for power users who want local VS Code or terminal access.

The browser IDE remains the zero-install story for all developers.

Current state:

  • CF Tunnel created and connected to bastion VM
  • Cloudflare Zero Trust free plan active
  • 🟡 Tunnel SSH hostname mapping not yet configured in Zero Trust dashboard
  • 🟡 Bastion SSH proxy jump config not yet done
  • 🟡 Dev container SSH server not yet verified
  • 🟡 Portal SSH config snippet not yet built

API (zpac-api)

Completed:

  • dev provisioning payload
  • runtime/version fields
  • enable_code_server flag
  • GET /api/servers/:id/status — server status endpoint
  • POST /api/dev/:id/ide-token — IDE token generation + hosted URL
  • GET /api/dev/:id/ide — bootstrap route
  • /__ide/:id/* — live tunnel proxy (HTTP + WS, target-bound)
  • handleHostedProxy — host-based routing via Host header vmid extraction
  • hosted flow browser-verified end-to-end

Outstanding:

  • Billing / Stripe integration — cannot take payments without this, launch blocker
  • Password reset flow — verify it is fully wired up
  • Usage limits / quota enforcement — prevent unbounded server creation per account
  • simplify and harden host-native devProxy
  • dev runtime catalog endpoint for portal
  • Headscale auth key generation
  • Velocity resync endpointPOST /api/velocity/resync to re-register all running containers after Velocity restart

Provisioning + DB Consistency (Active Bug — needs audit before launch):

  • Duplicate server creation occurring during single requests
  • Inconsistent / failed server termination
  • DB state drifting from actual Proxmox + agent state
  • BullMQ jobs may be executing more than once — missing deduplication / idempotency keys
  • VMID and port allocation not confirmed to be transactionally safe
  • Create endpoint not confirmed idempotent or guarded against double execution
  • Delete flow not confirmed to handle partial failures (container already gone)
  • DNS-based routing must not be used as source of truth — DB is the source of truth
  • internal.zlh DNS may be introducing timing issues in provisioning hot path — replace with env-based IP config for all service-to-service calls
  • See SCRATCH/debug-provisioning-consistency.md for full audit plan
  • See SCRATCH/service-discovery.md for env-based service discovery approach

Portal (zpac-portal)

Completed:

  • dev runtime dropdown
  • dotnet runtime support
  • enable code-server checkbox
  • dev file browser support
  • Site copy/wording rewrite — landing, features, FAQ, about updated for new platform
  • Pricing page updated — Vanilla/Modded/Heavy tiers, Minecraft only

Outstanding:

  • confirm "Open IDE" button fully uses hosted URL flow
  • SSH config snippet for power users (optional, clearly labeled as requiring cloudflared install)
  • User onboarding flow — what happens after registration? Guided first-server creation needed
  • Email notifications — server crashed, provisioning complete, billing receipts
  • Remove testdaemon binary from repo root (7MB, should not be in repo)

Game Servers

Outstanding:

  • Game server world backup / restore — player world data backup separate from PBS infrastructure backup. Trust-critical — players losing world data will kill retention.
  • Game server subdomain — how do players connect? Verify IP vs subdomain (e.g. mc.zerolaghub.com style)

Velocity / ZpackVelocityBridge

Outstanding:

  • Server registrations are in-memory only — lost on Velocity restart. API needs POST /api/velocity/resync to re-register all running containers
  • ZpackCommands not registered/zpack status/list/reload commands exist in code but not wired up
  • Player routing on restart — existing running containers go unrouted until new provisioning happens or resync is called
  • See SCRATCH/velocity-plugin.md for full plugin documentation

Infrastructure

Hosting — GTHost (Decision: Stay, Most Cost-Effective)

GTHost Detroit is the primary host. Decision made to stay on GTHost long-term:

  • Bare metal, instant Proxmox install without workarounds
  • Unmetered bandwidth
  • Competitive pricing — most cost-effective option evaluated
  • Digital Ocean: too expensive, no bare metal/Proxmox
  • OVH: more expensive, overkill for current scale
  • Hetzner: Proxmox install was painful historically

DDoS Protection (Resolved — Minimal Attack Surface, Accepted Risk)

Investigation complete. Attack surface is actually very small:

  • All infrastructure is internal to Proxmox — not publicly exposed
  • Portal, API, admin access all internal or behind Twingate
  • Minecraft player traffic is proxied through Velocity VM — individual game containers never directly exposed
  • Only the Velocity VM TCP port is public-facing
  • Minecraft Java uses TCP — harder to volumetric flood than UDP games
  • OPNsense rate limiting on Velocity-facing port is sufficient for launch
  • Cloudflare DNS (non-proxied) hides real IP from casual attackers

Decision: Architecture is well-designed — attack surface is minimal. Accept remaining risk at launch. Revisit if attack occurs or revenue supports additional mitigation.


Pre-Launch Checklist

Outstanding before launch:

  • Billing / Stripe integration — launch blocker, cannot take money
  • Game server world backup / restore — trust-critical
  • User onboarding flow — guided first-server creation after register
  • Password reset flow — verify wired up
  • Usage limits / quota enforcement — per account
  • Game server subdomain — verify player connection method
  • Email notifications — crashed, billing, provisioning
  • Upload testing — test file upload flow end-to-end in dev containers
  • Billing endpoints — add back to API
  • Stress testing — k6 IDE session load test + Minecraft bot test + code-server memory baseline
  • OPNsense audit — both routers need systematic validation
  • Provisioning + DB consistency audit — duplicate creation, state drift, BullMQ idempotency
  • Service discovery migration — replace internal.zlh with env-based IPs in API and agent hot paths
  • Remove testdaemon binary from zpac-portal repo root

Platform

Future work:

  • CF Tunnel SSH completion — power user feature, requires client cloudflared install
  • artifact version promotion
  • runtime rollback support
  • Cloudflare R2 for large artifact/mod file delivery at scale
  • Admin panel — manage users/servers as operator
  • Referral / dev pipeline reward system — revenue sharing for developers
  • Uptime history — visible to users per server
  • DDoS mitigation — revisit Path.net or similar when revenue supports it

Closed Threads

  • PTY console (dev + game)
  • Mod lifecycle
  • Upload pipeline
  • Runtime artifact installs
  • Dev container filesystem model
  • Code-server artifact fix
  • API status endpoint for frontend agent-state consumption
  • Game server crash recovery with backoff
  • Crash observability (classification, log tail, exit metadata)
  • Code-server lifecycle endpoints (start/stop/restart)
  • Code-server process detection via /proc scan
  • Dev IDE proxy — path-based browser IDE working end-to-end
  • Hosted wildcard Traefik → API → container dev IDE flow — browser-verified
  • Per-container dev IDE edge publish/unpublish removed from API
  • Wildcard TLS cert *.zerolaghub.dev via Let's Encrypt + Cloudflare DNS-01
  • Browser IDE fully loading at dev-.zerolaghub.dev
  • CF Tunnel created and connected to bastion VM
  • Portal copy rewrite — landing, features, FAQ, about, pricing
  • DDoS investigation — minimal attack surface, Velocity proxy + internal architecture, accepted risk
  • Hosting provider decision — GTHost Detroit, most cost-effective option
  • Migration to Detroit — complete Apr 2, 2026
  • Internal FQDN migration — all services on internal.zlh