From 6d6be18a84fcf4557551e96cbbf11518688fc5e0 Mon Sep 17 00:00:00 2001 From: jester Date: Fri, 10 Apr 2026 20:44:44 +0000 Subject: [PATCH] Update Velocity thread after plugin rehydrate fix validation --- OPEN_THREADS.md | 299 +++++++++++++++++------------------------------- 1 file changed, 107 insertions(+), 192 deletions(-) diff --git a/OPEN_THREADS.md b/OPEN_THREADS.md index 75265d9..8c56b41 100644 --- a/OPEN_THREADS.md +++ b/OPEN_THREADS.md @@ -9,295 +9,210 @@ Keep it short. ## Agent (zlh-agent) ### Dev Runtime System - Completed: - - catalog validation implemented - runtime installs artifact-backed - install guard implemented -- all installs now fetch from artifact server (no local artifact assumption) +- all installs now fetch from artifact server Outstanding: - - runtime install verification improvements - catalog hash validation - runtime removal / upgrade handling -- **runtime update process** — in-place runtime version upgrade for dev containers - ---- +- runtime update process for dev containers ### Dev Environment - Completed: - - dev user creation - workspace root `/home/dev/workspace` - console runs as dev user - `HOME`, `USER`, `LOGNAME`, `TERM` env vars set correctly Outstanding: - - PATH normalization - shell profile consistency - runtime PATH injection ---- - ### Code Server Addon +Status: Installed, running, browser-verified end-to-end. -Status: ✅ Installed, running, browser-verified end-to-end - -Confirmed: - -- pulled from artifact server (tar.gz) -- installed to `/opt/zlh/services/code-server` -- binds to `0.0.0.0:6000` -- lifecycle endpoints: `POST /dev/codeserver/start|stop|restart` -- detection via `/proc/*/cmdline` scan -- full browser IDE loading confirmed at `dev-6070.zerolaghub.dev` - -**Outstanding — Memory concern:** -- code-server consumes 1-2GB RAM at idle -- observed 99% memory usage in dev container during active use with AI coding agent -- 8GB container allocation may be insufficient for heavy dev workloads -- need to establish baseline memory usage for code-server alone vs with active coding session -- consider raising default dev container RAM or making it tier-based -- **stress test: k6 IDE session load test** — still outstanding from pre-launch checklist - ---- +Outstanding: +- code-server memory baseline +- decide whether default dev RAM should increase or become tier-based +- k6 IDE session load test ### Game Server Supervision - Completed: +- crash recovery with backoff +- crash observability and classification -- crash recovery with backoff: 30s → 60s → 120s -- backoff resets if uptime ≥ 30s -- transitions to `error` state after repeated failures -- crash observability: time, exit code, signal, uptime, log tail, classification -- classifications: `oom`, `mod_or_plugin_error`, `missing_dependency`, `nonzero_exit`, `unexpected_exit` +### Fabric Readiness Gating +Status: startup rehydrate path fixed in plugin and happy-path validated live. ---- - -### Fabric Readiness Gating (Active — agent-level fix needed) - -Root cause confirmed: Velocity registration happens before the Fabric server is fully ready to accept proxy traffic, causing "proxy starting" errors until Velocity is restarted. - -Required fix in agent: -- Do not register with Velocity until readiness probe confirms server is accepting connections -- TCP port check (port 25565) is the most reliable signal — already used by agent probe -- Velocity registration must be gated behind probe success, not just process start -- Pre-seeding mods + FabricProxy config before first run ✅ already done -- See `SCRATCH/session-stabilization-fabric-findings.md` for full findings - ---- - -### Agent Future Work (priority order) +Still outstanding: +- negative-path validation: verify a `ready=false` backend is skipped on Velocity startup rehydrate +- confirm whether any remaining agent-side registration path can surface a backend before readiness probe success +- see `SCRATCH/session-stabilization-fabric-findings.md` +### Agent Future Work 1. Structured logging (slog) for Loki 2. Dev container `provisioningComplete` state in `/status` -3. Graceful shutdown verification (SIGTERM + wait for Minecraft) +3. Graceful shutdown verification 4. Process reattachment on agent restart -5. SSH server install in dev container provisioning pipeline (for CF Tunnel SSH) +5. SSH server install in dev container provisioning pipeline --- ## Dev IDE Access -### Browser IDE ✅ Fully Working (browser-verified) — Zero Install - -``` -Browser → dev-.zerolaghub.dev → Traefik → API → container:6000 -``` - -This is the primary zero-install access method. VS Code in the browser, -no client software required. Browser-verified and production ready. - -### Remaining Work +### Browser IDE +Status: fully working, browser-verified, zero-install. +Remaining: - confirm "Open IDE" button in portal uses hosted URL in production path - reduce legacy `/__ide/:id` compatibility paths once portal button confirmed -- simplify and harden `devProxy` — remove stale path-based assumptions - -### Local Dev Access — SSH via CF Tunnel (Power Users Only) - -**Important:** CF Tunnel SSH requires `cloudflared` installed on the -developer's local machine. This is not zero-install and is an optional -feature for power users who want local VS Code or terminal access. - -The browser IDE remains the zero-install story for all developers. +- simplify and harden `devProxy` +### Local Dev Access — SSH via CF Tunnel Current state: -- ✅ CF Tunnel created and connected to bastion VM -- ✅ Cloudflare Zero Trust free plan active -- 🟡 Tunnel SSH hostname mapping not yet configured in Zero Trust dashboard -- 🟡 Bastion SSH proxy jump config not yet done -- 🟡 Dev container SSH server not yet verified -- 🟡 Portal SSH config snippet not yet built +- tunnel created and connected to bastion VM +- Zero Trust free plan active +- SSH hostname mapping not yet configured +- bastion SSH proxy jump config not yet done +- dev container SSH server not yet verified +- portal SSH config snippet not yet built --- -## API (zpac-api) +## API (zpack-api) Completed: - - dev provisioning payload - runtime/version fields - enable_code_server flag -- `GET /api/servers/:id/status` — server status endpoint -- `POST /api/dev/:id/ide-token` — IDE token generation + hosted URL -- `GET /api/dev/:id/ide` — bootstrap route -- `/__ide/:id/*` — live tunnel proxy (HTTP + WS, target-bound) -- `handleHostedProxy` — host-based routing via `Host` header vmid extraction +- status endpoint +- IDE token generation + hosted URL +- bootstrap IDE route +- live tunnel proxy +- host-based routing for hosted IDE - hosted flow browser-verified end-to-end -- ✅ Backend lifecycle hardened — idempotent create/delete/start/stop -- ✅ Duplicate server creation fixed — was frontend issue, not API -- ✅ Console routing corrected — browser → API → agent -- ✅ Control plane switched to IP-based service communication -- ✅ Velocity rehydration uses DB + Redis instead of Proxmox live state +- backend lifecycle hardened +- duplicate server creation fixed +- console routing corrected +- control plane switched to IP-based service communication +- Velocity rehydration uses DB + Redis instead of Proxmox live state Outstanding: - -- **Billing / Stripe integration** — cannot take payments without this, launch blocker -- **Password reset flow** — verify it is fully wired up -- **Usage limits / quota enforcement** — prevent unbounded server creation per account +- Billing / Stripe integration +- password reset flow verification +- usage limits / quota enforcement - simplify and harden host-native `devProxy` - dev runtime catalog endpoint for portal - Headscale auth key generation -- **Velocity resync endpoint** — `POST /api/velocity/resync` to re-register all running containers after Velocity restart -- **Service discovery migration** — replace remaining internal.zlh FQDNs with env-based IPs in hot paths +- service discovery migration for remaining hot-path `internal.zlh` references --- -## Portal (zpac-portal) +## Portal (zpack-portal) Completed: - - dev runtime dropdown - dotnet runtime support - enable code-server checkbox - dev file browser support -- ✅ Site copy/wording rewrite — landing, features, FAQ, about updated for new platform -- ✅ Pricing page updated — Vanilla/Modded/Heavy tiers, Minecraft only +- site copy rewrite +- pricing page updated Outstanding: - - confirm "Open IDE" button fully uses hosted URL flow -- SSH config snippet for power users (optional, clearly labeled as requiring cloudflared install) -- **User onboarding flow** — what happens after registration? Guided first-server creation needed -- **Email notifications** — server crashed, provisioning complete, billing receipts -- Remove `testdaemon` binary from repo root (7MB, should not be in repo) +- SSH config snippet for power users +- user onboarding flow +- email notifications +- remove `testdaemon` binary from repo root --- ## Game Servers Outstanding: - -- **Game server world backup / restore** — player world data backup separate from PBS infrastructure backup. Trust-critical — players losing world data will kill retention. -- **Game server subdomain** — how do players connect? Verify IP vs subdomain (e.g. `mc.zerolaghub.com` style) +- game server world backup / restore +- game server subdomain / player connection method verification --- ## Velocity / ZpackVelocityBridge +Completed: +- startup rehydrate now requires `ready == true` +- stale default `zpack-api.internal.zlh` fallback removed +- explicit `ZPACK_REHYDRATE_ENDPOINT` env wiring validated +- happy-path player routing validated live after restart + Outstanding: - -- **Server registrations are in-memory only** — lost on Velocity restart. API needs `POST /api/velocity/resync` to re-register all running containers -- **ZpackCommands not registered** — `/zpack status/list/reload` commands exist in code but not wired up -- **Player routing on restart** — existing running containers go unrouted until new provisioning happens or resync is called -- See `SCRATCH/velocity-plugin.md` for full plugin documentation - ---- - -## Infrastructure - -### Hosting — GTHost (Decision: Stay, Most Cost-Effective) - -GTHost Detroit is the primary host. Decision made to stay on GTHost long-term: -- Bare metal, instant Proxmox install without workarounds -- Unmetered bandwidth -- Competitive pricing — most cost-effective option evaluated -- Digital Ocean: too expensive, no bare metal/Proxmox -- OVH: more expensive, overkill for current scale -- Hetzner: Proxmox install was painful historically - -### DDoS Protection (Resolved — Minimal Attack Surface, Accepted Risk) - -Investigation complete. Attack surface is actually very small: - -- All infrastructure is internal to Proxmox — not publicly exposed -- Portal, API, admin access all internal or behind Twingate -- Minecraft player traffic is proxied through Velocity VM — individual game containers never directly exposed -- Only the Velocity VM TCP port is public-facing -- Minecraft Java uses TCP — harder to volumetric flood than UDP games -- OPNsense rate limiting on Velocity-facing port is sufficient for launch -- Cloudflare DNS (non-proxied) hides real IP from casual attackers - -**Decision:** Architecture is well-designed — attack surface is minimal. Accept remaining risk at launch. Revisit if attack occurs or revenue supports additional mitigation. +- ZpackCommands not registered (`/zpack status`, `/zpack list`, `/zpack reload`) +- negative-path readiness validation still pending for a `ready=false` backend +- see `SCRATCH/velocity-plugin.md` --- ## Pre-Launch Checklist Outstanding before launch: - -- **Billing / Stripe integration** — launch blocker, cannot take money -- **Game server world backup / restore** — trust-critical -- **User onboarding flow** — guided first-server creation after register -- **Password reset flow** — verify wired up -- **Usage limits / quota enforcement** — per account -- **Game server subdomain** — verify player connection method -- **Email notifications** — crashed, billing, provisioning -- **Upload testing** — test file upload flow end-to-end in dev containers -- **Billing endpoints** — add back to API -- **Stress testing** — k6 IDE session load test + Minecraft bot test + **code-server memory baseline** -- **OPNsense audit** — both routers need systematic validation -- **Fabric readiness gating** — agent must gate Velocity registration behind TCP probe success -- **Service discovery migration** — replace remaining internal.zlh with env-based IPs in hot paths -- **Provisioning validation** — single controlled creation, confirm 1 record / 1 job / 1 execution -- Remove `testdaemon` binary from zpac-portal repo root +- Billing / Stripe integration +- game server world backup / restore +- user onboarding flow +- password reset flow verification +- usage limits / quota enforcement +- game server subdomain verification +- email notifications +- upload testing +- billing endpoints +- stress testing: k6 IDE + Minecraft bot + code-server memory baseline +- OPNsense audit +- Fabric readiness gating full validation +- service discovery migration +- provisioning validation +- remove `testdaemon` from `zpack-portal` --- ## Platform Future work: - -- CF Tunnel SSH completion — power user feature, requires client cloudflared install +- CF Tunnel SSH completion - artifact version promotion - runtime rollback support -- Cloudflare R2 for large artifact/mod file delivery at scale -- **Admin panel** — manage users/servers as operator -- **Referral / dev pipeline reward system** — revenue sharing for developers -- **Uptime history** — visible to users per server -- **DDoS mitigation** — revisit Path.net or similar when revenue supports it +- Cloudflare R2 for large artifact/mod file delivery +- admin panel +- referral / dev pipeline reward system +- uptime history +- revisit DDoS mitigation later if needed --- ## Closed Threads -- ✅ PTY console (dev + game) -- ✅ Mod lifecycle -- ✅ Upload pipeline -- ✅ Runtime artifact installs -- ✅ Dev container filesystem model -- ✅ Code-server artifact fix -- ✅ API status endpoint for frontend agent-state consumption -- ✅ Game server crash recovery with backoff -- ✅ Crash observability (classification, log tail, exit metadata) -- ✅ Code-server lifecycle endpoints (start/stop/restart) -- ✅ Code-server process detection via /proc scan -- ✅ Dev IDE proxy — path-based browser IDE working end-to-end -- ✅ Hosted wildcard Traefik → API → container dev IDE flow — browser-verified -- ✅ Per-container dev IDE edge publish/unpublish removed from API -- ✅ Wildcard TLS cert `*.zerolaghub.dev` via Let's Encrypt + Cloudflare DNS-01 -- ✅ Browser IDE fully loading at dev-.zerolaghub.dev -- ✅ CF Tunnel created and connected to bastion VM -- ✅ Portal copy rewrite — landing, features, FAQ, about, pricing -- ✅ DDoS investigation — minimal attack surface, Velocity proxy + internal architecture, accepted risk -- ✅ Hosting provider decision — GTHost Detroit, most cost-effective option -- ✅ Migration to Detroit — complete Apr 2, 2026 -- ✅ Internal FQDN migration — all services on internal.zlh -- ✅ System stabilization — duplicate creation, DB/Redis drift, console routing all resolved -- ✅ Provisioning + DB consistency — root cause was migration state pollution, now clean -- ✅ IP-based control plane — internal.zlh removed from service-to-service hot paths +- PTY console (dev + game) +- mod lifecycle +- upload pipeline +- runtime artifact installs +- dev container filesystem model +- code-server artifact fix +- API status endpoint for frontend agent-state consumption +- game server crash recovery with backoff +- crash observability +- code-server lifecycle endpoints +- code-server process detection +- dev IDE proxy +- hosted wildcard Traefik → API → container IDE flow +- per-container dev IDE edge publish/unpublish removed from API +- wildcard TLS cert `*.zerolaghub.dev` +- browser IDE fully loading at `dev-.zerolaghub.dev` +- CF Tunnel created and connected to bastion VM +- portal copy rewrite +- DDoS investigation +- hosting provider decision: GTHost Detroit +- migration to Detroit complete +- system stabilization after migration +- IP-based control plane +- Velocity startup rehydrate fixed and validated on happy path