zlh-grind/OPEN_THREADS.md

305 lines
11 KiB
Markdown

# Open Threads — zlh-grind
This file tracks active but unfinished work.
Keep it short.
---
## Agent (zlh-agent)
### Dev Runtime System
Completed:
- catalog validation implemented
- runtime installs artifact-backed
- install guard implemented
- all installs now fetch from artifact server (no local artifact assumption)
Outstanding:
- runtime install verification improvements
- catalog hash validation
- runtime removal / upgrade handling
- **runtime update process** — in-place runtime version upgrade for dev containers
---
### Dev Environment
Completed:
- dev user creation
- workspace root `/home/dev/workspace`
- console runs as dev user
- `HOME`, `USER`, `LOGNAME`, `TERM` env vars set correctly
Outstanding:
- PATH normalization
- shell profile consistency
- runtime PATH injection
---
### Code Server Addon
Status: ✅ Installed, running, browser-verified end-to-end
Confirmed:
- pulled from artifact server (tar.gz)
- installed to `/opt/zlh/services/code-server`
- binds to `0.0.0.0:6000`
- lifecycle endpoints: `POST /dev/codeserver/start|stop|restart`
- detection via `/proc/*/cmdline` scan
- full browser IDE loading confirmed at `dev-6070.zerolaghub.dev`
**Outstanding — Memory concern:**
- code-server consumes 1-2GB RAM at idle
- observed 99% memory usage in dev container during active use with AI coding agent
- 8GB container allocation may be insufficient for heavy dev workloads
- need to establish baseline memory usage for code-server alone vs with active coding session
- consider raising default dev container RAM or making it tier-based
- **stress test: k6 IDE session load test** — still outstanding from pre-launch checklist
---
### Game Server Supervision
Completed:
- crash recovery with backoff: 30s → 60s → 120s
- backoff resets if uptime ≥ 30s
- transitions to `error` state after repeated failures
- crash observability: time, exit code, signal, uptime, log tail, classification
- classifications: `oom`, `mod_or_plugin_error`, `missing_dependency`, `nonzero_exit`, `unexpected_exit`
---
### Agent Future Work (priority order)
1. Structured logging (slog) for Loki
2. Dev container `provisioningComplete` state in `/status`
3. Graceful shutdown verification (SIGTERM + wait for Minecraft)
4. Process reattachment on agent restart
5. SSH server install in dev container provisioning pipeline (for CF Tunnel SSH)
---
## Dev IDE Access
### Browser IDE ✅ Fully Working (browser-verified) — Zero Install
```
Browser → dev-<vmid>.zerolaghub.dev → Traefik → API → container:6000
```
This is the primary zero-install access method. VS Code in the browser,
no client software required. Browser-verified and production ready.
### Remaining Work
- confirm "Open IDE" button in portal uses hosted URL in production path
- reduce legacy `/__ide/:id` compatibility paths once portal button confirmed
- simplify and harden `devProxy` — remove stale path-based assumptions
### Local Dev Access — SSH via CF Tunnel (Power Users Only)
**Important:** CF Tunnel SSH requires `cloudflared` installed on the
developer's local machine. This is not zero-install and is an optional
feature for power users who want local VS Code or terminal access.
The browser IDE remains the zero-install story for all developers.
Current state:
- ✅ CF Tunnel created and connected to bastion VM
- ✅ Cloudflare Zero Trust free plan active
- 🟡 Tunnel SSH hostname mapping not yet configured in Zero Trust dashboard
- 🟡 Bastion SSH proxy jump config not yet done
- 🟡 Dev container SSH server not yet verified
- 🟡 Portal SSH config snippet not yet built
---
## API (zpac-api)
Completed:
- dev provisioning payload
- runtime/version fields
- enable_code_server flag
- `GET /api/servers/:id/status` — server status endpoint
- `POST /api/dev/:id/ide-token` — IDE token generation + hosted URL
- `GET /api/dev/:id/ide` — bootstrap route
- `/__ide/:id/*` — live tunnel proxy (HTTP + WS, target-bound)
- `handleHostedProxy` — host-based routing via `Host` header vmid extraction
- hosted flow browser-verified end-to-end
Outstanding:
- **Billing / Stripe integration** — cannot take payments without this, launch blocker
- **Password reset flow** — verify it is fully wired up
- **Usage limits / quota enforcement** — prevent unbounded server creation per account
- simplify and harden host-native `devProxy`
- dev runtime catalog endpoint for portal
- Headscale auth key generation
- **Velocity resync endpoint** — `POST /api/velocity/resync` to re-register all running containers after Velocity restart
**Provisioning + DB Consistency (Active Bug — root cause identified, needs cleanup):**
Root cause confirmed: state inconsistency from Denver → Detroit migration, not a code bug.
- Redis likely has stale/replayed jobs from migration overlap period
- Denver worker may have briefly run alongside Detroit
- DB/Redis migration may have been partial or unsynchronized
- internal.zlh DNS adding latency/timing risk in provisioning hot path
Immediate remediation steps (next session):
1. `FLUSHALL` Redis — do NOT restore from backup, start clean
2. Validate DB — check for duplicate VMIDs, stuck "creating" states
3. Confirm only one worker running (Detroit only)
4. If DB inconsistent → restore MariaDB from Denver PBS snapshot
5. Run single controlled provisioning test — verify 1 record, 1 job, 1 agent execution
Architectural fix:
- Replace internal.zlh FQDNs with env-based IPs for all service-to-service calls
- See `SCRATCH/service-discovery.md` for spec
- See `SCRATCH/debug-provisioning-consistency.md` for full audit plan
- See `SCRATCH/session-update-migration-stability.md` for session findings
---
## Portal (zpac-portal)
Completed:
- dev runtime dropdown
- dotnet runtime support
- enable code-server checkbox
- dev file browser support
- ✅ Site copy/wording rewrite — landing, features, FAQ, about updated for new platform
- ✅ Pricing page updated — Vanilla/Modded/Heavy tiers, Minecraft only
Outstanding:
- confirm "Open IDE" button fully uses hosted URL flow
- SSH config snippet for power users (optional, clearly labeled as requiring cloudflared install)
- **User onboarding flow** — what happens after registration? Guided first-server creation needed
- **Email notifications** — server crashed, provisioning complete, billing receipts
- Remove `testdaemon` binary from repo root (7MB, should not be in repo)
---
## Game Servers
Outstanding:
- **Game server world backup / restore** — player world data backup separate from PBS infrastructure backup. Trust-critical — players losing world data will kill retention.
- **Game server subdomain** — how do players connect? Verify IP vs subdomain (e.g. `mc.zerolaghub.com` style)
---
## Velocity / ZpackVelocityBridge
Outstanding:
- **Server registrations are in-memory only** — lost on Velocity restart. API needs `POST /api/velocity/resync` to re-register all running containers
- **ZpackCommands not registered** — `/zpack status/list/reload` commands exist in code but not wired up
- **Player routing on restart** — existing running containers go unrouted until new provisioning happens or resync is called
- See `SCRATCH/velocity-plugin.md` for full plugin documentation
---
## Infrastructure
### Hosting — GTHost (Decision: Stay, Most Cost-Effective)
GTHost Detroit is the primary host. Decision made to stay on GTHost long-term:
- Bare metal, instant Proxmox install without workarounds
- Unmetered bandwidth
- Competitive pricing — most cost-effective option evaluated
- Digital Ocean: too expensive, no bare metal/Proxmox
- OVH: more expensive, overkill for current scale
- Hetzner: Proxmox install was painful historically
### DDoS Protection (Resolved — Minimal Attack Surface, Accepted Risk)
Investigation complete. Attack surface is actually very small:
- All infrastructure is internal to Proxmox — not publicly exposed
- Portal, API, admin access all internal or behind Twingate
- Minecraft player traffic is proxied through Velocity VM — individual game containers never directly exposed
- Only the Velocity VM TCP port is public-facing
- Minecraft Java uses TCP — harder to volumetric flood than UDP games
- OPNsense rate limiting on Velocity-facing port is sufficient for launch
- Cloudflare DNS (non-proxied) hides real IP from casual attackers
**Decision:** Architecture is well-designed — attack surface is minimal. Accept remaining risk at launch. Revisit if attack occurs or revenue supports additional mitigation.
---
## Pre-Launch Checklist
Outstanding before launch:
- **Billing / Stripe integration** — launch blocker, cannot take money
- **Game server world backup / restore** — trust-critical
- **User onboarding flow** — guided first-server creation after register
- **Password reset flow** — verify wired up
- **Usage limits / quota enforcement** — per account
- **Game server subdomain** — verify player connection method
- **Email notifications** — crashed, billing, provisioning
- **Upload testing** — test file upload flow end-to-end in dev containers
- **Billing endpoints** — add back to API
- **Stress testing** — k6 IDE session load test + Minecraft bot test + **code-server memory baseline**
- **OPNsense audit** — both routers need systematic validation
- **Redis FLUSHALL** — clear stale migration jobs before launch
- **DB consistency check** — verify no duplicate VMIDs or stuck states from migration
- **Single worker verification** — confirm only Detroit API/worker running
- **Provisioning validation** — single controlled creation, confirm 1 record / 1 job / 1 execution
- **Service discovery migration** — replace internal.zlh with env-based IPs in API and agent hot paths
- Remove `testdaemon` binary from zpac-portal repo root
---
## Platform
Future work:
- CF Tunnel SSH completion — power user feature, requires client cloudflared install
- artifact version promotion
- runtime rollback support
- Cloudflare R2 for large artifact/mod file delivery at scale
- **Admin panel** — manage users/servers as operator
- **Referral / dev pipeline reward system** — revenue sharing for developers
- **Uptime history** — visible to users per server
- **DDoS mitigation** — revisit Path.net or similar when revenue supports it
---
## Closed Threads
- ✅ PTY console (dev + game)
- ✅ Mod lifecycle
- ✅ Upload pipeline
- ✅ Runtime artifact installs
- ✅ Dev container filesystem model
- ✅ Code-server artifact fix
- ✅ API status endpoint for frontend agent-state consumption
- ✅ Game server crash recovery with backoff
- ✅ Crash observability (classification, log tail, exit metadata)
- ✅ Code-server lifecycle endpoints (start/stop/restart)
- ✅ Code-server process detection via /proc scan
- ✅ Dev IDE proxy — path-based browser IDE working end-to-end
- ✅ Hosted wildcard Traefik → API → container dev IDE flow — browser-verified
- ✅ Per-container dev IDE edge publish/unpublish removed from API
- ✅ Wildcard TLS cert `*.zerolaghub.dev` via Let's Encrypt + Cloudflare DNS-01
- ✅ Browser IDE fully loading at dev-<vmid>.zerolaghub.dev
- ✅ CF Tunnel created and connected to bastion VM
- ✅ Portal copy rewrite — landing, features, FAQ, about, pricing
- ✅ DDoS investigation — minimal attack surface, Velocity proxy + internal architecture, accepted risk
- ✅ Hosting provider decision — GTHost Detroit, most cost-effective option
- ✅ Migration to Detroit — complete Apr 2, 2026
- ✅ Internal FQDN migration — all services on internal.zlh