zlh-grind/OPEN_THREADS.md

205 lines
5.8 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Open Threads zlh-grind
This file tracks active but unfinished work.
Keep it short.
---
## Agent (zlh-agent)
### Dev Runtime System
Completed:
- catalog validation implemented
- runtime installs artifact-backed
- install guard implemented
- all installs now fetch from artifact server (no local artifact assumption)
Outstanding:
- runtime install verification improvements
- catalog hash validation
- runtime removal / upgrade handling
---
### Dev Environment
Completed:
- dev user creation
- workspace root `/home/dev/workspace`
- console runs as dev user
- `HOME`, `USER`, `LOGNAME`, `TERM` env vars set correctly
Outstanding:
- PATH normalization
- shell profile consistency
- runtime PATH injection
---
### Code Server Addon
Status: ✅ Installed, running, browser-verified end-to-end
Confirmed:
- pulled from artifact server (tar.gz)
- installed to `/opt/zlh/services/code-server`
- binds to `0.0.0.0:6000`
- lifecycle endpoints: `POST /dev/codeserver/start|stop|restart`
- detection via `/proc/*/cmdline` scan
- full browser IDE loading confirmed at `dev-6070.zerolaghub.dev`
---
### Game Server Supervision
Completed:
- crash recovery with backoff: 30s → 60s → 120s
- backoff resets if uptime ≥ 30s
- transitions to `error` state after repeated failures
- crash observability: time, exit code, signal, uptime, log tail, classification
- classifications: `oom`, `mod_or_plugin_error`, `missing_dependency`, `nonzero_exit`, `unexpected_exit`
---
### Agent Future Work (priority order)
1. Structured logging (slog) for Loki
2. Dev container `provisioningComplete` state in `/status`
3. Graceful shutdown verification (SIGTERM + wait for Minecraft)
4. Process reattachment on agent restart
5. SSH server install in dev container provisioning pipeline (for CF Tunnel SSH)
---
## Dev IDE Access
### Browser IDE ✅ Fully Working (browser-verified) — Zero Install
```
Browser → dev-<vmid>.zerolaghub.dev → Traefik → API → container:6000
```
This is the primary zero-install access method. VS Code in the browser,
no client software required. Browser-verified and production ready.
### Remaining Work
- confirm "Open IDE" button in portal uses hosted URL in production path
- reduce legacy `/__ide/:id` compatibility paths once portal button confirmed
- simplify and harden `devProxy` — remove stale path-based assumptions
### Local Dev Access — SSH via CF Tunnel (Power Users Only)
**Important:** CF Tunnel SSH requires `cloudflared` installed on the
developer's local machine. This is not zero-install and is an optional
feature for power users who want local VS Code or terminal access.
The browser IDE remains the zero-install story for all developers.
See `knowledge-base/network/cf-tunnel-ssh.md` for full detail.
Current state:
- ✅ CF Tunnel created and connected to bastion VM
- ✅ Cloudflare Zero Trust free plan active
- ⏳ Tunnel SSH hostname mapping not yet configured in Zero Trust dashboard
- ⏳ Bastion SSH proxy jump config not yet done
- ⏳ Dev container SSH server not yet verified
- ⏳ Portal SSH config snippet not yet built
---
## API (zpack-api)
Completed:
- dev provisioning payload
- runtime/version fields
- enable_code_server flag
- `GET /api/servers/:id/status` — server status endpoint
- `POST /api/dev/:id/ide-token` — IDE token generation + hosted URL
- `GET /api/dev/:id/ide` — bootstrap route
- `/__ide/:id/*` — live tunnel proxy (HTTP + WS, target-bound)
- `handleHostedProxy` — host-based routing via `Host` header vmid extraction
- hosted flow browser-verified end-to-end
Outstanding:
- **Billing endpoints** — need to be added back
- simplify and harden host-native `devProxy`
- dev runtime catalog endpoint for portal
- Headscale auth key generation
---
## Portal (zpack-portal)
Completed:
- dev runtime dropdown
- dotnet runtime support
- enable code-server checkbox
- dev file browser support
Outstanding:
- confirm "Open IDE" button fully uses hosted URL flow
- SSH config snippet for power users (optional, clearly labeled as requiring cloudflared install)
- site copy/wording — needs rewriting for public audience
---
## Pre-Launch Checklist
Outstanding before launch:
- **Upload testing** — test file upload flow end-to-end in dev containers
- **Portal copy/wording** — site needs rewriting for public audience
- **Billing endpoints** — add back to API
- **Stress testing** — k6 IDE session load test + Minecraft bot test
- See `knowledge-base/operations/stress-testing.md`
- **OPNsense audit** — both routers need systematic validation
- See `knowledge-base/network/opnsense-checklist.md`
- **Dedicated host migration** — evaluate GTHost upgrade (Gold 6152, Detroit)
- Trial period: $5/day up to 10 days, PBS restore approach
---
## Platform
Future work:
- CF Tunnel SSH completion — power user feature, requires client cloudflared install
- artifact version promotion
- runtime rollback support
- Cloudflare R2 for large artifact/mod file delivery at scale
---
## Closed Threads
- ✅ PTY console (dev + game)
- ✅ Mod lifecycle
- ✅ Upload pipeline
- ✅ Runtime artifact installs
- ✅ Dev container filesystem model
- ✅ Code-server artifact fix
- ✅ API status endpoint for frontend agent-state consumption
- ✅ Game server crash recovery with backoff
- ✅ Crash observability (classification, log tail, exit metadata)
- ✅ Code-server lifecycle endpoints (start/stop/restart)
- ✅ Code-server process detection via /proc scan
- ✅ Dev IDE proxy — path-based browser IDE working end-to-end
- ✅ Hosted wildcard Traefik → API → container dev IDE flow — browser-verified
- ✅ Per-container dev IDE edge publish/unpublish removed from API
- ✅ Wildcard TLS cert `*.zerolaghub.dev` via Let's Encrypt + Cloudflare DNS-01
- ✅ Browser IDE fully loading at dev-<vmid>.zerolaghub.dev
- ✅ CF Tunnel created and connected to bastion VM