240 lines
6.7 KiB
Markdown
240 lines
6.7 KiB
Markdown
# Open Threads – zlh-grind
|
||
|
||
This file tracks active but unfinished work.
|
||
|
||
Keep it short.
|
||
|
||
---
|
||
|
||
## Agent (zlh-agent)
|
||
|
||
### Dev Runtime System
|
||
|
||
Completed:
|
||
|
||
- catalog validation implemented
|
||
- runtime installs artifact-backed
|
||
- install guard implemented
|
||
- all installs now fetch from artifact server (no local artifact assumption)
|
||
|
||
Outstanding:
|
||
|
||
- runtime install verification improvements
|
||
- catalog hash validation
|
||
- runtime removal / upgrade handling
|
||
|
||
---
|
||
|
||
### Dev Environment
|
||
|
||
Completed:
|
||
|
||
- dev user creation
|
||
- workspace root `/home/dev/workspace`
|
||
- console runs as dev user
|
||
- `HOME`, `USER`, `LOGNAME`, `TERM` env vars set correctly
|
||
|
||
Outstanding:
|
||
|
||
- PATH normalization
|
||
- shell profile consistency
|
||
- runtime PATH injection
|
||
|
||
---
|
||
|
||
### Code Server Addon
|
||
|
||
Status: ✅ Installed, running, browser-verified end-to-end
|
||
|
||
Confirmed:
|
||
|
||
- pulled from artifact server (tar.gz)
|
||
- installed to `/opt/zlh/services/code-server`
|
||
- binds to `0.0.0.0:6000`
|
||
- lifecycle endpoints: `POST /dev/codeserver/start|stop|restart`
|
||
- detection via `/proc/*/cmdline` scan
|
||
- full browser IDE loading confirmed at `dev-6070.zerolaghub.dev`
|
||
|
||
---
|
||
|
||
### Game Server Supervision
|
||
|
||
Completed:
|
||
|
||
- crash recovery with backoff: 30s → 60s → 120s
|
||
- backoff resets if uptime ≥ 30s
|
||
- transitions to `error` state after repeated failures
|
||
- crash observability: time, exit code, signal, uptime, log tail, classification
|
||
- classifications: `oom`, `mod_or_plugin_error`, `missing_dependency`, `nonzero_exit`, `unexpected_exit`
|
||
|
||
---
|
||
|
||
### Agent Future Work (priority order)
|
||
|
||
1. Structured logging (slog) for Loki
|
||
2. Dev container `provisioningComplete` state in `/status`
|
||
3. Graceful shutdown verification (SIGTERM + wait for Minecraft)
|
||
4. Process reattachment on agent restart
|
||
|
||
---
|
||
|
||
## Dev IDE Access
|
||
|
||
### Browser IDE ✅ Fully Working (browser-verified)
|
||
|
||
```
|
||
Browser → dev-<vmid>.zerolaghub.dev → Traefik → API → container:6000
|
||
```
|
||
|
||
Browser-verified: VS Code loads in browser at `dev-6070.zerolaghub.dev/?folder=/home/dev/workspace`
|
||
with workspace mounted, extensions panel visible, AI chat panel active.
|
||
|
||
Verified flow:
|
||
|
||
1. frontend calls `POST /api/dev/:id/ide-token`
|
||
2. API returns `https://dev-<vmid>.zerolaghub.dev/?token=...`
|
||
3. browser opens hosted URL
|
||
4. Traefik wildcard router forwards to API at `http://10.60.0.245:4000`
|
||
5. API validates token, sets `zlh_dev_ide_token`, redirects to clean host URL
|
||
6. subsequent cookie-backed request redirects to `/?folder=/home/dev/workspace`
|
||
7. IDE loads fully in browser
|
||
|
||
### Remaining Work
|
||
|
||
- confirm "Open IDE" button in portal uses hosted URL in production path
|
||
- reduce legacy `/__ide/:id` compatibility paths once portal button confirmed
|
||
- simplify and harden `devProxy` — remove stale path-based assumptions
|
||
|
||
### Wildcard Edge (Traefik)
|
||
|
||
- Traefik on `zlh-zpack-proxy` (10.70.0.242) handles wildcard TLS via DNS challenge
|
||
- wildcard cert `*.zerolaghub.dev` issued via Let's Encrypt + Cloudflare DNS-01
|
||
- Traefik routes `dev-*.zerolaghub.dev` → API at `http://10.60.0.245:4000`
|
||
- `passHostHeader: true` preserves original hostname through to API
|
||
- no Caddy, no `:8081`, no per-container DNS/Traefik side effects from API
|
||
|
||
### Local Dev Access — SSH via CF Tunnel (Next Step)
|
||
|
||
Decision: Cloudflare Tunnel on bastion VM for SSH access. Free tier covers up to 50 users.
|
||
|
||
Planned architecture:
|
||
|
||
```
|
||
Developer laptop
|
||
↓ ssh dev-6070.zerolaghub.dev
|
||
Cloudflare edge
|
||
↓ CF Tunnel (persistent, runs on bastion)
|
||
Bastion VM (internal)
|
||
↓ SSH proxy jump
|
||
Dev container (10.100.x.x)
|
||
```
|
||
|
||
Same hostname as browser IDE — different protocol. Cloudflare routes HTTPS to
|
||
Traefik and SSH to CF Tunnel separately.
|
||
|
||
Developer one-time SSH config:
|
||
|
||
```
|
||
Host *.zerolaghub.dev
|
||
ProxyCommand cloudflared access ssh --hostname %h
|
||
```
|
||
|
||
After that `ssh dev-6070.zerolaghub.dev` just works. Portal can surface this
|
||
config snippet as a copyable block.
|
||
|
||
Outstanding:
|
||
|
||
- Install `cloudflared` on bastion VM
|
||
- Create CF Tunnel pointed at bastion SSH port
|
||
- Map `*.zerolaghub.dev` SSH through tunnel
|
||
- Portal SSH config snippet UI
|
||
- Agent: surface SSH hostname in `/status` or via API
|
||
|
||
---
|
||
|
||
## API (zpack-api)
|
||
|
||
Completed:
|
||
|
||
- dev provisioning payload
|
||
- runtime/version fields
|
||
- enable_code_server flag
|
||
- `GET /api/servers/:id/status` — server status endpoint
|
||
- `POST /api/dev/:id/ide-token` — IDE token generation + hosted URL
|
||
- `GET /api/dev/:id/ide` — bootstrap route (validates token, sets cookie, redirects)
|
||
- `/__ide/:id/*` — live tunnel proxy (HTTP + WS, target-bound)
|
||
- dev routing experiment removed (`devRouting.js`, `devDePublisher.js` deleted)
|
||
- host-based URL generation (`DEV_IDE_HOST_SUFFIX`, `DEV_IDE_RETURN_HOSTED_URL`)
|
||
- `handleHostedProxy` — host-based routing via `Host` header vmid extraction
|
||
- token bootstrap → cookie handoff working under hosted flow
|
||
- hosted flow browser-verified end-to-end
|
||
|
||
Outstanding:
|
||
|
||
- simplify and harden host-native `devProxy` — remove stale path-based assumptions
|
||
- dev runtime catalog endpoint for portal
|
||
- Headscale auth key generation
|
||
|
||
---
|
||
|
||
## Portal (zpack-portal)
|
||
|
||
Completed:
|
||
|
||
- dev runtime dropdown
|
||
- dotnet runtime support
|
||
- enable code-server checkbox
|
||
- dev file browser support
|
||
|
||
Outstanding:
|
||
|
||
- confirm "Open IDE" button fully uses hosted URL flow
|
||
- SSH config snippet for local VS Code / terminal access
|
||
- Headscale setup instructions
|
||
|
||
---
|
||
|
||
## Pre-Launch Checklist
|
||
|
||
Outstanding before launch:
|
||
|
||
- **Upload testing** — test file upload flow end-to-end in dev containers
|
||
- **Portal copy/wording** — site needs rewriting for public audience
|
||
- **Dedicated host migration** — evaluate GTHost upgrade (Gold 6152, Detroit)
|
||
- Trial period approach: $5/day up to 10 days
|
||
- PBS restore for safe migration validation
|
||
- Two-host split (core vs game/dev) is longer term option
|
||
|
||
---
|
||
|
||
## Platform
|
||
|
||
Future work:
|
||
|
||
- CF Tunnel SSH access (see Local Dev Access above)
|
||
- Tailscale dev access (alternative/complement to CF Tunnel)
|
||
- artifact version promotion
|
||
- runtime rollback support
|
||
- Cloudflare R2 for large artifact/mod file delivery at scale
|
||
|
||
---
|
||
|
||
## Closed Threads
|
||
|
||
- ✅ PTY console (dev + game)
|
||
- ✅ Mod lifecycle
|
||
- ✅ Upload pipeline
|
||
- ✅ Runtime artifact installs
|
||
- ✅ Dev container filesystem model
|
||
- ✅ Code-server artifact fix
|
||
- ✅ API status endpoint for frontend agent-state consumption
|
||
- ✅ Game server crash recovery with backoff
|
||
- ✅ Crash observability (classification, log tail, exit metadata)
|
||
- ✅ Code-server lifecycle endpoints (start/stop/restart)
|
||
- ✅ Code-server process detection via /proc scan
|
||
- ✅ Dev IDE proxy — path-based browser IDE working end-to-end
|
||
- ✅ Hosted wildcard Traefik → API → container dev IDE flow — browser-verified
|
||
- ✅ Per-container dev IDE edge publish/unpublish removed from API
|
||
- ✅ Wildcard TLS cert `*.zerolaghub.dev` via Let's Encrypt + Cloudflare DNS-01
|
||
- ✅ Browser IDE fully loading at dev-<vmid>.zerolaghub.dev
|