zlh-grind/OPEN_THREADS.md

6.7 KiB
Raw Blame History

Open Threads zlh-grind

This file tracks active but unfinished work.

Keep it short.


Agent (zlh-agent)

Dev Runtime System

Completed:

  • catalog validation implemented
  • runtime installs artifact-backed
  • install guard implemented
  • all installs now fetch from artifact server (no local artifact assumption)

Outstanding:

  • runtime install verification improvements
  • catalog hash validation
  • runtime removal / upgrade handling

Dev Environment

Completed:

  • dev user creation
  • workspace root /home/dev/workspace
  • console runs as dev user
  • HOME, USER, LOGNAME, TERM env vars set correctly

Outstanding:

  • PATH normalization
  • shell profile consistency
  • runtime PATH injection

Code Server Addon

Status: Installed, running, browser-verified end-to-end

Confirmed:

  • pulled from artifact server (tar.gz)
  • installed to /opt/zlh/services/code-server
  • binds to 0.0.0.0:6000
  • lifecycle endpoints: POST /dev/codeserver/start|stop|restart
  • detection via /proc/*/cmdline scan
  • full browser IDE loading confirmed at dev-6070.zerolaghub.dev

Game Server Supervision

Completed:

  • crash recovery with backoff: 30s → 60s → 120s
  • backoff resets if uptime ≥ 30s
  • transitions to error state after repeated failures
  • crash observability: time, exit code, signal, uptime, log tail, classification
  • classifications: oom, mod_or_plugin_error, missing_dependency, nonzero_exit, unexpected_exit

Agent Future Work (priority order)

  1. Structured logging (slog) for Loki
  2. Dev container provisioningComplete state in /status
  3. Graceful shutdown verification (SIGTERM + wait for Minecraft)
  4. Process reattachment on agent restart

Dev IDE Access

Browser IDE Fully Working (browser-verified)

Browser → dev-<vmid>.zerolaghub.dev → Traefik → API → container:6000

Browser-verified: VS Code loads in browser at dev-6070.zerolaghub.dev/?folder=/home/dev/workspace with workspace mounted, extensions panel visible, AI chat panel active.

Verified flow:

  1. frontend calls POST /api/dev/:id/ide-token
  2. API returns https://dev-<vmid>.zerolaghub.dev/?token=...
  3. browser opens hosted URL
  4. Traefik wildcard router forwards to API at http://10.60.0.245:4000
  5. API validates token, sets zlh_dev_ide_token, redirects to clean host URL
  6. subsequent cookie-backed request redirects to /?folder=/home/dev/workspace
  7. IDE loads fully in browser

Remaining Work

  • confirm "Open IDE" button in portal uses hosted URL in production path
  • reduce legacy /__ide/:id compatibility paths once portal button confirmed
  • simplify and harden devProxy — remove stale path-based assumptions

Wildcard Edge (Traefik)

  • Traefik on zlh-zpack-proxy (10.70.0.242) handles wildcard TLS via DNS challenge
  • wildcard cert *.zerolaghub.dev issued via Let's Encrypt + Cloudflare DNS-01
  • Traefik routes dev-*.zerolaghub.dev → API at http://10.60.0.245:4000
  • passHostHeader: true preserves original hostname through to API
  • no Caddy, no :8081, no per-container DNS/Traefik side effects from API

Local Dev Access — SSH via CF Tunnel (Next Step)

Decision: Cloudflare Tunnel on bastion VM for SSH access. Free tier covers up to 50 users.

Planned architecture:

Developer laptop
  ↓ ssh dev-6070.zerolaghub.dev
Cloudflare edge
  ↓ CF Tunnel (persistent, runs on bastion)
Bastion VM (internal)
  ↓ SSH proxy jump
Dev container (10.100.x.x)

Same hostname as browser IDE — different protocol. Cloudflare routes HTTPS to Traefik and SSH to CF Tunnel separately.

Developer one-time SSH config:

Host *.zerolaghub.dev
    ProxyCommand cloudflared access ssh --hostname %h

After that ssh dev-6070.zerolaghub.dev just works. Portal can surface this config snippet as a copyable block.

Outstanding:

  • Install cloudflared on bastion VM
  • Create CF Tunnel pointed at bastion SSH port
  • Map *.zerolaghub.dev SSH through tunnel
  • Portal SSH config snippet UI
  • Agent: surface SSH hostname in /status or via API

API (zpack-api)

Completed:

  • dev provisioning payload
  • runtime/version fields
  • enable_code_server flag
  • GET /api/servers/:id/status — server status endpoint
  • POST /api/dev/:id/ide-token — IDE token generation + hosted URL
  • GET /api/dev/:id/ide — bootstrap route (validates token, sets cookie, redirects)
  • /__ide/:id/* — live tunnel proxy (HTTP + WS, target-bound)
  • dev routing experiment removed (devRouting.js, devDePublisher.js deleted)
  • host-based URL generation (DEV_IDE_HOST_SUFFIX, DEV_IDE_RETURN_HOSTED_URL)
  • handleHostedProxy — host-based routing via Host header vmid extraction
  • token bootstrap → cookie handoff working under hosted flow
  • hosted flow browser-verified end-to-end

Outstanding:

  • simplify and harden host-native devProxy — remove stale path-based assumptions
  • dev runtime catalog endpoint for portal
  • Headscale auth key generation

Portal (zpack-portal)

Completed:

  • dev runtime dropdown
  • dotnet runtime support
  • enable code-server checkbox
  • dev file browser support

Outstanding:

  • confirm "Open IDE" button fully uses hosted URL flow
  • SSH config snippet for local VS Code / terminal access
  • Headscale setup instructions

Pre-Launch Checklist

Outstanding before launch:

  • Upload testing — test file upload flow end-to-end in dev containers
  • Portal copy/wording — site needs rewriting for public audience
  • Dedicated host migration — evaluate GTHost upgrade (Gold 6152, Detroit)
    • Trial period approach: $5/day up to 10 days
    • PBS restore for safe migration validation
    • Two-host split (core vs game/dev) is longer term option

Platform

Future work:

  • CF Tunnel SSH access (see Local Dev Access above)
  • Tailscale dev access (alternative/complement to CF Tunnel)
  • artifact version promotion
  • runtime rollback support
  • Cloudflare R2 for large artifact/mod file delivery at scale

Closed Threads

  • PTY console (dev + game)
  • Mod lifecycle
  • Upload pipeline
  • Runtime artifact installs
  • Dev container filesystem model
  • Code-server artifact fix
  • API status endpoint for frontend agent-state consumption
  • Game server crash recovery with backoff
  • Crash observability (classification, log tail, exit metadata)
  • Code-server lifecycle endpoints (start/stop/restart)
  • Code-server process detection via /proc scan
  • Dev IDE proxy — path-based browser IDE working end-to-end
  • Hosted wildcard Traefik → API → container dev IDE flow — browser-verified
  • Per-container dev IDE edge publish/unpublish removed from API
  • Wildcard TLS cert *.zerolaghub.dev via Let's Encrypt + Cloudflare DNS-01
  • Browser IDE fully loading at dev-.zerolaghub.dev