zlh-grind/OPEN_THREADS.md

4.3 KiB
Raw Blame History

Open Threads zlh-grind

This file tracks active but unfinished work.

Keep it short.


Agent (zlh-agent)

Dev Runtime System

Completed:

  • catalog validation implemented
  • runtime installs artifact-backed
  • install guard implemented
  • all installs now fetch from artifact server (no local artifact assumption)

Outstanding:

  • runtime install verification improvements
  • catalog hash validation
  • runtime removal / upgrade handling

Dev Environment

Completed:

  • dev user creation
  • workspace root /home/dev/workspace
  • console runs as dev user
  • HOME, USER, LOGNAME, TERM env vars set correctly

Outstanding:

  • PATH normalization
  • shell profile consistency
  • runtime PATH injection

Code Server Addon

Status: Installed and running

Confirmed:

  • pulled from artifact server (tar.gz)
  • installed to /opt/zlh/services/code-server
  • binds to 0.0.0.0:6000
  • lifecycle endpoints: POST /dev/codeserver/start|stop|restart
  • detection via /proc/*/cmdline scan (no longer relies solely on PID file)

BLOCKING — next task:

code-server must launch with:

--bind-addr 0.0.0.0:6000
--auth none
--disable-telemetry
--base-path /api/dev/<vmid>/ide
/home/dev/workspace

Without --base-path, WebSocket paths and static assets mismatch through the proxy. Result: IDE loads partially, WS closes with 1006, workspace shows ! (not mounted), extension host fails to start.


Game Server Supervision

Completed:

  • crash recovery with backoff: 30s → 60s → 120s
  • backoff resets if uptime ≥ 30s
  • transitions to error state after repeated failures
  • crash observability: time, exit code, signal, uptime, log tail, classification
  • classifications: oom, mod_or_plugin_error, missing_dependency, nonzero_exit, unexpected_exit

Agent Future Work (priority order)

  1. Fix code-server --base-path launch arg — unblocks IDE (IMMEDIATE)
  2. Structured logging (slog) for Loki
  3. Dev container provisioningComplete state in /status
  4. Graceful shutdown verification (SIGTERM + wait for Minecraft)
  5. Process reattachment on agent restart

Dev IDE Access

Browser IDE

Browser → Portal → API (/api/dev/:id/ide) → container:6000

API layer: complete Agent layer: ⚠️ blocked on --base-path

What is confirmed working:

  • API auth
  • Token flow
  • Proxy routing
  • WebSocket upgrade handler
  • Upstream targeting
  • code-server process running

What is failing:

  • Workbench WebSocket session
  • Filesystem provider initialization
  • Extension host startup

Root cause: code-server launched without --base-path /api/dev/<vmid>/ide

Local Dev Access (Headscale/Tailscale — Future)

Outstanding:

  • confirm zlh-ctl Headscale server status
  • implement Tailscale addon install in agent
  • API auth key generation
  • portal setup instructions

Constraints: magic_dns: false, no exit nodes, no DNS takeover


API (zpack-api)

Completed:

  • dev provisioning payload
  • runtime/version fields
  • enable_code_server flag
  • GET /api/servers/:id/status — server status endpoint
  • POST /api/dev/:id/ide-token — IDE token generation
  • GET /api/dev/:id/ide + GET /api/dev/:id/ide/* — IDE proxy with WebSocket
  • dev routing experiment removed (devRouting.js, devDePublisher.js deleted)

Outstanding:

  • dev runtime catalog endpoint for portal
  • Headscale auth key generation

Portal (zpack-portal)

Completed:

  • dev runtime dropdown
  • dotnet runtime support
  • enable code-server checkbox
  • dev file browser support

Outstanding:

  • "Open IDE" button — calls POST /api/dev/:id/ide-token, opens returned URL in new tab
  • Headscale setup instructions

Platform

Future work:

  • Tailscale dev access
  • artifact version promotion
  • runtime rollback support

Closed Threads

  • PTY console (dev + game)
  • Mod lifecycle
  • Upload pipeline
  • Runtime artifact installs
  • Dev container filesystem model
  • Code-server artifact fix
  • API status endpoint for frontend agent-state consumption
  • Dev IDE proxy implementation (API proxy + token system)
  • Dev DNS/Traefik routing experiment — removed
  • Game server crash recovery with backoff
  • Crash observability (classification, log tail, exit metadata)
  • Code-server lifecycle endpoints (start/stop/restart)
  • Code-server process detection via /proc scan