Add session summary 2026-03-17 — IDE proxy complete, base-path blocking issue, agent enhancements

This commit is contained in:
jester 2026-03-17 23:07:24 +00:00
parent 525366c5df
commit 393911c443

View File

@ -0,0 +1,132 @@
# 2026-03-17 IDE proxy blocking issue + agent enhancements
## Summary
API proxy layer for dev IDE is complete. Agent has significant new
capabilities. One blocking issue remains before IDE is fully functional.
---
## What Was Completed
### API
- Dev routing removed: `devRouting.js`, `devDePublisher.js` deleted
- All hooks removed from provisioning and container routes
- No DNS records, no Traefik routes, no container exposure
- `GET /api/dev/:id/ide` + `GET /api/dev/:id/ide/*` — proxy to container:6000
- WebSocket: `server.on("upgrade")` wired, `ws: true` enabled
- Path rewrite: `/api/dev/:id/ide/...``/...`
- `POST /api/dev/:id/ide-token` — short-lived signed token (300s TTL)
- Token payload: `sub`, `vmid`, `type: "dev-ide"`
- Proxy accepts `Authorization: Bearer` or `?token=`
- WebSocket upgrades validate same token
- `GET /api/servers/:id/status` — reads Redis `agent:<vmid>`, returns agent state
- `proxyClient.js` intentionally untouched — game edge publish path depends on it
### Agent
- Full dev container filesystem support, workspace root `/home/dev/workspace`
- Shell runs as `dev` user, `HOME`/`USER`/`LOGNAME`/`TERM` set
- Dev provisioning creates `/home/dev` and `/home/dev/workspace` with correct ownership
- Catalog-driven runtime validation against `devcontainer/_catalog.json`
- All installs now fetch from artifact server — no local artifact assumption
- Dotnet installer fetched from artifact server, installed into runtime path
- code-server pulled from artifact server (tar.gz)
- Installed to `/opt/zlh/services/code-server`
- Lifecycle endpoints: `POST /dev/codeserver/start|stop|restart`
- Process detection via `/proc/*/cmdline` scan — no longer relies solely on PID file
- Game server crash recovery with backoff: 30s → 60s → 120s
- Backoff resets if process uptime ≥ 30s
- Transitions to `error` state after repeated failures
- Crash observability: time, exit code, signal, uptime, log tail, classification
- Classifications: `oom`, `mod_or_plugin_error`, `missing_dependency`, `nonzero_exit`, `unexpected_exit`
- Structured logging across provisioning, installs, file ops, control plane
- Installer failures now include output tail for debugging
- `/status` now exposes: `runtimeInstalled`, `devProvisioned`, `devReadyAt`,
`workspaceRoot`, `serverRoot`, `codeServerInstalled`, `codeServerRunning`,
`lastCrashClassification`
---
## Current Blocking Issue
### Symptom
Browser opens IDE URL. Loads partially. Popup: "An unexpected error occurred."
Console:
- `WebSocket close 1006`
- `No file system provider found`
Workspace shows `!` (not mounted). Extension host fails to start.
### What works
- API auth ✅
- Token validation ✅
- Proxy routing to container ✅
- WebSocket upgrade handler ✅
- code-server process running on :6000 ✅
### What fails
- Workbench WebSocket session ❌
- Filesystem provider initialization ❌
- Extension host startup ❌
### Root cause (high confidence)
code-server is running without `--base-path /api/dev/<vmid>/ide`.
It assumes it is served from `/`. All WebSocket connection paths, static
asset paths, and VS Code remote authority resolution are wrong when served
through a proxy subpath without this flag.
---
## Required Fix
Update code-server launch in agent to:
```bash
code-server \
--bind-addr 0.0.0.0:6000 \
--auth none \
--disable-telemetry \
--base-path /api/dev/${vmid}/ide \
/home/dev/workspace
```
The `vmid` must be injected at launch time from the agent's config.
This is a one-line change to the code-server launch script in the agent.
---
## Architectural State
```
User → Portal → API → Proxy → Container (code-server)
```
Containers never publicly exposed. No Traefik per dev container.
No DNS per dev container. API token is sole auth mechanism.
---
## Next Tasks
1. Fix `--base-path` in agent code-server launch — immediate priority
2. Redeploy container, verify: no popup, no WS 1006, workspace loads, terminal works
3. Validate proxy consistency (HTTP + WS + static assets)
4. Portal "Open IDE" button implementation
5. Container hardening (non-root dev user, restrict /opt)
6. Headscale/Tailscale integration (future)