133 lines
4.1 KiB
Markdown
133 lines
4.1 KiB
Markdown
# 2026-03-17 – IDE proxy blocking issue + agent enhancements
|
||
|
||
## Summary
|
||
|
||
API proxy layer for dev IDE is complete. Agent has significant new
|
||
capabilities. One blocking issue remains before IDE is fully functional.
|
||
|
||
---
|
||
|
||
## What Was Completed
|
||
|
||
### API
|
||
|
||
- Dev routing removed: `devRouting.js`, `devDePublisher.js` deleted
|
||
- All hooks removed from provisioning and container routes
|
||
- No DNS records, no Traefik routes, no container exposure
|
||
|
||
- `GET /api/dev/:id/ide` + `GET /api/dev/:id/ide/*` — proxy to container:6000
|
||
- WebSocket: `server.on("upgrade")` wired, `ws: true` enabled
|
||
- Path rewrite: `/api/dev/:id/ide/...` → `/...`
|
||
|
||
- `POST /api/dev/:id/ide-token` — short-lived signed token (300s TTL)
|
||
- Token payload: `sub`, `vmid`, `type: "dev-ide"`
|
||
- Proxy accepts `Authorization: Bearer` or `?token=`
|
||
- WebSocket upgrades validate same token
|
||
|
||
- `GET /api/servers/:id/status` — reads Redis `agent:<vmid>`, returns agent state
|
||
|
||
- `proxyClient.js` intentionally untouched — game edge publish path depends on it
|
||
|
||
### Agent
|
||
|
||
- Full dev container filesystem support, workspace root `/home/dev/workspace`
|
||
- Shell runs as `dev` user, `HOME`/`USER`/`LOGNAME`/`TERM` set
|
||
- Dev provisioning creates `/home/dev` and `/home/dev/workspace` with correct ownership
|
||
- Catalog-driven runtime validation against `devcontainer/_catalog.json`
|
||
- All installs now fetch from artifact server — no local artifact assumption
|
||
- Dotnet installer fetched from artifact server, installed into runtime path
|
||
|
||
- code-server pulled from artifact server (tar.gz)
|
||
- Installed to `/opt/zlh/services/code-server`
|
||
- Lifecycle endpoints: `POST /dev/codeserver/start|stop|restart`
|
||
- Process detection via `/proc/*/cmdline` scan — no longer relies solely on PID file
|
||
|
||
- Game server crash recovery with backoff: 30s → 60s → 120s
|
||
- Backoff resets if process uptime ≥ 30s
|
||
- Transitions to `error` state after repeated failures
|
||
- Crash observability: time, exit code, signal, uptime, log tail, classification
|
||
- Classifications: `oom`, `mod_or_plugin_error`, `missing_dependency`, `nonzero_exit`, `unexpected_exit`
|
||
|
||
- Structured logging across provisioning, installs, file ops, control plane
|
||
- Installer failures now include output tail for debugging
|
||
|
||
- `/status` now exposes: `runtimeInstalled`, `devProvisioned`, `devReadyAt`,
|
||
`workspaceRoot`, `serverRoot`, `codeServerInstalled`, `codeServerRunning`,
|
||
`lastCrashClassification`
|
||
|
||
---
|
||
|
||
## Current Blocking Issue
|
||
|
||
### Symptom
|
||
|
||
Browser opens IDE URL. Loads partially. Popup: "An unexpected error occurred."
|
||
|
||
Console:
|
||
- `WebSocket close 1006`
|
||
- `No file system provider found`
|
||
|
||
Workspace shows `!` (not mounted). Extension host fails to start.
|
||
|
||
### What works
|
||
|
||
- API auth ✅
|
||
- Token validation ✅
|
||
- Proxy routing to container ✅
|
||
- WebSocket upgrade handler ✅
|
||
- code-server process running on :6000 ✅
|
||
|
||
### What fails
|
||
|
||
- Workbench WebSocket session ❌
|
||
- Filesystem provider initialization ❌
|
||
- Extension host startup ❌
|
||
|
||
### Root cause (high confidence)
|
||
|
||
code-server is running without `--base-path /api/dev/<vmid>/ide`.
|
||
|
||
It assumes it is served from `/`. All WebSocket connection paths, static
|
||
asset paths, and VS Code remote authority resolution are wrong when served
|
||
through a proxy subpath without this flag.
|
||
|
||
---
|
||
|
||
## Required Fix
|
||
|
||
Update code-server launch in agent to:
|
||
|
||
```bash
|
||
code-server \
|
||
--bind-addr 0.0.0.0:6000 \
|
||
--auth none \
|
||
--disable-telemetry \
|
||
--base-path /api/dev/${vmid}/ide \
|
||
/home/dev/workspace
|
||
```
|
||
|
||
The `vmid` must be injected at launch time from the agent's config.
|
||
This is a one-line change to the code-server launch script in the agent.
|
||
|
||
---
|
||
|
||
## Architectural State
|
||
|
||
```
|
||
User → Portal → API → Proxy → Container (code-server)
|
||
```
|
||
|
||
Containers never publicly exposed. No Traefik per dev container.
|
||
No DNS per dev container. API token is sole auth mechanism.
|
||
|
||
---
|
||
|
||
## Next Tasks
|
||
|
||
1. Fix `--base-path` in agent code-server launch — immediate priority
|
||
2. Redeploy container, verify: no popup, no WS 1006, workspace loads, terminal works
|
||
3. Validate proxy consistency (HTTP + WS + static assets)
|
||
4. Portal "Open IDE" button implementation
|
||
5. Container hardening (non-root dev user, restrict /opt)
|
||
6. Headscale/Tailscale integration (future)
|