zlh-grind/Session_Summaries/2026-03-17_IDE-Proxy-Blocking-Issue.md

133 lines
4.1 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# 2026-03-17 IDE proxy blocking issue + agent enhancements
## Summary
API proxy layer for dev IDE is complete. Agent has significant new
capabilities. One blocking issue remains before IDE is fully functional.
---
## What Was Completed
### API
- Dev routing removed: `devRouting.js`, `devDePublisher.js` deleted
- All hooks removed from provisioning and container routes
- No DNS records, no Traefik routes, no container exposure
- `GET /api/dev/:id/ide` + `GET /api/dev/:id/ide/*` — proxy to container:6000
- WebSocket: `server.on("upgrade")` wired, `ws: true` enabled
- Path rewrite: `/api/dev/:id/ide/...``/...`
- `POST /api/dev/:id/ide-token` — short-lived signed token (300s TTL)
- Token payload: `sub`, `vmid`, `type: "dev-ide"`
- Proxy accepts `Authorization: Bearer` or `?token=`
- WebSocket upgrades validate same token
- `GET /api/servers/:id/status` — reads Redis `agent:<vmid>`, returns agent state
- `proxyClient.js` intentionally untouched — game edge publish path depends on it
### Agent
- Full dev container filesystem support, workspace root `/home/dev/workspace`
- Shell runs as `dev` user, `HOME`/`USER`/`LOGNAME`/`TERM` set
- Dev provisioning creates `/home/dev` and `/home/dev/workspace` with correct ownership
- Catalog-driven runtime validation against `devcontainer/_catalog.json`
- All installs now fetch from artifact server — no local artifact assumption
- Dotnet installer fetched from artifact server, installed into runtime path
- code-server pulled from artifact server (tar.gz)
- Installed to `/opt/zlh/services/code-server`
- Lifecycle endpoints: `POST /dev/codeserver/start|stop|restart`
- Process detection via `/proc/*/cmdline` scan — no longer relies solely on PID file
- Game server crash recovery with backoff: 30s → 60s → 120s
- Backoff resets if process uptime ≥ 30s
- Transitions to `error` state after repeated failures
- Crash observability: time, exit code, signal, uptime, log tail, classification
- Classifications: `oom`, `mod_or_plugin_error`, `missing_dependency`, `nonzero_exit`, `unexpected_exit`
- Structured logging across provisioning, installs, file ops, control plane
- Installer failures now include output tail for debugging
- `/status` now exposes: `runtimeInstalled`, `devProvisioned`, `devReadyAt`,
`workspaceRoot`, `serverRoot`, `codeServerInstalled`, `codeServerRunning`,
`lastCrashClassification`
---
## Current Blocking Issue
### Symptom
Browser opens IDE URL. Loads partially. Popup: "An unexpected error occurred."
Console:
- `WebSocket close 1006`
- `No file system provider found`
Workspace shows `!` (not mounted). Extension host fails to start.
### What works
- API auth ✅
- Token validation ✅
- Proxy routing to container ✅
- WebSocket upgrade handler ✅
- code-server process running on :6000 ✅
### What fails
- Workbench WebSocket session ❌
- Filesystem provider initialization ❌
- Extension host startup ❌
### Root cause (high confidence)
code-server is running without `--base-path /api/dev/<vmid>/ide`.
It assumes it is served from `/`. All WebSocket connection paths, static
asset paths, and VS Code remote authority resolution are wrong when served
through a proxy subpath without this flag.
---
## Required Fix
Update code-server launch in agent to:
```bash
code-server \
--bind-addr 0.0.0.0:6000 \
--auth none \
--disable-telemetry \
--base-path /api/dev/${vmid}/ide \
/home/dev/workspace
```
The `vmid` must be injected at launch time from the agent's config.
This is a one-line change to the code-server launch script in the agent.
---
## Architectural State
```
User → Portal → API → Proxy → Container (code-server)
```
Containers never publicly exposed. No Traefik per dev container.
No DNS per dev container. API token is sole auth mechanism.
---
## Next Tasks
1. Fix `--base-path` in agent code-server launch — immediate priority
2. Redeploy container, verify: no popup, no WS 1006, workspace loads, terminal works
3. Validate proxy consistency (HTTP + WS + static assets)
4. Portal "Open IDE" button implementation
5. Container hardening (non-root dev user, restrict /opt)
6. Headscale/Tailscale integration (future)