Pivot dev access — abandon Traefik/DNS, adopt API proxy + Headscale

This commit is contained in:
jester 2026-03-15 22:58:55 +00:00
parent 88db6134f3
commit d128d92d15

View File

@ -40,9 +40,9 @@ Outstanding:
---
### Code Server Addon
## Code Server Addon
Status: ✅ Install + launch operational inside dev containers
Status: ✅ Installed and running inside dev containers
Confirmed:
@ -51,62 +51,114 @@ Confirmed:
- process confirmed running inside container
- binds to `0.0.0.0:6000`
- launched from `/opt/zlh/services/code-server`
- API now writes dev Traefik dynamic config during provisioning
- API now uses proxy SSH service account (`zlh`) instead of personal user
Port: `6000`
Routing model:
---
- DNS: Cloudflare + Technitium
- Proxy: Traefik dynamic file written by API during dev provisioning
- Host format currently in use: `dev-<vmid>.zerolaghub.dev`
### Access Model (Updated)
Outstanding:
The previous approach using:
- finalize external browser reachability for code-server through Cloudflare → Traefik → container
- remove manual proxy-file edits from debugging path and ensure generated config is the sole source
- standardize hostname format everywhere (`dev-<vmid>` only)
- add code-server launch link in portal
- remove dynamic Traefik file on dev container deletion
- Cloudflare DNS
- Technitium DNS
- Traefik dynamic config per container
has been **abandoned**.
Reason:
- too many moving pieces
- TLS and proxy complexity
- per-container DNS automation
- unnecessary exposure of internal dev services
---
### Agent Future Work (priority order)
### New Access Strategy
1. Unified structured logging (slog) — Promtail/Loki needs structured fields
2. Dev container /status — provisioningComplete + provisioningError fields
3. Crash recovery with backoff — 30s/60s/120s, max 3 attempts, then error state
4. Graceful shutdown verification — SIGTERM + wait before SIGKILL for Minecraft
5. Agent restart/process reattachment — detect existing process on restart
Dev containers will support **two access paths**.
#### Path 1 — Browser IDE (Primary)
```
Browser
Portal
API proxy
container:6000
```
URL format: `/dev/<vmid>/ide`
Implementation requirements:
- API proxy using `http-proxy-middleware`
- WebSocket support (`ws: true`)
- `server.on('upgrade', proxy.upgrade)`
- code-server launch args: `--base-path /dev/<vmid>/ide --auth none`
Authentication handled by portal JWT.
---
## API (zlh-api)
#### Path 2 — Local Dev Access (Advanced Users)
Direct developer access via **Headscale/Tailscale**.
Use cases:
- SSH
- VS Code Remote
- local development tools
Outstanding tasks:
- confirm `zlh-ctl` Headscale server status
- implement Tailscale addon install
- API auth key generation
- portal instructions
Headscale constraints:
- `magic_dns: false`
- no exit nodes
- no DNS takeover
---
## Agent Future Work (priority order)
1. Structured logging (slog) for Loki
2. Dev container provisioningComplete state
3. Crash recovery backoff
4. Graceful shutdown verification
5. Process reattachment on agent restart
---
## API (zpack-api)
Completed:
- dev provisioning payload
- runtime/version fields
- enable_code_server flag
- dev-only routing hook added during provisioning
- Technitium + Cloudflare dev DNS creation
- remote Traefik dynamic file writing via proxy SSH
- proxy SSH moved to service-user model (`zlh`)
- server status endpoint added so frontend can consume agent state
- frontend status/console availability now update correctly via API polling model
- API status endpoint for frontend state
Outstanding:
- runtime validation endpoint
- dev runtime catalog endpoint for portal
- remove Traefik dynamic config on dev container deletion
- domain / hostname normalization audit
- proxy/TLS generation cleanup so manual edits are no longer needed
- `/dev/:id/ide` proxy route
- websocket upgrade handling
- ownership validation before proxy
- Headscale auth key generation
- dev runtime catalog endpoint
---
## Portal (zlh-portal)
## Portal (zpack-portal)
Completed:
@ -114,30 +166,12 @@ Completed:
- dotnet runtime support
- enable code-server checkbox
- dev file browser support
- frontend now consumes API-backed status correctly for host/console state
Outstanding:
- runtime list driven from catalog API
- dev port exposure UI
- code-server launch link
- clearer dev readiness states (`installing`, `starting`, `running`, `error`, etc.)
---
## Artifact Server
Completed:
- runtime artifacts hosted
- devcontainer catalog
- runtime archive structure
- code-server compiled release artifact ✅
Outstanding:
- checksum publishing
- artifact metadata support
- "Open IDE" button
- `/dev/<vmid>/ide` page
- Headscale setup instructions
---
@ -145,12 +179,11 @@ Outstanding:
Active thread:
- complete external dev IDE access path end-to-end
- implement browser IDE proxy
Future work:
- dev port routing
- dev service detection
- Tailscale dev access
- artifact version promotion
- runtime rollback support
@ -158,22 +191,10 @@ Future work:
## Closed Threads
- ✅ Interactive PTY-backed console (dev + game)
- ✅ WebSocket stability and PTY ownership
- ✅ Customer isolation (API + frontend)
- ✅ Agent update system (versioned, hash-verified)
- ✅ Minecraft player presence (agent-sourced)
- ✅ Game telemetry router separation (`/api/game/*`)
- ✅ Agent Phase 1 mod management endpoints
- ✅ Agent process metrics endpoint
- ✅ Minecraft readiness probe + restart race mitigation
- ✅ Modrinth resolver + full mod lifecycle
- ✅ Direct runtime upload model (no staging, no symlinks)
- ✅ `.zlh_metadata.json` provenance tracking
- ✅ Raw `http.request` streaming in API upload proxy
- ✅ Filesystem architecture docs consolidated
- ✅ Upload transport timeout tuning
- ✅ Dev container filesystem support (container-aware, /workspace root)
- ✅ Code-server artifact fix — compiled release on zlh-artifacts
- ✅ Dev routing hook added to provisioning without changing game publish flow
- ✅ API status endpoint added for frontend agent-state consumption
- ✅ PTY console (dev + game)
- ✅ Mod lifecycle
- ✅ Upload pipeline
- ✅ Runtime artifact installs
- ✅ Dev container filesystem model
- ✅ Code-server artifact fix
- ✅ API status endpoint for frontend agent-state consumption