Pivot dev access — abandon Traefik/DNS, adopt API proxy + Headscale

This commit is contained in:
jester 2026-03-15 22:58:55 +00:00
parent 88db6134f3
commit d128d92d15

View File

@ -40,9 +40,9 @@ Outstanding:
--- ---
### Code Server Addon ## Code Server Addon
Status: ✅ Install + launch operational inside dev containers Status: ✅ Installed and running inside dev containers
Confirmed: Confirmed:
@ -51,62 +51,114 @@ Confirmed:
- process confirmed running inside container - process confirmed running inside container
- binds to `0.0.0.0:6000` - binds to `0.0.0.0:6000`
- launched from `/opt/zlh/services/code-server` - launched from `/opt/zlh/services/code-server`
- API now writes dev Traefik dynamic config during provisioning
- API now uses proxy SSH service account (`zlh`) instead of personal user
Port: `6000` Port: `6000`
Routing model: ---
- DNS: Cloudflare + Technitium ### Access Model (Updated)
- Proxy: Traefik dynamic file written by API during dev provisioning
- Host format currently in use: `dev-<vmid>.zerolaghub.dev`
Outstanding: The previous approach using:
- finalize external browser reachability for code-server through Cloudflare → Traefik → container - Cloudflare DNS
- remove manual proxy-file edits from debugging path and ensure generated config is the sole source - Technitium DNS
- standardize hostname format everywhere (`dev-<vmid>` only) - Traefik dynamic config per container
- add code-server launch link in portal
- remove dynamic Traefik file on dev container deletion has been **abandoned**.
Reason:
- too many moving pieces
- TLS and proxy complexity
- per-container DNS automation
- unnecessary exposure of internal dev services
--- ---
### Agent Future Work (priority order) ### New Access Strategy
1. Unified structured logging (slog) — Promtail/Loki needs structured fields Dev containers will support **two access paths**.
2. Dev container /status — provisioningComplete + provisioningError fields
3. Crash recovery with backoff — 30s/60s/120s, max 3 attempts, then error state #### Path 1 — Browser IDE (Primary)
4. Graceful shutdown verification — SIGTERM + wait before SIGKILL for Minecraft
5. Agent restart/process reattachment — detect existing process on restart ```
Browser
Portal
API proxy
container:6000
```
URL format: `/dev/<vmid>/ide`
Implementation requirements:
- API proxy using `http-proxy-middleware`
- WebSocket support (`ws: true`)
- `server.on('upgrade', proxy.upgrade)`
- code-server launch args: `--base-path /dev/<vmid>/ide --auth none`
Authentication handled by portal JWT.
--- ---
## API (zlh-api) #### Path 2 — Local Dev Access (Advanced Users)
Direct developer access via **Headscale/Tailscale**.
Use cases:
- SSH
- VS Code Remote
- local development tools
Outstanding tasks:
- confirm `zlh-ctl` Headscale server status
- implement Tailscale addon install
- API auth key generation
- portal instructions
Headscale constraints:
- `magic_dns: false`
- no exit nodes
- no DNS takeover
---
## Agent Future Work (priority order)
1. Structured logging (slog) for Loki
2. Dev container provisioningComplete state
3. Crash recovery backoff
4. Graceful shutdown verification
5. Process reattachment on agent restart
---
## API (zpack-api)
Completed: Completed:
- dev provisioning payload - dev provisioning payload
- runtime/version fields - runtime/version fields
- enable_code_server flag - enable_code_server flag
- dev-only routing hook added during provisioning - API status endpoint for frontend state
- Technitium + Cloudflare dev DNS creation
- remote Traefik dynamic file writing via proxy SSH
- proxy SSH moved to service-user model (`zlh`)
- server status endpoint added so frontend can consume agent state
- frontend status/console availability now update correctly via API polling model
Outstanding: Outstanding:
- runtime validation endpoint - `/dev/:id/ide` proxy route
- dev runtime catalog endpoint for portal - websocket upgrade handling
- remove Traefik dynamic config on dev container deletion - ownership validation before proxy
- domain / hostname normalization audit - Headscale auth key generation
- proxy/TLS generation cleanup so manual edits are no longer needed - dev runtime catalog endpoint
--- ---
## Portal (zlh-portal) ## Portal (zpack-portal)
Completed: Completed:
@ -114,30 +166,12 @@ Completed:
- dotnet runtime support - dotnet runtime support
- enable code-server checkbox - enable code-server checkbox
- dev file browser support - dev file browser support
- frontend now consumes API-backed status correctly for host/console state
Outstanding: Outstanding:
- runtime list driven from catalog API - "Open IDE" button
- dev port exposure UI - `/dev/<vmid>/ide` page
- code-server launch link - Headscale setup instructions
- clearer dev readiness states (`installing`, `starting`, `running`, `error`, etc.)
---
## Artifact Server
Completed:
- runtime artifacts hosted
- devcontainer catalog
- runtime archive structure
- code-server compiled release artifact ✅
Outstanding:
- checksum publishing
- artifact metadata support
--- ---
@ -145,12 +179,11 @@ Outstanding:
Active thread: Active thread:
- complete external dev IDE access path end-to-end - implement browser IDE proxy
Future work: Future work:
- dev port routing - Tailscale dev access
- dev service detection
- artifact version promotion - artifact version promotion
- runtime rollback support - runtime rollback support
@ -158,22 +191,10 @@ Future work:
## Closed Threads ## Closed Threads
- ✅ Interactive PTY-backed console (dev + game) - ✅ PTY console (dev + game)
- ✅ WebSocket stability and PTY ownership - ✅ Mod lifecycle
- ✅ Customer isolation (API + frontend) - ✅ Upload pipeline
- ✅ Agent update system (versioned, hash-verified) - ✅ Runtime artifact installs
- ✅ Minecraft player presence (agent-sourced) - ✅ Dev container filesystem model
- ✅ Game telemetry router separation (`/api/game/*`) - ✅ Code-server artifact fix
- ✅ Agent Phase 1 mod management endpoints - ✅ API status endpoint for frontend agent-state consumption
- ✅ Agent process metrics endpoint
- ✅ Minecraft readiness probe + restart race mitigation
- ✅ Modrinth resolver + full mod lifecycle
- ✅ Direct runtime upload model (no staging, no symlinks)
- ✅ `.zlh_metadata.json` provenance tracking
- ✅ Raw `http.request` streaming in API upload proxy
- ✅ Filesystem architecture docs consolidated
- ✅ Upload transport timeout tuning
- ✅ Dev container filesystem support (container-aware, /workspace root)
- ✅ Code-server artifact fix — compiled release on zlh-artifacts
- ✅ Dev routing hook added to provisioning without changing game publish flow
- ✅ API status endpoint added for frontend agent-state consumption