Update Velocity thread after plugin rehydrate fix validation

This commit is contained in:
jester 2026-04-10 20:44:44 +00:00
parent 9beac4015a
commit 6d6be18a84

View File

@ -9,295 +9,210 @@ Keep it short.
## Agent (zlh-agent) ## Agent (zlh-agent)
### Dev Runtime System ### Dev Runtime System
Completed: Completed:
- catalog validation implemented - catalog validation implemented
- runtime installs artifact-backed - runtime installs artifact-backed
- install guard implemented - install guard implemented
- all installs now fetch from artifact server (no local artifact assumption) - all installs now fetch from artifact server
Outstanding: Outstanding:
- runtime install verification improvements - runtime install verification improvements
- catalog hash validation - catalog hash validation
- runtime removal / upgrade handling - runtime removal / upgrade handling
- **runtime update process** — in-place runtime version upgrade for dev containers - runtime update process for dev containers
---
### Dev Environment ### Dev Environment
Completed: Completed:
- dev user creation - dev user creation
- workspace root `/home/dev/workspace` - workspace root `/home/dev/workspace`
- console runs as dev user - console runs as dev user
- `HOME`, `USER`, `LOGNAME`, `TERM` env vars set correctly - `HOME`, `USER`, `LOGNAME`, `TERM` env vars set correctly
Outstanding: Outstanding:
- PATH normalization - PATH normalization
- shell profile consistency - shell profile consistency
- runtime PATH injection - runtime PATH injection
---
### Code Server Addon ### Code Server Addon
Status: Installed, running, browser-verified end-to-end.
Status: ✅ Installed, running, browser-verified end-to-end Outstanding:
- code-server memory baseline
Confirmed: - decide whether default dev RAM should increase or become tier-based
- k6 IDE session load test
- pulled from artifact server (tar.gz)
- installed to `/opt/zlh/services/code-server`
- binds to `0.0.0.0:6000`
- lifecycle endpoints: `POST /dev/codeserver/start|stop|restart`
- detection via `/proc/*/cmdline` scan
- full browser IDE loading confirmed at `dev-6070.zerolaghub.dev`
**Outstanding — Memory concern:**
- code-server consumes 1-2GB RAM at idle
- observed 99% memory usage in dev container during active use with AI coding agent
- 8GB container allocation may be insufficient for heavy dev workloads
- need to establish baseline memory usage for code-server alone vs with active coding session
- consider raising default dev container RAM or making it tier-based
- **stress test: k6 IDE session load test** — still outstanding from pre-launch checklist
---
### Game Server Supervision ### Game Server Supervision
Completed: Completed:
- crash recovery with backoff
- crash observability and classification
- crash recovery with backoff: 30s → 60s → 120s ### Fabric Readiness Gating
- backoff resets if uptime ≥ 30s Status: startup rehydrate path fixed in plugin and happy-path validated live.
- transitions to `error` state after repeated failures
- crash observability: time, exit code, signal, uptime, log tail, classification
- classifications: `oom`, `mod_or_plugin_error`, `missing_dependency`, `nonzero_exit`, `unexpected_exit`
--- Still outstanding:
- negative-path validation: verify a `ready=false` backend is skipped on Velocity startup rehydrate
### Fabric Readiness Gating (Active — agent-level fix needed) - confirm whether any remaining agent-side registration path can surface a backend before readiness probe success
- see `SCRATCH/session-stabilization-fabric-findings.md`
Root cause confirmed: Velocity registration happens before the Fabric server is fully ready to accept proxy traffic, causing "proxy starting" errors until Velocity is restarted.
Required fix in agent:
- Do not register with Velocity until readiness probe confirms server is accepting connections
- TCP port check (port 25565) is the most reliable signal — already used by agent probe
- Velocity registration must be gated behind probe success, not just process start
- Pre-seeding mods + FabricProxy config before first run ✅ already done
- See `SCRATCH/session-stabilization-fabric-findings.md` for full findings
---
### Agent Future Work (priority order)
### Agent Future Work
1. Structured logging (slog) for Loki 1. Structured logging (slog) for Loki
2. Dev container `provisioningComplete` state in `/status` 2. Dev container `provisioningComplete` state in `/status`
3. Graceful shutdown verification (SIGTERM + wait for Minecraft) 3. Graceful shutdown verification
4. Process reattachment on agent restart 4. Process reattachment on agent restart
5. SSH server install in dev container provisioning pipeline (for CF Tunnel SSH) 5. SSH server install in dev container provisioning pipeline
--- ---
## Dev IDE Access ## Dev IDE Access
### Browser IDE ✅ Fully Working (browser-verified) — Zero Install ### Browser IDE
Status: fully working, browser-verified, zero-install.
```
Browser → dev-<vmid>.zerolaghub.dev → Traefik → API → container:6000
```
This is the primary zero-install access method. VS Code in the browser,
no client software required. Browser-verified and production ready.
### Remaining Work
Remaining:
- confirm "Open IDE" button in portal uses hosted URL in production path - confirm "Open IDE" button in portal uses hosted URL in production path
- reduce legacy `/__ide/:id` compatibility paths once portal button confirmed - reduce legacy `/__ide/:id` compatibility paths once portal button confirmed
- simplify and harden `devProxy` — remove stale path-based assumptions - simplify and harden `devProxy`
### Local Dev Access — SSH via CF Tunnel (Power Users Only)
**Important:** CF Tunnel SSH requires `cloudflared` installed on the
developer's local machine. This is not zero-install and is an optional
feature for power users who want local VS Code or terminal access.
The browser IDE remains the zero-install story for all developers.
### Local Dev Access — SSH via CF Tunnel
Current state: Current state:
- ✅ CF Tunnel created and connected to bastion VM - tunnel created and connected to bastion VM
- ✅ Cloudflare Zero Trust free plan active - Zero Trust free plan active
- 🟡 Tunnel SSH hostname mapping not yet configured in Zero Trust dashboard - SSH hostname mapping not yet configured
- 🟡 Bastion SSH proxy jump config not yet done - bastion SSH proxy jump config not yet done
- 🟡 Dev container SSH server not yet verified - dev container SSH server not yet verified
- 🟡 Portal SSH config snippet not yet built - portal SSH config snippet not yet built
--- ---
## API (zpac-api) ## API (zpack-api)
Completed: Completed:
- dev provisioning payload - dev provisioning payload
- runtime/version fields - runtime/version fields
- enable_code_server flag - enable_code_server flag
- `GET /api/servers/:id/status` — server status endpoint - status endpoint
- `POST /api/dev/:id/ide-token`IDE token generation + hosted URL - IDE token generation + hosted URL
- `GET /api/dev/:id/ide`bootstrap route - bootstrap IDE route
- `/__ide/:id/*`live tunnel proxy (HTTP + WS, target-bound) - live tunnel proxy
- `handleHostedProxy` — host-based routing via `Host` header vmid extraction - host-based routing for hosted IDE
- hosted flow browser-verified end-to-end - hosted flow browser-verified end-to-end
- ✅ Backend lifecycle hardened — idempotent create/delete/start/stop - backend lifecycle hardened
- ✅ Duplicate server creation fixed — was frontend issue, not API - duplicate server creation fixed
- ✅ Console routing corrected — browser → API → agent - console routing corrected
- ✅ Control plane switched to IP-based service communication - control plane switched to IP-based service communication
- Velocity rehydration uses DB + Redis instead of Proxmox live state - Velocity rehydration uses DB + Redis instead of Proxmox live state
Outstanding: Outstanding:
- Billing / Stripe integration
- **Billing / Stripe integration** — cannot take payments without this, launch blocker - password reset flow verification
- **Password reset flow** — verify it is fully wired up - usage limits / quota enforcement
- **Usage limits / quota enforcement** — prevent unbounded server creation per account
- simplify and harden host-native `devProxy` - simplify and harden host-native `devProxy`
- dev runtime catalog endpoint for portal - dev runtime catalog endpoint for portal
- Headscale auth key generation - Headscale auth key generation
- **Velocity resync endpoint**`POST /api/velocity/resync` to re-register all running containers after Velocity restart - service discovery migration for remaining hot-path `internal.zlh` references
- **Service discovery migration** — replace remaining internal.zlh FQDNs with env-based IPs in hot paths
--- ---
## Portal (zpac-portal) ## Portal (zpack-portal)
Completed: Completed:
- dev runtime dropdown - dev runtime dropdown
- dotnet runtime support - dotnet runtime support
- enable code-server checkbox - enable code-server checkbox
- dev file browser support - dev file browser support
- ✅ Site copy/wording rewrite — landing, features, FAQ, about updated for new platform - site copy rewrite
- ✅ Pricing page updated — Vanilla/Modded/Heavy tiers, Minecraft only - pricing page updated
Outstanding: Outstanding:
- confirm "Open IDE" button fully uses hosted URL flow - confirm "Open IDE" button fully uses hosted URL flow
- SSH config snippet for power users (optional, clearly labeled as requiring cloudflared install) - SSH config snippet for power users
- **User onboarding flow** — what happens after registration? Guided first-server creation needed - user onboarding flow
- **Email notifications** — server crashed, provisioning complete, billing receipts - email notifications
- Remove `testdaemon` binary from repo root (7MB, should not be in repo) - remove `testdaemon` binary from repo root
--- ---
## Game Servers ## Game Servers
Outstanding: Outstanding:
- game server world backup / restore
- **Game server world backup / restore** — player world data backup separate from PBS infrastructure backup. Trust-critical — players losing world data will kill retention. - game server subdomain / player connection method verification
- **Game server subdomain** — how do players connect? Verify IP vs subdomain (e.g. `mc.zerolaghub.com` style)
--- ---
## Velocity / ZpackVelocityBridge ## Velocity / ZpackVelocityBridge
Completed:
- startup rehydrate now requires `ready == true`
- stale default `zpack-api.internal.zlh` fallback removed
- explicit `ZPACK_REHYDRATE_ENDPOINT` env wiring validated
- happy-path player routing validated live after restart
Outstanding: Outstanding:
- ZpackCommands not registered (`/zpack status`, `/zpack list`, `/zpack reload`)
- **Server registrations are in-memory only** — lost on Velocity restart. API needs `POST /api/velocity/resync` to re-register all running containers - negative-path readiness validation still pending for a `ready=false` backend
- **ZpackCommands not registered**`/zpack status/list/reload` commands exist in code but not wired up - see `SCRATCH/velocity-plugin.md`
- **Player routing on restart** — existing running containers go unrouted until new provisioning happens or resync is called
- See `SCRATCH/velocity-plugin.md` for full plugin documentation
---
## Infrastructure
### Hosting — GTHost (Decision: Stay, Most Cost-Effective)
GTHost Detroit is the primary host. Decision made to stay on GTHost long-term:
- Bare metal, instant Proxmox install without workarounds
- Unmetered bandwidth
- Competitive pricing — most cost-effective option evaluated
- Digital Ocean: too expensive, no bare metal/Proxmox
- OVH: more expensive, overkill for current scale
- Hetzner: Proxmox install was painful historically
### DDoS Protection (Resolved — Minimal Attack Surface, Accepted Risk)
Investigation complete. Attack surface is actually very small:
- All infrastructure is internal to Proxmox — not publicly exposed
- Portal, API, admin access all internal or behind Twingate
- Minecraft player traffic is proxied through Velocity VM — individual game containers never directly exposed
- Only the Velocity VM TCP port is public-facing
- Minecraft Java uses TCP — harder to volumetric flood than UDP games
- OPNsense rate limiting on Velocity-facing port is sufficient for launch
- Cloudflare DNS (non-proxied) hides real IP from casual attackers
**Decision:** Architecture is well-designed — attack surface is minimal. Accept remaining risk at launch. Revisit if attack occurs or revenue supports additional mitigation.
--- ---
## Pre-Launch Checklist ## Pre-Launch Checklist
Outstanding before launch: Outstanding before launch:
- Billing / Stripe integration
- **Billing / Stripe integration** — launch blocker, cannot take money - game server world backup / restore
- **Game server world backup / restore** — trust-critical - user onboarding flow
- **User onboarding flow** — guided first-server creation after register - password reset flow verification
- **Password reset flow** — verify wired up - usage limits / quota enforcement
- **Usage limits / quota enforcement** — per account - game server subdomain verification
- **Game server subdomain** — verify player connection method - email notifications
- **Email notifications** — crashed, billing, provisioning - upload testing
- **Upload testing** — test file upload flow end-to-end in dev containers - billing endpoints
- **Billing endpoints** — add back to API - stress testing: k6 IDE + Minecraft bot + code-server memory baseline
- **Stress testing** — k6 IDE session load test + Minecraft bot test + **code-server memory baseline** - OPNsense audit
- **OPNsense audit** — both routers need systematic validation - Fabric readiness gating full validation
- **Fabric readiness gating** — agent must gate Velocity registration behind TCP probe success - service discovery migration
- **Service discovery migration** — replace remaining internal.zlh with env-based IPs in hot paths - provisioning validation
- **Provisioning validation** — single controlled creation, confirm 1 record / 1 job / 1 execution - remove `testdaemon` from `zpack-portal`
- Remove `testdaemon` binary from zpac-portal repo root
--- ---
## Platform ## Platform
Future work: Future work:
- CF Tunnel SSH completion
- CF Tunnel SSH completion — power user feature, requires client cloudflared install
- artifact version promotion - artifact version promotion
- runtime rollback support - runtime rollback support
- Cloudflare R2 for large artifact/mod file delivery at scale - Cloudflare R2 for large artifact/mod file delivery
- **Admin panel** — manage users/servers as operator - admin panel
- **Referral / dev pipeline reward system** — revenue sharing for developers - referral / dev pipeline reward system
- **Uptime history** — visible to users per server - uptime history
- **DDoS mitigation** — revisit Path.net or similar when revenue supports it - revisit DDoS mitigation later if needed
--- ---
## Closed Threads ## Closed Threads
- ✅ PTY console (dev + game) - PTY console (dev + game)
- ✅ Mod lifecycle - mod lifecycle
- ✅ Upload pipeline - upload pipeline
- ✅ Runtime artifact installs - runtime artifact installs
- ✅ Dev container filesystem model - dev container filesystem model
- ✅ Code-server artifact fix - code-server artifact fix
- ✅ API status endpoint for frontend agent-state consumption - API status endpoint for frontend agent-state consumption
- ✅ Game server crash recovery with backoff - game server crash recovery with backoff
- ✅ Crash observability (classification, log tail, exit metadata) - crash observability
- ✅ Code-server lifecycle endpoints (start/stop/restart) - code-server lifecycle endpoints
- ✅ Code-server process detection via /proc scan - code-server process detection
- ✅ Dev IDE proxy — path-based browser IDE working end-to-end - dev IDE proxy
- ✅ Hosted wildcard Traefik → API → container dev IDE flow — browser-verified - hosted wildcard Traefik → API → container IDE flow
- ✅ Per-container dev IDE edge publish/unpublish removed from API - per-container dev IDE edge publish/unpublish removed from API
- ✅ Wildcard TLS cert `*.zerolaghub.dev` via Let's Encrypt + Cloudflare DNS-01 - wildcard TLS cert `*.zerolaghub.dev`
- ✅ Browser IDE fully loading at dev-<vmid>.zerolaghub.dev - browser IDE fully loading at `dev-<vmid>.zerolaghub.dev`
- ✅ CF Tunnel created and connected to bastion VM - CF Tunnel created and connected to bastion VM
- ✅ Portal copy rewrite — landing, features, FAQ, about, pricing - portal copy rewrite
- ✅ DDoS investigation — minimal attack surface, Velocity proxy + internal architecture, accepted risk - DDoS investigation
- ✅ Hosting provider decision — GTHost Detroit, most cost-effective option - hosting provider decision: GTHost Detroit
- ✅ Migration to Detroit — complete Apr 2, 2026 - migration to Detroit complete
- ✅ Internal FQDN migration — all services on internal.zlh - system stabilization after migration
- ✅ System stabilization — duplicate creation, DB/Redis drift, console routing all resolved - IP-based control plane
- ✅ Provisioning + DB consistency — root cause was migration state pollution, now clean - Velocity startup rehydrate fixed and validated on happy path
- ✅ IP-based control plane — internal.zlh removed from service-to-service hot paths