Update Velocity thread after plugin rehydrate fix validation

This commit is contained in:
jester 2026-04-10 20:44:44 +00:00
parent 9beac4015a
commit 6d6be18a84

View File

@ -9,295 +9,210 @@ Keep it short.
## Agent (zlh-agent)
### Dev Runtime System
Completed:
- catalog validation implemented
- runtime installs artifact-backed
- install guard implemented
- all installs now fetch from artifact server (no local artifact assumption)
- all installs now fetch from artifact server
Outstanding:
- runtime install verification improvements
- catalog hash validation
- runtime removal / upgrade handling
- **runtime update process** — in-place runtime version upgrade for dev containers
---
- runtime update process for dev containers
### Dev Environment
Completed:
- dev user creation
- workspace root `/home/dev/workspace`
- console runs as dev user
- `HOME`, `USER`, `LOGNAME`, `TERM` env vars set correctly
Outstanding:
- PATH normalization
- shell profile consistency
- runtime PATH injection
---
### Code Server Addon
Status: Installed, running, browser-verified end-to-end.
Status: ✅ Installed, running, browser-verified end-to-end
Confirmed:
- pulled from artifact server (tar.gz)
- installed to `/opt/zlh/services/code-server`
- binds to `0.0.0.0:6000`
- lifecycle endpoints: `POST /dev/codeserver/start|stop|restart`
- detection via `/proc/*/cmdline` scan
- full browser IDE loading confirmed at `dev-6070.zerolaghub.dev`
**Outstanding — Memory concern:**
- code-server consumes 1-2GB RAM at idle
- observed 99% memory usage in dev container during active use with AI coding agent
- 8GB container allocation may be insufficient for heavy dev workloads
- need to establish baseline memory usage for code-server alone vs with active coding session
- consider raising default dev container RAM or making it tier-based
- **stress test: k6 IDE session load test** — still outstanding from pre-launch checklist
---
Outstanding:
- code-server memory baseline
- decide whether default dev RAM should increase or become tier-based
- k6 IDE session load test
### Game Server Supervision
Completed:
- crash recovery with backoff
- crash observability and classification
- crash recovery with backoff: 30s → 60s → 120s
- backoff resets if uptime ≥ 30s
- transitions to `error` state after repeated failures
- crash observability: time, exit code, signal, uptime, log tail, classification
- classifications: `oom`, `mod_or_plugin_error`, `missing_dependency`, `nonzero_exit`, `unexpected_exit`
### Fabric Readiness Gating
Status: startup rehydrate path fixed in plugin and happy-path validated live.
---
### Fabric Readiness Gating (Active — agent-level fix needed)
Root cause confirmed: Velocity registration happens before the Fabric server is fully ready to accept proxy traffic, causing "proxy starting" errors until Velocity is restarted.
Required fix in agent:
- Do not register with Velocity until readiness probe confirms server is accepting connections
- TCP port check (port 25565) is the most reliable signal — already used by agent probe
- Velocity registration must be gated behind probe success, not just process start
- Pre-seeding mods + FabricProxy config before first run ✅ already done
- See `SCRATCH/session-stabilization-fabric-findings.md` for full findings
---
### Agent Future Work (priority order)
Still outstanding:
- negative-path validation: verify a `ready=false` backend is skipped on Velocity startup rehydrate
- confirm whether any remaining agent-side registration path can surface a backend before readiness probe success
- see `SCRATCH/session-stabilization-fabric-findings.md`
### Agent Future Work
1. Structured logging (slog) for Loki
2. Dev container `provisioningComplete` state in `/status`
3. Graceful shutdown verification (SIGTERM + wait for Minecraft)
3. Graceful shutdown verification
4. Process reattachment on agent restart
5. SSH server install in dev container provisioning pipeline (for CF Tunnel SSH)
5. SSH server install in dev container provisioning pipeline
---
## Dev IDE Access
### Browser IDE ✅ Fully Working (browser-verified) — Zero Install
```
Browser → dev-<vmid>.zerolaghub.dev → Traefik → API → container:6000
```
This is the primary zero-install access method. VS Code in the browser,
no client software required. Browser-verified and production ready.
### Remaining Work
### Browser IDE
Status: fully working, browser-verified, zero-install.
Remaining:
- confirm "Open IDE" button in portal uses hosted URL in production path
- reduce legacy `/__ide/:id` compatibility paths once portal button confirmed
- simplify and harden `devProxy` — remove stale path-based assumptions
### Local Dev Access — SSH via CF Tunnel (Power Users Only)
**Important:** CF Tunnel SSH requires `cloudflared` installed on the
developer's local machine. This is not zero-install and is an optional
feature for power users who want local VS Code or terminal access.
The browser IDE remains the zero-install story for all developers.
- simplify and harden `devProxy`
### Local Dev Access — SSH via CF Tunnel
Current state:
- ✅ CF Tunnel created and connected to bastion VM
- ✅ Cloudflare Zero Trust free plan active
- 🟡 Tunnel SSH hostname mapping not yet configured in Zero Trust dashboard
- 🟡 Bastion SSH proxy jump config not yet done
- 🟡 Dev container SSH server not yet verified
- 🟡 Portal SSH config snippet not yet built
- tunnel created and connected to bastion VM
- Zero Trust free plan active
- SSH hostname mapping not yet configured
- bastion SSH proxy jump config not yet done
- dev container SSH server not yet verified
- portal SSH config snippet not yet built
---
## API (zpac-api)
## API (zpack-api)
Completed:
- dev provisioning payload
- runtime/version fields
- enable_code_server flag
- `GET /api/servers/:id/status` — server status endpoint
- `POST /api/dev/:id/ide-token`IDE token generation + hosted URL
- `GET /api/dev/:id/ide`bootstrap route
- `/__ide/:id/*`live tunnel proxy (HTTP + WS, target-bound)
- `handleHostedProxy` — host-based routing via `Host` header vmid extraction
- status endpoint
- IDE token generation + hosted URL
- bootstrap IDE route
- live tunnel proxy
- host-based routing for hosted IDE
- hosted flow browser-verified end-to-end
- ✅ Backend lifecycle hardened — idempotent create/delete/start/stop
- ✅ Duplicate server creation fixed — was frontend issue, not API
- ✅ Console routing corrected — browser → API → agent
- ✅ Control plane switched to IP-based service communication
- Velocity rehydration uses DB + Redis instead of Proxmox live state
- backend lifecycle hardened
- duplicate server creation fixed
- console routing corrected
- control plane switched to IP-based service communication
- Velocity rehydration uses DB + Redis instead of Proxmox live state
Outstanding:
- **Billing / Stripe integration** — cannot take payments without this, launch blocker
- **Password reset flow** — verify it is fully wired up
- **Usage limits / quota enforcement** — prevent unbounded server creation per account
- Billing / Stripe integration
- password reset flow verification
- usage limits / quota enforcement
- simplify and harden host-native `devProxy`
- dev runtime catalog endpoint for portal
- Headscale auth key generation
- **Velocity resync endpoint**`POST /api/velocity/resync` to re-register all running containers after Velocity restart
- **Service discovery migration** — replace remaining internal.zlh FQDNs with env-based IPs in hot paths
- service discovery migration for remaining hot-path `internal.zlh` references
---
## Portal (zpac-portal)
## Portal (zpack-portal)
Completed:
- dev runtime dropdown
- dotnet runtime support
- enable code-server checkbox
- dev file browser support
- ✅ Site copy/wording rewrite — landing, features, FAQ, about updated for new platform
- ✅ Pricing page updated — Vanilla/Modded/Heavy tiers, Minecraft only
- site copy rewrite
- pricing page updated
Outstanding:
- confirm "Open IDE" button fully uses hosted URL flow
- SSH config snippet for power users (optional, clearly labeled as requiring cloudflared install)
- **User onboarding flow** — what happens after registration? Guided first-server creation needed
- **Email notifications** — server crashed, provisioning complete, billing receipts
- Remove `testdaemon` binary from repo root (7MB, should not be in repo)
- SSH config snippet for power users
- user onboarding flow
- email notifications
- remove `testdaemon` binary from repo root
---
## Game Servers
Outstanding:
- **Game server world backup / restore** — player world data backup separate from PBS infrastructure backup. Trust-critical — players losing world data will kill retention.
- **Game server subdomain** — how do players connect? Verify IP vs subdomain (e.g. `mc.zerolaghub.com` style)
- game server world backup / restore
- game server subdomain / player connection method verification
---
## Velocity / ZpackVelocityBridge
Completed:
- startup rehydrate now requires `ready == true`
- stale default `zpack-api.internal.zlh` fallback removed
- explicit `ZPACK_REHYDRATE_ENDPOINT` env wiring validated
- happy-path player routing validated live after restart
Outstanding:
- **Server registrations are in-memory only** — lost on Velocity restart. API needs `POST /api/velocity/resync` to re-register all running containers
- **ZpackCommands not registered**`/zpack status/list/reload` commands exist in code but not wired up
- **Player routing on restart** — existing running containers go unrouted until new provisioning happens or resync is called
- See `SCRATCH/velocity-plugin.md` for full plugin documentation
---
## Infrastructure
### Hosting — GTHost (Decision: Stay, Most Cost-Effective)
GTHost Detroit is the primary host. Decision made to stay on GTHost long-term:
- Bare metal, instant Proxmox install without workarounds
- Unmetered bandwidth
- Competitive pricing — most cost-effective option evaluated
- Digital Ocean: too expensive, no bare metal/Proxmox
- OVH: more expensive, overkill for current scale
- Hetzner: Proxmox install was painful historically
### DDoS Protection (Resolved — Minimal Attack Surface, Accepted Risk)
Investigation complete. Attack surface is actually very small:
- All infrastructure is internal to Proxmox — not publicly exposed
- Portal, API, admin access all internal or behind Twingate
- Minecraft player traffic is proxied through Velocity VM — individual game containers never directly exposed
- Only the Velocity VM TCP port is public-facing
- Minecraft Java uses TCP — harder to volumetric flood than UDP games
- OPNsense rate limiting on Velocity-facing port is sufficient for launch
- Cloudflare DNS (non-proxied) hides real IP from casual attackers
**Decision:** Architecture is well-designed — attack surface is minimal. Accept remaining risk at launch. Revisit if attack occurs or revenue supports additional mitigation.
- ZpackCommands not registered (`/zpack status`, `/zpack list`, `/zpack reload`)
- negative-path readiness validation still pending for a `ready=false` backend
- see `SCRATCH/velocity-plugin.md`
---
## Pre-Launch Checklist
Outstanding before launch:
- **Billing / Stripe integration** — launch blocker, cannot take money
- **Game server world backup / restore** — trust-critical
- **User onboarding flow** — guided first-server creation after register
- **Password reset flow** — verify wired up
- **Usage limits / quota enforcement** — per account
- **Game server subdomain** — verify player connection method
- **Email notifications** — crashed, billing, provisioning
- **Upload testing** — test file upload flow end-to-end in dev containers
- **Billing endpoints** — add back to API
- **Stress testing** — k6 IDE session load test + Minecraft bot test + **code-server memory baseline**
- **OPNsense audit** — both routers need systematic validation
- **Fabric readiness gating** — agent must gate Velocity registration behind TCP probe success
- **Service discovery migration** — replace remaining internal.zlh with env-based IPs in hot paths
- **Provisioning validation** — single controlled creation, confirm 1 record / 1 job / 1 execution
- Remove `testdaemon` binary from zpac-portal repo root
- Billing / Stripe integration
- game server world backup / restore
- user onboarding flow
- password reset flow verification
- usage limits / quota enforcement
- game server subdomain verification
- email notifications
- upload testing
- billing endpoints
- stress testing: k6 IDE + Minecraft bot + code-server memory baseline
- OPNsense audit
- Fabric readiness gating full validation
- service discovery migration
- provisioning validation
- remove `testdaemon` from `zpack-portal`
---
## Platform
Future work:
- CF Tunnel SSH completion — power user feature, requires client cloudflared install
- CF Tunnel SSH completion
- artifact version promotion
- runtime rollback support
- Cloudflare R2 for large artifact/mod file delivery at scale
- **Admin panel** — manage users/servers as operator
- **Referral / dev pipeline reward system** — revenue sharing for developers
- **Uptime history** — visible to users per server
- **DDoS mitigation** — revisit Path.net or similar when revenue supports it
- Cloudflare R2 for large artifact/mod file delivery
- admin panel
- referral / dev pipeline reward system
- uptime history
- revisit DDoS mitigation later if needed
---
## Closed Threads
- ✅ PTY console (dev + game)
- ✅ Mod lifecycle
- ✅ Upload pipeline
- ✅ Runtime artifact installs
- ✅ Dev container filesystem model
- ✅ Code-server artifact fix
- ✅ API status endpoint for frontend agent-state consumption
- ✅ Game server crash recovery with backoff
- ✅ Crash observability (classification, log tail, exit metadata)
- ✅ Code-server lifecycle endpoints (start/stop/restart)
- ✅ Code-server process detection via /proc scan
- ✅ Dev IDE proxy — path-based browser IDE working end-to-end
- ✅ Hosted wildcard Traefik → API → container dev IDE flow — browser-verified
- ✅ Per-container dev IDE edge publish/unpublish removed from API
- ✅ Wildcard TLS cert `*.zerolaghub.dev` via Let's Encrypt + Cloudflare DNS-01
- ✅ Browser IDE fully loading at dev-<vmid>.zerolaghub.dev
- ✅ CF Tunnel created and connected to bastion VM
- ✅ Portal copy rewrite — landing, features, FAQ, about, pricing
- ✅ DDoS investigation — minimal attack surface, Velocity proxy + internal architecture, accepted risk
- ✅ Hosting provider decision — GTHost Detroit, most cost-effective option
- ✅ Migration to Detroit — complete Apr 2, 2026
- ✅ Internal FQDN migration — all services on internal.zlh
- ✅ System stabilization — duplicate creation, DB/Redis drift, console routing all resolved
- ✅ Provisioning + DB consistency — root cause was migration state pollution, now clean
- ✅ IP-based control plane — internal.zlh removed from service-to-service hot paths
- PTY console (dev + game)
- mod lifecycle
- upload pipeline
- runtime artifact installs
- dev container filesystem model
- code-server artifact fix
- API status endpoint for frontend agent-state consumption
- game server crash recovery with backoff
- crash observability
- code-server lifecycle endpoints
- code-server process detection
- dev IDE proxy
- hosted wildcard Traefik → API → container IDE flow
- per-container dev IDE edge publish/unpublish removed from API
- wildcard TLS cert `*.zerolaghub.dev`
- browser IDE fully loading at `dev-<vmid>.zerolaghub.dev`
- CF Tunnel created and connected to bastion VM
- portal copy rewrite
- DDoS investigation
- hosting provider decision: GTHost Detroit
- migration to Detroit complete
- system stabilization after migration
- IP-based control plane
- Velocity startup rehydrate fixed and validated on happy path