Update Velocity thread after plugin rehydrate fix validation
This commit is contained in:
parent
9beac4015a
commit
6d6be18a84
299
OPEN_THREADS.md
299
OPEN_THREADS.md
@ -9,295 +9,210 @@ Keep it short.
|
||||
## Agent (zlh-agent)
|
||||
|
||||
### Dev Runtime System
|
||||
|
||||
Completed:
|
||||
|
||||
- catalog validation implemented
|
||||
- runtime installs artifact-backed
|
||||
- install guard implemented
|
||||
- all installs now fetch from artifact server (no local artifact assumption)
|
||||
- all installs now fetch from artifact server
|
||||
|
||||
Outstanding:
|
||||
|
||||
- runtime install verification improvements
|
||||
- catalog hash validation
|
||||
- runtime removal / upgrade handling
|
||||
- **runtime update process** — in-place runtime version upgrade for dev containers
|
||||
|
||||
---
|
||||
- runtime update process for dev containers
|
||||
|
||||
### Dev Environment
|
||||
|
||||
Completed:
|
||||
|
||||
- dev user creation
|
||||
- workspace root `/home/dev/workspace`
|
||||
- console runs as dev user
|
||||
- `HOME`, `USER`, `LOGNAME`, `TERM` env vars set correctly
|
||||
|
||||
Outstanding:
|
||||
|
||||
- PATH normalization
|
||||
- shell profile consistency
|
||||
- runtime PATH injection
|
||||
|
||||
---
|
||||
|
||||
### Code Server Addon
|
||||
Status: Installed, running, browser-verified end-to-end.
|
||||
|
||||
Status: ✅ Installed, running, browser-verified end-to-end
|
||||
|
||||
Confirmed:
|
||||
|
||||
- pulled from artifact server (tar.gz)
|
||||
- installed to `/opt/zlh/services/code-server`
|
||||
- binds to `0.0.0.0:6000`
|
||||
- lifecycle endpoints: `POST /dev/codeserver/start|stop|restart`
|
||||
- detection via `/proc/*/cmdline` scan
|
||||
- full browser IDE loading confirmed at `dev-6070.zerolaghub.dev`
|
||||
|
||||
**Outstanding — Memory concern:**
|
||||
- code-server consumes 1-2GB RAM at idle
|
||||
- observed 99% memory usage in dev container during active use with AI coding agent
|
||||
- 8GB container allocation may be insufficient for heavy dev workloads
|
||||
- need to establish baseline memory usage for code-server alone vs with active coding session
|
||||
- consider raising default dev container RAM or making it tier-based
|
||||
- **stress test: k6 IDE session load test** — still outstanding from pre-launch checklist
|
||||
|
||||
---
|
||||
Outstanding:
|
||||
- code-server memory baseline
|
||||
- decide whether default dev RAM should increase or become tier-based
|
||||
- k6 IDE session load test
|
||||
|
||||
### Game Server Supervision
|
||||
|
||||
Completed:
|
||||
- crash recovery with backoff
|
||||
- crash observability and classification
|
||||
|
||||
- crash recovery with backoff: 30s → 60s → 120s
|
||||
- backoff resets if uptime ≥ 30s
|
||||
- transitions to `error` state after repeated failures
|
||||
- crash observability: time, exit code, signal, uptime, log tail, classification
|
||||
- classifications: `oom`, `mod_or_plugin_error`, `missing_dependency`, `nonzero_exit`, `unexpected_exit`
|
||||
### Fabric Readiness Gating
|
||||
Status: startup rehydrate path fixed in plugin and happy-path validated live.
|
||||
|
||||
---
|
||||
|
||||
### Fabric Readiness Gating (Active — agent-level fix needed)
|
||||
|
||||
Root cause confirmed: Velocity registration happens before the Fabric server is fully ready to accept proxy traffic, causing "proxy starting" errors until Velocity is restarted.
|
||||
|
||||
Required fix in agent:
|
||||
- Do not register with Velocity until readiness probe confirms server is accepting connections
|
||||
- TCP port check (port 25565) is the most reliable signal — already used by agent probe
|
||||
- Velocity registration must be gated behind probe success, not just process start
|
||||
- Pre-seeding mods + FabricProxy config before first run ✅ already done
|
||||
- See `SCRATCH/session-stabilization-fabric-findings.md` for full findings
|
||||
|
||||
---
|
||||
|
||||
### Agent Future Work (priority order)
|
||||
Still outstanding:
|
||||
- negative-path validation: verify a `ready=false` backend is skipped on Velocity startup rehydrate
|
||||
- confirm whether any remaining agent-side registration path can surface a backend before readiness probe success
|
||||
- see `SCRATCH/session-stabilization-fabric-findings.md`
|
||||
|
||||
### Agent Future Work
|
||||
1. Structured logging (slog) for Loki
|
||||
2. Dev container `provisioningComplete` state in `/status`
|
||||
3. Graceful shutdown verification (SIGTERM + wait for Minecraft)
|
||||
3. Graceful shutdown verification
|
||||
4. Process reattachment on agent restart
|
||||
5. SSH server install in dev container provisioning pipeline (for CF Tunnel SSH)
|
||||
5. SSH server install in dev container provisioning pipeline
|
||||
|
||||
---
|
||||
|
||||
## Dev IDE Access
|
||||
|
||||
### Browser IDE ✅ Fully Working (browser-verified) — Zero Install
|
||||
|
||||
```
|
||||
Browser → dev-<vmid>.zerolaghub.dev → Traefik → API → container:6000
|
||||
```
|
||||
|
||||
This is the primary zero-install access method. VS Code in the browser,
|
||||
no client software required. Browser-verified and production ready.
|
||||
|
||||
### Remaining Work
|
||||
### Browser IDE
|
||||
Status: fully working, browser-verified, zero-install.
|
||||
|
||||
Remaining:
|
||||
- confirm "Open IDE" button in portal uses hosted URL in production path
|
||||
- reduce legacy `/__ide/:id` compatibility paths once portal button confirmed
|
||||
- simplify and harden `devProxy` — remove stale path-based assumptions
|
||||
|
||||
### Local Dev Access — SSH via CF Tunnel (Power Users Only)
|
||||
|
||||
**Important:** CF Tunnel SSH requires `cloudflared` installed on the
|
||||
developer's local machine. This is not zero-install and is an optional
|
||||
feature for power users who want local VS Code or terminal access.
|
||||
|
||||
The browser IDE remains the zero-install story for all developers.
|
||||
- simplify and harden `devProxy`
|
||||
|
||||
### Local Dev Access — SSH via CF Tunnel
|
||||
Current state:
|
||||
- ✅ CF Tunnel created and connected to bastion VM
|
||||
- ✅ Cloudflare Zero Trust free plan active
|
||||
- 🟡 Tunnel SSH hostname mapping not yet configured in Zero Trust dashboard
|
||||
- 🟡 Bastion SSH proxy jump config not yet done
|
||||
- 🟡 Dev container SSH server not yet verified
|
||||
- 🟡 Portal SSH config snippet not yet built
|
||||
- tunnel created and connected to bastion VM
|
||||
- Zero Trust free plan active
|
||||
- SSH hostname mapping not yet configured
|
||||
- bastion SSH proxy jump config not yet done
|
||||
- dev container SSH server not yet verified
|
||||
- portal SSH config snippet not yet built
|
||||
|
||||
---
|
||||
|
||||
## API (zpac-api)
|
||||
## API (zpack-api)
|
||||
|
||||
Completed:
|
||||
|
||||
- dev provisioning payload
|
||||
- runtime/version fields
|
||||
- enable_code_server flag
|
||||
- `GET /api/servers/:id/status` — server status endpoint
|
||||
- `POST /api/dev/:id/ide-token` — IDE token generation + hosted URL
|
||||
- `GET /api/dev/:id/ide` — bootstrap route
|
||||
- `/__ide/:id/*` — live tunnel proxy (HTTP + WS, target-bound)
|
||||
- `handleHostedProxy` — host-based routing via `Host` header vmid extraction
|
||||
- status endpoint
|
||||
- IDE token generation + hosted URL
|
||||
- bootstrap IDE route
|
||||
- live tunnel proxy
|
||||
- host-based routing for hosted IDE
|
||||
- hosted flow browser-verified end-to-end
|
||||
- ✅ Backend lifecycle hardened — idempotent create/delete/start/stop
|
||||
- ✅ Duplicate server creation fixed — was frontend issue, not API
|
||||
- ✅ Console routing corrected — browser → API → agent
|
||||
- ✅ Control plane switched to IP-based service communication
|
||||
- ✅ Velocity rehydration uses DB + Redis instead of Proxmox live state
|
||||
- backend lifecycle hardened
|
||||
- duplicate server creation fixed
|
||||
- console routing corrected
|
||||
- control plane switched to IP-based service communication
|
||||
- Velocity rehydration uses DB + Redis instead of Proxmox live state
|
||||
|
||||
Outstanding:
|
||||
|
||||
- **Billing / Stripe integration** — cannot take payments without this, launch blocker
|
||||
- **Password reset flow** — verify it is fully wired up
|
||||
- **Usage limits / quota enforcement** — prevent unbounded server creation per account
|
||||
- Billing / Stripe integration
|
||||
- password reset flow verification
|
||||
- usage limits / quota enforcement
|
||||
- simplify and harden host-native `devProxy`
|
||||
- dev runtime catalog endpoint for portal
|
||||
- Headscale auth key generation
|
||||
- **Velocity resync endpoint** — `POST /api/velocity/resync` to re-register all running containers after Velocity restart
|
||||
- **Service discovery migration** — replace remaining internal.zlh FQDNs with env-based IPs in hot paths
|
||||
- service discovery migration for remaining hot-path `internal.zlh` references
|
||||
|
||||
---
|
||||
|
||||
## Portal (zpac-portal)
|
||||
## Portal (zpack-portal)
|
||||
|
||||
Completed:
|
||||
|
||||
- dev runtime dropdown
|
||||
- dotnet runtime support
|
||||
- enable code-server checkbox
|
||||
- dev file browser support
|
||||
- ✅ Site copy/wording rewrite — landing, features, FAQ, about updated for new platform
|
||||
- ✅ Pricing page updated — Vanilla/Modded/Heavy tiers, Minecraft only
|
||||
- site copy rewrite
|
||||
- pricing page updated
|
||||
|
||||
Outstanding:
|
||||
|
||||
- confirm "Open IDE" button fully uses hosted URL flow
|
||||
- SSH config snippet for power users (optional, clearly labeled as requiring cloudflared install)
|
||||
- **User onboarding flow** — what happens after registration? Guided first-server creation needed
|
||||
- **Email notifications** — server crashed, provisioning complete, billing receipts
|
||||
- Remove `testdaemon` binary from repo root (7MB, should not be in repo)
|
||||
- SSH config snippet for power users
|
||||
- user onboarding flow
|
||||
- email notifications
|
||||
- remove `testdaemon` binary from repo root
|
||||
|
||||
---
|
||||
|
||||
## Game Servers
|
||||
|
||||
Outstanding:
|
||||
|
||||
- **Game server world backup / restore** — player world data backup separate from PBS infrastructure backup. Trust-critical — players losing world data will kill retention.
|
||||
- **Game server subdomain** — how do players connect? Verify IP vs subdomain (e.g. `mc.zerolaghub.com` style)
|
||||
- game server world backup / restore
|
||||
- game server subdomain / player connection method verification
|
||||
|
||||
---
|
||||
|
||||
## Velocity / ZpackVelocityBridge
|
||||
|
||||
Completed:
|
||||
- startup rehydrate now requires `ready == true`
|
||||
- stale default `zpack-api.internal.zlh` fallback removed
|
||||
- explicit `ZPACK_REHYDRATE_ENDPOINT` env wiring validated
|
||||
- happy-path player routing validated live after restart
|
||||
|
||||
Outstanding:
|
||||
|
||||
- **Server registrations are in-memory only** — lost on Velocity restart. API needs `POST /api/velocity/resync` to re-register all running containers
|
||||
- **ZpackCommands not registered** — `/zpack status/list/reload` commands exist in code but not wired up
|
||||
- **Player routing on restart** — existing running containers go unrouted until new provisioning happens or resync is called
|
||||
- See `SCRATCH/velocity-plugin.md` for full plugin documentation
|
||||
|
||||
---
|
||||
|
||||
## Infrastructure
|
||||
|
||||
### Hosting — GTHost (Decision: Stay, Most Cost-Effective)
|
||||
|
||||
GTHost Detroit is the primary host. Decision made to stay on GTHost long-term:
|
||||
- Bare metal, instant Proxmox install without workarounds
|
||||
- Unmetered bandwidth
|
||||
- Competitive pricing — most cost-effective option evaluated
|
||||
- Digital Ocean: too expensive, no bare metal/Proxmox
|
||||
- OVH: more expensive, overkill for current scale
|
||||
- Hetzner: Proxmox install was painful historically
|
||||
|
||||
### DDoS Protection (Resolved — Minimal Attack Surface, Accepted Risk)
|
||||
|
||||
Investigation complete. Attack surface is actually very small:
|
||||
|
||||
- All infrastructure is internal to Proxmox — not publicly exposed
|
||||
- Portal, API, admin access all internal or behind Twingate
|
||||
- Minecraft player traffic is proxied through Velocity VM — individual game containers never directly exposed
|
||||
- Only the Velocity VM TCP port is public-facing
|
||||
- Minecraft Java uses TCP — harder to volumetric flood than UDP games
|
||||
- OPNsense rate limiting on Velocity-facing port is sufficient for launch
|
||||
- Cloudflare DNS (non-proxied) hides real IP from casual attackers
|
||||
|
||||
**Decision:** Architecture is well-designed — attack surface is minimal. Accept remaining risk at launch. Revisit if attack occurs or revenue supports additional mitigation.
|
||||
- ZpackCommands not registered (`/zpack status`, `/zpack list`, `/zpack reload`)
|
||||
- negative-path readiness validation still pending for a `ready=false` backend
|
||||
- see `SCRATCH/velocity-plugin.md`
|
||||
|
||||
---
|
||||
|
||||
## Pre-Launch Checklist
|
||||
|
||||
Outstanding before launch:
|
||||
|
||||
- **Billing / Stripe integration** — launch blocker, cannot take money
|
||||
- **Game server world backup / restore** — trust-critical
|
||||
- **User onboarding flow** — guided first-server creation after register
|
||||
- **Password reset flow** — verify wired up
|
||||
- **Usage limits / quota enforcement** — per account
|
||||
- **Game server subdomain** — verify player connection method
|
||||
- **Email notifications** — crashed, billing, provisioning
|
||||
- **Upload testing** — test file upload flow end-to-end in dev containers
|
||||
- **Billing endpoints** — add back to API
|
||||
- **Stress testing** — k6 IDE session load test + Minecraft bot test + **code-server memory baseline**
|
||||
- **OPNsense audit** — both routers need systematic validation
|
||||
- **Fabric readiness gating** — agent must gate Velocity registration behind TCP probe success
|
||||
- **Service discovery migration** — replace remaining internal.zlh with env-based IPs in hot paths
|
||||
- **Provisioning validation** — single controlled creation, confirm 1 record / 1 job / 1 execution
|
||||
- Remove `testdaemon` binary from zpac-portal repo root
|
||||
- Billing / Stripe integration
|
||||
- game server world backup / restore
|
||||
- user onboarding flow
|
||||
- password reset flow verification
|
||||
- usage limits / quota enforcement
|
||||
- game server subdomain verification
|
||||
- email notifications
|
||||
- upload testing
|
||||
- billing endpoints
|
||||
- stress testing: k6 IDE + Minecraft bot + code-server memory baseline
|
||||
- OPNsense audit
|
||||
- Fabric readiness gating full validation
|
||||
- service discovery migration
|
||||
- provisioning validation
|
||||
- remove `testdaemon` from `zpack-portal`
|
||||
|
||||
---
|
||||
|
||||
## Platform
|
||||
|
||||
Future work:
|
||||
|
||||
- CF Tunnel SSH completion — power user feature, requires client cloudflared install
|
||||
- CF Tunnel SSH completion
|
||||
- artifact version promotion
|
||||
- runtime rollback support
|
||||
- Cloudflare R2 for large artifact/mod file delivery at scale
|
||||
- **Admin panel** — manage users/servers as operator
|
||||
- **Referral / dev pipeline reward system** — revenue sharing for developers
|
||||
- **Uptime history** — visible to users per server
|
||||
- **DDoS mitigation** — revisit Path.net or similar when revenue supports it
|
||||
- Cloudflare R2 for large artifact/mod file delivery
|
||||
- admin panel
|
||||
- referral / dev pipeline reward system
|
||||
- uptime history
|
||||
- revisit DDoS mitigation later if needed
|
||||
|
||||
---
|
||||
|
||||
## Closed Threads
|
||||
|
||||
- ✅ PTY console (dev + game)
|
||||
- ✅ Mod lifecycle
|
||||
- ✅ Upload pipeline
|
||||
- ✅ Runtime artifact installs
|
||||
- ✅ Dev container filesystem model
|
||||
- ✅ Code-server artifact fix
|
||||
- ✅ API status endpoint for frontend agent-state consumption
|
||||
- ✅ Game server crash recovery with backoff
|
||||
- ✅ Crash observability (classification, log tail, exit metadata)
|
||||
- ✅ Code-server lifecycle endpoints (start/stop/restart)
|
||||
- ✅ Code-server process detection via /proc scan
|
||||
- ✅ Dev IDE proxy — path-based browser IDE working end-to-end
|
||||
- ✅ Hosted wildcard Traefik → API → container dev IDE flow — browser-verified
|
||||
- ✅ Per-container dev IDE edge publish/unpublish removed from API
|
||||
- ✅ Wildcard TLS cert `*.zerolaghub.dev` via Let's Encrypt + Cloudflare DNS-01
|
||||
- ✅ Browser IDE fully loading at dev-<vmid>.zerolaghub.dev
|
||||
- ✅ CF Tunnel created and connected to bastion VM
|
||||
- ✅ Portal copy rewrite — landing, features, FAQ, about, pricing
|
||||
- ✅ DDoS investigation — minimal attack surface, Velocity proxy + internal architecture, accepted risk
|
||||
- ✅ Hosting provider decision — GTHost Detroit, most cost-effective option
|
||||
- ✅ Migration to Detroit — complete Apr 2, 2026
|
||||
- ✅ Internal FQDN migration — all services on internal.zlh
|
||||
- ✅ System stabilization — duplicate creation, DB/Redis drift, console routing all resolved
|
||||
- ✅ Provisioning + DB consistency — root cause was migration state pollution, now clean
|
||||
- ✅ IP-based control plane — internal.zlh removed from service-to-service hot paths
|
||||
- PTY console (dev + game)
|
||||
- mod lifecycle
|
||||
- upload pipeline
|
||||
- runtime artifact installs
|
||||
- dev container filesystem model
|
||||
- code-server artifact fix
|
||||
- API status endpoint for frontend agent-state consumption
|
||||
- game server crash recovery with backoff
|
||||
- crash observability
|
||||
- code-server lifecycle endpoints
|
||||
- code-server process detection
|
||||
- dev IDE proxy
|
||||
- hosted wildcard Traefik → API → container IDE flow
|
||||
- per-container dev IDE edge publish/unpublish removed from API
|
||||
- wildcard TLS cert `*.zerolaghub.dev`
|
||||
- browser IDE fully loading at `dev-<vmid>.zerolaghub.dev`
|
||||
- CF Tunnel created and connected to bastion VM
|
||||
- portal copy rewrite
|
||||
- DDoS investigation
|
||||
- hosting provider decision: GTHost Detroit
|
||||
- migration to Detroit complete
|
||||
- system stabilization after migration
|
||||
- IP-based control plane
|
||||
- Velocity startup rehydrate fixed and validated on happy path
|
||||
|
||||
Loading…
Reference in New Issue
Block a user