zlh-grind/OPEN_THREADS.md

275 lines
9.3 KiB
Markdown

# Open Threads — zlh-grind
This file tracks active but unfinished work.
Keep it short.
---
## Agent (zlh-agent)
### Dev Runtime System
Completed:
- catalog validation implemented
- runtime installs artifact-backed
- install guard implemented
- all installs now fetch from artifact server
Outstanding:
- runtime install verification improvements
- catalog hash validation
- runtime removal / upgrade handling
- runtime update process for dev containers
### Dev Environment
Completed:
- dev user creation
- workspace root `/home/dev/workspace`
- console runs as dev user
- `HOME`, `USER`, `LOGNAME`, `TERM` env vars set correctly
Outstanding:
- PATH normalization
- shell profile consistency
- runtime PATH injection
### Code Server Addon
Status: Installed, running, browser-verified end-to-end.
Outstanding:
- code-server memory baseline
- decide whether default dev RAM should increase or become tier-based
- k6 IDE session load test
### Game Server Supervision
Completed:
- crash recovery with backoff
- crash observability and classification
- unified readiness-aware start / restart path across manual start, restart, autostart, and supervisor recovery
- dead duplicate crash monitor removed
- `/ready` endpoint added and operation/maintenance state surfaced through `/status`
- guarded operation lock added for mutating / stateful flows
- console command endpoint hardened
- agent self-update rollback symlink handling corrected
### Game Server Backups
Completed:
- first guarded Minecraft backup flow implemented in agent
- local backup create/list/restore endpoints added
- local backup delete endpoint added
- live backup uses `save-all flush` -> `save-off` -> archive -> `save-on`
- restore stops server, waits for exit, restores manifest-declared paths, then restarts through readiness-aware path
Outstanding:
- pre-restore safety checkpoint
- remote backup storage / transfer path
- backup job history / progress beyond current operation state
- retention policy refinement beyond initial local pruning
- real-world validation on live Minecraft server
### Fabric Readiness Gating
Status: startup rehydrate path fixed in plugin and both happy-path and negative-path startup validation passed.
Still outstanding:
- confirm whether any remaining agent-side registration path can surface a backend before readiness probe success
- see `SCRATCH/session-stabilization-fabric-findings.md`
### Agent Future Work
1. Structured logging (slog) for Loki
2. Dev container `provisioningComplete` state in `/status`
3. Graceful shutdown verification
4. Process reattachment on agent restart
5. SSH server install in dev container provisioning pipeline
---
## Dev IDE Access
### Browser IDE
Status: fully working, browser-verified, zero-install.
Remaining:
- confirm "Open IDE" button in portal uses hosted URL in production path
- reduce legacy `/__ide/:id` compatibility paths once portal button confirmed
- simplify and harden `devProxy`
### Local Dev Access — SSH via CF Tunnel
Current state:
- tunnel created and connected to bastion VM
- Zero Trust free plan active
- SSH hostname mapping not yet configured
- bastion SSH proxy jump config not yet done
- dev container SSH server not yet verified
- portal SSH config snippet not yet built
---
## API (zpack-api)
Completed:
- dev provisioning payload
- runtime/version fields
- enable_code_server flag
- status endpoint
- IDE token generation + hosted URL
- bootstrap IDE route
- live tunnel proxy
- host-based routing for hosted IDE
- hosted flow browser-verified end-to-end
- backend lifecycle hardened
- duplicate server creation fixed
- console routing corrected
- control plane switched to IP-based service communication
- Velocity rehydration uses DB + Redis instead of Proxmox live state
- Stripe webhook delivery/reachability fixed via public billing hostname
- webhook-driven persistence of billing state (`subscriptionStatus`, `plan`)
- billing page/API alignment for active state and Stripe portal flow
- direct in-app plan upgrade endpoint (`/api/billing/upgrade`)
- direct in-app plan downgrade scheduling endpoint (`/api/billing/downgrade`)
- persisted billing fields for `currentPeriodEnd`, `lastInvoicePaidAt`, `billingSyncedAt`
- persisted scheduled downgrade state (`scheduledPlan`, `scheduledPlanEffectiveAt`)
- plan-based quota enforcement in `POST /api/instances`
- password reset request + confirm flow implemented
- agent contract updated for POST control actions, `/ready`, operation state, and backup routes
- agent transport consolidated into shared `agentClient.js`
- semantic readiness split implemented with shared `isAgentReadyResult()`
- API backup forwarding added for list / create / restore / delete
- agent `409` conflict and readiness-oriented error handling preserved instead of collapsing to generic `500`
- Velocity routing made more conservative around missing readiness
Outstanding:
- simplify and harden host-native `devProxy`
- dev runtime catalog endpoint for portal
- Headscale auth key generation
- service discovery migration for remaining hot-path `internal.zlh` references
- normalize backup response shape now that portal is tolerating multiple field names
---
## Portal (zpack-portal)
Completed:
- dev runtime dropdown
- dotnet runtime support
- enable code-server checkbox
- dev file browser support
- site copy rewrite
- pricing page updated
- billing page aligned with API v2 billing state
- honest Stripe portal section with single portal CTA
- in-app Basic → Pro upgrade wiring
- in-app Pro → Basic scheduled downgrade wiring
- quota/limit messaging on create flow with billing upgrade guidance
- forgot-password + reset-password pages and login linkage
- first-login onboarding modal with quick/full tour and skip
- dashboard IA refresh: spotlight server card replaces duplicate mini-listing
- operation / maintenance state surfaced in game server UI
- first backup UI added for list / create / restore
- backup delete UI added with destructive confirmation
- targeted `409` / `503` messaging added for operation conflict and not-ready states
- console command submission updated to POST JSON
- console page action gating fixed so stopped MC servers remain startable while console send stays gated to running + ready
Outstanding:
- confirm "Open IDE" button fully uses hosted URL flow
- SSH config snippet for power users
- email notifications
- remove `testdaemon` binary from repo root
---
## Game Servers
Completed:
- first local Minecraft backup / restore flow wired end-to-end through agent, API, and portal
- manual local backup delete wired end-to-end through agent, API, and portal
Outstanding:
- harden world backup / restore with pre-restore checkpoint, remote storage, and live validation
- game server subdomain / player connection method verification
---
## Velocity / ZpackVelocityBridge
Completed:
- startup rehydrate now requires `ready == true`
- stale default `zpack-api.internal.zlh` fallback removed
- explicit `ZPACK_REHYDRATE_ENDPOINT` env wiring validated
- happy-path player routing validated live after restart
- negative-path startup validation passed: no backend registered when rehydrate returned zero eligible servers
Outstanding:
- confirm whether any remaining agent-side registration path can surface a backend before readiness probe success
- see `SCRATCH/velocity-plugin.md`
Optional / Nonessential:
- `ZpackCommands` exist in code but are not currently needed operationally
- prefer plugin HTTP status endpoint and host metrics over in-proxy admin commands
- if metrics coverage is sufficient, command registration can remain omitted or the command class can be removed later
---
## Pre-Launch Checklist
Outstanding before launch:
- harden game server world backup / restore (checkpoint + remote storage + validation)
- game server subdomain verification
- email notifications
- upload testing
- billing endpoint/path cleanup verification
- stress testing: k6 IDE + Minecraft bot + code-server memory baseline
- OPNsense audit
- Fabric readiness gating full validation
- service discovery migration
- provisioning validation
- remove `testdaemon` from `zpack-portal`
---
## Platform
Future work:
- CF Tunnel SSH completion
- artifact version promotion
- runtime rollback support
- Cloudflare R2 for large artifact/mod file delivery
- admin panel
- referral / dev pipeline reward system
- uptime history
- revisit DDoS mitigation later if needed
---
## Closed Threads
- PTY console (dev + game)
- mod lifecycle
- upload pipeline
- runtime artifact installs
- dev container filesystem model
- code-server artifact fix
- API status endpoint for frontend agent-state consumption
- game server crash recovery with backoff
- crash observability
- code-server lifecycle endpoints
- code-server process detection
- dev IDE proxy
- hosted wildcard Traefik → API → container IDE flow
- per-container dev IDE edge publish/unpublish removed from API
- wildcard TLS cert `*.zerolaghub.dev`
- browser IDE fully loading at `dev-<vmid>.zerolaghub.dev`
- CF Tunnel created and connected to bastion VM
- portal copy rewrite
- DDoS investigation
- hosting provider decision: GTHost Detroit
- migration to Detroit complete
- system stabilization after migration
- IP-based control plane
- Velocity startup rehydrate fixed and validated on happy path
- billing / Stripe webhook delivery and persistence
- password reset flow
- usage limits / quota enforcement
- user onboarding flow
- dashboard spotlight server IA refresh