# Open Threads — zlh-grind This file tracks active but unfinished work. Keep it short. --- ## Agent (zlh-agent) ### Dev Runtime System Completed: - catalog validation implemented - runtime installs artifact-backed - install guard implemented - all installs now fetch from artifact server Outstanding: - runtime install verification improvements - catalog hash validation - runtime removal / upgrade handling - runtime update process for dev containers ### Dev Environment Completed: - dev user creation - workspace root `/home/dev/workspace` - console runs as dev user - `HOME`, `USER`, `LOGNAME`, `TERM` env vars set correctly Outstanding: - PATH normalization - shell profile consistency - runtime PATH injection ### Dev Container Backups Recommendation: - implement dev backups as workspace snapshots, not whole-container backups - primary scope should be `/home/dev/workspace` - restore should rebuild from config, then restore workspace snapshot - treat dotfiles / user settings as optional follow-up, not default backup scope - avoid backing up reproducible runtime payloads and caches by default - plan remote storage early for dev backups so node loss does not equal workspace loss ### Code Server Addon Status: Installed, running, browser-verified end-to-end. Outstanding: - code-server memory baseline - decide whether default dev RAM should increase or become tier-based - k6 IDE session load test ### Game Server Supervision Completed: - crash recovery with backoff - crash observability and classification - unified readiness-aware start / restart path across manual start, restart, autostart, and supervisor recovery - dead duplicate crash monitor removed - `/ready` endpoint added and operation/maintenance state surfaced through `/status` - guarded operation lock added for mutating / stateful flows - console command endpoint hardened - agent self-update rollback symlink handling corrected ### Game Server Backups Completed: - first guarded Minecraft backup flow implemented in agent - local backup create/list/restore endpoints added - local backup delete endpoint added - live backup uses `save-all flush` -> `save-off` -> archive -> `save-on` - restore stops server, waits for exit, restores manifest-declared paths, then restarts through readiness-aware path - **pre-restore checkpoint hardening implemented (Apr 16 2026)** - backup metadata fields added: `type` and optional `reason` - manual backups record `type: "manual"` - restore creates a local `checkpoint` backup with `reason: "pre_restore"` before any destructive operation - checkpoint creation aborts the restore if it fails — live data never touched - checkpoint creation disables pruning so safety backup is not immediately removed - collision-safe backup IDs for same-second creation - `POST /game/backups/restore` response now includes `restored`, `backup`, `checkpoint` - full test coverage: checkpoint creation, abort-before-delete, manual metadata, listability, unsafe restore rejection Outstanding: - remote backup storage / transfer path - backup job history / progress beyond current operation state - retention policy refinement beyond initial local pruning - real-world validation on live Minecraft server ### Fabric Readiness Gating Status: startup rehydrate path fixed in plugin and both happy-path and negative-path startup validation passed. Still outstanding: - confirm whether any remaining agent-side registration path can surface a backend before readiness probe success - see `SCRATCH/session-stabilization-fabric-findings.md` ### Agent Future Work (post-launch) 1. Structured logging (slog) for Loki 2. Dev container `provisioningComplete` state in `/status` 3. Graceful shutdown verification 4. Process reattachment on agent restart 5. SSH server install in dev container provisioning pipeline 6. Long-running job model (job IDs, progress phases, cancel/retry) 7. Typed platform-action wrappers over raw console commands 8. Persistent operation recovery after agent restart 9. `RestartServer()` readiness probe bypass — fix or document --- ## Dev IDE Access ### Browser IDE Status: fully working, browser-verified, zero-install. Remaining: - confirm "Open IDE" button in portal uses hosted URL in production path - reduce legacy `/__ide/:id` compatibility paths once portal button confirmed - simplify and harden `devProxy` ### Local Dev Access — SSH via CF Tunnel Current state: - tunnel created and connected to bastion VM - Zero Trust free plan active - SSH hostname mapping not yet configured - bastion SSH proxy jump config not yet done - dev container SSH server not yet verified - portal SSH config snippet not yet built --- ## API (zpack-api) Completed: - dev provisioning payload - runtime/version fields - enable_code_server flag - status endpoint - IDE token generation + hosted URL - bootstrap IDE route - live tunnel proxy - host-based routing for hosted IDE - hosted flow browser-verified end-to-end - backend lifecycle hardened - duplicate server creation fixed - console routing corrected - control plane switched to IP-based service communication - Velocity rehydration uses DB + Redis instead of Proxmox live state - Stripe webhook delivery/reachability fixed via public billing hostname - webhook-driven persistence of billing state (`subscriptionStatus`, `plan`) - billing page/API alignment for active state and Stripe portal flow - direct in-app plan upgrade endpoint (`/api/billing/upgrade`) - direct in-app plan downgrade scheduling endpoint (`/api/billing/downgrade`) - persisted billing fields for `currentPeriodEnd`, `lastInvoicePaidAt`, `billingSyncedAt` - persisted scheduled downgrade state (`scheduledPlan`, `scheduledPlanEffectiveAt`) - plan-based quota enforcement in `POST /api/instances` - password reset request + confirm flow implemented - agent contract updated for POST control actions, `/ready`, operation state, and backup routes - agent transport consolidated into shared `agentClient.js` - semantic readiness split implemented with shared `isAgentReadyResult()` - API backup forwarding added for list / create / restore / delete - agent `409` conflict and readiness-oriented error handling preserved instead of collapsing to generic `500` - Velocity routing made more conservative around missing readiness Outstanding: - simplify and harden host-native `devProxy` - dev runtime catalog endpoint for portal - Headscale auth key generation - service discovery migration for remaining hot-path `internal.zlh` references - normalize backup response shape now that portal is tolerating multiple field names --- ## Portal (zpack-portal) Completed: - dev runtime dropdown - dotnet runtime support - enable code-server checkbox - dev file browser support - site copy rewrite - pricing page updated - billing page aligned with API v2 billing state - honest Stripe portal section with single portal CTA - in-app Basic → Pro upgrade wiring - in-app Pro → Basic scheduled downgrade wiring - quota/limit messaging on create flow with billing upgrade guidance - forgot-password + reset-password pages and login linkage - first-login onboarding modal with quick/full tour and skip - dashboard IA refresh: spotlight server card replaces duplicate mini-listing - operation / maintenance state surfaced in game server UI - first backup UI added for list / create / restore - backup delete UI added with destructive confirmation - targeted `409` / `503` messaging added for operation conflict and not-ready states - console command submission updated to POST JSON - console page action gating fixed so stopped MC servers remain startable while console send stays gated to running + ready Outstanding: - confirm "Open IDE" button fully uses hosted URL flow - SSH config snippet for power users - email notifications - remove `testdaemon` binary from repo root --- ## Game Servers Completed: - first local Minecraft backup / restore flow wired end-to-end through agent, API, and portal - manual local backup delete wired end-to-end through agent, API, and portal - pre-restore checkpoint hardening complete in agent Outstanding: - remote storage for game server backups - real-world backup/restore validation on live Minecraft server - game server subdomain / player connection method verification --- ## Velocity / ZpackVelocityBridge Completed: - startup rehydrate now requires `ready == true` - stale default `zpack-api.internal.zlh` fallback removed - explicit `ZPACK_REHYDRATE_ENDPOINT` env wiring validated - happy-path player routing validated live after restart - negative-path startup validation passed: no backend registered when rehydrate returned zero eligible servers Outstanding: - confirm whether any remaining agent-side registration path can surface a backend before readiness probe success - see `SCRATCH/velocity-plugin.md` Optional / Nonessential: - `ZpackCommands` exist in code but are not currently needed operationally - prefer plugin HTTP status endpoint and host metrics over in-proxy admin commands - if metrics coverage is sufficient, command registration can remain omitted or the command class can be removed later --- ## Pre-Launch Checklist Outstanding before launch: - remote storage for game server backups - real-world backup/restore validation on live Minecraft server - game server subdomain verification - email notifications - upload testing - billing endpoint/path cleanup verification - stress testing: k6 IDE + Minecraft bot + code-server memory baseline - OPNsense audit - Fabric readiness gating full validation - service discovery migration - provisioning validation - remove `testdaemon` from `zpack-portal` --- ## Platform Future work: - CF Tunnel SSH completion - artifact version promotion - runtime rollback support - Cloudflare R2 for large artifact/mod file delivery - admin panel - referral / dev pipeline reward system - uptime history - revisit DDoS mitigation later if needed --- ## Closed Threads - PTY console (dev + game) - mod lifecycle - upload pipeline - runtime artifact installs - dev container filesystem model - code-server artifact fix - API status endpoint for frontend agent-state consumption - game server crash recovery with backoff - crash observability - code-server lifecycle endpoints - code-server process detection - dev IDE proxy - hosted wildcard Traefik → API → container IDE flow - per-container dev IDE edge publish/unpublish removed from API - wildcard TLS cert `*.zerolaghub.dev` - browser IDE fully loading at `dev-.zerolaghub.dev` - CF Tunnel created and connected to bastion VM - portal copy rewrite - DDoS investigation - hosting provider decision: GTHost Detroit - migration to Detroit complete - system stabilization after migration - IP-based control plane - Velocity startup rehydrate fixed and validated on happy path - billing / Stripe webhook delivery and persistence - password reset flow - usage limits / quota enforcement - user onboarding flow - dashboard spotlight server IA refresh - pre-restore checkpoint hardening (agent-side, Apr 16 2026)