zlh-grind/OPEN_THREADS.md

9.0 KiB

Open Threads — zlh-grind

This file tracks active but unfinished work.

Keep it short.


Agent (zlh-agent)

Dev Runtime System

Completed:

  • catalog validation implemented
  • runtime installs artifact-backed
  • install guard implemented
  • all installs now fetch from artifact server

Outstanding:

  • runtime install verification improvements
  • catalog hash validation
  • runtime removal / upgrade handling
  • runtime update process for dev containers

Dev Environment

Completed:

  • dev user creation
  • workspace root /home/dev/workspace
  • console runs as dev user
  • HOME, USER, LOGNAME, TERM env vars set correctly

Outstanding:

  • PATH normalization
  • shell profile consistency
  • runtime PATH injection

Code Server Addon

Status: Installed, running, browser-verified end-to-end.

Outstanding:

  • code-server memory baseline
  • decide whether default dev RAM should increase or become tier-based
  • k6 IDE session load test

Game Server Supervision

Completed:

  • crash recovery with backoff
  • crash observability and classification
  • unified readiness-aware start / restart path across manual start, restart, autostart, and supervisor recovery
  • dead duplicate crash monitor removed
  • /ready endpoint added and operation/maintenance state surfaced through /status
  • guarded operation lock added for mutating / stateful flows
  • console command endpoint hardened
  • agent self-update rollback symlink handling corrected

Game Server Backups

Completed:

  • first guarded Minecraft backup flow implemented in agent
  • local backup create/list/restore endpoints added
  • live backup uses save-all flush -> save-off -> archive -> save-on
  • restore stops server, waits for exit, restores manifest-declared paths, then restarts through readiness-aware path

Outstanding:

  • pre-restore safety checkpoint
  • remote backup storage / transfer path
  • backup job history / progress beyond current operation state
  • retention policy refinement beyond initial local pruning
  • real-world validation on live Minecraft server

Fabric Readiness Gating

Status: startup rehydrate path fixed in plugin and both happy-path and negative-path startup validation passed.

Still outstanding:

  • confirm whether any remaining agent-side registration path can surface a backend before readiness probe success
  • see SCRATCH/session-stabilization-fabric-findings.md

Agent Future Work

  1. Structured logging (slog) for Loki
  2. Dev container provisioningComplete state in /status
  3. Graceful shutdown verification
  4. Process reattachment on agent restart
  5. SSH server install in dev container provisioning pipeline

Dev IDE Access

Browser IDE

Status: fully working, browser-verified, zero-install.

Remaining:

  • confirm "Open IDE" button in portal uses hosted URL in production path
  • reduce legacy /__ide/:id compatibility paths once portal button confirmed
  • simplify and harden devProxy

Local Dev Access — SSH via CF Tunnel

Current state:

  • tunnel created and connected to bastion VM
  • Zero Trust free plan active
  • SSH hostname mapping not yet configured
  • bastion SSH proxy jump config not yet done
  • dev container SSH server not yet verified
  • portal SSH config snippet not yet built

API (zpack-api)

Completed:

  • dev provisioning payload
  • runtime/version fields
  • enable_code_server flag
  • status endpoint
  • IDE token generation + hosted URL
  • bootstrap IDE route
  • live tunnel proxy
  • host-based routing for hosted IDE
  • hosted flow browser-verified end-to-end
  • backend lifecycle hardened
  • duplicate server creation fixed
  • console routing corrected
  • control plane switched to IP-based service communication
  • Velocity rehydration uses DB + Redis instead of Proxmox live state
  • Stripe webhook delivery/reachability fixed via public billing hostname
  • webhook-driven persistence of billing state (subscriptionStatus, plan)
  • billing page/API alignment for active state and Stripe portal flow
  • direct in-app plan upgrade endpoint (/api/billing/upgrade)
  • direct in-app plan downgrade scheduling endpoint (/api/billing/downgrade)
  • persisted billing fields for currentPeriodEnd, lastInvoicePaidAt, billingSyncedAt
  • persisted scheduled downgrade state (scheduledPlan, scheduledPlanEffectiveAt)
  • plan-based quota enforcement in POST /api/instances
  • password reset request + confirm flow implemented
  • agent contract updated for POST control actions, /ready, operation state, and backup routes
  • API backup forwarding added for list / create / restore
  • agent 409 conflict and readiness-oriented error handling preserved instead of collapsing to generic 500
  • Velocity routing made more conservative around missing readiness

Outstanding:

  • simplify and harden host-native devProxy
  • dev runtime catalog endpoint for portal
  • Headscale auth key generation
  • service discovery migration for remaining hot-path internal.zlh references
  • normalize backup response shape now that portal is tolerating multiple field names

Portal (zpack-portal)

Completed:

  • dev runtime dropdown
  • dotnet runtime support
  • enable code-server checkbox
  • dev file browser support
  • site copy rewrite
  • pricing page updated
  • billing page aligned with API v2 billing state
  • honest Stripe portal section with single portal CTA
  • in-app Basic → Pro upgrade wiring
  • in-app Pro → Basic scheduled downgrade wiring
  • quota/limit messaging on create flow with billing upgrade guidance
  • forgot-password + reset-password pages and login linkage
  • first-login onboarding modal with quick/full tour and skip
  • dashboard IA refresh: spotlight server card replaces duplicate mini-listing
  • operation / maintenance state surfaced in game server UI
  • first backup UI added for list / create / restore
  • targeted 409 / 503 messaging added for operation conflict and not-ready states
  • console command submission updated to POST JSON

Outstanding:

  • confirm "Open IDE" button fully uses hosted URL flow
  • SSH config snippet for power users
  • email notifications
  • remove testdaemon binary from repo root
  • fix console page action gating so stopped MC servers remain startable while console send stays gated to running + ready

Game Servers

Completed:

  • first local Minecraft backup / restore flow wired end-to-end through agent, API, and portal

Outstanding:

  • harden world backup / restore with pre-restore checkpoint, remote storage, and live validation
  • game server subdomain / player connection method verification

Velocity / ZpackVelocityBridge

Completed:

  • startup rehydrate now requires ready == true
  • stale default zpack-api.internal.zlh fallback removed
  • explicit ZPACK_REHYDRATE_ENDPOINT env wiring validated
  • happy-path player routing validated live after restart
  • negative-path startup validation passed: no backend registered when rehydrate returned zero eligible servers

Outstanding:

  • confirm whether any remaining agent-side registration path can surface a backend before readiness probe success
  • see SCRATCH/velocity-plugin.md

Optional / Nonessential:

  • ZpackCommands exist in code but are not currently needed operationally
  • prefer plugin HTTP status endpoint and host metrics over in-proxy admin commands
  • if metrics coverage is sufficient, command registration can remain omitted or the command class can be removed later

Pre-Launch Checklist

Outstanding before launch:

  • harden game server world backup / restore (checkpoint + remote storage + validation)
  • game server subdomain verification
  • email notifications
  • upload testing
  • billing endpoint/path cleanup verification
  • stress testing: k6 IDE + Minecraft bot + code-server memory baseline
  • OPNsense audit
  • Fabric readiness gating full validation
  • service discovery migration
  • provisioning validation
  • remove testdaemon from zpack-portal
  • portal console gating fix for stopped-but-startable MC servers

Platform

Future work:

  • CF Tunnel SSH completion
  • artifact version promotion
  • runtime rollback support
  • Cloudflare R2 for large artifact/mod file delivery
  • admin panel
  • referral / dev pipeline reward system
  • uptime history
  • revisit DDoS mitigation later if needed

Closed Threads

  • PTY console (dev + game)
  • mod lifecycle
  • upload pipeline
  • runtime artifact installs
  • dev container filesystem model
  • code-server artifact fix
  • API status endpoint for frontend agent-state consumption
  • game server crash recovery with backoff
  • crash observability
  • code-server lifecycle endpoints
  • code-server process detection
  • dev IDE proxy
  • hosted wildcard Traefik → API → container IDE flow
  • per-container dev IDE edge publish/unpublish removed from API
  • wildcard TLS cert *.zerolaghub.dev
  • browser IDE fully loading at dev-<vmid>.zerolaghub.dev
  • CF Tunnel created and connected to bastion VM
  • portal copy rewrite
  • DDoS investigation
  • hosting provider decision: GTHost Detroit
  • migration to Detroit complete
  • system stabilization after migration
  • IP-based control plane
  • Velocity startup rehydrate fixed and validated on happy path
  • billing / Stripe webhook delivery and persistence
  • password reset flow
  • usage limits / quota enforcement
  • user onboarding flow
  • dashboard spotlight server IA refresh