From 7db130baf46472ed99a3b99a89c8affa3c193128 Mon Sep 17 00:00:00 2001 From: jester Date: Thu, 16 Apr 2026 19:57:24 +0000 Subject: [PATCH] Reconcile root open threads with Codex ownership model --- OPEN_THREADS.md | 309 +++++++----------------------------------------- 1 file changed, 41 insertions(+), 268 deletions(-) diff --git a/OPEN_THREADS.md b/OPEN_THREADS.md index 17072cf..83d6a02 100644 --- a/OPEN_THREADS.md +++ b/OPEN_THREADS.md @@ -1,263 +1,60 @@ # Open Threads — zlh-grind -This file tracks active but unfinished work. +This file tracks **active cross-repo and platform-level work only**. -Keep it short. +Repo-specific work belongs in: +- `Codex/API/OPEN_ITEMS.md` +- `Codex/Portal/OPEN_ITEMS.md` +- `Codex/Agent/OPEN_ITEMS.md` + +Keep this file short. --- -## Agent (zlh-agent) +## Cross-Repo Active -### Dev Runtime System -Completed: -- catalog validation implemented -- runtime installs artifact-backed -- install guard implemented -- all installs now fetch from artifact server +### Game backup integration +- normalize backup response shape across API and Portal so list/create/restore/delete have stable success bodies and a consistent error envelope +- validate local Minecraft backup/restore flow on a real live server end-to-end +- confirm checkpoint metadata presentation once API exposes the final stable fields to Portal -Outstanding: -- runtime install verification improvements -- catalog hash validation -- runtime removal / upgrade handling -- runtime update process for dev containers +### Dev access / IDE / SSH +- simplify and harden API `devProxy` +- complete SSH / CF tunnel access path across platform, API, Agent, and Portal UX +- add Portal SSH config snippet for power users -### Dev Environment -Completed: -- dev user creation -- workspace root `/home/dev/workspace` -- console runs as dev user -- `HOME`, `USER`, `LOGNAME`, `TERM` env vars set correctly - -Outstanding: -- PATH normalization -- shell profile consistency -- runtime PATH injection - -### Dev Container Backups -Recommendation: -- implement dev backups as workspace snapshots, not whole-container backups -- primary scope should be `/home/dev/workspace` -- restore should rebuild from config, then restore workspace snapshot -- treat dotfiles / user settings as optional follow-up, not default backup scope -- avoid backing up reproducible runtime payloads and caches by default -- plan remote storage early for dev backups so node loss does not equal workspace loss - -### Code Server Addon -Status: Installed, running, browser-verified end-to-end. - -Outstanding: -- code-server memory baseline -- decide whether default dev RAM should increase or become tier-based -- k6 IDE session load test - -### Game Server Supervision -Completed: -- crash recovery with backoff -- crash observability and classification -- unified readiness-aware start / restart path across manual start, restart, autostart, and supervisor recovery -- dead duplicate crash monitor removed -- `/ready` endpoint added and operation/maintenance state surfaced through `/status` -- guarded operation lock added for mutating / stateful flows -- console command endpoint hardened -- agent self-update rollback symlink handling corrected - -### Game Server Backups -Completed: -- first guarded Minecraft backup flow implemented in agent -- local backup create/list/restore endpoints added -- local backup delete endpoint added -- live backup uses `save-all flush` -> `save-off` -> archive -> `save-on` -- restore stops server, waits for exit, restores manifest-declared paths, then restarts through readiness-aware path -- **pre-restore checkpoint hardening implemented (Apr 16 2026)** - - backup metadata fields added: `type` and optional `reason` - - manual backups record `type: "manual"` - - restore creates a local `checkpoint` backup with `reason: "pre_restore"` before any destructive operation - - checkpoint creation aborts the restore if it fails — live data never touched - - checkpoint creation disables pruning so safety backup is not immediately removed - - collision-safe backup IDs for same-second creation - - `POST /game/backups/restore` response now includes `restored`, `backup`, `checkpoint` - - full test coverage: checkpoint creation, abort-before-delete, manual metadata, listability, unsafe restore rejection - -Outstanding: -- remote backup storage / transfer path -- backup job history / progress beyond current operation state -- retention policy refinement beyond initial local pruning -- real-world validation on live Minecraft server - -### Fabric Readiness Gating -Status: startup rehydrate path fixed in plugin and both happy-path and negative-path startup validation passed. - -Still outstanding: -- confirm whether any remaining agent-side registration path can surface a backend before readiness probe success -- see `SCRATCH/session-stabilization-fabric-findings.md` - -### Agent Future Work (post-launch) -1. Structured logging (slog) for Loki -2. Dev container `provisioningComplete` state in `/status` -3. Graceful shutdown verification -4. Process reattachment on agent restart -5. SSH server install in dev container provisioning pipeline -6. Long-running job model (job IDs, progress phases, cancel/retry) -7. Typed platform-action wrappers over raw console commands -8. Persistent operation recovery after agent restart -9. `RestartServer()` readiness probe bypass — fix or document - ---- - -## Dev IDE Access - -### Browser IDE -Status: fully working, browser-verified, zero-install. - -Remaining: -- confirm "Open IDE" button in portal uses hosted URL in production path -- reduce legacy `/__ide/:id` compatibility paths once portal button confirmed -- simplify and harden `devProxy` - -### Local Dev Access — SSH via CF Tunnel -Current state: -- tunnel created and connected to bastion VM -- Zero Trust free plan active -- SSH hostname mapping not yet configured -- bastion SSH proxy jump config not yet done -- dev container SSH server not yet verified -- portal SSH config snippet not yet built - ---- - -## API (zpack-api) - -Completed: -- dev provisioning payload -- runtime/version fields -- enable_code_server flag -- status endpoint -- IDE token generation + hosted URL -- bootstrap IDE route -- live tunnel proxy -- host-based routing for hosted IDE -- hosted flow browser-verified end-to-end -- backend lifecycle hardened -- duplicate server creation fixed -- console routing corrected -- control plane switched to IP-based service communication -- Velocity rehydration uses DB + Redis instead of Proxmox live state -- Stripe webhook delivery/reachability fixed via public billing hostname -- webhook-driven persistence of billing state (`subscriptionStatus`, `plan`) -- billing page/API alignment for active state and Stripe portal flow -- direct in-app plan upgrade endpoint (`/api/billing/upgrade`) -- direct in-app plan downgrade scheduling endpoint (`/api/billing/downgrade`) -- persisted billing fields for `currentPeriodEnd`, `lastInvoicePaidAt`, `billingSyncedAt` -- persisted scheduled downgrade state (`scheduledPlan`, `scheduledPlanEffectiveAt`) -- plan-based quota enforcement in `POST /api/instances` -- password reset request + confirm flow implemented -- agent contract updated for POST control actions, `/ready`, operation state, and backup routes -- agent transport consolidated into shared `agentClient.js` -- semantic readiness split implemented with shared `isAgentReadyResult()` -- API backup forwarding added for list / create / restore / delete -- agent `409` conflict and readiness-oriented error handling preserved instead of collapsing to generic `500` -- Velocity routing made more conservative around missing readiness - -Outstanding: -- simplify and harden host-native `devProxy` -- dev runtime catalog endpoint for portal -- Headscale auth key generation -- service discovery migration for remaining hot-path `internal.zlh` references -- normalize backup response shape now that portal is tolerating multiple field names - ---- - -## Portal (zpack-portal) - -Completed: -- dev runtime dropdown -- dotnet runtime support -- enable code-server checkbox -- dev file browser support -- site copy rewrite -- pricing page updated -- billing page aligned with API v2 billing state -- honest Stripe portal section with single portal CTA -- in-app Basic → Pro upgrade wiring -- in-app Pro → Basic scheduled downgrade wiring -- quota/limit messaging on create flow with billing upgrade guidance -- forgot-password + reset-password pages and login linkage -- first-login onboarding modal with quick/full tour and skip -- dashboard IA refresh: spotlight server card replaces duplicate mini-listing -- operation / maintenance state surfaced in game server UI -- first backup UI added for list / create / restore -- backup delete UI added with destructive confirmation -- targeted `409` / `503` messaging added for operation conflict and not-ready states -- console command submission updated to POST JSON -- console page action gating fixed so stopped MC servers remain startable while console send stays gated to running + ready - -Outstanding: -- confirm "Open IDE" button fully uses hosted URL flow -- SSH config snippet for power users -- email notifications -- remove `testdaemon` binary from repo root - ---- - -## Game Servers - -Completed: -- first local Minecraft backup / restore flow wired end-to-end through agent, API, and portal -- manual local backup delete wired end-to-end through agent, API, and portal -- pre-restore checkpoint hardening complete in agent - -Outstanding: -- remote storage for game server backups -- real-world backup/restore validation on live Minecraft server +### Service discovery / launch validation +- service discovery migration for remaining hot-path references +- provisioning validation across current API/Agent/Portal assumptions +- Fabric / readiness / Velocity exposure final cross-component verification - game server subdomain / player connection method verification ---- - -## Velocity / ZpackVelocityBridge - -Completed: -- startup rehydrate now requires `ready == true` -- stale default `zpack-api.internal.zlh` fallback removed -- explicit `ZPACK_REHYDRATE_ENDPOINT` env wiring validated -- happy-path player routing validated live after restart -- negative-path startup validation passed: no backend registered when rehydrate returned zero eligible servers - -Outstanding: -- confirm whether any remaining agent-side registration path can surface a backend before readiness probe success -- see `SCRATCH/velocity-plugin.md` - -Optional / Nonessential: -- `ZpackCommands` exist in code but are not currently needed operationally -- prefer plugin HTTP status endpoint and host metrics over in-proxy admin commands -- if metrics coverage is sufficient, command registration can remain omitted or the command class can be removed later +### Notifications / launch polish +- email notifications across backend contract + Portal UX +- remove stray `testdameon` / `testdaemon` binary from Portal repo --- -## Pre-Launch Checklist +## Platform / Infrastructure Active -Outstanding before launch: -- remote storage for game server backups -- real-world backup/restore validation on live Minecraft server -- game server subdomain verification -- email notifications - upload testing -- billing endpoint/path cleanup verification -- stress testing: k6 IDE + Minecraft bot + code-server memory baseline +- stress testing: k6 IDE load, Minecraft bot load, code-server memory baseline - OPNsense audit -- Fabric readiness gating full validation -- service discovery migration -- provisioning validation -- remove `testdaemon` from `zpack-portal` +- billing endpoint/path cleanup verification + +### Backup boundary +- Agent-owned backups are local, app-aware rollback backups for Minecraft worlds/config +- PBS / platform backup strategy is the durability / disaster-recovery layer +- do not track PBS/offsite durability work as agent implementation work unless that ownership changes --- -## Platform +## Platform Future -Future work: -- CF Tunnel SSH completion +- CF Tunnel SSH completion beyond first working path - artifact version promotion - runtime rollback support -- Cloudflare R2 for large artifact/mod file delivery +- Cloudflare R2 for large artifact/mod delivery - admin panel - referral / dev pipeline reward system - uptime history @@ -265,35 +62,11 @@ Future work: --- -## Closed Threads +## Cleaning Rule -- PTY console (dev + game) -- mod lifecycle -- upload pipeline -- runtime artifact installs -- dev container filesystem model -- code-server artifact fix -- API status endpoint for frontend agent-state consumption -- game server crash recovery with backoff -- crash observability -- code-server lifecycle endpoints -- code-server process detection -- dev IDE proxy -- hosted wildcard Traefik → API → container IDE flow -- per-container dev IDE edge publish/unpublish removed from API -- wildcard TLS cert `*.zerolaghub.dev` -- browser IDE fully loading at `dev-.zerolaghub.dev` -- CF Tunnel created and connected to bastion VM -- portal copy rewrite -- DDoS investigation -- hosting provider decision: GTHost Detroit -- migration to Detroit complete -- system stabilization after migration -- IP-based control plane -- Velocity startup rehydrate fixed and validated on happy path -- billing / Stripe webhook delivery and persistence -- password reset flow -- usage limits / quota enforcement -- user onboarding flow -- dashboard spotlight server IA refresh -- pre-restore checkpoint hardening (agent-side, Apr 16 2026) +- Root keeps only cross-repo/platform work +- Repo-specific items must be removed from root once they live only in one Codex tracker +- Completed items should be removed, not left in place as historical clutter +- Use `CURRENT_STATE.md` for durable implemented behavior +- Use `DECISIONS.md` for settled choices +- Re-open old items only when there is current evidence they are still unfinished or have regressed