Reconcile root open threads with Codex ownership model

This commit is contained in:
jester 2026-04-16 19:57:24 +00:00
parent 948b207bc4
commit 7db130baf4

View File

@ -1,263 +1,60 @@
# Open Threads — zlh-grind
This file tracks active but unfinished work.
This file tracks **active cross-repo and platform-level work only**.
Keep it short.
Repo-specific work belongs in:
- `Codex/API/OPEN_ITEMS.md`
- `Codex/Portal/OPEN_ITEMS.md`
- `Codex/Agent/OPEN_ITEMS.md`
Keep this file short.
---
## Agent (zlh-agent)
## Cross-Repo Active
### Dev Runtime System
Completed:
- catalog validation implemented
- runtime installs artifact-backed
- install guard implemented
- all installs now fetch from artifact server
### Game backup integration
- normalize backup response shape across API and Portal so list/create/restore/delete have stable success bodies and a consistent error envelope
- validate local Minecraft backup/restore flow on a real live server end-to-end
- confirm checkpoint metadata presentation once API exposes the final stable fields to Portal
Outstanding:
- runtime install verification improvements
- catalog hash validation
- runtime removal / upgrade handling
- runtime update process for dev containers
### Dev access / IDE / SSH
- simplify and harden API `devProxy`
- complete SSH / CF tunnel access path across platform, API, Agent, and Portal UX
- add Portal SSH config snippet for power users
### Dev Environment
Completed:
- dev user creation
- workspace root `/home/dev/workspace`
- console runs as dev user
- `HOME`, `USER`, `LOGNAME`, `TERM` env vars set correctly
Outstanding:
- PATH normalization
- shell profile consistency
- runtime PATH injection
### Dev Container Backups
Recommendation:
- implement dev backups as workspace snapshots, not whole-container backups
- primary scope should be `/home/dev/workspace`
- restore should rebuild from config, then restore workspace snapshot
- treat dotfiles / user settings as optional follow-up, not default backup scope
- avoid backing up reproducible runtime payloads and caches by default
- plan remote storage early for dev backups so node loss does not equal workspace loss
### Code Server Addon
Status: Installed, running, browser-verified end-to-end.
Outstanding:
- code-server memory baseline
- decide whether default dev RAM should increase or become tier-based
- k6 IDE session load test
### Game Server Supervision
Completed:
- crash recovery with backoff
- crash observability and classification
- unified readiness-aware start / restart path across manual start, restart, autostart, and supervisor recovery
- dead duplicate crash monitor removed
- `/ready` endpoint added and operation/maintenance state surfaced through `/status`
- guarded operation lock added for mutating / stateful flows
- console command endpoint hardened
- agent self-update rollback symlink handling corrected
### Game Server Backups
Completed:
- first guarded Minecraft backup flow implemented in agent
- local backup create/list/restore endpoints added
- local backup delete endpoint added
- live backup uses `save-all flush` -> `save-off` -> archive -> `save-on`
- restore stops server, waits for exit, restores manifest-declared paths, then restarts through readiness-aware path
- **pre-restore checkpoint hardening implemented (Apr 16 2026)**
- backup metadata fields added: `type` and optional `reason`
- manual backups record `type: "manual"`
- restore creates a local `checkpoint` backup with `reason: "pre_restore"` before any destructive operation
- checkpoint creation aborts the restore if it fails — live data never touched
- checkpoint creation disables pruning so safety backup is not immediately removed
- collision-safe backup IDs for same-second creation
- `POST /game/backups/restore` response now includes `restored`, `backup`, `checkpoint`
- full test coverage: checkpoint creation, abort-before-delete, manual metadata, listability, unsafe restore rejection
Outstanding:
- remote backup storage / transfer path
- backup job history / progress beyond current operation state
- retention policy refinement beyond initial local pruning
- real-world validation on live Minecraft server
### Fabric Readiness Gating
Status: startup rehydrate path fixed in plugin and both happy-path and negative-path startup validation passed.
Still outstanding:
- confirm whether any remaining agent-side registration path can surface a backend before readiness probe success
- see `SCRATCH/session-stabilization-fabric-findings.md`
### Agent Future Work (post-launch)
1. Structured logging (slog) for Loki
2. Dev container `provisioningComplete` state in `/status`
3. Graceful shutdown verification
4. Process reattachment on agent restart
5. SSH server install in dev container provisioning pipeline
6. Long-running job model (job IDs, progress phases, cancel/retry)
7. Typed platform-action wrappers over raw console commands
8. Persistent operation recovery after agent restart
9. `RestartServer()` readiness probe bypass — fix or document
---
## Dev IDE Access
### Browser IDE
Status: fully working, browser-verified, zero-install.
Remaining:
- confirm "Open IDE" button in portal uses hosted URL in production path
- reduce legacy `/__ide/:id` compatibility paths once portal button confirmed
- simplify and harden `devProxy`
### Local Dev Access — SSH via CF Tunnel
Current state:
- tunnel created and connected to bastion VM
- Zero Trust free plan active
- SSH hostname mapping not yet configured
- bastion SSH proxy jump config not yet done
- dev container SSH server not yet verified
- portal SSH config snippet not yet built
---
## API (zpack-api)
Completed:
- dev provisioning payload
- runtime/version fields
- enable_code_server flag
- status endpoint
- IDE token generation + hosted URL
- bootstrap IDE route
- live tunnel proxy
- host-based routing for hosted IDE
- hosted flow browser-verified end-to-end
- backend lifecycle hardened
- duplicate server creation fixed
- console routing corrected
- control plane switched to IP-based service communication
- Velocity rehydration uses DB + Redis instead of Proxmox live state
- Stripe webhook delivery/reachability fixed via public billing hostname
- webhook-driven persistence of billing state (`subscriptionStatus`, `plan`)
- billing page/API alignment for active state and Stripe portal flow
- direct in-app plan upgrade endpoint (`/api/billing/upgrade`)
- direct in-app plan downgrade scheduling endpoint (`/api/billing/downgrade`)
- persisted billing fields for `currentPeriodEnd`, `lastInvoicePaidAt`, `billingSyncedAt`
- persisted scheduled downgrade state (`scheduledPlan`, `scheduledPlanEffectiveAt`)
- plan-based quota enforcement in `POST /api/instances`
- password reset request + confirm flow implemented
- agent contract updated for POST control actions, `/ready`, operation state, and backup routes
- agent transport consolidated into shared `agentClient.js`
- semantic readiness split implemented with shared `isAgentReadyResult()`
- API backup forwarding added for list / create / restore / delete
- agent `409` conflict and readiness-oriented error handling preserved instead of collapsing to generic `500`
- Velocity routing made more conservative around missing readiness
Outstanding:
- simplify and harden host-native `devProxy`
- dev runtime catalog endpoint for portal
- Headscale auth key generation
- service discovery migration for remaining hot-path `internal.zlh` references
- normalize backup response shape now that portal is tolerating multiple field names
---
## Portal (zpack-portal)
Completed:
- dev runtime dropdown
- dotnet runtime support
- enable code-server checkbox
- dev file browser support
- site copy rewrite
- pricing page updated
- billing page aligned with API v2 billing state
- honest Stripe portal section with single portal CTA
- in-app Basic → Pro upgrade wiring
- in-app Pro → Basic scheduled downgrade wiring
- quota/limit messaging on create flow with billing upgrade guidance
- forgot-password + reset-password pages and login linkage
- first-login onboarding modal with quick/full tour and skip
- dashboard IA refresh: spotlight server card replaces duplicate mini-listing
- operation / maintenance state surfaced in game server UI
- first backup UI added for list / create / restore
- backup delete UI added with destructive confirmation
- targeted `409` / `503` messaging added for operation conflict and not-ready states
- console command submission updated to POST JSON
- console page action gating fixed so stopped MC servers remain startable while console send stays gated to running + ready
Outstanding:
- confirm "Open IDE" button fully uses hosted URL flow
- SSH config snippet for power users
- email notifications
- remove `testdaemon` binary from repo root
---
## Game Servers
Completed:
- first local Minecraft backup / restore flow wired end-to-end through agent, API, and portal
- manual local backup delete wired end-to-end through agent, API, and portal
- pre-restore checkpoint hardening complete in agent
Outstanding:
- remote storage for game server backups
- real-world backup/restore validation on live Minecraft server
### Service discovery / launch validation
- service discovery migration for remaining hot-path references
- provisioning validation across current API/Agent/Portal assumptions
- Fabric / readiness / Velocity exposure final cross-component verification
- game server subdomain / player connection method verification
---
## Velocity / ZpackVelocityBridge
Completed:
- startup rehydrate now requires `ready == true`
- stale default `zpack-api.internal.zlh` fallback removed
- explicit `ZPACK_REHYDRATE_ENDPOINT` env wiring validated
- happy-path player routing validated live after restart
- negative-path startup validation passed: no backend registered when rehydrate returned zero eligible servers
Outstanding:
- confirm whether any remaining agent-side registration path can surface a backend before readiness probe success
- see `SCRATCH/velocity-plugin.md`
Optional / Nonessential:
- `ZpackCommands` exist in code but are not currently needed operationally
- prefer plugin HTTP status endpoint and host metrics over in-proxy admin commands
- if metrics coverage is sufficient, command registration can remain omitted or the command class can be removed later
### Notifications / launch polish
- email notifications across backend contract + Portal UX
- remove stray `testdameon` / `testdaemon` binary from Portal repo
---
## Pre-Launch Checklist
## Platform / Infrastructure Active
Outstanding before launch:
- remote storage for game server backups
- real-world backup/restore validation on live Minecraft server
- game server subdomain verification
- email notifications
- upload testing
- billing endpoint/path cleanup verification
- stress testing: k6 IDE + Minecraft bot + code-server memory baseline
- stress testing: k6 IDE load, Minecraft bot load, code-server memory baseline
- OPNsense audit
- Fabric readiness gating full validation
- service discovery migration
- provisioning validation
- remove `testdaemon` from `zpack-portal`
- billing endpoint/path cleanup verification
### Backup boundary
- Agent-owned backups are local, app-aware rollback backups for Minecraft worlds/config
- PBS / platform backup strategy is the durability / disaster-recovery layer
- do not track PBS/offsite durability work as agent implementation work unless that ownership changes
---
## Platform
## Platform Future
Future work:
- CF Tunnel SSH completion
- CF Tunnel SSH completion beyond first working path
- artifact version promotion
- runtime rollback support
- Cloudflare R2 for large artifact/mod file delivery
- Cloudflare R2 for large artifact/mod delivery
- admin panel
- referral / dev pipeline reward system
- uptime history
@ -265,35 +62,11 @@ Future work:
---
## Closed Threads
## Cleaning Rule
- PTY console (dev + game)
- mod lifecycle
- upload pipeline
- runtime artifact installs
- dev container filesystem model
- code-server artifact fix
- API status endpoint for frontend agent-state consumption
- game server crash recovery with backoff
- crash observability
- code-server lifecycle endpoints
- code-server process detection
- dev IDE proxy
- hosted wildcard Traefik → API → container IDE flow
- per-container dev IDE edge publish/unpublish removed from API
- wildcard TLS cert `*.zerolaghub.dev`
- browser IDE fully loading at `dev-<vmid>.zerolaghub.dev`
- CF Tunnel created and connected to bastion VM
- portal copy rewrite
- DDoS investigation
- hosting provider decision: GTHost Detroit
- migration to Detroit complete
- system stabilization after migration
- IP-based control plane
- Velocity startup rehydrate fixed and validated on happy path
- billing / Stripe webhook delivery and persistence
- password reset flow
- usage limits / quota enforcement
- user onboarding flow
- dashboard spotlight server IA refresh
- pre-restore checkpoint hardening (agent-side, Apr 16 2026)
- Root keeps only cross-repo/platform work
- Repo-specific items must be removed from root once they live only in one Codex tracker
- Completed items should be removed, not left in place as historical clutter
- Use `CURRENT_STATE.md` for durable implemented behavior
- Use `DECISIONS.md` for settled choices
- Re-open old items only when there is current evidence they are still unfinished or have regressed