Reconcile root open threads with Codex ownership model
This commit is contained in:
parent
948b207bc4
commit
7db130baf4
309
OPEN_THREADS.md
309
OPEN_THREADS.md
@ -1,263 +1,60 @@
|
||||
# Open Threads — zlh-grind
|
||||
|
||||
This file tracks active but unfinished work.
|
||||
This file tracks **active cross-repo and platform-level work only**.
|
||||
|
||||
Keep it short.
|
||||
Repo-specific work belongs in:
|
||||
- `Codex/API/OPEN_ITEMS.md`
|
||||
- `Codex/Portal/OPEN_ITEMS.md`
|
||||
- `Codex/Agent/OPEN_ITEMS.md`
|
||||
|
||||
Keep this file short.
|
||||
|
||||
---
|
||||
|
||||
## Agent (zlh-agent)
|
||||
## Cross-Repo Active
|
||||
|
||||
### Dev Runtime System
|
||||
Completed:
|
||||
- catalog validation implemented
|
||||
- runtime installs artifact-backed
|
||||
- install guard implemented
|
||||
- all installs now fetch from artifact server
|
||||
### Game backup integration
|
||||
- normalize backup response shape across API and Portal so list/create/restore/delete have stable success bodies and a consistent error envelope
|
||||
- validate local Minecraft backup/restore flow on a real live server end-to-end
|
||||
- confirm checkpoint metadata presentation once API exposes the final stable fields to Portal
|
||||
|
||||
Outstanding:
|
||||
- runtime install verification improvements
|
||||
- catalog hash validation
|
||||
- runtime removal / upgrade handling
|
||||
- runtime update process for dev containers
|
||||
### Dev access / IDE / SSH
|
||||
- simplify and harden API `devProxy`
|
||||
- complete SSH / CF tunnel access path across platform, API, Agent, and Portal UX
|
||||
- add Portal SSH config snippet for power users
|
||||
|
||||
### Dev Environment
|
||||
Completed:
|
||||
- dev user creation
|
||||
- workspace root `/home/dev/workspace`
|
||||
- console runs as dev user
|
||||
- `HOME`, `USER`, `LOGNAME`, `TERM` env vars set correctly
|
||||
|
||||
Outstanding:
|
||||
- PATH normalization
|
||||
- shell profile consistency
|
||||
- runtime PATH injection
|
||||
|
||||
### Dev Container Backups
|
||||
Recommendation:
|
||||
- implement dev backups as workspace snapshots, not whole-container backups
|
||||
- primary scope should be `/home/dev/workspace`
|
||||
- restore should rebuild from config, then restore workspace snapshot
|
||||
- treat dotfiles / user settings as optional follow-up, not default backup scope
|
||||
- avoid backing up reproducible runtime payloads and caches by default
|
||||
- plan remote storage early for dev backups so node loss does not equal workspace loss
|
||||
|
||||
### Code Server Addon
|
||||
Status: Installed, running, browser-verified end-to-end.
|
||||
|
||||
Outstanding:
|
||||
- code-server memory baseline
|
||||
- decide whether default dev RAM should increase or become tier-based
|
||||
- k6 IDE session load test
|
||||
|
||||
### Game Server Supervision
|
||||
Completed:
|
||||
- crash recovery with backoff
|
||||
- crash observability and classification
|
||||
- unified readiness-aware start / restart path across manual start, restart, autostart, and supervisor recovery
|
||||
- dead duplicate crash monitor removed
|
||||
- `/ready` endpoint added and operation/maintenance state surfaced through `/status`
|
||||
- guarded operation lock added for mutating / stateful flows
|
||||
- console command endpoint hardened
|
||||
- agent self-update rollback symlink handling corrected
|
||||
|
||||
### Game Server Backups
|
||||
Completed:
|
||||
- first guarded Minecraft backup flow implemented in agent
|
||||
- local backup create/list/restore endpoints added
|
||||
- local backup delete endpoint added
|
||||
- live backup uses `save-all flush` -> `save-off` -> archive -> `save-on`
|
||||
- restore stops server, waits for exit, restores manifest-declared paths, then restarts through readiness-aware path
|
||||
- **pre-restore checkpoint hardening implemented (Apr 16 2026)**
|
||||
- backup metadata fields added: `type` and optional `reason`
|
||||
- manual backups record `type: "manual"`
|
||||
- restore creates a local `checkpoint` backup with `reason: "pre_restore"` before any destructive operation
|
||||
- checkpoint creation aborts the restore if it fails — live data never touched
|
||||
- checkpoint creation disables pruning so safety backup is not immediately removed
|
||||
- collision-safe backup IDs for same-second creation
|
||||
- `POST /game/backups/restore` response now includes `restored`, `backup`, `checkpoint`
|
||||
- full test coverage: checkpoint creation, abort-before-delete, manual metadata, listability, unsafe restore rejection
|
||||
|
||||
Outstanding:
|
||||
- remote backup storage / transfer path
|
||||
- backup job history / progress beyond current operation state
|
||||
- retention policy refinement beyond initial local pruning
|
||||
- real-world validation on live Minecraft server
|
||||
|
||||
### Fabric Readiness Gating
|
||||
Status: startup rehydrate path fixed in plugin and both happy-path and negative-path startup validation passed.
|
||||
|
||||
Still outstanding:
|
||||
- confirm whether any remaining agent-side registration path can surface a backend before readiness probe success
|
||||
- see `SCRATCH/session-stabilization-fabric-findings.md`
|
||||
|
||||
### Agent Future Work (post-launch)
|
||||
1. Structured logging (slog) for Loki
|
||||
2. Dev container `provisioningComplete` state in `/status`
|
||||
3. Graceful shutdown verification
|
||||
4. Process reattachment on agent restart
|
||||
5. SSH server install in dev container provisioning pipeline
|
||||
6. Long-running job model (job IDs, progress phases, cancel/retry)
|
||||
7. Typed platform-action wrappers over raw console commands
|
||||
8. Persistent operation recovery after agent restart
|
||||
9. `RestartServer()` readiness probe bypass — fix or document
|
||||
|
||||
---
|
||||
|
||||
## Dev IDE Access
|
||||
|
||||
### Browser IDE
|
||||
Status: fully working, browser-verified, zero-install.
|
||||
|
||||
Remaining:
|
||||
- confirm "Open IDE" button in portal uses hosted URL in production path
|
||||
- reduce legacy `/__ide/:id` compatibility paths once portal button confirmed
|
||||
- simplify and harden `devProxy`
|
||||
|
||||
### Local Dev Access — SSH via CF Tunnel
|
||||
Current state:
|
||||
- tunnel created and connected to bastion VM
|
||||
- Zero Trust free plan active
|
||||
- SSH hostname mapping not yet configured
|
||||
- bastion SSH proxy jump config not yet done
|
||||
- dev container SSH server not yet verified
|
||||
- portal SSH config snippet not yet built
|
||||
|
||||
---
|
||||
|
||||
## API (zpack-api)
|
||||
|
||||
Completed:
|
||||
- dev provisioning payload
|
||||
- runtime/version fields
|
||||
- enable_code_server flag
|
||||
- status endpoint
|
||||
- IDE token generation + hosted URL
|
||||
- bootstrap IDE route
|
||||
- live tunnel proxy
|
||||
- host-based routing for hosted IDE
|
||||
- hosted flow browser-verified end-to-end
|
||||
- backend lifecycle hardened
|
||||
- duplicate server creation fixed
|
||||
- console routing corrected
|
||||
- control plane switched to IP-based service communication
|
||||
- Velocity rehydration uses DB + Redis instead of Proxmox live state
|
||||
- Stripe webhook delivery/reachability fixed via public billing hostname
|
||||
- webhook-driven persistence of billing state (`subscriptionStatus`, `plan`)
|
||||
- billing page/API alignment for active state and Stripe portal flow
|
||||
- direct in-app plan upgrade endpoint (`/api/billing/upgrade`)
|
||||
- direct in-app plan downgrade scheduling endpoint (`/api/billing/downgrade`)
|
||||
- persisted billing fields for `currentPeriodEnd`, `lastInvoicePaidAt`, `billingSyncedAt`
|
||||
- persisted scheduled downgrade state (`scheduledPlan`, `scheduledPlanEffectiveAt`)
|
||||
- plan-based quota enforcement in `POST /api/instances`
|
||||
- password reset request + confirm flow implemented
|
||||
- agent contract updated for POST control actions, `/ready`, operation state, and backup routes
|
||||
- agent transport consolidated into shared `agentClient.js`
|
||||
- semantic readiness split implemented with shared `isAgentReadyResult()`
|
||||
- API backup forwarding added for list / create / restore / delete
|
||||
- agent `409` conflict and readiness-oriented error handling preserved instead of collapsing to generic `500`
|
||||
- Velocity routing made more conservative around missing readiness
|
||||
|
||||
Outstanding:
|
||||
- simplify and harden host-native `devProxy`
|
||||
- dev runtime catalog endpoint for portal
|
||||
- Headscale auth key generation
|
||||
- service discovery migration for remaining hot-path `internal.zlh` references
|
||||
- normalize backup response shape now that portal is tolerating multiple field names
|
||||
|
||||
---
|
||||
|
||||
## Portal (zpack-portal)
|
||||
|
||||
Completed:
|
||||
- dev runtime dropdown
|
||||
- dotnet runtime support
|
||||
- enable code-server checkbox
|
||||
- dev file browser support
|
||||
- site copy rewrite
|
||||
- pricing page updated
|
||||
- billing page aligned with API v2 billing state
|
||||
- honest Stripe portal section with single portal CTA
|
||||
- in-app Basic → Pro upgrade wiring
|
||||
- in-app Pro → Basic scheduled downgrade wiring
|
||||
- quota/limit messaging on create flow with billing upgrade guidance
|
||||
- forgot-password + reset-password pages and login linkage
|
||||
- first-login onboarding modal with quick/full tour and skip
|
||||
- dashboard IA refresh: spotlight server card replaces duplicate mini-listing
|
||||
- operation / maintenance state surfaced in game server UI
|
||||
- first backup UI added for list / create / restore
|
||||
- backup delete UI added with destructive confirmation
|
||||
- targeted `409` / `503` messaging added for operation conflict and not-ready states
|
||||
- console command submission updated to POST JSON
|
||||
- console page action gating fixed so stopped MC servers remain startable while console send stays gated to running + ready
|
||||
|
||||
Outstanding:
|
||||
- confirm "Open IDE" button fully uses hosted URL flow
|
||||
- SSH config snippet for power users
|
||||
- email notifications
|
||||
- remove `testdaemon` binary from repo root
|
||||
|
||||
---
|
||||
|
||||
## Game Servers
|
||||
|
||||
Completed:
|
||||
- first local Minecraft backup / restore flow wired end-to-end through agent, API, and portal
|
||||
- manual local backup delete wired end-to-end through agent, API, and portal
|
||||
- pre-restore checkpoint hardening complete in agent
|
||||
|
||||
Outstanding:
|
||||
- remote storage for game server backups
|
||||
- real-world backup/restore validation on live Minecraft server
|
||||
### Service discovery / launch validation
|
||||
- service discovery migration for remaining hot-path references
|
||||
- provisioning validation across current API/Agent/Portal assumptions
|
||||
- Fabric / readiness / Velocity exposure final cross-component verification
|
||||
- game server subdomain / player connection method verification
|
||||
|
||||
---
|
||||
|
||||
## Velocity / ZpackVelocityBridge
|
||||
|
||||
Completed:
|
||||
- startup rehydrate now requires `ready == true`
|
||||
- stale default `zpack-api.internal.zlh` fallback removed
|
||||
- explicit `ZPACK_REHYDRATE_ENDPOINT` env wiring validated
|
||||
- happy-path player routing validated live after restart
|
||||
- negative-path startup validation passed: no backend registered when rehydrate returned zero eligible servers
|
||||
|
||||
Outstanding:
|
||||
- confirm whether any remaining agent-side registration path can surface a backend before readiness probe success
|
||||
- see `SCRATCH/velocity-plugin.md`
|
||||
|
||||
Optional / Nonessential:
|
||||
- `ZpackCommands` exist in code but are not currently needed operationally
|
||||
- prefer plugin HTTP status endpoint and host metrics over in-proxy admin commands
|
||||
- if metrics coverage is sufficient, command registration can remain omitted or the command class can be removed later
|
||||
### Notifications / launch polish
|
||||
- email notifications across backend contract + Portal UX
|
||||
- remove stray `testdameon` / `testdaemon` binary from Portal repo
|
||||
|
||||
---
|
||||
|
||||
## Pre-Launch Checklist
|
||||
## Platform / Infrastructure Active
|
||||
|
||||
Outstanding before launch:
|
||||
- remote storage for game server backups
|
||||
- real-world backup/restore validation on live Minecraft server
|
||||
- game server subdomain verification
|
||||
- email notifications
|
||||
- upload testing
|
||||
- billing endpoint/path cleanup verification
|
||||
- stress testing: k6 IDE + Minecraft bot + code-server memory baseline
|
||||
- stress testing: k6 IDE load, Minecraft bot load, code-server memory baseline
|
||||
- OPNsense audit
|
||||
- Fabric readiness gating full validation
|
||||
- service discovery migration
|
||||
- provisioning validation
|
||||
- remove `testdaemon` from `zpack-portal`
|
||||
- billing endpoint/path cleanup verification
|
||||
|
||||
### Backup boundary
|
||||
- Agent-owned backups are local, app-aware rollback backups for Minecraft worlds/config
|
||||
- PBS / platform backup strategy is the durability / disaster-recovery layer
|
||||
- do not track PBS/offsite durability work as agent implementation work unless that ownership changes
|
||||
|
||||
---
|
||||
|
||||
## Platform
|
||||
## Platform Future
|
||||
|
||||
Future work:
|
||||
- CF Tunnel SSH completion
|
||||
- CF Tunnel SSH completion beyond first working path
|
||||
- artifact version promotion
|
||||
- runtime rollback support
|
||||
- Cloudflare R2 for large artifact/mod file delivery
|
||||
- Cloudflare R2 for large artifact/mod delivery
|
||||
- admin panel
|
||||
- referral / dev pipeline reward system
|
||||
- uptime history
|
||||
@ -265,35 +62,11 @@ Future work:
|
||||
|
||||
---
|
||||
|
||||
## Closed Threads
|
||||
## Cleaning Rule
|
||||
|
||||
- PTY console (dev + game)
|
||||
- mod lifecycle
|
||||
- upload pipeline
|
||||
- runtime artifact installs
|
||||
- dev container filesystem model
|
||||
- code-server artifact fix
|
||||
- API status endpoint for frontend agent-state consumption
|
||||
- game server crash recovery with backoff
|
||||
- crash observability
|
||||
- code-server lifecycle endpoints
|
||||
- code-server process detection
|
||||
- dev IDE proxy
|
||||
- hosted wildcard Traefik → API → container IDE flow
|
||||
- per-container dev IDE edge publish/unpublish removed from API
|
||||
- wildcard TLS cert `*.zerolaghub.dev`
|
||||
- browser IDE fully loading at `dev-<vmid>.zerolaghub.dev`
|
||||
- CF Tunnel created and connected to bastion VM
|
||||
- portal copy rewrite
|
||||
- DDoS investigation
|
||||
- hosting provider decision: GTHost Detroit
|
||||
- migration to Detroit complete
|
||||
- system stabilization after migration
|
||||
- IP-based control plane
|
||||
- Velocity startup rehydrate fixed and validated on happy path
|
||||
- billing / Stripe webhook delivery and persistence
|
||||
- password reset flow
|
||||
- usage limits / quota enforcement
|
||||
- user onboarding flow
|
||||
- dashboard spotlight server IA refresh
|
||||
- pre-restore checkpoint hardening (agent-side, Apr 16 2026)
|
||||
- Root keeps only cross-repo/platform work
|
||||
- Repo-specific items must be removed from root once they live only in one Codex tracker
|
||||
- Completed items should be removed, not left in place as historical clutter
|
||||
- Use `CURRENT_STATE.md` for durable implemented behavior
|
||||
- Use `DECISIONS.md` for settled choices
|
||||
- Re-open old items only when there is current evidence they are still unfinished or have regressed
|
||||
|
||||
Loading…
Reference in New Issue
Block a user