Refresh open launch threads after provisioning billing support work

2026-05-03 19:50:56 +00:00 · 2026-05-03 19:50:56 +00:00 · 58c43fd0c9
commit 58c43fd0c9
parent fa575c45f4
1 changed files with 94 additions and 97 deletions
--- a/OPEN_THREADS.md
+++ b/OPEN_THREADS.md
@ -12,56 +12,83 @@ Keep this file short.

 ---

-## Cross-Repo Active
+## Launch Active

-### Final launch smoke test
- create Minecraft server
- confirm it reaches `Ready` / `connectable=true`
- verify public game hostname is shown only when connectable
- upload datapack on vanilla or install mod on supported modded runtime
- create backup
- restore backup
- stop/start/restart host lifecycle actions
- delete server
- confirm Velocity unregister, Cloudflare cleanup, and Technitium cleanup
+### 1. Portal terminal reliability
+- Fix/harden console connection state so the UI cannot remain stuck at `Connecting...`.
+- Required behavior:
+  - WebSocket connect timeout in `TerminalView`.
+  - Error path closes/clears current socket refs and sender callback.
+  - `ServerConsole` resets streaming state on `closed`, `error`, and `idle`.
+  - Button recovers to `Open Console` / `Reconnect`.
+- Do not rewrite Agent or switch terminal stack; xterm + API WebSocket bridge remains the model.

-### Portal public-site QA
- marketing site now has hybrid SaaS structure and SEO landing pages; verify visually before public push
- check desktop and mobile layouts for Home, Features, Pricing, FAQ, About, Support, and SEO landing pages
- confirm public CTAs route correctly to Register, Login, Pricing, Features, FAQ, and SEO pages
- confirm root metadata, page metadata, and hero copy reflect browser dev environments + managed server hosting
- mobile web is not yet considered optimized; perform a targeted usability pass before launch marketing traffic
+### 2. Monitoring / observability readiness
+- Finish or explicitly accept launch risk for the remaining monitoring items in `Codex/Monitoring/OPEN_ITEMS.md`.
+- Current launch focus:
+  - restrict public exposure for Prometheus/Grafana/node_exporter
+  - validate game/dev discovery sync and stale target removal
+  - ensure Grafana dashboards are installed and useful
+  - expose API health/app-level state
+  - decide centralized logs/Loki timing or accept as launch risk
+  - add/verify queue staleness visibility for `provisioning`, `repair`, and `billing_enforcement`

-### Backup / restore polish
- happy-path local Minecraft backup create/restore has been verified live
- API restore starts asynchronously and Portal polls restore status
- keep restore wording/status transitions clear through completion and restart
- confirm checkpoint metadata presentation remains clean when exposed to Portal
- later hardening: persist last restore failure/checkpoint state in Agent `/status`
- later hardening: automatic rollback from pre-restore checkpoint if restore apply/start fails after destructive replace
+### 3. Patch management / maintenance window policy
+- Define normal maintenance window cadence and timezone.
+- Define emergency maintenance behavior and customer notification expectations.
+- Document patch order and rollback expectations for Proxmox, PBS, OPNsense, API, Portal, templates, Agent, Velocity, and monitoring.

-### Service discovery / launch validation
- service discovery migration audit for remaining non-launch hot-path references
- provisioning validation across current API/Agent/Portal assumptions
- keep public exposure model explicit:
-  - Portal public
-  - Minecraft game hostnames public as needed
-  - API/control plane/internal bridge/agent/admin services private
+### 4. Notepad / announcements / messaging retest
+- Billing announcements and support ticket flow were validated live.
+- Still retest notepad/notes load/save, persistence after reload/login, permissions, empty states, and error states.

-### Monitoring / observability
- core lifecycle monitoring is launch-ready
- `/etc/zlh-monitor` is now the operational monitoring source of truth
- game/dev monitoring uses API discovery -> monitor sync -> file_sd for lifecycle inventory and add/remove validation
- container Alloy remote-write to Prometheus `10.60.0.25:9090` is the canonical game/dev metrics path
- `game-dev-alloy` scrape health also works because container Alloy now listens on `0.0.0.0:12345`
- remaining future work lives in `Codex/Monitoring/OPEN_ITEMS.md`: centralized logs/Loki and optional OPNsense router-only monitoring
- keep OPNsense plugin/PBS monitoring as explicit platform exceptions while Linux-managed game/dev targets converge on Alloy
+### 5. Final integrated smoke test
+Run this after the launch blockers above are clean:
+- Game lifecycle: create → ready/connectable → console → files → backup → restore → delete → DNS/Velocity/Cloudflare cleanup verified.
+- Dev lifecycle: create → hosted IDE → stop/restart/delete → cleanup verified.
+- Security: direct Agent auth fails closed, non-owner access blocked, browser does not expose internal tokens/secrets.

-### Notifications / launch polish
- email notifications across backend contract + Portal UX
- billing launch validation:
-  - plan limit gating verified in Portal
-  - still verify checkout/portal/webhook/upgrade-downgrade if Stripe is live
+---
+
+## Current Architecture Validated / No Longer Blocking
+
+### Provisioning worker / async create
+- `POST /api/instances` now creates/reuses a durable provisioning operation, enqueues BullMQ provisioning work, and returns `202 Accepted`.
+- `zpack-provision-worker.service` is installed/running under systemd.
+- Game and dev provisioning through the worker were live-validated.
+- Portal async pending cards were visually validated and replace themselves with real server cards.
+- Duplicate/idempotency guards and controlled failure behavior were validated.
+- API teardown still works for worker-created servers.
+
+### Controller / repair foundation
+- `zlh-controller.service` exists as the singleton controller/reconciler with Redis lock.
+- Controller is currently expected to run in dry-run unless deliberately enabling Level 1 repair.
+- `zpack-repair-worker.service` exists for Level 1 infrastructure repairs.
+- Validated repairs:
+  - stale provisioning operation → `stale`
+  - live Cloudflare SRV drift → `edge_republish` restored edge state
+- Level 2/3 repairs remain disabled.
+
+### Billing enforcement
+- Durable billing enforcement state, Stripe event idempotency, API gates, billing worker, and controller billing guards are implemented.
+- `zpack-billing-worker.service` is installed/running under systemd.
+- Validated:
+  - payment failed / warning flow
+  - Stripe replay idempotency
+  - final warning / backup block state
+  - suspension / shutdown safety
+  - API gates while suspended
+  - controller does not repair suspended game servers
+  - payment restored flow
+  - destructive billing actions rejected/audited
+- Remaining API-owned follow-ups are fixture validation only: game backup mutation route validation with an available backup fixture, and file read/list validation against a responsive Agent.
+
+### Support ticket path
+- `POST /api/support/create` exists and creates a `SupportTicket` DB row with human-readable ticket number.
+- Portal form submit was validated.
+- Customer acknowledgement email was received.
+- Discord `#support` alert was received.
+- Post-launch enhancements only: admin ticket list/view, support triage diagnostics, self-hosted helpdesk integration, inbound reply parsing, attachments.

 ---

@ -69,68 +96,38 @@ Keep this file short.

 - stress testing: k6 IDE load, Minecraft bot load, code-server memory baseline
 - OPNsense / public exposure audit
- billing endpoint/path cleanup verification

 ### Backup boundary
- Agent-owned backups are local, app-aware rollback backups for Minecraft worlds/config
- PBS / platform backup strategy is the durability / disaster-recovery layer
- do not track PBS/offsite durability work as agent implementation work unless that ownership changes
-
---
-
-## Recently Verified / No Longer Considered Blocked
-
- password reset and logged-in change-password work end-to-end
- password reset tokens are 5-minute, hashed at rest, single-use, and old unused tokens are invalidated on deploy
- API-owned Minecraft connection state derives from agent readiness, edge/DNS state, Velocity registration, and backend ping
- Velocity proxy lifecycle callbacks are live with `registered_with_proxy` and `proxy_ping_ok` landing in API state
- Portal consumes API-owned `connectable` / `connection` state and no longer infers Minecraft readiness itself
- Portal server creation redirects to `/servers` and tracks setup progress there
- Portal status labels no longer treat all non-connectable states as `Needs attention`
- Portal public marketing site now has hybrid conversion + SEO structure
- Portal pricing tiers are now Starter / Pro / Performance workload tiers rather than Minecraft-only tier names
- Portal root metadata, homepage hero copy, and fake CLI line were corrected to match actual product capabilities
- SEO landing pages were added for Minecraft hosting, modded Minecraft hosting, and browser dev environments
- local Minecraft backup create/restore works end-to-end on live validation
- restore creates intentional pre-restore checkpoint and API now starts restore asynchronously instead of holding the full request open
- backup timestamps are normalized and pre-restore checkpoints are filtered from the default backup list
- agent-backed file edits create shadow copies for revert and API route/stream forwarding issues were fixed
- vanilla datapack upload works
- vanilla Mods UI is hidden and direct vanilla `mods/` upload is rejected by API
- NeoForge mod search/install/list works
- delete/teardown lifecycle removes Velocity, Cloudflare, and Technitium records
- public exposure model is in place: Portal public, control plane private
- vanilla / fabric runtime split is restored:
-  - `vanilla` = Fabric-based internal profile with proxy/API/config injection
-  - `fabric` = plain Fabric jar delivery only
- Forge / Neoforge first-start flow avoids premature readiness gating, applies post-start property enforcement, and restarts through the readiness-aware path
- current validation indicates Minecraft server creation succeeds across supported runtime variants
- current validation indicates dev container creation succeeds and hosted IDE access still works after the latest API/Portal runtime and cleanup passes
+- Agent-owned backups are local, app-aware rollback backups for Minecraft worlds/config.
+- PBS / platform backup strategy is the durability / disaster-recovery layer.
+- PBS/R2 offsite copy currently uses rclone against Cloudflare R2; PBS does not natively use an S3 datastore for this workflow.
+- Do not track PBS/offsite durability work as agent implementation work unless that ownership changes.

 ---

 ## Platform Future / Phase 2

- SSH / CF tunnel power-user access
- Portal SSH config snippets
- true interactive shell confinement / workspace-boundary decision
- dev-container backup ownership and user-facing restore contract
- likely direction for dev backups: LXC snapshot-based backup/restore instead of agent-managed dev backups
- artifact version promotion
- runtime rollback support
- Cloudflare R2 for large artifact/mod delivery
- admin panel
- referral / dev pipeline reward system
- uptime history
- revisit DDoS mitigation later if needed
+- Support admin ticket view, support triage diagnostics, and optional self-hosted helpdesk integration.
+- SSH / CF tunnel power-user access.
+- Portal SSH config snippets.
+- true interactive shell confinement / workspace-boundary decision.
+- dev-container backup ownership and user-facing restore contract.
+- likely direction for dev backups: LXC snapshot-based backup/restore instead of agent-managed dev backups.
+- artifact version promotion.
+- runtime rollback support.
+- Cloudflare R2 for large artifact/mod delivery.
+- admin panel.
+- referral / dev pipeline reward system.
+- uptime history.
+- revisit DDoS mitigation later if needed.

 ---

 ## Cleaning Rule

- Root keeps only cross-repo/platform work
- Repo-specific items must be removed from root once they live only in one Codex tracker
- Completed items should be removed, not left in place as historical clutter
- Use `CURRENT_STATE.md` for durable implemented behavior
- Use `DECISIONS.md` for settled choices
- Re-open old items only when there is current evidence they are still unfinished or have regressed
+- Root keeps only cross-repo/platform work.
+- Repo-specific items must be removed from root once they live only in one Codex tracker.
+- Completed items should be removed from active sections, not left in place as historical clutter.
+- Use `CURRENT_STATE.md` for durable implemented behavior.
+- Use `DECISIONS.md` for settled choices.
+- Re-open old items only when there is current evidence they are still unfinished or have regressed.