From 58c43fd0c980faf90366b4817d48ea0a3e111624 Mon Sep 17 00:00:00 2001 From: jester Date: Sun, 3 May 2026 19:50:56 +0000 Subject: [PATCH] Refresh open launch threads after provisioning billing support work --- OPEN_THREADS.md | 191 ++++++++++++++++++++++++------------------------ 1 file changed, 94 insertions(+), 97 deletions(-) diff --git a/OPEN_THREADS.md b/OPEN_THREADS.md index b359d72..261be8f 100644 --- a/OPEN_THREADS.md +++ b/OPEN_THREADS.md @@ -12,56 +12,83 @@ Keep this file short. --- -## Cross-Repo Active +## Launch Active -### Final launch smoke test -- create Minecraft server -- confirm it reaches `Ready` / `connectable=true` -- verify public game hostname is shown only when connectable -- upload datapack on vanilla or install mod on supported modded runtime -- create backup -- restore backup -- stop/start/restart host lifecycle actions -- delete server -- confirm Velocity unregister, Cloudflare cleanup, and Technitium cleanup +### 1. Portal terminal reliability +- Fix/harden console connection state so the UI cannot remain stuck at `Connecting...`. +- Required behavior: + - WebSocket connect timeout in `TerminalView`. + - Error path closes/clears current socket refs and sender callback. + - `ServerConsole` resets streaming state on `closed`, `error`, and `idle`. + - Button recovers to `Open Console` / `Reconnect`. +- Do not rewrite Agent or switch terminal stack; xterm + API WebSocket bridge remains the model. -### Portal public-site QA -- marketing site now has hybrid SaaS structure and SEO landing pages; verify visually before public push -- check desktop and mobile layouts for Home, Features, Pricing, FAQ, About, Support, and SEO landing pages -- confirm public CTAs route correctly to Register, Login, Pricing, Features, FAQ, and SEO pages -- confirm root metadata, page metadata, and hero copy reflect browser dev environments + managed server hosting -- mobile web is not yet considered optimized; perform a targeted usability pass before launch marketing traffic +### 2. Monitoring / observability readiness +- Finish or explicitly accept launch risk for the remaining monitoring items in `Codex/Monitoring/OPEN_ITEMS.md`. +- Current launch focus: + - restrict public exposure for Prometheus/Grafana/node_exporter + - validate game/dev discovery sync and stale target removal + - ensure Grafana dashboards are installed and useful + - expose API health/app-level state + - decide centralized logs/Loki timing or accept as launch risk + - add/verify queue staleness visibility for `provisioning`, `repair`, and `billing_enforcement` -### Backup / restore polish -- happy-path local Minecraft backup create/restore has been verified live -- API restore starts asynchronously and Portal polls restore status -- keep restore wording/status transitions clear through completion and restart -- confirm checkpoint metadata presentation remains clean when exposed to Portal -- later hardening: persist last restore failure/checkpoint state in Agent `/status` -- later hardening: automatic rollback from pre-restore checkpoint if restore apply/start fails after destructive replace +### 3. Patch management / maintenance window policy +- Define normal maintenance window cadence and timezone. +- Define emergency maintenance behavior and customer notification expectations. +- Document patch order and rollback expectations for Proxmox, PBS, OPNsense, API, Portal, templates, Agent, Velocity, and monitoring. -### Service discovery / launch validation -- service discovery migration audit for remaining non-launch hot-path references -- provisioning validation across current API/Agent/Portal assumptions -- keep public exposure model explicit: - - Portal public - - Minecraft game hostnames public as needed - - API/control plane/internal bridge/agent/admin services private +### 4. Notepad / announcements / messaging retest +- Billing announcements and support ticket flow were validated live. +- Still retest notepad/notes load/save, persistence after reload/login, permissions, empty states, and error states. -### Monitoring / observability -- core lifecycle monitoring is launch-ready -- `/etc/zlh-monitor` is now the operational monitoring source of truth -- game/dev monitoring uses API discovery -> monitor sync -> file_sd for lifecycle inventory and add/remove validation -- container Alloy remote-write to Prometheus `10.60.0.25:9090` is the canonical game/dev metrics path -- `game-dev-alloy` scrape health also works because container Alloy now listens on `0.0.0.0:12345` -- remaining future work lives in `Codex/Monitoring/OPEN_ITEMS.md`: centralized logs/Loki and optional OPNsense router-only monitoring -- keep OPNsense plugin/PBS monitoring as explicit platform exceptions while Linux-managed game/dev targets converge on Alloy +### 5. Final integrated smoke test +Run this after the launch blockers above are clean: +- Game lifecycle: create → ready/connectable → console → files → backup → restore → delete → DNS/Velocity/Cloudflare cleanup verified. +- Dev lifecycle: create → hosted IDE → stop/restart/delete → cleanup verified. +- Security: direct Agent auth fails closed, non-owner access blocked, browser does not expose internal tokens/secrets. -### Notifications / launch polish -- email notifications across backend contract + Portal UX -- billing launch validation: - - plan limit gating verified in Portal - - still verify checkout/portal/webhook/upgrade-downgrade if Stripe is live +--- + +## Current Architecture Validated / No Longer Blocking + +### Provisioning worker / async create +- `POST /api/instances` now creates/reuses a durable provisioning operation, enqueues BullMQ provisioning work, and returns `202 Accepted`. +- `zpack-provision-worker.service` is installed/running under systemd. +- Game and dev provisioning through the worker were live-validated. +- Portal async pending cards were visually validated and replace themselves with real server cards. +- Duplicate/idempotency guards and controlled failure behavior were validated. +- API teardown still works for worker-created servers. + +### Controller / repair foundation +- `zlh-controller.service` exists as the singleton controller/reconciler with Redis lock. +- Controller is currently expected to run in dry-run unless deliberately enabling Level 1 repair. +- `zpack-repair-worker.service` exists for Level 1 infrastructure repairs. +- Validated repairs: + - stale provisioning operation → `stale` + - live Cloudflare SRV drift → `edge_republish` restored edge state +- Level 2/3 repairs remain disabled. + +### Billing enforcement +- Durable billing enforcement state, Stripe event idempotency, API gates, billing worker, and controller billing guards are implemented. +- `zpack-billing-worker.service` is installed/running under systemd. +- Validated: + - payment failed / warning flow + - Stripe replay idempotency + - final warning / backup block state + - suspension / shutdown safety + - API gates while suspended + - controller does not repair suspended game servers + - payment restored flow + - destructive billing actions rejected/audited +- Remaining API-owned follow-ups are fixture validation only: game backup mutation route validation with an available backup fixture, and file read/list validation against a responsive Agent. + +### Support ticket path +- `POST /api/support/create` exists and creates a `SupportTicket` DB row with human-readable ticket number. +- Portal form submit was validated. +- Customer acknowledgement email was received. +- Discord `#support` alert was received. +- Post-launch enhancements only: admin ticket list/view, support triage diagnostics, self-hosted helpdesk integration, inbound reply parsing, attachments. --- @@ -69,68 +96,38 @@ Keep this file short. - stress testing: k6 IDE load, Minecraft bot load, code-server memory baseline - OPNsense / public exposure audit -- billing endpoint/path cleanup verification ### Backup boundary -- Agent-owned backups are local, app-aware rollback backups for Minecraft worlds/config -- PBS / platform backup strategy is the durability / disaster-recovery layer -- do not track PBS/offsite durability work as agent implementation work unless that ownership changes - ---- - -## Recently Verified / No Longer Considered Blocked - -- password reset and logged-in change-password work end-to-end -- password reset tokens are 5-minute, hashed at rest, single-use, and old unused tokens are invalidated on deploy -- API-owned Minecraft connection state derives from agent readiness, edge/DNS state, Velocity registration, and backend ping -- Velocity proxy lifecycle callbacks are live with `registered_with_proxy` and `proxy_ping_ok` landing in API state -- Portal consumes API-owned `connectable` / `connection` state and no longer infers Minecraft readiness itself -- Portal server creation redirects to `/servers` and tracks setup progress there -- Portal status labels no longer treat all non-connectable states as `Needs attention` -- Portal public marketing site now has hybrid conversion + SEO structure -- Portal pricing tiers are now Starter / Pro / Performance workload tiers rather than Minecraft-only tier names -- Portal root metadata, homepage hero copy, and fake CLI line were corrected to match actual product capabilities -- SEO landing pages were added for Minecraft hosting, modded Minecraft hosting, and browser dev environments -- local Minecraft backup create/restore works end-to-end on live validation -- restore creates intentional pre-restore checkpoint and API now starts restore asynchronously instead of holding the full request open -- backup timestamps are normalized and pre-restore checkpoints are filtered from the default backup list -- agent-backed file edits create shadow copies for revert and API route/stream forwarding issues were fixed -- vanilla datapack upload works -- vanilla Mods UI is hidden and direct vanilla `mods/` upload is rejected by API -- NeoForge mod search/install/list works -- delete/teardown lifecycle removes Velocity, Cloudflare, and Technitium records -- public exposure model is in place: Portal public, control plane private -- vanilla / fabric runtime split is restored: - - `vanilla` = Fabric-based internal profile with proxy/API/config injection - - `fabric` = plain Fabric jar delivery only -- Forge / Neoforge first-start flow avoids premature readiness gating, applies post-start property enforcement, and restarts through the readiness-aware path -- current validation indicates Minecraft server creation succeeds across supported runtime variants -- current validation indicates dev container creation succeeds and hosted IDE access still works after the latest API/Portal runtime and cleanup passes +- Agent-owned backups are local, app-aware rollback backups for Minecraft worlds/config. +- PBS / platform backup strategy is the durability / disaster-recovery layer. +- PBS/R2 offsite copy currently uses rclone against Cloudflare R2; PBS does not natively use an S3 datastore for this workflow. +- Do not track PBS/offsite durability work as agent implementation work unless that ownership changes. --- ## Platform Future / Phase 2 -- SSH / CF tunnel power-user access -- Portal SSH config snippets -- true interactive shell confinement / workspace-boundary decision -- dev-container backup ownership and user-facing restore contract -- likely direction for dev backups: LXC snapshot-based backup/restore instead of agent-managed dev backups -- artifact version promotion -- runtime rollback support -- Cloudflare R2 for large artifact/mod delivery -- admin panel -- referral / dev pipeline reward system -- uptime history -- revisit DDoS mitigation later if needed +- Support admin ticket view, support triage diagnostics, and optional self-hosted helpdesk integration. +- SSH / CF tunnel power-user access. +- Portal SSH config snippets. +- true interactive shell confinement / workspace-boundary decision. +- dev-container backup ownership and user-facing restore contract. +- likely direction for dev backups: LXC snapshot-based backup/restore instead of agent-managed dev backups. +- artifact version promotion. +- runtime rollback support. +- Cloudflare R2 for large artifact/mod delivery. +- admin panel. +- referral / dev pipeline reward system. +- uptime history. +- revisit DDoS mitigation later if needed. --- ## Cleaning Rule -- Root keeps only cross-repo/platform work -- Repo-specific items must be removed from root once they live only in one Codex tracker -- Completed items should be removed, not left in place as historical clutter -- Use `CURRENT_STATE.md` for durable implemented behavior -- Use `DECISIONS.md` for settled choices -- Re-open old items only when there is current evidence they are still unfinished or have regressed +- Root keeps only cross-repo/platform work. +- Repo-specific items must be removed from root once they live only in one Codex tracker. +- Completed items should be removed from active sections, not left in place as historical clutter. +- Use `CURRENT_STATE.md` for durable implemented behavior. +- Use `DECISIONS.md` for settled choices. +- Re-open old items only when there is current evidence they are still unfinished or have regressed.