Refresh open launch threads after provisioning billing support work
This commit is contained in:
parent
fa575c45f4
commit
58c43fd0c9
191
OPEN_THREADS.md
191
OPEN_THREADS.md
@ -12,56 +12,83 @@ Keep this file short.
|
||||
|
||||
---
|
||||
|
||||
## Cross-Repo Active
|
||||
## Launch Active
|
||||
|
||||
### Final launch smoke test
|
||||
- create Minecraft server
|
||||
- confirm it reaches `Ready` / `connectable=true`
|
||||
- verify public game hostname is shown only when connectable
|
||||
- upload datapack on vanilla or install mod on supported modded runtime
|
||||
- create backup
|
||||
- restore backup
|
||||
- stop/start/restart host lifecycle actions
|
||||
- delete server
|
||||
- confirm Velocity unregister, Cloudflare cleanup, and Technitium cleanup
|
||||
### 1. Portal terminal reliability
|
||||
- Fix/harden console connection state so the UI cannot remain stuck at `Connecting...`.
|
||||
- Required behavior:
|
||||
- WebSocket connect timeout in `TerminalView`.
|
||||
- Error path closes/clears current socket refs and sender callback.
|
||||
- `ServerConsole` resets streaming state on `closed`, `error`, and `idle`.
|
||||
- Button recovers to `Open Console` / `Reconnect`.
|
||||
- Do not rewrite Agent or switch terminal stack; xterm + API WebSocket bridge remains the model.
|
||||
|
||||
### Portal public-site QA
|
||||
- marketing site now has hybrid SaaS structure and SEO landing pages; verify visually before public push
|
||||
- check desktop and mobile layouts for Home, Features, Pricing, FAQ, About, Support, and SEO landing pages
|
||||
- confirm public CTAs route correctly to Register, Login, Pricing, Features, FAQ, and SEO pages
|
||||
- confirm root metadata, page metadata, and hero copy reflect browser dev environments + managed server hosting
|
||||
- mobile web is not yet considered optimized; perform a targeted usability pass before launch marketing traffic
|
||||
### 2. Monitoring / observability readiness
|
||||
- Finish or explicitly accept launch risk for the remaining monitoring items in `Codex/Monitoring/OPEN_ITEMS.md`.
|
||||
- Current launch focus:
|
||||
- restrict public exposure for Prometheus/Grafana/node_exporter
|
||||
- validate game/dev discovery sync and stale target removal
|
||||
- ensure Grafana dashboards are installed and useful
|
||||
- expose API health/app-level state
|
||||
- decide centralized logs/Loki timing or accept as launch risk
|
||||
- add/verify queue staleness visibility for `provisioning`, `repair`, and `billing_enforcement`
|
||||
|
||||
### Backup / restore polish
|
||||
- happy-path local Minecraft backup create/restore has been verified live
|
||||
- API restore starts asynchronously and Portal polls restore status
|
||||
- keep restore wording/status transitions clear through completion and restart
|
||||
- confirm checkpoint metadata presentation remains clean when exposed to Portal
|
||||
- later hardening: persist last restore failure/checkpoint state in Agent `/status`
|
||||
- later hardening: automatic rollback from pre-restore checkpoint if restore apply/start fails after destructive replace
|
||||
### 3. Patch management / maintenance window policy
|
||||
- Define normal maintenance window cadence and timezone.
|
||||
- Define emergency maintenance behavior and customer notification expectations.
|
||||
- Document patch order and rollback expectations for Proxmox, PBS, OPNsense, API, Portal, templates, Agent, Velocity, and monitoring.
|
||||
|
||||
### Service discovery / launch validation
|
||||
- service discovery migration audit for remaining non-launch hot-path references
|
||||
- provisioning validation across current API/Agent/Portal assumptions
|
||||
- keep public exposure model explicit:
|
||||
- Portal public
|
||||
- Minecraft game hostnames public as needed
|
||||
- API/control plane/internal bridge/agent/admin services private
|
||||
### 4. Notepad / announcements / messaging retest
|
||||
- Billing announcements and support ticket flow were validated live.
|
||||
- Still retest notepad/notes load/save, persistence after reload/login, permissions, empty states, and error states.
|
||||
|
||||
### Monitoring / observability
|
||||
- core lifecycle monitoring is launch-ready
|
||||
- `/etc/zlh-monitor` is now the operational monitoring source of truth
|
||||
- game/dev monitoring uses API discovery -> monitor sync -> file_sd for lifecycle inventory and add/remove validation
|
||||
- container Alloy remote-write to Prometheus `10.60.0.25:9090` is the canonical game/dev metrics path
|
||||
- `game-dev-alloy` scrape health also works because container Alloy now listens on `0.0.0.0:12345`
|
||||
- remaining future work lives in `Codex/Monitoring/OPEN_ITEMS.md`: centralized logs/Loki and optional OPNsense router-only monitoring
|
||||
- keep OPNsense plugin/PBS monitoring as explicit platform exceptions while Linux-managed game/dev targets converge on Alloy
|
||||
### 5. Final integrated smoke test
|
||||
Run this after the launch blockers above are clean:
|
||||
- Game lifecycle: create → ready/connectable → console → files → backup → restore → delete → DNS/Velocity/Cloudflare cleanup verified.
|
||||
- Dev lifecycle: create → hosted IDE → stop/restart/delete → cleanup verified.
|
||||
- Security: direct Agent auth fails closed, non-owner access blocked, browser does not expose internal tokens/secrets.
|
||||
|
||||
### Notifications / launch polish
|
||||
- email notifications across backend contract + Portal UX
|
||||
- billing launch validation:
|
||||
- plan limit gating verified in Portal
|
||||
- still verify checkout/portal/webhook/upgrade-downgrade if Stripe is live
|
||||
---
|
||||
|
||||
## Current Architecture Validated / No Longer Blocking
|
||||
|
||||
### Provisioning worker / async create
|
||||
- `POST /api/instances` now creates/reuses a durable provisioning operation, enqueues BullMQ provisioning work, and returns `202 Accepted`.
|
||||
- `zpack-provision-worker.service` is installed/running under systemd.
|
||||
- Game and dev provisioning through the worker were live-validated.
|
||||
- Portal async pending cards were visually validated and replace themselves with real server cards.
|
||||
- Duplicate/idempotency guards and controlled failure behavior were validated.
|
||||
- API teardown still works for worker-created servers.
|
||||
|
||||
### Controller / repair foundation
|
||||
- `zlh-controller.service` exists as the singleton controller/reconciler with Redis lock.
|
||||
- Controller is currently expected to run in dry-run unless deliberately enabling Level 1 repair.
|
||||
- `zpack-repair-worker.service` exists for Level 1 infrastructure repairs.
|
||||
- Validated repairs:
|
||||
- stale provisioning operation → `stale`
|
||||
- live Cloudflare SRV drift → `edge_republish` restored edge state
|
||||
- Level 2/3 repairs remain disabled.
|
||||
|
||||
### Billing enforcement
|
||||
- Durable billing enforcement state, Stripe event idempotency, API gates, billing worker, and controller billing guards are implemented.
|
||||
- `zpack-billing-worker.service` is installed/running under systemd.
|
||||
- Validated:
|
||||
- payment failed / warning flow
|
||||
- Stripe replay idempotency
|
||||
- final warning / backup block state
|
||||
- suspension / shutdown safety
|
||||
- API gates while suspended
|
||||
- controller does not repair suspended game servers
|
||||
- payment restored flow
|
||||
- destructive billing actions rejected/audited
|
||||
- Remaining API-owned follow-ups are fixture validation only: game backup mutation route validation with an available backup fixture, and file read/list validation against a responsive Agent.
|
||||
|
||||
### Support ticket path
|
||||
- `POST /api/support/create` exists and creates a `SupportTicket` DB row with human-readable ticket number.
|
||||
- Portal form submit was validated.
|
||||
- Customer acknowledgement email was received.
|
||||
- Discord `#support` alert was received.
|
||||
- Post-launch enhancements only: admin ticket list/view, support triage diagnostics, self-hosted helpdesk integration, inbound reply parsing, attachments.
|
||||
|
||||
---
|
||||
|
||||
@ -69,68 +96,38 @@ Keep this file short.
|
||||
|
||||
- stress testing: k6 IDE load, Minecraft bot load, code-server memory baseline
|
||||
- OPNsense / public exposure audit
|
||||
- billing endpoint/path cleanup verification
|
||||
|
||||
### Backup boundary
|
||||
- Agent-owned backups are local, app-aware rollback backups for Minecraft worlds/config
|
||||
- PBS / platform backup strategy is the durability / disaster-recovery layer
|
||||
- do not track PBS/offsite durability work as agent implementation work unless that ownership changes
|
||||
|
||||
---
|
||||
|
||||
## Recently Verified / No Longer Considered Blocked
|
||||
|
||||
- password reset and logged-in change-password work end-to-end
|
||||
- password reset tokens are 5-minute, hashed at rest, single-use, and old unused tokens are invalidated on deploy
|
||||
- API-owned Minecraft connection state derives from agent readiness, edge/DNS state, Velocity registration, and backend ping
|
||||
- Velocity proxy lifecycle callbacks are live with `registered_with_proxy` and `proxy_ping_ok` landing in API state
|
||||
- Portal consumes API-owned `connectable` / `connection` state and no longer infers Minecraft readiness itself
|
||||
- Portal server creation redirects to `/servers` and tracks setup progress there
|
||||
- Portal status labels no longer treat all non-connectable states as `Needs attention`
|
||||
- Portal public marketing site now has hybrid conversion + SEO structure
|
||||
- Portal pricing tiers are now Starter / Pro / Performance workload tiers rather than Minecraft-only tier names
|
||||
- Portal root metadata, homepage hero copy, and fake CLI line were corrected to match actual product capabilities
|
||||
- SEO landing pages were added for Minecraft hosting, modded Minecraft hosting, and browser dev environments
|
||||
- local Minecraft backup create/restore works end-to-end on live validation
|
||||
- restore creates intentional pre-restore checkpoint and API now starts restore asynchronously instead of holding the full request open
|
||||
- backup timestamps are normalized and pre-restore checkpoints are filtered from the default backup list
|
||||
- agent-backed file edits create shadow copies for revert and API route/stream forwarding issues were fixed
|
||||
- vanilla datapack upload works
|
||||
- vanilla Mods UI is hidden and direct vanilla `mods/` upload is rejected by API
|
||||
- NeoForge mod search/install/list works
|
||||
- delete/teardown lifecycle removes Velocity, Cloudflare, and Technitium records
|
||||
- public exposure model is in place: Portal public, control plane private
|
||||
- vanilla / fabric runtime split is restored:
|
||||
- `vanilla` = Fabric-based internal profile with proxy/API/config injection
|
||||
- `fabric` = plain Fabric jar delivery only
|
||||
- Forge / Neoforge first-start flow avoids premature readiness gating, applies post-start property enforcement, and restarts through the readiness-aware path
|
||||
- current validation indicates Minecraft server creation succeeds across supported runtime variants
|
||||
- current validation indicates dev container creation succeeds and hosted IDE access still works after the latest API/Portal runtime and cleanup passes
|
||||
- Agent-owned backups are local, app-aware rollback backups for Minecraft worlds/config.
|
||||
- PBS / platform backup strategy is the durability / disaster-recovery layer.
|
||||
- PBS/R2 offsite copy currently uses rclone against Cloudflare R2; PBS does not natively use an S3 datastore for this workflow.
|
||||
- Do not track PBS/offsite durability work as agent implementation work unless that ownership changes.
|
||||
|
||||
---
|
||||
|
||||
## Platform Future / Phase 2
|
||||
|
||||
- SSH / CF tunnel power-user access
|
||||
- Portal SSH config snippets
|
||||
- true interactive shell confinement / workspace-boundary decision
|
||||
- dev-container backup ownership and user-facing restore contract
|
||||
- likely direction for dev backups: LXC snapshot-based backup/restore instead of agent-managed dev backups
|
||||
- artifact version promotion
|
||||
- runtime rollback support
|
||||
- Cloudflare R2 for large artifact/mod delivery
|
||||
- admin panel
|
||||
- referral / dev pipeline reward system
|
||||
- uptime history
|
||||
- revisit DDoS mitigation later if needed
|
||||
- Support admin ticket view, support triage diagnostics, and optional self-hosted helpdesk integration.
|
||||
- SSH / CF tunnel power-user access.
|
||||
- Portal SSH config snippets.
|
||||
- true interactive shell confinement / workspace-boundary decision.
|
||||
- dev-container backup ownership and user-facing restore contract.
|
||||
- likely direction for dev backups: LXC snapshot-based backup/restore instead of agent-managed dev backups.
|
||||
- artifact version promotion.
|
||||
- runtime rollback support.
|
||||
- Cloudflare R2 for large artifact/mod delivery.
|
||||
- admin panel.
|
||||
- referral / dev pipeline reward system.
|
||||
- uptime history.
|
||||
- revisit DDoS mitigation later if needed.
|
||||
|
||||
---
|
||||
|
||||
## Cleaning Rule
|
||||
|
||||
- Root keeps only cross-repo/platform work
|
||||
- Repo-specific items must be removed from root once they live only in one Codex tracker
|
||||
- Completed items should be removed, not left in place as historical clutter
|
||||
- Use `CURRENT_STATE.md` for durable implemented behavior
|
||||
- Use `DECISIONS.md` for settled choices
|
||||
- Re-open old items only when there is current evidence they are still unfinished or have regressed
|
||||
- Root keeps only cross-repo/platform work.
|
||||
- Repo-specific items must be removed from root once they live only in one Codex tracker.
|
||||
- Completed items should be removed from active sections, not left in place as historical clutter.
|
||||
- Use `CURRENT_STATE.md` for durable implemented behavior.
|
||||
- Use `DECISIONS.md` for settled choices.
|
||||
- Re-open old items only when there is current evidence they are still unfinished or have regressed.
|
||||
|
||||
Loading…
Reference in New Issue
Block a user