Refresh open launch threads after provisioning billing support work

This commit is contained in:
jester 2026-05-03 19:50:56 +00:00
parent fa575c45f4
commit 58c43fd0c9

View File

@ -12,56 +12,83 @@ Keep this file short.
---
## Cross-Repo Active
## Launch Active
### Final launch smoke test
- create Minecraft server
- confirm it reaches `Ready` / `connectable=true`
- verify public game hostname is shown only when connectable
- upload datapack on vanilla or install mod on supported modded runtime
- create backup
- restore backup
- stop/start/restart host lifecycle actions
- delete server
- confirm Velocity unregister, Cloudflare cleanup, and Technitium cleanup
### 1. Portal terminal reliability
- Fix/harden console connection state so the UI cannot remain stuck at `Connecting...`.
- Required behavior:
- WebSocket connect timeout in `TerminalView`.
- Error path closes/clears current socket refs and sender callback.
- `ServerConsole` resets streaming state on `closed`, `error`, and `idle`.
- Button recovers to `Open Console` / `Reconnect`.
- Do not rewrite Agent or switch terminal stack; xterm + API WebSocket bridge remains the model.
### Portal public-site QA
- marketing site now has hybrid SaaS structure and SEO landing pages; verify visually before public push
- check desktop and mobile layouts for Home, Features, Pricing, FAQ, About, Support, and SEO landing pages
- confirm public CTAs route correctly to Register, Login, Pricing, Features, FAQ, and SEO pages
- confirm root metadata, page metadata, and hero copy reflect browser dev environments + managed server hosting
- mobile web is not yet considered optimized; perform a targeted usability pass before launch marketing traffic
### 2. Monitoring / observability readiness
- Finish or explicitly accept launch risk for the remaining monitoring items in `Codex/Monitoring/OPEN_ITEMS.md`.
- Current launch focus:
- restrict public exposure for Prometheus/Grafana/node_exporter
- validate game/dev discovery sync and stale target removal
- ensure Grafana dashboards are installed and useful
- expose API health/app-level state
- decide centralized logs/Loki timing or accept as launch risk
- add/verify queue staleness visibility for `provisioning`, `repair`, and `billing_enforcement`
### Backup / restore polish
- happy-path local Minecraft backup create/restore has been verified live
- API restore starts asynchronously and Portal polls restore status
- keep restore wording/status transitions clear through completion and restart
- confirm checkpoint metadata presentation remains clean when exposed to Portal
- later hardening: persist last restore failure/checkpoint state in Agent `/status`
- later hardening: automatic rollback from pre-restore checkpoint if restore apply/start fails after destructive replace
### 3. Patch management / maintenance window policy
- Define normal maintenance window cadence and timezone.
- Define emergency maintenance behavior and customer notification expectations.
- Document patch order and rollback expectations for Proxmox, PBS, OPNsense, API, Portal, templates, Agent, Velocity, and monitoring.
### Service discovery / launch validation
- service discovery migration audit for remaining non-launch hot-path references
- provisioning validation across current API/Agent/Portal assumptions
- keep public exposure model explicit:
- Portal public
- Minecraft game hostnames public as needed
- API/control plane/internal bridge/agent/admin services private
### 4. Notepad / announcements / messaging retest
- Billing announcements and support ticket flow were validated live.
- Still retest notepad/notes load/save, persistence after reload/login, permissions, empty states, and error states.
### Monitoring / observability
- core lifecycle monitoring is launch-ready
- `/etc/zlh-monitor` is now the operational monitoring source of truth
- game/dev monitoring uses API discovery -> monitor sync -> file_sd for lifecycle inventory and add/remove validation
- container Alloy remote-write to Prometheus `10.60.0.25:9090` is the canonical game/dev metrics path
- `game-dev-alloy` scrape health also works because container Alloy now listens on `0.0.0.0:12345`
- remaining future work lives in `Codex/Monitoring/OPEN_ITEMS.md`: centralized logs/Loki and optional OPNsense router-only monitoring
- keep OPNsense plugin/PBS monitoring as explicit platform exceptions while Linux-managed game/dev targets converge on Alloy
### 5. Final integrated smoke test
Run this after the launch blockers above are clean:
- Game lifecycle: create → ready/connectable → console → files → backup → restore → delete → DNS/Velocity/Cloudflare cleanup verified.
- Dev lifecycle: create → hosted IDE → stop/restart/delete → cleanup verified.
- Security: direct Agent auth fails closed, non-owner access blocked, browser does not expose internal tokens/secrets.
### Notifications / launch polish
- email notifications across backend contract + Portal UX
- billing launch validation:
- plan limit gating verified in Portal
- still verify checkout/portal/webhook/upgrade-downgrade if Stripe is live
---
## Current Architecture Validated / No Longer Blocking
### Provisioning worker / async create
- `POST /api/instances` now creates/reuses a durable provisioning operation, enqueues BullMQ provisioning work, and returns `202 Accepted`.
- `zpack-provision-worker.service` is installed/running under systemd.
- Game and dev provisioning through the worker were live-validated.
- Portal async pending cards were visually validated and replace themselves with real server cards.
- Duplicate/idempotency guards and controlled failure behavior were validated.
- API teardown still works for worker-created servers.
### Controller / repair foundation
- `zlh-controller.service` exists as the singleton controller/reconciler with Redis lock.
- Controller is currently expected to run in dry-run unless deliberately enabling Level 1 repair.
- `zpack-repair-worker.service` exists for Level 1 infrastructure repairs.
- Validated repairs:
- stale provisioning operation → `stale`
- live Cloudflare SRV drift → `edge_republish` restored edge state
- Level 2/3 repairs remain disabled.
### Billing enforcement
- Durable billing enforcement state, Stripe event idempotency, API gates, billing worker, and controller billing guards are implemented.
- `zpack-billing-worker.service` is installed/running under systemd.
- Validated:
- payment failed / warning flow
- Stripe replay idempotency
- final warning / backup block state
- suspension / shutdown safety
- API gates while suspended
- controller does not repair suspended game servers
- payment restored flow
- destructive billing actions rejected/audited
- Remaining API-owned follow-ups are fixture validation only: game backup mutation route validation with an available backup fixture, and file read/list validation against a responsive Agent.
### Support ticket path
- `POST /api/support/create` exists and creates a `SupportTicket` DB row with human-readable ticket number.
- Portal form submit was validated.
- Customer acknowledgement email was received.
- Discord `#support` alert was received.
- Post-launch enhancements only: admin ticket list/view, support triage diagnostics, self-hosted helpdesk integration, inbound reply parsing, attachments.
---
@ -69,68 +96,38 @@ Keep this file short.
- stress testing: k6 IDE load, Minecraft bot load, code-server memory baseline
- OPNsense / public exposure audit
- billing endpoint/path cleanup verification
### Backup boundary
- Agent-owned backups are local, app-aware rollback backups for Minecraft worlds/config
- PBS / platform backup strategy is the durability / disaster-recovery layer
- do not track PBS/offsite durability work as agent implementation work unless that ownership changes
---
## Recently Verified / No Longer Considered Blocked
- password reset and logged-in change-password work end-to-end
- password reset tokens are 5-minute, hashed at rest, single-use, and old unused tokens are invalidated on deploy
- API-owned Minecraft connection state derives from agent readiness, edge/DNS state, Velocity registration, and backend ping
- Velocity proxy lifecycle callbacks are live with `registered_with_proxy` and `proxy_ping_ok` landing in API state
- Portal consumes API-owned `connectable` / `connection` state and no longer infers Minecraft readiness itself
- Portal server creation redirects to `/servers` and tracks setup progress there
- Portal status labels no longer treat all non-connectable states as `Needs attention`
- Portal public marketing site now has hybrid conversion + SEO structure
- Portal pricing tiers are now Starter / Pro / Performance workload tiers rather than Minecraft-only tier names
- Portal root metadata, homepage hero copy, and fake CLI line were corrected to match actual product capabilities
- SEO landing pages were added for Minecraft hosting, modded Minecraft hosting, and browser dev environments
- local Minecraft backup create/restore works end-to-end on live validation
- restore creates intentional pre-restore checkpoint and API now starts restore asynchronously instead of holding the full request open
- backup timestamps are normalized and pre-restore checkpoints are filtered from the default backup list
- agent-backed file edits create shadow copies for revert and API route/stream forwarding issues were fixed
- vanilla datapack upload works
- vanilla Mods UI is hidden and direct vanilla `mods/` upload is rejected by API
- NeoForge mod search/install/list works
- delete/teardown lifecycle removes Velocity, Cloudflare, and Technitium records
- public exposure model is in place: Portal public, control plane private
- vanilla / fabric runtime split is restored:
- `vanilla` = Fabric-based internal profile with proxy/API/config injection
- `fabric` = plain Fabric jar delivery only
- Forge / Neoforge first-start flow avoids premature readiness gating, applies post-start property enforcement, and restarts through the readiness-aware path
- current validation indicates Minecraft server creation succeeds across supported runtime variants
- current validation indicates dev container creation succeeds and hosted IDE access still works after the latest API/Portal runtime and cleanup passes
- Agent-owned backups are local, app-aware rollback backups for Minecraft worlds/config.
- PBS / platform backup strategy is the durability / disaster-recovery layer.
- PBS/R2 offsite copy currently uses rclone against Cloudflare R2; PBS does not natively use an S3 datastore for this workflow.
- Do not track PBS/offsite durability work as agent implementation work unless that ownership changes.
---
## Platform Future / Phase 2
- SSH / CF tunnel power-user access
- Portal SSH config snippets
- true interactive shell confinement / workspace-boundary decision
- dev-container backup ownership and user-facing restore contract
- likely direction for dev backups: LXC snapshot-based backup/restore instead of agent-managed dev backups
- artifact version promotion
- runtime rollback support
- Cloudflare R2 for large artifact/mod delivery
- admin panel
- referral / dev pipeline reward system
- uptime history
- revisit DDoS mitigation later if needed
- Support admin ticket view, support triage diagnostics, and optional self-hosted helpdesk integration.
- SSH / CF tunnel power-user access.
- Portal SSH config snippets.
- true interactive shell confinement / workspace-boundary decision.
- dev-container backup ownership and user-facing restore contract.
- likely direction for dev backups: LXC snapshot-based backup/restore instead of agent-managed dev backups.
- artifact version promotion.
- runtime rollback support.
- Cloudflare R2 for large artifact/mod delivery.
- admin panel.
- referral / dev pipeline reward system.
- uptime history.
- revisit DDoS mitigation later if needed.
---
## Cleaning Rule
- Root keeps only cross-repo/platform work
- Repo-specific items must be removed from root once they live only in one Codex tracker
- Completed items should be removed, not left in place as historical clutter
- Use `CURRENT_STATE.md` for durable implemented behavior
- Use `DECISIONS.md` for settled choices
- Re-open old items only when there is current evidence they are still unfinished or have regressed
- Root keeps only cross-repo/platform work.
- Repo-specific items must be removed from root once they live only in one Codex tracker.
- Completed items should be removed from active sections, not left in place as historical clutter.
- Use `CURRENT_STATE.md` for durable implemented behavior.
- Use `DECISIONS.md` for settled choices.
- Re-open old items only when there is current evidence they are still unfinished or have regressed.