diff --git a/Session_Summaries/2026-05-03-launch-autonomy-billing-support-update.md b/Session_Summaries/2026-05-03-launch-autonomy-billing-support-update.md new file mode 100644 index 0000000..f993404 --- /dev/null +++ b/Session_Summaries/2026-05-03-launch-autonomy-billing-support-update.md @@ -0,0 +1,245 @@ +# ZLH Session Summary — Launch Autonomy, Billing, and Support Update + +Date: 2026-05-03 + +## Summary + +This session moved several major launch blockers from design/risk into implemented and validated status. The biggest completed areas were asynchronous provisioning, controller/reconciler foundation, billing enforcement, and the support ticket path. + +## Completed / Validated + +### Provisioning worker / async create + +Status: launch-ready. + +Implemented and validated: + +```text +POST /api/instances now creates/reuses a durable ProvisioningOperation +API returns 202 Accepted quickly +BullMQ provisioning worker consumes jobs from provisioning queue +zpack-provision-worker.service installed and running under systemd +Portal async pending cards show queued/running phases and replace with real server cards +Game provisioning through worker validated +Dev provisioning through worker validated +API teardown still works for worker-created servers +Duplicate/idempotency guards validated +Controlled failure handling validated +``` + +Important behavior: + +```text +Provisioning is no longer run inside the HTTP request lifecycle. +Portal sends Idempotency-Key. +Operation state is pollable. +Worker concurrency remains 1. +Unsafe automatic retries remain disabled. +``` + +### Controller / reconciler foundation + +Status: implemented, validated, currently conservative. + +Implemented and validated: + +```text +zlh-controller.service exists as singleton controller/reconciler with Redis lock +zpack-repair-worker.service handles Level 1 repair jobs +Discord notifications wired +RepairEvent persistence added +clear_stale_operation_lock validated +live Cloudflare SRV drift detection validated +edge_republish restored deleted Cloudflare SRV record through existing edge publish path +Level 2 and Level 3 repairs remain disabled +``` + +Current operating posture: + +```text +Controller is expected to remain in dry-run unless deliberately enabling Level 1 repairs. +Repair worker is live. +Level 1 repair path is proven. +No destructive repairs are automatic. +``` + +### Billing enforcement / overdue handling + +Status: backend launch-ready. + +Implemented and validated: + +```text +BillingEnforcementState +BillingEnforcementEvent +StripeEventLog +Stripe event idempotency +payment_failed warning flow +final warning / backup block state +suspension / shutdown state +payment restored flow +API billing gates while suspended +controller does not repair suspended game servers +billing worker installed and running under systemd +billing announcements visible in Portal +``` + +Service: + +```text +zpack-billing-worker.service installed and clean under systemd +``` + +Safety guarantees validated: + +```text +No customer data deleted +No backups deleted +No DNS records deleted +No Velocity records deleted +No containers deleted +Destructive billing actions are rejected and audited +Suspended servers are not repaired back to connectable/running state +``` + +Remaining billing follow-ups are fixture validation only: + +```text +File read/list against a responsive Agent +Backup mutation route validation with a game backup fixture +``` + +### Support ticket path + +Status: launch-ready. + +Implemented and validated: + +```text +POST /api/support/create exists +SupportTicket DB model and migration added +Human-readable ticket number: ZLH-YYYYMMDD-XXXX +Portal form submits successfully +Customer acknowledgement email received +Discord #support alert received +SupportTicket DB row created +``` + +Post-launch enhancements only: + +```text +Admin ticket list/view +Support triage diagnostics +Self-hosted helpdesk integration +Inbound email reply parsing +Attachments +``` + +## Current launch service set + +```text +zpack-api.service +zpack-provision-worker.service +zpack-repair-worker.service +zlh-controller.service +zpack-billing-worker.service +``` + +Launch guardrail: + +```text +Do not add more worker/systemd services before launch unless there is a strong safety-boundary reason. +``` + +## Remaining launch-active work + +### Portal terminal reliability + +Issue: console can hang at Connecting. + +Required: + +```text +WebSocket connect timeout +Error path clears socket refs +isStreaming resets on closed/error/idle +Button recovers to Open Console/Reconnect +Validate console still works after fix +``` + +### Monitoring / observability readiness + +Still a major infrastructure item. + +Remaining: + +```text +Restrict Prometheus/Grafana/node_exporter exposure +Fix game/dev discovery sync +Remove stale file_sd targets +Install Grafana dashboards +Add API health/app scrape +Add lifecycle visibility +Add or explicitly defer centralized logs/Loki +Tighten monitoring token storage +Add/verify queue staleness visibility for provisioning, repair, billing_enforcement +``` + +### Patch management / maintenance window policy + +Needs written policy/runbook: + +```text +Normal maintenance window cadence/timezone +Emergency maintenance behavior +Customer notification expectations +Patch order and rollback expectations +``` + +### Notepad / messaging retest + +Announcements are validated for billing and support context. Still validate: + +```text +notepad/notes load and save +persistence after reload/login +permissions +empty/error states +``` + +### Final integrated smoke test + +Run after the above launch blockers are clean: + +```text +Game lifecycle: create -> ready/connectable -> console -> files -> backup -> restore -> delete -> DNS/Velocity/Cloudflare cleanup +Dev lifecycle: create -> hosted IDE -> stop/restart/delete -> cleanup +Security: Agent auth fail-closed, non-owner blocked, browser does not expose internal secrets +``` + +## Issues likely ready to close or supersede + +```text +#9 Support email/ticket path — resolved +#11 Provisioning worker / async create — resolved +#10 Multi-create modal confusion — likely resolved by async inline cards; quick two-create validation or close as covered by #11 +#6 Non-payment grace flow — superseded by #14 billing enforcement +``` + +## Issues still active + +```text +#12 Portal terminal reliability +#5 Monitoring / observability readiness +#7 Patch management / maintenance window policy +#8 Notepad / announcements / messaging retest +#4 Integrated Portal/API/Agent smoke test +#13 Controller/reconciler — keep dry-run soak / decide Level 1 live posture +#14 Billing enforcement — core resolved; minor fixture validation remains +``` + +## Notes + +- Support is launch-ready with ZLH-native DB ticket + customer email + Discord alert. +- A self-hosted helpdesk such as FreeScout/Zammad can be considered post-launch, but ZLH should keep its SupportTicket intake/audit record either way. +- Controller should not parse support ticket text or auto-run repairs from free text at launch. Post-launch support triage may add tags, read-only diagnostics, and suggested actions.