zlh-grind/Session_Summaries/2026-05-03-launch-autonomy-billing-support-update.md

6.5 KiB

ZLH Session Summary — Launch Autonomy, Billing, and Support Update

Date: 2026-05-03

Summary

This session moved several major launch blockers from design/risk into implemented and validated status. The biggest completed areas were asynchronous provisioning, controller/reconciler foundation, billing enforcement, and the support ticket path.

Completed / Validated

Provisioning worker / async create

Status: launch-ready.

Implemented and validated:

POST /api/instances now creates/reuses a durable ProvisioningOperation
API returns 202 Accepted quickly
BullMQ provisioning worker consumes jobs from provisioning queue
zpack-provision-worker.service installed and running under systemd
Portal async pending cards show queued/running phases and replace with real server cards
Game provisioning through worker validated
Dev provisioning through worker validated
API teardown still works for worker-created servers
Duplicate/idempotency guards validated
Controlled failure handling validated

Important behavior:

Provisioning is no longer run inside the HTTP request lifecycle.
Portal sends Idempotency-Key.
Operation state is pollable.
Worker concurrency remains 1.
Unsafe automatic retries remain disabled.

Controller / reconciler foundation

Status: implemented, validated, currently conservative.

Implemented and validated:

zlh-controller.service exists as singleton controller/reconciler with Redis lock
zpack-repair-worker.service handles Level 1 repair jobs
Discord notifications wired
RepairEvent persistence added
clear_stale_operation_lock validated
live Cloudflare SRV drift detection validated
edge_republish restored deleted Cloudflare SRV record through existing edge publish path
Level 2 and Level 3 repairs remain disabled

Current operating posture:

Controller is expected to remain in dry-run unless deliberately enabling Level 1 repairs.
Repair worker is live.
Level 1 repair path is proven.
No destructive repairs are automatic.

Billing enforcement / overdue handling

Status: backend launch-ready.

Implemented and validated:

BillingEnforcementState
BillingEnforcementEvent
StripeEventLog
Stripe event idempotency
payment_failed warning flow
final warning / backup block state
suspension / shutdown state
payment restored flow
API billing gates while suspended
controller does not repair suspended game servers
billing worker installed and running under systemd
billing announcements visible in Portal

Service:

zpack-billing-worker.service installed and clean under systemd

Safety guarantees validated:

No customer data deleted
No backups deleted
No DNS records deleted
No Velocity records deleted
No containers deleted
Destructive billing actions are rejected and audited
Suspended servers are not repaired back to connectable/running state

Remaining billing follow-ups are fixture validation only:

File read/list against a responsive Agent
Backup mutation route validation with a game backup fixture

Support ticket path

Status: launch-ready.

Implemented and validated:

POST /api/support/create exists
SupportTicket DB model and migration added
Human-readable ticket number: ZLH-YYYYMMDD-XXXX
Portal form submits successfully
Customer acknowledgement email received
Discord #support alert received
SupportTicket DB row created

Post-launch enhancements only:

Admin ticket list/view
Support triage diagnostics
Self-hosted helpdesk integration
Inbound email reply parsing
Attachments

Current launch service set

zpack-api.service
zpack-provision-worker.service
zpack-repair-worker.service
zlh-controller.service
zpack-billing-worker.service

Launch guardrail:

Do not add more worker/systemd services before launch unless there is a strong safety-boundary reason.

Remaining launch-active work

Portal terminal reliability

Issue: console can hang at Connecting.

Required:

WebSocket connect timeout
Error path clears socket refs
isStreaming resets on closed/error/idle
Button recovers to Open Console/Reconnect
Validate console still works after fix

Monitoring / observability readiness

Still a major infrastructure item.

Remaining:

Restrict Prometheus/Grafana/node_exporter exposure
Fix game/dev discovery sync
Remove stale file_sd targets
Install Grafana dashboards
Add API health/app scrape
Add lifecycle visibility
Add or explicitly defer centralized logs/Loki
Tighten monitoring token storage
Add/verify queue staleness visibility for provisioning, repair, billing_enforcement

Patch management / maintenance window policy

Needs written policy/runbook:

Normal maintenance window cadence/timezone
Emergency maintenance behavior
Customer notification expectations
Patch order and rollback expectations

Notepad / messaging retest

Announcements are validated for billing and support context. Still validate:

notepad/notes load and save
persistence after reload/login
permissions
empty/error states

Final integrated smoke test

Run after the above launch blockers are clean:

Game lifecycle: create -> ready/connectable -> console -> files -> backup -> restore -> delete -> DNS/Velocity/Cloudflare cleanup
Dev lifecycle: create -> hosted IDE -> stop/restart/delete -> cleanup
Security: Agent auth fail-closed, non-owner blocked, browser does not expose internal secrets

Issues likely ready to close or supersede

#9 Support email/ticket path — resolved
#11 Provisioning worker / async create — resolved
#10 Multi-create modal confusion — likely resolved by async inline cards; quick two-create validation or close as covered by #11
#6 Non-payment grace flow — superseded by #14 billing enforcement

Issues still active

#12 Portal terminal reliability
#5 Monitoring / observability readiness
#7 Patch management / maintenance window policy
#8 Notepad / announcements / messaging retest
#4 Integrated Portal/API/Agent smoke test
#13 Controller/reconciler — keep dry-run soak / decide Level 1 live posture
#14 Billing enforcement — core resolved; minor fixture validation remains

Notes

  • Support is launch-ready with ZLH-native DB ticket + customer email + Discord alert.
  • A self-hosted helpdesk such as FreeScout/Zammad can be considered post-launch, but ZLH should keep its SupportTicket intake/audit record either way.
  • Controller should not parse support ticket text or auto-run repairs from free text at launch. Post-launch support triage may add tags, read-only diagnostics, and suggested actions.