zlh-grind/Session_Summaries/2026-05-03-launch-autonomy-billing-support-update.md

246 lines
6.5 KiB
Markdown

# ZLH Session Summary — Launch Autonomy, Billing, and Support Update
Date: 2026-05-03
## Summary
This session moved several major launch blockers from design/risk into implemented and validated status. The biggest completed areas were asynchronous provisioning, controller/reconciler foundation, billing enforcement, and the support ticket path.
## Completed / Validated
### Provisioning worker / async create
Status: launch-ready.
Implemented and validated:
```text
POST /api/instances now creates/reuses a durable ProvisioningOperation
API returns 202 Accepted quickly
BullMQ provisioning worker consumes jobs from provisioning queue
zpack-provision-worker.service installed and running under systemd
Portal async pending cards show queued/running phases and replace with real server cards
Game provisioning through worker validated
Dev provisioning through worker validated
API teardown still works for worker-created servers
Duplicate/idempotency guards validated
Controlled failure handling validated
```
Important behavior:
```text
Provisioning is no longer run inside the HTTP request lifecycle.
Portal sends Idempotency-Key.
Operation state is pollable.
Worker concurrency remains 1.
Unsafe automatic retries remain disabled.
```
### Controller / reconciler foundation
Status: implemented, validated, currently conservative.
Implemented and validated:
```text
zlh-controller.service exists as singleton controller/reconciler with Redis lock
zpack-repair-worker.service handles Level 1 repair jobs
Discord notifications wired
RepairEvent persistence added
clear_stale_operation_lock validated
live Cloudflare SRV drift detection validated
edge_republish restored deleted Cloudflare SRV record through existing edge publish path
Level 2 and Level 3 repairs remain disabled
```
Current operating posture:
```text
Controller is expected to remain in dry-run unless deliberately enabling Level 1 repairs.
Repair worker is live.
Level 1 repair path is proven.
No destructive repairs are automatic.
```
### Billing enforcement / overdue handling
Status: backend launch-ready.
Implemented and validated:
```text
BillingEnforcementState
BillingEnforcementEvent
StripeEventLog
Stripe event idempotency
payment_failed warning flow
final warning / backup block state
suspension / shutdown state
payment restored flow
API billing gates while suspended
controller does not repair suspended game servers
billing worker installed and running under systemd
billing announcements visible in Portal
```
Service:
```text
zpack-billing-worker.service installed and clean under systemd
```
Safety guarantees validated:
```text
No customer data deleted
No backups deleted
No DNS records deleted
No Velocity records deleted
No containers deleted
Destructive billing actions are rejected and audited
Suspended servers are not repaired back to connectable/running state
```
Remaining billing follow-ups are fixture validation only:
```text
File read/list against a responsive Agent
Backup mutation route validation with a game backup fixture
```
### Support ticket path
Status: launch-ready.
Implemented and validated:
```text
POST /api/support/create exists
SupportTicket DB model and migration added
Human-readable ticket number: ZLH-YYYYMMDD-XXXX
Portal form submits successfully
Customer acknowledgement email received
Discord #support alert received
SupportTicket DB row created
```
Post-launch enhancements only:
```text
Admin ticket list/view
Support triage diagnostics
Self-hosted helpdesk integration
Inbound email reply parsing
Attachments
```
## Current launch service set
```text
zpack-api.service
zpack-provision-worker.service
zpack-repair-worker.service
zlh-controller.service
zpack-billing-worker.service
```
Launch guardrail:
```text
Do not add more worker/systemd services before launch unless there is a strong safety-boundary reason.
```
## Remaining launch-active work
### Portal terminal reliability
Issue: console can hang at Connecting.
Required:
```text
WebSocket connect timeout
Error path clears socket refs
isStreaming resets on closed/error/idle
Button recovers to Open Console/Reconnect
Validate console still works after fix
```
### Monitoring / observability readiness
Still a major infrastructure item.
Remaining:
```text
Restrict Prometheus/Grafana/node_exporter exposure
Fix game/dev discovery sync
Remove stale file_sd targets
Install Grafana dashboards
Add API health/app scrape
Add lifecycle visibility
Add or explicitly defer centralized logs/Loki
Tighten monitoring token storage
Add/verify queue staleness visibility for provisioning, repair, billing_enforcement
```
### Patch management / maintenance window policy
Needs written policy/runbook:
```text
Normal maintenance window cadence/timezone
Emergency maintenance behavior
Customer notification expectations
Patch order and rollback expectations
```
### Notepad / messaging retest
Announcements are validated for billing and support context. Still validate:
```text
notepad/notes load and save
persistence after reload/login
permissions
empty/error states
```
### Final integrated smoke test
Run after the above launch blockers are clean:
```text
Game lifecycle: create -> ready/connectable -> console -> files -> backup -> restore -> delete -> DNS/Velocity/Cloudflare cleanup
Dev lifecycle: create -> hosted IDE -> stop/restart/delete -> cleanup
Security: Agent auth fail-closed, non-owner blocked, browser does not expose internal secrets
```
## Issues likely ready to close or supersede
```text
#9 Support email/ticket path — resolved
#11 Provisioning worker / async create — resolved
#10 Multi-create modal confusion — likely resolved by async inline cards; quick two-create validation or close as covered by #11
#6 Non-payment grace flow — superseded by #14 billing enforcement
```
## Issues still active
```text
#12 Portal terminal reliability
#5 Monitoring / observability readiness
#7 Patch management / maintenance window policy
#8 Notepad / announcements / messaging retest
#4 Integrated Portal/API/Agent smoke test
#13 Controller/reconciler — keep dry-run soak / decide Level 1 live posture
#14 Billing enforcement — core resolved; minor fixture validation remains
```
## Notes
- Support is launch-ready with ZLH-native DB ticket + customer email + Discord alert.
- A self-hosted helpdesk such as FreeScout/Zammad can be considered post-launch, but ZLH should keep its SupportTicket intake/audit record either way.
- Controller should not parse support ticket text or auto-run repairs from free text at launch. Post-launch support triage may add tags, read-only diagnostics, and suggested actions.