246 lines
6.5 KiB
Markdown
246 lines
6.5 KiB
Markdown
# ZLH Session Summary — Launch Autonomy, Billing, and Support Update
|
|
|
|
Date: 2026-05-03
|
|
|
|
## Summary
|
|
|
|
This session moved several major launch blockers from design/risk into implemented and validated status. The biggest completed areas were asynchronous provisioning, controller/reconciler foundation, billing enforcement, and the support ticket path.
|
|
|
|
## Completed / Validated
|
|
|
|
### Provisioning worker / async create
|
|
|
|
Status: launch-ready.
|
|
|
|
Implemented and validated:
|
|
|
|
```text
|
|
POST /api/instances now creates/reuses a durable ProvisioningOperation
|
|
API returns 202 Accepted quickly
|
|
BullMQ provisioning worker consumes jobs from provisioning queue
|
|
zpack-provision-worker.service installed and running under systemd
|
|
Portal async pending cards show queued/running phases and replace with real server cards
|
|
Game provisioning through worker validated
|
|
Dev provisioning through worker validated
|
|
API teardown still works for worker-created servers
|
|
Duplicate/idempotency guards validated
|
|
Controlled failure handling validated
|
|
```
|
|
|
|
Important behavior:
|
|
|
|
```text
|
|
Provisioning is no longer run inside the HTTP request lifecycle.
|
|
Portal sends Idempotency-Key.
|
|
Operation state is pollable.
|
|
Worker concurrency remains 1.
|
|
Unsafe automatic retries remain disabled.
|
|
```
|
|
|
|
### Controller / reconciler foundation
|
|
|
|
Status: implemented, validated, currently conservative.
|
|
|
|
Implemented and validated:
|
|
|
|
```text
|
|
zlh-controller.service exists as singleton controller/reconciler with Redis lock
|
|
zpack-repair-worker.service handles Level 1 repair jobs
|
|
Discord notifications wired
|
|
RepairEvent persistence added
|
|
clear_stale_operation_lock validated
|
|
live Cloudflare SRV drift detection validated
|
|
edge_republish restored deleted Cloudflare SRV record through existing edge publish path
|
|
Level 2 and Level 3 repairs remain disabled
|
|
```
|
|
|
|
Current operating posture:
|
|
|
|
```text
|
|
Controller is expected to remain in dry-run unless deliberately enabling Level 1 repairs.
|
|
Repair worker is live.
|
|
Level 1 repair path is proven.
|
|
No destructive repairs are automatic.
|
|
```
|
|
|
|
### Billing enforcement / overdue handling
|
|
|
|
Status: backend launch-ready.
|
|
|
|
Implemented and validated:
|
|
|
|
```text
|
|
BillingEnforcementState
|
|
BillingEnforcementEvent
|
|
StripeEventLog
|
|
Stripe event idempotency
|
|
payment_failed warning flow
|
|
final warning / backup block state
|
|
suspension / shutdown state
|
|
payment restored flow
|
|
API billing gates while suspended
|
|
controller does not repair suspended game servers
|
|
billing worker installed and running under systemd
|
|
billing announcements visible in Portal
|
|
```
|
|
|
|
Service:
|
|
|
|
```text
|
|
zpack-billing-worker.service installed and clean under systemd
|
|
```
|
|
|
|
Safety guarantees validated:
|
|
|
|
```text
|
|
No customer data deleted
|
|
No backups deleted
|
|
No DNS records deleted
|
|
No Velocity records deleted
|
|
No containers deleted
|
|
Destructive billing actions are rejected and audited
|
|
Suspended servers are not repaired back to connectable/running state
|
|
```
|
|
|
|
Remaining billing follow-ups are fixture validation only:
|
|
|
|
```text
|
|
File read/list against a responsive Agent
|
|
Backup mutation route validation with a game backup fixture
|
|
```
|
|
|
|
### Support ticket path
|
|
|
|
Status: launch-ready.
|
|
|
|
Implemented and validated:
|
|
|
|
```text
|
|
POST /api/support/create exists
|
|
SupportTicket DB model and migration added
|
|
Human-readable ticket number: ZLH-YYYYMMDD-XXXX
|
|
Portal form submits successfully
|
|
Customer acknowledgement email received
|
|
Discord #support alert received
|
|
SupportTicket DB row created
|
|
```
|
|
|
|
Post-launch enhancements only:
|
|
|
|
```text
|
|
Admin ticket list/view
|
|
Support triage diagnostics
|
|
Self-hosted helpdesk integration
|
|
Inbound email reply parsing
|
|
Attachments
|
|
```
|
|
|
|
## Current launch service set
|
|
|
|
```text
|
|
zpack-api.service
|
|
zpack-provision-worker.service
|
|
zpack-repair-worker.service
|
|
zlh-controller.service
|
|
zpack-billing-worker.service
|
|
```
|
|
|
|
Launch guardrail:
|
|
|
|
```text
|
|
Do not add more worker/systemd services before launch unless there is a strong safety-boundary reason.
|
|
```
|
|
|
|
## Remaining launch-active work
|
|
|
|
### Portal terminal reliability
|
|
|
|
Issue: console can hang at Connecting.
|
|
|
|
Required:
|
|
|
|
```text
|
|
WebSocket connect timeout
|
|
Error path clears socket refs
|
|
isStreaming resets on closed/error/idle
|
|
Button recovers to Open Console/Reconnect
|
|
Validate console still works after fix
|
|
```
|
|
|
|
### Monitoring / observability readiness
|
|
|
|
Still a major infrastructure item.
|
|
|
|
Remaining:
|
|
|
|
```text
|
|
Restrict Prometheus/Grafana/node_exporter exposure
|
|
Fix game/dev discovery sync
|
|
Remove stale file_sd targets
|
|
Install Grafana dashboards
|
|
Add API health/app scrape
|
|
Add lifecycle visibility
|
|
Add or explicitly defer centralized logs/Loki
|
|
Tighten monitoring token storage
|
|
Add/verify queue staleness visibility for provisioning, repair, billing_enforcement
|
|
```
|
|
|
|
### Patch management / maintenance window policy
|
|
|
|
Needs written policy/runbook:
|
|
|
|
```text
|
|
Normal maintenance window cadence/timezone
|
|
Emergency maintenance behavior
|
|
Customer notification expectations
|
|
Patch order and rollback expectations
|
|
```
|
|
|
|
### Notepad / messaging retest
|
|
|
|
Announcements are validated for billing and support context. Still validate:
|
|
|
|
```text
|
|
notepad/notes load and save
|
|
persistence after reload/login
|
|
permissions
|
|
empty/error states
|
|
```
|
|
|
|
### Final integrated smoke test
|
|
|
|
Run after the above launch blockers are clean:
|
|
|
|
```text
|
|
Game lifecycle: create -> ready/connectable -> console -> files -> backup -> restore -> delete -> DNS/Velocity/Cloudflare cleanup
|
|
Dev lifecycle: create -> hosted IDE -> stop/restart/delete -> cleanup
|
|
Security: Agent auth fail-closed, non-owner blocked, browser does not expose internal secrets
|
|
```
|
|
|
|
## Issues likely ready to close or supersede
|
|
|
|
```text
|
|
#9 Support email/ticket path — resolved
|
|
#11 Provisioning worker / async create — resolved
|
|
#10 Multi-create modal confusion — likely resolved by async inline cards; quick two-create validation or close as covered by #11
|
|
#6 Non-payment grace flow — superseded by #14 billing enforcement
|
|
```
|
|
|
|
## Issues still active
|
|
|
|
```text
|
|
#12 Portal terminal reliability
|
|
#5 Monitoring / observability readiness
|
|
#7 Patch management / maintenance window policy
|
|
#8 Notepad / announcements / messaging retest
|
|
#4 Integrated Portal/API/Agent smoke test
|
|
#13 Controller/reconciler — keep dry-run soak / decide Level 1 live posture
|
|
#14 Billing enforcement — core resolved; minor fixture validation remains
|
|
```
|
|
|
|
## Notes
|
|
|
|
- Support is launch-ready with ZLH-native DB ticket + customer email + Discord alert.
|
|
- A self-hosted helpdesk such as FreeScout/Zammad can be considered post-launch, but ZLH should keep its SupportTicket intake/audit record either way.
|
|
- Controller should not parse support ticket text or auto-run repairs from free text at launch. Post-launch support triage may add tags, read-only diagnostics, and suggested actions.
|