11 KiB
11 KiB
API — Current State
This file records what is believed to be implemented now in zpack-api.
Runtime / dependency baseline
- API is tracked against Node 24 with repo-local pinning via
package.jsonengines and.nvmrc. - API repo is JavaScript ESM; do not introduce TypeScript into API launch work unless a separate migration is explicitly approved.
- Direct
node-fetchdependency has been removed and API code uses built-in globalfetch. - Prisma config lives in dedicated Prisma config, not deprecated
package.json#prismaconfig. - Prisma generate / validate checks have passed on the current launch baseline.
Service/process model
Launch service/process set:
zpack-api.service # HTTP/API
zpack-provision-worker.service # BullMQ provisioning worker
zpack-repair-worker.service # Level 1 repair worker
zlh-controller.service # singleton reconciler/controller
zpack-billing-worker.service # billing enforcement worker
Guardrail: do not add more worker/systemd services before launch unless there is a strong safety-boundary reason.
Provisioning / async create
POST /api/instancesis the active create entrypoint.- Create flow validates request/account state, creates or reuses a durable
ProvisioningOperation, enqueues a BullMQprovisioningjob, and returns202 AcceptedwithoperationId/statusUrl. src/queues/provisioning.jsis the live provisioning worker entrypoint.- Worker concurrency is
1; blind retries are disabled withattempts: 1. - BullMQ job ID uses the colon-free operation ID.
- Portal sends an
Idempotency-Key; backend duplicate protection also includes a no-key launch guard. - Worker callbacks update operation status, phase, heartbeat, server persistence, and cleanup metadata.
- Live validation passed for game and dev provisioning through the worker path.
- Portal async pending cards were visually validated and replace themselves with the real server card after completion.
- Existing API teardown works for worker-created servers.
Controller / repair model
src/controllers/reconciler.jsis the controller loop.- Controller runs as
zlh-controller.service, separate from the HTTP API process. - Controller is singleton-protected by Redis lock key
zlh:controller:lock. - Controller should remain conservative; Level 0/1 only for launch.
- Controller is expected to run in dry-run unless deliberately enabling Level 1 auto-repair.
src/controllers/repairPolicy.jsowns repair decision rules.src/queues/repair.jsis the Level 1 repair worker queue.- Validated Level 1 behavior:
- expired provisioning operation can be marked
stale - live Cloudflare SRV drift can be detected and repaired by
edge_republish
- expired provisioning operation can be marked
edge_republishuses the existing full edge publish path and post-checks live edge state so it does not fake success.- Level 2 actions such as agent/workload restart remain disabled.
- Level 3 destructive actions such as restore/rebuild/delete are never automatic.
Billing enforcement
- Durable billing enforcement exists through
BillingEnforcementState,BillingEnforcementEvent, andStripeEventLog. - Stripe webhook handling covers
invoice.payment_failed,invoice.paid,invoice.payment_succeeded,customer.subscription.updated, andcustomer.subscription.deleted. - Stripe events are idempotent through
StripeEventLog. src/services/billingEnforcement.jsowns billing states, guard/assert helpers, Stripe state transitions, and due enforcement calculation.src/queues/billingEnforcement.jsowns billing enforcement queue execution.zpack-billing-worker.serviceis installed/running under systemd.- Billing actions are allowlisted: warning, final warning, backup block, suspension shutdown, retained marking, restore access.
- Destructive billing actions are rejected and audited.
- API gates are implemented for provisioning, start/restart, backup mutations, console command/stream, and file mutations.
- File read/list/download remains allowed while suspended/retained by policy.
- Suspended/retained/pending-deletion billing state suppresses edge/DNS/Velocity repair and live edge observation.
- Validated flows include payment failed, replay idempotency, backup block state, suspension/shutdown safety, API gates while suspended, controller no-repair while suspended, payment restored, and destructive rejection.
Support tickets
POST /api/support/createis implemented and mounted under/api/support.- API creates a
SupportTicketDB row with human-readable ticket numberZLH-YYYYMMDD-XXXX. - Customer acknowledgement email is sent through SMTP.
- Discord
#supportalert is sent viaDISCORD_SUPPORT_WEBHOOK. - Optional support mailbox copy uses
SUPPORT_EMAIL_TOand emailReply-Tosupport. - Portal form submit, customer acknowledgement email, Discord alert, and DB-backed ticket creation were validated live.
- Support triage, admin ticket list/view, inbound reply parsing, attachments, and self-hosted helpdesk integration are post-launch enhancements.
Security / trust boundaries
- API has explicit route-level trust boundaries in addition to internal network placement behind OPNsense.
src/middleware/requireAdmin.jsprotects admin-only API routes.src/middleware/requireInternalToken.jsprotects internal-only control-plane routes withINTERNAL_API_TOKEN/ZLH_INTERNAL_API_TOKEN.- Internal-token routes accept
X-ZLH-Internal-Token,X-Internal-Token, or bearer auth carrying the internal token. - Internal-token routes fail closed when token config is missing except explicit development/test local flows.
/api/auditis admin-only.GET /api/instancesis admin-only global inventory./api/edge/*,/api/proxmox/*, raw control-plane routes, and monitoring/service-discovery surfaces require the appropriate admin/internal/discovery token boundary.- Portal never calls agents directly for normal user flows and must not expose internal tokens in browser code.
Route and lifecycle split
/api/instances,/api/servers, and/api/containersare distinct operational surfaces.GET /api/serversis the user-owned server list surface for Portal.DELETE /api/servers/:idis the preferred Portal/user-owned delete contract and checks ownership before teardown.DELETE /api/containers/:vmidremains raw cleanup/orphan-remediation and should be internal/admin oriented.- Orphan remediation stays in internal/admin workflows.
Teardown / orphan cleanup behavior
- Container teardown logic lives in
src/services/containerTeardown.jsso user-owned delete and raw internal delete share the same workflow. - Teardown archives a
DeletedInstancerecord before removing the activeContainerInstancerow. - Game teardown performs DNS / Cloudflare / Technitium / Velocity cleanup through existing publisher cleanup paths.
- Dev teardown performs dev IDE cleanup through the dev IDE publisher path.
- Live teardown was validated after worker-created game and dev server tests.
Readiness / agent state model
- API is the heartbeat authority by polling agents; Agent does not push state to API.
- API consumes
/health,/ready, and/status. - Portal should rely on API-normalized state, not direct agent state.
- Proxy lifecycle is tracked separately from agent readiness under
ContainerInstance.payload.proxy. - Edge publish state and Velocity callback state must be merged into
ContainerInstance.payloadatomically. - Transient Proxmox
lxc/status/currentread errors are soft-handled by host-status polling and surfacepowerState: unknownrather than raw API 500s.
Velocity / Minecraft edge lifecycle
- Minecraft edge publish uses Velocity instead of Traefik TCP config.
- API registers/unregisters Minecraft backends through the Velocity bridge routes
POST /zpack/register,POST /zpack/unregister, and verifies withGET /zpack/status. - API exposes
POST /internal/velocity/proxy-statusfor bridge lifecycle callbacks. - Accepted proxy lifecycle statuses are
registered_with_proxy,proxy_ping_ok, andproxy_ping_failed. GET /api/servers/:id/statusderives Minecraft connection state from agent readiness, persisted edge state, Velocity registration state, and backend ping state.
Host / LXC lifecycle state
/api/servers/:id/host/statusexposes Proxmox LXC power state for game and dev containers.- Host lifecycle routes return
202 Acceptedwith operation/status URL rather than blocking until Proxmox finishes:POST /api/servers/:id/host/startPOST /api/servers/:id/host/stopPOST /api/servers/:id/host/restart
- Overlapping host lifecycle operations return
409 host_operation_in_progress. - Host lifecycle routes perform ownership and billing checks before touching Proxmox.
Backup support
- API forwards game backup operations:
GET /api/game/servers/:id/backupsPOST /api/game/servers/:id/backupsPOST /api/game/servers/:id/backups/restore?id=<backup_id>DELETE /api/game/servers/:id/backups/:backupId
- Restore start is async at the API layer and Portal is expected to poll status.
- Backup response shape normalization remains open.
- Billing backup-block gates are implemented; full live route validation still needs a game backup fixture with backups available.
File proxy / route compatibility
- Duplicated game file proxy logic has been extracted into
src/routes/helpers/gameFileProxy.js. - Compatibility is intentionally preserved between
/api/game/servers/:id/files...and/api/servers/:id/files.... - Streamed upload/download/edit forwarding remains outside generic non-streaming agent helper transport.
- Billing gates block file mutations while suspended/retained but allow read/list/download by policy.
Hosted IDE proxy
POST /api/dev/:id/ide-tokenissues short-lived IDE proxy tokens.- IDE proxy supports hosted
dev-<vmid>.<suffix>hosts and tunnel paths. - Hosted IDE access was validated during dev provisioning worker tests.
- API exposes code-server controls for owned dev containers:
POST /api/dev/:id/codeserver/startPOST /api/dev/:id/codeserver/stopPOST /api/dev/:id/codeserver/restart
Console / socket stability
- Console WebSocket proxy attachment is guarded so the console upgrade handler is only attached once per HTTP server.
- Console proxy raw socket error logging is guarded to avoid stacking duplicate socket listeners.
- API raises listener limits on inbound HTTP sockets and console WebSocket sockets to avoid false-positive listener warnings under proxy/websocket fan-out.
- Portal terminal connection-state hardening remains a Portal launch item.
Legacy / archived behavior
- Legacy port allocation / slot reservation is no longer part of the live route mounts.
- Legacy synchronous HTTP provisioning is no longer the launch model; async BullMQ provisioning worker is the current model.
- Raw streaming upload proxy behavior remains outside
agentClient.js. - Non-runtime clutter such as checked-in keys/tokens, local artifacts,
.oldscripts,src/tmp, and retired legacy trees should stay out of the active repo.