zlh-grind/Codex/API/CURRENT_STATE.md

11 KiB

API — Current State

This file records what is believed to be implemented now in zpack-api.

Runtime / dependency baseline

  • API is tracked against Node 24 with repo-local pinning via package.json engines and .nvmrc.
  • API repo is JavaScript ESM; do not introduce TypeScript into API launch work unless a separate migration is explicitly approved.
  • Direct node-fetch dependency has been removed and API code uses built-in global fetch.
  • Prisma config lives in dedicated Prisma config, not deprecated package.json#prisma config.
  • Prisma generate / validate checks have passed on the current launch baseline.

Service/process model

Launch service/process set:

zpack-api.service              # HTTP/API
zpack-provision-worker.service # BullMQ provisioning worker
zpack-repair-worker.service    # Level 1 repair worker
zlh-controller.service         # singleton reconciler/controller
zpack-billing-worker.service   # billing enforcement worker

Guardrail: do not add more worker/systemd services before launch unless there is a strong safety-boundary reason.

Provisioning / async create

  • POST /api/instances is the active create entrypoint.
  • Create flow validates request/account state, creates or reuses a durable ProvisioningOperation, enqueues a BullMQ provisioning job, and returns 202 Accepted with operationId / statusUrl.
  • src/queues/provisioning.js is the live provisioning worker entrypoint.
  • Worker concurrency is 1; blind retries are disabled with attempts: 1.
  • BullMQ job ID uses the colon-free operation ID.
  • Portal sends an Idempotency-Key; backend duplicate protection also includes a no-key launch guard.
  • Worker callbacks update operation status, phase, heartbeat, server persistence, and cleanup metadata.
  • Live validation passed for game and dev provisioning through the worker path.
  • Portal async pending cards were visually validated and replace themselves with the real server card after completion.
  • Existing API teardown works for worker-created servers.

Controller / repair model

  • src/controllers/reconciler.js is the controller loop.
  • Controller runs as zlh-controller.service, separate from the HTTP API process.
  • Controller is singleton-protected by Redis lock key zlh:controller:lock.
  • Controller should remain conservative; Level 0/1 only for launch.
  • Controller is expected to run in dry-run unless deliberately enabling Level 1 auto-repair.
  • src/controllers/repairPolicy.js owns repair decision rules.
  • src/queues/repair.js is the Level 1 repair worker queue.
  • Validated Level 1 behavior:
    • expired provisioning operation can be marked stale
    • live Cloudflare SRV drift can be detected and repaired by edge_republish
  • edge_republish uses the existing full edge publish path and post-checks live edge state so it does not fake success.
  • Level 2 actions such as agent/workload restart remain disabled.
  • Level 3 destructive actions such as restore/rebuild/delete are never automatic.

Billing enforcement

  • Durable billing enforcement exists through BillingEnforcementState, BillingEnforcementEvent, and StripeEventLog.
  • Stripe webhook handling covers invoice.payment_failed, invoice.paid, invoice.payment_succeeded, customer.subscription.updated, and customer.subscription.deleted.
  • Stripe events are idempotent through StripeEventLog.
  • src/services/billingEnforcement.js owns billing states, guard/assert helpers, Stripe state transitions, and due enforcement calculation.
  • src/queues/billingEnforcement.js owns billing enforcement queue execution.
  • zpack-billing-worker.service is installed/running under systemd.
  • Billing actions are allowlisted: warning, final warning, backup block, suspension shutdown, retained marking, restore access.
  • Destructive billing actions are rejected and audited.
  • API gates are implemented for provisioning, start/restart, backup mutations, console command/stream, and file mutations.
  • File read/list/download remains allowed while suspended/retained by policy.
  • Suspended/retained/pending-deletion billing state suppresses edge/DNS/Velocity repair and live edge observation.
  • Validated flows include payment failed, replay idempotency, backup block state, suspension/shutdown safety, API gates while suspended, controller no-repair while suspended, payment restored, and destructive rejection.

Support tickets

  • POST /api/support/create is implemented and mounted under /api/support.
  • API creates a SupportTicket DB row with human-readable ticket number ZLH-YYYYMMDD-XXXX.
  • Customer acknowledgement email is sent through SMTP.
  • Discord #support alert is sent via DISCORD_SUPPORT_WEBHOOK.
  • Optional support mailbox copy uses SUPPORT_EMAIL_TO and email Reply-To support.
  • Portal form submit, customer acknowledgement email, Discord alert, and DB-backed ticket creation were validated live.
  • Support triage, admin ticket list/view, inbound reply parsing, attachments, and self-hosted helpdesk integration are post-launch enhancements.

Security / trust boundaries

  • API has explicit route-level trust boundaries in addition to internal network placement behind OPNsense.
  • src/middleware/requireAdmin.js protects admin-only API routes.
  • src/middleware/requireInternalToken.js protects internal-only control-plane routes with INTERNAL_API_TOKEN / ZLH_INTERNAL_API_TOKEN.
  • Internal-token routes accept X-ZLH-Internal-Token, X-Internal-Token, or bearer auth carrying the internal token.
  • Internal-token routes fail closed when token config is missing except explicit development/test local flows.
  • /api/audit is admin-only.
  • GET /api/instances is admin-only global inventory.
  • /api/edge/*, /api/proxmox/*, raw control-plane routes, and monitoring/service-discovery surfaces require the appropriate admin/internal/discovery token boundary.
  • Portal never calls agents directly for normal user flows and must not expose internal tokens in browser code.

Route and lifecycle split

  • /api/instances, /api/servers, and /api/containers are distinct operational surfaces.
  • GET /api/servers is the user-owned server list surface for Portal.
  • DELETE /api/servers/:id is the preferred Portal/user-owned delete contract and checks ownership before teardown.
  • DELETE /api/containers/:vmid remains raw cleanup/orphan-remediation and should be internal/admin oriented.
  • Orphan remediation stays in internal/admin workflows.

Teardown / orphan cleanup behavior

  • Container teardown logic lives in src/services/containerTeardown.js so user-owned delete and raw internal delete share the same workflow.
  • Teardown archives a DeletedInstance record before removing the active ContainerInstance row.
  • Game teardown performs DNS / Cloudflare / Technitium / Velocity cleanup through existing publisher cleanup paths.
  • Dev teardown performs dev IDE cleanup through the dev IDE publisher path.
  • Live teardown was validated after worker-created game and dev server tests.

Readiness / agent state model

  • API is the heartbeat authority by polling agents; Agent does not push state to API.
  • API consumes /health, /ready, and /status.
  • Portal should rely on API-normalized state, not direct agent state.
  • Proxy lifecycle is tracked separately from agent readiness under ContainerInstance.payload.proxy.
  • Edge publish state and Velocity callback state must be merged into ContainerInstance.payload atomically.
  • Transient Proxmox lxc/status/current read errors are soft-handled by host-status polling and surface powerState: unknown rather than raw API 500s.

Velocity / Minecraft edge lifecycle

  • Minecraft edge publish uses Velocity instead of Traefik TCP config.
  • API registers/unregisters Minecraft backends through the Velocity bridge routes POST /zpack/register, POST /zpack/unregister, and verifies with GET /zpack/status.
  • API exposes POST /internal/velocity/proxy-status for bridge lifecycle callbacks.
  • Accepted proxy lifecycle statuses are registered_with_proxy, proxy_ping_ok, and proxy_ping_failed.
  • GET /api/servers/:id/status derives Minecraft connection state from agent readiness, persisted edge state, Velocity registration state, and backend ping state.

Host / LXC lifecycle state

  • /api/servers/:id/host/status exposes Proxmox LXC power state for game and dev containers.
  • Host lifecycle routes return 202 Accepted with operation/status URL rather than blocking until Proxmox finishes:
    • POST /api/servers/:id/host/start
    • POST /api/servers/:id/host/stop
    • POST /api/servers/:id/host/restart
  • Overlapping host lifecycle operations return 409 host_operation_in_progress.
  • Host lifecycle routes perform ownership and billing checks before touching Proxmox.

Backup support

  • API forwards game backup operations:
    • GET /api/game/servers/:id/backups
    • POST /api/game/servers/:id/backups
    • POST /api/game/servers/:id/backups/restore?id=<backup_id>
    • DELETE /api/game/servers/:id/backups/:backupId
  • Restore start is async at the API layer and Portal is expected to poll status.
  • Backup response shape normalization remains open.
  • Billing backup-block gates are implemented; full live route validation still needs a game backup fixture with backups available.

File proxy / route compatibility

  • Duplicated game file proxy logic has been extracted into src/routes/helpers/gameFileProxy.js.
  • Compatibility is intentionally preserved between /api/game/servers/:id/files... and /api/servers/:id/files....
  • Streamed upload/download/edit forwarding remains outside generic non-streaming agent helper transport.
  • Billing gates block file mutations while suspended/retained but allow read/list/download by policy.

Hosted IDE proxy

  • POST /api/dev/:id/ide-token issues short-lived IDE proxy tokens.
  • IDE proxy supports hosted dev-<vmid>.<suffix> hosts and tunnel paths.
  • Hosted IDE access was validated during dev provisioning worker tests.
  • API exposes code-server controls for owned dev containers:
    • POST /api/dev/:id/codeserver/start
    • POST /api/dev/:id/codeserver/stop
    • POST /api/dev/:id/codeserver/restart

Console / socket stability

  • Console WebSocket proxy attachment is guarded so the console upgrade handler is only attached once per HTTP server.
  • Console proxy raw socket error logging is guarded to avoid stacking duplicate socket listeners.
  • API raises listener limits on inbound HTTP sockets and console WebSocket sockets to avoid false-positive listener warnings under proxy/websocket fan-out.
  • Portal terminal connection-state hardening remains a Portal launch item.

Legacy / archived behavior

  • Legacy port allocation / slot reservation is no longer part of the live route mounts.
  • Legacy synchronous HTTP provisioning is no longer the launch model; async BullMQ provisioning worker is the current model.
  • Raw streaming upload proxy behavior remains outside agentClient.js.
  • Non-runtime clutter such as checked-in keys/tokens, local artifacts, .old scripts, src/tmp, and retired legacy trees should stay out of the active repo.