From 8943e16a7ec24c8df48fcf759d3836645755e400 Mon Sep 17 00:00:00 2001 From: jester Date: Sun, 3 May 2026 19:54:05 +0000 Subject: [PATCH] Update API current state for worker controller billing support launch work --- Codex/API/CURRENT_STATE.md | 220 +++++++++++++++++-------------------- 1 file changed, 103 insertions(+), 117 deletions(-) diff --git a/Codex/API/CURRENT_STATE.md b/Codex/API/CURRENT_STATE.md index e6aab37..8f14095 100644 --- a/Codex/API/CURRENT_STATE.md +++ b/Codex/API/CURRENT_STATE.md @@ -1,173 +1,159 @@ # API — Current State -This file records what is believed to be implemented now. +This file records what is believed to be implemented now in `zpack-api`. ## Runtime / dependency baseline - API is tracked against Node 24 with repo-local pinning via `package.json` engines and `.nvmrc`. -- Direct `node-fetch` dependency has been removed and API code now uses built-in global `fetch`. -- Prisma config has been migrated out of deprecated `package.json#prisma` into `prisma.config.ts`. -- Prisma generate / validate checks reportedly pass on the current API baseline. +- API repo is JavaScript ESM; do not introduce TypeScript into API launch work unless a separate migration is explicitly approved. +- Direct `node-fetch` dependency has been removed and API code uses built-in global `fetch`. +- Prisma config lives in dedicated Prisma config, not deprecated `package.json#prisma` config. +- Prisma generate / validate checks have passed on the current launch baseline. + +## Service/process model +Launch service/process set: + +```text +zpack-api.service # HTTP/API +zpack-provision-worker.service # BullMQ provisioning worker +zpack-repair-worker.service # Level 1 repair worker +zlh-controller.service # singleton reconciler/controller +zpack-billing-worker.service # billing enforcement worker +``` + +Guardrail: do not add more worker/systemd services before launch unless there is a strong safety-boundary reason. + +## Provisioning / async create +- `POST /api/instances` is the active create entrypoint. +- Create flow validates request/account state, creates or reuses a durable `ProvisioningOperation`, enqueues a BullMQ `provisioning` job, and returns `202 Accepted` with `operationId` / `statusUrl`. +- `src/queues/provisioning.js` is the live provisioning worker entrypoint. +- Worker concurrency is `1`; blind retries are disabled with `attempts: 1`. +- BullMQ job ID uses the colon-free operation ID. +- Portal sends an `Idempotency-Key`; backend duplicate protection also includes a no-key launch guard. +- Worker callbacks update operation status, phase, heartbeat, server persistence, and cleanup metadata. +- Live validation passed for game and dev provisioning through the worker path. +- Portal async pending cards were visually validated and replace themselves with the real server card after completion. +- Existing API teardown works for worker-created servers. + +## Controller / repair model +- `src/controllers/reconciler.js` is the controller loop. +- Controller runs as `zlh-controller.service`, separate from the HTTP API process. +- Controller is singleton-protected by Redis lock key `zlh:controller:lock`. +- Controller should remain conservative; Level 0/1 only for launch. +- Controller is expected to run in dry-run unless deliberately enabling Level 1 auto-repair. +- `src/controllers/repairPolicy.js` owns repair decision rules. +- `src/queues/repair.js` is the Level 1 repair worker queue. +- Validated Level 1 behavior: + - expired provisioning operation can be marked `stale` + - live Cloudflare SRV drift can be detected and repaired by `edge_republish` +- `edge_republish` uses the existing full edge publish path and post-checks live edge state so it does not fake success. +- Level 2 actions such as agent/workload restart remain disabled. +- Level 3 destructive actions such as restore/rebuild/delete are never automatic. + +## Billing enforcement +- Durable billing enforcement exists through `BillingEnforcementState`, `BillingEnforcementEvent`, and `StripeEventLog`. +- Stripe webhook handling covers `invoice.payment_failed`, `invoice.paid`, `invoice.payment_succeeded`, `customer.subscription.updated`, and `customer.subscription.deleted`. +- Stripe events are idempotent through `StripeEventLog`. +- `src/services/billingEnforcement.js` owns billing states, guard/assert helpers, Stripe state transitions, and due enforcement calculation. +- `src/queues/billingEnforcement.js` owns billing enforcement queue execution. +- `zpack-billing-worker.service` is installed/running under systemd. +- Billing actions are allowlisted: warning, final warning, backup block, suspension shutdown, retained marking, restore access. +- Destructive billing actions are rejected and audited. +- API gates are implemented for provisioning, start/restart, backup mutations, console command/stream, and file mutations. +- File read/list/download remains allowed while suspended/retained by policy. +- Suspended/retained/pending-deletion billing state suppresses edge/DNS/Velocity repair and live edge observation. +- Validated flows include payment failed, replay idempotency, backup block state, suspension/shutdown safety, API gates while suspended, controller no-repair while suspended, payment restored, and destructive rejection. + +## Support tickets +- `POST /api/support/create` is implemented and mounted under `/api/support`. +- API creates a `SupportTicket` DB row with human-readable ticket number `ZLH-YYYYMMDD-XXXX`. +- Customer acknowledgement email is sent through SMTP. +- Discord `#support` alert is sent via `DISCORD_SUPPORT_WEBHOOK`. +- Optional support mailbox copy uses `SUPPORT_EMAIL_TO` and email `Reply-To` support. +- Portal form submit, customer acknowledgement email, Discord alert, and DB-backed ticket creation were validated live. +- Support triage, admin ticket list/view, inbound reply parsing, attachments, and self-hosted helpdesk integration are post-launch enhancements. ## Security / trust boundaries -- The API now has explicit route-level trust boundaries in addition to internal network placement behind OPNsense. +- API has explicit route-level trust boundaries in addition to internal network placement behind OPNsense. - `src/middleware/requireAdmin.js` protects admin-only API routes. - `src/middleware/requireInternalToken.js` protects internal-only control-plane routes with `INTERNAL_API_TOKEN` / `ZLH_INTERNAL_API_TOKEN`. - Internal-token routes accept `X-ZLH-Internal-Token`, `X-Internal-Token`, or bearer auth carrying the internal token. -- Internal-token routes fail closed when the token is missing, except for explicit `NODE_ENV=development` or `NODE_ENV=test` local flows. +- Internal-token routes fail closed when token config is missing except explicit development/test local flows. - `/api/audit` is admin-only. - `GET /api/instances` is admin-only global inventory. -- `/api/edge/*` and `/api/proxmox/*` are internal-token protected. -- `/sd/exporters` and `/monitoring/*` now fail closed in production-like environments when discovery tokens are missing. -- Security boundary documentation lives in `docs/security-boundaries.md` in the API repo. +- `/api/edge/*`, `/api/proxmox/*`, raw control-plane routes, and monitoring/service-discovery surfaces require the appropriate admin/internal/discovery token boundary. +- Portal never calls agents directly for normal user flows and must not expose internal tokens in browser code. ## Route and lifecycle split -- `/api/instances`, `/api/servers`, and `/api/containers` are different operational surfaces and should not be treated as duplicates. -- `POST /api/instances` is the active agent-driven provisioning entrypoint. -- `GET /api/instances` is now admin-only global inventory rather than a normal Portal user list. +- `/api/instances`, `/api/servers`, and `/api/containers` are distinct operational surfaces. - `GET /api/servers` is the user-owned server list surface for Portal. -- `DELETE /api/servers/:id` is the preferred Portal/user-owned delete contract. It requires a normal user bearer token and checks ownership before teardown. -- `DELETE /api/containers/:vmid` remains the raw cleanup/orphan-remediation route. It is primarily internal-token protected and currently also accepts owned-user bearer deletes for Portal compatibility during migration. -- Orphan remediation should stay in internal/admin workflows. Portal deletes only active servers the DB says the user owns. +- `DELETE /api/servers/:id` is the preferred Portal/user-owned delete contract and checks ownership before teardown. +- `DELETE /api/containers/:vmid` remains raw cleanup/orphan-remediation and should be internal/admin oriented. +- Orphan remediation stays in internal/admin workflows. ## Teardown / orphan cleanup behavior -- Container teardown logic has been extracted into `src/services/containerTeardown.js` so user-owned delete and raw internal delete share the same workflow. +- Container teardown logic lives in `src/services/containerTeardown.js` so user-owned delete and raw internal delete share the same workflow. - Teardown archives a `DeletedInstance` record before removing the active `ContainerInstance` row. -- Archive metadata is derived from current instance fields (`payload`, `engineType`, `allocatedPorts`) instead of assuming pre-v2 `game` / `variant` / `ports` columns still exist on `ContainerInstance`. -- Teardown still refuses to delete a running Proxmox container and returns a conflict-style response until the host is stopped. - Game teardown performs DNS / Cloudflare / Technitium / Velocity cleanup through existing publisher cleanup paths. - Dev teardown performs dev IDE cleanup through the dev IDE publisher path. -- Live validation completed for one game VMID and one dev VMID: both active rows were removed and archived with `reason=user_delete`. -- A read-only helper script exists at `src/scripts/monitorTeardown.js`, exposed as `npm run monitor:teardown -- [...]`, for watching active/archive DB state during teardown validation. +- Live teardown was validated after worker-created game and dev server tests. ## Readiness / agent state model -- API is the heartbeat authority by polling agents. -- Agent does not push state to API. -- API consumes: - - `/health` for liveness - - `/ready` for semantic readiness - - `/status` for detailed state snapshot +- API is the heartbeat authority by polling agents; Agent does not push state to API. +- API consumes `/health`, `/ready`, and `/status`. - Portal should rely on API-normalized state, not direct agent state. - Proxy lifecycle is tracked separately from agent readiness under `ContainerInstance.payload.proxy`. -- `ContainerInstance.payload.edge` and `ContainerInstance.payload.proxy` are merged through an atomic JSON merge helper so edge publish state and Velocity callback state do not clobber each other during near-simultaneous writes. - -## Readiness cleanup already done -- `agentClient.js` centralizes non-streaming agent transport. -- `getAgentReady()` remains low-level transport. -- `isAgentReadyResult()` is the shared semantic readiness helper. -- `assertAgentReady()` uses semantic readiness. -- Poller only caches `ready: true` when `/ready` returns semantic success. -- Provisioning requires semantic readiness before success/persist/publish. -- Timeout handling in `agentClient.js` has been modernized to `AbortSignal.timeout(...)`. -- A final alignment pass is still needed in merged live status so any Portal-facing `agentReady` field fully matches semantic `/ready` rather than looser cache presence. - -## Provisioning / post-provision flow -- `src/api/provisionAgent.js` is the active provisioning path. -- Older worker-based provisioning code is no longer treated as a live API path. -- Post-provision edge publishing is still active. -- Post-provision behavior has been adjusted away from legacy port-allocation commits and should be understood as request-driven / payload-driven. +- Edge publish state and Velocity callback state must be merged into `ContainerInstance.payload` atomically. +- Transient Proxmox `lxc/status/current` read errors are soft-handled by host-status polling and surface `powerState: unknown` rather than raw API 500s. ## Velocity / Minecraft edge lifecycle - Minecraft edge publish uses Velocity instead of Traefik TCP config. -- API registers Minecraft backends with the Velocity bridge using `POST /zpack/register` and unregisters with `POST /zpack/unregister`. -- Velocity registration verification uses the bridge's real `GET /zpack/status` route instead of the nonexistent `/zpack/list` route. -- `/zpack/status` is expected to expose a `servers` array containing `{ name, address, port }` entries. -- If `/zpack/status` exists but does not expose registered backends, API treats the register response as acknowledged but unverifiable instead of falsely failing the registration. -- API exposes `POST /internal/velocity/proxy-status` for bridge lifecycle callbacks using the same hashed `X-Zpack-Secret` shared-secret auth style. +- API registers/unregisters Minecraft backends through the Velocity bridge routes `POST /zpack/register`, `POST /zpack/unregister`, and verifies with `GET /zpack/status`. +- API exposes `POST /internal/velocity/proxy-status` for bridge lifecycle callbacks. - Accepted proxy lifecycle statuses are `registered_with_proxy`, `proxy_ping_ok`, and `proxy_ping_failed`. -- Proxy lifecycle callbacks are stored under `ContainerInstance.payload.proxy` with source, status, server name, address, port, duplicate flag, timestamps, latency, detail, and last event time. -- `GET /api/servers/:id/status` includes the stored `proxy` object beside agent-derived live status. - `GET /api/servers/:id/status` derives Minecraft connection state from agent readiness, persisted edge state, Velocity registration state, and backend ping state. -- When Velocity has registered a backend but has not posted a separate `proxy_ping_ok`, API can treat an agent-confirmed Minecraft ping (`readySource: "minecraft_ping"`) as a backend ping fallback unless Velocity explicitly reported `proxy_ping_failed`. -- The bridge should set `ZPACK_PROXY_STATUS_ENDPOINT` to the API internal callback URL, for example `http://:4000/internal/velocity/proxy-status`. ## Host / LXC lifecycle state -- `/api/servers/:id/host/status` exposes the underlying Proxmox LXC state for both game and dev containers. -- Host status distinguishes container power state from agent/game runtime state. -- Host status response includes `hostStatus`, `powerState`, `running`, active `operation`, and selected Proxmox stats. -- `hostStatus` can report `running`, `stopped`, `starting`, `stopping`, or `restarting`. -- Host lifecycle routes return `202 Accepted` with an operation object and `statusUrl` instead of blocking until Proxmox finishes: +- `/api/servers/:id/host/status` exposes Proxmox LXC power state for game and dev containers. +- Host lifecycle routes return `202 Accepted` with operation/status URL rather than blocking until Proxmox finishes: - `POST /api/servers/:id/host/start` - `POST /api/servers/:id/host/stop` - `POST /api/servers/:id/host/restart` - Overlapping host lifecycle operations return `409 host_operation_in_progress`. -- Host lifecycle operation state is cached under Redis `hostop:` with a short TTL and is also surfaced in `GET /api/servers` as `hostOperation`. -- `GET /api/servers` includes Proxmox-backed `hostStatus` and `powerState` so Portal list views can show LXC-level start/stop/restart state for game and dev containers. -- Host lifecycle routes perform ownership checks before touching Proxmox. -- Proxmox client can resolve the actual node for an LXC via `/cluster/resources` instead of assuming every VMID lives on the configured default `PROXMOX_NODE`. +- Host lifecycle routes perform ownership and billing checks before touching Proxmox. ## Backup support -- API forwards game backup operations. -- Current API route shape: +- API forwards game backup operations: - `GET /api/game/servers/:id/backups` - `POST /api/game/servers/:id/backups` - `POST /api/game/servers/:id/backups/restore?id=` - `DELETE /api/game/servers/:id/backups/:backupId` -- Restore start is async at the API layer and Portal is expected to poll status rather than hold the restore POST open. -- API forwards agent HTTP status codes for backup responses. -- Successful backup responses currently pass through the agent body. -- Non-OK backup responses currently use the shared agent response envelope: `{ error: , details: }`. +- Restore start is async at the API layer and Portal is expected to poll status. - Backup response shape normalization remains open. +- Billing backup-block gates are implemented; full live route validation still needs a game backup fixture with backups available. ## File proxy / route compatibility -- Duplicated game file proxy logic has been extracted into `src/routes/helpers/gameFileProxy.js` in the API repo. -- Route compatibility is intentionally preserved between: - - `/api/game/servers/:id/files...` - - `/api/servers/:id/files...` -- Streamed upload / download / file edit forwarding still exists outside the generic non-streaming agent helper path. -- Compatibility between canonical game routes and legacy/compatibility server routes should be treated as part of the API contract. - -## Agent contract alignment already done -- `/start`, `/stop`, `/restart` forwarded as POST. -- `/console/command` forwarded as POST JSON. -- `/ready` is part of poller/readiness logic. -- There is no API-native `/api/agent/:serverId/:action` route in `zpack-api`. -- The reviewed Portal repo contained a Portal-side bridge with that shape which calls `/api/instances` for IP lookup and then contacts the agent directly; static review did not find an obvious current frontend caller for that bridge. - -## Billing / auth lifecycle -- API issues access tokens and refresh tokens. -- Password reset tokens are stored hashed and exchanged through API routes. -- Password reset request delivers email through the configured support mailbox SMTP path first, with optional Resend fallback and console-link fallback for local development. -- Password reset request routes are `POST /api/auth/password-reset/request` and alias `POST /api/auth/forgot-password`. -- Password reset confirm routes are `POST /api/auth/password-reset/confirm` and alias `POST /api/auth/reset-password`. -- Logged-in password change is available at `POST /api/auth/change-password` with bearer auth and body `{ currentPassword, newPassword }`. -- Logged-in password change verifies the current password, enforces the same 8-character minimum, updates the password hash, and marks outstanding password reset tokens used. -- Reset links use `RESET_PASSWORD_URL_BASE`, then `PORTAL_URL`, then `http://localhost:3000`, and point at `/reset-password?token=...`. -- Reset request responses remain generic to avoid account enumeration. -- Reset confirmation rejects passwords shorter than 8 characters and marks all outstanding reset tokens for that user used after a successful password change. -- Default reset sender is `ZeroLag Hub Support ` and production SMTP is configured through `SMTP_HOST`, `SMTP_PORT`, `SMTP_SECURE`, `SMTP_USER`, and `SMTP_PASS`. -- Stripe billing routes cover checkout, upgrade, downgrade, portal, and current billing state. -- Stripe webhooks are mounted with raw body parsing before normal JSON middleware. -- Billing scheduler starts in-process and performs limited reminder/reconciliation work. -- Admin users are billing-exempt in billing flows. -- JWT verification has reportedly been tightened to fixed algorithm plus issuer/audience separation for access, refresh, and IDE proxy tokens. -- Pre-hardening tokens may no longer verify and a re-login may be required after this change. +- Duplicated game file proxy logic has been extracted into `src/routes/helpers/gameFileProxy.js`. +- Compatibility is intentionally preserved between `/api/game/servers/:id/files...` and `/api/servers/:id/files...`. +- Streamed upload/download/edit forwarding remains outside generic non-streaming agent helper transport. +- Billing gates block file mutations while suspended/retained but allow read/list/download by policy. ## Hosted IDE proxy - `POST /api/dev/:id/ide-token` issues short-lived IDE proxy tokens. -- IDE proxy supports both tunnel paths under `/__ide/:id` and hosted `dev-.` hosts. -- Hosted IDE tokens can be delivered by query parameter and then persisted as the IDE proxy cookie. -- Hosted URL return is controlled by `DEV_IDE_RETURN_HOSTED_URL`. +- IDE proxy supports hosted `dev-.` hosts and tunnel paths. +- Hosted IDE access was validated during dev provisioning worker tests. - API exposes code-server controls for owned dev containers: - - `POST /api/dev/:id/codeserver/start` -> Agent `POST /dev/codeserver/start` - - `POST /api/dev/:id/codeserver/stop` -> Agent `POST /dev/codeserver/stop` - - `POST /api/dev/:id/codeserver/restart` -> Agent `POST /dev/codeserver/restart` -- IDE proxy cookie hardening is expected to include `httpOnly`, `sameSite: "lax"`, and secure-cookie behavior tied to public HTTPS or explicit secure-cookie config. -- Sensitive proxy logging has reportedly been reduced so cookies and forwarded header detail are not exposed in normal logs. + - `POST /api/dev/:id/codeserver/start` + - `POST /api/dev/:id/codeserver/stop` + - `POST /api/dev/:id/codeserver/restart` -## Console / outbound socket stability +## Console / socket stability - Console WebSocket proxy attachment is guarded so the console upgrade handler is only attached once per HTTP server. - Console proxy raw socket error logging is guarded to avoid stacking duplicate socket listeners. - API raises listener limits on inbound HTTP sockets and console WebSocket sockets to avoid false-positive listener warnings under proxy/websocket fan-out. -- Axios-backed outbound clients use shared HTTP/HTTPS agent helpers that raise outbound socket listener limits at socket creation time. -- The outbound socket agent helper is used by Proxmox, OPNsense, Cloudflare, and Prometheus metrics query paths. -- A temporary `MaxListenersExceededWarning` tracer exists in `src/app.js` to log emitter/event/count/stack if listener warnings recur. +- Portal terminal connection-state hardening remains a Portal launch item. ## Legacy / archived behavior - Legacy port allocation / slot reservation is no longer part of the live route mounts. -- Legacy worker provisioning, detached reconcile helpers, explicit `.old` files, `src/tmp`, checked-in keys/tokens, and local artifacts have been removed from the active tree rather than kept as live repo clutter. -- Websocket console proxy wiring remains outside `agentClient.js`. +- Legacy synchronous HTTP provisioning is no longer the launch model; async BullMQ provisioning worker is the current model. - Raw streaming upload proxy behavior remains outside `agentClient.js`. - -## Node 24 cleanup already reflected in API repo -- `RegExp.escape(...)` is used where host / suffix regex escaping was previously manual. -- Selected built-in imports have been normalized to `node:` style. +- Non-runtime clutter such as checked-in keys/tokens, local artifacts, `.old` scripts, `src/tmp`, and retired legacy trees should stay out of the active repo.