Update API current state for worker controller billing support launch work
This commit is contained in:
parent
58c43fd0c9
commit
8943e16a7e
@ -1,173 +1,159 @@
|
||||
# API — Current State
|
||||
|
||||
This file records what is believed to be implemented now.
|
||||
This file records what is believed to be implemented now in `zpack-api`.
|
||||
|
||||
## Runtime / dependency baseline
|
||||
- API is tracked against Node 24 with repo-local pinning via `package.json` engines and `.nvmrc`.
|
||||
- Direct `node-fetch` dependency has been removed and API code now uses built-in global `fetch`.
|
||||
- Prisma config has been migrated out of deprecated `package.json#prisma` into `prisma.config.ts`.
|
||||
- Prisma generate / validate checks reportedly pass on the current API baseline.
|
||||
- API repo is JavaScript ESM; do not introduce TypeScript into API launch work unless a separate migration is explicitly approved.
|
||||
- Direct `node-fetch` dependency has been removed and API code uses built-in global `fetch`.
|
||||
- Prisma config lives in dedicated Prisma config, not deprecated `package.json#prisma` config.
|
||||
- Prisma generate / validate checks have passed on the current launch baseline.
|
||||
|
||||
## Service/process model
|
||||
Launch service/process set:
|
||||
|
||||
```text
|
||||
zpack-api.service # HTTP/API
|
||||
zpack-provision-worker.service # BullMQ provisioning worker
|
||||
zpack-repair-worker.service # Level 1 repair worker
|
||||
zlh-controller.service # singleton reconciler/controller
|
||||
zpack-billing-worker.service # billing enforcement worker
|
||||
```
|
||||
|
||||
Guardrail: do not add more worker/systemd services before launch unless there is a strong safety-boundary reason.
|
||||
|
||||
## Provisioning / async create
|
||||
- `POST /api/instances` is the active create entrypoint.
|
||||
- Create flow validates request/account state, creates or reuses a durable `ProvisioningOperation`, enqueues a BullMQ `provisioning` job, and returns `202 Accepted` with `operationId` / `statusUrl`.
|
||||
- `src/queues/provisioning.js` is the live provisioning worker entrypoint.
|
||||
- Worker concurrency is `1`; blind retries are disabled with `attempts: 1`.
|
||||
- BullMQ job ID uses the colon-free operation ID.
|
||||
- Portal sends an `Idempotency-Key`; backend duplicate protection also includes a no-key launch guard.
|
||||
- Worker callbacks update operation status, phase, heartbeat, server persistence, and cleanup metadata.
|
||||
- Live validation passed for game and dev provisioning through the worker path.
|
||||
- Portal async pending cards were visually validated and replace themselves with the real server card after completion.
|
||||
- Existing API teardown works for worker-created servers.
|
||||
|
||||
## Controller / repair model
|
||||
- `src/controllers/reconciler.js` is the controller loop.
|
||||
- Controller runs as `zlh-controller.service`, separate from the HTTP API process.
|
||||
- Controller is singleton-protected by Redis lock key `zlh:controller:lock`.
|
||||
- Controller should remain conservative; Level 0/1 only for launch.
|
||||
- Controller is expected to run in dry-run unless deliberately enabling Level 1 auto-repair.
|
||||
- `src/controllers/repairPolicy.js` owns repair decision rules.
|
||||
- `src/queues/repair.js` is the Level 1 repair worker queue.
|
||||
- Validated Level 1 behavior:
|
||||
- expired provisioning operation can be marked `stale`
|
||||
- live Cloudflare SRV drift can be detected and repaired by `edge_republish`
|
||||
- `edge_republish` uses the existing full edge publish path and post-checks live edge state so it does not fake success.
|
||||
- Level 2 actions such as agent/workload restart remain disabled.
|
||||
- Level 3 destructive actions such as restore/rebuild/delete are never automatic.
|
||||
|
||||
## Billing enforcement
|
||||
- Durable billing enforcement exists through `BillingEnforcementState`, `BillingEnforcementEvent`, and `StripeEventLog`.
|
||||
- Stripe webhook handling covers `invoice.payment_failed`, `invoice.paid`, `invoice.payment_succeeded`, `customer.subscription.updated`, and `customer.subscription.deleted`.
|
||||
- Stripe events are idempotent through `StripeEventLog`.
|
||||
- `src/services/billingEnforcement.js` owns billing states, guard/assert helpers, Stripe state transitions, and due enforcement calculation.
|
||||
- `src/queues/billingEnforcement.js` owns billing enforcement queue execution.
|
||||
- `zpack-billing-worker.service` is installed/running under systemd.
|
||||
- Billing actions are allowlisted: warning, final warning, backup block, suspension shutdown, retained marking, restore access.
|
||||
- Destructive billing actions are rejected and audited.
|
||||
- API gates are implemented for provisioning, start/restart, backup mutations, console command/stream, and file mutations.
|
||||
- File read/list/download remains allowed while suspended/retained by policy.
|
||||
- Suspended/retained/pending-deletion billing state suppresses edge/DNS/Velocity repair and live edge observation.
|
||||
- Validated flows include payment failed, replay idempotency, backup block state, suspension/shutdown safety, API gates while suspended, controller no-repair while suspended, payment restored, and destructive rejection.
|
||||
|
||||
## Support tickets
|
||||
- `POST /api/support/create` is implemented and mounted under `/api/support`.
|
||||
- API creates a `SupportTicket` DB row with human-readable ticket number `ZLH-YYYYMMDD-XXXX`.
|
||||
- Customer acknowledgement email is sent through SMTP.
|
||||
- Discord `#support` alert is sent via `DISCORD_SUPPORT_WEBHOOK`.
|
||||
- Optional support mailbox copy uses `SUPPORT_EMAIL_TO` and email `Reply-To` support.
|
||||
- Portal form submit, customer acknowledgement email, Discord alert, and DB-backed ticket creation were validated live.
|
||||
- Support triage, admin ticket list/view, inbound reply parsing, attachments, and self-hosted helpdesk integration are post-launch enhancements.
|
||||
|
||||
## Security / trust boundaries
|
||||
- The API now has explicit route-level trust boundaries in addition to internal network placement behind OPNsense.
|
||||
- API has explicit route-level trust boundaries in addition to internal network placement behind OPNsense.
|
||||
- `src/middleware/requireAdmin.js` protects admin-only API routes.
|
||||
- `src/middleware/requireInternalToken.js` protects internal-only control-plane routes with `INTERNAL_API_TOKEN` / `ZLH_INTERNAL_API_TOKEN`.
|
||||
- Internal-token routes accept `X-ZLH-Internal-Token`, `X-Internal-Token`, or bearer auth carrying the internal token.
|
||||
- Internal-token routes fail closed when the token is missing, except for explicit `NODE_ENV=development` or `NODE_ENV=test` local flows.
|
||||
- Internal-token routes fail closed when token config is missing except explicit development/test local flows.
|
||||
- `/api/audit` is admin-only.
|
||||
- `GET /api/instances` is admin-only global inventory.
|
||||
- `/api/edge/*` and `/api/proxmox/*` are internal-token protected.
|
||||
- `/sd/exporters` and `/monitoring/*` now fail closed in production-like environments when discovery tokens are missing.
|
||||
- Security boundary documentation lives in `docs/security-boundaries.md` in the API repo.
|
||||
- `/api/edge/*`, `/api/proxmox/*`, raw control-plane routes, and monitoring/service-discovery surfaces require the appropriate admin/internal/discovery token boundary.
|
||||
- Portal never calls agents directly for normal user flows and must not expose internal tokens in browser code.
|
||||
|
||||
## Route and lifecycle split
|
||||
- `/api/instances`, `/api/servers`, and `/api/containers` are different operational surfaces and should not be treated as duplicates.
|
||||
- `POST /api/instances` is the active agent-driven provisioning entrypoint.
|
||||
- `GET /api/instances` is now admin-only global inventory rather than a normal Portal user list.
|
||||
- `/api/instances`, `/api/servers`, and `/api/containers` are distinct operational surfaces.
|
||||
- `GET /api/servers` is the user-owned server list surface for Portal.
|
||||
- `DELETE /api/servers/:id` is the preferred Portal/user-owned delete contract. It requires a normal user bearer token and checks ownership before teardown.
|
||||
- `DELETE /api/containers/:vmid` remains the raw cleanup/orphan-remediation route. It is primarily internal-token protected and currently also accepts owned-user bearer deletes for Portal compatibility during migration.
|
||||
- Orphan remediation should stay in internal/admin workflows. Portal deletes only active servers the DB says the user owns.
|
||||
- `DELETE /api/servers/:id` is the preferred Portal/user-owned delete contract and checks ownership before teardown.
|
||||
- `DELETE /api/containers/:vmid` remains raw cleanup/orphan-remediation and should be internal/admin oriented.
|
||||
- Orphan remediation stays in internal/admin workflows.
|
||||
|
||||
## Teardown / orphan cleanup behavior
|
||||
- Container teardown logic has been extracted into `src/services/containerTeardown.js` so user-owned delete and raw internal delete share the same workflow.
|
||||
- Container teardown logic lives in `src/services/containerTeardown.js` so user-owned delete and raw internal delete share the same workflow.
|
||||
- Teardown archives a `DeletedInstance` record before removing the active `ContainerInstance` row.
|
||||
- Archive metadata is derived from current instance fields (`payload`, `engineType`, `allocatedPorts`) instead of assuming pre-v2 `game` / `variant` / `ports` columns still exist on `ContainerInstance`.
|
||||
- Teardown still refuses to delete a running Proxmox container and returns a conflict-style response until the host is stopped.
|
||||
- Game teardown performs DNS / Cloudflare / Technitium / Velocity cleanup through existing publisher cleanup paths.
|
||||
- Dev teardown performs dev IDE cleanup through the dev IDE publisher path.
|
||||
- Live validation completed for one game VMID and one dev VMID: both active rows were removed and archived with `reason=user_delete`.
|
||||
- A read-only helper script exists at `src/scripts/monitorTeardown.js`, exposed as `npm run monitor:teardown -- <vmid> [...]`, for watching active/archive DB state during teardown validation.
|
||||
- Live teardown was validated after worker-created game and dev server tests.
|
||||
|
||||
## Readiness / agent state model
|
||||
- API is the heartbeat authority by polling agents.
|
||||
- Agent does not push state to API.
|
||||
- API consumes:
|
||||
- `/health` for liveness
|
||||
- `/ready` for semantic readiness
|
||||
- `/status` for detailed state snapshot
|
||||
- API is the heartbeat authority by polling agents; Agent does not push state to API.
|
||||
- API consumes `/health`, `/ready`, and `/status`.
|
||||
- Portal should rely on API-normalized state, not direct agent state.
|
||||
- Proxy lifecycle is tracked separately from agent readiness under `ContainerInstance.payload.proxy`.
|
||||
- `ContainerInstance.payload.edge` and `ContainerInstance.payload.proxy` are merged through an atomic JSON merge helper so edge publish state and Velocity callback state do not clobber each other during near-simultaneous writes.
|
||||
|
||||
## Readiness cleanup already done
|
||||
- `agentClient.js` centralizes non-streaming agent transport.
|
||||
- `getAgentReady()` remains low-level transport.
|
||||
- `isAgentReadyResult()` is the shared semantic readiness helper.
|
||||
- `assertAgentReady()` uses semantic readiness.
|
||||
- Poller only caches `ready: true` when `/ready` returns semantic success.
|
||||
- Provisioning requires semantic readiness before success/persist/publish.
|
||||
- Timeout handling in `agentClient.js` has been modernized to `AbortSignal.timeout(...)`.
|
||||
- A final alignment pass is still needed in merged live status so any Portal-facing `agentReady` field fully matches semantic `/ready` rather than looser cache presence.
|
||||
|
||||
## Provisioning / post-provision flow
|
||||
- `src/api/provisionAgent.js` is the active provisioning path.
|
||||
- Older worker-based provisioning code is no longer treated as a live API path.
|
||||
- Post-provision edge publishing is still active.
|
||||
- Post-provision behavior has been adjusted away from legacy port-allocation commits and should be understood as request-driven / payload-driven.
|
||||
- Edge publish state and Velocity callback state must be merged into `ContainerInstance.payload` atomically.
|
||||
- Transient Proxmox `lxc/status/current` read errors are soft-handled by host-status polling and surface `powerState: unknown` rather than raw API 500s.
|
||||
|
||||
## Velocity / Minecraft edge lifecycle
|
||||
- Minecraft edge publish uses Velocity instead of Traefik TCP config.
|
||||
- API registers Minecraft backends with the Velocity bridge using `POST /zpack/register` and unregisters with `POST /zpack/unregister`.
|
||||
- Velocity registration verification uses the bridge's real `GET /zpack/status` route instead of the nonexistent `/zpack/list` route.
|
||||
- `/zpack/status` is expected to expose a `servers` array containing `{ name, address, port }` entries.
|
||||
- If `/zpack/status` exists but does not expose registered backends, API treats the register response as acknowledged but unverifiable instead of falsely failing the registration.
|
||||
- API exposes `POST /internal/velocity/proxy-status` for bridge lifecycle callbacks using the same hashed `X-Zpack-Secret` shared-secret auth style.
|
||||
- API registers/unregisters Minecraft backends through the Velocity bridge routes `POST /zpack/register`, `POST /zpack/unregister`, and verifies with `GET /zpack/status`.
|
||||
- API exposes `POST /internal/velocity/proxy-status` for bridge lifecycle callbacks.
|
||||
- Accepted proxy lifecycle statuses are `registered_with_proxy`, `proxy_ping_ok`, and `proxy_ping_failed`.
|
||||
- Proxy lifecycle callbacks are stored under `ContainerInstance.payload.proxy` with source, status, server name, address, port, duplicate flag, timestamps, latency, detail, and last event time.
|
||||
- `GET /api/servers/:id/status` includes the stored `proxy` object beside agent-derived live status.
|
||||
- `GET /api/servers/:id/status` derives Minecraft connection state from agent readiness, persisted edge state, Velocity registration state, and backend ping state.
|
||||
- When Velocity has registered a backend but has not posted a separate `proxy_ping_ok`, API can treat an agent-confirmed Minecraft ping (`readySource: "minecraft_ping"`) as a backend ping fallback unless Velocity explicitly reported `proxy_ping_failed`.
|
||||
- The bridge should set `ZPACK_PROXY_STATUS_ENDPOINT` to the API internal callback URL, for example `http://<api-host>:4000/internal/velocity/proxy-status`.
|
||||
|
||||
## Host / LXC lifecycle state
|
||||
- `/api/servers/:id/host/status` exposes the underlying Proxmox LXC state for both game and dev containers.
|
||||
- Host status distinguishes container power state from agent/game runtime state.
|
||||
- Host status response includes `hostStatus`, `powerState`, `running`, active `operation`, and selected Proxmox stats.
|
||||
- `hostStatus` can report `running`, `stopped`, `starting`, `stopping`, or `restarting`.
|
||||
- Host lifecycle routes return `202 Accepted` with an operation object and `statusUrl` instead of blocking until Proxmox finishes:
|
||||
- `/api/servers/:id/host/status` exposes Proxmox LXC power state for game and dev containers.
|
||||
- Host lifecycle routes return `202 Accepted` with operation/status URL rather than blocking until Proxmox finishes:
|
||||
- `POST /api/servers/:id/host/start`
|
||||
- `POST /api/servers/:id/host/stop`
|
||||
- `POST /api/servers/:id/host/restart`
|
||||
- Overlapping host lifecycle operations return `409 host_operation_in_progress`.
|
||||
- Host lifecycle operation state is cached under Redis `hostop:<vmid>` with a short TTL and is also surfaced in `GET /api/servers` as `hostOperation`.
|
||||
- `GET /api/servers` includes Proxmox-backed `hostStatus` and `powerState` so Portal list views can show LXC-level start/stop/restart state for game and dev containers.
|
||||
- Host lifecycle routes perform ownership checks before touching Proxmox.
|
||||
- Proxmox client can resolve the actual node for an LXC via `/cluster/resources` instead of assuming every VMID lives on the configured default `PROXMOX_NODE`.
|
||||
- Host lifecycle routes perform ownership and billing checks before touching Proxmox.
|
||||
|
||||
## Backup support
|
||||
- API forwards game backup operations.
|
||||
- Current API route shape:
|
||||
- API forwards game backup operations:
|
||||
- `GET /api/game/servers/:id/backups`
|
||||
- `POST /api/game/servers/:id/backups`
|
||||
- `POST /api/game/servers/:id/backups/restore?id=<backup_id>`
|
||||
- `DELETE /api/game/servers/:id/backups/:backupId`
|
||||
- Restore start is async at the API layer and Portal is expected to poll status rather than hold the restore POST open.
|
||||
- API forwards agent HTTP status codes for backup responses.
|
||||
- Successful backup responses currently pass through the agent body.
|
||||
- Non-OK backup responses currently use the shared agent response envelope: `{ error: <fallback>, details: <agent_body> }`.
|
||||
- Restore start is async at the API layer and Portal is expected to poll status.
|
||||
- Backup response shape normalization remains open.
|
||||
- Billing backup-block gates are implemented; full live route validation still needs a game backup fixture with backups available.
|
||||
|
||||
## File proxy / route compatibility
|
||||
- Duplicated game file proxy logic has been extracted into `src/routes/helpers/gameFileProxy.js` in the API repo.
|
||||
- Route compatibility is intentionally preserved between:
|
||||
- `/api/game/servers/:id/files...`
|
||||
- `/api/servers/:id/files...`
|
||||
- Streamed upload / download / file edit forwarding still exists outside the generic non-streaming agent helper path.
|
||||
- Compatibility between canonical game routes and legacy/compatibility server routes should be treated as part of the API contract.
|
||||
|
||||
## Agent contract alignment already done
|
||||
- `/start`, `/stop`, `/restart` forwarded as POST.
|
||||
- `/console/command` forwarded as POST JSON.
|
||||
- `/ready` is part of poller/readiness logic.
|
||||
- There is no API-native `/api/agent/:serverId/:action` route in `zpack-api`.
|
||||
- The reviewed Portal repo contained a Portal-side bridge with that shape which calls `/api/instances` for IP lookup and then contacts the agent directly; static review did not find an obvious current frontend caller for that bridge.
|
||||
|
||||
## Billing / auth lifecycle
|
||||
- API issues access tokens and refresh tokens.
|
||||
- Password reset tokens are stored hashed and exchanged through API routes.
|
||||
- Password reset request delivers email through the configured support mailbox SMTP path first, with optional Resend fallback and console-link fallback for local development.
|
||||
- Password reset request routes are `POST /api/auth/password-reset/request` and alias `POST /api/auth/forgot-password`.
|
||||
- Password reset confirm routes are `POST /api/auth/password-reset/confirm` and alias `POST /api/auth/reset-password`.
|
||||
- Logged-in password change is available at `POST /api/auth/change-password` with bearer auth and body `{ currentPassword, newPassword }`.
|
||||
- Logged-in password change verifies the current password, enforces the same 8-character minimum, updates the password hash, and marks outstanding password reset tokens used.
|
||||
- Reset links use `RESET_PASSWORD_URL_BASE`, then `PORTAL_URL`, then `http://localhost:3000`, and point at `/reset-password?token=...`.
|
||||
- Reset request responses remain generic to avoid account enumeration.
|
||||
- Reset confirmation rejects passwords shorter than 8 characters and marks all outstanding reset tokens for that user used after a successful password change.
|
||||
- Default reset sender is `ZeroLag Hub Support <support@zerolaghub.com>` and production SMTP is configured through `SMTP_HOST`, `SMTP_PORT`, `SMTP_SECURE`, `SMTP_USER`, and `SMTP_PASS`.
|
||||
- Stripe billing routes cover checkout, upgrade, downgrade, portal, and current billing state.
|
||||
- Stripe webhooks are mounted with raw body parsing before normal JSON middleware.
|
||||
- Billing scheduler starts in-process and performs limited reminder/reconciliation work.
|
||||
- Admin users are billing-exempt in billing flows.
|
||||
- JWT verification has reportedly been tightened to fixed algorithm plus issuer/audience separation for access, refresh, and IDE proxy tokens.
|
||||
- Pre-hardening tokens may no longer verify and a re-login may be required after this change.
|
||||
- Duplicated game file proxy logic has been extracted into `src/routes/helpers/gameFileProxy.js`.
|
||||
- Compatibility is intentionally preserved between `/api/game/servers/:id/files...` and `/api/servers/:id/files...`.
|
||||
- Streamed upload/download/edit forwarding remains outside generic non-streaming agent helper transport.
|
||||
- Billing gates block file mutations while suspended/retained but allow read/list/download by policy.
|
||||
|
||||
## Hosted IDE proxy
|
||||
- `POST /api/dev/:id/ide-token` issues short-lived IDE proxy tokens.
|
||||
- IDE proxy supports both tunnel paths under `/__ide/:id` and hosted `dev-<vmid>.<suffix>` hosts.
|
||||
- Hosted IDE tokens can be delivered by query parameter and then persisted as the IDE proxy cookie.
|
||||
- Hosted URL return is controlled by `DEV_IDE_RETURN_HOSTED_URL`.
|
||||
- IDE proxy supports hosted `dev-<vmid>.<suffix>` hosts and tunnel paths.
|
||||
- Hosted IDE access was validated during dev provisioning worker tests.
|
||||
- API exposes code-server controls for owned dev containers:
|
||||
- `POST /api/dev/:id/codeserver/start` -> Agent `POST /dev/codeserver/start`
|
||||
- `POST /api/dev/:id/codeserver/stop` -> Agent `POST /dev/codeserver/stop`
|
||||
- `POST /api/dev/:id/codeserver/restart` -> Agent `POST /dev/codeserver/restart`
|
||||
- IDE proxy cookie hardening is expected to include `httpOnly`, `sameSite: "lax"`, and secure-cookie behavior tied to public HTTPS or explicit secure-cookie config.
|
||||
- Sensitive proxy logging has reportedly been reduced so cookies and forwarded header detail are not exposed in normal logs.
|
||||
- `POST /api/dev/:id/codeserver/start`
|
||||
- `POST /api/dev/:id/codeserver/stop`
|
||||
- `POST /api/dev/:id/codeserver/restart`
|
||||
|
||||
## Console / outbound socket stability
|
||||
## Console / socket stability
|
||||
- Console WebSocket proxy attachment is guarded so the console upgrade handler is only attached once per HTTP server.
|
||||
- Console proxy raw socket error logging is guarded to avoid stacking duplicate socket listeners.
|
||||
- API raises listener limits on inbound HTTP sockets and console WebSocket sockets to avoid false-positive listener warnings under proxy/websocket fan-out.
|
||||
- Axios-backed outbound clients use shared HTTP/HTTPS agent helpers that raise outbound socket listener limits at socket creation time.
|
||||
- The outbound socket agent helper is used by Proxmox, OPNsense, Cloudflare, and Prometheus metrics query paths.
|
||||
- A temporary `MaxListenersExceededWarning` tracer exists in `src/app.js` to log emitter/event/count/stack if listener warnings recur.
|
||||
- Portal terminal connection-state hardening remains a Portal launch item.
|
||||
|
||||
## Legacy / archived behavior
|
||||
- Legacy port allocation / slot reservation is no longer part of the live route mounts.
|
||||
- Legacy worker provisioning, detached reconcile helpers, explicit `.old` files, `src/tmp`, checked-in keys/tokens, and local artifacts have been removed from the active tree rather than kept as live repo clutter.
|
||||
- Websocket console proxy wiring remains outside `agentClient.js`.
|
||||
- Legacy synchronous HTTP provisioning is no longer the launch model; async BullMQ provisioning worker is the current model.
|
||||
- Raw streaming upload proxy behavior remains outside `agentClient.js`.
|
||||
|
||||
## Node 24 cleanup already reflected in API repo
|
||||
- `RegExp.escape(...)` is used where host / suffix regex escaping was previously manual.
|
||||
- Selected built-in imports have been normalized to `node:` style.
|
||||
- Non-runtime clutter such as checked-in keys/tokens, local artifacts, `.old` scripts, `src/tmp`, and retired legacy trees should stay out of the active repo.
|
||||
|
||||
Loading…
Reference in New Issue
Block a user