zlh-grind/Codex/API/CURRENT_STATE.md

14 KiB

API — Current State

This file records what is believed to be implemented now.

Runtime / dependency baseline

  • API is tracked against Node 24 with repo-local pinning via package.json engines and .nvmrc.
  • Direct node-fetch dependency has been removed and API code now uses built-in global fetch.
  • Prisma config has been migrated out of deprecated package.json#prisma into prisma.config.ts.
  • Prisma generate / validate checks reportedly pass on the current API baseline.

Security / trust boundaries

  • The API now has explicit route-level trust boundaries in addition to internal network placement behind OPNsense.
  • src/middleware/requireAdmin.js protects admin-only API routes.
  • src/middleware/requireInternalToken.js protects internal-only control-plane routes with INTERNAL_API_TOKEN / ZLH_INTERNAL_API_TOKEN.
  • Internal-token routes accept X-ZLH-Internal-Token, X-Internal-Token, or bearer auth carrying the internal token.
  • Internal-token routes fail closed when the token is missing, except for explicit NODE_ENV=development or NODE_ENV=test local flows.
  • /api/audit is admin-only.
  • GET /api/instances is admin-only global inventory.
  • /api/edge/* and /api/proxmox/* are internal-token protected.
  • /sd/exporters and /monitoring/* now fail closed in production-like environments when discovery tokens are missing.
  • Security boundary documentation lives in docs/security-boundaries.md in the API repo.

Route and lifecycle split

  • /api/instances, /api/servers, and /api/containers are different operational surfaces and should not be treated as duplicates.
  • POST /api/instances is the active agent-driven provisioning entrypoint.
  • GET /api/instances is now admin-only global inventory rather than a normal Portal user list.
  • GET /api/servers is the user-owned server list surface for Portal.
  • DELETE /api/servers/:id is the preferred Portal/user-owned delete contract. It requires a normal user bearer token and checks ownership before teardown.
  • DELETE /api/containers/:vmid remains the raw cleanup/orphan-remediation route. It is primarily internal-token protected and currently also accepts owned-user bearer deletes for Portal compatibility during migration.
  • Orphan remediation should stay in internal/admin workflows. Portal deletes only active servers the DB says the user owns.

Teardown / orphan cleanup behavior

  • Container teardown logic has been extracted into src/services/containerTeardown.js so user-owned delete and raw internal delete share the same workflow.
  • Teardown archives a DeletedInstance record before removing the active ContainerInstance row.
  • Archive metadata is derived from current instance fields (payload, engineType, allocatedPorts) instead of assuming pre-v2 game / variant / ports columns still exist on ContainerInstance.
  • Teardown still refuses to delete a running Proxmox container and returns a conflict-style response until the host is stopped.
  • Game teardown performs DNS / Cloudflare / Technitium / Velocity cleanup through existing publisher cleanup paths.
  • Dev teardown performs dev IDE cleanup through the dev IDE publisher path.
  • Live validation completed for one game VMID and one dev VMID: both active rows were removed and archived with reason=user_delete.
  • A read-only helper script exists at src/scripts/monitorTeardown.js, exposed as npm run monitor:teardown -- <vmid> [...], for watching active/archive DB state during teardown validation.

Readiness / agent state model

  • API is the heartbeat authority by polling agents.
  • Agent does not push state to API.
  • API consumes:
    • /health for liveness
    • /ready for semantic readiness
    • /status for detailed state snapshot
  • Portal should rely on API-normalized state, not direct agent state.
  • Proxy lifecycle is tracked separately from agent readiness under ContainerInstance.payload.proxy.
  • ContainerInstance.payload.edge and ContainerInstance.payload.proxy are merged through an atomic JSON merge helper so edge publish state and Velocity callback state do not clobber each other during near-simultaneous writes.

Readiness cleanup already done

  • agentClient.js centralizes non-streaming agent transport.
  • getAgentReady() remains low-level transport.
  • isAgentReadyResult() is the shared semantic readiness helper.
  • assertAgentReady() uses semantic readiness.
  • Poller only caches ready: true when /ready returns semantic success.
  • Provisioning requires semantic readiness before success/persist/publish.
  • Timeout handling in agentClient.js has been modernized to AbortSignal.timeout(...).
  • A final alignment pass is still needed in merged live status so any Portal-facing agentReady field fully matches semantic /ready rather than looser cache presence.

Provisioning / post-provision flow

  • src/api/provisionAgent.js is the active provisioning path.
  • Older worker-based provisioning code is no longer treated as a live API path.
  • Post-provision edge publishing is still active.
  • Post-provision behavior has been adjusted away from legacy port-allocation commits and should be understood as request-driven / payload-driven.

Velocity / Minecraft edge lifecycle

  • Minecraft edge publish uses Velocity instead of Traefik TCP config.
  • API registers Minecraft backends with the Velocity bridge using POST /zpack/register and unregisters with POST /zpack/unregister.
  • Velocity registration verification uses the bridge's real GET /zpack/status route instead of the nonexistent /zpack/list route.
  • /zpack/status is expected to expose a servers array containing { name, address, port } entries.
  • If /zpack/status exists but does not expose registered backends, API treats the register response as acknowledged but unverifiable instead of falsely failing the registration.
  • API exposes POST /internal/velocity/proxy-status for bridge lifecycle callbacks using the same hashed X-Zpack-Secret shared-secret auth style.
  • Accepted proxy lifecycle statuses are registered_with_proxy, proxy_ping_ok, and proxy_ping_failed.
  • Proxy lifecycle callbacks are stored under ContainerInstance.payload.proxy with source, status, server name, address, port, duplicate flag, timestamps, latency, detail, and last event time.
  • GET /api/servers/:id/status includes the stored proxy object beside agent-derived live status.
  • GET /api/servers/:id/status derives Minecraft connection state from agent readiness, persisted edge state, Velocity registration state, and backend ping state.
  • When Velocity has registered a backend but has not posted a separate proxy_ping_ok, API can treat an agent-confirmed Minecraft ping (readySource: "minecraft_ping") as a backend ping fallback unless Velocity explicitly reported proxy_ping_failed.
  • The bridge should set ZPACK_PROXY_STATUS_ENDPOINT to the API internal callback URL, for example http://<api-host>:4000/internal/velocity/proxy-status.

Host / LXC lifecycle state

  • /api/servers/:id/host/status exposes the underlying Proxmox LXC state for both game and dev containers.
  • Host status distinguishes container power state from agent/game runtime state.
  • Host status response includes hostStatus, powerState, running, active operation, and selected Proxmox stats.
  • hostStatus can report running, stopped, starting, stopping, or restarting.
  • Host lifecycle routes return 202 Accepted with an operation object and statusUrl instead of blocking until Proxmox finishes:
    • POST /api/servers/:id/host/start
    • POST /api/servers/:id/host/stop
    • POST /api/servers/:id/host/restart
  • Overlapping host lifecycle operations return 409 host_operation_in_progress.
  • Host lifecycle operation state is cached under Redis hostop:<vmid> with a short TTL and is also surfaced in GET /api/servers as hostOperation.
  • GET /api/servers includes Proxmox-backed hostStatus and powerState so Portal list views can show LXC-level start/stop/restart state for game and dev containers.
  • Host lifecycle routes perform ownership checks before touching Proxmox.
  • Proxmox client can resolve the actual node for an LXC via /cluster/resources instead of assuming every VMID lives on the configured default PROXMOX_NODE.

Backup support

  • API forwards game backup operations.
  • Current API route shape:
    • GET /api/game/servers/:id/backups
    • POST /api/game/servers/:id/backups
    • POST /api/game/servers/:id/backups/restore?id=<backup_id>
    • DELETE /api/game/servers/:id/backups/:backupId
  • Restore start is async at the API layer and Portal is expected to poll status rather than hold the restore POST open.
  • API forwards agent HTTP status codes for backup responses.
  • Successful backup responses currently pass through the agent body.
  • Non-OK backup responses currently use the shared agent response envelope: { error: <fallback>, details: <agent_body> }.
  • Backup response shape normalization remains open.

File proxy / route compatibility

  • Duplicated game file proxy logic has been extracted into src/routes/helpers/gameFileProxy.js in the API repo.
  • Route compatibility is intentionally preserved between:
    • /api/game/servers/:id/files...
    • /api/servers/:id/files...
  • Streamed upload / download / file edit forwarding still exists outside the generic non-streaming agent helper path.
  • Compatibility between canonical game routes and legacy/compatibility server routes should be treated as part of the API contract.

Agent contract alignment already done

  • /start, /stop, /restart forwarded as POST.
  • /console/command forwarded as POST JSON.
  • /ready is part of poller/readiness logic.
  • There is no API-native /api/agent/:serverId/:action route in zpack-api.
  • The reviewed Portal repo contained a Portal-side bridge with that shape which calls /api/instances for IP lookup and then contacts the agent directly; static review did not find an obvious current frontend caller for that bridge.

Billing / auth lifecycle

  • API issues access tokens and refresh tokens.
  • Password reset tokens are stored hashed and exchanged through API routes.
  • Password reset request delivers email through the configured support mailbox SMTP path first, with optional Resend fallback and console-link fallback for local development.
  • Password reset request routes are POST /api/auth/password-reset/request and alias POST /api/auth/forgot-password.
  • Password reset confirm routes are POST /api/auth/password-reset/confirm and alias POST /api/auth/reset-password.
  • Logged-in password change is available at POST /api/auth/change-password with bearer auth and body { currentPassword, newPassword }.
  • Logged-in password change verifies the current password, enforces the same 8-character minimum, updates the password hash, and marks outstanding password reset tokens used.
  • Reset links use RESET_PASSWORD_URL_BASE, then PORTAL_URL, then http://localhost:3000, and point at /reset-password?token=....
  • Reset request responses remain generic to avoid account enumeration.
  • Reset confirmation rejects passwords shorter than 8 characters and marks all outstanding reset tokens for that user used after a successful password change.
  • Default reset sender is ZeroLag Hub Support <support@zerolaghub.com> and production SMTP is configured through SMTP_HOST, SMTP_PORT, SMTP_SECURE, SMTP_USER, and SMTP_PASS.
  • Stripe billing routes cover checkout, upgrade, downgrade, portal, and current billing state.
  • Stripe webhooks are mounted with raw body parsing before normal JSON middleware.
  • Billing scheduler starts in-process and performs limited reminder/reconciliation work.
  • Admin users are billing-exempt in billing flows.
  • JWT verification has reportedly been tightened to fixed algorithm plus issuer/audience separation for access, refresh, and IDE proxy tokens.
  • Pre-hardening tokens may no longer verify and a re-login may be required after this change.

Hosted IDE proxy

  • POST /api/dev/:id/ide-token issues short-lived IDE proxy tokens.
  • IDE proxy supports both tunnel paths under /__ide/:id and hosted dev-<vmid>.<suffix> hosts.
  • Hosted IDE tokens can be delivered by query parameter and then persisted as the IDE proxy cookie.
  • Hosted URL return is controlled by DEV_IDE_RETURN_HOSTED_URL.
  • API exposes code-server controls for owned dev containers:
    • POST /api/dev/:id/codeserver/start -> Agent POST /dev/codeserver/start
    • POST /api/dev/:id/codeserver/stop -> Agent POST /dev/codeserver/stop
    • POST /api/dev/:id/codeserver/restart -> Agent POST /dev/codeserver/restart
  • IDE proxy cookie hardening is expected to include httpOnly, sameSite: "lax", and secure-cookie behavior tied to public HTTPS or explicit secure-cookie config.
  • Sensitive proxy logging has reportedly been reduced so cookies and forwarded header detail are not exposed in normal logs.

Console / outbound socket stability

  • Console WebSocket proxy attachment is guarded so the console upgrade handler is only attached once per HTTP server.
  • Console proxy raw socket error logging is guarded to avoid stacking duplicate socket listeners.
  • API raises listener limits on inbound HTTP sockets and console WebSocket sockets to avoid false-positive listener warnings under proxy/websocket fan-out.
  • Axios-backed outbound clients use shared HTTP/HTTPS agent helpers that raise outbound socket listener limits at socket creation time.
  • The outbound socket agent helper is used by Proxmox, OPNsense, Cloudflare, and Prometheus metrics query paths.
  • A temporary MaxListenersExceededWarning tracer exists in src/app.js to log emitter/event/count/stack if listener warnings recur.

Legacy / archived behavior

  • Legacy port allocation / slot reservation is no longer part of the live route mounts.
  • Legacy worker provisioning, detached reconcile helpers, explicit .old files, src/tmp, checked-in keys/tokens, and local artifacts have been removed from the active tree rather than kept as live repo clutter.
  • Websocket console proxy wiring remains outside agentClient.js.
  • Raw streaming upload proxy behavior remains outside agentClient.js.

Node 24 cleanup already reflected in API repo

  • RegExp.escape(...) is used where host / suffix regex escaping was previously manual.
  • Selected built-in imports have been normalized to node: style.