From c59973a97cc2aaa121d922359686ce0aaa7f1e2d Mon Sep 17 00:00:00 2001 From: jester Date: Thu, 30 Apr 2026 19:19:38 +0000 Subject: [PATCH] Update API current state after hardening and teardown work --- Codex/API/CURRENT_STATE.md | 63 ++++++++++++++++++++++++++------------ 1 file changed, 43 insertions(+), 20 deletions(-) diff --git a/Codex/API/CURRENT_STATE.md b/Codex/API/CURRENT_STATE.md index 2bd92c7..e6aab37 100644 --- a/Codex/API/CURRENT_STATE.md +++ b/Codex/API/CURRENT_STATE.md @@ -3,18 +3,41 @@ This file records what is believed to be implemented now. ## Runtime / dependency baseline -- API is now tracked against Node 24 with repo-local pinning via `package.json` engines and `.nvmrc`. +- API is tracked against Node 24 with repo-local pinning via `package.json` engines and `.nvmrc`. - Direct `node-fetch` dependency has been removed and API code now uses built-in global `fetch`. -- Dependency / audit cleanup has been performed and the reported audit state is clean. - Prisma config has been migrated out of deprecated `package.json#prisma` into `prisma.config.ts`. - Prisma generate / validate checks reportedly pass on the current API baseline. +## Security / trust boundaries +- The API now has explicit route-level trust boundaries in addition to internal network placement behind OPNsense. +- `src/middleware/requireAdmin.js` protects admin-only API routes. +- `src/middleware/requireInternalToken.js` protects internal-only control-plane routes with `INTERNAL_API_TOKEN` / `ZLH_INTERNAL_API_TOKEN`. +- Internal-token routes accept `X-ZLH-Internal-Token`, `X-Internal-Token`, or bearer auth carrying the internal token. +- Internal-token routes fail closed when the token is missing, except for explicit `NODE_ENV=development` or `NODE_ENV=test` local flows. +- `/api/audit` is admin-only. +- `GET /api/instances` is admin-only global inventory. +- `/api/edge/*` and `/api/proxmox/*` are internal-token protected. +- `/sd/exporters` and `/monitoring/*` now fail closed in production-like environments when discovery tokens are missing. +- Security boundary documentation lives in `docs/security-boundaries.md` in the API repo. + ## Route and lifecycle split -- `/api/instances` and `/api/containers` are different operational surfaces and should not be treated as duplicates. -- `GET /api/instances` lists `ContainerInstance` rows and is still used for lookup/discovery by some Portal-side code paths. +- `/api/instances`, `/api/servers`, and `/api/containers` are different operational surfaces and should not be treated as duplicates. - `POST /api/instances` is the active agent-driven provisioning entrypoint. -- `DELETE /api/containers/:vmid` is the cleanup/delete/orphan-remediation route used for failed creates, agent failures, manual deletions, and other orphaned container cases. -- Current delete cleanup has been updated to derive archive metadata from current instance fields (`payload`, `engineType`, `allocatedPorts`) instead of assuming the pre-v2 `game` / `variant` / `ports` fields are still present on `ContainerInstance`. +- `GET /api/instances` is now admin-only global inventory rather than a normal Portal user list. +- `GET /api/servers` is the user-owned server list surface for Portal. +- `DELETE /api/servers/:id` is the preferred Portal/user-owned delete contract. It requires a normal user bearer token and checks ownership before teardown. +- `DELETE /api/containers/:vmid` remains the raw cleanup/orphan-remediation route. It is primarily internal-token protected and currently also accepts owned-user bearer deletes for Portal compatibility during migration. +- Orphan remediation should stay in internal/admin workflows. Portal deletes only active servers the DB says the user owns. + +## Teardown / orphan cleanup behavior +- Container teardown logic has been extracted into `src/services/containerTeardown.js` so user-owned delete and raw internal delete share the same workflow. +- Teardown archives a `DeletedInstance` record before removing the active `ContainerInstance` row. +- Archive metadata is derived from current instance fields (`payload`, `engineType`, `allocatedPorts`) instead of assuming pre-v2 `game` / `variant` / `ports` columns still exist on `ContainerInstance`. +- Teardown still refuses to delete a running Proxmox container and returns a conflict-style response until the host is stopped. +- Game teardown performs DNS / Cloudflare / Technitium / Velocity cleanup through existing publisher cleanup paths. +- Dev teardown performs dev IDE cleanup through the dev IDE publisher path. +- Live validation completed for one game VMID and one dev VMID: both active rows were removed and archived with `reason=user_delete`. +- A read-only helper script exists at `src/scripts/monitorTeardown.js`, exposed as `npm run monitor:teardown -- [...]`, for watching active/archive DB state during teardown validation. ## Readiness / agent state model - API is the heartbeat authority by polling agents. @@ -24,8 +47,8 @@ This file records what is believed to be implemented now. - `/ready` for semantic readiness - `/status` for detailed state snapshot - Portal should rely on API-normalized state, not direct agent state. -- Proxy lifecycle is now tracked separately from agent readiness under `ContainerInstance.payload.proxy`. -- `ContainerInstance.payload.edge` and `ContainerInstance.payload.proxy` are now merged through an atomic JSON merge helper so edge publish state and Velocity callback state do not clobber each other during near-simultaneous writes. +- Proxy lifecycle is tracked separately from agent readiness under `ContainerInstance.payload.proxy`. +- `ContainerInstance.payload.edge` and `ContainerInstance.payload.proxy` are merged through an atomic JSON merge helper so edge publish state and Velocity callback state do not clobber each other during near-simultaneous writes. ## Readiness cleanup already done - `agentClient.js` centralizes non-streaming agent transport. @@ -39,37 +62,37 @@ This file records what is believed to be implemented now. ## Provisioning / post-provision flow - `src/api/provisionAgent.js` is the active provisioning path. -- The older worker-based provisioning chain has been archived for reference and is no longer treated as a live API path. +- Older worker-based provisioning code is no longer treated as a live API path. - Post-provision edge publishing is still active. - Post-provision behavior has been adjusted away from legacy port-allocation commits and should be understood as request-driven / payload-driven. ## Velocity / Minecraft edge lifecycle - Minecraft edge publish uses Velocity instead of Traefik TCP config. - API registers Minecraft backends with the Velocity bridge using `POST /zpack/register` and unregisters with `POST /zpack/unregister`. -- Velocity registration verification now uses the bridge's real `GET /zpack/status` route instead of the nonexistent `/zpack/list` route. +- Velocity registration verification uses the bridge's real `GET /zpack/status` route instead of the nonexistent `/zpack/list` route. - `/zpack/status` is expected to expose a `servers` array containing `{ name, address, port }` entries. - If `/zpack/status` exists but does not expose registered backends, API treats the register response as acknowledged but unverifiable instead of falsely failing the registration. - API exposes `POST /internal/velocity/proxy-status` for bridge lifecycle callbacks using the same hashed `X-Zpack-Secret` shared-secret auth style. - Accepted proxy lifecycle statuses are `registered_with_proxy`, `proxy_ping_ok`, and `proxy_ping_failed`. - Proxy lifecycle callbacks are stored under `ContainerInstance.payload.proxy` with source, status, server name, address, port, duplicate flag, timestamps, latency, detail, and last event time. -- `GET /api/servers/:id/status` now includes the stored `proxy` object beside agent-derived live status. +- `GET /api/servers/:id/status` includes the stored `proxy` object beside agent-derived live status. - `GET /api/servers/:id/status` derives Minecraft connection state from agent readiness, persisted edge state, Velocity registration state, and backend ping state. - When Velocity has registered a backend but has not posted a separate `proxy_ping_ok`, API can treat an agent-confirmed Minecraft ping (`readySource: "minecraft_ping"`) as a backend ping fallback unless Velocity explicitly reported `proxy_ping_failed`. - The bridge should set `ZPACK_PROXY_STATUS_ENDPOINT` to the API internal callback URL, for example `http://:4000/internal/velocity/proxy-status`. ## Host / LXC lifecycle state - `/api/servers/:id/host/status` exposes the underlying Proxmox LXC state for both game and dev containers. -- Host status now distinguishes container power state from agent/game runtime state. +- Host status distinguishes container power state from agent/game runtime state. - Host status response includes `hostStatus`, `powerState`, `running`, active `operation`, and selected Proxmox stats. - `hostStatus` can report `running`, `stopped`, `starting`, `stopping`, or `restarting`. -- Host lifecycle routes now return `202 Accepted` with an operation object and `statusUrl` instead of blocking until Proxmox finishes: +- Host lifecycle routes return `202 Accepted` with an operation object and `statusUrl` instead of blocking until Proxmox finishes: - `POST /api/servers/:id/host/start` - `POST /api/servers/:id/host/stop` - `POST /api/servers/:id/host/restart` - Overlapping host lifecycle operations return `409 host_operation_in_progress`. - Host lifecycle operation state is cached under Redis `hostop:` with a short TTL and is also surfaced in `GET /api/servers` as `hostOperation`. - `GET /api/servers` includes Proxmox-backed `hostStatus` and `powerState` so Portal list views can show LXC-level start/stop/restart state for game and dev containers. -- Host lifecycle routes now perform ownership checks before touching Proxmox. +- Host lifecycle routes perform ownership checks before touching Proxmox. - Proxmox client can resolve the actual node for an LXC via `/cluster/resources` instead of assuming every VMID lives on the configured default `PROXMOX_NODE`. ## Backup support @@ -103,7 +126,7 @@ This file records what is believed to be implemented now. ## Billing / auth lifecycle - API issues access tokens and refresh tokens. - Password reset tokens are stored hashed and exchanged through API routes. -- Password reset request now delivers email through the configured support mailbox SMTP path first, with optional Resend fallback and console-link fallback for local development. +- Password reset request delivers email through the configured support mailbox SMTP path first, with optional Resend fallback and console-link fallback for local development. - Password reset request routes are `POST /api/auth/password-reset/request` and alias `POST /api/auth/forgot-password`. - Password reset confirm routes are `POST /api/auth/password-reset/confirm` and alias `POST /api/auth/reset-password`. - Logged-in password change is available at `POST /api/auth/change-password` with bearer auth and body `{ currentPassword, newPassword }`. @@ -135,15 +158,15 @@ This file records what is believed to be implemented now. - Console WebSocket proxy attachment is guarded so the console upgrade handler is only attached once per HTTP server. - Console proxy raw socket error logging is guarded to avoid stacking duplicate socket listeners. - API raises listener limits on inbound HTTP sockets and console WebSocket sockets to avoid false-positive listener warnings under proxy/websocket fan-out. -- Axios-backed outbound clients now use shared HTTP/HTTPS agent helpers that raise outbound socket listener limits at socket creation time. +- Axios-backed outbound clients use shared HTTP/HTTPS agent helpers that raise outbound socket listener limits at socket creation time. - The outbound socket agent helper is used by Proxmox, OPNsense, Cloudflare, and Prometheus metrics query paths. - A temporary `MaxListenersExceededWarning` tracer exists in `src/app.js` to log emitter/event/count/stack if listener warnings recur. ## Legacy / archived behavior -- legacy port allocation / slot reservation is no longer part of the live route mounts and has been archived for reference -- legacy worker provisioning, detached reconcile helpers, and explicit `.old` files have been moved under archive for reference rather than kept in the live tree -- websocket console proxy wiring remains outside `agentClient.js` -- raw streaming upload proxy behavior remains outside `agentClient.js` +- Legacy port allocation / slot reservation is no longer part of the live route mounts. +- Legacy worker provisioning, detached reconcile helpers, explicit `.old` files, `src/tmp`, checked-in keys/tokens, and local artifacts have been removed from the active tree rather than kept as live repo clutter. +- Websocket console proxy wiring remains outside `agentClient.js`. +- Raw streaming upload proxy behavior remains outside `agentClient.js`. ## Node 24 cleanup already reflected in API repo - `RegExp.escape(...)` is used where host / suffix regex escaping was previously manual.