diff --git a/Codex/API/AGENT_TRUST_MODEL.md b/Codex/API/AGENT_TRUST_MODEL.md new file mode 100644 index 0000000..9a17a42 --- /dev/null +++ b/Codex/API/AGENT_TRUST_MODEL.md @@ -0,0 +1,115 @@ +# API Agent Trust Model + +The API is the authority for user identity, ownership, billing, plan limits, and platform policy. The Agent is an internal executor inside a game/dev container. + +This means the API must protect the platform from unsafe Agent invocation. The Agent should not be user-facing, and Portal should never call Agent directly. + +## Launch / short term + +For launch, use static `ZLH_AGENT_TOKEN` as the API-to-Agent control-plane secret. + +Required launch behavior: + +- API refuses to start in production without `ZLH_AGENT_TOKEN`. +- API injects `ZLH_AGENT_TOKEN` into every Agent request. +- API centralizes Agent calls through the shared Agent client rather than ad hoc `fetch()` calls. +- Agent refuses protected requests in production without a configured token. +- Portal never receives or constructs the Agent token. +- Normal user requests are authenticated and ownership-checked before API calls Agent. + +## API responsibilities before calling Agent + +Before sending work to Agent, API must verify: + +- the user is authenticated, unless the caller is an internal/admin service +- the requested server/container belongs to the user +- billing/account state allows the operation +- plan limits allow create/upgrade operations +- the requested operation is valid for the server type (`game` vs `dev`) +- no conflicting operation is already in progress +- request payloads are bounded and normalized + +The Agent may validate filesystem/runtime safety, but it does not own customer authorization. + +## What API must not do + +- Do not expose `ZLH_AGENT_TOKEN` to Portal or browser-visible code. +- Do not let Portal call Agent IPs or `:18888` directly. +- Do not treat Agent responses as authorization proof. +- Do not let users supply arbitrary VMID/IP and proxy directly to Agent. +- Do not leave active-looking DB rows after provisioning fails. +- Do not allow duplicate provisioning retries to create duplicate LXCs. + +## Error handling + +Agent failures should be mapped to safe API responses: + +- Agent unreachable -> `502 Agent unreachable` +- Agent not ready -> `503 agent_not_ready` +- Agent operation in progress -> `409 operation_in_progress` +- Agent unauthorized -> internal misconfiguration, not user-auth failure + +Do not leak Agent URLs, IPs, stack traces, tokens, or raw internal error bodies to Portal. + +## Longer-term target + +Static `ZLH_AGENT_TOKEN` is a launch hardening layer, not the final trust model. + +The better long-term model is short-lived signed API-to-Agent request tokens. + +In that model: + +1. API keeps a signing secret or private key. +2. Agent knows only a verification secret or public key. +3. API signs a short-lived token per Agent request, usually 30-120 seconds. +4. Agent verifies issuer, audience, expiry, VMID, and scope before executing. + +Example claims: + +```json +{ + "iss": "zpack-api", + "aud": "zlh-agent", + "vmid": 5202, + "scope": "files:read", + "iat": 1714500000, + "exp": 1714500060, + "jti": "request-id" +} +``` + +This limits blast radius if a token is exposed. A stolen short-lived `files:read` token for VMID `5202` cannot be reused later to stop another server, restore a backup, or mutate files on a different container. + +## Roadmap + +- Phase 1: static `ZLH_AGENT_TOKEN`, fail-closed in production. +- Phase 2: token rotation and deployment validation so stale or missing tokens are caught before traffic. +- Phase 3: short-lived API-signed request tokens for API-to-Agent calls. +- Phase 4: optional mTLS between API and Agent for transport-level service identity. + +## JWT vs HMAC request signatures + +JWT is useful when explicit claims and public-key verification are desired. + +A compact HMAC request-signing scheme may be simpler operationally: + +```text +X-ZLH-Agent-Timestamp: +X-ZLH-Agent-Vmid: +X-ZLH-Agent-Scope: lifecycle:stop +X-ZLH-Agent-Signature: HMAC(secret, method + path + timestamp + vmid + scope + bodyHash) +``` + +Either approach is better than a forever bearer token once the launch system is stable. + +## Not a launch blocker + +Do not jump directly to short-lived signed tokens before launch unless required. They add operational complexity: + +- clock drift +- issuer/audience mismatch +- token scope bugs +- signing-key/config drift +- harder provisioning debugging + +For launch, static token plus fail-closed production behavior is the correct low-risk hardening step. For scale, short-lived signed API-to-Agent tokens are the cleaner end state.