# API Agent Trust Model The API is the authority for user identity, ownership, billing, plan limits, and platform policy. The Agent is an internal executor inside a game/dev container. This means the API must protect the platform from unsafe Agent invocation. The Agent should not be user-facing, and Portal should never call Agent directly. ## Launch / short term For launch, use static `ZLH_AGENT_TOKEN` as the API-to-Agent control-plane secret. Required launch behavior: - API refuses to start in production without `ZLH_AGENT_TOKEN`. - API injects `ZLH_AGENT_TOKEN` into every Agent request. - API centralizes Agent calls through the shared Agent client rather than ad hoc `fetch()` calls. - Agent refuses protected requests in production without a configured token. - Portal never receives or constructs the Agent token. - Normal user requests are authenticated and ownership-checked before API calls Agent. ## API responsibilities before calling Agent Before sending work to Agent, API must verify: - the user is authenticated, unless the caller is an internal/admin service - the requested server/container belongs to the user - billing/account state allows the operation - plan limits allow create/upgrade operations - the requested operation is valid for the server type (`game` vs `dev`) - no conflicting operation is already in progress - request payloads are bounded and normalized The Agent may validate filesystem/runtime safety, but it does not own customer authorization. ## What API must not do - Do not expose `ZLH_AGENT_TOKEN` to Portal or browser-visible code. - Do not let Portal call Agent IPs or `:18888` directly. - Do not treat Agent responses as authorization proof. - Do not let users supply arbitrary VMID/IP and proxy directly to Agent. - Do not leave active-looking DB rows after provisioning fails. - Do not allow duplicate provisioning retries to create duplicate LXCs. ## Error handling Agent failures should be mapped to safe API responses: - Agent unreachable -> `502 Agent unreachable` - Agent not ready -> `503 agent_not_ready` - Agent operation in progress -> `409 operation_in_progress` - Agent unauthorized -> internal misconfiguration, not user-auth failure Do not leak Agent URLs, IPs, stack traces, tokens, or raw internal error bodies to Portal. ## Longer-term target Static `ZLH_AGENT_TOKEN` is a launch hardening layer, not the final trust model. The better long-term model is short-lived signed API-to-Agent request tokens. In that model: 1. API keeps a signing secret or private key. 2. Agent knows only a verification secret or public key. 3. API signs a short-lived token per Agent request, usually 30-120 seconds. 4. Agent verifies issuer, audience, expiry, VMID, and scope before executing. Example claims: ```json { "iss": "zpack-api", "aud": "zlh-agent", "vmid": 5202, "scope": "files:read", "iat": 1714500000, "exp": 1714500060, "jti": "request-id" } ``` This limits blast radius if a token is exposed. A stolen short-lived `files:read` token for VMID `5202` cannot be reused later to stop another server, restore a backup, or mutate files on a different container. ## Roadmap - Phase 1: static `ZLH_AGENT_TOKEN`, fail-closed in production. - Phase 2: token rotation and deployment validation so stale or missing tokens are caught before traffic. - Phase 3: short-lived API-signed request tokens for API-to-Agent calls. - Phase 4: optional mTLS between API and Agent for transport-level service identity. ## JWT vs HMAC request signatures JWT is useful when explicit claims and public-key verification are desired. A compact HMAC request-signing scheme may be simpler operationally: ```text X-ZLH-Agent-Timestamp: X-ZLH-Agent-Vmid: X-ZLH-Agent-Scope: lifecycle:stop X-ZLH-Agent-Signature: HMAC(secret, method + path + timestamp + vmid + scope + bodyHash) ``` Either approach is better than a forever bearer token once the launch system is stable. ## Not a launch blocker Do not jump directly to short-lived signed tokens before launch unless required. They add operational complexity: - clock drift - issuer/audience mismatch - token scope bugs - signing-key/config drift - harder provisioning debugging For launch, static token plus fail-closed production behavior is the correct low-risk hardening step. For scale, short-lived signed API-to-Agent tokens are the cleaner end state.