zlh-grind/Codex/API/AGENT_TRUST_MODEL.md

4.3 KiB

API Agent Trust Model

The API is the authority for user identity, ownership, billing, plan limits, and platform policy. The Agent is an internal executor inside a game/dev container.

This means the API must protect the platform from unsafe Agent invocation. The Agent should not be user-facing, and Portal should never call Agent directly.

Launch / short term

For launch, use static ZLH_AGENT_TOKEN as the API-to-Agent control-plane secret.

Required launch behavior:

  • API refuses to start in production without ZLH_AGENT_TOKEN.
  • API injects ZLH_AGENT_TOKEN into every Agent request.
  • API centralizes Agent calls through the shared Agent client rather than ad hoc fetch() calls.
  • Agent refuses protected requests in production without a configured token.
  • Portal never receives or constructs the Agent token.
  • Normal user requests are authenticated and ownership-checked before API calls Agent.

API responsibilities before calling Agent

Before sending work to Agent, API must verify:

  • the user is authenticated, unless the caller is an internal/admin service
  • the requested server/container belongs to the user
  • billing/account state allows the operation
  • plan limits allow create/upgrade operations
  • the requested operation is valid for the server type (game vs dev)
  • no conflicting operation is already in progress
  • request payloads are bounded and normalized

The Agent may validate filesystem/runtime safety, but it does not own customer authorization.

What API must not do

  • Do not expose ZLH_AGENT_TOKEN to Portal or browser-visible code.
  • Do not let Portal call Agent IPs or :18888 directly.
  • Do not treat Agent responses as authorization proof.
  • Do not let users supply arbitrary VMID/IP and proxy directly to Agent.
  • Do not leave active-looking DB rows after provisioning fails.
  • Do not allow duplicate provisioning retries to create duplicate LXCs.

Error handling

Agent failures should be mapped to safe API responses:

  • Agent unreachable -> 502 Agent unreachable
  • Agent not ready -> 503 agent_not_ready
  • Agent operation in progress -> 409 operation_in_progress
  • Agent unauthorized -> internal misconfiguration, not user-auth failure

Do not leak Agent URLs, IPs, stack traces, tokens, or raw internal error bodies to Portal.

Longer-term target

Static ZLH_AGENT_TOKEN is a launch hardening layer, not the final trust model.

The better long-term model is short-lived signed API-to-Agent request tokens.

In that model:

  1. API keeps a signing secret or private key.
  2. Agent knows only a verification secret or public key.
  3. API signs a short-lived token per Agent request, usually 30-120 seconds.
  4. Agent verifies issuer, audience, expiry, VMID, and scope before executing.

Example claims:

{
  "iss": "zpack-api",
  "aud": "zlh-agent",
  "vmid": 5202,
  "scope": "files:read",
  "iat": 1714500000,
  "exp": 1714500060,
  "jti": "request-id"
}

This limits blast radius if a token is exposed. A stolen short-lived files:read token for VMID 5202 cannot be reused later to stop another server, restore a backup, or mutate files on a different container.

Roadmap

  • Phase 1: static ZLH_AGENT_TOKEN, fail-closed in production.
  • Phase 2: token rotation and deployment validation so stale or missing tokens are caught before traffic.
  • Phase 3: short-lived API-signed request tokens for API-to-Agent calls.
  • Phase 4: optional mTLS between API and Agent for transport-level service identity.

JWT vs HMAC request signatures

JWT is useful when explicit claims and public-key verification are desired.

A compact HMAC request-signing scheme may be simpler operationally:

X-ZLH-Agent-Timestamp: <unix timestamp>
X-ZLH-Agent-Vmid: <vmid>
X-ZLH-Agent-Scope: lifecycle:stop
X-ZLH-Agent-Signature: HMAC(secret, method + path + timestamp + vmid + scope + bodyHash)

Either approach is better than a forever bearer token once the launch system is stable.

Not a launch blocker

Do not jump directly to short-lived signed tokens before launch unless required. They add operational complexity:

  • clock drift
  • issuer/audience mismatch
  • token scope bugs
  • signing-key/config drift
  • harder provisioning debugging

For launch, static token plus fail-closed production behavior is the correct low-risk hardening step. For scale, short-lived signed API-to-Agent tokens are the cleaner end state.