API Agent Trust Model

The API is the authority for user identity, ownership, billing, plan limits, and platform policy. The Agent is an internal executor inside a game/dev container.

This means the API must protect the platform from unsafe Agent invocation. The Agent should not be user-facing, and Portal should never call Agent directly.

Launch / short term

For launch, use static ZLH_AGENT_TOKEN as the API-to-Agent control-plane secret.

Required launch behavior:

API refuses to start in production without ZLH_AGENT_TOKEN.
API injects ZLH_AGENT_TOKEN into every Agent request.
API centralizes Agent calls through the shared Agent client rather than ad hoc fetch() calls.
Agent refuses protected requests in production without a configured token.
Portal never receives or constructs the Agent token.
Normal user requests are authenticated and ownership-checked before API calls Agent.

API responsibilities before calling Agent

Before sending work to Agent, API must verify:

the user is authenticated, unless the caller is an internal/admin service
the requested server/container belongs to the user
billing/account state allows the operation
plan limits allow create/upgrade operations
the requested operation is valid for the server type (game vs dev)
no conflicting operation is already in progress
request payloads are bounded and normalized

The Agent may validate filesystem/runtime safety, but it does not own customer authorization.

What API must not do

Do not expose ZLH_AGENT_TOKEN to Portal or browser-visible code.
Do not let Portal call Agent IPs or :18888 directly.
Do not treat Agent responses as authorization proof.
Do not let users supply arbitrary VMID/IP and proxy directly to Agent.
Do not leave active-looking DB rows after provisioning fails.
Do not allow duplicate provisioning retries to create duplicate LXCs.

Error handling

Agent failures should be mapped to safe API responses:

Agent unreachable -> 502 Agent unreachable
Agent not ready -> 503 agent_not_ready
Agent operation in progress -> 409 operation_in_progress
Agent unauthorized -> internal misconfiguration, not user-auth failure

Do not leak Agent URLs, IPs, stack traces, tokens, or raw internal error bodies to Portal.

Longer-term target

Static ZLH_AGENT_TOKEN is a launch hardening layer, not the final trust model.

The better long-term model is short-lived signed API-to-Agent request tokens.

In that model:

API keeps a signing secret or private key.
Agent knows only a verification secret or public key.
API signs a short-lived token per Agent request, usually 30-120 seconds.
Agent verifies issuer, audience, expiry, VMID, and scope before executing.

Example claims:

{
  "iss": "zpack-api",
  "aud": "zlh-agent",
  "vmid": 5202,
  "scope": "files:read",
  "iat": 1714500000,
  "exp": 1714500060,
  "jti": "request-id"
}

This limits blast radius if a token is exposed. A stolen short-lived files:read token for VMID 5202 cannot be reused later to stop another server, restore a backup, or mutate files on a different container.

Roadmap

Phase 1: static ZLH_AGENT_TOKEN, fail-closed in production.
Phase 2: token rotation and deployment validation so stale or missing tokens are caught before traffic.
Phase 3: short-lived API-signed request tokens for API-to-Agent calls.
Phase 4: optional mTLS between API and Agent for transport-level service identity.

JWT vs HMAC request signatures

JWT is useful when explicit claims and public-key verification are desired.

A compact HMAC request-signing scheme may be simpler operationally:

X-ZLH-Agent-Timestamp: <unix timestamp>
X-ZLH-Agent-Vmid: <vmid>
X-ZLH-Agent-Scope: lifecycle:stop
X-ZLH-Agent-Signature: HMAC(secret, method + path + timestamp + vmid + scope + bodyHash)

Either approach is better than a forever bearer token once the launch system is stable.

Not a launch blocker

Do not jump directly to short-lived signed tokens before launch unless required. They add operational complexity:

clock drift
issuer/audience mismatch
token scope bugs
signing-key/config drift
harder provisioning debugging

For launch, static token plus fail-closed production behavior is the correct low-risk hardening step. For scale, short-lived signed API-to-Agent tokens are the cleaner end state.

4.3 KiB Raw Blame History