Add API agent trust model guidance
This commit is contained in:
parent
8e520d35cc
commit
ac1b4a365c
115
Codex/API/AGENT_TRUST_MODEL.md
Normal file
115
Codex/API/AGENT_TRUST_MODEL.md
Normal file
@ -0,0 +1,115 @@
|
|||||||
|
# API Agent Trust Model
|
||||||
|
|
||||||
|
The API is the authority for user identity, ownership, billing, plan limits, and platform policy. The Agent is an internal executor inside a game/dev container.
|
||||||
|
|
||||||
|
This means the API must protect the platform from unsafe Agent invocation. The Agent should not be user-facing, and Portal should never call Agent directly.
|
||||||
|
|
||||||
|
## Launch / short term
|
||||||
|
|
||||||
|
For launch, use static `ZLH_AGENT_TOKEN` as the API-to-Agent control-plane secret.
|
||||||
|
|
||||||
|
Required launch behavior:
|
||||||
|
|
||||||
|
- API refuses to start in production without `ZLH_AGENT_TOKEN`.
|
||||||
|
- API injects `ZLH_AGENT_TOKEN` into every Agent request.
|
||||||
|
- API centralizes Agent calls through the shared Agent client rather than ad hoc `fetch()` calls.
|
||||||
|
- Agent refuses protected requests in production without a configured token.
|
||||||
|
- Portal never receives or constructs the Agent token.
|
||||||
|
- Normal user requests are authenticated and ownership-checked before API calls Agent.
|
||||||
|
|
||||||
|
## API responsibilities before calling Agent
|
||||||
|
|
||||||
|
Before sending work to Agent, API must verify:
|
||||||
|
|
||||||
|
- the user is authenticated, unless the caller is an internal/admin service
|
||||||
|
- the requested server/container belongs to the user
|
||||||
|
- billing/account state allows the operation
|
||||||
|
- plan limits allow create/upgrade operations
|
||||||
|
- the requested operation is valid for the server type (`game` vs `dev`)
|
||||||
|
- no conflicting operation is already in progress
|
||||||
|
- request payloads are bounded and normalized
|
||||||
|
|
||||||
|
The Agent may validate filesystem/runtime safety, but it does not own customer authorization.
|
||||||
|
|
||||||
|
## What API must not do
|
||||||
|
|
||||||
|
- Do not expose `ZLH_AGENT_TOKEN` to Portal or browser-visible code.
|
||||||
|
- Do not let Portal call Agent IPs or `:18888` directly.
|
||||||
|
- Do not treat Agent responses as authorization proof.
|
||||||
|
- Do not let users supply arbitrary VMID/IP and proxy directly to Agent.
|
||||||
|
- Do not leave active-looking DB rows after provisioning fails.
|
||||||
|
- Do not allow duplicate provisioning retries to create duplicate LXCs.
|
||||||
|
|
||||||
|
## Error handling
|
||||||
|
|
||||||
|
Agent failures should be mapped to safe API responses:
|
||||||
|
|
||||||
|
- Agent unreachable -> `502 Agent unreachable`
|
||||||
|
- Agent not ready -> `503 agent_not_ready`
|
||||||
|
- Agent operation in progress -> `409 operation_in_progress`
|
||||||
|
- Agent unauthorized -> internal misconfiguration, not user-auth failure
|
||||||
|
|
||||||
|
Do not leak Agent URLs, IPs, stack traces, tokens, or raw internal error bodies to Portal.
|
||||||
|
|
||||||
|
## Longer-term target
|
||||||
|
|
||||||
|
Static `ZLH_AGENT_TOKEN` is a launch hardening layer, not the final trust model.
|
||||||
|
|
||||||
|
The better long-term model is short-lived signed API-to-Agent request tokens.
|
||||||
|
|
||||||
|
In that model:
|
||||||
|
|
||||||
|
1. API keeps a signing secret or private key.
|
||||||
|
2. Agent knows only a verification secret or public key.
|
||||||
|
3. API signs a short-lived token per Agent request, usually 30-120 seconds.
|
||||||
|
4. Agent verifies issuer, audience, expiry, VMID, and scope before executing.
|
||||||
|
|
||||||
|
Example claims:
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"iss": "zpack-api",
|
||||||
|
"aud": "zlh-agent",
|
||||||
|
"vmid": 5202,
|
||||||
|
"scope": "files:read",
|
||||||
|
"iat": 1714500000,
|
||||||
|
"exp": 1714500060,
|
||||||
|
"jti": "request-id"
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
This limits blast radius if a token is exposed. A stolen short-lived `files:read` token for VMID `5202` cannot be reused later to stop another server, restore a backup, or mutate files on a different container.
|
||||||
|
|
||||||
|
## Roadmap
|
||||||
|
|
||||||
|
- Phase 1: static `ZLH_AGENT_TOKEN`, fail-closed in production.
|
||||||
|
- Phase 2: token rotation and deployment validation so stale or missing tokens are caught before traffic.
|
||||||
|
- Phase 3: short-lived API-signed request tokens for API-to-Agent calls.
|
||||||
|
- Phase 4: optional mTLS between API and Agent for transport-level service identity.
|
||||||
|
|
||||||
|
## JWT vs HMAC request signatures
|
||||||
|
|
||||||
|
JWT is useful when explicit claims and public-key verification are desired.
|
||||||
|
|
||||||
|
A compact HMAC request-signing scheme may be simpler operationally:
|
||||||
|
|
||||||
|
```text
|
||||||
|
X-ZLH-Agent-Timestamp: <unix timestamp>
|
||||||
|
X-ZLH-Agent-Vmid: <vmid>
|
||||||
|
X-ZLH-Agent-Scope: lifecycle:stop
|
||||||
|
X-ZLH-Agent-Signature: HMAC(secret, method + path + timestamp + vmid + scope + bodyHash)
|
||||||
|
```
|
||||||
|
|
||||||
|
Either approach is better than a forever bearer token once the launch system is stable.
|
||||||
|
|
||||||
|
## Not a launch blocker
|
||||||
|
|
||||||
|
Do not jump directly to short-lived signed tokens before launch unless required. They add operational complexity:
|
||||||
|
|
||||||
|
- clock drift
|
||||||
|
- issuer/audience mismatch
|
||||||
|
- token scope bugs
|
||||||
|
- signing-key/config drift
|
||||||
|
- harder provisioning debugging
|
||||||
|
|
||||||
|
For launch, static token plus fail-closed production behavior is the correct low-risk hardening step. For scale, short-lived signed API-to-Agent tokens are the cleaner end state.
|
||||||
Loading…
Reference in New Issue
Block a user