zlh-grind/docs/architecture/mod-deployment-safety.md

308 lines
6.9 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Mod Deployment Safety Model
**Version:** 2.0
**Updated:** 2026-03-01
**Phase:** Phase 1 — Active
**Applies To:** Game Containers
**Owner:** Agent Layer
---
## Goal
Allow user uploads and automated mod installs while:
- Preventing path escape
- Preserving runtime integrity
- Tracking provenance
- Avoiding hidden deployment layers
---
## Deployment Types
| Type | Path | Extension |
|------|------|-----------|
| Mod (Modrinth) | `mods/<file>.jar` | `.jar` |
| Mod (user upload) | `mods/<file>.jar` | `.jar` |
| Datapack | `world/datapacks/<file>.zip` | `.zip` |
---
## Direct Runtime Writes
Uploads and automated installs are written directly into runtime directories.
No:
- Staging
- Symlinking
- Delayed deployment
---
## Atomic Write Process
1. Create temp file in target directory
2. Stream multipart body (or downloaded artifact) into temp file
3. Enforce size limit while streaming
4. `os.Rename()` temp → final filename
5. Update `.zlh_metadata.json`
If `overwrite=false` and file exists → `409`
---
## Size Limits
Configured in agent:
| Type | Limit |
|------|-------|
| Mods | 250MB |
| Datapacks | 100MB |
Enforced server-side during streaming. Not at the API layer.
---
## Overwrite Behavior
- `overwrite=false` (default) → reject if file exists → `409`
- `overwrite=true` → replace file, update `uploaded_at` in metadata
---
## Agent-Side Guarantees
The agent guarantees:
- Final path is a file (not directory)
- Target is not a symlink
- Path resolves within runtime root
- Parent directory exists
- `.zlh_metadata.json` is hidden from file listing API
- `.zlh-shadow` is hidden from file listing API
---
## Provenance Metadata
All user uploads are marked `source: "user"` in `.zlh_metadata.json`.
```json
{
"mods/sodium.jar": {
"source": "user",
"uploaded_at": "2026-03-01T22:37:01Z"
}
}
```
Modrinth-installed mods do not currently write provenance. Future curated installs may optionally write `source: "curated"`. Not currently implemented.
No automatic inference of source from filename or path.
---
## Failure Categories
| Code | Meaning |
|------|---------|
| `409` | File exists (`overwrite=false`) |
| `413` | Size limit exceeded |
| `403` | Path rejected by allowlist |
| `502` | API transport failure (not agent validation) |
---
## API Transport Considerations
The API acts strictly as a streaming proxy for uploads.
```js
req.pipe(proxyReq)
proxyRes.pipe(res)
```
- Uses raw `http.request` piping — not `fetch`
- Does not buffer file contents
- Does not re-validate upload policy
- Upload timeout must be substantially larger than normal routes
---
## Mod Lifecycle (Current Phase 1 State)
**Install (Modrinth):**
Modrinth resolver → API → Agent → Verified download → `<serverRoot>/mods`
**Install (User Upload):**
Portal → API (streaming proxy) → Agent → Atomic write → `<serverRoot>/mods`
**Enable/Disable:**
Filesystem rename: `.jar``.jar.disabled`
**Delete:**
Soft delete to `<serverRoot>/mods-removed` (no auto-purge, no retention policy)
Filesystem is canonical state. Agent cache invalidated after every mutation.
Restore of deleted mods handled manually via file browser (see `OPEN_THREADS.md`).
---
## Self-Healing Model (Automated Installs)
Applies to Modrinth installs and dev artifact promotions. Does **not** apply to user uploads — users assume responsibility for files they upload directly.
### Deployment States
```
IDLE → DEPLOYING → STABILIZING → STABLE → IDLE
ROLLBACK_FILE → ROLLBACK_SNAPSHOT → FAILED_RECOVERY
```
### Snapshot Scope
Included:
- `mods/`
- `config/`
- `server.properties`
Excluded:
- `world/` (separate backup system)
- `logs/`
- Cache and temp files
### Stabilization Window
Hardcoded Phase 1: **35 minutes**
Server is stable when:
- Readiness probe succeeds
- No crash restart during the window
- Server runs continuously for full window duration
### Failure Triggers
| Trigger | Action |
|---------|--------|
| Early boot crash (exit within ~30s) | → `ROLLBACK_FILE` |
| Crash loop (≥3 crashes in window) | → `ROLLBACK_SNAPSHOT` |
| Readiness timeout (full window) | → `ROLLBACK_SNAPSHOT` |
### Recovery
**File rollback:** Restore `.jar.shadow``.jar`, monitor again. If stable → `IDLE`. If not → escalate.
**Snapshot restore:** Extract `.zlh_snapshots/deploy-<timestamp>.tar.gz`, monitor. If stable → `IDLE`. If not → `FAILED_RECOVERY`.
### Safety Constraints
- Only one file rollback attempt
- Only one snapshot restore attempt
- Both fail → `FAILED_RECOVERY`, stop automation, surface to API + UI
- World data never touched
- Only one active snapshot at a time
---
## Metadata Tracking (Automated Installs)
Agent-level fields, cleared after stabilization success:
```
lastChangedMod
lastChangeTimestamp
lastChangeSource # modrinth | dev
deploymentState
crashCount
snapshotId
```
---
## Why No Symlinks
Symlink-based deployment was rejected because:
- Complicates mod loader behavior
- Breaks server-side loader compatibility
- Introduces unexpected runtime indirection
- Makes provenance ambiguous
Direct writes + metadata is simpler and safer.
---
## Logging Requirements
Structured logs must emit:
- Deployment started
- Snapshot created
- Shadow created
- Stabilization started
- Crash detected
- File rollback triggered
- Snapshot restore triggered
- Deployment stabilized
- Recovery failed
- User upload received
- User upload rejected (with reason)
All events visible via agent logs and surfaced through API status endpoints.
---
## Operational Guarantees
- A broken automated mod install will not permanently brick the server
- Most automated failures auto-recover without operator intervention
- User-uploaded files bypass self-healing (user responsibility)
- Deployment state will not accumulate disk artifacts
- The system remains deterministic and bounded
---
## Non-Goals (Phase 1)
- No multi-mod version history
- No dependency resolution
- No world snapshotting
- No long-term snapshot retention
- No manual snapshot UI
- No automatic dependency analysis
- No virus/malware scanning of user uploads
- No retention policy for `mods-removed/`
---
## Future Extensions (Phase 2+)
- Snapshot retention policies
- Manual snapshot restore via UI
- Multi-mod change grouping
- Deployment audit history
- Version diffing
- Provenance for Modrinth installs
- Retention automation for `mods-removed/`
---
## Final Decision Lock
| Decision | Value |
|----------|-------|
| Upload model | Direct runtime write, atomic via `os.Rename()` |
| Staging | None |
| Symlinks | None |
| Shadow copies | Single, automated installs only |
| Snapshots | Single, temporary, automated installs only |
| Escalation model | File rollback → snapshot restore → FAILED_RECOVERY |
| World data | Excluded, never touched |
| Stabilization window | Fixed, hardcoded Phase 1 |
| Provenance | `.zlh_metadata.json` at runtime root |
| User upload self-healing | Not applicable — user responsibility |