diff --git a/docs/architecture/mod-deployment-safety.md b/docs/architecture/mod-deployment-safety.md index e5c634d..e820c78 100644 --- a/docs/architecture/mod-deployment-safety.md +++ b/docs/architecture/mod-deployment-safety.md @@ -1,238 +1,216 @@ -# ZeroLagHub – Mod Deployment Safety & Self-Healing Specification +# Mod Deployment Safety Model -**Version:** 1.0 -**Phase:** Phase 1 Core Stability +**Version:** 2.0 +**Updated:** 2026-03-01 +**Phase:** Phase 1 — Active **Applies To:** Game Containers **Owner:** Agent Layer --- -## 1. Purpose +## Goal -Define the deterministic, self-healing deployment model for: +Allow user uploads and automated mod installs while: -- Modrinth-based automated mod installs -- Dev artifact promotion installs - -The system must: - -- Prevent catastrophic mod deployment failures -- Automatically recover from crash loops -- Avoid operator intervention -- Preserve world/player data -- Avoid state explosion and disk bloat - -> This system is **not** a backup solution. It is a **deployment safety mechanism**. +- Preventing path escape +- Preserving runtime integrity +- Tracking provenance +- Avoiding hidden deployment layers --- -## 2. Scope +## Deployment Types -### Included in Safety Snapshot - -- `/mods` -- `/config` -- `server.properties` - -### Explicitly Excluded - -- `/world` -- `/logs` -- Cache directories -- Runtime temp files - -World backup is handled by a separate backup system. +| Type | Path | Extension | +|------|------|-----------| +| Mod (Modrinth) | `mods/.jar` | `.jar` | +| Mod (user upload) | `mods/.jar` | `.jar` | +| Datapack | `world/datapacks/.zip` | `.zip` | --- -## 3. Core Concepts +## Direct Runtime Writes -### 3.1 Shadow Copy (File-Level Protection) +Uploads and automated installs are written directly into runtime directories. -Used for fast rollback of a single mod change. +No: +- Staging +- Symlinking +- Delayed deployment -- Only applies to the most recently modified mod -- Only one shadow exists at any time -- Shadow lifetime is limited to the stabilization window +--- -**Example:** +## Atomic Write Process -``` -mods/ - coolmod.jar - coolmod.jar.shadow +1. Create temp file in target directory +2. Stream multipart body (or downloaded artifact) into temp file +3. Enforce size limit while streaming +4. `os.Rename()` temp → final filename +5. Update `.zlh_metadata.json` + +If `overwrite=false` and file exists → `409` + +--- + +## Size Limits + +Configured in agent: + +| Type | Limit | +|------|-------| +| Mods | 250MB | +| Datapacks | 100MB | + +Enforced server-side during streaming. Not at the API layer. + +--- + +## Overwrite Behavior + +- `overwrite=false` (default) → reject if file exists → `409` +- `overwrite=true` → replace file, update `uploaded_at` in metadata + +--- + +## Agent-Side Guarantees + +The agent guarantees: + +- Final path is a file (not directory) +- Target is not a symlink +- Path resolves within runtime root +- Parent directory exists +- `.zlh_metadata.json` is hidden from file listing API +- `.zlh-shadow` is hidden from file listing API + +--- + +## Provenance Metadata + +All user uploads are marked `source: "user"` in `.zlh_metadata.json`. + +```json +{ + "mods/sodium.jar": { + "source": "user", + "uploaded_at": "2026-03-01T22:37:01Z" + } +} ``` -### 3.2 Snapshot (State-Level Protection) +Modrinth-installed mods do not currently write provenance. Future curated installs may optionally write `source: "curated"`. Not currently implemented. -Lightweight archive of mod/config state. Created before any mod deployment. +No automatic inference of source from filename or path. -**Example path:** +--- -``` -.zlh_snapshots/deploy-.tar.gz +## Failure Categories + +| Code | Meaning | +|------|---------| +| `409` | File exists (`overwrite=false`) | +| `413` | Size limit exceeded | +| `403` | Path rejected by allowlist | +| `502` | API transport failure (not agent validation) | + +--- + +## API Transport Considerations + +The API acts strictly as a streaming proxy for uploads. + +```js +req.pipe(proxyReq) +proxyRes.pipe(res) ``` -**Contains:** +- Uses raw `http.request` piping — not `fetch` +- Does not buffer file contents +- Does not re-validate upload policy +- Upload timeout must be substantially larger than normal routes +--- + +## Mod Lifecycle (Current Phase 1 State) + +**Install (Modrinth):** +Modrinth resolver → API → Agent → Verified download → `/mods` + +**Install (User Upload):** +Portal → API (streaming proxy) → Agent → Atomic write → `/mods` + +**Enable/Disable:** +Filesystem rename: `.jar` ↔ `.jar.disabled` + +**Delete:** +Soft delete to `/mods-removed` (no auto-purge, no retention policy) + +Filesystem is canonical state. Agent cache invalidated after every mutation. + +Restore of deleted mods handled manually via file browser (see `OPEN_THREADS.md`). + +--- + +## Self-Healing Model (Automated Installs) + +Applies to Modrinth installs and dev artifact promotions. Does **not** apply to user uploads — users assume responsibility for files they upload directly. + +### Deployment States + +``` +IDLE → DEPLOYING → STABILIZING → STABLE → IDLE + ↓ + ROLLBACK_FILE → ROLLBACK_SNAPSHOT → FAILED_RECOVERY +``` + +### Snapshot Scope + +Included: - `mods/` - `config/` - `server.properties` -Only one active snapshot exists at a time. +Excluded: +- `world/` (separate backup system) +- `logs/` +- Cache and temp files + +### Stabilization Window + +Hardcoded Phase 1: **3–5 minutes** + +Server is stable when: +- Readiness probe succeeds +- No crash restart during the window +- Server runs continuously for full window duration + +### Failure Triggers + +| Trigger | Action | +|---------|--------| +| Early boot crash (exit within ~30s) | → `ROLLBACK_FILE` | +| Crash loop (≥3 crashes in window) | → `ROLLBACK_SNAPSHOT` | +| Readiness timeout (full window) | → `ROLLBACK_SNAPSHOT` | + +### Recovery + +**File rollback:** Restore `.jar.shadow` → `.jar`, monitor again. If stable → `IDLE`. If not → escalate. + +**Snapshot restore:** Extract `.zlh_snapshots/deploy-.tar.gz`, monitor. If stable → `IDLE`. If not → `FAILED_RECOVERY`. + +### Safety Constraints + +- Only one file rollback attempt +- Only one snapshot restore attempt +- Both fail → `FAILED_RECOVERY`, stop automation, surface to API + UI +- World data never touched +- Only one active snapshot at a time --- -## 4. Deployment State Machine +## Metadata Tracking (Automated Installs) -### 4.1 States - -``` -IDLE -DEPLOYING -STABILIZING -STABLE -ROLLBACK_FILE -ROLLBACK_SNAPSHOT -FAILED_RECOVERY -``` - -### 4.2 Deployment Flow - -**Step 1 – Stop Server** -- Ensure clean shutdown -- Reset crash counters - -**Step 2 – Create Snapshot** -- Archive `mods/`, `config/`, `server.properties` -- Mark snapshot status = `pending` - -**Step 3 – Prepare Shadow** *(if replacing existing mod)* -- If target mod exists, rename: - ``` - coolmod.jar → coolmod.jar.shadow - ``` - -**Step 4 – Install New Mod** -- Validate SHA256 (if Modrinth source) -- Write file to `/mods` -- Set deployment metadata: - - `lastChangedMod` - - `changeTimestamp` - - `changeSource` - -**Step 5 – Start Server** -- Transition to `STABILIZING` state -- Begin stabilization timer - ---- - -## 5. Stabilization Window - -### Duration - -Hardcoded Phase 1 value: **3–5 minutes** - -### Stability Conditions - -Server is considered stable when: - -- Readiness probe returns success -- Server remains running continuously for the full window duration -- No crash restart occurred during the window - -### On Stability Confirmed - -- Delete snapshot -- Delete shadow copy -- Clear deployment metadata -- Transition: `STABLE` → `IDLE` - ---- - -## 6. Failure Detection - -### 6.1 Early Boot Crash - -**Condition:** Process exits within X seconds (e.g., 30s) during stabilization window - -**Action:** Transition to `ROLLBACK_FILE` - -### 6.2 Crash Loop - -**Condition:** ≥ N crashes (e.g., 3) within stabilization window - -**Action:** Transition to `ROLLBACK_SNAPSHOT` - -### 6.3 Readiness Timeout - -**Condition:** Server fails readiness probe for the entire stabilization window - -**Action:** Transition to `ROLLBACK_SNAPSHOT` - ---- - -## 7. Recovery Logic - -### 7.1 File-Level Rollback (`ROLLBACK_FILE`) - -**Conditions:** -- `lastChangedMod` exists -- Shadow copy exists - -**Action:** -1. Stop server -2. Replace: `coolmod.jar.shadow` → `coolmod.jar` -3. Start server -4. Monitor again - -**If stable:** -- Delete snapshot -- Delete shadow -- Return to `IDLE` - -**If still unstable:** -- Escalate to snapshot restore - -### 7.2 Snapshot Restore (`ROLLBACK_SNAPSHOT`) - -**Conditions:** -- Snapshot exists -- Snapshot has not already been restored - -**Action:** -1. Stop server -2. Extract snapshot archive -3. Start server -4. Mark snapshot as `restored` - -**If stable:** -- Delete snapshot -- Clear metadata -- Return to `IDLE` - -**If still unstable:** -- Transition to `FAILED_RECOVERY` -- Stop all automation -- Surface failure state to API and UI - ---- - -## 8. Safety Constraints - -- Never attempt infinite recovery loops -- Only **one** file rollback attempt allowed -- Only **one** snapshot restore attempt allowed -- If both fail → mark `FAILED_RECOVERY` -- No world data is touched under any circumstance -- No multiple snapshot retention -- Only one snapshot exists during the stabilization window - ---- - -## 9. Metadata Tracking (Agent-Level) - -Track the following fields (cleared after stabilization success): +Agent-level fields, cleared after stabilization success: ``` lastChangedMod @@ -245,9 +223,22 @@ snapshotId --- -## 10. Logging Requirements +## Why No Symlinks -Structured logs must emit events for: +Symlink-based deployment was rejected because: + +- Complicates mod loader behavior +- Breaks server-side loader compatibility +- Introduces unexpected runtime indirection +- Makes provenance ambiguous + +Direct writes + metadata is simpler and safer. + +--- + +## Logging Requirements + +Structured logs must emit: - Deployment started - Snapshot created @@ -258,41 +249,45 @@ Structured logs must emit events for: - Snapshot restore triggered - Deployment stabilized - Recovery failed +- User upload received +- User upload rejected (with reason) -All events must be visible via agent logs and surfaced through API status endpoints. +All events visible via agent logs and surfaced through API status endpoints. --- -## 11. Non-Goals (Phase 1) +## Operational Guarantees -- No multi-mod version history -- No dependency resolution -- No world snapshotting -- No long-term snapshot retention -- No manual snapshot management UI -- No automatic dependency analysis - ---- - -## 12. Operational Guarantees - -This system guarantees: - -- A broken mod will not permanently brick the server -- Most failures will auto-recover without operator intervention +- A broken automated mod install will not permanently brick the server +- Most automated failures auto-recover without operator intervention +- User-uploaded files bypass self-healing (user responsibility) - Deployment state will not accumulate disk artifacts - The system remains deterministic and bounded --- -## 13. Future Extensions (Phase 2+) +## Non-Goals (Phase 1) + +- No multi-mod version history +- No dependency resolution +- No world snapshotting +- No long-term snapshot retention +- No manual snapshot UI +- No automatic dependency analysis +- No virus/malware scanning of user uploads +- No retention policy for `mods-removed/` + +--- + +## Future Extensions (Phase 2+) - Snapshot retention policies - Manual snapshot restore via UI - Multi-mod change grouping - Deployment audit history - Version diffing -- Dependency resolution +- Provenance for Modrinth installs +- Retention automation for `mods-removed/` --- @@ -300,31 +295,13 @@ This system guarantees: | Decision | Value | |----------|-------| -| Shadow copies | Single, file-level only | -| Snapshots | Single, temporary only | +| Upload model | Direct runtime write, atomic via `os.Rename()` | +| Staging | None | +| Symlinks | None | +| Shadow copies | Single, automated installs only | +| Snapshots | Single, temporary, automated installs only | | Escalation model | File rollback → snapshot restore → FAILED_RECOVERY | | World data | Excluded, never touched | | Stabilization window | Fixed, hardcoded Phase 1 | -| Autonomy | Agent-level, no operator required | - ---- - -## Current Mod Lifecycle Model (2026-02) - -This section reflects the **implemented Phase 1 state** as of February 2026. - -**Install:** -Modrinth resolver → API → Agent → Verified download → `/mods` - -**Enable/Disable:** -Filesystem rename: `.jar` ↔ `.jar.disabled` - -**Delete:** -Soft delete to `/mods-removed` (no auto-purge, no retention policy) - -**Filesystem is canonical state.** -Agent cache invalidated after every mutation. - -Restore of deleted mods handled manually via file browser (planned — see `OPEN_THREADS.md`). - -No retention policy implemented. No install queue. No DB tracking of mod state. +| Provenance | `.zlh_metadata.json` at runtime root | +| User upload self-healing | Not applicable — user responsibility |