From c926cad4ffb9123e8d1e77e9b265cb428b64e9ba Mon Sep 17 00:00:00 2001 From: jester Date: Mon, 23 Feb 2026 00:12:01 +0000 Subject: [PATCH] docs: add mod deployment safety & self-healing specification v1.0 --- docs/architecture/mod-deployment-safety.md | 308 +++++++++++++++++++++ 1 file changed, 308 insertions(+) create mode 100644 docs/architecture/mod-deployment-safety.md diff --git a/docs/architecture/mod-deployment-safety.md b/docs/architecture/mod-deployment-safety.md new file mode 100644 index 0000000..164ad09 --- /dev/null +++ b/docs/architecture/mod-deployment-safety.md @@ -0,0 +1,308 @@ +# ZeroLagHub – Mod Deployment Safety & Self-Healing Specification + +**Version:** 1.0 +**Phase:** Phase 1 Core Stability +**Applies To:** Game Containers +**Owner:** Agent Layer + +--- + +## 1. Purpose + +Define the deterministic, self-healing deployment model for: + +- Modrinth-based automated mod installs +- Dev artifact promotion installs + +The system must: + +- Prevent catastrophic mod deployment failures +- Automatically recover from crash loops +- Avoid operator intervention +- Preserve world/player data +- Avoid state explosion and disk bloat + +> This system is **not** a backup solution. It is a **deployment safety mechanism**. + +--- + +## 2. Scope + +### Included in Safety Snapshot + +- `/mods` +- `/config` +- `server.properties` + +### Explicitly Excluded + +- `/world` +- `/logs` +- Cache directories +- Runtime temp files + +World backup is handled by a separate backup system. + +--- + +## 3. Core Concepts + +### 3.1 Shadow Copy (File-Level Protection) + +Used for fast rollback of a single mod change. + +- Only applies to the most recently modified mod +- Only one shadow exists at any time +- Shadow lifetime is limited to the stabilization window + +**Example:** + +``` +mods/ + coolmod.jar + coolmod.jar.shadow +``` + +### 3.2 Snapshot (State-Level Protection) + +Lightweight archive of mod/config state. Created before any mod deployment. + +**Example path:** + +``` +.zlh_snapshots/deploy-.tar.gz +``` + +**Contains:** + +- `mods/` +- `config/` +- `server.properties` + +Only one active snapshot exists at a time. + +--- + +## 4. Deployment State Machine + +### 4.1 States + +``` +IDLE +DEPLOYING +STABILIZING +STABLE +ROLLBACK_FILE +ROLLBACK_SNAPSHOT +FAILED_RECOVERY +``` + +### 4.2 Deployment Flow + +**Step 1 – Stop Server** +- Ensure clean shutdown +- Reset crash counters + +**Step 2 – Create Snapshot** +- Archive `mods/`, `config/`, `server.properties` +- Mark snapshot status = `pending` + +**Step 3 – Prepare Shadow** *(if replacing existing mod)* +- If target mod exists, rename: + ``` + coolmod.jar → coolmod.jar.shadow + ``` + +**Step 4 – Install New Mod** +- Validate SHA256 (if Modrinth source) +- Write file to `/mods` +- Set deployment metadata: + - `lastChangedMod` + - `changeTimestamp` + - `changeSource` + +**Step 5 – Start Server** +- Transition to `STABILIZING` state +- Begin stabilization timer + +--- + +## 5. Stabilization Window + +### Duration + +Hardcoded Phase 1 value: **3–5 minutes** + +### Stability Conditions + +Server is considered stable when: + +- Readiness probe returns success +- Server remains running continuously for the full window duration +- No crash restart occurred during the window + +### On Stability Confirmed + +- Delete snapshot +- Delete shadow copy +- Clear deployment metadata +- Transition: `STABLE` → `IDLE` + +--- + +## 6. Failure Detection + +### 6.1 Early Boot Crash + +**Condition:** Process exits within X seconds (e.g., 30s) during stabilization window + +**Action:** Transition to `ROLLBACK_FILE` + +### 6.2 Crash Loop + +**Condition:** ≥ N crashes (e.g., 3) within stabilization window + +**Action:** Transition to `ROLLBACK_SNAPSHOT` + +### 6.3 Readiness Timeout + +**Condition:** Server fails readiness probe for the entire stabilization window + +**Action:** Transition to `ROLLBACK_SNAPSHOT` + +--- + +## 7. Recovery Logic + +### 7.1 File-Level Rollback (`ROLLBACK_FILE`) + +**Conditions:** +- `lastChangedMod` exists +- Shadow copy exists + +**Action:** +1. Stop server +2. Replace: `coolmod.jar.shadow` → `coolmod.jar` +3. Start server +4. Monitor again + +**If stable:** +- Delete snapshot +- Delete shadow +- Return to `IDLE` + +**If still unstable:** +- Escalate to snapshot restore + +### 7.2 Snapshot Restore (`ROLLBACK_SNAPSHOT`) + +**Conditions:** +- Snapshot exists +- Snapshot has not already been restored + +**Action:** +1. Stop server +2. Extract snapshot archive +3. Start server +4. Mark snapshot as `restored` + +**If stable:** +- Delete snapshot +- Clear metadata +- Return to `IDLE` + +**If still unstable:** +- Transition to `FAILED_RECOVERY` +- Stop all automation +- Surface failure state to API and UI + +--- + +## 8. Safety Constraints + +- Never attempt infinite recovery loops +- Only **one** file rollback attempt allowed +- Only **one** snapshot restore attempt allowed +- If both fail → mark `FAILED_RECOVERY` +- No world data is touched under any circumstance +- No multiple snapshot retention +- Only one snapshot exists during the stabilization window + +--- + +## 9. Metadata Tracking (Agent-Level) + +Track the following fields (cleared after stabilization success): + +``` +lastChangedMod +lastChangeTimestamp +lastChangeSource # modrinth | dev +deploymentState +crashCount +snapshotId +``` + +--- + +## 10. Logging Requirements + +Structured logs must emit events for: + +- Deployment started +- Snapshot created +- Shadow created +- Stabilization started +- Crash detected +- File rollback triggered +- Snapshot restore triggered +- Deployment stabilized +- Recovery failed + +All events must be visible via agent logs and surfaced through API status endpoints. + +--- + +## 11. Non-Goals (Phase 1) + +- No multi-mod version history +- No dependency resolution +- No world snapshotting +- No long-term snapshot retention +- No manual snapshot management UI +- No automatic dependency analysis + +--- + +## 12. Operational Guarantees + +This system guarantees: + +- A broken mod will not permanently brick the server +- Most failures will auto-recover without operator intervention +- Deployment state will not accumulate disk artifacts +- The system remains deterministic and bounded + +--- + +## 13. Future Extensions (Phase 2+) + +- Snapshot retention policies +- Manual snapshot restore via UI +- Multi-mod change grouping +- Deployment audit history +- Version diffing +- Dependency resolution + +--- + +## Final Decision Lock + +| Decision | Value | +|----------|-------| +| Shadow copies | Single, file-level only | +| Snapshots | Single, temporary only | +| Escalation model | File rollback → snapshot restore → FAILED_RECOVERY | +| World data | Excluded, never touched | +| Stabilization window | Fixed, hardcoded Phase 1 | +| Autonomy | Agent-level, no operator required |