# ZeroLagHub – Mod Deployment Safety & Self-Healing Specification **Version:** 1.0 **Phase:** Phase 1 Core Stability **Applies To:** Game Containers **Owner:** Agent Layer --- ## 1. Purpose Define the deterministic, self-healing deployment model for: - Modrinth-based automated mod installs - Dev artifact promotion installs The system must: - Prevent catastrophic mod deployment failures - Automatically recover from crash loops - Avoid operator intervention - Preserve world/player data - Avoid state explosion and disk bloat > This system is **not** a backup solution. It is a **deployment safety mechanism**. --- ## 2. Scope ### Included in Safety Snapshot - `/mods` - `/config` - `server.properties` ### Explicitly Excluded - `/world` - `/logs` - Cache directories - Runtime temp files World backup is handled by a separate backup system. --- ## 3. Core Concepts ### 3.1 Shadow Copy (File-Level Protection) Used for fast rollback of a single mod change. - Only applies to the most recently modified mod - Only one shadow exists at any time - Shadow lifetime is limited to the stabilization window **Example:** ``` mods/ coolmod.jar coolmod.jar.shadow ``` ### 3.2 Snapshot (State-Level Protection) Lightweight archive of mod/config state. Created before any mod deployment. **Example path:** ``` .zlh_snapshots/deploy-.tar.gz ``` **Contains:** - `mods/` - `config/` - `server.properties` Only one active snapshot exists at a time. --- ## 4. Deployment State Machine ### 4.1 States ``` IDLE DEPLOYING STABILIZING STABLE ROLLBACK_FILE ROLLBACK_SNAPSHOT FAILED_RECOVERY ``` ### 4.2 Deployment Flow **Step 1 – Stop Server** - Ensure clean shutdown - Reset crash counters **Step 2 – Create Snapshot** - Archive `mods/`, `config/`, `server.properties` - Mark snapshot status = `pending` **Step 3 – Prepare Shadow** *(if replacing existing mod)* - If target mod exists, rename: ``` coolmod.jar → coolmod.jar.shadow ``` **Step 4 – Install New Mod** - Validate SHA256 (if Modrinth source) - Write file to `/mods` - Set deployment metadata: - `lastChangedMod` - `changeTimestamp` - `changeSource` **Step 5 – Start Server** - Transition to `STABILIZING` state - Begin stabilization timer --- ## 5. Stabilization Window ### Duration Hardcoded Phase 1 value: **3–5 minutes** ### Stability Conditions Server is considered stable when: - Readiness probe returns success - Server remains running continuously for the full window duration - No crash restart occurred during the window ### On Stability Confirmed - Delete snapshot - Delete shadow copy - Clear deployment metadata - Transition: `STABLE` → `IDLE` --- ## 6. Failure Detection ### 6.1 Early Boot Crash **Condition:** Process exits within X seconds (e.g., 30s) during stabilization window **Action:** Transition to `ROLLBACK_FILE` ### 6.2 Crash Loop **Condition:** ≥ N crashes (e.g., 3) within stabilization window **Action:** Transition to `ROLLBACK_SNAPSHOT` ### 6.3 Readiness Timeout **Condition:** Server fails readiness probe for the entire stabilization window **Action:** Transition to `ROLLBACK_SNAPSHOT` --- ## 7. Recovery Logic ### 7.1 File-Level Rollback (`ROLLBACK_FILE`) **Conditions:** - `lastChangedMod` exists - Shadow copy exists **Action:** 1. Stop server 2. Replace: `coolmod.jar.shadow` → `coolmod.jar` 3. Start server 4. Monitor again **If stable:** - Delete snapshot - Delete shadow - Return to `IDLE` **If still unstable:** - Escalate to snapshot restore ### 7.2 Snapshot Restore (`ROLLBACK_SNAPSHOT`) **Conditions:** - Snapshot exists - Snapshot has not already been restored **Action:** 1. Stop server 2. Extract snapshot archive 3. Start server 4. Mark snapshot as `restored` **If stable:** - Delete snapshot - Clear metadata - Return to `IDLE` **If still unstable:** - Transition to `FAILED_RECOVERY` - Stop all automation - Surface failure state to API and UI --- ## 8. Safety Constraints - Never attempt infinite recovery loops - Only **one** file rollback attempt allowed - Only **one** snapshot restore attempt allowed - If both fail → mark `FAILED_RECOVERY` - No world data is touched under any circumstance - No multiple snapshot retention - Only one snapshot exists during the stabilization window --- ## 9. Metadata Tracking (Agent-Level) Track the following fields (cleared after stabilization success): ``` lastChangedMod lastChangeTimestamp lastChangeSource # modrinth | dev deploymentState crashCount snapshotId ``` --- ## 10. Logging Requirements Structured logs must emit events for: - Deployment started - Snapshot created - Shadow created - Stabilization started - Crash detected - File rollback triggered - Snapshot restore triggered - Deployment stabilized - Recovery failed All events must be visible via agent logs and surfaced through API status endpoints. --- ## 11. Non-Goals (Phase 1) - No multi-mod version history - No dependency resolution - No world snapshotting - No long-term snapshot retention - No manual snapshot management UI - No automatic dependency analysis --- ## 12. Operational Guarantees This system guarantees: - A broken mod will not permanently brick the server - Most failures will auto-recover without operator intervention - Deployment state will not accumulate disk artifacts - The system remains deterministic and bounded --- ## 13. Future Extensions (Phase 2+) - Snapshot retention policies - Manual snapshot restore via UI - Multi-mod change grouping - Deployment audit history - Version diffing - Dependency resolution --- ## Final Decision Lock | Decision | Value | |----------|-------| | Shadow copies | Single, file-level only | | Snapshots | Single, temporary only | | Escalation model | File rollback → snapshot restore → FAILED_RECOVERY | | World data | Excluded, never touched | | Stabilization window | Fixed, hardcoded Phase 1 | | Autonomy | Agent-level, no operator required | --- ## Current Mod Lifecycle Model (2026-02) This section reflects the **implemented Phase 1 state** as of February 2026. **Install:** Modrinth resolver → API → Agent → Verified download → `/mods` **Enable/Disable:** Filesystem rename: `.jar` ↔ `.jar.disabled` **Delete:** Soft delete to `/mods-removed` (no auto-purge, no retention policy) **Filesystem is canonical state.** Agent cache invalidated after every mutation. Restore of deleted mods handled manually via file browser (planned — see `OPEN_THREADS.md`). No retention policy implemented. No install queue. No DB tracking of mod state.