zlh-grind/docs/architecture/mod-deployment-safety.md

6.5 KiB
Raw Blame History

ZeroLagHub Mod Deployment Safety & Self-Healing Specification

Version: 1.0 Phase: Phase 1 Core Stability Applies To: Game Containers Owner: Agent Layer


1. Purpose

Define the deterministic, self-healing deployment model for:

  • Modrinth-based automated mod installs
  • Dev artifact promotion installs

The system must:

  • Prevent catastrophic mod deployment failures
  • Automatically recover from crash loops
  • Avoid operator intervention
  • Preserve world/player data
  • Avoid state explosion and disk bloat

This system is not a backup solution. It is a deployment safety mechanism.


2. Scope

Included in Safety Snapshot

  • /mods
  • /config
  • server.properties

Explicitly Excluded

  • /world
  • /logs
  • Cache directories
  • Runtime temp files

World backup is handled by a separate backup system.


3. Core Concepts

3.1 Shadow Copy (File-Level Protection)

Used for fast rollback of a single mod change.

  • Only applies to the most recently modified mod
  • Only one shadow exists at any time
  • Shadow lifetime is limited to the stabilization window

Example:

mods/
  coolmod.jar
  coolmod.jar.shadow

3.2 Snapshot (State-Level Protection)

Lightweight archive of mod/config state. Created before any mod deployment.

Example path:

.zlh_snapshots/deploy-<timestamp>.tar.gz

Contains:

  • mods/
  • config/
  • server.properties

Only one active snapshot exists at a time.


4. Deployment State Machine

4.1 States

IDLE
DEPLOYING
STABILIZING
STABLE
ROLLBACK_FILE
ROLLBACK_SNAPSHOT
FAILED_RECOVERY

4.2 Deployment Flow

Step 1 Stop Server

  • Ensure clean shutdown
  • Reset crash counters

Step 2 Create Snapshot

  • Archive mods/, config/, server.properties
  • Mark snapshot status = pending

Step 3 Prepare Shadow (if replacing existing mod)

  • If target mod exists, rename:
    coolmod.jar → coolmod.jar.shadow
    

Step 4 Install New Mod

  • Validate SHA256 (if Modrinth source)
  • Write file to /mods
  • Set deployment metadata:
    • lastChangedMod
    • changeTimestamp
    • changeSource

Step 5 Start Server

  • Transition to STABILIZING state
  • Begin stabilization timer

5. Stabilization Window

Duration

Hardcoded Phase 1 value: 35 minutes

Stability Conditions

Server is considered stable when:

  • Readiness probe returns success
  • Server remains running continuously for the full window duration
  • No crash restart occurred during the window

On Stability Confirmed

  • Delete snapshot
  • Delete shadow copy
  • Clear deployment metadata
  • Transition: STABLEIDLE

6. Failure Detection

6.1 Early Boot Crash

Condition: Process exits within X seconds (e.g., 30s) during stabilization window

Action: Transition to ROLLBACK_FILE

6.2 Crash Loop

Condition: ≥ N crashes (e.g., 3) within stabilization window

Action: Transition to ROLLBACK_SNAPSHOT

6.3 Readiness Timeout

Condition: Server fails readiness probe for the entire stabilization window

Action: Transition to ROLLBACK_SNAPSHOT


7. Recovery Logic

7.1 File-Level Rollback (ROLLBACK_FILE)

Conditions:

  • lastChangedMod exists
  • Shadow copy exists

Action:

  1. Stop server
  2. Replace: coolmod.jar.shadowcoolmod.jar
  3. Start server
  4. Monitor again

If stable:

  • Delete snapshot
  • Delete shadow
  • Return to IDLE

If still unstable:

  • Escalate to snapshot restore

7.2 Snapshot Restore (ROLLBACK_SNAPSHOT)

Conditions:

  • Snapshot exists
  • Snapshot has not already been restored

Action:

  1. Stop server
  2. Extract snapshot archive
  3. Start server
  4. Mark snapshot as restored

If stable:

  • Delete snapshot
  • Clear metadata
  • Return to IDLE

If still unstable:

  • Transition to FAILED_RECOVERY
  • Stop all automation
  • Surface failure state to API and UI

8. Safety Constraints

  • Never attempt infinite recovery loops
  • Only one file rollback attempt allowed
  • Only one snapshot restore attempt allowed
  • If both fail → mark FAILED_RECOVERY
  • No world data is touched under any circumstance
  • No multiple snapshot retention
  • Only one snapshot exists during the stabilization window

9. Metadata Tracking (Agent-Level)

Track the following fields (cleared after stabilization success):

lastChangedMod
lastChangeTimestamp
lastChangeSource    # modrinth | dev
deploymentState
crashCount
snapshotId

10. Logging Requirements

Structured logs must emit events for:

  • Deployment started
  • Snapshot created
  • Shadow created
  • Stabilization started
  • Crash detected
  • File rollback triggered
  • Snapshot restore triggered
  • Deployment stabilized
  • Recovery failed

All events must be visible via agent logs and surfaced through API status endpoints.


11. Non-Goals (Phase 1)

  • No multi-mod version history
  • No dependency resolution
  • No world snapshotting
  • No long-term snapshot retention
  • No manual snapshot management UI
  • No automatic dependency analysis

12. Operational Guarantees

This system guarantees:

  • A broken mod will not permanently brick the server
  • Most failures will auto-recover without operator intervention
  • Deployment state will not accumulate disk artifacts
  • The system remains deterministic and bounded

13. Future Extensions (Phase 2+)

  • Snapshot retention policies
  • Manual snapshot restore via UI
  • Multi-mod change grouping
  • Deployment audit history
  • Version diffing
  • Dependency resolution

Final Decision Lock

Decision Value
Shadow copies Single, file-level only
Snapshots Single, temporary only
Escalation model File rollback → snapshot restore → FAILED_RECOVERY
World data Excluded, never touched
Stabilization window Fixed, hardcoded Phase 1
Autonomy Agent-level, no operator required

Current Mod Lifecycle Model (2026-02)

This section reflects the implemented Phase 1 state as of February 2026.

Install: Modrinth resolver → API → Agent → Verified download → <serverRoot>/mods

Enable/Disable: Filesystem rename: .jar.jar.disabled

Delete: Soft delete to <serverRoot>/mods-removed (no auto-purge, no retention policy)

Filesystem is canonical state. Agent cache invalidated after every mutation.

Restore of deleted mods handled manually via file browser (planned — see OPEN_THREADS.md).

No retention policy implemented. No install queue. No DB tracking of mod state.