6.5 KiB
ZeroLagHub – Mod Deployment Safety & Self-Healing Specification
Version: 1.0 Phase: Phase 1 Core Stability Applies To: Game Containers Owner: Agent Layer
1. Purpose
Define the deterministic, self-healing deployment model for:
- Modrinth-based automated mod installs
- Dev artifact promotion installs
The system must:
- Prevent catastrophic mod deployment failures
- Automatically recover from crash loops
- Avoid operator intervention
- Preserve world/player data
- Avoid state explosion and disk bloat
This system is not a backup solution. It is a deployment safety mechanism.
2. Scope
Included in Safety Snapshot
/mods/configserver.properties
Explicitly Excluded
/world/logs- Cache directories
- Runtime temp files
World backup is handled by a separate backup system.
3. Core Concepts
3.1 Shadow Copy (File-Level Protection)
Used for fast rollback of a single mod change.
- Only applies to the most recently modified mod
- Only one shadow exists at any time
- Shadow lifetime is limited to the stabilization window
Example:
mods/
coolmod.jar
coolmod.jar.shadow
3.2 Snapshot (State-Level Protection)
Lightweight archive of mod/config state. Created before any mod deployment.
Example path:
.zlh_snapshots/deploy-<timestamp>.tar.gz
Contains:
mods/config/server.properties
Only one active snapshot exists at a time.
4. Deployment State Machine
4.1 States
IDLE
DEPLOYING
STABILIZING
STABLE
ROLLBACK_FILE
ROLLBACK_SNAPSHOT
FAILED_RECOVERY
4.2 Deployment Flow
Step 1 – Stop Server
- Ensure clean shutdown
- Reset crash counters
Step 2 – Create Snapshot
- Archive
mods/,config/,server.properties - Mark snapshot status =
pending
Step 3 – Prepare Shadow (if replacing existing mod)
- If target mod exists, rename:
coolmod.jar → coolmod.jar.shadow
Step 4 – Install New Mod
- Validate SHA256 (if Modrinth source)
- Write file to
/mods - Set deployment metadata:
lastChangedModchangeTimestampchangeSource
Step 5 – Start Server
- Transition to
STABILIZINGstate - Begin stabilization timer
5. Stabilization Window
Duration
Hardcoded Phase 1 value: 3–5 minutes
Stability Conditions
Server is considered stable when:
- Readiness probe returns success
- Server remains running continuously for the full window duration
- No crash restart occurred during the window
On Stability Confirmed
- Delete snapshot
- Delete shadow copy
- Clear deployment metadata
- Transition:
STABLE→IDLE
6. Failure Detection
6.1 Early Boot Crash
Condition: Process exits within X seconds (e.g., 30s) during stabilization window
Action: Transition to ROLLBACK_FILE
6.2 Crash Loop
Condition: ≥ N crashes (e.g., 3) within stabilization window
Action: Transition to ROLLBACK_SNAPSHOT
6.3 Readiness Timeout
Condition: Server fails readiness probe for the entire stabilization window
Action: Transition to ROLLBACK_SNAPSHOT
7. Recovery Logic
7.1 File-Level Rollback (ROLLBACK_FILE)
Conditions:
lastChangedModexists- Shadow copy exists
Action:
- Stop server
- Replace:
coolmod.jar.shadow→coolmod.jar - Start server
- Monitor again
If stable:
- Delete snapshot
- Delete shadow
- Return to
IDLE
If still unstable:
- Escalate to snapshot restore
7.2 Snapshot Restore (ROLLBACK_SNAPSHOT)
Conditions:
- Snapshot exists
- Snapshot has not already been restored
Action:
- Stop server
- Extract snapshot archive
- Start server
- Mark snapshot as
restored
If stable:
- Delete snapshot
- Clear metadata
- Return to
IDLE
If still unstable:
- Transition to
FAILED_RECOVERY - Stop all automation
- Surface failure state to API and UI
8. Safety Constraints
- Never attempt infinite recovery loops
- Only one file rollback attempt allowed
- Only one snapshot restore attempt allowed
- If both fail → mark
FAILED_RECOVERY - No world data is touched under any circumstance
- No multiple snapshot retention
- Only one snapshot exists during the stabilization window
9. Metadata Tracking (Agent-Level)
Track the following fields (cleared after stabilization success):
lastChangedMod
lastChangeTimestamp
lastChangeSource # modrinth | dev
deploymentState
crashCount
snapshotId
10. Logging Requirements
Structured logs must emit events for:
- Deployment started
- Snapshot created
- Shadow created
- Stabilization started
- Crash detected
- File rollback triggered
- Snapshot restore triggered
- Deployment stabilized
- Recovery failed
All events must be visible via agent logs and surfaced through API status endpoints.
11. Non-Goals (Phase 1)
- No multi-mod version history
- No dependency resolution
- No world snapshotting
- No long-term snapshot retention
- No manual snapshot management UI
- No automatic dependency analysis
12. Operational Guarantees
This system guarantees:
- A broken mod will not permanently brick the server
- Most failures will auto-recover without operator intervention
- Deployment state will not accumulate disk artifacts
- The system remains deterministic and bounded
13. Future Extensions (Phase 2+)
- Snapshot retention policies
- Manual snapshot restore via UI
- Multi-mod change grouping
- Deployment audit history
- Version diffing
- Dependency resolution
Final Decision Lock
| Decision | Value |
|---|---|
| Shadow copies | Single, file-level only |
| Snapshots | Single, temporary only |
| Escalation model | File rollback → snapshot restore → FAILED_RECOVERY |
| World data | Excluded, never touched |
| Stabilization window | Fixed, hardcoded Phase 1 |
| Autonomy | Agent-level, no operator required |
Current Mod Lifecycle Model (2026-02)
This section reflects the implemented Phase 1 state as of February 2026.
Install:
Modrinth resolver → API → Agent → Verified download → <serverRoot>/mods
Enable/Disable:
Filesystem rename: .jar ↔ .jar.disabled
Delete:
Soft delete to <serverRoot>/mods-removed (no auto-purge, no retention policy)
Filesystem is canonical state. Agent cache invalidated after every mutation.
Restore of deleted mods handled manually via file browser (planned — see OPEN_THREADS.md).
No retention policy implemented. No install queue. No DB tracking of mod state.