docs: add mod deployment safety & self-healing specification v1.0

This commit is contained in:
jester 2026-02-23 00:12:01 +00:00
parent 0be7da49e8
commit c926cad4ff

View File

@ -0,0 +1,308 @@
# ZeroLagHub Mod Deployment Safety & Self-Healing Specification
**Version:** 1.0
**Phase:** Phase 1 Core Stability
**Applies To:** Game Containers
**Owner:** Agent Layer
---
## 1. Purpose
Define the deterministic, self-healing deployment model for:
- Modrinth-based automated mod installs
- Dev artifact promotion installs
The system must:
- Prevent catastrophic mod deployment failures
- Automatically recover from crash loops
- Avoid operator intervention
- Preserve world/player data
- Avoid state explosion and disk bloat
> This system is **not** a backup solution. It is a **deployment safety mechanism**.
---
## 2. Scope
### Included in Safety Snapshot
- `/mods`
- `/config`
- `server.properties`
### Explicitly Excluded
- `/world`
- `/logs`
- Cache directories
- Runtime temp files
World backup is handled by a separate backup system.
---
## 3. Core Concepts
### 3.1 Shadow Copy (File-Level Protection)
Used for fast rollback of a single mod change.
- Only applies to the most recently modified mod
- Only one shadow exists at any time
- Shadow lifetime is limited to the stabilization window
**Example:**
```
mods/
coolmod.jar
coolmod.jar.shadow
```
### 3.2 Snapshot (State-Level Protection)
Lightweight archive of mod/config state. Created before any mod deployment.
**Example path:**
```
.zlh_snapshots/deploy-<timestamp>.tar.gz
```
**Contains:**
- `mods/`
- `config/`
- `server.properties`
Only one active snapshot exists at a time.
---
## 4. Deployment State Machine
### 4.1 States
```
IDLE
DEPLOYING
STABILIZING
STABLE
ROLLBACK_FILE
ROLLBACK_SNAPSHOT
FAILED_RECOVERY
```
### 4.2 Deployment Flow
**Step 1 Stop Server**
- Ensure clean shutdown
- Reset crash counters
**Step 2 Create Snapshot**
- Archive `mods/`, `config/`, `server.properties`
- Mark snapshot status = `pending`
**Step 3 Prepare Shadow** *(if replacing existing mod)*
- If target mod exists, rename:
```
coolmod.jar → coolmod.jar.shadow
```
**Step 4 Install New Mod**
- Validate SHA256 (if Modrinth source)
- Write file to `/mods`
- Set deployment metadata:
- `lastChangedMod`
- `changeTimestamp`
- `changeSource`
**Step 5 Start Server**
- Transition to `STABILIZING` state
- Begin stabilization timer
---
## 5. Stabilization Window
### Duration
Hardcoded Phase 1 value: **35 minutes**
### Stability Conditions
Server is considered stable when:
- Readiness probe returns success
- Server remains running continuously for the full window duration
- No crash restart occurred during the window
### On Stability Confirmed
- Delete snapshot
- Delete shadow copy
- Clear deployment metadata
- Transition: `STABLE``IDLE`
---
## 6. Failure Detection
### 6.1 Early Boot Crash
**Condition:** Process exits within X seconds (e.g., 30s) during stabilization window
**Action:** Transition to `ROLLBACK_FILE`
### 6.2 Crash Loop
**Condition:** ≥ N crashes (e.g., 3) within stabilization window
**Action:** Transition to `ROLLBACK_SNAPSHOT`
### 6.3 Readiness Timeout
**Condition:** Server fails readiness probe for the entire stabilization window
**Action:** Transition to `ROLLBACK_SNAPSHOT`
---
## 7. Recovery Logic
### 7.1 File-Level Rollback (`ROLLBACK_FILE`)
**Conditions:**
- `lastChangedMod` exists
- Shadow copy exists
**Action:**
1. Stop server
2. Replace: `coolmod.jar.shadow``coolmod.jar`
3. Start server
4. Monitor again
**If stable:**
- Delete snapshot
- Delete shadow
- Return to `IDLE`
**If still unstable:**
- Escalate to snapshot restore
### 7.2 Snapshot Restore (`ROLLBACK_SNAPSHOT`)
**Conditions:**
- Snapshot exists
- Snapshot has not already been restored
**Action:**
1. Stop server
2. Extract snapshot archive
3. Start server
4. Mark snapshot as `restored`
**If stable:**
- Delete snapshot
- Clear metadata
- Return to `IDLE`
**If still unstable:**
- Transition to `FAILED_RECOVERY`
- Stop all automation
- Surface failure state to API and UI
---
## 8. Safety Constraints
- Never attempt infinite recovery loops
- Only **one** file rollback attempt allowed
- Only **one** snapshot restore attempt allowed
- If both fail → mark `FAILED_RECOVERY`
- No world data is touched under any circumstance
- No multiple snapshot retention
- Only one snapshot exists during the stabilization window
---
## 9. Metadata Tracking (Agent-Level)
Track the following fields (cleared after stabilization success):
```
lastChangedMod
lastChangeTimestamp
lastChangeSource # modrinth | dev
deploymentState
crashCount
snapshotId
```
---
## 10. Logging Requirements
Structured logs must emit events for:
- Deployment started
- Snapshot created
- Shadow created
- Stabilization started
- Crash detected
- File rollback triggered
- Snapshot restore triggered
- Deployment stabilized
- Recovery failed
All events must be visible via agent logs and surfaced through API status endpoints.
---
## 11. Non-Goals (Phase 1)
- No multi-mod version history
- No dependency resolution
- No world snapshotting
- No long-term snapshot retention
- No manual snapshot management UI
- No automatic dependency analysis
---
## 12. Operational Guarantees
This system guarantees:
- A broken mod will not permanently brick the server
- Most failures will auto-recover without operator intervention
- Deployment state will not accumulate disk artifacts
- The system remains deterministic and bounded
---
## 13. Future Extensions (Phase 2+)
- Snapshot retention policies
- Manual snapshot restore via UI
- Multi-mod change grouping
- Deployment audit history
- Version diffing
- Dependency resolution
---
## Final Decision Lock
| Decision | Value |
|----------|-------|
| Shadow copies | Single, file-level only |
| Snapshots | Single, temporary only |
| Escalation model | File rollback → snapshot restore → FAILED_RECOVERY |
| World data | Excluded, never touched |
| Stabilization window | Fixed, hardcoded Phase 1 |
| Autonomy | Agent-level, no operator required |