docs: add mod deployment safety & self-healing specification v1.0
This commit is contained in:
parent
0be7da49e8
commit
c926cad4ff
308
docs/architecture/mod-deployment-safety.md
Normal file
308
docs/architecture/mod-deployment-safety.md
Normal file
@ -0,0 +1,308 @@
|
|||||||
|
# ZeroLagHub – Mod Deployment Safety & Self-Healing Specification
|
||||||
|
|
||||||
|
**Version:** 1.0
|
||||||
|
**Phase:** Phase 1 Core Stability
|
||||||
|
**Applies To:** Game Containers
|
||||||
|
**Owner:** Agent Layer
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1. Purpose
|
||||||
|
|
||||||
|
Define the deterministic, self-healing deployment model for:
|
||||||
|
|
||||||
|
- Modrinth-based automated mod installs
|
||||||
|
- Dev artifact promotion installs
|
||||||
|
|
||||||
|
The system must:
|
||||||
|
|
||||||
|
- Prevent catastrophic mod deployment failures
|
||||||
|
- Automatically recover from crash loops
|
||||||
|
- Avoid operator intervention
|
||||||
|
- Preserve world/player data
|
||||||
|
- Avoid state explosion and disk bloat
|
||||||
|
|
||||||
|
> This system is **not** a backup solution. It is a **deployment safety mechanism**.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 2. Scope
|
||||||
|
|
||||||
|
### Included in Safety Snapshot
|
||||||
|
|
||||||
|
- `/mods`
|
||||||
|
- `/config`
|
||||||
|
- `server.properties`
|
||||||
|
|
||||||
|
### Explicitly Excluded
|
||||||
|
|
||||||
|
- `/world`
|
||||||
|
- `/logs`
|
||||||
|
- Cache directories
|
||||||
|
- Runtime temp files
|
||||||
|
|
||||||
|
World backup is handled by a separate backup system.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 3. Core Concepts
|
||||||
|
|
||||||
|
### 3.1 Shadow Copy (File-Level Protection)
|
||||||
|
|
||||||
|
Used for fast rollback of a single mod change.
|
||||||
|
|
||||||
|
- Only applies to the most recently modified mod
|
||||||
|
- Only one shadow exists at any time
|
||||||
|
- Shadow lifetime is limited to the stabilization window
|
||||||
|
|
||||||
|
**Example:**
|
||||||
|
|
||||||
|
```
|
||||||
|
mods/
|
||||||
|
coolmod.jar
|
||||||
|
coolmod.jar.shadow
|
||||||
|
```
|
||||||
|
|
||||||
|
### 3.2 Snapshot (State-Level Protection)
|
||||||
|
|
||||||
|
Lightweight archive of mod/config state. Created before any mod deployment.
|
||||||
|
|
||||||
|
**Example path:**
|
||||||
|
|
||||||
|
```
|
||||||
|
.zlh_snapshots/deploy-<timestamp>.tar.gz
|
||||||
|
```
|
||||||
|
|
||||||
|
**Contains:**
|
||||||
|
|
||||||
|
- `mods/`
|
||||||
|
- `config/`
|
||||||
|
- `server.properties`
|
||||||
|
|
||||||
|
Only one active snapshot exists at a time.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 4. Deployment State Machine
|
||||||
|
|
||||||
|
### 4.1 States
|
||||||
|
|
||||||
|
```
|
||||||
|
IDLE
|
||||||
|
DEPLOYING
|
||||||
|
STABILIZING
|
||||||
|
STABLE
|
||||||
|
ROLLBACK_FILE
|
||||||
|
ROLLBACK_SNAPSHOT
|
||||||
|
FAILED_RECOVERY
|
||||||
|
```
|
||||||
|
|
||||||
|
### 4.2 Deployment Flow
|
||||||
|
|
||||||
|
**Step 1 – Stop Server**
|
||||||
|
- Ensure clean shutdown
|
||||||
|
- Reset crash counters
|
||||||
|
|
||||||
|
**Step 2 – Create Snapshot**
|
||||||
|
- Archive `mods/`, `config/`, `server.properties`
|
||||||
|
- Mark snapshot status = `pending`
|
||||||
|
|
||||||
|
**Step 3 – Prepare Shadow** *(if replacing existing mod)*
|
||||||
|
- If target mod exists, rename:
|
||||||
|
```
|
||||||
|
coolmod.jar → coolmod.jar.shadow
|
||||||
|
```
|
||||||
|
|
||||||
|
**Step 4 – Install New Mod**
|
||||||
|
- Validate SHA256 (if Modrinth source)
|
||||||
|
- Write file to `/mods`
|
||||||
|
- Set deployment metadata:
|
||||||
|
- `lastChangedMod`
|
||||||
|
- `changeTimestamp`
|
||||||
|
- `changeSource`
|
||||||
|
|
||||||
|
**Step 5 – Start Server**
|
||||||
|
- Transition to `STABILIZING` state
|
||||||
|
- Begin stabilization timer
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 5. Stabilization Window
|
||||||
|
|
||||||
|
### Duration
|
||||||
|
|
||||||
|
Hardcoded Phase 1 value: **3–5 minutes**
|
||||||
|
|
||||||
|
### Stability Conditions
|
||||||
|
|
||||||
|
Server is considered stable when:
|
||||||
|
|
||||||
|
- Readiness probe returns success
|
||||||
|
- Server remains running continuously for the full window duration
|
||||||
|
- No crash restart occurred during the window
|
||||||
|
|
||||||
|
### On Stability Confirmed
|
||||||
|
|
||||||
|
- Delete snapshot
|
||||||
|
- Delete shadow copy
|
||||||
|
- Clear deployment metadata
|
||||||
|
- Transition: `STABLE` → `IDLE`
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 6. Failure Detection
|
||||||
|
|
||||||
|
### 6.1 Early Boot Crash
|
||||||
|
|
||||||
|
**Condition:** Process exits within X seconds (e.g., 30s) during stabilization window
|
||||||
|
|
||||||
|
**Action:** Transition to `ROLLBACK_FILE`
|
||||||
|
|
||||||
|
### 6.2 Crash Loop
|
||||||
|
|
||||||
|
**Condition:** ≥ N crashes (e.g., 3) within stabilization window
|
||||||
|
|
||||||
|
**Action:** Transition to `ROLLBACK_SNAPSHOT`
|
||||||
|
|
||||||
|
### 6.3 Readiness Timeout
|
||||||
|
|
||||||
|
**Condition:** Server fails readiness probe for the entire stabilization window
|
||||||
|
|
||||||
|
**Action:** Transition to `ROLLBACK_SNAPSHOT`
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 7. Recovery Logic
|
||||||
|
|
||||||
|
### 7.1 File-Level Rollback (`ROLLBACK_FILE`)
|
||||||
|
|
||||||
|
**Conditions:**
|
||||||
|
- `lastChangedMod` exists
|
||||||
|
- Shadow copy exists
|
||||||
|
|
||||||
|
**Action:**
|
||||||
|
1. Stop server
|
||||||
|
2. Replace: `coolmod.jar.shadow` → `coolmod.jar`
|
||||||
|
3. Start server
|
||||||
|
4. Monitor again
|
||||||
|
|
||||||
|
**If stable:**
|
||||||
|
- Delete snapshot
|
||||||
|
- Delete shadow
|
||||||
|
- Return to `IDLE`
|
||||||
|
|
||||||
|
**If still unstable:**
|
||||||
|
- Escalate to snapshot restore
|
||||||
|
|
||||||
|
### 7.2 Snapshot Restore (`ROLLBACK_SNAPSHOT`)
|
||||||
|
|
||||||
|
**Conditions:**
|
||||||
|
- Snapshot exists
|
||||||
|
- Snapshot has not already been restored
|
||||||
|
|
||||||
|
**Action:**
|
||||||
|
1. Stop server
|
||||||
|
2. Extract snapshot archive
|
||||||
|
3. Start server
|
||||||
|
4. Mark snapshot as `restored`
|
||||||
|
|
||||||
|
**If stable:**
|
||||||
|
- Delete snapshot
|
||||||
|
- Clear metadata
|
||||||
|
- Return to `IDLE`
|
||||||
|
|
||||||
|
**If still unstable:**
|
||||||
|
- Transition to `FAILED_RECOVERY`
|
||||||
|
- Stop all automation
|
||||||
|
- Surface failure state to API and UI
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 8. Safety Constraints
|
||||||
|
|
||||||
|
- Never attempt infinite recovery loops
|
||||||
|
- Only **one** file rollback attempt allowed
|
||||||
|
- Only **one** snapshot restore attempt allowed
|
||||||
|
- If both fail → mark `FAILED_RECOVERY`
|
||||||
|
- No world data is touched under any circumstance
|
||||||
|
- No multiple snapshot retention
|
||||||
|
- Only one snapshot exists during the stabilization window
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 9. Metadata Tracking (Agent-Level)
|
||||||
|
|
||||||
|
Track the following fields (cleared after stabilization success):
|
||||||
|
|
||||||
|
```
|
||||||
|
lastChangedMod
|
||||||
|
lastChangeTimestamp
|
||||||
|
lastChangeSource # modrinth | dev
|
||||||
|
deploymentState
|
||||||
|
crashCount
|
||||||
|
snapshotId
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 10. Logging Requirements
|
||||||
|
|
||||||
|
Structured logs must emit events for:
|
||||||
|
|
||||||
|
- Deployment started
|
||||||
|
- Snapshot created
|
||||||
|
- Shadow created
|
||||||
|
- Stabilization started
|
||||||
|
- Crash detected
|
||||||
|
- File rollback triggered
|
||||||
|
- Snapshot restore triggered
|
||||||
|
- Deployment stabilized
|
||||||
|
- Recovery failed
|
||||||
|
|
||||||
|
All events must be visible via agent logs and surfaced through API status endpoints.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 11. Non-Goals (Phase 1)
|
||||||
|
|
||||||
|
- No multi-mod version history
|
||||||
|
- No dependency resolution
|
||||||
|
- No world snapshotting
|
||||||
|
- No long-term snapshot retention
|
||||||
|
- No manual snapshot management UI
|
||||||
|
- No automatic dependency analysis
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 12. Operational Guarantees
|
||||||
|
|
||||||
|
This system guarantees:
|
||||||
|
|
||||||
|
- A broken mod will not permanently brick the server
|
||||||
|
- Most failures will auto-recover without operator intervention
|
||||||
|
- Deployment state will not accumulate disk artifacts
|
||||||
|
- The system remains deterministic and bounded
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 13. Future Extensions (Phase 2+)
|
||||||
|
|
||||||
|
- Snapshot retention policies
|
||||||
|
- Manual snapshot restore via UI
|
||||||
|
- Multi-mod change grouping
|
||||||
|
- Deployment audit history
|
||||||
|
- Version diffing
|
||||||
|
- Dependency resolution
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Final Decision Lock
|
||||||
|
|
||||||
|
| Decision | Value |
|
||||||
|
|----------|-------|
|
||||||
|
| Shadow copies | Single, file-level only |
|
||||||
|
| Snapshots | Single, temporary only |
|
||||||
|
| Escalation model | File rollback → snapshot restore → FAILED_RECOVERY |
|
||||||
|
| World data | Excluded, never touched |
|
||||||
|
| Stabilization window | Fixed, hardcoded Phase 1 |
|
||||||
|
| Autonomy | Agent-level, no operator required |
|
||||||
Loading…
Reference in New Issue
Block a user