zlh-grind/docs/architecture/mod-deployment-safety.md

309 lines
5.8 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# ZeroLagHub Mod Deployment Safety & Self-Healing Specification
**Version:** 1.0
**Phase:** Phase 1 Core Stability
**Applies To:** Game Containers
**Owner:** Agent Layer
---
## 1. Purpose
Define the deterministic, self-healing deployment model for:
- Modrinth-based automated mod installs
- Dev artifact promotion installs
The system must:
- Prevent catastrophic mod deployment failures
- Automatically recover from crash loops
- Avoid operator intervention
- Preserve world/player data
- Avoid state explosion and disk bloat
> This system is **not** a backup solution. It is a **deployment safety mechanism**.
---
## 2. Scope
### Included in Safety Snapshot
- `/mods`
- `/config`
- `server.properties`
### Explicitly Excluded
- `/world`
- `/logs`
- Cache directories
- Runtime temp files
World backup is handled by a separate backup system.
---
## 3. Core Concepts
### 3.1 Shadow Copy (File-Level Protection)
Used for fast rollback of a single mod change.
- Only applies to the most recently modified mod
- Only one shadow exists at any time
- Shadow lifetime is limited to the stabilization window
**Example:**
```
mods/
coolmod.jar
coolmod.jar.shadow
```
### 3.2 Snapshot (State-Level Protection)
Lightweight archive of mod/config state. Created before any mod deployment.
**Example path:**
```
.zlh_snapshots/deploy-<timestamp>.tar.gz
```
**Contains:**
- `mods/`
- `config/`
- `server.properties`
Only one active snapshot exists at a time.
---
## 4. Deployment State Machine
### 4.1 States
```
IDLE
DEPLOYING
STABILIZING
STABLE
ROLLBACK_FILE
ROLLBACK_SNAPSHOT
FAILED_RECOVERY
```
### 4.2 Deployment Flow
**Step 1 Stop Server**
- Ensure clean shutdown
- Reset crash counters
**Step 2 Create Snapshot**
- Archive `mods/`, `config/`, `server.properties`
- Mark snapshot status = `pending`
**Step 3 Prepare Shadow** *(if replacing existing mod)*
- If target mod exists, rename:
```
coolmod.jar → coolmod.jar.shadow
```
**Step 4 Install New Mod**
- Validate SHA256 (if Modrinth source)
- Write file to `/mods`
- Set deployment metadata:
- `lastChangedMod`
- `changeTimestamp`
- `changeSource`
**Step 5 Start Server**
- Transition to `STABILIZING` state
- Begin stabilization timer
---
## 5. Stabilization Window
### Duration
Hardcoded Phase 1 value: **35 minutes**
### Stability Conditions
Server is considered stable when:
- Readiness probe returns success
- Server remains running continuously for the full window duration
- No crash restart occurred during the window
### On Stability Confirmed
- Delete snapshot
- Delete shadow copy
- Clear deployment metadata
- Transition: `STABLE``IDLE`
---
## 6. Failure Detection
### 6.1 Early Boot Crash
**Condition:** Process exits within X seconds (e.g., 30s) during stabilization window
**Action:** Transition to `ROLLBACK_FILE`
### 6.2 Crash Loop
**Condition:** ≥ N crashes (e.g., 3) within stabilization window
**Action:** Transition to `ROLLBACK_SNAPSHOT`
### 6.3 Readiness Timeout
**Condition:** Server fails readiness probe for the entire stabilization window
**Action:** Transition to `ROLLBACK_SNAPSHOT`
---
## 7. Recovery Logic
### 7.1 File-Level Rollback (`ROLLBACK_FILE`)
**Conditions:**
- `lastChangedMod` exists
- Shadow copy exists
**Action:**
1. Stop server
2. Replace: `coolmod.jar.shadow``coolmod.jar`
3. Start server
4. Monitor again
**If stable:**
- Delete snapshot
- Delete shadow
- Return to `IDLE`
**If still unstable:**
- Escalate to snapshot restore
### 7.2 Snapshot Restore (`ROLLBACK_SNAPSHOT`)
**Conditions:**
- Snapshot exists
- Snapshot has not already been restored
**Action:**
1. Stop server
2. Extract snapshot archive
3. Start server
4. Mark snapshot as `restored`
**If stable:**
- Delete snapshot
- Clear metadata
- Return to `IDLE`
**If still unstable:**
- Transition to `FAILED_RECOVERY`
- Stop all automation
- Surface failure state to API and UI
---
## 8. Safety Constraints
- Never attempt infinite recovery loops
- Only **one** file rollback attempt allowed
- Only **one** snapshot restore attempt allowed
- If both fail → mark `FAILED_RECOVERY`
- No world data is touched under any circumstance
- No multiple snapshot retention
- Only one snapshot exists during the stabilization window
---
## 9. Metadata Tracking (Agent-Level)
Track the following fields (cleared after stabilization success):
```
lastChangedMod
lastChangeTimestamp
lastChangeSource # modrinth | dev
deploymentState
crashCount
snapshotId
```
---
## 10. Logging Requirements
Structured logs must emit events for:
- Deployment started
- Snapshot created
- Shadow created
- Stabilization started
- Crash detected
- File rollback triggered
- Snapshot restore triggered
- Deployment stabilized
- Recovery failed
All events must be visible via agent logs and surfaced through API status endpoints.
---
## 11. Non-Goals (Phase 1)
- No multi-mod version history
- No dependency resolution
- No world snapshotting
- No long-term snapshot retention
- No manual snapshot management UI
- No automatic dependency analysis
---
## 12. Operational Guarantees
This system guarantees:
- A broken mod will not permanently brick the server
- Most failures will auto-recover without operator intervention
- Deployment state will not accumulate disk artifacts
- The system remains deterministic and bounded
---
## 13. Future Extensions (Phase 2+)
- Snapshot retention policies
- Manual snapshot restore via UI
- Multi-mod change grouping
- Deployment audit history
- Version diffing
- Dependency resolution
---
## Final Decision Lock
| Decision | Value |
|----------|-------|
| Shadow copies | Single, file-level only |
| Snapshots | Single, temporary only |
| Escalation model | File rollback → snapshot restore → FAILED_RECOVERY |
| World data | Excluded, never touched |
| Stabilization window | Fixed, hardcoded Phase 1 |
| Autonomy | Agent-level, no operator required |