309 lines
5.8 KiB
Markdown
309 lines
5.8 KiB
Markdown
# ZeroLagHub – Mod Deployment Safety & Self-Healing Specification
|
||
|
||
**Version:** 1.0
|
||
**Phase:** Phase 1 Core Stability
|
||
**Applies To:** Game Containers
|
||
**Owner:** Agent Layer
|
||
|
||
---
|
||
|
||
## 1. Purpose
|
||
|
||
Define the deterministic, self-healing deployment model for:
|
||
|
||
- Modrinth-based automated mod installs
|
||
- Dev artifact promotion installs
|
||
|
||
The system must:
|
||
|
||
- Prevent catastrophic mod deployment failures
|
||
- Automatically recover from crash loops
|
||
- Avoid operator intervention
|
||
- Preserve world/player data
|
||
- Avoid state explosion and disk bloat
|
||
|
||
> This system is **not** a backup solution. It is a **deployment safety mechanism**.
|
||
|
||
---
|
||
|
||
## 2. Scope
|
||
|
||
### Included in Safety Snapshot
|
||
|
||
- `/mods`
|
||
- `/config`
|
||
- `server.properties`
|
||
|
||
### Explicitly Excluded
|
||
|
||
- `/world`
|
||
- `/logs`
|
||
- Cache directories
|
||
- Runtime temp files
|
||
|
||
World backup is handled by a separate backup system.
|
||
|
||
---
|
||
|
||
## 3. Core Concepts
|
||
|
||
### 3.1 Shadow Copy (File-Level Protection)
|
||
|
||
Used for fast rollback of a single mod change.
|
||
|
||
- Only applies to the most recently modified mod
|
||
- Only one shadow exists at any time
|
||
- Shadow lifetime is limited to the stabilization window
|
||
|
||
**Example:**
|
||
|
||
```
|
||
mods/
|
||
coolmod.jar
|
||
coolmod.jar.shadow
|
||
```
|
||
|
||
### 3.2 Snapshot (State-Level Protection)
|
||
|
||
Lightweight archive of mod/config state. Created before any mod deployment.
|
||
|
||
**Example path:**
|
||
|
||
```
|
||
.zlh_snapshots/deploy-<timestamp>.tar.gz
|
||
```
|
||
|
||
**Contains:**
|
||
|
||
- `mods/`
|
||
- `config/`
|
||
- `server.properties`
|
||
|
||
Only one active snapshot exists at a time.
|
||
|
||
---
|
||
|
||
## 4. Deployment State Machine
|
||
|
||
### 4.1 States
|
||
|
||
```
|
||
IDLE
|
||
DEPLOYING
|
||
STABILIZING
|
||
STABLE
|
||
ROLLBACK_FILE
|
||
ROLLBACK_SNAPSHOT
|
||
FAILED_RECOVERY
|
||
```
|
||
|
||
### 4.2 Deployment Flow
|
||
|
||
**Step 1 – Stop Server**
|
||
- Ensure clean shutdown
|
||
- Reset crash counters
|
||
|
||
**Step 2 – Create Snapshot**
|
||
- Archive `mods/`, `config/`, `server.properties`
|
||
- Mark snapshot status = `pending`
|
||
|
||
**Step 3 – Prepare Shadow** *(if replacing existing mod)*
|
||
- If target mod exists, rename:
|
||
```
|
||
coolmod.jar → coolmod.jar.shadow
|
||
```
|
||
|
||
**Step 4 – Install New Mod**
|
||
- Validate SHA256 (if Modrinth source)
|
||
- Write file to `/mods`
|
||
- Set deployment metadata:
|
||
- `lastChangedMod`
|
||
- `changeTimestamp`
|
||
- `changeSource`
|
||
|
||
**Step 5 – Start Server**
|
||
- Transition to `STABILIZING` state
|
||
- Begin stabilization timer
|
||
|
||
---
|
||
|
||
## 5. Stabilization Window
|
||
|
||
### Duration
|
||
|
||
Hardcoded Phase 1 value: **3–5 minutes**
|
||
|
||
### Stability Conditions
|
||
|
||
Server is considered stable when:
|
||
|
||
- Readiness probe returns success
|
||
- Server remains running continuously for the full window duration
|
||
- No crash restart occurred during the window
|
||
|
||
### On Stability Confirmed
|
||
|
||
- Delete snapshot
|
||
- Delete shadow copy
|
||
- Clear deployment metadata
|
||
- Transition: `STABLE` → `IDLE`
|
||
|
||
---
|
||
|
||
## 6. Failure Detection
|
||
|
||
### 6.1 Early Boot Crash
|
||
|
||
**Condition:** Process exits within X seconds (e.g., 30s) during stabilization window
|
||
|
||
**Action:** Transition to `ROLLBACK_FILE`
|
||
|
||
### 6.2 Crash Loop
|
||
|
||
**Condition:** ≥ N crashes (e.g., 3) within stabilization window
|
||
|
||
**Action:** Transition to `ROLLBACK_SNAPSHOT`
|
||
|
||
### 6.3 Readiness Timeout
|
||
|
||
**Condition:** Server fails readiness probe for the entire stabilization window
|
||
|
||
**Action:** Transition to `ROLLBACK_SNAPSHOT`
|
||
|
||
---
|
||
|
||
## 7. Recovery Logic
|
||
|
||
### 7.1 File-Level Rollback (`ROLLBACK_FILE`)
|
||
|
||
**Conditions:**
|
||
- `lastChangedMod` exists
|
||
- Shadow copy exists
|
||
|
||
**Action:**
|
||
1. Stop server
|
||
2. Replace: `coolmod.jar.shadow` → `coolmod.jar`
|
||
3. Start server
|
||
4. Monitor again
|
||
|
||
**If stable:**
|
||
- Delete snapshot
|
||
- Delete shadow
|
||
- Return to `IDLE`
|
||
|
||
**If still unstable:**
|
||
- Escalate to snapshot restore
|
||
|
||
### 7.2 Snapshot Restore (`ROLLBACK_SNAPSHOT`)
|
||
|
||
**Conditions:**
|
||
- Snapshot exists
|
||
- Snapshot has not already been restored
|
||
|
||
**Action:**
|
||
1. Stop server
|
||
2. Extract snapshot archive
|
||
3. Start server
|
||
4. Mark snapshot as `restored`
|
||
|
||
**If stable:**
|
||
- Delete snapshot
|
||
- Clear metadata
|
||
- Return to `IDLE`
|
||
|
||
**If still unstable:**
|
||
- Transition to `FAILED_RECOVERY`
|
||
- Stop all automation
|
||
- Surface failure state to API and UI
|
||
|
||
---
|
||
|
||
## 8. Safety Constraints
|
||
|
||
- Never attempt infinite recovery loops
|
||
- Only **one** file rollback attempt allowed
|
||
- Only **one** snapshot restore attempt allowed
|
||
- If both fail → mark `FAILED_RECOVERY`
|
||
- No world data is touched under any circumstance
|
||
- No multiple snapshot retention
|
||
- Only one snapshot exists during the stabilization window
|
||
|
||
---
|
||
|
||
## 9. Metadata Tracking (Agent-Level)
|
||
|
||
Track the following fields (cleared after stabilization success):
|
||
|
||
```
|
||
lastChangedMod
|
||
lastChangeTimestamp
|
||
lastChangeSource # modrinth | dev
|
||
deploymentState
|
||
crashCount
|
||
snapshotId
|
||
```
|
||
|
||
---
|
||
|
||
## 10. Logging Requirements
|
||
|
||
Structured logs must emit events for:
|
||
|
||
- Deployment started
|
||
- Snapshot created
|
||
- Shadow created
|
||
- Stabilization started
|
||
- Crash detected
|
||
- File rollback triggered
|
||
- Snapshot restore triggered
|
||
- Deployment stabilized
|
||
- Recovery failed
|
||
|
||
All events must be visible via agent logs and surfaced through API status endpoints.
|
||
|
||
---
|
||
|
||
## 11. Non-Goals (Phase 1)
|
||
|
||
- No multi-mod version history
|
||
- No dependency resolution
|
||
- No world snapshotting
|
||
- No long-term snapshot retention
|
||
- No manual snapshot management UI
|
||
- No automatic dependency analysis
|
||
|
||
---
|
||
|
||
## 12. Operational Guarantees
|
||
|
||
This system guarantees:
|
||
|
||
- A broken mod will not permanently brick the server
|
||
- Most failures will auto-recover without operator intervention
|
||
- Deployment state will not accumulate disk artifacts
|
||
- The system remains deterministic and bounded
|
||
|
||
---
|
||
|
||
## 13. Future Extensions (Phase 2+)
|
||
|
||
- Snapshot retention policies
|
||
- Manual snapshot restore via UI
|
||
- Multi-mod change grouping
|
||
- Deployment audit history
|
||
- Version diffing
|
||
- Dependency resolution
|
||
|
||
---
|
||
|
||
## Final Decision Lock
|
||
|
||
| Decision | Value |
|
||
|----------|-------|
|
||
| Shadow copies | Single, file-level only |
|
||
| Snapshots | Single, temporary only |
|
||
| Escalation model | File rollback → snapshot restore → FAILED_RECOVERY |
|
||
| World data | Excluded, never touched |
|
||
| Stabilization window | Fixed, hardcoded Phase 1 |
|
||
| Autonomy | Agent-level, no operator required |
|