docs: v2 — incorporate direct upload model, provenance metadata, API streaming, failure categories
This commit is contained in:
parent
7d7e2378f5
commit
ca105227c3
@ -1,238 +1,216 @@
|
||||
# ZeroLagHub – Mod Deployment Safety & Self-Healing Specification
|
||||
# Mod Deployment Safety Model
|
||||
|
||||
**Version:** 1.0
|
||||
**Phase:** Phase 1 Core Stability
|
||||
**Version:** 2.0
|
||||
**Updated:** 2026-03-01
|
||||
**Phase:** Phase 1 — Active
|
||||
**Applies To:** Game Containers
|
||||
**Owner:** Agent Layer
|
||||
|
||||
---
|
||||
|
||||
## 1. Purpose
|
||||
## Goal
|
||||
|
||||
Define the deterministic, self-healing deployment model for:
|
||||
Allow user uploads and automated mod installs while:
|
||||
|
||||
- Modrinth-based automated mod installs
|
||||
- Dev artifact promotion installs
|
||||
|
||||
The system must:
|
||||
|
||||
- Prevent catastrophic mod deployment failures
|
||||
- Automatically recover from crash loops
|
||||
- Avoid operator intervention
|
||||
- Preserve world/player data
|
||||
- Avoid state explosion and disk bloat
|
||||
|
||||
> This system is **not** a backup solution. It is a **deployment safety mechanism**.
|
||||
- Preventing path escape
|
||||
- Preserving runtime integrity
|
||||
- Tracking provenance
|
||||
- Avoiding hidden deployment layers
|
||||
|
||||
---
|
||||
|
||||
## 2. Scope
|
||||
## Deployment Types
|
||||
|
||||
### Included in Safety Snapshot
|
||||
|
||||
- `/mods`
|
||||
- `/config`
|
||||
- `server.properties`
|
||||
|
||||
### Explicitly Excluded
|
||||
|
||||
- `/world`
|
||||
- `/logs`
|
||||
- Cache directories
|
||||
- Runtime temp files
|
||||
|
||||
World backup is handled by a separate backup system.
|
||||
| Type | Path | Extension |
|
||||
|------|------|-----------|
|
||||
| Mod (Modrinth) | `mods/<file>.jar` | `.jar` |
|
||||
| Mod (user upload) | `mods/<file>.jar` | `.jar` |
|
||||
| Datapack | `world/datapacks/<file>.zip` | `.zip` |
|
||||
|
||||
---
|
||||
|
||||
## 3. Core Concepts
|
||||
## Direct Runtime Writes
|
||||
|
||||
### 3.1 Shadow Copy (File-Level Protection)
|
||||
Uploads and automated installs are written directly into runtime directories.
|
||||
|
||||
Used for fast rollback of a single mod change.
|
||||
No:
|
||||
- Staging
|
||||
- Symlinking
|
||||
- Delayed deployment
|
||||
|
||||
- Only applies to the most recently modified mod
|
||||
- Only one shadow exists at any time
|
||||
- Shadow lifetime is limited to the stabilization window
|
||||
---
|
||||
|
||||
**Example:**
|
||||
## Atomic Write Process
|
||||
|
||||
```
|
||||
mods/
|
||||
coolmod.jar
|
||||
coolmod.jar.shadow
|
||||
1. Create temp file in target directory
|
||||
2. Stream multipart body (or downloaded artifact) into temp file
|
||||
3. Enforce size limit while streaming
|
||||
4. `os.Rename()` temp → final filename
|
||||
5. Update `.zlh_metadata.json`
|
||||
|
||||
If `overwrite=false` and file exists → `409`
|
||||
|
||||
---
|
||||
|
||||
## Size Limits
|
||||
|
||||
Configured in agent:
|
||||
|
||||
| Type | Limit |
|
||||
|------|-------|
|
||||
| Mods | 250MB |
|
||||
| Datapacks | 100MB |
|
||||
|
||||
Enforced server-side during streaming. Not at the API layer.
|
||||
|
||||
---
|
||||
|
||||
## Overwrite Behavior
|
||||
|
||||
- `overwrite=false` (default) → reject if file exists → `409`
|
||||
- `overwrite=true` → replace file, update `uploaded_at` in metadata
|
||||
|
||||
---
|
||||
|
||||
## Agent-Side Guarantees
|
||||
|
||||
The agent guarantees:
|
||||
|
||||
- Final path is a file (not directory)
|
||||
- Target is not a symlink
|
||||
- Path resolves within runtime root
|
||||
- Parent directory exists
|
||||
- `.zlh_metadata.json` is hidden from file listing API
|
||||
- `.zlh-shadow` is hidden from file listing API
|
||||
|
||||
---
|
||||
|
||||
## Provenance Metadata
|
||||
|
||||
All user uploads are marked `source: "user"` in `.zlh_metadata.json`.
|
||||
|
||||
```json
|
||||
{
|
||||
"mods/sodium.jar": {
|
||||
"source": "user",
|
||||
"uploaded_at": "2026-03-01T22:37:01Z"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### 3.2 Snapshot (State-Level Protection)
|
||||
Modrinth-installed mods do not currently write provenance. Future curated installs may optionally write `source: "curated"`. Not currently implemented.
|
||||
|
||||
Lightweight archive of mod/config state. Created before any mod deployment.
|
||||
No automatic inference of source from filename or path.
|
||||
|
||||
**Example path:**
|
||||
---
|
||||
|
||||
```
|
||||
.zlh_snapshots/deploy-<timestamp>.tar.gz
|
||||
## Failure Categories
|
||||
|
||||
| Code | Meaning |
|
||||
|------|---------|
|
||||
| `409` | File exists (`overwrite=false`) |
|
||||
| `413` | Size limit exceeded |
|
||||
| `403` | Path rejected by allowlist |
|
||||
| `502` | API transport failure (not agent validation) |
|
||||
|
||||
---
|
||||
|
||||
## API Transport Considerations
|
||||
|
||||
The API acts strictly as a streaming proxy for uploads.
|
||||
|
||||
```js
|
||||
req.pipe(proxyReq)
|
||||
proxyRes.pipe(res)
|
||||
```
|
||||
|
||||
**Contains:**
|
||||
- Uses raw `http.request` piping — not `fetch`
|
||||
- Does not buffer file contents
|
||||
- Does not re-validate upload policy
|
||||
- Upload timeout must be substantially larger than normal routes
|
||||
|
||||
---
|
||||
|
||||
## Mod Lifecycle (Current Phase 1 State)
|
||||
|
||||
**Install (Modrinth):**
|
||||
Modrinth resolver → API → Agent → Verified download → `<serverRoot>/mods`
|
||||
|
||||
**Install (User Upload):**
|
||||
Portal → API (streaming proxy) → Agent → Atomic write → `<serverRoot>/mods`
|
||||
|
||||
**Enable/Disable:**
|
||||
Filesystem rename: `.jar` ↔ `.jar.disabled`
|
||||
|
||||
**Delete:**
|
||||
Soft delete to `<serverRoot>/mods-removed` (no auto-purge, no retention policy)
|
||||
|
||||
Filesystem is canonical state. Agent cache invalidated after every mutation.
|
||||
|
||||
Restore of deleted mods handled manually via file browser (see `OPEN_THREADS.md`).
|
||||
|
||||
---
|
||||
|
||||
## Self-Healing Model (Automated Installs)
|
||||
|
||||
Applies to Modrinth installs and dev artifact promotions. Does **not** apply to user uploads — users assume responsibility for files they upload directly.
|
||||
|
||||
### Deployment States
|
||||
|
||||
```
|
||||
IDLE → DEPLOYING → STABILIZING → STABLE → IDLE
|
||||
↓
|
||||
ROLLBACK_FILE → ROLLBACK_SNAPSHOT → FAILED_RECOVERY
|
||||
```
|
||||
|
||||
### Snapshot Scope
|
||||
|
||||
Included:
|
||||
- `mods/`
|
||||
- `config/`
|
||||
- `server.properties`
|
||||
|
||||
Only one active snapshot exists at a time.
|
||||
Excluded:
|
||||
- `world/` (separate backup system)
|
||||
- `logs/`
|
||||
- Cache and temp files
|
||||
|
||||
### Stabilization Window
|
||||
|
||||
Hardcoded Phase 1: **3–5 minutes**
|
||||
|
||||
Server is stable when:
|
||||
- Readiness probe succeeds
|
||||
- No crash restart during the window
|
||||
- Server runs continuously for full window duration
|
||||
|
||||
### Failure Triggers
|
||||
|
||||
| Trigger | Action |
|
||||
|---------|--------|
|
||||
| Early boot crash (exit within ~30s) | → `ROLLBACK_FILE` |
|
||||
| Crash loop (≥3 crashes in window) | → `ROLLBACK_SNAPSHOT` |
|
||||
| Readiness timeout (full window) | → `ROLLBACK_SNAPSHOT` |
|
||||
|
||||
### Recovery
|
||||
|
||||
**File rollback:** Restore `.jar.shadow` → `.jar`, monitor again. If stable → `IDLE`. If not → escalate.
|
||||
|
||||
**Snapshot restore:** Extract `.zlh_snapshots/deploy-<timestamp>.tar.gz`, monitor. If stable → `IDLE`. If not → `FAILED_RECOVERY`.
|
||||
|
||||
### Safety Constraints
|
||||
|
||||
- Only one file rollback attempt
|
||||
- Only one snapshot restore attempt
|
||||
- Both fail → `FAILED_RECOVERY`, stop automation, surface to API + UI
|
||||
- World data never touched
|
||||
- Only one active snapshot at a time
|
||||
|
||||
---
|
||||
|
||||
## 4. Deployment State Machine
|
||||
## Metadata Tracking (Automated Installs)
|
||||
|
||||
### 4.1 States
|
||||
|
||||
```
|
||||
IDLE
|
||||
DEPLOYING
|
||||
STABILIZING
|
||||
STABLE
|
||||
ROLLBACK_FILE
|
||||
ROLLBACK_SNAPSHOT
|
||||
FAILED_RECOVERY
|
||||
```
|
||||
|
||||
### 4.2 Deployment Flow
|
||||
|
||||
**Step 1 – Stop Server**
|
||||
- Ensure clean shutdown
|
||||
- Reset crash counters
|
||||
|
||||
**Step 2 – Create Snapshot**
|
||||
- Archive `mods/`, `config/`, `server.properties`
|
||||
- Mark snapshot status = `pending`
|
||||
|
||||
**Step 3 – Prepare Shadow** *(if replacing existing mod)*
|
||||
- If target mod exists, rename:
|
||||
```
|
||||
coolmod.jar → coolmod.jar.shadow
|
||||
```
|
||||
|
||||
**Step 4 – Install New Mod**
|
||||
- Validate SHA256 (if Modrinth source)
|
||||
- Write file to `/mods`
|
||||
- Set deployment metadata:
|
||||
- `lastChangedMod`
|
||||
- `changeTimestamp`
|
||||
- `changeSource`
|
||||
|
||||
**Step 5 – Start Server**
|
||||
- Transition to `STABILIZING` state
|
||||
- Begin stabilization timer
|
||||
|
||||
---
|
||||
|
||||
## 5. Stabilization Window
|
||||
|
||||
### Duration
|
||||
|
||||
Hardcoded Phase 1 value: **3–5 minutes**
|
||||
|
||||
### Stability Conditions
|
||||
|
||||
Server is considered stable when:
|
||||
|
||||
- Readiness probe returns success
|
||||
- Server remains running continuously for the full window duration
|
||||
- No crash restart occurred during the window
|
||||
|
||||
### On Stability Confirmed
|
||||
|
||||
- Delete snapshot
|
||||
- Delete shadow copy
|
||||
- Clear deployment metadata
|
||||
- Transition: `STABLE` → `IDLE`
|
||||
|
||||
---
|
||||
|
||||
## 6. Failure Detection
|
||||
|
||||
### 6.1 Early Boot Crash
|
||||
|
||||
**Condition:** Process exits within X seconds (e.g., 30s) during stabilization window
|
||||
|
||||
**Action:** Transition to `ROLLBACK_FILE`
|
||||
|
||||
### 6.2 Crash Loop
|
||||
|
||||
**Condition:** ≥ N crashes (e.g., 3) within stabilization window
|
||||
|
||||
**Action:** Transition to `ROLLBACK_SNAPSHOT`
|
||||
|
||||
### 6.3 Readiness Timeout
|
||||
|
||||
**Condition:** Server fails readiness probe for the entire stabilization window
|
||||
|
||||
**Action:** Transition to `ROLLBACK_SNAPSHOT`
|
||||
|
||||
---
|
||||
|
||||
## 7. Recovery Logic
|
||||
|
||||
### 7.1 File-Level Rollback (`ROLLBACK_FILE`)
|
||||
|
||||
**Conditions:**
|
||||
- `lastChangedMod` exists
|
||||
- Shadow copy exists
|
||||
|
||||
**Action:**
|
||||
1. Stop server
|
||||
2. Replace: `coolmod.jar.shadow` → `coolmod.jar`
|
||||
3. Start server
|
||||
4. Monitor again
|
||||
|
||||
**If stable:**
|
||||
- Delete snapshot
|
||||
- Delete shadow
|
||||
- Return to `IDLE`
|
||||
|
||||
**If still unstable:**
|
||||
- Escalate to snapshot restore
|
||||
|
||||
### 7.2 Snapshot Restore (`ROLLBACK_SNAPSHOT`)
|
||||
|
||||
**Conditions:**
|
||||
- Snapshot exists
|
||||
- Snapshot has not already been restored
|
||||
|
||||
**Action:**
|
||||
1. Stop server
|
||||
2. Extract snapshot archive
|
||||
3. Start server
|
||||
4. Mark snapshot as `restored`
|
||||
|
||||
**If stable:**
|
||||
- Delete snapshot
|
||||
- Clear metadata
|
||||
- Return to `IDLE`
|
||||
|
||||
**If still unstable:**
|
||||
- Transition to `FAILED_RECOVERY`
|
||||
- Stop all automation
|
||||
- Surface failure state to API and UI
|
||||
|
||||
---
|
||||
|
||||
## 8. Safety Constraints
|
||||
|
||||
- Never attempt infinite recovery loops
|
||||
- Only **one** file rollback attempt allowed
|
||||
- Only **one** snapshot restore attempt allowed
|
||||
- If both fail → mark `FAILED_RECOVERY`
|
||||
- No world data is touched under any circumstance
|
||||
- No multiple snapshot retention
|
||||
- Only one snapshot exists during the stabilization window
|
||||
|
||||
---
|
||||
|
||||
## 9. Metadata Tracking (Agent-Level)
|
||||
|
||||
Track the following fields (cleared after stabilization success):
|
||||
Agent-level fields, cleared after stabilization success:
|
||||
|
||||
```
|
||||
lastChangedMod
|
||||
@ -245,9 +223,22 @@ snapshotId
|
||||
|
||||
---
|
||||
|
||||
## 10. Logging Requirements
|
||||
## Why No Symlinks
|
||||
|
||||
Structured logs must emit events for:
|
||||
Symlink-based deployment was rejected because:
|
||||
|
||||
- Complicates mod loader behavior
|
||||
- Breaks server-side loader compatibility
|
||||
- Introduces unexpected runtime indirection
|
||||
- Makes provenance ambiguous
|
||||
|
||||
Direct writes + metadata is simpler and safer.
|
||||
|
||||
---
|
||||
|
||||
## Logging Requirements
|
||||
|
||||
Structured logs must emit:
|
||||
|
||||
- Deployment started
|
||||
- Snapshot created
|
||||
@ -258,41 +249,45 @@ Structured logs must emit events for:
|
||||
- Snapshot restore triggered
|
||||
- Deployment stabilized
|
||||
- Recovery failed
|
||||
- User upload received
|
||||
- User upload rejected (with reason)
|
||||
|
||||
All events must be visible via agent logs and surfaced through API status endpoints.
|
||||
All events visible via agent logs and surfaced through API status endpoints.
|
||||
|
||||
---
|
||||
|
||||
## 11. Non-Goals (Phase 1)
|
||||
## Operational Guarantees
|
||||
|
||||
- No multi-mod version history
|
||||
- No dependency resolution
|
||||
- No world snapshotting
|
||||
- No long-term snapshot retention
|
||||
- No manual snapshot management UI
|
||||
- No automatic dependency analysis
|
||||
|
||||
---
|
||||
|
||||
## 12. Operational Guarantees
|
||||
|
||||
This system guarantees:
|
||||
|
||||
- A broken mod will not permanently brick the server
|
||||
- Most failures will auto-recover without operator intervention
|
||||
- A broken automated mod install will not permanently brick the server
|
||||
- Most automated failures auto-recover without operator intervention
|
||||
- User-uploaded files bypass self-healing (user responsibility)
|
||||
- Deployment state will not accumulate disk artifacts
|
||||
- The system remains deterministic and bounded
|
||||
|
||||
---
|
||||
|
||||
## 13. Future Extensions (Phase 2+)
|
||||
## Non-Goals (Phase 1)
|
||||
|
||||
- No multi-mod version history
|
||||
- No dependency resolution
|
||||
- No world snapshotting
|
||||
- No long-term snapshot retention
|
||||
- No manual snapshot UI
|
||||
- No automatic dependency analysis
|
||||
- No virus/malware scanning of user uploads
|
||||
- No retention policy for `mods-removed/`
|
||||
|
||||
---
|
||||
|
||||
## Future Extensions (Phase 2+)
|
||||
|
||||
- Snapshot retention policies
|
||||
- Manual snapshot restore via UI
|
||||
- Multi-mod change grouping
|
||||
- Deployment audit history
|
||||
- Version diffing
|
||||
- Dependency resolution
|
||||
- Provenance for Modrinth installs
|
||||
- Retention automation for `mods-removed/`
|
||||
|
||||
---
|
||||
|
||||
@ -300,31 +295,13 @@ This system guarantees:
|
||||
|
||||
| Decision | Value |
|
||||
|----------|-------|
|
||||
| Shadow copies | Single, file-level only |
|
||||
| Snapshots | Single, temporary only |
|
||||
| Upload model | Direct runtime write, atomic via `os.Rename()` |
|
||||
| Staging | None |
|
||||
| Symlinks | None |
|
||||
| Shadow copies | Single, automated installs only |
|
||||
| Snapshots | Single, temporary, automated installs only |
|
||||
| Escalation model | File rollback → snapshot restore → FAILED_RECOVERY |
|
||||
| World data | Excluded, never touched |
|
||||
| Stabilization window | Fixed, hardcoded Phase 1 |
|
||||
| Autonomy | Agent-level, no operator required |
|
||||
|
||||
---
|
||||
|
||||
## Current Mod Lifecycle Model (2026-02)
|
||||
|
||||
This section reflects the **implemented Phase 1 state** as of February 2026.
|
||||
|
||||
**Install:**
|
||||
Modrinth resolver → API → Agent → Verified download → `<serverRoot>/mods`
|
||||
|
||||
**Enable/Disable:**
|
||||
Filesystem rename: `.jar` ↔ `.jar.disabled`
|
||||
|
||||
**Delete:**
|
||||
Soft delete to `<serverRoot>/mods-removed` (no auto-purge, no retention policy)
|
||||
|
||||
**Filesystem is canonical state.**
|
||||
Agent cache invalidated after every mutation.
|
||||
|
||||
Restore of deleted mods handled manually via file browser (planned — see `OPEN_THREADS.md`).
|
||||
|
||||
No retention policy implemented. No install queue. No DB tracking of mod state.
|
||||
| Provenance | `.zlh_metadata.json` at runtime root |
|
||||
| User upload self-healing | Not applicable — user responsibility |
|
||||
|
||||
Loading…
Reference in New Issue
Block a user