docs: v2 — incorporate direct upload model, provenance metadata, API streaming, failure categories

This commit is contained in:
jester 2026-03-01 23:13:09 +00:00
parent 7d7e2378f5
commit ca105227c3

View File

@ -1,238 +1,216 @@
# ZeroLagHub Mod Deployment Safety & Self-Healing Specification
# Mod Deployment Safety Model
**Version:** 1.0
**Phase:** Phase 1 Core Stability
**Version:** 2.0
**Updated:** 2026-03-01
**Phase:** Phase 1 — Active
**Applies To:** Game Containers
**Owner:** Agent Layer
---
## 1. Purpose
## Goal
Define the deterministic, self-healing deployment model for:
Allow user uploads and automated mod installs while:
- Modrinth-based automated mod installs
- Dev artifact promotion installs
The system must:
- Prevent catastrophic mod deployment failures
- Automatically recover from crash loops
- Avoid operator intervention
- Preserve world/player data
- Avoid state explosion and disk bloat
> This system is **not** a backup solution. It is a **deployment safety mechanism**.
- Preventing path escape
- Preserving runtime integrity
- Tracking provenance
- Avoiding hidden deployment layers
---
## 2. Scope
## Deployment Types
### Included in Safety Snapshot
- `/mods`
- `/config`
- `server.properties`
### Explicitly Excluded
- `/world`
- `/logs`
- Cache directories
- Runtime temp files
World backup is handled by a separate backup system.
| Type | Path | Extension |
|------|------|-----------|
| Mod (Modrinth) | `mods/<file>.jar` | `.jar` |
| Mod (user upload) | `mods/<file>.jar` | `.jar` |
| Datapack | `world/datapacks/<file>.zip` | `.zip` |
---
## 3. Core Concepts
## Direct Runtime Writes
### 3.1 Shadow Copy (File-Level Protection)
Uploads and automated installs are written directly into runtime directories.
Used for fast rollback of a single mod change.
No:
- Staging
- Symlinking
- Delayed deployment
- Only applies to the most recently modified mod
- Only one shadow exists at any time
- Shadow lifetime is limited to the stabilization window
---
**Example:**
## Atomic Write Process
```
mods/
coolmod.jar
coolmod.jar.shadow
1. Create temp file in target directory
2. Stream multipart body (or downloaded artifact) into temp file
3. Enforce size limit while streaming
4. `os.Rename()` temp → final filename
5. Update `.zlh_metadata.json`
If `overwrite=false` and file exists → `409`
---
## Size Limits
Configured in agent:
| Type | Limit |
|------|-------|
| Mods | 250MB |
| Datapacks | 100MB |
Enforced server-side during streaming. Not at the API layer.
---
## Overwrite Behavior
- `overwrite=false` (default) → reject if file exists → `409`
- `overwrite=true` → replace file, update `uploaded_at` in metadata
---
## Agent-Side Guarantees
The agent guarantees:
- Final path is a file (not directory)
- Target is not a symlink
- Path resolves within runtime root
- Parent directory exists
- `.zlh_metadata.json` is hidden from file listing API
- `.zlh-shadow` is hidden from file listing API
---
## Provenance Metadata
All user uploads are marked `source: "user"` in `.zlh_metadata.json`.
```json
{
"mods/sodium.jar": {
"source": "user",
"uploaded_at": "2026-03-01T22:37:01Z"
}
}
```
### 3.2 Snapshot (State-Level Protection)
Modrinth-installed mods do not currently write provenance. Future curated installs may optionally write `source: "curated"`. Not currently implemented.
Lightweight archive of mod/config state. Created before any mod deployment.
No automatic inference of source from filename or path.
**Example path:**
---
```
.zlh_snapshots/deploy-<timestamp>.tar.gz
## Failure Categories
| Code | Meaning |
|------|---------|
| `409` | File exists (`overwrite=false`) |
| `413` | Size limit exceeded |
| `403` | Path rejected by allowlist |
| `502` | API transport failure (not agent validation) |
---
## API Transport Considerations
The API acts strictly as a streaming proxy for uploads.
```js
req.pipe(proxyReq)
proxyRes.pipe(res)
```
**Contains:**
- Uses raw `http.request` piping — not `fetch`
- Does not buffer file contents
- Does not re-validate upload policy
- Upload timeout must be substantially larger than normal routes
---
## Mod Lifecycle (Current Phase 1 State)
**Install (Modrinth):**
Modrinth resolver → API → Agent → Verified download → `<serverRoot>/mods`
**Install (User Upload):**
Portal → API (streaming proxy) → Agent → Atomic write → `<serverRoot>/mods`
**Enable/Disable:**
Filesystem rename: `.jar``.jar.disabled`
**Delete:**
Soft delete to `<serverRoot>/mods-removed` (no auto-purge, no retention policy)
Filesystem is canonical state. Agent cache invalidated after every mutation.
Restore of deleted mods handled manually via file browser (see `OPEN_THREADS.md`).
---
## Self-Healing Model (Automated Installs)
Applies to Modrinth installs and dev artifact promotions. Does **not** apply to user uploads — users assume responsibility for files they upload directly.
### Deployment States
```
IDLE → DEPLOYING → STABILIZING → STABLE → IDLE
ROLLBACK_FILE → ROLLBACK_SNAPSHOT → FAILED_RECOVERY
```
### Snapshot Scope
Included:
- `mods/`
- `config/`
- `server.properties`
Only one active snapshot exists at a time.
Excluded:
- `world/` (separate backup system)
- `logs/`
- Cache and temp files
### Stabilization Window
Hardcoded Phase 1: **35 minutes**
Server is stable when:
- Readiness probe succeeds
- No crash restart during the window
- Server runs continuously for full window duration
### Failure Triggers
| Trigger | Action |
|---------|--------|
| Early boot crash (exit within ~30s) | → `ROLLBACK_FILE` |
| Crash loop (≥3 crashes in window) | → `ROLLBACK_SNAPSHOT` |
| Readiness timeout (full window) | → `ROLLBACK_SNAPSHOT` |
### Recovery
**File rollback:** Restore `.jar.shadow``.jar`, monitor again. If stable → `IDLE`. If not → escalate.
**Snapshot restore:** Extract `.zlh_snapshots/deploy-<timestamp>.tar.gz`, monitor. If stable → `IDLE`. If not → `FAILED_RECOVERY`.
### Safety Constraints
- Only one file rollback attempt
- Only one snapshot restore attempt
- Both fail → `FAILED_RECOVERY`, stop automation, surface to API + UI
- World data never touched
- Only one active snapshot at a time
---
## 4. Deployment State Machine
## Metadata Tracking (Automated Installs)
### 4.1 States
```
IDLE
DEPLOYING
STABILIZING
STABLE
ROLLBACK_FILE
ROLLBACK_SNAPSHOT
FAILED_RECOVERY
```
### 4.2 Deployment Flow
**Step 1 Stop Server**
- Ensure clean shutdown
- Reset crash counters
**Step 2 Create Snapshot**
- Archive `mods/`, `config/`, `server.properties`
- Mark snapshot status = `pending`
**Step 3 Prepare Shadow** *(if replacing existing mod)*
- If target mod exists, rename:
```
coolmod.jar → coolmod.jar.shadow
```
**Step 4 Install New Mod**
- Validate SHA256 (if Modrinth source)
- Write file to `/mods`
- Set deployment metadata:
- `lastChangedMod`
- `changeTimestamp`
- `changeSource`
**Step 5 Start Server**
- Transition to `STABILIZING` state
- Begin stabilization timer
---
## 5. Stabilization Window
### Duration
Hardcoded Phase 1 value: **35 minutes**
### Stability Conditions
Server is considered stable when:
- Readiness probe returns success
- Server remains running continuously for the full window duration
- No crash restart occurred during the window
### On Stability Confirmed
- Delete snapshot
- Delete shadow copy
- Clear deployment metadata
- Transition: `STABLE``IDLE`
---
## 6. Failure Detection
### 6.1 Early Boot Crash
**Condition:** Process exits within X seconds (e.g., 30s) during stabilization window
**Action:** Transition to `ROLLBACK_FILE`
### 6.2 Crash Loop
**Condition:** ≥ N crashes (e.g., 3) within stabilization window
**Action:** Transition to `ROLLBACK_SNAPSHOT`
### 6.3 Readiness Timeout
**Condition:** Server fails readiness probe for the entire stabilization window
**Action:** Transition to `ROLLBACK_SNAPSHOT`
---
## 7. Recovery Logic
### 7.1 File-Level Rollback (`ROLLBACK_FILE`)
**Conditions:**
- `lastChangedMod` exists
- Shadow copy exists
**Action:**
1. Stop server
2. Replace: `coolmod.jar.shadow``coolmod.jar`
3. Start server
4. Monitor again
**If stable:**
- Delete snapshot
- Delete shadow
- Return to `IDLE`
**If still unstable:**
- Escalate to snapshot restore
### 7.2 Snapshot Restore (`ROLLBACK_SNAPSHOT`)
**Conditions:**
- Snapshot exists
- Snapshot has not already been restored
**Action:**
1. Stop server
2. Extract snapshot archive
3. Start server
4. Mark snapshot as `restored`
**If stable:**
- Delete snapshot
- Clear metadata
- Return to `IDLE`
**If still unstable:**
- Transition to `FAILED_RECOVERY`
- Stop all automation
- Surface failure state to API and UI
---
## 8. Safety Constraints
- Never attempt infinite recovery loops
- Only **one** file rollback attempt allowed
- Only **one** snapshot restore attempt allowed
- If both fail → mark `FAILED_RECOVERY`
- No world data is touched under any circumstance
- No multiple snapshot retention
- Only one snapshot exists during the stabilization window
---
## 9. Metadata Tracking (Agent-Level)
Track the following fields (cleared after stabilization success):
Agent-level fields, cleared after stabilization success:
```
lastChangedMod
@ -245,9 +223,22 @@ snapshotId
---
## 10. Logging Requirements
## Why No Symlinks
Structured logs must emit events for:
Symlink-based deployment was rejected because:
- Complicates mod loader behavior
- Breaks server-side loader compatibility
- Introduces unexpected runtime indirection
- Makes provenance ambiguous
Direct writes + metadata is simpler and safer.
---
## Logging Requirements
Structured logs must emit:
- Deployment started
- Snapshot created
@ -258,41 +249,45 @@ Structured logs must emit events for:
- Snapshot restore triggered
- Deployment stabilized
- Recovery failed
- User upload received
- User upload rejected (with reason)
All events must be visible via agent logs and surfaced through API status endpoints.
All events visible via agent logs and surfaced through API status endpoints.
---
## 11. Non-Goals (Phase 1)
## Operational Guarantees
- No multi-mod version history
- No dependency resolution
- No world snapshotting
- No long-term snapshot retention
- No manual snapshot management UI
- No automatic dependency analysis
---
## 12. Operational Guarantees
This system guarantees:
- A broken mod will not permanently brick the server
- Most failures will auto-recover without operator intervention
- A broken automated mod install will not permanently brick the server
- Most automated failures auto-recover without operator intervention
- User-uploaded files bypass self-healing (user responsibility)
- Deployment state will not accumulate disk artifacts
- The system remains deterministic and bounded
---
## 13. Future Extensions (Phase 2+)
## Non-Goals (Phase 1)
- No multi-mod version history
- No dependency resolution
- No world snapshotting
- No long-term snapshot retention
- No manual snapshot UI
- No automatic dependency analysis
- No virus/malware scanning of user uploads
- No retention policy for `mods-removed/`
---
## Future Extensions (Phase 2+)
- Snapshot retention policies
- Manual snapshot restore via UI
- Multi-mod change grouping
- Deployment audit history
- Version diffing
- Dependency resolution
- Provenance for Modrinth installs
- Retention automation for `mods-removed/`
---
@ -300,31 +295,13 @@ This system guarantees:
| Decision | Value |
|----------|-------|
| Shadow copies | Single, file-level only |
| Snapshots | Single, temporary only |
| Upload model | Direct runtime write, atomic via `os.Rename()` |
| Staging | None |
| Symlinks | None |
| Shadow copies | Single, automated installs only |
| Snapshots | Single, temporary, automated installs only |
| Escalation model | File rollback → snapshot restore → FAILED_RECOVERY |
| World data | Excluded, never touched |
| Stabilization window | Fixed, hardcoded Phase 1 |
| Autonomy | Agent-level, no operator required |
---
## Current Mod Lifecycle Model (2026-02)
This section reflects the **implemented Phase 1 state** as of February 2026.
**Install:**
Modrinth resolver → API → Agent → Verified download → `<serverRoot>/mods`
**Enable/Disable:**
Filesystem rename: `.jar``.jar.disabled`
**Delete:**
Soft delete to `<serverRoot>/mods-removed` (no auto-purge, no retention policy)
**Filesystem is canonical state.**
Agent cache invalidated after every mutation.
Restore of deleted mods handled manually via file browser (planned — see `OPEN_THREADS.md`).
No retention policy implemented. No install queue. No DB tracking of mod state.
| Provenance | `.zlh_metadata.json` at runtime root |
| User upload self-healing | Not applicable — user responsibility |