zlh-grind/docs/architecture/mod-deployment-safety.md

6.9 KiB
Raw Permalink Blame History

Mod Deployment Safety Model

Version: 2.0 Updated: 2026-03-01 Phase: Phase 1 — Active Applies To: Game Containers Owner: Agent Layer


Goal

Allow user uploads and automated mod installs while:

  • Preventing path escape
  • Preserving runtime integrity
  • Tracking provenance
  • Avoiding hidden deployment layers

Deployment Types

Type Path Extension
Mod (Modrinth) mods/<file>.jar .jar
Mod (user upload) mods/<file>.jar .jar
Datapack world/datapacks/<file>.zip .zip

Direct Runtime Writes

Uploads and automated installs are written directly into runtime directories.

No:

  • Staging
  • Symlinking
  • Delayed deployment

Atomic Write Process

  1. Create temp file in target directory
  2. Stream multipart body (or downloaded artifact) into temp file
  3. Enforce size limit while streaming
  4. os.Rename() temp → final filename
  5. Update .zlh_metadata.json

If overwrite=false and file exists → 409


Size Limits

Configured in agent:

Type Limit
Mods 250MB
Datapacks 100MB

Enforced server-side during streaming. Not at the API layer.


Overwrite Behavior

  • overwrite=false (default) → reject if file exists → 409
  • overwrite=true → replace file, update uploaded_at in metadata

Agent-Side Guarantees

The agent guarantees:

  • Final path is a file (not directory)
  • Target is not a symlink
  • Path resolves within runtime root
  • Parent directory exists
  • .zlh_metadata.json is hidden from file listing API
  • .zlh-shadow is hidden from file listing API

Provenance Metadata

All user uploads are marked source: "user" in .zlh_metadata.json.

{
  "mods/sodium.jar": {
    "source": "user",
    "uploaded_at": "2026-03-01T22:37:01Z"
  }
}

Modrinth-installed mods do not currently write provenance. Future curated installs may optionally write source: "curated". Not currently implemented.

No automatic inference of source from filename or path.


Failure Categories

Code Meaning
409 File exists (overwrite=false)
413 Size limit exceeded
403 Path rejected by allowlist
502 API transport failure (not agent validation)

API Transport Considerations

The API acts strictly as a streaming proxy for uploads.

req.pipe(proxyReq)
proxyRes.pipe(res)
  • Uses raw http.request piping — not fetch
  • Does not buffer file contents
  • Does not re-validate upload policy
  • Upload timeout must be substantially larger than normal routes

Mod Lifecycle (Current Phase 1 State)

Install (Modrinth): Modrinth resolver → API → Agent → Verified download → <serverRoot>/mods

Install (User Upload): Portal → API (streaming proxy) → Agent → Atomic write → <serverRoot>/mods

Enable/Disable: Filesystem rename: .jar.jar.disabled

Delete: Soft delete to <serverRoot>/mods-removed (no auto-purge, no retention policy)

Filesystem is canonical state. Agent cache invalidated after every mutation.

Restore of deleted mods handled manually via file browser (see OPEN_THREADS.md).


Self-Healing Model (Automated Installs)

Applies to Modrinth installs and dev artifact promotions. Does not apply to user uploads — users assume responsibility for files they upload directly.

Deployment States

IDLE → DEPLOYING → STABILIZING → STABLE → IDLE
                              ↓
                       ROLLBACK_FILE → ROLLBACK_SNAPSHOT → FAILED_RECOVERY

Snapshot Scope

Included:

  • mods/
  • config/
  • server.properties

Excluded:

  • world/ (separate backup system)
  • logs/
  • Cache and temp files

Stabilization Window

Hardcoded Phase 1: 35 minutes

Server is stable when:

  • Readiness probe succeeds
  • No crash restart during the window
  • Server runs continuously for full window duration

Failure Triggers

Trigger Action
Early boot crash (exit within ~30s) ROLLBACK_FILE
Crash loop (≥3 crashes in window) ROLLBACK_SNAPSHOT
Readiness timeout (full window) ROLLBACK_SNAPSHOT

Recovery

File rollback: Restore .jar.shadow.jar, monitor again. If stable → IDLE. If not → escalate.

Snapshot restore: Extract .zlh_snapshots/deploy-<timestamp>.tar.gz, monitor. If stable → IDLE. If not → FAILED_RECOVERY.

Safety Constraints

  • Only one file rollback attempt
  • Only one snapshot restore attempt
  • Both fail → FAILED_RECOVERY, stop automation, surface to API + UI
  • World data never touched
  • Only one active snapshot at a time

Metadata Tracking (Automated Installs)

Agent-level fields, cleared after stabilization success:

lastChangedMod
lastChangeTimestamp
lastChangeSource    # modrinth | dev
deploymentState
crashCount
snapshotId

Symlink-based deployment was rejected because:

  • Complicates mod loader behavior
  • Breaks server-side loader compatibility
  • Introduces unexpected runtime indirection
  • Makes provenance ambiguous

Direct writes + metadata is simpler and safer.


Logging Requirements

Structured logs must emit:

  • Deployment started
  • Snapshot created
  • Shadow created
  • Stabilization started
  • Crash detected
  • File rollback triggered
  • Snapshot restore triggered
  • Deployment stabilized
  • Recovery failed
  • User upload received
  • User upload rejected (with reason)

All events visible via agent logs and surfaced through API status endpoints.


Operational Guarantees

  • A broken automated mod install will not permanently brick the server
  • Most automated failures auto-recover without operator intervention
  • User-uploaded files bypass self-healing (user responsibility)
  • Deployment state will not accumulate disk artifacts
  • The system remains deterministic and bounded

Non-Goals (Phase 1)

  • No multi-mod version history
  • No dependency resolution
  • No world snapshotting
  • No long-term snapshot retention
  • No manual snapshot UI
  • No automatic dependency analysis
  • No virus/malware scanning of user uploads
  • No retention policy for mods-removed/

Future Extensions (Phase 2+)

  • Snapshot retention policies
  • Manual snapshot restore via UI
  • Multi-mod change grouping
  • Deployment audit history
  • Version diffing
  • Provenance for Modrinth installs
  • Retention automation for mods-removed/

Final Decision Lock

Decision Value
Upload model Direct runtime write, atomic via os.Rename()
Staging None
Symlinks None
Shadow copies Single, automated installs only
Snapshots Single, temporary, automated installs only
Escalation model File rollback → snapshot restore → FAILED_RECOVERY
World data Excluded, never touched
Stabilization window Fixed, hardcoded Phase 1
Provenance .zlh_metadata.json at runtime root
User upload self-healing Not applicable — user responsibility