zlh-grind/INFRASTRUCTURE_ARCHITECTURE.md

3.7 KiB
Raw Blame History

ZeroLagHub Infrastructure Architecture

Three-Layer Model

The platform is organized into three distinct layers.

Control Plane  →  Edge / Gateway  →  Runtime

This separation ensures that customer workloads cannot affect platform availability, and that the platform can always manage the runtime even if the runtime is degraded.


Layer 1 — Control Plane (Core Host)

Mission critical. Must never go down or the platform is dead.

Low CPU usage but high availability requirement.

Components:

  • zpack-api — platform API
  • zpack-portal — Next.js frontend
  • MariaDB — primary database
  • Redis — cache + agent state
  • Technitium DNS (zlh-dns)
  • Prometheus/Grafana monitoring (zlh-monitor)
  • PBS backup (zlh-back)
  • Headscale (zlh-ctl) — dev access VPN
  • zlh-proxy — Traefik for core/portal SSL termination
  • core OPNsense

Rule: Control plane must never depend on runtime.


Layer 2 — Edge / Gateway Layer

Handles incoming traffic routing. Sits between internet and runtime.

Components:

  • zlh-zpack-proxy — Traefik for game/dev runtime traffic
  • zlh-velocity — Minecraft Velocity proxy
  • game/dev OPNsense (zlh-zpack-router)

The two Traefik instances map cleanly to this split:

Instance Routes
zlh-proxy (core) portal, API, monitoring, Headscale
zlh-zpack-proxy (runtime) game servers, dev IDE (future)

Velocity belongs with runtime — proxying game traffic locally avoids cross-host latency.


Layer 3 — Runtime Layer (Runtime Host)

Customer workloads. Noisy and unpredictable. Isolated from control plane.

Components:

  • game containers
  • dev containers
  • zlh-agent (inside every container)
  • build systems
  • zlh-artifacts — runtime binaries + server jars

Rule: Runtime must not be able to break control plane.


Host Placement

Core Host (small, stable)

core OPNsense
zlh-proxy (Traefik — core traffic)
zpack-api
zpack-portal
MariaDB
Redis
zlh-dns (Technitium)
zlh-monitor (Prometheus/Grafana)
zlh-back (PBS backup)
zlh-ctl (Headscale)

Runtime Host (large, beefy)

game-dev OPNsense
zlh-zpack-proxy (Traefik — runtime traffic)
zlh-velocity (Minecraft proxy)
game containers
dev containers
agents
build systems
zlh-artifacts

Agent Communication Path

API → agent crosses the host boundary via management network.

Core Host (API)
  ↓ management network
Runtime Host
  ↓
agent :18888 inside container

Firewall rule: only API (core host) may reach agent ports (:18888) on runtime host. No other source should be able to reach agents directly.


Failure Isolation

If runtime host degrades or goes down:

  • control plane stays up
  • operators can still login, manage servers, redeploy containers
  • platform is degraded but not dead

If core host degrades:

  • runtime containers continue running
  • game servers stay up
  • management capability is lost until core recovers

VMID Allocation Scheme

Block-based allocation makes the Proxmox UI readable and automation predictable. Container role is derivable from VMID range.

Range Purpose
100199 Core infrastructure
200299 Base templates
300399 Network / proxy services
10001099 Game containers
11001199 Dev containers
2000+ Monitoring / backup

Design Principles

  1. Control plane never depends on runtime
  2. Runtime cannot break control plane
  3. Management network is the only path from core to runtime agents
  4. Two Traefik instances — one per traffic domain — is correct
  5. Velocity stays with runtime to keep game traffic local
  6. VMID ranges communicate role without lookup