diff --git a/INFRASTRUCTURE_ARCHITECTURE.md b/INFRASTRUCTURE_ARCHITECTURE.md new file mode 100644 index 0000000..67cae7d --- /dev/null +++ b/INFRASTRUCTURE_ARCHITECTURE.md @@ -0,0 +1,166 @@ +# ZeroLagHub – Infrastructure Architecture + +## Three-Layer Model + +The platform is organized into three distinct layers. + +``` +Control Plane → Edge / Gateway → Runtime +``` + +This separation ensures that customer workloads cannot affect platform +availability, and that the platform can always manage the runtime even +if the runtime is degraded. + +--- + +## Layer 1 — Control Plane (Core Host) + +Mission critical. Must never go down or the platform is dead. + +Low CPU usage but high availability requirement. + +Components: + +- `zpack-api` — platform API +- `zpack-portal` — Next.js frontend +- MariaDB — primary database +- Redis — cache + agent state +- Technitium DNS (`zlh-dns`) +- Prometheus/Grafana monitoring (`zlh-monitor`) +- PBS backup (`zlh-back`) +- Headscale (`zlh-ctl`) — dev access VPN +- `zlh-proxy` — Traefik for core/portal SSL termination +- core OPNsense + +Rule: **Control plane must never depend on runtime.** + +--- + +## Layer 2 — Edge / Gateway Layer + +Handles incoming traffic routing. Sits between internet and runtime. + +Components: + +- `zlh-zpack-proxy` — Traefik for game/dev runtime traffic +- `zlh-velocity` — Minecraft Velocity proxy +- game/dev OPNsense (`zlh-zpack-router`) + +The two Traefik instances map cleanly to this split: + +| Instance | Routes | +|----------|--------| +| `zlh-proxy` (core) | portal, API, monitoring, Headscale | +| `zlh-zpack-proxy` (runtime) | game servers, dev IDE (future) | + +Velocity belongs with runtime — proxying game traffic locally avoids +cross-host latency. + +--- + +## Layer 3 — Runtime Layer (Runtime Host) + +Customer workloads. Noisy and unpredictable. Isolated from control plane. + +Components: + +- game containers +- dev containers +- zlh-agent (inside every container) +- build systems +- `zlh-artifacts` — runtime binaries + server jars + +Rule: **Runtime must not be able to break control plane.** + +--- + +## Host Placement + +### Core Host (small, stable) + +``` +core OPNsense +zlh-proxy (Traefik — core traffic) +zpack-api +zpack-portal +MariaDB +Redis +zlh-dns (Technitium) +zlh-monitor (Prometheus/Grafana) +zlh-back (PBS backup) +zlh-ctl (Headscale) +``` + +### Runtime Host (large, beefy) + +``` +game-dev OPNsense +zlh-zpack-proxy (Traefik — runtime traffic) +zlh-velocity (Minecraft proxy) +game containers +dev containers +agents +build systems +zlh-artifacts +``` + +--- + +## Agent Communication Path + +API → agent crosses the host boundary via management network. + +``` +Core Host (API) + ↓ management network +Runtime Host + ↓ +agent :18888 inside container +``` + +Firewall rule: only API (core host) may reach agent ports (:18888) on +runtime host. No other source should be able to reach agents directly. + +--- + +## Failure Isolation + +If runtime host degrades or goes down: + +- control plane stays up +- operators can still login, manage servers, redeploy containers +- platform is degraded but not dead + +If core host degrades: + +- runtime containers continue running +- game servers stay up +- management capability is lost until core recovers + +--- + +## VMID Allocation Scheme + +Block-based allocation makes the Proxmox UI readable and automation +predictable. Container role is derivable from VMID range. + +| Range | Purpose | +|-------|---------| +| 100–199 | Core infrastructure | +| 200–299 | Base templates | +| 300–399 | Network / proxy services | +| 1000–1099 | Game containers | +| 1100–1199 | Dev containers | +| 2000+ | Monitoring / backup | + +--- + +## Design Principles + +1. Control plane never depends on runtime +2. Runtime cannot break control plane +3. Management network is the only path from core to runtime agents +4. Two Traefik instances — one per traffic domain — is correct +5. Velocity stays with runtime to keep game traffic local +6. VMID ranges communicate role without lookup