Add infrastructure architecture doc — three-layer model, host placement, VMID scheme
This commit is contained in:
parent
58ba631350
commit
75ca1303e9
166
INFRASTRUCTURE_ARCHITECTURE.md
Normal file
166
INFRASTRUCTURE_ARCHITECTURE.md
Normal file
@ -0,0 +1,166 @@
|
|||||||
|
# ZeroLagHub – Infrastructure Architecture
|
||||||
|
|
||||||
|
## Three-Layer Model
|
||||||
|
|
||||||
|
The platform is organized into three distinct layers.
|
||||||
|
|
||||||
|
```
|
||||||
|
Control Plane → Edge / Gateway → Runtime
|
||||||
|
```
|
||||||
|
|
||||||
|
This separation ensures that customer workloads cannot affect platform
|
||||||
|
availability, and that the platform can always manage the runtime even
|
||||||
|
if the runtime is degraded.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Layer 1 — Control Plane (Core Host)
|
||||||
|
|
||||||
|
Mission critical. Must never go down or the platform is dead.
|
||||||
|
|
||||||
|
Low CPU usage but high availability requirement.
|
||||||
|
|
||||||
|
Components:
|
||||||
|
|
||||||
|
- `zpack-api` — platform API
|
||||||
|
- `zpack-portal` — Next.js frontend
|
||||||
|
- MariaDB — primary database
|
||||||
|
- Redis — cache + agent state
|
||||||
|
- Technitium DNS (`zlh-dns`)
|
||||||
|
- Prometheus/Grafana monitoring (`zlh-monitor`)
|
||||||
|
- PBS backup (`zlh-back`)
|
||||||
|
- Headscale (`zlh-ctl`) — dev access VPN
|
||||||
|
- `zlh-proxy` — Traefik for core/portal SSL termination
|
||||||
|
- core OPNsense
|
||||||
|
|
||||||
|
Rule: **Control plane must never depend on runtime.**
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Layer 2 — Edge / Gateway Layer
|
||||||
|
|
||||||
|
Handles incoming traffic routing. Sits between internet and runtime.
|
||||||
|
|
||||||
|
Components:
|
||||||
|
|
||||||
|
- `zlh-zpack-proxy` — Traefik for game/dev runtime traffic
|
||||||
|
- `zlh-velocity` — Minecraft Velocity proxy
|
||||||
|
- game/dev OPNsense (`zlh-zpack-router`)
|
||||||
|
|
||||||
|
The two Traefik instances map cleanly to this split:
|
||||||
|
|
||||||
|
| Instance | Routes |
|
||||||
|
|----------|--------|
|
||||||
|
| `zlh-proxy` (core) | portal, API, monitoring, Headscale |
|
||||||
|
| `zlh-zpack-proxy` (runtime) | game servers, dev IDE (future) |
|
||||||
|
|
||||||
|
Velocity belongs with runtime — proxying game traffic locally avoids
|
||||||
|
cross-host latency.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Layer 3 — Runtime Layer (Runtime Host)
|
||||||
|
|
||||||
|
Customer workloads. Noisy and unpredictable. Isolated from control plane.
|
||||||
|
|
||||||
|
Components:
|
||||||
|
|
||||||
|
- game containers
|
||||||
|
- dev containers
|
||||||
|
- zlh-agent (inside every container)
|
||||||
|
- build systems
|
||||||
|
- `zlh-artifacts` — runtime binaries + server jars
|
||||||
|
|
||||||
|
Rule: **Runtime must not be able to break control plane.**
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Host Placement
|
||||||
|
|
||||||
|
### Core Host (small, stable)
|
||||||
|
|
||||||
|
```
|
||||||
|
core OPNsense
|
||||||
|
zlh-proxy (Traefik — core traffic)
|
||||||
|
zpack-api
|
||||||
|
zpack-portal
|
||||||
|
MariaDB
|
||||||
|
Redis
|
||||||
|
zlh-dns (Technitium)
|
||||||
|
zlh-monitor (Prometheus/Grafana)
|
||||||
|
zlh-back (PBS backup)
|
||||||
|
zlh-ctl (Headscale)
|
||||||
|
```
|
||||||
|
|
||||||
|
### Runtime Host (large, beefy)
|
||||||
|
|
||||||
|
```
|
||||||
|
game-dev OPNsense
|
||||||
|
zlh-zpack-proxy (Traefik — runtime traffic)
|
||||||
|
zlh-velocity (Minecraft proxy)
|
||||||
|
game containers
|
||||||
|
dev containers
|
||||||
|
agents
|
||||||
|
build systems
|
||||||
|
zlh-artifacts
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Agent Communication Path
|
||||||
|
|
||||||
|
API → agent crosses the host boundary via management network.
|
||||||
|
|
||||||
|
```
|
||||||
|
Core Host (API)
|
||||||
|
↓ management network
|
||||||
|
Runtime Host
|
||||||
|
↓
|
||||||
|
agent :18888 inside container
|
||||||
|
```
|
||||||
|
|
||||||
|
Firewall rule: only API (core host) may reach agent ports (:18888) on
|
||||||
|
runtime host. No other source should be able to reach agents directly.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Failure Isolation
|
||||||
|
|
||||||
|
If runtime host degrades or goes down:
|
||||||
|
|
||||||
|
- control plane stays up
|
||||||
|
- operators can still login, manage servers, redeploy containers
|
||||||
|
- platform is degraded but not dead
|
||||||
|
|
||||||
|
If core host degrades:
|
||||||
|
|
||||||
|
- runtime containers continue running
|
||||||
|
- game servers stay up
|
||||||
|
- management capability is lost until core recovers
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## VMID Allocation Scheme
|
||||||
|
|
||||||
|
Block-based allocation makes the Proxmox UI readable and automation
|
||||||
|
predictable. Container role is derivable from VMID range.
|
||||||
|
|
||||||
|
| Range | Purpose |
|
||||||
|
|-------|---------|
|
||||||
|
| 100–199 | Core infrastructure |
|
||||||
|
| 200–299 | Base templates |
|
||||||
|
| 300–399 | Network / proxy services |
|
||||||
|
| 1000–1099 | Game containers |
|
||||||
|
| 1100–1199 | Dev containers |
|
||||||
|
| 2000+ | Monitoring / backup |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Design Principles
|
||||||
|
|
||||||
|
1. Control plane never depends on runtime
|
||||||
|
2. Runtime cannot break control plane
|
||||||
|
3. Management network is the only path from core to runtime agents
|
||||||
|
4. Two Traefik instances — one per traffic domain — is correct
|
||||||
|
5. Velocity stays with runtime to keep game traffic local
|
||||||
|
6. VMID ranges communicate role without lookup
|
||||||
Loading…
Reference in New Issue
Block a user