26 KiB
🛡️ ZeroLagHub Cross-Project Tracker & Drift Prevention System
Last Updated: December 7, 2025
Version: 1.0 (Canonical Architecture Enforcement)
Status: ACTIVE - Must Be Consulted Before All Code Changes
🎯 Document Purpose
This document establishes architectural boundaries across the three major ZeroLagHub systems to prevent drift, confusion, and broken contracts when switching between contexts or AI assistants.
Critical Insight: Most project failures come from gradual architectural erosion, not sudden breaking changes. This document prevents that.
📊 The Three Systems (Canonical Ownership)
┌─────────────────────────────────────────────────────────────┐
│ NODE.JS API v2 │
│ (Orchestration Engine) │
│ │
│ Owns: VMID allocation, port management, DNS publishing, │
│ Velocity registration, job queue, database state │
│ │
│ Speaks To: Proxmox, Cloudflare, Technitium, Velocity, │
│ MariaDB, BullMQ, Go Agent (HTTP) │
│ │
│ Never Touches: Container filesystem, game server files, │
│ Java installation, artifact downloads │
└─────────────────────┬───────────────────────────────────────┘
│
│ HTTP Contract
│ POST /config, /start, /stop
│ GET /status, /health
│
▼
┌─────────────────────────────────────────────────────────────┐
│ GO AGENT │
│ (Container-Internal Manager) │
│ │
│ Owns: Server installation, Java runtime, artifact │
│ downloads, process management, READY detection, │
│ filesystem layout, verification + self-repair │
│ │
│ Speaks To: Local filesystem, game server process, │
│ API (status updates via HTTP polling) │
│ │
│ Never Touches: Proxmox, DNS, Cloudflare, Velocity, │
│ port allocation, VMID selection │
└─────────────────────┬───────────────────────────────────────┘
│
│ Status Polling
│ Agent reports state
│
▼
┌─────────────────────────────────────────────────────────────┐
│ NEXT.JS FRONTEND │
│ (Customer + Admin UI) │
│ │
│ Owns: User interaction, form validation, display logic, │
│ client-side state, UI components │
│ │
│ Speaks To: API v2 only (REST + WebSocket when added) │
│ │
│ Never Touches: Proxmox, Go Agent, DNS, Velocity, │
│ Cloudflare, direct container access │
└─────────────────────────────────────────────────────────────┘
🗺️ System Ownership Matrix (CANONICAL)
| Area | Node.js API v2 | Go Agent | Frontend |
|---|---|---|---|
| Provisioning Orchestration | ✅ OWNER (allocates VMID, ports, builds LXC config) | ❌ Executes inside container | ❌ Triggers only |
| Template Selection | ✅ OWNER (selects template, passes config) | ❌ Template contains agent | ❌ Displays options |
| Server Installation | ❌ Never | ✅ OWNER (Java, artifacts, validation) | ❌ Displays results |
| Runtime Control | ✅ OWNER (sends commands) | ✅ OWNER (executes commands) | ❌ UI only |
| DNS (Cloudflare + Technitium) | ✅ OWNER (creates, deletes, tracks IDs) | ❌ Never | ❌ Displays info |
| Velocity Registration | ✅ OWNER (registers, deregisters) | ❌ Never | ❌ Displays status |
| IP Logic | ✅ OWNER (external + internal IPs) | ❌ Sees container IP only | ❌ Displays final |
| Port Allocation | ✅ OWNER (PortPool DB management) | ❌ Receives assignments | ❌ Displays ports |
| Monitoring | ✅ OWNER (collects metrics) | ✅ OWNER (exposes /health) | ❌ Displays data |
| Error Handling | ✅ OWNER (BullMQ jobs, retries) | ❌ Local output only | ❌ User notifications |
🔄 API ↔ Agent Contract (IMMUTABLE)
API → Agent Endpoints
POST /config
├─ Payload: {
│ game: "minecraft",
│ variant: "paper",
│ version: "1.21.3",
│ ports: [25565, 25575],
│ memory: "4G",
│ motd: "Welcome to ZeroLagHub",
│ worldSettings: {...}
│ }
└─ Response: 200 OK
POST /start
└─ Triggers: ensureProvisioned() → StartServer()
POST /stop
└─ Triggers: Graceful shutdown via server stop command
POST /restart
└─ Triggers: stop → start sequence
GET /status
└─ Response: {
state: "RUNNING" | "INSTALLING" | "FAILED",
pid: 12345,
uptime: 3600,
lastError: null
}
GET /health
└─ Response: {
healthy: true,
java: "/usr/lib/jvm/java-21-openjdk-amd64",
variant: "paper",
ready: true
}
Agent → API Signals (via /status polling)
State Machine:
INSTALLING
├─> DOWNLOADING_ARTIFACT
├─> INSTALLING_JAVA
├─> FINALIZING
├─> RUNNING
└─> FAILED | CRASHED
🚫 Forbidden Agent Behaviors
Agent NEVER:
- ❌ Allocates or manages ports
- ❌ Talks to Proxmox API
- ❌ Creates DNS records
- ❌ Registers with Velocity
- ❌ Modifies LXC config
- ❌ Allocates VMIDs
- ❌ Manages templates
- ❌ Talks to Cloudflare or Technitium
Violation Detection: If agent code imports Proxmox client, DNS client, or port allocation logic → DRIFT VIOLATION
🖥️ Frontend ↔ API Contract (IMMUTABLE)
Frontend Allowed Endpoints
Provisioning:
POST /api/containers/create
GET /api/containers/:vmid
DELETE /api/containers/:vmid
Control:
POST /api/containers/:vmid/start
POST /api/containers/:vmid/stop
POST /api/containers/:vmid/restart
Status:
GET /api/containers/:vmid/status
GET /api/containers/:vmid/logs
GET /api/containers/:vmid/stats
Discovery:
GET /api/templates
GET /api/containers (list user's containers)
Read-Only Info:
GET /api/dns/:hostname (display only)
GET /api/velocity/status (display only)
🚫 Forbidden Frontend Behaviors
Frontend NEVER:
- ❌ Talks directly to Go Agent
- ❌ Calls Proxmox API
- ❌ Creates DNS records
- ❌ Registers with Velocity
- ❌ Allocates ports
- ❌ Executes container commands
- ❌ Accesses MariaDB directly
Violation Detection: If frontend code imports Proxmox client, agent HTTP client (except via API), or database client → DRIFT VIOLATION
🚨 Drift Detection Rules (ACTIVE ENFORCEMENT)
Rule #1: Provisioning Ownership
Violation: Agent asked to allocate ports, choose templates, create DNS, call Proxmox, manage VMIDs
Correct Path: API owns ALL provisioning orchestration
Example Violation:
// WRONG - Agent should NEVER do this
func (a *Agent) allocatePorts() ([]int, error) {
// ... port allocation logic
}
Correct Pattern:
// RIGHT - API allocates, agent receives
async function provisionInstance(game, variant) {
const ports = await portAllocator.allocate(vmid);
await agent.postConfig({ ports, game, variant });
}
Rule #2: Artifact Ownership
Violation: API asked to install Java, download server.jar, run installers
Correct Path: Agent owns ALL in-container installation
Example Violation:
// WRONG - API should NEVER do this
async function installMinecraft(vmid, version) {
await proxmox.exec(vmid, `wget https://...`);
await proxmox.exec(vmid, `java -jar installer.jar`);
}
Correct Pattern:
// RIGHT - Agent handles installation
func (p *Provisioner) ProvisionAll(cfg Config) error {
downloadArtifact(cfg.Variant, cfg.Version)
installJavaRuntime(cfg.Version)
verifyInstallation()
}
Rule #3: Direct Container Access
Violation: Frontend or API wants to exec commands directly into container
Correct Path: Agent owns container execution layer
Example Violation:
// WRONG - Frontend should NEVER do this
async function restartServer(vmid: number) {
const ssh = new SSHClient();
await ssh.connect(containerIP);
await ssh.exec('systemctl restart minecraft');
}
Correct Pattern:
// RIGHT - Frontend talks to API, API talks to agent
async function restartServer(vmid: number) {
await api.post(`/containers/${vmid}/restart`);
}
Rule #4: Networking Responsibilities
Violation: Agent asked to select public vs internal IPs, decide DNS zones
Correct Path: API owns dual-IP logic (Cloudflare external, Technitium internal)
Example Violation:
// WRONG - Agent should NEVER decide this
func (a *Agent) determinePublicIP() string {
if a.needsCloudflare() {
return "139.64.165.248"
}
return a.containerIP
}
Correct Pattern:
// RIGHT - API decides network topology
function determineIPs(vmid, game) {
const internalIP = `10.200.0.${vmid - 1000}`;
const externalIP = "139.64.165.248"; // Cloudflare target
const velocityIP = "10.70.0.241"; // Internal routing
return { internalIP, externalIP, velocityIP };
}
Rule #5: Proxy Responsibilities
Violation: Agent asked to register with Velocity, configure proxy routing
Correct Path: API owns ALL proxy integrations
Example Violation:
// WRONG - Agent should NEVER do this
func (a *Agent) registerWithVelocity() error {
client := velocity.NewClient()
return client.Register(a.hostname, a.port)
}
Correct Pattern:
// RIGHT - API handles Velocity registration
async function registerVelocity(vmid, hostname, internalIP) {
await velocityBridge.registerBackend({
name: hostname,
address: internalIP,
port: 25565
});
}
🔄 Context Switching Safety Workflow
When Moving to Node.js API Work:
Pre-Switch Checklist:
- Agent contract unchanged? (POST /config, /start, /stop, GET /status)
- Database schema unchanged? (Prisma models consistent)
- LXC template IDs unchanged? (VMID 800 for game, 6000-series for dev)
- DNS/IP logic consistent? (Cloudflare external, Technitium internal)
- Port allocation logic preserved? (PortPool DB-backed)
Common Drift Patterns:
- ⚠️ Adding agent installation logic to API
- ⚠️ Changing agent contract without updating both sides
- ⚠️ Moving DNS logic to different service
- ⚠️ Bypassing job queue for provisioning
When Moving to Go Agent Work:
Pre-Switch Checklist:
- No provisioning logic outside allowed scope (no port allocation, DNS, etc.)
- File paths remain canonical (
/opt/zlh/<game>/<variant>/world) - Naming conventions maintained (
server.jar,fabric-server.jar, etc.) - No external API calls (Proxmox, DNS, Velocity)
- Status states unchanged (INSTALLING, RUNNING, FAILED, etc.)
Common Drift Patterns:
- ⚠️ Adding port allocation to agent
- ⚠️ Making agent talk to external services
- ⚠️ Changing directory structure without API coordination
- ⚠️ Adding orchestration logic to agent
When Moving to Frontend Work:
Pre-Switch Checklist:
- Only API-approved fields used in UI
- No direct agent HTTP calls
- VMID not exposed in user-facing UI
- Internal IPs not displayed to users
- All state from API, not computed locally
Common Drift Patterns:
- ⚠️ Adding direct agent calls from frontend
- ⚠️ Computing server state client-side
- ⚠️ Exposing internal infrastructure details
- ⚠️ Bypassing API for container control
🚨 High-Risk Integration Zones (GUARDED)
These areas have historically caused drift across sessions:
1. Forge / NeoForge Installation Logic
- Risk: Agent vs API confusion on who handles
run.shpatching - Guard: Agent owns ALL Forge installation, API just passes config
- Test: Can provision Forge 1.21.3 without API filesystem access?
2. Cloudflare SRV Deletion
- Risk: Case sensitivity, subdomain normalization, record ID tracking
- Guard: API stores Cloudflare record IDs in EdgeState, deletes by ID
- Test: Create → Delete → Recreate same hostname without orphans?
3. Technitium DNS Zone Mismatch
- Risk: Wrong zone selection, duplicate records
- Guard: API hardcodes zone as
zpack.zerolaghub.com, validates before creation - Test: No records created in wrong zones?
4. Velocity Registration Order
- Risk: Registering before server ready, deregistering incorrectly
- Guard: API waits for agent RUNNING state, then registers Velocity
- Test: Player connection works immediately after provisioning complete?
5. PortPool Commit Logic
- Risk: Race conditions, double-allocation, uncommitted ports
- Guard: API allocates → provisions → commits (rollback on failure)
- Test: Concurrent provisions don't collide on ports?
6. Agent READY Detection
- Risk: False negatives, false positives, variant-specific patterns
- Guard: Agent uses variant-aware log parsing, multiple confirmation lines
- Test: All 6 variants correctly detect READY state?
7. Server Start-Up False Negatives
- Risk: Timeout too short, log parsing too strict
- Guard: Agent increases timeout for Forge (90s), multiple log patterns
- Test: Forge installer completes without false failure?
8. IP Selection Logic
- Risk: Confusing external (Cloudflare) vs internal (Velocity/Technitium) IPs
- Guard: API clearly separates: externalIP (139.64.165.248), internalIP (10.200.0.X), velocityIP (10.70.0.241)
- Test: DNS points to correct IPs, Velocity routes to correct internal IP?
📋 Architecture Decision Log (LOCKED) ⭐ NEW
Purpose: Records finalized architectural decisions that must not be re-litigated unless explicitly requested.
Status: LOCKED - These decisions are final and cannot be changed without user explicitly saying "Revisit decision X"
DEC-001: Templates vs Go Agent (FINAL)
Decision: Hybrid model
- LXC templates define base environment only
- Go Agent is authoritative execution layer inside containers
Rationale:
- Templates alone cannot handle multi-variant logic (Forge, NeoForge, Fabric)
- Agent enables self-repair, async provisioning, runtime control
- Hybrid provides speed + flexibility without API container access
Applies To: API v2, Go Agent, Frontend
Status: ✅ LOCKED
DEC-002: Provisioning Authority
Decision: API orchestrates, Agent executes
Rationale:
- API has global visibility (DB, DNS, Proxmox, Velocity)
- Agent is intentionally sandboxed to container filesystem + process
Applies To: All systems
Status: ✅ LOCKED
DEC-003: DNS & Edge Publishing Ownership
Decision: API-only responsibility
Rationale:
- Requires external credentials (Cloudflare, Technitium)
- Must correlate DB state, record IDs, reconciliation jobs
Applies To: API v2
Status: ✅ LOCKED
DEC-004: Proxy Stack
Decision: Traefik + Velocity only
Rationale:
- Traefik for HTTP/control-plane
- Velocity for Minecraft TCP routing
- HAProxy explicitly deprecated for ZeroLagHub
Applies To: Infrastructure, API
Status: ✅ LOCKED
DEC-005: State Persistence
Decision: MariaDB is single source of truth
Rationale:
- Flat files caused race conditions and drift
- DB enables reconciliation, recovery, observability
Applies To: API v2
Status: ✅ LOCKED
DEC-006: Frontend Access Model
Decision: Frontend communicates with API only
Rationale:
- Security boundary
- Prevents leaking infrastructure details
Applies To: Frontend
Status: ✅ LOCKED
DEC-007: Architecture Enforcement Policy
Decision: Drift prevention is mandatory
Rationale:
- Prevents oscillation between alternatives
- Preserves velocity during late-stage development
Applies To: All work sessions
Status: ✅ LOCKED
📋 Canonical Architecture Anchors (ABSOLUTE)
These rules are immutable unless explicitly changed with full system review:
Anchor #1: Orchestration
✅ API v2 orchestrates EVERYTHING (jobs, provisioning, DNS, proxy, lifecycle)
Anchor #2: Container-Internal
✅ Agent performs EVERYTHING inside container (install, start, stop, detect)
Anchor #3: Templates
✅ Templates contain agent + base environment (VMID 800 for game, 6000-series for dev)
Anchor #4: Job Queue
✅ BullMQ/Redis drive job system (async provisioning, retries, reconciliation)
Anchor #5: Database
✅ MariaDB holds all state (PortPool, ContainerInstance, EdgeState, etc.)
Anchor #6: Infrastructure
✅ Proxmox API (not Ansible) for LXC management
Anchor #7: Routing
✅ Traefik (HTTP) + Velocity (Minecraft) are ONLY routing/proxy systems
Anchor #8: DNS
✅ Cloudflare = authoritative public DNS
✅ Technitium = authoritative internal DNS
Anchor #9: Frontend Isolation
✅ Frontend speaks ONLY to API (no direct agent, Proxmox, DNS, Velocity)
Anchor #10: Directory Structure
✅ /opt/zlh/<game>/<variant>/world is canonical game server path
🛡️ Enforcement Policies (ACTIVE)
When future instructions conflict with Canonical Architecture or Architecture Decision Log:
Step 1: STOP
Immediately halt the task. Do not proceed with drift-inducing change.
Step 2: RAISE DRIFT WARNING
⚠️ DRIFT WARNING ⚠️
Proposed change violates Canonical Architecture:
Rule Violated: [Rule #X: Description]
OR
Decision Violated: [DEC-XXX: Decision Name]
Violation: [Specific behavior that violates rule/decision]
Correct Path: [Architecture-aligned approach]
Impact: [What breaks if this drift is allowed]
Options:
1. Implement correct architecture-aligned path
2. Amend Canonical Architecture (requires full system review)
3. Request user to "Revisit decision X" (for ADL changes)
4. Cancel proposed change
Step 3: PROVIDE CORRECT PATH
Show the architecture-aligned implementation that achieves the same goal.
Step 4: ASK FOR DIRECTION
Should we:
A) Implement the correct architecture-aligned path?
B) Perform full system review to amend architecture?
C) Request user to explicitly revisit locked decision?
D) Cancel this change?
Step 5: ARCHITECTURE DECISION LOG SPECIAL RULE
If violation is against a LOCKED Architecture Decision (DEC-001 through DEC-007):
ADDITIONAL CHECK:
⚠️ LOCKED DECISION WARNING ⚠️
This change conflicts with Architecture Decision Log entry: [DEC-XXX]
Status: LOCKED
This decision can ONLY be changed if the user explicitly says:
"Revisit decision [DEC-XXX]"
Without explicit user request to revisit, this decision is FINAL.
Proceeding with this change would violate architectural governance.
Never proceed with drift-inducing changes without explicit confirmation.
Never re-litigate locked decisions without user explicitly requesting revision.
📊 Drift Detection Examples
Example 1: Agent Port Allocation (VIOLATION)
Proposed Change:
// Agent code proposal
func (a *Agent) allocatePort() int {
// Find available port...
return port
}
Drift Warning:
⚠️ DRIFT WARNING ⚠️
Rule Violated: Rule #1 - Provisioning Ownership
Violation: Agent attempting to allocate ports
Correct Path: API allocates ports, agent receives them via /config
Impact: Port collisions, database inconsistency, broken PortPool
Recommendation: Remove port allocation from agent, ensure API sends ports in config payload
Example 2: Frontend Direct Agent Call (VIOLATION)
Proposed Change:
// Frontend code proposal
async function getServerLogs(vmid: number) {
const agentURL = `http://10.200.0.${vmid}:8080/logs`;
return await fetch(agentURL);
}
Drift Warning:
⚠️ DRIFT WARNING ⚠️
Rule Violated: Rule #3 - Direct Container Access
Violation: Frontend bypassing API to talk to agent
Correct Path: Frontend → API → Agent (API proxies logs)
Impact: Broken frontend if container IP changes, no auth/rate limiting, security risk
Recommendation: Add GET /api/containers/:vmid/logs endpoint that proxies to agent
Example 3: API Installing Java (VIOLATION)
Proposed Change:
// API code proposal
async function provisionServer(vmid, game, variant) {
await proxmox.exec(vmid, 'apt-get install openjdk-21-jdk');
await proxmox.exec(vmid, 'wget https://papermc.io/...');
}
Drift Warning:
⚠️ DRIFT WARNING ⚠️
Rule Violated: Rule #2 - Artifact Ownership
Violation: API performing in-container installation
Correct Path: API sends config to agent, agent handles installation
Impact: Breaks agent self-repair, variant-specific logic duplicated, no verification system
Recommendation: Remove installation logic from API, ensure agent receives proper config via POST /config
📁 Integration with Existing Documentation
Relationship to Master Bootstrap
- Master Bootstrap: Strategic overview and business model
- This Document: Technical governance and boundary enforcement
- Usage: Consult this before implementing ANY code changes
Relationship to Complete Current State
- Complete Current State: What's working, what's next
- This Document: How things MUST work (regardless of current state)
- Usage: This is the "law", current state is "status"
Relationship to Engineering Handover
- Engineering Handover: Daily tactical tasks and sprint plan
- This Document: Constraints within which tasks must be implemented
- Usage: Check this before starting each handover task
🔄 Architecture Amendment Process
If legitimate need to change Canonical Architecture:
Step 1: Identify Change
Document exactly what architectural boundary needs to change and why.
Step 2: Full System Impact Analysis
- What breaks in API?
- What breaks in Agent?
- What breaks in Frontend?
- What changes to contracts?
- What database migrations needed?
Step 3: Update ALL Affected Documents
- This document (Canonical Architecture)
- Master Bootstrap (if strategic impact)
- Complete Current State (implementation changes)
- Engineering Handover (sprint tasks)
- Agent Spec, Operational Guide (if affected)
Step 4: Update ALL Systems
- API code + tests
- Agent code + tests
- Frontend code + tests
- Database schema (migration)
- Infrastructure config
Step 5: Validation
- Integration tests pass?
- No new drift introduced?
- Documentation consistent?
- All AIs briefed on change?
Only after ALL steps is architecture amendment complete.
🎯 Quick Reference Card
API Owns
- ✅ Provisioning orchestration
- ✅ Port allocation (PortPool)
- ✅ DNS (Cloudflare + Technitium)
- ✅ Velocity registration
- ✅ IP logic (external + internal)
- ✅ Job queue (BullMQ)
- ✅ Database state
Agent Owns
- ✅ Container-internal installation
- ✅ Java runtime
- ✅ Artifact downloads
- ✅ Server process management
- ✅ READY detection
- ✅ Self-repair + verification
- ✅ Filesystem layout
Frontend Owns
- ✅ User interaction
- ✅ Display logic
- ✅ Client state
- ✅ Form validation
Never
- ❌ Agent allocates ports
- ❌ Agent talks to DNS/Velocity/Proxmox
- ❌ API installs server files
- ❌ API executes in-container commands
- ❌ Frontend talks to agent directly
- ❌ Frontend talks to infrastructure
📋 Session Start Checklist
Before every coding session with any AI:
- Read this document's Quick Reference Card
- Identify which system you're working on (API, Agent, Frontend)
- Review that system's "Owns" list
- Check High-Risk Integration Zones if touching those areas
- Verify no drift from previous session
- Confirm contracts unchanged since last session
If ANY doubt: Re-read full Canonical Architecture Anchors section.
✅ Document Status
Status: ACTIVE - Must be consulted before all code changes
Enforcement: MANDATORY - Drift violations must be caught
Authority: CANONICAL - Overrides conflicting guidance
Updates: Only via Architecture Amendment Process
🛡️ This document prevents architectural drift. Violate at your own risk.