knowledge-base/ZeroLagHub_Cross_Project_Tracker.md
2025-12-13 16:40:48 +00:00

26 KiB

🛡️ ZeroLagHub Cross-Project Tracker & Drift Prevention System

Last Updated: December 7, 2025
Version: 1.0 (Canonical Architecture Enforcement)
Status: ACTIVE - Must Be Consulted Before All Code Changes


🎯 Document Purpose

This document establishes architectural boundaries across the three major ZeroLagHub systems to prevent drift, confusion, and broken contracts when switching between contexts or AI assistants.

Critical Insight: Most project failures come from gradual architectural erosion, not sudden breaking changes. This document prevents that.


📊 The Three Systems (Canonical Ownership)

┌─────────────────────────────────────────────────────────────┐
│                     NODE.JS API v2                          │
│                  (Orchestration Engine)                     │
│                                                             │
│  Owns: VMID allocation, port management, DNS publishing,   │
│        Velocity registration, job queue, database state    │
│                                                             │
│  Speaks To: Proxmox, Cloudflare, Technitium, Velocity,    │
│             MariaDB, BullMQ, Go Agent (HTTP)               │
│                                                             │
│  Never Touches: Container filesystem, game server files,   │
│                 Java installation, artifact downloads       │
└─────────────────────┬───────────────────────────────────────┘
                      │
                      │ HTTP Contract
                      │ POST /config, /start, /stop
                      │ GET /status, /health
                      │
                      ▼
┌─────────────────────────────────────────────────────────────┐
│                       GO AGENT                              │
│              (Container-Internal Manager)                   │
│                                                             │
│  Owns: Server installation, Java runtime, artifact         │
│        downloads, process management, READY detection,     │
│        filesystem layout, verification + self-repair       │
│                                                             │
│  Speaks To: Local filesystem, game server process,         │
│             API (status updates via HTTP polling)          │
│                                                             │
│  Never Touches: Proxmox, DNS, Cloudflare, Velocity,        │
│                 port allocation, VMID selection            │
└─────────────────────┬───────────────────────────────────────┘
                      │
                      │ Status Polling
                      │ Agent reports state
                      │
                      ▼
┌─────────────────────────────────────────────────────────────┐
│                    NEXT.JS FRONTEND                         │
│                   (Customer + Admin UI)                     │
│                                                             │
│  Owns: User interaction, form validation, display logic,   │
│        client-side state, UI components                    │
│                                                             │
│  Speaks To: API v2 only (REST + WebSocket when added)      │
│                                                             │
│  Never Touches: Proxmox, Go Agent, DNS, Velocity,          │
│                 Cloudflare, direct container access        │
└─────────────────────────────────────────────────────────────┘

🗺️ System Ownership Matrix (CANONICAL)

Area Node.js API v2 Go Agent Frontend
Provisioning Orchestration OWNER (allocates VMID, ports, builds LXC config) Executes inside container Triggers only
Template Selection OWNER (selects template, passes config) Template contains agent Displays options
Server Installation Never OWNER (Java, artifacts, validation) Displays results
Runtime Control OWNER (sends commands) OWNER (executes commands) UI only
DNS (Cloudflare + Technitium) OWNER (creates, deletes, tracks IDs) Never Displays info
Velocity Registration OWNER (registers, deregisters) Never Displays status
IP Logic OWNER (external + internal IPs) Sees container IP only Displays final
Port Allocation OWNER (PortPool DB management) Receives assignments Displays ports
Monitoring OWNER (collects metrics) OWNER (exposes /health) Displays data
Error Handling OWNER (BullMQ jobs, retries) Local output only User notifications

🔄 API ↔ Agent Contract (IMMUTABLE)

API → Agent Endpoints

POST /config
├─ Payload: {
│    game: "minecraft",
│    variant: "paper",
│    version: "1.21.3",
│    ports: [25565, 25575],
│    memory: "4G",
│    motd: "Welcome to ZeroLagHub",
│    worldSettings: {...}
│  }
└─ Response: 200 OK

POST /start
└─ Triggers: ensureProvisioned() → StartServer()

POST /stop
└─ Triggers: Graceful shutdown via server stop command

POST /restart
└─ Triggers: stop → start sequence

GET /status
└─ Response: {
     state: "RUNNING" | "INSTALLING" | "FAILED",
     pid: 12345,
     uptime: 3600,
     lastError: null
   }

GET /health
└─ Response: {
     healthy: true,
     java: "/usr/lib/jvm/java-21-openjdk-amd64",
     variant: "paper",
     ready: true
   }

Agent → API Signals (via /status polling)

State Machine:
INSTALLING
  ├─> DOWNLOADING_ARTIFACT
  ├─> INSTALLING_JAVA
  ├─> FINALIZING
  ├─> RUNNING
  └─> FAILED | CRASHED

🚫 Forbidden Agent Behaviors

Agent NEVER:

  • Allocates or manages ports
  • Talks to Proxmox API
  • Creates DNS records
  • Registers with Velocity
  • Modifies LXC config
  • Allocates VMIDs
  • Manages templates
  • Talks to Cloudflare or Technitium

Violation Detection: If agent code imports Proxmox client, DNS client, or port allocation logic → DRIFT VIOLATION


🖥️ Frontend ↔ API Contract (IMMUTABLE)

Frontend Allowed Endpoints

Provisioning:
POST /api/containers/create
GET  /api/containers/:vmid
DELETE /api/containers/:vmid

Control:
POST /api/containers/:vmid/start
POST /api/containers/:vmid/stop
POST /api/containers/:vmid/restart

Status:
GET /api/containers/:vmid/status
GET /api/containers/:vmid/logs
GET /api/containers/:vmid/stats

Discovery:
GET /api/templates
GET /api/containers (list user's containers)

Read-Only Info:
GET /api/dns/:hostname (display only)
GET /api/velocity/status (display only)

🚫 Forbidden Frontend Behaviors

Frontend NEVER:

  • Talks directly to Go Agent
  • Calls Proxmox API
  • Creates DNS records
  • Registers with Velocity
  • Allocates ports
  • Executes container commands
  • Accesses MariaDB directly

Violation Detection: If frontend code imports Proxmox client, agent HTTP client (except via API), or database client → DRIFT VIOLATION


🚨 Drift Detection Rules (ACTIVE ENFORCEMENT)

Rule #1: Provisioning Ownership

Violation: Agent asked to allocate ports, choose templates, create DNS, call Proxmox, manage VMIDs

Correct Path: API owns ALL provisioning orchestration

Example Violation:

// WRONG - Agent should NEVER do this
func (a *Agent) allocatePorts() ([]int, error) {
    // ... port allocation logic
}

Correct Pattern:

// RIGHT - API allocates, agent receives
async function provisionInstance(game, variant) {
    const ports = await portAllocator.allocate(vmid);
    await agent.postConfig({ ports, game, variant });
}

Rule #2: Artifact Ownership

Violation: API asked to install Java, download server.jar, run installers

Correct Path: Agent owns ALL in-container installation

Example Violation:

// WRONG - API should NEVER do this
async function installMinecraft(vmid, version) {
    await proxmox.exec(vmid, `wget https://...`);
    await proxmox.exec(vmid, `java -jar installer.jar`);
}

Correct Pattern:

// RIGHT - Agent handles installation
func (p *Provisioner) ProvisionAll(cfg Config) error {
    downloadArtifact(cfg.Variant, cfg.Version)
    installJavaRuntime(cfg.Version)
    verifyInstallation()
}

Rule #3: Direct Container Access

Violation: Frontend or API wants to exec commands directly into container

Correct Path: Agent owns container execution layer

Example Violation:

// WRONG - Frontend should NEVER do this
async function restartServer(vmid: number) {
    const ssh = new SSHClient();
    await ssh.connect(containerIP);
    await ssh.exec('systemctl restart minecraft');
}

Correct Pattern:

// RIGHT - Frontend talks to API, API talks to agent
async function restartServer(vmid: number) {
    await api.post(`/containers/${vmid}/restart`);
}

Rule #4: Networking Responsibilities

Violation: Agent asked to select public vs internal IPs, decide DNS zones

Correct Path: API owns dual-IP logic (Cloudflare external, Technitium internal)

Example Violation:

// WRONG - Agent should NEVER decide this
func (a *Agent) determinePublicIP() string {
    if a.needsCloudflare() {
        return "139.64.165.248"
    }
    return a.containerIP
}

Correct Pattern:

// RIGHT - API decides network topology
function determineIPs(vmid, game) {
    const internalIP = `10.200.0.${vmid - 1000}`;
    const externalIP = "139.64.165.248"; // Cloudflare target
    const velocityIP = "10.70.0.241"; // Internal routing
    
    return { internalIP, externalIP, velocityIP };
}

Rule #5: Proxy Responsibilities

Violation: Agent asked to register with Velocity, configure proxy routing

Correct Path: API owns ALL proxy integrations

Example Violation:

// WRONG - Agent should NEVER do this
func (a *Agent) registerWithVelocity() error {
    client := velocity.NewClient()
    return client.Register(a.hostname, a.port)
}

Correct Pattern:

// RIGHT - API handles Velocity registration
async function registerVelocity(vmid, hostname, internalIP) {
    await velocityBridge.registerBackend({
        name: hostname,
        address: internalIP,
        port: 25565
    });
}

🔄 Context Switching Safety Workflow

When Moving to Node.js API Work:

Pre-Switch Checklist:

  • Agent contract unchanged? (POST /config, /start, /stop, GET /status)
  • Database schema unchanged? (Prisma models consistent)
  • LXC template IDs unchanged? (VMID 800 for game, 6000-series for dev)
  • DNS/IP logic consistent? (Cloudflare external, Technitium internal)
  • Port allocation logic preserved? (PortPool DB-backed)

Common Drift Patterns:

  • ⚠️ Adding agent installation logic to API
  • ⚠️ Changing agent contract without updating both sides
  • ⚠️ Moving DNS logic to different service
  • ⚠️ Bypassing job queue for provisioning

When Moving to Go Agent Work:

Pre-Switch Checklist:

  • No provisioning logic outside allowed scope (no port allocation, DNS, etc.)
  • File paths remain canonical (/opt/zlh/<game>/<variant>/world)
  • Naming conventions maintained (server.jar, fabric-server.jar, etc.)
  • No external API calls (Proxmox, DNS, Velocity)
  • Status states unchanged (INSTALLING, RUNNING, FAILED, etc.)

Common Drift Patterns:

  • ⚠️ Adding port allocation to agent
  • ⚠️ Making agent talk to external services
  • ⚠️ Changing directory structure without API coordination
  • ⚠️ Adding orchestration logic to agent

When Moving to Frontend Work:

Pre-Switch Checklist:

  • Only API-approved fields used in UI
  • No direct agent HTTP calls
  • VMID not exposed in user-facing UI
  • Internal IPs not displayed to users
  • All state from API, not computed locally

Common Drift Patterns:

  • ⚠️ Adding direct agent calls from frontend
  • ⚠️ Computing server state client-side
  • ⚠️ Exposing internal infrastructure details
  • ⚠️ Bypassing API for container control

🚨 High-Risk Integration Zones (GUARDED)

These areas have historically caused drift across sessions:

1. Forge / NeoForge Installation Logic

  • Risk: Agent vs API confusion on who handles run.sh patching
  • Guard: Agent owns ALL Forge installation, API just passes config
  • Test: Can provision Forge 1.21.3 without API filesystem access?

2. Cloudflare SRV Deletion

  • Risk: Case sensitivity, subdomain normalization, record ID tracking
  • Guard: API stores Cloudflare record IDs in EdgeState, deletes by ID
  • Test: Create → Delete → Recreate same hostname without orphans?

3. Technitium DNS Zone Mismatch

  • Risk: Wrong zone selection, duplicate records
  • Guard: API hardcodes zone as zpack.zerolaghub.com, validates before creation
  • Test: No records created in wrong zones?

4. Velocity Registration Order

  • Risk: Registering before server ready, deregistering incorrectly
  • Guard: API waits for agent RUNNING state, then registers Velocity
  • Test: Player connection works immediately after provisioning complete?

5. PortPool Commit Logic

  • Risk: Race conditions, double-allocation, uncommitted ports
  • Guard: API allocates → provisions → commits (rollback on failure)
  • Test: Concurrent provisions don't collide on ports?

6. Agent READY Detection

  • Risk: False negatives, false positives, variant-specific patterns
  • Guard: Agent uses variant-aware log parsing, multiple confirmation lines
  • Test: All 6 variants correctly detect READY state?

7. Server Start-Up False Negatives

  • Risk: Timeout too short, log parsing too strict
  • Guard: Agent increases timeout for Forge (90s), multiple log patterns
  • Test: Forge installer completes without false failure?

8. IP Selection Logic

  • Risk: Confusing external (Cloudflare) vs internal (Velocity/Technitium) IPs
  • Guard: API clearly separates: externalIP (139.64.165.248), internalIP (10.200.0.X), velocityIP (10.70.0.241)
  • Test: DNS points to correct IPs, Velocity routes to correct internal IP?

📋 Architecture Decision Log (LOCKED) NEW

Purpose: Records finalized architectural decisions that must not be re-litigated unless explicitly requested.

Status: LOCKED - These decisions are final and cannot be changed without user explicitly saying "Revisit decision X"


DEC-001: Templates vs Go Agent (FINAL)

Decision: Hybrid model

  • LXC templates define base environment only
  • Go Agent is authoritative execution layer inside containers

Rationale:

  • Templates alone cannot handle multi-variant logic (Forge, NeoForge, Fabric)
  • Agent enables self-repair, async provisioning, runtime control
  • Hybrid provides speed + flexibility without API container access

Applies To: API v2, Go Agent, Frontend
Status: LOCKED


DEC-002: Provisioning Authority

Decision: API orchestrates, Agent executes

Rationale:

  • API has global visibility (DB, DNS, Proxmox, Velocity)
  • Agent is intentionally sandboxed to container filesystem + process

Applies To: All systems
Status: LOCKED


DEC-003: DNS & Edge Publishing Ownership

Decision: API-only responsibility

Rationale:

  • Requires external credentials (Cloudflare, Technitium)
  • Must correlate DB state, record IDs, reconciliation jobs

Applies To: API v2
Status: LOCKED


DEC-004: Proxy Stack

Decision: Traefik + Velocity only

Rationale:

  • Traefik for HTTP/control-plane
  • Velocity for Minecraft TCP routing
  • HAProxy explicitly deprecated for ZeroLagHub

Applies To: Infrastructure, API
Status: LOCKED


DEC-005: State Persistence

Decision: MariaDB is single source of truth

Rationale:

  • Flat files caused race conditions and drift
  • DB enables reconciliation, recovery, observability

Applies To: API v2
Status: LOCKED


DEC-006: Frontend Access Model

Decision: Frontend communicates with API only

Rationale:

  • Security boundary
  • Prevents leaking infrastructure details

Applies To: Frontend
Status: LOCKED


DEC-007: Architecture Enforcement Policy

Decision: Drift prevention is mandatory

Rationale:

  • Prevents oscillation between alternatives
  • Preserves velocity during late-stage development

Applies To: All work sessions
Status: LOCKED


📋 Canonical Architecture Anchors (ABSOLUTE)

These rules are immutable unless explicitly changed with full system review:

Anchor #1: Orchestration

API v2 orchestrates EVERYTHING (jobs, provisioning, DNS, proxy, lifecycle)

Anchor #2: Container-Internal

Agent performs EVERYTHING inside container (install, start, stop, detect)

Anchor #3: Templates

Templates contain agent + base environment (VMID 800 for game, 6000-series for dev)

Anchor #4: Job Queue

BullMQ/Redis drive job system (async provisioning, retries, reconciliation)

Anchor #5: Database

MariaDB holds all state (PortPool, ContainerInstance, EdgeState, etc.)

Anchor #6: Infrastructure

Proxmox API (not Ansible) for LXC management

Anchor #7: Routing

Traefik (HTTP) + Velocity (Minecraft) are ONLY routing/proxy systems

Anchor #8: DNS

Cloudflare = authoritative public DNS
Technitium = authoritative internal DNS

Anchor #9: Frontend Isolation

Frontend speaks ONLY to API (no direct agent, Proxmox, DNS, Velocity)

Anchor #10: Directory Structure

/opt/zlh/<game>/<variant>/world is canonical game server path


🛡️ Enforcement Policies (ACTIVE)

When future instructions conflict with Canonical Architecture or Architecture Decision Log:

Step 1: STOP

Immediately halt the task. Do not proceed with drift-inducing change.

Step 2: RAISE DRIFT WARNING

⚠️ DRIFT WARNING ⚠️

Proposed change violates Canonical Architecture:

Rule Violated: [Rule #X: Description]
OR
Decision Violated: [DEC-XXX: Decision Name]

Violation: [Specific behavior that violates rule/decision]
Correct Path: [Architecture-aligned approach]

Impact: [What breaks if this drift is allowed]

Options:
1. Implement correct architecture-aligned path
2. Amend Canonical Architecture (requires full system review)
3. Request user to "Revisit decision X" (for ADL changes)
4. Cancel proposed change

Step 3: PROVIDE CORRECT PATH

Show the architecture-aligned implementation that achieves the same goal.

Step 4: ASK FOR DIRECTION

Should we:
A) Implement the correct architecture-aligned path?
B) Perform full system review to amend architecture?
C) Request user to explicitly revisit locked decision?
D) Cancel this change?

Step 5: ARCHITECTURE DECISION LOG SPECIAL RULE

If violation is against a LOCKED Architecture Decision (DEC-001 through DEC-007):

ADDITIONAL CHECK:

⚠️ LOCKED DECISION WARNING ⚠️

This change conflicts with Architecture Decision Log entry: [DEC-XXX]
Status: LOCKED

This decision can ONLY be changed if the user explicitly says:
"Revisit decision [DEC-XXX]"

Without explicit user request to revisit, this decision is FINAL.

Proceeding with this change would violate architectural governance.

Never proceed with drift-inducing changes without explicit confirmation.
Never re-litigate locked decisions without user explicitly requesting revision.


📊 Drift Detection Examples

Example 1: Agent Port Allocation (VIOLATION)

Proposed Change:

// Agent code proposal
func (a *Agent) allocatePort() int {
    // Find available port...
    return port
}

Drift Warning:

⚠️ DRIFT WARNING ⚠️

Rule Violated: Rule #1 - Provisioning Ownership
Violation: Agent attempting to allocate ports
Correct Path: API allocates ports, agent receives them via /config

Impact: Port collisions, database inconsistency, broken PortPool

Recommendation: Remove port allocation from agent, ensure API sends ports in config payload

Example 2: Frontend Direct Agent Call (VIOLATION)

Proposed Change:

// Frontend code proposal
async function getServerLogs(vmid: number) {
    const agentURL = `http://10.200.0.${vmid}:8080/logs`;
    return await fetch(agentURL);
}

Drift Warning:

⚠️ DRIFT WARNING ⚠️

Rule Violated: Rule #3 - Direct Container Access
Violation: Frontend bypassing API to talk to agent
Correct Path: Frontend → API → Agent (API proxies logs)

Impact: Broken frontend if container IP changes, no auth/rate limiting, security risk

Recommendation: Add GET /api/containers/:vmid/logs endpoint that proxies to agent

Example 3: API Installing Java (VIOLATION)

Proposed Change:

// API code proposal
async function provisionServer(vmid, game, variant) {
    await proxmox.exec(vmid, 'apt-get install openjdk-21-jdk');
    await proxmox.exec(vmid, 'wget https://papermc.io/...');
}

Drift Warning:

⚠️ DRIFT WARNING ⚠️

Rule Violated: Rule #2 - Artifact Ownership
Violation: API performing in-container installation
Correct Path: API sends config to agent, agent handles installation

Impact: Breaks agent self-repair, variant-specific logic duplicated, no verification system

Recommendation: Remove installation logic from API, ensure agent receives proper config via POST /config

📁 Integration with Existing Documentation

Relationship to Master Bootstrap

  • Master Bootstrap: Strategic overview and business model
  • This Document: Technical governance and boundary enforcement
  • Usage: Consult this before implementing ANY code changes

Relationship to Complete Current State

  • Complete Current State: What's working, what's next
  • This Document: How things MUST work (regardless of current state)
  • Usage: This is the "law", current state is "status"

Relationship to Engineering Handover

  • Engineering Handover: Daily tactical tasks and sprint plan
  • This Document: Constraints within which tasks must be implemented
  • Usage: Check this before starting each handover task

🔄 Architecture Amendment Process

If legitimate need to change Canonical Architecture:

Step 1: Identify Change

Document exactly what architectural boundary needs to change and why.

Step 2: Full System Impact Analysis

  • What breaks in API?
  • What breaks in Agent?
  • What breaks in Frontend?
  • What changes to contracts?
  • What database migrations needed?

Step 3: Update ALL Affected Documents

  • This document (Canonical Architecture)
  • Master Bootstrap (if strategic impact)
  • Complete Current State (implementation changes)
  • Engineering Handover (sprint tasks)
  • Agent Spec, Operational Guide (if affected)

Step 4: Update ALL Systems

  • API code + tests
  • Agent code + tests
  • Frontend code + tests
  • Database schema (migration)
  • Infrastructure config

Step 5: Validation

  • Integration tests pass?
  • No new drift introduced?
  • Documentation consistent?
  • All AIs briefed on change?

Only after ALL steps is architecture amendment complete.


🎯 Quick Reference Card

API Owns

  • Provisioning orchestration
  • Port allocation (PortPool)
  • DNS (Cloudflare + Technitium)
  • Velocity registration
  • IP logic (external + internal)
  • Job queue (BullMQ)
  • Database state

Agent Owns

  • Container-internal installation
  • Java runtime
  • Artifact downloads
  • Server process management
  • READY detection
  • Self-repair + verification
  • Filesystem layout

Frontend Owns

  • User interaction
  • Display logic
  • Client state
  • Form validation

Never

  • Agent allocates ports
  • Agent talks to DNS/Velocity/Proxmox
  • API installs server files
  • API executes in-container commands
  • Frontend talks to agent directly
  • Frontend talks to infrastructure

📋 Session Start Checklist

Before every coding session with any AI:

  • Read this document's Quick Reference Card
  • Identify which system you're working on (API, Agent, Frontend)
  • Review that system's "Owns" list
  • Check High-Risk Integration Zones if touching those areas
  • Verify no drift from previous session
  • Confirm contracts unchanged since last session

If ANY doubt: Re-read full Canonical Architecture Anchors section.


Document Status

Status: ACTIVE - Must be consulted before all code changes
Enforcement: MANDATORY - Drift violations must be caught
Authority: CANONICAL - Overrides conflicting guidance
Updates: Only via Architecture Amendment Process


🛡️ This document prevents architectural drift. Violate at your own risk.