knowledge-base/ZeroLagHub_Cross_Project_Tracker.md
2025-12-13 16:40:48 +00:00

847 lines
26 KiB
Markdown

# 🛡️ ZeroLagHub Cross-Project Tracker & Drift Prevention System
**Last Updated**: December 7, 2025
**Version**: 1.0 (Canonical Architecture Enforcement)
**Status**: ACTIVE - Must Be Consulted Before All Code Changes
---
## 🎯 Document Purpose
This document establishes **architectural boundaries** across the three major ZeroLagHub systems to prevent drift, confusion, and broken contracts when switching between contexts or AI assistants.
**Critical Insight**: Most project failures come from **gradual architectural erosion**, not sudden breaking changes. This document prevents that.
---
## 📊 The Three Systems (Canonical Ownership)
```
┌─────────────────────────────────────────────────────────────┐
│ NODE.JS API v2 │
│ (Orchestration Engine) │
│ │
│ Owns: VMID allocation, port management, DNS publishing, │
│ Velocity registration, job queue, database state │
│ │
│ Speaks To: Proxmox, Cloudflare, Technitium, Velocity, │
│ MariaDB, BullMQ, Go Agent (HTTP) │
│ │
│ Never Touches: Container filesystem, game server files, │
│ Java installation, artifact downloads │
└─────────────────────┬───────────────────────────────────────┘
│ HTTP Contract
│ POST /config, /start, /stop
│ GET /status, /health
┌─────────────────────────────────────────────────────────────┐
│ GO AGENT │
│ (Container-Internal Manager) │
│ │
│ Owns: Server installation, Java runtime, artifact │
│ downloads, process management, READY detection, │
│ filesystem layout, verification + self-repair │
│ │
│ Speaks To: Local filesystem, game server process, │
│ API (status updates via HTTP polling) │
│ │
│ Never Touches: Proxmox, DNS, Cloudflare, Velocity, │
│ port allocation, VMID selection │
└─────────────────────┬───────────────────────────────────────┘
│ Status Polling
│ Agent reports state
┌─────────────────────────────────────────────────────────────┐
│ NEXT.JS FRONTEND │
│ (Customer + Admin UI) │
│ │
│ Owns: User interaction, form validation, display logic, │
│ client-side state, UI components │
│ │
│ Speaks To: API v2 only (REST + WebSocket when added) │
│ │
│ Never Touches: Proxmox, Go Agent, DNS, Velocity, │
│ Cloudflare, direct container access │
└─────────────────────────────────────────────────────────────┘
```
---
## 🗺️ System Ownership Matrix (CANONICAL)
| Area | Node.js API v2 | Go Agent | Frontend |
|------|----------------|----------|----------|
| **Provisioning Orchestration** | ✅ OWNER (allocates VMID, ports, builds LXC config) | ❌ Executes inside container | ❌ Triggers only |
| **Template Selection** | ✅ OWNER (selects template, passes config) | ❌ Template contains agent | ❌ Displays options |
| **Server Installation** | ❌ Never | ✅ OWNER (Java, artifacts, validation) | ❌ Displays results |
| **Runtime Control** | ✅ OWNER (sends commands) | ✅ OWNER (executes commands) | ❌ UI only |
| **DNS (Cloudflare + Technitium)** | ✅ OWNER (creates, deletes, tracks IDs) | ❌ Never | ❌ Displays info |
| **Velocity Registration** | ✅ OWNER (registers, deregisters) | ❌ Never | ❌ Displays status |
| **IP Logic** | ✅ OWNER (external + internal IPs) | ❌ Sees container IP only | ❌ Displays final |
| **Port Allocation** | ✅ OWNER (PortPool DB management) | ❌ Receives assignments | ❌ Displays ports |
| **Monitoring** | ✅ OWNER (collects metrics) | ✅ OWNER (exposes /health) | ❌ Displays data |
| **Error Handling** | ✅ OWNER (BullMQ jobs, retries) | ❌ Local output only | ❌ User notifications |
---
## 🔄 API ↔ Agent Contract (IMMUTABLE)
### **API → Agent Endpoints**
```
POST /config
├─ Payload: {
│ game: "minecraft",
│ variant: "paper",
│ version: "1.21.3",
│ ports: [25565, 25575],
│ memory: "4G",
│ motd: "Welcome to ZeroLagHub",
│ worldSettings: {...}
│ }
└─ Response: 200 OK
POST /start
└─ Triggers: ensureProvisioned() → StartServer()
POST /stop
└─ Triggers: Graceful shutdown via server stop command
POST /restart
└─ Triggers: stop → start sequence
GET /status
└─ Response: {
state: "RUNNING" | "INSTALLING" | "FAILED",
pid: 12345,
uptime: 3600,
lastError: null
}
GET /health
└─ Response: {
healthy: true,
java: "/usr/lib/jvm/java-21-openjdk-amd64",
variant: "paper",
ready: true
}
```
### **Agent → API Signals (via /status polling)**
```
State Machine:
INSTALLING
├─> DOWNLOADING_ARTIFACT
├─> INSTALLING_JAVA
├─> FINALIZING
├─> RUNNING
└─> FAILED | CRASHED
```
### **🚫 Forbidden Agent Behaviors**
Agent **NEVER**:
- ❌ Allocates or manages ports
- ❌ Talks to Proxmox API
- ❌ Creates DNS records
- ❌ Registers with Velocity
- ❌ Modifies LXC config
- ❌ Allocates VMIDs
- ❌ Manages templates
- ❌ Talks to Cloudflare or Technitium
**Violation Detection**: If agent code imports Proxmox client, DNS client, or port allocation logic → **DRIFT VIOLATION**
---
## 🖥️ Frontend ↔ API Contract (IMMUTABLE)
### **Frontend Allowed Endpoints**
```
Provisioning:
POST /api/containers/create
GET /api/containers/:vmid
DELETE /api/containers/:vmid
Control:
POST /api/containers/:vmid/start
POST /api/containers/:vmid/stop
POST /api/containers/:vmid/restart
Status:
GET /api/containers/:vmid/status
GET /api/containers/:vmid/logs
GET /api/containers/:vmid/stats
Discovery:
GET /api/templates
GET /api/containers (list user's containers)
Read-Only Info:
GET /api/dns/:hostname (display only)
GET /api/velocity/status (display only)
```
### **🚫 Forbidden Frontend Behaviors**
Frontend **NEVER**:
- ❌ Talks directly to Go Agent
- ❌ Calls Proxmox API
- ❌ Creates DNS records
- ❌ Registers with Velocity
- ❌ Allocates ports
- ❌ Executes container commands
- ❌ Accesses MariaDB directly
**Violation Detection**: If frontend code imports Proxmox client, agent HTTP client (except via API), or database client → **DRIFT VIOLATION**
---
## 🚨 Drift Detection Rules (ACTIVE ENFORCEMENT)
### **Rule #1: Provisioning Ownership**
**Violation**: Agent asked to allocate ports, choose templates, create DNS, call Proxmox, manage VMIDs
**Correct Path**: API owns ALL provisioning orchestration
**Example Violation**:
```go
// WRONG - Agent should NEVER do this
func (a *Agent) allocatePorts() ([]int, error) {
// ... port allocation logic
}
```
**Correct Pattern**:
```javascript
// RIGHT - API allocates, agent receives
async function provisionInstance(game, variant) {
const ports = await portAllocator.allocate(vmid);
await agent.postConfig({ ports, game, variant });
}
```
---
### **Rule #2: Artifact Ownership**
**Violation**: API asked to install Java, download server.jar, run installers
**Correct Path**: Agent owns ALL in-container installation
**Example Violation**:
```javascript
// WRONG - API should NEVER do this
async function installMinecraft(vmid, version) {
await proxmox.exec(vmid, `wget https://...`);
await proxmox.exec(vmid, `java -jar installer.jar`);
}
```
**Correct Pattern**:
```go
// RIGHT - Agent handles installation
func (p *Provisioner) ProvisionAll(cfg Config) error {
downloadArtifact(cfg.Variant, cfg.Version)
installJavaRuntime(cfg.Version)
verifyInstallation()
}
```
---
### **Rule #3: Direct Container Access**
**Violation**: Frontend or API wants to exec commands directly into container
**Correct Path**: Agent owns container execution layer
**Example Violation**:
```typescript
// WRONG - Frontend should NEVER do this
async function restartServer(vmid: number) {
const ssh = new SSHClient();
await ssh.connect(containerIP);
await ssh.exec('systemctl restart minecraft');
}
```
**Correct Pattern**:
```typescript
// RIGHT - Frontend talks to API, API talks to agent
async function restartServer(vmid: number) {
await api.post(`/containers/${vmid}/restart`);
}
```
---
### **Rule #4: Networking Responsibilities**
**Violation**: Agent asked to select public vs internal IPs, decide DNS zones
**Correct Path**: API owns dual-IP logic (Cloudflare external, Technitium internal)
**Example Violation**:
```go
// WRONG - Agent should NEVER decide this
func (a *Agent) determinePublicIP() string {
if a.needsCloudflare() {
return "139.64.165.248"
}
return a.containerIP
}
```
**Correct Pattern**:
```javascript
// RIGHT - API decides network topology
function determineIPs(vmid, game) {
const internalIP = `10.200.0.${vmid - 1000}`;
const externalIP = "139.64.165.248"; // Cloudflare target
const velocityIP = "10.70.0.241"; // Internal routing
return { internalIP, externalIP, velocityIP };
}
```
---
### **Rule #5: Proxy Responsibilities**
**Violation**: Agent asked to register with Velocity, configure proxy routing
**Correct Path**: API owns ALL proxy integrations
**Example Violation**:
```go
// WRONG - Agent should NEVER do this
func (a *Agent) registerWithVelocity() error {
client := velocity.NewClient()
return client.Register(a.hostname, a.port)
}
```
**Correct Pattern**:
```javascript
// RIGHT - API handles Velocity registration
async function registerVelocity(vmid, hostname, internalIP) {
await velocityBridge.registerBackend({
name: hostname,
address: internalIP,
port: 25565
});
}
```
---
## 🔄 Context Switching Safety Workflow
### **When Moving to Node.js API Work:**
**Pre-Switch Checklist**:
- [ ] Agent contract unchanged? (POST /config, /start, /stop, GET /status)
- [ ] Database schema unchanged? (Prisma models consistent)
- [ ] LXC template IDs unchanged? (VMID 800 for game, 6000-series for dev)
- [ ] DNS/IP logic consistent? (Cloudflare external, Technitium internal)
- [ ] Port allocation logic preserved? (PortPool DB-backed)
**Common Drift Patterns**:
- ⚠️ Adding agent installation logic to API
- ⚠️ Changing agent contract without updating both sides
- ⚠️ Moving DNS logic to different service
- ⚠️ Bypassing job queue for provisioning
---
### **When Moving to Go Agent Work:**
**Pre-Switch Checklist**:
- [ ] No provisioning logic outside allowed scope (no port allocation, DNS, etc.)
- [ ] File paths remain canonical (`/opt/zlh/<game>/<variant>/world`)
- [ ] Naming conventions maintained (`server.jar`, `fabric-server.jar`, etc.)
- [ ] No external API calls (Proxmox, DNS, Velocity)
- [ ] Status states unchanged (INSTALLING, RUNNING, FAILED, etc.)
**Common Drift Patterns**:
- ⚠️ Adding port allocation to agent
- ⚠️ Making agent talk to external services
- ⚠️ Changing directory structure without API coordination
- ⚠️ Adding orchestration logic to agent
---
### **When Moving to Frontend Work:**
**Pre-Switch Checklist**:
- [ ] Only API-approved fields used in UI
- [ ] No direct agent HTTP calls
- [ ] VMID not exposed in user-facing UI
- [ ] Internal IPs not displayed to users
- [ ] All state from API, not computed locally
**Common Drift Patterns**:
- ⚠️ Adding direct agent calls from frontend
- ⚠️ Computing server state client-side
- ⚠️ Exposing internal infrastructure details
- ⚠️ Bypassing API for container control
---
## 🚨 High-Risk Integration Zones (GUARDED)
These areas have historically caused drift across sessions:
### **1. Forge / NeoForge Installation Logic**
- **Risk**: Agent vs API confusion on who handles `run.sh` patching
- **Guard**: Agent owns ALL Forge installation, API just passes config
- **Test**: Can provision Forge 1.21.3 without API filesystem access?
### **2. Cloudflare SRV Deletion**
- **Risk**: Case sensitivity, subdomain normalization, record ID tracking
- **Guard**: API stores Cloudflare record IDs in EdgeState, deletes by ID
- **Test**: Create → Delete → Recreate same hostname without orphans?
### **3. Technitium DNS Zone Mismatch**
- **Risk**: Wrong zone selection, duplicate records
- **Guard**: API hardcodes zone as `zpack.zerolaghub.com`, validates before creation
- **Test**: No records created in wrong zones?
### **4. Velocity Registration Order**
- **Risk**: Registering before server ready, deregistering incorrectly
- **Guard**: API waits for agent RUNNING state, then registers Velocity
- **Test**: Player connection works immediately after provisioning complete?
### **5. PortPool Commit Logic**
- **Risk**: Race conditions, double-allocation, uncommitted ports
- **Guard**: API allocates → provisions → commits (rollback on failure)
- **Test**: Concurrent provisions don't collide on ports?
### **6. Agent READY Detection**
- **Risk**: False negatives, false positives, variant-specific patterns
- **Guard**: Agent uses variant-aware log parsing, multiple confirmation lines
- **Test**: All 6 variants correctly detect READY state?
### **7. Server Start-Up False Negatives**
- **Risk**: Timeout too short, log parsing too strict
- **Guard**: Agent increases timeout for Forge (90s), multiple log patterns
- **Test**: Forge installer completes without false failure?
### **8. IP Selection Logic**
- **Risk**: Confusing external (Cloudflare) vs internal (Velocity/Technitium) IPs
- **Guard**: API clearly separates: externalIP (139.64.165.248), internalIP (10.200.0.X), velocityIP (10.70.0.241)
- **Test**: DNS points to correct IPs, Velocity routes to correct internal IP?
---
## 📋 Architecture Decision Log (LOCKED) ⭐ NEW
**Purpose**: Records finalized architectural decisions that **must not be re-litigated** unless explicitly requested.
**Status**: LOCKED - These decisions are final and cannot be changed without user explicitly saying "Revisit decision X"
---
### **DEC-001: Templates vs Go Agent (FINAL)**
**Decision**: Hybrid model
- LXC templates define **base environment only**
- Go Agent is **authoritative execution layer** inside containers
**Rationale**:
- Templates alone cannot handle multi-variant logic (Forge, NeoForge, Fabric)
- Agent enables self-repair, async provisioning, runtime control
- Hybrid provides speed + flexibility without API container access
**Applies To**: API v2, Go Agent, Frontend
**Status**: ✅ LOCKED
---
### **DEC-002: Provisioning Authority**
**Decision**: API orchestrates, Agent executes
**Rationale**:
- API has global visibility (DB, DNS, Proxmox, Velocity)
- Agent is intentionally sandboxed to container filesystem + process
**Applies To**: All systems
**Status**: ✅ LOCKED
---
### **DEC-003: DNS & Edge Publishing Ownership**
**Decision**: API-only responsibility
**Rationale**:
- Requires external credentials (Cloudflare, Technitium)
- Must correlate DB state, record IDs, reconciliation jobs
**Applies To**: API v2
**Status**: ✅ LOCKED
---
### **DEC-004: Proxy Stack**
**Decision**: Traefik + Velocity only
**Rationale**:
- Traefik for HTTP/control-plane
- Velocity for Minecraft TCP routing
- **HAProxy explicitly deprecated** for ZeroLagHub
**Applies To**: Infrastructure, API
**Status**: ✅ LOCKED
---
### **DEC-005: State Persistence**
**Decision**: MariaDB is single source of truth
**Rationale**:
- Flat files caused race conditions and drift
- DB enables reconciliation, recovery, observability
**Applies To**: API v2
**Status**: ✅ LOCKED
---
### **DEC-006: Frontend Access Model**
**Decision**: Frontend communicates with API only
**Rationale**:
- Security boundary
- Prevents leaking infrastructure details
**Applies To**: Frontend
**Status**: ✅ LOCKED
---
### **DEC-007: Architecture Enforcement Policy**
**Decision**: Drift prevention is mandatory
**Rationale**:
- Prevents oscillation between alternatives
- Preserves velocity during late-stage development
**Applies To**: All work sessions
**Status**: ✅ LOCKED
---
## 📋 Canonical Architecture Anchors (ABSOLUTE)
These rules are **immutable** unless explicitly changed with full system review:
### **Anchor #1: Orchestration**
✅ API v2 orchestrates EVERYTHING (jobs, provisioning, DNS, proxy, lifecycle)
### **Anchor #2: Container-Internal**
✅ Agent performs EVERYTHING inside container (install, start, stop, detect)
### **Anchor #3: Templates**
✅ Templates contain agent + base environment (VMID 800 for game, 6000-series for dev)
### **Anchor #4: Job Queue**
✅ BullMQ/Redis drive job system (async provisioning, retries, reconciliation)
### **Anchor #5: Database**
✅ MariaDB holds all state (PortPool, ContainerInstance, EdgeState, etc.)
### **Anchor #6: Infrastructure**
✅ Proxmox API (not Ansible) for LXC management
### **Anchor #7: Routing**
✅ Traefik (HTTP) + Velocity (Minecraft) are ONLY routing/proxy systems
### **Anchor #8: DNS**
✅ Cloudflare = authoritative public DNS
✅ Technitium = authoritative internal DNS
### **Anchor #9: Frontend Isolation**
✅ Frontend speaks ONLY to API (no direct agent, Proxmox, DNS, Velocity)
### **Anchor #10: Directory Structure**
`/opt/zlh/<game>/<variant>/world` is canonical game server path
---
## 🛡️ Enforcement Policies (ACTIVE)
When future instructions conflict with Canonical Architecture or Architecture Decision Log:
### **Step 1: STOP**
Immediately halt the task. Do not proceed with drift-inducing change.
### **Step 2: RAISE DRIFT WARNING**
```
⚠️ DRIFT WARNING ⚠️
Proposed change violates Canonical Architecture:
Rule Violated: [Rule #X: Description]
OR
Decision Violated: [DEC-XXX: Decision Name]
Violation: [Specific behavior that violates rule/decision]
Correct Path: [Architecture-aligned approach]
Impact: [What breaks if this drift is allowed]
Options:
1. Implement correct architecture-aligned path
2. Amend Canonical Architecture (requires full system review)
3. Request user to "Revisit decision X" (for ADL changes)
4. Cancel proposed change
```
### **Step 3: PROVIDE CORRECT PATH**
Show the architecture-aligned implementation that achieves the same goal.
### **Step 4: ASK FOR DIRECTION**
```
Should we:
A) Implement the correct architecture-aligned path?
B) Perform full system review to amend architecture?
C) Request user to explicitly revisit locked decision?
D) Cancel this change?
```
### **Step 5: ARCHITECTURE DECISION LOG SPECIAL RULE**
If violation is against a **LOCKED** Architecture Decision (DEC-001 through DEC-007):
**ADDITIONAL CHECK**:
```
⚠️ LOCKED DECISION WARNING ⚠️
This change conflicts with Architecture Decision Log entry: [DEC-XXX]
Status: LOCKED
This decision can ONLY be changed if the user explicitly says:
"Revisit decision [DEC-XXX]"
Without explicit user request to revisit, this decision is FINAL.
Proceeding with this change would violate architectural governance.
```
**Never proceed** with drift-inducing changes without explicit confirmation.
**Never re-litigate** locked decisions without user explicitly requesting revision.
---
## 📊 Drift Detection Examples
### **Example 1: Agent Port Allocation (VIOLATION)**
**Proposed Change**:
```go
// Agent code proposal
func (a *Agent) allocatePort() int {
// Find available port...
return port
}
```
**Drift Warning**:
```
⚠️ DRIFT WARNING ⚠️
Rule Violated: Rule #1 - Provisioning Ownership
Violation: Agent attempting to allocate ports
Correct Path: API allocates ports, agent receives them via /config
Impact: Port collisions, database inconsistency, broken PortPool
Recommendation: Remove port allocation from agent, ensure API sends ports in config payload
```
---
### **Example 2: Frontend Direct Agent Call (VIOLATION)**
**Proposed Change**:
```typescript
// Frontend code proposal
async function getServerLogs(vmid: number) {
const agentURL = `http://10.200.0.${vmid}:8080/logs`;
return await fetch(agentURL);
}
```
**Drift Warning**:
```
⚠️ DRIFT WARNING ⚠️
Rule Violated: Rule #3 - Direct Container Access
Violation: Frontend bypassing API to talk to agent
Correct Path: Frontend → API → Agent (API proxies logs)
Impact: Broken frontend if container IP changes, no auth/rate limiting, security risk
Recommendation: Add GET /api/containers/:vmid/logs endpoint that proxies to agent
```
---
### **Example 3: API Installing Java (VIOLATION)**
**Proposed Change**:
```javascript
// API code proposal
async function provisionServer(vmid, game, variant) {
await proxmox.exec(vmid, 'apt-get install openjdk-21-jdk');
await proxmox.exec(vmid, 'wget https://papermc.io/...');
}
```
**Drift Warning**:
```
⚠️ DRIFT WARNING ⚠️
Rule Violated: Rule #2 - Artifact Ownership
Violation: API performing in-container installation
Correct Path: API sends config to agent, agent handles installation
Impact: Breaks agent self-repair, variant-specific logic duplicated, no verification system
Recommendation: Remove installation logic from API, ensure agent receives proper config via POST /config
```
---
## 📁 Integration with Existing Documentation
### **Relationship to Master Bootstrap**
- Master Bootstrap: Strategic overview and business model
- **This Document**: Technical governance and boundary enforcement
- **Usage**: Consult this before implementing ANY code changes
### **Relationship to Complete Current State**
- Complete Current State: What's working, what's next
- **This Document**: How things MUST work (regardless of current state)
- **Usage**: This is the "law", current state is "status"
### **Relationship to Engineering Handover**
- Engineering Handover: Daily tactical tasks and sprint plan
- **This Document**: Constraints within which tasks must be implemented
- **Usage**: Check this before starting each handover task
---
## 🔄 Architecture Amendment Process
If legitimate need to change Canonical Architecture:
### **Step 1: Identify Change**
Document exactly what architectural boundary needs to change and why.
### **Step 2: Full System Impact Analysis**
- What breaks in API?
- What breaks in Agent?
- What breaks in Frontend?
- What changes to contracts?
- What database migrations needed?
### **Step 3: Update ALL Affected Documents**
- This document (Canonical Architecture)
- Master Bootstrap (if strategic impact)
- Complete Current State (implementation changes)
- Engineering Handover (sprint tasks)
- Agent Spec, Operational Guide (if affected)
### **Step 4: Update ALL Systems**
- API code + tests
- Agent code + tests
- Frontend code + tests
- Database schema (migration)
- Infrastructure config
### **Step 5: Validation**
- Integration tests pass?
- No new drift introduced?
- Documentation consistent?
- All AIs briefed on change?
**Only after ALL steps** is architecture amendment complete.
---
## 🎯 Quick Reference Card
### **API Owns**
- ✅ Provisioning orchestration
- ✅ Port allocation (PortPool)
- ✅ DNS (Cloudflare + Technitium)
- ✅ Velocity registration
- ✅ IP logic (external + internal)
- ✅ Job queue (BullMQ)
- ✅ Database state
### **Agent Owns**
- ✅ Container-internal installation
- ✅ Java runtime
- ✅ Artifact downloads
- ✅ Server process management
- ✅ READY detection
- ✅ Self-repair + verification
- ✅ Filesystem layout
### **Frontend Owns**
- ✅ User interaction
- ✅ Display logic
- ✅ Client state
- ✅ Form validation
### **Never**
- ❌ Agent allocates ports
- ❌ Agent talks to DNS/Velocity/Proxmox
- ❌ API installs server files
- ❌ API executes in-container commands
- ❌ Frontend talks to agent directly
- ❌ Frontend talks to infrastructure
---
## 📋 Session Start Checklist
Before every coding session with any AI:
- [ ] Read this document's Quick Reference Card
- [ ] Identify which system you're working on (API, Agent, Frontend)
- [ ] Review that system's "Owns" list
- [ ] Check High-Risk Integration Zones if touching those areas
- [ ] Verify no drift from previous session
- [ ] Confirm contracts unchanged since last session
**If ANY doubt**: Re-read full Canonical Architecture Anchors section.
---
## ✅ Document Status
**Status**: ACTIVE - Must be consulted before all code changes
**Enforcement**: MANDATORY - Drift violations must be caught
**Authority**: CANONICAL - Overrides conflicting guidance
**Updates**: Only via Architecture Amendment Process
---
🛡️ **This document prevents architectural drift. Violate at your own risk.**