knowledge-base/ZeroLagHub_Cross_Project_Tracker.md

909 lines
28 KiB
Markdown

# 🛡️ ZeroLagHub Cross-Project Tracker & Drift Prevention System
**Last Updated**: January 18, 2026
**Version**: 1.1 (Canonical Architecture Enforcement + Git Status Update)
**Status**: ACTIVE - Must Be Consulted Before All Code Changes
---
## 🎯 Document Purpose
This document establishes **architectural boundaries** across the three major ZeroLagHub systems to prevent drift, confusion, and broken contracts when switching between contexts or AI assistants.
**Critical Insight**: Most project failures come from **gradual architectural erosion**, not sudden breaking changes. This document prevents that.
---
## 🚨 CRITICAL STATUS UPDATE (January 18, 2026)
### **Git Repository Status**
**zlh-api Repository: 🔴 EMPTY**
- **Created**: December 28, 2025
- **Status**: No code pushed to git
- **Impact**:
- Cannot verify December 20 DNS fix application
- No version control for API codebase
- No change tracking or collaboration possible
- **Action Required**: IMMEDIATE - Push API codebase to git
**zlh-grind Repository: ✅ CURRENT**
- **Last Updated**: January 18, 2026
- **Status**: Architectural guardrails established
- **Recent Changes**:
- PORTAL_MIGRATION.md - Architectural boundaries added
- CONSTRAINTS.md - Network architecture rules added
- ANTI_DRIFT_GUARDRAIL.md - AI-specific guardrails added
**knowledge-base Repository: ⚠️ OUTDATED**
- **Last Updated**: December 7, 2025 (6+ weeks ago)
- **Missing**: 4+ weeks of session summaries (Dec 20 - Jan 18)
- **Status**: Being updated now (Jan 18, 2026)
### **Outstanding Critical Issues**
1. **DNS Fix Status Unknown** 🔴
- Fix identified: December 20, 2025
- Location: `provisionAgent.js` lines 46, 330-331, 402
- Status: Cannot verify if applied (API not in git)
- Impact: Servers may still be unreachable if not applied
2. **API Codebase Not Version Controlled** 🔴
- Running in production but not in git
- Cannot track changes, review code, or collaborate
- Immediate risk to project continuity
3. **Documentation Debt** 🟡
- 6 weeks without tracker updates
- 4 weeks of undocumented sessions
- Risk of losing institutional knowledge
---
## 📊 The Three Systems (Canonical Ownership)
```
┌────────────────────────────────────────────────────────────┐
│ NODE.JS API v2 │
│ (Orchestration Engine) │
│ │
│ Owns: VMID allocation, port management, DNS publishing, │
│ Velocity registration, job queue, database state │
│ │
│ Speaks To: Proxmox, Cloudflare, Technitium, Velocity, │
│ MariaDB, BullMQ, Go Agent (HTTP) │
│ │
│ Never Touches: Container filesystem, game server files, │
│ Java installation, artifact downloads │
└────────────────────────────┬───────────────────────────────┘
│ HTTP Contract
│ POST /config, /start, /stop
│ GET /status, /health
┌────────────────────────────────────────────────────────────┐
│ GO AGENT │
│ (Container-Internal Manager) │
│ │
│ Owns: Server installation, Java runtime, artifact │
│ downloads, process management, READY detection, │
│ filesystem layout, verification + self-repair │
│ │
│ Speaks To: Local filesystem, game server process, │
│ API (status updates via HTTP polling) │
│ │
│ Never Touches: Proxmox, DNS, Cloudflare, Velocity, │
│ port allocation, VMID selection │
└────────────────────────────┬───────────────────────────────┘
│ Status Polling
│ Agent reports state
┌────────────────────────────────────────────────────────────┐
│ NEXT.JS FRONTEND │
│ (Customer + Admin UI) │
│ │
│ Owns: User interaction, form validation, display logic, │
│ client-side state, UI components │
│ │
│ Speaks To: API v2 only (REST + WebSocket when added) │
│ │
│ Never Touches: Proxmox, Go Agent, DNS, Velocity, │
│ Cloudflare, direct container access │
└────────────────────────────────────────────────────────────┘
```
---
## 🗺️ System Ownership Matrix (CANONICAL)
| Area | Node.js API v2 | Go Agent | Frontend |
|------|----------------|----------|----------|
| **Provisioning Orchestration** | ✅ OWNER (allocates VMID, ports, builds LXC config) | ❌ Executes inside container | ❌ Triggers only |
| **Template Selection** | ✅ OWNER (selects template, passes config) | ❌ Template contains agent | ❌ Displays options |
| **Server Installation** | ❌ Never | ✅ OWNER (Java, artifacts, validation) | ❌ Displays results |
| **Runtime Control** | ✅ OWNER (sends commands) | ✅ OWNER (executes commands) | ❌ UI only |
| **DNS (Cloudflare + Technitium)** | ✅ OWNER (creates, deletes, tracks IDs) | ❌ Never | ❌ Displays info |
| **Velocity Registration** | ✅ OWNER (registers, deregisters) | ❌ Never | ❌ Displays status |
| **IP Logic** | ✅ OWNER (external + internal IPs) | ❌ Sees container IP only | ❌ Displays final |
| **Port Allocation** | ✅ OWNER (PortPool DB management) | ❌ Receives assignments | ❌ Displays ports |
| **Monitoring** | ✅ OWNER (collects metrics) | ✅ OWNER (exposes /health) | ❌ Displays data |
| **Error Handling** | ✅ OWNER (BullMQ jobs, retries) | ❌ Local output only | ❌ User notifications |
---
## 🔒 API ↔ Agent Contract (IMMUTABLE)
### **API → Agent Endpoints**
```
POST /config
├─ Payload: {
│ game: "minecraft",
│ variant: "paper",
│ version: "1.21.3",
│ ports: [25565, 25575],
│ memory: "4G",
│ motd: "Welcome to ZeroLagHub",
│ worldSettings: {...}
│ }
└─ Response: 200 OK
POST /start
└─ Triggers: ensureProvisioned() → StartServer()
POST /stop
└─ Triggers: Graceful shutdown via server stop command
POST /restart
└─ Triggers: stop → start sequence
GET /status
└─ Response: {
state: "RUNNING" | "INSTALLING" | "FAILED",
pid: 12345,
uptime: 3600,
lastError: null
}
GET /health
└─ Response: {
healthy: true,
java: "/usr/lib/jvm/java-21-openjdk-amd64",
variant: "paper",
ready: true
}
```
### **Agent → API Signals (via /status polling)**
```
State Machine:
INSTALLING
├─> DOWNLOADING_ARTIFACT
├─> INSTALLING_JAVA
├─> FINALIZING
├─> RUNNING
└─> FAILED | CRASHED
```
### **🚫 Forbidden Agent Behaviors**
Agent **NEVER**:
- ❌ Allocates or manages ports
- ❌ Talks to Proxmox API
- ❌ Creates DNS records
- ❌ Registers with Velocity
- ❌ Modifies LXC config
- ❌ Allocates VMIDs
- ❌ Manages templates
- ❌ Talks to Cloudflare or Technitium
**Violation Detection**: If agent code imports Proxmox client, DNS client, or port allocation logic → **DRIFT VIOLATION**
---
## 🖥️ Frontend ↔ API Contract (IMMUTABLE)
### **Frontend Allowed Endpoints**
```
Provisioning:
POST /api/containers/create
GET /api/containers/:vmid
DELETE /api/containers/:vmid
Control:
POST /api/containers/:vmid/start
POST /api/containers/:vmid/stop
POST /api/containers/:vmid/restart
Status:
GET /api/containers/:vmid/status
GET /api/containers/:vmid/logs
GET /api/containers/:vmid/stats
Discovery:
GET /api/templates
GET /api/containers (list user's containers)
Read-Only Info:
GET /api/dns/:hostname (display only)
GET /api/velocity/status (display only)
```
### **🚫 Forbidden Frontend Behaviors**
Frontend **NEVER**:
- ❌ Talks directly to Go Agent
- ❌ Calls Proxmox API
- ❌ Creates DNS records
- ❌ Registers with Velocity
- ❌ Allocates ports
- ❌ Executes container commands
- ❌ Accesses MariaDB directly
**Violation Detection**: If frontend code imports Proxmox client, agent HTTP client (except via API), or database client → **DRIFT VIOLATION**
---
## 🚨 NEW: Architectural Boundary Enforcement (January 18, 2026)
### **Critical Rule: Frontend Cannot Call Agents**
**The Hard Reality**:
- Container IPs are internal-only (10.x network)
- No network path exists from browser to container
- Agents have no CORS headers (they're not web services)
- Direct calls would fail at network layer
**Why This Matters**:
- AI tools may suggest "quick fixes" calling agents directly
- Developers may try to add CORS to agents
- Frontend shortcuts bypass security/auth/rate limiting
- Breaks architectural isolation
**Correct Flow**:
```
User Action → Frontend → API → Agent → Response
```
**Forbidden Flow**:
```
User Action → Frontend → Agent (FAILS - no network path)
```
### **Common Drift Patterns (Now Prevented)**
**Pattern 1: AI Tool Suggests Direct Agent Call**
```javascript
// WRONG - AI suggestion
async function getServerLogs(vmid) {
const agentURL = `http://10.200.0.${vmid}:8080/logs`;
return await fetch(agentURL); // FAILS - no route
}
// CORRECT - Architectural pattern
async function getServerLogs(vmid) {
return await api.get(`/containers/${vmid}/logs`);
}
```
**Pattern 2: Adding CORS to Agents**
```go
// WRONG - Never add this to agents
func (a *Agent) enableCORS() {
a.router.Use(cors.New(cors.Config{
AllowOrigins: ["*"],
}))
}
```
**Pattern 3: Exposing Agent Ports**
```nginx
# WRONG - Never proxy agent ports
location /agent/ {
proxy_pass http://10.200.0.100:8080/;
}
```
### **Enforcement Documentation**
Three-layer defense established in `zlh-grind`:
1. **PORTAL_MIGRATION.md** - High-level boundaries
2. **CONSTRAINTS.md** - Hard technical rules
3. **ANTI_DRIFT_GUARDRAIL.md** - AI-specific warnings
**Rule**: When code conflicts with documentation, **documentation wins**.
---
## 🛡️ Drift Detection Rules (ACTIVE ENFORCEMENT)
### **Rule #1: Provisioning Ownership**
**Violation**: Agent asked to allocate ports, choose templates, create DNS, call Proxmox, manage VMIDs
**Correct Path**: API owns ALL provisioning orchestration
**Example Violation**:
```go
// WRONG - Agent should NEVER do this
func (a *Agent) allocatePorts() ([]int, error) {
// ... port allocation logic
}
```
**Correct Pattern**:
```javascript
// RIGHT - API allocates, agent receives
async function provisionInstance(game, variant) {
const ports = await portAllocator.allocate(vmid);
await agent.postConfig({ ports, game, variant });
}
```
---
### **Rule #2: Artifact Ownership**
**Violation**: API asked to install Java, download server.jar, run installers
**Correct Path**: Agent owns ALL in-container installation
**Example Violation**:
```javascript
// WRONG - API should NEVER do this
async function installMinecraft(vmid, version) {
await proxmox.exec(vmid, `wget https://...`);
await proxmox.exec(vmid, `java -jar installer.jar`);
}
```
**Correct Pattern**:
```go
// RIGHT - Agent handles installation
func (p *Provisioner) ProvisionAll(cfg Config) error {
downloadArtifact(cfg.Variant, cfg.Version)
installJavaRuntime(cfg.Version)
verifyInstallation()
}
```
---
### **Rule #3: Direct Container Access**
**Violation**: Frontend or API wants to exec commands directly into container
**Correct Path**: Agent owns container execution layer
**Example Violation**:
```typescript
// WRONG - Frontend should NEVER do this
async function restartServer(vmid: number) {
const ssh = new SSHClient();
await ssh.connect(containerIP);
await ssh.exec('systemctl restart minecraft');
}
```
**Correct Pattern**:
```typescript
// RIGHT - Frontend talks to API, API talks to agent
async function restartServer(vmid: number) {
await api.post(`/containers/${vmid}/restart`);
}
```
---
### **Rule #4: Networking Responsibilities**
**Violation**: Agent asked to select public vs internal IPs, decide DNS zones
**Correct Path**: API owns dual-IP logic (Cloudflare external, Technitium internal)
**Example Violation**:
```go
// WRONG - Agent should NEVER decide this
func (a *Agent) determinePublicIP() string {
if a.needsCloudflare() {
return "139.64.165.248"
}
return a.containerIP
}
```
**Correct Pattern**:
```javascript
// RIGHT - API decides network topology
function determineIPs(vmid, game) {
const internalIP = `10.200.0.${vmid - 1000}`;
const externalIP = "139.64.165.248"; // Cloudflare target
const velocityIP = "10.70.0.241"; // Internal routing
return { internalIP, externalIP, velocityIP };
}
```
---
### **Rule #5: Proxy Responsibilities**
**Violation**: Agent asked to register with Velocity, configure proxy routing
**Correct Path**: API owns ALL proxy integrations
**Example Violation**:
```go
// WRONG - Agent should NEVER do this
func (a *Agent) registerWithVelocity() error {
client := velocity.NewClient()
return client.Register(a.hostname, a.port)
}
```
**Correct Pattern**:
```javascript
// RIGHT - API handles Velocity registration
async function registerVelocity(vmid, hostname, internalIP) {
await velocityBridge.registerBackend({
name: hostname,
address: internalIP,
port: 25565
});
}
```
---
## 🔍 Context Switching Safety Workflow
### **When Moving to Node.js API Work:**
**Pre-Switch Checklist**:
- [ ] Agent contract unchanged? (POST /config, /start, /stop, GET /status)
- [ ] Database schema unchanged? (Prisma models consistent)
- [ ] LXC template IDs unchanged? (VMID 800 for game, 6000-series for dev)
- [ ] DNS/IP logic consistent? (Cloudflare external, Technitium internal)
- [ ] Port allocation logic preserved? (PortPool DB-backed)
**Common Drift Patterns**:
- ⚠️ Adding agent installation logic to API
- ⚠️ Changing agent contract without updating both sides
- ⚠️ Moving DNS logic to different service
- ⚠️ Bypassing job queue for provisioning
---
### **When Moving to Go Agent Work:**
**Pre-Switch Checklist**:
- [ ] No provisioning logic outside allowed scope (no port allocation, DNS, etc.)
- [ ] File paths remain canonical (`/opt/zlh/<game>/<variant>/world`)
- [ ] Naming conventions maintained (`server.jar`, `fabric-server.jar`, etc.)
- [ ] No external API calls (Proxmox, DNS, Velocity)
- [ ] Status states unchanged (INSTALLING, RUNNING, FAILED, etc.)
**Common Drift Patterns**:
- ⚠️ Adding port allocation to agent
- ⚠️ Making agent talk to external services
- ⚠️ Changing directory structure without API coordination
- ⚠️ Adding orchestration logic to agent
---
### **When Moving to Frontend Work:**
**Pre-Switch Checklist**:
- [ ] Only API-approved fields used in UI
- [ ] No direct agent HTTP calls
- [ ] VMID not exposed in user-facing UI
- [ ] Internal IPs not displayed to users
- [ ] All state from API, not computed locally
**Common Drift Patterns**:
- ⚠️ Adding direct agent calls from frontend
- ⚠️ Computing server state client-side
- ⚠️ Exposing internal infrastructure details
- ⚠️ Bypassing API for container control
---
## 🛡️ High-Risk Integration Zones (GUARDED)
These areas have historically caused drift across sessions:
### **1. Forge / NeoForge Installation Logic**
- **Risk**: Agent vs API confusion on who handles `run.sh` patching
- **Guard**: Agent owns ALL Forge installation, API just passes config
- **Test**: Can provision Forge 1.21.3 without API filesystem access?
### **2. Cloudflare SRV Deletion**
- **Risk**: Case sensitivity, subdomain normalization, record ID tracking
- **Guard**: API stores Cloudflare record IDs in EdgeState, deletes by ID
- **Test**: Create → Delete → Recreate same hostname without orphans?
### **3. Technitium DNS Zone Mismatch**
- **Risk**: Wrong zone selection, duplicate records
- **Guard**: API hardcodes zone as `zpack.zerolaghub.com`, validates before creation
- **Test**: No records created in wrong zones?
### **4. Velocity Registration Order**
- **Risk**: Registering before server ready, deregistering incorrectly
- **Guard**: API waits for agent RUNNING state, then registers Velocity
- **Test**: Player connection works immediately after provisioning complete?
### **5. PortPool Commit Logic**
- **Risk**: Race conditions, double-allocation, uncommitted ports
- **Guard**: API allocates → provisions → commits (rollback on failure)
- **Test**: Concurrent provisions don't collide on ports?
### **6. Agent READY Detection**
- **Risk**: False negatives, false positives, variant-specific patterns
- **Guard**: Agent uses variant-aware log parsing, multiple confirmation lines
- **Test**: All 6 variants correctly detect READY state?
### **7. Server Start-Up False Negatives**
- **Risk**: Timeout too short, log parsing too strict
- **Guard**: Agent increases timeout for Forge (90s), multiple log patterns
- **Test**: Forge installer completes without false failure?
### **8. IP Selection Logic**
- **Risk**: Confusing external (Cloudflare) vs internal (Velocity/Technitium) IPs
- **Guard**: API clearly separates: externalIP (139.64.165.248), internalIP (10.200.0.X), velocityIP (10.70.0.241)
- **Test**: DNS points to correct IPs, Velocity routes to correct internal IP?
---
## 📊 Architecture Decision Log (LOCKED) ⭐ NEW
**Purpose**: Records finalized architectural decisions that **must not be re-litigated** unless explicitly requested.
**Status**: LOCKED - These decisions are final and cannot be changed without user explicitly saying "Revisit decision X"
---
### **DEC-001: Templates vs Go Agent (FINAL)**
**Decision**: Hybrid model
- LXC templates define **base environment only**
- Go Agent is **authoritative execution layer** inside containers
**Rationale**:
- Templates alone cannot handle multi-variant logic (Forge, NeoForge, Fabric)
- Agent enables self-repair, async provisioning, runtime control
- Hybrid provides speed + flexibility without API container access
**Applies To**: API v2, Go Agent, Frontend
**Status**: ✅ LOCKED
---
### **DEC-002: Provisioning Authority**
**Decision**: API orchestrates, Agent executes
**Rationale**:
- API has global visibility (DB, DNS, Proxmox, Velocity)
- Agent is intentionally sandboxed to container filesystem + process
**Applies To**: All systems
**Status**: ✅ LOCKED
---
### **DEC-003: DNS & Edge Publishing Ownership**
**Decision**: API-only responsibility
**Rationale**:
- Requires external credentials (Cloudflare, Technitium)
- Must correlate DB state, record IDs, reconciliation jobs
**Applies To**: API v2
**Status**: ✅ LOCKED
---
### **DEC-004: Proxy Stack**
**Decision**: Traefik + Velocity only
**Rationale**:
- Traefik for HTTP/control-plane
- Velocity for Minecraft TCP routing
- **HAProxy explicitly deprecated** for ZeroLagHub
**Applies To**: Infrastructure, API
**Status**: ✅ LOCKED
---
### **DEC-005: State Persistence**
**Decision**: MariaDB is single source of truth
**Rationale**:
- Flat files caused race conditions and drift
- DB enables reconciliation, recovery, observability
**Applies To**: API v2
**Status**: ✅ LOCKED
---
### **DEC-006: Frontend Access Model**
**Decision**: Frontend communicates with API only
**Rationale**:
- Security boundary
- Prevents leaking infrastructure details
**Applies To**: Frontend
**Status**: ✅ LOCKED
---
### **DEC-007: Architecture Enforcement Policy**
**Decision**: Drift prevention is mandatory
**Rationale**:
- Prevents oscillation between alternatives
- Preserves velocity during late-stage development
**Applies To**: All work sessions
**Status**: ✅ LOCKED
---
### **DEC-008: Frontend-Agent Network Isolation (NEW - January 18, 2026)**
**Decision**: Frontend can NEVER call agents directly
**Rationale**:
- Container IPs (10.x) are internal-only with no public routing
- Agents have no CORS headers (not web services)
- Direct calls would fail at network layer
- API enforces auth, rate limits, access control
**Technical Reality**:
- No network path exists from browser to container
- Even if CORS added, network routing blocks access
- This is architectural fact, not policy choice
**Applies To**: Frontend, All Documentation
**Status**: ✅ LOCKED
---
## 📊 Canonical Architecture Anchors (ABSOLUTE)
These rules are **immutable** unless explicitly changed with full system review:
### **Anchor #1: Orchestration**
✅ API v2 orchestrates EVERYTHING (jobs, provisioning, DNS, proxy, lifecycle)
### **Anchor #2: Container-Internal**
✅ Agent performs EVERYTHING inside container (install, start, stop, detect)
### **Anchor #3: Templates**
✅ Templates contain agent + base environment (VMID 800 for game, 6000-series for dev)
### **Anchor #4: Job Queue**
✅ BullMQ/Redis drive job system (async provisioning, retries, reconciliation)
### **Anchor #5: Database**
✅ MariaDB holds all state (PortPool, ContainerInstance, EdgeState, etc.)
### **Anchor #6: Infrastructure**
✅ Proxmox API (not Ansible) for LXC management
### **Anchor #7: Routing**
✅ Traefik (HTTP) + Velocity (Minecraft) are ONLY routing/proxy systems
### **Anchor #8: DNS**
✅ Cloudflare = authoritative public DNS
✅ Technitium = authoritative internal DNS
### **Anchor #9: Frontend Isolation**
✅ Frontend speaks ONLY to API (no direct agent, Proxmox, DNS, Velocity)
### **Anchor #10: Directory Structure**
`/opt/zlh/<game>/<variant>/world` is canonical game server path
### **Anchor #11: Network Isolation (NEW)**
✅ Container IPs (10.x) are internal-only
✅ No network path from frontend to agents
✅ API is the only bridge between public and internal networks
---
## 🛡️ Enforcement Policies (ACTIVE)
When future instructions conflict with Canonical Architecture or Architecture Decision Log:
### **Step 1: STOP**
Immediately halt the task. Do not proceed with drift-inducing change.
### **Step 2: RAISE DRIFT WARNING**
```
⚠️ DRIFT WARNING ⚠️
Proposed change violates Canonical Architecture:
Rule Violated: [Rule #X: Description]
OR
Decision Violated: [DEC-XXX: Decision Name]
Violation: [Specific behavior that violates rule/decision]
Correct Path: [Architecture-aligned approach]
Impact: [What breaks if this drift is allowed]
Options:
1. Implement correct architecture-aligned path
2. Amend Canonical Architecture (requires full system review)
3. Request user to "Revisit decision X" (for ADL changes)
4. Cancel proposed change
```
### **Step 3: PROVIDE CORRECT PATH**
Show the architecture-aligned implementation that achieves the same goal.
### **Step 4: ASK FOR DIRECTION**
```
Should we:
A) Implement the correct architecture-aligned path?
B) Perform full system review to amend architecture?
C) Request user to explicitly revisit locked decision?
D) Cancel this change?
```
### **Step 5: ARCHITECTURE DECISION LOG SPECIAL RULE**
If violation is against a **LOCKED** Architecture Decision (DEC-001 through DEC-008):
**ADDITIONAL CHECK**:
```
⚠️ LOCKED DECISION WARNING ⚠️
This change conflicts with Architecture Decision Log entry: [DEC-XXX]
Status: LOCKED
This decision can ONLY be changed if the user explicitly says:
"Revisit decision [DEC-XXX]"
Without explicit user request to revisit, this decision is FINAL.
Proceeding with this change would violate architectural governance.
```
**Never proceed** with drift-inducing changes without explicit confirmation.
**Never re-litigate** locked decisions without user explicitly requesting revision.
---
## 📚 Integration with Existing Documentation
### **Relationship to Master Bootstrap**
- Master Bootstrap: Strategic overview and business model
- **This Document**: Technical governance and boundary enforcement
- **Usage**: Consult this before implementing ANY code changes
### **Relationship to Complete Current State**
- Complete Current State: What's working, what's next
- **This Document**: How things MUST work (regardless of current state)
- **Usage**: This is the "law", current state is "status"
### **Relationship to Engineering Handover**
- Engineering Handover: Daily tactical tasks and sprint plan
- **This Document**: Constraints within which tasks must be implemented
- **Usage**: Check this before starting each handover task
---
## 🔧 Architecture Amendment Process
If legitimate need to change Canonical Architecture:
### **Step 1: Identify Change**
Document exactly what architectural boundary needs to change and why.
### **Step 2: Full System Impact Analysis**
- What breaks in API?
- What breaks in Agent?
- What breaks in Frontend?
- What changes to contracts?
- What database migrations needed?
### **Step 3: Update ALL Affected Documents**
- This document (Canonical Architecture)
- Master Bootstrap (if strategic impact)
- Complete Current State (implementation changes)
- Engineering Handover (sprint tasks)
- Agent Spec, Operational Guide (if affected)
### **Step 4: Update ALL Systems**
- API code + tests
- Agent code + tests
- Frontend code + tests
- Database schema (migration)
- Infrastructure config
### **Step 5: Validation**
- Integration tests pass?
- No new drift introduced?
- Documentation consistent?
- All AIs briefed on change?
**Only after ALL steps** is architecture amendment complete.
---
## 🎯 Quick Reference Card
### **API Owns**
- ✅ Provisioning orchestration
- ✅ Port allocation (PortPool)
- ✅ DNS (Cloudflare + Technitium)
- ✅ Velocity registration
- ✅ IP logic (external + internal)
- ✅ Job queue (BullMQ)
- ✅ Database state
### **Agent Owns**
- ✅ Container-internal installation
- ✅ Java runtime
- ✅ Artifact downloads
- ✅ Server process management
- ✅ READY detection
- ✅ Self-repair + verification
- ✅ Filesystem layout
### **Frontend Owns**
- ✅ User interaction
- ✅ Display logic
- ✅ Client state
- ✅ Form validation
### **Never**
- ❌ Agent allocates ports
- ❌ Agent talks to DNS/Velocity/Proxmox
- ❌ API installs server files
- ❌ API executes in-container commands
- ❌ Frontend talks to agent directly
- ❌ Frontend talks to infrastructure
---
## 📋 Session Start Checklist
Before every coding session with any AI:
- [ ] Read this document's Quick Reference Card
- [ ] Identify which system you're working on (API, Agent, Frontend)
- [ ] Review that system's "Owns" list
- [ ] Check High-Risk Integration Zones if touching those areas
- [ ] Verify no drift from previous session
- [ ] Confirm contracts unchanged since last session
**If ANY doubt**: Re-read full Canonical Architecture Anchors section.
---
## ✅ Document Status
**Status**: ACTIVE - Must be consulted before all code changes
**Enforcement**: MANDATORY - Drift violations must be caught
**Authority**: CANONICAL - Overrides conflicting guidance
**Updates**: Only via Architecture Amendment Process
---
🛡️ **This document prevents architectural drift. Violate at your own risk.**