knowledge-base/Session_Summaries/2026-02-07_Customer-ID-Prometheus-Host-Controls.md

13 KiB

Session Summary - February 7, 2026

🎯 Session Objectives

Resolve customer ID confusion, complete Prometheus/Grafana setup, and implement host lifecycle controls with proper delete safeguards.

Work Completed

1. Customer ID Schema Standardization

Problem Identified:

  • Mixed ID formats in ContainerInstance.customerId field
  • Sometimes u000 (manual test data)
  • Sometimes u-dev-001 (hardcoded in frontend)
  • Sometimes Prisma CUID (new user registrations)
  • Root cause: Frontend hardcoded IDs instead of using authenticated user's ID

Decision Made: Use Prisma CUID (User.id) as authoritative identifier

Implementation Rule:

// Always derive from auth context
const customerId = req.user.id;

// NEVER accept client-supplied customerId
// NEVER hardcode test IDs in production code

Why This Works:

  • Guaranteed unique (Prisma generates)
  • Harder to guess (security benefit)
  • Already in database (no schema changes needed)
  • Eliminates format confusion

Display Strategy:

  • UI shows: displayName (username) • [first 8 chars of ID]
  • Example: Jester (jester) • cmk07b7n4
  • Logs include full CUID for debugging

Long-term Option (deferred):

  • Add optional publicCustomerId field (u001, u002, etc.)
  • Keep CUID as internal primary key
  • Use public ID only for invoices, support tickets, external references
  • Not blocking current development

Documentation:

  • zlh-grind/SCRATCH/2026-02-07_customer-id-schema.md

2. Prometheus + Grafana Operational

Achievement: Full metrics stack now working

Components Verified:

  • node_exporter installed in every LXC template
  • Prometheus scraping via HTTP service discovery
  • API endpoint /sd/exporters providing dynamic targets
  • Grafana dashboards importing and displaying data
  • Custom "ZLH LXC Overview" dashboard operational

Prometheus Configuration:

# /etc/prometheus/prometheus.yml on zlh-monitor
- job_name: 'zlh_lxc'
  scrape_interval: 15s
  http_sd_configs:
    - url: http://10.60.0.245:4000/sd/exporters
      refresh_interval: 30s
      authorization:
        type: Bearer
        credentials: '<TOKEN>'

Why This Works:

  • No hardcoded container IPs needed
  • New containers automatically discovered
  • API dynamically generates target list
  • Prometheus refreshes every 30s

Grafana Dashboard Setup:

  • Import gotcha resolved: datasource UID mapping required
  • Standard Node Exporter dashboard working
  • Custom ZLH dashboard with container-specific views
  • CPU metric variance vs top noted (acceptable for monitoring needs)

CPU Metric Note:

  • Grafana CPU% may not match top exactly
  • Reasons: rate windows, per-core vs normalized, cgroup accounting
  • Decision: Good enough for "is it pegged?" / "trending up/down?" / "is it dead?"
  • Not blocking on perfect parity

Next Step - Portal Integration: Plan to expose metrics to portal server cards via API:

GET /api/metrics/servers/:vmid/summary
Response:
{
  cpu_percent: 45.2,
  mem_percent: 62.8,
  disk_percent: 38.5,
  net_in: 1234567,  // bytes/sec
  net_out: 987654
}

PromQL Building Blocks (documented):

  • CPU: 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
  • Memory: 100 * (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes))
  • Disk: 100 * (1 - (node_filesystem_avail_bytes / node_filesystem_size_bytes))
  • Network: rate(node_network_receive_bytes_total[5m])

Documentation:

  • zlh-grind/SCRATCH/2026-02-07_prometheus-grafana.md

3. Host Controls + Delete Failsafe

Problem:

  • Frontend added "Delete Server" button with confirmation
  • Backend has safety check: refuse delete if LXC host is running
  • Portal only had game server controls (process), not host controls (LXC)
  • Users couldn't stop host to enable deletion

Decision: Keep delete failsafe + Add host controls

Why Not Remove Failsafe:

  • Prevents accidental data loss
  • Forces user to consciously stop host before deleting
  • Gives time to reconsider deletion
  • Aligns with "safety first" platform philosophy

New Host Controls Added:

  • Start Host
  • Stop Host
  • Restart Host

API Endpoints:

POST /servers/:id/host/start
POST /servers/:id/host/stop
POST /servers/:id/host/restart

Backend Implementation:

  • Calls proxmoxClient.startContainer(vmid)
  • Calls proxmoxClient.stopContainer(vmid) or shutdownContainer(vmid)
  • Restart = stop then start sequence

UX Wording Decision:

  • Avoid "container" in user-facing UI
  • Use "Host Controls" or "Server Host"
  • Internal: LXC lifecycle management
  • User sees: "Start Host" / "Stop Host" / "Restart Host"

Delete Gate Logic:

DELETE /servers/:id
├─ Check: Is LXC host running?
│  ├─ YES → Deny (403 Forbidden)
│  └─ NO → Allow deletion

User Flow:

  1. User clicks "Delete Server"
  2. If host running: "Please stop the host first"
  3. User uses "Stop Host" control
  4. Host shuts down
  5. User clicks "Delete Server" again
  6. Deletion proceeds

Benefits:

  • Safety without frustration
  • User control over compute spend (can stop idle containers)
  • Clear UX pattern (stop → delete)
  • Prevents orphaned running containers

Testing Checklist Verified:

  • Start Host → LXC goes running
  • Stop Host → LXC shuts down
  • Restart Host → stop then start
  • Delete while running → denied
  • Delete after stop → allowed

Documentation:

  • zlh-grind/SCRATCH/2026-02-07_host-controls-and-delete-failsafe.md

📊 Technical Analysis

Customer ID Impact

Before:

  • Database had mixed formats (u000, u-dev-001, CUIDs)
  • Frontend hardcoded test IDs
  • Provisioning inconsistent with auth context
  • Schema confusion across sessions

After:

  • Single format: Prisma CUID from User.id
  • Always derive from req.user.id
  • Consistent across all container creation
  • No more format guessing

Migration Path:

  • New containers: automatic (uses req.user.id)
  • Existing test data: leave as-is or clean up later
  • Rule enforced going forward

Prometheus Architecture

HTTP Service Discovery Pattern:

Prometheus (zlh-monitor)
   ↓
GET /sd/exporters (API endpoint)
   ↓
Returns dynamic target list:
[
  {
    "targets": ["10.200.0.50:9100"],
    "labels": {
      "vmid": "1050",
      "type": "game",
      "customer": "jester"
    }
  },
  ...
]
   ↓
Prometheus scrapes all targets
   ↓
Metrics available in Grafana

Why This Works:

  • Scales automatically (new containers auto-discovered)
  • No manual config changes needed
  • API is source of truth for container inventory
  • Prometheus just consumes the list

Delete Safety Pattern

Design Philosophy:

Add controls, don't remove safeguards

Pattern:

  • Safety check remains: "host must be stopped"
  • User empowerment: give them stop/start controls
  • Clear feedback: explain why delete is blocked
  • Obvious path forward: "stop host first"

Alternative Considered and Rejected:

  • Auto-stop before delete: Too dangerous (user may not realize host will stop)
  • Remove safety check: Defeats purpose of safeguard
  • Admin-only delete: Breaks self-service model

Chosen Approach Benefits:

  • User maintains control
  • Clear cause-and-effect
  • Prevents accidents
  • Enables compute cost management (stop idle hosts)

🎯 Platform Status

Infrastructure

  • Prometheus/Grafana fully operational
  • HTTP service discovery working
  • Metrics available for all containers
  • Ready for portal integration

Backend

  • Customer ID standardization enforced
  • Host lifecycle controls implemented
  • Delete safeguards preserved
  • API endpoints ready for frontend

Frontend 🔧

  • Needs to stop hardcoding customer IDs
  • Needs to integrate host controls UI
  • Needs to handle delete flow (stop → delete)
  • Portal metrics cards planned (next)

Observability

  • Full metrics stack operational
  • Custom dashboards working
  • PromQL queries documented
  • Ready for production monitoring

📋 Implementation Checklist

Customer ID Enforcement (Backend)

  • Identify hardcoded IDs in provisioning code
  • Document decision to use Prisma CUID
  • Update provisioning endpoint to use req.user.id
  • Remove any client-supplied customerId acceptance
  • Verify new containers use correct ID format

Customer ID Updates (Frontend)

  • Remove u-dev-001 hardcoding
  • Use authenticated user's ID from token
  • Update display to show displayName (username) • ID
  • Test with new user registration

Host Controls (Frontend)

  • Add Start/Stop/Restart Host buttons to server cards
  • Wire to /servers/:id/host/{start|stop|restart} endpoints
  • Update delete flow to check host status
  • Show "Stop host first" message if delete blocked
  • Test full flow: stop → delete

Portal Metrics (Next Sprint)

  • Create /api/metrics/servers/:vmid/summary endpoint
  • Query Prometheus from API (server-side)
  • Return CPU/RAM/Disk/Net percentages
  • Add metrics display to server cards
  • Handle missing/stale data gracefully

🔑 Key Decisions Made

Decision 1: Prisma CUID as Customer ID

Rationale:

  • Already generated by Prisma automatically
  • Guaranteed unique, hard to guess
  • No schema changes needed
  • Eliminates format confusion immediately

Alternative Considered: Human-friendly sequential IDs (u001, u002) Why Deferred: Adds complexity, not blocking current work

Implementation: Use req.user.id everywhere, display nicely in UI


Decision 2: Keep Delete Failsafe

Rationale:

  • Prevents accidental data loss
  • Forces conscious decision (stop then delete)
  • Aligns with safety-first philosophy
  • Better to add controls than remove safeguards

Alternative Considered: Auto-stop before delete Why Rejected: Too dangerous, removes user control

Implementation: Add host controls so users can stop themselves


Decision 3: Prometheus HTTP Service Discovery

Rationale:

  • Scales automatically with container growth
  • No manual configuration maintenance
  • API is authoritative source of truth
  • Consistent with architectural boundaries

Alternative Considered: Static prometheus config with container IPs Why Rejected: Doesn't scale, manual updates, IP changes break it

Implementation: API endpoint /sd/exporters returns dynamic targets


🚀 Next Session Priorities

Immediate (Next Session)

  1. Frontend Customer ID Fix

    • Remove hardcoded u-dev-001
    • Use req.user.id from auth context
    • Test with new user registration
  2. Host Controls UI

    • Add start/stop/restart buttons
    • Wire to API endpoints
    • Update delete confirmation flow
  3. Portal Metrics Endpoint

    • Create API endpoint querying Prometheus
    • Return summary metrics for server cards
    • Test with live containers

This Week

  1. Metrics in Portal

    • Display CPU/RAM/Disk/Net on server cards
    • Handle stale/missing data
    • Add loading states
  2. Testing

    • End-to-end customer ID flow
    • Host lifecycle controls
    • Delete with safety check
    • Metrics display accuracy

Next Week

  1. Security Vulnerability Fixes (still pending)
    • Server ownership bypass
    • Admin privilege escalation
    • Token URL exposure
    • API key validation

📁 Files Modified

zlh-grind Repository:

  • README.md - Updated with Feb 7 handover summary
  • SCRATCH/2026-02-07_customer-id-schema.md - Created
  • SCRATCH/2026-02-07_prometheus-grafana.md - Created
  • SCRATCH/2026-02-07_host-controls-and-delete-failsafe.md - Created

knowledge-base Repository:

  • Session_Summaries/2026-02-07_Customer-ID-Prometheus-Host-Controls.md - This file (created)

🔗 Reference Documentation

Session Documents:

Platform Documentation:

  • Cross Project Tracker: ZeroLagHub_Cross_Project_Tracker.md
  • Current State: ZeroLagHub_Complete_Current_State_Jan2026.md

PromQL Resources:


💡 Key Insights

1. Standardization Eliminates Confusion

  • Mixed ID formats caused confusion across sessions
  • Single format (Prisma CUID) removes ambiguity
  • Display layer can make IDs human-friendly
  • Backend uses what's already in database

2. Safety Through Empowerment

  • Don't remove safeguards, add user controls
  • Delete failsafe + host controls = safety + agency
  • Clear UX: "stop host first" tells user what to do
  • Prevents accidents without frustration

3. HTTP Service Discovery Scales

  • Dynamic target discovery from API
  • No manual Prometheus config updates
  • Containers auto-discovered on creation
  • API as authoritative inventory source

4. Monitoring is Production-Ready

  • Full metrics stack operational
  • Grafana dashboards working
  • Ready for portal integration
  • PromQL queries documented for reuse

Session Duration: ~2-3 hours
Session Type: Platform foundation + UX safety
Next AI: Claude or GPT (frontend customer ID fix + host controls UI)
Blocking Issues: None (all decisions made, implementation ready)
Session Outcome: Three critical platform decisions made and documented