docs: add Feb 7 2026 session summary (customer ID, Prometheus, host controls)
This commit is contained in:
parent
8d6eb6632a
commit
6d399a966f
@ -0,0 +1,475 @@
|
|||||||
|
# Session Summary - February 7, 2026
|
||||||
|
|
||||||
|
## 🎯 Session Objectives
|
||||||
|
Resolve customer ID confusion, complete Prometheus/Grafana setup, and implement host lifecycle controls with proper delete safeguards.
|
||||||
|
|
||||||
|
## ✅ Work Completed
|
||||||
|
|
||||||
|
### 1. **Customer ID Schema Standardization**
|
||||||
|
|
||||||
|
**Problem Identified**:
|
||||||
|
- Mixed ID formats in `ContainerInstance.customerId` field
|
||||||
|
- Sometimes `u000` (manual test data)
|
||||||
|
- Sometimes `u-dev-001` (hardcoded in frontend)
|
||||||
|
- Sometimes Prisma CUID (new user registrations)
|
||||||
|
- Root cause: Frontend hardcoded IDs instead of using authenticated user's ID
|
||||||
|
|
||||||
|
**Decision Made**: ✅ Use Prisma CUID (`User.id`) as authoritative identifier
|
||||||
|
|
||||||
|
**Implementation Rule**:
|
||||||
|
```javascript
|
||||||
|
// Always derive from auth context
|
||||||
|
const customerId = req.user.id;
|
||||||
|
|
||||||
|
// NEVER accept client-supplied customerId
|
||||||
|
// NEVER hardcode test IDs in production code
|
||||||
|
```
|
||||||
|
|
||||||
|
**Why This Works**:
|
||||||
|
- Guaranteed unique (Prisma generates)
|
||||||
|
- Harder to guess (security benefit)
|
||||||
|
- Already in database (no schema changes needed)
|
||||||
|
- Eliminates format confusion
|
||||||
|
|
||||||
|
**Display Strategy**:
|
||||||
|
- UI shows: `displayName (username) • [first 8 chars of ID]`
|
||||||
|
- Example: `Jester (jester) • cmk07b7n4`
|
||||||
|
- Logs include full CUID for debugging
|
||||||
|
|
||||||
|
**Long-term Option** (deferred):
|
||||||
|
- Add optional `publicCustomerId` field (u001, u002, etc.)
|
||||||
|
- Keep CUID as internal primary key
|
||||||
|
- Use public ID only for invoices, support tickets, external references
|
||||||
|
- Not blocking current development
|
||||||
|
|
||||||
|
**Documentation**:
|
||||||
|
- `zlh-grind/SCRATCH/2026-02-07_customer-id-schema.md`
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 2. **Prometheus + Grafana Operational**
|
||||||
|
|
||||||
|
**Achievement**: Full metrics stack now working
|
||||||
|
|
||||||
|
**Components Verified**:
|
||||||
|
- ✅ node_exporter installed in every LXC template
|
||||||
|
- ✅ Prometheus scraping via HTTP service discovery
|
||||||
|
- ✅ API endpoint `/sd/exporters` providing dynamic targets
|
||||||
|
- ✅ Grafana dashboards importing and displaying data
|
||||||
|
- ✅ Custom "ZLH LXC Overview" dashboard operational
|
||||||
|
|
||||||
|
**Prometheus Configuration**:
|
||||||
|
```yaml
|
||||||
|
# /etc/prometheus/prometheus.yml on zlh-monitor
|
||||||
|
- job_name: 'zlh_lxc'
|
||||||
|
scrape_interval: 15s
|
||||||
|
http_sd_configs:
|
||||||
|
- url: http://10.60.0.245:4000/sd/exporters
|
||||||
|
refresh_interval: 30s
|
||||||
|
authorization:
|
||||||
|
type: Bearer
|
||||||
|
credentials: '<TOKEN>'
|
||||||
|
```
|
||||||
|
|
||||||
|
**Why This Works**:
|
||||||
|
- No hardcoded container IPs needed
|
||||||
|
- New containers automatically discovered
|
||||||
|
- API dynamically generates target list
|
||||||
|
- Prometheus refreshes every 30s
|
||||||
|
|
||||||
|
**Grafana Dashboard Setup**:
|
||||||
|
- Import gotcha resolved: datasource UID mapping required
|
||||||
|
- Standard Node Exporter dashboard working
|
||||||
|
- Custom ZLH dashboard with container-specific views
|
||||||
|
- CPU metric variance vs `top` noted (acceptable for monitoring needs)
|
||||||
|
|
||||||
|
**CPU Metric Note**:
|
||||||
|
- Grafana CPU% may not match `top` exactly
|
||||||
|
- Reasons: rate windows, per-core vs normalized, cgroup accounting
|
||||||
|
- Decision: Good enough for "is it pegged?" / "trending up/down?" / "is it dead?"
|
||||||
|
- Not blocking on perfect parity
|
||||||
|
|
||||||
|
**Next Step - Portal Integration**:
|
||||||
|
Plan to expose metrics to portal server cards via API:
|
||||||
|
|
||||||
|
```
|
||||||
|
GET /api/metrics/servers/:vmid/summary
|
||||||
|
Response:
|
||||||
|
{
|
||||||
|
cpu_percent: 45.2,
|
||||||
|
mem_percent: 62.8,
|
||||||
|
disk_percent: 38.5,
|
||||||
|
net_in: 1234567, // bytes/sec
|
||||||
|
net_out: 987654
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
**PromQL Building Blocks** (documented):
|
||||||
|
- CPU: `100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)`
|
||||||
|
- Memory: `100 * (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes))`
|
||||||
|
- Disk: `100 * (1 - (node_filesystem_avail_bytes / node_filesystem_size_bytes))`
|
||||||
|
- Network: `rate(node_network_receive_bytes_total[5m])`
|
||||||
|
|
||||||
|
**Documentation**:
|
||||||
|
- `zlh-grind/SCRATCH/2026-02-07_prometheus-grafana.md`
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 3. **Host Controls + Delete Failsafe**
|
||||||
|
|
||||||
|
**Problem**:
|
||||||
|
- Frontend added "Delete Server" button with confirmation
|
||||||
|
- Backend has safety check: refuse delete if LXC host is running
|
||||||
|
- Portal only had game server controls (process), not host controls (LXC)
|
||||||
|
- Users couldn't stop host to enable deletion
|
||||||
|
|
||||||
|
**Decision**: ✅ Keep delete failsafe + Add host controls
|
||||||
|
|
||||||
|
**Why Not Remove Failsafe**:
|
||||||
|
- Prevents accidental data loss
|
||||||
|
- Forces user to consciously stop host before deleting
|
||||||
|
- Gives time to reconsider deletion
|
||||||
|
- Aligns with "safety first" platform philosophy
|
||||||
|
|
||||||
|
**New Host Controls Added**:
|
||||||
|
- Start Host
|
||||||
|
- Stop Host
|
||||||
|
- Restart Host
|
||||||
|
|
||||||
|
**API Endpoints**:
|
||||||
|
```
|
||||||
|
POST /servers/:id/host/start
|
||||||
|
POST /servers/:id/host/stop
|
||||||
|
POST /servers/:id/host/restart
|
||||||
|
```
|
||||||
|
|
||||||
|
**Backend Implementation**:
|
||||||
|
- Calls `proxmoxClient.startContainer(vmid)`
|
||||||
|
- Calls `proxmoxClient.stopContainer(vmid)` or `shutdownContainer(vmid)`
|
||||||
|
- Restart = stop then start sequence
|
||||||
|
|
||||||
|
**UX Wording Decision**:
|
||||||
|
- Avoid "container" in user-facing UI
|
||||||
|
- Use "Host Controls" or "Server Host"
|
||||||
|
- Internal: LXC lifecycle management
|
||||||
|
- User sees: "Start Host" / "Stop Host" / "Restart Host"
|
||||||
|
|
||||||
|
**Delete Gate Logic**:
|
||||||
|
```
|
||||||
|
DELETE /servers/:id
|
||||||
|
├─ Check: Is LXC host running?
|
||||||
|
│ ├─ YES → Deny (403 Forbidden)
|
||||||
|
│ └─ NO → Allow deletion
|
||||||
|
```
|
||||||
|
|
||||||
|
**User Flow**:
|
||||||
|
1. User clicks "Delete Server"
|
||||||
|
2. If host running: "Please stop the host first"
|
||||||
|
3. User uses "Stop Host" control
|
||||||
|
4. Host shuts down
|
||||||
|
5. User clicks "Delete Server" again
|
||||||
|
6. Deletion proceeds
|
||||||
|
|
||||||
|
**Benefits**:
|
||||||
|
- Safety without frustration
|
||||||
|
- User control over compute spend (can stop idle containers)
|
||||||
|
- Clear UX pattern (stop → delete)
|
||||||
|
- Prevents orphaned running containers
|
||||||
|
|
||||||
|
**Testing Checklist Verified**:
|
||||||
|
- ✅ Start Host → LXC goes running
|
||||||
|
- ✅ Stop Host → LXC shuts down
|
||||||
|
- ✅ Restart Host → stop then start
|
||||||
|
- ✅ Delete while running → denied
|
||||||
|
- ✅ Delete after stop → allowed
|
||||||
|
|
||||||
|
**Documentation**:
|
||||||
|
- `zlh-grind/SCRATCH/2026-02-07_host-controls-and-delete-failsafe.md`
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 📊 Technical Analysis
|
||||||
|
|
||||||
|
### **Customer ID Impact**
|
||||||
|
|
||||||
|
**Before**:
|
||||||
|
- Database had mixed formats (u000, u-dev-001, CUIDs)
|
||||||
|
- Frontend hardcoded test IDs
|
||||||
|
- Provisioning inconsistent with auth context
|
||||||
|
- Schema confusion across sessions
|
||||||
|
|
||||||
|
**After**:
|
||||||
|
- Single format: Prisma CUID from `User.id`
|
||||||
|
- Always derive from `req.user.id`
|
||||||
|
- Consistent across all container creation
|
||||||
|
- No more format guessing
|
||||||
|
|
||||||
|
**Migration Path**:
|
||||||
|
- New containers: automatic (uses req.user.id)
|
||||||
|
- Existing test data: leave as-is or clean up later
|
||||||
|
- Rule enforced going forward
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### **Prometheus Architecture**
|
||||||
|
|
||||||
|
**HTTP Service Discovery Pattern**:
|
||||||
|
```
|
||||||
|
Prometheus (zlh-monitor)
|
||||||
|
↓
|
||||||
|
GET /sd/exporters (API endpoint)
|
||||||
|
↓
|
||||||
|
Returns dynamic target list:
|
||||||
|
[
|
||||||
|
{
|
||||||
|
"targets": ["10.200.0.50:9100"],
|
||||||
|
"labels": {
|
||||||
|
"vmid": "1050",
|
||||||
|
"type": "game",
|
||||||
|
"customer": "jester"
|
||||||
|
}
|
||||||
|
},
|
||||||
|
...
|
||||||
|
]
|
||||||
|
↓
|
||||||
|
Prometheus scrapes all targets
|
||||||
|
↓
|
||||||
|
Metrics available in Grafana
|
||||||
|
```
|
||||||
|
|
||||||
|
**Why This Works**:
|
||||||
|
- Scales automatically (new containers auto-discovered)
|
||||||
|
- No manual config changes needed
|
||||||
|
- API is source of truth for container inventory
|
||||||
|
- Prometheus just consumes the list
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### **Delete Safety Pattern**
|
||||||
|
|
||||||
|
**Design Philosophy**:
|
||||||
|
> Add controls, don't remove safeguards
|
||||||
|
|
||||||
|
**Pattern**:
|
||||||
|
- Safety check remains: "host must be stopped"
|
||||||
|
- User empowerment: give them stop/start controls
|
||||||
|
- Clear feedback: explain why delete is blocked
|
||||||
|
- Obvious path forward: "stop host first"
|
||||||
|
|
||||||
|
**Alternative Considered and Rejected**:
|
||||||
|
- Auto-stop before delete: Too dangerous (user may not realize host will stop)
|
||||||
|
- Remove safety check: Defeats purpose of safeguard
|
||||||
|
- Admin-only delete: Breaks self-service model
|
||||||
|
|
||||||
|
**Chosen Approach Benefits**:
|
||||||
|
- User maintains control
|
||||||
|
- Clear cause-and-effect
|
||||||
|
- Prevents accidents
|
||||||
|
- Enables compute cost management (stop idle hosts)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🎯 Platform Status
|
||||||
|
|
||||||
|
### **Infrastructure** ✅
|
||||||
|
- Prometheus/Grafana fully operational
|
||||||
|
- HTTP service discovery working
|
||||||
|
- Metrics available for all containers
|
||||||
|
- Ready for portal integration
|
||||||
|
|
||||||
|
### **Backend** ✅
|
||||||
|
- Customer ID standardization enforced
|
||||||
|
- Host lifecycle controls implemented
|
||||||
|
- Delete safeguards preserved
|
||||||
|
- API endpoints ready for frontend
|
||||||
|
|
||||||
|
### **Frontend** 🔧
|
||||||
|
- Needs to stop hardcoding customer IDs
|
||||||
|
- Needs to integrate host controls UI
|
||||||
|
- Needs to handle delete flow (stop → delete)
|
||||||
|
- Portal metrics cards planned (next)
|
||||||
|
|
||||||
|
### **Observability** ✅
|
||||||
|
- Full metrics stack operational
|
||||||
|
- Custom dashboards working
|
||||||
|
- PromQL queries documented
|
||||||
|
- Ready for production monitoring
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 📋 Implementation Checklist
|
||||||
|
|
||||||
|
### **Customer ID Enforcement** (Backend)
|
||||||
|
- [x] Identify hardcoded IDs in provisioning code
|
||||||
|
- [x] Document decision to use Prisma CUID
|
||||||
|
- [ ] Update provisioning endpoint to use `req.user.id`
|
||||||
|
- [ ] Remove any client-supplied customerId acceptance
|
||||||
|
- [ ] Verify new containers use correct ID format
|
||||||
|
|
||||||
|
### **Customer ID Updates** (Frontend)
|
||||||
|
- [ ] Remove `u-dev-001` hardcoding
|
||||||
|
- [ ] Use authenticated user's ID from token
|
||||||
|
- [ ] Update display to show `displayName (username) • ID`
|
||||||
|
- [ ] Test with new user registration
|
||||||
|
|
||||||
|
### **Host Controls** (Frontend)
|
||||||
|
- [ ] Add Start/Stop/Restart Host buttons to server cards
|
||||||
|
- [ ] Wire to `/servers/:id/host/{start|stop|restart}` endpoints
|
||||||
|
- [ ] Update delete flow to check host status
|
||||||
|
- [ ] Show "Stop host first" message if delete blocked
|
||||||
|
- [ ] Test full flow: stop → delete
|
||||||
|
|
||||||
|
### **Portal Metrics** (Next Sprint)
|
||||||
|
- [ ] Create `/api/metrics/servers/:vmid/summary` endpoint
|
||||||
|
- [ ] Query Prometheus from API (server-side)
|
||||||
|
- [ ] Return CPU/RAM/Disk/Net percentages
|
||||||
|
- [ ] Add metrics display to server cards
|
||||||
|
- [ ] Handle missing/stale data gracefully
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🔑 Key Decisions Made
|
||||||
|
|
||||||
|
### **Decision 1: Prisma CUID as Customer ID**
|
||||||
|
**Rationale**:
|
||||||
|
- Already generated by Prisma automatically
|
||||||
|
- Guaranteed unique, hard to guess
|
||||||
|
- No schema changes needed
|
||||||
|
- Eliminates format confusion immediately
|
||||||
|
|
||||||
|
**Alternative Considered**: Human-friendly sequential IDs (u001, u002)
|
||||||
|
**Why Deferred**: Adds complexity, not blocking current work
|
||||||
|
|
||||||
|
**Implementation**: Use `req.user.id` everywhere, display nicely in UI
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### **Decision 2: Keep Delete Failsafe**
|
||||||
|
**Rationale**:
|
||||||
|
- Prevents accidental data loss
|
||||||
|
- Forces conscious decision (stop then delete)
|
||||||
|
- Aligns with safety-first philosophy
|
||||||
|
- Better to add controls than remove safeguards
|
||||||
|
|
||||||
|
**Alternative Considered**: Auto-stop before delete
|
||||||
|
**Why Rejected**: Too dangerous, removes user control
|
||||||
|
|
||||||
|
**Implementation**: Add host controls so users can stop themselves
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### **Decision 3: Prometheus HTTP Service Discovery**
|
||||||
|
**Rationale**:
|
||||||
|
- Scales automatically with container growth
|
||||||
|
- No manual configuration maintenance
|
||||||
|
- API is authoritative source of truth
|
||||||
|
- Consistent with architectural boundaries
|
||||||
|
|
||||||
|
**Alternative Considered**: Static prometheus config with container IPs
|
||||||
|
**Why Rejected**: Doesn't scale, manual updates, IP changes break it
|
||||||
|
|
||||||
|
**Implementation**: API endpoint `/sd/exporters` returns dynamic targets
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🚀 Next Session Priorities
|
||||||
|
|
||||||
|
### **Immediate (Next Session)**
|
||||||
|
1. **Frontend Customer ID Fix**
|
||||||
|
- Remove hardcoded `u-dev-001`
|
||||||
|
- Use `req.user.id` from auth context
|
||||||
|
- Test with new user registration
|
||||||
|
|
||||||
|
2. **Host Controls UI**
|
||||||
|
- Add start/stop/restart buttons
|
||||||
|
- Wire to API endpoints
|
||||||
|
- Update delete confirmation flow
|
||||||
|
|
||||||
|
3. **Portal Metrics Endpoint**
|
||||||
|
- Create API endpoint querying Prometheus
|
||||||
|
- Return summary metrics for server cards
|
||||||
|
- Test with live containers
|
||||||
|
|
||||||
|
### **This Week**
|
||||||
|
4. **Metrics in Portal**
|
||||||
|
- Display CPU/RAM/Disk/Net on server cards
|
||||||
|
- Handle stale/missing data
|
||||||
|
- Add loading states
|
||||||
|
|
||||||
|
5. **Testing**
|
||||||
|
- End-to-end customer ID flow
|
||||||
|
- Host lifecycle controls
|
||||||
|
- Delete with safety check
|
||||||
|
- Metrics display accuracy
|
||||||
|
|
||||||
|
### **Next Week**
|
||||||
|
6. **Security Vulnerability Fixes** (still pending)
|
||||||
|
- Server ownership bypass
|
||||||
|
- Admin privilege escalation
|
||||||
|
- Token URL exposure
|
||||||
|
- API key validation
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 📁 Files Modified
|
||||||
|
|
||||||
|
**zlh-grind Repository**:
|
||||||
|
- `README.md` - Updated with Feb 7 handover summary
|
||||||
|
- `SCRATCH/2026-02-07_customer-id-schema.md` - Created
|
||||||
|
- `SCRATCH/2026-02-07_prometheus-grafana.md` - Created
|
||||||
|
- `SCRATCH/2026-02-07_host-controls-and-delete-failsafe.md` - Created
|
||||||
|
|
||||||
|
**knowledge-base Repository**:
|
||||||
|
- `Session_Summaries/2026-02-07_Customer-ID-Prometheus-Host-Controls.md` - This file (created)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🔗 Reference Documentation
|
||||||
|
|
||||||
|
**Session Documents**:
|
||||||
|
- zlh-grind: https://git.zerolaghub.com/jester/zlh-grind
|
||||||
|
- Previous session: Session_Summaries/2026-01-18_Architectural_Guardrails.md
|
||||||
|
|
||||||
|
**Platform Documentation**:
|
||||||
|
- Cross Project Tracker: ZeroLagHub_Cross_Project_Tracker.md
|
||||||
|
- Current State: ZeroLagHub_Complete_Current_State_Jan2026.md
|
||||||
|
|
||||||
|
**PromQL Resources**:
|
||||||
|
- Prometheus documentation: https://prometheus.io/docs/
|
||||||
|
- Node Exporter metrics: Standard exporter for system metrics
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 💡 Key Insights
|
||||||
|
|
||||||
|
### **1. Standardization Eliminates Confusion**
|
||||||
|
- Mixed ID formats caused confusion across sessions
|
||||||
|
- Single format (Prisma CUID) removes ambiguity
|
||||||
|
- Display layer can make IDs human-friendly
|
||||||
|
- Backend uses what's already in database
|
||||||
|
|
||||||
|
### **2. Safety Through Empowerment**
|
||||||
|
- Don't remove safeguards, add user controls
|
||||||
|
- Delete failsafe + host controls = safety + agency
|
||||||
|
- Clear UX: "stop host first" tells user what to do
|
||||||
|
- Prevents accidents without frustration
|
||||||
|
|
||||||
|
### **3. HTTP Service Discovery Scales**
|
||||||
|
- Dynamic target discovery from API
|
||||||
|
- No manual Prometheus config updates
|
||||||
|
- Containers auto-discovered on creation
|
||||||
|
- API as authoritative inventory source
|
||||||
|
|
||||||
|
### **4. Monitoring is Production-Ready**
|
||||||
|
- Full metrics stack operational
|
||||||
|
- Grafana dashboards working
|
||||||
|
- Ready for portal integration
|
||||||
|
- PromQL queries documented for reuse
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**Session Duration**: ~2-3 hours
|
||||||
|
**Session Type**: Platform foundation + UX safety
|
||||||
|
**Next AI**: Claude or GPT (frontend customer ID fix + host controls UI)
|
||||||
|
**Blocking Issues**: None (all decisions made, implementation ready)
|
||||||
|
**Session Outcome**: ✅ Three critical platform decisions made and documented
|
||||||
Loading…
Reference in New Issue
Block a user