docs: add Feb 7 2026 session summary (customer ID, Prometheus, host controls)
This commit is contained in:
parent
8d6eb6632a
commit
6d399a966f
@ -0,0 +1,475 @@
|
||||
# Session Summary - February 7, 2026
|
||||
|
||||
## 🎯 Session Objectives
|
||||
Resolve customer ID confusion, complete Prometheus/Grafana setup, and implement host lifecycle controls with proper delete safeguards.
|
||||
|
||||
## ✅ Work Completed
|
||||
|
||||
### 1. **Customer ID Schema Standardization**
|
||||
|
||||
**Problem Identified**:
|
||||
- Mixed ID formats in `ContainerInstance.customerId` field
|
||||
- Sometimes `u000` (manual test data)
|
||||
- Sometimes `u-dev-001` (hardcoded in frontend)
|
||||
- Sometimes Prisma CUID (new user registrations)
|
||||
- Root cause: Frontend hardcoded IDs instead of using authenticated user's ID
|
||||
|
||||
**Decision Made**: ✅ Use Prisma CUID (`User.id`) as authoritative identifier
|
||||
|
||||
**Implementation Rule**:
|
||||
```javascript
|
||||
// Always derive from auth context
|
||||
const customerId = req.user.id;
|
||||
|
||||
// NEVER accept client-supplied customerId
|
||||
// NEVER hardcode test IDs in production code
|
||||
```
|
||||
|
||||
**Why This Works**:
|
||||
- Guaranteed unique (Prisma generates)
|
||||
- Harder to guess (security benefit)
|
||||
- Already in database (no schema changes needed)
|
||||
- Eliminates format confusion
|
||||
|
||||
**Display Strategy**:
|
||||
- UI shows: `displayName (username) • [first 8 chars of ID]`
|
||||
- Example: `Jester (jester) • cmk07b7n4`
|
||||
- Logs include full CUID for debugging
|
||||
|
||||
**Long-term Option** (deferred):
|
||||
- Add optional `publicCustomerId` field (u001, u002, etc.)
|
||||
- Keep CUID as internal primary key
|
||||
- Use public ID only for invoices, support tickets, external references
|
||||
- Not blocking current development
|
||||
|
||||
**Documentation**:
|
||||
- `zlh-grind/SCRATCH/2026-02-07_customer-id-schema.md`
|
||||
|
||||
---
|
||||
|
||||
### 2. **Prometheus + Grafana Operational**
|
||||
|
||||
**Achievement**: Full metrics stack now working
|
||||
|
||||
**Components Verified**:
|
||||
- ✅ node_exporter installed in every LXC template
|
||||
- ✅ Prometheus scraping via HTTP service discovery
|
||||
- ✅ API endpoint `/sd/exporters` providing dynamic targets
|
||||
- ✅ Grafana dashboards importing and displaying data
|
||||
- ✅ Custom "ZLH LXC Overview" dashboard operational
|
||||
|
||||
**Prometheus Configuration**:
|
||||
```yaml
|
||||
# /etc/prometheus/prometheus.yml on zlh-monitor
|
||||
- job_name: 'zlh_lxc'
|
||||
scrape_interval: 15s
|
||||
http_sd_configs:
|
||||
- url: http://10.60.0.245:4000/sd/exporters
|
||||
refresh_interval: 30s
|
||||
authorization:
|
||||
type: Bearer
|
||||
credentials: '<TOKEN>'
|
||||
```
|
||||
|
||||
**Why This Works**:
|
||||
- No hardcoded container IPs needed
|
||||
- New containers automatically discovered
|
||||
- API dynamically generates target list
|
||||
- Prometheus refreshes every 30s
|
||||
|
||||
**Grafana Dashboard Setup**:
|
||||
- Import gotcha resolved: datasource UID mapping required
|
||||
- Standard Node Exporter dashboard working
|
||||
- Custom ZLH dashboard with container-specific views
|
||||
- CPU metric variance vs `top` noted (acceptable for monitoring needs)
|
||||
|
||||
**CPU Metric Note**:
|
||||
- Grafana CPU% may not match `top` exactly
|
||||
- Reasons: rate windows, per-core vs normalized, cgroup accounting
|
||||
- Decision: Good enough for "is it pegged?" / "trending up/down?" / "is it dead?"
|
||||
- Not blocking on perfect parity
|
||||
|
||||
**Next Step - Portal Integration**:
|
||||
Plan to expose metrics to portal server cards via API:
|
||||
|
||||
```
|
||||
GET /api/metrics/servers/:vmid/summary
|
||||
Response:
|
||||
{
|
||||
cpu_percent: 45.2,
|
||||
mem_percent: 62.8,
|
||||
disk_percent: 38.5,
|
||||
net_in: 1234567, // bytes/sec
|
||||
net_out: 987654
|
||||
}
|
||||
```
|
||||
|
||||
**PromQL Building Blocks** (documented):
|
||||
- CPU: `100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)`
|
||||
- Memory: `100 * (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes))`
|
||||
- Disk: `100 * (1 - (node_filesystem_avail_bytes / node_filesystem_size_bytes))`
|
||||
- Network: `rate(node_network_receive_bytes_total[5m])`
|
||||
|
||||
**Documentation**:
|
||||
- `zlh-grind/SCRATCH/2026-02-07_prometheus-grafana.md`
|
||||
|
||||
---
|
||||
|
||||
### 3. **Host Controls + Delete Failsafe**
|
||||
|
||||
**Problem**:
|
||||
- Frontend added "Delete Server" button with confirmation
|
||||
- Backend has safety check: refuse delete if LXC host is running
|
||||
- Portal only had game server controls (process), not host controls (LXC)
|
||||
- Users couldn't stop host to enable deletion
|
||||
|
||||
**Decision**: ✅ Keep delete failsafe + Add host controls
|
||||
|
||||
**Why Not Remove Failsafe**:
|
||||
- Prevents accidental data loss
|
||||
- Forces user to consciously stop host before deleting
|
||||
- Gives time to reconsider deletion
|
||||
- Aligns with "safety first" platform philosophy
|
||||
|
||||
**New Host Controls Added**:
|
||||
- Start Host
|
||||
- Stop Host
|
||||
- Restart Host
|
||||
|
||||
**API Endpoints**:
|
||||
```
|
||||
POST /servers/:id/host/start
|
||||
POST /servers/:id/host/stop
|
||||
POST /servers/:id/host/restart
|
||||
```
|
||||
|
||||
**Backend Implementation**:
|
||||
- Calls `proxmoxClient.startContainer(vmid)`
|
||||
- Calls `proxmoxClient.stopContainer(vmid)` or `shutdownContainer(vmid)`
|
||||
- Restart = stop then start sequence
|
||||
|
||||
**UX Wording Decision**:
|
||||
- Avoid "container" in user-facing UI
|
||||
- Use "Host Controls" or "Server Host"
|
||||
- Internal: LXC lifecycle management
|
||||
- User sees: "Start Host" / "Stop Host" / "Restart Host"
|
||||
|
||||
**Delete Gate Logic**:
|
||||
```
|
||||
DELETE /servers/:id
|
||||
├─ Check: Is LXC host running?
|
||||
│ ├─ YES → Deny (403 Forbidden)
|
||||
│ └─ NO → Allow deletion
|
||||
```
|
||||
|
||||
**User Flow**:
|
||||
1. User clicks "Delete Server"
|
||||
2. If host running: "Please stop the host first"
|
||||
3. User uses "Stop Host" control
|
||||
4. Host shuts down
|
||||
5. User clicks "Delete Server" again
|
||||
6. Deletion proceeds
|
||||
|
||||
**Benefits**:
|
||||
- Safety without frustration
|
||||
- User control over compute spend (can stop idle containers)
|
||||
- Clear UX pattern (stop → delete)
|
||||
- Prevents orphaned running containers
|
||||
|
||||
**Testing Checklist Verified**:
|
||||
- ✅ Start Host → LXC goes running
|
||||
- ✅ Stop Host → LXC shuts down
|
||||
- ✅ Restart Host → stop then start
|
||||
- ✅ Delete while running → denied
|
||||
- ✅ Delete after stop → allowed
|
||||
|
||||
**Documentation**:
|
||||
- `zlh-grind/SCRATCH/2026-02-07_host-controls-and-delete-failsafe.md`
|
||||
|
||||
---
|
||||
|
||||
## 📊 Technical Analysis
|
||||
|
||||
### **Customer ID Impact**
|
||||
|
||||
**Before**:
|
||||
- Database had mixed formats (u000, u-dev-001, CUIDs)
|
||||
- Frontend hardcoded test IDs
|
||||
- Provisioning inconsistent with auth context
|
||||
- Schema confusion across sessions
|
||||
|
||||
**After**:
|
||||
- Single format: Prisma CUID from `User.id`
|
||||
- Always derive from `req.user.id`
|
||||
- Consistent across all container creation
|
||||
- No more format guessing
|
||||
|
||||
**Migration Path**:
|
||||
- New containers: automatic (uses req.user.id)
|
||||
- Existing test data: leave as-is or clean up later
|
||||
- Rule enforced going forward
|
||||
|
||||
---
|
||||
|
||||
### **Prometheus Architecture**
|
||||
|
||||
**HTTP Service Discovery Pattern**:
|
||||
```
|
||||
Prometheus (zlh-monitor)
|
||||
↓
|
||||
GET /sd/exporters (API endpoint)
|
||||
↓
|
||||
Returns dynamic target list:
|
||||
[
|
||||
{
|
||||
"targets": ["10.200.0.50:9100"],
|
||||
"labels": {
|
||||
"vmid": "1050",
|
||||
"type": "game",
|
||||
"customer": "jester"
|
||||
}
|
||||
},
|
||||
...
|
||||
]
|
||||
↓
|
||||
Prometheus scrapes all targets
|
||||
↓
|
||||
Metrics available in Grafana
|
||||
```
|
||||
|
||||
**Why This Works**:
|
||||
- Scales automatically (new containers auto-discovered)
|
||||
- No manual config changes needed
|
||||
- API is source of truth for container inventory
|
||||
- Prometheus just consumes the list
|
||||
|
||||
---
|
||||
|
||||
### **Delete Safety Pattern**
|
||||
|
||||
**Design Philosophy**:
|
||||
> Add controls, don't remove safeguards
|
||||
|
||||
**Pattern**:
|
||||
- Safety check remains: "host must be stopped"
|
||||
- User empowerment: give them stop/start controls
|
||||
- Clear feedback: explain why delete is blocked
|
||||
- Obvious path forward: "stop host first"
|
||||
|
||||
**Alternative Considered and Rejected**:
|
||||
- Auto-stop before delete: Too dangerous (user may not realize host will stop)
|
||||
- Remove safety check: Defeats purpose of safeguard
|
||||
- Admin-only delete: Breaks self-service model
|
||||
|
||||
**Chosen Approach Benefits**:
|
||||
- User maintains control
|
||||
- Clear cause-and-effect
|
||||
- Prevents accidents
|
||||
- Enables compute cost management (stop idle hosts)
|
||||
|
||||
---
|
||||
|
||||
## 🎯 Platform Status
|
||||
|
||||
### **Infrastructure** ✅
|
||||
- Prometheus/Grafana fully operational
|
||||
- HTTP service discovery working
|
||||
- Metrics available for all containers
|
||||
- Ready for portal integration
|
||||
|
||||
### **Backend** ✅
|
||||
- Customer ID standardization enforced
|
||||
- Host lifecycle controls implemented
|
||||
- Delete safeguards preserved
|
||||
- API endpoints ready for frontend
|
||||
|
||||
### **Frontend** 🔧
|
||||
- Needs to stop hardcoding customer IDs
|
||||
- Needs to integrate host controls UI
|
||||
- Needs to handle delete flow (stop → delete)
|
||||
- Portal metrics cards planned (next)
|
||||
|
||||
### **Observability** ✅
|
||||
- Full metrics stack operational
|
||||
- Custom dashboards working
|
||||
- PromQL queries documented
|
||||
- Ready for production monitoring
|
||||
|
||||
---
|
||||
|
||||
## 📋 Implementation Checklist
|
||||
|
||||
### **Customer ID Enforcement** (Backend)
|
||||
- [x] Identify hardcoded IDs in provisioning code
|
||||
- [x] Document decision to use Prisma CUID
|
||||
- [ ] Update provisioning endpoint to use `req.user.id`
|
||||
- [ ] Remove any client-supplied customerId acceptance
|
||||
- [ ] Verify new containers use correct ID format
|
||||
|
||||
### **Customer ID Updates** (Frontend)
|
||||
- [ ] Remove `u-dev-001` hardcoding
|
||||
- [ ] Use authenticated user's ID from token
|
||||
- [ ] Update display to show `displayName (username) • ID`
|
||||
- [ ] Test with new user registration
|
||||
|
||||
### **Host Controls** (Frontend)
|
||||
- [ ] Add Start/Stop/Restart Host buttons to server cards
|
||||
- [ ] Wire to `/servers/:id/host/{start|stop|restart}` endpoints
|
||||
- [ ] Update delete flow to check host status
|
||||
- [ ] Show "Stop host first" message if delete blocked
|
||||
- [ ] Test full flow: stop → delete
|
||||
|
||||
### **Portal Metrics** (Next Sprint)
|
||||
- [ ] Create `/api/metrics/servers/:vmid/summary` endpoint
|
||||
- [ ] Query Prometheus from API (server-side)
|
||||
- [ ] Return CPU/RAM/Disk/Net percentages
|
||||
- [ ] Add metrics display to server cards
|
||||
- [ ] Handle missing/stale data gracefully
|
||||
|
||||
---
|
||||
|
||||
## 🔑 Key Decisions Made
|
||||
|
||||
### **Decision 1: Prisma CUID as Customer ID**
|
||||
**Rationale**:
|
||||
- Already generated by Prisma automatically
|
||||
- Guaranteed unique, hard to guess
|
||||
- No schema changes needed
|
||||
- Eliminates format confusion immediately
|
||||
|
||||
**Alternative Considered**: Human-friendly sequential IDs (u001, u002)
|
||||
**Why Deferred**: Adds complexity, not blocking current work
|
||||
|
||||
**Implementation**: Use `req.user.id` everywhere, display nicely in UI
|
||||
|
||||
---
|
||||
|
||||
### **Decision 2: Keep Delete Failsafe**
|
||||
**Rationale**:
|
||||
- Prevents accidental data loss
|
||||
- Forces conscious decision (stop then delete)
|
||||
- Aligns with safety-first philosophy
|
||||
- Better to add controls than remove safeguards
|
||||
|
||||
**Alternative Considered**: Auto-stop before delete
|
||||
**Why Rejected**: Too dangerous, removes user control
|
||||
|
||||
**Implementation**: Add host controls so users can stop themselves
|
||||
|
||||
---
|
||||
|
||||
### **Decision 3: Prometheus HTTP Service Discovery**
|
||||
**Rationale**:
|
||||
- Scales automatically with container growth
|
||||
- No manual configuration maintenance
|
||||
- API is authoritative source of truth
|
||||
- Consistent with architectural boundaries
|
||||
|
||||
**Alternative Considered**: Static prometheus config with container IPs
|
||||
**Why Rejected**: Doesn't scale, manual updates, IP changes break it
|
||||
|
||||
**Implementation**: API endpoint `/sd/exporters` returns dynamic targets
|
||||
|
||||
---
|
||||
|
||||
## 🚀 Next Session Priorities
|
||||
|
||||
### **Immediate (Next Session)**
|
||||
1. **Frontend Customer ID Fix**
|
||||
- Remove hardcoded `u-dev-001`
|
||||
- Use `req.user.id` from auth context
|
||||
- Test with new user registration
|
||||
|
||||
2. **Host Controls UI**
|
||||
- Add start/stop/restart buttons
|
||||
- Wire to API endpoints
|
||||
- Update delete confirmation flow
|
||||
|
||||
3. **Portal Metrics Endpoint**
|
||||
- Create API endpoint querying Prometheus
|
||||
- Return summary metrics for server cards
|
||||
- Test with live containers
|
||||
|
||||
### **This Week**
|
||||
4. **Metrics in Portal**
|
||||
- Display CPU/RAM/Disk/Net on server cards
|
||||
- Handle stale/missing data
|
||||
- Add loading states
|
||||
|
||||
5. **Testing**
|
||||
- End-to-end customer ID flow
|
||||
- Host lifecycle controls
|
||||
- Delete with safety check
|
||||
- Metrics display accuracy
|
||||
|
||||
### **Next Week**
|
||||
6. **Security Vulnerability Fixes** (still pending)
|
||||
- Server ownership bypass
|
||||
- Admin privilege escalation
|
||||
- Token URL exposure
|
||||
- API key validation
|
||||
|
||||
---
|
||||
|
||||
## 📁 Files Modified
|
||||
|
||||
**zlh-grind Repository**:
|
||||
- `README.md` - Updated with Feb 7 handover summary
|
||||
- `SCRATCH/2026-02-07_customer-id-schema.md` - Created
|
||||
- `SCRATCH/2026-02-07_prometheus-grafana.md` - Created
|
||||
- `SCRATCH/2026-02-07_host-controls-and-delete-failsafe.md` - Created
|
||||
|
||||
**knowledge-base Repository**:
|
||||
- `Session_Summaries/2026-02-07_Customer-ID-Prometheus-Host-Controls.md` - This file (created)
|
||||
|
||||
---
|
||||
|
||||
## 🔗 Reference Documentation
|
||||
|
||||
**Session Documents**:
|
||||
- zlh-grind: https://git.zerolaghub.com/jester/zlh-grind
|
||||
- Previous session: Session_Summaries/2026-01-18_Architectural_Guardrails.md
|
||||
|
||||
**Platform Documentation**:
|
||||
- Cross Project Tracker: ZeroLagHub_Cross_Project_Tracker.md
|
||||
- Current State: ZeroLagHub_Complete_Current_State_Jan2026.md
|
||||
|
||||
**PromQL Resources**:
|
||||
- Prometheus documentation: https://prometheus.io/docs/
|
||||
- Node Exporter metrics: Standard exporter for system metrics
|
||||
|
||||
---
|
||||
|
||||
## 💡 Key Insights
|
||||
|
||||
### **1. Standardization Eliminates Confusion**
|
||||
- Mixed ID formats caused confusion across sessions
|
||||
- Single format (Prisma CUID) removes ambiguity
|
||||
- Display layer can make IDs human-friendly
|
||||
- Backend uses what's already in database
|
||||
|
||||
### **2. Safety Through Empowerment**
|
||||
- Don't remove safeguards, add user controls
|
||||
- Delete failsafe + host controls = safety + agency
|
||||
- Clear UX: "stop host first" tells user what to do
|
||||
- Prevents accidents without frustration
|
||||
|
||||
### **3. HTTP Service Discovery Scales**
|
||||
- Dynamic target discovery from API
|
||||
- No manual Prometheus config updates
|
||||
- Containers auto-discovered on creation
|
||||
- API as authoritative inventory source
|
||||
|
||||
### **4. Monitoring is Production-Ready**
|
||||
- Full metrics stack operational
|
||||
- Grafana dashboards working
|
||||
- Ready for portal integration
|
||||
- PromQL queries documented for reuse
|
||||
|
||||
---
|
||||
|
||||
**Session Duration**: ~2-3 hours
|
||||
**Session Type**: Platform foundation + UX safety
|
||||
**Next AI**: Claude or GPT (frontend customer ID fix + host controls UI)
|
||||
**Blocking Issues**: None (all decisions made, implementation ready)
|
||||
**Session Outcome**: ✅ Three critical platform decisions made and documented
|
||||
Loading…
Reference in New Issue
Block a user