Add technical stack and architecture document

This commit is contained in:
jester 2025-12-17 00:59:56 +00:00
parent 7b18b119bc
commit d733b600bb

530
technical-stack.md Normal file
View File

@ -0,0 +1,530 @@
# Technical Stack & Architecture
## Overview
This document outlines the technical infrastructure, tools, and architecture for all three phases of the venture.
---
## Phase 1: Consulting (Customer Hardware)
### Deployment Environment
**Location**: Customer premises
**Hardware**: Customer-provided or recommended purchase
- Raspberry Pi 4 (8GB): $75-100
- OR repurposed industrial PC (free if available)
- OR Intel NUC ($300-800)
### Software Stack
**Operating System:**
- Ubuntu Server 24.04 LTS (free)
- Debian 12 (alternative, free)
**Container Platform:**
- Docker (free, easier for customer maintenance)
- OR LXC (free, lower overhead)
**MQTT Broker:**
- Eclipse Mosquitto (free, open source)
- Configuration: Local network only, authenticated users
- Port: 1883 (or 8883 for TLS)
**Time-Series Database:**
- InfluxDB 2.x OSS (free, open source)
- Alternative: TimescaleDB (PostgreSQL extension, free)
- Retention: 30-90 days default
**Visualization:**
- Grafana OSS (free, open source)
- Dashboards: Production line overview, OEE tracking, downtime analysis
- Alerts: Email/SMS via SMTP or webhook
**PLC Integration:**
- Node-RED (free, visual programming)
- node-red-contrib-modbus
- node-red-contrib-opcua
- node-red-contrib-s7
- Alternative: Python scripts
- pymodbus (Modbus TCP/RTU)
- opcua-client (OPC UA)
- python-snap7 (Siemens S7)
### Network Architecture
```
Production Floor:
├── PLCs (Allen-Bradley, Siemens, etc.)
│ └── Connected via: Ethernet, Serial, or OPC UA Server
├── Edge Gateway (Raspberry Pi / Industrial PC)
│ ├── Mosquitto MQTT broker
│ ├── InfluxDB time-series database
│ ├── Grafana visualization
│ └── Node-RED PLC integration
└── Local Network
├── Operators access via web browser (http://edge-gateway:3000)
└── Managers access via web browser or mobile
```
**Security:**
- Isolated VLAN (recommended)
- Firewall rules (only necessary ports)
- HTTPS/TLS for Grafana (Let's Encrypt)
- VPN for remote access (WireGuard or OpenVPN)
### Protocols Supported
**OT Protocols:**
- Modbus TCP/RTU
- OPC UA
- Siemens S7 (via Snap7)
- EtherNet/IP (Allen-Bradley)
- BACnet (building automation)
- Profinet (via OPC UA gateway)
**IT Protocols:**
- MQTT (primary message bus)
- HTTP/REST APIs
- HTTPS for dashboards
- SMTP for email alerts
---
## Phase 2: Edge Monitoring Platform (GTHost Multi-Tenant)
### Infrastructure
**GTHost Dedicated Server #1:**
```
Configuration:
├── CPU: 8 cores (Intel Xeon or AMD EPYC)
├── RAM: 32GB
├── Storage: 1TB NVMe SSD
├── Network: 1Gbps unmetered
├── Location: Choose closest to majority of customers
└── Cost: $100-150/month
```
**Operating System:**
- Ubuntu Server 24.04 LTS
- Automated updates (unattended-upgrades)
- Fail2ban for security
- UFW firewall configured
### Multi-Tenant Architecture
**Container Platform:**
- LXC (Linux Containers)
- Lightweight vs Docker
- Better for long-running services
- Kernel-level isolation
- Proven from ZeroLagHub experience
**Container Template:**
```
Customer Container (LXC):
├── Ubuntu 24.04 minimal
├── Mosquitto MQTT broker (isolated)
├── InfluxDB (isolated database)
├── Grafana (customer-specific dashboards)
├── Node-RED (optional, for advanced workflows)
├── Backup agent (automated daily)
└── Resource limits (CPU, RAM, disk)
```
**Resource Allocation per Customer:**
- CPU: 1-2 cores (burstable)
- RAM: 2-4GB
- Disk: 50-100GB
- Network: Shared 1Gbps
**Capacity Planning:**
- 8-10 basic customers per server
- 5-8 customers if heavy data volume
- Monitor: CPU, RAM, disk I/O, network
### Networking & Security
**Network Architecture:**
```
Internet
Caddy Reverse Proxy (TLS termination)
LXC Bridge (internal network)
├── Customer 1 Container (192.168.100.10)
├── Customer 2 Container (192.168.100.11)
├── Customer 3 Container (192.168.100.12)
└── ...
```
**Subdomain Structure:**
- customer1.yourdomain.com → Grafana dashboard
- customer2.yourdomain.com → Grafana dashboard
- mqtt.yourdomain.com → MQTT broker (port per customer)
**Security Features:**
- TLS/SSL via Let's Encrypt (automated)
- Firewall (UFW) - only necessary ports
- Fail2ban - brute force protection
- Container isolation (LXC namespaces)
- VPN access for edge devices (WireGuard)
- Backup encryption (GPG)
### Data Flow
```
Customer Site:
PLC → Node-RED → MQTT (local edge device)
(Over VPN or direct connection)
GTHost Server:
MQTT Broker → InfluxDB → Grafana
Alert Engine → Email/SMS
```
### Backup & Disaster Recovery
**Backup Strategy:**
- Automated daily backups (3am UTC)
- Retention: 7 daily, 4 weekly, 12 monthly
- Storage: GTHost server + offsite (Wasabi/Backblaze B2)
- Encrypted with GPG
- Automated restore testing (monthly)
**Disaster Recovery:**
- RTO (Recovery Time Objective): 4 hours
- RPO (Recovery Point Objective): 24 hours
- Documented restoration procedure
- Annual DR test
### Monitoring & Alerting
**Server Monitoring:**
- Prometheus + Grafana (internal)
- Alerts: CPU >80%, RAM >80%, Disk >85%
- UptimeRobot (external monitoring)
- PagerDuty or similar (if needed)
**Customer Monitoring:**
- Per-container resource usage
- MQTT connection status
- InfluxDB query performance
- Grafana dashboard access logs
---
## Phase 3: GPU-Powered AI Platform
### Infrastructure
**GTHost Dedicated Server #2 (AI/Premium Tier):**
```
Configuration:
├── CPU: 16 cores (Intel Xeon or AMD EPYC)
├── RAM: 64GB
├── GPU: NVIDIA Tesla P4 8GB (or T1000 8GB)
├── Storage: 2TB NVMe SSD
├── Network: 1Gbps unmetered
└── Cost: $350-450/month
```
**Why Tesla P4:**
- Optimized for AI inference (not training)
- 8GB VRAM sufficient for production models
- Low power consumption (75W)
- Good performance/cost ratio
### AI/ML Stack
**ML Frameworks:**
- TensorFlow Lite (optimized inference)
- PyTorch (model development, optional)
- ONNX Runtime (cross-framework inference)
- Scikit-learn (traditional ML)
**GPU Acceleration:**
- CUDA 12.x
- cuDNN (deep learning primitives)
- TensorRT (inference optimization)
**Model Serving:**
- FastAPI (REST API for predictions)
- Triton Inference Server (optional, for heavy workloads)
- Redis (result caching)
### AI Features Architecture
**1. Predictive Maintenance:**
```
Sensor Data → Feature Engineering → Model Inference → Alert
(MQTT) (Python script) (TensorFlow) (Email/SMS)
```
**Models:**
- Anomaly detection (vibration, temperature patterns)
- Failure prediction (time-to-failure models)
- Remaining Useful Life (RUL) estimation
**2. Computer Vision Quality Inspection:**
```
Camera → Image Capture → Preprocessing → Model Inference → Classification
(HTTP) (Python) (OpenCV) (TensorFlow) (Pass/Fail)
```
**Models:**
- Object detection (YOLOv8, faster RCNN)
- Defect classification (CNN)
- OCR (text recognition on parts)
### Container Architecture (Phase 3)
**Premium Customer Container:**
```
├── Basic monitoring stack (MQTT, InfluxDB, Grafana)
├── ML inference service (FastAPI + TensorFlow)
├── Feature engineering pipeline
├── Model registry (versioned models)
├── Result database (predictions, alerts)
└── GPU access (controlled, per-customer limits)
```
**Resource Allocation (Premium):**
- CPU: 4-8 cores
- RAM: 16-32GB
- GPU: Shared (time-sliced or MIG partitioning)
- Disk: 200-500GB
### Model Development Workflow
**Development (Offline):**
1. Collect customer data (4-8 weeks)
2. Feature engineering and labeling
3. Model training (local GPU or cloud)
4. Model validation (accuracy, false positives)
5. Export to ONNX or TensorFlow Lite
**Deployment:**
1. Upload model to server
2. A/B test against baseline
3. Monitor inference latency and accuracy
4. Gradual rollout to production
5. Continuous monitoring
### Data Pipeline (AI Features)
```
Customer PLCs/Cameras
Edge Device (optional preprocessing)
MQTT → Feature Store (InfluxDB + PostgreSQL)
ML Inference Service (GPU-accelerated)
Prediction Results → InfluxDB
Grafana Dashboard + Alerts
```
---
## Development & Deployment Tools
### Local Development
**Workstation Setup:**
- Ubuntu 22.04 or macOS
- Docker Desktop (for testing containers)
- VS Code with extensions:
- Python
- Docker
- YAML
- Grafana dashboards
**Testing Environment:**
- Local LXC or Docker setup
- Simulated PLC data (Node-RED)
- Small InfluxDB + Grafana instance
### CI/CD Pipeline
**Source Control:**
- Git (self-hosted Gitea or GitHub)
- Branches: main, development, customer-specific
**Automation (Future):**
- GitHub Actions or Gitea Actions
- Automated testing on push
- Deployment scripts (Ansible)
**Deployment Process (Manual Initially):**
1. Test in local environment
2. Deploy to staging container
3. Validate with test data
4. Deploy to production
5. Monitor for issues
---
## Technology Decisions & Rationale
### Why LXC over Docker?
**Advantages:**
- Lower overhead (runs closer to bare metal)
- Better for long-running services (MQTT, databases)
- Simpler networking (bridge vs overlay)
- Proven from ZeroLagHub experience
- Less complexity than Kubernetes
**Disadvantages:**
- Less popular than Docker (smaller community)
- Fewer pre-built images
- Manual setup required
**Decision**: Use LXC for multi-tenant platform, Docker for customer edge deployments (easier for them to maintain).
### Why InfluxDB over Prometheus?
**Advantages:**
- Purpose-built for time-series data
- Better query language (Flux/InfluxQL)
- Native downsampling and retention policies
- Better Grafana integration for industrial data
- Can handle high-frequency data (1-10 second resolution)
**Disadvantages:**
- More complex than Prometheus
- Heavier resource usage
**Decision**: InfluxDB for customer data, Prometheus for internal monitoring.
### Why Grafana over Custom Dashboard?
**Advantages:**
- Industry standard
- Excellent out-of-box visualizations
- Plugin ecosystem
- Customer familiarity (many have seen it)
- Lower development time
**Disadvantages:**
- Not as customizable as custom solution
- Licensing considerations (AGPL for self-hosted)
**Decision**: Grafana for Phase 1-2, consider custom dashboard in Phase 3 if needed.
### Why MQTT over HTTP?
**Advantages:**
- Purpose-built for IoT (lightweight)
- Pub/sub model (flexible)
- Quality of Service levels (QoS 0, 1, 2)
- Better for unreliable networks
- Lower bandwidth overhead
**Disadvantages:**
- One more service to manage
- Not as universally understood as HTTP
**Decision**: MQTT for OT data collection, HTTP/REST for management APIs.
---
## Scaling Plan
### Server Capacity Thresholds
**Add Server #2 When:**
- Server #1 >70% CPU average
- OR >80% RAM average
- OR >10 customers on Server #1
**Add Server #3 When:**
- Combined >70% capacity
- OR >20 total customers
- OR geographic distribution (West Coast + East Coast servers)
### Database Scaling
**InfluxDB Scaling:**
- Start: Single node per customer container
- Scale: Consider InfluxDB clustering (Enterprise) if needed
- Alternative: TimescaleDB for SQL-familiar customers
**Backup Scaling:**
- Start: Daily backups to local disk
- Scale: Offsite backup to object storage (S3-compatible)
- Future: Real-time replication to hot standby
---
## Security Best Practices
### Server Hardening
- [ ] Disable root login (SSH key only)
- [ ] Fail2ban configured
- [ ] UFW firewall (only necessary ports)
- [ ] Automated security updates
- [ ] Regular security audits (quarterly)
### Application Security
- [ ] TLS/SSL everywhere (Let's Encrypt)
- [ ] Strong passwords (generated, stored in 1Password)
- [ ] API keys rotated (quarterly)
- [ ] Container isolation verified
- [ ] Database encryption at rest
### Compliance Considerations
- GDPR (if EU customers): Data residency, right to deletion
- HIPAA (if medical devices): BAA required, encryption
- ISO 27001 (future): Information security management
---
## Tools & Subscriptions
### Required (Paid)
| Tool | Purpose | Cost/Month |
|------|---------|------------|
| GTHost Server #1 | Infrastructure | $100-150 |
| Domain + DNS | yourdomain.com | $1-2 |
| Email (G Suite or similar) | Professional email | $6-12 |
**Total**: $107-164/month
### Optional (Free/Paid)
| Tool | Purpose | Cost/Month |
|------|---------|------------|
| 1Password | Password management | $0 (personal) |
| Wasabi | Offsite backups | $6/TB |
| UptimeRobot | External monitoring | $0 (free tier) |
| Stripe | Payment processing | 2.9% + $0.30 |
| Twilio | SMS alerts | Pay-as-you-go |
---
## Documentation Strategy
### Internal Documentation
- Runbooks (how to deploy, backup, restore)
- Architecture diagrams (network, data flow)
- Troubleshooting guides
- Security incident response plan
### Customer Documentation
- User guide (how to access dashboards)
- FAQ (common questions)
- Alert configuration guide
- Troubleshooting (basic)
**Format**: Markdown in Git repository (easy to version, search)
---
*Last Updated: December 2025*