531 lines
13 KiB
Markdown
531 lines
13 KiB
Markdown
# Technical Stack & Architecture
|
|
|
|
## Overview
|
|
|
|
This document outlines the technical infrastructure, tools, and architecture for all three phases of the venture.
|
|
|
|
---
|
|
|
|
## Phase 1: Consulting (Customer Hardware)
|
|
|
|
### Deployment Environment
|
|
**Location**: Customer premises
|
|
**Hardware**: Customer-provided or recommended purchase
|
|
- Raspberry Pi 4 (8GB): $75-100
|
|
- OR repurposed industrial PC (free if available)
|
|
- OR Intel NUC ($300-800)
|
|
|
|
### Software Stack
|
|
|
|
**Operating System:**
|
|
- Ubuntu Server 24.04 LTS (free)
|
|
- Debian 12 (alternative, free)
|
|
|
|
**Container Platform:**
|
|
- Docker (free, easier for customer maintenance)
|
|
- OR LXC (free, lower overhead)
|
|
|
|
**MQTT Broker:**
|
|
- Eclipse Mosquitto (free, open source)
|
|
- Configuration: Local network only, authenticated users
|
|
- Port: 1883 (or 8883 for TLS)
|
|
|
|
**Time-Series Database:**
|
|
- InfluxDB 2.x OSS (free, open source)
|
|
- Alternative: TimescaleDB (PostgreSQL extension, free)
|
|
- Retention: 30-90 days default
|
|
|
|
**Visualization:**
|
|
- Grafana OSS (free, open source)
|
|
- Dashboards: Production line overview, OEE tracking, downtime analysis
|
|
- Alerts: Email/SMS via SMTP or webhook
|
|
|
|
**PLC Integration:**
|
|
- Node-RED (free, visual programming)
|
|
- node-red-contrib-modbus
|
|
- node-red-contrib-opcua
|
|
- node-red-contrib-s7
|
|
- Alternative: Python scripts
|
|
- pymodbus (Modbus TCP/RTU)
|
|
- opcua-client (OPC UA)
|
|
- python-snap7 (Siemens S7)
|
|
|
|
### Network Architecture
|
|
|
|
```
|
|
Production Floor:
|
|
├── PLCs (Allen-Bradley, Siemens, etc.)
|
|
│ └── Connected via: Ethernet, Serial, or OPC UA Server
|
|
├── Edge Gateway (Raspberry Pi / Industrial PC)
|
|
│ ├── Mosquitto MQTT broker
|
|
│ ├── InfluxDB time-series database
|
|
│ ├── Grafana visualization
|
|
│ └── Node-RED PLC integration
|
|
└── Local Network
|
|
├── Operators access via web browser (http://edge-gateway:3000)
|
|
└── Managers access via web browser or mobile
|
|
```
|
|
|
|
**Security:**
|
|
- Isolated VLAN (recommended)
|
|
- Firewall rules (only necessary ports)
|
|
- HTTPS/TLS for Grafana (Let's Encrypt)
|
|
- VPN for remote access (WireGuard or OpenVPN)
|
|
|
|
### Protocols Supported
|
|
|
|
**OT Protocols:**
|
|
- Modbus TCP/RTU
|
|
- OPC UA
|
|
- Siemens S7 (via Snap7)
|
|
- EtherNet/IP (Allen-Bradley)
|
|
- BACnet (building automation)
|
|
- Profinet (via OPC UA gateway)
|
|
|
|
**IT Protocols:**
|
|
- MQTT (primary message bus)
|
|
- HTTP/REST APIs
|
|
- HTTPS for dashboards
|
|
- SMTP for email alerts
|
|
|
|
---
|
|
|
|
## Phase 2: Edge Monitoring Platform (GTHost Multi-Tenant)
|
|
|
|
### Infrastructure
|
|
|
|
**GTHost Dedicated Server #1:**
|
|
```
|
|
Configuration:
|
|
├── CPU: 8 cores (Intel Xeon or AMD EPYC)
|
|
├── RAM: 32GB
|
|
├── Storage: 1TB NVMe SSD
|
|
├── Network: 1Gbps unmetered
|
|
├── Location: Choose closest to majority of customers
|
|
└── Cost: $100-150/month
|
|
```
|
|
|
|
**Operating System:**
|
|
- Ubuntu Server 24.04 LTS
|
|
- Automated updates (unattended-upgrades)
|
|
- Fail2ban for security
|
|
- UFW firewall configured
|
|
|
|
### Multi-Tenant Architecture
|
|
|
|
**Container Platform:**
|
|
- LXC (Linux Containers)
|
|
- Lightweight vs Docker
|
|
- Better for long-running services
|
|
- Kernel-level isolation
|
|
- Proven from ZeroLagHub experience
|
|
|
|
**Container Template:**
|
|
```
|
|
Customer Container (LXC):
|
|
├── Ubuntu 24.04 minimal
|
|
├── Mosquitto MQTT broker (isolated)
|
|
├── InfluxDB (isolated database)
|
|
├── Grafana (customer-specific dashboards)
|
|
├── Node-RED (optional, for advanced workflows)
|
|
├── Backup agent (automated daily)
|
|
└── Resource limits (CPU, RAM, disk)
|
|
```
|
|
|
|
**Resource Allocation per Customer:**
|
|
- CPU: 1-2 cores (burstable)
|
|
- RAM: 2-4GB
|
|
- Disk: 50-100GB
|
|
- Network: Shared 1Gbps
|
|
|
|
**Capacity Planning:**
|
|
- 8-10 basic customers per server
|
|
- 5-8 customers if heavy data volume
|
|
- Monitor: CPU, RAM, disk I/O, network
|
|
|
|
### Networking & Security
|
|
|
|
**Network Architecture:**
|
|
```
|
|
Internet
|
|
↓
|
|
Caddy Reverse Proxy (TLS termination)
|
|
↓
|
|
LXC Bridge (internal network)
|
|
├── Customer 1 Container (192.168.100.10)
|
|
├── Customer 2 Container (192.168.100.11)
|
|
├── Customer 3 Container (192.168.100.12)
|
|
└── ...
|
|
```
|
|
|
|
**Subdomain Structure:**
|
|
- customer1.yourdomain.com → Grafana dashboard
|
|
- customer2.yourdomain.com → Grafana dashboard
|
|
- mqtt.yourdomain.com → MQTT broker (port per customer)
|
|
|
|
**Security Features:**
|
|
- TLS/SSL via Let's Encrypt (automated)
|
|
- Firewall (UFW) - only necessary ports
|
|
- Fail2ban - brute force protection
|
|
- Container isolation (LXC namespaces)
|
|
- VPN access for edge devices (WireGuard)
|
|
- Backup encryption (GPG)
|
|
|
|
### Data Flow
|
|
|
|
```
|
|
Customer Site:
|
|
PLC → Node-RED → MQTT (local edge device)
|
|
↓
|
|
(Over VPN or direct connection)
|
|
↓
|
|
GTHost Server:
|
|
MQTT Broker → InfluxDB → Grafana
|
|
↓
|
|
Alert Engine → Email/SMS
|
|
```
|
|
|
|
### Backup & Disaster Recovery
|
|
|
|
**Backup Strategy:**
|
|
- Automated daily backups (3am UTC)
|
|
- Retention: 7 daily, 4 weekly, 12 monthly
|
|
- Storage: GTHost server + offsite (Wasabi/Backblaze B2)
|
|
- Encrypted with GPG
|
|
- Automated restore testing (monthly)
|
|
|
|
**Disaster Recovery:**
|
|
- RTO (Recovery Time Objective): 4 hours
|
|
- RPO (Recovery Point Objective): 24 hours
|
|
- Documented restoration procedure
|
|
- Annual DR test
|
|
|
|
### Monitoring & Alerting
|
|
|
|
**Server Monitoring:**
|
|
- Prometheus + Grafana (internal)
|
|
- Alerts: CPU >80%, RAM >80%, Disk >85%
|
|
- UptimeRobot (external monitoring)
|
|
- PagerDuty or similar (if needed)
|
|
|
|
**Customer Monitoring:**
|
|
- Per-container resource usage
|
|
- MQTT connection status
|
|
- InfluxDB query performance
|
|
- Grafana dashboard access logs
|
|
|
|
---
|
|
|
|
## Phase 3: GPU-Powered AI Platform
|
|
|
|
### Infrastructure
|
|
|
|
**GTHost Dedicated Server #2 (AI/Premium Tier):**
|
|
```
|
|
Configuration:
|
|
├── CPU: 16 cores (Intel Xeon or AMD EPYC)
|
|
├── RAM: 64GB
|
|
├── GPU: NVIDIA Tesla P4 8GB (or T1000 8GB)
|
|
├── Storage: 2TB NVMe SSD
|
|
├── Network: 1Gbps unmetered
|
|
└── Cost: $350-450/month
|
|
```
|
|
|
|
**Why Tesla P4:**
|
|
- Optimized for AI inference (not training)
|
|
- 8GB VRAM sufficient for production models
|
|
- Low power consumption (75W)
|
|
- Good performance/cost ratio
|
|
|
|
### AI/ML Stack
|
|
|
|
**ML Frameworks:**
|
|
- TensorFlow Lite (optimized inference)
|
|
- PyTorch (model development, optional)
|
|
- ONNX Runtime (cross-framework inference)
|
|
- Scikit-learn (traditional ML)
|
|
|
|
**GPU Acceleration:**
|
|
- CUDA 12.x
|
|
- cuDNN (deep learning primitives)
|
|
- TensorRT (inference optimization)
|
|
|
|
**Model Serving:**
|
|
- FastAPI (REST API for predictions)
|
|
- Triton Inference Server (optional, for heavy workloads)
|
|
- Redis (result caching)
|
|
|
|
### AI Features Architecture
|
|
|
|
**1. Predictive Maintenance:**
|
|
```
|
|
Sensor Data → Feature Engineering → Model Inference → Alert
|
|
(MQTT) (Python script) (TensorFlow) (Email/SMS)
|
|
```
|
|
|
|
**Models:**
|
|
- Anomaly detection (vibration, temperature patterns)
|
|
- Failure prediction (time-to-failure models)
|
|
- Remaining Useful Life (RUL) estimation
|
|
|
|
**2. Computer Vision Quality Inspection:**
|
|
```
|
|
Camera → Image Capture → Preprocessing → Model Inference → Classification
|
|
(HTTP) (Python) (OpenCV) (TensorFlow) (Pass/Fail)
|
|
```
|
|
|
|
**Models:**
|
|
- Object detection (YOLOv8, faster RCNN)
|
|
- Defect classification (CNN)
|
|
- OCR (text recognition on parts)
|
|
|
|
### Container Architecture (Phase 3)
|
|
|
|
**Premium Customer Container:**
|
|
```
|
|
├── Basic monitoring stack (MQTT, InfluxDB, Grafana)
|
|
├── ML inference service (FastAPI + TensorFlow)
|
|
├── Feature engineering pipeline
|
|
├── Model registry (versioned models)
|
|
├── Result database (predictions, alerts)
|
|
└── GPU access (controlled, per-customer limits)
|
|
```
|
|
|
|
**Resource Allocation (Premium):**
|
|
- CPU: 4-8 cores
|
|
- RAM: 16-32GB
|
|
- GPU: Shared (time-sliced or MIG partitioning)
|
|
- Disk: 200-500GB
|
|
|
|
### Model Development Workflow
|
|
|
|
**Development (Offline):**
|
|
1. Collect customer data (4-8 weeks)
|
|
2. Feature engineering and labeling
|
|
3. Model training (local GPU or cloud)
|
|
4. Model validation (accuracy, false positives)
|
|
5. Export to ONNX or TensorFlow Lite
|
|
|
|
**Deployment:**
|
|
1. Upload model to server
|
|
2. A/B test against baseline
|
|
3. Monitor inference latency and accuracy
|
|
4. Gradual rollout to production
|
|
5. Continuous monitoring
|
|
|
|
### Data Pipeline (AI Features)
|
|
|
|
```
|
|
Customer PLCs/Cameras
|
|
↓
|
|
Edge Device (optional preprocessing)
|
|
↓
|
|
MQTT → Feature Store (InfluxDB + PostgreSQL)
|
|
↓
|
|
ML Inference Service (GPU-accelerated)
|
|
↓
|
|
Prediction Results → InfluxDB
|
|
↓
|
|
Grafana Dashboard + Alerts
|
|
```
|
|
|
|
---
|
|
|
|
## Development & Deployment Tools
|
|
|
|
### Local Development
|
|
|
|
**Workstation Setup:**
|
|
- Ubuntu 22.04 or macOS
|
|
- Docker Desktop (for testing containers)
|
|
- VS Code with extensions:
|
|
- Python
|
|
- Docker
|
|
- YAML
|
|
- Grafana dashboards
|
|
|
|
**Testing Environment:**
|
|
- Local LXC or Docker setup
|
|
- Simulated PLC data (Node-RED)
|
|
- Small InfluxDB + Grafana instance
|
|
|
|
### CI/CD Pipeline
|
|
|
|
**Source Control:**
|
|
- Git (self-hosted Gitea or GitHub)
|
|
- Branches: main, development, customer-specific
|
|
|
|
**Automation (Future):**
|
|
- GitHub Actions or Gitea Actions
|
|
- Automated testing on push
|
|
- Deployment scripts (Ansible)
|
|
|
|
**Deployment Process (Manual Initially):**
|
|
1. Test in local environment
|
|
2. Deploy to staging container
|
|
3. Validate with test data
|
|
4. Deploy to production
|
|
5. Monitor for issues
|
|
|
|
---
|
|
|
|
## Technology Decisions & Rationale
|
|
|
|
### Why LXC over Docker?
|
|
|
|
**Advantages:**
|
|
- Lower overhead (runs closer to bare metal)
|
|
- Better for long-running services (MQTT, databases)
|
|
- Simpler networking (bridge vs overlay)
|
|
- Proven from ZeroLagHub experience
|
|
- Less complexity than Kubernetes
|
|
|
|
**Disadvantages:**
|
|
- Less popular than Docker (smaller community)
|
|
- Fewer pre-built images
|
|
- Manual setup required
|
|
|
|
**Decision**: Use LXC for multi-tenant platform, Docker for customer edge deployments (easier for them to maintain).
|
|
|
|
### Why InfluxDB over Prometheus?
|
|
|
|
**Advantages:**
|
|
- Purpose-built for time-series data
|
|
- Better query language (Flux/InfluxQL)
|
|
- Native downsampling and retention policies
|
|
- Better Grafana integration for industrial data
|
|
- Can handle high-frequency data (1-10 second resolution)
|
|
|
|
**Disadvantages:**
|
|
- More complex than Prometheus
|
|
- Heavier resource usage
|
|
|
|
**Decision**: InfluxDB for customer data, Prometheus for internal monitoring.
|
|
|
|
### Why Grafana over Custom Dashboard?
|
|
|
|
**Advantages:**
|
|
- Industry standard
|
|
- Excellent out-of-box visualizations
|
|
- Plugin ecosystem
|
|
- Customer familiarity (many have seen it)
|
|
- Lower development time
|
|
|
|
**Disadvantages:**
|
|
- Not as customizable as custom solution
|
|
- Licensing considerations (AGPL for self-hosted)
|
|
|
|
**Decision**: Grafana for Phase 1-2, consider custom dashboard in Phase 3 if needed.
|
|
|
|
### Why MQTT over HTTP?
|
|
|
|
**Advantages:**
|
|
- Purpose-built for IoT (lightweight)
|
|
- Pub/sub model (flexible)
|
|
- Quality of Service levels (QoS 0, 1, 2)
|
|
- Better for unreliable networks
|
|
- Lower bandwidth overhead
|
|
|
|
**Disadvantages:**
|
|
- One more service to manage
|
|
- Not as universally understood as HTTP
|
|
|
|
**Decision**: MQTT for OT data collection, HTTP/REST for management APIs.
|
|
|
|
---
|
|
|
|
## Scaling Plan
|
|
|
|
### Server Capacity Thresholds
|
|
|
|
**Add Server #2 When:**
|
|
- Server #1 >70% CPU average
|
|
- OR >80% RAM average
|
|
- OR >10 customers on Server #1
|
|
|
|
**Add Server #3 When:**
|
|
- Combined >70% capacity
|
|
- OR >20 total customers
|
|
- OR geographic distribution (West Coast + East Coast servers)
|
|
|
|
### Database Scaling
|
|
|
|
**InfluxDB Scaling:**
|
|
- Start: Single node per customer container
|
|
- Scale: Consider InfluxDB clustering (Enterprise) if needed
|
|
- Alternative: TimescaleDB for SQL-familiar customers
|
|
|
|
**Backup Scaling:**
|
|
- Start: Daily backups to local disk
|
|
- Scale: Offsite backup to object storage (S3-compatible)
|
|
- Future: Real-time replication to hot standby
|
|
|
|
---
|
|
|
|
## Security Best Practices
|
|
|
|
### Server Hardening
|
|
- [ ] Disable root login (SSH key only)
|
|
- [ ] Fail2ban configured
|
|
- [ ] UFW firewall (only necessary ports)
|
|
- [ ] Automated security updates
|
|
- [ ] Regular security audits (quarterly)
|
|
|
|
### Application Security
|
|
- [ ] TLS/SSL everywhere (Let's Encrypt)
|
|
- [ ] Strong passwords (generated, stored in 1Password)
|
|
- [ ] API keys rotated (quarterly)
|
|
- [ ] Container isolation verified
|
|
- [ ] Database encryption at rest
|
|
|
|
### Compliance Considerations
|
|
- GDPR (if EU customers): Data residency, right to deletion
|
|
- HIPAA (if medical devices): BAA required, encryption
|
|
- ISO 27001 (future): Information security management
|
|
|
|
---
|
|
|
|
## Tools & Subscriptions
|
|
|
|
### Required (Paid)
|
|
|
|
| Tool | Purpose | Cost/Month |
|
|
|------|---------|------------|
|
|
| GTHost Server #1 | Infrastructure | $100-150 |
|
|
| Domain + DNS | yourdomain.com | $1-2 |
|
|
| Email (G Suite or similar) | Professional email | $6-12 |
|
|
|
|
**Total**: $107-164/month
|
|
|
|
### Optional (Free/Paid)
|
|
|
|
| Tool | Purpose | Cost/Month |
|
|
|------|---------|------------|
|
|
| 1Password | Password management | $0 (personal) |
|
|
| Wasabi | Offsite backups | $6/TB |
|
|
| UptimeRobot | External monitoring | $0 (free tier) |
|
|
| Stripe | Payment processing | 2.9% + $0.30 |
|
|
| Twilio | SMS alerts | Pay-as-you-go |
|
|
|
|
---
|
|
|
|
## Documentation Strategy
|
|
|
|
### Internal Documentation
|
|
- Runbooks (how to deploy, backup, restore)
|
|
- Architecture diagrams (network, data flow)
|
|
- Troubleshooting guides
|
|
- Security incident response plan
|
|
|
|
### Customer Documentation
|
|
- User guide (how to access dashboards)
|
|
- FAQ (common questions)
|
|
- Alert configuration guide
|
|
- Troubleshooting (basic)
|
|
|
|
**Format**: Markdown in Git repository (easy to version, search)
|
|
|
|
---
|
|
|
|
*Last Updated: December 2025*
|