Add technical stack and architecture document

2025-12-17 00:59:56 +00:00 · 2025-12-17 00:59:56 +00:00 · d733b600bb
commit d733b600bb
parent 7b18b119bc
1 changed files with 530 additions and 0 deletions
--- a/technical-stack.md
+++ b/technical-stack.md
@ -0,0 +1,530 @@
+# Technical Stack & Architecture
+
+## Overview
+
+This document outlines the technical infrastructure, tools, and architecture for all three phases of the venture.
+
+---
+
+## Phase 1: Consulting (Customer Hardware)
+
+### Deployment Environment
+**Location**: Customer premises  
+**Hardware**: Customer-provided or recommended purchase
+- Raspberry Pi 4 (8GB): $75-100
+- OR repurposed industrial PC (free if available)
+- OR Intel NUC ($300-800)
+
+### Software Stack
+
+**Operating System:**
+- Ubuntu Server 24.04 LTS (free)
+- Debian 12 (alternative, free)
+
+**Container Platform:**
+- Docker (free, easier for customer maintenance)
+- OR LXC (free, lower overhead)
+
+**MQTT Broker:**
+- Eclipse Mosquitto (free, open source)
+- Configuration: Local network only, authenticated users
+- Port: 1883 (or 8883 for TLS)
+
+**Time-Series Database:**
+- InfluxDB 2.x OSS (free, open source)
+- Alternative: TimescaleDB (PostgreSQL extension, free)
+- Retention: 30-90 days default
+
+**Visualization:**
+- Grafana OSS (free, open source)
+- Dashboards: Production line overview, OEE tracking, downtime analysis
+- Alerts: Email/SMS via SMTP or webhook
+
+**PLC Integration:**
+- Node-RED (free, visual programming)
+  - node-red-contrib-modbus
+  - node-red-contrib-opcua
+  - node-red-contrib-s7
+- Alternative: Python scripts
+  - pymodbus (Modbus TCP/RTU)
+  - opcua-client (OPC UA)
+  - python-snap7 (Siemens S7)
+
+### Network Architecture
+
+```
+Production Floor:
+├── PLCs (Allen-Bradley, Siemens, etc.)
+│   └── Connected via: Ethernet, Serial, or OPC UA Server
+├── Edge Gateway (Raspberry Pi / Industrial PC)
+│   ├── Mosquitto MQTT broker
+│   ├── InfluxDB time-series database
+│   ├── Grafana visualization
+│   └── Node-RED PLC integration
+└── Local Network
+    ├── Operators access via web browser (http://edge-gateway:3000)
+    └── Managers access via web browser or mobile
+```
+
+**Security:**
+- Isolated VLAN (recommended)
+- Firewall rules (only necessary ports)
+- HTTPS/TLS for Grafana (Let's Encrypt)
+- VPN for remote access (WireGuard or OpenVPN)
+
+### Protocols Supported
+
+**OT Protocols:**
+- Modbus TCP/RTU
+- OPC UA
+- Siemens S7 (via Snap7)
+- EtherNet/IP (Allen-Bradley)
+- BACnet (building automation)
+- Profinet (via OPC UA gateway)
+
+**IT Protocols:**
+- MQTT (primary message bus)
+- HTTP/REST APIs
+- HTTPS for dashboards
+- SMTP for email alerts
+
+---
+
+## Phase 2: Edge Monitoring Platform (GTHost Multi-Tenant)
+
+### Infrastructure
+
+**GTHost Dedicated Server #1:**
+```
+Configuration:
+├── CPU: 8 cores (Intel Xeon or AMD EPYC)
+├── RAM: 32GB
+├── Storage: 1TB NVMe SSD
+├── Network: 1Gbps unmetered
+├── Location: Choose closest to majority of customers
+└── Cost: $100-150/month
+```
+
+**Operating System:**
+- Ubuntu Server 24.04 LTS
+- Automated updates (unattended-upgrades)
+- Fail2ban for security
+- UFW firewall configured
+
+### Multi-Tenant Architecture
+
+**Container Platform:**
+- LXC (Linux Containers)
+  - Lightweight vs Docker
+  - Better for long-running services
+  - Kernel-level isolation
+  - Proven from ZeroLagHub experience
+
+**Container Template:**
+```
+Customer Container (LXC):
+├── Ubuntu 24.04 minimal
+├── Mosquitto MQTT broker (isolated)
+├── InfluxDB (isolated database)
+├── Grafana (customer-specific dashboards)
+├── Node-RED (optional, for advanced workflows)
+├── Backup agent (automated daily)
+└── Resource limits (CPU, RAM, disk)
+```
+
+**Resource Allocation per Customer:**
+- CPU: 1-2 cores (burstable)
+- RAM: 2-4GB
+- Disk: 50-100GB
+- Network: Shared 1Gbps
+
+**Capacity Planning:**
+- 8-10 basic customers per server
+- 5-8 customers if heavy data volume
+- Monitor: CPU, RAM, disk I/O, network
+
+### Networking & Security
+
+**Network Architecture:**
+```
+Internet
+    ↓
+Caddy Reverse Proxy (TLS termination)
+    ↓
+LXC Bridge (internal network)
+    ├── Customer 1 Container (192.168.100.10)
+    ├── Customer 2 Container (192.168.100.11)
+    ├── Customer 3 Container (192.168.100.12)
+    └── ...
+```
+
+**Subdomain Structure:**
+- customer1.yourdomain.com → Grafana dashboard
+- customer2.yourdomain.com → Grafana dashboard
+- mqtt.yourdomain.com → MQTT broker (port per customer)
+
+**Security Features:**
+- TLS/SSL via Let's Encrypt (automated)
+- Firewall (UFW) - only necessary ports
+- Fail2ban - brute force protection
+- Container isolation (LXC namespaces)
+- VPN access for edge devices (WireGuard)
+- Backup encryption (GPG)
+
+### Data Flow
+
+```
+Customer Site:
+    PLC → Node-RED → MQTT (local edge device)
+                        ↓
+                    (Over VPN or direct connection)
+                        ↓
+GTHost Server:
+    MQTT Broker → InfluxDB → Grafana
+                        ↓
+                   Alert Engine → Email/SMS
+```
+
+### Backup & Disaster Recovery
+
+**Backup Strategy:**
+- Automated daily backups (3am UTC)
+- Retention: 7 daily, 4 weekly, 12 monthly
+- Storage: GTHost server + offsite (Wasabi/Backblaze B2)
+- Encrypted with GPG
+- Automated restore testing (monthly)
+
+**Disaster Recovery:**
+- RTO (Recovery Time Objective): 4 hours
+- RPO (Recovery Point Objective): 24 hours
+- Documented restoration procedure
+- Annual DR test
+
+### Monitoring & Alerting
+
+**Server Monitoring:**
+- Prometheus + Grafana (internal)
+- Alerts: CPU >80%, RAM >80%, Disk >85%
+- UptimeRobot (external monitoring)
+- PagerDuty or similar (if needed)
+
+**Customer Monitoring:**
+- Per-container resource usage
+- MQTT connection status
+- InfluxDB query performance
+- Grafana dashboard access logs
+
+---
+
+## Phase 3: GPU-Powered AI Platform
+
+### Infrastructure
+
+**GTHost Dedicated Server #2 (AI/Premium Tier):**
+```
+Configuration:
+├── CPU: 16 cores (Intel Xeon or AMD EPYC)
+├── RAM: 64GB
+├── GPU: NVIDIA Tesla P4 8GB (or T1000 8GB)
+├── Storage: 2TB NVMe SSD
+├── Network: 1Gbps unmetered
+└── Cost: $350-450/month
+```
+
+**Why Tesla P4:**
+- Optimized for AI inference (not training)
+- 8GB VRAM sufficient for production models
+- Low power consumption (75W)
+- Good performance/cost ratio
+
+### AI/ML Stack
+
+**ML Frameworks:**
+- TensorFlow Lite (optimized inference)
+- PyTorch (model development, optional)
+- ONNX Runtime (cross-framework inference)
+- Scikit-learn (traditional ML)
+
+**GPU Acceleration:**
+- CUDA 12.x
+- cuDNN (deep learning primitives)
+- TensorRT (inference optimization)
+
+**Model Serving:**
+- FastAPI (REST API for predictions)
+- Triton Inference Server (optional, for heavy workloads)
+- Redis (result caching)
+
+### AI Features Architecture
+
+**1. Predictive Maintenance:**
+```
+Sensor Data → Feature Engineering → Model Inference → Alert
+   (MQTT)      (Python script)      (TensorFlow)     (Email/SMS)
+```
+
+**Models:**
+- Anomaly detection (vibration, temperature patterns)
+- Failure prediction (time-to-failure models)
+- Remaining Useful Life (RUL) estimation
+
+**2. Computer Vision Quality Inspection:**
+```
+Camera → Image Capture → Preprocessing → Model Inference → Classification
+ (HTTP)    (Python)       (OpenCV)       (TensorFlow)      (Pass/Fail)
+```
+
+**Models:**
+- Object detection (YOLOv8, faster RCNN)
+- Defect classification (CNN)
+- OCR (text recognition on parts)
+
+### Container Architecture (Phase 3)
+
+**Premium Customer Container:**
+```
+├── Basic monitoring stack (MQTT, InfluxDB, Grafana)
+├── ML inference service (FastAPI + TensorFlow)
+├── Feature engineering pipeline
+├── Model registry (versioned models)
+├── Result database (predictions, alerts)
+└── GPU access (controlled, per-customer limits)
+```
+
+**Resource Allocation (Premium):**
+- CPU: 4-8 cores
+- RAM: 16-32GB
+- GPU: Shared (time-sliced or MIG partitioning)
+- Disk: 200-500GB
+
+### Model Development Workflow
+
+**Development (Offline):**
+1. Collect customer data (4-8 weeks)
+2. Feature engineering and labeling
+3. Model training (local GPU or cloud)
+4. Model validation (accuracy, false positives)
+5. Export to ONNX or TensorFlow Lite
+
+**Deployment:**
+1. Upload model to server
+2. A/B test against baseline
+3. Monitor inference latency and accuracy
+4. Gradual rollout to production
+5. Continuous monitoring
+
+### Data Pipeline (AI Features)
+
+```
+Customer PLCs/Cameras
+    ↓
+Edge Device (optional preprocessing)
+    ↓
+MQTT → Feature Store (InfluxDB + PostgreSQL)
+    ↓
+ML Inference Service (GPU-accelerated)
+    ↓
+Prediction Results → InfluxDB
+    ↓
+Grafana Dashboard + Alerts
+```
+
+---
+
+## Development & Deployment Tools
+
+### Local Development
+
+**Workstation Setup:**
+- Ubuntu 22.04 or macOS
+- Docker Desktop (for testing containers)
+- VS Code with extensions:
+  - Python
+  - Docker
+  - YAML
+  - Grafana dashboards
+
+**Testing Environment:**
+- Local LXC or Docker setup
+- Simulated PLC data (Node-RED)
+- Small InfluxDB + Grafana instance
+
+### CI/CD Pipeline
+
+**Source Control:**
+- Git (self-hosted Gitea or GitHub)
+- Branches: main, development, customer-specific
+
+**Automation (Future):**
+- GitHub Actions or Gitea Actions
+- Automated testing on push
+- Deployment scripts (Ansible)
+
+**Deployment Process (Manual Initially):**
+1. Test in local environment
+2. Deploy to staging container
+3. Validate with test data
+4. Deploy to production
+5. Monitor for issues
+
+---
+
+## Technology Decisions & Rationale
+
+### Why LXC over Docker?
+
+**Advantages:**
+- Lower overhead (runs closer to bare metal)
+- Better for long-running services (MQTT, databases)
+- Simpler networking (bridge vs overlay)
+- Proven from ZeroLagHub experience
+- Less complexity than Kubernetes
+
+**Disadvantages:**
+- Less popular than Docker (smaller community)
+- Fewer pre-built images
+- Manual setup required
+
+**Decision**: Use LXC for multi-tenant platform, Docker for customer edge deployments (easier for them to maintain).
+
+### Why InfluxDB over Prometheus?
+
+**Advantages:**
+- Purpose-built for time-series data
+- Better query language (Flux/InfluxQL)
+- Native downsampling and retention policies
+- Better Grafana integration for industrial data
+- Can handle high-frequency data (1-10 second resolution)
+
+**Disadvantages:**
+- More complex than Prometheus
+- Heavier resource usage
+
+**Decision**: InfluxDB for customer data, Prometheus for internal monitoring.
+
+### Why Grafana over Custom Dashboard?
+
+**Advantages:**
+- Industry standard
+- Excellent out-of-box visualizations
+- Plugin ecosystem
+- Customer familiarity (many have seen it)
+- Lower development time
+
+**Disadvantages:**
+- Not as customizable as custom solution
+- Licensing considerations (AGPL for self-hosted)
+
+**Decision**: Grafana for Phase 1-2, consider custom dashboard in Phase 3 if needed.
+
+### Why MQTT over HTTP?
+
+**Advantages:**
+- Purpose-built for IoT (lightweight)
+- Pub/sub model (flexible)
+- Quality of Service levels (QoS 0, 1, 2)
+- Better for unreliable networks
+- Lower bandwidth overhead
+
+**Disadvantages:**
+- One more service to manage
+- Not as universally understood as HTTP
+
+**Decision**: MQTT for OT data collection, HTTP/REST for management APIs.
+
+---
+
+## Scaling Plan
+
+### Server Capacity Thresholds
+
+**Add Server #2 When:**
+- Server #1 >70% CPU average
+- OR >80% RAM average
+- OR >10 customers on Server #1
+
+**Add Server #3 When:**
+- Combined >70% capacity
+- OR >20 total customers
+- OR geographic distribution (West Coast + East Coast servers)
+
+### Database Scaling
+
+**InfluxDB Scaling:**
+- Start: Single node per customer container
+- Scale: Consider InfluxDB clustering (Enterprise) if needed
+- Alternative: TimescaleDB for SQL-familiar customers
+
+**Backup Scaling:**
+- Start: Daily backups to local disk
+- Scale: Offsite backup to object storage (S3-compatible)
+- Future: Real-time replication to hot standby
+
+---
+
+## Security Best Practices
+
+### Server Hardening
+- [ ] Disable root login (SSH key only)
+- [ ] Fail2ban configured
+- [ ] UFW firewall (only necessary ports)
+- [ ] Automated security updates
+- [ ] Regular security audits (quarterly)
+
+### Application Security
+- [ ] TLS/SSL everywhere (Let's Encrypt)
+- [ ] Strong passwords (generated, stored in 1Password)
+- [ ] API keys rotated (quarterly)
+- [ ] Container isolation verified
+- [ ] Database encryption at rest
+
+### Compliance Considerations
+- GDPR (if EU customers): Data residency, right to deletion
+- HIPAA (if medical devices): BAA required, encryption
+- ISO 27001 (future): Information security management
+
+---
+
+## Tools & Subscriptions
+
+### Required (Paid)
+
+| Tool | Purpose | Cost/Month |
+|------|---------|------------|
+| GTHost Server #1 | Infrastructure | $100-150 |
+| Domain + DNS | yourdomain.com | $1-2 |
+| Email (G Suite or similar) | Professional email | $6-12 |
+
+**Total**: $107-164/month
+
+### Optional (Free/Paid)
+
+| Tool | Purpose | Cost/Month |
+|------|---------|------------|
+| 1Password | Password management | $0 (personal) |
+| Wasabi | Offsite backups | $6/TB |
+| UptimeRobot | External monitoring | $0 (free tier) |
+| Stripe | Payment processing | 2.9% + $0.30 |
+| Twilio | SMS alerts | Pay-as-you-go |
+
+---
+
+## Documentation Strategy
+
+### Internal Documentation
+- Runbooks (how to deploy, backup, restore)
+- Architecture diagrams (network, data flow)
+- Troubleshooting guides
+- Security incident response plan
+
+### Customer Documentation
+- User guide (how to access dashboards)
+- FAQ (common questions)
+- Alert configuration guide
+- Troubleshooting (basic)
+
+**Format**: Markdown in Git repository (easy to version, search)
+
+---
+
+*Last Updated: December 2025*