# Technical Stack & Architecture ## Overview This document outlines the technical infrastructure, tools, and architecture for all three phases of the venture. --- ## Phase 1: Consulting (Customer Hardware) ### Deployment Environment **Location**: Customer premises **Hardware**: Customer-provided or recommended purchase - Raspberry Pi 4 (8GB): $75-100 - OR repurposed industrial PC (free if available) - OR Intel NUC ($300-800) ### Software Stack **Operating System:** - Ubuntu Server 24.04 LTS (free) - Debian 12 (alternative, free) **Container Platform:** - Docker (free, easier for customer maintenance) - OR LXC (free, lower overhead) **MQTT Broker:** - Eclipse Mosquitto (free, open source) - Configuration: Local network only, authenticated users - Port: 1883 (or 8883 for TLS) **Time-Series Database:** - InfluxDB 2.x OSS (free, open source) - Alternative: TimescaleDB (PostgreSQL extension, free) - Retention: 30-90 days default **Visualization:** - Grafana OSS (free, open source) - Dashboards: Production line overview, OEE tracking, downtime analysis - Alerts: Email/SMS via SMTP or webhook **PLC Integration:** - Node-RED (free, visual programming) - node-red-contrib-modbus - node-red-contrib-opcua - node-red-contrib-s7 - Alternative: Python scripts - pymodbus (Modbus TCP/RTU) - opcua-client (OPC UA) - python-snap7 (Siemens S7) ### Network Architecture ``` Production Floor: ├── PLCs (Allen-Bradley, Siemens, etc.) │ └── Connected via: Ethernet, Serial, or OPC UA Server ├── Edge Gateway (Raspberry Pi / Industrial PC) │ ├── Mosquitto MQTT broker │ ├── InfluxDB time-series database │ ├── Grafana visualization │ └── Node-RED PLC integration └── Local Network ├── Operators access via web browser (http://edge-gateway:3000) └── Managers access via web browser or mobile ``` **Security:** - Isolated VLAN (recommended) - Firewall rules (only necessary ports) - HTTPS/TLS for Grafana (Let's Encrypt) - VPN for remote access (WireGuard or OpenVPN) ### Protocols Supported **OT Protocols:** - Modbus TCP/RTU - OPC UA - Siemens S7 (via Snap7) - EtherNet/IP (Allen-Bradley) - BACnet (building automation) - Profinet (via OPC UA gateway) **IT Protocols:** - MQTT (primary message bus) - HTTP/REST APIs - HTTPS for dashboards - SMTP for email alerts --- ## Phase 2: Edge Monitoring Platform (GTHost Multi-Tenant) ### Infrastructure **GTHost Dedicated Server #1:** ``` Configuration: ├── CPU: 8 cores (Intel Xeon or AMD EPYC) ├── RAM: 32GB ├── Storage: 1TB NVMe SSD ├── Network: 1Gbps unmetered ├── Location: Choose closest to majority of customers └── Cost: $100-150/month ``` **Operating System:** - Ubuntu Server 24.04 LTS - Automated updates (unattended-upgrades) - Fail2ban for security - UFW firewall configured ### Multi-Tenant Architecture **Container Platform:** - LXC (Linux Containers) - Lightweight vs Docker - Better for long-running services - Kernel-level isolation - Proven from ZeroLagHub experience **Container Template:** ``` Customer Container (LXC): ├── Ubuntu 24.04 minimal ├── Mosquitto MQTT broker (isolated) ├── InfluxDB (isolated database) ├── Grafana (customer-specific dashboards) ├── Node-RED (optional, for advanced workflows) ├── Backup agent (automated daily) └── Resource limits (CPU, RAM, disk) ``` **Resource Allocation per Customer:** - CPU: 1-2 cores (burstable) - RAM: 2-4GB - Disk: 50-100GB - Network: Shared 1Gbps **Capacity Planning:** - 8-10 basic customers per server - 5-8 customers if heavy data volume - Monitor: CPU, RAM, disk I/O, network ### Networking & Security **Network Architecture:** ``` Internet ↓ Caddy Reverse Proxy (TLS termination) ↓ LXC Bridge (internal network) ├── Customer 1 Container (192.168.100.10) ├── Customer 2 Container (192.168.100.11) ├── Customer 3 Container (192.168.100.12) └── ... ``` **Subdomain Structure:** - customer1.yourdomain.com → Grafana dashboard - customer2.yourdomain.com → Grafana dashboard - mqtt.yourdomain.com → MQTT broker (port per customer) **Security Features:** - TLS/SSL via Let's Encrypt (automated) - Firewall (UFW) - only necessary ports - Fail2ban - brute force protection - Container isolation (LXC namespaces) - VPN access for edge devices (WireGuard) - Backup encryption (GPG) ### Data Flow ``` Customer Site: PLC → Node-RED → MQTT (local edge device) ↓ (Over VPN or direct connection) ↓ GTHost Server: MQTT Broker → InfluxDB → Grafana ↓ Alert Engine → Email/SMS ``` ### Backup & Disaster Recovery **Backup Strategy:** - Automated daily backups (3am UTC) - Retention: 7 daily, 4 weekly, 12 monthly - Storage: GTHost server + offsite (Wasabi/Backblaze B2) - Encrypted with GPG - Automated restore testing (monthly) **Disaster Recovery:** - RTO (Recovery Time Objective): 4 hours - RPO (Recovery Point Objective): 24 hours - Documented restoration procedure - Annual DR test ### Monitoring & Alerting **Server Monitoring:** - Prometheus + Grafana (internal) - Alerts: CPU >80%, RAM >80%, Disk >85% - UptimeRobot (external monitoring) - PagerDuty or similar (if needed) **Customer Monitoring:** - Per-container resource usage - MQTT connection status - InfluxDB query performance - Grafana dashboard access logs --- ## Phase 3: GPU-Powered AI Platform ### Infrastructure **GTHost Dedicated Server #2 (AI/Premium Tier):** ``` Configuration: ├── CPU: 16 cores (Intel Xeon or AMD EPYC) ├── RAM: 64GB ├── GPU: NVIDIA Tesla P4 8GB (or T1000 8GB) ├── Storage: 2TB NVMe SSD ├── Network: 1Gbps unmetered └── Cost: $350-450/month ``` **Why Tesla P4:** - Optimized for AI inference (not training) - 8GB VRAM sufficient for production models - Low power consumption (75W) - Good performance/cost ratio ### AI/ML Stack **ML Frameworks:** - TensorFlow Lite (optimized inference) - PyTorch (model development, optional) - ONNX Runtime (cross-framework inference) - Scikit-learn (traditional ML) **GPU Acceleration:** - CUDA 12.x - cuDNN (deep learning primitives) - TensorRT (inference optimization) **Model Serving:** - FastAPI (REST API for predictions) - Triton Inference Server (optional, for heavy workloads) - Redis (result caching) ### AI Features Architecture **1. Predictive Maintenance:** ``` Sensor Data → Feature Engineering → Model Inference → Alert (MQTT) (Python script) (TensorFlow) (Email/SMS) ``` **Models:** - Anomaly detection (vibration, temperature patterns) - Failure prediction (time-to-failure models) - Remaining Useful Life (RUL) estimation **2. Computer Vision Quality Inspection:** ``` Camera → Image Capture → Preprocessing → Model Inference → Classification (HTTP) (Python) (OpenCV) (TensorFlow) (Pass/Fail) ``` **Models:** - Object detection (YOLOv8, faster RCNN) - Defect classification (CNN) - OCR (text recognition on parts) ### Container Architecture (Phase 3) **Premium Customer Container:** ``` ├── Basic monitoring stack (MQTT, InfluxDB, Grafana) ├── ML inference service (FastAPI + TensorFlow) ├── Feature engineering pipeline ├── Model registry (versioned models) ├── Result database (predictions, alerts) └── GPU access (controlled, per-customer limits) ``` **Resource Allocation (Premium):** - CPU: 4-8 cores - RAM: 16-32GB - GPU: Shared (time-sliced or MIG partitioning) - Disk: 200-500GB ### Model Development Workflow **Development (Offline):** 1. Collect customer data (4-8 weeks) 2. Feature engineering and labeling 3. Model training (local GPU or cloud) 4. Model validation (accuracy, false positives) 5. Export to ONNX or TensorFlow Lite **Deployment:** 1. Upload model to server 2. A/B test against baseline 3. Monitor inference latency and accuracy 4. Gradual rollout to production 5. Continuous monitoring ### Data Pipeline (AI Features) ``` Customer PLCs/Cameras ↓ Edge Device (optional preprocessing) ↓ MQTT → Feature Store (InfluxDB + PostgreSQL) ↓ ML Inference Service (GPU-accelerated) ↓ Prediction Results → InfluxDB ↓ Grafana Dashboard + Alerts ``` --- ## Development & Deployment Tools ### Local Development **Workstation Setup:** - Ubuntu 22.04 or macOS - Docker Desktop (for testing containers) - VS Code with extensions: - Python - Docker - YAML - Grafana dashboards **Testing Environment:** - Local LXC or Docker setup - Simulated PLC data (Node-RED) - Small InfluxDB + Grafana instance ### CI/CD Pipeline **Source Control:** - Git (self-hosted Gitea or GitHub) - Branches: main, development, customer-specific **Automation (Future):** - GitHub Actions or Gitea Actions - Automated testing on push - Deployment scripts (Ansible) **Deployment Process (Manual Initially):** 1. Test in local environment 2. Deploy to staging container 3. Validate with test data 4. Deploy to production 5. Monitor for issues --- ## Technology Decisions & Rationale ### Why LXC over Docker? **Advantages:** - Lower overhead (runs closer to bare metal) - Better for long-running services (MQTT, databases) - Simpler networking (bridge vs overlay) - Proven from ZeroLagHub experience - Less complexity than Kubernetes **Disadvantages:** - Less popular than Docker (smaller community) - Fewer pre-built images - Manual setup required **Decision**: Use LXC for multi-tenant platform, Docker for customer edge deployments (easier for them to maintain). ### Why InfluxDB over Prometheus? **Advantages:** - Purpose-built for time-series data - Better query language (Flux/InfluxQL) - Native downsampling and retention policies - Better Grafana integration for industrial data - Can handle high-frequency data (1-10 second resolution) **Disadvantages:** - More complex than Prometheus - Heavier resource usage **Decision**: InfluxDB for customer data, Prometheus for internal monitoring. ### Why Grafana over Custom Dashboard? **Advantages:** - Industry standard - Excellent out-of-box visualizations - Plugin ecosystem - Customer familiarity (many have seen it) - Lower development time **Disadvantages:** - Not as customizable as custom solution - Licensing considerations (AGPL for self-hosted) **Decision**: Grafana for Phase 1-2, consider custom dashboard in Phase 3 if needed. ### Why MQTT over HTTP? **Advantages:** - Purpose-built for IoT (lightweight) - Pub/sub model (flexible) - Quality of Service levels (QoS 0, 1, 2) - Better for unreliable networks - Lower bandwidth overhead **Disadvantages:** - One more service to manage - Not as universally understood as HTTP **Decision**: MQTT for OT data collection, HTTP/REST for management APIs. --- ## Scaling Plan ### Server Capacity Thresholds **Add Server #2 When:** - Server #1 >70% CPU average - OR >80% RAM average - OR >10 customers on Server #1 **Add Server #3 When:** - Combined >70% capacity - OR >20 total customers - OR geographic distribution (West Coast + East Coast servers) ### Database Scaling **InfluxDB Scaling:** - Start: Single node per customer container - Scale: Consider InfluxDB clustering (Enterprise) if needed - Alternative: TimescaleDB for SQL-familiar customers **Backup Scaling:** - Start: Daily backups to local disk - Scale: Offsite backup to object storage (S3-compatible) - Future: Real-time replication to hot standby --- ## Security Best Practices ### Server Hardening - [ ] Disable root login (SSH key only) - [ ] Fail2ban configured - [ ] UFW firewall (only necessary ports) - [ ] Automated security updates - [ ] Regular security audits (quarterly) ### Application Security - [ ] TLS/SSL everywhere (Let's Encrypt) - [ ] Strong passwords (generated, stored in 1Password) - [ ] API keys rotated (quarterly) - [ ] Container isolation verified - [ ] Database encryption at rest ### Compliance Considerations - GDPR (if EU customers): Data residency, right to deletion - HIPAA (if medical devices): BAA required, encryption - ISO 27001 (future): Information security management --- ## Tools & Subscriptions ### Required (Paid) | Tool | Purpose | Cost/Month | |------|---------|------------| | GTHost Server #1 | Infrastructure | $100-150 | | Domain + DNS | yourdomain.com | $1-2 | | Email (G Suite or similar) | Professional email | $6-12 | **Total**: $107-164/month ### Optional (Free/Paid) | Tool | Purpose | Cost/Month | |------|---------|------------| | 1Password | Password management | $0 (personal) | | Wasabi | Offsite backups | $6/TB | | UptimeRobot | External monitoring | $0 (free tier) | | Stripe | Payment processing | 2.9% + $0.30 | | Twilio | SMS alerts | Pay-as-you-go | --- ## Documentation Strategy ### Internal Documentation - Runbooks (how to deploy, backup, restore) - Architecture diagrams (network, data flow) - Troubleshooting guides - Security incident response plan ### Customer Documentation - User guide (how to access dashboards) - FAQ (common questions) - Alert configuration guide - Troubleshooting (basic) **Format**: Markdown in Git repository (easy to version, search) --- *Last Updated: December 2025*