13 KiB
Technical Stack & Architecture
Overview
This document outlines the technical infrastructure, tools, and architecture for all three phases of the venture.
Phase 1: Consulting (Customer Hardware)
Deployment Environment
Location: Customer premises
Hardware: Customer-provided or recommended purchase
- Raspberry Pi 4 (8GB): $75-100
- OR repurposed industrial PC (free if available)
- OR Intel NUC ($300-800)
Software Stack
Operating System:
- Ubuntu Server 24.04 LTS (free)
- Debian 12 (alternative, free)
Container Platform:
- Docker (free, easier for customer maintenance)
- OR LXC (free, lower overhead)
MQTT Broker:
- Eclipse Mosquitto (free, open source)
- Configuration: Local network only, authenticated users
- Port: 1883 (or 8883 for TLS)
Time-Series Database:
- InfluxDB 2.x OSS (free, open source)
- Alternative: TimescaleDB (PostgreSQL extension, free)
- Retention: 30-90 days default
Visualization:
- Grafana OSS (free, open source)
- Dashboards: Production line overview, OEE tracking, downtime analysis
- Alerts: Email/SMS via SMTP or webhook
PLC Integration:
- Node-RED (free, visual programming)
- node-red-contrib-modbus
- node-red-contrib-opcua
- node-red-contrib-s7
- Alternative: Python scripts
- pymodbus (Modbus TCP/RTU)
- opcua-client (OPC UA)
- python-snap7 (Siemens S7)
Network Architecture
Production Floor:
├── PLCs (Allen-Bradley, Siemens, etc.)
│ └── Connected via: Ethernet, Serial, or OPC UA Server
├── Edge Gateway (Raspberry Pi / Industrial PC)
│ ├── Mosquitto MQTT broker
│ ├── InfluxDB time-series database
│ ├── Grafana visualization
│ └── Node-RED PLC integration
└── Local Network
├── Operators access via web browser (http://edge-gateway:3000)
└── Managers access via web browser or mobile
Security:
- Isolated VLAN (recommended)
- Firewall rules (only necessary ports)
- HTTPS/TLS for Grafana (Let's Encrypt)
- VPN for remote access (WireGuard or OpenVPN)
Protocols Supported
OT Protocols:
- Modbus TCP/RTU
- OPC UA
- Siemens S7 (via Snap7)
- EtherNet/IP (Allen-Bradley)
- BACnet (building automation)
- Profinet (via OPC UA gateway)
IT Protocols:
- MQTT (primary message bus)
- HTTP/REST APIs
- HTTPS for dashboards
- SMTP for email alerts
Phase 2: Edge Monitoring Platform (GTHost Multi-Tenant)
Infrastructure
GTHost Dedicated Server #1:
Configuration:
├── CPU: 8 cores (Intel Xeon or AMD EPYC)
├── RAM: 32GB
├── Storage: 1TB NVMe SSD
├── Network: 1Gbps unmetered
├── Location: Choose closest to majority of customers
└── Cost: $100-150/month
Operating System:
- Ubuntu Server 24.04 LTS
- Automated updates (unattended-upgrades)
- Fail2ban for security
- UFW firewall configured
Multi-Tenant Architecture
Container Platform:
- LXC (Linux Containers)
- Lightweight vs Docker
- Better for long-running services
- Kernel-level isolation
- Proven from ZeroLagHub experience
Container Template:
Customer Container (LXC):
├── Ubuntu 24.04 minimal
├── Mosquitto MQTT broker (isolated)
├── InfluxDB (isolated database)
├── Grafana (customer-specific dashboards)
├── Node-RED (optional, for advanced workflows)
├── Backup agent (automated daily)
└── Resource limits (CPU, RAM, disk)
Resource Allocation per Customer:
- CPU: 1-2 cores (burstable)
- RAM: 2-4GB
- Disk: 50-100GB
- Network: Shared 1Gbps
Capacity Planning:
- 8-10 basic customers per server
- 5-8 customers if heavy data volume
- Monitor: CPU, RAM, disk I/O, network
Networking & Security
Network Architecture:
Internet
↓
Caddy Reverse Proxy (TLS termination)
↓
LXC Bridge (internal network)
├── Customer 1 Container (192.168.100.10)
├── Customer 2 Container (192.168.100.11)
├── Customer 3 Container (192.168.100.12)
└── ...
Subdomain Structure:
- customer1.yourdomain.com → Grafana dashboard
- customer2.yourdomain.com → Grafana dashboard
- mqtt.yourdomain.com → MQTT broker (port per customer)
Security Features:
- TLS/SSL via Let's Encrypt (automated)
- Firewall (UFW) - only necessary ports
- Fail2ban - brute force protection
- Container isolation (LXC namespaces)
- VPN access for edge devices (WireGuard)
- Backup encryption (GPG)
Data Flow
Customer Site:
PLC → Node-RED → MQTT (local edge device)
↓
(Over VPN or direct connection)
↓
GTHost Server:
MQTT Broker → InfluxDB → Grafana
↓
Alert Engine → Email/SMS
Backup & Disaster Recovery
Backup Strategy:
- Automated daily backups (3am UTC)
- Retention: 7 daily, 4 weekly, 12 monthly
- Storage: GTHost server + offsite (Wasabi/Backblaze B2)
- Encrypted with GPG
- Automated restore testing (monthly)
Disaster Recovery:
- RTO (Recovery Time Objective): 4 hours
- RPO (Recovery Point Objective): 24 hours
- Documented restoration procedure
- Annual DR test
Monitoring & Alerting
Server Monitoring:
- Prometheus + Grafana (internal)
- Alerts: CPU >80%, RAM >80%, Disk >85%
- UptimeRobot (external monitoring)
- PagerDuty or similar (if needed)
Customer Monitoring:
- Per-container resource usage
- MQTT connection status
- InfluxDB query performance
- Grafana dashboard access logs
Phase 3: GPU-Powered AI Platform
Infrastructure
GTHost Dedicated Server #2 (AI/Premium Tier):
Configuration:
├── CPU: 16 cores (Intel Xeon or AMD EPYC)
├── RAM: 64GB
├── GPU: NVIDIA Tesla P4 8GB (or T1000 8GB)
├── Storage: 2TB NVMe SSD
├── Network: 1Gbps unmetered
└── Cost: $350-450/month
Why Tesla P4:
- Optimized for AI inference (not training)
- 8GB VRAM sufficient for production models
- Low power consumption (75W)
- Good performance/cost ratio
AI/ML Stack
ML Frameworks:
- TensorFlow Lite (optimized inference)
- PyTorch (model development, optional)
- ONNX Runtime (cross-framework inference)
- Scikit-learn (traditional ML)
GPU Acceleration:
- CUDA 12.x
- cuDNN (deep learning primitives)
- TensorRT (inference optimization)
Model Serving:
- FastAPI (REST API for predictions)
- Triton Inference Server (optional, for heavy workloads)
- Redis (result caching)
AI Features Architecture
1. Predictive Maintenance:
Sensor Data → Feature Engineering → Model Inference → Alert
(MQTT) (Python script) (TensorFlow) (Email/SMS)
Models:
- Anomaly detection (vibration, temperature patterns)
- Failure prediction (time-to-failure models)
- Remaining Useful Life (RUL) estimation
2. Computer Vision Quality Inspection:
Camera → Image Capture → Preprocessing → Model Inference → Classification
(HTTP) (Python) (OpenCV) (TensorFlow) (Pass/Fail)
Models:
- Object detection (YOLOv8, faster RCNN)
- Defect classification (CNN)
- OCR (text recognition on parts)
Container Architecture (Phase 3)
Premium Customer Container:
├── Basic monitoring stack (MQTT, InfluxDB, Grafana)
├── ML inference service (FastAPI + TensorFlow)
├── Feature engineering pipeline
├── Model registry (versioned models)
├── Result database (predictions, alerts)
└── GPU access (controlled, per-customer limits)
Resource Allocation (Premium):
- CPU: 4-8 cores
- RAM: 16-32GB
- GPU: Shared (time-sliced or MIG partitioning)
- Disk: 200-500GB
Model Development Workflow
Development (Offline):
- Collect customer data (4-8 weeks)
- Feature engineering and labeling
- Model training (local GPU or cloud)
- Model validation (accuracy, false positives)
- Export to ONNX or TensorFlow Lite
Deployment:
- Upload model to server
- A/B test against baseline
- Monitor inference latency and accuracy
- Gradual rollout to production
- Continuous monitoring
Data Pipeline (AI Features)
Customer PLCs/Cameras
↓
Edge Device (optional preprocessing)
↓
MQTT → Feature Store (InfluxDB + PostgreSQL)
↓
ML Inference Service (GPU-accelerated)
↓
Prediction Results → InfluxDB
↓
Grafana Dashboard + Alerts
Development & Deployment Tools
Local Development
Workstation Setup:
- Ubuntu 22.04 or macOS
- Docker Desktop (for testing containers)
- VS Code with extensions:
- Python
- Docker
- YAML
- Grafana dashboards
Testing Environment:
- Local LXC or Docker setup
- Simulated PLC data (Node-RED)
- Small InfluxDB + Grafana instance
CI/CD Pipeline
Source Control:
- Git (self-hosted Gitea or GitHub)
- Branches: main, development, customer-specific
Automation (Future):
- GitHub Actions or Gitea Actions
- Automated testing on push
- Deployment scripts (Ansible)
Deployment Process (Manual Initially):
- Test in local environment
- Deploy to staging container
- Validate with test data
- Deploy to production
- Monitor for issues
Technology Decisions & Rationale
Why LXC over Docker?
Advantages:
- Lower overhead (runs closer to bare metal)
- Better for long-running services (MQTT, databases)
- Simpler networking (bridge vs overlay)
- Proven from ZeroLagHub experience
- Less complexity than Kubernetes
Disadvantages:
- Less popular than Docker (smaller community)
- Fewer pre-built images
- Manual setup required
Decision: Use LXC for multi-tenant platform, Docker for customer edge deployments (easier for them to maintain).
Why InfluxDB over Prometheus?
Advantages:
- Purpose-built for time-series data
- Better query language (Flux/InfluxQL)
- Native downsampling and retention policies
- Better Grafana integration for industrial data
- Can handle high-frequency data (1-10 second resolution)
Disadvantages:
- More complex than Prometheus
- Heavier resource usage
Decision: InfluxDB for customer data, Prometheus for internal monitoring.
Why Grafana over Custom Dashboard?
Advantages:
- Industry standard
- Excellent out-of-box visualizations
- Plugin ecosystem
- Customer familiarity (many have seen it)
- Lower development time
Disadvantages:
- Not as customizable as custom solution
- Licensing considerations (AGPL for self-hosted)
Decision: Grafana for Phase 1-2, consider custom dashboard in Phase 3 if needed.
Why MQTT over HTTP?
Advantages:
- Purpose-built for IoT (lightweight)
- Pub/sub model (flexible)
- Quality of Service levels (QoS 0, 1, 2)
- Better for unreliable networks
- Lower bandwidth overhead
Disadvantages:
- One more service to manage
- Not as universally understood as HTTP
Decision: MQTT for OT data collection, HTTP/REST for management APIs.
Scaling Plan
Server Capacity Thresholds
Add Server #2 When:
- Server #1 >70% CPU average
- OR >80% RAM average
- OR >10 customers on Server #1
Add Server #3 When:
- Combined >70% capacity
- OR >20 total customers
- OR geographic distribution (West Coast + East Coast servers)
Database Scaling
InfluxDB Scaling:
- Start: Single node per customer container
- Scale: Consider InfluxDB clustering (Enterprise) if needed
- Alternative: TimescaleDB for SQL-familiar customers
Backup Scaling:
- Start: Daily backups to local disk
- Scale: Offsite backup to object storage (S3-compatible)
- Future: Real-time replication to hot standby
Security Best Practices
Server Hardening
- Disable root login (SSH key only)
- Fail2ban configured
- UFW firewall (only necessary ports)
- Automated security updates
- Regular security audits (quarterly)
Application Security
- TLS/SSL everywhere (Let's Encrypt)
- Strong passwords (generated, stored in 1Password)
- API keys rotated (quarterly)
- Container isolation verified
- Database encryption at rest
Compliance Considerations
- GDPR (if EU customers): Data residency, right to deletion
- HIPAA (if medical devices): BAA required, encryption
- ISO 27001 (future): Information security management
Tools & Subscriptions
Required (Paid)
| Tool | Purpose | Cost/Month |
|---|---|---|
| GTHost Server #1 | Infrastructure | $100-150 |
| Domain + DNS | yourdomain.com | $1-2 |
| Email (G Suite or similar) | Professional email | $6-12 |
Total: $107-164/month
Optional (Free/Paid)
| Tool | Purpose | Cost/Month |
|---|---|---|
| 1Password | Password management | $0 (personal) |
| Wasabi | Offsite backups | $6/TB |
| UptimeRobot | External monitoring | $0 (free tier) |
| Stripe | Payment processing | 2.9% + $0.30 |
| Twilio | SMS alerts | Pay-as-you-go |
Documentation Strategy
Internal Documentation
- Runbooks (how to deploy, backup, restore)
- Architecture diagrams (network, data flow)
- Troubleshooting guides
- Security incident response plan
Customer Documentation
- User guide (how to access dashboards)
- FAQ (common questions)
- Alert configuration guide
- Troubleshooting (basic)
Format: Markdown in Git repository (easy to version, search)
Last Updated: December 2025