venture/technical-stack.md

13 KiB

Technical Stack & Architecture

Overview

This document outlines the technical infrastructure, tools, and architecture for all three phases of the venture.


Phase 1: Consulting (Customer Hardware)

Deployment Environment

Location: Customer premises
Hardware: Customer-provided or recommended purchase

  • Raspberry Pi 4 (8GB): $75-100
  • OR repurposed industrial PC (free if available)
  • OR Intel NUC ($300-800)

Software Stack

Operating System:

  • Ubuntu Server 24.04 LTS (free)
  • Debian 12 (alternative, free)

Container Platform:

  • Docker (free, easier for customer maintenance)
  • OR LXC (free, lower overhead)

MQTT Broker:

  • Eclipse Mosquitto (free, open source)
  • Configuration: Local network only, authenticated users
  • Port: 1883 (or 8883 for TLS)

Time-Series Database:

  • InfluxDB 2.x OSS (free, open source)
  • Alternative: TimescaleDB (PostgreSQL extension, free)
  • Retention: 30-90 days default

Visualization:

  • Grafana OSS (free, open source)
  • Dashboards: Production line overview, OEE tracking, downtime analysis
  • Alerts: Email/SMS via SMTP or webhook

PLC Integration:

  • Node-RED (free, visual programming)
    • node-red-contrib-modbus
    • node-red-contrib-opcua
    • node-red-contrib-s7
  • Alternative: Python scripts
    • pymodbus (Modbus TCP/RTU)
    • opcua-client (OPC UA)
    • python-snap7 (Siemens S7)

Network Architecture

Production Floor:
├── PLCs (Allen-Bradley, Siemens, etc.)
│   └── Connected via: Ethernet, Serial, or OPC UA Server
├── Edge Gateway (Raspberry Pi / Industrial PC)
│   ├── Mosquitto MQTT broker
│   ├── InfluxDB time-series database
│   ├── Grafana visualization
│   └── Node-RED PLC integration
└── Local Network
    ├── Operators access via web browser (http://edge-gateway:3000)
    └── Managers access via web browser or mobile

Security:

  • Isolated VLAN (recommended)
  • Firewall rules (only necessary ports)
  • HTTPS/TLS for Grafana (Let's Encrypt)
  • VPN for remote access (WireGuard or OpenVPN)

Protocols Supported

OT Protocols:

  • Modbus TCP/RTU
  • OPC UA
  • Siemens S7 (via Snap7)
  • EtherNet/IP (Allen-Bradley)
  • BACnet (building automation)
  • Profinet (via OPC UA gateway)

IT Protocols:

  • MQTT (primary message bus)
  • HTTP/REST APIs
  • HTTPS for dashboards
  • SMTP for email alerts

Phase 2: Edge Monitoring Platform (GTHost Multi-Tenant)

Infrastructure

GTHost Dedicated Server #1:

Configuration:
├── CPU: 8 cores (Intel Xeon or AMD EPYC)
├── RAM: 32GB
├── Storage: 1TB NVMe SSD
├── Network: 1Gbps unmetered
├── Location: Choose closest to majority of customers
└── Cost: $100-150/month

Operating System:

  • Ubuntu Server 24.04 LTS
  • Automated updates (unattended-upgrades)
  • Fail2ban for security
  • UFW firewall configured

Multi-Tenant Architecture

Container Platform:

  • LXC (Linux Containers)
    • Lightweight vs Docker
    • Better for long-running services
    • Kernel-level isolation
    • Proven from ZeroLagHub experience

Container Template:

Customer Container (LXC):
├── Ubuntu 24.04 minimal
├── Mosquitto MQTT broker (isolated)
├── InfluxDB (isolated database)
├── Grafana (customer-specific dashboards)
├── Node-RED (optional, for advanced workflows)
├── Backup agent (automated daily)
└── Resource limits (CPU, RAM, disk)

Resource Allocation per Customer:

  • CPU: 1-2 cores (burstable)
  • RAM: 2-4GB
  • Disk: 50-100GB
  • Network: Shared 1Gbps

Capacity Planning:

  • 8-10 basic customers per server
  • 5-8 customers if heavy data volume
  • Monitor: CPU, RAM, disk I/O, network

Networking & Security

Network Architecture:

Internet
    ↓
Caddy Reverse Proxy (TLS termination)
    ↓
LXC Bridge (internal network)
    ├── Customer 1 Container (192.168.100.10)
    ├── Customer 2 Container (192.168.100.11)
    ├── Customer 3 Container (192.168.100.12)
    └── ...

Subdomain Structure:

  • customer1.yourdomain.com → Grafana dashboard
  • customer2.yourdomain.com → Grafana dashboard
  • mqtt.yourdomain.com → MQTT broker (port per customer)

Security Features:

  • TLS/SSL via Let's Encrypt (automated)
  • Firewall (UFW) - only necessary ports
  • Fail2ban - brute force protection
  • Container isolation (LXC namespaces)
  • VPN access for edge devices (WireGuard)
  • Backup encryption (GPG)

Data Flow

Customer Site:
    PLC → Node-RED → MQTT (local edge device)
                        ↓
                    (Over VPN or direct connection)
                        ↓
GTHost Server:
    MQTT Broker → InfluxDB → Grafana
                        ↓
                   Alert Engine → Email/SMS

Backup & Disaster Recovery

Backup Strategy:

  • Automated daily backups (3am UTC)
  • Retention: 7 daily, 4 weekly, 12 monthly
  • Storage: GTHost server + offsite (Wasabi/Backblaze B2)
  • Encrypted with GPG
  • Automated restore testing (monthly)

Disaster Recovery:

  • RTO (Recovery Time Objective): 4 hours
  • RPO (Recovery Point Objective): 24 hours
  • Documented restoration procedure
  • Annual DR test

Monitoring & Alerting

Server Monitoring:

  • Prometheus + Grafana (internal)
  • Alerts: CPU >80%, RAM >80%, Disk >85%
  • UptimeRobot (external monitoring)
  • PagerDuty or similar (if needed)

Customer Monitoring:

  • Per-container resource usage
  • MQTT connection status
  • InfluxDB query performance
  • Grafana dashboard access logs

Phase 3: GPU-Powered AI Platform

Infrastructure

GTHost Dedicated Server #2 (AI/Premium Tier):

Configuration:
├── CPU: 16 cores (Intel Xeon or AMD EPYC)
├── RAM: 64GB
├── GPU: NVIDIA Tesla P4 8GB (or T1000 8GB)
├── Storage: 2TB NVMe SSD
├── Network: 1Gbps unmetered
└── Cost: $350-450/month

Why Tesla P4:

  • Optimized for AI inference (not training)
  • 8GB VRAM sufficient for production models
  • Low power consumption (75W)
  • Good performance/cost ratio

AI/ML Stack

ML Frameworks:

  • TensorFlow Lite (optimized inference)
  • PyTorch (model development, optional)
  • ONNX Runtime (cross-framework inference)
  • Scikit-learn (traditional ML)

GPU Acceleration:

  • CUDA 12.x
  • cuDNN (deep learning primitives)
  • TensorRT (inference optimization)

Model Serving:

  • FastAPI (REST API for predictions)
  • Triton Inference Server (optional, for heavy workloads)
  • Redis (result caching)

AI Features Architecture

1. Predictive Maintenance:

Sensor Data → Feature Engineering → Model Inference → Alert
   (MQTT)      (Python script)      (TensorFlow)     (Email/SMS)

Models:

  • Anomaly detection (vibration, temperature patterns)
  • Failure prediction (time-to-failure models)
  • Remaining Useful Life (RUL) estimation

2. Computer Vision Quality Inspection:

Camera → Image Capture → Preprocessing → Model Inference → Classification
 (HTTP)    (Python)       (OpenCV)       (TensorFlow)      (Pass/Fail)

Models:

  • Object detection (YOLOv8, faster RCNN)
  • Defect classification (CNN)
  • OCR (text recognition on parts)

Container Architecture (Phase 3)

Premium Customer Container:

├── Basic monitoring stack (MQTT, InfluxDB, Grafana)
├── ML inference service (FastAPI + TensorFlow)
├── Feature engineering pipeline
├── Model registry (versioned models)
├── Result database (predictions, alerts)
└── GPU access (controlled, per-customer limits)

Resource Allocation (Premium):

  • CPU: 4-8 cores
  • RAM: 16-32GB
  • GPU: Shared (time-sliced or MIG partitioning)
  • Disk: 200-500GB

Model Development Workflow

Development (Offline):

  1. Collect customer data (4-8 weeks)
  2. Feature engineering and labeling
  3. Model training (local GPU or cloud)
  4. Model validation (accuracy, false positives)
  5. Export to ONNX or TensorFlow Lite

Deployment:

  1. Upload model to server
  2. A/B test against baseline
  3. Monitor inference latency and accuracy
  4. Gradual rollout to production
  5. Continuous monitoring

Data Pipeline (AI Features)

Customer PLCs/Cameras
    ↓
Edge Device (optional preprocessing)
    ↓
MQTT → Feature Store (InfluxDB + PostgreSQL)
    ↓
ML Inference Service (GPU-accelerated)
    ↓
Prediction Results → InfluxDB
    ↓
Grafana Dashboard + Alerts

Development & Deployment Tools

Local Development

Workstation Setup:

  • Ubuntu 22.04 or macOS
  • Docker Desktop (for testing containers)
  • VS Code with extensions:
    • Python
    • Docker
    • YAML
    • Grafana dashboards

Testing Environment:

  • Local LXC or Docker setup
  • Simulated PLC data (Node-RED)
  • Small InfluxDB + Grafana instance

CI/CD Pipeline

Source Control:

  • Git (self-hosted Gitea or GitHub)
  • Branches: main, development, customer-specific

Automation (Future):

  • GitHub Actions or Gitea Actions
  • Automated testing on push
  • Deployment scripts (Ansible)

Deployment Process (Manual Initially):

  1. Test in local environment
  2. Deploy to staging container
  3. Validate with test data
  4. Deploy to production
  5. Monitor for issues

Technology Decisions & Rationale

Why LXC over Docker?

Advantages:

  • Lower overhead (runs closer to bare metal)
  • Better for long-running services (MQTT, databases)
  • Simpler networking (bridge vs overlay)
  • Proven from ZeroLagHub experience
  • Less complexity than Kubernetes

Disadvantages:

  • Less popular than Docker (smaller community)
  • Fewer pre-built images
  • Manual setup required

Decision: Use LXC for multi-tenant platform, Docker for customer edge deployments (easier for them to maintain).

Why InfluxDB over Prometheus?

Advantages:

  • Purpose-built for time-series data
  • Better query language (Flux/InfluxQL)
  • Native downsampling and retention policies
  • Better Grafana integration for industrial data
  • Can handle high-frequency data (1-10 second resolution)

Disadvantages:

  • More complex than Prometheus
  • Heavier resource usage

Decision: InfluxDB for customer data, Prometheus for internal monitoring.

Why Grafana over Custom Dashboard?

Advantages:

  • Industry standard
  • Excellent out-of-box visualizations
  • Plugin ecosystem
  • Customer familiarity (many have seen it)
  • Lower development time

Disadvantages:

  • Not as customizable as custom solution
  • Licensing considerations (AGPL for self-hosted)

Decision: Grafana for Phase 1-2, consider custom dashboard in Phase 3 if needed.

Why MQTT over HTTP?

Advantages:

  • Purpose-built for IoT (lightweight)
  • Pub/sub model (flexible)
  • Quality of Service levels (QoS 0, 1, 2)
  • Better for unreliable networks
  • Lower bandwidth overhead

Disadvantages:

  • One more service to manage
  • Not as universally understood as HTTP

Decision: MQTT for OT data collection, HTTP/REST for management APIs.


Scaling Plan

Server Capacity Thresholds

Add Server #2 When:

  • Server #1 >70% CPU average
  • OR >80% RAM average
  • OR >10 customers on Server #1

Add Server #3 When:

  • Combined >70% capacity
  • OR >20 total customers
  • OR geographic distribution (West Coast + East Coast servers)

Database Scaling

InfluxDB Scaling:

  • Start: Single node per customer container
  • Scale: Consider InfluxDB clustering (Enterprise) if needed
  • Alternative: TimescaleDB for SQL-familiar customers

Backup Scaling:

  • Start: Daily backups to local disk
  • Scale: Offsite backup to object storage (S3-compatible)
  • Future: Real-time replication to hot standby

Security Best Practices

Server Hardening

  • Disable root login (SSH key only)
  • Fail2ban configured
  • UFW firewall (only necessary ports)
  • Automated security updates
  • Regular security audits (quarterly)

Application Security

  • TLS/SSL everywhere (Let's Encrypt)
  • Strong passwords (generated, stored in 1Password)
  • API keys rotated (quarterly)
  • Container isolation verified
  • Database encryption at rest

Compliance Considerations

  • GDPR (if EU customers): Data residency, right to deletion
  • HIPAA (if medical devices): BAA required, encryption
  • ISO 27001 (future): Information security management

Tools & Subscriptions

Required (Paid)

Tool Purpose Cost/Month
GTHost Server #1 Infrastructure $100-150
Domain + DNS yourdomain.com $1-2
Email (G Suite or similar) Professional email $6-12

Total: $107-164/month

Optional (Free/Paid)

Tool Purpose Cost/Month
1Password Password management $0 (personal)
Wasabi Offsite backups $6/TB
UptimeRobot External monitoring $0 (free tier)
Stripe Payment processing 2.9% + $0.30
Twilio SMS alerts Pay-as-you-go

Documentation Strategy

Internal Documentation

  • Runbooks (how to deploy, backup, restore)
  • Architecture diagrams (network, data flow)
  • Troubleshooting guides
  • Security incident response plan

Customer Documentation

  • User guide (how to access dashboards)
  • FAQ (common questions)
  • Alert configuration guide
  • Troubleshooting (basic)

Format: Markdown in Git repository (easy to version, search)


Last Updated: December 2025