jester/venture

Fork 0

jester d733b600bb Add technical stack and architecture document

2025-12-17 00:59:56 +00:00

13 KiB

Raw Blame History

Technical Stack & Architecture

Overview

This document outlines the technical infrastructure, tools, and architecture for all three phases of the venture.

Phase 1: Consulting (Customer Hardware)

Deployment Environment

Location: Customer premises
Hardware: Customer-provided or recommended purchase

Raspberry Pi 4 (8GB): $75-100
OR repurposed industrial PC (free if available)
OR Intel NUC ($300-800)

Software Stack

Operating System:

Ubuntu Server 24.04 LTS (free)
Debian 12 (alternative, free)

Container Platform:

Docker (free, easier for customer maintenance)
OR LXC (free, lower overhead)

MQTT Broker:

Eclipse Mosquitto (free, open source)
Configuration: Local network only, authenticated users
Port: 1883 (or 8883 for TLS)

Time-Series Database:

InfluxDB 2.x OSS (free, open source)
Alternative: TimescaleDB (PostgreSQL extension, free)
Retention: 30-90 days default

Visualization:

Grafana OSS (free, open source)
Dashboards: Production line overview, OEE tracking, downtime analysis
Alerts: Email/SMS via SMTP or webhook

PLC Integration:

Node-RED (free, visual programming)
- node-red-contrib-modbus
- node-red-contrib-opcua
- node-red-contrib-s7
Alternative: Python scripts
- pymodbus (Modbus TCP/RTU)
- opcua-client (OPC UA)
- python-snap7 (Siemens S7)

Network Architecture

Production Floor:
├── PLCs (Allen-Bradley, Siemens, etc.)
│   └── Connected via: Ethernet, Serial, or OPC UA Server
├── Edge Gateway (Raspberry Pi / Industrial PC)
│   ├── Mosquitto MQTT broker
│   ├── InfluxDB time-series database
│   ├── Grafana visualization
│   └── Node-RED PLC integration
└── Local Network
    ├── Operators access via web browser (http://edge-gateway:3000)
    └── Managers access via web browser or mobile

Security:

Isolated VLAN (recommended)
Firewall rules (only necessary ports)
HTTPS/TLS for Grafana (Let's Encrypt)
VPN for remote access (WireGuard or OpenVPN)

Protocols Supported

OT Protocols:

Modbus TCP/RTU
OPC UA
Siemens S7 (via Snap7)
EtherNet/IP (Allen-Bradley)
BACnet (building automation)
Profinet (via OPC UA gateway)

IT Protocols:

MQTT (primary message bus)
HTTP/REST APIs
HTTPS for dashboards
SMTP for email alerts

Phase 2: Edge Monitoring Platform (GTHost Multi-Tenant)

Infrastructure

GTHost Dedicated Server #1:

Configuration:
├── CPU: 8 cores (Intel Xeon or AMD EPYC)
├── RAM: 32GB
├── Storage: 1TB NVMe SSD
├── Network: 1Gbps unmetered
├── Location: Choose closest to majority of customers
└── Cost: $100-150/month

Operating System:

Ubuntu Server 24.04 LTS
Automated updates (unattended-upgrades)
Fail2ban for security
UFW firewall configured

Multi-Tenant Architecture

Container Platform:

LXC (Linux Containers)
- Lightweight vs Docker
- Better for long-running services
- Kernel-level isolation
- Proven from ZeroLagHub experience

Container Template:

Customer Container (LXC):
├── Ubuntu 24.04 minimal
├── Mosquitto MQTT broker (isolated)
├── InfluxDB (isolated database)
├── Grafana (customer-specific dashboards)
├── Node-RED (optional, for advanced workflows)
├── Backup agent (automated daily)
└── Resource limits (CPU, RAM, disk)

Resource Allocation per Customer:

CPU: 1-2 cores (burstable)
RAM: 2-4GB
Disk: 50-100GB
Network: Shared 1Gbps

Capacity Planning:

8-10 basic customers per server
5-8 customers if heavy data volume
Monitor: CPU, RAM, disk I/O, network

Networking & Security

Network Architecture:

Internet
    ↓
Caddy Reverse Proxy (TLS termination)
    ↓
LXC Bridge (internal network)
    ├── Customer 1 Container (192.168.100.10)
    ├── Customer 2 Container (192.168.100.11)
    ├── Customer 3 Container (192.168.100.12)
    └── ...

Subdomain Structure:

customer1.yourdomain.com → Grafana dashboard
customer2.yourdomain.com → Grafana dashboard
mqtt.yourdomain.com → MQTT broker (port per customer)

Security Features:

TLS/SSL via Let's Encrypt (automated)
Firewall (UFW) - only necessary ports
Fail2ban - brute force protection
Container isolation (LXC namespaces)
VPN access for edge devices (WireGuard)
Backup encryption (GPG)

Data Flow

Customer Site:
    PLC → Node-RED → MQTT (local edge device)
                        ↓
                    (Over VPN or direct connection)
                        ↓
GTHost Server:
    MQTT Broker → InfluxDB → Grafana
                        ↓
                   Alert Engine → Email/SMS

Backup & Disaster Recovery

Backup Strategy:

Automated daily backups (3am UTC)
Retention: 7 daily, 4 weekly, 12 monthly
Storage: GTHost server + offsite (Wasabi/Backblaze B2)
Encrypted with GPG
Automated restore testing (monthly)

Disaster Recovery:

RTO (Recovery Time Objective): 4 hours
RPO (Recovery Point Objective): 24 hours
Documented restoration procedure
Annual DR test

Monitoring & Alerting

Server Monitoring:

Prometheus + Grafana (internal)
Alerts: CPU >80%, RAM >80%, Disk >85%
UptimeRobot (external monitoring)
PagerDuty or similar (if needed)

Customer Monitoring:

Per-container resource usage
MQTT connection status
InfluxDB query performance
Grafana dashboard access logs

Phase 3: GPU-Powered AI Platform

Infrastructure

GTHost Dedicated Server #2 (AI/Premium Tier):

Configuration:
├── CPU: 16 cores (Intel Xeon or AMD EPYC)
├── RAM: 64GB
├── GPU: NVIDIA Tesla P4 8GB (or T1000 8GB)
├── Storage: 2TB NVMe SSD
├── Network: 1Gbps unmetered
└── Cost: $350-450/month

Why Tesla P4:

Optimized for AI inference (not training)
8GB VRAM sufficient for production models
Low power consumption (75W)
Good performance/cost ratio

AI/ML Stack

ML Frameworks:

TensorFlow Lite (optimized inference)
PyTorch (model development, optional)
ONNX Runtime (cross-framework inference)
Scikit-learn (traditional ML)

GPU Acceleration:

CUDA 12.x
cuDNN (deep learning primitives)
TensorRT (inference optimization)

Model Serving:

FastAPI (REST API for predictions)
Triton Inference Server (optional, for heavy workloads)
Redis (result caching)

AI Features Architecture

1. Predictive Maintenance:

Sensor Data → Feature Engineering → Model Inference → Alert
   (MQTT)      (Python script)      (TensorFlow)     (Email/SMS)

Models:

Anomaly detection (vibration, temperature patterns)
Failure prediction (time-to-failure models)
Remaining Useful Life (RUL) estimation

2. Computer Vision Quality Inspection:

Camera → Image Capture → Preprocessing → Model Inference → Classification
 (HTTP)    (Python)       (OpenCV)       (TensorFlow)      (Pass/Fail)

Models:

Object detection (YOLOv8, faster RCNN)
Defect classification (CNN)
OCR (text recognition on parts)

Container Architecture (Phase 3)

Premium Customer Container:

├── Basic monitoring stack (MQTT, InfluxDB, Grafana)
├── ML inference service (FastAPI + TensorFlow)
├── Feature engineering pipeline
├── Model registry (versioned models)
├── Result database (predictions, alerts)
└── GPU access (controlled, per-customer limits)

Resource Allocation (Premium):

CPU: 4-8 cores
RAM: 16-32GB
GPU: Shared (time-sliced or MIG partitioning)
Disk: 200-500GB

Model Development Workflow

Development (Offline):

Collect customer data (4-8 weeks)
Feature engineering and labeling
Model training (local GPU or cloud)
Model validation (accuracy, false positives)
Export to ONNX or TensorFlow Lite

Deployment:

Upload model to server
A/B test against baseline
Monitor inference latency and accuracy
Gradual rollout to production
Continuous monitoring

Data Pipeline (AI Features)

Customer PLCs/Cameras
    ↓
Edge Device (optional preprocessing)
    ↓
MQTT → Feature Store (InfluxDB + PostgreSQL)
    ↓
ML Inference Service (GPU-accelerated)
    ↓
Prediction Results → InfluxDB
    ↓
Grafana Dashboard + Alerts

Development & Deployment Tools

Local Development

Workstation Setup:

Ubuntu 22.04 or macOS
Docker Desktop (for testing containers)
VS Code with extensions:
- Python
- Docker
- YAML
- Grafana dashboards

Testing Environment:

Local LXC or Docker setup
Simulated PLC data (Node-RED)
Small InfluxDB + Grafana instance

CI/CD Pipeline

Source Control:

Git (self-hosted Gitea or GitHub)
Branches: main, development, customer-specific

Automation (Future):

GitHub Actions or Gitea Actions
Automated testing on push
Deployment scripts (Ansible)

Deployment Process (Manual Initially):

Test in local environment
Deploy to staging container
Validate with test data
Deploy to production
Monitor for issues

Technology Decisions & Rationale

Why LXC over Docker?

Advantages:

Lower overhead (runs closer to bare metal)
Better for long-running services (MQTT, databases)
Simpler networking (bridge vs overlay)
Proven from ZeroLagHub experience
Less complexity than Kubernetes

Disadvantages:

Less popular than Docker (smaller community)
Fewer pre-built images
Manual setup required

Decision: Use LXC for multi-tenant platform, Docker for customer edge deployments (easier for them to maintain).

Why InfluxDB over Prometheus?

Advantages:

Purpose-built for time-series data
Better query language (Flux/InfluxQL)
Native downsampling and retention policies
Better Grafana integration for industrial data
Can handle high-frequency data (1-10 second resolution)

Disadvantages:

More complex than Prometheus
Heavier resource usage

Decision: InfluxDB for customer data, Prometheus for internal monitoring.

Why Grafana over Custom Dashboard?

Advantages:

Industry standard
Excellent out-of-box visualizations
Plugin ecosystem
Customer familiarity (many have seen it)
Lower development time

Disadvantages:

Not as customizable as custom solution
Licensing considerations (AGPL for self-hosted)

Decision: Grafana for Phase 1-2, consider custom dashboard in Phase 3 if needed.

Why MQTT over HTTP?

Advantages:

Purpose-built for IoT (lightweight)
Pub/sub model (flexible)
Quality of Service levels (QoS 0, 1, 2)
Better for unreliable networks
Lower bandwidth overhead

Disadvantages:

One more service to manage
Not as universally understood as HTTP

Decision: MQTT for OT data collection, HTTP/REST for management APIs.

Scaling Plan

Server Capacity Thresholds

Add Server #2 When:

Server #1 >70% CPU average
OR >80% RAM average
OR >10 customers on Server #1

Add Server #3 When:

Combined >70% capacity
OR >20 total customers
OR geographic distribution (West Coast + East Coast servers)

Database Scaling

InfluxDB Scaling:

Start: Single node per customer container
Scale: Consider InfluxDB clustering (Enterprise) if needed
Alternative: TimescaleDB for SQL-familiar customers

Backup Scaling:

Start: Daily backups to local disk
Scale: Offsite backup to object storage (S3-compatible)
Future: Real-time replication to hot standby

Security Best Practices

Server Hardening

Disable root login (SSH key only)
Fail2ban configured
UFW firewall (only necessary ports)
Automated security updates
Regular security audits (quarterly)

Application Security

TLS/SSL everywhere (Let's Encrypt)
Strong passwords (generated, stored in 1Password)
API keys rotated (quarterly)
Container isolation verified
Database encryption at rest

Compliance Considerations

GDPR (if EU customers): Data residency, right to deletion
HIPAA (if medical devices): BAA required, encryption
ISO 27001 (future): Information security management

Tools & Subscriptions

Required (Paid)

Tool	Purpose	Cost/Month
GTHost Server #1	Infrastructure	$100-150
Domain + DNS	yourdomain.com	$1-2
Email (G Suite or similar)	Professional email	$6-12

Total: $107-164/month

Optional (Free/Paid)

Tool	Purpose	Cost/Month
1Password	Password management	$0 (personal)
Wasabi	Offsite backups	$6/TB
UptimeRobot	External monitoring	$0 (free tier)
Stripe	Payment processing	2.9% + $0.30
Twilio	SMS alerts	Pay-as-you-go

Documentation Strategy

Internal Documentation

Runbooks (how to deploy, backup, restore)
Architecture diagrams (network, data flow)
Troubleshooting guides
Security incident response plan

Customer Documentation

User guide (how to access dashboards)
FAQ (common questions)
Alert configuration guide
Troubleshooting (basic)

Format: Markdown in Git repository (easy to version, search)

Last Updated: December 2025

13 KiB Raw Blame History

Technical Stack & Architecture

Overview

Phase 1: Consulting (Customer Hardware)

Deployment Environment

Software Stack

Network Architecture

Protocols Supported

Phase 2: Edge Monitoring Platform (GTHost Multi-Tenant)

Infrastructure

Multi-Tenant Architecture

Networking & Security

Data Flow

Backup & Disaster Recovery

Monitoring & Alerting

Phase 3: GPU-Powered AI Platform

Infrastructure

AI/ML Stack

AI Features Architecture

Container Architecture (Phase 3)

Model Development Workflow

Data Pipeline (AI Features)

Development & Deployment Tools

Local Development

CI/CD Pipeline

Technology Decisions & Rationale

Why LXC over Docker?

Why InfluxDB over Prometheus?

Why Grafana over Custom Dashboard?

Why MQTT over HTTP?

Scaling Plan

Server Capacity Thresholds

Database Scaling

Security Best Practices

Server Hardening

Application Security

Compliance Considerations

Tools & Subscriptions

Required (Paid)

Optional (Free/Paid)

Documentation Strategy

Internal Documentation

Customer Documentation

13 KiB

Raw Blame History