venture/technical-stack.md

# Technical Stack & Architecture

## Overview

This document outlines the technical infrastructure, tools, and architecture for all three phases of the venture.

---

## Phase 1: Consulting (Customer Hardware)

### Deployment Environment
**Location**: Customer premises
**Hardware**: Customer-provided or recommended purchase
- Raspberry Pi 4 (8GB): $75-100
- OR repurposed industrial PC (free if available)
- OR Intel NUC ($300-800)

### Software Stack

**Operating System:**
- Ubuntu Server 24.04 LTS (free)
- Debian 12 (alternative, free)

**Container Platform:**
- Docker (free, easier for customer maintenance)
- OR LXC (free, lower overhead)

**MQTT Broker:**
- Eclipse Mosquitto (free, open source)
- Configuration: Local network only, authenticated users
- Port: 1883 (or 8883 for TLS)

**Time-Series Database:**
- InfluxDB 2.x OSS (free, open source)
- Alternative: TimescaleDB (PostgreSQL extension, free)
- Retention: 30-90 days default

**Visualization:**
- Grafana OSS (free, open source)
- Dashboards: Production line overview, OEE tracking, downtime analysis
- Alerts: Email/SMS via SMTP or webhook

**PLC Integration:**
- Node-RED (free, visual programming)
  - node-red-contrib-modbus
  - node-red-contrib-opcua
  - node-red-contrib-s7
- Alternative: Python scripts
  - pymodbus (Modbus TCP/RTU)
  - opcua-client (OPC UA)
  - python-snap7 (Siemens S7)

### Network Architecture

```
Production Floor:
├── PLCs (Allen-Bradley, Siemens, etc.)
│   └── Connected via: Ethernet, Serial, or OPC UA Server
├── Edge Gateway (Raspberry Pi / Industrial PC)
│   ├── Mosquitto MQTT broker
│   ├── InfluxDB time-series database
│   ├── Grafana visualization
│   └── Node-RED PLC integration
└── Local Network
    ├── Operators access via web browser (http://edge-gateway:3000)
    └── Managers access via web browser or mobile
```

**Security:**
- Isolated VLAN (recommended)
- Firewall rules (only necessary ports)
- HTTPS/TLS for Grafana (Let's Encrypt)
- VPN for remote access (WireGuard or OpenVPN)

### Protocols Supported

**OT Protocols:**
- Modbus TCP/RTU
- OPC UA
- Siemens S7 (via Snap7)
- EtherNet/IP (Allen-Bradley)
- BACnet (building automation)
- Profinet (via OPC UA gateway)

**IT Protocols:**
- MQTT (primary message bus)
- HTTP/REST APIs
- HTTPS for dashboards
- SMTP for email alerts

---

## Phase 2: Edge Monitoring Platform (GTHost Multi-Tenant)

### Infrastructure

**GTHost Dedicated Server #1:**
```
Configuration:
├── CPU: 8 cores (Intel Xeon or AMD EPYC)
├── RAM: 32GB
├── Storage: 1TB NVMe SSD
├── Network: 1Gbps unmetered
├── Location: Choose closest to majority of customers
└── Cost: $100-150/month
```

**Operating System:**
- Ubuntu Server 24.04 LTS
- Automated updates (unattended-upgrades)
- Fail2ban for security
- UFW firewall configured

### Multi-Tenant Architecture

**Container Platform:**
- LXC (Linux Containers)
  - Lightweight vs Docker
  - Better for long-running services
  - Kernel-level isolation
  - Proven from ZeroLagHub experience

**Container Template:**
```
Customer Container (LXC):
├── Ubuntu 24.04 minimal
├── Mosquitto MQTT broker (isolated)
├── InfluxDB (isolated database)
├── Grafana (customer-specific dashboards)
├── Node-RED (optional, for advanced workflows)
├── Backup agent (automated daily)
└── Resource limits (CPU, RAM, disk)
```

**Resource Allocation per Customer:**
- CPU: 1-2 cores (burstable)
- RAM: 2-4GB
- Disk: 50-100GB
- Network: Shared 1Gbps

**Capacity Planning:**
- 8-10 basic customers per server
- 5-8 customers if heavy data volume
- Monitor: CPU, RAM, disk I/O, network

### Networking & Security

**Network Architecture:**
```
Internet
    ↓
Caddy Reverse Proxy (TLS termination)
    ↓
LXC Bridge (internal network)
    ├── Customer 1 Container (192.168.100.10)
    ├── Customer 2 Container (192.168.100.11)
    ├── Customer 3 Container (192.168.100.12)
    └── ...
```

**Subdomain Structure:**
- customer1.yourdomain.com → Grafana dashboard
- customer2.yourdomain.com → Grafana dashboard
- mqtt.yourdomain.com → MQTT broker (port per customer)

**Security Features:**
- TLS/SSL via Let's Encrypt (automated)
- Firewall (UFW) - only necessary ports
- Fail2ban - brute force protection
- Container isolation (LXC namespaces)
- VPN access for edge devices (WireGuard)
- Backup encryption (GPG)

### Data Flow

```
Customer Site:
    PLC → Node-RED → MQTT (local edge device)
                        ↓
                    (Over VPN or direct connection)
                        ↓
GTHost Server:
    MQTT Broker → InfluxDB → Grafana
                        ↓
                   Alert Engine → Email/SMS
```

### Backup & Disaster Recovery

**Backup Strategy:**
- Automated daily backups (3am UTC)
- Retention: 7 daily, 4 weekly, 12 monthly
- Storage: GTHost server + offsite (Wasabi/Backblaze B2)
- Encrypted with GPG
- Automated restore testing (monthly)

**Disaster Recovery:**
- RTO (Recovery Time Objective): 4 hours
- RPO (Recovery Point Objective): 24 hours
- Documented restoration procedure
- Annual DR test

### Monitoring & Alerting

**Server Monitoring:**
- Prometheus + Grafana (internal)
- Alerts: CPU >80%, RAM >80%, Disk >85%
- UptimeRobot (external monitoring)
- PagerDuty or similar (if needed)

**Customer Monitoring:**
- Per-container resource usage
- MQTT connection status
- InfluxDB query performance
- Grafana dashboard access logs

---

## Phase 3: GPU-Powered AI Platform

### Infrastructure

**GTHost Dedicated Server #2 (AI/Premium Tier):**
```
Configuration:
├── CPU: 16 cores (Intel Xeon or AMD EPYC)
├── RAM: 64GB
├── GPU: NVIDIA Tesla P4 8GB (or T1000 8GB)
├── Storage: 2TB NVMe SSD
├── Network: 1Gbps unmetered
└── Cost: $350-450/month
```

**Why Tesla P4:**
- Optimized for AI inference (not training)
- 8GB VRAM sufficient for production models
- Low power consumption (75W)
- Good performance/cost ratio

### AI/ML Stack

**ML Frameworks:**
- TensorFlow Lite (optimized inference)
- PyTorch (model development, optional)
- ONNX Runtime (cross-framework inference)
- Scikit-learn (traditional ML)

**GPU Acceleration:**
- CUDA 12.x
- cuDNN (deep learning primitives)
- TensorRT (inference optimization)

**Model Serving:**
- FastAPI (REST API for predictions)
- Triton Inference Server (optional, for heavy workloads)
- Redis (result caching)

### AI Features Architecture

**1. Predictive Maintenance:**
```
Sensor Data → Feature Engineering → Model Inference → Alert
   (MQTT)      (Python script)      (TensorFlow)     (Email/SMS)
```

**Models:**
- Anomaly detection (vibration, temperature patterns)
- Failure prediction (time-to-failure models)
- Remaining Useful Life (RUL) estimation

**2. Computer Vision Quality Inspection:**
```
Camera → Image Capture → Preprocessing → Model Inference → Classification
 (HTTP)    (Python)       (OpenCV)       (TensorFlow)      (Pass/Fail)
```

**Models:**
- Object detection (YOLOv8, faster RCNN)
- Defect classification (CNN)
- OCR (text recognition on parts)

### Container Architecture (Phase 3)

**Premium Customer Container:**
```
├── Basic monitoring stack (MQTT, InfluxDB, Grafana)
├── ML inference service (FastAPI + TensorFlow)
├── Feature engineering pipeline
├── Model registry (versioned models)
├── Result database (predictions, alerts)
└── GPU access (controlled, per-customer limits)
```

**Resource Allocation (Premium):**
- CPU: 4-8 cores
- RAM: 16-32GB
- GPU: Shared (time-sliced or MIG partitioning)
- Disk: 200-500GB

### Model Development Workflow

**Development (Offline):**
1. Collect customer data (4-8 weeks)
2. Feature engineering and labeling
3. Model training (local GPU or cloud)
4. Model validation (accuracy, false positives)
5. Export to ONNX or TensorFlow Lite

**Deployment:**
1. Upload model to server
2. A/B test against baseline
3. Monitor inference latency and accuracy
4. Gradual rollout to production
5. Continuous monitoring

### Data Pipeline (AI Features)

```
Customer PLCs/Cameras
    ↓
Edge Device (optional preprocessing)
    ↓
MQTT → Feature Store (InfluxDB + PostgreSQL)
    ↓
ML Inference Service (GPU-accelerated)
    ↓
Prediction Results → InfluxDB
    ↓
Grafana Dashboard + Alerts
```

---

## Development & Deployment Tools

### Local Development

**Workstation Setup:**
- Ubuntu 22.04 or macOS
- Docker Desktop (for testing containers)
- VS Code with extensions:
  - Python
  - Docker
  - YAML
  - Grafana dashboards

**Testing Environment:**
- Local LXC or Docker setup
- Simulated PLC data (Node-RED)
- Small InfluxDB + Grafana instance

### CI/CD Pipeline

**Source Control:**
- Git (self-hosted Gitea or GitHub)
- Branches: main, development, customer-specific

**Automation (Future):**
- GitHub Actions or Gitea Actions
- Automated testing on push
- Deployment scripts (Ansible)

**Deployment Process (Manual Initially):**
1. Test in local environment
2. Deploy to staging container
3. Validate with test data
4. Deploy to production
5. Monitor for issues

---

## Technology Decisions & Rationale

### Why LXC over Docker?

**Advantages:**
- Lower overhead (runs closer to bare metal)
- Better for long-running services (MQTT, databases)
- Simpler networking (bridge vs overlay)
- Proven from ZeroLagHub experience
- Less complexity than Kubernetes

**Disadvantages:**
- Less popular than Docker (smaller community)
- Fewer pre-built images
- Manual setup required

**Decision**: Use LXC for multi-tenant platform, Docker for customer edge deployments (easier for them to maintain).

### Why InfluxDB over Prometheus?

**Advantages:**
- Purpose-built for time-series data
- Better query language (Flux/InfluxQL)
- Native downsampling and retention policies
- Better Grafana integration for industrial data
- Can handle high-frequency data (1-10 second resolution)

**Disadvantages:**
- More complex than Prometheus
- Heavier resource usage

**Decision**: InfluxDB for customer data, Prometheus for internal monitoring.

### Why Grafana over Custom Dashboard?

**Advantages:**
- Industry standard
- Excellent out-of-box visualizations
- Plugin ecosystem
- Customer familiarity (many have seen it)
- Lower development time

**Disadvantages:**
- Not as customizable as custom solution
- Licensing considerations (AGPL for self-hosted)

**Decision**: Grafana for Phase 1-2, consider custom dashboard in Phase 3 if needed.

### Why MQTT over HTTP?

**Advantages:**
- Purpose-built for IoT (lightweight)
- Pub/sub model (flexible)
- Quality of Service levels (QoS 0, 1, 2)
- Better for unreliable networks
- Lower bandwidth overhead

**Disadvantages:**
- One more service to manage
- Not as universally understood as HTTP

**Decision**: MQTT for OT data collection, HTTP/REST for management APIs.

---

## Scaling Plan

### Server Capacity Thresholds

**Add Server #2 When:**
- Server #1 >70% CPU average
- OR >80% RAM average
- OR >10 customers on Server #1

**Add Server #3 When:**
- Combined >70% capacity
- OR >20 total customers
- OR geographic distribution (West Coast + East Coast servers)

### Database Scaling

**InfluxDB Scaling:**
- Start: Single node per customer container
- Scale: Consider InfluxDB clustering (Enterprise) if needed
- Alternative: TimescaleDB for SQL-familiar customers

**Backup Scaling:**
- Start: Daily backups to local disk
- Scale: Offsite backup to object storage (S3-compatible)
- Future: Real-time replication to hot standby

---

## Security Best Practices

### Server Hardening
- [ ] Disable root login (SSH key only)
- [ ] Fail2ban configured
- [ ] UFW firewall (only necessary ports)
- [ ] Automated security updates
- [ ] Regular security audits (quarterly)

### Application Security
- [ ] TLS/SSL everywhere (Let's Encrypt)
- [ ] Strong passwords (generated, stored in 1Password)
- [ ] API keys rotated (quarterly)
- [ ] Container isolation verified
- [ ] Database encryption at rest

### Compliance Considerations
- GDPR (if EU customers): Data residency, right to deletion
- HIPAA (if medical devices): BAA required, encryption
- ISO 27001 (future): Information security management

---

## Tools & Subscriptions

### Required (Paid)

| Tool | Purpose | Cost/Month |
|------|---------|------------|
| GTHost Server #1 | Infrastructure | $100-150 |
| Domain + DNS | yourdomain.com | $1-2 |
| Email (G Suite or similar) | Professional email | $6-12 |

**Total**: $107-164/month

### Optional (Free/Paid)

| Tool | Purpose | Cost/Month |
|------|---------|------------|
| 1Password | Password management | $0 (personal) |
| Wasabi | Offsite backups | $6/TB |
| UptimeRobot | External monitoring | $0 (free tier) |
| Stripe | Payment processing | 2.9% + $0.30 |
| Twilio | SMS alerts | Pay-as-you-go |

---

## Documentation Strategy

### Internal Documentation
- Runbooks (how to deploy, backup, restore)
- Architecture diagrams (network, data flow)
- Troubleshooting guides
- Security incident response plan

### Customer Documentation
- User guide (how to access dashboards)
- FAQ (common questions)
- Alert configuration guide
- Troubleshooting (basic)

**Format**: Markdown in Git repository (easy to version, search)

---

*Last Updated: December 2025*