From d733b600bb602310a04358a622691646a003317b Mon Sep 17 00:00:00 2001 From: jester Date: Wed, 17 Dec 2025 00:59:56 +0000 Subject: [PATCH] Add technical stack and architecture document --- technical-stack.md | 530 +++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 530 insertions(+) create mode 100644 technical-stack.md diff --git a/technical-stack.md b/technical-stack.md new file mode 100644 index 0000000..4f545dc --- /dev/null +++ b/technical-stack.md @@ -0,0 +1,530 @@ +# Technical Stack & Architecture + +## Overview + +This document outlines the technical infrastructure, tools, and architecture for all three phases of the venture. + +--- + +## Phase 1: Consulting (Customer Hardware) + +### Deployment Environment +**Location**: Customer premises +**Hardware**: Customer-provided or recommended purchase +- Raspberry Pi 4 (8GB): $75-100 +- OR repurposed industrial PC (free if available) +- OR Intel NUC ($300-800) + +### Software Stack + +**Operating System:** +- Ubuntu Server 24.04 LTS (free) +- Debian 12 (alternative, free) + +**Container Platform:** +- Docker (free, easier for customer maintenance) +- OR LXC (free, lower overhead) + +**MQTT Broker:** +- Eclipse Mosquitto (free, open source) +- Configuration: Local network only, authenticated users +- Port: 1883 (or 8883 for TLS) + +**Time-Series Database:** +- InfluxDB 2.x OSS (free, open source) +- Alternative: TimescaleDB (PostgreSQL extension, free) +- Retention: 30-90 days default + +**Visualization:** +- Grafana OSS (free, open source) +- Dashboards: Production line overview, OEE tracking, downtime analysis +- Alerts: Email/SMS via SMTP or webhook + +**PLC Integration:** +- Node-RED (free, visual programming) + - node-red-contrib-modbus + - node-red-contrib-opcua + - node-red-contrib-s7 +- Alternative: Python scripts + - pymodbus (Modbus TCP/RTU) + - opcua-client (OPC UA) + - python-snap7 (Siemens S7) + +### Network Architecture + +``` +Production Floor: +├── PLCs (Allen-Bradley, Siemens, etc.) +│ └── Connected via: Ethernet, Serial, or OPC UA Server +├── Edge Gateway (Raspberry Pi / Industrial PC) +│ ├── Mosquitto MQTT broker +│ ├── InfluxDB time-series database +│ ├── Grafana visualization +│ └── Node-RED PLC integration +└── Local Network + ├── Operators access via web browser (http://edge-gateway:3000) + └── Managers access via web browser or mobile +``` + +**Security:** +- Isolated VLAN (recommended) +- Firewall rules (only necessary ports) +- HTTPS/TLS for Grafana (Let's Encrypt) +- VPN for remote access (WireGuard or OpenVPN) + +### Protocols Supported + +**OT Protocols:** +- Modbus TCP/RTU +- OPC UA +- Siemens S7 (via Snap7) +- EtherNet/IP (Allen-Bradley) +- BACnet (building automation) +- Profinet (via OPC UA gateway) + +**IT Protocols:** +- MQTT (primary message bus) +- HTTP/REST APIs +- HTTPS for dashboards +- SMTP for email alerts + +--- + +## Phase 2: Edge Monitoring Platform (GTHost Multi-Tenant) + +### Infrastructure + +**GTHost Dedicated Server #1:** +``` +Configuration: +├── CPU: 8 cores (Intel Xeon or AMD EPYC) +├── RAM: 32GB +├── Storage: 1TB NVMe SSD +├── Network: 1Gbps unmetered +├── Location: Choose closest to majority of customers +└── Cost: $100-150/month +``` + +**Operating System:** +- Ubuntu Server 24.04 LTS +- Automated updates (unattended-upgrades) +- Fail2ban for security +- UFW firewall configured + +### Multi-Tenant Architecture + +**Container Platform:** +- LXC (Linux Containers) + - Lightweight vs Docker + - Better for long-running services + - Kernel-level isolation + - Proven from ZeroLagHub experience + +**Container Template:** +``` +Customer Container (LXC): +├── Ubuntu 24.04 minimal +├── Mosquitto MQTT broker (isolated) +├── InfluxDB (isolated database) +├── Grafana (customer-specific dashboards) +├── Node-RED (optional, for advanced workflows) +├── Backup agent (automated daily) +└── Resource limits (CPU, RAM, disk) +``` + +**Resource Allocation per Customer:** +- CPU: 1-2 cores (burstable) +- RAM: 2-4GB +- Disk: 50-100GB +- Network: Shared 1Gbps + +**Capacity Planning:** +- 8-10 basic customers per server +- 5-8 customers if heavy data volume +- Monitor: CPU, RAM, disk I/O, network + +### Networking & Security + +**Network Architecture:** +``` +Internet + ↓ +Caddy Reverse Proxy (TLS termination) + ↓ +LXC Bridge (internal network) + ├── Customer 1 Container (192.168.100.10) + ├── Customer 2 Container (192.168.100.11) + ├── Customer 3 Container (192.168.100.12) + └── ... +``` + +**Subdomain Structure:** +- customer1.yourdomain.com → Grafana dashboard +- customer2.yourdomain.com → Grafana dashboard +- mqtt.yourdomain.com → MQTT broker (port per customer) + +**Security Features:** +- TLS/SSL via Let's Encrypt (automated) +- Firewall (UFW) - only necessary ports +- Fail2ban - brute force protection +- Container isolation (LXC namespaces) +- VPN access for edge devices (WireGuard) +- Backup encryption (GPG) + +### Data Flow + +``` +Customer Site: + PLC → Node-RED → MQTT (local edge device) + ↓ + (Over VPN or direct connection) + ↓ +GTHost Server: + MQTT Broker → InfluxDB → Grafana + ↓ + Alert Engine → Email/SMS +``` + +### Backup & Disaster Recovery + +**Backup Strategy:** +- Automated daily backups (3am UTC) +- Retention: 7 daily, 4 weekly, 12 monthly +- Storage: GTHost server + offsite (Wasabi/Backblaze B2) +- Encrypted with GPG +- Automated restore testing (monthly) + +**Disaster Recovery:** +- RTO (Recovery Time Objective): 4 hours +- RPO (Recovery Point Objective): 24 hours +- Documented restoration procedure +- Annual DR test + +### Monitoring & Alerting + +**Server Monitoring:** +- Prometheus + Grafana (internal) +- Alerts: CPU >80%, RAM >80%, Disk >85% +- UptimeRobot (external monitoring) +- PagerDuty or similar (if needed) + +**Customer Monitoring:** +- Per-container resource usage +- MQTT connection status +- InfluxDB query performance +- Grafana dashboard access logs + +--- + +## Phase 3: GPU-Powered AI Platform + +### Infrastructure + +**GTHost Dedicated Server #2 (AI/Premium Tier):** +``` +Configuration: +├── CPU: 16 cores (Intel Xeon or AMD EPYC) +├── RAM: 64GB +├── GPU: NVIDIA Tesla P4 8GB (or T1000 8GB) +├── Storage: 2TB NVMe SSD +├── Network: 1Gbps unmetered +└── Cost: $350-450/month +``` + +**Why Tesla P4:** +- Optimized for AI inference (not training) +- 8GB VRAM sufficient for production models +- Low power consumption (75W) +- Good performance/cost ratio + +### AI/ML Stack + +**ML Frameworks:** +- TensorFlow Lite (optimized inference) +- PyTorch (model development, optional) +- ONNX Runtime (cross-framework inference) +- Scikit-learn (traditional ML) + +**GPU Acceleration:** +- CUDA 12.x +- cuDNN (deep learning primitives) +- TensorRT (inference optimization) + +**Model Serving:** +- FastAPI (REST API for predictions) +- Triton Inference Server (optional, for heavy workloads) +- Redis (result caching) + +### AI Features Architecture + +**1. Predictive Maintenance:** +``` +Sensor Data → Feature Engineering → Model Inference → Alert + (MQTT) (Python script) (TensorFlow) (Email/SMS) +``` + +**Models:** +- Anomaly detection (vibration, temperature patterns) +- Failure prediction (time-to-failure models) +- Remaining Useful Life (RUL) estimation + +**2. Computer Vision Quality Inspection:** +``` +Camera → Image Capture → Preprocessing → Model Inference → Classification + (HTTP) (Python) (OpenCV) (TensorFlow) (Pass/Fail) +``` + +**Models:** +- Object detection (YOLOv8, faster RCNN) +- Defect classification (CNN) +- OCR (text recognition on parts) + +### Container Architecture (Phase 3) + +**Premium Customer Container:** +``` +├── Basic monitoring stack (MQTT, InfluxDB, Grafana) +├── ML inference service (FastAPI + TensorFlow) +├── Feature engineering pipeline +├── Model registry (versioned models) +├── Result database (predictions, alerts) +└── GPU access (controlled, per-customer limits) +``` + +**Resource Allocation (Premium):** +- CPU: 4-8 cores +- RAM: 16-32GB +- GPU: Shared (time-sliced or MIG partitioning) +- Disk: 200-500GB + +### Model Development Workflow + +**Development (Offline):** +1. Collect customer data (4-8 weeks) +2. Feature engineering and labeling +3. Model training (local GPU or cloud) +4. Model validation (accuracy, false positives) +5. Export to ONNX or TensorFlow Lite + +**Deployment:** +1. Upload model to server +2. A/B test against baseline +3. Monitor inference latency and accuracy +4. Gradual rollout to production +5. Continuous monitoring + +### Data Pipeline (AI Features) + +``` +Customer PLCs/Cameras + ↓ +Edge Device (optional preprocessing) + ↓ +MQTT → Feature Store (InfluxDB + PostgreSQL) + ↓ +ML Inference Service (GPU-accelerated) + ↓ +Prediction Results → InfluxDB + ↓ +Grafana Dashboard + Alerts +``` + +--- + +## Development & Deployment Tools + +### Local Development + +**Workstation Setup:** +- Ubuntu 22.04 or macOS +- Docker Desktop (for testing containers) +- VS Code with extensions: + - Python + - Docker + - YAML + - Grafana dashboards + +**Testing Environment:** +- Local LXC or Docker setup +- Simulated PLC data (Node-RED) +- Small InfluxDB + Grafana instance + +### CI/CD Pipeline + +**Source Control:** +- Git (self-hosted Gitea or GitHub) +- Branches: main, development, customer-specific + +**Automation (Future):** +- GitHub Actions or Gitea Actions +- Automated testing on push +- Deployment scripts (Ansible) + +**Deployment Process (Manual Initially):** +1. Test in local environment +2. Deploy to staging container +3. Validate with test data +4. Deploy to production +5. Monitor for issues + +--- + +## Technology Decisions & Rationale + +### Why LXC over Docker? + +**Advantages:** +- Lower overhead (runs closer to bare metal) +- Better for long-running services (MQTT, databases) +- Simpler networking (bridge vs overlay) +- Proven from ZeroLagHub experience +- Less complexity than Kubernetes + +**Disadvantages:** +- Less popular than Docker (smaller community) +- Fewer pre-built images +- Manual setup required + +**Decision**: Use LXC for multi-tenant platform, Docker for customer edge deployments (easier for them to maintain). + +### Why InfluxDB over Prometheus? + +**Advantages:** +- Purpose-built for time-series data +- Better query language (Flux/InfluxQL) +- Native downsampling and retention policies +- Better Grafana integration for industrial data +- Can handle high-frequency data (1-10 second resolution) + +**Disadvantages:** +- More complex than Prometheus +- Heavier resource usage + +**Decision**: InfluxDB for customer data, Prometheus for internal monitoring. + +### Why Grafana over Custom Dashboard? + +**Advantages:** +- Industry standard +- Excellent out-of-box visualizations +- Plugin ecosystem +- Customer familiarity (many have seen it) +- Lower development time + +**Disadvantages:** +- Not as customizable as custom solution +- Licensing considerations (AGPL for self-hosted) + +**Decision**: Grafana for Phase 1-2, consider custom dashboard in Phase 3 if needed. + +### Why MQTT over HTTP? + +**Advantages:** +- Purpose-built for IoT (lightweight) +- Pub/sub model (flexible) +- Quality of Service levels (QoS 0, 1, 2) +- Better for unreliable networks +- Lower bandwidth overhead + +**Disadvantages:** +- One more service to manage +- Not as universally understood as HTTP + +**Decision**: MQTT for OT data collection, HTTP/REST for management APIs. + +--- + +## Scaling Plan + +### Server Capacity Thresholds + +**Add Server #2 When:** +- Server #1 >70% CPU average +- OR >80% RAM average +- OR >10 customers on Server #1 + +**Add Server #3 When:** +- Combined >70% capacity +- OR >20 total customers +- OR geographic distribution (West Coast + East Coast servers) + +### Database Scaling + +**InfluxDB Scaling:** +- Start: Single node per customer container +- Scale: Consider InfluxDB clustering (Enterprise) if needed +- Alternative: TimescaleDB for SQL-familiar customers + +**Backup Scaling:** +- Start: Daily backups to local disk +- Scale: Offsite backup to object storage (S3-compatible) +- Future: Real-time replication to hot standby + +--- + +## Security Best Practices + +### Server Hardening +- [ ] Disable root login (SSH key only) +- [ ] Fail2ban configured +- [ ] UFW firewall (only necessary ports) +- [ ] Automated security updates +- [ ] Regular security audits (quarterly) + +### Application Security +- [ ] TLS/SSL everywhere (Let's Encrypt) +- [ ] Strong passwords (generated, stored in 1Password) +- [ ] API keys rotated (quarterly) +- [ ] Container isolation verified +- [ ] Database encryption at rest + +### Compliance Considerations +- GDPR (if EU customers): Data residency, right to deletion +- HIPAA (if medical devices): BAA required, encryption +- ISO 27001 (future): Information security management + +--- + +## Tools & Subscriptions + +### Required (Paid) + +| Tool | Purpose | Cost/Month | +|------|---------|------------| +| GTHost Server #1 | Infrastructure | $100-150 | +| Domain + DNS | yourdomain.com | $1-2 | +| Email (G Suite or similar) | Professional email | $6-12 | + +**Total**: $107-164/month + +### Optional (Free/Paid) + +| Tool | Purpose | Cost/Month | +|------|---------|------------| +| 1Password | Password management | $0 (personal) | +| Wasabi | Offsite backups | $6/TB | +| UptimeRobot | External monitoring | $0 (free tier) | +| Stripe | Payment processing | 2.9% + $0.30 | +| Twilio | SMS alerts | Pay-as-you-go | + +--- + +## Documentation Strategy + +### Internal Documentation +- Runbooks (how to deploy, backup, restore) +- Architecture diagrams (network, data flow) +- Troubleshooting guides +- Security incident response plan + +### Customer Documentation +- User guide (how to access dashboards) +- FAQ (common questions) +- Alert configuration guide +- Troubleshooting (basic) + +**Format**: Markdown in Git repository (easy to version, search) + +--- + +*Last Updated: December 2025*