# Stress Testing Guide ## Overview Four distinct layers need stress testing. Each has different tools, goals, and failure modes. Test in this order — API first, then proxy/WS, then infrastructure, then game servers. --- ## Layer 1 — API + IDE Proxy (Start Here) **Tool:** k6 (https://k6.io) **Goal:** Validate concurrent IDE sessions, WebSocket stability, proxy memory behavior, and Traefik resilience under load. **Install k6:** ```bash # Ubuntu/Debian sudo gpg -k sudo gpg --no-default-keyring --keyring /usr/share/keyrings/k6-archive-keyring.gpg \ --keyserver hkp://keyserver.ubuntu.com:80 --recv-keys C5AD17C747E3415A3642D57D77C6C491D6AC1D69 echo "deb [signed-by=/usr/share/keyrings/k6-archive-keyring.gpg] https://dl.k6.io/deb stable main" \ | sudo tee /etc/apt/sources.list.d/k6.list sudo apt update && sudo apt install k6 ``` **IDE session script** (`stress/ide-session.js`): ```javascript import http from 'k6/http'; import ws from 'k6/ws'; import { check, sleep } from 'k6'; // Configure these const API_BASE = 'https://10.60.0.245:4000'; const IDE_HOST = 'dev-6070.zerolaghub.dev'; const USER_TOKEN = ''; const VMID = 6070; export const options = { stages: [ { duration: '30s', target: 10 }, // ramp up to 10 users { duration: '2m', target: 10 }, // hold { duration: '30s', target: 30 }, // ramp to 30 { duration: '2m', target: 30 }, // hold { duration: '30s', target: 0 }, // ramp down ], thresholds: { http_req_duration: ['p(95)<2000'], // 95% of requests under 2s http_req_failed: ['rate<0.05'], // less than 5% failure }, }; export default function () { // Step 1: Get IDE token const tokenRes = http.post( `${API_BASE}/api/dev/${VMID}/ide-token`, null, { headers: { Authorization: `Bearer ${USER_TOKEN}`, 'Content-Type': 'application/json', }, } ); check(tokenRes, { 'token issued': (r) => r.status === 200, 'url present': (r) => JSON.parse(r.body).url !== undefined, }); if (tokenRes.status !== 200) return; const { url, token } = JSON.parse(tokenRes.body); // Step 2: Bootstrap (follow redirect, get cookie) const bootstrapRes = http.get(url, { redirects: 5, headers: { Host: IDE_HOST }, }); check(bootstrapRes, { 'IDE loads': (r) => r.status === 200, }); // Step 3: Hold a WebSocket connection for 60s const wsUrl = `wss://${IDE_HOST}/`; ws.connect(wsUrl, { headers: { Cookie: bootstrapRes.cookies } }, (socket) => { socket.on('open', () => { socket.setTimeout(() => socket.close(), 60000); }); socket.on('error', (e) => console.error('WS error', e)); }); sleep(1); } ``` **Run:** ```bash k6 run stress/ide-session.js ``` **What to watch:** - `http_req_duration` — should stay under 2s at p95 - `ws_session_duration` — WebSocket sessions should stay stable - API process memory on `zlh-api` — watch for leaks: `pm2 monit` - Traefik logs for 502s or connection resets - File descriptor count on API host: `lsof -p | wc -l` --- ## Layer 2 — Infrastructure Capacity **Goal:** Determine max concurrent dev containers before host degrades. **Method:** Provision containers via API in batches of 5, watch Proxmox host metrics in Grafana. Stop when CPU steal > 10% or memory pressure causes swapping. **Metrics to watch in Grafana:** - Host CPU usage (not per-VM) - Memory available - Disk I/O on LXC datastore - Network throughput on container bridge **Expected baseline:** Current host (Xeon Silver 4116, 192GB) should comfortably handle 20-30 concurrent active dev containers. Idle containers cost almost nothing. --- ## Layer 3 — Game Server Load **Tool:** mineflayer (Minecraft), game-specific bots for others **Minecraft bot script:** ```javascript const mineflayer = require('mineflayer'); const BOT_COUNT = 20; const SERVER_HOST = ''; const SERVER_PORT = 25565; for (let i = 0; i < BOT_COUNT; i++) { setTimeout(() => { const bot = mineflayer.createBot({ host: SERVER_HOST, port: SERVER_PORT, username: `StressBot${i}`, }); bot.on('error', (err) => console.error(`Bot ${i} error:`, err)); // Walk around randomly bot.once('spawn', () => { setInterval(() => { bot.setControlState('forward', true); setTimeout(() => bot.setControlState('forward', false), 1000); }, 3000); }); }, i * 500); // stagger joins } ``` **What to watch:** - Agent crash recovery triggering unexpectedly - Server TPS (ticks per second) — below 15 TPS = struggling - Container CPU throttling in Proxmox --- ## Layer 4 — OPNsense Router Validation See `network/opnsense-checklist.md` for router-specific validation. --- ## Recommended Test Order Before Launch 1. API token endpoint under 50 concurrent users (k6) 2. 20 simultaneous IDE WebSocket sessions (k6) 3. Provision 10 dev containers simultaneously via API 4. Minecraft server with 20 bot players for 10 minutes 5. OPNsense failover test (see network doc) --- ## Key Files - `stress/ide-session.js` — IDE proxy load test (create in zpack-api repo) - `stress/bot-load.js` — Minecraft bot load test --- ## Failure Mode Reference | Symptom | Likely Cause | |---------|--------------| | API 502s under load | Proxy running out of connections or file descriptors | | WS sessions dropping after ~2min | Traefik keepalive timeout — increase `respondingTimeouts` | | Memory growth on API process | Connection leak in `http-proxy-middleware` under concurrent load | | Container provision failures under load | Proxmox API rate limiting or LXC storage I/O bottleneck | | Minecraft TPS drop with bots | Container CPU limit too low — check LXC resource allocation |