diff --git a/operations/stress-testing.md b/operations/stress-testing.md new file mode 100644 index 0000000..228a63c --- /dev/null +++ b/operations/stress-testing.md @@ -0,0 +1,204 @@ +# Stress Testing Guide + +## Overview + +Four distinct layers need stress testing. Each has different tools, goals, +and failure modes. Test in this order — API first, then proxy/WS, then +infrastructure, then game servers. + +--- + +## Layer 1 — API + IDE Proxy (Start Here) + +**Tool:** k6 (https://k6.io) + +**Goal:** Validate concurrent IDE sessions, WebSocket stability, proxy +memory behavior, and Traefik resilience under load. + +**Install k6:** +```bash +# Ubuntu/Debian +sudo gpg -k +sudo gpg --no-default-keyring --keyring /usr/share/keyrings/k6-archive-keyring.gpg \ + --keyserver hkp://keyserver.ubuntu.com:80 --recv-keys C5AD17C747E3415A3642D57D77C6C491D6AC1D69 +echo "deb [signed-by=/usr/share/keyrings/k6-archive-keyring.gpg] https://dl.k6.io/deb stable main" \ + | sudo tee /etc/apt/sources.list.d/k6.list +sudo apt update && sudo apt install k6 +``` + +**IDE session script** (`stress/ide-session.js`): + +```javascript +import http from 'k6/http'; +import ws from 'k6/ws'; +import { check, sleep } from 'k6'; + +// Configure these +const API_BASE = 'https://10.60.0.245:4000'; +const IDE_HOST = 'dev-6070.zerolaghub.dev'; +const USER_TOKEN = ''; +const VMID = 6070; + +export const options = { + stages: [ + { duration: '30s', target: 10 }, // ramp up to 10 users + { duration: '2m', target: 10 }, // hold + { duration: '30s', target: 30 }, // ramp to 30 + { duration: '2m', target: 30 }, // hold + { duration: '30s', target: 0 }, // ramp down + ], + thresholds: { + http_req_duration: ['p(95)<2000'], // 95% of requests under 2s + http_req_failed: ['rate<0.05'], // less than 5% failure + }, +}; + +export default function () { + // Step 1: Get IDE token + const tokenRes = http.post( + `${API_BASE}/api/dev/${VMID}/ide-token`, + null, + { + headers: { + Authorization: `Bearer ${USER_TOKEN}`, + 'Content-Type': 'application/json', + }, + } + ); + + check(tokenRes, { + 'token issued': (r) => r.status === 200, + 'url present': (r) => JSON.parse(r.body).url !== undefined, + }); + + if (tokenRes.status !== 200) return; + + const { url, token } = JSON.parse(tokenRes.body); + + // Step 2: Bootstrap (follow redirect, get cookie) + const bootstrapRes = http.get(url, { + redirects: 5, + headers: { Host: IDE_HOST }, + }); + + check(bootstrapRes, { + 'IDE loads': (r) => r.status === 200, + }); + + // Step 3: Hold a WebSocket connection for 60s + const wsUrl = `wss://${IDE_HOST}/`; + ws.connect(wsUrl, { headers: { Cookie: bootstrapRes.cookies } }, (socket) => { + socket.on('open', () => { + socket.setTimeout(() => socket.close(), 60000); + }); + socket.on('error', (e) => console.error('WS error', e)); + }); + + sleep(1); +} +``` + +**Run:** +```bash +k6 run stress/ide-session.js +``` + +**What to watch:** +- `http_req_duration` — should stay under 2s at p95 +- `ws_session_duration` — WebSocket sessions should stay stable +- API process memory on `zlh-api` — watch for leaks: `pm2 monit` +- Traefik logs for 502s or connection resets +- File descriptor count on API host: `lsof -p | wc -l` + +--- + +## Layer 2 — Infrastructure Capacity + +**Goal:** Determine max concurrent dev containers before host degrades. + +**Method:** Provision containers via API in batches of 5, watch Proxmox host +metrics in Grafana. Stop when CPU steal > 10% or memory pressure causes +swapping. + +**Metrics to watch in Grafana:** +- Host CPU usage (not per-VM) +- Memory available +- Disk I/O on LXC datastore +- Network throughput on container bridge + +**Expected baseline:** +Current host (Xeon Silver 4116, 192GB) should comfortably handle 20-30 +concurrent active dev containers. Idle containers cost almost nothing. + +--- + +## Layer 3 — Game Server Load + +**Tool:** mineflayer (Minecraft), game-specific bots for others + +**Minecraft bot script:** +```javascript +const mineflayer = require('mineflayer'); + +const BOT_COUNT = 20; +const SERVER_HOST = ''; +const SERVER_PORT = 25565; + +for (let i = 0; i < BOT_COUNT; i++) { + setTimeout(() => { + const bot = mineflayer.createBot({ + host: SERVER_HOST, + port: SERVER_PORT, + username: `StressBot${i}`, + }); + bot.on('error', (err) => console.error(`Bot ${i} error:`, err)); + // Walk around randomly + bot.once('spawn', () => { + setInterval(() => { + bot.setControlState('forward', true); + setTimeout(() => bot.setControlState('forward', false), 1000); + }, 3000); + }); + }, i * 500); // stagger joins +} +``` + +**What to watch:** +- Agent crash recovery triggering unexpectedly +- Server TPS (ticks per second) — below 15 TPS = struggling +- Container CPU throttling in Proxmox + +--- + +## Layer 4 — OPNsense Router Validation + +See `network/opnsense-checklist.md` for router-specific validation. + +--- + +## Recommended Test Order Before Launch + +1. API token endpoint under 50 concurrent users (k6) +2. 20 simultaneous IDE WebSocket sessions (k6) +3. Provision 10 dev containers simultaneously via API +4. Minecraft server with 20 bot players for 10 minutes +5. OPNsense failover test (see network doc) + +--- + +## Key Files + +- `stress/ide-session.js` — IDE proxy load test (create in zpack-api repo) +- `stress/bot-load.js` — Minecraft bot load test + +--- + +## Failure Mode Reference + +| Symptom | Likely Cause | +|---------|--------------| +| API 502s under load | Proxy running out of connections or file descriptors | +| WS sessions dropping after ~2min | Traefik keepalive timeout — increase `respondingTimeouts` | +| Memory growth on API process | Connection leak in `http-proxy-middleware` under concurrent load | +| Container provision failures under load | Proxmox API rate limiting or LXC storage I/O bottleneck | +| Minecraft TPS drop with bots | Container CPU limit too low — check LXC resource allocation |