Add stress testing guide with k6 IDE session script and layer breakdown

2026-03-24 23:09:44 +00:00 · 2026-03-24 23:09:44 +00:00 · d47445c4d4
commit d47445c4d4
parent 2a6bfe77dc
1 changed files with 204 additions and 0 deletions
--- a/operations/stress-testing.md
+++ b/operations/stress-testing.md
@ -0,0 +1,204 @@
+# Stress Testing Guide
+
+## Overview
+
+Four distinct layers need stress testing. Each has different tools, goals,
+and failure modes. Test in this order — API first, then proxy/WS, then
+infrastructure, then game servers.
+
+---
+
+## Layer 1 — API + IDE Proxy (Start Here)
+
+**Tool:** k6 (https://k6.io)
+
+**Goal:** Validate concurrent IDE sessions, WebSocket stability, proxy
+memory behavior, and Traefik resilience under load.
+
+**Install k6:**
+```bash
+# Ubuntu/Debian
+sudo gpg -k
+sudo gpg --no-default-keyring --keyring /usr/share/keyrings/k6-archive-keyring.gpg \
+  --keyserver hkp://keyserver.ubuntu.com:80 --recv-keys C5AD17C747E3415A3642D57D77C6C491D6AC1D69
+echo "deb [signed-by=/usr/share/keyrings/k6-archive-keyring.gpg] https://dl.k6.io/deb stable main" \
+  | sudo tee /etc/apt/sources.list.d/k6.list
+sudo apt update && sudo apt install k6
+```
+
+**IDE session script** (`stress/ide-session.js`):
+
+```javascript
+import http from 'k6/http';
+import ws from 'k6/ws';
+import { check, sleep } from 'k6';
+
+// Configure these
+const API_BASE = 'https://10.60.0.245:4000';
+const IDE_HOST = 'dev-6070.zerolaghub.dev';
+const USER_TOKEN = '<valid-jwt>';
+const VMID = 6070;
+
+export const options = {
+  stages: [
+    { duration: '30s', target: 10 },   // ramp up to 10 users
+    { duration: '2m',  target: 10 },   // hold
+    { duration: '30s', target: 30 },   // ramp to 30
+    { duration: '2m',  target: 30 },   // hold
+    { duration: '30s', target: 0 },    // ramp down
+  ],
+  thresholds: {
+    http_req_duration: ['p(95)<2000'],  // 95% of requests under 2s
+    http_req_failed:   ['rate<0.05'],   // less than 5% failure
+  },
+};
+
+export default function () {
+  // Step 1: Get IDE token
+  const tokenRes = http.post(
+    `${API_BASE}/api/dev/${VMID}/ide-token`,
+    null,
+    {
+      headers: {
+        Authorization: `Bearer ${USER_TOKEN}`,
+        'Content-Type': 'application/json',
+      },
+    }
+  );
+
+  check(tokenRes, {
+    'token issued': (r) => r.status === 200,
+    'url present':  (r) => JSON.parse(r.body).url !== undefined,
+  });
+
+  if (tokenRes.status !== 200) return;
+
+  const { url, token } = JSON.parse(tokenRes.body);
+
+  // Step 2: Bootstrap (follow redirect, get cookie)
+  const bootstrapRes = http.get(url, {
+    redirects: 5,
+    headers: { Host: IDE_HOST },
+  });
+
+  check(bootstrapRes, {
+    'IDE loads': (r) => r.status === 200,
+  });
+
+  // Step 3: Hold a WebSocket connection for 60s
+  const wsUrl = `wss://${IDE_HOST}/`;
+  ws.connect(wsUrl, { headers: { Cookie: bootstrapRes.cookies } }, (socket) => {
+    socket.on('open', () => {
+      socket.setTimeout(() => socket.close(), 60000);
+    });
+    socket.on('error', (e) => console.error('WS error', e));
+  });
+
+  sleep(1);
+}
+```
+
+**Run:**
+```bash
+k6 run stress/ide-session.js
+```
+
+**What to watch:**
+- `http_req_duration` — should stay under 2s at p95
+- `ws_session_duration` — WebSocket sessions should stay stable
+- API process memory on `zlh-api` — watch for leaks: `pm2 monit`
+- Traefik logs for 502s or connection resets
+- File descriptor count on API host: `lsof -p <pid> | wc -l`
+
+---
+
+## Layer 2 — Infrastructure Capacity
+
+**Goal:** Determine max concurrent dev containers before host degrades.
+
+**Method:** Provision containers via API in batches of 5, watch Proxmox host
+metrics in Grafana. Stop when CPU steal > 10% or memory pressure causes
+swapping.
+
+**Metrics to watch in Grafana:**
+- Host CPU usage (not per-VM)
+- Memory available
+- Disk I/O on LXC datastore
+- Network throughput on container bridge
+
+**Expected baseline:**
+Current host (Xeon Silver 4116, 192GB) should comfortably handle 20-30
+concurrent active dev containers. Idle containers cost almost nothing.
+
+---
+
+## Layer 3 — Game Server Load
+
+**Tool:** mineflayer (Minecraft), game-specific bots for others
+
+**Minecraft bot script:**
+```javascript
+const mineflayer = require('mineflayer');
+
+const BOT_COUNT = 20;
+const SERVER_HOST = '<game-container-ip>';
+const SERVER_PORT = 25565;
+
+for (let i = 0; i < BOT_COUNT; i++) {
+  setTimeout(() => {
+    const bot = mineflayer.createBot({
+      host: SERVER_HOST,
+      port: SERVER_PORT,
+      username: `StressBot${i}`,
+    });
+    bot.on('error', (err) => console.error(`Bot ${i} error:`, err));
+    // Walk around randomly
+    bot.once('spawn', () => {
+      setInterval(() => {
+        bot.setControlState('forward', true);
+        setTimeout(() => bot.setControlState('forward', false), 1000);
+      }, 3000);
+    });
+  }, i * 500); // stagger joins
+}
+```
+
+**What to watch:**
+- Agent crash recovery triggering unexpectedly
+- Server TPS (ticks per second) — below 15 TPS = struggling
+- Container CPU throttling in Proxmox
+
+---
+
+## Layer 4 — OPNsense Router Validation
+
+See `network/opnsense-checklist.md` for router-specific validation.
+
+---
+
+## Recommended Test Order Before Launch
+
+1. API token endpoint under 50 concurrent users (k6)
+2. 20 simultaneous IDE WebSocket sessions (k6)
+3. Provision 10 dev containers simultaneously via API
+4. Minecraft server with 20 bot players for 10 minutes
+5. OPNsense failover test (see network doc)
+
+---
+
+## Key Files
+
+- `stress/ide-session.js` — IDE proxy load test (create in zpack-api repo)
+- `stress/bot-load.js` — Minecraft bot load test
+
+---
+
+## Failure Mode Reference
+
+| Symptom | Likely Cause |
+|---------|--------------|
+| API 502s under load | Proxy running out of connections or file descriptors |
+| WS sessions dropping after ~2min | Traefik keepalive timeout — increase `respondingTimeouts` |
+| Memory growth on API process | Connection leak in `http-proxy-middleware` under concurrent load |
+| Container provision failures under load | Proxmox API rate limiting or LXC storage I/O bottleneck |
+| Minecraft TPS drop with bots | Container CPU limit too low — check LXC resource allocation |