Add stress testing guide with k6 IDE session script and layer breakdown

This commit is contained in:
jester 2026-03-24 23:09:44 +00:00
parent 2a6bfe77dc
commit d47445c4d4

View File

@ -0,0 +1,204 @@
# Stress Testing Guide
## Overview
Four distinct layers need stress testing. Each has different tools, goals,
and failure modes. Test in this order — API first, then proxy/WS, then
infrastructure, then game servers.
---
## Layer 1 — API + IDE Proxy (Start Here)
**Tool:** k6 (https://k6.io)
**Goal:** Validate concurrent IDE sessions, WebSocket stability, proxy
memory behavior, and Traefik resilience under load.
**Install k6:**
```bash
# Ubuntu/Debian
sudo gpg -k
sudo gpg --no-default-keyring --keyring /usr/share/keyrings/k6-archive-keyring.gpg \
--keyserver hkp://keyserver.ubuntu.com:80 --recv-keys C5AD17C747E3415A3642D57D77C6C491D6AC1D69
echo "deb [signed-by=/usr/share/keyrings/k6-archive-keyring.gpg] https://dl.k6.io/deb stable main" \
| sudo tee /etc/apt/sources.list.d/k6.list
sudo apt update && sudo apt install k6
```
**IDE session script** (`stress/ide-session.js`):
```javascript
import http from 'k6/http';
import ws from 'k6/ws';
import { check, sleep } from 'k6';
// Configure these
const API_BASE = 'https://10.60.0.245:4000';
const IDE_HOST = 'dev-6070.zerolaghub.dev';
const USER_TOKEN = '<valid-jwt>';
const VMID = 6070;
export const options = {
stages: [
{ duration: '30s', target: 10 }, // ramp up to 10 users
{ duration: '2m', target: 10 }, // hold
{ duration: '30s', target: 30 }, // ramp to 30
{ duration: '2m', target: 30 }, // hold
{ duration: '30s', target: 0 }, // ramp down
],
thresholds: {
http_req_duration: ['p(95)<2000'], // 95% of requests under 2s
http_req_failed: ['rate<0.05'], // less than 5% failure
},
};
export default function () {
// Step 1: Get IDE token
const tokenRes = http.post(
`${API_BASE}/api/dev/${VMID}/ide-token`,
null,
{
headers: {
Authorization: `Bearer ${USER_TOKEN}`,
'Content-Type': 'application/json',
},
}
);
check(tokenRes, {
'token issued': (r) => r.status === 200,
'url present': (r) => JSON.parse(r.body).url !== undefined,
});
if (tokenRes.status !== 200) return;
const { url, token } = JSON.parse(tokenRes.body);
// Step 2: Bootstrap (follow redirect, get cookie)
const bootstrapRes = http.get(url, {
redirects: 5,
headers: { Host: IDE_HOST },
});
check(bootstrapRes, {
'IDE loads': (r) => r.status === 200,
});
// Step 3: Hold a WebSocket connection for 60s
const wsUrl = `wss://${IDE_HOST}/`;
ws.connect(wsUrl, { headers: { Cookie: bootstrapRes.cookies } }, (socket) => {
socket.on('open', () => {
socket.setTimeout(() => socket.close(), 60000);
});
socket.on('error', (e) => console.error('WS error', e));
});
sleep(1);
}
```
**Run:**
```bash
k6 run stress/ide-session.js
```
**What to watch:**
- `http_req_duration` — should stay under 2s at p95
- `ws_session_duration` — WebSocket sessions should stay stable
- API process memory on `zlh-api` — watch for leaks: `pm2 monit`
- Traefik logs for 502s or connection resets
- File descriptor count on API host: `lsof -p <pid> | wc -l`
---
## Layer 2 — Infrastructure Capacity
**Goal:** Determine max concurrent dev containers before host degrades.
**Method:** Provision containers via API in batches of 5, watch Proxmox host
metrics in Grafana. Stop when CPU steal > 10% or memory pressure causes
swapping.
**Metrics to watch in Grafana:**
- Host CPU usage (not per-VM)
- Memory available
- Disk I/O on LXC datastore
- Network throughput on container bridge
**Expected baseline:**
Current host (Xeon Silver 4116, 192GB) should comfortably handle 20-30
concurrent active dev containers. Idle containers cost almost nothing.
---
## Layer 3 — Game Server Load
**Tool:** mineflayer (Minecraft), game-specific bots for others
**Minecraft bot script:**
```javascript
const mineflayer = require('mineflayer');
const BOT_COUNT = 20;
const SERVER_HOST = '<game-container-ip>';
const SERVER_PORT = 25565;
for (let i = 0; i < BOT_COUNT; i++) {
setTimeout(() => {
const bot = mineflayer.createBot({
host: SERVER_HOST,
port: SERVER_PORT,
username: `StressBot${i}`,
});
bot.on('error', (err) => console.error(`Bot ${i} error:`, err));
// Walk around randomly
bot.once('spawn', () => {
setInterval(() => {
bot.setControlState('forward', true);
setTimeout(() => bot.setControlState('forward', false), 1000);
}, 3000);
});
}, i * 500); // stagger joins
}
```
**What to watch:**
- Agent crash recovery triggering unexpectedly
- Server TPS (ticks per second) — below 15 TPS = struggling
- Container CPU throttling in Proxmox
---
## Layer 4 — OPNsense Router Validation
See `network/opnsense-checklist.md` for router-specific validation.
---
## Recommended Test Order Before Launch
1. API token endpoint under 50 concurrent users (k6)
2. 20 simultaneous IDE WebSocket sessions (k6)
3. Provision 10 dev containers simultaneously via API
4. Minecraft server with 20 bot players for 10 minutes
5. OPNsense failover test (see network doc)
---
## Key Files
- `stress/ide-session.js` — IDE proxy load test (create in zpack-api repo)
- `stress/bot-load.js` — Minecraft bot load test
---
## Failure Mode Reference
| Symptom | Likely Cause |
|---------|--------------|
| API 502s under load | Proxy running out of connections or file descriptors |
| WS sessions dropping after ~2min | Traefik keepalive timeout — increase `respondingTimeouts` |
| Memory growth on API process | Connection leak in `http-proxy-middleware` under concurrent load |
| Container provision failures under load | Proxmox API rate limiting or LXC storage I/O bottleneck |
| Minecraft TPS drop with bots | Container CPU limit too low — check LXC resource allocation |