5.6 KiB
Stress Testing Guide
Overview
Four distinct layers need stress testing. Each has different tools, goals, and failure modes. Test in this order — API first, then proxy/WS, then infrastructure, then game servers.
Layer 1 — API + IDE Proxy (Start Here)
Tool: k6 (https://k6.io)
Goal: Validate concurrent IDE sessions, WebSocket stability, proxy memory behavior, and Traefik resilience under load.
Install k6:
# Ubuntu/Debian
sudo gpg -k
sudo gpg --no-default-keyring --keyring /usr/share/keyrings/k6-archive-keyring.gpg \
--keyserver hkp://keyserver.ubuntu.com:80 --recv-keys C5AD17C747E3415A3642D57D77C6C491D6AC1D69
echo "deb [signed-by=/usr/share/keyrings/k6-archive-keyring.gpg] https://dl.k6.io/deb stable main" \
| sudo tee /etc/apt/sources.list.d/k6.list
sudo apt update && sudo apt install k6
IDE session script (stress/ide-session.js):
import http from 'k6/http';
import ws from 'k6/ws';
import { check, sleep } from 'k6';
// Configure these
const API_BASE = 'https://10.60.0.245:4000';
const IDE_HOST = 'dev-6070.zerolaghub.dev';
const USER_TOKEN = '<valid-jwt>';
const VMID = 6070;
export const options = {
stages: [
{ duration: '30s', target: 10 }, // ramp up to 10 users
{ duration: '2m', target: 10 }, // hold
{ duration: '30s', target: 30 }, // ramp to 30
{ duration: '2m', target: 30 }, // hold
{ duration: '30s', target: 0 }, // ramp down
],
thresholds: {
http_req_duration: ['p(95)<2000'], // 95% of requests under 2s
http_req_failed: ['rate<0.05'], // less than 5% failure
},
};
export default function () {
// Step 1: Get IDE token
const tokenRes = http.post(
`${API_BASE}/api/dev/${VMID}/ide-token`,
null,
{
headers: {
Authorization: `Bearer ${USER_TOKEN}`,
'Content-Type': 'application/json',
},
}
);
check(tokenRes, {
'token issued': (r) => r.status === 200,
'url present': (r) => JSON.parse(r.body).url !== undefined,
});
if (tokenRes.status !== 200) return;
const { url, token } = JSON.parse(tokenRes.body);
// Step 2: Bootstrap (follow redirect, get cookie)
const bootstrapRes = http.get(url, {
redirects: 5,
headers: { Host: IDE_HOST },
});
check(bootstrapRes, {
'IDE loads': (r) => r.status === 200,
});
// Step 3: Hold a WebSocket connection for 60s
const wsUrl = `wss://${IDE_HOST}/`;
ws.connect(wsUrl, { headers: { Cookie: bootstrapRes.cookies } }, (socket) => {
socket.on('open', () => {
socket.setTimeout(() => socket.close(), 60000);
});
socket.on('error', (e) => console.error('WS error', e));
});
sleep(1);
}
Run:
k6 run stress/ide-session.js
What to watch:
http_req_duration— should stay under 2s at p95ws_session_duration— WebSocket sessions should stay stable- API process memory on
zlh-api— watch for leaks:pm2 monit - Traefik logs for 502s or connection resets
- File descriptor count on API host:
lsof -p <pid> | wc -l
Layer 2 — Infrastructure Capacity
Goal: Determine max concurrent dev containers before host degrades.
Method: Provision containers via API in batches of 5, watch Proxmox host metrics in Grafana. Stop when CPU steal > 10% or memory pressure causes swapping.
Metrics to watch in Grafana:
- Host CPU usage (not per-VM)
- Memory available
- Disk I/O on LXC datastore
- Network throughput on container bridge
Expected baseline: Current host (Xeon Silver 4116, 192GB) should comfortably handle 20-30 concurrent active dev containers. Idle containers cost almost nothing.
Layer 3 — Game Server Load
Tool: mineflayer (Minecraft), game-specific bots for others
Minecraft bot script:
const mineflayer = require('mineflayer');
const BOT_COUNT = 20;
const SERVER_HOST = '<game-container-ip>';
const SERVER_PORT = 25565;
for (let i = 0; i < BOT_COUNT; i++) {
setTimeout(() => {
const bot = mineflayer.createBot({
host: SERVER_HOST,
port: SERVER_PORT,
username: `StressBot${i}`,
});
bot.on('error', (err) => console.error(`Bot ${i} error:`, err));
// Walk around randomly
bot.once('spawn', () => {
setInterval(() => {
bot.setControlState('forward', true);
setTimeout(() => bot.setControlState('forward', false), 1000);
}, 3000);
});
}, i * 500); // stagger joins
}
What to watch:
- Agent crash recovery triggering unexpectedly
- Server TPS (ticks per second) — below 15 TPS = struggling
- Container CPU throttling in Proxmox
Layer 4 — OPNsense Router Validation
See network/opnsense-checklist.md for router-specific validation.
Recommended Test Order Before Launch
- API token endpoint under 50 concurrent users (k6)
- 20 simultaneous IDE WebSocket sessions (k6)
- Provision 10 dev containers simultaneously via API
- Minecraft server with 20 bot players for 10 minutes
- OPNsense failover test (see network doc)
Key Files
stress/ide-session.js— IDE proxy load test (create in zpack-api repo)stress/bot-load.js— Minecraft bot load test
Failure Mode Reference
| Symptom | Likely Cause |
|---|---|
| API 502s under load | Proxy running out of connections or file descriptors |
| WS sessions dropping after ~2min | Traefik keepalive timeout — increase respondingTimeouts |
| Memory growth on API process | Connection leak in http-proxy-middleware under concurrent load |
| Container provision failures under load | Proxmox API rate limiting or LXC storage I/O bottleneck |
| Minecraft TPS drop with bots | Container CPU limit too low — check LXC resource allocation |