Add stress testing guide with k6 IDE session script and layer breakdown
This commit is contained in:
parent
2a6bfe77dc
commit
d47445c4d4
204
operations/stress-testing.md
Normal file
204
operations/stress-testing.md
Normal file
@ -0,0 +1,204 @@
|
||||
# Stress Testing Guide
|
||||
|
||||
## Overview
|
||||
|
||||
Four distinct layers need stress testing. Each has different tools, goals,
|
||||
and failure modes. Test in this order — API first, then proxy/WS, then
|
||||
infrastructure, then game servers.
|
||||
|
||||
---
|
||||
|
||||
## Layer 1 — API + IDE Proxy (Start Here)
|
||||
|
||||
**Tool:** k6 (https://k6.io)
|
||||
|
||||
**Goal:** Validate concurrent IDE sessions, WebSocket stability, proxy
|
||||
memory behavior, and Traefik resilience under load.
|
||||
|
||||
**Install k6:**
|
||||
```bash
|
||||
# Ubuntu/Debian
|
||||
sudo gpg -k
|
||||
sudo gpg --no-default-keyring --keyring /usr/share/keyrings/k6-archive-keyring.gpg \
|
||||
--keyserver hkp://keyserver.ubuntu.com:80 --recv-keys C5AD17C747E3415A3642D57D77C6C491D6AC1D69
|
||||
echo "deb [signed-by=/usr/share/keyrings/k6-archive-keyring.gpg] https://dl.k6.io/deb stable main" \
|
||||
| sudo tee /etc/apt/sources.list.d/k6.list
|
||||
sudo apt update && sudo apt install k6
|
||||
```
|
||||
|
||||
**IDE session script** (`stress/ide-session.js`):
|
||||
|
||||
```javascript
|
||||
import http from 'k6/http';
|
||||
import ws from 'k6/ws';
|
||||
import { check, sleep } from 'k6';
|
||||
|
||||
// Configure these
|
||||
const API_BASE = 'https://10.60.0.245:4000';
|
||||
const IDE_HOST = 'dev-6070.zerolaghub.dev';
|
||||
const USER_TOKEN = '<valid-jwt>';
|
||||
const VMID = 6070;
|
||||
|
||||
export const options = {
|
||||
stages: [
|
||||
{ duration: '30s', target: 10 }, // ramp up to 10 users
|
||||
{ duration: '2m', target: 10 }, // hold
|
||||
{ duration: '30s', target: 30 }, // ramp to 30
|
||||
{ duration: '2m', target: 30 }, // hold
|
||||
{ duration: '30s', target: 0 }, // ramp down
|
||||
],
|
||||
thresholds: {
|
||||
http_req_duration: ['p(95)<2000'], // 95% of requests under 2s
|
||||
http_req_failed: ['rate<0.05'], // less than 5% failure
|
||||
},
|
||||
};
|
||||
|
||||
export default function () {
|
||||
// Step 1: Get IDE token
|
||||
const tokenRes = http.post(
|
||||
`${API_BASE}/api/dev/${VMID}/ide-token`,
|
||||
null,
|
||||
{
|
||||
headers: {
|
||||
Authorization: `Bearer ${USER_TOKEN}`,
|
||||
'Content-Type': 'application/json',
|
||||
},
|
||||
}
|
||||
);
|
||||
|
||||
check(tokenRes, {
|
||||
'token issued': (r) => r.status === 200,
|
||||
'url present': (r) => JSON.parse(r.body).url !== undefined,
|
||||
});
|
||||
|
||||
if (tokenRes.status !== 200) return;
|
||||
|
||||
const { url, token } = JSON.parse(tokenRes.body);
|
||||
|
||||
// Step 2: Bootstrap (follow redirect, get cookie)
|
||||
const bootstrapRes = http.get(url, {
|
||||
redirects: 5,
|
||||
headers: { Host: IDE_HOST },
|
||||
});
|
||||
|
||||
check(bootstrapRes, {
|
||||
'IDE loads': (r) => r.status === 200,
|
||||
});
|
||||
|
||||
// Step 3: Hold a WebSocket connection for 60s
|
||||
const wsUrl = `wss://${IDE_HOST}/`;
|
||||
ws.connect(wsUrl, { headers: { Cookie: bootstrapRes.cookies } }, (socket) => {
|
||||
socket.on('open', () => {
|
||||
socket.setTimeout(() => socket.close(), 60000);
|
||||
});
|
||||
socket.on('error', (e) => console.error('WS error', e));
|
||||
});
|
||||
|
||||
sleep(1);
|
||||
}
|
||||
```
|
||||
|
||||
**Run:**
|
||||
```bash
|
||||
k6 run stress/ide-session.js
|
||||
```
|
||||
|
||||
**What to watch:**
|
||||
- `http_req_duration` — should stay under 2s at p95
|
||||
- `ws_session_duration` — WebSocket sessions should stay stable
|
||||
- API process memory on `zlh-api` — watch for leaks: `pm2 monit`
|
||||
- Traefik logs for 502s or connection resets
|
||||
- File descriptor count on API host: `lsof -p <pid> | wc -l`
|
||||
|
||||
---
|
||||
|
||||
## Layer 2 — Infrastructure Capacity
|
||||
|
||||
**Goal:** Determine max concurrent dev containers before host degrades.
|
||||
|
||||
**Method:** Provision containers via API in batches of 5, watch Proxmox host
|
||||
metrics in Grafana. Stop when CPU steal > 10% or memory pressure causes
|
||||
swapping.
|
||||
|
||||
**Metrics to watch in Grafana:**
|
||||
- Host CPU usage (not per-VM)
|
||||
- Memory available
|
||||
- Disk I/O on LXC datastore
|
||||
- Network throughput on container bridge
|
||||
|
||||
**Expected baseline:**
|
||||
Current host (Xeon Silver 4116, 192GB) should comfortably handle 20-30
|
||||
concurrent active dev containers. Idle containers cost almost nothing.
|
||||
|
||||
---
|
||||
|
||||
## Layer 3 — Game Server Load
|
||||
|
||||
**Tool:** mineflayer (Minecraft), game-specific bots for others
|
||||
|
||||
**Minecraft bot script:**
|
||||
```javascript
|
||||
const mineflayer = require('mineflayer');
|
||||
|
||||
const BOT_COUNT = 20;
|
||||
const SERVER_HOST = '<game-container-ip>';
|
||||
const SERVER_PORT = 25565;
|
||||
|
||||
for (let i = 0; i < BOT_COUNT; i++) {
|
||||
setTimeout(() => {
|
||||
const bot = mineflayer.createBot({
|
||||
host: SERVER_HOST,
|
||||
port: SERVER_PORT,
|
||||
username: `StressBot${i}`,
|
||||
});
|
||||
bot.on('error', (err) => console.error(`Bot ${i} error:`, err));
|
||||
// Walk around randomly
|
||||
bot.once('spawn', () => {
|
||||
setInterval(() => {
|
||||
bot.setControlState('forward', true);
|
||||
setTimeout(() => bot.setControlState('forward', false), 1000);
|
||||
}, 3000);
|
||||
});
|
||||
}, i * 500); // stagger joins
|
||||
}
|
||||
```
|
||||
|
||||
**What to watch:**
|
||||
- Agent crash recovery triggering unexpectedly
|
||||
- Server TPS (ticks per second) — below 15 TPS = struggling
|
||||
- Container CPU throttling in Proxmox
|
||||
|
||||
---
|
||||
|
||||
## Layer 4 — OPNsense Router Validation
|
||||
|
||||
See `network/opnsense-checklist.md` for router-specific validation.
|
||||
|
||||
---
|
||||
|
||||
## Recommended Test Order Before Launch
|
||||
|
||||
1. API token endpoint under 50 concurrent users (k6)
|
||||
2. 20 simultaneous IDE WebSocket sessions (k6)
|
||||
3. Provision 10 dev containers simultaneously via API
|
||||
4. Minecraft server with 20 bot players for 10 minutes
|
||||
5. OPNsense failover test (see network doc)
|
||||
|
||||
---
|
||||
|
||||
## Key Files
|
||||
|
||||
- `stress/ide-session.js` — IDE proxy load test (create in zpack-api repo)
|
||||
- `stress/bot-load.js` — Minecraft bot load test
|
||||
|
||||
---
|
||||
|
||||
## Failure Mode Reference
|
||||
|
||||
| Symptom | Likely Cause |
|
||||
|---------|--------------|
|
||||
| API 502s under load | Proxy running out of connections or file descriptors |
|
||||
| WS sessions dropping after ~2min | Traefik keepalive timeout — increase `respondingTimeouts` |
|
||||
| Memory growth on API process | Connection leak in `http-proxy-middleware` under concurrent load |
|
||||
| Container provision failures under load | Proxmox API rate limiting or LXC storage I/O bottleneck |
|
||||
| Minecraft TPS drop with bots | Container CPU limit too low — check LXC resource allocation |
|
||||
Loading…
Reference in New Issue
Block a user