205 lines
5.6 KiB
Markdown
205 lines
5.6 KiB
Markdown
# Stress Testing Guide
|
|
|
|
## Overview
|
|
|
|
Four distinct layers need stress testing. Each has different tools, goals,
|
|
and failure modes. Test in this order — API first, then proxy/WS, then
|
|
infrastructure, then game servers.
|
|
|
|
---
|
|
|
|
## Layer 1 — API + IDE Proxy (Start Here)
|
|
|
|
**Tool:** k6 (https://k6.io)
|
|
|
|
**Goal:** Validate concurrent IDE sessions, WebSocket stability, proxy
|
|
memory behavior, and Traefik resilience under load.
|
|
|
|
**Install k6:**
|
|
```bash
|
|
# Ubuntu/Debian
|
|
sudo gpg -k
|
|
sudo gpg --no-default-keyring --keyring /usr/share/keyrings/k6-archive-keyring.gpg \
|
|
--keyserver hkp://keyserver.ubuntu.com:80 --recv-keys C5AD17C747E3415A3642D57D77C6C491D6AC1D69
|
|
echo "deb [signed-by=/usr/share/keyrings/k6-archive-keyring.gpg] https://dl.k6.io/deb stable main" \
|
|
| sudo tee /etc/apt/sources.list.d/k6.list
|
|
sudo apt update && sudo apt install k6
|
|
```
|
|
|
|
**IDE session script** (`stress/ide-session.js`):
|
|
|
|
```javascript
|
|
import http from 'k6/http';
|
|
import ws from 'k6/ws';
|
|
import { check, sleep } from 'k6';
|
|
|
|
// Configure these
|
|
const API_BASE = 'https://10.60.0.245:4000';
|
|
const IDE_HOST = 'dev-6070.zerolaghub.dev';
|
|
const USER_TOKEN = '<valid-jwt>';
|
|
const VMID = 6070;
|
|
|
|
export const options = {
|
|
stages: [
|
|
{ duration: '30s', target: 10 }, // ramp up to 10 users
|
|
{ duration: '2m', target: 10 }, // hold
|
|
{ duration: '30s', target: 30 }, // ramp to 30
|
|
{ duration: '2m', target: 30 }, // hold
|
|
{ duration: '30s', target: 0 }, // ramp down
|
|
],
|
|
thresholds: {
|
|
http_req_duration: ['p(95)<2000'], // 95% of requests under 2s
|
|
http_req_failed: ['rate<0.05'], // less than 5% failure
|
|
},
|
|
};
|
|
|
|
export default function () {
|
|
// Step 1: Get IDE token
|
|
const tokenRes = http.post(
|
|
`${API_BASE}/api/dev/${VMID}/ide-token`,
|
|
null,
|
|
{
|
|
headers: {
|
|
Authorization: `Bearer ${USER_TOKEN}`,
|
|
'Content-Type': 'application/json',
|
|
},
|
|
}
|
|
);
|
|
|
|
check(tokenRes, {
|
|
'token issued': (r) => r.status === 200,
|
|
'url present': (r) => JSON.parse(r.body).url !== undefined,
|
|
});
|
|
|
|
if (tokenRes.status !== 200) return;
|
|
|
|
const { url, token } = JSON.parse(tokenRes.body);
|
|
|
|
// Step 2: Bootstrap (follow redirect, get cookie)
|
|
const bootstrapRes = http.get(url, {
|
|
redirects: 5,
|
|
headers: { Host: IDE_HOST },
|
|
});
|
|
|
|
check(bootstrapRes, {
|
|
'IDE loads': (r) => r.status === 200,
|
|
});
|
|
|
|
// Step 3: Hold a WebSocket connection for 60s
|
|
const wsUrl = `wss://${IDE_HOST}/`;
|
|
ws.connect(wsUrl, { headers: { Cookie: bootstrapRes.cookies } }, (socket) => {
|
|
socket.on('open', () => {
|
|
socket.setTimeout(() => socket.close(), 60000);
|
|
});
|
|
socket.on('error', (e) => console.error('WS error', e));
|
|
});
|
|
|
|
sleep(1);
|
|
}
|
|
```
|
|
|
|
**Run:**
|
|
```bash
|
|
k6 run stress/ide-session.js
|
|
```
|
|
|
|
**What to watch:**
|
|
- `http_req_duration` — should stay under 2s at p95
|
|
- `ws_session_duration` — WebSocket sessions should stay stable
|
|
- API process memory on `zlh-api` — watch for leaks: `pm2 monit`
|
|
- Traefik logs for 502s or connection resets
|
|
- File descriptor count on API host: `lsof -p <pid> | wc -l`
|
|
|
|
---
|
|
|
|
## Layer 2 — Infrastructure Capacity
|
|
|
|
**Goal:** Determine max concurrent dev containers before host degrades.
|
|
|
|
**Method:** Provision containers via API in batches of 5, watch Proxmox host
|
|
metrics in Grafana. Stop when CPU steal > 10% or memory pressure causes
|
|
swapping.
|
|
|
|
**Metrics to watch in Grafana:**
|
|
- Host CPU usage (not per-VM)
|
|
- Memory available
|
|
- Disk I/O on LXC datastore
|
|
- Network throughput on container bridge
|
|
|
|
**Expected baseline:**
|
|
Current host (Xeon Silver 4116, 192GB) should comfortably handle 20-30
|
|
concurrent active dev containers. Idle containers cost almost nothing.
|
|
|
|
---
|
|
|
|
## Layer 3 — Game Server Load
|
|
|
|
**Tool:** mineflayer (Minecraft), game-specific bots for others
|
|
|
|
**Minecraft bot script:**
|
|
```javascript
|
|
const mineflayer = require('mineflayer');
|
|
|
|
const BOT_COUNT = 20;
|
|
const SERVER_HOST = '<game-container-ip>';
|
|
const SERVER_PORT = 25565;
|
|
|
|
for (let i = 0; i < BOT_COUNT; i++) {
|
|
setTimeout(() => {
|
|
const bot = mineflayer.createBot({
|
|
host: SERVER_HOST,
|
|
port: SERVER_PORT,
|
|
username: `StressBot${i}`,
|
|
});
|
|
bot.on('error', (err) => console.error(`Bot ${i} error:`, err));
|
|
// Walk around randomly
|
|
bot.once('spawn', () => {
|
|
setInterval(() => {
|
|
bot.setControlState('forward', true);
|
|
setTimeout(() => bot.setControlState('forward', false), 1000);
|
|
}, 3000);
|
|
});
|
|
}, i * 500); // stagger joins
|
|
}
|
|
```
|
|
|
|
**What to watch:**
|
|
- Agent crash recovery triggering unexpectedly
|
|
- Server TPS (ticks per second) — below 15 TPS = struggling
|
|
- Container CPU throttling in Proxmox
|
|
|
|
---
|
|
|
|
## Layer 4 — OPNsense Router Validation
|
|
|
|
See `network/opnsense-checklist.md` for router-specific validation.
|
|
|
|
---
|
|
|
|
## Recommended Test Order Before Launch
|
|
|
|
1. API token endpoint under 50 concurrent users (k6)
|
|
2. 20 simultaneous IDE WebSocket sessions (k6)
|
|
3. Provision 10 dev containers simultaneously via API
|
|
4. Minecraft server with 20 bot players for 10 minutes
|
|
5. OPNsense failover test (see network doc)
|
|
|
|
---
|
|
|
|
## Key Files
|
|
|
|
- `stress/ide-session.js` — IDE proxy load test (create in zpack-api repo)
|
|
- `stress/bot-load.js` — Minecraft bot load test
|
|
|
|
---
|
|
|
|
## Failure Mode Reference
|
|
|
|
| Symptom | Likely Cause |
|
|
|---------|--------------|
|
|
| API 502s under load | Proxy running out of connections or file descriptors |
|
|
| WS sessions dropping after ~2min | Traefik keepalive timeout — increase `respondingTimeouts` |
|
|
| Memory growth on API process | Connection leak in `http-proxy-middleware` under concurrent load |
|
|
| Container provision failures under load | Proxmox API rate limiting or LXC storage I/O bottleneck |
|
|
| Minecraft TPS drop with bots | Container CPU limit too low — check LXC resource allocation |
|