knowledge-base/operations/stress-testing.md

5.6 KiB

Stress Testing Guide

Overview

Four distinct layers need stress testing. Each has different tools, goals, and failure modes. Test in this order — API first, then proxy/WS, then infrastructure, then game servers.


Layer 1 — API + IDE Proxy (Start Here)

Tool: k6 (https://k6.io)

Goal: Validate concurrent IDE sessions, WebSocket stability, proxy memory behavior, and Traefik resilience under load.

Install k6:

# Ubuntu/Debian
sudo gpg -k
sudo gpg --no-default-keyring --keyring /usr/share/keyrings/k6-archive-keyring.gpg \
  --keyserver hkp://keyserver.ubuntu.com:80 --recv-keys C5AD17C747E3415A3642D57D77C6C491D6AC1D69
echo "deb [signed-by=/usr/share/keyrings/k6-archive-keyring.gpg] https://dl.k6.io/deb stable main" \
  | sudo tee /etc/apt/sources.list.d/k6.list
sudo apt update && sudo apt install k6

IDE session script (stress/ide-session.js):

import http from 'k6/http';
import ws from 'k6/ws';
import { check, sleep } from 'k6';

// Configure these
const API_BASE = 'https://10.60.0.245:4000';
const IDE_HOST = 'dev-6070.zerolaghub.dev';
const USER_TOKEN = '<valid-jwt>';
const VMID = 6070;

export const options = {
  stages: [
    { duration: '30s', target: 10 },   // ramp up to 10 users
    { duration: '2m',  target: 10 },   // hold
    { duration: '30s', target: 30 },   // ramp to 30
    { duration: '2m',  target: 30 },   // hold
    { duration: '30s', target: 0 },    // ramp down
  ],
  thresholds: {
    http_req_duration: ['p(95)<2000'],  // 95% of requests under 2s
    http_req_failed:   ['rate<0.05'],   // less than 5% failure
  },
};

export default function () {
  // Step 1: Get IDE token
  const tokenRes = http.post(
    `${API_BASE}/api/dev/${VMID}/ide-token`,
    null,
    {
      headers: {
        Authorization: `Bearer ${USER_TOKEN}`,
        'Content-Type': 'application/json',
      },
    }
  );

  check(tokenRes, {
    'token issued': (r) => r.status === 200,
    'url present':  (r) => JSON.parse(r.body).url !== undefined,
  });

  if (tokenRes.status !== 200) return;

  const { url, token } = JSON.parse(tokenRes.body);

  // Step 2: Bootstrap (follow redirect, get cookie)
  const bootstrapRes = http.get(url, {
    redirects: 5,
    headers: { Host: IDE_HOST },
  });

  check(bootstrapRes, {
    'IDE loads': (r) => r.status === 200,
  });

  // Step 3: Hold a WebSocket connection for 60s
  const wsUrl = `wss://${IDE_HOST}/`;
  ws.connect(wsUrl, { headers: { Cookie: bootstrapRes.cookies } }, (socket) => {
    socket.on('open', () => {
      socket.setTimeout(() => socket.close(), 60000);
    });
    socket.on('error', (e) => console.error('WS error', e));
  });

  sleep(1);
}

Run:

k6 run stress/ide-session.js

What to watch:

  • http_req_duration — should stay under 2s at p95
  • ws_session_duration — WebSocket sessions should stay stable
  • API process memory on zlh-api — watch for leaks: pm2 monit
  • Traefik logs for 502s or connection resets
  • File descriptor count on API host: lsof -p <pid> | wc -l

Layer 2 — Infrastructure Capacity

Goal: Determine max concurrent dev containers before host degrades.

Method: Provision containers via API in batches of 5, watch Proxmox host metrics in Grafana. Stop when CPU steal > 10% or memory pressure causes swapping.

Metrics to watch in Grafana:

  • Host CPU usage (not per-VM)
  • Memory available
  • Disk I/O on LXC datastore
  • Network throughput on container bridge

Expected baseline: Current host (Xeon Silver 4116, 192GB) should comfortably handle 20-30 concurrent active dev containers. Idle containers cost almost nothing.


Layer 3 — Game Server Load

Tool: mineflayer (Minecraft), game-specific bots for others

Minecraft bot script:

const mineflayer = require('mineflayer');

const BOT_COUNT = 20;
const SERVER_HOST = '<game-container-ip>';
const SERVER_PORT = 25565;

for (let i = 0; i < BOT_COUNT; i++) {
  setTimeout(() => {
    const bot = mineflayer.createBot({
      host: SERVER_HOST,
      port: SERVER_PORT,
      username: `StressBot${i}`,
    });
    bot.on('error', (err) => console.error(`Bot ${i} error:`, err));
    // Walk around randomly
    bot.once('spawn', () => {
      setInterval(() => {
        bot.setControlState('forward', true);
        setTimeout(() => bot.setControlState('forward', false), 1000);
      }, 3000);
    });
  }, i * 500); // stagger joins
}

What to watch:

  • Agent crash recovery triggering unexpectedly
  • Server TPS (ticks per second) — below 15 TPS = struggling
  • Container CPU throttling in Proxmox

Layer 4 — OPNsense Router Validation

See network/opnsense-checklist.md for router-specific validation.


  1. API token endpoint under 50 concurrent users (k6)
  2. 20 simultaneous IDE WebSocket sessions (k6)
  3. Provision 10 dev containers simultaneously via API
  4. Minecraft server with 20 bot players for 10 minutes
  5. OPNsense failover test (see network doc)

Key Files

  • stress/ide-session.js — IDE proxy load test (create in zpack-api repo)
  • stress/bot-load.js — Minecraft bot load test

Failure Mode Reference

Symptom Likely Cause
API 502s under load Proxy running out of connections or file descriptors
WS sessions dropping after ~2min Traefik keepalive timeout — increase respondingTimeouts
Memory growth on API process Connection leak in http-proxy-middleware under concurrent load
Container provision failures under load Proxmox API rate limiting or LXC storage I/O bottleneck
Minecraft TPS drop with bots Container CPU limit too low — check LXC resource allocation