Stress Testing Guide

Overview

Four distinct layers need stress testing. Each has different tools, goals, and failure modes. Test in this order — API first, then proxy/WS, then infrastructure, then game servers.

Layer 1 — API + IDE Proxy (Start Here)

Tool: k6 (https://k6.io)

Goal: Validate concurrent IDE sessions, WebSocket stability, proxy memory behavior, and Traefik resilience under load.

Install k6:

# Ubuntu/Debian
sudo gpg -k
sudo gpg --no-default-keyring --keyring /usr/share/keyrings/k6-archive-keyring.gpg \
  --keyserver hkp://keyserver.ubuntu.com:80 --recv-keys C5AD17C747E3415A3642D57D77C6C491D6AC1D69
echo "deb [signed-by=/usr/share/keyrings/k6-archive-keyring.gpg] https://dl.k6.io/deb stable main" \
  | sudo tee /etc/apt/sources.list.d/k6.list
sudo apt update && sudo apt install k6

IDE session script (stress/ide-session.js):

import http from 'k6/http';
import ws from 'k6/ws';
import { check, sleep } from 'k6';

// Configure these
const API_BASE = 'https://10.60.0.245:4000';
const IDE_HOST = 'dev-6070.zerolaghub.dev';
const USER_TOKEN = '<valid-jwt>';
const VMID = 6070;

export const options = {
  stages: [
    { duration: '30s', target: 10 },   // ramp up to 10 users
    { duration: '2m',  target: 10 },   // hold
    { duration: '30s', target: 30 },   // ramp to 30
    { duration: '2m',  target: 30 },   // hold
    { duration: '30s', target: 0 },    // ramp down
  ],
  thresholds: {
    http_req_duration: ['p(95)<2000'],  // 95% of requests under 2s
    http_req_failed:   ['rate<0.05'],   // less than 5% failure
  },
};

export default function () {
  // Step 1: Get IDE token
  const tokenRes = http.post(
    `${API_BASE}/api/dev/${VMID}/ide-token`,
    null,
    {
      headers: {
        Authorization: `Bearer ${USER_TOKEN}`,
        'Content-Type': 'application/json',
      },
    }
  );

  check(tokenRes, {
    'token issued': (r) => r.status === 200,
    'url present':  (r) => JSON.parse(r.body).url !== undefined,
  });

  if (tokenRes.status !== 200) return;

  const { url, token } = JSON.parse(tokenRes.body);

  // Step 2: Bootstrap (follow redirect, get cookie)
  const bootstrapRes = http.get(url, {
    redirects: 5,
    headers: { Host: IDE_HOST },
  });

  check(bootstrapRes, {
    'IDE loads': (r) => r.status === 200,
  });

  // Step 3: Hold a WebSocket connection for 60s
  const wsUrl = `wss://${IDE_HOST}/`;
  ws.connect(wsUrl, { headers: { Cookie: bootstrapRes.cookies } }, (socket) => {
    socket.on('open', () => {
      socket.setTimeout(() => socket.close(), 60000);
    });
    socket.on('error', (e) => console.error('WS error', e));
  });

  sleep(1);
}

Run:

k6 run stress/ide-session.js

What to watch:

http_req_duration — should stay under 2s at p95
ws_session_duration — WebSocket sessions should stay stable
API process memory on zlh-api — watch for leaks: pm2 monit
Traefik logs for 502s or connection resets
File descriptor count on API host: lsof -p <pid> | wc -l

Layer 2 — Infrastructure Capacity

Goal: Determine max concurrent dev containers before host degrades.

Method: Provision containers via API in batches of 5, watch Proxmox host metrics in Grafana. Stop when CPU steal > 10% or memory pressure causes swapping.

Metrics to watch in Grafana:

Host CPU usage (not per-VM)
Memory available
Disk I/O on LXC datastore
Network throughput on container bridge

Expected baseline: Current host (Xeon Silver 4116, 192GB) should comfortably handle 20-30 concurrent active dev containers. Idle containers cost almost nothing.

Layer 3 — Game Server Load

Tool: mineflayer (Minecraft), game-specific bots for others

Minecraft bot script:

const mineflayer = require('mineflayer');

const BOT_COUNT = 20;
const SERVER_HOST = '<game-container-ip>';
const SERVER_PORT = 25565;

for (let i = 0; i < BOT_COUNT; i++) {
  setTimeout(() => {
    const bot = mineflayer.createBot({
      host: SERVER_HOST,
      port: SERVER_PORT,
      username: `StressBot${i}`,
    });
    bot.on('error', (err) => console.error(`Bot ${i} error:`, err));
    // Walk around randomly
    bot.once('spawn', () => {
      setInterval(() => {
        bot.setControlState('forward', true);
        setTimeout(() => bot.setControlState('forward', false), 1000);
      }, 3000);
    });
  }, i * 500); // stagger joins
}

What to watch:

Agent crash recovery triggering unexpectedly
Server TPS (ticks per second) — below 15 TPS = struggling
Container CPU throttling in Proxmox

Layer 4 — OPNsense Router Validation

See network/opnsense-checklist.md for router-specific validation.

Recommended Test Order Before Launch

API token endpoint under 50 concurrent users (k6)
20 simultaneous IDE WebSocket sessions (k6)
Provision 10 dev containers simultaneously via API
Minecraft server with 20 bot players for 10 minutes
OPNsense failover test (see network doc)

Key Files

stress/ide-session.js — IDE proxy load test (create in zpack-api repo)
stress/bot-load.js — Minecraft bot load test

Failure Mode Reference

Symptom	Likely Cause
API 502s under load	Proxy running out of connections or file descriptors
WS sessions dropping after ~2min	Traefik keepalive timeout — increase `respondingTimeouts`
Memory growth on API process	Connection leak in `http-proxy-middleware` under concurrent load
Container provision failures under load	Proxmox API rate limiting or LXC storage I/O bottleneck
Minecraft TPS drop with bots	Container CPU limit too low — check LXC resource allocation

5.6 KiB Raw Blame History