Launch architecture: move provisioning into BullMQ worker with idempotency safeguards #11

Open
opened 2026-05-03 11:07:17 +00:00 by jester · 7 comments
Owner

Decision / launch architecture item

Move server provisioning out of the HTTP request lifecycle and into a BullMQ worker process.

Confirmed current state

zpack-api already has queue infrastructure:

  • bullmq
  • ioredis
  • existing src/queues/postProvision.js

Current expensive provisioning still lives in provisionAgentInstance() and runs as a synchronous pipeline from the API path. It performs Proxmox clone/config/start, waits for IP, sends Agent config, waits for Agent terminal/readiness state with a polling loop, persists the instance, and then queues/publishes follow-on edge work.

Chosen first implementation

Use the same zpack-api repo and add a separate worker process first.

Recommended first deployment:

zpack-api VM
  - zpack-api.service            # HTTP API
  - zpack-provision-worker.service # BullMQ provisioning worker

Initial shape:

  • One queue: provisioning
  • One worker command: npm run worker:provision
  • Concurrency: 1
  • Keep game/dev in the same queue for now; ctype branching already exists inside the provisioning function
  • Split into game/dev queues only later if there is real concurrency pressure

Required worker safety additions

1. Error cleanup must survive worker context

The current catch block deletes the container and releases the VMID. When this moves into BullMQ, avoid blind automatic retries that can re-run a partially successful provisioning job.

Requirements:

  • Start with attempts: 1 or otherwise explicitly disable unsafe auto-retry.
  • Make failure states visible in DB/Portal.
  • Do not retry the whole pipeline unless phases are made idempotent.
  • Ensure cleanup behavior is deterministic and safe when clone/config/start partially succeeds.

2. Resolve confirmVmidAllocated / DB persist atomicity

Current ordering appears to persist ContainerInstance and then call confirmVmidAllocated near the end.

Risk:

  • If the worker dies after DB persist but before confirmVmidAllocated, a persisted instance can exist with an unconfirmed VMID allocation.

Required decision/fix:

  • Make DB persist and VMID confirmation atomic, or
  • Treat successful DB persist as the VMID confirmation source of truth, or
  • Move confirmation into the same transaction/state transition used to create the instance.

3. Queue deduplication / duplicate create protection

The Portal multi-create modal issue is downstream, but the backend also needs duplicate enqueue protection.

Requirements:

  • Use deterministic BullMQ jobId / idempotency key.
  • Key should prevent duplicate provisioning if Portal double-submits or retries.
  • Candidate key: request nonce/idempotency key from Portal, or a server/request row created before enqueue.
  • Avoid keying only on customerId + ctype if multiple intentional creates of the same type should be allowed later.

Desired API flow

Portal
  -> POST /api/servers
  -> API validates auth/billing/quota/overdue state
  -> API creates durable provisioning/server request state
  -> API enqueues BullMQ job with deterministic jobId
  -> API returns 202 Accepted + serverId/operationId/statusUrl

Worker:

worker:provision
  -> consumes provisioning job
  -> runs provisioning pipeline
  -> updates phase/status as it progresses
  -> persists final instance state
  -> queues/publishes edge work as appropriate
  -> records failed phase/error if failed

Portal:

/servers renders provisioning_queued/provisioning_running/provisioning_failed state from API status/server list

Launch expectation

This is the preferred launch-safe fix for long-running provisioning and the multi-create modal confusion.

Do not move API/Portal to LXC as part of this. Keep the production serving shape stable and extract provisioning execution first.

## Decision / launch architecture item Move server provisioning out of the HTTP request lifecycle and into a BullMQ worker process. ## Confirmed current state `zpack-api` already has queue infrastructure: - `bullmq` - `ioredis` - existing `src/queues/postProvision.js` Current expensive provisioning still lives in `provisionAgentInstance()` and runs as a synchronous pipeline from the API path. It performs Proxmox clone/config/start, waits for IP, sends Agent config, waits for Agent terminal/readiness state with a polling loop, persists the instance, and then queues/publishes follow-on edge work. ## Chosen first implementation Use the same `zpack-api` repo and add a separate worker process first. Recommended first deployment: ```text zpack-api VM - zpack-api.service # HTTP API - zpack-provision-worker.service # BullMQ provisioning worker ``` Initial shape: - One queue: `provisioning` - One worker command: `npm run worker:provision` - Concurrency: `1` - Keep game/dev in the same queue for now; `ctype` branching already exists inside the provisioning function - Split into game/dev queues only later if there is real concurrency pressure ## Required worker safety additions ### 1. Error cleanup must survive worker context The current catch block deletes the container and releases the VMID. When this moves into BullMQ, avoid blind automatic retries that can re-run a partially successful provisioning job. Requirements: - Start with `attempts: 1` or otherwise explicitly disable unsafe auto-retry. - Make failure states visible in DB/Portal. - Do not retry the whole pipeline unless phases are made idempotent. - Ensure cleanup behavior is deterministic and safe when clone/config/start partially succeeds. ### 2. Resolve `confirmVmidAllocated` / DB persist atomicity Current ordering appears to persist `ContainerInstance` and then call `confirmVmidAllocated` near the end. Risk: - If the worker dies after DB persist but before `confirmVmidAllocated`, a persisted instance can exist with an unconfirmed VMID allocation. Required decision/fix: - Make DB persist and VMID confirmation atomic, or - Treat successful DB persist as the VMID confirmation source of truth, or - Move confirmation into the same transaction/state transition used to create the instance. ### 3. Queue deduplication / duplicate create protection The Portal multi-create modal issue is downstream, but the backend also needs duplicate enqueue protection. Requirements: - Use deterministic BullMQ `jobId` / idempotency key. - Key should prevent duplicate provisioning if Portal double-submits or retries. - Candidate key: request nonce/idempotency key from Portal, or a server/request row created before enqueue. - Avoid keying only on `customerId + ctype` if multiple intentional creates of the same type should be allowed later. ## Desired API flow ```text Portal -> POST /api/servers -> API validates auth/billing/quota/overdue state -> API creates durable provisioning/server request state -> API enqueues BullMQ job with deterministic jobId -> API returns 202 Accepted + serverId/operationId/statusUrl ``` Worker: ```text worker:provision -> consumes provisioning job -> runs provisioning pipeline -> updates phase/status as it progresses -> persists final instance state -> queues/publishes edge work as appropriate -> records failed phase/error if failed ``` Portal: ```text /servers renders provisioning_queued/provisioning_running/provisioning_failed state from API status/server list ``` ## Launch expectation This is the preferred launch-safe fix for long-running provisioning and the multi-create modal confusion. Do not move API/Portal to LXC as part of this. Keep the production serving shape stable and extract provisioning execution first.
Author
Owner

Implementation update — provisioning worker refactor

Provisioning worker refactor has been implemented in zpack-api.

Completed

  • Added durable ProvisioningOperation state in schema.prisma plus migration:
    • prisma/migrations/20260503120000_add_provisioning_operations/migration.sql
  • Added standalone BullMQ provisioning worker:
    • src/queues/provisioning.js
    • concurrency 1
    • attempts: 1
    • colon-free operation.id job IDs
    • lease heartbeats
    • structured lifecycle logs
    • cleanup tracking
  • Changed POST /api/instances to:
    • create/reuse operation
    • enqueue provisioning job
    • return 202 with operationId and statusUrl
  • Added polling route:
    • GET /api/instances/operations/:operationId
  • Added sanitized requestPayload storage and idempotency/no-key launch guard behavior:
    • same no-key active payload returns existing operation with 202
    • differing no-key active payload returns 409
    • reused idempotency key with differing payload returns 409
  • Refactored provisionAgentInstance() to support optional safe callbacks:
    • phase
    • heartbeat
    • server persistence
    • cleanup result
  • Added VMID source-of-truth warning/TODO near confirmVmidAllocated().
  • Added package script:
    • npm run worker:provision

Validation completed

  • npm run prisma:generate succeeded
  • npx prisma validate succeeded
  • node --check passed for:
    • src/queues/provisioning.js
    • src/routes/instances.js
    • src/api/provisionAgent.js
  • Short smoke start of npm run worker:provision succeeded and idled
  • Short smoke start of npm start succeeded and listened on port 4000

Not yet validated

  • Real provisioning job against Proxmox / Agent infrastructure has not been run yet.
  • DB migration file has been created but has not been applied to the live database yet.

Notes

Existing unrelated worktree changes were already present in:

.env.example
src/routes/monitoring.js
src/routes/promSd.js
scripts/
src/utils/lifecycleDiscovery.js

They were left untouched.

## Implementation update — provisioning worker refactor Provisioning worker refactor has been implemented in `zpack-api`. ### Completed - Added durable `ProvisioningOperation` state in `schema.prisma` plus migration: - `prisma/migrations/20260503120000_add_provisioning_operations/migration.sql` - Added standalone BullMQ provisioning worker: - `src/queues/provisioning.js` - concurrency `1` - `attempts: 1` - colon-free `operation.id` job IDs - lease heartbeats - structured lifecycle logs - cleanup tracking - Changed `POST /api/instances` to: - create/reuse operation - enqueue provisioning job - return `202` with `operationId` and `statusUrl` - Added polling route: - `GET /api/instances/operations/:operationId` - Added sanitized `requestPayload` storage and idempotency/no-key launch guard behavior: - same no-key active payload returns existing operation with `202` - differing no-key active payload returns `409` - reused idempotency key with differing payload returns `409` - Refactored `provisionAgentInstance()` to support optional safe callbacks: - phase - heartbeat - server persistence - cleanup result - Added VMID source-of-truth warning/TODO near `confirmVmidAllocated()`. - Added package script: - `npm run worker:provision` ### Validation completed - `npm run prisma:generate` succeeded - `npx prisma validate` succeeded - `node --check` passed for: - `src/queues/provisioning.js` - `src/routes/instances.js` - `src/api/provisionAgent.js` - Short smoke start of `npm run worker:provision` succeeded and idled - Short smoke start of `npm start` succeeded and listened on port `4000` ### Not yet validated - Real provisioning job against Proxmox / Agent infrastructure has not been run yet. - DB migration file has been created but has not been applied to the live database yet. ### Notes Existing unrelated worktree changes were already present in: ```text .env.example src/routes/monitoring.js src/routes/promSd.js scripts/ src/utils/lifecycleDiscovery.js ``` They were left untouched.
Author
Owner

Live validation update

The new BullMQ provisioning worker path successfully provisioned a real vanilla Minecraft game server.

Validated

  • POST /api/instances returned 202 quickly instead of waiting on provisioning.
  • A durable provisioning operation was created and enqueued.
  • The worker consumed the job with concurrency 1.
  • Phase and heartbeat logging worked through the full provisioning path.
  • The job completed successfully.
  • A real ContainerInstance was created with VMID 5208.
  • Post-provision edge publish still ran.
  • DNS / Velocity registration completed.
  • API status reported the new server online/connectable.

Operation details

operationId: 22962aab-c618-4cb7-a045-d82e0e283aef
jobId:       22962aab-c618-4cb7-a045-d82e0e283aef
customerId:  u000
ctype:       game
variant:     vanilla
version:     1.21.7
vmid:        5208
hostname:    mc-vanilla-5208

Phases observed

allocate-vmid
clone-container
configure-container
start-container
wait-for-ip
build-agent-payload
send-agent-config
wait-for-agent-terminal-state
persist-instance
publish-edge
confirm-vmid
complete

Portal follow-up

Portal now needs to be updated for the async create response. The backend provisioned correctly, but Portal needs to poll the returned operation/status URL and attach the placeholder create card to the real server record once serverId is present or the operation is complete.

Remaining before closing this item

- Update Portal for async operation polling / operationId handling
- Install worker as systemd service
- Run a dev container provisioning test through the worker path
- Run a duplicate/idempotency test
- Run a controlled failure test to verify failed phase + cleanup metadata

Status: first live game-server worker provisioning validation passed.

## Live validation update The new BullMQ provisioning worker path successfully provisioned a real vanilla Minecraft game server. ### Validated - `POST /api/instances` returned `202` quickly instead of waiting on provisioning. - A durable provisioning operation was created and enqueued. - The worker consumed the job with concurrency `1`. - Phase and heartbeat logging worked through the full provisioning path. - The job completed successfully. - A real `ContainerInstance` was created with VMID `5208`. - Post-provision edge publish still ran. - DNS / Velocity registration completed. - API status reported the new server online/connectable. ### Operation details ```text operationId: 22962aab-c618-4cb7-a045-d82e0e283aef jobId: 22962aab-c618-4cb7-a045-d82e0e283aef customerId: u000 ctype: game variant: vanilla version: 1.21.7 vmid: 5208 hostname: mc-vanilla-5208 ``` ### Phases observed ```text allocate-vmid clone-container configure-container start-container wait-for-ip build-agent-payload send-agent-config wait-for-agent-terminal-state persist-instance publish-edge confirm-vmid complete ``` ### Portal follow-up Portal now needs to be updated for the async create response. The backend provisioned correctly, but Portal needs to poll the returned operation/status URL and attach the placeholder create card to the real server record once `serverId` is present or the operation is complete. ### Remaining before closing this item ```text - Update Portal for async operation polling / operationId handling - Install worker as systemd service - Run a dev container provisioning test through the worker path - Run a duplicate/idempotency test - Run a controlled failure test to verify failed phase + cleanup metadata ``` Status: first live game-server worker provisioning validation passed.
Author
Owner

Live validation update — teardown still works through API

After the successful worker-provisioned vanilla Minecraft server test, teardown was run through the existing API path and completed.

Server under test

vmid: 5208
hostname: mc-vanilla-5208
fqdn: mc-vanilla-5208.zerolaghub.quest
ctype: game

Observed before teardown

Velocity and server status were healthy before delete:

[internalVelocity] proxy status vmid=5208 hostname=mc-vanilla-5208 status=registered_with_proxy
[internalVelocity] proxy status vmid=5208 hostname=mc-vanilla-5208 status=proxy_ping_ok
[serverStatus] vmid=5208 online=true

Console websocket also connected successfully before teardown:

[consoleProxy] ws upgrade vmid=5208 type=game target=ws://10.200.0.81:18888/console/stream?serverId=5208
[consoleProxy] ws vmid=5208 connected -> ws://10.200.0.81:18888/console/stream?serverId=5208

Teardown behavior validated

Existing API teardown path still handled the worker-provisioned server:

[teardown] DELETE request received for VMID 5208
[teardown] Archived vmid=5208 into DeletedInstance (id=172)
[teardown] Deleted Proxmox container vmid=5208

Depublish / cleanup completed:

Velocity unregister: success
Technitium A cleanup: success
Technitium SRV cleanup: success
Cloudflare A cleanup: success
Cloudflare SRV cleanup: success
Teardown complete for mc-vanilla-5208

Notes / follow-up observations

A transient Proxmox status error was observed before teardown:

[Proxmox lxc/status/current] HTTP 500 failed to read from command socket: Connection reset by peer

This appears to have been a host/status read issue and did not block teardown, but it is relevant for the future controller/reconciler. The controller should treat isolated Proxmox status read failures as retry/degraded signals, not immediate destructive conditions.

Cloudflare cleanup attempted a second delete path for _minecraft._tcp.mc-vanilla-5208 after the real SRV was already removed, producing no-match logs. This is noisy but not currently blocking.

Current validation status

Worker-provisioned game create: passed
Worker phase/heartbeat tracking: passed
Post-provision edge publish: passed
Console websocket attach: passed
Existing API teardown: passed
Velocity/DNS/Cloudflare cleanup: passed

Remaining before closing issue:

- Portal async operation polling update
- Install worker as systemd service
- Dev container provisioning test through worker path
- Duplicate/idempotency test
- Controlled failure test for failed phase + cleanup metadata
## Live validation update — teardown still works through API After the successful worker-provisioned vanilla Minecraft server test, teardown was run through the existing API path and completed. ### Server under test ```text vmid: 5208 hostname: mc-vanilla-5208 fqdn: mc-vanilla-5208.zerolaghub.quest ctype: game ``` ### Observed before teardown Velocity and server status were healthy before delete: ```text [internalVelocity] proxy status vmid=5208 hostname=mc-vanilla-5208 status=registered_with_proxy [internalVelocity] proxy status vmid=5208 hostname=mc-vanilla-5208 status=proxy_ping_ok [serverStatus] vmid=5208 online=true ``` Console websocket also connected successfully before teardown: ```text [consoleProxy] ws upgrade vmid=5208 type=game target=ws://10.200.0.81:18888/console/stream?serverId=5208 [consoleProxy] ws vmid=5208 connected -> ws://10.200.0.81:18888/console/stream?serverId=5208 ``` ### Teardown behavior validated Existing API teardown path still handled the worker-provisioned server: ```text [teardown] DELETE request received for VMID 5208 [teardown] Archived vmid=5208 into DeletedInstance (id=172) [teardown] Deleted Proxmox container vmid=5208 ``` Depublish / cleanup completed: ```text Velocity unregister: success Technitium A cleanup: success Technitium SRV cleanup: success Cloudflare A cleanup: success Cloudflare SRV cleanup: success Teardown complete for mc-vanilla-5208 ``` ### Notes / follow-up observations A transient Proxmox status error was observed before teardown: ```text [Proxmox lxc/status/current] HTTP 500 failed to read from command socket: Connection reset by peer ``` This appears to have been a host/status read issue and did not block teardown, but it is relevant for the future controller/reconciler. The controller should treat isolated Proxmox status read failures as retry/degraded signals, not immediate destructive conditions. Cloudflare cleanup attempted a second delete path for `_minecraft._tcp.mc-vanilla-5208` after the real SRV was already removed, producing no-match logs. This is noisy but not currently blocking. ### Current validation status ```text Worker-provisioned game create: passed Worker phase/heartbeat tracking: passed Post-provision edge publish: passed Console websocket attach: passed Existing API teardown: passed Velocity/DNS/Cloudflare cleanup: passed ``` Remaining before closing issue: ```text - Portal async operation polling update - Install worker as systemd service - Dev container provisioning test through worker path - Duplicate/idempotency test - Controlled failure test for failed phase + cleanup metadata ```
Author
Owner

Follow-up patch — host status handles transient Proxmox read failures

The transient Proxmox status error observed during validation has been patched in zpack-api.

Original issue

GET /api/servers/:id/host/status called getContainerStatus() once. If Proxmox returned a transient LXC status error, the route surfaced it as an API 500.

Observed example:

[Proxmox lxc/status/current] HTTP 500 failed to read from command socket: Connection reset by peer

This typically indicates Proxmox/LXC could not read the container status socket at that instant.

Patch

src/routes/servers.js was updated so host-status polling catches this Proxmox read failure, logs a warning, and still returns JSON.

Fallback response includes:

powerState: "unknown"
operation state from Redis if present
proxmox.error with message/http code

Validation

node --check src/routes/servers.js passed

Controller relevance

This matches the intended future controller behavior: isolated Proxmox status read failures should be treated as retry/degraded signals, not immediate hard failures or destructive repair conditions.

## Follow-up patch — host status handles transient Proxmox read failures The transient Proxmox status error observed during validation has been patched in `zpack-api`. ### Original issue `GET /api/servers/:id/host/status` called `getContainerStatus()` once. If Proxmox returned a transient LXC status error, the route surfaced it as an API 500. Observed example: ```text [Proxmox lxc/status/current] HTTP 500 failed to read from command socket: Connection reset by peer ``` This typically indicates Proxmox/LXC could not read the container status socket at that instant. ### Patch `src/routes/servers.js` was updated so host-status polling catches this Proxmox read failure, logs a warning, and still returns JSON. Fallback response includes: ```text powerState: "unknown" operation state from Redis if present proxmox.error with message/http code ``` ### Validation ```text node --check src/routes/servers.js passed ``` ### Controller relevance This matches the intended future controller behavior: isolated Proxmox status read failures should be treated as retry/degraded signals, not immediate hard failures or destructive repair conditions.
Author
Owner

Portal async create flow implemented

Portal has been updated to consume the new async POST /api/instances behavior.

Files changed

zpack-portal/src/app/(dashboard)/servers/create/page.tsx
zpack-portal/src/app/(dashboard)/servers/page.tsx

Completed behavior

  • POST /api/instances now sends a per-click Idempotency-Key.
  • 202 Accepted is treated as a successful accepted create.
  • Portal stores the returned async operation data:
    • operationId
    • requestId
    • statusUrl
    • status
    • phase
    • eventual serverId
  • /servers polls statusUrl until completion or failure while keeping existing server-list polling.
  • Pending create UI now shows:
    • Request sent
    • Queued
    • Running: <phase>
    • Waiting for server record
    • Ready
    • Failed: <phase/error>
  • When serverId appears or the operation completes, Portal refreshes the server list and attaches/replaces the pending card with the real server card once available.
  • Pending creation UI now appears as an inline server card inside the correct GAME or DEV section instead of a separate banner/modal-style block.

Validation

npm run lint passed with existing unrelated warnings
npm run build passed

Current issue status

The main backend + Portal async provisioning path is now implemented and build-validated.

Remaining before closing this item:

- Install `zpack-provision-worker.service` under systemd
- Run a dev container provisioning test through the worker path
- Run duplicate/idempotency test
- Run controlled failure test for failed phase + cleanup metadata
## Portal async create flow implemented Portal has been updated to consume the new async `POST /api/instances` behavior. ### Files changed ```text zpack-portal/src/app/(dashboard)/servers/create/page.tsx zpack-portal/src/app/(dashboard)/servers/page.tsx ``` ### Completed behavior - `POST /api/instances` now sends a per-click `Idempotency-Key`. - `202 Accepted` is treated as a successful accepted create. - Portal stores the returned async operation data: - `operationId` - `requestId` - `statusUrl` - status - phase - eventual `serverId` - `/servers` polls `statusUrl` until completion or failure while keeping existing server-list polling. - Pending create UI now shows: - `Request sent` - `Queued` - `Running: <phase>` - `Waiting for server record` - `Ready` - `Failed: <phase/error>` - When `serverId` appears or the operation completes, Portal refreshes the server list and attaches/replaces the pending card with the real server card once available. - Pending creation UI now appears as an inline server card inside the correct `GAME` or `DEV` section instead of a separate banner/modal-style block. ### Validation ```text npm run lint passed with existing unrelated warnings npm run build passed ``` ### Current issue status The main backend + Portal async provisioning path is now implemented and build-validated. Remaining before closing this item: ```text - Install `zpack-provision-worker.service` under systemd - Run a dev container provisioning test through the worker path - Run duplicate/idempotency test - Run controlled failure test for failed phase + cleanup metadata ```
Author
Owner

Final validation update: provisioning worker implementation is now validated. Systemd worker service is installed and running. Game and dev provisioning both passed through the worker path. Dev teardown passed through API cleanup. Idempotency, no-key duplicate guard, and controlled failure behavior all passed. Two follow-up fixes were also validated: provisionAgent now records normalize-request failures correctly, and host-status polling soft-handles transient Proxmox read failures. Remaining action: visually confirm Portal pending-card UX, then this issue can be moved to completed.

Final validation update: provisioning worker implementation is now validated. Systemd worker service is installed and running. Game and dev provisioning both passed through the worker path. Dev teardown passed through API cleanup. Idempotency, no-key duplicate guard, and controlled failure behavior all passed. Two follow-up fixes were also validated: provisionAgent now records normalize-request failures correctly, and host-status polling soft-handles transient Proxmox read failures. Remaining action: visually confirm Portal pending-card UX, then this issue can be moved to completed.
Author
Owner

Portal pending-card UX visually validated. During game provisioning, the inline GAME placeholder card showed the operation state and phase, including Running: wait-for-agent-terminal-state. After completion, the placeholder was replaced by the real server card for mc-vanilla-5209 with Ready status, connection hostname, and Host: Online. Existing DEV card dev-6088 remained stable. This completes visual validation of the async Portal create flow.

Portal pending-card UX visually validated. During game provisioning, the inline GAME placeholder card showed the operation state and phase, including Running: wait-for-agent-terminal-state. After completion, the placeholder was replaced by the real server card for mc-vanilla-5209 with Ready status, connection hostname, and Host: Online. Existing DEV card dev-6088 remained stable. This completes visual validation of the async Portal create flow.
Sign in to join this conversation.
No Label
No Milestone
No project
No Assignees
1 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: jester/zlh-grind#11
No description provided.