Launch architecture: move provisioning into BullMQ worker with idempotency safeguards #11
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Decision / launch architecture item
Move server provisioning out of the HTTP request lifecycle and into a BullMQ worker process.
Confirmed current state
zpack-apialready has queue infrastructure:bullmqioredissrc/queues/postProvision.jsCurrent expensive provisioning still lives in
provisionAgentInstance()and runs as a synchronous pipeline from the API path. It performs Proxmox clone/config/start, waits for IP, sends Agent config, waits for Agent terminal/readiness state with a polling loop, persists the instance, and then queues/publishes follow-on edge work.Chosen first implementation
Use the same
zpack-apirepo and add a separate worker process first.Recommended first deployment:
Initial shape:
provisioningnpm run worker:provision1ctypebranching already exists inside the provisioning functionRequired worker safety additions
1. Error cleanup must survive worker context
The current catch block deletes the container and releases the VMID. When this moves into BullMQ, avoid blind automatic retries that can re-run a partially successful provisioning job.
Requirements:
attempts: 1or otherwise explicitly disable unsafe auto-retry.2. Resolve
confirmVmidAllocated/ DB persist atomicityCurrent ordering appears to persist
ContainerInstanceand then callconfirmVmidAllocatednear the end.Risk:
confirmVmidAllocated, a persisted instance can exist with an unconfirmed VMID allocation.Required decision/fix:
3. Queue deduplication / duplicate create protection
The Portal multi-create modal issue is downstream, but the backend also needs duplicate enqueue protection.
Requirements:
jobId/ idempotency key.customerId + ctypeif multiple intentional creates of the same type should be allowed later.Desired API flow
Worker:
Portal:
Launch expectation
This is the preferred launch-safe fix for long-running provisioning and the multi-create modal confusion.
Do not move API/Portal to LXC as part of this. Keep the production serving shape stable and extract provisioning execution first.
Implementation update — provisioning worker refactor
Provisioning worker refactor has been implemented in
zpack-api.Completed
ProvisioningOperationstate inschema.prismaplus migration:prisma/migrations/20260503120000_add_provisioning_operations/migration.sqlsrc/queues/provisioning.js1attempts: 1operation.idjob IDsPOST /api/instancesto:202withoperationIdandstatusUrlGET /api/instances/operations/:operationIdrequestPayloadstorage and idempotency/no-key launch guard behavior:202409409provisionAgentInstance()to support optional safe callbacks:confirmVmidAllocated().npm run worker:provisionValidation completed
npm run prisma:generatesucceedednpx prisma validatesucceedednode --checkpassed for:src/queues/provisioning.jssrc/routes/instances.jssrc/api/provisionAgent.jsnpm run worker:provisionsucceeded and idlednpm startsucceeded and listened on port4000Not yet validated
Notes
Existing unrelated worktree changes were already present in:
They were left untouched.
Live validation update
The new BullMQ provisioning worker path successfully provisioned a real vanilla Minecraft game server.
Validated
POST /api/instancesreturned202quickly instead of waiting on provisioning.1.ContainerInstancewas created with VMID5208.Operation details
Phases observed
Portal follow-up
Portal now needs to be updated for the async create response. The backend provisioned correctly, but Portal needs to poll the returned operation/status URL and attach the placeholder create card to the real server record once
serverIdis present or the operation is complete.Remaining before closing this item
Status: first live game-server worker provisioning validation passed.
Live validation update — teardown still works through API
After the successful worker-provisioned vanilla Minecraft server test, teardown was run through the existing API path and completed.
Server under test
Observed before teardown
Velocity and server status were healthy before delete:
Console websocket also connected successfully before teardown:
Teardown behavior validated
Existing API teardown path still handled the worker-provisioned server:
Depublish / cleanup completed:
Notes / follow-up observations
A transient Proxmox status error was observed before teardown:
This appears to have been a host/status read issue and did not block teardown, but it is relevant for the future controller/reconciler. The controller should treat isolated Proxmox status read failures as retry/degraded signals, not immediate destructive conditions.
Cloudflare cleanup attempted a second delete path for
_minecraft._tcp.mc-vanilla-5208after the real SRV was already removed, producing no-match logs. This is noisy but not currently blocking.Current validation status
Remaining before closing issue:
Follow-up patch — host status handles transient Proxmox read failures
The transient Proxmox status error observed during validation has been patched in
zpack-api.Original issue
GET /api/servers/:id/host/statuscalledgetContainerStatus()once. If Proxmox returned a transient LXC status error, the route surfaced it as an API 500.Observed example:
This typically indicates Proxmox/LXC could not read the container status socket at that instant.
Patch
src/routes/servers.jswas updated so host-status polling catches this Proxmox read failure, logs a warning, and still returns JSON.Fallback response includes:
Validation
Controller relevance
This matches the intended future controller behavior: isolated Proxmox status read failures should be treated as retry/degraded signals, not immediate hard failures or destructive repair conditions.
Portal async create flow implemented
Portal has been updated to consume the new async
POST /api/instancesbehavior.Files changed
Completed behavior
POST /api/instancesnow sends a per-clickIdempotency-Key.202 Acceptedis treated as a successful accepted create.operationIdrequestIdstatusUrlserverId/serverspollsstatusUrluntil completion or failure while keeping existing server-list polling.Request sentQueuedRunning: <phase>Waiting for server recordReadyFailed: <phase/error>serverIdappears or the operation completes, Portal refreshes the server list and attaches/replaces the pending card with the real server card once available.GAMEorDEVsection instead of a separate banner/modal-style block.Validation
Current issue status
The main backend + Portal async provisioning path is now implemented and build-validated.
Remaining before closing this item:
Final validation update: provisioning worker implementation is now validated. Systemd worker service is installed and running. Game and dev provisioning both passed through the worker path. Dev teardown passed through API cleanup. Idempotency, no-key duplicate guard, and controlled failure behavior all passed. Two follow-up fixes were also validated: provisionAgent now records normalize-request failures correctly, and host-status polling soft-handles transient Proxmox read failures. Remaining action: visually confirm Portal pending-card UX, then this issue can be moved to completed.
Portal pending-card UX visually validated. During game provisioning, the inline GAME placeholder card showed the operation state and phase, including Running: wait-for-agent-terminal-state. After completion, the placeholder was replaced by the real server card for mc-vanilla-5209 with Ready status, connection hostname, and Host: Online. Existing DEV card dev-6088 remained stable. This completes visual validation of the async Portal create flow.