Launch architecture: build singleton controller/reconciler with Discord notifications #13
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Launch architecture requirement
Build a conservative ZLH controller/reconciler before launch. This is required because James works full time and operates ZLH solo with no secondary operator. Routine failures must be detected and repaired automatically where safe.
Purpose
The controller is not an observability dashboard. It is the platform control loop:
Build order
Minimal launch scope
Run every 60 seconds:
Singleton requirement
Controller must run as a singleton with a Redis-backed distributed lock from day one.
Reason: if two controller instances run at the same time, ZLH risks split-brain repair decisions.
Recommended placement:
Queue structure
Repair policy levels
Only Level 0 and Level 1 should be enabled at launch.
Circuit breakers required from day one
Discord notifications
James is creating a dedicated ZLH Discord server. Wire Discord webhook notifications into the controller as:
Channel structure:
Notification rules:
Implementation: Discord webhook per channel, HTTP POST only. No bot required at this stage. One function per alert level in the notification service.
Launch expectation
Controller should be conservative. Do not enable destructive actions automatically. The first version should observe, repair non-destructive drift, and escalate when safe automatic repair is not possible.
Implementation update — controller/reconciler foundation
Controller/reconciler foundation and Discord notification service have been implemented in
zpack-api.Created / changed
Added scripts
Added persistence
Added
RepairEventmodel and migration.Discord notification service
Webhook-only Discord notifications were added. Notification failures are fail-soft. Secret-looking fields are redacted.
Discord env vars expected only in
.envor secret manager:Webhook URLs should not be committed. If a real webhook was pasted into chat or any non-private surface, rotate it.
Singleton controller lock
controllerLock.jsimplements Redis singleton locking with:Controller behavior
reconciler.jsruns as a separate process, not inside the HTTP API. It supports:Current observations:
Enabled repairs
Stubbed / disabled
Validation
Current status
Controller/reconciler foundation is implemented and dry-run validated. Next step is to review the implementation and then decide whether to run the controller in dry-run under systemd before enabling non-dry-run Level 1 repairs.
Validation update — stale operation repair test
A manual stale operation repair test was run against the repair worker.
Repair job
Result
Repair worker completed the job successfully:
DB verification confirmed the operation was marked stale:
Validation status
clear_stale_operation_lockLevel 1 repair behavior passed. The repair worker successfully converted an expired queued provisioning operation into a durablestalestate with structured error metadata.Follow-up gap — controller does not yet detect live Cloudflare DNS drift
Current controller/repair split is clarified:
Impact:
If a Cloudflare A or SRV record is manually deleted, the controller probably will not notice automatically yet, because it is mostly observing persisted edge/payload state rather than live Cloudflare state.
Manual repair test is still possible by enqueueing
edge_republishordns_republishagainst a safe test game server VMID.Example manual test shape:
Replace
5000with the real disposable test game server VMID.Next code step for automatic repair:
This should remain Level 1 repair behavior and only run for non-suspended, non-deleting, non-maintenance-locked game servers.
Implementation update — live edge drift detection path
Live edge drift detection and repair integration have been implemented in
zpack-api, but this controller item should not be marked fully complete yet because the real Cloudflare delete / dry-run / non-dry-run disposable-server validation has not been performed.Files changed
New behavior
observeEdgeState(instance)was added and the controller now observes live edge state for game servers.External state checked:
Policy / repair behavior
Confirmed Cloudflare / Technitium / Velocity drift now recommends enabled Level 1 repair:
Existing safe actions retained:
Level 2 remains disabled.
Repair worker behavior
edge_republishnow uses the existing full edge publish path, then post-checks live edge state so it does not fake success.Validation completed
Controller smoke note:
Limitations / TODOs
Required next validation
On a disposable game server:
Do not close controller issue until at least one real Cloudflare drift test passes.
Validation update — live Cloudflare SRV drift repair passed
A real live edge drift test was performed on a disposable game server by deleting the Cloudflare SRV record.
Server under test
Result
The controller/repair path detected the drift and
edge_republishrestored the edge state through the existing edge publish path.Repair job completed successfully:
Edge publisher completed and Velocity remained healthy/idempotent:
Validation status
Live Cloudflare SRV drift repair: passed.
This confirms the intended Level 1 repair loop:
Remaining optional validation: