Launch architecture: build singleton controller/reconciler with Discord notifications #13

New Issue

jester · 2026-05-03T12:15:19Z

jester commented

2026-05-03 12:15:19 +00:00

Launch architecture requirement

Build a conservative ZLH controller/reconciler before launch. This is required because James works full time and operates ZLH solo with no secondary operator. Routine failures must be detected and repaired automatically where safe.

Purpose

The controller is not an observability dashboard. It is the platform control loop:

What should be true?
What is actually true?
Is the drift safe to fix automatically?
If yes, enqueue repair.
If no, escalate.

Build order

1. Provisioning worker + operation leases
2. Minimal controller (observe + Level 1 repairs)
3. Billing webhook live
4. Expand controller to workload-level repairs

Minimal launch scope

Run every 60 seconds:

- Clear stale operation locks where heartbeat expired and no active BullMQ job exists
- Detect agent offline where LXC is running but Agent is unreachable
- Detect missing Velocity registration and re-register automatically
- Detect missing DNS record and republish automatically
- Detect stuck provisioning jobs (enqueued > 15 min with no progress), fail cleanly, release VMID
- If billing suspended and server is running, stop workload
- If backup missed, notify only; do not auto-repair

Singleton requirement

Controller must run as a singleton with a Redis-backed distributed lock from day one.

Reason: if two controller instances run at the same time, ZLH risks split-brain repair decisions.

Recommended placement:

zpack-api repo:
  src/controllers/reconciler.js

systemd:
  zlh-controller.service

Queue structure

provisioning queue  -> create servers
operation queue     -> start/stop/restart/backup/restore
repair queue        -> fix drift
reconcile loop      -> scheduled 60s health check

Repair policy levels

Only Level 0 and Level 1 should be enabled at launch.

Level 0: observe and log only
Level 1: stale lock clear, edge republish, DNS republish, Velocity re-register
Level 2: agent service restart, workload restart (enable after proven stable)
Level 3: restore backup, rebuild container — never automatic, admin only

Circuit breakers required from day one

Max 3 repair attempts per server per hour
Max 1 destructive candidate per server until admin review
Global repair concurrency limit
Per-host repair concurrency limit
No repair during active backup/restore/maintenance lock

Discord notifications

James is creating a dedicated ZLH Discord server. Wire Discord webhook notifications into the controller as:

src/services/discord.js

Channel structure:

#alerts-critical     -> repair failed 3x, data at risk — always ping
#alerts-repairs      -> controller acted automatically — no ping
#alerts-billing      -> payment events, webhook failures — ping
#alerts-provisioning -> server create/fail — ping on failure
#system-health       -> daily digest, backup status, queue depth — no ping
#audit-log           -> full timestamped controller action history — no ping

Notification rules:

Level 1 repair succeeded     -> #alerts-repairs, no ping
Level 2 repair succeeded     -> #alerts-repairs, ping if outside business hours
Repair failed 3x             -> #alerts-critical, always ping
Provisioning job failed      -> #alerts-provisioning, always ping
Billing webhook missed       -> #alerts-billing, always ping
Backup missed 2x consecutive -> #alerts-critical, always ping
Daily health digest          -> #system-health, no ping

Implementation: Discord webhook per channel, HTTP POST only. No bot required at this stage. One function per alert level in the notification service.

Launch expectation

Controller should be conservative. Do not enable destructive actions automatically. The first version should observe, repair non-destructive drift, and escalate when safe automatic repair is not possible.

## Launch architecture requirement Build a conservative ZLH controller/reconciler before launch. This is required because James works full time and operates ZLH solo with no secondary operator. Routine failures must be detected and repaired automatically where safe. ## Purpose The controller is not an observability dashboard. It is the platform control loop: ```text What should be true? What is actually true? Is the drift safe to fix automatically? If yes, enqueue repair. If no, escalate. ``` ## Build order ```text 1. Provisioning worker + operation leases 2. Minimal controller (observe + Level 1 repairs) 3. Billing webhook live 4. Expand controller to workload-level repairs ``` ## Minimal launch scope Run every 60 seconds: ```text - Clear stale operation locks where heartbeat expired and no active BullMQ job exists - Detect agent offline where LXC is running but Agent is unreachable - Detect missing Velocity registration and re-register automatically - Detect missing DNS record and republish automatically - Detect stuck provisioning jobs (enqueued > 15 min with no progress), fail cleanly, release VMID - If billing suspended and server is running, stop workload - If backup missed, notify only; do not auto-repair ``` ## Singleton requirement Controller must run as a singleton with a Redis-backed distributed lock from day one. Reason: if two controller instances run at the same time, ZLH risks split-brain repair decisions. Recommended placement: ```text zpack-api repo: src/controllers/reconciler.js systemd: zlh-controller.service ``` ## Queue structure ```text provisioning queue -> create servers operation queue -> start/stop/restart/backup/restore repair queue -> fix drift reconcile loop -> scheduled 60s health check ``` ## Repair policy levels Only Level 0 and Level 1 should be enabled at launch. ```text Level 0: observe and log only Level 1: stale lock clear, edge republish, DNS republish, Velocity re-register Level 2: agent service restart, workload restart (enable after proven stable) Level 3: restore backup, rebuild container — never automatic, admin only ``` ## Circuit breakers required from day one ```text Max 3 repair attempts per server per hour Max 1 destructive candidate per server until admin review Global repair concurrency limit Per-host repair concurrency limit No repair during active backup/restore/maintenance lock ``` ## Discord notifications James is creating a dedicated ZLH Discord server. Wire Discord webhook notifications into the controller as: ```text src/services/discord.js ``` Channel structure: ```text #alerts-critical -> repair failed 3x, data at risk — always ping #alerts-repairs -> controller acted automatically — no ping #alerts-billing -> payment events, webhook failures — ping #alerts-provisioning -> server create/fail — ping on failure #system-health -> daily digest, backup status, queue depth — no ping #audit-log -> full timestamped controller action history — no ping ``` Notification rules: ```text Level 1 repair succeeded -> #alerts-repairs, no ping Level 2 repair succeeded -> #alerts-repairs, ping if outside business hours Repair failed 3x -> #alerts-critical, always ping Provisioning job failed -> #alerts-provisioning, always ping Billing webhook missed -> #alerts-billing, always ping Backup missed 2x consecutive -> #alerts-critical, always ping Daily health digest -> #system-health, no ping ``` Implementation: Discord webhook per channel, HTTP POST only. No bot required at this stage. One function per alert level in the notification service. ## Launch expectation Controller should be conservative. Do not enable destructive actions automatically. The first version should observe, repair non-destructive drift, and escalate when safe automatic repair is not possible.

jester commented

2026-05-03 14:25:35 +00:00

Implementation update — controller/reconciler foundation

Controller/reconciler foundation and Discord notification service have been implemented in zpack-api.

Created / changed

src/services/discord.js
src/locks/controllerLock.js
src/controllers/repairPolicy.js
src/queues/repair.js
src/controllers/reconciler.js
prisma/schema.prisma
prisma/migrations/20260503130000_add_repair_events
.env.example
package.json

Added scripts

npm run controller
npm run worker:repair

Added persistence

Added RepairEvent model and migration.

Discord notification service

Webhook-only Discord notifications were added. Notification failures are fail-soft. Secret-looking fields are redacted.

Discord env vars expected only in .env or secret manager:

DISCORD_ALERTS_CRITICAL_WEBHOOK
DISCORD_ALERTS_REPAIRS_WEBHOOK
DISCORD_ALERTS_BILLING_WEBHOOK
DISCORD_ALERTS_PROVISIONING_WEBHOOK
DISCORD_SYSTEM_HEALTH_WEBHOOK
DISCORD_AUDIT_LOG_WEBHOOK

Webhook URLs should not be committed. If a real webhook was pasted into chat or any non-private surface, rotate it.

Singleton controller lock

controllerLock.js implements Redis singleton locking with:

SET NX PX
token-checked renew
token-checked release
lock key: zlh:controller:lock

Controller behavior

reconciler.js runs as a separate process, not inside the HTTP API. It supports:

CONTROLLER_DRY_RUN=true
CONTROLLER_INTERVAL_MS
CONTROLLER_MAX_REPAIRS_PER_LOOP

Current observations:

- Expired active ProvisioningOperation rows
- Active ContainerInstance rows
- Safe Proxmox status reads, with failures contained
- Agent /status if server IP exists, with failures contained
- Persisted payload edge state for game server edge/DNS/Velocity hints

Enabled repairs

clear_stale_operation_lock
  marks expired queued/running provisioning operations as stale

edge_republish
  calls existing edge publish helper

velocity_reregister
  calls existing Velocity registration helper

dns_republish
  delegates to full edge republish because DNS target selection is coupled there

Stubbed / disabled

Level 2 actions such as agent/workload restart are defined but not enabled
Level 3 restore/rebuild remains never automatic
Unsupported or unsafe repair actions return skipped results, not fake success

Validation

npm run prisma:generate: passed
npx prisma validate: passed
npx prisma migrate deploy: passed
node --check: passed for all new files
npm run worker:repair smoke: passed
CONTROLLER_DRY_RUN=true npm run controller smoke: passed
fake expired operation dry-run recommended clear_stale_operation_lock and did not enqueue
missing Discord env vars warned and did not crash
real Discord webhook URL was not written to repo files

Current status

Controller/reconciler foundation is implemented and dry-run validated. Next step is to review the implementation and then decide whether to run the controller in dry-run under systemd before enabling non-dry-run Level 1 repairs.

## Implementation update — controller/reconciler foundation Controller/reconciler foundation and Discord notification service have been implemented in `zpack-api`. ### Created / changed ```text src/services/discord.js src/locks/controllerLock.js src/controllers/repairPolicy.js src/queues/repair.js src/controllers/reconciler.js prisma/schema.prisma prisma/migrations/20260503130000_add_repair_events .env.example package.json ``` ### Added scripts ```text npm run controller npm run worker:repair ``` ### Added persistence Added `RepairEvent` model and migration. ### Discord notification service Webhook-only Discord notifications were added. Notification failures are fail-soft. Secret-looking fields are redacted. Discord env vars expected only in `.env` or secret manager: ```text DISCORD_ALERTS_CRITICAL_WEBHOOK DISCORD_ALERTS_REPAIRS_WEBHOOK DISCORD_ALERTS_BILLING_WEBHOOK DISCORD_ALERTS_PROVISIONING_WEBHOOK DISCORD_SYSTEM_HEALTH_WEBHOOK DISCORD_AUDIT_LOG_WEBHOOK ``` Webhook URLs should not be committed. If a real webhook was pasted into chat or any non-private surface, rotate it. ### Singleton controller lock `controllerLock.js` implements Redis singleton locking with: ```text SET NX PX token-checked renew token-checked release lock key: zlh:controller:lock ``` ### Controller behavior `reconciler.js` runs as a separate process, not inside the HTTP API. It supports: ```text CONTROLLER_DRY_RUN=true CONTROLLER_INTERVAL_MS CONTROLLER_MAX_REPAIRS_PER_LOOP ``` Current observations: ```text - Expired active ProvisioningOperation rows - Active ContainerInstance rows - Safe Proxmox status reads, with failures contained - Agent /status if server IP exists, with failures contained - Persisted payload edge state for game server edge/DNS/Velocity hints ``` ### Enabled repairs ```text clear_stale_operation_lock marks expired queued/running provisioning operations as stale edge_republish calls existing edge publish helper velocity_reregister calls existing Velocity registration helper dns_republish delegates to full edge republish because DNS target selection is coupled there ``` ### Stubbed / disabled ```text Level 2 actions such as agent/workload restart are defined but not enabled Level 3 restore/rebuild remains never automatic Unsupported or unsafe repair actions return skipped results, not fake success ``` ### Validation ```text npm run prisma:generate: passed npx prisma validate: passed npx prisma migrate deploy: passed node --check: passed for all new files npm run worker:repair smoke: passed CONTROLLER_DRY_RUN=true npm run controller smoke: passed fake expired operation dry-run recommended clear_stale_operation_lock and did not enqueue missing Discord env vars warned and did not crash real Discord webhook URL was not written to repo files ``` ### Current status Controller/reconciler foundation is implemented and dry-run validated. Next step is to review the implementation and then decide whether to run the controller in dry-run under systemd before enabling non-dry-run Level 1 repairs.

jester commented

2026-05-03 15:26:07 +00:00

Validation update — stale operation repair test

A manual stale operation repair test was run against the repair worker.

Repair job

repairId: manual-clear-stale-test-1777821822206
jobId: manual-clear-stale-test-1777821822206
action: clear_stale_operation_lock
operationId: 53de788a-7cfb-4299-a590-2b4bde3a2b99
dryRun: false

Result

Repair worker completed the job successfully:

ok: true
action: clear_stale_operation_lock
skipped: false
previousStatus: queued
status: stale
error: null

DB verification confirmed the operation was marked stale:

{
  "id": "53de788a-7cfb-4299-a590-2b4bde3a2b99",
  "status": "stale",
  "phase": "queued",
  "error": {
    "message": "Operation lease expired and was marked stale by repair worker",
    "source": "repair_worker"
  }
}

Validation status

clear_stale_operation_lock Level 1 repair behavior passed. The repair worker successfully converted an expired queued provisioning operation into a durable stale state with structured error metadata.

## Validation update — stale operation repair test A manual stale operation repair test was run against the repair worker. ### Repair job ```text repairId: manual-clear-stale-test-1777821822206 jobId: manual-clear-stale-test-1777821822206 action: clear_stale_operation_lock operationId: 53de788a-7cfb-4299-a590-2b4bde3a2b99 dryRun: false ``` ### Result Repair worker completed the job successfully: ```text ok: true action: clear_stale_operation_lock skipped: false previousStatus: queued status: stale error: null ``` DB verification confirmed the operation was marked stale: ```json { "id": "53de788a-7cfb-4299-a590-2b4bde3a2b99", "status": "stale", "phase": "queued", "error": { "message": "Operation lease expired and was marked stale by repair worker", "source": "repair_worker" } } ``` ### Validation status `clear_stale_operation_lock` Level 1 repair behavior passed. The repair worker successfully converted an expired queued provisioning operation into a durable `stale` state with structured error metadata.

jester commented

2026-05-03 15:33:45 +00:00

Follow-up gap — controller does not yet detect live Cloudflare DNS drift

Current controller/repair split is clarified:

Repair worker:
  edge_republish / dns_republish actions exist and can call the existing edge publisher.

Controller policy:
  only recommends DNS/edge repair if DB/payload observation says edge/DNS is missing.

Current gap:
  controller does not yet query Cloudflare live DNS state and compare expected A/SRV records to actual records.

Impact:

If a Cloudflare A or SRV record is manually deleted, the controller probably will not notice automatically yet, because it is mostly observing persisted edge/payload state rather than live Cloudflare state.

Manual repair test is still possible by enqueueing edge_republish or dns_republish against a safe test game server VMID.

Example manual test shape:

node -e 'import("./src/queues/repair.js").then(async ({ enqueueRepairJob }) => {
  const job = await enqueueRepairJob({
    repairId: "manual-edge-republish-" + Date.now(),
    serverId: 5000,
    vmid: 5000,
    action: "edge_republish",
    reason: "manual Cloudflare DNS repair test",
    level: 1,
    source: "manual",
    dryRun: false
  });
  console.log({ jobId: job.id });
  process.exit(0);
})'

Replace 5000 with the real disposable test game server VMID.

Next code step for automatic repair:

Add live Cloudflare DNS observation to the controller.
Compare expected records for each game server:
  A: <hostname>.zerolaghub.quest
  SRV: _minecraft._tcp.<hostname>.zerolaghub.quest when applicable
against actual Cloudflare records.

If missing/mismatched:
  policy recommends dns_republish or edge_republish.

This should remain Level 1 repair behavior and only run for non-suspended, non-deleting, non-maintenance-locked game servers.

## Follow-up gap — controller does not yet detect live Cloudflare DNS drift Current controller/repair split is clarified: ```text Repair worker: edge_republish / dns_republish actions exist and can call the existing edge publisher. Controller policy: only recommends DNS/edge repair if DB/payload observation says edge/DNS is missing. Current gap: controller does not yet query Cloudflare live DNS state and compare expected A/SRV records to actual records. ``` Impact: If a Cloudflare A or SRV record is manually deleted, the controller probably will not notice automatically yet, because it is mostly observing persisted edge/payload state rather than live Cloudflare state. Manual repair test is still possible by enqueueing `edge_republish` or `dns_republish` against a safe test game server VMID. Example manual test shape: ```bash node -e 'import("./src/queues/repair.js").then(async ({ enqueueRepairJob }) => { const job = await enqueueRepairJob({ repairId: "manual-edge-republish-" + Date.now(), serverId: 5000, vmid: 5000, action: "edge_republish", reason: "manual Cloudflare DNS repair test", level: 1, source: "manual", dryRun: false }); console.log({ jobId: job.id }); process.exit(0); })' ``` Replace `5000` with the real disposable test game server VMID. Next code step for automatic repair: ```text Add live Cloudflare DNS observation to the controller. Compare expected records for each game server: A: <hostname>.zerolaghub.quest SRV: _minecraft._tcp.<hostname>.zerolaghub.quest when applicable against actual Cloudflare records. If missing/mismatched: policy recommends dns_republish or edge_republish. ``` This should remain Level 1 repair behavior and only run for non-suspended, non-deleting, non-maintenance-locked game servers.

jester commented

2026-05-03 15:51:28 +00:00

Implementation update — live edge drift detection path

Live edge drift detection and repair integration have been implemented in zpack-api, but this controller item should not be marked fully complete yet because the real Cloudflare delete / dry-run / non-dry-run disposable-server validation has not been performed.

Files changed

src/services/edgeObserver.js
src/services/cloudflareClient.js
src/services/technitiumClient.js
src/services/velocityClient.js
src/controllers/repairPolicy.js
src/controllers/reconciler.js
src/queues/repair.js

New behavior

observeEdgeState(instance) was added and the controller now observes live edge state for game servers.

External state checked:

Cloudflare A exists for <hostname>.zerolaghub.quest
Cloudflare SRV exists for _minecraft._tcp.<hostname>.zerolaghub.quest
Technitium A exists
Technitium SRV exists and matches expected SRV port/target
Velocity backend exists and matches expected address/port when exposed by status API

Policy / repair behavior

Confirmed Cloudflare / Technitium / Velocity drift now recommends enabled Level 1 repair:

edge_republish

Existing safe actions retained:

clear_stale_operation_lock
dns_republish
velocity_reregister

Level 2 remains disabled.

Repair worker behavior

edge_republish now uses the existing full edge publish path, then post-checks live edge state so it does not fake success.

Validation completed

npm run prisma:generate: passed
npx prisma validate: passed
node --check: passed for all requested touched JS files
npm run worker:repair: started successfully and shut down cleanly under timeout
policy smoke check: missing cloudflare_a recommends enabled Level 1 edge_republish

Controller smoke note:

CONTROLLER_DRY_RUN=true npm run controller did not run a loop because the controller lock was already held.
Observed: controller_lock_lost { reason: 'lock_not_acquired' }

Limitations / TODOs

No real Cloudflare drift test has been completed yet.
Controller smoke needs the active controller lock released/expired or existing service stopped temporarily.
Velocity lookup depends on the status endpoint exposing registered backends.
Confirm whether expected Velocity backend port should use allocated/payload port or always container-internal 25565. Current observer follows the prompt and uses first payload/allocated port.

Required next validation

On a disposable game server:

1. Confirm server healthy.
2. Delete one Cloudflare A record or SRV record.
3. Run controller in dry-run and confirm it detects missing cloudflare_a/cloudflare_srv and recommends edge_republish.
4. Run non-dry-run Level 1 repair and confirm edge_republish recreates the record.
5. Confirm RepairEvent completed and server remains connectable.

Do not close controller issue until at least one real Cloudflare drift test passes.

## Implementation update — live edge drift detection path Live edge drift detection and repair integration have been implemented in `zpack-api`, but this controller item should not be marked fully complete yet because the real Cloudflare delete / dry-run / non-dry-run disposable-server validation has not been performed. ### Files changed ```text src/services/edgeObserver.js src/services/cloudflareClient.js src/services/technitiumClient.js src/services/velocityClient.js src/controllers/repairPolicy.js src/controllers/reconciler.js src/queues/repair.js ``` ### New behavior `observeEdgeState(instance)` was added and the controller now observes live edge state for game servers. External state checked: ```text Cloudflare A exists for <hostname>.zerolaghub.quest Cloudflare SRV exists for _minecraft._tcp.<hostname>.zerolaghub.quest Technitium A exists Technitium SRV exists and matches expected SRV port/target Velocity backend exists and matches expected address/port when exposed by status API ``` ### Policy / repair behavior Confirmed Cloudflare / Technitium / Velocity drift now recommends enabled Level 1 repair: ```text edge_republish ``` Existing safe actions retained: ```text clear_stale_operation_lock dns_republish velocity_reregister ``` Level 2 remains disabled. ### Repair worker behavior `edge_republish` now uses the existing full edge publish path, then post-checks live edge state so it does not fake success. ### Validation completed ```text npm run prisma:generate: passed npx prisma validate: passed node --check: passed for all requested touched JS files npm run worker:repair: started successfully and shut down cleanly under timeout policy smoke check: missing cloudflare_a recommends enabled Level 1 edge_republish ``` Controller smoke note: ```text CONTROLLER_DRY_RUN=true npm run controller did not run a loop because the controller lock was already held. Observed: controller_lock_lost { reason: 'lock_not_acquired' } ``` ### Limitations / TODOs ```text No real Cloudflare drift test has been completed yet. Controller smoke needs the active controller lock released/expired or existing service stopped temporarily. Velocity lookup depends on the status endpoint exposing registered backends. Confirm whether expected Velocity backend port should use allocated/payload port or always container-internal 25565. Current observer follows the prompt and uses first payload/allocated port. ``` ### Required next validation On a disposable game server: ```text 1. Confirm server healthy. 2. Delete one Cloudflare A record or SRV record. 3. Run controller in dry-run and confirm it detects missing cloudflare_a/cloudflare_srv and recommends edge_republish. 4. Run non-dry-run Level 1 repair and confirm edge_republish recreates the record. 5. Confirm RepairEvent completed and server remains connectable. ``` Do not close controller issue until at least one real Cloudflare drift test passes.

jester commented

2026-05-03 16:01:50 +00:00

Validation update — live Cloudflare SRV drift repair passed

A real live edge drift test was performed on a disposable game server by deleting the Cloudflare SRV record.

Server under test

vmid: 5210
hostname: mc-neoforge-5210
fqdn: mc-neoforge-5210.zerolaghub.quest
game: minecraft

Result

The controller/repair path detected the drift and edge_republish restored the edge state through the existing edge publish path.

Repair job completed successfully:

repairId: server-5210-edge_republish-1777824012291
action: edge_republish
ok: true
published: true
fqdn: mc-neoforge-5210.zerolaghub.quest
internalEdgeIp: 10.70.0.10
publicEdgeIp: 66.163.115.115
externalPort: 25565
skipped: false
error: null

Edge publisher completed and Velocity remained healthy/idempotent:

[edgePublisher] ✓ Velocity registered mc-neoforge-5210 → 10.200.0.83:25565 (Already registered mc-neoforge-5210) verified=true attempts=1
[edgePublisher] COMPLETE vmid=5210, fqdn=mc-neoforge-5210.zerolaghub.quest, game=minecraft

Validation status

Live Cloudflare SRV drift repair: passed.

This confirms the intended Level 1 repair loop:

Cloudflare SRV record deleted
→ controller detects live edge drift
→ repair worker runs edge_republish
→ existing edge publish path recreates/validates DNS and Velocity
→ repair job completes successfully

Remaining optional validation:

- Repeat with Cloudflare A record deletion if desired
- Confirm RepairEvent row shows completed result
- Confirm Discord repair notification landed in the expected channel

## Validation update — live Cloudflare SRV drift repair passed A real live edge drift test was performed on a disposable game server by deleting the Cloudflare SRV record. ### Server under test ```text vmid: 5210 hostname: mc-neoforge-5210 fqdn: mc-neoforge-5210.zerolaghub.quest game: minecraft ``` ### Result The controller/repair path detected the drift and `edge_republish` restored the edge state through the existing edge publish path. Repair job completed successfully: ```text repairId: server-5210-edge_republish-1777824012291 action: edge_republish ok: true published: true fqdn: mc-neoforge-5210.zerolaghub.quest internalEdgeIp: 10.70.0.10 publicEdgeIp: 66.163.115.115 externalPort: 25565 skipped: false error: null ``` Edge publisher completed and Velocity remained healthy/idempotent: ```text [edgePublisher] ✓ Velocity registered mc-neoforge-5210 → 10.200.0.83:25565 (Already registered mc-neoforge-5210) verified=true attempts=1 [edgePublisher] COMPLETE vmid=5210, fqdn=mc-neoforge-5210.zerolaghub.quest, game=minecraft ``` ### Validation status Live Cloudflare SRV drift repair: passed. This confirms the intended Level 1 repair loop: ```text Cloudflare SRV record deleted → controller detects live edge drift → repair worker runs edge_republish → existing edge publish path recreates/validates DNS and Velocity → repair job completes successfully ``` Remaining optional validation: ```text - Repeat with Cloudflare A record deletion if desired - Confirm RepairEvent row shows completed result - Confirm Discord repair notification landed in the expected channel ```

Sign in to join this conversation.

No Label

No Milestone

No project

No Assignees

1 Participants

Notifications

Due Date

The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: jester/zlh-grind#13