Update provisioning bug with confirmed root cause and add Redis/DB/worker cleanup to pre-launch checklist
This commit is contained in:
parent
ff06e6211d
commit
6d16e3e00e
@ -147,18 +147,26 @@ Outstanding:
|
|||||||
- Headscale auth key generation
|
- Headscale auth key generation
|
||||||
- **Velocity resync endpoint** — `POST /api/velocity/resync` to re-register all running containers after Velocity restart
|
- **Velocity resync endpoint** — `POST /api/velocity/resync` to re-register all running containers after Velocity restart
|
||||||
|
|
||||||
**Provisioning + DB Consistency (Active Bug — needs audit before launch):**
|
**Provisioning + DB Consistency (Active Bug — root cause identified, needs cleanup):**
|
||||||
- Duplicate server creation occurring during single requests
|
|
||||||
- Inconsistent / failed server termination
|
Root cause confirmed: state inconsistency from Denver → Detroit migration, not a code bug.
|
||||||
- DB state drifting from actual Proxmox + agent state
|
- Redis likely has stale/replayed jobs from migration overlap period
|
||||||
- BullMQ jobs may be executing more than once — missing deduplication / idempotency keys
|
- Denver worker may have briefly run alongside Detroit
|
||||||
- VMID and port allocation not confirmed to be transactionally safe
|
- DB/Redis migration may have been partial or unsynchronized
|
||||||
- Create endpoint not confirmed idempotent or guarded against double execution
|
- internal.zlh DNS adding latency/timing risk in provisioning hot path
|
||||||
- Delete flow not confirmed to handle partial failures (container already gone)
|
|
||||||
- DNS-based routing must not be used as source of truth — DB is the source of truth
|
Immediate remediation steps (next session):
|
||||||
- **internal.zlh DNS may be introducing timing issues in provisioning hot path** — replace with env-based IP config for all service-to-service calls
|
1. `FLUSHALL` Redis — do NOT restore from backup, start clean
|
||||||
|
2. Validate DB — check for duplicate VMIDs, stuck "creating" states
|
||||||
|
3. Confirm only one worker running (Detroit only)
|
||||||
|
4. If DB inconsistent → restore MariaDB from Denver PBS snapshot
|
||||||
|
5. Run single controlled provisioning test — verify 1 record, 1 job, 1 agent execution
|
||||||
|
|
||||||
|
Architectural fix:
|
||||||
|
- Replace internal.zlh FQDNs with env-based IPs for all service-to-service calls
|
||||||
|
- See `SCRATCH/service-discovery.md` for spec
|
||||||
- See `SCRATCH/debug-provisioning-consistency.md` for full audit plan
|
- See `SCRATCH/debug-provisioning-consistency.md` for full audit plan
|
||||||
- See `SCRATCH/service-discovery.md` for env-based service discovery approach
|
- See `SCRATCH/session-update-migration-stability.md` for session findings
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
@ -246,7 +254,10 @@ Outstanding before launch:
|
|||||||
- **Billing endpoints** — add back to API
|
- **Billing endpoints** — add back to API
|
||||||
- **Stress testing** — k6 IDE session load test + Minecraft bot test + **code-server memory baseline**
|
- **Stress testing** — k6 IDE session load test + Minecraft bot test + **code-server memory baseline**
|
||||||
- **OPNsense audit** — both routers need systematic validation
|
- **OPNsense audit** — both routers need systematic validation
|
||||||
- **Provisioning + DB consistency audit** — duplicate creation, state drift, BullMQ idempotency
|
- **Redis FLUSHALL** — clear stale migration jobs before launch
|
||||||
|
- **DB consistency check** — verify no duplicate VMIDs or stuck states from migration
|
||||||
|
- **Single worker verification** — confirm only Detroit API/worker running
|
||||||
|
- **Provisioning validation** — single controlled creation, confirm 1 record / 1 job / 1 execution
|
||||||
- **Service discovery migration** — replace internal.zlh with env-based IPs in API and agent hot paths
|
- **Service discovery migration** — replace internal.zlh with env-based IPs in API and agent hot paths
|
||||||
- Remove `testdaemon` binary from zpac-portal repo root
|
- Remove `testdaemon` binary from zpac-portal repo root
|
||||||
|
|
||||||
|
|||||||
Loading…
Reference in New Issue
Block a user