134 lines
6.2 KiB
Markdown
134 lines
6.2 KiB
Markdown
# Open Threads — zlh-grind
|
|
|
|
This file tracks **active cross-repo and platform-level work only**.
|
|
|
|
Repo-specific work belongs in:
|
|
- `Codex/API/OPEN_ITEMS.md`
|
|
- `Codex/Portal/OPEN_ITEMS.md`
|
|
- `Codex/Agent/OPEN_ITEMS.md`
|
|
- `Codex/Monitoring/OPEN_ITEMS.md`
|
|
|
|
Keep this file short.
|
|
|
|
---
|
|
|
|
## Launch Active
|
|
|
|
### 1. Portal terminal reliability
|
|
- Fix/harden console connection state so the UI cannot remain stuck at `Connecting...`.
|
|
- Required behavior:
|
|
- WebSocket connect timeout in `TerminalView`.
|
|
- Error path closes/clears current socket refs and sender callback.
|
|
- `ServerConsole` resets streaming state on `closed`, `error`, and `idle`.
|
|
- Button recovers to `Open Console` / `Reconnect`.
|
|
- Do not rewrite Agent or switch terminal stack; xterm + API WebSocket bridge remains the model.
|
|
|
|
### 2. Monitoring / observability readiness
|
|
- Finish or explicitly accept launch risk for the remaining monitoring items in `Codex/Monitoring/OPEN_ITEMS.md`.
|
|
- Current launch focus:
|
|
- restrict public exposure for Prometheus/Grafana/node_exporter
|
|
- validate game/dev discovery sync and stale target removal
|
|
- ensure Grafana dashboards are installed and useful
|
|
- expose API health/app-level state
|
|
- decide centralized logs/Loki timing or accept as launch risk
|
|
- add/verify queue staleness visibility for `provisioning`, `repair`, and `billing_enforcement`
|
|
|
|
### 3. Patch management / maintenance window policy
|
|
- Define normal maintenance window cadence and timezone.
|
|
- Define emergency maintenance behavior and customer notification expectations.
|
|
- Document patch order and rollback expectations for Proxmox, PBS, OPNsense, API, Portal, templates, Agent, Velocity, and monitoring.
|
|
|
|
### 4. Notepad / announcements / messaging retest
|
|
- Billing announcements and support ticket flow were validated live.
|
|
- Still retest notepad/notes load/save, persistence after reload/login, permissions, empty states, and error states.
|
|
|
|
### 5. Final integrated smoke test
|
|
Run this after the launch blockers above are clean:
|
|
- Game lifecycle: create → ready/connectable → console → files → backup → restore → delete → DNS/Velocity/Cloudflare cleanup verified.
|
|
- Dev lifecycle: create → hosted IDE → stop/restart/delete → cleanup verified.
|
|
- Security: direct Agent auth fails closed, non-owner access blocked, browser does not expose internal tokens/secrets.
|
|
|
|
---
|
|
|
|
## Current Architecture Validated / No Longer Blocking
|
|
|
|
### Provisioning worker / async create
|
|
- `POST /api/instances` now creates/reuses a durable provisioning operation, enqueues BullMQ provisioning work, and returns `202 Accepted`.
|
|
- `zpack-provision-worker.service` is installed/running under systemd.
|
|
- Game and dev provisioning through the worker were live-validated.
|
|
- Portal async pending cards were visually validated and replace themselves with real server cards.
|
|
- Duplicate/idempotency guards and controlled failure behavior were validated.
|
|
- API teardown still works for worker-created servers.
|
|
|
|
### Controller / repair foundation
|
|
- `zlh-controller.service` exists as the singleton controller/reconciler with Redis lock.
|
|
- Controller is currently expected to run in dry-run unless deliberately enabling Level 1 repair.
|
|
- `zpack-repair-worker.service` exists for Level 1 infrastructure repairs.
|
|
- Validated repairs:
|
|
- stale provisioning operation → `stale`
|
|
- live Cloudflare SRV drift → `edge_republish` restored edge state
|
|
- Level 2/3 repairs remain disabled.
|
|
|
|
### Billing enforcement
|
|
- Durable billing enforcement state, Stripe event idempotency, API gates, billing worker, and controller billing guards are implemented.
|
|
- `zpack-billing-worker.service` is installed/running under systemd.
|
|
- Validated:
|
|
- payment failed / warning flow
|
|
- Stripe replay idempotency
|
|
- final warning / backup block state
|
|
- suspension / shutdown safety
|
|
- API gates while suspended
|
|
- controller does not repair suspended game servers
|
|
- payment restored flow
|
|
- destructive billing actions rejected/audited
|
|
- Remaining API-owned follow-ups are fixture validation only: game backup mutation route validation with an available backup fixture, and file read/list validation against a responsive Agent.
|
|
|
|
### Support ticket path
|
|
- `POST /api/support/create` exists and creates a `SupportTicket` DB row with human-readable ticket number.
|
|
- Portal form submit was validated.
|
|
- Customer acknowledgement email was received.
|
|
- Discord `#support` alert was received.
|
|
- Post-launch enhancements only: admin ticket list/view, support triage diagnostics, self-hosted helpdesk integration, inbound reply parsing, attachments.
|
|
|
|
---
|
|
|
|
## Platform / Infrastructure Active
|
|
|
|
- stress testing: k6 IDE load, Minecraft bot load, code-server memory baseline
|
|
- OPNsense / public exposure audit
|
|
|
|
### Backup boundary
|
|
- Agent-owned backups are local, app-aware rollback backups for Minecraft worlds/config.
|
|
- PBS / platform backup strategy is the durability / disaster-recovery layer.
|
|
- PBS/R2 offsite copy currently uses rclone against Cloudflare R2; PBS does not natively use an S3 datastore for this workflow.
|
|
- Do not track PBS/offsite durability work as agent implementation work unless that ownership changes.
|
|
|
|
---
|
|
|
|
## Platform Future / Phase 2
|
|
|
|
- Support admin ticket view, support triage diagnostics, and optional self-hosted helpdesk integration.
|
|
- SSH / CF tunnel power-user access.
|
|
- Portal SSH config snippets.
|
|
- true interactive shell confinement / workspace-boundary decision.
|
|
- dev-container backup ownership and user-facing restore contract.
|
|
- likely direction for dev backups: LXC snapshot-based backup/restore instead of agent-managed dev backups.
|
|
- artifact version promotion.
|
|
- runtime rollback support.
|
|
- Cloudflare R2 for large artifact/mod delivery.
|
|
- admin panel.
|
|
- referral / dev pipeline reward system.
|
|
- uptime history.
|
|
- revisit DDoS mitigation later if needed.
|
|
|
|
---
|
|
|
|
## Cleaning Rule
|
|
|
|
- Root keeps only cross-repo/platform work.
|
|
- Repo-specific items must be removed from root once they live only in one Codex tracker.
|
|
- Completed items should be removed from active sections, not left in place as historical clutter.
|
|
- Use `CURRENT_STATE.md` for durable implemented behavior.
|
|
- Use `DECISIONS.md` for settled choices.
|
|
- Re-open old items only when there is current evidence they are still unfinished or have regressed.
|