# Open Threads — zlh-grind This file tracks **active cross-repo and platform-level work only**. Repo-specific work belongs in: - `Codex/API/OPEN_ITEMS.md` - `Codex/Portal/OPEN_ITEMS.md` - `Codex/Agent/OPEN_ITEMS.md` - `Codex/Monitoring/OPEN_ITEMS.md` Keep this file short. --- ## Launch Active ### 1. Portal terminal reliability - Fix/harden console connection state so the UI cannot remain stuck at `Connecting...`. - Required behavior: - WebSocket connect timeout in `TerminalView`. - Error path closes/clears current socket refs and sender callback. - `ServerConsole` resets streaming state on `closed`, `error`, and `idle`. - Button recovers to `Open Console` / `Reconnect`. - Do not rewrite Agent or switch terminal stack; xterm + API WebSocket bridge remains the model. ### 2. Monitoring / observability readiness - Finish or explicitly accept launch risk for the remaining monitoring items in `Codex/Monitoring/OPEN_ITEMS.md`. - Current launch focus: - restrict public exposure for Prometheus/Grafana/node_exporter - validate game/dev discovery sync and stale target removal - ensure Grafana dashboards are installed and useful - expose API health/app-level state - decide centralized logs/Loki timing or accept as launch risk - add/verify queue staleness visibility for `provisioning`, `repair`, and `billing_enforcement` ### 3. Patch management / maintenance window policy - Define normal maintenance window cadence and timezone. - Define emergency maintenance behavior and customer notification expectations. - Document patch order and rollback expectations for Proxmox, PBS, OPNsense, API, Portal, templates, Agent, Velocity, and monitoring. ### 4. Notepad / announcements / messaging retest - Billing announcements and support ticket flow were validated live. - Still retest notepad/notes load/save, persistence after reload/login, permissions, empty states, and error states. ### 5. Final integrated smoke test Run this after the launch blockers above are clean: - Game lifecycle: create → ready/connectable → console → files → backup → restore → delete → DNS/Velocity/Cloudflare cleanup verified. - Dev lifecycle: create → hosted IDE → stop/restart/delete → cleanup verified. - Security: direct Agent auth fails closed, non-owner access blocked, browser does not expose internal tokens/secrets. --- ## Current Architecture Validated / No Longer Blocking ### Provisioning worker / async create - `POST /api/instances` now creates/reuses a durable provisioning operation, enqueues BullMQ provisioning work, and returns `202 Accepted`. - `zpack-provision-worker.service` is installed/running under systemd. - Game and dev provisioning through the worker were live-validated. - Portal async pending cards were visually validated and replace themselves with real server cards. - Duplicate/idempotency guards and controlled failure behavior were validated. - API teardown still works for worker-created servers. ### Controller / repair foundation - `zlh-controller.service` exists as the singleton controller/reconciler with Redis lock. - Controller is currently expected to run in dry-run unless deliberately enabling Level 1 repair. - `zpack-repair-worker.service` exists for Level 1 infrastructure repairs. - Validated repairs: - stale provisioning operation → `stale` - live Cloudflare SRV drift → `edge_republish` restored edge state - Level 2/3 repairs remain disabled. ### Billing enforcement - Durable billing enforcement state, Stripe event idempotency, API gates, billing worker, and controller billing guards are implemented. - `zpack-billing-worker.service` is installed/running under systemd. - Validated: - payment failed / warning flow - Stripe replay idempotency - final warning / backup block state - suspension / shutdown safety - API gates while suspended - controller does not repair suspended game servers - payment restored flow - destructive billing actions rejected/audited - Remaining API-owned follow-ups are fixture validation only: game backup mutation route validation with an available backup fixture, and file read/list validation against a responsive Agent. ### Support ticket path - `POST /api/support/create` exists and creates a `SupportTicket` DB row with human-readable ticket number. - Portal form submit was validated. - Customer acknowledgement email was received. - Discord `#support` alert was received. - Post-launch enhancements only: admin ticket list/view, support triage diagnostics, self-hosted helpdesk integration, inbound reply parsing, attachments. --- ## Platform / Infrastructure Active - stress testing: k6 IDE load, Minecraft bot load, code-server memory baseline - OPNsense / public exposure audit ### Backup boundary - Agent-owned backups are local, app-aware rollback backups for Minecraft worlds/config. - PBS / platform backup strategy is the durability / disaster-recovery layer. - PBS/R2 offsite copy currently uses rclone against Cloudflare R2; PBS does not natively use an S3 datastore for this workflow. - Do not track PBS/offsite durability work as agent implementation work unless that ownership changes. --- ## Platform Future / Phase 2 - Support admin ticket view, support triage diagnostics, and optional self-hosted helpdesk integration. - SSH / CF tunnel power-user access. - Portal SSH config snippets. - true interactive shell confinement / workspace-boundary decision. - dev-container backup ownership and user-facing restore contract. - likely direction for dev backups: LXC snapshot-based backup/restore instead of agent-managed dev backups. - artifact version promotion. - runtime rollback support. - Cloudflare R2 for large artifact/mod delivery. - admin panel. - referral / dev pipeline reward system. - uptime history. - revisit DDoS mitigation later if needed. --- ## Cleaning Rule - Root keeps only cross-repo/platform work. - Repo-specific items must be removed from root once they live only in one Codex tracker. - Completed items should be removed from active sections, not left in place as historical clutter. - Use `CURRENT_STATE.md` for durable implemented behavior. - Use `DECISIONS.md` for settled choices. - Re-open old items only when there is current evidence they are still unfinished or have regressed.