zlh-grind/OPEN_THREADS.md

# Open Threads — zlh-grind

This file tracks **active cross-repo and platform-level work only**.

Repo-specific work belongs in:
- `Codex/API/OPEN_ITEMS.md`
- `Codex/Portal/OPEN_ITEMS.md`
- `Codex/Agent/OPEN_ITEMS.md`
- `Codex/Monitoring/OPEN_ITEMS.md`

Keep this file short.

---

## Launch Active

### 1. Portal terminal reliability
- Fix/harden console connection state so the UI cannot remain stuck at `Connecting...`.
- Required behavior:
  - WebSocket connect timeout in `TerminalView`.
  - Error path closes/clears current socket refs and sender callback.
  - `ServerConsole` resets streaming state on `closed`, `error`, and `idle`.
  - Button recovers to `Open Console` / `Reconnect`.
- Do not rewrite Agent or switch terminal stack; xterm + API WebSocket bridge remains the model.

### 2. Monitoring / observability readiness
- Finish or explicitly accept launch risk for the remaining monitoring items in `Codex/Monitoring/OPEN_ITEMS.md`.
- Current launch focus:
  - restrict public exposure for Prometheus/Grafana/node_exporter
  - validate game/dev discovery sync and stale target removal
  - ensure Grafana dashboards are installed and useful
  - expose API health/app-level state
  - decide centralized logs/Loki timing or accept as launch risk
  - add/verify queue staleness visibility for `provisioning`, `repair`, and `billing_enforcement`

### 3. Patch management / maintenance window policy
- Define normal maintenance window cadence and timezone.
- Define emergency maintenance behavior and customer notification expectations.
- Document patch order and rollback expectations for Proxmox, PBS, OPNsense, API, Portal, templates, Agent, Velocity, and monitoring.

### 4. Notepad / announcements / messaging retest
- Billing announcements and support ticket flow were validated live.
- Still retest notepad/notes load/save, persistence after reload/login, permissions, empty states, and error states.

### 5. Final integrated smoke test
Run this after the launch blockers above are clean:
- Game lifecycle: create → ready/connectable → console → files → backup → restore → delete → DNS/Velocity/Cloudflare cleanup verified.
- Dev lifecycle: create → hosted IDE → stop/restart/delete → cleanup verified.
- Security: direct Agent auth fails closed, non-owner access blocked, browser does not expose internal tokens/secrets.

---

## Current Architecture Validated / No Longer Blocking

### Provisioning worker / async create
- `POST /api/instances` now creates/reuses a durable provisioning operation, enqueues BullMQ provisioning work, and returns `202 Accepted`.
- `zpack-provision-worker.service` is installed/running under systemd.
- Game and dev provisioning through the worker were live-validated.
- Portal async pending cards were visually validated and replace themselves with real server cards.
- Duplicate/idempotency guards and controlled failure behavior were validated.
- API teardown still works for worker-created servers.

### Controller / repair foundation
- `zlh-controller.service` exists as the singleton controller/reconciler with Redis lock.
- Controller is currently expected to run in dry-run unless deliberately enabling Level 1 repair.
- `zpack-repair-worker.service` exists for Level 1 infrastructure repairs.
- Validated repairs:
  - stale provisioning operation → `stale`
  - live Cloudflare SRV drift → `edge_republish` restored edge state
- Level 2/3 repairs remain disabled.

### Billing enforcement
- Durable billing enforcement state, Stripe event idempotency, API gates, billing worker, and controller billing guards are implemented.
- `zpack-billing-worker.service` is installed/running under systemd.
- Validated:
  - payment failed / warning flow
  - Stripe replay idempotency
  - final warning / backup block state
  - suspension / shutdown safety
  - API gates while suspended
  - controller does not repair suspended game servers
  - payment restored flow
  - destructive billing actions rejected/audited
- Remaining API-owned follow-ups are fixture validation only: game backup mutation route validation with an available backup fixture, and file read/list validation against a responsive Agent.

### Support ticket path
- `POST /api/support/create` exists and creates a `SupportTicket` DB row with human-readable ticket number.
- Portal form submit was validated.
- Customer acknowledgement email was received.
- Discord `#support` alert was received.
- Post-launch enhancements only: admin ticket list/view, support triage diagnostics, self-hosted helpdesk integration, inbound reply parsing, attachments.

---

## Platform / Infrastructure Active

- stress testing: k6 IDE load, Minecraft bot load, code-server memory baseline
- OPNsense / public exposure audit

### Backup boundary
- Agent-owned backups are local, app-aware rollback backups for Minecraft worlds/config.
- PBS / platform backup strategy is the durability / disaster-recovery layer.
- PBS/R2 offsite copy currently uses rclone against Cloudflare R2; PBS does not natively use an S3 datastore for this workflow.
- Do not track PBS/offsite durability work as agent implementation work unless that ownership changes.

---

## Platform Future / Phase 2

- Support admin ticket view, support triage diagnostics, and optional self-hosted helpdesk integration.
- SSH / CF tunnel power-user access.
- Portal SSH config snippets.
- true interactive shell confinement / workspace-boundary decision.
- dev-container backup ownership and user-facing restore contract.
- likely direction for dev backups: LXC snapshot-based backup/restore instead of agent-managed dev backups.
- artifact version promotion.
- runtime rollback support.
- Cloudflare R2 for large artifact/mod delivery.
- admin panel.
- referral / dev pipeline reward system.
- uptime history.
- revisit DDoS mitigation later if needed.

---

## Cleaning Rule

- Root keeps only cross-repo/platform work.
- Repo-specific items must be removed from root once they live only in one Codex tracker.
- Completed items should be removed from active sections, not left in place as historical clutter.
- Use `CURRENT_STATE.md` for durable implemented behavior.
- Use `DECISIONS.md` for settled choices.
- Re-open old items only when there is current evidence they are still unfinished or have regressed.