zlh-grind/OPEN_THREADS.md

134 lines
6.2 KiB
Markdown

# Open Threads — zlh-grind
This file tracks **active cross-repo and platform-level work only**.
Repo-specific work belongs in:
- `Codex/API/OPEN_ITEMS.md`
- `Codex/Portal/OPEN_ITEMS.md`
- `Codex/Agent/OPEN_ITEMS.md`
- `Codex/Monitoring/OPEN_ITEMS.md`
Keep this file short.
---
## Launch Active
### 1. Portal terminal reliability
- Fix/harden console connection state so the UI cannot remain stuck at `Connecting...`.
- Required behavior:
- WebSocket connect timeout in `TerminalView`.
- Error path closes/clears current socket refs and sender callback.
- `ServerConsole` resets streaming state on `closed`, `error`, and `idle`.
- Button recovers to `Open Console` / `Reconnect`.
- Do not rewrite Agent or switch terminal stack; xterm + API WebSocket bridge remains the model.
### 2. Monitoring / observability readiness
- Finish or explicitly accept launch risk for the remaining monitoring items in `Codex/Monitoring/OPEN_ITEMS.md`.
- Current launch focus:
- restrict public exposure for Prometheus/Grafana/node_exporter
- validate game/dev discovery sync and stale target removal
- ensure Grafana dashboards are installed and useful
- expose API health/app-level state
- decide centralized logs/Loki timing or accept as launch risk
- add/verify queue staleness visibility for `provisioning`, `repair`, and `billing_enforcement`
### 3. Patch management / maintenance window policy
- Define normal maintenance window cadence and timezone.
- Define emergency maintenance behavior and customer notification expectations.
- Document patch order and rollback expectations for Proxmox, PBS, OPNsense, API, Portal, templates, Agent, Velocity, and monitoring.
### 4. Notepad / announcements / messaging retest
- Billing announcements and support ticket flow were validated live.
- Still retest notepad/notes load/save, persistence after reload/login, permissions, empty states, and error states.
### 5. Final integrated smoke test
Run this after the launch blockers above are clean:
- Game lifecycle: create → ready/connectable → console → files → backup → restore → delete → DNS/Velocity/Cloudflare cleanup verified.
- Dev lifecycle: create → hosted IDE → stop/restart/delete → cleanup verified.
- Security: direct Agent auth fails closed, non-owner access blocked, browser does not expose internal tokens/secrets.
---
## Current Architecture Validated / No Longer Blocking
### Provisioning worker / async create
- `POST /api/instances` now creates/reuses a durable provisioning operation, enqueues BullMQ provisioning work, and returns `202 Accepted`.
- `zpack-provision-worker.service` is installed/running under systemd.
- Game and dev provisioning through the worker were live-validated.
- Portal async pending cards were visually validated and replace themselves with real server cards.
- Duplicate/idempotency guards and controlled failure behavior were validated.
- API teardown still works for worker-created servers.
### Controller / repair foundation
- `zlh-controller.service` exists as the singleton controller/reconciler with Redis lock.
- Controller is currently expected to run in dry-run unless deliberately enabling Level 1 repair.
- `zpack-repair-worker.service` exists for Level 1 infrastructure repairs.
- Validated repairs:
- stale provisioning operation → `stale`
- live Cloudflare SRV drift → `edge_republish` restored edge state
- Level 2/3 repairs remain disabled.
### Billing enforcement
- Durable billing enforcement state, Stripe event idempotency, API gates, billing worker, and controller billing guards are implemented.
- `zpack-billing-worker.service` is installed/running under systemd.
- Validated:
- payment failed / warning flow
- Stripe replay idempotency
- final warning / backup block state
- suspension / shutdown safety
- API gates while suspended
- controller does not repair suspended game servers
- payment restored flow
- destructive billing actions rejected/audited
- Remaining API-owned follow-ups are fixture validation only: game backup mutation route validation with an available backup fixture, and file read/list validation against a responsive Agent.
### Support ticket path
- `POST /api/support/create` exists and creates a `SupportTicket` DB row with human-readable ticket number.
- Portal form submit was validated.
- Customer acknowledgement email was received.
- Discord `#support` alert was received.
- Post-launch enhancements only: admin ticket list/view, support triage diagnostics, self-hosted helpdesk integration, inbound reply parsing, attachments.
---
## Platform / Infrastructure Active
- stress testing: k6 IDE load, Minecraft bot load, code-server memory baseline
- OPNsense / public exposure audit
### Backup boundary
- Agent-owned backups are local, app-aware rollback backups for Minecraft worlds/config.
- PBS / platform backup strategy is the durability / disaster-recovery layer.
- PBS/R2 offsite copy currently uses rclone against Cloudflare R2; PBS does not natively use an S3 datastore for this workflow.
- Do not track PBS/offsite durability work as agent implementation work unless that ownership changes.
---
## Platform Future / Phase 2
- Support admin ticket view, support triage diagnostics, and optional self-hosted helpdesk integration.
- SSH / CF tunnel power-user access.
- Portal SSH config snippets.
- true interactive shell confinement / workspace-boundary decision.
- dev-container backup ownership and user-facing restore contract.
- likely direction for dev backups: LXC snapshot-based backup/restore instead of agent-managed dev backups.
- artifact version promotion.
- runtime rollback support.
- Cloudflare R2 for large artifact/mod delivery.
- admin panel.
- referral / dev pipeline reward system.
- uptime history.
- revisit DDoS mitigation later if needed.
---
## Cleaning Rule
- Root keeps only cross-repo/platform work.
- Repo-specific items must be removed from root once they live only in one Codex tracker.
- Completed items should be removed from active sections, not left in place as historical clutter.
- Use `CURRENT_STATE.md` for durable implemented behavior.
- Use `DECISIONS.md` for settled choices.
- Re-open old items only when there is current evidence they are still unfinished or have regressed.