zlh-grind/OPEN_THREADS.md

6.2 KiB

Open Threads — zlh-grind

This file tracks active cross-repo and platform-level work only.

Repo-specific work belongs in:

  • Codex/API/OPEN_ITEMS.md
  • Codex/Portal/OPEN_ITEMS.md
  • Codex/Agent/OPEN_ITEMS.md
  • Codex/Monitoring/OPEN_ITEMS.md

Keep this file short.


Launch Active

1. Portal terminal reliability

  • Fix/harden console connection state so the UI cannot remain stuck at Connecting....
  • Required behavior:
    • WebSocket connect timeout in TerminalView.
    • Error path closes/clears current socket refs and sender callback.
    • ServerConsole resets streaming state on closed, error, and idle.
    • Button recovers to Open Console / Reconnect.
  • Do not rewrite Agent or switch terminal stack; xterm + API WebSocket bridge remains the model.

2. Monitoring / observability readiness

  • Finish or explicitly accept launch risk for the remaining monitoring items in Codex/Monitoring/OPEN_ITEMS.md.
  • Current launch focus:
    • restrict public exposure for Prometheus/Grafana/node_exporter
    • validate game/dev discovery sync and stale target removal
    • ensure Grafana dashboards are installed and useful
    • expose API health/app-level state
    • decide centralized logs/Loki timing or accept as launch risk
    • add/verify queue staleness visibility for provisioning, repair, and billing_enforcement

3. Patch management / maintenance window policy

  • Define normal maintenance window cadence and timezone.
  • Define emergency maintenance behavior and customer notification expectations.
  • Document patch order and rollback expectations for Proxmox, PBS, OPNsense, API, Portal, templates, Agent, Velocity, and monitoring.

4. Notepad / announcements / messaging retest

  • Billing announcements and support ticket flow were validated live.
  • Still retest notepad/notes load/save, persistence after reload/login, permissions, empty states, and error states.

5. Final integrated smoke test

Run this after the launch blockers above are clean:

  • Game lifecycle: create → ready/connectable → console → files → backup → restore → delete → DNS/Velocity/Cloudflare cleanup verified.
  • Dev lifecycle: create → hosted IDE → stop/restart/delete → cleanup verified.
  • Security: direct Agent auth fails closed, non-owner access blocked, browser does not expose internal tokens/secrets.

Current Architecture Validated / No Longer Blocking

Provisioning worker / async create

  • POST /api/instances now creates/reuses a durable provisioning operation, enqueues BullMQ provisioning work, and returns 202 Accepted.
  • zpack-provision-worker.service is installed/running under systemd.
  • Game and dev provisioning through the worker were live-validated.
  • Portal async pending cards were visually validated and replace themselves with real server cards.
  • Duplicate/idempotency guards and controlled failure behavior were validated.
  • API teardown still works for worker-created servers.

Controller / repair foundation

  • zlh-controller.service exists as the singleton controller/reconciler with Redis lock.
  • Controller is currently expected to run in dry-run unless deliberately enabling Level 1 repair.
  • zpack-repair-worker.service exists for Level 1 infrastructure repairs.
  • Validated repairs:
    • stale provisioning operation → stale
    • live Cloudflare SRV drift → edge_republish restored edge state
  • Level 2/3 repairs remain disabled.

Billing enforcement

  • Durable billing enforcement state, Stripe event idempotency, API gates, billing worker, and controller billing guards are implemented.
  • zpack-billing-worker.service is installed/running under systemd.
  • Validated:
    • payment failed / warning flow
    • Stripe replay idempotency
    • final warning / backup block state
    • suspension / shutdown safety
    • API gates while suspended
    • controller does not repair suspended game servers
    • payment restored flow
    • destructive billing actions rejected/audited
  • Remaining API-owned follow-ups are fixture validation only: game backup mutation route validation with an available backup fixture, and file read/list validation against a responsive Agent.

Support ticket path

  • POST /api/support/create exists and creates a SupportTicket DB row with human-readable ticket number.
  • Portal form submit was validated.
  • Customer acknowledgement email was received.
  • Discord #support alert was received.
  • Post-launch enhancements only: admin ticket list/view, support triage diagnostics, self-hosted helpdesk integration, inbound reply parsing, attachments.

Platform / Infrastructure Active

  • stress testing: k6 IDE load, Minecraft bot load, code-server memory baseline
  • OPNsense / public exposure audit

Backup boundary

  • Agent-owned backups are local, app-aware rollback backups for Minecraft worlds/config.
  • PBS / platform backup strategy is the durability / disaster-recovery layer.
  • PBS/R2 offsite copy currently uses rclone against Cloudflare R2; PBS does not natively use an S3 datastore for this workflow.
  • Do not track PBS/offsite durability work as agent implementation work unless that ownership changes.

Platform Future / Phase 2

  • Support admin ticket view, support triage diagnostics, and optional self-hosted helpdesk integration.
  • SSH / CF tunnel power-user access.
  • Portal SSH config snippets.
  • true interactive shell confinement / workspace-boundary decision.
  • dev-container backup ownership and user-facing restore contract.
  • likely direction for dev backups: LXC snapshot-based backup/restore instead of agent-managed dev backups.
  • artifact version promotion.
  • runtime rollback support.
  • Cloudflare R2 for large artifact/mod delivery.
  • admin panel.
  • referral / dev pipeline reward system.
  • uptime history.
  • revisit DDoS mitigation later if needed.

Cleaning Rule

  • Root keeps only cross-repo/platform work.
  • Repo-specific items must be removed from root once they live only in one Codex tracker.
  • Completed items should be removed from active sections, not left in place as historical clutter.
  • Use CURRENT_STATE.md for durable implemented behavior.
  • Use DECISIONS.md for settled choices.
  • Re-open old items only when there is current evidence they are still unfinished or have regressed.