6.2 KiB
6.2 KiB
Open Threads — zlh-grind
This file tracks active cross-repo and platform-level work only.
Repo-specific work belongs in:
Codex/API/OPEN_ITEMS.mdCodex/Portal/OPEN_ITEMS.mdCodex/Agent/OPEN_ITEMS.mdCodex/Monitoring/OPEN_ITEMS.md
Keep this file short.
Launch Active
1. Portal terminal reliability
- Fix/harden console connection state so the UI cannot remain stuck at
Connecting.... - Required behavior:
- WebSocket connect timeout in
TerminalView. - Error path closes/clears current socket refs and sender callback.
ServerConsoleresets streaming state onclosed,error, andidle.- Button recovers to
Open Console/Reconnect.
- WebSocket connect timeout in
- Do not rewrite Agent or switch terminal stack; xterm + API WebSocket bridge remains the model.
2. Monitoring / observability readiness
- Finish or explicitly accept launch risk for the remaining monitoring items in
Codex/Monitoring/OPEN_ITEMS.md. - Current launch focus:
- restrict public exposure for Prometheus/Grafana/node_exporter
- validate game/dev discovery sync and stale target removal
- ensure Grafana dashboards are installed and useful
- expose API health/app-level state
- decide centralized logs/Loki timing or accept as launch risk
- add/verify queue staleness visibility for
provisioning,repair, andbilling_enforcement
3. Patch management / maintenance window policy
- Define normal maintenance window cadence and timezone.
- Define emergency maintenance behavior and customer notification expectations.
- Document patch order and rollback expectations for Proxmox, PBS, OPNsense, API, Portal, templates, Agent, Velocity, and monitoring.
4. Notepad / announcements / messaging retest
- Billing announcements and support ticket flow were validated live.
- Still retest notepad/notes load/save, persistence after reload/login, permissions, empty states, and error states.
5. Final integrated smoke test
Run this after the launch blockers above are clean:
- Game lifecycle: create → ready/connectable → console → files → backup → restore → delete → DNS/Velocity/Cloudflare cleanup verified.
- Dev lifecycle: create → hosted IDE → stop/restart/delete → cleanup verified.
- Security: direct Agent auth fails closed, non-owner access blocked, browser does not expose internal tokens/secrets.
Current Architecture Validated / No Longer Blocking
Provisioning worker / async create
POST /api/instancesnow creates/reuses a durable provisioning operation, enqueues BullMQ provisioning work, and returns202 Accepted.zpack-provision-worker.serviceis installed/running under systemd.- Game and dev provisioning through the worker were live-validated.
- Portal async pending cards were visually validated and replace themselves with real server cards.
- Duplicate/idempotency guards and controlled failure behavior were validated.
- API teardown still works for worker-created servers.
Controller / repair foundation
zlh-controller.serviceexists as the singleton controller/reconciler with Redis lock.- Controller is currently expected to run in dry-run unless deliberately enabling Level 1 repair.
zpack-repair-worker.serviceexists for Level 1 infrastructure repairs.- Validated repairs:
- stale provisioning operation →
stale - live Cloudflare SRV drift →
edge_republishrestored edge state
- stale provisioning operation →
- Level 2/3 repairs remain disabled.
Billing enforcement
- Durable billing enforcement state, Stripe event idempotency, API gates, billing worker, and controller billing guards are implemented.
zpack-billing-worker.serviceis installed/running under systemd.- Validated:
- payment failed / warning flow
- Stripe replay idempotency
- final warning / backup block state
- suspension / shutdown safety
- API gates while suspended
- controller does not repair suspended game servers
- payment restored flow
- destructive billing actions rejected/audited
- Remaining API-owned follow-ups are fixture validation only: game backup mutation route validation with an available backup fixture, and file read/list validation against a responsive Agent.
Support ticket path
POST /api/support/createexists and creates aSupportTicketDB row with human-readable ticket number.- Portal form submit was validated.
- Customer acknowledgement email was received.
- Discord
#supportalert was received. - Post-launch enhancements only: admin ticket list/view, support triage diagnostics, self-hosted helpdesk integration, inbound reply parsing, attachments.
Platform / Infrastructure Active
- stress testing: k6 IDE load, Minecraft bot load, code-server memory baseline
- OPNsense / public exposure audit
Backup boundary
- Agent-owned backups are local, app-aware rollback backups for Minecraft worlds/config.
- PBS / platform backup strategy is the durability / disaster-recovery layer.
- PBS/R2 offsite copy currently uses rclone against Cloudflare R2; PBS does not natively use an S3 datastore for this workflow.
- Do not track PBS/offsite durability work as agent implementation work unless that ownership changes.
Platform Future / Phase 2
- Support admin ticket view, support triage diagnostics, and optional self-hosted helpdesk integration.
- SSH / CF tunnel power-user access.
- Portal SSH config snippets.
- true interactive shell confinement / workspace-boundary decision.
- dev-container backup ownership and user-facing restore contract.
- likely direction for dev backups: LXC snapshot-based backup/restore instead of agent-managed dev backups.
- artifact version promotion.
- runtime rollback support.
- Cloudflare R2 for large artifact/mod delivery.
- admin panel.
- referral / dev pipeline reward system.
- uptime history.
- revisit DDoS mitigation later if needed.
Cleaning Rule
- Root keeps only cross-repo/platform work.
- Repo-specific items must be removed from root once they live only in one Codex tracker.
- Completed items should be removed from active sections, not left in place as historical clutter.
- Use
CURRENT_STATE.mdfor durable implemented behavior. - Use
DECISIONS.mdfor settled choices. - Re-open old items only when there is current evidence they are still unfinished or have regressed.