143 lines
6.5 KiB
Markdown
143 lines
6.5 KiB
Markdown
# Session Log
|
|
|
|
---
|
|
|
|
## 2026-03-10
|
|
|
|
- Closed: Upload transport timeout tuning — upload route now logs explicit error categories distinguishing client abort, upstream timeout, and socket reset.
|
|
- Research: Investigated external agent standards applicable to zlh-agent. No formal standard maps cleanly — agent is purpose-built (embedded process manager + filesystem authority inside LXC, internal-only caller). Key findings: health probe split (/healthz liveness vs /readyz readiness) is a common convention but not required given single-caller architecture; graceful shutdown (SIGINT/SIGTERM, 10s timeout) is correct; structured lifecycle logging already solid; Go 1.21 slog exists if log unification is ever wanted. No open threads opened from this — no gaps requiring action.
|
|
- Grind repo updated to reflect current platform state. Misplaced architecture docs later removed — canonical docs belong in knowledge-base not zlh-grind.
|
|
|
|
---
|
|
|
|
## 2026-03-14
|
|
|
|
Goal: Stabilize dev container provisioning across agent, portal, and artifact server.
|
|
|
|
Work completed:
|
|
|
|
Agent:
|
|
- catalog-driven runtime validation via devcontainer/_catalog.json (ValidateRuntimeSelection in common.go)
|
|
- EnsureDevUserEnvironment — dev user, /home/dev/workspace, correct ownership
|
|
- dotnet runtime provisioning added
|
|
- optional code-server addon provisioning added
|
|
- WriteReadyMarker after successful provision
|
|
- RuntimeInstalled() filesystem-based install guard in common.go
|
|
|
|
Portal:
|
|
- dotnet runtime added
|
|
- enable code-server option added
|
|
- Files tab enabled for dev containers
|
|
|
|
API:
|
|
- enable_code_server field added to dev provisioning payload
|
|
|
|
Blocker:
|
|
- code-server artifact on zlh-artifacts contains source repository, not compiled release
|
|
- install.sh expects bin/code-server, lib/, node_modules/ — compiled release required
|
|
- fix: replace artifact with official code-server release tarball (e.g. code-server-4.x.x-linux-amd64.tar.gz)
|
|
|
|
---
|
|
|
|
## 2026-03-15
|
|
|
|
Architecture review session. Key decisions and findings:
|
|
|
|
Dev container model:
|
|
- 1 server / 1 container / 1 world confirmed as correct model
|
|
- Dev containers: full R/W access under /home/dev/workspace, no allowlist
|
|
- Multiverse/multi-world via plugins is customer-managed, not a platform concern
|
|
- Port exposure (dev-<vmid>.zerolaghub.com) identified as next major dev feature — future work
|
|
- dotnet SDK covers all C# game modding (Valheim, Core Keeper, Vintage Story, Rust/Oxide)
|
|
- Code Server confirmed as correct browser IDE approach given single public IP constraint
|
|
- Traefik dynamic file provider confirmed as correct routing approach — no plugin needed, no SRV records needed
|
|
|
|
Agent review (zlh-agent commit 6019d0bc — 2026-03-15):
|
|
- Catalog transition confirmed correct — ValidateRuntimeSelection gates all dev provisioning
|
|
- Scripts unchanged — embedded script execution via bash stdin pipe, no 126 risk from runtime installs
|
|
- devcontainer/common.go is clean and complete
|
|
- node/verify.go has hardcoded /opt/zlh/runtime/node/bin/node — wrong path, pre-existing issue
|
|
- node/python/go/java install packages still use old version-unaware marker pattern — pre-existing, not a regression
|
|
|
|
Agent future work (priority order):
|
|
1. Unified structured logging (slog) — Promtail/Loki integration needs structured fields
|
|
2. Dev container /status — provisioningComplete + provisioningError fields
|
|
3. Crash recovery with backoff — 30s/60s/120s, max 3 attempts, then error state
|
|
4. Graceful shutdown verification — SIGTERM + wait before SIGKILL for Minecraft world save safety
|
|
5. Agent restart/process reattachment — detect existing process on restart
|
|
|
|
Code-server routing:
|
|
- Artifact fix confirmed working 2026-03-15
|
|
- Binary confirmed present at /opt/zlh/services/code-server/bin/code-server
|
|
- Root cause of ERR_CONNECTION_CLOSED identified: code-server is installed but never launched
|
|
- Port conflict: Node runtime is binding 6000, code-server cannot share the port
|
|
- Two fixes needed:
|
|
1. Assign code-server a port that won't conflict with Node (6000 taken)
|
|
2. Add launch step to addon install script — install != start, binary must be daemonized after provisioning
|
|
- Suggested launch: nohup /opt/zlh/services/code-server/bin/code-server --bind-addr 0.0.0.0:<port> --auth none /home/dev/workspace > /opt/zlh-agent/logs/code-server.log 2>&1 &
|
|
|
|
---
|
|
|
|
## 2026-04-12
|
|
|
|
Goal: Close the billing loop, ship first-run onboarding, and refresh the dashboard home surface.
|
|
|
|
Work completed:
|
|
|
|
Infra / Billing path:
|
|
- public billing hostname and reverse proxy path fixed at `billing.zerolaghub.com`
|
|
- Caddy TLS issuance succeeded for `billing.zerolaghub.com` and `portal.zerolaghub.com`
|
|
- Stripe webhook delivery validated live against the public billing endpoint
|
|
- Prisma verified local billing state persistence after webhook delivery
|
|
|
|
API:
|
|
- billing webhook now persists live billing state
|
|
- `subscriptionStatus`
|
|
- `plan`
|
|
- `currentPeriodEnd`
|
|
- `lastInvoicePaidAt`
|
|
- `billingSyncedAt`
|
|
- direct upgrade flow implemented via `POST /api/billing/upgrade`
|
|
- period-end scheduled downgrade flow implemented via `POST /api/billing/downgrade`
|
|
- scheduled downgrade persistence added:
|
|
- `scheduledPlan`
|
|
- `scheduledPlanEffectiveAt`
|
|
- centralized plan limits added and enforced in `POST /api/instances`
|
|
- basic: 1 game / 1 dev
|
|
- pro: 3 game / 3 dev
|
|
- admin exempt
|
|
- password reset API flow implemented:
|
|
- `POST /api/auth/password-reset/request`
|
|
- `POST /api/auth/password-reset/confirm`
|
|
|
|
Portal:
|
|
- billing page aligned to live API billing state
|
|
- honest Stripe portal section reduced to one real portal entry point
|
|
- direct in-app Basic → Pro upgrade flow wired
|
|
- direct in-app Pro → Basic scheduled downgrade flow wired
|
|
- quota/plan-limit messaging added to server create flow with billing upgrade guidance
|
|
- forgot-password and reset-password pages added and linked from login
|
|
- first-login onboarding shipped on dashboard:
|
|
- welcome modal
|
|
- quick tour
|
|
- full tour
|
|
- skip / completion persistence via localStorage
|
|
- dashboard refreshed from mini-listing to home surface:
|
|
- duplicate resource overview removed
|
|
- spotlight server card/carousel added
|
|
- primary actions + notices retained
|
|
|
|
Outcome:
|
|
- billing loop is now functional end-to-end
|
|
- auth reset flow is present end-to-end
|
|
- onboarding is now in-product
|
|
- dashboard now feels like a dashboard instead of a duplicate servers page
|
|
|
|
Confirmed remaining follow-ups:
|
|
- game server world backup / restore
|
|
- email notifications
|
|
- Open IDE production-path confirmation
|
|
- SSH config snippet for power users
|
|
- service discovery cleanup
|
|
- upload, stress, and provisioning validation
|