# Session Log --- ## 2026-03-10 - Closed: Upload transport timeout tuning — upload route now logs explicit error categories distinguishing client abort, upstream timeout, and socket reset. - Research: Investigated external agent standards applicable to zlh-agent. No formal standard maps cleanly — agent is purpose-built (embedded process manager + filesystem authority inside LXC, internal-only caller). Key findings: health probe split (/healthz liveness vs /readyz readiness) is a common convention but not required given single-caller architecture; graceful shutdown (SIGINT/SIGTERM, 10s timeout) is correct; structured lifecycle logging already solid; Go 1.21 slog exists if log unification is ever wanted. No open threads opened from this — no gaps requiring action. - Grind repo updated to reflect current platform state. Misplaced architecture docs later removed — canonical docs belong in knowledge-base not zlh-grind. --- ## 2026-03-14 Goal: Stabilize dev container provisioning across agent, portal, and artifact server. Work completed: Agent: - catalog-driven runtime validation via devcontainer/_catalog.json (ValidateRuntimeSelection in common.go) - EnsureDevUserEnvironment — dev user, /home/dev/workspace, correct ownership - dotnet runtime provisioning added - optional code-server addon provisioning added - WriteReadyMarker after successful provision - RuntimeInstalled() filesystem-based install guard in common.go Portal: - dotnet runtime added - enable code-server option added - Files tab enabled for dev containers API: - enable_code_server field added to dev provisioning payload Blocker: - code-server artifact on zlh-artifacts contains source repository, not compiled release - install.sh expects bin/code-server, lib/, node_modules/ — compiled release required - fix: replace artifact with official code-server release tarball (e.g. code-server-4.x.x-linux-amd64.tar.gz) --- ## 2026-03-15 Architecture review session. Key decisions and findings: Dev container model: - 1 server / 1 container / 1 world confirmed as correct model - Dev containers: full R/W access under /home/dev/workspace, no allowlist - Multiverse/multi-world via plugins is customer-managed, not a platform concern - Port exposure (dev-.zerolaghub.com) identified as next major dev feature — future work - dotnet SDK covers all C# game modding (Valheim, Core Keeper, Vintage Story, Rust/Oxide) - Code Server confirmed as correct browser IDE approach given single public IP constraint - Traefik dynamic file provider confirmed as correct routing approach — no plugin needed, no SRV records needed Agent review (zlh-agent commit 6019d0bc — 2026-03-15): - Catalog transition confirmed correct — ValidateRuntimeSelection gates all dev provisioning - Scripts unchanged — embedded script execution via bash stdin pipe, no 126 risk from runtime installs - devcontainer/common.go is clean and complete - node/verify.go has hardcoded /opt/zlh/runtime/node/bin/node — wrong path, pre-existing issue - node/python/go/java install packages still use old version-unaware marker pattern — pre-existing, not a regression Agent future work (priority order): 1. Unified structured logging (slog) — Promtail/Loki integration needs structured fields 2. Dev container /status — provisioningComplete + provisioningError fields 3. Crash recovery with backoff — 30s/60s/120s, max 3 attempts, then error state 4. Graceful shutdown verification — SIGTERM + wait before SIGKILL for Minecraft world save safety 5. Agent restart/process reattachment — detect existing process on restart Code-server routing: - Artifact fix confirmed working 2026-03-15 - Binary confirmed present at /opt/zlh/services/code-server/bin/code-server - Root cause of ERR_CONNECTION_CLOSED identified: code-server is installed but never launched - Port conflict: Node runtime is binding 6000, code-server cannot share the port - Two fixes needed: 1. Assign code-server a port that won't conflict with Node (6000 taken) 2. Add launch step to addon install script — install != start, binary must be daemonized after provisioning - Suggested launch: nohup /opt/zlh/services/code-server/bin/code-server --bind-addr 0.0.0.0: --auth none /home/dev/workspace > /opt/zlh-agent/logs/code-server.log 2>&1 & --- ## 2026-04-12 Goal: Close the billing loop, ship first-run onboarding, and refresh the dashboard home surface. Work completed: Infra / Billing path: - public billing hostname and reverse proxy path fixed at `billing.zerolaghub.com` - Caddy TLS issuance succeeded for `billing.zerolaghub.com` and `portal.zerolaghub.com` - Stripe webhook delivery validated live against the public billing endpoint - Prisma verified local billing state persistence after webhook delivery API: - billing webhook now persists live billing state - `subscriptionStatus` - `plan` - `currentPeriodEnd` - `lastInvoicePaidAt` - `billingSyncedAt` - direct upgrade flow implemented via `POST /api/billing/upgrade` - period-end scheduled downgrade flow implemented via `POST /api/billing/downgrade` - scheduled downgrade persistence added: - `scheduledPlan` - `scheduledPlanEffectiveAt` - centralized plan limits added and enforced in `POST /api/instances` - basic: 1 game / 1 dev - pro: 3 game / 3 dev - admin exempt - password reset API flow implemented: - `POST /api/auth/password-reset/request` - `POST /api/auth/password-reset/confirm` Portal: - billing page aligned to live API billing state - honest Stripe portal section reduced to one real portal entry point - direct in-app Basic → Pro upgrade flow wired - direct in-app Pro → Basic scheduled downgrade flow wired - quota/plan-limit messaging added to server create flow with billing upgrade guidance - forgot-password and reset-password pages added and linked from login - first-login onboarding shipped on dashboard: - welcome modal - quick tour - full tour - skip / completion persistence via localStorage - dashboard refreshed from mini-listing to home surface: - duplicate resource overview removed - spotlight server card/carousel added - primary actions + notices retained Outcome: - billing loop is now functional end-to-end - auth reset flow is present end-to-end - onboarding is now in-product - dashboard now feels like a dashboard instead of a duplicate servers page Confirmed remaining follow-ups: - game server world backup / restore - email notifications - Open IDE production-path confirmation - SSH config snippet for power users - service discovery cleanup - upload, stress, and provisioning validation