zlh-grind/SESSION_LOG.md

6.5 KiB

Session Log


2026-03-10

  • Closed: Upload transport timeout tuning — upload route now logs explicit error categories distinguishing client abort, upstream timeout, and socket reset.
  • Research: Investigated external agent standards applicable to zlh-agent. No formal standard maps cleanly — agent is purpose-built (embedded process manager + filesystem authority inside LXC, internal-only caller). Key findings: health probe split (/healthz liveness vs /readyz readiness) is a common convention but not required given single-caller architecture; graceful shutdown (SIGINT/SIGTERM, 10s timeout) is correct; structured lifecycle logging already solid; Go 1.21 slog exists if log unification is ever wanted. No open threads opened from this — no gaps requiring action.
  • Grind repo updated to reflect current platform state. Misplaced architecture docs later removed — canonical docs belong in knowledge-base not zlh-grind.

2026-03-14

Goal: Stabilize dev container provisioning across agent, portal, and artifact server.

Work completed:

Agent:

  • catalog-driven runtime validation via devcontainer/_catalog.json (ValidateRuntimeSelection in common.go)
  • EnsureDevUserEnvironment — dev user, /home/dev/workspace, correct ownership
  • dotnet runtime provisioning added
  • optional code-server addon provisioning added
  • WriteReadyMarker after successful provision
  • RuntimeInstalled() filesystem-based install guard in common.go

Portal:

  • dotnet runtime added
  • enable code-server option added
  • Files tab enabled for dev containers

API:

  • enable_code_server field added to dev provisioning payload

Blocker:

  • code-server artifact on zlh-artifacts contains source repository, not compiled release
  • install.sh expects bin/code-server, lib/, node_modules/ — compiled release required
  • fix: replace artifact with official code-server release tarball (e.g. code-server-4.x.x-linux-amd64.tar.gz)

2026-03-15

Architecture review session. Key decisions and findings:

Dev container model:

  • 1 server / 1 container / 1 world confirmed as correct model
  • Dev containers: full R/W access under /home/dev/workspace, no allowlist
  • Multiverse/multi-world via plugins is customer-managed, not a platform concern
  • Port exposure (dev-.zerolaghub.com) identified as next major dev feature — future work
  • dotnet SDK covers all C# game modding (Valheim, Core Keeper, Vintage Story, Rust/Oxide)
  • Code Server confirmed as correct browser IDE approach given single public IP constraint
  • Traefik dynamic file provider confirmed as correct routing approach — no plugin needed, no SRV records needed

Agent review (zlh-agent commit 6019d0bc — 2026-03-15):

  • Catalog transition confirmed correct — ValidateRuntimeSelection gates all dev provisioning
  • Scripts unchanged — embedded script execution via bash stdin pipe, no 126 risk from runtime installs
  • devcontainer/common.go is clean and complete
  • node/verify.go has hardcoded /opt/zlh/runtime/node/bin/node — wrong path, pre-existing issue
  • node/python/go/java install packages still use old version-unaware marker pattern — pre-existing, not a regression

Agent future work (priority order):

  1. Unified structured logging (slog) — Promtail/Loki integration needs structured fields
  2. Dev container /status — provisioningComplete + provisioningError fields
  3. Crash recovery with backoff — 30s/60s/120s, max 3 attempts, then error state
  4. Graceful shutdown verification — SIGTERM + wait before SIGKILL for Minecraft world save safety
  5. Agent restart/process reattachment — detect existing process on restart

Code-server routing:

  • Artifact fix confirmed working 2026-03-15
  • Binary confirmed present at /opt/zlh/services/code-server/bin/code-server
  • Root cause of ERR_CONNECTION_CLOSED identified: code-server is installed but never launched
  • Port conflict: Node runtime is binding 6000, code-server cannot share the port
  • Two fixes needed:
    1. Assign code-server a port that won't conflict with Node (6000 taken)
    2. Add launch step to addon install script — install != start, binary must be daemonized after provisioning
  • Suggested launch: nohup /opt/zlh/services/code-server/bin/code-server --bind-addr 0.0.0.0: --auth none /home/dev/workspace > /opt/zlh-agent/logs/code-server.log 2>&1 &

2026-04-12

Goal: Close the billing loop, ship first-run onboarding, and refresh the dashboard home surface.

Work completed:

Infra / Billing path:

  • public billing hostname and reverse proxy path fixed at billing.zerolaghub.com
  • Caddy TLS issuance succeeded for billing.zerolaghub.com and portal.zerolaghub.com
  • Stripe webhook delivery validated live against the public billing endpoint
  • Prisma verified local billing state persistence after webhook delivery

API:

  • billing webhook now persists live billing state
    • subscriptionStatus
    • plan
    • currentPeriodEnd
    • lastInvoicePaidAt
    • billingSyncedAt
  • direct upgrade flow implemented via POST /api/billing/upgrade
  • period-end scheduled downgrade flow implemented via POST /api/billing/downgrade
  • scheduled downgrade persistence added:
    • scheduledPlan
    • scheduledPlanEffectiveAt
  • centralized plan limits added and enforced in POST /api/instances
    • basic: 1 game / 1 dev
    • pro: 3 game / 3 dev
    • admin exempt
  • password reset API flow implemented:
    • POST /api/auth/password-reset/request
    • POST /api/auth/password-reset/confirm

Portal:

  • billing page aligned to live API billing state
  • honest Stripe portal section reduced to one real portal entry point
  • direct in-app Basic → Pro upgrade flow wired
  • direct in-app Pro → Basic scheduled downgrade flow wired
  • quota/plan-limit messaging added to server create flow with billing upgrade guidance
  • forgot-password and reset-password pages added and linked from login
  • first-login onboarding shipped on dashboard:
    • welcome modal
    • quick tour
    • full tour
    • skip / completion persistence via localStorage
  • dashboard refreshed from mini-listing to home surface:
    • duplicate resource overview removed
    • spotlight server card/carousel added
    • primary actions + notices retained

Outcome:

  • billing loop is now functional end-to-end
  • auth reset flow is present end-to-end
  • onboarding is now in-product
  • dashboard now feels like a dashboard instead of a duplicate servers page

Confirmed remaining follow-ups:

  • game server world backup / restore
  • email notifications
  • Open IDE production-path confirmation
  • SSH config snippet for power users
  • service discovery cleanup
  • upload, stress, and provisioning validation