zlh-grind/SESSION_LOG.md

9.2 KiB

Session Log — zlh-grind

Append-only execution log for GPT-assisted development work.
Do not rewrite or reorder past entries.


2025-12-20

  • Goal: Restore reliable end-to-end provisioning for devcontainers and agent-managed installs.
  • Observed repeated failures during devcontainer runtime installation (node, python, go, java).
  • Initial assumption was installer regression; investigation showed installers were enforcing contract correctly.
  • Root cause identified: agent was not exporting required runtime environment variables (notably RUNTIME_VERSION).

2025-12-21

  • Deep dive on zlh-agent devcontainer provisioning flow.
  • Confirmed that all devcontainer installers intentionally require RUNTIME_VERSION and fail fast if missing.
  • Clarified that payload JSON is not read by installers; agent must project intent via environment variables.
  • Verified that installer logic itself (artifact naming, extraction, symlink layout) was correct.

Embedded installer execution findings

  • Agent executes installers as embedded scripts, not filesystem paths.
  • Identified critical requirement: shared installer logic (common.sh) and runtime installer must execute in the same shell session.
  • Failure mode observed: install_runtime: command not found → caused by running runtime installer without common.sh loaded.
  • Confirmed this explains missing runtime directories and lack of artifact downloads.

Installer architecture changes

  • Refactored installer model to:
    • common.sh: shared, strict, embedded-safe installation logic
    • per-runtime installers (node, python, go, java) as declarative descriptors only
  • Established that runtime installers are intentionally minimal and declarative by design.
  • Confirmed that this preserves existing runtime layout: /opt/zlh/runtime/<language>/<version>/current

Artifact layout update

  • Artifact naming and layout changed to simplified form:
    • node-24.tar.xz
    • python-3.12.tar.xz
    • go-1.22.tar.gz
    • jdk-21.tar.gz
  • Identified mismatch between runtime name and archive prefix (notably Java).
  • Introduced ARCHIVE_PREFIX as a runtime-level variable to resolve naming cleanly.

Final conclusions

  • No regression in installer logic; failures were execution-order and environment-projection issues.
  • Correct fix is agent-side:
    • concatenate common.sh + runtime installer into one bash invocation
    • inject RUNTIME_VERSION (and related vars) into environment
  • Architecture now supports deterministic, artifact-driven, embedded-safe installs.

Status: Root cause resolved; implementation pending agent patch & installer updates.


2025-12-26

  • Goal: Enable SSH access to dev containers via bastion for external users.
  • Split DNS architecture confirmed working: Cloudflare (public) + Technitium (internal).

Dev container SSH (internal success)

  • Root cause of initial SSH failures: missing SSH host keys in unprivileged LXC containers.
  • Fixed by ensuring host keys exist before sshd starts.
  • Verified internal SSH access works from WireGuard/LAN:
    ssh -J zlh@10.100.0.48 root@dev-6038.zerolaghub.dev
    
  • Confirmed not a firewall issue (iptables default ACCEPT, sshd listening on :22).

Agent SSH provisioning requirements identified

  • Agent must automate SSH setup for new containers:
    • Install and enable sshd
    • Generate SSH host keys if missing (add to common.sh or bootstrap)
    • Create devuser with sudo access
    • Configure authorized_keys for key-based auth

zlh-cli progress

  • Built successfully in Go, deployed to bastion at /usr/local/bin/zlh.
  • Known bugs when running ON bastion:
    • Incorrectly attempts to jump via zlh-bastion.zerolaghub.dev (should use localhost/direct)
    • User/host targeting logic needs fixes (was targeting bastion instead of dev container)
  • Goal: reduce ssh -J complexity to simple zlh ssh 6038 command.

Bastion public SSH blocker (ACTIVE)

  • Critical issue: Public SSH to bastion fails immediately.
  • Symptoms:
    • TCP connection succeeds (SYN/ACK completes)
    • SSH handshake never proceeds: kex_exchange_identification: Connection closed by remote host
    • Not authentication failure - connection closes before banner exchange
  • Verified:
    • Cloudflare DNS: zlh-bastion.zerolaghub.dev139.64.165.248 (DNS-only)
    • OPNsense NAT rule forwards :22 to 10.100.0.48
    • Internal SSH to bastion works perfectly
    • Fail2ban shows no blocks
    • Tried disabling OPNsense SSH, changing ports - same failure
  • This is NOT the container host-key issue (that was internal and fixed).

Next debug steps for bastion blocker

  1. tcpdump on bastion during external connection attempt
  2. OPNsense live firewall log during external attempt
  3. Confirm NAT truly reaches bastion sshd (not terminating upstream)
  4. Check for ISP/modem interference or hairpin NAT issues

Status: Dev container SSH working internally; bastion public access blocked at network layer.


2025-12-28 — APIv2 Auth + Portal Alignment Session

Work Completed

  • APIv2 auth route verified functional (JWT-based)
  • bcrypt password verification confirmed
  • /api/instances endpoint verified working without auth
  • Portal/API boundary clarified: portal owns identity UX, API owns validation + DB
  • Confirmed no CSRF or cookie-based auth required (stateless JWT)

Key Findings

  • Portal still contains APIv1 / Pterodactyl assumptions
  • zlh-grind is documentation + constraint repo only (no code)
  • Instances endpoint behavior was correct; earlier failures were route misuse

Decisions

  • APIv2 auth will remain stateless (JWT only)
  • No CSRF protection will be implemented
  • Portal must fully remove APIv1 and Pterodactyl patterns

Next Actions

  • Enforce requireAuth selectively in APIv2
  • Update portal login to match APIv2 contract
  • Track portal migration progress in OPEN_THREADS

Jan 2026 — Portal UX & Control Plane Alignment

Completed

  • Finalized Dashboard as awareness-only surface
  • Implemented System Health indicator (non-technical)
  • Reworked Notices UX with expansion and scrolling
  • Removed all bulk lifecycle controls
  • Defined Servers page grouping by type
  • Introduced expandable server cards
  • Selected "System View" as escalation action
  • Explicitly avoided AWS-style UI patterns

Key Decisions

  • Agent is authoritative for runtime state
  • API v2 brokers access and permissions only
  • Console / runtime access is observational first
  • Control surfaces are separated from awareness surfaces

Outcome

Portal is no longer Pterodactyl-shaped. It now supports mixed workloads cleanly.


2026-01-11 — Live Status & Agent Integration

Summary

Finalized live server status architecture between:

  • ZPack API
  • zlh-agent
  • Redis
  • Portal UI

This session resolved incorrect / misleading server states and clarified the separation between host health, agent availability, and service runtime.

Key Decisions

1. Status Sources Are Explicit

  • Agent /health
    • Authoritative for host / container availability
    • Used for DEV containers and host presence
  • Agent /status
    • Authoritative for game server runtime only
    • Idle ≠ Offline (idle means server stopped but host up)

2. DEV vs GAME Semantics

  • DEV containers
    • No runtime process to track
    • Status = running if host online
  • GAME containers
    • Runtime state controlled by agent
    • Status reflects actual game process

3. Redis Is the Live State Cache

  • Agent poller runs every 5s
  • Redis TTL = 15s
  • UI reads Redis, not agents directly
  • Eventual consistency is acceptable

4. No Push Model (Yet)

  • Agent does not push status
  • API polls agents
  • WebSocket / push deferred until console phase

Result

  • Portal now reflects correct, real-world state
  • "Offline / Idle / Running" distinctions are accurate
  • Architecture aligns with Pterodactyl-style behavior

Next Focus

  • Console (read-only output first)

2026-01-14 → 2026-01-18 — Agent PTY Console Breakthrough

Objective

Replace log-only console streaming with a true interactive PTY-backed console supporting:

  • Dev server shell access
  • Game server (Minecraft) live console
  • Full duplex WebSocket I/O
  • Session persistence across reconnects

What Changed

  • Agent console migrated from log tailing → PTY-backed processes
  • PTY ownership moved to agent (not frontend, not Proxmox)
  • WebSocket implementation replaced with gorilla/websocket
  • Enforced single writer per WS connection
  • Introduced console session manager keyed by {vmid, container_type}
  • Sessions survive WS disconnects (dev servers only, TTL = 60s)
  • PTY not killed on client close
  • Added keepalive frames (5s) to prevent browser timeouts
  • Added detailed WS + PTY logging for attach, read, write, close

Architecture Summary

  • Dev containers attach to /bin/bash (fallback /bin/sh)
  • Game containers attach to Minecraft server PTY
  • WS input → PTY stdin
  • PTY stdout → WS broadcast
  • One PTY reader loop per session
  • Multiple WS connections can attach/detach safely

Final Status

Interactive console confirmed working
Commands execute correctly
Output streamed live
Reconnect behavior stable

This closes the original "console v2" milestone.