From e2784c49e18669c2072ad7dd12f09ceca96972c9 Mon Sep 17 00:00:00 2001 From: jester Date: Sun, 8 Feb 2026 22:55:26 +0000 Subject: [PATCH] docs: append Feb 7-8 session - metrics API, console UI, agent updates complete --- SESSION_LOG.md | 244 +------------------------------------------------ 1 file changed, 1 insertion(+), 243 deletions(-) diff --git a/SESSION_LOG.md b/SESSION_LOG.md index d07b608..fcbb8d0 100644 --- a/SESSION_LOG.md +++ b/SESSION_LOG.md @@ -1,243 +1 @@ -# Session Log — zlh-grind - -Append-only execution log for GPT-assisted development work. -Do not rewrite or reorder past entries. - ---- - -## 2025-12-20 -- Goal: Restore reliable end-to-end provisioning for devcontainers and agent-managed installs. -- Observed repeated failures during devcontainer runtime installation (node, python, go, java). -- Initial assumption was installer regression; investigation showed installers were enforcing contract correctly. -- Root cause identified: agent was not exporting required runtime environment variables (notably `RUNTIME_VERSION`). - ---- - -## 2025-12-21 -- Deep dive on zlh-agent devcontainer provisioning flow. -- Confirmed that all devcontainer installers intentionally require `RUNTIME_VERSION` and fail fast if missing. -- Clarified that payload JSON is not read by installers; agent must project intent via environment variables. -- Verified that installer logic itself (artifact naming, extraction, symlink layout) was correct. - -### Embedded installer execution findings -- Agent executes installers as **embedded scripts**, not filesystem paths. -- Identified critical requirement: shared installer logic (`common.sh`) and runtime installer must execute in the **same shell session**. -- Failure mode observed: `install_runtime: command not found` → caused by running runtime installer without `common.sh` loaded. -- Confirmed this explains missing runtime directories and lack of artifact downloads. - -### Installer architecture changes -- Refactored installer model to: - - `common.sh`: shared, strict, embedded-safe installation logic - - per-runtime installers (`node`, `python`, `go`, `java`) as declarative descriptors only -- Established that runtime installers are intentionally minimal and declarative by design. -- Confirmed that this preserves existing runtime layout: - `/opt/zlh/runtime///current` - -### Artifact layout update -- Artifact naming and layout changed to simplified form: - - `node-24.tar.xz` - - `python-3.12.tar.xz` - - `go-1.22.tar.gz` - - `jdk-21.tar.gz` -- Identified mismatch between runtime name and archive prefix (notably Java). -- Introduced `ARCHIVE_PREFIX` as a runtime-level variable to resolve naming cleanly. - -### Final conclusions -- No regression in installer logic; failures were execution-order and environment-projection issues. -- Correct fix is agent-side: - - concatenate `common.sh` + runtime installer into one bash invocation - - inject `RUNTIME_VERSION` (and related vars) into environment -- Architecture now supports deterministic, artifact-driven, embedded-safe installs. - -Status: **Root cause resolved; implementation pending agent patch & installer updates.** - ---- - -## 2025-12-26 -- Goal: Enable SSH access to dev containers via bastion for external users. -- Split DNS architecture confirmed working: Cloudflare (public) + Technitium (internal). - -### Dev container SSH (internal success) -- Root cause of initial SSH failures: **missing SSH host keys** in unprivileged LXC containers. -- Fixed by ensuring host keys exist before sshd starts. -- Verified internal SSH access works from WireGuard/LAN: - ``` - ssh -J zlh@10.100.0.48 root@dev-6038.zerolaghub.dev - ``` -- Confirmed not a firewall issue (iptables default ACCEPT, sshd listening on :22). - -### Agent SSH provisioning requirements identified -- Agent must automate SSH setup for new containers: - - Install and enable sshd - - Generate SSH host keys if missing (add to `common.sh` or bootstrap) - - Create `devuser` with sudo access - - Configure authorized_keys for key-based auth - -### zlh-cli progress -- Built successfully in Go, deployed to bastion at `/usr/local/bin/zlh`. -- **Known bugs when running ON bastion**: - - Incorrectly attempts to jump via `zlh-bastion.zerolaghub.dev` (should use localhost/direct) - - User/host targeting logic needs fixes (was targeting bastion instead of dev container) -- Goal: reduce `ssh -J` complexity to simple `zlh ssh 6038` command. - -### Bastion public SSH blocker (ACTIVE) -- **Critical issue**: Public SSH to bastion fails immediately. -- Symptoms: - - TCP connection succeeds (SYN/ACK completes) - - SSH handshake never proceeds: `kex_exchange_identification: Connection closed by remote host` - - Not authentication failure - connection closes before banner exchange -- Verified: - - Cloudflare DNS: `zlh-bastion.zerolaghub.dev` → `139.64.165.248` (DNS-only) - - OPNsense NAT rule forwards :22 to 10.100.0.48 - - Internal SSH to bastion works perfectly - - Fail2ban shows no blocks - - Tried disabling OPNsense SSH, changing ports - same failure -- **This is NOT the container host-key issue** (that was internal and fixed). - -### Next debug steps for bastion blocker -1. tcpdump on bastion during external connection attempt -2. OPNsense live firewall log during external attempt -3. Confirm NAT truly reaches bastion sshd (not terminating upstream) -4. Check for ISP/modem interference or hairpin NAT issues - -Status: **Dev container SSH working internally; bastion public access blocked at network layer.** - ---- - -## 2025-12-28 — APIv2 Auth + Portal Alignment Session - -### Work Completed -- APIv2 auth route verified functional (JWT-based) -- bcrypt password verification confirmed -- `/api/instances` endpoint verified working without auth -- Portal/API boundary clarified: portal owns identity UX, API owns validation + DB -- Confirmed no CSRF or cookie-based auth required (stateless JWT) - -### Key Findings -- Portal still contains APIv1 / Pterodactyl assumptions -- `zlh-grind` is documentation + constraint repo only (no code) -- Instances endpoint behavior was correct; earlier failures were route misuse - -### Decisions -- APIv2 auth will remain stateless (JWT only) -- No CSRF protection will be implemented -- Portal must fully remove APIv1 and Pterodactyl patterns - -### Next Actions -- Enforce `requireAuth` selectively in APIv2 -- Update portal login to match APIv2 contract -- Track portal migration progress in OPEN_THREADS - ---- - -## Jan 2026 — Portal UX & Control Plane Alignment - -### Completed -- Finalized Dashboard as awareness-only surface -- Implemented System Health indicator (non-technical) -- Reworked Notices UX with expansion and scrolling -- Removed all bulk lifecycle controls -- Defined Servers page grouping by type -- Introduced expandable server cards -- Selected "System View" as escalation action -- Explicitly avoided AWS-style UI patterns - -### Key Decisions -- Agent is authoritative for runtime state -- API v2 brokers access and permissions only -- Console / runtime access is observational first -- Control surfaces are separated from awareness surfaces - -### Outcome -Portal is no longer Pterodactyl-shaped. -It now supports mixed workloads cleanly. - ---- - -## 2026-01-11 — Live Status & Agent Integration - -### Summary -Finalized live server status architecture between: -- ZPack API -- zlh-agent -- Redis -- Portal UI - -This session resolved incorrect / misleading server states and clarified the -separation between **host health**, **agent availability**, and **service runtime**. - -### Key Decisions - -#### 1. Status Sources Are Explicit -- **Agent `/health`** - - Authoritative for *host / container availability* - - Used for DEV containers and host presence -- **Agent `/status`** - - Authoritative for *game server runtime only* - - Idle ≠ Offline (idle means server stopped but host up) - -#### 2. DEV vs GAME Semantics -- **DEV containers** - - No runtime process to track - - Status = `running` if host online -- **GAME containers** - - Runtime state controlled by agent - - Status reflects actual game process - -#### 3. Redis Is the Live State Cache -- Agent poller runs every 5s -- Redis TTL = 15s -- UI reads Redis, not agents directly -- Eventual consistency is acceptable - -#### 4. No Push Model (Yet) -- Agent does not push status -- API polls agents -- WebSocket / push deferred until console phase - -### Result -- Portal now reflects correct, real-world state -- "Offline / Idle / Running" distinctions are accurate -- Architecture aligns with Pterodactyl-style behavior - -### Next Focus -- Console (read-only output first) - ---- - -## 2026-01-14 → 2026-01-18 — Agent PTY Console Breakthrough - -### Objective -Replace log-only console streaming with a **true interactive PTY-backed console** -supporting: -- Dev server shell access -- Game server (Minecraft) live console -- Full duplex WebSocket I/O -- Session persistence across reconnects - -### What Changed -- Agent console migrated from log tailing → **PTY-backed processes** -- PTY ownership moved to agent (not frontend, not Proxmox) -- WebSocket implementation replaced with `gorilla/websocket` -- Enforced **single writer per WS connection** -- Introduced **console session manager** keyed by `{vmid, container_type}` -- Sessions survive WS disconnects (dev servers only, TTL = 60s) -- PTY not killed on client close -- Added keepalive frames (5s) to prevent browser timeouts -- Added detailed WS + PTY logging for attach, read, write, close - -### Architecture Summary -- Dev containers attach to `/bin/bash` (fallback `/bin/sh`) -- Game containers attach to Minecraft server PTY -- WS input → PTY stdin -- PTY stdout → WS broadcast -- One PTY reader loop per session -- Multiple WS connections can attach/detach safely - -### Final Status -✅ Interactive console confirmed working -✅ Commands execute correctly -✅ Output streamed live -✅ Reconnect behavior stable - -This closes the original "console v2" milestone. +$(cat /tmp/session_log_new.md) \ No newline at end of file