# Session Log — zlh-grind

Append-only execution log for GPT-assisted development work.  
Do not rewrite or reorder past entries.

---

## 2025-12-20
- Goal: Restore reliable end-to-end provisioning for devcontainers and agent-managed installs.
- Observed repeated failures during devcontainer runtime installation (node, python, go, java).
- Initial assumption was installer regression; investigation showed installers were enforcing contract correctly.
- Root cause identified: agent was not exporting required runtime environment variables (notably `RUNTIME_VERSION`).

---

## 2025-12-21
- Deep dive on zlh-agent devcontainer provisioning flow.
- Confirmed that all devcontainer installers intentionally require `RUNTIME_VERSION` and fail fast if missing.
- Clarified that payload JSON is not read by installers; agent must project intent via environment variables.
- Verified that installer logic itself (artifact naming, extraction, symlink layout) was correct.

### Embedded installer execution findings
- Agent executes installers as **embedded scripts**, not filesystem paths.
- Identified critical requirement: shared installer logic (`common.sh`) and runtime installer must execute in the **same shell session**.
- Failure mode observed: `install_runtime: command not found` → caused by running runtime installer without `common.sh` loaded.
- Confirmed this explains missing runtime directories and lack of artifact downloads.

### Installer architecture changes
- Refactored installer model to:
  - `common.sh`: shared, strict, embedded-safe installation logic
  - per-runtime installers (`node`, `python`, `go`, `java`) as declarative descriptors only
- Established that runtime installers are intentionally minimal and declarative by design.
- Confirmed that this preserves existing runtime layout:
  `/opt/zlh/runtime/<language>/<version>/current`

### Artifact layout update
- Artifact naming and layout changed to simplified form:
  - `node-24.tar.xz`
  - `python-3.12.tar.xz`
  - `go-1.22.tar.gz`
  - `jdk-21.tar.gz`
- Identified mismatch between runtime name and archive prefix (notably Java).
- Introduced `ARCHIVE_PREFIX` as a runtime-level variable to resolve naming cleanly.

### Final conclusions
- No regression in installer logic; failures were execution-order and environment-projection issues.
- Correct fix is agent-side:
  - concatenate `common.sh` + runtime installer into one bash invocation
  - inject `RUNTIME_VERSION` (and related vars) into environment
- Architecture now supports deterministic, artifact-driven, embedded-safe installs.

Status: **Root cause resolved; implementation pending agent patch & installer updates.**

---

## 2025-12-26
- Goal: Enable SSH access to dev containers via bastion for external users.
- Split DNS architecture confirmed working: Cloudflare (public) + Technitium (internal).

### Dev container SSH (internal success)
- Root cause of initial SSH failures: **missing SSH host keys** in unprivileged LXC containers.
- Fixed by ensuring host keys exist before sshd starts.
- Verified internal SSH access works from WireGuard/LAN:
  ```
  ssh -J zlh@10.100.0.48 root@dev-6038.zerolaghub.dev
  ```
- Confirmed not a firewall issue (iptables default ACCEPT, sshd listening on :22).

### Agent SSH provisioning requirements identified
- Agent must automate SSH setup for new containers:
  - Install and enable sshd
  - Generate SSH host keys if missing (add to `common.sh` or bootstrap)
  - Create `devuser` with sudo access
  - Configure authorized_keys for key-based auth

### zlh-cli progress
- Built successfully in Go, deployed to bastion at `/usr/local/bin/zlh`.
- **Known bugs when running ON bastion**:
  - Incorrectly attempts to jump via `zlh-bastion.zerolaghub.dev` (should use localhost/direct)
  - User/host targeting logic needs fixes (was targeting bastion instead of dev container)
- Goal: reduce `ssh -J` complexity to simple `zlh ssh 6038` command.

### Bastion public SSH blocker (ACTIVE)
- **Critical issue**: Public SSH to bastion fails immediately.
- Symptoms:
  - TCP connection succeeds (SYN/ACK completes)
  - SSH handshake never proceeds: `kex_exchange_identification: Connection closed by remote host`
  - Not authentication failure - connection closes before banner exchange
- Verified:
  - Cloudflare DNS: `zlh-bastion.zerolaghub.dev` → `139.64.165.248` (DNS-only)
  - OPNsense NAT rule forwards :22 to 10.100.0.48
  - Internal SSH to bastion works perfectly
  - Fail2ban shows no blocks
  - Tried disabling OPNsense SSH, changing ports - same failure
- **This is NOT the container host-key issue** (that was internal and fixed).

### Next debug steps for bastion blocker
1. tcpdump on bastion during external connection attempt
2. OPNsense live firewall log during external attempt
3. Confirm NAT truly reaches bastion sshd (not terminating upstream)
4. Check for ISP/modem interference or hairpin NAT issues

Status: **Dev container SSH working internally; bastion public access blocked at network layer.**

---

## 2025-12-28 — APIv2 Auth + Portal Alignment Session

### Work Completed
- APIv2 auth route verified functional (JWT-based)
- bcrypt password verification confirmed
- `/api/instances` endpoint verified working without auth
- Portal/API boundary clarified: portal owns identity UX, API owns validation + DB
- Confirmed no CSRF or cookie-based auth required (stateless JWT)

### Key Findings
- Portal still contains APIv1 / Pterodactyl assumptions
- `zlh-grind` is documentation + constraint repo only (no code)
- Instances endpoint behavior was correct; earlier failures were route misuse

### Decisions
- APIv2 auth will remain stateless (JWT only)
- No CSRF protection will be implemented
- Portal must fully remove APIv1 and Pterodactyl patterns

### Next Actions
- Enforce `requireAuth` selectively in APIv2
- Update portal login to match APIv2 contract
- Track portal migration progress in OPEN_THREADS

---

## Jan 2026 — Portal UX & Control Plane Alignment

### Completed
- Finalized Dashboard as awareness-only surface
- Implemented System Health indicator (non-technical)
- Reworked Notices UX with expansion and scrolling
- Removed all bulk lifecycle controls
- Defined Servers page grouping by type
- Introduced expandable server cards
- Selected "System View" as escalation action
- Explicitly avoided AWS-style UI patterns

### Key Decisions
- Agent is authoritative for runtime state
- API v2 brokers access and permissions only
- Console / runtime access is observational first
- Control surfaces are separated from awareness surfaces

### Outcome
Portal is no longer Pterodactyl-shaped.
It now supports mixed workloads cleanly.

---

## 2026-01-11 — Live Status & Agent Integration

### Summary
Finalized live server status architecture between:
- ZPack API
- zlh-agent
- Redis
- Portal UI

This session resolved incorrect / misleading server states and clarified the
separation between **host health**, **agent availability**, and **service runtime**.

### Key Decisions

#### 1. Status Sources Are Explicit
- **Agent `/health`**
  - Authoritative for *host / container availability*
  - Used for DEV containers and host presence
- **Agent `/status`**
  - Authoritative for *game server runtime only*
  - Idle ≠ Offline (idle means server stopped but host up)

#### 2. DEV vs GAME Semantics
- **DEV containers**
  - No runtime process to track
  - Status = `running` if host online
- **GAME containers**
  - Runtime state controlled by agent
  - Status reflects actual game process

#### 3. Redis Is the Live State Cache
- Agent poller runs every 5s
- Redis TTL = 15s
- UI reads Redis, not agents directly
- Eventual consistency is acceptable

#### 4. No Push Model (Yet)
- Agent does not push status
- API polls agents
- WebSocket / push deferred until console phase

### Result
- Portal now reflects correct, real-world state
- "Offline / Idle / Running" distinctions are accurate
- Architecture aligns with Pterodactyl-style behavior

### Next Focus
- Console (read-only output first)

---

## 2026-01-14 → 2026-01-18 — Agent PTY Console Breakthrough

### Objective
Replace log-only console streaming with a **true interactive PTY-backed console**
supporting:
- Dev server shell access
- Game server (Minecraft) live console
- Full duplex WebSocket I/O
- Session persistence across reconnects

### What Changed
- Agent console migrated from log tailing → **PTY-backed processes**
- PTY ownership moved to agent (not frontend, not Proxmox)
- WebSocket implementation replaced with `gorilla/websocket`
- Enforced **single writer per WS connection**
- Introduced **console session manager** keyed by `{vmid, container_type}`
- Sessions survive WS disconnects (dev servers only, TTL = 60s)
- PTY not killed on client close
- Added keepalive frames (5s) to prevent browser timeouts
- Added detailed WS + PTY logging for attach, read, write, close

### Architecture Summary
- Dev containers attach to `/bin/bash` (fallback `/bin/sh`)
- Game containers attach to Minecraft server PTY
- WS input → PTY stdin
- PTY stdout → WS broadcast
- One PTY reader loop per session
- Multiple WS connections can attach/detach safely

### Final Status
✅ Interactive console confirmed working  
✅ Commands execute correctly  
✅ Output streamed live  
✅ Reconnect behavior stable  

This closes the original "console v2" milestone.