zlh-grind/SESSION_LOG.md

5.8 KiB

Session Log — zlh-grind

Append-only execution log for GPT-assisted development work.
Do not rewrite or reorder past entries.


2025-12-20

  • Goal: Restore reliable end-to-end provisioning for devcontainers and agent-managed installs.
  • Observed repeated failures during devcontainer runtime installation (node, python, go, java).
  • Initial assumption was installer regression; investigation showed installers were enforcing contract correctly.
  • Root cause identified: agent was not exporting required runtime environment variables (notably RUNTIME_VERSION).

2025-12-21

  • Deep dive on zlh-agent devcontainer provisioning flow.
  • Confirmed that all devcontainer installers intentionally require RUNTIME_VERSION and fail fast if missing.
  • Clarified that payload JSON is not read by installers; agent must project intent via environment variables.
  • Verified that installer logic itself (artifact naming, extraction, symlink layout) was correct.

Embedded installer execution findings

  • Agent executes installers as embedded scripts, not filesystem paths.
  • Identified critical requirement: shared installer logic (common.sh) and runtime installer must execute in the same shell session.
  • Failure mode observed: install_runtime: command not found → caused by running runtime installer without common.sh loaded.
  • Confirmed this explains missing runtime directories and lack of artifact downloads.

Installer architecture changes

  • Refactored installer model to:
    • common.sh: shared, strict, embedded-safe installation logic
    • per-runtime installers (node, python, go, java) as declarative descriptors only
  • Established that runtime installers are intentionally minimal and declarative by design.
  • Confirmed that this preserves existing runtime layout: /opt/zlh/runtime/<language>/<version>/current

Artifact layout update

  • Artifact naming and layout changed to simplified form:
    • node-24.tar.xz
    • python-3.12.tar.xz
    • go-1.22.tar.gz
    • jdk-21.tar.gz
  • Identified mismatch between runtime name and archive prefix (notably Java).
  • Introduced ARCHIVE_PREFIX as a runtime-level variable to resolve naming cleanly.

Final conclusions

  • No regression in installer logic; failures were execution-order and environment-projection issues.
  • Correct fix is agent-side:
    • concatenate common.sh + runtime installer into one bash invocation
    • inject RUNTIME_VERSION (and related vars) into environment
  • Architecture now supports deterministic, artifact-driven, embedded-safe installs.

Status: Root cause resolved; implementation pending agent patch & installer updates.


2025-12-26

  • Goal: Enable SSH access to dev containers via bastion for external users.
  • Split DNS architecture confirmed working: Cloudflare (public) + Technitium (internal).

Dev container SSH (internal success)

  • Root cause of initial SSH failures: missing SSH host keys in unprivileged LXC containers.
  • Fixed by ensuring host keys exist before sshd starts.
  • Verified internal SSH access works from WireGuard/LAN:
    ssh -J zlh@10.100.0.48 root@dev-6038.zerolaghub.dev
    
  • Confirmed not a firewall issue (iptables default ACCEPT, sshd listening on :22).

Agent SSH provisioning requirements identified

  • Agent must automate SSH setup for new containers:
    • Install and enable sshd
    • Generate SSH host keys if missing (add to common.sh or bootstrap)
    • Create devuser with sudo access
    • Configure authorized_keys for key-based auth

zlh-cli progress

  • Built successfully in Go, deployed to bastion at /usr/local/bin/zlh.
  • Known bugs when running ON bastion:
    • Incorrectly attempts to jump via zlh-bastion.zerolaghub.dev (should use localhost/direct)
    • User/host targeting logic needs fixes (was targeting bastion instead of dev container)
  • Goal: reduce ssh -J complexity to simple zlh ssh 6038 command.

Bastion public SSH blocker (ACTIVE)

  • Critical issue: Public SSH to bastion fails immediately.
  • Symptoms:
    • TCP connection succeeds (SYN/ACK completes)
    • SSH handshake never proceeds: kex_exchange_identification: Connection closed by remote host
    • Not authentication failure - connection closes before banner exchange
  • Verified:
    • Cloudflare DNS: zlh-bastion.zerolaghub.dev139.64.165.248 (DNS-only)
    • OPNsense NAT rule forwards :22 to 10.100.0.48
    • Internal SSH to bastion works perfectly
    • Fail2ban shows no blocks
    • Tried disabling OPNsense SSH, changing ports - same failure
  • This is NOT the container host-key issue (that was internal and fixed).

Next debug steps for bastion blocker

  1. tcpdump on bastion during external connection attempt
  2. OPNsense live firewall log during external attempt
  3. Confirm NAT truly reaches bastion sshd (not terminating upstream)
  4. Check for ISP/modem interference or hairpin NAT issues

Status: Dev container SSH working internally; bastion public access blocked at network layer.


2025-12-28 — APIv2 Auth + Portal Alignment Session

Work Completed

  • APIv2 auth route verified functional (JWT-based)
  • bcrypt password verification confirmed
  • /api/instances endpoint verified working without auth
  • Portal/API boundary clarified: portal owns identity UX, API owns validation + DB
  • Confirmed no CSRF or cookie-based auth required (stateless JWT)

Key Findings

  • Portal still contains APIv1 / Pterodactyl assumptions
  • zlh-grind is documentation + constraint repo only (no code)
  • Instances endpoint behavior was correct; earlier failures were route misuse

Decisions

  • APIv2 auth will remain stateless (JWT only)
  • No CSRF protection will be implemented
  • Portal must fully remove APIv1 and Pterodactyl patterns

Next Actions

  • Enforce requireAuth selectively in APIv2
  • Update portal login to match APIv2 contract
  • Track portal migration progress in OPEN_THREADS