6.5 KiB
6.5 KiB
Session Log — zlh-grind
Append-only execution log for GPT-assisted development work.
Do not rewrite or reorder past entries.
2025-12-20
- Goal: Restore reliable end-to-end provisioning for devcontainers and agent-managed installs.
- Observed repeated failures during devcontainer runtime installation (node, python, go, java).
- Initial assumption was installer regression; investigation showed installers were enforcing contract correctly.
- Root cause identified: agent was not exporting required runtime environment variables (notably
RUNTIME_VERSION).
2025-12-21
- Deep dive on zlh-agent devcontainer provisioning flow.
- Confirmed that all devcontainer installers intentionally require
RUNTIME_VERSIONand fail fast if missing. - Clarified that payload JSON is not read by installers; agent must project intent via environment variables.
- Verified that installer logic itself (artifact naming, extraction, symlink layout) was correct.
Embedded installer execution findings
- Agent executes installers as embedded scripts, not filesystem paths.
- Identified critical requirement: shared installer logic (
common.sh) and runtime installer must execute in the same shell session. - Failure mode observed:
install_runtime: command not found→ caused by running runtime installer withoutcommon.shloaded. - Confirmed this explains missing runtime directories and lack of artifact downloads.
Installer architecture changes
- Refactored installer model to:
common.sh: shared, strict, embedded-safe installation logic- per-runtime installers (
node,python,go,java) as declarative descriptors only
- Established that runtime installers are intentionally minimal and declarative by design.
- Confirmed that this preserves existing runtime layout:
/opt/zlh/runtime/<language>/<version>/current
Artifact layout update
- Artifact naming and layout changed to simplified form:
node-24.tar.xzpython-3.12.tar.xzgo-1.22.tar.gzjdk-21.tar.gz
- Identified mismatch between runtime name and archive prefix (notably Java).
- Introduced
ARCHIVE_PREFIXas a runtime-level variable to resolve naming cleanly.
Final conclusions
- No regression in installer logic; failures were execution-order and environment-projection issues.
- Correct fix is agent-side:
- concatenate
common.sh+ runtime installer into one bash invocation - inject
RUNTIME_VERSION(and related vars) into environment
- concatenate
- Architecture now supports deterministic, artifact-driven, embedded-safe installs.
Status: Root cause resolved; implementation pending agent patch & installer updates.
2025-12-26
- Goal: Enable SSH access to dev containers via bastion for external users.
- Split DNS architecture confirmed working: Cloudflare (public) + Technitium (internal).
Dev container SSH (internal success)
- Root cause of initial SSH failures: missing SSH host keys in unprivileged LXC containers.
- Fixed by ensuring host keys exist before sshd starts.
- Verified internal SSH access works from WireGuard/LAN:
ssh -J zlh@10.100.0.48 root@dev-6038.zerolaghub.dev - Confirmed not a firewall issue (iptables default ACCEPT, sshd listening on :22).
Agent SSH provisioning requirements identified
- Agent must automate SSH setup for new containers:
- Install and enable sshd
- Generate SSH host keys if missing (add to
common.shor bootstrap) - Create
devuserwith sudo access - Configure authorized_keys for key-based auth
zlh-cli progress
- Built successfully in Go, deployed to bastion at
/usr/local/bin/zlh. - Known bugs when running ON bastion:
- Incorrectly attempts to jump via
zlh-bastion.zerolaghub.dev(should use localhost/direct) - User/host targeting logic needs fixes (was targeting bastion instead of dev container)
- Incorrectly attempts to jump via
- Goal: reduce
ssh -Jcomplexity to simplezlh ssh 6038command.
Bastion public SSH blocker (ACTIVE)
- Critical issue: Public SSH to bastion fails immediately.
- Symptoms:
- TCP connection succeeds (SYN/ACK completes)
- SSH handshake never proceeds:
kex_exchange_identification: Connection closed by remote host - Not authentication failure - connection closes before banner exchange
- Verified:
- Cloudflare DNS:
zlh-bastion.zerolaghub.dev→139.64.165.248(DNS-only) - OPNsense NAT rule forwards :22 to 10.100.0.48
- Internal SSH to bastion works perfectly
- Fail2ban shows no blocks
- Tried disabling OPNsense SSH, changing ports - same failure
- Cloudflare DNS:
- This is NOT the container host-key issue (that was internal and fixed).
Next debug steps for bastion blocker
- tcpdump on bastion during external connection attempt
- OPNsense live firewall log during external attempt
- Confirm NAT truly reaches bastion sshd (not terminating upstream)
- Check for ISP/modem interference or hairpin NAT issues
Status: Dev container SSH working internally; bastion public access blocked at network layer.
2025-12-28 — APIv2 Auth + Portal Alignment Session
Work Completed
- APIv2 auth route verified functional (JWT-based)
- bcrypt password verification confirmed
/api/instancesendpoint verified working without auth- Portal/API boundary clarified: portal owns identity UX, API owns validation + DB
- Confirmed no CSRF or cookie-based auth required (stateless JWT)
Key Findings
- Portal still contains APIv1 / Pterodactyl assumptions
zlh-grindis documentation + constraint repo only (no code)- Instances endpoint behavior was correct; earlier failures were route misuse
Decisions
- APIv2 auth will remain stateless (JWT only)
- No CSRF protection will be implemented
- Portal must fully remove APIv1 and Pterodactyl patterns
Next Actions
- Enforce
requireAuthselectively in APIv2 - Update portal login to match APIv2 contract
- Track portal migration progress in OPEN_THREADS
Jan 2026 — Portal UX & Control Plane Alignment
Completed
- Finalized Dashboard as awareness-only surface
- Implemented System Health indicator (non-technical)
- Reworked Notices UX with expansion and scrolling
- Removed all bulk lifecycle controls
- Defined Servers page grouping by type
- Introduced expandable server cards
- Selected "System View" as escalation action
- Explicitly avoided AWS-style UI patterns
Key Decisions
- Agent is authoritative for runtime state
- API v2 brokers access and permissions only
- Console / runtime access is observational first
- Control surfaces are separated from awareness surfaces
Outcome
Portal is no longer Pterodactyl-shaped. It now supports mixed workloads cleanly.