105 lines
4.9 KiB
Markdown
105 lines
4.9 KiB
Markdown
# Session Log – zlh-grind
|
||
|
||
Append-only execution log for GPT-assisted development work.
|
||
Do not rewrite or reorder past entries.
|
||
|
||
---
|
||
|
||
## 2025-12-20
|
||
- Goal: Restore reliable end-to-end provisioning for devcontainers and agent-managed installs.
|
||
- Observed repeated failures during devcontainer runtime installation (node, python, go, java).
|
||
- Initial assumption was installer regression; investigation showed installers were enforcing contract correctly.
|
||
- Root cause identified: agent was not exporting required runtime environment variables (notably `RUNTIME_VERSION`).
|
||
|
||
---
|
||
|
||
## 2025-12-21
|
||
- Deep dive on zlh-agent devcontainer provisioning flow.
|
||
- Confirmed that all devcontainer installers intentionally require `RUNTIME_VERSION` and fail fast if missing.
|
||
- Clarified that payload JSON is not read by installers; agent must project intent via environment variables.
|
||
- Verified that installer logic itself (artifact naming, extraction, symlink layout) was correct.
|
||
|
||
### Embedded installer execution findings
|
||
- Agent executes installers as **embedded scripts**, not filesystem paths.
|
||
- Identified critical requirement: shared installer logic (`common.sh`) and runtime installer must execute in the **same shell session**.
|
||
- Failure mode observed: `install_runtime: command not found` → caused by running runtime installer without `common.sh` loaded.
|
||
- Confirmed this explains missing runtime directories and lack of artifact downloads.
|
||
|
||
### Installer architecture changes
|
||
- Refactored installer model to:
|
||
- `common.sh`: shared, strict, embedded-safe installation logic
|
||
- per-runtime installers (`node`, `python`, `go`, `java`) as declarative descriptors only
|
||
- Established that runtime installers are intentionally minimal and declarative by design.
|
||
- Confirmed that this preserves existing runtime layout:
|
||
`/opt/zlh/runtime/<language>/<version>/current`
|
||
|
||
### Artifact layout update
|
||
- Artifact naming and layout changed to simplified form:
|
||
- `node-24.tar.xz`
|
||
- `python-3.12.tar.xz`
|
||
- `go-1.22.tar.gz`
|
||
- `jdk-21.tar.gz`
|
||
- Identified mismatch between runtime name and archive prefix (notably Java).
|
||
- Introduced `ARCHIVE_PREFIX` as a runtime-level variable to resolve naming cleanly.
|
||
|
||
### Final conclusions
|
||
- No regression in installer logic; failures were execution-order and environment-projection issues.
|
||
- Correct fix is agent-side:
|
||
- concatenate `common.sh` + runtime installer into one bash invocation
|
||
- inject `RUNTIME_VERSION` (and related vars) into environment
|
||
- Architecture now supports deterministic, artifact-driven, embedded-safe installs.
|
||
|
||
Status: **Root cause resolved; implementation pending agent patch & installer updates.**
|
||
|
||
---
|
||
|
||
## 2025-12-26
|
||
- Goal: Enable SSH access to dev containers via bastion for external users.
|
||
- Split DNS architecture confirmed working: Cloudflare (public) + Technitium (internal).
|
||
|
||
### Dev container SSH (internal success)
|
||
- Root cause of initial SSH failures: **missing SSH host keys** in unprivileged LXC containers.
|
||
- Fixed by ensuring host keys exist before sshd starts.
|
||
- Verified internal SSH access works from WireGuard/LAN:
|
||
```
|
||
ssh -J zlh@10.100.0.48 root@dev-6038.zerolaghub.dev
|
||
```
|
||
- Confirmed not a firewall issue (iptables default ACCEPT, sshd listening on :22).
|
||
|
||
### Agent SSH provisioning requirements identified
|
||
- Agent must automate SSH setup for new containers:
|
||
- Install and enable sshd
|
||
- Generate SSH host keys if missing (add to `common.sh` or bootstrap)
|
||
- Create `devuser` with sudo access
|
||
- Configure authorized_keys for key-based auth
|
||
|
||
### zlh-cli progress
|
||
- Built successfully in Go, deployed to bastion at `/usr/local/bin/zlh`.
|
||
- **Known bugs when running ON bastion**:
|
||
- Incorrectly attempts to jump via `zlh-bastion.zerolaghub.dev` (should use localhost/direct)
|
||
- User/host targeting logic needs fixes (was targeting bastion instead of dev container)
|
||
- Goal: reduce `ssh -J` complexity to simple `zlh ssh 6038` command.
|
||
|
||
### Bastion public SSH blocker (ACTIVE)
|
||
- **Critical issue**: Public SSH to bastion fails immediately.
|
||
- Symptoms:
|
||
- TCP connection succeeds (SYN/ACK completes)
|
||
- SSH handshake never proceeds: `kex_exchange_identification: Connection closed by remote host`
|
||
- Not authentication failure - connection closes before banner exchange
|
||
- Verified:
|
||
- Cloudflare DNS: `zlh-bastion.zerolaghub.dev` → `139.64.165.248` (DNS-only)
|
||
- OPNsense NAT rule forwards :22 to 10.100.0.48
|
||
- Internal SSH to bastion works perfectly
|
||
- Fail2ban shows no blocks
|
||
- Tried disabling OPNsense SSH, changing ports - same failure
|
||
- **This is NOT the container host-key issue** (that was internal and fixed).
|
||
|
||
### Next debug steps for bastion blocker
|
||
1. tcpdump on bastion during external connection attempt
|
||
2. OPNsense live firewall log during external attempt
|
||
3. Confirm NAT truly reaches bastion sshd (not terminating upstream)
|
||
4. Check for ISP/modem interference or hairpin NAT issues
|
||
|
||
Status: **Dev container SSH working internally; bastion public access blocked at network layer.**
|
||
|
||
--- |