154 lines
6.5 KiB
Markdown
154 lines
6.5 KiB
Markdown
# Session Log — zlh-grind
|
|
|
|
Append-only execution log for GPT-assisted development work.
|
|
Do not rewrite or reorder past entries.
|
|
|
|
---
|
|
|
|
## 2025-12-20
|
|
- Goal: Restore reliable end-to-end provisioning for devcontainers and agent-managed installs.
|
|
- Observed repeated failures during devcontainer runtime installation (node, python, go, java).
|
|
- Initial assumption was installer regression; investigation showed installers were enforcing contract correctly.
|
|
- Root cause identified: agent was not exporting required runtime environment variables (notably `RUNTIME_VERSION`).
|
|
|
|
---
|
|
|
|
## 2025-12-21
|
|
- Deep dive on zlh-agent devcontainer provisioning flow.
|
|
- Confirmed that all devcontainer installers intentionally require `RUNTIME_VERSION` and fail fast if missing.
|
|
- Clarified that payload JSON is not read by installers; agent must project intent via environment variables.
|
|
- Verified that installer logic itself (artifact naming, extraction, symlink layout) was correct.
|
|
|
|
### Embedded installer execution findings
|
|
- Agent executes installers as **embedded scripts**, not filesystem paths.
|
|
- Identified critical requirement: shared installer logic (`common.sh`) and runtime installer must execute in the **same shell session**.
|
|
- Failure mode observed: `install_runtime: command not found` → caused by running runtime installer without `common.sh` loaded.
|
|
- Confirmed this explains missing runtime directories and lack of artifact downloads.
|
|
|
|
### Installer architecture changes
|
|
- Refactored installer model to:
|
|
- `common.sh`: shared, strict, embedded-safe installation logic
|
|
- per-runtime installers (`node`, `python`, `go`, `java`) as declarative descriptors only
|
|
- Established that runtime installers are intentionally minimal and declarative by design.
|
|
- Confirmed that this preserves existing runtime layout:
|
|
`/opt/zlh/runtime/<language>/<version>/current`
|
|
|
|
### Artifact layout update
|
|
- Artifact naming and layout changed to simplified form:
|
|
- `node-24.tar.xz`
|
|
- `python-3.12.tar.xz`
|
|
- `go-1.22.tar.gz`
|
|
- `jdk-21.tar.gz`
|
|
- Identified mismatch between runtime name and archive prefix (notably Java).
|
|
- Introduced `ARCHIVE_PREFIX` as a runtime-level variable to resolve naming cleanly.
|
|
|
|
### Final conclusions
|
|
- No regression in installer logic; failures were execution-order and environment-projection issues.
|
|
- Correct fix is agent-side:
|
|
- concatenate `common.sh` + runtime installer into one bash invocation
|
|
- inject `RUNTIME_VERSION` (and related vars) into environment
|
|
- Architecture now supports deterministic, artifact-driven, embedded-safe installs.
|
|
|
|
Status: **Root cause resolved; implementation pending agent patch & installer updates.**
|
|
|
|
---
|
|
|
|
## 2025-12-26
|
|
- Goal: Enable SSH access to dev containers via bastion for external users.
|
|
- Split DNS architecture confirmed working: Cloudflare (public) + Technitium (internal).
|
|
|
|
### Dev container SSH (internal success)
|
|
- Root cause of initial SSH failures: **missing SSH host keys** in unprivileged LXC containers.
|
|
- Fixed by ensuring host keys exist before sshd starts.
|
|
- Verified internal SSH access works from WireGuard/LAN:
|
|
```
|
|
ssh -J zlh@10.100.0.48 root@dev-6038.zerolaghub.dev
|
|
```
|
|
- Confirmed not a firewall issue (iptables default ACCEPT, sshd listening on :22).
|
|
|
|
### Agent SSH provisioning requirements identified
|
|
- Agent must automate SSH setup for new containers:
|
|
- Install and enable sshd
|
|
- Generate SSH host keys if missing (add to `common.sh` or bootstrap)
|
|
- Create `devuser` with sudo access
|
|
- Configure authorized_keys for key-based auth
|
|
|
|
### zlh-cli progress
|
|
- Built successfully in Go, deployed to bastion at `/usr/local/bin/zlh`.
|
|
- **Known bugs when running ON bastion**:
|
|
- Incorrectly attempts to jump via `zlh-bastion.zerolaghub.dev` (should use localhost/direct)
|
|
- User/host targeting logic needs fixes (was targeting bastion instead of dev container)
|
|
- Goal: reduce `ssh -J` complexity to simple `zlh ssh 6038` command.
|
|
|
|
### Bastion public SSH blocker (ACTIVE)
|
|
- **Critical issue**: Public SSH to bastion fails immediately.
|
|
- Symptoms:
|
|
- TCP connection succeeds (SYN/ACK completes)
|
|
- SSH handshake never proceeds: `kex_exchange_identification: Connection closed by remote host`
|
|
- Not authentication failure - connection closes before banner exchange
|
|
- Verified:
|
|
- Cloudflare DNS: `zlh-bastion.zerolaghub.dev` → `139.64.165.248` (DNS-only)
|
|
- OPNsense NAT rule forwards :22 to 10.100.0.48
|
|
- Internal SSH to bastion works perfectly
|
|
- Fail2ban shows no blocks
|
|
- Tried disabling OPNsense SSH, changing ports - same failure
|
|
- **This is NOT the container host-key issue** (that was internal and fixed).
|
|
|
|
### Next debug steps for bastion blocker
|
|
1. tcpdump on bastion during external connection attempt
|
|
2. OPNsense live firewall log during external attempt
|
|
3. Confirm NAT truly reaches bastion sshd (not terminating upstream)
|
|
4. Check for ISP/modem interference or hairpin NAT issues
|
|
|
|
Status: **Dev container SSH working internally; bastion public access blocked at network layer.**
|
|
|
|
---
|
|
|
|
## 2025-12-28 — APIv2 Auth + Portal Alignment Session
|
|
|
|
### Work Completed
|
|
- APIv2 auth route verified functional (JWT-based)
|
|
- bcrypt password verification confirmed
|
|
- `/api/instances` endpoint verified working without auth
|
|
- Portal/API boundary clarified: portal owns identity UX, API owns validation + DB
|
|
- Confirmed no CSRF or cookie-based auth required (stateless JWT)
|
|
|
|
### Key Findings
|
|
- Portal still contains APIv1 / Pterodactyl assumptions
|
|
- `zlh-grind` is documentation + constraint repo only (no code)
|
|
- Instances endpoint behavior was correct; earlier failures were route misuse
|
|
|
|
### Decisions
|
|
- APIv2 auth will remain stateless (JWT only)
|
|
- No CSRF protection will be implemented
|
|
- Portal must fully remove APIv1 and Pterodactyl patterns
|
|
|
|
### Next Actions
|
|
- Enforce `requireAuth` selectively in APIv2
|
|
- Update portal login to match APIv2 contract
|
|
- Track portal migration progress in OPEN_THREADS
|
|
|
|
---
|
|
|
|
## Jan 2026 — Portal UX & Control Plane Alignment
|
|
|
|
### Completed
|
|
- Finalized Dashboard as awareness-only surface
|
|
- Implemented System Health indicator (non-technical)
|
|
- Reworked Notices UX with expansion and scrolling
|
|
- Removed all bulk lifecycle controls
|
|
- Defined Servers page grouping by type
|
|
- Introduced expandable server cards
|
|
- Selected "System View" as escalation action
|
|
- Explicitly avoided AWS-style UI patterns
|
|
|
|
### Key Decisions
|
|
- Agent is authoritative for runtime state
|
|
- API v2 brokers access and permissions only
|
|
- Console / runtime access is observational first
|
|
- Control surfaces are separated from awareness surfaces
|
|
|
|
### Outcome
|
|
Portal is no longer Pterodactyl-shaped.
|
|
It now supports mixed workloads cleanly.
|