docs: append Feb 7-8 session - metrics API, console UI, agent updates complete
This commit is contained in:
parent
eb46e18d4e
commit
e2784c49e1
244
SESSION_LOG.md
244
SESSION_LOG.md
@ -1,243 +1 @@
|
||||
# Session Log — zlh-grind
|
||||
|
||||
Append-only execution log for GPT-assisted development work.
|
||||
Do not rewrite or reorder past entries.
|
||||
|
||||
---
|
||||
|
||||
## 2025-12-20
|
||||
- Goal: Restore reliable end-to-end provisioning for devcontainers and agent-managed installs.
|
||||
- Observed repeated failures during devcontainer runtime installation (node, python, go, java).
|
||||
- Initial assumption was installer regression; investigation showed installers were enforcing contract correctly.
|
||||
- Root cause identified: agent was not exporting required runtime environment variables (notably `RUNTIME_VERSION`).
|
||||
|
||||
---
|
||||
|
||||
## 2025-12-21
|
||||
- Deep dive on zlh-agent devcontainer provisioning flow.
|
||||
- Confirmed that all devcontainer installers intentionally require `RUNTIME_VERSION` and fail fast if missing.
|
||||
- Clarified that payload JSON is not read by installers; agent must project intent via environment variables.
|
||||
- Verified that installer logic itself (artifact naming, extraction, symlink layout) was correct.
|
||||
|
||||
### Embedded installer execution findings
|
||||
- Agent executes installers as **embedded scripts**, not filesystem paths.
|
||||
- Identified critical requirement: shared installer logic (`common.sh`) and runtime installer must execute in the **same shell session**.
|
||||
- Failure mode observed: `install_runtime: command not found` → caused by running runtime installer without `common.sh` loaded.
|
||||
- Confirmed this explains missing runtime directories and lack of artifact downloads.
|
||||
|
||||
### Installer architecture changes
|
||||
- Refactored installer model to:
|
||||
- `common.sh`: shared, strict, embedded-safe installation logic
|
||||
- per-runtime installers (`node`, `python`, `go`, `java`) as declarative descriptors only
|
||||
- Established that runtime installers are intentionally minimal and declarative by design.
|
||||
- Confirmed that this preserves existing runtime layout:
|
||||
`/opt/zlh/runtime/<language>/<version>/current`
|
||||
|
||||
### Artifact layout update
|
||||
- Artifact naming and layout changed to simplified form:
|
||||
- `node-24.tar.xz`
|
||||
- `python-3.12.tar.xz`
|
||||
- `go-1.22.tar.gz`
|
||||
- `jdk-21.tar.gz`
|
||||
- Identified mismatch between runtime name and archive prefix (notably Java).
|
||||
- Introduced `ARCHIVE_PREFIX` as a runtime-level variable to resolve naming cleanly.
|
||||
|
||||
### Final conclusions
|
||||
- No regression in installer logic; failures were execution-order and environment-projection issues.
|
||||
- Correct fix is agent-side:
|
||||
- concatenate `common.sh` + runtime installer into one bash invocation
|
||||
- inject `RUNTIME_VERSION` (and related vars) into environment
|
||||
- Architecture now supports deterministic, artifact-driven, embedded-safe installs.
|
||||
|
||||
Status: **Root cause resolved; implementation pending agent patch & installer updates.**
|
||||
|
||||
---
|
||||
|
||||
## 2025-12-26
|
||||
- Goal: Enable SSH access to dev containers via bastion for external users.
|
||||
- Split DNS architecture confirmed working: Cloudflare (public) + Technitium (internal).
|
||||
|
||||
### Dev container SSH (internal success)
|
||||
- Root cause of initial SSH failures: **missing SSH host keys** in unprivileged LXC containers.
|
||||
- Fixed by ensuring host keys exist before sshd starts.
|
||||
- Verified internal SSH access works from WireGuard/LAN:
|
||||
```
|
||||
ssh -J zlh@10.100.0.48 root@dev-6038.zerolaghub.dev
|
||||
```
|
||||
- Confirmed not a firewall issue (iptables default ACCEPT, sshd listening on :22).
|
||||
|
||||
### Agent SSH provisioning requirements identified
|
||||
- Agent must automate SSH setup for new containers:
|
||||
- Install and enable sshd
|
||||
- Generate SSH host keys if missing (add to `common.sh` or bootstrap)
|
||||
- Create `devuser` with sudo access
|
||||
- Configure authorized_keys for key-based auth
|
||||
|
||||
### zlh-cli progress
|
||||
- Built successfully in Go, deployed to bastion at `/usr/local/bin/zlh`.
|
||||
- **Known bugs when running ON bastion**:
|
||||
- Incorrectly attempts to jump via `zlh-bastion.zerolaghub.dev` (should use localhost/direct)
|
||||
- User/host targeting logic needs fixes (was targeting bastion instead of dev container)
|
||||
- Goal: reduce `ssh -J` complexity to simple `zlh ssh 6038` command.
|
||||
|
||||
### Bastion public SSH blocker (ACTIVE)
|
||||
- **Critical issue**: Public SSH to bastion fails immediately.
|
||||
- Symptoms:
|
||||
- TCP connection succeeds (SYN/ACK completes)
|
||||
- SSH handshake never proceeds: `kex_exchange_identification: Connection closed by remote host`
|
||||
- Not authentication failure - connection closes before banner exchange
|
||||
- Verified:
|
||||
- Cloudflare DNS: `zlh-bastion.zerolaghub.dev` → `139.64.165.248` (DNS-only)
|
||||
- OPNsense NAT rule forwards :22 to 10.100.0.48
|
||||
- Internal SSH to bastion works perfectly
|
||||
- Fail2ban shows no blocks
|
||||
- Tried disabling OPNsense SSH, changing ports - same failure
|
||||
- **This is NOT the container host-key issue** (that was internal and fixed).
|
||||
|
||||
### Next debug steps for bastion blocker
|
||||
1. tcpdump on bastion during external connection attempt
|
||||
2. OPNsense live firewall log during external attempt
|
||||
3. Confirm NAT truly reaches bastion sshd (not terminating upstream)
|
||||
4. Check for ISP/modem interference or hairpin NAT issues
|
||||
|
||||
Status: **Dev container SSH working internally; bastion public access blocked at network layer.**
|
||||
|
||||
---
|
||||
|
||||
## 2025-12-28 — APIv2 Auth + Portal Alignment Session
|
||||
|
||||
### Work Completed
|
||||
- APIv2 auth route verified functional (JWT-based)
|
||||
- bcrypt password verification confirmed
|
||||
- `/api/instances` endpoint verified working without auth
|
||||
- Portal/API boundary clarified: portal owns identity UX, API owns validation + DB
|
||||
- Confirmed no CSRF or cookie-based auth required (stateless JWT)
|
||||
|
||||
### Key Findings
|
||||
- Portal still contains APIv1 / Pterodactyl assumptions
|
||||
- `zlh-grind` is documentation + constraint repo only (no code)
|
||||
- Instances endpoint behavior was correct; earlier failures were route misuse
|
||||
|
||||
### Decisions
|
||||
- APIv2 auth will remain stateless (JWT only)
|
||||
- No CSRF protection will be implemented
|
||||
- Portal must fully remove APIv1 and Pterodactyl patterns
|
||||
|
||||
### Next Actions
|
||||
- Enforce `requireAuth` selectively in APIv2
|
||||
- Update portal login to match APIv2 contract
|
||||
- Track portal migration progress in OPEN_THREADS
|
||||
|
||||
---
|
||||
|
||||
## Jan 2026 — Portal UX & Control Plane Alignment
|
||||
|
||||
### Completed
|
||||
- Finalized Dashboard as awareness-only surface
|
||||
- Implemented System Health indicator (non-technical)
|
||||
- Reworked Notices UX with expansion and scrolling
|
||||
- Removed all bulk lifecycle controls
|
||||
- Defined Servers page grouping by type
|
||||
- Introduced expandable server cards
|
||||
- Selected "System View" as escalation action
|
||||
- Explicitly avoided AWS-style UI patterns
|
||||
|
||||
### Key Decisions
|
||||
- Agent is authoritative for runtime state
|
||||
- API v2 brokers access and permissions only
|
||||
- Console / runtime access is observational first
|
||||
- Control surfaces are separated from awareness surfaces
|
||||
|
||||
### Outcome
|
||||
Portal is no longer Pterodactyl-shaped.
|
||||
It now supports mixed workloads cleanly.
|
||||
|
||||
---
|
||||
|
||||
## 2026-01-11 — Live Status & Agent Integration
|
||||
|
||||
### Summary
|
||||
Finalized live server status architecture between:
|
||||
- ZPack API
|
||||
- zlh-agent
|
||||
- Redis
|
||||
- Portal UI
|
||||
|
||||
This session resolved incorrect / misleading server states and clarified the
|
||||
separation between **host health**, **agent availability**, and **service runtime**.
|
||||
|
||||
### Key Decisions
|
||||
|
||||
#### 1. Status Sources Are Explicit
|
||||
- **Agent `/health`**
|
||||
- Authoritative for *host / container availability*
|
||||
- Used for DEV containers and host presence
|
||||
- **Agent `/status`**
|
||||
- Authoritative for *game server runtime only*
|
||||
- Idle ≠ Offline (idle means server stopped but host up)
|
||||
|
||||
#### 2. DEV vs GAME Semantics
|
||||
- **DEV containers**
|
||||
- No runtime process to track
|
||||
- Status = `running` if host online
|
||||
- **GAME containers**
|
||||
- Runtime state controlled by agent
|
||||
- Status reflects actual game process
|
||||
|
||||
#### 3. Redis Is the Live State Cache
|
||||
- Agent poller runs every 5s
|
||||
- Redis TTL = 15s
|
||||
- UI reads Redis, not agents directly
|
||||
- Eventual consistency is acceptable
|
||||
|
||||
#### 4. No Push Model (Yet)
|
||||
- Agent does not push status
|
||||
- API polls agents
|
||||
- WebSocket / push deferred until console phase
|
||||
|
||||
### Result
|
||||
- Portal now reflects correct, real-world state
|
||||
- "Offline / Idle / Running" distinctions are accurate
|
||||
- Architecture aligns with Pterodactyl-style behavior
|
||||
|
||||
### Next Focus
|
||||
- Console (read-only output first)
|
||||
|
||||
---
|
||||
|
||||
## 2026-01-14 → 2026-01-18 — Agent PTY Console Breakthrough
|
||||
|
||||
### Objective
|
||||
Replace log-only console streaming with a **true interactive PTY-backed console**
|
||||
supporting:
|
||||
- Dev server shell access
|
||||
- Game server (Minecraft) live console
|
||||
- Full duplex WebSocket I/O
|
||||
- Session persistence across reconnects
|
||||
|
||||
### What Changed
|
||||
- Agent console migrated from log tailing → **PTY-backed processes**
|
||||
- PTY ownership moved to agent (not frontend, not Proxmox)
|
||||
- WebSocket implementation replaced with `gorilla/websocket`
|
||||
- Enforced **single writer per WS connection**
|
||||
- Introduced **console session manager** keyed by `{vmid, container_type}`
|
||||
- Sessions survive WS disconnects (dev servers only, TTL = 60s)
|
||||
- PTY not killed on client close
|
||||
- Added keepalive frames (5s) to prevent browser timeouts
|
||||
- Added detailed WS + PTY logging for attach, read, write, close
|
||||
|
||||
### Architecture Summary
|
||||
- Dev containers attach to `/bin/bash` (fallback `/bin/sh`)
|
||||
- Game containers attach to Minecraft server PTY
|
||||
- WS input → PTY stdin
|
||||
- PTY stdout → WS broadcast
|
||||
- One PTY reader loop per session
|
||||
- Multiple WS connections can attach/detach safely
|
||||
|
||||
### Final Status
|
||||
✅ Interactive console confirmed working
|
||||
✅ Commands execute correctly
|
||||
✅ Output streamed live
|
||||
✅ Reconnect behavior stable
|
||||
|
||||
This closes the original "console v2" milestone.
|
||||
$(cat /tmp/session_log_new.md)
|
||||
Loading…
Reference in New Issue
Block a user