docs: append Feb 7-8 session - metrics API, console UI, agent updates complete
This commit is contained in:
parent
eb46e18d4e
commit
e2784c49e1
244
SESSION_LOG.md
244
SESSION_LOG.md
@ -1,243 +1 @@
|
|||||||
# Session Log — zlh-grind
|
$(cat /tmp/session_log_new.md)
|
||||||
|
|
||||||
Append-only execution log for GPT-assisted development work.
|
|
||||||
Do not rewrite or reorder past entries.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 2025-12-20
|
|
||||||
- Goal: Restore reliable end-to-end provisioning for devcontainers and agent-managed installs.
|
|
||||||
- Observed repeated failures during devcontainer runtime installation (node, python, go, java).
|
|
||||||
- Initial assumption was installer regression; investigation showed installers were enforcing contract correctly.
|
|
||||||
- Root cause identified: agent was not exporting required runtime environment variables (notably `RUNTIME_VERSION`).
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 2025-12-21
|
|
||||||
- Deep dive on zlh-agent devcontainer provisioning flow.
|
|
||||||
- Confirmed that all devcontainer installers intentionally require `RUNTIME_VERSION` and fail fast if missing.
|
|
||||||
- Clarified that payload JSON is not read by installers; agent must project intent via environment variables.
|
|
||||||
- Verified that installer logic itself (artifact naming, extraction, symlink layout) was correct.
|
|
||||||
|
|
||||||
### Embedded installer execution findings
|
|
||||||
- Agent executes installers as **embedded scripts**, not filesystem paths.
|
|
||||||
- Identified critical requirement: shared installer logic (`common.sh`) and runtime installer must execute in the **same shell session**.
|
|
||||||
- Failure mode observed: `install_runtime: command not found` → caused by running runtime installer without `common.sh` loaded.
|
|
||||||
- Confirmed this explains missing runtime directories and lack of artifact downloads.
|
|
||||||
|
|
||||||
### Installer architecture changes
|
|
||||||
- Refactored installer model to:
|
|
||||||
- `common.sh`: shared, strict, embedded-safe installation logic
|
|
||||||
- per-runtime installers (`node`, `python`, `go`, `java`) as declarative descriptors only
|
|
||||||
- Established that runtime installers are intentionally minimal and declarative by design.
|
|
||||||
- Confirmed that this preserves existing runtime layout:
|
|
||||||
`/opt/zlh/runtime/<language>/<version>/current`
|
|
||||||
|
|
||||||
### Artifact layout update
|
|
||||||
- Artifact naming and layout changed to simplified form:
|
|
||||||
- `node-24.tar.xz`
|
|
||||||
- `python-3.12.tar.xz`
|
|
||||||
- `go-1.22.tar.gz`
|
|
||||||
- `jdk-21.tar.gz`
|
|
||||||
- Identified mismatch between runtime name and archive prefix (notably Java).
|
|
||||||
- Introduced `ARCHIVE_PREFIX` as a runtime-level variable to resolve naming cleanly.
|
|
||||||
|
|
||||||
### Final conclusions
|
|
||||||
- No regression in installer logic; failures were execution-order and environment-projection issues.
|
|
||||||
- Correct fix is agent-side:
|
|
||||||
- concatenate `common.sh` + runtime installer into one bash invocation
|
|
||||||
- inject `RUNTIME_VERSION` (and related vars) into environment
|
|
||||||
- Architecture now supports deterministic, artifact-driven, embedded-safe installs.
|
|
||||||
|
|
||||||
Status: **Root cause resolved; implementation pending agent patch & installer updates.**
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 2025-12-26
|
|
||||||
- Goal: Enable SSH access to dev containers via bastion for external users.
|
|
||||||
- Split DNS architecture confirmed working: Cloudflare (public) + Technitium (internal).
|
|
||||||
|
|
||||||
### Dev container SSH (internal success)
|
|
||||||
- Root cause of initial SSH failures: **missing SSH host keys** in unprivileged LXC containers.
|
|
||||||
- Fixed by ensuring host keys exist before sshd starts.
|
|
||||||
- Verified internal SSH access works from WireGuard/LAN:
|
|
||||||
```
|
|
||||||
ssh -J zlh@10.100.0.48 root@dev-6038.zerolaghub.dev
|
|
||||||
```
|
|
||||||
- Confirmed not a firewall issue (iptables default ACCEPT, sshd listening on :22).
|
|
||||||
|
|
||||||
### Agent SSH provisioning requirements identified
|
|
||||||
- Agent must automate SSH setup for new containers:
|
|
||||||
- Install and enable sshd
|
|
||||||
- Generate SSH host keys if missing (add to `common.sh` or bootstrap)
|
|
||||||
- Create `devuser` with sudo access
|
|
||||||
- Configure authorized_keys for key-based auth
|
|
||||||
|
|
||||||
### zlh-cli progress
|
|
||||||
- Built successfully in Go, deployed to bastion at `/usr/local/bin/zlh`.
|
|
||||||
- **Known bugs when running ON bastion**:
|
|
||||||
- Incorrectly attempts to jump via `zlh-bastion.zerolaghub.dev` (should use localhost/direct)
|
|
||||||
- User/host targeting logic needs fixes (was targeting bastion instead of dev container)
|
|
||||||
- Goal: reduce `ssh -J` complexity to simple `zlh ssh 6038` command.
|
|
||||||
|
|
||||||
### Bastion public SSH blocker (ACTIVE)
|
|
||||||
- **Critical issue**: Public SSH to bastion fails immediately.
|
|
||||||
- Symptoms:
|
|
||||||
- TCP connection succeeds (SYN/ACK completes)
|
|
||||||
- SSH handshake never proceeds: `kex_exchange_identification: Connection closed by remote host`
|
|
||||||
- Not authentication failure - connection closes before banner exchange
|
|
||||||
- Verified:
|
|
||||||
- Cloudflare DNS: `zlh-bastion.zerolaghub.dev` → `139.64.165.248` (DNS-only)
|
|
||||||
- OPNsense NAT rule forwards :22 to 10.100.0.48
|
|
||||||
- Internal SSH to bastion works perfectly
|
|
||||||
- Fail2ban shows no blocks
|
|
||||||
- Tried disabling OPNsense SSH, changing ports - same failure
|
|
||||||
- **This is NOT the container host-key issue** (that was internal and fixed).
|
|
||||||
|
|
||||||
### Next debug steps for bastion blocker
|
|
||||||
1. tcpdump on bastion during external connection attempt
|
|
||||||
2. OPNsense live firewall log during external attempt
|
|
||||||
3. Confirm NAT truly reaches bastion sshd (not terminating upstream)
|
|
||||||
4. Check for ISP/modem interference or hairpin NAT issues
|
|
||||||
|
|
||||||
Status: **Dev container SSH working internally; bastion public access blocked at network layer.**
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 2025-12-28 — APIv2 Auth + Portal Alignment Session
|
|
||||||
|
|
||||||
### Work Completed
|
|
||||||
- APIv2 auth route verified functional (JWT-based)
|
|
||||||
- bcrypt password verification confirmed
|
|
||||||
- `/api/instances` endpoint verified working without auth
|
|
||||||
- Portal/API boundary clarified: portal owns identity UX, API owns validation + DB
|
|
||||||
- Confirmed no CSRF or cookie-based auth required (stateless JWT)
|
|
||||||
|
|
||||||
### Key Findings
|
|
||||||
- Portal still contains APIv1 / Pterodactyl assumptions
|
|
||||||
- `zlh-grind` is documentation + constraint repo only (no code)
|
|
||||||
- Instances endpoint behavior was correct; earlier failures were route misuse
|
|
||||||
|
|
||||||
### Decisions
|
|
||||||
- APIv2 auth will remain stateless (JWT only)
|
|
||||||
- No CSRF protection will be implemented
|
|
||||||
- Portal must fully remove APIv1 and Pterodactyl patterns
|
|
||||||
|
|
||||||
### Next Actions
|
|
||||||
- Enforce `requireAuth` selectively in APIv2
|
|
||||||
- Update portal login to match APIv2 contract
|
|
||||||
- Track portal migration progress in OPEN_THREADS
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Jan 2026 — Portal UX & Control Plane Alignment
|
|
||||||
|
|
||||||
### Completed
|
|
||||||
- Finalized Dashboard as awareness-only surface
|
|
||||||
- Implemented System Health indicator (non-technical)
|
|
||||||
- Reworked Notices UX with expansion and scrolling
|
|
||||||
- Removed all bulk lifecycle controls
|
|
||||||
- Defined Servers page grouping by type
|
|
||||||
- Introduced expandable server cards
|
|
||||||
- Selected "System View" as escalation action
|
|
||||||
- Explicitly avoided AWS-style UI patterns
|
|
||||||
|
|
||||||
### Key Decisions
|
|
||||||
- Agent is authoritative for runtime state
|
|
||||||
- API v2 brokers access and permissions only
|
|
||||||
- Console / runtime access is observational first
|
|
||||||
- Control surfaces are separated from awareness surfaces
|
|
||||||
|
|
||||||
### Outcome
|
|
||||||
Portal is no longer Pterodactyl-shaped.
|
|
||||||
It now supports mixed workloads cleanly.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 2026-01-11 — Live Status & Agent Integration
|
|
||||||
|
|
||||||
### Summary
|
|
||||||
Finalized live server status architecture between:
|
|
||||||
- ZPack API
|
|
||||||
- zlh-agent
|
|
||||||
- Redis
|
|
||||||
- Portal UI
|
|
||||||
|
|
||||||
This session resolved incorrect / misleading server states and clarified the
|
|
||||||
separation between **host health**, **agent availability**, and **service runtime**.
|
|
||||||
|
|
||||||
### Key Decisions
|
|
||||||
|
|
||||||
#### 1. Status Sources Are Explicit
|
|
||||||
- **Agent `/health`**
|
|
||||||
- Authoritative for *host / container availability*
|
|
||||||
- Used for DEV containers and host presence
|
|
||||||
- **Agent `/status`**
|
|
||||||
- Authoritative for *game server runtime only*
|
|
||||||
- Idle ≠ Offline (idle means server stopped but host up)
|
|
||||||
|
|
||||||
#### 2. DEV vs GAME Semantics
|
|
||||||
- **DEV containers**
|
|
||||||
- No runtime process to track
|
|
||||||
- Status = `running` if host online
|
|
||||||
- **GAME containers**
|
|
||||||
- Runtime state controlled by agent
|
|
||||||
- Status reflects actual game process
|
|
||||||
|
|
||||||
#### 3. Redis Is the Live State Cache
|
|
||||||
- Agent poller runs every 5s
|
|
||||||
- Redis TTL = 15s
|
|
||||||
- UI reads Redis, not agents directly
|
|
||||||
- Eventual consistency is acceptable
|
|
||||||
|
|
||||||
#### 4. No Push Model (Yet)
|
|
||||||
- Agent does not push status
|
|
||||||
- API polls agents
|
|
||||||
- WebSocket / push deferred until console phase
|
|
||||||
|
|
||||||
### Result
|
|
||||||
- Portal now reflects correct, real-world state
|
|
||||||
- "Offline / Idle / Running" distinctions are accurate
|
|
||||||
- Architecture aligns with Pterodactyl-style behavior
|
|
||||||
|
|
||||||
### Next Focus
|
|
||||||
- Console (read-only output first)
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 2026-01-14 → 2026-01-18 — Agent PTY Console Breakthrough
|
|
||||||
|
|
||||||
### Objective
|
|
||||||
Replace log-only console streaming with a **true interactive PTY-backed console**
|
|
||||||
supporting:
|
|
||||||
- Dev server shell access
|
|
||||||
- Game server (Minecraft) live console
|
|
||||||
- Full duplex WebSocket I/O
|
|
||||||
- Session persistence across reconnects
|
|
||||||
|
|
||||||
### What Changed
|
|
||||||
- Agent console migrated from log tailing → **PTY-backed processes**
|
|
||||||
- PTY ownership moved to agent (not frontend, not Proxmox)
|
|
||||||
- WebSocket implementation replaced with `gorilla/websocket`
|
|
||||||
- Enforced **single writer per WS connection**
|
|
||||||
- Introduced **console session manager** keyed by `{vmid, container_type}`
|
|
||||||
- Sessions survive WS disconnects (dev servers only, TTL = 60s)
|
|
||||||
- PTY not killed on client close
|
|
||||||
- Added keepalive frames (5s) to prevent browser timeouts
|
|
||||||
- Added detailed WS + PTY logging for attach, read, write, close
|
|
||||||
|
|
||||||
### Architecture Summary
|
|
||||||
- Dev containers attach to `/bin/bash` (fallback `/bin/sh`)
|
|
||||||
- Game containers attach to Minecraft server PTY
|
|
||||||
- WS input → PTY stdin
|
|
||||||
- PTY stdout → WS broadcast
|
|
||||||
- One PTY reader loop per session
|
|
||||||
- Multiple WS connections can attach/detach safely
|
|
||||||
|
|
||||||
### Final Status
|
|
||||||
✅ Interactive console confirmed working
|
|
||||||
✅ Commands execute correctly
|
|
||||||
✅ Output streamed live
|
|
||||||
✅ Reconnect behavior stable
|
|
||||||
|
|
||||||
This closes the original "console v2" milestone.
|
|
||||||
Loading…
Reference in New Issue
Block a user