Session log 2026-03-15 — architecture review, agent future work
This commit is contained in:
parent
3a63662049
commit
78e8344984
100
SESSION_LOG.md
100
SESSION_LOG.md
@ -6,77 +6,71 @@
|
||||
|
||||
- Closed: Upload transport timeout tuning — upload route now logs explicit error categories distinguishing client abort, upstream timeout, and socket reset.
|
||||
- Research: Investigated external agent standards applicable to zlh-agent. No formal standard maps cleanly — agent is purpose-built (embedded process manager + filesystem authority inside LXC, internal-only caller). Key findings: health probe split (/healthz liveness vs /readyz readiness) is a common convention but not required given single-caller architecture; graceful shutdown (SIGINT/SIGTERM, 10s timeout) is correct; structured lifecycle logging already solid; Go 1.21 slog exists if log unification is ever wanted. No open threads opened from this — no gaps requiring action.
|
||||
- Grind repo updated to reflect current platform state:
|
||||
- README.md — added Current Platform Capabilities section (March 2026)
|
||||
- Agent_Endpoint_Specifications_Phase1.md — created; full route reference including updated /status crash metadata fields
|
||||
- Session_Summaries/2026-03-01_Upload-Pipeline-Filesystem-Consolidation.md — created; includes upload transport fix (502 root cause + resolution)
|
||||
- Frontend/TerminalView_Component.md — created; includes reconnect behavior (1–5s backoff, no page refresh required)
|
||||
- ZeroLagHub_Quick_Status_Feb2026.md — created; reflects March 2026 operational state
|
||||
- Session_Summaries/2026-03-10_Agent_Observability_Update.md — created; structured logging, crash metadata fields, ReadinessTimeout constant, crash semantics unchanged
|
||||
- Grind repo updated to reflect current platform state. Misplaced architecture docs later removed — canonical docs belong in knowledge-base not zlh-grind.
|
||||
|
||||
---
|
||||
|
||||
## 2026-03-14
|
||||
|
||||
Goal:
|
||||
Stabilize dev container provisioning across agent, portal, and artifact server.
|
||||
|
||||
Scope:
|
||||
|
||||
- catalog-driven runtime installs
|
||||
- dotnet runtime support
|
||||
- dev workspace environment
|
||||
- portal provisioning updates
|
||||
|
||||
Explicit non-goals:
|
||||
|
||||
- platform architecture changes
|
||||
- container orchestration changes
|
||||
|
||||
Canonical refs used:
|
||||
|
||||
- ZeroLagHub knowledge base
|
||||
- dev runtime provisioning spec
|
||||
Goal: Stabilize dev container provisioning across agent, portal, and artifact server.
|
||||
|
||||
Work completed:
|
||||
|
||||
Agent:
|
||||
|
||||
- switched dev runtime validation to artifact catalog
|
||||
- runtime installs moved to `/opt/zlh/runtimes`
|
||||
- implemented runtime install guards
|
||||
- implemented dev user environment
|
||||
- updated dev shell to run as `dev`
|
||||
- added dotnet runtime provisioning
|
||||
- added optional code-server provisioning
|
||||
- catalog-driven runtime validation via devcontainer/_catalog.json (ValidateRuntimeSelection in common.go)
|
||||
- EnsureDevUserEnvironment — dev user, /home/dev/workspace, correct ownership
|
||||
- dotnet runtime provisioning added
|
||||
- optional code-server addon provisioning added
|
||||
- WriteReadyMarker after successful provision
|
||||
- RuntimeInstalled() filesystem-based install guard in common.go
|
||||
|
||||
Portal:
|
||||
|
||||
- added dotnet runtime
|
||||
- added enable code-server option
|
||||
- enabled Files tab for dev containers
|
||||
- dev file uploads now current-directory based
|
||||
- dotnet runtime added
|
||||
- enable code-server option added
|
||||
- Files tab enabled for dev containers
|
||||
|
||||
API:
|
||||
- enable_code_server field added to dev provisioning payload
|
||||
|
||||
- added `enable_code_server` field to dev provisioning payload
|
||||
Blocker:
|
||||
- code-server artifact on zlh-artifacts contains source repository, not compiled release
|
||||
- install.sh expects bin/code-server, lib/, node_modules/ — compiled release required
|
||||
- fix: replace artifact with official code-server release tarball (e.g. code-server-4.x.x-linux-amd64.tar.gz)
|
||||
|
||||
Artifact Server:
|
||||
---
|
||||
|
||||
- runtime artifacts verified
|
||||
- dotnet installer aligned with artifact fetch model
|
||||
## 2026-03-15
|
||||
|
||||
Issues discovered:
|
||||
Architecture review session. No code changes. Key decisions and findings:
|
||||
|
||||
- code-server artifact contains repository source
|
||||
- installer expects packaged release
|
||||
Dev container model:
|
||||
- 1 server / 1 container / 1 world confirmed as correct model
|
||||
- Dev containers: full R/W access under /home/dev/workspace, no allowlist
|
||||
- Multiverse/multi-world via plugins is customer-managed, not a platform concern
|
||||
- Port exposure (dev-<vmid>.zerolaghub.com) identified as next major dev feature — future work
|
||||
- Wildcard DNS pre-planning needed before port exposure implementation
|
||||
- dotnet SDK covers all C# game modding (Valheim, Core Keeper, Vintage Story, Rust/Oxide)
|
||||
- Code Server confirmed as correct browser IDE approach given single public IP constraint
|
||||
|
||||
Resolution:
|
||||
Artifact server must provide compiled code-server release archive.
|
||||
Agent review (zlh-agent commit 6019d0bc — 2026-03-15):
|
||||
- Catalog transition confirmed correct — ValidateRuntimeSelection gates all dev provisions
|
||||
- Scripts unchanged — embedded script execution via bash stdin pipe, no 126 risk from runtime installs
|
||||
- devcontainer/common.go is clean and complete
|
||||
- node/verify.go has hardcoded /opt/zlh/runtime/node/bin/node — wrong path, pre-existing issue, not a regression
|
||||
- node/python/go/java install packages still use old version-unaware marker pattern — pre-existing, not a regression from catalog work
|
||||
- node exporter baked into containers — disk/CPU/memory/network already covered by Prometheus
|
||||
- Promtail present but status unknown
|
||||
|
||||
Current system state:
|
||||
Agent future work identified (not yet implemented, priority order):
|
||||
1. Unified structured logging (slog) — Promtail/Loki integration needs structured fields to be queryable
|
||||
2. Dev container status/readiness — /status needs provisioningComplete + provisioningError for dev containers
|
||||
3. Crash recovery with backoff — auto-restart on crash with increasing delay (30s/60s/120s), max 3 attempts, then error state
|
||||
4. Graceful shutdown verification — confirm SIGTERM + wait before SIGKILL for Minecraft world save safety
|
||||
5. Agent restart/process reattachment — detect existing process on agent restart, reattach rather than double-start
|
||||
6. Disk pressure warning in /status — agent-level signal before node exporter threshold alerts
|
||||
|
||||
- dev runtime provisioning operational
|
||||
- dotnet runtime operational
|
||||
- portal provisioning operational
|
||||
- code-server blocked by artifact packaging
|
||||
Explicitly out of scope (not Pterodactyl):
|
||||
- User management, permissions, billing
|
||||
- Multi-container orchestration
|
||||
- Plugin/extension systems
|
||||
- Anything owned by API or portal
|
||||
|
||||
Loading…
Reference in New Issue
Block a user