zlh-grind/SESSION_LOG.md

4.6 KiB
Raw Blame History

Session Log zlh-grind

Append-only execution log for GPT-assisted development work.
Do not rewrite or reorder past entries.


2025-12-14

  • Goal: Stabilize zlh-agent provisioning pipeline for game + dev containers + addons without regressing game provisioning.
  • Scope: Agent routing (ctype=dev), local artifacts strategy, addon skeleton, initial end-to-end tests.
  • Adopted single base template approach (zlh-agent template) where agent installs roles at runtime.

2025-12-20

  • Goal: Restore reliable end-to-end provisioning for game + dev without regressions; preserve the explicit orchestration steps/logging.

Current Architecture Reality (IMPORTANT)

  • We are on single base template: AGENT_TEMPLATE_VMID is the only template in .env.
  • In this workflow, any zlh-agent code change requires rebuilding + pushing the binary into the template container.
    • That effectively means: agent change = new template promotion.

Key Findings / Fixes

  • Game provisioning steps are functioning again and show clear step-by-step orchestration output (allocate VMID → clone → configure → start → IP → agent config → wait → DB save → edge publish).
  • Dev provisioning failure was traced to a wire contract mismatch between API JSON and agent state.Config struct tags:
    • Agent expects: container_type (snake_case) on the JSON wire
    • API refactors were sending: ctype / containerType (camelCase) in some variants
    • Result: cfg.ContainerType == "" on decode → agent routes into game path → errors like:
      • unsupported container identity (containerType="" game="")

Decision (to minimize churn)

  • Because agent changes force a template promotion in this pipeline, the correct short-term move is:
    • Update API to emit container_type to match the existing agent contract
    • Avoid touching agent code until we intentionally rev the template

Operational Guardrails

  • Do NOT remove or "simplify" the explicit provisioning steps/logging; they are required for debugging and operator confidence.
  • Treat container_type as the canonical wire key until the next planned template rev.

2025-12-21

  • Goal: Restore reliable end-to-end provisioning for devcontainers and agent-managed installs.
  • Observed repeated failures during devcontainer runtime installation (node, python, go, java).
  • Initial assumption was installer regression; investigation showed installers were enforcing contract correctly.
  • Root cause identified: agent was not exporting required runtime environment variables (notably RUNTIME_VERSION).

Devcontainer provisioning investigation

  • Deep dive on zlh-agent devcontainer provisioning flow.
  • Confirmed that all devcontainer installers intentionally require RUNTIME_VERSION and fail fast if missing.
  • Clarified that payload JSON is not read by installers; agent must project intent via environment variables.
  • Verified that installer logic itself (artifact naming, extraction, symlink layout) was correct.

Embedded installer execution findings

  • Agent executes installers as embedded scripts, not filesystem paths.
  • Identified critical requirement: shared installer logic (common.sh) and runtime installer must execute in the same shell session.
  • Failure mode observed: install_runtime: command not found → caused by running runtime installer without common.sh loaded.
  • Confirmed this explains missing runtime directories and lack of artifact downloads.

Installer architecture changes

  • Refactored installer model to:
    • common.sh: shared, strict, embedded-safe installation logic
    • per-runtime installers (node, python, go, java) as declarative descriptors only
  • Established that runtime installers are intentionally minimal and declarative by design.
  • Confirmed that this preserves existing runtime layout: /opt/zlh/runtime/<language>/<version>/current

Artifact layout update

  • Artifact naming and layout standardized to simplified form:
    • node-24.tar.xz
    • python-3.12.tar.xz
    • go-1.22.tar.gz
    • jdk-21.tar.gz
  • Identified mismatch between runtime name and archive prefix (notably Java).
  • Introduced ARCHIVE_PREFIX as a runtime-level variable to resolve naming cleanly.

Final conclusions

  • No regression in installer logic; failures were execution-order and environment-projection issues.
  • Correct fix is agent-side:
    • concatenate common.sh + runtime installer into one bash invocation
    • inject RUNTIME_VERSION (and related vars) into environment
  • Architecture now supports deterministic, artifact-driven, embedded-safe installs.

Status: Root cause resolved; implementation pending agent patch & installer updates.