diff --git a/Runbooks/Agent_Release_Runbook.md b/Runbooks/Agent_Release_Runbook.md new file mode 100644 index 0000000..8e185c3 --- /dev/null +++ b/Runbooks/Agent_Release_Runbook.md @@ -0,0 +1,195 @@ +# 🚀 Agent Release Runbook + +**Applies To**: `zlh-agent` Go binary +**Artifact Server**: `10.60.0.251` (`/opt/zlh/agents/`) +**Agent HTTP Port**: `18888` +**Last Updated**: February 22, 2026 + +--- + +## Overview + +This runbook covers the full process for building, uploading, and rolling out a new `zlh-agent` release. Follow all steps in order. Do not skip the canary validation step before rolling out to remaining nodes. + +--- + +## Step 1 — Choose Next Version + +Always bump the version. Never reuse an existing version number or folder. + +Use semantic versioning: `MAJOR.MINOR.PATCH` +Example: if current is `1.0.7`, next is `1.0.8`. + +Check the current manifest to confirm what's already published: +```bash +curl -s http://10.60.0.251:8080/agents/manifest.json | jq '.latest' +``` + +--- + +## Step 2 — Build the Release Artifact + +Always build via the release script. Do **not** use `go build -o zlh-agent` directly. + +```bash +cd /opt/zlh-agent +./scripts/build-release.sh 1.0.8 +``` + +This produces the binary and `.sha256` checksum under `dist/1.0.8/`. + +--- + +## Step 3 — Verify Built Binary Version + +Confirm the binary reports the correct version before uploading: + +```bash +timeout 2s ./dist/1.0.8/zlh-agent-linux-amd64 2>&1 | head -n 1 +``` + +**Expected output contains**: `starting ZeroLagHub Agent v1.0.8` + +If the version string is wrong, stop here. Do not upload a mismatched binary. + +--- + +## Step 4 — Upload Release Folder to Artifact Server + +```bash +scp -r -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null \ + dist/1.0.8 root@10.60.0.251:/opt/zlh/agents/versions/ +``` + +--- + +## Step 5 — Verify Files on Artifact Server + +```bash +ssh root@10.60.0.251 'ls -lah /opt/zlh/agents/versions/1.0.8' +``` + +The directory must contain **both**: +- `zlh-agent-linux-amd64` +- `zlh-agent-linux-amd64.sha256` + +If either is missing, re-upload before proceeding. + +--- + +## Step 6 — Update Manifest on Artifact Server + +Edit `/opt/zlh/agents/manifest.json` on the artifact server: + +```bash +ssh root@10.60.0.251 +nano /opt/zlh/agents/manifest.json +``` + +Set the following fields: +- `"latest"` → `"1.0.8"` +- `"channels": { "stable": "1.0.8" }` +- Add a new entry under `"artifacts"`: + +```json +"1.0.8": { + "linux_amd64": { + "binary": "versions/1.0.8/zlh-agent-linux-amd64", + "sha256": "versions/1.0.8/zlh-agent-linux-amd64.sha256" + } +} +``` + +Do not remove old version entries from `artifacts` — keep the full history. + +--- + +## Step 7 — Validate Manifest Remotely + +```bash +curl -s http://10.60.0.251:8080/agents/manifest.json | jq +``` + +Check: +- `channels.stable` matches an existing key in `artifacts` +- `latest` matches `channels.stable` +- The new version entry has the correct paths + +If anything looks wrong, fix the manifest before triggering any updates. + +--- + +## Step 8 — Canary: Trigger Update on a Test Node + +Pick one non-production or low-traffic node as the canary. + +```bash +# Trigger the update +curl -s -X POST http://127.0.0.1:18888/agent/update | jq + +# Wait for the agent to restart +sleep 3 + +# Confirm new version is running +curl -s http://127.0.0.1:18888/version | jq +``` + +**Expected**: version field shows `v1.0.8` + +Do not proceed to Step 9 until the canary confirms the correct version. + +--- + +## Step 9 — Verify Service Health and Logs on Canary + +```bash +systemctl status zlh-agent --no-pager +journalctl -u zlh-agent -n 80 --no-pager +``` + +Look for: +- Service shows `active (running)` +- No crash/restart loops in logs +- No unexpected errors in the first 80 log lines + +--- + +## Step 10 — Roll Out to Remaining Nodes + +Trigger `/agent/update` from the API backend or orchestrator in batches. Do not broadcast to all nodes simultaneously — stagger to catch any issues early. + +The agent's `ZLH_AGENT_UPDATE_MODE` controls behavior: +- `auto` — agent self-updates when triggered +- `notify` — agent logs that an update is available but waits +- `off` — agent ignores update signals + +--- + +## Important Rules + +| Rule | Detail | +|------|--------| +| ✅ Always use build script | `./scripts/build-release.sh ` | +| ✅ Always upload binary + `.sha256` as a pair | Never upload one without the other | +| ✅ Validate manifest before triggering updates | `channels.stable` must match an `artifacts` key | +| ✅ Canary first, fleet second | Always validate on one node before rolling out | +| ❌ Never use `go build -o zlh-agent` for releases | Bypasses version embedding and checksum generation | +| ❌ Never reuse a version number or path | Always bump; never overwrite an existing release folder | +| ❌ Never remove old `artifacts` entries from manifest | Keep full history for rollback reference | + +--- + +## Rollback Procedure + +If a release is bad, update the manifest to point `channels.stable` and `latest` back to the last known-good version. Trigger `/agent/update` on affected nodes. + +The old binary is still present on the artifact server under its original version folder, so no re-upload is needed for a rollback. + +--- + +## Related + +- Agent endpoint specs: `Agent_Endpoint_Specifications_Phase1.md` +- Agent update mode env var: `ZLH_AGENT_UPDATE_MODE` (`auto|notify|off`) +- Update check interval: `ZLH_AGENT_UPDATE_INTERVAL` (default `30m`) +- Artifact server location: `10.60.0.251:/opt/zlh/agents/`