docs: add agent release runbook

This commit is contained in:
jester 2026-02-22 19:30:23 +00:00
parent f651b2d5c1
commit 3db9416c60

View File

@ -0,0 +1,195 @@
# 🚀 Agent Release Runbook
**Applies To**: `zlh-agent` Go binary
**Artifact Server**: `10.60.0.251` (`/opt/zlh/agents/`)
**Agent HTTP Port**: `18888`
**Last Updated**: February 22, 2026
---
## Overview
This runbook covers the full process for building, uploading, and rolling out a new `zlh-agent` release. Follow all steps in order. Do not skip the canary validation step before rolling out to remaining nodes.
---
## Step 1 — Choose Next Version
Always bump the version. Never reuse an existing version number or folder.
Use semantic versioning: `MAJOR.MINOR.PATCH`
Example: if current is `1.0.7`, next is `1.0.8`.
Check the current manifest to confirm what's already published:
```bash
curl -s http://10.60.0.251:8080/agents/manifest.json | jq '.latest'
```
---
## Step 2 — Build the Release Artifact
Always build via the release script. Do **not** use `go build -o zlh-agent` directly.
```bash
cd /opt/zlh-agent
./scripts/build-release.sh 1.0.8
```
This produces the binary and `.sha256` checksum under `dist/1.0.8/`.
---
## Step 3 — Verify Built Binary Version
Confirm the binary reports the correct version before uploading:
```bash
timeout 2s ./dist/1.0.8/zlh-agent-linux-amd64 2>&1 | head -n 1
```
**Expected output contains**: `starting ZeroLagHub Agent v1.0.8`
If the version string is wrong, stop here. Do not upload a mismatched binary.
---
## Step 4 — Upload Release Folder to Artifact Server
```bash
scp -r -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null \
dist/1.0.8 root@10.60.0.251:/opt/zlh/agents/versions/
```
---
## Step 5 — Verify Files on Artifact Server
```bash
ssh root@10.60.0.251 'ls -lah /opt/zlh/agents/versions/1.0.8'
```
The directory must contain **both**:
- `zlh-agent-linux-amd64`
- `zlh-agent-linux-amd64.sha256`
If either is missing, re-upload before proceeding.
---
## Step 6 — Update Manifest on Artifact Server
Edit `/opt/zlh/agents/manifest.json` on the artifact server:
```bash
ssh root@10.60.0.251
nano /opt/zlh/agents/manifest.json
```
Set the following fields:
- `"latest"``"1.0.8"`
- `"channels": { "stable": "1.0.8" }`
- Add a new entry under `"artifacts"`:
```json
"1.0.8": {
"linux_amd64": {
"binary": "versions/1.0.8/zlh-agent-linux-amd64",
"sha256": "versions/1.0.8/zlh-agent-linux-amd64.sha256"
}
}
```
Do not remove old version entries from `artifacts` — keep the full history.
---
## Step 7 — Validate Manifest Remotely
```bash
curl -s http://10.60.0.251:8080/agents/manifest.json | jq
```
Check:
- `channels.stable` matches an existing key in `artifacts`
- `latest` matches `channels.stable`
- The new version entry has the correct paths
If anything looks wrong, fix the manifest before triggering any updates.
---
## Step 8 — Canary: Trigger Update on a Test Node
Pick one non-production or low-traffic node as the canary.
```bash
# Trigger the update
curl -s -X POST http://127.0.0.1:18888/agent/update | jq
# Wait for the agent to restart
sleep 3
# Confirm new version is running
curl -s http://127.0.0.1:18888/version | jq
```
**Expected**: version field shows `v1.0.8`
Do not proceed to Step 9 until the canary confirms the correct version.
---
## Step 9 — Verify Service Health and Logs on Canary
```bash
systemctl status zlh-agent --no-pager
journalctl -u zlh-agent -n 80 --no-pager
```
Look for:
- Service shows `active (running)`
- No crash/restart loops in logs
- No unexpected errors in the first 80 log lines
---
## Step 10 — Roll Out to Remaining Nodes
Trigger `/agent/update` from the API backend or orchestrator in batches. Do not broadcast to all nodes simultaneously — stagger to catch any issues early.
The agent's `ZLH_AGENT_UPDATE_MODE` controls behavior:
- `auto` — agent self-updates when triggered
- `notify` — agent logs that an update is available but waits
- `off` — agent ignores update signals
---
## Important Rules
| Rule | Detail |
|------|--------|
| ✅ Always use build script | `./scripts/build-release.sh <version>` |
| ✅ Always upload binary + `.sha256` as a pair | Never upload one without the other |
| ✅ Validate manifest before triggering updates | `channels.stable` must match an `artifacts` key |
| ✅ Canary first, fleet second | Always validate on one node before rolling out |
| ❌ Never use `go build -o zlh-agent` for releases | Bypasses version embedding and checksum generation |
| ❌ Never reuse a version number or path | Always bump; never overwrite an existing release folder |
| ❌ Never remove old `artifacts` entries from manifest | Keep full history for rollback reference |
---
## Rollback Procedure
If a release is bad, update the manifest to point `channels.stable` and `latest` back to the last known-good version. Trigger `/agent/update` on affected nodes.
The old binary is still present on the artifact server under its original version folder, so no re-upload is needed for a rollback.
---
## Related
- Agent endpoint specs: `Agent_Endpoint_Specifications_Phase1.md`
- Agent update mode env var: `ZLH_AGENT_UPDATE_MODE` (`auto|notify|off`)
- Update check interval: `ZLH_AGENT_UPDATE_INTERVAL` (default `30m`)
- Artifact server location: `10.60.0.251:/opt/zlh/agents/`