docs: add agent release runbook
This commit is contained in:
parent
f651b2d5c1
commit
3db9416c60
195
Runbooks/Agent_Release_Runbook.md
Normal file
195
Runbooks/Agent_Release_Runbook.md
Normal file
@ -0,0 +1,195 @@
|
||||
# 🚀 Agent Release Runbook
|
||||
|
||||
**Applies To**: `zlh-agent` Go binary
|
||||
**Artifact Server**: `10.60.0.251` (`/opt/zlh/agents/`)
|
||||
**Agent HTTP Port**: `18888`
|
||||
**Last Updated**: February 22, 2026
|
||||
|
||||
---
|
||||
|
||||
## Overview
|
||||
|
||||
This runbook covers the full process for building, uploading, and rolling out a new `zlh-agent` release. Follow all steps in order. Do not skip the canary validation step before rolling out to remaining nodes.
|
||||
|
||||
---
|
||||
|
||||
## Step 1 — Choose Next Version
|
||||
|
||||
Always bump the version. Never reuse an existing version number or folder.
|
||||
|
||||
Use semantic versioning: `MAJOR.MINOR.PATCH`
|
||||
Example: if current is `1.0.7`, next is `1.0.8`.
|
||||
|
||||
Check the current manifest to confirm what's already published:
|
||||
```bash
|
||||
curl -s http://10.60.0.251:8080/agents/manifest.json | jq '.latest'
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Step 2 — Build the Release Artifact
|
||||
|
||||
Always build via the release script. Do **not** use `go build -o zlh-agent` directly.
|
||||
|
||||
```bash
|
||||
cd /opt/zlh-agent
|
||||
./scripts/build-release.sh 1.0.8
|
||||
```
|
||||
|
||||
This produces the binary and `.sha256` checksum under `dist/1.0.8/`.
|
||||
|
||||
---
|
||||
|
||||
## Step 3 — Verify Built Binary Version
|
||||
|
||||
Confirm the binary reports the correct version before uploading:
|
||||
|
||||
```bash
|
||||
timeout 2s ./dist/1.0.8/zlh-agent-linux-amd64 2>&1 | head -n 1
|
||||
```
|
||||
|
||||
**Expected output contains**: `starting ZeroLagHub Agent v1.0.8`
|
||||
|
||||
If the version string is wrong, stop here. Do not upload a mismatched binary.
|
||||
|
||||
---
|
||||
|
||||
## Step 4 — Upload Release Folder to Artifact Server
|
||||
|
||||
```bash
|
||||
scp -r -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null \
|
||||
dist/1.0.8 root@10.60.0.251:/opt/zlh/agents/versions/
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Step 5 — Verify Files on Artifact Server
|
||||
|
||||
```bash
|
||||
ssh root@10.60.0.251 'ls -lah /opt/zlh/agents/versions/1.0.8'
|
||||
```
|
||||
|
||||
The directory must contain **both**:
|
||||
- `zlh-agent-linux-amd64`
|
||||
- `zlh-agent-linux-amd64.sha256`
|
||||
|
||||
If either is missing, re-upload before proceeding.
|
||||
|
||||
---
|
||||
|
||||
## Step 6 — Update Manifest on Artifact Server
|
||||
|
||||
Edit `/opt/zlh/agents/manifest.json` on the artifact server:
|
||||
|
||||
```bash
|
||||
ssh root@10.60.0.251
|
||||
nano /opt/zlh/agents/manifest.json
|
||||
```
|
||||
|
||||
Set the following fields:
|
||||
- `"latest"` → `"1.0.8"`
|
||||
- `"channels": { "stable": "1.0.8" }`
|
||||
- Add a new entry under `"artifacts"`:
|
||||
|
||||
```json
|
||||
"1.0.8": {
|
||||
"linux_amd64": {
|
||||
"binary": "versions/1.0.8/zlh-agent-linux-amd64",
|
||||
"sha256": "versions/1.0.8/zlh-agent-linux-amd64.sha256"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Do not remove old version entries from `artifacts` — keep the full history.
|
||||
|
||||
---
|
||||
|
||||
## Step 7 — Validate Manifest Remotely
|
||||
|
||||
```bash
|
||||
curl -s http://10.60.0.251:8080/agents/manifest.json | jq
|
||||
```
|
||||
|
||||
Check:
|
||||
- `channels.stable` matches an existing key in `artifacts`
|
||||
- `latest` matches `channels.stable`
|
||||
- The new version entry has the correct paths
|
||||
|
||||
If anything looks wrong, fix the manifest before triggering any updates.
|
||||
|
||||
---
|
||||
|
||||
## Step 8 — Canary: Trigger Update on a Test Node
|
||||
|
||||
Pick one non-production or low-traffic node as the canary.
|
||||
|
||||
```bash
|
||||
# Trigger the update
|
||||
curl -s -X POST http://127.0.0.1:18888/agent/update | jq
|
||||
|
||||
# Wait for the agent to restart
|
||||
sleep 3
|
||||
|
||||
# Confirm new version is running
|
||||
curl -s http://127.0.0.1:18888/version | jq
|
||||
```
|
||||
|
||||
**Expected**: version field shows `v1.0.8`
|
||||
|
||||
Do not proceed to Step 9 until the canary confirms the correct version.
|
||||
|
||||
---
|
||||
|
||||
## Step 9 — Verify Service Health and Logs on Canary
|
||||
|
||||
```bash
|
||||
systemctl status zlh-agent --no-pager
|
||||
journalctl -u zlh-agent -n 80 --no-pager
|
||||
```
|
||||
|
||||
Look for:
|
||||
- Service shows `active (running)`
|
||||
- No crash/restart loops in logs
|
||||
- No unexpected errors in the first 80 log lines
|
||||
|
||||
---
|
||||
|
||||
## Step 10 — Roll Out to Remaining Nodes
|
||||
|
||||
Trigger `/agent/update` from the API backend or orchestrator in batches. Do not broadcast to all nodes simultaneously — stagger to catch any issues early.
|
||||
|
||||
The agent's `ZLH_AGENT_UPDATE_MODE` controls behavior:
|
||||
- `auto` — agent self-updates when triggered
|
||||
- `notify` — agent logs that an update is available but waits
|
||||
- `off` — agent ignores update signals
|
||||
|
||||
---
|
||||
|
||||
## Important Rules
|
||||
|
||||
| Rule | Detail |
|
||||
|------|--------|
|
||||
| ✅ Always use build script | `./scripts/build-release.sh <version>` |
|
||||
| ✅ Always upload binary + `.sha256` as a pair | Never upload one without the other |
|
||||
| ✅ Validate manifest before triggering updates | `channels.stable` must match an `artifacts` key |
|
||||
| ✅ Canary first, fleet second | Always validate on one node before rolling out |
|
||||
| ❌ Never use `go build -o zlh-agent` for releases | Bypasses version embedding and checksum generation |
|
||||
| ❌ Never reuse a version number or path | Always bump; never overwrite an existing release folder |
|
||||
| ❌ Never remove old `artifacts` entries from manifest | Keep full history for rollback reference |
|
||||
|
||||
---
|
||||
|
||||
## Rollback Procedure
|
||||
|
||||
If a release is bad, update the manifest to point `channels.stable` and `latest` back to the last known-good version. Trigger `/agent/update` on affected nodes.
|
||||
|
||||
The old binary is still present on the artifact server under its original version folder, so no re-upload is needed for a rollback.
|
||||
|
||||
---
|
||||
|
||||
## Related
|
||||
|
||||
- Agent endpoint specs: `Agent_Endpoint_Specifications_Phase1.md`
|
||||
- Agent update mode env var: `ZLH_AGENT_UPDATE_MODE` (`auto|notify|off`)
|
||||
- Update check interval: `ZLH_AGENT_UPDATE_INTERVAL` (default `30m`)
|
||||
- Artifact server location: `10.60.0.251:/opt/zlh/agents/`
|
||||
Loading…
Reference in New Issue
Block a user