zlh-grind/Session_Summaries/2026-05-02-pbs-r2-offsite-handoff.md

168 lines
4.5 KiB
Markdown

# Session Handoff — PBS to Cloudflare R2 Offsite Backup
Date: 2026-05-02
## Summary
Worked through the Proxmox Backup Server offsite backup path for ZeroLagHub.
Current decision: use PBS local datastore as the primary restore source, and use `rclone copy` to push a clean datastore baseline to Cloudflare R2 for offsite disaster recovery.
Native PBS S3 datastore was investigated but is not the chosen path for now because the PBS/R2 S3 endpoint path was not behaving cleanly enough and PBS S3 datastore support should not be treated as the stable production path yet.
## Confirmed architecture
```text
PVE / production Proxmox
-> PBS local datastore: z-back
-> rclone copy
-> Cloudflare R2 bucket: z-back-remote
```
Roles:
- PBS local datastore = primary restore-ready infrastructure backup layer
- Cloudflare R2 = offsite disaster recovery copy
- Agent backups = local app-aware rollback only, not platform DR
## R2 / rclone details
Configured rclone remote:
```text
remote name: zback-remote
provider: Cloudflare R2 / S3
endpoint: https://526f4df41bcce7267d5d4a39883cdd21.r2.cloudflarestorage.com
region: auto
bucket: z-back-remote
working path: zback-remote:z-back-remote
```
Important naming distinction:
- `zback-remote` = rclone remote name
- `z-back-remote` = Cloudflare R2 bucket name
Connectivity was validated with a test write/list:
```bash
echo "r2 test from zlh-pbs $(date)" > /tmp/r2-test.txt
rclone copy /tmp/r2-test.txt zback-remote:z-back-remote/test/ \
--s3-no-check-bucket \
--progress
rclone lsf zback-remote:z-back-remote/test/ \
--s3-no-check-bucket
```
Expected/observed result:
```text
r2-test.txt
```
## PBS datastore dry run
Ran a dry run against the current datastore:
```bash
rclone copy /mnt/datastore/z-back zback-remote:z-back-remote/pbs/z-back \
--dry-run \
--s3-no-check-bucket \
--progress \
--log-file=/var/log/zlh-pbs-r2-copy.log \
--log-level=INFO
```
Dry run result:
```text
Transferred: 131.169 GiB / 131.169 GiB, 100%
Transferred: 97513 / 97513, 100%
Elapsed time: 6.8s
```
This only proved rclone would copy the datastore. It did not upload because `--dry-run` was used.
## Important current blocker / decision
The current PBS datastore contents are old migration-era backups from March.
User stated these backups are not useful for current production recovery and likely can be removed because the environment is far past migration.
Decision:
- Do not upload the current migration-era datastore to R2.
- First clean PBS by removing old March backups.
- Then create fresh production backups.
- Then copy the clean baseline to R2.
## Next steps
1. In PBS, remove the old March backup snapshots/groups from datastore `z-back`.
- Prefer PBS UI: Datastore -> z-back -> Content -> remove/forget old snapshots.
- Be careful to delete only migration-era backups that are not needed.
2. Run garbage collection on `z-back` after old snapshots are forgotten.
```bash
proxmox-backup-manager garbage-collection start z-back
```
3. From Proxmox VE, run fresh backups of current production VMs/LXCs to PBS datastore `z-back`.
4. Verify the fresh PBS backups.
5. Dry-run the R2 copy again and confirm it reflects only the clean baseline.
```bash
rclone copy /mnt/datastore/z-back zback-remote:z-back-remote/pbs/z-back \
--dry-run \
--s3-no-check-bucket \
--progress \
--log-file=/var/log/zlh-backups/pbs-r2-z-back-dryrun.log \
--log-level=INFO
```
6. Run the real offsite copy once PBS is quiet.
```bash
rclone copy /mnt/datastore/z-back zback-remote:z-back-remote/pbs/z-back \
--s3-no-check-bucket \
--progress \
--transfers=8 \
--checkers=16 \
--log-file=/var/log/zlh-backups/pbs-r2-z-back-$(date +%F-%H%M).log \
--log-level=INFO
```
7. Perform a restore test from R2 before considering offsite DR proven.
## Operational constraints
Do not run rclone while PBS is:
- writing backups
- pruning
- garbage collecting
- verifying
Use `rclone copy`, not `rclone sync`, until restore-from-R2 has been proven. `copy` avoids remote deletions and is safer while establishing the first offsite baseline.
## Security note
The R2 access key and secret were pasted during the session. Treat them as compromised.
Before real backup upload:
- rotate/recreate the Cloudflare R2 access key and secret
- update `/root/.config/rclone/rclone.conf`
- verify test upload/list still works
Recommended rclone config additions:
```ini
acl = private
no_check_bucket = true
```
## Session stopping point
rclone transport to R2 is working. The remaining work is PBS cleanup, fresh baseline backup, R2 copy, and restore validation.