14.2.3.3 Production Volume Backup

A focused guide to Production Volume Backup, connecting core concepts with practical Docker and container operations.

Production volume backup is the recurring process of capturing the contents of a Docker volume into a separate, durable storage location on a schedule, with enough retention and verification built in that a host failure, accidental deletion, or corruption event does not result in permanent data loss.

Snapshot-based backup versus archive-based backup

Two broad approaches exist for backing up a volume: archiving its contents into a portable file (typically a tar archive), or taking a storage-layer snapshot of the underlying block device or filesystem. Archive-based backups are portable and work identically regardless of the storage backend; snapshot-based backups are typically faster and more space-efficient but depend on the specific storage technology supporting snapshots.

docker run --rm -v pgdata:/data:ro -v "$(pwd)":/backup alpine \
  tar czf /backup/pgdata-$(date +%Y%m%d-%H%M).tar.gz -C /data .

lvcreate --size 5G --snapshot --name pgdata-snap /dev/vg0/pgdata

Where the volume's backing storage is an LVM logical volume, a ZFS dataset, or a cloud block storage device, the snapshot mechanism native to that layer is usually preferable for production use, since it captures the volume's state nearly instantaneously and without holding a long-running read lock against an active workload.

Cloud block storage snapshots

For volumes backed by cloud block storage, the cloud provider's own snapshot mechanism is typically the most production-appropriate option, since it operates outside the container and host entirely:

aws ec2 create-snapshot --volume-id vol-0123456789abcdef0 --description "pgdata backup $(date +%Y%m%d)"

aws ec2 describe-snapshots --filters "Name=volume-id,Values=vol-0123456789abcdef0" --query 'Snapshots[*].[SnapshotId,StartTime,State]'

Cloud snapshots are usually incremental after the first full snapshot, which keeps both the storage cost and the time required for each subsequent backup low compared to repeatedly archiving the full volume contents from scratch.

Consistency during backup

A backup taken while a volume is being actively written to risks capturing data mid-write, which is a particular concern for databases and other applications that maintain invariants across multiple files. Filesystem-level and storage-level snapshots are generally crash-consistent (equivalent to what would be on disk after a sudden power loss), which is sufficient for applications designed to recover cleanly from an unclean shutdown, but applications without that resilience may need a brief pause or flush before the snapshot is taken:

docker exec my-db psql -U postgres -c "SELECT pg_start_backup('snapshot');"
lvcreate --size 5G --snapshot --name pgdata-snap /dev/vg0/pgdata
docker exec my-db psql -U postgres -c "SELECT pg_stop_backup();"

Coordinating the snapshot with the application's own backup-mode hooks, where the application provides one, produces a more reliably consistent result than relying on crash consistency alone.

Scheduling and retention policy

A production volume backup process needs an explicit schedule and retention policy, balancing how much historical recovery capability is kept against the storage cost of keeping it:

0 2 * * * /usr/local/sbin/backup-volume.sh pgdata >> /var/log/volume-backup.log 2>&1

find /backups -name "pgdata-*.tar.gz" -mtime +30 -delete

A common retention pattern keeps daily backups for a short window, weekly backups for a longer window, and monthly backups for long-term retention, which provides fine-grained recent recovery points without retaining every daily backup indefinitely.

find /backups/daily -mtime +7 -delete
find /backups/weekly -mtime +60 -delete

Shipping backups off the host

A backup stored only on the same host and disk as the volume it protects does not protect against the most common cause of total data loss, the host or disk itself failing entirely. Every backup should be copied to a separate location as part of the same automated process that creates it:

aws s3 cp /backups/pgdata-$(date +%Y%m%d).tar.gz s3://my-backup-bucket/docker-volumes/ --storage-class STANDARD_IA

rsync -a /backups/ backup-host.internal:/mnt/backup-storage/docker-volumes/

Encrypting backups at rest

Because a volume backup often contains the same sensitive data the production system itself holds, it should be encrypted both in transit to its off-host destination and at rest once it arrives there:

tar czf - -C /data . | gpg --symmetric --cipher-algo AES256 -o /backup/pgdata-$(date +%Y%m%d).tar.gz.gpg

A backup that is easier for an attacker to exfiltrate and read than the production system itself is a meaningful security gap, even if the production system is otherwise well protected.

Verifying backups through actual restores

A backup that has never been restored is unverified. A scheduled restore drill, ideally automated, that provisions a fresh volume from the latest backup and confirms the application starts correctly against it, is the only reliable way to know the backup process is actually producing usable output:

docker volume create pgdata-restore-test
docker run --rm -v pgdata-restore-test:/data -v "$(pwd)":/backup alpine \
  tar xzf /backup/pgdata-latest.tar.gz -C /data
docker run --rm -v pgdata-restore-test:/var/lib/postgresql/data postgres:16 pg_isready

Monitoring backup job health

The backup process itself needs monitoring, since a silently failing backup job is functionally indistinguishable from having no backup at all until the moment a restore is actually needed:

#!/bin/sh
if ! ./backup-volume.sh pgdata; then
  curl -X POST https://alerts.example.com/webhook -d "Backup failed for pgdata on $(hostname)"
fi

Alerting on backup job failure, rather than only checking backup status manually and occasionally, ensures a missed or failed backup is noticed within the same operational cycle it occurred in, not weeks later during an unrelated incident.

Common mistakes

Running backups without verifying they complete successfully, only discovering a silent failure when a restore is actually needed.
Storing backups exclusively on the same host as the original volume, leaving no protection against a full host or disk failure.
Taking backups of an actively-writing application without any consistency coordination, risking a backup that captures an internally inconsistent state.
Retaining every backup indefinitely without a retention policy, or conversely retaining too few, leaving an inadequate recovery window when an issue is discovered well after it occurred.
Never test-restoring a backup, leaving its actual usability unverified until an incident forces the first real attempt.

A production volume backup process is built from a consistency-aware capture method, an explicit retention policy, off-host and encrypted storage of the result, active monitoring of the backup job itself, and periodic, real restore drills that prove the entire chain actually works end to end.