✦ For everyone, free.

Practical knowledge for real and everyday life

Home

14.2.3.4 Production Restore Testing

A focused guide to Production Restore Testing, connecting core concepts with practical Docker and container operations.

Production restore testing is the practice of regularly proving that a backup can actually be turned back into a working system, rather than trusting that a backup job completing without error means the data inside it is usable, complete, and correctly structured for recovery.

Why a successful backup job is not proof of a usable backup

A backup script can exit with status zero, write a file of the expected size, and still produce something unusable: a database dump taken mid-transaction, an archive missing a file due to a permissions error that was silently swallowed, or a snapshot of a volume that was already corrupted before the backup ran. None of these failure modes are visible from the backup job's own success signal; they only become visible when someone actually tries to restore from the result.

ls -lh /backups/pgdata-20240101.tar.gz

A file of a plausible size sitting in the expected location confirms almost nothing about whether that archive will actually restore into a working database.

Designing a restore test that mirrors a real recovery

A meaningful restore test reproduces the actual recovery path as closely as possible: a fresh volume, a fresh container, and the backup as the only source of data, rather than restoring on top of an environment that still has some of the original state present.

docker volume create pgdata-restore-test
docker run --rm -v pgdata-restore-test:/data -v "$(pwd)":/backup alpine \
  tar xzf /backup/pgdata-latest.tar.gz -C /data
docker run -d --name restore-test -v pgdata-restore-test:/var/lib/postgresql/data postgres:16
docker exec restore-test pg_isready
docker exec restore-test psql -U postgres -d app -c "SELECT count(*) FROM users;"

Checking that the database process starts is a necessary first step, but confirming that expected data is actually present and queryable is what distinguishes a real restore test from a superficial one.

Automating restore tests on a schedule

A restore test performed once, manually, after the backup process was first set up, answers whether the process worked at that moment, not whether it still works months later after the application, schema, or backup script has changed. Scheduling restore tests to run automatically, on the same cadence as backups or slightly less frequently, catches regressions introduced by unrelated changes:

0 4 * * 0 /usr/local/sbin/restore-test.sh pgdata >> /var/log/restore-test.log 2>&1
#!/bin/sh
set -e
docker volume rm -f pgdata-restore-test
docker volume create pgdata-restore-test
docker run --rm -v pgdata-restore-test:/data -v /backups:/backup alpine \
  tar xzf /backup/pgdata-latest.tar.gz -C /data
docker run --rm -v pgdata-restore-test:/var/lib/postgresql/data postgres:16 pg_isready
docker volume rm pgdata-restore-test

A failing exit code from this script is treated the same as any other production alert, since a restore test failure means the backup currently in place would not actually have worked if needed for a real recovery right now.

Measuring restore time, not just restore success

Knowing that a restore succeeds is necessary but incomplete; knowing how long it takes is equally important, since an organization's actual recovery time objective is only meaningful if it has been measured against a real restore rather than assumed:

time (docker run --rm -v pgdata-restore-test:/data -v /backups:/backup alpine \
  tar xzf /backup/pgdata-latest.tar.gz -C /data)

If a measured restore takes considerably longer than the recovery time objective the business expects during an incident, that gap needs to be addressed deliberately, whether by changing the backup format, the storage location, or the restore procedure itself, rather than discovered for the first time during an actual outage.

Testing application-level correctness, not just data presence

For applications with meaningful internal consistency requirements, such as referential integrity across tables or invariants enforced only at the application layer, a restore test should ideally exercise the application against the restored data, not just confirm the database process starts:

docker run --rm --network restore-test-net \
  -e DATABASE_URL=postgres://postgres@restore-test:5432/app \
  my-api npm run smoke-test

A smoke test suite run against the restored environment catches the class of corruption that a simple row count or health check would miss, such as data that is present but logically inconsistent in a way only the application's own validation would detect.

Testing partial and point-in-time recovery scenarios

Beyond restoring the most recent backup, restore testing should occasionally exercise less common but realistic scenarios: restoring an older backup to recover from a problem introduced gradually and only discovered later, or restoring to a specific point in time if the backup strategy supports it:

docker run --rm -v pgdata-restore-test:/data -v /backups:/backup alpine \
  tar xzf /backup/pgdata-20240601.tar.gz -C /data
pg_restore --target-time="2024-06-01 14:30:00" -d app /backups/pgdata-wal-archive

A restore process that has only ever been exercised against the most recent backup may behave unexpectedly when an older one is needed, particularly if backup format or schema has changed in the interim.

Documenting and rehearsing the human procedure, not just the script

An automated restore test validates the technical mechanism, but a real incident also involves people following a procedure under pressure. Periodically having an operator who did not write the restore script actually perform a manual recovery, following only the written runbook, surfaces gaps in documentation that an automated test alone would not catch:

cat runbooks/database-restore.md

A runbook that only the original author can successfully follow is a single point of failure during an incident if that person is unavailable.

Common mistakes

  • Treating a backup job's successful exit code as sufficient evidence that the backup is restorable, without ever performing an actual restore.
  • Testing restores manually and infrequently rather than on an automated, regular schedule that catches regressions introduced by later changes.
  • Measuring only whether a restore succeeds, without measuring how long it takes, leaving the organization's recovery time objective unverified against reality.
  • Restoring only the most recent backup during tests, leaving older backups and point-in-time recovery paths unverified.
  • Never rehearsing the restore procedure as a human-followed runbook, relying entirely on the knowledge of whoever originally built the backup and restore automation.

Production restore testing closes the gap between believing a backup works and knowing it works, by regularly and automatically exercising the full path from backup file to a running, data-correct, application-validated system, and by measuring how long that path actually takes against the recovery expectations the business depends on.