✦ For everyone, free.

Practical knowledge for real and everyday life

Home

14.3.1.4 Single Manual Rollback

A focused guide to Single Manual Rollback, connecting core concepts with practical Docker and container operations.

A single manual rollback is the act of reverting a single-container or single-service deployment back to its previous known-good image and configuration by hand, typically performed during an incident when an automated rollback mechanism either does not exist or cannot be trusted to handle the specific failure being observed.

Why manual rollback still matters even with automation

Automated rollback systems handle the common case well: a new deployment fails its health checks, and the orchestrator reverts automatically. But automated rollback is not always triggered correctly, especially for failures that pass health checks but still cause incorrect behavior (a subtle data bug, a misconfigured feature flag, a dependency timing out only under specific load patterns). A manual rollback procedure that an operator can execute confidently and quickly is the backstop for exactly the failures automation was not designed to catch.

Identifying the last known-good image

The first step of any manual rollback is confirming, with certainty, which previously deployed image was actually working correctly, rather than assuming it was simply "the one before this one":

docker image ls my-api --format "{{.Tag}}\t{{.CreatedAt}}"
git log --oneline -- docker-compose.production.yml

Pinning deployments by immutable digest, rather than by a mutable tag, makes this step unambiguous: the previous deployment's exact digest is recorded in the deployment history, removing any guesswork about which bytes were actually running before the problematic release.

docker pull registry.example.com/my-api@sha256:8a1f3c9b2e7d...

Performing the rollback

For a single container, the rollback itself is a straightforward replacement: stop the current container, start a new one from the previous image, and confirm it is healthy before considering the rollback complete:

docker run -d --name my-api-rollback registry.example.com/my-api@sha256:8a1f3c9b2e7d...
docker exec my-api-rollback curl -f http://localhost:3000/healthz
docker stop my-api
docker rm my-api
docker rename my-api-rollback my-api

Starting the rollback container under a temporary name and verifying its health before removing the failing container minimizes the window during which neither version is confirmed working, compared to stopping the failing container first and only then starting the replacement.

Rolling back configuration alongside the image

A failure is sometimes caused by a configuration change deployed alongside the new image rather than by the image itself, which means a complete manual rollback needs to revert both together rather than assuming only the image changed:

git diff HEAD~1 -- production.env
docker run -d --name my-api --env-file production.env.previous registry.example.com/my-api@sha256:8a1f3c9b2e7d...

Keeping the previous environment file or Compose override readily available, rather than only the previous image reference, is what makes this step possible quickly during an incident rather than requiring it to be reconstructed under pressure.

Database compatibility during rollback

If the failed deployment included a database migration, rolling back the application image without also considering the schema can leave the previous version running against a schema it was never designed to handle:

docker exec my-api npx node-pg-migrate down

A rollback plan should specify, in advance, whether a given migration is safe to leave applied while rolling back the application code (because it was written to be backward-compatible) or whether it genuinely needs to be reversed as part of the rollback; discovering this distinction for the first time during the incident itself is considerably riskier than having decided it ahead of time when the migration was first written.

Communicating during a manual rollback

Because a manual rollback is, by definition, not automated, the steps taken and their outcome should be communicated clearly to anyone else involved in or affected by the incident, since the lack of automation also means there is no system-generated record of what happened unless someone creates one deliberately:

echo "$(date -u): Rolled back my-api from sha256:3f29a8c1 to sha256:8a1f3c9b due to elevated 500 errors" >> /var/log/incident-notes.log

Verifying the rollback actually resolved the issue

A rollback is not complete once the previous version is running; it is complete once the original symptom that triggered the rollback has actually stopped occurring, which requires watching the relevant metrics or logs for long enough to be confident the issue is gone rather than merely paused:

docker logs --since 5m my-api | grep -c "ERROR"
curl -s https://metrics.example.com/api/error_rate?service=my-api&window=10m

A rollback performed correctly but not actually verified against the original symptom risks a false sense of resolution, particularly if the underlying cause was not the deployment at all but something coincidental.

Preparing for manual rollback before it is needed

The practices that make a manual rollback fast and low-risk during a real incident are the ones put in place beforehand: pinning deployments by digest, retaining previous environment files and Compose overrides, and documenting which migrations are safe to leave applied. None of these can be assembled quickly under the pressure of an active incident if they were not already in place.

docker inspect my-api --format '{{.Image}}' >> deployment-history.log

Logging the deployed image digest on every deployment, even when nothing has gone wrong, builds exactly the historical record a manual rollback depends on having available when something eventually does.

Common mistakes

  • Rolling back to "the previous tag" without confirming what image digest that tag actually pointed to at deployment time, especially if the tag is mutable and has since been overwritten.
  • Reverting the application image without also reverting an incompatible configuration change deployed alongside it.
  • Treating the rollback as complete the moment the previous version starts, without confirming the original symptom has actually stopped.
  • Having no documented plan for whether a deployment's database migration needs to be reversed alongside an application rollback, leaving that decision to be made for the first time under incident pressure.
  • Not recording what was rolled back, from what, and why, leaving no record for a postmortem or for the next person who needs to understand what happened.

A single manual rollback is most reliable when most of its hard decisions, which image is known-good, whether configuration needs to revert too, whether a migration needs reversing, were already made and documented before the incident, leaving the actual rollback as a fast, mechanical execution rather than a series of judgment calls made under pressure.