14.3.3.5 Blue Green Fast Rollback
A focused guide to Blue Green Fast Rollback, connecting core concepts with practical Docker and container operations.
Blue-green fast rollback is the property, central to why teams adopt blue-green deployment in the first place, that reverting to the previous release after a problem is detected takes no longer than the original cutover did, because the previous environment was never stopped, never modified, and remains fully capable of receiving traffic again the instant the proxy is pointed back at it.
Why rollback speed is the entire point
Many deployment patterns can technically be rolled back, but the rollback often involves redeploying the previous version from scratch: pulling an image, starting containers, waiting for them to become healthy. Blue-green deployment's distinguishing advantage is that this entire sequence already happened, for the previous release, before the current deployment even started, and that previous, still-running environment is simply waiting, untouched, as a rollback target.
docker compose -p myapp-blue ps
Confirming the previous environment is still running and healthy, as a check performed routinely rather than only during an actual incident, is what keeps the fast rollback property genuinely available rather than theoretical.
The rollback operation itself
Because rollback is the same mechanism as the forward cutover, applied in the opposite direction, it requires no new tooling or unfamiliar procedure during an incident, only the same switch script pointed at the other environment:
./scripts/switch-environment.sh blue
#!/bin/sh
set -e
TARGET="$1"
cp "nginx-${TARGET}.conf" /etc/nginx/conf.d/upstream.conf
docker exec proxy nginx -t
docker exec proxy nginx -s reload
echo "Switched to ${TARGET}"
The fact that an operator under incident pressure runs the exact same script they would for a routine deployment, just with a different argument, removes the need to recall a separate, less-frequently-exercised rollback procedure at the worst possible time.
Defining the rollback decision criteria in advance
Speed of execution matters less if the decision to roll back is delayed by uncertainty about whether a rollback is actually warranted. Defining, before a deployment even starts, the specific signals that would trigger an immediate rollback removes that hesitation during a real incident:
ERROR_RATE=$(curl -s https://metrics.example.com/api/error_rate?service=my-api&window=5m)
if (( $(echo "$ERROR_RATE > 0.05" | bc -l) )); then
./scripts/switch-environment.sh blue
echo "Automatic rollback triggered: error rate ${ERROR_RATE}"
fi
An automated rollback trigger based on a clearly defined threshold, rather than requiring a human to notice and decide, captures the fast-rollback advantage of blue-green deployment fully, reacting in seconds to a degradation that a human watching a dashboard might take several minutes to notice and act on.
Keeping the previous environment genuinely ready
The fast rollback property depends on the previous environment actually being in a state where it can immediately receive traffic again, not merely existing as stopped containers that would need to be started and warmed up:
docker compose -p myapp-blue ps --filter "status=running"
Some teams keep the previous environment running at a reduced scale rather than fully stopped, balancing infrastructure cost against rollback speed; a fully stopped previous environment trades some of the speed advantage for lower cost, since containers need to start, and any connection pools or caches need to warm, before that environment is genuinely ready to handle full production load again.
docker compose -p myapp-blue up -d --scale api=1
Database state during a rollback
If a database migration was applied alongside the now-problematic release, rolling back the application traffic does not automatically reverse that migration, and the previous environment's code needs to remain compatible with whatever the database currently looks like for the fast rollback to actually work cleanly:
-- Verify the previous version's code can still operate against the current schema
SELECT column_name FROM information_schema.columns WHERE table_name = 'users';
This is the direct consequence of the backward-compatibility discipline that should already have been applied when the migration was written; a rollback that requires also reversing a database migration is no longer fast, since reversing a migration safely against live data takes considerably longer than flipping a proxy configuration.
Verifying the rollback resolved the issue
A fast rollback is not complete until the original symptom that triggered it has actually stopped, which should be confirmed by watching the relevant metrics return to their expected baseline after the switch back, not assumed the moment the switch command completes:
watch -n 5 'curl -s https://metrics.example.com/api/error_rate?service=my-api&window=1m'
docker logs --since 2m myapp-blue-api-1 | grep -c ERROR
Practicing the rollback before it is needed
The fast rollback property should be exercised periodically as a deliberate drill, not only discovered to work (or not) during a real incident; running the rollback script against a non-critical deployment, or as part of a scheduled chaos exercise, builds and maintains confidence that it will work exactly as expected when it actually matters.
./scripts/switch-environment.sh green
sleep 60
./scripts/switch-environment.sh blue
Common mistakes
- Stopping the previous environment immediately after a successful-looking cutover, removing the fast rollback target before enough time has passed to be confident no rollback will be needed.
- Defining no automated or clearly documented threshold for triggering a rollback, leaving the decision to ad hoc judgment during an actual incident.
- Applying a database migration that breaks backward compatibility with the previous release's code, turning what should be a fast rollback into a slower, more complex recovery involving migration reversal.
- Letting the previous environment's connection pools, caches, or warm state degrade while it sits idle, so that even though it is technically running, it is not actually ready to absorb full production load instantly.
- Never practicing the rollback procedure outside of a real incident, leaving its actual speed and reliability unverified until the moment it matters most.
Blue-green fast rollback delivers on its promise only when the previous environment is kept genuinely ready to receive traffic, the rollback decision criteria are defined in advance rather than improvised, and the rollback mechanism itself is exercised routinely enough that using it during a real incident is unremarkable rather than novel.