18.2.2.3 Swarm Rolling Updates

A focused guide to Swarm Rolling Updates, connecting core concepts with practical Docker and container operations.

Swarm rolling updates provide built-in, configurable control over exactly how a service transitions from one version to another, parallelism, ordering, monitoring duration, and failure tolerance, all expressed directly as part of the service definition rather than requiring an external deployment tool or custom scripting to implement equivalent behavior.

Update order: stop-first versus start-first

The update order determines whether an old task is stopped before its replacement starts, or the new task starts first and the old one is only stopped once the replacement is confirmed running:

docker service update --update-order stop-first --image my-api:1.5.0 my-api

docker service update --update-order start-first --image my-api:1.5.0 my-api

stop-first is the historical default and briefly reduces total available capacity during each batch's transition, since the old task is gone before its replacement is ready; start-first avoids this capacity dip by ensuring the replacement is running before the old task stops, at the cost of briefly running both versions simultaneously during the transition, which needs to be acceptable for the specific service being updated.

Controlling update parallelism

The parallelism setting controls how many tasks update simultaneously in each batch, directly trading overall update speed against the blast radius of a problem discovered partway through the rollout:

docker service update --update-parallelism 1 --image my-api:1.5.0 my-api

docker service update --update-parallelism 3 --image my-api:1.5.0 my-api

A parallelism of 1 updates the safest, slowest way, one task at a time, limiting how much of the service is affected if the new version turns out to have a problem; a higher parallelism completes the rollout faster but exposes more replicas to a potential issue before it would be caught and the rollout paused.

The monitor duration before proceeding

The update monitor setting controls how long Swarm waits after starting each batch before considering it successful and proceeding to the next, which gives a newly started task's health check a genuine window to actually run and report before the rollout continues further:

docker service update --update-monitor 30s --image my-api:1.5.0 my-api

This duration should be set with the service's actual health check timing in mind, long enough for the health check's own interval and retry count to genuinely confirm health before the rollout proceeds, rather than an arbitrarily short window that might allow the rollout to continue before a real problem has had a chance to actually surface through the health check.

Failure tolerance and automatic pause or rollback

The maximum failure ratio setting determines what fraction of a batch's updates are allowed to fail before Swarm pauses the rollout entirely, and combining this with an automatic rollback failure action converts a detected problem into an immediate, automatic recovery rather than requiring manual intervention:

docker service update \
  --update-failure-action rollback \
  --update-max-failure-ratio 0.2 \
  --image my-api:1.5.0 my-api

With this configuration, if more than 20% of a batch's updates fail, Swarm automatically reverts the service to its previous, known-good configuration rather than continuing to roll out a version that has already demonstrated a meaningful failure rate.

Combining rolling update settings with meaningful health checks

These rolling update mechanisms only function as intended if the service's own health check is genuinely meaningful, since Swarm relies on health status to determine whether a newly started task should be considered a successful part of the rollout; a shallow health check that always reports healthy regardless of actual application condition would let a rollout proceed and complete successfully even with a genuinely broken new version.

HEALTHCHECK --interval=10s --timeout=5s --retries=3 \
  CMD curl -f http://localhost:3000/healthz || exit 1

Investing in a meaningful health check, exercising real dependencies rather than just confirming the process is running, is the prerequisite that makes every rolling update safety mechanism described here actually function as intended rather than providing only the appearance of safety.

Monitoring an update in progress

docker service ps shows the actual progress of an in-flight rolling update directly, which task instances are still running the previous version versus the new one, useful for confirming a rollout is proceeding as expected or for understanding exactly how far it had progressed if it needs to be paused or rolled back manually:

docker service ps my-api

NAME        IMAGE          CURRENT STATE
my-api.1    my-api:1.5.0   Running 2 minutes ago
my-api.2    my-api:1.4.2   Running 10 minutes ago
my-api.3    my-api:1.4.2   Running 10 minutes ago

This output, showing a mix of old and new image versions across different replicas, confirms a rolling update is genuinely in progress rather than having completed or failed entirely.

Pausing and resuming an update manually

Beyond automatic pause-on-failure behavior, an update can be paused manually at any point, useful when an operator notices something concerning that the automated failure detection has not yet caught, and resumed once the concern is addressed or determined not to be an actual problem:

docker service update --update-pause my-api
docker service update --update-resume my-api

Common mistakes

Using stop-first update order for a service where the brief capacity reduction during each transition is genuinely unacceptable, rather than start-first.
Setting update parallelism too high for a service where limiting blast radius during a problematic rollout matters more than rollout speed.
Configuring an update monitor duration too short relative to the service's own health check interval and retry count, allowing the rollout to proceed before a real problem has had a chance to surface.
Not configuring an automatic rollback failure action, requiring manual intervention to recover from a rollout that has already demonstrated a meaningful failure rate.
Relying on rolling update safety mechanisms without a genuinely meaningful health check, which is the actual prerequisite that makes every other safety setting function as intended.

Swarm rolling updates provide a genuinely capable, declaratively configured deployment mechanism, update order, parallelism, monitor duration, and failure tolerance each tunable to match a specific service's actual risk tolerance, but every one of these mechanisms depends entirely on a genuinely meaningful health check to function as real safety rather than only the appearance of it.