✦ For everyone, free.

Practical knowledge for real and everyday life

Home

18.2.1 Swarm Mode

A focused guide to Swarm Mode, connecting core concepts with practical Docker and container operations.

Swarm mode is the specific operating state a Docker Engine transitions into once it joins or initializes a cluster, fundamentally changing how that engine behaves, from managing standalone containers directly to participating in a distributed, consensus-based cluster, with manager and worker roles, quorum requirements, and node availability states that govern the cluster's actual internal operation.

The standalone-to-swarm transition

A Docker Engine begins in standalone mode, where every command operates only against that single host's own containers, and transitions into swarm mode the moment it either initializes a new cluster or joins an existing one:

docker info --format '{{.Swarm.LocalNodeState}}'
inactive
docker swarm init
docker info --format '{{.Swarm.LocalNodeState}}'
active

This state change is what unlocks Swarm-specific commands, docker service, docker stack, docker node, which return an error if attempted against an engine still in standalone mode, since these commands fundamentally depend on the cluster-aware behavior that only exists once swarm mode is actually active.

Manager and worker node roles

Manager nodes maintain the cluster's actual state and make scheduling decisions, while worker nodes simply run the containers assigned to them by a manager, with a node's role determined at join time and changeable afterward through explicit promotion or demotion:

docker node ls
ID             HOSTNAME   STATUS    AVAILABILITY   MANAGER STATUS
abc123 *       node-1     Ready     Active         Leader
def456         node-2     Ready     Active
docker node promote node-2

A worker node has no visibility into or control over the cluster's overall scheduling decisions; only manager nodes participate in that process, which is why a cluster's manager count and their own availability matter considerably more to overall cluster health than the number of worker nodes alone.

Raft consensus and manager quorum

Manager nodes use the Raft consensus algorithm to maintain a consistent, agreed-upon view of cluster state across all of them, which requires a quorum, a majority of manager nodes, to remain available for the cluster to continue making scheduling decisions at all:

docker node ls --filter role=manager

A cluster with three managers can tolerate one manager failing while still maintaining quorum (two of three remaining); a cluster with only two managers has no quorum tolerance at all, since losing either one immediately drops below a majority, which is why an odd number of managers, generally three or five for any cluster where manager availability genuinely matters, is the standard, recommended configuration rather than an even number.

Node availability states

Beyond a node's basic ready status, its availability setting, active, pause, or drain, controls whether the scheduler considers it eligible to receive new work, which is useful for performing maintenance on a specific node without removing it from the cluster entirely:

docker node update --availability drain node-2
docker node update --availability active node-2

Draining a node before performing maintenance causes Swarm to reschedule its currently running tasks onto other available nodes first, then marks it ineligible for new work until explicitly set back to active, which is the correct, graceful way to take a node offline for planned maintenance without disrupting the services it was running.

Locking the swarm for additional security

Swarm mode supports an optional autolock feature, requiring an unlock key to be supplied whenever a manager node restarts, which protects the cluster's encrypted Raft data and TLS keys from being accessible simply by gaining access to a manager's disk while it happens to be offline:

docker swarm update --autolock=true
docker swarm unlock

This is a meaningful additional security layer for clusters running in environments where physical or storage-level access to a manager node's disk cannot be fully trusted, at the cost of needing to supply the unlock key manually (or through an automated, securely stored mechanism) every time a manager node restarts.

Leaving and reforming the cluster

A node can leave a swarm explicitly, returning to standalone mode, and a cluster's manager set can be reformed if managers are lost beyond quorum recovery, through a forced re-initialization that should be understood as a genuine disaster recovery action rather than a routine operation:

docker swarm leave
docker swarm init --force-new-cluster

The forced re-initialization specifically should be reserved for a genuine quorum-loss disaster recovery scenario, since it discards the previous cluster's consensus state and effectively starts a new single-manager cluster from whatever node it is run on.

Common mistakes

  • Attempting Swarm-specific commands against an engine still in standalone mode, encountering a clear error rather than understanding the underlying state transition that needs to happen first.
  • Running a cluster with an even number of manager nodes, providing no actual quorum tolerance advantage over a smaller, odd-numbered configuration.
  • Removing or rebooting a node abruptly for maintenance rather than draining it first to allow its running tasks to reschedule gracefully elsewhere.
  • Not understanding the manager-versus-worker role distinction, assuming any node's availability is equally important to overall cluster scheduling health.
  • Reaching for a forced cluster re-initialization as a routine fix rather than reserving it specifically for genuine, otherwise unrecoverable quorum-loss disaster scenarios.

Swarm mode is a genuine engine state transition unlocking cluster-aware behavior built on Raft consensus among manager nodes, and understanding manager-versus-worker roles, quorum requirements, and node availability states, draining for maintenance specifically rather than abrupt removal, is what keeps a cluster's actual internal operation healthy and predictable as nodes join, leave, and undergo routine maintenance over its operational lifetime.

Content in this section