18.2.1.4 Swarm Service Scheduling

A focused guide to Swarm Service Scheduling, connecting core concepts with practical Docker and container operations.

Swarm service scheduling concerns the actual decision process by which the cluster determines where each task, a single running instance of a service, gets placed, including the distinction between replicated and global service modes, the task lifecycle through several distinct states, and the restart and rescheduling policies that govern what happens when a task fails.

Replicated versus global service modes

A replicated service runs a specified, fixed number of task instances distributed across available nodes according to the scheduler's own placement logic, while a global service runs exactly one instance on every node in the cluster automatically, including any new node that joins afterward:

docker service create --mode replicated --replicas 5 my-api

docker service create --mode global my-log-collector

Global mode is the natural fit for infrastructure-level services that genuinely need a presence on every node, a log collector, a metrics agent, while replicated mode fits ordinary application services where the actual replica count is a deliberate capacity decision independent of how many nodes happen to exist in the cluster.

The scheduler's default placement strategy

Without explicit constraints or preferences, Swarm's scheduler spreads replicas across nodes based on available resources, favoring nodes with more free capacity relative to what they are currently running, which produces a reasonably even distribution by default without requiring any explicit configuration for the common case:

docker service ps my-api

NODE      DESIRED STATE   CURRENT STATE
node-1    Running         Running 2 minutes ago
node-2    Running         Running 2 minutes ago
node-3    Running         Running 2 minutes ago

This default behavior is generally sufficient for most services; explicit constraints and placement preferences, covered separately, are needed only when a service has genuine, specific requirements beyond this reasonable, automatic default distribution.

The task lifecycle

A task moves through several distinct states from creation to running, and understanding this lifecycle clarifies what docker service ps output actually represents at any given moment during a deployment or recovery event:

NEW       → task created, not yet assigned to a node
PENDING   → assigned, waiting on resource availability
ASSIGNED  → scheduled to a specific node
PREPARING → node is pulling the image and preparing to start
STARTING  → container starting
RUNNING   → task is actively running

A task that appears stuck in PENDING for an extended period, rather than progressing toward RUNNING, typically indicates the scheduler cannot find a node with sufficient available resources to satisfy the service's reservations, which is worth checking directly rather than assuming the task will eventually resolve on its own.

Restart policies at the service level

Swarm services have their own restart policy, distinct from a standalone container's --restart flag, controlling whether and how aggressively a failed task is restarted, including a maximum attempt count and delay between attempts:

docker service create --restart-condition on-failure --restart-max-attempts 5 --restart-delay 10s my-api

A task that exhausts its maximum restart attempts is left in a failed state rather than retried indefinitely, which surfaces a persistent, non-transient problem as a visible, investigable failure rather than masking it behind an endless, silent restart loop.

Rescheduling versus restarting

It is worth distinguishing a task restart, attempting to start the same task again, generally on the same node, from rescheduling, which happens specifically when an entire node becomes unavailable and the scheduler needs to place a new task instance on a different, available node entirely:

docker node ls

node-2    Down

docker service ps my-api

If node-2 goes down entirely, any tasks that were running on it are rescheduled onto other available nodes automatically, which is a distinct mechanism from the restart policy governing a task that fails while its node remains otherwise healthy and available.

Limiting replicas per node

For services where running more than one replica on the same physical node provides no genuine benefit, or could even be counterproductive (competing for the same underlying resource a single replica per node would otherwise have exclusive access to), an explicit constraint on maximum replicas per node prevents the scheduler from concentrating too many instances of the same service onto one node:

services:
  api:
    deploy:
      replicas: 6
      placement:
        max_replicas_per_node: 2

This ensures the six replicas spread across at least three distinct nodes, rather than potentially concentrating, for instance, four of them onto a single node if that happened to have the most available capacity at scheduling time.

Common mistakes

Using replicated mode for an infrastructure-level service that genuinely needs a presence on every node, rather than global mode, which would handle this requirement automatically without manual replica count management.
Not investigating a task stuck in a pending state for an extended period, missing a resource availability issue the scheduler cannot resolve on its own.
Conflating a standalone container's restart policy with a Swarm service's own, distinct restart condition and attempt limit configuration.
Assuming task rescheduling and task restart are the same mechanism, when rescheduling specifically applies to entire node failures rather than individual task failures on an otherwise healthy node.
Not constraining maximum replicas per node for a service where concentrating multiple instances on the same physical node provides no benefit or could even be counterproductive.

Swarm service scheduling combines mode selection (replicated versus global), a sensible default spread-based placement strategy, an explicit task lifecycle worth understanding when diagnosing a stuck deployment, and distinct restart-versus-reschedule mechanisms, each of which together determines where and how reliably a service's actual workload runs across the cluster's available capacity.