1.3.3.5 Scheduling Recovery Contrast

A focused guide to Scheduling Recovery Contrast, connecting core concepts with practical Docker and container operations.

Scheduling recovery contrast compares how a failed container is handled on a single host versus in an orchestrated cluster, focusing specifically on whether anything actively notices a failure and decides what to do about it.

Recovery on a Single Host Without an Orchestrator

Running containers directly with docker run provides a basic restart policy, but no broader decision-making about where else a container could run if the host itself becomes unavailable. If the host fails entirely, there is nothing left to perform recovery, since recovery logic, if any, runs on the same host as the workload it is meant to protect.

docker run -d --restart=on-failure --name myapp myapp:1.0

This restarts the container if its process exits unexpectedly, but does nothing if the underlying host itself goes offline.

Scheduling Recovery in an Orchestrated Cluster

An orchestrator runs its own management processes separately from the workloads it schedules, typically across multiple manager nodes, so it can detect a node failure and reschedule the affected containers onto other healthy nodes, without depending on the failed node for anything.

docker node ls
docker service ps myapp

If a node hosting a replica goes down, the orchestrator notices the missing replica and starts a new one on a different available node to bring the actual state back in line with the desired replica count.

Why a Single Host Cannot Self-Recover From Its Own Failure

A fundamental limitation of single-host recovery is that the recovery mechanism and the workload share the same point of failure — if the host that would restart a container is the same host that has failed, no restart can happen. Multi-node orchestration solves this by separating the decision-making (the orchestrator's control plane) from any individual node running workloads.

docker service create --replicas 3 --update-delay 10s myapp:1.0

Defining What "Recovered" Means

Recovery in an orchestrated system means returning to the desired state — a defined number of healthy replicas — rather than necessarily restoring the exact same container instance. The orchestrator does not try to revive the specific failed container; it creates a fresh one elsewhere that fulfills the same role.

Why This Distinction Matters for Availability

Applications with real availability requirements need scheduling recovery that survives the failure of any single machine, which is only achievable with an orchestrator coordinating across multiple nodes — a single host, no matter how reliable its restart policy, cannot provide that guarantee on its own.