✦ For everyone, free.

Practical knowledge for real and everyday life

Home

15.3.1.5 Health Failure Threshold

A focused guide to Health Failure Threshold, connecting core concepts with practical Docker and container operations.

A health failure threshold is the number of consecutive failed check attempts required before a container is actually marked unhealthy, controlled by Docker's --retries setting, and it exists specifically to distinguish a single, transient blip from a genuine, sustained problem, with the right value depending on how much tolerance for brief failures a given service should reasonably have.

What the retries setting actually controls

The --retries value is a count of consecutive failures, not a count out of a rolling window; a single successful check resets the counter back to zero, which means the container only transitions to unhealthy after that many failures occur back to back with no intervening success:

HEALTHCHECK --interval=10s --timeout=5s --retries=3 \
  CMD curl -f http://localhost:3000/healthz || exit 1

With this configuration, three consecutive failed checks, occurring across roughly 30 seconds given the 10-second interval, are required before the container's health status actually flips to unhealthy; a single isolated failure followed by a success leaves the container reported as healthy throughout.

Setting the threshold too low

A threshold of one or two means a single brief, transient issue, a momentary network blip, a garbage collection pause, a slow but ultimately successful database query, immediately flips the container to unhealthy, which can trigger unnecessary restarts or premature removal from traffic routing for a condition that would have resolved on its own within the next check cycle:

HEALTHCHECK --interval=10s --timeout=5s --retries=1 \
  CMD curl -f http://localhost:3000/healthz || exit 1

A container restarted or pulled from rotation for a single transient blip, only to immediately report healthy again afterward, suggests the threshold is tuned too aggressively relative to the service's actual, normal variance in health check response.

Setting the threshold too high

Conversely, a high threshold delays detecting a genuine, sustained problem, since the container continues to be reported as healthy through several consecutive real failures before the threshold is finally reached:

HEALTHCHECK --interval=10s --timeout=5s --retries=10 \
  CMD curl -f http://localhost:3000/healthz || exit 1

With a 10-second interval and 10 retries, a genuinely failed container would continue to be marked healthy, and continue receiving traffic it cannot correctly serve, for nearly two minutes before the threshold is reached, which is a significant window for a service with meaningful availability requirements.

Balancing threshold against interval and overall detection time

The threshold should be considered together with the check interval, since the two combine to determine total detection time, and the right balance depends on the specific service's tolerance for both false positives and detection delay:

detection_time ≈ interval × retries
HEALTHCHECK --interval=5s --timeout=3s --retries=3 \
  CMD curl -f http://localhost:3000/healthz || exit 1

A shorter interval combined with a moderate retry count can achieve fast genuine-failure detection while still requiring multiple consecutive failures to avoid reacting to a single transient blip, which is often a better balance than a long interval with a low retry count or a short interval with a very high retry count.

Matching the threshold to the service's actual failure characteristics

A service known to experience occasional, brief, self-resolving issues (a downstream dependency with its own retry logic that occasionally takes a moment longer, for instance) reasonably warrants a slightly higher threshold than a service expected to be either consistently healthy or genuinely, persistently broken with no in-between state:

docker events --filter event=health_status --since 7d | grep unhealthy

Reviewing how often a service has actually transitioned to unhealthy over a representative historical period, and whether those transitions correlated with genuine incidents versus apparent false positives, is a more reliable basis for tuning the threshold than a value chosen without reference to the service's actual observed behavior.

Threshold tuning and flapping

A threshold tuned too aggressively relative to a service's normal variance can produce flapping, repeated transitions between healthy and unhealthy as the container hovers right at the edge of the threshold, which is disruptive both for any automated response (repeated restarts) and for anyone trying to interpret the health signal during an investigation:

docker events --filter event=health_status --since 1h

A health status history showing frequent, rapid transitions back and forth, rather than a single, clear transition into a sustained unhealthy state, is a clear sign the threshold (or the underlying check itself) needs adjustment, since a reliable health signal should be stable during genuinely healthy periods, not oscillating.

Threshold considerations during deployment

The failure threshold also affects how quickly a newly deployed, broken container is detected and removed from rotation during a rollout; a deployment pipeline relying on health status to gate progression benefits from a threshold tuned for relatively fast detection specifically during this window, even if a slightly more tolerant threshold is acceptable during steady-state operation:

until [ "$(docker inspect --format='{{.State.Health.Status}}' my-api-new)" != "starting" ]; do
  sleep 2
done

Some deployment automation explicitly waits for a definitive healthy or unhealthy result, rather than proceeding the moment the container starts, which makes the combination of interval and retries directly relevant to how long a deployment pipeline needs to wait before it can safely proceed or roll back.

Common mistakes

  • Setting the retries threshold to one or two without considering the service's normal variance, causing brief, transient issues to trigger unnecessary restarts or traffic removal.
  • Setting the threshold too high, delaying genuine failure detection well beyond what the service's actual availability requirements can tolerate.
  • Tuning the threshold without reviewing historical health transition data, relying on intuition rather than the service's actual observed failure pattern.
  • Allowing a threshold tuned too tightly relative to normal variance to produce flapping, repeated oscillation between healthy and unhealthy states.
  • Not considering how the threshold affects deployment pipeline timing specifically, when a different balance might be appropriate during a rollout than during steady-state operation.

A health failure threshold should be set deliberately based on the service's actual, observed variance and failure patterns, balanced against the check interval to achieve a reasonable total detection time, and revisited whenever flapping or either false positives or delayed detection are observed in practice rather than left at a default value chosen without reference to how the specific service actually behaves.