15.3.1.4 Health Timeout Choice

A focused guide to Health Timeout Choice, connecting core concepts with practical Docker and container operations.

Health timeout choice is the decision of how long a health check is allowed to run before being considered failed, and getting it wrong in either direction produces a specific, recognizable problem: too short a timeout causes false failures during normal, brief slowness, while too long a timeout delays detection of a genuine problem and can let a struggling container linger in rotation far longer than it should.

The timeout parameter and what it actually bounds

Docker's HEALTHCHECK instruction includes a --timeout value that bounds how long a single check invocation is allowed to run before the daemon considers that specific attempt failed, separate from --interval (how often checks run) and --retries (how many consecutive failures before the container is marked unhealthy):

HEALTHCHECK --interval=15s --timeout=5s --retries=3 \
  CMD curl -f http://localhost:3000/healthz || exit 1

A check that takes longer than the configured timeout is treated identically to one that returned a failure status, which means the timeout value directly determines how much slowness in the check itself is tolerated before that slowness is interpreted as a failure.

Setting the timeout too short

A timeout set without reference to the check's actual typical response time risks triggering false failures during ordinary, brief variance in response latency, particularly under load or during garbage collection pauses that briefly slow down request handling without indicating any genuine problem:

HEALTHCHECK --interval=10s --timeout=1s --retries=3 \
  CMD curl -f http://localhost:3000/healthz || exit 1

A one-second timeout for a check that legitimately takes 800 milliseconds under typical, healthy load leaves almost no margin for any normal variance, which means an entirely healthy container can be marked unhealthy purely due to timing noise rather than any actual problem, leading to unnecessary restarts or traffic removal.

Setting the timeout too long

Conversely, an overly generous timeout delays detecting a genuinely struggling container, since the check has to actually time out, taking the full configured duration, before that attempt counts as a failure at all:

HEALTHCHECK --interval=10s --timeout=30s --retries=3 \
  CMD curl -f http://localhost:3000/healthz || exit 1

With a 30-second timeout and 3 retries, a container that has genuinely stopped responding correctly could take up to 90 seconds or more before being marked unhealthy, during which it may continue receiving traffic it cannot actually serve correctly, which is a meaningful delay for a service with tight availability requirements.

Measuring actual response time as a starting point

The right timeout value should be derived from the health check's actual, measured response time under normal conditions, with enough margin to tolerate expected variance without being so generous that it delays genuine failure detection:

for i in $(seq 1 20); do
  curl -o /dev/null -s -w "%{time_total}\n" http://localhost:3000/healthz
done

Measuring a representative sample of actual response times, then setting the timeout meaningfully above the observed maximum (rather than the average) gives a reasonable starting point that accounts for normal variance without being arbitrarily generous.

The relationship between timeout and the check's own dependencies

Since the health check's response time depends on however long its own dependency checks take, the health check implementation itself should impose an internal timeout shorter than Docker's own configured timeout, ensuring the check fails fast and predictably from within rather than being cut off externally at an unpredictable point:

app.get('/healthz', async (req, res) => {
  const timeout = new Promise((_, reject) => setTimeout(() => reject(new Error('internal timeout')), 3000));
  try {
    await Promise.race([checkDependencies(), timeout]);
    res.status(200).send('ok');
  } catch {
    res.status(503).send('degraded');
  }
});

HEALTHCHECK --timeout=5s CMD curl -f http://localhost:3000/healthz || exit 1

Setting the internal timeout (3 seconds in this example) shorter than Docker's external timeout (5 seconds) ensures the application's own check logic, not an external network-level cutoff, determines the failure, which produces a more predictable and diagnosable failure mode than relying on the external timeout to catch a slow dependency.

Adjusting timeout for known, expected slow conditions

Some services have legitimately variable response times under specific, known conditions, a cold cache immediately after startup, a periodic batch job that briefly consumes resources, and the timeout (along with start-period for the startup case specifically) should account for these known patterns rather than being tuned only against typical steady-state behavior:

HEALTHCHECK --start-period=60s --interval=15s --timeout=5s --retries=3 \
  CMD curl -f http://localhost:3000/healthz || exit 1

start-period specifically exempts the initial warm-up window from counting toward the retry threshold, which is the more appropriate tool for handling startup-specific slowness rather than inflating the steady-state timeout to accommodate a condition that only exists briefly after the container starts.

Reviewing timeout settings after incidents

A health check timeout that contributed to either a false failure or a delayed detection during a real incident is worth revisiting explicitly afterward, since the actual, observed behavior during that incident is more informative than the assumptions the timeout was originally set with:

docker events --filter event=health_status --since 24h

Reviewing health status transition timestamps against the actual timeline of a recent incident reveals whether the configured timeout behaved as intended, detecting the real problem promptly without excessive false positives along the way, or whether it needs adjustment based on what was actually observed.

Common mistakes

Setting a timeout without measuring the check's actual typical response time, leading to either frequent false failures or unnecessarily slow genuine failure detection.
Configuring an external timeout shorter than what the health check's own internal logic needs to complete under normal conditions, causing every check to fail purely due to timing.
Not using start-period to account for legitimate startup slowness, instead inflating the steady-state timeout to compensate for a condition that only exists briefly after the container starts.
Leaving the health check implementation itself without an internal timeout on its dependency calls, making the external timeout the only thing preventing an indefinitely hanging check.
Never revisiting timeout settings after an incident where they contributed to either a false failure or a detection delay, leaving the same misconfiguration in place for the next occurrence.

Health timeout choice should be grounded in actual, measured response time data rather than an arbitrary default, paired with an internal timeout inside the check's own implementation shorter than the external configured value, and revisited explicitly whenever real incident behavior reveals the current setting is either too tight or too loose for the service's actual needs.