✦ For everyone, free.

Practical knowledge for real and everyday life

Home

17.2.3.5 Shutdown Timeout Awareness

A focused guide to Shutdown Timeout Awareness, connecting core concepts with practical Docker and container operations.

Shutdown timeout awareness is the practice of making a container's actual configured stop timeout a known, visible, and monitored quantity, rather than an implicit, easily forgotten detail that the application's own shutdown logic has no actual knowledge of and that no one is actively watching for violations of in production.

The default timeout trap

Docker's default stop timeout, 10 seconds, applies silently to any container where it has not been explicitly configured, which means a service deployed without anyone deliberately considering its actual shutdown needs inherits this default regardless of whether 10 seconds is remotely sufficient for that specific service's actual in-flight work:

docker stop my-api
docker stop --time=30 my-api

A service handling requests that can legitimately take 15 or 20 seconds to complete, given only the silent, unconfigured 10-second default, will have many of those requests forcibly cut off by SIGKILL on every single routine deployment, an entirely avoidable problem that simply configuring an appropriate, explicit timeout resolves.

Making the timeout visible to the application itself

An application's own shutdown handler generally has no built-in knowledge of how much time it actually has before SIGKILL arrives, which makes it difficult for the handler to make informed decisions, such as how aggressively to cap an in-flight request wait; explicitly passing the configured timeout value into the application as an environment variable closes this gap:

services:
  api:
    stop_grace_period: 30s
    environment:
      - SHUTDOWN_TIMEOUT_MS=25000
const shutdownBudget = parseInt(process.env.SHUTDOWN_TIMEOUT_MS, 10) || 10000;
process.on('SIGTERM', async () => {
  await Promise.race([
    waitForInFlightRequests(),
    new Promise((resolve) => setTimeout(resolve, shutdownBudget * 0.8)),
  ]);
  process.exit(0);
});

Deliberately setting the application's own internal budget slightly below the actual configured stop timeout, as shown here with the 0.8 multiplier, leaves margin for the remaining shutdown steps, like closing database connections, to also execute within the full available window rather than consuming the entire budget on just the first step.

Layered timeout configuration across orchestration platforms

In environments where Docker's own configuration sits beneath a higher-level orchestrator, the orchestrator's own grace period setting and Docker's stop timeout can both be in effect simultaneously, and confirming which one actually governs, or that both are configured consistently, avoids a confusing mismatch between the two:

terminationGracePeriodSeconds: 30
STOPSIGNAL SIGTERM

The orchestrator's terminationGracePeriodSeconds is generally the value that actually matters in this layered scenario, since the orchestrator manages the container's lifecycle directly; any conflicting, separately configured value at the Docker level beneath it should be reviewed for consistency rather than left as a separate, potentially contradictory setting that no one has actually reconciled.

Monitoring for SIGKILL occurrences as a timeout signal

A container that frequently exits with code 137, indicating it was forcibly killed rather than exiting cleanly, is a direct, measurable signal that its configured stop timeout is too short relative to its actual shutdown needs, and tracking this specifically across a fleet of services surfaces exactly which ones need their timeout reconsidered:

docker events --filter event=die --filter exitcode=137
exit_code{service="my-api"} == 137

Building a simple, recurring check or dashboard panel specifically tracking the frequency of this exit code across services turns timeout awareness from a one-time configuration decision into an ongoing, monitored signal that flags when a previously appropriate timeout has become insufficient as the service's behavior or load characteristics have evolved.

Reviewing timeout configuration as part of routine service review

Treating the configured stop timeout as a value worth periodically revisiting, alongside resource limits and other runtime parameters covered elsewhere, rather than a one-time setting decided when the service was first deployed and never reconsidered, keeps it aligned with the service's actual, current behavior as that behavior changes over time:

docker inspect my-api --format '{{.Config.StopTimeout}}'

Reviewing this value directly against recent, actual shutdown duration measurements, rather than relying on memory of why a particular value was originally chosen, confirms whether the current configuration still makes sense.

Measuring actual shutdown duration directly

Beyond just configuring a timeout, measuring how long a clean, successful shutdown actually takes under realistic conditions provides the concrete data needed to set the timeout value deliberately, with appropriate margin, rather than guessing:

time docker stop --time=60 my-api
real 0m12.453s

A measured shutdown duration of around 12 seconds under realistic load suggests a configured timeout somewhat above that, accounting for some variance and margin, perhaps 20 to 25 seconds, is more appropriate than either the unconsidered 10-second default or an excessively generous, unverified value chosen without any actual measurement behind it.

Common mistakes

  • Deploying a service without explicitly considering or configuring its stop timeout, leaving it silently governed by Docker's default 10-second value regardless of whether that is actually sufficient.
  • Not exposing the configured timeout value to the application itself, leaving its shutdown handler unable to make informed decisions about how to budget its own available time.
  • Configuring a Docker-level stop timeout inconsistently with an orchestrator's own grace period setting in a layered deployment, without reconciling which value actually governs.
  • Not monitoring for SIGKILL-driven (exit code 137) terminations across a fleet, missing a direct, measurable signal that a specific service's timeout has become insufficient.
  • Setting a timeout once at initial deployment and never revisiting it as the service's actual shutdown behavior and load characteristics evolve over time.

Shutdown timeout awareness turns a value that is easy to leave at an unconsidered default into something actively measured, exposed to the application's own shutdown logic, monitored for violations across a fleet, and periodically revisited, which is what actually keeps the configured timeout aligned with a service's real, current shutdown needs rather than reflecting an outdated or never-considered assumption.