✦ For everyone, free.

Practical knowledge for real and everyday life

Home

17.2.3 Graceful Shutdown Design

A focused guide to Graceful Shutdown Design, connecting core concepts with practical Docker and container operations.

Graceful shutdown design, as a best practice, is about deliberately sequencing the specific steps a shutdown handler performs and budgeting the available time across them, since simply "handling SIGTERM" is insufficient if the handler's internal steps are not ordered correctly or do not account for how much time each one might realistically take within the container's overall stop timeout.

The correct general sequence

A well-designed shutdown handler proceeds through a specific order: stop accepting new work first, allow in-flight work to finish, then close downstream connections, and only then actually exit, rather than closing connections immediately and risking in-flight work failing partway through:

process.on('SIGTERM', async () => {
  server.close(); // stop accepting new connections immediately
  await waitForInFlightRequests(); // let existing requests finish
  await db.end(); // close database connections only after requests are done
  await logger.flush(); // ensure final logs are written
  process.exit(0);
});

Reversing this order, closing the database connection before in-flight requests finish using it, for instance, would cause those in-flight requests to fail unnecessarily, defeating much of the purpose of handling the signal gracefully in the first place.

Budgeting time across shutdown steps

The container's configured stop timeout is a fixed total budget, and a well-designed handler should account for how much of that budget each step might reasonably consume, rather than allowing any single step to potentially consume the entire budget and leave none for the steps that follow:

process.on('SIGTERM', async () => {
  server.close();
  await Promise.race([
    waitForInFlightRequests(),
    new Promise((resolve) => setTimeout(resolve, 15000)), // cap in-flight wait at 15s
  ]);
  await db.end();
  process.exit(0);
});
docker run --stop-timeout=30 my-api

Capping the in-flight request wait at a defined portion of the total 30-second budget, rather than letting it potentially consume the entire window, ensures the subsequent, equally important step of closing database connections cleanly still has time to actually execute before the container's stop timeout expires and forces a SIGKILL.

Stopping new work before draining existing work

The very first action in any shutdown sequence should be ensuring no new work is accepted, since continuing to accept new requests or jobs during the shutdown window only adds more work that then also needs to be drained, potentially extending the shutdown process indefinitely rather than converging toward completion:

let shuttingDown = false;

app.use((req, res, next) => {
  if (shuttingDown) {
    return res.status(503).set('Connection', 'close').send('Server shutting down');
  }
  next();
});

process.on('SIGTERM', () => {
  shuttingDown = true;
  server.close();
});

This explicit, immediate flag, checked on every incoming request before any other shutdown logic executes, ensures the set of work that needs to be drained is fixed at the moment the signal arrives rather than continuing to grow.

Designing for queue and background job consumers

For a service consuming from a job queue rather than handling synchronous HTTP requests, the equivalent sequence stops pulling new jobs from the queue first, finishes processing whatever job is currently in progress, and only then disconnects from the queue entirely:

let shuttingDown = false;

async function consumeJobs() {
  while (!shuttingDown) {
    const job = await queue.dequeue();
    if (job) await processJob(job);
  }
}

process.on('SIGTERM', async () => {
  shuttingDown = true;
  await currentJobPromise; // wait for whatever job is actively in progress
  await queue.disconnect();
  process.exit(0);
});

Testing the shutdown sequence deliberately

Rather than assuming a shutdown handler works correctly based on its code reading reasonably, deliberately testing it under realistic conditions, sending SIGTERM while genuine in-flight work exists, confirms the actual, observed behavior matches the intended sequence:

curl http://localhost:3000/slow-endpoint &
docker stop --time=30 my-api
docker logs my-api | tail -20

Confirming through this kind of test that the slow, in-flight request genuinely completes successfully before the container actually stops, rather than being abruptly cut off, is the concrete verification that the shutdown sequence is working as designed, not just as written.

Logging the shutdown sequence's own progress

Emitting a log line at each major step of the shutdown sequence provides direct, observable confirmation of exactly how far the sequence progressed before either completing successfully or being forcibly terminated by an expiring timeout:

process.on('SIGTERM', async () => {
  console.log('Shutdown: stopping new connections');
  server.close();
  console.log('Shutdown: waiting for in-flight requests');
  await waitForInFlightRequests();
  console.log('Shutdown: closing database connections');
  await db.end();
  console.log('Shutdown: complete');
  process.exit(0);
});

If a container is observed exiting via SIGKILL rather than completing this sequence cleanly, these log lines reveal exactly which step the shutdown sequence had reached, or gotten stuck on, before the timeout expired, which is considerably more diagnostic than a shutdown handler with no progress visibility at all.

Common mistakes

  • Closing downstream connections like a database before in-flight requests that depend on them have actually finished.
  • Not budgeting the available stop timeout across the shutdown sequence's individual steps, allowing one step to potentially consume the entire window.
  • Continuing to accept new work during the shutdown window, causing the set of work needing to drain to keep growing rather than converging.
  • Assuming a shutdown handler works correctly based on its code without ever testing it under realistic conditions with genuine in-flight work present.
  • Providing no logging or progress visibility within the shutdown sequence itself, making it difficult to diagnose exactly where a shutdown that ended in a forced SIGKILL actually got stuck.

Graceful shutdown design is as much about the deliberate sequencing and time budgeting of the shutdown handler's individual steps as it is about handling the signal at all, and verifying that sequence through deliberate, realistic testing, with clear progress logging, is what actually confirms the handler behaves correctly under the conditions it was specifically designed to handle.

Content in this section