17.2.1.4 Healthcheck Simplicity

A focused guide to Healthcheck Simplicity, connecting core concepts with practical Docker and container operations.

Healthcheck simplicity is the discipline of keeping a container's health check fast, cheap, and narrowly scoped, resisting the temptation to build it into an elaborate, comprehensive diagnostic system, since a health check's actual job is to answer one specific, binary question quickly and reliably, not to perform a thorough audit of every aspect of the application's condition on every single invocation.

The health check's actual job is narrow

A health check exists to answer one question: can this container currently do the basic thing it exists to do. It is not the place to perform a comprehensive system audit, run a full test suite, or check every conceivable dependency the application might ever touch:

app.get('/healthz', async (req, res) => {
  try {
    await db.query('SELECT 1');
    res.status(200).send('ok');
  } catch {
    res.status(503).send('degraded');
  }
});

app.get('/healthz', async (req, res) => {
  const results = await Promise.all([
    checkDatabase(),
    checkCache(),
    checkSearchService(),
    checkPaymentGateway(),
    checkEmailService(),
    checkAnalyticsService(),
    runDiagnosticSuite(),
  ]);
  res.json(results);
});

The first example correctly scopes the check to the one or two dependencies that genuinely determine whether the application can function at all; the second, while thorough, runs considerably more work on every single check interval than is actually warranted, checking dependencies whose unavailability might reasonably degrade a specific feature without making the entire service genuinely unhealthy.

Fast and cheap by design

A health check should complete quickly and consume minimal resources, since it runs repeatedly, often every few seconds, for as long as the container exists, and an expensive check multiplies that cost across every check cycle and every replica continuously:

app.get('/healthz', async (req, res) => {
  await pool.query('SELECT 1'); // a few milliseconds against an existing connection pool
  res.status(200).send('ok');
});

app.get('/healthz', async (req, res) => {
  const fullReport = await generateComprehensiveSystemReport(); // potentially seconds of work
  res.json(fullReport);
});

The cumulative cost of an expensive check running continuously across many replicas, indefinitely, for the entire lifetime of a service, is a real, recurring resource expense that a genuinely simple check avoids entirely.

Avoiding building a custom health-check framework

It is tempting, especially for a larger application with many components, to build an elaborate internal framework specifically for health checking, a registry of checks, a scoring system, a dashboard; this is almost always more complexity than the actual need warrants, since the underlying question, the binary "can this container do its job," does not usually benefit from this kind of elaborate machinery:

class HealthCheckRegistry {
  register(name, checkFn, options) { /* ... */ }
  runAll() { /* ... */ }
  getScore() { /* ... */ }
}

A simple, direct function checking the one or two things that genuinely matter is almost always sufficient, and the apparent sophistication of a registry-and-scoring system rarely justifies its added maintenance burden for what is, at its core, a simple binary signal.

Separating health from detailed diagnostics

The desire for comprehensive system visibility is legitimate, but it belongs in a separate, more detailed status endpoint or in the application's metrics and logging, not folded into the health check itself, which needs to remain fast and narrowly scoped specifically because of how frequently and continuously it runs:

app.get('/healthz', async (req, res) => {
  await pool.query('SELECT 1');
  res.status(200).send('ok');
});

app.get('/internal/status', requireAuth, async (req, res) => {
  res.json(await generateComprehensiveSystemReport());
});

This separation keeps the frequently-run health check cheap while still providing somewhere for the genuinely useful, more comprehensive diagnostic information to live, accessible on demand rather than computed on every single health check interval regardless of whether anyone is actually looking at it.

Resisting unnecessary configuration complexity

The timing parameters, interval, timeout, retries, start period, should be set deliberately based on the application's actual measured behavior, but resist the temptation to build elaborate, dynamically adjusting logic around these values; a simple, fixed, well-reasoned configuration is usually sufficient and considerably easier to reason about than a system that tries to adapt its own health check behavior dynamically:

HEALTHCHECK --interval=15s --timeout=5s --retries=3 --start-period=30s \
  CMD curl -f http://localhost:3000/healthz || exit 1

A fixed, simple configuration like this, set once based on real, measured application behavior and revisited only when that behavior genuinely changes, avoids the maintenance burden and unpredictability that a more dynamically adaptive health checking system would introduce for very little corresponding benefit in most cases.

The cost of over-engineering becomes visible during an incident

An elaborate, comprehensive health check that itself depends on many things going right can become a source of confusion during an actual incident, when the question "is the health check itself broken, or is the application genuinely unhealthy" becomes a real, time-consuming distraction precisely when clarity matters most:

docker inspect --format='{{.State.Health.Status}}' my-api

A simple, narrowly scoped health check is far less likely to itself become a point of confusion or failure during an incident, since there is considerably less surface area within it for something to go wrong independently of the actual application condition it is meant to reflect.

Common mistakes

Building a health check that attempts a comprehensive audit of every dependency rather than narrowly answering whether the application can perform its core function.
Allowing a health check to perform expensive, slow operations that accumulate meaningful resource cost when multiplied across frequent, continuous execution and many replicas.
Constructing an elaborate internal health-check framework, registries, scoring systems, dashboards, for what is fundamentally a simple, binary signal.
Folding detailed, comprehensive diagnostic reporting directly into the health check rather than exposing it through a separate, on-demand status endpoint.
Building dynamically adaptive health check timing logic rather than a simple, fixed configuration based on real, measured application behavior.

Healthcheck simplicity means resisting the urge to make a health check comprehensive, sophisticated, or elaborate, keeping it instead fast, cheap, and narrowly focused on the one specific, binary question it actually exists to answer, with any genuine need for deeper, more comprehensive diagnostic visibility served by a separate mechanism entirely.