15.3.1 Healthcheck Design
A focused guide to Healthcheck Design, connecting core concepts with practical Docker and container operations.
Healthcheck design is the deliberate process of deciding what a container's health check should actually verify, how often, with what tolerance for transient failure, and what cost it imposes, choices that determine whether the resulting health status is a genuinely useful signal or a check that technically exists but fails to detect the problems that matter.
Starting from what failure actually looks like
A well-designed health check begins with the question of what it would actually mean for this specific service to be unable to serve requests correctly, rather than defaulting to a generic, content-free check applied uniformly to every service regardless of what it does:
HEALTHCHECK CMD curl -f http://localhost:3000/ || exit 1
A bare request to the root path frequently succeeds even when a critical dependency is unreachable, since many frameworks return a generic response for any request regardless of the application's actual operational state, which is precisely the gap a deliberately designed check needs to close.
Exercising real dependencies
A meaningful health check actively verifies the dependencies the application genuinely needs in order to function, rather than only confirming the HTTP server itself is listening:
app.get('/healthz', async (req, res) => {
try {
await db.query('SELECT 1');
await cache.ping();
res.status(200).json({ status: 'ok' });
} catch (err) {
res.status(503).json({ status: 'degraded', reason: err.message });
}
});
The specific dependencies checked should reflect what would actually make the service unable to do its job if unavailable; a service that can degrade gracefully without its cache, for instance, arguably should not report unhealthy purely because the cache is unreachable, since doing so could trigger unnecessary restarts or routing changes for a condition the application is already designed to tolerate.
Distinguishing liveness from readiness
Two related but distinct questions are often conflated in a single health check: is the process alive and not deadlocked (liveness), versus is it currently capable of correctly serving traffic (readiness). Some orchestration layers support separate checks for each, and even where Docker's own native health check supports only one combined notion, designing the check with this distinction in mind clarifies intent:
app.get('/livez', (req, res) => res.status(200).send('alive'));
app.get('/readyz', async (req, res) => {
const ready = await checkDependencies();
res.status(ready ? 200 : 503).send(ready ? 'ready' : 'not ready');
});
A liveness check failing should generally trigger a restart, since it implies the process itself is broken; a readiness check failing should generally trigger removal from traffic routing without necessarily restarting a process that may recover on its own once a dependency becomes available again.
Setting interval, timeout, and retries deliberately
The three timing parameters of a health check directly determine how quickly a real problem is detected versus how tolerant the check is of brief, expected blips, and each should be set based on the specific service's actual behavior rather than copied from an unrelated example:
HEALTHCHECK --interval=15s --timeout=5s --retries=3 --start-period=30s \
CMD curl -f http://localhost:3000/healthz || exit 1
interval controls how often the check runs; timeout bounds how long a single check attempt is allowed to take before being considered failed; retries is how many consecutive failures are required before the container is actually marked unhealthy; and start-period provides a grace window after container startup during which failures do not count toward the retry threshold, which matters for any application with a meaningful warm-up time before it can correctly respond to its own health check.
Avoiding a health check that is more expensive than the work it protects
A health check that performs an expensive operation, a full database query scanning a large table, an external API call with significant latency, can itself become a meaningful source of load or latency, particularly when run frequently across many replicas:
app.get('/healthz', async (req, res) => {
await db.query('SELECT 1'); // cheap, sufficient connectivity check
res.status(200).send('ok');
});
app.get('/healthz', async (req, res) => {
await db.query('SELECT COUNT(*) FROM orders'); // unnecessarily expensive for a connectivity check
res.status(200).send('ok');
});
A lightweight query that confirms connectivity is generally sufficient for health check purposes; the goal is confirming the dependency is reachable and responsive, not performing a comprehensive functional test on every single check cycle.
Designing for partial degradation
Not every dependency failure should necessarily flip a service to fully unhealthy; a service with several independent capabilities might reasonably report healthy (if its core function still works) while separately surfacing which specific, non-critical capability is currently degraded through a different signal, such as a metric or a more detailed status endpoint:
app.get('/healthz', async (req, res) => {
const coreOk = await checkCoreDependency();
res.status(coreOk ? 200 : 503).send(coreOk ? 'ok' : 'degraded');
});
app.get('/status', async (req, res) => {
res.json({ core: await checkCoreDependency(), cache: await checkCache(), search: await checkSearch() });
});
This avoids a binary health check causing unnecessary restarts or traffic removal for a degraded but still partially functional service, while still surfacing the more nuanced state for anyone actually investigating.
Testing the health check's failure path deliberately
A health check is only as trustworthy as its tested behavior under actual failure, which means deliberately breaking a dependency in a test environment and confirming the check correctly reports unhealthy, rather than assuming the implementation is correct because it looks reasonable in code review:
docker network disconnect my-network my-db
sleep 20
docker inspect --format='{{.State.Health.Status}}' my-api
Verifying this end to end, including confirming the container actually transitions to unhealthy within the expected number of retry cycles, catches a health check that looks correct but has a subtle bug, such as swallowing the dependency's error instead of propagating it into the check's own failure response.
Common mistakes
- Implementing a health check that only confirms the server process is listening, without exercising any of the dependencies that actually determine whether the service can function.
- Conflating liveness and readiness into a single check without considering whether a failure should trigger a restart, a routing change, or both.
- Setting timing parameters by copying an unrelated example rather than basing them on the specific service's actual startup time and acceptable failure tolerance.
- Making the health check itself expensive enough to contribute meaningfully to load or latency, especially when multiplied across many replicas checking frequently.
- Never deliberately testing the health check's failure path, leaving its actual correctness under real failure conditions unverified.
Good healthcheck design starts from a clear understanding of what failure actually looks like for the specific service, exercises real dependencies cheaply, distinguishes liveness from readiness where that distinction matters, and is verified under deliberately induced failure rather than trusted purely because the implementation reads correctly.