✦ For everyone, free.

Practical knowledge for real and everyday life

Home

15.3.2.2 Health Healthy State

A focused guide to Health Healthy State, connecting core concepts with practical Docker and container operations.

The health healthy state is Docker's signal that a container's most recent health check succeeded, and while it is the status every operator wants to see, understanding precisely what it does and does not guarantee, a point-in-time result rather than a continuous guarantee, prevents over-trusting it in ways that can leave a genuinely degraded container appearing fine right up until the next scheduled check.

What healthy actually represents

A healthy status means exactly one thing: the most recent check attempt returned a success result. It is not a continuous, real-time guarantee of correctness; it is a snapshot, valid as of whenever that last check ran, with no information about what has happened in the interval since:

HEALTHCHECK --interval=30s CMD curl -f http://localhost:3000/healthz || exit 1
docker inspect --format='{{.State.Health.Status}}' my-api

With a 30-second interval, a container could become genuinely broken at any point within that 30-second window and would continue reporting healthy, accurately reflecting its last successful check, until the next scheduled check attempt actually catches the problem.

The gap between checks

The interval between checks is a real, unavoidable blind spot, and its size directly trades off against the load the check itself imposes; a very frequent check shrinks this blind spot but adds proportionally more overhead, while a longer interval reduces overhead at the cost of a larger window during which a real problem could exist without being reflected in the reported status:

HEALTHCHECK --interval=5s CMD curl -f http://localhost:3000/healthz || exit 1
HEALTHCHECK --interval=60s CMD curl -f http://localhost:3000/healthz || exit 1

Choosing an interval appropriate to how quickly a real problem needs to be detected, balanced against how much overhead the check can reasonably impose at that frequency, is a deliberate trade-off rather than a setting with one universally correct value.

Healthy does not mean fully functional

A health check, even a well-designed one, typically verifies a specific, bounded set of conditions, core dependency connectivity, basic responsiveness, not the full correctness of every feature the application provides. A container can legitimately report healthy while a specific, narrower feature is broken for a reason the health check was never designed to detect:

app.get('/healthz', async (req, res) => {
  await pool.query('SELECT 1'); // confirms database connectivity only
  res.status(200).send('ok');
});

A broken search feature caused by a misconfigured external search service, for instance, would not be caught by a health check that only verifies database connectivity, which means a healthy status should be read as "core dependencies are reachable," not as a blanket assurance that every feature is working correctly.

Trusting healthy status for traffic routing decisions

Load balancers and service meshes that route traffic based on health status are placing real trust in the accuracy and recency of that signal, which is part of why both the check's design (what it actually verifies) and its interval (how current the signal is) directly affect the real-world consequences of an inaccurate or stale healthy report:

http:
  services:
    api:
      loadBalancer:
        healthCheck:
          path: /healthz
          interval: "10s"

A proxy configured to check health independently, at its own interval, rather than relying solely on Docker's own internally tracked status, adds a layer of redundancy, since the proxy's own check provides an additional, independently timed data point rather than depending entirely on however frequently Docker's internal check happens to run.

A healthy container that becomes unresponsive between checks

A particularly important edge case is a container that becomes completely unresponsive, hung or deadlocked, immediately after a successful check, and remains in that state until the next check eventually catches it and begins counting toward the failure threshold; during this entire window, the container continues to show as healthy while actually being unable to serve any real traffic correctly:

docker inspect --format='{{.State.Health.Status}}' my-api

This is precisely why combining health status with other signals, application metrics showing actual request success or failure, an external monitor independent of the container's own self-reported status, closes a gap that relying on health status alone would leave open for the duration of one check interval plus however many retries are required to reach the failure threshold.

Healthy status persisting through a slow degradation

A service experiencing gradual performance degradation, rather than a sudden, binary failure, may continue passing its health check throughout the entire degradation if the check itself does not measure latency or response quality, only binary success or failure of a basic connectivity test:

app.get('/healthz', async (req, res) => {
  await pool.query('SELECT 1'); // succeeds even if normal request latency has degraded significantly
  res.status(200).send('ok');
});

A health check reporting healthy throughout a period of significantly degraded latency, visible clearly in application-level latency metrics, illustrates why health status and performance metrics answer different questions and neither alone is sufficient for a complete operational picture.

Avoiding overreliance on the healthy signal

The practical takeaway is to treat a healthy status as a necessary but not sufficient condition for confidence in a service's actual operational quality, pairing it with application-level metrics, latency tracking, and error rate monitoring that can catch the categories of problem a binary health check, by its nature, is not designed to detect.

container_health_status{name="my-api"} == 1
http_request_duration_seconds{service="my-api", quantile="0.99"} < 1.0

Common mistakes

  • Treating a healthy status as a continuous, real-time guarantee rather than a snapshot valid only as of the most recent check.
  • Assuming a healthy result means full functional correctness across every feature, rather than recognizing it as confirmation of whatever specific, bounded set of conditions the check actually verifies.
  • Relying solely on health status for traffic routing decisions without an independent, redundant check or corroborating application-level signal.
  • Not accounting for the window between checks during which a container could become unresponsive while still showing its last successful, now-stale result.
  • Designing a health check that only measures binary connectivity, missing gradual performance degradation that application-level latency metrics would have caught clearly.

The health healthy state is a useful, necessary signal, but it is bounded in both time (a snapshot, not a continuous guarantee) and scope (whatever the specific check verifies, not full functional correctness), and treating it as more than that, without pairing it with metrics and independent checks that cover its blind spots, leaves real gaps in actual operational visibility.