15.3 Health Observability
A focused guide to Health Observability, connecting core concepts with practical Docker and container operations.
Health observability is the practice of treating a container's health check status as an active observability signal in its own right, feeding it into monitoring, alerting, and dashboards alongside logs and metrics, rather than treating it merely as an internal mechanism the container runtime uses for restart and routing decisions.
Health status as a distinct signal
A container's health check produces a simple, well-defined state, healthy, unhealthy, or starting, that summarizes whether the application considers itself capable of correctly serving requests, distinct from whether the process is merely running and distinct from any specific metric value:
HEALTHCHECK --interval=15s --timeout=5s --retries=3 \
CMD curl -f http://localhost:3000/healthz || exit 1
docker inspect --format='{{.State.Health.Status}}' my-api
This binary-ish signal is coarser than a full metrics dashboard, but it is also simpler to reason about and alert on directly: "is this container healthy right now" is a question every other observability signal eventually needs to answer indirectly through more complex correlation.
Exporting health status to a metrics pipeline
Health status by itself is only visible through docker inspect unless it is actively collected and exported, which means a deliberate step is needed to turn this internal signal into something a dashboard or alerting system can act on:
docker inspect --format='{{.State.Health.Status}}' my-api | \
awk '{print ($1=="healthy")?1:0}'
container_health_status{name="my-api"} 1
cAdvisor and several other Docker-aware metrics exporters surface health status alongside resource metrics automatically, which means in many setups this signal is already available in the same metrics pipeline as CPU and memory data without needing a separate, custom collection step.
Alerting on health transitions, not just current state
The most actionable health observability signal is often the transition itself, a container moving from healthy to unhealthy, rather than the instantaneous current state, since a brief, single failed check (which the retries setting is designed to tolerate before marking a container unhealthy) is expected and not alert-worthy on its own:
changes(container_health_status{name="my-api"}[5m]) > 0
Alerting specifically on a transition into the unhealthy state, rather than on every individual failed check attempt, respects the retry tolerance already built into the health check configuration and avoids generating noise for conditions the container runtime is already designed to tolerate gracefully.
Correlating health transitions with other signals
A health status transition is most useful for investigation when correlated against logs, metrics, and recent deployment events from the same time window, since the health check alone tells you something changed but not why:
docker events --filter event=health_status --since 1h
docker logs --since 5m my-api | tail -50
Docker's own event stream includes health status change events directly, which provides a precise timestamp to anchor a subsequent log and metrics correlation around, rather than needing to infer roughly when a transition occurred from periodic polling alone.
Health checks as a deployment safety signal
Beyond steady-state monitoring, health observability is directly tied to deployment safety: a deployment pipeline that actively checks health status before considering a rollout successful catches a broken release before it receives full traffic, rather than relying on a human noticing degraded behavior afterward:
until [ "$(docker inspect --format='{{.State.Health.Status}}' my-api-new)" = "healthy" ]; do
sleep 2
done
Wiring this check directly into deployment automation, rather than treating health status as something only consulted manually after a deployment appears to be going wrong, turns it into an active gate rather than a passive signal discovered after the fact.
Health check quality determines observability value
A shallow health check, one that only confirms the process is listening without exercising any real dependency, provides a correspondingly shallow observability signal: it can report healthy right up until the moment a genuine failure occurs that the check was never designed to detect:
app.get('/healthz', async (req, res) => {
try {
await db.query('SELECT 1');
res.status(200).json({ status: 'ok' });
} catch (err) {
res.status(503).json({ status: 'degraded' });
}
});
Investing in a health check that actually verifies the dependencies the application needs to function correctly is what makes the resulting observability signal worth building dashboards and alerts around in the first place; a meaningless health check produces a meaningless observability signal no matter how well it is collected and displayed.
Dashboards combining health status with resource and request metrics
A useful operational dashboard places health status alongside resource usage and request-level metrics for the same service, since the combination, rather than any single signal alone, usually tells the more complete story during an investigation:
container_health_status{service="api"}
container_cpu_usage_seconds_total{service="api"}
http_requests_total{service="api", status=~"5.."}
Seeing a health status transition occur at the same moment as a CPU spike and an increase in error rate paints a far more useful picture for rapid diagnosis than any one of those three signals would on its own.
Common mistakes
- Treating health status as purely an internal mechanism for restart and routing decisions, never exporting it to a dashboard or alerting system at all.
- Alerting on every individual failed health check attempt rather than on the actual transition into an unhealthy state, generating noise the retry tolerance was already designed to absorb.
- Investigating a health transition without correlating it against logs and other metrics from the same time window, missing the explanation for why the transition occurred.
- Relying on a shallow health check that does not exercise real dependencies, producing a health observability signal that fails to detect the actual class of problem most likely to occur.
- Treating health status as relevant only after a deployment, rather than wiring it directly into deployment automation as an active safety gate.
Health observability treats a container's health check status as a first-class signal worth exporting, alerting on transitions for, and correlating against logs and metrics, which only delivers real value when the underlying health check itself is meaningful enough to actually detect the failures an operator cares about.