20.3.2.4 Health Monitoring Practice

A focused guide to Health Monitoring Practice, connecting core concepts with practical Docker and container operations.

Health monitoring practice defines meaningful healthchecks for containers, monitors the health state over time, and establishes the operational response when health degrades. A running container is not necessarily a healthy container — the application process may be alive while the application itself is deadlocked, out of database connections, or serving only error responses. Healthchecks provide the signal that distinguishes these states, enabling automated and manual responses before users are affected.

What a Healthcheck Does

A healthcheck is a command that Docker runs inside the container at a configured interval. The command's exit code determines health:

Exit code 0: healthy
Exit code 1: unhealthy

Docker tracks the last N results (configured by --retries). When enough consecutive failures accumulate, the container transitions to the unhealthy state. Docker records the health check history and the current status, visible through docker inspect and docker ps.

Defining a Healthcheck in the Dockerfile

FROM node:20-alpine
WORKDIR /app
COPY . .
RUN npm ci --only=production
EXPOSE 3000
HEALTHCHECK --interval=30s --timeout=5s --start-period=15s --retries=3 \
  CMD wget -qO- http://localhost:3000/health || exit 1
CMD ["node", "server.js"]

Parameters:

--interval=30s — Run the healthcheck every 30 seconds.
--timeout=5s — A healthcheck that takes longer than 5 seconds is treated as a failure.
--start-period=15s — During the first 15 seconds after container startup, failures do not count toward the retry threshold. This allows slow-starting applications time to initialize without immediately going unhealthy.
--retries=3 — 3 consecutive failures are required to transition to unhealthy.

The wget -qO- command fetches the health endpoint. A 200 response exits with code 0 (healthy). A failed connection or non-2xx response exits with code 1 (unhealthy). Alpine-based images use wget instead of curl since curl is not installed by default.

Healthcheck via curl

For Debian/Ubuntu-based images:

HEALTHCHECK --interval=30s --timeout=5s --start-period=15s --retries=3 \
  CMD curl -f http://localhost:3000/health || exit 1

curl -f (fail fast) returns exit code 22 on HTTP error responses, which Docker treats as unhealthy. On connection failure, curl returns exit code 7.

Database Healthchecks

For PostgreSQL:

HEALTHCHECK --interval=10s --timeout=5s --retries=5 \
  CMD pg_isready -U postgres -d mydb || exit 1

pg_isready is included in PostgreSQL images. It tests whether the database server is accepting connections without executing a query.

For Redis:

HEALTHCHECK --interval=10s --timeout=3s --retries=3 \
  CMD redis-cli ping | grep PONG || exit 1

redis-cli ping returns PONG when the server is responsive.

Checking Health Status

docker ps

CONTAINER ID   IMAGE     STATUS                    PORTS
a1b2c3d4e5f6   my-api    Up 2 hours (healthy)      0.0.0.0:3000->3000/tcp
b2c3d4e5f6a1   my-db     Up 1 hour (healthy)       5432/tcp
c3d4e5f6a1b2   my-cache  Up 30 min (unhealthy)     6379/tcp

The status column shows (healthy), (unhealthy), or (health: starting). An (unhealthy) container requires investigation.

For detailed health history:

docker inspect my-cache --format '{{json .State.Health}}'

{
  "Status": "unhealthy",
  "FailingStreak": 5,
  "Log": [
    {
      "Start": "2024-03-15T14:22:00Z",
      "End": "2024-03-15T14:22:05Z",
      "ExitCode": 1,
      "Output": "redis-cli: Could not connect to Redis"
    },
    ...
  ]
}

The Log array shows the last N healthcheck results with their output. FailingStreak shows how many consecutive failures have occurred.

The /health Endpoint

The healthcheck command tests an endpoint. That endpoint must reflect the application's actual health, not just process liveness. A useful /health endpoint checks:

Whether the application can connect to its database.
Whether the application can connect to its cache.
Whether background queues are processing.
Application-specific indicators (e.g., whether config loaded successfully).

An endpoint that just returns 200 OK without checking dependencies is a liveness probe, not a true health probe. It tells you the process is running but not whether the application is functional.

Example Node.js health endpoint:

app.get('/health', async (req, res) => {
  try {
    await db.query('SELECT 1');
    await redis.ping();
    res.status(200).json({ status: 'ok' });
  } catch (err) {
    res.status(503).json({ status: 'error', message: err.message });
  }
});

The healthcheck tests database and cache connectivity on every check. If either is unavailable, the endpoint returns 503, which curl -f treats as a failure.

Healthchecks in Docker Compose

services:
  api:
    image: my-api:latest
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:3000/health"]
      interval: 30s
      timeout: 5s
      retries: 3
      start_period: 15s

  db:
    image: postgres:15-alpine
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U postgres"]
      interval: 10s
      timeout: 5s
      retries: 5

CMD exec form avoids shell processing. CMD-SHELL passes the command to /bin/sh -c, enabling shell features like pipes and variable expansion.

Disabling a Healthcheck

To disable a healthcheck defined in the base image (useful during debugging):

docker run --no-healthcheck my-image

In Compose:

services:
  api:
    healthcheck:
      disable: true

Healthcheck vs Orchestrator Liveness/Readiness

Docker healthchecks map to the concept used by orchestrators:

Liveness probe (Kubernetes): Is the container running? If not healthy, restart it.
Readiness probe (Kubernetes): Is the container ready to receive traffic? If not ready, stop sending it traffic (but don't restart it).

In standalone Docker, the healthcheck plays both roles — it reports health state but does not automatically restart an unhealthy container (that is the restart policy's job, which triggers on container exit). In Docker Swarm, an unhealthy container causes Swarm to schedule a replacement.

For standalone Docker, monitoring the health state and alerting on unhealthy transitions enables manual or scripted intervention:

# Script to alert on unhealthy containers
docker ps --filter health=unhealthy --format '{{.Names}}'

# Event stream for health state transitions
docker events --filter type=container --filter event=health_status

2024-03-15T14:22:31.123Z container health_status my-cache (health_status=unhealthy)

Integrating Docker events into a monitoring pipeline (PagerDuty, OpsGenie, alertmanager) enables automated alerting when containers become unhealthy.