15.3.1.1 Process Health Signal

A focused guide to Process Health Signal, connecting core concepts with practical Docker and container operations.

A process health signal is a liveness indicator derived directly from the operating system's own view of a container's main process, whether it is running, what exit code it produced, whether it is consuming CPU or appears stuck, distinct from an application-level health check endpoint that requires the application itself to implement and expose logic about its own readiness.

Why process-level signals matter even with application health checks

An HTTP-based health check depends on the application's own code running correctly enough to respond to it at all; a process that has crashed, deadlocked at a level below the HTTP server's own request handling, or never started successfully provides no useful response to any health endpoint, which is exactly when the lower-level, OS-derived process signal becomes the only available indicator:

docker inspect --format='{{.State.Status}}' my-api

docker inspect --format='{{.State.ExitCode}}' my-api

These two fields, sourced directly from the container runtime's own tracking of the process rather than anything the application itself reports, remain available and accurate even in scenarios where an application-level health check would simply time out or fail to respond at all.

Exit codes as a health signal

The specific exit code a container's main process produces when it stops is itself a meaningful health signal, distinguishing between a clean, intentional shutdown and various categories of failure:

docker inspect --format='{{.State.ExitCode}}' my-api

0   - clean exit
1   - generic application error
137 - killed (often OOM or docker stop timeout exceeded, SIGKILL)
139 - segmentation fault
143 - terminated by SIGTERM

A pattern of repeated non-zero exit codes across restarts is a different and more specific signal than a single isolated restart, pointing toward a systematic problem (a bad deployment, a persistent resource constraint) rather than a one-off, possibly transient failure.

Restart count as a health signal

Docker tracks how many times a container has been automatically restarted under its configured restart policy, and a rapidly increasing restart count is one of the clearest possible process-level health signals, indicating a crash loop that needs investigation regardless of what any application-level health check might separately report:

docker inspect --format='{{.RestartCount}}' my-api

docker events --filter event=restart --filter container=my-api

A container restarting every few seconds is fundamentally unhealthy at the process level even if, in the brief windows it manages to start, an application health check would have reported it as healthy; the restart count signal catches this pattern that a point-in-time health check sampling could miss entirely if its timing happened not to coincide with one of the brief healthy windows.

CPU activity as an indirect liveness signal

A process consuming zero CPU for an extended period, despite the container reporting as "running," can indicate the process is blocked or stuck rather than genuinely idle, particularly for a service expected to have continuous background activity:

docker stats --no-stream --format "{{.CPUPerc}}" my-api

This is an imperfect signal on its own, since genuine idleness under low traffic looks identical to a stuck process from this view alone, but combined with an expectation of what normal idle CPU usage looks like for a specific service, a sudden, sustained drop to exactly zero can be a useful corroborating signal alongside other checks.

Watchdog patterns built on process signals

Some applications implement an internal watchdog that monitors their own critical threads or loops and deliberately exits the entire process if a fatal internal condition is detected, converting an internal hang into a clean, externally visible process-level failure that the container runtime's restart policy can then act on directly:

const watchdogTimer = setInterval(() => {
  if (Date.now() - lastHeartbeat > 30000) {
    console.error('Watchdog: main loop unresponsive, exiting');
    process.exit(1);
  }
}, 5000);

This pattern is valuable specifically because it converts an otherwise invisible internal hang (which an external health check might not catch quickly, depending on what exactly it tests) into an immediate, unambiguous process exit that the restart policy and the resulting exit code and restart count signals can respond to right away.

Signal handling and its effect on process health signals

How a process responds to signals directly affects what process-level health signals look like during a shutdown or restart: a process that correctly handles SIGTERM and exits cleanly produces a different, more informative signal than one that has to be forcibly killed with SIGKILL after the stop timeout expires:

docker inspect --format='{{.State.ExitCode}}' my-api

0   - SIGTERM handled, clean exit
137 - SIGKILL required, graceful shutdown did not complete in time

A pattern of consistently seeing exit code 137 during routine deployments, rather than 0, is itself a process-level health signal indicating that graceful shutdown handling is not working correctly, worth investigating independently of whatever the deployment's overall success or failure status reports.

Combining process signals with application-level signals

The most reliable overall health picture comes from combining the OS-derived process signal with an application-level health check, since each catches failure modes the other cannot: a process signal catches crashes and restart loops invisible to an application check that never gets a chance to run, while an application check catches a process that is technically running but functionally broken in a way the OS has no visibility into:

docker inspect --format='{{.State.Status}} {{.State.Health.Status}} {{.RestartCount}}' my-api

Common mistakes

Relying exclusively on an application-level health check, missing failures (crash loops, segfaults, OOM kills) that occur before or instead of the application ever getting a chance to respond to a health endpoint.
Not monitoring restart count as its own signal, missing a crash loop that happens to recover briefly between restarts in a way that intermittent health check sampling could fail to catch.
Ignoring exit code patterns across restarts, missing the distinction between a one-off transient failure and a systematic, repeating problem.
Implementing no internal watchdog for an application with critical internal loops or threads, leaving an internal hang invisible to the container runtime until an external health check eventually times out, if one exists at all.
Treating zero CPU usage as definitive proof of a stuck process without considering genuine, expected idleness as an equally likely explanation.

Process health signals, exit codes, restart counts, and raw running status sourced directly from the container runtime, provide a layer of liveness visibility that remains reliable even when an application is too broken to respond to its own health check, and combining these OS-level signals with application-level health checks closes gaps that either signal alone would leave open.