15.2.2.5 Stats Quick Diagnostics

A focused guide to Stats Quick Diagnostics, connecting core concepts with practical Docker and container operations.

Stats quick diagnostics refers to using docker stats and a handful of closely related commands as a fast, first-pass triage tool during an active incident, before reaching for a full metrics dashboard or log aggregation system, since these built-in commands require no setup and can often immediately confirm or rule out the most common, simple causes of a problem.

Starting with an unfiltered snapshot

The first useful step during an incident affecting an unknown container is a single, unfiltered snapshot across every running container, which often immediately reveals an obvious outlier:

docker stats --no-stream

Scanning this output by eye for a container with conspicuously higher CPU, memory, or PID count than its neighbors is frequently enough to identify which specific container deserves focused attention, without needing to know in advance which service was actually responsible for the symptom being investigated.

Confirming resource exhaustion as the cause

When an application is behaving sluggishly or erroring, quickly confirming or ruling out resource exhaustion narrows the investigation significantly:

docker stats --no-stream my-api

MEM USAGE / LIMIT: 510MiB / 512MiB

A container sitting at or very near its configured memory or CPU limit is a strong, immediate signal that resource exhaustion is at least a contributing factor, which redirects the investigation toward right-sizing the limit or addressing whatever is driving usage that high, rather than continuing to look for an unrelated application-level bug.

Checking for OOM kills directly

A container that restarted unexpectedly is worth checking immediately for an OOM kill before investigating further, since this single flag can save significant time compared to chasing an application-level explanation for a restart that was actually caused by memory exhaustion:

docker inspect my-api --format '{{.State.OOMKilled}}'
docker inspect my-api --format '{{.State.ExitCode}}'

A true result here redirects the entire investigation toward memory usage and limits rather than application logic, immediately.

Checking process and thread count for a runaway pattern

For an application that seems to be degrading progressively rather than failing all at once, a quick check of process or thread count can reveal a leak before diving into application-level profiling:

docker stats --no-stream --format "{{.Name}}: {{.PIDs}}" my-api
docker top my-api

A process count far higher than expected for the application's normal operating pattern, especially with many entries in a zombie or stuck state, points toward a process management bug well before more involved profiling tools would be needed to reach the same conclusion.

Comparing against a healthy baseline

Quick diagnostics are most effective when there is something to compare against; checking a healthy replica or a known-good time period's typical resource usage alongside the currently struggling one quickly highlights what is actually different:

docker stats --no-stream my-api-1 my-api-2 my-api-3

If one replica out of several shows dramatically different resource usage from its otherwise identical siblings, the problem is likely isolated to that specific instance rather than systemic across the service, which meaningfully narrows the scope of further investigation.

Cross-referencing with recent log output

Pairing a resource snapshot with a quick look at the same container's recent logs often confirms whether the resource symptom and an observed application-level symptom (errors, timeouts) are actually correlated in time, rather than coincidental:

docker stats --no-stream my-api
docker logs --since 5m my-api | tail -50

Seeing elevated CPU or memory alongside a burst of error logs in the same recent window is a reasonable basis for treating them as related; seeing one without the other is a useful signal that they may be unrelated symptoms requiring separate investigation.

Knowing when quick diagnostics are not enough

Quick diagnostics through docker stats and related commands are valuable for ruling things in or out fast, but they are not a substitute for a proper investigation once the immediate, simple causes have been excluded: a problem that persists despite normal-looking resource usage, healthy process counts, and no recent OOM kill needs the deeper tooling, distributed tracing, application-level metrics, detailed log correlation, that quick diagnostics are not designed to provide.

docker stats --no-stream my-api
docker logs --since 10m my-api
docker inspect my-api --format '{{.State.OOMKilled}}'

If all three of these checks come back unremarkable, the investigation needs to move beyond what docker stats and its immediate neighbors can offer, rather than repeating the same quick checks expecting a different result.

Building a quick diagnostics checklist

Because these checks are fast and require no setup, codifying them as a short, repeatable checklist, run automatically as the very first step whenever an incident begins, ensures the simple, common causes are always ruled out quickly and consistently rather than depending on whichever operator happens to be responding remembering to check them:

#!/bin/sh
echo "=== Resource snapshot ==="
docker stats --no-stream
echo "=== OOM check ==="
for c in $(docker ps --format "{{.Names}}"); do
  echo "$c: $(docker inspect "$c" --format '{{.State.OOMKilled}}')"
done

Common mistakes

Skipping the quick, no-setup diagnostic checks and going straight to a more involved investigation, missing a simple cause that would have been immediately visible.
Not checking the OOMKilled flag directly after an unexpected restart, spending time on an application-level explanation for what was actually a memory-related kill.
Treating a single container's resource reading in isolation without comparing it against a healthy sibling replica running the same workload.
Continuing to rely on quick diagnostics alone once they have failed to surface an obvious cause, rather than escalating to deeper tooling.
Not having a consistent, repeatable checklist for these quick checks, leading to inconsistent triage depending on who happens to be responding to a given incident.

Stats quick diagnostics earn their place as the very first step of an incident response specifically because they require no setup and can immediately confirm or rule out the most common causes, resource exhaustion, OOM kills, runaway process counts, but they are deliberately shallow tools meant to triage quickly, not a replacement for deeper investigation once the obvious, simple explanations have been excluded.