16 Docker Troubleshooting
A focused guide to Docker Troubleshooting, connecting core concepts with practical Docker and container operations.
Docker troubleshooting is the systematic process of diagnosing why a container, image, network, or the daemon itself is not behaving as expected, and approaching it effectively means narrowing down which layer of the system, the application inside the container, the container's own configuration, the Docker daemon, or the host itself, is actually responsible, rather than guessing based on whichever explanation comes to mind first.
Starting with the right layer
A Docker-related problem can originate from several distinct layers, and the most efficient troubleshooting starts by quickly identifying which layer is actually implicated, since the diagnostic approach differs considerably depending on the answer:
docker ps -a
docker inspect my-api --format '{{.State.Status}} {{.State.ExitCode}}'
A container that never started successfully points toward image, configuration, or daemon-level issues; a container that started but is misbehaving points toward application-level or runtime-resource issues instead; distinguishing between these two broad categories immediately is usually the fastest way to avoid wasting time investigating the wrong layer.
Containers that fail to start
For a container that exits immediately or fails to start at all, the exit code and any daemon-recorded error are the first and most direct sources of information:
docker logs my-api
docker inspect my-api --format '{{.State.Error}}'
exit code 127: command not found
exit code 1: application-level startup error
exit code 137: killed (often OOM or stop timeout exceeded)
A specific, recognizable exit code, combined with whatever the container managed to log before exiting, frequently narrows the cause considerably before any deeper investigation is needed; an exit code of 127, for instance, points specifically toward an incorrect or missing command, an entirely different category of problem than an application-level crash reflected by exit code 1.
Containers that run but misbehave
For a container that is running but not behaving correctly, the investigation typically needs to span multiple signals together: resource usage, logs, and health status, rather than any single one in isolation:
docker stats --no-stream my-api
docker logs --since 10m my-api | tail -50
docker inspect --format='{{.State.Health.Status}}' my-api
Combining a resource snapshot, recent logs, and health status gives a reasonably complete picture quickly, and the specific combination of symptoms, high memory with no corresponding error logs versus normal resource usage with frequent error logs, points toward meaningfully different categories of underlying cause.
Networking problems
Container networking issues, services unable to reach each other or unable to be reached from outside, benefit from working through the network topology systematically rather than guessing at a specific misconfiguration:
docker network inspect my-network
docker exec my-api ping -c 1 my-db
docker exec my-api nslookup my-db
Confirming that two containers are actually on the same network, that DNS resolution between them works, and that the expected ports are actually listening, in that order, isolates whether a connectivity problem is at the network attachment level, the DNS resolution level, or the application listening level.
Image and build problems
Issues that manifest during docker build rather than at runtime are generally easier to isolate, since the build process reports the specific instruction and step where it failed directly:
docker build --progress=plain -t my-api .
docker build --no-cache -t my-api .
Building without cache specifically rules out a stale, incorrectly cached layer as the cause of an unexpected build result, which is a common and easily overlooked source of confusing, seemingly inconsistent build behavior.
Daemon-level problems
When the issue appears to affect every container on a host, or basic Docker commands themselves are failing or behaving unexpectedly, the investigation should shift to the daemon itself rather than any individual container:
systemctl status docker
journalctl -u docker.service --since "10 minutes ago"
docker info
A daemon that is unhealthy, low on disk space, or running with an unexpected configuration affects every container on the host simultaneously, which is a useful distinguishing signal: a problem affecting only one specific container is unlikely to be a daemon-level issue, while a problem affecting many or all containers simultaneously usually is.
Resource exhaustion at the host level
A host running low on disk space, memory, or hitting other system-level limits can produce confusing, seemingly unrelated symptoms across multiple containers simultaneously, which is worth ruling out early, before investigating individual containers in depth, specifically because it can masquerade as many different, unrelated-looking problems at once:
df -h
free -h
docker system df
docker system df specifically breaks down how much disk space Docker itself is consuming across images, containers, and volumes, which is useful for identifying whether accumulated, unused Docker resources are the actual cause of a host running low on space.
Building a repeatable troubleshooting habit
A consistent, repeatable sequence of initial checks, rather than an ad hoc investigation that varies depending on who happens to be responding, makes troubleshooting faster and more reliable over time:
docker ps -a
docker stats --no-stream
docker logs --since 10m <container>
docker inspect <container> --format '{{.State.Health.Status}} {{.State.ExitCode}}'
Running through a short, consistent checklist like this at the very start of any investigation, before diving into a specific hypothesis, ensures the obvious and commonly responsible causes are ruled in or out quickly and consistently, regardless of who is doing the troubleshooting.
Common mistakes
- Jumping immediately to investigating application code for a problem that is actually a daemon-level or host-level resource issue affecting every container simultaneously.
- Not checking exit codes and daemon-recorded errors first for a container that failed to start, instead immediately attempting a deeper, more time-consuming investigation.
- Treating a network connectivity problem as a single, undifferentiated issue rather than working through attachment, DNS resolution, and listening port checks separately.
- Rebuilding without ruling out a stale build cache as the cause of unexpected, seemingly inconsistent build behavior.
- Investigating individual containers in depth before first ruling out host-level resource exhaustion as a potential explanation for symptoms appearing across multiple containers simultaneously.
Effective Docker troubleshooting comes from quickly identifying which layer, application, container configuration, networking, image and build, daemon, or host, is actually responsible for a given symptom, and working through a consistent, repeatable set of initial checks at each layer before committing to a deeper, more time-consuming investigation in any single direction.