16.2 Runtime Troubleshooting
A focused guide to Runtime Troubleshooting, connecting core concepts with practical Docker and container operations.
Runtime troubleshooting covers problems that occur after a container has successfully started and is running, as distinct from build-time problems, and the diagnostic approach shifts accordingly: rather than reading build output for a specific failed instruction, it requires combining logs, resource metrics, process inspection, and health status to understand behavior in a system that is, by definition, still actively running and potentially changing while being investigated.
Starting with a current-state snapshot
The first useful step for any runtime issue is capturing a snapshot of the container's current state across several dimensions simultaneously, since the combination of signals usually narrows the investigation faster than any single one alone:
docker ps -a
docker stats --no-stream <container>
docker inspect <container> --format '{{.State.Health.Status}} {{.State.ExitCode}}'
docker logs --since 10m <container> | tail -50
Running through this sequence quickly, before forming any specific hypothesis, often reveals an obvious outlier, elevated resource usage, a recent restart, a burst of error logs, that directs the rest of the investigation toward a specific, narrower area rather than starting from a completely open-ended question.
Distinguishing a crashed container from a hung one
A container that has stopped entirely behaves very differently from one that is still running but unresponsive, and confirming which situation is actually occurring early avoids wasted effort applying the wrong diagnostic approach:
docker ps -a --filter "name=my-api"
STATUS: Exited (1) 5 minutes ago
STATUS: Up 2 hours (unhealthy)
A crashed container's evidence is mostly retrospective, logs, exit code, and the health check log leading up to the crash, since the process itself no longer exists to inspect directly; a hung but still-running container offers additional live diagnostic options, attaching to inspect its current state directly, that a crashed container does not.
Investigating a hung or unresponsive container
For a container that is still running but not behaving correctly, inspecting its live process state directly often reveals more than logs alone, particularly for a deadlock or stuck state that never produces any log output describing it:
docker exec my-api ps aux
docker top my-api
docker exec my-api cat /proc/1/status
A process showing a state of D (uninterruptible sleep, often waiting on I/O) or one that has not advanced in CPU time despite appearing to be running can indicate it is genuinely stuck on a blocking operation rather than performing normal work, which is a useful distinction when application logs alone give no indication of what the process is currently doing.
Container restarting repeatedly
A container caught in a restart loop benefits from checking the restart count and exit code pattern across the most recent restarts, since a consistent exit code across many restarts points toward a persistent, repeatable cause, while varying exit codes might suggest a less deterministic, possibly resource-related issue:
docker inspect my-api --format '{{.RestartCount}}'
docker events --filter event=restart --filter container=my-api --since 30m
docker logs my-api --tail 100
Examining the logs from just before the most recent crash, rather than only the very latest output, often captures the actual error that triggered the exit, since a crash frequently produces its most informative output in the moments immediately preceding the process exiting.
Resource-related runtime issues
When resource exhaustion is suspected, checking both current usage and whether the container has been forcibly killed for exceeding a limit confirms or rules this out directly rather than relying on inference from indirect symptoms:
docker stats --no-stream my-api
docker inspect my-api --format '{{.State.OOMKilled}}'
A container that is not currently near its resource limits, and shows no OOM kill flag, is unlikely to be experiencing a resource exhaustion problem at all, which redirects the investigation toward application logic, external dependencies, or networking instead.
Networking and connectivity issues at runtime
For a running container unable to reach a dependency it was previously communicating with successfully, working through the network path systematically, from the container's own network attachment outward, isolates where connectivity actually breaks down:
docker exec my-api ping -c 1 my-db
docker exec my-api nc -zv my-db 5432
docker network inspect my-network
A failure at the DNS resolution step points toward a different cause than a failure at the actual port connection step, even though both might initially present as "my-api can't reach my-db," which is why working through the path in order rather than assuming the cause matters for efficient diagnosis.
Comparing against a known-healthy instance
For a service running multiple replicas, comparing the struggling instance directly against a healthy sibling running identical code and configuration is one of the most effective runtime troubleshooting techniques, since it immediately highlights what is actually different about the affected instance rather than requiring a theory about the cause to be formed first:
docker stats --no-stream my-api-1 my-api-2 my-api-3
docker inspect my-api-1 --format '{{.Config.Env}}' > env-1.txt
docker inspect my-api-2 --format '{{.Config.Env}}' > env-2.txt
diff env-1.txt env-2.txt
A configuration difference, an environment variable, a resource limit, a mounted volume, surfaced through this kind of direct comparison often explains an otherwise puzzling situation where supposedly identical replicas are behaving differently.
Reproducing the issue interactively
When the available signals are insufficient to pin down the cause, reproducing the problem in a controlled, interactive session, ideally against a copy of the affected container rather than the live, production one, allows more invasive investigation than logs and metrics alone provide:
docker exec -it my-api sh
docker run -it --rm --network container:my-api my-api:1.4.0 sh
The second approach, running a fresh debugging container sharing the network namespace of the affected one, is useful when the affected container's own image lacks the diagnostic tools needed, since it allows using a different, more fully-equipped image for investigation while still observing the network conditions the original container experiences.
Common mistakes
- Forming a specific hypothesis before capturing a broad, current-state snapshot across logs, metrics, and health status, missing an obvious signal that a quick initial survey would have surfaced immediately.
- Treating a hung, still-running container the same as a crashed one, missing the additional live diagnostic options available specifically because the process has not actually exited.
- Investigating only the most recent log output after a crash, missing the more informative output that often appears just before the actual exit.
- Not comparing a struggling instance against a healthy sibling replica when one is available, missing a configuration difference that direct comparison would reveal quickly.
- Jumping to application-level explanations before ruling out resource exhaustion or basic network connectivity, which are faster to check and rule out definitively.
Runtime troubleshooting is most effective when it starts broad, a quick snapshot across multiple signals, before narrowing based on whatever that snapshot actually reveals, and comparing against a healthy reference point, a sibling replica, a known-good prior state, whenever one is available rather than relying solely on absolute inspection of the single affected instance.