16.3.2 Disk Exhaustion Problems
A focused guide to Disk Exhaustion Problems, connecting core concepts with practical Docker and container operations.
Disk exhaustion problems specifically address the emergency scenario where a host has already run completely out of disk space, as distinct from monitoring and preventing the condition before it occurs, and the practical challenge here is that Docker's own commands, and sometimes even basic shell operations, can themselves begin failing once free space reaches zero, requiring a careful, specific sequence of recovery steps rather than the more leisurely cleanup approaches appropriate before the crisis point is reached.
Confirming the host has actually run out of space
The first step, even in an apparent emergency, is confirming disk exhaustion specifically, rather than assuming it based on symptoms that could have a different cause:
df -h
Filesystem Size Used Avail Use%
/dev/sda1 50G 50G 0G 100%
A filesystem at or essentially at 100% utilization confirms genuine exhaustion; this single check should happen before any other troubleshooting step, since many other commands, including some of Docker's own, may behave unpredictably or fail outright once disk space is fully exhausted.
Why Docker commands themselves can fail under full disk conditions
The Docker daemon needs to write its own metadata, logs, and temporary files during normal operation, which means a fully exhausted disk can cause even basic commands like docker ps or docker rm to fail or hang, creating a frustrating situation where the tools needed to fix the problem are themselves impaired by the problem:
docker ps
Error response from daemon: write /var/lib/docker/...: no space left on device
This is precisely why disk exhaustion deserves a different, more careful approach than routine cleanup: every command attempted needs to actually free space, immediately and reliably, rather than risk failing or making the situation marginally worse.
Finding space outside of Docker's own data directory first
Before touching anything Docker-managed, checking whether space can be freed elsewhere on the host, log files, old kernel versions, unrelated temporary files, provides breathing room to then perform Docker-specific cleanup more safely and with commands more likely to actually succeed:
du -sh /var/log/* 2>/dev/null | sort -rh | head -10
journalctl --vacuum-size=200M
apt-get clean
Freeing even a modest amount of space through these host-level, non-Docker-specific steps can be enough to get Docker's own commands functioning reliably again, which then allows a more thorough, Docker-specific cleanup to proceed without fighting against commands that themselves keep failing due to the same exhaustion.
Targeted cleanup once commands are functional again
Once enough space has been freed to restore basic command functionality, working through Docker's own disk usage breakdown identifies where the bulk of reclaimable space actually sits:
docker system df
docker builder prune -af
docker image prune -af
Build cache and unused images are frequently the largest reclaimable categories, and removing them first, before considering anything that might affect running containers or volumes, is generally the safest order of operations during an active emergency.
Identifying what is actively consuming space right now
If general cleanup does not free enough space, or to understand what caused the exhaustion in the first place, identifying the specific largest and most recently growing files within Docker's data directory points directly at the actual cause:
du -ah /var/lib/docker/containers/*/*.log 2>/dev/null | sort -rh | head -5
An unexpectedly massive container log file is one of the most common single causes of sudden, severe disk exhaustion, particularly for a container stuck in some kind of error loop producing continuous, repetitive log output far beyond what would occur under normal operation.
truncate -s 0 /var/lib/docker/containers/<container-id>/<container-id>-json.log
Truncating a specific, identified runaway log file directly is a valid, if blunt, emergency measure to immediately recover space, with the underlying cause, missing log rotation configuration, or whatever is causing the runaway logging in the first place, addressed properly afterward rather than left to recur.
Stopping non-essential containers temporarily
If cleanup alone is insufficient, temporarily stopping non-essential containers, particularly any that are actively writing significant data to their own writable layer, halts further consumption while a more thorough cleanup or capacity expansion is arranged:
docker stop $(docker ps -q --filter "name=non-essential")
This is a triage decision, prioritizing stopping the bleeding over maintaining full service availability during an active disk emergency, appropriate when the alternative is the entire host becoming unresponsive or every container failing simultaneously due to a complete inability to write anything at all.
Verifying recovery and restoring normal operation
After freeing sufficient space, confirming the host is genuinely stable again, rather than just barely above zero and at immediate risk of recurring, before considering the emergency resolved:
df -h
docker ps
docker system df
A host that has just barely recovered above zero free space remains at acute risk of recurrence from even routine activity; freeing a more comfortable margin, and addressing whatever the underlying cause was, log rotation, retention policy, unexpected growth, matters more than simply restoring functionality to a precarious, marginal state.
Preventing recurrence after the immediate crisis
Once stable, implementing the preventive measures that should have caught this before it became an emergency, log rotation limits, scheduled pruning, disk usage monitoring with alerting before reaching a critical threshold, converts a one-time crisis response into a permanent fix rather than a temporary reprieve before the same situation recurs:
{
"log-driver": "local",
"log-opts": { "max-size": "10m", "max-file": "3" }
}
0 2 * * * docker system prune -af --filter "until=72h"
Common mistakes
- Attempting extensive Docker-specific troubleshooting before confirming and addressing the disk exhaustion itself, when many commands may fail or behave unpredictably until at least some space is freed.
- Not checking for non-Docker sources of disk usage first, missing an easier, faster path to restoring enough functionality to then perform more thorough Docker cleanup.
- Removing volumes or running containers' data indiscriminately during a panic-driven cleanup, rather than prioritizing safer categories like build cache and unused images first.
- Restoring the host to barely-above-zero free space and considering the emergency resolved, without addressing the underlying cause or establishing a safer margin.
- Not implementing preventive measures, log rotation, scheduled pruning, monitoring, after the immediate crisis is resolved, leaving the host vulnerable to an identical recurrence.
Disk exhaustion problems require a careful, ordered emergency response specifically because the tools needed to fix the problem can themselves be impaired by it, and working through host-level space first, then Docker's own safest reclaimable categories, before considering more disruptive measures, restores stability fastest while minimizing the risk of further complications during an already precarious situation.