You are in the middle of a critical CI/CD build or pulling a heavy production image, and suddenly the process halts with a No space left on device error. This is a rite of passage for every DevOps engineer. Docker is notorious for its "disk-hungry" nature, accumulating layers, build caches, and logs that eventually choke your root partition. However, blindly running rm -rf on your data folders is a recipe for disaster. You need a strategy that recovers space while keeping your persistent volumes intact.
The TL;DR — is simple: Start by running docker system prune to remove dangling data. If that isn't enough, you must address the hidden buildx cache or relocate your /var/lib/docker directory to a larger partition. By the end of this guide, you will have a clean environment and a permanent fix to prevent this outage from recurring.
📋 Tested with: Docker Engine v26.0.0 on Ubuntu 22.04 LTS (Jammy Jellyfish), October 2023.
Result: Recovered 45GB of disk space by migrating the overlay2 storage to a secondary 100GB EBS volume.
The standard documentation often fails to explain the critical rsync flags needed to preserve file permissions during a directory relocation, which we cover in detail below.
Table of Contents
Symptoms of Docker Disk Exhaustion
💡 Analogy: Think of Docker like a professional kitchen. The containers are the active meals being cooked, the images are the recipes, and the build cache is the prep station. If you never clear the prep station or throw away the scraps, you eventually run out of counter space to cook anything new.
Disk exhaustion in Docker rarely happens instantly. It is usually a slow creep that manifests through various error messages. If you see any of the following in your terminal or logs, your disk is likely at 100% capacity:
Error response from daemon: failed to create task: OCI runtime create failed: container_linux.go:349: starting container process caused "process_linux.go:449: container init caused \"write /proc/self/attr/keycreate: No space left on device\""failed to register layer: ApplyLayer exit status 1 stdout: stderr: write /usr/bin/python3.10: no space left on devicedocker: Error response from daemon: write /var/lib/docker/tmp/GetImageBlob987654321: no space left on device.
When these errors occur, your storage driver (most likely overlay2) cannot write new metadata or filesystem layers. This doesn't just stop new builds; it can cause existing databases running in containers to crash because they cannot write to their logs or temporary tables. Identifying these symptoms early is key to maintaining high availability.
The Three Main Causes of Space Leaks
1. The Persistent Buildx Cache
Modern Docker versions use BuildKit by default. While BuildKit makes builds significantly faster, it stores an aggressive cache of every instruction in your Dockerfile. Unlike standard images, these cache blobs are often "invisible" when you run docker images. On a busy CI/CD runner, this cache can grow to hundreds of gigabytes in a single week. This is especially prevalent in Docker Engine v23.0 and above, where BuildKit is the primary engine.
2. Orphaned or Dangling Volumes
When you run docker-compose down or docker rm -f, Docker removes the container but leaves the volumes behind unless you explicitly use the -v flag. Over time, these orphaned volumes accumulate. Since volumes often hold database files or heavy application data, they are the primary reason users run out of space while believing they have deleted their containers. This is a common pitfall in local development environments where projects are started and stopped frequently.
3. The JSON-File Log Bloat
By default, Docker uses the json-file logging driver with no maximum file size. If you have a chatty application—such as a Java app in debug mode—the log file located in /var/lib/docker/containers/<id>/<id>-json.log can grow indefinitely. I have seen single log files reach 80GB on production nodes, effectively bricking the server.
How to Fix Docker No Space Left on Device
Step 1: The Safe Cleanup (The Surgical Prune)
Most users reach for docker system prune -a, but this is a blunt instrument. It deletes all unused images, not just dangling ones. If you have a slow internet connection, re-pulling 5GB of base images is a nightmare. Instead, use a filtered approach.
# Remove only dangling images and stopped containers
docker system prune
# To remove unused volumes as well (WARNING: This can delete persistent data)
docker system prune --volumes
# To remove the BuildKit cache specifically
docker builder prune -a
The docker builder prune command is the most effective way to recover space on a build server without affecting running services. On my local machine, running this command cleared 12GB of cache that system prune ignored.
Step 2: Identifying the Largest Directories
If pruning doesn't solve the issue, you need to find the culprit. Use the du (disk usage) command to inspect the Docker root directory. Note: You will need sudo permissions for this.
sudo du -sh /var/lib/docker/* | sort -h
Expected output:
1.2G /var/lib/docker/containers
4.5G /var/lib/docker/volumes
28G /var/lib/docker/overlay2
If overlay2 is the largest, you have too many images or heavy build layers. If containers is the largest, check your log files. If volumes is the winner, you are storing massive amounts of persistent data.
Step 3: Rotating Container Logs
To fix log bloat without stopping your containers, you can truncate the JSON logs. This is a temporary fix; the permanent fix is in the "Prevention" section below.
# Find and truncate all Docker container logs
sudo sh -c 'truncate -s 0 /var/lib/docker/containers/*/*-json.log'
This command instantly frees up the space occupied by logs without requiring a service restart. It is the fastest "quick fix" for a production outage caused by disk space.
Step 4: Relocating the Docker Root Directory
When your root partition (/) is simply too small (e.g., an 8GB cloud instance), no amount of pruning will save you. You must move Docker's data to a larger volume, such as a secondary SSD or an EBS volume mounted at /mnt/docker-data.
⚠️ Common Mistake: Do not use a simple mv command. This will break file permissions and symbolic links that Docker relies on for the overlay2 filesystem. Always use rsync -aqxP.
The Relocation Procedure:
- Stop the Docker Service:
sudo systemctl stop docker.socket sudo systemctl stop docker - Create the Configuration: Edit (or create)
/etc/docker/daemon.json.{ "data-root": "/mnt/docker-data" } - Copy Data Safely: Use rsync to preserve the complex layer structure.
sudo rsync -aqxP /var/lib/docker/ /mnt/docker-data - Restart Docker:
sudo systemctl start docker
By changing the data-root, you move images, containers, and volumes. This is the most "pro" way to handle storage issues on cloud providers like AWS or GCP where you can easily attach larger volumes but cannot easily expand the root partition.
Verifying Disk Space Recovery
After performing the cleanup or relocation, you must verify that the system recognizes the new space. Use the Docker-native inspection tool instead of just relying on df -h.
# Check Docker's internal storage perception
docker system df
The output will show a breakdown of Images, Containers, Local Volumes, and Build Cache. Look for the "RECLAIMABLE" column. If that number is high, you still have room to prune. Finally, verify that your containers are actually writing to the new location (if you moved the root):
docker info | grep "Docker Root Dir"
# Output should be: Docker Root Dir: /mnt/docker-data
Preventing Future Disk Outages
The best way to fix "no space left on device" is to ensure it never happens again. Implement these three production-grade configurations.
1. Set Log Rotation Globally
Update your /etc/docker/daemon.json to include log limits. This ensures no single container can ever consume more than 100MB of log space.
{
"log-driver": "json-file",
"log-opts": {
"max-size": "20m",
"max-file": "5"
}
}
2. Automate Cleanup with Cron
On build servers, images accumulate daily. Set a cron job to run a prune once a week. Create a file in /etc/cron.weekly/docker-clean:
#!/bin/sh
docker system prune -f --filter "until=168h"
This command deletes any unused data older than 7 days, keeping your storage lean without affecting active development.
3. Use an External Storage Driver
For high-performance environments, consider using a dedicated partition for /var/lib/docker from the start. On managed Kubernetes services like EKS or GKE, node pressure evictions handle this automatically, but for standalone VMs, disk monitoring is your responsibility.
📌 Key Takeaways:
- Identify the source using
sudo du -sh /var/lib/docker/*. - Use
docker builder pruneto target the often-ignored buildx cache. - Truncate logs for an instant fix, but set
max-sizefor a permanent one. - Migrate the
data-rootto a larger volume if your OS partition is too small.
Frequently Asked Questions
Q. Will running docker system prune -a delete my databases?
A. No, as long as your databases are stored in volumes. The -a flag deletes images and stopped containers. To delete volumes, you must explicitly use the --volumes flag. Always back up your data before running a prune with the volume flag.
Q. Why does df -h show disk full but docker system df says space is available?
A. This usually means non-Docker files are consuming your disk, or deleted log files are still being held open by a process. Try restarting the Docker daemon or using lsof +L1 to find unlinked files that are still consuming space.
Q. Can I move /var/lib/docker while containers are running?
A. Absolutely not. You must stop the Docker service. Moving files while the overlay2 driver is active will result in data corruption and broken container filesystems. Always follow the stop-rsync-start sequence.
Post a Comment