20.3.2 Production Operation Step
A focused guide to Production Operation Step, connecting core concepts with practical Docker and container operations.
Production operation with Docker requires operational practices that go beyond development usage: reliable container startup, graceful shutdown, healthcheck-driven availability management, resource enforcement, log collection, and the discipline of never mutating running containers in place — always deploying new images. The production operation step covers the runtime operational patterns that keep containerized applications running stably and make them diagnosable when they do not.
Restart Policies
Production containers must restart automatically after failures, host reboots, or daemon restarts. Restart policies define when and how Docker restarts a stopped container:
docker run -d \
--restart unless-stopped \
--name my-api \
my-api:v1.2.3
Available policies:
| Policy | Behavior |
|---|---|
no | Never restart (default) |
always | Always restart, even after manual stop; restarts on daemon restart |
on-failure | Restart only on non-zero exit code; does not restart if manually stopped |
on-failure:5 | Restart up to 5 times on failure |
unless-stopped | Always restart except when explicitly stopped; restarts on daemon restart |
unless-stopped is the standard choice for long-running production services. It recovers from crashes and host reboots without restarting services that were intentionally stopped for maintenance.
In Compose:
services:
api:
restart: unless-stopped
Healthchecks
A healthcheck defines how Docker determines whether a container is ready and operational. Without a healthcheck, Docker considers the container healthy as soon as the process starts. An unhealthy application (crashed web server, database connection pool exhausted, deadlock) looks healthy to Docker until the healthcheck fails.
docker run -d \
--name my-api \
--health-cmd "curl -f http://localhost:3000/health || exit 1" \
--health-interval 30s \
--health-timeout 5s \
--health-retries 3 \
my-api:v1.2.3
In the Dockerfile (so the healthcheck is part of the image definition):
HEALTHCHECK --interval=30s --timeout=5s --retries=3 \
CMD curl -f http://localhost:3000/health || exit 1
Health states:
- starting: Container started but the healthcheck has not run enough times yet.
- healthy: The last
--retriesnumber of healthchecks passed. - unhealthy:
--retriesconsecutive healthchecks failed.
Checking health status:
docker inspect my-api --format '{{.State.Health.Status}}'
docker ps
NAME STATUS
my-api Up 5 minutes (healthy)
Resource Limits in Production
Production containers must have memory and CPU limits to prevent a single container from starving other containers or the host:
docker run -d \
--name my-api \
--memory 512m \
--memory-swap 512m \
--cpus 1.0 \
--pids-limit 200 \
my-api:v1.2.3
--memory-swap 512m set to the same value as --memory disables swap usage, preventing the container from swapping to disk when under memory pressure. Swap usage causes latency spikes in latency-sensitive services.
--pids-limit prevents fork bombs and limits the blast radius of a runaway process spawning subprocesses.
Without memory limits, a container that develops a memory leak can consume all available host RAM, causing the kernel's OOM killer to terminate processes on the host indiscriminately.
Log Management
By default, Docker logs container stdout/stderr as JSON files on the host. These files grow without bound unless configured. For production:
docker run -d \
--name my-api \
--log-driver json-file \
--log-opt max-size=10m \
--log-opt max-file=5 \
my-api:v1.2.3
max-size=10m rotates the log file when it reaches 10MB. max-file=5 keeps the last 5 rotated files. Combined, this caps log storage at 50MB per container.
For centralized log aggregation (ELK stack, Datadog, Splunk):
docker run -d \
--log-driver syslog \
--log-opt syslog-address=udp://log-server:514 \
--log-opt tag="{{.Name}}" \
my-api:v1.2.3
Or use the fluentd log driver for Fluent Bit/Fluentd-based pipelines.
Immutable Deployment Pattern
The production container lifecycle follows a strict rule: never update a running container in place. When new code needs to be deployed, build a new image, push it to the registry, then replace the old container with a new one running the new image:
# Build new version
docker build -t registry.example.com/my-api:v1.2.4 .
docker push registry.example.com/my-api:v1.2.4
# Stop old container and start new one
docker stop my-api
docker rm my-api
docker run -d \
--name my-api \
--restart unless-stopped \
registry.example.com/my-api:v1.2.4
This pattern ensures deployments are reproducible: the same image tag produces the same container state. Patching a running container's filesystem (docker exec to modify files) creates invisible state drift that is impossible to reproduce or audit.
Monitoring Container Health
docker stats
Live resource usage for all running containers:
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O
a1b2c3d4e5f6 my-api 0.12% 82.4MiB / 512MiB 16.1% 14.2kB / 8.1kB 0B / 0B
b2c3d4e5f6a1 db 1.5% 198MiB / 1GiB 19.3% 2.1kB / 1.4kB 44MB / 12MB
For a specific container:
docker stats my-api --no-stream
--no-stream outputs a single snapshot instead of live updates, useful for scripting and monitoring scripts.
Graceful Shutdown
When docker stop is called, Docker sends SIGTERM to the container's PID 1 and waits for the container to exit. If the container does not exit within the stop timeout (default 10 seconds), Docker sends SIGKILL, which terminates the process immediately.
Applications must handle SIGTERM to shut down gracefully — finishing in-flight requests, closing database connections, flushing buffers. Applications using the shell form of CMD receive signals via the shell and may not propagate them correctly. The exec form of CMD ensures PID 1 is the application process itself:
# Correct — application receives SIGTERM directly
CMD ["node", "server.js"]
# Incorrect — shell is PID 1 and may not forward SIGTERM
CMD node server.js
Extending the stop timeout for containers that need more time to drain connections:
docker stop --time 30 my-api
In Compose:
services:
api:
stop_grace_period: 30s
Labeling for Operational Clarity
Labels on containers and images provide metadata useful for operational tooling, log correlation, and deployment tracking:
docker run -d \
--name my-api \
--label app=my-api \
--label version=v1.2.4 \
--label environment=production \
--label deployed-by=ci-pipeline \
my-api:v1.2.4
In Compose:
services:
api:
labels:
app: my-api
version: v1.2.4
environment: production
Labels can be queried with docker ps --filter label=app=my-api and are included in Docker events for monitoring systems.