20.3.2 Production Operation Step

A focused guide to Production Operation Step, connecting core concepts with practical Docker and container operations.

Production operation with Docker requires operational practices that go beyond development usage: reliable container startup, graceful shutdown, healthcheck-driven availability management, resource enforcement, log collection, and the discipline of never mutating running containers in place — always deploying new images. The production operation step covers the runtime operational patterns that keep containerized applications running stably and make them diagnosable when they do not.

Restart Policies

Production containers must restart automatically after failures, host reboots, or daemon restarts. Restart policies define when and how Docker restarts a stopped container:

docker run -d \
  --restart unless-stopped \
  --name my-api \
  my-api:v1.2.3

Available policies:

Policy	Behavior
`no`	Never restart (default)
`always`	Always restart, even after manual stop; restarts on daemon restart
`on-failure`	Restart only on non-zero exit code; does not restart if manually stopped
`on-failure:5`	Restart up to 5 times on failure
`unless-stopped`	Always restart except when explicitly stopped; restarts on daemon restart

unless-stopped is the standard choice for long-running production services. It recovers from crashes and host reboots without restarting services that were intentionally stopped for maintenance.

In Compose:

services:
  api:
    restart: unless-stopped

Healthchecks

A healthcheck defines how Docker determines whether a container is ready and operational. Without a healthcheck, Docker considers the container healthy as soon as the process starts. An unhealthy application (crashed web server, database connection pool exhausted, deadlock) looks healthy to Docker until the healthcheck fails.

docker run -d \
  --name my-api \
  --health-cmd "curl -f http://localhost:3000/health || exit 1" \
  --health-interval 30s \
  --health-timeout 5s \
  --health-retries 3 \
  my-api:v1.2.3

In the Dockerfile (so the healthcheck is part of the image definition):

HEALTHCHECK --interval=30s --timeout=5s --retries=3 \
  CMD curl -f http://localhost:3000/health || exit 1

Health states:

starting: Container started but the healthcheck has not run enough times yet.
healthy: The last --retries number of healthchecks passed.
unhealthy: --retries consecutive healthchecks failed.

Checking health status:

docker inspect my-api --format '{{.State.Health.Status}}'

docker ps

NAME      STATUS
my-api    Up 5 minutes (healthy)

Resource Limits in Production

Production containers must have memory and CPU limits to prevent a single container from starving other containers or the host:

docker run -d \
  --name my-api \
  --memory 512m \
  --memory-swap 512m \
  --cpus 1.0 \
  --pids-limit 200 \
  my-api:v1.2.3

--memory-swap 512m set to the same value as --memory disables swap usage, preventing the container from swapping to disk when under memory pressure. Swap usage causes latency spikes in latency-sensitive services.

--pids-limit prevents fork bombs and limits the blast radius of a runaway process spawning subprocesses.

Without memory limits, a container that develops a memory leak can consume all available host RAM, causing the kernel's OOM killer to terminate processes on the host indiscriminately.

Log Management

By default, Docker logs container stdout/stderr as JSON files on the host. These files grow without bound unless configured. For production:

docker run -d \
  --name my-api \
  --log-driver json-file \
  --log-opt max-size=10m \
  --log-opt max-file=5 \
  my-api:v1.2.3

max-size=10m rotates the log file when it reaches 10MB. max-file=5 keeps the last 5 rotated files. Combined, this caps log storage at 50MB per container.

For centralized log aggregation (ELK stack, Datadog, Splunk):

docker run -d \
  --log-driver syslog \
  --log-opt syslog-address=udp://log-server:514 \
  --log-opt tag="{{.Name}}" \
  my-api:v1.2.3

Or use the fluentd log driver for Fluent Bit/Fluentd-based pipelines.

Immutable Deployment Pattern

The production container lifecycle follows a strict rule: never update a running container in place. When new code needs to be deployed, build a new image, push it to the registry, then replace the old container with a new one running the new image:

# Build new version
docker build -t registry.example.com/my-api:v1.2.4 .
docker push registry.example.com/my-api:v1.2.4

# Stop old container and start new one
docker stop my-api
docker rm my-api
docker run -d \
  --name my-api \
  --restart unless-stopped \
  registry.example.com/my-api:v1.2.4

This pattern ensures deployments are reproducible: the same image tag produces the same container state. Patching a running container's filesystem (docker exec to modify files) creates invisible state drift that is impossible to reproduce or audit.

Monitoring Container Health

docker stats

Live resource usage for all running containers:

CONTAINER ID   NAME       CPU %   MEM USAGE / LIMIT   MEM %   NET I/O         BLOCK I/O
a1b2c3d4e5f6   my-api     0.12%   82.4MiB / 512MiB    16.1%   14.2kB / 8.1kB  0B / 0B
b2c3d4e5f6a1   db         1.5%    198MiB / 1GiB       19.3%   2.1kB / 1.4kB   44MB / 12MB

For a specific container:

docker stats my-api --no-stream

--no-stream outputs a single snapshot instead of live updates, useful for scripting and monitoring scripts.

Graceful Shutdown

When docker stop is called, Docker sends SIGTERM to the container's PID 1 and waits for the container to exit. If the container does not exit within the stop timeout (default 10 seconds), Docker sends SIGKILL, which terminates the process immediately.

Applications must handle SIGTERM to shut down gracefully — finishing in-flight requests, closing database connections, flushing buffers. Applications using the shell form of CMD receive signals via the shell and may not propagate them correctly. The exec form of CMD ensures PID 1 is the application process itself:

# Correct — application receives SIGTERM directly
CMD ["node", "server.js"]

# Incorrect — shell is PID 1 and may not forward SIGTERM
CMD node server.js

Extending the stop timeout for containers that need more time to drain connections:

docker stop --time 30 my-api

In Compose:

services:
  api:
    stop_grace_period: 30s

Labeling for Operational Clarity

Labels on containers and images provide metadata useful for operational tooling, log correlation, and deployment tracking:

docker run -d \
  --name my-api \
  --label app=my-api \
  --label version=v1.2.4 \
  --label environment=production \
  --label deployed-by=ci-pipeline \
  my-api:v1.2.4

In Compose:

services:
  api:
    labels:
      app: my-api
      version: v1.2.4
      environment: production

Labels can be queried with docker ps --filter label=app=my-api and are included in Docker events for monitoring systems.

Codartium

20.3.2 Production Operation Step

Restart Policies

Healthchecks

Resource Limits in Production

Log Management

Immutable Deployment Pattern

Monitoring Container Health

Graceful Shutdown

Labeling for Operational Clarity

Content in this section

20.3.2 Production Operation Step

Restart Policies

Healthchecks

Resource Limits in Production

Log Management

Immutable Deployment Pattern

Monitoring Container Health

Graceful Shutdown

Labeling for Operational Clarity

Content in this section

Related content