14.2.1.3 Prod Environment Config

A focused guide to Prod Environment Config, connecting core concepts with practical Docker and container operations.

Production environment configuration for Docker is the set of values, secrets, and runtime settings supplied to the exact same image that already passed through development and staging, tuned for the durability, security, and operational discipline that a live, customer-facing system requires.

Promotion, not rebuild

The production configuration step should never trigger a new build. The artifact deployed to production is the one already validated upstream, identified by an immutable reference rather than a mutable tag:

docker pull registry.example.com/my-api@sha256:3f29a8c1d8e2...
docker run -d --env-file production.env registry.example.com/my-api@sha256:3f29a8c1d8e2...

Pinning by digest rather than by a tag like latest or even a semantic version tag removes any ambiguity about which exact bytes are running, which matters once an incident requires confirming precisely what was deployed at a given moment.

Sourcing secrets correctly

Production is where credential handling discipline matters most. Secrets should come from a secrets manager or an orchestrator-native secret store, not from a plaintext file sitting on the host:

services:
  api:
    image: registry.example.com/my-api@sha256:3f29a8c1d8e2...
    secrets:
      - db_password
      - jwt_signing_key

secrets:
  db_password:
    external: true
  jwt_signing_key:
    external: true

cat /run/secrets/db_password

The application reads secret values from their mounted file paths at startup rather than from environment variables, keeping them out of docker inspect output, shell history, and process environment dumps that an environment variable approach would expose.

Resource limits and restart policy

Production containers should run with explicit resource boundaries and a restart policy appropriate for unattended operation, since no one is watching to manually restart a crashed process at 3 a.m.:

docker run -d \
  --memory=1024m --cpus=2 \
  --restart=unless-stopped \
  --env-file production.env \
  registry.example.com/my-api@sha256:3f29a8c1d8e2...

services:
  api:
    deploy:
      resources:
        limits:
          memory: 1024M
          cpus: "2"
      restart_policy:
        condition: on-failure
        max_attempts: 5

unless-stopped (or an orchestrator's equivalent restart policy) ensures the service comes back automatically after a crash or host reboot, without restarting indefinitely if it was deliberately stopped for maintenance.

Logging configuration

The default logging driver can fill a host's disk if left unconfigured, since container stdout/stderr is captured indefinitely otherwise. Production logging configuration should bound log size and either ship logs off the host or rotate them locally:

{
  "log-driver": "json-file",
  "log-opts": {
    "max-size": "20m",
    "max-file": "5"
  }
}

docker run -d --log-driver=syslog --log-opt syslog-address=udp://logs.example.com:514 my-api

Shipping logs to a centralized aggregator is generally preferable to relying on local rotation alone, since local rotation discards history precisely when an incident investigation might need it.

Health checks driving real traffic decisions

Production configuration should wire the container's health check into whatever is making routing decisions, not just into the container runtime's own restart logic:

HEALTHCHECK --interval=15s --timeout=5s --retries=3 \
  CMD curl -f http://localhost:3000/healthz || exit 1

docker inspect --format='{{.State.Health.Status}}' my-api

A load balancer or service mesh that is not consuming this health status is effectively routing traffic blind to whether the container can actually serve a request, which defeats much of the purpose of defining a health check at all.

High availability and replica configuration

Production configuration typically specifies multiple replicas and a placement strategy that avoids a single host failure taking down every instance of a service:

services:
  api:
    deploy:
      replicas: 4
      placement:
        max_replicas_per_node: 2
      update_config:
        parallelism: 1
        delay: 10s

The update_config settings control how a rolling update proceeds, updating one replica at a time with a pause in between, rather than replacing every replica simultaneously and risking a full outage if the new version has a problem.

Network exposure

Production network configuration should expose only what is genuinely intended to be reachable from outside the host, binding internal-only services to internal networks rather than to all interfaces:

services:
  api:
    ports:
      - "443:3000"
  db:
    networks:
      - internal
    # no ports published; reachable only from other services on the internal network

networks:
  internal:
    internal: true

Auditability of what is actually running

Production configuration should be reconstructable after the fact: given an incident, an operator should be able to determine exactly which image digest, environment values (secrets excluded), and resource limits were in effect at a given time:

docker inspect my-api --format '{{json .Config}}' | jq .

git log -p -- docker-compose.production.yml

Version-controlling the production Compose or stack definition, with secret values referenced rather than embedded, gives this auditability essentially for free.

Common mistakes

Deploying by a mutable tag such as latest, losing the ability to know with certainty which build is actually running in production at a given time.
Storing production credentials directly as environment variables in a Compose file checked into a shared repository, rather than referencing an external secrets store.
Leaving the default, unbounded logging driver in place, risking a host disk fill during an incident that produces unusually verbose error logging.
Defining a health check that the container runtime checks but that nothing in the traffic-routing path actually consumes, leaving unhealthy replicas in rotation.
Updating every replica simultaneously during a deployment instead of rolling through them, turning a bad release into a full outage instead of a partial, quickly-rolled-back one.

Production environment configuration is where every configuration decision has the least margin for error: it should promote a known artifact unchanged, source secrets from a dedicated store, bound resource and log usage explicitly, and wire health status into real traffic decisions rather than leaving it as an unused signal.