✦ For everyone, free.

Practical knowledge for real and everyday life

Home

14.2.1.5 Production Drift Control

A focused guide to Production Drift Control, connecting core concepts with practical Docker and container operations.

Production drift control addresses the gap that opens over time between the deployment manifest an operator believes is running and the actual state of the containers, networks, and volumes on a production host, a gap that grows through manual fixes, partial rollouts, and one-off changes made directly against a live system instead of through the deployment pipeline.

How drift happens

Drift rarely arrives through a single dramatic event. It accumulates through small, individually reasonable actions: an operator restarts a container with an extra flag to work around an incident, a manual docker exec changes a running config file to unblock something urgent, or a rollback is performed by hand instead of by redeploying a previous manifest version. Each action solves an immediate problem while quietly invalidating the assumption that the manifest in version control describes what is actually running.

docker run -d -p 8080:80 --memory=2g my-api:1.4.0

If the manifest in version control still says --memory=512m, the host now has a real, running difference from its documented configuration, and nothing forces that difference to be reconciled or even recorded.

Detecting drift directly

Comparing a running container's actual configuration against its source manifest is the most direct way to surface drift:

docker inspect my-api --format '{{json .HostConfig}}' | jq '{Memory: .Memory, CpuShares: .CpuShares}'
docker inspect my-api --format '{{json .Config.Env}}' | jq .

Running this against every production container on a schedule, and diffing the result against the expected values derived from the Compose or stack file, turns drift detection from an occasional manual audit into a continuous, automated check.

docker compose -f docker-compose.production.yml config > expected.yml
docker inspect $(docker compose ps -q) > actual.json

Image drift

Drift also occurs at the image level, when a running container's image no longer matches the digest recorded in the manifest, often because someone ran docker pull and restarted a container manually rather than going through the deployment pipeline:

docker inspect my-api --format '{{.Image}}'
docker inspect registry.example.com/my-api:1.4.0 --format '{{.Id}}'

If these two values diverge, the container running in production is not the artifact the manifest claims it is, which undermines any confidence that what was tested is what is serving traffic.

Declarative deployment as the primary defense

The most effective way to prevent drift is to make the declarative manifest the only path by which production state is ever changed, rather than treating it as documentation that may or may not reflect reality:

docker compose -f docker-compose.yml -f docker-compose.production.yml up -d --remove-orphans

The --remove-orphans flag is itself a small drift-control mechanism: it removes containers that exist on the host but are no longer defined in the current manifest, rather than leaving them running indefinitely as undocumented leftovers.

Treating manual intervention as temporary by construction

Incidents will sometimes require a fast manual fix that cannot wait for a full deployment pipeline run. Drift control does not mean forbidding this; it means treating the manual fix as inherently temporary and immediately following it with the equivalent permanent change pushed through the normal pipeline:

docker update --memory=2g my-api
services:
  api:
    deploy:
      resources:
        limits:
          memory: 2G
git commit -am "Increase production memory limit after incident #482"
docker compose -f docker-compose.yml -f docker-compose.production.yml up -d

The manual docker update resolves the immediate incident; the subsequent commit and redeploy through the pipeline is what actually closes the drift gap it would otherwise have left open.

Immutable infrastructure as a stronger guarantee

Where feasible, replacing containers entirely rather than mutating them in place removes an entire category of drift, since a container that is never modified after creation cannot accumulate undocumented changes:

docker compose -f docker-compose.yml -f docker-compose.production.yml up -d --force-recreate

Forcing recreation on every deployment, rather than allowing the orchestrator to leave an existing container running if its definition appears unchanged, guarantees that whatever state a container has accumulated since its last creation is discarded and replaced with a fresh instance matching the current manifest exactly.

Auditing changes outside the pipeline

Drift control benefits from visibility into who changed what on a production host, independent of whether that change went through the deployment pipeline:

docker events --filter event=start --filter event=update --since 24h

Capturing the Docker daemon's event stream and correlating it against deployment pipeline logs surfaces any action that touched production outside the expected path, which is the first step toward either preventing it in the future or formalizing it as a legitimate, pipeline-driven change.

Periodic full reconciliation

Beyond continuous spot checks, a scheduled, full reconciliation pass, tearing down and recreating the entire production stack from its current manifest during a planned maintenance window, is a strong way to confirm that the manifest genuinely describes everything needed to reconstruct the running system from scratch:

docker compose -f docker-compose.yml -f docker-compose.production.yml down
docker compose -f docker-compose.yml -f docker-compose.production.yml up -d

If this sequence fails to fully restore expected behavior, the manifest itself has gaps that years of incremental manual fixes had been silently compensating for.

Common mistakes

  • Allowing docker update, docker exec, or manual restarts with extra flags to become a permanent fix rather than a temporary bridge to a proper manifest change.
  • Trusting the deployment manifest as accurate without ever comparing it against the actual running configuration on the host.
  • Restarting containers manually after a docker pull instead of going through the deployment pipeline, silently decoupling the running image from the one the manifest specifies.
  • Never performing a full teardown-and-recreate reconciliation, leaving latent gaps in the manifest undiscovered until an unrelated failure forces a full rebuild under pressure.
  • Treating drift control as a one-time cleanup exercise instead of a continuous, automated check, allowing the same gap to reopen shortly after it was closed.

Production drift control works by making the declarative manifest the sole legitimate path to changing production state, detecting and closing the gap quickly when an exception is unavoidable, and periodically proving the manifest's completeness through full reconciliation rather than assuming it remains accurate indefinitely.