17.3.2.5 Production Resource Discipline

A focused guide to Production Resource Discipline, connecting core concepts with practical Docker and container operations.

Production resource discipline is the ongoing practice of periodically reviewing and adjusting configured resource limits against actual, observed usage over time, distinct from the one-time decision to set explicit limits in the first place, since a limit that was correctly sized when first configured can become poorly matched to reality as an application's behavior, traffic, and dependencies evolve.

Why initial sizing alone is insufficient

A resource limit set once, based on whatever usage pattern was observed at the time of initial deployment, reflects only that specific moment; an application's actual resource needs typically change as features are added, traffic grows, or dependencies are upgraded, and a limit never revisited after that initial configuration gradually drifts further from what the application genuinely needs:

docker run -d --memory=512m --cpus=1 my-api

A limit set this way six months ago, with no subsequent review, may now be significantly mismatched to the application's current, evolved behavior, either too conservative, causing avoidable throttling or OOM kills, or too generous, wasting capacity that could otherwise be allocated to other workloads sharing the same host.

Establishing a regular review cadence

Scheduling a periodic, routine review of actual resource usage against configured limits, rather than only revisiting limits reactively after an incident draws attention to a mismatch, keeps configuration aligned with reality on a predictable basis:

docker stats --no-stream --format "{{.Name}}: {{.CPUPerc}} {{.MemPerc}}"

rate(container_cpu_usage_seconds_total{name="my-api"}[7d])

Reviewing actual usage trends over a representative recent period, weekly or monthly depending on how quickly a given service's behavior tends to change, and adjusting configured limits accordingly, treats resource sizing as ongoing maintenance rather than a decision made once and never reconsidered.

Avoiding both over-provisioning and under-provisioning

Excessively generous limits waste capacity that could otherwise support additional workloads on the same host, while excessively conservative limits cause unnecessary throttling or restarts during legitimate, expected load; sizing limits with a deliberate, reasonable margin above observed peak usage, rather than either extreme, balances headroom for genuine spikes against efficient use of shared host capacity:

docker stats --no-stream --format "{{.MemPerc}}" my-api

peak observed: 380MB
configured limit: 512M (35% headroom)

A documented, deliberate margin like this, derived from actual measured peak usage rather than an arbitrary round number, gives a clear, revisitable basis for the configured limit that can be checked against future measurements as the service evolves.

Accounting for host-level oversubscription

When configuring resource limits across many containers sharing one host, the sum of every individual container's configured limit can reasonably exceed the host's actual total capacity, oversubscription, since not every container will simultaneously hit its peak usage at the same exact moment; managing this deliberately, with an understanding of the actual risk being accepted, is different from accidentally oversubscribing without having considered the consequence:

docker stats --no-stream --format "{{.Name}}: {{.MemPerc}}"

Reviewing whether containers on a shared host tend to peak simultaneously or at different times informs how much oversubscription is genuinely safe; a host where every container's peak usage correlates with the same traffic pattern carries more risk from oversubscription than one where different containers peak at genuinely independent times.

Detecting and addressing noisy neighbor patterns

A container with disproportionately high or unpredictable resource usage relative to its actual importance can degrade every other workload sharing the same host, and identifying this pattern specifically, rather than only reacting to a generic "the host feels slow" complaint, directs remediation toward the actually responsible container:

docker stats --no-stream

Comparing usage across every container on a host periodically, not just when a problem is already suspected, surfaces a developing noisy neighbor pattern before it escalates into a more serious, host-wide degradation affecting multiple unrelated services simultaneously.

Documenting the reasoning behind a configured limit

Recording why a specific limit was chosen, not just what the value is, preserves the context needed for a future review to judge whether that reasoning still holds, rather than requiring the next reviewer to reconstruct the original justification from scratch:

services:
  api:
    deploy:
      resources:
        limits:
          memory: 512M  # sized for 35% headroom above 380MB measured peak, 2024-06

A brief, dated comment like this gives a future reviewer immediate context for whether the current limit's underlying justification remains accurate, or whether enough time and change has passed that a fresh measurement and reconsideration is warranted.

Common mistakes

Setting resource limits once at initial deployment and never revisiting them as the application's actual behavior and load characteristics evolve.
Configuring limits with no deliberate, documented margin above observed peak usage, leaving the actual reasoning behind a specific value unclear to anyone reviewing it later.
Oversubscribing host capacity across many containers without having considered or deliberately accepted the actual risk that decision carries.
Reacting to noisy neighbor problems only after they have already caused a noticeable, host-wide degradation rather than monitoring for the pattern proactively.
Not documenting the reasoning behind a specific configured limit, leaving a future reviewer to reconstruct the original justification from scratch or simply leave a potentially outdated value unexamined.

Production resource discipline treats configured limits as living configuration requiring periodic, deliberate review against actual measured usage, rather than a one-time decision, balancing headroom against efficient shared capacity use, and documenting the reasoning behind each value clearly enough that a future review can judge whether it still holds or needs reconsideration.