On the Usage of Differential Evolution for Function
Optimization
                                              by Rainer Storn
    Siemens AG, ZFE T SN2, Otto-Hahn Ring 6, D-81739 Muenchen, Germany, currently on leave at ICSI,
                    1947 Center Street, Berkeley, CA 94704, storn@icsi.berkeley.edu
                      Abstract                          assumed unless otherwise stated. Basically, DE
Differential Evolution (DE) has recently proven to      generates new parameter vectors by adding the
be an efficient method for optimizing real-valued       weighted difference between two population
multi-modal objective functions. Besides its good       vectors to a third vector. If the resulting vector
convergence properties and suitability for              yields a lower objective function value than a
parallelization, DE's main assets are its               predetermined population member, the newly
conceptual simplicity and ease of use. This             generated vector replaces the vector, with which
paper describes several variants of DE and              it was compared, in the next generation;
elaborates on the choice of DE's control                otherwise, the old vector is retained. This basic
parameters which corresponds to the application         principle, however, is extended when it comes to
of fuzzy rules. Finally the design of a howling         the practical variants of DE. For example an
removal unit with DE is described to provide a          existing vector can be perturbed by adding more
real-world example for DE's applicability.              than one weighted difference vector to it. In most
                                                        cases, it is also worthwhile to mix the
1 Introduction                                          parameters of the old vector with those of the
Differential Evolution (DE) [1], [2] has proven to      perturbed one before comparing the objective
be a promising candidate for minimizing real-           function values. Several variants of DE which
valued, multi-modal objective functions. Besides        have proven to be useful will be described in the
its good convergence properties DE is very              following.
simple to understand and to implement. DE is
also particularly easy to work with, having only a
                                                        2 Scheme DE/rand/1
few control variables which remain fixed
                                                        For each vector x i ,G , i = 0,1,2,...,NP-1, a
throughout the entire minimization procedure.
                                                        perturbed vector v i ,G +1 is generated according
DE is a parallel direct search method which
                                                        to         v i,G +1 = x r1 ,G + F ⋅ ( x r2 ,G − x r 3,G )   (2)
utilizes NP D-dimensional parameter vectors
         xi,G, i = 0, 1, 2, ... , NP-1,      (1)        with    r1 , r2 , r3 ∈ 0, NP − 1 , integer and mutually
                                                        different, and F > 0.
as a population for each generation G, i.e. for
each iteration of the minimization. NP doesn't          The randomly chosen integers r1, r2 and r3 are
change during the minimization process. The             also chosen to be different from the running
initial population is chosen randomly and should        index i. F is a real and constant factor ∈ [0, 2]
try to cover the entire parameter space                 which controls the amplification of the differential
uniformly. As a rule, a uniform probability             variation ( x r2 ,G − x r3 ,G ) . Note that the vector
distribution for all random decisions will be
1
x   r1 ,G     which is perturbed to yield                                                 v    i ,G +1 has         no       where rand() is supposed to generate a random
relation to                      x   i ,G          but is a randomly chosen                                                 number ∈ [0,1):
population member.       Fig. 1 shows a two-                                                                                          L = 0;
dimensional example that illustrates the different
                                                                                                                                       do {
vectors which play a part in the vector-generation
                                                                                                                                             L = L + 1;
scheme. The notation: DE/rand/1 specifies that
the vector to be perturbed is randomly chosen,                                                                                         }while(rand()< CR)             AND     (L < D));
and that the perturbation consists of one
                                                                                                                            Hence the probability Pr(L>=ν) = (CR)ν-1, ν > 0.
weighted difference vector.
                                                                                                                            CR is taken from the interval [0, 1] and
 X1
                                                                                                                            constitutes a control variable in the design
              x N P Param e ter vectors from g eneration G
                N ewly generated param eter vector v                                                                        process. The random decisions for both n and L
                                     F( x r2 ,G - x r3 ,G )                                                                 are made anew for each newly generated vector
                                                                                              M INIM UM                     ui,G+1.
                                            x                         x         x
                                                          x
                x
                             x                                                                                              To decide whether or not it should become a
                                x
        x i,G                                                     x                                                         member of generation G+1, the new vector
             x r3 ,G x r ,G
                        2
                                     x
                                         x r1 ,G      x                                                                     ui,G+1 is compared to x i ,G . If vector ui,G+1
                                                                                x
                                                                                                                            yields a smaller objective function value than
                                                              x
                                                              v i,G + 1 = x r1 ,G
                                                                                                                            x i,G , then x i,G +1 is set to ui,G+1; otherwise, the
                                                                                    +   F ( x r2 ,G - x r3 ,G )
                                                                                                                    X0      old value    x i,G is retained.
Fig.1:          An example of a two-dimensional                                                                             3 Scheme DE/best/1
                objective function showing its contour
                                                                                                                            Basically, scheme DE/best/1 works the same
                lines and the process for generating
                                                                                                                            way as DE/rand/1 except that it generates the
                vi,G+1 in scheme DE/rand/1.
                                                                                                                            vector vi,G+1 according to:
                                                                                                                            v i,G +1 = x best ,G + F ⋅ ( x r1 ,G − x r2 ,G ) . (5)
In order to increase the potential diversity of the
                                                                                                                            This time, the vector to be perturbed is the best
perturbed parameter vectors, crossover is
                                                                                                                            performing vector of the current generation.
introduced. To this end, the vector:
                                                                                                                            Again, the computation of ui,G+1 is defined by
u i,G +1 = (u0i,G +1 , u1i,G +1 ,..., u( D −1)i,G +1 )                                                            (3)       eq. (4). This will be also be the case for the
                                                                                                                            remaining variants.
                                                                                                                            4 Scheme DE/best/2
with
             %Kv                                          j = n D , n + 1 D ,..., n + L − 1
                                                                                                                            Scheme DE/best/2 uses two difference vectors
            =&
                     ji, G +1               for                                                                         D
                                                                                                                            as a perturbation:
              K' x
u ji,G +1
                     ji, G                      for           all      other             j ∈[ 0, D − 1]                     v i,G +1 = x best ,G + F ⋅ ( x r1 ,G + x r2 ,G − x r3 ,G − x r4 ,G ) .   (6)
                                                                                                                  (4)
                                                                                                                            Due to the central limit theorem the random
is formed. The acute brackets                                                                     denote the
                                                                                              D                             variation is shifted slightly into gaussian direction
modulo function with modulus D. The starting                                                                                which seems to be beneficial for many functions.
index, n, in (4) is a randomly chosen integer from
                                                                                                                            5 Scheme DE/rand-to best/1
the interval [0,D-1]. The integer L, which denotes
the number of parameters that are going to be                                                                               Scheme       DE/rand-to-best/1   places   the
exchanged, is drawn from the interval [1, D].                                                                               perturbation at a location between a randomly
The algorithm which determines L works                                                                                      chosen population member and the best
according to the following lines of pseudo code                                                                             population member:
2
vi,G+1 = xi,G + λ ⋅ ( xbest,G − xi,G ) + F ⋅ ( xr2 ,G − xr3 ,G ) . (7)   crucial. The more knowledge one includes, the
                                                                         more likely the minimization is going to converge.
λ controls the greediness of the scheme. To
                                                                         The sum of error squares is not always a good
reduce the number of control variables we
                                                                         choice as it has the potential to hide the path to
usually set λ =  .
                                                                         the global minimum. To minimize the maximum
                                                                         error is often a better objective but seems to
6 Rules for the usage of DE                                              yield more local minima.
Since it's invention [1], DE's has been tested
extensively against artificial and real-world
                                                                         7 Design of a howling remover
minimization problems. So far, the following set
                                                                         In order to demonstrate DE's applicability to real-
of linguistic rules has emerged to be useful when
                                                                         world problems a howling removal unit has been
it comes to choose the control variables F, CR
                                                                         designed     with    DE.    In    modern      audio
and NP:
                                                                         communication       applications    hands      free
# At initialization the population should be spread                      environments are the current trend where
as much as possible over the objective function                          headsets are replaced by loudspeakers and
surface.                                                                 microphones. The preferred way of audio
                                                                         communication is full          duplex,   i.e.    all
 # Most often the crossover probability CR ∈ [0,
                                                                         loudspeakers and microphones are active as
1] must be considerably lower than one (e.g.
                                                                         opposed to half duplex or "walkie talkie" mode
0.3). If no convergence can be achieved,
                                                                         where only one party is allowed to talk at a time.
however, CR ∈ [0.8, 1] often helps.
                                                                         Howling is one of the problems in full duplex
# For many applications NP=10*D is a good                                communication and builds up due to the acoustic
choice. F is usually chosen ∈ [0.5, 1].                                  feedback path. One way to reduce howling is to
 # The higher the population size NP is chosen,                          frequency-shift the signal that is picked up by a
the lower one should choose the weighting factor                         microphone by 10Hz to 20Hz before it is sent to
F.                                                                       the other communicating parties. This shift is
                                                                         usually not perceived as unnatural by the human
# watching the parameters: it's a good                                   ear. The shifted signal appears at the destination
convergence sign if the parameters of the best                           loudspeakers and travels back to the originator,
population member change a lot from generation                           shifted by another 10Hz to 20Hz. The signal
to generation, especially at the beginning of the                        travels many times through this acoustic path
minimization and even if the objective function                          and is quickly shifted out of band, thus reducing
value of the best population member decreases                            the feedback problems. Fig. 2 shows the block
slowly.                                                                  diagram of the howling removal unit.
 # watching the objective function: it is not                                                 B and pa ss                             D ow nsam p ler
necessarily bad, if the objective function value of                      xk         4             BP                        LP               4          yk
the best population member exhibits plateaus
                                                                              U p sam p ler                      Δω π n   L ow pass
during the minimization process. However, it is                                                         cos(1-     )
                                                                                                                 ωA 2
an indication that the minimization might take a                         Fig. 2: Howling removal unit.
long time or that the increase of the population
                                                                         The upsampler fills in three zero samples
size NP might be beneficial for convergence.
                                                                         between the adjacent signal samples xk and
# The objective function value of the best                               xk+1. The bandpass, which operates at four
population member shouldn't drop too fast,                               times the sampling frequency ωA, retains the
otherwise the optimization might get stuck in a                          components of the spectrum which are
local minimum.                                                           sopposed to be frequency-shifted by Δω. The
# The proper choice of the objective function is                         actual shift is performed via multiplication with an
3
appropriate cosine signal. The lowpass removes            scheme the magnitude response of the
some artifacts which appear due to the shifting           corresponding filter   was sampled in the
operation, and the downsampler takes every                frequency domain. The number of samples that
fourth sample of the lowpass result to yield the          were used are indicated in figs. 3 and 4 which
output signal yk at the original sampling                 also show the final results of the design
frequency.                                                procedure.
                                                                        1.2
One of the most important features of the                                                        10 sam ples            4 0 s am p les      10 sam ples
                                                                                                 tra ns itio n b an d   p as s ba nd     transition b and
howling remover is low computational complexity
                                                                         1
                                                                                                            1.0 1
as the unit has to operate in a real-time
                                                                                                            0.99
environment. Therefore it was crucial to design a                       0.8
                                                                                                                                           10 sam ples
                                                          M agnitude
                                                                                           20 sam ples
bandpass as well as a lowpass with minimum
                                                                                           stop b and                                      sto p b an d
degree. To this end the bandpass was chosen to                          0.6
be a recursive digital filter (IIR-filter) the transfer
                                                                        0.4
function of which had to meet specific magnitude
constraints defined by a tolerance scheme. The                          0.2
lowpass was designed as a transversal filter                                        0.01                                                        0.01
(FIR-filter) without the usual linear phase                              0
                                                                              0             0.1            0.2       0.3      0.4                         0.5
requirement. Also the lowpass had to meet                                                                Normalized Frequency
magnitude specifications defined by a tolerance
                                                          Fig. 3: Magnitude response of the bandpass after
scheme.
                                                          the design process.
The objective function in both cases was defined
to be the maximum deviation from the
                                                                        1.2
corresponding tolerance scheme or to be one of                                    1.005
the following penalty terms pk, whichever value                          1
was greater:                                                                      0.995
                                                          M agnitud e
                                                                        0.8
a) For all parameters par[i]:
    p1 = 20, 000 + par[i]*100 if           par[i] < 0                   0.6
                                                                                  40 sam ples                                   20 sam ples
                                                                                  pass band         10 sam ples                  stop band
b) For all par[i] denoting the radius of a zero [3]                     0.4
                                                                                                      transition
of the transfer function:                                                                             band
                                                                        0.2
                100                                                                                                     0.01                     0.001
p2 = 2 +                         if   par[i ] < 1
           par[i ] + 1. e − 7                                            0
                                                                             0             0.1              0.2           0.3             0.4             0.5
                                                                                                  Normalized Frequency
c) For all p[i] denoting a zero angle:
                                                          Fig. 4: Magnitude response of the lowpass after
p3 = 2 + par[i ]* 100      if    par[i ] < 0. 5           the design process.
d) For all p[i] denoting a pole radius:
p4 = 2 + par[i ]* 100      if    par[i ] >= 1             The bandpass filter result of fig. 3 was obtained
                                                          using strategy DE/best/2 with NP=300, F=0.5
e) For all p[i] denoting a pole angle:                    and CR=1. The entire design took 2,210,400
                                                          evaluations of the bandpass transfer function. A
p5 = 2 + par[i ]* 100      if
                                                          total of 24 parameters was used, 6 zero radii, 6
par[i ] ∈" stopband      of     filter "                  zero angles, 6 pole radii and 6 pole angles in the
                                                          complex z-plane [3]. The overall gain constant a0
To determine the deviation from the tolerance
4
was set to 0.005. The magnitude response of the                                                            π "# .
filter still violates some parts of the tolerance
                                                                                            interval
                                                                                                          ! 2$
                                                                                                       x ∈ 0,        The strategy used was
scheme slightly, yet the design was satisfactory                                            DE/best/1 with NP=20, F=0.9 and CR=1. It took
for the howling remover.                                                                    30,020 function evaluations to get the result of
The lowpass filter result of fig. 4 was obtained                                            fig. 5. The final speed increase of the cosine
using strategy DE/best/2 with NP=200, F=0.5                                                 function computed by opti(x) was 17% compared
and CR=1. The entire design took 83,800                                                     to the library function cos(x).
evaluations of the lowpass transfer function. A
total of 16 parameters was used, 8 zero radii and                                           Conclusion
8 zero angles in the complex z-plane [3]. The                                               Several variants of Differential Evolution (DE)
overall gain constant a0 was set to 0.005.                                                  have been introduced and general hints about
The third optimization for the howling remover                                              their usage have been provided. Three real-world
was concerned with the cosine function the                                                  design tasks appearing in the development of a
evaluation of which takes up a non-neglectable                                              howling remover for audio communications have
                                                                                            been solved successfully by applying DE. All
                                                                                 0, π "#
amount of computing time. Hence an
                                                                                            three design tasks could have been performed
approximation of cos(x) in the interval
                                                                                 ! 2$       with specialized design tools; the advantage of
                                                                                            using DE, however, was that neither specialized
was performed using a polynomial opti(x) of third
                                                                                            and most probably expensive tools nor expert
degree. Fig. 5 shows that opti(x) yields an
                                                                                            knowledge concerning the design tasks
improved approximation compared to the taylor
                                                                                            themselves was necessary.
polynomial taylor(x) of third degree.
                            1
                                                                                            References
                          0.8                       opti(x)=0.9975575805                    [1] Storn, R. and Price, K., Differential Evolution
cos(x), taylor(x), p(x)
                                                            +0.03400468081*x
                          0.6                               - 0.6044035554*x^2
                                                                                                - a simple and efficient adaptive scheme for
                          0.4                               +0.1129638031*x^3                   global optimization over continuous spaces,
                          0.2                                                                   Technical Report TR-95-012, ICSI,
                            0                                                                   http://http.icsi.berkeley.edu/~storn/litera.html
                          -0.2
                          -0.4                                                              [2] Storn, R. and Price, K., Minimizing the real
                          -0.6                                         cos(x)                   functions of the ICEC'96 contest by
                                    taylor(x) = 1 + 0.5*x^2
                          -0.8                                                                  Differential Evolution, Int. Conf. on
                           -1
                                0       0.5     1      1.5     2     2.5         π              Evolutionary Computation, Nagoya, Japan.
                                                         x
                                                                                            [2] Mitra, S.K. and Kaiser, J.F., Handbook for
Fig. 5: Approximation of cos(x) by means of a                                                   digital signal processing, John Wiley, 1993.
polynomial of third order.
The optimization of the coefficients in opti(x) was
performed by minimization of the sum of error
squares obtained by 100 sampling points in the