Statistical Computing
Prof. Dr. Matei Demetrescu
                                          Summer 2019
Statistics and Econometrics (CAU Kiel)                            Summer 2019   1 / 41
Today’s outline
                             Combinatorial optimization
 1    Heuristic searches
 2    Simulated annealing and genetic algorithms
 3    Tabu algorithms
 4    Up next
Statistics and Econometrics (CAU Kiel)                    Summer 2019   2 / 41
                                         Heuristic searches
Outline
 1    Heuristic searches
 2    Simulated annealing and genetic algorithms
 3    Tabu algorithms
 4    Up next
Statistics and Econometrics (CAU Kiel)                        Summer 2019   3 / 41
                                         Heuristic searches
The basic setup
 We have the same goal:
         maximize f (θ) w.r.t.θ = (θ1 , . . . , θp ), where θ ∈ Ω, but
         Ω consists of a finite number N of possible solutions.
 Sounds simple, but N may be large.
Statistics and Econometrics (CAU Kiel)                                   Summer 2019   4 / 41
                                         Heuristic searches
Example: Model selection
 Given a dependent variable Y and a set of predictors Xj ,
     we search for the best model of the form Y = β0 + sj=1 βij Xij + 
                                                         P
         ... where {i1 , . . . , is } is a subset of {1, . . . , p}; always with intercept.
         Of course, θi is coded suitably.
         What “best” means may be defined in various ways, say by the
         Akaike information criterion (AIC). (Recall, R2 is not suitable.)
 Regressor selection requires ideally optimization over N = 2p candidate
 models!
Statistics and Econometrics (CAU Kiel)                                     Summer 2019   5 / 41
                                         Heuristic searches
More generally
 Practitioners frequently have to deal with discrete optimization problems
 where Ω has 2p or even p! elements.
 E.g. a likelihood function may depend on
           so-called configuration parameters θ
                 which describe the form of a statistical model and
                 for which there are many discrete choices
           and other parameters which are easily optimized for each
           configuration.
 In such cases, we may view f (θ) as the (log) profile likelihood of a given
 configuration, θ.1
 The solutions space Ω is obtained by combining p various features;
 optimizing w.r.t. θ is demanding...
      1
          I.e., the highest likelihood attainable using that configuration.
Statistics and Econometrics (CAU Kiel)                                        Summer 2019   6 / 41
                                         Heuristic searches
Just to get an idea
 Parsing many possible solutions is not always feasible:
                                   Time to solve problem of order. . .
                    p            O(p2 )       O(2p )           O(p!)
                   20        1 minute                1 minute         1 minute
                   21      1.10 minutes              2 minutes       21 minutes
                   25      1.57 minutes             32 minutes        12.1 years
                   30      2.25 minutes             17.1 hours    207 million years
                   50      6.25 minutes            2042.9 years    2.4×1040 years
 Bottom line: Many combinatorial optimization problems are inherently too
 difficult to solve exactly by traditional means.
Statistics and Econometrics (CAU Kiel)                                          Summer 2019   7 / 41
                                         Heuristic searches
A trade-off
 So let’s avoid algorithms
         that will find the global maximum, but
         will take forever to do so.
 An algorithm that finds a good local maximum in reasonable time is
 clearly to be preferred.
 Heuristic algorithms explicitly trade off global optimality for speed.
Statistics and Econometrics (CAU Kiel)                          Summer 2019   8 / 41
                                         Heuristic searches
Local neighbourhood search
 Let stay iterative (what else?):
         we improve a current candidate solution, but
         do so in a neighborhood of the current candidate, say N (θ (t) ).
         The update θ (t) 7→ θ (t+1) ∈ N (θ (t) ) is a local move (step).
 Don’t forget to add a stopping rule.
 Note: Since the neighbourhood changes with each move, we do search
 larger subsets of Ω.
Statistics and Econometrics (CAU Kiel)                               Summer 2019   9 / 41
                                         Heuristic searches
Choice of neighborhood
 The choice of neighbourhood influences performance:
           So keep it simple: e.g., define N (θ (t) ) by allowing k changes to the
           current candidate solution.2
           Large k is time-consuming, but small k may miss improvements.
           No universal answer on how to pick it.
 In the model selection problem,
           θ (t) is a binary vector coding the inclusion/exclusion of each X;
           a 1-neighborhood is the set of models that either add or eliminate one
           regressor from θ (t) (i.e. flips one element of θ).
      2
          This defines a so-called k-neighborhood, and any move is called a k-change.
Statistics and Econometrics (CAU Kiel)                                     Summer 2019   10 / 41
                                         Heuristic searches
The moves
 You may come up with many ways of updating the current candidate; you
 typically want an upward move:
         Simple ascent: θ (t+1) is the first better θ in N (θ (t) ).
         Steepest ascent: θ (t+1) is the best θ in N (θ (t) ).
 Then, update the neighbourhood – and repeat.
 Different pros and cons; again, no universally best choice...
Statistics and Econometrics (CAU Kiel)                                 Summer 2019   11 / 41
                                         Heuristic searches
Random starts
 Be smart (for the given money)
         Select many starting points at random,
         Run simple/steepest ascent from each point
         Choose best solution as final answer.
 Works quite well in practice, actually.
 Worth emphasizing: The random starts strategy can be combined with any
 optimization method!
Statistics and Econometrics (CAU Kiel)                        Summer 2019   12 / 41
                                          Heuristic searches
Example: Model selection, p = 27 predictors
                     400
        yaxislabel
                     350
                     300
                           1     31  16    46      61
                               xaxislabel
 Results of random-starts local search, 1-change steepest ascent, 15
 iterations max per start. (AIC ranges between −300 and −420.)
Statistics and Econometrics (CAU Kiel)                         Summer 2019   13 / 41
                                         Heuristic searches
Example: Regression with 27 predictors cont’d
                              Predictors selected by various searches
  Method          1 2 3 6 7 8 9 10 12 13 14 15 16 18 19 20 21 22 24 25 26              AIC
  LS   (2,4)          •••         •      •   • • • •                  • • • −416.95
  LS   (1)             ••         •      • • • •                • • •       −414.60
  LS   (3)             ••         •      •   • •                • • • • •   −414.16
  LS   (5)             •          ••     •   • •              • • • •       −413.52
  S-Plus    •            •• •            •        • • • •               • • • −416.94
  Efroymson              • ••            •        • •     •             •   • −400.16
 Random starts local search [LS] model selection results using 1-change
 steepest ascent, as well as two standard procedures (greedy stepwise
 ascent). (Bullets indicate inclusion of predictor j.)
Statistics and Econometrics (CAU Kiel)                                   Summer 2019    14 / 41
                                         Heuristic searches
Local search with variable neighborhoods
 For standard ascent algorithms,
         entrapment in a globally uncompetitive local maximum is inevitable
         because no downhill moves are allowed, and
         (many) random starts are mandatory.
 We may want to
         break out of a bad neighbourhood, or
         move towards other (more promising) neighbourhoods.
 Of course, while ensuring that you don’t slide back immediately...
Statistics and Econometrics (CAU Kiel)                         Summer 2019   15 / 41
                   Simulated annealing and genetic algorithms
Outline
 1    Heuristic searches
 2    Simulated annealing and genetic algorithms
 3    Tabu algorithms
 4    Up next
Statistics and Econometrics (CAU Kiel)                          Summer 2019   16 / 41
                   Simulated annealing and genetic algorithms
Annealing
 Wikipedia says:
         Annealing, in metallurgy and materials science, is a heat
         treatment that alters the physical and sometimes chemical
         properties of a material [...].
         In annealing, atoms migrate in the crystal lattice [grid] and the
         number of dislocations decreases, leading to a change in ductility
         and hardness. As the material cools it recrystallizes.
Statistics and Econometrics (CAU Kiel)                           Summer 2019   17 / 41
                   Simulated annealing and genetic algorithms
An analogy
         A crystal has the lowest level of potential energy among all
         configurations.
         Movement of atoms is random, depends on temperature.
         Annealing “pushes” atoms to reach their optimal position in the
         lattice. This involves:
                heating to a predetermined temperature
                holding the material for some time at that temperature
                slowly cooling and repeating the procedure.
         In real-life physics, atoms may break out of suboptimal positions
         when heating – but we want such behavior for θ (t) as well!
 We treat θ (t) as an atom that should reach the optimal θ ∗ .
Statistics and Econometrics (CAU Kiel)                                   Summer 2019   18 / 41
                   Simulated annealing and genetic algorithms
Simulated annealing
 To maintain the analogy, annealing problems are posed as minimizations.
 Of course, we may always minimize −f if we needed maximization.
 The simulated annealing process
         Initializes with a candidate θ (0) and a “temperature” τ0 .
         Consists of several stages of constant temperature, τ0 , τ1 , τ2 . . .,
         where
         each stage j consists of mj iterations, where updates may (but need
         not) take place.
         The “temperature” parameterizes how often the algorithm could take
         non-improving steps,
         ... thus allowing for break-out attempts.
Statistics and Econometrics (CAU Kiel)                               Summer 2019   19 / 41
                   Simulated annealing and genetic algorithms
The algorithm
    1    Pick (randomly) a candidate solution θ̃ from the current
         neighborhood of θ (t) , according to a proposal distribution g (t) .
         (Often the discrete uniform over N (θ (t) ).)
    2    If proposal leads to an improvement, update θ (t+1) = θ̃
    3    If proposal does not lead to an improvement,
                randomly choose whether to update using θ̃ or not;
                the current temperature controls the probability of updating!
         So let θ (t+1) = θ̃ with probability equal to
                                 (                              !)
                                            f (θ (t) ) − f (θ̃)
                             min 1, exp                            .
                                                     τj
         Otherwise, let θ (t+1) = θ (t) .
    4    Repeat steps 1 – 3 a total of mj times.
    5    Increment j. Update temperature and stage length. Go to step 1.
Statistics and Econometrics (CAU Kiel)                                 Summer 2019   20 / 41
                   Simulated annealing and genetic algorithms
Some details
 The neighbourhoods:
         should be small and easily computed.
         any two candidates must be connectable via a sequence of
         neighborhoods.
 Let τj = α(τj−1 ) and mj = β(mj−1 ) (cooling schedule):
         α(·) should slowly decrease temperatures to zero.
         β(·) should increase stage lengths as temperatures drop.
 Popular choices:
                                                                      τj−1
         Set mj = 1 ∀j and reduce temperature slowly, α(τj−1 ) =    1+aτj−1   for
         some a  1. Or
         Set α(τj−1 ) = aτj−1 for some a < 1 (usually a ≥ 0.9). Increase stage
         lengths as you go, say β(mj−1 ) = bmj−1 for b > 1.
Statistics and Econometrics (CAU Kiel)                          Summer 2019   21 / 41
                   Simulated annealing and genetic algorithms
More practical details
         Pick τ0 > 0 so that exp ((f (θ i ) − f (θ j ))/τ0 ) is close to one ∀ θ i and
         θ j in Ω.
                This is problem-dependent and may be an effort, but
                allows any point to be visited (with a reasonable chance) early on.
         Larger temperature reductions typically require longer stages.
         Long stages at high temperatures are usually useless; decrease
         temperature rapidly at first but slowly thereafter.
         Stop when no improvements for a couple of stages, run out of
         patience, exceed budget, or current solution is acceptable.
Statistics and Econometrics (CAU Kiel)                                  Summer 2019   22 / 41
                            Simulated annealing and genetic algorithms
Example: Regression model selection
                    −300
               yaxislabel                                                1.0
                                                                               otherlabel
                    −350
                                                                         0.5
                    −400
                                                                         0.0
                                       0
                                     1000              2000
                                  xaxislabel
 Discrete uniform proposal, 1-neighbourhoods, 15-stage cooling schedule
 with stage lengths of 60, 120, and 220, α(τj−1 ) = 0.9τj−1 , τ0 is 1 (bottom
 solid) or 10 (top solid). (Dashed: baseline temperature at each step.)
Statistics and Econometrics (CAU Kiel)                                    Summer 2019       23 / 41
                   Simulated annealing and genetic algorithms
Genetic algorithms
 Rather than physics we now mimic natural evolutionary processes to
 improve candidate solutions.
 The analogy is taken seriously:
    1    One works with a so-called population of candidate solutions;
         the best individual (according to a so-called fitness function) is the
         current approximation to the solution.
    2    Each iteration leads to a change of generations (population renewal).
    3    Characteristics of individuals are inherited and modified from
         individuals of the previous generation, usually in a way that involves
         randomness.
    4    The key aspect is that individuals with high fitness are more likely to
         pass on their characteristics to the next generation.
Statistics and Econometrics (CAU Kiel)                             Summer 2019   24 / 41
                   Simulated annealing and genetic algorithms
The general structure
 Genetic algorithms consist of
    1    A genetic representation of the solution, typically an array of 0/1
         values (some encoding may be necessary); the exact configuration of
         the genetic code of an individual is called chromosome. (Its elements
         are the genes, and the possible values of the genes are called alleles.)
    2    The fitness function to evaluate individuals (this is usually the target
         function f of interest)
    3    Initial population (should be large enough; either chosen at random or
         clustered in subsets of the solutions space)
    4    The population renewal algorithm (the most important item here); to
         keep the analogy, it consists of applying a set of genetic operators
         (i.e. rules that change the genetic representation of individuals) to
         selected individuals of the current population.
    5    A stopping rule (well, duh...)
Statistics and Econometrics (CAU Kiel)                             Summer 2019   25 / 41
                   Simulated annealing and genetic algorithms
Model selection again
 In this example,
    1    The genetic representation of θ is simply inclusion/exclusion of the
         respective regressor (obvious codes 0/1).
    2    The AIC may play the role of the fitness function.
    3    The initial population consists of several candidate models, say picked
         at random.
    4    To renew the population, we look at the best performing models from
         the current populations, and change them suitably to yield a new
         generation (“breeding”).
    5    We stop if no improvement has been observed for the last say 15
         generations or a total number of generations has been reached (etc.).
Statistics and Econometrics (CAU Kiel)                           Summer 2019   26 / 41
                   Simulated annealing and genetic algorithms
Parent selection and genetic operators
 We need current individuals to breed the next generation, so
         Select a pair of parent solution at random, with probability to be
         picked proportional to individual fitness or to relative rank in current
         population (sometimes more parents are better);
         Create new individual (offspring) based on characteristics of parents;
         repeat until you have enough new individuals.
 Two basic flavours of genetic operators generating new individuals:
         Cross-over: select random position and split parent chromosomes
         correspondingly; then recombine pieces.
         Mutation: randomly change one or several elements of the
         chromosome(s)
 There are many variations (rule of thumb: small mutation probabilities).
Statistics and Econometrics (CAU Kiel)                             Summer 2019   27 / 41
                       Simulated annealing and genetic algorithms
Model selection again
                     400
                     300
        yaxislabel
                     200
                     100
                           0          50                   100
                                xaxislabel
 Population size P = 20, 100 generations breeding with fitness-rank based
 selection probability, simple crossover, and 1% mutation.
Statistics and Econometrics (CAU Kiel)                              Summer 2019   28 / 41
                                         Tabu algorithms
Outline
 1    Heuristic searches
 2    Simulated annealing and genetic algorithms
 3    Tabu algorithms
 4    Up next
Statistics and Econometrics (CAU Kiel)                     Summer 2019   29 / 41
                                         Tabu algorithms
Potentially escaping entrapment
 Let’s now consider downhill moves explicitly. We allow them
         if we can’t find uphill ones, but with
         care to avoid reversal in the next (or 2nd next) step,
 since this may eliminate potential long-term benefits of downhill move.
 SANN or GA deal with this to some extend by built-in randomness, but
 local searches don’t. (And even SANN/GA may use some help.)
 So, after a downhill step, let’s (temporarily) forbid (or make tabu/taboo)
 moves that have the potential of taking us back. We need to
         characterize moves to this end, and
         keep an eye on the recent history (say H (t) ) of moves.
Statistics and Econometrics (CAU Kiel)                              Summer 2019   30 / 41
                                         Tabu algorithms
Move attributes
 We use so-called attributes to characterize moves:
         in each step, we forbid attributes that would lead to reversing moves.
         ... but only do so for a couple of steps.
 Several attributes may be relevant for a given problem
         call the ath attribute Aa ;
         a move from θ (t) to θ (t+1) can be characterized by different
         attributes, depending on t.
 The tabu list is a list of attributes, not moves, so a single tabu attribute
 may forbid entire classes of moves.
Statistics and Econometrics (CAU Kiel)                            Summer 2019   31 / 41
                                          Tabu algorithms
Examples of attributes
   Generic Attribute                                        Model       Selection          Example
                                                            (2-change neighbourhood)
                                         (t)
   A change in the value of θi . The at-                    A1 : Whether the ith predictor is added
   tribute may be the value from which                      (or deleted) from the model.
   the move began, or the value to which
   it arrived.
                                         (t)         (t)
   A swap in the values of θi                  and θj       A2 : Whether the absent variable is ex-
   when
             (t)
            θi     6=
                         (t)
                        θj .                                changed for the included variable.
   A change in the value of f resulting                     A3 : The reduction in AIC achieved by
   from the step, f (θ (t+1) ) − f (θ (t) ).                the move.
   The value of some other strategically                    A4 : The number of predictors in the
   chosen function, g, i.e., g(θ (t+1) ).                   new model.
   A change in the value of g resulting                     A5 : The change in a different variable
   from the step, g(θ (t+1) ) − g(θ (t) ).                  selection criterion such as adjusted R2 .
Statistics and Econometrics (CAU Kiel)                                               Summer 2019   32 / 41
                                         Tabu algorithms
Relevance at a given iteration
 Recall, the presence of attributes on the tabu list is only temporary.
         The recency R of an attribute Aa is the number of steps that have
         passed since a move last exhibited Aa .
         Let R Aa , H (t) = 0 if the move yielding θ (t) exhibits attribute a,                         
         R Aa , H (t) = 1 if last appearance of a was at time t − 1, etc.
                      History matters:
         Current neighbourhoods may depend on search history; denote this
         N (θ (t) , H (t) ).
         The choice of θ (t+1) in N (θ (t) , H (t) ) may depend on the (recent)
         search history too.
         This effectively leads to using an augmented target function, fH (t) .
Statistics and Econometrics (CAU Kiel)                             Summer 2019    33 / 41
                                         Tabu algorithms
Workflow
 Say we make a move exhibiting Aa
         Then we put its complement on the tabu list (want to prevent
         undoing this move),
         ... and leave it there for τ iterations. (This is a tuning parameter
         controlling the way the algorithm works.)
 The neighbourhood where we look for updates becomes
                n                                                 o
     (t)  (t)                (t)
 N (θ , H ) = θ : θ ∈ N (θ ) & no attribute of θ is tabu at time t .
 When R Aa , H (t) first equals τ , we remove Aa from the tabu list.
                                                                       √         √
 Typical tabu validity in practice: 7 ≤ τ ≤ 20, or .5 p ≤ τ ≤ 2 p. If it’s
 large enough it will prevent cycles.
Statistics and Econometrics (CAU Kiel)                             Summer 2019   34 / 41
                                         Tabu algorithms
Refinements: Aspiration
 May want to override the tabu when it is wise.
 Most common:
         permit a tabu move ...
         if it provides a higher value of the objective function than has been
         found in any iteration so far.
 Aspiration by influence:
         A very recent high increase in target may imply that algorithm parses
         a new (promising) region in Ω
         So cancel the tabu of not reversing the last downward move: further
         exploration of new region is meaningful.
Statistics and Econometrics (CAU Kiel)                           Summer 2019   35 / 41
                                         Tabu algorithms
Refinements: Diversification
         Bad candidates with often appearances may correspond to
         entrapment.
         So construct fH (t) to penalize such (bad) moves/attributes.
 This requires of course keeping track of residence frequency F Aa , H (t) .
                                                                           One way of imposing this:
                       f (θ (t+1) )                  if f (θ (t+1) ) ≥ f (θ (t) )
                                 (t+1)
  fH (t) (θ       )=
                       f (θ (t+1) ) − cF Aa , H (t) if f (θ (t+1) ) < f (θ (t) )                                                   
Statistics and Econometrics (CAU Kiel)                              Summer 2019     36 / 41
                                         Tabu algorithms
Refinements: Intensification
 Construct fH (t) to extend search near good solutions with high residency.
         Reward retention of attributes that correlate with high-quality
         high-freqency solutions;
         penalize removal of such attributes.
 Easier said than done.
Statistics and Econometrics (CAU Kiel)                           Summer 2019   37 / 41
                                         Tabu algorithms
To sum up
 Identify a list of problem-specific attributes, initialize tabu list. Then
    1    Construct fH (t) (θ) based on f (θ), and perhaps
         diversification/intensification penalties or incentives.
    2    Identify neighbors of θ (t) , namely the members of N (θ (t) ).
    3    Rank the neighbors in decreasing order of improvement, as evaluated
         by fH (t) .
    4    Select the highest ranking neighbor.
    5    Is this neighbor currently on the tabu list? If not, go to step 8.
    6    Does this neighbor pass an aspiration criterion? If so, go to step 8.
    7    If all neighbors of θ (t) have been considered and none have been
         adopted as θ (t+1) , then stop. Otherwise, select the next best neighbor
         and go to step 5.
    8    Adopt this solution as θ (t+1) and update the tabu list.
    9    Has a stopping criterion been met? If so, stop. Otherwise, increment t
         and go to step 1.
Statistics and Econometrics (CAU Kiel)                            Summer 2019   38 / 41
                                          Tabu algorithms
Results of tabu search for model selection example
                     400
        yaxislabel
                     350
                     300
                           0             20          40     60
                                              xaxislabel
 Best solution was hit twice: iterations 44 and 66.
Statistics and Econometrics (CAU Kiel)                           Summer 2019   39 / 41
                                         Up next
Outline
 1    Heuristic searches
 2    Simulated annealing and genetic algorithms
 3    Tabu algorithms
 4    Up next
Statistics and Econometrics (CAU Kiel)             Summer 2019   40 / 41
                                         Up next
Coming up
       The expectation-maximization algorithm
Statistics and Econometrics (CAU Kiel)             Summer 2019   41 / 41