Optimization
Rowel Atienza
        rowel@eee.upd.edu.ph
University of the Philippines
Optimization
Finding the parameters, , of a neural network that significantly reduce the cost
function J( )
    Measured in terms of a performance measure, P, on the entire training set
and some regularization terms
    P is the one that makes Optimization in Machine Learning different from just
pure optimization as the end goal itself.
Optimization
Loss function from an empirical distribution pdata (over the training set)
    J( ) = E                  L(f(x; ),y)
               (x,y)~pdata
    f(x; ) per sample prediction
    y is the label
Usually, we want to minimize J* over the true data generating distribution pdata
    J*( ) = E(x,y)~pdataL(f(x; ),y)
Empirical Risk Minimization
Empirical Risk Minimization:
    J( ) = E                  L(f(x; ),y) = 1/mi=1mL(f(x(i); ),y(i))
               (x,y)~pdata
    m is the number of sample
    prone to overfitting
Minibatch Stochastic
Using entire training set, known as deterministic, is expensive and has no linear
return; good estimate of gradient but low linear return
Use of minibatch stochastic (small subset of entire training set) offers many
advantages:
    Suitable for parallelization
    GPUs perform better on power of 2 sizes, 32 to 256
    Small batches offer regularizing effects; improves generalization error
    Shuffle minibatch, make minibatches independent improves training
Challenges
Design of Loss Function : Convex
Ill conditioning of Hessian Matrix, H
    Second order Taylor expansion of Loss:  gTHg - gTg
         If the first term exceeds gTg, slow learning
Problem with multiple local minima
    If the loss function can be reduced to an acceptable level, parameters at local
minimum are acceptable
Challenges                                                      Saddle pt
Saddle points: found in high-dimensional model
   High cost but can be easily overcome by SGD; SGD are designed to move
downhill and not necessarily seek critical points
    Newton method encounters difficulty dealing with saddle and global maxima
    Saddle-free Newton method can overcome saddle points - research in
progress
Challenges
Cliff
        Gradient descent proposes a large change
            thus missing the minimum - Exploding Gradient
        Solution is to use gradient clipping - capping the gradient
        Common problem is Recurrent Neural Networks
Challenges
Long Term Dependencies (eg RNN, LSTM)
   Performing the same computation many times
   Applying the same W t-times
   Wt = (Vdiag( )V-1)t = Vdiag( )tV-1
   if | |<1, the term vanishes as t increases
   if | |>1, the term explodes as t increases
   Gradients are influenced by diag( )
Challenges
Inexact gradients due to noisy or biased estimates
Local and Global Structure
    Optimization does not necessarily lead to a critical pt (global, local or saddle).
     Most of the time, only near zero gradient points with resulting acceptable
performance
Challenges
Wrong side of the mountain: gradient descent will not find the minimum
    Solution: algorithm for choosing the initial points
      Bad initial points send the
      objective function to the
      wrong side of the mountain
Parameter Initialization
Initial point determines if the objective function will converge or not
Modern initialization strategies are simple and heuristics
    Optimization for neural network is not well understood yet
Initialize weights and biases with different random values (symmetry breaking
effect)
Large weights are good for optimization
Small weights are good for regularization
Parameter Initialization
Biases:
    Small values (0.1) for ReLU activation
    1 for LSTM forget state
    For output layer with highly skewed output c, we solve softmax(b) = c
Stochastic Gradient Descent
Instead of using the whole training set, we use a minibatch of m iid samples
Learning Rate:
    Gradually decrease learning rate during training since after some time, the
gradient due to noise is more significant
    Apply learning rate decay until t= when it set to constant
          k
              = (1-k/ ) o + (k/ )
      is usually chosen by trial and error while observing all errors.
Stochastic Gradient Descent
Theoretically, the excess error = J( ) - Jmin( ), has lower bound of O(1/k) for
convex functions. Anything faster than O(1/k) will not improve the generalization
error. Thus resulting to overfitting.
Generally, batch gradient is better than SGD in convergence. A technique that can
be used is to increase the batch size gradually.
Momentum on SGD for Speed Improvement
v v- g
     +v
    where v is the accumulator of gradient g; v includes influence of past
gradients, g
      is momentum [0,1); typical 0.5, 0.9 and 0.99; the larger compared to , the
bigger is the influence of past gs; similar to snowballing effect
Nesterov Momentum: Loss is evaluated after the momentum is applied g  g( +
 v)
Adaptive Learning Rates
AdaGrad (Adaptive Gradient) : learning rate is proportional to partial derivatives of
loss
r  r + gg
    -g /( +r)
     where    = small constant (eg 10-7)
     Effective for some deep learning but not all; Accumulation of early learning
rate can cause excessive decrease in learning rate
Adaptive Learning Rates
RMSProp
r  r + (1- )gg
    -g /( +r)
     where    = small constant (eg 10-7);   is the decay rate
     Discard history from extreme past
     Effective and practical for deep neural nets
Adaptive Learning Rates
RMSProp with Nesterov Momentum
r  r + (1- )gg
v  v - g /(r)
     +v
     where   = small constant (eg 10-7)
      is the decay rate
      is momentum coefficient
Adaptive Learning Rates
Adam (Adaptive Moments)
tt+1
                                                           t
first moment: s     1
                         s + (1- 1)gg, s  s/(1+     1
                                                               )
                                                                       t
second moment: r          2
                               r + (1- 2)gg, r  r/(1+           2
                                                                           )
    - s/( +r),            +
where = small constant for numerical stabilization (eg 10-8), 1 and 2  [0, 1)
(suggest: 1 = 0.9, 2 = 0.999), t is time step, is suggested to be 0.001
Reference
Deep Learning, Ian Goodfellow and Yoshua Bengio and Aaron Courville, MIT
Press, 2016, http://www.deeplearningbook.org
End