0% found this document useful (0 votes)

46 views22 pages

Op Tim Ization

This document discusses optimization techniques for neural networks. It begins by defining the goal of optimization as finding the parameters of a neural network that minimize a cost function on the training set, measured by a performance metric, while also considering regularization. It then discusses challenges such as non-convex loss functions, saddle points, exploding or vanishing gradients, parameter initialization, and adapting the learning rate. Stochastic gradient descent, momentum, adaptive methods like RMSProp and Adam are presented as techniques to address these challenges.

Uploaded by

ricsun

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

46 views22 pages

Op Tim Ization

Uploaded by

ricsun

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

Optimization

Rowel Atienza
rowel@eee.upd.edu.ph

University of the Philippines

Optimization
Finding the parameters, , of a neural network that significantly reduce the cost
function J( )

Measured in terms of a performance measure, P, on the entire training set

and some regularization terms

P is the one that makes Optimization in Machine Learning different from just
pure optimization as the end goal itself.
Optimization
Loss function from an empirical distribution pdata (over the training set)

J( ) = E L(f(x; ),y)
(x,y)~pdata

f(x; ) per sample prediction

y is the label

Usually, we want to minimize J* over the true data generating distribution pdata

J*( ) = E(x,y)~pdataL(f(x; ),y)

Empirical Risk Minimization
Empirical Risk Minimization:

J( ) = E L(f(x; ),y) = 1/mi=1mL(f(x(i); ),y(i))

(x,y)~pdata

m is the number of sample

prone to overfitting
Minibatch Stochastic
Using entire training set, known as deterministic, is expensive and has no linear
return; good estimate of gradient but low linear return

Use of minibatch stochastic (small subset of entire training set) offers many
advantages:

Suitable for parallelization

GPUs perform better on power of 2 sizes, 32 to 256

Small batches offer regularizing effects; improves generalization error

Shuffle minibatch, make minibatches independent improves training

Challenges
Design of Loss Function : Convex

Ill conditioning of Hessian Matrix, H

Second order Taylor expansion of Loss: gTHg - gTg

If the first term exceeds gTg, slow learning

Problem with multiple local minima

If the loss function can be reduced to an acceptable level, parameters at local

minimum are acceptable
Challenges Saddle pt

Saddle points: found in high-dimensional model

High cost but can be easily overcome by SGD; SGD are designed to move
downhill and not necessarily seek critical points

Newton method encounters difficulty dealing with saddle and global maxima

Saddle-free Newton method can overcome saddle points - research in

progress
Challenges
Cliff

Gradient descent proposes a large change

thus missing the minimum - Exploding Gradient

Solution is to use gradient clipping - capping the gradient

Common problem is Recurrent Neural Networks

Challenges
Long Term Dependencies (eg RNN, LSTM)

Performing the same computation many times

Applying the same W t-times

Wt = (Vdiag( )V-1)t = Vdiag( )tV-1

if | |<1, the term vanishes as t increases

if | |>1, the term explodes as t increases

Gradients are influenced by diag( )

Challenges
Inexact gradients due to noisy or biased estimates

Local and Global Structure

Optimization does not necessarily lead to a critical pt (global, local or saddle).

Most of the time, only near zero gradient points with resulting acceptable
performance
Challenges
Wrong side of the mountain: gradient descent will not find the minimum

Solution: algorithm for choosing the initial points

Bad initial points send the

objective function to the
wrong side of the mountain
Parameter Initialization
Initial point determines if the objective function will converge or not

Modern initialization strategies are simple and heuristics

Optimization for neural network is not well understood yet

Initialize weights and biases with different random values (symmetry breaking
effect)

Large weights are good for optimization

Small weights are good for regularization

Parameter Initialization
Biases:

Small values (0.1) for ReLU activation

1 for LSTM forget state

For output layer with highly skewed output c, we solve softmax(b) = c

Stochastic Gradient Descent
Instead of using the whole training set, we use a minibatch of m iid samples

Learning Rate:

Gradually decrease learning rate during training since after some time, the
gradient due to noise is more significant

Apply learning rate decay until t= when it set to constant

k
= (1-k/ ) o + (k/ )

is usually chosen by trial and error while observing all errors.

Stochastic Gradient Descent
Theoretically, the excess error = J( ) - Jmin( ), has lower bound of O(1/k) for
convex functions. Anything faster than O(1/k) will not improve the generalization
error. Thus resulting to overfitting.

Generally, batch gradient is better than SGD in convergence. A technique that can
be used is to increase the batch size gradually.
Momentum on SGD for Speed Improvement
v v- g

where v is the accumulator of gradient g; v includes influence of past

gradients, g

is momentum [0,1); typical 0.5, 0.9 and 0.99; the larger compared to , the
bigger is the influence of past gs; similar to snowballing effect

Nesterov Momentum: Loss is evaluated after the momentum is applied g g( +

v)
Adaptive Learning Rates
AdaGrad (Adaptive Gradient) : learning rate is proportional to partial derivatives of
loss

r r + gg

-g /( +r)

where = small constant (eg 10-7)

Effective for some deep learning but not all; Accumulation of early learning
rate can cause excessive decrease in learning rate
Adaptive Learning Rates
RMSProp

r r + (1- )gg

-g /( +r)

where = small constant (eg 10-7); is the decay rate

Discard history from extreme past

Effective and practical for deep neural nets

Adaptive Learning Rates
RMSProp with Nesterov Momentum

r r + (1- )gg

v v - g /(r)

where = small constant (eg 10-7)

is the decay rate

is momentum coefficient
Adaptive Learning Rates
Adam (Adaptive Moments)

tt+1
t
first moment: s 1
s + (1- 1)gg, s s/(1+ 1
)
t
second moment: r 2
r + (1- 2)gg, r r/(1+ 2
)

- s/( +r), +

where = small constant for numerical stabilization (eg 10-8), 1 and 2 [0, 1)
(suggest: 1 = 0.9, 2 = 0.999), t is time step, is suggested to be 0.001
Reference
Deep Learning, Ian Goodfellow and Yoshua Bengio and Aaron Courville, MIT
Press, 2016, http://www.deeplearningbook.org
End

Implement 03-1
No ratings yet
Implement 03-1
24 pages
DL UNIT II PART II (IMP) Optimization For Training Deep Model
No ratings yet
DL UNIT II PART II (IMP) Optimization For Training Deep Model
81 pages
Pure Optimization
No ratings yet
Pure Optimization
23 pages
Lecture 2
No ratings yet
Lecture 2
31 pages
BME 6407 - Class 10 (April 2023)
No ratings yet
BME 6407 - Class 10 (April 2023)
31 pages
DL Test-2
No ratings yet
DL Test-2
28 pages
M3 Session 1-1
No ratings yet
M3 Session 1-1
27 pages
Training NNs
No ratings yet
Training NNs
34 pages
Optimizers
No ratings yet
Optimizers
4 pages
Cours 5
No ratings yet
Cours 5
23 pages
5 - Chapter8 - Optimization 2
No ratings yet
5 - Chapter8 - Optimization 2
40 pages
Improving Deep Neural Networks: Hyperparameter Tuning, Regularization and Optimization
No ratings yet
Improving Deep Neural Networks: Hyperparameter Tuning, Regularization and Optimization
1 page
8 Adagrad, RMSprop, Adam 04 Sep 2020material I 04 Sep 2020 Module4 Optimization
No ratings yet
8 Adagrad, RMSprop, Adam 04 Sep 2020material I 04 Sep 2020 Module4 Optimization
50 pages
S09 DNN Gradients Wip
No ratings yet
S09 DNN Gradients Wip
28 pages
Unit V NNHDL
No ratings yet
Unit V NNHDL
33 pages
Week 06 - Deep Feedforward Networks - Optimization
No ratings yet
Week 06 - Deep Feedforward Networks - Optimization
83 pages
Chapter 8-Deep Learning Book
No ratings yet
Chapter 8-Deep Learning Book
27 pages
08 Training
No ratings yet
08 Training
18 pages
Activations, Loss Functions & Optimizers in ML
No ratings yet
Activations, Loss Functions & Optimizers in ML
29 pages
Gradient-Based Optimizers
No ratings yet
Gradient-Based Optimizers
54 pages
DL Module 2 1 (Sami)
No ratings yet
DL Module 2 1 (Sami)
17 pages
Unit 3
No ratings yet
Unit 3
110 pages
Soft Computing Assignment
No ratings yet
Soft Computing Assignment
9 pages
Curs6site PDF
No ratings yet
Curs6site PDF
40 pages
WINSEM2024-25 CSE4006 ETH AP2024254000693 2025-01-08 Reference-Material-I
No ratings yet
WINSEM2024-25 CSE4006 ETH AP2024254000693 2025-01-08 Reference-Material-I
40 pages
Part 13 MD
No ratings yet
Part 13 MD
41 pages
UNIT3
No ratings yet
UNIT3
17 pages
Optimization Techniques (SGD Alternatives)
No ratings yet
Optimization Techniques (SGD Alternatives)
34 pages
Deep Learning Module 3
No ratings yet
Deep Learning Module 3
15 pages
Unit-1 and 2 and 3
No ratings yet
Unit-1 and 2 and 3
212 pages
Optimization in Deep Learning - by Adam Cataldo - AC On AI
No ratings yet
Optimization in Deep Learning - by Adam Cataldo - AC On AI
5 pages
Neural Network Optimization Guide
No ratings yet
Neural Network Optimization Guide
51 pages
Lec 8
No ratings yet
Lec 8
43 pages
L5 Training Neural Networks Part 2 en v2
No ratings yet
L5 Training Neural Networks Part 2 en v2
70 pages
Chapter-2 Single Feed Forward Netwotk
No ratings yet
Chapter-2 Single Feed Forward Netwotk
132 pages
Deep Learning Optimization Guide
No ratings yet
Deep Learning Optimization Guide
12 pages
DL 4
No ratings yet
DL 4
15 pages
Cst414-Deep Learning Module 2
No ratings yet
Cst414-Deep Learning Module 2
13 pages
Ch2-Training, Optimization and Regularization of DNN-new
No ratings yet
Ch2-Training, Optimization and Regularization of DNN-new
114 pages
23-Practical Aspects of Optimization
No ratings yet
23-Practical Aspects of Optimization
7 pages
DL Lecture 11 Optimizers
No ratings yet
DL Lecture 11 Optimizers
41 pages
Lecture 8 Gradient Descent For Non-Convex Functions
No ratings yet
Lecture 8 Gradient Descent For Non-Convex Functions
21 pages
DL Mod2
No ratings yet
DL Mod2
45 pages
Optimization For Deep Learning Theory and Algorithms
No ratings yet
Optimization For Deep Learning Theory and Algorithms
60 pages
Neural Networks: A Beginner's Guide
No ratings yet
Neural Networks: A Beginner's Guide
23 pages
Theory DL
No ratings yet
Theory DL
227 pages
Unit 2.2
No ratings yet
Unit 2.2
46 pages
Deep Learning for Data Scientists
No ratings yet
Deep Learning for Data Scientists
17 pages
Lecture 5
No ratings yet
Lecture 5
34 pages
Deep Learning Module-03 Search Creators
No ratings yet
Deep Learning Module-03 Search Creators
20 pages
Unit 5
No ratings yet
Unit 5
36 pages
AML - Lecture 5
No ratings yet
AML - Lecture 5
97 pages
Supervised Deep Learning
No ratings yet
Supervised Deep Learning
28 pages
1.1 Introduction
No ratings yet
1.1 Introduction
73 pages
Deep Learning
No ratings yet
Deep Learning
3 pages
Optimization Techniques in Deep Learning
No ratings yet
Optimization Techniques in Deep Learning
14 pages
Artificial Neural Networks - DL
No ratings yet
Artificial Neural Networks - DL
55 pages
What Makes A Portfolio A Portfolio
No ratings yet
What Makes A Portfolio A Portfolio
5 pages
Behaviorism and Mentalism
No ratings yet
Behaviorism and Mentalism
6 pages
Annex 1 PRC Instructional Design Template 1 1 1
100% (3)
Annex 1 PRC Instructional Design Template 1 1 1
6 pages
Memorandum: District Tactical Motorized Unit
No ratings yet
Memorandum: District Tactical Motorized Unit
2 pages
Reading Grade 8
100% (1)
Reading Grade 8
3 pages
PPTch06 - Collecting-Analyzing-Feeding Back Diagnostic Info
No ratings yet
PPTch06 - Collecting-Analyzing-Feeding Back Diagnostic Info
19 pages
A Guide To Guided Reading
No ratings yet
A Guide To Guided Reading
32 pages
IJCRT2401005
No ratings yet
IJCRT2401005
6 pages
1 s2.0 S0360131511002508 Main PDF
No ratings yet
1 s2.0 S0360131511002508 Main PDF
11 pages
Theories With Sample Studies
No ratings yet
Theories With Sample Studies
8 pages
Daily Lesson Log With Annotation - Exponential Function Application
100% (3)
Daily Lesson Log With Annotation - Exponential Function Application
6 pages
Criteria For Judging A Photo Contest
No ratings yet
Criteria For Judging A Photo Contest
2 pages
Professional Education - SET E - Part 3
No ratings yet
Professional Education - SET E - Part 3
5 pages
Accounting Information System QUIZ 1
No ratings yet
Accounting Information System QUIZ 1
3 pages
Interview - Wikipedia
No ratings yet
Interview - Wikipedia
4 pages
Analytical Chemistry Graduate Resume
No ratings yet
Analytical Chemistry Graduate Resume
2 pages
Understanding Autism Basics
No ratings yet
Understanding Autism Basics
11 pages
2G Practice Test Ielts Level 3 Student's Book
No ratings yet
2G Practice Test Ielts Level 3 Student's Book
133 pages
Elements & Principles of Design
100% (1)
Elements & Principles of Design
2 pages
Passive Reporting Verbs
No ratings yet
Passive Reporting Verbs
3 pages
Employee Engagement Presentation
No ratings yet
Employee Engagement Presentation
10 pages
Grade 8 English: Reading Invitations
No ratings yet
Grade 8 English: Reading Invitations
3 pages
Ielts Tips Writing
No ratings yet
Ielts Tips Writing
14 pages
Workplace Teamwork Guide
No ratings yet
Workplace Teamwork Guide
27 pages
W6
No ratings yet
W6
21 pages
AWS Certified Data Analytics - Specialty Exam Guide - v1.0!08!23-2019 - FINAL
0% (1)
AWS Certified Data Analytics - Specialty Exam Guide - v1.0!08!23-2019 - FINAL
2 pages
Contemporary Literature Performance Guidelines
No ratings yet
Contemporary Literature Performance Guidelines
3 pages
Higgins Observation Weebly
No ratings yet
Higgins Observation Weebly
3 pages
DLL Catch Up Friday Week 5
No ratings yet
DLL Catch Up Friday Week 5
6 pages
Angelica Mata - Worksheet 1 - Phenomenology
No ratings yet
Angelica Mata - Worksheet 1 - Phenomenology
8 pages