Introduction to Deep
Learning
Biological Neurons
AI Development
The Deep Revival
From CAT to CNN
DL : Faster, Higher, Stronger
DL : Sequencing Models
Gaming
The rise of Transformer
From Language to Vision
Discrimination to Generalization
Questions??
Artificial Neuron
Artificial Neuron
Guess the personality??
Guess the personality??
Guess the personality??
Guess the personality??
Questions??
Decision
Boolean Function
McCulloch Pitts Model
OR Using MP Model
Non-Boolean Function
Example
OR Using Perceptron Model
Errors
Questions?
Perceptron Learning Algorithm
Questions??
Linearly separable functions
OR/XOR Using Perceptron Model
What is the solution
for points which are
linearly inseparable??
??
Network of Perceptron (MLP)
Multilayer Network of Perceptron (MLP)
XOR Using MLP
Three Input MLP
What if you have more than 3 Input??
MLP
Sigmoid Neuron
Supervised Learning
Machine Learning SL Setup
• Data ?
• Model?
• Parameter?
• Learning Algorithm?
• Objective Function / Loss function?
Learning Parameter
Learning Algorithm
Example:
Calculation
Questions??
Feed Forward Network
Multilayer Network of neuron
feed forward neural network
Questions??
Learning parameters
: Gradient Descent
Calculate Grad(θ): Grad. (W) and
Grad. (b)
Example:
Calculation
Problem Type-1
Problem Type-2
Problem Type : Regression / Classification
Questions??
Activation Function
Activation Function
Activation Function..
• Nonlinear — When the activation function is non-linear, then a two-layer neural
network can be proven to be a universal function approximator. The identity
activation function does not satisfy this property. When multiple layers use the
identity activation function, the entire network is equivalent to a single-layer
model.
• Range — When the range of the activation function is finite, gradient-based
training methods tend to be more stable, because pattern presentations
significantly affect only limited weights. When the range is infinite, training is
generally more efficient because pattern presentations significantly affect most of
the weights. In the latter case, smaller learning rates are typically necessary.
• Continuously differentiable — This property is desirable (ReLU is not
continuously differentiable and has some issues with gradient-based optimization,
but it is still possible) for enabling gradient-based optimization methods. The
binary step activation function is not differentiable at 0, and it differentiates to 0
for all other values, so gradient-based methods can make no progress with it.
1. The Sigmoid Function
• Sigmoid functions are used in machine learning for logistic regression
and basic neural network implementations and they are the
introductory activation units. But for advanced Neural Network
Sigmoid functions are not preferred due to various drawbacks
(vanishing gradient problem).
Tanh Function
• In tanh function the drawback we saw in sigmoid function is
addressed (not entirely), here the only difference with sigmoid
function is the curve is symetric across the origin with values ranging
from -1 to 1.
ReLU
• A Rectified Linear Unit (A unit employing the rectifier is also called a
rectified linear unit ReLU) has output 0 if the input is less than 0,
and raw output otherwise. That is, if the input is greater than 0, the
output is equal to the input. The operation of ReLU is closer to the
way our biological neurons work.
Softmax Function
• Softmax is a very interesting activation function because it not
only maps our output to a [0,1] range but also maps each output in
such a way that the total sum is 1. The output of Softmax is therefore
a probability distribution.
Forms of GD
Training of Feedforward Neural Network with
Gradient Descent
• Training FNNs involves adjusting their weights to minimize the loss
function, which measures the difference between the network's
predictions and the actual targets. Gradient Descent (GD) is a
fundamental method used for this optimization.
Training of Feedforward Neural Network with
Gradient Descent
Step-1
import numpy as np
# Simple example: Training a network to learn the AND function
# Inputs and corresponding targets for AND
X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y = np.array([[0], [0], [1], [1]])
# Sigmoid activation function and its derivative
def sigmoid(x):
return 1 / (1 + np.exp(-x))
def sigmoid_derivative(x):
return x * (1 - x)
Training of Feedforward Neural Network with
Gradient Descent
Step-2
# Initialize weights randomly
weights = np.random.uniform(size=(2, 1))
bias = np.random.uniform(size=(1))
learning_rate = 0.1
# Training loop
for epoch in range(10000):
inputs = X
# Forward propagation
z = np.dot(inputs, weights) + bias
output = sigmoid(z)
# Calculate the error
error = y - output
Training of Feedforward Neural Network with
Gradient Descent
Step-3
# Backpropagation
adjustment = error * sigmoid_derivative(output)
weights += np.dot(inputs.T, adjustment) * learning_rate
bias += np.sum(adjustment, axis=0) * learning_rate
# Predictions after training
print("Output after training")
print(output)
Momentum Based Gradient Descent
• Momentum helps accelerate the GD in the correct direction and
dampens oscillations by adding a fraction of the previous update to
the current one.
momentum = 0.9
v = 0 # Initialize velocity
for each epoch:
gradients = compute_gradients(data, weights)
v = momentum * v + learning_rate * gradients
weights = weights - v
Nesterov Accelerated Gradient Descent
• Nesterov Accelerated Gradient (NAG) is a slight variation on the
momentum idea, where the gradient is calculated at an ahead point
rather than the current position.
momentum = 0.9
v = 0 # Initialize velocity
for each epoch:
temp_weights = weights - momentum * v
gradients = compute_gradients(data, temp_weights)
v = momentum * v + learning_rate * gradients
weights = weights - v
Stochastic Gradient Descent (SGD)
• SGD updates the weights by calculating the gradient based on a
subset of the data, making the training process faster.
for each epoch:
for each batch in data:
gradients = compute_gradients(batch, weights)
weights = weights - learning_rate * gradients
AdaGrad, RMSProp, and Adam
• These are adaptive learning rate optimization algorithms. AdaGrad
adjusts the learning rate for each parameter, RMSProp modifies
AdaGrad to improve its performance in the long run, and Adam
combines the ideas of momentum and RMSProp for an efficient and
effective optimization.
beta1 = 0.9
beta2 = 0.999
epsilon = 1e-8
m=0
v=0
for each epoch:
gradients = compute_gradients(data, weights)
m = beta1 * m + (1 - beta1) * gradients
v = beta2 * v + (1 - beta2) * (gradients ** 2)
m_hat = m / (1 - beta1 ** epoch) # Correct bias
v_hat = v / (1 - beta2 ** epoch)
weights = weights - learning_rate * m_hat / (np.sqrt(v_hat) + epsilon)
Forms of GD
Gradient Descent Update Strategy Key Feature Computational Cost Convergence Speed Application
Variant
Batch Gradient Descent Full Dataset Global Minimum High Slow Simple regression, small
(BGD) (Convex Functions) datasets
Stochastic Gradient Single Example Escapes Local Minima Low Fast Online learning, real-
Descent (SGD) time applications
Mini-Batch Gradient Small Batch Balances Efficiency & Medium Medium Deep learning, large-
Descent Stability scale classification
problems
Momentum-Based Full/Batch Faster Convergence Medium Fast Image recognition, deep
Gradient Descent learning frameworks
Nesterov Accelerated Full/Batch Smooth Convergence Medium Faster than Speech recognition, NLP
Gradient (NAG) Momentum tasks
Adagrad Adaptive Good for Sparse Data Low Slows Over Time Text processing, NLP
applications
RMSprop Adaptive Prevents Learning Medium Fast Recurrent Neural
Rate Decay Networks (RNNs), speech
analysis
Adam Adaptive Combines Medium Very Fast General deep learning,
Momentum & CNNs, NLP,
reinforcement learning
RMSprop
Nadam Adaptive Adds Nesterov Medium Faster than Adam Computer vision,
Momentum sequence modeling
AdaMax Adaptive Stable Updates Medium Fast Training GANs, complex
neural networks
AMSGrad Adaptive Prevents Learning Medium Stable Financial modeling,
Rate Decay advanced AI applications
Bias and Variance
Bias and Variance
Train error vs Test error
Regularization
L2 Regularization
• L2 regularization, known as weight decay in the context of neural
networks, is commonly applied to the weights of the neural network
layers.
• It helps prevent overfitting by shrinking the weights, making the
network less sensitive to small changes in input data.
• L2 regularization encourages smaller, more evenly distributed weights
by adding a penalty based on the square of the coefficients.
L2 Regularization
Data Augmentation
Data Augmentation
• Typically, More data = better learning
• Works well for image classification / object recognition tasks
• Also shown to work well for speech
• For some tasks it may not be clear how to generate such data
Questions??