0% found this document useful (0 votes)
30 views67 pages

Adl Unit 1 2

Notes for Advanced Deep Learning units 1 and 2 for IPU exams!

Uploaded by

bakerpuddin71
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views67 pages

Adl Unit 1 2

Notes for Advanced Deep Learning units 1 and 2 for IPU exams!

Uploaded by

bakerpuddin71
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 67

🏪

Advancement in Deep Learning

Unit 1:

Reviewing Deep Learning Concepts,

NN,

Advancement in Deep Learning 1


Advancement in Deep Learning 2
Advancement in Deep Learning 3
Advancement in Deep Learning 4
Advancement in Deep Learning 5
Regularization

Advancement in Deep Learning 6


Batch Normalization

Batch Normalization (BN) is a technique used to address the problem of


internal covariate shift in deep neural networks. Internal covariate shift refers
to the change in the distribution of the network activations as the parameters
of the preceding layers change during training. BN normalizes the activations
of each layer by adjusting and scaling them.

Batch Normalization offers several benefits:

Improved Training Speed: BN can accelerate the training process by


reducing the internal covariate shift, allowing for higher learning rates and
faster convergence.

Advancement in Deep Learning 7


Stabilized Gradients: BN helps stabilize the gradients, making optimization
more robust and less sensitive to weight initialization.

Regularization: BN acts as a form of regularization, reducing the need for


other regularization techniques like dropout.

Allows for Deeper Networks: BN enables the training of deeper networks


by mitigating the vanishing or exploding gradient problems.

Batch Normalization is typically applied before the activation function in each


layer of the network, although there are variations such as "Batch
Renormalization" where normalization is applied after the activation function. It
has become a standard component in many modern deep learning
architectures and is widely used in practice to improve training stability and
performance.

Layer Normalization (Out of syl)


Layer Normalization (LN) is another technique used to address the problem of
internal covariate shift in deep neural networks, similar to Batch Normalization
(BN). However, unlike BN, which normalizes activations across the mini-batch
dimension, LN normalizes activations across the feature dimension (or layer
dimension) independently for each training example.

Advancement in Deep Learning 8


Layer Normalization offers benefits similar to Batch Normalization, such as
improved training speed, stabilized gradients, and regularization. However, it
operates independently for each training example rather than across mini-
batches, which can be advantageous in certain scenarios, especially when the
size of the mini-batch is small or when dealing with recurrent neural networks
(RNNs) where the concept of mini-batches is less applicable.
Layer Normalization has found applications in various deep learning
architectures, particularly in scenarios where Batch Normalization may not be
suitable due to constraints on mini-batch size or network architecture.
Additionally, Layer Normalization has been shown to be effective in stabilizing
the training of transformers and recurrent neural networks.

Advancement in Deep Learning 9


Weight Initialization Strategies

Learning vs Optimization

Advancement in Deep Learning 10


In summary, learning in deep learning encompasses the broader process of
acquiring knowledge from data through training, while optimization refers
specifically to the process of finding the optimal set of parameters for a given
model by minimizing a defined objective function. Learning involves
optimization as a crucial step, but it also includes other components such as
data preprocessing, model architecture design, and evaluation.

1. Optimization:

Optimization, on the other hand, specifically refers to the process of


finding the best set of parameters for a given model with respect to a
certain objective function.

Advancement in Deep Learning 11


In the context of deep learning, this typically involves minimizing a loss
function that quantifies the difference between the model's predictions
and the actual targets.

Optimization algorithms are used to iteratively adjust the parameters of


the model in order to minimize this loss function.

Common optimization algorithms in deep learning include gradient


descent, stochastic gradient descent (SGD), Adam, RMSprop, and
others.

2. Learning:

In deep learning, "learning" refers to the process by which a model


acquires knowledge or understanding from data through training.

This process involves adjusting the parameters (weights and biases) of


the model based on the input data and the desired outputs, with the
goal of minimizing the difference between the model's predictions and
the actual targets.

Learning in deep learning often involves iterative updates to the model


parameters using a training algorithm such as stochastic gradient
descent (SGD) or one of its variants.

In the context of deep learning, "learning" and "optimization" are closely


related but distinct concepts.

Effective training in Deep Net

Advancement in Deep Learning 12


Early Stopping,

Advancement in Deep Learning 13


Normalization(Batch,Instance,Group)
Normalization is a data pre-processing tool used to bring the numerical data
to a common scale without distorting its shape. Generally, when we input the

Advancement in Deep Learning 14


data to a machine or deep learning algorithm we tend to change the values to
a balanced scale.

removes overfitting

Batch:

Instance:

Group:

Advancement in Deep Learning 15


Batch Gradient Descent (GD)

Advancement in Deep Learning 16


GD with momentum
Momentum is an extension to the gradient descent optimization algorithm that
allows the search to build inertia in a direction in the search space and
overcome the oscillations of noisy gradients and coast across flat spots of the
search space.

The problem with gradient descent is that the weight update at a moment (t) is
governed by the learning rate and gradient at that moment only. It doesn’t take
into account the past steps taken while traversing the cost space.

It leads to the following problems.

1. The gradient of the cost function at saddle points( plateau) is negligible or


zero, which in turn leads to small or no weight updates. Hence, the
network becomes stagnant, and learning stops

Advancement in Deep Learning 17


2. The path followed by Gradient Descent is very jittery even when operating
with mini-batch mode

How can this be used and applied to Gradient Descent?


To account for the momentum, we can use a moving average over the past
gradients. In regions where the gradient is high like AB, weight updates will be
large. Thus, in a way we are gathering momentum by taking a moving average
over these gradients. But there is a problem with this method, it considers all
the gradients over iterations with equal weightage. The gradient at t=0 has
equal weightage as that of the gradient at current iteration t. We need to use
some sort of weighted average of the past gradients such that the recent
gradients are given more weightage.

Unit 2:
Recent Trends in Deep Learning Architectures,
GANs:

Advancement in Deep Learning 18


VGG:

Inception Net:

Advancement in Deep Learning 19


Residual Network

Advancement in Deep Learning 20


Pros:

Enables training of extremely deep networks (100+ layers).

Alleviates vanishing gradient problem via skip connections.

Facilitates feature reuse and learning of residual functions.

Improves optimization by allowing gradients to flow more easily.

Cons:

Increased model complexity compared to shallower networks.

Requires careful initialization and regularization to prevent overfitting.

May suffer from degradation problem if not properly tuned.

Training can still be time-consuming and computationally intensive for


very deep architectures.

Understanding ResNet and analyzing various models on the CIFAR-10


datase

Introduction

Advancement in Deep Learning 21


Deep neural networks are very fascinating and it produces magic as we try to
predict something, maybe with images or texts. In the past 10 years, there has
been a major improvement in deep learning, especially when it comes to
Image recognition. Many researchers have been trying to develop newer
models every week to improve the existing system’s accuracy.

Challenges in building Neural Networks


One of the major challenges included how deep networks could be built.
Theoretically, it sounds cool to build deeper networks but in reality, it
encounters a problem called degradation. It is the problem of the increase in
the training error as deeper layers are constructed. This hurts our accuracy a
lot. One another problem of building deeper networks includes the vanishing
gradient descent. This happens in the backpropagation step, as we know in
the neural networks we need to adjust weights after calculating the loss
function.

While backpropagating, we follow the chain rule, the derivatives of each layer
are multiplied down the network. When we use a lot of deeper layers, and we
have hidden layers like sigmoid, we could have derivatives being scaled down
below 0.25 for each layer. So when n number of layers derivatives are
multiplied the gradient decreases exponentially as we propagate down to the
initial layers.

As told earlier when we go very deep into a network we get blocks that have
learned a lot and when we add deeper blocks to it tends to be just an identity
mapping of the earlier block that is it is just the same as the earlier block.
Degradation results suggest that there are difficulties in learning this identity
mapping. To solve these problems we come across the ResNet paper. These
are residual networks stacked together that allow us to build deep networks
without degradation or vanishing gradient descent.

Advancement in Deep Learning 22


Train and Test error visualized on 56 and 34 layers plain model
(https://arxiv.org/pdf/1512.03385.pdf)

We may think that it could be a result of overfitting too, but here the error% of
the 56-layer network is worst on both training as well as testing data which
does not happen when the model is overfitting.

Derivative of sigmoid layers (https://towardsdatascience.com/the-vanishing-


gradient-problem-69bf08b15484)

We can see the derivative of the sigmoid function is from a value of 0 to 0.25
which brings down the value and when we multiply the chain of such values
as deeper layers we go, we end up with a very small value which affects
weight updation through our loss function.

How does ResNet work?

Advancement in Deep Learning 23


Let us now understand how ResNet works. Here, we have something called
Residual blocks. Many Residual blocks are stacked together to form a ResNet.
We have “skipped connections” which are the major part of ResNet. The
following image was provided by the authors in the original paper which
denotes how a residual network works. The idea is to connect the input of a
layer directly to the output of a layer after skipping a few connections. We can
see here, x is the input to the layer which we are directly using to connect to a
layer after skipping the identity connections and if we think the output from
identity connection to be F(x). Then we can say the output will be F(x) + x.

Comparison of ResNet with Plain Networks


Now let us compare the ResNet with the plain networks. According to the
deep learning course, Andrew NG has told us that one of the main benefits of
ResNet is how they perform in training errors. If we see the plain networks, as
we increase the layers there is a decrease in train error, but after few layers,
the error goes back increasing. This is why deploy methods like early
stopping. But this behavior is solved when it comes to ResNet and as layers

Advancement in Deep Learning 24


are increasing the error only tends to decrease and don’t increase. The
authors of the ResNet paper have also given us the comparison between how
plain networks and ResNet works over 18 and 34 layers and we can see that
the ResNet gives us lesser error compared to plain networks.

Skip Connection Network

Pros:

Facilitates training of very deep networks.

Helps alleviate vanishing gradient problem.

Allows for better information flow through layers.

Enables reuse of features from earlier layers.

Cons:

Increased model complexity.

Requires careful design to optimize performance.

May lead to increased memory and computational requirements.

Training can still be challenging with extremely deep architectures.

What are Skip Connections in Deep Learning?

Advancement in Deep Learning 25


Introduction
The need for deeper networks emerges while handling complex tasks.
However, training a deep neural net has a lot of complications not only limited
to overfitting and high computation costs but also has some non-trivial
problems. In this article, we will solve some complex deep learning problems
using skip connections.

Why Skip Connections?


The beauty of deep neural networks is that they can learn complex functions
more efficiently than their shallow counterparts. While training deep neural
nets, the performance of the model drops down with the increase in depth of
the architecture. This is known as the degradation problem. But, what could
be the reasons for the saturation inaccuracy with the increase in network
depth? Let us try to understand the reasons behind the degradation problem.

Deeper Network Performance Analysis: Overfitting Discarded


One of the possible reasons could be overfitting. The model tends to overfit
with the increase in depth but that’s not the case here. As you can infer from
the below figure, the deeper network with 56 layers has more training error
than the shallow one with 20 layers. The deeper model doesn’t perform as
well as the shallow one. Clearly, overfitting is not the problem here.
Train and test error for 20-layer and 56-layer NN

Advancement in Deep Learning 26


Gradient Issues in ResNet Construction
Another possible reason can be vanishing gradient and/or exploding gradient
problems. However, the authors of ResNet (He et al.) argued that the use of
Batch Normalization and proper initialization of weights through normalization
ensures that the gradients have healthy norms. But, what went wrong here?
Let’s understand this by construction.
Consider a shallow neural network that was trained on a dataset. Also,
consider a deeper one in which the initial layers have the same weight
matrices as the shallow network (the blue colored layers in the below diagram)
with added some extra layers (green colored layers). We set the weight
matrices of the added layers as identity matrices (identity mappings).

Diagram explaining the construction

From this construction, the deeper network should not produce any higher
training error than its shallow counterpart because we are actually using the

Advancement in Deep Learning 27


shallow model’s weight in the deeper network with added identity layers. But
experiments prove that the deeper network produces high training error
comparing to the shallow one. This states the inability of deeper layers to
learn even identity mappings.

The degradation of training accuracy indicates that not all


systems are similarly easy to optimize.

One of the primary reasons is due to random initialization of weights with a


mean around zero, L1, and L2 regularization. As a result, the weights in the
model would always be around zero and thus the deeper layers can’t learn
identity mappings as well.
Here comes the concept of skip connections which would enable us to train
very deep neural networks. Let’s learn this awesome concept now.

What are Skip Connections?

Skip Connections (or Shortcut Connections) as the name


suggests skips some of the layers in the neural network and
feeds the output of one layer as the input to the next layers.

Skip Connections were introduced to solve different problems in different


architectures. In the case of ResNets, skip connections solved the degradation
problem that we addressed earlier whereas, in the case of DenseNets, it
ensured feature reusability. We’ll discuss them in detail in the following
sections.

Advancement in Deep Learning 28


How do Skip Connections Work?
Skip connections were introduced in literature even before residual networks.
For example, Highway Networks (Srivastava et al.) had skip connections
with gates that controlled and learned the flow of information to deeper layers.
This concept is similar to the gating mechanism in LSTM. Although ResNets is
actually a special case of Highway networks, the performance isn’t up to the
mark comparing to ResNets. This suggests that it’s better to keep the gradient
highways clear than to go for any gates – simplicity wins here!
Neural networks can learn any functions of arbitrary complexity, which could
be high-dimensional and non-convex. Visualizations have the potential to help
us answer several important questions about why neural networks work. And
there is actually some nice work done by Li et al. which enables us to visualize
the complex loss surfaces. The results from the networks with skip
connections are even more surprising! Take a look at them.
The loss surfaces of ResNet-56 with and without skip connections

As you can see here, the loss surface of the neural network with skip
connections is smoother and thus leading to faster convergence than the
network without any skip connections. Let’s see the variants of skip
connections in the next section.

Advancement in Deep Learning 29


Variants of Skip Connections
In this section, we will see the variants of skip connections in different
architectures. Skip Connections can be used in 2 fundamental ways in Neural
Networks: Addition and Concatenation.

Residual Networks (ResNets)


Residual Networks were proposed by He et al. in 2015 to solve the image
classification problem. In ResNets, the information from the initial layers is
passed to deeper layers by matrix addition. This operation doesn’t have any
additional parameters as the output from the previous layer is added to the
layer ahead. A single residual block with skip connection looks like this:
A residual block

Thanks to the deeper layer representation of ResNets as pre-trained weights


from this network can be used to solve multiple tasks. It’s not only limited to
image classification but also can solve a wide range of problems on image
segmentation, keypoint detection & object detection. Hence, ResNet is one of
the most influential architectures in the deep learning community.

Advancement in Deep Learning 30


Next, we’ll learn about another variant of skip connections in DenseNets which
is inspired by ResNets.

I would recommend you to go through the below resources for an in-detailed


understanding of ResNets–

Understanding ResNet and analyzing various models on the CIFAR-10


dataset

Densely Connected Convolutional Networks (DenseNets)


DenseNets were proposed by Huang et al. in 2017. The primary difference
between ResNets and DenseNets is that DenseNets concatenates the output
feature maps of the layer with the next layer rather than a summation.

Coming to Skip Connections, DenseNets uses


Concatenation whereas ResNets uses Summation

A 5-layer dense block

Advancement in Deep Learning 31


The idea behind the concatenation is to use features that are learned from
earlier layers in deeper layers as well. This concept is known as Feature
Reusability. So, DenseNets can learn mapping with fewer parameters than a
traditional CNN as there is no need to learn redundant maps.

U-Net: Convolutional Networks for Biomedical Image


Segmentation
The use of skip connections influences the field of biomedical too. U-
Nets were proposed by Ronneberger et al. for biomedical image segmentation.
It has an encoder-decoder part including Skip Connections. The overall
architecture looks like the English letter “U”, thus the name U-Nets.
U-Net architecture

Advancement in Deep Learning 32


The layers in the encoder part are skip connected and concatenated with
layers in the decoder part (those are mentioned as grey lines in the above
diagram). This makes the U-Nets use fine-grained details learned in the
encoder part to construct an image in the decoder part.
These kinds of connections are long skip connections whereas the ones we
saw in ResNets were short skip connections. More about U-Nets here.
Okay! Enough of theory, let’s implement a block of the discussed architectures
and how to load and use them in PyTorch!

Implementation of Skip Connections


In this section, we will build ResNets and DesNets using Skip Connections
from the scratch. Are you excited? Let’s go!

ResNet – A Residual Block

Advancement in Deep Learning 33


First, we will implement a residual block using skip connections. PyTorch is
preferred because of its super cool feature – object-oriented structure.

# import required libraries

import torch

from torch import nn

import torch.nn.functional as F

import torchvision

# basic resdidual block of ResNet

# This is generic in the sense, it could be used for downsampling of


features.

class ResidualBlock(nn.Module):

def __init__(self, in_channels, out_channels, stride=[1, 1],


downsample=None):

"""

A basic residual block of ResNet

Parameters

----------

in_channels: Number of channels that the input have

out_channels: Number of channels that the output have

stride: strides in convolutional layers

downsample: A callable to be applied before addition of residual


mapping

"""

super(ResidualBlock, self).__init__()

self.conv1 = nn.Conv2d(

in_channels, out_channels, kernel_size=3, stride=stride[0],

padding=1, bias=False

Advancement in Deep Learning 34


self.conv2 = nn.Conv2d(

out_channels, out_channels, kernel_size=3, stride=stride[1],

padding=1, bias=False

self.bn = nn.BatchNorm2d(out_channels)

self.downsample = downsample

def forward(self, x):

residual = x

# applying a downsample function before adding it to the output

if(self.downsample is not None):

residual = downsample(residual)

out = F.relu(self.bn(self.conv1(x)))

out = self.bn(self.conv2(out))

# note that adding residual before activation

out = out + residual

out = F.relu(out)

return out

view rawresidual_block.py hosted with ❤ by GitHub


As we have a Residual block in our hand, we can build a ResNet model of
arbitrary depth! Let’s quickly build the first five layers of ResNet-34 to get an
idea of how to connect the residual blocks.

# downsample using 1 * 1 convolution

downsample = nn.Sequential(

nn.Conv2d(64, 128, kernel_size=1, stride=2, bias=False),

Advancement in Deep Learning 35


nn.BatchNorm2d(128)

# First five layers of ResNet34

resnet_blocks = nn.Sequential(

nn.Conv2d(3, 64, kernel_size=7, stride=2, padding=3, bias=False),

nn.MaxPool2d(kernel_size=2, stride=2),

ResidualBlock(64, 64),

ResidualBlock(64, 64),

ResidualBlock(64, 128, stride=[2, 1], downsample=downsample)

# checking the shape

inputs = torch.rand(1, 3, 100, 100) # single 100 * 100 color image

outputs = resnet_blocks(inputs)

print(outputs.shape) # shape would be (1, 128, 13, 13)

view rawmain_resnet.py hosted with ❤ by GitHub


PyTorch provides us an easy way to load ResNet models with pretrained
weights trained on the ImageNet dataset.

# one could also use pretrained weights of ResNet trained on


ImageNet

resnet34 = torchvision.models.resnet34(pretrained=True)

view rawresnet_pretrained.py hosted with ❤ by GitHub

DenseNet – A Dense Block


Implementing the complete densenet would be a little bit complex. Let’s grab it
step by step.

1. Implement a DenseNet layer

2. Build a dense block

3. Connect multiple dense blocks to obtain a densenet model

Advancement in Deep Learning 36


class Dense_Layer(nn.Module):

def __init__(self, in_channels, growthrate, bn_size):

super(Dense_Layer, self).__init__()

self.bn1 = nn.BatchNorm2d(in_channels)

self.conv1 = nn.Conv2d(

in_channels, bn_size * growthrate, kernel_size=1, bias=False

self.bn2 = nn.BatchNorm2d(bn_size * growthrate)

self.conv2 = nn.Conv2d(

bn_size * growthrate, growthrate, kernel_size=3, padding=1,


bias=False

def forward(self, prev_features):

out1 = torch.cat(prev_features, dim=1)

out1 = self.conv1(F.relu(self.bn1(out1)))

out2 = self.conv2(F.relu(self.bn2(out1)))

return out2

view rawdense_layer.py hosted with ❤ by GitHub


Next, we’ll implement a dense block that consists of an arbitrary number of
DenseNet layers.

class Dense_Block(nn.ModuleDict):

def __init__(self, n_layers, in_channels, growthrate, bn_size):

"""

A Dense block consists of `n_layers` of `Dense_Layer`

Parameters

----------

n_layers: Number of dense layers to be stacked

Advancement in Deep Learning 37


in_channels: Number of input channels for first layer in the block

growthrate: Growth rate (k) as mentioned in DenseNet paper

bn_size: Multiplicative factor for # of bottleneck layers

"""

super(Dense_Block, self).__init__()

layers = dict()

for i in range(n_layers):

layer = Dense_Layer(in_channels + i * growthrate, growthrate,


bn_size)

layers['dense{}'.format(i)] = layer

self.block = nn.ModuleDict(layers)

def forward(self, features):

if(isinstance(features, torch.Tensor)):

features = [features]

for _, layer in self.block.items():

new_features = layer(features)

features.append(new_features)

return torch.cat(features, dim=1)

view rawdense_block.py hosted with ❤ by GitHub


From the dense block, let’s build DenseNet. Here, I’ve omitted the transition
layers of DenseNet architecture (which acts as downsampling) for simplicity.

# a block consists of initial conv layers followed by 6 dense layers

dense_block = nn.Sequential(

nn.Conv2d(3, 64, kernel_size=7, padding=3, stride=2, bias=False),

Advancement in Deep Learning 38


nn.BatchNorm2d(64),

nn.MaxPool2d(3, 2),

Dense_Block(6, 64, growthrate=32, bn_size=4),

inputs = torch.rand(1, 3, 100, 100)

outputs = dense_block(inputs)

print(outputs.shape) # shape would be (1, 256, 24, 24)

# one could also use pretrained weights of DenseNet trained on


ImageNet

densenet121 = torchvision.models.densenet121(pretrained=True)

view rawmain_densenet.py hosted with ❤ by GitHub

Conclusion
In this article, we’ve discussed the importance of skip connections for the
training of deep neural nets and how skip connections were used in ResNet,
DenseNet, and U-Net with its implementation. I know, this article covers many
theoretical aspects which are not easy to grasp in one go. So, feel free to
leave comments if you have any.

Frequently Asked Question


Computer Visiondeep learningdeep learning architecturespythonTheory
S
Sivaram T14 Aug 2023
AdvancedComputer VisionDeep LearningLibrariesPython

Frequently Asked Questions

Q1. Why skip connections in ResNet?

Advancement in Deep Learning 39


A. Skip connections in ResNet prevent the vanishing gradient problem during
deep neural network training. These connections enable the direct flow of
information from earlier layers to later layers, aiding in preserving gradient and
promoting better convergence.

Q2. Why do we use skip connections in unet?

Q3. What are the different types of skip


connections?

Q5. What is the difference between skip and


residual connections?
Responses From Readers
Submit reply

Related Courses

72 Lessons
4.84

Advancement in Deep Learning 40


A Comprehensive Learning Path for Deep Learning in 2023

FREE
Learning Path Deep Learning Neural Networks
Enroll now

72 Lessons
4.94
A Comprehensive Learning Path for Deep Learning in 2020
FREE
Learning Path Deep Learning Neural Networks
Enroll now

Advancement in Deep Learning 41


76 Lessons
4.93
A Comprehensive Learning Path for Deep Learning in 2019
FREE
Learning Path Deep Learning Neural Networks

Enroll now
Write, Shine, Succeed
Write, captivate, and earn accolades and rewards for your work

• Reach a Global Audience


• Get Expert Feedback
• Build Your Brand & Audience
• Cash In on Your Knowledge
• Join a Thriving Community
• Level Up Your Data Science Game

Advancement in Deep Learning 42


Rahul Shah27

Sion Chakrabarti16

CHIRAG GOYAL87

Advancement in Deep Learning 43


Barney Darlington5

Suvojit Hore9

Arnab Mondal15

Prateek Majumder68

Advancement in Deep Learning 44


3

Company

About Us

Contact Us

Careers
Discover


Blogs

Expert session

Podcasts

Comprehensive Guides
Learn


Free courses

Learning path

BlackBelt program

Gen AI
Engage

Advancement in Deep Learning 45



Community

Hackathons

Events

Daily challenges
Contribute


Contribute & win

Become a speaker

Become a mentor

Become an instructor
Enterprise


Our offerings

Case studies

Industry report

quexto.aiDownload App

Terms & conditions Refund Policy Privacy Policy Cookies Policy © Analytics
Vidhya 2023.All rights reserved.

Advancement in Deep Learning 46


Image Denoising
(a)Gaussian Noise - Noise having PDF equal to the normal distribution. i.e. the
pixel values that these noises can take are Gaussian distributed.
(b)Impulse Noise - caused by sharp and sudden disturbances in the image
signal. It usually occurs as white and black pixels in the image.
The real-world noise (also known as blind noise) is more sophisticated and
diverse.

AutoEncoders:

Advancement in Deep Learning 47


CBDNet (Convolutional Blind Denoising Network), PRIDNet (Perceptual
Residual-Injective Denoising Network), and RIDNet (Residual-in-Residual
Dense Network) are all state-of-the-art deep learning models designed for
image denoising. Let's briefly discuss each of them:

1. CBDNet:

CBDNet was proposed in the paper "Toward Convolutional Blind


Denoising of Real Photographs" by Guo et al., published in 2019.

It's designed to denoise real-world photographs without assuming any


prior knowledge about the noise characteristics.

CBDNet utilizes a blind denoising approach, meaning it doesn't require


any explicit noise level estimation.

The network architecture is composed of multiple convolutional layers


along with residual connections to effectively learn the denoising task.

2. PRIDNet:

PRIDNet was introduced in the paper "Perceptual Residual-Injective


Denoising Network for Real Image Denoising" by Wang et al.,
presented in 2019.

It focuses on real image denoising and aims to achieve perceptually


superior denoising results.

PRIDNet incorporates a residual-injective structure that leverages both


local and global residual learning for better denoising performance.

Advancement in Deep Learning 48


Additionally, it employs a perceptual loss function, which takes into
account the perceptual difference between the denoised and clean
images, leading to visually pleasing results.

3. RIDNet:

RIDNet, proposed in the paper "RIDNet: Residual-in-Residual Dense


Network for Image Denoising" by Ahn et al., in 2018, focuses on
learning hierarchical representations for image denoising.

It adopts a residual-in-residual dense block architecture, which


facilitates the learning of highly non-linear mappings between noisy
and clean images.

RIDNet is capable of capturing both local and global features


effectively through its dense connections and residual learning.

The network architecture allows for efficient information flow across


multiple layers, enabling better exploitation of the image's contextual
information for denoising.

Overall, CBDNet, PRIDNet, and RIDNet are among the top-performing deep
learning models for image denoising, each offering unique architectural
designs and learning strategies to address the challenges associated with
real-world image denoising tasks.

Semantic Segmentation

Advancement in Deep Learning 49


1. UNet:

UNet, proposed by Ronneberger et al. in 2015, is a widely used


architecture particularly suited for biomedical image segmentation.

It features a symmetric encoder-decoder structure with skip


connections between corresponding encoder and decoder layers.

UNet's skip connections help preserve spatial information and enable


precise localization of objects in the segmentation masks.

Advancement in Deep Learning 50


ENet (Efficient Neural Network):

ENet, proposed by Paszke et al. in 2016, is designed for efficient real-time


semantic segmentation.

It features a compact architecture with lightweight operations, making it


suitable for deployment on embedded systems or mobile devices.

ENet utilizes a combination of regular and asymmetric convolutions to


reduce computational complexity while maintaining performance.

Advancement in Deep Learning 51


Object Detection etc
Object detection is a computer vision task that involves identifying and
localizing objects within an image. Deep learning architectures have
revolutionized object detection, enabling high accuracy and real-time
performance. Some of the most popular deep learning architectures for object
detection include:

1. Faster R-CNN:

Faster R-CNN, introduced by Ren et al. in 2015, is a milestone in object


detection.

It combines a Region Proposal Network (RPN) with a Fast R-CNN


detector, allowing for end-to-end training.

RPN generates region proposals (bounding boxes) from the input


image, and Fast R-CNN uses these proposals to classify and refine
object detections.

2. YOLO:
YOLO (You Only Look Once) is a popular deep learning architecture for
real-time object detection. YOLO processes images in a single forward
pass through a neural network to predict bounding boxes and class
probabilities directly. This approach makes YOLO extremely fast and

Advancement in Deep Learning 52


suitable for real-time applications. The original YOLO architecture, as
introduced by Joseph Redmon et al. in 2015, has undergone several
iterations, including YOLOv2, YOLOv3, and YOLOv4, each with
improvements in accuracy and efficiency. Here's an overview of the
original YOLO architecture:

1. Input Processing:

YOLO takes an input image of fixed size (e.g., 416x416 pixels) and
divides it into a grid of cells.

Each grid cell is responsible for predicting bounding boxes and


class probabilities for objects present in that cell.

2. Feature Extraction:

The input image is passed through a convolutional neural network


(CNN) to extract features.

The CNN architecture typically consists of convolutional layers


followed by max-pooling layers, which progressively reduce the
spatial dimensions of the feature maps while increasing the depth.

3. Grid Cell Prediction:

For each grid cell, YOLO predicts multiple bounding boxes.

Each bounding box is represented by a set of coordinates (x, y,


width, height) relative to the grid cell's location.

Advancement in Deep Learning 53


Additionally, YOLO predicts the confidence score for each
bounding box, indicating the probability that the box contains an
object, as well as the class probabilities for the detected objects.

4. Non-Maximum Suppression (NMS):

YOLO applies non-maximum suppression to remove redundant


bounding boxes.

It keeps the bounding box with the highest confidence score for
each detected object and suppresses overlapping boxes with
lower scores.

5. Output:

The final output of YOLO is a set of bounding boxes along with


their associated class probabilities.

YOLO provides real-time object detection by efficiently processing


the input image in a single pass through the network.

Neural Attention Models,


Neural attention models are a class of deep learning architectures that mimic
the human cognitive mechanism of selectively focusing on specific parts of
input data while processing it. These models have gained prominence across
various tasks in natural language processing (NLP), computer vision, and
other domains. Attention mechanisms allow neural networks to dynamically
weigh the importance of different parts of the input during computation,
enabling more effective and context-aware processing. Here are some key
types of neural attention models and their applications:

1. Sequence-to-Sequence with Attention:

This model was introduced by Bahdanau et al. in 2014 and is


commonly used for tasks such as machine translation and text
summarization.

In sequence-to-sequence tasks, the model encodes an input


sequence into a fixed-length context vector using a recurrent neural
network (RNN) encoder.

Advancement in Deep Learning 54


During decoding, an attention mechanism is applied to the encoder's
hidden states, allowing the decoder to attend to different parts of the
input sequence while generating the output sequence.

2. Transformer:

The Transformer architecture, introduced by Vaswani et al. in 2017,


revolutionized NLP tasks by eliminating recurrent connections and
replacing them with self-attention mechanisms.

Transformers consist of multiple self-attention layers that allow each


word/token in the input sequence to attend to all other words/tokens in
the sequence.

Self-attention enables the model to capture long-range dependencies


and contextual information more efficiently than traditional recurrent
architectures.

Transformers have been widely adopted for tasks such as machine


translation, text classification, and language modeling.

3. Spatial Attention in Convolutional Neural Networks (CNNs):

Advancement in Deep Learning 55


In computer vision, spatial attention mechanisms are used to focus on
relevant regions of an input image while suppressing irrelevant or
distracting regions.

These mechanisms typically involve learning attention maps that


indicate the importance of different spatial locations in the input image.

Spatial attention has been integrated into CNN architectures for tasks
such as image classification, object detection, and image captioning,
improving performance by allowing the model to focus on salient
features.

4. Multi-Head Attention:

Multi-head attention, introduced in the Transformer architecture,


enables the model to attend to different parts of the input
simultaneously.

In multi-head attention, the input is projected into multiple subspaces,


and attention is computed independently in each subspace.

This allows the model to capture diverse representations and attend to


different aspects of the input data effectively.

5. Cross-Modal Attention:

Cross-modal attention mechanisms enable models to attend to


information from multiple modalities (e.g., text, image, audio)
simultaneously.

These mechanisms are used in tasks such as image captioning, visual


question answering (VQA), and multimodal translation, where the input
may consist of data from different modalities.

Neural attention models have demonstrated significant improvements in


various tasks by enabling more flexible and context-aware processing of input
data. They continue to be an active area of research, with ongoing efforts to
develop more advanced attention mechanisms and integrate them into diverse
architectures and applications.

Neural Machine Translation.

Advancement in Deep Learning 56


Neural Machine Translation (NMT) is an approach to machine translation that
uses neural networks to translate text from one language to another. NMT has
largely replaced traditional statistical machine translation (SMT) approaches
due to its superior performance, especially in capturing long-range
dependencies and handling context.
Here's how Neural Machine Translation generally works:

1. Sequence-to-Sequence Model:

NMT is typically based on the sequence-to-sequence (seq2seq)


model architecture, introduced by Sutskever et al. in 2014.

In seq2seq models, an encoder-decoder architecture is used where


the encoder processes the input sequence (source language) and
generates a fixed-length context vector that represents the input.

The decoder then takes this context vector and generates the output
sequence (target language) word by word.

2. Recurrent Neural Networks (RNNs) and Transformers:

Initially, NMT systems were built using Recurrent Neural Networks


(RNNs) such as Long Short-Term Memory (LSTM) or Gated Recurrent
Unit (GRU) cells.

However, with the introduction of the Transformer architecture by


Vaswani et al. in 2017, the landscape of NMT changed significantly.
Transformers have since become the dominant architecture for NMT
due to their ability to capture long-range dependencies more
effectively through self-attention mechanisms.

3. Training Data and Loss Function:

NMT models are trained on parallel corpora, which are collections of


sentences in both the source and target languages.

During training, the model learns to minimize a loss function that


measures the difference between the predicted translations and the
ground truth translations.

Common loss functions used in NMT include cross-entropy loss and


sequence-to-sequence loss.

Advancement in Deep Learning 57


4. Attention Mechanism:

Attention mechanisms play a crucial role in NMT by allowing the model


to focus on relevant parts of the input sentence while generating the
output translation.

They enable the model to align words in the source and target
languages and alleviate the bottleneck of fixed-length context vectors.

The attention mechanism can be implemented using different variants,


such as global attention, local attention, or multi-head attention.

5. Evaluation:

NMT systems are evaluated based on metrics such as BLEU (Bilingual


Evaluation Understudy), which measures the similarity between the
predicted translations and human-generated translations.

Other evaluation metrics include METEOR, TER, and human evaluation.

NMT has made significant advancements in recent years and is widely used in
commercial translation systems and research laboratories. While it has
achieved impressive results, there are still challenges such as handling low-
resource languages, domain adaptation, and capturing subtle linguistic
nuances. Ongoing research in NMT aims to address these challenges and
further improve the quality and efficiency of machine translation systems.

Performance Metrics,

Advancement in Deep Learning 58


Neural Machine Translation Performance Metrics:

NMT systems are evaluated based on metrics such as BLEU (Bilingual


Evaluation Understudy), which measures the similarity between the
predicted translations and human-generated translations.

Other evaluation metrics include METEOR, TER, and human evaluation.

Baseline Methods,
Baseline models wield immense influence in machine learning practice.
Though intentionally simple, they serve as the basis for evaluating the
performance of more complex models. Baseline models have a dual purpose:

first, they set a performance baseline against which advancements can be


measured, and

second, they provide a benchmark for gauging the efficiency of intricate


models.

Advancement in Deep Learning 59


1. Feedforward Neural Networks (FNNs):

Pros: Simple architecture, suitable for basic tasks.

Cons: Limited complexity, prone to overfitting.
2.
Convolutional Neural Networks (CNNs):

Pros: Excellent for image tasks, reduces computational load.

Cons: Needs lots of data, vanishing gradients in deep networks.
3.
Recurrent Neural Networks (RNNs):

Pros: Great for sequential data, captures temporal dependencies.

Cons: Vanishing/exploding gradients, struggles with long-term dependencies.
4.
Autoencoders:

Pros: Unsupervised learning, feature learning, dimensionality reduction.

Cons: Slow training, potential information loss during compression.
5.
Generative Adversarial Networks (GANs):

Pros: Generates realistic data, used in image generation.

Cons: Training instability, mode collapse, hyperparameter sensitivity.
6.
Reinforcement Learning (RL) Models:

Pros: Learns decision-making through interaction.

Advancement in Deep Learning 60


Cons: High computational requirements, reward design sensitivity.
7.
Transfer Learning:

Pros: Saves time/resources, useful for limited data domains.

Cons: Task misalignment, requires careful fine-tuning.
8.
Ensemble Methods:

Pros: Combines models for improved accuracy.

Cons: Increased complexity, potential overfitting.
9.
Attention Mechanisms:

Pros: Improves model interpretability, focuses on relevant inputs.

Cons: Adds computational overhead, tuning required.
10.
Meta-Learning Approaches:

Pros: Learns quickly for new tasks/domains.

Cons: Requires careful algorithm design, sensitive to task similarities.

Data Requirements,
1. Quantity: Large-scale and diverse datasets are crucial for effective model
training and generalization.
2.
Quality: Clean, accurately labeled data with balanced class distributions
improves model performance.
3.
Preprocessing: Normalize, standardize, and augment data to aid model
convergence and reduce overfitting.
4.

Advancement in Deep Learning 61


Representative Features: Ensure input features capture relevant information
for the task.
5.
Data Splitting: Divide data into training, validation, and test sets for evaluation
and hyperparameter tuning.
6.
Imbalance Handling: Address class imbalance using oversampling,
undersampling, or class weights.
7.
Transfer Learning: Utilize pre-trained models and domain adaptation
techniques for limited data scenarios.
8.
Privacy and Security: Comply with regulations and implement data protection
measures for sensitive data.

Hyperparameter Tuning:
hyperparameters - external configuration variables that data scientists use
to manage machine learning model training. eg - the number of nodes and
layers in a neural network and the number of branches in a decision tree

Parameters allow the model to learn the rules from the data while
hyperparameters control how the model is training.

Hyperparameter tuning is a critical step in optimizing the performance of


machine learning and deep learning models. It involves adjusting the
hyperparameters of a model to find the best configuration that results in
improved performance metrics such as accuracy, precision, recall, or F1-
score. Here are key points regarding hyperparameter tuning:

1. Hyperparameters Examples:

Learning rate in optimization algorithms (e.g., gradient descent)

Number of layers and neurons in a neural network

Regularization parameters (e.g., L1/L2 regularization strength, dropout


rate)

Advancement in Deep Learning 62


Batch size, epochs, and optimizer choice (e.g., Adam, SGD)

2. Cross-Validation: Utilize cross-validation techniques (e.g., k-fold cross-


validation) during hyperparameter tuning to evaluate model performance
across different subsets of data and reduce overfitting.

3. Objective Function: Define an objective function (e.g., accuracy, loss) that


the hyperparameter tuning process aims to optimize. It guides the search
for optimal hyperparameters.

4. Early Stopping: Implement early stopping based on validation metrics to


prevent overfitting during hyperparameter tuning iterations.

5. Parallelization: Leverage parallel computing or distributed systems to


speed up hyperparameter tuning processes, especially for computationally
intensive models or large datasets.

6. Domain Knowledge: Incorporate domain knowledge and insights into the


hyperparameter tuning process to guide the search space and prioritize
relevant hyperparameters.

Manual vs Automatic,

Advancement in Deep Learning 63


Grid vs Random.

Advancement in Deep Learning 64


Advancement in Deep Learning 65
Advancement in Deep Learning 66
Advancement in Deep Learning 67

You might also like