0% found this document useful (0 votes)
22 views37 pages

U5 PDF

This document provides an overview of Recurrent Neural Networks (RNNs) and their variations, including Bidirectional RNNs and Long Short-Term Memory (LSTM) networks. It explains the architecture, functioning, advantages, disadvantages, and applications of RNNs, emphasizing their ability to handle sequential data. Additionally, it discusses Recursive Neural Networks (RvNNs) and their role in natural language processing, highlighting their structured approach to data representation.

Uploaded by

Anitha M
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views37 pages

U5 PDF

This document provides an overview of Recurrent Neural Networks (RNNs) and their variations, including Bidirectional RNNs and Long Short-Term Memory (LSTM) networks. It explains the architecture, functioning, advantages, disadvantages, and applications of RNNs, emphasizing their ability to handle sequential data. Additionally, it discusses Recursive Neural Networks (RvNNs) and their role in natural language processing, highlighting their structured approach to data representation.

Uploaded by

Anitha M
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

Nndl unit 5 - easy to understand

Neural Network and Deep Learning (Anna University)

Scan to open on Studocu

Studocu is not sponsored or endorsed by any college or university


Downloaded by Bharathi P (bharathicse88@gmail.com)
UNIT V Recurrent Neural Networks
Neural Network and Deep Learning (Anna University)

Scan to open on Studocu

Downloadedb by
Downloaded yA Bharathi
binayaPK(bharathicse88@gmail.com)
Studocu is not sponsored or endorsed by any college or university

Downloadedb by
Downloaded yA Bharathi
binayaPK(bharathicse88@gmail.com)
UNIT V RECURRENT NEURAL NETWORKS
Recurrent Neural Networks:Introduction– Recursive Neural Networks – Bidirectional RNNs – Deep
Recurrent Networks – Applications: Image Generation, Image Compression, Natural Language Proce
Complete Auto encoder, Regularized Autoencoder, Stochastic Encoders and Decoders, Contractive
Encoders.
RECURRENT NEURAL NETWORKS: INTRODUCTION
What is Recurrent Neural Network (RNN)?
Recurrent Neural Network(RNN) is a type of Neural Network where the output from the previous step is
fed as input to the current step. In traditional neural networks, all the inputs and outputs are independ
each other, but in cases when it is required to predict the next word of a sentence, the previous words a
required and hence there is a need to remember the previous words. Thus RNN came into existence, w
solved this issue with the help of a Hidden Layer. The main and most important feature of RNN is its
Hidden state, which remembers some information about a sequence. The state is also referred to as M
State since it remembers the previous input to the network. It uses the same parameters for each input
performs the same task on all the inputs or hidden layers to produce the output. This reduces the comp
of parameters, unlike other neural networks.

Architecture Of Recurrent Neural Network


RNNs have the same input and output architecture as any other deep neural architecture. However,
differences arise in the way information flows from input to output. Unlike Deep neural networks wher
have different weight matrices for each Dense network in RNN, the weight across the network remains
same. It calculates state hidden state Hi for every input Xi . By using the following formulas:
h= σ(UX + Wh-1 + B)
Y = O(Vh + C) Hence
Y = f (X, h , W, U, V, B, C)
Here S is the State matrix which has element si as the state of the network at timestep i
The parameters in the network are W, U, V, c, b which are shared across timestep
How RNN works
The Recurrent Neural Network consists of multiple fixed activation function units, one for each time ste
Each unit has an internal state which is called the hidden state of the unit. This hidden state signifies the
knowledge that the network currently holds at a given time step. This hidden state is updated at every t
step to signify the change in the knowledge of the network about the past. The hidden state is updated
the following recurrence relation:-

The formula for calculating the current state:

where:

ht -> current state


ht-1 -> previous state
xt -> input state
Formula for applying Activation function(tanh):

where:

whh -> weight at recurrent neuron


wxh -> weight at input neuron
The formula for calculating output:
Yt -> output
Why -> weight at output layer
These parameters are updated using Backpropagation. However, since RNN works on sequential data
we use an updated backpropagation which is known as Backpropagation through time.
Backpropagation Through Time (BPTT)
In RNN the neural network is in an ordered fashion and since in the ordered network each variable is
computed one at a time in a specified order like first h1 then h2 then h3 so on. Hence we will apply
backpropagation throughout all these hidden time states sequentially.

L(θ)(loss function) depends on h3 h3 in turn depends on h2 and W h2 in turn depends on h1 and


W h1 in turn depends on h0 and W where h0 is a constant starting state.
Training through RNN
A single-time step of the input is provided to the network.
Then calculate its current state using a set of current input and the previous state.
The current ht becomes ht-1 for the next time step.
One can go as many time steps according to the problem and join the information from all the

previous
states.
Once all the time steps are completed the final current state is used to calculate the output.
The output is then compared to the actual output i.e the target output and the error is
generated.
The error is then back-propagated to the network to update the weights and hence the

network (RNN) is
trained using Backpropagation through time.
Advantages of Recurrent Neural Network An RNN remembers each and every piece of
information through time. It is useful in time series prediction
only because of the feature to remember previous inputs as well. This is called Long Short Term
Memory .
Recurrent neural networks are even used with convolutional layers to extend the effective pixel
neighborhood.
Disadvantages of Recurrent Neural Network
Gradient vanishing and exploding problems.
Training an RNN is a very difficult task.
It cannot process very long sequences if using tanh or relu as an activation function.
Applications of Recurrent Neural Network
Language Modelling and Generating Text
Speech Recognition
Machine Translation
Image Recognition, Face detection
Time series Forecasting
Types Of RNN
There are four types of RNNs based on the number of inputs and outputs in the network.
One to One
One to Many
Many to One
Many to Many
One to One
This type of RNN behaves the same as any simple Neural network it is also known as Vanilla
Neural Network. In this Neural network, there is only one input and one output.

One To Many
In this type of RNN, there is one input and many outputs associated with it. One of the most used exam
of this network is Image captioning where given an image we predict a sentence having Multiple words

Downloaded by Bharathi P (bharathicse88@gmail.com)


Many to One In this type of network, Many inputs are fed to the network at several states of the
network generating only
one output. This type of network is used in the problems like sentimental analysis. Where we
give multiple
words as input and predict only the sentiment of the sentence as output.

Many to Many
In this type of neural network, there are multiple inputs and multiple outputs corresponding to a probl
One Example of this Problem will be language translation. In language translation, we provide multipl
words from one language as input and predict multiple words from the second language as output.
Variation Of Recurrent Neural Network (RNN)
To overcome the problems like vanishing gradient and exploding gradient descent several new advanc
versions of RNNs are formed some of these are as ;
Bidirectional Neural Network (BiNN)
Long Short-Term Memory (LSTM)
Bidirectional Neural Network (BiNN)
A BiNN is a variation of a Recurrent Neural Network in which the input information flows in both directio
and then the output of both direction are combined to produce the input. BiNN is useful in situations w
the context of the input is more important such as Nlp tasks and Time-series analysis problems.
Long Short-Term Memory (LSTM)
Long Short-Term Memory works on the read-write-and-forget principle where given the input informa
network reads and writes the most useful information from the data and it forgets about the informatio
which is not important in predicting the output. For doing this three new gates are introduced in the RN
this way, only the selected information is passed through the network.
Difference between RNN and Simple Neural Network
RNN is considered to be the better version of deep neural when the data is sequential. There are signifi
differences between the RNN and deep neural networks they are listed as:

Recurrent Neural Network Deep Neural Network

Weights are same across all the layers


Weights are different for each layer of the network
number of a Recurrent Neural Network

Recurrent Neural Networks are used whenA Simple Deep Neural network does not have any
the data is sequential and the number of special method for sequential data also here the
inputs is not predefined. the number of inputs is fixed

The Numbers of parameter in the RNN are


The Numbers of Parameter are lower than RNN
higher than in simple DNN

Exploding and vanishing gradients is the theThese problems also occur in DNN but these are
major drawback of RNN not the major problem with DNN

RECURSIVE NEURAL NETWORKS:


Deep Learning is a subfield of machine learning and artificial intelligence (AI) that attempts to imitate h
the human brain processes data and gains certain knowledge. Neural Networks form the backbone o
Learning. These are loosely modeled after the human brain and designed to accurately recognize
underlying patterns in a data set. If you want to predict the unpredictable, Deep Learning is the solutio
Recursive Neural Networks (RvNNs) are a class of deep neural networks that can learn detailed and
structured information. With RvNN, you can get a structured prediction by recursively applying the sam
of weights on structured inputs. The word recursive indicates that the neural network is applied to its o
Due to their deep tree-like structure, Recursive Neural Networks can handle hierarchical data. The tree
structure means combining child nodes and producing parent nodes. Each child-parent bond has a we
matrix, and similar children have the same weights. The number of children for every node in the tree i
fixed to enable it to perform recursive operations and use the same weights. RvNNs are used when the
need to parse an entire sentence.
To calculate the parent node's representation, we add the products of the weight matrices (W_i) and th
children's representations (C_i) and apply the transformation f:
\[h = f \left( \sum_{i=1}^{i=c} W_i C_i \right) \], where c is the number of children.
Recursive Neural Network Implementation
A Recursive Neural Network is used for sentiment analysis in natural language sentences. It is one of th
most important tasks of Natural language Processing (NLP), which identifies the writing tone and
sentiments of the writer in a particular sentence. If a writer expresses any sentiment, basic labels abou
writing tone are recognized. We want to identify the smaller components like nouns or verb phrases an
order them in a syntactic hierarchy. For example, it identifies whether the sentence showcases a
constructive form of writing or negative word choices.

A variable called 'score' is calculated at each traversal of nodes, telling us which pair of phrases and wor
we must combine to form the perfect syntactic tree for a given sentence.

Let us consider the representation of the phrase -- "a lot of fun" in the following sentence.

Programming is a lot of fun.

An RNN representation of this phrase would not be suitable because it considers only sequential relatio
Each state varies with the preceding words' representation. So, a subsequence that doesn't occur at th
beginning of the sentence can't be represented. With RNN, when processing the word 'fun,' the hidde
will represent the whole sentence.

However, with a Recursive Neural Network (RvNN), the hierarchical architecture can store the
representation of the exact phrase. It lies in the hidden state of the node R_{a\ lot\ of\ fun}. Thus
parsing is completely implemented with the help of Recursive Neural Networks.

Benefits of RvNNs for Natural Language Processing


The two significantadvantages of RecursiveNeural Networks for Natural Language Processing are thei
structure and reduction in network depth.
As already explained, the tree structure of Recursive Neural Networks can manage hierarchical data lik
parsing problems.
Another benefit of RvNN is that the trees can have a logarithmic height. When there are O(n) input word
Recursive Neural Network can represent a binary tree with height O(log\ n). This lessens the distance
between the first andlast input elements. Hence, the long-termdependency turns shorter and easier t
Disadvantages of RvNNs for Natural Language Processing
The main disadvantage of recursive neural networks can be the tree structure. Using the tree
structure indicates introducing a unique inductive bias to our model. The bias corresponds to the
assumption that the data follow a tree hierarchy structure. But that is not the truth. Thus, the
network may not be able to learn the existing patterns.
Another disadvantage of the Recursive Neural Network is that sentence parsing can be slow and

ambiguous.
Interestingly, there can be many parse trees for a single sentence.
Also, it is more time-consuming and labor-intensive to label the training data for recursive neural
networks
componentsthan to construct recurrent neural networks. Manually parsing a sentence into short
is more time-consuming and tedious than assigning a label to a sentence. BIDIRECTIONAL
RNNS: An architecture of a neural network called a bidirectional recurrent neural network (BRNN)
is made to
process sequential data. In order for the network to use information from both the past and
future context
in its predictions, BRNNs process input sequences in both the forward and backward directions.
This is
the main distinction between BRNNs and conventional recurrent neural networks.
A BRNN has two distinct recurrent hidden layers, one of which processes the input sequence

forward and
the otherand
collected of which processes it backward. After that, the results from these hidden layers are
input into a prediction-making final layer. Any recurrent neural network cell, such as Long
Short-Term
Memory (LSTM) or Gated Recurrent Unit, can be used to create the recurrent hidden layers.
The BRNN functions similarly to conventional recurrent neural networks in the forward direction,
updating the hidden state depending on the current input and the prior hidden state at each time
step. The
backward hidden layer, on the other hand, analyses the input sequence in the opposite manner,
updating
the hidden state based on the current input and the hidden state of the next time step.
Compared to conventional unidirectional recurrent neural networks, the accuracy of the

BRNN is
improved
contexts. since it can process information in both directions and account for both past and future
Because the two hidden layers can complement one another and give the final prediction layer
more data,
using two distinct hidden layers also offers a type of model regularisation.
In order to update the model parameters, the gradients are computed for both the forward and

backward
passes of the backpropagation through the time technique that is typically used to train BRNNs.
The input
sequence is processed by the BRNN in a single forward pass at inference time, and predictions
are made Bi-directional Recurrent Neural Network
based on the combined outputs of the two hidden layers. layers.
Working of Bidirectional Recurrent Neural Network
1. Inputting a sequenceA: sequence of data points, each represented as a vector with the same
dimensionality, are fed into a BRNN. The sequence might have different lengths.
2.
Dual Processing:Both the forward and backward directions are used to process the data. On the ba
of the input at that step and the hidden state at step t-1, the hidden state at time step t is determined
the forward direction. The input at step t and the hidden state at step t+1 are used to calculate the
hidden state at step t in a reverse way.
3.Computing the hidden state: A non-linear activation function on the weighted sum of the input and
previous hidden state is used to calculate the hidden state at each step. This creates a memory
mechanism that enables the network to remember data from earlier steps in the process.
4.Determining the output: A non-linear activation function is used to determine the output at each ste
from the weighted sum of the hidden state and a number of output weights. This output has two
options: it can be the final output or input for another layer in the network.
5. Training: The network is trained through a supervised learning approach where the goal is to
minimize the discrepancy between the predicted output and the actual output. The network adjusts
weights in the input-to-hidden and hidden-to-output connections during training through
backpropagation.
To calculate the output from an RNN unit, we use the following formula:
Ht (Forward) = A(Xt * WXH (forward) + Ht-1 (Forward) * WHH (Forward) + bH (Forward)
Ht (Backward) = A(Xt * WXH (Backward) + Ht+1 (Backward) * WHH (Backward) + bH (Backward)
where,
A = activation function,
W = weight matrix
b = bias
The hidden state at time t is given by a combination of Ht (Forward) and Ht (Backward). The output at a
given hidden state is :
Yt = Ht * WAY + by
The training of a BRNN is similar to backpropagation through a time algorithm. BPTT algorithm works a
follows:
Roll out the network and calculate errors at each iteration
Update weights and roll up the network.
However, because forward and backward passes in a BRNN occur simultaneously, updating the we
for the two processes may occur at the same time. This produces inaccurate outcomes. Thus, the follo
approach is used to train a BRNN to accommodate forward and backward passes individually.
Applications of Bidirectional Recurrent Neural Network
Bi-RNNs have been applied to various natural language processing (NLP) tasks, including:
1. Sentiment Analysis : By taking into account both the prior and subsequent context, BRNNs can be
utilized to categorize the sentiment of a particular sentence.
2. Named Entity Recognition : By considering the context both before and after the stated thing,
BRNNs can be utilized to identify those entities in a sentence.
3. Part-of-Speech Tagging : The classification of words in a phrase into their corresponding parts of
speech, such as nouns, verbs, adjectives, etc., can be done using BRNNs.
4. Machine Translation : BRNNs can be used in encoder-decoder models for machine translation, whe
the decoder creates the target sentence and the encoder analyses the source sentence in both direc
to capture its context.
5. Speech Recognition : When the input voice signal is processed in both directions to capture the
contextual information, BRNNs can be used in automatic speech recognition systems.
Advantages of Bidirectional RNN
Context from both past and future: With the ability to process sequential input both forward and
backward, BRNNs provide a thorough grasp of the full context of a sequence. Because of this, BRN
are effective at tasks like sentiment analysis and speech recognition.
Enhanced accuracy: BRNNs frequently yield more precise answers since they take both historical
and upcoming datainto account.
Efficient handling of variable-length sequences: When compared to conventional RNNs, which
require padding to have a constant length, BRNNs are better equipped to handle variable-length
sequences.
Resilience to noise and irrelevant information: BRNNs may be resistant to noise and irrelevant data
that are present in the data. This is so because both the forward and backward paths offer useful
information that supports the predictions made by the network.
Ability to handle sequential dependencies: BRNNs can capture long-term links between sequence
pieces, making them extremely adept at handling complicated sequential dependencies.
Disadvantages of Bidirectional RNN
Computational complexity: Given that they analyze data both forward and backward, BRNNs can
be computationally expensive due to the increased amount of calculations needed.
Long training time: BRNNs can also take a while to train because there are many parameters to
optimize, especially when using huge datasets.
Difficulty in parallelization: Due to the requirement for sequential processing in both the forward
and backward directions, BRNNs can be challenging to parallelize.
Overfitting : BRNNs are prone to overfitting since they include many parameters that might result in
too complicated models, especially when trained on short datasets.
Interpretability: Due to the processing of data in both forward and backward directions, BRNNs can
be tricky to interpret since it can be difficult to comprehend what the model is doing and how it is
producing predictions.
Implementation of Bi-directional Recurrent Neural Network on NLP dataset
There are multiple processes involved in training a bidirectional RNN on an NLP dataset, including data
preprocessing, model development, and model training. Here is an illustration of a Python implemen
using Keras and TensorFlow. We’ll utilize the IMDb movie review sentiment classification dataset from
Keras in this example. The data must first be loaded and preprocessed.

import warnings
warnings.filterwarnings('ignore')
from keras.datasets import imdb
from keras_preprocessing.sequence import pad_sequences

# let's load the dataset and then split # it


into training and testing sets features
= 2000
len = 50
(X_train, y_train),\
(X_test, y_test) = imdb.load_data(num_words=features)

# we are using pad sequences to a fixed length


X_train = pad_sequences(X_train, maxlen=len)
X_test = pad_sequences(X_test, maxlen=len)

Model Architecture

By using the high-level API of the Keras we will implement a Bidirectional Recurrent Neural Network mode
This model will have 64 hidden units and 128 as the size of the embedding layer. While compiling a
model we provide these three essential parameters:
optimizer – This is the method that helps to optimize the cost function by using gradient descent.
loss – The loss function by which we monitor whether the model is improving with training or not.
metrics – This helps to evaluate the model by predicting the training and the validation data.
# Import the necessary modules from Keras: from
keras.models import Sequential
from keras.layers import Embedding,\
Bidirectional, SimpleRNN, Dense

# Set the values for the embedding size and #


number of hidden units in the LSTM layer
embedding = 128
hidden = 64

# Create a Sequential model object model


= Sequential()
model.add(Embedding(features, embedding,
input_length=len))
model.add(Bidirectional(SimpleRNN(hidden)))
model.add(Dense(1, activation='sigmoid'))
model.compile('adam', 'binary_crossentropy',
metrics=['accuracy'])
Model Training
As we have compiled our model successfully and the data pipeline is also ready so, we can move forwar
toward the process of training our BRNN.

#set batch size and number of epochs you want


batch_size = 32
epochs = 5

model.fit(X_train, y_train,
batch_size=batch_size,
epochs=epochs,
validation_data=(X_test, y_test))
Output:

Training progress of the BRNN epoch-by-epoch

Evaluate the Model


Now as we have our model ready let’s evaluate its performance on the validation data using
different evaluation metrics. For this purpose, we will first predict the class for the validation data using th
model and then compare the output with the true labels.

loss, accuracy = model.evaluate(X_test, y_test)


print('Test accuracy:', accuracy)
Output :

Validation Accuracy of the model on the holdout dataset


Here we are using simple BRNN but we can also use LSTM with the bidirectional network for better
accuracy and the result of the model.

DEEP RECURRENT NETWORKS:


APPLICATIONS
IMAGE GENERATION:
Deep Recurrent Attentive Writer (DRAW) is a neural network architecture for image generation.
DRAW
networks
eye, with acombine a novel spatial attention mechanism that mimics the foveation of the human
sequential variational auto-encoding framework that allows for the iterative construction of
complex images. The system substantially improves on the state of the art for generative models
on MNIST, and, when
trained on the
distinguished Street View House Numbers dataset, it generates images that cannot be
from
real data with the naked eye. The core of the DRAW architecture is a pair of recurrent neural
networks: an encoder network that
compresses the real images presented during training, and a decoder that reconstitutes images
after
receiving codes. The combined system is trained end-to-end with stochastic gradient descent,
where the loss
function is a variational upper bound on the log-likelihood of the data.
DRAW Architecture
DRAW Network is similar to other variational auto-encoders, it contains an encoder network that
determines a distribution over latent codes that capture salient information about the input data
and
a decoder
and
own network
the output receives
region samples
modified by thefrom
Inthe
decoder .code
simple distribution
terms and uses them to condition
,thenetworkdecidesateachtime -stepits
“wheretoread”and“wheretowrite”aswellas
distribution over images. “what to write” .

3 Key Differences Between DRAW and Auto-Encoders


Both , the encoder
successively to and decoder are recurrent networks in DRAW.Decoder’s output are added
the distribution in order to generate the data, instead of generating this the distribution in single
steps.A
dynamically
the encoder,updated attention mechanism is used to restrict both the input region observed by
Left: Conventional Variational Auto-Encoder.
z is drawn from a prior P(z) and passed through the feedforward decoder
During generation, a sample
network to compute the probability of the input P(x|z) given the sample.
During inference the input x is passed to the encoder network, producing an approximate
posterior Q(z|x) over latent variables. During training, z is sampled from Q(z|x) and then used to compu
the total description length KL ( Q (Z|x)∣∣ P(Z)−log(P(x|z)), which is minimized with stochastic grad
descent.
Right: DRAW Network.
At each time-step a sample z_t from the prior P(z_t) is passed to the recurrent decoder network, whic
modifies part of the canvas matrix. The final canvas matrix cT is used to compute P(x|z_1:T).

During inference the input is read at every time-step and the result is passed to the encoder RNN. The
RNNs at the previous time-step specify where to read. The output of the encoder RNN is used to compu
the approximate posterior over the latent variables at that time-step.

Loss Function
The final canvas matrix cT is used to parametrize a model D(X | cT) of the input data. If the input is binary
the natural choice for D is a Bernoulli distribution with means given by σ(cT). The reconstruction loss Lx
defined as the negative log probability of x under D:

of nats required to transmit the latent sample sequence z_1:T to the decoder from the prior, and (if x is
discrete) Lx^ is the number of nats required for the decoder to reconstruct x given z_1:T. The total lo
therefore equivalent to the expected compression of the data by the decoder and prior. The latent loss

for a sequence of latent distributions

is defined as the summed Kullback-Leibler divergence of some latent


P(Zprior
_t)from

Note that this loss depends upon the latent samples


z_drawn
t from
which depend in turn on the input x. If the latent distribution is a diagonal Gaussian with μt, σt where:

a simple choice for


P(Z_t)is a standard Gaussian with mean zero and standard deviation one, in which cas
the equation becomes:

The total lossL for the network is the expectation of the sum of the reconstruction and latent losses:

Which we optimize using a single sample of z for each stochastic gradient descent step.

Lz^ can be interpreted as the number

Improving Images
As Eric Jang mentions on his post, it’s easier to ask our neural network to merely “improve the image”
rather than “finish the image in one shot”. Human artists work by iterating on their canvas, and infer fr
their drawing what to fix and what to paint next.
Improving an image or progressive refinement is simply breaking up our joint distribution P(C) over an
over again, resulting in a chain of latent variables C1,C2,…CT−1 to a new observed variable
distribution P(CT).

The trick is to sample from the iterative refinement distribution P(Ct|Ct−1)several times rather than stra
up sampling from P(C).
In the DRAW model, P(Ct|Ct−1) is the same distribution for all t, so we can compactly represent this as th
following recurrence relation (if not, then we have a Markov Chain instead of a recurrent network)
The DRAW model applied
Imagine you are trying to encode an image of the number 8. Every handwritten number is drawn
differently, while some portions may be thicker others can be longer. Without attention, the encoder w
be forced to try and capture all these small variations at the same time.
But…what about if the encoder could choose a small crop of the image on every frame and examine eac
portion of the number one at a time? That would make the work more easy, right?
The same logic applies for generating the number. The attention unit will determine where to draw the
portion of the number 8 -or any other-, while the latent vector passed will determine if the decoder gen
a thicker area or a thinner area.

Basically, if we think of the latent code in a VAE (variational auto-encoder)as a vector that represents th
entire image, the latent codes in DRAW can be thought of as vectors that represent a pen stroke. Event
a sequence of these vectors creates a recreation of the original image.

Ok, But how does it really work?


In a recurrent VAE model, the encoder takes in the entire input image at every single timestep. In DRAW
we need to focus in the attention gate between the two of them, so the encoder only receives the portio
our image that the network deems is important at that timestep. That first attention gate is called
the “read” attention.

The “read” attention consists in two parts:

Choosing the important portionCropping the image and forget about other parts

Choosing the important portion of an image


In order to determine which part of the image to focus on, we need some sort of observation to make a
decision based on. In DRAW, we use the previous timestep’s decoder hidden state. Using a simple full
connected layer, we can map the hidden state to three parameters that represent our square crop: cen
center y, and the scale.
Cropping the image
Now, instead of encoding the entire image, we crop it so only a small part of the image is encoded. This
code is then passed through the system, and decoded back into a small patch.
We now arrive to the second part of our attention gate, the “write”attention, which have the same setu
the “read” section, except that the “write” attention gate uses the current decoder instead of the previ
timestep’s decoder.
Wait…is that really done in practice?
While describing the attention mechanism as a crop makes sense intuitively, in practice, a different
method is used. The model structure described above is still accurate, but a matrix of gaussian filters
instead of a crop is used.
In DRAW, we take an array of gaussian filters, each with their centers spaced apart evenly.

IMAGE COMPRESSION:
Introduction:
The development and demand for multimedia goods has risen in recent years, resulting in network
bandwidth and storage device limitations. As a result, image compression theory is becoming more
significant for reducing data redundancy and boosting device space and transmission bandwidth savin
computer science and information theory, data compression, also known as source coding, is the proc
encoding information using fewer bits or other information-bearing units than an unencoded version
Compression is advantageous because it saves money by reducing the use of expensive resources such
hard disc space and transmission bandwidth.
Image Compression:
Image compression is a type of data compression in which the original image is encoded with a small
number of bits. Compression focuses on reducing image size without sacrificing the uniqueness and
information included in the original. The purpose of image compression is to eliminate image redunda
while also increasing storage capacity for well-organized communication.

There are two major types of image compression techniques:


1.Lossless Compression:
This method is commonly used for archive purposes. Lossless compression is suggested for images wi
geometric forms that are relatively basic. It's used for medical, technical, and clip art graphics, among
things.
2.Lossy Compression:
Lossy compression algorithms are very useful for compressing natural pictures such as photographs, w
small loss in reliability is sufficient to achieve a significant decrease in bit rate. This is the most common
method for compressing multimedia data, and some data may be lost as a result.
RNN Based Encoder and Decoders:
Two convolutional kernels are employed in the recurrent units used to produce the encoder and decod
one on the input vector that enters into the unit from the previous layer, and the other on the state vect
gives the recurring character of the unit. The "hidden convolution" and "hidden kernel" refer to the
convolution on the state vector and its kernel, respectively.
The input-vector convolutional kernel's spatial extent and output depth are shown in Figure. All
convolutional kernels support full depth mixing. For example, the unit D-RNN#3 operates on the inpu
vector with 256 convolutional kernels, each with 33 spatial extent and full input-depth extent (128 in this
case, because D-depth RNN#2's is decreased by a factor of four when it passes through the "Depth-to-
Space" unit).
Except in units D-RNN#3 and D-RNN#4, where the hidden kernels are 33, the spatial extents of
the hidden kernels are all 11. When compared to the 11 hidden kernels, the larger hidden
kernels consistently produced better compression curves.

Types of Recurrent Units:


1. LTSM:
The long short-term memory (LSTM) architecture is a deep learning architecture that employs a recurr
neural network (RNN). LSTM features feedback connections, unlike standard feedforward neural netw
It is capable of handling not just single data points (such as images), but also whole data streams (such
speech or video). Tasks like unsegmented, linked handwriting identification, speech recognition, and
anomaly detection in network traffic or IDSs (intrusion detection systems) can all benefit from LSTM.

Let xt, ct, and ht represent the input, cell, and hidden states at iteration t, respectively. The new cell st
and the new hidden state ht are computed using the current input xt, prior cell state ct1, and previous h
state ht1.

2. Associative LTSM:
To enable key-value storage of data, an Associative LSTM combines an LSTM with principles from
Holographic Reduced Representations (HRRs). To achieve key-value binding between two vectors, HR
employ a "binding" operator (the key and its associated content). Associative arrays are natively
implemented as a byproduct. Stacks, Queues, or Lists can also be easily implemented
Associative LSTM extends LSTM using holographic representation. Its new states are computed as:

Only when employed in the decoder were associative LSTMs effective.

3.Gated Recurrent Units:


Kyunghyun Cho et al. established gated recurrent units (GRUs) as a gating technique in recurrent neur
networks in 2014. The GRU is similar to a long short-term memory (LSTM) with a forget gate, but it lacks
an output gate, hence it has fewer parameters. GRU's performance on polyphonic music modelling, sp
signal modelling, and natural language processing tasks was found to be comparable to that of LSTM i
some cases. On some smaller and less frequent datasets, GRUs have been foundbetter.
to perform
The GRU formula, which has an input xt and a hiddenstate/output ht, is as follows:

Reconstruction Framework:
Three distinct ways for constructing the final image reconstruction from the decoder outputs are explo
addition to employing different types of recurrent units.
One-shot Reconstruction:
One-shot Reconstruction: As was done in Toderici et al. [2016], After each iteration of the decoder (=
(1)), we predict the whole picture. Each cycle has more access to the encoder's produced bits, allowing
better reconstruction. This method is known as "one-shot reconstruction." We merely transfer the prev
iteration's residual to the next iteration, despite trying to rebuild the original picture at each iteration. T
number of weights is reduced as a result, and trials demonstrate that sending both the original picture
residual does not enhance the reconstructions.
Additive Reconstruction:
In additive reconstruction, which is more widely used in traditional image coding, each iteration only t
reconstruct the residual from the previous iterations. The final image reconstruction is then the sum of
outputs of all iterations (γ = 1).
Residual Scaling:
The residual starts large in both additive and "one shot" reconstruction, and we anticipate it to diminish
with each repetition. However, operating the encoder and decoder effectively across a large range of v
may be problematic. In addition, the pace at which the residual diminishes is determined by the conten
The drop-off will be significantly more apparent in certain areas (for example, uniform regions) than in
others (e.g., highly textured patches).
The additive reconstruction architecture is enhanced to incorporate a content-dependent, iteration-d
gain factor to address these variances.
The following is a diagram of the extension that is used:
Entropy Encoding:
Because the network is not deliberately intended to maximise entropy in its codes, and the model does
always utilise visual redundancy across a vast geographical extent, the entropy of the codes created du
inference is not maximum. As is usual in regular image compression codecs, adding an entropy coding
can boost the compression ratio even more.
The lossless entropy coding techniques addressed here are completely convolutional, process binary c
progressive order, and process raster-scan order for a particular encoding iteration. All of our image e
designs produce binary codes of the type c(y, x, d) with the dimensions H W D, where H and W are integ
fractions of the picture height and width, and D is m the number of iterations. A conventional lossless
encoding system is considered, which combines a conditional probabilistic model of the present binary
c(y, x, d) with an arithmetic coder to do the actual compression. More formally, given a context T(y, x,
which depends only on previous bits in stream order, we will estimate P(c(y, x, d) | T(y, x, d)) so that the
expected ideal encoded length of c(y, x, d) is the cross entropy between P(c | T) and Pˆ(c | T). We do not
consider the small penalty involved by using a practical arithmetic coder that requires a quantized vers
Pˆ(c | T).
Single Iteration Entropy Coder:
We employ the PixelRNN architecture for single-layer binary code compression and a related design
(BinaryRNN) for multi-layer binary code compression. The estimate of the conditional code probabilitie
for line y in this architecture is directly dependent on certain neighbouring codes, but it is also indirectly
dependent on the previously decoded binary codes via a line of states S of size 1 W k that captures both
short and long term dependencies. All of the previous lines are summarised in the state line. We use k =
in practise. Using a 13 LSTM convolution, the probabilities are calculated and the state is updated line b
line. There are three steps to the end-to-end probability estimation.
First, a 7/7 convolution is used to enlarge the LSTM state's receptive field, with the receptive field being
set of codes c(i, j, ) that potentially impact the probability estimate of codes c(y, x, ).
To prevent dependence on subsequent routines, this first convolution is a disguised convolution. The
LSTM in the second stage takes the output z0 of the initial convolution as input and processes one scan l
at a time. The line LSTM captures both short- and long-term dependencies since LSTM hidden states ar
created by processing preceding scan lines. The input-to-state LSTM transform is likewise a masked
convolution for the same reason. Finally, two 11 convolutions are added to the network to boost its cap
to remember additional binary code patterns. The Bernoulli-distribution parameter may be easily calc
using a sigmoid activation in the final convolution because we are attempting to predict binary codes.
Above Image: Binary recurrent network (BinaryRNN) architecture for a single iteration. The gray area
denotes the context that is available at decode time.
Progressive Entropy Encoding:
To cope with many iterations, a simple entropy coder would be to reproduce the single iteration entrop
coder many times, with each iteration having its own line LSTM. However, such a structure would fail t
account for the duplication that exists between iterations. We can add some information from the prev
layers to the data that is provided to the line LSTM of iteration #k.

Description of neural network used to compute additional line LSTM inputs for progressive entropy cod
This allows propagation of information from the previous iterations to the current.
Evaluation Metrics
For evaluation purposes we use Multi-Scale Structural Similarity (MS-SSIM) a well-established metric fo
comparing lossy image compression algorithms, and the more recent Peak Signal to Noise Ratio - Hum
Visual System (PSNR-HVS). While PSNR-HVS already has colour information, we apply MS-SSIM to
each of the RGB channels separately and average the results. The MS-SSIM score ranges from 0 to 1,
whereas the PSNR-HVS is recorded in decibels. Higher scores indicate a closer match between the test
reference photos in both circumstances. After each cycle, both metrics are computed for all models ac
the reconstructed pictures. We utilise an aggregate metric derived as the area under the rate-distortio
to rank models (AUC).
NATURAL LANGUAGE PROCESSING:
RNNs are ideal for solving problems where the sequence is more important than the individual items
themselves.
An RNNs is essentially a fully connected neural network that contains a refactoring of some of its layers
a loop. That loop is typically an iteration over the addition or concatenation of two inputs, a matrix
multiplication and a non-linear function.

Among the text usages, the following tasks are among those RNNs perform well at:

Downloadedb by
Downloaded yA Bharathi
binayaPK(bharathicse88@gmail.com)
• Sequence labelling

• Natural Language Processing (NLP) text classification

• Natural Language Processing (NLP) text generation

Other tasks that RNNs are effective at solving are time series predictions or other sequence predictions
aren’t image or tabular based.
There has been several highlighted and controversial reports in the media over the advances in text
generation, in particular OpenAI’s GPT-2 algorithm. In many cases the generated text is often
indistinguishable from text written by humans.

I found learning how RNNs function and how to construct them and their varients has been among the
difficult topics I have had to learn. I would like to thank the Fastai team and Jeremy Howard for their
courses explaining the concepts in amore understandable order, which I’ve followed in this article’s
explanation.
RNNs effectively have an internal memory that allows the previous inputs to affect the subsequent
predictions. It’s much easier to predict the next word in a sentence with more accuracy, if you know wh
the previous words were.

Often with tasks well suited to RNNs, the sequence of the items is as or more important than the previo
item in the sequence.
As I’m typing the draft for this on my smart phone, the next word suggested by my phone’s keyboard w
be predicted by an RNN. For example, the swift key keyboard software uses RNNs to predict what you a
typing.
Natural Language Processing:

Natural Language Processing (NLP) is a sub-field of computer science and artificial intelligence, dealin
with processing and generating natural language data. Although there is still research that is outside o
machine learning, most NLP is now based on language models produced by machine learning.
NLP is a good use case for RNNs and is used in the article to explain how RNNs can be constructed.

Language models

The aim for a language model is to minimise how confused the model is having seen a given sequence o
text.
It is only necessary to train one language model per domain, as the language model encoder can be us
different purposes such as text generation and multiple different classifiers within that domain.
As the longest part of training is usually creating the language model encoder, reusing the encoder can
significant training time.
Comparing an RNN to a fully connected neural network:
If we take a sequence of three words of text and a network that predicts the fourth word.

The network has three hidden layers, each of which are an affine function (for example a matrix dot pro
multiplication), followed by a non-linear function then the last hidden layer is followed by an output fro
the last layer activation function.

Downloadedb by
Downloaded yA Bharathi
binayaPK(bharathicse88@gmail.com)
The input vectors representing each word in the sequence are lookups in a word embedding
matrix, based on a one hot encoded vector representing the word in the vocabulary. Note that all
inputted words use the same word embedding. In this context a word is actually a token that
could represent a word or a punctuation mark.
The output will be a one hot encoded vector representing the predicted fourth word in the

sequence.

The first hidden layer takes a vector representing the first word in the sequence as an input and

the output
activations serve as one of the inputs into the second hidden layer.
The second hidden layer takes the input from the activations of the first hidden layer and also an

input of the
second
togetherword
. represented as a vector. These two inputs could be either added or concatenated

The third hidden layer follows the same structure as the second hidden layer, taking the

activation from the


second hidden layer combined with the vector representing the third word in the sequence.
Again, these
inputs are added or concatenated together.

The output from the last hidden layer goes through an activation function that produces an

output
representing a word from the vocabulary, as a one hot encoded vector.
This second and third hidden layer could both use the same weight matrix, opening the

opportunity of
refactoring this into a loop to become recurrent.

A fully connected network for text generation/prediction. Source: Fastai deep learning course V3 by Jer
Howard.
Vocabulary:
The vocabulary is a vector of numbers, called tokens where each token represents one of the unique w
or punctuation symbols in our corpus.
Usually words that don’t occur at least twice in the texts making up the corpus usually aren’t included,
otherwise the vocabulary would be too large. I wonder if this could be used as a factor for detecting
generating text, looking for the presence of words not common in the given domain.
Word embedding:
A word embedding is a matrix of weights, with a row for each word/token in the vocabulary
Matrix dot product multiplication with a one hot encoded vector outputs a row of the matrix representi
activations from that word. It is essentially a row lookup in the matrix and is computationally more effic
to do that, this is called an embedding lookup.

Downloadedb by
Downloaded yA Bharathi
binayaPK(bharathicse88@gmail.com)
Using the vector from the word embedding helps prevent the resulting activations being very sparse. A
the input was the one hot encoded vector, which is all zeros apart from one element, the majority of the
activations would also be zero. This would then be difficult to train.
Refactored with a loop, an RNN:
For the network to be recurrent, a loop needs to be factored into the network’s model. It makes sense t
the same embedded weight matrix for every word input. This means we can replace the second and th
layers with iterations within a loop.
Each iteration of the loop takes an input of a vector representing the next word in the sequence with the
output activations from the last iteration. These inputs are added or concatenated together.
The output from the last iteration is a representation of the next word in the sentence being put throug
last layer activation function which converts it to a one hot encoded vector representing a word in the
vocabulary.

A basic RNN. Source: Fastai deep learning course V3 by Jeremy Howard.


This allows the network to predict a word at the end of a sequence of any arbitrary length.
Retaining the output through out the loop, an improved RNN:
Once at the end of the sequence of words, the predicted output of the next word could be stored, appe
an array, to be used as additional information in the next iteration. Each iteration then has access to th
previous predictions.
For a given number of inputs there are the same number of outputs created.

Downloadedb by
Downloaded yA Bharathi
binayaPK(bharathicse88@gmail.com)
An improved RNN retaining its output. Source: Fastai deep learning course V3 by Jeremy
Howard. In theory the sequence of predicted text could be infinite in length, with a predicted
word following the last
predicted word in the loop.
Retaining the history, a further improved RNN:
With each new batch the history of the previous batch’s sequence, the state, is often lost.
Assuming the sentences are related, this may lose important insights.
To aid the prediction when we start each batch, it is helpful to know the history of the last batch

rather than
reset
wordsitthat
. This
is aretains the state and hence the context, this results in an understanding of the
better approximation. Note with some datasets such as one-billion-words each sentence isn’t
related to the previous one, in this
case this may not help as there is no context between sentences.
Backpropagation through time:
Back propagation through time (BPTT) is the sequence length used during training. If we were
trying to train on sequences of 50 words, the BPTT would be 50.
Usually the document is split into 64 equal sections. In this case the BPTT is the document length

in words
divided by 64. If the document length in words is 3200 then that divided by 64 gives a BPTT of 50.
It’s beneficial to slightly randomise the BPTT value for each sequence to help improve the model.
Layered RNNs:
To get more layers of computation to be able to solve or approximate more complex tasks, the

output of
the RNN how
could be fed into another RNN, or any number of layers of RNNs. The next section
explains
this can be done. Extending RNNs to avoid the vanishing gradient: As the number of layers of
RNNs increases the loss landscape and can become impossible to train, this is the
vanishing
Short gradient problem. To solve this problem a Gated Recurrent Unit (GRU) or a Long Term
Downloadedb by
Downloaded yA Bharathi
binayaPK(bharathicse88@gmail.com)
As part of this computation, the sigmoid function squashes the values of these vectors between 0
and 1, and by multiplying them elementwise with another vector you define how much of that
other vector you want to “let through”
Long Term Short Term Memory (LSTM):
An RNN has short term memory. When used in combination with Long Short Term Memory
(LSTM) Gates, the network can have long term memory.
Instead of the recurring section of an RNN, an LTSM is a small neural network consisting of four

neural
network layers. These are the recurring layer from the RNN with three networks acting as gates.
An LSTM also has a cell state as well, along side the hidden state. This cell state is the long term

memory.
Rather than
An
1comprised justgate
Input returning the hidden
, this controls the state at each input
information iteration , a tuple
at each timeof hidden
step . states are returned
of
. the cell state and
An Output hidden
gate , this state . Long
controls howShort
much Term Memory is
information (LSTM ) has three
outputted gates
to the next:cell or upward laye
2 A Forget gate, this controls how much data to lose at each time step.
Gated
. recurrent unit (GRU):
3 gated recurrent unit is sometimes referred to as a gated recurrent network.
A
. the output of each iteration there is a small neural network with three neural networks layers
At
implemented, consisting of the recurring layer from the RNN, a reset gate and an update gate. The up
gate acts as a forget and input gate. The coupling of these two gates performs a similar function as the
gates forget, input and output in an LSTM.
Compared to an LSTM, a GRU has a merged cell state and hidden state, whereas in an LSTM these are
separate.
Reset gate:
The reset gate takes the input activations from last layer, these are multiplied by a reset factor between
1.The reset factor is calculated by a neural network with no hidden layer (like a logistic regression), this
performs a dot product matrix multiplication between a weight matrix and the addition/concatenation
the previous hidden state and our new input. This is then all put through the sigmoid function e^x / (
e^x).
This can learn to do different things in different situations, for example to forget more information if th
a full stop token.
Update gate:
The update gate controls how much of the new input to take and how much of the hidden state to take.
is a linear interpolation. This is 1 — Z multiplied by the previous hidden state plus Z multiplied by the new
hidden state. This controls to what degree we keep information from the previous states and to what d
we use information from the new state.
The update gate is often represented as a switch in diagrams, although the gate can be in any position t
create a linear interpolation between the two hidden states.

Downloadedb by
Downloaded yA Bharathi
binayaPK(bharathicse88@gmail.com)
A RNN with a GRU. Source: Fastai deep learning course V3 by Jeremy Howard.

Which is better, a GRU or an LSTM:

This depends entirely on the task in question, it is often worth trying both to see which can perform bet

Text classification:

In text classification the prediction of the network is to classify which group or groups the text belongs t
common use is classifying if the sentiment of a piece of text is positive or negative.

If an RNN is trained to predict text from a corpus within a given domain as in the RNN explanation earlie
in this article, it is close to ideal to be re-purposed for text classification within that domain. The genera
‘head’ of the network is removed leaving the ‘backbone’ of the network. The weights within the backbo

can then be frozen. A new classification head can then be attached to the backbone and trained to pred
required classifications.

It can be a very effective method to speed up training to gradually unfreeze the weights within the layer

Starting with the weights of the last two layers, then the weights of the last three layers, and finally all
unfreeze all of the layers’ weights.

COMPLETE AUTO ENCODER:

What are AutoEncoders?

AutoEncoder is an artificial neural network model that seeks to learn from a compressed representatio
the input.

Downloadedb by
Downloaded yA Bharathi
binayaPK(bharathicse88@gmail.com)
There are various types of autoencoders available suited for different types of scenarios, however, the
commonly used autoencoder is for feature extraction.
Combining feature extraction models with different types of models has a wide variety of applications.

Feature Extraction Autoencoders models for prediction sequence problems are quite challenging not
because the length of the input can vary, its because machine learning algorithms and neural network
designed to work with fixed length inputs.

Another problem with sequence prediction is the temporal ordering of the observations can make it
challenging to extract features. Therefore special predictive models were developed to overcome such
challenges. These are called Sequence-to-sequence, or seq2seq. and the widely used we already have
of are the LSTM models.
LSTM:
Recurrent neural networks such as the LSTM or Long Short-Term Memory network are specially
designed to support the sequential data.
They are capable of learning the complex dynamics within the temporal ordering of input sequences as
as using an internal memory to remember or use information across long input sequences.
NOW combing Autoencoders with LSTM will allow us to understand the pattern of sequential data with
LSTM then extract the features with Autoencoders to recreate the input sequence.
In other words, for a given dataset of sequences, an encoder-decoder LSTM is configured to read the i
sequence, encode it and recreate it. The performance of the model is evaluated based on the model’s a
to recreate the input sequence.
Once the model achieves a desired level of performance in recreating the sequence. The decoder part o
model can be removed, leaving just the encoder model. Now further this model can be used to en
input sequences.

Downloadedb by
Downloaded yA Bharathi
binayaPK(bharathicse88@gmail.com)
The workflow of the composite encoder will be something like this.

REGULARIZED AUTOENCODER:
Introduction:
As we know, regularization and autoencoders are two different terminologies. First, we will briefly disc
each topic, i.e., autoencoders and regularization, separately, and then we will see different ways to d
regularization of autoencoders.
Autoencoders:
Autoencoders are a variant of feed-forward neural networks that have an extra bias for calculating the
of reconstructing the original input. After training, autoencoders are then used as a normal feed-forw
neural network for activations. This is an unsupervised form of feature extraction because the neural
network uses only the original input for learning weights rather than backpropagation, which has labe
Deep networks can use either RBMs or autoencoders as building blocks for larger networks (a single
network rarely uses both).
Use of autoencoders:
Autoencoders are used to learn compressed representations of datasets. Commonly, we use it in redu
the dimensions of the dataset. The output of the autoencoder is a reformation of the input data in the m
efficient form.
Similarities of autoencoders to multilayer perceptron
Autoencoders are identical to multilayer perceptron neural networks because, like multilayer perceptr
autoencoders have an input layer, some hidden layers, and an output layer. The key difference betwee
multilayer perceptron network and an autoencoder is that the output layer of an autoencoder has the s
number of neurons as that of the input layer.

Downloadedb by
Downloaded yA Bharathi
binayaPK(bharathicse88@gmail.com)
Regularization
Regularization helps with the effects of out-of-control parameters by using different methods to minim
parameter size over time.
In mathematical notation, we see regularization represented by the coefficient lambda, controlling th
off between finding a good fit and keeping the value of certain feature weights low as the exponents on
features increase.
Regularization coefficients L1 and L2 help fight overfitting by making certain weights smaller. Smaller-
valued weights lead to simpler hypotheses, which are the most generalizable. Unregularized weights w
several higher-order polynomials in the feature sets tend to overfit the training set.
As the input training set size grows, the effect of regularization decreases, and the parameters tend to
increase in magnitude. This is appropriate because an excess of features relative to training set examp
leads to overfitting in the first place. Bigger data is the ultimate regularizer.
Regularized autoencoders
There are other ways to constrain the reconstruction of an autoencoder than to impose a hidden layer o
smaller dimensions than the input. The regularized autoencoders use a loss function that helps the mo
have other properties besides copying input to the output. We can generally find two types of regulariz
autoencoder: the denoising autoencoder and the sparse autoencoder.
Denoising autoencoder
We can modify the autoencoder to learn useful features is by changing the inputs; we can add random
to the input and recover it to the original form by removing noise from the input data. This prevents the
autoencoder from copying the data from input to output because it contains random noise. We ask it to
subtract the noise and produce meaningful underlying data. This is called a denoising autoencoder.

Downloadedb by
Downloaded yA Bharathi
binayaPK(bharathicse88@gmail.com)
In the above diagram, the first row contains original images. We can see in the second row that random
noise is added to the original images; this noise is called Gaussian noise. The input of the autoencoder
not get the original images, but autoencoders are trained in such a way that they will remove noise and
generate the original images.

The only difference between implementing the denoising autoencoder and the normal autoencoder is
change in input data. The rest of the implementation is the same for both the autoencoders. Below is t
difference between training the autoencoder.

Training simple autoencoder:


autoencoder.fit(x_train, x_train)
Training denoising autoencoder:
autoencoder.fit(x_train_noisy, x_train)
Simple as that, everything else is exactly the same. The input to the autoencoder is the noisy image, an
expected target is the original noise-free one.
Sparse autoencoders
Another way of regularizing the autoencoder is by using a sparsity constraint. In this way of regularizat
only fraction nodes are allowed to do forward and backward propagation. These nodes have non-zero
and are called active nodes.
To do so, we add a penalty term to the loss function, which helps to activate the fraction of nodes. This
forces the autoencoder to represent each input as a combination of a small number of nodes and dema
to discover interesting structures in the data. This method is efficient even if the code size is large becau
only a small subset of the nodes will be active.
For example, add a regularization term in the loss function. Doing this will make our autoencoder learn
sparse representation of data.

Downloadedb by
Downloaded yA Bharathi
binayaPK(bharathicse88@gmail.com)
input_size = 256
hidden_size = 32
output_size = 256
l1 = Input(shape=(input_size,))
# Encoder
h1 = Dense(hidden_size ,activity_regularizer=regularizers.l1(10e-6), activation='relu')(l1)
# Decoder
l2 = Dense(output_size, activation='sigmoid')(h1)
autoencoder = Model(input=l1, output=l2)
autoencoder.compile(loss='mse', optimizer='adam’)
In the above code, we have added L1 regularization to the hidden layer of the encoder, which adds the
penalty to the loss function.

STOCHASTIC ENCODERS AND DECODERS:


Variational Autoencoders (VAEs):
Variational Autoencoders are a type of generative model used for tasks like image generation, data
compression, and feature learning. They consist of two main components: an encoder and a decoder
goal of a VAE is to learn a probabilistic model of the data, which allows it to generate new data samples
are similar to the ones it was trained on.
Stochastic Encoder:
The encoder in a VAE is responsible for mapping an input data point (e.g., an image) into a probability
distribution in a lower-dimensional latent space. A deterministic encoder would produce a single point
this latent space for each input.
In contrast, a stochastic encoder generates a probability distribution over the latent space. This distrib
typically represented as a Gaussian distribution parameterized by two values: a mean (μ) and a varianc
which are outputs of the encoder neural network.
The mean (μ) represents the expected position of the encoded data point in the latent space, and the va
(σ²) represents the uncertainty or spread of the encoded data point in the latent space.
By sampling from this Gaussian distribution, you obtain different points in the latent space for the sam
input data. This introduces a source of randomness and allows for the generation of diverse latent
representations for similar input data. This diversity is essential for the generative aspect of VAEs.
The process of sampling from this distribution during encoding is known as the "reparameterization tri
It allows for backpropagation during training and makes it possible to optimize the model using techni
like stochastic gradient descent.
Stochastic Decoder:
The decoder in a VAE takes a point in the latent space and maps it back to the data space, attempting to
reconstruct the original input. In the case of image generation, it might generate a probability distribu
over pixel values for each location in the image.

Downloadedb by
Downloaded yA Bharathi
binayaPK(bharathicse88@gmail.com)
The stochastic decoder acknowledges the uncertainty introduced by the stochastic encoder. It
also produces a probability distribution over the data space, which can be thought of as the
likelihood of generating a particular data point given a point in the latent space.
By sampling from this distribution, you can produce different reconstructions of the same

input data. This


is crucial
that for the generative aspect of VAEs, as it allows the model to generate diverse outputs
capture
the inherent uncertainty in the data. CONTRACTIVE ENCODERS: Contractive Autoencoder was
proposed by the researchers at the University of Toronto in 2011 in the paper
Contractive auto-encoders: Explicit invariance during feature extraction. The idea behind that is
to make the
autoencoders robust of small changes in the training dataset. To deal with the above challenge
that is posed in basic autoencoders, the authors proposed to add another
penalty term to the loss function of autoencoders. We will discuss this loss function in details.
The Loss function:
Contractive autoencoder adds an extra term in the loss function of autoencoder, it is given as:

i.e the above penalty term is the Frobenius Norm of the encoder, the frobenius norm is just a
generalization of Euclidean norm.
In the above penalty term, we first need to calculate the Jacobian matrix of the hidden layer,

calculating a
jacobian of the hidden layer with respect to input is similar to gradient calculation. Let’s first
calculate
the Jacobian of hidden layer:

where, \phi is non-linearity. Now, to get the jth hidden unit, we need to get the dot product of ith fea
vector and the corresponding weight. For this, we need to apply the chain rule.

The above method is similar to how we calculate the gradient descent, but there is one major differenc
is we take h(X) as a vector-valued function, each as a separate output. Intuitively, For example, we hav
hidden units, then we have 64 function outputs, and so we will have a gradient vector for each of that 64
hidden unit.
Let diag(x) is the diagonal matrix, the matrix from the above derivative is as follows:

Now, we place the diag(x) equation to the above equation and simplify:

Downloadedb by
Downloaded yA Bharathi
binayaPK(bharathicse88@gmail.com)
Relationship with Sparse Autoencoder
In sparse autoencoder, our goal is to have the majority of components of representation close to 0, for
happen, they must be lying in the left saturated part of the sigmoid function, where their correspondin
sigmoid value is close to 0 with a very small first derivative, which in turn leads to the very small entries i
the Jacobian matrix. This leads to highly contractive mapping in the sparse autoencoder, even though
not the goal in sparse Autoencoder.
Relationship with Denoising Autoencoder
The idea behind denoising autoencoder is just to increase the robustness of the encoder to the small ch
in the training data which is quite similar to the motivation of Contractive Autoencoder. However, ther
some difference:
CAEs encourage robustness of representation f(x), whereas DAEs encourage robustness of reconstruc
which only partially increases the robustness of representation.
DAE increases its robustness by stochastically training the model for the reconstruction, whereas CAE
increases the robustness of the first derivative of Jacobian matrix.

Downloadedb by
Downloaded yA Bharathi
binayaPK(bharathicse88@gmail.com)

You might also like