U5 PDF
U5 PDF
           Downloadedb by
         Downloaded    yA Bharathi
                           binayaPK(bharathicse88@gmail.com)
Studocu is not sponsored or endorsed by any college or university
       Downloadedb by
     Downloaded    yA Bharathi
                       binayaPK(bharathicse88@gmail.com)
UNIT V                     RECURRENT NEURAL NETWORKS
Recurrent Neural Networks:Introduction– Recursive Neural Networks – Bidirectional RNNs – Deep
Recurrent Networks – Applications: Image Generation, Image Compression, Natural Language Proce
Complete Auto encoder, Regularized Autoencoder, Stochastic Encoders and Decoders, Contractive
Encoders.
RECURRENT NEURAL NETWORKS: INTRODUCTION
What is Recurrent Neural Network (RNN)?
Recurrent Neural Network(RNN) is a type of Neural Network where the output from the previous step is
fed as input to the current step. In traditional neural networks, all the inputs and outputs are independ
each other, but in cases when it is required to predict the next word of a sentence, the previous words a
required and hence there is a need to remember the previous words. Thus RNN came into existence, w
solved this issue with the help of a Hidden Layer. The main and most important feature of RNN is its
Hidden state, which remembers some information about a sequence. The state is also referred to as M
State since it remembers the previous input to the network. It uses the same parameters for each input
performs the same task on all the inputs or hidden layers to produce the output. This reduces the comp
of parameters, unlike other neural networks.
where:
where:
previous
states.
Once all the time steps are completed the final current state is used to calculate the output.
The output is then compared to the actual output i.e the target output and the error is
generated.
The error is then back-propagated to the network to update the weights and hence the
network (RNN) is
trained using Backpropagation through time.
Advantages of Recurrent Neural Network An RNN remembers each and every piece of
information through time. It is useful in time series prediction
only because of the feature to remember previous inputs as well. This is called Long Short Term
Memory  .
Recurrent neural networks are even used with convolutional layers to extend the effective pixel
neighborhood.
Disadvantages of Recurrent Neural Network
Gradient vanishing and exploding problems.
Training an RNN is a very difficult task.
It cannot process very long sequences if using tanh or relu as an activation function.
Applications of Recurrent Neural Network
Language Modelling and Generating Text
Speech Recognition
Machine Translation
Image Recognition, Face detection
Time series Forecasting
Types Of RNN
There are four types of RNNs based on the number of inputs and outputs in the network.
One to One
One to Many
Many to One
Many to Many
One to One
This type of RNN behaves the same as any simple Neural network it is also known as Vanilla
Neural Network. In this Neural network, there is only one input and one output.
One To Many
In this type of RNN, there is one input and many outputs associated with it. One of the most used exam
of this network is Image captioning where given an image we predict a sentence having Multiple words
Many to Many
In this type of neural network, there are multiple inputs and multiple outputs corresponding to a probl
One Example of this Problem will be language translation. In language translation, we provide multipl
words from one language as input and predict multiple words from the second language as output.
Variation Of Recurrent Neural Network (RNN)
To overcome the problems like vanishing gradient and exploding gradient descent several new advanc
versions of RNNs are formed some of these are as ;
Bidirectional Neural Network (BiNN)
Long Short-Term Memory (LSTM)
Bidirectional Neural Network (BiNN)
A BiNN is a variation of a Recurrent Neural Network in which the input information flows in both directio
and then the output of both direction are combined to produce the input. BiNN is useful in situations w
the context of the input is more important such as Nlp tasks and Time-series analysis problems.
Long Short-Term Memory (LSTM)
Long Short-Term Memory works on the read-write-and-forget principle where given the input informa
network reads and writes the most useful information from the data and it forgets about the informatio
which is not important in predicting the output. For doing this three new gates are introduced in the RN
this way, only the selected information is passed through the network.
Difference between RNN and Simple Neural Network
RNN is considered to be the better version of deep neural when the data is sequential. There are signifi
differences between the RNN and deep neural networks they are listed as:
   Recurrent Neural Networks are used whenA Simple Deep Neural network does not have any
   the data is sequential and the number of special method for sequential data also here the
         inputs is not predefined.                  the number of inputs is fixed
 Exploding and vanishing gradients is the theThese problems also occur in DNN but these are
         major drawback of RNN                     not the major problem with DNN
A variable called 'score' is calculated at each traversal of nodes, telling us which pair of phrases and wor
we must combine to form the perfect syntactic tree for a given sentence.
Let us consider the representation of the phrase -- "a lot of fun" in the following sentence.
An RNN representation of this phrase would not be suitable because it considers only sequential relatio
Each state varies with the preceding words' representation. So, a subsequence that doesn't occur at th
beginning of the sentence can't be represented. With RNN, when processing the word 'fun,' the hidde
will represent the whole sentence.
However, with a Recursive Neural Network (RvNN), the hierarchical architecture can store the
representation of the exact phrase. It lies in the hidden state of the node R_{a\ lot\ of\ fun}. Thus
parsing is completely implemented with the help of Recursive Neural Networks.
ambiguous.
Interestingly, there can be many parse trees for a single sentence.
Also, it is more time-consuming and labor-intensive to label the training data for recursive neural
networks
componentsthan to construct recurrent neural networks. Manually parsing a sentence into short
is more time-consuming and tedious than assigning a label to a sentence. BIDIRECTIONAL
RNNS: An architecture of a neural network called a bidirectional recurrent neural network (BRNN)
is made to
process sequential data. In order for the network to use information from both the past and
future context
in its predictions, BRNNs process input sequences in both the forward and backward directions.
This is
the main distinction between BRNNs and conventional recurrent neural networks.
A BRNN has two distinct recurrent hidden layers, one of which processes the input sequence
forward and
the otherand
collected of which processes it backward. After that, the results from these hidden layers are
input into a prediction-making final layer. Any recurrent neural network cell, such as Long
Short-Term
Memory (LSTM) or Gated Recurrent Unit, can be used to create the recurrent hidden layers.
The BRNN functions similarly to conventional recurrent neural networks in the forward direction,
updating the hidden state depending on the current input and the prior hidden state at each time
step. The
backward hidden layer, on the other hand, analyses the input sequence in the opposite manner,
updating
the hidden state based on the current input and the hidden state of the next time step.
Compared to conventional unidirectional recurrent neural networks, the accuracy of the
BRNN is
improved
contexts. since it can process information in both directions and account for both past and future
Because the two hidden layers can complement one another and give the final prediction layer
more data,
using two distinct hidden layers also offers a type of model regularisation.
In order to update the model parameters, the gradients are computed for both the forward and
backward
passes of the backpropagation through the time technique that is typically used to train BRNNs.
The input
sequence is processed by the BRNN in a single forward pass at inference time, and predictions
are made                   Bi-directional Recurrent Neural Network
based on the combined outputs of the two hidden layers. layers.
Working of Bidirectional Recurrent Neural Network
1. Inputting a sequenceA: sequence of data points, each represented as a vector with the same
   dimensionality, are fed into a BRNN. The sequence might have different lengths.
2.
   Dual Processing:Both the forward and backward directions are used to process the data. On the ba
   of the input at that step and the hidden state at step t-1, the hidden state at time step t is determined
   the forward direction. The input at step t and the hidden state at step t+1 are used to calculate the
   hidden state at step t in a reverse way.
3.Computing the hidden state: A non-linear activation function on the weighted sum of the input and
   previous hidden state is used to calculate the hidden state at each step. This creates a memory
   mechanism that enables the network to remember data from earlier steps in the process.
4.Determining the output: A non-linear activation function is used to determine the output at each ste
   from the weighted sum of the hidden state and a number of output weights. This output has two
   options: it can be the final output or input for another layer in the network.
5. Training: The network is trained through a supervised learning approach where the goal is to
   minimize the discrepancy between the predicted output and the actual output. The network adjusts
   weights in the input-to-hidden and hidden-to-output connections during training through
   backpropagation.
To calculate the output from an RNN unit, we use the following formula:
Ht (Forward) = A(Xt * WXH (forward) + Ht-1 (Forward) * WHH (Forward) + bH (Forward)
Ht (Backward) = A(Xt * WXH (Backward) + Ht+1 (Backward) * WHH (Backward) + bH (Backward)
where,
A = activation function,
W = weight matrix
b = bias
The hidden state at time t is given by a combination of Ht (Forward) and Ht (Backward). The output at a
given hidden           state is :
Yt = Ht * WAY + by
The training of a BRNN is similar to backpropagation through a time algorithm. BPTT algorithm works a
follows:
     Roll out the network and calculate errors at each iteration
     Update weights and roll up the network.
However, because forward and backward passes in a BRNN occur simultaneously, updating the we
for the two processes may occur at the same time. This produces inaccurate outcomes. Thus, the follo
approach is used to train a BRNN to accommodate forward and backward passes individually.
Applications of Bidirectional Recurrent Neural Network
Bi-RNNs have been applied to various natural language processing (NLP) tasks, including:
1. Sentiment Analysis : By taking into account both the prior and subsequent context, BRNNs can be
    utilized to categorize the sentiment of a particular sentence.
2. Named Entity Recognition : By considering the context both before and after the stated thing,
    BRNNs can be utilized to identify those entities in a sentence.
3. Part-of-Speech Tagging : The classification of words in a phrase into their corresponding parts of
    speech, such as nouns, verbs, adjectives, etc., can be done using BRNNs.
4. Machine Translation : BRNNs can be used in encoder-decoder models for machine translation, whe
    the decoder creates the target sentence and the encoder analyses the source sentence in both direc
    to capture its context.
5. Speech Recognition : When the input voice signal is processed in both directions to capture the
    contextual information, BRNNs can be used in automatic speech recognition systems.
 Advantages of Bidirectional RNN
    Context from both past and future: With the ability to process sequential input both forward and
    backward, BRNNs provide a thorough grasp of the full context of a sequence. Because of this, BRN
    are effective at tasks like sentiment analysis and speech recognition.
    Enhanced accuracy: BRNNs frequently yield more precise answers since they take both historical
    and upcoming datainto account.
      Efficient handling of variable-length sequences: When compared to conventional RNNs, which
      require padding to have a constant length, BRNNs are better equipped to handle variable-length
      sequences.
      Resilience to noise and irrelevant information: BRNNs may be resistant to noise and irrelevant data
      that are present in the data. This is so because both the forward and backward paths offer useful
      information that supports the predictions made by the network.
      Ability to handle sequential dependencies: BRNNs can capture long-term links between sequence
      pieces, making them extremely adept at handling complicated sequential dependencies.
Disadvantages of Bidirectional RNN
       Computational complexity: Given that they analyze data both forward and backward, BRNNs can
      be computationally expensive due to the increased amount of calculations needed.
      Long training time: BRNNs can also take a while to train because there are many parameters to
      optimize, especially when using huge datasets.
      Difficulty in parallelization: Due to the requirement for sequential processing in both the forward
      and backward directions, BRNNs can be challenging to parallelize.
      Overfitting : BRNNs are prone to overfitting since they include many parameters that might result in
      too complicated models, especially when trained on short datasets.
      Interpretability: Due to the processing of data in both forward and backward directions, BRNNs can
      be tricky to interpret since it can be difficult to comprehend what the model is doing and how it is
      producing predictions.
   Implementation of Bi-directional Recurrent Neural Network on NLP dataset
There are multiple processes involved in training a bidirectional RNN on an NLP dataset, including data
   preprocessing, model development, and model training. Here is an illustration of a Python implemen
   using Keras and TensorFlow. We’ll utilize the IMDb movie review sentiment classification dataset from
   Keras in this example. The data must first be loaded and preprocessed.
    import warnings
   warnings.filterwarnings('ignore')
   from keras.datasets import imdb
   from keras_preprocessing.sequence import pad_sequences
Model Architecture
By using the high-level API of the Keras we will implement a Bidirectional Recurrent Neural Network mode
   This model will have 64 hidden units and 128 as the size of the embedding layer. While compiling a
   model we provide these three essential parameters:
      optimizer – This is the method that helps to optimize the cost function by using gradient descent.
      loss – The loss function by which we monitor whether the model is improving with training or not.
      metrics – This helps to evaluate the model by predicting the training and the validation data.
# Import the necessary modules from Keras: from
   keras.models import Sequential
   from keras.layers import Embedding,\
         Bidirectional, SimpleRNN, Dense
   model.fit(X_train, y_train,
            batch_size=batch_size,
               epochs=epochs,
               validation_data=(X_test, y_test))
   Output:
During inference the input is read at every time-step and the result is passed to the encoder RNN. The
RNNs at the previous time-step specify where to read. The output of the encoder RNN is used to compu
the approximate posterior over the latent variables at that time-step.
Loss Function
The final canvas matrix cT is used to parametrize a model D(X | cT) of the input data. If the input is binary
the natural choice for D is a Bernoulli distribution with means given by σ(cT). The reconstruction loss Lx
defined as the negative log probability of x under D:
of nats required to transmit the latent sample sequence z_1:T to the decoder from the prior, and (if x is
discrete) Lx^ is the number of nats required for the decoder to reconstruct x given z_1:T. The total lo
therefore equivalent to the expected compression of the data by the decoder and prior. The latent loss
The total lossL for the network is the expectation of the sum of the reconstruction and latent losses:
Which we optimize using a single sample of z for each stochastic gradient descent step.
Improving Images
As Eric Jang mentions on his post, it’s easier to ask our neural network to merely “improve the image”
rather than “finish the image in one shot”. Human artists work by iterating on their canvas, and infer fr
their drawing what to fix and what to paint next.
Improving an image or progressive refinement is simply breaking up our joint distribution P(C) over an
over again, resulting in a chain of latent variables C1,C2,…CT−1 to a new observed variable
distribution P(CT).
The trick is to sample from the iterative refinement distribution P(Ct|Ct−1)several times rather than stra
up sampling from P(C).
In the DRAW model, P(Ct|Ct−1) is the same distribution for all t, so we can compactly represent this as th
following recurrence relation (if not, then we have a Markov Chain instead of a recurrent network)
The DRAW model applied
Imagine you are trying to encode an image of the number 8. Every handwritten number is drawn
differently, while some portions may be thicker others can be longer. Without attention, the encoder w
be forced to try and capture all these small variations at the same time.
But…what about if the encoder could choose a small crop of the image on every frame and examine eac
portion of the number one at a time? That would make the work more easy, right?
The same logic applies for generating the number. The attention unit will determine where to draw the
portion of the number 8 -or any other-, while the latent vector passed will determine if the decoder gen
a thicker area or a thinner area.
Basically, if we think of the latent code in a VAE (variational auto-encoder)as a vector that represents th
entire image, the latent codes in DRAW can be thought of as vectors that represent a pen stroke. Event
a sequence of these vectors creates a recreation of the original image.
Choosing the important portionCropping the image and forget about other parts
IMAGE COMPRESSION:
Introduction:
The development and demand for multimedia goods has risen in recent years, resulting in network
bandwidth and storage device limitations. As a result, image compression theory is becoming more
significant for reducing data redundancy and boosting device space and transmission bandwidth savin
computer science and information theory, data compression, also known as source coding, is the proc
encoding information using fewer bits or other information-bearing units than an unencoded version
Compression is advantageous because it saves money by reducing the use of expensive resources such
hard disc space and transmission bandwidth.
Image Compression:
Image compression is a type of data compression in which the original image is encoded with a small
number of bits. Compression focuses on reducing image size without sacrificing the uniqueness and
information included in the original. The purpose of image compression is to eliminate image redunda
while also increasing storage capacity for well-organized communication.
Let xt, ct, and ht represent the input, cell, and hidden states at iteration t, respectively. The new cell st
and the new hidden state ht are computed using the current input xt, prior cell state ct1, and previous h
state ht1.
2. Associative LTSM:
To enable key-value storage of data, an Associative LSTM combines an LSTM with principles from
Holographic Reduced Representations (HRRs). To achieve key-value binding between two vectors, HR
employ a "binding" operator (the key and its associated content). Associative arrays are natively
implemented as a byproduct. Stacks, Queues, or Lists can also be easily implemented
Associative LSTM extends LSTM using holographic representation. Its new states are computed as:
Reconstruction Framework:
Three distinct ways for constructing the final image reconstruction from the decoder outputs are explo
addition to employing different types of recurrent units.
One-shot Reconstruction:
One-shot Reconstruction: As was done in Toderici et al. [2016], After each iteration of the decoder (=
(1)), we predict the whole picture. Each cycle has more access to the encoder's produced bits, allowing
better reconstruction. This method is known as "one-shot reconstruction." We merely transfer the prev
iteration's residual to the next iteration, despite trying to rebuild the original picture at each iteration. T
number of weights is reduced as a result, and trials demonstrate that sending both the original picture
residual does not enhance the reconstructions.
Additive Reconstruction:
In additive reconstruction, which is more widely used in traditional image coding, each iteration only t
reconstruct the residual from the previous iterations. The final image reconstruction is then the sum of
outputs of all iterations (γ = 1).
Residual Scaling:
The residual starts large in both additive and "one shot" reconstruction, and we anticipate it to diminish
with each repetition. However, operating the encoder and decoder effectively across a large range of v
may be problematic. In addition, the pace at which the residual diminishes is determined by the conten
The drop-off will be significantly more apparent in certain areas (for example, uniform regions) than in
others (e.g., highly textured patches).
The additive reconstruction architecture is enhanced to incorporate a content-dependent, iteration-d
gain factor to address these variances.
The following is a diagram of the extension that is used:
Entropy Encoding:
Because the network is not deliberately intended to maximise entropy in its codes, and the model does
always utilise visual redundancy across a vast geographical extent, the entropy of the codes created du
inference is not maximum. As is usual in regular image compression codecs, adding an entropy coding
can boost the compression ratio even more.
The lossless entropy coding techniques addressed here are completely convolutional, process binary c
progressive order, and process raster-scan order for a particular encoding iteration. All of our image e
designs produce binary codes of the type c(y, x, d) with the dimensions H W D, where H and W are integ
fractions of the picture height and width, and D is m the number of iterations. A conventional lossless
encoding system is considered, which combines a conditional probabilistic model of the present binary
c(y, x, d) with an arithmetic coder to do the actual compression. More formally, given a context T(y, x,
which depends only on previous bits in stream order, we will estimate P(c(y, x, d) | T(y, x, d)) so that the
expected ideal encoded length of c(y, x, d) is the cross entropy between P(c | T) and Pˆ(c | T). We do not
consider the small penalty involved by using a practical arithmetic coder that requires a quantized vers
Pˆ(c | T).
Single Iteration Entropy Coder:
We employ the PixelRNN architecture for single-layer binary code compression and a related design
(BinaryRNN) for multi-layer binary code compression. The estimate of the conditional code probabilitie
for line y in this architecture is directly dependent on certain neighbouring codes, but it is also indirectly
dependent on the previously decoded binary codes via a line of states S of size 1 W k that captures both
short and long term dependencies. All of the previous lines are summarised in the state line. We use k =
in practise. Using a 13 LSTM convolution, the probabilities are calculated and the state is updated line b
line. There are three steps to the end-to-end probability estimation.
First, a 7/7 convolution is used to enlarge the LSTM state's receptive field, with the receptive field being
set of codes c(i, j, ) that potentially impact the probability estimate of codes c(y, x, ).
To prevent dependence on subsequent routines, this first convolution is a disguised convolution. The
LSTM in the second stage takes the output z0 of the initial convolution as input and processes one scan l
at a time. The line LSTM captures both short- and long-term dependencies since LSTM hidden states ar
created by processing preceding scan lines. The input-to-state LSTM transform is likewise a masked
convolution for the same reason. Finally, two 11 convolutions are added to the network to boost its cap
to remember additional binary code patterns. The Bernoulli-distribution parameter may be easily calc
using a sigmoid activation in the final convolution because we are attempting to predict binary codes.
Above Image: Binary recurrent network (BinaryRNN) architecture for a single iteration. The gray area
denotes the context that is available at decode time.
Progressive Entropy Encoding:
To cope with many iterations, a simple entropy coder would be to reproduce the single iteration entrop
coder many times, with each iteration having its own line LSTM. However, such a structure would fail t
account for the duplication that exists between iterations. We can add some information from the prev
layers to the data that is provided to the line LSTM of iteration #k.
Description of neural network used to compute additional line LSTM inputs for progressive entropy cod
This allows propagation of information from the previous iterations to the current.
Evaluation Metrics
For evaluation purposes we use Multi-Scale Structural Similarity (MS-SSIM) a well-established metric fo
comparing lossy image compression algorithms, and the more recent Peak Signal to Noise Ratio - Hum
Visual System (PSNR-HVS). While PSNR-HVS already has colour information, we apply MS-SSIM to
each of the RGB channels separately and average the results. The MS-SSIM score ranges from 0 to 1,
whereas the PSNR-HVS is recorded in decibels. Higher scores indicate a closer match between the test
reference photos in both circumstances. After each cycle, both metrics are computed for all models ac
the reconstructed pictures. We utilise an aggregate metric derived as the area under the rate-distortio
to rank models (AUC).
NATURAL LANGUAGE PROCESSING:
RNNs are ideal for solving problems where the sequence is more important than the individual items
themselves.
An RNNs is essentially a fully connected neural network that contains a refactoring of some of its layers
a loop. That loop is typically an iteration over the addition or concatenation of two inputs, a matrix
multiplication and a non-linear function.
Among the text usages, the following tasks are among those RNNs perform well at:
                              Downloadedb by
                            Downloaded    yA Bharathi
                                              binayaPK(bharathicse88@gmail.com)
•     Sequence labelling
Other tasks that RNNs are effective at solving are time series predictions or other sequence predictions
aren’t image or tabular based.
There has been several highlighted and controversial reports in the media over the advances in text
generation, in particular OpenAI’s GPT-2 algorithm. In many cases the generated text is often
indistinguishable from text written by humans.
I found learning how RNNs function and how to construct them and their varients has been among the
difficult topics I have had to learn. I would like to thank the Fastai team and Jeremy Howard for their
courses explaining the concepts in amore understandable order, which I’ve followed in this article’s
explanation.
RNNs effectively have an internal memory that allows the previous inputs to affect the subsequent
predictions. It’s much easier to predict the next word in a sentence with more accuracy, if you know wh
the previous words were.
Often with tasks well suited to RNNs, the sequence of the items is as or more important than the previo
item in the sequence.
As I’m typing the draft for this on my smart phone, the next word suggested by my phone’s keyboard w
be predicted by an RNN. For example, the swift key keyboard software uses RNNs to predict what you a
typing.
Natural Language Processing:
Natural Language Processing (NLP) is a sub-field of computer science and artificial intelligence, dealin
with processing and generating natural language data. Although there is still research that is outside o
machine learning, most NLP is now based on language models produced by machine learning.
NLP is a good use case for RNNs and is used in the article to explain how RNNs can be constructed.
Language models
The aim for a language model is to minimise how confused the model is having seen a given sequence o
text.
It is only necessary to train one language model per domain, as the language model encoder can be us
different purposes such as text generation and multiple different classifiers within that domain.
As the longest part of training is usually creating the language model encoder, reusing the encoder can
significant training time.
Comparing an RNN to a fully connected neural network:
If we take a sequence of three words of text and a network that predicts the fourth word.
The network has three hidden layers, each of which are an affine function (for example a matrix dot pro
multiplication), followed by a non-linear function then the last hidden layer is followed by an output fro
the last layer activation function.
                              Downloadedb by
                            Downloaded    yA Bharathi
                                              binayaPK(bharathicse88@gmail.com)
The input vectors representing each word in the sequence are lookups in a word embedding
matrix, based on a one hot encoded vector representing the word in the vocabulary. Note that all
inputted words use the same word embedding. In this context a word is actually a token that
could represent a word or a punctuation mark.
The output will be a one hot encoded vector representing the predicted fourth word in the
sequence.
The first hidden layer takes a vector representing the first word in the sequence as an input and
the output
activations serve as one of the inputs into the second hidden layer.
The second hidden layer takes the input from the activations of the first hidden layer and also an
input of the
second
togetherword
          .  represented as a vector. These two inputs could be either added or concatenated
The third hidden layer follows the same structure as the second hidden layer, taking the
The output from the last hidden layer goes through an activation function that produces an
output
representing a word from the vocabulary, as a one hot encoded vector.
This second and third hidden layer could both use the same weight matrix, opening the
opportunity of
refactoring this into a loop to become recurrent.
A fully connected network for text generation/prediction. Source: Fastai deep learning course V3 by Jer
Howard.
Vocabulary:
The vocabulary is a vector of numbers, called tokens where each token represents one of the unique w
or punctuation symbols in our corpus.
Usually words that don’t occur at least twice in the texts making up the corpus usually aren’t included,
otherwise the vocabulary would be too large. I wonder if this could be used as a factor for detecting
generating text, looking for the presence of words not common in the given domain.
Word embedding:
A word embedding is a matrix of weights, with a row for each word/token in the vocabulary
Matrix dot product multiplication with a one hot encoded vector outputs a row of the matrix representi
activations from that word. It is essentially a row lookup in the matrix and is computationally more effic
to do that, this is called an embedding lookup.
                              Downloadedb by
                            Downloaded    yA Bharathi
                                              binayaPK(bharathicse88@gmail.com)
Using the vector from the word embedding helps prevent the resulting activations being very sparse. A
the input was the one hot encoded vector, which is all zeros apart from one element, the majority of the
activations would also be zero. This would then be difficult to train.
Refactored with a loop, an RNN:
For the network to be recurrent, a loop needs to be factored into the network’s model. It makes sense t
the same embedded weight matrix for every word input. This means we can replace the second and th
layers with iterations within a loop.
Each iteration of the loop takes an input of a vector representing the next word in the sequence with the
output activations from the last iteration. These inputs are added or concatenated together.
The output from the last iteration is a representation of the next word in the sentence being put throug
last layer activation function which converts it to a one hot encoded vector representing a word in the
vocabulary.
                              Downloadedb by
                            Downloaded    yA Bharathi
                                              binayaPK(bharathicse88@gmail.com)
An improved RNN retaining its output. Source: Fastai deep learning course V3 by Jeremy
Howard. In theory the sequence of predicted text could be infinite in length, with a predicted
word following the last
predicted word in the loop.
Retaining the history, a further improved RNN:
With each new batch the history of the previous batch’s sequence, the state, is often lost.
Assuming the sentences are related, this may lose important insights.
To aid the prediction when we start each batch, it is helpful to know the history of the last batch
rather than
reset
wordsitthat
        . This
            is aretains the state and hence the context, this results in an understanding of the
better approximation. Note with some datasets such as one-billion-words each sentence isn’t
related to the previous one, in this
case this may not help as there is no context between sentences.
Backpropagation through time:
Back propagation through time (BPTT) is the sequence length used during training. If we were
trying to train on sequences of 50 words, the BPTT would be 50.
Usually the document is split into 64 equal sections. In this case the BPTT is the document length
in words
divided by 64. If the document length in words is 3200 then that divided by 64 gives a BPTT of 50.
It’s beneficial to slightly randomise the BPTT value for each sequence to help improve the model.
Layered RNNs:
To get more layers of computation to be able to solve or approximate more complex tasks, the
output of
the RNN how
         could be fed into another RNN, or any number of layers of RNNs. The next section
explains
this can be done. Extending RNNs to avoid the vanishing gradient: As the number of layers of
RNNs increases the loss landscape and can become impossible to train, this is the
vanishing
Short     gradient problem. To solve this problem a Gated Recurrent Unit (GRU) or a Long Term
                                Downloadedb by
                              Downloaded    yA Bharathi
                                                binayaPK(bharathicse88@gmail.com)
As part of this computation, the sigmoid function squashes the values of these vectors between 0
and 1, and by multiplying them elementwise with another vector you define how much of that
other vector you want to “let through”
Long Term Short Term Memory (LSTM):
An RNN has short term memory. When used in combination with Long Short Term Memory
(LSTM) Gates, the network can have long term memory.
Instead of the recurring section of an RNN, an LTSM is a small neural network consisting of four
neural
network layers. These are the recurring layer from the RNN with three networks acting as gates.
An LSTM also has a cell state as well, along side the hidden state. This cell state is the long term
memory.
Rather  than
       An
1comprised    justgate
            Input  returning    the hidden
                       , this controls    the state at each input
                                              information   iteration , a tuple
                                                                  at each  timeof  hidden
                                                                                 step .    states are returned
 of
 . the cell state and
       An Output      hidden
                    gate  , this state . Long
                                 controls   howShort
                                                 much Term  Memory is
                                                        information   (LSTM  ) has three
                                                                        outputted        gates
                                                                                    to the next:cell or upward laye
2      A Forget gate, this controls how much data to lose at each time step.
Gated
.     recurrent unit (GRU):
3 gated recurrent unit is sometimes referred to as a gated recurrent network.
A
. the output of each iteration there is a small neural network with three neural networks layers
At
implemented, consisting of the recurring layer from the RNN, a reset gate and an update gate. The up
gate acts as a forget and input gate. The coupling of these two gates performs a similar function as the
gates forget, input and output in an LSTM.
Compared to an LSTM, a GRU has a merged cell state and hidden state, whereas in an LSTM these are
separate.
Reset gate:
The reset gate takes the input activations from last layer, these are multiplied by a reset factor between
1.The reset factor is calculated by a neural network with no hidden layer (like a logistic regression), this
performs a dot product matrix multiplication between a weight matrix and the addition/concatenation
the previous hidden state and our new input. This is then all put through the sigmoid function e^x / (
e^x).
This can learn to do different things in different situations, for example to forget more information if th
a full stop token.
Update gate:
The update gate controls how much of the new input to take and how much of the hidden state to take.
is a linear interpolation. This is 1 — Z multiplied by the previous hidden state plus Z multiplied by the new
hidden state. This controls to what degree we keep information from the previous states and to what d
we use information from the new state.
The update gate is often represented as a switch in diagrams, although the gate can be in any position t
create a linear interpolation between the two hidden states.
                                 Downloadedb by
                               Downloaded    yA Bharathi
                                                 binayaPK(bharathicse88@gmail.com)
A RNN with a GRU. Source: Fastai deep learning course V3 by Jeremy Howard.
This depends entirely on the task in question, it is often worth trying both to see which can perform bet
Text classification:
In text classification the prediction of the network is to classify which group or groups the text belongs t
common use is classifying if the sentiment of a piece of text is positive or negative.
If an RNN is trained to predict text from a corpus within a given domain as in the RNN explanation earlie
in this article, it is close to ideal to be re-purposed for text classification within that domain. The genera
‘head’ of the network is removed leaving the ‘backbone’ of the network. The weights within the backbo
can then be frozen. A new classification head can then be attached to the backbone and trained to pred
required classifications.
It can be a very effective method to speed up training to gradually unfreeze the weights within the layer
Starting with the weights of the last two layers, then the weights of the last three layers, and finally all
unfreeze all of the layers’ weights.
AutoEncoder is an artificial neural network model that seeks to learn from a compressed representatio
the input.
                               Downloadedb by
                             Downloaded    yA Bharathi
                                               binayaPK(bharathicse88@gmail.com)
There are various types of autoencoders available suited for different types of scenarios, however, the
commonly used autoencoder is for feature extraction.
Combining feature extraction models with different types of models has a wide variety of applications.
Feature Extraction Autoencoders models for prediction sequence problems are quite challenging not
because the length of the input can vary, its because machine learning algorithms and neural network
designed to work with fixed length inputs.
Another problem with sequence prediction is the temporal ordering of the observations can make it
challenging to extract features. Therefore special predictive models were developed to overcome such
challenges. These are called Sequence-to-sequence, or seq2seq. and the widely used we already have
of are the LSTM models.
LSTM:
Recurrent neural networks such as the LSTM or Long Short-Term Memory network are specially
designed to support the sequential data.
They are capable of learning the complex dynamics within the temporal ordering of input sequences as
as using an internal memory to remember or use information across long input sequences.
NOW combing Autoencoders with LSTM will allow us to understand the pattern of sequential data with
LSTM then extract the features with Autoencoders to recreate the input sequence.
In other words, for a given dataset of sequences, an encoder-decoder LSTM is configured to read the i
sequence, encode it and recreate it. The performance of the model is evaluated based on the model’s a
to recreate the input sequence.
Once the model achieves a desired level of performance in recreating the sequence. The decoder part o
model can be removed, leaving just the encoder model. Now further this model can be used to en
input sequences.
                             Downloadedb by
                           Downloaded    yA Bharathi
                                             binayaPK(bharathicse88@gmail.com)
The workflow of the composite encoder will be something like this.
REGULARIZED AUTOENCODER:
Introduction:
As we know, regularization and autoencoders are two different terminologies. First, we will briefly disc
each topic, i.e., autoencoders and regularization, separately, and then we will see different ways to d
regularization of autoencoders.
Autoencoders:
Autoencoders are a variant of feed-forward neural networks that have an extra bias for calculating the
of reconstructing the original input. After training, autoencoders are then used as a normal feed-forw
neural network for activations. This is an unsupervised form of feature extraction because the neural
network uses only the original input for learning weights rather than backpropagation, which has labe
Deep networks can use either RBMs or autoencoders as building blocks for larger networks (a single
network rarely uses both).
Use of autoencoders:
Autoencoders are used to learn compressed representations of datasets. Commonly, we use it in redu
the dimensions of the dataset. The output of the autoencoder is a reformation of the input data in the m
efficient form.
Similarities of autoencoders to multilayer perceptron
Autoencoders are identical to multilayer perceptron neural networks because, like multilayer perceptr
autoencoders have an input layer, some hidden layers, and an output layer. The key difference betwee
multilayer perceptron network and an autoencoder is that the output layer of an autoencoder has the s
number of neurons as that of the input layer.
                             Downloadedb by
                           Downloaded    yA Bharathi
                                             binayaPK(bharathicse88@gmail.com)
Regularization
Regularization helps with the effects of out-of-control parameters by using different methods to minim
parameter size over time.
In mathematical notation, we see regularization represented by the coefficient lambda, controlling th
off between finding a good fit and keeping the value of certain feature weights low as the exponents on
features increase.
Regularization coefficients L1 and L2 help fight overfitting by making certain weights smaller. Smaller-
valued weights lead to simpler hypotheses, which are the most generalizable. Unregularized weights w
several higher-order polynomials in the feature sets tend to overfit the training set.
As the input training set size grows, the effect of regularization decreases, and the parameters tend to
increase in magnitude. This is appropriate because an excess of features relative to training set examp
leads to overfitting in the first place. Bigger data is the ultimate regularizer.
Regularized autoencoders
There are other ways to constrain the reconstruction of an autoencoder than to impose a hidden layer o
smaller dimensions than the input. The regularized autoencoders use a loss function that helps the mo
have other properties besides copying input to the output. We can generally find two types of regulariz
autoencoder: the denoising autoencoder and the sparse autoencoder.
Denoising autoencoder
We can modify the autoencoder to learn useful features is by changing the inputs; we can add random
to the input and recover it to the original form by removing noise from the input data. This prevents the
autoencoder from copying the data from input to output because it contains random noise. We ask it to
subtract the noise and produce meaningful underlying data. This is called a denoising autoencoder.
                              Downloadedb by
                            Downloaded    yA Bharathi
                                              binayaPK(bharathicse88@gmail.com)
In the above diagram, the first row contains original images. We can see in the second row that random
noise is added to the original images; this noise is called Gaussian noise. The input of the autoencoder
not get the original images, but autoencoders are trained in such a way that they will remove noise and
generate the original images.
The only difference between implementing the denoising autoencoder and the normal autoencoder is
change in input data. The rest of the implementation is the same for both the autoencoders. Below is t
difference between training the autoencoder.
                               Downloadedb by
                             Downloaded    yA Bharathi
                                               binayaPK(bharathicse88@gmail.com)
input_size = 256
hidden_size = 32
output_size = 256
l1 = Input(shape=(input_size,))
# Encoder
h1 = Dense(hidden_size ,activity_regularizer=regularizers.l1(10e-6), activation='relu')(l1)
# Decoder
l2 = Dense(output_size, activation='sigmoid')(h1)
autoencoder = Model(input=l1, output=l2)
autoencoder.compile(loss='mse', optimizer='adam’)
In the above code, we have added L1 regularization to the hidden layer of the encoder, which adds the
penalty to the loss function.
                              Downloadedb by
                            Downloaded    yA Bharathi
                                              binayaPK(bharathicse88@gmail.com)
The stochastic decoder acknowledges the uncertainty introduced by the stochastic encoder. It
also produces a probability distribution over the data space, which can be thought of as the
likelihood of generating a particular data point given a point in the latent space.
By sampling from this distribution, you can produce different reconstructions of the same
i.e the above penalty term is the Frobenius Norm of the encoder, the frobenius norm is just a
generalization of Euclidean norm.
In the above penalty term, we first need to calculate the Jacobian matrix of the hidden layer,
calculating a
jacobian of the hidden layer with respect to input is similar to gradient calculation. Let’s first
calculate
the Jacobian of hidden layer:
where, \phi is non-linearity. Now, to get the jth hidden unit, we need to get the dot product of ith fea
vector and the corresponding weight. For this, we need to apply the chain rule.
The above method is similar to how we calculate the gradient descent, but there is one major differenc
is we take h(X) as a vector-valued function, each as a separate output. Intuitively, For example, we hav
hidden units, then we have 64 function outputs, and so we will have a gradient vector for each of that 64
hidden unit.
Let diag(x) is the diagonal matrix, the matrix from the above derivative is as follows:
Now, we place the diag(x) equation to the above equation and simplify:
                              Downloadedb by
                            Downloaded    yA Bharathi
                                              binayaPK(bharathicse88@gmail.com)
Relationship with Sparse Autoencoder
In sparse autoencoder, our goal is to have the majority of components of representation close to 0, for
happen, they must be lying in the left saturated part of the sigmoid function, where their correspondin
sigmoid value is close to 0 with a very small first derivative, which in turn leads to the very small entries i
the Jacobian matrix. This leads to highly contractive mapping in the sparse autoencoder, even though
not the goal in sparse Autoencoder.
Relationship with Denoising Autoencoder
The idea behind denoising autoencoder is just to increase the robustness of the encoder to the small ch
in the training data which is quite similar to the motivation of Contractive Autoencoder. However, ther
some difference:
CAEs encourage robustness of representation f(x), whereas DAEs encourage robustness of reconstruc
which only partially increases the robustness of representation.
DAE increases its robustness by stochastically training the model for the reconstruction, whereas CAE
increases the robustness of the first derivative of Jacobian matrix.
                               Downloadedb by
                             Downloaded    yA Bharathi
                                               binayaPK(bharathicse88@gmail.com)