NNDL-unit 3
NNDL-unit 3
Third-Generation N eu ra l
N et w or ks
Syllabus
Spiking Neural Nen\'Orks-Convolutional Ne11ra/ Netvvorks-
Deep Learning Neural Networks-Extreme
Learning Machine Model-Convol11tional Nt'11ra/ Networks
: The Convolution Operation - Motivation
-Pooling - Variants of the basic Co11vo/11tion Function ·- Struc
t11red 011tputs - Data Types - EJficicnt
Convolution Algorithms - Ne11roscientijic Basis - Applicatio
ns : Comp11lcr Vision, Image Generation,
Image Compression.
Contents
3.1 Spiking Neural Networks
3.2 Convolutional Neural Networks
3.3 Extreme Learning Machine Model
3.4 The Convolution Operation
3.5 Pooling
3.6 Variants of the Basic Convolution Function
3. 7 Structured Outputs
3.8 Data Types
3.9 Efficient Convolution Algorithms
3.10 Neuroscientific Basis
3.11 Applications : Com pute r Vision
3.12 Two Marks Questions with Answers
Third-Generation Neural N
3 -2 er~
'~s
neural networks are a type of neural network that can simulate the firing of neurons
.brain. These networks are designed to better model how the brain works, and they
potential to be more efficient and powerful than traditional neJ}ral networks.
am:al networks are made up of neurons that ~re in response to input. The strength
the rate at which the neuron fires and the pattern of firing can be
2. supervised Learning
• In supe1:ised learning, data (the input) is accompanied by labels (the targets) and the
learning device's purpose is to correlate (classes of) inputs with the target outputs.
• An error signal is computed between the target and the actual output and utilized to update
the network's weights.
• Supervised learning allows us to use the targets to directly updat·e parameters, whereas
reinforcement learning just provides us with a generic error signal ("reward") that reflects
how well the system is functioning. In practice, the line between the two types of supervised
learning is blurred.
• The goal of CNN is to reduce the im;-igcs so that it would he easier to process wit!H1ut 11,:-1nf
features that arc valuable for accurate prediction.
2. Disadvantages :
• Adversarial attacks are cases . of feeding the network 'bad' examp
les to cause
misclassification.
• CNN requires lot of training data.
• CNNs tend to be much slower because of operations like maxpool.
. 3-6 ~
Neural Networks and Deep Leaming d
·11ance systems an smart 1, 0111
. . ars Al-powered survc• . . es
• Objt•ct detection : Sclf-dnvmg c ' b" t CNN can identify obJccts on th
·d tify and mark o ~ec s. e
otkn use CNN to be able to I en
. . I st·r and label them.
photos and m real-tune, c as Y . D nind' s \.VavcNct Conv"t1. •
. , · synthesizer uses eepi ~e1
• Voice synthesis : Google Assi stant s voice
Convolution
Pooling
• A convolutional neural network, as discussed above, has the following layers that are useful
for various deep learning algorithms. Let us see the working of these layers taking an
example of the image having dimension of 12 x 12 x 4. These are:
I. Input layer: This layer will accept the image of width 12, height' 12 and depth 4 .
2. Convolution layer : It computes the volume of the image by getting the dot product
between the image filters possible and the image patch. For example, there are 10 filters
possible, then the volume will be computed as 12 x 12 x 10.
3. Activation function layer : This layer applies activation function to each element in the
output of the convolutional layer. Some of the well accepted activation functions are
ReLu, Sigmoid, Tanh, Leaky ReLu, etc. These functions will not change the volume
obtained at the convolutional layer and hence it will remain equal to 12 x 12 x 10 .
. 4. Pool layer : This function mainly reduces the volume of the intem1ediate outputs. which
enables fast computation of the network model, thus preventing it from overfitting.
..
Laye r Fccd"iorward Neural Network (SLFN) , which converges muc~t1
· EL
faster than trad1t1onal rncth 0 ds. · • I I 'ti 1111s·
· M converges much fa~tcr than trad1t1ona a gon
I
bclnuse 1t learns without iteration
b", , in
·
, ELM assigns random values t0 ti1c weights .
between input
.
and hidden layer amI tI1c ,ascs
th e hid<lcn layer and th csc parameters arc frozen during training . Fig. 3.3.1 ~hows th c
architec ture of the ELM .
,, ,,
I I I I
I I I I
I I I I
I ' t
I I
I I t
I I I
I I I
I I I
I I
X· I
I I
'o
I I 1
I
I I
I
I I
I
t I
I
t t
I I
I
t
I I I
I I I
t I I
I
I I I
I
I I ',._ I
,._ I
, hidden
• ELM is a single-hidden layer feed-forward network with three parts : input neurons
neurons and output neurons:
fonn of
• In particular, h(x) == [h 1 (x), ... , ~ (x)] is nonlinear feature mapping of ELM with the
n the
h}x) == g(wj. x + bj) and pj == [Pjl, ... ,Pf] T,j ==I, ... , Lis the output weights betwee
jth hidden layer and the output nodes.
linear
• The basic training of ELM can be regarded as two steps: _random ·initialization and
parameter solution.
I. Firstly, ELM uses random parameters wi and bi in its hidden layer and they are frozen
during the ·whole training pro~ess. The ~ut vector is mapped into a random feature
t
space with random :settings and nonlinear activation functions which is· more efficien
th
Third-Generation Neural Networks Ii
'
k:iturc ~nap ~utput from anolber convolution. It is often interpreted as a filter in which c
· ~ •
1"-crnd tilters mput data for certain ki'nds of m,orn1at1on.
• Sl)1nctimes a 5 x 5 or a 7 x 7 matrix is used as a featur~ detector. The feature detector is
ollcn referred to as a "kernel" or a "filter,". At each step, the kernel is multiplied by the input
data values wi th in its bounds, creating a single entry in the output feature map.
-------►
1 I 1 I 1 I '.,.___I
,r_
I I U I1' '
___ IL ___ 1
1 --r---r--- I.
o I 1 I 1 I 1 I 0
I I I I 1 : 0 : 1
---~---~---
0 Q 1
~---~---
1 1
___ I ___ I __ _
I I
,. ,.
I I I I I I ,,.
I I I I 0 I 1 1,,."'Q
q---t---, ,.
---~---t~ ___ I - .,z1' ---
1 ,,. I ,. ,.
Q Q I
I
1 1-.....J nI
7'-'I..
I
I
I ~ 0 I 1 ,. ,.
---~---t---~---~---
0 1 1 1 1 1 o 1 o
,,."' I '-{--,
I' Convoluted feature
I I I I Kernel
Input data
• Generally, an image can be considered as a matrix whose elements are numbers between 0
and 255. The size of image matrix is : image height*image width*numbe r of image
channels.
• A grayscale image has 1 channel, where a colour image has 3 channels.
• Kernel : A kernel is a small matrix ·of numbers that is used in image convolutions.
Differently sized kernels containing different patterns of numbers produce different results
under convolution. The size of a kernel is arbitrary but 3 x 3 is often used. Fig. 3.4.2 shows
example of kernel.
0 1 0
1 1 1
0 1 0
• Convolutional layers perform transformations on the input data volume that are a function of
the activations in the input volume and the parameters.
• 1n reality, convolutional neural networks develop multiple feature detectors and use them to
develop several feature maps which are referred to as convolutional layers and it is _shown in
Fig. 3.4.3.
1-L-~~t~1=~1~~t}~E13;~t~~:::::::====~~ij~itr,~=====~~..,
Io )
0 0
0 0 0 0 0 0 0
Input image
Convolutional layer
~ k
TECHNICAL PUBLICA TIONs® _an un-thru t ,or
,., s now/edge
.,eura1Netw orks and Deep Learning
::..:- ------ '--__ ..:..;: .:..:.! !._ _ _ _. 3 11
,.
• Third-Generation Neural Networks
, The user can reduce the number of
c:in compute at some spatial p .l_Paramctcrs
. )' ) . 0s1 ion (x ) tl by .mak
.
ing an assumption that if one feature
t-' ,. , t , 1en 1t ts useful to compute a diffe · .
- - rent place
, In other words, qeno ting a single .
. ·k-1Jropagation eve 20
b,1~ t . shce of depth as a depth slice. For example,
, ry neuron m the . during
, c gradients will b d netw
t IllS ork will compute the gradient for its weights, but
. t e a ded up a .
.
we1g11 s per s1ice. · cros s each depth shce and only update a smg .
le set of
, If all neurons in a single depth sr .
. ice are usmg the same weight vector, then the
of the convolutional layer can b forward pass
, . . :
e computed m each depth slice as a convolut •
neuron s weights with the input 1 · ion of the
. vo ume. This • .
1s the reason why it is common to refer to the
sets of weights as a filter (or a kernel) th t .
, a 1s convo1ved with . .
the mpu t.
, Fig. 3 .4.4 shows convolution shar th
es e same parameters across all spatial locations
·
.
----------------:
-L~/:CA:T:/::;O:N~S®®-.:an;:-u:p:-t;hru~st~f
TECHNICAL PU8 o;r~k~no;w01/e~d~g;e---------
·
Neural Nf't .. , 1
0 {'t~p Loa/ Ill/lg 3 _ 12
- n Ll/1\S ,1n(1 Third-Generation Neural Net
~
• E'\11mpk of l'quirnrhmcr With 20 images convolution creates a
map where Cena•
fraturc s appear in the inpuL If we move
the object in the inpu~, the representation
the same amount in the output. It is usef
Will mo::
ul to detect edges m firs_t \ayer of convo\uti
ona\
nc twor k. Samc edges appear cveryw I1ere , ,
m image, so it is practical to share pararnet
across entire ima ge, ers
l( fl Padding
• Padding is the process of adding one
or more pixels of zeros all arou nd th e bou
ndaries of an
image, in ord er to increase its effective size
. Zero padding helps to make output dime
nsions
and kern el size independent.
• One observation is that the convolution , th
operation reduces the size of the (q + 1)
layer in
comparison with the size of the qth laye
r. This type of reduction in size is not desi
rable in
gene~al, because it tends to lose some
information along the borders of the ima
ge. This
problem can be reso lved by using padding
.
• 3 common zero padding strategies are
:
a) Valid convolution : Extreme case
in which no zero-padding is used whatsoe
ver, and the
convolution kernel is only allowed to visi
t positions where the entire kernel is contained
entirely within the input. For a kernel of
size k in any dimension, the input shape of
the direction will become m- k + 1 in min
the output. This shrinkage restricts architect
ure
depth.
b) Same convolution : Just enough
zero-padding is added to keep the size of
the output
equal to the size of the input. Essentially
, for a dimension where kernel size is k,
input is padded by k.:... 1 zeros in-that dim the
ension.
c) Full convolution : Other extreme
ease where enough zeroes are added for ever
y pixel to
be visited k times in each direction, resulting
an output image of width m + k - 1.
d) The 1D block is composed by a configu
rable number of filters, where the filter has
a set
size; a convolution operation is perform
ed between the vector and the filter, prod
ucing
as output a new vector with as many channe
ls as the number of filters .. Every value in
the
tensor is then fed thro1:1gh an activation func
tion to introduce nonlinearity.
• When padding is_not used, the resu
lting "padding" is also referred to as a
valid padding.
Valid padding generally does not work wel
l from an experimental point of view. In
of valid padding, the contributions of the the case
pixels on the borders of the layer will be
under·
represented compared to the central pixe
ls in the next hidden layer, which is undesira
ble.
-Str ide . .
in • · . . .·
c onvolut1on functions used. practic e differ slightly compa red to convolution operation as
'
. .
it is usually understood m the math emattc .
a 1hterature.
· · t
In general a convolution layer consis ts of applic ation of several different kernels to the mpu ·
' • Tl ·
.
This allows the extraction of severa 1 differe nt features at all locations in the mput. us
kernels, are used as
means that in each layer, a single kernel is not applied. Multiple
di ffe rent feature detectors.
channel convolutions
• The input is generally not real-valued but instead vector valued. Multi-
•
are commutative only if numb er of output and mput .
channels 1s the same.
° convolutions can be
• In rd er to allow for calculation of features at a coarser level strided
lution followed by a
tised. The effect of st rided convolution is the same as that of a convo
down sampling stage. This can be used to reduce the representation size.
,I
' r.
and vertically over the . '/:
• The stride indicates the pace by which the filter move·s horizontally
convolution.
pixels of the input image during convolution . Fig. 3.4.5 shows stride during
I
Stride= 1
i
I.
# =
amount of movement
• Stride is a parameter of the neural network's filter that modifies the
images and video
over the image or video. Stride is a component for the compression of
move one pixel or
data. For example, if a neural network's stride is set to 1, the filter will
unit, at a time. If stride= I, the filter will move one pixel.
smaller stride size if we
• Stride depends on what we expect in our output image. We prefer a
hand, if we are only
expect several fine-grained features to reflect in our output. On the other
interested in the macro-level of features, we choose a larger stride size.
· • The pooling layer operates upon each feature map separately to create a new set of the same
number of pooled feature maps. Pooling involves selecting a pooling operation, much like a
filter to be applied to feature maps.
• The size of the pooling operation or filter is smaller than the size of the feature map. This
means that the pooling layer will always reduce the size of each feature map by a factor of 2,
e.g. each dimension is halved, reducing the number of pixels or values in each feature map
to one quarter the size. ·
• For example, a pooling layer applied to a feature map of 6 x 6 (36 pixels) will result in an
output pooled feature map of 3 x 3 (9 pixels). The pooling o·peration is specified, rather than
learned.
• The pooling operation, also called subsampling, is used to reduce the dimensionality of
feature maps from the convolution operation. Max pooling and average pooling are the most
common pooling operations used in the CNN.
• Pooling units are obtained using functions like max-pooling, average pooling and even L2 -
norm pooling. At the pooling layer, forward propagation results in an N_x N pooling block
being reduced to a ·single value - value of the "winning unit". Back-propagatio n of the
pooling layer then computes the error which is acquired by this single ~alue "winning unit".
• Pooling layers, also known as down sampling~ conducts dimensionality reduction, reducing
the number of parameters in the input. Similar to the convolutional layer, the pooling
operation sweeps a tilter across the entire input, but the difference is that this filter does not
have any weights. Instead, the kernel applies an aggregation function to the values within the
receptive field, populating the output array. Thero are t\_VO main types of pooling:
•
l Network
Thi rd- Ge ner atio n Ne ura ~
3-1 6 . . .
.
max 1rnu 111
Neural Ne two rks and Deep Lea rnm g
• t it. selects the pixel with the
.
. A I1 filt 1 er
moves across .d mpthu. ' approac h tends to be used more ofte n
t 1,e
• Max poo l111 g : s t c As an as , e, ts
sen d to the ou tpu t arra Y.
va lue to I .
r:ige poo .lin g. cu lates the av era ge va ue Within
compared to ave the input it cal
er mo ves acr oss '
filt
• Avcragc po oling : As the .
to sen d to the ou tpu t array. er a certai n feature
the recepti ve field ful if we care more about wh
eth
.
on can be use
• Invariance to local tra nslati
where it is.
is present rather than exactly
tio n
Ell Variants of the Basic Convolution Func s :
ion functions are as follow
• Various types of convo lut
l. • Stride
2. Zero padding
3. Unshared co nvolution
en channels
4. Partial connectivity betwe
5. Tiled tion as
practi ce dif fer sligh tly co mpared to convolution opera
in
• Convolution functions used
mathematical literature .
it is usually understood in the to the input.
er con sis ts of ap pli cat ion of several different kernels
• In general a convolution lay in the input. This
ractio n of sev era l dif fer en t features at all locat~ons
This allows the ext . Multiple kernels, are use
d as
single ker nel is no t ap pli ed
means that in each layer, a
different feature detectors . ions
alu ed bu t ins tea d ve cto r va lued. Multi-channel convolut
l-v
• The input is generally not rea the same.
uta tive only if nu mb er of output and input channels is
are c·omm
fea tur es at a co ars er lev el strided convolutions can be
ion of
• In order to allow for calculat ed by a
nv olu tio n is the sam e as that of a convolution follow
co
used. The effect of strided entation size.
sam pling stage. Th is can be used to reduce the repres
down
, transp osed
. convo
. lution s : These typ es O f convolutions
. . ns
are also known as dcconv olut10
or fractt0. nally stnded convo
. lutions · A transpo
sed convolutional layer carries out a regula r
convolution but reverts its spatial transformation.
, fi<ro· 3 ·6. l shows how transp osed con vo1uhon
· with
· a 2 x 2 kernel is compu ted for a 2 X 2
input tensor. ·
Input Kernel
ffiJJ Transposed @JJ
~ conv
ITIT1
Output
0 0 1
0 4 6
4 12 9
; , .
;
3 - 18 Thir d-G ene ratio n Neu ral Networks
g
Neu ral Net work s and Dee p LeR rnin --- ---
(b)
(a)
(d)
(c)
is d = 2
ted filte r whe re the dilation fact or
Fig. 3.6. 2 Con volu tion with a dila
•
I he kcrm: l tensor V is used t 0 . I . · te of the labels •
· • l'tol ll l:C an estima given the hidde n va Iuc-;.
, · () 11
th
all but c lir st step, the kernel s W arc convol ved over · Y to prov ide input to the
hidden
th
Li~ er. On e tir st time stC"p, thi s term is replaced by 1.cro. lkcau
,c the same parameters arc
11~1.:d l'll eac h step, this is an example of a recurr
ent network .
EI)D ata Type s
• The (.bta used with a convo lutional network usually consists
of severa l chann els, each
channd being the observation of a different quantity at some point
in space or time.
• One advantage to convo~utional networks is that they can also proces
s inputs with varyin g
spatial extent s. These kinds of input simply cannot be represented
by traditi onal, matrix
multip lication-based neural networks. This provides a compe
lling reason to use
convo lutional networks even when computational cost and overfi
tting are not significant
issues.
• For example, consider a collection of images, where each image
has a different width and
height. It is unclear how to model such inputs with a weigh
t matrix of fixed size.
Convolution is straightforward to apply; the kernel is simply applie
d a different number of
rimes depending on the size of the input and the output of the ·convo
lution operation scales
accordingly.
, Convolution may be viewed as matrix multiplication; the same convo
lution kernel induces a
different size of doubly block circulate matrix for each size of input.
Sometimes the output
of the network is allowed to have variable size as well as the input,
for example if we want
to assim
0
a class label to each pixel of the input.
• Data s~zes and types : Number vs volume
• l billion tweets - 130 Gs ( l 40bytes - tweet)
~
rJB" Nc.:urophysiologiS ts David Hubel and Torsten Wiesel collaborated for several years to
• .0e,;1cr111ine many of the mo st basic facts about how the mammalian vision system works.
Their accomplishments were eventually recognized with a Nobel prize. Their wo_rk helped to
characterize many aspects of brain function.
• In this simplified view, we focus on a part of the brain called VI, also known as the primary
visual cortex. Yl is the first area of the brain that begins to perform significantly advanced
processing of visual input.
• In cartoon view, images are formed by light arriving in the eye and stimulating the retina, th e
·· · · . · le
tight-sensitive tissue m the back of the eye. The neurons in the _retina perform some s imp 1
preprocessing of the image but do not substantially alter the way it is represented.
• The image· then passes through the optic nerve and a brain region called the lateral
geniculate nucleus. A convolutional network layer is designed to capture three properties of '
Vl :
1. V 1 is arranged in a spatial map. It actually has a two-dimensional structure mirroring the
structure of the image in the retina.
l
2. Vl contains many simple cells. A simple cell's acttv1ty can to some extent be
characterized by a linear function of the image in a small, spatially localized receptive
field. The ·detector units of a convolutional network are designed to emulate these
properties of simple cells. I
tl
3. ~ l also contains many complex cells. These cells r~spond to features that are similar to I
J
those detected by simple cells, but complex cells are invariant to small shifts in the
position of the feature. This inspires the pooling units of convolutional networks.
____
I
'
/ .,, -,
I
I
,' ,-- ,
I
-'
I I
I
t'."""...K:M--...1'... I
I
I
~: Image ~
I L..,._/ enco der l-.y""
I
I
: Compact representation
Original \
, __ ComCNN ___ .,,'I
· image
/
,,,,. ---- ---- --- ' Channel ·
I \
I
I
I
I
I
,· Imag e
I deco der
I
I
I
I
I
Decoded image
\
, ___ RecC NN __ _ ./
', ________
Co(•) ________
;
,
· Fig. 3.11.1
l;l ~'l'I'.
• ·Sinc
e 1hc first convolutional Jaye .
r uses
. a !ltrn
. 1s
. ge ~11e
•Jc of two , the ima . reduced hy IHJ ff•
I
I
I
and the last ayer,
Rec ( 'NN layer uses twe nty N eura I network layers. Apart from the fin,1 · n.
f ~
· n operalw
· al1o
•ich l;.1y·cr in this formation carr1·cs
,
ou t convo Iutm · nal and hatch normal1z
c..
PSNR metrics of
scale images and 50 epochs . SSIM and
, Trainc<l th c network using 400 gray
.
1'1 csc ima ges arc better than JPEG images
m arc visible
es, artifacts of image compression algorith
, In lossy image compression techniqu for
An examp le of such ai1i fact s is visible on images for which tiling was used
in images. ges .
ntiz atio n . In such ima ges, thes e tile boundaries continue to remain in the ima
qua
Ans.: efficiency ,
t R duc es the mem ory requirements and. improves its statistical .
• Fe wer parame ers : e
rations
• Computing the output requires fewer ope
tional
Q5 wou 1d sparse •mteract,on
·• e reduction on performance of in convolu
· s caus
--- ..:.:"_e=:tw~o~rk~s~?~-=-=-=-=-=-=-=-=-=-=--=--=--=--=--=:~-;~-=--=-~~:~:~==~~::~=-=-~:_:_:_:_:_:_:_:_:_:=:
- an up-thrust for knowledge
TECHNICAL PUBLICA T/ONS®
Third-Generation Neural N
~l'lr.
Neural Net ., 3 - 24 . . ~
---..........
worns nnd Df><'P Ll'am,ng
_ ,,h direct conncct1on s 111 a convotur
1 d r II tonal
Ans . : Nl) I really. si nce \,·e ha \'C di:L'P layers. Lvcn t lOll::- o a or most of
, .. , . , ., be indirectly con ncctc
11d •11 L 'cry spnrSL'. unit s in the deeper layLis ca n the
mput image.
Ans.:
· I · t I rand a locally connected layer ·
• Ito ffers a compromise between a convo ut1ona aye ·
• Memory requirements for storing the parameters wi ll increase only by a factor of the size
of this set ofkemels
Q.9 What Is a convolution ?
Ans. : Convolution is an orderly procedure where two sources of information are intertwined;
it's an operation that changes a function into something else.
Q.10 Which are four main operations in a CNN ?
Four main operations in a CNN are Convolution, Non- Linearity (ReLU), Pooling or Sub
Ans. ·:
Ans. :Full convolution applies the maximum possible padding to the input feature maps before
convolution. The maximum possible padding is the one where at least one valid input value is
' involved in all convolution cases.
Q.12 What is an autoencoder ?
Ans. : Autoencoders are neural networks that can learn to compress and reconstruct input data,
such as images, using a hidden layer of neurons. An autoencoder rriodel consists of two parts : an
encoder and a decoder.
Q.13 What is aim of autoencoder ?
□□□