0% found this document useful (0 votes)
29 views39 pages

CH 4

Uploaded by

aryanpattni913
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views39 pages

CH 4

Uploaded by

aryanpattni913
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

Classification (Discrete value output)

Regression (Predict real value output)


Clustering (Structure of the dataset)
D: 4
n: 10
Supervised Learning
Training dataset n*m
Unsupervised Learning
10 Gb
Learning algorithm 5 Gb

Hypothesis Y
X
Size (X) Thane Price (Y)
525 25 L
1000 40L
750
h1
h2
Y

price
X, Y
3X1+2X2 = Y

2
h3

X
Size

Θ0 = 2 0 2
Θ1 = 0 1 1
• Y= h(x) = Θ0 + Θ1*x
• h(x) = Θ0 + Θ1*x Linear regression eq. for
univariate Multivariate
• Θ = Parameters Choose

Minimize
Θ0 , Θ 1 J(Θ0, Θ1)
Cost function
Squared error function

2+(-2) + 4 + (-4) = 0
4+4+16+16 = 40
Minimize J(Θ1)
h(x) = Θ0 + Θ1*x h(x) = Θ1*x for Θ0 = 0 Θ1
X Y
1 1
2 2
3 3

Case - 1
Y Θ1 = 1

Case-1 J(Θ1) = 1/2M(02 + 02 + 02) = 0


3
2
Case-2 Case - 2
Θ1 = 0.5
1
J(Θ1) = 1/2M((0.5-1)2 + (1-2)2 + (1.5-3)2) = ?
1 2 3
X

Case-3 Case - 3
Θ1 = -0.5

J(Θ1) = 1/2M((0.5-1)2 + (1-2)2 + (1.5-3)2) = ?


J(Θ1)
Optimization problem

Faster convergence

Θ1
Bowl shape function 1

Minimize J(Θ1) Objective function


Θ1
X -> dx

J(Θ0, Θ1)

Θ1
Θ0
Gradient descent

Minimize J

1. Start with some values of Θ0, Θ1 (Initialization of variable (Random))


2. Update Θ to minimize J (Convergence)

3. Θ0, Θ1 = 0 Y= mx + c

M = (y2 –y1)/(x2 – x1)


J= 0 or 1

Learning rate
• ML DL/RL
• IEEE, Sciencedirect (Elsevier), ACM, Springer
• Research paper
• 2016> State-of-art
• X -> Disadvantage
• X1
• XY X1 + Y1 = X1Y1
h(x) = Θ0 + Θ1*x

Polyno,
Quadr
h(x) = Θ0 + Θ1x1 +
Θ0 Θ1X1 + Θ02X3
Normalized Normalization Feature scaling
Raw data Feature extra

Feature Feature spac

Color Size Weight Class Redu and no


Orange 15-25 cm 100 gm Mango
Feature selec
Green 100 cm 1 Kg Watermelon (FR, GA, AN)
classifier

1 hr
1min
Mean SD Var MG3 MG4 RMS Class 60 sample
10.5 Bus 1 min = 100 va
100*60 = 6000
Bycle
100*6 = 600
Train
Walkin Frequency dom
g Time-Frequenc
Raw data Feature extraction Outp
classifier
Feature space

Redu and non-sensiti,

Feature selection /Transformation (PCA)


(FR, GA, AN) ML

Raw data Deep learning method Output

Capability to extract the features by their own


DL
Distinguishing feature
Min J(O) x2 Minmaxscalar

Min-max normalization

Max value (Column) = xi/max(x

0-1
0<x1<1
x1 0<x2<1
Iteration
x2 Image = 0-255

Ordinary least square

x1
h(x) = Θ0 + Θ1x1 + Θ2x2
h(x) = Θ0x0 + Θ1x1 + Θ2x2 ……….. + Θnxn

Θ = Θ0 X= 1
Θ1 x1
Θn xn

h(x) = ΘTX

X0 = 1
Logistic regression: Classification

2-class classification h(x) = ΘTX


Email-s/ns
D- C/NC h(x)
C- Y/N
W–l
Y=1
Y = 0 or 1

Threshold = 0. 5 0.5

h(x) > 0.5 Y=1


h(x) <= 0.5 Y=0 Y=0

Size
0<=h(x)<=1
0<=h(x)<=1 Linear regression: h(x) = ΘTX
Logistic regression = g(ΘTX)

Sigmoid function g(z)

Probability that x Y 0 or 1

x2
p(y=1 | X, Θ)
5 Y=1
p(y=0 | X, Θ) + p(y=1 | X, Θ) = 1
Y=1 x1+x2>=5
Y= 0 x1+x2<5

x1
5
Y=0
Cost(h(x), y) = -log(h(x) if Y = 1
-log(1-h(x)) if Y = 0

Cost = 0 if Y=1 h(x) = 1 but h(x) -> 0 cost->Infinity

Cost = 0 if Y = 0 h(x) = 0 but h(x) -> 1 cost->Infinity

Min J(Θ)

https://www.desmos.com/calculator
Solve
Y=1, Y=2, y=3 One vs All (OVA) NN(OVA) SVM(OVA)

DT
max
Linear
Logistic
Just right
Underfit
High bias

Overfit
High variance
Subset Selection
Reduce the number of features
1) Manually
2) Algorithm
Best subset selection
Forward and backward selection

Shrinkage method (Regularization)


Ridge (L2 Norm)
Lasso (L1 Norm) least absolute shrinkage and selection operator

P-norm
Regularization

Θ0 + Θ 1 x + Θ 2 x2

Θ0 + Θ 1 x + Θ 2 x2 + Θ 3 x3 + Θ 4 x4

+ 1000Θ3x3 + 1000Θ4x4

Θ3 and Θ4 = 0
Elastic net

L21
Regularization parameter
Neural Network and learning machines by Simon Haykin x2
Y=1
Y=1
Input and output behaviour

McCulloch-Pitts Model
Y=0
X1 or X2 Y=1
x0
-15 h(x) = g(-5+10x1+10x2)
x1

10
x1 y

10

x2 X1 X2 h(x) and h(x) or


0 0 0 g(-15) 0 g(-5)
0 1 0 g(-5) 1 g(5)
1 0 0 g(-5) 1 g(5)
1 1 1 g(5) 1 g(15)
NN as Directed Graph

w
x y = wx Synaptic link

linear X
g()
x y = g(wx) Activation link
Non-linear

Xi
yi
Y = yi + yj
x

yj xj
Computation Graph Y= x2 y
Y = 2x

1.001 x
1
X=1 y=1 X=1 y=2
X= 1.001 y=1.002 X= 1.001 y=2.002
Slope = 2
X=4 y= 8
X=4.001 y=8.0002
X=4 y= 16
X=4.001 y=16.008 Slope = height/width
Slope = 8 2
J(a,b,c) = 3(a + bc)

b v=a+u J = 3v
u = bc
dJ/dv = 3
c

J = 3v
dJ/du = dJ/dv * dv/du V = 11 11.001
dJ/da = dJ/dv * dv/da =3*1=3 J = 33 33.003
=3 * d (a+u)/da
3
=3*1=3

dJ/db = dJ/dv * dv/du * du/db


=3*c

dJ/dc = 3b
W b

a Cost/L
X sum Sig a = h(x)
oss g(ΘTx)

da/dz = a(1-a) L(y,a) = -ylog(a)-(1-y)log(1-a)

dL/da = -y/a + (1-y)/(1-a)


dL/dz = dL/da * da/dz
=a-y

dL/dw1 = dL/da * da/dz * dz/dw1


= (a-y)x1

dL/db= a-y g(z)(1-g(z))


x
z = w Tx + b a = g(z) L(a,y)
w
Sigmoid
b dz da Tanh
Threshold
Relu
Input layer Rbf
W1, b1 Leaky relu
x1
W2, b2 Output layer

x2 a

L(a2,y) a = max(0,z)
x3 Z2 and a2 a = max(0.001z, z)
W2 = 1,3
W1 = 3,3 B2 = 1,1
X=a0== 3,1 b1 = 3,1 1, 0,
Z2 = 1,1
Z1 = 3,1 0, 1,
A2 = 1,1
a1 = 3, 1 0, 0,
Z1 and a1 Hidden layer-1 0 0
Parameters: W1, b1, W2, b2 W1, b1
Cost function: J(W1, b1, W2, b2) = x1
1/m sum L(a2,y) W2, b2

x2 L(a2,y)

W=random(-1,1) * 0.0001
Z2
B =0
x3
Z1

Backward propagation
Forward propagation
Z1 = W1X+b1 dz2= dL/dz2 = dL/da2 * da2/dz2 = a2 - y
A1 = g(z1) dw2 = dz2 a1T
Z2 = W2A1 + b2 db2 = dz2
A2 = g(z2) (sigmoid) dz1 = dL/dz1 = W2Tdz2 * a1(1-a1)
Dw1 = dz1 XT
Db1 = dz1

W2 =W2 – αdw2
b2 =b2 – αdb2
W1 =W1 – αdw1
b1 =b1 – αdb1
GA:
Genetic Representation
GA: Initial Population: ELM1(L, w, b), ELM2(L,w,b), ELM3,
Population …..ELMn :20, 50, 100 multi-D
Crossover Fitness function: RMSE (Test): 0.3> ELM Accuracy,
Mutation F1-score
Fitness function Population: New based on Initial population
For I =0 to T T: Preset: 100, 50, 20 or Early stopping: 5
Selection function: Tournament, Roulette wheel selection
A: ELM1, B: ELM5
Consume Crossover operator:
Grid Search Node-based operator
Randomly Node <L A, B
Child 1: A+ b(B)
CHild2: B + b(A)
Link-based operator
Randomly choosing w from A and B
Arthime

Mutation operator:
Node-based operator
Delete node and add node
Link-based operator
Exchange link
Fitness function: Child RMSE
One hot encoding Softmax Function

Why?
Categorical data (B, C, T) 1, 2, 3
1.5

One hot encoding: Binary variable [1, 3, 2]


B: 1, 0, 0
C: 0, 1, 0 1/(1+3+2)
T: 0, 0, 1 3/
2/

1.0
Deep Learning
Why

DNN

Forward and back propagation Parameters


W, b, W, b

Hyperparameter
Learning rate
Epoch
L
2018/2019 Neuron
Broad learning Activation
Convolutional Neural Network (CNN)

Edge detection Filter, Kernel, Mask


Vertical
Horizontal

Padding
Stride= 1 2
Image = 2D

Max pooling
* +b1

(m-a+1) X (n-b+1), 2

X:mXnX3 aXbX3

relu(WTX + b)

+b2
Cost J = 1/M L(y’ , y)

Input Conv Max Conv Max Conv Max

FC FC FC output
Conv
and
Conv Max
3 x 3 x6

28 X 28 X 3 = 2352

24 X 24 X6= 1728

Parameter sharing 2352 X 1728 = 40 L


Sparsity of connections

Pre-trained
VGG-16, 19, ResNet,

Transfer Learning

5X5X6 150
Sequence models
NLP

X Y
Audio Person
Music Pattern, music
Sentimental analysis Hate speech analysis, Trend in
share….
Book Suicide
Syntax
Paragra Summary
Mangoes am in VJTI……. , are ….. .
X0 x1 x2 x3 Input = Output
RNN
Y0 y1 y2 y3 Many to Many
1 Many to one
One to Many
One to one

P(? | I)
Voc. X0 X1
1 0
I
0 1
A P(in | I, am)
m 0 0
In 0 0
0 0
vjt 0 0
i 0 0
0 0
t
Vanishing gradient problem
L(y’,y) = sum(t) L (y’, y) Exploding
NaN
L = -ylog(y) - (1-y)log(1-Y)

A A

A A
P(? | I)
LSTM/GRU
Error rate: y NotEQ Y’

F-Score = 2 * P *R /(P + R)
100 .0% 96.0% 4.0%
ROC analysis
Accuracy: 1 – Error rate Kappa constant
R2
Confusion matrix Sensitivity = Recall
Specificity= TN / (TN + FP)
Actual class Y=1 Y=0
Predicated Y’ = 1 True +ve (TP) False +ve (FP)
7 0
Y’ = 0 False –ve (FN) True –ve (TN)
1 10

Recall: actual class what fraction did we correctly classified = TP / (TP + FN): 0 to 1

Precision: Predicted value what fraction actually y = TP / (TP + FP): 0 to 1

You might also like