Classification (Discrete value output)
Regression (Predict real value output)
Clustering (Structure of the dataset)
D: 4
n: 10
Supervised Learning
Training dataset n*m
Unsupervised Learning
10 Gb
Learning algorithm 5 Gb
Hypothesis Y
X
Size (X) Thane Price (Y)
525 25 L
1000 40L
750
h1
h2
Y
price
X, Y
3X1+2X2 = Y
2
h3
X
Size
Θ0 = 2 0 2
Θ1 = 0 1 1
• Y= h(x) = Θ0 + Θ1*x
• h(x) = Θ0 + Θ1*x Linear regression eq. for
univariate Multivariate
• Θ = Parameters Choose
Minimize
Θ0 , Θ 1 J(Θ0, Θ1)
Cost function
Squared error function
2+(-2) + 4 + (-4) = 0
4+4+16+16 = 40
Minimize J(Θ1)
h(x) = Θ0 + Θ1*x h(x) = Θ1*x for Θ0 = 0 Θ1
X Y
1 1
2 2
3 3
Case - 1
Y Θ1 = 1
Case-1 J(Θ1) = 1/2M(02 + 02 + 02) = 0
3
2
Case-2 Case - 2
Θ1 = 0.5
1
J(Θ1) = 1/2M((0.5-1)2 + (1-2)2 + (1.5-3)2) = ?
1 2 3
X
Case-3 Case - 3
Θ1 = -0.5
J(Θ1) = 1/2M((0.5-1)2 + (1-2)2 + (1.5-3)2) = ?
J(Θ1)
Optimization problem
Faster convergence
Θ1
Bowl shape function 1
Minimize J(Θ1) Objective function
Θ1
X -> dx
J(Θ0, Θ1)
Θ1
Θ0
Gradient descent
Minimize J
1. Start with some values of Θ0, Θ1 (Initialization of variable (Random))
2. Update Θ to minimize J (Convergence)
3. Θ0, Θ1 = 0 Y= mx + c
M = (y2 –y1)/(x2 – x1)
J= 0 or 1
Learning rate
• ML DL/RL
• IEEE, Sciencedirect (Elsevier), ACM, Springer
• Research paper
• 2016> State-of-art
• X -> Disadvantage
• X1
• XY X1 + Y1 = X1Y1
h(x) = Θ0 + Θ1*x
Polyno,
Quadr
h(x) = Θ0 + Θ1x1 +
Θ0 Θ1X1 + Θ02X3
Normalized Normalization Feature scaling
Raw data Feature extra
Feature Feature spac
Color Size Weight Class Redu and no
Orange 15-25 cm 100 gm Mango
Feature selec
Green 100 cm 1 Kg Watermelon (FR, GA, AN)
classifier
1 hr
1min
Mean SD Var MG3 MG4 RMS Class 60 sample
10.5 Bus 1 min = 100 va
100*60 = 6000
Bycle
100*6 = 600
Train
Walkin Frequency dom
g Time-Frequenc
Raw data Feature extraction Outp
classifier
Feature space
Redu and non-sensiti,
Feature selection /Transformation (PCA)
(FR, GA, AN) ML
Raw data Deep learning method Output
Capability to extract the features by their own
DL
Distinguishing feature
Min J(O) x2 Minmaxscalar
Min-max normalization
Max value (Column) = xi/max(x
0-1
0<x1<1
x1 0<x2<1
Iteration
x2 Image = 0-255
Ordinary least square
x1
h(x) = Θ0 + Θ1x1 + Θ2x2
h(x) = Θ0x0 + Θ1x1 + Θ2x2 ……….. + Θnxn
Θ = Θ0 X= 1
Θ1 x1
Θn xn
h(x) = ΘTX
X0 = 1
Logistic regression: Classification
2-class classification h(x) = ΘTX
Email-s/ns
D- C/NC h(x)
C- Y/N
W–l
Y=1
Y = 0 or 1
Threshold = 0. 5 0.5
h(x) > 0.5 Y=1
h(x) <= 0.5 Y=0 Y=0
Size
0<=h(x)<=1
0<=h(x)<=1 Linear regression: h(x) = ΘTX
Logistic regression = g(ΘTX)
Sigmoid function g(z)
Probability that x Y 0 or 1
x2
p(y=1 | X, Θ)
5 Y=1
p(y=0 | X, Θ) + p(y=1 | X, Θ) = 1
Y=1 x1+x2>=5
Y= 0 x1+x2<5
x1
5
Y=0
Cost(h(x), y) = -log(h(x) if Y = 1
-log(1-h(x)) if Y = 0
Cost = 0 if Y=1 h(x) = 1 but h(x) -> 0 cost->Infinity
Cost = 0 if Y = 0 h(x) = 0 but h(x) -> 1 cost->Infinity
Min J(Θ)
https://www.desmos.com/calculator
Solve
Y=1, Y=2, y=3 One vs All (OVA) NN(OVA) SVM(OVA)
DT
max
Linear
Logistic
Just right
Underfit
High bias
Overfit
High variance
Subset Selection
Reduce the number of features
1) Manually
2) Algorithm
Best subset selection
Forward and backward selection
Shrinkage method (Regularization)
Ridge (L2 Norm)
Lasso (L1 Norm) least absolute shrinkage and selection operator
P-norm
Regularization
Θ0 + Θ 1 x + Θ 2 x2
Θ0 + Θ 1 x + Θ 2 x2 + Θ 3 x3 + Θ 4 x4
+ 1000Θ3x3 + 1000Θ4x4
Θ3 and Θ4 = 0
Elastic net
L21
Regularization parameter
Neural Network and learning machines by Simon Haykin x2
Y=1
Y=1
Input and output behaviour
McCulloch-Pitts Model
Y=0
X1 or X2 Y=1
x0
-15 h(x) = g(-5+10x1+10x2)
x1
10
x1 y
10
x2 X1 X2 h(x) and h(x) or
0 0 0 g(-15) 0 g(-5)
0 1 0 g(-5) 1 g(5)
1 0 0 g(-5) 1 g(5)
1 1 1 g(5) 1 g(15)
NN as Directed Graph
w
x y = wx Synaptic link
linear X
g()
x y = g(wx) Activation link
Non-linear
Xi
yi
Y = yi + yj
x
yj xj
Computation Graph Y= x2 y
Y = 2x
1.001 x
1
X=1 y=1 X=1 y=2
X= 1.001 y=1.002 X= 1.001 y=2.002
Slope = 2
X=4 y= 8
X=4.001 y=8.0002
X=4 y= 16
X=4.001 y=16.008 Slope = height/width
Slope = 8 2
J(a,b,c) = 3(a + bc)
b v=a+u J = 3v
u = bc
dJ/dv = 3
c
J = 3v
dJ/du = dJ/dv * dv/du V = 11 11.001
dJ/da = dJ/dv * dv/da =3*1=3 J = 33 33.003
=3 * d (a+u)/da
3
=3*1=3
dJ/db = dJ/dv * dv/du * du/db
=3*c
dJ/dc = 3b
W b
a Cost/L
X sum Sig a = h(x)
oss g(ΘTx)
da/dz = a(1-a) L(y,a) = -ylog(a)-(1-y)log(1-a)
dL/da = -y/a + (1-y)/(1-a)
dL/dz = dL/da * da/dz
=a-y
dL/dw1 = dL/da * da/dz * dz/dw1
= (a-y)x1
dL/db= a-y g(z)(1-g(z))
x
z = w Tx + b a = g(z) L(a,y)
w
Sigmoid
b dz da Tanh
Threshold
Relu
Input layer Rbf
W1, b1 Leaky relu
x1
W2, b2 Output layer
x2 a
L(a2,y) a = max(0,z)
x3 Z2 and a2 a = max(0.001z, z)
W2 = 1,3
W1 = 3,3 B2 = 1,1
X=a0== 3,1 b1 = 3,1 1, 0,
Z2 = 1,1
Z1 = 3,1 0, 1,
A2 = 1,1
a1 = 3, 1 0, 0,
Z1 and a1 Hidden layer-1 0 0
Parameters: W1, b1, W2, b2 W1, b1
Cost function: J(W1, b1, W2, b2) = x1
1/m sum L(a2,y) W2, b2
x2 L(a2,y)
W=random(-1,1) * 0.0001
Z2
B =0
x3
Z1
Backward propagation
Forward propagation
Z1 = W1X+b1 dz2= dL/dz2 = dL/da2 * da2/dz2 = a2 - y
A1 = g(z1) dw2 = dz2 a1T
Z2 = W2A1 + b2 db2 = dz2
A2 = g(z2) (sigmoid) dz1 = dL/dz1 = W2Tdz2 * a1(1-a1)
Dw1 = dz1 XT
Db1 = dz1
W2 =W2 – αdw2
b2 =b2 – αdb2
W1 =W1 – αdw1
b1 =b1 – αdb1
GA:
Genetic Representation
GA: Initial Population: ELM1(L, w, b), ELM2(L,w,b), ELM3,
Population …..ELMn :20, 50, 100 multi-D
Crossover Fitness function: RMSE (Test): 0.3> ELM Accuracy,
Mutation F1-score
Fitness function Population: New based on Initial population
For I =0 to T T: Preset: 100, 50, 20 or Early stopping: 5
Selection function: Tournament, Roulette wheel selection
A: ELM1, B: ELM5
Consume Crossover operator:
Grid Search Node-based operator
Randomly Node <L A, B
Child 1: A+ b(B)
CHild2: B + b(A)
Link-based operator
Randomly choosing w from A and B
Arthime
Mutation operator:
Node-based operator
Delete node and add node
Link-based operator
Exchange link
Fitness function: Child RMSE
One hot encoding Softmax Function
Why?
Categorical data (B, C, T) 1, 2, 3
1.5
One hot encoding: Binary variable [1, 3, 2]
B: 1, 0, 0
C: 0, 1, 0 1/(1+3+2)
T: 0, 0, 1 3/
2/
1.0
Deep Learning
Why
DNN
Forward and back propagation Parameters
W, b, W, b
Hyperparameter
Learning rate
Epoch
L
2018/2019 Neuron
Broad learning Activation
Convolutional Neural Network (CNN)
Edge detection Filter, Kernel, Mask
Vertical
Horizontal
Padding
Stride= 1 2
Image = 2D
Max pooling
* +b1
(m-a+1) X (n-b+1), 2
X:mXnX3 aXbX3
relu(WTX + b)
+b2
Cost J = 1/M L(y’ , y)
Input Conv Max Conv Max Conv Max
FC FC FC output
Conv
and
Conv Max
3 x 3 x6
28 X 28 X 3 = 2352
24 X 24 X6= 1728
Parameter sharing 2352 X 1728 = 40 L
Sparsity of connections
Pre-trained
VGG-16, 19, ResNet,
Transfer Learning
5X5X6 150
Sequence models
NLP
X Y
Audio Person
Music Pattern, music
Sentimental analysis Hate speech analysis, Trend in
share….
Book Suicide
Syntax
Paragra Summary
Mangoes am in VJTI……. , are ….. .
X0 x1 x2 x3 Input = Output
RNN
Y0 y1 y2 y3 Many to Many
1 Many to one
One to Many
One to one
P(? | I)
Voc. X0 X1
1 0
I
0 1
A P(in | I, am)
m 0 0
In 0 0
0 0
vjt 0 0
i 0 0
0 0
t
Vanishing gradient problem
L(y’,y) = sum(t) L (y’, y) Exploding
NaN
L = -ylog(y) - (1-y)log(1-Y)
A A
A A
P(? | I)
LSTM/GRU
Error rate: y NotEQ Y’
F-Score = 2 * P *R /(P + R)
100 .0% 96.0% 4.0%
ROC analysis
Accuracy: 1 – Error rate Kappa constant
R2
Confusion matrix Sensitivity = Recall
Specificity= TN / (TN + FP)
Actual class Y=1 Y=0
Predicated Y’ = 1 True +ve (TP) False +ve (FP)
7 0
Y’ = 0 False –ve (FN) True –ve (TN)
1 10
Recall: actual class what fraction did we correctly classified = TP / (TP + FN): 0 to 1
Precision: Predicted value what fraction actually y = TP / (TP + FP): 0 to 1