0% found this document useful (0 votes)
13 views131 pages

Unit 2

Uploaded by

Mayank Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views131 pages

Unit 2

Uploaded by

Mayank Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 131

UNIT II

Department of Computer Science & Engineering , Bharati Vidyapeeth’s College of Engineering,


New Delhi, Subject : Machine Learning
Table of content:
• Naive Bayes Classifiers
• Perceptron Neural Network
• Support Vector Machines
• Generative Learning
• Combing classifiers: Bagging
& boosting

Department of Computer Science & Engineering , Bharati Vidyapeeth’s College of Engineering,


New Delhi, Subject : Machine Learning
Overview

Sample data set with frequencies and probabilities


Classification based on Bayes rule
Maximum a posterior and maximum likelihood
Properties of Bayes classifiers
Naive Bayes classifiers

Parameter estimation, properties, example


Dealing with sparse data
Application: email classification

Department of Computer Science & Engineering , Bharati Vidyapeeth’s College of Engineering,


New Delhi, Subject : Machine Learning
A Sample Data Set

Fictional data set that describes the weather conditions for


playing some unspecified game.

outlook temp. humidity windy play outlook temp. humidity windy play
sunny hot high false no sunny mild high false no
sunny hot high true no sunny cool normal false yes
overcast hot high false yes rainy mild normal false yes
rainy mild high false yes sunny mild normal true yes
rainy cool normal false yes overcast mild high true yes
rainy cool normal true no overcast hot normal false yes
overcast cool normal true yes rainy mild high true no

Department of Computer Science & Engineering , Bharati Vidyapeeth’s College of Engineering,


New Delhi, Subject : Machine Learning
Frequencies and Probabilities

Frequencies and probabilities for the weather data:

outlook temperature humidity windy play


yes no yes no yes no yes no yes no

sunny 2 3 hot 2 2 high 3 4 false 6 2 9 5

overcast 4 0 mild 4 2 normal 6 1 true 3 3


rainy 3 2 cool 3 1

yes no yes no yes no yes no yes no


sunny 2/9 3/5 hot 2/9 2/5 high 3/9 4/5 false 6/9 2/5 9/14 5/14
overcast 4/9 0/5 mild 4/9 2/5 normal 6/9 1/5 true 3/9 3/5
rainy 3/9 2/5 cool 3/9 1/5
Department of Computer Science & Engineering , Bharati Vidyapeeth’s College of Engineering,
New Delhi, Subject : Machine Learning
Classifying an Unseen Example

Now assume that we have to classify the following new


instance:
outlook temp. humidity windy play
sunny cool high true ?

Key idea: compute a probability for each class based on the


probability distribution in the training data.
First take into account the the probability of each attribute. Treat
all attributes equally important, i.e., multiply the probabilities:
P yes 2 9 3 9 3 9 3 9 0 0082
P no 3 5 1 5 4 5 3 5 0 0577
Department of Computer Science & Engineering , Bharati Vidyapeeth’s College of Engineering,
New Delhi, Subject : Machine Learning
Classifying an Unseen Example

Now take into account the overall probability of a given class.


Multiply it with the probabilities of the attributes:
P yes 0 0082 9 14 0 0053
P no 0 0577 5 14 0 0206
Now choose the class so that it maximizes this probability. This
means that the new instance will be classified as no.

Department of Computer Science & Engineering , Bharati Vidyapeeth’s College of Engineering,


New Delhi, Subject : Machine Learning
Bayes Rule

This procedure is based on Bayes Rule, which says: if you have


a hypothesis h and data D which bears on the hypothesis, then:

P DhP h
(1) PhD
PD

P h : independent probability of h: prior probability


P D : independent probability of D
P D h : conditional probability of D given h: likelihood
P h D : cond. probability of h given D: posterior probability

Department of Computer Science & Engineering , Bharati Vidyapeeth’s College of Engineering,


New Delhi, Subject : Machine Learning
Maximum A Posteriori

Based on Bayes Rule, we can compute the maximum a


posteriori hypothesis for the data:
(2) hMAP argmax P h D
h H
PDhPh
argmax
h H PD
argmax P D h P h
h H

H : set of all hypotheses


Note that we can drop P D as the probability of the data is
constant (and independent of the hypothesis).
Department of Computer Science & Engineering , Bharati Vidyapeeth’s College of Engineering,
New Delhi, Subject : Machine Learning
Maximum Likelihood

Now assume that all hypotheses are equally probable a priori,


i.e, P h i P h j for all hi hj H .
This is called assuming a uniform prior. It simplifies computing
the posterior:

(3) hML argmax P D h


h H

This hypothesis is called the maximum likelihood hypothesis.

Department of Computer Science & Engineering , Bharati Vidyapeeth’s College of Engineering,


New Delhi, Subject : Machine Learning
Properties of Bayes Classifiers

Incrementality: with each training example, the prior and


the likelihood can be updated dynamically: flexible and
robust to errors.
Combines prior knowledge and observed data: prior
probability of a hypothesis multiplied with probability of the
hypothesis given the training data.
Probabilistic hypotheses: outputs not only a
classification, but a probability distribution over all classes.
Meta-classification: the outputs of several classifiers can
be combined, e.g., by multiplying the probabilities that all
classifiers predict for a given class.
Department of Computer Science & Engineering , Bharati Vidyapeeth’s College of Engineering,
New Delhi, Subject : Machine Learning
Naive Bayes Classifier
Assumption: training set consists of instances described as
conjunctions of attributes values, target classification based
on finite set of classes V .
The task of the learner is to predict the correct class for a new
instance a1 a2 an .
Key idea: assign most probable class vMAP using Bayes Rule.
(4) vMAP argmax P v j a 1 a 2 an
vj V
P a1 a2 an v j P v j
argmax
vj V P a 1 a2 an
argmax P a1 a2 an v j P v j
vj V
Department of Computer Science & Engineering , Bharati Vidyapeeth’s College of Engineering,
New Delhi, Subject : Machine Learning
Naive Bayes: Parameter Estimation
Estimating P v j is simple: compute the relative frequency of
each target class in the training set.
Estimating P a 1 a 2 a n v j is difficult: typically not enough
instances for each attribute combination in the training set:
sparse data problem.
Independence assumption: attribute values are conditionally
independent given the target value: naive Bayes.

(5) P a1 a2 an vj ’P ai vj
i
Hence we get the following classifier:

(6) vNB argmax P v j


vj V
’ P ai v j
i
Department of Computer Science & Engineering , Bharati Vidyapeeth’s College of Engineering,
New Delhi, Subject : Machine Learning
Naive Bayes: Properties

Estimating P a i v j instead of P a 1 a 2 a n v j greatly


reduces the number of parameters (and data sparseness).
The learning step in Naive Bayes consists of estimating
P a i v j and P v j based on the frequencies in the
training data.
There is no explicit search during training (as opposed to
decision trees).
An unseen instance is classified by computing the class
that maximizes the posterior.
When conditional independence is satisfied, Naive Bayes
corresponds to MAP classification.
Department of Computer Science & Engineering , Bharati Vidyapeeth’s College of Engineering,
New Delhi, Subject : Machine Learning
Naive Bayes: Example
Apply Naive Bayes to the weather training data. The hypothesis
space is V yes no . Classify the following new instance:
outlook temp. humidity windy play
sunny cool high true ?

vNB arg
vj
max
yes no
P vj ’ P ai v j
i
arg max P vj P outlook sunny v j P temp cool v j
vj yes no

P humidity high vj P windy true vj


Compute priors:
P play yes 9 14 P play no 5 14
Department of Computer Science & Engineering , Bharati Vidyapeeth’s College of Engineering,
New Delhi, Subject : Machine Learning
Naive Bayes: Example
Compute conditionals (examples):
P windy true play yes 3 9
P windy true play no 3 5
Then compute the best class:
P yes P sunny yes P cool yes P high yes P true yes
9 14 2 9 3 9 3 9 3 9 0 0053
P no P sunny no P cool no P high no P true no
5 14 3 5 1 5 4 5 3 5 0 0206
Now classify the unseen instance:
vNB arg max P v j P sunny v j P cool v j P high v j P true v j
no v j yes no
Department of Computer Science & Engineering , Bharati Vidyapeeth’s College of Engineering,
New Delhi, Subject : Machine Learning
Naive Bayes: Sparse Data

Conditional probabilities can be estimated directly as relative


frequencies:
nc
P ai v j
n
where n is the total number of training instances with class v j ,
and n c is the number of instances with attribute a i and class v i .
Problem: this provides a poor estimate if n c is very small.

Extreme case: if nc 0, then the whole posterior will be zero.

Department of Computer Science & Engineering , Bharati Vidyapeeth’s College of Engineering,


New Delhi, Subject : Machine Learning
Naive Bayes: Sparse Data

Solution: use the m-estimate of probabilities:


nc mp
P ai vj
n m
p: prior estimate of the probability
m: equivalent sample size (constant)
In the absence of other information, assume a uniform prior:
1
p k

where k is the number of values that the attribute a i can take.

Department of Computer Science & Engineering , Bharati Vidyapeeth’s College of Engineering,


New Delhi, Subject : Machine Learning
Application: Email Classification
Training data: a corpus of email messages, each message
annotated as spam or no spam.
Task: classify new email messages as spam/no spam.
To use a naive Bayes classifier for this task, we have to first find
an attribute representation of the data.
Treat each text position as an attribute, with as its value the
word at this position. Example: email starts: get rich.
The naive Bayes classifier is then:
max
vNB arg v j spam nospam P vj ’ P ai v j
i
arg max P vj P a1 get v j P a 2 rich v j
vj spam nospam
Department of Computer Science & Engineering , Bharati Vidyapeeth’s College of Engineering,
New Delhi, Subject : Machine Learning
Application: Email Classification

Using naive Bayes means we assume that words are


independent of each other. Clearly incorrect, but doesn’t hurt
a lot for our task.
The classifier uses P ai w k v j , i.e., the probability that the
i-th word in the email is the k-word in our vocabulary, given the
email has been classified as v j .
Simplify by assuming that position is irrelevant: estimate
P w k v j , i.e., the probability that word w k occurs in the email,
given class v j .
Create a vocabulary: make a list of all words in the training
corpus, discard words with very high or very low frequency.
Department of Computer Science & Engineering , Bharati Vidyapeeth’s College of Engineering,
New Delhi, Subject : Machine Learning
Application: Email Classification

Training: estimate priors:


n
P vj N
Estimate likelihoods using the m-estimate:
nk 1
P wk vj n Vocabulary

N : total number of words in all emails


n : number of words in emails with class v j
n k : number of times word w k occurs in emails with class vj
Vocabulary : size of the vocabulary
Testing: to classify a new email, assign it the class with the
highest posterior probability. Ignore unknown words.
Department of Computer Science & Engineering , Bharati Vidyapeeth’s College of Engineering,
New Delhi, Subject : Machine Learning
Summary

Bayes classifier combines prior knowledge with observed


data: assigns a posterior probability to a class based on its
prior probability and its likelihood given the training data.
Computes the maximum a posterior (MAP) hypothesis or
the maximum likelihood (ML) hypothesis.
Naive Bayes classifier assumes conditional independence
between attributes and assigns the MAP class to new
instances.
Likelihoods can be estimated based on frequencies.
Problem: sparse data. Solution: using the m-estimate
(adding a constant).
Department of Computer Science & Engineering , Bharati Vidyapeeth’s College of Engineering,
New Delhi, Subject : Machine Learning
Introduction to Perceptron
The perceptron was first proposed by Rosenblatt (1958) is a
simple
neuron that is used to classify its input into one of two
categories.
A perceptron is a single processing unit of a neural network.of its
A

b (bias
x1 )
w1
x2 v y
w2
(v
wn )
xn
Perception in term of
biological neuron

Department of Computer Science & Engineering , Bharati Vidyapeeth’s College of Engineering,


New Delhi, Subject : Machine Learning
While in actual neurons the dendrite receives electrical signals
from the
axons of other neurons, in the perceptron these electrical signals
are represented as numerical values. At the synapses between
the dendrite and axons, electrical signals are modulated in
various amounts. This is also modeled in the perceptron by
multiplying each input value by a value called the weight.
An actual neuron fires an output signal only when the total
strength of the input signals exceed a certain threshold. We
model this phenomenon in a perceptron by calculating the
weighted sum of the inputs to represent the total strength of the
input signals, and applying a step function on the sum to
determine its output. As in biological neural networks, this output
is fed to other perceptrons.
Perceptroncan be defined as a singl artificia neuro that
computes its weighted input with thee helpl of the n activatio
threshold n
function or step function.
It is also called as a (Threshol Logica Unit)
TLU d l .
x1 w1 w0
w2
x2 o
n
. w x
. wn i=0 i i
.xn 1 if
n
wi x i
f(xi) { >0 i=
0
= -1
Definition of supervised learning network
Supervised learning is used when we have a set of training data.This
training data consists of some input data that is connected with some correct
output values. The output values are often referred to as target values. This training
data is used by learning algorithms like back propagation or genetic algorithms

Department of Computer Science & Engineering , Bharati Vidyapeeth’s College of Engineering,


New Delhi, Subject : Machine Learning
What can a perceptron do?
In machine learning, the perceptron is an algorithm
for supervised classification of an input into one of several possible non-binary
outputs.
Perceptron can be defined as a single artificial neuron that computes its weighted input with
the help of the threshold activation function or step function.
The Perceptron is used for binary Classification.
The Perceptron can only model linearly separable classes.
First train a perceptron for a classification task

Find suitable weights in such a way that the training examples are
correctly classified.
Geometrically try to find a hyper-plane that separates the examples of the two classes

Department of Computer Science & Engineering , Bharati Vidyapeeth’s College of Engineering,


New Delhi, Subject : Machine Learning
Linear separability
Linear separability is the concept wherein the separation of the input
space into regions is based on whether the network response is positive or negative.
When the two classes are not linearly separable, it may be desirable to obtain a linear
separator that minimizes the mean squared error.
Definition : Sets of points in 2-D space are linearly separable if the sets
can be separated by a straight line.
Generalizing, a set of points in n-dimensional space are linearly separable if there is
a hyper plane of (n-1) dimensions separates the sets.

Department of Computer Science & Engineering , Bharati Vidyapeeth’s College of Engineering,


New Delhi, Subject : Machine Learning
Department of Computer Science & Engineering , Bharati Vidyapeeth’s College of Engineering,
New Delhi, Subject : Machine Learning
Consider a network having positive response in the first quadrant
and
negative response in all other quadrants (AND function) with
either
binary or bipolar data, then the decision line is drawn
separating the positive response region from the negative
response region.
And gate is linearly separable:

Department of Computer Science & Engineering , Bharati Vidyapeeth’s College of Engineering,


New Delhi, Subject : Machine Learning
The net input to the output
Neuron is: Yin = w0 + Ʃi xi
wi
Where = The net inputs to ouput
Yin i =the
any integer neurons.
w0 = initial
weight
The following gives the boundary of
relation region net
input. b + Ʃi xi wi =
0

Department of Computer Science & Engineering , Bharati Vidyapeeth’s College of Engineering,


New Delhi, Subject : Machine Learning
The equation can be used to determine the
decision
boundary between the region where Yin> 0 Yin <
and 0.
Depending on the number of input neurons in the
network. this equation represents a line, a plane or
a hyper-plane.
If it is possible to find the weights so that all of the
training input vectors for which the correct response
is 1. lie on the either side of the boundary, then the
problem is called linearly separable.

Otherwise. If the above criteria is not met, the


problem is
called linearly non-separable.
Department of Computer Science & Engineering , Bharati Vidyapeeth’s College of Engineering,
New Delhi, Subject : Machine Learning
Ex

Exclusive or(xor) problem:

Even parity means even number of 1 bits in the


input
Odd parity means odd number of 1 bits in the
input
Department of Computer Science & Engineering , Bharati Vidyapeeth’s College of Engineering,
New Delhi, Subject : Machine Learning
There is no way to draw a single straight line so that the circles
are on
one side of the line and the dots on the other side.
Perceptron is unable to find a line separating even parity input
patterns from odd parity input patterns.

Department of Computer Science & Engineering , Bharati Vidyapeeth’s College of Engineering,


New Delhi, Subject : Machine Learning
Single layer
Single perceptron:
layer perceptron

A Single Layer Perceptron consists of an input and an output


layer. The
activation function employed is a hard limiting function.
Definition : An arrangement of one input layer of neurons feed
forward
to one output layer of neurons is known as Single Layer
Perceptron.

Department of Computer Science & Engineering , Bharati Vidyapeeth’s College of Engineering,


New Delhi, Subject : Machine Learning
Department of Computer Science & Engineering , Bharati Vidyapeeth’s College of Engineering,
New Delhi, Subject : Machine Learning
Single layer perceptron learning algo:
Step 1 : Create a perceptron with (n+1) input neurons x0 , x1 , . . . . . , . ,
xn where x0 = 1 is the bias input. Let O be the output neuron.
Step 2 : Initialize weight W = (w0, w1, . . . . . , . wn ) to random
weights.
Step 3 :Iterate through the input patterns xj of the training set using
the
net j the
weight set; i.e compute = Ʃ weighted
Xi Forsumi=1 toof inputs
wi n
for 4each
Step input pattern
: Compute j . Yj using the step
the output
function
Step 5 :Compare the computed output yj with the target
output yj
for each input pattern j .
If all the input patterns have been classified correctly, then
output
(read) the weights and exit.
computed outputs yj update
Step 6 : Otherwise, hav bee
is 1 but the weights as 0,
given below : If
should
the e n
Then wi = wi - α xi , i= 0, 1, 2, . . n
If , computed outputs yj is 0 shoul hav bee 1,Then =
. .the
but d e n wi
wi + α xi , i= 0, 1, 2, . . . . , n
and is
where α is the learning constant.
parameter
Step 7 : goto step 3
END
Multi layer perceptron

Multilayer perceptrons (MLP) are the most popular type of neural


networks in use today. They belong to a general class of structures called
feedforward neural networks, a basic type of neural network capable of
approximating generic classes of functions, including continuous and
integrable functions.
A multilayer perceptron:
has one or more hidden layers with any number of units.
uses linear combination functions in the input layers.
uses generally sigmoid activation functions in the hidden layers.
has any number of outputs with any activation function.
has connections between the input layer and the first hidden layer, between the
hidden layers, and between the last hidden layer and the output layer.

Department of Computer Science & Engineering , Bharati Vidyapeeth’s College of Engineering,


New Delhi, Subject : Machine Learning
Three layer network
x1

x2

Input
Output

xn

Department of Computer Science & Engineering , Bharati Vidyapeeth’s College of Engineering,


New Delhi, Subject : Machine Learning
Layer in neural network:
The input layer:
Introduces input values into the network.
No activation function or other processing.

The hidden layer(s):


Performs classification of features.
Two hidden layers are sufficient to solve any
Features imply more layers may be better.
The output layer:

Functionally is just like the hidden layers.


Outputs are passed on to the world outside the neural

Department of Computer Science & Engineering , Bharati Vidyapeeth’s College of Engineering,


New Delhi, Subject : Machine Learning
Adaptive linear neuron (Adaline):
In 1959, Bernard Widrow and Marcian Hoff of
Stanford
developed models they called ADALINE (Adaptive
Linear
Neuron) and MADALINE (Multilayer ADALINE).
These models were named for their use of Multiple
ADAptive LINear Elements. MADALINE was the
first neural network be applied to a real world
problem. It is an adaptive filter which eliminates
echoes on phone lines.
Use Adaline network:

Initialize Initialize
• Assign random weights to all links
Training
• Feed-in known inputs in random
sequence
• Simulate the network
• Compute error between the input the Training
and
output (Error
Function)
• Adjust weights (Learning
Function)
• Repeat until total error < ε Thinking
Thinking
• Simulate the network eve for
• Network will respond to any input n trained
Perceptron learning algo:
Training patterns are presented to the network's
inputs;
outputtheis Then the connection weights wj
computed.
modified by an amountare that is proportional to the the
product
difference of between the actual and the
output, y, desired
output, d, and the input pattern, x.
The algorithm is as follows:
Initialize the weights and threshold to small random
numbers.
Present a vector x to the neuron inputs and calculate the
output.
Update the weights according to:
where
d is the desired output,
t is the iteration number, and
eta is the gain or step size, where 0.0 < n < 1.0
Repeat steps 2 and 3 until:
the iteration error is less than a user-specified error
threshold
or
a predetermined number of iterations have been
completed.
Perceptron Training a

Perceptron Training algo:


Classification of patterns:
Training of Network : Given a set of inputs ‘x’, and
output/target
values ‘y’, the network finds the best linear from x to
mapping y.
Given an unpredicted ‘x’ value, we train our to
network predict
what the most likely ‘y’ value will be.
Classification of pattern is also a technique of training
the
network, in which we assign a physical object, event or
phenomenon to one set of pre-specified classes (or
categories).
Let us consider an example to illustrate the concept,
with 2
inputs (x1 and x2) and 1 output node, classifying
input into 2
Classes (class 0 and class 1).

Department of Computer Science & Engineering , Bharati Vidyapeeth’s College of Engineering, New
Delhi, Subject : Machine Learning
Support Vector Machines
Unit 2
Eighth Semester
Subject codeETCS102
Department of Computer Science & Engineering,
Bharati Vidyapeeth’s College of Engineering,
New Delhi
Subject: Machine learning

Department of Computer Science & Engineering , Bharati Vidyapeeth’s College of Engineering, New
Delhi, Subject : Machine Learning
Outline
Introduction
Towards SVM
Basic Concept
Implementations
Issues
Conclusion & References

Department of Computer Science & Engineering , Bharati Vidyapeeth’s College of Engineering,


New Delhi, Subject : Machine Learning
Introduction:
o SVMs – a supervised learning methods for
classification and Regression
o Base: Vapnik-Chervonenkis theory
o First practical implementation: Early nineties
o Satisfying from theoretical point of view
o Can lead to high performance in practical
applications
o Currently considered one of the most efficient
family of algorithms in Machine Learning

Department of Computer Science & Engineering , Bharati Vidyapeeth’s College of


Engineering, New Delhi, Subject : Machine Learning
Towards SVM
A:I found really good function describing the training
examples using ANN but couldn’t classify test example that
efficiently, what could be the problem?
B: It didn’t generalize well! A:
What should I do now?
B: Try SVM!
A: why?
B: SVM
1)Generalises well
And what's more….
2)Computationally efficient (just a convex optimization
problem)
3)Robust in high dimensions also (no overfitting)
Department of Computer Science & Engineering , Bharati Vidyapeeth’s College of Engineering,
New Delhi, Subject : Machine Learning
A: Why is it so?
B: So many questions…?? L
o Vapnik & Chervonenkis Statistical Learning Theory Result:
Relates ability to learn a rule for classifying training data to
ability of resulting rule to classify unseen examples
(Generalization)
o Let a rule f , f ∈ F
o Empirical Risk of f : Measure of quality of classification on
training data
R emp ( f ) = 0 Best performance

R emp ( f ) = 1 Worst performance


i

Department of Computer Science & Engineering , Bharati Vidyapeeth’s College of


Engineering, New Delhi, Subject : Machine Learning
What about the Generalization?
o Risk of classifier: Probability that rule ƒ makes a
mistake on a new sample randomly generated by
random machine
R( f ) = P( f(x) ≠ y)
R( f ) = 0 Best Generalization

R(f ) = 1 Worst Generalization

o Many times small Empirical Risk implies Small Risk


i

Department of Computer Science & Engineering , Bharati Vidyapeeth’s College of Engineering, New
Delhi, Subject : Machine Learning
Is the problem solved? …….. NO!
o Is Risk of ft selected by Empirical Risk Minimization
(ERM) near to that of ideal fi ?
o No, not in case of overfitting
o Important Result of Statistical Learning Theory

V(F)
ER ( ft ) ≤ R ( fi ) + C
N

Where, V(F)- VC dimension of class F


N- number of observations for training
C- Universal
Department of Computer Science & Engineering , Bharati Vidyapeeth’s College of Engineering, New
Delhi, Subject : Machine Learning
What it says:

o Risk of rule selected by ERM is not far from Risk of


the ideal rule if-
1) N is large enough
2)VC dimension of F should be small enough
[VC dimension? In short larger a class F, the larger its VC dimension

Department of Computer Science & Engineering , Bharati Vidyapeeth’s College of Engineering, New
Delhi, Subject : Machine Learning
Structural Risk Minimization (SRM)
o Consider family of F =>
F0 ⊂ F1 ⊂ ..........Fn ⊂ ...... ⊂ F
s.t.
V (F0 ) ≤ V (F1 ) ≤ .......... ≤ V (Fn ) ≤ ...... ≤ V (F )

o Find the minimum Empirical Risk for each subclass


and its VC dimension
o Select a subclass with minimum bound on the Risk
(i.e. sum of the VC dimension and empirical risk)

Department of Computer Science & Engineering , Bharati Vidyapeeth’s College of Engineering, New
Delhi, Subject : Machine Learning
V (F )
SRM Graphically: ER( ft) ≤ R( fi) + C
N

Department of Computer Science & Engineering , Bharati Vidyapeeth’s College of Engineering, New
Delhi, Subject : Machine Learning
A: What it has to do with SVM….?
B:SVM is an approximate implementation of SRM!
A: How?
B: Just in simple way for now:
Just import a result:
Maximizing distance of the decision boundary from
training points minimizes the VC dimension
resulting into the Good generalization!

Department of Computer Science & Engineering , Bharati Vidyapeeth’s College of Engineering, New
Delhi, Subject : Machine Learning
A: Means Now onwards our target is Maximizing Distance
between decision boundary and the Training points!

B: Yeah, Right!
A: Ok, I am convinced that SVM will generalize well,
but can you please explain what is the concept of
SVM and how to implement it, are there any
packages available?
B: Yeah, don’t worry, there are many implementations
available, just use them for your application, now the
next part of the presentation will give a basic idea
about the SVM, so be with me!
Department of Computer Science & Engineering , Bharati Vidyapeeth’s College of Engineering, New
Delhi, Subject : Machine Learning
Basic Concept of SVM:
o Which line
will classify
the unseen
data well?
o The dotted
line! Its line
with
Maximum
Margin!
Department of Computer Science & Engineering , Bharati Vidyapeeth’s College of Engineering, New
Delhi, Subject : Machine Learning
Cont…

Support Vectors Support Vectors

−1
WT X +b =
0+
1

Department of Computer Science & Engineering , Bharati Vidyapeeth’s College of Engineering, New
Delhi, Subject : Machine Learning
Some definitions:
o Functional Margin:
w.r.t.
1) individual examples : γˆ = y ( i ) (W T x ( i ) + b )
(i)

2)example set S = {( x ( i ) , y ( i ) ); i = 1,....., m }


ˆ = min ˆ ( i )
\ i = 1 ,..., m

o Geometric Margin:
w.r.t T
W b
1)Individual examples: γ
(i)
= y (i) || W || x (i) +
|| W ||
2) example set S,
γ = min γ ( i )
i = 1 ,..., m

Department of Computer Science & Engineering , Bharati Vidyapeeth’s College of Engineering, New
Delhi, Subject : Machine Learning
Problem Formulation:

−1
W T
X +b = 0
1
+

Department of Computer Science & Engineering , Bharati Vidyapeeth’s College of Engineering, New
Delhi, Subject : Machine Learning
Cont..
o Distance of a point (u, v) from Ax+By+C=0, is given by
|Ax+By+C|/||n||
Where ||n|| is norm of vector n(A,B)
b
o Distance of hyperpalne from origin = || W ||
b + 1
o Distance of point A from origin = || W ||

b − 1
o Distance of point B from Origin = || W ||
2
o Distance between points A and B (Margin) = || W ||

Department of Computer Science & Engineering , Bharati Vidyapeeth’s College of Engineering, New
Delhi, Subject : Machine Learning
Cont…
We have data set {X(i),Y(i)},i =1,....,m
X ∈Rd and Y ∈R1
separating hyperplane
W T X + b = 0
s .t .
W T
X (i)
+ b > 0 if Y (i)
= +1
W T
X (i)
+ b < 0 if Y (i)
= −1

Department of Computer Science & Engineering , Bharati Vidyapeeth’s College of Engineering, New
Delhi, Subject : Machine Learning
Cont…
o Suppose training data satisfy following constrains also,
W T X (i) + b ≥ +1 for Y (i) = + 1
W T X (i) + b ≤ −1 for Y (i)
=−1
Combining these to the one,
Y (i)
(W T X (i)
+ b) ≥ 1 for ∀i

o Our objective is to find Hyperplane(W,b) with maximal


separation between it and closest data points while satisfying the
above constrains

Department of Computer Science & Engineering , Bharati Vidyapeeth’s College of Engineering, New
Delhi, Subject : Machine Learning
THE PROBLEM:
2
max
W,b ||W||
such that
(i)
Y (W X T (i)
+b) ≥1 for ∀i
Also we know
|| W || = W T
W

Department of Computer Science & Engineering , Bharati Vidyapeeth’s College of Engineering, New
Delhi, Subject : Machine Learning
Cont..
So the Problem can be written as:
1
W TW
min
W ,b 2
Such that
Y(i)(WT X(i) +b)≥1 for ∀i

Notice:W T W =|| W ||2

It is just a convex quadratic optimization problem !


Department of Computer Science & Engineering , Bharati Vidyapeeth’s College of Engineering, New
Delhi, Subject : Machine Learning
DUAL
o Solving dual for our problem will lead us to apply SVM for
nonlinearly separable data, efficiently
o It can be shown that
min primal = max(minL(W,b,α ))
α ≥0 W ,b
o Primal problem:
1 T
min W W
W ,b 2

Such that

Y(i) (WT X (i) +b) ≥1 for ∀i

Department of Computer Science & Engineering , Bharati Vidyapeeth’s College of Engineering, New
Delhi, Subject : Machine Learning
Constructing Lagrangian
o Lagrangian for our problem:

[ ]
m
1
L(W , b,α ) = || W ||2 −∑ α i Y (i ) (W T X (i ) + b) −1
2 i =1

Where α a Lagrange multiplier and α i ≥ 0

o Now minimizing it w.r.t. W and b:


We set derivatives of Lagrangian w.r.t. W and b to zero

Department of Computer Science & Engineering , Bharati Vidyapeeth’s College of Engineering, New
Delhi, Subject : Machine Learning
Cont…
o Setting derivative w.r.t. W to zero, it gives:
m
W − ∑α i Y (i) X (i) = 0
i=1

i.e.
m
W = ∑ α iY (i) X (i)
i=1

o Setting derivative w.r.t. b to zero, it gives:


m

∑ α iY
(i)
= 0
i =1

Department of Computer Science & Engineering , Bharati Vidyapeeth’s College of Engineering, New
Delhi, Subject : Machine Learning
Cont…
o Plugging these results into Lagrangian gives
m
1 m (i) ( j)
L(W,b,α ) = ∑ αi − ∑ Y Y α iα j (X ) (X
(i) T ( j)
)
i=1 2 i, j=1
o Say it
m
1 m (i) ( j)
D(α ) = ∑ αi − ∑ Y Y α iα j (X (i)) T(X ( j)
)
i=1 2 i, j=1
o This is result of our minimization w.r.t W and b,

Department of Computer Science & Engineering , Bharati Vidyapeeth’s College of Engineering, New
Delhi, Subject : Machine Learning
So The DUAL:
o Now Dual becomes::
m m
1
max
α
D(α ) = ∑ α i −
2
∑ Y (i)
Y ( j)
α i α j X
(i)
,X
( j)

i=1 i, j=1

s .t .
α i ≥ 0, i = 1 ,..., m
m

∑i=1
α iY (i)
= 0

o Solving this optimization problem gives us α i


o Also Karush-Kuhn-Tucker (KKT) condition is
satisfied at this solution i.e.
αi [Y(i) (WT X(i) +b)−1]=0, for i =1,...,m
Department of Computer Science & Engineering , Bharati Vidyapeeth’s College of Engineering, New
Delhi, Subject : Machine Learning
Values of W and b:
o W can be found using
m
W = ∑ α iY (i)
X (i)

i =1

o b can be found using:


max i:Y ( i )=−1 W * X T (i)
+ mini:Y =1
T
W* X (i)
b* = −
(i)

Department of Computer Science & Engineering , Bharati Vidyapeeth’s College of Engineering, New
Delhi, Subject : Machine Learning
What if data is nonlinearly separable?
o The maximal margin
hyperplane can classify
only linearly separable
data
o What if the data is linearly
non-separable?
o Take your data to linearly
separable ( higher
dimensional space) and use
maximal margin
hyperplane there!

Department of Computer Science & Engineering , Bharati Vidyapeeth’s College of Engineering, New
Delhi, Subject : Machine Learning
Taking it to higher dimension works!
Ex. XOR

Department of Computer Science & Engineering , Bharati Vidyapeeth’s College of Engineering, New
Delhi, Subject : Machine Learning
Doing it in higher dimensional space
o Let Φ: X →F be non linear mapping from input
space X (original space) to feature space (higher
dimensional) F
(i) ( j)
o Then our inner (dot) product X , X in higher
dimensional space is φ( X ), φ( X ) (i ) (j)

o Now, the problem becomes:


m m
1
max D (α ) =
α
∑ αi−
2
∑ Y (i)
Y ( j )α i α j φ( X (i)
), φ ( X (j)
)
i =1 i , j =1

s.t.
α i ≥ 0, i = 1,..., m
m

∑αY i
(i)
= 0
i =1

Department of Computer Science & Engineering , Bharati Vidyapeeth’s College of Engineering, New
Delhi, Subject : Machine Learning
Kernel function:
o There exist a way to compute inner product in feature
space as function of original input points – Its kernel
function!
o Kernel function:

K(x, z) = φ(x),φ(z)
o We need not know φ to compute K ( x, z)

Department of Computer Science & Engineering , Bharati Vidyapeeth’s College of Engineering, New
Delhi, Subject : Machine Learning
An example:
let x, z ∈ R n For n=3, feature mapping φ
K(x, z) = (xT z)2 is given as : x x 1 1
x1 x 2
n n
i.e. K(x, z) = (∑ xi zi ) (∑ xj zj )
x 1 x3
x 2 x1
i=1 j=1
φ( x ) = x2 x 2
n n
=∑ ∑ xxzz x 2 x3
i j i j x3 x1
i=1 j=1
x3 x 2
n x3 x3
= ∑ (xi xj )(zi z j )
i, j=1
K (x, z) = φ(x),φ(z)

Department of Computer Science & Engineering , Bharati Vidyapeeth’s College of Engineering, New
example cont…
o Here,
x 1x 1 1
for
x 1x 2 2
φ(x) = =
K(x, z) = (xTz)2 x 2 x1 2
x2x2 4
3
x = 1 z = 9
2 4 12
φ(z) =
12
3
x z =
T
[1 2] 16
9
4
= 11 φ( x )T φ( z ) = [1 2 2 4 ] 12
12
K ( x , z ) = ( x T z ) 2 = 121 16
= 121

Department of Computer Science & Engineering , Bharati Vidyapeeth’s College of Engineering, New
Delhi, Subject : Machine Learning
So our SVM for the non-linearly
separable data:
o Optimization problem:
m m
1
max D (α ) =
α
∑ αi −
2
∑ Y (i)
Y ( j )α i α j K X (i)
, X (j)

i =1 i , j =1

s.t.
α i ≥ 0, i = 1,..., m
m
∑α
i =1
i Y
(i)
= 0

o Decision function
m
F ( X ) = Sign(∑ α iY (i) K ( X (i) , X ) + b)
i=1

Department of Computer Science & Engineering , Bharati Vidyapeeth’s College of Engineering, New
Delhi, Subject : Machine Learning
Some commonly used Kernel functions:

o Linear: K ( X ,Y ) = X T Y
o Polynomial of degree d: K ( X , Y ) = ( X T Y + 1) d
|| X −Y ||2

K ( X ,Y ) = e 2σ 2
o Gaussian Radial Basis Function (RBF):

o Tanh kernel: K ( X ,Y ) = tanh( ρ ( X T


Y ) −δ)

Department of Computer Science & Engineering , Bharati Vidyapeeth’s College of Engineering, New
Delhi, Subject : Machine Learning
Implementations:
Some Ready to use available SVM implementations:
1)LIBSVM:A library for SVM by Chih-Chung Chang and
chih-Jen Lin
(at: http://www.csie.ntu.edu.tw/~cjlin/libsvm/)
2)SVM light : An implementation in C by Thorsten
Joachims
(at: http://svmlight.joachims.org/ )
3)Weka: A Data Mining Software in Java by University
of Waikato
(at: http://www.cs.waikato.ac.nz/ml/weka/ )
Department of Computer Science & Engineering , Bharati Vidyapeeth’s College of Engineering, New
Delhi, Subject : Machine Learning
Issues:
o Selecting suitable kernel: Its most of the time trial
and error
o Multiclass classification: One decision function for
each class( l1 vs l-1 ) and then finding one with max
value i.e. if X belongs to class 1, then for this and
other (l-1) classes vales of decision functions:
F 1 ( X ) ≥ + 1
F 2 ( X ) ≤ − 1
.
.
F l ( X ) ≤ − 1

Department of Computer Science & Engineering , Bharati Vidyapeeth’s College of Engineering, New
Delhi, Subject : Machine Learning
Cont….
o Sensitive to noise: Mislabeled data can badly affect
the performance
o Good performance for the applications like-
1)computational biology and medical applications
(protein, cancer classification problems)
2)Image classification
3)hand-written character recognition
And many others…..
o Use SVM :High dimensional, linearly separable
data (strength), for nonlinearly depends on choice of
kernel
Department of Computer Science & Engineering , Bharati Vidyapeeth’s College of Engineering,
New Delhi, Subject : Machine Learning
Conclusion:
Support Vector Machines provides very
simple method for linear classification. But
performance, in case of nonlinearly separable
data, largely depends on the choice of kernel!

Department of Computer Science & Engineering , Bharati Vidyapeeth’s College of Engineering, New
Delhi, Subject : Machine Learning
Generative Learning algorithms
So far, we’ve mainly been talking about learning algorithms that model p(y|x; θ),
the conditional distribution of y given x. For instance, logistic regression
modeled p(y|x; θ) as hθ (x) = g(θT x) where g is the sigmoid func- tion. In these
notes, we’ll talk about a different type of learning algorithm.
Consider a classification problem in which we want to learn to distinguish
between elephants (y = 1) and dogs (y = 0), based on some features of an
animal. Given a training set, an algorithm like logistic regression or the
perceptron algorithm (basically) tries to find a straight line—that is, a decision
boundary—that separates the elephants and dogs. Then, to classify a new animal
as either an elephant or a dog, it checks on which side of the decision boundary it
falls, and makes its prediction accordingly.
Here’s a different approach. First, looking at elephants, we can build a model
of what elephants look like. Then, looking at dogs, we can build a separate model
of what dogs look like. Finally, to classify a new animal, we can match the new
animal against the elephant model, and match it against the dog model, to see
whether the new animal looks more like the elephants or more like the dogs we
Department
had seenofinComputer Science
the training & Engineering , Bharati Vidyapeeth’s College of Engineering, New
set.
Algorithms that try to learn p(y|x) directly (such as logistic regression), or algorithms that try
to learn mappings directly from the space of inputs X to the labels {0, 1}, (such as the perceptron
algorithm) are called discrim- inative learning algorithms. Here, we’ll talk about algorithms that
instead try to model p(x|y) (and p(y)). These algorithms are called generative learning
algorithms. For instance, if y indicates whether an example is a dog (0) or an elephant (1), then
p(x|y = 0) models the distribution of dogs’ features, and p(x|y = 1) models the distribution of
elephants’ features.
After modeling p(y) (called the class priors) and p(x|y), our algorithm
can then use Bayes rule to derive the posterior distribution on y given x:
p(y|x) =p(x|y)p(y)/ p(x
Here, the denominator is given by p(x) = p(x|y = 1)p(y = 1) + p(x|y = 0)p(y = 0) (you should
be able to verify that this is true from the standard properties of probabilities), and thus can also
be expressed in terms of the quantities p(x|y) and p(y) that we’ve learned. Actually, if were
calculating p(y|x) in order to make a prediction, then we don’t actually need to calculate the
denominator, since

arg max p(y|x) = arg maxp(x|y)p(y)/p(x) = arg max p(x|y)p(y).

Department of Computer Science & Engineering , Bharati Vidyapeeth’s College of Engineering, New
Delhi, Subject : Machine Learning
1 Gaussian discriminant analysis
The first generative learning algorithm that we’ll look at is Gaussian discrim-
inant analysis (GDA). In this model, we’ll assume that p(x|y) is distributed
according to a multivariate normal distribution. Let’s talk briefly about the
properties of multivariate normal distributions before moving on to the GDA
model itself.

1. The multivariate normal distribution


The multivariate normal distribution in n-dimensions, also called the multi-
variate Gaussian distribution, is parameterized by a mean vector µ ∈ Rn and a
covariance matrix Σ ∈ Rn×n, where Σ ≥ 0 is symmetric and positive semi-
definite. Also written “N (µ, Σ)”, its density is given by:
1
p(x; µ, Σ) =1/ (2π)n/2|Σ|1/2exp − (x − µ)T Σ−1(x − µ)
In the equation above, “|Σ|” denotes the determinant of the matrix Σ.
For a random variable X distributed N (µ, Σ), the mean is (unsurprisingly) given
by µ:E[X] =x p(x; µ, Σ)dx = µ
The covariance of a vector-valued random variable Z is defined as Cov(Z) =
E[(Z − E[Z])(Z − E[Z])T ]. This generalizes the notion of the variance of a

x
real-valued random variable. The covariance can also be defined as Cov(Z) = E[ZZT ] −
(E[Z])(E[Z])T . (You should be able to prove to yourself that these two definitions are
equivalent.) If X ∼ N (µ, Σ), then
Cov(X) = Σ.
Here are some examples of what the density of a Gaussian distribution looks like:

The left-most figure shows a Gaussian with mean zero (that is, the 2x1 zero-vector) and
covariance matrix Σ = I (the 2x2 identity matrix). A Gaus- sian with zero mean and
identity covariance is also called the standard nor- mal distribution. The middle figure
shows the density of a Gaussian with zero mean and Σ = 0.6I; and in the rightmost figure
shows one with , Σ = 2I. We see that as Σ becomes larger, the Gaussian becomes more
“spread-out,” and as it becomes smaller, the distribution becomes more “compressed.”
Let’s look at some more examples.

Department of Computer Science & Engineering , Bharati Vidyapeeth’s College of Engineering, New
Delhi, Subject : Machine Learning
The figures above show Gaussians with mean 0, and with covariance matrices
respectively
Σ=

The leftmost figure shows the familiar standard normal distribution, and we see that as
we increase the off-diagonal entry in Σ, the density becomes more “compressed” towards
the 45◦ line (given by x1 = x2). We can see this more clearly when we look at the contours
of the same three densities

Department of Computer Science & Engineering , Bharati Vidyapeeth’s College of Engineering, New Delhi,
Here’s one last set of examples generated by varying Σ:

The plots above used, respectively,


From the leftmost and middle figures, we see that by decreasing the off- diagonal elements
of the covariance matrix, the density now becomes “com- pressed” again, but in the
opposite direction. Lastly, as we vary the pa- rameters, more generally the contours will
form ellipses (the rightmost figure showing an example).
As our last set of examples, fixing Σ = I, by varying µ, we can also move the mean of the
density around.

The figures
Department above were
of Computer generated
Science using Σ, =Bharati
& Engineering I, andVidyapeeth’s
respectivelyCollege of Engineering, New
Delhi, Subject : Machine Learning
5

1.2 The Gaussian Discriminant Analysis model


When we have a classification problem in which the input features x are continuous-valued random
variables, we can then use the Gaussian Discrim- inant Analysis (GDA) model, which models
p(x|y) using a multivariate nor- mal distribution.
y ∼ x|y The model is:
Bernoulli(φ)
= 0 ∼ x|y = N (µ0, Σ)
1∼ N (µ1, Σ)
Writing out the distributions, this is:

p(y) = φy(1 − φ)1−y 1


. Σ
p(x|y = 0) = (2π)n/2|Σ|1/2 1 exp − (x1 − µ ) Σ 0 (xT −−1µ ) 0
. 2 Σ
1
2 − (x − µ ) Σ 1 (x − µ )
T −1
(2π)n/2|Σ|1/2 exp 1
p(x|y = 1) =
Here, the parameters of our model are φ, Σ, µ0 and µ1. (Note that while there’re two different mean
vectors µ0 and µ1, this model is usually applied using only one covariance matrix Σ.) The log-
likelihood of the data is given by
Ym
ℓ(φ, µ0, µ1, Σ) = log p(x(i), y(i); φ, µ0, µ1, Σ)
i=1
Ym
= log p(x |y ; µ0, µ1, Σ)p(y(i); φ).
(i) (i)

i=1

Department of Computer Science & Engineering , Bharati Vidyapeeth’s College of Engineering, New Delhi,
Subject : Machine Learning
6

By maximizing ℓ with respect to the parameters, we find the maximum like- lihood estimate of the
parameters (see problemφset 1 Σm
= 1) to be: 1{y (i) = 1}
m
Σ mi=1
(i) = 0}x (i)
µ0 = Σi=1m 1{y (i)
1{y = 0}
Σ m i=11{y (i) = 1}x (i)
µ1 = Σi=1m
i=1 1{y
(i) = 1}

Σ = 1 Σ(x
m
(i)
− µy(i) )(x(i) − µy(i) )T .
m i=1
Pictorially, what the algorithm is doing can be seen in as follows:
1

−1

−2

−3

−4

−5

−6

−7
−2 −1 0 1 2 3 4 5 6 7

Shown in the figure are the training set, as well as the contours of the two Gaussian distributions
that have been fit to the data in each of the two classes. Note that the two Gaussians have contours
that are the same shape and orientation, since they share a covariance matrix Σ, but they have
different means µ0 and µ1. Also shown in the figure is the straight line giving the decision boundary
at which p(y = 1|x) = 0.5. On one side of the boundary, we’ll predict y = 1 to be the most likely
outcome, and on the other side, we’ll predict y= 0.

1.3 Discussion: G D A and logistic regression


The GDA model has an interesting relationship to logistic regression. If we view the quantity p(y =
1|x; φ, µ0, µ1, Σ) as a function of x, we’ll find that it
Department of Computer Science & Engineering , Bharati Vidyapeeth’s College of Engineering, New
Delhi, Subject : Machine Learning
7

can be expressed in the form


p(y = 1|x; φ, Σ, µ0, µ1) = , 1
1 + exp(−θT x)
where θ is some appropriate function of φ, Σ, µ0, µ1.1 This is exactly the form that logistic
regression—a discriminative algorithm—used to model p(y = 1|x).
When would we prefer one model over another? GDA and logistic regres- sion will, in general,
give different decision boundaries when trained on the same dataset. Which isbetter?
We just argued that if p(x|y) is multivariate gaussian (with shared Σ), then p(y|x) necessarily
follows a logistic function. The converse, however, is not true; i.e., p(y|x) being a logistic function
does not imply p(x|y) is multivariate gaussian. This shows that GDA makes stronger modeling as-
sumptions about the data than does logistic regression. It turns out that when these modeling
assumptions are correct, then GDA will find better fits to the data, and is a better model.
Specifically, when p(x|y) is indeed gaus- sian (with shared Σ), then GDA is asymptotically
efficient. Informally, this means that in the limit of very large training sets (large m), there is no
algorithm that is strictly better than GDA (in terms of, say, how accurately they estimate p(y|x)).
In particular, it can be shown that in this setting, GDA will be a better algorithm than logistic
regression; and more generally, even for small training set sizes, we would generally expect GDA to
better.
In contrast, by making significantly weaker assumptions, logistic regres- sion is also more robust
and less sensitive to incorrect modeling assumptions. There are many different sets of assumptions
that would lead to p(y|x) taking the form of a logistic function. For example, if x|y = 0 ∼
Poisson(λ0), and x|y = 1 ∼ Poisson(λ1), then p(y|x) will be logistic. Logistic regression will also
work well on Poisson data like this. But if we were to use GDA on such data—and fit Gaussian
distributions to such non-Gaussian data—then the results will be less predictable, and GDA may
(or may not) do well.
To summarize: GDA makes stronger modeling assumptions, and is more data efficient (i.e.,
requires less training data to learn “well”) when the mod- eling assumptions are correct or at least
approximately
dimensional vectorscorrect.
by adding Logistic
the extra coordinate x (i) =makes
regression weakerset
1; see problem 0assumptions,
1. and is significantly more
robust to deviations from modeling assumptions. Specifically, when the data is in- deed non-
Department of Computer
Gaussian, then Science
in the limit& ofEngineering
large datasets, ,logistic
Bharati Vidyapeeth’s
regression will College of Engineering, New
Delhi, Subject : Machine1 Learning
This uses the convention of redefining the x (i) ’s on the right-hand-side to be (n + 1)-
8

almost always do better than GDA. For this reason, in practice logistic re- gression is used more
often than GDA. (Some related considerations about discriminative vs. generative models also
apply for the Naive Bayes algorith- m that we discuss next, but the Naive Bayes algorithm is still
considered a very good, and is certainly also a very popular, classification algorithm.)

2 Naive Bayes
In GDA, the feature vectors x were continuous, real-valued vectors. Let’s now talk about a different
learning algorithm in which the x j ’s are discrete- valued.
For our motivating example, consider building an email spam filter using machine learning.
Here, we wish to classify messages according to whether they are unsolicited commercial (spam)
email, or non-spam email. After learning to do this, we can then have our mail reader automatically
filter out the spam messages and perhaps place them in a separate mail folder. Classifying emails is
one example of a broader set of problems called text classification.
Let’s say we have a training set (a set of emails labeled as spam or non- spam). We’ll begin our
construction of our spam filter by specifying the features x j used to represent an email.
We will represent an email via a feature vector whose length is equal to the number of words in
the dictionary. Specifically, if an email contains
1 the j-th word of the dictionary, then we will set x j =
a aardvark
1; otherwise, we let x j = 0. For instance, 0
the vector aardwolf
0
x= . . . .
1 buy .
.. .
0 zygmurgy
is used to represent an email that contains the words “a” and “buy,” but not “aardvark,”
“aardwolf” or “zygmurgy.”2 The set of words encoded intothe

Department of Computer Science & Engineering , Bharati Vidyapeeth’s College of Engineering, New
Delhi, Subject : Machine Learning
9

feature vector is called the vocabulary, so the dimension of x is equal to the size of the vocabulary.
Having chosen our feature vector, we now want to build a generative model. So, we have to model
p(x|y). But if we have, say, a vocabulary of 50000 words, then x ∈ {0, 1}50000 (x is a 50000-
dimensional vector of 0’s and 1’s), and if we were to model x explicitly with a multinomial
distribution over the 250000 possible outcomes, then we’d end up with a (250000 −1)-dimensional
parameter vector. This is clearly too many parameters.
To model p(x|y), we will therefore make a very strong assumption. We will assume that the xi ’s
are conditionally independent given y. This assumption is called the Naive Bayes (NB)
assumption, and the resulting algorithm is called the Naive Bayes classifier. For instance, if y =
1 means spam email; “buy” is word 2087 and “price” is word 39831; then we are assuming that if I
tell you y = 1 (that a particular piece of email is spam), then knowledge of x2087 (knowledge of
whether “buy” appears in the message) will have no effect on your beliefs about the value of x39831
(whether “price” appears). More formally, this can be written p(x2087|y) = p(x2087|y, x39831). (Note
that this is not the same as saying that x2087 and x39831 are independent, which would have been
written “p(x2087) = p(x2087|x39831)”; rather, we are only assuming that x2087 and x39831 are
conditionally independent given y.)
We now have:

p(x1, . . . , x50000|y)
= p(x1|y)p(x2|y, x1)p(x3|y, x1, x2) ···p(x50000|y, x1, . . . , x49999)
= p(x1|y)p(x2|y)p(x3|y) ···p(x50000|y)
Yn
= p(xj |y)
j=1

The first equality simply follows from the usual properties of probabilities, and the second equality
used the NB assumption. We note that even though the Naive Bayes assumption is an extremely
strong assumptions, the resulting algorithm works well on many problems.
.
10

Our model is parameterized by φj|y=1 = p(xj = 1|y = 1), φj|y=0 = p(xj = 1|y = 0), and φy = p(y
= 1). As usual, given a training set {(x (i) , y(i)); i = 1, . . . , m}, we can write down the joint
likelihood of the data:
Ym
Maximizing this with respect to L(φφy, φ,j|y=0 and φj|y=1 gives the(i) maximum
, y(i)). likelihood estimates:
y φj|y=0, φj|y=1) = p(x
Σm
i = 1 1{x j = 1 ∧ y
(i) i=1 (i) = 1}
φj |y=1 = Σ m
i=1 1{y = 1}
(i)
Σm
i = 1 1{x j = 1 ∧ y
(i) (i) = 0}
φj |y=0 = Σ m
1{y = 0}
(i)
Σ m 1{yi=1 (i) = 1}
i=1
φy =
m
In the equations above, the “∧” symbol means “and.” The parameters have a very natural
interpretation. For instance, φj|y=1 is just the fraction of the spam (y = 1) emails in which word j
does appear.
Having fit all these parameters, to make a prediction on a new example with features x, we then
simply calculate
p(y = 1|x) = p(x|y = 1)p(y = 1)
p(x) .Q n Σ
j =1 p(xj |y = 1) p(y = 1)
= .Qn Σ .Q Σ ,
j=1 p(xj |y = 1) p(y = 1) + j=1 p(xj |y = 0) p(y = 0)
n

and pick whichever class has the higher posterior probability.


Lastly, we note that while we have developed the Naive Bayes algorithm mainly for the case of
problems where the features x j are binary-valued, the generalization to where x j can take values in
{1, 2, . . . , k j } is straightforward. Here, we would simply model p(xj |y) as multinomial rather than
as Bernoulli. Indeed, even if some original input attribute (say, the living area of a house, as in our
earlier example) were continuous valued, it is quite common to discretize it—that is, turn it into a
small set of discrete values—and apply
Living area (sq. feet) < 400
Naive Bayes.
400-800
For instance,
800-1200
if we use some feature>1600
1200-1600
x j to
represent living area, x i we might discretize1 the continuous
2 values as 3follows: 4 5

Department of Computer Science & Engineering , Bharati Vidyapeeth’s College of Engineering, New
Delhi, Subject : Machine Learning
11

Thus, for a house with living area 890 square feet, we would set the value of the corresponding
feature x j to 3. We can then apply the Naive Bayes algorithm, and model p(xj |y) with a
multinomial distribution, as described previously. When the original, continuous-valued attributes
are not well- modeled by a multivariate normal distribution, discretizing the features and using
Naive Bayes (instead of GDA) will often result in a better classifier.

2.1 Laplace smoothing


The Naive Bayes algorithm as we have described it will work fairly well for many problems, but
there is a simple change that makes it work much better, especially for text classification. Let’s
briefly discuss a problem with the algorithm in its current form, and then talk about how we can fix
it.
Consider spam/email classification, and let’s suppose that, after complet- ing CS229 and having
done excellent work on the project, you decide around June 2003 to submit the work you did to the
NIPS conference for publication. (NIPS is one of the top machine learning conferences, and the
deadline for submitting a paper is typically in late June or early July.) Because you end up
discussing the conference in your emails, you also start getting messages with the word “nips” in it.
But this is your first NIPS paper, and until1{x this
(i) time, you had not previously seen any emails
φ =
containing the word “nips”; in particular “nips” did not ever appear in
35000|y=1 =0 your training set of
m
spam/non- spam emails. Assuming that “nips” 1{x was the 35000th word in the dictionary, your Naive
(i)
φ35000|y=0 = m =0
Bayes spam filter therefore had picked its maximum i=1 likelihood estimates of the parameters φ35000|y
to bebecause it has never seen “nips” before in either spam or non-spam training examples, it thinks
I.e.,
the probability of seeing it in either type of email is zero. Hence, when trying to decide if one of
these messages containing “nips” is spam, it calculates the class posterior probabilities, and
obtains Qn p(x |y
j = 1)p(y = 1)
p(y = 1|x) = Q n
p(x |y = j=1
1)p(y = 1) + Q n p(x |y = 0)p(y = 0)
j =1 j j =1 j

= .0 0

Department of Computer Science & Engineering , Bharati Vidyapeeth’s College of Engineering, New
Delhi, Subject : Machine Learning
12
Qn
This is because each of the terms “ p(x |y)” includes j =1 aj term p(x |y) = 35000
0 that is multiplied into it. Hence, our algorithm obtains 0/0, and doesn’t know how to make a
prediction.
Stating the problem more broadly, it is statistically a bad idea to estimate the probability of
some event to be zero just because you haven’t seen it be- fore in your finite training set. Take the
problem of estimating the mean of a multinomial random variable z taking values in {1, . . . , k}.
We can param- eterize our multinomial with φj = p(z = j). Given a set of m independent
Σ mlikelihood
observations {z (1) , . . . , z (m) }, the maximum j}
estimates
1{z (i) = are given by
m
φj = i=1 .
As we saw previously, if we were to use these maximum likelihood estimates, then some of the φj ’s
might end up as zero, which was a problem. To avoid this, we can use Laplace smoothing, which
replaces the above estimate with Σm 1{z (i) = j } + 1
φj = i=1 .
m+ k
ΣHere,
k we’ve added 1 to the numerator, and k to the denominator. Notethat
j =1
since the φj ’s are estimates for probabilities that we know must sum to 1. Also, φj ƒ= 0for all values
of j, solving our problem of probabilities being estimated as zero. Under certain (arguably quite
strong) conditions, it can be shown that the Laplace smoothing actually gives the optimal
estimator of the φj ’s.
Returning to our Naive Bayes classifier, with Laplace smoothing, we therefore obtain the
following estimates of the parameters: Σm
i = 1 1{x
(i)
= 1 ∧ y(i) = 1} + 1
φj |y=1 = Σ jm
i=1 1{y
(i) = 1} + 2
Σm (i)
= 1 ∧ y(i) = 0} + 1
i = 1 1{x
φj |y=0 = Σ jm
i=1 1{y
(i) = 0} + 2

(In practice, it usually doesn’t matter much whether we apply Laplace s- moothing to φy or not,
since we will typically have a fair fraction each of spam and non-spam messages, so φy will be a
reasonable estimate of p(y = 1) and will be quite far from 0anyway.)

Department of Computer Science & Engineering , Bharati Vidyapeeth’s College of Engineering, New
Delhi, Subject : Machine Learning
2.2 Event models for text classification 13
To close off our discussion of generative learning algorithms, let’s talk about one more model that is
specifically for text classification. While Naive Bayes as we’ve presented it will work well for many
classification problems, for text classification, there is a related model that does even better.
In the specific context of text classification, Naive Bayes as presented uses the what’s called the multi-
variate Bernoulli event model. In this model, we assumed that the way an email is generated is that
first it is randomly determined (according to the class priors p(y)) whether a spammer or non- spammer
will send you your next message. Then, the person sending the email runs through the dictionary,
deciding whether to include each word j in that email independently and according to the probabilitQies
p(xj = 1|y) =
Here’s a different model, called the multinomial event model. To de- scribe this model, we will use a
different notation and set of features for representing emails. We let x j denote the identity of the j-th word
in the email. Thus, x j is now an integer taking values in {1, . . . , |V |}, where |V | is the size of our vocabulary
(dictionary). An email of n words is now represent- ed by a vector (x1, x2, . . . , x n ) of length n; note that n
can vary for different documents. For instance, if an email starts with “A NIPS . . . ,” then x1 = 1 (“a” is
the first word in the dictionary), and x2 = 35000(if “nips” is the 35000th word in the dictionary).
In the multinomial event model, we assume that the way an email is gen- erated is via a random process
in which spam/non-spam is first determined (according to p(y)) as before. Then, the sender of the email
writes the e- mail by first generating x1 from some multinomial distribution over words (p(x1|y)). Next, the
second word x2 is chosen independently of x1 but from the same multinomial distribution, and similarly for
x3, x 4, and so on, until all n words of the email
n
have beQengenerated. Thus, the overall probabili-
j =1
like the one we had earlier for the probability of a message under the multi- variate Bernoulli event model,
but that the terms in the formula now mean very different things. In particular x j |y is now a multinomial,
rather than a Bernoulli distribution.
The parameters for our new model are φy = p(y) as before, φk|y=1 = p(xj = k|y = 1) (for any j) and
φk|y=0 = p(xj = k|y = 0). Note that we have assumed that p(xj |y) is the same for all values of j (i.e., that
the distribution according to which a word is generated does not depend on its position j within the
email).
Department of Computer Science & Engineering , Bharati Vidyapeeth’s College of Engineering, New Delhi,
Subject : Machine Learning
14
(x(i) , x(i) , . . . , x(i) ) (here, n s the number of words in the i-trainingexample),

1 2 ni we are given a training set {(x (i) , y(i)); i = 1, . . . , m} where x(i) =


iIf
the likelihood of the data is given by
Ym
L(φy, φk|y=0,φk|y=1) = p(x(i), y(i))
i=1
Ym . Yni Σ
= p(x j |y; φk|y=0, φk|y=1) p(y ; φy ).
(i) (i)
i=1 j=1

Maximizing this yields the maximum likelihood estimates of the parameters:


Σ m Σni 1{x (i)j = k ∧ y(i) = 1}
i=1 j=1
φk|y=1 = Σ m
1{y (i) = 1}n i
Σ m Σ n i i=11{x (i) = k ∧ y(i) = 0}
φk|y=0 = i=1
Σj=1 m j
1{y (i) = 0}n i
Σ m 1{y (i)i=1 = 1}
φy = i=1
m
.
If we were to apply Laplace smoothing (which needed in practice for good performance) when
estimating φk|y=0 and φk|y=1, we add 1 to the numerators and |V| to the denominators, and obtain:
Σ m Σni 1{x (i)
= k ∧ y(i) = 1} + 1
i=1 j=1 j
φk|y=1 = Σ m
i=1 1{y (i) = 1}n + |V i |
Σ m Σni 1{x (i)j = k ∧ y = 0} + 1
(i)
φk|y=0 = i=1
Σj=1m .
i=1 1{y (i) = 0}n + |V i |

While not necessarily the very best classification algorithm, the Naive Bayes classifier often works
surprisingly well. It is often also a very good “first thing to try,” given its simplicity and ease of
implementation.

Department of Computer Science & Engineering , Bharati Vidyapeeth’s College of Engineering, New Delhi,
Subject : Machine Learning
Ensemble Methods: Bagging and Boosting
Unit 2
Eighth Semester
Subject codeETCS102
Department of Computer Science &
Engineering,
Bharati Vidyapeeth’s College of
Engineering,
New Delhi
Subject: Machine learning
g
Bagging stands for Bootstrap Aggregation
Takes original data set D with N training
examples
Creates M copies {D ˜ }M
m m=1

Each D˜mis generated from D bysampling with replacement


Each data set D˜mhas the same number of examples as in data set D
These data sets are reasonably different from each other (since only about 63% of the
original examples appear in any of these data sets)

Train models h1, . . . , hM using D˜1,. . . , D˜M,respectively

1
Σ M
Use an averaged model h = M m=1 hm as the finalmodel

Useful for models with high variance and noisy


data
illustration
Top: Original data, Middle: 3 models (from some model class) learned using three data sets chosen
via bootstrapping, Bottom: averagedmodel
Forests

An ensemble of decision tree (DT) classifiers


Uses bagging on features (each DT will use a random set of features)

Given a total of D features, each DT uses D randomly chosen
features Randomly chosen features make the different trees
uncorrelated
All DTs usually have the same depth
Each DT will split the training data differently at the leaves
Prediction for a test example votes on/averages predictions from all the
DTs
ing
The basic idea
Take a weak learning algorithm
Only requirement: Should be slightly better thanrandom
Turn it into an awesome one by making it focus on difficult cases

Most boosting algoithms follow these steps:

1 Train a weak model on some training data

2 Compute the error of the model on each training example


Give higher importance to examples on which the model made
3
mistakes Re-train the model using “importance weighted” training
4
examples
5
Go back to step 2
Algorithm
Given: Training data (x 1, y1), ...,(x N,yN) with yn ∈ {−1, +1}, ∀n
Initializeweightof each example ( x n, yn): D1(n) = 1/N, ∀n
For round t = 1 :T
Learn a weak ht (x ) → {−1, +1} using training dataweighted as per
Dt
Compute theweightedfraction of errors of ht on this training data
ΣN
st = D t(n) 1 [h (x nƒ
t ) =y
n=1 n
Set “importance” of ht : α t = 1 log( 1−st )(gets larger as s]t gets smaller)
2 st
Update the weightof each example
,
Dt(n) × exp(−αt ) if ht(x n) = yn (correct prediction: decrease weight)
(n) ∝
D t+1
Dt(n) × exp(αt ) if ht(x n) ƒ= yn (incorrect prediction: increase weight)

= Dt(n)exp(−αt ynht (x n))

Dt+1
Normalize Dt+1 so that it sums to 1: t+1
(n)
N
D (n) = D (m)
Σ m=1 t+1
Σ T
Output the “boosted” final hypothesis H(x ) = sign( t=1 α t h t (x))
AdaBoost: Example

Consider binary classification with 10 training examples


Initial weight distribution D1 isuniform(each point has equal weight =
1/10)

Each of our weak classifers will be an axis-parallel linear


classifier
After Round 1

Error rate of h1: s1 = 0.3; weight of h1: α 1 = 12 ln((11 − s )/s ) =


0.42
Eachmisclassifiedpointupweighted(weight multiplied by1α 2 ))
Eachcorrectly classifiedpointdownweighted(weight multiplied by−α 2 ))
After Round 2

Error rate of h2: s2 = 0.21; weight of h2: α 2 = 12 ln((12 − s )/s ) =


0.65
Eachmisclassifiedpointupweighted(weight multiplied by 2 α 2 ))
exp(
Eachcorrectly classifiedpointdownweighted(weight multiplied by −α 2 ))
exp(
3

Error rate of h3: s 3 = 0.14; weight of h : α3 = 1


2
ln((13 − s )/s ) =
0.92 3
Suppose we decide to stop after round 3 3
Ourensemblenow consists of 3 classifiers: h1,
h2, h3
Final Classifier

Final classifier is aweighted linear combinationof all the


classifiers Classifier hi gets a weight α i

Multipleweak, linear classifiers combined to give astrong, nonlinear


classifier
Another Example

Given: A nonlinearly separabledataset


We want to use Perceptron (linear classifier) on this
data
Round 1
After round 1, our ensemble has 1 linear classifier (Perceptron)
Bottom figure: X axis is number of rounds, Y axis is trainingerror
: Round 2
After round 2, our ensemble has 2 linear classifiers
(Perceptrons) Bottom figure: X axis is number of rounds, Y axis
is trainingerror
: Round 3
After round 3, our ensemble has 3 linear classifiers
(Perceptrons) Bottom figure: X axis is number of rounds, Y axis
is trainingerror
: Round 4
After round 4, our ensemble has 4 linear classifiers
(Perceptrons) Bottom figure: X axis is number of rounds, Y axis
is trainingerror
: Round 5
After round 5, our ensemble has 5 linear classifiers
(Perceptrons) Bottom figure: X axis is number of rounds, Y axis
is trainingerror

Ensemble Methods: Bagging and Boosting


Round 6
After round 6, our ensemble has 6 linear classifiers
(Perceptrons) Bottom figure: X axis is number of rounds, Y axis
is trainingerror
Round 7
After round 7, our ensemble has 7 linear classifiers
(Perceptrons) Bottom figure: X axis is number of rounds, Y axis
is trainingerror
Round 40
After round 40, our ensemble has 40 linear classifiers
(Perceptrons) Bottom figure: X axis is number of rounds, Y axis
is trainingerror
Classifier
A decision stump (DS) is a tree with a single node (testing the value of a single feature, say
the
binary, i.e., y ∈ {−1, +1}
d-th•feature)
Suppose each example x has D binary features {x D
}
d d=1 , with xd ∈ {0, 1} and the label y is
also• The DS (assuming it tests the d -th feature) will predict the label
asas
• h(x ) = sd(2xd − 1) where s ∈ {−1,+1}

• Suppose we have T such decision stumps h1, ...,hT , testing feature


number i1, ...,iT , respectively, i.e., ht(x) = sit (2xit − 1)

Σ T
The boosted hypothesis H(x ) = sgn( t=1 α t h t (x)) can be written as
ΣT ΣT
T
H(x )= sgn( T it its (2xi − 1)) = sgn(
α t t ti
xi − αti its ) = sign(w x + b)
Σ t=12α
i s t=1
t
t=1

Σ Σ
where wd = t : i t =d 2α t s t and b = − t αtst
Comments
For AdaBoost, given each model’s error s = 1/2 − γ , the training error consistently gets
better t
with T
t
rounds train-error(Hfinal ) ≤ exp(−2 2
γ) t=1
Σ t
Boosting algorithms can be shown to be minimizing a loss
function
E.g., AdaBoost has been shown to be minimizing an exponential
loss ΣN
L = exp{−yn H(x
)} n=1
n
Σ
where H(x )= sign( T
t=1
αt ht(x)), given weak base classifiers h1, . . .
, hT
Boosting in general can perform badly if some examples are
outliers
Boosting
No clear winner; usually depends on the data
Bagging is computationally more efficient than boosting (note that bagging can train the
M
models in parallel, boosting can’t)

Both reduce variance (and overfitting) by combining different models


The resulting model has higher stability as compared to the individual ones

Bagging usually can’t reduce the bias, boosting can (note that in boosting, the training
error steadily decreases)
Bagging usually performs better than boosting if we don’t have a high bias and only want
to reduce variance (i.e., if we are overfitting)

You might also like