Unit 2
Unit 2
outlook temp. humidity windy play outlook temp. humidity windy play
sunny hot high false no sunny mild high false no
sunny hot high true no sunny cool normal false yes
overcast hot high false yes rainy mild normal false yes
rainy mild high false yes sunny mild normal true yes
rainy cool normal false yes overcast mild high true yes
rainy cool normal true no overcast hot normal false yes
overcast cool normal true yes rainy mild high true no
P DhP h
(1) PhD
PD
(5) P a1 a2 an vj ’P ai vj
i
Hence we get the following classifier:
vNB arg
vj
max
yes no
P vj ’ P ai v j
i
arg max P vj P outlook sunny v j P temp cool v j
vj yes no
b (bias
x1 )
w1
x2 v y
w2
(v
wn )
xn
Perception in term of
biological neuron
Find suitable weights in such a way that the training examples are
correctly classified.
Geometrically try to find a hyper-plane that separates the examples of the two classes
x2
Input
Output
xn
Initialize Initialize
• Assign random weights to all links
Training
• Feed-in known inputs in random
sequence
• Simulate the network
• Compute error between the input the Training
and
output (Error
Function)
• Adjust weights (Learning
Function)
• Repeat until total error < ε Thinking
Thinking
• Simulate the network eve for
• Network will respond to any input n trained
Perceptron learning algo:
Training patterns are presented to the network's
inputs;
outputtheis Then the connection weights wj
computed.
modified by an amountare that is proportional to the the
product
difference of between the actual and the
output, y, desired
output, d, and the input pattern, x.
The algorithm is as follows:
Initialize the weights and threshold to small random
numbers.
Present a vector x to the neuron inputs and calculate the
output.
Update the weights according to:
where
d is the desired output,
t is the iteration number, and
eta is the gain or step size, where 0.0 < n < 1.0
Repeat steps 2 and 3 until:
the iteration error is less than a user-specified error
threshold
or
a predetermined number of iterations have been
completed.
Perceptron Training a
Department of Computer Science & Engineering , Bharati Vidyapeeth’s College of Engineering, New
Delhi, Subject : Machine Learning
Support Vector Machines
Unit 2
Eighth Semester
Subject codeETCS102
Department of Computer Science & Engineering,
Bharati Vidyapeeth’s College of Engineering,
New Delhi
Subject: Machine learning
Department of Computer Science & Engineering , Bharati Vidyapeeth’s College of Engineering, New
Delhi, Subject : Machine Learning
Outline
Introduction
Towards SVM
Basic Concept
Implementations
Issues
Conclusion & References
Department of Computer Science & Engineering , Bharati Vidyapeeth’s College of Engineering, New
Delhi, Subject : Machine Learning
Is the problem solved? …….. NO!
o Is Risk of ft selected by Empirical Risk Minimization
(ERM) near to that of ideal fi ?
o No, not in case of overfitting
o Important Result of Statistical Learning Theory
V(F)
ER ( ft ) ≤ R ( fi ) + C
N
Department of Computer Science & Engineering , Bharati Vidyapeeth’s College of Engineering, New
Delhi, Subject : Machine Learning
Structural Risk Minimization (SRM)
o Consider family of F =>
F0 ⊂ F1 ⊂ ..........Fn ⊂ ...... ⊂ F
s.t.
V (F0 ) ≤ V (F1 ) ≤ .......... ≤ V (Fn ) ≤ ...... ≤ V (F )
Department of Computer Science & Engineering , Bharati Vidyapeeth’s College of Engineering, New
Delhi, Subject : Machine Learning
V (F )
SRM Graphically: ER( ft) ≤ R( fi) + C
N
Department of Computer Science & Engineering , Bharati Vidyapeeth’s College of Engineering, New
Delhi, Subject : Machine Learning
A: What it has to do with SVM….?
B:SVM is an approximate implementation of SRM!
A: How?
B: Just in simple way for now:
Just import a result:
Maximizing distance of the decision boundary from
training points minimizes the VC dimension
resulting into the Good generalization!
Department of Computer Science & Engineering , Bharati Vidyapeeth’s College of Engineering, New
Delhi, Subject : Machine Learning
A: Means Now onwards our target is Maximizing Distance
between decision boundary and the Training points!
B: Yeah, Right!
A: Ok, I am convinced that SVM will generalize well,
but can you please explain what is the concept of
SVM and how to implement it, are there any
packages available?
B: Yeah, don’t worry, there are many implementations
available, just use them for your application, now the
next part of the presentation will give a basic idea
about the SVM, so be with me!
Department of Computer Science & Engineering , Bharati Vidyapeeth’s College of Engineering, New
Delhi, Subject : Machine Learning
Basic Concept of SVM:
o Which line
will classify
the unseen
data well?
o The dotted
line! Its line
with
Maximum
Margin!
Department of Computer Science & Engineering , Bharati Vidyapeeth’s College of Engineering, New
Delhi, Subject : Machine Learning
Cont…
−1
WT X +b =
0+
1
Department of Computer Science & Engineering , Bharati Vidyapeeth’s College of Engineering, New
Delhi, Subject : Machine Learning
Some definitions:
o Functional Margin:
w.r.t.
1) individual examples : γˆ = y ( i ) (W T x ( i ) + b )
(i)
o Geometric Margin:
w.r.t T
W b
1)Individual examples: γ
(i)
= y (i) || W || x (i) +
|| W ||
2) example set S,
γ = min γ ( i )
i = 1 ,..., m
Department of Computer Science & Engineering , Bharati Vidyapeeth’s College of Engineering, New
Delhi, Subject : Machine Learning
Problem Formulation:
−1
W T
X +b = 0
1
+
Department of Computer Science & Engineering , Bharati Vidyapeeth’s College of Engineering, New
Delhi, Subject : Machine Learning
Cont..
o Distance of a point (u, v) from Ax+By+C=0, is given by
|Ax+By+C|/||n||
Where ||n|| is norm of vector n(A,B)
b
o Distance of hyperpalne from origin = || W ||
b + 1
o Distance of point A from origin = || W ||
b − 1
o Distance of point B from Origin = || W ||
2
o Distance between points A and B (Margin) = || W ||
Department of Computer Science & Engineering , Bharati Vidyapeeth’s College of Engineering, New
Delhi, Subject : Machine Learning
Cont…
We have data set {X(i),Y(i)},i =1,....,m
X ∈Rd and Y ∈R1
separating hyperplane
W T X + b = 0
s .t .
W T
X (i)
+ b > 0 if Y (i)
= +1
W T
X (i)
+ b < 0 if Y (i)
= −1
Department of Computer Science & Engineering , Bharati Vidyapeeth’s College of Engineering, New
Delhi, Subject : Machine Learning
Cont…
o Suppose training data satisfy following constrains also,
W T X (i) + b ≥ +1 for Y (i) = + 1
W T X (i) + b ≤ −1 for Y (i)
=−1
Combining these to the one,
Y (i)
(W T X (i)
+ b) ≥ 1 for ∀i
Department of Computer Science & Engineering , Bharati Vidyapeeth’s College of Engineering, New
Delhi, Subject : Machine Learning
THE PROBLEM:
2
max
W,b ||W||
such that
(i)
Y (W X T (i)
+b) ≥1 for ∀i
Also we know
|| W || = W T
W
Department of Computer Science & Engineering , Bharati Vidyapeeth’s College of Engineering, New
Delhi, Subject : Machine Learning
Cont..
So the Problem can be written as:
1
W TW
min
W ,b 2
Such that
Y(i)(WT X(i) +b)≥1 for ∀i
Such that
Department of Computer Science & Engineering , Bharati Vidyapeeth’s College of Engineering, New
Delhi, Subject : Machine Learning
Constructing Lagrangian
o Lagrangian for our problem:
[ ]
m
1
L(W , b,α ) = || W ||2 −∑ α i Y (i ) (W T X (i ) + b) −1
2 i =1
Department of Computer Science & Engineering , Bharati Vidyapeeth’s College of Engineering, New
Delhi, Subject : Machine Learning
Cont…
o Setting derivative w.r.t. W to zero, it gives:
m
W − ∑α i Y (i) X (i) = 0
i=1
i.e.
m
W = ∑ α iY (i) X (i)
i=1
∑ α iY
(i)
= 0
i =1
Department of Computer Science & Engineering , Bharati Vidyapeeth’s College of Engineering, New
Delhi, Subject : Machine Learning
Cont…
o Plugging these results into Lagrangian gives
m
1 m (i) ( j)
L(W,b,α ) = ∑ αi − ∑ Y Y α iα j (X ) (X
(i) T ( j)
)
i=1 2 i, j=1
o Say it
m
1 m (i) ( j)
D(α ) = ∑ αi − ∑ Y Y α iα j (X (i)) T(X ( j)
)
i=1 2 i, j=1
o This is result of our minimization w.r.t W and b,
Department of Computer Science & Engineering , Bharati Vidyapeeth’s College of Engineering, New
Delhi, Subject : Machine Learning
So The DUAL:
o Now Dual becomes::
m m
1
max
α
D(α ) = ∑ α i −
2
∑ Y (i)
Y ( j)
α i α j X
(i)
,X
( j)
i=1 i, j=1
s .t .
α i ≥ 0, i = 1 ,..., m
m
∑i=1
α iY (i)
= 0
i =1
Department of Computer Science & Engineering , Bharati Vidyapeeth’s College of Engineering, New
Delhi, Subject : Machine Learning
What if data is nonlinearly separable?
o The maximal margin
hyperplane can classify
only linearly separable
data
o What if the data is linearly
non-separable?
o Take your data to linearly
separable ( higher
dimensional space) and use
maximal margin
hyperplane there!
Department of Computer Science & Engineering , Bharati Vidyapeeth’s College of Engineering, New
Delhi, Subject : Machine Learning
Taking it to higher dimension works!
Ex. XOR
Department of Computer Science & Engineering , Bharati Vidyapeeth’s College of Engineering, New
Delhi, Subject : Machine Learning
Doing it in higher dimensional space
o Let Φ: X →F be non linear mapping from input
space X (original space) to feature space (higher
dimensional) F
(i) ( j)
o Then our inner (dot) product X , X in higher
dimensional space is φ( X ), φ( X ) (i ) (j)
s.t.
α i ≥ 0, i = 1,..., m
m
∑αY i
(i)
= 0
i =1
Department of Computer Science & Engineering , Bharati Vidyapeeth’s College of Engineering, New
Delhi, Subject : Machine Learning
Kernel function:
o There exist a way to compute inner product in feature
space as function of original input points – Its kernel
function!
o Kernel function:
K(x, z) = φ(x),φ(z)
o We need not know φ to compute K ( x, z)
Department of Computer Science & Engineering , Bharati Vidyapeeth’s College of Engineering, New
Delhi, Subject : Machine Learning
An example:
let x, z ∈ R n For n=3, feature mapping φ
K(x, z) = (xT z)2 is given as : x x 1 1
x1 x 2
n n
i.e. K(x, z) = (∑ xi zi ) (∑ xj zj )
x 1 x3
x 2 x1
i=1 j=1
φ( x ) = x2 x 2
n n
=∑ ∑ xxzz x 2 x3
i j i j x3 x1
i=1 j=1
x3 x 2
n x3 x3
= ∑ (xi xj )(zi z j )
i, j=1
K (x, z) = φ(x),φ(z)
Department of Computer Science & Engineering , Bharati Vidyapeeth’s College of Engineering, New
example cont…
o Here,
x 1x 1 1
for
x 1x 2 2
φ(x) = =
K(x, z) = (xTz)2 x 2 x1 2
x2x2 4
3
x = 1 z = 9
2 4 12
φ(z) =
12
3
x z =
T
[1 2] 16
9
4
= 11 φ( x )T φ( z ) = [1 2 2 4 ] 12
12
K ( x , z ) = ( x T z ) 2 = 121 16
= 121
Department of Computer Science & Engineering , Bharati Vidyapeeth’s College of Engineering, New
Delhi, Subject : Machine Learning
So our SVM for the non-linearly
separable data:
o Optimization problem:
m m
1
max D (α ) =
α
∑ αi −
2
∑ Y (i)
Y ( j )α i α j K X (i)
, X (j)
i =1 i , j =1
s.t.
α i ≥ 0, i = 1,..., m
m
∑α
i =1
i Y
(i)
= 0
o Decision function
m
F ( X ) = Sign(∑ α iY (i) K ( X (i) , X ) + b)
i=1
Department of Computer Science & Engineering , Bharati Vidyapeeth’s College of Engineering, New
Delhi, Subject : Machine Learning
Some commonly used Kernel functions:
o Linear: K ( X ,Y ) = X T Y
o Polynomial of degree d: K ( X , Y ) = ( X T Y + 1) d
|| X −Y ||2
−
K ( X ,Y ) = e 2σ 2
o Gaussian Radial Basis Function (RBF):
Department of Computer Science & Engineering , Bharati Vidyapeeth’s College of Engineering, New
Delhi, Subject : Machine Learning
Implementations:
Some Ready to use available SVM implementations:
1)LIBSVM:A library for SVM by Chih-Chung Chang and
chih-Jen Lin
(at: http://www.csie.ntu.edu.tw/~cjlin/libsvm/)
2)SVM light : An implementation in C by Thorsten
Joachims
(at: http://svmlight.joachims.org/ )
3)Weka: A Data Mining Software in Java by University
of Waikato
(at: http://www.cs.waikato.ac.nz/ml/weka/ )
Department of Computer Science & Engineering , Bharati Vidyapeeth’s College of Engineering, New
Delhi, Subject : Machine Learning
Issues:
o Selecting suitable kernel: Its most of the time trial
and error
o Multiclass classification: One decision function for
each class( l1 vs l-1 ) and then finding one with max
value i.e. if X belongs to class 1, then for this and
other (l-1) classes vales of decision functions:
F 1 ( X ) ≥ + 1
F 2 ( X ) ≤ − 1
.
.
F l ( X ) ≤ − 1
Department of Computer Science & Engineering , Bharati Vidyapeeth’s College of Engineering, New
Delhi, Subject : Machine Learning
Cont….
o Sensitive to noise: Mislabeled data can badly affect
the performance
o Good performance for the applications like-
1)computational biology and medical applications
(protein, cancer classification problems)
2)Image classification
3)hand-written character recognition
And many others…..
o Use SVM :High dimensional, linearly separable
data (strength), for nonlinearly depends on choice of
kernel
Department of Computer Science & Engineering , Bharati Vidyapeeth’s College of Engineering,
New Delhi, Subject : Machine Learning
Conclusion:
Support Vector Machines provides very
simple method for linear classification. But
performance, in case of nonlinearly separable
data, largely depends on the choice of kernel!
Department of Computer Science & Engineering , Bharati Vidyapeeth’s College of Engineering, New
Delhi, Subject : Machine Learning
Generative Learning algorithms
So far, we’ve mainly been talking about learning algorithms that model p(y|x; θ),
the conditional distribution of y given x. For instance, logistic regression
modeled p(y|x; θ) as hθ (x) = g(θT x) where g is the sigmoid func- tion. In these
notes, we’ll talk about a different type of learning algorithm.
Consider a classification problem in which we want to learn to distinguish
between elephants (y = 1) and dogs (y = 0), based on some features of an
animal. Given a training set, an algorithm like logistic regression or the
perceptron algorithm (basically) tries to find a straight line—that is, a decision
boundary—that separates the elephants and dogs. Then, to classify a new animal
as either an elephant or a dog, it checks on which side of the decision boundary it
falls, and makes its prediction accordingly.
Here’s a different approach. First, looking at elephants, we can build a model
of what elephants look like. Then, looking at dogs, we can build a separate model
of what dogs look like. Finally, to classify a new animal, we can match the new
animal against the elephant model, and match it against the dog model, to see
whether the new animal looks more like the elephants or more like the dogs we
Department
had seenofinComputer Science
the training & Engineering , Bharati Vidyapeeth’s College of Engineering, New
set.
Algorithms that try to learn p(y|x) directly (such as logistic regression), or algorithms that try
to learn mappings directly from the space of inputs X to the labels {0, 1}, (such as the perceptron
algorithm) are called discrim- inative learning algorithms. Here, we’ll talk about algorithms that
instead try to model p(x|y) (and p(y)). These algorithms are called generative learning
algorithms. For instance, if y indicates whether an example is a dog (0) or an elephant (1), then
p(x|y = 0) models the distribution of dogs’ features, and p(x|y = 1) models the distribution of
elephants’ features.
After modeling p(y) (called the class priors) and p(x|y), our algorithm
can then use Bayes rule to derive the posterior distribution on y given x:
p(y|x) =p(x|y)p(y)/ p(x
Here, the denominator is given by p(x) = p(x|y = 1)p(y = 1) + p(x|y = 0)p(y = 0) (you should
be able to verify that this is true from the standard properties of probabilities), and thus can also
be expressed in terms of the quantities p(x|y) and p(y) that we’ve learned. Actually, if were
calculating p(y|x) in order to make a prediction, then we don’t actually need to calculate the
denominator, since
Department of Computer Science & Engineering , Bharati Vidyapeeth’s College of Engineering, New
Delhi, Subject : Machine Learning
1 Gaussian discriminant analysis
The first generative learning algorithm that we’ll look at is Gaussian discrim-
inant analysis (GDA). In this model, we’ll assume that p(x|y) is distributed
according to a multivariate normal distribution. Let’s talk briefly about the
properties of multivariate normal distributions before moving on to the GDA
model itself.
x
real-valued random variable. The covariance can also be defined as Cov(Z) = E[ZZT ] −
(E[Z])(E[Z])T . (You should be able to prove to yourself that these two definitions are
equivalent.) If X ∼ N (µ, Σ), then
Cov(X) = Σ.
Here are some examples of what the density of a Gaussian distribution looks like:
The left-most figure shows a Gaussian with mean zero (that is, the 2x1 zero-vector) and
covariance matrix Σ = I (the 2x2 identity matrix). A Gaus- sian with zero mean and
identity covariance is also called the standard nor- mal distribution. The middle figure
shows the density of a Gaussian with zero mean and Σ = 0.6I; and in the rightmost figure
shows one with , Σ = 2I. We see that as Σ becomes larger, the Gaussian becomes more
“spread-out,” and as it becomes smaller, the distribution becomes more “compressed.”
Let’s look at some more examples.
Department of Computer Science & Engineering , Bharati Vidyapeeth’s College of Engineering, New
Delhi, Subject : Machine Learning
The figures above show Gaussians with mean 0, and with covariance matrices
respectively
Σ=
The leftmost figure shows the familiar standard normal distribution, and we see that as
we increase the off-diagonal entry in Σ, the density becomes more “compressed” towards
the 45◦ line (given by x1 = x2). We can see this more clearly when we look at the contours
of the same three densities
Department of Computer Science & Engineering , Bharati Vidyapeeth’s College of Engineering, New Delhi,
Here’s one last set of examples generated by varying Σ:
The figures
Department above were
of Computer generated
Science using Σ, =Bharati
& Engineering I, andVidyapeeth’s
respectivelyCollege of Engineering, New
Delhi, Subject : Machine Learning
5
i=1
Department of Computer Science & Engineering , Bharati Vidyapeeth’s College of Engineering, New Delhi,
Subject : Machine Learning
6
By maximizing ℓ with respect to the parameters, we find the maximum like- lihood estimate of the
parameters (see problemφset 1 Σm
= 1) to be: 1{y (i) = 1}
m
Σ mi=1
(i) = 0}x (i)
µ0 = Σi=1m 1{y (i)
1{y = 0}
Σ m i=11{y (i) = 1}x (i)
µ1 = Σi=1m
i=1 1{y
(i) = 1}
Σ = 1 Σ(x
m
(i)
− µy(i) )(x(i) − µy(i) )T .
m i=1
Pictorially, what the algorithm is doing can be seen in as follows:
1
−1
−2
−3
−4
−5
−6
−7
−2 −1 0 1 2 3 4 5 6 7
Shown in the figure are the training set, as well as the contours of the two Gaussian distributions
that have been fit to the data in each of the two classes. Note that the two Gaussians have contours
that are the same shape and orientation, since they share a covariance matrix Σ, but they have
different means µ0 and µ1. Also shown in the figure is the straight line giving the decision boundary
at which p(y = 1|x) = 0.5. On one side of the boundary, we’ll predict y = 1 to be the most likely
outcome, and on the other side, we’ll predict y= 0.
almost always do better than GDA. For this reason, in practice logistic re- gression is used more
often than GDA. (Some related considerations about discriminative vs. generative models also
apply for the Naive Bayes algorith- m that we discuss next, but the Naive Bayes algorithm is still
considered a very good, and is certainly also a very popular, classification algorithm.)
2 Naive Bayes
In GDA, the feature vectors x were continuous, real-valued vectors. Let’s now talk about a different
learning algorithm in which the x j ’s are discrete- valued.
For our motivating example, consider building an email spam filter using machine learning.
Here, we wish to classify messages according to whether they are unsolicited commercial (spam)
email, or non-spam email. After learning to do this, we can then have our mail reader automatically
filter out the spam messages and perhaps place them in a separate mail folder. Classifying emails is
one example of a broader set of problems called text classification.
Let’s say we have a training set (a set of emails labeled as spam or non- spam). We’ll begin our
construction of our spam filter by specifying the features x j used to represent an email.
We will represent an email via a feature vector whose length is equal to the number of words in
the dictionary. Specifically, if an email contains
1 the j-th word of the dictionary, then we will set x j =
a aardvark
1; otherwise, we let x j = 0. For instance, 0
the vector aardwolf
0
x= . . . .
1 buy .
.. .
0 zygmurgy
is used to represent an email that contains the words “a” and “buy,” but not “aardvark,”
“aardwolf” or “zygmurgy.”2 The set of words encoded intothe
Department of Computer Science & Engineering , Bharati Vidyapeeth’s College of Engineering, New
Delhi, Subject : Machine Learning
9
feature vector is called the vocabulary, so the dimension of x is equal to the size of the vocabulary.
Having chosen our feature vector, we now want to build a generative model. So, we have to model
p(x|y). But if we have, say, a vocabulary of 50000 words, then x ∈ {0, 1}50000 (x is a 50000-
dimensional vector of 0’s and 1’s), and if we were to model x explicitly with a multinomial
distribution over the 250000 possible outcomes, then we’d end up with a (250000 −1)-dimensional
parameter vector. This is clearly too many parameters.
To model p(x|y), we will therefore make a very strong assumption. We will assume that the xi ’s
are conditionally independent given y. This assumption is called the Naive Bayes (NB)
assumption, and the resulting algorithm is called the Naive Bayes classifier. For instance, if y =
1 means spam email; “buy” is word 2087 and “price” is word 39831; then we are assuming that if I
tell you y = 1 (that a particular piece of email is spam), then knowledge of x2087 (knowledge of
whether “buy” appears in the message) will have no effect on your beliefs about the value of x39831
(whether “price” appears). More formally, this can be written p(x2087|y) = p(x2087|y, x39831). (Note
that this is not the same as saying that x2087 and x39831 are independent, which would have been
written “p(x2087) = p(x2087|x39831)”; rather, we are only assuming that x2087 and x39831 are
conditionally independent given y.)
We now have:
p(x1, . . . , x50000|y)
= p(x1|y)p(x2|y, x1)p(x3|y, x1, x2) ···p(x50000|y, x1, . . . , x49999)
= p(x1|y)p(x2|y)p(x3|y) ···p(x50000|y)
Yn
= p(xj |y)
j=1
The first equality simply follows from the usual properties of probabilities, and the second equality
used the NB assumption. We note that even though the Naive Bayes assumption is an extremely
strong assumptions, the resulting algorithm works well on many problems.
.
10
Our model is parameterized by φj|y=1 = p(xj = 1|y = 1), φj|y=0 = p(xj = 1|y = 0), and φy = p(y
= 1). As usual, given a training set {(x (i) , y(i)); i = 1, . . . , m}, we can write down the joint
likelihood of the data:
Ym
Maximizing this with respect to L(φφy, φ,j|y=0 and φj|y=1 gives the(i) maximum
, y(i)). likelihood estimates:
y φj|y=0, φj|y=1) = p(x
Σm
i = 1 1{x j = 1 ∧ y
(i) i=1 (i) = 1}
φj |y=1 = Σ m
i=1 1{y = 1}
(i)
Σm
i = 1 1{x j = 1 ∧ y
(i) (i) = 0}
φj |y=0 = Σ m
1{y = 0}
(i)
Σ m 1{yi=1 (i) = 1}
i=1
φy =
m
In the equations above, the “∧” symbol means “and.” The parameters have a very natural
interpretation. For instance, φj|y=1 is just the fraction of the spam (y = 1) emails in which word j
does appear.
Having fit all these parameters, to make a prediction on a new example with features x, we then
simply calculate
p(y = 1|x) = p(x|y = 1)p(y = 1)
p(x) .Q n Σ
j =1 p(xj |y = 1) p(y = 1)
= .Qn Σ .Q Σ ,
j=1 p(xj |y = 1) p(y = 1) + j=1 p(xj |y = 0) p(y = 0)
n
Department of Computer Science & Engineering , Bharati Vidyapeeth’s College of Engineering, New
Delhi, Subject : Machine Learning
11
Thus, for a house with living area 890 square feet, we would set the value of the corresponding
feature x j to 3. We can then apply the Naive Bayes algorithm, and model p(xj |y) with a
multinomial distribution, as described previously. When the original, continuous-valued attributes
are not well- modeled by a multivariate normal distribution, discretizing the features and using
Naive Bayes (instead of GDA) will often result in a better classifier.
= .0 0
Department of Computer Science & Engineering , Bharati Vidyapeeth’s College of Engineering, New
Delhi, Subject : Machine Learning
12
Qn
This is because each of the terms “ p(x |y)” includes j =1 aj term p(x |y) = 35000
0 that is multiplied into it. Hence, our algorithm obtains 0/0, and doesn’t know how to make a
prediction.
Stating the problem more broadly, it is statistically a bad idea to estimate the probability of
some event to be zero just because you haven’t seen it be- fore in your finite training set. Take the
problem of estimating the mean of a multinomial random variable z taking values in {1, . . . , k}.
We can param- eterize our multinomial with φj = p(z = j). Given a set of m independent
Σ mlikelihood
observations {z (1) , . . . , z (m) }, the maximum j}
estimates
1{z (i) = are given by
m
φj = i=1 .
As we saw previously, if we were to use these maximum likelihood estimates, then some of the φj ’s
might end up as zero, which was a problem. To avoid this, we can use Laplace smoothing, which
replaces the above estimate with Σm 1{z (i) = j } + 1
φj = i=1 .
m+ k
ΣHere,
k we’ve added 1 to the numerator, and k to the denominator. Notethat
j =1
since the φj ’s are estimates for probabilities that we know must sum to 1. Also, φj ƒ= 0for all values
of j, solving our problem of probabilities being estimated as zero. Under certain (arguably quite
strong) conditions, it can be shown that the Laplace smoothing actually gives the optimal
estimator of the φj ’s.
Returning to our Naive Bayes classifier, with Laplace smoothing, we therefore obtain the
following estimates of the parameters: Σm
i = 1 1{x
(i)
= 1 ∧ y(i) = 1} + 1
φj |y=1 = Σ jm
i=1 1{y
(i) = 1} + 2
Σm (i)
= 1 ∧ y(i) = 0} + 1
i = 1 1{x
φj |y=0 = Σ jm
i=1 1{y
(i) = 0} + 2
(In practice, it usually doesn’t matter much whether we apply Laplace s- moothing to φy or not,
since we will typically have a fair fraction each of spam and non-spam messages, so φy will be a
reasonable estimate of p(y = 1) and will be quite far from 0anyway.)
Department of Computer Science & Engineering , Bharati Vidyapeeth’s College of Engineering, New
Delhi, Subject : Machine Learning
2.2 Event models for text classification 13
To close off our discussion of generative learning algorithms, let’s talk about one more model that is
specifically for text classification. While Naive Bayes as we’ve presented it will work well for many
classification problems, for text classification, there is a related model that does even better.
In the specific context of text classification, Naive Bayes as presented uses the what’s called the multi-
variate Bernoulli event model. In this model, we assumed that the way an email is generated is that
first it is randomly determined (according to the class priors p(y)) whether a spammer or non- spammer
will send you your next message. Then, the person sending the email runs through the dictionary,
deciding whether to include each word j in that email independently and according to the probabilitQies
p(xj = 1|y) =
Here’s a different model, called the multinomial event model. To de- scribe this model, we will use a
different notation and set of features for representing emails. We let x j denote the identity of the j-th word
in the email. Thus, x j is now an integer taking values in {1, . . . , |V |}, where |V | is the size of our vocabulary
(dictionary). An email of n words is now represent- ed by a vector (x1, x2, . . . , x n ) of length n; note that n
can vary for different documents. For instance, if an email starts with “A NIPS . . . ,” then x1 = 1 (“a” is
the first word in the dictionary), and x2 = 35000(if “nips” is the 35000th word in the dictionary).
In the multinomial event model, we assume that the way an email is gen- erated is via a random process
in which spam/non-spam is first determined (according to p(y)) as before. Then, the sender of the email
writes the e- mail by first generating x1 from some multinomial distribution over words (p(x1|y)). Next, the
second word x2 is chosen independently of x1 but from the same multinomial distribution, and similarly for
x3, x 4, and so on, until all n words of the email
n
have beQengenerated. Thus, the overall probabili-
j =1
like the one we had earlier for the probability of a message under the multi- variate Bernoulli event model,
but that the terms in the formula now mean very different things. In particular x j |y is now a multinomial,
rather than a Bernoulli distribution.
The parameters for our new model are φy = p(y) as before, φk|y=1 = p(xj = k|y = 1) (for any j) and
φk|y=0 = p(xj = k|y = 0). Note that we have assumed that p(xj |y) is the same for all values of j (i.e., that
the distribution according to which a word is generated does not depend on its position j within the
email).
Department of Computer Science & Engineering , Bharati Vidyapeeth’s College of Engineering, New Delhi,
Subject : Machine Learning
14
(x(i) , x(i) , . . . , x(i) ) (here, n s the number of words in the i-trainingexample),
While not necessarily the very best classification algorithm, the Naive Bayes classifier often works
surprisingly well. It is often also a very good “first thing to try,” given its simplicity and ease of
implementation.
Department of Computer Science & Engineering , Bharati Vidyapeeth’s College of Engineering, New Delhi,
Subject : Machine Learning
Ensemble Methods: Bagging and Boosting
Unit 2
Eighth Semester
Subject codeETCS102
Department of Computer Science &
Engineering,
Bharati Vidyapeeth’s College of
Engineering,
New Delhi
Subject: Machine learning
g
Bagging stands for Bootstrap Aggregation
Takes original data set D with N training
examples
Creates M copies {D ˜ }M
m m=1
1
Σ M
Use an averaged model h = M m=1 hm as the finalmodel
Dt+1
Normalize Dt+1 so that it sums to 1: t+1
(n)
N
D (n) = D (m)
Σ m=1 t+1
Σ T
Output the “boosted” final hypothesis H(x ) = sign( t=1 α t h t (x))
AdaBoost: Example
Σ T
The boosted hypothesis H(x ) = sgn( t=1 α t h t (x)) can be written as
ΣT ΣT
T
H(x )= sgn( T it its (2xi − 1)) = sgn(
α t t ti
xi − αti its ) = sign(w x + b)
Σ t=12α
i s t=1
t
t=1
Σ Σ
where wd = t : i t =d 2α t s t and b = − t αtst
Comments
For AdaBoost, given each model’s error s = 1/2 − γ , the training error consistently gets
better t
with T
t
rounds train-error(Hfinal ) ≤ exp(−2 2
γ) t=1
Σ t
Boosting algorithms can be shown to be minimizing a loss
function
E.g., AdaBoost has been shown to be minimizing an exponential
loss ΣN
L = exp{−yn H(x
)} n=1
n
Σ
where H(x )= sign( T
t=1
αt ht(x)), given weak base classifiers h1, . . .
, hT
Boosting in general can perform badly if some examples are
outliers
Boosting
No clear winner; usually depends on the data
Bagging is computationally more efficient than boosting (note that bagging can train the
M
models in parallel, boosting can’t)
Bagging usually can’t reduce the bias, boosting can (note that in boosting, the training
error steadily decreases)
Bagging usually performs better than boosting if we don’t have a high bias and only want
to reduce variance (i.e., if we are overfitting)