MCSE0007: Machine Learning
Naïve Bayes Classification
Classification Problem
Classification consists of assigning a class label
to a set of unclassified data
More precisely, a classification problem can be
stated as below:
Definition: Classification Problem
Given a database D = 𝑡1, 𝑡2, … . . , 𝑡𝑚 of tuples and a set of classes C =
𝑐1, 𝑐2, … . . , 𝑐𝑘 , the classification problem is to define a mapping f ∶ D → 𝐶,
Where each 𝑡𝑖 is assigned to one class.
Note that tuple 𝑡𝑖 ∈ 𝐷 is defined by a set of attributes 𝐴 = 𝐴1, 𝐴2, … . . , 𝐴𝑛 .
Classification Techniques
⚫ A number of classification techniques are known, which can be broadly
classified into the following categories:
1. Statistical-Based Methods
• Regression
• Bayesian Classifier
•
2. Distance-Based Classification
• K-Nearest Neighbours
3. Decision Tree-Based Classification
• ID3, C 4.5, CART
5. Classification using Machine Learning (SVM)
6. Classification using Neural Network (A𝐍𝐍)
Bayesian Classifier
⚫ Principle
⚫ If it walks like a duck, quacks like a duck, then it is probably a duck
Bayesian Classifier …
⚫ A statistical classifier
⚫ Performs probabilistic prediction, i.e., predicts class
membership probabilities
⚫ Foundation
⚫ Based on Bayes’ Theorem.
⚫ Assumptions
1. The classes are mutually exclusive and exhaustive.
2. The attributes are independent given the class.
⚫ Called “Naïve” classifier because of these assumptions.
⚫ Empirically proven to be useful.
⚫ Scales very well.
Bayesian Classifier …
⚫ In many applications, the relationship between the
attributes set and the class variable is non-deterministic.
o In other words, a test cannot be classified to a class label
with certainty.
o In such a situation, the classification can be achieved
probabilistically.
⚫ The Bayesian classifier is an approach for modelling
probabilistic relationships between the attribute set and the
class variable.
⚫ More precisely, Bayesian classifier use Bayes’ Theorem of
Probability for classification.
Theory of Probability
Simple Probability
Definition: Simple Probability
If there are n elementary events associated with a random experiment and m of n of
them are favorable to an event A, then the probability of happening or occurrence of A
is
𝑚
𝑃 𝐴 =
𝑛
⚫ Suppose, A and B are any two events and P(A), P(B) denote the
probabilities that the events A and B will occur, respectively.
⚫ Mutually Exclusive Events:
⚫ Two events are mutually exclusive, if the occurrence of one precludes the
occurrence of the other.
Example: Tossing a coin (two events)
Tossing a ludo cube (Six events)
Simple Probability …
⚫ Independent events: Two events are independent if occurrences of one
does not alter the occurrence of other.
Example: Tossing both coin and ludo cube together.
Definition: Joint Probability
If P(A) and P(B) are the probability of two events, then
𝑃 𝐴∪𝐵 = 𝑃 𝐴 +𝑃 𝐵 −𝑃 𝐴∩𝐵
If A and B are mutually exclusive, then 𝑃 𝐴 ∩ 𝐵 = 0
If A and B are independent events, then 𝑃 𝐴 ∩ 𝐵 = 𝑃 𝐴 . 𝑃(𝐵)
Thus, for mutually exclusive events
𝑃 𝐴∪𝐵 = 𝑃 𝐴 +𝑃 𝐵
Conditional Probability
Definition: Conditional Probability
If events are dependent, then their probability is expressed by conditional
probability. The probability that A occurs given that B is denoted by 𝑃(𝐴|𝐵).
Suppose, A and B are two events associated with a random experiment. The
probability of A under the condition that B has already occurred and 𝑃(𝐵) ≠ 0 is
given by
Number of events in 𝐵 which are favourable to 𝐴
𝑃 𝐴𝐵 =
Number of events in 𝐵
Number of events favourable to 𝐴 ∩ 𝐵
=
Number of events favourable to 𝐵
𝑃(𝐴 ∩ 𝐵)
=
𝑃(𝐵)
Simple Probability …
Corollary: Conditional Probability
𝑃 𝐴 ∩𝐵 = 𝑃 𝐴 .𝑃 𝐵 𝐴 , 𝑖𝑓 𝑃 𝐴 ≠ 0
or 𝑃 𝐴 ∩𝐵 = 𝑃 𝐵 .𝑃 𝐴 𝐵 , 𝑖𝑓 𝑃(𝐵) ≠ 0
For three events A, B and C
𝑃 𝐴 ∩ 𝐵 ∩ 𝐶 = 𝑃 𝐴 .𝑃 𝐵 .𝑃 𝐶 𝐴 ∩ 𝐵
For n events A1, A2, …, An and if all events are mutually independent to each other
𝑃 𝐴1 ∩ 𝐴2 ∩ … … … … ∩ 𝐴𝑛 = 𝑃 𝐴1 . 𝑃 𝐴2 … … … … 𝑃 𝐴𝑛
Note:
𝑃 𝐴𝐵 =0 if events are mutually exclusive
𝑃 𝐴𝐵 =𝑃 𝐴 if A and B are independent
𝑃 𝐴 𝐵 ⋅ 𝑃 𝐵 = 𝑃 𝐵 𝐴 ⋅ 𝑃(𝐴) otherwise,
P A ∩ B = P(B ∩ A)
Prior and Posterior Probabilities
⚫ P(A) and P(B) are called prior probabilities X Y
⚫ P(A|B), P(B|A) are called posterior probabilities
𝑥1 A
Example 8.6: Prior versus Posterior Probabilities 𝑥2 A
⚫ This table shows that the event Y has two outcomes
namely A and B, which is dependent on another event X 𝑥3 B
with various outcomes like 𝑥1, 𝑥2 and 𝑥3. 𝑥3 A
⚫ Case1: Suppose, we don’t have any information of the
event A. Then, from the given sample space, we can 𝑥2 B
calculate P(Y = A) = 5 = 0.5 𝑥1 A
10
•
⚫ Case2: Now, suppose, we want to calculate P(X = 𝑥1 B
𝑥2 |Y =A) = 2 = 0.4 .
5 𝑥3 B
The later is the conditional or posterior 𝑥2 B
probability, where as the former is the prior 𝑥2 A
probability.
Naïve Bayesian Classifier
⚫ Suppose, Y is a class variable and X = 𝑋1, 𝑋2, … . . , 𝑋𝑛 is a set of attributes,
with instance of Y.
INPUT (X) CLASS(Y)
… … …
… … … …
𝑥 1, 𝑥 2 , … , 𝑥 𝑛 𝑦𝑖
… … … …
⚫ The classification problem, then can be expressed as the class-conditional
probability
𝑃 𝑌 = 𝑦𝑖| 𝑋1 = 𝑥1 AND 𝑋2 = 𝑥2 AND … . . 𝑋𝑛 = 𝑥𝑛
Naïve Bayesian Classifier …
⚫ Naïve Bayesian classifier calculate this posterior probability using Bayes’
theorem, which is as follows.
⚫ From Bayes’ theorem on conditional probability, we have
𝑃(𝑋|𝑌)∙𝑃(𝑌)
𝑃 𝑌𝑋 =
𝑃(𝑋)
𝑃(𝑋|𝑌) ∙ 𝑃(𝑌)
=
𝑃 𝑋 𝑌 = 𝑦1 ∙ 𝑃 𝑌 = 𝑦1 + ⋯ + 𝑃 𝑋 𝑌 = 𝑦𝑘 ∙ 𝑃 𝑌 = 𝑦𝑘
where,
𝑃 𝑋 = σ𝑘𝑖=1 𝑃(𝑋|𝑌 = 𝑦𝑖) ∙ 𝑃(Y = 𝑦𝑖 )
Note:
𝑃 𝑋 is called the evidence (also the total probability) and it is a constant.
The probability P(Y|X) (also called class conditional probability) is therefore
proportional to P(X|Y)∙ 𝑃(𝑌).
Thus, P(Y|X) can be taken as a measure of Y given that X.
P(Y|X) ≈ 𝑃 𝑋 𝑌 ∙ 𝑃(𝑌)
Naïve Bayesian Classifier …
⚫ Suppose, for a given instance of X (say x = (𝑋1 = 𝑥1) and …..
(𝑋𝑛= 𝑥𝑛)).
⚫ There are any two class conditional probabilities namely P(Y=
𝑦𝑖|X=x) and P(Y= 𝑦𝑗 | X=x).
⚫ If P(Y= 𝑦𝑖 | X=x) > P(Y= 𝑦𝑗 | X=x), then we say that 𝑦𝑖 is more
stronger than 𝑦𝑗 for the instance X = x.
⚫ The strongest 𝑦𝑖 is the classification for the instance X = x.
Naïve Bayesian Classifier …
Naïve Bayes Algorithm (for discrete input attributes) has two phase
– 1. Learning Phase: Given a training set S,
Learning is easy, just
For each target value of ci (ci c1 ,,c L )
create probability
P̂(C ci ) estimate P(C ci ) with examples in S; tables.
For every attributevalue xjk of each attributeXj ( j 1,, n; k 1,, N j )
P̂(X j x jk |C ci ) estimate P(Xj x jk |C ci ) with examples in S;
Output: conditional probability tables; for Xj , N j L elements
– 2. Test Phase: Given an unknown instance X (a1 ,,,an)
Look up tables to assign the label c* to X’ if
[Pˆ(a |c* ) Pˆ(a |c* )]P̂(c * ) [Pˆ(a |c) Pˆ(a |c)]P̂(c), c c* , c c ,,c
1 n 1 n 1 L
Classification is easy, just multiply probabilities
Naïve Bayesian Classifier …
Naïve Bayesian Classifier …
• Naïve Bayes is based on the independence assumption
– Training is very easy and fast; just requiring considering each attribute in
each class separately
– Test is straightforward; just looking up tables or calculating conditional
probabilities with normal distributions
• Naïve Bayes is a popular generative classifier model
1. Performance of naïve Bayes is competitive to most of state-of-the-art
classifiers even if in presence of violating the independence assumption
2. It has many successful applications, e.g., spam mail filtering
3. A good candidate of a base learner in ensemble learning
4. Apart from classification, naïve Bayes can do more…
Any Questions ?