0% found this document useful (0 votes)
14 views6 pages

07 Naive Bayes

Uploaded by

l.arrizabalaga
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views6 pages

07 Naive Bayes

Uploaded by

l.arrizabalaga
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Bayesian Classifiers

Rubén Sánchez Corcuera


ruben.sanchez@deusto.es
■ Bayesian classifiers are statistical classifiers based on Bayes’ theorem
● They can predict class membership probabilities such as the probability
that a given tuple belongs to a particular class
■ Class conditional independence

Naïve
● Naïve Bayes classifiers assume that the effect of an attribute value on a
given class is independent of the values of the other attributes.
● This is made to simplify the computations involved and this is why is
considered naive.

Bayes
■ Studies have found that a simple Bayesian classifier (Naïve Bayes) can compete
with decision trees and some neural networks
■ Bayesian classifiers have also exhibited high accuracy and speed when applied
to large databases.

Bayes Theorem

■ Let B be a data tuple. In Bayesian terms, B is considered “evidence.”


● B is described by measurements on n attributes
■ Let A be some hypothesis:
● The data tuple B belongs to a specified class C.
■ For classification problems, we want to determine P(A|B), the probability
that the hypothesis H holds given the “evidence” or observed data tuple B.
■ In other words, we are looking for the probability that tuple B belongs to
class C, given that we know the attribute description of B.

3 4
Bayes Theorem - Posterior probability Bayes Theorem - Prior probability

■ P(A|B) is the posterior probability, or a posteriori probability, of A ■ P(A) is the prior probability, or a priori probability, of A.
conditioned on B.
● For our example, this is the probability that any patient developing a
● For example, suppose our world of data tuples is confined to patients heart disease without considering specific factors like age or blood
described by the attributes age and blood pressure level, respectively, pressure.
and that B is a 50-year-old patient with high blood pressure. Suppose
that A is the hypothesis that our patient will develop heart disease. ● The posterior probability, P(A|B), is based on more information (e.g.,
Then P(A∣B) reflects the probability that patient B will develop heart patient information) than the prior probability, P(A), which is
disease given that we know the patient's age and blood pressure level. independent of B.

5 6

Bayes Theorem

■ Similarly, P(B|A) is the posterior probability of B conditioned on A. Steps for


Naïve Bayes
● That is, it is the probability that a patient, B, is 50 years old has high
blood pressure, given that we know the patient has a heart disease
■ P(B) is the prior probability of B.
● The probability that a person from our set of patients is 50 years old
and has high blood pressure
■ “How are these probabilities estimated?” P(A), P(B|A), and P(B) may be
estimated from the given data, as we shall see next.
■ Bayes’ theorem is useful in that it provides a way of calculating the
posterior probability, P(A|B), from P(A), P(B|A), and P(B).

7 8
Naïve Bayes: Step 1 Naïve Bayes: Step 2

■ Suppose that there are m classes, C1, C2, . . . , Cm. Given a tuple, X, the
classifier will predict that X belongs to the class having the highest
posterior probability, conditioned on X. That is, the Naïve Bayesian classifier
predicts that tuple X belongs to the class Ci if and only if
■ Let D be a training set of tuples and their associated class labels.
● As usual, each tuple is represented by an n-dimensional attribute
vector, X = (x1, x2, . . . , xn), depicting n measurements made on the
tuple from n attributes, respectively, A1, A2, . . . , An. ■ Thus, we maximize P(Ci|X). The class Ci for which P(Ci|X) is maximized is
called the maximum posteriori hypothesis. By Bayes’ theorem

9 10

Naïve Bayes: Step 3 Naïve Bayes: Step 4

■ As P(X) is constant for all classes, only P(X|Ci)P(Ci) needs to be ■ Given datasets with many attributes, it would be extremely computationally
maximized. expensive to compute P(X|Ci).
● If the class prior probabilities are not known, then it is commonly ■ To reduce computation in evaluating P(X|Ci), the naïve assumption of
assumed that the classes are equally likely, that is, class-conditional independence is made. This presumes that the attributes’
P(C1) = P(C2) = … = P(Cm), and we would therefore maximize P(X|Ci). values are conditionally independent of one another, given the class label of
Otherwise, we maximize P(X|Ci)P(Ci). the tuple (i.e., that there are no dependence relationships among the
■ Note that the class prior probabilities may be estimated by attributes). Thus,

■ |Ci,D| is the number of training tuples of class Ci in D.

11 12
Naïve Bayes: Step 4 Naïve Bayes: Step 4 - Categorical

■ We can easily estimate the probabilities P(x1|Ci), P(x2|Ci), …,


P(xn|Ci) from the training tuples. ■ If Ak is categorical, then P(xk|Ci) is the number of tuples of class Ci
● Recall that here xk refers to the value of attribute Ak for tuple X. in D having the value xk for Ak, divided by |Ci,D|, the number of
For each attribute, we look at whether the attribute is tuples of class Ci in D.
categorical or continuous-valued.

13 14

Naïve Bayes: Step 4 - Continuous Naïve Bayes: Step 5

■ If Ak is continuous-valued, then we need to do a bit more work, but the ■ To predict the class label of X, P(X|Ci)P(Ci) is evaluated for each
calculation is pretty straightforward. A continuous-valued attribute is class Ci. The classifier predicts that the class label of tuple X is the
typically assumed to have a Gaussian distribution with a mean μ and class Ci if and only if
standard deviation σ, defined by:

■ In other words, the predicted class label is the class Ci for which
P(X|Ci)P(Ci) is the maximum.
■ For this formula we only need to compute μCi and σCi which are the mean
and standard deviation, respectively, of the values of attribute Ak for
training tuples of class Ci. We then plug these two quantities into the
equation, together with xk to estimate P(xk|Ci)
15 16
Naïve Bayes: Final Thoughts
Further reading
■ In theory, Bayesian classifiers have the minimum error rate in comparison to
all other classifiers.
■ Sections 8.3 in [Han and Kamber, 2006]
■ In practice this is not always the case, owing to inaccuracies in the
assumptions made for its use, such as class-conditional independence, and
Extra material:
the lack of available probability data.
■ https://scikit-learn.org/stable/modules/naive_bayes.html
■ Bayesian classifiers are also useful in that they provide a theoretical
justification for other classifiers that do not explicitly use Bayes’ theorem.

○ For example, under certain assumptions, it can be shown that many


neural network and curve-fitting algorithms output the maximum
posteriori hypothesis, as does the naïve Bayesian classifier.

17 18

Exercise 1 Exercise 2

■ Write a script that does the following: ■ We are going to try to classify previously unseen words in their proper
language. We are going to work with Spanish and Finish.
1. loads the iris dataset using sklearn (sklearn.datasets.load_iris)
■ Having only the words in each language, how would you do it? Remember,
2. splits the data into training and testing part using the train_test_split we are going to classify unseen words!
function so that the training set size is 80% of the whole data (give the
call also the random_state=0 argument to make the result ■ Let's prepare what we need:
deterministic)
○ Download the Spanish.txt and Suomi.txt datasets from ALUD.
3. uses Gaussian naive Bayes to fit the training data
(sklearn.naive_bayes.GaussianNB) ○ Open the Naïve Bayes colab in ALUD

4. predicts labels of the test data


5. the function should return the accuracy score of the prediction
performance (sklearn.metrics.accuracy_score)
19 20
Do you have any questions?
ruben.sanchez@deusto.es

Thanks!
21

You might also like