BAYES OPTIMAL CLASSIFIER
• Focus so far: "What is the most probable hypothesis given the training data?"
• More significant in practice:
"What is the most probable classification of a new instance given the training
data?"
• At first glance:
It might seem sufficient to apply the MAP (Maximum A Posteriori) hypothesis to
classify new instances.
• However:
There are methods that can outperform simply using the MAP hypothesis for
classification.
Example to Build Intuition:
• Hypothesis space: Contains three hypotheses — h₁, h₂, h₃
• Posterior probabilities:
– h₁: 0.4 → MAP hypothesis
– h₂: 0.3
– h₃: 0.3
• New instance x:
– Classified positive by h₁
– Classified negative by h₂ and h₃ ∑ P(+ | h ) P(h | D) = .4
hi ∈H
i i
• Overall classification probabilities:
– Positive: 0.4 ∑ P(− | h ) P(h | D) = .6
i i
– Negative: 0.6 hi ∈H
Observation:
Although h₁ is the MAP hypothesis and classifies x as positive, the most
probable classification considering all hypotheses is negative
• In general, the most probable classification of the new instance
is obtained by combining the predictions of all hypotheses,
weighted by their posterior probabilities.
• If the possible classification of the new example can take on
any value vj from some set V, then the probability P(vj|D) that
the correct classification for the new instance is vj, is just
P(vj | D) = ∑ P(v j | hi ) P(hi | D)
hi ∈H
• The optimal classification of the new instance is the value vj,
for which P(vj|D) is maximum.
arg max ∑ P(v j | hi ) P(hi | D)
v j ∈V
hi ∈H
• Any system that classifies new instances according to equation
above is called a Bayes optimal classifier, or Bayes optimal
learner.
• No other classification method using the same hypothesis
space and same prior knowledge can outperform this method
on average.
• This method maximizes the probability that the new instance is
classified correctly, given the available data, hypothesis space,
and prior probabilities over the hypotheses.
• Bayes optimal classifier can make predictions that do not match
any single hypothesis in the hypothesis space H
• Using equation to classify all instances in X: The resulting labeling
may not align with any individual hypothesis h ∈ H
Interpretation:
• The Bayes optimal classifier behaves as if it’s using a different
hypothesis space H′
Where H′ includes:
• Hypotheses that combine predictions from multiple hypotheses in
H
• Often through linear combinations or comparisons of these
predictions
Bayes optimal classifier provides best result, but can be expensive if
many hypotheses.
Gibbs algorithm:
1. Choose one hypothesis h at random, according to P(h|D)
2. Use this h to classify new instance
Surprising fact: assume target concepts are drawn at random from H
according to priors on H. Then:
E[errorGibbs] ≤ 2E[errorBayesOptimal]
Suppose correct, uniform prior distribution over H, then
• Pick any hypothesis from VS, with uniform probability
• Its expected error no worse than twice Bayes optimal
Classification approach:
• Chooses a hypothesis at random based on the posterior
probability distribution
Key insight:
• Despite its simplicity, it has provable performance guarantees
• Theoretical result (Haussler et al., 1994):
Under certain conditions:
– The expected misclassification error of the Gibbs algorithm
is at most twice that of the Bayes optimal classifier
Conditions:
• Target concepts are assumed to be drawn from the prior probability
distribution
• The expected value is taken over this distribution
Naive Bayes Classifier:
• A practical and widely-used Bayesian learning method
• Also known as the naive Bayes learner
Strength:
– Performs well in many real-world domains
– Moderate or large training set available
– Attributes that describe instances are conditionally independent given
classification
• Performance:
– Often comparable to that of neural networks and
decision tree learning methods
Assume target function f: X→V, where each instance x described by attributed
(a1,a2,…,an).
Most probable value of f(x) is: vMAP = arg max P(v j | a1 , a2 ,..., an )
v ∈V j
P(a1 , a2 ,..., an | v j ) P(v j )
= arg max
v j ∈V P(a1 , a2 ,..., an )
= arg max P(a1 , a2 ,..., an | v j ) P(v j )
v j ∈V
Naïve Bayes assumption:
• The naive Bayes classifier is based on the simplifying assumption that the
attribute values are conditionally independent given the target value. In
other words, the assumption is that given the target value of the instance,
the probability of observing the conjunction a l , a 2 . . .a, is just the
product of the probabilities for the individual attributes:
P(a1 , a2 ,..., an | v j ) = ∏ P(ai | v j )
which gives i
Naïve Bayes classifier: v NB = arg max
v ∈V
j
P (v j ) ∏ P(a | v )
i
i j
Naive_Bayes_Learn(examples)
For each target value v j
Pˆ (v j ) ← estimate P(v j )
For each attribute value ai of each attribute a
Pˆ (a |v ) ← estimate P(a |v )
i j i j
Classify_New_Instance( x)
v NB = arg max Pˆ (v j ) ∏ Pˆ (ai|v j )
v j ∈V
a i ∈x
Consider CoolCar and new instance
(Color=Blue, Type=SUV, Doors=2, Tires=White)
Want to compute
v NB = arg max P(v j )∏ P(ai | v j )
v j ∈V
i
P(+)*P(Blue|+)*P(SUV|+)*P(2|+)*P(WhiteW|+)=
5/14 * 1/5 * 2/5 * 4/5 * 3/5 = 0.0137
P(-)*P(Blue|-)*P(SUV|-)*P(2|-)*P(WhiteW|-)=
9/14 * 3/9 * 4/9 * 3/9 * 3/9 = 0.0106
1. Conditional independence assumption is often violated
P(a1 , a2 ,..., an | v j ) = ∏ P(ai | v j )
• … but it works surprisingly welli anyway. Note that you do
not need estimated posteriors to be correct; need only that
arg max Pˆ (v j )∏ Pˆ (ai | v j ) = arg max P(v j ) P(a1 ,..., an | v j )
v j ∈V v j ∈V
i
• Naïve Bayes posteriors often unrealistically close to 1 or 0
2. What if none of the training instances with target value vj have
attribute value ai? Then
Pˆ (ai | v j ) = 0, and ...
Pˆ (v j )∏ Pˆ (ai | v j ) = 0
i
Typical solution is Bayesian estimate for
nc + mp
P(ai | v j ) ←
ˆ
n+m
• n is number of training examples for which v=vj
• nc is number of examples for which v=vj and a=ai
• p is prior estimate for
• m is weight given to prior (i.e., number of “virtual” examples)
• A probabilistic classifier based on Bayes' Theorem
• Assumes strong (naive) independence between features
• Simple, fast, and effective for high-dimensional data
• All features are conditionally independent given the class label
• Despite the unrealistic assumption, it often performs surprisingly
well
• Fast and scalable – works well with large datasets
• Handles both binary and multiclass classification
• Requires less training data than many other models
• Performs well even with noisy or missing data
1. Training Phase:
1. Learn prior probabilities: P(y)
2. Learn likelihoods: P(xi∣y)
2. Prediction Phase:
1. Use Bayes' Theorem to compute posterior:
2. Choose class y with highest posterior probability
THANK YOU