MACHINE LEARNING
UNIT 3
BAYESIAN LEARNING
Bayesian learning methods are relevant to our study of machine learning
for two different reasons
First, Bayesian learning algorithms that calculate explicit probabilities for
hypotheses, such as the naive Bayes classifier, are among the most
practical approaches to certain types of learning problems.
The second reason that Bayesian methods are important to our study of
machine learning is that they provide a useful perspective for
understanding many learning algorithms that do not explicitly manipulate
probabilities.
Features of Bayesian learning methods
Each observed training example can incrementally decrease or increase
the estimated probability that a hypothesis is correct
Prior knowledge can be combined with observed data to determine the
final probability of a hypothesis
Bayesian methods can accommodate hypotheses that make
probabilistic predictions
New instances can be classified by combining the predictions of multiple
hypotheses, weighted by their probabilities.
WSN security
NAIVE BAYES CLASSIFIER
One highly practical Bayesian learning method is the naive Bayes
learner, often called the naive Bayes classijier.
The naive Bayes classifier applies to learning tasks where each
instance x is described by a conjunction of attribute values and
where the target function f ( x ) can take on any value from some
finite set V.
A set of training examples of the target function is provided, and a
new instance is presented, described by the tuple of attribute values
(al, a2.. .a,). The learner is asked to predict the target value, or
classification, for this new instance.
where VNB denotes the target value output by the naive Bayes
classifier
GIBBS ALGORITHM
Bayes optimal classifier obtains the best performance that
can be achieved from the given training data, it can be quite
costly to apply.
The expense is due to the fact that it computes the posterior
probability for every hypothesis in H and then combines the
predictions of each hypothesis to classify each new instance.
An alternative, less optimal method is the Gibbs algorithm
(see Opper and Haussler 1991), defined as follows:
Gibbs algorithm simply applies a hypothesis drawn at random
according to the current posterior probability distribution .
Surprisingly, it can be shown that under certain conditions the
expected misclassification error for the Gibbs algorithm is at
most twice the expected error of the Bayes optimal classifie
AN EXAMPLE: LEARNING TO
CLASSIFY TEXT
K-NEAREST NEIGHBUOR
CLASSIFICATION
K-Nearest Neighbours
K-Nearest Neighbors is one of the most basic yet essential classification
algorithms in Machine Learning
It belongs to the supervised learning domain
Has application in pattern recognition, data mining and intrusion detection
We are given some prior data (also called training data), which classifies
coordinates into groups identified by an attribute.
given another set of data points (also called testing data), allocate these points a
group by analyzing the training set.
given an unclassified point, we can assign it to a group by observing what
group its nearest neighbors belong to.
This means a point close to a cluster of points classified as ‘Red’ has a higher
probability of getting classified as ‘Red’.
Intuitively, we can see that the first point (2.5, 7) should be classified as
‘Green’ and the second point (5.5, 4.5) should be classified as ‘Red’.
Weighted K-Nearest Neighbor
Example
Minimum Description Length
Principle
Minimum Description Length Principle (Conti..)
Minimum Description Length Principle (Conti..)
BAYES OPTIMAL CLASSIFIER
The Bayes Optimal Classifier is a probabilistic model that
predicts the most likely outcome for a new situation
The Bayes theorem is a method for calculating a hypothesis’s
probability based on its prior probability, the probabilities of
observing specific data given the hypothesis, and the seen
data itself.
Maximum a Posteriori (MAP), a probabilistic framework for
determining the most likely hypothesis for a training
dataset.
Take a hypothesis space that has 3 hypotheses h1, h2, and
h3.
The posterior probabilities of the hypotheses are as
follows:
h1 -> 0.4
h2 -> 0.3
h3 -> 0.3
Hence, h1 is the MAP hypothesis. (MAP => max
posterior)
Suppose a new instance x is encountered, which is
classified negative by h2 and h3 but positive by h1.
Taking all hypotheses into account, the probability that x is
positive is .4 and the probability that it is negative is
therefore .6.
The classification generated by the MAP hypothesis is
different from the most probable classification in this case
which is negative.
The most probable classification of the new instance is
obtained by combining the predictions of all hypotheses,
weighted by their posterior probabilities.
If the new example’s probable classification can be any
value vj from a set V, the probability P(vj/D) that the right
classification for the new instance is vj is merely
The denominator is omitted since we’re only using this
for comparison and all the values of P(vj/D) will have the
same denominator.
The value vj, for which P (vj/D) is maximum, is the
best classification for the new instance.
Computational Learning Theory
PAC MODEL (PROBABLY
APPROXIMATELY CORRECT)
PAC (Probably Approximately Correct) learning is a framework used for
mathematical analysis. A PAC Learner tries to learn a concept
(approximately correct) by selecting a hypothesis from a set of
hypotheses that has a low generalization error.
Approximately correct Hypothesis if h⊕c <= ɛ
where 0< ɛ < 0.5
Probably Approximately Correct if Pr(h⊕ c ) <= 1- δ
δ is confidence interval where 0< δ < 0.5
SAMPLE COMPLEXITY FOR
FINITE HYPOTHESIS SPACES
PAC-learnability is largely determined by the number of training examples
required by the learner.
The growth in the number of required training examples with problem size,
called the sample complexity of the learning problem
A learner is consistent if it outputs hypotheses that perfectly fit the training
data, whenever possible
we derive a bound on the number of training examples required by
any consistent learner, independent of the specific algorithm it uses
SAMPLE COMPLEXITY FOR
INFINITE HYPOTHESIS SPACES
we consider a second measure of the complexity of H, called the Vapnik-
Chervonenkis dimension of H we can state bounds on sample complexity
that use VC(H) rather than IHI.
The VC dimension measures the complexity of the hypothesis space H, not
by the number of distinct hypotheses 1 H 1, but instead by the numberof
distinct instances from X that can be completely discriminated using H.
It uses a notation called as shettering.