MODULE 4
CHAPTER 8
BAYESIAN LEARNING
 8.1 INTRODUCTION TO PROBABILITY-BASED
 LEARNING
• Probability-based learning is one of the most important practical
  learning methods which combines            prior knowledge or prior
  probabilities with observed data.
• Probabilistic learning uses the concept of probability theory that
  describes how to model randomness, uncertainty, and noise to predict
  future events.
• It is a tool for modelling large datasets and uses Bayes rule to infer
  unknown quantities, predict and learn from data.
• In a probabilistic model, randomness plays a major role which gives
  probability distribution a solution, while in a deterministic model there
  is no randomness and hence it exhibits the same initial conditions
  every time the model is run and is likely to get a single possible
  outcome as the solution.
• Bayesian learning differs from probabilistic learning as it uses
  subjective probabilities (i.e., probability that is based on an individual’s
  belief or interpretation about the outcome of an event and it can
  change over time) to infer parameters of a model.
• Two practical learning algorithms called Naïve Bayes learning and
  Bayesian Belief Network (BBN) form the major part of Bayesian
  learning. These algorithms use prior probabilities and apply Bayes rule
  to infer useful information.
•Bayesian Learning is a learning method that describes and
 represents knowledge in an uncertain domain and provides
 a way to reason about this knowledge using probability
 measure.
•It uses Bayes theorem to infer the unknown parameters of a
 model.
•Bayesian inference is useful in many applications which
 involve reasoning and diagnosis such as game theory,
 medicine, etc. Bayesian inference is much more powerful in
 handling missing data and for estimating any uncertainty in
 predictions.
For Understanding
 • The prior probability is the probability assigned to an event before
   the arrival of some information that makes it necessary to revise
   the assigned probability.
 • The revision of the prior is carried out using Bayes' rule. The new
   probability assigned to the event after the revision is called posterior
   probability.
 What is prior probability in Naive Bayes?
 • The probability of each class before any characteristics are observed
   is known as the prior probability in the Naive Bayes method
 • Posterior probability = prior probability + new data
For Understanding
 What is likelihood probability in Machine Learning with example?
 • In simple words, as the name suggests, the likelihood is a function
   that tells us how likely the specific data point suits the existing data
   distribution.
 For example.
 • Suppose there are two data points in the dataset. The likelihood of
   the first data point is greater than the second
For Understanding
 Examples of Probability and Likelihood
 Examples 1 – Coin Toss
 • In the context of coin tosses, likelihood and probability represent
   different aspects of the same experiment.
 • The likelihood refers to the probability of observing a specific
   outcome given a particular model or hypothesis.
 • On the other hand, probability represents the long-term frequency of
   an event occurring over multiple trials.
For Understanding
 • To recap: probability is generally something we consider when
   we have a model with a fixed set of parameters and we are
   interested in the types of data that might be generated.
 • Conversely, likelihood comes into play when we have already
   observed data and we want to examine how likely certain model
   parameters are.
 • The distinction between probability and likelihood is
   fundamentally important: Probability attaches to possible
   results; likelihood attaches to hypotheses.
For Understanding
  What is Probability?
  Probability is a measure of the likelihood that an event will actually
  occur based on information or assumptions that are currently known.
  The probability of the event is commonly stated as a number between 0
  and 1, where 0 indicates impossibility and 1 indicates inevitability.
  To determine probability, use the following formula −
  Probability=Numberoffavorableoutcomes/Totalnumberofoutcomes
  For instance, the probability of getting heads when flipping a fair coin is
  0.5 because there are two possible outcomes (heads or tails), and each
  outcome has an equal likelihood of occurring.
  Probability is used to describe the likelihood of events based on
  assumptions or to make predictions about the future.
For Understanding
  • Probability is used to make predictions about future
    events, whereas likelihood is used to estimate unknown
    parameters based on seen evidence.
8.2 FUNDAMENTALS OF BAYES THEOREM
• Naïve Bayes Model relies on Bayes theorem that works on the principle of three
  kinds of probabilities called prior probability, likelihood probability, and posterior
  probability.
• Prior Probability It is the general probability of an uncertain event before an
  observation is seen or some evidence is collected. It is the initial probability that
  is believed before any new information is collected.
• Likelihood Probability Likelihood probability is the relative probability of the
  observation occurring for each class or the sampling density for the evidence
  given the hypothesis. It is stated as P (Evidence | Hypothesis), which denotes
  the likeliness of the occurrence of the evidence given the parameters.
• Posterior Probability It is the updated or revised probability of an event taking
  into account the observations from the training data. P (Hypothesis | Evidence)
  is the posterior distribution representing the belief about the hypothesis, given
  the evidence from the training data. Therefore,
• Posterior probability = prior probability + new evidence
8.3 CLASSIFICATION USING BAYES MODEL
  • Naïve Bayes Classification models work on the principle of Bayes
    theorem.
  • Bayes’ rule is a mathematical formula used to determine the
    posterior probability, given prior probabilities of events.
  • Generally, Bayes theorem is used to select the most probable
    hypothesis from data, considering both prior knowledge and
    posterior distributions. It is based on the calculation of the posterior
    probability and is stated as:
             P (Hypothesis h | Evidence E)
  • where, Hypothesis h is the target class to be classified and Evidence E
    is the given test instance.
• P (Hypothesis h| Evidence E) is calculated from the prior probability P
  (Hypothesis h), the likelihood probability P (Evidence E |Hypothesis h)
  and the marginal probability P (Evidence E).
• It can be written as:
• where, P (Hypothesis h) is the prior probability of the hypothesis h
  without observing the training data or considering any evidence.
• It denotes the prior belief or the initial probability that the hypothesis h
  is correct. P (Evidence E) is the prior probability of the evidence E from
  the training dataset without any knowledge of which hypothesis holds.
  It is also called the marginal probability.
• P (Evidence E | Hypothesis h) is the prior probability of Evidence E
  given Hypothesis h.
• It is the likelihood probability of the Evidence E after observing the
  training data that the hypothesis h is correct.
• P (Hypothesis h | Evidence E) is the posterior probability of
  Hypothesis h given Evidence E.
• It is the probability of the hypothesis h after observing the training
  data that the evidence E is correct.
• In other words, by the equation of Bayes Eq. (8.1), one can observe
  that:
    Posterior Probability α Prior Probability × Likelihood Probability
• Bayes theorem helps in calculating the posterior probability for a
  number of hypotheses, from which the hypothesis with the highest
  probability can be selected.
• This selection of the most probable hypothesis from a set of
  hypotheses is formally defined as Maximum A Posteriori (MAP)
  Hypothesis
• What is Naive Bayes Classifier?
• Naive Bayes classifier is a probabilistic machine learning model based
  on Bayes’ theorem. It assumes independence between features and
  calculates the probability of a given input belonging to a particular
  class. It’s widely used in text classification, spam filtering, and
  recommendation systems.
8.3.1 NAÏVE BAYES ALGORITHM
 • It is a supervised binary class or multi class classification algorithm that
   works on the principle of Bayes theorem.
 • There is a family of Naïve Bayes classifiers based on a common principle.
 • These algorithms classify for datasets whose features are independent
   and each feature is assumed to be given equal weightage.
 • It particularly works for a large dataset and is very fast. It is one of the
   most effective and simple classification algorithms.
 • This algorithm considers all features to be independent of each other
   even though they are individually dependent on the classified object.
 • Each of the features contributes a probability value independently during
   classification and hence this algorithm is called as Naïve algorithm. Some
   important applications of these algorithms are text classification,
   recommendation system and face recognition.
• Solution: The training dataset T consists of 10 data instances with
  attributes such as ‘CGPA’, ‘Interactiveness’, ‘Practical Knowledge’
  and ‘Communication Skills’ as shown in Table 8.1.
• The target variable is Job Offer which is classified as Yes or No for a
  candidate student.
• Step 1: Compute the prior probability for the target feature ‘Job
  Offer’. The target feature ‘Job Offer’ has two classes, ‘Yes’ and ‘No’.
• It is a binary classification problem.
• Given a student instance, we need to classify whether ‘Job Offer =
  Yes’ or ‘Job Offer = No’.
• From the training dataset, we observe that the frequency or the
  number of instances with ‘Job Offer = Yes’ is 7 and ‘Job Offer = No’ is
  3.
• The prior probability for the target feature is calculated by dividing
  the number of instances belonging to a particular target class by the
  total number of instances.
• Hence, the prior probability for ‘Job Offer = Yes’ is 7/10 and ‘Job Offer
  = No’ is 3/10 as shown in Table 8.2.
• Step 2: Compute Frequency matrix and Likelihood Probability for each
  of the feature. Step 2(a): Feature – CGPA Table 8.3 shows the
  frequency matrix for the feature CGPA.
• Table 8.4 shows how the likelihood probability is calculated for CGPA
  using conditional probability.
• As explained earlier the Likelihood probability is stated as the sampling
  density for the evidence given the hypothesis.
• It is denoted as P (Evidence | Hypothesis), which says how likely is the
  occurrence of the evidence given the parameters.
• It is calculated as the number of instances of each attribute value and for a
  given class value divided by the number of instances with that class value.
• For example P (CGPA ≥9 | Job Offer = Yes) denotes the number of
  instances with ‘CGPA ≥9’ and ‘Job Offer = Yes’ divided by the total number
  of instances with ‘Job Offer = Yes’.
• From the Table 8.3 Frequency Matrix of CGPA, number of instances with
  ‘CGPA ≥9’ and ‘Job Offer = Yes’ is 3. The total number of instances with
  ‘Job Offer = Yes’ is 7. Hence, P (CGPA ≥9 | Job Offer = Yes) = 3/7.
• Similarly, the Likelihood probability is calculated for all attribute values of
  feature CGPA.
• Step 2(b): Feature – Interactiveness Table 8.5 shows the frequency
  matrix for the feature Interactiveness.
8.3.4 Gibbs Algorithm
  • The main drawback of Bayes optimal classifier is that it computes the
    posterior probability for all hypotheses in the hypothesis space and
    then combines the predictions to classify a new instance.
  • Gibbs algorithm is a sampling technique which randomly selects a
    hypothesis from the hypothesis space according to the posterior
    probability distribution and classifies a new instance.
  • It is found that the prediction error occurs twice with the Gibbs
    algorithm when compared to Bayes Optimal classifier.
8.4 NAÏVE BAYES ALGORITHM FOR
CONTINUOUS ATTRIBUTES
• There are two ways to predict with Naive Bayes algorithm for
  continuous attributes:
• 1. Discretize continuous feature to discrete feature.
• 2. Apply Normal or Gaussian distribution for continuous feature.
Gaussian Naive Bayes Algorithm In Gaussian Naive Bayes, the values
of continuous features are assumed to be sampled from a Gaussian
distribution.
Thank You