Matters of Discussion
Theory of generalization
Generalization bound – approximation-generalization tradeoff – bias
                  and variance – learning curve
                 - Finite and Infinite Hypothesis Spaces,
           Probably Approximately Correct (PAC) Learning-
                         Bayes theorem,
                          MDL principle.
Compiled By: Dr. Nilamadhab Mishra [(PhD- CSIE) Taiwan]
                                                                1
Generalization in Machine Learning:
❖ Definition:
Generalization is the core goal of machine learning – to
build models that can accurately predict outcomes on data
they haven't encountered during training.
❖ Importance:
Effective generalization ensures that models are useful in
real-world applications, not just in the specific training
environment.
❖ Factors Influencing Generalization:
Overfitting (model learns noise in training data) and
underfitting (model is too simple to capture underlying
patterns) can both hamper generalization.
Compiled By: Dr. Nilamadhab Mishra [(PhD- CSIE) Taiwan]
                                                          2
Theory of generalization
❖ In machine learning, generalization refers to a model's ability to
  perform well on unseen data after being trained on a specific
  dataset.
❖ Generalization bounds are mathematical expressions that provide
  upper limits on this generalization ability, essentially quantifying
  how well a trained model will perform on new, unseen data.
❖ The approximation-generalization tradeoff highlights the inherent
  tension between a model's ability to fit the training data (low
  approximation error) and its ability to generalize to new data (low
  generalization error).
                                                                     3
Generalization Bounds:
❖ These bounds provide a theoretical way to assess how well a
  learning algorithm will generalize.
❖ They typically relate the training error to the generalization
  error (error on unseen data).
❖ Common Approaches:
➢ Rademacher Complexity: Measures the complexity of a
  hypothesis space by how well it can fit random noise.
➢ VC Dimension: A measure of the capacity of a hypothesis
  space, indicating how well it can classify data points.
➢ Concentration Inequalities: Statistical tools used to bound
  the difference between the empirical (training) error and the
  true (generalization) error.
Compiled By: Dr. Nilamadhab Mishra [(PhD- CSIE) Taiwan]
                                                               4
Key insights of VC dimension
❖ The VC dimension is defined for spaces of binary
    functions (functions to {0,1}).
❖ Several generalizations have been suggested for
    spaces of non-binary functions.
❖ For multi-valued functions (functions to {0,...,n}),
    the Natarajan dimension can be used.
❖ VC        dimension-          Binary        classification,           Natarajan
    dimension- Multi-class classification
VC dimension- Binary classification, Natarajan dimension- Multi-class
classification                                                                  5
Natarajan dimension
For example, let’s say your visual system can recognize N
different Boolean features
(e.g., has-tail, has-fur, can-bark, has-four-legs, can-fly,
and so on).
We might describe a dog as an animal that
has-fur (1), can-fly (0), has-four-legs (1), can-bark(1)
1- yes; 0--no
Fur– soft thick hair that covers the bodies of some animals
                                                              6
Approximation-Generalization Tradeoff:
❖ This tradeoff highlights the tension between a model's
  ability to fit the training data (approximation) and its ability
  to generalize to new data (generalization).
❖ Complex Models: Have more parameters and can fit the
  training data very well (low approximation error), but they
  are more prone to overfitting, leading to high generalization
  error.
❖ Simple Models: Have fewer parameters, may not fit the
  training data perfectly (high approximation error), but can
  be more robust and generalize better to new data (low
  generalization error).
Compiled By: Dr. Nilamadhab Mishra [(PhD- CSIE) Taiwan]
                                                                 7
Cont..
Finding the Sweet Spot:
The goal is to find a model complexity that balances
approximation and generalization, minimizing the overall error.
Example:
➢ If the model is too simple (underfitting), it will have high bias
   (systematic error) and high training and generalization error.
➢ If the model is too complex (overfitting), it will have low
   bias, high variance (sensitivity to training data), and high
   generalization error.
                                                                    8
summary
❖ In essence, the theory of generalization and generalization
  bounds provides a framework for understanding the
  limitations and potential of machine learning models,
❖ The approximation-generalization tradeoff highlights the
  practical challenges of balancing model complexity and
  predictive performance on unseen data.
                                                            9
Finite and Infinite Hypothesis Spaces
❖ ML professionals and data scientists make an initial assumption for the
    solution of the problem.
❖ This assumption in Machine learning is known as Hypothesis. In
    Machine Learning, at various times, Hypothesis and Model are used
    interchangeably.
Compiled By: Dr. Nilamadhab Mishra [(PhD- CSIE) Taiwan]
                                                                       10
Cont.. Finite Hypothesis Space:
- In a finite hypothesis space, the learning algorithm is restricted to
choose from a limited, finite set of hypothesis functions. This means
that the algorithm can only consider a fixed number of possible
solutions during the learning process.
- Examples of algorithms that operate within a finite hypothesis space
include decision trees with a limited depth, linear regression with a
fixed set of features, and finite sets of rules in rule-based systems.
- Advantages of a finite hypothesis space include computational
efficiency, simplicity, and potentially improved generalization to new,
unseen data if the true function can be accurately represented within
the limited set of hypotheses.
                                                                         11
Cont.. infinite hypothesis space
In an infinite hypothesis space, the learning algorithm can potentially
consider an unlimited number of hypothesis functions. This allows for a more
flexible and expressive representation of the true function, as the algorithm
is not constrained by a fixed set of hypotheses.
- Examples of algorithms that operate within an infinite hypothesis space
include neural networks with a large number of parameters, kernel methods
such as support vector machines with infinite-dimensional feature spaces,
and non-parametric models like Gaussian processes.
- Advantages of an infinite hypothesis space include the ability to
approximate complex, nonlinear functions more accurately, potentially
leading to better performance on complex tasks where the true function is
highly intricate or not easily captured by a simpler model.
                                                                           12
PAC Learning in ML
❖ Probably approximately correct learning
❖ It is a framework for mathematical analysis of
    machine learning.
❖ defines a mathematical relationship between the
    number of training samples, the error rate, and the
    probability that the available training data are large
    enough to attain the desired error rate.
Compiled By: Dr. Nilamadhab Mishra [(PhD- CSIE) Taiwan]
                                                          13
Why PAC Learning?
❖ We use a collection of training samples to train a machine
   learning algorithm to learn classification or regression patterns.
   Usually,   we   calculate   various   metrics   to   estimate   its
   performance. One of these metrics is the error rate.
❖ However, the problem is that we can only estimate it using a test
   set. The estimates differ from the actual value due to their
   statistical nature. So, we may ask how many training samples we
   need to be confident about our estimates.
❖ The PAC theory is an answer to that. It’s about finding the
   relationship between the true error rate and the number of
   training samples.
                                                                    14
Cont.. Example
Cat & Dog Classification
                           15
Cont..
❖ Imagine you have a teacher who wants to teach you how to recognize
    cats from dogs based on pictures.
❖ In the PAC learning framework, the teacher's goal is to help you learn
    to classify new images correctly with high probability and a small
    margin of error.
❖ "Probably" refers to the idea that the learner should be able to classify
    new examples with high probability.
❖ This means that the learner should be correct most of the time when
    making predictions based on the training data.
❖ "Approximately Correct" means that the learner is not expected to be
    perfect but should classify most examples correctly.
❖ There is an allowance for some errors, but the errors should be limited
    within a small margin.                                               16
Cont..
❖ In PAC learning, the learner aims to find a hypothesis (a
    model) that approximates the true relationship between
    inputs and outputs in the data.
❖ The goal is to minimize the error between the hypothesis
    and the true function that generated the data.
❖ To achieve PAC learning, the learner needs to find a
    balance between fitting the training data well (achieving
    low error) and generalizing to new, unseen data.
Compiled By: Dr. Nilamadhab Mishra [(PhD- CSIE) Taiwan]
                                                           17
18
Key insight on PAC learning
PAC learning is a theoretical framework that aims to
address the fundamental challenge in ML:
▪ building models that can generalize well to
  unseen data.
▪ PAC learning aims to find a hypothesis, or a
  model, that approximates the target concept as
  accurately as possible.
  ****************************************
Compiled By: Dr. Nilamadhab Mishra [(PhD- CSIE) Taiwan]
                                                          19
Bayes theorem
❖ Bayes' Theorem is a fundamental concept in
  probability theory and machine learning that helps
  update the probability of a hypothesis based on new
  evidence.
❖ In machine learning, it's crucial for tasks like
  classification, where it's used to calculate the
  probability of a data point belonging to a specific
  class, given its features.
❖ How it's used in Machine Learning:
❖ Classification: Bayes' Theorem is the foundation of
  the Naive Bayes classifier, a simple yet effective
  algorithm for classifying data points into different
  categories.
                                                    20
Cont.. Key Concepts:
Prior Probability:
The initial belief about the probability of an event before considering any
new evidence (e.g., the overall probability of spam emails).
Likelihood:
The probability of observing the new evidence given a specific hypothesis
(e.g., the probability of seeing certain words in an email given that it is
spam).
Posterior Probability:
The updated probability of a hypothesis after considering the new evidence
(e.g., the probability that an email is spam after analyzing its content).
Evidence:
The overall probability of observing the new evidence.
                                                                             21
Cont..
                     Bayes' Theorem Formula:
                   P(A|B) = [P(B|A) * P(A)] / P(B)
Where:
P(A|B) is the posterior probability of event A given evidence
B.
P(B|A) is the likelihood of evidence B given event A.
P(A) is the prior probability of event A.
P(B) is the evidence, or the probability of observing evidence
B.
Compiled By: Dr. Nilamadhab Mishra [(PhD- CSIE) Taiwan]
                                                            22
MDL principle
❖     The Minimum Description Length (MDL) principle in machine
    learning is a model selection method that favors the model that
    provides the shortest description of the data.
❖ It's based on the idea that the best model for a given dataset is
    the one that allows for the most efficient compression of that
    data.
❖ MDL balances model complexity with its ability to fit the data,
    selecting the model that achieves the best trade-off between
    these two factors.
Fur– soft thick hair that covers the bodies of some animals
                                                                 23
Cont..
❖ MDL is rooted in information theory and
    aims to find the simplest explanation for
    observed data.
❖ The principle suggests that the best model is
    the one that allows you to describe the data
    using the fewest number of bits.
Fur– soft thick hair that covers the bodies of some animals
                                                              24
How MDL Works:
1. Model Selection:
MDL compares different models (hypotheses) by calculating how much data is needed to
describe both the model and the data using that model.
2. Description Length:
This involves calculating the length of the code required to represent the model itself (model
description length) and the length of the code needed to describe the data given that model
(data description length).
3. Trade-off:
A more complex model will have a shorter data description length (because it fits the data
better), but a longer model description length (because it is more complex). Conversely, a
simpler model will have a longer data description length, but a shorter model description length.
4. Optimal Model:
The MDL principle selects the model for which the sum of the model description length and the
data description length is minimized.
                                                                                               25
How MDL Works: cont..example
➢ Imagine you have a dataset of coin flips.
➢ You could model this data with a simple model that assumes a fair
   coin (50/50 chance of heads or tails), or with a more complex
   model that allows for a different probability of heads and tails.
➢ MDL would compare the description length of the data using each
   model.
➢ If the observed data is significantly closer to 50/50, the simpler
   fair coin model might be preferred, as it would require fewer bits
   to describe the data given the model.
➢ If the observed data deviates significantly from 50/50, a more
   complex model allowing for a different probability might be
   chosen.                                                             26
Explore on your end [A-4]
Investigate VC and Natarajan dimensions with
some relevant problem scenarios.
Vapnik-Chervonenkis dimension—VC
                                           27
   Cheers For the Great Patience!
                     Query Please?
Compiled By: Dr. Nilamadhab Mishra [(PhD- CSIE) Taiwan]
                                                          28