0% found this document useful (0 votes)
48 views28 pages

Matters of Discussion

The document discusses the theory of generalization in machine learning, emphasizing its importance for model performance on unseen data and the tradeoff between approximation and generalization. It covers concepts such as generalization bounds, VC and Natarajan dimensions, PAC learning, Bayes theorem, and the Minimum Description Length principle, all of which contribute to understanding model selection and performance. The goal is to find a balance in model complexity that minimizes error while ensuring effective generalization.

Uploaded by

kkhare603
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views28 pages

Matters of Discussion

The document discusses the theory of generalization in machine learning, emphasizing its importance for model performance on unseen data and the tradeoff between approximation and generalization. It covers concepts such as generalization bounds, VC and Natarajan dimensions, PAC learning, Bayes theorem, and the Minimum Description Length principle, all of which contribute to understanding model selection and performance. The goal is to find a balance in model complexity that minimizes error while ensuring effective generalization.

Uploaded by

kkhare603
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

Matters of Discussion

Theory of generalization
Generalization bound – approximation-generalization tradeoff – bias
and variance – learning curve

- Finite and Infinite Hypothesis Spaces,

Probably Approximately Correct (PAC) Learning-


Bayes theorem,
MDL principle.

Compiled By: Dr. Nilamadhab Mishra [(PhD- CSIE) Taiwan]


1
Generalization in Machine Learning:
❖ Definition:
Generalization is the core goal of machine learning – to
build models that can accurately predict outcomes on data
they haven't encountered during training.
❖ Importance:
Effective generalization ensures that models are useful in
real-world applications, not just in the specific training
environment.
❖ Factors Influencing Generalization:
Overfitting (model learns noise in training data) and
underfitting (model is too simple to capture underlying
patterns) can both hamper generalization.
Compiled By: Dr. Nilamadhab Mishra [(PhD- CSIE) Taiwan]
2
Theory of generalization
❖ In machine learning, generalization refers to a model's ability to
perform well on unseen data after being trained on a specific
dataset.

❖ Generalization bounds are mathematical expressions that provide


upper limits on this generalization ability, essentially quantifying
how well a trained model will perform on new, unseen data.

❖ The approximation-generalization tradeoff highlights the inherent


tension between a model's ability to fit the training data (low
approximation error) and its ability to generalize to new data (low
generalization error).
3
Generalization Bounds:
❖ These bounds provide a theoretical way to assess how well a
learning algorithm will generalize.
❖ They typically relate the training error to the generalization
error (error on unseen data).
❖ Common Approaches:
➢ Rademacher Complexity: Measures the complexity of a
hypothesis space by how well it can fit random noise.
➢ VC Dimension: A measure of the capacity of a hypothesis
space, indicating how well it can classify data points.
➢ Concentration Inequalities: Statistical tools used to bound
the difference between the empirical (training) error and the
true (generalization) error.
Compiled By: Dr. Nilamadhab Mishra [(PhD- CSIE) Taiwan]
4
Key insights of VC dimension
❖ The VC dimension is defined for spaces of binary
functions (functions to {0,1}).
❖ Several generalizations have been suggested for
spaces of non-binary functions.
❖ For multi-valued functions (functions to {0,...,n}),
the Natarajan dimension can be used.
❖ VC dimension- Binary classification, Natarajan
dimension- Multi-class classification
VC dimension- Binary classification, Natarajan dimension- Multi-class
classification 5
Natarajan dimension
For example, let’s say your visual system can recognize N
different Boolean features
(e.g., has-tail, has-fur, can-bark, has-four-legs, can-fly,
and so on).
We might describe a dog as an animal that
has-fur (1), can-fly (0), has-four-legs (1), can-bark(1)
1- yes; 0--no

Fur– soft thick hair that covers the bodies of some animals
6
Approximation-Generalization Tradeoff:
❖ This tradeoff highlights the tension between a model's
ability to fit the training data (approximation) and its ability
to generalize to new data (generalization).
❖ Complex Models: Have more parameters and can fit the
training data very well (low approximation error), but they
are more prone to overfitting, leading to high generalization
error.
❖ Simple Models: Have fewer parameters, may not fit the
training data perfectly (high approximation error), but can
be more robust and generalize better to new data (low
generalization error).

Compiled By: Dr. Nilamadhab Mishra [(PhD- CSIE) Taiwan]


7
Cont..
Finding the Sweet Spot:
The goal is to find a model complexity that balances
approximation and generalization, minimizing the overall error.
Example:
➢ If the model is too simple (underfitting), it will have high bias
(systematic error) and high training and generalization error.
➢ If the model is too complex (overfitting), it will have low
bias, high variance (sensitivity to training data), and high
generalization error.

8
summary
❖ In essence, the theory of generalization and generalization
bounds provides a framework for understanding the
limitations and potential of machine learning models,

❖ The approximation-generalization tradeoff highlights the


practical challenges of balancing model complexity and
predictive performance on unseen data.

9
Finite and Infinite Hypothesis Spaces
❖ ML professionals and data scientists make an initial assumption for the
solution of the problem.

❖ This assumption in Machine learning is known as Hypothesis. In


Machine Learning, at various times, Hypothesis and Model are used
interchangeably.

Compiled By: Dr. Nilamadhab Mishra [(PhD- CSIE) Taiwan]


10
Cont.. Finite Hypothesis Space:
- In a finite hypothesis space, the learning algorithm is restricted to
choose from a limited, finite set of hypothesis functions. This means
that the algorithm can only consider a fixed number of possible
solutions during the learning process.
- Examples of algorithms that operate within a finite hypothesis space
include decision trees with a limited depth, linear regression with a
fixed set of features, and finite sets of rules in rule-based systems.
- Advantages of a finite hypothesis space include computational
efficiency, simplicity, and potentially improved generalization to new,
unseen data if the true function can be accurately represented within
the limited set of hypotheses.

11
Cont.. infinite hypothesis space
In an infinite hypothesis space, the learning algorithm can potentially
consider an unlimited number of hypothesis functions. This allows for a more
flexible and expressive representation of the true function, as the algorithm
is not constrained by a fixed set of hypotheses.
- Examples of algorithms that operate within an infinite hypothesis space
include neural networks with a large number of parameters, kernel methods
such as support vector machines with infinite-dimensional feature spaces,
and non-parametric models like Gaussian processes.
- Advantages of an infinite hypothesis space include the ability to
approximate complex, nonlinear functions more accurately, potentially
leading to better performance on complex tasks where the true function is
highly intricate or not easily captured by a simpler model.

12
PAC Learning in ML
❖ Probably approximately correct learning
❖ It is a framework for mathematical analysis of
machine learning.
❖ defines a mathematical relationship between the
number of training samples, the error rate, and the
probability that the available training data are large
enough to attain the desired error rate.

Compiled By: Dr. Nilamadhab Mishra [(PhD- CSIE) Taiwan]


13
Why PAC Learning?
❖ We use a collection of training samples to train a machine
learning algorithm to learn classification or regression patterns.
Usually, we calculate various metrics to estimate its
performance. One of these metrics is the error rate.
❖ However, the problem is that we can only estimate it using a test
set. The estimates differ from the actual value due to their
statistical nature. So, we may ask how many training samples we
need to be confident about our estimates.
❖ The PAC theory is an answer to that. It’s about finding the
relationship between the true error rate and the number of
training samples.

14
Cont.. Example

Cat & Dog Classification


15
Cont..
❖ Imagine you have a teacher who wants to teach you how to recognize
cats from dogs based on pictures.
❖ In the PAC learning framework, the teacher's goal is to help you learn
to classify new images correctly with high probability and a small
margin of error.
❖ "Probably" refers to the idea that the learner should be able to classify
new examples with high probability.
❖ This means that the learner should be correct most of the time when
making predictions based on the training data.
❖ "Approximately Correct" means that the learner is not expected to be
perfect but should classify most examples correctly.
❖ There is an allowance for some errors, but the errors should be limited
within a small margin. 16
Cont..
❖ In PAC learning, the learner aims to find a hypothesis (a
model) that approximates the true relationship between
inputs and outputs in the data.
❖ The goal is to minimize the error between the hypothesis
and the true function that generated the data.
❖ To achieve PAC learning, the learner needs to find a
balance between fitting the training data well (achieving
low error) and generalizing to new, unseen data.

Compiled By: Dr. Nilamadhab Mishra [(PhD- CSIE) Taiwan]


17
18
Key insight on PAC learning
PAC learning is a theoretical framework that aims to
address the fundamental challenge in ML:
▪ building models that can generalize well to
unseen data.
▪ PAC learning aims to find a hypothesis, or a
model, that approximates the target concept as
accurately as possible.

****************************************

Compiled By: Dr. Nilamadhab Mishra [(PhD- CSIE) Taiwan]


19
Bayes theorem
❖ Bayes' Theorem is a fundamental concept in
probability theory and machine learning that helps
update the probability of a hypothesis based on new
evidence.
❖ In machine learning, it's crucial for tasks like
classification, where it's used to calculate the
probability of a data point belonging to a specific
class, given its features.
❖ How it's used in Machine Learning:
❖ Classification: Bayes' Theorem is the foundation of
the Naive Bayes classifier, a simple yet effective
algorithm for classifying data points into different
categories.
20
Cont.. Key Concepts:
Prior Probability:
The initial belief about the probability of an event before considering any
new evidence (e.g., the overall probability of spam emails).
Likelihood:
The probability of observing the new evidence given a specific hypothesis
(e.g., the probability of seeing certain words in an email given that it is
spam).
Posterior Probability:
The updated probability of a hypothesis after considering the new evidence
(e.g., the probability that an email is spam after analyzing its content).
Evidence:
The overall probability of observing the new evidence.
21
Cont..
Bayes' Theorem Formula:
P(A|B) = [P(B|A) * P(A)] / P(B)
Where:
P(A|B) is the posterior probability of event A given evidence
B.
P(B|A) is the likelihood of evidence B given event A.
P(A) is the prior probability of event A.
P(B) is the evidence, or the probability of observing evidence
B.

Compiled By: Dr. Nilamadhab Mishra [(PhD- CSIE) Taiwan]


22
MDL principle
❖ The Minimum Description Length (MDL) principle in machine
learning is a model selection method that favors the model that
provides the shortest description of the data.
❖ It's based on the idea that the best model for a given dataset is
the one that allows for the most efficient compression of that
data.
❖ MDL balances model complexity with its ability to fit the data,
selecting the model that achieves the best trade-off between
these two factors.

Fur– soft thick hair that covers the bodies of some animals
23
Cont..
❖ MDL is rooted in information theory and
aims to find the simplest explanation for
observed data.

❖ The principle suggests that the best model is


the one that allows you to describe the data
using the fewest number of bits.

Fur– soft thick hair that covers the bodies of some animals
24
How MDL Works:
1. Model Selection:
MDL compares different models (hypotheses) by calculating how much data is needed to
describe both the model and the data using that model.
2. Description Length:
This involves calculating the length of the code required to represent the model itself (model
description length) and the length of the code needed to describe the data given that model
(data description length).
3. Trade-off:
A more complex model will have a shorter data description length (because it fits the data
better), but a longer model description length (because it is more complex). Conversely, a
simpler model will have a longer data description length, but a shorter model description length.
4. Optimal Model:
The MDL principle selects the model for which the sum of the model description length and the
data description length is minimized.

25
How MDL Works: cont..example
➢ Imagine you have a dataset of coin flips.
➢ You could model this data with a simple model that assumes a fair
coin (50/50 chance of heads or tails), or with a more complex
model that allows for a different probability of heads and tails.
➢ MDL would compare the description length of the data using each
model.
➢ If the observed data is significantly closer to 50/50, the simpler
fair coin model might be preferred, as it would require fewer bits
to describe the data given the model.
➢ If the observed data deviates significantly from 50/50, a more
complex model allowing for a different probability might be
chosen. 26
Explore on your end [A-4]

Investigate VC and Natarajan dimensions with


some relevant problem scenarios.

Vapnik-Chervonenkis dimension—VC

27
Cheers For the Great Patience!
Query Please?

Compiled By: Dr. Nilamadhab Mishra [(PhD- CSIE) Taiwan]


28

You might also like