Chapter 6: Classification and Prediction
This chapter focuses on techniques used to classify data and make predictions based on learned models.
The main topics include:
● Bayesian Classification
● Instance-Based Methods
● Classification Accuracy
Bayesian Classification: Why?
Bayesian classification is a probabilistic approach used in machine learning and statistics. Key points are:
1. Probabilistic learning:
○ It calculates explicit probabilities for a hypothesis.
○ Useful for practical learning problems where probabilistic approaches provide reliable results.
2. Incremental:
○ Bayesian methods allow incremental learning.
○ With each training sample, the probability of a hypothesis being correct can be updated.
3. Probabilistic prediction:
○ Allows multiple hypotheses, weighted based on their probabilities.
4. Standard:
○ Bayesian methods offer a standard for decision-making, even when computational resources
are limited.
-
Key Takeaways:
● Bayesian classification assigns a sample to the class with the highest posterior probability.
● It simplifies to maximizing the product of likelihood P(X∣Ci) and prior P(Ci).
● Computing likelihood directly can be difficult in practice.
-
-
-
Naïve Bayesian Classifier: Comments
Advantages
1. Easy to implement:
○ The algorithm is straightforward and simple to apply.
2. Good results in most cases:
○ Despite its assumptions, the Naïve Bayesian Classifier performs well in many practical
applications.
Disadvantages
1. Class Conditional Independence Assumption:
○ Assumes that the features (attributes) are conditionally independent given the class.
○ In real-world scenarios, this assumption often does not hold, reducing accuracy.
2. Dependencies Between Variables:
○ Practically, dependencies exist among variables.
○ Example:
■ Hospitals: Patient profiles depend on age, family history, symptoms, and diseases.
■ Dependencies like fever and cough indicating diseases such as lung cancer cannot be
captured by Naïve Bayesian Classifier.
3. Inability to Model Dependencies:
○ The classifier fails to account for interdependencies among variables.
Solution: Bayesian Belief Networks
To address dependencies:
● Use Bayesian Belief Networks.
Bayesian Networks
Overview:
● A Bayesian Belief Network allows a subset of variables to be conditionally independent.
Key Points:
1. Graphical Model:
○ Represents causal relationships among variables.
○ Nodes: Random variables.
○ Links: Dependencies (arrows indicate relationships).
2. Features:
○ Specifies joint probability distributions.
○ Handles dependencies between variables.
○ Has no loops or cycles (acyclic graph).
3. Example:
○ X and Y are parents of Z.
○ Y is the parent of P.
○ There is no direct dependency between Z and P.
-
Learning Bayesian Networks
Several Cases for Learning Bayesian Networks
1. Given Network Structure and All Variables Observable:
○ Only the Conditional Probability Tables (CPTs) need to be learned.
○ This is the simplest case because the structure (relationships) is already provided.
2. Network Structure Known, Some Hidden Variables:
○ Use methods like gradient descent to learn parameters.
○ This approach is similar to training neural networks.
3. Network Structure Unknown, All Variables Observable:
○ Requires searching through the model space to reconstruct the network structure (graph
topology).
○ Algorithms optimize based on observed data to determine the best graph structure.
4. Unknown Structure and All Hidden Variables:
○ This is the most complex case.
○ Currently, no good algorithms exist to effectively solve this problem due to its complexity.
Instance-Based Methods
What are Instance-Based Methods?
● Definition:
Instance-based learning stores training examples and delays processing (lazy evaluation) until a
new instance needs to be classified.
● Key Feature:
The method compares a new instance to existing training examples to make a classification decision.
Key Approach: k-Nearest Neighbor (k-NN)
1. Representation:
○ Instances are treated as points in a Euclidean space (geometric space).
2. How it Works:
○ Given a new instance, calculate the distance between the new instance and all training
examples.
○ Find the k closest neighbors (based on Euclidean distance).
○ Assign the class based on the majority class of the nearest neighbors.
Memory-Based Reasoning (MBR)
Concept:
MBR mirrors human reasoning by identifying similar examples from the past and applying what is
known/learned to solve a new problem.
1. Examples in Daily Life:
○ Recognizing traffic patterns/routes.
○ Meeting new people by recalling past experiences.
○ Trying new food based on past preferences.
2. Terminology:
In MBR, the similar examples are referred to as neighbors.
Applications of MBR
1. Fraud Detection:
○ Identify fraudulent activities by comparing with known cases.
2. Customer Response Prediction:
○ Predict customer behavior based on historical patterns.
3. Medical Treatments:
○ Suggest treatments by matching symptoms with similar past cases.
4. Classifying Responses:
○ Process free-text responses and assign codes, often used in natural language processing
tasks.
The k-Nearest Neighbor (k-NN) Algorithm
Key Concept
● k-NN is an instance-based learning algorithm where:
○ All training instances are represented as points in an n-dimensional space.
○ The classification or prediction of a new instance is based on the distance to its nearest
neighbors.
Target Functions in k-NN
1. Discrete Target Function (Classification):
○ k-NN returns the most common class (majority vote) among the kkk-nearest training
examples.
○ Example: Predict if an object is "yes" or "no" based on the majority class.
2. Continuous Target Function (Regression/Numerical Prediction):
○ k-NN returns the mean value of the kkk-nearest neighbors.
○ Example: Predicting a numerical value like temperature, house price, etc.
Robustness to Noise
● k-NN is robust to noisy data because averaging the kkk-nearest neighbors smoothens the impact of
outliers.
Summary
The k-NN algorithm is:
● Non-parametric (no assumptions about the data distribution).
● Lazy learning (stores the training data and delays computation until prediction).
● Used for both classification and regression tasks.
MBR Challenges
1. Choosing Appropriate Historical Data
○ Ensuring relevant and high-quality data is used for training the system.
2. Efficient Representation of Training Data
○ Lazy learning delays computation until prediction.
○ Efficient indexing methods are critical for quick retrieval of neighbors.
3. Choosing the Number of Neighbors (k)
○ Small k:
■ Too noisy; sensitive to outliers.
○ Large k:
■ Approaches the size of the dataset, reducing specificity and leading to the same result
for all cases.
Discussion on the k-NN Algorithm
1. Choosing the Distance Function
○ Common measures:
■ Euclidean Distance: Most widely used.
■ Manhattan Distance: Sum of absolute differences.
2. Impact of Variable Units
○ Units of variables can distort distance calculations.
○ Solution:
■ Scaling/Normalization: Standardize variables to eliminate unit dependency.
■ Methods:
■ z-scores: Scale values to have mean 0 and standard deviation 1.
■ Min-max scaling: Normalize values between 0 and 1.
Choosing the Combination Function
1. Categorical Target Variables (Classification)
○ Majority Rule: Class with the highest frequency among neighbors.
○ Weighted Voting:
■ Neighbors closer to the query point have higher weights.
2. Numerical Target Variables (Regression)
○ Average: Mean of the values of the k-nearest neighbors.
○ Weighted Average:
■ Assign higher weights to closer neighbors (inverse of the distance).
Weighted k-NN
1. Why Weight Neighbors?
○ Not all neighbors contribute equally.
○ Neighbors closer to the query point should have a greater impact.
2. Weight Formula:
w=1d(xq,xi)2w = \frac{1}{d(x_q, x_i)^2}w=d(xq,xi)21
Here:
○ d(xq,xi)d(x_q, x_i)d(xq,xi) is the distance between query point xqx_qxqand training instance
xix_ixi.
3. Distance-Weighted Nearest Neighbor
○ Assign weights inversely proportional to distance.
○ Helps improve prediction accuracy for both:
■ Classification (categorical).
■ Regression (numerical).
Weighing Variables
1. Purpose:
○ Similar to clustering, weighing variables helps address relevance during scaling or
standardization.
2. Need:
○ Some variables require higher weights (e.g., standardized income vs. standardized age).
3. Outcome:
○ Balances the impact of variables during distance-based methods.
Curse of Dimensionality
1. Problem:
○ High-dimensional data can cause distances to be dominated by irrelevant attributes.
2. Solutions:
○ Stretch axes or eliminate irrelevant attributes.
○ Use different scaling factors.
○ Cross-validation to determine the best factor.
3. Key Strategy:
○ Assign small k values to irrelevant variables, eliminating them if necessary.
Lazy Learning
1. Concept:
○ No explicit training phase; all historical data acts as the training set.
○ Minimal training cost.
2. Drawbacks:
○ Classifying a new instance requires real-time computation, which can be time-consuming.
3. Key Operation:
○ Finding the k-nearest neighbors for new observations.
MBR Strengths
1. Advantages:
○ Uses data "as is" with distance functions and combination functions.
○ Adapts easily with new data.
○ Produces results without lengthy training.
2. Real-Time Utility:
○ Efficient for applications requiring constant updates.
-
-
Chapter 6: Classification and Prediction
measure of how accurately the model predicts the output label.
Holdout Estimation
● Concept:
○ When data is limited, the holdout method splits the dataset into:
■ Training Set: Builds the model.
■ Testing Set: Evaluates model performance.
○ Typical split: 1/3 for testing and the rest for training.
● Problem:
○ Samples may not be representative.
○ Example: A class might be missing in the test set.
● Advanced Method:
○ Stratification: Ensures balanced representation of each class in both training and test sets.
Repeated Holdout Method
● Definition: The holdout estimate becomes more reliable by repeating the process multiple times
using different subsamples of data.
● Steps:
○ In each iteration, a portion of data is randomly selected for training.
■ Stratification: Ensures equal representation of each class in samples.
○ Error rates from different iterations are averaged to calculate an overall error rate.
● Issues:
○ Test Set Overlap: Repeated random selection may lead to overlapping test sets.
○ Question: Can overlapping be avoided?
Cross-Validation
1. Definition:
○ Cross-validation avoids overlapping test sets by systematically splitting the data into subsets.
○ It is an improvement over the simple holdout method.
2. Steps:
○ Step 1: Split data into k subsets of equal size.
○ Step 2: Use each subset in turn for testing, while the remaining subsets are used for training.
○ This is called k-fold cross-validation.
3. Key Points:
○ Stratification ensures equal class proportions in each fold.
○ The error estimates from each iteration are averaged to produce the overall error estimate.
More on Cross-Validation
1. Stratified Ten-Fold Cross-Validation:
○ Why 10 folds?:
■ Extensive experiments show that 10 folds provide the most accurate estimate.
■ Supported by theoretical evidence.
○ Stratification:
■ Reduces variance by ensuring balanced class representation in each fold.
2. Improvement:
○ Repeated Stratified Cross-Validation:
■ Ten-fold cross-validation is repeated multiple times.
■ Results are averaged to further reduce variance.
Cross-Validation Example
1. Scenario:
○ Data size: 1000 records.
○ k = 10 folds:
■ Randomize data to eliminate biases.
■ Split data into 10 equal subsets (folds) of 100 records each.
2. Process:
○ Fold 1: Used as test set; remaining 9 folds are for training.
○ Fold 2: Used as test set; remaining 9 folds for training.
○ Repeat this process until all folds have been used for testing.
3. Result:
○ Each record gets tested once.
○ Final error estimate is the average of all fold results.
Leave-One-Out Cross-Validation (LOOCV)
1. Definition:
○ A specific form of cross-validation where the number of folds equals the number of training
instances.
○ For n training instances, the classifier is trained n times.
2. Key Points:
○ Makes the best use of data.
○ No random subsampling is involved.
○ It is computationally expensive because the model is trained multiple times.
■ Exception: Nearest Neighbor (NN) methods.
The Bootstrap
1. Definition:
○ A resampling technique where instances are selected with replacement.
○ Bootstrap creates multiple datasets from the original data for training/testing.
2. Process:
○ Sample a dataset of n instances repeatedly to create a new training set.
○ Instances not selected form the test set.
3. Difference from Cross-Validation:
○ Cross-validation uses without replacement sampling, while bootstrap uses with replacement.
The 0.632 Bootstrap
1. Concept:
○ A particular instance has a probability of (1 - 1/n) of not being picked in a single sample.
○ For large n, this probability approximates to 0.368.
2. Implications:
○ Around 63.2% of the instances are included in the training data.
○ The remaining 36.8% of the instances appear in the test data.
Example of Bootstrap
1. Dataset Size:
○ The original dataset has 1000 observations.
2. Process:
○ Create a training set by sampling with replacement 1000 times.
○ The size of the training set remains 1000, but:
■ Some observations appear multiple times.
■ Some observations do not appear at all.
3. Test Set:
○ Observations not appearing in the training set form the test set.
○ The size of the test set is variable.
Bagging and Boosting
Bagging
1. Definition:
○ Bagging stands for Bootstrap Aggregating.
○ It improves accuracy by reducing variance in model predictions.
2. Steps:
○ Generate t bootstrap samples (with replacement) from the original dataset.
○ Train a new classifier CtC_tCtfor each sample StS_tSt.
○ For classification problems:
■ Combine predictions using the majority vote rule.
○ For regression problems:
■ Combine results by averaging predictions.
3. Outcome:
○ The final classifier C∗C^*C∗ is an aggregated version of all classifiers.
Boosting
1. Definition:
○ Boosting builds multiple classifiers sequentially, where each classifier pays more attention to
misclassified examples.
2. Steps:
○ Learn a series of classifiers.
○ Misclassified examples in each iteration receive higher weight for the next classifier.
3. Characteristics:
○ Boosting works well with decision trees or Bayesian classifiers.
○ Requires linear time and constant space.
-
Key Points
1. Boosting improves classification accuracy by:
○ Focusing on misclassified examples in subsequent iterations.
○ Combining multiple weak classifiers into a strong ensemble model.
2. The final hypothesis is a weighted combination of all classifiers, with more accurate classifiers
receiving higher weights.
3. Boosting is iterative and adapts the model during each step.
-
-
Evaluating Numeric Prediction
Characteristics:
● Penalizes larger errors more heavily due to the squaring of differences.
● Easy to manipulate mathematically.
● Widely used for regression problems.
Characteristics:
● Gives an interpretable error in the same scale as the original values.
● Sensitive to outliers (like MSE).
Characteristics:
● Less sensitive to outliers compared to MSE.
● Represents the average "absolute" error.
4. Relative Error:
● Definition: Expresses errors as a percentage of the target value, useful for understanding errors in
relative terms.
● Example:
○ If an error of 50 occurs while predicting 500, the relative error is 10%10\%10%.
Key Differences:
1. MSE and RMSE: Both emphasize large errors due to squaring, but RMSE makes the error more
interpretable.
2. MAE: Less sensitive to large outliers as it does not square the errors.
3. Relative Error: Provides a percentage-based understanding of error magnitude, useful for scaled
evaluation.
Lift Charts: Explanation and Practical Use
Definition
A lift chart is a visual tool used to compare the performance of predictive models or decisions. It shows the
"lift," or improvement, of targeting a subset of the population (using a model or strategy) versus random
selection.
Why Use Lift Charts?
1. Costs are unknown: In practice, you may not always have exact cost figures for decision-making.
2. Comparing Scenarios: Instead of relying solely on cost-based analysis, decisions are compared
based on how effective they are relative to the baseline.
How a Lift Chart Works
1. X-Axis: Proportion of the population targeted (e.g., 10%, 20%, etc.).
2. Y-Axis: Proportion of positive responses achieved (e.g., purchases, clicks, or any positive outcome).
3. Baseline (Random): A straight line representing expected results if no model is used (random
selection).
4. Lift Curve: The curve showing the actual performance of the model.
○ A steeper lift curve means the model is better at identifying promising subsets of the
population.
Practical Interpretation
● The higher the curve above the baseline, the better the predictive model or decision strategy.
● Decision-makers can visually identify the trade-offs:
○ How much of the population to target.
○ What percentage of responses to expect.