1. What is Classification? What is Prediction?
● Classification:
○ Classification is a data mining or machine learning technique used to categorize data into
predefined groups or classes.
○ It works with categorical variables (e.g., “yes” or “no”, “fraudulent” or “not fraudulent”).
○ The goal is to use a training dataset to create a model that can assign class labels to new,
unseen data.
○ Key Concept:
■ In classification, the outputs are discrete (limited, specific classes).
○ Examples:
■ Email classification: Spam vs. not spam.
■ Disease diagnosis: Determining whether a tumor is cancerous or benign.
■ Credit scoring: Deciding if a customer is high-risk or low-risk for loans.
● Prediction:
○ Prediction involves forecasting unknown or future outcomes based on patterns in historical
data.
○ Prediction often works with numerical values (continuous outputs), unlike classification.
○ Key Concept:
■ In prediction, the output can take any value within a range (continuous values).
○ Examples:
■ Predicting housing prices based on size and location.
■ Forecasting stock prices.
■ Estimating monthly energy consumption for a household.
2. Supervised Learning vs. Unsupervised Learning
Supervised Learning (Classification)
● Definition:
○ Supervised learning uses labeled data (data where the target outcome, or class label, is
known).
○ The model is trained using a training dataset where both inputs (features) and outputs (labels)
are available.
● How It Works:
○ The algorithm identifies a relationship between input variables and the output label.
○ Once the model is trained, it can classify new input data based on what it has learned.
● Key Features:
○ Requires supervision (labeling of data).
○ Solves classification problems (categorical outputs).
● Examples:
○ Spam Detection: Emails are labeled as spam or not spam based on text content.
○ Medical Diagnosis: Patients' medical data are labeled as "healthy" or "diseased."
○ Image Recognition: Classifying photos as "cat" or "dog" based on labeled datasets.
Unsupervised Learning (Clustering)
● Definition:
○ Unsupervised learning uses unlabeled data (data where class labels are unknown).
○ The goal is to find patterns, structures, or groupings within the data.
● How It Works:
○ The algorithm identifies relationships and creates clusters based on similarities or differences.
○ These clusters are not predefined but discovered by the model.
● Key Features:
○ No supervision (class labels are unavailable).
○ Focuses on clustering and discovering hidden patterns.
● Examples:
○ Market Segmentation: Grouping customers based on purchasing habits.
○ Anomaly Detection: Detecting unusual patterns in a network (e.g., fraud).
○ Customer Clustering: Segmenting users into categories like “frequent buyers” or “occasional
visitors.”
3. Prediction Problems: Classification vs. Numeric Prediction
Classification
● Definition:
○ Classification predicts categorical class labels (e.g., yes/no, fraud/non-fraud).
○ The model is built using training data that contains known class labels.
● How It Works:
○ The model learns the relationship between features (independent variables) and class labels
(dependent variable).
○ Once trained, the model predicts class labels for new data points.
● Key Points:
○ Output is categorical (discrete classes).
○ Requires labeled data for training.
● Applications:
○ Credit/Loan Approval:
■ Classify applicants as “eligible” or “not eligible” based on income, credit score, etc.
○ Medical Diagnosis:
■ Predict whether a tumor is cancerous or benign using medical reports.
○ Fraud Detection:
■ Identify transactions as fraudulent or non-fraudulent.
○ Web Page Categorization:
■ Classify web pages into predefined categories (news, entertainment, sports, etc.).
Numeric Prediction
● Definition:
○ Numeric prediction predicts continuous numerical values based on input features.
○ Instead of assigning classes, it forecasts a specific value.
● How It Works:
○ The model learns the relationship between features (inputs) and a numeric output.
○ Predictions are made using regression algorithms.
● Key Points:
○ Output is continuous (numeric value).
○ Focuses on predicting unknown/missing values.
● Applications:
○ Loan Default Prediction:
■ Predict the likelihood or percentage chance of a customer defaulting on a loan.
○ Medical Diagnosis:
■ Predict cholesterol level based on health records.
○ Fraud Risk:
■ Predict the fraud probability score for financial transactions.
○ Weather Forecasting:
■ Predict future temperature, rainfall, or humidity levels.
○ House Price Prediction:
■ Forecast house prices based on size, location, and amenities.
Comparison Summary Table
Feature Classification Numeric Prediction
Output Categorical (e.g., yes/no, spam) Continuous (e.g., 0-100)
Data Labels Requires labeled training data Requires numerical outputs
Goal Assign data to specific Predict a continuous value
categories
Algorithms Decision Trees, Naive Bayes, Regression, Linear Models
Used SVM
Examples Spam detection, fraud detection House price prediction, weather
forecasting
Classification – A Two-Step Process
1. Model Construction
● Definition:
Model construction is the first step where the system learns a relationship between input attributes
and class labels using training data.
● Key Components:
○ Training Set:
■ A labeled dataset used to build the model.
■ Each tuple/sample is assumed to belong to a predefined class.
■ Example: In the provided data, “rank” and “years” determine whether someone is
tenured (“yes” or “no”).
○ Model Representation:
■ The constructed model can be represented in several ways:
■ Classification rules: e.g., “IF rank = Professor OR years > 6 THEN tenured =
yes.”
■ Decision Trees: A tree structure where each node represents a test condition.
■ Mathematical Formulae: Represent relationships using algebraic or statistical
models.
2. Model Usage
● Definition:
After construction, the model is used to classify new or unseen data.
● Steps:
○ Accuracy Estimation:
■ Use a test set (independent from the training set) to evaluate the model.
■ Compare the known labels of test samples with the labels predicted by the model.
■ Calculate the accuracy rate:
■ Accuracy = (Correctly Classified Samples / Total Samples in Test Set) × 100.
○ Overfitting:
■ Ensure the test set is independent to avoid overfitting, where the model memorizes
training data instead of generalizing.
● Final Usage:
○ If the model meets accuracy criteria, it is used to classify new, unseen data.
Issues Regarding Classification and Prediction
1. Data Preparation
● Data Cleaning:
○ Preprocess the data to reduce noise and handle missing values.
● Relevance Analysis (Feature Selection):
○ Remove irrelevant or redundant features to improve model accuracy.
● Data Transformation:
○ Normalize or generalize data to ensure consistency across features.
2. Evaluating Classification Methods
● Predictive Accuracy:
○ The percentage of correct predictions on a test set.
● Speed and Scalability:
○ Time to construct and use the model efficiently.
● Robustness:
○ Ability to handle noise and missing values in data.
● Scalability:
○ Efficiency with large datasets or databases.
● Interpretability:
○ Understanding and insights provided by the model (e.g., clear decision rules).
● Goodness of Rules:
○ Compactness of decision trees and classification rules.
Summary
1. Classification involves building a model using training data to predict class labels for future or
unseen data.
2. Model Construction focuses on learning from labeled data.
3. Model Usage evaluates accuracy and applies the model for predictions.
4. Challenges include data preparation, accuracy evaluation, speed, and robustness.
Key Components of a Decision Tree:
● Internal Node: Represents a test on an attribute (e.g., "Is age > 30?").
● Branch: Represents the outcome of the test (e.g., "Yes" or "No").
● Leaf Node: Represents a class label or the final decision for a subset of data.
● Tree Structure:
○ Root Node: Starting point where all data resides.
○ Branches and nodes split the data based on attributes.
○ Leaf Nodes: Final classification outputs.
2. Two Phases of Decision Tree Generation
a. Tree Construction
● Steps:
○ Start with all training examples at the root node.
○ Recursively partition the data into subsets based on the values of selected attributes.
○ Each partition creates new internal nodes and branches based on attribute conditions.
○ The process continues until a stopping condition is met (e.g., all samples belong to the same
class).
● Attribute Selection:
○ Selecting the best attribute to split the data is key. Common criteria include:
■ Information Gain (Entropy reduction – used in ID3 algorithm).
■ Gini Index (used in CART algorithm).
b. Tree Pruning
● Definition:
Pruning simplifies the tree by removing branches that do not improve accuracy.
● Purpose:
○ To avoid overfitting, where the model memorizes noise or irrelevant patterns.
○ Pruned trees are more generalizable to unseen data.
● Steps:
○ Identify branches or nodes that contribute minimal value.
○ Remove branches that handle noise or outliers in the training data.
ID3 Algorithm for Decision Tree Induction
1. Overview of ID3 Algorithm
● Invented by: Quinlan (1986)
● Approach: ID3 uses a top-down recursive divide-and-conquer method to construct decision
trees.
Steps in ID3 Algorithm
Step 1: Start at the Root
● All training examples are placed at the root of the tree.
Step 2: Attribute Selection
● Goal: Identify the attribute that best splits the data.
● Heuristic Used: Information Gain
○ Information gain measures how well an attribute separates the data into classes.
○ Definition of Information Gain:
■ Gain = Entropy(Parent Node) − Weighted Sum of Entropy(Child Nodes)
2. Entropy
● Definition: Entropy is a measure of the randomness or uncertainty in the data.
● Formula:
I(s)=−log2Ps=log2(1/Ps)I(s) = - \log_2 P_s = \log_2 (1/P_s)I(s)=−log2Ps=log2(1/Ps)
○ PsP_sPs: Probability of an event sss.
● Key Properties:
○ Entropy is high when events are less probable (more uncertainty/randomness).
○ Entropy is low when events are more probable (less uncertainty).
○ Cases:
■ When Ps=0P_s = 0Ps=0: Entropy approaches infinity (high uncertainty).
■ When Ps=1P_s = 1Ps=1: Entropy approaches zero (no uncertainty).
Step 3: Recursive Partitioning
● The selected attribute is used to split the data into subsets.
● Each subset forms a child node.
● The process is repeated recursively for each child node until:
○ All samples at a node belong to the same class.
○ There are no attributes left for further splitting.
○ Majority voting is used to classify remaining nodes.
Stopping Conditions
The partitioning stops when:
1. All samples for a given node belong to the same class.
2. There are no remaining attributes for further partitioning (majority voting is applied).
3. There are no samples left.
Key Points about ID3
● Works only with categorical attributes.
● Continuous-valued attributes must be discretized in advance.
● Uses information gain to determine the attribute with the most discriminative power.
Entropy and Information Gain Summary
1. Entropy:
○ Measures randomness or disorder in the data.
○ High entropy = more uncertainty.
○ Low entropy = less uncertainty.
2. Information Gain:
○ Reduction in entropy after splitting data on a specific attribute.
○ Attributes with higher information gain are preferred for splitting.
1. Information Gain
Definition:
Information Gain is a measure used to select the attribute that best splits the dataset into subsets in a
decision tree. It is based on the concept of entropy (measure of randomness/uncertainty).
How Information Gain Works
Information Gain calculates the reduction in entropy achieved by splitting the data using a particular
attribute.
2. Majority Voting
Definition:
Majority voting is a method used to classify data at a decision tree node when further splitting is not
possible or no attributes remain.
When is Majority Voting Used?
1. Stopping Condition:
○ If all attributes are exhausted.
○ The node still contains mixed class labels.
2. How it Works:
○ The class with the largest number of samples (majority class) is assigned as the class label
for that node.
○ Example: If a node has 7 "Yes" and 3 "No," the node is labeled as "Yes" since it is the majority
class.
Entropy of Composite Events and Examples
1. Composite Events (Examples A, B, C, D)
● Event A: Tossing a fair coin: PH=0.5P_H = 0.5PH=0.5, PT=0.5P_T = 0.5PT=0.5
○ Entropy is maximum (1 bit) because the uncertainty is highest (equal chance of head/tail).
● Event B: Tossing an unfair coin: PH=0.25P_H = 0.25PH=0.25, PT=0.75P_T = 0.75PT=0.75
○ Entropy decreases because the probability of tail is higher, reducing uncertainty.
● Event C: Tossing an unfair coin: PH=0.001P_H = 0.001PH=0.001, PT=0.999P_T = 0.999PT=0.999
○ Entropy is very low because the outcome is almost certain (tail).
● Event D: Tossing an unfair coin: PH=0P_H = 0PH=0, PT=1P_T = 1PT=1
○ Entropy = 0 because the outcome is completely certain.
2. Expected Information and Entropy Formula
3. Key Insights
● Entropy = 0 → Outcome is certain (e.g., Event D).
● Entropy = 1 → Maximum uncertainty (e.g., Event A).
● For unfair events (B, C), entropy is between 0 and 1, reflecting partial certainty.
3. Weighted Average of Events
Entropy is the weighted average of the information content of all possible outcomes, weighted by their
probabilities.
● This means outcomes with higher probabilities contribute more to the final entropy.
4. Key Insight
● Low Entropy → Higher certainty, less information required.
● High Entropy → Higher uncertainty, more information required.
Weighted Average of Events Weighted by Probabilities – Explanation
Concept:
The "weighted average" means that we are combining multiple outcomes (or events), but each outcome
contributes based on its probability of occurring. The higher the probability of an outcome, the greater its
impact on the average.
In the context of entropy, the information content of each outcome is weighted by its probability.
Why is it Weighted?
In probability theory:
● Rare events (low probability) contribute less to the average.
● Frequent events (high probability) contribute more.
For example:
● If p=0.5p = 0.5p=0.5 (fair coin), both "head" and "tail" contribute equally to entropy.
● If p=0.9p = 0.9p=0.9 (biased coin), the likely outcome contributes more, reducing overall entropy.
Key Insight
The expected information to classify using an attribute is the weighted average of the entropies of its
partitions. Partitions with more objects contribute more to the total entropy.
4. Key Points
● Information Gain is always non-negative.
● Higher Information Gain means the attribute provides more information for classification.
● In decision trees, the attribute with the highest Information Gain is chosen for splitting.
-
Explanation of Expected Information for Age <= 30
Summary
● Entropy for Age <= 30: 0.971 bits
● Weight of this group: 5/14
● Weighted Contribution: 0.346 bits
This contributes to the overall weighted information for the attribute "age" in the decision tree.
1. Interpretation:
○ The 0.246 is the reduction in entropy due to splitting the dataset based on "age."
○ This means that knowing the age of a customer reduces uncertainty about whether they will
buy (Yes) or not (No).
2. Key Insight:
○ Attributes with higher information gain are more effective in splitting the data and reducing
uncertainty.
○ This helps improve the ability of the model to classify new objects, such as predicting a
customer's buying behavior.
Summary:
Information Gain =0.246= 0.246=0.246 shows that "age" as an attribute reduces the entropy significantly and
helps in predicting the classification outcome more effectively.
Advantages and Disadvantages of Decision Trees
Advantages:
1. Easy to understand and interpret:
○ Decision trees map directly to production rules, making them highly intuitive.
2. Works with both types of data:
○ Supports categorical (e.g., "yes/no") and numerical (e.g., age, income) inputs.
3. No statistical assumptions:
○ Decision trees do not assume any specific distribution of attributes.
4. Fast generation and application:
○ Building decision trees and using them for classification is computationally efficient.
Disadvantages:
1. Output attributes must be categorical:
○ Decision trees typically work well when outputs are categorical rather than continuous.
2. Instability:
○ Minor changes in the training data can lead to different attribute splits and tree
structures.
3. Complexity with numerical attributes:
○ Numerical input attributes often result in binary splits, making the tree more complex.
4. Inefficient for non-rectangular regions:
○ Decision trees struggle to represent regions that are not rectangular, especially in high
dimensions:
■ Simple splits can represent regions in 2D (lines) or 3D (planes).
■ For higher dimensions, it requires hyperplanes.
Why Decision Trees Are Not Suitable for This Problem
1. Non-Linear Decision Boundary:
○ The two classes (X and O) are separated by a diagonal line (AA), which is non-linear
when viewed through a decision tree perspective.
○ Decision trees can only create axis-aligned splits (vertical or horizontal lines) to separate
classes.
2. Limitations of Axis-Aligned Splits:
○ Decision trees struggle to handle datasets where the classification boundary cannot be
represented by simple splits along the feature axes (income and age here).
○ To approximate the diagonal separation, the tree would require multiple splits, leading to an
unnecessarily complex model.
3. Class Overlap:
○ In this example, class X and class O points are intertwined, further complicating the decision
tree's ability to split effectively.
Key Takeaways:
● Split Information accounts for how "evenly" the data is split into child nodes.
● Gain Ratio adjusts the Information Gain by normalizing it using Split Information.
● Attributes with high splits (e.g., TID) are penalized by Gain Ratio, ensuring more meaningful splits are
chosen.
Gini Index Explained
The Gini Index is used as an attribute selection measure in decision trees, particularly in CART (Classification
and Regression Trees). It measures the impurity of a dataset by calculating how often a randomly chosen
element would be misclassified.
3. Attribute Selection Using Gini Index:
● For each possible attribute and splitting point, calculate the Gini split.
● Choose the attribute and split point with the smallest Gini Index as it indicates the lowest
impurity.
Key Notes:
● Gini Index focuses on impurity and favors attributes that create pure splits.
● All possible splitting points must be evaluated for each attribute, which can be computationally
expensive.
Key Idea:
● Simple Model: Always predicts the most probable outcome.
● Sophisticated Model: Incorporates randomness based on probability distribution μo\mu_oμo.
Both models result in the same error but the second is more general for probabilistic decisions.
Explanation of the Risk Matrix (Confusion Matrix)
This slide discusses the confusion matrix for a binary classification problem (predicting heads or tails)
and evaluates accuracy and error based on probabilities.
Key Takeaways:
1. Accuracy is highest when predictions align with the majority class.
2. Error is minimized when one class is dominant.
3. For a fair coin (μo=0.5\mu_o = 0.5μo=0.5):
○ Accuracy = 0.50.50.5
○ Error = 0.50.50.5, as heads and tails are equally likely.
-
-
Extracting Classification Rules from Trees
Classification rules are extracted from decision trees to simplify the results into human-readable IF-THEN
statements. Here's a breakdown of the process:
Key Concepts
1. IF-THEN Rules:
○ Each rule represents a unique path from the root to a leaf node in the decision tree.
○ Conditions at each node form a conjunction (logical AND).
2. Leaf Nodes:
○ Leaf nodes represent the predicted class or outcome.
3. Interpretability:
○ Rules make the decision tree easier for humans to understand and apply.
General Process for Extracting Rules
1. Start at the root node of the decision tree.
2. Follow each branch until reaching a leaf node.
3. Combine the conditions encountered along the path using logical AND.
4. Assign the class label (or prediction) at the leaf node as the THEN part of the rule.
"Enhancements to Basic Decision Tree Induction" Explanation
1. Allow for Continuous-Valued Attributes:
○ Decision trees can dynamically discretize continuous attributes by partitioning their values into
a discrete set of intervals.
○ For example, a continuous value like "age" can be converted into "young", "middle-aged", and
"old" categories.
2. Handle Missing Attribute Values:
○ Missing values in the data are handled by:
■ Assigning the most common value of the attribute for the missing entries.
■ Assigning probabilities to possible values based on existing data distribution.
3. Attribute Construction:
○ New attributes can be created using existing ones to improve the tree's structure.
○ This is especially useful when certain features are sparsely represented.
○ Reduces redundancy and avoids fragmentation of data.
"Avoid Overfitting in Classification" Explanation
1. Overfitting:
○ Decision trees may overfit the training data by adding unnecessary splits, which can cause:
■ Too many branches reflecting noise or outliers.
■ Poor accuracy when predicting unseen samples.
2. Approaches to Avoid Overfitting:
○ Prepruning:
■ Stop building the tree early if further splitting does not improve performance
significantly.
■ Challenge: Choosing an appropriate stopping threshold is difficult.
○ Postpruning:
■ Fully grow the tree first, then remove unnecessary branches.
■ Use a separate validation set to identify the best pruned version of the tree.
■ Results in a more generalized and robust model.
Tree-Growing Methods
1. CHAID (Chi-squared Automatic Interaction Detector)
○ Developed by Kass (1980).
○ It uses the Chi-squared statistical test to determine the best attribute for splitting the data.
○ Primarily works with categorical data.
○ Advantage: Handles multi-way splits efficiently (not just binary splits).
2. Exhaustive CHAID
○ An extended version of CHAID by Biggs, de Ville, and Suen (1991).
○ Difference: It evaluates all possible splits for each attribute to find the most significant split.
○ This method is computationally more intensive but provides more accurate splits for
classification.
3. CART (Classification and Regression Trees)
○ Introduced by Breiman, Friedman, Olshen, and Stone (1984).
○ Works with both numerical and categorical data.
○ Key Features:
■ Uses Gini index for classification (to measure impurity).
■ Supports binary splits at each node.
○ Use Case:
■ Classification: Predict categorical labels.
■ Regression: Predict continuous values.
4. QUEST (Quick, Unbiased, Efficient Statistical Tree)
○ Developed by Loh and Shih (1997).
○ Designed for speed and unbiased attribute selection.
○ Key Benefits:
■ Handles both categorical and continuous data.
■ Efficient for large datasets.
■ Reduces bias in selecting splitting attributes compared to other methods.
Classification in Large Databases
1. Classification Problem:
○ A classical problem that has been extensively researched by both statisticians and
machine learning experts.
○ Involves assigning a label (or class) to a new instance based on its attributes.
2. Scalability:
○ In large databases, data sets often contain millions of examples and hundreds of
attributes.
○ A good classification method must efficiently handle these large-scale data sets and still
maintain reasonable speed.
3. Why Use Decision Tree Induction in Data Mining?
○ Faster Learning Speed:
■ Compared to other classification methods, decision trees are relatively faster to train.
○ Simple and Understandable Rules:
■ Decision trees generate easily interpretable rules that can be written as IF-THEN
statements.
■ This simplicity helps users understand how classification decisions are made.
○ SQL Query Integration:
■ Decision trees can be represented in a form that allows users to query large databases
using SQL.
■ This makes decision trees practical for database systems.
○ Comparable Accuracy:
■ Despite their simplicity, decision trees offer classification accuracy that is comparable
to other, more complex methods.
Scalable Decision Tree Induction Methods
Decision tree methods must handle large-scale data sets efficiently. The following techniques aim to
improve scalability in data mining:
1. SLIQ (EDBT’96 — Mehta et al.)
● What it does:
○ Builds an index for each attribute.
○ Keeps only the class list and current attribute list in memory to minimize resource usage.
● Why it matters:
○ Reduces memory consumption, making decision tree induction more efficient for large datasets.
2. SPRINT (VLDB’96 — J. Shafer et al.)
● What it does:
○ Constructs an attribute list data structure for splitting nodes in the decision tree.
● Why it matters:
○ Ensures scalability by avoiding memory limitations.
○ Works well for distributed environments and large data sets.
3. PUBLIC (VLDB’98 — Rastogi & Shim)
● What it does:
○ Integrates tree splitting and tree pruning processes.
○ Stops growing the tree earlier if a threshold is met.
● Why it matters:
○ Reduces unnecessary splits and minimizes overfitting.
○ Efficiently manages large-scale data by stopping growth when further splits are less valuable.
4. RainForest (VLDB’98 — Gehrke, Ramakrishnan & Ganti)
● What it does:
○ Separates the scalability aspects (performance optimization) from the quality criteria
(accuracy).
○ Builds an AVC-list: Attribute-Value-Class list for efficient splits.
● Why it matters:
○ Improves scalability for large datasets without sacrificing accuracy.
○ Allows better data organization for decision tree construction.
Summary:
These methods (SLIQ, SPRINT, PUBLIC, and RainForest) aim to address the challenges of large-scale data
mining. They focus on efficient memory usage, optimized tree construction, and pruning strategies
to ensure decision trees can handle large datasets effectively.