DECISION TREES
• Decision Tree is a Supervised learning technique that can be used for both classification and
Regression problems, but mostly it is preferred for solving Classification problems.
• It is a tree-structured classifier, where internal nodes represent the features of a dataset,
branches represent the decision rules and each leaf node represents the outcome.
• In a Decision tree, there are two nodes, which are the Decision Node and Leaf Node.
• Decision nodes are used to make any decision and have multiple branches, whereas Leaf
nodes are the output of those decisions and do not contain any further branches.
• The decisions or the tests are performed on the basis of features of the given dataset.
• It is a graphical representation for getting all the possible solutions to a problem/decision
based on given conditions.
• It is called a decision tree because, similar to a tree, it starts with the root node, which
expands on further branches and constructs a tree-like structure.
Example:
• One of the reasons that decision trees are popular is that we can turn them into a set of
logical disjunctions (if ... then rules) that then go into program code very simply.
Ex: if there is a party then go to it if there is not a party and you have an urgent deadline
then study
Constructing Decision Trees:
1. Start with the Entire Dataset
Consider all training data as the root.
2. Choose the Best Feature to Split
• Evaluate each feature using a splitting criterion (explained below).
• Select the feature that best separates the data.
3. Split the Dataset
Based on the selected feature, divide the dataset into subsets.
4. Recurse on Each Subset
Repeat the process recursively on each subset, treating it as a new node.
5. Stopping Criteria
Stop if:
o All instances in a node belong to the same class.
o Maximum tree depth is reached.
o Minimum number of samples in a node is reached.
o No further gain from splitting.
6. Assign Output
• For classification: Assign the most common class.
• For regression: Assign the average value.
Main Criteria Used for Splitting Nodes
1. Gini Impurity
Gini Impurity measures how mixed the classes are in a node. A lower value means a purer node.
It is used in the CART algorithm.
2. Entropy / Information Gain
Information Gain tells us how much useful information a feature gives. A good feature split
reduces entropy. It is used in the ID3 algorithm.
3. Chi-Square
Chi-Square checks if the split between data groups is meaningful or just due to chance.
4. Reduction in Variance
This is used in regression trees. It chooses splits that reduce the difference (variance) within the
data.
5. Gain Ratio
Gain Ratio improves Information Gain by adjusting for features that have many different values.
It is used in the C4.5 algorithm.
CLASSIFICATION AND REGRESSION IN TREES
CART is a machine learning algorithm that builds decision trees.
It is used for two types of problems:
• Classification – when we want to predict a category or label
• Regression – when we want to predict a numerical value
CART uses a binary tree structure, meaning each node splits into two branches.
Classification in Trees
This is used when the output we want is a class label (like “yes” or “no”).
Example:
• Predicting whether an email is spam or not spam
• Diagnosing whether a person has disease A or not
Working
• At each node, CART checks all features and chooses the one that best separates the data.
• The goal is to group similar items together in each branch.
• This process continues until each branch (called a leaf node) has mostly one class.
Gini Impurity (for classification)
Gini Impurity is a measure used to find the best feature to split the data.
What it means:
• If all data at a node belongs to one class → Gini = 0 (perfectly pure)
• If data is equally mixed among classes → Gini is higher
• Formula:
Where pi is the probability of a class at that node.
CART chooses the split that has the lowest Gini impurity, meaning it creates purer
groups.
Regression in Trees
This is used when the output is a numerical value (like marks, price, or temperature).
Example:
• Predicting the price of a house based on area, location, etc.
• Estimating the rainfall for next week
Working
• CART splits the data again and again to form smaller and smaller groups.
• It finds the best split by checking how much the error reduces after the split.
Residual Sum of Squares (for regression)
• This is the error measure used for regression trees.
• It calculates how far the predicted values are from the actual values.
• Formula:
Where:
• yi is the actual value
• y^ is the predicted value
CART tries to find a split that gives the lowest error (minimum SSE).
Residual Reduction
Residual Reduction is how much the error decreases after a split.
What it means:
• If a split greatly reduces the error → it is a good split
• If it doesn't help much → CART won’t split that way
CART keeps splitting until:
• The tree is too deep
• Or a node has very few data points
ENSEMBLE LEARNING
• Ensemble learning refers to the approach of combining multiple ML models to produce a
more accurate and robust prediction compared to any individual model.
• The conventional ensemble methods include bagging, boosting, and stacking-based
methods
Boosting:
• Boosting is an ensemble technique that combines multiple weak learners to create a
strong learner.
• The ensemble of weak models are trained in series such that each model that comes
next, tries to correct errors of the previous model until the entire training dataset is
predicted correctly.
• One of the most well-known boosting algorithms is AdaBoost (Adaptive Boosting).
AdaBoost (Adaptive Boosting)
• AdaBoost (Adaptive Boosting) is an ensemble learning method used for classification and
regression tasks.
• It works by training weak classifiers one after another, giving more focus (weight) to the
misclassified data in each round.
• The final model is a weighted combination of all weak classifiers, where more accurate
models get higher weight.
Steps in AdaBoost:
1. Weight Initialization
Start by giving equal importance to all training examples.
2. Model Training
Train a simple model on the data, using the current importance (weights) of each example.
3. Weighted Error Calculation
Check how many examples were classified wrongly, giving more attention to the important
ones.
4. Model Weight Calculation
If the model does well, give it more power (weight) in the final decision; if it does badly, give
it less.
5. Update Instance Weights
Increase the importance (weight) of wrongly classified examples, so the next model tries
harder on them.
6. Repeat
Repeat steps 2 to 5 multiple times, each time focusing more on the hard examples.
7. Final Model Creation
Combine all the models into one strong model, giving more say to the better ones.
8. Classification
For new data, the final model makes a prediction by taking a vote from all the simple
models.
Bagging:
• Bagging (Bootstrap Aggregating) is an ensemble learning technique that builds multiple
models on different random subsets of the training data (with replacement).
• It helps to reduce variance and prevent overfitting, especially in high-variance models
like decision trees.
• Bagging is used for both classification and regression tasks and makes the final prediction
by combining outputs from all base models
Steps:
1. Bootstrap Sampling
Randomly create multiple subsets of the training data with replacement.
2. Train Base Models
Train a separate model (usually the same type) on each subset independently.
3. Aggregate Predictions
Combine the results:
Use majority vote for classification
Use average for regression
4. Out-of-Bag Evaluation
Use the samples not included in the training subset (OOB samples) to estimate model
accuracy.
5. Final Prediction
Output the aggregated prediction from all models for the final result.
Random Forest
Definition:
Random Forest is an ensemble method based on Bagging that builds a collection of Decision
Trees, using random subsets of data and features, to improve prediction and reduce overfitting.
Steps:
1. Bootstrap Sampling
Create random subsets of training data with replacement, like in Bagging.
2. Random Feature Selection
When building each Decision Tree, at each split, use a random subset of features (not all
features).
3. Train Multiple Decision Trees
Build many Decision Trees, each trained on different data and features.
4. Aggregate Predictions
Combine the predictions:
Use majority vote for classification
Use average for regression
5. Final Prediction
Predict the final output by combining the outputs of all the trees.
Different Ways to Combine Classifiers
1. Majority Voting (for Classification)
If the number of classifiers is odd and they are independent, the label chosen by most
classifiers is used as the final output.
2. Mean (for Regression)
For regression tasks, we usually take the average (mean) of all model outputs.
3. Median (for Robust Regression)
Since the mean can be affected by outliers, using the median (middle value) is often
better for more stable predictions.
4. Bagging Uses Median
When the median is used to combine predictions, it leads to a more robust version of
bagging, sometimes called "robust bagging".
BASIC STATISTICS
1. Mean
The mean is the average value of a dataset.
It is calculated by adding all the values in the dataset and then dividing the total by the
number of values.
The mean is a useful measure of central tendency, but it is sensitive to outliers, meaning
extreme values can affect it significantly.
2. Median
The median is the middle value of a dataset when all the values are arranged in order.
If the number of values is odd, the median is the exact middle value.
If the number of values is even, the median is calculated by taking the average of the two
middle values.
The median is a reliable measure of central tendency because it is not affected by outliers or
extreme values.
3. Mode
The mode is the value that occurs most frequently in a dataset.
It is useful for identifying the most common value in the data.
If a dataset has two or more values that occur with the same highest frequency, it is called
bimodal, trimodal, or multimodal respectively.
The mode may not be a helpful measure if all values are different or if the dataset has a wide
range of values.
4. Variance
Variance measures how much the values in a dataset differ from the mean.
A high variance indicates that the data points are spread out widely, while a low variance
means they are closely clustered around the mean.
5. Standard Deviation
The standard deviation is the square root of the variance.
It tells us how much the values in the dataset typically differ from the mean, and it is
expressed in the same units as the data.
A small standard deviation means the data is closer to the mean, while a large one means
the data is more spread out.
6. Covariance
Covariance shows the relationship between two variables, indicating how one variable
changes when the other changes.
A positive covariance means both variables tend to increase or decrease together.
A negative covariance means that as one variable increases, the other tends to decrease.
Covariance is scale-dependent, so the values can vary greatly.
7. Mahalanobis Distance
The Mahalanobis Distance is a statistical measure that shows how far a point is from the
mean of a distribution, considering the correlations between variables.
It is used in applications like outlier detection, clustering, and classification.
The formula is:
GAUSSIAN MIXTURE MODELS (GMM)
Gaussian Mixture Model (GMM) is a probabilistic model that assumes data is generated from a
mixture of several Gaussian (normal) distributions, each with its own mean and variance.
Purpose:
GMM is mainly used for unsupervised learning, especially in clustering, where we don’t have
labelled data. It helps to group similar data points based on the likelihood that they belong to
the same Gaussian distribution.
How GMM Works:
1. Initialization
Start with a guess of how many Gaussians (clusters) there are.
2. Expectation Step (E-Step)
For each data point, calculate the probability that it belongs to each Gaussian.
3. Maximization Step (M-Step)
Update the parameters (mean, variance, and weight) of each Gaussian to maximize the
likelihood of the data.
4. Repeat
Repeat E-Step and M-Step until the model converges, i.e., changes are very small.
Notation:
• K: Number of Gaussian components
• N: Number of data points
• D: Dimensionality of the data
GMM Parameters
Means (μ) → The center of each Gaussian component
Covariance (Σ) → The shape and spread of each component
Weights (π) → The probability (importance) of each component
Model Training
• Training a GMM involves setting the parameters using available data.
• The Expectation-Maximization (EM) technique is often employed, alternating between the
Expectation (E) and Maximization (M) steps until convergence.
Expectation-Maximization:
• During the E step, the model calculates the probability of each data point belonging to each
Gaussian component.
• The M step then adjusts the model’s parameters based on these probabilities.
Clustering and Density Estimation:
• Post-training, GMMs cluster data points based on the highest posterior probability.
• They are also used for density estimation, assessing the probability density at any point in the
feature space.
Applications:
• Customer segmentation
• Anomaly detection
• Image segmentation
• Speaker identification
Nearest Neighbour Methods:
K-Nearest Neighbors Algorithm:
What is KNN? – Simple Points
• K-Nearest Neighbors (KNN) is a supervised learning algorithm used for both classification
and regression.
• It is a lazy learning algorithm, meaning it does not build a model during training — it
simply stores the data.
• KNN makes predictions only when a new input is given.
• The algorithm assumes that similar data points are located close together in the feature
space.
To make a prediction:
• It finds the 'K' nearest neighbors (closest data points) to the input.
• For classification, it uses majority voting of the neighbors' labels.
• For regression, it takes the average of the neighbors' values.
Steps in KNN Algorithm:
Step 1: Choose the value of K
K is the number of nearest neighbors to look at when making a prediction.
A good value of K is chosen by testing for the best accuracy.
Step 2: Calculate Distance
Use Euclidean distance (or other distance measures) to find how close each training point is
to the new (target) point.
Step 3: Find Nearest Neighbors
Pick the K training points that are closest to the target point based on the calculated
distances.
Step 4: Make Prediction
For Classification:
→ Use majority voting — the class that appears the most among the K neighbors is selected.
For Regression:
→ Take the average of the values of the K nearest neighbors to predict the result.
K MEANS ALGORITHM
• K-Means is an unsupervised learning algorithm used to group unlabelled data into K
clusters based on similarity.
• The value of K is pre-defined and represents the number of clusters the algorithm should
form.
• It works by finding centroids (center points) and assigning data points to the nearest
centroid.
• The algorithm is iterative, repeatedly adjusting the centroids and cluster assignments until
the results stabilize.
• K-Means is widely used in tasks like customer segmentation, image compression, and
pattern recognition because it is simple, fast, and effective.
Main Purpose:
The K-Means algorithm tries to group similar data points together by minimizing the distance
between the points and their cluster center (called a centroid).
Working
1. Choose the number of clusters (K)
Decide how many groups (clusters) you want to divide the data into.
2. Initialize K centroids randomly
Select K random points from the dataset to serve as the initial centers of the clusters.
3. Assign each data point to the nearest centroid
Measure the distance (usually Euclidean distance) from each point to each centroid and
assign it to the closest one.
4. Update the centroids
Recalculate the centroid of each cluster by taking the average position of all the points in
that cluster.
5. Repeat steps 3 and 4
Continue reassigning points and updating centroids until the centroids do not change
much, or a maximum number of iterations is reached.