0% found this document useful (0 votes)
32 views89 pages

Unit 2

This document provides an overview of clustering as an unsupervised machine learning technique. It discusses how clustering can be used to group unlabeled data into subsets based on similarities in the data. Different types of clustering algorithms and similarity metrics for calculating distances between observations are described.

Uploaded by

luckyqwe764
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views89 pages

Unit 2

This document provides an overview of clustering as an unsupervised machine learning technique. It discusses how clustering can be used to group unlabeled data into subsets based on similarities in the data. Different types of clustering algorithms and similarity metrics for calculating distances between observations are described.

Uploaded by

luckyqwe764
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 89

Unit 2:

Using Clustering to
Subdivide Data
Introducing Clustering Basics

• Clustering is a form of machine learning — the machine in this case is your


computer, and learning refers to an algorithm that’s repeated over and over until a
certain set of predetermined conditions is met.
• Learning algorithms are generally run until the point that the final analysis results
will not change, no matter how many additional times the algorithm is passed over
the data
• Clustering is one of the two main types of machine learning: In unsupervised
machine learning, the data in the dataset is unlabelled. Because the data is
unlabelled, the algorithms must use inferential methods to discover patterns,
relationships, and correlations within the raw dataset.
.
• Clustering or cluster analysis is a machine learning technique, which
groups the unlabelled dataset.
• It can be defined as "A way of grouping the data points into different
clusters, consisting of similar data points. The objects with the
possible similarities remain in a group that has less or no similarities
with another group."
• It does it by finding some similar patterns in the unlabelled dataset
such as shape, size, color, behaviour, etc., and divides them as per the
presence and absence of those similar patterns.
• It is an unsupervised learning method, hence no supervision is
provided to the algorithm, and it deals with the unlabelled dataset.
Getting to know clustering algorithms

• You use clustering algorithms to subdivide unlabelled datasets into clusters of


observations that are most similar for a predefined feature.
• If you have a dataset that describes multiple features about a set of observations
and you want to group your observations by their feature similarities, use
clustering algorithms.
• There are different clustering methods, depending on how you want your dataset
to be divided.
• The two main types of clustering algorithms are
• Partitional: Algorithms that create only a single set of clusters
• Hierarchical: Algorithms that create separate sets of nested clusters, each in its
own hierarchal level
• In unsupervised clustering, you start with this data and then proceed to
divide it into subsets.
• These subsets, called clusters, are composed of observations that are
most similar to one another.
• In Figure, it appears that there are at least two clusters, probably three
— one at the bottom with low income and education, and then the
high-education countries look like they might be split between low and
high incomes.
• Although you can generate visual estimates of clusters, you can
achieve much more accurate results when dealing with much larger
datasets by using algorithms to generate clusters for you.
• Visual estimation is a rough method that’s useful only on smaller
datasets of minimal complexity.
• Algorithms produce exact, repeatable results, and you can use
algorithms to generate clusters from multiple dimensions of data
within your dataset.
• Clustering algorithms are appropriate in situations where the following
characteristics are true:
• You know and understand the dataset you’re analysing.
• Before running the clustering algorithm, you don’t have an exact idea of
the nature of the subsets (clusters). Often, you don’t even know how many
subsets are in the dataset before you run the algorithm.
• The subsets (clusters) are determined by only the single dataset you’re
analysing.
• Your goal is to determine a model that describes the subsets in a single
dataset and only this dataset.
• If you add more data to your dataset after you’ve already built the model, be sure
to rebuild the model from scratch to get more complete and accurate model
results.
Looking at clustering similarity metrics
• Clustering methods are based on calculating the similarity or difference between
two observations.
• If your dataset is numeric — composed of only numerical features — and can be
portrayed on an n-dimensional plot, you can use various geometric metrics to scale
your multidimensional data.
• An n-dimensional plot is a multidimensional scatter plot that you can use to plot n
number of dimensions of data.
• Some popular geometric metrics used for calculating distances between
observations are simply different geometric functions that are useful for modelling
distances between points:
• Euclidean metric: A measure of the distance between points plotted on a
Euclidean plane.
• Manhattan metric: A measure of the distance between points where distance is
calculated as the sum of the absolute value of the differences between two points’
Cartesian coordinates.
• Minkowski distance metric: A generalization of the Euclidean and Manhattan
distance metrics. Quite often, these metrics can be used interchangeably.
• Cosine similarity metric: The cosine metric measures the similarity of two data
points based on their orientation, as determined by taking the cosine of the angle
between them.
. Euclidean Distance
Euclidean Distance represents the shortest distance between two points.
Most machine learning algorithms including K-Means use this distance metric to measure the similarity
between observations. Let’s say we have two points as shown below:
Where,
•n = number of dimensions
•pi, qi = data points
Manhattan Distance
Manhattan Distance is the sum of absolute differences between points across all the
dimensions.
Since the above representation is 2 dimensional, to calculate
Manhattan Distance, we will take the sum of absolute distances
in both the x and y directions. So, the Manhattan distance in a 2-
dimensional space is given as:

•n = number of dimensions
•pi, qi = data points
Minkowski Distance

• Minkowski Distance is the generalized form of Euclidean and


Manhattan Distance.

• When p = 2, Minkowski distance is same as Euclidean distance.


• When p = 1, Minkowski distance is same as Manhattan distance.
Cosine similarity
• Cosine similarity is a metric, helpful in determining, how similar the
data objects are irrespective of their size.
• In cosine similarity, data objects in a dataset are treated as a vector.
The formula to find the cosine similarity between two vectors is –
Cos(x, y) = x . y / ||x|| * ||y||
• x . y = product (dot) of the vectors ‘x’ and ‘y’.
• ||x|| and ||y|| = length of the two vectors ‘x’ and ‘y’.
• ||x|| * ||y|| = cross product of the two vectors ‘x’ and ‘y’.
• Lastly, for non-numeric data, you can use metrics like the Jaccard distance
metric, an index that compares the number of features that two observations have
in common.
• Jaccard distance is a metric used to measure the dissimilarity between two sets. It's
particularly useful when dealing with sets of elements, like words in documents,
items in a shopping cart, or users who liked certain items. The Jaccard distance is
the complement of the Jaccard similarity and is calculated as the ratio of the size
of the intersection of sets to the size of their union.
• Suppose Set A = {apple, banana, orange} and Set B = {banana, cherry, grape}.
• Intersection: {banana} (common elements)
• Union: {apple, banana, orange, cherry, grape} (all unique elements)
• Jaccard similarity: J(A,B)=1/5(since the intersection has one element and the union
has five elements)
• Jaccard distance: D(A,B)=1−1/5=4/5
Identifying Clusters in Your Data

• You can use many different algorithms for clustering, but the speed
and robustness of the k-means algorithm makes it a popular choice
among experienced data scientists.
• As alternatives, kernel density estimation methods, hierarchical
algorithms, and neighbourhood algorithms are also available to help
you identify clusters in your dataset.
Clustering with the k-means algorithm

• The k-means clustering algorithm is a simple, fast, unsupervised learning


algorithm that you can use to predict groupings within a dataset.
• The model makes its prediction based on the number of centroids present —
represented by k, a model parameter that you must define — and the nearest mean
values, measured by the Euclidean distance between observations.
• Because the features of a dataset are usually on different scales, the difference of
scales can distort the results of this distance calculation. Normalizing the data is
important to ensure that the distance measure accords equal weight to each
variable.
• To avoid this problem, scale your variables before using k-means to predict data
groupings
• The quality of the clusters is heavily dependent on the correctness of the k value
you specify.
• If your data is 2- or 3-dimensional, a plausible range of k values may be
visually determinable.
• In the eyeballed approximation of clustering from the World Bank Income
and Education data scatter plot , a visual estimation of the k value would
equate to three clusters, or k = 3
• When defining the k value, it may be possible to choose the number of
centroids by looking at a scatter plot (if your dataset is 2- or 3-dimensional)
or by looking for obvious, significant groupings within your dataset’s
variables.
• If more than 2-3 dimension data than k will be calculated by elbow method
or silhouette method
• You can pick the number of centroids based on the number of
groupings that you know exist in the dataset, or by the number of
groupings that you want to exist in the dataset.
• Whatever the case, use your subjective knowledge about the dataset
when choosing the number of clusters to be modelled.
• The working of the K-Means algorithm is explained in the below steps:
• Step-1: Select the number K to decide the number of clusters.
• Step-2: Select random K points or centroids. (It can be other from the input
dataset).
• Step-3: Assign each data point to their closest centroid, which will form the
predefined K clusters.
• Step-4: Calculate the variance and place a new centroid of each cluster.
• Step-5: Repeat the third steps, which means reassign each datapoint to the
new closest centroid of each cluster.
• Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.
• Step-7: The model is ready.
Elbow Method
• Elbow Method: This is one of the most popular methods that is used to find K for K-Means.
• For this, we have to learn something known as WSS(Within the sum of squares).
• WSS: The WSS is defined as the sum of squared distance between each member of the cluster
and its centroid.

• Where
• p(i)=data point
• q(i)=closest centroid to the data point
• Here in the elbow method, the K value is chosen after the decrease of WSS is almost constant.
Silhouette Method

• Silhouette Method: Here in the silhouette method, we will compute


the silhouette score for every point.
• Silhouette Coefficient for the point= (b-a)/max(a, b)
• Where
• a=mean intra-cluster distance
• b=mean nearest cluster distance
• Silhouette coefficient for dataset = Mean Silhouette Coefficient over
points.
• Points to remember while calculating silhouette coefficient:
• The value of the silhouette coefficient is between [-1, 1].
• A score of 1 denotes the best meaning that the data point i is very
compact within the cluster to which it belongs and far away from the
other clusters.
• The worst value is -1. Values near 0 denote overlapping clusters.
• The centers are moved from regions of lower density to regions of
higher density until all centers are within a region of local maximum
density — a true center of the cluster, where each cluster gets a
maximum number of points closest to its cluster center.
• Whenever possible, you should try to place the centers yourself,
manually. If that’s not possible, simply place the centers randomly and
run the algorithm several times to see how often you end up with the
same clusters.
Problem with k-mean
• One weakness of the k-means algorithm is that it may produce
incorrect results by placing cluster centers in areas of local minimum
density.
• This happens when centers get lost in low-density regions (in other
words, regions of the plot that have relatively few points plotted in
them) and the algorithm-driven directional movement (the movement
that’s meant to increase point density) starts to bounce and oscillate
between faraway clusters.
• In these cases, the center gets caught in a low-density space that’s
located between two high-point density zones.
• This results in erroneous clusters based around centers that converge
in areas of low, local minimum density.
• Ironically, this happens most often when the underlying data is very
well-clustered, with tight, dense regions that are separated by wide,
sparse areas.
Estimating clusters with kernel density estimation
(KDE)
• If the k-means algorithm doesn’t appeal to you, one alternative way to identify clusters in your data
is to use a density smoothing function instead.
• Kernel density estimation (KDE) is that smoothing method;
• it works by placing a kernel — a weighting function that is useful for quantifying density — on
each data point in the dataset and then summing the kernels to generate a kernel density estimate
for the overall region.
• Kernel is simply a function which satisfies following three properties as mentioned below.
• Kernel functions are used to estimate density of random variables and as weighing function in non-
parametric regression.
• This function is also used in machine learning as kernel method to perform classification and
clustering.
• Areas of greater point density will sum out with greater kernel density, and areas
of lower point density will sum out with less kernel density.
• Because kernel smoothing methods don’t rely on cluster center placement and
clustering techniques to estimate clusters, they don’t exhibit a risk of generating
erroneous clusters by placing centers in areas of local minimum density.
• Where k-means algorithms generate hard-lined definitions between points in
different clusters, KDE generates a plot of gradual density change between
observations.
• For this reason, it’s a helpful aid when eyeballing clusters.
Clustering with hierarchical algorithms

• A hierarchical clustering algorithm is yet another alternative to k-means clustering.


• In comparison to k-means clustering, the hierarchical clustering algorithm is a
slower, clunkier unsupervised clustering algorithm.
• It predicts groupings within a dataset by calculating the distance and generating a
link between each singular observation and its nearest neighbour.
• It then uses those distances to predict subgroups within a dataset.
• If you’re carrying out a statistical study or analysing biological or environmental
data, hierarchical clustering might be your ideal machine learning solution.
• The hierarchical clustering technique has two approaches:
1.Agglomerative: Agglomerative is a bottom-up approach, in
which the algorithm starts with taking all data points as single
clusters and merging them until one cluster is left.
2.Divisive: Divisive algorithm is the reverse of the
agglomerative algorithm as it is a top-down approach.
Why hierarchical clustering?

• As we already have other clustering algorithms such as K-Means


Clustering, then why we need hierarchical clustering?
• So, as we have seen in the K-means clustering that there are some
challenges with this algorithm, which are a predetermined number of
clusters, and it always tries to create the clusters of the same size.
• To solve these two challenges, we can opt for the hierarchical
clustering algorithm because, in this algorithm, we don't need to have
knowledge about the predefined number of clusters.
• Hierarchical clustering starts by treating each observation as a separate
cluster.
• Then, it repeatedly executes the following two steps:
• (1) identify the two clusters that are closest together, and
• (2) merge the two most similar clusters.
• This iterative process continues until all the clusters are merged
together.
Agglomerative Hierarchical clustering

• Step-1: Create each data point as a single cluster. Let's say there are N
data points, so the number of clusters will also be N.
• Step-2: Take two closest data points or clusters and merge them to
form one cluster. So, there will now be N-1 clusters.
• Step-3: Again, take the two closest clusters and merge them together
to form one cluster. There will be N-2 clusters.
• Step-4: Repeat Step 3 until only one cluster left. So, we will get the
following clusters. Consider the below images:
• Step-5: Once all the clusters are combined into one big cluster,
develop the dendrogram to divide the clusters as per the problem.
• As we have seen, the closest distance between the two clusters is
crucial for the hierarchical clustering. There are various ways to
calculate the distance between two clusters, and these ways decide the
rule for clustering. These measures are called Linkage methods.
Some of the popular linkage methods are given below:
• Single Linkage: It is the Shortest Distance between the closest points
of the clusters. Consider the below image:
• Complete Linkage: It is the farthest distance between the two points
of two different clusters. It is one of the popular linkage methods as it
forms tighter clusters than single-linkage.
Average Linkage: It is the linkage method in which the distance
between each pair of datasets is added up and then divided by the total
number of datasets to calculate the average distance between two
clusters. It is also one of the most popular linkage methods.
Centroid Linkage: It is the linkage method in which the distance
between the centroid of the clusters is calculated. Consider the below
image:
Woking of Dendrogram in Hierarchical
clustering
• The dendrogram is a tree-like structure that is mainly used to store
each step as a memory that the HC algorithm performs.
• In the dendrogram plot, the Y-axis shows the Euclidean distances
between the data points, and the x-axis shows all the data points of the
given dataset.
• firstly, the datapoints P2 and P3 combine together and form a cluster,
correspondingly a dendrogram is created, which connects P2 and P3 with a
rectangular shape. The height is decided according to the Euclidean distance
between the data points.
• In the next step, P5 and P6 form a cluster, and the corresponding
dendrogram is created. It is higher than of previous, as the Euclidean
distance between P5 and P6 is a little bit greater than the P2 and P3.
• Again, two new dendrograms are created that combine P1, P2, and P3 in
one dendrogram, and P4, P5, and P6, in another dendrogram.
• At last, the final dendrogram is created that combines all the data points
together.
• We can cut the dendrogram tree structure at any level as per our
requirement.
• Hierarchical clustering algorithms are more computationally expensive than k-means algorithms
because with each iteration of hierarchical clustering, many observations must be compared to
many other observations.
• The benefit, however, is that hierarchical clustering algorithms are not subject to errors caused by
center convergence at areas of local minimum density (as exhibited with the k-means clustering
algorithms).
• Neither k-means nor hierarchical clustering algorithms perform well when clusters
are nonglobular — a configuration where some points in a cluster are closer to
points in a different cluster than they are to points in the center of their own
cluster.
• If your dataset shows nonglobular clustering, you can use neighbourhood
clustering algorithms, like DBScan, to determine whether each point is closer to
its neighbours in the same cluster or closer to its neighbouring observations in
other clusters.
Dabbling in the DBScan neighbourhood
• Partition-based and hierarchical clustering techniques are highly efficient with
normal shaped clusters. However, when it comes to arbitrary shaped clusters or
detecting outliers, density-based techniques are more efficient.
• For example, the dataset in the figure below can easily be divided into three
clusters using k-means algorithm.
• Density-based spatial clustering of applications with noise (DBScan) is an unsupervised learning
method that works by clustering core samples (dense areas of a dataset) while simultaneously
demarking non-core samples (portions of the dataset that are comparatively sparse).
• It’s a neighbourhood clustering algorithm that’s ideal for examining two or more variables together
to identify outliers.
• It’s particularly useful for identifying collective outliers — outliers that appear nearby to one
another, all having similar values that are anomalous to most values in the variable.
• With DBScan, take an iterative, trial-and-error approach to find the ideal number of outliers for
inspection.
• When experimenting with the DBScan model, outliers should comprise 5 percent or less of the
dataset’s observations.
• You must adjust the model parameters until you’ve isolated this small select group of observations.
• The DBSCAN algorithm is based on this intuitive notion of “clusters” and “noise”.
• The key idea is that for each point of a cluster, the neighbourhood of a given radius has to
contain at least a minimum number of points.
• Partitioning methods (K-means) and hierarchical clustering work for finding spherical-
shaped clusters or convex clusters.
• In other words, they are suitable only for compact and well-separated clusters.
• Moreover, they are also severely affected by the presence of noise and outliers in the data.
• Real life data may contain irregularities, like:
1.Clusters can be of arbitrary shape
2.Data may contain noise.
• Given such data, k-means algorithm has difficulties for identifying these clusters
with arbitrary shapes.
• Density-based clustering algorithms are very efficient at finding high-
density regions and outliers. It is very important to detect outliers for
some task, e.g. anomaly detection.
DBSCAN stands for density-based spatial clustering of applications
with noise. It is able to find arbitrary shaped clusters and clusters with
noise (i.e. outliers).
• The main idea behind DBSCAN is that a point belongs to a cluster if it
is close to many points from that cluster.
• There are two key parameters of DBSCAN:
• eps: The distance that specifies the neighbourhoods. Two points are
considered to be neighbors if the distance between them are less than
or equal to eps.
• minPts: Minimum number of data points to define a cluster.
• Based on these two parameters, points are classified as core point,
border point, or outlier:
• Core point: A point is a core point if there are at least minPts number
of points (including the point itself) in its surrounding area with radius
eps.
• Border point: A point is a border point if it is reachable from a core
point and there are less than minPts number of points within its
surrounding area.
• Outlier: A point is an outlier if it is not a core point and not reachable
from any core points.
• In this case, minPts is 4. Red points are core points because there
are at least 4 points within their surrounding area with radius eps.
• This area is shown with the circles in the figure.
• The yellow points are border points because they are reachable from a
core point and have less than 4 points within their neighbourhood.
• Reachable means being in the surrounding area of a core point.
• The points B and C have two points (including the point itself) within
their neighbourhood (i.e. the surrounding area with a radius of eps).
Finally N is an outlier because it is not a core point and cannot be
reached from a core point.
• Neighbourhood clustering algorithms are generally effective, but they are subject
to these two weaknesses:
• Neighbourhood clustering can be computationally expensive:- With every
iteration of this method, every data point might have to be compared to every
other data point in the dataset.
• With neighbourhood clustering, you might have to provide the model with
empirical parameter values for expected cluster size and cluster density:- If
you guess either of these parameters incorrectly, the algorithm misidentifies
clusters, and you must start the whole long process over again to fix the problem.
If you choose to use the DBScan method, you’re required to specify these
parameters.
• To avoid making poor guesses for the cluster size and cluster density parameters,
you can first use a quick k-means algorithm to determine plausible values.
Categorizing Data with Decision Tree and Random Forest Algorithms

• At certain times, you can get stuck trying to cluster and classify data from a
nonnumerical dataset.
• It’s times like these that you can use a decision tree model to help cluster and
classify your data correctly.
• A decision tree is a supervised machine learning algorithm that can be used for
both classification and regression problems. A decision tree is simply a series of
sequential decisions made to reach a specific result
• A decision tree algorithm works by developing a set of yes-or-no rules that you
can follow for new data to see exactly how it will be characterized by the model.
• But you must be careful when using decision tree models, because
they run the high risk of error propagation, which occurs whenever
one of the model rules is incorrect.
• Errors are generated in the results of decisions that are made based on
that incorrect rule, and then propagated through every other
subsequent decision made along that branch of the tree.
• It is a tree-structured classifier, where internal nodes represent the
features of a dataset, branches represent the decision
rules and each leaf node represents the outcome.
• In a Decision tree, there are two nodes, which are the Decision
Node and Leaf Node. Decision nodes are used to make any decision
and have multiple branches, whereas Leaf nodes are the output of
those decisions and do not contain any further branches.
• The decisions or the test are performed on the basis of features of the
given dataset.
• It is a graphical representation for getting all the possible solutions
to a problem/decision based on given conditions.
• It is called a decision tree because, similar to a tree, it starts with the
root node, which expands on further branches and constructs a tree-
like structure.
• In order to build a tree, we use the CART algorithm, which stands
for Classification and Regression Tree algorithm.
• A decision tree simply asks a question, and based on the answer
(Yes/No), it further split the tree into subtrees.
Decision Tree Terminologies

•Root Node: Root node is from where the decision tree starts. It represents the entire
dataset, which further gets divided into two or more homogeneous sets.
•Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated
further after getting a leaf node.
•Splitting: Splitting is the process of dividing the decision node/root node into sub-
nodes according to the given conditions.
•Branch/Sub Tree: A tree formed by splitting the tree.
•Pruning: Pruning is the process of removing the unwanted branches from the tree.
•Parent/Child node: The root node of the tree is called the parent node, and other
nodes are called the child nodes.
• In a decision tree, for predicting the class of the given dataset, the
algorithm starts from the root node of the tree.
• This algorithm compares the values of root attribute with the record
(real dataset) attribute and, based on the comparison, follows the
branch and jumps to the next node.
• For the next node, the algorithm again compares the attribute value
with the other sub-nodes and move further.
• It continues the process until it reaches the leaf node of the tree.
• Step-1: Begin the tree with the root node, says S, which contains the
complete dataset.
• Step-2: Find the best attribute in the dataset using Attribute Selection
Measure (ASM).
• Step-3: Divide the S into subsets that contains possible values for the best
attributes.
• Step-4: Generate the decision tree node, which contains the best attribute.
• Step-5: Recursively make new decision trees using the subsets of the
dataset created in step -3. Continue this process until a stage is reached
where you cannot further classify the nodes and called the final node as a
leaf node.
Attribute Selection Measures

• While implementing a Decision tree, the main issue arises that how to
select the best attribute for the root node and for sub-nodes.
• So, to solve such problems there is a technique which is called
as Attribute selection measure or ASM.
• By this measurement, we can easily select the best attribute for the
nodes of the tree. There are two popular techniques for ASM, which
are:
• Information Gain
• Gini Index
• 1. Information Gain:
• Information gain is the measurement of changes in entropy after the
segmentation of a dataset based on an attribute.
• It calculates how much information a feature provides us about a class.
• According to the value of information gain, we split the node and build the
decision tree.
• A decision tree algorithm always tries to maximize the value of information
gain, and a node/attribute having the highest information gain is split first. It
can be calculated using the below formula:
1.Information Gain= Entropy(S)- [(Weighted Avg) *Entropy(each feature)
Entropy: Entropy is a metric to measure the impurity in a given attribute. It
specifies randomness in data. Entropy can be calculated as:
Entropy(s)= -P(yes)log2 P(yes)- P(no) log2 P(no)
Where,
•S= Total number of samples
•P(yes)= probability of yes
•P(no)= probability of no
Gini Index:
•Gini index is a measure of impurity or purity used while creating a decision tree in
the CART(Classification and Regression Tree) algorithm.
•An attribute with the low Gini index should be preferred as compared to the high
Gini index.
•It only creates binary splits, and the CART algorithm uses the Gini index to create
binary splits.
•Gini index can be calculated using the below formula:
Gini Index= 1- ∑jPj2
• But you must be careful when using decision tree models, because
they run the high risk of error propagation, which occurs whenever
one of the model rules is incorrect.
• Errors are generated in the results of decisions that are made based on
that incorrect rule, and then propagated through every other
subsequent decision made along that branch of the tree.
Random forest algorithm
• The decision tree algorithm is quite easy to understand and interpret. But
often, a single tree is not sufficient for producing effective results. This is
where the Random Forest algorithm comes into the picture.
• Random Forest is a tree-based machine learning algorithm that leverages
the power of multiple decision trees for making decisions. As the name
suggests, it is a “forest” of trees
• it is a forest of randomly created decision trees.
• Each node in the decision tree works on a random subset of features to
calculate the output.
• The random forest then combines the output of individual decision trees to
generate the final output.
• This process of combining the output of multiple individual models (also known
as weak learners) is called Ensemble Learning
• Lastly, random forest algorithms are a slower but more powerful alternative.
• Instead of building a tree from the data, the algorithm creates random trees and
then determines which one best classifies the testing data.
• This method eliminates the risk of error propagation that is inherent in decision
tree models.
Working of Random Forest
• The Working process can be explained in the below steps and diagram:
• Step-1: Select random K data points from the training set.
• Step-2: Build the decision trees associated with the selected data
points (Subsets).
• Step-3: Choose the number N for decision trees that you want to
build.
• Step-4: Repeat Step 1 & 2.
• Step-5: For new data points, find the predictions of each decision tree,
and assign the new data points to the category that wins the majority
votes.

You might also like