Unit 2
Unit 2
Using Clustering to
Subdivide Data
Introducing Clustering Basics
•n = number of dimensions
•pi, qi = data points
Minkowski Distance
• You can use many different algorithms for clustering, but the speed
and robustness of the k-means algorithm makes it a popular choice
among experienced data scientists.
• As alternatives, kernel density estimation methods, hierarchical
algorithms, and neighbourhood algorithms are also available to help
you identify clusters in your dataset.
Clustering with the k-means algorithm
• Where
• p(i)=data point
• q(i)=closest centroid to the data point
• Here in the elbow method, the K value is chosen after the decrease of WSS is almost constant.
Silhouette Method
• Step-1: Create each data point as a single cluster. Let's say there are N
data points, so the number of clusters will also be N.
• Step-2: Take two closest data points or clusters and merge them to
form one cluster. So, there will now be N-1 clusters.
• Step-3: Again, take the two closest clusters and merge them together
to form one cluster. There will be N-2 clusters.
• Step-4: Repeat Step 3 until only one cluster left. So, we will get the
following clusters. Consider the below images:
• Step-5: Once all the clusters are combined into one big cluster,
develop the dendrogram to divide the clusters as per the problem.
• As we have seen, the closest distance between the two clusters is
crucial for the hierarchical clustering. There are various ways to
calculate the distance between two clusters, and these ways decide the
rule for clustering. These measures are called Linkage methods.
Some of the popular linkage methods are given below:
• Single Linkage: It is the Shortest Distance between the closest points
of the clusters. Consider the below image:
• Complete Linkage: It is the farthest distance between the two points
of two different clusters. It is one of the popular linkage methods as it
forms tighter clusters than single-linkage.
Average Linkage: It is the linkage method in which the distance
between each pair of datasets is added up and then divided by the total
number of datasets to calculate the average distance between two
clusters. It is also one of the most popular linkage methods.
Centroid Linkage: It is the linkage method in which the distance
between the centroid of the clusters is calculated. Consider the below
image:
Woking of Dendrogram in Hierarchical
clustering
• The dendrogram is a tree-like structure that is mainly used to store
each step as a memory that the HC algorithm performs.
• In the dendrogram plot, the Y-axis shows the Euclidean distances
between the data points, and the x-axis shows all the data points of the
given dataset.
• firstly, the datapoints P2 and P3 combine together and form a cluster,
correspondingly a dendrogram is created, which connects P2 and P3 with a
rectangular shape. The height is decided according to the Euclidean distance
between the data points.
• In the next step, P5 and P6 form a cluster, and the corresponding
dendrogram is created. It is higher than of previous, as the Euclidean
distance between P5 and P6 is a little bit greater than the P2 and P3.
• Again, two new dendrograms are created that combine P1, P2, and P3 in
one dendrogram, and P4, P5, and P6, in another dendrogram.
• At last, the final dendrogram is created that combines all the data points
together.
• We can cut the dendrogram tree structure at any level as per our
requirement.
• Hierarchical clustering algorithms are more computationally expensive than k-means algorithms
because with each iteration of hierarchical clustering, many observations must be compared to
many other observations.
• The benefit, however, is that hierarchical clustering algorithms are not subject to errors caused by
center convergence at areas of local minimum density (as exhibited with the k-means clustering
algorithms).
• Neither k-means nor hierarchical clustering algorithms perform well when clusters
are nonglobular — a configuration where some points in a cluster are closer to
points in a different cluster than they are to points in the center of their own
cluster.
• If your dataset shows nonglobular clustering, you can use neighbourhood
clustering algorithms, like DBScan, to determine whether each point is closer to
its neighbours in the same cluster or closer to its neighbouring observations in
other clusters.
Dabbling in the DBScan neighbourhood
• Partition-based and hierarchical clustering techniques are highly efficient with
normal shaped clusters. However, when it comes to arbitrary shaped clusters or
detecting outliers, density-based techniques are more efficient.
• For example, the dataset in the figure below can easily be divided into three
clusters using k-means algorithm.
• Density-based spatial clustering of applications with noise (DBScan) is an unsupervised learning
method that works by clustering core samples (dense areas of a dataset) while simultaneously
demarking non-core samples (portions of the dataset that are comparatively sparse).
• It’s a neighbourhood clustering algorithm that’s ideal for examining two or more variables together
to identify outliers.
• It’s particularly useful for identifying collective outliers — outliers that appear nearby to one
another, all having similar values that are anomalous to most values in the variable.
• With DBScan, take an iterative, trial-and-error approach to find the ideal number of outliers for
inspection.
• When experimenting with the DBScan model, outliers should comprise 5 percent or less of the
dataset’s observations.
• You must adjust the model parameters until you’ve isolated this small select group of observations.
• The DBSCAN algorithm is based on this intuitive notion of “clusters” and “noise”.
• The key idea is that for each point of a cluster, the neighbourhood of a given radius has to
contain at least a minimum number of points.
• Partitioning methods (K-means) and hierarchical clustering work for finding spherical-
shaped clusters or convex clusters.
• In other words, they are suitable only for compact and well-separated clusters.
• Moreover, they are also severely affected by the presence of noise and outliers in the data.
• Real life data may contain irregularities, like:
1.Clusters can be of arbitrary shape
2.Data may contain noise.
• Given such data, k-means algorithm has difficulties for identifying these clusters
with arbitrary shapes.
• Density-based clustering algorithms are very efficient at finding high-
density regions and outliers. It is very important to detect outliers for
some task, e.g. anomaly detection.
DBSCAN stands for density-based spatial clustering of applications
with noise. It is able to find arbitrary shaped clusters and clusters with
noise (i.e. outliers).
• The main idea behind DBSCAN is that a point belongs to a cluster if it
is close to many points from that cluster.
• There are two key parameters of DBSCAN:
• eps: The distance that specifies the neighbourhoods. Two points are
considered to be neighbors if the distance between them are less than
or equal to eps.
• minPts: Minimum number of data points to define a cluster.
• Based on these two parameters, points are classified as core point,
border point, or outlier:
• Core point: A point is a core point if there are at least minPts number
of points (including the point itself) in its surrounding area with radius
eps.
• Border point: A point is a border point if it is reachable from a core
point and there are less than minPts number of points within its
surrounding area.
• Outlier: A point is an outlier if it is not a core point and not reachable
from any core points.
• In this case, minPts is 4. Red points are core points because there
are at least 4 points within their surrounding area with radius eps.
• This area is shown with the circles in the figure.
• The yellow points are border points because they are reachable from a
core point and have less than 4 points within their neighbourhood.
• Reachable means being in the surrounding area of a core point.
• The points B and C have two points (including the point itself) within
their neighbourhood (i.e. the surrounding area with a radius of eps).
Finally N is an outlier because it is not a core point and cannot be
reached from a core point.
• Neighbourhood clustering algorithms are generally effective, but they are subject
to these two weaknesses:
• Neighbourhood clustering can be computationally expensive:- With every
iteration of this method, every data point might have to be compared to every
other data point in the dataset.
• With neighbourhood clustering, you might have to provide the model with
empirical parameter values for expected cluster size and cluster density:- If
you guess either of these parameters incorrectly, the algorithm misidentifies
clusters, and you must start the whole long process over again to fix the problem.
If you choose to use the DBScan method, you’re required to specify these
parameters.
• To avoid making poor guesses for the cluster size and cluster density parameters,
you can first use a quick k-means algorithm to determine plausible values.
Categorizing Data with Decision Tree and Random Forest Algorithms
• At certain times, you can get stuck trying to cluster and classify data from a
nonnumerical dataset.
• It’s times like these that you can use a decision tree model to help cluster and
classify your data correctly.
• A decision tree is a supervised machine learning algorithm that can be used for
both classification and regression problems. A decision tree is simply a series of
sequential decisions made to reach a specific result
• A decision tree algorithm works by developing a set of yes-or-no rules that you
can follow for new data to see exactly how it will be characterized by the model.
• But you must be careful when using decision tree models, because
they run the high risk of error propagation, which occurs whenever
one of the model rules is incorrect.
• Errors are generated in the results of decisions that are made based on
that incorrect rule, and then propagated through every other
subsequent decision made along that branch of the tree.
• It is a tree-structured classifier, where internal nodes represent the
features of a dataset, branches represent the decision
rules and each leaf node represents the outcome.
• In a Decision tree, there are two nodes, which are the Decision
Node and Leaf Node. Decision nodes are used to make any decision
and have multiple branches, whereas Leaf nodes are the output of
those decisions and do not contain any further branches.
• The decisions or the test are performed on the basis of features of the
given dataset.
• It is a graphical representation for getting all the possible solutions
to a problem/decision based on given conditions.
• It is called a decision tree because, similar to a tree, it starts with the
root node, which expands on further branches and constructs a tree-
like structure.
• In order to build a tree, we use the CART algorithm, which stands
for Classification and Regression Tree algorithm.
• A decision tree simply asks a question, and based on the answer
(Yes/No), it further split the tree into subtrees.
Decision Tree Terminologies
•Root Node: Root node is from where the decision tree starts. It represents the entire
dataset, which further gets divided into two or more homogeneous sets.
•Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated
further after getting a leaf node.
•Splitting: Splitting is the process of dividing the decision node/root node into sub-
nodes according to the given conditions.
•Branch/Sub Tree: A tree formed by splitting the tree.
•Pruning: Pruning is the process of removing the unwanted branches from the tree.
•Parent/Child node: The root node of the tree is called the parent node, and other
nodes are called the child nodes.
• In a decision tree, for predicting the class of the given dataset, the
algorithm starts from the root node of the tree.
• This algorithm compares the values of root attribute with the record
(real dataset) attribute and, based on the comparison, follows the
branch and jumps to the next node.
• For the next node, the algorithm again compares the attribute value
with the other sub-nodes and move further.
• It continues the process until it reaches the leaf node of the tree.
• Step-1: Begin the tree with the root node, says S, which contains the
complete dataset.
• Step-2: Find the best attribute in the dataset using Attribute Selection
Measure (ASM).
• Step-3: Divide the S into subsets that contains possible values for the best
attributes.
• Step-4: Generate the decision tree node, which contains the best attribute.
• Step-5: Recursively make new decision trees using the subsets of the
dataset created in step -3. Continue this process until a stage is reached
where you cannot further classify the nodes and called the final node as a
leaf node.
Attribute Selection Measures
• While implementing a Decision tree, the main issue arises that how to
select the best attribute for the root node and for sub-nodes.
• So, to solve such problems there is a technique which is called
as Attribute selection measure or ASM.
• By this measurement, we can easily select the best attribute for the
nodes of the tree. There are two popular techniques for ASM, which
are:
• Information Gain
• Gini Index
• 1. Information Gain:
• Information gain is the measurement of changes in entropy after the
segmentation of a dataset based on an attribute.
• It calculates how much information a feature provides us about a class.
• According to the value of information gain, we split the node and build the
decision tree.
• A decision tree algorithm always tries to maximize the value of information
gain, and a node/attribute having the highest information gain is split first. It
can be calculated using the below formula:
1.Information Gain= Entropy(S)- [(Weighted Avg) *Entropy(each feature)
Entropy: Entropy is a metric to measure the impurity in a given attribute. It
specifies randomness in data. Entropy can be calculated as:
Entropy(s)= -P(yes)log2 P(yes)- P(no) log2 P(no)
Where,
•S= Total number of samples
•P(yes)= probability of yes
•P(no)= probability of no
Gini Index:
•Gini index is a measure of impurity or purity used while creating a decision tree in
the CART(Classification and Regression Tree) algorithm.
•An attribute with the low Gini index should be preferred as compared to the high
Gini index.
•It only creates binary splits, and the CART algorithm uses the Gini index to create
binary splits.
•Gini index can be calculated using the below formula:
Gini Index= 1- ∑jPj2
• But you must be careful when using decision tree models, because
they run the high risk of error propagation, which occurs whenever
one of the model rules is incorrect.
• Errors are generated in the results of decisions that are made based on
that incorrect rule, and then propagated through every other
subsequent decision made along that branch of the tree.
Random forest algorithm
• The decision tree algorithm is quite easy to understand and interpret. But
often, a single tree is not sufficient for producing effective results. This is
where the Random Forest algorithm comes into the picture.
• Random Forest is a tree-based machine learning algorithm that leverages
the power of multiple decision trees for making decisions. As the name
suggests, it is a “forest” of trees
• it is a forest of randomly created decision trees.
• Each node in the decision tree works on a random subset of features to
calculate the output.
• The random forest then combines the output of individual decision trees to
generate the final output.
• This process of combining the output of multiple individual models (also known
as weak learners) is called Ensemble Learning
• Lastly, random forest algorithms are a slower but more powerful alternative.
• Instead of building a tree from the data, the algorithm creates random trees and
then determines which one best classifies the testing data.
• This method eliminates the risk of error propagation that is inherent in decision
tree models.
Working of Random Forest
• The Working process can be explained in the below steps and diagram:
• Step-1: Select random K data points from the training set.
• Step-2: Build the decision trees associated with the selected data
points (Subsets).
• Step-3: Choose the number N for decision trees that you want to
build.
• Step-4: Repeat Step 1 & 2.
• Step-5: For new data points, find the predictions of each decision tree,
and assign the new data points to the category that wins the majority
votes.