0% found this document useful (0 votes)
258 views47 pages

Zara

Clustering algorithms group similar data points together based on their characteristics. There are two main types of clustering - hard clustering which assigns each data point to one cluster, and soft clustering which assigns probabilities of cluster membership. Common clustering algorithms include k-means clustering which groups data around cluster centroids, and hierarchical clustering which creates nested clusters in a dendrogram structure. Clustering has many applications including market segmentation, recommendation systems, and improving supervised learning models.

Uploaded by

Davin Malore
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
258 views47 pages

Zara

Clustering algorithms group similar data points together based on their characteristics. There are two main types of clustering - hard clustering which assigns each data point to one cluster, and soft clustering which assigns probabilities of cluster membership. Common clustering algorithms include k-means clustering which groups data around cluster centroids, and hierarchical clustering which creates nested clusters in a dendrogram structure. Clustering has many applications including market segmentation, recommendation systems, and improving supervised learning models.

Uploaded by

Davin Malore
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 47

Task 02: Cluster Analysis

2.a) Businesses can use “Clustering algorithms” for market segmentation. Explain the
concept of clustering in data analytics and identify two more potential business use
cases of clustering (3marks)
- self descriptive

Introduction
When a population or set of data points is clustered, the data points in the same group are
more similar to one another than the data points in different groups. To put it another way,
the goal is to sort people into groups based on shared characteristics.
Let's use an example to better comprehend what I'm saying. If you're the manager of a rental
store, you may want to learn more about your customers' tastes in order to grow your
business. What if you were able to look at the specifics of each customer and come up with a
customized business plan for each one? Not at all. For example, you can group your
customers into 10 different categories, each with a different strategy for customers in each of
these categories. The term for this is "clustering."
2. Types of Clustering
Broadly speaking, clustering can be divided into two subgroups :
Hard Clustering: In hard clustering, each data point either belongs to a cluster completely or
not. For example, in the above example each customer is put into one group out of the 10
groups.
Soft Clustering: In soft clustering, instead of putting each data point into a separate cluster, a
probability or likelihood of that data point to be in those clusters is assigned. For example,
from the above scenario each costumer is assigned a probability to be in either of 10 clusters
of the retail store.
3. Types of clustering algorithms
Since the task of clustering is subjective, the means that can be used for achieving this goal
are plenty. Every methodology follows a different set of rules for defining the ‘similarity’
among data points. In fact, there are more than 100 clustering algorithms known. But few of
the algorithms are used popularly, let’s look at them in detail:
Connectivity models:They are named by the idea that the data points closer together in data
space are more comparable to each other than the data points further apart. There are two
ways to approach these models. Using the first method, all data points are first classified into
different clusters, after which they are aggregated as the distance between them reduces. Data
points are grouped together into a single cluster and then partitioned when the distance
between them grows. Distance function selection is a matter of personal preference. Even
while these models are simple to understand, they are not capable of managing large datasets.
Hierarchical clustering and its variants are examples of these models.
Centroid models: Clustering methods that use the centroid of the clusters as a measure of
similarity are called iterative clustering algorithms. Algorithms like the K-Means clustering
method are well-known in this field. Because the number of clusters needed at the end of
these models must be specified in advance, prior knowledge of the dataset is critical.
Iteratively, these models search for the best possible solution.
Models of distribution: Based on the idea of how likely it is that all data points in the cluster
belong to a single distribution, these models use clustering algorithms (For example: Normal,
Gaussian). In many cases, overfitting is a problem with these types of models. The
Expectation-maximization algorithm, which makes use of multivariate normal distributions,
is a well-known illustration of this type of model.
Density Models: These models search the data space for areas of varied density of data
points in the data space. It isolates various different density regions and assign the data points
within these regions in the same cluster. Popular examples of density models are DBSCAN
and OPTICS.
4. K Means Clustering
K means is an iterative clustering algorithm that aims to find local maxima in each iteration.
This algorithm works in these 5 steps :
Specify the desired number of clusters K : Let us choose k=2 for these 5 data points in 2-D
space.

2.Randomly assign each data point to a cluster : Let’s assign three points in cluster 1 shown
using red color and two points in cluster 2 shown using grey color.
3. Compute cluster centroids : The centroid of data points in the red cluster is shown using
red cross and those in grey cluster using grey cross.

4. Re-assign each point to the closest cluster centroid : Note that only the data point at the
bottom is assigned to the red cluster even though its closer to the centroid of grey cluster.
Thus, we assign that data point into grey cluster
5. Re-compute cluster centroids : Now, re-computing the centroids for both the clusters.

'''

The following code is for the K-Means


Created by - zara
'''
# importing required libraries
import pandas as pd
from sklearn.cluster import KMeans
# read the train and test dataset
train_data = pd.read_csv('train-data.csv')
test_data = pd.read_csv('test-data.csv')
# shape of the dataset
print('Shape of training data :',train_data.shape)
print('Shape of testing data :',test_data.shape)
# Now, we need to divide the training data into differernt clusters
# and predict in which cluster a particular data point belongs.
'''
Create the object of the K-Means model
You can also add other parameters and test your code here
Some parameters are : n_clusters and max_iter
Documentation of sklearn KMeans:
https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html
'''
model = KMeans()
# fit the model with the training data
model.fit(train_data)
# Number of Clusters
print('\nDefault number of Clusters : ',model.n_clusters)
# predict the clusters on the train dataset
predict_train = model.predict(train_data)
print('\nCLusters on train data',predict_train)
# predict the target on the test dataset
predict_test = model.predict(test_data)
print('Clusters on test data',predict_test)
# Now, we will train a model with n_cluster = 3
model_n3 = KMeans(n_clusters=3)
# fit the model with the training data
model_n3.fit(train_data)
# Number of Clusters
print('\nNumber of Clusters : ',model_n3.n_clusters)
# predict the clusters on the train dataset
predict_train_3 = model_n3.predict(train_data)
print('\nCLusters on train data',predict_train_3)
# predict the target on the test dataset
predict_test_3 = model_n3.predict(test_data)
print('Clusters on test data',predict_test_3)

5. Hierarchical Clustering
As the name suggests, hierarchical clustering is an algorithm that creates a hierarchy of
clusters. To begin, each data point is assigned to its own cluster in this algorithm. The two
clusters that are the closest to each other are then combined into a single cluster. When there
is only one cluster left, this method comes to an end.

A dendrogram can be used to display the results of hierarchical clustering. It is possible to


read the dendrogram in this way:

Initially, we have 25 data points, each of which is assigned to a distinct cluster. The two
clusters that are the closest to each other are then combined to form a single cluster at the top.
The distance between two clusters in the data space is represented by the height of the
clusters that are merged in the dendrogram.
The dendrogram can be used to determine the number of clusters that best represent the
various groups. The optimal number of clusters is the number of vertical lines in the
dendrogram cut by a horizontal line that may cross the maximum vertical distance without
overlapping a cluster.
The red horizontal line in the dendrogram below covers the maximum vertical distance AB in
the example above, hence the optimal number of clusters is 4.
Two important things that you should know about hierarchical clustering are:
This algorithm has been implemented above using bottom up approach. It is also possible to
follow top-down approach starting with all data points assigned in the same cluster and
recursively performing splits till each data point is assigned a separate cluster.
The decision of merging two clusters is taken on the basis of closeness of these clusters.
There are multiple metrics for deciding the closeness of two clusters :
Euclidean distance: ||a-b||2 = √(Σ(ai-bi))
Squared Euclidean distance: ||a-b||22 = Σ((ai-bi)2)
Manhattan distance: ||a-b||1 = Σ|ai-bi|
Maximum distance:||a-b||INFINITY = maxi|ai-bi|
Mahalanobis distance: √((a-b)T S-1 (-b)) {where, s : covariance matrix}
6. Difference between K Means and Hierarchical clustering
Hierarchical clustering can’t handle big data well but K Means clustering can. This is because
the time complexity of K Means is linear i.e. O(n) while that of hierarchical clustering is
quadratic i.e. O(n2 ).
In K Means clustering, since we start with random choice of clusters, the results produced by
running the algorithm multiple times might differ. While results are reproducible in
Hierarchical clustering.
K Means is found to work well when the shape of the clusters is hyper spherical (like circle in
2D, sphere in 3D).
K Means clustering requires prior knowledge of K i.e. no. of clusters you want to divide your
data into. But, you can stop at whatever number of clusters you find appropriate in
hierarchical clustering by interpreting the dendrogram
7. Applications of Clustering
Clustering has a large no. of applications spread across various domains. Some of the most
popular applications of clustering are:
Recommendation engines
Market segmentation
Social network analysis
Search result grouping
Medical imaging
Image segmentation
Anomaly detection

8. Improving Supervised Learning Algorithms with Clustering


Is it possible to utilize clustering to increase the accuracy of supervised machine learning
algorithms by grouping data points into similar clusters and using these cluster labels as
independent variables in the supervised algorithm? Why not have a look?
With 3000 observations and 100 stock data predictors, we'll examine the impact of clustering
on the accuracy of our model for the classification issue. 100 independent variables (X1 to
X100) describe the stock's profile, and one outcome variable (Y) has two levels: 1 for an
increase in stock price, and -1 for an increase in stock price.

#loading required libraries

library('randomForest')
library('Metrics')
#set random seed
set.seed(101)
#loading dataset
data<-read.csv("train.csv",stringsAsFactors= T)
#checking dimensions of data
dim(data)
## [1] 3000 101
#specifying outcome variable as factor
data$Y<-as.factor(data$Y)
#dividing the dataset into train and test
train<-data[1:2000,] test<-data[2001:3000,]
#applying randomForest model_rf<-randomForest(Y~.,data=train)
preds<-predict(object=model_rf,test[,-101])
table(preds)
## preds
## -1 1
## 453 547
#checking accuracy
auc(preds,test$Y)
## [1] 0.4522703
So, the accuracy we get is 0.45. Now let’s create five clusters based on values of independent
variables using k-means clustering and reapply randomforest.
<textarea id=all<-rbind(train,test)

#creating 5 clusters using K- means clustering

Cluster <- kmeans(all[,-101], 5)

#adding clusters as independent variable to the dataset.


all$cluster<-as.factor(Cluster$cluster)

#dividing the dataset into train and test


train<-all[1:2000,]
test<-all[2001:3000,]
#applying randomforest
model_rf<-randomForest(Y~.,data=train)

preds2<-predict(object=model_rf,test[,-101])

table(preds2)

## preds2

## -1 1

##548 452

auc(preds2,test$Y)

## [1] 0.5345908
Whoo! In the above example, even though the final accuracy is poor but clustering has given
our model a significant boost from accuracy of 0.45 to slightly above 0.53.
This shows that clustering can indeed be helpful for supervised machine learning tasks.
End Notes
In this article, we have discussed what are the various ways of performing clustering. It find
applications for unsupervised learning in a large no. of domains. You also saw how you can
improve the accuracy of your supervised machine learning algorithm using clustering.
Although clustering is easy to implement, you need to take care of some important aspects
like treating outliers in your data and making sure each cluster has sufficient population.
These aspects of clustering are dealt in great detail in this article.
2.b) Compare and contrast hierarchical clustering with k-means clustering
- self descriptive
k-means is a cluster analysis approach that uses a predetermined number of clusters.
Preparation in 'K' is required.
One of the methods of cluster analysis, known as hierarchical clustering or hierarchical
cluster analysis (HCA), attempts to establish a hierarchy of clusters without a predefined
number of clusters.
Main differences between K means and Hierarchical Clustering are:

k-means Clustering Hierarchical Clustering

To locate the mutually exclusive cluster of Hierarchical methods can be either divisive
the spherical shape using k-means, the or agglomerative.
approach uses a pre-specified number of
clusters.
It is necessary to have an idea of the number Using the dendrogram to analyze the
of clusters you want to divide your data into clusters, one can stop the hierarchical
in order to perform clustering. clustering process at any number of clusters
that seem acceptable.
One can use median or mean as a cluster Agglomerative methods begin with 'n'
centre to represent each cluster. clusters and merge them gradually until only
one cluster remains.
For really large datasets, the methods When using a dividing method, the records
utilized are more efficient in terms of are divided in the other direction, starting
computation. with one large cluster. When the goal is to
arrange the clusters in a natural order,
hierarchical approaches are extremely
beneficial.
K Means clustering, since one start with In Hierarchical Clustering, results are
random choice of clusters, the results reproducible in Hierarchical clustering
produced by running the algorithm many
times may differ.
K- means clustering a simply a division of A hierarchical clustering is a set of nested
the set of data objects into non-overlapping clusters that are arranged as a tree.
subsets (clusters) such that each data object
is in exactly one subset).
K Means clustering works best when the Hierarchical clustering don’t work as well
clusters are hyper spherical in structure (like as, k means when the shape of the clusters
circle in 2D, sphere in 3D). is hyper spherical.

Advantages: 1. Convergence is guaranteed. Advantages: 1 .Ease of handling of any


2. Specialized to clusters of different sizes forms of similarity or distance. 2.
and shapes. Consequently, applicability to any attributes
types.
Disadvantages: 1. K-Value is difficult to Disadvantage: 1. Hierarchical clustering
predict 2. Didn’t work well with global requires the computation and storage of an
cluster. n×n distance matrix. For very large
datasets, this can be expensive and slow
Explore the given dataset and perform any necessary pre-processing of the data before
carrying
out task 2.d, 2.e and 2.f
- Provide relevant well commented code snippets, details of used package(s) and
method(s) along
with appropriate outputs in your report.

Data Preprocessing
Data preprocessing is the process of transforming raw data into an understandable format. It
is also an important step in data mining as we cannot work with raw data. The quality of the
data should be checked before applying machine learning or data mining algorithms.
Why is Data preprocessing important?
Preprocessing of data is mainly to check the data quality. The quality can be checked by the
following

Accuracy: To check whether the data entered is correct or not.


Completeness: To check whether the data is available or not recorded.
Consistency: To check whether the same data is kept in all the places that do or do not match.
Timeliness: The data should be updated correctly.
Believability: The data should be trustable.
Interpretability: The understandability of the data.
Major Tasks in Data Preprocessing:
Data cleaning
Data integration
Data reduction
Data transformation
Data cleaning:
Data cleaning is the process to remove incorrect data, incomplete data and inaccurate data
from the datasets, and it also replaces the missing values. There are some techniques in data
cleaning
The missing variables can be replaced with standard values like "Not Available" or "NA."
When a dataset is large, manually filling in missing values is not suggested.
When the data is regularly distributed, the mean value of the attribute can be used to replace
the missing value; however, if the data is not normally distributed, the attribute's median
value can be utilized.
Binning: This method is to smooth or handle noisy data. First, the data is sorted then and
then the sorted values are separated and stored in the form of bins. There are three methods
for smoothing data in the bin. Smoothing by bin mean method: In this method, the values in
the bin are replaced by the mean value of the bin; Smoothing by bin median: In this method,
the values in the bin are replaced by the median value; Smoothing by bin boundary: In this
method, the using minimum and maximum values of the bin values are taken and the values
are replaced by the closest boundary value.
Regression: This is used to smooth the data and will help to handle data when unnecessary
data is present. For the analysis, purpose regression helps to decide the variable which is
suitable for our analysis.
Clustering: This is used for finding the outliers and also in grouping the data. Clustering is
generally used in unsupervised learning.
Data integration:
The process of combining multiple sources into a single dataset. The Data integration process
is one of the main components in data management. There are some problems to be
considered during data integration.
Schema integration: Integrates metadata(a set of data that describes other data) from
different sources.
Entity identification problem: Identifying entities from multiple databases. For example,
the system or the use should know student _id of one database and student_name of another
database belongs to the same entity.
Detecting and resolving data value concepts: The data taken from different databases while
merging may differ. Like the attribute values from one database may differ from another
database. For example, the date format may differ like “MM/DD/YYYY” or
“DD/MM/YYYY”.
Data reduction:
This process helps in the reduction of the volume of the data which makes the analysis easier
yet produces the same or almost the same result. This reduction also helps to reduce storage
space. There are some of the techniques in data reduction are Dimensionality reduction,
Numerosity reduction, Data compression.

Dimensionality reduction: This process is necessary for real-world applications as the data
size is big. In this process, the reduction of random variables or attributes is done so that the
dimensionality of the data set can be reduced. Combining and merging the attributes of the
data without losing its original characteristics. This also helps in the reduction of storage
space and computation time is reduced. When the data is highly dimensional the problem
called “Curse of Dimensionality” occurs.
Numerosity Reduction: In this method, the representation of the data is made smaller by
reducing the volume. There will not be any loss of data in this reduction.
Data compression: The compressed form of data is called data compression. This
compression can be lossless or lossy. When there is no loss of information during
compression it is called lossless compression. Whereas lossy compression reduces
information but it removes only the unnecessary information.
Data Transformation:
The change made in the format or the structure of the data is called data transformation. This
step can be simple or complex based on the requirements. There are some methods in data
transformation.
Smoothing: With the help of algorithms, we can remove noise from the dataset and helps in
knowing the important features of the dataset. By smoothing we can find even a simple
change that helps in prediction.
Aggregation: In this method, the data is stored and presented in the form of a summary. The
data set which is from multiple sources is integrated into with data analysis description. This
is an important step since the accuracy of the data depends on the quantity and quality of the
data. When the quality and the quantity of the data are good the results are more relevant.
Discretization: The continuous data here is split into intervals. Discretization reduces the
data size. For example, rather than specifying the class time, we can set an interval like (3
pm-5 pm, 6 pm-8 pm).
Normalization: It is the method of scaling the data so that it can be represented in a smaller
range. Example ranging from -1.0 to 1.0.

import pandas as pd
import numpy as np
dataset = pd.read_csv('Datasets.csv')
print (data_set)

from sklearn.preprocessing import Imputer


imputer= Imputer(missing_values ='NaN', strategy='mean', axis = 0) imputerimputer=
imputer.fit(x[:, 1:3])
x[:, 1:3]= imputer.transform(x[:, 1:3])
x

Encoding the country variable


The machine learning models use mathematical equations. So categorical data is not accepted
so we convert it into numerical form.
from sklearn.preprocessing import LabelEncoder
label_encoder_x= LabelEncoder()
x[:, 0]= label_encoder_x.fit_
Dummy encoding
These dummy variables replace the categorical data as 0 and 1 in the absence or the presence
of the specific categorical data.
Encoding for Purchased variable
labelencoder_y= LabelEncoder()
y= labelencoder_y.fit_transform(y)

Encoding the country variable


The machine learning models use mathematical equations. So categorical data is not accepted
so we convert it into numerical form.
from sklearn.preprocessing import LabelEncoder
label_encoder_x= LabelEncoder()
x[:, 0]= label_encoder_x.fit_transform(x[:, 0])

Encoding for Purchased variable


labelencoder_y= LabelEncoder()
y= labelencoder_y.fit_transform(y)

Dummy encoding | Data preprocessing


Splitting the dataset into training and test set:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.2, random_state=0)
Feature Scaling
from sklearn.preprocessing import StandardScaler
st_x= StandardScaler()
x_train= st_x.fit_transform(x_train)

x_test= st_x.transform(x_test)

2.d) Perform hierarchical clustering on the given dataset and produce a dendrogram
- Provide relevant well commented code snippets, details of used package(s) and
method(s) along
with appropriate visualisation in your report.

Introduction
In any business, it is critical to have a grasp on how customers behave. In the last year, my
chief marketing officer approached me and said, "Can you tell me which current customers
should we target for our new product?"

For me, that was a steep learning curve. As a data scientist, I rapidly learned the importance
of segmenting clients in order to adapt and design targeted strategies for my firm. Using
clustering proved to be a lifesaver here!

Problems like client segmentation can be surprisingly difficult to solve because we don't have
a specific goal in mind. A new era of learning has arrived in which we must discover patterns
and structures without any predetermined goal in mind. As a data scientist, it's both tough and
exhilarating.
Table of Contents
Supervised vs Unsupervised Learning
Why Hierarchical Clustering?
What is Hierarchical Clustering?
Types of Hierarchical Clustering
Agglomerative Hierarchical Clustering
Divisive Hierarchical Clustering
Steps to perform Hierarchical Clustering
How to Choose the Number of Clusters in Hierarchical Clustering?
Solving a Wholesale Customer Segmentation Problem using Hierarchical Clustering

The dependent or target variable is y, and the independent variables are represented by X.
The target variable is referred to as a dependent variable because it is linked to X. The term
"supervised learning" refers to the fact that we train our model using the independent
variables under the supervision of the target variable.

It is our goal to create a function that maps the independent variables to our desired target
when we train the model When a model has been trained, additional sets of observations can
be fed into it, and the model will anticipate the target. In a nutshell, this is a form of
supervised education.
There may be times when we don't know what we're trying to forecast. The term
"unsupervised learning" refers to situations in which there is no clear target variable.
Problems like these solely deal with independent variables and no dependent factors.

We try to divide the entire data into a set of groups in these cases. These groups are known as
clusters and the process of making these clusters is known as clustering.

Clustering a population into distinct subgroups is a common usage of this approach.


Clustering comparable papers together, proposing similar tunes or movies, and so on are
some examples of typical segmentation techniques
Unsupervised learning has a much wider range of applications. We'd love to hear about any
cool apps you've found in the comments area below.
These clusters can now be made using a variety of techniques. K-means and Hierarchical
clustering are the two most often used clustering techniques.
Brief overview of how K-means works:
Decide the number of clusters (k)
Select k random points from the data as centroids
Assign all the points to the nearest cluster centroid
Calculate the centroid of newly formed clusters
Repeat steps 3 and 4
It is an iterative process. It will keep on running until the centroids of newly formed clusters
do not change or the maximum number of iterations are reached.

But there are certain challenges with K-means. It always tries to make clusters of the same
size. Also, we have to decide the number of clusters at the beginning of the algorithm.
Ideally, we would not know how many clusters should we have, in the beginning of the
algorithm and hence it a challenge with K-means.
we have the below points and we want to cluster them into groups:

We can assign each of these points to a separate cluster:

Now, based on the similarity of these clusters, we can combine the most similar clusters
together and repeat this process until only a single cluster is left:
We are essentially building a hierarchy of clusters. That’s why this algorithm is called
hierarchical clustering. I will discuss how to decide the number of clusters in a later section.
For now, let’s look at the different types of hierarchical clustering.

Types of Hierarchical Clustering


There are mainly two types of hierarchical clustering:

Agglomerative hierarchical clustering


Divisive Hierarchical clustering
Let’s understand each type in detail.
Agglomerative Hierarchical Clustering
Using this method, each point is assigned to a specific cluster. Let's say we have four data
points. In the beginning, each of these points will be assigned to one of four clusters:

Then, at each iteration, we merge the closest pair of clusters and repeat this step until only a
single cluster is left:
We are merging (or adding) the clusters at each step, right? Hence, this type of clustering is
also known as additive hierarchical clustering.
Divisive Hierarchical Clustering
Divisive hierarchical clustering, on the other hand, operates the other way around. It is
preferable to start with a single cluster and assign all points to that cluster rather than having
n separate clusters.

Now, at each iteration, we split the farthest point in the cluster and repeat this process until
each cluster only contains a single point:
Steps to Perform Hierarchical Clustering
Hierarchical clustering combines the most comparable points or clusters, and we are well
aware of this. Now, how do we determine which points are similar and which are different?
It's one of the most pressing issues in clustering!
Take the distance between the clusters' centroids to calculate similarity. It is possible to
merge the points with the smallest distance between them, referred to as comparable points.
This algorithm can also be referred to as a distance-based one (since we are calculating the
distances between the clusters).
A proximity matrix is a key concept in hierarchical clustering. The distances between the
points are saved here. This matrix and the processes to execute hierarchical clustering can be
better understood with an example.
Let's say a teacher wishes to create separate groups for her students. Each student's grade on
an assignment has been tabulated, and she'd like to divide the students into groups depending
on their grades. There's no set number of groups to form here. A supervised learning problem
cannot be solved since the teacher does not know which pupils should be assigned to which
group. As a result, we'll try to perform hierarchical clustering and divide the pupils into
several subgroups.
Let’s take a sample of 5 students:

Steps to Perform Hierarchical Clustering


Step 1: First, we assign all the points to an individual cluster:
proximity matrix
Different colors here represent different clusters. You can see that we have 5 different
clusters for the 5 points in our data.
Step 2: Next, we will look at the smallest distance in the proximity matrix and merge the
points with the smallest distance. We then update the proximity matrix:
Here, the smallest distance is 3 and hence we will merge point 1 and 2:

Step 3: We will repeat step 2 until only a single cluster is left.


So, we will first look at the minimum distance in the proximity matrix and then merge the
closest pair of clusters. We will get the merged clusters as shown below after repeating these
steps:
How should we Choose the Number of Clusters in Hierarchical Clustering?
Are you finally ready to address the question that has been bugging you from the beginning
of your education? Dendrograms are a great tool for determining the number of clusters in
hierarchical clustering.
One way to represent the history of mergers and splits is to use a dendrogram.
Returning to our teacher-student example, let's get started. A dendrogram is a graph that
shows the distance between two groups when they are combined. Let's have a look at a
dendrogram:

On the x-axis, we have the dataset's samples, and on the y-axis, we have the distance. A join
will be made in this dendrogram every time two clusters are merged, and the height of that
join will be proportional to how far apart these points are. Let's create a dendrogram for the
sake of demonstration:
Spend a few seconds contemplating the image to the left. Merging samples 1 and 2 yielded
sample 3 with a distance of 3. (refer to the first proximity matrix in the previous section).
Consider the following data and plot it in a dendrogram:

As you can see, sample 1 and sample 2 are blended here. The distance between these samples
is shown by a vertical line on the graph. Dendrograms can be created by plotting each step in
which groups of data are combined.

Hierarchical clustering's stages are easily discernible. Clusters are further apart if the vertical
lines in the dendrogram are further apart.
Setting a threshold distance and drawing a horizontal line are now possible options
(Generally, we try to set the threshold in such a way that it cuts the tallest vertical line). Set
the bar at 12, and then draw a horizontal line through it:
It is the amount of intersected vertical lines that determines how many clusters are formed.
There will be two clusters in the example above due to the red line intersecting two vertical
lines in the picture. There will be a sample in one cluster (1,2,4) and a sample in the other
cluster (2,3,4,5). (3,5). Isn't it obvious where this is going?
Hierarchical Clustering uses a dendrogram to determine the number of clusters. In the
following section, we'll use hierarchical clustering to help you better comprehend the topics
we've covered in this post, so stay tuned!

Solving the Wholesale Customer Segmentation problem using Hierarchical Clustering


Working on wholesale customer segmentation is what we're going to do. Using this link, you
can get a copy of the data. UCI's Machine Learning repository houses the data. On the basis
of their annual spending on several product categories, such as milk, grocery, geography, etc.,
wholesale distributors can segment their customers.
Before applying Hierarchical Clustering, we'll first look at the data.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

Load the data and look at the first few rows:


data = pd.read_csv('Assignment data.csv')
data.head()
Product categories include Fresh, Milk, Grocery, and more. Each customer buys a specific number of
units of each product. Using this data, we hope to identify groups of customers who share similar
characteristics. The Hierarchical Clustering method will be used to solve this problem.
In order to use Hierarchical Clustering, the data first needs to be standardized such that each variable
has the same scale. What's the big deal? So if the scale of the variables is not the same, the model may
be biased towards variables with a bigger magnitude like Fresh or Milk (refer to the above table).
from sklearn.preprocessing import normalize
data_scaled = normalize(data)
data_scaled = pd.DataFrame(data_scaled, columns=data.columns)
data_scaled.head()
import scipy.cluster.hierarchy as shc
plt.figure(figsize=(10, 7))
plt.title("Dendrograms")
dend = shc.dendrogram(shc.linkage(data_scaled, method='ward'))

The x-axis contains the samples and y-axis represents the distance between these samples. The
vertical line with maximum distance is the blue line and hence we can decide a threshold of 6 and cut
the dendrogram:

plt.figure(figsize=(10, 7))
plt.title("Dendrograms")
dend = shc.dendrogram(shc.linkage(data_scaled, method='ward'))
plt.axhline(y=6, color='r', linestyle='--')
from sklearn.cluster import AgglomerativeClustering
cluster = AgglomerativeClustering(n_clusters=2, affinity='euclidean', linkage='ward')
cluster.fit_predict(data_scaled)

We can see the values of 0s and 1s in the output since we defined 2 clusters. 0 represents the
points that belong to the first cluster and 1 represents points in the second cluster. Let’s now
visualize the two clusters:
plt.figure(figsize=(10, 7))
plt.scatter(data_scaled['Milk'], data_scaled['Grocery'], c=cluster.labels_)
2.e) Using the cluster dendrogram output of task 2.c and any other suitable algorithmic
procedure(s)
argue what is the ideal number of clusters to be made.
- Provide relevant well commented code snippets, details of used package(s) and
method(s) along
with appropriate visualisation and/or output in your report.
A fundamental step for any unsupervised algorithm is to determine the optimal number of
clusters into which the data may be clustered. The Elbow Method is one of the most popular
methods to determine this optimal value of k.
We now demonstrate the given method using the K-Means clustering technique using the
Sklearn library of python.
Step 1: Importing the required libraries

from sklearn.cluster import KMeans


from sklearn import metrics
from scipy.spatial.distance import cdist
import numpy as np
import matplotlib.pyplot as plt
Step 2: Creating and Visualizing the data
# Creating the data
x1 = np.array([3, 1, 1, 2, 1, 6, 6, 6, 5, 6, 7, 8, 9, 8, 9, 9, 8])
x2 = np.array([5, 4, 5, 6, 5, 8, 6, 7, 6, 7, 1, 2, 1, 2, 3, 2, 3])
X = np.array(list(zip(x1, x2))).reshape(len(x1), 2)

# Visualizing the data


plt.plot()
plt.xlim([0, 10])
plt.ylim([0, 10])
plt.title('Dataset')
plt.scatter(x1, x2)
plt.show()
From the above visualization, we can see that the optimal number of clusters should be
around 3. But visualizing the data alone cannot always give the right answer. Hence we
demonstrate the following steps.
We now define the following:-
Distortion: It is calculated as the average of the squared distances from the cluster centers of
the respective clusters. Typically, the Euclidean distance metric is used.
Inertia: It is the sum of squared distances of samples to their closest cluster center.
We iterate the values of k from 1 to 9 and calculate the values of distortions for each value of
k and calculate the distortion and inertia for each value of k in the given range.
Step 3: Building the clustering model and calculating the values of the Distortion and Inertia:

distortions = []
inertias = []
mapping1 = {}
mapping2 = {}
K = range(1, 10)

for k in K:
# Building and fitting the model
kmeanModel = KMeans(n_clusters=k).fit(X)
kmeanModel.fit(X)

distortions.append(sum(np.min(cdist(X, kmeanModel.cluster_centers_,
'euclidea
n'), axis=1)) / X.shape[0])
inertias.append(kmeanModel.inertia_)

mapping1[k] = sum(np.min(cdist(X, kmeanModel.cluster_centers_,


'euclidean'), axis=1)) /
X.shape[0]
mapping2[k] = kmeanModel.inertia_
Step 4: Tabulating and Visualizing the results
a) Using the different values of Distortion:

for key, val in mapping1.items():


print(f'{key} : {val}')

plt.plot(K, distortions, 'bx-')


plt.xlabel('Values of K')
plt.ylabel('Distortion')
plt.title('The Elbow Method using Distortion')
plt.show()
b) Using the different values of Inertia:
for key, val in mapping2.items():
print(f'{key} : {val}')

plt.plot(K, inertias, 'bx-')


plt.xlabel('Values of K')
plt.ylabel('Inertia')
plt.title('The Elbow Method using Inertia')
plt.show()

It is necessary to choose the "elbow" value of k, i.e. the point at which distortion and inertia
begin to decrease linearly, in order to identify the best number of clusters. In other words,
based on the provided data, we can safely say that the ideal number of clusters is 3.
Variations in the value of k result in the following clustered data:
k=1

2. k = 2

3. k = 3
4. k = 4

2.f) Suppose your marketing department requires you to segment the given dataset in to
4 clusters,perform the clustering as required and produce a segment profile for each
cluster with appropriate visualisations and descriptions.
-Provide with all important and well commented code snippets, column averages for
each segment,appropriate visualisation(s) in your report and write a segment profile
(for all 4 segments) in clearbusiness language. State any underlying assumptions made
towards writing these segment profiles.

The separation of a market into distinct client groups with comparable characteristics is
known as customer segmentation. Unmet client needs can be discovered through the use of
customer segmentation. With the information provided above, businesses may create
products and services that stand out from the crowd.
Customer groups can be divided in a variety of ways, with the most prevalent being:

Demographic information, such as gender, age, familial and marital status, income,
education, and occupation.
Geographical information, which differs depending on the scope of the company. For
localized businesses, this info might pertain to specific towns or counties. For larger
companies, it might mean a customer’s city, state, or even country of residence.
Psychographics, such as social class, lifestyle, and personality traits.
Behavioral data, such as spending and consumption habits, product/service usage, and
desired benefits.
Advantages of Customer Segmentation
Determine appropriate product pricing.
Develop customized marketing campaigns.
Design an optimal distribution strategy.
Choose specific product features for deployment.
Prioritize new product development efforts.
The Challenge
As the owner of a grocery mall, you are privy to basic information on your clients, such as
their Customer ID (as well as their gender, annual income, and spending score). For the
marketing team, you want to obtain a sense of who your target clients are so that they can
plan accordingly.
K Means Clustering Algorithm
Specify number of clusters K.
Initialize centroids by first shuffling the dataset and then randomly selecting K data points for
the centroids without replacement.
Keep iterating until there is no change to the centroids. i.e assignment of data points to
clusters isn’t changing.
Data
This project is a part of the Mall Customer Segmentation Data competition held on Kaggle.
The dataset can be downloaded from the kaggle website which can be found here.

Environment and tools


scikit-learn
seaborn
numpy
pandas
matplotlib

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv("../input/customer-segmentation-tutorial-in-python/Mall_Customers.csv")
df.head()

dropped the id column as that does not seem relevant to the context. Also I plotted the age
frequency of customers.
df.drop(["CustomerID"], axis = 1, inplace=True)

plt.figure(figsize=(10,6))
plt.title("Ages Frequency")
sns.axes_style("dark")
sns.violinplot(y=df["Age"])
plt.show()

box plot of spending score and annual income to better visualize the distribution range. The
range of spending score is clearly more than the annual income range
plt.figure(figsize=(15,6))
plt.subplot(1,2,1)
sns.boxplot(y=df["Spending Score (1-100)"], color="red")
plt.subplot(1,2,2)
sns.boxplot(y=df["Annual Income (k$)"])
plt.show()
Checked male and female population distribution by drawing a bar graph. There are far more
women than men in the world.

genders = df.Gender.value_counts()
sns.set_style("darkgrid")
plt.figure(figsize=(10,4))
sns.barplot(x=genders.index, y=genders.values)
plt.show()

made a bar plot to check the distribution of number of customers in each age group. Clearly
the 26–35 age group outweighs every other age group.
age18_25 = df.Age[(df.Age <= 25) & (df.Age >= 18)]
age26_35 = df.Age[(df.Age <= 35) & (df.Age >= 26)]
age36_45 = df.Age[(df.Age <= 45) & (df.Age >= 36)]
age46_55 = df.Age[(df.Age <= 55) & (df.Age >= 46)]
age55above = df.Age[df.Age >= 56]
x = ["18-25","26-35","36-45","46-55","55+"]

y=[len(age18_25.values),len(age26_35.values),len(age36_45.values),len(age46_55.values),len(age
55above.values)]

plt.figure(figsize=(15,6))
sns.barplot(x=x, y=y, palette="rocket")
plt.title("Number of Customer and Ages")
plt.xlabel("Age")
plt.ylabel("Number of Customer")
plt.show()

made a bar graph to show how many clients there were based on how much money they
spent. Approximately four-fifths of our clients have a spending score between 41 and 60.
ss1_20 = df["Spending Score (1-100)"][(df["Spending Score (1-100)"] >= 1) &
(df["Spending Score (1-100)"] <= 20)]
ss21_40 = df["Spending Score (1-100)"][(df["Spending Score (1-100)"] >= 21) &
(df["Spending Score (1-100)"] <= 40)]
ss41_60 = df["Spending Score (1-100)"][(df["Spending Score (1-100)"] >= 41) &
(df["Spending Score (1-100)"] <= 60)]
ss61_80 = df["Spending Score (1-100)"][(df["Spending Score (1-100)"] >= 61) &
(df["Spending Score (1-100)"] <= 80)]
ss81_100 = df["Spending Score (1-100)"][(df["Spending Score (1-100)"] >= 81) &
(df["Spending Score (1-100)"] <= 100)]
ssx = ["1-20", "21-40", "41-60", "61-80", "81-100"]
ssy = [len(ss1_20.values), len(ss21_40.values), len(ss41_60.values), len(ss61_80.values),
len(ss81_100.values)]

plt.figure(figsize=(15,6))
sns.barplot(x=ssx, y=ssy, palette="nipy_spectral_r")
plt.title("Spending Scores")
plt.xlabel("Score")
plt.ylabel("Number of Customer Having the Score")
plt.show()

ai0_30 = df["Annual Income (k$)"][(df["Annual Income (k$)"] >= 0) & (df["Annual


Income (k$)"] <= 30)]
ai31_60 = df["Annual Income (k$)"][(df["Annual Income (k$)"] >= 31) & (df["Annual
Income (k$)"] <= 60)]
ai61_90 = df["Annual Income (k$)"][(df["Annual Income (k$)"] >= 61) & (df["Annual
Income (k$)"] <= 90)]
ai91_120 = df["Annual Income (k$)"][(df["Annual Income (k$)"] >= 91) & (df["Annual
Income (k$)"] <= 120)]
ai121_150 = df["Annual Income (k$)"][(df["Annual Income (k$)"] >= 121) &
(df["Annual Income (k$)"] <= 150)]
aix = ["$ 0 - 30,000", "$ 30,001 - 60,000", "$ 60,001 - 90,000", "$ 90,001 - 120,000", "$
120,001 - 150,000"]
aiy = [len(ai0_30.values), len(ai31_60.values), len(ai61_90.values), len(ai91_120.values),
len(ai121_150.values)]

plt.figure(figsize=(15,6))
sns.barplot(x=aix, y=aiy, palette="Set2")
plt.title("Annual Incomes")
plt.xlabel("Income")
plt.ylabel("Number of Customer")
plt.show()

plotted Within Cluster Sum Of Squares (WCSS) against the the number of clusters (K Value)
to figure out the optimal number of clusters value. WCSS measures sum of distances of
observations from their cluster centroids which is given by the below formula.

where Yi is centroid for observation Xi. The main goal is to maximize number of clusters and
in limiting case each data point becomes its own cluster centroid.
The Elbow Method
The steps can be summarized in the below steps:

Compute K-Means clustering for different values of K by varying K from 1 to 10 clusters.


For each K, calculate the total within-cluster sum of square (WCSS).
Plot the curve of WCSS vs the number of clusters K.
The location of a bend (knee) in the plot is generally considered as an indicator of the
appropriate number of clusters.

produced a 3D plot of the customer's annual income and expenditure score to show the
correlation. In the 3D plot, the data points are divided into five categories, each of which is
represented by a different color.

km = KMeans(n_clusters=5)
clusters = km.fit_predict(df.iloc[:,1:])
df["label"] = clusters

from mpl_toolkits.mplot3d import Axes3D


import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

fig = plt.figure(figsize=(20,10))
ax = fig.add_subplot(111, projection='3d')
ax.scatter(df.Age[df.label == 0], df["Annual Income (k$)"][df.label == 0], df["Spending
Score (1-100)"][df.label == 0], c='blue', s=60)
ax.scatter(df.Age[df.label == 1], df["Annual Income (k$)"][df.label == 1], df["Spending
Score (1-100)"][df.label == 1], c='red', s=60)
ax.scatter(df.Age[df.label == 2], df["Annual Income (k$)"][df.label == 2], df["Spending
Score (1-100)"][df.label == 2], c='green', s=60)
ax.scatter(df.Age[df.label == 3], df["Annual Income (k$)"][df.label == 3], df["Spending
Score (1-100)"][df.label == 3], c='orange', s=60)
ax.scatter(df.Age[df.label == 4], df["Annual Income (k$)"][df.label == 4], df["Spending
Score (1-100)"][df.label == 4], c='purple', s=60)
ax.view_init(30, 185)
plt.xlabel("Age")
plt.ylabel("Annual Income (k$)")
ax.set_zlabel('Spending Score (1-100)')
plt.show()

Results

Conclusions
One of the most commonly used clustering methods, K-means clustering (K-means), is
frequently the initial step in addressing clustering assignments. The purpose of K means is to
separate data points into discrete, non-overlapping clusters for subsequent analysis and
interpretation. One of the most common uses of K means clustering is customer
segmentation, which may be used to boost a company's profits by better knowing its
customers.

You might also like