Customer Segmentation Analysis
Customer Segmentation Analysis
USING CLUSTERING
BACHELOR OF TECHNOLOGY
IN
COMPUTER SCIENCE AND ENGINEERING
SUBMITTED
By
A. SUMA SRI 16671A0561
K.S. UDIT 16671A0586
A. NIYATHI 16671A0598
N. SAI TEJA 15671A05A0
i
J.B.INSTITUTE OF ENGINEERING & TECHNOLOGY
(UGC AUTONOMOUS)
(Accredited by NAAC, Permanently Affiliated to JNTUH)
Yenkapally, Moinabad Mandal, R.R. Dist. -500 075
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
CERTIFICATE
External Examiner
ii
J.B.INSTITUTE OF ENGINEERING & TECHNOLOGY
(UGC Autonomous)
(Accredited by NAAC Permanently Affiliated to JNTUH)
Yenkapally, Moinabad Mandal, R.R. Dist.-500 075
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
DECLARATION
Date: 21/05/2020
A. SUMA SRI 16671A0561
K.S. UDIT 16671A0586
A. NIYATHI 16671A0598
N. SAI TEJA 15671A05A0
iii
ACKNOWLEDGEMENT
At outset we express our gratitude to almighty lord for showering his grace and blessings
upon us to complete this Main Project. Although our name appears on the cover of this book, many
people had contributed in some form or the other to this project Development. We could not have
done this Project without the assistance or support of each of the following.
First of all we are highly indebted to Dr.S. SUDHAKARA REDDY, Principal for giving
us the permission to carry out this Main Project.
We would like to thank Dr. P. SRINIVASA RAO, Professor & Head of the Department
of COMPUTER SCIENCE AND ENGINEERING, for being moral support throughout the period
of the study in the Department.
We would like to thank Teaching and Non-Teaching Staff of Department of Computer
Science & Engineering for sharing their knowledge with us.
iv
ABSTRACT
Customer segmentation is the practice of dividing a company’s customers into groups that
reflect similarity among customers in each group. The goal of segmenting customers is to decide
how to relate to customers in each segment in order to maximize the value of each customer to the
business. Customer segmentation has the potential to allow marketers to address each customer in
the most effective way. Using the large amount of data available on customers and potential
customers, a customer segmentation analysis allows marketers to identify discrete groups of
customers with a high degree of accuracy based on demographic, behavioral and other indicators.
To scale efficiently and effectively, expansion stage companies need to focus their efforts not on
a broad universe of potential customers, but rather on a specific subset of customers who are most
similar to their best current customers.
v
TABLE OF CONTENTS
1. INTRODUCTION 01
2. LITERATURE SURVEY 03
3. SYSTEM ANALYSIS 05
3.1 Aim 5
3.2 Existing System 5
3.3 Proposed System 5
3.4 Software Requirements 6
4. SYSTEM DESIGN 28
5. IMPLEMENTATION 30
6.1Testing Strategies 36
6.2 Functional testing 39
9.FUTURE ENHANCEMENT 53
vi
10.BIBLIOGRAPGY 54
vii
1. INTRODUCTION
In the contemporary day and age, the importance of treating customers as the principal asset
of an organization is increasing in value. Organizations are rapidly investing in developing
strategies for better customer acquisition, maintenance and development. The concept of business
intelligence has a crucial role to play in making it possible for organizations to use technical
expertise for acquiring better customer insight for outreach programs. In this scenario, the concept
of CRM garners much attention since it is a comprehensive process of acquiring and retaining
customers, using business intelligence, to maximize the customer value for a business enterprise.
One of the two most important objectives of CRM is customer development through
customer insight. This objective of CRM entails the usage of an analytical approach in order to
correctly assess customer information and analysis of the value of customers for better customer
insight. Keeping up with the changing times, organizations are modifying their business flow
models by employing systems engineering as well as change management and designing
information technology(IT) solutions that aid them in acquiring new customers, help retain the
present customer base and boost the customers lifelong value.
Due to the diverse range of products and services available in the market as well as the
intense competition among organizations, customer relationship management has come to play a
significant role in the identification and analysis of a company’s best customers and the adoption
of best marketing strategies to achieve and sustain competitive advantage. One of the most useful
techniques in business analytics for the analysis of consumer behavior and categorization is
customer segmentation. By using clustering techniques, customers with similar means, end and
behavior are grouped together into homogeneous clusters.
Customer Segmentation helps organizations in identifying or revealing distinct groups of
customers who think and function differently and follow varied approaches in their spending and
purchasing habits. Clustering techniques reveal internally homogeneous and externally
heterogeneous groups. Customers vary in terms of behavior, needs, wants and characteristics and
the main goal of clustering techniques is to identify different customer types and segment
1
the customer base into clusters of similar profiles so that the process of target marketing can be
executed more efficiently.
This study aims to explore the avenues of using customer segmentation, as a business
intelligence tool within the CRM framework as well as the use of clustering techniques for helping
organizations redeem a clearer picture of the valuable customer base. The concepts of customer
relationship management, customer segmentation as a core function of CRM as well as the approach
of segmenting customers using clustering techniques are discussed.
The available clustering models for business analysis in the context of customer
segmentation, the advantages and disadvantages of the two main models chosen for our study-
KMeans and Hierarchical Clustering, as well as the possibility of developing a hybrid model which
can outperform the individual models is surveyed.
2
2. LITERATURE SURVEY
Research dealing with shopping malls’ and / or hypermarkets’ attributes, especially in the
Indian context, is very less in number. Not many studies have empirically analyzed the influence of
an assortment of attributes on buying behaviour in shopping arcades and malls and customers’
shopping experiences. Mostly the researches undertaken so far have been taken from the foreign
experiences, as they have come of age in the US, UK and European markets. An earnest attempt
has been made to delve into the relevant researches done on the theme, presented henceforth as
follows:
Brunner and Mason (1968) investigated the importance of driving time upon the
preferences of consumers towards regional shopping centers. They expressed that although it is
recognized that population, purchasing power, population density, newspaper circulation, and other
factors are influential in determining the shopping habits of consumers, a factor which is generally
overlooked is the driving time required to reach the center. In this study, it was established that the
driving time required to reach a center is highly influential in determining consumer shopping center
preferences. The most consistent and significant driving time dimension in delineating shopping
center trade areas was found at the 15-minute driving points, as three- fourths of each center’s
shoppers resided within this range.
Huff (1964 and 1966) concluded that the comparative size of the centers and the
convenience of access were the primary characteristics that consumers sought when choosing a
shopping center to visit.
Cox and Cooke (1970) determined customer preference for shopping centers and the
importance of driving time. The authors concluded that location and attractiveness are important
determinants of consumer shopping center preferences.
Mehrabian and Russell (1974) noted that the response that store atmosphere elicits from
consumers, varies along three dimensions of pleasantness, arousal and dominance.
Bellenger et al. (1977) found that some consumers placed the greatest value on convenience
and economic attributes including convenience to home, accessibility, and the
3
Presence of service such as banks and restaurants. Others however, emphasized recreational
attributes including atmosphere, fashionability, variety of stores and merchandise.
Vaughn and Hansotia (1977) opined that merchandise and convenience seem to be the two
underlying dimensions which consistently appear everytime. Merchandise quality, merchandise
variety, atmosphere of shopping area, availability of sale items and ease of shopping comparisons
are all component parts of this underlying dimension
McCarthy (1980) attempted to include transport mode / travel attributes in studying the
role of the qualitative characteristics that influence the choice in shopping destination. Using the
factor analytical technique, five sets of qualitative generalized attributes were generated. These
generalized attributes include trip convenience, trip comfort, trip safety, shopping area attraction
and shopping area mobility. He found that these generalized attributes, which were obtained from
attitudinal information, are significant in an individual's choice of shopping area.
4
3. SYSTEM ANALYSIS
3.1 AIM
Customer Segmentation is the subdivision of a market into discrete customer groups that
share similar characteristics. Customer Segmentation can be a powerful means to identify
unsatisfied customer needs. Using the above data companies can then outperform the competition
by developing uniquely appealing products and services.
5
SOFTWARE REQUIREMENTS:
• Anaconda
• Jupyter
• Kaggle
• Operating system (Windows 10)
6
ALGORITHM USED
INTRODUCTION:
The most common ways in which businesses segment their customer base are:
Demographic segmentation : Clustering demographic information such as gender, age, familial
and marital status, income, education, and occupation.
Typically, demographic data contains many categorical variables. The mining function
works well with data sets that consist of this type of variables.
You can also use numerical variables. The Demographic Clustering algorithm treats
numerical variables by assigning similarities according to the numeric difference of the values.
Demographic Clustering is an iterative process over the input data. Each input record is read
in succession. The similarity of each record with each of the currently existing clusters is calculated.
If the biggest calculated similarity is above a given threshold, the record is added to the relevant
cluster. This cluster's characteristics change accordingly. If the calculated similarity is not above the
threshold, or if there is no cluster (which is initially the case) a new cluster is created that contains
the record alone. You can specify the maximum number of clusters, as well as the similarity
threshold.
Demographic Clustering uses the statistical Condorcet criterion to manage the assignment of
records to clusters and the creation of new clusters. The Condorcet criterion evaluates how
homogeneous each discovered cluster is (in that the records it contains are similar) and how
heterogeneous the discovered clusters are among each other. The iterative process of discovering
clusters stops after two or more passes over the input data if the improvement of the clustering result
according to the Condorcet criterion does not justify a new pass
7
Geographical segmentation: It differs depending on the scope of the company. For localized
businesses, this info might pertain to specific towns or counties. For larger companies, it might
mean a customer’s city, state, or even country of residence.
• ZIP code
• City
• Country
• Radius around a certain location
• Climate
• Urban or rural
Geographic segmentation can refer to a defined geographic boundary (such as a city or ZIP
code) or type of area (such as the size of city or type of climate).
An example of geographic segmentation may be the luxury car company choosing to target
customers who live in warm climates where vehicles don’t need to be equipped for snowy
weather. The marketing platform might focus their marketing efforts around urban, city centers
where their target customer is likely to work.
We can get details for graphic segmentation and find out where your audience lives using
Alexa’s Site Overview tool Enter your site URL, and the report shows you where your website
visitors are located across the world.
8
Psychographic Market Segmentation Examples
• Personality traits
• Values
• Attitudes
• Interests
• Lifestyles
• Psychological influences
• Subconscious and conscious beliefs
• Motivations
• Priorities
Psychographic segmentation factors are slightly more difficult to identify than demographics
because they are subjective. They are not data-focused and require research to uncover and
understand.
For example, the luxury car brand may choose to focus on customers who value quality
and status. While the B2B enterprise marketing platform may target marketing managers who
are motivated to increase productivity and show value to their executive team.
When your obvious groupings of target segments seem to have radically different needs and
responses to your offerings and messaging, this is a major indicator it is a good time to look at
psychographic segmentation. This method is a powerful way to market the same product to
individuals who otherwise seem very heterogeneous. Many expert marketers say this approach
will ultimately yield the greatest payoff, in many ways: purchase amount and frequency, lifetime
value, loyalty, and more.
Behavioral segmentation:It collects behavioural data , such as spending and consumption habits,
product/service usage, and desired benefits.
9
Behavioral Market Segmentation Examples
• Purchasing habits
• Spending habits
• User status
• Brand interactions
Behavioral segmentation requires you to know about your customer’s actions. These
activities may relate to how a customer interacts with your brand or to other activities that happen
away from your brand.
A B2C example in this segment may be the luxury car brand choosing to target customers
who have purchased a high-end vehicle in the past three years. The B2B marketing platform
may focus on leads who have signed up for one of their free webinars
Behavioral segmentation isn’t about just recognizing that people have different habits, it’s
about optimizing marketing campaigns to match these behavioral patterns with a particular
message.
Behavioral segmentation is the process of sorting and grouping customers based on the
behaviors they exhibit. These behaviors include the types of products and content they consume,
and the cadence of their interactions with an app, website, or business.
Acquisition, engagement, and retention are all important factors to keep in mind when
analyzing customer behavior. Understanding the following ways your users can interact with your
product will help you accomplish a sustainable and constructive behavioral segmentation strategy.
CLUSTERING
Clustering is one of the most common exploratory data analysis technique used to get an
intuition about the structure of the data. It can be defined as the task of identifying subgroups in the
data such that data points in the same subgroup (cluster) are very similar while data points in
10
different clusters are very different. In other words, we try to find homogeneous subgroups within
the data such that data points in each cluster are as similar as possible according to a similarity
measure such as euclidean-based distance or correlation-based distance. The decision of which
similarity measure to use is application-specific.
Clustering analysis can be done on the basis of features where we try to find subgroups of
samples based on features or on the basis of samples where we try to find subgroups of features
based on samples. We’ll cover here clustering based on features. Clustering is used in market
segmentation; where we try to fined customers that are similar to each other whether in terms of
behaviors or attributes, image segmentation/compression; where we try to group similar regions
together, document clustering based on topics, etc.
Clustering is the task of dividing the population or data points into a number of groups such
that data points in the same groups are more similar to other data points in the same group and
dissimilar to the data points in other groups. It is basically a collection of objects on the basis of
similarity and dissimilarity between them.
Why Clustering ?
Clustering is very much important as it determines the intrinsic grouping among the
unlabeled data present. There are no criteria for a good clustering. It depends on the user, what is
the criteria they may use which satisfy their need. For instance, we could be interested in finding
representatives for homogeneous groups (data reduction), in finding “natural clusters” and
describe their unknown properties (“natural” data types), in finding useful and suitable groupings
(“useful” data classes) or in finding unusual data objects (outlier detection). This algorithm must
make some assumptions which constitute the similarity of points and each assumption make
different and equally valid clusters.
11
Clustering Methods :
• Density-Based Methods : These methods consider the clusters as the dense region
having some similarity and different from the lower dense region of the space. These
methods have good accuracy and ability to merge two clusters. Example DBSCAN
• Hierarchical Based Methods : The clusters formed in this method forms a tree-type
structure based on the hierarchy. New clusters are formed using the previously formed
one. It is divided into two category
o Agglomerative
o Divisive
• Partitioning Methods : These methods partition the objects into k clusters and each
partition forms one cluster. This method is used to optimize an objective criterion
similarity function such as when the distance is a major parameter
• Grid-based Methods : In this method the data space is formulated into a finite number
of cells that form a grid-like structure. All the clustering operation done on these grids are
fast and independent of the number of data objects
So, the goal of clustering is to determine the intrinsic grouping in a set of unlabeled data. But
how to decide what constitutes a good clustering? It can be shown that there is no absolute “best”
criterion which would be independent of the final aim of the clustering. Consequently, it is the
user which must supply this criterion, in such a way that the result of the clustering will suit their
needs.
For instance, we could be interested in finding representatives for homogeneous groups (data
reduction), in finding “natural clusters” and describe their unknown properties (“natural” data
types), in finding useful and suitable groupings (“useful” data classes) or in finding unusual data
objects (outlier detection).
• Marketing: finding groups of customers with similar behavior given a large database of
customer data containing their properties and past buying records;
12
• Biology: classification of plants and animals given their features;
• Libraries: book ordering;
• Insurance: identifying groups of motor insurance policy holders with a high average
claim cost; identifying frauds;
• City-planning: identifying groups of houses according to their house type, value and
geographical location;
• Earthquake studies: clustering observed earthquake epicenters to identify dangerous
zones;
• WWW: document classification; clustering weblog data to discover groups of similar
access patterns.
Requirements
The main requirements that a clustering algorithm should satisfy are:
• scalability;
• dealing with different types of attributes;
• discovering clusters with arbitrary shape;
• minimal requirements for domain knowledge to determine input parameters;
• ability to deal with noise and outliers;
• insensitivity to order of input records;
• high dimensionality;
• interpretability and usability.
Problems
• current clustering techniques do not address all the requirements adequately (and
concurrently);
• dealing with large number of dimensions and large number of data items can be
problematic because of time complexity;
• the effectiveness of the method depends on the definition of “distance” (for distance-
based clustering);
13
• if an obvious distance measure doesn’t exist we must “define” it, which is not always
easy, especially in multi-dimensional spaces;
• the result of the clustering algorithm (that in many cases can be arbitrary itself) can be
interpreted in different ways.
• Exclusive Clustering
• Overlapping Clustering
• Hierarchical Clustering
• Probabilistic Clustering
In the first case data are grouped in an exclusive way, so that if a certain datum belongs to a
definite cluster then it could not be included in another cluster. A simple example of that is
shown in the figure below, where the separation of points is achieved by a straight line .
On the contrary the second type, the overlapping clustering, uses fuzzy sets to cluster data,
so that each point may belong to two or more clusters with different degrees of membership. In
this case, data will be associated to an appropriate membership value.
Instead, a hierarchical clustering algorithm is based on the union between the two
nearest clusters. The beginning condition is realized by setting every datum as a cluster. After a
few
14
iterations it reaches the final clusters wanted.
Finally, the last kind of clustering use a completely probabilistic approach.
• K-means
• Fuzzy C-means
• Hierarchical clustering
• Mixture of Gaussians
Each of these algorithms belongs to one of the clustering types listed above. So thatK-
Means is an exclusive clustering algorithm, Fuzzy C-Means is an overlapping clustering
algorithm, Hierarchial clustering is obvious and lastly Mixture of Guassian is a probabilistic
clustering algorithm. We will discuss about each clustering method in the following paragraphs.
Distance Measure
15
Fig 3.2 Formation of clusters
Notice however that this is not only a graphic issue: the problem arises from the
mathematical formula used to combine the distances between the single components of the data
feature vectors into a unique distance measure that can be used for clustering purposes: different
formulas leads to different clusterings.
Again, domain knowledge must be used to guide the formulation of a suitable distance measure
for each particular application.
Minkowski Metric
For higher dimensional data, a popular measure is the Minkowski metric,
where d is the dimensionality of the data. The Euclidean distance is a special case where p=2,
while Manhattan metric has p=1. However, there are no general theoretical guidelines for
selecting a measure for any given application.
It is often the case that the components of the data feature vectors are not immediately
comparable. It can be that the components are not continuous variables, like length, but nominal
16
categories, such as the days of the week. In these cases again, domain knowledge must be used to
formulate an appropriate measure.
One of the most important considerations regarding the ML model is assessing its
performance, or you can say the model’s quality. In the case of supervised learning algorithms,
evaluating the quality of our model is easy because we already have labels for every example.
On the other hand, in the case of unsupervised learning algorithms, we are not that much
blessed because we deal with unlabeled data. But still, we have some metrics that give the
practitioner insight into the happening of change in clusters depending on the algorithm.
• The intra-cluster similarity is high (The data that is present inside the cluster is similar to
one another)
• The inter-cluster similarity is less (Each cluster holds information that isn’t similar to the
other)
Before we deep dive into such metrics, we must understand that these metrics only evaluates
the comparative performance of models against each other rather than measuring the validity of the
model’s prediction.
You still don’t know which cluster is which class, and if they make any sense at all. In this
case, you can validate your results by simple sampling from the clusters and looking at the quality
of classification. If the questions are split reasonably, you can register a label for every cluster and
either label the whole dataset, train a supervised model, or you can continue to use the k-means
cluster, keeping the information about which cluster corresponds to which class.:
17
Applications of Clustering
Creating NewsFeeds: K-Means can be used to cluster articles by their similarity — it can
separate documents into disjoint clusters.
Pattern Recognition in images: For example, to automatically detect infected fruits or for
segmentation of blood cells for leukaemia detection.
Cluster analysis itself is not one specific algorithm, but the general task to be solved. It can be
achieved by various algorithms that differ significantly in their understanding of what constitutes
a cluster and how to efficiently find them. Popular notions of clusters include groups with small
distances between cluster members, dense areas of the data space, intervals or particular statistical
distributions. Clustering can therefore be formulated as a multi-objective optimization problem.
The appropriate clustering algorithm and parameter settings .
Kmeans algorithm is an iterative algorithm that tries to partition the dataset into Kpre-
defined distinct non-overlapping subgroups (clusters) where each data point belongs to only one
group.
18
It tries to make the intra-cluster data points as similar as possible while also keeping the
clusters as different (far) as possible. It assigns data points to a cluster such that the sum of the
squared distance between the data points and the cluster’s centroid (arithmetic mean of all the data
points that belong to that cluster) is at the minimum. The less variation we have within clusters, the
more homogeneous (similar) the data points are within the same cluster.
2. Initialize centroids by first shuffling the dataset and then randomly selecting K data points for
the centroids without replacement.
3. Keep iterating until there is no change to the centroids. i.e assignment of data points to
clusters isn’t changing.
• Compute the sum of the squared distance between data points and all centroids.
• Compute the centroids for the clusters by taking the average of the all data points that belong
to each cluster.
The approach kmeans follows to solve the problem is called Expectation-Maximization. The E-
step is assigning the data points to the closest cluster. The M-step is computing the centroid of each
cluster. Below is a break down of how we can solve it mathematically (feel free to skip it).
19
where wik=1 for data point xi if it belongs to cluster k; otherwise, wik=0. Also, μk is the centroid
of xi’s cluster.
It’s a minimization problem of two parts. We first minimize J w.r.t. wik and treat μk fixed.
Then we minimize J w.r.t. μk and treat wik fixed. Technically speaking, we differentiate J w.r.t.
wik first and update cluster assignments (E-step). Then we differentiate J w.r.t. μk and recompute
the centroids after the cluster assignments from previous step (M-step). Therefore, E-step is:
In other words, assign the data point xi to the closest cluster judged by its sum of squared
distance from cluster’s centroid.
20
Which translates to recomputing the centroid of each cluster to reflect the new assignments.
There are countless examples of where this automated grouping of data can be extremely
useful. For example, consider the case of creating an online advertising campaign for a brand new
range of products being released to the market. While we could display a single generic
advertisement to the entire population, a far better approach would be to divide the population
into clusters of people who hold shared characteristics and interests displaying customised
advertisements to each group. K-means is an algorithm that finds these groupings in big datasets
where it is not feasible to be done by hand.
The intuition behind the algorithm is actually pretty straight forward. To begin, we choose a
value for k (the number of clusters) and randomly choose an initial centroid (centre coordinates)
for each cluster. We then apply a two step process:
21
1. Assignment step — Assign each observation to it’s nearest centre.
2. Update step — Update the centroids as being the centre of their respective observation.
We repeat these two steps over and over until there is no further change in the clusters. At
this point the algorithm has converged and we may retrieve our final clusterings
One final key aspect of k-means returns to this concept of convergence. We previously
mentioned that the k-means algorithm doesn’t necessarily converge to the global minima and
instead may converge to a local minima (i.e. k-means is not guaranteed to find the best solution).
In fact, depending on which values we choose for our initial centroids we may obtain differing
results.
As we are only interested in the best clustering solution for a given choice of k, a common
solution to this problem is to run k-means multiple times, each time with different randomised
initial centroids, and use only the best solution. In other words, always run k-means multiple
times to ensure we find a solution close to the global minima.
Advantages
3) Gives best result when data set are distinct or well separated from each other.
Disadvantages
1) The learning algorithm requires apriori specification of the number of cluster centers.
2) The use of Exclusive Assignment - If there are two highly overlapping data then k-means
will not be able to resolve that there are two clusters.
22
3) The learning algorithm is not invariant to non-linear transformations i.e. with different
representation of data we get.
5) The learning algorithm provides the local optima of the squared error function.
6) Randomly choosing of the cluster center cannot lead us to the fruitful result. Pl. refer Fig.
7) Applicable only when mean is defined i.e. fails for categorical data.
The main drawback of this technique is related to ambiguity about the K number of points that
should be initialized. To overcome this issue, the performance of the algorithm is calculated for
different numbers of centroids.
Conclusion
K-means is one of the most common and intuitive clustering algorithms in Machine
Learning.The name ‘k-means’ almost explains the theory itself.
2. The mean of the corresponding features of the nearest data points is calculated and set as a
new coordinate of the pre-initialize.
EVALUATION METHODS:
Contrary to supervised learning where we have the ground truth to evaluate the model’s
performance, clustering analysis doesn’t have a solid evaluation metric that we can use to evaluate
23
the outcome of different clustering algorithms. Moreover, since kmeans requires k as an input and
doesn’t learn it from data, there is no right answer in terms of the number of clusters that we should
have in any problem. Sometimes domain knowledge and intuition may help but usually that is not
the case. In the cluster-predict methodology, we can evaluate how well the models are performing
based on different K clusters since clusters are used in the downstream modeling.
• Elbow method
• Quick method
ELBOW METHOD:
Elbow method gives us an idea on what a good k number of clusters would be based on the
sum of squared distance (SSE) between data points and their assigned clusters’ centroids. We pick
k at the spot where SSE starts to flatten out and forming an elbow. We’ll use the geyser dataset and
evaluate SSE for different values of k and see where the curve might form an elbow and flatten out.
Then, plot a line chart of the SSE for each value of k. If the line chart looks like an arm, then
the "elbow" on the arm is the value of k that is the best. The idea is that we want a small SSE, but
that the SSE tends to decrease toward 0 as we increase k (the SSE is 0 when k is equal to the number
of data points in the dataset, because then each data point is its own cluster, and there is no error
between it and the center of its cluster). So our goal is to choose a small value of k that still has a
low SSE, and the elbow usually represents where we start to have diminishing returns by increasing
k.
24
Fig 3.4 Example of the Elbow method
The graph above shows that k=2 is not a bad choice. Sometimes it’s still hard to figure out
a good number of clusters to use because the curve is monotonically decreasing and may not show
any elbow or has an obvious point where the curve starts flattening out.
Quick Method
The same functionality above can be achieved with the associated quick method
kelbow_visualizer. This method will build the KElbowVisualizer object with the associated
arguments, fit it, then (optionally) immediately show the visualization.
The K-Elbow Visualizer implements the “elbow” method of selecting the optimal number
of clusters for K-means clustering. K-means is a simple unsupervised machine learning algorithm
that groups data into a specified number (k) of clusters. Because the user must specify in advance
what k to choose, the algorithm is somewhat naive – it assigns all members to k clusters even if
that is not the right k for the dataset.
25
The elbow method runs k-means clustering on the dataset for a range of values for k
and then for each value of k computes an average score for all clusters. By default, the distortion
score is computed, the sum of square distances from each point to its assigned center. Other metrics
can also be used such as the silhouette score, the mean silhouette coefficient for all samples or the
calinski_harabasz score, which computes the ratio of dispersion between and within clusters.
When these overall metrics for each model are plotted, it is possible to visually
determine the best value for k. If the line chart looks like an arm, then the “elbow” (the point of
inflection on the curve) is the best value of k. The “arm” can be either up or down, but if there
is a strong inflection point, it is a good indication that the underlying model fits best at that
point.
Parameters
modela scikit-learn clusterer
The axes to plot the figure on. If None is passed in the current axes will be used (or generated if
required).
The k values to compute silhouette scores for. If a single integer is specified, then will
compute the range (2,k). If a tuple of 2 integers is specified, then k will be in np.arange(k[0],
k[1]).
Otherwise, specify an iterable of integers to use as values for k.
metricstring, default: "distortion"
Select the scoring metric to evaluate the clusters. The default is the mean distortion,
defined by the sum of squared distances between each observation and its closest centroid.
Other metrics
26
include:
• distortion: mean sum of squared distances to centers
• silhouette: mean ratio of intra-cluster and nearest-cluster distance
• calinski_harabasz: ratio of within to between cluster dispersion
Display the fitting time per k to evaluate the amount of time required to train the
clustering model.
Automatically find the “elbow” or “knee” which likely corresponds to the optimal value
of k using the “knee point detection algorithm”. The knee point detection algorithm finds the
point of maximum curvature, which in a well-behaved clustering problem also represents the
pivot of the elbow curve. The point is labeled with a dashed line and annotated with the score and
k values.
Keyword arguments that are passed to the base class and may influence the
visualization as defined in other Visualizers
27
4.SYSTEM DESIGN
SYSTEM ARCHITECTURE
Fig 5.1 Sytem architecture of a machine learning algorithm and how it flows
The machine learning architecture defines the various layers involved in the machine
learning cycle and involves the major steps being carried out in the transformation of raw data into
training data sets capable for enabling the decision making of a system.
DATAFLOW DIAGRAM
Fig 5.2 The dataflow diagram of a how a machine learning algorithm works
28
CLUSTERING ALGORITHM
The algorithm splits a given dataset into different clusters on the bases of data density. It
works on each data point and check the proximity of the given data point from all the
cluster centers. k-means then allocates that data point to the cluster
whose cluster center (centroid) is closest to it.
29
5. IMPLEMENTATION
NO.OF MODULES
• Administrator
• Customer
MODULE DESCRIPTION
INPUTS:
30
• Customer gets his targeted expenditure score
OUTPUTS:
CODING
The dataset consists of Annual income of 1 lakh customers and their total expenditure
score (in $) for a period of one year. This dataset is taken from Kaggle which contains various types
of datasets.Let us explore the datausing numpy and pandas libraries in python
This dataset contains the basic information (ID, age, gender, income, spending score) about the
customers
31
5.3 CODE
import numpy as np
import pandas as pd
import sklearn
# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.offline as py
import plotly.graph_objs as go
#Loading Data
data=pd.read_csv('Mall_Customers.csv')
data.head()
data.describe()
sns.countplot(x='Gender',data=data);
plt.title('Distribution of Gender');
data.hist('Age',bins=35);
plt.title('Distribution of Age');
plt.xlabel('Age');
32
data.hist('Annual Income (k$)')
plt.xlabel('Thousands of Dollars');
plt.hist('AnnualIncome(k$)',data=data[data['Gender']=='Female'],alpha=0.5,label=’Male’
plt.xlabel('Income(Thousand of Dollars)');
plt.legend();
male_customers=data[data['Gender']=='Male']
female_customers=data[data['Gender']=='Female']
33
sns.pairplot(data)
plt.show()
#Clustering Agorithm
wcss=[]
for i in range(1,11):
km=KMeans(n_clusters=i,init='k-means++',max_iter=300,n_init=10,random_state=0)
km.fit(x)
wcss.append(km.inertia_)
plt.plot(range(1,11),wcss)
plt.xlabel('No of Clusters')
plt.ylabel('wcss')
plt.show()
#K-Means Algorithm
km=KMeans(n_clusters=i,init='k-means++',max_iter=300,n_init=10,random_state=0)
y_means=km.fit_predict(x)
plt.scatter(x[y_means==0,0],x[y_means==0,1],s=100,c='pink',label='miser')
plt.scatter(x[y_means==1,0],x[y_means==1,1],s=100,c='yellow',label='average')
34
plt.scatter(x[y_means==2,0],x[y_means==2,1],s=100,c='green',label='buyer')
plt.scatter(x[y_means==3,0],x[y_means==3,1],s=100,c='red',label='spender')
plt.scatter(x[y_means==4,0],x[y_means==4,1],s=100,c='black',label='target')
plt.scatter(km.cluster_centers_[:,0],km.cluster_centers_[:,1],s=50,c='blue',label='centroid')
plt.title('K-Means',fontsize=20)
plt.xlabel('Annual Income')
plt.ylabel('Spending Score')
plt.legend()
plt.grid()
plt.show()
35
6.TESTING
INTRODUCTION TO TESTING
Software testing is a critical element of software quality assurance and represents the ultimate
review of specification, design and coding. The increasing visibility of software as a system
element and attendant costs associated with a software failure are motivating factors for we
planned, through testing. Testing is the process of executing a program with the intent of finding an
error. The design of tests for software and other engineered products can be as challenging as the
initial design of the product itself.
One is Black-Box testing – the specified function that a product has been designed to perform,
tests can be conducted that demonstrate each function is fully operated.
The other is White-Box testing – knowing the internal workings of the product ,tests can
be conducted to ensure that the internal operation of the product performs according to
specifications and all internal components have been adequately exercised.
White box and Black box testing methods have been used to test this package. The entire
loop constructs have been tested for their boundary and intermediate conditions. The test data was
designed with a view to check for all the conditions and logical decisions. Error handling has been
taken care of by the use of exception handlers.
Testing is a set of activities that can be planned in advanced and conducted systematically.
A strategy for software testing must accommodation low-level tests that are necessary to verify that
a small source code segment has been correctly implemented as well as high-level tests that validate
major system functions against customer requirements.
Software testing is one element of verification and validation. Verification refers to the set
of activities that ensure that software correctly implements as specific function. Validation
36
refers to a different set of activities that ensure that the software that has been built is traceable to
customer requirements.
The main objective of software is testing to uncover errors. To fulfill this objective, a series
of test steps unit, integration, validation and system tests are planned and executed. Each test step
is accomplished through a series of systematic test technique that assist in the design of test cases.
With each testing step, the level of abstraction with which software is considered is broadened.
Testing is the only way to assure the quality of software and it is an umbrella activity rather than a
separate phase. This is an activity to be preformed in parallel with the software effort and one that
consists of its own phases 00f analysis, design, implementation, execution and maintenance.
UNIT TESTING:
This testing method considers a module as single unit and checks the unit at interfaces and
communicates with other modules rather than getting into details at statement level. Here the
module will be treated as a black box, which will take some input and generate output. Outputs for
a given set of input combination are pre-calculated and are generated by the module.
SYSTEM TESTING:
Here all the pre tested individual modules will be assembled to create the larger system and
tests are carried out at system level to make sure that all modules are working in synchronous with
each other. This testing methodology helps in making sure that all modules which are running
perfectly when checked individually are also running in cohesion with other modules. For this
testing we create test cases to check all modules once and then generated test combinations of test
paths through out the system to make sure that no path is making its way into chaos.
37
INTEGRATED TESTING:
Testing is a major quality control measure employed during software development. Its basic
function is to detect errors. Sub functions when combined may not produce than it is desired. Global
data structures can represent the problems. Integrated testing is a systematic technique for
constructing the program structure while conducting the tests. To uncover errors that are associated
with interfacing the objective is to make unit test modules and built a program structure that has
been detected by design. In a non - incremental integration all the modules are combined in advance
and the program is tested as a whole. Here errors will appear in an end less loop function. In
incremental testing the program is constructed and tested in small segments where the errors are
isolated and corrected. Different incremental integration strategies are top – down integration,
bottom – up integration, regression testing.
Modules are integrated by moving downwards through the control hierarchy beginning with
main program. The subordinate modules are incorporated into structure in either a breadth first
manner or depth first manner. This process is done in five steps:
• Main control module is used as a test driver and steps are substituted or all modules
directly to main program.
• Depending on the integration approach selected subordinate is replaced at a time with
actual modules.
• Tests are conducted.
• On completion of each set of tests another stub is replaced with the real module
• Regression testing may be conducted to ensure trha5t new errors have not been
introduced.
This process continuous from step 2 until entire program structure is reached. In top down
integration strategy decision making occurs at upper levels in the hierarchy and is encountered first.
If major control problems do exists early recognitions is essential.
If depth first integration is selected a complete function of the software may be implemented and
demonstrated.
38
Some problems occur when processing at low levels in hierarchy is required to adequately
test upper level steps to replace low-level modules at the beginning of the top down testing. So no
data flows upward in the program structure.
Begins construction and testing with atomic modules. As modules are integrated from the
bottom up, processing requirement for modules subordinate to a given level is
always available and need for stubs is eliminated.The following steps implements this strategy.
• Low-level modules are combined in to clusters that perform a specific software sub
function.
• A driver is written to coordinate test case input and output.
• Cluster is tested.
• Drivers are removed and moving upward in program structure combines clusters.
Integration moves upward, the need for separate test driver’s lesions.
If the top levels of program structures are integrated top down, the number of drivers can be
reduced substantially and integration of clusters is greatly simplified.
REGRESSION TESTING:
Each time a new module is added as a part of integration as the software changes. Regression
testing is an actually that helps to ensure changes that do not introduce unintended behavior as
additional errors.
Regression testing maybe conducted manually by executing a subset of all test cases or using
automated capture play back tools enables the software engineer to capture the test case and results
for subsequent playback and compression. The regression suit contains different classes of test
cases.
A representative sample to tests that will exercise all software functions.
Additional tests that focus on software functions that are likely to be affected by the change.
39
6.2 NON FUNCTIONAL TESTING:
Non-functional testing is the testing of a software application or system for its non-
functional requirements: the way a system operates, rather than specific behaviours of that
system. This is in contrast to functional testing, which tests against functional requirements that
describe the functions of a system and its components. The names of many non-functional tests
are often used interchangeably because of the overlap in scope between various non-functional
requirements. For example, software performance is a broad term that includes many specific
requirements like reliability and scalability.
• Smoke testing
• Sanity testing
• Regression testing
40
TESTCASE 1
Fig.6.1 Testcase 1
The pandas library has not been mentioned in the above diagram therefore there occurs an error.
Pandas library is used for data manipulation and analysis.
TESTCASE 2
In this case, a command is missing which is km.fit() which ensures the inertia of the project.Inertia
is the sum of square of distance between 2 points in a cluster.
41
6.5 GOALS OF TESTING:
• The main goal of the project is to get accuracy and an error free model.
• The purpose of this testing is training the model such that when anew data point is added
there should not be any difficulty in the process of execution.
• The value of k should be predicted correctly while elbow method is being in use.
• This model is trained in a manner so that the dataset can be changed as per use, by making
the system experience with different samples in a population.
42
7.OUTPUT SCREENS
Step 1:Import packages and libraries
Fig 7.1 describes the packages that are being used in our project. numpy is used
for arithmetic operations, pandas are used for loading the dataset, seaborn for
styling the graphs, matplotlib for plotting the graphs and sklearn for using the
different algorithms.
Step 2: collect the dataset
43
Step 3: Describe the dataset
The figure above describes the number of male an female members in the given dataset.
44
Step 5: Plotting Histograms
45
Step 6:Plotting Scatterplots
46
Figure 7.6.2:Age to Spending Score by Female scatterplot
This figure shows the female income expenditure score according to theur age intevals in the form
of a histogram.
47
Step 7:Plotting pairplots
Figure 7.7:Pairplot
This figure shows a pairplot that compares 2 different features at a time.
48
Step 8:Using Elbow curve method
As K-Means algorithm requires the number of clusters as input, below we will use the elbow
method to get the optimal number of clusters which can be formed [32]. It works on the principal
that after a certain number of „K‟ clusters, the difference in SSE (Sum of Squared Errors) starts to
decrease and diminishes gradually. Here, the WCSS(Within-Cluster-Sum-of-Squared-errors)
metric is used as an indicator of the same. Hence, the „K‟ value, specifies the number of clusters.
In Figure 1, it can be observed that an elbow point occurs at K=5. After K=5, the difference in
WCSS is not so visible. Hence, we will choose to have 5 clusters and provide the same as input to
the K-Means algorithm.
49
Step 9:Using K-Means Clustering( Final output)
We can see that the mall customers can be broadly grouped into 5 groups based on their
In cluster 4(yellow colored) we can see people have low annual income and low spending
scores, this is quite reasonable as people having low salaries prefer to buy less, in fact, these are the
wise people who know how to spend and save money. The shops/mall will be least interested in
In cluster 2(blue colored) we can see that people have low income but higher spending
scores, these are those people who for some reason love to buy products more often even though
they have a low income. Maybe it’s because these people are more than satisfied with the mall
services. The shops/malls might not target these people that effectively but still will not lose them.
50
In cluster 5(pink colored) we see that people have average income and an average spending
score, these people again will not be the prime targets of the shops or mall, but again they will be
considered and other data analysis techniques may be used to increase their spending score.
In cluster 1(red-colored) we see that people have high income and high spending scores, this
is the ideal case for the mall or shops as these people are the prime sources of profit. These people
might be the regular customers of the mall and are convinced by the mall’s facilities.
In cluster 3(green colored) we see that people have high income but low spending scores,
this is interesting. Maybe these are the people who are unsatisfied or unhappy by the mall’s services.
These can be the prime targets of the mall, as they have the potential to spend money. So, the mall
authorities will try to add new facilities so that they can attract these people and can meet their
needs.
Finally, based on our machine learning technique we may deduce that to increase the profits
of the mall, the mall authorities should target people belonging to cluster 3 and cluster 5 and should
also maintain its standards to keep the people belonging to cluster 1 and cluster 2 happy and satisfied.
To conclude, I would like to say that it is amazing to see how machine learning can be used in
51
8.CONCLUSION
Our project classifies various customers into different clusters so that different marketing
strategies can be employed to different clusters attain maximum profit. Due to increasing
commercialization, consumer data is increasing exponentially. When dealing with this large
magnitude of data, organizations need to make use of more efficient clustering algorithms for
customer segmentation. These clustering models need to possess the capability to process this
enormous data effectively. Each of the above discussed clustering algorithms come with their own
set of merits and demerits. The computational speed of K-Means clustering algorithm is relatively
better as compared to the hierarchical clustering algorithms as the latter require the calculation of
the full proximity matrix after each iteration . K-Means clustering gives better performance for a
large number of observations while hierarchical clustering has the ability to handle fewer data
points.
The major hindrance produces itself in the form of selecting the numbers of clusters „K‟ for
the K-Means process, which have to be provided as an input to this non-hierarchical clustering
algorithm. This limitation does not exist in the case of hierarchical clustering since it does not
require any cluster centers as input. It depends on the user to choose the cluster groups as well as
their number. Hierarchical clustering also gives better results as compared to K-Means when a
random dataset is used. The output or results obtained when using hierarchical clustering are in the
form of dendrograms but the output of K-Means consists of flatstructured clusters which may be
difficult to analyze. As the value of k increases, the quality(accuracy) of hierarchical clustering
improves when compared to K-Means clustering. As such, partitioning algorithms like K-Means
are suitable for large datasets while hierarchical clustering algorithms are more suitable for small
datasets.
52
9.FUTURE ENHANCEMENT
We have done this project with as minimum flaws as possible and can further be enhanced
by including major identification of statistics of poeple and improving the accuracy of the output.
All the census data also can be collected to train the dataset even more to get more accurate outputs.
Based on the social data accumulated, we can conclude that mall customer segmentation system can
be used in a wide range of applications across a variety of domains including:
• identifying interests of people while they are buying items from a mall
• Identifying many more clusters to segment the products to improve the sales of
the product
In this project we have implemented k-means algorithm, it can be further enhanced by using
few complex algorithms such as conventional neural networks algorithms.
All the census data also can be collected to train the dataset even more to get more accurate
outputs
There is no need for privacy invasion of users like other applications like amazon etc
53
10.BIBLIOGRAPHY
[1] E. Ngai, L. Xiu and D. Chau, “Application of data mining techniques in customer relationship
management: A literature review and classification”, Expert Systems with Applications, vol. 36,
no. 2, pp. 2592-2602, 2009.
[2] J. Peppard, “Customer Relationship Management (CRM) in financial services”, European
Management Journal, vol. 18, no. 3, pp. 312-327, 2000.
[3] A. Ansari and A. Riasi, “Taxonomy of marketing strategies using bank customers clustering”,
International Journal of Business and Management, vol. 11, no. 7, pp. 106-119, 2016.
[4] M. Ghzanfari, et al., “Customer segmentation in clothing exports based on clustering
algorithm”, Iranian Journal of Trade Studies, vol. 14, no. 56, pp. 59-86, 2010.
[5] C. Rygielski, J. Wang and D. Yen, “Data mining techniques for customer relationship
management”, Technology in Society, vol. 24, no. 4, pp. 483-502, 2002..
[6] J. Lee and S. Park, “Intelligent profitable customers segmentation system based on business
intelligence tools”, Expert Systems with Applications, vol. 29, no. 1, pp. 145-152, 2005.
[7] D. A. Kandeil, A. A. Saad and S. M. Youssef, “A two-phase clustering analysis for B2B
customer segmentation”, in International Conference on Intelligent Networking and Collaborative
Systems, Salerno, 2014, pp. 221-228.
[8] R. Swift, Accelerating Customer Relationships: Using CRM and Relationship Technologies,
1st ed. Upper Saddle River, N.J.: Prentice Hall PTR, 2000.
[9] J. Aaker, A. Brumbaugh and S. Grier, “Nontarget Markets and Viewer Distinctiveness: The
Impact of Target Marketing on Advertising Attitudes”, Journal of Consumer Psychology, vol. 9,
no. 3, pp. 127-140, 2000.
[10] T. Kanungo, et al., “An efficient k-means clustering algorithm: analysis and
implementation”,IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, no. 7,
pp. 881-892, 2002
[11] Y. Chen, et al., “Identifying patients in target customer segments using a two-stage
clustering-classification approach: A hospitalbased assessment”, Computers in Biology and
Medicine, vol. 42, no. 2, pp. 213-221, 2012.
54
[12] G. Lefait and T. Kechadi, “Customer segmentation architecture based on clustering
techniques”, in Fourth International Conference on Digital Society, Sint Maarten, 2010, pp. 243-
248.
[13] M. Namvar, M. Gholamian and S. KhakAbi, “A two-phase clustering method for intelligent
customer segmentation”, in International Conference on Intelligent Systems, Modelling and
Simulation, Liverpool, 2010, pp. 215-219.
[14] J. MacQueen, “Some methods for classification and analysis of multivariate observations”,
in Fifth Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, 1967, pp. 28
1 - 297.
[15] E. Rendon, et al., “A comparison of internal and external cluster validation indexes”, in
American Conference on Applied Mathematics and The Fifth WSEAS International Conference
on Computer Engineering and Applications, Puerto Morelos, 2011, pp. 158 -163.
[16] H. Gucdemir and H. Selim, “Integrating multi -criteria decision making and clustering for
business customer segmentation”, Industrial Management & Data Systems, vol. 115, no. 6, pp.
1022 - 1040, 2015.
55