100% found this document useful (2 votes)
4K views62 pages

Customer Segmentation Analysis

This document describes a project that aims to analyze customer segmentation to improve sales using clustering. It was submitted by 4 students to fulfill the requirements for a Bachelor of Technology degree in Computer Science and Engineering. The project was completed under the guidance of Dr. P. Srinivasa Rao, Professor of Computer Science and Engineering. Customer segmentation analysis can help companies identify distinct customer groups based on attributes like demographics and behavior. This allows companies to better understand their different customer segments and tailor their marketing and services accordingly to maximize value. The project will explore a customer data set to identify any segments and patterns using tools like Jupyter Notebook, NumPy, Pandas, Seaborn and Matplotlib.

Uploaded by

udit kosuru
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (2 votes)
4K views62 pages

Customer Segmentation Analysis

This document describes a project that aims to analyze customer segmentation to improve sales using clustering. It was submitted by 4 students to fulfill the requirements for a Bachelor of Technology degree in Computer Science and Engineering. The project was completed under the guidance of Dr. P. Srinivasa Rao, Professor of Computer Science and Engineering. Customer segmentation analysis can help companies identify distinct customer groups based on attributes like demographics and behavior. This allows companies to better understand their different customer segments and tailor their marketing and services accordingly to maximize value. The project will explore a customer data set to identify any segments and patterns using tools like Jupyter Notebook, NumPy, Pandas, Seaborn and Matplotlib.

Uploaded by

udit kosuru
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 62

CUSTOMER SEGMENTATION ANALYSIS FOR IMPROVING SALES

USING CLUSTERING

A Project report submitted in


Partial fulfillment of the requirement for the award of the Degree of

BACHELOR OF TECHNOLOGY
IN
COMPUTER SCIENCE AND ENGINEERING

SUBMITTED
By
A. SUMA SRI 16671A0561
K.S. UDIT 16671A0586
A. NIYATHI 16671A0598
N. SAI TEJA 15671A05A0

Under the esteemed guidance of


Dr. P. SRINIVASA RAO
PROFESSOR

Department of Computer Science and Engineering


J.B. Institute of Engineering & Technology
(UGC AUTONOMOUS)
(Affiliated to Jawaharlal Nehru Technological University, Hyderabad)

Yenkapally, Moinabad mandal, R.R. Dist-75 (TG)


2016-2020

i
J.B.INSTITUTE OF ENGINEERING & TECHNOLOGY
(UGC AUTONOMOUS)
(Accredited by NAAC, Permanently Affiliated to JNTUH)
Yenkapally, Moinabad Mandal, R.R. Dist. -500 075
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

CERTIFICATE

This is to certify that the project report entitled “CUSTOMER SEGMENTATION


ANALYSIS FOR IMPROVING SALES USING CLUSTERING” submitted to the
Department of Computer Science & Engineering, J.B Institute of Engineering and Technology, in
accordance with Jawaharlal Nehru Technological University regulations as partial fulfillment
required for successful completion of Bachelor of Technology is a record of bonafide work carried
out during the academic year 2019-20 by,

A. SUMA SRI 16671A0561


K.S. UDIT 16671A0586
A. NIYATHI 16671A0598
N. SAI TEJA 15671A05A0

Internal Guide Head of the Department


Dr. P. SRINIVASA RAO Dr. P. SRINIVASA RAO
PROFESSOR PROFESSOR

External Examiner

ii
J.B.INSTITUTE OF ENGINEERING & TECHNOLOGY
(UGC Autonomous)
(Accredited by NAAC Permanently Affiliated to JNTUH)
Yenkapally, Moinabad Mandal, R.R. Dist.-500 075
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

DECLARATION

We hereby certify that the Main Project report entitled “CUSTOMER


SEGMENTATION ANALYSIS FOR IMPROVING SALES USING
CLUSTERING” carried out under the guidance of, Dr. P. SRINIVASA RAO,
Professor in computer science and engineering is submitted in partial fulfillment of the
requirements for the award of the degree of Bachelor of Technology in computer science and
engineering. This is a record of bonafide work carried out by us and the results embodied in this
project report have not been reproduced or copied from any source. The results embodied in this
project report have not been submitted to any other university or institute for the award of any
other degree or diploma.

Date: 21/05/2020
A. SUMA SRI 16671A0561
K.S. UDIT 16671A0586
A. NIYATHI 16671A0598
N. SAI TEJA 15671A05A0

iii
ACKNOWLEDGEMENT
At outset we express our gratitude to almighty lord for showering his grace and blessings
upon us to complete this Main Project. Although our name appears on the cover of this book, many
people had contributed in some form or the other to this project Development. We could not have
done this Project without the assistance or support of each of the following.
First of all we are highly indebted to Dr.S. SUDHAKARA REDDY, Principal for giving
us the permission to carry out this Main Project.
We would like to thank Dr. P. SRINIVASA RAO, Professor & Head of the Department
of COMPUTER SCIENCE AND ENGINEERING, for being moral support throughout the period
of the study in the Department.
We would like to thank Teaching and Non-Teaching Staff of Department of Computer
Science & Engineering for sharing their knowledge with us.

A. SUMA SRI 16671A0561


K.S. UDIT 16671A0586
A. NIYATHI 16671A0598
N. SAI TEJA 15671A05A0

iv
ABSTRACT

Customer segmentation is the practice of dividing a company’s customers into groups that
reflect similarity among customers in each group. The goal of segmenting customers is to decide
how to relate to customers in each segment in order to maximize the value of each customer to the
business. Customer segmentation has the potential to allow marketers to address each customer in
the most effective way. Using the large amount of data available on customers and potential
customers, a customer segmentation analysis allows marketers to identify discrete groups of
customers with a high degree of accuracy based on demographic, behavioral and other indicators.
To scale efficiently and effectively, expansion stage companies need to focus their efforts not on
a broad universe of potential customers, but rather on a specific subset of customers who are most
similar to their best current customers.

The key to doing so is through customer segmentation. The segmentation is based on


customers having similar ‘needs’(so that a single whole product can satisfy them) and ‘buying
characteristics’(responses to messaging, marketing channels, and sales channels, that a single go-
to-market approach can be used to sell to them competitively and economically). In this project
we will explore a data set on customers to try to see if there are any discernible segments and
patterns. Customer segmentation is useful in understanding what demographic and psychographic
sub-populations there are within your customers in a business case. By understanding this, we can
better understand how to market and serve them. This project uses the packages such as numpy,
pandas, seaborn and matplotlib and the tools such as jupyter notebook to analyze and segment each
individual customer into their respective segment based on three important attributes which are
‘gender’, ‘income’ and ‘expenditure’.

v
TABLE OF CONTENTS

1. INTRODUCTION 01
2. LITERATURE SURVEY 03

3. SYSTEM ANALYSIS 05
3.1 Aim 5
3.2 Existing System 5
3.3 Proposed System 5
3.4 Software Requirements 6

4. SYSTEM DESIGN 28

5. IMPLEMENTATION 30

5.1 Module Description 30

5.2 Dataset Taken 31


5.3 Code 33
6. TESTING 36

6.1Testing Strategies 36
6.2 Functional testing 39

6.3 Non functional strategies 39


6.4 Test Cases 40
6.5 Goals of Testing 42
7.OUTPUT SCREENS 43
8.CONCLUSION 52

9.FUTURE ENHANCEMENT 53

vi
10.BIBLIOGRAPGY 54

vii
1. INTRODUCTION

In the contemporary day and age, the importance of treating customers as the principal asset
of an organization is increasing in value. Organizations are rapidly investing in developing
strategies for better customer acquisition, maintenance and development. The concept of business
intelligence has a crucial role to play in making it possible for organizations to use technical
expertise for acquiring better customer insight for outreach programs. In this scenario, the concept
of CRM garners much attention since it is a comprehensive process of acquiring and retaining
customers, using business intelligence, to maximize the customer value for a business enterprise.
One of the two most important objectives of CRM is customer development through
customer insight. This objective of CRM entails the usage of an analytical approach in order to
correctly assess customer information and analysis of the value of customers for better customer
insight. Keeping up with the changing times, organizations are modifying their business flow
models by employing systems engineering as well as change management and designing
information technology(IT) solutions that aid them in acquiring new customers, help retain the
present customer base and boost the customers lifelong value.
Due to the diverse range of products and services available in the market as well as the
intense competition among organizations, customer relationship management has come to play a
significant role in the identification and analysis of a company’s best customers and the adoption
of best marketing strategies to achieve and sustain competitive advantage. One of the most useful
techniques in business analytics for the analysis of consumer behavior and categorization is
customer segmentation. By using clustering techniques, customers with similar means, end and
behavior are grouped together into homogeneous clusters.
Customer Segmentation helps organizations in identifying or revealing distinct groups of
customers who think and function differently and follow varied approaches in their spending and
purchasing habits. Clustering techniques reveal internally homogeneous and externally
heterogeneous groups. Customers vary in terms of behavior, needs, wants and characteristics and
the main goal of clustering techniques is to identify different customer types and segment

1
the customer base into clusters of similar profiles so that the process of target marketing can be
executed more efficiently.

This study aims to explore the avenues of using customer segmentation, as a business
intelligence tool within the CRM framework as well as the use of clustering techniques for helping
organizations redeem a clearer picture of the valuable customer base. The concepts of customer
relationship management, customer segmentation as a core function of CRM as well as the approach
of segmenting customers using clustering techniques are discussed.

The available clustering models for business analysis in the context of customer
segmentation, the advantages and disadvantages of the two main models chosen for our study-
KMeans and Hierarchical Clustering, as well as the possibility of developing a hybrid model which
can outperform the individual models is surveyed.

2
2. LITERATURE SURVEY

Research dealing with shopping malls’ and / or hypermarkets’ attributes, especially in the
Indian context, is very less in number. Not many studies have empirically analyzed the influence of
an assortment of attributes on buying behaviour in shopping arcades and malls and customers’
shopping experiences. Mostly the researches undertaken so far have been taken from the foreign
experiences, as they have come of age in the US, UK and European markets. An earnest attempt
has been made to delve into the relevant researches done on the theme, presented henceforth as
follows:

Brunner and Mason (1968) investigated the importance of driving time upon the
preferences of consumers towards regional shopping centers. They expressed that although it is
recognized that population, purchasing power, population density, newspaper circulation, and other
factors are influential in determining the shopping habits of consumers, a factor which is generally
overlooked is the driving time required to reach the center. In this study, it was established that the
driving time required to reach a center is highly influential in determining consumer shopping center
preferences. The most consistent and significant driving time dimension in delineating shopping
center trade areas was found at the 15-minute driving points, as three- fourths of each center’s
shoppers resided within this range.

Huff (1964 and 1966) concluded that the comparative size of the centers and the
convenience of access were the primary characteristics that consumers sought when choosing a
shopping center to visit.

Cox and Cooke (1970) determined customer preference for shopping centers and the
importance of driving time. The authors concluded that location and attractiveness are important
determinants of consumer shopping center preferences.

Mehrabian and Russell (1974) noted that the response that store atmosphere elicits from
consumers, varies along three dimensions of pleasantness, arousal and dominance.

Bellenger et al. (1977) found that some consumers placed the greatest value on convenience
and economic attributes including convenience to home, accessibility, and the

3
Presence of service such as banks and restaurants. Others however, emphasized recreational
attributes including atmosphere, fashionability, variety of stores and merchandise.

Vaughn and Hansotia (1977) opined that merchandise and convenience seem to be the two
underlying dimensions which consistently appear everytime. Merchandise quality, merchandise
variety, atmosphere of shopping area, availability of sale items and ease of shopping comparisons
are all component parts of this underlying dimension

McCarthy (1980) attempted to include transport mode / travel attributes in studying the
role of the qualitative characteristics that influence the choice in shopping destination. Using the
factor analytical technique, five sets of qualitative generalized attributes were generated. These
generalized attributes include trip convenience, trip comfort, trip safety, shopping area attraction
and shopping area mobility. He found that these generalized attributes, which were obtained from
attitudinal information, are significant in an individual's choice of shopping area.

4
3. SYSTEM ANALYSIS

3.1 AIM

Customer Segmentation is the subdivision of a market into discrete customer groups that
share similar characteristics. Customer Segmentation can be a powerful means to identify
unsatisfied customer needs. Using the above data companies can then outperform the competition
by developing uniquely appealing products and services.

3.2 EXISTING SYSTEM

The existing system contains the following drawbacks:

• All the segmentations are search based


• Difficult to gather the data and segment them accordingly
• The results are not really accurate as the clustering is not close enough to determine accurate
centroids

3.3 PROPOSED SYSTEM

Our proposed system has the following features:

• Develop the system to get easy visualization techniques


• Increase the data set to accommodate many data points so that results will be more accurate
• Segment the products directly according to the customer group
• Use different methods to collect the customer data instead of physical forms

3.4 HARDWARE REQUIREMENTS:


• Hard disk

• System (8GB RAM and 1TB Hard Disk)

• Forms (To collect data from the customer in malls)

5
SOFTWARE REQUIREMENTS:

• Anaconda
• Jupyter
• Kaggle
• Operating system (Windows 10)

6
ALGORITHM USED

INTRODUCTION:

The most common ways in which businesses segment their customer base are:
Demographic segmentation : Clustering demographic information such as gender, age, familial
and marital status, income, education, and occupation.

Demographic clustering is distribution-based. It provides fast and natural clustering of very


large databases. Clusters are characterized by the value distributions of their members. It
automatically determines the number of clusters to be generated.

Typically, demographic data contains many categorical variables. The mining function
works well with data sets that consist of this type of variables.

You can also use numerical variables. The Demographic Clustering algorithm treats
numerical variables by assigning similarities according to the numeric difference of the values.

Demographic Clustering is an iterative process over the input data. Each input record is read
in succession. The similarity of each record with each of the currently existing clusters is calculated.
If the biggest calculated similarity is above a given threshold, the record is added to the relevant
cluster. This cluster's characteristics change accordingly. If the calculated similarity is not above the
threshold, or if there is no cluster (which is initially the case) a new cluster is created that contains
the record alone. You can specify the maximum number of clusters, as well as the similarity
threshold.

Demographic Clustering uses the statistical Condorcet criterion to manage the assignment of
records to clusters and the creation of new clusters. The Condorcet criterion evaluates how
homogeneous each discovered cluster is (in that the records it contains are similar) and how
heterogeneous the discovered clusters are among each other. The iterative process of discovering
clusters stops after two or more passes over the input data if the improvement of the clustering result
according to the Condorcet criterion does not justify a new pass

7
Geographical segmentation: It differs depending on the scope of the company. For localized
businesses, this info might pertain to specific towns or counties. For larger companies, it might
mean a customer’s city, state, or even country of residence.

Geographic segmentation is the simplest type of market segmentation. It categorizes customers


based on geographic borders

Geographic Market Segmentation Examples

• ZIP code
• City
• Country
• Radius around a certain location
• Climate
• Urban or rural

Geographic segmentation can refer to a defined geographic boundary (such as a city or ZIP
code) or type of area (such as the size of city or type of climate).

An example of geographic segmentation may be the luxury car company choosing to target
customers who live in warm climates where vehicles don’t need to be equipped for snowy
weather. The marketing platform might focus their marketing efforts around urban, city centers
where their target customer is likely to work.

We can get details for graphic segmentation and find out where your audience lives using
Alexa’s Site Overview tool Enter your site URL, and the report shows you where your website
visitors are located across the world.

Psychographic segmentation:Psychographic information such as social class, lifestyle, and


personality traits.

Psychographic segmentation categorizes audiences and customers by factors that relate


to their personalities and characteristics

8
Psychographic Market Segmentation Examples

• Personality traits
• Values
• Attitudes
• Interests
• Lifestyles
• Psychological influences
• Subconscious and conscious beliefs
• Motivations
• Priorities

Psychographic segmentation factors are slightly more difficult to identify than demographics
because they are subjective. They are not data-focused and require research to uncover and
understand.

For example, the luxury car brand may choose to focus on customers who value quality
and status. While the B2B enterprise marketing platform may target marketing managers who
are motivated to increase productivity and show value to their executive team.

When your obvious groupings of target segments seem to have radically different needs and
responses to your offerings and messaging, this is a major indicator it is a good time to look at
psychographic segmentation. This method is a powerful way to market the same product to
individuals who otherwise seem very heterogeneous. Many expert marketers say this approach
will ultimately yield the greatest payoff, in many ways: purchase amount and frequency, lifetime
value, loyalty, and more.

Behavioral segmentation:It collects behavioural data , such as spending and consumption habits,
product/service usage, and desired benefits.

While demographic and psychographic segmentation focus on who a customer


isbehavoural focuses on how the customer acts.

9
Behavioral Market Segmentation Examples

• Purchasing habits
• Spending habits
• User status
• Brand interactions

Behavioral segmentation requires you to know about your customer’s actions. These
activities may relate to how a customer interacts with your brand or to other activities that happen
away from your brand.

A B2C example in this segment may be the luxury car brand choosing to target customers
who have purchased a high-end vehicle in the past three years. The B2B marketing platform
may focus on leads who have signed up for one of their free webinars

Behavioral segmentation isn’t about just recognizing that people have different habits, it’s
about optimizing marketing campaigns to match these behavioral patterns with a particular
message.

Behavioral segmentation is the process of sorting and grouping customers based on the
behaviors they exhibit. These behaviors include the types of products and content they consume,
and the cadence of their interactions with an app, website, or business.

Acquisition, engagement, and retention are all important factors to keep in mind when
analyzing customer behavior. Understanding the following ways your users can interact with your
product will help you accomplish a sustainable and constructive behavioral segmentation strategy.

CLUSTERING

Clustering is one of the most common exploratory data analysis technique used to get an
intuition about the structure of the data. It can be defined as the task of identifying subgroups in the
data such that data points in the same subgroup (cluster) are very similar while data points in

10
different clusters are very different. In other words, we try to find homogeneous subgroups within
the data such that data points in each cluster are as similar as possible according to a similarity
measure such as euclidean-based distance or correlation-based distance. The decision of which
similarity measure to use is application-specific.

Clustering analysis can be done on the basis of features where we try to find subgroups of
samples based on features or on the basis of samples where we try to find subgroups of features
based on samples. We’ll cover here clustering based on features. Clustering is used in market
segmentation; where we try to fined customers that are similar to each other whether in terms of
behaviors or attributes, image segmentation/compression; where we try to group similar regions
together, document clustering based on topics, etc.

Unlike supervised learning, clustering is considered an unsupervised learning method since


we don’t have the ground truth to compare the output of the clustering algorithm to the true labels
to evaluate its performance. We only want to try to investigate the structure of the data by grouping
the data points into distinct subgroups.

Clustering is the task of dividing the population or data points into a number of groups such
that data points in the same groups are more similar to other data points in the same group and
dissimilar to the data points in other groups. It is basically a collection of objects on the basis of
similarity and dissimilarity between them.

Why Clustering ?

Clustering is very much important as it determines the intrinsic grouping among the
unlabeled data present. There are no criteria for a good clustering. It depends on the user, what is
the criteria they may use which satisfy their need. For instance, we could be interested in finding
representatives for homogeneous groups (data reduction), in finding “natural clusters” and
describe their unknown properties (“natural” data types), in finding useful and suitable groupings
(“useful” data classes) or in finding unusual data objects (outlier detection). This algorithm must
make some assumptions which constitute the similarity of points and each assumption make
different and equally valid clusters.

11
Clustering Methods :

• Density-Based Methods : These methods consider the clusters as the dense region
having some similarity and different from the lower dense region of the space. These
methods have good accuracy and ability to merge two clusters. Example DBSCAN
• Hierarchical Based Methods : The clusters formed in this method forms a tree-type
structure based on the hierarchy. New clusters are formed using the previously formed
one. It is divided into two category
o Agglomerative
o Divisive
• Partitioning Methods : These methods partition the objects into k clusters and each
partition forms one cluster. This method is used to optimize an objective criterion
similarity function such as when the distance is a major parameter
• Grid-based Methods : In this method the data space is formulated into a finite number
of cells that form a grid-like structure. All the clustering operation done on these grids are
fast and independent of the number of data objects

So, the goal of clustering is to determine the intrinsic grouping in a set of unlabeled data. But
how to decide what constitutes a good clustering? It can be shown that there is no absolute “best”
criterion which would be independent of the final aim of the clustering. Consequently, it is the
user which must supply this criterion, in such a way that the result of the clustering will suit their
needs.
For instance, we could be interested in finding representatives for homogeneous groups (data
reduction), in finding “natural clusters” and describe their unknown properties (“natural” data
types), in finding useful and suitable groupings (“useful” data classes) or in finding unusual data
objects (outlier detection).

Clustering algorithms can be applied in many fields, for instance:

• Marketing: finding groups of customers with similar behavior given a large database of
customer data containing their properties and past buying records;

12
• Biology: classification of plants and animals given their features;
• Libraries: book ordering;
• Insurance: identifying groups of motor insurance policy holders with a high average
claim cost; identifying frauds;
• City-planning: identifying groups of houses according to their house type, value and
geographical location;
• Earthquake studies: clustering observed earthquake epicenters to identify dangerous
zones;
• WWW: document classification; clustering weblog data to discover groups of similar
access patterns.

Requirements
The main requirements that a clustering algorithm should satisfy are:

• scalability;
• dealing with different types of attributes;
• discovering clusters with arbitrary shape;
• minimal requirements for domain knowledge to determine input parameters;
• ability to deal with noise and outliers;
• insensitivity to order of input records;
• high dimensionality;
• interpretability and usability.

Problems

There are a number of problems with clustering. Among them:

• current clustering techniques do not address all the requirements adequately (and
concurrently);
• dealing with large number of dimensions and large number of data items can be
problematic because of time complexity;
• the effectiveness of the method depends on the definition of “distance” (for distance-
based clustering);

13
• if an obvious distance measure doesn’t exist we must “define” it, which is not always
easy, especially in multi-dimensional spaces;
• the result of the clustering algorithm (that in many cases can be arbitrary itself) can be
interpreted in different ways.

Clustering algorithms may be classified as listed below:

• Exclusive Clustering
• Overlapping Clustering
• Hierarchical Clustering
• Probabilistic Clustering

In the first case data are grouped in an exclusive way, so that if a certain datum belongs to a
definite cluster then it could not be included in another cluster. A simple example of that is
shown in the figure below, where the separation of points is achieved by a straight line .

On the contrary the second type, the overlapping clustering, uses fuzzy sets to cluster data,
so that each point may belong to two or more clusters with different degrees of membership. In
this case, data will be associated to an appropriate membership value.

Fig 3.1 Working of a clustering algorithm

Instead, a hierarchical clustering algorithm is based on the union between the two
nearest clusters. The beginning condition is realized by setting every datum as a cluster. After a
few
14
iterations it reaches the final clusters wanted.
Finally, the last kind of clustering use a completely probabilistic approach.

In this tutorial we propose four of the most used clustering algorithms:

• K-means
• Fuzzy C-means
• Hierarchical clustering
• Mixture of Gaussians

Each of these algorithms belongs to one of the clustering types listed above. So thatK-
Means is an exclusive clustering algorithm, Fuzzy C-Means is an overlapping clustering
algorithm, Hierarchial clustering is obvious and lastly Mixture of Guassian is a probabilistic
clustering algorithm. We will discuss about each clustering method in the following paragraphs.

Distance Measure

An important component of a clustering algorithm is the distance measure between data


points. If the components of the data instance vectors are all in the same physical units then it is
possible that the simple Euclidean distance metric is sufficient to successfully group similar data
instances. However, even in this case the Euclidean distance can sometimes be misleading.
Figure shown below illustrates this with an example of the width and height measurements of an
object. Despite both measurements being taken in the same physical units, an informed decision
has to be made as to the relative scaling. As the figure shows, different scalings can lead to
different clusterings.

15
Fig 3.2 Formation of clusters

Notice however that this is not only a graphic issue: the problem arises from the
mathematical formula used to combine the distances between the single components of the data
feature vectors into a unique distance measure that can be used for clustering purposes: different
formulas leads to different clusterings.
Again, domain knowledge must be used to guide the formulation of a suitable distance measure
for each particular application.

Minkowski Metric
For higher dimensional data, a popular measure is the Minkowski metric,

where d is the dimensionality of the data. The Euclidean distance is a special case where p=2,
while Manhattan metric has p=1. However, there are no general theoretical guidelines for
selecting a measure for any given application.

It is often the case that the components of the data feature vectors are not immediately
comparable. It can be that the components are not continuous variables, like length, but nominal

16
categories, such as the days of the week. In these cases again, domain knowledge must be used to
formulate an appropriate measure.

Measuring Algorithm Performance

One of the most important considerations regarding the ML model is assessing its
performance, or you can say the model’s quality. In the case of supervised learning algorithms,
evaluating the quality of our model is easy because we already have labels for every example.

On the other hand, in the case of unsupervised learning algorithms, we are not that much
blessed because we deal with unlabeled data. But still, we have some metrics that give the
practitioner insight into the happening of change in clusters depending on the algorithm.

What are the criteria for comparing clustering algorithms

Now a good clustering algorithm aims to create clusters whose:

• The intra-cluster similarity is high (The data that is present inside the cluster is similar to
one another)
• The inter-cluster similarity is less (Each cluster holds information that isn’t similar to the
other)

Before we deep dive into such metrics, we must understand that these metrics only evaluates
the comparative performance of models against each other rather than measuring the validity of the
model’s prediction.

You still don’t know which cluster is which class, and if they make any sense at all. In this
case, you can validate your results by simple sampling from the clusters and looking at the quality
of classification. If the questions are split reasonably, you can register a label for every cluster and
either label the whole dataset, train a supervised model, or you can continue to use the k-means
cluster, keeping the information about which cluster corresponds to which class.:

17
Applications of Clustering

Customer Segmentation: Subdivision of customers into groups/segments such that each


customer segment consists of customers with similar market characteristics — pricing, loyalty,
spending behaviours etc. Some of the segmentation variables could be, e.g., the number of items
bought on sale, avg transaction value, the total number of transactions.

Creating NewsFeeds: K-Means can be used to cluster articles by their similarity — it can
separate documents into disjoint clusters.

Cloud Computing Environment: Clustered storage to increase performance, capacity, or reliability


— clustering distributes workloads to each server, manages the transfer of workloads between
servers, and provides access to all files from any server regardless of the physical location of the
data.

Environmental risks: K-means can be used to analyse environmental risk in an area —


environmental risk zoning of a chemical industrial area.

Pattern Recognition in images: For example, to automatically detect infected fruits or for
segmentation of blood cells for leukaemia detection.

Cluster analysis itself is not one specific algorithm, but the general task to be solved. It can be
achieved by various algorithms that differ significantly in their understanding of what constitutes
a cluster and how to efficiently find them. Popular notions of clusters include groups with small
distances between cluster members, dense areas of the data space, intervals or particular statistical
distributions. Clustering can therefore be formulated as a multi-objective optimization problem.
The appropriate clustering algorithm and parameter settings .

K-Means Clustering algortihm

Kmeans algorithm is an iterative algorithm that tries to partition the dataset into Kpre-
defined distinct non-overlapping subgroups (clusters) where each data point belongs to only one
group.
18
It tries to make the intra-cluster data points as similar as possible while also keeping the
clusters as different (far) as possible. It assigns data points to a cluster such that the sum of the
squared distance between the data points and the cluster’s centroid (arithmetic mean of all the data
points that belong to that cluster) is at the minimum. The less variation we have within clusters, the
more homogeneous (similar) the data points are within the same cluster.

The way kmeans algorithm works is as follows:

1. Specify number of clusters K.

2. Initialize centroids by first shuffling the dataset and then randomly selecting K data points for
the centroids without replacement.

3. Keep iterating until there is no change to the centroids. i.e assignment of data points to
clusters isn’t changing.

• Compute the sum of the squared distance between data points and all centroids.

• Assign each data point to the closest cluster (centroid).

• Compute the centroids for the clusters by taking the average of the all data points that belong
to each cluster.

The approach kmeans follows to solve the problem is called Expectation-Maximization. The E-
step is assigning the data points to the closest cluster. The M-step is computing the centroid of each
cluster. Below is a break down of how we can solve it mathematically (feel free to skip it).

The objective function is:

19
where wik=1 for data point xi if it belongs to cluster k; otherwise, wik=0. Also, μk is the centroid
of xi’s cluster.

It’s a minimization problem of two parts. We first minimize J w.r.t. wik and treat μk fixed.
Then we minimize J w.r.t. μk and treat wik fixed. Technically speaking, we differentiate J w.r.t.
wik first and update cluster assignments (E-step). Then we differentiate J w.r.t. μk and recompute
the centroids after the cluster assignments from previous step (M-step). Therefore, E-step is:

In other words, assign the data point xi to the closest cluster judged by its sum of squared
distance from cluster’s centroid.

And M-step is:

20
Which translates to recomputing the centroid of each cluster to reflect the new assignments.

Fig 3.3 Example of K-Means clustering

K-means is an unsupervised clustering algorithm designed to partition unlabelled data into


a certain number (thats the “ K”) of distinct groupings. In other words, k-means finds
observations that share important characteristics and classifies them together into clusters. A good
clustering solution is one that finds clusters such that the observations within each cluster are
more similar than the clusters themselves.

There are countless examples of where this automated grouping of data can be extremely
useful. For example, consider the case of creating an online advertising campaign for a brand new
range of products being released to the market. While we could display a single generic
advertisement to the entire population, a far better approach would be to divide the population
into clusters of people who hold shared characteristics and interests displaying customised
advertisements to each group. K-means is an algorithm that finds these groupings in big datasets
where it is not feasible to be done by hand.

The intuition behind the algorithm is actually pretty straight forward. To begin, we choose a
value for k (the number of clusters) and randomly choose an initial centroid (centre coordinates)
for each cluster. We then apply a two step process:

21
1. Assignment step — Assign each observation to it’s nearest centre.
2. Update step — Update the centroids as being the centre of their respective observation.

We repeat these two steps over and over until there is no further change in the clusters. At
this point the algorithm has converged and we may retrieve our final clusterings

One final key aspect of k-means returns to this concept of convergence. We previously
mentioned that the k-means algorithm doesn’t necessarily converge to the global minima and
instead may converge to a local minima (i.e. k-means is not guaranteed to find the best solution).
In fact, depending on which values we choose for our initial centroids we may obtain differing
results.

As we are only interested in the best clustering solution for a given choice of k, a common
solution to this problem is to run k-means multiple times, each time with different randomised
initial centroids, and use only the best solution. In other words, always run k-means multiple
times to ensure we find a solution close to the global minima.

Advantages

1) Fast, robust and easier to understand.

2) Relatively efficient: O(tknd), where n is # objects, k is # clusters, d is # dimension of each


object, and t is # iterations. Normally, k, t, d << n.

3) Gives best result when data set are distinct or well separated from each other.

Disadvantages

1) The learning algorithm requires apriori specification of the number of cluster centers.

2) The use of Exclusive Assignment - If there are two highly overlapping data then k-means
will not be able to resolve that there are two clusters.

22
3) The learning algorithm is not invariant to non-linear transformations i.e. with different
representation of data we get.

4) Euclidean distance measures can unequally weight underlying factors.

5) The learning algorithm provides the local optima of the squared error function.

6) Randomly choosing of the cluster center cannot lead us to the fruitful result. Pl. refer Fig.

7) Applicable only when mean is defined i.e. fails for categorical data.

8) Unable to handle noisy data and outliers.

9) Algorithm fails for non-linear data set

The main drawback of this technique is related to ambiguity about the K number of points that
should be initialized. To overcome this issue, the performance of the algorithm is calculated for
different numbers of centroids.

Conclusion

K-means is one of the most common and intuitive clustering algorithms in Machine
Learning.The name ‘k-means’ almost explains the theory itself.

1. ‘K’ number of data points is initialized.

2. The mean of the corresponding features of the nearest data points is calculated and set as a
new coordinate of the pre-initialize.

EVALUATION METHODS:

Contrary to supervised learning where we have the ground truth to evaluate the model’s
performance, clustering analysis doesn’t have a solid evaluation metric that we can use to evaluate

23
the outcome of different clustering algorithms. Moreover, since kmeans requires k as an input and
doesn’t learn it from data, there is no right answer in terms of the number of clusters that we should
have in any problem. Sometimes domain knowledge and intuition may help but usually that is not
the case. In the cluster-predict methodology, we can evaluate how well the models are performing
based on different K clusters since clusters are used in the downstream modeling.

We’ll cover metric that may give us some intuition about k:

• Elbow method

• Quick method

ELBOW METHOD:

Elbow method gives us an idea on what a good k number of clusters would be based on the
sum of squared distance (SSE) between data points and their assigned clusters’ centroids. We pick
k at the spot where SSE starts to flatten out and forming an elbow. We’ll use the geyser dataset and
evaluate SSE for different values of k and see where the curve might form an elbow and flatten out.

Then, plot a line chart of the SSE for each value of k. If the line chart looks like an arm, then
the "elbow" on the arm is the value of k that is the best. The idea is that we want a small SSE, but
that the SSE tends to decrease toward 0 as we increase k (the SSE is 0 when k is equal to the number
of data points in the dataset, because then each data point is its own cluster, and there is no error
between it and the center of its cluster). So our goal is to choose a small value of k that still has a
low SSE, and the elbow usually represents where we start to have diminishing returns by increasing
k.

24
Fig 3.4 Example of the Elbow method

The graph above shows that k=2 is not a bad choice. Sometimes it’s still hard to figure out
a good number of clusters to use because the curve is monotonically decreasing and may not show
any elbow or has an obvious point where the curve starts flattening out.

Quick Method

The same functionality above can be achieved with the associated quick method
kelbow_visualizer. This method will build the KElbowVisualizer object with the associated
arguments, fit it, then (optionally) immediately show the visualization.

The K-Elbow Visualizer implements the “elbow” method of selecting the optimal number
of clusters for K-means clustering. K-means is a simple unsupervised machine learning algorithm
that groups data into a specified number (k) of clusters. Because the user must specify in advance
what k to choose, the algorithm is somewhat naive – it assigns all members to k clusters even if
that is not the right k for the dataset.

25
The elbow method runs k-means clustering on the dataset for a range of values for k
and then for each value of k computes an average score for all clusters. By default, the distortion
score is computed, the sum of square distances from each point to its assigned center. Other metrics
can also be used such as the silhouette score, the mean silhouette coefficient for all samples or the
calinski_harabasz score, which computes the ratio of dispersion between and within clusters.

When these overall metrics for each model are plotted, it is possible to visually
determine the best value for k. If the line chart looks like an arm, then the “elbow” (the point of
inflection on the curve) is the best value of k. The “arm” can be either up or down, but if there
is a strong inflection point, it is a good indication that the underlying model fits best at that
point.

Parameters
modela scikit-learn clusterer

Should be an instance of an unfitted clusterer, specifically KMeans or MiniBatchKMeans. If it is


not a clusterer, an exception is raised.

axmatplotlib Axes, default: None

The axes to plot the figure on. If None is passed in the current axes will be used (or generated if
required).

kinteger, tuple, or iterable

The k values to compute silhouette scores for. If a single integer is specified, then will
compute the range (2,k). If a tuple of 2 integers is specified, then k will be in np.arange(k[0],
k[1]).
Otherwise, specify an iterable of integers to use as values for k.
metricstring, default: "distortion"

Select the scoring metric to evaluate the clusters. The default is the mean distortion,
defined by the sum of squared distances between each observation and its closest centroid.
Other metrics

26
include:
• distortion: mean sum of squared distances to centers
• silhouette: mean ratio of intra-cluster and nearest-cluster distance
• calinski_harabasz: ratio of within to between cluster dispersion

timingsbool, default: True

Display the fitting time per k to evaluate the amount of time required to train the
clustering model.

locate_elbowbool, default: True

Automatically find the “elbow” or “knee” which likely corresponds to the optimal value
of k using the “knee point detection algorithm”. The knee point detection algorithm finds the
point of maximum curvature, which in a well-behaved clustering problem also represents the
pivot of the elbow curve. The point is labeled with a dashed line and annotated with the score and
k values.

Keyword arguments that are passed to the base class and may influence the
visualization as defined in other Visualizers

27
4.SYSTEM DESIGN

SYSTEM ARCHITECTURE

Fig 5.1 Sytem architecture of a machine learning algorithm and how it flows

The machine learning architecture defines the various layers involved in the machine
learning cycle and involves the major steps being carried out in the transformation of raw data into
training data sets capable for enabling the decision making of a system.

DATAFLOW DIAGRAM

Fig 5.2 The dataflow diagram of a how a machine learning algorithm works

28
CLUSTERING ALGORITHM

Fig 5.3 K-Means algorithm architecture

The algorithm splits a given dataset into different clusters on the bases of data density. It
works on each data point and check the proximity of the given data point from all the
cluster centers. k-means then allocates that data point to the cluster
whose cluster center (centroid) is closest to it.

29
5. IMPLEMENTATION

5.1 MODULE DESCRIPTION

NO.OF MODULES

• Administrator
• Customer

MODULE DESCRIPTION

1. Administrator: Administrator is the controller of the survey link.Admin will perform


all the controlling operations of the model. Admin designs an algorithm to get
expected expenditure score of customer after customer fills his details in the
form.Admin will perform all the operation like training the dataset and modifying the
dataset regularly with adding the information into it.
2. Customer: The customer is for whom the output is targeted. They give their details
such as their age,income,gender etc by filling in the survey form. User cannot make
changes to the model, but can only use the already trained model.

INPUT AND OUTPUT

The following are some of the inputs and outputs:

INPUTS:

• Admin trains the model


• Admin trains the datasets
• Admin tests the model
• Admin adds various categories
• Customer enters his details
• Details which user enters are his age,income,gender.

30
• Customer gets his targeted expenditure score

OUTPUTS:

• Admin will get the corresponding expenditure score of a customer


• Admin will be able to cluster different categories of customers
• Customers may buy their desired products quite effectively and quickly
• Different products can be sold to different categories of customers with a guaranteed
customer satisfaction

CODING

5.2 DATASET TAKEN

The dataset consists of Annual income of 1 lakh customers and their total expenditure
score (in $) for a period of one year. This dataset is taken from Kaggle which contains various types
of datasets.Let us explore the datausing numpy and pandas libraries in python

This dataset contains the basic information (ID, age, gender, income, spending score) about the
customers

Fig 6.1 The dataset we have chosen to implement this project

31
5.3 CODE

import numpy as np
import pandas as pd
import sklearn

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.offline as py
import plotly.graph_objs as go

#Loading Data

data=pd.read_csv('Mall_Customers.csv')

data.head()

data.describe()

#Comparing data of different customers

sns.countplot(x='Gender',data=data);

plt.title('Distribution of Gender');

data.hist('Age',bins=35);

plt.title('Distribution of Age');

plt.xlabel('Age');

32
data.hist('Annual Income (k$)')

plt.title('Annual Income Distribution in Thousands of Dollars');

plt.xlabel('Thousands of Dollars');

plt.hist('Annual Income (k$)',data=data[data['Gender'] == 'Male'],alpha=0.5,label='Male');

plt.hist('AnnualIncome(k$)',data=data[data['Gender']=='Female'],alpha=0.5,label=’Male’

plt.title('Distribution of Income by Gender');

plt.xlabel('Income(Thousand of Dollars)');

plt.legend();

male_customers=data[data['Gender']=='Male']

female_customers=data[data['Gender']=='Female']

print(male_customers['Spending Score (1-100)'].mean())

print(female_customers['Spending Score (1-100)'].mean())

#Visualizing the data

sns.scatterplot('Age','Annual Income (k$)',hue='Gender',data=data);

plt.title('Age to Spending Score,Colored by Gender');

sns.lmplot('Age','Spending Score (1-100)',data=female_customers);

plt.title('Age to Spending Score,Female only');

sns.scatterplot('Annual Income (k$)','Spending Score (1-100)',hue='Gender',data=data);

plt.title('Annual Income to Spending Score,Colored by Gender')

33
sns.pairplot(data)

plt.show()

#Clustering Agorithm

from sklearn.cluster import KMeans

wcss=[]

for i in range(1,11):

km=KMeans(n_clusters=i,init='k-means++',max_iter=300,n_init=10,random_state=0)

km.fit(x)

wcss.append(km.inertia_)

plt.plot(range(1,11),wcss)

plt.title('The Elbow Method',fontsize=20)

plt.xlabel('No of Clusters')

plt.ylabel('wcss')

plt.show()

#K-Means Algorithm

km=KMeans(n_clusters=i,init='k-means++',max_iter=300,n_init=10,random_state=0)

y_means=km.fit_predict(x)

plt.scatter(x[y_means==0,0],x[y_means==0,1],s=100,c='pink',label='miser')

plt.scatter(x[y_means==1,0],x[y_means==1,1],s=100,c='yellow',label='average')

34
plt.scatter(x[y_means==2,0],x[y_means==2,1],s=100,c='green',label='buyer')

plt.scatter(x[y_means==3,0],x[y_means==3,1],s=100,c='red',label='spender')

plt.scatter(x[y_means==4,0],x[y_means==4,1],s=100,c='black',label='target')

plt.scatter(km.cluster_centers_[:,0],km.cluster_centers_[:,1],s=50,c='blue',label='centroid')

plt.title('K-Means',fontsize=20)

plt.xlabel('Annual Income')

plt.ylabel('Spending Score')

plt.legend()

plt.grid()

plt.show()

35
6.TESTING

INTRODUCTION TO TESTING

Software testing is a critical element of software quality assurance and represents the ultimate
review of specification, design and coding. The increasing visibility of software as a system
element and attendant costs associated with a software failure are motivating factors for we
planned, through testing. Testing is the process of executing a program with the intent of finding an
error. The design of tests for software and other engineered products can be as challenging as the
initial design of the product itself.

There of basically two types of testing approaches.

One is Black-Box testing – the specified function that a product has been designed to perform,
tests can be conducted that demonstrate each function is fully operated.

The other is White-Box testing – knowing the internal workings of the product ,tests can
be conducted to ensure that the internal operation of the product performs according to
specifications and all internal components have been adequately exercised.

White box and Black box testing methods have been used to test this package. The entire
loop constructs have been tested for their boundary and intermediate conditions. The test data was
designed with a view to check for all the conditions and logical decisions. Error handling has been
taken care of by the use of exception handlers.

6.1 TESTING STRATEGIES:

Testing is a set of activities that can be planned in advanced and conducted systematically.
A strategy for software testing must accommodation low-level tests that are necessary to verify that
a small source code segment has been correctly implemented as well as high-level tests that validate
major system functions against customer requirements.

Software testing is one element of verification and validation. Verification refers to the set
of activities that ensure that software correctly implements as specific function. Validation

36
refers to a different set of activities that ensure that the software that has been built is traceable to
customer requirements.
The main objective of software is testing to uncover errors. To fulfill this objective, a series
of test steps unit, integration, validation and system tests are planned and executed. Each test step
is accomplished through a series of systematic test technique that assist in the design of test cases.
With each testing step, the level of abstraction with which software is considered is broadened.
Testing is the only way to assure the quality of software and it is an umbrella activity rather than a
separate phase. This is an activity to be preformed in parallel with the software effort and one that
consists of its own phases 00f analysis, design, implementation, execution and maintenance.

UNIT TESTING:

This testing method considers a module as single unit and checks the unit at interfaces and
communicates with other modules rather than getting into details at statement level. Here the
module will be treated as a black box, which will take some input and generate output. Outputs for
a given set of input combination are pre-calculated and are generated by the module.

SYSTEM TESTING:

Here all the pre tested individual modules will be assembled to create the larger system and
tests are carried out at system level to make sure that all modules are working in synchronous with
each other. This testing methodology helps in making sure that all modules which are running
perfectly when checked individually are also running in cohesion with other modules. For this
testing we create test cases to check all modules once and then generated test combinations of test
paths through out the system to make sure that no path is making its way into chaos.

37
INTEGRATED TESTING:

Testing is a major quality control measure employed during software development. Its basic
function is to detect errors. Sub functions when combined may not produce than it is desired. Global
data structures can represent the problems. Integrated testing is a systematic technique for
constructing the program structure while conducting the tests. To uncover errors that are associated
with interfacing the objective is to make unit test modules and built a program structure that has
been detected by design. In a non - incremental integration all the modules are combined in advance
and the program is tested as a whole. Here errors will appear in an end less loop function. In
incremental testing the program is constructed and tested in small segments where the errors are
isolated and corrected. Different incremental integration strategies are top – down integration,
bottom – up integration, regression testing.

TOP-DOWN INTEGRATION TEST:

Modules are integrated by moving downwards through the control hierarchy beginning with
main program. The subordinate modules are incorporated into structure in either a breadth first
manner or depth first manner. This process is done in five steps:
• Main control module is used as a test driver and steps are substituted or all modules
directly to main program.
• Depending on the integration approach selected subordinate is replaced at a time with
actual modules.
• Tests are conducted.
• On completion of each set of tests another stub is replaced with the real module
• Regression testing may be conducted to ensure trha5t new errors have not been
introduced.
This process continuous from step 2 until entire program structure is reached. In top down
integration strategy decision making occurs at upper levels in the hierarchy and is encountered first.
If major control problems do exists early recognitions is essential.
If depth first integration is selected a complete function of the software may be implemented and
demonstrated.

38
Some problems occur when processing at low levels in hierarchy is required to adequately
test upper level steps to replace low-level modules at the beginning of the top down testing. So no
data flows upward in the program structure.

BOTTOM-UP INTEGRATION TEST:

Begins construction and testing with atomic modules. As modules are integrated from the
bottom up, processing requirement for modules subordinate to a given level is
always available and need for stubs is eliminated.The following steps implements this strategy.
• Low-level modules are combined in to clusters that perform a specific software sub
function.
• A driver is written to coordinate test case input and output.
• Cluster is tested.
• Drivers are removed and moving upward in program structure combines clusters.

Integration moves upward, the need for separate test driver’s lesions.
If the top levels of program structures are integrated top down, the number of drivers can be
reduced substantially and integration of clusters is greatly simplified.

REGRESSION TESTING:

Each time a new module is added as a part of integration as the software changes. Regression
testing is an actually that helps to ensure changes that do not introduce unintended behavior as
additional errors.
Regression testing maybe conducted manually by executing a subset of all test cases or using
automated capture play back tools enables the software engineer to capture the test case and results
for subsequent playback and compression. The regression suit contains different classes of test
cases.
A representative sample to tests that will exercise all software functions.

Additional tests that focus on software functions that are likely to be affected by the change.

39
6.2 NON FUNCTIONAL TESTING:
Non-functional testing is the testing of a software application or system for its non-
functional requirements: the way a system operates, rather than specific behaviours of that
system. This is in contrast to functional testing, which tests against functional requirements that
describe the functions of a system and its components. The names of many non-functional tests
are often used interchangeably because of the overlap in scope between various non-functional
requirements. For example, software performance is a broad term that includes many specific
requirements like reliability and scalability.

6.3 FUNCTIONAL TESTING:


Functional testing is a quality assurance (QA) process and a type of black-box testing that
bases its test cases on the specifications of the software component under test. Functions are
tested by feeding them input and examining the output, and internal program structure is rarely
considered (unlike white-box testing). Functional testing is conducted to evaluate the
compliance of a system or component with specified functional requirements. Functional testing
usually describes what the system does.

Functional testing has many types:

• Smoke testing
• Sanity testing
• Regression testing

6.4 TEST CASES


A Test Cases is a set of conditions or variables under which a tester will determine whether a system
under test satisfies requirements or works correctly. The process of developing test cases can also
help find problems in the requirements or design of an application.

40
TESTCASE 1

Fig.6.1 Testcase 1

The pandas library has not been mentioned in the above diagram therefore there occurs an error.
Pandas library is used for data manipulation and analysis.

TESTCASE 2

Fig 6.2 Testcase 2

In this case, a command is missing which is km.fit() which ensures the inertia of the project.Inertia
is the sum of square of distance between 2 points in a cluster.

41
6.5 GOALS OF TESTING:

• The main goal of the project is to get accuracy and an error free model.
• The purpose of this testing is training the model such that when anew data point is added
there should not be any difficulty in the process of execution.
• The value of k should be predicted correctly while elbow method is being in use.
• This model is trained in a manner so that the dataset can be changed as per use, by making
the system experience with different samples in a population.

42
7.OUTPUT SCREENS
Step 1:Import packages and libraries

Fig 7.1 The importing of packages

Fig 7.1 describes the packages that are being used in our project. numpy is used
for arithmetic operations, pandas are used for loading the dataset, seaborn for
styling the graphs, matplotlib for plotting the graphs and sklearn for using the
different algorithms.
Step 2: collect the dataset

Fig7.2 data collection


This figure describes the features of the dataset used. The 5 features used are Customer ID, Gender, Age,
Annual Income, Spending Score.

43
Step 3: Describe the dataset

Figure 7.3 describing dataset


This figure describes the maximum, minimum and the count of data points which
are being used. It also describes the mean.

Step 4: Plot the counter-plot

Fig 7.4 Bar graph to describe the gender distribution

The figure above describes the number of male an female members in the given dataset.

44
Step 5: Plotting Histograms

Figure 7.5:Distribution of Age histogram


This picture is a histogram that shows the number of people in specific age intervals.

Figure 7.5.1:Distribution of Age by Gender histogram


This figure above shows the combined description of age and gender in a particular age
interval.

45
Step 6:Plotting Scatterplots

Figure 7.6:Age to Spending Score by Gender Scatterplot


This figure shows the scatterplot of age to gender to show which age group people spend how much
money.

Figure 7.6.1:Age to Spending Score by Male Scatter plot


The figure above shows the male expenditure score in the form of a scatterplot

46
Figure 7.6.2:Age to Spending Score by Female scatterplot

This figure shows the female income expenditure score according to theur age intevals in the form
of a histogram.

Figure 7.6.3:Annual Income to Spending Score scatterplot


This figure shows the annual income to spending score for both male and female expenditure with
annual income in x -axis and spending score on y-axis.

47
Step 7:Plotting pairplots

Figure 7.7:Pairplot
This figure shows a pairplot that compares 2 different features at a time.

48
Step 8:Using Elbow curve method

Figure 7.8:The Elbow curve


This figure conveys us the appropriate k value which is obtained after performing elbow method.

As K-Means algorithm requires the number of clusters as input, below we will use the elbow
method to get the optimal number of clusters which can be formed [32]. It works on the principal
that after a certain number of „K‟ clusters, the difference in SSE (Sum of Squared Errors) starts to
decrease and diminishes gradually. Here, the WCSS(Within-Cluster-Sum-of-Squared-errors)
metric is used as an indicator of the same. Hence, the „K‟ value, specifies the number of clusters.
In Figure 1, it can be observed that an elbow point occurs at K=5. After K=5, the difference in
WCSS is not so visible. Hence, we will choose to have 5 clusters and provide the same as input to
the K-Means algorithm.

49
Step 9:Using K-Means Clustering( Final output)

Fig 7.9 The output obtained after K-Means Clustering is done


This figure is the final output that shows the different clusters obtained as per the features
provided.

We can see that the mall customers can be broadly grouped into 5 groups based on their

purchases made in the mall.

In cluster 4(yellow colored) we can see people have low annual income and low spending

scores, this is quite reasonable as people having low salaries prefer to buy less, in fact, these are the

wise people who know how to spend and save money. The shops/mall will be least interested in

people belonging to this cluster.

In cluster 2(blue colored) we can see that people have low income but higher spending

scores, these are those people who for some reason love to buy products more often even though

they have a low income. Maybe it’s because these people are more than satisfied with the mall

services. The shops/malls might not target these people that effectively but still will not lose them.
50
In cluster 5(pink colored) we see that people have average income and an average spending
score, these people again will not be the prime targets of the shops or mall, but again they will be

considered and other data analysis techniques may be used to increase their spending score.

In cluster 1(red-colored) we see that people have high income and high spending scores, this

is the ideal case for the mall or shops as these people are the prime sources of profit. These people

might be the regular customers of the mall and are convinced by the mall’s facilities.

In cluster 3(green colored) we see that people have high income but low spending scores,

this is interesting. Maybe these are the people who are unsatisfied or unhappy by the mall’s services.

These can be the prime targets of the mall, as they have the potential to spend money. So, the mall

authorities will try to add new facilities so that they can attract these people and can meet their

needs.

Finally, based on our machine learning technique we may deduce that to increase the profits

of the mall, the mall authorities should target people belonging to cluster 3 and cluster 5 and should
also maintain its standards to keep the people belonging to cluster 1 and cluster 2 happy and satisfied.

To conclude, I would like to say that it is amazing to see how machine learning can be used in

businesses to enhance profit.

51
8.CONCLUSION

Our project classifies various customers into different clusters so that different marketing
strategies can be employed to different clusters attain maximum profit. Due to increasing
commercialization, consumer data is increasing exponentially. When dealing with this large
magnitude of data, organizations need to make use of more efficient clustering algorithms for
customer segmentation. These clustering models need to possess the capability to process this
enormous data effectively. Each of the above discussed clustering algorithms come with their own
set of merits and demerits. The computational speed of K-Means clustering algorithm is relatively
better as compared to the hierarchical clustering algorithms as the latter require the calculation of
the full proximity matrix after each iteration . K-Means clustering gives better performance for a
large number of observations while hierarchical clustering has the ability to handle fewer data
points.

The major hindrance produces itself in the form of selecting the numbers of clusters „K‟ for
the K-Means process, which have to be provided as an input to this non-hierarchical clustering
algorithm. This limitation does not exist in the case of hierarchical clustering since it does not
require any cluster centers as input. It depends on the user to choose the cluster groups as well as
their number. Hierarchical clustering also gives better results as compared to K-Means when a
random dataset is used. The output or results obtained when using hierarchical clustering are in the
form of dendrograms but the output of K-Means consists of flatstructured clusters which may be
difficult to analyze. As the value of k increases, the quality(accuracy) of hierarchical clustering
improves when compared to K-Means clustering. As such, partitioning algorithms like K-Means
are suitable for large datasets while hierarchical clustering algorithms are more suitable for small
datasets.

52
9.FUTURE ENHANCEMENT

We have done this project with as minimum flaws as possible and can further be enhanced
by including major identification of statistics of poeple and improving the accuracy of the output.

All the census data also can be collected to train the dataset even more to get more accurate outputs.
Based on the social data accumulated, we can conclude that mall customer segmentation system can
be used in a wide range of applications across a variety of domains including:

• identifying interests of people while they are buying items from a mall

• Grouping people efficiently

• Identifying many more clusters to segment the products to improve the sales of
the product

In this project we have implemented k-means algorithm, it can be further enhanced by using
few complex algorithms such as conventional neural networks algorithms.

All the census data also can be collected to train the dataset even more to get more accurate
outputs

There is no need for privacy invasion of users like other applications like amazon etc

53
10.BIBLIOGRAPHY

[1] E. Ngai, L. Xiu and D. Chau, “Application of data mining techniques in customer relationship
management: A literature review and classification”, Expert Systems with Applications, vol. 36,
no. 2, pp. 2592-2602, 2009.
[2] J. Peppard, “Customer Relationship Management (CRM) in financial services”, European
Management Journal, vol. 18, no. 3, pp. 312-327, 2000.
[3] A. Ansari and A. Riasi, “Taxonomy of marketing strategies using bank customers clustering”,
International Journal of Business and Management, vol. 11, no. 7, pp. 106-119, 2016.
[4] M. Ghzanfari, et al., “Customer segmentation in clothing exports based on clustering
algorithm”, Iranian Journal of Trade Studies, vol. 14, no. 56, pp. 59-86, 2010.
[5] C. Rygielski, J. Wang and D. Yen, “Data mining techniques for customer relationship
management”, Technology in Society, vol. 24, no. 4, pp. 483-502, 2002..
[6] J. Lee and S. Park, “Intelligent profitable customers segmentation system based on business
intelligence tools”, Expert Systems with Applications, vol. 29, no. 1, pp. 145-152, 2005.
[7] D. A. Kandeil, A. A. Saad and S. M. Youssef, “A two-phase clustering analysis for B2B
customer segmentation”, in International Conference on Intelligent Networking and Collaborative
Systems, Salerno, 2014, pp. 221-228.

[8] R. Swift, Accelerating Customer Relationships: Using CRM and Relationship Technologies,
1st ed. Upper Saddle River, N.J.: Prentice Hall PTR, 2000.

[9] J. Aaker, A. Brumbaugh and S. Grier, “Nontarget Markets and Viewer Distinctiveness: The
Impact of Target Marketing on Advertising Attitudes”, Journal of Consumer Psychology, vol. 9,
no. 3, pp. 127-140, 2000.

[10] T. Kanungo, et al., “An efficient k-means clustering algorithm: analysis and
implementation”,IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, no. 7,
pp. 881-892, 2002

[11] Y. Chen, et al., “Identifying patients in target customer segments using a two-stage
clustering-classification approach: A hospitalbased assessment”, Computers in Biology and
Medicine, vol. 42, no. 2, pp. 213-221, 2012.

54
[12] G. Lefait and T. Kechadi, “Customer segmentation architecture based on clustering
techniques”, in Fourth International Conference on Digital Society, Sint Maarten, 2010, pp. 243-
248.
[13] M. Namvar, M. Gholamian and S. KhakAbi, “A two-phase clustering method for intelligent
customer segmentation”, in International Conference on Intelligent Systems, Modelling and
Simulation, Liverpool, 2010, pp. 215-219.
[14] J. MacQueen, “Some methods for classification and analysis of multivariate observations”,
in Fifth Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, 1967, pp. 28
1 - 297.

[15] E. Rendon, et al., “A comparison of internal and external cluster validation indexes”, in
American Conference on Applied Mathematics and The Fifth WSEAS International Conference
on Computer Engineering and Applications, Puerto Morelos, 2011, pp. 158 -163.

[16] H. Gucdemir and H. Selim, “Integrating multi -criteria decision making and clustering for
business customer segmentation”, Industrial Management & Data Systems, vol. 115, no. 6, pp.
1022 - 1040, 2015.

55

You might also like