0% found this document useful (0 votes)
19 views10 pages

DMW Unit 5

Cluster analysis is the process of grouping similar objects into clusters based on their attributes, with applications in marketing, land use, insurance, city planning, and more. Major clustering methods include partitioning (like k-means), hierarchical, density-based (like DBSCAN), grid-based, and model-based approaches. Each method has its strengths and weaknesses, making them suitable for different types of data and analysis needs.

Uploaded by

V. Subalakshmi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views10 pages

DMW Unit 5

Cluster analysis is the process of grouping similar objects into clusters based on their attributes, with applications in marketing, land use, insurance, city planning, and more. Major clustering methods include partitioning (like k-means), hierarchical, density-based (like DBSCAN), grid-based, and model-based approaches. Each method has its strengths and weaknesses, making them suitable for different types of data and analysis needs.

Uploaded by

V. Subalakshmi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

UNIT 5

CLUSTER ANALYSIS

Definition:

 Clustering is defined as grouping a set of similar objects into


classes or clusters. In other words, during cluster analysis,
the data is grouped into classes or clusters, so that records
within a cluster (intra-cluster) have high similarity with one
another but have high dissimilarities in comparison to objects
in other clusters (inter-cluster).
 The similarity of records is identified on the basis of values
of attributes describing the objects.
 Cluster analysis is an important human activity. The first
human beings Adam and Eve actually learned through the
process of clustering.
 They did not know the name of any object, they simply
observed each and every object. Based on the similarity of
their properties, they identified these objects in groups or
clusters.

Characteristics of clusters
Applications of Cluster Analysis
Cluster analysis has been widely used in various important
applications such as:
• Marketing: It helps marketers find out distinctive groups
among their customer bases, and this knowledge helps them
improve their targeted marketing programs.
• Land use: Clustering is used for identifying areas of similar
land use from the databases of
earth observations.
• Insurance: Clustering is helpful for recognizing clusters of
insurance policyholders with a high
regular claim cost.
• City-planning: It also helps in identifying clusters of houses
based on house type, geographical
location, and value.
• Earthquake studies: Clustering is helpful for analysis of
earthquakes as it has been noticed
that earthquake epicenters are clustered along continent faults.
• Biology studies: Clustering helps in defining plant and animal
classifications, identifying genes
with similar functionalities, and in gaining insights into
structures inherent to populations.
• Web discovery: Clustering is helpful in categorizing
documents on the web for information
discovery.
• Fraud detection: Clustering is also helpful in outlier detection
applications such as credit card
fraud detection
Major Clustering Methods/Algorithms:

Clustering methods/algorithms can be categorized into five


categories which are given as follows:
• Partitioning method: It constructs random partitions and then
iteratively refines them by some
criterion.
• Hierarchical method: It creates a hierarchical decomposition
of the set of data (or objects) using
some criterion.
• Density-based method: It is based on connectivity and density
functions.
• Grid based method: It is based on a multiple-level granularity
structure.
• Model based method: A model is considered for each of the
clusters and the idea is to identify
the best fit for that model.

i ) Partitioning Clustering :

Clustering is the task of splitting a group of data or dataset into a


small number of clusters. For example, the items in a grocery
store are grouped into different categories (butter, milk, and
cheese are clustered in dairy products). This is a qualitative kind
of partitioning. A quantitative approach on the other hand,
measures certain features of the products such as the percentage
of milk and suchlike, i.e., products having a high percentage of
milk would be clustered together.
In the partitioning method, we cluster objects based on attributes
into a number of partitions.

The k-means clustering is an important technique which falls


under partitioning clustering.
7.6.1. k-means clustering :

In the k-means clustering algorithm, n objects are


clustered into k clusters or partitions on the basis of attributes,
where k < n and k is a positive integer number. In simple words,
in k-means clustering algorithm, the objects are grouped into ‘k’
number of clusters on the basis of attributes or features.
The grouping of objects is done by minimizing the sum of
squares of distances, i.e., a Euclidean distance between data and
the corresponding cluster centroid.

Working of the k-means algorithm


The working of k-means clustering algorithm can be illustrated
in five simple steps, as given
below.

Step 1: Start with a selection of the value of k where k =


number of clusters
In this step, the k centroids (or clusters) are initiated and then the
first k training samples out of n samples of data are taken as
single-element clusters. Each of the remaining (n-k) training
samples are assigned to the cluster with the nearest centroid and
the centroid of the gaining cluster is recomputed after each
assignment.

Step 2: Creation of distance matrix between the centroids and


each pattern
In this step, distance is computed from each sample of data to
the centroid of each of the clusters.
The heavy calculation involved is the major drawback of this
step since there are k centroids and n samples, the algorithm will
have to compute n*k distances.
Step 3: Assign each sample in the cluster with the closest
centroid (minimal distance)

Now, the samples are grouped on the basis of their distance


from the centroid of each of the clusters.

If currently, the sample is not in the cluster then switch it to the


cluster with the closest centroid.

When there is no movement of samples to another cluster


anymore, the algorithm will end.

Step 4: Update the new centroids for each cluster


In this step, the locations of centroids are updated. Update the
location for each centroid of the cluster that has gained or lost a
sample by computing the mean of each attribute of all samples
belonging to respective clusters.

Step 5: Repeat until no further change occurs


Return to Step 2 of the algorithm and repeat the updating
process of each centroid location until a convergence condition
is satisfied, which is until a pass through the training sample
causes no new changes.

Flowchart of K-means Clustering


ii)Hierarchical Clustering Algorithms (HCA)

Hierarchical clustering is a type of cluster analysis which seeks


to generate a hierarchy of clusters. It is also called Hierarchical
Cluster Analysis (HCA). Hierarchical clustering methods build a
nested series of clusters in comparison to partitioned methods
that generate only a flat set of clusters.

There are two types of Hierarchical clustering: agglomerative


and divisive.

Agglomerative

This is a ‘bottom-up’ approach. In this approach, each object is


a cluster by itself at the start and its nearby clusters are
repetitively combined resulting in larger and larger clusters until
some stopping criterion is met. The stopping criterion may be
the specified number of clusters or a stage at which all the
objects are combined into a single large cluster that is the
highest level of hierarchy.

Divisive :

This is a ‘top-down’ approach. In this approach, all objects start


from one cluster, and partitions are performed repeatedly
resulting in smaller and smaller clusters until some stopping
criterion is met or each cluster consists of the only object in it as
Generally, the mergers and partitions are decided in a greedy
manner.
Density-based clustering:

 Density-based clustering algorithms perform the


clustering of data by forming the cluster of nodes on
the basis of the estimated density distribution of
corresponding nodes in the region.
 In density based clustering, clusters are dense regions
in the data space.
 These clusters are separated by regions of lower
object density. A cluster is defined as a maximal set of
density-connected points.
 The major strength of this approach is that it can identify
clusters of arbitrary shapes.
 A Density-based clustering method is known as DBSCAN
(Density-Based Spatial Clustering of Applications with
Noise).
 Familiarity with the following terms is a must in order to
understand the workings of the DBSCAN algorithm.

Neighborhood (e):

Neighborhood is an important term used in DBSCAN. It


represents objects within a certain radius of from a centroid type
object. The high-density neighborhood results if an object
contains at least MinPts (minimum points) of objects in its
neighborhood.
Core, Border, and Outlier:

A point is known as a core point if it has more than a specified


number of points (MinPts) within neighborhood (e). These
points must lie at the interior of a cluster. A border point is a
point if it has fewer than MinPts within neighborhood (e), but it
is in the neighborhood of a core point. A point is a noise or
outlier point if it is neither a core point nor a border point. The
concept of Core, Border, and Outlier is illustrated by
considering Minpts .

Concept of Core,Border and Outlier

DBSCAN algorithm
After the understanding of concepts of neighborhood, core,
border, outlier points and density
reachability, DBSCAN algorithm can be summarized as given
below.

Algorithm for DBSCAN :

1. Find the e (epsilon) neighbors of every point, and identify the


core points with more than minPts neighbors.
2. Find the connected components of core points on the
neighbourhood graph, ignoring all non-core points.
3. Assign each non-core point to a nearby cluster if the cluster is
an e (eps) neighbor, otherwise assign it to noise.
Strengths of DBSCAN algorithm :

DBSCAN algorithm has following advantages.

• It is not required to specify the number of clustersin the data at


the start in case of the DBSCAN
algorithm.
• It requires only two parameters and does not depend on the
ordering of the points in the
database.
• It can identify clusters of arbitrary shape. It is also able to
identify the clusters completely
surrounded by a different cluster.
• DBSCAN is robust to outliers and has a notion of noise.

Weakness of DBSCAN algorithm :

DBSCAN algorithm has following disadvantages.

• The DBSCAN algorithm is sensitive to the parameter, i.e., it is


difficult to identify the correct set of parameters.
• The quality of the DBSCAN algorithm depends upon the
distance measureslike neighborhood (e) and MinPts.
• The DBSCAN algorithm cannot cluster datasets accurately
with varying densities or large differences in densities.

Applications of Cluster Analysis:


Cluster analysis has been widely used in various important
applications such as:
• Marketing: It helps marketers find out distinctive groups
among their customer bases, and this knowledge helps them
improve their targeted marketing programs.
• Land use: Clustering is used for identifying areas of similar
land use from the databases of
earth observations.
• Insurance: Clustering is helpful for recognizing clusters of
insurance policyholders with a high
regular claim cost.
• City-planning: It also helps in identifying clusters of houses
based on house type, geographical
location, and value.
• Earthquake studies: Clustering is helpful for analysis of
earthquakes as it has been noticed
that earthquake epicenters are clustered along continent faults.
• Biology studies: Clustering helps in defining plant and animal
classifications, identifying genes
with similar functionalities, and in gaining insights into
structures inherent to populations.
• Web discovery: Clustering is helpful in categorizing
documents on the web for information
discovery.
• Fraud detection: Clustering is also helpful in outlier detection
applications such as credit card
fraud detection

You might also like