0% found this document useful (0 votes)

19 views10 pages

DMW Unit 5

Cluster analysis is the process of grouping similar objects into clusters based on their attributes, with applications in marketing, land use, insurance, city planning, and more. Major clustering methods include partitioning (like k-means), hierarchical, density-based (like DBSCAN), grid-based, and model-based approaches. Each method has its strengths and weaknesses, making them suitable for different types of data and analysis needs.

Uploaded by

V. Subalakshmi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views10 pages

DMW Unit 5

Uploaded by

V. Subalakshmi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

UNIT 5

CLUSTER ANALYSIS

Definition:

 Clustering is defined as grouping a set of similar objects into

classes or clusters. In other words, during cluster analysis,
the data is grouped into classes or clusters, so that records
within a cluster (intra-cluster) have high similarity with one
another but have high dissimilarities in comparison to objects
in other clusters (inter-cluster).
 The similarity of records is identified on the basis of values
of attributes describing the objects.
 Cluster analysis is an important human activity. The first
human beings Adam and Eve actually learned through the
process of clustering.
 They did not know the name of any object, they simply
observed each and every object. Based on the similarity of
their properties, they identified these objects in groups or
clusters.

Characteristics of clusters
Applications of Cluster Analysis
Cluster analysis has been widely used in various important
applications such as:
• Marketing: It helps marketers find out distinctive groups
among their customer bases, and this knowledge helps them
improve their targeted marketing programs.
• Land use: Clustering is used for identifying areas of similar
land use from the databases of
earth observations.
• Insurance: Clustering is helpful for recognizing clusters of
insurance policyholders with a high
regular claim cost.
• City-planning: It also helps in identifying clusters of houses
based on house type, geographical
location, and value.
• Earthquake studies: Clustering is helpful for analysis of
earthquakes as it has been noticed
that earthquake epicenters are clustered along continent faults.
• Biology studies: Clustering helps in defining plant and animal
classifications, identifying genes
with similar functionalities, and in gaining insights into
structures inherent to populations.
• Web discovery: Clustering is helpful in categorizing
documents on the web for information
discovery.
• Fraud detection: Clustering is also helpful in outlier detection
applications such as credit card
fraud detection
Major Clustering Methods/Algorithms:

Clustering methods/algorithms can be categorized into five

categories which are given as follows:
• Partitioning method: It constructs random partitions and then
iteratively refines them by some
criterion.
• Hierarchical method: It creates a hierarchical decomposition
of the set of data (or objects) using
some criterion.
• Density-based method: It is based on connectivity and density
functions.
• Grid based method: It is based on a multiple-level granularity
structure.
• Model based method: A model is considered for each of the
clusters and the idea is to identify
the best fit for that model.

i ) Partitioning Clustering :

Clustering is the task of splitting a group of data or dataset into a

small number of clusters. For example, the items in a grocery
store are grouped into different categories (butter, milk, and
cheese are clustered in dairy products). This is a qualitative kind
of partitioning. A quantitative approach on the other hand,
measures certain features of the products such as the percentage
of milk and suchlike, i.e., products having a high percentage of
milk would be clustered together.
In the partitioning method, we cluster objects based on attributes
into a number of partitions.

The k-means clustering is an important technique which falls

under partitioning clustering.
7.6.1. k-means clustering :

In the k-means clustering algorithm, n objects are

clustered into k clusters or partitions on the basis of attributes,
where k < n and k is a positive integer number. In simple words,
in k-means clustering algorithm, the objects are grouped into ‘k’
number of clusters on the basis of attributes or features.
The grouping of objects is done by minimizing the sum of
squares of distances, i.e., a Euclidean distance between data and
the corresponding cluster centroid.

Working of the k-means algorithm

The working of k-means clustering algorithm can be illustrated
in five simple steps, as given
below.

Step 1: Start with a selection of the value of k where k =

number of clusters
In this step, the k centroids (or clusters) are initiated and then the
first k training samples out of n samples of data are taken as
single-element clusters. Each of the remaining (n-k) training
samples are assigned to the cluster with the nearest centroid and
the centroid of the gaining cluster is recomputed after each
assignment.

Step 2: Creation of distance matrix between the centroids and

each pattern
In this step, distance is computed from each sample of data to
the centroid of each of the clusters.
The heavy calculation involved is the major drawback of this
step since there are k centroids and n samples, the algorithm will
have to compute n*k distances.
Step 3: Assign each sample in the cluster with the closest
centroid (minimal distance)

Now, the samples are grouped on the basis of their distance

from the centroid of each of the clusters.

If currently, the sample is not in the cluster then switch it to the

cluster with the closest centroid.

When there is no movement of samples to another cluster

anymore, the algorithm will end.

Step 4: Update the new centroids for each cluster

In this step, the locations of centroids are updated. Update the
location for each centroid of the cluster that has gained or lost a
sample by computing the mean of each attribute of all samples
belonging to respective clusters.

Step 5: Repeat until no further change occurs

Return to Step 2 of the algorithm and repeat the updating
process of each centroid location until a convergence condition
is satisfied, which is until a pass through the training sample
causes no new changes.

Flowchart of K-means Clustering

ii)Hierarchical Clustering Algorithms (HCA)

Hierarchical clustering is a type of cluster analysis which seeks

to generate a hierarchy of clusters. It is also called Hierarchical
Cluster Analysis (HCA). Hierarchical clustering methods build a
nested series of clusters in comparison to partitioned methods
that generate only a flat set of clusters.

There are two types of Hierarchical clustering: agglomerative

and divisive.

Agglomerative

This is a ‘bottom-up’ approach. In this approach, each object is

a cluster by itself at the start and its nearby clusters are
repetitively combined resulting in larger and larger clusters until
some stopping criterion is met. The stopping criterion may be
the specified number of clusters or a stage at which all the
objects are combined into a single large cluster that is the
highest level of hierarchy.

Divisive :

This is a ‘top-down’ approach. In this approach, all objects start

from one cluster, and partitions are performed repeatedly
resulting in smaller and smaller clusters until some stopping
criterion is met or each cluster consists of the only object in it as
Generally, the mergers and partitions are decided in a greedy
manner.
Density-based clustering:

 Density-based clustering algorithms perform the

clustering of data by forming the cluster of nodes on
the basis of the estimated density distribution of
corresponding nodes in the region.
 In density based clustering, clusters are dense regions
in the data space.
 These clusters are separated by regions of lower
object density. A cluster is defined as a maximal set of
density-connected points.
 The major strength of this approach is that it can identify
clusters of arbitrary shapes.
 A Density-based clustering method is known as DBSCAN
(Density-Based Spatial Clustering of Applications with
Noise).
 Familiarity with the following terms is a must in order to
understand the workings of the DBSCAN algorithm.

Neighborhood (e):

Neighborhood is an important term used in DBSCAN. It

represents objects within a certain radius of from a centroid type
object. The high-density neighborhood results if an object
contains at least MinPts (minimum points) of objects in its
neighborhood.
Core, Border, and Outlier:

A point is known as a core point if it has more than a specified

number of points (MinPts) within neighborhood (e). These
points must lie at the interior of a cluster. A border point is a
point if it has fewer than MinPts within neighborhood (e), but it
is in the neighborhood of a core point. A point is a noise or
outlier point if it is neither a core point nor a border point. The
concept of Core, Border, and Outlier is illustrated by
considering Minpts .

Concept of Core,Border and Outlier

DBSCAN algorithm
After the understanding of concepts of neighborhood, core,
border, outlier points and density
reachability, DBSCAN algorithm can be summarized as given
below.

Algorithm for DBSCAN :

1. Find the e (epsilon) neighbors of every point, and identify the

core points with more than minPts neighbors.
2. Find the connected components of core points on the
neighbourhood graph, ignoring all non-core points.
3. Assign each non-core point to a nearby cluster if the cluster is
an e (eps) neighbor, otherwise assign it to noise.
Strengths of DBSCAN algorithm :

DBSCAN algorithm has following advantages.

• It is not required to specify the number of clustersin the data at

the start in case of the DBSCAN
algorithm.
• It requires only two parameters and does not depend on the
ordering of the points in the
database.
• It can identify clusters of arbitrary shape. It is also able to
identify the clusters completely
surrounded by a different cluster.
• DBSCAN is robust to outliers and has a notion of noise.

Weakness of DBSCAN algorithm :

DBSCAN algorithm has following disadvantages.

• The DBSCAN algorithm is sensitive to the parameter, i.e., it is

difficult to identify the correct set of parameters.
• The quality of the DBSCAN algorithm depends upon the
distance measureslike neighborhood (e) and MinPts.
• The DBSCAN algorithm cannot cluster datasets accurately
with varying densities or large differences in densities.

Applications of Cluster Analysis:

Cluster analysis has been widely used in various important
applications such as:
• Marketing: It helps marketers find out distinctive groups
among their customer bases, and this knowledge helps them
improve their targeted marketing programs.
• Land use: Clustering is used for identifying areas of similar
land use from the databases of
earth observations.
• Insurance: Clustering is helpful for recognizing clusters of
insurance policyholders with a high
regular claim cost.
• City-planning: It also helps in identifying clusters of houses
based on house type, geographical
location, and value.
• Earthquake studies: Clustering is helpful for analysis of
earthquakes as it has been noticed
that earthquake epicenters are clustered along continent faults.
• Biology studies: Clustering helps in defining plant and animal
classifications, identifying genes
with similar functionalities, and in gaining insights into
structures inherent to populations.
• Web discovery: Clustering is helpful in categorizing
documents on the web for information
discovery.
• Fraud detection: Clustering is also helpful in outlier detection
applications such as credit card
fraud detection

IT3080 Lecture04 2023
No ratings yet
IT3080 Lecture04 2023
56 pages
DMDWUNITV
No ratings yet
DMDWUNITV
72 pages
Unit 4
No ratings yet
Unit 4
16 pages
Clustering
No ratings yet
Clustering
11 pages
Unit 4
No ratings yet
Unit 4
74 pages
Cluster Analysis & Methods Guide
No ratings yet
Cluster Analysis & Methods Guide
11 pages
Prasanna Hebbar @govt First Grade College Honnavar
No ratings yet
Prasanna Hebbar @govt First Grade College Honnavar
11 pages
DWMModule 4
No ratings yet
DWMModule 4
31 pages
Unit 3 Clustering Algorithm
No ratings yet
Unit 3 Clustering Algorithm
44 pages
Unit 5 Cluster Analysis
No ratings yet
Unit 5 Cluster Analysis
15 pages
ML - 8
No ratings yet
ML - 8
70 pages
Clustering Techniques Overview
No ratings yet
Clustering Techniques Overview
52 pages
Cluster Evaluation Techniques: Atds Assignment
No ratings yet
Cluster Evaluation Techniques: Atds Assignment
4 pages
Clustering: Methods and Applications
No ratings yet
Clustering: Methods and Applications
69 pages
Clustering New
No ratings yet
Clustering New
6 pages
Unit 5
No ratings yet
Unit 5
63 pages
DMT Unit-5
No ratings yet
DMT Unit-5
10 pages
Clustering
No ratings yet
Clustering
12 pages
Unit 4 Clustering
No ratings yet
Unit 4 Clustering
18 pages
Clustering Techniques Overview
No ratings yet
Clustering Techniques Overview
40 pages
Fds Unit03
No ratings yet
Fds Unit03
11 pages
Unit 4
No ratings yet
Unit 4
5 pages
Clustering Techniques Explained
No ratings yet
Clustering Techniques Explained
11 pages
Clustering in Data Mining
No ratings yet
Clustering in Data Mining
14 pages
K Mean Clustering1
No ratings yet
K Mean Clustering1
23 pages
Sathyabama Institute of Science and Technology SIT1301-Data Mining and Warehousing
No ratings yet
Sathyabama Institute of Science and Technology SIT1301-Data Mining and Warehousing
22 pages
Unit 2
No ratings yet
Unit 2
33 pages
Clustering
No ratings yet
Clustering
29 pages
DW & DM Unit 4 Notes
No ratings yet
DW & DM Unit 4 Notes
40 pages
Unit 4
No ratings yet
Unit 4
4 pages
Cluster
100% (1)
Cluster
72 pages
Clustering Techniques Overview
No ratings yet
Clustering Techniques Overview
45 pages
Unit 2 - Introduction To Cluster Analysis
No ratings yet
Unit 2 - Introduction To Cluster Analysis
53 pages
AI
No ratings yet
AI
19 pages
Cluster Analysis Concepts & Algorithms
No ratings yet
Cluster Analysis Concepts & Algorithms
93 pages
Unit 4
No ratings yet
Unit 4
40 pages
Introduction To Cluster Analysis.
No ratings yet
Introduction To Cluster Analysis.
53 pages
Cluster Analysis: Methods and Applications
No ratings yet
Cluster Analysis: Methods and Applications
14 pages
Graph Partitioning & Clustering Techniques
No ratings yet
Graph Partitioning & Clustering Techniques
14 pages
Chapter 3 Unsupervised Learning
No ratings yet
Chapter 3 Unsupervised Learning
45 pages
Clustering Algorithm: An Unsupervised Learning Approach
No ratings yet
Clustering Algorithm: An Unsupervised Learning Approach
23 pages
Module 5
No ratings yet
Module 5
91 pages
Unsupervised Learning: Clustering
No ratings yet
Unsupervised Learning: Clustering
12 pages
M5
No ratings yet
M5
40 pages
4.5-Cluster Analysis
No ratings yet
4.5-Cluster Analysis
17 pages
Datamining-Lect5 - Clustering. The K-Means Algorithm. Hierarchical Clustering. The DBSCAN Algorithm. Clustering Evaluation
No ratings yet
Datamining-Lect5 - Clustering. The K-Means Algorithm. Hierarchical Clustering. The DBSCAN Algorithm. Clustering Evaluation
110 pages
Clustering
No ratings yet
Clustering
65 pages
Unit 3 Data
No ratings yet
Unit 3 Data
37 pages
ML Unit-4 Final 2024-25
No ratings yet
ML Unit-4 Final 2024-25
28 pages
ML Unit 4 Notes - NJ
No ratings yet
ML Unit 4 Notes - NJ
15 pages
Unit 5
No ratings yet
Unit 5
5 pages
(3rd Year) Pattern REcognition Lecture 4
No ratings yet
(3rd Year) Pattern REcognition Lecture 4
48 pages
4 Clustering
No ratings yet
4 Clustering
9 pages
Clustering
No ratings yet
Clustering
104 pages
Clustering
No ratings yet
Clustering
38 pages
Clustering
No ratings yet
Clustering
7 pages
Clustering in Data Mining Lecture
No ratings yet
Clustering in Data Mining Lecture
80 pages
Music Compostion With Magenta
No ratings yet
Music Compostion With Magenta
2 pages
Resume Rohnit
No ratings yet
Resume Rohnit
1 page
Online BCA for Aspiring Tech Professionals
No ratings yet
Online BCA for Aspiring Tech Professionals
18 pages
Arduino
No ratings yet
Arduino
6 pages
AIGP Demo
No ratings yet
AIGP Demo
5 pages
Neural Machine Translation
No ratings yet
Neural Machine Translation
29 pages
Python and ML
No ratings yet
Python and ML
43 pages
Title Accurate Inverse Process Optimization Framework in Laser Directed Energy Deposition Author List
No ratings yet
Title Accurate Inverse Process Optimization Framework in Laser Directed Energy Deposition Author List
32 pages
Lead Time Forecasting With Machine Learning Techniques For A Pharmaceutical Supply Chain
No ratings yet
Lead Time Forecasting With Machine Learning Techniques For A Pharmaceutical Supply Chain
8 pages
Modern Deep Learning For Tabular Data Novel To Ye Annas Archive
No ratings yet
Modern Deep Learning For Tabular Data Novel To Ye Annas Archive
855 pages
Practical Machine Learning Guide
No ratings yet
Practical Machine Learning Guide
7 pages
Ggc14@columbia - Edu: Columbia University Syllabus COMS S4995-Topics in Computer Science: Machine Learning in Finance
No ratings yet
Ggc14@columbia - Edu: Columbia University Syllabus COMS S4995-Topics in Computer Science: Machine Learning in Finance
3 pages
Data Science Syllabus
No ratings yet
Data Science Syllabus
2 pages
9 - NCP Computer Science PG
No ratings yet
9 - NCP Computer Science PG
19 pages
Mini Project
No ratings yet
Mini Project
10 pages
AI and Business Automation
No ratings yet
AI and Business Automation
43 pages
Crop Seeds Classification Using Traditional Machin
No ratings yet
Crop Seeds Classification Using Traditional Machin
36 pages
Group Assignment: Machine Learning: TOPIC: Predicting of Census Data Using Machine Learning Techniques
No ratings yet
Group Assignment: Machine Learning: TOPIC: Predicting of Census Data Using Machine Learning Techniques
11 pages
EAP5A Blueprint Example 4
No ratings yet
EAP5A Blueprint Example 4
4 pages
2019-20 Hanback Electronics Eng Compressed
No ratings yet
2019-20 Hanback Electronics Eng Compressed
308 pages
Vanshdeep Singh Madan Resume - v3
No ratings yet
Vanshdeep Singh Madan Resume - v3
2 pages
Analysis of Rate of Penetration (ROP) Prediction in Drilling Using Physics-Based and Data-Driven Models
No ratings yet
Analysis of Rate of Penetration (ROP) Prediction in Drilling Using Physics-Based and Data-Driven Models
30 pages
Machine Learning in Materials Science
No ratings yet
Machine Learning in Materials Science
21 pages
Data-Driven Information For Action
No ratings yet
Data-Driven Information For Action
14 pages
Telecom Customer Segmentation Guide
No ratings yet
Telecom Customer Segmentation Guide
13 pages
Pa 01 Density Estimation
No ratings yet
Pa 01 Density Estimation
25 pages
Case Study K-Means Clustering
No ratings yet
Case Study K-Means Clustering
4 pages
Ad3511 DL Lab All Lab Manual
No ratings yet
Ad3511 DL Lab All Lab Manual
36 pages
Ai-102 6
No ratings yet
Ai-102 6
43 pages
AI T-Bone
No ratings yet
AI T-Bone
9 pages

DMW Unit 5

Uploaded by

DMW Unit 5

Uploaded by

UNIT 5

 Clustering is defined as grouping a set of similar objects into

Clustering methods/algorithms can be categorized into five

Clustering is the task of splitting a group of data or dataset into a

The k-means clustering is an important technique which falls

In the k-means clustering algorithm, n objects are

Working of the k-means algorithm

Step 1: Start with a selection of the value of k where k =

Step 2: Creation of distance matrix between the centroids and

Now, the samples are grouped on the basis of their distance

If currently, the sample is not in the cluster then switch it to the

When there is no movement of samples to another cluster

Step 4: Update the new centroids for each cluster

Step 5: Repeat until no further change occurs

Flowchart of K-means Clustering

Hierarchical clustering is a type of cluster analysis which seeks

There are two types of Hierarchical clustering: agglomerative

This is a ‘bottom-up’ approach. In this approach, each object is

This is a ‘top-down’ approach. In this approach, all objects start

 Density-based clustering algorithms perform the

Neighborhood is an important term used in DBSCAN. It

A point is known as a core point if it has more than a specified

Concept of Core,Border and Outlier

Algorithm for DBSCAN :

1. Find the e (epsilon) neighbors of every point, and identify the

DBSCAN algorithm has following advantages.

• It is not required to specify the number of clustersin the data at

Weakness of DBSCAN algorithm :

DBSCAN algorithm has following disadvantages.

• The DBSCAN algorithm is sensitive to the parameter, i.e., it is

Applications of Cluster Analysis:

You might also like