DWDM Unit5

This document provides an overview of clustering, which is the process of grouping similar objects together. It discusses different types of clustering methods, including partitioning, hierarchical, density-based, grid-based, and model-based. It focuses on explaining the k-means clustering algorithm, which assigns data points to k number of clusters based on similarity. It also discusses how to determine the optimal number of clusters k using the elbow method.

Uploaded by

sri charan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

84 views14 pages

DWDM Unit5

Uploaded by

sri charan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

UNIT 5

Clustering
What is Clustering?
Clustering is the process of making a group of abstract objects into classes of similar
objects.
• A cluster of data objects can be treated as one group.
• While doing cluster analysis, we first partition the set of data into groups based
on data similarity and then assign the labels to the groups.
• The main advantage of clustering over classification is that, it is adaptable to
changes and helps single out useful features that distinguish different groups.
• Clustering analysis is broadly used in many applications such as market
research, pattern recognition, data analysis, and image processing.
• Clustering can also help marketers discover distinct groups in their customer
base. And they can characterize their customer groups based on the
purchasing patterns.
• In the field of biology, it can be used to derive plant and animal taxonomies,
categorize genes with similar functionalities and gain insight into structures
inherent to populations.
• Clustering also helps in identification of areas of similar land use in an earth
observation database. It also helps in the identification of groups of houses in
a city according to house type, value, and geographic location.
• Clustering also helps in classifying documents on the web for information
discovery.
• Clustering is also used in outlier detection applications such as detection of
credit card fraud.
• As a data mining function, cluster analysis serves as a tool to gain insight into
the distribution of data to observe characteristics of each cluster.
Clustering Methods
Clustering methods can be classified into the following categories −
• Partitioning Method
• Hierarchical Method
• Density-based Method
• Grid-Based Method
• Model-Based Method
• Constraint-based Method
Partitioning Method
Suppose we are given a database of ‘n’ objects and the partitioning method
constructs ‘k’ partition of data. Each partition will represent a cluster and k ≤ n. It
means that it will classify the data into k groups, which satisfy the following
requirements −
• Each group contains at least one object.
• Each object must belong to exactly one group.
Points to remember −
• For a given number of partitions (say k), the partitioning method will create an
initial partitioning.
• Then it uses the iterative relocation technique to improve the partitioning by
moving objects from one group to other.
K-Means Clustering is an Unsupervised Learning algorithm, which groups the
unlabeled dataset into different clusters. Here K defines the number of pre-defined
clusters that need to be created in the process, as if K=2, there will be two clusters,
and for K=3, there will be three clusters, and so on.
It allows us to cluster the data into different groups and a convenient way to discover
the categories of groups in the unlabeled dataset on its own without the need for any
training.
It is a centroid-based algorithm, where each cluster is associated with a centroid. The
main aim of this algorithm is to minimize the sum of distances between the data point
and their corresponding clusters.
The algorithm takes the unlabeled dataset as input, divides the dataset into k-number
of clusters, and repeats the process until it does not find the best clusters. The value
of k should be predetermined in this algorithm.
The k-means clustering algorithm mainly performs two tasks:
o Determines the best value for K center points or centroids by an iterative
process.
o Assigns each data point to its closest k-center. Those data points which are
near to the particular k-center, create a cluster.
Hence each cluster has datapoints with some commonalities, and it is away from other
clusters.
The below diagram explains the working of the K-means Clustering Algorithm:
Let's understand the above steps by considering the visual plots:
Suppose we have two variables M1 and M2. The x-y axis scatter plot of these two
variables is given below:

o Let's take number k of clusters, i.e., K=2, to identify the dataset and to put them
into different clusters. It means here we will try to group these datasets into two
different clusters.
o We need to choose some random k points or centroid to form the cluster. These
points can be either the points from the dataset or any other point. So, here we
are selecting the below two points as k points, which are not the part of our
dataset. Consider the below image:

o Now we will assign each data point of the scatter plot to its closest K-point or
centroid. We will compute it by applying some mathematics that we have
studied to calculate the distance between two points. So, we will draw a median
between both the centroids. Consider the below image:

From the above image, it is clear that points left side of the line is near to the K1 or
blue centroid, and points to the right of the line are close to the yellow centroid. Let's
color them as blue and yellow for clear visualization.
o As we need to find the closest cluster, so we will repeat the process by
choosing a new centroid. To choose the new centroids, we will compute the
center of gravity of these centroids, and will find new centroids as below:

Next, we will reassign each datapoint to the new centroid. For this, we will repeat the
same process of finding a median line. The median will be like below image:

From the above image, we can see, one yellow point is on the left side of the line, and
two blue points are right to the line. So, these three points will be assigned to new
centroids.
As reassignment has taken place, so we will again go to the step-4, which is finding
new centroids or K-points.
o We will repeat the process by finding the centre of gravity of centroids, so the
new centroids will be as shown in the below image:

o
o As we got the new centroids so again will draw the median line and reassign
the data points. So, the image will be:

o We can see in the above image; there are no dissimilar data points on either
side of the line, which means our model is formed. Consider the below image:

As our model is ready, so we can now remove the assumed centroids, and the two
final clusters will be as shown in the below image:
How to choose the value of "K number of clusters" in K-means Clustering?
The performance of the K-means clustering algorithm depends upon highly efficient
clusters that it forms. But choosing the optimal number of clusters is a big task. There
are some different ways to find the optimal number of clusters, but here we are
discussing the most appropriate method to find the number of clusters or value of K.
The method is given below:
Elbow Method
The Elbow method is one of the most popular ways to find the optimal number of
clusters. This method uses the concept of WCSS value. WCSS stands for Within
Cluster Sum of Squares, which defines the total variations within a cluster. The
formula to calculate the value of WCSS (for 3 clusters) is given below:
WCSS= ∑Pi in Cluster1 distance(Pi C1)2 +∑Pi in Cluster2distance(Pi C2)2+∑Pi in

CLuster3 distance(Pi C3)2

In the above formula of WCSS,
∑Pi in Cluster1 distance(Pi C1)2: It is the sum of the square of the distances between each
data point and its centroid within a cluster1 and the same for the other two terms

Hierarchical Methods
This method creates a hierarchical decomposition of the given set of data objects.
We can classify hierarchical methods on the basis of how the hierarchical
decomposition is formed. There are two approaches here −
• Agglomerative Approach
• Divisive Approach
Agglomerative Approach
This approach is also known as the bottom-up approach. In this, we start with each
object forming a separate group. It keeps on merging the objects or groups that are
close to one another. It keep on doing so until all of the groups are merged into one
or until the termination condition holds.
Divisive Approach
This approach is also known as the top-down approach. In this, we start with all of
the objects in the same cluster. In the continuous iteration, a cluster is split up into
smaller clusters. It is down until each object in one cluster or the termination condition
holds. This method is rigid, i.e., once a merging or splitting is done, it can never be
undone.
Approaches to Improve Quality of Hierarchical Clustering
Here are the two approaches that are used to improve the quality of hierarchical
clustering −
• Perform careful analysis of object linkages at each hierarchical partitioning.
• Integrate hierarchical agglomeration by first using a hierarchical agglomerative
algorithm to group objects into micro-clusters, and then performing macro-
clustering on the micro-clusters.
Density-based Method
This method is based on the notion of density. The basic idea is to continue growing
the given cluster as long as the density in the neighborhood exceeds some threshold,
i.e., for each data point within a given cluster, the radius of a given cluster has to
contain at least a minimum number of points.
Density-based spatial clustering of applications with noise (DBSCAN)
clustering method.
Clusters are dense regions in the data space, separated by regions of the lower
density of points. The DBSCAN algorithm is based on this intuitive notion of
“clusters” and “noise”. The key idea is that for each point of a cluster, the
neighbourhood of a given radius has to contain at least a minimum number of points.
Why DBSCAN?
Partitioning methods (K-means, PAM clustering) and hierarchical clustering work
for finding spherical-shaped clusters or convex clusters. In other words, they are
suitable only for compact and well-separated clusters. Moreover, they are also
severely affected by the presence of noise and outliers in the data.
Real life data may contain irregularities, like:
1. Clusters can be of arbitrary shape such as those shown in the figure
below.
2. Data may contain noise.

The figure below shows a data set containing nonconvex clusters and
outliers/noises. Given such data, k-means algorithm has difficulties for identifying
these clusters with arbitrary shapes.
DBSCAN algorithm requires two parameters:
1. eps : It defines the neighbourhood around a data point i.e. if the distance
between two points is lower or equal to ‘eps’ then they are considered as
neighbours. If the eps value is chosen too small then large part of the data
will be considered as outliers. If it is chosen very large then the clusters
will merge and majority of the data points will be in the same clusters. One
way to find the eps value is based on the k-distance graph.
2. MinPts: Minimum number of neighbours (data points) within eps radius.
Larger the dataset, the larger value of MinPts must be chosen. As a
general rule, the minimum MinPts can be derived from the number of
dimensions D in the dataset as, MinPts >= D+1. The minimum value of
MinPts must be chosen at least 3.
Grid-based Method
In this, the objects together form a grid. The object space is quantized into finite
number of cells that form a grid structure.
Advantages
• The major advantage of this method is fast processing time.
• It is dependent only on the number of cells in each dimension in the quantized
space.
Model-based methods
In this method, a model is hypothesized for each cluster to find the best fit of data for
a given model. This method locates the clusters by clustering the density function. It
reflects spatial distribution of the data points.
This method also provides a way to automatically determine the number of clusters
based on standard statistics, taking outlier or noise into account. It therefore yields
robust clustering methods.
Constraint-based Method
In this method, the clustering is performed by the incorporation of user or application-
oriented constraints. A constraint refers to the user expectation or the properties of
desired clustering results. Constraints provide us with an interactive way of
communication with the clustering process. Constraints can be specified by the user
or the application requirement.

Outliter Analysis:

Outlier is a data object that deviates significantly from the rest of the data objects
and behaves in a different manner. An outlier is an object that deviates significantly
from the rest of the objects. They can be caused by measurement or execution
errors. The analysis of outlier data is referred to as outlier analysis or outlier mining.
An outlier cannot be termed as a noise or error. Instead, they are suspected of not
being generated by the same method as the rest of the data objects.
An outlier cannot be termed as a noise or error. Instead, they are suspected of not
being generated by the same method as the rest of the data objects.
Outliers are of three types, namely –
1. Global (or Point) Outliers
2. Collective Outliers
3. Contextual (or Conditional) Outliers
1. Global Outliers
They are also known as Point Outliers. These are the simplest form of outliers. If, in
a given dataset, a data point strongly deviates from all the rest of the data points, it
is known as a global outlier. Mostly, all of the outlier detection methods are aimed at
finding global outliers.
For example, In Intrusion Detection System, if a large number of packages are
broadcast in a very short span of time, then this may be considered as a global outlier
and we can say that that particular system has been potentially hacked.

Outlier Detection Methods

Models for Outlier Detection Analysis
There are several approaches to detecting Outliers. Outlier detection models may be
classified into the following groups:
1. Extreme Value Analysis
Extreme Value Analysis is the most basic form of outlier detection and great for 1-
dimension data. In this Outlier analysis approach, it is assumed that values which are
too large or too small are outliers. Z-test and Student’s t-test are classic examples.
These are good heuristics for initial analysis of data but they do not have much value
in multivariate settings. Extreme Value Analysis is largely used as final step for
interpreting outputs of other outlier detection methods.
2. Linear Models
In this approach, the data is modelled into a lower-dimensional sub-space with the use
of linear correlations. Then the distance of each data point to a plane that fits the sub-
space is being calculated. This distance is used to find outliers. PCA (Principal
Component Analysis) is an example of linear models for anomaly detection.
3. Probabilistic and Statistical Models
In this approach, Probabilistic and Statistical Models assume specific distributions for
data. They make use of the expectation-maximization (EM) methods to estimate the
parameters of the model. Finally, they calculate the probability of membership of each
data point to calculated distribution. The points with a low probability of membership
are marked as outliers.
4. Proximity-based Models
In this method, outliers are modelled as points isolated from the rest of the
observations. Cluster analysis, density-based analysis, and nearest neighborhood are
the principal approaches of this kind.
5. Information-Theoretic Models
In this method, the outliers increase the minimum code length to describe a data set.
The red data point is a global outlier.
2. Collective Outliers
As the name suggests, if in a given dataset, some of the data points, as a whole,
deviate significantly from the rest of the dataset, they may be termed as collective
outliers. Here, the individual data objects may not be outliers, but when seen as a
whole, they may behave as outliers. To detect these types of outliers, we might need
background information about the relationship between those data objects showing
the behaviour of outliers.
For example: In an Intrusion Detection System, a DOS (denial-of-service) package
from one computer to another may be considered as normal behaviour. However, if
this happens with several computers at the same time, then this may be considered
as abnormal behaviour and as a whole they can be termed as collective outliers.

The red data points as a whole are collective outliers.

3. Contextual Outliers
They are also known as Conditional Outliers. Here, if in a given dataset, a data object
deviates significantly from the other data points based on a specific context or
condition only. A data point may be an outlier due to a certain condition and may
show normal behaviour under another condition. Therefore, a context has to be
specified as part of the problem statement in order to identify contextual outliers.
Contextual outlier analysis provides flexibility for users where one can examine
outliers in different contexts, which can be highly desirable in many applications. The
attributes of the data point are decided on the basis of both contextual and
behavioural attributes.
For example: A temperature reading of 40°C may behave as an outlier in the context
of a “winter season” but will behave like a normal data point in the context of a
“summer season”.

A low temperature value in June is a contextual outlier because the same value in
December is not an outlier.

Outliers are generally defined as samples that are exceptionally far from the
mainstream of data. There is no rigid mathematical definition of what constitutes an
outlier; determining whether or not an observation is an outlier is ultimately a subjective
exercise.
An outlier may also be explained as a piece of data or observation that deviates
drastically from the given norm or average of the data set. An outlier may be caused
simply by chance, but it may also indicate measurement error or that the given data
set has a heavy-tailed distribution.
Therefore, Outlier Detection may be defined as the process of detecting and
subsequently excluding outliers from a given set of data. There are no standardized
Outlier identification methods as these are largely dependent upon the data set. Outlier
Detection as a branch of data mining has many applications in data stream analysis.
Our discussion will also cover areas of standard applications of Outlier
Detection, such as Fraud detection, public health, and sports and touch upon the
various approaches like Proximity-based approaches and Angle-based approaches.

K-Means Clustering Explained
No ratings yet
K-Means Clustering Explained
12 pages
Machine Learning Unit 4
No ratings yet
Machine Learning Unit 4
22 pages
UNIT - 3 - Clustering
No ratings yet
UNIT - 3 - Clustering
21 pages
Clustering
No ratings yet
Clustering
17 pages
AI Week 11
No ratings yet
AI Week 11
21 pages
K-Mean Clustering
No ratings yet
K-Mean Clustering
8 pages
Algo
No ratings yet
Algo
59 pages
K-Means Clustering Guide & Python Implementation
No ratings yet
K-Means Clustering Guide & Python Implementation
21 pages
Machine Learning
No ratings yet
Machine Learning
23 pages
Clustering
No ratings yet
Clustering
10 pages
Clustering Techniques Explained
No ratings yet
Clustering Techniques Explained
11 pages
Unit 3 Data
No ratings yet
Unit 3 Data
37 pages
K-Means Clustering Guide
100% (1)
K-Means Clustering Guide
14 pages
ML Unit-4 Final 2024-25
No ratings yet
ML Unit-4 Final 2024-25
28 pages
Data Mining-4
No ratings yet
Data Mining-4
9 pages
Presentation 1
No ratings yet
Presentation 1
47 pages
Unit 4
No ratings yet
Unit 4
22 pages
K-Means Clustering Guide
No ratings yet
K-Means Clustering Guide
26 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
23 pages
K-Means Clustering Insights
No ratings yet
K-Means Clustering Insights
8 pages
K-Means Clustering Guide 2023
No ratings yet
K-Means Clustering Guide 2023
14 pages
(3rd Year) Pattern REcognition Lecture 4
No ratings yet
(3rd Year) Pattern REcognition Lecture 4
48 pages
Module 4 - 5TH Sem
No ratings yet
Module 4 - 5TH Sem
23 pages
DWMModule 4
No ratings yet
DWMModule 4
31 pages
6 Clustering
No ratings yet
6 Clustering
15 pages
Clustering Techniques Overview
No ratings yet
Clustering Techniques Overview
45 pages
MODULE 4 Clustering
No ratings yet
MODULE 4 Clustering
23 pages
Unsupervised Learning Insights
No ratings yet
Unsupervised Learning Insights
10 pages
MachineLearning Unit IV
No ratings yet
MachineLearning Unit IV
51 pages
Clustering in Data Mining
No ratings yet
Clustering in Data Mining
14 pages
DWDM Unit V Note
No ratings yet
DWDM Unit V Note
19 pages
Text Analytics Unit-3
No ratings yet
Text Analytics Unit-3
11 pages
K-Means Clustering Guide
No ratings yet
K-Means Clustering Guide
51 pages
Unit 4
No ratings yet
Unit 4
4 pages
AI Chapter 3 Part 5
No ratings yet
AI Chapter 3 Part 5
30 pages
Clustering in Python
No ratings yet
Clustering in Python
31 pages
Intro to Clustering Techniques
No ratings yet
Intro to Clustering Techniques
13 pages
MLF Mod3
No ratings yet
MLF Mod3
10 pages
Hierarchical Clustering Guide
No ratings yet
Hierarchical Clustering Guide
6 pages
Clustering Algorithms
No ratings yet
Clustering Algorithms
19 pages
Lecture 18 K Means Clustering
No ratings yet
Lecture 18 K Means Clustering
77 pages
Unit 4
No ratings yet
Unit 4
74 pages
IT3080 Lecture04 2023
No ratings yet
IT3080 Lecture04 2023
56 pages
Clustering
No ratings yet
Clustering
125 pages
CH-6 DM Clustering
No ratings yet
CH-6 DM Clustering
28 pages
Chapter 4 - Clustering
No ratings yet
Chapter 4 - Clustering
21 pages
Unit 4 Cluster Analysis 3
No ratings yet
Unit 4 Cluster Analysis 3
20 pages
Lecture 14 Clustering
0% (1)
Lecture 14 Clustering
57 pages
PART2
No ratings yet
PART2
61 pages
Unsupervised Learning Part 1
No ratings yet
Unsupervised Learning Part 1
9 pages
K Clustering
No ratings yet
K Clustering
28 pages
ML Unit 2
No ratings yet
ML Unit 2
17 pages
Unit 4
No ratings yet
Unit 4
29 pages
Unit V - Clustering
No ratings yet
Unit V - Clustering
19 pages
DMW Unit 5
No ratings yet
DMW Unit 5
10 pages
Clustering Algorithm
No ratings yet
Clustering Algorithm
47 pages
K-Means Clustering Algorithm
No ratings yet
K-Means Clustering Algorithm
13 pages
ML Seminar
No ratings yet
ML Seminar
37 pages
Raymond Caldwell: Theatre Educator & Director
No ratings yet
Raymond Caldwell: Theatre Educator & Director
9 pages
1 - 1 - Online Traffic School
No ratings yet
1 - 1 - Online Traffic School
3 pages
API 577 Welding Test Questions
No ratings yet
API 577 Welding Test Questions
38 pages
Teicoplanin Prescribing and Therapeutic Drug Monitoring Clinical Guideline V2.0 March 2019
No ratings yet
Teicoplanin Prescribing and Therapeutic Drug Monitoring Clinical Guideline V2.0 March 2019
12 pages
Moderator: Dr. Usha Suwalka Presenter: Dr. Suchismita Naik
No ratings yet
Moderator: Dr. Usha Suwalka Presenter: Dr. Suchismita Naik
44 pages
Chi-Square - Unequal Expected Frequency
0% (1)
Chi-Square - Unequal Expected Frequency
12 pages
Od Dog, To, 2024: Commerce Examination, July, 2024
No ratings yet
Od Dog, To, 2024: Commerce Examination, July, 2024
4 pages
Littelfuse Series Inverse Parallel and Turn Off Connections For SCR Thyristor Devices
No ratings yet
Littelfuse Series Inverse Parallel and Turn Off Connections For SCR Thyristor Devices
5 pages
Samantha Danico Resume
No ratings yet
Samantha Danico Resume
2 pages
MENISPERMACEAE
No ratings yet
MENISPERMACEAE
93 pages
Carbon Accounting: Apa, Mengapa Dan Sudahkah Ber-Implikasi Pada Sustainability Reporting? (Based On 2012 TH 'Proper With Gold Rank)
No ratings yet
Carbon Accounting: Apa, Mengapa Dan Sudahkah Ber-Implikasi Pada Sustainability Reporting? (Based On 2012 TH 'Proper With Gold Rank)
9 pages
Đáp Án Đề Thi Thử Số 42 (2019-2020)
100% (1)
Đáp Án Đề Thi Thử Số 42 (2019-2020)
6 pages
Annotated Bibliography
No ratings yet
Annotated Bibliography
23 pages
IJMEDPH Ayushi
No ratings yet
IJMEDPH Ayushi
5 pages
Tungsten Oxide WO3 X-ray Data
No ratings yet
Tungsten Oxide WO3 X-ray Data
1 page
MartinezMork Panel03
No ratings yet
MartinezMork Panel03
15 pages
ForSignOff April9 Edited April8 LAS Math 2 Q2 Week 2 v.2 Final Look For Sign Off
No ratings yet
ForSignOff April9 Edited April8 LAS Math 2 Q2 Week 2 v.2 Final Look For Sign Off
15 pages
ATX-UK Exam Guidance 2024-2025
No ratings yet
ATX-UK Exam Guidance 2024-2025
7 pages
Gartner 7 Building Blocks of MDM
No ratings yet
Gartner 7 Building Blocks of MDM
1 page
Moroccan Export Promotion Guide
No ratings yet
Moroccan Export Promotion Guide
50 pages
Guppy Farming
No ratings yet
Guppy Farming
7 pages
Shrm-Unit I - PPT
No ratings yet
Shrm-Unit I - PPT
44 pages
Classic Strategies: Still Effective?
No ratings yet
Classic Strategies: Still Effective?
7 pages
Hub Name - Dat
No ratings yet
Hub Name - Dat
1,603 pages
HLASM R3: New Features & Functions
No ratings yet
HLASM R3: New Features & Functions
27 pages
Class X Science Exam Paper
No ratings yet
Class X Science Exam Paper
8 pages
Official PNS Tut Solutions (Almost All Tutes)
No ratings yet
Official PNS Tut Solutions (Almost All Tutes)
64 pages
Dual-Porosity Simulation Insights
No ratings yet
Dual-Porosity Simulation Insights
10 pages
Topical Al-Mat-P1 Source File
No ratings yet
Topical Al-Mat-P1 Source File
473 pages
Polycure AC
No ratings yet
Polycure AC
71 pages

DWDM Unit5

Uploaded by

DWDM Unit5

Uploaded by

UNIT 5

CLuster3 distance(Pi C3)2

Outlier Detection Methods

The red data points as a whole are collective outliers.

You might also like