LAB Manual
PART A
(PART A: TO BE REFFERED BY STUDENTS)
Experiment No.07
A.1 Aim:
Implementation of K-means clustering using JAVA or WEKA.
A.2 Prerequisite:
Familiarity with the WEKA tool and programming languages.
A.3 Outcome:
After successful completion of this experiment students will be able to
➢ Use classification and clustering algorithms of data mining.
A.4 Theory:
THEORY:
K -means is one of the simplest unsupervised learning algorithms that solve the well-known
clustering problem. The procedure follows a simple and easy way to classify a given data
set through a certain number of clusters (assume k clusters) fixed apriori. The main idea is to
define k centers, one for each cluster. These centers should be placed in a cunning way because
of different location causes different result. So, the better choice is to place them as much as
possible far away from each other. The next step is to take each point belonging to a given data
set and associate it to the nearest center. When no point is pending, the first step is completed
and an early group age is done. At this point we need to re-calculate k new centroids as
barycenter of the clusters resulting from the previous step. After we have these k new centroids,
a new binding has to be done between the same data set points and the nearest new center. A
loop has been generated. As a result of this loop we may notice that the k centers change their
location step by step until no more changes are done or in other words centers do not move
any more.
Algorithmic steps for k-means clustering
Let X = {x1,x2,x3,……..,xn} be the set of data points and V = {v1,v2,…….,vc} be the set of
centers.
1) Randomly select ‘c’ cluster centers.
2) Calculate the distance between each data point and cluster centers.
3) Assign the data point to the cluster center whose distance from the cluster center is
minimum of all the cluster centers.
4) Recalculate the new cluster center using:
where, ‘ci’ represents the number of data points in ith cluster.
5) Recalculate the distance between each data point and new obtained cluster centers.
6) If no data point was reassigned then stop, otherwise repeat from step 3).
PART B
(PART B: TO BE COMPLETED BY STUDENTS)
(Students must submit the soft copy as per following segments within two hours of the
practical. The soft copy must be uploaded on the Blackboard or emailed to the concerned
lab in charge faculties at the end of the practical in case the there is no Black board access
available)
Roll. No Name:
Class: Batch:
Date of Experiment: Date of Submission:
Grade:
B.1 Software Code written by student:
(Paste your problem statement related to your case study completed during the 2 hours of practical in
the lab here)
B.2 Input and Output:
(Paste diagram of star schema and snowflake schema model related to your case study in following
format )
B.3 Observations and learning:
We understood the concept of clustering
B.4 Conclusion:
Implementation of K-means clustering using WEKA.
B.5 Question of Curiosity
(To be answered by student based on the practical performed and learning/observations)
Q1: What is Clustering? Types of clustering? Explain advantages and disadvantages of
clustering.
Clustering is a type of unsupervised learning method of machine learning.
In the unsupervised learning method, the inferences are drawn from the
data sets which do not contain labelled output variable. It is an exploratory
data analysis technique that allows us to analyze the multivariate data sets.
Types:
• Hard Clustering: In hard clustering, each data point either belongs to a
cluster completely or not. For example, in the above example each customer
is put into one group out of the 10 groups.
• Soft Clustering: In soft clustering, instead of putting each data point into a
separate cluster, a probability or likelihood of that data point to be in those
clusters is assigned. For example, from the above scenario each costumer
is assigned a probability to be in either of 10 clusters of the retail store.
The main advantage of a clustered solution is automatic recovery from failure, that
is, recovery without user intervention. Disadvantages of clustering are complexity and
inability to recover from database corruption.
Q2: Give the advantages and disadvantages of K- means clustering.
Advantages of k-means
Relatively simple to implement.
Scales to large data sets.
Guarantees convergence.
Can warm-start the positions of centroids.
Easily adapts to new examples.
Generalizes to clusters of different shapes and sizes, such as elliptical
clusters.
Disadvantages of k-means
Choosing k manually.
Use the “Loss vs. Clusters” plot to find the optimal (k), as discussed in Interpret
Results.
Being dependent on initial values.
For a low k, you can mitigate this dependence by running k-means several times
with different initial values and picking the best result. As k increases, you need
advanced versions of k-means to pick better values of the initial centroids (called k-
means seeding). For a full discussion of k- means seeding see, A Comparative
Study of Efficient Initialization Methods for the K-Means Clustering Algorithm by M.
Emre Celebi, Hassan A. Kingravi, Patricio A. Vela.
Clustering data of varying sizes and density.
k-means has trouble clustering data where clusters are of varying sizes and density.
To cluster such data, you need to generalize k-means as described in
the Advantages section.
Clustering outliers.
Centroids can be dragged by outliers, or outliers might get their own cluster instead
of being ignored. Consider removing or clipping outliers before clustering.
Scaling with number of dimensions.
As the number of dimensions increases, a distance-based similarity measure
converges to a constant value between any given examples. Reduce dimensionality
either by using PCA on the feature data, or by using “spectral clustering” to modify
the clustering algorithm as explained below.
Q3: How is the number of cluster chosen?
The optimal number of clusters can be defined as follow: Compute clustering
algorithm (e.g., k-means clustering) for different values of k. For instance, by
varying k from 1 to 10 clusters. For each k, calculate the total within-cluster sum of
square (wss).