0% found this document useful (0 votes)
47 views11 pages

DWM Exp7 C49

.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views11 pages

DWM Exp7 C49

.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

LAB Manual

PART A
(PART A: TO BE REFFERED BY STUDENTS)

Experiment No.07
A.1 Aim:
Implementation of K-means clustering using JAVA or WEKA.

A.2 Prerequisite:
Familiarity with the WEKA tool and programming languages.

A.3 Outcome:
After successful completion of this experiment students will be able
to
Use classification and clustering algorithms of data mining.

A.4 Theory:

THEORY:
K -means is one of the simplest
unsupervised learning algorithms that solve the well-known clustering
problem. The procedure follows a simple and easy way to classify a given
data set through a certain number of clusters (assume k clusters) fixed
apriori. The main idea is to define k centers, one for each cluster. These
centers should be placed in a cunning way because of different location
causes different result. So, the better choice is to place them as much as
possible far away from each other. The next step is to take each point
belonging to a given data set and associate it to the nearest center. When
no point is pending, the first step is completed and an early group age is
done. At this point we need to re-calculate k new centroids as barycenter
of the clusters resulting from the previous step. After we have these k new
centroids, a new binding has to be done between the same data set
points and the nearest new center. A loop has been generated. As a
result of this loop we may notice that the k centers change their location
step by step until no more changes are done or in other words centers
do not move any more.

Algorithmic steps for k-means clustering


Let X = {x ,x ,x ,……..,x } be the set of data points and V =
1 2 3 n
{v ,v ,…….,v } be the set of centers.
1 2 c
1) Randomly select ‘c’ cluster centers.
2) Calculate the distance between each data point and cluster centers.
3) Assign the data point to the cluster center whose distance from the
cluster center is minimum of all the cluster centers.
4) Recalculate the new cluster center using:

where, ‘c ’ represents the number of data points


i
th
in i cluster.
5) Recalculate the distance between each data point and new obtained
cluster centers.
6) If no data point was reassigned then stop, otherwise repeat from step
3).
PART B
(PART B: TO BE COMPLETED BY STUDENTS)

(Students must submit the soft copy as per following segments


within two hours of the practical. The soft copy must be uploaded
on the Blackboard or emailed to the concerned lab in charge
faculties at the end of the practical in case the there is no Black
board access available)

B.1 Software Code written by student:


(Paste your problem statement related to your case study completed during
the 2 hours of practical in the lab here)
# importing libraries
import numpy as nm
import matplotlib.pyplot as mtp
import pandas as pd
# Importing the dataset
dataset = pd.read_csv('diabetes_csv.csv')
x = dataset.iloc[:, [7, 5]].values
#finding optimal number of clusters using the elbow method
from sklearn.cluster import KMeans
wcss_list= [] #Initializing the list for the values of WCSS
#Using a loop for iterations from 1 to 10.
for i in range(1, 11):
kmeans = KMeans(n_clusters=i, init='k-means++', random_state=42)
kmeans.fit(x)
wcss_list.append(kmeans.inertia_)
mtp.plot(range(1, 11), wcss_list)
mtp.title('The Elbow Method Graph')
mtp.xlabel('Number of clusters(k)')
mtp.ylabel('wcss_list')
mtp.show()
#training the K-means model on a dataset
kmeans = KMeans(n_clusters=2, init='k-means++', random_state= 42)
y_predict= kmeans.fit_predict(x)
mtp.scatter(x[y_predict == 0, 0], x[y_predict == 0, 1], s = 100, c = 'blue', label =
'Cluster 1') #for first cluster
mtp.scatter(x[y_predict == 1, 0], x[y_predict == 1, 1], s = 100, c = 'green', label =
'Cluster 2') #for second cluster
mtp.scatter(kmeans.cluster_centers_[:, 0],
kmeans.cluster_centers_[:, 1], s = 300, c = 'yellow', label = 'Centroid')
mtp.title('Clusters of patients')
mtp.xlabel('Age(in years)')
mtp.ylabel('BMI(Body Mass Index)')
mtp.legend()
mtp.show()
B.2 Input and Output:
(Paste diagram of star schema and snowflake schema model related to your
case study in following format )
Jupyter Notebook
Jupyter/Binder:

Jupyter Notebook:
Sample Dataset:

The Elbow Method Graph:

Python Output:
Weka Tool
Weka Output:

B.3 Observations and learning:


(Students are expected to comment on the output obtained with clear
observations and learning for each task/ sub part assigned)
K -means is one of the simplest unsupervised learning algorithms that solve
the well-known clustering problem. A cluster refers to a collection of data points
aggregated together because of certain similarities. The ‘means’ in the K-means
refers to averaging of the data; that is, finding the centroid.
B.4 Conclusion:
(Students must write the conclusion as per the attainment of individual
outcome listed above and learning/observation noted in section B.3)
Hence we’ve successfully implemented K-means clustering through Python as well
as Weka Tool.
B.5 Question of Curiosity
(To be answered by student based on the practical performed and
learning/observations)
1. What is Clustering? Types of clustering? Explain the advantages and
disadvantages of clustering.
Ans:
Clustering
Clustering is the task of dividing the population or data points into several groups
such that data points in the same groups are more similar to other data points
in the same group and dissimilar to the data points in other groups. It is a
collection of objects based on similarity and dissimilarity between them.
Cluster analysis or clustering is the task of grouping a set of objects in such a way
that objects in the same group (called a cluster) are more similar (in some
sense) to each other than to those in other groups (clusters). It is the main task
of exploratory data mining, and a common technique for statistical data
analysis, used in many fields, including machine learning, pattern recognition,
image analysis, information retrieval, bioinformatics, data compression, and
computer graphics.

Types Of Clustering Algorithms


1. Connectivity-based Clustering (Hierarchical clustering)
2. Centroids-based Clustering (Partitioning methods)
3. Distribution-based Clustering
4. Density-based Clustering (Model-based methods)
5. Fuzzy Clustering
6. Constraint-based (Supervised Clustering)
Advantages and Disadvantages of Clustering
The main advantage of a clustered solution is automatic recovery from failure,
that is, recovery without user intervention.
Disadvantages of clustering are complexity and inability to recover from database
corruption.
1. Give the advantages and disadvantages of K- means clustering.
Ans:
Advantages of K-means clustering:
1. Relatively simple to implement.
2. Scales to large data sets.
3. Guarantees convergence.
4. Can warm-start the positions of centroids.
5. Easily adapts to new examples.
6. Generalizes to clusters of different shapes and sizes, such as elliptical clusters.
Disadvantages of K-means clustering:
1. Choosing manually.
2. Being dependent on initial values.
3. Clustering data of varying sizes and density.
4. Clustering outliers.
5. Scaling with several dimensions.

1. How is the number of clusters chosen?


Ans:
A fundamental step for any unsupervised algorithm is to determine the optimal
number of clusters into which the data may be clustered. The Elbow Method is one of
the most popular methods to determine this optimal value of k. Distortion: It is
calculated as the average of the squared distances from the cluster centres of the
respective clusters. Typically, the Euclidean distance metric is used. Inertia: It is the
sum of squared distances of samples to their closest cluster centre. We iterate the
values of k from 1 to 9 and calculate the values of distortions for each value of k and
calculate the distortion and inertia for each value of k in the given range.

You might also like