Experiment - 6
Name: Ansari Mohammed Shanouf Valijan
Class: B.E. Computer Engineering, Semester - VII
UID: 2021300004
Batch: M
Aim:
To implement K-means Clustering for a particular dataset using R programming.
Objectives:
▪ To understand the basics of R programming and RStudio IDE.
▪ To use K-means clustering on a heart-disease dataset for better distinction among
groups/levels of severity of the disease.
▪ To visualize and interpret the clusters formed.
Outcomes:
▪ Familiarization with R programming as a tool to perform statistical analysis.
▪ Proper interpretation of the results.
Theory:
R programming is a powerful language widely used for statistical computing and data analysis.
It offers a vast array of packages and built-in functions that make it particularly suitable for
tasks involving data manipulation, visualization, and modeling. Among its many applications,
R excels in clustering analysis, which is crucial for uncovering patterns within datasets. One
popular clustering technique is K-means clustering, a method that partitions data into distinct
groups based on feature similarity.
K-means clustering in R can be easily implemented using the kmeans() function, which
requires the dataset and the number of clusters as input. The function works by iteratively
assigning data points to the nearest cluster centroid and updating the centroids based on
these assignments. This process continues until the centroids stabilize, resulting in optimal
cluster formation. R also provides the factoextra package, which offers additional tools for
visualizing clustering results, such as silhouette plots and cluster plots, enhancing
interpretability.
In practice, using K-means clustering in R involves several key steps. First, data must be pre-
processed, which may include handling missing values, standardizing features, and selecting
relevant variables. Once the data is ready, the kmeans() function can be executed, allowing
the user to specify the desired number of clusters. It is common to use techniques such as
the elbow method to determine the optimal number of clusters, where the within-cluster
sum of squares is plotted against different values of K.
After fitting the K-means model, R provides various methods for analyzing and interpreting
the results. The final cluster assignments can be added back to the original dataset for further
analysis. Visualization tools, such as scatter plots with cluster color coding, help in
understanding the distribution of data points across clusters. R's rich ecosystem of libraries
and functions not only simplifies the clustering process but also empowers users to derive
meaningful insights from complex data.
Dataset Description:
For the purpose of experimenting with K-means, heart-disease dataset from Kaggle was
utilized.
It consists of the records of various patients suffering from heart diseases. Features like Age,
Sex, Chest pain type, blood pressure, serum cholesterol, fasting blood sugar, etc are included
in the data. The aim of using k-means clustering, here, is to try to group the patients in
specific levels of risk factor based on all the features available.
Implementation:
Following is a step-by-step implementation that was carried out in R-Studio, using R
programming-
Importing the required libraries and the dataset
Viewing the data at a glance to get information about different types of variables used
Performing data imputation to handle missing values (Median for numerical columns and
Mode for categorical columns)
Encoding the categorical columns and scaling the dataset
Viewing the dataset summary
Viewing the scaled dataset
Visualizing age distribution, cholesterol levels and chest pain types to better understand the
further resulting clusters
Using the Elbow method to determine the optimum number of clusters based on within-
cluster sum of squares parameter
Performing final k-means using k as 3 (obtained from above graph) and segregating the
patients in different risk groups based on cholesterol and age
Cluster projections as obtained (Considering cholesterol and age as the major components)
Conclusion:
By performing this experiment, I was able to get familiar with R programming. Further I was
able to write a program in R through R studio that performs k-means clustering on a dataset
consisting of records of patients suffering from heart diseases. While finding optimum
number of clusters, the elbow method showed no significant improvement after k was 3. The
resulting clusters, when projected on the XY plane, considering cholesterol and age as the
major components, show a good amount of separation. This, thus, hints at 3 classes of risks
involved among heart patients which one may understand as mild, at par and severe.