Internship Report ML'
Internship Report ML'
21INT68 -Innovation/Entrepreneurship
/Societal Internship
Submitted by
1BI21CS140 Shreya VR
2023-2024
Certificate
ACKNOWLEDGEMENT
The satisfaction and euphoria that accompanies the successful completion of any task
would be incomplete without complementing those who made it possible and whose
guidance and encouragement made my efforts successful. So, my sincere thanks to all
those who have supported me in completing this Internship successfully.
My sincere thanks to Dr. M. U. Aswath, Principal, BIT and Dr. J. Girija, HOD,
Department of CSE, BIT for their encouragement, support and guidance to the student
community in all fields of education. I am grateful to our institution for providing us a
congenial atmosphere to carry out the Internship successfully.
I extend my sincere thanks to all the department faculty members and non-teaching staff
for supporting me directly or indirectly in the completion of this Internship.
NAME: SHREYA VR
USN: 1BI21CS140
TABLE OF CONTENTS
Chapter 1 - Introduction 1
1.1 Overview 1
1.2 Objective 1
1.3
Purpose, Scope and applicability 2
1.3.1 Purpose 2
1.3.2 Scope 2
1.3.3 Applicability 3
1.4 Organization of Report 4
Chapter 2 - Problem Statement 5
Chapter 3 -Methodology /System Architecture/Algorithm 7
Chapter 4 - Tools/Technologies 9
Chapter 5 - Implementation 11
5.1 Source code 11
Chapter 6 - Results 17
Chapter 7 - Reflection Notes 20
Chapter 8 - References 21
Internship Certificate 22
LIST OF FIGURES
Figure Page
Description
No. No.
2.1 Sample Dataset 6
3.1 System Architecture 7
6.1 Results for Random Forest Classifier 17
6.2 Results for Decision Tree Classifier 17
6.3 Results for Support Vector Machine 18
(a)Prediction on new data_1
6.4 18
(b) Output for new data_1
(a) Prediction on new data_2
6.5 19
(b) Output for new data_2
5
Chapter 1
Introduction
6
Chapter 1
Introduction
1.1 Overview
Machine Learning models hold great promise in diabetes prediction by providing insights
into individual risk profiles and empowering both patients and healthcare providers to
take proactive steps towards better management and prevention of diabetes.
1.2 Objective
The objective of this project is to develop and evaluate machine learning models for the
prediction of diabetes based on healthcare data. The primary goal is to create predictive
models capable of accurately identifying individuals at risk of developing diabetes before
symptoms appear. Through data analysis, preprocessing, and model training, the project
aims to harness the predictive power of machine learning algorithms to improve early
detection and management of diabetes.
7
Furthermore, the project seeks to explore and compare the performance of different
machine learning classifiers, including Random Forest, Decision Tree, and Support
Vector
Machine (SVM), in predicting diabetes. By employing techniques such as grid search for
hyperparameter tuning and cross-validation for robust evaluation, the project aims to
identify the most effective model for diabetes prediction. Ultimately, the project aims to
provide insights into the utility of machine learning approaches in healthcare and
contribute to the advancement of predictive modeling in diabetes care.
1.3.1 Purpose
The purpose of this project is to develop and evaluate a machine learning-based
predictive model for the early detection of diabetes. Diabetes is a prevalent chronic
disease that poses significant health risks and burdens on individuals and healthcare
systems worldwide.
Early detection plays a crucial role in preventing or delaying the onset of diabetes-related
complications, thus improving patient outcomes and reducing healthcare costs. By
leveraging machine learning algorithms and predictive modeling techniques, this project
aims to harness the power of data analytics to accurately predict an individual's risk of
developing diabetes based on various health parameters and risk factors.
1.3.2 Scope
The scope of this project encompasses the development, implementation, and evaluation
of a machine learning-based predictive model for diabetes detection. The project aims to
utilize a dataset containing relevant health parameters and risk factors to train and
evaluate various machine learning algorithms, including Random Forest, Decision Tree,
8
and Support Vector Machine classifiers. The project will explore different strategies for
handling missing data, outliers, and feature engineering to optimize the predictive model's
accuracy and reliability.
Additionally, the scope of the project extends to the evaluation and validation of the
predictive models using appropriate metrics such as accuracy, confusion matrices, and
classification reports.
Furthermore, the scope of the project extends to the practical application of the predictive
model in healthcare settings. This includes developing an interface for healthcare
professionals to input patient data and obtain predictions regarding their risk of
developing diabetes. Overall, the scope of this project is to provide a comprehensive
analysis of machine learning techniques for diabetes detection, with the aim of
contributing to the advancement of personalized, data-driven approaches to diabetes
management and prevention.
1.3.3 Applicability
The applicability of this project extends across various sectors of healthcare, offering
potential benefits to both patients and healthcare providers. Firstly, the developed
machine learning-based predictive model for diabetes detection can significantly improve
patient outcomes by enabling early intervention and personalized care. By accurately
identifying individuals at risk of developing diabetes before symptoms manifest,
healthcare professionals can initiate timely preventive measures, lifestyle modifications,
and medical interventions to mitigate the progression of the disease and reduce the
likelihood of complications.
The predictive model can be integrated into clinical practice to support healthcare
providers in making informed decisions regarding patient care. By incorporating the
predictive model into electronic health records or clinical decision support systems,
healthcare professionals can access real-time risk assessments and recommendations for
diabetes management during routine patient visits.
The applicability of this project extends beyond clinical settings to include wellness and
preventive healthcare programs. By empowering individuals with knowledge about their
risk of diabetes and providing them with tailored interventions and support, these
9
programs have the potential to improve overall population health outcomes and reduce
healthcare costs associated with diabetes-related complications.
The applicability of this project in healthcare spans across clinical practice, population
health management, and preventive healthcare initiatives, offering transformative benefits
to individuals, healthcare systems, and communities alike.
10
Chapter 2
Problem Statement
11
Chapter 2
Problem Statement
The binary outcome variable (Outcome) indicates whether a patient has diabetes (1) or
got (0). The goal is to create a robust predictive model that can assist in early diabetes
diagnosis and risk assessment, ultimately improving patient care and health outcomes.
Outcome: Class variable (either 10 or 1). 268 of 768 values are 1, and the others are 0
(Target Variable)
Sample Input:
Number of Pregnancies: 5
12
Glucose Level: 130
Blood Pressure: 70 mmHg
Skin Thickness: 30 mm
Insulin Level: 80
Body Mass Index (BMI): 2
Sample Dataset:
13
Chapter 3
System Architecture
14
Chapter 3
System Architecture
Data Collection:
The initial phase involves gathering relevant healthcare data from various sources,
including medical records, wearable devices, and health surveys. This data may include
patient demographics, medical history, biometric measurements (e.g., glucose levels,
blood pressure), and lifestyle factors. Techniques for data collection may involve
accessing electronic health records, utilizing health monitoring devices, and conducting
health assessments.
Data Preprocessing:
Raw healthcare data often requires preprocessing to ensure its quality and suitability for
model training. This crucial step involves cleaning the data to remove errors and
inconsistencies, handling missing values, and addressing outliers. Additionally, features
may be transformed or engineered to extract relevant information. For diabetes
prediction, preprocessing may include standardizing glucose levels, categorizing blood
pressure readings, and encoding categorical variables.
Model Training:
During this phase, machine learning models are trained on the preprocessed healthcare
data to predict the likelihood of diabetes occurrence.
Decision Trees are versatile models capable of handling both classification and regression
tasks. They partition the feature space into distinct regions based on feature values,
allowing for intuitive interpretation of decision-making processes.
The ANN model consists of multiple layers with dropout regularization to prevent
overfitting.
Hidden layers: Comprise several dense layers with ReLU activation functions to
introduce non-linearity.
Output layer: Single neuron with sigmoid activation function, producing binary
predictions (0 or 1) for diabetes outcome.
Model Construction:
Build an ANN model using Sequential API .Configure dense layers with various
activation functions and dropout regularization to prevent overfitting.
Model Compilation:
Compile the model using binary cross-entropy loss and the Adam optimizer .Set accuracy
as the evaluation metric.
Model Training:
Train the model on the training data with 50 epochs and a batch size of 128.Utilize a
validation split of 10% for monitoring training progress.
Prediction:
Once trained and evaluated, the predictive models can be deployed to make predictions
on new healthcare data. New patient data undergoes preprocessing similar to the training
data and is then fed into the trained models. The models leverage the learned patterns
from the training data to generate predictions regarding an individual's likelihood of
developing diabetes, aiding healthcare professionals in early detection and preventive
interventions.
These stages lay the foundation for deploying machine learning models as decision
support tools in clinical settings, empowering healthcare professionals with valuable
insights to tailor interventions, optimize treatment plans, and promote proactive strategies
for diabetes management and prevention.
To implement this project, I have used Jupyter Notebook and Python 3 (ipykernel). The
minimum requirements are:
Operating System:
Windows: The code can be executed on computers running Microsoft Windows
operating systems, including Windows 7, Windows 8, and Windows 10.
macOS: It is also compatible with macOS, the operating system used on Apple
Macintosh computers.
Linux: The code can run on various distributions of Linux, including Ubuntu,
Fedora, CentOS, and others.
Memory (RAM):
The amount of RAM required depends on the size of the dataset being processed
and the complexity of the machine learning models being trained.
For small to medium-sized datasets and relatively simple models, 4GB to 8GB of
RAM should be sufficient.
Processor (CPU):
The processor's speed and number of cores influence the code's execution time,
especially during data preprocessing and model training.
Any modern multi-core processor (e.g., Intel Core i5, i7, or i9 series for Intel
CPUs, or AMD Ryzen series for AMD CPUs) should be sufficient for running the
code.
Storage Space:
Recommended minimum: 200 MB free disk space for the VS Code installation.
Additional space will be required for your projects and dependencies.
Internet Connectivity:
Internet connectivity is not mandatory, but some features like extensions and
updates require an internet connection.
Additionally, one can use Google Colab Environment to run this code which will require
internet connection.
Programming Language:
Python is the programming language used in the code. Ensure that Python is installed on
the system. One can download and install Python from the official Python website.
Additionally, it's recommended to use a virtual environment management tool like
virtualenv or conda to create isolated Python environments for managing dependencies
and avoiding conflicts between different projects.
Development Environment:
The development environment to write, execute, and manage the Python code. Popular
choices include Jupyter Notebooks, Google Colab, Anaconda, or any text editor/IDE
(Integrated Development Environment) like VSCode, PyCharm, or Sublime Text.
Chapter 5
Implementation
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
import seaborn as sns
import matplotlib.pyplot as plt
#Data Analysis
data.head()
data.tail()
print("Number of Rows",data.shape[0])
print("Number of Columns",data.shape[1])
data.info()
data_copy = data.copy(deep=True)
data.columns
data['Glucose'] = data['Glucose'].replace(0,data['Glucose'].mean())
data['BloodPressure'] = data['BloodPressure'].replace(0,data['BloodPressure'].mean())
data['SkinThickness'] = data['SkinThickness'].replace(0,data['SkinThickness'].mean())
data['Insulin'] = data['Insulin'].replace(0,data['Insulin'].mean())
data['BMI'] = data['BMI'].replace(0,data['BMI'].mean())
#Training Model
pipeline_rf = Pipeline([
('imputer', SimpleImputer(strategy='mean')),
('scaler', StandardScaler()),
('classifier', RandomForestClassifier(random_state=42))
])
param_grid_rf = {
'classifier__n_estimators': [50, 100, 200],
'classifier__max_depth': [None, 10, 20],
'classifier__min_samples_split': [2, 5, 10],
'classifier__min_samples_leaf': [1, 2, 4],
'classifier__max_features': ['auto', 'sqrt', 'log2']
}
print("RandomForestClassifier Results:")
print(f"Accuracy: {accuracy_rf}")
print(f"Confusion Matrix:\n{conf_matrix_rf}")
print("Classification Report:\n", classification_report_str_rf)
pipeline_dt = Pipeline([
('imputer', SimpleImputer(strategy='mean')),
('scaler', StandardScaler()),
('classifier', DecisionTreeClassifier(random_state=42))
])
param_grid_dt = {
'classifier__max_depth': [None, 10, 20],
'classifier__min_samples_split': [2, 5, 10],
'classifier__min_samples_leaf': [1, 2, 4],
'classifier__max_features': ['auto', 'sqrt', 'log2']
}
print("DecisionTreeClassifier Results:")
print(f"Accuracy: {accuracy_dt}")
print(f"Confusion Matrix:\n{conf_matrix_dt}")
print("Classification Report:\n", classification_report_str_dt)
pipeline_svm = Pipeline([
('imputer', SimpleImputer(strategy='mean')),
('scaler', StandardScaler()),
('classifier', SVC(random_state=42))
])
param_grid_svm = {
'classifier__C': [0.1, 1, 10],
'classifier__kernel': ['linear', 'rbf', 'poly'],
'classifier__gamma': ['scale', 'auto']
}
rf_model=RandomForestClassifier(n_estimators=100,
max_depth=10, min_samples_split=2, min_samples_leaf=1, max_features='auto',
random_state=42)
# Example of new data (modify this according to your feature names and values)
new_data_example = pd.DataFrame({
'Pregnancies': [5],
'Glucose': [130],
'BloodPressure': [70],
'SkinThickness': [30],
'Insulin': [80],
'BMI': [25],
'DiabetesPedigreeFunction': [0.5],
'Age': [35]
})
model = Sequential([
Dense(512, activation='relu', input_shape=(X_train.shape[1],)),
Dropout(0.5),
Dense(128, activation='relu'),
Dropout(0.5),
Dense(256, activation='relu'),
Dropout(0.5),
Dense(128, activation='relu'),
Dropout(0.5),
Dense(1, activation='sigmoid')
])
model.compile(optimizer='adam',
loss='binary_crossentropy',
metrics=['accuracy'])
Fig 6.1
Results for Random Forest Classifier
Delving into the realm of machine learning, Python emerged as a cornerstone due to its
rich ecosystem of libraries tailored for data analysis and modeling. scikit-learn, NumPy,
and Pandas stood out as indispensable tools, each playing pivotal roles in various stages
of the machine learning pipeline. NumPy provided a solid foundation for numerical
computations and efficient handling of arrays, while Pandas facilitated seamless data
manipulation and preprocessing tasks, thanks to its powerful DataFrame and Series
structures. Meanwhile, scikit-learn offered a comprehensive suite of machine learning
algorithms and utilities, simplifying the implementation and evaluation of predictive
models.
The Random Forest algorithm emerged as a particularly robust choice for predictive
modeling endeavors. Its ensemble nature, which harnesses the collective wisdom of
multiple decision trees, proved effective in enhancing predictive accuracy while
mitigating the risk of overfitting. By aggregating predictions from diverse individual
trees, Random Forests fostered robustness and resilience, making them well-suited for a
wide range of classification and regression tasks across different domains.
6. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., ...
& Vanderplas, J. (2011). Scikit-learn: Machine learning in Python. Journal of
Machine Learning Research, 12(Oct), 2825-2830. Retrieved from
[https://jmlr.csail.mit.edu/papers/v12/pedregosa11a.html](https://
jmlr.csail.mit.edu/papers/v12/pedregosa11a.html).