0% found this document useful (0 votes)
42 views36 pages

Internship Report ML'

Uploaded by

Shreya Rangachar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views36 pages

Internship Report ML'

Uploaded by

Shreya Rangachar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 36

VISVESVARAYA TECHNOLOGICAL UNIVERSITY

“Jnana Sangama”, Belagavi - 590 018, Karnataka.

21INT68 -Innovation/Entrepreneurship
/Societal Internship

“DATA ANALYSIS USING PYTHON”


Submitted in partial fulfillment of the requirements for the award of the degree of
Bachelor of Engineering
In
Computer Science & Engineering

Submitted by

1BI21CS140 Shreya VR

Under the Guidance of


Dr. Maya B.S
Assistant Professor
Dept. of CSE, BIT

DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING


BANGALORE INSTITUTE OF TECHNOLOGY
K. R. Road, V. V. Puram, Bengaluru - 560 004

2023-2024

VISVESVARAYA TECHNOLOGICAL UNIVERSITY


“Jnana Sangama”, Belagavi-590 018, Karnataka

BANGALORE INSTITUTE OF TECHNOLOGY


Bengaluru-560 004

DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

Certificate

Certified that the 21INT68-Innovation/Entrepreneurship/Societal Internship (21INT68)


work entitled “Data Analysis using Python” carried out by Ms. Shreya VR bearing USN
1BI21CS163, a bonafide student of Bangalore Institute of Technology in partial fulfillment
for the award of Bachelor of Engineering in Computer Science & Engineering of the
Visvesvaraya Technological University, Belagavi during the academic year 2023-2024. It
is certified that all corrections/suggestions indicated for Internal Assessment have been
incorporated in the report deposited in the departmental library.
The Internship report has been approved as it satisfies the academic requirements
in respect of Internship work prescribed for the said degree.

Guide Dr. J. Girija


Dr. Maya B.S. Professor and Head
Assistant Professor Department of CSE, BIT
Department of CSE, BIT

ACKNOWLEDGEMENT
The satisfaction and euphoria that accompanies the successful completion of any task
would be incomplete without complementing those who made it possible and whose
guidance and encouragement made my efforts successful. So, my sincere thanks to all
those who have supported me in completing this Internship successfully.

My sincere thanks to Dr. M. U. Aswath, Principal, BIT and Dr. J. Girija, HOD,
Department of CSE, BIT for their encouragement, support and guidance to the student
community in all fields of education. I am grateful to our institution for providing us a
congenial atmosphere to carry out the Internship successfully.

I would not forget to remember Dr. Bhanushree K J, Associate Professor 21INT68 -


Innovation/Entrepreneurship/Societal Internship Coordinator, Department of CSE,
BIT, for her encouragement and more over for her timely support and guidance till the
completion of the Internship.

I avail this opportunity to express my profound sense of deep gratitude to my esteemed


guide Dr. Maya B.S., Assistant Professor, Department of CSE, BIT, for her moral
support, encouragement and valuable suggestions throughout the Internship.

I extend my sincere thanks to all the department faculty members and non-teaching staff
for supporting me directly or indirectly in the completion of this Internship.

NAME: SHREYA VR
USN: 1BI21CS140

TABLE OF CONTENTS
Chapter 1 - Introduction 1
1.1 Overview 1
1.2 Objective 1
1.3
Purpose, Scope and applicability 2

1.3.1 Purpose 2
1.3.2 Scope 2
1.3.3 Applicability 3
1.4 Organization of Report 4
Chapter 2 - Problem Statement 5
Chapter 3 -Methodology /System Architecture/Algorithm 7
Chapter 4 - Tools/Technologies 9
Chapter 5 - Implementation 11
5.1 Source code 11
Chapter 6 - Results 17
Chapter 7 - Reflection Notes 20
Chapter 8 - References 21
Internship Certificate 22
LIST OF FIGURES
Figure Page
Description
No. No.
2.1 Sample Dataset 6
3.1 System Architecture 7
6.1 Results for Random Forest Classifier 17
6.2 Results for Decision Tree Classifier 17
6.3 Results for Support Vector Machine 18
(a)Prediction on new data_1
6.4 18
(b) Output for new data_1
(a) Prediction on new data_2
6.5 19
(b) Output for new data_2

5
Chapter 1
Introduction

6
Chapter 1
Introduction
1.1 Overview

Diabetes is a chronic metabolic disorder characterized by elevated blood sugar levels,


which can lead to serious health complications if left untreated. Predictive modeling using
machine learning (ML) techniques has emerged as a valuable tool in the early detection
and management of diabetes.

ML models analyze various factors such as demographics, lifestyle habits, medical


history, and biomarkers to predict an individual's risk of developing diabetes. By
leveraging large datasets and sophisticated algorithms, these models can identify patterns
and trends that may not be apparent through traditional methods.

The predictive power of ML models enables healthcare providers to intervene early,


offering personalized recommendations for lifestyle modifications, preventive measures,
and targeted screenings. This proactive approach improves patient outcomes by enabling
timely interventions to prevent or delay the onset of diabetes-related complications.

Machine Learning models hold great promise in diabetes prediction by providing insights
into individual risk profiles and empowering both patients and healthcare providers to
take proactive steps towards better management and prevention of diabetes.

1.2 Objective

The objective of this project is to develop and evaluate machine learning models for the
prediction of diabetes based on healthcare data. The primary goal is to create predictive
models capable of accurately identifying individuals at risk of developing diabetes before
symptoms appear. Through data analysis, preprocessing, and model training, the project
aims to harness the predictive power of machine learning algorithms to improve early
detection and management of diabetes.

7
Furthermore, the project seeks to explore and compare the performance of different
machine learning classifiers, including Random Forest, Decision Tree, and Support
Vector
Machine (SVM), in predicting diabetes. By employing techniques such as grid search for
hyperparameter tuning and cross-validation for robust evaluation, the project aims to
identify the most effective model for diabetes prediction. Ultimately, the project aims to
provide insights into the utility of machine learning approaches in healthcare and
contribute to the advancement of predictive modeling in diabetes care.

1.3 Purpose, Scope and Applicability

1.3.1 Purpose
The purpose of this project is to develop and evaluate a machine learning-based
predictive model for the early detection of diabetes. Diabetes is a prevalent chronic
disease that poses significant health risks and burdens on individuals and healthcare
systems worldwide.

Early detection plays a crucial role in preventing or delaying the onset of diabetes-related
complications, thus improving patient outcomes and reducing healthcare costs. By
leveraging machine learning algorithms and predictive modeling techniques, this project
aims to harness the power of data analytics to accurately predict an individual's risk of
developing diabetes based on various health parameters and risk factors.

By evaluating different machine learning algorithms, including Random Forest, Decision


Tree, and Support Vector Machine classifiers, this project aims to identify the most
suitable approach for predicting diabetes risk. Ultimately, the purpose of this project is to
advance our understanding of the role of machine learning in healthcare and to empower
healthcare providers with tools that can improve patient outcomes and enhance
population health.

1.3.2 Scope
The scope of this project encompasses the development, implementation, and evaluation
of a machine learning-based predictive model for diabetes detection. The project aims to
utilize a dataset containing relevant health parameters and risk factors to train and
evaluate various machine learning algorithms, including Random Forest, Decision Tree,

8
and Support Vector Machine classifiers. The project will explore different strategies for
handling missing data, outliers, and feature engineering to optimize the predictive model's
accuracy and reliability.

Additionally, the scope of the project extends to the evaluation and validation of the
predictive models using appropriate metrics such as accuracy, confusion matrices, and
classification reports.

Furthermore, the scope of the project extends to the practical application of the predictive
model in healthcare settings. This includes developing an interface for healthcare
professionals to input patient data and obtain predictions regarding their risk of
developing diabetes. Overall, the scope of this project is to provide a comprehensive
analysis of machine learning techniques for diabetes detection, with the aim of
contributing to the advancement of personalized, data-driven approaches to diabetes
management and prevention.

1.3.3 Applicability
The applicability of this project extends across various sectors of healthcare, offering
potential benefits to both patients and healthcare providers. Firstly, the developed
machine learning-based predictive model for diabetes detection can significantly improve
patient outcomes by enabling early intervention and personalized care. By accurately
identifying individuals at risk of developing diabetes before symptoms manifest,
healthcare professionals can initiate timely preventive measures, lifestyle modifications,
and medical interventions to mitigate the progression of the disease and reduce the
likelihood of complications.

The predictive model can be integrated into clinical practice to support healthcare
providers in making informed decisions regarding patient care. By incorporating the
predictive model into electronic health records or clinical decision support systems,
healthcare professionals can access real-time risk assessments and recommendations for
diabetes management during routine patient visits.

The applicability of this project extends beyond clinical settings to include wellness and
preventive healthcare programs. By empowering individuals with knowledge about their
risk of diabetes and providing them with tailored interventions and support, these

9
programs have the potential to improve overall population health outcomes and reduce
healthcare costs associated with diabetes-related complications.
The applicability of this project in healthcare spans across clinical practice, population
health management, and preventive healthcare initiatives, offering transformative benefits
to individuals, healthcare systems, and communities alike.

1.4 Organization of report

The project consists of 7 chapters.


Chapter 1
It comprises the overview of the project, the objective of the project along with the
purpose,
scope, and applicability.
Chapter 2
It comprises the problem statement along with the inputs and the outputs that is to be
given
and expected respectively.
Chapter 3
It gives an overview of the methodology used and the system architecture that has been
followed.
Chapter 4
It consists of the system architecture description and representation of a system along
with a detailed explanation of the system architecture.
Chapter 5
It encompasses the implementation of the project by using the source code.
Chapter 6
They are the outputs that can be expected on implementing the source code. It covers the
maximum possible cases.
Chapter 7
It is the reflection note on how the project has affected me in both technical and non-
technical aspects.

10
Chapter 2
Problem Statement

11
Chapter 2
Problem Statement

Develop a classification algorithm to accurately predict whether patients have diabetes or


not using a dataset with medical predictor variables such as the number of pregnancies,
glucose levels, blood pressure, skin thickness, insulin levels, BMI, diabetes pedigree
function, and age.

The binary outcome variable (Outcome) indicates whether a patient has diabetes (1) or
got (0). The goal is to create a robust predictive model that can assist in early diabetes
diagnosis and risk assessment, ultimately improving patient care and health outcomes.

Key Components to the problem:


Pregnancies: Number of times pregnant
Glucose: Plasma glucose concentration in an oral glucose tolerance test
Blood Pressure: Diastolic blood pressure (mm Hg)
Skin Thickness: Triceps skinfold thickness (mm)
Insulin: Two-hour serum insulin
BMI: Body Mass Index
Diabetes Pedigree Function: A numerical feature or variable typically used in diabetes-
related datasets. It quantifies the diabetes hereditary risk er likelihood based on family
history
Age: Age in years

Outcome: Class variable (either 10 or 1). 268 of 768 values are 1, and the others are 0
(Target Variable)

Sample Input:
Number of Pregnancies: 5

12
Glucose Level: 130
Blood Pressure: 70 mmHg
Skin Thickness: 30 mm
Insulin Level: 80
Body Mass Index (BMI): 2

Diabetes Pedigree Function: 0.5


Age of Patient: 35 years

Output for the above data:


“Not Diabetic”

Sample Dataset:

Fig 2.1 Sample Dataset

13
Chapter 3
System Architecture

14
Chapter 3
System Architecture

Fig. 3.1 System Architecture

Data Collection:
The initial phase involves gathering relevant healthcare data from various sources,
including medical records, wearable devices, and health surveys. This data may include
patient demographics, medical history, biometric measurements (e.g., glucose levels,
blood pressure), and lifestyle factors. Techniques for data collection may involve
accessing electronic health records, utilizing health monitoring devices, and conducting
health assessments.

Data Preprocessing:
Raw healthcare data often requires preprocessing to ensure its quality and suitability for
model training. This crucial step involves cleaning the data to remove errors and
inconsistencies, handling missing values, and addressing outliers. Additionally, features
may be transformed or engineered to extract relevant information. For diabetes
prediction, preprocessing may include standardizing glucose levels, categorizing blood
pressure readings, and encoding categorical variables.

Model Training:
During this phase, machine learning models are trained on the preprocessed healthcare
data to predict the likelihood of diabetes occurrence.

Decision Trees are versatile models capable of handling both classification and regression
tasks. They partition the feature space into distinct regions based on feature values,
allowing for intuitive interpretation of decision-making processes.

Dept. of CSE, BIT 2023-2024


Support Vector Machines (SVMs) are particularly effective for binary classification tasks
but can also be extended to handle multiclass classification and regression. They identify
hyperplanes in high-dimensional feature spaces that best separate data points belonging to
different classes, maximizing the margin between classes to enhance robustness and
generalization.

Random Forests leverage the power of ensemble learning by aggregating predictions


from multiple decision trees. By randomly sampling subsets of the training data and
features for each tree, Random Forests reduce the risk of overfitting and improve model
accuracy and robustness. The final prediction is determined by a majority vote or
averaging across the ensemble, leading to more reliable predictions compared to
individual decision trees.

The ANN model consists of multiple layers with dropout regularization to prevent
overfitting.

Input layer: Accepts standardized features.

Hidden layers: Comprise several dense layers with ReLU activation functions to
introduce non-linearity.

Output layer: Single neuron with sigmoid activation function, producing binary
predictions (0 or 1) for diabetes outcome.

Model Construction:

Build an ANN model using Sequential API .Configure dense layers with various
activation functions and dropout regularization to prevent overfitting.

Model Compilation:

Compile the model using binary cross-entropy loss and the Adam optimizer .Set accuracy
as the evaluation metric.

Model Training:

Train the model on the training data with 50 epochs and a batch size of 128.Utilize a
validation split of 10% for monitoring training progress.

Dept. of CSE, BIT 2023-2024


Model Evaluation:
After training, it's essential to evaluate the performance of the predictive models on
unseen data. This is typically achieved by splitting the preprocessed healthcare data into
training and testing sets. The models are trained on the training data and then assessed on
the testing data using metrics such as accuracy, sensitivity and specificity. Evaluation
ensures that the models generalize well to new data and can effectively identify
individuals at risk of diabetes.

Prediction:
Once trained and evaluated, the predictive models can be deployed to make predictions
on new healthcare data. New patient data undergoes preprocessing similar to the training
data and is then fed into the trained models. The models leverage the learned patterns
from the training data to generate predictions regarding an individual's likelihood of
developing diabetes, aiding healthcare professionals in early detection and preventive
interventions.

These stages lay the foundation for deploying machine learning models as decision
support tools in clinical settings, empowering healthcare professionals with valuable
insights to tailor interventions, optimize treatment plans, and promote proactive strategies
for diabetes management and prevention.

Dept. of CSE, BIT 2023-2024


Chapter 4
Tools/Technologies

Dept. of CSE, BIT 2023-2024


Chapter 4
Tools/Technologies

4.1 Hardware Tools

To implement this project, I have used Jupyter Notebook and Python 3 (ipykernel). The
minimum requirements are:

Operating System:
 Windows: The code can be executed on computers running Microsoft Windows
operating systems, including Windows 7, Windows 8, and Windows 10.
 macOS: It is also compatible with macOS, the operating system used on Apple
Macintosh computers.
 Linux: The code can run on various distributions of Linux, including Ubuntu,
Fedora, CentOS, and others.
Memory (RAM):
 The amount of RAM required depends on the size of the dataset being processed
and the complexity of the machine learning models being trained.
 For small to medium-sized datasets and relatively simple models, 4GB to 8GB of
RAM should be sufficient.
Processor (CPU):
 The processor's speed and number of cores influence the code's execution time,
especially during data preprocessing and model training.
 Any modern multi-core processor (e.g., Intel Core i5, i7, or i9 series for Intel
CPUs, or AMD Ryzen series for AMD CPUs) should be sufficient for running the
code.
Storage Space:
 Recommended minimum: 200 MB free disk space for the VS Code installation.
Additional space will be required for your projects and dependencies.
Internet Connectivity:
 Internet connectivity is not mandatory, but some features like extensions and
updates require an internet connection.
Additionally, one can use Google Colab Environment to run this code which will require
internet connection.

Dept. of CSE, BIT 2023-2024


4.2 Software Tools

Programming Language:
Python is the programming language used in the code. Ensure that Python is installed on
the system. One can download and install Python from the official Python website.
Additionally, it's recommended to use a virtual environment management tool like
virtualenv or conda to create isolated Python environments for managing dependencies
and avoiding conflicts between different projects.

Development Environment:
The development environment to write, execute, and manage the Python code. Popular
choices include Jupyter Notebooks, Google Colab, Anaconda, or any text editor/IDE
(Integrated Development Environment) like VSCode, PyCharm, or Sublime Text.

4.3 Libraries Used

pandas: pandas is a powerful data manipulation and analysis library in Python. It


provides data structures like DataFrame and Series, which are ideal for handling
structured data. pandas is widely used for tasks such as data cleaning, transformation,
exploration, and preparation.

numpy: numpy is a fundamental package for scientific computing in Python. It provides


support for multidimensional arrays and matrices, along with a collection of mathematical
functions to operate on these arrays efficiently. numpy is essential for numerical
computations and is extensively used in array-oriented computing tasks.

scikit-learn: scikit-learn is a versatile machine learning library in Python. It offers a wide


range of supervised and unsupervised learning algorithms, including classification,
regression, clustering, dimensionality reduction, and more. scikit-learn also provides tools
for model selection, evaluation, and preprocessing.

seaborn: seaborn is a statistical data visualization library built on top of matplotlib. It


provides a high-level interface for creating attractive and informative statistical graphics.

Dept. of CSE, BIT 2023-2024


seaborn simplifies the process of creating complex visualizations, such as heatmaps,
violin plots, and pair plots, with minimal code.

Chapter 5
Implementation

Dept. of CSE, BIT 2023-2024


Chapter 5
Implementation
5.1 Source Code

#Import Necessary Libraries

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
import seaborn as sns
import matplotlib.pyplot as plt

#Data Analysis

url = "health care diabetes.csv"


data = pd.read_csv(url)

data.head()

data.tail()

print("Number of Rows",data.shape[0])
print("Number of Columns",data.shape[1])

data.info()

Dept. of CSE, BIT 2023-2024


data.isnull().sum(
data.describe()

data_copy = data.copy(deep=True)
data.columns

data_copy[['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',


'BMI']] = data_copy[['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
'BMI']].replace(0,np.nan)
data_copy.isnull().sum()

data['Glucose'] = data['Glucose'].replace(0,data['Glucose'].mean())
data['BloodPressure'] = data['BloodPressure'].replace(0,data['BloodPressure'].mean())
data['SkinThickness'] = data['SkinThickness'].replace(0,data['SkinThickness'].mean())
data['Insulin'] = data['Insulin'].replace(0,data['Insulin'].mean())
data['BMI'] = data['BMI'].replace(0,data['BMI'].mean())

#Training Model

# Separate features and target variable


X = data.drop('Outcome', axis=1)
y = data['Outcome']

# Split the dataset into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

#Function to generate heatmap

def plot_confusion_matrix_heatmap(conf_matrix, title):


sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues', cbar=False)
plt.title(title)
plt.xlabel('Predicted Labels')
plt.ylabel('True Labels')
plt.show()

Dept. of CSE, BIT 2023-2024


#Random Forest Classifier

pipeline_rf = Pipeline([
('imputer', SimpleImputer(strategy='mean')),
('scaler', StandardScaler()),
('classifier', RandomForestClassifier(random_state=42))
])

param_grid_rf = {
'classifier__n_estimators': [50, 100, 200],
'classifier__max_depth': [None, 10, 20],
'classifier__min_samples_split': [2, 5, 10],
'classifier__min_samples_leaf': [1, 2, 4],
'classifier__max_features': ['auto', 'sqrt', 'log2']
}

grid_search_rf = GridSearchCV(pipeline_rf, param_grid_rf, cv=5, scoring='accuracy',


n_jobs=-1)
grid_search_rf.fit(X_train, y_train)
best_model_rf = grid_search_rf.best_estimator_
y_pred_rf = best_model_rf.predict(X_test)

accuracy_rf = accuracy_score(y_test, y_pred_rf)


conf_matrix_rf = confusion_matrix(y_test, y_pred_rf)
classification_report_str_rf = classification_report(y_test, y_pred_rf)

print("RandomForestClassifier Results:")
print(f"Accuracy: {accuracy_rf}")
print(f"Confusion Matrix:\n{conf_matrix_rf}")
print("Classification Report:\n", classification_report_str_rf)

# Generate heatmap for RandomForestClassifier


plot_confusion_matrix_heatmap(conf_matrix_rf, 'RandomForestClassifier Confusion
Matrix')

Dept. of CSE, BIT 2023-2024


#Decision Tree Classifier

pipeline_dt = Pipeline([
('imputer', SimpleImputer(strategy='mean')),
('scaler', StandardScaler()),
('classifier', DecisionTreeClassifier(random_state=42))
])

param_grid_dt = {
'classifier__max_depth': [None, 10, 20],
'classifier__min_samples_split': [2, 5, 10],
'classifier__min_samples_leaf': [1, 2, 4],
'classifier__max_features': ['auto', 'sqrt', 'log2']
}

grid_search_dt = GridSearchCV(pipeline_dt, param_grid_dt, cv=5, scoring='accuracy',


n_jobs=-1)
grid_search_dt.fit(X_train, y_train)
best_model_dt = grid_search_dt.best_estimator_
y_pred_dt = best_model_dt.predict(X_test)

accuracy_dt = accuracy_score(y_test, y_pred_dt)


conf_matrix_dt = confusion_matrix(y_test, y_pred_dt)
classification_report_str_dt = classification_report(y_test, y_pred_dt)

print("DecisionTreeClassifier Results:")
print(f"Accuracy: {accuracy_dt}")
print(f"Confusion Matrix:\n{conf_matrix_dt}")
print("Classification Report:\n", classification_report_str_dt)

# Generate heatmap for DecisionTreeClassifier


plot_confusion_matrix_heatmap(conf_matrix_dt, 'DecisionTreeClassifier Confusion
Matrix')

Dept. of CSE, BIT 2023-2024


#Support Vector Machine

pipeline_svm = Pipeline([
('imputer', SimpleImputer(strategy='mean')),
('scaler', StandardScaler()),
('classifier', SVC(random_state=42))
])

param_grid_svm = {
'classifier__C': [0.1, 1, 10],
'classifier__kernel': ['linear', 'rbf', 'poly'],
'classifier__gamma': ['scale', 'auto']
}

grid_search_svm = GridSearchCV(pipeline_svm, param_grid_svm, cv=5,


scoring='accuracy', n_jobs=-1)
grid_search_svm.fit(X_train, y_train)
best_model_svm = grid_search_svm.best_estimator_
y_pred_svm = best_model_svm.predict(X_test)

accuracy_svm = accuracy_score(y_test, y_pred_svm)


conf_matrix_svm = confusion_matrix(y_test, y_pred_svm)
classification_report_str_svm = classification_report(y_test, y_pred_svm)

print("SVM Classifier Results:")


print(f"Accuracy: {accuracy_svm}")
print(f"Confusion Matrix:\n{conf_matrix_svm}")
print("Classification Report:\n", classification_report_str_svm)

# Generate heatmap for SVM Classifier


plot_confusion_matrix_heatmap(conf_matrix_svm, 'SVM Classifier Confusion Matrix')

Dept. of CSE, BIT 2023-2024


#Prediction on New Data

rf_model=RandomForestClassifier(n_estimators=100,
max_depth=10, min_samples_split=2, min_samples_leaf=1, max_features='auto',
random_state=42)

# Fit the model to the training data


rf_model.fit(X_train, y_train)

# Example of new data (modify this according to your feature names and values)
new_data_example = pd.DataFrame({
'Pregnancies': [5],
'Glucose': [130],
'BloodPressure': [70],
'SkinThickness': [30],
'Insulin': [80],
'BMI': [25],
'DiabetesPedigreeFunction': [0.5],
'Age': [35]
})

# Make predictions using the trained RandomForestClassifier


prediction_rf = rf_model.predict(new_data_example)

# Convert the prediction to a more user-friendly result


result_rf = "Diabetic" if prediction_rf[0] == 1 else "Not Diabetic"

print("Result: ", result_rf)

Dept. of CSE, BIT 2023-2024


Nueral networks sequential model :

model = Sequential([
Dense(512, activation='relu', input_shape=(X_train.shape[1],)),
Dropout(0.5),
Dense(128, activation='relu'),
Dropout(0.5),
Dense(256, activation='relu'),
Dropout(0.5),
Dense(128, activation='relu'),
Dropout(0.5),
Dense(1, activation='sigmoid')
])

model.compile(optimizer='adam',
loss='binary_crossentropy',
metrics=['accuracy'])

Dept. of CSE, BIT 2023-2024


Chapter 6
Results
Chapter 6
Results

Fig 6.1
Results for Random Forest Classifier

Fig 6.2 Results for Decision Tree Classifier

Dept. of CSE, BIT 2023-2024


Fig 6.3 Results for Support Vector Machine

Fig 6.4(a) Prediction on new data_1

Fig 6.4(b) Output for new data_1

Dept. of CSE, BIT 2023-2024


Fig 6.5(a) Prediction on new data_2

Fig 6.5(b) Output for new data_2

Dept. of CSE, BIT 2023-2024


Chapter 7
Reflection Notes

Dept. of CSE, BIT 2023-2024


Chapter 7
Reflection Notes

Delving into the realm of machine learning, Python emerged as a cornerstone due to its
rich ecosystem of libraries tailored for data analysis and modeling. scikit-learn, NumPy,
and Pandas stood out as indispensable tools, each playing pivotal roles in various stages
of the machine learning pipeline. NumPy provided a solid foundation for numerical
computations and efficient handling of arrays, while Pandas facilitated seamless data
manipulation and preprocessing tasks, thanks to its powerful DataFrame and Series
structures. Meanwhile, scikit-learn offered a comprehensive suite of machine learning
algorithms and utilities, simplifying the implementation and evaluation of predictive
models.

The Random Forest algorithm emerged as a particularly robust choice for predictive
modeling endeavors. Its ensemble nature, which harnesses the collective wisdom of
multiple decision trees, proved effective in enhancing predictive accuracy while
mitigating the risk of overfitting. By aggregating predictions from diverse individual
trees, Random Forests fostered robustness and resilience, making them well-suited for a
wide range of classification and regression tasks across different domains.

My exploration of machine learning tools and technologies revealed a dynamic landscape,


where careful choices and integration of tools significantly impact the development and
deployment of predictive models.my internship experience provided me with invaluable
opportunities for my personal development as well. I am grateful for the challenges and
learning experiences that have contributed to my growth, and I look forward to applying
these newfound skills and insights in future endeavors.

Dept. of CSE, BIT 2023-2024


References
1. DataCamp. (n.d.). Random Forests Classifier in Python. Retrieved from
https://www.datacamp.com/tutorial/random-forests-classifier-
python#::text=Random%20forests%20are%20for%20supervised,combine
%20predictions%20from%20other%20models.

2. GeeksForGeeks. (n.d.). Decision Tree Introduction with Example. Retrieved from


[https://www.geeksforgeeks.org/decision-tree-introduction-example/]
(https://www.geeksforgeeks.org/decision-tree-introduction-example/).

3. GeeksForGeeks. (n.d.). Support Vector Machine Algorithm. Retrieved from


[https://www.geeksforgeeks.org/support-vector-machine-algorithm/](https://
www.geeksforgeeks.org/support-vector-machine-algorithm/).

4. W3Schools. (n.d.). Python Machine Learning - Getting Started. Retrieved from


[https://www.w3schools.com/python/python_ml_getting_started.asp](https://
www.w3schools.com/python/python_ml_getting_started.asp).

5. Brownlee, J. (2020). Introduction to Random Forests for Classification and


Regression. Machine Learning Mastery. Retrieved from
[https://machinelearningmastery.com/random-forest-ensemble-in-python/]
(https://machinelearningmastery.com/random-forest-ensemble-in-python/).

6. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., ...
& Vanderplas, J. (2011). Scikit-learn: Machine learning in Python. Journal of
Machine Learning Research, 12(Oct), 2825-2830. Retrieved from
[https://jmlr.csail.mit.edu/papers/v12/pedregosa11a.html](https://
jmlr.csail.mit.edu/papers/v12/pedregosa11a.html).

7. McKinney, W. (2010). Data Structures for Statistical Computing in Python.


Proceedings of the 9th Python in Science Conference, 51-56. Retrieved from

Dept. of CSE, BIT 2023-2024


[https://conference.scipy.org/proceedings/scipy2010/mckinney.html](https://
conference.scipy.org/proceedings/scipy2010/mckinney.html).

Dept. of CSE, BIT 2023-2024

You might also like