0% found this document useful (0 votes)

77 views7 pages

Article Eda

This document describes a disease prediction system that uses machine learning classifiers to predict a disease based on a user's symptoms. It uses 4 classifiers - KNN, Decision Tree, SVM, and Naive Bayes - and takes the mode of the predictions to determine the optimal disease. The system is trained on a dataset containing real hospital data with symptom and disease labels. It splits the data into 80% for training and 20% for testing the models. The document provides details on each classifier and evaluates their performance using metrics like confusion matrices. The combined model is able to accurately classify all data points. Future work could involve more complex datasets involving symptom severity and duration.

Uploaded by

gunda pavan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

77 views7 pages

Article Eda

Uploaded by

gunda pavan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 7

Disease prediction using machine learning

G.satyapavankumar T.mohan savendra

19mis1126 19mis1200
Vellore Institute of Technology, Chennai campus

Abstract: Our Disease Prediction system predicts the disease of the user on the basis of symptoms provided by the user,
the symptoms are given as an input to the system. The system analyzes the symptoms provided by the user as input and
gives the probable disease as the output. Disease Prediction is done by implementing 4 Classifiers. Our model uses 4
classifier algorithms and takes the mode of the 4 algorithms to provides us the optimal solution.
Keywords: Classifiers, Machine Learning, Disease detection, 2. DATASET DESCRIPTION:
Python

1.1. INTRODUCTION 2.1 DATASET USED:

Machine learning is The hospital data will be in the form of
programming computers to textual format or in the structural format.
optimize performance using example The dataset used in this project is real-life
data or past data. Machine learning is data. The structural data
study of computer systems that learn contains symptoms of patients
from data and experience. Machine while unstructured data consist of textual
learning algorithm has two format. The dataset used is contains
passes: Training, Testing. reallife hospital data, and data
Prediction of a disease by using patient’s stored in
symptoms is our target. For prediction of data center. The data
diseases the existing will be done on provided by the hospital
KNN, Decision Tree, SVM And Naive contains symptoms of the patients
Bayes algorithm.
2.2 DATASET
1.2. MOTIVATION DESCRIPTION:
There is a need to study and make a system Complete Dataset consists of 2 CSV files.
which will make it easy for an end user to One of them is training and other is for
predict the chronic diseases without visiting testing your model. Each CSV file has
physician or doctor for diagnosis. To detect 133 columns. 132 of these columns are
the Various Diseases through symptoms that a person experiences and
the examining symptoms of patient’s last column is the prognosis. These
using different techniques of Machine symptoms are mapped to 42 diseases you
Learning can classify these set of symptoms to.
Models. To Handle Text data and The dataset used is contains real-life
Structured data is no Proper method. hospital data, and data stored in data
The Proposed system will consider both centre. The data provided by the hospital
structure and unstructured data. The contains symptoms of the patients
Predictions Accuracy will Increase using
Machine Learning.
2.3 SPLITTING THE DATA FOR One of the simplest ways that of
TRAINING AND choosing the foremost probable
TESTING THE MODEL: hypothesis given the info that we've that
we are able to use as our previous
We have split the data into 80:20 format i.e.
information regarding the matter. Bayes’
80% of the dataset will be used for training
Theorem provides how that we are able
the model and 20% of the data will be used to
to calculate the likelihood of a hypothesis
evaluate the performance of the model
given our previous information. Naive
Bayes classifier assumes that the
presence of a specific feature in an
3. ALGORITHM USED exceedingly class is unrelated to the
3.1 KNN presence of the other feature. Bayes
theorem provides some way of
K Nearest Neighbor (KNN) could be a calculative posterior chance P(b|a) from
terribly easy, simple to grasp, versatile and P(b), P(a) and P(a|b).
one amongst the uppermost machine learning
algorithms. In Healthcare System, user will
predict the disease. In this system, user can
3.3 SUPPORT VECTOR
predict whether disease will detect or not. In CLASSIFIER
propose system, classifying disease in various
classes that shows which disease will happen Support Vector Classifier is a
on the basis of symptoms. KNN rule used for discriminative classifier i.e., when given
each classification and regression issues. a labelled training data, the algorithm
KNN algorithm based on feature similarity tries to find an optimal hyperplane that
approach. A case is classed by a majority vote accurately separates the samples into
of its neighbors, with the case being assigned different categories in hyperspace.
to the class most common amongst its K
3.4 RANDOM FOREST
nearest neighbors measured by a distance
CLASSIFIER:
function. If K = 1, then the case is just
assigned to the category of its nearest Random Forest is an ensemble
neighbor It ought to even be noted that every learningbased supervised machine
one 3 distance measures square measure learning classification algorithm that
solely valid form continuous variables. In the internally uses multiple decision trees to
instance of categorical variables, the make the classification. In a random
Hamming distance must be used. It conjointly forest classifier, all the internal decision
brings up the difficulty of standardization of trees are weak learners, the outputs of
the numerical variables between zero and one these weak decision trees are combined
once there's a combination of numerical and i.e. mode of all the predictions is as the
categorical variables within the dataset. final prediction.
3.2 NAIVE BAYES
Naive Bayes is an easy however amazingly
powerful rule for prognosticate modelling.
3.5 BLOCK DIAGRAM: For Naive Bayes
3. MODEL EVALUATION:

Predictionson Validation
datasetby KNN Classifier

3.1 EVALUATION METHOD

To calculate performance evaluation

in experiment, first we denote TP, TN,
Fp and FN as true positive (the number
of
results correctly predicted as required), For SVM
true negative (the number of results not
required), false positive (the number of
results incorrectly predicted as required),
false negative (the number of results
incorrectly predicted as not required)
respectively.

3.2 Confusion Matrices:

For KNN
For Random Forest:
4. RESULTS AND DISCUSSIONS
4.1 CONFUSION MATRIX FOR
THE MODEL:
4.2 Result
FINAL PREDICTION:

5. CONCLUSION AND FUTURE

WORK
We can see that our combined model has 5.1 CONCLUSIONS
classified all the data points accurately.Now This project aims to predict the disease
we will be creating a function that takes on the basis of the symptoms. The project
symptoms separated by commas as input is designed in such a way that the system
and outputs the predicted disease using the takes symptoms from the user as input
combined model based on the input and produces output i.e., predict disease.
symptoms. In conclusion, for disease risk modelling,
the accuracy of risk prediction depends
on the diversity feature of the hospital
data
5.2 Future Works:

Can train the model to work with

more complex datasets pertaining
the severity of the symptoms and
the time affected by the
symptoms.

6. REFERENCES
1) Min Chen, Yixue Hao, Kai Hwang, Fellow, IEEE,
Lu
Wang, and Lin Wang “Disease Prediction by
Machine Learning over Big Data from Healthcare
Communities” (2017).

2) Mr. Chala Beyene, Prof. Pooja Kamat, “Survey on

Prediction and Analysis the Occurrence of Heart
Disease Using Data Mining Techniques”,
International Journal of Pure and Applied
Mathematics, 2018 Mr.
Chala Beyene, Prof. Pooja Kamat, “Survey on Prediction confusion_matrix from
and Analysis the Occurrence of Heart Disease Using Data
Mining Techniques”, International sklearn.neighbors import
Journal of Pure and Applied Mathematics, 2018
KNeighborsClassifier
3) P. Groves, B. Kayyali, D. Knott, and S. V. Kuiken, “The
’big data’ revolution in healthcare: Accelerating value and %matplotlib inline
innovation,” 2016.
# Reading the train.csv by removing the
4) S.-H. Wang, T.-M. Zhan, Y. Chen, Y. Zhang, M.
Yang, # last column since it's an empty column
H.-M. Lu, H.-N. Wang, B. Liu, and P. Phillips,
“Multiple sclerosis detection based on biorthogonal DATA_PATH = "EDA/Training.csv" data =
wavelet transform, rbf kernel principal component pd.read_csv(DATA_PATH).dropna(axis = 1)
analysis, and logistic regression,” IEEE Access, vol. 4,
pp. 7567–7576, 2016. # Checking whether the dataset is balanced
or not disease_counts =
data["prognosis"].value_counts()
print(disease_counts) temp_df =
pd.DataFrame({
APPENDIX "Disease": disease_counts.index,
Program: "Counts": disease_counts.values
# Importing libraries import
}) plt.figure(figsize = (18,8)) sns.barplot(x =
numpy as np import pandas as pd "Disease", y = "Counts", data = temp_df)
from scipy.stats import mode plt.xticks(rotation=90) plt.show()
import matplotlib.pyplot as plt # Encoding the target value into numerical
# value using LabelEncoder
import seaborn as sns from
encoder = LabelEncoder()
sklearn.preprocessing import
encoder.fit(data["prognosis"])
LabelEncoder from
data["prognosis"] =
sklearn.model_selection import encoder.transform(data["prognosis
train_test_split, cross_val_score "]) X = data.iloc[:,:-1] y =
from sklearn.svm import SVC data.iloc[:, -1]

from sklearn.naive_bayes import X_train, X_test, y_train, y_test

=train_test_split(X, y, test_size = 0.2,
GaussianNB from random_state = 24)
sklearn.ensemble import
RandomForestClassifier from print(f"Train: {X_train.shape},
sklearn.metrics import {y_train.shape}") print(f"Test:
accuracy_score, {X_test.shape}, {y_test.shape}")
# Defining scoring metric for k-fold cross plt.figure(figsize=(12,8))
validation def cv_scoring(estimator, X, y): sns.heatmap(cf_matrix, annot=True)
return accuracy_score(y, plt.title("Confusion Matrix for SVM
estimator.predict(X)) #
Classifier on Test Data") plt.show() #
Initializing Models
Training and testing Naive Bayes Classifier
models = { nb_model = GaussianNB()
"SVC":SVC(), nb_model.fit(X_train, y_train) preds =
"Gaussian NB":GaussianNB(), nb_model.predict(X_test) print(f"Accuracy
"Random on train data by Naive Bayes Classifier\
Forest":RandomForestClassifier(random_state
=18), : {accuracy_score(y_train,
nb_model.predict(X_train))*100}")
"KNN":KNeighborsClassifier(n_neigh print(f"Accuracy on test data by Naive Bayes
bors=3) Classifier\
} : {accuracy_score(y_test, preds)*100}")
# Producing cross validation score for the cf_matrix = confusion_matrix(y_test, preds)
models for model_name in models: model = plt.figure(figsize=(12,8))
models[model_name] scores = sns.heatmap(cf_matrix, annot=True)
cross_val_score(model, X, y, cv = 10, n_jobs = plt.title("Confusion Matrix for Naive Bayes
-1, scoring = cv_scoring) print("=="*30)
Classifier on Test Data") plt.show()
print(model_name) print(f"Scores: {scores}")
print(f"Mean Score: # Training and testing Random Forest
Classifier rf_model
{np.mean(scores)}") # Training and testing
=
SVM Classifier svm_model = SVC()
RandomForestClassifier(random_state=18)
svm_model.fit(X_train, y_train) preds =
rf_model.fit(X_train, y_train) preds =
svm_model.predict(X_test) rf_model.predict(X_test) print(f"Accuracy on
train data by Random Forest Classifier\
print(f"Accuracy on train data by SVM Classifier\ : {accuracy_score(y_train,
: {accuracy_score(y_train, rf_model.predict(X_train))*100}")
svm_model.predict(X_train))*100}") print(f"Accuracy on test data by Random
Forest Classifier\: {accuracy_score(y_test,
print(f"Accuracy on test data by SVM Classifier\ preds)*100}") cf_matrix =

: {accuracy_score(y_test, preds)*100}") confusion_matrix(y_test, preds)

plt.figure(figsize=(12,8))
cf_matrix = confusion_matrix(y_test, preds)
sns.heatmap(cf_matrix, annot=True) score for KNN is {}%".format(score*100))
plt.title("Confusion Matrix for Random Forest cf_matrix = confusion_matrix(y_test, preds)
Classifier on Test Data") plt.show() # Training and plt.figure(figsize=(12,8))
testing KNN Classifier knn_model = sns.heatmap(cf_matrix, annot=True)
KNeighborsClassifier(n_neighbors=3)
plt.title("Confusion Matrix for Random
knn_model.fit(X_train, y_train) preds =
KNN
knn_model.predict(X_test) score =
Classifier on Test Data")
accuracy_score(y_test, preds) print("Accuracy
plt.show() plt.show()
# Training the models on whole data final_svm_model = SVC()

final_nb_model = GaussianNB() final_rf_model =

RandomForestClassifier(random_state=18) final_svm_model.fit(X, y)
final_nb_model.fit(X, y) final_rf_model.fit(X, y) # Reading the test data

test_data =

pd.read_csv("EDA/Testing.csv").dropna(axis=
1)
test_X = test_data.iloc[:, :-1] test_Y = encoder.transform(test_data.iloc[:, -1])
# Making prediction by take mode of predictions
# made by all the classifiers svm_preds = final_svm_model.predict(test_X) nb_preds =
final_nb_model.predict(test_X) rf_preds = final_rf_model.predict(test_X) knn_preds =

knn_model.predict(X_test)

final_preds = [mode([i,j,k,l])[0][0] for i,j,k,l in zip(svm_preds, nb_preds,

rf_preds , knn_preds )] print(f"Accuracy on Test dataset by the combined
model\
: {accuracy_score(test_Y, final_preds)*100}") cf_matrix =
confusion_matrix(test_Y, final_preds) plt.figure(figsize=(12,8))
sns.heatmap(cf_matrix, annot = True) plt.title("Confusion Matrix for Combined
Model on Test Dataset")

Multi-Disease Prediction With Machine Learning
No ratings yet
Multi-Disease Prediction With Machine Learning
7 pages
Final Presentation GDP
No ratings yet
Final Presentation GDP
21 pages
Final Research Paper
No ratings yet
Final Research Paper
5 pages
Disease Pred Report
No ratings yet
Disease Pred Report
42 pages
Doctormate - An Early Disease Prediction Approach Using Multiple Machine Learning Algorithms
No ratings yet
Doctormate - An Early Disease Prediction Approach Using Multiple Machine Learning Algorithms
7 pages
Saturday PR
No ratings yet
Saturday PR
10 pages
Base Paper
No ratings yet
Base Paper
4 pages
Multiple Diseases
No ratings yet
Multiple Diseases
15 pages
Machine Learning Based Multiple Disease Prediction System
No ratings yet
Machine Learning Based Multiple Disease Prediction System
5 pages
Diseaseppt
No ratings yet
Diseaseppt
18 pages
Edited - Django Website For Disease Prediction Using Machine Learning
No ratings yet
Edited - Django Website For Disease Prediction Using Machine Learning
7 pages
BP-5 (Model, Algo Info)
No ratings yet
BP-5 (Model, Algo Info)
8 pages
AI-Driven Disease Prediction Tool
No ratings yet
AI-Driven Disease Prediction Tool
13 pages
Disease Prediction Based On Symptoms
No ratings yet
Disease Prediction Based On Symptoms
16 pages
No 3
No ratings yet
No 3
4 pages
Disease Prediction Using ML
No ratings yet
Disease Prediction Using ML
20 pages
Multiple Disease Prediction Using Different Machine Learning Algorithms Comparatively
No ratings yet
Multiple Disease Prediction Using Different Machine Learning Algorithms Comparatively
5 pages
Disease Prediction Using Patient Data
No ratings yet
Disease Prediction Using Patient Data
7 pages
Heart Diesease Prediction and Recommendation System Using Machine Learning
No ratings yet
Heart Diesease Prediction and Recommendation System Using Machine Learning
11 pages
Implementation of Smart Health Predictio
No ratings yet
Implementation of Smart Health Predictio
6 pages
Disease Prediction Using ML
No ratings yet
Disease Prediction Using ML
12 pages
Review
No ratings yet
Review
5 pages
Disease Prediction System Using Naïve Bayes
No ratings yet
Disease Prediction System Using Naïve Bayes
7 pages
Ijarcce 2019 81210
No ratings yet
Ijarcce 2019 81210
3 pages
Disease Prediction With Android Application: Shagun Patial, Shashwat Agarwal, Shruti Pathak, Prabhat Verma
No ratings yet
Disease Prediction With Android Application: Shagun Patial, Shashwat Agarwal, Shruti Pathak, Prabhat Verma
6 pages
ML Symptom-Based Disease Prediction
No ratings yet
ML Symptom-Based Disease Prediction
3 pages
Smart Disease Prediction Using Machine Learning
No ratings yet
Smart Disease Prediction Using Machine Learning
5 pages
Heart Disease Prediction Using Machine Learning Techniques: Raparthi Yaswanth, Y. Md. Riyazuddin
No ratings yet
Heart Disease Prediction Using Machine Learning Techniques: Raparthi Yaswanth, Y. Md. Riyazuddin
5 pages
Disease Prediction Using Machine Learning Algorithms2020 PDF
No ratings yet
Disease Prediction Using Machine Learning Algorithms2020 PDF
7 pages
Team No-7
No ratings yet
Team No-7
12 pages
Heart Disease Prediction
No ratings yet
Heart Disease Prediction
17 pages
No 11
No ratings yet
No 11
8 pages
Heart Disease Prediction via ML Techniques
No ratings yet
Heart Disease Prediction via ML Techniques
4 pages
Major
No ratings yet
Major
15 pages
(IJCST-V13I2P2) :seema Saroj, Sakshi Sahu, Sanjana Patel, Suraj Sahu
No ratings yet
(IJCST-V13I2P2) :seema Saroj, Sakshi Sahu, Sanjana Patel, Suraj Sahu
2 pages
Heart Disease Detection via ML
No ratings yet
Heart Disease Detection via ML
12 pages
Heart Disease Prediction Using Feature Selection and Ensemble Learning Techniques
No ratings yet
Heart Disease Prediction Using Feature Selection and Ensemble Learning Techniques
5 pages
Project PPT Batch (5) )
No ratings yet
Project PPT Batch (5) )
14 pages
MP Final Report
No ratings yet
MP Final Report
52 pages
Drugdisease 2
No ratings yet
Drugdisease 2
17 pages
Proposal Multiple Dieases Prediction System 1
No ratings yet
Proposal Multiple Dieases Prediction System 1
4 pages
Multi Disease Prediction Using Machine Learning Algorithms
No ratings yet
Multi Disease Prediction Using Machine Learning Algorithms
10 pages
286IARP27
No ratings yet
286IARP27
72 pages
Disease Prediction Using Machine Learning
No ratings yet
Disease Prediction Using Machine Learning
4 pages
Final Research Paper
No ratings yet
Final Research Paper
10 pages
Predicting The Presence of Heart Diseases Using Comparative Data Mining and Machine Learning Algorithms
No ratings yet
Predicting The Presence of Heart Diseases Using Comparative Data Mining and Machine Learning Algorithms
5 pages
BTech Phase 4 Presentation Template
No ratings yet
BTech Phase 4 Presentation Template
24 pages
Heart Disease Prediction Using KNN Algorithm-2
No ratings yet
Heart Disease Prediction Using KNN Algorithm-2
19 pages
No 7
No ratings yet
No 7
9 pages
Health Prediction for Students
No ratings yet
Health Prediction for Students
4 pages
Machine Learning in Disease Prediction
No ratings yet
Machine Learning in Disease Prediction
6 pages
SmartCare A Symptoms Based Disease Prediction Model Using Machine Learning Approach
No ratings yet
SmartCare A Symptoms Based Disease Prediction Model Using Machine Learning Approach
9 pages
Multi-Disease Prediction Guide
No ratings yet
Multi-Disease Prediction Guide
33 pages
Epidemics vs. Pandemics
No ratings yet
Epidemics vs. Pandemics
15 pages
Heart Disease Prediction Using Machine Learning Publication - Ijsart
No ratings yet
Heart Disease Prediction Using Machine Learning Publication - Ijsart
5 pages
Final Year Project
No ratings yet
Final Year Project
57 pages
Thesis Presentation
No ratings yet
Thesis Presentation
22 pages
DL Project Progress Report
No ratings yet
DL Project Progress Report
49 pages
A Disease Prediction Model Using Naive Bayes and Keras Based Neural Networks
No ratings yet
A Disease Prediction Model Using Naive Bayes and Keras Based Neural Networks
8 pages
DBDA Data Result
No ratings yet
DBDA Data Result
14 pages
Block 1
No ratings yet
Block 1
93 pages
Elast-O-Actif Flexibility Enhancer Guide
No ratings yet
Elast-O-Actif Flexibility Enhancer Guide
1 page
Introduction Hadoop Ecosystem Hdfs I Slides
No ratings yet
Introduction Hadoop Ecosystem Hdfs I Slides
12 pages
3D Interactive Data Explorer
No ratings yet
3D Interactive Data Explorer
2 pages
Generic Delta Explained
100% (3)
Generic Delta Explained
5 pages
WEKA Association Rule Examples
No ratings yet
WEKA Association Rule Examples
13 pages
Object Based Data Model
No ratings yet
Object Based Data Model
40 pages
Megha Resume Ir
No ratings yet
Megha Resume Ir
1 page
AI The Simplest Way
No ratings yet
AI The Simplest Way
43 pages
The Five-Question Method For Framing A Qualitative
No ratings yet
The Five-Question Method For Framing A Qualitative
16 pages
Final Research Project Bachelor of Education (ECE) : Edu 4503 - Assessment
No ratings yet
Final Research Project Bachelor of Education (ECE) : Edu 4503 - Assessment
37 pages
65e6e3211ee1fde0289e991c - Content - Ultimate Guide To DPM
No ratings yet
65e6e3211ee1fde0289e991c - Content - Ultimate Guide To DPM
14 pages
Database
No ratings yet
Database
36 pages
Final Report
No ratings yet
Final Report
76 pages
Definition of Marketing Research
No ratings yet
Definition of Marketing Research
13 pages
Unit I Linear Data Structures - List
No ratings yet
Unit I Linear Data Structures - List
54 pages
Viral Marketing - Mixue The King of Snow Ice Cream - How Does This Franchise Affect The Market
No ratings yet
Viral Marketing - Mixue The King of Snow Ice Cream - How Does This Franchise Affect The Market
9 pages
Data Analysis & SQL Queries Guide
No ratings yet
Data Analysis & SQL Queries Guide
5 pages
AnushkaArunG 16IAD008 DocBook Pages PDF
No ratings yet
AnushkaArunG 16IAD008 DocBook Pages PDF
98 pages
Introduction To Educational Research Connecting Methods To Practice 1st Edition Lochmiller Test Bank 1
100% (101)
Introduction To Educational Research Connecting Methods To Practice 1st Edition Lochmiller Test Bank 1
11 pages
Open Street Map
No ratings yet
Open Street Map
17 pages
4IT1 - 02 - Notes For Centres November 2020
No ratings yet
4IT1 - 02 - Notes For Centres November 2020
9 pages
Apple File System Reference
No ratings yet
Apple File System Reference
181 pages
Waaheen Market Policy Brief Updated 1
No ratings yet
Waaheen Market Policy Brief Updated 1
8 pages
A Big Data Analytics Architecture For Cleaner Manufacturing and Maintenance Processes of Complex Products
No ratings yet
A Big Data Analytics Architecture For Cleaner Manufacturing and Maintenance Processes of Complex Products
16 pages
Big Data
No ratings yet
Big Data
13 pages
Computer Hardware and Software Basics
No ratings yet
Computer Hardware and Software Basics
4 pages
Midterms NCM 113
No ratings yet
Midterms NCM 113
22 pages
GIS42 MapEditor PDF
No ratings yet
GIS42 MapEditor PDF
432 pages

Article Eda

Uploaded by

Article Eda

Uploaded by

Disease prediction using machine learning

G.satyapavankumar T.mohan savendra

1.1. INTRODUCTION 2.1 DATASET USED:

3.1 EVALUATION METHOD

To calculate performance evaluation

3.2 Confusion Matrices:

5. CONCLUSION AND FUTURE

Can train the model to work with

2) Mr. Chala Beyene, Prof. Pooja Kamat, “Survey on

from sklearn.naive_bayes import X_train, X_test, y_train, y_test

: {accuracy_score(y_test, preds)*100}") confusion_matrix(y_test, preds)

final_nb_model = GaussianNB() final_rf_model =

final_preds = [mode([i,j,k,l])[0][0] for i,j,k,l in zip(svm_preds, nb_preds,

You might also like