0% found this document useful (0 votes)
77 views7 pages

Article Eda

This document describes a disease prediction system that uses machine learning classifiers to predict a disease based on a user's symptoms. It uses 4 classifiers - KNN, Decision Tree, SVM, and Naive Bayes - and takes the mode of the predictions to determine the optimal disease. The system is trained on a dataset containing real hospital data with symptom and disease labels. It splits the data into 80% for training and 20% for testing the models. The document provides details on each classifier and evaluates their performance using metrics like confusion matrices. The combined model is able to accurately classify all data points. Future work could involve more complex datasets involving symptom severity and duration.

Uploaded by

gunda pavan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
77 views7 pages

Article Eda

This document describes a disease prediction system that uses machine learning classifiers to predict a disease based on a user's symptoms. It uses 4 classifiers - KNN, Decision Tree, SVM, and Naive Bayes - and takes the mode of the predictions to determine the optimal disease. The system is trained on a dataset containing real hospital data with symptom and disease labels. It splits the data into 80% for training and 20% for testing the models. The document provides details on each classifier and evaluates their performance using metrics like confusion matrices. The combined model is able to accurately classify all data points. Future work could involve more complex datasets involving symptom severity and duration.

Uploaded by

gunda pavan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 7

Disease prediction using machine learning

G.satyapavankumar T.mohan savendra


19mis1126 19mis1200
Vellore Institute of Technology, Chennai campus

Abstract: Our Disease Prediction system predicts the disease of the user on the basis of symptoms provided by the user,
the symptoms are given as an input to the system. The system analyzes the symptoms provided by the user as input and
gives the probable disease as the output. Disease Prediction is done by implementing 4 Classifiers. Our model uses 4
classifier algorithms and takes the mode of the 4 algorithms to provides us the optimal solution.
Keywords: Classifiers, Machine Learning, Disease detection, 2. DATASET DESCRIPTION:
Python

1.1. INTRODUCTION 2.1 DATASET USED:


Machine learning is The hospital data will be in the form of
programming computers to textual format or in the structural format.
optimize performance using example The dataset used in this project is real-life
data or past data. Machine learning is data. The structural data
study of computer systems that learn contains symptoms of patients
from data and experience. Machine while unstructured data consist of textual
learning algorithm has two format. The dataset used is contains
passes: Training, Testing. reallife hospital data, and data
Prediction of a disease by using patient’s stored in
symptoms is our target. For prediction of data center. The data
diseases the existing will be done on provided by the hospital
KNN, Decision Tree, SVM And Naive contains symptoms of the patients
Bayes algorithm.
2.2 DATASET
1.2. MOTIVATION DESCRIPTION:
There is a need to study and make a system Complete Dataset consists of 2 CSV files.
which will make it easy for an end user to One of them is training and other is for
predict the chronic diseases without visiting testing your model. Each CSV file has
physician or doctor for diagnosis. To detect 133 columns. 132 of these columns are
the Various Diseases through symptoms that a person experiences and
the examining symptoms of patient’s last column is the prognosis. These
using different techniques of Machine symptoms are mapped to 42 diseases you
Learning can classify these set of symptoms to.
Models. To Handle Text data and The dataset used is contains real-life
Structured data is no Proper method. hospital data, and data stored in data
The Proposed system will consider both centre. The data provided by the hospital
structure and unstructured data. The contains symptoms of the patients
Predictions Accuracy will Increase using
Machine Learning.
2.3 SPLITTING THE DATA FOR One of the simplest ways that of
TRAINING AND choosing the foremost probable
TESTING THE MODEL: hypothesis given the info that we've that
we are able to use as our previous
We have split the data into 80:20 format i.e.
information regarding the matter. Bayes’
80% of the dataset will be used for training
Theorem provides how that we are able
the model and 20% of the data will be used to
to calculate the likelihood of a hypothesis
evaluate the performance of the model
given our previous information. Naive
Bayes classifier assumes that the
presence of a specific feature in an
3. ALGORITHM USED exceedingly class is unrelated to the
3.1 KNN presence of the other feature. Bayes
theorem provides some way of
K Nearest Neighbor (KNN) could be a calculative posterior chance P(b|a) from
terribly easy, simple to grasp, versatile and P(b), P(a) and P(a|b).
one amongst the uppermost machine learning
algorithms. In Healthcare System, user will
predict the disease. In this system, user can
3.3 SUPPORT VECTOR
predict whether disease will detect or not. In CLASSIFIER
propose system, classifying disease in various
classes that shows which disease will happen Support Vector Classifier is a
on the basis of symptoms. KNN rule used for discriminative classifier i.e., when given
each classification and regression issues. a labelled training data, the algorithm
KNN algorithm based on feature similarity tries to find an optimal hyperplane that
approach. A case is classed by a majority vote accurately separates the samples into
of its neighbors, with the case being assigned different categories in hyperspace.
to the class most common amongst its K
3.4 RANDOM FOREST
nearest neighbors measured by a distance
CLASSIFIER:
function. If K = 1, then the case is just
assigned to the category of its nearest Random Forest is an ensemble
neighbor It ought to even be noted that every learningbased supervised machine
one 3 distance measures square measure learning classification algorithm that
solely valid form continuous variables. In the internally uses multiple decision trees to
instance of categorical variables, the make the classification. In a random
Hamming distance must be used. It conjointly forest classifier, all the internal decision
brings up the difficulty of standardization of trees are weak learners, the outputs of
the numerical variables between zero and one these weak decision trees are combined
once there's a combination of numerical and i.e. mode of all the predictions is as the
categorical variables within the dataset. final prediction.
3.2 NAIVE BAYES
Naive Bayes is an easy however amazingly
powerful rule for prognosticate modelling.
3.5 BLOCK DIAGRAM: For Naive Bayes
3. MODEL EVALUATION:

Predictionson Validation
datasetby KNN Classifier

3.1 EVALUATION METHOD

To calculate performance evaluation


in experiment, first we denote TP, TN,
Fp and FN as true positive (the number
of
results correctly predicted as required), For SVM
true negative (the number of results not
required), false positive (the number of
results incorrectly predicted as required),
false negative (the number of results
incorrectly predicted as not required)
respectively.

3.2 Confusion Matrices:

For KNN
For Random Forest:
4. RESULTS AND DISCUSSIONS
4.1 CONFUSION MATRIX FOR
THE MODEL:
4.2 Result
FINAL PREDICTION:

5. CONCLUSION AND FUTURE


WORK
We can see that our combined model has 5.1 CONCLUSIONS
classified all the data points accurately.Now This project aims to predict the disease
we will be creating a function that takes on the basis of the symptoms. The project
symptoms separated by commas as input is designed in such a way that the system
and outputs the predicted disease using the takes symptoms from the user as input
combined model based on the input and produces output i.e., predict disease.
symptoms. In conclusion, for disease risk modelling,
the accuracy of risk prediction depends
on the diversity feature of the hospital
data
5.2 Future Works:

Can train the model to work with


more complex datasets pertaining
the severity of the symptoms and
the time affected by the
symptoms.

6. REFERENCES
1) Min Chen, Yixue Hao, Kai Hwang, Fellow, IEEE,
Lu
Wang, and Lin Wang “Disease Prediction by
Machine Learning over Big Data from Healthcare
Communities” (2017).

2) Mr. Chala Beyene, Prof. Pooja Kamat, “Survey on


Prediction and Analysis the Occurrence of Heart
Disease Using Data Mining Techniques”,
International Journal of Pure and Applied
Mathematics, 2018 Mr.
Chala Beyene, Prof. Pooja Kamat, “Survey on Prediction confusion_matrix from
and Analysis the Occurrence of Heart Disease Using Data
Mining Techniques”, International sklearn.neighbors import
Journal of Pure and Applied Mathematics, 2018
KNeighborsClassifier
3) P. Groves, B. Kayyali, D. Knott, and S. V. Kuiken, “The
’big data’ revolution in healthcare: Accelerating value and %matplotlib inline
innovation,” 2016.
# Reading the train.csv by removing the
4) S.-H. Wang, T.-M. Zhan, Y. Chen, Y. Zhang, M.
Yang, # last column since it's an empty column
H.-M. Lu, H.-N. Wang, B. Liu, and P. Phillips,
“Multiple sclerosis detection based on biorthogonal DATA_PATH = "EDA/Training.csv" data =
wavelet transform, rbf kernel principal component pd.read_csv(DATA_PATH).dropna(axis = 1)
analysis, and logistic regression,” IEEE Access, vol. 4,
pp. 7567–7576, 2016. # Checking whether the dataset is balanced
or not disease_counts =
data["prognosis"].value_counts()
print(disease_counts) temp_df =
pd.DataFrame({
APPENDIX "Disease": disease_counts.index,
Program: "Counts": disease_counts.values
# Importing libraries import
}) plt.figure(figsize = (18,8)) sns.barplot(x =
numpy as np import pandas as pd "Disease", y = "Counts", data = temp_df)
from scipy.stats import mode plt.xticks(rotation=90) plt.show()
import matplotlib.pyplot as plt # Encoding the target value into numerical
# value using LabelEncoder
import seaborn as sns from
encoder = LabelEncoder()
sklearn.preprocessing import
encoder.fit(data["prognosis"])
LabelEncoder from
data["prognosis"] =
sklearn.model_selection import encoder.transform(data["prognosis
train_test_split, cross_val_score "]) X = data.iloc[:,:-1] y =
from sklearn.svm import SVC data.iloc[:, -1]

from sklearn.naive_bayes import X_train, X_test, y_train, y_test


=train_test_split(X, y, test_size = 0.2,
GaussianNB from random_state = 24)
sklearn.ensemble import
RandomForestClassifier from print(f"Train: {X_train.shape},
sklearn.metrics import {y_train.shape}") print(f"Test:
accuracy_score, {X_test.shape}, {y_test.shape}")
# Defining scoring metric for k-fold cross plt.figure(figsize=(12,8))
validation def cv_scoring(estimator, X, y): sns.heatmap(cf_matrix, annot=True)
return accuracy_score(y, plt.title("Confusion Matrix for SVM
estimator.predict(X)) #
Classifier on Test Data") plt.show() #
Initializing Models
Training and testing Naive Bayes Classifier
models = { nb_model = GaussianNB()
"SVC":SVC(), nb_model.fit(X_train, y_train) preds =
"Gaussian NB":GaussianNB(), nb_model.predict(X_test) print(f"Accuracy
"Random on train data by Naive Bayes Classifier\
Forest":RandomForestClassifier(random_state
=18), : {accuracy_score(y_train,
nb_model.predict(X_train))*100}")
"KNN":KNeighborsClassifier(n_neigh print(f"Accuracy on test data by Naive Bayes
bors=3) Classifier\
} : {accuracy_score(y_test, preds)*100}")
# Producing cross validation score for the cf_matrix = confusion_matrix(y_test, preds)
models for model_name in models: model = plt.figure(figsize=(12,8))
models[model_name] scores = sns.heatmap(cf_matrix, annot=True)
cross_val_score(model, X, y, cv = 10, n_jobs = plt.title("Confusion Matrix for Naive Bayes
-1, scoring = cv_scoring) print("=="*30)
Classifier on Test Data") plt.show()
print(model_name) print(f"Scores: {scores}")
print(f"Mean Score: # Training and testing Random Forest
Classifier rf_model
{np.mean(scores)}") # Training and testing
=
SVM Classifier svm_model = SVC()
RandomForestClassifier(random_state=18)
svm_model.fit(X_train, y_train) preds =
rf_model.fit(X_train, y_train) preds =
svm_model.predict(X_test) rf_model.predict(X_test) print(f"Accuracy on
train data by Random Forest Classifier\
print(f"Accuracy on train data by SVM Classifier\ : {accuracy_score(y_train,
: {accuracy_score(y_train, rf_model.predict(X_train))*100}")
svm_model.predict(X_train))*100}") print(f"Accuracy on test data by Random
Forest Classifier\: {accuracy_score(y_test,
print(f"Accuracy on test data by SVM Classifier\ preds)*100}") cf_matrix =

: {accuracy_score(y_test, preds)*100}") confusion_matrix(y_test, preds)


plt.figure(figsize=(12,8))
cf_matrix = confusion_matrix(y_test, preds)
sns.heatmap(cf_matrix, annot=True) score for KNN is {}%".format(score*100))
plt.title("Confusion Matrix for Random Forest cf_matrix = confusion_matrix(y_test, preds)
Classifier on Test Data") plt.show() # Training and plt.figure(figsize=(12,8))
testing KNN Classifier knn_model = sns.heatmap(cf_matrix, annot=True)
KNeighborsClassifier(n_neighbors=3)
plt.title("Confusion Matrix for Random
knn_model.fit(X_train, y_train) preds =
KNN
knn_model.predict(X_test) score =
Classifier on Test Data")
accuracy_score(y_test, preds) print("Accuracy
plt.show() plt.show()
# Training the models on whole data final_svm_model = SVC()

final_nb_model = GaussianNB() final_rf_model =

RandomForestClassifier(random_state=18) final_svm_model.fit(X, y)
final_nb_model.fit(X, y) final_rf_model.fit(X, y) # Reading the test data

test_data =

pd.read_csv("EDA/Testing.csv").dropna(axis=
1)
test_X = test_data.iloc[:, :-1] test_Y = encoder.transform(test_data.iloc[:, -1])
# Making prediction by take mode of predictions
# made by all the classifiers svm_preds = final_svm_model.predict(test_X) nb_preds =
final_nb_model.predict(test_X) rf_preds = final_rf_model.predict(test_X) knn_preds =

knn_model.predict(X_test)

final_preds = [mode([i,j,k,l])[0][0] for i,j,k,l in zip(svm_preds, nb_preds,


rf_preds , knn_preds )] print(f"Accuracy on Test dataset by the combined
model\
: {accuracy_score(test_Y, final_preds)*100}") cf_matrix =
confusion_matrix(test_Y, final_preds) plt.figure(figsize=(12,8))
sns.heatmap(cf_matrix, annot = True) plt.title("Confusion Matrix for Combined
Model on Test Dataset")

You might also like