0% found this document useful (0 votes)
521 views56 pages

Pancreatic Cancer Prediction

This document is a project report submitted to APJ Abdul Kalam Technological University for the degree of Bachelor of Technology in Information Technology. It discusses the development of a computer-aided diagnosis system for pancreatic cancer using machine learning algorithms. Four students - Sreelaya Sudheer, Rabia K, Navya K J, and Angel Anto - worked on the project under the guidance of Dr. Dhanya K M at Government Engineering College Palakkad. The project aims to build models using support vector machine, naive Bayes, and random forest classifiers to predict pancreatic cancer based on patient data.

Uploaded by

sreelaya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
521 views56 pages

Pancreatic Cancer Prediction

This document is a project report submitted to APJ Abdul Kalam Technological University for the degree of Bachelor of Technology in Information Technology. It discusses the development of a computer-aided diagnosis system for pancreatic cancer using machine learning algorithms. Four students - Sreelaya Sudheer, Rabia K, Navya K J, and Angel Anto - worked on the project under the guidance of Dr. Dhanya K M at Government Engineering College Palakkad. The project aims to build models using support vector machine, naive Bayes, and random forest classifiers to predict pancreatic cancer based on patient data.

Uploaded by

sreelaya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 56

COMPUTER AIDED DIAGNOSIS OF

PANCREATIC CANCER

A PROJECT REPORT

submitted by

SREELAYA SUDHEER (PKD17IT056)


RABIYA K (PKD17IT040)
NAVYA K J (PKD17IT036)
ANGEL ANTO (PKD17IT010)

to

the APJ Abdul Kalam Technological University


in partial fulfillment of the requirements for the award of the Degree
of
Bachelor of Technology
In
Information Technology

Department of Information Technology


Government Engineering College Palakkad
Sreekrishnapuram, Palakkad - 678633
June 2021
COMPUTER AIDED DIAGNOSIS OF

PANCREATIC CANCER

A PROJECT REPORT

submitted by

SREELAYA SUDHEER (PKD17IT056)


RABIYA K (PKD17IT040)
NAVYA K J (PKD17IT036)
ANGEL ANTO (PKD17IT010)

to

the APJ Abdul Kalam Technological University


in partial fulfillment of the requirements for the award of the Degree
of
Bachelor of Technology
In
Information Technology

Department of Information Technology


Government Engineering College Palakkad
Sreekrishnapuram, Palakkad - 678633
June 2021
DECLARATION
We hereby declare that the project report entitled “Computer Aided Diagnosis Of
Pancreatic Cancer” submitted by us to APJ Abdul Kalam Technological University during
the academic year 2020 - 2021 in partial fulfillment of the requirements for the award of
Degree of Bachelor of Technology in Information Technology is a record of bonafide project
work carried out by us under the guidance and supervision of Dr. Dhanya K M. We further
declare that the work reported in this project has not been submitted and will not be
submitted, either in part or in full, for the award of any other degree or diploma in this
institute or any other institute or university.

Place: Sreekrishnapuram SREELAYA SUDHEER (PKD17IT056)


Date: 11 June 2021 RABIYA K (PKD17IT040)
NAVYA K J (PKD17IT036)
ANGEL ANTO (PKD17IT010)
DEPARTMENT OF INFORMATION TECHNOLOGY

GOVERNMENT ENGINEERING COLLEGE PALAKKAD

SREEKRISHNAPURAM, PALAKKAD – 678633

CERTIFICATE

This is to certify that the report entitled “Computer Aided Diagnosis Of Pancreatic
Cancer” submitted by SREELAYA SUDHEER (PKD17IT056), RABIYA K
(PKD17IT040), NAVYA K J (PKD17IT036) and ANGEL ANTO (PKD17IT010) to the
APJ Abdul Kalam Technological University in partial fulfillment of the requirements for the
award of the Degree of Bachelor of Technology in Information Technology is bonafide
record of the project work carried out by them under our guidance and supervision. This
report in any form has not been submitted to any other Universities or Institute for any
purpose.

GUIDE HEAD OF THE DEPARTMENT


Dr. DHANYA K M Dr. K.R.REMESH BABU
Assoc. Professor Assoc. Professor
Dept. of Information Technology Dept. of Information Technology
CONTENTS
Contents Page
No:
ACKNOWLEDGEMENT i
ABSTRACT ii
LIST OF TABLES iii
LIST OF FIGURES iv
ABBREVIATIONS v
NOTATION vi
Chapter 1: INTRODUCTION 1
1.1: SCOPE AND OBJECTIVE 1
1.2 PROBLEM STATEMENT 2
Chapter 2: LITERATURE SURVEY 3
2.1 SVM COMBINED WITH MAGNETIC RESONANCE IMAGING 3
2.2 IMAGE CLASSIFICATION USING RANDOM FOREST 3
2.3 TUMOR DETECTION FRAMEWORK FOR PANCREATIC 4
CANCER
2.4 HEART DISEASE CLASSIFICATION USING MACHINE 5
LEARNING
2.5 DIAGNOSIS OF PANCREATIC CANCER BY PATTERN 5
RECOGNITION METHODS
2.6 CANCER PREDICTION USING NAÏVE BAYES, K- 5
NEAREST NEIGHBOUR AND J48 ALGORITHM
Chapter 3: PROPOSED SYSTEM 8
3.1 PROPOSED SYSTEM 8
3.2 NEED FOR PROPOSED SYSTEM 8
3.3 FEASIBILITY STUDY 8
Chapter 4: SYSTEM DESIGN 10
4.1 SYTEM ARCHITECTURE 10
4.1.1 GATHERING OF DATA 11
4.1.2 DATA CLEANING 11
4.1.3 MODEL TRAINING 11
4.1.4 PREDICTION MODULE 12
4.2 SYSTEM DESIGN 13
4.2.1 ACTIVITY DIAGRAM 13
4.2.2 USE CASE DIAGRAM 13
Chapter 5: SYSTEM IMPLIMENTATION 15
5.1 SOFTWARE REQUIREMENTS 15
5.1.1 JUPYTER NOTEBOOK 15
5.1.2 PYTHON PACKAGES 15
5.2 IMPLIMENTATION 16
5.2.1 DATA COLLECTION 16
5.2.2 EDA 19
5.2.3 DATA PREPROCESSING 22
5.2.4 PANCREATIC CANCER PREDICTION USING SUPPORT 23
VECTOR MACHINE
5.2.5 PANCREATIC CANCER PREDICTION USING NAÏVE BAYES 24
5.2.6 PANCREATIC CANCER PREDICTION USING RANDOM 25
FOREST
Chapter 6: RESULT ANALYSIS 28
6.1 CONFUSION MATRIX 28
6.2 EVALUATION PARAMETERS 29
6.3 RESULTS 30
Chapter 7: CONCLUSION AND FUTURE WORK 32
REFERENCES
APPENDICES
ACKNOWLEDGEMENT

Many noble hearts contributed immense inspiration and support for the successful completion
of the project preliminary works. We are unable to express my gratitude in words to such
individuals.

First of all, we would like to thank The Almighty God, for granting us the strength, courage
and knowledge to complete this project design successfully. We would like to express our
deep regard to Dr. P. C. Reghu Raj, Principal, Government Engineering College, Palakkad,
for providing facilities throughout our project.

We take this opportunity to express our profound gratitude to Dr. K.R. Remesh Babu, Head
of the Department, Department of Information Technology, Government Engineering
College, Palakkad, for providing permission and availing all required facilities for
undertaking the project in a systematic way. We are extremely grateful to Dr. Dhanya K M,
Internal Guide, Associate Professor, Department of Information Technology, Government
Engineering College, Palakkad, who guided us with her kind, ordinal and valuable
suggestions. We pay our deep sense of gratitude to Ms. Sangeetha U. and Mr. Ebey S. Raj,
Project Coordinators, Department of Information Technology, Government Engineering
College, Palakkad, for their valuable guidance, keen interest and encouragement at various
stages of the project. We would also like to thank all the teaching and non-teaching staff of
Department of Information Technology, Government Engineering College, Palakkad, for the
sincere directions imparted and the cooperation in connection with the project.

We will be failing in duty if we do not acknowledge with grateful thanks to the authors of the
references and other literatures referred in this project.

We are also thankful to our parents for the overwhelming support given by them for the
project. Last, but not the least, we take pleasant privilege in expressing our heartful thanks to
our friends who were of precious help in completing this project.

i
ABSTRACT
Pancreatic cancer is a malignant tumor that seriously threatens the survival of patients.
Malignant growth is an irregular development of cell tissue. Pancreatic disease is one of the
observable reasons for death around the world. Pancreatic malignant growth starts in the
tissues of pancreas. The pancreas secretes proteins that helps the processing and hormones
that directs the breakdown of sugars. Pancreatic malignancy is usually detected in the later
stages, spreads rapidly and has a poor prediction. Biomarkers play an essential role in the
management of patients with invasive cancers. Pancreatic Ductal Adeno Carcinoma
associated with poor prognosis due to advanced presentation and limited therapeutic options.
This is further complicated by absence of validated screening and predictive biomarkers for
early diagnosis and precision treatments respectively. In this paper we have made an attempt
to discuss various Machine Learning methods to detect pancreatic cancer. The selected
urinary biomarkers values are provided as the input of Support Vector Machine (SVM),
Naïve Bayes (NB), and Random Forest (RF) methods. The diagnosing accuracy of pancreatic
cancer using NB, SVM and RF classifiers are 71.7, 74.5 and 81.3 respectively.
The experimental results prove that Random Forest classifier is more feasible and promising
for clinical applications for the diagnosis of pancreatic cancer when compared to NB and
SVM.

ii
LIST OF TABLES

No: Title Page No:


2.1 LITERATURE SURVEY SUMMARY 7
6.1 CONFUSION MATRIX 28
6.2 CONFUSION MATRIX ELEMENTS 29
6.3 PERFORMANCE ANALYSIS 31

iii
LIST OF FIGURES

No: Title Page No:


4.1 ARCHITECTURE DIAGRAM 10
4.2 ACTIVITY DIAGRAM 13
4.3 USE CASE DIAGRAM 14
5.1 DATASET 16
5.2 HEAT MAP 20
5.3 CORRELATION BETWEEN DIAGNOSIS AND LYVE1 21
5.4 CORRELATION BETWEEN DIAGNOSIS AND COUNT 21
5.5 CORRELATION BETWEEN COUNT AND AGE 22
5.6 SVM CLASSIFICATION OF TWO CLASSES 23
5.7 SVM CLASSIFICATION OF THREE CLASSES 24
5.8 RANDOM FOREST 27

iv
ABBREVIATIONS

PC Pancreatic Cancer
PDAC Pancreatic Ductal Adeno Carcinoma
SVM Support Vector Machine
QGA-SVM Quantum Genetic Algorithm
RBF Radial Basis Function
RF Random Forest
PHOG Pyramid HOG
PHOW Pyramid Histogram Of Visual Words
ROI Region Of Interest
ROC Receiver Operating Characteristic
KNN K-Nearest Neighbor
CNN Convolutional Neural Network
ANN Artificial Neural Network
DCNNS Deep Convolutional Neural Networks
CT Computed Tomography
DC Dependencies Computation
MRI Magnetic Resonance Imaging
PET Position Emission Tomography
EUS Endoscopic Ultrasound
US Ultrasound
EDA Exploratory Data Analysis
BPTB Barts Pancreas Tissue Bank
UCL University College London
LYVE1 Lymphatic Vessel Endothelial Hyaluronan Receptor 1
TFF1 Trefoil Factor 1
LIV Liverpool University

v
NOTATION

TP True Positive
TN True Negative
FP False Positive
FN False Negative

vi
CHAPTER 1
INTRODUCTION

Pancreatic cancer (PC) is a highly malignant tumor of the digestive tract that presents
considerable challenges in both the early screening stage and later treatment. It is estimated
that approximately 57,600 people had been diagnosed with PC, and approximately 47,050
people had died of PC in 2020, therefore PC is known as an incurable disease. In developing
countries, PC is still widely distributed [1]. Therefore, comprehensive diagnosis and staging
of PC are particularly important, which could better help the clinicians to deliver the optimal
therapeutic schedule for PC and allow the patients to receive early medical interventions
before advanced PC are formed. PC is a disease in which malignant (cancerous) cells form in
the tissues of the pancreas. The pancreas is a gland located behind the stomach and in front of
the spine. The pancreas produces digestive juices and hormones that regulate blood sugar.
Cells called exocrine pancreas cells produce the digestive juices, while cells called endocrine
pancreas cells produce the hormones. The majority of PCs start in the exocrine cells. There
are various treatments for PC, including surgery, chemotherapy, and radiation therapy [1].
Chemotherapy uses drugs to treat cancer, while radiation therapy uses X-rays or other kinds
of radiation to kill cancer cells. Surgery can be used to remove a tumor or to treat symptoms
of PC. The American Cancer Society reports that only about 23% of patients with cancer of
the exocrine pancreas are still living one year after diagnosis. About 8.2% are still alive five
years after being diagnosed. Early detection of PC is difficult, and thus many cases of PC are
diagnosed late. When PC is detected, the cancer is usually well developed. Machine learning
is an approach that is part of artificial intelligence and can detect PC early [10].

1.1 SCOPE AND OBJECTIVE

The main objective of this project is to develop a machine learning model to predict the
possibility of PC and to analyze how machine learning is being used to support clinical
decision making in PC.

The scope of this project is to use machine learning techniques for early detection of PC and
use the results in clinical diagnosis and cancer screening applications to support diagnosis
1
1.2 PROBLEM STATEMENT

PC is becoming a leading cause of cancer related death in societies. Rapid and accurate
diagnosis of a pancreatic mass is crucial for improving outcomes. Early detection of PC is
challenging because cancer-specific symptoms occur only at an advanced stage, and a
reliable screening tool to identify high-risk patients is lacking. Machine learning technique is
a better way to address this challenge. There are exciting developments of new diagnostic
techniques that open the possibility of personalised cancer medicine.

2
CHAPTER 2

LITERATURE SURVEY

Several experiments and researches based on diagnosis of diseases using machine learning
techniques have been carried out in the recent years.

2.1 SVM COMBINED WITH MAGNETIC RESONANCE IMAGING

This research used Support Vector Machine (SVM) combined with Magnetic Resonance
Imaging (MRI) to analyze the diagnosis and application of PC. At the same time, the
traditional SVM classification model is optimized to improve the classification accuracy, and
the Quantum Genetic Algorithm (QGA) is used to optimize its parameters. Based on this, the
QGA-SVM classification model is constructed. In the PC detection method based on the
SVM classification model, the parameters of the kernel function and the penalty factor C are
the key factors affecting the recognition, so proper parameter selection is important for the
improvement of the recognition rate. For the kernel function of SVM, this study uses the RBF
kernel function. All in all, the research work of this subject is to use MRI images for clinical
auxiliary diagnosis research and assist the imaging doctors to identify the PC lesions and
provide opinions and references for the diagnosis of PC. The key issue of the study is to
select the appropriate method to extract key features of PC. This study has achieved better-
expected results for automatic classification of PC by clustering method [2]. The results show
that the detection model proposed in this study has a high accuracy rate for the diagnosis of
PC. Moreover, compared with the normal detection algorithm, the features are clearly
distinguished, and the classification accuracy is the highest.

2.2 IMAGE CLASSIFICATION USING RANDOM FOREST

The aim of this work is to classify an image by the object category using Random Forest (RF)
and ferns. The Datasets used were Caltech-101 and Caltech-256. Caltech-101 consists of
images from 101 object categories and Caltech-256 consists of images from 256 object
categories. The methodology used in this research is Image Representation and Matching,

3
based on spatial pyramid matching. Spatial pyramid representation is done by using
appearance and shape descriptors together with the image spatial layout to obtain two
representations Pyramid Histogram Of visual Words (PHOW) descriptor for appearance and
Pyramid HOG (PHOG) descriptor for shape. In image matching, the similarity between a pair
of images is computed using a kernel function between their PHOG (or PHOW) histogram
descriptor. First step is the Selection of Regions of Interest (ROI). It is the method of
automatically learning a rectangular ROI in each of the training images. In the next step the
test image is passed down each random tree until it reaches a leaf node. All the posterior
probabilities are then averaged and the arg max is taken as the classification of the input
images. Then design a node test that is suitable for the representations of shape, appearance
and pyramid spatial correspondence. Random ferns classifier is used in this work to increase
the speed of the RF classifiers. Ferns are non-hierarchical structures where each one consists
of a set of binary tests. Then for the test images a “sliding window” over a range of
translations and scales is applied. A new sub-image classified by considering the average of
the probabilities is formed. The result obtained was 38% without the optimization and with
the optimization this increases by 5%.

2.3 TUMOR DETECTION FRAMEWORK FOR PANCREATIC


CANCER
This research aims to design a novel and efficient pancreatic tumor detection framework
aiming at fully exploiting the context information at multiple scales using Computed
Tomography (CT) images. As Deep Convolutional Neural Networks (DCNNs) have shown
robust performance and results in medical image analysis, a number of deep-learning-based
tumor detection methods were developed in recent years. Nowadays, the automatic detection
of pancreatic tumors using contrast-enhanced CT is widely applied for the diagnosis and
staging of PC. Traditional hand-crafted methods only extract low-level features. Normal
convolutional neural networks, however, fail to make full use of effective context
information, which causes inferior detection results. In this paper, a novel and efficient
pancreatic tumor detection framework aiming at fully exploiting the context information at
multiple scales is designed. More specifically, the contribution of the proposed method
mainly consists of three components: Augmented Feature Pyramid networks, Self-adaptive
Feature Fusion and a Dependencies Computation (DC) Module. A bottom-up path

4
augmentation to fully extract and propagate low-level accurate localization information is
established firstly. Then, the Self-adaptive Feature Fusion can encode much richer context
information at multiple scales based on the proposed regions. Finally, the DC Module is
specifically designed to capture the interaction information between proposals and
surrounding tissues. Experimental results achieve 94% accuracy.

2.4 HEART DISEASE CLASSIFICATION USING MACHINE


LEARNING TECHNIQUES

The aim of this study is to classify heart disease using data mining tools and machine learning
techniques. Dataset is collected from University of California. Dataset contains 13 features,
one target variable, and 303 instances. Six data mining tools used in this work are: Orange,
Weka, RapidMiner, Knime, Matlab, Scikit-learn and six machine learning techniques used
are Logistics regression, k-Nearest Neighbor, ANN, SVM, RF and NB. Accuracy,
Sensitivity, Specificity are estimated in the system and ANN is found to be the best model for
heart disease classification among the compared tools when experimented on the dataset
collected from University of California.

This study uses six data mining tools, and in each tool, six machine learning techniques have
been employed and confusion matrices are extracted to calculate the performance measures
of the models. To analyze the results, the researchers made two comparisons: a comparison
between different machine learning techniques in the same data mining tool, and a
comparison between the same machines learning technique in each data mining tool.

2.5 DIAGNOSIS OF PANCREATIC CANCER BY PATTERN


RECOGNITION METHODS

In this study, the diagnosis of PC was made with ANN and k-NN classifiers using a dataset
consisting of microarray gene expression profiles. Analysis of Variance (ANOVA), a
statistical feature selection method, was used to remove unrelated and unnecessary features in
high-dimensional PC profiles. According to the analysis results obtained from the algorithms;
when the precision, sensitivity and accuracy values are compared, it is seen that ANN gives

5
better results. In the k-NN algorithm, it was understood that the k parameter should be
selected to be optimal [11]. Classification accuracy is 82.7% with KNN and 84.6% with
ANN. Thus ANN gives better results than k-NN.

2.6 CANCER PREDICTION USING NAIVE BAYES, k-NEAREST


NEIGHBOUR AND J48 ALGORITHM

NB, k-NN and j48 algorithm are used in this work for predicting cancer disease. NB is easy
to build and really useful for very big dataset. k- NN uses dataset and create a dataset by
separated into different classes and also predicting classification of new points. J48 Classifier
are based on the decision Tree from training datasets, using the fact that each of them and
data sets can be used for decision-making it into smallest subset. Weka tool is used for the
purpose of measuring the accuracy of the cancer disease dataset including 09 types of cancer.
10-fold cross-validation is used for predicting cancer disease. In NB the accuracy is 98.2%,
k-NN accuracy is 98.8% and j48 accuracy is 98.5%.

This research is to predict cancer disease trying three types of algorithm and find the best
accuracy among them. The authors use the Windows 10 operating system and Weka 3.6
version. Accuracy identifies the ability of classifier. The greater the accuracy will be a better
classifier. So, main work is to find the accuracy of all those three-classification algorithms.
Among them, one will be greater in accuracy and that will be the best algorithm. The
researchers analyze 9 types of cancers accuracy, error rate, sensitivity, specificity, precision,
F-score. Error rate finds the error of the dataset. Sensitivity finds actual true values and
specificity finds actual negative values. The dataset will be ideal if FP=0, FN=0. Using 10-
fold cross validation and three classification learning algorithm Weka gives us a confusion
matrix. Confusion matrix gives us the TP, FP, TN and FN values.

6
Comparison of the literature survey papers is shown in the table 2.1 given below.

Table 2.1: Literature survey summary

Author Title Methodology Remarks

Zhang, et Support vector machine Classification – SVM SVM was found to be accurate
al. combined with magnetic for diagnosing paediatric PC
resonance imaging for Multi-fold cross-
accurate diagnosis of validation
paediatric pancreatic cancer

Bosch, et Image Classification using Random Forest Without optimization-38.7%


al. Random Forest and Ferns classifier
With optimization -43.7%
Random Fern classifier

Zhang, et A Novel and Efficient Tumor Augmented Feature Results shows slight
al. Detection Framework for pyramid network improvements in accuracy
Pancreatic
Self-adaptive feature
Cancer via CT Images fusion

Dependencies
computation module

Tougui, et Heart disease classification Data mining tools ANN gives better results than
al. using data mining tools and the compared tools – KNN,
machine learning techniques Machine Learning SVM, NB, RF, Logistic
regression

Arslan, et Diagnosis Of Pancreatic KNN KNN – 82.7%


al. Cancer By Pattern
Recognition Methods using ANN ANN – 84.6%
Gene Fade Profiles

Maliha, et Cancer Disease Prediction NB NB – 98.2%


al. Using Naive Bayes, K Nearest
Neighbor and J48 algorithm KNN KNN –98.8%

J48 J48 – 98.5%

7
CHAPTER 3

PROPOSED SYSTEM

3.1 PROPOSED SYSTEM

The proposed system analyzes the accuracy of prediction of PC using machine learning
techniques: SVM, RF and NB. These classifiers come under the category of supervised
learning in machine learning. The classifier or the algorithm will be trained with the dataset
that has the features and labels regarding PC, hence it becomes a trained model to predict the
label. The trained model will be tested with new data or with random features from dataset.
The performance of SVM, NB and RF classifiers are compared to find out which classifier
have better accuracy among them. Also predicts the outcome that is whether the chosen
person has the disease or not.

3.2 NEED FOR PROPOSED SYSTEM

The majority of patients with PC die within a few months of diagnosis and only around 1%
survive for 10 years. This is mainly because PC is usually diagnosed late. Patients diagnosed
early have a much better chance of cure. Current diagnostic methods can be time consuming and
may involve uncomfortable procedures, which has led to an increasing interest in better
diagnostic and screening tests. Urine represents an easily obtainable testing medium. The
proposed system will investigate the usefulness and accuracy of biomarkers in urine that can
detect PC in patients with high accuracy.

3.3 FEASIBILITY STUDY

A feasibility study is an analysis that takes all of a project's relevant factors into account
including economic, technical, legal, and scheduling considerations to ascertain the
likelihood of completing the project successfully. A feasibility study assesses the practicality
of a proposed plan or project. The goals of feasibility studies are it helps to understand

8
thoroughly all aspects of a project, concept, or plan, to become aware of any potential
problems that could occur while implementing the project to determine if, after considering
all significant factors, the project is viable. In this, the feasibility analysis for projects in the
field of PC prediction is provided. Generally, feasibility studies technical development and
project implementation. Technical feasibility is the evaluation of the hardware, software, and
other technical requirements of the proposed system. The proposed methodology can be
implemented using the programming language, Python in the Jupyter Notebook. All the
technologies are widely used and are available. So, the entire project is found technically
feasible.

9
CHAPTER 4

SYSTEM DESIGN

4.1 SYSTEM ARCHITECTURE

The system architecture depicting all the four modules is shown below:

Fig 4.1: Architecture diagram

Figure 4.1 shows the overall architecture for the diagnosis of PC, which consists of following
parts that are processed consecutively:

The system consists of four modules:

1. Gathering of data

2. Data cleaning

3. Model Training

4. Prediction Module

10
4.1.1 GATHERING OF DATA

In gathering of data, the patient’s urinary biomarkers values which can help in early detection
of PC is entered in to the proposed system and loaded as dataset. The data used were urinary
biomarkers obtained from Centre for Cancer Biomarkers and Biotherapeutics, Barts Cancer
Institute, Queen Mary University of London, London, United Kingdom. The data consisted of
591 samples and 12 features. The 12 features were age, sex, stage, plasmaca19-9, creatine,
lyve1, Reg1B, Reg1A, TFF1, id, patient cohort, and sample origin. The dataset consists of a
series of biomarkers from the urine of three groups of patients as follows:

• Healthy controls
• Patients with non-cancerous pancreatic conditions, like chronic pancreatitis
• Patients with pancreatic ductal adenocarcinoma

4.1.2 DATA CLEANING

In module 2, cleaning of data and data preprocessing is done by removing the missing values.
The features that possess a few null values are replaced by mean or mode of the remaining
data whereas the features which contain so many null values are dropped since they may
affect the performance of the proposed system. The function isna() is used to identify the
presence of null values. After checking that the attributes present in non-numerical forms
have to be converted into numerical form. Data visualization and exploratory data analysis is
also done in this step using python packages pandas, seaborn and Matplotlib to find out the
correlation between the features.

4.1.3 MODEL TRAINING

In module 3 the model training are carried out. The dataset is divided into testing and training
datasets using the test train split function in Sklearn package. In dataset 70% is considered for
training and 30% for testing. Then in this module the classification algorithms, SVM, RF, NB
classifiers are used.

11
In the proposed system RF is used as one of the classifier. RF is a machine learning technique
that is used to solve regression and classification problems. It utilizes ensemble learning,
which is a technique that combines many classifiers to provide solutions to complex
problems. RF algorithm consists of many decision trees. The ‘forest’ generated by the RF
algorithm is trained through bagging or bootstrap aggregating. The entropy methods are used
in decision trees, thereby making decision trees more efficient by using key feature based
split criteria. And decision trees are made randomly based on each bootstrapped dataset made
from the dataset of Urinary biomarkers.

SVM is used for Classification as well as Regression problems. SVM is built on statistical
learning theory. SVM is based on the principle of structural risk minimization and has strong
generalization ability. It studies optimal separating hyperplane in the high dimension feature
space for sample classification. The proposed system is trained using SVM and linear kernel
as kernel function because the dataset seems to be linearly separable. So that SVM classifies
the data into three labels using hyperplanes.

NB classifier is also used in the proposed system. NB is a classification technique based on


Bayes theorem. The predictors perform their role independently. It consists of two parts
which is Naïve and Bayes. It works on the principle that all the features are independent in
their existence. In case the features are interdependent then also each one of them contributes
independently to the probability. By applying NB find the probability of each class based on
each feature with the help of Bayes theorem. In proposed system, Gaussian NB is applied
because the employment of normalized data in the system.

4.1.4 PREDICTION MODULE

In module 4, the accuracy of the system was calculated by comparing the predicted results in
the testing data. Then the prediction of PC was done by using the predict method with
features as parameters. A confusion matrix is a table that is often used to describe the
performance of a classification model on a set of test data for which the true values are
known. One of the methods used to calculate accuracy in the concept of data mining or
decision support systems is confusion matrix. A confusion matrix is a technique for
summarizing the performance of a classification algorithm. Machine learning model accuracy
is the measurement used to determine which model is best at identifying relationships and

12
patterns between variables in a dataset based on the input, or training, data. Accuracy is
defined as the percentage of correct predictions for the test data. It can be calculated easily by
dividing the number of correct predictions by the number of total predictions. It gives you the
overall accuracy of the model.

4.2 SYSTEM DESIGN

4.2.1 ACTIVITY DIAGRAM

Activity Diagrams are used to illustrate the flow of control in a system and refer to the steps
involved in the execution of a system. Sequential and concurrent activities are modelled using
activity diagrams. Basically workflows are visually depicted using an activity diagram. An
activity diagram focuses on condition of flow and the sequence in which it happens. The
Activity diagram of the proposed system is shown in figure 4.2.

Fig 4.2: Activity diagram

13
4.2.2 USE CASE DIAGRAM

The figure 4.3 shows USE CASE representation of the system. It describes the structure of
the system by showing the attributes and their relationships .The main aim of USE CASE is
to define a standard way to visualize the way a system has been designed. Use case has
provided features to capture the dynamics of a system from different angles.

Fig 4.3: Use case diagram

14
CHAPTER 5

SYSTEM IMPLIMENTATION

System implementation is the process of defining how the information system should be
built, ensuring that the information system is operational and used, ensuring that the
information system meets quality standard.

5.1 SOFTWARE REQUIREMENTS

5.1.1 JUPYTER NOTEBOOK

Jupyter is a free, open-source, interactive web tool known as a computational notebook,


which researchers can use to combine software code, computational output, explanatory text
and multimedia resources in a single document.

5.1.2 PYTHON PACKAGES

Pandas is a software library written for the Python programming language for data
manipulation and analysis. In particular, it offers data structures and operations for
manipulating numerical tables and time series. It is free software released under the three-
clause BSD license.

NumPy is the fundamental package for scientific computing in Python. NumPy arrays
facilitate advanced mathematical and other types of operations on large numbers of data. It
has functions for working in domain of arrays, linear algebra, Fourier transform, and
matrices.

Matplotlib is a plotting library for the Python programming language and its numerical
mathematics extension NumPy. It is used for creating static, animated, and interactive
visualizations in Python.

Seaborn is a data visualization library built on top of Matplotlib and closely integrated with
pandas data structures in Python. Visualization is the central part of Seaborn which helps in
exploration and understanding of data. It is used to create more attractive and informative
statistical graphics.

15
Sklearn is probably the most useful library for machine learning in Python. The Sklearn
library contains a lot of efficient tools for machine learning and statistical modeling including
classification, regression, and clustering and dimensionality reduction.

5.2 IMPLEMENTATION

5.2.1 Data collection

The data used were urinary biomarkers obtained from Centre for Cancer Biomarkers and Bio
therapeutics, Barts Cancer Institute, Queen Mary University of London, London, United
Kingdom. The data consisted of 591 samples and 12 features. The 12 features were age, sex,
stage, plasmaca19-9, creatine, lyve1, Reg1B, Reg1A, TFF1, id, patient cohort, sample origin.
They gathered a series of biomarkers from the urine of three groups of patients:

•Healthy controls

•Patients with non-cancerous pancreatic conditions, like chronic pancreatitis

•Patients with pancreatic ductal adenocarcinoma

Collected dataset shown in figure 5.1 given below:

Fig 5.1: Dataset

16
Creatinine is a protein that is often used as an indicator of kidney function. YVLE1 is
lymphatic vessel endothelial hyaluronan receptor 1, a protein that may play a role in tumor
metastasis. REG1B is a protein that may be associated with pancreas regeneration. TFF1 is
trefoil factor 1, which may be related to regeneration and repair of the urinary tract.REG1B is
a protein that may be associated with pancreas regeneration [17].Patient's Cohort1is
previously used samples and Cohort2 is new samples. Plasma CA19-9 is blood plasma levels
of CA 19–9 monoclonal antibody that is often elevated in patients with PC. Sample Origin is
the places from where the samples are collected. Stages are different stages of PC [11].

The key features are four urinary biomarkers: creatinine, LYVE1, REG1B, and TFF1.

•Creatinine is a protein that is often used as an indicator of kidney function.

Creatinine is measured in milligrams per deciliter (mg/dL). The normal values by age: 0.9 to
1.3 mg/dL for adult males. 0.6 to 1.1 mg/dL for adult females.

•YVLE1 is lymphatic vessel endothelial hyaluronan receptor 1, a protein that may play a role
in tumor metastasis

•REG1B is a protein that may be associated with pancreas regeneration

•TFF1 is trefoil factor 1, which may be related to regeneration and repair of the urinary tract.

Age and sex, both included in the dataset, may also play a role in who gets PC. The dataset
includes a few other biomarkers as well, but these were not measured in all patients (they
were collected partly to measure how various blood biomarkers compared to urine
biomarkers).

Features of the dataset:

• Sample ID: Unique string identifying each subject


• Patient's Cohort:

Cohort 1, previously used samples;

Cohort 2, newly added samples

17
• Sample Origin:

BPTB: Barts Pancreas Tissue Bank, London, UK

ESP: Spanish National Cancer Research Centre, Madrid, Spain

LIV: Liverpool University, UK

UCL: University College London, UK

• Age: Age in years


• Sex: M = male, F = female
• Diagnosis (1=Control, 2=Benign, 3=PDAC):

1 = control (no pancreatic disease)

2 = benign hepatobiliary disease (119 of which are chronic pancreatitis)

3= pancreatic ductal adenocarcinoma

• Stage: For those with PC, what stage was it?


One of I, IA, IB, II, IIA, IIB, III, IV
• Benign Samples Diagnosis: For those with a benign, non-cancerous diagnosis, what
was the diagnosis?
• Plasma CA19-9 U/ml: Blood plasma levels of CA 19–9 monoclonal antibody that is
often elevated in patients with PC. Only assessed in 350 patients (one goal of the
study was to compare).The upper limit of the normal reference value for CA19-9 is 37
U/mL.
• Creatinine mg/ml: Urinary biomarker of kidney function
• LYVE1 ng/ml: Urinary levels of Lymphatic vessel endothelial hyaluronan receptor 1,
a protein that may play a role in tumor metastasis
• REG1B ng/ml: Urinary levels of a protein that may be associated with pancreas
regeneration.
• TFF1 ng/ml: Urinary levels of Trefoil Factor 1, which may be related to regeneration
and repair of the urinary tract

18
• REG1A ng/ml: Urinary levels of a protein that may be associated with pancreas
regeneration. Only assessed in 306 patients (one goal of the study was to assess
REG1B vs REG1A)

5.2.2 EDA

Exploratory data analysis (EDA) is an essential step in any research analysis. The primary
aim with exploratory analysis is to examine the data for distribution, outliers and anomalies
to direct specific testing of your hypothesis. It also provides tools for hypothesis generation
by visualizing and understanding the data usually through graphical representation. EDA
aims to assist the natural patterns recognition of the analyst. Finally, feature selection
techniques often fall into it. EDA is a fundamental early step after data collection and pre-
processing, where the data is simply visualized, plotted, manipulated, without any
assumptions, in order to help assessing the quality of the data and building models [29]. Most
EDA techniques are graphical in nature with a few quantitative techniques. The reason for the
heavy reliance on graphics is that, the main role of EDA is to explore, and graphics gives the
analysts unparalleled power to do so, while being ready to gain insight into the data. There
are many ways to categorize the many EDA techniques [23].

In this project Heatmap is used. Heatmap visualizes the data in a 2-dimensional format in the
form of colored maps. The color maps use hue, saturation, or luminance to achieve color
variation to display various details. This color variation gives visual cues to the readers about
the magnitude of numeric values. HeatMaps is about replacing numbers with colors because
the human brain understands visuals better than numbers, text, or any written data. Human
beings are visual learners; therefore, visualizing the data in any form makes more sense.
Heatmap represent data in an easy-to-understand manner. Thus visualizing methods like
HeatMaps have become popular. Heatmap can describe the density or intensity of variables,
visualize patterns, variance, and even anomalies. It shows relationships between variables.
These variables are plotted on both axes. Then Look for the patterns in the cell by noticing
the color change. It only accepts numeric data and plots it on the grid, displaying different
data values by varying color intensity [25].

The Heat Map procedure shows the distribution of a quantitative variable over all
combinations of 2 categorical factors. If one of the 2 factors represents time, then the

19
evolution of the variable can be easily viewed using the map. A gradient color scale is used to
represent values of the quantitative variable.

Heatmap representation of the correlation of the features is shown in figure 5.2:

Dataset visualization through EDA helps to understand the correlation between the features
and to understand the core features that contribute more to the accuracy of the classification
system.

20
Figure 5.3 show the correlation between the diagnosis and LYVE1.

Fig 5.3: Correlation between diagnosis and LYVE1

Figure 5.4 show the correlation between the diagnosis and count. The number of samples
diagnosed with benign hepitilobary diseases is more than the PC diagnosed samples and
normal patients

Fig 5.4: Correlation between diagnosis and count

21
The correlation between the count of samples and age is represented in the figure 5.5. The
most common age groups occurred in this dataset is 65-70.

C
o
u
n
t

Age

Fig 5.5: Correlation between count and age

5.2.3 DATA PREPROCESSING

The data preprocessing is a method used to remove the inconsistencies and incompleteness of
the data. This step is essential because if the data contains missing attributes, noise, outliers
or duplicate contents, it will degrade the quality of results. The dataset contains the urinary
biomarkers of the patients. It consists of 12 features and 1 label. First the dataset will be
loaded. Then dataset needs to be cleaned by eliminating the null values. The function isna() is
used to identify the presence of null values. After checking that the attributes present in non-
numerical forms have to be converted into numerical form. In the dataset the attributes
sample origin and sex are thus converted into numerical values using replace function. The
null values were presented in stage, benign sample diagnosis, plasma_CA19_9, REG1A. Null
values in plasma_CA19_9 and REG1A were replaced by using the mean value of the

22
respective attributes. And it is found that there are many missing values in the features;
sample_id, sample origin, patient_cohort and benign sample diagnosis. So these features have
to be dropped or it will reduce the accuracy of the system. And then again check for the
presence of null values and it is found that the data is clean. Thus the essential features that
contribute to the diagnosis of PC is identified now. The nine features extracted were Age,
sex, stage, plasmaca19-9, creatine, and lyve1, Reg1B, Reg1A, and TFF1. Thus after data
preprocessing dataset contains 9 features and 1 label.

5.2.4 PANCREATIC CANCER PREDICTION USING SVM

SVM is used for Classification as well as Regression problems. SVM is built on statistical
learning theory. SVM is based on the principle of structural risk minimization and has strong
generalization ability. It studies optimal separating hyperplane in the high dimension feature
space for sample classification. The goal of the SVM algorithm is to create the best line or
decision boundary that can segregate n-dimensional space into classes so that we can easily
put the new data point in the correct category in the future. This best decision boundary is
called a hyperplane [2]. SVM chooses the extreme points/vectors that help in creating the
hyperplane. These extreme cases are called as support vectors, and hence algorithm is termed
as SVM.

In the proposed methodology SVM works as follows; firstly, many attributes due to missing
values are neglected. As some of the attributes have very few missing values, the
corresponding record is dropped instead of the whole feature, which gives a dataset with 591
records and 12 attributes. After Data Cleaning is done, the dataset is prepared to a data frame
supported by pandas library in python. Then dataset is divided into training data and test data
in the ratio of 7:3. Data is trained by SVM and it is a set of supervised learning methods used
for classification, regression and outlier’s detection. Then test data is provided to the
enhanced model of PC.

23
The figure 5.6 shows the classification of two classes using hyperplanes of SVM

Fig 5.6: SVM Classification of two classes

The figure 5.7 shows the classification of three classes using hyperplanes of SVM

Fig 5.7: SVM Classification of three classes

24
In its most simple type, SVM doesn’t support multiclass classification natively. It supports
binary classification and separating data points into two classes. For multiclass classification,
the same principle is utilized after breaking down the multiclassification problem into
multiple binary classification problems. The idea is to map data points to high dimensional
space to gain mutual linear separation between every two classes. This is called a One-to-One
approach, which breaks down the multiclass problem into multiple binary classification
problems. A binary classifier per each pair of classes. Another approach one can use is One-
to-Rest. In that approach, the breakdown is set to a binary classifier per each class.

5.2.5 PANCREATIC CANCER PREDICTION USING NAÏVE BAYES

NB is a classification technique based on Bayes theorem. The predictors perform their role
independently. It consists of two parts which is Naïve and Bayes. It works on the principle
that all the features are independent in their existence. In case the features are interdependent
then also each one of them contributes independently to the probability. It is carried out in the
assumption that the impact of an attribute value on a class does not depend on other attribute
values [13].

𝑃(𝐷⁄ℎ) × 𝑃(ℎ)
𝑃(ℎ⁄𝐷 ) =
𝑃(𝐷)

P(h): the probability of hypothesis h being true (regardless of the data). This is known as the
prior probability of h.

P(D): the probability of the data (regardless of the hypothesis). This is known as the prior
probability.

P(h|D): the probability of hypothesis h given the data D. This is known as posterior
probability.

25
P(D|h): the probability of data d given that the hypothesis h was true. This is known as
posterior probability.

In the proposed methodology NB works as follows

1. Load the dataset

2. Cleaning the data and data pre-processing. Remove the missing values and replace the
missing values using pandas, fillna or dropna method.

3. Data analysis and data visualization is done in the step. Data analysis is done using python
package pandas and matplotlib and seaborn is used for data visualization. Calculate the data
correlation.

4. Divide the data into testing and training dataset. Using test train split. Using sklearn model
selection. Divide the dataset into 70% training and 30% testing.

5. Calculate the Accuracy of the system

6. Predict the value of testing dataset using Gaussian NB classifier. The classifier is trained
using training data. .After building the classifier, the model is ready to make predictions.
Then the predict() method is used with test set features as its parameters.

5.2.5 PANCREATIC CANCER PREDICTION USING RANDOM FOREST

RF is a machine learning technique that’s used to solve regression and classification


problems. It utilizes ensemble learning, which is a technique that combines many classifiers
to provide solutions to complex problems. RF algorithm consists of many decision trees. The
‘forest’ generated by the RF algorithm is trained through bagging or bootstrap aggregating.
Bagging is an ensemble meta-algorithm that improves the accuracy of machine learning
algorithms [9]. The RF algorithm establishes the outcome based on the predictions of the
decision trees. It predicts by taking the average or mean of the output from various trees.
Increasing the number of trees increases the precision of the outcome. RF works in two-phase
first is to create the RF by combining N decision tree, and second is to make predictions for
each tree created in the first phase [31].

26
In proposed system Urinary biomarkers are used for creating the dataset. The preprocessed
data which was divided into testing and training data was fitted in to RF Classifier. Then the
classifier select K data points randomly from the training set and thus made the bootstrapped
datasets. And Decision Trees are build associated with these bootstrapped datasets or the
selected data points. The number of decision trees wanted in proposed system are given
manually as hundred. The predictions of each decision tree are noted and the category that
wins majority votes was assigned as predicted class.

After training the system by fitting the training set in to the RF classifier, the 30% of the
testing data was given to the trained system and predictions are marked. Then the predicted
values are cross checked with the actual labels of the testing data. Confusion matrix is made
by comparing the predicted values with the actual labels and accuracy of the RF classifier is
found.

The Working process can be summarized as follows:

1. Select random K data points from the training set.


2. Build the decision trees associated with the selected data points (Subsets).
3. Choose the number N for decision trees that you want to build.
4. Repeat Step 1 & 2.
5. For new data points, find the predictions of each decision tree, and assign the new
data points the category that wins the majority votes.

27
The working of the RF algorithm is depicted in the figure 5.8

Fig 5.8: Random forest

Implementation Steps RF algorithm in proposed system can be done as follows:

1) Data Pre-processing step


2) Fitting the RF algorithm to the Training set
3) Predicting the test result
4) Test accuracy of the result (Creation of Confusion matrix)
5) Visualizing the test set result

28
CHAPTER 6

RESULT ANALYSIS

6.1 Confusion matrix

A confusion matrix is a table that is often used to describe the performance of a classification
model on a set of test data for which the true values are known. One of the methods used to
calculate accuracy in the concept of data mining or decision support systems is confusion
matrix. A confusion matrix is a technique for summarizing the performance of a classification
algorithm. Classification accuracy alone can be misleading if there is unequal number of
observations in each class. Calculating a confusion matrix can give a better idea of what a
classification model is getting right and what types of errors it is making [27].

The table 6.1 shows the cells representing positive and negative predictions in the confusion
matrix

Table 6.1: Confusion matrix

Predict

Benign
Hepatobilary
Normal State Disease PDA

Normal State +ve -ve -ve

1 2 3

Actual Benign -ve +ve -ve


Hepatobilary
Disease 4 5
6

PDA -ve -ve +ve

7 8 9

The table 6.2 depicts the True Positive, False Negative, False Positive and True Negative for
each class

29
Table 6.2: Confusion matrix elements

Normal State Benign Hepatobiliary PDA


Disease

TP = Cell1 TP = Cell5 TP = Cell9

FP = Cell2 + Cell3 FP = Cell4 + Cell6 FP = Cell7 + Cell8

TN = Cell5 + Cell6 + Cell8 + Cell9 TN = Cell1 + Cell3 + Cell7 + Cell9 TN = Cell1 + Cell2 + Cell4 + Cell5

FN = Cell4 + Cell7 FN = Cell2 + Cell8 FN = Cell3 + Cell6

6.2 Evaluation Parameters

Machine learning model accuracy is the measurement used to determine which model is best
at identifying relationships and patterns between variables in a dataset based on the input, or
training, data. Accuracy is defined as the percentage of correct predictions for the test data. It
can be calculated easily by dividing the number of correct predictions by the number of total
predictions. It gives you the overall accuracy of the model, meaning the fraction of the total
samples that were correctly classified by the classifier [25]. To calculate accuracy, use the
following formula.

𝑇𝑃 + 𝑇𝑁
𝐴𝐶𝐶𝑈𝑅𝐴𝐶𝑌 =
𝑇𝑃 + 𝐹𝑃 + 𝑇𝑁 + 𝐹𝑁

30
TP-True Positives

TN-True Negatives

FP-False Positives

FN-False Negatives

True Positive (TP): It refers to the number of predictions where the classifier correctly predicts

the positive class as positive.

True Negative (TN): It refers to the number of predictions where the classifier correctly

predicts the negative class as negative

False Positive (FP): It refers to the number of predictions where the classifier incorrectly

predicts the negative class as positive.

False Negative (FN): It refers to the number of predictions where the classifier incorrectly

predicts the positive class as negative.

6.3 Results

In this work, three classification models were employed for detecting PC. Table 6.3
represents a comparison of classification models based on the performance. The performance
evaluation parameter used in the system is accuracy.

The table 6.3 depicts the average rate of accuracy. The average rate of accuracy of NB is 71.7
%, SVM is 74.5 % and for RF is 81.3 %. From this, it is clear that RF gives an accurate result
than the other two classifier algorithm. So, it can be concluded that RF performs better than
the other two classification algorithms.

Comparison results of the machine learning techniques used in the system are shown in table
6.3

31
Table 6.3: Performance analysis

SI Technique Description Accuracy

No.

RFs or random decision forests are an


ensemble learning method for
1 Random Forest classification, regression and other
tasks that operates by constructing a
multitude of decision trees at training
time.
81.3%

A SVM is a supervised machine


learning model that used for
2 Support Vector Machine classification problems. It classifies
data by finding the best hyper plane
that separates all data points of one 74.5%
class from those of the other class.

NB algorithm is based on Bayesian


Theorem. The Bayesian Classification
3 Naïve Bayes represents a supervised learning
method as well as a statistical method
for classification. 71.7%

Early detection of PC is very important, so that the handling of PC does not occur too late,
before the cancer spreads to other organs in the body. However, early detection of PC is
difficult because this cancer has non-specific symptoms.

After classifying PC with SVM, NB and RF methods, it gets several results of accuracy. By
comparing the values that are given from those methods (SVM, NB and RF), it is possible to
conclude that RF generates a better result than SVM and NB. Because of the good results,
RF is suggested to help the medical staff to predict or classify a disease rather than SVM and
NB, especially for a dataset that is similar to this research.

The collective and ultimate outcomes of determining and implementing early-detection


methods are focused on a better future for patients, their families, science, and medicine. The
impact of improving quality of life, treatment options, and survival for those individuals

32
diagnosed with PDAC will be immense. When this disease is classified as a chronic disease
rather than a devastating deadly diagnosis, it will be said that success has been achieved.

33
CHAPTER 7

CONCLUSION AND FUTURE WORK

Early detection of PC is very important so that the handling of PC does not occur too late,
before the cancer spreads to other organs in the body. However, early detection of PC is
difficult because this cancer has non-specific symptoms.

After classifying PC with SVM, NB and RF methods, it gets several results of accuracy. By
comparing the values that are given from those methods (SVM, NB and RF), it is possible to
conclude that RF generates a better result than SVM and NB. Because of the good results,
RF is suggested to help the medical staff to predict or classify a disease rather than SVM and
NB, especially for a dataset that is similar to this research.

The collective and ultimate outcomes of determining and implementing early-detection


methods are focused on a better future for patients, their families, science, and medicine. The
impact of improving quality of life, treatment options, and survival for those individuals
diagnosed with PDAC will be immense. When this disease is classified as a chronic disease
rather than a devastating deadly diagnosis, it will be said that success has been achieved.

It is clear that machine learning methods generally improve the performance or predictive
accuracy of most prognoses, especially when compared to conventional statistical or expert-
based systems. So the using of the other machine learning methods and the combination of
other classification algorithms can improve the accuracy of the system. While most studies
are generally well constructed and reasonably well validated, certainly greater attention to
experimental design and implementation appears to be warranted, especially with respect to
the quantity and quality of biological data. So the use of more and relevant data also can
improve the system performance. Improvements in experimental design along with improved
biological validation would no doubt enhance the overall quality, generality and
reproducibility of many machine-based classifiers. Overall, believe that if the quality of
studies continues to improve, it is likely that the use of machine learning classifier will
become much more common place in many clinical and hospital settings.

34
REFERENCES

[1] A. Bosch, A. Zisserman and X. Munoz 2007 "Image Classification using Random
Forests and Ferns,"IEEE 11th International Conference on Computer Vision, Rio de Janeiro,
10.1109/ICCV.2007.4409066

[2] Bhatt A, Dubey SK, Bhatt AK, Joshi M 2017, “Data Mining Approach to Predict and
Analyze the Cardiovascular Disease”, Proceedings of the 5th International Conference on
Frontiers in Intelligent Computing: Theory and Applications

[3] Bramhall, S.R., Neoptolemos, J.P., Stamp, G.W. and Lemoine, N.R 1998, Imbalance
of expression of matrix metalloproteinases (MMPs) and tissue inhibitors of the matrix
metalloproteinase (TIMPs) in human pancreatic carcinoma. J. Pathol., 182

[4] D. Arslan, M. E. Özdemir and M. T. Arslan2017, "Diagnosis of pancreatic cancer by


pattern recognition methods using gene expression profiles", International Artificial
Intelligence and Data Processing Symposium (IDAP), 10.1109/IDAP.2017.8090327.

[5] Daniele Ravi, Charence Wong 2017, “Deep Learning for HealthInformatics”, ieee
journal of biomedical and health informatics

[6] D. Delen, G. Walker, and A. Kadam 2005, "Predicting breast cancer survivability: a
comparison of three data mining methods," Artificial intelligence in medicine

[7] Dona Sara Jacob, RakhiViswan, V Manju, L PadmaSuresh, Shine Raj 2018, “A
Survey on Breast Cancer Prediction Using Data Mining Techniques", IEEE Access

[8] Dr Prof. Neeraj, Sakshi Sharma, RenukaPurohit&Pramod Singh Rathore2017,


“Prediction of Recurrence Cancer using J48Algorithm” Proceedings of the 2nd International
Conference on Communication and Electronics Systems

[9] Dua D, Graff C 2019, “UCI machine learning repository”, School of Information and
Computer Science, University of California, Irvine, CA

[10] Dwivedi AK, 2018 “Performance evaluation of different machine learning techniques
for prediction of heart disease”, Neural Computer & Application

35
[11] Ellenrieder, V., Adler, G. and Gress, T.M 1999, Invasion and metastasis in pancreatic
cancer. Ann. Oncol.10 (Suppl. 4)

[12] Escamilla AKG, El Hassani AH, Andres E 2019, “A Comparison of Machine


Learning Techniques to Predict the Risk of Heart Failure”, Machine Learning Paradigms.
Springer

[13] Eun Sun Lee, Jeong Min Lee 2014, “pancreatic cancer: A state-of - the-art review
World”

[14] G.N. Satapathi, Dr.P.Srihari, Ch.ArunaJyothi, S. Lavanya 2013, “Prediction of


cancer using DCP cells", IEEE Access

[15] Ilias Tougui1,Abdelilah Jilbab1,Jamal El Mhamdi1 “Heart disease classification


using data mining tools and machine learning techniques” . Health Technol, 2020

[16] Lola Rahib, Benjamin D Smith, Rhonda Aizenberg, Allison B Rosenzweig, Julie M
Fleshman, and Lynn M Matrisian 2020, “Projecting cancer incidence and deaths to 2030:
The unexpected burden of thyroid, liver, and pancreas cancers in the United States”

[17] Mohtadi K, Msaad R, Essadik R, Lebrazi H, Kettani A 2018, “Current risk factors of
ischemic cardiovascular diseases estimated in a representative population of Casablanca”,
EndocrinolMetabSyndr

[18] Ms. Rashmi G D, Mrs. A Lekha, Dr. NeelamBawane 2015, “Analysis of Efficiency of
Classification and Prediction Algorithms (Naïve Bayes) for Breast Cancer Dataset", IEEE
Access

[19] Sarfaraz Hussein, PujanKandel, Juan E. Corral CandiceW.Bolan, Michael B.


Wallace and UlasBagci 2018, “Deep Multi-Modal Classification of Intraductal Papillary
Mucinous Neoplasms (IPMN) with Canonical Correlation Analysis”, IEEE

[20] Shanjida Khan Maliha; Romana Rahman Ema; SimantaKumar Ghosh; Helal
Ahmed; Md. RafsunJonyMollick; Tajul Islam 2019, “Cancer Disease Prediction Using
Naive Bayes, Nearest Neighbor and J48 algorithm”

36
[21] Shuhao Sun, FimaKlebaner, and TianhaiTian 2017, “Mathematical model for
pancreatic cancer progression using non-constant gene mutation rate”, IEEE International
Conference on Bioinformatics and Biomedicine

[22] Tougui, I., Jilbab, A. & El Mhamdi, J2020, Heart disease classification using data
mining tools and machine learning techniques. Health Technol. 10, 1137–1144

[23] TurkiTurki 2018, "An Empirical Study of Machine LearningAlgorithms for Cancer
Identification", IEEE Access

[24] Yuling Zhang; Shuchang Wang; ShuqiangQu 2020, “Support vector machine
combined with magnetic resonance imaging for accurate diagnosis of peadiatric pancreatic
cancer” IET Image Processing

[25] Z. Zhang, S. Li, Z. Wang and Y. Lu 2020, "A Novel and Efficient Tumor Detection
Framework for Pancreatic Cancer via CT Images," 42nd Annual International Conference of
the IEEE Engineering in Medicine & Biology Society (EMBC), Montreal, QC, Canada,
10.1109/EMBC44109.2020.9176172

[26] Classification algorithms: //www.javapoint.com/classification-algorithm-in-machine-


learning

[27] National Survey on Population and Family Health. Ministry of health of morocco 2018.
http://www.sante.gov.ma/Documents/ 2019/10/ENPSF

[28][Dataset]:https://www.kaggle.com/johnjdavisiv/urinary-biomarkers-for-pancreatic-cancer

[29] Data cleaning: https://www.sisense.com/glossary/data-cleaning/

[30] Feature selection: https://www.kdnuggets.com/2021/06/feature-selection-overview.html

[31] Machine learning using python: https://scikit-learn.org/stable/

[32] Matplotlib tutorial: https://www.tutorialspoint.com/matplotlib/index.htm

[33] Model training: https://elitedatascience.com/model-training

37
[34] NumPy introduction: https://www.w3schools.com/python/numpy/numpy_intro.asp

https://www.python-course.eu/numpy.php

[35] Python pandas: https://www.tutorialspoint.com/python_pandas/index.html

[36] Random forest: https://stackabuse.com/random-forest-algorithm-with-python-and-scikit-


learn

38
APPENDIX A

SCREENSHOTS

A.1 LOADING DATASET

A.2 DATA PREPROCESSING

39
A.3 MODEL TRAINING

40
41
4.4 PREDICTION

42
43

You might also like