100% found this document useful (1 vote)
75 views79 pages

Final Report

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
75 views79 pages

Final Report

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 79

CREDIT CARD FRAUD DETECTION

USING MACHINE LEARNING


A PROJECT REPORT
Submitted by,

BALA BHARATHI S (310819205015)

ESOHIN E (310819205028)

PAVAN SIDHARTH J (310819205055)

In the partial fulfillment for the award of the degree

of

BACHELOR OF TECHNOLOGY

in

INFORMATION TECHNOLOGY

JEPPIAAR ENGINEERING COLLEGE


ANNA UNIVERSITY
CHENNAI 600 025
JUNE 2023
JEPPIAAR ENGINEERING COLLEGE
DEPARTMENT OF INFORMATION TECHNOLOGY

JEPPIAAR NAGAR, RAJIV GANDHI ROAD, CHENNAI-119

BONAFIDE CERTIFICATE
This is to certify that this Project Report “ CREDIT CARD FRAUD DETECTION
USING MACHINE LEARNING ” is the bonafide work of “BALA BHARATHI S ,
ESOHIN E and PAVAN SIDHARTH J” who carried out the project under my
supervision.

SUPERVISOR HEAD OF DEPARTMENT

T.Anuja , M.E., Dr. S.Venkatesh, M.E., Ph.D.,


Assistant Professor, Assistant Professor,
Department of IT, Department of IT,
Jeppiaar Engineering College, Jeppiaar Engineering College,
Chennai 600 119. Chennai 600 119.

Submitted for the project viva voce examination held on ___________

INTERNAL EXAMINER EXTERNAL EXAMINER

I
ACKNOWLEDGEMENT

We are very much indebted to (Late) Hon’ble Colonel Dr. JEPPIAAR, M.A.,
B.L., Ph.D., Our Chairman and Managing Director Dr. M. REGEENA
JEPPIAAR, B. Tech., M.B.A., Ph.D., the Principal Dr. J.FRANCIS XAVIER,
M.Tech., Ph.D., and the Dean Academics Dr. SHALEESHA A. STANLEY
M.Sc., M.Phil., Ph.D., to carry out the project here.

We would like to express our deep sense of gratitude to Dr. S. Venkatesh, M.E.,
Ph.D., Head of the Department and also to our guide Mrs T.Anuja, M.E., for
giving valuable suggestions for making this project a grand success.

We take this opportunity to express our sincere gratitude to our Project


coordinator Mrs T.Anuja, M.E., for giving us the opportunity to do this project
under their esteemed guidance.

We also thank the teaching and non teaching staff members of the department of
Information Technology for their constant support.

II
ABSTRACT

Credit cards are the commonly used payment mode in recent years. As
the technology is developing, the number of fraud cases are also
increasing and finally poses the need to develop a fraud detection
algorithm to accurately find and eradicate the fraudulent activities. This
project work proposes different machine learning based classification
algorithms and hyperparameter tuning,pca for handling the heavily
imbalanced data set for preprocessing . Finally, this project work will
calculate the accuracy, precision, recall, f1 score Credit card frauds are
easy and friendly targets. E-commerce and many other online sites have
increased the online payment modes, increasing the risk for online frauds.
Increase in fraud rates, researchers started using different machine
learning methods to detect and analyze frauds in online transactions. The
main aim of the paper is to design and develop a novel fraud detection
method for Streaming Transaction Data, with an objective to analyze the
past transaction details of the customers and extract the behavioral
patterns. Where cardholders are clustered into different groups based on
their transaction amount. Then using a sliding window strategy, to
aggregate the transaction made by the cardholders from different groups
so that the behavioral pattern of the groups can be extracted respectively.
Later different classifiers are trained over the groups separately. And
then the classifier with better rating score can be chosen to be one of the
best methods to predict frauds.

Keywords: Fraud detection, Credit card, Logistic regression, Decision


tree, Random forest , SVM, KNN, XGBoost.

III
TABLE OF CONTENTS

CHAPTER NO. TITLE PAGE


NO.
ABSTRACT III

LIST OF FIGURES VII

CHAPTER 1: INTRODUCTION
1
1.1 INTRODUCTION 1
CHAPTER 2: LITERATURE SURVEY
2
2.1 LITERATURE SURVEY 4
CHAPTER 3: SYSTEM ANALYSIS
3
3.1 EXISTING SYSTEM 9
3.2 PROPOSED SYSTEM 9
3.3 BLOCK DIAGRAM 10
3.3.1 DESCRIPTION OF THE SYSTEM BLOCK
DIAGRAM 10
3.4 FLOW DIAGRAM 11
CHAPTER 4: METHODOLOGIES AND
4 ALGORITHMS

4.1 ENSEMBLE ALGORITHM USED 12


4.1.1 LOGISTIC REGRESSION 12
4.1.2 RANDOM FOREST CLASSIFIER 13
4.1.3 DECISION TREE ALGORITHM 15
4.1.4 K NEAREST NEIGHBORS ALGORITHM 16
4.1.5 XG BOOST 16
4.1.6 SUPPORT VECTOR CLASSIFIERS
ALGORITHM 17
4.2 FINAL ALGORITHM USED 18
4.2.1 SUPPORT VECTOR MACHINES 18

IV
CHAPTER 5: SYSTEM DESIGN

5.1 FUNCTIONAL AND NON-FUNCTIONAL


REQUIREMENTS 23
5.1.1 FUNCTIONAL REQUIREMENTS 23
5 5.1.2 NON-FUNCTIONAL REQUIREMENTS 23
5.2 SYSTEM SPECIFICATIONS 24
5.2.1 HARDWARE SPECIFICATIONS 24
5.2.2 SOFTWARE SPECIFICATIONS 24
5.3 UML DIAGRAMS 25
5.3.1 USE CASE DIAGRAM 26
5.3.2 CLASS DIAGRAM 27
5.3.3 SEQUENCE DIAGRAM 27
5.3.4 COLLABORATION DIAGRAM 28
5.3.5 DEPLOYMENT DIAGRAM 29
5.3.6 ACTIVITY DIAGRAM 29
5.3.7 COMPONENT DIAGRAM 30
5.3.8 ER DIAGRAM 30
5.3.9 DFD DIAGRAM 31
CHAPTER 6: SOFTWARE DESIGN
6
6.1 SOFTWARE DEVELOPMENT LIFE CYCLE 33
6.2 FEASIBILITY STUDY 34
6.2.1 ECONOMIC FEASIBILITY 35
6.2.2 TECHNICAL FEASIBILITY 35
6.2.3 SOCIAL FEASIBILITY 35
6.3 MODULES 36
6.3.1 DATA COLLECTION 36
6.3.2 DATA CLEANING 36
6.3.3 FEATURE EXTRACTION 37
6.3.4 MODEL TRAINING 38
6.3.5 TESTING MODEL 38
6.3.6 PERFORMANCE EVALUATION 39
6.3.7 PREDICTION 40

CHAPTER 7: SOFTWARE TESTING


7
7.1 INTRODUCTION 41
7.2 TESTING TRADITIONAL SOFTWARE SYSTEMS

V
V/S MACHINE LEARNING SYSTEMS 42
7.3 MODEL TESTING AND MODEL EVALUATION 43
7.3.1 WRITING TEST CASES 43
7.4 PROJECT TESTING 45
CHAPTER 8: IMPLEMENTATION

8 8.1 FRONT END CODING 47


8.2 BACK END CODING 50
9 CHAPTER 9: OUTPUTS AND SNAPSHOTS 57

CHAPTER 10: CONCLUSION AND FUTURE


10 WORK

10.1 CONCLUSION 54
10.2 FUTURE WORK 54

REFERENCES 66

VI
LIST OF FIGURES

FIGURE NO. NAME OF THE FIGURE PAGE NO.


3.3 BLOCK DIAGRAM 10
3.4 FLOW DIAGRAM 11
4.1.1.1 LOGISTIC REGRESSION 13
HYPOTHESIS
4.1.1.2 LOGISTIC REGRESSION DECISION 13
BOUNDARY
4.1.2 RANDOM FOREST CLASSIFIER 15
4.2.1.1 GRAPH OF SUPPORT VECTORS 18
4.2.1.2 HYPERPLANES 19
4.2.1.3 SUPPORT VECTORS 19
5.3.1 USE CASE DIAGRAM 26
5.3.2 CLASS DIAGRAM 27
5.3.3 SEQUENCE DIAGRAM 28
5.3.4 COLLABORATION DIAGRAM 28
5.3.5 DEPLOYMENT DIAGRAM 29
5.3.6 ACTIVITY DIAGRAM 29
5.3.7 COMPONENT DIAGRAM 30
5.3.8 ER DIAGRAM 31
5.3.9 DFD DIAGRAM 32
6.1 WATERFALL MODEL 33
9.1 SNAPSHOTS 57

VII
CHAPTER 1
INTRODUCTION

1.1 INTRODUCTION

Work is to identify fraudulent transactions using credit


cards. To accomplish this, it is required to classify the fraudulent and
non-fraudulent transactions. The primary goal is to make a fraud
detection algorithm, which finds the fraud transactions with less time and
high accuracy by using machine learning based classification algorithms.
As technology is advancing rapidly, the payment by cash is reduced and
online payment gets increased, this paves way for the fraudsters to make
anonymous transactions. To do fraud he just needs card details for some
purchases and the user may not know whether his/her credit card
information was leaked Credit card generally refers to a card that is
assigned to the customer (cardholder), usually allowing them to purchase
goods and services within credit limit or withdraw cash in advance.
Credit card provides the cardholder an advantage of the time, i.e., it
provides time for their customers to repay later in a prescribed time, by
carrying it to the next billing cycle. Credit card frauds are easy targets.
Without any risks, a significant amount can be withdrawn without the
owner’s knowledge, in a short period. Fraudsters always try to make
every fraudulent transaction legitimate, which makes fraud detection a
very challenging and difficult task to detect.

Credit card fraud is a huge ranging term for theft and fraud committed
using or involving at the time of payment by using this card. The purpose
may be to purchase goods without paying, or to transfer unauthorized

1
funds from an account. Credit card fraud is also an add on to identity
theft. As per the information from the United States Federal Trade
Commission, the theft rate of identity had been holding stable during the
mid 2000s, but it increased by 21 percent in 2008. Even though credit
card fraud, that crime which most people associate with ID theft,
decreased as a percentage of all ID theft complaints In 2000, out of 13
billion transactions made annually, approximately 10 million or one out
of every 1300 transactions turned out to be fraudulent. Also, 0.05% (5 out
of every 10,000) of all monthly active accounts was fraudulent. Today,
fraud detection systems are introduced to control one-twelfth of one
percent of all transactions processed which still translates into billions of
dollars in losses. Credit Card Fraud is one of the biggest threats to
business establishments today. However, to combat fraud effectively, it is
important to first understand the mechanisms of executing a fraud. Credit
card fraudsters employ a large number of ways to commit fraud. In
simple terms, Credit Card Fraud is defined as “when an individual uses
another individuals’ credit card for personal reasons while the owner of
the card and the card issuer are not aware of the fact that the card is being
used”. Card fraud begins either with the theft of the physical card or with
the important data associated with the account, including the card account
number or other information that necessarily be available to a merchant
during a permissible transaction. Card numbers, generally the Primary
Account Number (PAN) are often reprinted on the card, and a magnetic
stripe on the back contains the data in machine-readable format. It
contains the following Fields:  Name of card holder  Card number 
Expiration date  Verification/CVV code  Type of card There are
more methods to commit credit card fraud. Fraudsters are very talented
and fast moving people. In the Traditional approach, to be identified by
this paper is Application Fraud, where a person will give the wrong

2
information about himself to get a credit card. There is also the
unauthorized use of Lost and Stolen Cards, which makes up a significant
area of credit card fraud. There are more enlightened credit card
fraudsters, starting with those who produce Fake and Doctored Cards;
there are also those who use Skimming to commit fraud. They will get
this information held on either the magnetic strip on the back of the credit
card, or the data stored on the smart chip is copied from one card to
another. Site Cloning and False Merchant Sites on the Internet are
becoming a popular method of fraud for many criminals with a skilled
ability for hacking. Such sites are developed to get people to hand over
their credit card details without knowing they have been swindled.

Types of Frauds:

● Online and Offline


● Card Theft,
● Data phishing
● Application Fraud
● Telecommunication

3
CHAPTER 2
LITERATURE SURVEY

2.1 Credit Card Fraud Detection: A Realistic Modeling and a


Novel Learning Strategy
Author: Andrea Dal Pozzolo; Giacomo Boracchi; Olivier
Caelen; Cesare Alippi; Gianluca Bontempi
Date of Publication: 14 September 2017

Detecting frauds in credit card transactions is perhaps one of the best


testbeds for computational intelligence algorithms. In fact, this problem
involves a number of relevant challenges, namely: concept drift
(customers’ habits evolve and fraudsters change their strategies over
time), class imbalance (genuine transactions far outnumber frauds), and
verification latency (only a small set of transactions are timely checked
by investigators). However, the vast majority of learning algorithms that
have been proposed for fraud detection rely on assumptions that hardly
hold in a real-world fraud-detection system (FDS). This lack of realism
concerns two main aspects: 1) the way and timing with which supervised
information is provided and 2) the measures used to assess
fraud-detection performance. This paper has three major contributions.
First, we propose, with the help of our industrial partner, a formalization
of the fraud-detection problem that realistically describes the operating
conditions of FDSs that analyze massive streams of credit card
transactions. We also illustrate the most appropriate performance
measures to be used for fraud-detection purposes. Second, we design and
assess a novel learning strategy that effectively addresses class

4
imbalance, concept drift, and verification latency. Third, in our
experiments, we demonstrate the impact of class unbalance and concept
drift in a real-world data stream containing more than 75 million
transactions, authorized over a time window of three years.

2.2 Dataset Shift Quantification for Credit Card Fraud


Detection
Author: Yvan Lucas; Pierre-Edouard Portier; Léa Laporte;
Sylvie Calabretto; Liyun He-Guelton; Frederic Oblé; Michael
Granitzer
Date of Publication: 08 August 2019
DOI: 10.1109/AIKE.2019.00024

Machine learning and data mining techniques have been used extensively
in order to detect credit card frauds. However purchase behavior and
fraudster strategies may change over time. This phenomenon is named
dataset shift [1] or concept drift in the domain of fraud detection [2]. In
this paper, we present a method to quantify day-by-day the dataset shift in
our face-to-face credit card transactions dataset (card holder located in the
shop) . In practice, we classify the days against each other and measure
the efficiency of the classification. The more efficient the classification,
the more different the buying behavior between two days, and vice versa.
Therefore, we obtain a distance matrix characterizing the dataset shift.
After an agglomerative clustering of the distance matrix, we observe that
the dataset shift pattern matches the calendar events for this time period
(holidays, week-ends, etc). We then incorporate this dataset shift
knowledge in the credit card fraud detection task

5
2.3 A Novel Approach for Credit Card Fraud Detection
using Decision Tree and Random Forest Algorithms
Author:M R Dileep; A V Navaneeth; M Abhishek
Date of Publication: 31 March 2021
DOI: 10.1109/ICICV50876.2021.9388431

In the world of finance, as technology grew, new systems of business


making came into picture. Credit card system is one among them. But
because of a lot of loopholes in this system, a lot of problems are aroused
in this system in the method of credit card scams. Due to this the industry
and customers who are using credit cards are facing a huge loss. There is
a deficiency of investigation lessons on examining practical credit card
figures in arrears to privacy issues. In the manuscript an attempt has been
made for finding the frauds in the credit card business by using the
algorithms which adopted machine learning techniques. In this regard,
two algorithms are used viz Fraud Detection in credit card using Decision
Tree and Fraud Detection using Random Forest. The efficiency of the
model can be decided by using some public data as a sample. Then, an
actual world credit card facts group from a financial institution is
examined. Along with this, some clatter is supplemented to the data
samples to auxiliary check the sturdiness of the systems. The significance
of the methods used in the paper is the first method constructs a tree
against the activities performed by the user and using this tree scams will
be suspected. In the second method a user activity based forest will be
constructed and using this forest an attempt will be made in identifying
the suspect. The investigational outcomes absolutely show that the
mainstream elective technique attains decent precision degrees in sensing
scam circumstances in credit cards

6
2.4 Machine Learning For Credit Card Fraud Detection
System
Author:Ruttala Sailusha; V. Gnaneswar; R. Ramesh; G.
Ramakoteswara Rao
Date of Publication: 13-15 May 2020
DOI: 10.1109/ICICCS48265.2020.9121114

The rapid growth in the E-Commerce industry has led to an exponential


increase in the use of credit cards for online purchases and consequently
there has been a surge in the fraud related to it .In recent years, For banks
has become very difficult for detecting the fraud in credit card system.
Machine learning plays a vital role for detecting credit card fraud in the
transactions. For predicting these transactions banks make use of various
machine learning methodologies, past data has been collected and new
features have been used for enhancing the predictive power. The
performance of fraud detecting in credit card transactions is greatly
affected by the sampling approach on data-set, selection of variables and
detection techniques used. This paper investigates the performance of
logistic regression, decision tree and random forest for credit card fraud
detection. Dataset of credit card transactions is collected from kaggle and
it contains a total of 2,84,808 credit card transactions of a European bank
data set. It considers fraud transactions as the “positive class” and
genuine ones as the “negative class” .The data set is highly imbalanced, it
has about 0.172% of fraud transactions and the rest are genuine
transactions. The author has done oversampling to balance the data set,
which resulted in 60% of fraud transactions and 40% genuine ones. The
three techniques are applied for the dataset and work is implemented in R
language. The performance of the techniques is evaluated for different

7
variables based on sensitivity, specificity, accuracy and error rate. The
result shows accuracy for logistic regression, Decision tree and random
forest classifier are 90.0, 94.3, 95.5 respectively. The comparative results
show that the Random forest performs better than the logistic regression
and decision tree techniques.

8
CHAPTER 3

SYSTEM ANALYSIS

3.1 EXISTING SYSTEM


● The Credit card fraud detection system is initiated for detecting the
fraud transactions from the number of transactions made by the card
holders.
● The transactions done by credit card holders Kaggle datasets are
nothing but data that are already being posted by the companies and
researchers for the purpose of machine learning and data mining.
● Using the data, which is nothing but the transactions are trained .This
technique is mainly used to differentiate the fraud transactions from
the original transactions done by the card holders.
● Initially the transaction data is stored in a confluence form.

DISADVANTAGE
● Does not have a huge amount of data set .
● For smaller amounts of data the results may be not accurate.

3.2 PROPOSED SYSTEM

● The Credit Card Fraud Detection Problem includes modeling past


credit card transactions with the data of the ones that turned out to be
fraud. This model is then used to recognize whether a new transaction
is fraudulent or not.

9
● of the fraudulent transactions while minimizing the incorrect fraud
classifications. Credit Card Fraud Detection is a typical sample of
classification.
● In this process, we have focused on analyzing and pre-processing
data sets as well as the deployment of multiple anomaly detection
algorithms such as logistic regression

ADVANTAGE
● Protection against credit card fraud.
● Free credit score information.
● No foreign transaction fees.

3.3 BLOCK DIAGRAM

Fig.No 3.3 - BLOCK DIAGRAM

3.3.1 DESCRIPTION OF THE SYSTEM BLOCK


DIAGRAM
● Define adequately our problem (objective, desired outputs…).
● Gather data.
● Choose a measure of success.
● Set an evaluation protocol and the different protocols available.

10
● Prepare the data (dealing with missing values, with categorical
values…).
● Split correctly the data as a train and test data.
● Using machine learning algorithms and Prediction Is Made.

3.4 FLOW DIAGRAM

Fig.No 3.4- FLOW DIAGRAM

11
CHAPTER 4

METHODOLOGY AND ALGORITHMS:

4.1 ENSEMBLE ALGORITHMS USED


1. Logistic regression
2. Random forest
3. Decision tree
4. K-Nearest Neighbors Algorithm
5. XG Boost
6. SVM

4.1.1 LOGISTIC REGRESSION

Logistic regression is a Machine Learning classification algorithm that is


used to predict the probability of a categorical dependent variable. In
logistic regression, the dependent variable is a binary variable that
contains data coded as 1 (yes, success, etc.) or 0 (no, failure, etc.). In
other words, the logistic regression model predicts P(Y=1) as a function
of X.
Step 1: Logistic Regression hypothesis
The logistic regression classifier can be derived by analogy to the logistic
regression the function g(z) is the logistic function also known as
the sigmoid function.
The logistic function has asymptotes at 0 and 1, and it crosses the y-axis
at 0.5.

12
Fig.No 4.1.1.1- LOGISTIC REGRESSION HYPOTHESIS

Step (1b): Logistic regression decision boundary

Since our data set has two features: height and weight, the logistic
regression hypothesis is the following:

Fig.No 4.1.1.2- LOGISTIC REGRESSION DECISION BOUNDARY

4.1.2 RANDOM FOREST CLASSIFIER

Random forest is a supervised learning algorithm which is used for both


classification as well as regression. But it is mainly used for classification

13
problems. As we know that a forest is made up of trees and more trees
means more robust forest. Similarly, a random forest algorithm creates
decision trees on data samples and then gets the prediction from each of
them and finally selects the best solution by means of voting. It is an
ensemble method which is better than a single decision tree because it
reduces the over-fitting by averaging the result.

Working of Random Forest Algorithm

We can understand the working of Random Forest algorithm with the


help of following steps −

Step 1 − First, start with the selection of random samples from a


given dataset.

Step 2 − Next, this algorithm will construct a decision tree for every
sample. Then it will get the prediction result from every decision tree.

Step 3 − In this step, voting will be performed for every predicted


result.

Step 4 − At last, select the most voted prediction result as the final
prediction result.

The following diagram will illustrate its working −

14
Fig.No 4.1.2 - RANDOM FOREST CLASSIFIER

4.1.3 DECISION TREE ALGORITHM

A decision tree is a non-parametric supervised learning


algorithm, which is utilized for both classification and regression
tasks. It has a hierarchical tree structure, which consists of a root node,
branches, internal nodes and leaf nodes.

● Step-1: Begin the tree with the root node, says S, which contains the
complete dataset.
● Step-2: Find the best attribute in the dataset using Attribute
Selection Measure (ASM).
● Step-3: Divide the S into subsets that contains possible values for the
best attributes.
● Step-4: Generate the decision tree node, which contains the best
attribute.
● Step-5: Recursively make new decision trees using the subsets of the
dataset created in step -3. Continue this process until a stage is

15
reached where you cannot further classify the nodes and called the
final node as a leaf node.

4.1.4 K NEAREST NEIGHBORS ALGORITHM

K Nearest Neighbors Algorithm. The k-nearest neighbors algorithm,


also known as KNN or k-NN, is a non-parametric, supervised learning
classifier, which uses proximity to make classifications or predictions
about the grouping of an individual data point.

Step-1: Select the number K of the neighbors


Step-2: Calculate the Euclidean distance of K number of neighbors
Step-3: Take the K nearest neighbors as per the calculated Euclidean
distance.
Step-4: Among these k neighbors, count the number of the data points
in each category.
Step-5: Assign the new data points to that category for which the
number of the neighbor is maximum.
Step-6: Our model is ready.

4.1.5 XG BOOST
Gradient boosted decision trees are implemented by the XGBoost library
of Python, intended for speed and execution, which is the most important
aspect of ML (machine learning).
XgBoost: XgBoost (Extreme Gradient Boosting) library of Python was
introduced at the University of Washington by scholars. It is a module of

16
Python written in C++, which helps ML model algorithms by the training
for Gradient Boosting.
Gradient boosting: This is an AI method utilized in classification and
regression assignments, among others. It gives an expectation model as a
troupe of feeble forecast models, commonly called decision trees.

Step 1: Make an Initial Prediction and Calculate Residuals. ...


Step 2: Build an XGBoost Tree. ...
Step 3: Prune the Tree. ...
Step 4: Calculate the Output Values of Leaves. ...
Step 5: Make New Predictions. ...
Step 6: Calculate Residuals Using the New Predictions.

4.1.6 SUPPORT VECTOR CLASSIFIERS ALGORITHM

Support Vector Machine or SVM algorithm is a simple yet powerful


Supervised Machine Learning algorithm that can be used for building
both regression and classification models. SVM algorithms can perform
really well with both linearly separable and non-linearly separable
datasets. Even with a limited amount of data, the support vector machine
algorithm does not fail to show its magic.

Step 1: Load Pandas library and the dataset using Pandas

Step 2: Define the features and the target

Step 3: Split the dataset into train and test using sklearn before building
the SVM algorithm model

17
Step 4: Import the support vector classifier function or SVC function
from Sklearn SVM module. Build the Support Vector Machine model
with the help of the SVC function

Step 5: Predict values using the SVM algorithm model

Step 6: Evaluate the Support Vector Machine model

4.2 FINAL ALGORITHM USED


1.SVM

4.2.1 SUPPORT VECTOR MACHINES:

The objective of the support vector machine algorithm is to find a


hyperplane in an N-dimensional space (N — the number of features) that
distinctly classifies the data points.

Fig.No 4.2.1.1 - GRAPH OF SUPPORT VECTORS


Possible hyper planes :
To separate the two classes of data points, there are many possible Hyper
planes that could be chosen. Our objective is to find a plane that has the
maximum margin, i.e. the maximum distance between data points of both

18
classes. Maximizing the margin distance provides some reinforcement so
that future data points can be classified with more confidence.

Hyper planes and Support Vectors

Fig.No 4.2.1.2 - HYPER PLANES

Hyper planes in 2D and 3D feature space

Hyper planes are decision boundaries that help classify the data points.
Data points falling on either side of the hyperplane can be attributed to
different classes. Also, the dimension of the hyper plane depends upon
the number of features. If the number of input features is 2, then the
hyperplane is just a line. If the number of input features is 3, then the
hyper plane becomes a two-dimensional plane. It becomes difficult to
imagine when the number of features exceeds 3.

Fig.No 4.2.1.3 - SUPPORT VECTORS

19
Support Vectors

Support vectors are data points that are closer to the hyperplane and
influence the position and orientation of the hyperplane. Using these
support vectors, we maximize the margin of the classifier. Deleting the
support vectors will change the position of the hyperplane. These are the
points that help us build our SVM.

Large Margin Intuition


In logistic regression, we take the output of the linear function and squash
the value within the range of [0,1] using the sigmoid function. If the
squashed value is greater than a threshold value (0.5) we assign it a label
1, else we assign it a label 0. In SVM, we take the output of the linear
function and if that output is greater than 1, we identify it with one class
and if the output is -1, we identify it with another class. Since the
threshold values are changed to 1 and -1 in SVM, we obtain this
reinforcement range of values ([-1, 1]) which acts as margin.

Cost Function and Gradient Updates


In the SVM algorithm, we are looking to maximize the margin between
the data points and the hyperplane. The loss function that helps
maximize the margin is hinge loss.

20
Hinge loss function (function on left can be represented as a function on
the right)
The cost is 0 if the predicted value and the actual value are of the same
sign. If they are not, we then calculate the loss value. We also add a
regularization parameter to the cost function. The objective of the
regularization parameter is to balance the margin maximization and loss.
After adding the regularization parameter, the cost function looks as
below.

Loss function for SVM


Now that we have the loss function, we take partial derivatives with
respect to the weights to find the gradients. Using the gradients, we can
update our weights.

Gradients

21
When there is no misclassification, i.e. our model correctly predicts the
class of our data point, we only have to update the gradient from the
regularization parameter.

Gradient Update — No misclassification


When there is a misclassification, i.e. our model makes a mistake on the
prediction of the class of our data point, we include the loss along with
the regularization parameter to perform gradient update.

22
CHAPTER 5
SYSTEM DESIGN
5.1 FUNCTIONAL AND NONFUNCTIONAL
REQUIREMENTS:

Requirement’s analysis is a very critical process that enables the success


of a system or software project to be assessed. Requirements are
generally split into two types: Functional and nonfunctional requirements.
5.1.1 FUNCTIONAL REQUIREMENTS
These are the requirements that the end user specifically demands as
basic facilities that the system should offer. All these functionalities need
to be necessarily incorporated into the system as a part of the contract.
These are represented or stated in the form of input to be given to the
system, the operation performed and the output expected. They are
basically the requirements stated by the user which one can see directly in
the final product, unlike the non-functional requirements.
Examples of functional requirements:
1) Authentication of user whenever he/she logs into the system
2) System shutdown in case of a cyber-attack
3) A verification email is sent to the user whenever he/she registers
for the first time on some software system.
5.1.2 NON-FUNCTIONAL REQUIREMENTS
These are basically the quality constraints that the system must satisfy
according to the project contract. The priority or extent to which these
factors are implemented varies from one project to another. They are also
called non-behavioral requirements.
They basically deal with issues like:
● Portability

23
● Security
● Maintainability
● Reliability
● Scalability
● Performance
● Reusability
● Flexibility
Examples of non-functional requirements:
1) Emails should be sent with a latency of no greater than 12 hours
from such an activity.
2) The processing of each request should be done within 10 seconds
3) The site should load in 3 seconds whenever of simultaneous users
are > 10000

5.2 SYSTEM SPECIFICATIONS:


5.2.1 HARDWARE SPECIFICATIONS:
● Processor : I3/Intel Processor
● RAM : 8GB (min)
● Hard Disk : 128 GB

5.2.2 SOFTWARE SPECIFICATIONS :


• Operating System : Windows 10
• Server-side Script : Python 3.6
• IDE : PyCharm
• Framework : Flask
• Libraries Used : Numpy, pandas, Scikit-Learn.

24
5.3 UML DIAGRAMS
UML stands for Unified Modelling Language. UML is a
standardized general-purpose modeling language in the field of
object-oriented software engineering. The standard is managed, and was
created by, the Object Management Group.
The goal is for UML to become a common language for creating
models of object-oriented computer software. In its current form UML
comprises two major components: a Meta-model and a notation. In the
future, some form of method or process may also be added to; or
associated with, UML.
The Unified Modelling Language is a standard language for
specifying, Visualization, Constructing and documenting the artifacts of
software systems, as well as for business modeling and other
non-software systems.
The UML represents a collection of best engineering practices that
have proven successful in the modeling of large and complex systems.
The UML is a very important part of developing objects-oriented
software and the software development process. The UML uses mostly
graphical notations to express the design of software projects.

GOALS:
The Primary goals in the design of the UML are as follows:
1. Provide users a ready-to-use, expressive visual modeling Language
so that they can develop and exchange meaningful models.
2. Provide extendibility and specialization mechanisms to extend the
core concepts.
3. Be independent of particular programming languages and
development processes.

25
4. Provide a formal basis for understanding the modeling language.
5. Encourage the growth of the OO tools market.
6. Support higher level development concepts such as collaborations,
frameworks, patterns and components.
7. Integrate best practices.
5.3.1 USE CASE DIAGRAM
► A use case diagram in the Unified Modeling Language (UML) is a
type of behavioral diagram defined by and created from a Use-case
analysis.

► Its purpose is to present a graphical overview of the functionality


provided by a system in terms of actors, their goals (represented as
use cases), and any dependencies between those use cases.

► The main purpose of a use case diagram is to show what system


functions are performed for which actor. Roles of the actors in the
system can be depicted.

Fig.No 5.3.1 - USE CASE DIAGRAM

26
5.3.2 CLASS DIAGRAM
In software engineering, a class diagram in the Unified Modeling
Language (UML) is a type of static structure diagram that describes the
structure of a system by showing the system's classes, their attributes, operations
(or methods), and the relationships among the classes. It explains which class contains
information

Fig.No 5.3.2 - CLASS DIAGRAM

5.3.3 SEQUENCE DIAGRAM


► A sequence diagram in Unified Modeling Language (UML) is a
kind of interaction diagram that shows how processes operate with
one another and in what order.

► It is a construct of a Message Sequence Chart. Sequence diagrams


are sometimes called event diagrams, event scenarios, and timing
diagrams.

27
Fig.No 5.3.3 - SEQUENCE DIAGRAM

5.3.4 COLLABORATION DIAGRAM:

In the collaboration diagram the method call sequence is indicated by


some numbering technique as shown below. The number indicates how
the methods are called one after another. We have taken the same order
management system to describe the collaboration diagram. The method
calls are similar to that of a sequence diagram. But the difference is that
the sequence diagram does not describe the object organization whereas
the collaboration diagram shows the object organization.

Fig.No 5.3.4 - COLLABORATION DIAGRAM

28
5.3.5 DEPLOYMENT DIAGRAM
Deployment diagram represents the deployment view of a system. It is
related to the component diagram. Because the components are deployed
using the deployment diagrams. A deployment diagram consists of nodes.
Nodes are nothing but physical hardware used to deploy the application.

Fig.No 5.3.5 - DEPLOYMENT DIAGRAM

5.3.6 ACTIVITY DIAGRAM:


Activity diagrams are graphical representations of workflows of stepwise
activities and actions with support for choice, iteration and concurrency.
In the Unified Modeling Language, activity diagrams can be used to
describe the business and operational step-by-step workflows of
components in a system. An activity diagram shows the overall flow of
control.

Fig.No 5.3.6 -ACTIVITY DIAGRAM

29
5.3.7 COMPONENT DIAGRAM:
A component diagram, also known as a UML component diagram,
describes the organization and wiring of the physical components in a
system. Component diagrams are often drawn to help model
implementation details and double-check that every aspect of the system's
required function is covered by planned development.

Fig.No 5.3.7 - COMPONENT DIAGRAM

5.3.8 ER DIAGRAM:
An Entity–relationship model (ER model) describes the structure of a
database with the help of a diagram, which is known as Entity
Relationship Diagram (ER Diagram). An ER model is a design or
blueprint of a database that can later be implemented as a database. The
main components of the E-R model are: entity set and relationship set.
An ER diagram shows the relationship among entity sets. An entity set is
a group of similar entities and these entities can have attributes. In terms
of DBMS, an entity is a table or attribute of a table in a database, so by
showing relationship among tables and their attributes, ER diagram
shows the complete logical structure of a database. Let’s have a look at a
simple ER diagram to understand this concept.

30
Fig.No 5.3.8 - ER DIAGRAM

5.3.9 DFD DIAGRAM:


A Data Flow Diagram (DFD) is a traditional way to visualize the
information flows within a system. A neat and clear DFD can depict a
good amount of the system requirements graphically. It can be manual,
automated, or a combination of both. It shows how information enters
and leaves the system, what changes the information and where
information is stored. The purpose of a DFD is to show the scope and
boundaries of a system as a whole. It may be used as a communications
tool between a systems analyst and any person who plays a part in the
system that acts as the starting point for redesigning a system.

31
Fig.No 5.3.9 - DFD DIAGRAM

32
CHAPTER 6
SOFTWARE DESIGN

6.1 SOFTWARE DEVELOPMENT LIFE CYCLE – SDLC:

In our project we use the waterfall model as our software development


cycle because of its step-by-step procedure while implementing.

Fig.No 6.1 - Waterfall Model

● Requirement Gathering and analysis − All possible


requirements of the system to be developed are captured in this
phase and documented in a requirement specification document.

● System Design − The requirement specifications from first phase


are studied in this phase and the system design is prepared. This
system design helps in specifying hardware and system
requirements and helps in defining the overall system architecture.

● Implementation − With inputs from the system design, the


system is first developed in small programs called units, which are

33
integrated in the next phase. Each unit is developed and tested for
its functionality, which is referred to as Unit Testing.

● Integration and Testing − All the units developed in the


implementation phase are integrated into a system after testing of
each unit. Post integration the entire system is tested for any faults
and failures.

● Deployment of system − Once the functional and non-functional


testing is done; the product is deployed in the customer
environment or released into the market.

● Maintenance − There are some issues which come up in the client


environment. To fix those issues, patches are released. Also, to
enhance the product some better versions are released.
Maintenance is done to deliver these changes in the customer
environment.

6.2 FEASIBILITY STUDY

The feasibility of the project is analyzed in this phase and a business


proposal is put forth with a very general plan for the project and some
cost estimates. During system analysis the feasibility study of the
proposed system is to be carried out. This is to ensure that the proposed
system is not a burden to the company. For feasibility analysis, some
understanding of the major requirements for the system is essential.

Three key considerations involved in the feasibility analysis are


♦ ECONOMICAL FEASIBILITY
♦ TECHNICAL FEASIBILITY
♦ SOCIAL FEASIBILITY

34
6.2.1 ECONOMIC FEASIBILITY:
This study is carried out to check the economic impact that the system
will have on the organization. The amount of fund that the company can
pour into the research and development of the system is limited. The
expenditures must be justified. Thus, the developed system as well within
the budget and this was achieved because most of the technologies used
are freely available. Only the customized products had to be purchased.

6.2.2 TECHNICAL FEASIBILITY:


This study is carried out to check the technical feasibility, that is, the
technical requirements of the system. Any system developed must not
have a high demand on the available technical resources. This will lead to
high demands on the available technical resources. This will lead to high
demands being placed on the client. The developed system must have a
modest requirement, as only minimal or null changes are required for
implementing this system.

6.2.3 SOCIAL FEASIBILITY:


The aspect of study is to check the level of acceptance of the system by
the user. This includes the process of training the user to use the system
efficiently. The user must not feel threatened by the system, instead must
accept it as a necessity. The level of acceptance by the users solely
depends on the methods that are employed to educate the user about the
system and to make him familiar with it. His level of confidence must be
raised so that he is also able to make some constructive criticism, which
is welcomed, as he is the final user of the system.

35
6.3 MODULES

● Data collection
● Data pre-processing
● Feature extraction
● Model training
● Testing model
● Performance evaluation
● Prediction

6.3.1 DATA COLLECTION


● Collecting data allows you to capture a record of past events so that
we can use data analysis to find recurring patterns. From those
patterns, you build predictive models using machine
learning algorithms that look for trends and predict future changes.
● Predictive models are only as good as the data from which they are
built, so good data collection practices are crucial to developing
high-performing models.
● The data need to be error-free (garbage in, garbage out) and contain
relevant information for the task at hand. For example, a loan default
model would not benefit from tiger population sizes but could benefit
from gas prices over time.
● In this module, we collect the data from kaggle dataset archives. This
dataset contains the information of divorce in previous years.

6.3.2 DATA CLEANING


● Data cleaning is a critically important step in any machine learning
project.

36
● In this module data cleaning is done to prepare the data for analysis
by removing or modifying the data that may be incorrect, incomplete,
duplicated or improperly formatted.
● In tabular data, there are many different statistical analysis and data
visualization techniques you can use to explore your data in order to
identify data cleaning operations you may want to perform
● In this place we have to use pca nd hyperparameter tuning

6.3.3 FEATURE EXTRACTION


● This is done to reduce the number of attributes in the dataset hence
providing advantages like speeding up the training and accuracy
improvements.
● In machine learning, pattern recognition, and image
processing, feature extraction starts from an initial set of measured
data and builds derived values (features) intended to be informative
and non-redundant, facilitating the subsequent learning and
generalization steps, and in some cases leading to better human
interpretations. Feature extraction is related to dimensionality
reduction
● When the input data to an algorithm is too large to be processed and it
is suspected to be redundant (e.g. the same measurement in both feet
and meters, or the repetitiveness of images presented as pixels), then
it can be transformed into a reduced set of features (also named
a feature vector).
● Determining a subset of the initial features is called feature
selection. The selected features are expected to contain the relevant
information from the input data, so that the desired task can be
performed by using this reduced representation instead of the
complete initial data.

37
6.3.4 MODEL TRAINING
● A training model is a dataset that is used to train an ML algorithm. It
consists of the sample output data and the corresponding sets of input
data that have an influence on the output.
● The training model is used to run the input data through the
algorithm to correlate the processed output against the sample output.
The result from this correlation is used to modify the model.
● This iterative process is called “model fitting”. The accuracy of the
training dataset or the validation dataset is critical for the precision of
the model.
● Model training in machine language is the process of feeding an ML
algorithm with data to help identify and learn good values for all
attributes involved.
● There are several types of machine learning models, of which the
most common ones are supervised and unsupervised learning.
● In this module we use supervised classification algorithms like linear
regression to train the model on the cleaned dataset after
dimensionality reduction.

6.3.5 TESTING MODEL


● In this module we test the trained machine learning model using the
test dataset
● Quality assurance is required to make sure that the software system
works according to the requirements. Were all the features
implemented as agreed? Does the program behave as expected? All
the parameters that you test the program against should be stated in
the technical specification document.

38
● Moreover, software testing has the power to point out all the defects
and flaws during development. You don’t want your clients to
encounter bugs after the software is released and come to you waving
their fists. Different kinds of testing allow us to catch bugs that are
visible only during runtime.

6.3.6 PERFORMANCE EVALUATION


● In this module, we evaluate the performance of trained machine
learning models using performance evaluation criteria such as F1
score, accuracy and classification error.
● In case the model performs poorly, we optimize the machine learning
algorithms to improve the performance.
● Performance Evaluation is defined as a formal and productive
procedure to measure an employee’s work and results based on their
job responsibilities. It is used to gauge the amount of value added by
an employee in terms of increased business revenue, in comparison to
industry standards and overall employee return on investment (ROI).
● All organizations that have learned the art of “winning from within”
by focusing inward towards their employees, rely on a systematic
performance evaluation process to measure and evaluate employee
performance regularly.
● Ideally, employees are graded annually on their work anniversaries
based on which they are either promoted or are given suitable
distribution of salary raises
● Performance evaluation also plays a direct role in providing periodic
feedback to employees, such that they are more self-aware in terms of
their performance metrics.

39
6.3.7 PREDICTION
● Prediction” refers to the output of an algorithm after it has
been trained on a historical dataset and applied to new data when
forecasting the likelihood of a particular outcome, such as whether or
not a customer will churn in 30 days.
● The algorithm will generate probable values for an unknown variable
for each record in the new data, allowing the model builder to identify
what that value will most likely be.
● The word “prediction” can be misleading. In some cases, it really
does mean that you are predicting a future outcome, such as when
you’re using machine learning to determine the next best action in a
marketing campaign.
● Other times, though, the “prediction” has to do with, for example,
whether or not a transaction that already occurred was fraudulent.
● In that case, the transaction already happened, but you’re making an
educated guess about whether or not it was legitimate, allowing you
to take the appropriate action.

40
CHAPTER 7
TESTING

7.1 INTRODUCTION
Testing forms an integral part of any software development project.
Testing helps in ensuring that the final product is by and large, free of
defects and it meets the desired requirements. Proper testing in the
development phase helps in identifying the critical errors in the design
and implementation of various functionalities thereby ensuring product
reliability. Even though it is a bit time-consuming and a costly process at
first, it helps in the long run of software development.

Although machine learning systems are not traditional software systems,


not testing them properly for their intended purposes can lead to a huge
impact in the real world. This is because machine learning systems reflect
the biases of the real world. Not accounting or testing for them will
inevitably have lasting and sometimes irreversible impacts. Some of the
examples for such fails include Amazon’s recruitment tool which did not
evaluate people in a gender-neutral way and Microsoft’s
chatbot Tay which responded with offensive and derogatory remarks.

In this article, we will understand how testing machine learning systems


is different from testing the traditional software systems, the difference
between model testing and model evaluation, types of tests for Machine
Learning systems followed by a hands-on example of writing test cases
for “insurance charge prediction”.

41
7.2 Testing Traditional Software Systems v/s Machine
Learning Systems
In traditional software systems, code is written for having a desired
behavior as the outcome. Testing them involves testing the logic behind
the actual behavior and how it compares with the expected behavior. In
machine learning systems, however, data and desired behavior are the
inputs and the models learn the logic as the outcome of the training and
optimization processes.

In this case, testing involves validating the consistency of the


model’s logic and our desired behavior. Due to the process of models
learning the logic, there are some notable obstacles in the way of testing
Machine Learning systems. They are:

⮚ Indeterminate outcomes: on retraining, it’s highly possible that the


model parameters vary significantly
⮚ Generalization: it’s a huge task for Machine Learning models to
predict sensible outcomes for data not encountered in their training
⮚ Coverage: there is no set method of determining test coverage for a
Machine Learning model
⮚ Interpretability: most ML models are black boxes and don’t have a
comprehensible logic for a certain decision made during prediction

These issues lead to a lower understanding of the scenarios in which


models fail and the reason for that behavior; not to mention, making it more
difficult for developers to improve their behaviors.

42
7.3 MODEL TESTING AND MODEL EVALUATION

From the discussion above, it may feel as if model testing is the same as
model evaluation but that’s not true. Model evaluations focus on the
performance metrics of the models like accuracy, precision, the area
under the curve, f1 score, log loss, etc. These metrics are calculated on
the validation dataset and remain confined to that. Though the evaluation
metrics are necessary for assessing a model, they are not sufficient
because they don’t shed light on the specific behaviors of the model.

It is fully possible that a model’s evaluation metrics have improved but its
behavior on a core functionality has regressed. Or retraining a model on
new data might introduce a bias for marginalized sections of society all
the while showing no particular difference in the metrics values. This is
extra harmful in the case of ML systems since such problems might not
come to light easily but can have devastating impacts.

In summary, model evaluation helps in covering the performance on


validation datasets while model testing helps in explicitly validating the
nuanced behaviors of our models. During the development of ML
models, it is better to have both model testing and evaluation to be
executed in parallel.

7.3.1 WRITING TEST CASES

We usually write two different classes of tests for Machine Learning


systems:

● Pre-train tests
● Post-train tests

43
Pre-train tests: The intention is to write such tests which can be run
without trained parameters so that we can catch implementation errors
early on. This helps in avoiding the extra time and effort spent in a wasted
training job.

We can test the following in the pre-train test:

● the model predicted output shape is proper or not


● test dataset leakage i.e. checking whether the data in training and
testing datasets have no duplication
● temporal data leakage which involves checking whether the
dependencies between training and test data do not lead to unrealistic
situations in the time domain like training on a future data point and
testing on a past data point
● check for the output ranges. In the cases where we are predicting
outputs in a certain range (for example when predicting probabilities),
we need to ensure the final prediction is not outside the expected
range of values.
● Ensuring a gradient step training on a batch of data leads to a decrease
in the loss

data profiling assertions

Post-train tests: Post-train tests are aimed at testing the model’s


behavior. We want to test the learned logic and it could be tested on the
following points and more:

● Invariance tests which involve testing the model by tweaking only


one feature in a data point and checking for consistency in model
predictions. For example, if we are working with a loan prediction
dataset then change in sex should not affect an individual’s eligibility
for the loan given all other features are the same or in the case of

44
titanic survivor probability prediction data, change in the passenger’s
name should not affect their chances of survival.
● Directional expectations wherein we test for a direct relation between
feature values and predictions. For example, in the case of a loan
prediction problem, having a higher credit score should definitely
increase a person’s eligibility for a loan.
● Apart from this, you can also write tests for any other failure modes
identified for your model.

Now, let’s try a hands-on approach and write tests for the Medical Cost
Personal Datasets. Here, we are given a bunch of features and we have to
predict the insurance costs.

7.4 PROJECT TESTING


Let’s see the features first. The following columns are provided in the
dataset:

The dataset contains transactions made by credit cards in September .This


dataset presents transactions that occurred in two days, where we have
492 frauds out of 284,807 transactions. The dataset is highly unbalanced,
the positive class (frauds) account for 0.172% of all transactions.

It contains only numeric input variables which are the result of a PCA
transformation. Unfortunately, due to confidentiality issues, we cannot
provide the original features and more background information about the
data. Features V1, V2, … V28 are the principal components obtained
with PCA, the only features which have not been transformed with PCA
are 'Time' and 'Amount'. Feature 'Time' contains the seconds elapsed
between each transaction and the first transaction in the dataset. The
feature 'Amount' is the transaction Amount, this feature can be used for

45
example-dependent cost-sensitive learning. Feature 'Class' is the response
variable and it takes value 1 in case of fraud and 0 otherwise.

Doing a little bit of analysis on the dataset will reveal the relationship
between various features. Since the main aim of this article is to learn
how to write tests, we will skip the analysis part and directly write basic
tests.

46
CHAPTER 8
IMPLEMENTATION

8.1 FRONT END CODING


Html code

Index .html

<!DOCTYPE html>
<html>
<head>
<link rel="stylesheet" href="{{ url_for('static',
filename='css/indexstyle.css') }}">
<title>ML API</title>
</head>
<body>
<form action="{{ url_for('predict')}}" method="POST">
<input id="input-1" type="text" placeholder="Enter time" name ="time"
required autofocus />
<label for="input-1">
<span class="label-text">TIME</span>
<span class="nav-dot"></span>
<div class="signup-button-trigger">Credit Card Fraud
Prediction</div>
</label>
<input id="input-2" type="text" placeholder="Enter Amount"
name="amount" required />
<label for="input-2">
<span class="label-text">AMOUNT</span>
<span class="nav-dot"></span>
</label>
<input id="input-3" type="text" placeholder="Enter Transaction
Method" name="tm" required />
<label for="input-3">
<span class="label-text">Transaction Method</span>
<span class="nav-dot"></span>
</label>

47
<input id="input-4" type="text" placeholder="Transaction id" name="ti"
required />
<label for="input-4">
<span class="label-text">Transaction id</span>
<span class="nav-dot"></span>
</label>
<input id="input-5" type="text" placeholder="Enter Type Of card"
name="ct" required />
<label for="input-5">
<span class="label-text">Card Type</span>
<span class="nav-dot"></span>
</label>
<input id="input-6" type="text" placeholder="Enter Location"
name="location" required />
<label for="input-6">
<span class="label-text">Enter Location</span>
<span class="nav-dot"></span>
</label>
<input id="input-7" type="text" placeholder="Enter Bank" name="em"
required />
<label for="input-7">
<span class="label-text">Enter Bank</span>
<span class="nav-dot"></span>
</label>
<button type="submit">Predict</button>
<p class="tip">Press Tab</p>
<div class="signup-button">Credit card Fraud Detection</div>
</form>
</body>
</html>

Result .html

<!--Hey! This is the original version


of Simple CSS Waves-->
<!doctype html>
<html>
<head>
<meta charset="UTF-8">
<title>ML API</title>
<link rel="stylesheet" href="{{ url_for('static',
filename='css/resultstyle.css') }}">

48
</head>

<!--Hey! This is the original version


of Simple CSS Waves-->
<body>
<div class="header">

<!--Content before waves-->


<div class="inner-header flex">
<!--Just the logo.. Don't mind this-->
<h1>{{ prediction }}</h1>
</div>

<!--Waves Container-->
<div>
<svg class="waves" xmlns="http://www.w3.org/2000/svg"
xmlns:xlink="http://www.w3.org/1999/xlink"
viewBox="0 24 150 28" preserveAspectRatio="none"
shape-rendering="auto">
<defs>
<path id="gentle-wave" d="M-160 44c30 0 58-18 88-18s 58 18 88 18
58-18 88-18 58 18 88 18 v44h-352z" />
</defs>
<g class="parallax">
<use xlink:href="#gentle-wave" x="48" y="0" fill="#bb1515" />
<use xlink:href="#gentle-wave" x="48" y="3" fill="#bb1515" />
<use xlink:href="#gentle-wave" x="48" y="5"
fill="rgba(255,255,255,0.3)" />
<use xlink:href="#gentle-wave" x="48" y="7" fill="#bb1515" />
</g>
</svg>
</div>
<!--Waves end-->

</div>
<!--Header ends-->

<!--Content starts-->
<div class="content flex">
</div>
<!--Content ends-->
</body>

49
8.2 BACK END CODING
"""# Feature Scaling"""

cols= ['V22', 'V24', 'V25', 'V26', 'V27', 'V28']

scaler = StandardScaler()

frames= ['Time', 'Amount']

x= data[frames]

d_temp = data.drop(frames, axis=1)

temp_col=scaler.fit_transform(x)

scaled_col = pd.DataFrame(temp_col, columns=frames)

scaled_col.head()

d_scaled = pd.concat([scaled_col, d_temp], axis =1)

d_scaled.head()

y = data['Class']

d_scaled.head()

"""# Dimensionality Reduction"""

from sklearn.decomposition import PCA

pca = PCA(n_components=7)

X_temp_reduced = pca.fit_transform(d_scaled)

pca.explained_variance_ratio_

pca.explained_variance_

50
names=['Time','Amount','Transaction Method','Transaction
Id','Location','Type of Card','Bank']

X_reduced= pd.DataFrame(X_temp_reduced,columns=names)
X_reduced.head()

Y=d_scaled['Class']

new_data=pd.concat([X_reduced,Y],axis=1)
new_data.head()
new_data.shape

new_data.to_csv('E:/ats
projects/Credit-Card-Fraud-Detection-ML-WebApp-master/finaldata.csv')

X_train, X_test, y_train, y_test= train_test_split(X_reduced,


d_scaled['Class'], test_size = 0.30, random_state = 42)

X_train.shape, X_test.shape

"""# Logistic Regression"""

from sklearn.linear_model import LogisticRegression


lr=LogisticRegression()
lr.fit(X_train,y_train)
y_pred_lr=lr.predict(X_test)
y_pred_lr

from sklearn.metrics import classification_report,confusion_matrix


print(confusion_matrix(y_test,y_pred_lr))

#Hyperparamter tuning
from sklearn.model_selection import GridSearchCV
lr_model = LogisticRegression()
lr_params = {'penalty': ['l1', 'l2'],'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000]}
grid_lr= GridSearchCV(lr_model, param_grid = lr_params)
grid_lr.fit(X_train, y_train)

grid_lr.best_params_

y_pred_lr3=grid_lr.predict(X_test)
print(classification_report(y_test,y_pred_lr3))

51
"""# Support Vector Machine"""

from sklearn.svm import SVC


svc=SVC(kernel='rbf')
svc.fit(X_train,y_train)
y_pred_svc=svc.predict(X_test)
y_pred_svc

print(classification_report(y_test,y_pred_svc))

print(confusion_matrix(y_test,y_pred_svc))

from sklearn.model_selection import GridSearchCV


parameters = [ {'C': [1, 10, 100, 1000], 'kernel': ['rbf'], 'gamma': [0.1, 1,
0.01, 0.0001 ,0.001]}]
grid_search = GridSearchCV(estimator = svc,
param_grid = parameters,
scoring = 'accuracy',
n_jobs = -1)
grid_search = grid_search.fit(X_train, y_train)
best_accuracy = grid_search.best_score_
best_parameters = grid_search.best_params_
print("Best Accuracy: {:.2f} %".format(best_accuracy*100))
print("Best Parameters:", best_parameters)

svc_param=SVC(kernel='rbf',gamma=0.01,C=100)
svc_param.fit(X_train,y_train)
y_pred_svc2=svc_param.predict(X_test)
print(classification_report(y_test,y_pred_svc2))

"""# Decision Tree"""

from sklearn.tree import DecisionTreeClassifier


dtree=DecisionTreeClassifier()
dtree.fit(X_train,y_train)
y_pred_dtree=dtree.predict(X_test)
print(classification_report(y_test,y_pred_dtree))

52
print(confusion_matrix(y_test,y_pred_dtree))

d_tree_param=DecisionTreeClassifier()
tree_parameters={'criterion':['gini','entropy'],'max_depth':list(range(2,4,1)
),
'min_samples_leaf':list(range(5,7,1))}
grid_tree=GridSearchCV(d_tree_param,tree_parameters)
grid_tree.fit(X_train,y_train)

y_pred_dtree2=grid_tree.predict(X_test)

print(classification_report(y_test,y_pred_dtree2))

"""# Random Forest"""

from sklearn.ensemble import RandomForestClassifier


randomforest=RandomForestClassifier(n_estimators=5)
randomforest.fit(X_train,y_train)
y_pred_rf=randomforest.predict(X_test)
print(confusion_matrix(y_test,y_pred_rf))

print(classification_report(y_test,y_pred_rf))

"""# K Nearest Neighbors"""

from sklearn.neighbors import KNeighborsClassifier


knn=KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train,y_train)
y_pred_knn=knn.predict(X_test)
y_pred_knn

print(classification_report(y_test,y_pred_knn))

print(confusion_matrix(y_test,y_pred_knn))

knn_param=KNeighborsClassifier()
knn_params={"n_neighbors": list(range(2,5,1)), 'algorithm': ['auto',
'ball_tree', 'kd_tree', 'brute']}
grid_knn=GridSearchCV(knn_param,param_grid=knn_params)

53
grid_knn.fit(X_train,y_train)
grid_knn.best_params_

knn = KNeighborsClassifier(n_neighbors=2)

knn.fit(X_train,y_train)
pred_knn2 = knn.predict(X_test)

print('WITH K=3')
print('\n')
print(confusion_matrix(y_test,pred_knn2))
print('\n')
print(classification_report(y_test,pred_knn2))

"""# XGBoost"""

from xgboost import XGBClassifier


xgb=XGBClassifier()
xgb.fit(X_train,y_train)
y_pred_xg=xgb.predict(X_test)
print(classification_report(y_test,y_pred_xg))

"""# LGB"""

import lightgbm as lgb

lgb_train = lgb.Dataset(X_train, y_train, free_raw_data= False)

lgb_test = lgb.Dataset(X_test, y_test, reference=lgb_train,


free_raw_data= False)

parameters = {'num_leaves': 2**8,


'learning_rate': 0.1,
'is_unbalance': True,
'min_split_gain': 0.1,
'min_child_weight': 1,
'reg_lambda': 1,
'subsample': 1,
'objective':'binary',
#'device': 'gpu', # comment this line if you are not using GPU
'task': 'train'
}
num_rounds = 300

54
lgb_train = lgb.Dataset(X_train, y_train)

lgb_test = lgb.Dataset(X_test, y_test)

clf = lgb.train(parameters, lgb_train, num_boost_round=num_rounds)

y_prob = clf.predict(X_test)
y_pred = sklearn.preprocessing.binarize(np.reshape(y_prob, (-1,1)),
threshold= 0.5)

accuracy_score(y_test, y_pred)

print(classification_report(y_test,y_pred))

"""# ROC"""

from sklearn.metrics import roc_curve,roc_auc_score


lg_fpr,lg_tpr,lg_threshold=roc_curve(y_test,y_pred_lr3)
svc_fpr,svc_tpr,svc_threshold=roc_curve(y_test,y_pred_svc2)
dtree_fpr,dtree_tpr,dtree_threshold=roc_curve(y_test,y_pred_dtree2)
rf_fpr,rf_tpr,rf_threshold=roc_curve(y_test,y_pred_rf)
knn_fpr,knn_tpr,rf_threshold=roc_curve(y_test,pred_knn2)
xg_fpr,xg_tpr,xg_threshold=roc_curve(y_test,y_pred_xg)
lgb_fpr,lgb_tpr,lgb_threshold=roc_curve(y_test,y_pred)

plt.figure(figsize=(15,10))
plt.title("Roc Curve")
plt.plot(lg_fpr,lg_tpr, label='Logistic Regression Classifier Score:
{:.4f}'.format(roc_auc_score(y_test, y_pred_lr3)))
plt.plot(knn_fpr,knn_tpr, label='KNears Neighbors Classifier Score:
{:.4f}'.format(roc_auc_score(y_test, pred_knn2)))
plt.plot(svc_fpr, svc_tpr, label='Support Vector Classifier Score:
{:.4f}'.format(roc_auc_score(y_test, y_pred_svc2)))
plt.plot(dtree_fpr, dtree_tpr, label='Decision Tree Classifier Score:
{:.4f}'.format(roc_auc_score(y_test, y_pred_dtree2)))
plt.plot(rf_fpr,rf_tpr, label='Random Forest Classifier Score:
{:.4f}'.format(roc_auc_score(y_test, y_pred_rf)))
plt.plot(xg_fpr,xg_tpr, label='XGBoost Classifier Score:
{:.4f}'.format(roc_auc_score(y_test, y_pred_xg)))

55
plt.plot(lgb_fpr,lgb_tpr, label='Light Gradient Boosting Classifier Score:
{:.4f}'.format(roc_auc_score(y_test, y_pred)))
plt.xlabel('False Positive Rate', fontsize=16)
plt.ylabel('True Positive Rate', fontsize=16)
plt.legend()
plt.show()

56
CHAPTER 9
OUTPUTS AND SNAPSHOTS
HOME PAGE

Fig.No 9.1 - HOME PAGE

TIME DETAILS

Fig.No 9.2 -TIME DETAILS

57
TRANSACTION METHOD

Fig.No 9.3 - TRANSACTION METHOD

TRANSACTION ID

Fig.No 9.4 - TRANSACTION ID

58
CARD TYPE

Fig.No 9.5- CARD TYPE

LOCATION DETAILS

Fig.No 9.6 - LOCATION DETAILS

59
BANK TYPE

Fig.No 9.7 - BANK TYPE

PREDICTION

Fig.No 9.8 - PREDICT

60
FINAL OUTPUT

Fig.No 9.9 - FINAL OUTPUT

DATA HISTOGRAM

Fig.No 9.10 - DATA HISTOGRAM

61
MATRIX

Fig.No 9.11 - MATRIX

HEAT MAP

Fig.No 9.12 - HEAT MAP

62
ACCURACY COMPARISON

Fig.No ACCURACY COMPARISON

63
CHAPTER 10
CONCLUSION AND FUTURE WORK

10.1 CONCLUSION :
In this project we studied the algorithms decision tree , Random forest,
logistic regression, classification,svm, KNN, and XG boost machine
learning algorithms .the results shows that Supervise machine learning
algorithm performs best with having accuracy , precision , recall , f1
scores still there are 4 False Negative values and when we use data
without Random Under Sampling we will get accuracy of due to heavily
imbalance and results many false output. After cleaning data and
applying algorithms we got SVM as the best algorithm but based on the
decisions for transactions whether fraud or not by making a particular
feature as root and gaining information from all trees predicting the
outcome.

10.2 FUTURE WORK :


While we couldn’t reach our goal of 100% accuracy in fraud detection,
we did end up creating a system that can, with enough time and data, get
very close to that goal. As with any such project, there is some room for
improvement here. The very nature of this project allows for multiple
algorithms to be integrated together as modules and their results can be
combined to increase the accuracy of the final result. This model can
further be improved with the addition of more algorithms into it.
However, the output of these algorithms needs to be in the same format as
the others. Once that condition is satisfied, the modules are easy to add as
done in the code. This provides a great degree of modularity and

64
versatility to the project. More room for improvement can be found in the
dataset. As demonstrated before, the precision of the algorithms increases
when the size of the dataset is increased. Hence, more data will surely
make the model more accurate in detecting frauds and reduce the number
of false positives. However, this requires official support from the banks
themselves.

65
REFERENCES

1) R. R. Subramanian, R. Ramar, “Design of Offline and Online Writer


Inference Technique”, International Journal of Innovative Technology
and Exploring Engineering, vol. 9, no. 2S2, Dec. 2019, ISSN:
2278-3075
2) Subramanian R.R., Seshadri K. (2019) Design and Evaluation of a
Hybrid Hierarchical Feature Tree Based Authorship Inference
Technique. In: Kolhe M., Trivedi M., Tiwari S., Singh V. (eds)
Advances in Data and Information Sciences. Lecture Notes in
Networks and Systems, vol 39. Springer, Singapore
3) Joshva Devadas T., Raja Subramanian R. (2020) Paradigms for
Intelligent IOT Architecture. In: Peng SL., Pal S., Huang L. (eds)
Principles of Internet of Things (IoT) Ecosystem: Insight Paradigm.
Intelligent Systems Reference Library, vol 174. Springer, Cham
4) R. R. Subramanian, B. R. Babu, K. Mamta and K. Manogna, "Design
and Evaluation of a Hybrid Feature Descriptor based Handwritten
Character Inference Technique," 2019 IEEE International Conference
on Intelligent Techniques in Control, Optimization and Signal
Processing (INCOS), Tamil Nadu, India, 2019, pp. 1-5.
5) M. J. Islam, Q. M. J. Wu, M. Ahmadi, M. A. SidAhmed,
"Investigating the Performance of Naive-Bayes Classifiers and
KNearestNeighbor Classifiers", IEEE International Conference on
Convergence Information Technology, pp. 1541-1546, 2007.
6) R. Wheeler, S. Aitken, "Multiple algorithms for fraud detection" in
Knowledge-Based Systems, Elsevier, vol. 13, no. 2, pp. 93-99, 2000.
S. Patil, H. Somavanshi, J. Gaikwad, A. Deshmane, R. Badgujar,
"Credit Card Fraud Detection Using Decision Tree Induction
Algorithm", International Journal of Computer Science and Mobile

66
Computing (IJCSMC), vol. 4, no. 4, pp. 92-95, 2015, ISSN ISSN:
2320-088X.
7) S. Maes, K. Tuyls, B. Vanschoenwinkel, B. Manderick, "Credit card
fraud detection using Bayesian and neural networks", Proceedings of
the 1st international naiso congress on neuro fuzzy technologies, pp.
261-270, 2002.
8) S. Bhattacharyya, S. Jha, K. Tharakunnel, J. C. Westland, "Data
mining for credit card fraud: A comparative study", Decision Support
Systems, vol. 50, no. 3, pp. 602-613, 2011.
9) Y. Sahin, E. Duman, "Detecting credit card fraud by ANN and logistic
regression", Innovations in Intelligent Systems and Applications
(INISTA) 2011 International Symposium, pp. 315-319, 2011.
10) Selvani Deepthi Kavila,LAKSHMI S.V.S.S.,RAJESH B “ Automated
Essay Scoring using Feature Extraction Method “ IJCER ,volume
7,issue 4(L), Page No. 12161-12165.
11) S.V.S.S.Lakshmi,K.S.Deepthi,Ch.Suresh “Text Summarization basing
on Font and Cue-phrase

67
CONFERENCE CERTIFICATE

68
69
CREDIT CARD FRAUD DETECTION USING
MACHINE LEARNING
Anuja T1, Pavan Sidharth J2, Esohin E3, Bala Bharathi S4

1
Assistant Professor, Department of Information

technology, Jeppiaar Engineering College, Chennai,

India

2,3,4
UG Student, Department of Information Technology,

Jeppiaar Engineering College, Chennai, India

E-mail: 1 HYPERLINK "mailto:


1
anuanuja2013@gmail.com,

2
pavansidharth19it055@gmail.com,

3
esohin120@gmail.com,

4
balabharathi19jeit015@gmail.com

ABSTRACT:

Credit cards are the commonly used payment mode in recent years. As
the technology is developing, the number of fraud cases are also
increasing and finally poses the need to develop a fraud detection
algorithm to accurately find and eradicate the fraudulent activities. This
project work proposes different machine learning based classification
algorithms and hyperparameter tuning,pca for handling the heavily

70
imbalanced data set for preprocessing . Finally, this project work will
calculate the accuracy, precision, recall, f1 score . Credit card frauds are
easy and friendly targets. E-commerce and many other online sites have
increased the online payment modes, increasing the risk for online
frauds. Increase in fraud rates, researchers started using different
machine learning methods to detect and analyze frauds in online
transactions. The main aim of the paper is to design and develop a novel
fraud detection method for Streaming Transaction Data, with an
objective to analyze the past transaction details of the customers and
extract the behavioral patterns. Where cardholders are clustered into
different groups based on their transaction amount. Then using a sliding
window strategy, to aggregate the transactions made by the cardholders
from different groups so that the behavioral pattern of the groups can be
extracted respectively. Later different classifiers are trained over the
groups separately. And then the classifier with better rating score can be
chosen to be one of the best methods to predict frauds.

KEYWORD: Streaming Transaction Data,calculate the accuracy,


Hyperparameter tuning ,Novel fraud detection method.

71

You might also like