Final Report
Final Report
ESOHIN E (310819205028)
of
BACHELOR OF TECHNOLOGY
in
INFORMATION TECHNOLOGY
BONAFIDE CERTIFICATE
This is to certify that this Project Report “ CREDIT CARD FRAUD DETECTION
USING MACHINE LEARNING ” is the bonafide work of “BALA BHARATHI S ,
ESOHIN E and PAVAN SIDHARTH J” who carried out the project under my
supervision.
I
ACKNOWLEDGEMENT
We are very much indebted to (Late) Hon’ble Colonel Dr. JEPPIAAR, M.A.,
B.L., Ph.D., Our Chairman and Managing Director Dr. M. REGEENA
JEPPIAAR, B. Tech., M.B.A., Ph.D., the Principal Dr. J.FRANCIS XAVIER,
M.Tech., Ph.D., and the Dean Academics Dr. SHALEESHA A. STANLEY
M.Sc., M.Phil., Ph.D., to carry out the project here.
We would like to express our deep sense of gratitude to Dr. S. Venkatesh, M.E.,
Ph.D., Head of the Department and also to our guide Mrs T.Anuja, M.E., for
giving valuable suggestions for making this project a grand success.
We also thank the teaching and non teaching staff members of the department of
Information Technology for their constant support.
II
ABSTRACT
Credit cards are the commonly used payment mode in recent years. As
the technology is developing, the number of fraud cases are also
increasing and finally poses the need to develop a fraud detection
algorithm to accurately find and eradicate the fraudulent activities. This
project work proposes different machine learning based classification
algorithms and hyperparameter tuning,pca for handling the heavily
imbalanced data set for preprocessing . Finally, this project work will
calculate the accuracy, precision, recall, f1 score Credit card frauds are
easy and friendly targets. E-commerce and many other online sites have
increased the online payment modes, increasing the risk for online frauds.
Increase in fraud rates, researchers started using different machine
learning methods to detect and analyze frauds in online transactions. The
main aim of the paper is to design and develop a novel fraud detection
method for Streaming Transaction Data, with an objective to analyze the
past transaction details of the customers and extract the behavioral
patterns. Where cardholders are clustered into different groups based on
their transaction amount. Then using a sliding window strategy, to
aggregate the transaction made by the cardholders from different groups
so that the behavioral pattern of the groups can be extracted respectively.
Later different classifiers are trained over the groups separately. And
then the classifier with better rating score can be chosen to be one of the
best methods to predict frauds.
III
TABLE OF CONTENTS
CHAPTER 1: INTRODUCTION
1
1.1 INTRODUCTION 1
CHAPTER 2: LITERATURE SURVEY
2
2.1 LITERATURE SURVEY 4
CHAPTER 3: SYSTEM ANALYSIS
3
3.1 EXISTING SYSTEM 9
3.2 PROPOSED SYSTEM 9
3.3 BLOCK DIAGRAM 10
3.3.1 DESCRIPTION OF THE SYSTEM BLOCK
DIAGRAM 10
3.4 FLOW DIAGRAM 11
CHAPTER 4: METHODOLOGIES AND
4 ALGORITHMS
IV
CHAPTER 5: SYSTEM DESIGN
V
V/S MACHINE LEARNING SYSTEMS 42
7.3 MODEL TESTING AND MODEL EVALUATION 43
7.3.1 WRITING TEST CASES 43
7.4 PROJECT TESTING 45
CHAPTER 8: IMPLEMENTATION
10.1 CONCLUSION 54
10.2 FUTURE WORK 54
REFERENCES 66
VI
LIST OF FIGURES
VII
CHAPTER 1
INTRODUCTION
1.1 INTRODUCTION
Credit card fraud is a huge ranging term for theft and fraud committed
using or involving at the time of payment by using this card. The purpose
may be to purchase goods without paying, or to transfer unauthorized
1
funds from an account. Credit card fraud is also an add on to identity
theft. As per the information from the United States Federal Trade
Commission, the theft rate of identity had been holding stable during the
mid 2000s, but it increased by 21 percent in 2008. Even though credit
card fraud, that crime which most people associate with ID theft,
decreased as a percentage of all ID theft complaints In 2000, out of 13
billion transactions made annually, approximately 10 million or one out
of every 1300 transactions turned out to be fraudulent. Also, 0.05% (5 out
of every 10,000) of all monthly active accounts was fraudulent. Today,
fraud detection systems are introduced to control one-twelfth of one
percent of all transactions processed which still translates into billions of
dollars in losses. Credit Card Fraud is one of the biggest threats to
business establishments today. However, to combat fraud effectively, it is
important to first understand the mechanisms of executing a fraud. Credit
card fraudsters employ a large number of ways to commit fraud. In
simple terms, Credit Card Fraud is defined as “when an individual uses
another individuals’ credit card for personal reasons while the owner of
the card and the card issuer are not aware of the fact that the card is being
used”. Card fraud begins either with the theft of the physical card or with
the important data associated with the account, including the card account
number or other information that necessarily be available to a merchant
during a permissible transaction. Card numbers, generally the Primary
Account Number (PAN) are often reprinted on the card, and a magnetic
stripe on the back contains the data in machine-readable format. It
contains the following Fields: Name of card holder Card number
Expiration date Verification/CVV code Type of card There are
more methods to commit credit card fraud. Fraudsters are very talented
and fast moving people. In the Traditional approach, to be identified by
this paper is Application Fraud, where a person will give the wrong
2
information about himself to get a credit card. There is also the
unauthorized use of Lost and Stolen Cards, which makes up a significant
area of credit card fraud. There are more enlightened credit card
fraudsters, starting with those who produce Fake and Doctored Cards;
there are also those who use Skimming to commit fraud. They will get
this information held on either the magnetic strip on the back of the credit
card, or the data stored on the smart chip is copied from one card to
another. Site Cloning and False Merchant Sites on the Internet are
becoming a popular method of fraud for many criminals with a skilled
ability for hacking. Such sites are developed to get people to hand over
their credit card details without knowing they have been swindled.
Types of Frauds:
3
CHAPTER 2
LITERATURE SURVEY
4
imbalance, concept drift, and verification latency. Third, in our
experiments, we demonstrate the impact of class unbalance and concept
drift in a real-world data stream containing more than 75 million
transactions, authorized over a time window of three years.
Machine learning and data mining techniques have been used extensively
in order to detect credit card frauds. However purchase behavior and
fraudster strategies may change over time. This phenomenon is named
dataset shift [1] or concept drift in the domain of fraud detection [2]. In
this paper, we present a method to quantify day-by-day the dataset shift in
our face-to-face credit card transactions dataset (card holder located in the
shop) . In practice, we classify the days against each other and measure
the efficiency of the classification. The more efficient the classification,
the more different the buying behavior between two days, and vice versa.
Therefore, we obtain a distance matrix characterizing the dataset shift.
After an agglomerative clustering of the distance matrix, we observe that
the dataset shift pattern matches the calendar events for this time period
(holidays, week-ends, etc). We then incorporate this dataset shift
knowledge in the credit card fraud detection task
5
2.3 A Novel Approach for Credit Card Fraud Detection
using Decision Tree and Random Forest Algorithms
Author:M R Dileep; A V Navaneeth; M Abhishek
Date of Publication: 31 March 2021
DOI: 10.1109/ICICV50876.2021.9388431
6
2.4 Machine Learning For Credit Card Fraud Detection
System
Author:Ruttala Sailusha; V. Gnaneswar; R. Ramesh; G.
Ramakoteswara Rao
Date of Publication: 13-15 May 2020
DOI: 10.1109/ICICCS48265.2020.9121114
7
variables based on sensitivity, specificity, accuracy and error rate. The
result shows accuracy for logistic regression, Decision tree and random
forest classifier are 90.0, 94.3, 95.5 respectively. The comparative results
show that the Random forest performs better than the logistic regression
and decision tree techniques.
8
CHAPTER 3
SYSTEM ANALYSIS
DISADVANTAGE
● Does not have a huge amount of data set .
● For smaller amounts of data the results may be not accurate.
9
● of the fraudulent transactions while minimizing the incorrect fraud
classifications. Credit Card Fraud Detection is a typical sample of
classification.
● In this process, we have focused on analyzing and pre-processing
data sets as well as the deployment of multiple anomaly detection
algorithms such as logistic regression
ADVANTAGE
● Protection against credit card fraud.
● Free credit score information.
● No foreign transaction fees.
10
● Prepare the data (dealing with missing values, with categorical
values…).
● Split correctly the data as a train and test data.
● Using machine learning algorithms and Prediction Is Made.
11
CHAPTER 4
12
Fig.No 4.1.1.1- LOGISTIC REGRESSION HYPOTHESIS
Since our data set has two features: height and weight, the logistic
regression hypothesis is the following:
13
problems. As we know that a forest is made up of trees and more trees
means more robust forest. Similarly, a random forest algorithm creates
decision trees on data samples and then gets the prediction from each of
them and finally selects the best solution by means of voting. It is an
ensemble method which is better than a single decision tree because it
reduces the over-fitting by averaging the result.
Step 2 − Next, this algorithm will construct a decision tree for every
sample. Then it will get the prediction result from every decision tree.
Step 4 − At last, select the most voted prediction result as the final
prediction result.
14
Fig.No 4.1.2 - RANDOM FOREST CLASSIFIER
● Step-1: Begin the tree with the root node, says S, which contains the
complete dataset.
● Step-2: Find the best attribute in the dataset using Attribute
Selection Measure (ASM).
● Step-3: Divide the S into subsets that contains possible values for the
best attributes.
● Step-4: Generate the decision tree node, which contains the best
attribute.
● Step-5: Recursively make new decision trees using the subsets of the
dataset created in step -3. Continue this process until a stage is
15
reached where you cannot further classify the nodes and called the
final node as a leaf node.
4.1.5 XG BOOST
Gradient boosted decision trees are implemented by the XGBoost library
of Python, intended for speed and execution, which is the most important
aspect of ML (machine learning).
XgBoost: XgBoost (Extreme Gradient Boosting) library of Python was
introduced at the University of Washington by scholars. It is a module of
16
Python written in C++, which helps ML model algorithms by the training
for Gradient Boosting.
Gradient boosting: This is an AI method utilized in classification and
regression assignments, among others. It gives an expectation model as a
troupe of feeble forecast models, commonly called decision trees.
Step 3: Split the dataset into train and test using sklearn before building
the SVM algorithm model
17
Step 4: Import the support vector classifier function or SVC function
from Sklearn SVM module. Build the Support Vector Machine model
with the help of the SVC function
18
classes. Maximizing the margin distance provides some reinforcement so
that future data points can be classified with more confidence.
Hyper planes are decision boundaries that help classify the data points.
Data points falling on either side of the hyperplane can be attributed to
different classes. Also, the dimension of the hyper plane depends upon
the number of features. If the number of input features is 2, then the
hyperplane is just a line. If the number of input features is 3, then the
hyper plane becomes a two-dimensional plane. It becomes difficult to
imagine when the number of features exceeds 3.
19
Support Vectors
Support vectors are data points that are closer to the hyperplane and
influence the position and orientation of the hyperplane. Using these
support vectors, we maximize the margin of the classifier. Deleting the
support vectors will change the position of the hyperplane. These are the
points that help us build our SVM.
20
Hinge loss function (function on left can be represented as a function on
the right)
The cost is 0 if the predicted value and the actual value are of the same
sign. If they are not, we then calculate the loss value. We also add a
regularization parameter to the cost function. The objective of the
regularization parameter is to balance the margin maximization and loss.
After adding the regularization parameter, the cost function looks as
below.
Gradients
21
When there is no misclassification, i.e. our model correctly predicts the
class of our data point, we only have to update the gradient from the
regularization parameter.
22
CHAPTER 5
SYSTEM DESIGN
5.1 FUNCTIONAL AND NONFUNCTIONAL
REQUIREMENTS:
23
● Security
● Maintainability
● Reliability
● Scalability
● Performance
● Reusability
● Flexibility
Examples of non-functional requirements:
1) Emails should be sent with a latency of no greater than 12 hours
from such an activity.
2) The processing of each request should be done within 10 seconds
3) The site should load in 3 seconds whenever of simultaneous users
are > 10000
24
5.3 UML DIAGRAMS
UML stands for Unified Modelling Language. UML is a
standardized general-purpose modeling language in the field of
object-oriented software engineering. The standard is managed, and was
created by, the Object Management Group.
The goal is for UML to become a common language for creating
models of object-oriented computer software. In its current form UML
comprises two major components: a Meta-model and a notation. In the
future, some form of method or process may also be added to; or
associated with, UML.
The Unified Modelling Language is a standard language for
specifying, Visualization, Constructing and documenting the artifacts of
software systems, as well as for business modeling and other
non-software systems.
The UML represents a collection of best engineering practices that
have proven successful in the modeling of large and complex systems.
The UML is a very important part of developing objects-oriented
software and the software development process. The UML uses mostly
graphical notations to express the design of software projects.
GOALS:
The Primary goals in the design of the UML are as follows:
1. Provide users a ready-to-use, expressive visual modeling Language
so that they can develop and exchange meaningful models.
2. Provide extendibility and specialization mechanisms to extend the
core concepts.
3. Be independent of particular programming languages and
development processes.
25
4. Provide a formal basis for understanding the modeling language.
5. Encourage the growth of the OO tools market.
6. Support higher level development concepts such as collaborations,
frameworks, patterns and components.
7. Integrate best practices.
5.3.1 USE CASE DIAGRAM
► A use case diagram in the Unified Modeling Language (UML) is a
type of behavioral diagram defined by and created from a Use-case
analysis.
26
5.3.2 CLASS DIAGRAM
In software engineering, a class diagram in the Unified Modeling
Language (UML) is a type of static structure diagram that describes the
structure of a system by showing the system's classes, their attributes, operations
(or methods), and the relationships among the classes. It explains which class contains
information
27
Fig.No 5.3.3 - SEQUENCE DIAGRAM
28
5.3.5 DEPLOYMENT DIAGRAM
Deployment diagram represents the deployment view of a system. It is
related to the component diagram. Because the components are deployed
using the deployment diagrams. A deployment diagram consists of nodes.
Nodes are nothing but physical hardware used to deploy the application.
29
5.3.7 COMPONENT DIAGRAM:
A component diagram, also known as a UML component diagram,
describes the organization and wiring of the physical components in a
system. Component diagrams are often drawn to help model
implementation details and double-check that every aspect of the system's
required function is covered by planned development.
5.3.8 ER DIAGRAM:
An Entity–relationship model (ER model) describes the structure of a
database with the help of a diagram, which is known as Entity
Relationship Diagram (ER Diagram). An ER model is a design or
blueprint of a database that can later be implemented as a database. The
main components of the E-R model are: entity set and relationship set.
An ER diagram shows the relationship among entity sets. An entity set is
a group of similar entities and these entities can have attributes. In terms
of DBMS, an entity is a table or attribute of a table in a database, so by
showing relationship among tables and their attributes, ER diagram
shows the complete logical structure of a database. Let’s have a look at a
simple ER diagram to understand this concept.
30
Fig.No 5.3.8 - ER DIAGRAM
31
Fig.No 5.3.9 - DFD DIAGRAM
32
CHAPTER 6
SOFTWARE DESIGN
33
integrated in the next phase. Each unit is developed and tested for
its functionality, which is referred to as Unit Testing.
34
6.2.1 ECONOMIC FEASIBILITY:
This study is carried out to check the economic impact that the system
will have on the organization. The amount of fund that the company can
pour into the research and development of the system is limited. The
expenditures must be justified. Thus, the developed system as well within
the budget and this was achieved because most of the technologies used
are freely available. Only the customized products had to be purchased.
35
6.3 MODULES
● Data collection
● Data pre-processing
● Feature extraction
● Model training
● Testing model
● Performance evaluation
● Prediction
36
● In this module data cleaning is done to prepare the data for analysis
by removing or modifying the data that may be incorrect, incomplete,
duplicated or improperly formatted.
● In tabular data, there are many different statistical analysis and data
visualization techniques you can use to explore your data in order to
identify data cleaning operations you may want to perform
● In this place we have to use pca nd hyperparameter tuning
37
6.3.4 MODEL TRAINING
● A training model is a dataset that is used to train an ML algorithm. It
consists of the sample output data and the corresponding sets of input
data that have an influence on the output.
● The training model is used to run the input data through the
algorithm to correlate the processed output against the sample output.
The result from this correlation is used to modify the model.
● This iterative process is called “model fitting”. The accuracy of the
training dataset or the validation dataset is critical for the precision of
the model.
● Model training in machine language is the process of feeding an ML
algorithm with data to help identify and learn good values for all
attributes involved.
● There are several types of machine learning models, of which the
most common ones are supervised and unsupervised learning.
● In this module we use supervised classification algorithms like linear
regression to train the model on the cleaned dataset after
dimensionality reduction.
38
● Moreover, software testing has the power to point out all the defects
and flaws during development. You don’t want your clients to
encounter bugs after the software is released and come to you waving
their fists. Different kinds of testing allow us to catch bugs that are
visible only during runtime.
39
6.3.7 PREDICTION
● Prediction” refers to the output of an algorithm after it has
been trained on a historical dataset and applied to new data when
forecasting the likelihood of a particular outcome, such as whether or
not a customer will churn in 30 days.
● The algorithm will generate probable values for an unknown variable
for each record in the new data, allowing the model builder to identify
what that value will most likely be.
● The word “prediction” can be misleading. In some cases, it really
does mean that you are predicting a future outcome, such as when
you’re using machine learning to determine the next best action in a
marketing campaign.
● Other times, though, the “prediction” has to do with, for example,
whether or not a transaction that already occurred was fraudulent.
● In that case, the transaction already happened, but you’re making an
educated guess about whether or not it was legitimate, allowing you
to take the appropriate action.
40
CHAPTER 7
TESTING
7.1 INTRODUCTION
Testing forms an integral part of any software development project.
Testing helps in ensuring that the final product is by and large, free of
defects and it meets the desired requirements. Proper testing in the
development phase helps in identifying the critical errors in the design
and implementation of various functionalities thereby ensuring product
reliability. Even though it is a bit time-consuming and a costly process at
first, it helps in the long run of software development.
41
7.2 Testing Traditional Software Systems v/s Machine
Learning Systems
In traditional software systems, code is written for having a desired
behavior as the outcome. Testing them involves testing the logic behind
the actual behavior and how it compares with the expected behavior. In
machine learning systems, however, data and desired behavior are the
inputs and the models learn the logic as the outcome of the training and
optimization processes.
42
7.3 MODEL TESTING AND MODEL EVALUATION
From the discussion above, it may feel as if model testing is the same as
model evaluation but that’s not true. Model evaluations focus on the
performance metrics of the models like accuracy, precision, the area
under the curve, f1 score, log loss, etc. These metrics are calculated on
the validation dataset and remain confined to that. Though the evaluation
metrics are necessary for assessing a model, they are not sufficient
because they don’t shed light on the specific behaviors of the model.
It is fully possible that a model’s evaluation metrics have improved but its
behavior on a core functionality has regressed. Or retraining a model on
new data might introduce a bias for marginalized sections of society all
the while showing no particular difference in the metrics values. This is
extra harmful in the case of ML systems since such problems might not
come to light easily but can have devastating impacts.
● Pre-train tests
● Post-train tests
43
Pre-train tests: The intention is to write such tests which can be run
without trained parameters so that we can catch implementation errors
early on. This helps in avoiding the extra time and effort spent in a wasted
training job.
44
titanic survivor probability prediction data, change in the passenger’s
name should not affect their chances of survival.
● Directional expectations wherein we test for a direct relation between
feature values and predictions. For example, in the case of a loan
prediction problem, having a higher credit score should definitely
increase a person’s eligibility for a loan.
● Apart from this, you can also write tests for any other failure modes
identified for your model.
Now, let’s try a hands-on approach and write tests for the Medical Cost
Personal Datasets. Here, we are given a bunch of features and we have to
predict the insurance costs.
It contains only numeric input variables which are the result of a PCA
transformation. Unfortunately, due to confidentiality issues, we cannot
provide the original features and more background information about the
data. Features V1, V2, … V28 are the principal components obtained
with PCA, the only features which have not been transformed with PCA
are 'Time' and 'Amount'. Feature 'Time' contains the seconds elapsed
between each transaction and the first transaction in the dataset. The
feature 'Amount' is the transaction Amount, this feature can be used for
45
example-dependent cost-sensitive learning. Feature 'Class' is the response
variable and it takes value 1 in case of fraud and 0 otherwise.
Doing a little bit of analysis on the dataset will reveal the relationship
between various features. Since the main aim of this article is to learn
how to write tests, we will skip the analysis part and directly write basic
tests.
46
CHAPTER 8
IMPLEMENTATION
Index .html
<!DOCTYPE html>
<html>
<head>
<link rel="stylesheet" href="{{ url_for('static',
filename='css/indexstyle.css') }}">
<title>ML API</title>
</head>
<body>
<form action="{{ url_for('predict')}}" method="POST">
<input id="input-1" type="text" placeholder="Enter time" name ="time"
required autofocus />
<label for="input-1">
<span class="label-text">TIME</span>
<span class="nav-dot"></span>
<div class="signup-button-trigger">Credit Card Fraud
Prediction</div>
</label>
<input id="input-2" type="text" placeholder="Enter Amount"
name="amount" required />
<label for="input-2">
<span class="label-text">AMOUNT</span>
<span class="nav-dot"></span>
</label>
<input id="input-3" type="text" placeholder="Enter Transaction
Method" name="tm" required />
<label for="input-3">
<span class="label-text">Transaction Method</span>
<span class="nav-dot"></span>
</label>
47
<input id="input-4" type="text" placeholder="Transaction id" name="ti"
required />
<label for="input-4">
<span class="label-text">Transaction id</span>
<span class="nav-dot"></span>
</label>
<input id="input-5" type="text" placeholder="Enter Type Of card"
name="ct" required />
<label for="input-5">
<span class="label-text">Card Type</span>
<span class="nav-dot"></span>
</label>
<input id="input-6" type="text" placeholder="Enter Location"
name="location" required />
<label for="input-6">
<span class="label-text">Enter Location</span>
<span class="nav-dot"></span>
</label>
<input id="input-7" type="text" placeholder="Enter Bank" name="em"
required />
<label for="input-7">
<span class="label-text">Enter Bank</span>
<span class="nav-dot"></span>
</label>
<button type="submit">Predict</button>
<p class="tip">Press Tab</p>
<div class="signup-button">Credit card Fraud Detection</div>
</form>
</body>
</html>
Result .html
48
</head>
<!--Waves Container-->
<div>
<svg class="waves" xmlns="http://www.w3.org/2000/svg"
xmlns:xlink="http://www.w3.org/1999/xlink"
viewBox="0 24 150 28" preserveAspectRatio="none"
shape-rendering="auto">
<defs>
<path id="gentle-wave" d="M-160 44c30 0 58-18 88-18s 58 18 88 18
58-18 88-18 58 18 88 18 v44h-352z" />
</defs>
<g class="parallax">
<use xlink:href="#gentle-wave" x="48" y="0" fill="#bb1515" />
<use xlink:href="#gentle-wave" x="48" y="3" fill="#bb1515" />
<use xlink:href="#gentle-wave" x="48" y="5"
fill="rgba(255,255,255,0.3)" />
<use xlink:href="#gentle-wave" x="48" y="7" fill="#bb1515" />
</g>
</svg>
</div>
<!--Waves end-->
</div>
<!--Header ends-->
<!--Content starts-->
<div class="content flex">
</div>
<!--Content ends-->
</body>
49
8.2 BACK END CODING
"""# Feature Scaling"""
scaler = StandardScaler()
x= data[frames]
temp_col=scaler.fit_transform(x)
scaled_col.head()
d_scaled.head()
y = data['Class']
d_scaled.head()
pca = PCA(n_components=7)
X_temp_reduced = pca.fit_transform(d_scaled)
pca.explained_variance_ratio_
pca.explained_variance_
50
names=['Time','Amount','Transaction Method','Transaction
Id','Location','Type of Card','Bank']
X_reduced= pd.DataFrame(X_temp_reduced,columns=names)
X_reduced.head()
Y=d_scaled['Class']
new_data=pd.concat([X_reduced,Y],axis=1)
new_data.head()
new_data.shape
new_data.to_csv('E:/ats
projects/Credit-Card-Fraud-Detection-ML-WebApp-master/finaldata.csv')
X_train.shape, X_test.shape
#Hyperparamter tuning
from sklearn.model_selection import GridSearchCV
lr_model = LogisticRegression()
lr_params = {'penalty': ['l1', 'l2'],'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000]}
grid_lr= GridSearchCV(lr_model, param_grid = lr_params)
grid_lr.fit(X_train, y_train)
grid_lr.best_params_
y_pred_lr3=grid_lr.predict(X_test)
print(classification_report(y_test,y_pred_lr3))
51
"""# Support Vector Machine"""
print(classification_report(y_test,y_pred_svc))
print(confusion_matrix(y_test,y_pred_svc))
svc_param=SVC(kernel='rbf',gamma=0.01,C=100)
svc_param.fit(X_train,y_train)
y_pred_svc2=svc_param.predict(X_test)
print(classification_report(y_test,y_pred_svc2))
52
print(confusion_matrix(y_test,y_pred_dtree))
d_tree_param=DecisionTreeClassifier()
tree_parameters={'criterion':['gini','entropy'],'max_depth':list(range(2,4,1)
),
'min_samples_leaf':list(range(5,7,1))}
grid_tree=GridSearchCV(d_tree_param,tree_parameters)
grid_tree.fit(X_train,y_train)
y_pred_dtree2=grid_tree.predict(X_test)
print(classification_report(y_test,y_pred_dtree2))
print(classification_report(y_test,y_pred_rf))
print(classification_report(y_test,y_pred_knn))
print(confusion_matrix(y_test,y_pred_knn))
knn_param=KNeighborsClassifier()
knn_params={"n_neighbors": list(range(2,5,1)), 'algorithm': ['auto',
'ball_tree', 'kd_tree', 'brute']}
grid_knn=GridSearchCV(knn_param,param_grid=knn_params)
53
grid_knn.fit(X_train,y_train)
grid_knn.best_params_
knn = KNeighborsClassifier(n_neighbors=2)
knn.fit(X_train,y_train)
pred_knn2 = knn.predict(X_test)
print('WITH K=3')
print('\n')
print(confusion_matrix(y_test,pred_knn2))
print('\n')
print(classification_report(y_test,pred_knn2))
"""# XGBoost"""
"""# LGB"""
54
lgb_train = lgb.Dataset(X_train, y_train)
y_prob = clf.predict(X_test)
y_pred = sklearn.preprocessing.binarize(np.reshape(y_prob, (-1,1)),
threshold= 0.5)
accuracy_score(y_test, y_pred)
print(classification_report(y_test,y_pred))
"""# ROC"""
plt.figure(figsize=(15,10))
plt.title("Roc Curve")
plt.plot(lg_fpr,lg_tpr, label='Logistic Regression Classifier Score:
{:.4f}'.format(roc_auc_score(y_test, y_pred_lr3)))
plt.plot(knn_fpr,knn_tpr, label='KNears Neighbors Classifier Score:
{:.4f}'.format(roc_auc_score(y_test, pred_knn2)))
plt.plot(svc_fpr, svc_tpr, label='Support Vector Classifier Score:
{:.4f}'.format(roc_auc_score(y_test, y_pred_svc2)))
plt.plot(dtree_fpr, dtree_tpr, label='Decision Tree Classifier Score:
{:.4f}'.format(roc_auc_score(y_test, y_pred_dtree2)))
plt.plot(rf_fpr,rf_tpr, label='Random Forest Classifier Score:
{:.4f}'.format(roc_auc_score(y_test, y_pred_rf)))
plt.plot(xg_fpr,xg_tpr, label='XGBoost Classifier Score:
{:.4f}'.format(roc_auc_score(y_test, y_pred_xg)))
55
plt.plot(lgb_fpr,lgb_tpr, label='Light Gradient Boosting Classifier Score:
{:.4f}'.format(roc_auc_score(y_test, y_pred)))
plt.xlabel('False Positive Rate', fontsize=16)
plt.ylabel('True Positive Rate', fontsize=16)
plt.legend()
plt.show()
56
CHAPTER 9
OUTPUTS AND SNAPSHOTS
HOME PAGE
TIME DETAILS
57
TRANSACTION METHOD
TRANSACTION ID
58
CARD TYPE
LOCATION DETAILS
59
BANK TYPE
PREDICTION
60
FINAL OUTPUT
DATA HISTOGRAM
61
MATRIX
HEAT MAP
62
ACCURACY COMPARISON
63
CHAPTER 10
CONCLUSION AND FUTURE WORK
10.1 CONCLUSION :
In this project we studied the algorithms decision tree , Random forest,
logistic regression, classification,svm, KNN, and XG boost machine
learning algorithms .the results shows that Supervise machine learning
algorithm performs best with having accuracy , precision , recall , f1
scores still there are 4 False Negative values and when we use data
without Random Under Sampling we will get accuracy of due to heavily
imbalance and results many false output. After cleaning data and
applying algorithms we got SVM as the best algorithm but based on the
decisions for transactions whether fraud or not by making a particular
feature as root and gaining information from all trees predicting the
outcome.
64
versatility to the project. More room for improvement can be found in the
dataset. As demonstrated before, the precision of the algorithms increases
when the size of the dataset is increased. Hence, more data will surely
make the model more accurate in detecting frauds and reduce the number
of false positives. However, this requires official support from the banks
themselves.
65
REFERENCES
66
Computing (IJCSMC), vol. 4, no. 4, pp. 92-95, 2015, ISSN ISSN:
2320-088X.
7) S. Maes, K. Tuyls, B. Vanschoenwinkel, B. Manderick, "Credit card
fraud detection using Bayesian and neural networks", Proceedings of
the 1st international naiso congress on neuro fuzzy technologies, pp.
261-270, 2002.
8) S. Bhattacharyya, S. Jha, K. Tharakunnel, J. C. Westland, "Data
mining for credit card fraud: A comparative study", Decision Support
Systems, vol. 50, no. 3, pp. 602-613, 2011.
9) Y. Sahin, E. Duman, "Detecting credit card fraud by ANN and logistic
regression", Innovations in Intelligent Systems and Applications
(INISTA) 2011 International Symposium, pp. 315-319, 2011.
10) Selvani Deepthi Kavila,LAKSHMI S.V.S.S.,RAJESH B “ Automated
Essay Scoring using Feature Extraction Method “ IJCER ,volume
7,issue 4(L), Page No. 12161-12165.
11) S.V.S.S.Lakshmi,K.S.Deepthi,Ch.Suresh “Text Summarization basing
on Font and Cue-phrase
67
CONFERENCE CERTIFICATE
68
69
CREDIT CARD FRAUD DETECTION USING
MACHINE LEARNING
Anuja T1, Pavan Sidharth J2, Esohin E3, Bala Bharathi S4
1
Assistant Professor, Department of Information
India
2,3,4
UG Student, Department of Information Technology,
2
pavansidharth19it055@gmail.com,
3
esohin120@gmail.com,
4
balabharathi19jeit015@gmail.com
ABSTRACT:
Credit cards are the commonly used payment mode in recent years. As
the technology is developing, the number of fraud cases are also
increasing and finally poses the need to develop a fraud detection
algorithm to accurately find and eradicate the fraudulent activities. This
project work proposes different machine learning based classification
algorithms and hyperparameter tuning,pca for handling the heavily
70
imbalanced data set for preprocessing . Finally, this project work will
calculate the accuracy, precision, recall, f1 score . Credit card frauds are
easy and friendly targets. E-commerce and many other online sites have
increased the online payment modes, increasing the risk for online
frauds. Increase in fraud rates, researchers started using different
machine learning methods to detect and analyze frauds in online
transactions. The main aim of the paper is to design and develop a novel
fraud detection method for Streaming Transaction Data, with an
objective to analyze the past transaction details of the customers and
extract the behavioral patterns. Where cardholders are clustered into
different groups based on their transaction amount. Then using a sliding
window strategy, to aggregate the transactions made by the cardholders
from different groups so that the behavioral pattern of the groups can be
extracted respectively. Later different classifiers are trained over the
groups separately. And then the classifier with better rating score can be
chosen to be one of the best methods to predict frauds.
71