Credit Card Fraud Detection using
Machine Learning and Data Science
When someone other than the account owner uses a credit card without authorization or
permission, it is called "fraud" in credit card transactions.
Fraud detection involves monitoring the activities of populations of users in order to estimate,
perceive or avoid objectionable behaviour, which consist of fraud, intrusion, and defaulting.
This is a highly pertinent issue that has to be addressed by fields like data science and
machine learning, since these fields can automate the answer.
From the standpoint of learning, this issue is especially difficult because it is characterized by
a number of variables, including class imbalance. There are many more legitimate
transactions than fraudulent ones. Furthermore, throughout time, the statistical features of the
transaction patterns frequently alter.
However, these are not the only difficulties in putting a real-world fraud detection system
into practice. In real-world scenarios, automated systems swiftly sort through the enormous
volume of payment requests to decide which ones to approve.
Algorithms for machine learning are used to analyze all approved transactions and flag any
that seem suspect. Experts look into these allegations and get in touch with the cardholders to
verify whether or not the transaction was fraudulent.
The automated system receives input from the investigators, which is utilized to train and
update the algorithm to
eventually improve the fraud-detection performance over time.
Fraud detection techniques are always being improved to prevent crooks from changing their
fraudulent tactics. They categorize these frauds as:
• Both offline and online credit card fraud
• Theft of Cards
• Device intrusion; • Application fraud; • Counterfeit card; • Account bankruptcy; •
Telecommunication fraud
The following are a few methods now in use to identify this kind of fraud:
• Synthetic Neural Network
• Inductive Fuzzy Logic
Bayesian networks, decision trees, genetic algorithms, logistic regression, support vector
machines, hidden markov model, and K-nearest neighbor
Review of Literature II
Fraud is defined as the illegal or criminal use of deception with the goal of gaining personal
or financial gain. It is an intentional violation of a law, regulation, or policy done with the
intention of gaining unapproved financial gain.
A wealth of literature has already been published and is accessible to the general public
regarding anomaly or fraud detection in this field. According to a thorough assessment
carried out by Clifton Phua and his colleagues, methods used in this field include adversarial
detection, automated fraud detection, and data mining applications. Suman, a research scholar
at Hisar HCE's GJUS&T, described methods for detecting credit card fraud in another article
that included both supervised and unsupervised learning. Although these techniques and
algorithms yielded surprising results in certain domains, they fell short.
Wen-Fang YU and Na Wang presented a related study area in which they employed distance
sum algorithms, outlier detection mining, and outlier mining to accurately forecast fraudulent
transactions in an emulation experiment including a credit card transaction data set from a
specific commercial bank. One area of data mining that is primarily utilized in the financial
and online domains is outlier mining. Its task is to identify objects—that is, fraudulent
transactions—that are isolated from the main system. They have measured the difference
between an attribute's observed value and its preset value by taking characteristics of the
behavior of their customers and calculating their worth.
Unconventional methods, like hybrid data mining and complex network classification
algorithms, have shown promise in identifying illicit instances within real card transaction
data sets. These methods are based on network reconstruction algorithms, which enable the
creation of representations of an instance's deviation from a reference group. These
techniques have been effective, on average, with medium-sized online transactions.
Additionally, attempts have been made to advance from an entirely new angle. Improvements
to the alert-feedback relationship in the event of a fraudulent transaction have been attempted.
The authorized system would be notified in the event of a fraudulent transaction, and a
feedback would be provided to cancel the current transaction.
One method that provided fresh insight into this area was Artificial Genetic Algorithm, which
tackled fraud from an alternative angle.
It demonstrated accuracy in identifying fraudulent transactions and reducing the quantity of
false alarms. Nevertheless, a classification issue with fluctuating misclassification costs
accompanied it.
III. METHODS
The method this study suggests looks for unusual activity, or outliers, using the most recent
machine learning methods.
The following figure serves as a representation of the fundamental rough architecture
diagram:
Upon closer inspection and the addition of real-world components, the entire architecture
diagram can be shown as follows: -
Initially, our dataset was acquired via the data analysis website Kaggle, which offers datasets.
There are 31 columns in this dataset, 28 of which are labeled as v1–v28 to safeguard sensitive
information.
Time, Amount, and Class are represented by the other columns. Time indicates how much
time has passed between the first and subsequent transactions. The amount of money
exchanged is called the amount. Class 0 denotes a legitimate transaction, while Class 1
denotes a fraudulent one.
Plotting several graphs will help you visually understand any values in the dataset and look
for any inconsistencies. By doing this, we can make sure that the machine learning algorithms
can analyze the dataset without the need for any missing value imputation.
Following this analysis, we create a heatmap to visualize the data in color and examine the
relationship between the class variable and our predictor factors. Below is a heatmap of this:
The dataset has now been processed and formatted. To guarantee evaluation impartiality, the
class column is eliminated and the time and amount columns are standardized. A collection
of modules' algorithms process the data. The interaction of these algorithms is illustrated in
the module diagram that follows: The following outlier identification modules are applied to
this data once it has been fitted into a model:
• Algorithm for Isolation Forest; Local Outlier Factor
These algorithms belong in the sklearn library. The sklearn package's ensemble module
contains algorithms and methods for classification, regression, and outlier detection that are
ensemble-based.
Using the NumPy, SciPy, and matplotlib modules, this free and open-source Python library
offers a variety of easy-to-use and effective tools for data analysis.
This graph demonstrates how much fewer fraudulent transactions there are overall.
The times at which transactions were completed within two days are displayed on this graph.
It is evident that the most transactions were made during the day and the least throughout the
night.
The amount transacted is shown in this graph. Fewer than 5% of transactions are close to the
maximum amount transacted, while the majority are quite tiny.
We plot a histogram for each column in this dataset once it has been examined. This is done
in order to obtain a graphical representation of the dataset, which can then be utilized to
confirm that machine learning and missing values are there. It is intended to work with the
scientific and numerical libraries and includes a variety of classification, clustering, and
regression techniques.
We have developed a Python program utilizing the Jupyter Notebook platform to illustrate
the methodology proposed in this work. Moreover, the Google Collab platform, which
supports all Python notebook files, can be used to run this program in the cloud.
The following modules are explained in detail, along with pseudocodes for their algorithms
and output graphs:
A. Factor of Local Outliers
The algorithm is called Unsupervised Outlier Detection. The anomaly score of every sample
is referred to as the "Local Outlier Factor." It calculates the sample data's local deviation from
its neighbors.
More specifically, the local data is estimated using the distance between the k-nearest
neighbors, which provides the locality.
This algorithm's pseudocode is expressed as
On plotting the results of Local Outlier Factor algorithm, we get the following figure:
When they are partitioned randomly, anomalies have shorter pathways. Samples that
mutually yield shorter path lengths in a forest of random trees are very likely to be anomalies.
The system can be used to report abnormalities to the relevant authorities once they are
found. We are comparing these algorithms' results to assess the precision and accuracy of the
systems.
B. Algorithm for Isolation Forest
By choosing a feature at random and then selecting a split value between the maximum and
minimum values of the chosen feature, the Isolation Forest "isolates" observations.
A tree can be used to illustrate recursive partitioning, where the path length from the root
node to the ending node represents the number of splits needed to isolate a sample.
The mean of
On plotting the results of Isolation Forest algorithm, we get the following figure:
Results with the complete dataset is used:
Section IV - Execution
This concept is challenging to put into practice since it calls for the collaboration of banks,
who are reluctant to exchange information because of market competitiveness, legal
concerns, and user data protection.
As a result, we searched for some reference articles that used comparable methodologies and
produced findings. According to one of these citation papers:
In 2006, a German bank provided the complete application data set, to which this technique
was deployed. Please see below simply a summary of the results, due to banking
confidentiality concerns. Following the application of this method, a small number of cases
with a high likelihood of being fraudsters are included in the level 1 list.
Every person listed on this list had their cards closed.
V. OUTCOMES
The number of false positives the code found is printed out, and it is then compared to the
real data. This is used to determine the algorithms' precision and accuracy score.
Ten percent of the whole dataset is the portion we used for expedited testing. Finally, the
entire dataset is also used, and both sets of findings are printed.
The output shows these findings together with the classification report for each algorithm,
with class 0 denoting a lawful transaction and class 1 denoting a fraudulent transaction.
To rule out false positives, this result was compared to the class values.
Outcomes after utilizing 10% of the dataset:
Given that the complete dataset
VII. UPCOMING ADVANCES
Even though we were unable to achieve our aim of 100% accuracy in fraud detection, we
were able to develop a system that, given sufficient time and data, can come very near to that
objective. Like any undertaking of this kind, there is space for improvement.
This project's design makes it possible to incorporate several algorithms as modules and
combine their outputs to improve the end product's accuracy.
Adding more algorithms to this model will help it get even better. These algorithms must,
however, produce output in the same format as the others. The modules are simple to add
once that need is met, as demonstrated by the code.
MENTORS
[1] John Richard D. Kho and Larry A. Vea's paper, "Credit Card Fraud Detection Based on
Transaction Behavior," was published in the proceedings of the 2017 IEEE Region 10
Conference (TENCON), held in Malaysia from November5-7.
[2]ROSS GAYLER2, KATE SMITH1, VINCENT LEE1, and CLIFTON PHUA1 "A
Comprehensive Survey of Data Mining-based Fraud Detection Research," released by
Monash University's Wellington Road School of Business Systems and Faculty of
Information Technology
Australia's Clayton, Victoria 3800 [3]The International Journal of Advanced Research in
Computer Engineering & Technology (IJARCET) Volume 3 Issue 3 published a survey paper
on credit card fraud detection by Suman, a research scholar at GJUS&T Hisar HCE, Sonepat,
in March 2014 [4].Published in 2009, "Studies on Credit Card Fraud Detection Model Based
on Distance Sum - by Wen-Fang YU and Na Wang"
Python
Python is a high-level, object-oriented, and interpreted programming language. It was created
by Guido van Rossum from 1985 to 1990 and released in 1991.
Python's syntax is close to the English language, allowing developers to construct programs
with fewer lines than some other programming languages. Python is an interpreter-based
language, which means that code can be executed as soon as it is written. Prototyping may be
done quickly.
Characteristics of Python :-
Following are the important characteristics of python programming -
• Python is a dynamic, high-level, free open source and interpreted programming language.
• It supports object-oriented programming as well as procedural oriented programming.
• It can be used as a scripting language or can be compiled to byte-code for building large
applications.
• It provides very high-level dynamic data types and supports dynamic type checking.
• It supports automatic garbage collection.
• It can be easily integrated with C, C+, COM, ActiveX, CORBA, and Java.