0% found this document useful (0 votes)

2K views14 pages

Credit Card Fraud Lessons

Uploaded by

ankit.goel

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2K views14 pages

Credit Card Fraud Lessons

Uploaded by

ankit.goel

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

Expert Systems with Applications 41 (2014) 4915–4928

Contents lists available at ScienceDirect

Expert Systems with Applications

journal homepage: www.elsevier.com/locate/eswa

Learned lessons in credit card fraud detection from a practitioner

perspective
Andrea Dal Pozzolo a,⇑, Olivier Caelen b, Yann-Aël Le Borgne a, Serge Waterschoot b, Gianluca Bontempi a
a
Machine Learning Group, Computer Science Department, Faculty of Sciences ULB, Université Libre de Bruxelles, Brussels, Belgium
b
Fraud Risk Management Analytics, Worldline, Brussels, Belgium

a r t i c l e i n f o a b s t r a c t

Keywords: Billions of dollars of loss are caused every year due to fraudulent credit card transactions. The design of
Incremental learning efﬁcient fraud detection algorithms is key for reducing these losses, and more algorithms rely on
Unbalanced data advanced machine learning techniques to assist fraud investigators. The design of fraud detection algo-
Fraud detection rithms is however particularly challenging due to non-stationary distribution of the data, highly imbal-
anced classes distributions and continuous streams of transactions.
At the same time public data are scarcely available for conﬁdentiality issues, leaving unanswered many
questions about which is the best strategy to deal with them.
In this paper we provide some answers from the practitioner’s perspective by focusing on three crucial
issues: unbalancedness, non-stationarity and assessment. The analysis is made possible by a real credit
card dataset provided by our industrial partner.
Ó 2014 Elsevier Ltd. All rights reserved.

1. Introduction Detection problems are typically addressed in two different

ways. In the static learning setting, a detection model is periodi-
Nowadays, enterprises and public institutions have to face a cally relearnt from scratch (e.g. once a year or month). In the online
growing presence of fraud initiatives and need automatic systems learning setting, the detection model is updated as soon as new
to implement fraud detection (Delamaire, Abdou, & Pointon, 2009). data arrives. Though this strategy is the most adequate to deal with
Automatic systems are essential since it is not always possible or issues of non stationarity (e.g. due to the evolution of the spending
easy for a human analyst to detect fraudulent patterns in transac- behavior of the regular card holder or the fraudster), little attention
tion datasets, often characterized by a large number of samples, has been devoted in the literature to the unbalanced problem in
many dimensions and online updates. Also, the cardholder is not changing environment.
reliable in reporting the theft, loss or fraudulent use of a card Another problematic issue in credit card detection is the scar-
(Pavía, Veres-Ferrer, & Foix-Escura, 2012). Since the number of city of available data due to confidentiality issues that give little
fraudulent transactions is much smaller than the legitimate ones, chance to the community to share real datasets and assess existing
the data distribution is unbalanced, i.e. skewed towards non-fraud- techniques.
ulent observations. It is well known that many learning algorithms
underperform when used for unbalanced dataset (Japkowicz &
2. Contributions
Stephen, 2002) and methods (e.g. resampling) have been proposed
to improve their performances. Unbalancedness is not the only
This paper aims at making an experimental comparison of sev-
factor that determines the difficulty of a classification/detection
eral state of the art algorithms and modeling techniques on one
task. Another influential factor is the amount of overlapping of
real dataset, focusing in particular on some open questions like:
the classes of interest due to limited information that transaction
Which machine learning algorithm should be used? Is it enough
records provide about the nature of the process (Holte, Acker, &
to learn a model once a month or it is necessary to update the mod-
Porter, 1989).
el everyday? How many transactions are sufficient to train the
model? Should the data be analyzed in their original unbalanced
⇑ Corresponding author. Tel.: +32 2 650 55 94. form? If not, which is the best way to rebalance them? Which
E-mail addresses: adalpozz@ulb.ac.be (A. Dal Pozzolo), olivier.caelen@worldline.
performance measure is the most adequate to asses results?
com (O. Caelen), yleborgn@ulb.ac.be (Y.-A. Le Borgne), serge.waterschoot@worldline. In this paper we address these questions with the aim of
com (S. Waterschoot), gbonte@ulb.ac.be (G. Bontempi). assessing their importance on real data and from a practitioner

http://dx.doi.org/10.1016/j.eswa.2014.02.026
0957-4174/Ó 2014 Elsevier Ltd. All rights reserved.
4916 A. Dal Pozzolo et al. / Expert Systems with Applications 41 (2014) 4915–4928

perspective. These are just some of potential questions that could 3.2. Unbalanced problem
raise during the design of a detection system. We do not claim to
be able to give a definite answer to the problem, but we hope to Learning from unbalanced datasets is a difficult task since most
that our work serves as guideline for other people in the field. learning systems are not designed to cope with a large difference
Our goal is to show what worked and what did not in a real case between the number of cases belonging to each class (Batista,
study. In this paper we give a formalisation of the learning problem Carvalho, & Monard, 2000). In the literature, traditional methods
in the context of credit card fraud detection. We present a way to for classification with unbalanced datasets rely on sampling
create new features in the datasets that can trace the card holder techniques to balance the dataset (Japkowicz & Stephen, 2002).
spending habits. By doing this it is possible to present the transac- In particular we can distinguish between methods that operates
tions to the learning algorithm without providing the card holder at the data and algorithmic levels (Chawla, Japkowicz, & Kotcz,
identifier. We then argue that traditional classification metrics 2004). At the data level, balancing techniques are used as a
are not suited for a detection task and present existing alternative pre-processing step to rebalance the dataset or to remove the noise
measures. between the two classes, before any algorithm is applied. At the
We propose and compare three approaches for online learning algorithmic level, the classification algorithms themselves are
in order to identify what is important to retain or to forget in a adapted to deal with the minority class detection. In this article
changing and non-stationary environment. We show the impact we focus on data level techniques as they have the advantage of
of the rebalancing technique on the final performance when the leaving the algorithms unchanged.
class distribution is skewed. In doing this we merge techniques Sampling techniques do not take into consideration any specific
developed for unbalanced static datasets with online learning information in removing or adding observations from one class, yet
strategies. The resulting frameworks are able to deal with unbal- they are easy to implement and to understand. Undersampling
anced and evolving data streams. All the results are obtained by (Drummond & Holte, 2003) consists in down-sizing the majority
experimentation on a dataset of real credit card transactions pro- class by removing observations at random until the dataset is
vided by our industrial partner. balanced.
SMOTE (Chawla, Bowyer, Hall, & Kegelmeyer, 2011) over-
samples the minority class by generating synthetic minority
examples in the neighborhood of observed ones. The idea is to form
3. State of the art in credit card fraud detection new minority examples by interpolating between examples of the
same class. This has the effect of creating clusters around each
Credit card fraud detection is one of the most explored domains minority observation.
of fraud detection (Chan, Fan, Prodromidis, & Stolfo, 1999; Bolton & Ensemble methods combine balancing techniques with a
Hand, 2001; Brause, Langsdorf, & Hepp, 1999) and relies on the classifier to explore the majority and minority class distribution.
automatic analysis of recorded transactions to detect fraudulent EasyEnsemble is claimed in Liu, Wu, and Zhou (2009) to be better
behavior. Every time a credit card is used, transaction data, com- alternative to undersampling. This method learns different aspects
posed of a number of attributes (e.g. credit card identifier, transac- of the original majority class in an unsupervised manner. This is
tion date, recipient, amount of the transaction), are stored in the done by creating different balanced training sets by Undersampling,
databases of the service provider. learning a model for each dataset and then combining all
However a single transaction information is typically not suffi- predictions.
cient to detect a fraud occurrence (Bolton & Hand, 2001) and the
analysis has to consider aggregate measures like total spent per 3.3. Incremental learning
day, transaction number per week or average amount of a transac-
tion (Whitrow, Hand, Juszczak, Weston, & Adams, 2009). Static learning is the classical learning setting where the data are
processed all at once in a single learning batch. Incremental learning
instead interprets data as a continuous stream and processes each
new instance ‘‘on arrival’’ (Oza, 2005). In this context it is impor-
3.1. Supervised versus unsupervised detection tant to preserve the previously acquired knowledge as well as to
update it properly in front of new observations. In incremental
In the fraud detection literature we encounter both supervised learning data arrives in chunks where the underlying data genera-
techniques that make use of the class of the transaction (e.g. gen- tion function may change, while static learning deals with a single
uine or fraudulent) and unsupervised techniques. Supervised dataset. The problem of learning in the case of unbalanced data has
methods assume that labels of past transactions are available and been widely explored in the static learning setting (Chawla et al.,
reliable but are often limited to recognize fraud patterns that have 2011; Drummond & Holte, 2003; Japkowicz & Stephen, 2002; Liu
already occurred (Bolton & Hand, 2002). On the other hand, unsu- et al., 2009). Learning from non-stationary data stream with
pervised methods do not use the class of transactions and are capa- skewed class distribution is however a relatively recent domain.
ble of detecting new fraudulent behaviours (Bolton & Hand, 2001). In the incremental setting, when the data distribution changes,
Clustering based methods (Quah & Sriganesh, 2008; Weston, Hand, it is important to learn from new observations while retaining
Adams, Whitrow, & Juszczak, 2008) form customer profiles to existing knowledge form past observations. Concepts learnt in
identify new hidden fraud patterns. the past may re-occur in the future as new concepts may appear
The focus of this paper will be on supervised methods. In the lit- in the data stream. This is known as the stability-plasticity dilem-
erature several supervised methods have been applied to fraud ma (Grossberg, 1988). A classifier is required to be able to respond
detection such as Neural networks (Dorronsoro, Ginel, Sgnchez, & to changes in the data distribution, while ensuring that it still re-
Cruz, 1997), Rule-based methods (BAYES Clark & Niblett, 1989, tains relevant past knowledge. Many of the techniques proposed
RIPPER Cohen, 1995) and tree-based algorithms (C4.5 Quinlan, (Chen, He, Li, & Desai, 2010; Polikar, Upda, Upda, & Honavar,
1993 and CART Olshen & Stone, 1984). It is well known however 2001; Street & Kim, 2001) use ensemble classifiers in order to com-
that an open issue is how to manage unbalanced class sizes since bine what is learnt from new observations and the knowledge ac-
the legitimate transactions generally far outnumber the fraudulent quired before. As fraud evolves over time, the learning framework
ones. has to adapt to the new distribution. The classifier should be able
A. Dal Pozzolo et al. / Expert Systems with Applications 41 (2014) 4915–4928 4917

to learn from a new fraud distributions and ‘‘forget’’ outdated on the Internet) changes the behaviour of the cardholders and con-
knowledge. It becomes critical then to set the rate of forgetting sequently the distributions PðX Þ. At the same time the evolution of
in order to match the rate of change in the distribution (Kuncheva, the types of frauds affects the class conditional probability distri-
2004). The simplest strategy uses a constant forgetting rate, which bution PðYjX Þ. As a result the joint distribution PðX ; YÞ is not
boils down to consider a fix window of recent observations to stationary: this is is known as concept-drift (Hoens, Polikar, &
retrain the model. FLORA approach (Widmer & Kubat, 1996) uses Chawla, 2012). Note that Gao et al. (2007) suggests that even when
a variable forgetting rate where the window is shrunk if a change the concept drift is not detected, there is still a benefit in updating
is detected and expanded otherwise. The evolution of a class the models.
concept is called in literature concept drift.
Gao, Fan, Han, and Philip (2007) proposes to store all previous
5. Performance measure
minority class examples into the current training data set to make
it less unbalanced and then to combine the models into an ensem-
Fraud detection must deal with the following challenges: (i)
ble of classifiers. SERA (Chen et al., 2009) and REA (Chen & He,
timeliness of decision (a card should be blocked as soon as it is
2011) selectively accumulate old minority class observations to
found victim of fraud, quick reaction to the appearance of the first
rebalance the training chunk. They propose two different methods
can prevent other frauds), (ii) unbalanced class sizes (the number
(Mahalanobis distance and K nearest neighbours) in order to select
of frauds are relatively small compare to genuine transactions)
the most relevant minority instances to include in the current
and (iii) cost structure of the problem (the cost of a fraud is not
chunk from the set of old minority instances.
easy to define). The cost of a fraud is often assumed to be equal
These methods consist in oversampling the minority class of the
to the transaction amount (Elkan, 2001). However, frauds of small
current chunk by retaining old positive observations. Accumula-
and big amounts must be treated with equal importance. A fraud-
tion of previous minority class examples is of limited volume due
ulent activity is usually tested with a small amount and then, if
to skewed class distribution, therefore oversampling does not
successful, replicated with bigger amount. The cost should also in-
increase a lot the chunk size.
clude the time taken by the detection system to react. The shorter
is the reaction time, the larger is the number of frauds that it is
4. Formalization of the learning problem possible to prevent. Depending on the fraud risk assigned by the
detection system to the transaction, the following can happens:
In this section, we formalize the credit card fraud detection task (i) transaction accepted, (ii) transaction refused (iii) card blocked.
as a statistical learning problem. Let X ij be the transaction number j Usually the card is blocked only in few cases where there is a high
risk of fraud (well known fraudulent patterns with high accuracy,
of a card number i. We assume that the transactions are ordered in
e.g. 99% correct). When a transaction is refused, the investigators
time such that if X iv occurs before X iw then v < w. For each trans-
make a phone call to the card holder to verify if it is the case of a
action some basic information is available such as amount of the
false alert or a real fraud. The cost of a false alert can then be con-
expenditure, the shop where it was performed, the currency, etc.
sidered equivalent to the cost of the phone call, which is negligible
However these variables do not provide any information about
compared to the loss that occurs in case of a fraud. However, when
the normal card usage. The normal behaviour of a card can be mea-
the number of false alerts is too big or the card is blocked by error,
sured by using a set of historical transactions from the same card.
the impossibility to make transactions can translate into big losses
For example, we can get an idea of the card holder spending habits
for the customer. For all these reasons, defining a cost measure is a
by looking at the average amount spent in different merchant cat-
challenging problem in credit card detection.
egories (e.g. restaurant, online shopping, gas station, etc.) in the
The fraud problem can be seen as a binary classification and
last 3 months preceding the transaction. Let X ik be a new transac-
detection problem.
tion and let dt ðX ik Þ be the corresponding transaction date-time in
the dataset. Let T denote the time-frame of a set of historical
transactions for the same card. X Hik is then the set of the historical 5.1. Classification
transactions occurring in the time-frame T before X ik such that
In a classification problem an algorithm is assessed on its accu-
X Hik ¼ X ij , where dt X ij – dtðX ik Þ and dtðX ik Þ > dt X ij P dt ðX ik Þ T.
racy to predict the correct classes of new unseen observations. Let
For instance, with T ¼ 90 days, X Hik
is the set of transactions for
fY 0 g be the set of genuine transactions, fY 1 g the set of fraudulent
the same card occurring in the 3 months preceding dtðX ik Þ. The b 0 g the set of transactions predicted as genuine and
transactions, f Y
card behaviour can be summarised using classical aggregation b 1 g the set of transactions predicted as fraudulent. For a binary
fY
methods (e.g. mean, max, min or count) on the set X Hik . This means classification problem it is conventional to define a confusion
that it is possible to create new aggregated variables that can be matrix (Table 1).
added to the original transaction variables to include information In an unbalanced class problem, it is well-known that quantities
of the card. In this way we have included information about the TP
like TPR (TPþFN TN
), TNR (FPþTN TPþTN
) and Accuracy (TPþFNþFPþTN ) are mislead-
user behaviour at the transaction level and we can now no longer
ing assessment measures (Provost, 2000). Balanced Error Rate
consider the card ID. Transactions from card holders with similar FP FN
(0:5 TNþFP þ 0:5 FNþTP ) may be inappropriate too because of dif-
spending habits will share similar aggregate variables. Let fX g be
ferent costs of misclassification false negatives and false positives.
the new set of transactions with aggregated variables. Each trans-
A well accepted measure for unbalanced dataset is AUC (area
action X j is assigned a binary status Y j where Y j ¼ 1 when the
under the ROC curve) (Chawla, 2005). This metric gives an measure
transaction j is fraudulent and Y j ¼ 0 otherwise. The goal of a
detection system is to learn P ðYjX Þ and predict the class of a
new transaction Y b N 2 ð0;1Þ. Note that here the focus is not on Table 1
classifying the card holder, but the transaction as fraudulent or Confusion Matrix.

legitimate. True Fraud (Y 1 ) True Genuine (Y 0 )

Credit card fraud detection has some speciﬁcities compared to b 1) TP FP
Predicted Fraud ( Y
classical machine learning problems. For instance, the continuous b 0) FN TN
Predicted Genuine ( Y
availability of new products in the market (like purchase of music
4918 A. Dal Pozzolo et al. / Expert Systems with Applications 41 (2014) 4915–4928

of how much the ROC curve is close to the point of perfect classi- where DRðt r Þ ¼ Rðt r Þ Rðt r1 Þ and N is the total number of observa-
fication. Hand (2009) considers calculation of the area under del tion in the dataset. From the definition of Rðt r Þ we have:
ROC curve as inappropriate, since this translate into making an (
average of the misclassification cost of the two classes. An alterna- 1
hðtr Þ hðt r1 Þ p if the the rth is fraudulent
tive way of estimating AUC is based on the use of the MannWhit- DRðt r Þ ¼ ¼
p 0 if the the rth is genuine
ney statistic and consists in ranking the observations by the fraud
probability and measuring the probability that a random minority An algorithm ‘‘A’’ is superior to an algorithm ‘‘B’’ only if it de-
class example ranks higher than a random majority class example tects the frauds before algorithm ‘‘B’’. The better the rank, the
(Bamber, 1975). By using the rank-based formulation of AUC we greater the AP. The optimal algorithm that ranks all the frauds
can avoid setting different probability thresholds to generate the ahead of the legitimates has average precision of 1.
ROC curve and avoid the problem raised by Hand. Let n0 ¼ jY 0 j In detection teams like the one of our industrial partner, each
be the number of genuine transactions, and n1 ¼ jY 1 j be the num- time a fraud alert is generated by the detection system, it has to
ber of fraudulent transactions. Let g i ¼ p^0 ðxi0 Þ be the estimated be checked by investigators before proceeding with actions (e.g.
probability of belonging to the genuine class for the i transaction
th customer contact or card stop). Given the limited number of inves-
in fY 0 g, for i ¼ 1; . . . ; n0 . Define fi ¼ p ^0 ðxi1 Þ similarly for the n1 tigators it is possible to verify only a limited number of alerts.
fraudulent transactions. Then fg 1 ; . . . ; g n0 g and ff1 ; . . . ; fn1 g are sam- Therefore it is crucial to have the best ranking within the maxi-
mum number a of alerts that they can investigate. In this setting
ples from the g and f distributions. Rank the combined set of values
it is important to have the highest Precision within the first a
g 1 ; . . . ; g n0 ; f1 ; . . . ; fn1 in increasing order and let qi be the rank of the
alerts.
ith genuine transaction. There are ðqi iÞ fraudulent transactions
In the following we will denote as PrecisionRank the Precision
with estimated probabilities of belonging to class 0 which are
within the a observations with the highest rank.
smaller than that of the ith genuine transaction (Hand & Till,
2001). Summing over the 0 class, we see that the total number of
pairs of transactions, one from class 0 and one from class 1, in 6. Strategies for incremental learning with unbalanced fraud
which the fraudulent transaction has smaller estimated probability data
of belonging to class 0 than does the fraudulent transaction, is
X
n0 X
n0 X
n0 The most conventional way to deal with sequential fraud data is
ðqi iÞ ¼ qi i ¼ R0 n0 ðn0 þ 1Þ=2 to adopt a static approach (Fig. 1) which creates once in a while a
i¼0 i¼0 i¼0 classification model and uses it as a predictor during a long hori-
P n0
where R0 ¼ i¼0 qi . Since there are n0 n1 such pairs of transactions zon. Though this approach reduces the learning effort, its main
altogether, our estimate of the probability that a randomly chosen problem resides in the lack of adaptivity which makes it insensitive
fraudulent transaction has a lower estimated probability of belong- to any change of distribution in the upcoming chunks.
ing to class 0 than a randomly chosen genuine transaction is On the basis of the state-of-the-work described in Section 3.3, it
is possible to conceive two alternative strategies to address both
^ ¼ R0 n0 ðn0 þ 1Þ=2
A the incremental and the unbalanced nature of the fraud detection
n0 n1 problem.
^ gives an estimation of AUC that avoids errors introduced by
A The first approach, denoted as the updating approach and illus-
smoothing procedures in the ROC curve and that is threshold-free trated in Fig. 2, is inspired to Wang, Fan, Yu, and Han (2003). It uses
(Hand & Till, 2001). a set of M models and a number K of chunks to train each model.
Note that for M > 1 and K > 1 the training sets of the M models
5.2. Detection are overlapping. This approach adapts to changing environment
by forgetting chunks at a constant rate. The last M models are
The performance of a detection task (like fraud detection) is not stored and used in a weighted ensemble of models EM . Let
necessarily well described in terms of classification (Fan & Zhu, PrecisionRankm denote the predictive accuracy measured in terms
2011). In a detection problem what matters most is whether the of PrecisionRank on the last (testing) chunk of the mth model.
algorithm can rank the few useful items (e.g. frauds) ahead of The ensemble EM is defined as the linear combination of all the
the rest. In a scenario with limited resources, fraud investigators M models hm :
cannot revise all transactions marked as fraudulent from a classifi-
cation algorithm. They have to put their effort into investigating X
M

the transactions with the highest risk of fraud, which means that EM ¼ wm hm
m¼1
the detection system is asked to return the transactions ranked
by their posteriori fraud probability. The goal then is not only to
where
predict accurately each class, but to return a correct rank of the
minority classes.
PrecisionRankm PrecisionRankmin
In this context a good detection algorithm should be able to give wm ¼
a high rank to relevant items (frauds) and low score to non-rele-
PrecisionRankmax PrecisionRankmin
vant. Fan and Zhu (2011) consider the average precision (AP) as PrecisionRankmin ¼ minðPrecisionRankm Þ
m2M
the correct measure for a detection task. Let p be the number of po- PrecisionRankmax ¼ maxðPrecisionRankm Þ
sitive (fraud) case in the original dataset. Out of the t% top-ranked m2M

candidates, suppose hðtÞ are truly positive (hðtÞ <¼ t). We can then The second approach denoted as the forgetting genuine approach
deﬁne recall as RðtÞ ¼ hðtÞ=p and precision as PðtÞ ¼ hðtÞ=t. Then and illustrated in Fig. 3 is inspired to Gao et al.’s work. In order to
Pðtr Þ and Rðtr Þ is the precision and recall of the rth ranked observa- mitigate the unbalanced effects, each time a new chunk is avail-
tion. The formula for calculating the average precision is: able, a model is learned on the genuine transactions of the previous
X
N K gen chunks and all past fraudulent transactions. Since this ap-
AP ¼ Pðt r ÞDRðt r Þ proach leads to training sets which grow in size over the time, a
r¼1 maximum training size is set to avoid overloading. Once this size
A. Dal Pozzolo et al. / Expert Systems with Applications 41 (2014) 4915–4928 4919

Fig. 1. Static approach: a model is trained on K ¼ 3 chunks and used to predict future chunks.

7.1. Dataset

The credit card fraud dataset was provided by a payment

service provider in Belgium. It contains the logs of a subset of
transactions from the ﬁrst of February 2012 to the twentieth of
May 2013 (details in Table 3). The dataset was divided in daily
chunks and contained e-commerce fraudulent transactions.
The original variables included the transaction amount, point of
sale, currency, country of the transaction, merchant type and many
others. However the original variables do not explain card holder
behaviour. Aggregated variables are added to the original ones
(see Section 4) in order to proﬁle the user behaviour. For example
the transaction amount and the card ID is used to compute the
average expenditure per week and per month of one card, the dif-
ference between the current and previous transaction and many
others. For each transaction and card we took 3 months
(H = 90 days) of previous transactions to compute the aggregated
variable. Therefore the weekly average expenditure for one card
is the weekly average of the last 3 months.
This dataset is strongly unbalanced (the percentage of fraudu-
lent transactions is lower than 0.4%) and contains both categorical
Fig. 2. Updating approach for K ¼ 3 and M ¼ 4. For each new chunk a model is and continuous variables. In what follow we will consider that
trained on the K latest chunks. Single models are used to predict the following
chunks contain sets of daily transactions, where the average trans-
chunk or can be combined into an ensemble.
actions per chunk is 5218.

is reached older observations are removed in favor of the more re- 7.2. Learned lessons
cent ones. An ensemble of models is obtained by combining the
last M models as in the update approach. Our experimental analysis allows to provide some answers to
Note that in all these approaches (including the static one), a the most common questions of credit card fraud detection. The
balancing technique (Section 3.2) can be used to reduce the skew- questions and the answers based on our experimental ﬁndings
ness of the training set (Fig. 4). are detailed below.
In Table 2 we have summarised the strengths and weaknesses
of the incremental approaches presented. The Static strategy has
7.2.1. Which algorithm and which training size is recommended in
the advantage of being fast as the training of the model is done
case of a static approach?
only once, but this does not return a model that follows the
The static approach (described in Section 6) is one of the most
changes in the distribution of the data. The other two approaches
commonly used by practitioners because of its simplicity and
on the contrary can adapt to concept drift. They differ essentially
rapidity. However, open questions remain about which learning
in the way the minority class is accumulated in the training
algorithm should be used and the consequent sensitivity of the
chunks. The Forget strategy propagates instances between chunks
accuracy to the training size. We tested three different supervised
leading to bigger training sets and computational burden.
algorithms: Random Forests (RF), Neural Network (NNET) and
Support Vector Machine (SVM) provided by the R software
7. Experimental assessment (Development Core Team, 2011). We used R version 3.0.1 with
packages randomForest (Liaw & Wiener, 2002), e1071 (Meyer,
In this section we perform an extensive experimental assess- Dimitriadou, Hornik, Weingessel, & Leisch, 2012), unbalanced
ment on the basis of real data (Section 7.1) in order to address (Pozzolo, 2014) and MASS (Venables & Ripley, 2002).
common issues that the practitioner has to solve when facing large In order to assess the impact of the training set size (in terms of
credit card fraud datasets (Section 7.2). days/chunks) we carried out the predictions with different
4920 A. Dal Pozzolo et al. / Expert Systems with Applications 41 (2014) 4915–4928

Fig. 3. Forgetting genuine approach: for each new chunk a model is created by keeping all previous fraudulent transactions and a small set of genuine transactions from the
last 2 chunks (K gen ¼ 2). Single models are used to predict the following chunk or can be combined into an ensemble (M ¼ 4).

Table 3
Fraudulent dataset.

Ndays Nvar Ntrx Period

422 45 2202228 1Feb12–20May13

where K is the total number of chunks. The higher the sum, the high-
er is the number of times that one strategy is superior to the others.
The white bars denote models which are signiﬁcantly worse than the
best (paired t-test based on the ranks of each chunk).
The strategy names follow a structure built on the following
options:

Algorithm used (RF, SVM, NNET).

Sampling method (Under, SMOTE, EasyEnsemble).
Model update frequency (One, Daily, 15 days, Weekly).
Number of models in the ensemble (M).
Fig. 4. A balancing technique is used to reduce the skewness of the training set Incremental approach (Static, Update, Forget).
before learning a model. Incremental parameter (K; K gen ).

windows (K = 30, 60 and 90). All training sets were rebalanced Then the strategy options are concatenated using the dot as
using undersampling at first (50% fraudulent, 50% genuine). separation point (e.g. RF.Under.Daily.10M.Update.60K).
All experiments are replicated five times to reduce the variance In both datasets, Random Forests clearly outperforms its com-
caused by the sampling implicit in unbalanced techniques. Fig. 5 petitors and, as expected, accuracy is improved by increasing the
shows the sum of the ranks from the Friedman test (Friedman, training size (Fig. 5). Because of the significative superiority of Ran-
1937) for each strategy in terms of AP, AUC and PrecisionRank. dom Forests with respect to the other algorithms, in what follows
For each chunk, we rank the strategies from the least to the best we will limit to consider only this learning algorithm.
performing. Then we sum the ranks over all chunks. More formally,
let rs;k 2 f1; . . . ; Sg be the rank of strategy s on chunk k and S be the
number of strategies to compare. The strategy with the highest 7.2.2. Is there an advantage in updating models?
accuracy in k has r s;k ¼ S and the one with the lowest has rs;k ¼ 1. Here we assess the advantage of adopting the update approach
PK
Then the sum of ranks for the strategy s is defined as k¼1 r k;s ,
described in Section 6. Fig. 6 reports the results for different values
of K and M.

Table 2
Strengths and weaknesses of the incremental approaches.

Approach Strengths Weaknesses

Static – Speed – No adaptation to changing distributions
Update – No instances propagation – Need several chunks for the minority class
- Adapts to changing distribution
Forget – Accumulates minority instances faster – Instances propagation
- Adapts to changing distribution
A. Dal Pozzolo et al. / Expert Systems with Applications 41 (2014) 4915–4928 4921

Metric: AP

RF.Under.One.1M.Static.90K

RF.Under.One.1M.Static.60K

RF.Under.One.1M.Static.30K
Best significant
SVM.Under.One.1M.Static.90K FALSE
TRUE

NNET.Under.One.1M.Static.90K

SVM.Under.One.1M.Static.60K

NNET.Under.One.1M.Static.60K

0 500 1000 1500

Sum of the ranks

Metric: AUC

RF.Under.One.1M.Static.90K

RF.Under.One.1M.Static.60K

RF.Under.One.1M.Static.30K
Best significant
NNET.Under.One.1M.Static.90K FALSE
TRUE

NNET.Under.One.1M.Static.60K

SVM.Under.One.1M.Static.90K

SVM.Under.One.1M.Static.60K

0 500 1000 1500

Sum of the ranks

Metric: PrecisionRank

RF.Under.One.1M.Static.90K

RF.Under.One.1M.Static.60K

RF.Under.One.1M.Static.30K
Best significant
SVM.Under.One.1M.Static.90K FALSE
TRUE

NNET.Under.One.1M.Static.90K

SVM.Under.One.1M.Static.60K

NNET.Under.One.1M.Static.60K

0 500 1000 1500

Sum of the ranks

Fig. 5. Comparison of static strategies using sum of ranks in all chunk.

The strategies are called daily if a model is built every day, 90 days (K ¼ 90) for training and keeps the last 5 models created
weekly if once a week or 15days if every 15 days. We compare (M ¼ 5) for predictions. In the case of AP however this strategy is
ensemble strategies (M > 1) with models built on single chunks not statistically better than the ensemble approaches ranked as
(K ¼ 1) against single models strategies (M ¼ 1) using several second.
chunks in the training (K > 1). For all metrics, the strategies that use only the current chunk to
For all metrics, the best strategy is RF.Under.Daily.5M.Upda- build a model (K ¼ 1) are coherently the worst. This conﬁrms the
te.90K. It creates a new model at each chunk using previous result of previous analysis showing that a too short window of data
4922 A. Dal Pozzolo et al. / Expert Systems with Applications 41 (2014) 4915–4928

(and consequently a very small fraction of frauds) is insufficient to 7.2.3. Retaining old genuine transactions together with old frauds is
learn a reliable model. beneficial?
When comparing the update frequency of the models using the This section assesses the accuracy of the forgetting approach
same number of chunks for training (K ¼ 90), daily update is rank- described in Section 6 whose rationale is to avoid the discard of
ing always better than weekly and 15days. This confirms the intui- old fraudulent observations.
tion that fraud distribution is always evolving and therefore it is Accumulating old frauds leads to less unbalanced chunks. In
better to update the models as soon as possible. order to avoid having chunks where the accumulated frauds

Metric: AP

RF.Under.Daily.5M.Update.90K
RF.Under.Daily.15M.Update.90K
RF.Under.Daily.30M.Update.90K
RF.Under.Daily.1M.Update.90K
RF.Under.Daily.1M.Update.60K
RF.Under.Weekly.1M.Update.90K Best significant
FALSE
RF.Under.15days.1M.Update.90K TRUE
RF.Under.Daily.1M.Update.30K
RF.Under.Daily.30M.Update.1K
RF.Under.Daily.15M.Update.1K
RF.Under.Daily.60M.Update.1K
RF.Under.Daily.90M.Update.1K
0 1000 2000

Sum of the ranks

Metric: AUC

RF.Under.Daily.5M.Update.90K
RF.Under.Daily.15M.Update.90K
RF.Under.Daily.30M.Update.90K
RF.Under.Daily.1M.Update.90K
RF.Under.Weekly.1M.Update.90K
RF.Under.Daily.1M.Update.60K Best significant
FALSE
RF.Under.15days.1M.Update.90K TRUE
RF.Under.Daily.1M.Update.30K
RF.Under.Daily.30M.Update.1K
RF.Under.Daily.15M.Update.1K
RF.Under.Daily.60M.Update.1K
RF.Under.Daily.90M.Update.1K
0 1000 2000

Sum of the ranks

Metric: PrecisionRank

Sum of the ranks

Fig. 6. Comparison of update strategies using sum of ranks in all chunks.

A. Dal Pozzolo et al. / Expert Systems with Applications 41 (2014) 4915–4928 4923

outnumber the genuine transactions, two options are available: (i) Fig. 7 shows the sum of ranks for different strategies where the
forgetting some of the old frauds (ii) accumulating old genuine genuine transactions are taken from a different number of days
transactions as well. In the first case when the accumulated frauds (K gen ). The best strategy for AP and PrecisionRank uses an ensemble
represent 40% of the transaction, new frauds replace old frauds as of 5 models for each chunk (M ¼ 5) and 30 days for genuine
in Gao, Ding, Fan, Han, and Yu (2008). In the second case we accu- transactions (K gen ¼ 30). The same strategy ranks third in terms
mulate genuine transactions from previous K gen chunks, where K gen of AUC and is significantly worse than the best. To create ensem-
defines the number of chunks used (see Fig. 3). bles we use a time-based array of models of fixed size M, which

Metric: AP

RF.Under.Daily.5M.Forget.30Kgen

RF.Under.Daily.10M.Forget.30Kgen

RF.Under.Daily.15M.Forget.30Kgen

RF.Under.Daily.1M.Forget.15Kgen
Best significant
RF.Under.Daily.1M.Forget.30Kgen FALSE
TRUE
RF.Under.Daily.1M.Forget.60Kgen

RF.Under.Daily.1M.Forget.90Kgen

RF.Under.Daily.30M.Forget.30Kgen

RF.Under.Daily.1M.Forget.0Kgen

0 1000 2000 3000

Sum of the ranks

Metric: AUC

RF.Under.Daily.1M.Forget.90Kgen

RF.Under.Daily.1M.Forget.60Kgen

RF.Under.Daily.5M.Forget.30Kgen

RF.Under.Daily.10M.Forget.30Kgen
Best significant
RF.Under.Daily.15M.Forget.30Kgen FALSE
TRUE
RF.Under.Daily.1M.Forget.30Kgen

RF.Under.Daily.30M.Forget.30Kgen

RF.Under.Daily.1M.Forget.15Kgen

RF.Under.Daily.1M.Forget.0Kgen

0 1000 2000 3000

Sum of the ranks

Metric: PrecisionRank

RF.Under.Daily.5M.Forget.30Kgen

RF.Under.Daily.1M.Forget.60Kgen

RF.Under.Daily.1M.Forget.90Kgen

RF.Under.Daily.10M.Forget.30Kgen
Best significant
RF.Under.Daily.1M.Forget.30Kgen FALSE
TRUE
RF.Under.Daily.15M.Forget.30Kgen

RF.Under.Daily.1M.Forget.15Kgen

RF.Under.Daily.30M.Forget.30Kgen

RF.Under.Daily.1M.Forget.0Kgen

0 1000 2000

Sum of the ranks

Fig. 7. Comparison of forgetting strategies using sum of ranks in all chunks.

4924 A. Dal Pozzolo et al. / Expert Systems with Applications 41 (2014) 4915–4928

means that when the number of models available is greater than 7.2.4. Do balancing techniques have an impact on accuracy?
M, the most recent in time model replaces the Mth model in the ar- So far we considered exclusively undersampling as balancing
ray removing the oldest model in the ensemble. technique in our experiments. In this section we assess the impact
In general we see better performances when K gen increase from of using alternative methods like SMOTE and EasyEnsemble.
0 to 30 and only in few cases K gen > 30 leads to signiﬁcantly better Experimental results (Fig. 8) show that they both over-perform
accuracy. Note that in all our strategies after selecting the observa- undersampling.
tions to include in the training sets we use undersampling to make In our datasets, the number of frauds is on average 0.4% of all
sure we have the two classes equally represented. transactions in the chunk. Undersampling randomly selects a

Metric: AP

RF.SMOTE.Daily.1M.Update.90K

RF.EasyEnsemble.Daily.1M.Forget.90Kgen

RF.SMOTE.Daily.1M.Forget.90Kgen

RF.EasyEnsemble.Daily.1M.Update.90K
Best significant
RF.Under.Daily.1M.Forget.90Kgen FALSE
TRUE
RF.Under.Daily.1M.Update.90K

RF.EasyEnsemble.One.1M.Static.90K

RF.SMOTE.One.1M.Static.90K

RF.Under.One.1M.Static.90K

0 500 1000 1500

Sum of the ranks

Metric: AUC

RF.EasyEnsemble.Daily.1M.Forget.90Kgen

RF.EasyEnsemble.Daily.1M.Update.90K

RF.SMOTE.Daily.1M.Update.90K

RF.Under.Daily.1M.Forget.90Kgen
Best significant
RF.SMOTE.Daily.1M.Forget.90Kgen FALSE
TRUE
RF.Under.Daily.1M.Update.90K

RF.EasyEnsemble.One.1M.Static.90K

RF.Under.One.1M.Static.90K

RF.SMOTE.One.1M.Static.90K

0 500 1000 1500

Sum of the ranks

Metric: PrecisionRank

RF.SMOTE.Daily.1M.Update.90K

RF.EasyEnsemble.Daily.1M.Forget.90Kgen

RF.EasyEnsemble.Daily.1M.Update.90K

RF.SMOTE.Daily.1M.Forget.90Kgen
Best significant
RF.Under.Daily.1M.Forget.90Kgen FALSE
TRUE
RF.Under.Daily.1M.Update.90K

RF.EasyEnsemble.One.1M.Static.90K

RF.SMOTE.One.1M.Static.90K

RF.Under.One.1M.Static.90K

0 500 1000 1500

Sum of the ranks

Fig. 8. Comparison of different balancing techniques and strategies using sum of ranks in all chunks.
A. Dal Pozzolo et al. / Expert Systems with Applications 41 (2014) 4915–4928 4925

number of genuine transactions equal to the number of frauds, the static approach ranks low in Figs. 10 as it is not able to adapt
which means removing about 99.6% of the genuine transactions to the changing distribution. The forgetting approach is signifi-
in the chunk. EasyEnsemble is able to reduce the variance of under- cantly better than update for EasyEnsemble, while SMOTE gives
sampling by using several sub-models for each chunk, while better ranking with update.
SMOTE creates new artificial fraudulent transactions. In our exper- It is worth notice that strategies which combines more than one
iments we used 5 sub-models in EasyEnsemble. For all balancing model (M > 1) together with undersampling are not superior to
techniques, between the three approaches presented in Section 6, the predictions with a single model and EasyEnsemble. EasyEn-
the static approach is consistently the worse. semble learns from different samples of the majority class, which
In Fig. 9 we compare the previous strategies in terms of average means that for each chunk different concepts of the majority class
prediction time over all chunks. SMOTE is computationally heavy are learnt.
since it consists in oversampling, leading to bigger chunk sizes.
EasyEnsemble replicates undersampling and learns from several
sub-chunks. This gives higher computational time than undersam-
8. Future work
pling. Between the different incremental approaches, static has the
lowest time as the model is learnt once and no retrained. Forget
Future work will focus on the automatic selection of the best
strategy has the highest prediction time over all balancing meth-
unbalanced technique in the case of online learning. Dal Pozzolo,
ods. This is expected since it retains old transactions to deal with
Caelen, Waterschoot, and Bontempi (2013) recently proposed to
unbalanced chunks.
use a F-race (Birattari, Stützle, Paquete, & Varrentrapp, 2002) algo-
rithm to automatically select the correct unbalanced strategy for a
7.2.5. Overall, which is the best strategy? given dataset. In their work a cross validation is used to feed the
The large number of possible alternatives (in terms of learning data into the race. A natural extension of this work could be the
classifier, balancing technique and incremental learning strategy) use of racing in incremental data where the data fed into the race
require a joint assessment of several combination in order to come comes from new chunks in the stream.
up with a recommended approach. Fig. 10 summaries the best Throughout our paper we used only data driven techniques to
strategies in terms of different metrics. The combinations of Easy- deal with the unbalanced problem. HDDT (Cieslak & Chawla,
Ensemble with forgetting emerge as best for all metrics. SMOTE 2008) is a decision tree that uses Hellinger distance (Hellinger,
with update is not significantly worse of the best for AP and Preci- 1909) as splitting criteria that is able to deal with skewed distribu-
sionRank, but it is not ranking well in terms of AUC. The fact that tion. With HDDT, balancing method are not longer needed before
within the best strategies we see different balancing techniques training. The use of such algorithm could remove the need of posi-
confirms that in online learning when the data is unbalanced, the tive instances propagation between chunks to fight the unbalanced
adopted balancing strategy may play a major role. As expected problem.

EasyEnsemble SMOTE Under

400
Computational time

200

0
n

0K
ge

ge
.90

.90

.90
c.9

c.9

c.9
0K

0K
ate

ate

ate
ati

ati

ati
t.9

t.9

t.9
pd

pd
St

.St

.St
rge

rge

rge
.U

.U
M.

M
.Fo

.Fo

.Fo
1M

1M
e.1

e.1

e.1
1M

1M
ily.

ily.

ily.
On

On
Da

Da
ily.

ily.

ily.
Da

Strategy

Fig. 9. Comparison of different balancing techniques and strategies in terms of average prediction time (in seconds) over all chunks. Experiments run on a HP ProLiant BL465c
G5 blades with 2x AMD Opteron 2.4 GHz, 4 cores each and 32 GB DDR3 RAM.
4926 A. Dal Pozzolo et al. / Expert Systems with Applications 41 (2014) 4915–4928

Metric: AP

RF.EasyEnsemble.Daily.1M.Forget.90Kgen
RF.SMOTE.Daily.1M.Update.90K
RF.Under.Daily.5M.Forget.30Kgen
RF.SMOTE.Daily.1M.Forget.90Kgen
RF.EasyEnsemble.Daily.1M.Update.90K
RF.Under.Daily.10M.Forget.30Kgen
RF.Under.Daily.15M.Forget.30Kgen
RF.Under.Daily.5M.Update.90K
RF.Under.Daily.15M.Update.90K
RF.Under.Daily.30M.Update.90K
RF.Under.Daily.30M.Forget.30Kgen
RF.Under.Daily.1M.Forget.90Kgen
RF.Under.Daily.1M.Update.90K Best significant
RF.Under.Weekly.1M.Update.90K FALSE
RF.Under.15days.1M.Update.90K TRUE
RF.Under.One.1M.Static.120K
RF.EasyEnsemble.One.1M.Static.90K
RF.Under.One.1M.Static.90K
RF.SMOTE.One.1M.Static.90K
RF.Under.One.1M.Static.60K
RF.Under.One.1M.Static.30K
SVM.Under.One.1M.Static.120K
SVM.Under.One.1M.Static.90K
NNET.Under.One.1M.Static.120K
NNET.Under.One.1M.Static.90K
SVM.Under.One.1M.Static.60K
NNET.Under.One.1M.Static.60K
0 1000 2000 3000 4000

Sum of the ranks

Metric: AUC

RF.EasyEnsemble.Daily.1M.Forget.90Kgen
RF.EasyEnsemble.Daily.1M.Update.90K
RF.Under.Daily.5M.Update.90K
RF.Under.Daily.15M.Update.90K
RF.SMOTE.Daily.1M.Update.90K
RF.Under.Daily.1M.Forget.90Kgen
RF.Under.Daily.30M.Update.90K
RF.Under.Daily.1M.Update.90K
RF.SMOTE.Daily.1M.Forget.90Kgen
RF.Under.Weekly.1M.Update.90K
RF.Under.Daily.5M.Forget.30Kgen
RF.Under.Daily.10M.Forget.30Kgen
RF.Under.15days.1M.Update.90K Best significant
RF.Under.Daily.15M.Forget.30Kgen FALSE
RF.Under.One.1M.Static.120K TRUE
RF.Under.Daily.30M.Forget.30Kgen
RF.EasyEnsemble.One.1M.Static.90K
RF.Under.One.1M.Static.90K
RF.SMOTE.One.1M.Static.90K
RF.Under.One.1M.Static.60K
RF.Under.One.1M.Static.30K
NNET.Under.One.1M.Static.120K
NNET.Under.One.1M.Static.90K
NNET.Under.One.1M.Static.60K
SVM.Under.One.1M.Static.90K
SVM.Under.One.1M.Static.120K
SVM.Under.One.1M.Static.60K
0 1000 2000 3000 4000

Sum of the ranks

Metric: PrecisionRank

RF.EasyEnsemble.Daily.1M.Forget.90Kgen
RF.SMOTE.Daily.1M.Update.90K
RF.EasyEnsemble.Daily.1M.Update.90K
RF.Under.Daily.5M.Update.90K
RF.Under.Daily.15M.Update.90K
RF.Under.Daily.1M.Forget.90Kgen
RF.Under.Daily.5M.Forget.30Kgen
RF.SMOTE.Daily.1M.Forget.90Kgen
RF.Under.Daily.30M.Update.90K
RF.Under.Daily.1M.Update.90K
RF.Under.Daily.10M.Forget.30Kgen
RF.Under.Daily.15M.Forget.30Kgen
RF.Under.Weekly.1M.Update.90K Best significant
RF.Under.Daily.30M.Forget.30Kgen FALSE
RF.Under.15days.1M.Update.90K TRUE
RF.Under.One.1M.Static.120K
RF.EasyEnsemble.One.1M.Static.90K
RF.Under.One.1M.Static.90K
RF.SMOTE.One.1M.Static.90K
RF.Under.One.1M.Static.60K
RF.Under.One.1M.Static.30K
SVM.Under.One.1M.Static.120K
NNET.Under.One.1M.Static.120K
SVM.Under.One.1M.Static.90K
NNET.Under.One.1M.Static.90K
SVM.Under.One.1M.Static.60K
NNET.Under.One.1M.Static.60K
0 1000 2000 3000 4000

Sum of the ranks

Fig. 10. Comparison of all strategies using sum of ranks in all chunks.
A. Dal Pozzolo et al. / Expert Systems with Applications 41 (2014) 4915–4928 4927

In our work the combination of models in an ensemble is based performance measure adopted, etc. (Dal Pozzolo et al., 2013). In
on the performance of each model in the testing chunk. Several our work we adopted two alternatives to undersampling: SMOTE
other methods (Kolter & Maloof, 2003; Hoens, Chawla, & Polikar, and EasyEnsemble. In particular we show that they are both able
2011; Lichtenwalter & Chawla, 2010) have been proposed to com- to return higher accuracies. Our framework can be easily extended
bine models in presence of concept drift. In future work it would be to include other data-level balancing techniques.
interesting to test some of these methods and compare it to our The experimental part has shown that in online learning, when
framework. the data is skewed towards one class it is important maintaining
In this manuscript we assumed that there is a single concept to previous minority examples in order to learn a better separation
learn for the minority class. However, as frauds are different from of two classes. Instance propagation from previous chunks has
each other we could distinguish several sub-concept within the the effect of increasing the minority class in the current chunk,
positive class. Hoens et al. (2011) suggest to use Naive Bayes to but it is of limited impact given the small number of frauds. The
retain old positive instances that come from the same sub-concept. update and forgetting approaches presented in Section 6 differ
REA (Chen & He, 2011) and SERA (Chen et al., 2009) proposed by essentially in the way the minority class is oversampled in the cur-
Chen and He propagate to the last chunk only minority class that rent chunk. We tested several ensemble and single models strate-
belong to the same concept using Mahalanobis distance and a gies using different number of chunks for training. In general we
k-nearest neighbors algorithm. Future work should take into con- see that models trained on several chunks have better accuracy
sideration the possibility of having several minority concepts. than single chunk models. Multi-chunks models learn on overlap-
ping training sets, when this happens single models strategies can
beat ensembles.
9. Conclusion Our framework addresses the problem of non-stationary in data
streams by creating a new model every time a new chunk is avail-
The need to detect fraudulent patterns in huge amount of data able. This approach has showed better results than updating the
demands the adoption of automatic methods. The scarcity of public models at a lower frequency (weekly or every 15days). Updating
available dataset in credit card transactions gives little chance to the models is crucial in a non-stationary environments, this intui-
the community to test and asses the impact of existing techniques tion is confirmed by the bad results of the static approach. In our
on real data. The goal of our work it to give some guidelines to dataset, overall we saw Random Forest beating Neural Network
practitioners on how to tackle the detection problem. and Support Vector Machine. The final best strategy implemented
The paper presents the fraud detection problem and proposes the forgetting approach together with EasyEnsemble and daily
AP, AUC and PrecisonRank as correct performance measures for a update.
fraud detection task. Frauds occur rarely compared to the total
amount of transactions. As explained in Section 5.1,, standard clas-
sification metrics such as Accuracy are not suitable for unbalanced References
problems. Moreover, the goal of detection is giving the investiga-
tors the transactions with the highest fraud risk. For this reason Bamber, D. (1975). The area above the ordinal dominance graph and the area below
we argue that having a good ranking of the transactions by their the receiver operating characteristic graph. Journal of Mathematical Psychology,
12, 387–415.
fraud probability is more important than having transactions Batista, G., Carvalho, A., & Monard, M. (2000). Applying one-sided selection to
correctly classified. unbalanced datasets. MICAI 2000: Advances in Artificial Intelligence, 315–325.
Credit card fraud detection relies on the analysis of recorded Birattari, M., Stützle, T., Paquete, L., & Varrentrapp, K. (2002). A racing algorithm for
configuring metaheuristics. In Proceedings of the genetic and evolutionary
transactions. However a single transaction information is not con- computation conference (pp. 11–18).
sidered sufficient to detect a fraud occurrence (Bolton & Hand, Bolton, R. J., & Hand, D. J. (2001). Unsupervised profiling methods for fraud
2001) and the analysis has to take into consideration the card- detection. Credit Scoring and Credit Control, VII, 235–255.
Bolton, R., & Hand, D. (2002). Statistical fraud detection: A review. Statistical Science,
holder behaviour. In this paper we have proposed a way to include 235–249.
cardholder information into the transaction by computing aggre- Brause, R., Langsdorf, T., & Hepp, M. (1999). Neural data mining for credit card fraud
gate variables on historical transaction of the same card. detection. In Tools with artificial intelligence, 1999. Proceedings (pp. 103–106).
IEEE.
As new credit-card transactions keep arriving, the detection
Chan, P., Fan, W., Prodromidis, A., & Stolfo, S. (1999). Distributed data mining in
system has to process them as soon as they arrive incrementally credit card fraud detection. Intelligent Systems and their Applications, 14, 67–74.
and avoid retaining in memory too many old transactions. Fraud Chawla, N. V. (2005). Data mining for imbalanced datasets: An overview. In Data
types are in continuous evolution and detection has to adapt to mining and knowledge discovery handbook (pp. 853–867). Springer.
Chawla, N., Bowyer, K., Hall, L., & Kegelmeyer, W. (2011). Smote: synthetic minority
fraudsters. Once a fraud is well detected, the fraudster could over-sampling technique. Arxiv preprint arXiv:1106.1813.
change his habits and find another way to fraud. Adaptive schemes Chawla, N. V., Japkowicz, N., & Kotcz, A. (2004). Editorial: Special issue on learning
are therefore required to deal with non-stationary fraud dynamics from imbalanced data sets. ACM SIGKDD Explorations Newsletter, 6, 1–6.
Chen, S., & He, H. (2009). Sera: Selectively recursive approach towards
and discover potentially new fraud mechanisms by itself. We com- nonstationary imbalanced stream data mining. In International joint conference
pare three alternative approaches (static, update and forgetting) to on neural networks, 2009. IJCNN 2009 (pp. 522–529). IEEE.
learn from unbalanced and non-stationary credit card data Chen, S., & He, H. (2011). Towards incremental learning of nonstationary
imbalanced data stream: A multiple selectively recursive approach. Evolving
streams. Systems, 2, 35–50.
Fraud detection is a highly unbalanced problem where the Chen, S., He, H., Li, K., & Desai, S. (2010). Musera: Multiple selectively recursive
number of genuine transactions far outnumbers the fraudulent approach towards imbalanced stream data mining. In The 2010 international
joint conference on neural networks (IJCNN) (pp. 1–8). IEEE.
ones. In the static learning setting a wide range of techniques have Cieslak, D. A., & Chawla, N. V. (2008). Learning decision trees for unbalanced data. In
been proposed to deal with unbalanced dataset. In incremental Machine learning and knowledge discovery in databases (pp. 241–256). Springer.
learning however few attempts have tried to deal with unbalanced Clark, P., & Niblett, T. (1989). The cn2 induction algorithm. Machine Learning, 3,
261–283.
data streams (Gao et al., 2007; Hoens et al., 2012; Ditzler, Polikar, &
Cohen, W. W. (1995). Fast effective rule induction. In Machine learning-international
Chawla, 2010). In these works, the most common balancing workshop then conference (pp. 115–123). Morgan Kaufmann Publishers, INC..
technique consists in undersampling the majority class in order Dal Pozzolo, A., Caelen, O., Waterschoot, S., & Bontempi, G. (2013). Racing for
to reduce the skewness of the of the chunks. unbalanced methods selection. In Proceedings of the 14th international
conference on intelligent data engineering and automated learning, IDEAL.
The best technique for unbalanced data may depend on several Delamaire, L., Abdou, H., & Pointon, J. (2009). Credit card fraud and detection
factors such as (i) data distribution (ii) classifier used (iii) techniques: A review. Banks and Bank Systems, 4, 57–68.
4928 A. Dal Pozzolo et al. / Expert Systems with Applications 41 (2014) 4915–4928

R Development Core Team, R: A Language and Environment for Statistical Kuncheva, L. I. (2004). Classifier ensembles for changing environments. In Multiple
Computing, R Foundation for Statistical Computing, Vienna, Austria, 2011. classifier systems (pp. 1–15). Springer.
<http://www.R-project.org/>, ISBN 3-900051-07-0. Liaw, A., & Wiener, M. (2002). Classification and regression by randomforest. R
Ditzler, G., Polikar, R., & Chawla, N. (2010). An incremental learning algorithm for News, 2, 18–22.
non-stationary environments and class imbalance. In 2010 20th International Lichtenwalter, R. N., & Chawla, N. V. (2010). Adaptive methods for classification in
conference on pattern recognition (ICPR) (pp. 2997–3000). IEEE. arbitrarily imbalanced and drifting data streams. In New frontiers in applied data
Dorronsoro, J., Ginel, F., Sgnchez, C., & Cruz, C. (1997). Neural fraud detection in mining (pp. 53–75). Springer.
credit card operations. Neural Networks, 8, 827–834. Liu, X., Wu, J., & Zhou, Z. (2009). Exploratory undersampling for class-imbalance
Drummond, C., & Holte, R. (2003). C4. 5, class imbalance, and cost sensitivity: Why learning. Systems, Man, and Cybernetics, Part B: Cybernetics, 39, 539–550.
under-sampling beats over-sampling. In Workshop on learning from imbalanced Meyer, D., Dimitriadou, E., Hornik, K., Weingessel, A., & Leisch, F. (2012). e1071:
datasets II. Citeseer. Misc Functions of the Department of Statistics (e1071), TU Wien, 2012. <http://
Elkan, C. (2001). The foundations of cost-sensitive learning. International joint CRAN.R-project.org/package=e1071>, r package version 1.6-1.
conference on artificial intelligence (Vol. 17, pp. 973–978). Citeseer. Olshen, L., & Stone, C. (1986). Classification and regression trees. Wadsworth
Fan, G., & Zhu, M. (2011). Detection of rare items with target. Statistics and Its International Group.
Interface, 4, 11–17. Oza, N. C. (2005). Online bagging and boosting. Systems, man and cybernetics (Vol. 3,
Friedman, M. (1937). The use of ranks to avoid the assumption of normality implicit pp. 2340–2345). IEEE.
in the analysis of variance. Journal of the American Statistical Association, 32, Pavía, J. M., Veres-Ferrer, E. J., & Foix-Escura, G. (2012). Credit card incidents and
675–701. control systems. International Journal of Information Management, 32, 501–503.
Gao, J., Ding, B., Fan, W., Han, J., & Yu, P. S. (2008). Classifying data streams with Polikar, R., Upda, L., Upda, S., Honavar, V., et al. (2001). Learn++: An incremental
skewed class distributions and concept drifts. Internet Computing, 12, 37–49. learning algorithm for supervised neural networks. Systems, Man, and
Gao, J., Fan, W., Han, J., & Philip, S. Y. (2007). A general framework for mining Cybernetics, Part C: Applications and Reviews, 31, 497–508.
concept-drifting data streams with skewed distributions. In SDM. Pozzolo, A. D. (2014). unbalanced: The package implements different data-driven
Grossberg, S. (1988). Nonlinear neural networks: Principles, mechanisms, and method for unbalanced datasets. R package version 1.0.
architectures. Neural Networks, 1, 17–61. Provost, F. (2000). Machine learning from imbalanced data sets 101. In Proceedings
Hand, D. (2009). Measuring classifier performance: A coherent alternative to the of the AAAI’2000 workshop on imbalanced data sets.
area under the ROC curve. Machine Learning, 77, 103–123. Quah, J. T., & Sriganesh, M. (2008). Real-time credit card fraud detection using
Hand, D. J., & Till, R. J. (2001). A simple generalisation of the area under the roc curve computational intelligence. Expert Systems with Applications, 35, 1721–1732.
for multiple class classification problems. Machine Learning, 45, 171–186. Quinlan, J. (1993). C4. 5: Programs for machine learning (Vol. 1). Morgan Kaufmann.
Hellinger, E. (1909). Neue begründung der theorie quadratischer formen von Street, W. N., & Kim, Y. (2001). A streaming ensemble algorithm (sea) for large-scale
unendlichvielen veränderlichen. Journal für die reine und Angewandte classification. In Proceedings of the seventh ACM SIGKDD international conference
Mathematik, 136, 210–271. on knowledge discovery and data mining (pp. 377–382). ACM.
Hoens, T. R., Chawla, N. V., & Polikar, R. (2011). Heuristic updatable weighted Venables, W. N., & Ripley, B. D. (2002). Modern applied statistics with S (fourth ed.). 0-
random subspaces for non-stationary environments. In 2011 IEEE 11th 387-95457-0. New York: Springer<http://www.stats.ox.ac.uk/pub/MASS4> .
international conference on data mining (ICDM) (pp. 241–250). IEEE. Wang, H., Fan, W., Yu, P. S., & Han, J. (2003). Mining concept-drifting data streams
Hoens, T. R., Polikar, R., & Chawla, N. V. (2012). Learning from streaming data with using ensemble classifiers. In Proceedings of the ninth ACM SIGKDD international
concept drift and imbalance: An overview. Progress in Artificial Intelligence, 1, conference on Knowledge discovery and data mining (pp. 226–235). ACM.
89–101. Weston, D., Hand, D., Adams, N., Whitrow, C., & Juszczak, P. (2008). Plastic card
Holte, R. C., Acker, L. E., & Porter, B. W. (1989). Concept learning and the problem of fraud detection using peer group analysis. Advances in Data Analysis and
small disjuncts. Proceedings of the eleventh international joint conference on Classification, 2, 45–62.
artificial intelligence (Vol. 1). Citeseer. Whitrow, C., Hand, D. J., Juszczak, P., Weston, D., & Adams, N. M. (2009). Transaction
Japkowicz, N., & Stephen, S. (2002). The class imbalance problem: A systematic aggregation as a strategy for credit card fraud detection. Data Mining and
study. Intelligent Data Analysis, 6, 429–449. Knowledge Discovery, 18, 30–55.
Kolter, J. Z., & Maloof, M. A. (2003). Dynamic weighted majority: A new ensemble Widmer, G., & Kubat, M. (1996). Learning in the presence of concept drift and
method for tracking concept drift. In Third IEEE international conference on data hidden contexts. Machine Learning, 23, 69–101.
mining, 2003. ICDM 2003 (pp. 123–130). IEEE.

List of Safe Legit Sites To Get Tools-Logs-Cvvs From PDF Technology & Engineering
No ratings yet
List of Safe Legit Sites To Get Tools-Logs-Cvvs From PDF Technology & Engineering
1 page
Learn To Swipe
No ratings yet
Learn To Swipe
2 pages
Mobile Data For Carding
100% (3)
Mobile Data For Carding
11 pages
Credit Card Details & iPayFast Info
No ratings yet
Credit Card Details & iPayFast Info
2 pages
Enroll Transer Balance
No ratings yet
Enroll Transer Balance
4 pages
Carding Info
100% (1)
Carding Info
12 pages
Darkweb SIM Solutions Guide
100% (1)
Darkweb SIM Solutions Guide
5 pages
How To Buy Logs PDF
No ratings yet
How To Buy Logs PDF
1 page
Bank Logins and Bank Transfer 2024
83% (6)
Bank Logins and Bank Transfer 2024
3 pages
Residential Proxies 101
71% (7)
Residential Proxies 101
6 pages
T-Bag CC - Best Cvv2, Dumps+Pin, Ebt+Pin Shop - The Best Products All in One Store - First Hand and High Quality
75% (4)
T-Bag CC - Best Cvv2, Dumps+Pin, Ebt+Pin Shop - The Best Products All in One Store - First Hand and High Quality
23 pages
Boa Bank Open-U
67% (3)
Boa Bank Open-U
21 pages
Screenshot 2024-11-14 at 8.52.39 PM
No ratings yet
Screenshot 2024-11-14 at 8.52.39 PM
1 page
Affirm Method
100% (5)
Affirm Method
2 pages
CC Top Up
No ratings yet
CC Top Up
1 page
Ojhgf
No ratings yet
Ojhgf
10 pages
Advanced Method
100% (1)
Advanced Method
8 pages
Method CC Cashout
67% (3)
Method CC Cashout
2 pages
Amazon Code's Giftcard 52 PDF Cooking, Food & Wine
0% (1)
Amazon Code's Giftcard 52 PDF Cooking, Food & Wine
1 page
Cashout Methods 100% Working
100% (3)
Cashout Methods 100% Working
10 pages
2024 - 2025 Updated PayPal Cash Out Method
No ratings yet
2024 - 2025 Updated PayPal Cash Out Method
2 pages
Debit Bins
No ratings yet
Debit Bins
2 pages
Tax Refund 2023 2
No ratings yet
Tax Refund 2023 2
131 pages
How To Buy and Setup Socks 5
100% (1)
How To Buy and Setup Socks 5
36 pages
Screenshot 2024-11-14 at 8.53.02 PM
No ratings yet
Screenshot 2024-11-14 at 8.53.02 PM
1 page
How to Open a Bank of America Account
No ratings yet
How to Open a Bank of America Account
3 pages
Carding Videos 55 Methods Cash
80% (5)
Carding Videos 55 Methods Cash
3 pages
Best Faceless Proxy Alternatives 2024
No ratings yet
Best Faceless Proxy Alternatives 2024
6 pages
?????? ???? 2025
No ratings yet
?????? ???? 2025
3 pages
BIN Lists for Carding Enthusiasts
100% (2)
BIN Lists for Carding Enthusiasts
3 pages
Enroll Bins
100% (1)
Enroll Bins
17 pages
Carder Pro
No ratings yet
Carder Pro
33 pages
Illicit Bank Transfer Services
100% (2)
Illicit Bank Transfer Services
2 pages
How To Buy Good Live CC From Shop
No ratings yet
How To Buy Good Live CC From Shop
24 pages
PA UI Three in One Tut
No ratings yet
PA UI Three in One Tut
9 pages
Carding: Beyond the BIN Obsession
100% (3)
Carding: Beyond the BIN Obsession
10 pages
Lending Documents Deborah $5000 Loan Amount
No ratings yet
Lending Documents Deborah $5000 Loan Amount
5 pages
Guide to Using CCs & Debits Safely
100% (3)
Guide to Using CCs & Debits Safely
2 pages
Pubg CC
100% (1)
Pubg CC
8 pages
How To Collect Bank Logs and Credit Cards
No ratings yet
How To Collect Bank Logs and Credit Cards
1 page
How To Hit USA PP Logs
No ratings yet
How To Hit USA PP Logs
2 pages
Cardsauces
50% (4)
Cardsauces
4 pages
Cashapp Paypal Linkable Method 2025
100% (2)
Cashapp Paypal Linkable Method 2025
3 pages
CC To BTC 2024
No ratings yet
CC To BTC 2024
17 pages
Illegal Stripe Account Setup Guide
100% (7)
Illegal Stripe Account Setup Guide
16 pages
Buying CC Fullz A How To Guide
100% (1)
Buying CC Fullz A How To Guide
4 pages
Cashout Checks Banking
No ratings yet
Cashout Checks Banking
6 pages
CW Mini Bank Open Up Guide @CeeDubzz On Telegram
No ratings yet
CW Mini Bank Open Up Guide @CeeDubzz On Telegram
3 pages
How To Cashout Banklogs Through Paypal
100% (2)
How To Cashout Banklogs Through Paypal
10 pages
Become Pro Carder
100% (2)
Become Pro Carder
7 pages
Selling ATM Card Cashout ICQ 747095810 Dumps Pin Emv Chip ATM Skimmers Track2 Pin
No ratings yet
Selling ATM Card Cashout ICQ 747095810 Dumps Pin Emv Chip ATM Skimmers Track2 Pin
3 pages
Paybis Verification Guide
100% (1)
Paybis Verification Guide
21 pages
How To Add Linkables 2 - Updated - Professor
100% (1)
How To Add Linkables 2 - Updated - Professor
6 pages
Wells Fargo Cash Out Guide
No ratings yet
Wells Fargo Cash Out Guide
6 pages
How To Buy Credit Card
100% (1)
How To Buy Credit Card
7 pages
Montgomery Ward Tutorial 2023
No ratings yet
Montgomery Ward Tutorial 2023
13 pages
Hotel Method 1
100% (1)
Hotel Method 1
1 page
EDD Employer Info Guide
No ratings yet
EDD Employer Info Guide
19 pages
Approaches To Fraud Detection On
No ratings yet
Approaches To Fraud Detection On
10 pages
Group 23
No ratings yet
Group 23
11 pages
Discrete Random
No ratings yet
Discrete Random
57 pages
2.9 Distributed Deadlock Detection and Resolution
No ratings yet
2.9 Distributed Deadlock Detection and Resolution
31 pages
Data Structure Viva Questions
No ratings yet
Data Structure Viva Questions
9 pages
Data Structures & Algorithms Course
100% (1)
Data Structures & Algorithms Course
2 pages
cs302 Assignment 2 by Dilshad Ahmad
No ratings yet
cs302 Assignment 2 by Dilshad Ahmad
7 pages
Trip Generation 2
No ratings yet
Trip Generation 2
6 pages
203 Quiz1 Soln 2023
No ratings yet
203 Quiz1 Soln 2023
6 pages
Midsem Ee650 Fall2018-Solution IITK
No ratings yet
Midsem Ee650 Fall2018-Solution IITK
8 pages
Lucas Tree PDF
100% (1)
Lucas Tree PDF
11 pages
Midterm IE313 Sample
No ratings yet
Midterm IE313 Sample
2 pages
Detection, Classification, and Estimation of Fault Location On An Overhead Transmission Line Using S-Transform and Neural Network
No ratings yet
Detection, Classification, and Estimation of Fault Location On An Overhead Transmission Line Using S-Transform and Neural Network
13 pages
Ol Ict Past Paper 2020 English
No ratings yet
Ol Ict Past Paper 2020 English
16 pages
Fundamentals and Advances in Multiple-Hypothesis Tracking - NATO Report
No ratings yet
Fundamentals and Advances in Multiple-Hypothesis Tracking - NATO Report
18 pages
Frac Stepdown
No ratings yet
Frac Stepdown
3 pages
Home Assignment 1 v1
No ratings yet
Home Assignment 1 v1
2 pages
Solutions Manual P 00 Un Se
No ratings yet
Solutions Manual P 00 Un Se
422 pages
Paper 2
No ratings yet
Paper 2
19 pages
Numerical Differentiation Guide
No ratings yet
Numerical Differentiation Guide
43 pages
Advanced Process Dynamics
No ratings yet
Advanced Process Dynamics
24 pages
Manoranjan PP T
No ratings yet
Manoranjan PP T
22 pages
A2S2 - PEAG - en - Fisa Disciplinei - Programare Evolutiva Si Algoritmi Genetici 2016
No ratings yet
A2S2 - PEAG - en - Fisa Disciplinei - Programare Evolutiva Si Algoritmi Genetici 2016
5 pages
Chapter 2. Graph Theory and Concepts: Figure 2-1
No ratings yet
Chapter 2. Graph Theory and Concepts: Figure 2-1
18 pages
Signals and Systems: Answer
No ratings yet
Signals and Systems: Answer
5 pages
MT132 Tma 2
No ratings yet
MT132 Tma 2
6 pages
2nd Sem-1st Session-Week4-8-Discrete Structures IS
No ratings yet
2nd Sem-1st Session-Week4-8-Discrete Structures IS
12 pages
Nonlinear Control for PhD Students
No ratings yet
Nonlinear Control for PhD Students
47 pages
Deep Reinforcement Learning For Algorithmic Trading
No ratings yet
Deep Reinforcement Learning For Algorithmic Trading
9 pages
Restaurant Napkin Cost Optimization
No ratings yet
Restaurant Napkin Cost Optimization
10 pages
Math Stock Market Analysis Case Study
No ratings yet
Math Stock Market Analysis Case Study
9 pages
Credit Card Fraud Detection Using ML
No ratings yet
Credit Card Fraud Detection Using ML
82 pages

Credit Card Fraud Lessons

Uploaded by

Credit Card Fraud Lessons

Uploaded by

Expert Systems with Applications 41 (2014) 4915–4928

Contents lists available at ScienceDirect

Expert Systems with Applications

Learned lessons in credit card fraud detection from a practitioner

1. Introduction Detection problems are typically addressed in two different

legitimate. True Fraud (Y 1 ) True Genuine (Y 0 )

The credit card fraud dataset was provided by a payment

Ndays Nvar Ntrx Period

Algorithm used (RF, SVM, NNET).

Approach Strengths Weaknesses

0 500 1000 1500

Sum of the ranks

0 500 1000 1500

Sum of the ranks

0 500 1000 1500

Sum of the ranks

Fig. 5. Comparison of static strategies using sum of ranks in all chunk.

Sum of the ranks

Sum of the ranks

Sum of the ranks

Fig. 6. Comparison of update strategies using sum of ranks in all chunks.

0 1000 2000 3000

Sum of the ranks

0 1000 2000 3000

Sum of the ranks

Sum of the ranks

Fig. 7. Comparison of forgetting strategies using sum of ranks in all chunks.

0 500 1000 1500

Sum of the ranks

0 500 1000 1500

Sum of the ranks

0 500 1000 1500

Sum of the ranks

EasyEnsemble SMOTE Under

Sum of the ranks

Sum of the ranks

Sum of the ranks

You might also like