Credit Card Fraud Lessons
Credit Card Fraud Lessons
a r t i c l e i n f o a b s t r a c t
Keywords:                                                 Billions of dollars of loss are caused every year due to fraudulent credit card transactions. The design of
Incremental learning                                      efficient fraud detection algorithms is key for reducing these losses, and more algorithms rely on
Unbalanced data                                           advanced machine learning techniques to assist fraud investigators. The design of fraud detection algo-
Fraud detection                                           rithms is however particularly challenging due to non-stationary distribution of the data, highly imbal-
                                                          anced classes distributions and continuous streams of transactions.
                                                             At the same time public data are scarcely available for confidentiality issues, leaving unanswered many
                                                          questions about which is the best strategy to deal with them.
                                                             In this paper we provide some answers from the practitioner’s perspective by focusing on three crucial
                                                          issues: unbalancedness, non-stationarity and assessment. The analysis is made possible by a real credit
                                                          card dataset provided by our industrial partner.
                                                                                                                            Ó 2014 Elsevier Ltd. All rights reserved.
http://dx.doi.org/10.1016/j.eswa.2014.02.026
0957-4174/Ó 2014 Elsevier Ltd. All rights reserved.
4916                                     A. Dal Pozzolo et al. / Expert Systems with Applications 41 (2014) 4915–4928
perspective. These are just some of potential questions that could                3.2. Unbalanced problem
raise during the design of a detection system. We do not claim to
be able to give a definite answer to the problem, but we hope to                      Learning from unbalanced datasets is a difficult task since most
that our work serves as guideline for other people in the field.                   learning systems are not designed to cope with a large difference
Our goal is to show what worked and what did not in a real case                   between the number of cases belonging to each class (Batista,
study. In this paper we give a formalisation of the learning problem              Carvalho, & Monard, 2000). In the literature, traditional methods
in the context of credit card fraud detection. We present a way to                for classification with unbalanced datasets rely on sampling
create new features in the datasets that can trace the card holder                techniques to balance the dataset (Japkowicz & Stephen, 2002).
spending habits. By doing this it is possible to present the transac-                In particular we can distinguish between methods that operates
tions to the learning algorithm without providing the card holder                 at the data and algorithmic levels (Chawla, Japkowicz, & Kotcz,
identifier. We then argue that traditional classification metrics                   2004). At the data level, balancing techniques are used as a
are not suited for a detection task and present existing alternative              pre-processing step to rebalance the dataset or to remove the noise
measures.                                                                         between the two classes, before any algorithm is applied. At the
    We propose and compare three approaches for online learning                   algorithmic level, the classification algorithms themselves are
in order to identify what is important to retain or to forget in a                adapted to deal with the minority class detection. In this article
changing and non-stationary environment. We show the impact                       we focus on data level techniques as they have the advantage of
of the rebalancing technique on the final performance when the                     leaving the algorithms unchanged.
class distribution is skewed. In doing this we merge techniques                      Sampling techniques do not take into consideration any specific
developed for unbalanced static datasets with online learning                     information in removing or adding observations from one class, yet
strategies. The resulting frameworks are able to deal with unbal-                 they are easy to implement and to understand. Undersampling
anced and evolving data streams. All the results are obtained by                  (Drummond & Holte, 2003) consists in down-sizing the majority
experimentation on a dataset of real credit card transactions pro-                class by removing observations at random until the dataset is
vided by our industrial partner.                                                  balanced.
                                                                                     SMOTE (Chawla, Bowyer, Hall, & Kegelmeyer, 2011) over-
                                                                                  samples the minority class by generating synthetic minority
                                                                                  examples in the neighborhood of observed ones. The idea is to form
3. State of the art in credit card fraud detection                                new minority examples by interpolating between examples of the
                                                                                  same class. This has the effect of creating clusters around each
    Credit card fraud detection is one of the most explored domains               minority observation.
of fraud detection (Chan, Fan, Prodromidis, & Stolfo, 1999; Bolton &                 Ensemble methods combine balancing techniques with a
Hand, 2001; Brause, Langsdorf, & Hepp, 1999) and relies on the                    classifier to explore the majority and minority class distribution.
automatic analysis of recorded transactions to detect fraudulent                  EasyEnsemble is claimed in Liu, Wu, and Zhou (2009) to be better
behavior. Every time a credit card is used, transaction data, com-                alternative to undersampling. This method learns different aspects
posed of a number of attributes (e.g. credit card identifier, transac-             of the original majority class in an unsupervised manner. This is
tion date, recipient, amount of the transaction), are stored in the               done by creating different balanced training sets by Undersampling,
databases of the service provider.                                                learning a model for each dataset and then combining all
    However a single transaction information is typically not suffi-               predictions.
cient to detect a fraud occurrence (Bolton & Hand, 2001) and the
analysis has to consider aggregate measures like total spent per                  3.3. Incremental learning
day, transaction number per week or average amount of a transac-
tion (Whitrow, Hand, Juszczak, Weston, & Adams, 2009).                                Static learning is the classical learning setting where the data are
                                                                                  processed all at once in a single learning batch. Incremental learning
                                                                                  instead interprets data as a continuous stream and processes each
                                                                                  new instance ‘‘on arrival’’ (Oza, 2005). In this context it is impor-
3.1. Supervised versus unsupervised detection                                     tant to preserve the previously acquired knowledge as well as to
                                                                                  update it properly in front of new observations. In incremental
    In the fraud detection literature we encounter both supervised                learning data arrives in chunks where the underlying data genera-
techniques that make use of the class of the transaction (e.g. gen-               tion function may change, while static learning deals with a single
uine or fraudulent) and unsupervised techniques. Supervised                       dataset. The problem of learning in the case of unbalanced data has
methods assume that labels of past transactions are available and                 been widely explored in the static learning setting (Chawla et al.,
reliable but are often limited to recognize fraud patterns that have              2011; Drummond & Holte, 2003; Japkowicz & Stephen, 2002; Liu
already occurred (Bolton & Hand, 2002). On the other hand, unsu-                  et al., 2009). Learning from non-stationary data stream with
pervised methods do not use the class of transactions and are capa-               skewed class distribution is however a relatively recent domain.
ble of detecting new fraudulent behaviours (Bolton & Hand, 2001).                     In the incremental setting, when the data distribution changes,
Clustering based methods (Quah & Sriganesh, 2008; Weston, Hand,                   it is important to learn from new observations while retaining
Adams, Whitrow, & Juszczak, 2008) form customer profiles to                        existing knowledge form past observations. Concepts learnt in
identify new hidden fraud patterns.                                               the past may re-occur in the future as new concepts may appear
    The focus of this paper will be on supervised methods. In the lit-            in the data stream. This is known as the stability-plasticity dilem-
erature several supervised methods have been applied to fraud                     ma (Grossberg, 1988). A classifier is required to be able to respond
detection such as Neural networks (Dorronsoro, Ginel, Sgnchez, &                  to changes in the data distribution, while ensuring that it still re-
Cruz, 1997), Rule-based methods (BAYES Clark & Niblett, 1989,                     tains relevant past knowledge. Many of the techniques proposed
RIPPER Cohen, 1995) and tree-based algorithms (C4.5 Quinlan,                      (Chen, He, Li, & Desai, 2010; Polikar, Upda, Upda, & Honavar,
1993 and CART Olshen & Stone, 1984). It is well known however                     2001; Street & Kim, 2001) use ensemble classifiers in order to com-
that an open issue is how to manage unbalanced class sizes since                  bine what is learnt from new observations and the knowledge ac-
the legitimate transactions generally far outnumber the fraudulent                quired before. As fraud evolves over time, the learning framework
ones.                                                                             has to adapt to the new distribution. The classifier should be able
                                                A. Dal Pozzolo et al. / Expert Systems with Applications 41 (2014) 4915–4928                                       4917
to learn from a new fraud distributions and ‘‘forget’’ outdated                          on the Internet) changes the behaviour of the cardholders and con-
knowledge. It becomes critical then to set the rate of forgetting                        sequently the distributions PðX Þ. At the same time the evolution of
in order to match the rate of change in the distribution (Kuncheva,                      the types of frauds affects the class conditional probability distri-
2004). The simplest strategy uses a constant forgetting rate, which                      bution PðYjX Þ. As a result the joint distribution PðX ; YÞ is not
boils down to consider a fix window of recent observations to                             stationary: this is is known as concept-drift (Hoens, Polikar, &
retrain the model. FLORA approach (Widmer & Kubat, 1996) uses                            Chawla, 2012). Note that Gao et al. (2007) suggests that even when
a variable forgetting rate where the window is shrunk if a change                        the concept drift is not detected, there is still a benefit in updating
is detected and expanded otherwise. The evolution of a class                             the models.
concept is called in literature concept drift.
    Gao, Fan, Han, and Philip (2007) proposes to store all previous
                                                                                         5. Performance measure
minority class examples into the current training data set to make
it less unbalanced and then to combine the models into an ensem-
                                                                                             Fraud detection must deal with the following challenges: (i)
ble of classifiers. SERA (Chen et al., 2009) and REA (Chen & He,
                                                                                         timeliness of decision (a card should be blocked as soon as it is
2011) selectively accumulate old minority class observations to
                                                                                         found victim of fraud, quick reaction to the appearance of the first
rebalance the training chunk. They propose two different methods
                                                                                         can prevent other frauds), (ii) unbalanced class sizes (the number
(Mahalanobis distance and K nearest neighbours) in order to select
                                                                                         of frauds are relatively small compare to genuine transactions)
the most relevant minority instances to include in the current
                                                                                         and (iii) cost structure of the problem (the cost of a fraud is not
chunk from the set of old minority instances.
                                                                                         easy to define). The cost of a fraud is often assumed to be equal
    These methods consist in oversampling the minority class of the
                                                                                         to the transaction amount (Elkan, 2001). However, frauds of small
current chunk by retaining old positive observations. Accumula-
                                                                                         and big amounts must be treated with equal importance. A fraud-
tion of previous minority class examples is of limited volume due
                                                                                         ulent activity is usually tested with a small amount and then, if
to skewed class distribution, therefore oversampling does not
                                                                                         successful, replicated with bigger amount. The cost should also in-
increase a lot the chunk size.
                                                                                         clude the time taken by the detection system to react. The shorter
                                                                                         is the reaction time, the larger is the number of frauds that it is
4. Formalization of the learning problem                                                 possible to prevent. Depending on the fraud risk assigned by the
                                                                                         detection system to the transaction, the following can happens:
    In this section, we formalize the credit card fraud detection task                   (i) transaction accepted, (ii) transaction refused (iii) card blocked.
as a statistical learning problem. Let X ij be the transaction number j                  Usually the card is blocked only in few cases where there is a high
                                                                                         risk of fraud (well known fraudulent patterns with high accuracy,
of a card number i. We assume that the transactions are ordered in
                                                                                         e.g. 99% correct). When a transaction is refused, the investigators
time such that if X iv occurs before X iw then v < w. For each trans-
                                                                                         make a phone call to the card holder to verify if it is the case of a
action some basic information is available such as amount of the
                                                                                         false alert or a real fraud. The cost of a false alert can then be con-
expenditure, the shop where it was performed, the currency, etc.
                                                                                         sidered equivalent to the cost of the phone call, which is negligible
However these variables do not provide any information about
                                                                                         compared to the loss that occurs in case of a fraud. However, when
the normal card usage. The normal behaviour of a card can be mea-
                                                                                         the number of false alerts is too big or the card is blocked by error,
sured by using a set of historical transactions from the same card.
                                                                                         the impossibility to make transactions can translate into big losses
For example, we can get an idea of the card holder spending habits
                                                                                         for the customer. For all these reasons, defining a cost measure is a
by looking at the average amount spent in different merchant cat-
                                                                                         challenging problem in credit card detection.
egories (e.g. restaurant, online shopping, gas station, etc.) in the
                                                                                             The fraud problem can be seen as a binary classification and
last 3 months preceding the transaction. Let X ik be a new transac-
                                                                                         detection problem.
tion and let dt ðX ik Þ be the corresponding transaction date-time in
the dataset. Let T denote the time-frame of a set of historical
transactions for the same card. X Hik is then the set of the historical                  5.1. Classification
transactions occurring in the time-frame T before X ik such that
                                                                                       In a classification problem an algorithm is assessed on its accu-
X Hik ¼ X ij , where dt X ij – dtðX ik Þ and dtðX ik Þ > dt X ij P dt ðX ik Þ  T.
                                                                                         racy to predict the correct classes of new unseen observations. Let
For instance, with T ¼ 90 days,         X Hik
                                  is the set of transactions for
                                                                                         fY 0 g be the set of genuine transactions, fY 1 g the set of fraudulent
the same card occurring in the 3 months preceding dtðX ik Þ. The                                          b 0 g the set of transactions predicted as genuine and
                                                                                         transactions, f Y
card behaviour can be summarised using classical aggregation                               b 1 g the set of transactions predicted as fraudulent. For a binary
                                                                                         fY
methods (e.g. mean, max, min or count) on the set X Hik . This means                     classification problem it is conventional to define a confusion
that it is possible to create new aggregated variables that can be                       matrix (Table 1).
added to the original transaction variables to include information                           In an unbalanced class problem, it is well-known that quantities
of the card. In this way we have included information about the                                       TP
                                                                                         like TPR (TPþFN             TN
                                                                                                         ), TNR (FPþTN                      TPþTN
                                                                                                                         ) and Accuracy (TPþFNþFPþTN ) are mislead-
user behaviour at the transaction level and we can now no longer
                                                                                         ing assessment measures (Provost, 2000). Balanced Error Rate
consider the card ID. Transactions from card holders with similar                                  FP             FN
                                                                                         (0:5  TNþFP  þ 0:5  FNþTP  ) may be inappropriate too because of dif-
spending habits will share similar aggregate variables. Let fX g be
                                                                                         ferent costs of misclassification false negatives and false positives.
the new set of transactions with aggregated variables. Each trans-
                                                                                             A well accepted measure for unbalanced dataset is AUC (area
action X j is assigned a binary status Y j where Y j ¼ 1 when the
                                                                                         under the ROC curve) (Chawla, 2005). This metric gives an measure
transaction j is fraudulent and Y j ¼ 0 otherwise. The goal of a
detection system is to learn P ðYjX Þ and predict the class of a
new transaction Y  b N 2 ð0;1Þ. Note that here the focus is not on                       Table 1
classifying the card holder, but the transaction as fraudulent or                        Confusion Matrix.
of how much the ROC curve is close to the point of perfect classi-                          where DRðt r Þ ¼ Rðt r Þ  Rðt r1 Þ and N is the total number of observa-
fication. Hand (2009) considers calculation of the area under del                            tion in the dataset. From the definition of Rðt r Þ we have:
ROC curve as inappropriate, since this translate into making an                                                                   (
average of the misclassification cost of the two classes. An alterna-                                                                  1
                                                                                                         hðtr Þ  hðt r1 Þ           p   if the the rth is fraudulent
tive way of estimating AUC is based on the use of the MannWhit-                             DRðt r Þ ¼                        ¼
                                                                                                                 p                    0 if the the rth is genuine
ney statistic and consists in ranking the observations by the fraud
probability and measuring the probability that a random minority                                An algorithm ‘‘A’’ is superior to an algorithm ‘‘B’’ only if it de-
class example ranks higher than a random majority class example                             tects the frauds before algorithm ‘‘B’’. The better the rank, the
(Bamber, 1975). By using the rank-based formulation of AUC we                               greater the AP. The optimal algorithm that ranks all the frauds
can avoid setting different probability thresholds to generate the                          ahead of the legitimates has average precision of 1.
ROC curve and avoid the problem raised by Hand. Let n0 ¼ jY 0 j                                 In detection teams like the one of our industrial partner, each
be the number of genuine transactions, and n1 ¼ jY 1 j be the num-                          time a fraud alert is generated by the detection system, it has to
ber of fraudulent transactions. Let g i ¼ p^0 ðxi0 Þ be the estimated                       be checked by investigators before proceeding with actions (e.g.
probability of belonging to the genuine class for the i transaction
                                                                     th                     customer contact or card stop). Given the limited number of inves-
in fY 0 g, for i ¼ 1; . . . ; n0 . Define fi ¼ p          ^0 ðxi1 Þ similarly for the n1     tigators it is possible to verify only a limited number of alerts.
fraudulent transactions. Then fg 1 ; . . . ; g n0 g and ff1 ; . . . ; fn1 g are sam-        Therefore it is crucial to have the best ranking within the maxi-
                                                                                            mum number a of alerts that they can investigate. In this setting
ples from the g and f distributions. Rank the combined set of values
                                                                                            it is important to have the highest Precision within the first a
g 1 ; . . . ; g n0 ; f1 ; . . . ; fn1 in increasing order and let qi be the rank of the
                                                                                            alerts.
ith genuine transaction. There are ðqi  iÞ fraudulent transactions
                                                                                                In the following we will denote as PrecisionRank the Precision
with estimated probabilities of belonging to class 0 which are
                                                                                            within the a observations with the highest rank.
smaller than that of the ith genuine transaction (Hand & Till,
2001). Summing over the 0 class, we see that the total number of
pairs of transactions, one from class 0 and one from class 1, in                            6. Strategies for incremental learning with unbalanced fraud
which the fraudulent transaction has smaller estimated probability                          data
of belonging to class 0 than does the fraudulent transaction, is
X
n0            X
              n0     X
                     n0                                                                         The most conventional way to deal with sequential fraud data is
   ðqi  iÞ ¼    qi  i ¼ R0  n0 ðn0 þ 1Þ=2                                                to adopt a static approach (Fig. 1) which creates once in a while a
i¼0              i¼0       i¼0                                                              classification model and uses it as a predictor during a long hori-
                P n0
where R0 ¼ i¼0 qi . Since there are n0 n1 such pairs of transactions                        zon. Though this approach reduces the learning effort, its main
altogether, our estimate of the probability that a randomly chosen                          problem resides in the lack of adaptivity which makes it insensitive
fraudulent transaction has a lower estimated probability of belong-                         to any change of distribution in the upcoming chunks.
ing to class 0 than a randomly chosen genuine transaction is                                    On the basis of the state-of-the-work described in Section 3.3, it
                                                                                            is possible to conceive two alternative strategies to address both
^ ¼ R0  n0 ðn0 þ 1Þ=2
A                                                                                           the incremental and the unbalanced nature of the fraud detection
          n0 n1                                                                             problem.
^ gives an estimation of AUC that avoids errors introduced by
A                                                                                               The first approach, denoted as the updating approach and illus-
smoothing procedures in the ROC curve and that is threshold-free                            trated in Fig. 2, is inspired to Wang, Fan, Yu, and Han (2003). It uses
(Hand & Till, 2001).                                                                        a set of M models and a number K of chunks to train each model.
                                                                                            Note that for M > 1 and K > 1 the training sets of the M models
5.2. Detection                                                                              are overlapping. This approach adapts to changing environment
                                                                                            by forgetting chunks at a constant rate. The last M models are
    The performance of a detection task (like fraud detection) is not                       stored and used in a weighted ensemble of models EM . Let
necessarily well described in terms of classification (Fan & Zhu,                            PrecisionRankm denote the predictive accuracy measured in terms
2011). In a detection problem what matters most is whether the                              of PrecisionRank on the last (testing) chunk of the mth model.
algorithm can rank the few useful items (e.g. frauds) ahead of                              The ensemble EM is defined as the linear combination of all the
the rest. In a scenario with limited resources, fraud investigators                         M models hm :
cannot revise all transactions marked as fraudulent from a classifi-
cation algorithm. They have to put their effort into investigating                                 X
                                                                                                   M
the transactions with the highest risk of fraud, which means that                           EM ¼         wm hm
                                                                                                   m¼1
the detection system is asked to return the transactions ranked
by their posteriori fraud probability. The goal then is not only to
                                                                                            where
predict accurately each class, but to return a correct rank of the
minority classes.
                                                                                                    PrecisionRankm  PrecisionRankmin
    In this context a good detection algorithm should be able to give                       wm ¼
a high rank to relevant items (frauds) and low score to non-rele-
                                                                                                   PrecisionRankmax  PrecisionRankmin
vant. Fan and Zhu (2011) consider the average precision (AP) as                             PrecisionRankmin ¼ minðPrecisionRankm Þ
                                                                                                                      m2M
the correct measure for a detection task. Let p be the number of po-                        PrecisionRankmax ¼ maxðPrecisionRankm Þ
sitive (fraud) case in the original dataset. Out of the t% top-ranked                                                  m2M
candidates, suppose hðtÞ are truly positive (hðtÞ <¼ t). We can then                            The second approach denoted as the forgetting genuine approach
define recall as RðtÞ ¼ hðtÞ=p and precision as PðtÞ ¼ hðtÞ=t. Then                          and illustrated in Fig. 3 is inspired to Gao et al.’s work. In order to
Pðtr Þ and Rðtr Þ is the precision and recall of the rth ranked observa-                    mitigate the unbalanced effects, each time a new chunk is avail-
tion. The formula for calculating the average precision is:                                 able, a model is learned on the genuine transactions of the previous
       X
       N                                                                                    K gen chunks and all past fraudulent transactions. Since this ap-
AP ¼     Pðt r ÞDRðt r Þ                                                                    proach leads to training sets which grow in size over the time, a
       r¼1                                                                                  maximum training size is set to avoid overloading. Once this size
                                              A. Dal Pozzolo et al. / Expert Systems with Applications 41 (2014) 4915–4928                              4919
Fig. 1. Static approach: a model is trained on K ¼ 3 chunks and used to predict future chunks.
7.1. Dataset
is reached older observations are removed in favor of the more re-                     7.2. Learned lessons
cent ones. An ensemble of models is obtained by combining the
last M models as in the update approach.                                                  Our experimental analysis allows to provide some answers to
    Note that in all these approaches (including the static one), a                    the most common questions of credit card fraud detection. The
balancing technique (Section 3.2) can be used to reduce the skew-                      questions and the answers based on our experimental findings
ness of the training set (Fig. 4).                                                     are detailed below.
    In Table 2 we have summarised the strengths and weaknesses
of the incremental approaches presented. The Static strategy has
                                                                                       7.2.1. Which algorithm and which training size is recommended in
the advantage of being fast as the training of the model is done
                                                                                       case of a static approach?
only once, but this does not return a model that follows the
                                                                                          The static approach (described in Section 6) is one of the most
changes in the distribution of the data. The other two approaches
                                                                                       commonly used by practitioners because of its simplicity and
on the contrary can adapt to concept drift. They differ essentially
                                                                                       rapidity. However, open questions remain about which learning
in the way the minority class is accumulated in the training
                                                                                       algorithm should be used and the consequent sensitivity of the
chunks. The Forget strategy propagates instances between chunks
                                                                                       accuracy to the training size. We tested three different supervised
leading to bigger training sets and computational burden.
                                                                                       algorithms: Random Forests (RF), Neural Network (NNET) and
                                                                                       Support Vector Machine (SVM) provided by the R software
7. Experimental assessment                                                             (Development Core Team, 2011). We used R version 3.0.1 with
                                                                                       packages randomForest (Liaw & Wiener, 2002), e1071 (Meyer,
   In this section we perform an extensive experimental assess-                        Dimitriadou, Hornik, Weingessel, & Leisch, 2012), unbalanced
ment on the basis of real data (Section 7.1) in order to address                       (Pozzolo, 2014) and MASS (Venables & Ripley, 2002).
common issues that the practitioner has to solve when facing large                        In order to assess the impact of the training set size (in terms of
credit card fraud datasets (Section 7.2).                                              days/chunks) we carried out the predictions with different
4920                                            A. Dal Pozzolo et al. / Expert Systems with Applications 41 (2014) 4915–4928
Fig. 3. Forgetting genuine approach: for each new chunk a model is created by keeping all previous fraudulent transactions and a small set of genuine transactions from the
last 2 chunks (K gen ¼ 2). Single models are used to predict the following chunk or can be combined into an ensemble (M ¼ 4).
                                                                                         Table 3
                                                                                         Fraudulent dataset.
                                                                                         where K is the total number of chunks. The higher the sum, the high-
                                                                                         er is the number of times that one strategy is superior to the others.
                                                                                         The white bars denote models which are significantly worse than the
                                                                                         best (paired t-test based on the ranks of each chunk).
                                                                                             The strategy names follow a structure built on the following
                                                                                         options:
windows (K = 30, 60 and 90). All training sets were rebalanced                              Then the strategy options are concatenated using the dot as
using undersampling at first (50% fraudulent, 50% genuine).                               separation point (e.g. RF.Under.Daily.10M.Update.60K).
    All experiments are replicated five times to reduce the variance                         In both datasets, Random Forests clearly outperforms its com-
caused by the sampling implicit in unbalanced techniques. Fig. 5                         petitors and, as expected, accuracy is improved by increasing the
shows the sum of the ranks from the Friedman test (Friedman,                             training size (Fig. 5). Because of the significative superiority of Ran-
1937) for each strategy in terms of AP, AUC and PrecisionRank.                           dom Forests with respect to the other algorithms, in what follows
For each chunk, we rank the strategies from the least to the best                        we will limit to consider only this learning algorithm.
performing. Then we sum the ranks over all chunks. More formally,
let rs;k 2 f1; . . . ; Sg be the rank of strategy s on chunk k and S be the
number of strategies to compare. The strategy with the highest                           7.2.2. Is there an advantage in updating models?
accuracy in k has r s;k ¼ S and the one with the lowest has rs;k ¼ 1.                       Here we assess the advantage of adopting the update approach
                                                                  PK
Then the sum of ranks for the strategy s is defined as                k¼1 r k;s ,
                                                                                         described in Section 6. Fig. 6 reports the results for different values
                                                                                         of K and M.
                Table 2
                Strengths and weaknesses of the incremental approaches.
Metric: AP
RF.Under.One.1M.Static.90K
RF.Under.One.1M.Static.60K
                               RF.Under.One.1M.Static.30K
                                                                                                                Best significant
                             SVM.Under.One.1M.Static.90K                                                          FALSE
                                                                                                                  TRUE
NNET.Under.One.1M.Static.90K
SVM.Under.One.1M.Static.60K
NNET.Under.One.1M.Static.60K
Metric: AUC
RF.Under.One.1M.Static.90K
RF.Under.One.1M.Static.60K
                               RF.Under.One.1M.Static.30K
                                                                                                                Best significant
                             NNET.Under.One.1M.Static.90K                                                         FALSE
                                                                                                                  TRUE
NNET.Under.One.1M.Static.60K
SVM.Under.One.1M.Static.90K
SVM.Under.One.1M.Static.60K
Metric: PrecisionRank
RF.Under.One.1M.Static.90K
RF.Under.One.1M.Static.60K
                               RF.Under.One.1M.Static.30K
                                                                                                                Best significant
                             SVM.Under.One.1M.Static.90K                                                          FALSE
                                                                                                                  TRUE
NNET.Under.One.1M.Static.90K
SVM.Under.One.1M.Static.60K
NNET.Under.One.1M.Static.60K
    The strategies are called daily if a model is built every day,              90 days (K ¼ 90) for training and keeps the last 5 models created
weekly if once a week or 15days if every 15 days. We compare                    (M ¼ 5) for predictions. In the case of AP however this strategy is
ensemble strategies (M > 1) with models built on single chunks                  not statistically better than the ensemble approaches ranked as
(K ¼ 1) against single models strategies (M ¼ 1) using several                  second.
chunks in the training (K > 1).                                                    For all metrics, the strategies that use only the current chunk to
    For all metrics, the best strategy is RF.Under.Daily.5M.Upda-               build a model (K ¼ 1) are coherently the worst. This confirms the
te.90K. It creates a new model at each chunk using previous                     result of previous analysis showing that a too short window of data
4922                                    A. Dal Pozzolo et al. / Expert Systems with Applications 41 (2014) 4915–4928
(and consequently a very small fraction of frauds) is insufficient to             7.2.3. Retaining old genuine transactions together with old frauds is
learn a reliable model.                                                          beneficial?
   When comparing the update frequency of the models using the                      This section assesses the accuracy of the forgetting approach
same number of chunks for training (K ¼ 90), daily update is rank-               described in Section 6 whose rationale is to avoid the discard of
ing always better than weekly and 15days. This confirms the intui-                old fraudulent observations.
tion that fraud distribution is always evolving and therefore it is                 Accumulating old frauds leads to less unbalanced chunks. In
better to update the models as soon as possible.                                 order to avoid having chunks where the accumulated frauds
Metric: AP
                               RF.Under.Daily.5M.Update.90K
                              RF.Under.Daily.15M.Update.90K
                              RF.Under.Daily.30M.Update.90K
                               RF.Under.Daily.1M.Update.90K
                               RF.Under.Daily.1M.Update.60K
                             RF.Under.Weekly.1M.Update.90K                                                          Best significant
                                                                                                                      FALSE
                              RF.Under.15days.1M.Update.90K                                                           TRUE
                               RF.Under.Daily.1M.Update.30K
                               RF.Under.Daily.30M.Update.1K
                               RF.Under.Daily.15M.Update.1K
                               RF.Under.Daily.60M.Update.1K
                               RF.Under.Daily.90M.Update.1K
                                                                 0              1000              2000
Metric: AUC
                               RF.Under.Daily.5M.Update.90K
                              RF.Under.Daily.15M.Update.90K
                              RF.Under.Daily.30M.Update.90K
                               RF.Under.Daily.1M.Update.90K
                             RF.Under.Weekly.1M.Update.90K
                               RF.Under.Daily.1M.Update.60K                                                         Best significant
                                                                                                                      FALSE
                              RF.Under.15days.1M.Update.90K                                                           TRUE
                               RF.Under.Daily.1M.Update.30K
                               RF.Under.Daily.30M.Update.1K
                               RF.Under.Daily.15M.Update.1K
                               RF.Under.Daily.60M.Update.1K
                               RF.Under.Daily.90M.Update.1K
                                                                 0              1000              2000
Metric: PrecisionRank
                               RF.Under.Daily.5M.Update.90K
                              RF.Under.Daily.15M.Update.90K
                              RF.Under.Daily.30M.Update.90K
                               RF.Under.Daily.1M.Update.90K
                             RF.Under.Weekly.1M.Update.90K
                               RF.Under.Daily.1M.Update.60K                                                         Best significant
                                                                                                                      FALSE
                              RF.Under.15days.1M.Update.90K                                                           TRUE
                               RF.Under.Daily.1M.Update.30K
                               RF.Under.Daily.30M.Update.1K
                               RF.Under.Daily.15M.Update.1K
                               RF.Under.Daily.60M.Update.1K
                               RF.Under.Daily.90M.Update.1K
                                                                 0      500      1000      1500       2000   2500
outnumber the genuine transactions, two options are available: (i)                  Fig. 7 shows the sum of ranks for different strategies where the
forgetting some of the old frauds (ii) accumulating old genuine                 genuine transactions are taken from a different number of days
transactions as well. In the first case when the accumulated frauds              (K gen ). The best strategy for AP and PrecisionRank uses an ensemble
represent 40% of the transaction, new frauds replace old frauds as              of 5 models for each chunk (M ¼ 5) and 30 days for genuine
in Gao, Ding, Fan, Han, and Yu (2008). In the second case we accu-              transactions (K gen ¼ 30). The same strategy ranks third in terms
mulate genuine transactions from previous K gen chunks, where K gen             of AUC and is significantly worse than the best. To create ensem-
defines the number of chunks used (see Fig. 3).                                  bles we use a time-based array of models of fixed size M, which
Metric: AP
RF.Under.Daily.5M.Forget.30Kgen
RF.Under.Daily.10M.Forget.30Kgen
RF.Under.Daily.15M.Forget.30Kgen
                              RF.Under.Daily.1M.Forget.15Kgen
                                                                                                                 Best significant
                              RF.Under.Daily.1M.Forget.30Kgen                                                      FALSE
                                                                                                                   TRUE
                              RF.Under.Daily.1M.Forget.60Kgen
RF.Under.Daily.1M.Forget.90Kgen
RF.Under.Daily.30M.Forget.30Kgen
RF.Under.Daily.1M.Forget.0Kgen
Metric: AUC
RF.Under.Daily.1M.Forget.90Kgen
RF.Under.Daily.1M.Forget.60Kgen
RF.Under.Daily.5M.Forget.30Kgen
                             RF.Under.Daily.10M.Forget.30Kgen
                                                                                                                 Best significant
                             RF.Under.Daily.15M.Forget.30Kgen                                                      FALSE
                                                                                                                   TRUE
                              RF.Under.Daily.1M.Forget.30Kgen
RF.Under.Daily.30M.Forget.30Kgen
RF.Under.Daily.1M.Forget.15Kgen
RF.Under.Daily.1M.Forget.0Kgen
Metric: PrecisionRank
RF.Under.Daily.5M.Forget.30Kgen
RF.Under.Daily.1M.Forget.60Kgen
RF.Under.Daily.1M.Forget.90Kgen
                             RF.Under.Daily.10M.Forget.30Kgen
                                                                                                                 Best significant
                              RF.Under.Daily.1M.Forget.30Kgen                                                      FALSE
                                                                                                                   TRUE
                             RF.Under.Daily.15M.Forget.30Kgen
RF.Under.Daily.1M.Forget.15Kgen
RF.Under.Daily.30M.Forget.30Kgen
RF.Under.Daily.1M.Forget.0Kgen
0 1000 2000
means that when the number of models available is greater than                     7.2.4. Do balancing techniques have an impact on accuracy?
M, the most recent in time model replaces the Mth model in the ar-                    So far we considered exclusively undersampling as balancing
ray removing the oldest model in the ensemble.                                     technique in our experiments. In this section we assess the impact
   In general we see better performances when K gen increase from                  of using alternative methods like SMOTE and EasyEnsemble.
0 to 30 and only in few cases K gen > 30 leads to significantly better              Experimental results (Fig. 8) show that they both over-perform
accuracy. Note that in all our strategies after selecting the observa-             undersampling.
tions to include in the training sets we use undersampling to make                    In our datasets, the number of frauds is on average 0.4% of all
sure we have the two classes equally represented.                                  transactions in the chunk. Undersampling randomly selects a
Metric: AP
RF.SMOTE.Daily.1M.Update.90K
RF.EasyEnsemble.Daily.1M.Forget.90Kgen
RF.SMOTE.Daily.1M.Forget.90Kgen
                                 RF.EasyEnsemble.Daily.1M.Update.90K
                                                                                                                   Best significant
                                       RF.Under.Daily.1M.Forget.90Kgen                                               FALSE
                                                                                                                     TRUE
                                          RF.Under.Daily.1M.Update.90K
RF.EasyEnsemble.One.1M.Static.90K
RF.SMOTE.One.1M.Static.90K
RF.Under.One.1M.Static.90K
Metric: AUC
RF.EasyEnsemble.Daily.1M.Forget.90Kgen
RF.EasyEnsemble.Daily.1M.Update.90K
RF.SMOTE.Daily.1M.Update.90K
                                       RF.Under.Daily.1M.Forget.90Kgen
                                                                                                                   Best significant
                                     RF.SMOTE.Daily.1M.Forget.90Kgen                                                 FALSE
                                                                                                                     TRUE
                                          RF.Under.Daily.1M.Update.90K
RF.EasyEnsemble.One.1M.Static.90K
RF.Under.One.1M.Static.90K
RF.SMOTE.One.1M.Static.90K
Metric: PrecisionRank
RF.SMOTE.Daily.1M.Update.90K
RF.EasyEnsemble.Daily.1M.Forget.90Kgen
RF.EasyEnsemble.Daily.1M.Update.90K
                                     RF.SMOTE.Daily.1M.Forget.90Kgen
                                                                                                                   Best significant
                                       RF.Under.Daily.1M.Forget.90Kgen                                               FALSE
                                                                                                                     TRUE
                                          RF.Under.Daily.1M.Update.90K
RF.EasyEnsemble.One.1M.Static.90K
RF.SMOTE.One.1M.Static.90K
RF.Under.One.1M.Static.90K
                             Fig. 8. Comparison of different balancing techniques and strategies using sum of ranks in all chunks.
                                                                       A. Dal Pozzolo et al. / Expert Systems with Applications 41 (2014) 4915–4928                                      4925
number of genuine transactions equal to the number of frauds,                                                         the static approach ranks low in Figs. 10 as it is not able to adapt
which means removing about 99.6% of the genuine transactions                                                          to the changing distribution. The forgetting approach is signifi-
in the chunk. EasyEnsemble is able to reduce the variance of under-                                                   cantly better than update for EasyEnsemble, while SMOTE gives
sampling by using several sub-models for each chunk, while                                                            better ranking with update.
SMOTE creates new artificial fraudulent transactions. In our exper-                                                       It is worth notice that strategies which combines more than one
iments we used 5 sub-models in EasyEnsemble. For all balancing                                                        model (M > 1) together with undersampling are not superior to
techniques, between the three approaches presented in Section 6,                                                      the predictions with a single model and EasyEnsemble. EasyEn-
the static approach is consistently the worse.                                                                        semble learns from different samples of the majority class, which
    In Fig. 9 we compare the previous strategies in terms of average                                                  means that for each chunk different concepts of the majority class
prediction time over all chunks. SMOTE is computationally heavy                                                       are learnt.
since it consists in oversampling, leading to bigger chunk sizes.
EasyEnsemble replicates undersampling and learns from several
sub-chunks. This gives higher computational time than undersam-
                                                                                                                      8. Future work
pling. Between the different incremental approaches, static has the
lowest time as the model is learnt once and no retrained. Forget
                                                                                                                          Future work will focus on the automatic selection of the best
strategy has the highest prediction time over all balancing meth-
                                                                                                                      unbalanced technique in the case of online learning. Dal Pozzolo,
ods. This is expected since it retains old transactions to deal with
                                                                                                                      Caelen, Waterschoot, and Bontempi (2013) recently proposed to
unbalanced chunks.
                                                                                                                      use a F-race (Birattari, Stützle, Paquete, & Varrentrapp, 2002) algo-
                                                                                                                      rithm to automatically select the correct unbalanced strategy for a
7.2.5. Overall, which is the best strategy?                                                                           given dataset. In their work a cross validation is used to feed the
   The large number of possible alternatives (in terms of learning                                                    data into the race. A natural extension of this work could be the
classifier, balancing technique and incremental learning strategy)                                                     use of racing in incremental data where the data fed into the race
require a joint assessment of several combination in order to come                                                    comes from new chunks in the stream.
up with a recommended approach. Fig. 10 summaries the best                                                                Throughout our paper we used only data driven techniques to
strategies in terms of different metrics. The combinations of Easy-                                                   deal with the unbalanced problem. HDDT (Cieslak & Chawla,
Ensemble with forgetting emerge as best for all metrics. SMOTE                                                        2008) is a decision tree that uses Hellinger distance (Hellinger,
with update is not significantly worse of the best for AP and Preci-                                                   1909) as splitting criteria that is able to deal with skewed distribu-
sionRank, but it is not ranking well in terms of AUC. The fact that                                                   tion. With HDDT, balancing method are not longer needed before
within the best strategies we see different balancing techniques                                                      training. The use of such algorithm could remove the need of posi-
confirms that in online learning when the data is unbalanced, the                                                      tive instances propagation between chunks to fight the unbalanced
adopted balancing strategy may play a major role. As expected                                                         problem.
                                          400
                   Computational time
200
                                            0
                                                        n
0K
                                                                                                                                                                                    0K
                                                       ge
ge
                                                                                                                                                      ge
                                                                     .90
.90
                                                                                                                                                                      .90
                                                                                  c.9
c.9
                                                                                                                                                                                   c.9
                                                   0K
0K
                                                                                                                                                     0K
                                                                  ate
ate
                                                                                                                                                                   ate
                                                                                ati
ati
                                                                                                                                                                                 ati
                                                  t.9
t.9
                                                                                                                                                    t.9
                                                             pd
pd
                                                                                                                                                               pd
                                                                                 St
.St
                                                                                                                                                                              .St
                                                 rge
rge
                                                                                                                                                rge
                                                             .U
.U
                                                                                                                                                               .U
                                                                              M.
                                                                                                                                                                               M
                                            .Fo
.Fo
                                                                                                                                              .Fo
                                                            1M
1M
                                                                                                                                                              1M
                                                                          e.1
e.1
                                                                                                                                                                           e.1
                                            1M
1M
                                                                                                                                          1M
                                                        ily.
ily.
                                                                                                                                                          ily.
                                                                       On
On
                                                                                                                                                                        On
                                                        Da
Da
                                                                                                                                                          Da
                                        ily.
ily.
                                                                                                                                         ily.
                             Da
Da
Da
Strategy
Fig. 9. Comparison of different balancing techniques and strategies in terms of average prediction time (in seconds) over all chunks. Experiments run on a HP ProLiant BL465c
G5 blades with 2x AMD Opteron 2.4 GHz, 4 cores each and 32 GB DDR3 RAM.
4926                      A. Dal Pozzolo et al. / Expert Systems with Applications 41 (2014) 4915–4928
Metric: AP
       RF.EasyEnsemble.Daily.1M.Forget.90Kgen
               RF.SMOTE.Daily.1M.Update.90K
              RF.Under.Daily.5M.Forget.30Kgen
             RF.SMOTE.Daily.1M.Forget.90Kgen
         RF.EasyEnsemble.Daily.1M.Update.90K
             RF.Under.Daily.10M.Forget.30Kgen
             RF.Under.Daily.15M.Forget.30Kgen
                 RF.Under.Daily.5M.Update.90K
                RF.Under.Daily.15M.Update.90K
                RF.Under.Daily.30M.Update.90K
             RF.Under.Daily.30M.Forget.30Kgen
              RF.Under.Daily.1M.Forget.90Kgen
                 RF.Under.Daily.1M.Update.90K                                                            Best significant
               RF.Under.Weekly.1M.Update.90K                                                               FALSE
               RF.Under.15days.1M.Update.90K                                                               TRUE
                  RF.Under.One.1M.Static.120K
           RF.EasyEnsemble.One.1M.Static.90K
                   RF.Under.One.1M.Static.90K
                 RF.SMOTE.One.1M.Static.90K
                   RF.Under.One.1M.Static.60K
                   RF.Under.One.1M.Static.30K
                SVM.Under.One.1M.Static.120K
                 SVM.Under.One.1M.Static.90K
               NNET.Under.One.1M.Static.120K
                NNET.Under.One.1M.Static.90K
                 SVM.Under.One.1M.Static.60K
                NNET.Under.One.1M.Static.60K
                                                0       1000         2000            3000     4000
Metric: AUC
       RF.EasyEnsemble.Daily.1M.Forget.90Kgen
         RF.EasyEnsemble.Daily.1M.Update.90K
                 RF.Under.Daily.5M.Update.90K
                RF.Under.Daily.15M.Update.90K
               RF.SMOTE.Daily.1M.Update.90K
              RF.Under.Daily.1M.Forget.90Kgen
                RF.Under.Daily.30M.Update.90K
                 RF.Under.Daily.1M.Update.90K
             RF.SMOTE.Daily.1M.Forget.90Kgen
               RF.Under.Weekly.1M.Update.90K
              RF.Under.Daily.5M.Forget.30Kgen
             RF.Under.Daily.10M.Forget.30Kgen
               RF.Under.15days.1M.Update.90K                                                             Best significant
             RF.Under.Daily.15M.Forget.30Kgen                                                              FALSE
                  RF.Under.One.1M.Static.120K                                                              TRUE
             RF.Under.Daily.30M.Forget.30Kgen
           RF.EasyEnsemble.One.1M.Static.90K
                   RF.Under.One.1M.Static.90K
                 RF.SMOTE.One.1M.Static.90K
                   RF.Under.One.1M.Static.60K
                   RF.Under.One.1M.Static.30K
               NNET.Under.One.1M.Static.120K
                NNET.Under.One.1M.Static.90K
                NNET.Under.One.1M.Static.60K
                 SVM.Under.One.1M.Static.90K
                SVM.Under.One.1M.Static.120K
                 SVM.Under.One.1M.Static.60K
                                                0       1000         2000            3000     4000
Metric: PrecisionRank
       RF.EasyEnsemble.Daily.1M.Forget.90Kgen
               RF.SMOTE.Daily.1M.Update.90K
         RF.EasyEnsemble.Daily.1M.Update.90K
                 RF.Under.Daily.5M.Update.90K
                RF.Under.Daily.15M.Update.90K
              RF.Under.Daily.1M.Forget.90Kgen
              RF.Under.Daily.5M.Forget.30Kgen
             RF.SMOTE.Daily.1M.Forget.90Kgen
                RF.Under.Daily.30M.Update.90K
                 RF.Under.Daily.1M.Update.90K
             RF.Under.Daily.10M.Forget.30Kgen
             RF.Under.Daily.15M.Forget.30Kgen
               RF.Under.Weekly.1M.Update.90K                                                             Best significant
             RF.Under.Daily.30M.Forget.30Kgen                                                              FALSE
               RF.Under.15days.1M.Update.90K                                                               TRUE
                  RF.Under.One.1M.Static.120K
           RF.EasyEnsemble.One.1M.Static.90K
                   RF.Under.One.1M.Static.90K
                 RF.SMOTE.One.1M.Static.90K
                   RF.Under.One.1M.Static.60K
                   RF.Under.One.1M.Static.30K
                SVM.Under.One.1M.Static.120K
               NNET.Under.One.1M.Static.120K
                 SVM.Under.One.1M.Static.90K
                NNET.Under.One.1M.Static.90K
                 SVM.Under.One.1M.Static.60K
                NNET.Under.One.1M.Static.60K
                                                0        1000          2000            3000      4000
                            Fig. 10. Comparison of all strategies using sum of ranks in all chunks.
                                         A. Dal Pozzolo et al. / Expert Systems with Applications 41 (2014) 4915–4928                                                 4927
   In our work the combination of models in an ensemble is based                  performance measure adopted, etc. (Dal Pozzolo et al., 2013). In
on the performance of each model in the testing chunk. Several                    our work we adopted two alternatives to undersampling: SMOTE
other methods (Kolter & Maloof, 2003; Hoens, Chawla, & Polikar,                   and EasyEnsemble. In particular we show that they are both able
2011; Lichtenwalter & Chawla, 2010) have been proposed to com-                    to return higher accuracies. Our framework can be easily extended
bine models in presence of concept drift. In future work it would be              to include other data-level balancing techniques.
interesting to test some of these methods and compare it to our                       The experimental part has shown that in online learning, when
framework.                                                                        the data is skewed towards one class it is important maintaining
   In this manuscript we assumed that there is a single concept to                previous minority examples in order to learn a better separation
learn for the minority class. However, as frauds are different from               of two classes. Instance propagation from previous chunks has
each other we could distinguish several sub-concept within the                    the effect of increasing the minority class in the current chunk,
positive class. Hoens et al. (2011) suggest to use Naive Bayes to                 but it is of limited impact given the small number of frauds. The
retain old positive instances that come from the same sub-concept.                update and forgetting approaches presented in Section 6 differ
REA (Chen & He, 2011) and SERA (Chen et al., 2009) proposed by                    essentially in the way the minority class is oversampled in the cur-
Chen and He propagate to the last chunk only minority class that                  rent chunk. We tested several ensemble and single models strate-
belong to the same concept using Mahalanobis distance and a                       gies using different number of chunks for training. In general we
k-nearest neighbors algorithm. Future work should take into con-                  see that models trained on several chunks have better accuracy
sideration the possibility of having several minority concepts.                   than single chunk models. Multi-chunks models learn on overlap-
                                                                                  ping training sets, when this happens single models strategies can
                                                                                  beat ensembles.
9. Conclusion                                                                         Our framework addresses the problem of non-stationary in data
                                                                                  streams by creating a new model every time a new chunk is avail-
   The need to detect fraudulent patterns in huge amount of data                  able. This approach has showed better results than updating the
demands the adoption of automatic methods. The scarcity of public                 models at a lower frequency (weekly or every 15days). Updating
available dataset in credit card transactions gives little chance to              the models is crucial in a non-stationary environments, this intui-
the community to test and asses the impact of existing techniques                 tion is confirmed by the bad results of the static approach. In our
on real data. The goal of our work it to give some guidelines to                  dataset, overall we saw Random Forest beating Neural Network
practitioners on how to tackle the detection problem.                             and Support Vector Machine. The final best strategy implemented
   The paper presents the fraud detection problem and proposes                    the forgetting approach together with EasyEnsemble and daily
AP, AUC and PrecisonRank as correct performance measures for a                    update.
fraud detection task. Frauds occur rarely compared to the total
amount of transactions. As explained in Section 5.1,, standard clas-
sification metrics such as Accuracy are not suitable for unbalanced                References
problems. Moreover, the goal of detection is giving the investiga-
tors the transactions with the highest fraud risk. For this reason                Bamber, D. (1975). The area above the ordinal dominance graph and the area below
we argue that having a good ranking of the transactions by their                      the receiver operating characteristic graph. Journal of Mathematical Psychology,
                                                                                      12, 387–415.
fraud probability is more important than having transactions                      Batista, G., Carvalho, A., & Monard, M. (2000). Applying one-sided selection to
correctly classified.                                                                  unbalanced datasets. MICAI 2000: Advances in Artificial Intelligence, 315–325.
   Credit card fraud detection relies on the analysis of recorded                 Birattari, M., Stützle, T., Paquete, L., & Varrentrapp, K. (2002). A racing algorithm for
                                                                                      configuring metaheuristics. In Proceedings of the genetic and evolutionary
transactions. However a single transaction information is not con-                    computation conference (pp. 11–18).
sidered sufficient to detect a fraud occurrence (Bolton & Hand,                    Bolton, R. J., & Hand, D. J. (2001). Unsupervised profiling methods for fraud
2001) and the analysis has to take into consideration the card-                       detection. Credit Scoring and Credit Control, VII, 235–255.
                                                                                  Bolton, R., & Hand, D. (2002). Statistical fraud detection: A review. Statistical Science,
holder behaviour. In this paper we have proposed a way to include                     235–249.
cardholder information into the transaction by computing aggre-                   Brause, R., Langsdorf, T., & Hepp, M. (1999). Neural data mining for credit card fraud
gate variables on historical transaction of the same card.                            detection. In Tools with artificial intelligence, 1999. Proceedings (pp. 103–106).
                                                                                      IEEE.
   As new credit-card transactions keep arriving, the detection
                                                                                  Chan, P., Fan, W., Prodromidis, A., & Stolfo, S. (1999). Distributed data mining in
system has to process them as soon as they arrive incrementally                       credit card fraud detection. Intelligent Systems and their Applications, 14, 67–74.
and avoid retaining in memory too many old transactions. Fraud                    Chawla, N. V. (2005). Data mining for imbalanced datasets: An overview. In Data
types are in continuous evolution and detection has to adapt to                       mining and knowledge discovery handbook (pp. 853–867). Springer.
                                                                                  Chawla, N., Bowyer, K., Hall, L., & Kegelmeyer, W. (2011). Smote: synthetic minority
fraudsters. Once a fraud is well detected, the fraudster could                        over-sampling technique. Arxiv preprint arXiv:1106.1813.
change his habits and find another way to fraud. Adaptive schemes                  Chawla, N. V., Japkowicz, N., & Kotcz, A. (2004). Editorial: Special issue on learning
are therefore required to deal with non-stationary fraud dynamics                     from imbalanced data sets. ACM SIGKDD Explorations Newsletter, 6, 1–6.
                                                                                  Chen, S., & He, H. (2009). Sera: Selectively recursive approach towards
and discover potentially new fraud mechanisms by itself. We com-                      nonstationary imbalanced stream data mining. In International joint conference
pare three alternative approaches (static, update and forgetting) to                  on neural networks, 2009. IJCNN 2009 (pp. 522–529). IEEE.
learn from unbalanced and non-stationary credit card data                         Chen, S., & He, H. (2011). Towards incremental learning of nonstationary
                                                                                      imbalanced data stream: A multiple selectively recursive approach. Evolving
streams.                                                                              Systems, 2, 35–50.
   Fraud detection is a highly unbalanced problem where the                       Chen, S., He, H., Li, K., & Desai, S. (2010). Musera: Multiple selectively recursive
number of genuine transactions far outnumbers the fraudulent                          approach towards imbalanced stream data mining. In The 2010 international
                                                                                      joint conference on neural networks (IJCNN) (pp. 1–8). IEEE.
ones. In the static learning setting a wide range of techniques have              Cieslak, D. A., & Chawla, N. V. (2008). Learning decision trees for unbalanced data. In
been proposed to deal with unbalanced dataset. In incremental                         Machine learning and knowledge discovery in databases (pp. 241–256). Springer.
learning however few attempts have tried to deal with unbalanced                  Clark, P., & Niblett, T. (1989). The cn2 induction algorithm. Machine Learning, 3,
                                                                                      261–283.
data streams (Gao et al., 2007; Hoens et al., 2012; Ditzler, Polikar, &
                                                                                  Cohen, W. W. (1995). Fast effective rule induction. In Machine learning-international
Chawla, 2010). In these works, the most common balancing                              workshop then conference (pp. 115–123). Morgan Kaufmann Publishers, INC..
technique consists in undersampling the majority class in order                   Dal Pozzolo, A., Caelen, O., Waterschoot, S., & Bontempi, G. (2013). Racing for
to reduce the skewness of the of the chunks.                                          unbalanced methods selection. In Proceedings of the 14th international
                                                                                      conference on intelligent data engineering and automated learning, IDEAL.
   The best technique for unbalanced data may depend on several                   Delamaire, L., Abdou, H., & Pointon, J. (2009). Credit card fraud and detection
factors such as (i) data distribution (ii) classifier used (iii)                       techniques: A review. Banks and Bank Systems, 4, 57–68.
4928                                                 A. Dal Pozzolo et al. / Expert Systems with Applications 41 (2014) 4915–4928
R Development Core Team, R: A Language and Environment for Statistical                        Kuncheva, L. I. (2004). Classifier ensembles for changing environments. In Multiple
    Computing, R Foundation for Statistical Computing, Vienna, Austria, 2011.                     classifier systems (pp. 1–15). Springer.
    <http://www.R-project.org/>, ISBN 3-900051-07-0.                                          Liaw, A., & Wiener, M. (2002). Classification and regression by randomforest. R
Ditzler, G., Polikar, R., & Chawla, N. (2010). An incremental learning algorithm for              News, 2, 18–22.
    non-stationary environments and class imbalance. In 2010 20th International               Lichtenwalter, R. N., & Chawla, N. V. (2010). Adaptive methods for classification in
    conference on pattern recognition (ICPR) (pp. 2997–3000). IEEE.                               arbitrarily imbalanced and drifting data streams. In New frontiers in applied data
Dorronsoro, J., Ginel, F., Sgnchez, C., & Cruz, C. (1997). Neural fraud detection in              mining (pp. 53–75). Springer.
    credit card operations. Neural Networks, 8, 827–834.                                      Liu, X., Wu, J., & Zhou, Z. (2009). Exploratory undersampling for class-imbalance
Drummond, C., & Holte, R. (2003). C4. 5, class imbalance, and cost sensitivity: Why               learning. Systems, Man, and Cybernetics, Part B: Cybernetics, 39, 539–550.
    under-sampling beats over-sampling. In Workshop on learning from imbalanced               Meyer, D., Dimitriadou, E., Hornik, K., Weingessel, A., & Leisch, F. (2012). e1071:
    datasets II. Citeseer.                                                                        Misc Functions of the Department of Statistics (e1071), TU Wien, 2012. <http://
Elkan, C. (2001). The foundations of cost-sensitive learning. International joint                 CRAN.R-project.org/package=e1071>, r package version 1.6-1.
    conference on artificial intelligence (Vol. 17, pp. 973–978). Citeseer.                    Olshen, L., & Stone, C. (1986). Classification and regression trees. Wadsworth
Fan, G., & Zhu, M. (2011). Detection of rare items with target. Statistics and Its                International Group.
    Interface, 4, 11–17.                                                                      Oza, N. C. (2005). Online bagging and boosting. Systems, man and cybernetics (Vol. 3,
Friedman, M. (1937). The use of ranks to avoid the assumption of normality implicit               pp. 2340–2345). IEEE.
    in the analysis of variance. Journal of the American Statistical Association, 32,         Pavía, J. M., Veres-Ferrer, E. J., & Foix-Escura, G. (2012). Credit card incidents and
    675–701.                                                                                      control systems. International Journal of Information Management, 32, 501–503.
Gao, J., Ding, B., Fan, W., Han, J., & Yu, P. S. (2008). Classifying data streams with        Polikar, R., Upda, L., Upda, S., Honavar, V., et al. (2001). Learn++: An incremental
    skewed class distributions and concept drifts. Internet Computing, 12, 37–49.                 learning algorithm for supervised neural networks. Systems, Man, and
Gao, J., Fan, W., Han, J., & Philip, S. Y. (2007). A general framework for mining                 Cybernetics, Part C: Applications and Reviews, 31, 497–508.
    concept-drifting data streams with skewed distributions. In SDM.                          Pozzolo, A. D. (2014). unbalanced: The package implements different data-driven
Grossberg, S. (1988). Nonlinear neural networks: Principles, mechanisms, and                      method for unbalanced datasets. R package version 1.0.
    architectures. Neural Networks, 1, 17–61.                                                 Provost, F. (2000). Machine learning from imbalanced data sets 101. In Proceedings
Hand, D. (2009). Measuring classifier performance: A coherent alternative to the                   of the AAAI’2000 workshop on imbalanced data sets.
    area under the ROC curve. Machine Learning, 77, 103–123.                                  Quah, J. T., & Sriganesh, M. (2008). Real-time credit card fraud detection using
Hand, D. J., & Till, R. J. (2001). A simple generalisation of the area under the roc curve        computational intelligence. Expert Systems with Applications, 35, 1721–1732.
    for multiple class classification problems. Machine Learning, 45, 171–186.                 Quinlan, J. (1993). C4. 5: Programs for machine learning (Vol. 1). Morgan Kaufmann.
Hellinger, E. (1909). Neue begründung der theorie quadratischer formen von                    Street, W. N., & Kim, Y. (2001). A streaming ensemble algorithm (sea) for large-scale
    unendlichvielen veränderlichen. Journal für die reine und Angewandte                          classification. In Proceedings of the seventh ACM SIGKDD international conference
    Mathematik, 136, 210–271.                                                                     on knowledge discovery and data mining (pp. 377–382). ACM.
Hoens, T. R., Chawla, N. V., & Polikar, R. (2011). Heuristic updatable weighted               Venables, W. N., & Ripley, B. D. (2002). Modern applied statistics with S (fourth ed.). 0-
    random subspaces for non-stationary environments. In 2011 IEEE 11th                           387-95457-0. New York: Springer<http://www.stats.ox.ac.uk/pub/MASS4> .
    international conference on data mining (ICDM) (pp. 241–250). IEEE.                       Wang, H., Fan, W., Yu, P. S., & Han, J. (2003). Mining concept-drifting data streams
Hoens, T. R., Polikar, R., & Chawla, N. V. (2012). Learning from streaming data with              using ensemble classifiers. In Proceedings of the ninth ACM SIGKDD international
    concept drift and imbalance: An overview. Progress in Artificial Intelligence, 1,              conference on Knowledge discovery and data mining (pp. 226–235). ACM.
    89–101.                                                                                   Weston, D., Hand, D., Adams, N., Whitrow, C., & Juszczak, P. (2008). Plastic card
Holte, R. C., Acker, L. E., & Porter, B. W. (1989). Concept learning and the problem of           fraud detection using peer group analysis. Advances in Data Analysis and
    small disjuncts. Proceedings of the eleventh international joint conference on                Classification, 2, 45–62.
    artificial intelligence (Vol. 1). Citeseer.                                                Whitrow, C., Hand, D. J., Juszczak, P., Weston, D., & Adams, N. M. (2009). Transaction
Japkowicz, N., & Stephen, S. (2002). The class imbalance problem: A systematic                    aggregation as a strategy for credit card fraud detection. Data Mining and
    study. Intelligent Data Analysis, 6, 429–449.                                                 Knowledge Discovery, 18, 30–55.
Kolter, J. Z., & Maloof, M. A. (2003). Dynamic weighted majority: A new ensemble              Widmer, G., & Kubat, M. (1996). Learning in the presence of concept drift and
    method for tracking concept drift. In Third IEEE international conference on data             hidden contexts. Machine Learning, 23, 69–101.
    mining, 2003. ICDM 2003 (pp. 123–130). IEEE.