0% found this document useful (0 votes)
21 views11 pages

Lec 2

The document outlines a lecture on Exploratory Data Analysis (EDA) using R, focusing on insurance fraud schemes and data visualization techniques. It discusses various analytical methods, including univariate and multivariate analysis, as well as clustering techniques for fraud detection. The content also covers the use of specific datasets and the importance of defining fraud scenarios and selecting appropriate data analytics strategies.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views11 pages

Lec 2

The document outlines a lecture on Exploratory Data Analysis (EDA) using R, focusing on insurance fraud schemes and data visualization techniques. It discusses various analytical methods, including univariate and multivariate analysis, as well as clustering techniques for fraud detection. The content also covers the use of specific datasets and the importance of defining fraud scenarios and selecting appropriate data analytics strategies.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

12/8/2022

To d a y ’s 01 Exploratory Data
Analysis (EDA) using R
Agenda 01
02 Insurance Fraud
FITE7410 Scheme Exploratory
- Cluster Analysis
Lecture 2 Data Analysis
Lecturers: Dr. Vivien CHAN, Annie CHAN 03 Financial Statement (EDA) using R
Tutor: Ms. Yanan GONG Fraud Scheme Dr. Vivien CHAN
- Intro
Department of Computer Science
The University of Hong Kong
1 2 3
http://www.free-powerpoint-templates-design.com

Data Visualization with R Example of EDA using R Example – Loading packages and libraries
Rob Kabacoff (2020) • Dataset : https://www.kaggle.com/mlg-
#library for correlations
ulb/creditcardfraud
https://rkabacoff.github.io/datavis/index.html • The Credit Card Fraud Detection Dataset
comprises transactions that European
credit card holders made in September #library for plotting the samples
2013. The dataset shows transactions What are the
that occurred in two days. goals of EDA?
• The dataset has been collected and
analyzed during a research collaboration How to achieve
of Worldline and the Machine Learning
Group (http://mlg.ulb.ac.be) of ULB the goals?
(Université Libre de Bruxelles) on big
4
data mining and fraud detection. 5 6

Example - Step1: Distinguish Attributes Example - Step1: Distinguish Attributes Example – Step 2: Univariate Analysis
>str(data)
>summary(data)

What does this histogram


tell you about the variable
“Amount”?
What kind of initial
information that you can
get from this preliminary
exploration?
7 8 9
12/8/2022

Example – Step 2: Univariate Analysis Example – Step 2: Univariate Analysis Example – Step 2: Univariate Analysis

Now, with the “Amount”


limited to under $200, what
does this histogram tell you
** Examples of boxplot
about the variable
“Amount”?
** Examples of bar charts

10 11 12

Example – Step 3: Bi-/Multi-variate Analysis Example – Step 3: Bi-/Multi-variate Analysis

02
Insurance fraud
What insights can you get
from this figure?
Scheme
Dr. Vivien CHAN

What insights can you get from the correlations


13 among the attributes? 14 15

Insurance Frauds Insurance Frauds Overview: Fraud Data Analytics Methodology


• Insurance Fraud - Any types of insurance Insurance Insurance Service provider Conspiracy fraud
agent/provider subscriber
• Life insurance
• Health insurance Staring point. Define scope of Fraud
• Selling policies from • Exaggerated claims • Billing services that • Patient colluding
• Automobile insurance non-existent • Falsified medical are not performed with doctors BUT the process is cyclical, Data Analytics
• … companies history • Overbilling, e.g. • Automobile owner NOT linear.
• Failing to submit • Post-dated policies issue invoices for colluding with car
• Insurance Fraud can be committed by: premiums • Faked death services with higher repair shops
• Insurance agent/provider • Churning policies to • Faked damage fees than actually • … Selection of Fraud Fraud Scenario
• Insurance subscriber create more … performed Data Analytics Model Identification
commissions • Providing
• Service provider
… unnecessary
• Conspiracy fraud – involving more than 1 parties services
• … Data Analytics
Strategies for Fraud
Detection
12/8/2022

Health Insurance Fraud Case 1: Define scope of Fraud Data Analytics 2: Fraud Scenario Identification
• Background • Example: Health insurance claims • Who is the committing person?
• The data used in this study is purchased from the Center for • Subscriber of the health insurance
Medicare and Medicaid Services (in USA). • Provider of the health insurance
• Medicare provides three types of services: • What are the objectives of the fraud data analytics?
• hospital insurance (part A), medical insurance (part B) and prescription drug • To identify any suspicious or fraudulent insurance • What are the possible entities involved?
coverage (parts C and D). • Specifically : to find out any health insurance frauds committed by • Service provider (which can be real or fake; if real, can be either
• Medicaid is a state administered program and each state sets its either the insurance provide or the insurance subscriber complicit or non-complicit)
own guidelines regarding eligibility and services • What are the possible fraudulent actions?
• it is available only to low-income individuals and families as determined by • Exaggerated claims
federal and state law
• Falsified medical history
• Post-dated policies
• Faked injuries
• …
19
Source: Liu, Q. (2014). The application of exploratory data analysis in auditing (Doctoral dissertation, Rutgers University-Graduate School-Newark). Source: Liu, Q. (2014). The application of exploratory data analysis in auditing (Doctoral dissertation, Rutgers University-Graduate School-Newark).

2: Fraud Scenario Identification 3: Data Analytics Strategies for Fraud Detection 3: Data Analytics Strategies for Fraud Detection
• Create the permutation of fraud scenarios • What are the data that you need to collect? Or select as • 10 attributes most related to the insurance claims are
samples for further data analysis? selected

Fake Service Provider


Faked injuries

These 3 are
number values
Real Complicit Service of different
Insurance Exaggerated claims Provider scales
subscriber

Q: How to select the appropriate attributes


Real Non-Complicit
Falsified medical Service Provider
for further data analysis?
history
Source: Liu, Q. (2014). The application of exploratory data analysis in auditing (Doctoral dissertation, Rutgers University-Graduate School-Newark). Source: Liu, Q. (2014). The application of exploratory data analysis in auditing (Doctoral dissertation, Rutgers University-Graduate School-Newark).

3: Data Analytics Strategies for Fraud Detection 3: Data Analytics Strategies for Fraud Detection 4: Selection of Fraud Data Analytics Model
• Data transformation • 1/ Specific identification strategy • 2nd technique: Cluster analysis
• E.g. health insurance subscriber with claims inconsistent with the • The aim of clustering is to group a set of observations into
claim policy
groups (or clusters), so that:
• 2/ Internal control avoidance: • The homogeneity within the cluster is maximized, i.e. observations
• E.g. insurance claims made during period outside of the hospital in the same cluster tend to be similar to each other
stay period • The heterogeneity between the cluster is maximized, i.e.
• 3/ Data interpretation observations in different clusters are dissimilar
• E.g. excessive or questionable claim amounts
• 4/ Number anomaly • Examples include:
• E.g. pattern and frequency of claims associated with an insurance • Clustering transactions in a credit card setting
subscriber • Clustering claims in an insurance setting
• Clustering tax statements in a tax-inspecting setting
• Clustering cash transfer in an anti-money laundering setting
Source: Liu, Q. (2014). The application of exploratory data analysis in auditing (Doctoral dissertation, Rutgers University-Graduate School-Newark).
12/8/2022

Cluster analysis - example Distance Metrics


• Clustering is: • Q: How to measure the similarity and dissimilarity between
• Unsupervised classification observations in a group or cluster?
• No predefined target class
• Number of clusters unknown • A: Use a distance metric to quantify the similarity
Clustering • Meaning of clusters unknown
• Types of distance metrics
• Clusters can be ambiguous – how many clusters are there? • Metrics for continuous variables
• Minkowski distance
• Pearson correlation
Dr. Vivien CHAN
• Metrics for categorical variables
• Simple matching coefficient (SMC)
• Jaccard index

3 clusters? 5 clusters?

Distance Metrics – continuous variables Manhattan (or city block) distance Euclidean distance
• When the input variables are continuous variables, use
Minkowski distance between two observations and X1 = (50,20) X1 = (50,20)

X2 When p = 1, Manhattan (or city block) distance X2 When p = 2, Euclidean distance


= (30,10) = (30,10)

• •
= |x11-x21 | + | x12-x22 | = √(x11-x21)2 + (x12-x22)2
= |50-30| + |20-10| = √ (50-30)2 + (20-10)2
= 30 = 22

Distance metrics – categorical variables SMC & Jaccard - example


• When input variables are categorical variables, can use • For example, binary variables (Yes or No) that are used as a
simple matching coefficient (SMC) or Jaccard index series of red-flag indicators to label a claim as suspicious or not
• SMC
• Calculates the number of identical matches between the variable
values
• Assumption of SMC is that “Yes” and “No” are of equal weights
• Jaccard index
• Similar to SMC, but left out the “No-No” match
• Measures the similarity between observations across those red
flags that were raised at least once
• Especially useful in situations where many red-flag indicators are
available and typically only a few are raised
12/8/2022

Types of Clustering Algorithms Some basics


• Clusters are formed by connecting data points according to their • Non-hierarchical procedure, it is a partitioning method -
Connectivity-based distance.
• e.g. Hierarchical clustering partitions n observations into K clusters

• Clusters are formed by connecting data points nearest to the


K-means • Given a K, find a partition of K clusters that optimizes the
chosen partitioning criterion – k-means, i.e. each cluster is
Centroid-based
centroid of a cluster. Centroid might not be any existing data
points. Clustering represented by the center (or mean) of the cluster
• e.g. K-means clustering
• The number of clusters, K, needs to be specified before the
Dr. Vivien CHAN
• Clusters are formed by how probable it is for a data point to start of analysis, e.g. expert-based or result of another
Distribution-based belong to a certain distribution, e.g. Gaussian distribution. clustering procedure (e.g. hierarchical)
• e.g. Gaussian Mixture Models (GMM) clustering

• Clusters are defined as areas of higher density within the data


space compared to other regions. Data points in sparse region
Density-based are considered to be “noise”.
• e.g. DBSCAN, OPTICS
37

Algorithm of k-means clustering Example – k=2 Example – k=2


• Step 1 • K=2 • Assign the
• Select k observations as initial cluster centroids (seeds) • Randomly assign 2 observations
• Step 2 observations as the closest to seeds
• Assign each observation to the cluster that has the closest seed into one cluster
centroid (e.g. use Euclidean distance)
• Step 3
• When all observations have been assigned, recalculate the
positions of the k centroids
• Step 4 These 2 are selected seeds
• Repeat until the cluster centroids no longer change or a fixed
number of iterations is reached

Example – k=2 Example – k=2 Example – k=2


• Recalculate the • Reassign the • Recalculate the
cluster centroids observations based centroids of the 2
on the new clusters
centroids

These 2 are the new centroids of the These 2 are the new centroids of the These 2 are the new centroids of the
2 clusters 2 clusters 2 new clusters
12/8/2022

Example – k=2 K-means Clustering


• Reassign Advantages
observations to the
closest new • Relatively computational efficient as
How to interpret
compared with hierarchical clustering
centroids • Simple to implement and scales to large Clustering
• Recalculate the data sets.
centroids of the output?
clusters Disadvantages Dr. Vivien CHAN

• Stopped when the • Need to specify k, the number of clusters,


centroids no longer in advance
These 2 new • Often terminates at a local optimum.
centroids are same change Need to try different initial cluster centres
as the previous 2 • Unable to handle noisy data and outliers
centroids
We’ll talk about how to find the OPTIMAL k later today
47 48
i.e. how to decide the number of clusters.

Cluster Interpretation – compare clusters Cluster interpretation – classification tree


• Cluster C1 has observations with • Compute a decision tree by using the cluster id as the
low recency values and high target variable
monetary values, whereas the • Supervised learning techniques can be used to interpret
frequency is similar to original and explain unsupervised learning models
population
• For example,

Evaluation of Clustering Solutions How good is a clustering solution?


• For supervised learning model, we can use accuracy, • High within-cluster similarity
precision, recall, F1 score, ROC-AUC to measure the • Low between-cluster similarity
performance.
High within-cluster
similarity
• Q: How to measure the performance of clustering?
• A: There exists no universal criterion for clustering
performance evaluation
Low between-cluster
similarity
• Evaluate clustering solutions from 2 perspectives
• Statistical perspective
• Interpretability perspective
12/8/2022

How good is a clustering solution? Elbow method


• Statistically, can use sum of square error (SSE) as a • Elbow method – makes use of Within-
measure of similarity cluster SSE (WSS)

• Steps:
1. Compute the k-means clustering models
for different k values
2. For each k, calculate the WSS
• How to use SSE to evaluate a clustering solution? 3. Plot WSS according to the value of k
• The lower the SSE for a particular cluster (WSS), the more 4. Find the point where there is a bend (or
homogenous is that cluster, i.e. higher within-cluster similarity elbow) in the curve. This is an indicator
• The higher the SSE among different clusters (BSS), the more that the number of cluster is optimal
heterogenous are the clusters, i.e. lower between-cluster similarity

Source: https://www.datanovia.com/en/lessons/determining-the-optimal-number-of-clusters-3-must-know-methods

Average silhouette method Average silhouette method Average silhouette method


• Silhouette analysis estimates the average distance between • Mean distance (a(i)) between observation i and all other data • Silhouette estimates the average
clusters. points in the same cluster The distance between data points and distance between clusters
in the same cluster
• Defined as follows: Mean distance of data point to other data
• Silhouette value of data point i points in the same cluster Total number of data points belonging to • Steps: (similar to Elbow method)
the same cluster 1. Compute the k-means clustering models
Smallest distance of data point to data for different k values
points in other clusters • Smallest mean distance (b(i)) of i to all data points in other 2. For each k, calculate the average
clusters silhouette of the observations (avg sil)
3. Plot avg sil according to the value of k
The distance between data points and 4. Find the point where the location is of
in the different cluster and maximal value. This is an indicator that
the number of cluster is optimal

Source: https://www.datanovia.com/en/lessons/determining-the-optimal-number-of-clusters-3-must-know-methods

EDA – example : health care insurance fraud EDA – example : health care insurance fraud
• A noteworthy feature in the
distribution is the existence
of negative payments.
• What these negative
payments mean and in
which situation they occur
need to be identified and
verified

62 63
Source: Liu, Q. (2014). The application of exploratory data analysis in auditing (Doctoral dissertation, Rutgers University-Graduate School-Newark).
Source: Liu, Q. (2014). The application of exploratory data analysis in auditing (Doctoral dissertation, Rutgers University-Graduate School-Newark).
12/8/2022

EDA – example : health care insurance fraud Cluster Analysis – example : health care insurance fraud Cluster Analysis – example : health care insurance fraud
• Selection of number of clusters

• The maximum value of beneficiary’s hospital stay period is


668, whereas, there are only 365 days in a year.
• Therefore, a beneficiary should not stay in the hospital for
more than 365 days within a year.
• In the raw data, 28 beneficiaries, who have spent more than
365 days in hospital, are found
64 65 66
Source: Liu, Q. (2014). The application of exploratory data analysis in auditing (Doctoral dissertation, Rutgers University-Graduate School-Newark). Source: Liu, Q. (2014). The application of exploratory data analysis in auditing (Doctoral dissertation, Rutgers University-Graduate School-Newark). Source: Liu, Q. (2014). The application of exploratory data analysis in auditing (Doctoral dissertation, Rutgers University-Graduate School-Newark).

Cluster Analysis Result Interpretation of each cluster Interpretation of each cluster


• Clusters 3 and 4 contain 3,671 (0.0%)and 47
• 7 cluster results Claims in clusters 1, 5, 6, claims (0.0%), respectively.
and 7 have relatively short • Claims in cluster 3 have long travel distance, short
travel distance, short hospital stay period, and small payment amount.
hospital stay period, and • Cluster 4 contains claims with large payment
small amount of payment. amount and short hospital stay period. This is a
new abnormal pattern revealed in this analysis

Cluster 2 relates to long hospital stay


period, short travel distance, and The seven-cluster analysis reduces suspicious claims from 195,343
relatively large amount of payment. to 3718 (3671+47), which are more feasible to examine.
67 68 69
Source: Liu, Q. (2014). The application of exploratory data analysis in auditing (Doctoral dissertation, Rutgers University-Graduate School-Newark). Source: Liu, Q. (2014). The application of exploratory data analysis in auditing (Doctoral dissertation, Rutgers University-Graduate School-Newark).
Source: Liu, Q. (2014). The application of exploratory data analysis in auditing (Doctoral dissertation, Rutgers University-Graduate School-Newark).

Example using R Example using R: Step 1-EDA


• Dataset : https://www.kaggle.com/ealaxi/banksim1
• Source:
Cluster Lopez-Rojas, Edgar Alonso ; Axelsson, Stefan
Banksim: A bank payments simulator for fraud detection research In
proceedings
Analysis 26th European Modeling and Simulation Symposium, EMSS 2014, Bordeaux,
France, pp. 144–152, Dime University of Genoa, 2014, ISBN:
using R 9788897999324.
https://www.researchgate.net/publication/265736405_BankSim_A_Bank_Pay
ment_Simulation_for_Fraud_Detection_Research
Dr. Vivien CHAN
• BankSim is an agent-based simulator of bank payments based on a
sample of aggregated transactional data provided by a bank in What is your preliminary
Spain. The main purpose of BankSim is the generation of synthetic
data that can be used for fraud detection research. observation of this dataset?
71 72
• R sample code : https://www.kaggle.com/andradaolteanu/ii-fraud-
12/8/2022

Original dataset
Example: After EDA Example using R: Step 2 -Feature Engineering
• Normal Behaviour
• transactions amount fairly small (under $500)
• payments for transportation and food transactions (don’t have any Questions to ask yourself:
fraud cases) • Do you need to recode character variables to numeric variables?
• there are some merchants that don’t have any cases of fraud • Do you need to create new variables?
within their transactions After Feature
• Do you need to remove any of the variables which are not useful
• Abnormal Behaviour: when building your data model? Engineering
• transactions with high amounts (above $500) • Do you need to handle missing values or erroneous values?
• transactions made during travel or for leisure activities (like
sports/toys expenditure, hotels etc.) • Do you need to standardize or normalize your variables?
• there are some merchants where all transactions made to them
are fraud

73 74 75

Example in R: Finding the optimal number of clusters


Example using R: Step 3 -Cluster analysis Example in R : Scale the dataset
• library(tidyverse) # data manipulation
• library(cluster) # clustering algorithms
• library(factoextra) # clustering algorithms & visualization
• library(caret) # streamline model training process

To perform a cluster analysis in R, generally, the data should be prepared as


follows:
1.Any missing value in the data must be removed or estimated.
2.The data must be standardized (i.e., scaled) to make variables comparable.
• use “scale” to standardize the dataset
“data[-c(5)]” means removing column 5 Another function for scree plot is the following:
3.The data frame needs to be a matrix
• use “as.matrix” to transform data frame with : rows are observations (individuals) and QUESTION :
columns are variables 76 WHY DO WE NEED TO REMOVE THIS COLUMN? 77 78

Example in R : k-means clustering Example in R : Examine the clusters


03
Financial
Statement
Fraud Scheme
Dr. Vivien CHAN

3 cluster results Original labelled


dataset
79 80 81
12/8/2022

Financial Statement Frauds Famous case – Enron Scandal Overview: Fraud Data Analytics Methodology
• Financial Statement Fraud • Enron deliberately misstated profits, cash
• Deliberate misrepresentation of the financial condition of a company, flows and understated liabilities with the use of
e.g. omission of amounts or disclosures in the financial statements, with creative, yet questionable accounting
methods. Staring point. Define scope of Fraud
the intention to deceive or mislead the users of the financial statements
• A disguised loan in 1999 in which the BUT the process is cyclical, Data Analytics
• Top 10 accounting scandals proceeds from the sale of bonds was reported NOT linear.
• Waste management (1998) as cash from operations. This overstated
• Enron (2001) operating cash flow by $700 million. With the
• WorldCom (2002) Selection of Fraud Fraud Scenario
use of market to market accounting, Enron Data Analytics Model Identification
• Tyco (2002) recognized a very significant amount of future
• HealthSouth (2003) earnings as current income. This allowed a
• Freddie Mac (2003) certain business unit to report quarterly profit
• American International Group (AIG) (2005) of $40 million when in fact, this unit was Data Analytics
• Lehman Brothers (2008) actually operating at a loss. Another loan Strategies for Fraud
Detection
• Bernie Madoff (2008) transaction was understated by $4.85 billion.
• Satyam (2009) 83
Source: https://corporatefinanceinstitute.com/resources/knowledge/other/top-accounting-scandals/

Background Background 1: Define scope of Fraud Data Analytics


• Specific problems of Financial Statement Fraud detections: • Data sample • What is the objective and scope of fraud data analytics?
1. the ratio of fraud to nonfraud firms is small • To identify the algorithms and predictors to use when creating new
2. the ratio of false positive to false negative misclassification costs models for financial statement fraud detection under specific class
is small and cost imbalance ratios
3. the attributes used to detect fraud are relatively noisy, where
similar attribute values can signal both fraudulent and
nonfraudulent activities; and
4. fraudsters actively attempt to conceal the fraud, thereby taking
fraud firm attribute values look similar to nonfraud firm attribute
values.

Source: Perols, Johan. (2010). Financial Statement Fraud Detection: An Analysis of Statistical and Machine Learning Algorithms. Auditing A Journal of Practice & Theory. 30. 10.2308/ajpt-50009.
85 Source: Perols, Johan. (2010). Financial Statement Fraud Detection: An Analysis of Statistical and Machine Learning Algorithms. Auditing A Journal of Practice & Theory. 30. 10.2308/ajpt-50009.
86 87

2: Fraud Scenario Identification (Re-visit) Inherent Fraud Scheme for Financial Statement Fraud 2: Fraud Scenario Identification
• The person committing the fraud
• The person committing the fraud would be less critical • Create the permutation of fraud scenarios
• The person can be from internal or • Usually would be senior Committing
Committing person Committing Fraudulent action Transactions in Entity
external management or controller
person person Financial statement
• The person who have direct or
indirect access to the database

• Attachment of transaction in the • Fraudulent action • Direction of misstatement, i.e.


general ledger account is Fake Vendor
business system describes how the Transactions Overstate
• E.g. in payroll system, ‘employee’ is the transaction is Fraudulent Fraud overstated or understated; which Real transaction/sales
scenario in Financial transaction/sales
entity recorded action statement financial year, etc.
Fraud • In credit card system, ‘card number’ is • Concerns whether the • General ledger account, i.e.
scenario the entity transaction or the transactions recorded is real or Controller
fake Real Complicit Vendor
• Fraudulent action links entity is false or real
Understate
False transaction/sales
committing person transaction/sales
and entity Fraudulent
Entity
• E.g. payment of action Entity • e.g. Shell company,
vendor without customers (real or fake),
vendor (real or fake) Real Non-Complicit
purchase order 89
Vendor
12/8/2022

2: Fraud Scenario Identification 2: Fraud Scenario Identification 3: Data Analytics Strategies for Fraud Detection
Some example predictor attributes: • whether accounts receivable grew by more than 10 percent
• number of auditor turnovers • allowance for doubtful accounts to net sales
• Example techniques used to overstate an asset:
• total discretionary accruals • current minus prior year inventory to sales • Recording an asset that does not exist
• Big 4 auditor • gross margin to net sales • Recording a real asset before the liability occurs
• accounts receivable • evidence of CFO change • Recording a real asset that is not owned by the company
• allowance for doubtful accounts • holding period return in the violation period • Improper capitalization of a false expense
• accounts receivable to total assets • property plant and equipment to total assets • Improper capitalization of a real expense
• accounts receivable to sales • value of issued securities to market value • Reporting the asset in the wrong section of the balance sheet
• whether meeting or beating forecast • fixed assets to total assets;
• Example techniques used to understate an asset:
• evidence of CEO change • days in receivables index
• Failure to record a real asset
• sales to total assets • industry ROE minus firm ROE
• Failure to capitalize a real expense
• inventory to sales • positive accruals dummy
• Failure to record an asset in the proper period
• unexpected employee productivity • whether gross margin grew by more than 10 percent
• Reporting the asset in the wrong section of the balance sheet
• percentage of executives on the board of directors • allowance for doubtful accounts to accounts receivable
• total debt to total assets
Source: Perols, Johan. (2010). Financial Statement Fraud Detection: An Analysis of Statistical and Machine Learning Algorithms. Auditing A Journal of Practice & Theory. 30. 10.2308/ajpt-50009.
91 92 93

3: Data Analytics Strategies for Fraud Detection 4: Selection of Fraud Data Analytics Model Topics to be covered later
Fraud Detection Model • How to handle imbalance dataset
Train Fraud Develop the • Techniques for fraud detection model
Data Detection Fraud Detection
Model Model • Machine Learning algorithm
• Statistical technique – Benford’s law
• Social Network Analysis
New • How to evaluate performance of fraud detection model
Make
Use Model
Predictions Data
Frequency of re-training the model depends on:
• Volatility of the fraud behaviour
• Detection power of the current model
• Amount of (similar) confirmed cases already available in the database
• Rate at which new cases are being confirmed
94 • Required effort to retrain the model 96

References
• Bart Baesens, Veronique Van Vlasselaer, Wouter Verbeke (2015). Fraud Analytics using Descriptive,
Predictive, and Social Network Techniques, 1st ed, John Wiley & Sons Inc.
• Leonard W. Vona (2017). Fraud Data Analytics Methodology: The Fraud Scenario Approach to
Uncovering Fraud in Core Business Systems, John Wiley & Sons, Inc.
• Spann, Delena D.. (2013). Fraud Analytics : Strategies and Methods for Detection and Prevention,
John Wiley & Sons, Incorporated, 2013. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/hkuhk/detail.action?docID=1752695

97 98

You might also like