0% found this document useful (0 votes)

21 views11 pages

Lec 2

The document outlines a lecture on Exploratory Data Analysis (EDA) using R, focusing on insurance fraud schemes and data visualization techniques. It discusses various analytical methods, including univariate and multivariate analysis, as well as clustering techniques for fraud detection. The content also covers the use of specific datasets and the importance of defining fraud scenarios and selecting appropriate data analytics strategies.

Uploaded by

esthersweetertang

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views11 pages

Lec 2

Uploaded by

esthersweetertang

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

12/8/2022

To d a y ’s 01 Exploratory Data
Analysis (EDA) using R
Agenda 01
02 Insurance Fraud
FITE7410 Scheme Exploratory
- Cluster Analysis
Lecture 2 Data Analysis
Lecturers: Dr. Vivien CHAN, Annie CHAN 03 Financial Statement (EDA) using R
Tutor: Ms. Yanan GONG Fraud Scheme Dr. Vivien CHAN
- Intro
Department of Computer Science
The University of Hong Kong
1 2 3
http://www.free-powerpoint-templates-design.com

Data Visualization with R Example of EDA using R Example – Loading packages and libraries
Rob Kabacoff (2020) • Dataset : https://www.kaggle.com/mlg-
#library for correlations
ulb/creditcardfraud
https://rkabacoff.github.io/datavis/index.html • The Credit Card Fraud Detection Dataset
comprises transactions that European
credit card holders made in September #library for plotting the samples
2013. The dataset shows transactions What are the
that occurred in two days. goals of EDA?
• The dataset has been collected and
analyzed during a research collaboration How to achieve
of Worldline and the Machine Learning
Group (http://mlg.ulb.ac.be) of ULB the goals?
(Université Libre de Bruxelles) on big
4
data mining and fraud detection. 5 6

Example - Step1: Distinguish Attributes Example - Step1: Distinguish Attributes Example – Step 2: Univariate Analysis
>str(data)
>summary(data)

What does this histogram

tell you about the variable
“Amount”?
What kind of initial
information that you can
get from this preliminary
exploration?
7 8 9
12/8/2022

Example – Step 2: Univariate Analysis Example – Step 2: Univariate Analysis Example – Step 2: Univariate Analysis

Now, with the “Amount”

limited to under $200, what
does this histogram tell you
** Examples of boxplot
about the variable
“Amount”?
** Examples of bar charts

10 11 12

Example – Step 3: Bi-/Multi-variate Analysis Example – Step 3: Bi-/Multi-variate Analysis

02
Insurance fraud
What insights can you get
from this figure?
Scheme
Dr. Vivien CHAN

What insights can you get from the correlations

13 among the attributes? 14 15

Insurance Frauds Insurance Frauds Overview: Fraud Data Analytics Methodology

• Insurance Fraud - Any types of insurance Insurance Insurance Service provider Conspiracy fraud
agent/provider subscriber
• Life insurance
• Health insurance Staring point. Define scope of Fraud
• Selling policies from • Exaggerated claims • Billing services that • Patient colluding
• Automobile insurance non-existent • Falsified medical are not performed with doctors BUT the process is cyclical, Data Analytics
• … companies history • Overbilling, e.g. • Automobile owner NOT linear.
• Failing to submit • Post-dated policies issue invoices for colluding with car
• Insurance Fraud can be committed by: premiums • Faked death services with higher repair shops
• Insurance agent/provider • Churning policies to • Faked damage fees than actually • … Selection of Fraud Fraud Scenario
• Insurance subscriber create more … performed Data Analytics Model Identification
commissions • Providing
• Service provider
… unnecessary
• Conspiracy fraud – involving more than 1 parties services
• … Data Analytics
Strategies for Fraud
Detection
12/8/2022

Health Insurance Fraud Case 1: Define scope of Fraud Data Analytics 2: Fraud Scenario Identification
• Background • Example: Health insurance claims • Who is the committing person?
• The data used in this study is purchased from the Center for • Subscriber of the health insurance
Medicare and Medicaid Services (in USA). • Provider of the health insurance
• Medicare provides three types of services: • What are the objectives of the fraud data analytics?
• hospital insurance (part A), medical insurance (part B) and prescription drug • To identify any suspicious or fraudulent insurance • What are the possible entities involved?
coverage (parts C and D). • Specifically : to find out any health insurance frauds committed by • Service provider (which can be real or fake; if real, can be either
• Medicaid is a state administered program and each state sets its either the insurance provide or the insurance subscriber complicit or non-complicit)
own guidelines regarding eligibility and services • What are the possible fraudulent actions?
• it is available only to low-income individuals and families as determined by • Exaggerated claims
federal and state law
• Falsified medical history
• Post-dated policies
• Faked injuries
• …
19
Source: Liu, Q. (2014). The application of exploratory data analysis in auditing (Doctoral dissertation, Rutgers University-Graduate School-Newark). Source: Liu, Q. (2014). The application of exploratory data analysis in auditing (Doctoral dissertation, Rutgers University-Graduate School-Newark).

2: Fraud Scenario Identification 3: Data Analytics Strategies for Fraud Detection 3: Data Analytics Strategies for Fraud Detection
• Create the permutation of fraud scenarios • What are the data that you need to collect? Or select as • 10 attributes most related to the insurance claims are
samples for further data analysis? selected

Fake Service Provider

Faked injuries

These 3 are
number values
Real Complicit Service of different
Insurance Exaggerated claims Provider scales
subscriber

Q: How to select the appropriate attributes

Real Non-Complicit
Falsified medical Service Provider
for further data analysis?
history
Source: Liu, Q. (2014). The application of exploratory data analysis in auditing (Doctoral dissertation, Rutgers University-Graduate School-Newark). Source: Liu, Q. (2014). The application of exploratory data analysis in auditing (Doctoral dissertation, Rutgers University-Graduate School-Newark).

3: Data Analytics Strategies for Fraud Detection 3: Data Analytics Strategies for Fraud Detection 4: Selection of Fraud Data Analytics Model
• Data transformation • 1/ Specific identification strategy • 2nd technique: Cluster analysis
• E.g. health insurance subscriber with claims inconsistent with the • The aim of clustering is to group a set of observations into
claim policy
groups (or clusters), so that:
• 2/ Internal control avoidance: • The homogeneity within the cluster is maximized, i.e. observations
• E.g. insurance claims made during period outside of the hospital in the same cluster tend to be similar to each other
stay period • The heterogeneity between the cluster is maximized, i.e.
• 3/ Data interpretation observations in different clusters are dissimilar
• E.g. excessive or questionable claim amounts
• 4/ Number anomaly • Examples include:
• E.g. pattern and frequency of claims associated with an insurance • Clustering transactions in a credit card setting
subscriber • Clustering claims in an insurance setting
• Clustering tax statements in a tax-inspecting setting
• Clustering cash transfer in an anti-money laundering setting
Source: Liu, Q. (2014). The application of exploratory data analysis in auditing (Doctoral dissertation, Rutgers University-Graduate School-Newark).
12/8/2022

Cluster analysis - example Distance Metrics

• Clustering is: • Q: How to measure the similarity and dissimilarity between
• Unsupervised classification observations in a group or cluster?
• No predefined target class
• Number of clusters unknown • A: Use a distance metric to quantify the similarity
Clustering • Meaning of clusters unknown
• Types of distance metrics
• Clusters can be ambiguous – how many clusters are there? • Metrics for continuous variables
• Minkowski distance
• Pearson correlation
Dr. Vivien CHAN
• Metrics for categorical variables
• Simple matching coefficient (SMC)
• Jaccard index

3 clusters? 5 clusters?

Distance Metrics – continuous variables Manhattan (or city block) distance Euclidean distance
• When the input variables are continuous variables, use
Minkowski distance between two observations and X1 = (50,20) X1 = (50,20)

X2 When p = 1, Manhattan (or city block) distance X2 When p = 2, Euclidean distance

= (30,10) = (30,10)

• •
= |x11-x21 | + | x12-x22 | = √(x11-x21)2 + (x12-x22)2
= |50-30| + |20-10| = √ (50-30)2 + (20-10)2
= 30 = 22

Distance metrics – categorical variables SMC & Jaccard - example

• When input variables are categorical variables, can use • For example, binary variables (Yes or No) that are used as a
simple matching coefficient (SMC) or Jaccard index series of red-flag indicators to label a claim as suspicious or not
• SMC
• Calculates the number of identical matches between the variable
values
• Assumption of SMC is that “Yes” and “No” are of equal weights
• Jaccard index
• Similar to SMC, but left out the “No-No” match
• Measures the similarity between observations across those red
flags that were raised at least once
• Especially useful in situations where many red-flag indicators are
available and typically only a few are raised
12/8/2022

Types of Clustering Algorithms Some basics

• Clusters are formed by connecting data points according to their • Non-hierarchical procedure, it is a partitioning method -
Connectivity-based distance.
• e.g. Hierarchical clustering partitions n observations into K clusters

• Clusters are formed by connecting data points nearest to the

K-means • Given a K, find a partition of K clusters that optimizes the
chosen partitioning criterion – k-means, i.e. each cluster is
Centroid-based
centroid of a cluster. Centroid might not be any existing data
points. Clustering represented by the center (or mean) of the cluster
• e.g. K-means clustering
• The number of clusters, K, needs to be specified before the
Dr. Vivien CHAN
• Clusters are formed by how probable it is for a data point to start of analysis, e.g. expert-based or result of another
Distribution-based belong to a certain distribution, e.g. Gaussian distribution. clustering procedure (e.g. hierarchical)
• e.g. Gaussian Mixture Models (GMM) clustering

• Clusters are defined as areas of higher density within the data

space compared to other regions. Data points in sparse region
Density-based are considered to be “noise”.
• e.g. DBSCAN, OPTICS
37

Algorithm of k-means clustering Example – k=2 Example – k=2

• Step 1 • K=2 • Assign the
• Select k observations as initial cluster centroids (seeds) • Randomly assign 2 observations
• Step 2 observations as the closest to seeds
• Assign each observation to the cluster that has the closest seed into one cluster
centroid (e.g. use Euclidean distance)
• Step 3
• When all observations have been assigned, recalculate the
positions of the k centroids
• Step 4 These 2 are selected seeds
• Repeat until the cluster centroids no longer change or a fixed
number of iterations is reached

Example – k=2 Example – k=2 Example – k=2

• Recalculate the • Reassign the • Recalculate the
cluster centroids observations based centroids of the 2
on the new clusters
centroids

These 2 are the new centroids of the These 2 are the new centroids of the These 2 are the new centroids of the
2 clusters 2 clusters 2 new clusters
12/8/2022

Example – k=2 K-means Clustering

• Reassign Advantages
observations to the
closest new • Relatively computational efficient as
How to interpret
compared with hierarchical clustering
centroids • Simple to implement and scales to large Clustering
• Recalculate the data sets.
centroids of the output?
clusters Disadvantages Dr. Vivien CHAN

• Stopped when the • Need to specify k, the number of clusters,

centroids no longer in advance
These 2 new • Often terminates at a local optimum.
centroids are same change Need to try different initial cluster centres
as the previous 2 • Unable to handle noisy data and outliers
centroids
We’ll talk about how to find the OPTIMAL k later today
47 48
i.e. how to decide the number of clusters.

Cluster Interpretation – compare clusters Cluster interpretation – classification tree

• Cluster C1 has observations with • Compute a decision tree by using the cluster id as the
low recency values and high target variable
monetary values, whereas the • Supervised learning techniques can be used to interpret
frequency is similar to original and explain unsupervised learning models
population
• For example,

Evaluation of Clustering Solutions How good is a clustering solution?

• For supervised learning model, we can use accuracy, • High within-cluster similarity
precision, recall, F1 score, ROC-AUC to measure the • Low between-cluster similarity
performance.
High within-cluster
similarity
• Q: How to measure the performance of clustering?
• A: There exists no universal criterion for clustering
performance evaluation
Low between-cluster
similarity
• Evaluate clustering solutions from 2 perspectives
• Statistical perspective
• Interpretability perspective
12/8/2022

How good is a clustering solution? Elbow method

• Statistically, can use sum of square error (SSE) as a • Elbow method – makes use of Within-
measure of similarity cluster SSE (WSS)

• Steps:
1. Compute the k-means clustering models
for different k values
2. For each k, calculate the WSS
• How to use SSE to evaluate a clustering solution? 3. Plot WSS according to the value of k
• The lower the SSE for a particular cluster (WSS), the more 4. Find the point where there is a bend (or
homogenous is that cluster, i.e. higher within-cluster similarity elbow) in the curve. This is an indicator
• The higher the SSE among different clusters (BSS), the more that the number of cluster is optimal
heterogenous are the clusters, i.e. lower between-cluster similarity

Source: https://www.datanovia.com/en/lessons/determining-the-optimal-number-of-clusters-3-must-know-methods

Average silhouette method Average silhouette method Average silhouette method

• Silhouette analysis estimates the average distance between • Mean distance (a(i)) between observation i and all other data • Silhouette estimates the average
clusters. points in the same cluster The distance between data points and distance between clusters
in the same cluster
• Defined as follows: Mean distance of data point to other data
• Silhouette value of data point i points in the same cluster Total number of data points belonging to • Steps: (similar to Elbow method)
the same cluster 1. Compute the k-means clustering models
Smallest distance of data point to data for different k values
points in other clusters • Smallest mean distance (b(i)) of i to all data points in other 2. For each k, calculate the average
clusters silhouette of the observations (avg sil)
3. Plot avg sil according to the value of k
The distance between data points and 4. Find the point where the location is of
in the different cluster and maximal value. This is an indicator that
the number of cluster is optimal

Source: https://www.datanovia.com/en/lessons/determining-the-optimal-number-of-clusters-3-must-know-methods

EDA – example : health care insurance fraud EDA – example : health care insurance fraud
• A noteworthy feature in the
distribution is the existence
of negative payments.
• What these negative
payments mean and in
which situation they occur
need to be identified and
verified

62 63
Source: Liu, Q. (2014). The application of exploratory data analysis in auditing (Doctoral dissertation, Rutgers University-Graduate School-Newark).
Source: Liu, Q. (2014). The application of exploratory data analysis in auditing (Doctoral dissertation, Rutgers University-Graduate School-Newark).
12/8/2022

EDA – example : health care insurance fraud Cluster Analysis – example : health care insurance fraud Cluster Analysis – example : health care insurance fraud
• Selection of number of clusters

• The maximum value of beneficiary’s hospital stay period is

668, whereas, there are only 365 days in a year.
• Therefore, a beneficiary should not stay in the hospital for
more than 365 days within a year.
• In the raw data, 28 beneficiaries, who have spent more than
365 days in hospital, are found
64 65 66
Source: Liu, Q. (2014). The application of exploratory data analysis in auditing (Doctoral dissertation, Rutgers University-Graduate School-Newark). Source: Liu, Q. (2014). The application of exploratory data analysis in auditing (Doctoral dissertation, Rutgers University-Graduate School-Newark). Source: Liu, Q. (2014). The application of exploratory data analysis in auditing (Doctoral dissertation, Rutgers University-Graduate School-Newark).

Cluster Analysis Result Interpretation of each cluster Interpretation of each cluster

• Clusters 3 and 4 contain 3,671 (0.0%)and 47
• 7 cluster results Claims in clusters 1, 5, 6, claims (0.0%), respectively.
and 7 have relatively short • Claims in cluster 3 have long travel distance, short
travel distance, short hospital stay period, and small payment amount.
hospital stay period, and • Cluster 4 contains claims with large payment
small amount of payment. amount and short hospital stay period. This is a
new abnormal pattern revealed in this analysis

Cluster 2 relates to long hospital stay

period, short travel distance, and The seven-cluster analysis reduces suspicious claims from 195,343
relatively large amount of payment. to 3718 (3671+47), which are more feasible to examine.
67 68 69
Source: Liu, Q. (2014). The application of exploratory data analysis in auditing (Doctoral dissertation, Rutgers University-Graduate School-Newark). Source: Liu, Q. (2014). The application of exploratory data analysis in auditing (Doctoral dissertation, Rutgers University-Graduate School-Newark).
Source: Liu, Q. (2014). The application of exploratory data analysis in auditing (Doctoral dissertation, Rutgers University-Graduate School-Newark).

Example using R Example using R: Step 1-EDA

• Dataset : https://www.kaggle.com/ealaxi/banksim1
• Source:
Cluster Lopez-Rojas, Edgar Alonso ; Axelsson, Stefan
Banksim: A bank payments simulator for fraud detection research In
proceedings
Analysis 26th European Modeling and Simulation Symposium, EMSS 2014, Bordeaux,
France, pp. 144–152, Dime University of Genoa, 2014, ISBN:
using R 9788897999324.
https://www.researchgate.net/publication/265736405_BankSim_A_Bank_Pay
ment_Simulation_for_Fraud_Detection_Research
Dr. Vivien CHAN
• BankSim is an agent-based simulator of bank payments based on a
sample of aggregated transactional data provided by a bank in What is your preliminary
Spain. The main purpose of BankSim is the generation of synthetic
data that can be used for fraud detection research. observation of this dataset?
71 72
• R sample code : https://www.kaggle.com/andradaolteanu/ii-fraud-
12/8/2022

Original dataset
Example: After EDA Example using R: Step 2 -Feature Engineering
• Normal Behaviour
• transactions amount fairly small (under $500)
• payments for transportation and food transactions (don’t have any Questions to ask yourself:
fraud cases) • Do you need to recode character variables to numeric variables?
• there are some merchants that don’t have any cases of fraud • Do you need to create new variables?
within their transactions After Feature
• Do you need to remove any of the variables which are not useful
• Abnormal Behaviour: when building your data model? Engineering
• transactions with high amounts (above $500) • Do you need to handle missing values or erroneous values?
• transactions made during travel or for leisure activities (like
sports/toys expenditure, hotels etc.) • Do you need to standardize or normalize your variables?
• there are some merchants where all transactions made to them
are fraud

73 74 75

Example in R: Finding the optimal number of clusters

Example using R: Step 3 -Cluster analysis Example in R : Scale the dataset
• library(tidyverse) # data manipulation
• library(cluster) # clustering algorithms
• library(factoextra) # clustering algorithms & visualization
• library(caret) # streamline model training process

To perform a cluster analysis in R, generally, the data should be prepared as

follows:
1.Any missing value in the data must be removed or estimated.
2.The data must be standardized (i.e., scaled) to make variables comparable.
• use “scale” to standardize the dataset
“data[-c(5)]” means removing column 5 Another function for scree plot is the following:
3.The data frame needs to be a matrix
• use “as.matrix” to transform data frame with : rows are observations (individuals) and QUESTION :
columns are variables 76 WHY DO WE NEED TO REMOVE THIS COLUMN? 77 78

Example in R : k-means clustering Example in R : Examine the clusters

03
Financial
Statement
Fraud Scheme
Dr. Vivien CHAN

3 cluster results Original labelled

dataset
79 80 81
12/8/2022

Financial Statement Frauds Famous case – Enron Scandal Overview: Fraud Data Analytics Methodology
• Financial Statement Fraud • Enron deliberately misstated profits, cash
• Deliberate misrepresentation of the financial condition of a company, flows and understated liabilities with the use of
e.g. omission of amounts or disclosures in the financial statements, with creative, yet questionable accounting
methods. Staring point. Define scope of Fraud
the intention to deceive or mislead the users of the financial statements
• A disguised loan in 1999 in which the BUT the process is cyclical, Data Analytics
• Top 10 accounting scandals proceeds from the sale of bonds was reported NOT linear.
• Waste management (1998) as cash from operations. This overstated
• Enron (2001) operating cash flow by $700 million. With the
• WorldCom (2002) Selection of Fraud Fraud Scenario
use of market to market accounting, Enron Data Analytics Model Identification
• Tyco (2002) recognized a very significant amount of future
• HealthSouth (2003) earnings as current income. This allowed a
• Freddie Mac (2003) certain business unit to report quarterly profit
• American International Group (AIG) (2005) of $40 million when in fact, this unit was Data Analytics
• Lehman Brothers (2008) actually operating at a loss. Another loan Strategies for Fraud
Detection
• Bernie Madoff (2008) transaction was understated by $4.85 billion.
• Satyam (2009) 83
Source: https://corporatefinanceinstitute.com/resources/knowledge/other/top-accounting-scandals/

Background Background 1: Define scope of Fraud Data Analytics

• Specific problems of Financial Statement Fraud detections: • Data sample • What is the objective and scope of fraud data analytics?
1. the ratio of fraud to nonfraud firms is small • To identify the algorithms and predictors to use when creating new
2. the ratio of false positive to false negative misclassification costs models for financial statement fraud detection under specific class
is small and cost imbalance ratios
3. the attributes used to detect fraud are relatively noisy, where
similar attribute values can signal both fraudulent and
nonfraudulent activities; and
4. fraudsters actively attempt to conceal the fraud, thereby taking
fraud firm attribute values look similar to nonfraud firm attribute
values.

Source: Perols, Johan. (2010). Financial Statement Fraud Detection: An Analysis of Statistical and Machine Learning Algorithms. Auditing A Journal of Practice & Theory. 30. 10.2308/ajpt-50009.
85 Source: Perols, Johan. (2010). Financial Statement Fraud Detection: An Analysis of Statistical and Machine Learning Algorithms. Auditing A Journal of Practice & Theory. 30. 10.2308/ajpt-50009.
86 87

2: Fraud Scenario Identification (Re-visit) Inherent Fraud Scheme for Financial Statement Fraud 2: Fraud Scenario Identification
• The person committing the fraud
• The person committing the fraud would be less critical • Create the permutation of fraud scenarios
• The person can be from internal or • Usually would be senior Committing
Committing person Committing Fraudulent action Transactions in Entity
external management or controller
person person Financial statement
• The person who have direct or
indirect access to the database

• Attachment of transaction in the • Fraudulent action • Direction of misstatement, i.e.

general ledger account is Fake Vendor
business system describes how the Transactions Overstate
• E.g. in payroll system, ‘employee’ is the transaction is Fraudulent Fraud overstated or understated; which Real transaction/sales
scenario in Financial transaction/sales
entity recorded action statement financial year, etc.
Fraud • In credit card system, ‘card number’ is • Concerns whether the • General ledger account, i.e.
scenario the entity transaction or the transactions recorded is real or Controller
fake Real Complicit Vendor
• Fraudulent action links entity is false or real
Understate
False transaction/sales
committing person transaction/sales
and entity Fraudulent
Entity
• E.g. payment of action Entity • e.g. Shell company,
vendor without customers (real or fake),
vendor (real or fake) Real Non-Complicit
purchase order 89
Vendor
12/8/2022

2: Fraud Scenario Identification 2: Fraud Scenario Identification 3: Data Analytics Strategies for Fraud Detection
Some example predictor attributes: • whether accounts receivable grew by more than 10 percent
• number of auditor turnovers • allowance for doubtful accounts to net sales
• Example techniques used to overstate an asset:
• total discretionary accruals • current minus prior year inventory to sales • Recording an asset that does not exist
• Big 4 auditor • gross margin to net sales • Recording a real asset before the liability occurs
• accounts receivable • evidence of CFO change • Recording a real asset that is not owned by the company
• allowance for doubtful accounts • holding period return in the violation period • Improper capitalization of a false expense
• accounts receivable to total assets • property plant and equipment to total assets • Improper capitalization of a real expense
• accounts receivable to sales • value of issued securities to market value • Reporting the asset in the wrong section of the balance sheet
• whether meeting or beating forecast • fixed assets to total assets;
• Example techniques used to understate an asset:
• evidence of CEO change • days in receivables index
• Failure to record a real asset
• sales to total assets • industry ROE minus firm ROE
• Failure to capitalize a real expense
• inventory to sales • positive accruals dummy
• Failure to record an asset in the proper period
• unexpected employee productivity • whether gross margin grew by more than 10 percent
• Reporting the asset in the wrong section of the balance sheet
• percentage of executives on the board of directors • allowance for doubtful accounts to accounts receivable
• total debt to total assets
Source: Perols, Johan. (2010). Financial Statement Fraud Detection: An Analysis of Statistical and Machine Learning Algorithms. Auditing A Journal of Practice & Theory. 30. 10.2308/ajpt-50009.
91 92 93

3: Data Analytics Strategies for Fraud Detection 4: Selection of Fraud Data Analytics Model Topics to be covered later
Fraud Detection Model • How to handle imbalance dataset
Train Fraud Develop the • Techniques for fraud detection model
Data Detection Fraud Detection
Model Model • Machine Learning algorithm
• Statistical technique – Benford’s law
• Social Network Analysis
New • How to evaluate performance of fraud detection model
Make
Use Model
Predictions Data
Frequency of re-training the model depends on:
• Volatility of the fraud behaviour
• Detection power of the current model
• Amount of (similar) confirmed cases already available in the database
• Rate at which new cases are being confirmed
94 • Required effort to retrain the model 96

References
• Bart Baesens, Veronique Van Vlasselaer, Wouter Verbeke (2015). Fraud Analytics using Descriptive,
Predictive, and Social Network Techniques, 1st ed, John Wiley & Sons Inc.
• Leonard W. Vona (2017). Fraud Data Analytics Methodology: The Fraud Scenario Approach to
Uncovering Fraud in Core Business Systems, John Wiley & Sons, Inc.
• Spann, Delena D.. (2013). Fraud Analytics : Strategies and Methods for Detection and Prevention,
John Wiley & Sons, Incorporated, 2013. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/hkuhk/detail.action?docID=1752695

97 98

Claims&FraudAnalytics DebashishBanerjee
No ratings yet
Claims&FraudAnalytics DebashishBanerjee
13 pages
Fraud in Insurance: Applications of Predictive Modeling
No ratings yet
Fraud in Insurance: Applications of Predictive Modeling
16 pages
Advanced Data Analytics For IT Auditors Joa Eng 1116
No ratings yet
Advanced Data Analytics For IT Auditors Joa Eng 1116
8 pages
PPT Dự án cuối kỳ nhóm 8
No ratings yet
PPT Dự án cuối kỳ nhóm 8
38 pages
Lec 1
No ratings yet
Lec 1
13 pages
Ebook Detecting Preventing Fraud
No ratings yet
Ebook Detecting Preventing Fraud
24 pages
Detecting & Preventing Fraud With Data Analytics
No ratings yet
Detecting & Preventing Fraud With Data Analytics
24 pages
PPT
100% (1)
PPT
19 pages
Introduction
No ratings yet
Introduction
4 pages
CAATS and Fraud - June 14
No ratings yet
CAATS and Fraud - June 14
85 pages
Finding Needles in A Haystack: Using Data Analytics To Improve Fraud Prediction
No ratings yet
Finding Needles in A Haystack: Using Data Analytics To Improve Fraud Prediction
53 pages
Chapter 6 Data-Driven Fraud Detection
No ratings yet
Chapter 6 Data-Driven Fraud Detection
36 pages
Bolton and Hand
No ratings yet
Bolton and Hand
16 pages
Analysis of Women Saftey in Indian Cities Using Machine Learning
No ratings yet
Analysis of Women Saftey in Indian Cities Using Machine Learning
14 pages
Py - Clustering Credit Card Fraud - Actuaries' Analytical Cookbook
No ratings yet
Py - Clustering Credit Card Fraud - Actuaries' Analytical Cookbook
58 pages
Claims Fraud Predictive Model
No ratings yet
Claims Fraud Predictive Model
14 pages
Six Sigma in Fraud Detection Analysis
No ratings yet
Six Sigma in Fraud Detection Analysis
4 pages
Statistical Fraud Detection
No ratings yet
Statistical Fraud Detection
21 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
12 pages
Data-Driven Fraud Detection: Bwanika Najib
No ratings yet
Data-Driven Fraud Detection: Bwanika Najib
34 pages
Imw14283usen 00
No ratings yet
Imw14283usen 00
7 pages
Expert Systems With Applications
No ratings yet
Expert Systems With Applications
11 pages
Data Mining Using Learning Techniques For Fraud Detection
No ratings yet
Data Mining Using Learning Techniques For Fraud Detection
3 pages
White Paper Automating Fraud Detection Guide
No ratings yet
White Paper Automating Fraud Detection Guide
8 pages
Data Analytics For Accounting - Chapter 3 - Data Analytics For Accouting - Performing The Test Plan and
No ratings yet
Data Analytics For Accounting - Chapter 3 - Data Analytics For Accouting - Performing The Test Plan and
10 pages
2.data Mining and Audit Tools
No ratings yet
2.data Mining and Audit Tools
27 pages
My Fraud Detection
No ratings yet
My Fraud Detection
20 pages
Insurance Fraud Detection
No ratings yet
Insurance Fraud Detection
10 pages
Fraud Detection: Data Mining
No ratings yet
Fraud Detection: Data Mining
5 pages
Data Mining For Fraud Detection 4381
No ratings yet
Data Mining For Fraud Detection 4381
27 pages
Institute of Mathematical Statistics
No ratings yet
Institute of Mathematical Statistics
16 pages
2023 Sylv DMBDA FraudDetection
No ratings yet
2023 Sylv DMBDA FraudDetection
9 pages
Doi: 10.5281/zenodo.7922883: ISSN: 1004-9037
No ratings yet
Doi: 10.5281/zenodo.7922883: ISSN: 1004-9037
18 pages
Midway Report Group 7
No ratings yet
Midway Report Group 7
8 pages
Chapter 6 2.0
No ratings yet
Chapter 6 2.0
4 pages
Credit Card Fraud Detection and Analysis
No ratings yet
Credit Card Fraud Detection and Analysis
4 pages
Unsupervised Profiling Methods For
No ratings yet
Unsupervised Profiling Methods For
16 pages
A Beginner's Guide To Fraud Detection With Data Analytics
100% (1)
A Beginner's Guide To Fraud Detection With Data Analytics
12 pages
Using Data Analysis To Detect Fraud: IIA Dallas Chapter Meeting
No ratings yet
Using Data Analysis To Detect Fraud: IIA Dallas Chapter Meeting
46 pages
Fraud Detection
No ratings yet
Fraud Detection
15 pages
U. Ed U: Why Does It Matter To Your Career?
No ratings yet
U. Ed U: Why Does It Matter To Your Career?
23 pages
Fraud Analytics Course Outline
No ratings yet
Fraud Analytics Course Outline
4 pages
Research Paper An Improved Approch For Fraud Detection in Health Insurance Using Data Mining Machine Learning
No ratings yet
Research Paper An Improved Approch For Fraud Detection in Health Insurance Using Data Mining Machine Learning
4 pages
Fraud Analytics 2022
No ratings yet
Fraud Analytics 2022
11 pages
122208
No ratings yet
122208
17 pages
6 - InnovatiCS - Data Visualization (Numerical & Graphical Descriptive Statistics)
No ratings yet
6 - InnovatiCS - Data Visualization (Numerical & Graphical Descriptive Statistics)
96 pages
Data Engineering For Fraud Detection
No ratings yet
Data Engineering For Fraud Detection
13 pages
A Case Study On Analytical Tools For Insurance Fraud: Dr. Kashmira Mathur
No ratings yet
A Case Study On Analytical Tools For Insurance Fraud: Dr. Kashmira Mathur
6 pages
Module 4
No ratings yet
Module 4
15 pages
Insurance Claim Fraud Detection
No ratings yet
Insurance Claim Fraud Detection
6 pages
Use of Data Analysis To Detect Credit Card Fraud
No ratings yet
Use of Data Analysis To Detect Credit Card Fraud
4 pages
Fraud Detection for Auditors
No ratings yet
Fraud Detection for Auditors
52 pages
Fraud Detection for Banking Pros
No ratings yet
Fraud Detection for Banking Pros
50 pages
End of Chapter Questions and Cases For
14% (7)
End of Chapter Questions and Cases For
102 pages
Ata Analytics - 5 Data Analytics Software: About Jim Kaplan, CIA, CFE
No ratings yet
Ata Analytics - 5 Data Analytics Software: About Jim Kaplan, CIA, CFE
32 pages
Auto Insurance Fraud Detection
No ratings yet
Auto Insurance Fraud Detection
28 pages
EDA Credit Assignment Shakti - PDF
No ratings yet
EDA Credit Assignment Shakti - PDF
51 pages
502011ca002091xxxxmb 4
No ratings yet
502011ca002091xxxxmb 4
13 pages
Types of Cyber Crime (1) .PPTX - 20240617 - 230018 - 0000
No ratings yet
Types of Cyber Crime (1) .PPTX - 20240617 - 230018 - 0000
8 pages
Cheating Under IPC
No ratings yet
Cheating Under IPC
10 pages
How To Unfreeze Your Bank Account - A Step-by-Step Guide
No ratings yet
How To Unfreeze Your Bank Account - A Step-by-Step Guide
5 pages
Lasveg
No ratings yet
Lasveg
15 pages
Blockchain - Unit 5
No ratings yet
Blockchain - Unit 5
7 pages
NEW CAFV Letter For Debt Collector
100% (9)
NEW CAFV Letter For Debt Collector
9 pages
FRSC 104 QUESTIONED DOCUMENTS EXAMINATION
No ratings yet
FRSC 104 QUESTIONED DOCUMENTS EXAMINATION
16 pages
Tyco International SCANDAL (2002) : Presented by
No ratings yet
Tyco International SCANDAL (2002) : Presented by
34 pages
Marden v. Dorthy
No ratings yet
Marden v. Dorthy
16 pages
Assessment - Attempt Review - Cytrain
No ratings yet
Assessment - Attempt Review - Cytrain
6 pages
B2 UNIT 5 Test Higher
100% (1)
B2 UNIT 5 Test Higher
6 pages
Ethical Issues and Dilemmas in Business
No ratings yet
Ethical Issues and Dilemmas in Business
40 pages
2017 MLD 1383
No ratings yet
2017 MLD 1383
2 pages
Trusted Bank Logs Vendors
79% (19)
Trusted Bank Logs Vendors
5 pages
Akanni Sikiru Omogbolahan Complete Project
No ratings yet
Akanni Sikiru Omogbolahan Complete Project
47 pages
Document-Is Any Material Which Contains Marks, Symbols or Signs, Either Visible, Partially
No ratings yet
Document-Is Any Material Which Contains Marks, Symbols or Signs, Either Visible, Partially
16 pages
Art. 1340-1341
No ratings yet
Art. 1340-1341
3 pages
Angelos Gogo Siregar - TUGAS CYBER LAW ENGLISH - 110110170303
No ratings yet
Angelos Gogo Siregar - TUGAS CYBER LAW ENGLISH - 110110170303
7 pages
White Collar Crime Project
100% (2)
White Collar Crime Project
31 pages
Vocabulary Study and Tests
No ratings yet
Vocabulary Study and Tests
12 pages
Chapter 8 Illustrative Solutions
No ratings yet
Chapter 8 Illustrative Solutions
14 pages
Cyber Security
No ratings yet
Cyber Security
5 pages
Criminal Law - Book 2 by Ruben S. Cabardo
No ratings yet
Criminal Law - Book 2 by Ruben S. Cabardo
28 pages
Occasional Criminals
No ratings yet
Occasional Criminals
5 pages
Digital Payment Fraud Awareness
No ratings yet
Digital Payment Fraud Awareness
8 pages
Machine Learning in Signature Forgery Detection
No ratings yet
Machine Learning in Signature Forgery Detection
8 pages
Online Text Editor PDF
No ratings yet
Online Text Editor PDF
2 pages
An Examination of Au227.com: Disambiguating Digital Footprints and Associated Cyber Risks
No ratings yet
An Examination of Au227.com: Disambiguating Digital Footprints and Associated Cyber Risks
15 pages
Insider Trading Cases: Global Examples
0% (1)
Insider Trading Cases: Global Examples
4 pages

Lec 2

Uploaded by

Lec 2

Uploaded by

12/8/2022

What does this histogram

Now, with the “Amount”

Example – Step 3: Bi-/Multi-variate Analysis Example – Step 3: Bi-/Multi-variate Analysis

What insights can you get from the correlations

Insurance Frauds Insurance Frauds Overview: Fraud Data Analytics Methodology

Fake Service Provider

Q: How to select the appropriate attributes

Cluster analysis - example Distance Metrics

X2 When p = 1, Manhattan (or city block) distance X2 When p = 2, Euclidean distance

Distance metrics – categorical variables SMC & Jaccard - example

Types of Clustering Algorithms Some basics

• Clusters are formed by connecting data points nearest to the

• Clusters are defined as areas of higher density within the data

Algorithm of k-means clustering Example – k=2 Example – k=2

Example – k=2 Example – k=2 Example – k=2

Example – k=2 K-means Clustering

• Stopped when the • Need to specify k, the number of clusters,

Cluster Interpretation – compare clusters Cluster interpretation – classification tree

Evaluation of Clustering Solutions How good is a clustering solution?

How good is a clustering solution? Elbow method

Average silhouette method Average silhouette method Average silhouette method

• The maximum value of beneficiary’s hospital stay period is

Cluster Analysis Result Interpretation of each cluster Interpretation of each cluster

Cluster 2 relates to long hospital stay

Example using R Example using R: Step 1-EDA

Example in R: Finding the optimal number of clusters

To perform a cluster analysis in R, generally, the data should be prepared as

Example in R : k-means clustering Example in R : Examine the clusters

3 cluster results Original labelled

Background Background 1: Define scope of Fraud Data Analytics

• Attachment of transaction in the • Fraudulent action • Direction of misstatement, i.e.

You might also like