Lec 2
Lec 2
To d a y ’s 01 Exploratory Data
Analysis (EDA) using R
Agenda 01
02 Insurance Fraud
FITE7410 Scheme Exploratory
- Cluster Analysis
Lecture 2 Data Analysis
Lecturers: Dr. Vivien CHAN, Annie CHAN 03 Financial Statement (EDA) using R
Tutor: Ms. Yanan GONG Fraud Scheme Dr. Vivien CHAN
- Intro
Department of Computer Science
The University of Hong Kong
1 2 3
http://www.free-powerpoint-templates-design.com
Data Visualization with R Example of EDA using R Example – Loading packages and libraries
Rob Kabacoff (2020) • Dataset : https://www.kaggle.com/mlg-
#library for correlations
ulb/creditcardfraud
https://rkabacoff.github.io/datavis/index.html • The Credit Card Fraud Detection Dataset
comprises transactions that European
credit card holders made in September #library for plotting the samples
2013. The dataset shows transactions What are the
that occurred in two days. goals of EDA?
• The dataset has been collected and
analyzed during a research collaboration How to achieve
of Worldline and the Machine Learning
Group (http://mlg.ulb.ac.be) of ULB the goals?
(Université Libre de Bruxelles) on big
4
data mining and fraud detection. 5 6
Example - Step1: Distinguish Attributes Example - Step1: Distinguish Attributes Example – Step 2: Univariate Analysis
>str(data)
>summary(data)
Example – Step 2: Univariate Analysis Example – Step 2: Univariate Analysis Example – Step 2: Univariate Analysis
10 11 12
02
Insurance fraud
What insights can you get
from this figure?
Scheme
Dr. Vivien CHAN
Health Insurance Fraud Case 1: Define scope of Fraud Data Analytics 2: Fraud Scenario Identification
• Background • Example: Health insurance claims • Who is the committing person?
• The data used in this study is purchased from the Center for • Subscriber of the health insurance
Medicare and Medicaid Services (in USA). • Provider of the health insurance
• Medicare provides three types of services: • What are the objectives of the fraud data analytics?
• hospital insurance (part A), medical insurance (part B) and prescription drug • To identify any suspicious or fraudulent insurance • What are the possible entities involved?
coverage (parts C and D). • Specifically : to find out any health insurance frauds committed by • Service provider (which can be real or fake; if real, can be either
• Medicaid is a state administered program and each state sets its either the insurance provide or the insurance subscriber complicit or non-complicit)
own guidelines regarding eligibility and services • What are the possible fraudulent actions?
• it is available only to low-income individuals and families as determined by • Exaggerated claims
federal and state law
• Falsified medical history
• Post-dated policies
• Faked injuries
• …
19
Source: Liu, Q. (2014). The application of exploratory data analysis in auditing (Doctoral dissertation, Rutgers University-Graduate School-Newark). Source: Liu, Q. (2014). The application of exploratory data analysis in auditing (Doctoral dissertation, Rutgers University-Graduate School-Newark).
2: Fraud Scenario Identification 3: Data Analytics Strategies for Fraud Detection 3: Data Analytics Strategies for Fraud Detection
• Create the permutation of fraud scenarios • What are the data that you need to collect? Or select as • 10 attributes most related to the insurance claims are
samples for further data analysis? selected
These 3 are
number values
Real Complicit Service of different
Insurance Exaggerated claims Provider scales
subscriber
3: Data Analytics Strategies for Fraud Detection 3: Data Analytics Strategies for Fraud Detection 4: Selection of Fraud Data Analytics Model
• Data transformation • 1/ Specific identification strategy • 2nd technique: Cluster analysis
• E.g. health insurance subscriber with claims inconsistent with the • The aim of clustering is to group a set of observations into
claim policy
groups (or clusters), so that:
• 2/ Internal control avoidance: • The homogeneity within the cluster is maximized, i.e. observations
• E.g. insurance claims made during period outside of the hospital in the same cluster tend to be similar to each other
stay period • The heterogeneity between the cluster is maximized, i.e.
• 3/ Data interpretation observations in different clusters are dissimilar
• E.g. excessive or questionable claim amounts
• 4/ Number anomaly • Examples include:
• E.g. pattern and frequency of claims associated with an insurance • Clustering transactions in a credit card setting
subscriber • Clustering claims in an insurance setting
• Clustering tax statements in a tax-inspecting setting
• Clustering cash transfer in an anti-money laundering setting
Source: Liu, Q. (2014). The application of exploratory data analysis in auditing (Doctoral dissertation, Rutgers University-Graduate School-Newark).
12/8/2022
3 clusters? 5 clusters?
Distance Metrics – continuous variables Manhattan (or city block) distance Euclidean distance
• When the input variables are continuous variables, use
Minkowski distance between two observations and X1 = (50,20) X1 = (50,20)
• •
= |x11-x21 | + | x12-x22 | = √(x11-x21)2 + (x12-x22)2
= |50-30| + |20-10| = √ (50-30)2 + (20-10)2
= 30 = 22
These 2 are the new centroids of the These 2 are the new centroids of the These 2 are the new centroids of the
2 clusters 2 clusters 2 new clusters
12/8/2022
• Steps:
1. Compute the k-means clustering models
for different k values
2. For each k, calculate the WSS
• How to use SSE to evaluate a clustering solution? 3. Plot WSS according to the value of k
• The lower the SSE for a particular cluster (WSS), the more 4. Find the point where there is a bend (or
homogenous is that cluster, i.e. higher within-cluster similarity elbow) in the curve. This is an indicator
• The higher the SSE among different clusters (BSS), the more that the number of cluster is optimal
heterogenous are the clusters, i.e. lower between-cluster similarity
Source: https://www.datanovia.com/en/lessons/determining-the-optimal-number-of-clusters-3-must-know-methods
Source: https://www.datanovia.com/en/lessons/determining-the-optimal-number-of-clusters-3-must-know-methods
EDA – example : health care insurance fraud EDA – example : health care insurance fraud
• A noteworthy feature in the
distribution is the existence
of negative payments.
• What these negative
payments mean and in
which situation they occur
need to be identified and
verified
62 63
Source: Liu, Q. (2014). The application of exploratory data analysis in auditing (Doctoral dissertation, Rutgers University-Graduate School-Newark).
Source: Liu, Q. (2014). The application of exploratory data analysis in auditing (Doctoral dissertation, Rutgers University-Graduate School-Newark).
12/8/2022
EDA – example : health care insurance fraud Cluster Analysis – example : health care insurance fraud Cluster Analysis – example : health care insurance fraud
• Selection of number of clusters
Original dataset
Example: After EDA Example using R: Step 2 -Feature Engineering
• Normal Behaviour
• transactions amount fairly small (under $500)
• payments for transportation and food transactions (don’t have any Questions to ask yourself:
fraud cases) • Do you need to recode character variables to numeric variables?
• there are some merchants that don’t have any cases of fraud • Do you need to create new variables?
within their transactions After Feature
• Do you need to remove any of the variables which are not useful
• Abnormal Behaviour: when building your data model? Engineering
• transactions with high amounts (above $500) • Do you need to handle missing values or erroneous values?
• transactions made during travel or for leisure activities (like
sports/toys expenditure, hotels etc.) • Do you need to standardize or normalize your variables?
• there are some merchants where all transactions made to them
are fraud
73 74 75
Financial Statement Frauds Famous case – Enron Scandal Overview: Fraud Data Analytics Methodology
• Financial Statement Fraud • Enron deliberately misstated profits, cash
• Deliberate misrepresentation of the financial condition of a company, flows and understated liabilities with the use of
e.g. omission of amounts or disclosures in the financial statements, with creative, yet questionable accounting
methods. Staring point. Define scope of Fraud
the intention to deceive or mislead the users of the financial statements
• A disguised loan in 1999 in which the BUT the process is cyclical, Data Analytics
• Top 10 accounting scandals proceeds from the sale of bonds was reported NOT linear.
• Waste management (1998) as cash from operations. This overstated
• Enron (2001) operating cash flow by $700 million. With the
• WorldCom (2002) Selection of Fraud Fraud Scenario
use of market to market accounting, Enron Data Analytics Model Identification
• Tyco (2002) recognized a very significant amount of future
• HealthSouth (2003) earnings as current income. This allowed a
• Freddie Mac (2003) certain business unit to report quarterly profit
• American International Group (AIG) (2005) of $40 million when in fact, this unit was Data Analytics
• Lehman Brothers (2008) actually operating at a loss. Another loan Strategies for Fraud
Detection
• Bernie Madoff (2008) transaction was understated by $4.85 billion.
• Satyam (2009) 83
Source: https://corporatefinanceinstitute.com/resources/knowledge/other/top-accounting-scandals/
Source: Perols, Johan. (2010). Financial Statement Fraud Detection: An Analysis of Statistical and Machine Learning Algorithms. Auditing A Journal of Practice & Theory. 30. 10.2308/ajpt-50009.
85 Source: Perols, Johan. (2010). Financial Statement Fraud Detection: An Analysis of Statistical and Machine Learning Algorithms. Auditing A Journal of Practice & Theory. 30. 10.2308/ajpt-50009.
86 87
2: Fraud Scenario Identification (Re-visit) Inherent Fraud Scheme for Financial Statement Fraud 2: Fraud Scenario Identification
• The person committing the fraud
• The person committing the fraud would be less critical • Create the permutation of fraud scenarios
• The person can be from internal or • Usually would be senior Committing
Committing person Committing Fraudulent action Transactions in Entity
external management or controller
person person Financial statement
• The person who have direct or
indirect access to the database
2: Fraud Scenario Identification 2: Fraud Scenario Identification 3: Data Analytics Strategies for Fraud Detection
Some example predictor attributes: • whether accounts receivable grew by more than 10 percent
• number of auditor turnovers • allowance for doubtful accounts to net sales
• Example techniques used to overstate an asset:
• total discretionary accruals • current minus prior year inventory to sales • Recording an asset that does not exist
• Big 4 auditor • gross margin to net sales • Recording a real asset before the liability occurs
• accounts receivable • evidence of CFO change • Recording a real asset that is not owned by the company
• allowance for doubtful accounts • holding period return in the violation period • Improper capitalization of a false expense
• accounts receivable to total assets • property plant and equipment to total assets • Improper capitalization of a real expense
• accounts receivable to sales • value of issued securities to market value • Reporting the asset in the wrong section of the balance sheet
• whether meeting or beating forecast • fixed assets to total assets;
• Example techniques used to understate an asset:
• evidence of CEO change • days in receivables index
• Failure to record a real asset
• sales to total assets • industry ROE minus firm ROE
• Failure to capitalize a real expense
• inventory to sales • positive accruals dummy
• Failure to record an asset in the proper period
• unexpected employee productivity • whether gross margin grew by more than 10 percent
• Reporting the asset in the wrong section of the balance sheet
• percentage of executives on the board of directors • allowance for doubtful accounts to accounts receivable
• total debt to total assets
Source: Perols, Johan. (2010). Financial Statement Fraud Detection: An Analysis of Statistical and Machine Learning Algorithms. Auditing A Journal of Practice & Theory. 30. 10.2308/ajpt-50009.
91 92 93
3: Data Analytics Strategies for Fraud Detection 4: Selection of Fraud Data Analytics Model Topics to be covered later
Fraud Detection Model • How to handle imbalance dataset
Train Fraud Develop the • Techniques for fraud detection model
Data Detection Fraud Detection
Model Model • Machine Learning algorithm
• Statistical technique – Benford’s law
• Social Network Analysis
New • How to evaluate performance of fraud detection model
Make
Use Model
Predictions Data
Frequency of re-training the model depends on:
• Volatility of the fraud behaviour
• Detection power of the current model
• Amount of (similar) confirmed cases already available in the database
• Rate at which new cases are being confirmed
94 • Required effort to retrain the model 96
References
• Bart Baesens, Veronique Van Vlasselaer, Wouter Verbeke (2015). Fraud Analytics using Descriptive,
Predictive, and Social Network Techniques, 1st ed, John Wiley & Sons Inc.
• Leonard W. Vona (2017). Fraud Data Analytics Methodology: The Fraud Scenario Approach to
Uncovering Fraud in Core Business Systems, John Wiley & Sons, Inc.
• Spann, Delena D.. (2013). Fraud Analytics : Strategies and Methods for Detection and Prevention,
John Wiley & Sons, Incorporated, 2013. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/hkuhk/detail.action?docID=1752695
97 98