Lec 1
Lec 1
Introduction to
                                                                                                                                                                      To d a y ’s                                              01 Financial Fraud
                                                                                   FITE7410                                                                           Agenda                                                             Analytics
                                                                             Introduction to                                                                                                                                                                                                                              01 Introduction
                                                                             Financial Fraud                                                                                                                                   02 Credit Card Fraud                                                                       to Financial
                                                                                                                                                                                                                                  Scheme                                                                                  Fraud Data
                                                                                   Analytics                                                                                                                                      - Intro to Exploratory
                                                                                       Lecturers: Dr. Vivien Chan, Annie Chan
                                                                                                                                                                                                                                    Data Analysis (EDA)                                                                   Analytics
                                                                                                            Tutor: Yanan Gong
                                                                                                                                                                                                                                                                                                                          Dr. Vivien CHAN
                                                                                                            Date: 7 & 8 Sep 2022
http://www.free-powerpoint-templates-design.com
                      What is a Red Flag?                                                                       Examples of Red Flags                                   False Positive vs False Negative
                                                                                                                                                                                    False Positive                                   False Negative
      • Examples of anomality:                                                                                                                                        Definition    Red flags identified in the fraud data profile   Red flags NOT identified in the fraud data
          •   Outliers                                                                                                                                                              but the transaction is NOT a fraudulent          profile but the transaction is a fraudulent
          •   Inliers where they are not expected                                                                                                                                   transaction.                                     transaction.
        1: Define the scope of fraud analysis                                                              Overview: Fraud Data Analytics Methodology                                                                      2: Fraud Scenario Identification
         A few examples:                                                                                                                                                                                              • Inherent Fraud Schemes
         • If the fraud data analytics is used in a whistle blower                                                                                                                                                        • The theory states that the number of inherent scheme and scenarios
                                                                                                                                                                                                                            are finite and predictable
           allegation,                                                                                    Staring point.                                 Define scope of Fraud
                                                                                                          BUT the process is cyclical,                       Data Analytics                                               • Comprises a committing person, an entity structure and a fraudulent
             • then the fraud data analytics plan is designed to refute or
                                                                                                                                                                                                                            action
               corroborate the allegation.                                                                NOT linear.
                                                                                                                                                                                                                          • Each inherent fraud scheme has a finite and predictable list of fraud
         • If the fraud data analytics plan is used in a control audit,                                                                                                                                                     permutations
             • then the fraud data analytics would search for internal control                                                            Selection of Fraud               Fraud Scenario                                 • Each fraud scheme permutation creates a finite and predictable list of
               compliance or internal control avoidance.                                                                                 Data Analytics Model               Identification                                  fraud scenarios
         • If the fraud data analytics is used for fraud testing,                                                                                                                                                         • How the inherent scheme occurs will be influenced by the business
             • then the fraud data analytics is used to search for a specific fraud                                                                                                                                         processes and internal controls
                                                                                                                                                              Data Analytics
               scenario that is hidden in your database.                                                                                                   Strategies for Fraud                                  • Fraud is predictable, with regard to the schemes that occur
                                                                                                                                                                Detection                                        • Number of schemes is finite, that can occur in a given
                                                                                                                                                                                                                   business system
               2: Fraud Scenario Identification                                                                    2: Fraud Scenario Identification                                                              Overview: Fraud Data Analytics Methodology
        • The person committing the fraud                                                                         • Fraud scenario - Create the permutation of fraud scenarios
        • The person can be from internal or
                                               Committing                                                 Committing person                Fraudulent action                                     Entity
          external
                                                 person
        • The person who have direct or                                                                                                                                                                         Staring point.                              Define scope of Fraud
          indirect access to the database                                                                                                                                                                       BUT the process is cyclical,                    Data Analytics
                                                                                                                                         Overbilling
                                                                                                                                                                                             Fake Vendor        NOT linear.
                                                            • Attachment of transaction in the
                                                              business system
                                                            • E.g. in payroll system, ‘employee’ is the
                                                                                                                                                                                                                                                 Selection of Fraud          Fraud Scenario
                                                              entity
                                                                                                                                                                                                                                                Data Analytics Model          Identification
                                                 Fraud      • In credit card system, ‘card number’ is                                                                                   Real Complicit Vendor
                                                scenario      the entity                                   Company                       Multiple procurements
• Fraudulent action links                                                                                  Procurement
  committing person                                                                                        Cardholder
                                                                                                                                                                                                                                                                Data Analytics
  and entity                 Fraudulent                                                                                                                                                                                                                      Strategies for Fraud
                                                               Entity
• E.g. payment of              action                                                                                                                                                   Real Non-Complicit                                                        Detection
  vendor without                                                                                                                         Conflict-of-interests                          Vendor
  purchase order
                                                                                                                                                                                                                                 12/8/2022
3: Data Analytics Strategies for Fraud Detection                                                 The world’s best auditor                                  What are “Fraud Data”?
  • There are 2 steps                                                                            using the world’s best                       • Which of the following are “Fraud Data”?
                                                                                                                                                 •   Bank Statements?
                                                                                                 audit program cannot                            •   Network logs?
  • Step 1: Handling the raw data
     •   Obtaining the raw data
                                                 NOTE:                                           detect fraud unless their                       •   Customer data?
                                         Techniques for handling                                                                                 •   Firewall logs?
     •   Cleaning the data              raw data will be discussed                               sample includes a                               •   Email?
     •   Data pre-processing                      later
     •   …                                                                                       fraudulent transaction.                         •   Whatsapp/Wechat/LINE or any other IM messages?
                                                                                                                                                 •   FB/IG/MeWe or any other social media pages?
                                                                                                                                                 •   Public information, e.g. news?
  • Step 2: Strategies for data analytics                                                                                                        •   Documentation, e.g. user guides?
                                                                                                                                                 •   Server room check-in and check-out logbook?
                                                                                                                                                 •   CCTV?
                                                                                                                                                 •   …
                                                                                                                                                                                                                            30
Example of raw data – user login access log Example of raw data – CRM records Example of raw data - email
                                                                                                                                                 • Semi-structured
                                                                                                                                       35                                                                                   36
                                                                                                                                                                                                                                                                                                                                                                  12/8/2022
    Some Basic Concepts - Data Table                                                                                             Basic Concepts – Data types                                                                                                              Basic Concepts – Data types
                    Columns = variables, fields, characteristics,
                    attributes, features, etc.
                                                                                                                                           Continuous data                                                                                                                                       Categorical data
Rows =                                                                                                                                                                             Example:                                                                               Nominal                     Ordinal                       Binary
instances,                                                                                                                          Defined on                                     • amount of                                                                            • limited set of values     • take on a limited set       • can only take on two
                                                                                                                                                          With or                                                                                                           with no meaningful          of values with a              values
observations,                                                                                                                       an interval,                                     transactions
                                                                                                                                                          without a                • balance on
                                                                                                                                                                                                                                                                            ordering in between         meaningful ordering         • e.g. yes/no
lines, records,                                                                             Sample data records                     with limited                                                                                                                          • e.g. marital status;        in between
                                                                                            extracted from                                                natural zero               savings                                                                                payment type;             • e.g. age coded as
tuples, etc.                                                                                Enron case                              or unlimited
                                                                                                                                                          value                      account                                                                                country of origin           young, middle-age,
                                                                                                                                    value                                          • similarity index                                                                                                   and old
37 38 39
                Data Analytics Strategies                                                                               Overview: Fraud Data Analytics Methodology                                                                                            4: Selection of Fraud Data Analytics Model
 • 1/ Specific identification strategy
                                                                                                                                                                                                                                         Predictive analytics
    • Usually this is based on the fraud scenarios to identify specific issues                                                                                                                                                                                                                                                               Descriptive analytics
                                                                                                                                                                                                                                         • Linear/Logistic
 • 2/ Internal control avoidance
                                                                                                                                                                                                                                                                                                        Supervised
                                                                                                                                                                                                                                                                                                         Learning
                                                                                                                                                                                                                                                                                                                     Unsupervised
                                                                                                                                                                                                                                                                                                                       Learning              • Clustering (k-means)
                                                                                                                       Staring point.                                                                                                      Regression
                                                                                                                                                                         Define scope of Fraud                                                                                                                                               • Autoencoder
    • To check the data against any company internal control policies                                                  BUT the process is cyclical,                          Data Analytics                                              • Decision Tree
                                                                                                                                                                                                                                                                                                                                             • … many more
 • 3/ Data interpretation                                                                                              NOT linear.                                                                                                       • Ensemble Method
                                                                                                                                                                                                                                                                                                          Semi-         Social
                                                                                                                                                                                                                                         • Random Forest                                                supervised     Network
                                                                                                                                                                                                                                                                                                                                             Network analytics
    • To identify patterns of behaviours from the data                                                                                                                                                                                                                                                   Learning      Analysis
                                                                                                                                                                                                                                         • Neural Network                                                                                    • Social Network
 • 4/ Number anomaly                                                                                                                                       Selection of Fraud              Fraud Scenario
                                                                                                                                                                                                                                         • Support Vector
                                                                                                                                                          Data Analytics Model              Identification                                                                                                                                     Analysis
    • There are several statistical techniques, e.g. Benford’s law (to be discussed in                                                                                                                                                     Machines
      later lecture).                                                                                                                                                                                                                    • … many more                                              Statistical analytics
    • Benford’s law : search for anomaly in the first, second, etc. integers of an                                                                                                                                                                                                                  • Outlier detection techniques
      amount. The anomaly is based on Benford’s distribution table.                                                                                                          Data Analytics
                                                                                                                                                                          Strategies for Fraud                                                                                                         • Break-Point Analysis
                                                                                                                                                                               Detection                                                                                                               • Peer-group Analysis
                                                                                                                                                                                                                                                                                                    • Benford’s Law
                                                                                                                                                                                                        Data Analytics
                                                                                                                                                                                                     Strategies for Fraud
                                                                                                                                                                                                          Detection
1: Define scope of Fraud Data Analytics                                                                                                                       2: Fraud Scenario Identification                                                  2: Fraud Scenario Identification
 • What is the type of credit card fraud scenario?                                                                                                          • Example#1: Credit card renewal discount offer                                   • Create the permutation of fraud scenarios
      • Example: Credit card renewal discount offer                                                                                                         • Who is the committing person?
                                                                                                                                                                • Credit card representatives                                                                             High discount outside
                                                                                                                                                                                                                                                                          company guideline
 • What are the objectives of the fraud data analytics?                                                                                                     • What are the possible entities involved?
      • To identify any non-compliance behaviours of credit card                                                                                                • Cardholder (can be real of fake)                                                                                                                         Real cardholder
        representatives that would case revenue loss of the credit card
        company
                                                                                                                                                            • What are the possible fraudulent actions?                                                                   High discount without
                                                                                                                                                                • Offering higher discounts than allowed                                                                  negotiation
                                                                                                                                                                                                                                       Credit card
                                                                                                                                                                • Offering high discounts without making an effort to negotiate with   representatives
                                                                                                                                                                  cardholder
                                                                                                                                                                • Offering discounts without negotiation with cardholder                                                  High discount without
                                                                                                                                                                • Conflict-of-interests                                                                                   making effort for
                                                                                                                                                                                                                                                                          negotiation                                      Fake cardholder
                                                                                                                                                                •…
  Source: Liu, Qi. (2019). An Application of Exploratory Data Analysis in Auditing – Credit Card Retention Case. 10.1108/978-1-78743-085-320191001.
                                                                                                                                                                                                                                                                                                                                                                                      12/8/2022
3: Data Analytics Strategies for Fraud Detection                                                                         3: Data Analytics Strategies for Fraud Detection                                                                                                                     3: Data Analytics Strategies for Fraud Detection
  • The account master data is a large dataset with 60,309,524                                                                   • Data pre-processing                                                                                                                                           4 data analytics strategies
    records and 504 fields.                                                                                                                • E.g. Data transformation                                                                                                                            • 1/ Specific identification strategy
                                                                                                                                                     • achieved by the logarithm function.
  • Description of 8 selected attributes in this credit card case                                                                                                                                                                                                                                    • E.g. credit card holder with credit card numbers inconsistent with
                                                                                                                                           • E.g. Feature re-engineering: Creation of new attributes –                                                                                                 the company’s offer
                                                                                                                                             ‘Discount’
                                                                                                                                           • 2 attributes related to ‘Discount’                                                                                                                  • 2/ Internal control avoidance:
                                                                                                                                                     • Original fee = original annual fee                                                                                                            • E.g. offer of discount that is not complied with company policy
                                                                                                                                                     • Actual fee = actual annual fee paid                                                                                                       • 3/ Data interpretation
                                                                                                                                                                                                                                                                                                     • E.g. any questionable discount offer
                                                                                                                                                            Q: What does the following values mean?
                                                                                                                                                                                                                                                                                                 • 4/ Number anomaly
                                                                                                                                                                    Case 1: Discount = 0%                                                                                                            • E.g. pattern and frequency of discount offer by credit card
                                                                                                                                                                   Case 2: Discount = 100%                                                                                                             representatives
                                                                                                                                                                 Case 3: Discount = -ve value
   Source: Liu, Qi. (2019). An Application of Exploratory Data Analysis in Auditing – Credit Card Retention Case.                    Source: Liu, Qi. (2019). An Application of Exploratory Data Analysis in Auditing – Credit Card Retention Case. 10.1108/978-1-78743-085-320191001.
                                                                                                                                                                                                                                             Data Analysis                                       • However, with the growing size (5Vs) of data sets
                                                                                                                                                                                                                                             (EDA)                                                 nowadays, it is not feasible to use only traditional EDA to
                                                                                                                                                                                                                                                                                                   analysis data sets
                                                                                                                                                                                                                                             Dr. Vivien CHAN
59 60
 Exploratory Data Analysis (EDA)                                                                                                                                                 Step-by-step EDA                                                                                             EDA – Step 1: Distinguish Attributes
  • Purpose of EDA:                                                                                                                                                                                                                                                                                Objective
       • to have a better understanding of the data before building the                                                                                                                                                                                                                       • To identify the attributes in a dataset in order to formulate a clearer goal of the
         fraud detection model                                                                                                                                                                                                                                                                  data analytics process
       • to detect problems in data                                                                                                                                                                                                                                                           • To understanding the meaning of each attribute before analyzing the data
                                                                                                                                                                                                                                                                                                 Explore what?
  • When performing EDA, fraud analyst need to keep the
    following in mind:                                                                                                                                                                                                                                                                        • Attribute names, datatypes, number of attributes, etc.
       • What is the purpose of this fraud analysis?                                                                                                                                                                                                                                          • Continuous vs Categorical data types
       • What are the insights or in-depth knowledge about the fraud that
         can be derived?
                                                                                                                                                                                                                                                                                                  Techniques
       • Are the objectives of EDA aligned with the problem on hand?
                                                                                                                                                                                                                                                                                              • Descriptive summary of the dataset
                                                                                                                    61   Ref: Ghosh et al. (2018). A comprehensive review of tools for exploratory analysis of tabular industrial datasets
                                                                                                                                                                                                                                                                12/8/2022
    EDA – Step 4: Detect aberrant & missing values                                                                                                 Data Cleaning                                                        Aberrant and Missing Data
            Objective                                                                                                       • Raw data contains noise, inconsistencies and                                • Reasons of aberrant and
    • Aberrant and missing values may result in biased analysis of data. Thus,                                                incompleteness.                                                               missing data
      need to identify and detect any outliers and missing values in the dataset                                            • “Dirty” data can cause confusion for the data analytics                        • Human input error
                                                                                                                              procedure.                                                                     • Intentionally hiding some
        Explore what?                                                                                                                                                                                          information
                                                                                                                                •   Incomplete/Missing data
    • Aberrant values : Erroneous values which occur as a result of incorrect user                                                                                                                           • Not applicable values, e.g. if
                                                                                                                                •   Noisy data (outliers or errors)
      inputs or calculation errors                                                                                                                                                                             there are records without visa
    • Missing values : Occur in a dataset during data extraction and/or data                                                    •   Data inconsistencies (similar to data conflict in data integration)        card, visa card transactions will
      collection.                                                                                                               •   Duplicate records (similar to data duplication in data integration)        be not applicable
                                                                                                                            • Thus, data cleaning is an essential process in data pre-                       • Not matching search or filter
          Techniques                                                                                                                                                                                           criteria, e.g. if transaction > 1
                                                                                                                              processing.
    • Performs AFTER multivariate analysis when you have a clearer idea about                                                                                                                                  billion
      the attributes
    • Detection of abnormalities in univariate, bivariate and multivariate
      visualizations
              How to handle missing data?                                                                                   Handling Missing Data - deletion                                              Handling Missing Data - deletion
                                                                                                                                Likewise Deletion                               Pairwise Deletion
                                                                                                                                                                                                                 Likewise                          Pairwise
                                                       Supervised Learning
                                                                                                                                                                                                                 Deletion                          Deletion
                                                                                                                                                                                                                                                    Omits the variables with
                                                                                                                                                                                                                   Simplest way to handle           missing value and all
                                                      Un-supervised Learning                                                                                                                                       missing data                     records are involved in the
                                     Techniques
analysis
   • Also known as ETL (Extract, Transform, Load) – data is                                                                      differences in the ranges of attribute.
     extracted from multiple sources, transformed to a single                                                                                                                                       • Z-score Standardization
                                                                                                              Normalization    • It is required only when attributes have different
     format, and loaded into a data warehouse for data analysis                                                                  ranges. For example,
     process.                                                                                                 Discretization       • AGE range from 0 – 100; INCOME range from
                                                                                                                                                                                                    •   (natural) log or base-10 log
   • It serves several purposes:                                                                                                     10,000 – 100,000
                                                                                                                                   • Problem : INCOME might have larger affect on the               •   Square root
       • For easy comparison among different data sets with diverse format
                                                                                                                                     predictive power of the model due to its larger value          •   Inverse
       • For easy combination with other data sets to provide insights
                                                                                                                                     (100 times larger than AGE)                                    •   Square
       • To perform aggregation of data
                                                                                                                                                                                                    •   Exponential
                                                                                                                                                                                                    •   Centring (subtract mean)
                                                                                                                                                                                                                                                                                                                                                                                                       12/8/2022
                                                                                                                                                                                                                                          CUSTOMER        INCOME
                                                          values (discrete/categorical variable).                                                                       • For example,
                                                        • Method : Binning transformation (or                                                                                • Create 2 bins of equal range:
                                                                                                                                                                                                                                          A
                                                                                                                                                                                                                                          B
                                                                                                                                                                                                                                                          1,000
                                                                                                                                                                                                                                                          1,200
                                                                                                                                                                                                                                                                                                                                                        Reduction
                                     Normalization
                                                          categorization)                                                                                                         • BIN 1 (range 1,000 – 1,500) : A, B, C, F
                                                                                                                                                                                                                                          C               1,300
                                                                                                                                                                                  • BIN 2 (range 1,500 – 2,000) : D, E                                                                                                                                  FEATURE ENGINEERING
                                      Discretization                                                                                                                                                                                      D               2,000
                                                        • Examples                                                                                                           • Create 2 bins of equal frequency:
                                                                                                                                                                                                                                          E               1,800
                                                            • replace AGE (numeric value) with AGEGROUP                                                                           • BIN 1 : A, B, C
                                                              (children, youth, adult, elderly)                                                                                                                                           F               1,400
                                                                                                                                                                                  • BIN 2 : D, E, F
                                                            • group rare levels into one discrete group “OTHER”,
                                                              e.g. use “OTHER” to represent those with values that                                                      • The above methods do not take into consideration the
                                                              occur less than a specified cutoff value (e.g. less than
                                                              5%)                                                                                                         target variable (e.g. FRAUD)
                                                                                                                                                                                                                                                                         Variable Creation
                                                                                                                                                                     Selection                                                                                                                 Selection        visualize the variation present in a dataset with many variables
                                                                    Techniques:
                                                                                                                                                                                     • Correlation with target variable
                                      Dimensionality                • Feature Elimination:                                                                                           • Information criteria
                                        Reduction                       • Drop some features that may be unimportant.                                                 Variable       • Clustering of variables                                                                                  Variable    • Basics of PCA are as follows:
                                                                        • While the approach is simple, may lose useful information present                          Extraction                                                         TARGET VARIABLE                                        Extraction
                                                                                                                                                                                                                                                                                                              • A dataset with many variables
                                                                           in those dropped features.
                                                                    • Feature Extraction:                                                                                                                                                                                                                     • Simplify that dataset by turning the original variables into a
                    Variable                            Variable        • Transform the original set of features into another set of features.                                                                                                                                                                  smaller number of "Principal Components"
                    Selection                          Extraction       • To pack the most important information into as few derived features
                                                                                                                                                                                                          INPUT
                                                                           as possible
                                                                        • Reduce the number of dimensions by dropping some of the derived
                                                                           features. But don't lose complete information from the original
                                                                           features: derived features are a linear combination of the original
                                                                           features.
                                                                                                                               • Aindrila Ghosh, Mona Nashaat, James Miller, Shaikh Quader, Chad Marston (2018). A
                                                                                                                                 comprehensive review of tools for exploratory analysis of tabular industrial datasets. Visual
                                                                                                                                 Informatics 2 (2018) 235–253.
                                                   Any insights you get
                                                                                                                               • Lunardon, N., Menardi, G., & Torelli, N. (2014). ROSE: a Package for Binary Imbalanced
                                                     from this initial                                                           Learning. R J., 6, 79.
                                                   univariate analysis?                                                        • G. Menardi and N. Torelli. (2014). Training and assessing classification rules with imbalanced
                                                                                                                                 data. Data Mining and Knowledge Discovery, 28(1):92–122.