Lu05 - Ain3701 B0 LS05 004
Lu05 - Ain3701 B0 LS05 004
DATA ANALYTICS
5.12 REFERENCES
1|P a g e
5.1 INTRODUCTION, OUTCOME AND ASSESSMENT CRITERIA
Your studies in AIN1501 introduced you to the concept of data. In AIN2601, you explored data in the
computerised system process, the database environment, relational database terminology, and how
databases are utilised in the business world (this included concepts like data warehouses, data
marts, data mining and OLAP). It is important to understand these concepts very well before
attempting to perform and understand data analytics. It is therefore advisable to revise this content
before continuing with this learning unit.
We will kick off with a discussion of the world of data. Herein we will also discuss an emerging
technology called big data. Once you understand these two concepts, we will focus on the data
analytics cycle.
In the current age, you will agree that data are all around us. Figure 5.1 indicates the increase of
data, from 2 zettabytes in 2010 to a forecasted 175 zettabytes in 2025. You will notice from this
graph that data are increasing exponentially, so imagine how much data will be generated by 2030.
Figure 5.1 Volume of data/information created, captured, copied, and consumed worldwide from
2010 to 2025 (Reinsel, Gantz & Rydning, 2018)
“The data volumes are exploding; more data has been created in the past two years than in the
entire previous history of the human race.”
2|P a g e
Look again at figure 5.1, and you will agree that this is still the case. To makes matters more complex
it is estimated that 90% of data are unstructured (Drury, 2022:671). For an explanation on structured
vs unstructured data, refer to learning unit 2.
This enormous increase in data gave rise to an emerging technology called big data, which can be
analysed by making use of data analytics tools. Examples of data analytic tools are machine learning
and deep learning.
Big data is covered in detail in AIN1501. Please study this section in AIN1501 before continuing to
the next topic in this section.
In your AIN1501 learning material, you would have noticed that there are numerous benefits that
big data offers. Unfortunately, certain risks also come with these benefits (also discussed in your
AIN1501 material). To mitigate or limit these risks, it is necessary to implement internal controls.
                 For organisations to limit their risks regarding big data, it is important for them to
                 implement the following (Richard et al, 2021: 8/74):
3|P a g e
                    •   Make sure that access management controls (covered in AIN1501) are
                        implemented over big data.
                    •   Change management controls (covered in AIN1501) should be
                        implemented over big data.
                    •   Backup and disaster recovery controls (covered in AIN1501) should be in
                        place over big data.
                    •   Staff should be trained on how to use big data, as well as the privacy
                        concerns regarding big data.
Activity 5.1
Ms Duma, the CEO of Ikhishi Likagogo, the company that manufactures breads, cakes, and South
African delights, has called a meeting with you, Ms Jansen (financial director), Ms Tau (product
manager of the cake division) and Mr de Villiers (CIO).
4|P a g e
She said that it is clear from figure 5.2 that the decline of sales of Ikhishi Likagogo was mainly due to
the decline of sales in the Cakes division. She was very concerned and asked the meeting members
to brainstorm a solution for this.
Mr de Villiers went first, and stated that he thinks the problem is that Ikhishi Likagogo does not
know their customers anymore, therefore not supplying the correct products to them. He also
stated that big data can assist Ikhishi Likagogo to resolve this problem.
Ms Duma replied: “Thanks Mr de Villiers. I am not sure what this big data is and what you are talking
about. But I am willing to let the AIN3701 student investigate this matter and present at our next
meeting to explain this to me.”
End of activity
Data analytics is the process of analysing raw data in order to draw out meaningful, actionable
insights, which are then used to inform and drive smart business decisions (Stevens, 2022). It is
important to note that data analytics provides a way to search through structured and unstructured
data (Richardson et al, 2023). The benefit of data analytics is that it will enable organisations to
discover various buying patterns of their customers, investigate anomalies that were not
anticipated, and forecast future possibilities (Richardson et al, 2023). These are only a few of the
benefits, as there could be a lot more.
Now you might ask me, what does this have to do with the accounting professional of the future?
Richardson et al (2022:13-14) provide some guidance on this: “Accounting professionals do not need
to become data scientists, or built a database from scratch or even perform real hardcore data
analytics but they need to know how to do the following:
5|P a g e
    •   Clearly articulate the business problem an organisation is facing.
    •   Communicate with data scientists about specific data needs and understand the underlying
        quality of data.
    •   Draw appropriate conclusions to the business problem based on the data and make
        recommendations on a timely basis.
    •   Present their results to individual members to management in an accessible manner. This
        can include CEOs, audit managers, etc.”
If one looks at the above requirements, analytical-minded accountants should have skills in the
following areas (Richardson et al, 2022:14):
    •   Develop an analytics mindset – know when and how data analytics can address business
        questions.
    •   Data cleaning and preparation – comprehend the process needed to clean and prepare the
        data before analysis.
    •   Know that data quality consists of completeness, reliability, and validity.
    •   Descriptive data analysis – perform basic analysis to understand the quality of the underlying
        data and its ability to address the business question.
    •   Data analysis through manipulation – demonstrates the ability to sort, rearrange, merge,
        and reconfigure data in a manner that allows enhanced analysis. This may include
        diagnostic, predictive, or prescriptive analytics to appropriately analyse the data.
    •   Statistical data analysis competency – identify and implement an approach that will use
        statistical data analysis to draw conclusions and make recommendations on a timely basis.
    •   Data visualisation and data reporting – report results of analysis in an accessible way to each
        varied decision maker and his/her specific needs.
Activity 5.2
        Data analytics is expected to have dramatic effects on auditing and financial reporting as
        well as tax and managerial accounting (Richardson et al, 2023:5).
Choose one area (auditing, financial accounting, managerial accounting, or tax) and discuss the
above statement on this area.
Share your reply in the Discussion Forum data analytics with your fellow students. Please follow the
rules of netiquette we have covered in learning unit 0 and adhere to the following guidelines for
participating in forums:
6|P a g e
•   Compile your post offline and keep record of it.
•   Use an academic writing style for referencing and citing the sources you used.
•   Post your answer on the forum.
•   Reply to contributions of at least two of your fellow students’ contributions.
End of activity
In the next section, we are going to discuss the steps in the data analytics process.
To perform data analytics, it is important to follow a structured approach. There are a number of
data analytics approaches like the CRISP (Cross-Industry Standard Process) model, but for AIN3701
we are going to use the IMPACT model, which is prescribed by Richardson et al in the textbook Data
Analytics for Accounting. The IMPACT model is an acronym which can be explained as follows:
C - Communicate insights
T - Track outcomes
7|P a g e
                                                Identify the
                                                 questions
                      Communicate                                         Perform a
                        insights                                          test plan
                                                Address and
                                                 refine the
                                                   results
Figure 5.3 The IMPACT model (Richardson et al, 2023:9). Illustrated by Van Staden (2022).
In the next sections, we will be discussing the steps of the IMPACT cycle in detail. It is important that
you can apply these steps.
The question identification in the data analysis process is very important. If you ask the wrong
question, you are not going to get the right answer. As an accountant (or an accounting professional)
you will possess a unique skill set to improve an organisation’s data analytics by the ability to ask the
right questions, as accountants (or accounting professionals) understand an organisation’s financial
data (Richardson et al, 2023:9).
In the workplace you could be tasked within a group to solve a problem using data analytics (or in
some cases even data science). It is important that you are very active in identifying the problem by
defining the correct question. But how do you do that?
To assist you in identifying the correct question, you can use root cause analysis techniques. Root
cause analysis is an approach for identifying the underlying causes of an incident so that the most
effective solutions can be identified and implemented (ThinkReliability, 2022).
8|P a g e
Figure 5.4 Root cause analysis (ThinkReliability, 2022)
It is important to note that the problem is not necessarily the symptom of the problem. In most
cases there are underlying causes, which are referred to as the root of the problem. This is
illustrated in figure 5.4. Therefore, by solving the root of the problem (and note that there could be
more than one root cause of a problem), the symptoms will improve or even disappear.
There are a few techniques you can use to derive at the root cause of problem. The first technique
we are going to discuss is the Five Why’s technique. The Five Why’s technique (5W) is asking
multiple “why’s” on a specific problem until the root cause of the problem is identified. Let me
illustrate:
Flikkers Disco (Flikkers) is a nightclub in Sunnyside, Pretoria. Flikkers also serves beverages (alcoholic
and non-alcoholic) and light meals in the evenings and are open from 18:00 – 23:00 each night. In
December 2023, the company’s year end, the directors noticed in the financial statements that the
net profit declined by almost 45% from the 2022 financial year. This is very concerning, and they
have asked you to investigate this problem.
9|P a g e
    •    Why did profit decline from 2023 to 2022?
         Answer:
         The drop in profit was caused mainly by a decrease in revenue. Most of the expenditure
         items increased below inflation, indicating that some cost containment measures were
         introduced.
Flikkers can now action a solution to this problem. They need to determine which music customers
prefer. This is an example of a data analysis exercise, where Flikkers will need to determine their
customers’ music preferences (including but not limited to genres, styles, or music bands) to draw
them back to the nightclub.
Another technique that can be used to identify the problem is the 5W1H technique. The 5W1H
technique assists you to understand a potentially problematic situation by asking the right questions.
The questions are (Galiana, 2019):
10 | P a g e
    •    Why: reasons, cause of problems.
The problem statement in this case will be: Due to a decline in profitability in 2023, the board of
directors will launch an investigation to determine the music preferences of customers at Flikkers in
Sunnyside, Pretoria. If this investigation is not done, it could lead to possible retrenchments of staff
of Flikkers or even closing the Flikkers club down in Sunnyside.
I think you will agree that this is a nice and comprehensive problem statement �.
The last technique we are going to discuss, which you may encounter in practice, is the fishbone
diagram. For a discussion on this technique, study the article by Lewis (2020) by clicking here.
                                                         Decrease in
               Environment                                revenue
                                                         Fewer customers
                  New competitor
                                                                                            Decrease in
                                                                                              profit
                                     Do not like music
                                                                            No increases
Less bouncers
                                 Clients                              Cost
                                                                  containment
11 | P a g e
Activity 5.3
After the meeting held in activity 5.1, Ms Duma asked you to assist in identifying why the Cakes
division sales are currently declining (see figure 5.2).
Apply the following root cause analysis techniques to identify the root cause of the above problem:
Prepare this activity in Word, and then convert your Word document into a PDF document. Upload
then your PDF document by clicking here here.
IMPORTANT: YOU NEED TO SHOW ALL YOUR STEPS TO IDENTIFY THE ROOT CAUSE OF THE
PROBLEM.
End of activity
Mastering the data requires one to know what data are available and whether that data might be
able to help to address the problem identified in step 1 (Richardson et al, 2023:10). This step
requires that you have a comprehensive knowledge of the data and includes how to access, the
availability, reliability, frequency of updates and the time periods that the data cover to make sure
the data coincide with the timing of our business problem. Step 2 can be divided into sub-steps as
listed below:
Data scientists refer to the above process as the ETL process, which stands for extract(ing),
transform(ing) and load(ing) data. Next, we will discuss how to determine the purpose and scope of
the data request.
12 | P a g e
5.6.1    Step 2a: Determine the purpose and scope of the data request
To determine the purpose and scope of the data request, you need to answer the following
questions (Richardson et al, 2023:61):
    •    What is the purpose of the data request? What do you need the data to solve? What
         business problem will it address?
    •    What risks exist in data integrity? If there are risks, is there any mitigation plan?
    •    What other information will impact the nature, timing, and extent of the data analysis?
Make sure you document the purpose and scope of the data request. This will include any risks and
assumptions made on the data.
In data analytics, it is crucial that you understand the business domain, the data domain and the
statistical tools required to solve the problem. In defining the business problem in step 1, you
utilised your understanding of the business.
Now to solve this problem, you will need to know what data you require to demystify this problem.
Are the data currently available in the organisation’s databases or do you need to utilise data
gathering techniques to obtain the data? Maybe you need to consult external data sources? This
concept will be elaborated on in section 5.6.2.1.
Once you know which data source or sources are going to be used, you need to know where the
data are stored. Are the data stored in a relational database, spreadsheet or even a flat file? In most
organisations data will be stored in a relational database, like Oracle, SQL server or Microsoft Access.
It is good practice to normalise databases to ensure data integrity. Section 5.6.2.2 will explain this
concept to you.
Once you know where to obtain your data from, you will need to investigate what data are available
in your data source. As stated before, it is very important to know your data. If you don’t understand
the data and use the wrong data fields, your analysis will analyse irrelevant data leading to an
incorrect answer. To gain an understanding of your data, you will find that database schemas (also
called ER diagrams) and data dictionaries are very useful. We will discuss these data exploratory
tools in section 5.6.2.3.
Once you understand the data, it is time to collect the data. The data collection process will be
discussed in section 5.6.2.4.
13 | P a g e
5.6.2.1 Data sources
One must remember that data can come from many sources. These can normally be classified as
internal and external data sources. Data are internal if an organisation generates, owns, and controls
it (Bekker, 2017). This can be done through separate systems like the accounting information
system, supply chain management system, customer relationship management system and the
human resource management system, to name a few. These days more and more organisations are
making use of Enterprise Resource Planning (ERP) systems, which is a business management
software that integrates applications throughout the business into one system (Richardson et al,
2023:54).
External data are public data, or the data generated outside the company, therefore, the
organisation neither owns nor controls it (Bekker, 2017). Examples of external data sources are the
South African Data Portal (https://southafrica.opendataforafrica.org/), the South African Reserve
Bank (https://www.resbank.co.za/en/home) and Stats SA (https://www.statssa.gov.za/).
                It is important to ensure that the data source used to obtain data from is complete,
                accurate and valid.
                Do you have permission to obtain data from the data source you have selected?
                You need to obtain permission to use data from the data owner (whether internal
                or external). If you use an external data source, did you read the terms of usage?
                Do you comply with the terms? Are you maybe in breach of legislation using the
                data source? Maybe you are hacking the data �.
Is the source you are going to use based on the latest (current) data?
                Remember: The aim is to make a business decision on the data used within the
                data source.
14 | P a g e
5.6.2.2 Normalisation
A new concept I want to discuss in relational databases is normalisation. Peterson (2022) defines
normalisation as ‘a database technique that reduces data redundancy and eliminates undesirable
characteristics.’ Normalisation rules divide larger tables into smaller tables and link them using
relationships (Peterson, 2022).
Figure 5.6 Customer and Province data tables (Van Staden, 2022)
The Customer table’s primary key is Customer_ID, while the Province table’s primary key is
Province_ID. Therefore, these two tables can be linked by Province_ID (therefore Province_ID is a
foreign key in the Customer table). But let’s see how the data look in these two tables:
CUSTOMER
Customer_ID (PK)    Customer_Name           Customer_Adress Customer_VatNr Customer_City Province_ID (FK)
               1    Lulu's Bakery           321 Justice street  1122336789 Pretoria                     1
               2    Sophia's Tea Garden     45 Tau avenue       1122336800 Johannesburg                 1
               3    Bela Bela Supermarket   12 Church street    1122336811 Bela Bela                    2
               4    Capricorn Inn           51 Unicorn drive    1122336822 Polokwane                    2
               5    UNISA student café      Preller street      1122336833 Pretoria                     1
PROVINCES
Province_ID (FK)     Province_Name
                   1 Gauteng
                   2 Limpopo
15 | P a g e
Let’s look at Customer_ID “1”. Lulu’s Bakery is in Province_ID “1”. In the Provinces table we can see
that province “1” is Gauteng. The process of creating additional stand-alone (smaller) tables is
known as normalisation. There are a lot of benefits to it, but the biggest benefit is data integrity.
Imagine if one needed to type in Gauteng for each customer in Gauteng. You will get multiple
versions of it, for example Gauteng, Gateng, gteng, Guateng, Gatang etc.
Now that you understand these concepts, let me shock you with this statement: For data analytics
you will take a normalised table and convert it back to a unnormalised single sheet (in Excel for
AIN3701).
Consider again the data in figure 5.7. What province has the most customers? With little data it is
easy to answer, but what if there were 150 customers, with nine provinces. If you tell me province
“1” has the most customers, that won’t mean anything to me. So, we need to change the “1’s” back
to Gauteng in our data analysis sheet. But more on this in section 5.6.4.2.
In AIN3701 we will be using data dictionaries and ER diagrams to determine which data are needed
to solve a business problem. An ER diagram displays the name of the table, fields in the table and the
relationships between tables in a database. An ER diagram is illustrated in figure 5.8.
16 | P a g e
Figure 5.8 Entity Relationship Diagram Example (Peterson, 2022)
If you struggle to interpret ER diagrams, please read this article by Peterson (2022) by clicking here.
Note that you will be assessed on your ability to identify the correct tables and fields in a database
and not on the detail of constructing ER diagrams. Therefore, this article is there only to assist you,
and you will not be assessed on it.
Data dictionaries were covered in detail in AIN2601. Please revise this section in your AIN2601 study
material.
The following steps can assist you when you are collecting your data (Richardson et al, 2023:61-62):
First, you need to determine from who you need to request the data. In some cases, you will be able
to retrieve the data yourself, but most of the time you will need to request the data from the
database administrator or the information system department.
17 | P a g e
Secondly, you need to determine what data are needed. Use the ER diagrams and data dictionaries
to identify which data will be applicable for your analysis. Be specific.
Thirdly, you need to determine in which format the data should be. First price will be a flat file, with
all foreign keys be populated, but you will rarely be so lucky to receive your data like this. Take also
into consideration which tool you are going to use to test and process the data received. This will aid
the database administrator (or information system department) to provide the data to you in the
most accessible format.
Lastly, provide the database administrator (or information system department) with a deadline by
when you should receive the data. This will enable you to do your data analysis in time.
Some organisations have a pre-developed form to request data. This form is called a data request
form. Richardson et al (2023:62) provide an example of a data request form displayed in figure 5.9.
 Requestor Name:
 Requestor Contact Number:
 Requestor Email Address:
 Please provide a description of the information needed (indicate which tables and which fields you
 require):
Remember, when obtaining the data yourself, you are still required to identify the tables, the
attributes required and the relationships between the tables before extracting the data. There are
various methods to retrieve the data. One of these methods is if the database administrator
18 | P a g e
provides you with direct access to the database. Then you will be able to connect to the database by
using Microsoft Excel, Microsoft PowerBI or Tableau. When your dataset has more than 1 048 576
rows, you will not be able to pull it into Microsoft Excel 365, or earlier versions. Then you should
make use of more advanced data analysis software like Microsoft PowerBI, Tableau, R or Python. In
AIN3701 we are going to make use of Microsoft Excel 365.
Note that you can also use SQL (Structured Query Language) to retrieve data from a database. SQL,
however, belongs more in the field of computer science and will not be covered in this course. If you
want to learn more about SQL, please watch the video by Gary Explains (2019) by clicking here. You
will not be assessed on this video in this module.
Activity 5.4
Ms Jansen asks if you can assist her in requesting data to answer the following questions for the
2023 financial year:
   I.       What was the revenue per division? One column should indicate the divisions’ income
            through normal sales and the other column, delivery income.
  II.       What was the profit per division? One column should indicate the divisions’ profit through
            normal sales and the other column, delivery income.
 III.       What was the revenue per province? One column should indicate the divisions’ income
            through normal sales and the other column, delivery income.
 IV.        What was the profit per province? One column should indicate the divisions’ profit through
            normal sales and the other column, delivery income.
Part of Ikhishi Likagogo’s data management policy is that all data should be requested by making use
of the company’s data request form. Please download the data request form by clicking here.
To assist you in completing the data request form, Mr de Villiers (CIO) provided you with the
following:
You are required to complete the data request form to provide enough data in order for Ms Jansen
to answer questions I–IV above.
19 | P a g e
End of activity
5.6.3 Step 2c: Validate the data for completeness and integrity
When data are moved or extracted from one location to another, it is possible that some of the data
can get lost or corrupted during extraction. It is crucial to ensure that the extracted data are
complete, as you will get the wrong answer by analysing incomplete data. You also need to ensure
that the integrity of the data remains after the extract. This means that no one has manipulated,
tampered with, or duplicated data during extraction. To assist you to validate data successfully you
will need to have the technical skills, but you also need to know your data well. If you know what to
expect from the data, you have a higher likelihood of identifying errors or issues from extraction
(Richardson et al, 2023:64).
Richardson et al (2023:64-65) identified the following four steps that should be completed to
validate your data after extraction:
   I.    Compare the number of records that were extracted to the number of records in the source
         database. The records should match. This step will not provide information on the data itself
         but will ensure that the number of records matches contributing to completeness.
  II.    Compare descriptive statistics for numeric fields. Calculating minimums, maximums,
         averages and medians will help ensure that the numeric data adhere to the data limits,
         ensuring accuracy and completeness. Descriptive statistics will be discussed in detail in
         section 5.7.1.
 III.    Validate date or time fields. This can be achieved by converting the data type to numeric and
         running descriptive statistics comparisons on these fields.
 IV.     Compare string limits for text fields. Text fields are unlikely to cause concerns if you extract
         your data to Microsoft Excel 365, because Excel allows a generous maximum character
         number. Just be aware of this if you use another tool than Excel to analyse text fields.
                  The above four steps are data controls to ensure that the ETL (extract, transfer,
                  and data loading process) are complete, accurate and valid.
If an error is detected at this stage, you need to contact the originator (information systems
department or the database administrator) of the data to report this. Most of the time they will re-
extract the data, which will solve the problem. If you have extracted the data yourself, then my
advice will be to re-extract the data. If this does not solve the problem, you need to report it to the
information systems department or the database administrator to investigate this matter further.
20 | P a g e
5.6.4    Step 2d: Clean the data
You now have obtained the data, from the Information Science department, database administrator
or you extracted it by yourself. Now the aim is to get your data ready for analysis.
Unfortunately, extracted datasets may contain duplicates, anomalies, or missing data, which could
distort how the data is interpreted (Stevens, 2022). Therefore, it is important to remove these
discrepancies. We refer to this process as data cleaning.
Stevens (2022) states that data cleaning can be a time-consuming task, but that it is crucial for
obtaining accurate results.
Proper computer controls will, however, reduce time on data cleaning. For example, if a
computerised control limits a user not to input a negative field in the sales quantity field, then you
will not have to investigate negative sales quantities in your dataset. Another example is where
customer detail is retrieved by the algorithm, and automatically inputted in the customer details
fields. Therefore, the customer address does need not to be typed in every time a customer is
buying an item. The result of this control is that spelling mistakes will be limited in the data, as well
as in your dataset.
It is therefore necessary to discuss computerised controls before we take you through the data
cleaning process itself. As an accounting professional you will need to design and sometimes
evaluate these controls in practice. Computer controls will be discussed in section 5.6.4.1.
In section 5.6.4.2 we will be discussing and illustrating how to do data cleaning on a dataset.
For an organisation’s information system to operate efficiently and effectively, there should be rules
and procedures governing information systems. This can be achieved by ensuring a strong internal
control environment, effective general IT controls and effective application controls.
To understand computer controls over data and new technologies, it is important to understand
how IT controls interact with business operations. IT controls were introduced in AIN1501; please
revise this study material. Also watch the following three video by AmandaLovestoAudit (2019)
which explains the difference between general controls and application controls by clicking here.
IT controls will be further expanded on in your auditing studies in modules AUE2601 and AUE3764.
Remember that IT controls will also be applicable to data. For data input and usage, controls should
be implemented to ensure that data are accurate, complete and valid. If these objectives are not
achieved, an organisation will make decisions upon inaccurate, incomplete and invalid data.
21 | P a g e
Note that data also expose an organisation to additional reputational risk and legislation risk. If
sensitive data are leaked on sensitive issues regarding the organisation (for example, employees’
salaries) then the organisation’s reputation suffers. Organisations need to ensure that all laws (like
the POPI Act) relating to data are complied with. The POPI Act was covered in AIN1501 and AIN2601.
In table 5.1 is the comprehensive steps compiled by Dzuranin, Geerts & Lenk (2023) to clean data in
a dataset.
Extraction
1 All the data did not transfer Compare row count Add missing rows
 2        Data was not correctly          Compare control amounts         Modify incorrect values
          transferred
Transformation
 3        Irrelevant / unreliable data    Visually scan columns for       Remove columns with
                                          irrelevant and unreliable       irrelevant and unreliable
                                          data                            data
 4        Incorrect / ambiguous data      Scan column for incorrect of    Rename the column
                                          ambiguous names
5 Incorrect data types Inspect the data type Change data type
 6        Columns are not single          Examine the values in the       Split composite columns
          valued                          columns
22 | P a g e
 10         Data refers to foreign keys   Investigate whether there        Unnormalise normalised
            in other tables               are more data tables where       data (only in Excel)
                                          the data is stored
Table 5.1 Data preparation patterns (Dzuranin, Geerts & Lenk, 2023:5-46). Adjusted by Van Staden.
Table 5.1 provides nine steps that should be followed when performing the extraction and
transformation process on a dataset. Step 1 and 2, assist with the validation of data for
completeness and integrity which was discussed in section 5.6.3. Steps 3 to 10 refers to the
transformation of data, which mainly assist to clean the dataset. Note that in Excel data does not
have to be loaded again after the data has been cleaned. If you are working with other data analysis
software, it may be necessary to check whether the data that has been cleaned are loaded correctly
to the final dataset. Next, we will be discussing steps 3 to 9 in table 5.1.
Data that are irrelevant for decision making enlarges the data model. It is also important to avoid
integrating unreliable data into the data model. Remember, garbage in, garbage out (Dzuranin et al,
2023).
By making use of the data dictionary and by scanning through the database irrelevant columns could
be identified. If you want to determine revenue per province, it would be advisable to only keep the
columns that will contribute to your solution. In this case it will be the revenue column and the
province column. There is no need to keep the irrelevant columns like address, city, and credit limit
in this case.
By scanning through the data, one can also determine if a column contains unreliable data.
Unreliable data is where a column contains a mixture of null values, dates, and text. Imagine
analysing an age column where there is a mixture of the above. If a column can’t be fixed, it will be
better to remove it.
Column names become variables during data exploration and interpretation. Imagine on interpreting
data where the column names are PR1, TR2 and Di. One would not comprehend what this column
names refer to unless you refer to the data dictionary. To make matters easier for yourself you can
rename the columns to ProductRevenue, TaxRate and Discount. Now it is understandable and other
users of the dataset can also understand it.
23 | P a g e
Dzuranin et al (2023) provides the following guidelines for naming colums:
If a data type is incorrect, it can provide inaccurate results. Imagine if the revenue column displays
numerical values but Excel see it as text. Then drawing statistics on this column will give zeros on
calculations. Therefore, if this is the case, you need to change the column to the correct data type.
Each cell should contain one value describing one characteristic because two or more values in the
same cell makes analysis more challenging. Two specific scenarios that violate the single-valued rule
and make analysis more complex are composite columns and multi-valued columns (Dzuranin et al,
2023).
The best method to detect composite or multi-valued columns is visual scanning (Dzuranin et al,
2023). Consider the following table EMPLOYEE:
Pers_Nr       EmployeeName
            1 Jacques, van Staden
            2 Keenan, Martin
            3 Rendani, Nzhinga
You will see that the EmployeeName column contains the name and the surname in one column.
Preferably we want to split this into two columns, called FirstName and Surname.
Excel can facilitate this. Select the column EmployeeName, click on Data and then Text to Columns.
24 | P a g e
Figure 5.11 Text to Columns wizard (Step 1)
25 | P a g e
Figure 5.12 Text to Columns wizard (Step 2)
The second screen of the wizard allows you to specify how Excel should recognise where to split the
data. In the case of the EmployeeName you will see the names and surnames are split by a comma.
Therefore, deselect Tab and select Comma. You will see the following:
26 | P a g e
Figure 5.13 Text to Columns wizard (Step 3)
Notice in the Data preview that the data is now split between the name and the surname. This is a
preview on how the data will look once converted.
27 | P a g e
Figure 5.14 Text to Columns wizard (Step 4)
On this screen you can adjust the column format and specify the destination of the columns. In this
case don’t change anything and click on Finish.
Pers_Nr   EmployeeName
      1   Jacques      van Staden
      2   Keenan       Martin
      3   Rendani      Nzhinga
Please change the column names to more descriptive names as already discussed in section
5.6.4.2.2. Please see the final table in figure 5.16.
28 | P a g e
Pers_Nr     FirstName        Surname
      1     Jacques          van Staden
      2     Keenan           Martin
      3     Rendani          Nzhinga
Incorrect values are sometimes entered into a dataset and may be difficult to detect. An example of
an incorrect value is where the price of a product is entered as R15 instead of the correct price of
R51. In case of numerical values outlier detection procedures could be used to detect this type of
mistakes. Text values are harder to find. One could most likely pick these errors up with sorting and
filtering the data.
Study the article by Bansal (n.d.), which explains what outliers are, how to find them in Excel and
how outliers should be handled. To download the article, click here.
One can also detect outliers by using charts. Examples of charts than can assist in detecting outliers
are histograms and box and whisker charts. You have learned to create charts in FAC1501, but the
following article by Microsoft (2022) will assist you if you struggle to draw up a box and whisker
chart. To download the article, click here.
Once the incorrect data value is determined one should correct it in the Excel sheet. It is also
important to determine the error’s root cause and advise the information system department on it
to enable them to modify the information systems in such a way that this error is resolved not to
occur in future again.
Inconsistent values occur when two or more different representations of the same data are mixed in
the same column (Dzuranin et al, 2023). An example is if the following data values are present in the
City column: Pretoria, Pta, Protoria and Pretorai. If one would ask to summarise sales per city only
the sales will be separately grouped to Pretoria, Pta, Protoria and Pretorai instead of Pretoria.
It is therefore very important to correct these values once they are recognised. The information
systems department should also be notified in this case, that they can develop measures to
eliminate this error in the future.
29 | P a g e
5.6.4.2.7        Step 9: Incomplete values
Incomplete values are when rows in a data set are empty. We also refer to these values as NULL
values. NULL values will distort the results of your data analysis. Therefore, you need to deal with
them.
First, you need to detect them. Excel has also some nice ways to assist you with this. Consider the
following dataset in figure 5.17. Note that I keep it small to illustrate the detection of blank cells, but
your datasets will normally be a lot bigger than five records.
 Nr   Student    Mark
  1   Steyn        56
  2                44
  3   Mbasa        76
  4   Ntuli
  5   Abrahams      88
By viewing this data set, one can see that student 2 has no student name, and student 4 has no
mark. Let’s look if Excel can find them.
In Excel, click on the Home ribbon, then Find & Select, Go To Special. You will see the following
dialog box:
30 | P a g e
Figure 5.18 Go to Special dialog box
Select a Shading colour from the Home ribbon. I will select Red.
All the cells containing NULL values will not be highlighted in Red (please see figure 5.14), which will
enable you to investigate them.
31 | P a g e
Figure 5.20 Student data with highlighted empty cells
Maladkar (2018) identified the following five ways to deal with missing (NULL) values:
     •   Delete the rows (or records) containing the missing data. This method is advised only when
         there are enough samples in the dataset. One also needs to make sure that after you
         deleted the data, there is no addition of bias.
     •   Replacing the missing values with the mean, the median or the mode. This method is the
         statistical approach of handling missing values.
     •   Assigning a unique category. This is mostly applicable with qualitative data; for example,
         gender. A new category, for example, unknown, can be created for the missing data.
     •   Predicting the missing values. An example is to use a machine learning algorithm to predict
         the missing values. An example of a technique used to predict the values is regression
         analysis. We will discuss regression analysis later in this lesson.
     •   Using algorithms which support missing values. Machine learning algorithms such as KNN (K
         – Nearest Neighbor) and the RandomForest algorithm are used to predict non-linear and
         categorical data. These two techniques are in the domain of data science and will not be
         discussed further in AIN3701.
In section 5.6.2.2 you were introduced to the concept of normalisation. You were then given a hint
that for data analysis you will need to unnormalise normalised data. Let me illustrate this with an
example. Look at figure 5.10.
32 | P a g e
CUSTOMER
Customer_ID (PK)    Customer_Name           Customer_Adress Customer_VatNr Customer_City Province_ID (FK)
               1    Lulu's Bakery           321 Justice street  1122336789 Pretoria                     1
               2    Sophia's Tea Garden     45 Tau avenue       1122336800 Johannesburg                 1
               3    Bela Bela Supermarket   12 Church street    1122336811 Bela Bela                    2
               4    Capricorn Inn           51 Unicorn drive    1122336822 Polokwane                    2
               5    UNISA student café      Preller street      1122336833 Pretoria                     1
PROVINCES
Province_ID (FK)     Province_Name
                   1 Gauteng
                   2 Limpopo
You will see that Province_ID in the CUSTOMER table contains a foreign key, which relates to the
primary key in the PROVINCES table. If we need to analyse customer data, we will import both tables
into Excel, but the primary analysis will be on the CUSTOMER sheet (where the CUSTOMER table is
imported). We will therefore need to replace the Province_Id column values of 1 and 2 with Gauteng
(1) and Limpopo (2).
There is no right or wrong answer to do this. You can use Find & Replace, you can type it in
manually, but personally I feel the best solution to do this is with the VLOOKUP function. (The
VLOOKUP function was covered in AIN2601.) If you use the VLOOKUP function, insert a column,
where the result of the VLOOKUP function is stored (in the above example, you will insert a column
next to the Province_ID, where the Province_Name will be displayed when applying the VLOOKUP
function). You will find that the VLOOKUP function works the best for big datasets.
5.6.5 Step 2e: Load the data in preparation for data analysis (load).
In some data analysis software packages, one needs to get the data ready before loading it. In Excel
we first load the data, and then we prepare our data cleaning on the data. Work through the article
by Chauhan (2022), which will provide you with step-by-step information on how to import data into
Excel. As a bonus, this article also covers how to export data from Excel. You can download the
article by clicking here.
33 | P a g e
Activity 5.5
You have received the data from Chip de Villiers (CIO). He has put two tables, the Invoice_2022, and
the Customer_Master into an Excel workbook. You can download the workbook by clicking here.
Chip de Villiers informed you and Ms Jansen that he did not have time to clean the data and would
only be able to start with this process after two weeks.
Ms Jansen informed him that the analysis is urgent, and that the Finance department will clean the
dataset themselves.
   I.    Clean the dataset received in the attached Excel file. You need to clean both the
         Invoice_2022 and the Customer_Master sheets. Use Table 5.1 - Data preparation patterns
         and follow the steps in these table.
               a. Do the data cleaning in the Invoice_2022 and the Customer_Master sheets. In
                  addition, document (in a separate document) which step was followed in table 5.1
                  and provide examples on how the data was cleansed under each step.
                  Please see below information to deal with outliers and missing values:
                         Ms Jansen will investigate the outliers, but she asks that you continue with
                          your analysis without adjusting the outliers.
                         Any missing values should be replaced. Ms Jansen also asks you to highlight
                          missing values as follows:
                               •   If you replace the value, highlight the cell in blue.
                               •   If you recalculate the value, highlight the cell in red.
                               •   If you can’t replace or recalculate the value, then you need to use
                                   the mean value (refer to DSC1630). In this case, highlight the value
                                   in yellow.
               b. Test for outliers in the Qty column in the Invoice_2022 sheet by using a box and
                  whisker chart. Create a separate sheet in the workbook where this chart is
                  displayed.
34 | P a g e
               c. Test for outliers in the Inv_Price_ex_del column in the Invoice_2022 sheet by using
                  the IQR. Clearly indicate whether there are outliers present, by shading these values
                  in grey. Show all your calculations.
  II.    Create a sheet called Analysis_Region. In this sheet, provide the answer for the following
         question:
               a. What was the revenue per province, per division?
End of activity
You have cleaned your data, and now you are ready to start your analysis. You will see that there are
different data analytics approaches to help us understand what happened, why it happened, what
we can expect to happen in the future, and what we should do based on what we expect to happen
(Richardson et al, 2023:116). In this step we will evaluate data analytics approaches and choose the
type of analytic which will provide the most accurate, complete, and valid answer to our problem we
have identified in step 1.
There are four types of data analytics approaches. Study the article by Cote (2021), which explains
the four types of data analytics. You can download the article by clicking here.
We are not going to explore all the different analytical methods one can use under each of these
four main types of data analytics. Note that only summary statistics, data reduction (filtering) and
regression are assessable in AIN3701. The other methods are only introduced and will not be
assessed in AIN3701.
There are two main approaches when using descriptive analytics, which are summary statistics and
data reduction (also known as filtering).
Summary statistics describe a set of data in terms of their location (mean, median), range (standard
deviation, minimum and maximum), shape (quartile) and size (count) (Richardson, et al, 2023:117).
Table 5.2 indicates what statistic can be used, the Excel formula and the description of the statistic.
35 | P a g e
 Statistic                   Excel formula       Description
 Median                      =MEDIAN()           The middle value that divides the top half of the
                                                 data from the bottom half.
 Standard deviation          =STDEV.S() for a    The variability or spread of the data from the mean.
                             sample              A larger standard deviation means a wider spread
                                                 away from the mean.
                             =STDEV.P() for
                             the entire
                             population
 Quartile                    =QUARTILE()         The value that divides a quarter of the data from
                                                 the rest. This indicates skewness of data.
Table 5.2 Summary statistics (Richardson et al, 2023:119). Adjusted by Van Staden (2024)
Data reduction is used to reduce the number of observations to focus on the most critical,
interesting, or abnormal items (Richardson et al, 2023:120). It does this by taking a large set of data
and reducing it to a smaller set that has most of the critical information of the larger dataset.
(Richardson et al, 2023:117-118). In Excel you can use the Filter tools and Pivot tables to assist you in
doing data reduction. The Filter tool and Pivot tables in Excel was discussed in detail in AIN2601.
36 | P a g e
Activity 5.6
You have attended a meeting with Mrs Duma (CEO), Ms Jansen (CIO), Ms Arendse (manager of
delivery division), Mr Law (legal representative of Ikhishi Likagogo) and Mr Azharuddin (union
representative).
Mrs Duma is concerned about the number of serious accidents that were caused by Ikhishi
Likagogo’s drivers. She also informed the meeting that Ikhishi Likagogo’s insurers, DriveWise
Insurers, has indicated that they are going to increase Ikhishi Likagogo’s insurance premium unless
she can show them how Ikhishi Likagogo is going to limit accidents.
Ms Arendse reports that during 2022 there were four serious accidents, in which delivery vehicles
were written off. Unfortunately, it caused a delay in deliveries, as there were not enough vehicles.
She states that Ikhishi Likagogo delivery policy (95% of deliveries need to be on time) has put
pressure on the drivers causing them to, in some cases, violate traffic laws.
Ms Jansen reported that the number of traffic fines increased in 2022 by almost 60%. Furthermore,
there are several legal cases against Ikhishi Likagogo emanating from incidents where the other
party’s drivers were seriously injured. Mr Law indicated that Ikhishi Likagogo would not win these
cases, as witnesses identified that the drivers of Ikhishi Likagogo was negligent.
Mr Azharuddin claims that Ikhishi Likagogo overworks drivers, and that five drivers are surely not
enough to cover the deliveries. He states that all five drivers are union members, and that they
informed him that they are under enormous pressure to deliver orders on time. He expresses his
disappointment that no driver received a performance bonus due to the 95% on-time delivery target
not being met. He claims that this target was not met due to circumstances outside the driver’s
control.
Ms Duma request that you and Ms Jansen analyse the data in the attached Excel workbook
regarding vehicle deliveries. You can download the spreadsheet by clicking here.
Ms Jansen has cleaned the spreadsheet, and ask that you assist in doing the following:
   I.      In the SALES worksheet, insert a column next to (to the right) of Delivery distance. Call this
           column Total_distance. Total_distance = Delivery distance x 2. Remember, Delivery distance
           only shows a one-way trip and disregards the return trip. Therefore Total_distance will cover
           the full trip of delivering an order.
37 | P a g e
  II.    In the Sum_Stats worksheet, perform summary statistics of the Total_distance column (all
         the statistics in table 5.1 should be performed). Report on your findings in the Key finding’s
         textbox. It should add value to the above meeting.
 III.    Complete the report in the Driver_report worksheet.
 IV.     Copy the SALES worksheet to the Data_reduction worksheet.
  V.     In the Data_reduction worksheet, perform data reduction to provide further insights on
         what happened in the case of the on-time delivery percentage not being met.
 VI.     Do you agree with Mr Azharuddin statements he made in the meeting? State your
         conclusion in the Conclusion worksheet. Motivate your answer.
VII.     Provide two possible solutions on how to limit driver accidents. Write your answer in the
         Possible_Solutions worksheet.
End of activity
Diagnostic analytics consists of four approaches. They are profiling, clustering, similarity matching
and co-occurrence grouping. All four of these approaches are studied in detail in the domain of data
science. Therefore, you will not be assessed on these four approaches. We will discuss these four
approaches very briefly to provide you with some background, should you encounter them in
practice.
Profiling identifies the typical behaviour of a selected group by compiling summary statistics about
its data and then to compare it to the summary statistics of the population. By understanding this
behaviour, we will be able to identify abnormal behaviour more easily. Profiling might be used in
accounting to identify transactions that might warrant some additional investigation, for example,
outlier travel expenses or fraud (Richardson et al, 2023:117).
Clustering helps identify groups (clusters) of individuals (such as customers) who share common
underlying characteristics. With clustering we want to identify the similar data elements and the
underlying drivers of the identified groups (Richardson et al, 2023:117). An example of clustering is
to identify the most efficient production lines and to look what characteristics they share. Now, we
can compare it with other production lines to see if we cannot implement some of these identified
“best practices”.
38 | P a g e
5.7.2.3 Similarity matching (not assessable in AIN3701)
Similarity matching is a grouping technique used to identify similar individuals based on data known
about them. For example, companies identify seller and customer fraud based on various
characteristics known about each seller and customer to see if they are like known fraud cases
experienced in the past (Richardson et al, 2023:117).
Co-occurrence grouping discovers associations between individuals based on common events, such
as transactions they are involved in (Richardson et al, 2023:117). Takealot might use this to sell
another item to you by knowing what items are frequently bought together.
Predictive analytics consists of three approaches. They are regression, classification, and link
prediction. Regression will be assessed in this module. Classification and link prediction are studied
in detail in the domain of data science. Therefore, you will not be assessed on these two approaches.
5.7.3.1 Regression
In DSC1630, you have learned about the regression analysis theory and application by making use of
your calculator. In AIN3701, we will build on this, by showing you how to perform a regression
analysis in Excel. We will further expand on this by making use of multiple regression. Lastly, we will
show you how to interpret the statistics produced by Excel when you make use of the regression
tool in Excel.
In Excel, there are two ways to determine the regression formula and the coefficients of
determination (R2).
The first is by means of a scatterplot chart. You have learned in AIN1501 how to create charts, but in
the following video I will illustrate how to determine the regression formula and the R2. The dataset
used in the video can be downloaded by clicking here. Please watch the video by clicking here.
The second method is by making use of the regression function provided in the “Analysis Toolpak”
add-on in Excel. Let me first show you how to add the “Analysis Toolpak”:
39 | P a g e
     Open Excel.
     Click on File.
     Select Options. Note: if Options is not available, click on More.
     Click on Add-Ins.
     On the bottom of the screen dialog box, you will see a combo (drop-down) box. Select Excel
         Add-ins and click on Go …
     Check the Analysis Toolpak and click OK.
     Select Data.
     The following additional item (“Analysis”) will appear on the ribbon:
Figure 5.22 Data ribbon displaying the Analysis tools from the Analysis Toolpak
Please watch the video on performing regression by making use of the regression function from the
Analysis Toolpak by clicking here.
The output of the regression tool is displayed in figure 5.23 and will be discussed next.
SUMMARY OUTPUT
       Regression Statistics
Multiple R                   96.48%
R Square                     93.08%
Adjusted R Square            92.22%
Standard Error       4317.984172
Observations                     10
ANOVA
                          df             SS          MS           F      Significance F
Regression                       1     2006940102 2006940102 107.6396604           0.00
Residual                         8    149159898.5 18644987.31
Total                            9     2156100000
                      Coefficients Standard Error   t Stat       P-value      Lower 95%   Upper 95% Lower 95.0% Upper 95.0%
Intercept               93,434.01    2708.312563 34.49897602           0.00   87188.63018 99679.39012 87188.63018 99679.39012
Year of experience       3,568.53    343.9560394 10.37495351           0.00    2775.36387 4361.691968 2775.36387 4361.691968
40 | P a g e
I have highlighted the most important statistical outputs from the dataset, and you should be able to
interpret them in AIN3701. I will discuss these outputs below:
The adjusted R Square determines the extent of the variance of the dependent variable, which the
independent variable can explain. The specialty of the adjusted R Square is that it does not consider
the impact of all independent variables but only those which impact the variation of the dependent
variable. By looking at the adjusted R Square value, one can judge whether the data in the regression
equation is a good fit. The higher the adjusted R Square, the better the regression equation, as it
implies that the independent variable chosen to determine the dependent variable can explain the
variation in the dependent variable (Srivastav, 2022).
To understand the “Significance F” value, one first needs to understand what a P-value is. To assist
you in understanding a P-value, read the article by Seth (2020) by clicking here.
The Significance F is a very important statistic. This is the p-value for the F-test of overall
significance. This test determines whether your regression model with all its independent variables
does a better job explaining the dependent variable’s variability than a model with no independent
variables. If this test result is statistically significant, it suggests you have a good model (Frost, 2020).
If the P-value us less than 0.01, there is overwhelming evidence to infer that the alternative
hypothesis is true. The result is deemed to be highly significant.
If the P-value lies between 0.01 and 0.05, there is strong evidence to infer that the alternative
hypothesis is true. The result is deemed to be significant.
If the P-value is between 0.05 and 0.10, it indicates weak evidence that that the alternative
hypothesis is true. When the P-value is greater than 0.05, the result is not statistically significant.
When the P-value exceeds 0.10, there is little to no evidence to infer that the alternative hypothesis
is true.
41 | P a g e
The value is 0.00. This indicates that this is a good model. The model as a whole is statistically
deemed to be highly significant.
The value is 0.00. This indicates that the variable “Year of Experience” is deemed to be highly
significant in predicting the “Annual Sales”.
Coefficients
The Intercept indicates what the value will be if independent variable (x) is zero. In the regression
equation y = b + ax it represents the “b” value.
The “Year of Experience” is an independent variable. The coefficient indicates what the dependant
variable (y) will change for every one unit of change in the dependant variable (in this case “Year of
Experience”). This represents the “a” value in the equation y = b + ax.
We can make predictions using the above equation. Let’s say we want to predict the annual sales of
a salesperson with 15 years’ experience, then:
y = 146 969
Therefore, we can see that the annual sales for a salesperson with 15 years’ experience will be
R146 969 (with a predicted accuracy of 93%).
Multiple regression
At this stage of your studies, you have worked with one dependant and one independent variable
when dealing with regression analysis. There could, in practice, be more independent variables.
Then our equation will change to:
To do this manually will almost be an impossible task, as you will have more than two axis, one for
each variable. Luckily Excel can assist us to solve multiple regression problems.
Watch the following video by SoftTech Tutorials (2021), which illustrates how to do and interpret
multiple regression in Excel, by clicking here.
42 | P a g e
Activity 5.7
You were invited to the budget meeting. Mrs Duma (CEO), Ms Jansen (CFO), the product managers,
the production foremen of each production line and the maintenance manager, Mr Mashaba, were
all present at the meeting.
All the product managers presented their individual budgets for the year, but on every single one
Mr Mashaba complained that the maintenance fee was not worked out correctly. He explained that
all the product divisions used a flat rate of R130 for a maintenance service per machine but in effect
services are more expensive. He explained that maintenance on the ovens (which are most of the
production machines) were very labour intensive. It also depends on how many service staff are
available to perform maintenance on the production machines. He therefore argued that it does not
reflect reality to simply charge a flat rate when maintenance is performed on a machine.
He presented the following sheet, which reflects the output (number of machines serviced), number
of employees, direct labour hours and the total maintenance cost (figure 5.25).
He ensured the meeting that his data were correct, and that there were no outliers present in these
figures presented in figure 5.25. You can download the Excel sheet presented in figure 5.25 by
clicking here.
43 | P a g e
Ms Jansen told the meeting that the maintenance cost matter would be investigated. After this, the
meeting was adjourned.
Prepare your answer in PDF. The best is to use Word, and then to convert your Word document to
PDF.
End of activity
Classification predicts a class or a category for a new observation based on the manual identification
of classes from previous observations. Some examples of classification include predicting which
loans are likely to default, credit applications that are expected to be approved, the classification of
an operating or financing lease, and the identification of suspicious transactions. In each of these
cases, prior data must be manually identified as belonging to each class to build a predictive model
(Richardson et al, 2023:117).
Link prediction predicts a relationship between two data items, such as members of a social media
platform. For example, if two individuals have mutual friends on social media and both attended the
same university, it is likely that they know each other, and the site may make a recommendation for
44 | P a g e
them to connect. Link prediction in an accounting setting might work to use social media to look for
relationships between related parties that are not otherwise disclosed to identify related party
transactions (Richardson et al, 2023:117).
Prescriptive analytics consists out of two approaches. They are decision support systems, and
machine learning and artificial intelligence. Both of these approaches are studied in detail in the
domain of data science. Therefore, you will not be assessed on these two approaches. We will
discuss the two approaches very briefly to provide you with some background should you encounter
them in practice.
Decision support systems are rule-based systems that gather data and recommend actions based on
the input. Tax preparation software, investment advice tools, and auditing tools recommend courses
of action based on data that are input as part of an interview or interrogation process (Richardson et
al, 2023:117).
Machine learning and artificial intelligence are learning models or intelligent agents that adapt to
new external data to recommend a course of action. For example, an artificial intelligence model
may observe opinions given by an audit partner and adjust the model to reflect a changing level of
risk appetite and regulation (Richardson et al, 2023:117).
After the data have been analysed (in step 3), the fourth step is to address and refine results. Data
analysis is iterative. We slice, dice, and manipulate the data, find correlations, test hypothesis, asks
ourselves further, better questions, ask colleagues what they think, and revise and rerun the analysis
potentially multiple times. But once that is complete, we have the results ready to communicate to
interested stakeholders, which hopefully address their questions (Richardson et al, 2023:13).
Data are important and data analytics is effective, BUT they are only as important and effective as
we can communicate and make the data understandable (Richardson et al, 2023:183).
The goal of step 5 is to share critical information with stakeholders in a clear and concise manner.
This could be done by using a written report, a chart, a callout box, or a few key statistics. It is also
45 | P a g e
important to consider the needs of the audience of your data analysis, which can influence the
choice of communication method (Richardson et al, 2023:183).
These days, data visualisation is very important to communicate results. Data visualisation is the
presentation of data in a graphical format (Sharma, 2022). The choice of a chart can play a big role in
getting your message across in your data visualisation. The type of data can influence the type of
chart that is used. Data are categorised in two main groups, namely qualitative and quantitative
data.
Qualitative data are data that can be classified or categorised, but cannot be measured (Sharma,
2022:18). Examples of qualitative data are colours, satisfaction, and rankings. Qualitative data can be
further divided into two subcategories, namely nominal and ordinal data (Richardson et al,
2023:187). Nominal data are qualitative data that cannot be put in a meaningful order (in other
words, they cannot be ranked) (Sharma, 2022:18). An example of nominal data is hair colours.
Ordinal data are qualitative data that can be put into a meaningful order (the data can be ranked),
for example, very satisfied, satisfied, unsatisfied and very unsatisfied (Sharma, 2022:18).
Quantitative data (also known as interval/ratio data) are data that can be measured, for example,
customers’ sales, weight, gross profit margin, to name a few (Sharma, 2022:18). Quantitative data
can be categorised into two different types, namely interval and ratio (Richardson et al, 2023:187).
The difference between interval and ratio data is that ratio data have a meaningful zero and interval
data do not (Richardson et al, 2023:187). To clarify, for ratio data, the zero point reflects an absolute
zero, which means that there is “nothing” of the specific variable (Jansen, 2020). With interval data,
zero is just another number (Richardson et al, 2023:187).
Quantitative data have another classification, being either as discrete or continuous. Discrete data
are quantitative data with a finite number of values (Sharma, 2022:18), for example, the number of
pets. You can have three pets, for example, but not 2.25 pets. Continuous data are quantitative data
with a value within a range or interval (Sharma, 2022:18). Therefore, this value can take on any
numerical value, for example 3.14159265 (Richardson et al, 2023:187).
The last data type we need to look at is dichotomous data. A dichotomous data value can only have
two values (Dodge, 2008). An example a dichotomous value is “yes” or “no”. The dichotomous data
type is a qualitative data type, but Dodge (2008) reminds us that quantitative variables can be
reduced and dichotomised by reducing revenue, for example, to “low revenue” and “high revenue”.
Now you might ask me: what does this have to do with the type of chart to be chosen?
46 | P a g e
Richardson et al (2023:195) provide the following assistance by indicating what chart type will be
suitable to use for each data type. This guidance is displayed in figure 5.26.
If you look at figure 5.26, you will see that the starting point is to identify whether your data that you
want to put into your chart is qualitative or quantitative. Next, you look at the purpose of your chart.
Is it for comparisons, geographic data, or text data, in the case of qualitative data? You have already
covered in AIN1501 how to generate charts in Excel; therefore, you will need to revise this material.
If you want to explore other methods to choose charts and need assistance with charts in data
visualisations, I recommend that you work through the article by Dzuranin (2022). You can download
this article here. This article also covers a technique to present your data, called storytelling.
Storytelling is a very interesting field in data analytics and if you want to know more about it, you
can also read through the article by Dzuranin (2022). Storytelling will not be assessed in AIN3701.
Data visualisation using dashboards will be covered in detail in lesson 6. Therefore, you will be
assessed in lesson 6 on using the correct charts, applying best practices, etc.
47 | P a g e
5.10 STEP 6: TRACK OUTCOMES (T)
In doing your data analysis, you will hopefully encounter some insights and areas that can be
improved on. To show improvement or progress, it is necessary that these insights or improvement
areas are managed and monitored. In practice, digital dashboards and data visualisations are useful
in tracking outcomes. Dashboards are discussed in detail in lesson 6.
Well done on completing this unit! You have covered a lot in this unit, and I hope that you have
learned a lot on data analysis.
Data are referred to today as the new oil of the modern age. The expectation is that the importance
of data will increase in the future. Therefore, as an accounting professional, it is very likely that you
will encounter data analytics in practice.
We have covered quite a few principles in doing data analytics (which are sufficient for AIN3701),
but there are much more to explore. If you want to explore data analytics further, I recommend you
start with the book by Richardson et al, called Data analytics for accounting (Third edition, 2023).
Data analytics are evolving every day with the introduction and advancement of machine learning
models, so it is important that you try to stay up to date in this subject area.
For me, it is important that you realise that there is a structured process to be followed when
tackling business problems requiring data analytics to resolve it. Jumping in at the middle of the
process will lead to either a loss of time or an incorrect answer. Therefore, follow the process as
illustrated in this lesson or prescribed in the workplace.
In the next lesson, we are going to focus on data visualisation, through dashboards. This is a very
exciting topic, and we look forward to teaching you the skills to create an efficient and effective
dashboard.
48 | P a g e
5.12     REFERENCES
Bansal, S. No date. How to Find Outliers in Excel (and how to handle these). Available at:
https://trumpexcel.com/find-outliers-excel/
Bekker, A. 2017. Big Data: Examples, Sources and Technologies explained. Available at:
https://www.scnsoft.com/blog/what-is-big-data
Chauhan, R. 2022. How to Import Data Into Your Excel Spreadsheets the Neat & Easy Way. Available
at:
https://www.makeuseof.com/tag/import-data-excel-spreadsheets-neat-easy-way/
Cote, C. 2021. 4 Types of data analytics to improve decision making. Available at:
https://online.hbs.edu/blog/post/types-of-data-analysis
Dzuranin, AC, Geerts, GL, & Lenk, MM. 2023. Data and Analytics in Accounting. An integrated
approach. 1st edition. John Wiley & Sons,Inc.
Frost, J. 2020. How to Perform Regression Analysis using Excel. Available at:
https://statisticsbyjim.com/regression/regression-analysis-excel/
Galiana, D. 2019. The 5W1H Method : Project Management defined and applied. Available at:
https://www.wimi-teamwork.com/blog/the-5w1h-method-project-management-defined-and-
applied/
Jansen, D. 2020. Nominal, Ordinal, Interval & Ratio Data. Available at:
https://gradcoach.com/nominal-ordinal-interval-ratio/
Keller, G. 2012. Managerial statistics. International edition. 9th edition. South-Western Cengage
learning.
Lewis, S. 2020. Fishbone diagram (Ishikawa cause and effect) Available at:
https://www.techtarget.com/whatis/definition/fishbone-diagram
Maladkar, K. 2018. 5 Ways To Handle Missing Values In Machine Learning Datasets. Available at:
https://analyticsindiamag.com/5-ways-handle-missing-values-machine-learning-datasets/
Marr, B. 2015. Big Data: 20 Mind-Boggling Facts Everyone Must Read. Available at:
https://www.forbes.com/sites/bernardmarr/2015/09/30/big-data-20-mind-boggling-facts-
everyone-must-read/?sh=5c7f7e9117b1
49 | P a g e
Microsoft. 2022. Create a box and whisker chart. Available at:
https://support.microsoft.com/en-us/office/create-a-box-and-whisker-chart-62f4219f-db4b-4754-
aca8-4743f6190f0d
Peterson, R. 2022. Entity Relationship (ER) Diagram Model with DBMS Example Available at:
https://www.guru99.com/er-diagram-tutorial-dbms.html
Peterson, R. 2022. What is Normalization in DBMS (SQL)? 1NF, 2NF, 3NF, BCNF Database with
Example Available at:
https://www.guru99.com/database-normalization.html
Reinsel, D, Gantz, J & Rydning, J. 2018. The Digitization of the World from Edge to Core Example.
Data Age 2025. Available at:
https://www.seagate.com/files/www-content/our-story/trends/files/idc-seagate-dataage-
whitepaper.pdf
Richard, G, Roets, C, Adams, A & West, S. 2021. Auditing Notes for South African Students. 12th
edition. LexisNexis.
Richardson, VJ, Teeter, RA & Terrell, KL. 2023. Data Analytics for Accounting. 3rd edition. McGrawHill.
Sharma, H. 2022. Learn to Select Best Excel Charts for Data Analysis and Reporting. Optimize Smart.
Stevens, E. 2022. What is Data Analytics? A Complete Guide for Beginners. Available to download at:
https://careerfoundry.com/en/blog/data-analytics/what-is-data-analytics/
50 | P a g e