0% found this document useful (0 votes)
19 views50 pages

Lu05 - Ain3701 B0 LS05 004

Learning Unit 5 focuses on data analytics, emphasizing its importance in accounting and related services. It outlines the data analytics process using the IMPACT model, which includes identifying questions, mastering data, performing tests, refining results, communicating insights, and tracking outcomes. Additionally, it discusses the significance of big data, its benefits and risks, and the internal controls necessary for organizations to manage these risks effectively.

Uploaded by

Charmaine Ndlela
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views50 pages

Lu05 - Ain3701 B0 LS05 004

Learning Unit 5 focuses on data analytics, emphasizing its importance in accounting and related services. It outlines the data analytics process using the IMPACT model, which includes identifying questions, mastering data, performing tests, refining results, communicating insights, and tracking outcomes. Additionally, it discusses the significance of big data, its benefits and risks, and the internal controls necessary for organizations to manage these risks effectively.

Uploaded by

Charmaine Ndlela
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

LEARNING UNIT 5

DATA ANALYTICS

5.1 INTRODUCTION, OUTCOME AND ASSESSMENT CRITERIA

5.2 THE WORLD OF DATA

5.3 DATA ANALYTICS FOR ACCOUNTING AND RELATED SERVICES

5.4 THE DATA ANALYTICS PROCESS

5.5 STEP 1: IDENTIFY THE QUESTIONS (I)

5.6 STEP 2: MASTER THE DATA (M)

5.7 STEP 3: PERFORM TEST PLAN (P)

5.8 STEP 4: ADDRESS AND REFINE RESULTS (A)

5.9 STEP 5: COMMUNICATE INSIGHTS (C)

5.10 STEP 6: TRACK OUTCOMES (T)

5.11 CLOSING REMARKS

5.12 REFERENCES

1|P a g e
5.1 INTRODUCTION, OUTCOME AND ASSESSMENT CRITERIA

Your studies in AIN1501 introduced you to the concept of data. In AIN2601, you explored data in the
computerised system process, the database environment, relational database terminology, and how
databases are utilised in the business world (this included concepts like data warehouses, data
marts, data mining and OLAP). It is important to understand these concepts very well before
attempting to perform and understand data analytics. It is therefore advisable to revise this content
before continuing with this learning unit.

We will kick off with a discussion of the world of data. Herein we will also discuss an emerging
technology called big data. Once you understand these two concepts, we will focus on the data
analytics cycle.

Let’s explore the world of data …

5.2 THE WORLD OF DATA

In the current age, you will agree that data are all around us. Figure 5.1 indicates the increase of
data, from 2 zettabytes in 2010 to a forecasted 175 zettabytes in 2025. You will notice from this
graph that data are increasing exponentially, so imagine how much data will be generated by 2030.

Figure 5.1 Volume of data/information created, captured, copied, and consumed worldwide from
2010 to 2025 (Reinsel, Gantz & Rydning, 2018)

In September 2015, Marr (2015) stated the following:

“The data volumes are exploding; more data has been created in the past two years than in the
entire previous history of the human race.”

2|P a g e
Look again at figure 5.1, and you will agree that this is still the case. To makes matters more complex
it is estimated that 90% of data are unstructured (Drury, 2022:671). For an explanation on structured
vs unstructured data, refer to learning unit 2.

This enormous increase in data gave rise to an emerging technology called big data, which can be
analysed by making use of data analytics tools. Examples of data analytic tools are machine learning
and deep learning.

Big data is covered in detail in AIN1501. Please study this section in AIN1501 before continuing to
the next topic in this section.

In your AIN1501 learning material, you would have noticed that there are numerous benefits that
big data offers. Unfortunately, certain risks also come with these benefits (also discussed in your
AIN1501 material). To mitigate or limit these risks, it is necessary to implement internal controls.

*** SPECIFIC CONTROLS TO BE IMPLEMENTED REGARDING BIG DATA ***

For organisations to limit their risks regarding big data, it is important for them to
implement the following (Richard et al, 2021: 8/74):

• The organisation should have a big data strategy that should:


o be agreed and signed off by top management
o aligned with the overall business strategy
o incorporate data security, privacy, measures, data landscape,
storage, cloud computing and cyber security
• Documented business and technical requirements should align with big
data projects and with the organisation’s strategy.
• Management should review that the analysis of the data extracted is
meaningful and that the organisation is reviewing and using the findings of
the analysis.
• If big data poses a risk to the organisation, it should be addressed in the
risk management process by updating the risk register.
• Service level agreements (SLA) entered between the organisation and third
parties should be scrutinised if third parties are analysing big data on the
organisation’s behalf.
• Roles and responsibilities should be defined for big data processing as well
as ownership.

3|P a g e
• Make sure that access management controls (covered in AIN1501) are
implemented over big data.
• Change management controls (covered in AIN1501) should be
implemented over big data.
• Backup and disaster recovery controls (covered in AIN1501) should be in
place over big data.
• Staff should be trained on how to use big data, as well as the privacy
concerns regarding big data.

Activity 5.1

Ms Duma, the CEO of Ikhishi Likagogo, the company that manufactures breads, cakes, and South
African delights, has called a meeting with you, Ms Jansen (financial director), Ms Tau (product
manager of the cake division) and Mr de Villiers (CIO).

Ms Duma displayed the following slide in her meeting:

Figure 5.2 Revenue per division 2012–2022 (Van Staden, 2022)

4|P a g e
She said that it is clear from figure 5.2 that the decline of sales of Ikhishi Likagogo was mainly due to
the decline of sales in the Cakes division. She was very concerned and asked the meeting members
to brainstorm a solution for this.

Mr de Villiers went first, and stated that he thinks the problem is that Ikhishi Likagogo does not
know their customers anymore, therefore not supplying the correct products to them. He also
stated that big data can assist Ikhishi Likagogo to resolve this problem.

Ms Duma replied: “Thanks Mr de Villiers. I am not sure what this big data is and what you are talking
about. But I am willing to let the AIN3701 student investigate this matter and present at our next
meeting to explain this to me.”

Prepare a slide show to present to Ms Duma:

I. What is big data? Discuss this by applying it to Ikhishi Likagogo.


II. What are the sources where one could obtain the above-mentioned big data?
III. If the big data is obtained from these sources, how will the big data be analysed?
IV. What additional risks would Ikhishi Likagogo be faced with, if they were to use big data to
identify market trends?
V. What internal controls can Ikhishi Likagogo implement to minimize the effect of the risks
mentioned in part IV of this activity?

Please upload your presentation by clicking here.

End of activity

5.3 DATA ANALYTICS FOR ACCOUNTING AND RELATED SERVICES

Data analytics is the process of analysing raw data in order to draw out meaningful, actionable
insights, which are then used to inform and drive smart business decisions (Stevens, 2022). It is
important to note that data analytics provides a way to search through structured and unstructured
data (Richardson et al, 2023). The benefit of data analytics is that it will enable organisations to
discover various buying patterns of their customers, investigate anomalies that were not
anticipated, and forecast future possibilities (Richardson et al, 2023). These are only a few of the
benefits, as there could be a lot more.

Now you might ask me, what does this have to do with the accounting professional of the future?

Richardson et al (2022:13-14) provide some guidance on this: “Accounting professionals do not need
to become data scientists, or built a database from scratch or even perform real hardcore data
analytics but they need to know how to do the following:

5|P a g e
• Clearly articulate the business problem an organisation is facing.
• Communicate with data scientists about specific data needs and understand the underlying
quality of data.
• Draw appropriate conclusions to the business problem based on the data and make
recommendations on a timely basis.
• Present their results to individual members to management in an accessible manner. This
can include CEOs, audit managers, etc.”

If one looks at the above requirements, analytical-minded accountants should have skills in the
following areas (Richardson et al, 2022:14):

• Develop an analytics mindset – know when and how data analytics can address business
questions.
• Data cleaning and preparation – comprehend the process needed to clean and prepare the
data before analysis.
• Know that data quality consists of completeness, reliability, and validity.
• Descriptive data analysis – perform basic analysis to understand the quality of the underlying
data and its ability to address the business question.
• Data analysis through manipulation – demonstrates the ability to sort, rearrange, merge,
and reconfigure data in a manner that allows enhanced analysis. This may include
diagnostic, predictive, or prescriptive analytics to appropriately analyse the data.
• Statistical data analysis competency – identify and implement an approach that will use
statistical data analysis to draw conclusions and make recommendations on a timely basis.
• Data visualisation and data reporting – report results of analysis in an accessible way to each
varied decision maker and his/her specific needs.

Activity 5.2

Consider the following statement:

Data analytics is expected to have dramatic effects on auditing and financial reporting as
well as tax and managerial accounting (Richardson et al, 2023:5).

Choose one area (auditing, financial accounting, managerial accounting, or tax) and discuss the
above statement on this area.

Share your reply in the Discussion Forum data analytics with your fellow students. Please follow the
rules of netiquette we have covered in learning unit 0 and adhere to the following guidelines for
participating in forums:

6|P a g e
• Compile your post offline and keep record of it.
• Use an academic writing style for referencing and citing the sources you used.
• Post your answer on the forum.
• Reply to contributions of at least two of your fellow students’ contributions.
End of activity

In the next section, we are going to discuss the steps in the data analytics process.

5.4 THE DATA ANALYTICS PROCESS

To perform data analytics, it is important to follow a structured approach. There are a number of
data analytics approaches like the CRISP (Cross-Industry Standard Process) model, but for AIN3701
we are going to use the IMPACT model, which is prescribed by Richardson et al in the textbook Data
Analytics for Accounting. The IMPACT model is an acronym which can be explained as follows:

I - Identify the questions

M - Master the data

P - Perform a test plan

A - Address and refine the results

C - Communicate insights

T - Track outcomes

The IMPACT cycle is illustrated in figure 5.3.

7|P a g e
Identify the
questions

Track Master the


outcomes data

Communicate Perform a
insights test plan

Address and
refine the
results

Figure 5.3 The IMPACT model (Richardson et al, 2023:9). Illustrated by Van Staden (2022).

In the next sections, we will be discussing the steps of the IMPACT cycle in detail. It is important that
you can apply these steps.

5.5 STEP 1: IDENTIFY THE QUESTIONS (I)

The question identification in the data analysis process is very important. If you ask the wrong
question, you are not going to get the right answer. As an accountant (or an accounting professional)
you will possess a unique skill set to improve an organisation’s data analytics by the ability to ask the
right questions, as accountants (or accounting professionals) understand an organisation’s financial
data (Richardson et al, 2023:9).

In the workplace you could be tasked within a group to solve a problem using data analytics (or in
some cases even data science). It is important that you are very active in identifying the problem by
defining the correct question. But how do you do that?

To assist you in identifying the correct question, you can use root cause analysis techniques. Root
cause analysis is an approach for identifying the underlying causes of an incident so that the most
effective solutions can be identified and implemented (ThinkReliability, 2022).

8|P a g e
Figure 5.4 Root cause analysis (ThinkReliability, 2022)

It is important to note that the problem is not necessarily the symptom of the problem. In most
cases there are underlying causes, which are referred to as the root of the problem. This is
illustrated in figure 5.4. Therefore, by solving the root of the problem (and note that there could be
more than one root cause of a problem), the symptoms will improve or even disappear.

There are a few techniques you can use to derive at the root cause of problem. The first technique
we are going to discuss is the Five Why’s technique. The Five Why’s technique (5W) is asking
multiple “why’s” on a specific problem until the root cause of the problem is identified. Let me
illustrate:

Flikkers Disco (Flikkers) is a nightclub in Sunnyside, Pretoria. Flikkers also serves beverages (alcoholic
and non-alcoholic) and light meals in the evenings and are open from 18:00 – 23:00 each night. In
December 2023, the company’s year end, the directors noticed in the financial statements that the
net profit declined by almost 45% from the 2022 financial year. This is very concerning, and they
have asked you to investigate this problem.

You decided to use the 5W technique. So, let’s start:

9|P a g e
• Why did profit decline from 2023 to 2022?
Answer:
The drop in profit was caused mainly by a decrease in revenue. Most of the expenditure
items increased below inflation, indicating that some cost containment measures were
introduced.

• Why did revenue decrease?


Answer:
Revenue decreased because of fewer customers visiting the night club in 2023 than in 2022.

• Why did fewer customers visit the night club in 2023?


Answer:
To answer this, it may be necessary to gather data from customers of the nightclub. Let’s say
through the data you can see that customers prefer to go to a new club, across the street,
called Boemboks.

• Why do customers prefer Boemboks over Flikkers?


Answer:
Again, some investigation should be done. Through investigation, it is determined that
Boemboks is playing music that customers prefer.
This is the root cause of the problem.

Flikkers can now action a solution to this problem. They need to determine which music customers
prefer. This is an example of a data analysis exercise, where Flikkers will need to determine their
customers’ music preferences (including but not limited to genres, styles, or music bands) to draw
them back to the nightclub.

Another technique that can be used to identify the problem is the 5W1H technique. The 5W1H
technique assists you to understand a potentially problematic situation by asking the right questions.
The questions are (Galiana, 2019):

• What: description of the problem.


• Who: the responsible parties.
• Where: the location of the problem.
• When: temporal characteristics of the problem (at what point of time, how often).
• How: the effects of the problem.

10 | P a g e
• Why: reasons, cause of problems.

Let’s apply this to Flikkers.

• What: Profitability decreased by 45% from 2023 to 2022.


• Who: The parties responsible for maintaining profitability are the board of directors of
Flikkers.
• Where: The location is Flikkers nightclub in Sunnyside, Pretoria.
• When: The problem of decrease in profit was identified in December 2023.
• How: The decrease in profit could lead to possible retrenchments of staff, or even closing
down the club.
• Why: Here you could use the five why’s again, until you reach the root problem.

The problem statement in this case will be: Due to a decline in profitability in 2023, the board of
directors will launch an investigation to determine the music preferences of customers at Flikkers in
Sunnyside, Pretoria. If this investigation is not done, it could lead to possible retrenchments of staff
of Flikkers or even closing the Flikkers club down in Sunnyside.

I think you will agree that this is a nice and comprehensive problem statement �.

The last technique we are going to discuss, which you may encounter in practice, is the fishbone
diagram. For a discussion on this technique, study the article by Lewis (2020) by clicking here.

An example of a fishbone diagram for Flikkers is displayed in figure 5.5.

Decrease in
Environment revenue

Fewer customers
New competitor

Fewer meals sold


More fights in evenings

Decrease in
profit
Do not like music
No increases

Less bouncers

Clients Cost
containment

Figure 5.5 Fishbone diagram of Flikkers (Van Staden, 2022)

11 | P a g e
Activity 5.3

After the meeting held in activity 5.1, Ms Duma asked you to assist in identifying why the Cakes
division sales are currently declining (see figure 5.2).

Apply the following root cause analysis techniques to identify the root cause of the above problem:

I. The five why’s


II. 5W1H
III. Fishbone diagram

Prepare this activity in Word, and then convert your Word document into a PDF document. Upload
then your PDF document by clicking here here.

IMPORTANT: YOU NEED TO SHOW ALL YOUR STEPS TO IDENTIFY THE ROOT CAUSE OF THE
PROBLEM.

End of activity

5.6 STEP 2: MASTER THE DATA (M)

Mastering the data requires one to know what data are available and whether that data might be
able to help to address the problem identified in step 1 (Richardson et al, 2023:10). This step
requires that you have a comprehensive knowledge of the data and includes how to access, the
availability, reliability, frequency of updates and the time periods that the data cover to make sure
the data coincide with the timing of our business problem. Step 2 can be divided into sub-steps as
listed below:

• Determine the purpose and scope of the data request (extract).


• Obtain the data (extract).
• Validate the data for completeness and integrity (transform).
• Clean the data (transform).
• Load the data in preparation for data analysis (load).

Data scientists refer to the above process as the ETL process, which stands for extract(ing),
transform(ing) and load(ing) data. Next, we will discuss how to determine the purpose and scope of
the data request.

12 | P a g e
5.6.1 Step 2a: Determine the purpose and scope of the data request

To determine the purpose and scope of the data request, you need to answer the following
questions (Richardson et al, 2023:61):

• What is the purpose of the data request? What do you need the data to solve? What
business problem will it address?
• What risks exist in data integrity? If there are risks, is there any mitigation plan?
• What other information will impact the nature, timing, and extent of the data analysis?

Make sure you document the purpose and scope of the data request. This will include any risks and
assumptions made on the data.

Now you are ready to continue to the next sub-step.

5.6.2 Step 2b: Obtain the data

In data analytics, it is crucial that you understand the business domain, the data domain and the
statistical tools required to solve the problem. In defining the business problem in step 1, you
utilised your understanding of the business.

Now to solve this problem, you will need to know what data you require to demystify this problem.
Are the data currently available in the organisation’s databases or do you need to utilise data
gathering techniques to obtain the data? Maybe you need to consult external data sources? This
concept will be elaborated on in section 5.6.2.1.

Once you know which data source or sources are going to be used, you need to know where the
data are stored. Are the data stored in a relational database, spreadsheet or even a flat file? In most
organisations data will be stored in a relational database, like Oracle, SQL server or Microsoft Access.
It is good practice to normalise databases to ensure data integrity. Section 5.6.2.2 will explain this
concept to you.

Once you know where to obtain your data from, you will need to investigate what data are available
in your data source. As stated before, it is very important to know your data. If you don’t understand
the data and use the wrong data fields, your analysis will analyse irrelevant data leading to an
incorrect answer. To gain an understanding of your data, you will find that database schemas (also
called ER diagrams) and data dictionaries are very useful. We will discuss these data exploratory
tools in section 5.6.2.3.

Once you understand the data, it is time to collect the data. The data collection process will be
discussed in section 5.6.2.4.

13 | P a g e
5.6.2.1 Data sources

One must remember that data can come from many sources. These can normally be classified as
internal and external data sources. Data are internal if an organisation generates, owns, and controls
it (Bekker, 2017). This can be done through separate systems like the accounting information
system, supply chain management system, customer relationship management system and the
human resource management system, to name a few. These days more and more organisations are
making use of Enterprise Resource Planning (ERP) systems, which is a business management
software that integrates applications throughout the business into one system (Richardson et al,
2023:54).

External data are public data, or the data generated outside the company, therefore, the
organisation neither owns nor controls it (Bekker, 2017). Examples of external data sources are the
South African Data Portal (https://southafrica.opendataforafrica.org/), the South African Reserve
Bank (https://www.resbank.co.za/en/home) and Stats SA (https://www.statssa.gov.za/).

*** DATA CONTROLS REGARDING THE DATA SOURCE ***

It is important to ensure that the data source used to obtain data from is complete,
accurate and valid.

Controls to ensure the data source is valid:

Do you have permission to obtain data from the data source you have selected?
You need to obtain permission to use data from the data owner (whether internal
or external). If you use an external data source, did you read the terms of usage?
Do you comply with the terms? Are you maybe in breach of legislation using the
data source? Maybe you are hacking the data �.

Controls to ensure the data source is complete:

Is the source you are going to use based on the latest (current) data?

Controls to ensure the data source is accurate:

Is the owner of the data known to supply accurate data?

Remember: The aim is to make a business decision on the data used within the
data source.

Therefore, do your due diligence on the data source !!!

14 | P a g e
5.6.2.2 Normalisation

A new concept I want to discuss in relational databases is normalisation. Peterson (2022) defines
normalisation as ‘a database technique that reduces data redundancy and eliminates undesirable
characteristics.’ Normalisation rules divide larger tables into smaller tables and link them using
relationships (Peterson, 2022).

I want to illustrate this concept with an example:

Consider the following database tables:

Figure 5.6 Customer and Province data tables (Van Staden, 2022)

The Customer table’s primary key is Customer_ID, while the Province table’s primary key is
Province_ID. Therefore, these two tables can be linked by Province_ID (therefore Province_ID is a
foreign key in the Customer table). But let’s see how the data look in these two tables:

CUSTOMER
Customer_ID (PK) Customer_Name Customer_Adress Customer_VatNr Customer_City Province_ID (FK)
1 Lulu's Bakery 321 Justice street 1122336789 Pretoria 1
2 Sophia's Tea Garden 45 Tau avenue 1122336800 Johannesburg 1
3 Bela Bela Supermarket 12 Church street 1122336811 Bela Bela 2
4 Capricorn Inn 51 Unicorn drive 1122336822 Polokwane 2
5 UNISA student café Preller street 1122336833 Pretoria 1

PROVINCES
Province_ID (FK) Province_Name
1 Gauteng
2 Limpopo

Figure 5.7 Customer and Province data (Van Staden, 2022)

15 | P a g e
Let’s look at Customer_ID “1”. Lulu’s Bakery is in Province_ID “1”. In the Provinces table we can see
that province “1” is Gauteng. The process of creating additional stand-alone (smaller) tables is
known as normalisation. There are a lot of benefits to it, but the biggest benefit is data integrity.
Imagine if one needed to type in Gauteng for each customer in Gauteng. You will get multiple
versions of it, for example Gauteng, Gateng, gteng, Guateng, Gatang etc.

Now that you understand these concepts, let me shock you with this statement: For data analytics
you will take a normalised table and convert it back to a unnormalised single sheet (in Excel for
AIN3701).

Consider again the data in figure 5.7. What province has the most customers? With little data it is
easy to answer, but what if there were 150 customers, with nine provinces. If you tell me province
“1” has the most customers, that won’t mean anything to me. So, we need to change the “1’s” back
to Gauteng in our data analysis sheet. But more on this in section 5.6.4.2.

5.6.2.3 Data exploratory tools

In AIN3701 we will be using data dictionaries and ER diagrams to determine which data are needed
to solve a business problem. An ER diagram displays the name of the table, fields in the table and the
relationships between tables in a database. An ER diagram is illustrated in figure 5.8.

16 | P a g e
Figure 5.8 Entity Relationship Diagram Example (Peterson, 2022)

If you struggle to interpret ER diagrams, please read this article by Peterson (2022) by clicking here.

Note that you will be assessed on your ability to identify the correct tables and fields in a database
and not on the detail of constructing ER diagrams. Therefore, this article is there only to assist you,
and you will not be assessed on it.

Data dictionaries were covered in detail in AIN2601. Please revise this section in your AIN2601 study
material.

5.6.2.4 The data collection process

The following steps can assist you when you are collecting your data (Richardson et al, 2023:61-62):

First, you need to determine from who you need to request the data. In some cases, you will be able
to retrieve the data yourself, but most of the time you will need to request the data from the
database administrator or the information system department.

17 | P a g e
Secondly, you need to determine what data are needed. Use the ER diagrams and data dictionaries
to identify which data will be applicable for your analysis. Be specific.

Thirdly, you need to determine in which format the data should be. First price will be a flat file, with
all foreign keys be populated, but you will rarely be so lucky to receive your data like this. Take also
into consideration which tool you are going to use to test and process the data received. This will aid
the database administrator (or information system department) to provide the data to you in the
most accessible format.

Lastly, provide the database administrator (or information system department) with a deadline by
when you should receive the data. This will enable you to do your data analysis in time.

Some organisations have a pre-developed form to request data. This form is called a data request
form. Richardson et al (2023:62) provide an example of a data request form displayed in figure 5.9.
Requestor Name:
Requestor Contact Number:
Requestor Email Address:
Please provide a description of the information needed (indicate which tables and which fields you
require):

What will the information be used for?

Frequency (circle one) One-Off Annually Termly Other:___________


Format you wish the data
Spreadsheet Text File
to be delivered in (circle
Word Document Other: ____________
one):
Request Date:
Required Date:
Intended Audience:
Customer (if not requestor):

Figure 5.9 Example of a data request form (Richardson et al, 2023:62)

Remember, when obtaining the data yourself, you are still required to identify the tables, the
attributes required and the relationships between the tables before extracting the data. There are
various methods to retrieve the data. One of these methods is if the database administrator

18 | P a g e
provides you with direct access to the database. Then you will be able to connect to the database by
using Microsoft Excel, Microsoft PowerBI or Tableau. When your dataset has more than 1 048 576
rows, you will not be able to pull it into Microsoft Excel 365, or earlier versions. Then you should
make use of more advanced data analysis software like Microsoft PowerBI, Tableau, R or Python. In
AIN3701 we are going to make use of Microsoft Excel 365.

Note that you can also use SQL (Structured Query Language) to retrieve data from a database. SQL,
however, belongs more in the field of computer science and will not be covered in this course. If you
want to learn more about SQL, please watch the video by Gary Explains (2019) by clicking here. You
will not be assessed on this video in this module.

Activity 5.4

Ms Jansen asks if you can assist her in requesting data to answer the following questions for the
2023 financial year:

I. What was the revenue per division? One column should indicate the divisions’ income
through normal sales and the other column, delivery income.
II. What was the profit per division? One column should indicate the divisions’ profit through
normal sales and the other column, delivery income.
III. What was the revenue per province? One column should indicate the divisions’ income
through normal sales and the other column, delivery income.
IV. What was the profit per province? One column should indicate the divisions’ profit through
normal sales and the other column, delivery income.

Part of Ikhishi Likagogo’s data management policy is that all data should be requested by making use
of the company’s data request form. Please download the data request form by clicking here.

To assist you in completing the data request form, Mr de Villiers (CIO) provided you with the
following:

• ER diagram (download by clicking here)


• Data dictionary (download by clicking here)

You are required to complete the data request form to provide enough data in order for Ms Jansen
to answer questions I–IV above.

Upload your completed data request form by clicking here.

19 | P a g e
End of activity

5.6.3 Step 2c: Validate the data for completeness and integrity

When data are moved or extracted from one location to another, it is possible that some of the data
can get lost or corrupted during extraction. It is crucial to ensure that the extracted data are
complete, as you will get the wrong answer by analysing incomplete data. You also need to ensure
that the integrity of the data remains after the extract. This means that no one has manipulated,
tampered with, or duplicated data during extraction. To assist you to validate data successfully you
will need to have the technical skills, but you also need to know your data well. If you know what to
expect from the data, you have a higher likelihood of identifying errors or issues from extraction
(Richardson et al, 2023:64).

Richardson et al (2023:64-65) identified the following four steps that should be completed to
validate your data after extraction:

I. Compare the number of records that were extracted to the number of records in the source
database. The records should match. This step will not provide information on the data itself
but will ensure that the number of records matches contributing to completeness.
II. Compare descriptive statistics for numeric fields. Calculating minimums, maximums,
averages and medians will help ensure that the numeric data adhere to the data limits,
ensuring accuracy and completeness. Descriptive statistics will be discussed in detail in
section 5.7.1.
III. Validate date or time fields. This can be achieved by converting the data type to numeric and
running descriptive statistics comparisons on these fields.
IV. Compare string limits for text fields. Text fields are unlikely to cause concerns if you extract
your data to Microsoft Excel 365, because Excel allows a generous maximum character
number. Just be aware of this if you use another tool than Excel to analyse text fields.

The above four steps are data controls to ensure that the ETL (extract, transfer,
and data loading process) are complete, accurate and valid.

If an error is detected at this stage, you need to contact the originator (information systems
department or the database administrator) of the data to report this. Most of the time they will re-
extract the data, which will solve the problem. If you have extracted the data yourself, then my
advice will be to re-extract the data. If this does not solve the problem, you need to report it to the
information systems department or the database administrator to investigate this matter further.

20 | P a g e
5.6.4 Step 2d: Clean the data

You now have obtained the data, from the Information Science department, database administrator
or you extracted it by yourself. Now the aim is to get your data ready for analysis.

Unfortunately, extracted datasets may contain duplicates, anomalies, or missing data, which could
distort how the data is interpreted (Stevens, 2022). Therefore, it is important to remove these
discrepancies. We refer to this process as data cleaning.

Stevens (2022) states that data cleaning can be a time-consuming task, but that it is crucial for
obtaining accurate results.

Proper computer controls will, however, reduce time on data cleaning. For example, if a
computerised control limits a user not to input a negative field in the sales quantity field, then you
will not have to investigate negative sales quantities in your dataset. Another example is where
customer detail is retrieved by the algorithm, and automatically inputted in the customer details
fields. Therefore, the customer address does need not to be typed in every time a customer is
buying an item. The result of this control is that spelling mistakes will be limited in the data, as well
as in your dataset.

It is therefore necessary to discuss computerised controls before we take you through the data
cleaning process itself. As an accounting professional you will need to design and sometimes
evaluate these controls in practice. Computer controls will be discussed in section 5.6.4.1.

In section 5.6.4.2 we will be discussing and illustrating how to do data cleaning on a dataset.

5.6.4.1 Computer controls (IT controls)

For an organisation’s information system to operate efficiently and effectively, there should be rules
and procedures governing information systems. This can be achieved by ensuring a strong internal
control environment, effective general IT controls and effective application controls.

To understand computer controls over data and new technologies, it is important to understand
how IT controls interact with business operations. IT controls were introduced in AIN1501; please
revise this study material. Also watch the following three video by AmandaLovestoAudit (2019)
which explains the difference between general controls and application controls by clicking here.

IT controls will be further expanded on in your auditing studies in modules AUE2601 and AUE3764.

Remember that IT controls will also be applicable to data. For data input and usage, controls should
be implemented to ensure that data are accurate, complete and valid. If these objectives are not
achieved, an organisation will make decisions upon inaccurate, incomplete and invalid data.

21 | P a g e
Note that data also expose an organisation to additional reputational risk and legislation risk. If
sensitive data are leaked on sensitive issues regarding the organisation (for example, employees’
salaries) then the organisation’s reputation suffers. Organisations need to ensure that all laws (like
the POPI Act) relating to data are complied with. The POPI Act was covered in AIN1501 and AIN2601.

5.6.4.2 Data cleaning

In table 5.1 is the comprehensive steps compiled by Dzuranin, Geerts & Lenk (2023) to clean data in
a dataset.

Step Issue Detect Correct

Extraction

1 All the data did not transfer Compare row count Add missing rows

2 Data was not correctly Compare control amounts Modify incorrect values
transferred

Transformation

3 Irrelevant / unreliable data Visually scan columns for Remove columns with
irrelevant and unreliable irrelevant and unreliable
data data

4 Incorrect / ambiguous data Scan column for incorrect of Rename the column
ambiguous names

5 Incorrect data types Inspect the data type Change data type

6 Columns are not single Examine the values in the Split composite columns
valued columns

7 Incorrect values Detect incorrect values with Modify incorrect values


outliers

8 Inconsistent values Identify inconsistent values Modify inconsistent values

9 Incomplete values Investigate null values Replace the null values

22 | P a g e
10 Data refers to foreign keys Investigate whether there Unnormalise normalised
in other tables are more data tables where data (only in Excel)
the data is stored

Table 5.1 Data preparation patterns (Dzuranin, Geerts & Lenk, 2023:5-46). Adjusted by Van Staden.

Table 5.1 provides nine steps that should be followed when performing the extraction and
transformation process on a dataset. Step 1 and 2, assist with the validation of data for
completeness and integrity which was discussed in section 5.6.3. Steps 3 to 10 refers to the
transformation of data, which mainly assist to clean the dataset. Note that in Excel data does not
have to be loaded again after the data has been cleaned. If you are working with other data analysis
software, it may be necessary to check whether the data that has been cleaned are loaded correctly
to the final dataset. Next, we will be discussing steps 3 to 9 in table 5.1.

5.6.4.2.1 Step 3: Irrelevant and unreliable data

Data that are irrelevant for decision making enlarges the data model. It is also important to avoid
integrating unreliable data into the data model. Remember, garbage in, garbage out (Dzuranin et al,
2023).

By making use of the data dictionary and by scanning through the database irrelevant columns could
be identified. If you want to determine revenue per province, it would be advisable to only keep the
columns that will contribute to your solution. In this case it will be the revenue column and the
province column. There is no need to keep the irrelevant columns like address, city, and credit limit
in this case.

By scanning through the data, one can also determine if a column contains unreliable data.
Unreliable data is where a column contains a mixture of null values, dates, and text. Imagine
analysing an age column where there is a mixture of the above. If a column can’t be fixed, it will be
better to remove it.

5.6.4.2.2 Step 4: Incorrect and ambiguous column names

Column names become variables during data exploration and interpretation. Imagine on interpreting
data where the column names are PR1, TR2 and Di. One would not comprehend what this column
names refer to unless you refer to the data dictionary. To make matters easier for yourself you can
rename the columns to ProductRevenue, TaxRate and Discount. Now it is understandable and other
users of the dataset can also understand it.

23 | P a g e
Dzuranin et al (2023) provides the following guidelines for naming colums:

• Names should accurately describe the column’s content.


• Names should be intuitive to businesspeople.
• Use only common abbreviations that are understood by everyone, such as YTD.
• Eliminate spaces, underscores, or other symbols. For example, use CustomerName instead
of Customer_Name

5.6.4.2.3 Step 5: Incorrect data types

If a data type is incorrect, it can provide inaccurate results. Imagine if the revenue column displays
numerical values but Excel see it as text. Then drawing statistics on this column will give zeros on
calculations. Therefore, if this is the case, you need to change the column to the correct data type.

5.6.4.2.4 Step 6: Composite and multi-valued columns

Each cell should contain one value describing one characteristic because two or more values in the
same cell makes analysis more challenging. Two specific scenarios that violate the single-valued rule
and make analysis more complex are composite columns and multi-valued columns (Dzuranin et al,
2023).

The best method to detect composite or multi-valued columns is visual scanning (Dzuranin et al,
2023). Consider the following table EMPLOYEE:

Pers_Nr EmployeeName
1 Jacques, van Staden
2 Keenan, Martin
3 Rendani, Nzhinga

Figure 5.10 Employee table

You will see that the EmployeeName column contains the name and the surname in one column.
Preferably we want to split this into two columns, called FirstName and Surname.

Excel can facilitate this. Select the column EmployeeName, click on Data and then Text to Columns.

You will see the following wizard:

24 | P a g e
Figure 5.11 Text to Columns wizard (Step 1)

Select Delimited and click Next.

25 | P a g e
Figure 5.12 Text to Columns wizard (Step 2)

The second screen of the wizard allows you to specify how Excel should recognise where to split the
data. In the case of the EmployeeName you will see the names and surnames are split by a comma.
Therefore, deselect Tab and select Comma. You will see the following:

26 | P a g e
Figure 5.13 Text to Columns wizard (Step 3)

Notice in the Data preview that the data is now split between the name and the surname. This is a
preview on how the data will look once converted.

Click on next. You will see the following:

27 | P a g e
Figure 5.14 Text to Columns wizard (Step 4)

On this screen you can adjust the column format and specify the destination of the columns. In this
case don’t change anything and click on Finish.

The table has changed as follows:

Pers_Nr EmployeeName
1 Jacques van Staden
2 Keenan Martin
3 Rendani Nzhinga

Figure 5.15 Amended Employee table

Note that we are not done yet.

Please change the column names to more descriptive names as already discussed in section
5.6.4.2.2. Please see the final table in figure 5.16.

28 | P a g e
Pers_Nr FirstName Surname
1 Jacques van Staden
2 Keenan Martin
3 Rendani Nzhinga

Figure 5.16 Final Employee table

5.6.4.2.5 Step 7: Incorrect values

Incorrect values are sometimes entered into a dataset and may be difficult to detect. An example of
an incorrect value is where the price of a product is entered as R15 instead of the correct price of
R51. In case of numerical values outlier detection procedures could be used to detect this type of
mistakes. Text values are harder to find. One could most likely pick these errors up with sorting and
filtering the data.

Study the article by Bansal (n.d.), which explains what outliers are, how to find them in Excel and
how outliers should be handled. To download the article, click here.

One can also detect outliers by using charts. Examples of charts than can assist in detecting outliers
are histograms and box and whisker charts. You have learned to create charts in FAC1501, but the
following article by Microsoft (2022) will assist you if you struggle to draw up a box and whisker
chart. To download the article, click here.

Once the incorrect data value is determined one should correct it in the Excel sheet. It is also
important to determine the error’s root cause and advise the information system department on it
to enable them to modify the information systems in such a way that this error is resolved not to
occur in future again.

5.6.4.2.6 Step 8: Inconsistent values

Inconsistent values occur when two or more different representations of the same data are mixed in
the same column (Dzuranin et al, 2023). An example is if the following data values are present in the
City column: Pretoria, Pta, Protoria and Pretorai. If one would ask to summarise sales per city only
the sales will be separately grouped to Pretoria, Pta, Protoria and Pretorai instead of Pretoria.

It is therefore very important to correct these values once they are recognised. The information
systems department should also be notified in this case, that they can develop measures to
eliminate this error in the future.

29 | P a g e
5.6.4.2.7 Step 9: Incomplete values

Incomplete values are when rows in a data set are empty. We also refer to these values as NULL
values. NULL values will distort the results of your data analysis. Therefore, you need to deal with
them.

First, you need to detect them. Excel has also some nice ways to assist you with this. Consider the
following dataset in figure 5.17. Note that I keep it small to illustrate the detection of blank cells, but
your datasets will normally be a lot bigger than five records.

Nr Student Mark
1 Steyn 56
2 44
3 Mbasa 76
4 Ntuli
5 Abrahams 88

Figure 5.17 Student data (Van Staden, 2022)

By viewing this data set, one can see that student 2 has no student name, and student 4 has no
mark. Let’s look if Excel can find them.

In Excel, click on the Home ribbon, then Find & Select, Go To Special. You will see the following
dialog box:

30 | P a g e
Figure 5.18 Go to Special dialog box

Select Blanks and click OK. Now you will see:

Figure 5.19 Student data

Notice the NULL values are highlighted.

Select a Shading colour from the Home ribbon. I will select Red.

All the cells containing NULL values will not be highlighted in Red (please see figure 5.14), which will
enable you to investigate them.

31 | P a g e
Figure 5.20 Student data with highlighted empty cells

Maladkar (2018) identified the following five ways to deal with missing (NULL) values:

• Delete the rows (or records) containing the missing data. This method is advised only when
there are enough samples in the dataset. One also needs to make sure that after you
deleted the data, there is no addition of bias.
• Replacing the missing values with the mean, the median or the mode. This method is the
statistical approach of handling missing values.
• Assigning a unique category. This is mostly applicable with qualitative data; for example,
gender. A new category, for example, unknown, can be created for the missing data.
• Predicting the missing values. An example is to use a machine learning algorithm to predict
the missing values. An example of a technique used to predict the values is regression
analysis. We will discuss regression analysis later in this lesson.
• Using algorithms which support missing values. Machine learning algorithms such as KNN (K
– Nearest Neighbor) and the RandomForest algorithm are used to predict non-linear and
categorical data. These two techniques are in the domain of data science and will not be
discussed further in AIN3701.

5.6.4.2.8 Step 10: Data refers to foreign keys in other tables

In section 5.6.2.2 you were introduced to the concept of normalisation. You were then given a hint
that for data analysis you will need to unnormalise normalised data. Let me illustrate this with an
example. Look at figure 5.10.

32 | P a g e
CUSTOMER
Customer_ID (PK) Customer_Name Customer_Adress Customer_VatNr Customer_City Province_ID (FK)
1 Lulu's Bakery 321 Justice street 1122336789 Pretoria 1
2 Sophia's Tea Garden 45 Tau avenue 1122336800 Johannesburg 1
3 Bela Bela Supermarket 12 Church street 1122336811 Bela Bela 2
4 Capricorn Inn 51 Unicorn drive 1122336822 Polokwane 2
5 UNISA student café Preller street 1122336833 Pretoria 1

PROVINCES
Province_ID (FK) Province_Name
1 Gauteng
2 Limpopo

Figure 5.21 Customer and Province data (Van Staden, 2022)

You will see that Province_ID in the CUSTOMER table contains a foreign key, which relates to the
primary key in the PROVINCES table. If we need to analyse customer data, we will import both tables
into Excel, but the primary analysis will be on the CUSTOMER sheet (where the CUSTOMER table is
imported). We will therefore need to replace the Province_Id column values of 1 and 2 with Gauteng
(1) and Limpopo (2).

There is no right or wrong answer to do this. You can use Find & Replace, you can type it in
manually, but personally I feel the best solution to do this is with the VLOOKUP function. (The
VLOOKUP function was covered in AIN2601.) If you use the VLOOKUP function, insert a column,
where the result of the VLOOKUP function is stored (in the above example, you will insert a column
next to the Province_ID, where the Province_Name will be displayed when applying the VLOOKUP
function). You will find that the VLOOKUP function works the best for big datasets.

5.6.5 Step 2e: Load the data in preparation for data analysis (load).

In some data analysis software packages, one needs to get the data ready before loading it. In Excel
we first load the data, and then we prepare our data cleaning on the data. Work through the article
by Chauhan (2022), which will provide you with step-by-step information on how to import data into
Excel. As a bonus, this article also covers how to export data from Excel. You can download the
article by clicking here.

33 | P a g e
Activity 5.5

Please refer to activity 5.4 for background information on this task.

You have received the data from Chip de Villiers (CIO). He has put two tables, the Invoice_2022, and
the Customer_Master into an Excel workbook. You can download the workbook by clicking here.

Chip de Villiers informed you and Ms Jansen that he did not have time to clean the data and would
only be able to start with this process after two weeks.

Ms Jansen informed him that the analysis is urgent, and that the Finance department will clean the
dataset themselves.

Ms Jansen asks you to complete the following:

I. Clean the dataset received in the attached Excel file. You need to clean both the
Invoice_2022 and the Customer_Master sheets. Use Table 5.1 - Data preparation patterns
and follow the steps in these table.
a. Do the data cleaning in the Invoice_2022 and the Customer_Master sheets. In
addition, document (in a separate document) which step was followed in table 5.1
and provide examples on how the data was cleansed under each step.

Please see below information to deal with outliers and missing values:
 Ms Jansen will investigate the outliers, but she asks that you continue with
your analysis without adjusting the outliers.
 Any missing values should be replaced. Ms Jansen also asks you to highlight
missing values as follows:
• If you replace the value, highlight the cell in blue.
• If you recalculate the value, highlight the cell in red.
• If you can’t replace or recalculate the value, then you need to use
the mean value (refer to DSC1630). In this case, highlight the value
in yellow.
b. Test for outliers in the Qty column in the Invoice_2022 sheet by using a box and
whisker chart. Create a separate sheet in the workbook where this chart is
displayed.

34 | P a g e
c. Test for outliers in the Inv_Price_ex_del column in the Invoice_2022 sheet by using
the IQR. Clearly indicate whether there are outliers present, by shading these values
in grey. Show all your calculations.
II. Create a sheet called Analysis_Region. In this sheet, provide the answer for the following
question:
a. What was the revenue per province, per division?

Upload the updates spreadsheet clicking here.

End of activity

5.7 STEP 3: PERFORM TEST PLAN (P)

You have cleaned your data, and now you are ready to start your analysis. You will see that there are
different data analytics approaches to help us understand what happened, why it happened, what
we can expect to happen in the future, and what we should do based on what we expect to happen
(Richardson et al, 2023:116). In this step we will evaluate data analytics approaches and choose the
type of analytic which will provide the most accurate, complete, and valid answer to our problem we
have identified in step 1.

There are four types of data analytics approaches. Study the article by Cote (2021), which explains
the four types of data analytics. You can download the article by clicking here.

We are not going to explore all the different analytical methods one can use under each of these
four main types of data analytics. Note that only summary statistics, data reduction (filtering) and
regression are assessable in AIN3701. The other methods are only introduced and will not be
assessed in AIN3701.

5.7.1 Descriptive analytics

There are two main approaches when using descriptive analytics, which are summary statistics and
data reduction (also known as filtering).

5.7.1.1 Summary statistics

Summary statistics describe a set of data in terms of their location (mean, median), range (standard
deviation, minimum and maximum), shape (quartile) and size (count) (Richardson, et al, 2023:117).
Table 5.2 indicates what statistic can be used, the Excel formula and the description of the statistic.

35 | P a g e
Statistic Excel formula Description

Sum =SUM() The total of all numerical values.

Mean =AVERAGE() The centre value calculated as the sum of all


observations divided by the number of
observations.

Median =MEDIAN() The middle value that divides the top half of the
data from the bottom half.

Minimum =MIN() The smallest value.

Maximum =MAX() The largest value.

Count =COUNT() The number of observations

Frequency =FREQUENCY() The number of observations in each of a series of


numerical or categorical buckets.

Standard deviation =STDEV.S() for a The variability or spread of the data from the mean.
sample A larger standard deviation means a wider spread
away from the mean.
=STDEV.P() for
the entire
population

Quartile =QUARTILE() The value that divides a quarter of the data from
the rest. This indicates skewness of data.

Table 5.2 Summary statistics (Richardson et al, 2023:119). Adjusted by Van Staden (2024)

5.7.1.2 Data reduction (filtering)

Data reduction is used to reduce the number of observations to focus on the most critical,
interesting, or abnormal items (Richardson et al, 2023:120). It does this by taking a large set of data
and reducing it to a smaller set that has most of the critical information of the larger dataset.
(Richardson et al, 2023:117-118). In Excel you can use the Filter tools and Pivot tables to assist you in
doing data reduction. The Filter tool and Pivot tables in Excel was discussed in detail in AIN2601.

36 | P a g e
Activity 5.6

You have attended a meeting with Mrs Duma (CEO), Ms Jansen (CIO), Ms Arendse (manager of
delivery division), Mr Law (legal representative of Ikhishi Likagogo) and Mr Azharuddin (union
representative).

Mrs Duma is concerned about the number of serious accidents that were caused by Ikhishi
Likagogo’s drivers. She also informed the meeting that Ikhishi Likagogo’s insurers, DriveWise
Insurers, has indicated that they are going to increase Ikhishi Likagogo’s insurance premium unless
she can show them how Ikhishi Likagogo is going to limit accidents.

Ms Arendse reports that during 2022 there were four serious accidents, in which delivery vehicles
were written off. Unfortunately, it caused a delay in deliveries, as there were not enough vehicles.
She states that Ikhishi Likagogo delivery policy (95% of deliveries need to be on time) has put
pressure on the drivers causing them to, in some cases, violate traffic laws.

Ms Jansen reported that the number of traffic fines increased in 2022 by almost 60%. Furthermore,
there are several legal cases against Ikhishi Likagogo emanating from incidents where the other
party’s drivers were seriously injured. Mr Law indicated that Ikhishi Likagogo would not win these
cases, as witnesses identified that the drivers of Ikhishi Likagogo was negligent.

Mr Azharuddin claims that Ikhishi Likagogo overworks drivers, and that five drivers are surely not
enough to cover the deliveries. He states that all five drivers are union members, and that they
informed him that they are under enormous pressure to deliver orders on time. He expresses his
disappointment that no driver received a performance bonus due to the 95% on-time delivery target
not being met. He claims that this target was not met due to circumstances outside the driver’s
control.

Ms Duma request that you and Ms Jansen analyse the data in the attached Excel workbook
regarding vehicle deliveries. You can download the spreadsheet by clicking here.

Ms Jansen has cleaned the spreadsheet, and ask that you assist in doing the following:

I. In the SALES worksheet, insert a column next to (to the right) of Delivery distance. Call this
column Total_distance. Total_distance = Delivery distance x 2. Remember, Delivery distance
only shows a one-way trip and disregards the return trip. Therefore Total_distance will cover
the full trip of delivering an order.

37 | P a g e
II. In the Sum_Stats worksheet, perform summary statistics of the Total_distance column (all
the statistics in table 5.1 should be performed). Report on your findings in the Key finding’s
textbox. It should add value to the above meeting.
III. Complete the report in the Driver_report worksheet.
IV. Copy the SALES worksheet to the Data_reduction worksheet.
V. In the Data_reduction worksheet, perform data reduction to provide further insights on
what happened in the case of the on-time delivery percentage not being met.
VI. Do you agree with Mr Azharuddin statements he made in the meeting? State your
conclusion in the Conclusion worksheet. Motivate your answer.
VII. Provide two possible solutions on how to limit driver accidents. Write your answer in the
Possible_Solutions worksheet.

Upload the updated spreadsheet by clicking here.

End of activity

5.7.2 Diagnostic analytics

Diagnostic analytics consists of four approaches. They are profiling, clustering, similarity matching
and co-occurrence grouping. All four of these approaches are studied in detail in the domain of data
science. Therefore, you will not be assessed on these four approaches. We will discuss these four
approaches very briefly to provide you with some background, should you encounter them in
practice.

5.7.2.1 Profiling (not assessable in AIN3701)

Profiling identifies the typical behaviour of a selected group by compiling summary statistics about
its data and then to compare it to the summary statistics of the population. By understanding this
behaviour, we will be able to identify abnormal behaviour more easily. Profiling might be used in
accounting to identify transactions that might warrant some additional investigation, for example,
outlier travel expenses or fraud (Richardson et al, 2023:117).

5.7.2.2 Clustering (not assessable in AIN3701)

Clustering helps identify groups (clusters) of individuals (such as customers) who share common
underlying characteristics. With clustering we want to identify the similar data elements and the
underlying drivers of the identified groups (Richardson et al, 2023:117). An example of clustering is
to identify the most efficient production lines and to look what characteristics they share. Now, we
can compare it with other production lines to see if we cannot implement some of these identified
“best practices”.

38 | P a g e
5.7.2.3 Similarity matching (not assessable in AIN3701)

Similarity matching is a grouping technique used to identify similar individuals based on data known
about them. For example, companies identify seller and customer fraud based on various
characteristics known about each seller and customer to see if they are like known fraud cases
experienced in the past (Richardson et al, 2023:117).

5.7.2.4 Co-occurrence grouping (not assessable in AIN3701)

Co-occurrence grouping discovers associations between individuals based on common events, such
as transactions they are involved in (Richardson et al, 2023:117). Takealot might use this to sell
another item to you by knowing what items are frequently bought together.

5.7.3 Predictive analytics

Predictive analytics consists of three approaches. They are regression, classification, and link
prediction. Regression will be assessed in this module. Classification and link prediction are studied
in detail in the domain of data science. Therefore, you will not be assessed on these two approaches.

5.7.3.1 Regression

Correlation, regression, scatter diagram, correlation coefficient and coefficient


of determination were discussed in detail in DSC1630.

Please revise this content, before continuing with this section.

In DSC1630, you have learned about the regression analysis theory and application by making use of
your calculator. In AIN3701, we will build on this, by showing you how to perform a regression
analysis in Excel. We will further expand on this by making use of multiple regression. Lastly, we will
show you how to interpret the statistics produced by Excel when you make use of the regression
tool in Excel.

In Excel, there are two ways to determine the regression formula and the coefficients of
determination (R2).

The first is by means of a scatterplot chart. You have learned in AIN1501 how to create charts, but in
the following video I will illustrate how to determine the regression formula and the R2. The dataset
used in the video can be downloaded by clicking here. Please watch the video by clicking here.

The second method is by making use of the regression function provided in the “Analysis Toolpak”
add-on in Excel. Let me first show you how to add the “Analysis Toolpak”:

39 | P a g e
 Open Excel.
 Click on File.
 Select Options. Note: if Options is not available, click on More.
 Click on Add-Ins.
 On the bottom of the screen dialog box, you will see a combo (drop-down) box. Select Excel
Add-ins and click on Go …
 Check the Analysis Toolpak and click OK.
 Select Data.
 The following additional item (“Analysis”) will appear on the ribbon:

Figure 5.22 Data ribbon displaying the Analysis tools from the Analysis Toolpak

Now you are ready to perform regression in Excel.

Please watch the video on performing regression by making use of the regression function from the
Analysis Toolpak by clicking here.

The output of the regression tool is displayed in figure 5.23 and will be discussed next.

SUMMARY OUTPUT

Regression Statistics
Multiple R 96.48%
R Square 93.08%
Adjusted R Square 92.22%
Standard Error 4317.984172
Observations 10

ANOVA
df SS MS F Significance F
Regression 1 2006940102 2006940102 107.6396604 0.00
Residual 8 149159898.5 18644987.31
Total 9 2156100000

Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%
Intercept 93,434.01 2708.312563 34.49897602 0.00 87188.63018 99679.39012 87188.63018 99679.39012
Year of experience 3,568.53 343.9560394 10.37495351 0.00 2775.36387 4361.691968 2775.36387 4361.691968

Figure 5.23 Regression output of the Years’ experience dataset

40 | P a g e
I have highlighted the most important statistical outputs from the dataset, and you should be able to
interpret them in AIN3701. I will discuss these outputs below:

R Square & Adjusted R square

For R Square, refer to your DSC1630 material.

The adjusted R Square determines the extent of the variance of the dependent variable, which the
independent variable can explain. The specialty of the adjusted R Square is that it does not consider
the impact of all independent variables but only those which impact the variation of the dependent
variable. By looking at the adjusted R Square value, one can judge whether the data in the regression
equation is a good fit. The higher the adjusted R Square, the better the regression equation, as it
implies that the independent variable chosen to determine the dependent variable can explain the
variation in the dependent variable (Srivastav, 2022).

Significance F & the P-value

To understand the “Significance F” value, one first needs to understand what a P-value is. To assist
you in understanding a P-value, read the article by Seth (2020) by clicking here.

The Significance F is a very important statistic. This is the p-value for the F-test of overall
significance. This test determines whether your regression model with all its independent variables
does a better job explaining the dependent variable’s variability than a model with no independent
variables. If this test result is statistically significant, it suggests you have a good model (Frost, 2020).

But how do you interpret the P-value?

Keller (2012:365) provides some guidance on how to translate the P-values:

If the P-value us less than 0.01, there is overwhelming evidence to infer that the alternative
hypothesis is true. The result is deemed to be highly significant.

If the P-value lies between 0.01 and 0.05, there is strong evidence to infer that the alternative
hypothesis is true. The result is deemed to be significant.

If the P-value is between 0.05 and 0.10, it indicates weak evidence that that the alternative
hypothesis is true. When the P-value is greater than 0.05, the result is not statistically significant.

When the P-value exceeds 0.10, there is little to no evidence to infer that the alternative hypothesis
is true.

Let’s apply this to figure 5.24:

We first look at the “Significance F”:

41 | P a g e
The value is 0.00. This indicates that this is a good model. The model as a whole is statistically
deemed to be highly significant.

Now the P-value for “Year of Experience”:

The value is 0.00. This indicates that the variable “Year of Experience” is deemed to be highly
significant in predicting the “Annual Sales”.

Coefficients

The Intercept indicates what the value will be if independent variable (x) is zero. In the regression
equation y = b + ax it represents the “b” value.

The “Year of Experience” is an independent variable. The coefficient indicates what the dependant
variable (y) will change for every one unit of change in the dependant variable (in this case “Year of
Experience”). This represents the “a” value in the equation y = b + ax.

Therefore, we can derive the regression equation as: y = 93 434 + 3 569x

We can make predictions using the above equation. Let’s say we want to predict the annual sales of
a salesperson with 15 years’ experience, then:

y = 93 434 + (3 569 x 15)

y = 146 969

Therefore, we can see that the annual sales for a salesperson with 15 years’ experience will be
R146 969 (with a predicted accuracy of 93%).

Multiple regression

At this stage of your studies, you have worked with one dependant and one independent variable
when dealing with regression analysis. There could, in practice, be more independent variables.
Then our equation will change to:

y = b + a1x1 + a2x2 + a3x3 + …. + anxn

To do this manually will almost be an impossible task, as you will have more than two axis, one for
each variable. Luckily Excel can assist us to solve multiple regression problems.

Watch the following video by SoftTech Tutorials (2021), which illustrates how to do and interpret
multiple regression in Excel, by clicking here.

42 | P a g e
Activity 5.7

You were invited to the budget meeting. Mrs Duma (CEO), Ms Jansen (CFO), the product managers,
the production foremen of each production line and the maintenance manager, Mr Mashaba, were
all present at the meeting.

All the product managers presented their individual budgets for the year, but on every single one
Mr Mashaba complained that the maintenance fee was not worked out correctly. He explained that
all the product divisions used a flat rate of R130 for a maintenance service per machine but in effect
services are more expensive. He explained that maintenance on the ovens (which are most of the
production machines) were very labour intensive. It also depends on how many service staff are
available to perform maintenance on the production machines. He therefore argued that it does not
reflect reality to simply charge a flat rate when maintenance is performed on a machine.

He presented the following sheet, which reflects the output (number of machines serviced), number
of employees, direct labour hours and the total maintenance cost (figure 5.25).

Month Output Nr employees Direct labour hours Total cost


January 300 28 4,480 50,933
February 320 30 4,700 53,973
March 350 30 4,800 55,200
April 500 32 5,120 68,000
May 530 32 5,150 70,640
June 640 35 5,700 80,507
July 790 41 7,210 93,920
August 250 41 4,100 43,627
September 820 41 7,300 101,067
October 780 39 7,200 95,893
November 750 38 6,400 91,173
December 280 33 3,960 44,667

Figure 5.25 Maintenance manager’s presentation

He ensured the meeting that his data were correct, and that there were no outliers present in these
figures presented in figure 5.25. You can download the Excel sheet presented in figure 5.25 by
clicking here.

43 | P a g e
Ms Jansen told the meeting that the maintenance cost matter would be investigated. After this, the
meeting was adjourned.

Ms Jansen asks you to handle to assist here with the following:

I. Perform a multiple regression analysis, with the following:


a. Output, Nr of Employees and Direct labour hours are the independent variables.
Total cost is the dependent variable. Copy and paste the Excel output into your
report. Explain the statistics generated by Excel.
b. If necessary, repeat the regression analysis until the statistical measures are
satisfied. Each time, indicate if the regression analysis was reperformed, which
variables were used, and interpret the statistics generated by Excel.
c. Provide a formula that can calculate maintenance accurately.
II. Provide an example of how the formula obtained in the regression analysis can be used to
calculate the maintenance cost if:
a. Output = 380, Nr employees = 42, Direct labour hours = 6 720
III. Provide an opinion whether you agree with Mr Mashaba’s statement that maintenance cost
is labour-hour intensive or dependable on the number of maintenance staff? Substantiate
your answer.

Prepare your answer in PDF. The best is to use Word, and then to convert your Word document to
PDF.

Upload your calculations and feedback by clicking here.

End of activity

5.7.3.2 Classification (not assessable in AIN3701)

Classification predicts a class or a category for a new observation based on the manual identification
of classes from previous observations. Some examples of classification include predicting which
loans are likely to default, credit applications that are expected to be approved, the classification of
an operating or financing lease, and the identification of suspicious transactions. In each of these
cases, prior data must be manually identified as belonging to each class to build a predictive model
(Richardson et al, 2023:117).

5.7.3.3 Link prediction (not assessable in AIN3701)

Link prediction predicts a relationship between two data items, such as members of a social media
platform. For example, if two individuals have mutual friends on social media and both attended the
same university, it is likely that they know each other, and the site may make a recommendation for

44 | P a g e
them to connect. Link prediction in an accounting setting might work to use social media to look for
relationships between related parties that are not otherwise disclosed to identify related party
transactions (Richardson et al, 2023:117).

5.7.4 Prescriptive analytics

Prescriptive analytics consists out of two approaches. They are decision support systems, and
machine learning and artificial intelligence. Both of these approaches are studied in detail in the
domain of data science. Therefore, you will not be assessed on these two approaches. We will
discuss the two approaches very briefly to provide you with some background should you encounter
them in practice.

5.7.4.1 Decision support systems (not assessable in AIN3701)

Decision support systems are rule-based systems that gather data and recommend actions based on
the input. Tax preparation software, investment advice tools, and auditing tools recommend courses
of action based on data that are input as part of an interview or interrogation process (Richardson et
al, 2023:117).

5.7.4.2 Machine learning and artificial intelligence (not assessable in AIN3701)

Machine learning and artificial intelligence are learning models or intelligent agents that adapt to
new external data to recommend a course of action. For example, an artificial intelligence model
may observe opinions given by an audit partner and adjust the model to reflect a changing level of
risk appetite and regulation (Richardson et al, 2023:117).

5.8 STEP 4: ADDRESS AND REFINE RESULTS (A)

After the data have been analysed (in step 3), the fourth step is to address and refine results. Data
analysis is iterative. We slice, dice, and manipulate the data, find correlations, test hypothesis, asks
ourselves further, better questions, ask colleagues what they think, and revise and rerun the analysis
potentially multiple times. But once that is complete, we have the results ready to communicate to
interested stakeholders, which hopefully address their questions (Richardson et al, 2023:13).

5.9 STEP 5: COMMUNICATE INSIGHTS (C)

Data are important and data analytics is effective, BUT they are only as important and effective as
we can communicate and make the data understandable (Richardson et al, 2023:183).

The goal of step 5 is to share critical information with stakeholders in a clear and concise manner.
This could be done by using a written report, a chart, a callout box, or a few key statistics. It is also

45 | P a g e
important to consider the needs of the audience of your data analysis, which can influence the
choice of communication method (Richardson et al, 2023:183).

These days, data visualisation is very important to communicate results. Data visualisation is the
presentation of data in a graphical format (Sharma, 2022). The choice of a chart can play a big role in
getting your message across in your data visualisation. The type of data can influence the type of
chart that is used. Data are categorised in two main groups, namely qualitative and quantitative
data.

Qualitative data are data that can be classified or categorised, but cannot be measured (Sharma,
2022:18). Examples of qualitative data are colours, satisfaction, and rankings. Qualitative data can be
further divided into two subcategories, namely nominal and ordinal data (Richardson et al,
2023:187). Nominal data are qualitative data that cannot be put in a meaningful order (in other
words, they cannot be ranked) (Sharma, 2022:18). An example of nominal data is hair colours.
Ordinal data are qualitative data that can be put into a meaningful order (the data can be ranked),
for example, very satisfied, satisfied, unsatisfied and very unsatisfied (Sharma, 2022:18).

Quantitative data (also known as interval/ratio data) are data that can be measured, for example,
customers’ sales, weight, gross profit margin, to name a few (Sharma, 2022:18). Quantitative data
can be categorised into two different types, namely interval and ratio (Richardson et al, 2023:187).
The difference between interval and ratio data is that ratio data have a meaningful zero and interval
data do not (Richardson et al, 2023:187). To clarify, for ratio data, the zero point reflects an absolute
zero, which means that there is “nothing” of the specific variable (Jansen, 2020). With interval data,
zero is just another number (Richardson et al, 2023:187).

Quantitative data have another classification, being either as discrete or continuous. Discrete data
are quantitative data with a finite number of values (Sharma, 2022:18), for example, the number of
pets. You can have three pets, for example, but not 2.25 pets. Continuous data are quantitative data
with a value within a range or interval (Sharma, 2022:18). Therefore, this value can take on any
numerical value, for example 3.14159265 (Richardson et al, 2023:187).

The last data type we need to look at is dichotomous data. A dichotomous data value can only have
two values (Dodge, 2008). An example a dichotomous value is “yes” or “no”. The dichotomous data
type is a qualitative data type, but Dodge (2008) reminds us that quantitative variables can be
reduced and dichotomised by reducing revenue, for example, to “low revenue” and “high revenue”.

Now you might ask me: what does this have to do with the type of chart to be chosen?

46 | P a g e
Richardson et al (2023:195) provide the following assistance by indicating what chart type will be
suitable to use for each data type. This guidance is displayed in figure 5.26.

Figure 5.26 Summary of chart types (Richardson et al, 2023:195)

If you look at figure 5.26, you will see that the starting point is to identify whether your data that you
want to put into your chart is qualitative or quantitative. Next, you look at the purpose of your chart.
Is it for comparisons, geographic data, or text data, in the case of qualitative data? You have already
covered in AIN1501 how to generate charts in Excel; therefore, you will need to revise this material.

If you want to explore other methods to choose charts and need assistance with charts in data
visualisations, I recommend that you work through the article by Dzuranin (2022). You can download
this article here. This article also covers a technique to present your data, called storytelling.
Storytelling is a very interesting field in data analytics and if you want to know more about it, you
can also read through the article by Dzuranin (2022). Storytelling will not be assessed in AIN3701.

Data visualisation using dashboards will be covered in detail in lesson 6. Therefore, you will be
assessed in lesson 6 on using the correct charts, applying best practices, etc.

47 | P a g e
5.10 STEP 6: TRACK OUTCOMES (T)

In doing your data analysis, you will hopefully encounter some insights and areas that can be
improved on. To show improvement or progress, it is necessary that these insights or improvement
areas are managed and monitored. In practice, digital dashboards and data visualisations are useful
in tracking outcomes. Dashboards are discussed in detail in lesson 6.

5.11 CLOSING REMARKS

Well done on completing this unit! You have covered a lot in this unit, and I hope that you have
learned a lot on data analysis.

Data are referred to today as the new oil of the modern age. The expectation is that the importance
of data will increase in the future. Therefore, as an accounting professional, it is very likely that you
will encounter data analytics in practice.

We have covered quite a few principles in doing data analytics (which are sufficient for AIN3701),
but there are much more to explore. If you want to explore data analytics further, I recommend you
start with the book by Richardson et al, called Data analytics for accounting (Third edition, 2023).
Data analytics are evolving every day with the introduction and advancement of machine learning
models, so it is important that you try to stay up to date in this subject area.

For me, it is important that you realise that there is a structured process to be followed when
tackling business problems requiring data analytics to resolve it. Jumping in at the middle of the
process will lead to either a loss of time or an incorrect answer. Therefore, follow the process as
illustrated in this lesson or prescribed in the workplace.

In the next lesson, we are going to focus on data visualisation, through dashboards. This is a very
exciting topic, and we look forward to teaching you the skills to create an efficient and effective
dashboard.

See you in the next lesson.

48 | P a g e
5.12 REFERENCES

AmandaLovestoAudit. 2020. IT controls - General vs Application Controls. Available at:


https://www.youtube.com/watch?v=bafb1IyUKUU

Bansal, S. No date. How to Find Outliers in Excel (and how to handle these). Available at:
https://trumpexcel.com/find-outliers-excel/

Bekker, A. 2017. Big Data: Examples, Sources and Technologies explained. Available at:
https://www.scnsoft.com/blog/what-is-big-data

Chauhan, R. 2022. How to Import Data Into Your Excel Spreadsheets the Neat & Easy Way. Available
at:
https://www.makeuseof.com/tag/import-data-excel-spreadsheets-neat-easy-way/

Cote, C. 2021. 4 Types of data analytics to improve decision making. Available at:
https://online.hbs.edu/blog/post/types-of-data-analysis

Drury, C. 2022. Management and Cost Accounting. 11th edition. Cengage.

Dzuranin, AC. 2022. Explanatory Data Visualizations. Available at:


https://sfmagazine.com/post-entry/january-2022-explanatory-data-visualizations/

Dzuranin, AC, Geerts, GL, & Lenk, MM. 2023. Data and Analytics in Accounting. An integrated
approach. 1st edition. John Wiley & Sons,Inc.

Frost, J. 2020. How to Perform Regression Analysis using Excel. Available at:
https://statisticsbyjim.com/regression/regression-analysis-excel/

Galiana, D. 2019. The 5W1H Method : Project Management defined and applied. Available at:
https://www.wimi-teamwork.com/blog/the-5w1h-method-project-management-defined-and-
applied/

Gary Explains. 2019. A Quick Introduction to SQL. Available at:


https://www.youtube.com/watch?v=nBmAnVzQv3o

Jansen, D. 2020. Nominal, Ordinal, Interval & Ratio Data. Available at:
https://gradcoach.com/nominal-ordinal-interval-ratio/

Keller, G. 2012. Managerial statistics. International edition. 9th edition. South-Western Cengage
learning.

Lewis, S. 2020. Fishbone diagram (Ishikawa cause and effect) Available at:
https://www.techtarget.com/whatis/definition/fishbone-diagram

Maladkar, K. 2018. 5 Ways To Handle Missing Values In Machine Learning Datasets. Available at:
https://analyticsindiamag.com/5-ways-handle-missing-values-machine-learning-datasets/

Marr, B. 2015. Big Data: 20 Mind-Boggling Facts Everyone Must Read. Available at:
https://www.forbes.com/sites/bernardmarr/2015/09/30/big-data-20-mind-boggling-facts-
everyone-must-read/?sh=5c7f7e9117b1

49 | P a g e
Microsoft. 2022. Create a box and whisker chart. Available at:
https://support.microsoft.com/en-us/office/create-a-box-and-whisker-chart-62f4219f-db4b-4754-
aca8-4743f6190f0d

Peterson, R. 2022. Entity Relationship (ER) Diagram Model with DBMS Example Available at:
https://www.guru99.com/er-diagram-tutorial-dbms.html

Peterson, R. 2022. What is Normalization in DBMS (SQL)? 1NF, 2NF, 3NF, BCNF Database with
Example Available at:
https://www.guru99.com/database-normalization.html

Reinsel, D, Gantz, J & Rydning, J. 2018. The Digitization of the World from Edge to Core Example.
Data Age 2025. Available at:
https://www.seagate.com/files/www-content/our-story/trends/files/idc-seagate-dataage-
whitepaper.pdf

Richard, G, Roets, C, Adams, A & West, S. 2021. Auditing Notes for South African Students. 12th
edition. LexisNexis.

Richardson, VJ, Teeter, RA & Terrell, KL. 2023. Data Analytics for Accounting. 3rd edition. McGrawHill.

Seth, T. 2020. How to understand p-value in layman terms? Available at:


https://towardsdatascience.com/how-to-understand-p-value-in-layman-terms-80a5cc206ec2

Sharma, H. 2022. Learn to Select Best Excel Charts for Data Analysis and Reporting. Optimize Smart.

SoftTech Tutorials. No date. Excel Multiple Regression. Available at:


https://softtechtutorials.com/microsoft-office/excel/excel-multiple-regression/

Srivastav, AK. 2022. Adjusted R Squared. Available at:


https://www.wallstreetmojo.com/adjusted-r-squared/

Stevens, E. 2022. What is Data Analytics? A Complete Guide for Beginners. Available to download at:
https://careerfoundry.com/en/blog/data-analytics/what-is-data-analytics/

ThinkReliability. 2022. The basics of root cause analysis. Available at:


https://www.thinkreliability.com/cause-mapping/what-is-root-cause-analysis/

50 | P a g e

You might also like