0% found this document useful (0 votes)
330 views22 pages

Data Transformation Challenges

1. The document describes a case study where Ashton is trying to merge and clean data from two different ERP systems used by an acquired company to improve decision making. 2. Ashton analyzed a data extract combining vendor data from both systems, which showed differences in data formats, field names, and other challenges. 3. The data transformation process described in the document involves structuring, standardizing, cleaning and validating data to maintain or improve data quality.

Uploaded by

Dewi Sartika
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
330 views22 pages

Data Transformation Challenges

1. The document describes a case study where Ashton is trying to merge and clean data from two different ERP systems used by an acquired company to improve decision making. 2. Ashton analyzed a data extract combining vendor data from both systems, which showed differences in data formats, field names, and other challenges. 3. The data transformation process described in the document involves structuring, standardizing, cleaning and validating data to maintain or improve data quality.

Uploaded by

Dewi Sartika
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

CHAPTER

Transforming Data

6
LEARNING OBJECTIVES

After studying this chapter, you should be able to:


1. Describe the principles of data structuring related to aggregating data,
data joining, and data pivoting.
2. Describe data parsing, data concatenation, cryptic data values, misfielded
data values, data formatting, and data consistency and how they relate to
data standardization.
3. Describe how to diagnose and fix the data cleaning errors related to data
duplication, data filtering, data contradiction errors, data threshold viola-
tions, violated attribute dependencies, and data entry errors.
4. List and describe four different techniques to perform data validation.

I N T EGRATIV E C ASE S&S

Ashton successfully helped the organization transform its culture and adopt an analytics
mindset for dealing with data. The company successfully built a data structure that consists
of a data lake, a data warehouse, and data marts. With the new structure, employees are able
to fulfill many of their data needs, allowing data analysts more time to analyze data and not
just locate and clean data for others.
Although data handling at S&S has improved, Ashton still receives requests for extract-
ing, transforming, and loading data into other applications that either have not been added
to the current data architecture or require other special processing. Such is the current case
with which Ashton is dealing.
S&S purchased another company that used a different enterprise resource planning
(ERP) system. The two systems don’t work well together—storing data in different formats,
with different field names, and a host of other challenges. Although S&S plans to better in-
tegrate these two systems, until then Ashton has become frustrated trying to make business
decisions because he cannot seem to get a clear picture of what is happening due to incom-
patible data. Ashton has been trying to figure out how to merge and clean the data to be able
to improve S&S’s decision making.

188
S.John/Shutterstock
To figure this problem out, Ashton decided to study a subset of S&S’s data that contains
information about its vendors. He had an IT specialist export data from the two systems into
a combined flat file for his review. He hopes to be able to understand all of the differences
that exist in the data so he can recommend ways to clean the data going forward.
The data extract Ashton is analyzing is shown in Figure 6-1, and an extract of the data
dictionary describing the merged data is shown in Figure 6-2. This data dictionary provides
abbreviated metadata of the data extract. Figure 6-3 contains additional information about
Product ID, one of the fields in Figure 6-1. This chapter will use this data to teach principles
related to transforming data.

Introduction
IBM estimates that poor quality data costs the United States economy $3.1 trillion per year and
that 1 in 3 business decision makers don’t trust the information they use to make decisions. A
worldwide survey conducted by Experian shows that business executives believe that 26% of
their total data may be inaccurate, and it is getting worse over time. While some of the errors
are complex, many are simple, including incomplete data, outdated data, duplicate informa-
tion, or simple errors in the data (e.g., spelling mistakes, typos, etc.).
So what is to be done? The ideal is to capture and store clean data. The internal controls
presented in Chapters 10–13 address methods of capturing accurate data. Even with strong
internal controls, data can still become “dirty” or need to be transformed for alternate uses.
Data can become dirty by such things as changing events in the world (e.g, customer changes
a phone number), merging various data sources that store the same type of data differently, or
making mistakes in dealing with data. When this happens, the data needs to be transformed in

FIGURE 6-1
Extract of Vendor Data

189
190 PART II DATA ANALYTICS

FIGURE 6-2 Data Element Field


Abbreviated Data Name Description Length Field Type
Dictionary for Vendor VendorName Vendor complete company name 50 Alphanumeric
Data Extract FName Vendor customer representative first name 50 Alphanumeric
LName Vendor customer representative last name 50 Alphanumeric
PhoneNumber Saved as 10-digit number, formatted as (###) 10 Numeric (integer)
###-####
Address Vendor street address 100 Alphanumeric
City Vendor city 50 Alphanumeric
State Vendor state in 2-digit abbreviated form 2 Alphanumeric
ZipCode Vendor 5-digit zip code 5 Numeric (integer)
EarlyPayDiscount Terms for early payment discount 25 Alphanumeric
ProdCat Unique identifier for each product category: 6 Numeric (integer)
1=kitchen appliances; 2=fans; 3=hand tools
ProdID Unique identifier for each product in a 6 Numeric (integer)
product category. Figure 6-3 contains full
categorization
UnitsPurch Total number of units purchased 15 Numeric (integer)
TotalCosts Total costs listed in USD 15 Numeric (double)
AvgCostPerUnit TotalCosts divided by UnitsPurch listed in USD 15 Numeric (double)

a way that restores its quality. This chapter presents a four-step process for transforming data
that will maintain or improve data quality: (1) structure the data, (2) standardize the data, (3)
clean the data, and (4) validate the data.
The data transformation process spans the life of data. That is, while data transformation
is specifically a step of the extract, transform, and load (ETL) process, it also occurs when data
is gathered, stored, and analyzed. Understanding the transformation process is valuable for all
data stages. Before discussing the steps of the transformation process, we review the attributes
of high-quality data. These attributes form the objectives of the data transformation process.

ATTRIBUTES OF HIGH-QUALITY DATA


Chapter 1 discusses how information must have multiple attributes—such as being accurate,
available, concise, relevant, etc.—to be valuable for decision making. To create information
that has these attributes, the underlying data must also have certain attributes. While data sci-
entists debate about the various attributes of high-quality data, there is general agreement that
at a minimum, data must be accurate, complete, consistent, timely, and valid in order to create
useful information. Table 6-1 repeats the definitions for accurate, complete, consistent, and
timely from Chapter 1 and adds valid to the list. The table also gives an example of a violation
of each attribute.
These attributes are critical to business success, as the following historical example dem-
onstrates: NASA lost a $125 million Mars orbiter because of inconsistent data. It took NASA
10 months to fly the orbiter to Mars. Once the orbiter approached Mars, it was designed to
orbit the planet and make scientific measurements. The orbiter was unable to enter its designed
orbiting pattern. The problem? Two different engineering teams in the company had inap-
propriately designed systems to control the orbiter. One team used the imperial measurement

FIGURE 6-3 ProdCat ProdID ProdDesc


Product ID (ProdID) Table 1 1 Blender
1 2 Microwave
1 3 Toaster
2 1 Box Fan
2 2 Desk Fan
3 1 Hand Drill
3 2 Jigsaw
CHAPTER 6 TRANSFORMING DATA 191

TABLE 6-1 Attributes of High-Quality Data


Attribute Definition Example of Violation of Attribute
Accurate Correct; free of error; accurately A sale occurred on December 27 but is re-
represents events and activities corded as occurring the following year on
January 4.
Complete Does not omit aspects of events An annual evaluation of vendor performance
or activities; of enough breadth only contains 7 months of data.
and depth
Consistent Presented in same format over A company switches the denomination of
time amounts (e.g., thousands, millions, etc.)
irregularly.
Timely Provided in time for decision Customer purchasing metrics are 2 years old.
makers to make decisions
Valid Data measures what it is intended There are only 7 unique job positions at a
to measure; conforms to syntax company but 9 different positions are at-
rules and to requirements tributed to employees—2 answers are not
valid.

system (i.e., feet, miles), while the other team used the metric system (e.g., meters, kilome-
ters). This inconsistency in the data resulted in a software failure and the loss of a very expen-
sive Mars orbiter.
In every interaction with data, it is critical to make sure the data has the attributes of high-
quality data. If any of these attributes are missing, the data will need to be remedied. While not
all data problems can be remedied by data transformation, the four-step process introduced in
this chapter will help improve data quality.

Data Structuring
Data structure describes the way data are stored, including the relationships between different
fields. Data structuring is the process of changing the organization and relationships among data structuring - The process
data fields to prepare the data for analysis. Extracted data often needs to be structured in a of changing the organization
and relationships among data
manner that will enable analysis. This can entail aggregating the data at different levels of fields to prepare the data for
detail, joining different data together, and/or pivoting the data. analysis.

AGGREGATE DATA
Aggregate data is the presentation of data in a summarized form. Aggregate data has fewer aggregate data - The presenta-
details than disaggregated data. As an example, consider the typical sales process and reporting tion of data in a summarized
form.
of data. When a business makes a sale, it records a credit to sales revenue in a database. This
is fully disaggregated sales data. In aggregation, the company might sum sales data around a
corresponding attribute. For example, summing sales by sales manager removes the detail of
each sale but allows the business to evaluate manager performance. Similarly, a company may
fully aggregate sales data by summing all sales to create the sales revenue line item on the
income statement.
When transforming data, it is critical to understand the level of aggregation for each data
source because combining data aggregated at different levels can create problems. For ex-
ample, S&S could analyze the vendor purchases shown in Figure 6-1 by aggregating, or sum-
ming, all individual purchases to produce the total units purchased from each vendor and the
total expenditures for each vendor. Figure 6-4 shows this higher level aggregation as well as
aggregating across all vendors.
When data is aggregated, information is lost. That is, with the data aggregated in Figure
6-4 at the individual vendor level, it is no longer possible to know to which products the
UnitsPurch and TotalCosts fields relate. The only way to analyze data at the product level
192 PART II DATA ANALYTICS

FIGURE 6-4
Examples of Different
Levels of Aggregating
Data

would be to go back to the original disaggregated data because it cannot be extracted from the
aggregated data. Best practice for storing data is to store the information in as disaggregated
form as possible and then aggregate the data through querying for other uses.

DATA JOINING
As discussed in Chapter 4, a key part in querying databases is joining information from differ-
ent tables into a single table so that the joined data can be analyzed.1 Frequently in the ETL
process, data is queried from the database into a single flat file, as seen in the S&S dataset. The
table created in the S&S example required joining data that was likely contained in different da-
tabase tables such as the vendor table (e.g., all fields from VendorName to EarlyPayDiscount),
from the product table (e.g., fields ProdCat and ProdID), and from a transaction table that
recorded purchases (e.g., fields UnitsPurch and TotalCosts).

DATA PIVOTING
data pivoting - A technique Data pivoting is rotating data from rows to columns. Some software programs are designed
that rotates data from rows to to use pivoted data rather than unpivoted data. Understanding the design of the program into
columns.
which the data will be loaded will help you determine if data needs to be pivoted or not.
The S&S data in Figure 6-1 could be pivoted in a variety of ways to make it easier to in-
terpret the data. Figure 6-5 shows a pivot of the data so that the vendor names are listed in each
row and the columns show the different product categories. Note that when data is pivoted, the
data often will be aggregated in some way. For example, in the pivoted data the total product
costs are summed for each vendor and product category.
Figure 6-6 shows the data in Figure 6-5 pivoted back into a different form. The Figure
6-6 form is much closer to the original but aggregated at a higher level. Since the information

FIGURE 6-5 ProdCat


Pivoting S&S Data
VendorName 1 2 3
B&D $ 15,982.00 $ 2,529.00
Black and Decker $ 2,220.06 $ 568.00 $ 13,024.57
Calphalon $ 19,509.75
Honeywell $ 5,516.90
KitchenAid $ 43,282.53
Oster $ 28,020.11
Panasonic $ 15,765.12 $ 5,693.50

1
Data joining is also referred to as data merging and data aggregation. The last term, data aggregation, can be confus-
ing, given the previous discussion of aggregate data. Aggregating data is the process of summarizing data at different
levels of detail, while data aggregation is the process of combining different data sources. To avoid confusion, we will
use the term data joining to refer to data aggregation throughout the text.
CHAPTER 6 TRANSFORMING DATA 193

FIGURE 6-6
VendorName ProdCat TotalCosts
Pivoting Figure 6-5 Data
B&D 1 $15,982.00
B&D 2 $ 2,529.00
Black and Decker 1 $ 2,220.06
Black and Decker 2 $ 568.00
Black and Decker 3 $13,024.57
Calphalon 1 $19,509.75
Honeywell 2 $ 5,516.90
KitchenAid 1 $43,282.53
Oster 1 $28,020.11
Panasonic 1 $15,765.12
Panasonic 2 $ 5,693.50

about individual products was lost in pivoting the data to Figure 6-5, individual products can-
not be displayed in the pivot from Figure 6-5 to Figure 6-6. This is another example of how
aggregating data results in a loss of detail that cannot be recaptured unless the original data is
analyzed.

Data Standardization
Data standardization is the process of standardizing the structure and meaning of each data data standardization - The pro-
element so it can be analyzed and used in decision making. Data standardization is particularly cess of standardizing the struc-
ture and meaning of each data
important when merging data from several sources. Achieving standardized data may involve element so it can be analyzed
changing data to a common format, data type, or coding scheme. Data standardization also and used in decision making.
encompasses ensuring the information is contained in the correct field and the fields are orga-
nized in a useful manner. When dealing with a database file or a flat file format, think of data
standardization as making sure the columns of information are correct. In doing so, several
important things should be considered, including data parsing and data concatenation, cryptic
data values, misfielded data values, and data formatting and data consistency. Focus 6-1 shows
the importance of data standardization in practice.

DATA PARSING AND DATA CONCATENATION


Separating data from a single field into multiple fields is called data parsing. For example, an data parsing - Separating data
employee code might be saved in a single field as “N-0504-2002”. As discussed in Chapter 2, from a single field into multiple
fields.
this employee code is a combination of a mnemonic code, a sequence code, and a group code.
The first element is a mnemonic code which indicates that the employee works at the North
Office (the “N”). The second is a sequence code (employee number 0504). The last element
is a group code representing the year of hire (2002). Data parsing separates this information,
contained in a single field, into three different fields so that the different items can be sepa-
rately analyzed.
In the S&S data, the column labeled EarlyPayDiscount contains information about the
terms for early payment. For row 1, the entry “2/10 N30” signifies that S&S receives a 2%
discount if they pay in 10 days; otherwise, the full amount is due in 30 days. To be more use-
ful, the information should be parsed into three columns as shown in Figure 6-7. With the data
parsed into three columns, it is easier to use the data for decision making.
Data parsing is often an iterative process that relies heavily on pattern recognition. Since
data often follow multiple patterns, parsing can usually be performed in several ways, any of
194 PART II DATA ANALYTICS

FOCUS 6-1 Data Standardization Success Story

In the mid-2000s, cellphones began to be successfully all public transportation agencies to use. Transit agencies
used for navigating not only streets but also public trans- voluntarily adopted these data standards. The standards
portation. Prior to that time, you had to look up public continued to evolve so that now, thanks to standardized
transportation schedules, often still in paper form, for each data, the general public can quickly and easily find public
different transit authority. Bibiana McHugh, an IT manager transportation options in most places in the world.
at a transportation company in Portland, Oregon, was fed Data standards make it possible for different parties
up with this difficulty. Recognizing that the general public to use and share data. This makes it easier to build apps
already used Google web apps for transportation needs, for analyzing and displaying data from different sources.
she contacted Google and partnered with them to cre- In auditing, the American Institute of Certified Public
ate Google Transit. Using the data at her company, they Accountants (AICPA) has established audit data standards
were able to build an app for helping people find public so that external auditors can more easily use the data pro-
transit information. Upon launch, the interest was so high duced by different clients. When the data standards are
across the world that they decided to scale the app they followed, the standardization of data saves auditors time
had built for the city of Portland across the globe. To make so they are able to charge lower fees and conduct a more
scaling possible, they generated open data standards for thorough audit.

which can yield a valid output. For the S&S example, the EarlyPayDiscount field could be
parsed in the following alternative ways:
● Select the first digit to put in the DiscountRate Column. Select the third and fourth digits
to put in the DiscountDays column, and then select the seventh and eighth digits for the
BalanceDueDays column.
● Select all data before the forward slash to put in the DiscountRate column. Select the
data between the forward slash and first space and put it in the DiscountDays column.
Select all the data after the N to put into the BalanceDueDays column.
● Split the data into columns based on the space. From the new first column (that contains
the DiscountRate and DiscountDays information), take all numerical digits before the
slash and put them in the DiscountRate column and, reading from right to left, take all
digits until it comes to the forward slash and put them in the DiscountDays column. With
the new second column from the original split (i.e., the “N30” information), remove all
alpha characters and place any numerical characters in the BalanceDueDays column.

FIGURE 6-7 Original Column New Columns


Data Parsing Example RowNumber EarlyPayDiscount DiscountRate DiscountDays BalanceDueDays
1 2/10 N30 2 10 30
2 2/10 N30 2 10 30
3
4 2/10 N30 2 10 30
5 2/10 N30 2 10 30
6 2/10 N15 2 10 15
7 2/10 N30 2 10 30
8 2/10 N15 2 10 15
9
10 2/10 N30 2 10 30
11 2/10 N30 2 10 30
12 1/10 N30 1 10 30
13
14 2/10 N30 2 10 30
15 1/10 N30 1 10 30
16 2/10 N30 2 10 30
17 2/10 N15 2 10 15
CHAPTER 6 TRANSFORMING DATA 195

Notice that each of these approaches make different assumptions about the structure of the
data contained in the original EarlyPayDiscount field. In the S&S example, all of these pat-
terns hold for the 17 rows displayed. However, when datasets contain millions of rows, iden-
tifying a pattern that applies to all, or even a subset of rows, is much more challenging. When
one pattern cannot be identified for all rows, often patterns can be identified for sets of rows.
These patterns can be used to iteratively parse the appropriate sets of rows. The challenge of
parsing data correctly is one reason why the ETL process can be so challenging.
Data concatenation is combining data from two or more fields into a single field. In the data concatenation - The com-
S&S data, Ashton would like to have one field that lists the full name of each vendor repre- bining of data from two or more
fields into a single field.
sentative as shown in Figure 6-8. To do this, the original columns of FName and LName need
to be combined, or concatenated, into a single field. Notice that when FName and LName are
concatenated there is also a space added between the first and last name. Without the space,
the new column would list names like “DeShawnWilliams” and “MiltonArmstrong”—which
makes the data messy and would require later cleaning.
Data concatenation is often used to create a unique identifier for a row. Notice in Figure
6-1 that ProductID has some of the same values as ProductCategory. That is, ProductCategory
1 has ProductID’s of 1, 2, and 3 and ProductCategory 2 also has ProductID’s of 1 and 2. Thus,
the ProductID does not uniquely identify each product; rather, the combination of ProductID
and ProductCategory identifies each product at S&S. In this case, it would be useful to con-
catenate ProductID and ProductCategory so that each product can be uniquely identified. This
could be done by creating a new column labeled ProductCatID that adds a dash character, so
the final result is “1-1” and “1-2” where the first number shows the product category number
and the second number shows the product ID.

CRYPTIC DATA VALUES


Cryptic data values are data items that have no apparent meaning without understanding cryptic data values - Data items
the underlying coding scheme. For example, a consulting firm may keep track of positions in that have no meaning with-
out understanding a coding
the organization such as partner, senior consultant, and research analyst by entering into the scheme.
database the number 1 for partner, 2 for senior consultant, and 3 for research analyst. Without
understanding the code, the numbers in the column for employee position have no meaning.
In the S&S data, Ashton notices that he is unable to tell what the values in ProdCat and
ProdID mean. Referring to the data dictionary, he learns that a ProdCat of 1 is for kitchen ap-
pliances, 2 for fans, and 3 for hand tools. Similarly, ProdID is defined in the data dictionary by
referring to another table displayed in Figure 6-3.

Original Columns New Column FIGURE 6-8


RowNumber FName LName FullName Data Concatenation
1 DeShawn Williams DeShawn Williams Example
2 Milton Armstrong Milton Armstrong
3 Chiyo Tanaka Chiyo Tanaka
4 M. Armstrong M. Armstrong
5 Milton Armstrong Milton Armstrong
6 Maryam Ahmad Maryam Ahmad
7 Milton G. Armstrong Milton G. Armstrong
8 Maryam Ahmad Maryam Ahmad
9 Chiyo Tanaka Chiyo Tanaka
10 DeShawn Williams DeShawn Williams
11 DeShawn Williams DeShawn Williams
12 Larsena Hansen Larsena Hansen
13 Jacobsen Sofia Jacobsen Sofia
14 Milton Armstrong Milton Armstrong
15 Larsena Hansen Larsena Hansen
16 Milton Armstrong Milton Armstrong
17 Maryam Ahmad Maryam Ahmad
196 PART II DATA ANALYTICS

Ashton considers two possible solutions for this problem: (1) to replace the values in
the ProdID and ProdCat columns with the words rather than the numbers or (2) to add new
columns to the flat file that identify what the ProdID and the ProdCat values mean. Ashton
decides on the second option because keeping the original IDs will allow him to later merge
additional data into his file if he needs it. Typically, the cryptic data values problem is handled
through joining additional data into the file or by replacing the values. If storage space is not
an issue and there is no concern over the complexity of the working file, it is generally advis-
able to join data rather than replace values.
For some fields, there is a general understanding of what cryptic values mean. When a
dummy variable or dichotomous field contains only two different responses, typically 0 or 1, this field is called a dummy
variable - A data field that con- variable or dichotomous variable. In this case, standard practice is to use a 1 to signify
tains only two responses, typi-
cally 0 or 1. the presence and 0 the absence of the attribute. For example, if the field captures data about
whether a vendor is a preferred vendor or not, the value of 1 would suggest they are a preferred
vendor and 0 that they are not. With dummy variables, best practice is to give them a meaning-
ful name rather than a generic name. So, for the previous example, rather than naming the field
VendorStatus, naming the field PreferredVendor is superior because the user knows the mean-
ing of values 0 and 1 in this field without referring to the data dictionary.

MISFIELDED DATA VALUES


misfielded data values - Data Misfielded data values are data values that are correctly formatted but do not belong in the
values that are correctly format- field. As an example, if a data field for city contains the country name Germany, the data val-
ted but not listed in the correct
field. ues are misfielded. The value Germany should be entered in a data field for country.
Misfielded data values can be a problem with an entire field (i.e., the entire column) or
with individual values (i.e., entries in a row). Correcting this problem can be considered either
a data standardization or a data cleaning step. How this problem is categorized is less impor-
tant than recognizing and fixing the problem in the dataset.
For S&S, notice that the entries in the City and State columns are backwards. That is, the
City field holds a state value and the State field holds city values. The data should be corrected
so that each field holds the appropriate information.

DATA FORMATTING AND DATA CONSISTENCY


MI5, the United Kingdom’s domestic counter-intelligence and security agency, learned the
importance of data formatting and data consistency when they “bugged” 134 incorrect phone
numbers. The cause of the mistake was a spreadsheet formatting error that altered the last three
digits of each telephone number to be “000” instead of the true last three digits. Data values
should have the same format as all other data values in the same field and match the data type
specified in the data dictionary. Similarly, every value in a field should be stored in the same
data consistency - The principle way, which is referred to as data consistency.
that every value in a field should One of the best examples showing the similarities and differences in data formatting
be stored in the same way.
and data consistency relates to how calendar dates are displayed and stored in databases and
spreadsheets. Dates can be displayed in many different formats. For example, each of these
date formats represents the same date: April 3, 1982; 3 April 1982; 03/04/82; and 04/03/82.
Dates are formatted differently because of differences in preference, display space, or geo-
graphic location—different countries choose to arrange days, months, and years differently.
A single format should be chosen and used for all dates in a field and typically all dates con-
tained in a file. The chosen format should be documented in the data dictionary using a mne-
monic representation like dd/mm/yyyy to represents days, months, and year.
Although dates are often formatted differently for display, they typically are stored in
a consistent manner in the underlying systems. There are several common ways to store
dates digitally. Most typical is to store dates as the amount of time since a specific calen-
dar date. For example, Microsoft Excel stores dates as serial dates meaning the number of
days since January 1, 1900, plus a fractional portion to represent the hour of the 24-hour
day. The date 1:00 a.m. on November 11, 1979 is actually stored in Excel as the number
“29170.0416”. Other programs store dates using Unix epoch time (sometimes called Unix
epoch or epoch time), which is the number of seconds that have elapsed since January 1,
CHAPTER 6 TRANSFORMING DATA 197

1970, not counting leap seconds. In Unix epoch time, 1:00 a.m. on November 11, 1979 is FIGURE 6-9
stored as “311130000”. Data consistency within time data is particularly problematic when Examples of Different
it comes from different time zones. Thus, time data in most enterprise systems is stored ac- Data Formats
cording to Coordinated Universal Time (UTC). When local time is needed, it is computed
based on the stored UTC value and the local time zone. Data consistency requires that the
PhoneNumber
information stored for each field be stored the same way. When storing a date value, one
field should not switch between the serial date and the epoch time storage designs. Rather, it 907961-4917
should use only one of these two designs. (844) 986-0858
This textbook categorizes data formatting and data consistency issues together because it 3605659487
is often hard to determine from viewing data whether a problem is a formatting or a consis- (551) 7779393
tency issue. Furthermore, when given a flat file or an extraction of data, fixing formatting and
consistency issues likely uses the same techniques. Distinguishing between formatting and
(790)447-1783
consistency issues is more important when trying to fix the original source data. (551) 7779393
The S&S data has both data formatting and data consistency errors—and some errors that
could be caused by either or both issues. For example, the PhoneNumber field shows phone
numbers formatted in six different ways. Notice how sometimes the formats have parentheses
around the area code, sometimes there are spaces in different places, and the dashes are not
always included. An example of each of the different formats are listed in Figure 6-9.
From the S&S data extract, it is not clear whether the data is stored in the database in dif-
ferent ways (data consistency) or just displayed in different formats. The data dictionary says
that the data are intended to be stored in the same way (as a 10-digit number), so it is likely that
the problem was a formatting problem that occurred when creating the data extract. Regardless
of the underlying problem, the solution is that the PhoneNumber field needs to be manipulated
to be consistent in appearance, as inconsistent formatting of the data can make it more difficult
to analyze the data. In the phone number example, if someone wanted to extract the area code
for all of the phone numbers, he would need to perform a much more complex parsing and
extraction exercise because of the poor formatting. If the data always contained the area code
between the parentheses, it would make the extraction of the area code a relatively straightfor-
ward exercise.
The S&S data in Figure 6-1 contain additional data consistency and data formatting prob-
lems in the following fields: VendorName, FName, ZipCode, EarlyPayDiscount, ProdCat,
TotalCosts, and AvgCostPerUnit. Take a minute to see if you can find the errors in each of
these fields before reviewing the bullet points below.
● VendorName: The Black and Decker vendor name is sometimes written in its entirety
and sometimes abbreviated B&D. This is likely the same company and not two different
companies with similar initials because the address is the same for both types of entries.
The same company name should be listed for every instance of Black and Decker.
● FName: There are two problems in this field. The same person’s name is sometimes
listed in different ways; that is, Milton Armstrong is listed as Milton Armstrong,
M. Armstrong, and Milton G. Armstrong. This is likely the same person because all the
different name iterations have the same address. The second problem is that the data is
aligned in different ways: lines 9, 14, and 16 are right justified and all the other entries
are left justified.
● ZipCode: Sometimes the zip code contains the 5-digit zip code and other times it con-
tains a 9-digit zip code.
● EarlyPayDiscount: The data in the field has inconsistent alignment (left, center, and
right). While alignment is unlikely to influence data analysis, good practice is to use con-
sistent alignment for a field.
● ProdCat: The field does not consistently show the same number of decimal places for
numbers. For this field, the decimals are meaningless, and the data should only show
whole numbers.
● TotalCosts: The field sometimes shows dollar signs and a different number of decimal
places for different values. Since total cost data is collected at the penny level, these
should all be formatted to that level.
● AvgCostPerUnit: The error is that the euro currency marker is used even though this
company is in the United States and the total cost information is denominated in the U.S.
dollar (see the data dictionary description).
198 PART II DATA ANALYTICS

Data Cleaning
A survey of 16,000 users on Kaggle, a popular website for data scientists owned by Google,
dirty data - Data that is inconsis- reveals that the biggest problem data scientists face is dirty data. Dirty data is data that is in-
tent, inaccurate, or incomplete. consistent, inaccurate, or incomplete. To be useful, dirty data must be cleaned. Data cleaning
data cleaning - The process of
is the process of updating data to be consistent, accurate, and complete.
updating data to be consistent, Costs of dirty data can be exceptionally high, both in terms of dollars and reputation.
accurate, and complete. Fidelity’s Magellan fund learned this when they reported to shareholders that they expected
to pay a $4.32 per share dividend at the end of the year. However, no year-end dividend was
issued. Fidelity later reported to shareholders that they wouldn’t be receiving the predicted
dividend because a tax accountant missed inputting a minus sign on a $1.3 billion net capital
loss. Thus, the system treated it as a gain—a total error of $2.6 billion dollars. Below are some
of the common errors to look for in cleaning data.

DATA DE-DUPLICATION
data de-duplication - The Data de-duplication is the process of analyzing the data and removing two or more records
process of analyzing data that contain identical information. In the S&S data, one might be concerned there is signifi-
and removing two or more
records that contain identical cant duplicate data in the columns related to vendor attributes (e.g., names, phone numbers,
information. addresses) because they show a lot of repeated data. Although these items repeat, it does not
mean that a row is a duplicate of another row. That is, each row represents a unique product
within a product category, and since some vendors sell multiple products to S&S, some of
their information is repeated. In contrast, row 17 appears to be an exact duplicate of row 6—
as the information on the rows are identical. Note that the formatting of the rows sometimes
differs, but formatting is not considered for data de-duplication, rather only the values of the
fields are compared. Row 17 should be examined in more detail to see if it occurred on the
exact same day at the same time and does not represent two similar orders. If found to be an
exact duplicate, in most circumstances, row 17 should be deleted.
Duplicate data means that all information for a record is exactly the same as another
record when, in reality, there were not two identical occurrences of whatever is measured.
Note there is one exception to this general rule. As seen in the S&S data, the value in the
RowNumber field is different for the two records. Often row identifiers are added to a file
without considering if there are duplicate rows. In this case, the identifier may not exactly
match between records even though all the rest of the information is duplicative. The second
record is still a duplicate even though the row identifiers don’t match, as is the case in the S&S
example where the field RowNumber has different values for row 6 and row 17. When looking
for duplicate values, be aware of how the data is constructed to decide whether the row identi-
fiers are likely to indicate a duplicate or not.

DATA FILTERING
The process of removing records or fields of information from a data source is called
data filtering - The process of data filtering. Companies often collect information about many vendors, even if they have
removing records or fields of in- not purchased from all of them. In the transformation of data, a data analyst might filter
formation from a data source.
out inactive vendors (no purchases in the last year) to analyze active vendors. This is an
example of filtering information. Typically, columns are filtered out of the data during
the extraction process; however, when extra columns are included in the data extraction,
removing unnecessary columns of data can be performed in the transformation process. In
the example of vendors, the data analyst may receive an extraction of vendor information
that contains the vendor representative. If this field is not necessary for the analysis, it can
be filtered out of the dataset.
The S&S data has several irrelevant records included in the file. Ashton intended to ana-
lyze products in the kitchen appliances and fans product category—product category 1 and
2—but his file also contains information about product category 3, hand tools. Ashton can
filter the data to delete the two ProdCat 3 records.
Another important filtering question is what to do with missing or empty values (referred
to as null values). Notice in the S&S data that the EarlyPayDiscount field contains null values
CHAPTER 6 TRANSFORMING DATA 199

for several rows. There is no hard rule for what to do with null values. The first step to take
when a null value is encountered is to go back to the original source data and verify that the
information was not lost in extracting the data. If the original source data contains the missing
information, the missing data should be added to the file.
Once the missing values are verified as validly missing or empty, there are generally three
approaches to take. The first approach retains the null values and continues with the transfor-
mation process. When the data is analyzed, the analyst will decide how to manage the null
values. The second approach is to delete the record with null values and document the deletion
in the analysis log because it may change inferences made when analyzing the data. The third
approach is to perform a data imputation, which is the process of replacing a null or missing data imputation - The process
value with a substituted value. Data imputation only works with numeric data (so it is not pos- of replacing a null or missing
value with a substituted value.
sible for the missing payment terms in the S&S data).
There are many different types of data imputation, each with strengths and weaknesses.
One approach is to replace all null values with the mean of the column of the non-null values
in the data. Other imputations replace the value with a random value from the dataset or use
statistical models, like regression, to compute a replacement value. Regardless of the method
chosen, it is critical to document in the analysis log when values have been imputed and the
method used. The analyst often adds an additional column that uses a dichotomous value to
indicate which values have been imputed.

DATA CONTRADICTION ERRORS


Data contradiction errors exist when the same entity is described in two conflicting ways. data contradiction errors - An
For example, a data file contains information about a manufacturing plant in two different error that exists when the same
entity is described in two con-
records; however, the physical address of the manufacturing plant is different in each record flicting ways.
even though each record is meant to reference the same physical location.
In the S&S dataset, Milton Armstrong’s telephone number on line 16 is different than
his phone number on all other lines. Due to the contradiction error in Milton’s phone num-
ber, we do not know the true value. The phone number should be corrected so that Milton’s
phone number is the same throughout the dataset. This contradiction error may have been
caused by incomplete updates between the two separate systems from which the data came.
Milton updated his phone number, but it was only recorded in one system. The reason this
is most likely a data contradiction error is that Milton’s address is the exact same in all the
other records. If there were two different people named Milton Armstrong that lived at the
exact same house and each had a different phone number, this would not be considered a
data contradiction error. Contradiction errors need to be investigated and resolved appro-
priately. In this case, one might query the updated logs of each system to determine which
phone number was most recently updated. One could call each number and resolve the issue
directly with the vendor.

DATA THRESHOLD VIOLATIONS


Data threshold violations are data errors that occur when a data value falls outside an al- data threshold violations - Data
lowable level. An example would be a field capturing the number of children a taxpayer errors that occur when a data
value falls outside an allowable
claims as dependents in which the taxpayer lists the value of “300.” Acceptable thresholds level.
for a field can be found in the data dictionary. In this case, let us assume that the field at-
tributes indicate that the field is numeric with a length of 2. The “300” value would violate
this threshold. Additionally, the data dictionary may establish a range of acceptable values
for this field as between 0 and 15. The “300” value would also violate this threshold.
Alternatively, a value of “65” would not violate the field attribute but would still be in viola-
tion of acceptable values.
In the S&S data, the PhoneNumber field always displays 10 numeric characters, except
for line 13. On line 13, the phone number contains 13 numeric characters because of the extra
“(208)”. Notice this example only considered numeric characters as the data dictionary de-
fined this field using numeric characters and not formatting. This threshold violation will need
to be investigated and resolved. Both errors in the value of the data and in the attributes of the
data can cause data threshold violations.
200 PART II DATA ANALYTICS

VIOLATED ATTRIBUTE DEPENDENCIES


violated attribute dependencies -  Violated attribute dependencies are errors that occur when a secondary attribute does not
Errors that occur when a second- match the primary attribute. For example, if a record accurately indicates a person lives in
ary attribute in a row of data
does not match the primary Nauvoo, Illinois but mistakenly lists the zip code as 26354, there is a violated attribute de-
attribute. pendency because the zip code for Nauvoo, Illinois is 62354. A transposition of the first two
digits of the zip code has created an attribute dependency violation. The data in the secondary
attribute, in this example zip code, needs to be corrected to match the primary attribute, in this
example city.
In the S&S data, there is a violated attribute dependency in the field ZipCode. The zip code
for Concord, North Carolina is not 94519 (which is the zip code for Concord, California), but
rather 28027. All instances where the zip code is wrong for Concord, North Carolina should be
replaced with the correct values.

DATA ENTRY ERRORS


data entry errors - All types of Data entry errors are all types of errors that come from inputting data incorrectly. Data entry
errors that come from inputting errors often occur in human data entry, such as misspelling words, transposing digits in nu-
data incorrectly.
meric strings, and failing to enter data. Data entry errors can also be introduced by the com-
puter system. For example, a system may fail to record the first two digits of a year, and so it is
not clear if the date is meant to be 1910 or 2010. Controls around prevention and detection of
data entry errors are discussed extensively in Chapter 13.
Data entry errors may be indistinguishable from data formatting and data consistency er-
rors in an output data file. For example, the previous error of the inconsistent names of Milton
Armstrong, M. Armstrong, and Milton G. Armstrong may have been caused by data input er-
rors, formatting differences between systems, or data consistency errors in how the data was
structured. When correcting the problem in source data, it will be important to understand
which type of error is occurring so the appropriate fix can be administered; however, in data
cleaning, it is less important to distinguish between these three types of errors and more im-
portant to correct them.
A data entry error in the S&S data not likely caused by data formatting or data consistency
problems is the misspelling of San Diego as “SanDgieo” for rows 6, 8, and 17. Likely, some-
one simply input the data incorrectly.

Data Validation
Before, during, and while finalizing the data transformation process, one should analyze
whether the data has the properties of high-quality data. Analyzing data to make sure it has
data validation - The process of these properties is called data validation.
analyzing data to make certain Data validation is both a formal and informal process. Often companies have a formal
the data has the properties of
high-quality data: accuracy, data validation process that follows the transformation process. Informally, data validation is
completeness, consistency, performed in parallel with each transformation. For example, if a column of information list-
timeliness, and validity. ing customer names shows that all names were recorded fully capitalized, one might want to
change the formatting such that only the first letter of the first and last name are capitalized.
After performing this operation, it is important to check to see if it was done correctly and
completely—which is an example of informal data validation.
Data validation is an important precursor to data cleaning. The techniques used to validate
data once cleaned are also useful for discovering problems in the data that need to be fixed.
Thus, data validation is an iterative process that both helps identify what needs to be trans-
formed in the data as well as verify that the data has been transformed correctly.
The techniques used to validate data can be thought of as a continuum from simple to
complex. As there is rarely a clear stopping point for data validation, the importance of the
data analysis and the availability of resources (both time and money) will guide the decision
on deciding which of these techniques to use. These techniques include visual inspection,
basic statistical tests, auditing a sample of data, and advanced testing techniques.
CHAPTER 6 TRANSFORMING DATA 201

VISUAL INSPECTION
Visual inspection is exactly what it sounds like—examining data using human vision to see if visual inspection - Examining
there are problems. When examining data, one should look for each of the different problems data using human vision to see
if there are problems.
listed above. Visual inspection is particularly helpful in finding data standardization issues be-
cause it is easy to quickly scan an entire column. Visual scanning also is useful in identifying
if data parsing or concatenation is necessary, finding if data is misfielded, and identifying data
formatting and data consistency problems.
Visual inspection is easy to accomplish when the dataset is small, but even when large,
visually inspecting the data can be done efficiently. Consider the following two techniques.
First, sort the data in each column in ascending and then descending order. The sorting of data
can reveal missing values and the wrong type of data in a column. For example, if a column
should have numeric values, sorting will show if there are also characters contained in some
entries in the column.
Second, copy a column of data to a new environment and then remove all duplicate values
in the data copied. The reduced list, especially coupled with sorting, can be reviewed more
quickly and may show data consistency errors like the “Milton Armstrong” naming problems
discussed above. It can also help show if there are inappropriate values in the data. For ex-
ample, if the data should only contain whole numbers between 1 and 5, examining a list of
unique values will reveal if some entries contain decimals or are outside the acceptable range.

BASIC STATISTICAL TESTS


Several basic statistical tests can be performed to validate the data. The types of tests that can
be conducted differ for numeric fields and text fields.
For numeric fields, compute basic descriptive statistics for each field such as the mini-
mum, maximum, mean, median, and sum (or total). Examine these descriptive statistics for
unreasonable values such as negative amounts for product prices or large outliers. Comparing
totals from the transformed data with totals from the original data sources will show if the
transformation process inappropriately changed values. This will allow testing to see if the
dataset contains a complete set of all the original transactions. Consider computing descriptive
statistics before and after complex transformations to determine if the transformation was suc-
cessful or created unintended errors.
For text fields, a similar comparison can be made by counting the number of total records
or the number of distinct records present before and after a data transformation. For text
fields, it is also possible to compute the length of each value and compare this amount to
pre-transformation lengths to see if there are changes.

AUDIT A SAMPLE
One of the best techniques for assuring data quality is to audit a sample of the data. Select a
sample of rows, columns, or both and analyze each value to make sure it is valid. One can trace
the transformed values back to the original data sources to make sure they accurately reflect
what was originally captured. To ensure completeness, select a sample of data items from the
original data sources and make sure all those items are listed in the final dataset.
There are many ways to choose a sample of data to audit. If there is a particular data item
that is very important, select all records that have that data item present. In contrast, if un-
derstanding the cleanliness of the overall dataset is desired, selecting a random sample will
allow for the computation of an overall error rate in the population. As an example, imag-
ine the dataset Ashton selects has 100,000 records. Ashton chooses to audit 1,000 records,
or 1% of the total number of records. If in those 1,000 records Ashton finds 70 errors,
Ashton can compute a 7% error rate for his sample. Because Ashton randomly selected
the 1,000 records, Ashton can assume the 7% error rate is the same across the entire set of
100,000 transactions—meaning there are likely 7,000 errors in the entire dataset. If Ashton
tracks each type of error, he can estimate the quantity of each type in the transformed dataset.
If the error rate exceeds an acceptable level, Ashton should continue the standardization and
cleaning process.
202 PART II DATA ANALYTICS

ADVANCED TESTING TECHNIQUES


A deeper understanding of the content of data can allow for more sophisticated data validation
tests. As an example, accountants know that every journal entry should have a balanced set
of debit and credit entries. If an accountant has extracted journal entries, she can analyze the
extracted data to make sure the balance of credits equals the balance of debits. She can also
test the relationship between sub-ledgers and the general ledger in extracted data; namely, that
the sum of the balances in all of the sub-ledgers should equal the total in the general ledger.
Testing within a transaction could be done by comparing the sum of quantities times prices of
items in a transaction with the total of that transaction. With an understanding of the data, one
can leverage that knowledge to check if the transformed data conforms to accounting rules,
thus providing evidence of the quality of your data.

Summary and Case Conclusion


Ashton carefully reviewed the data extract he was given by the IT department. He no-
ticed many problems in the data and was able to correct them before analyzing the vendor
information.
Ashton was also able to document the errors he found and send that list back to the data
owners. With the list of errors, the data owners corrected the source data in their systems (see
corrected extract of data in Figure 6-10). In this way, all future users of the vendor data will not
have to repair the same data errors again. The list of errors will also be useful as S&S incorpo-
rates the new system into their existing data architecture.

FIGURE 6-10
Transformed Extract of S&S Vendor Data

KEY TERMS

data structuring 191 dummy variable or data imputation 199


aggregate data 191 dichotomous variable 196 data contradiction errors 199
data pivoting 192 misfielded data values 196 data threshold violations 199
data standardization 193 data consistency 196 violated attribute
data parsing 193 dirty data 198 dependencies 200
data concatenation 195 data cleaning 198 data entry errors 200
cryptic data values 195 data de-duplication 198 data validation 200
data filtering 198 visual inspection 201
CHAPTER 6 TRANSFORMING DATA 203

AIS in Action
CHAPTER QUIZ

1. In which step of the data transformation process would you analyze whether the data has
the properties of high-quality data?
a. data structuring c. data cleaning
b. data standardization d. data validation
2. As part of the data standardization process, often items contained in different fields for the
same record need to be combined into a single field. This process is called:
a. aggregating data c. data concatenation
b. data aggregation d. data parsing
3. Which of the following reasons describes why transforming data is necessary?
a. Data aggregated at different levels needs to be joined.
b. Data within a field has various formats.
c. Multiple data values are contained in the same field and need to be separated.
d. All of the above
Use the following image to answer questions 4 through 10.

4. In column 3, which of the following problems do you find?


a. data consistency error c. data contradiction error
b. data imputation error d. violated attribute dependencies
5. In column 5, which of the following problems do you find?
a. data pivoting error c. data consistency error
b. violated attribute dependencies d. cryptic values
6. In column 7, row 1, which of the following problems do you find?
a. data consistency error c. data threshold violation
b. data parsing error d. misfielded data value
7. In row 8 and row 9, which of the following problems do you find?
a. data contradiction error c. data aggregation error
b. data concatenation error d. duplicate values
8. In column 2, row 7, which of the following problems do you find?
a. data threshold violation c. violated attribute dependencies
b. data entry error d. dichotomous variable problem
204 PART II DATA ANALYTICS

9. Column 4, row 12, is most likely an example of which of the following?


a. data imputation c. violated attribute dependencies
b. cryptic data value d. listing the date in serial date format
10. Which of the following could be used to catch the problems listed in the figure?
a. visual inspection c. auditing a sample
b. basic statistical tests d. All of the above

DISCUSSION QUESTIONS

6.1 Why is transforming data necessary, and why does it take so much time? What ways can
you think of to reduce the time needed to transform data?
6.2 For each of the data quality attributes listed in Table 6-1, discuss a business scenario that
would be harmed if data did not have the data attribute. As you prepare these scenarios,
are there other attributes of data that should be added to Table 6-1? If so, why?
6.3 Companies are increasingly operating throughout the entire world. As such, the data
companies collect can differ based on the country where the data is captured. This chap-
ter identified two problems that can exist in data based on operating throughout the
world: different formats for dates and capturing time stamps in different locations. What
are other possible challenges you would observe in data captured throughout the world?
6.4 What are the strengths and weaknesses of each of the four data validation procedures
discussed in this chapter? What are other possible ways to validate data?

PROBLEMS

6.1 Match the following terms with their definitions or examples:

___ 1. aggregate data a. Process of analyzing data to make certain the data has the
properties of high-quality data: accuracy, completeness,
consistency, timeliness, and validity
___ 2. cryptic data b. Data values that are correctly formatted but not listed in
values the correct field
___ 3. data cleaning c. All types of errors that come from inputting data
incorrectly
___ 4. data d. Examining data using human vision to see if there are
concatenation problems
___ 5. data consistency e. Process of tracing extracted or transformed values back
to their original source
___ 6. data f. Data items that have no meaning without understanding a
contradiction coding scheme
errors
___ 7. data g. The process of standardizing the structure and meaning
de-duplication of each data element so it can be analyzed and used in
decision making
___ 8. data entry errors h. Technique that rotates data from a state of rows to a state
of columns
___ 9. data filtering i. Errors that occur when a secondary attribute in a row of
data does not match the primary attribute
___ 10. data imputation j. Data field that contains only two responses, typically a 0
or 1; also called a dichotomous variable
___ 11. data parsing k. Principle that every value in a field should be stored in
the same way
CHAPTER 6 TRANSFORMING DATA 205

___ 12. data pivoting l. Process of changing the organization and relationships
among data fields to prepare the data for analysis
___ 13. data m. Data errors that occur when a data value falls outside an
standardization allowable level
___ 14. data structuring n. Data field that contains only two responses, typically a 0
or 1; also called a dummy variable
___ 15. data threshold o. Process of updating data to be consistent, accurate, and
violations complete
___ 16. data validation p. Data that is inconsistent, inaccurate, or incomplete
___ 17. dichotomous q. Separating data from a single field into multiple fields
variable
___ 18. dirty data r. Combining data from two or more fields into a single
field
___ 19. dummy variable s. Process of replacing a null or missing value with a
substituted value
___ 20. misfielded data t. Process of removing records or fields of information from
values a data source
___ 21. violated attribute u. Process of analyzing data and removing two or more
dependencies records that contain identical information
___ 22. visual inspection v. Presentation of data in a summarized form
w. Error that occurs when the same entity is described in
two conflicting ways
x. Pocess of ordering data to reveal unexpected values

6.2 Excel Project: Data Pivoting


You are a data analyst for the city of Burlington, Vermont. Download the data file
“P6-2BurlingtonVermontData.xlsx” from the student download page at http://www.
pearsonglobaleditions.com, which contains the annual account balance information
for city departments for six fiscal years. For this problem, use the sheet titled “Annual
Data”—it contains data aggregated to the annual level for the city departments.

REQUIRED
Using this data, prepare a PivotTable in a new sheet to answer each of the following
questions:
a. Have total department budgets changed each year? To answer the question, create a
PivotTable that shows the budgeted amount of expenditures for each fiscal year. Do
not include grand totals. Add conditional formatting data bars to show which amounts
are the greatest.
b. Which funds have the largest expense budgets for fiscal year 6? Create a PivotTable
that shows fund names and budgeted amounts for fiscal year 6. Sort the data so the
greatest budget amounts are listed at the top.
c. Regardless of department, organization, or fund, which type of activities were most
costly during the entire time period (hint: view the “Detail_Description” field)? How
much did they pay for this activity?
6.3 Excel Project: Aggregating Data at Different Levels
You are an internal auditor for the city of Burlington, Vermont. Download the data file “P6-
3BurlingtonVermontData.xlsx” from the student download page at http://www.pearson
globaleditions.com, which contains the annual account balance information for city de-
partments for six fiscal years. There are two sheets in this workbook. The “Annual Data”
sheet contains data aggregated to the annual level for the city departments. The “Monthly
Data” sheet contains data aggregated to the monthly level for the city departments. You
are planning to perform audit procedures on the “Monthly Data.” Before you do, verify
that the data in this sheet matches the data in the “Annual Data” worksheet, which you
already verified as correct.
206 PART II DATA ANALYTICS

REQUIRED
Analyze the two sheets. Based on your analysis, answer the following questions:
a. Under what circumstances can you NOT use the “Annual Data” sheet for your audit?
Said differently, why might you need the data in the “Monthly Data” sheet for your
audit?
b. On a separate worksheet in Excel, create a summary of the data that shows the total
dollar amount of transactions for the two different sheets. Are these the same for both
datasets?
c. Does the total amount for transactions differ for the different departments and years?
Create two sheets: the first sheet should compare departments and the second sheet
should compare years. What do you learn from these analyses?
d. Based on your analysis in the previous questions, suggest the areas you believe are
most important to investigate further. Why do you believe these areas are the most
important to investigate further?
6.4 Excel Project: Parsing Data
Go to the student download page at http://www.pearsonglobaleditions.com and download
the file labeled “P6-4CustomerData.xlsx”. For this problem, assume your supervisor gave
you this file and said it contains information about customers. The file has several fields
merged together, and your supervisor wants you to parse them into separate fields.

REQUIRED
Separate the fields as instructed in parts a and b below.
a. Parse the data on the “Data – Simplified” sheet. Put your solution on the “Simplified
Solution” sheet. The image below shows the solution for the first three rows. Your
solution should separate the information using only formulas. Leave the formulas in
your solution file for evaluation. As a hint, this problem can be solved using a combi-
nation of the LEFT, RIGHT, MID, FIND, and VALUE formulas.

b. Parse the data on the “Data – Advanced” sheet. Put your solution on the “Advanced
Solution” sheet. The image below shows the solution for the first three rows. Your
solution should separate the information using only formulas. Leave the formulas in
your solution file for evaluation. As a hint, this problem can be solved using a com-
bination of the LEFT, RIGHT, MID, FIND, TRIM, SUBSTITUTE, REPT, LEN, and
VALUE formulas.*

6.5 Excel Project: Fixing Cryptic Data Using Data Concatenation


You work at a non-profit that seeks to help people start businesses in third world coun-
tries so those people can lift themselves out of poverty. Your organization has found more
success working in countries that have policies that make it easier to start a business. You
have been asked to explore which countries your organization should consider entering.
To do this, go to the student download page at http://www.pearsonglobaleditions.com and

* Life-long learning opportunity: see pp. xxiii–xxiv in preface.


CHAPTER 6 TRANSFORMING DATA 207

download the file labeled “P6-5StartingABusiness.xlsx”. This dataset provides informa-


tion about the relative ease of starting a business and doing business in countries around
the world. The Excel spreadsheet contains three years of data in multiple sheets. The data
is not easy to understand as it is spread across multiple sheets and the “Final Report” sheet
currently only shows cryptic data values for the Country and Year. The data for this prob-
lem comes from The World Bank: Doing Business dataset: see http://www.doingbusiness.
org/en/data for more information.

REQUIRED
Do the following
a. Fill out the “Final Report” sheet so that it looks like the image below. You should use
formulas to prepare the final report. Leave the formulas in your solution file for evalu-
ation. As a hint, this problem can be solved using a combination of the VLOOKUP
and CONCATENATE functions. Add columns to the sheets to complete the task with
these formulas as needed.

b. Fill out the “Final Report – Advanced” sheet so that it looks like the image below.
You should use formulas to prepare the final report. Leave the formulas in your solu-
tion file for evaluation. For this problem, you are only allowed to use combinations
of the following functions: INDEX, MATCH, and CONCATENATE. You will NOT
need to add columns to complete the task.

c. Compare and contrast the formulas you used to answer part a and part b. Research
both types of formulas and discuss which formulas are better for use in practice.*
6.6 Excel Project: Formatting Data Consistently
You are preparing to analyze data about the Washington, DC Public Schools from
January 2009 to February 2019. Go to the student download page at http://www.pearson
globaleditions.com and download the file labeled “P6-6WashingtonDCData.xlsx”. Notice
that the formatting of the data in the fields “Transaction_Date” and “Transaction_Amount”
has become corrupt and each entry can be formatted in many ways. Note, all of the data in
the “Transaction_Date” field contains a date (so any numbers are stored date values). All of
the data listed in the “Transaction_Amount” column refer to dollar amounts.

REQUIRED
For the two fields, format the data in each field to be consistent within each column.
Justify the choice you made for the data format you chose to use for each column.
Describe the process you went through to make sure you formatted all the data correctly.
The data in the “Transaction_Date” column only needs to be accurate at the day, month,
and year level (so do not worry about time information in the column if it is present).
The data in the “Transaction_Amount” column needs to be accurate at the penny level.*

* Life-long learning opportunity: see pp. xxiii–xxiv in preface.


208 PART II DATA ANALYTICS

6.7 Excel Project: Auditing a Data Extract


You work as an auditor for the city of Washington, DC and have been asked to audit
employee purchasing card (p-cards) expenses. You have been asked to validate that the
cleaned data for a sample of the p-card transactions match original records from before
the transformation process. Go to the student download page at http://www.pearson
globaleditions.com and download the file labeled “P6-7WashingtonDCDataAudit.xlsx”.
The file contains two sheets. The sheet labeled “Population” includes all p-card transac-
tions from January 2009 to February 2019. The sheet labeled “Sample” contains the 20
transactions from 2019 you have selected for examination.

REQUIRED
a. Determine whether all the data in the sample transactions matches records in the orig-
inal data. If there are errors, document any you find.
b. Once you are finished, estimate how many errors are likely for each field in the 2019
data.
6.8 For the S&S case discussed in this chapter, you receive the following output containing
basic descriptive statistics for some of the columns in the full dataset (the chapter ex-
ample problem contained a small excerpt of data; this problem uses more data). S&S has
a total of 60 products that customers purchase across 3 categories.

REQUIRED
List the concerns you have with the data and discuss what steps you would take for each
concern identified.

6.9 Excel Project: Creating Data Errors


This problem asks you to create a dataset with specified errors.

REQUIRED
a. Create a fake dataset that has at least 20 rows and 6 columns. The dataset can relate
to any setting in accounting or business. In the dataset, include at least one of each
of the following types of errors: cryptic data values, data contradiction errors, data
entry errors, data threshold violations, misfielded data values, and violated attribute
dependencies.
b. Prepare a memo to go with your dataset that describes the nature of the data. Then
define each error and describe the example error you included in your dataset.
6.10 Search the Internet for an example of a business project failure caused by a failure in
some part of the ETL process. Based on the example you find, discuss the principles
from this chapter that the company failed to follow. In addition, discuss the reasons why
you think the company failed to follow these principles.
CHAPTER 6 TRANSFORMING DATA 209

The following case was developed by the Ernst & Young Academic Resource Center, which is
sponsored by the Ernst & Young Foundation.

CASE 6-1 Hotel Data Cleaning Case

For this case, you received a data file, Analytics_ important accounting-relevant datasets on these other
mindset_case_studies_ETL_Case4.xlsx (go to the tabs: BusinessUnits, ChartOfAccounts, PreparerInfo,
student download page at http://www.pearsonglobaledi- and Source. The following is a select list of data fields
tions.com to download this file). It includes 789 lines of from this file noting the field name and field description
journal entries for 11 days from a hotel and conference tabs on which the data field is located.
center (on the tab labeled JELineItems) as well as other

Field Name Field Description Tab


JENumber Unique identifier for each journal entry JELineItems
GLAccountNumber GL account number from chart of accounts; tab labeled JELineItems and
GLAccounts contains the full information about the GL ChartOfAccounts
accounts
BusinessUnitID Business unit number (1–8) of the journal entry JELineItems, BusinessUnits,
and PreparerInfo
PreparerID Employee ID for employee who initiated the transaction; for JELineItems and
transactions recorded initially in a subsystem (e.g., GuestSYS PreparerInfo
or POS), the PreparerID is listed as the system and not the
employee
● Note that the PreparerID is not unique. The company starts all
ID numbering over for each business unit. Thus, the combina-
tion of the PreparerID and BusinessUnit number is unique for
each employee.
SourceID Unique identifier for each source JELineItems and Source
EffectiveDate Date entry was posted to the GL as occurring; the EffectiveDate JELineItems
is the date that the transaction is posted in the GL and recog-
nized as revenue. The corporate office, therefore, is recognizing
revenue throughout the year based on this date, rather than
the date it is meeting its performance obligations, which you
would consider the “right” effective date for proper accounting
treatment. However, the corporate office performs year-end
cutoff procedures to account for this at a level of material-
ity that, year over year, would suit corporate and ensure that
amounts are properly stated.
JEDescription Description of transaction; may include vendor or guest name, JELineItems
etc.
Debit Debit amount of entry (positive) JELineItems
Credit Credit amount of entry (negative) JELineItems
Amount Total amount of journal entry line item (may be positive or JELineItems
negative)
BusinessUnit Business unit name (e.g., hotel, food and beverage) of journal BusinessUnits
entry
BUDescription Description of each business unit BusinessUnits
AccountType For each account, a high-level description of which type of gen- ChartOfAccounts
eral ledger account it is (e.g., asset, liability, equity, expense,
revenue)

You might also like