Data Transformation Challenges
Data Transformation Challenges
Transforming Data
6
LEARNING OBJECTIVES
Ashton successfully helped the organization transform its culture and adopt an analytics
mindset for dealing with data. The company successfully built a data structure that consists
of a data lake, a data warehouse, and data marts. With the new structure, employees are able
to fulfill many of their data needs, allowing data analysts more time to analyze data and not
just locate and clean data for others.
Although data handling at S&S has improved, Ashton still receives requests for extract-
ing, transforming, and loading data into other applications that either have not been added
to the current data architecture or require other special processing. Such is the current case
with which Ashton is dealing.
S&S purchased another company that used a different enterprise resource planning
(ERP) system. The two systems don’t work well together—storing data in different formats,
with different field names, and a host of other challenges. Although S&S plans to better in-
tegrate these two systems, until then Ashton has become frustrated trying to make business
decisions because he cannot seem to get a clear picture of what is happening due to incom-
patible data. Ashton has been trying to figure out how to merge and clean the data to be able
to improve S&S’s decision making.
188
S.John/Shutterstock
To figure this problem out, Ashton decided to study a subset of S&S’s data that contains
information about its vendors. He had an IT specialist export data from the two systems into
a combined flat file for his review. He hopes to be able to understand all of the differences
that exist in the data so he can recommend ways to clean the data going forward.
The data extract Ashton is analyzing is shown in Figure 6-1, and an extract of the data
dictionary describing the merged data is shown in Figure 6-2. This data dictionary provides
abbreviated metadata of the data extract. Figure 6-3 contains additional information about
Product ID, one of the fields in Figure 6-1. This chapter will use this data to teach principles
related to transforming data.
Introduction
IBM estimates that poor quality data costs the United States economy $3.1 trillion per year and
that 1 in 3 business decision makers don’t trust the information they use to make decisions. A
worldwide survey conducted by Experian shows that business executives believe that 26% of
their total data may be inaccurate, and it is getting worse over time. While some of the errors
are complex, many are simple, including incomplete data, outdated data, duplicate informa-
tion, or simple errors in the data (e.g., spelling mistakes, typos, etc.).
So what is to be done? The ideal is to capture and store clean data. The internal controls
presented in Chapters 10–13 address methods of capturing accurate data. Even with strong
internal controls, data can still become “dirty” or need to be transformed for alternate uses.
Data can become dirty by such things as changing events in the world (e.g, customer changes
a phone number), merging various data sources that store the same type of data differently, or
making mistakes in dealing with data. When this happens, the data needs to be transformed in
FIGURE 6-1
Extract of Vendor Data
189
190 PART II DATA ANALYTICS
a way that restores its quality. This chapter presents a four-step process for transforming data
that will maintain or improve data quality: (1) structure the data, (2) standardize the data, (3)
clean the data, and (4) validate the data.
The data transformation process spans the life of data. That is, while data transformation
is specifically a step of the extract, transform, and load (ETL) process, it also occurs when data
is gathered, stored, and analyzed. Understanding the transformation process is valuable for all
data stages. Before discussing the steps of the transformation process, we review the attributes
of high-quality data. These attributes form the objectives of the data transformation process.
system (i.e., feet, miles), while the other team used the metric system (e.g., meters, kilome-
ters). This inconsistency in the data resulted in a software failure and the loss of a very expen-
sive Mars orbiter.
In every interaction with data, it is critical to make sure the data has the attributes of high-
quality data. If any of these attributes are missing, the data will need to be remedied. While not
all data problems can be remedied by data transformation, the four-step process introduced in
this chapter will help improve data quality.
Data Structuring
Data structure describes the way data are stored, including the relationships between different
fields. Data structuring is the process of changing the organization and relationships among data structuring - The process
data fields to prepare the data for analysis. Extracted data often needs to be structured in a of changing the organization
and relationships among data
manner that will enable analysis. This can entail aggregating the data at different levels of fields to prepare the data for
detail, joining different data together, and/or pivoting the data. analysis.
AGGREGATE DATA
Aggregate data is the presentation of data in a summarized form. Aggregate data has fewer aggregate data - The presenta-
details than disaggregated data. As an example, consider the typical sales process and reporting tion of data in a summarized
form.
of data. When a business makes a sale, it records a credit to sales revenue in a database. This
is fully disaggregated sales data. In aggregation, the company might sum sales data around a
corresponding attribute. For example, summing sales by sales manager removes the detail of
each sale but allows the business to evaluate manager performance. Similarly, a company may
fully aggregate sales data by summing all sales to create the sales revenue line item on the
income statement.
When transforming data, it is critical to understand the level of aggregation for each data
source because combining data aggregated at different levels can create problems. For ex-
ample, S&S could analyze the vendor purchases shown in Figure 6-1 by aggregating, or sum-
ming, all individual purchases to produce the total units purchased from each vendor and the
total expenditures for each vendor. Figure 6-4 shows this higher level aggregation as well as
aggregating across all vendors.
When data is aggregated, information is lost. That is, with the data aggregated in Figure
6-4 at the individual vendor level, it is no longer possible to know to which products the
UnitsPurch and TotalCosts fields relate. The only way to analyze data at the product level
192 PART II DATA ANALYTICS
FIGURE 6-4
Examples of Different
Levels of Aggregating
Data
would be to go back to the original disaggregated data because it cannot be extracted from the
aggregated data. Best practice for storing data is to store the information in as disaggregated
form as possible and then aggregate the data through querying for other uses.
DATA JOINING
As discussed in Chapter 4, a key part in querying databases is joining information from differ-
ent tables into a single table so that the joined data can be analyzed.1 Frequently in the ETL
process, data is queried from the database into a single flat file, as seen in the S&S dataset. The
table created in the S&S example required joining data that was likely contained in different da-
tabase tables such as the vendor table (e.g., all fields from VendorName to EarlyPayDiscount),
from the product table (e.g., fields ProdCat and ProdID), and from a transaction table that
recorded purchases (e.g., fields UnitsPurch and TotalCosts).
DATA PIVOTING
data pivoting - A technique Data pivoting is rotating data from rows to columns. Some software programs are designed
that rotates data from rows to to use pivoted data rather than unpivoted data. Understanding the design of the program into
columns.
which the data will be loaded will help you determine if data needs to be pivoted or not.
The S&S data in Figure 6-1 could be pivoted in a variety of ways to make it easier to in-
terpret the data. Figure 6-5 shows a pivot of the data so that the vendor names are listed in each
row and the columns show the different product categories. Note that when data is pivoted, the
data often will be aggregated in some way. For example, in the pivoted data the total product
costs are summed for each vendor and product category.
Figure 6-6 shows the data in Figure 6-5 pivoted back into a different form. The Figure
6-6 form is much closer to the original but aggregated at a higher level. Since the information
1
Data joining is also referred to as data merging and data aggregation. The last term, data aggregation, can be confus-
ing, given the previous discussion of aggregate data. Aggregating data is the process of summarizing data at different
levels of detail, while data aggregation is the process of combining different data sources. To avoid confusion, we will
use the term data joining to refer to data aggregation throughout the text.
CHAPTER 6 TRANSFORMING DATA 193
FIGURE 6-6
VendorName ProdCat TotalCosts
Pivoting Figure 6-5 Data
B&D 1 $15,982.00
B&D 2 $ 2,529.00
Black and Decker 1 $ 2,220.06
Black and Decker 2 $ 568.00
Black and Decker 3 $13,024.57
Calphalon 1 $19,509.75
Honeywell 2 $ 5,516.90
KitchenAid 1 $43,282.53
Oster 1 $28,020.11
Panasonic 1 $15,765.12
Panasonic 2 $ 5,693.50
about individual products was lost in pivoting the data to Figure 6-5, individual products can-
not be displayed in the pivot from Figure 6-5 to Figure 6-6. This is another example of how
aggregating data results in a loss of detail that cannot be recaptured unless the original data is
analyzed.
Data Standardization
Data standardization is the process of standardizing the structure and meaning of each data data standardization - The pro-
element so it can be analyzed and used in decision making. Data standardization is particularly cess of standardizing the struc-
ture and meaning of each data
important when merging data from several sources. Achieving standardized data may involve element so it can be analyzed
changing data to a common format, data type, or coding scheme. Data standardization also and used in decision making.
encompasses ensuring the information is contained in the correct field and the fields are orga-
nized in a useful manner. When dealing with a database file or a flat file format, think of data
standardization as making sure the columns of information are correct. In doing so, several
important things should be considered, including data parsing and data concatenation, cryptic
data values, misfielded data values, and data formatting and data consistency. Focus 6-1 shows
the importance of data standardization in practice.
In the mid-2000s, cellphones began to be successfully all public transportation agencies to use. Transit agencies
used for navigating not only streets but also public trans- voluntarily adopted these data standards. The standards
portation. Prior to that time, you had to look up public continued to evolve so that now, thanks to standardized
transportation schedules, often still in paper form, for each data, the general public can quickly and easily find public
different transit authority. Bibiana McHugh, an IT manager transportation options in most places in the world.
at a transportation company in Portland, Oregon, was fed Data standards make it possible for different parties
up with this difficulty. Recognizing that the general public to use and share data. This makes it easier to build apps
already used Google web apps for transportation needs, for analyzing and displaying data from different sources.
she contacted Google and partnered with them to cre- In auditing, the American Institute of Certified Public
ate Google Transit. Using the data at her company, they Accountants (AICPA) has established audit data standards
were able to build an app for helping people find public so that external auditors can more easily use the data pro-
transit information. Upon launch, the interest was so high duced by different clients. When the data standards are
across the world that they decided to scale the app they followed, the standardization of data saves auditors time
had built for the city of Portland across the globe. To make so they are able to charge lower fees and conduct a more
scaling possible, they generated open data standards for thorough audit.
which can yield a valid output. For the S&S example, the EarlyPayDiscount field could be
parsed in the following alternative ways:
● Select the first digit to put in the DiscountRate Column. Select the third and fourth digits
to put in the DiscountDays column, and then select the seventh and eighth digits for the
BalanceDueDays column.
● Select all data before the forward slash to put in the DiscountRate column. Select the
data between the forward slash and first space and put it in the DiscountDays column.
Select all the data after the N to put into the BalanceDueDays column.
● Split the data into columns based on the space. From the new first column (that contains
the DiscountRate and DiscountDays information), take all numerical digits before the
slash and put them in the DiscountRate column and, reading from right to left, take all
digits until it comes to the forward slash and put them in the DiscountDays column. With
the new second column from the original split (i.e., the “N30” information), remove all
alpha characters and place any numerical characters in the BalanceDueDays column.
Notice that each of these approaches make different assumptions about the structure of the
data contained in the original EarlyPayDiscount field. In the S&S example, all of these pat-
terns hold for the 17 rows displayed. However, when datasets contain millions of rows, iden-
tifying a pattern that applies to all, or even a subset of rows, is much more challenging. When
one pattern cannot be identified for all rows, often patterns can be identified for sets of rows.
These patterns can be used to iteratively parse the appropriate sets of rows. The challenge of
parsing data correctly is one reason why the ETL process can be so challenging.
Data concatenation is combining data from two or more fields into a single field. In the data concatenation - The com-
S&S data, Ashton would like to have one field that lists the full name of each vendor repre- bining of data from two or more
fields into a single field.
sentative as shown in Figure 6-8. To do this, the original columns of FName and LName need
to be combined, or concatenated, into a single field. Notice that when FName and LName are
concatenated there is also a space added between the first and last name. Without the space,
the new column would list names like “DeShawnWilliams” and “MiltonArmstrong”—which
makes the data messy and would require later cleaning.
Data concatenation is often used to create a unique identifier for a row. Notice in Figure
6-1 that ProductID has some of the same values as ProductCategory. That is, ProductCategory
1 has ProductID’s of 1, 2, and 3 and ProductCategory 2 also has ProductID’s of 1 and 2. Thus,
the ProductID does not uniquely identify each product; rather, the combination of ProductID
and ProductCategory identifies each product at S&S. In this case, it would be useful to con-
catenate ProductID and ProductCategory so that each product can be uniquely identified. This
could be done by creating a new column labeled ProductCatID that adds a dash character, so
the final result is “1-1” and “1-2” where the first number shows the product category number
and the second number shows the product ID.
Ashton considers two possible solutions for this problem: (1) to replace the values in
the ProdID and ProdCat columns with the words rather than the numbers or (2) to add new
columns to the flat file that identify what the ProdID and the ProdCat values mean. Ashton
decides on the second option because keeping the original IDs will allow him to later merge
additional data into his file if he needs it. Typically, the cryptic data values problem is handled
through joining additional data into the file or by replacing the values. If storage space is not
an issue and there is no concern over the complexity of the working file, it is generally advis-
able to join data rather than replace values.
For some fields, there is a general understanding of what cryptic values mean. When a
dummy variable or dichotomous field contains only two different responses, typically 0 or 1, this field is called a dummy
variable - A data field that con- variable or dichotomous variable. In this case, standard practice is to use a 1 to signify
tains only two responses, typi-
cally 0 or 1. the presence and 0 the absence of the attribute. For example, if the field captures data about
whether a vendor is a preferred vendor or not, the value of 1 would suggest they are a preferred
vendor and 0 that they are not. With dummy variables, best practice is to give them a meaning-
ful name rather than a generic name. So, for the previous example, rather than naming the field
VendorStatus, naming the field PreferredVendor is superior because the user knows the mean-
ing of values 0 and 1 in this field without referring to the data dictionary.
1970, not counting leap seconds. In Unix epoch time, 1:00 a.m. on November 11, 1979 is FIGURE 6-9
stored as “311130000”. Data consistency within time data is particularly problematic when Examples of Different
it comes from different time zones. Thus, time data in most enterprise systems is stored ac- Data Formats
cording to Coordinated Universal Time (UTC). When local time is needed, it is computed
based on the stored UTC value and the local time zone. Data consistency requires that the
PhoneNumber
information stored for each field be stored the same way. When storing a date value, one
field should not switch between the serial date and the epoch time storage designs. Rather, it 907961-4917
should use only one of these two designs. (844) 986-0858
This textbook categorizes data formatting and data consistency issues together because it 3605659487
is often hard to determine from viewing data whether a problem is a formatting or a consis- (551) 7779393
tency issue. Furthermore, when given a flat file or an extraction of data, fixing formatting and
consistency issues likely uses the same techniques. Distinguishing between formatting and
(790)447-1783
consistency issues is more important when trying to fix the original source data. (551) 7779393
The S&S data has both data formatting and data consistency errors—and some errors that
could be caused by either or both issues. For example, the PhoneNumber field shows phone
numbers formatted in six different ways. Notice how sometimes the formats have parentheses
around the area code, sometimes there are spaces in different places, and the dashes are not
always included. An example of each of the different formats are listed in Figure 6-9.
From the S&S data extract, it is not clear whether the data is stored in the database in dif-
ferent ways (data consistency) or just displayed in different formats. The data dictionary says
that the data are intended to be stored in the same way (as a 10-digit number), so it is likely that
the problem was a formatting problem that occurred when creating the data extract. Regardless
of the underlying problem, the solution is that the PhoneNumber field needs to be manipulated
to be consistent in appearance, as inconsistent formatting of the data can make it more difficult
to analyze the data. In the phone number example, if someone wanted to extract the area code
for all of the phone numbers, he would need to perform a much more complex parsing and
extraction exercise because of the poor formatting. If the data always contained the area code
between the parentheses, it would make the extraction of the area code a relatively straightfor-
ward exercise.
The S&S data in Figure 6-1 contain additional data consistency and data formatting prob-
lems in the following fields: VendorName, FName, ZipCode, EarlyPayDiscount, ProdCat,
TotalCosts, and AvgCostPerUnit. Take a minute to see if you can find the errors in each of
these fields before reviewing the bullet points below.
● VendorName: The Black and Decker vendor name is sometimes written in its entirety
and sometimes abbreviated B&D. This is likely the same company and not two different
companies with similar initials because the address is the same for both types of entries.
The same company name should be listed for every instance of Black and Decker.
● FName: There are two problems in this field. The same person’s name is sometimes
listed in different ways; that is, Milton Armstrong is listed as Milton Armstrong,
M. Armstrong, and Milton G. Armstrong. This is likely the same person because all the
different name iterations have the same address. The second problem is that the data is
aligned in different ways: lines 9, 14, and 16 are right justified and all the other entries
are left justified.
● ZipCode: Sometimes the zip code contains the 5-digit zip code and other times it con-
tains a 9-digit zip code.
● EarlyPayDiscount: The data in the field has inconsistent alignment (left, center, and
right). While alignment is unlikely to influence data analysis, good practice is to use con-
sistent alignment for a field.
● ProdCat: The field does not consistently show the same number of decimal places for
numbers. For this field, the decimals are meaningless, and the data should only show
whole numbers.
● TotalCosts: The field sometimes shows dollar signs and a different number of decimal
places for different values. Since total cost data is collected at the penny level, these
should all be formatted to that level.
● AvgCostPerUnit: The error is that the euro currency marker is used even though this
company is in the United States and the total cost information is denominated in the U.S.
dollar (see the data dictionary description).
198 PART II DATA ANALYTICS
Data Cleaning
A survey of 16,000 users on Kaggle, a popular website for data scientists owned by Google,
dirty data - Data that is inconsis- reveals that the biggest problem data scientists face is dirty data. Dirty data is data that is in-
tent, inaccurate, or incomplete. consistent, inaccurate, or incomplete. To be useful, dirty data must be cleaned. Data cleaning
data cleaning - The process of
is the process of updating data to be consistent, accurate, and complete.
updating data to be consistent, Costs of dirty data can be exceptionally high, both in terms of dollars and reputation.
accurate, and complete. Fidelity’s Magellan fund learned this when they reported to shareholders that they expected
to pay a $4.32 per share dividend at the end of the year. However, no year-end dividend was
issued. Fidelity later reported to shareholders that they wouldn’t be receiving the predicted
dividend because a tax accountant missed inputting a minus sign on a $1.3 billion net capital
loss. Thus, the system treated it as a gain—a total error of $2.6 billion dollars. Below are some
of the common errors to look for in cleaning data.
DATA DE-DUPLICATION
data de-duplication - The Data de-duplication is the process of analyzing the data and removing two or more records
process of analyzing data that contain identical information. In the S&S data, one might be concerned there is signifi-
and removing two or more
records that contain identical cant duplicate data in the columns related to vendor attributes (e.g., names, phone numbers,
information. addresses) because they show a lot of repeated data. Although these items repeat, it does not
mean that a row is a duplicate of another row. That is, each row represents a unique product
within a product category, and since some vendors sell multiple products to S&S, some of
their information is repeated. In contrast, row 17 appears to be an exact duplicate of row 6—
as the information on the rows are identical. Note that the formatting of the rows sometimes
differs, but formatting is not considered for data de-duplication, rather only the values of the
fields are compared. Row 17 should be examined in more detail to see if it occurred on the
exact same day at the same time and does not represent two similar orders. If found to be an
exact duplicate, in most circumstances, row 17 should be deleted.
Duplicate data means that all information for a record is exactly the same as another
record when, in reality, there were not two identical occurrences of whatever is measured.
Note there is one exception to this general rule. As seen in the S&S data, the value in the
RowNumber field is different for the two records. Often row identifiers are added to a file
without considering if there are duplicate rows. In this case, the identifier may not exactly
match between records even though all the rest of the information is duplicative. The second
record is still a duplicate even though the row identifiers don’t match, as is the case in the S&S
example where the field RowNumber has different values for row 6 and row 17. When looking
for duplicate values, be aware of how the data is constructed to decide whether the row identi-
fiers are likely to indicate a duplicate or not.
DATA FILTERING
The process of removing records or fields of information from a data source is called
data filtering - The process of data filtering. Companies often collect information about many vendors, even if they have
removing records or fields of in- not purchased from all of them. In the transformation of data, a data analyst might filter
formation from a data source.
out inactive vendors (no purchases in the last year) to analyze active vendors. This is an
example of filtering information. Typically, columns are filtered out of the data during
the extraction process; however, when extra columns are included in the data extraction,
removing unnecessary columns of data can be performed in the transformation process. In
the example of vendors, the data analyst may receive an extraction of vendor information
that contains the vendor representative. If this field is not necessary for the analysis, it can
be filtered out of the dataset.
The S&S data has several irrelevant records included in the file. Ashton intended to ana-
lyze products in the kitchen appliances and fans product category—product category 1 and
2—but his file also contains information about product category 3, hand tools. Ashton can
filter the data to delete the two ProdCat 3 records.
Another important filtering question is what to do with missing or empty values (referred
to as null values). Notice in the S&S data that the EarlyPayDiscount field contains null values
CHAPTER 6 TRANSFORMING DATA 199
for several rows. There is no hard rule for what to do with null values. The first step to take
when a null value is encountered is to go back to the original source data and verify that the
information was not lost in extracting the data. If the original source data contains the missing
information, the missing data should be added to the file.
Once the missing values are verified as validly missing or empty, there are generally three
approaches to take. The first approach retains the null values and continues with the transfor-
mation process. When the data is analyzed, the analyst will decide how to manage the null
values. The second approach is to delete the record with null values and document the deletion
in the analysis log because it may change inferences made when analyzing the data. The third
approach is to perform a data imputation, which is the process of replacing a null or missing data imputation - The process
value with a substituted value. Data imputation only works with numeric data (so it is not pos- of replacing a null or missing
value with a substituted value.
sible for the missing payment terms in the S&S data).
There are many different types of data imputation, each with strengths and weaknesses.
One approach is to replace all null values with the mean of the column of the non-null values
in the data. Other imputations replace the value with a random value from the dataset or use
statistical models, like regression, to compute a replacement value. Regardless of the method
chosen, it is critical to document in the analysis log when values have been imputed and the
method used. The analyst often adds an additional column that uses a dichotomous value to
indicate which values have been imputed.
Data Validation
Before, during, and while finalizing the data transformation process, one should analyze
whether the data has the properties of high-quality data. Analyzing data to make sure it has
data validation - The process of these properties is called data validation.
analyzing data to make certain Data validation is both a formal and informal process. Often companies have a formal
the data has the properties of
high-quality data: accuracy, data validation process that follows the transformation process. Informally, data validation is
completeness, consistency, performed in parallel with each transformation. For example, if a column of information list-
timeliness, and validity. ing customer names shows that all names were recorded fully capitalized, one might want to
change the formatting such that only the first letter of the first and last name are capitalized.
After performing this operation, it is important to check to see if it was done correctly and
completely—which is an example of informal data validation.
Data validation is an important precursor to data cleaning. The techniques used to validate
data once cleaned are also useful for discovering problems in the data that need to be fixed.
Thus, data validation is an iterative process that both helps identify what needs to be trans-
formed in the data as well as verify that the data has been transformed correctly.
The techniques used to validate data can be thought of as a continuum from simple to
complex. As there is rarely a clear stopping point for data validation, the importance of the
data analysis and the availability of resources (both time and money) will guide the decision
on deciding which of these techniques to use. These techniques include visual inspection,
basic statistical tests, auditing a sample of data, and advanced testing techniques.
CHAPTER 6 TRANSFORMING DATA 201
VISUAL INSPECTION
Visual inspection is exactly what it sounds like—examining data using human vision to see if visual inspection - Examining
there are problems. When examining data, one should look for each of the different problems data using human vision to see
if there are problems.
listed above. Visual inspection is particularly helpful in finding data standardization issues be-
cause it is easy to quickly scan an entire column. Visual scanning also is useful in identifying
if data parsing or concatenation is necessary, finding if data is misfielded, and identifying data
formatting and data consistency problems.
Visual inspection is easy to accomplish when the dataset is small, but even when large,
visually inspecting the data can be done efficiently. Consider the following two techniques.
First, sort the data in each column in ascending and then descending order. The sorting of data
can reveal missing values and the wrong type of data in a column. For example, if a column
should have numeric values, sorting will show if there are also characters contained in some
entries in the column.
Second, copy a column of data to a new environment and then remove all duplicate values
in the data copied. The reduced list, especially coupled with sorting, can be reviewed more
quickly and may show data consistency errors like the “Milton Armstrong” naming problems
discussed above. It can also help show if there are inappropriate values in the data. For ex-
ample, if the data should only contain whole numbers between 1 and 5, examining a list of
unique values will reveal if some entries contain decimals or are outside the acceptable range.
AUDIT A SAMPLE
One of the best techniques for assuring data quality is to audit a sample of the data. Select a
sample of rows, columns, or both and analyze each value to make sure it is valid. One can trace
the transformed values back to the original data sources to make sure they accurately reflect
what was originally captured. To ensure completeness, select a sample of data items from the
original data sources and make sure all those items are listed in the final dataset.
There are many ways to choose a sample of data to audit. If there is a particular data item
that is very important, select all records that have that data item present. In contrast, if un-
derstanding the cleanliness of the overall dataset is desired, selecting a random sample will
allow for the computation of an overall error rate in the population. As an example, imag-
ine the dataset Ashton selects has 100,000 records. Ashton chooses to audit 1,000 records,
or 1% of the total number of records. If in those 1,000 records Ashton finds 70 errors,
Ashton can compute a 7% error rate for his sample. Because Ashton randomly selected
the 1,000 records, Ashton can assume the 7% error rate is the same across the entire set of
100,000 transactions—meaning there are likely 7,000 errors in the entire dataset. If Ashton
tracks each type of error, he can estimate the quantity of each type in the transformed dataset.
If the error rate exceeds an acceptable level, Ashton should continue the standardization and
cleaning process.
202 PART II DATA ANALYTICS
FIGURE 6-10
Transformed Extract of S&S Vendor Data
KEY TERMS
AIS in Action
CHAPTER QUIZ
1. In which step of the data transformation process would you analyze whether the data has
the properties of high-quality data?
a. data structuring c. data cleaning
b. data standardization d. data validation
2. As part of the data standardization process, often items contained in different fields for the
same record need to be combined into a single field. This process is called:
a. aggregating data c. data concatenation
b. data aggregation d. data parsing
3. Which of the following reasons describes why transforming data is necessary?
a. Data aggregated at different levels needs to be joined.
b. Data within a field has various formats.
c. Multiple data values are contained in the same field and need to be separated.
d. All of the above
Use the following image to answer questions 4 through 10.
DISCUSSION QUESTIONS
6.1 Why is transforming data necessary, and why does it take so much time? What ways can
you think of to reduce the time needed to transform data?
6.2 For each of the data quality attributes listed in Table 6-1, discuss a business scenario that
would be harmed if data did not have the data attribute. As you prepare these scenarios,
are there other attributes of data that should be added to Table 6-1? If so, why?
6.3 Companies are increasingly operating throughout the entire world. As such, the data
companies collect can differ based on the country where the data is captured. This chap-
ter identified two problems that can exist in data based on operating throughout the
world: different formats for dates and capturing time stamps in different locations. What
are other possible challenges you would observe in data captured throughout the world?
6.4 What are the strengths and weaknesses of each of the four data validation procedures
discussed in this chapter? What are other possible ways to validate data?
PROBLEMS
___ 1. aggregate data a. Process of analyzing data to make certain the data has the
properties of high-quality data: accuracy, completeness,
consistency, timeliness, and validity
___ 2. cryptic data b. Data values that are correctly formatted but not listed in
values the correct field
___ 3. data cleaning c. All types of errors that come from inputting data
incorrectly
___ 4. data d. Examining data using human vision to see if there are
concatenation problems
___ 5. data consistency e. Process of tracing extracted or transformed values back
to their original source
___ 6. data f. Data items that have no meaning without understanding a
contradiction coding scheme
errors
___ 7. data g. The process of standardizing the structure and meaning
de-duplication of each data element so it can be analyzed and used in
decision making
___ 8. data entry errors h. Technique that rotates data from a state of rows to a state
of columns
___ 9. data filtering i. Errors that occur when a secondary attribute in a row of
data does not match the primary attribute
___ 10. data imputation j. Data field that contains only two responses, typically a 0
or 1; also called a dichotomous variable
___ 11. data parsing k. Principle that every value in a field should be stored in
the same way
CHAPTER 6 TRANSFORMING DATA 205
___ 12. data pivoting l. Process of changing the organization and relationships
among data fields to prepare the data for analysis
___ 13. data m. Data errors that occur when a data value falls outside an
standardization allowable level
___ 14. data structuring n. Data field that contains only two responses, typically a 0
or 1; also called a dummy variable
___ 15. data threshold o. Process of updating data to be consistent, accurate, and
violations complete
___ 16. data validation p. Data that is inconsistent, inaccurate, or incomplete
___ 17. dichotomous q. Separating data from a single field into multiple fields
variable
___ 18. dirty data r. Combining data from two or more fields into a single
field
___ 19. dummy variable s. Process of replacing a null or missing value with a
substituted value
___ 20. misfielded data t. Process of removing records or fields of information from
values a data source
___ 21. violated attribute u. Process of analyzing data and removing two or more
dependencies records that contain identical information
___ 22. visual inspection v. Presentation of data in a summarized form
w. Error that occurs when the same entity is described in
two conflicting ways
x. Pocess of ordering data to reveal unexpected values
REQUIRED
Using this data, prepare a PivotTable in a new sheet to answer each of the following
questions:
a. Have total department budgets changed each year? To answer the question, create a
PivotTable that shows the budgeted amount of expenditures for each fiscal year. Do
not include grand totals. Add conditional formatting data bars to show which amounts
are the greatest.
b. Which funds have the largest expense budgets for fiscal year 6? Create a PivotTable
that shows fund names and budgeted amounts for fiscal year 6. Sort the data so the
greatest budget amounts are listed at the top.
c. Regardless of department, organization, or fund, which type of activities were most
costly during the entire time period (hint: view the “Detail_Description” field)? How
much did they pay for this activity?
6.3 Excel Project: Aggregating Data at Different Levels
You are an internal auditor for the city of Burlington, Vermont. Download the data file “P6-
3BurlingtonVermontData.xlsx” from the student download page at http://www.pearson
globaleditions.com, which contains the annual account balance information for city de-
partments for six fiscal years. There are two sheets in this workbook. The “Annual Data”
sheet contains data aggregated to the annual level for the city departments. The “Monthly
Data” sheet contains data aggregated to the monthly level for the city departments. You
are planning to perform audit procedures on the “Monthly Data.” Before you do, verify
that the data in this sheet matches the data in the “Annual Data” worksheet, which you
already verified as correct.
206 PART II DATA ANALYTICS
REQUIRED
Analyze the two sheets. Based on your analysis, answer the following questions:
a. Under what circumstances can you NOT use the “Annual Data” sheet for your audit?
Said differently, why might you need the data in the “Monthly Data” sheet for your
audit?
b. On a separate worksheet in Excel, create a summary of the data that shows the total
dollar amount of transactions for the two different sheets. Are these the same for both
datasets?
c. Does the total amount for transactions differ for the different departments and years?
Create two sheets: the first sheet should compare departments and the second sheet
should compare years. What do you learn from these analyses?
d. Based on your analysis in the previous questions, suggest the areas you believe are
most important to investigate further. Why do you believe these areas are the most
important to investigate further?
6.4 Excel Project: Parsing Data
Go to the student download page at http://www.pearsonglobaleditions.com and download
the file labeled “P6-4CustomerData.xlsx”. For this problem, assume your supervisor gave
you this file and said it contains information about customers. The file has several fields
merged together, and your supervisor wants you to parse them into separate fields.
REQUIRED
Separate the fields as instructed in parts a and b below.
a. Parse the data on the “Data – Simplified” sheet. Put your solution on the “Simplified
Solution” sheet. The image below shows the solution for the first three rows. Your
solution should separate the information using only formulas. Leave the formulas in
your solution file for evaluation. As a hint, this problem can be solved using a combi-
nation of the LEFT, RIGHT, MID, FIND, and VALUE formulas.
b. Parse the data on the “Data – Advanced” sheet. Put your solution on the “Advanced
Solution” sheet. The image below shows the solution for the first three rows. Your
solution should separate the information using only formulas. Leave the formulas in
your solution file for evaluation. As a hint, this problem can be solved using a com-
bination of the LEFT, RIGHT, MID, FIND, TRIM, SUBSTITUTE, REPT, LEN, and
VALUE formulas.*
REQUIRED
Do the following
a. Fill out the “Final Report” sheet so that it looks like the image below. You should use
formulas to prepare the final report. Leave the formulas in your solution file for evalu-
ation. As a hint, this problem can be solved using a combination of the VLOOKUP
and CONCATENATE functions. Add columns to the sheets to complete the task with
these formulas as needed.
b. Fill out the “Final Report – Advanced” sheet so that it looks like the image below.
You should use formulas to prepare the final report. Leave the formulas in your solu-
tion file for evaluation. For this problem, you are only allowed to use combinations
of the following functions: INDEX, MATCH, and CONCATENATE. You will NOT
need to add columns to complete the task.
c. Compare and contrast the formulas you used to answer part a and part b. Research
both types of formulas and discuss which formulas are better for use in practice.*
6.6 Excel Project: Formatting Data Consistently
You are preparing to analyze data about the Washington, DC Public Schools from
January 2009 to February 2019. Go to the student download page at http://www.pearson
globaleditions.com and download the file labeled “P6-6WashingtonDCData.xlsx”. Notice
that the formatting of the data in the fields “Transaction_Date” and “Transaction_Amount”
has become corrupt and each entry can be formatted in many ways. Note, all of the data in
the “Transaction_Date” field contains a date (so any numbers are stored date values). All of
the data listed in the “Transaction_Amount” column refer to dollar amounts.
REQUIRED
For the two fields, format the data in each field to be consistent within each column.
Justify the choice you made for the data format you chose to use for each column.
Describe the process you went through to make sure you formatted all the data correctly.
The data in the “Transaction_Date” column only needs to be accurate at the day, month,
and year level (so do not worry about time information in the column if it is present).
The data in the “Transaction_Amount” column needs to be accurate at the penny level.*
REQUIRED
a. Determine whether all the data in the sample transactions matches records in the orig-
inal data. If there are errors, document any you find.
b. Once you are finished, estimate how many errors are likely for each field in the 2019
data.
6.8 For the S&S case discussed in this chapter, you receive the following output containing
basic descriptive statistics for some of the columns in the full dataset (the chapter ex-
ample problem contained a small excerpt of data; this problem uses more data). S&S has
a total of 60 products that customers purchase across 3 categories.
REQUIRED
List the concerns you have with the data and discuss what steps you would take for each
concern identified.
REQUIRED
a. Create a fake dataset that has at least 20 rows and 6 columns. The dataset can relate
to any setting in accounting or business. In the dataset, include at least one of each
of the following types of errors: cryptic data values, data contradiction errors, data
entry errors, data threshold violations, misfielded data values, and violated attribute
dependencies.
b. Prepare a memo to go with your dataset that describes the nature of the data. Then
define each error and describe the example error you included in your dataset.
6.10 Search the Internet for an example of a business project failure caused by a failure in
some part of the ETL process. Based on the example you find, discuss the principles
from this chapter that the company failed to follow. In addition, discuss the reasons why
you think the company failed to follow these principles.
CHAPTER 6 TRANSFORMING DATA 209
The following case was developed by the Ernst & Young Academic Resource Center, which is
sponsored by the Ernst & Young Foundation.
For this case, you received a data file, Analytics_ important accounting-relevant datasets on these other
mindset_case_studies_ETL_Case4.xlsx (go to the tabs: BusinessUnits, ChartOfAccounts, PreparerInfo,
student download page at http://www.pearsonglobaledi- and Source. The following is a select list of data fields
tions.com to download this file). It includes 789 lines of from this file noting the field name and field description
journal entries for 11 days from a hotel and conference tabs on which the data field is located.
center (on the tab labeled JELineItems) as well as other