Data quality issue
Incorrect rows
Summary rows
Extra rows
Missing Column Names
Fix rows and columns Inconsistent column names
Unnecessary columns
Columns containing Multiple data values
No Unique Identifier
Misaligned columns
Disguised Missing values
Missing Values Significant number of Missing values in a row/column
Partial missing values
Non-standard units
Values with varying Scales
Standardise Numbers
Over-precision
Remove outliers
Extra characters
Different cases of same words
Standardise Text
Non-standard formats
Encoding Issues
Incorrect data types
Correct values not in list
Fix Invalid Values
Wrong structure
Correct values beyond range
Validate internal rules
Duplicate data
Filter Data
Extra/Unnecessary rows
Filter Data
Columns not relevant to analysis
Dispersed data
Examples
Header rows, footer rows
Total, subtotal rows
Column numbers, indicators, blank rows
Column names as blanks, NA, XX etc.
X1, X2,C4 which give no information about the column
Unidentified columns, irrelevant columns, blank columns
E.g. address columns containing city, state, country
E.g. Multiple cities with same name in a column
Shifted columns
blank strings, "NA", "XX", "999" etc
Missing time zone, century etc
Convert lbs to kgs, miles/hr to km/hr
A column containing marks in subjects, with some subject
marks out of 50 and others out of 100
4.5312341 kgs, 9.323252 meters
Abnormally High and Low values
Common prefix/suffix, leading/trailing/multiple spaces
Uppercase, lowercase, Title Case, Sentence case, etc
23/10/16 to 2016/10/20
“Modi, Narendra" to “Narendra Modi"
CP1252 instead of UTF-8
Number stored as a string: "12,300"
Date stored
String storedasasaastring:
number:"2013-Aug"
PIN Code "110001" stored as
110001
Non-existent country, PIN code
Phone number with over 10 digits
Temperature less than -273° C (0° K)
Gross sales > Net sales
Date of delivery > Date of ordering
If Title is "Mr" then Gender is "M"
Identical rows, rows where some columns are identical
Rows that are not required in the analysis. E.g if
observations before or after a particular date only are
required for analysis, other rows become unnecessary
Columns that are not needed for analysis e.g. Personal
Detail columns such as Address, phone column in a
dataset for
Parts of data required for analysis stored in different files
or part of different datasets
How to resolve
Delete
Delete
Delete
Add the column names
Add column names that give some information
about the data
Delete
Split columns into components
Combine columns to create unique identifiers
e.g. combine City with the State
Align these columns
Set values as missing values
Delete rows, columns
Fill the missing values with the correct value
Standardise the observations so all of them
have the same consistent units
Make the scale common. E.g. a percentage scale
Standardise precision for better presentation of
data. 4.5312341 kgs couldbe presented as 4.53
kgs
Correct if by mistake else Remove
Remove the extra characters
Standadise the case/bring to a common case
Correct the format/Standardise format for
better readability in R
Encode unicode properly
Convert to Correct data type
Delete the invalid values, treat as Missing
Deduplicate Data/ Remove duplicated data
Filter rows to keep only the relevant data.
Filter columns-Pick columns relevant to analysis
Bring the data together, Group by required keys,
aggregate the rest