0% found this document useful (0 votes)
8 views4 pages

Data Quality Issues & Solutions

Uploaded by

diop samba
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views4 pages

Data Quality Issues & Solutions

Uploaded by

diop samba
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Data quality issue Examples

Incorrect rows Header rows, footer rows


Summary rows Total, subtotal rows
Extra rows Column numbers, indicators, blank rows
Missing Column Names Column names as blanks, NA, XX etc.

Fix rows and columns Inconsistent column names X1, X2,C4 which give no information about the column
Unnecessary columns Unidentified columns, irrelevant columns, blank columns
Columns containing Multiple data values E.g. address columns containing city, state, country

No Unique Identifier E.g. Multiple cities with same name in a column


Misaligned columns Shifted columns

Disguised Missing values blank strings, "NA", "XX", "999" etc


Missing Values Significant number of Missing values in a row/column
Partial missing values Missing time zone, century etc

Non-standard units Convert lbs to kgs, miles/hr to km/hr


A column containing marks in subjects, with some subject
Values with varying Scales marks out of 50 and others out of 100
Standardise Numbers
Over-precision
4.5312341 kgs, 9.323252 meters
Remove outliers Abnormally High and Low values
Extra characters Common prefix/suffix, leading/trailing/multiple spaces
Different cases of same words Uppercase, lowercase, Title Case, Sentence case, etc
Standardise Text 23/10/16 to 2016/10/20
Non-standard formats “Modi, Narendra" to “Narendra Modi"
How to resolve
Delete
Delete
Delete
Add the column names
Add column names that give some information
about the data
Delete
Split columns into components
Combine columns to create unique identifiers
e.g. combine City with the State
Align these columns

Set values as missing values


Delete rows, columns
Fill the missing values with the correct value

Standardise the observations so all of them have


the same consistent units

Make the scale common. E.g. a percentage scale


Standardise precision for better presentation of
data. 4.5312341 kgs couldbe presented as 4.53
kgs
Correct if by mistake else Remove
Remove the extra characters
Standadise the case/bring to a common case
Correct the format/Standardise format for
better readability in R
Encoding Issues CP1252 instead of UTF-8
Number stored as a string: "12,300"
Incorrect data types Date stored as a string: "2013-Aug"
String stored as a number: PIN Code "110001" stored as
Correct values not in list Non-existent
110001 country, PIN code
Fix Invalid Values
Wrong structure Phone number with over 10 digits
Correct values beyond range Temperature less than -273° C (0° K)
Gross sales > Net sales
Validate internal rules Date of delivery > Date of ordering
If Title is "Mr" then Gender is "M"

Duplicate data Identical rows, rows where some columns are identical
Rows that are not required in the analysis. E.g if
observations before or after a particular date only are
Extra/Unnecessary rows required for analysis, other rows become unnecessary
Filter Data
Columns that are not needed for analysis e.g. Personal
Columns not relevant to analysis Detail columns such as Address, phone column in a
Parts of for
dataset data required for analysis stored in different files
Dispersed data or part of different datasets
Encode unicode properly

Convert to Correct data type

Delete the invalid values, treat as Missing

Deduplicate Data/ Remove duplicated data

Filter rows to keep only the relevant data.

Filter columns-Pick columns relevant to analysis


Bring the data together, Group by required keys,
aggregate the rest

You might also like