Data Cleaning Checklist: Checklist Examples in Action Potential Solutions Data Constraints Problems
Data Cleaning Checklist: Checklist Examples in Action Potential Solutions Data Constraints Problems
Data cleaning takes up 80% of the data science workflow. Use this
                             checklist to identify and resolve any quality issues with your data
 Uniqueness Constraints                     Example A duplicate row where the name and                                        Keep only one of the exact
                                                    phone_number columns are identical, but not                               duplicate rows
Ensuring that there are no exact or                 the height_cm column
                                                                                                                              Merge rows that have non-exact
almost exact duplicates within your                                                                                           duplicate rows
rows.                                        name                     height_cm                 phone_number
                                             Carl Rosseel             177                       (555) 200-5598
 Length violation for text data             Example A phone_number column that is 9 characters                                Drop rows that are affected by
                                                    instead of 14                                                             length violation
Ensuring that text columns that follow                                                                                        Set affected observations to
a specific standard have the same            name                     height_cm                 phone_number                  missing
string length
                                             Carl Rosseel             177                       (555) 200-5598
 Text data inconsistent                     Example A phone_number number column that                                         Standardize formatting for affected
                                                    contains different phone number formats                                   observations
 formatting
                                                                                                                              Drop rows that are affected by the
Ensuring that text columns that follow       name                      height_cm                phone_number                  inconsistency
a specific standard have the same            Carl Rosseel              177                      (555) 200-5598
string formatting
                                             Carl Rosseel              178                      (555) 200-5598
 Crossfield validation for                   date                economy              first class             total           Dropping rows where sanity checks
                                                                                                                              fail
 numeric columns                             05-18-2022          250                  50                      300
                                                                                                                              Apply rules from domain knowledge
Crossfield validation is when we use         05-19-2022          200                  50                      200             based on knowing the data
multiple fields in a dataset to ensure
the validity of another. For example,
ensuring that part to whole columns
add to a relevant total (Flight
bookings per class add up to the total
recorded bookings)
 Crossfield validation for date               name                    date of birth                   age                     Dropping rows where sanity checks
                                                                                                                              fail
 columns                                      John Doe                02-07-1994                      27
                                                                                                                              Apply rules from domain knowledge
Ensuring that date and temporal               Jane Doe                10-12-2000                      34                      based on knowing the data
columns pass sanity checks (for
example, ensuring that webinar
registration dates always precede
webinar attendance dates)