Find a small, messy dataset. Data collection methods.
6. Task
Identify missing, duplicate, and inconsistent 1. Recap of Previous Lecture Different data sources (public datasets, APIs).
data.
Importance of data quality.
Definition: Adjusting values to fit within a
certain range (usually 0 to 1) to prevent some Definition: Data preprocessing is the process
variables from dominating the analysis. of transforming raw data into a format that is
clean and ready for analysis.
Normalization:
Example: If you have student heights in
centimeters and exam scores out of 100, Raw data is often incomplete, inconsistent, or
Why Is It Important?
normalization makes sure both variables are contains errors.
5. Data Transformation
on the same scale. 2. What Is Data Preprocessing? Properly preprocessed data ensures that the
Example: Turning "Male" and "Female" in the Definition: Converting text data into
Encoding Categorical Data:
Lecture 4: Data analysis will be accurate and meaningful.
gender column into 0 and 1. numerical format.
Preprocessing And Example: Imagine a survey where some
students didn’t answer all the questions or
Show a small dataset with missing, duplicate, Cleaning entered incorrect information. Without
cleaning that data, your analysis would give
and inconsistent data, and demonstrate how Example Task: wrong results.
to clean it.
Definition: When some values are missing
Removing rows/columns: If too much data is from a dataset.
missing, it’s often best to remove those rows Missing Data:
or columns. Example: A student forgot to fill in their age
Handling Missing Data:
in a survey.
Example: Remove survey responses where
more than half of the answers are missing. Definition: When the same record appears
multiple times in a dataset.
Example: If a student's age is missing, you Imputation: Filling in missing values using a
3. Common Data Issues Duplicate Data:
might fill it with the average age of the class. method like the mean, median, or mode. 4. Basic Data Cleaning Techniques Example: The same student appears twice in
the dataset due to a typo.
Simple but crucial step to ensure no
duplicate records are present. Definition: When data is entered in different
formats.
Fixing Inconsistencies: Removing Duplicates: Inconsistent Data:
`Example: The gender column has "Male"
Standardizing data formats, such as making and "M" as different entries for the same
sure all "M" entries in the gender column are value.
changed to "Male."