0% found this document useful (0 votes)
19 views5 pages

Data Cleaning Using R

Uploaded by

Tina Parker
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views5 pages

Data Cleaning Using R

Uploaded by

Tina Parker
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 5

Data cleaning is an essential step in the data analysis process to ensure that the dataset is accurate,

consistent, and ready for analysis. In R, there are several techniques and functions to clean the data.
Below are some commonly used data cleaning techniques in R:

1. Handling Missing Data

 Identifying Missing Data: You can use is.na() to detect missing values.

 is.na(data)

 Removing Missing Data: Use na.omit() or complete.cases() to remove rows with missing values.

 cleaned_data <- na.omit(data)

Or:

cleaned_data <- data[complete.cases(data), ]

 Imputing Missing Data: Replace missing values with the mean, median, or other statistical
imputation methods.

 data$column[is.na(data$column)] <- mean(data$column, na.rm = TRUE)

2. Handling Duplicates

 Identifying Duplicates: Use the duplicated() function to find duplicate rows.

 duplicated_rows <- duplicated(data)

 Removing Duplicates: To remove duplicate rows, use unique() or distinct() from the dplyr
package.

 data_unique <- unique(data)

Or using dplyr:

library(dplyr)

data_unique <- distinct(data)

3. Outlier Detection and Removal

 Visualizing Outliers: You can visualize the distribution using boxplots to detect outliers.

 boxplot(data$column)

 Removing Outliers: A common approach is to remove values outside of 1.5 times the
interquartile range (IQR).

 Q1 <- quantile(data$column, 0.25)

 Q3 <- quantile(data$column, 0.75)

 IQR <- Q3 - Q1
 data_clean <- data[data$column > (Q1 - 1.5 * IQR) & data$column < (Q3 + 1.5 * IQR), ]

4. Data Type Conversion

 Converting Data Types: You may need to convert columns to appropriate data types such as
factors, integers, or dates.

 data$column <- as.numeric(data$column)

 data$column <- as.factor(data$column)

 data$column <- as.Date(data$column, format = "%Y-%m-%d")

5. Handling Categorical Data

 Renaming Factor Levels: Use levels() or forcats::fct_recode() to rename factor levels.

 data$column <- factor(data$column, levels = c("old_level1", "old_level2"), labels =


c("new_level1", "new_level2"))

 Recode Factors Using forcats: You can also use fct_recode for recoding factor levels.

 library(forcats)

 data$column <- fct_recode(data$column, "New Level" = "Old Level")

6. String Cleaning

 Removing Whitespaces: You can use trimws() to remove leading and trailing whitespaces.

 data$column <- trimws(data$column)

 Converting to Lowercase or Uppercase: Convert text data to lowercase or uppercase using


tolower() or toupper().

 data$column <- tolower(data$column)

 Removing Special Characters: Use gsub() to remove or replace special characters.

 data$column <- gsub("[^[:alnum:][:space:]]", "", data$column)

7. Feature Engineering and Transformation

 Create New Variables: You can create new variables based on existing ones.

 data$new_column <- data$column1 + data$column2

 Log Transformation: Log transformations are useful for skewed data.

 data$log_column <- log(data$column + 1)

 Binning or Categorizing Continuous Variables: Use cut() to categorize continuous variables into
bins.

 data$category <- cut(data$column, breaks = 4, labels = c("Low", "Medium", "High", "Very High"))
8. Standardizing/Scaling Data

 Scaling Data (Normalization or Standardization): You can standardize your data (e.g., scale
between 0 and 1, or standardize to have zero mean and unit variance).

 data$scaled_column <- scale(data$column)

9. Handling Date and Time Data

 Converting to Date Type: Use as.Date() for date conversion.

 data$Date <- as.Date(data$Date, format="%Y-%m-%d")

 Extracting Date Components: Extract year, month, or day from a date.

 data$Year <- format(data$Date, "%Y")

 data$Month <- format(data$Date, "%m")

10. Data Transformation

 Pivoting Data: Reshape data using tidyr's pivot_longer() or pivot_wider() functions.

 library(tidyr)

 data_long <- pivot_longer(data, cols = c("column1", "column2"), names_to = "variable",


values_to = "value")

 data_wide <- pivot_wider(data, names_from = "variable", values_from = "value")

 Merging Data: Use merge() or dplyr's left_join(), right_join(), etc., to merge datasets.

 library(dplyr)

 merged_data <- left_join(data1, data2, by = "common_column")

11. Dealing with Factors and Levels

 Reordering Factor Levels: You can reorder factor levels with factor() and relevel().

 data$column <- factor(data$column, levels = c("level1", "level2", "level3"))

 data$column <- relevel(data$column, ref = "level2")

Commonly Used Packages for Data Cleaning in R:

 dplyr: For data manipulation (filter, select, mutate, arrange, etc.).

 tidyr: For reshaping and tidying data (pivot, gather, spread, etc.).

 stringr: For string manipulation functions (regex, trimming, etc.).

 lubridate: For handling date and time data.


 forcats: For working with factors.

 data.table: For high-performance data manipulation.

Example of a Complete Data Cleaning Workflow:

# Load necessary libraries

library(dplyr)

library(tidyr)

# Step 1: Remove duplicates

data <- distinct(data)

# Step 2: Handle missing values (impute with mean for numeric columns)

data$numeric_column[is.na(data$numeric_column)] <- mean(data$numeric_column, na.rm = TRUE)

# Step 3: Remove rows with missing data in any column

data <- data[complete.cases(data), ]

# Step 4: Handle outliers (using IQR method)

Q1 <- quantile(data$numeric_column, 0.25)

Q3 <- quantile(data$numeric_column, 0.75)

IQR <- Q3 - Q1

data <- data[data$numeric_column > (Q1 - 1.5 * IQR) & data$numeric_column < (Q3 + 1.5 * IQR), ]

# Step 5: Convert a factor to a character

data$factor_column <- as.character(data$factor_column)

# Step 6: Normalize data (scale to 0-1)

data$scaled_column <- scale(data$numeric_column)

# Step 7: Split date into year and month


data$Year <- format(data$Date, "%Y")

data$Month <- format(data$Date, "%m")

# Step 8: Rename factor levels

data$factor_column <- factor(data$factor_column, levels = c("old_level1", "old_level2"), labels =


c("new_level1", "new_level2"))

This workflow is just a sample and can be customized based on the specific needs of your dataset.

You might also like