0% found this document useful (0 votes)
12 views6 pages

Tushar Case Study

This case study focuses on data cleaning techniques applied to an e-commerce customer analytics dataset to address issues such as duplicates, missing values, inconsistencies, and outliers. Key steps include removing duplicate transactions, handling missing data by dropping critical rows, standardizing inconsistent entries, filtering out outliers, and converting data types for better analysis. The cleaned dataset allows for improved analytics and insights into customer behavior and transaction trends.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views6 pages

Tushar Case Study

This case study focuses on data cleaning techniques applied to an e-commerce customer analytics dataset to address issues such as duplicates, missing values, inconsistencies, and outliers. Key steps include removing duplicate transactions, handling missing data by dropping critical rows, standardizing inconsistent entries, filtering out outliers, and converting data types for better analysis. The cleaned dataset allows for improved analytics and insights into customer behavior and transaction trends.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 6

Case Study

Student Name: Tushar Thakur UID: 24MCI10214


Branch: MCA(AIML) Section/Group:24MAM 4(A)
Semester: 1st Subject Code: 24CAT -613
Subject Name: Data Mining

Topic:- Data Cleaning in E-Commerce Customer Analytics.

To tackle the problem of data inconsistencies, duplicates, and errors in e-commerce customer analytics, we'll go
through the process of cleaning a sample dataset. We'll use a variety of data cleaning techniques to ensure the data
is reliable for analytics. Let's start by outlining the steps we'll follow:

1. Load the Dataset: Import the dataset for cleaning.


2. Inspect the Data: Understand the structure and identify potential issues.
3. Handle Missing Values: Deal with any missing data points.
4. Remove Duplicates: Eliminate duplicate records.
5. Correct Inconsistencies: Standardize inconsistent data entries.
6. Data Type Conversion: Ensure data types are correct for each column.
7. Data Validation: Check for any anomalies or outliers.

Step 1: Relevant Dataset:-For this case study, we will assume a dataset related to customer transactions from an e-
commerce website. The dataset contains the following fields:

 Transaction ID: Unique identifier for each transaction.

 Customer ID: Unique identifier for each customer.

 Product ID: Unique identifier for products purchased.

 Purchase Amount: Amount spent on a particular product during a transaction.

 Purchase Date: Date and time when the transaction occurred.

 Payment Method: Method of payment used by the customer (e.g., Credit Card, PayPal, etc.)

 Customer Feedback: Rating or review provided by the customer after a transaction.

 Email: Customer’s email address.

 Location: Customer’s shipping location.


Here is an example of what the raw data might look like:

Step 2. Novel Dataset:-

The dataset is relevant and typical for e-commerce customer analytics. However, we assume a "novel" aspect of the
dataset here: "multiple customer entries with duplicate transactions and incomplete data points." Our focus will be
on removing duplicates, filling missing values, handling outliers, and correcting errors in the dataset.

Step 3. Data Cleaning Techniques:-

Here are the data cleaning techniques applied to this dataset, along with explanations:

a. Handling Duplicates:

Problem: Duplicate records can lead to inflated transaction counts, incorrect revenue calculations, and misrepresentation
of customer behavior.

Action: We identified that the customer john.doe@email.com has made multiple transactions with the same product and
purchase amount at the same time (Transactions T001 and T005). These can be duplicates.

 Solution: Remove duplicate transactions where all fields match, except for Transaction ID.

b. Handling Missing Data:


Problem: Missing values in critical fields like Customer Feedback, Email, and Location reduce the quality of analysis,
especially if these fields are essential for customer segmentation or feedback analysis.

Action:

 We noticed that Customer Feedback, Email, and Location are missing for Transaction T003.

 We can either fill in these missing values using imputation techniques (mean, mode, or a custom approach) or
drop the rows if the missing data is non-critical.

Solution: In this case, we'll drop rows with missing Customer ID, Email, or Location, as these are essential for
customer-level analysis.

c. Fixing Inconsistent Data:

Problem: Inconsistent data entry in fields like Payment Method can cause issues in analysis.

Action:

 Inconsistent naming in the Payment Method field (e.g., "Credit card" vs "Credit Card").

Solution: Standardize the Payment Method values by converting them to lowercase or title case for consistency.

d. Handling Outliers:

Problem: Outliers in Purchase Amount could represent either fraudulent transactions or incorrect data entry.

Action:

 Outliers need to be identified and handled (either by removing or correcting them).

 For simplicity, let’s assume any transaction greater than $1000 is an outlier for this dataset.

Solution: We will filter out transactions where the Purchase Amount exceeds $1000.

e. Converting Data Types:

Problem: In some cases, fields like Purchase Date may be stored as strings instead of date-time objects, which makes
analysis difficult.

Action:

 Convert the Purchase Date column to datetime objects for better time-series analysis.

Solution: Convert the Purchase Date column to a datetime type.


Step 4. Relevant Graphs and Tables Depicting Clean Data:-

After cleaning the data using the techniques outlined above, we can visualize the cleaned dataset and analyze customer
behavior.

a.Transaction Frequency by Payment Method (Bar Graph):

This will help us understand which payment methods are most commonly used.

b. Purchase Amount Distribution (Histogram):

This will show the distribution of purchase amounts and help identify trends or anomalies in spending behavior.

c. Customer Feedback Ratings (Bar Graph):

This will give insights into how customers rate their experience after purchasing products.
d. Cleaned Dataset Table:

Here is the cleaned version of the dataset after applying the data cleaning techniques:

Conclusion:

The process of data cleaning for the e-commerce customer dataset involved several key steps aimed at improving data
quality and reliability for better analytics. Here's a summary of the steps taken and their impact:

1. Handling Duplicates:

o We removed duplicate transactions where all key fields (such as Customer ID, Product ID, and Purchase
Amount) were the same, except for the unique Transaction ID. This helped eliminate redundant entries
and ensured the accuracy of transaction counts and revenue.

2. Handling Missing Data:

o Rows with missing critical fields like Email and Location were dropped to ensure that only complete
customer information was retained. Missing Customer Feedback values were imputed with the mean
value to avoid bias in the analysis and preserve customer sentiment data.

3. Fixing Inconsistent Data:

o We standardized the Payment Method field (e.g., changing "credit card" to "Credit Card") to ensure
uniformity and avoid discrepancies during analysis. This allows for more consistent grouping and analysis
by payment method.
4. Handling Outliers:

o Outliers in the Purchase Amount (i.e., transactions above $1000) were removed, assuming that such
transactions could be errors or outliers that could distort the data analysis.

5. Converting Data Types:

o The Purchase Date was converted to a proper datetime format to enable accurate time-based analysis, such
as understanding purchase trends over time and performing time-series analysis.

You might also like