Data Cleaning

Uploaded by

ishita.sengupta.06

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

293 views8 pages

Data Cleaning

Uploaded by

ishita.sengupta.06

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Data Cleaning in Data Mining

Data cleaning is a crucial process in Data Mining. It carries an important part in the building of a
model. Data Cleaning can be regarded as the process needed, but everyone often neglects it. Data
quality is the main issue in quality information management. Data quality problems occur anywhere
in information systems. These problems are solved by data cleaning.
Data cleaning is fixing or removing incorrect, corrupted, incorrectly formatted, duplicate, or
incomplete data within a dataset. If data is incorrect, outcomes and algorithms are unreliable, even
though they may look correct. When combining multiple data sources, there are many opportunities
for data to be duplicated or mislabeled.
Generally, data cleaning reduces errors and improves data quality. Correcting errors in data and
eliminating bad records can be a time-consuming and tedious process, but it cannot be ignored.
Data mining is a key technique for data cleaning. Data mining is a technique for discovering
interesting information in data. Data quality mining is a recent approach applying data mining
techniques to identify and recover data quality problems in large databases. Data mining
automatically extracts hidden and intrinsic information from the collections of data. Data mining
has various techniques that are suitable for data cleaning.
Understanding and correcting the quality of your data is imperative in getting to an accurate final
analysis. The data needs to be prepared to discover crucial patterns. Data mining is considered
exploratory. Data cleaning in data mining allows the user to discover inaccurate or incomplete data
before the business analysis and insights.

In most cases, data cleaning in data mining can be a laborious process and typically requires IT
resources to help in the initial step of evaluating your data because data cleaning before data mining
is so time-consuming. But without proper data quality, your final analysis will suffer inaccuracy, or
you could potentially arrive at the wrong conclusion.

Steps of Data Cleaning

While the techniques used for data cleaning may vary according to the types of data your company
stores, you can follow these basic steps to cleaning your data, such as:
1. Remove duplicate or irrelevant observations
Remove unwanted observations from your dataset, including duplicate observations or irrelevant
observations. Duplicate observations will happen most often during data collection. When you
combine data sets from multiple places, scrape data, or receive data from clients or multiple
departments, there are opportunities to create duplicate data. De-duplication is one of the largest
areas to be considered in this process. Irrelevant observations are when you notice observations that
do not fit into the specific problem you are trying to analyze.
For example, if you want to analyze data regarding millennial customers, but your dataset includes
older generations, you might remove those irrelevant observations. This can make analysis more
efficient, minimize distraction from your primary target, and create a more manageable and
performable dataset.
2. Fix structural errors
Structural errors are when you measure or transfer data and notice strange naming conventions,
typos, or incorrect capitalization. These inconsistencies can cause mislabeled categories or classes.
For example, you may find "N/A" and "Not Applicable" in any sheet, but they should be analyzed
in the same category.
3. Filter unwanted outliers

Often, there will be one-off observations where, at a glance, they do not appear to fit within the data
you are analyzing. If you have a legitimate reason to remove an outlier, like improper data entry,
doing so will help the performance of the data you are working with.
However, sometimes, the appearance of an outlier will prove a theory you are working on. And just
because an outlier exists doesn't mean it is incorrect. This step is needed to determine the validity of
that number. If an outlier proves to be irrelevant for analysis or is a mistake, consider removing it.
4. Handle missing data
You can't ignore missing data because many algorithms will not accept missing values. There are a
couple of ways to deal with missing data. Neither is optimal, but both can be considered, such as:
• You can drop observations with missing values, but this will drop or lose information, so be
careful before removing it.
• You can input missing values based on other observations; again, there is an opportunity to
lose the integrity of the data because you may be operating from assumptions and not actual
observations.
• You might alter how the data is used to navigate null values effectively.
5. Validate and QA
At the end of the data cleaning process, you should be able to answer these questions as a part of
basic validation, such as:
• Does the data make sense?
• Does the data follow the appropriate rules for its field?
• Does it prove or disprove your working theory or bring any insight to light?
• Can you find trends in the data to help you for your next theory?
• If not, is that because of a data quality issue?
Because of incorrect or noisy data, false conclusions can inform poor business strategy and
decision-making. False conclusions can lead to an embarrassing moment in a reporting meeting
when you realize your data doesn't stand up to study. Before you get there, it is important to create a
culture of quality data in your organization. To do this, you should document the tools you might
use to create this strategy.
Methods of Data Cleaning
There are many data cleaning methods through which the data should be run. The methods are
described below

1. Ignore the tuples: This method is not very feasible, as it only comes to use when the tuple
has several attributes is has missing values.
2. Fill the missing value: This approach is also not very effective or feasible. Moreover, it can
be a time-consuming method. In the approach, one has to fill in the missing value. This is
usually done manually, but it can also be done by attribute mean or using the most probable
value.
3. Binning method: This approach is very simple to understand. The smoothing of sorted data
is done using the values around it. The data is then divided into several segments of equal
size. After that, the different methods are executed to complete the task.
4. Regression: The data is made smooth with the help of using the regression function. The
regression can be linear or multiple. Linear regression has only one independent variable,
and multiple regressions have more than one independent variable.
5. Clustering: This method mainly operates on the group. Clustering groups the data in a
cluster. Then, the outliers are detected with the help of clustering. Next, the similar values
are then arranged into a "group" or a "cluster".

Process of Data Cleaning

The following steps show the process of data cleaning in data mining.
1. Monitoring the errors: Keep a note of suitability where the most mistakes arise. It will
make it easier to determine and stabilize false or corrupt information. Information is
especially necessary while integrating another possible alternative with established
management software.
2. Standardize the mining process: Standardize the point of insertion to assist and reduce the
chances of duplicity.
3. Validate data accuracy: Analyze and invest in data tools to clean the record in real-time.
Tools used Artificial Intelligence to better examine for correctness.
4. Scrub for duplicate data: Determine duplicates to save time when analyzing data.
Frequently attempted the same data can be avoided by analyzing and investing in separate
data erasing tools that can analyze rough data in quantity and automate the operation.
5. Research on data: Before this activity, our data must be standardized, validated, and
scrubbed for duplicates. There are many third-party sources, and these Approved &
authorized parties sources can capture information directly from our databases. They help us
to clean and compile the data to ensure completeness, accuracy, and reliability for business
decision-making.
6. Communicate with the team: Keeping the group in the loop will assist in developing and
strengthening the client and sending more targeted data to prospective customers.

Usage of Data Cleaning in Data Mining

Here are the following usages of data cleaning in data mining, such as:

• Data Integration: Since it is difficult to ensure quality in low-quality data, data integration
has an important role in solving this problem. Data Integration is the process of combining
data from different data sets into a single one. This process uses data cleansing tools to
ensure that the embedded data set is standardized and formatted before moving to the final
destination.
• Data Migration: Data migration is the process of moving one file from one system to
another, one format to another, or one application to another. While the data is on the move,
it is important to maintain its quality, security, and consistency, to ensure that the resultant
data has the correct format and structure without any delicacies at the destination.
• Data Transformation: Before the data is uploaded to a destination, it needs to be
transformed. This is only possible through data cleaning, which considers the system criteria
of formatting, structuring, etc. Data transformation processes usually include using rules and
filters before further analysis. Data transformation is an integral part of most data integration
and data management processes. Data cleansing tools help to clean the data using the built-
in transformations of the systems.
• Data Debugging in ETL Processes: Data cleansing is crucial to preparing data during
extract, transform, and load (ETL) for reporting and analysis. Data cleansing ensures that
only high-quality data is used for decision-making and analysis.
For example, a retail company receives data from various sources, such as CRM or ERP systems,
containing misinformation or duplicate data. A good data debugging tool would detect
inconsistencies in the data and rectify them. The purged data will be converted to a standard format
and uploaded to a target database.

Characteristics of Data Cleaning

Data cleaning is mandatory to guarantee the business data's accuracy, integrity, and security. Based
on the qualities or characteristics of data, these may vary in quality. Here are the main points of data
cleaning in data mining:
• Accuracy: All the data that make up a database within the business must be highly accurate.
One way to corroborate their accuracy is by comparing them with different sources. If the
source is not found or has errors, the stored information will have the same problems.
• Coherence: The data must be consistent with each other, so you can be sure that the
information of an individual or body is the same in different forms of storage used.
• Validity: The stored data must have certain regulations or established restrictions. Likewise,
the information has to be verified to corroborate its authenticity.
• Uniformity: The data that make up a database must have the same units or values. It is an
essential aspect when carrying out the Data Cleansing process since it does not increase the
complexity of the procedure.
• Data Verification: The process must be verified at all times, both the appropriateness and
the effectiveness of the procedure. Said verification is carried out through various insistence
of the study, design, and validation stages. The drawbacks are often evident after the data is
applied in a certain amount of changes.
• Clean Data Backflow: After eliminating quality problems, the already clean data must be
replaced by those not located in the original source, so that legacy applications obtain the
benefits of these, obviating the need for applications of actions of data cleaning afterward.

Tools for Data Cleaning in Data Mining

Data Cleansing Tools can be very helpful if you are not confident of cleaning the data yourself or
have no time to clean up all your data sets. You might need to invest in those tools, but it is worth
the expenditure. There are many data cleaning tools in the market. Here are some top-ranked data
cleaning tools, such as:
1. OpenRefine
2. Trifacta Wrangler
3. Drake
4. Data Ladder
5. Data Cleaner
6. Cloudingo
7. Reifier
8. IBM Infosphere Quality Stage
9. TIBCO Clarity
10.Winpure

Types of data cleaning

There are various types of data cleaning which are as follows −
• Missing Values − Missing values are filled with appropriate values. There are the following
approaches to fill the values.
• The tuple is ignored when it includes several attributes with missing values.
• The values are filled manually for the missing value.
• The same global constant can fill the values.
• The attribute mean can fill the missing values.
• The most probable value can fill the missing values.
• Noisy data − Noise is a random error or variance in a measured variable. There are the
following smoothing methods to handle noise which are as follows −
• Binning − These methods smooth out a arrange data value by consulting its
“neighborhood,” especially, the values around the noisy information. The arranged
values are distributed into multiple buckets or bins. Because binning methods consult
the neighborhood of values, they implement local smoothing.
• Regression − Data can be smoothed by fitting the information to a function,
including with regression. Linear regression contains finding the “best” line to fit two
attributes (or variables) so that one attribute can be used to forecast the other.
Multiple linear regression is a development of linear regression, where more than
two attributes are contained and the data are fit to a multidimensional area.
• Clustering − Clustering supports in identifying the outliers. The same values are
organized into clusters and those values which fall outside the cluster are known as
outliers.
• Combined computer and human inspection − The outliers can also be recognized
with the support of computer and human inspection. The outliers pattern can be
descriptive or garbage. Patterns having astonishment value can be output to a list.
• Inconsistence data − The inconsistency can be recorded in various transactions, during data
entry, or arising from integrating information from multiple databases. Some redundancies
can be recognized by correlation analysis. Accurate and proper integration of the data from
various sources can decrease and avoid redundancy.

Benefits of Data Cleaning

Having clean data will ultimately increase overall productivity and allow for the highest quality
information in your decision-making. Here are some major benefits of data cleaning in data mining,
such as:
• Removal of errors when multiple sources of data are at play.
• Fewer errors make for happier clients and less-frustrated employees.
• Ability to map the different functions and what your data is intended to do.
• Monitoring errors and better reporting to see where errors are coming from, making it easier
to fix incorrect or corrupt data for future applications.
• Using tools for data cleaning will make for more efficient business practices and quicker
decision-making.

AIML Feb, March Scheme 2023
No ratings yet
AIML Feb, March Scheme 2023
25 pages
Unit 3
100% (1)
Unit 3
22 pages
Unit 5 Database Security 2
No ratings yet
Unit 5 Database Security 2
18 pages
Summer Internship Project Report
No ratings yet
Summer Internship Project Report
71 pages
Data Analytics Using R
No ratings yet
Data Analytics Using R
37 pages
Data Mining and Business Intelligence
No ratings yet
Data Mining and Business Intelligence
42 pages
20cs51i Makeup Exam September 2023 QP - Deemech
No ratings yet
20cs51i Makeup Exam September 2023 QP - Deemech
2 pages
Unit 1 Introduction To Datascience
No ratings yet
Unit 1 Introduction To Datascience
14 pages
Big Data Unit 2
No ratings yet
Big Data Unit 2
19 pages
BDA Notes-1
No ratings yet
BDA Notes-1
39 pages
Data Preprocessing for Tech Students
No ratings yet
Data Preprocessing for Tech Students
59 pages
Gujarat Technological University
No ratings yet
Gujarat Technological University
4 pages
ds4015 Big Data Analytics Vignesh K Notes
No ratings yet
ds4015 Big Data Analytics Vignesh K Notes
146 pages
PG - M.sc. - Computer Science - 34141 Data Mining and Ware Housing
No ratings yet
PG - M.sc. - Computer Science - 34141 Data Mining and Ware Housing
192 pages
Relational Set Operators Lecture 5
No ratings yet
Relational Set Operators Lecture 5
25 pages
Lab 1: Preprocessing Using Python
No ratings yet
Lab 1: Preprocessing Using Python
5 pages
De&v Lab Manual
No ratings yet
De&v Lab Manual
91 pages
Data Mining: Association Rules Basics
No ratings yet
Data Mining: Association Rules Basics
31 pages
Introduction To Data Visualization
No ratings yet
Introduction To Data Visualization
37 pages
Data Wrangling With Python Lab Manual
No ratings yet
Data Wrangling With Python Lab Manual
29 pages
Partitioning Methods
100% (1)
Partitioning Methods
3 pages
Facets of Data
No ratings yet
Facets of Data
6 pages
Data Mining Interface, Security, Backup and Recovery, Tuning Data Warehouse, Testing Data Warehouse.
No ratings yet
Data Mining Interface, Security, Backup and Recovery, Tuning Data Warehouse, Testing Data Warehouse.
31 pages
Database and DBMS: A Comprehensive Guide
No ratings yet
Database and DBMS: A Comprehensive Guide
32 pages
Classification and Prediction
No ratings yet
Classification and Prediction
126 pages
Unit-2 Solution
No ratings yet
Unit-2 Solution
22 pages
Unit 2 DMW
No ratings yet
Unit 2 DMW
26 pages
Classification and Predication in Data Mining
No ratings yet
Classification and Predication in Data Mining
6 pages
Security, Backup, Recovery, Tuning, Testing of Data Mining and Warehousing
No ratings yet
Security, Backup, Recovery, Tuning, Testing of Data Mining and Warehousing
16 pages
BI UNIT-I Chp01 (Business Intelligence)
No ratings yet
BI UNIT-I Chp01 (Business Intelligence)
14 pages
ASSIGNMENT 1 Questions BI
No ratings yet
ASSIGNMENT 1 Questions BI
1 page
Unit V Big Data Analytics
No ratings yet
Unit V Big Data Analytics
47 pages
Ai and ML qp1 Solved
No ratings yet
Ai and ML qp1 Solved
20 pages
Unit 4 - Association Analysis
No ratings yet
Unit 4 - Association Analysis
12 pages
DWM Question Bank
No ratings yet
DWM Question Bank
3 pages
Module 3
No ratings yet
Module 3
43 pages
DWM Sem V Module 2 - Introduction To Data Mining, Data Exploration and Data Pre-Processing
No ratings yet
DWM Sem V Module 2 - Introduction To Data Mining, Data Exploration and Data Pre-Processing
55 pages
DWDM Unit - 1 MCQ: by Arpit Sharma 01629802018
No ratings yet
DWDM Unit - 1 MCQ: by Arpit Sharma 01629802018
27 pages
Data Mining: Concepts and Techniques: - Chapter 5
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 5
63 pages
Spam Mail Detection Using Machine Learning
No ratings yet
Spam Mail Detection Using Machine Learning
14 pages
MSC Datascience Unit1
No ratings yet
MSC Datascience Unit1
20 pages
KMBNIT03 - Unit 2
No ratings yet
KMBNIT03 - Unit 2
12 pages
Introduction To Power BI and Its Features
No ratings yet
Introduction To Power BI and Its Features
41 pages
Big Data Unit5
No ratings yet
Big Data Unit5
57 pages
Data Transformation in Data Mining
No ratings yet
Data Transformation in Data Mining
6 pages
OBJECT ORIENTED SYSTEM DESIGN Question Paper 21 22
No ratings yet
OBJECT ORIENTED SYSTEM DESIGN Question Paper 21 22
3 pages
Cluster Analysis and Applications
No ratings yet
Cluster Analysis and Applications
37 pages
Power Bi
No ratings yet
Power Bi
39 pages
Memory Based Reasoning - BIA
100% (1)
Memory Based Reasoning - BIA
19 pages
CS8091 - Big Data Analytics - Unit 1
No ratings yet
CS8091 - Big Data Analytics - Unit 1
28 pages
ccs346 Eda
No ratings yet
ccs346 Eda
2 pages
KDD vs Data Mining Explained
No ratings yet
KDD vs Data Mining Explained
2 pages
Business Analytics Local Author Book 1
No ratings yet
Business Analytics Local Author Book 1
233 pages
OOAD
No ratings yet
OOAD
2 pages
DWDM Unit 1
No ratings yet
DWDM Unit 1
103 pages
DSBDA Lab Manual
No ratings yet
DSBDA Lab Manual
155 pages
1.5 Triangular Factors and Row Exchanges
No ratings yet
1.5 Triangular Factors and Row Exchanges
29 pages
DBMS Unit4 Notes
No ratings yet
DBMS Unit4 Notes
14 pages
Unit 2
No ratings yet
Unit 2
16 pages
Foundation of DS
No ratings yet
Foundation of DS
21 pages
Apriori Algorithm & Outlier Analysis
No ratings yet
Apriori Algorithm & Outlier Analysis
89 pages
Data Partitioning & K-Means Guide
No ratings yet
Data Partitioning & K-Means Guide
8 pages
Regression
No ratings yet
Regression
4 pages
Hierarchical Clustering in Data Mining
No ratings yet
Hierarchical Clustering in Data Mining
4 pages
Syllabus OEC-CS801E
No ratings yet
Syllabus OEC-CS801E
3 pages
Goal Setting
No ratings yet
Goal Setting
15 pages
Firewall SH 16march2024 SRC
No ratings yet
Firewall SH 16march2024 SRC
7 pages
ERP, CRM, SRM: A Comprehensive Guide
No ratings yet
ERP, CRM, SRM: A Comprehensive Guide
20 pages
Four Cs ECommerce 28may2024 SRC
No ratings yet
Four Cs ECommerce 28may2024 SRC
6 pages
E-Commerce Business Models & EDI
No ratings yet
E-Commerce Business Models & EDI
8 pages
Sri Ramajayam Global Senior Secondary Cbse School: UNIT-3 (Chapter-11)
No ratings yet
Sri Ramajayam Global Senior Secondary Cbse School: UNIT-3 (Chapter-11)
2 pages
Independent Samples T Test 5
No ratings yet
Independent Samples T Test 5
3 pages
Reading SAP BEx Queries Via REST Service
No ratings yet
Reading SAP BEx Queries Via REST Service
9 pages
Exp 3 and 4
No ratings yet
Exp 3 and 4
8 pages
Standardisation and Rationalisation Group (SRG) For Materials Management
No ratings yet
Standardisation and Rationalisation Group (SRG) For Materials Management
12 pages
JDBC Onboarding
No ratings yet
JDBC Onboarding
6 pages
Facial Recognition Using Deep Learning
No ratings yet
Facial Recognition Using Deep Learning
6 pages
SAP BODS Transformations Full QA
No ratings yet
SAP BODS Transformations Full QA
5 pages
Part4 - Ch9 - Functional Dependencies and Normalization
No ratings yet
Part4 - Ch9 - Functional Dependencies and Normalization
26 pages
802 Information Technology SQP
No ratings yet
802 Information Technology SQP
7 pages
Iso 30042-2008
No ratings yet
Iso 30042-2008
98 pages
Zkqgfre5tl9l SQLtoMongoDBCheatSheet1
No ratings yet
Zkqgfre5tl9l SQLtoMongoDBCheatSheet1
10 pages
Digisaf 100 Upgrade Guide
No ratings yet
Digisaf 100 Upgrade Guide
3 pages
SQL Best Practices Guide
No ratings yet
SQL Best Practices Guide
7 pages
Data Layer Design: Architecting With Google Cloud Platform: Design and Process
No ratings yet
Data Layer Design: Architecting With Google Cloud Platform: Design and Process
47 pages
Oracle HRMS Tables
100% (2)
Oracle HRMS Tables
78 pages
Document Routing Whitepaper
No ratings yet
Document Routing Whitepaper
42 pages
6.DMBI Question Bank PDF
No ratings yet
6.DMBI Question Bank PDF
12 pages
MERN Stack E-commerce Guide
No ratings yet
MERN Stack E-commerce Guide
61 pages
ISM Project Report
No ratings yet
ISM Project Report
28 pages
Advanced Stored Procedures Guide
No ratings yet
Advanced Stored Procedures Guide
17 pages
Tcs Questions With Answers: Note To The Students
No ratings yet
Tcs Questions With Answers: Note To The Students
55 pages
Data Analytics Training Program Brochure Final 2-4-2024
No ratings yet
Data Analytics Training Program Brochure Final 2-4-2024
14 pages
UnderstandingEthereumviaGraphAnalysis Toit
No ratings yet
UnderstandingEthereumviaGraphAnalysis Toit
32 pages
FDA eCTD v4.0 Implementation Guide
No ratings yet
FDA eCTD v4.0 Implementation Guide
65 pages
DSO - Step by Step (Part 1 of 2) - Creation, Extraction, Transformation - SAP Blogs
No ratings yet
DSO - Step by Step (Part 1 of 2) - Creation, Extraction, Transformation - SAP Blogs
23 pages
DBMS - 6th Sem
No ratings yet
DBMS - 6th Sem
7 pages
Cohesity Oracle RMAN Solution Guide
No ratings yet
Cohesity Oracle RMAN Solution Guide
22 pages
Quantitative Mathematics Module 1 PDF
No ratings yet
Quantitative Mathematics Module 1 PDF
8 pages
Arba Minch University Arba Minch Institute of Technology
No ratings yet
Arba Minch University Arba Minch Institute of Technology
86 pages

Data Cleaning

Uploaded by

Data Cleaning

Uploaded by

Data Cleaning in Data Mining

Steps of Data Cleaning

Process of Data Cleaning

Usage of Data Cleaning in Data Mining

Characteristics of Data Cleaning

Tools for Data Cleaning in Data Mining

Types of data cleaning

Benefits of Data Cleaning

You might also like