0% found this document useful (0 votes)

19 views12 pages

Chương

The document discusses preprocessing data which involves cleaning, integrating, reducing, and transforming data. It describes techniques for each step such as handling missing data, data consolidation, dimensionality reduction, and discretization. Classification of data is also covered by grouping images based on exposure, chatter, and other factors like blurriness. Methods discussed are fast local Laplacian filtering and histogram equalization.

Uploaded by

BẢO ĐẶNG THÁI

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views12 pages

Chương

Uploaded by

BẢO ĐẶNG THÁI

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 12

Chương: Pre-processing

1. Giới thiệu

Data processing is the process of transforming raw or basic initial data into
machine-readable data, which is a crucial step and a prerequisite before loading
data into an AI model.

Figure 2: How data scientists spend

their time

Figure 1: The importance of pre-

processing task.

Thus, 19% of the time is allocated for data loading, and 26% for data cleansing. In
total, 45% of the time is dedicated to creating and processing data before it can be
used for AI model development.

In machine learning, data acts as raw material. Insufficient training datasets may
introduce unintended biases, unfairly benefiting or disadvantaging specific
demographic groups. Additionally, incomplete or inconsistent data can negatively
affect outcomes in data mining efforts. To address these issues, data preprocessing
is vital, involving four main stages: cleaning, integration, reduction, and
transformation.

- Data Cleaning:

Data cleaning, also known as data cleansing, involves identifying and rectifying
errors, inconsistencies, and inaccuracies within datasets to enhance their quality and
reliability for analysis. It encompasses addressing issues like missing values, duplicate
entries, outliers, and formatting disparities, with the objective of ensuring the dataset's
accuracy, comprehensiveness, and suitability for analytical endeavors. This meticulous
process is indispensable for optimizing the efficacy and credibility of data-driven
models and decision-making procedures within academic contexts.

Key problem in this stage is missing data. The prevalence of missing data is
ubiquitous, stemming from factors such as data collection processes or adherence to
validation protocols. Remedial measures typically involve acquiring supplementary
data samples or seeking out additional datasets. Furthermore, the challenge of missing
values can manifest when amalgamating multiple datasets, necessitating the exclusion
of fields absent in all datasets prior to merging.

- Data Integration:

Given the multifarious origins of data, its integration stands as a pivotal facet in
the realm of data preparation. This amalgamation, however, can engender disparate
and superfluous data entities, thereby precipitating the formulation of models marked
by diminished accuracy.

Some approaches to integrate data:

- Data consolidation: involves gathering and storing data in a centralized

location, enhancing efficiency and productivity by utilizing data warehouse
software.
- Data virtualization: This method offers a unified, real-time interface for
accessing data from various sources, providing a single point of view for data
viewing.
- Data propagation: This process entails transferring data from one location to
another using specialized applications. It can occur synchronously or
asynchronously, typically triggered by events.
- Data reduction:

As implied by its name, minimizes data volume, cutting costs in data mining or
analysis. It condenses dataset representation while preserving data integrity, crucial for
managing vast amounts of big data efficiently.

Some techniques used for data reduction:

- Dimensionality Reduction: also known as dimension reduction. One

common method of data reduction is reducing the number of features or
variables in a dataset, also known as dimensionality reduction. This can be
achieved through techniques like Principal Component Analysis (PCA), t-
distributed Stochastic Neighbor Embedding (t-SNE), or Singular Value
Decomposition (SVD), High correlation filter, Missing values ratio, Low
variance filter, Random forest. These methods aim to capture the most
important aspects of the data while discarding redundant or less informative
features.
- Data Compression: Data compression techniques transform the original data
into a more compact representation, reducing storage requirements and
computational complexity. Methods like wavelet transforms, Huffman
coding, or run-length encoding are commonly used for this purpose.
- Feature subset selection: Feature selection stands as a pivotal preprocessing
step in machine learning, aiming to cherry-pick attributes that offer the most
meaningful input to the task at hand. Take, for instance, predicting students'
weights based on historical data. In a dataset comprising features like Roll
Number, Age, Height, and Weight, Roll Number holds no sway over weight,
thus, it's excluded. This refined dataset, now with only three features,
promises improved performance over the original set. This data reduction
approach can help create faster and more cost-efficient machine learning
models. Attribute subset selection can also be performed in the data
transformation step.
- Numerosity reduction: is the process of replacing the original data with a
smaller form of data representation. There are two ways to perform this:
parametric and non-parametric methods. Parametric methods use models for
data representation. Log-linear and regression methods are used to create such
models. In contrast, non-parametric methods store reduced data
representations using clustering, histograms, data cube aggregation, and data
sampling.
- Data transformation:

Data transformation refers to the conversion of data from one format to another,
enabling efficient learning for computers by adopting suitable formats. Some strategies
for data transformation:

- Smoothing: This statistical method utilizes algorithms to filter out data noise,
emphasizing key dataset features and predicting patterns by removing outliers
for clearer insights.
- Aggregation: Aggregation involves consolidating data from diverse sources
into a unified format, crucial for data mining or analysis. It boosts the quantity
of data points, ensuring machine learning models have sufficient examples for
effective learning.
- Discretization: Discretization transforms continuous data into smaller
interval sets. For instance, categorizing individuals as "teen," "young adult,"
"middle age," or "senior" is more efficient than using continuous age values.
- Generalization: Generalization entails elevating low-level data features to
high-level ones. For example, categorical attributes like home addresses can
be generalized to higher-level definitions such as city or state.
- Normalization: Normalization involves scaling all data variables to a defined
range, typically between 0 and 1. It ensures that attribute values are within a
standardized range, facilitating comparisons. Methods like decimal scaling,
min-max normalization, and z-score normalization are commonly used for
this purpose.
2. Classification
Based on the characteristics of the image and the distribution graph of the histogram
values, the data can be divided into 5 groups as follows:
- "Underexposed" and "overexposed" are two terms commonly used in
photography to describe the brightness level of an image.

Underexposed occurs when an image is too dark, possibly due to insufficient light or
improper exposure settings during the photography process. As a result, details in the
image may be lost in the shadows, appearing unclear.

Overexposed occurs when an image is too bright, often due to excessive light or
incorrect exposure settings. In this case, bright areas may lose detail, become washed
out, and appear blurry.

- “Chatter-free” and “Chatter-rich”: "Chatter" in manufacturing refers to

undesirable vibration or oscillation that occurs during machining processes,
particularly in metalworking. It can result in poor surface finish, dimensional
inaccuracies, tool wear, and even tool breakage. Chatter can occur in various
machining operations such as turning, milling, drilling, and grinding.

In roughness manufacturing, "Chatter-free" refers to stable machining without

vibration, crucial for high-quality finishes and tool longevity. "Chatter-rich" describes
machining with significant vibration, leading to poor results and wear. Achieving
chatter-free machining involves optimizing parameters and tools, while addressing
chatter-rich conditions requires identifying and fixing root causes like improper
settings or tooling.

- Others: Other various factors affecting the image include rust, scratches,
abrasions, blurring, or partial glare.

Chatter-free Chatter-rich
Dark
(Underexposed)
Ra 0.85 Ra 0.62
Bright
(Overexposed)

Ra 0.54 Ra 1.01
Others:

Ra 0.83: Image is blurred and partially glared.

3. Methods

There are 2 image processing techniques used in the article:

- Fast local Laplacian filtering of images: Local Laplacian image filtering

rapidly enhances images by accentuating edges and retaining vital details. It
employs adaptive filtering strength, adjusting to image content to maintain
sharpness while minimizing noise, utilizing efficient algorithms for real-time
application.
- Enhance contrast using histogram equalization: Histogram equalization is
a technique used to enhance the contrast of an image by redistributing the
intensity values in its histogram. It works by stretching the histogram so that it
covers the entire intensity range, thereby making the image more visually
appealing with improved contrast and detail.

Input arguments of these 2 methods:

Fast Local Laplacian filtering of images:

sigma — Amplitude of Edge amplitude is denoted Choose value: 0.5

edges by sigma, a non-negative
number. For integer
images, sigma ranges
from 0 to 1, while for
single images within the
range [0, 1], sigma
remains in that range. For
single images with a
different range [a, b],
sigma should align with
that range. Use a sigma
value to process the
details and an alpha value
to increase the contrast,
effectively enhancing the
local contrast of the
image.
alpha — Smoothing of Smoothing of details, Choose value: 0.4
details specified as a positive Increases the details of the
number. Typical values of input image, effectively
alpha are in the range enhancing the local
[0.01, 10]. contrast of the image
without affecting edges or
introducing halos.
beta — Dynamic range Dynamic range, specified Choose value: 1
as a non-negative number. Expands the dynamic
Typical values of beta are range of the image.
in the range [0, 5]. beta
affects the dynamic range
of A.
Enhance contrast using histogram equalization:

I — Grayscale image Grayscale image,

specified as a numeric
array of any dimension.

n — Number of discrete Number of discrete gray n_binsHistoEqui = 256

gray levels levels, specified as a
positive integer.

Figure: Comparison between before and after Histogram Equalization in MATLAB

4. Implementation

Common standards:

Regarding brightness: Based on the histogram of the image through the imhist()
function, we will have the following:

Figure: Histogram of Ra 0.54 image

Comment: The histogram of the image is skewed to the right, indicating that the image
is bright. However, overall, the peak of the graph is located near the center of the
distribution, suggesting that the image has a balanced brightness.

From there, we have two brightness regions: usually bright and usually dark. For
usually bright regions, we will process by enhancing edge detection without adjusting
the contrast distribution. This means using Fast Local Laplacian filtering of images
without the need for additional contrast enhancement methods like histogram
equalization. An example is as follows:

- Sample Ra 0.54 (Usually bright)

Figure: (Sample: Ra 0.54)

Row 1: Histogram of the image.

Row 2: Image before (leftmost) and after passing through processes 1 and 2 (from left
to right, image 2 and image 3).

Row 3: Using edge detection for the corresponding processes of row 2.

Note: Process 1 is abbreviated as Fast Local Laplacian filtering of images. Process 2 is

Enhance contrast using histogram equalization.
Observation: For images in the usually bright category, stopping at process 1 is
sufficient. If the image goes through process 2 as well, it will become overexposed and
lose edge detection data. Comparing between the two processes (in the figure above,
row 3, columns 2 and 3), the image in column 3 will lose a corner of edge detection at
the bottom right corner and some in the middle of the image.

- Sample Ra 0.85 (Usually dark)

Figure: (Mẫu: Ra 0.85) Hàng 1: Histogram của hình. Hàng 2: Hình trước (từ trái sang
hình đầu) và sau khi qua các quá trình 1 và 2 (từ trái sang, hình thứ 2 và hình thứ 3).
Hàng 3: Sử dụng edge detect đối với các quá trình tương ứng của hàng 2.

Nhận xét: Hình ở dạng thường tối thì có thể cần trải qua quá trình 2. So với giữa hai
quá trình, (theo hình trên, hàng 3, cột 2 và 3) thì phần giữa hình bên dưới sau khi được
tăng cường sáng đã hiện ra các đường vân rõ hơn. Tuy nhiên, phần trên của hình lại
sáng hơn so với phần còn lại nên sau khi trải qua quá trình 2 đã bị chói và mất dữ liệu
về edge.
Đối với phân nhóm chatter-free và chatter-rich thì: chủ yếu hình thành do các quá trình
gia công, để có thể làm nổi bật hơn đường vân trong nhóm chatter-free cần phải có
một phương pháp nhận diện riêng biệt khác, và cần phải nghiên cứu thêm trong tương
lai.

Đối với phân nhóm khác: dựa vào phân nhóm theo độ sáng, phân làm 2 nhóm và trải
qua 2 quá trình của AI như trên. Vẫn giữ lại nhóm này, cho dù là khuyết tật khi gia
công, bảo quản, hoặc khi chụp để cho AI học và xem như là các biến thiên trong quá
trình sử dụng.
REFERENCES

[1] Amal Joby, What Is Data Preprocessing? 4 Crucial Steps to Do It Right, August 6,
2021

Dsur Ea2352001010391 W7
No ratings yet
Dsur Ea2352001010391 W7
3 pages
Data Mining UNIT II
No ratings yet
Data Mining UNIT II
19 pages
Data Preprocessing
No ratings yet
Data Preprocessing
8 pages
Data Preprocessing: Clean, Transform, Integrate
No ratings yet
Data Preprocessing: Clean, Transform, Integrate
6 pages
Data Binning
No ratings yet
Data Binning
9 pages
Module 3 Notes
No ratings yet
Module 3 Notes
5 pages
Data Mining for Business Insights
No ratings yet
Data Mining for Business Insights
38 pages
Data Preparation Steps for Analysis
No ratings yet
Data Preparation Steps for Analysis
3 pages
7 Data Preprocessing Steps in Machine Learning
No ratings yet
7 Data Preprocessing Steps in Machine Learning
5 pages
Unit II - Data Preprocessing and Classification RSK-1
No ratings yet
Unit II - Data Preprocessing and Classification RSK-1
115 pages
Assignment 4 MB511
No ratings yet
Assignment 4 MB511
6 pages
3-Data Preprocessing
No ratings yet
3-Data Preprocessing
32 pages
Data Mining
No ratings yet
Data Mining
22 pages
Ch8 Data and Its Processing
No ratings yet
Ch8 Data and Its Processing
32 pages
Lecture 3 Unit 1
No ratings yet
Lecture 3 Unit 1
61 pages
DS Unit 2
No ratings yet
DS Unit 2
23 pages
Data Preprocessing
No ratings yet
Data Preprocessing
5 pages
Week 3
No ratings yet
Week 3
23 pages
Machine Learning Data Prep Guide
No ratings yet
Machine Learning Data Prep Guide
9 pages
MSDSModule 2
No ratings yet
MSDSModule 2
35 pages
Data Handling and Visualization 3rd Unit
No ratings yet
Data Handling and Visualization 3rd Unit
4 pages
Machine Learning Essentials Guide
No ratings yet
Machine Learning Essentials Guide
33 pages
Business Analytics
No ratings yet
Business Analytics
14 pages
Data Preprocessing in Data Mining
No ratings yet
Data Preprocessing in Data Mining
11 pages
3 Preprocessing
No ratings yet
3 Preprocessing
27 pages
Unit II (DWDM)
No ratings yet
Unit II (DWDM)
19 pages
BI Unit 4
No ratings yet
BI Unit 4
21 pages
Data Preprocessing and Feature Engineering
No ratings yet
Data Preprocessing and Feature Engineering
32 pages
Data Mining
No ratings yet
Data Mining
55 pages
Unit - 2
No ratings yet
Unit - 2
17 pages
Data Preprocessing for Analysts
No ratings yet
Data Preprocessing for Analysts
3 pages
Data Pre-processing Guide
No ratings yet
Data Pre-processing Guide
8 pages
1725892639module 3 The Machine Learning Process
No ratings yet
1725892639module 3 The Machine Learning Process
17 pages
Screenshot 2025-04-09 at 10.35.12 AM
No ratings yet
Screenshot 2025-04-09 at 10.35.12 AM
31 pages
Data Preprocessing Essentials
No ratings yet
Data Preprocessing Essentials
14 pages
FDS CH 3
No ratings yet
FDS CH 3
2 pages
Data Preprocessing
No ratings yet
Data Preprocessing
22 pages
Unit 3
No ratings yet
Unit 3
18 pages
Data Mining Overview
No ratings yet
Data Mining Overview
4 pages
Lesson 7 Data Description and Diagnostics
No ratings yet
Lesson 7 Data Description and Diagnostics
14 pages
NN 7
No ratings yet
NN 7
26 pages
DWDM Unit 3
No ratings yet
DWDM Unit 3
16 pages
Improve Model Accuracy With Data Pre-Processing
No ratings yet
Improve Model Accuracy With Data Pre-Processing
11 pages
Unit 2 DA
No ratings yet
Unit 2 DA
3 pages
Bi Ut2 Answers
No ratings yet
Bi Ut2 Answers
23 pages
Session-2-CO3-Introduction To Data Preprocessing
No ratings yet
Session-2-CO3-Introduction To Data Preprocessing
39 pages
Unit 2
No ratings yet
Unit 2
16 pages
Data Warehouse and Data Mining - Definition and Concepts
No ratings yet
Data Warehouse and Data Mining - Definition and Concepts
20 pages
CSC 3301-Lecture06 Introduction To Machine Learning
No ratings yet
CSC 3301-Lecture06 Introduction To Machine Learning
56 pages
Data Preprocessing
No ratings yet
Data Preprocessing
15 pages
Unit II 1
No ratings yet
Unit II 1
12 pages
DAI101 4 Data Preparation
No ratings yet
DAI101 4 Data Preparation
45 pages
Major Issues in Data Mining
No ratings yet
Major Issues in Data Mining
5 pages
Unit 2 Data Warehouse and Data Mining
No ratings yet
Unit 2 Data Warehouse and Data Mining
19 pages
DM Unit 3
No ratings yet
DM Unit 3
15 pages
3D Steel Truss Design with SAP2000
No ratings yet
3D Steel Truss Design with SAP2000
21 pages
Bitcoin Price After Last Halving Went 7.8x - Goog
No ratings yet
Bitcoin Price After Last Halving Went 7.8x - Goog
1 page
Research Tangina
No ratings yet
Research Tangina
5 pages
Ergonomics - Posture: Chair
No ratings yet
Ergonomics - Posture: Chair
3 pages
Writing Rubric Tok 3RD Term
No ratings yet
Writing Rubric Tok 3RD Term
1 page
On Aden - The Desert - Suez - The Canal
No ratings yet
On Aden - The Desert - Suez - The Canal
8 pages
Glass Production and Processing
No ratings yet
Glass Production and Processing
10 pages
Cloud Desktops: VDI vs. DaaS
No ratings yet
Cloud Desktops: VDI vs. DaaS
4 pages
Geography
No ratings yet
Geography
73 pages
E-Banking Satisfaction Study
No ratings yet
E-Banking Satisfaction Study
151 pages
Valerie Lynn, The Mommy Plan, NPEW Invitation, Mar 8th-10th
No ratings yet
Valerie Lynn, The Mommy Plan, NPEW Invitation, Mar 8th-10th
1 page
PPG - Thinner 2106 - English
No ratings yet
PPG - Thinner 2106 - English
13 pages
19BBL019 Internship Report
No ratings yet
19BBL019 Internship Report
19 pages
Robert Abiol-Grade 10 Mapeh 2
No ratings yet
Robert Abiol-Grade 10 Mapeh 2
51 pages
J of App Behav Analysis - 2018 - Vessells - Effects of Delay Fading and Signals On Self Control Choices by Children
No ratings yet
J of App Behav Analysis - 2018 - Vessells - Effects of Delay Fading and Signals On Self Control Choices by Children
8 pages
Algorithm:: Experiment No: 03 Experiment Name: Write A Program Implementation of Aim
No ratings yet
Algorithm:: Experiment No: 03 Experiment Name: Write A Program Implementation of Aim
3 pages
O.E.E. Reading Material
No ratings yet
O.E.E. Reading Material
12 pages
Curriculum Vitae: Avi Srivastava
No ratings yet
Curriculum Vitae: Avi Srivastava
2 pages
FATA Water Studies
No ratings yet
FATA Water Studies
132 pages
5.microscope Final
No ratings yet
5.microscope Final
52 pages
General Principles of Geriatric Rehabilitation
100% (2)
General Principles of Geriatric Rehabilitation
5 pages
Fort St. John Financial Report
No ratings yet
Fort St. John Financial Report
9 pages
G4 - LR - 1G - 4.2.3 A World Tour of Cultures
No ratings yet
G4 - LR - 1G - 4.2.3 A World Tour of Cultures
10 pages
All-Solid-State Lithium-Ion and Lithium Metal Batteries-2018schnell
No ratings yet
All-Solid-State Lithium-Ion and Lithium Metal Batteries-2018schnell
16 pages
Lab Manual (March-July 2018)
No ratings yet
Lab Manual (March-July 2018)
37 pages
A Case Study On Fixed Deposit Position of Civil Bank LTD Central Office, Kathmandu
No ratings yet
A Case Study On Fixed Deposit Position of Civil Bank LTD Central Office, Kathmandu
42 pages
10 Reflection Lesson Plan 10 English
100% (2)
10 Reflection Lesson Plan 10 English
2 pages
Pharmaceutical Technology
No ratings yet
Pharmaceutical Technology
15 pages
Cadmach Tablet Press Replacement Parts Catalog
No ratings yet
Cadmach Tablet Press Replacement Parts Catalog
12 pages
Geology Merit Badge Pamphlet 35904
No ratings yet
Geology Merit Badge Pamphlet 35904
100 pages

Chương

Uploaded by

Chương

Uploaded by

Chương: Pre-processing

Figure 2: How data scientists spend

Figure 1: The importance of pre-

Some approaches to integrate data:

- Data consolidation: involves gathering and storing data in a centralized

Some techniques used for data reduction:

- Dimensionality Reduction: also known as dimension reduction. One

- “Chatter-free” and “Chatter-rich”: "Chatter" in manufacturing refers to

In roughness manufacturing, "Chatter-free" refers to stable machining without

Ra 0.83: Image is blurred and partially glared.

There are 2 image processing techniques used in the article:

- Fast local Laplacian filtering of images: Local Laplacian image filtering

Input arguments of these 2 methods:

Fast Local Laplacian filtering of images:

sigma — Amplitude of Edge amplitude is denoted Choose value: 0.5

I — Grayscale image Grayscale image,

n — Number of discrete Number of discrete gray n_binsHistoEqui = 256

Figure: Comparison between before and after Histogram Equalization in MATLAB

Figure: Histogram of Ra 0.54 image

- Sample Ra 0.54 (Usually bright)

Figure: (Sample: Ra 0.54)

Row 1: Histogram of the image.

Row 3: Using edge detection for the corresponding processes of row 2.

Note: Process 1 is abbreviated as Fast Local Laplacian filtering of images. Process 2 is

- Sample Ra 0.85 (Usually dark)

You might also like