Chương: Pre-processing
1. Giới thiệu
Data processing is the process of transforming raw or basic initial data into
machine-readable data, which is a crucial step and a prerequisite before loading
data into an AI model.
Figure 2: How data scientists spend
their time
Figure 1: The importance of pre-
processing task.
Thus, 19% of the time is allocated for data loading, and 26% for data cleansing. In
total, 45% of the time is dedicated to creating and processing data before it can be
used for AI model development.
In machine learning, data acts as raw material. Insufficient training datasets may
introduce unintended biases, unfairly benefiting or disadvantaging specific
demographic groups. Additionally, incomplete or inconsistent data can negatively
affect outcomes in data mining efforts. To address these issues, data preprocessing
is vital, involving four main stages: cleaning, integration, reduction, and
transformation.
- Data Cleaning:
Data cleaning, also known as data cleansing, involves identifying and rectifying
errors, inconsistencies, and inaccuracies within datasets to enhance their quality and
reliability for analysis. It encompasses addressing issues like missing values, duplicate
entries, outliers, and formatting disparities, with the objective of ensuring the dataset's
accuracy, comprehensiveness, and suitability for analytical endeavors. This meticulous
process is indispensable for optimizing the efficacy and credibility of data-driven
models and decision-making procedures within academic contexts.
Key problem in this stage is missing data. The prevalence of missing data is
ubiquitous, stemming from factors such as data collection processes or adherence to
validation protocols. Remedial measures typically involve acquiring supplementary
data samples or seeking out additional datasets. Furthermore, the challenge of missing
values can manifest when amalgamating multiple datasets, necessitating the exclusion
of fields absent in all datasets prior to merging.
- Data Integration:
Given the multifarious origins of data, its integration stands as a pivotal facet in
the realm of data preparation. This amalgamation, however, can engender disparate
and superfluous data entities, thereby precipitating the formulation of models marked
by diminished accuracy.
Some approaches to integrate data:
- Data consolidation: involves gathering and storing data in a centralized
location, enhancing efficiency and productivity by utilizing data warehouse
software.
- Data virtualization: This method offers a unified, real-time interface for
accessing data from various sources, providing a single point of view for data
viewing.
- Data propagation: This process entails transferring data from one location to
another using specialized applications. It can occur synchronously or
asynchronously, typically triggered by events.
- Data reduction:
As implied by its name, minimizes data volume, cutting costs in data mining or
analysis. It condenses dataset representation while preserving data integrity, crucial for
managing vast amounts of big data efficiently.
Some techniques used for data reduction:
- Dimensionality Reduction: also known as dimension reduction. One
common method of data reduction is reducing the number of features or
variables in a dataset, also known as dimensionality reduction. This can be
achieved through techniques like Principal Component Analysis (PCA), t-
distributed Stochastic Neighbor Embedding (t-SNE), or Singular Value
Decomposition (SVD), High correlation filter, Missing values ratio, Low
variance filter, Random forest. These methods aim to capture the most
important aspects of the data while discarding redundant or less informative
features.
- Data Compression: Data compression techniques transform the original data
into a more compact representation, reducing storage requirements and
computational complexity. Methods like wavelet transforms, Huffman
coding, or run-length encoding are commonly used for this purpose.
- Feature subset selection: Feature selection stands as a pivotal preprocessing
step in machine learning, aiming to cherry-pick attributes that offer the most
meaningful input to the task at hand. Take, for instance, predicting students'
weights based on historical data. In a dataset comprising features like Roll
Number, Age, Height, and Weight, Roll Number holds no sway over weight,
thus, it's excluded. This refined dataset, now with only three features,
promises improved performance over the original set. This data reduction
approach can help create faster and more cost-efficient machine learning
models. Attribute subset selection can also be performed in the data
transformation step.
- Numerosity reduction: is the process of replacing the original data with a
smaller form of data representation. There are two ways to perform this:
parametric and non-parametric methods. Parametric methods use models for
data representation. Log-linear and regression methods are used to create such
models. In contrast, non-parametric methods store reduced data
representations using clustering, histograms, data cube aggregation, and data
sampling.
- Data transformation:
Data transformation refers to the conversion of data from one format to another,
enabling efficient learning for computers by adopting suitable formats. Some strategies
for data transformation:
- Smoothing: This statistical method utilizes algorithms to filter out data noise,
emphasizing key dataset features and predicting patterns by removing outliers
for clearer insights.
- Aggregation: Aggregation involves consolidating data from diverse sources
into a unified format, crucial for data mining or analysis. It boosts the quantity
of data points, ensuring machine learning models have sufficient examples for
effective learning.
- Discretization: Discretization transforms continuous data into smaller
interval sets. For instance, categorizing individuals as "teen," "young adult,"
"middle age," or "senior" is more efficient than using continuous age values.
- Generalization: Generalization entails elevating low-level data features to
high-level ones. For example, categorical attributes like home addresses can
be generalized to higher-level definitions such as city or state.
- Normalization: Normalization involves scaling all data variables to a defined
range, typically between 0 and 1. It ensures that attribute values are within a
standardized range, facilitating comparisons. Methods like decimal scaling,
min-max normalization, and z-score normalization are commonly used for
this purpose.
2. Classification
Based on the characteristics of the image and the distribution graph of the histogram
values, the data can be divided into 5 groups as follows:
- "Underexposed" and "overexposed" are two terms commonly used in
photography to describe the brightness level of an image.
Underexposed occurs when an image is too dark, possibly due to insufficient light or
improper exposure settings during the photography process. As a result, details in the
image may be lost in the shadows, appearing unclear.
Overexposed occurs when an image is too bright, often due to excessive light or
incorrect exposure settings. In this case, bright areas may lose detail, become washed
out, and appear blurry.
- “Chatter-free” and “Chatter-rich”: "Chatter" in manufacturing refers to
undesirable vibration or oscillation that occurs during machining processes,
particularly in metalworking. It can result in poor surface finish, dimensional
inaccuracies, tool wear, and even tool breakage. Chatter can occur in various
machining operations such as turning, milling, drilling, and grinding.
In roughness manufacturing, "Chatter-free" refers to stable machining without
vibration, crucial for high-quality finishes and tool longevity. "Chatter-rich" describes
machining with significant vibration, leading to poor results and wear. Achieving
chatter-free machining involves optimizing parameters and tools, while addressing
chatter-rich conditions requires identifying and fixing root causes like improper
settings or tooling.
- Others: Other various factors affecting the image include rust, scratches,
abrasions, blurring, or partial glare.
Chatter-free Chatter-rich
Dark
(Underexposed)
Ra 0.85 Ra 0.62
Bright
(Overexposed)
Ra 0.54 Ra 1.01
Others:
Ra 0.83: Image is blurred and partially glared.
3. Methods
There are 2 image processing techniques used in the article:
- Fast local Laplacian filtering of images: Local Laplacian image filtering
rapidly enhances images by accentuating edges and retaining vital details. It
employs adaptive filtering strength, adjusting to image content to maintain
sharpness while minimizing noise, utilizing efficient algorithms for real-time
application.
- Enhance contrast using histogram equalization: Histogram equalization is
a technique used to enhance the contrast of an image by redistributing the
intensity values in its histogram. It works by stretching the histogram so that it
covers the entire intensity range, thereby making the image more visually
appealing with improved contrast and detail.
Input arguments of these 2 methods:
Fast Local Laplacian filtering of images:
sigma — Amplitude of Edge amplitude is denoted Choose value: 0.5
edges by sigma, a non-negative
number. For integer
images, sigma ranges
from 0 to 1, while for
single images within the
range [0, 1], sigma
remains in that range. For
single images with a
different range [a, b],
sigma should align with
that range. Use a sigma
value to process the
details and an alpha value
to increase the contrast,
effectively enhancing the
local contrast of the
image.
alpha — Smoothing of Smoothing of details, Choose value: 0.4
details specified as a positive Increases the details of the
number. Typical values of input image, effectively
alpha are in the range enhancing the local
[0.01, 10]. contrast of the image
without affecting edges or
introducing halos.
beta — Dynamic range Dynamic range, specified Choose value: 1
as a non-negative number. Expands the dynamic
Typical values of beta are range of the image.
in the range [0, 5]. beta
affects the dynamic range
of A.
Enhance contrast using histogram equalization:
I — Grayscale image Grayscale image,
specified as a numeric
array of any dimension.
n — Number of discrete Number of discrete gray n_binsHistoEqui = 256
gray levels levels, specified as a
positive integer.
Figure: Comparison between before and after Histogram Equalization in MATLAB
4. Implementation
Common standards:
Regarding brightness: Based on the histogram of the image through the imhist()
function, we will have the following:
Figure: Histogram of Ra 0.54 image
Comment: The histogram of the image is skewed to the right, indicating that the image
is bright. However, overall, the peak of the graph is located near the center of the
distribution, suggesting that the image has a balanced brightness.
From there, we have two brightness regions: usually bright and usually dark. For
usually bright regions, we will process by enhancing edge detection without adjusting
the contrast distribution. This means using Fast Local Laplacian filtering of images
without the need for additional contrast enhancement methods like histogram
equalization. An example is as follows:
- Sample Ra 0.54 (Usually bright)
Figure: (Sample: Ra 0.54)
Row 1: Histogram of the image.
Row 2: Image before (leftmost) and after passing through processes 1 and 2 (from left
to right, image 2 and image 3).
Row 3: Using edge detection for the corresponding processes of row 2.
Note: Process 1 is abbreviated as Fast Local Laplacian filtering of images. Process 2 is
Enhance contrast using histogram equalization.
Observation: For images in the usually bright category, stopping at process 1 is
sufficient. If the image goes through process 2 as well, it will become overexposed and
lose edge detection data. Comparing between the two processes (in the figure above,
row 3, columns 2 and 3), the image in column 3 will lose a corner of edge detection at
the bottom right corner and some in the middle of the image.
- Sample Ra 0.85 (Usually dark)
Figure: (Mẫu: Ra 0.85) Hàng 1: Histogram của hình. Hàng 2: Hình trước (từ trái sang
hình đầu) và sau khi qua các quá trình 1 và 2 (từ trái sang, hình thứ 2 và hình thứ 3).
Hàng 3: Sử dụng edge detect đối với các quá trình tương ứng của hàng 2.
Nhận xét: Hình ở dạng thường tối thì có thể cần trải qua quá trình 2. So với giữa hai
quá trình, (theo hình trên, hàng 3, cột 2 và 3) thì phần giữa hình bên dưới sau khi được
tăng cường sáng đã hiện ra các đường vân rõ hơn. Tuy nhiên, phần trên của hình lại
sáng hơn so với phần còn lại nên sau khi trải qua quá trình 2 đã bị chói và mất dữ liệu
về edge.
Đối với phân nhóm chatter-free và chatter-rich thì: chủ yếu hình thành do các quá trình
gia công, để có thể làm nổi bật hơn đường vân trong nhóm chatter-free cần phải có
một phương pháp nhận diện riêng biệt khác, và cần phải nghiên cứu thêm trong tương
lai.
Đối với phân nhóm khác: dựa vào phân nhóm theo độ sáng, phân làm 2 nhóm và trải
qua 2 quá trình của AI như trên. Vẫn giữ lại nhóm này, cho dù là khuyết tật khi gia
công, bảo quản, hoặc khi chụp để cho AI học và xem như là các biến thiên trong quá
trình sử dụng.
REFERENCES
[1] Amal Joby, What Is Data Preprocessing? 4 Crucial Steps to Do It Right, August 6,
2021