0% found this document useful (0 votes)

68 views100 pages

ML Unit-Ii

This document discusses data preprocessing techniques. It covers data objects and attributes, statistical descriptions of data, and various preprocessing tasks like data cleaning, integration, and reduction. Data cleaning involves handling incomplete, noisy, and inconsistent data through techniques such as filling in missing values, identifying outliers, and resolving inconsistencies. Data integration combines data from multiple sources. Data reduction reduces dimensionality and numerosity to simplify data through methods like PCA and compression.

Uploaded by

Supriya alluri

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

68 views100 pages

ML Unit-Ii

Uploaded by

Supriya alluri

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 100

UNIT-II

Know the data

and
Data Preprocessing
UNIT-2
Know the Data and Data Preprocessing: Data Objects and
attribute types, Basic statistical description of Data, Data preprocessing,
Data cleaning, Data Integration and Data reduction. Main Approaches
for Dimensionality Reduction, Projection, Manifold Learning, PCA.
Insufficient Quantity of Training Data, Nonrepresentative Training Data,
Poor-Quality Data, Irrelevant Features, Overfitting the Training Data,
Underfitting the Training Data, Stepping Back, Testing and Validating.
Data Object
❑Data sets are made up of data objects.
❑A data object represents an entity.
❑Examples:
❑ sales database: customers, store items, sales
❑ medical database: patients, treatments
❑ university database: students, professors, courses
❑Also called samples , examples, instances, data points, objects, data tuples.
❑Data objects are described by attributes.
Data Objects

❑An attribute is a property or characteristic or feature of a data object.

❑ Examples: eye color of a person, temperature, etc.

❑Attribute is also known as variable, field, characteristic, or feature

❑A collection of attributes describe an object.
❑Attribute values are numbers or symbols assigned to an attribute

❑Database rows -> data objects; columns ->attributes.

Data Objects

Database rows → data objects

Database columns → attributes
Attributes
❑ Attribute (or dimensions, features, variables): a data field, representing a
characteristic or feature of a data object.
❑E.g., customer _ID, name, address
❑ Attribute values are numbers or symbols assigned to an attribute.

❑ Distinction between attributes and attribute values

❑ Same attribute can be mapped to different attribute values

❑Example: height can be measured in feet or meters

❑ Different attributes can be mapped to the same set of values

❑Example: Attribute values for ID and age are integers

❑But properties of attribute values can be different; ID has no limit, but
age has a maximum and minimum value
Attribute Types

❖NOMINAL ( “relating to names”)

❖BINARY (only two categories or states)

❖ORDINAL (Order or Ranking)

❖NUMERIC (Measurable quantity)

❖DISCRETE

❖CONTINUOUS
Attribute Types

❑Categorical (Qualitative)
❑ Nominal and Ordinal attributes are collectively referred to as
categorical or qualitative attributes.

❑Numeric (Quantitative)
❑ Interval and Ratio are collectively referred to as quantitative
or numeric attributes.

❑Discrete vs Continuous attributes

Attribute Types
Nominal: categories, states, or “names of things”, “Symbols”.
◼ Hair_color = {auburn, black, blond, brown, grey, red, white}
◼ Marital status, occupation, ID numbers, zip codes
Binary
◼ Nominal attribute with only 2 states (0 and 1)
◼ Symmetric binary: both outcomes equally important
e.g., gender
◼ Asymmetric binary: outcomes not equally important.
e.g., medical test (positive vs. negative)
Convention: assign 1 to most important outcome (e.g., HIV positive)
Ordinal
◼ Values have a meaningful order (ranking) but magnitude between successive values
is not known.
◼ Size = {small, medium, large}, grades, army rankings 11
Quantity (integer or real-valued)
Interval
Measured on a scale of equal-sized units
Values have order
◼ E.g., temperature in C˚ or F˚, calendar dates

No true zero-point
Ratio
Inherent zero-point
We can speak of values as being an order of magnitude larger
than the unit of measurement (10 K˚ is twice as high as 5 K˚).
◼ e.g., temperature in Kelvin, length, counts, monetary

quantities
Attribute Types
(Discrete Vs Continuous Attribute)
Discrete Attribute
◼ Has only a finite or countably finite set of values
◼ zip codes, profession, or the set of words in a collection of documents

◼ Sometimes, represented as integer variables

◼ Note: Binary attributes are a special case of discrete attributes
◼ Binary attributes where only non-zero values are important are called asymmetric binary
attributes.
Continuous Attribute
◼ Has real numbers as attribute values
◼ Temperature, height, or weight

◼ Practically, real values can only be measured and represented using a finite number of
digits
◼ Continuous attributes are typically represented as floating-point variables
Basic statistical description of data

❑Basic statistical descriptions can be used to

identify properties of the data and highlight
which data values should be treated as noise
or outliers.

❑For data preprocessing tasks, we want to learn

about data characteristics regarding both
central tendency and dispersion of the data.
Measures of central tendency include mean, median, mode,
and midrange.
Measures of data dispersion include quartiles, interquartile
range (IQR), and variance.
These descriptive statistics are of great help in understanding
the distribution of the data.
Symmetric vs. Skewed Data
Median, mean and mode of symmetric, symmetric
positively and negatively skewed data

positively skewed negatively skewed

23
February 27, 2023 Data Mining: Concepts and Techniques
Dispersion
measures the
extent to which
the items vary
from central
value.

Also called as
spread out,
scatter, variance.
Data Preprocessing
Data Preprocessing: An Overview

◼ Data Quality

◼ Major Tasks in Data Preprocessing

Data Cleaning

Data Integration

Data Reduction

34
34
Data Quality: Why Preprocess the Data?

• Accuracy: correct or wrong, accurate or not

• Completeness: not recorded, unavailable
Measures • Consistency: some modified but some not,
for data dangling, …
quality: A • Timeliness: timely update?
multidimens • Believability: how trustable the data are
ional view correct?
• Interpretability: how easily the data can be
understood?

35
Major Techniques/ Tasks in Data Preprocessing

Data cleaning Data Data Data

integration reduction transformation and
• Fill in missing
values, smooth
Data discretization
• Integration of • Dimensionality
noisy data, identify multiple reduction • Normalization
or remove outliers, databases, data • Concept hierarchy
and resolve • Numerosity
cubes, or files reduction generation
inconsistencies
• Data compression

36
Forms of data preprocessing
Data Preprocessing
Data Preprocessing: An Overview

◼ Data Quality

◼ Major Tasks in Data Preprocessing

Data Cleaning

Data Integration

Data Reduction

38
38
Data Cleaning
Data in the Real World Is Dirty: Lots of potentially incorrect data
Ex. instrument faulty, human or computer error, transmission error

Intentional
Incomplete: Noisy: Inconsistent: e.g. disguised missing
data)
• lacking attribute • containing noise, • containing • Jan. 1 as everyone’s
values, lacking errors, or discrepancies in codes birthday?
or names, e.g.,
certain attributes outliers • Age=“42”,
of interest, or • e.g., Birthday=“20/03/2010”
containing only Salary=“−10” (an • Was rating “1, 2, 3”,
aggregate data error) now rating “A, B, C”
• e.g. Occupation=“ • discrepancy between
” (missing data) duplicate records
39
Incomplete (Missing) Data
Data is not always available
• E.g., many tuples have no recorded value for several attributes, such as customer
income in sales data
Missing data may be due to
• Equipment malfunction
• Inconsistent with other recorded data and thus deleted
• Data not entered due to misunderstanding
• Certain data may not be considered important at the time of entry
• Not register history or changes of the data

Missing data may need to be inferred

40
How to Handle Missing Data?
Ignore the tuple : usually
done when class label is missing Fill in the missing value
(when doing classification)—not manually:
effective when the % of missing
values per attribute varies tedious + infeasible?
considerably

Fill in it automatically with

• a global constant : e.g., “unknown”, a new class?!
• The attribute mean
• The attribute mean for all samples belonging to the
same class: smarter
• The most probable value: inference-based such as
Bayesian formula or decision tree
41
Noisy Data
Incorrect attribute Other data problems
Noise: random error values may be due which require data
to cleaning
or variance in a
measured variable faulty data collection
instruments duplicate records

data entry problems

incomplete data

data transmission
problems inconsistent data

Technology limitation

Inconsistency in
naming convention
43
How to Handle Noisy Data?
• First sort data & partition into (equal-frequency) bins
• Then one can smooth by bin means,
Binning
• smooth by bin median,
• smooth by bin boundaries, etc.

• Data smooth can also be done regression functions

Regression • Linear Regression
• Multiple linear Regression

• Place data elements in their similar groups as Clusters.

Clustering • Detect and remove outliers

Combined computer • Detect suspicious values and check by human (e.g.,

and human inspection deal with possible outliers)
44
Data Preprocessing

Data Preprocessing: An Overview

◼ Data Quality

◼ Major Tasks in Data Preprocessing

Data Cleaning

Data Integration

Data Reduction

46
Data Integration
• Combines data from multiple sources into a
Data integration: coherent store

Entity identification • Identify real world entities from multiple data

problem: sources, Ex. Bill Clinton = William Clinton

• For the same real-world entity, attribute values

Detecting and resolving from different sources are different
data value conflicts • Possible reasons: different representations,
different scales, e.g., metric vs. British units

• Object Identification.
Redundancy • Derivable data
48
Redundant data occur often when integration of
multiple databases
• Object identification: The same attribute or object may
have different names in different databases
Handling • Derivable data: One attribute may be a “derived”
attribute in another table, e.g., annual revenue
Redundancy
in Data Redundant attributes may be able to be detected by
correlation analysis and covariance analysis
Integration
Careful integration of the data from multiple sources
may help reduce/avoid redundancies and
inconsistencies and improve mining speed and
quality
52
Χ2 Correlation Test (Nominal Data)
Χ2 (chi-square) test

(Observed − Expected) 2
 =
2

Expected
The larger the Χ2 value, the more likely the variables are related
The cells that contribute the most to the Χ2 value are those whose actual
count is very different from the expected count
Correlation does not imply causality
◼ # of hospitals and # of car-theft in a city are correlated
◼ Both are causally linked to the third variable: population
56
Chi-Square Calculation: An Example
Play chess Not play chess Sum (row)
Like science fiction 250(90) 200(360) 450
Not like science 50(210) 1000(840) 1050
fiction
Sum(col.) 300 1200 1500

Χ2 (chi-square) calculation (numbers in parenthesis are expected counts

calculated based on the data distribution in the two categories)

( 250 − 90) 2
(50 − 210) 2
( 200 − 360) 2
(1000 − 840) 2
2 = + + + = 507.93
90 210 360 840
It shows that like_science_fiction and play_chess are correlated in the group
57
Data Reduction Strategies
Data reduction strategies
Why data
Data reduction? • Dimensionality reduction:
(remove unimportant attributes)
reduction: -Increase storage • Wavelet transforms
Data reduction efficiency • Principal Components Analysis
is a process - Performance (PCA)
that reduces (Complex data analysis • Feature subset selection, feature
the volume of may take a very long creation
original data time to run on the
complete data set.)
• Numerosity reduction: (some
and represents simply call it: Data Reduction)
it in a much -Reduce storage • Regression and Log-Linear
smaller Cost Models
volume. • Histograms, clustering, sampling
• Data cube aggregation
• Data compression
Principal Component Analysis (PCA)
Principal Component Analysis is an unsupervised learning algorithm that is used for the dimensionality
reduction in machine learning.
It is a statistical process that converts the observations of correlated features into a set of linearly
uncorrelated features with the help of orthogonal transformation.
These new transformed features are called the Principal Components.
It is one of the popular tools that is used for exploratory data analysis and predictive modeling. It is a
technique to draw strong patterns from the given dataset by reducing the variances.
PCA generally tries to find the lower-dimensional surface to project the high-dimensional data.

75
x1
Step 2: Step 3:
Step 1: Calculate the Calculate the
Standardize the covariance matrix eigenvalues and
dataset. for the features in eigenvectors for the
the dataset. covariance matrix.

Step 4:
Step 5:
Step 6: Sort eigenvalues
Pick k eigenvalues
Transform the and their
and form a matrix
original matrix. corresponding
of eigenvectors.
eigenvectors.
Regression Analysis y
Y1
Regression analysis:
Y1’ y=x+1
◼ Regression analysis is a statistical method to model the relationship
between a dependent (target) and independent (predictor)
X1 x
variables with one or more independent variables.
Used for prediction
◼ More specifically, Regression analysis helps us to understand how (including forecasting
the value of the dependent variable is changing corresponding to an of time-series data),
inference, hypothesis
independent variable when other independent variables are held testing, and modeling
fixed. of causal relationships

◼ It predicts continuous/real values such as temperature, age, salary,

House price, etc.
The parameters are estimated so as to give a "best fit" of the data
78
Regress Analysis and Log-Linear Models

Linear regression: Y = w X + b
◼ Two regression coefficients, w and b, specify the line and are to be estimated by using the
data at hand
◼ Using the least squares criterion to the known values of Y1, Y2, …, X1, X2, ….
Multiple regression: Y = b0 + b1 X1 + b2 X2
◼ Many nonlinear functions can be transformed into the above
Log-linear models:
◼ Approximate discrete multidimensional probability distributions
◼ Estimate the probability of each point (tuple) in a multi-dimensional space for a set of
discretized attributes, based on a smaller subset of dimensional combinations
◼ Useful for dimensionality reduction and data smoothing 79
Histogram Analysis
40
35
30
25
20
15
10
5
0

100000
10000

20000

30000

40000

50000

60000

70000

80000

90000
Divide data into buckets and store
average (sum) for each bucket
Partitioning rules:
◼ Equal-width: equal bucket range
◼ Equal-frequency (or equal-depth)
80
Clustering

Partition data set into clusters based on similarity, and store cluster
representation (e.g., centroid and diameter) only
Can be very effective if data is clustered but not if data is “smeared”
Can have hierarchical clustering and be stored in multi-dimensional
index tree structures
There are many choices of clustering definitions and clustering
algorithms

81
Sampling

Sampling: obtaining a small sample s to represent the whole data set N

Allow a mining algorithm to run in complexity that is potentially sub-
linear to the size of the data
Key principle: Choose a representative subset of the data
◼ Simple random sampling may have very poor performance in the
presence of skew
◼ Develop adaptive sampling methods, e.g., stratified sampling:
Note: Sampling may not reduce database I/Os (page at a time)

82
Types of Sampling

Simple random Sampling with Sampling without

Stratified sampling:
sampling replacement replacement

• There is an • Once an • A selected • Partition the data

set, and draw
equal object is object is not samples from
probability of selected, it is removed each partition
selecting any removed from the (proportionally,
particular from the population i.e.,
approximately the
item population same percentage
of the data)
• Used in
conjunction with
skewed data
83
Sampling: With or without Replacement

Raw Data
84
Sampling: Cluster or Stratified Sampling

Raw Data Cluster/Stratified Sample

85
What Is Wavelet Transform?
Decomposes a signal into different
frequency sub bands
◼ Applicable to n-dimensional signals
Data are transformed to preserve
relative distance between objects at
different levels of resolution
Allow natural clusters to become
more distinguishable
Used for image compression

88
Wavelet Transformation
Discrete wavelet transform (DWT) for linear signal
Haar2 Daubechie4
processing, multi-resolution analysis
Compressed approximation: store only a small fraction of the
strongest of the wavelet coefficients
Similar to discrete Fourier transform (DFT), but better lossy
compression, localized in space
Method:
▪ Length, L, must be an integer power of 2 (padding with 0’s, when necessary)
▪ Each transform has 2 functions: smoothing, difference
▪ Applies to pairs of data, resulting in two set of data of length L/2
▪ Applies two functions recursively, until reaches the desired length
89
Why Wavelet Transform?

Use hat-shape
filters Effective
Multi-
• Emphasize removal of
region where resolution
outliers
points cluster • Detect
• Insensitive to arbitrary Only
• Suppress noise,
weaker shaped applicable to Efficient
insensitive to clusters at
information in input order low Complexity
their different
scales dimensional O(N)
boundaries
data

91
Data Preprocessing

Data Preprocessing: An Overview

◼ Data Quality

◼ Major Tasks in Data Preprocessing

Data Cleaning

Data Integration

Data Reduction

Data Transformation and Data Discretization

Summary
92
Data Transformation
A function that maps the entire set of values of a given attribute to a new set of replacement values i.e.
each old value can be identified with one of the new values
Data transformation is a process of converting data from one format or structure into another format or
structure.

Aggregation: Normalization: Scaled Discretization:

Smoothing: Remove Attribute/feature Summarization, to fall within a smaller, Concept hierarchy
noise from data construction data cube specified range climbing
construction
min-max
New attributes normalization
constructed from
the given ones
z-score
normalization

normalization by
decimal scaling
Normalization
Min-max normalization: to [new_minA, new_maxA]

v − minA
v' = (new _ maxA − new _ minA) + new _ minA
maxA − minA
◼ Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then $73,600 is mapped to
Z-score normalization (μ: mean, σ: standard deviation):
73,600 − 12,000
(1.0 − 0) + 0 = 0.716
98,000 − 12,000
◼ v−
Ex. Let μ = 54,000, σ = 16,000. A
Then
v' =
Normalization by decimal scaling  A
73,600 − 54,000
= 1.225
16,000

v
v' = j Where j is the smallest integer such that Max(|ν’|) < 1
10
94
Discretization

Three types of attributes

Nominal—values from an Ordinal—values from an

unordered set Numeric—real numbers,
ordered set,
e.g., integer or real numbers
e.g., color, profession e.g., military or academic rank

96
Discretization: Divide the
range of a continuous attribute
into intervals

Interval labels can Prepare for further

then be used to Supervised vs. Split (top-down) vs.
analysis, e.g.,
replace actual data unsupervised merge (bottom-up)
classification
values

Reduce data size by

discretization

Discretization can be
performed recursively on
an attribute
Data Discretization Methods

Data Discretization
methods
( All the methods can be applied
recursively)

Histogram Clustering analysis Decision-tree Correlation (e.g., 2)

Binning (unsupervised, top-down analysis analysis
analysis (supervised, top-
split or bottom-up merge) (unsupervised,
down split) bottom-up merge)

Top-down
split, Top-down split,
unsupervised unsupervised

98
Simple Discretization: Binning

Equal-width (distance) Equal-depth (frequency)

partitioning partitioning
• Divides the range into N • Divides the range into N
intervals of equal size: uniform intervals, each containing
grid approximately same number of
• if A and B are the lowest and samples
highest values of the attribute, • Good data scaling
the width of intervals will be: W • Managing categorical attributes
= (B –A)/N. can be tricky
• The most straightforward, but
outliers may dominate
presentation
• Skewed data is not handled well
99
Equal width vs Equal depth binning
Challenges of ML
Insufficient Quantity
of training data
Data
Mismatch Nonrepresentative
training data
Hyperparameter
Poor – Quality
tuning and Model
of data
selection

Testing and Irrelevant

Validating features

Stepping Back Overfitting the

Training data

Underfitting the
Training data 102
103
104
105
10
6

BY PUNNA RAO

DM Preprocessing Lec4,5
No ratings yet
DM Preprocessing Lec4,5
36 pages
Data Preprocessing PDF
No ratings yet
Data Preprocessing PDF
57 pages
Preprocessing
No ratings yet
Preprocessing
50 pages
Data Wrangling and Descriptive Analytics: DR Sandipan Karmakar Department of Management Studies MNIT Jaipur
No ratings yet
Data Wrangling and Descriptive Analytics: DR Sandipan Karmakar Department of Management Studies MNIT Jaipur
57 pages
Data Preprocessing
No ratings yet
Data Preprocessing
54 pages
Data Preparation Modeling Evaluation
No ratings yet
Data Preparation Modeling Evaluation
145 pages
S1 - 25 (NSP) - ML - CS 2 - 3rd Aug 2025
No ratings yet
S1 - 25 (NSP) - ML - CS 2 - 3rd Aug 2025
67 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
Data and Attributes in Data Mining
No ratings yet
Data and Attributes in Data Mining
47 pages
ML - Data - Preprocessing For Machine Learning
No ratings yet
ML - Data - Preprocessing For Machine Learning
44 pages
Data Preprocessing & Attributes
No ratings yet
Data Preprocessing & Attributes
33 pages
Unit I
No ratings yet
Unit I
57 pages
Lect2 - Data Preprocessing
No ratings yet
Lect2 - Data Preprocessing
10 pages
Chapter 2 Data Issues
No ratings yet
Chapter 2 Data Issues
21 pages
Unit 1 - IDS
No ratings yet
Unit 1 - IDS
49 pages
Why Data Preprocessing?
No ratings yet
Why Data Preprocessing?
3 pages
Null 1
No ratings yet
Null 1
62 pages
M2 PPT
No ratings yet
M2 PPT
60 pages
Lec01 Dataprep
No ratings yet
Lec01 Dataprep
67 pages
Data Preparation
No ratings yet
Data Preparation
21 pages
Full
No ratings yet
Full
367 pages
How To Work On Data You Haev
No ratings yet
How To Work On Data You Haev
40 pages
Lecture 2
No ratings yet
Lecture 2
27 pages
Data Mining and Preprocessing Guide
No ratings yet
Data Mining and Preprocessing Guide
40 pages
Data Mining-L3
No ratings yet
Data Mining-L3
22 pages
Data Mining Lecture2-2
No ratings yet
Data Mining Lecture2-2
29 pages
Module2 - Preprocessing Updated - V3-2
No ratings yet
Module2 - Preprocessing Updated - V3-2
106 pages
Data Attributes & Types Explained
No ratings yet
Data Attributes & Types Explained
69 pages
Wk. 3. Data (12-05-2021)
No ratings yet
Wk. 3. Data (12-05-2021)
57 pages
Data Mining: Data
No ratings yet
Data Mining: Data
50 pages
Unit1 Data Preprocessing
No ratings yet
Unit1 Data Preprocessing
95 pages
Unit 2 - Data Preprocessing
No ratings yet
Unit 2 - Data Preprocessing
23 pages
Week 5 - Data Mining Exploring Data With R
No ratings yet
Week 5 - Data Mining Exploring Data With R
146 pages
Data Preprocessing Essentials
No ratings yet
Data Preprocessing Essentials
33 pages
Data Preprocessing
No ratings yet
Data Preprocessing
12 pages
Data Warehousing and Mining: Dr. Hossen Asiful Mustafa
No ratings yet
Data Warehousing and Mining: Dr. Hossen Asiful Mustafa
49 pages
CH1-data Preprocessing
No ratings yet
CH1-data Preprocessing
49 pages
Bab 2 Data: Created By: Arif Djunaidy (Ftif - Its)
No ratings yet
Bab 2 Data: Created By: Arif Djunaidy (Ftif - Its)
57 pages
Data Preprocessing
No ratings yet
Data Preprocessing
120 pages
Ch.3 Data Preprocessing
No ratings yet
Ch.3 Data Preprocessing
16 pages
DP
No ratings yet
DP
44 pages
Data Preprocessing
100% (1)
Data Preprocessing
33 pages
DMDW
No ratings yet
DMDW
14 pages
Data Pre-Processing Essentials
No ratings yet
Data Pre-Processing Essentials
21 pages
CIS664-Knowledge Discovery and Data Mining
No ratings yet
CIS664-Knowledge Discovery and Data Mining
52 pages
17 Data Analysis
No ratings yet
17 Data Analysis
64 pages
Data Preparation Guide COS10022
No ratings yet
Data Preparation Guide COS10022
61 pages
Data
No ratings yet
Data
84 pages
Data Science Unit I (LN and QB)
No ratings yet
Data Science Unit I (LN and QB)
44 pages
Data Preprocessing Guide
No ratings yet
Data Preprocessing Guide
40 pages
03 Data Science Process - Fall 23-24
No ratings yet
03 Data Science Process - Fall 23-24
38 pages
Class 2 Introduction To Data
No ratings yet
Class 2 Introduction To Data
40 pages
Unit - 1 Data Preprocessing
No ratings yet
Unit - 1 Data Preprocessing
66 pages
Lec 5
No ratings yet
Lec 5
24 pages
Data Mining and Predictive Modelling: Lecture 2: Functionalities, KDD Process, Data Attributes and Properties
No ratings yet
Data Mining and Predictive Modelling: Lecture 2: Functionalities, KDD Process, Data Attributes and Properties
11 pages
DWH m2p2
No ratings yet
DWH m2p2
8 pages
Unit 2 1 Feature Sampling Normalization
No ratings yet
Unit 2 1 Feature Sampling Normalization
43 pages
Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
31 pages
m3 Solved Papers
No ratings yet
m3 Solved Papers
350 pages
Eigenvalues & Eigenvectors Guide
No ratings yet
Eigenvalues & Eigenvectors Guide
4 pages
Positive Definite Matrices: Notes On Linear Algebra
100% (1)
Positive Definite Matrices: Notes On Linear Algebra
51 pages
Susskind Quantum Mechanics Notes
100% (1)
Susskind Quantum Mechanics Notes
27 pages
Quantitative Trading Lecture
No ratings yet
Quantitative Trading Lecture
23 pages
MLF Combined
No ratings yet
MLF Combined
84 pages
201 BSC Mathematics 23-24 F-1
No ratings yet
201 BSC Mathematics 23-24 F-1
116 pages
Deep Learning
No ratings yet
Deep Learning
28 pages
1693553649BCADSC-1.5 Linear Algebra
No ratings yet
1693553649BCADSC-1.5 Linear Algebra
3 pages
Mathematical Methods in Data Science Jingli Ren Available Instanly
100% (1)
Mathematical Methods in Data Science Jingli Ren Available Instanly
121 pages
Sit Ece Syllabus Book 26-09-2022
No ratings yet
Sit Ece Syllabus Book 26-09-2022
276 pages
Matrix Review
No ratings yet
Matrix Review
7 pages
NM 2068 1583059872
No ratings yet
NM 2068 1583059872
1 page
Advanced Engineering Mathematics Prof. Pratima Panigrahi Department of Mathematics Indian Institute of Technology, Kharagpur
No ratings yet
Advanced Engineering Mathematics Prof. Pratima Panigrahi Department of Mathematics Indian Institute of Technology, Kharagpur
15 pages
SET MMP File With Solutions
No ratings yet
SET MMP File With Solutions
42 pages
ME Transfer Course Description
No ratings yet
ME Transfer Course Description
16 pages
Unitary Matrices
No ratings yet
Unitary Matrices
24 pages
Placement Exam
100% (2)
Placement Exam
2 pages
Mathematics For Engineers - The HELM Project
No ratings yet
Mathematics For Engineers - The HELM Project
7 pages
2021 Final Test Mathematics 2 For Economics Version 1 Def
No ratings yet
2021 Final Test Mathematics 2 For Economics Version 1 Def
2 pages
Introduction To Functional Differential Equations, J Hale, Springer, 1993,458s PDF
75% (4)
Introduction To Functional Differential Equations, J Hale, Springer, 1993,458s PDF
458 pages
M.E.communication & NW
No ratings yet
M.E.communication & NW
60 pages
Eigenvalues, Eigenvectors and Vector Space - SUMMER 2019-20
No ratings yet
Eigenvalues, Eigenvectors and Vector Space - SUMMER 2019-20
14 pages
Question Bank Conm-Final
No ratings yet
Question Bank Conm-Final
5 pages
June 2018 QP-2
No ratings yet
June 2018 QP-2
32 pages
Me338 H01solution Donotshow
No ratings yet
Me338 H01solution Donotshow
12 pages
Linear Algebra Notes
No ratings yet
Linear Algebra Notes
271 pages
Chapter 6. Eigenvalues and Eigenvectors 23
No ratings yet
Chapter 6. Eigenvalues and Eigenvectors 23
1 page
Cable System Transients Theory Modeling and Simulation 1st Edition Akihiro Ametani - Download The Ebook Now and Own The Full Detailed Content
100% (3)
Cable System Transients Theory Modeling and Simulation 1st Edition Akihiro Ametani - Download The Ebook Now and Own The Full Detailed Content
41 pages
Common AMS - Assignment - 1
No ratings yet
Common AMS - Assignment - 1
3 pages

ML Unit-Ii

Uploaded by

ML Unit-Ii

Uploaded by

UNIT-II

Know the data

❑An attribute is a property or characteristic or feature of a data object.

❑Attribute is also known as variable, field, characteristic, or feature

❑Database rows -> data objects; columns ->attributes.

Database rows → data objects

❑ Distinction between attributes and attribute values

❑Example: height can be measured in feet or meters

❑Example: Attribute values for ID and age are integers

❖NOMINAL ( “relating to names”)

❖BINARY (only two categories or states)

❖ORDINAL (Order or Ranking)

❖NUMERIC (Measurable quantity)

❑Discrete vs Continuous attributes

◼ Sometimes, represented as integer variables

❑Basic statistical descriptions can be used to

❑For data preprocessing tasks, we want to learn

positively skewed negatively skewed

◼ Major Tasks in Data Preprocessing

• Accuracy: correct or wrong, accurate or not

Data cleaning Data Data Data

◼ Major Tasks in Data Preprocessing

Missing data may need to be inferred

Fill in it automatically with

data entry problems

• Data smooth can also be done regression functions

• Place data elements in their similar groups as Clusters.

Combined computer • Detect suspicious values and check by human (e.g.,

Data Preprocessing: An Overview

◼ Major Tasks in Data Preprocessing

Entity identification • Identify real world entities from multiple data

• For the same real-world entity, attribute values

Χ2 (chi-square) calculation (numbers in parenthesis are expected counts

◼ It predicts continuous/real values such as temperature, age, salary,

Sampling: obtaining a small sample s to represent the whole data set N

Simple random Sampling with Sampling without

• There is an • Once an • A selected • Partition the data

Raw Data Cluster/Stratified Sample

Data Preprocessing: An Overview

◼ Major Tasks in Data Preprocessing

Data Transformation and Data Discretization

Aggregation: Normalization: Scaled Discretization:

Three types of attributes

Nominal—values from an Ordinal—values from an

Interval labels can Prepare for further

Reduce data size by

Histogram Clustering analysis Decision-tree Correlation (e.g., 2)

Equal-width (distance) Equal-depth (frequency)

Testing and Irrelevant

Stepping Back Overfitting the

You might also like