0% found this document useful (0 votes)

17 views43 pages

COMP5318/COMP4318 Machine Learning and Data Mining Semester 1, 2023, Week 1b

The document outlines key concepts in machine learning and data mining, focusing on data attributes, data cleaning, preprocessing techniques, and similarity measures. It discusses nominal and numeric attributes, methods for handling noise and missing values, and various data preprocessing techniques such as feature extraction and selection. Additionally, it highlights the importance of data aggregation and normalization in preparing data for analysis.

Uploaded by

zhouyjchris

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views43 pages

COMP5318/COMP4318 Machine Learning and Data Mining Semester 1, 2023, Week 1b

Uploaded by

zhouyjchris

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 43

Data

COMP5318/COMP4318 Machine Learning and Data Mining

semester 1, 2023, week 1b
Irena Koprinska

Reference: Tan ch. 2

1
Outline
• Nominal and numeric attributes
• Data cleaning
• Noise
• Missing values
• Data preprocessing
• Data aggregation
• Feature extraction
• Feature subset selection
• Converting features from one type to another
• Normalization of feature values
• Similarity measures
• Euclidean, Manhattan, Minkowski
• Hamming, SMC, Jaccard coefficient
• Cosine similarity
• Correlation
Irena Koprinska, irena.koprinska@sydney.edu.au COMP5318 ML&DM, week 1b, 2023
2
Data

• Data is collection of examples (also Attributes (features)

called instances, records, Class
observations, objects) Tid Refund Marital Taxable
Status Income Cheat

1 Yes Single 125K No

• Examples are described with 2 No Married 100K No
attributes (features, variables) 3 No Single 70K No

Examples 4 Yes Married 120K No

5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10

Irena Koprinska, irena.koprinska@sydney.edu.au COMP5318 ML&DM, week 1b, 2023

3
Nominal and numeric attributes

• Two types of attributes:

• nominal (categorical) - their values belong to a pre-specified, finite set of
possibilities
• numeric (continuous) - their values are numbers
outlook temp. humidity windy play sepal sepal petal petal iris type
sunny hot high false no length width length width
sunny hot high true no 5.1 3.5 1.4 0.2 iris setosa
overcast hot high false yes 4.9 3.0 1.4 0.2 iris setosa
rainy mild high false yes
rainy cool normal false yes
4.7 3.2 1.3 0.2 iris setosa
rainy cool normal true no ...
overcast cool normal true yes 6.4 3.2 4.5 1.5 iris versicolor
sunny mild high false no 6.9 3.1 4.9 1.5 iris versicolor
sunny cool normal false yes 5.5 2.3 4.0 1.3 iris versicolor
rainy mild normal false yes 6.5 2.8 4.6 1.5 iris versicolor
sunny mild normal true yes 6.3 3.3 6.0 2.5 iris virginica
overcast mild high true yes 5.8 2.7 5.1 1.9 iris virginica
overcast hot normal false yes
rainy
...
mild high true no

nominal numeric

Irena Koprinska, irena.koprinska@sydney.edu.au COMP5318 ML&DM, week 1b, 2023

4
Types of data

Tid Refund Marital Taxable

Status Income Cheat
TID Items
1 Yes Single 125K No 1 Bread, Coke, Milk
2 No Married 100K No
2 Beer, Bread
3 No Single 70K No
4 Yes Married 120K No
3 Beer, Coke, Diaper, Milk
5 No Divorced 95K Yes 4 Beer, Bread, Diaper, Milk
6 No Married 60K No
5 Coke, Diaper, Milk
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
transaction data
10 No Single 90K Yes
10

data matrix

GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG graph Average monthly temperature
(e.g. molecular spatio-temporal
Sequential structure)
(e.g. genetic sequence)
Irena Koprinska, irena.koprinska@sydney.edu.au COMP5318 ML&DM, week 1b, 2023
5
Data cleaning

Irena Koprinska, irena.koprinska@sydney.edu.au COMP5318 ML&DM, week 1b, 2023

6
Data cleaning

• Data is not perfect

• Noise due to
• distortion of values
• addition of spurious examples
• inconsistent and duplicate data
• Missing values

Irena Koprinska, irena.koprinska@sydney.edu.au COMP5318 ML&DM, week 1b, 2023

7
Noise

• Human errors when entering data or limitations of measuring

instruments, flaws in the data collection process

• 1) Noise - distortion of values

• Ex: distortion of human voice when talking on a poor phone line
• Higher distortion => the shape of the signal may be lost

voice voice + noise

Irena Koprinska, irena.koprinska@sydney.edu.au COMP5318 ML&DM, week 1b, 2023

8
Noise (2)

• 2) Noise – addition of spurious examples

• Some are far from the other examples (are outliers), some are mixed
with the non-noisy data

Irena Koprinska, irena.koprinska@sydney.edu.au COMP5318 ML&DM, week 1b, 2023

9
Noise (3)

• 3) Noise – inconsistent and duplicate data

• E.g. negative weight and height values, non-existing zip codes, 2
records for the same person – need to be detected and corrected
• Typically easier to detect and correct than the other two types of
noise

• Reducing noise types 1) and 2):

• Using signal and image processing and outlier detection techniques
before DM
• Using ML algorithms that are more robust to noise – give acceptable
results in presence of noise

Irena Koprinska, irena.koprinska@sydney.edu.au COMP5318 ML&DM, week 1b, 2023

10
Dealing with missing values

outlook temp. humidity windy play

• Various methods, e.g.: sunny hot high false no
sunny hot high true no
overcast hot high false yes
1) Ignore all examples with missing values rainy mild high false yes
? cool normal false yes
• Can be done if small % missing values rainy cool normal true no
overcast cool normal true yes
sunny mild high false no
sunny ? normal false yes
rainy mild normal false yes
sunny mild normal true yes
overcast ? high true yes
overcast hot normal false yes
rainy mild high true no

2) Estimate the missing values by using the remaining values

• Nominal attributes - replace the missing values for attribute A with
• the most common value for A or
• the most common value among the examples with the same class (if
supervised learning)
• Numerical – replace with the average value of the nearest neighbors (the
most similar examples)
Irena Koprinska, irena.koprinska@sydney.edu.au COMP5318 ML&DM, week 1b, 2023
11
Data preprocessing

Irena Koprinska, irena.koprinska@sydney.edu.au COMP5318 ML&DM, week 1b, 2023

12
Data preprocessing

• Data aggregation
• Dimensionality reduction
• Feature extraction
• Feature subset selection
• Converting attributes from one type to another
• Normalization

Irena Koprinska, irena.koprinska@sydney.edu.au COMP5318 ML&DM, week 1b, 2023

13
Data aggregation

• Combining two or more attributes into one – purpose:

• Data reduction - less memory and computation time; may allow the
use of computationally more expensive ML algorithms
• Change of scale - provides high-level view
• E.g. cities aggregated into states or countries
• More stable data - aggregated data is less variable than non-
aggregated
• E.g. consumed daily food (food_day1, food_day2, etc.) aggregated into
weekly food to get a more reliable understanding of the diet
(carbohydrates, fat, protein, etc.)
• Disadvantage – potential loss of interesting details

Irena Koprinska, irena.koprinska@sydney.edu.au COMP5318 ML&DM, week 1b, 2023

14
Feature extraction

• Feature extraction is the creation of features from raw data – very

important task
• Requires domain expertise
• Ex: classifying images into outdoors or indoors
raw data: color value for each pixel
extracted features: color histogram, dominant color, edge histogram, etc.
• May require mapping data to a new space, then extract features
• The new space may reveal important characteristics

Fourier 2 peaks
transform corresponding
→ to the periods
of the sin
waves

2 sin waves + noise power spectrum

Irena Koprinska, irena.koprinska@sydney.edu.au COMP5318 ML&DM, week 1b, 2023
15
Feature subset selection

• The process of removing irrelevant and redundant features and

selecting a small set of features that are necessary and sufficient for
good classification
• Very important for successful classification
• Good feature selection typically improves accuracy
• Using less features also means:
• Faster building of the classifiers, i.e. reduces computational cost
• Often more compact and easier to interpret classification rule

Useful references:
Kohavi, R., John, H.: Wrappers for feature subset selection, Artificial Intelligence, vol. 97,
issue 1-2 (1997), pp. 273 – 324
Hall, M.: Correlation-based Feature Selection for Discrete and Numeric Class Machine
Learning. 17th Int. Conf. on Machine Learning (ICML). Morgan Kaufmann (2000) 359-366

Irena Koprinska, irena.koprinska@sydney.edu.au COMP5318 ML&DM, week 1b, 2023

16
Feature subset selection methods

• Brute force – try all possible combinations of features as input to a ML

algorithm, select the best one (rarely possible in practice – too many
combinations of features)
• Embedded - some ML algorithms can automatically select features (e.g.
decision trees)
• Filter – select features before the ML algorithm is run; the feature
selection is independent of the ML algorithm
• Based on statistical measures, e.g. information gain, mutual
information, odds ratio, etc.
• Correlation-based feature selection, Relief
• Wrapper – select the best subset for a given ML algorithm; it uses the
ML algorithm as a black box to evaluate different subsets and select the
best
Feature selection is well studied in ML and there are many excellent
methods!
Irena Koprinska, irena.koprinska@sydney.edu.au COMP5318 ML&DM, week 1b, 2023
17
Feature weighting

• Can be used instead of feature reduction or in conjunction with it

• The more important features are assigned a higher weight, the less
important – lower
• manually - based on domain knowledge
• automatically – some classification algorithms do it (e.g. boosting) or may
do it if this option is selected (k-nearest neighbor)
• Key idea: features with higher weights play more important role in
the construction of the ML model

Irena Koprinska, irena.koprinska@sydney.edu.au COMP5318 ML&DM, week 1b, 2023

18
Converting attributes from one type to another

• Converting numeric attributes to nominal (discretization)

• Converting numeric and nominal attributes to binary attributes
(binarization)

• Needed as some ML algorithms work only with numeric, nominal or

binary attributes

Irena Koprinska, irena.koprinska@sydney.edu.au COMP5318 ML&DM, week 1b, 2023

19
Binarization

• Converting categorical and numeric attributes into binary

• There is no best method; the best one is the one that works best for a given
ML algorithm but all possibilities cannot be evaluated
• Simple technique
• categorical attribute -> integer -> binary
• numeric attribute -> categorical -> integer -> binary

categorical -> binary

Irena Koprinska, irena.koprinska@sydney.edu.au COMP5318 ML&DM, week 1b, 2023

20
Discretization

• Converting numeric attributes into nominal

• 2 types: unsupervised and supervised
• Unsupervised – class information is not used
• Supervised - class information is used
• Decisions to be taken numeric -> nominal
• How many categories (intervals)?
• Where should the splits be?

2-dim data; x and y are numeric attributes

Goal: convert x from numeric to nominal

Irena Koprinska, irena.koprinska@sydney.edu.au COMP5318 ML&DM, week 1b, 2023

21
Unsupervised discretization

• How many intervals?

• The user specifies them, e.g. 4 • equal width – 4 intervals
• Where should the splits be? with the same width - [0,5),
[5, 10), [11,15), [15,20)
• 3 methods
• equal frequency – 4
intervals with the same
number of points in each of
them
• clustering (e.g. k-means) –
4 intervals determined by a
clustering method
Data Equal width

Equal frequency K-means

Irena Koprinska, irena.koprinska@sydney.edu.au COMP5318 ML&DM, week 1b, 2023
22
Supervised discretization – entropy-based

• Splits are placed so that they maximizes the purity of the intervals
• Entropy is a measure of the purity of a dataset (interval) S
• The higher the entropy, the lower the purity of the dataset
entropy( S ) = − Pi . log 2 Pi Pi - proportion of examples from class i
i
• Ex.: Consider a split between 70 and 71. What is the entropy of the left and right
datasets (intervals)?
• values of temperature:
64 65 68 69 70 71 72 73 74 75 80 81 83 85
yes no yes yes yes no no no yes yes no yes yes no

4 4 1 1
entropy ( S left ) = − log 2 − log 2 = 0.722 bits
5 5 5 5
4 4 5 5
entropy ( S right ) = − log 2 − log 2 = 0.991bits
9 9 9 9
Irena Koprinska, irena.koprinska@sydney.edu.au COMP5318 ML&DM, week 1b, 2023
23
Entropy-based discretization - example

• Total entropy of the split = weighted average of the interval entropies

n
totalEntropy =  wi entropy (Si )
i

wi – proportion of values in interval i, n – number of intervals

• Algorithm: evaluate all possible splits and choose the best one (with the
lowest total entropy); repeat recursively until stopping criteria are satisfied
(e.g. user specified number of splits is reached)

Irena Koprinska, irena.koprinska@sydney.edu.au COMP5318 ML&DM, week 1b, 2023

24
Entropy-based discretization – example (2)

-attribute temperature
64 65 68 69 70 71 72 73 74 75 80 81 83 85
yes no yes yes yes no no no yes yes no yes yes no

• 7 initial possible splits

• For each of the 7 splits:
• Compute the entropy of the 2 intervals
• Compute the total entropy of the split
• Choose the best split (the one with minimum total entropy)
• Repeat for the remaining splits until the desired number of splits is
reached

Irena Koprinska, irena.koprinska@sydney.edu.au COMP5318 ML&DM, week 1b, 2023

25
Normalization and standardization

• Attribute transformation to a new range, e.g. [0,1]

• Used to avoid the dominance of attributes with large values over
attributes with small values
• Required for distance-based ML algorithms; some other algorithms also
work better with normalized data
• E.g. age (in years) and annual income (in dollars) have different scales
A=[20, 40 000]
B=[40, 60 000]
D(A,B)=|20-40| + |40 000-60 000|=20 020
Difference in income dominates, age doesn’t contribute

• Solution: first normalize or standartize then calculate distance

Irena Koprinska, irena.koprinska@sydney.edu.au COMP5318 ML&DM, week 1b, 2023

26
Normalization and standardization (2)

• Performed for each attribute

Normalization
(also called min-max scaling): Standardization:

𝑥 − min(x) 𝑥 − 𝜇(x)
𝑥′ = 𝑥′ =
max(x) − min(x) 𝜎(x)

x – original value
x’ – new value

x – all values of the attribute; a vector

min(x) and max(x) – min and max values of the attribute (of the vector x)
(x) - mean value of the attribute
(x) - standard deviation of the attribute
Irena Koprinska, irena.koprinska@sydney.edu.au COMP5318 ML&DM, week 1b, 2023
27
Normalization - example

Examples with 2 attributes: age and income:

A=[20, 40 000]
B=[40, 60 000]
C=[25, 30 000]
…
Suppose that:
for age: min = 0, max=100
for income: min=0, max=100 000

After normalization:
A=[0.2, 0.4]
B=[0.4, 0.6]
C=[0.25, 0.3]
…
D(A,B)=|0.2-0.4| + |0.4-0.6|=0.4, i.e. income and age contribute equally
Irena Koprinska, irena.koprinska@sydney.edu.au COMP5318 ML&DM, week 1b, 2023
28
Similarity measures

Irena Koprinska, irena.koprinska@sydney.edu.au COMP5318 ML&DM, week 1b, 2023

29
Measuring similarity

• Many ML algorithms require to measure the similarity between 2

examples
• Two main types of measures
• Distance
• Correlation

Irena Koprinska, irena.koprinska@sydney.edu.au COMP5318 ML&DM, week 1b, 2023

30
Euclidean and Manhattan distance

• Distance measures for numeric attributes

• A, B – examples with attribute values a1, a2,..., an & b1, b2,..., bn
• E.g. A= [1, 3, 5], B=[1, 6, 9]

• Euclidean distance (L2 norm) – most frequently used

D( A, B) = (a1 − b1 ) 2 + (a2 − b2 ) 2 + ... + (an − bn ) 2

D(A,B) = sqrt ((1-1)2+(3-6)2+(5-9)2)=5

• Manhattan distance (L1 norm)

D ( A, B ) = a1 − b1 + a 2 − b2 + ... + a n − bn
D(A,B)=|1-1|+|3-6|+|5-9|=7

Irena Koprinska, irena.koprinska@sydney.edu.au COMP5318 ML&DM, week 1b, 2023

31
Minkowski distance

• Minkowski distance – generalization of Euclidean & Manhattan

q q q 1/ q
D( A, B ) = ( a1 − b1 + a2 − b2 + ... + an − bn )
q – positive integer

• Weighted distance – each attribute is assigned a weight according to

its importance (requires domain knowledge)
• Weighted Euclidean:
2 2 2
D( A, B ) = w1 a1 − b1 + w2 a2 − b2 + ... + wn an − bn

Irena Koprinska, irena.koprinska@sydney.edu.au COMP5318 ML&DM, week 1b, 2023

32
Similarity between binary vectors

• Hamming distance = Manhattan for binary vectors

• Counts the number of different bits

D( A, B ) = a1 − b1 + a2 − b2 + ... + an − bn
A = [1 0 0 0 0 0 0 0 0 0 ]
B = [0 0 0 0 0 0 1 0 0 1 ]
D(A,B) = 3
• Similarity coefficients
f00: number of matching 0-0 bits
f01: number of matching 0-1 bits
f10: number of matching 1-0 bits
f11: number of matching 1-1 bits
• Calculate these coefficients for the example above!
Answer: f01 = 2, f10 = 1, f00 = 7 , f11 = 0

Irena Koprinska, irena.koprinska@sydney.edu.au COMP5318 ML&DM, week 1b, 2023

33
Similarity between binary vectors (2)

• Simple Matching Coefficient (SMC) - matching 1-1 and 0-0 / num. attributes
SMC = (f11+f00)/(f01+f10+f11+f00)
Ex.: A = [1 0 0 0 0 0 0 0 0 0 ]
B = [0 0 0 0 0 0 1 0 0 1 ]
f01 = 2, f10 = 1, f00 = 7 , f11 = 0
SMC = (0+7) / (2+1+0+7) = 0.7

• Task: Suppose that A and B are the supermarket bills of 2 customers. Each
product in the supermarket corresponds to a different attribute.
• attribute value = 1 – product was purchased
• attribute value = 0 - product was not purchased
• SMC is used to calculate the similarity between A and B. Is there any
problem using SMC?

Irena Koprinska, irena.koprinska@sydney.edu.au COMP5318 ML&DM, week 1b, 2023

34
Similarity between binary vectors (2)

• Yes, SMC will find all customer transactions (bills) to be similar

• Reason: The number of products that are not purchased in a transaction is
much bigger than the number of products that are purchased
• => f00 will be very high (not purchased products)
• f11 will be low (purchased products)
• f00 will be much higher than f11 and its effect will be lost

SMC = (f11+f00)/(f01+f10+f11+f00)

• => More generally, the problem is that the 2 vectors A and B contain many
0s, i.e. are very sparse => SMC is not suitable for sparse data
A = [1 0 0 0 0 0 0 0 0 0 ]
B = [0 0 0 0 0 0 1 0 0 1 ]

Irena Koprinska, irena.koprinska@sydney.edu.au COMP5318 ML&DM, week 1b, 2023

35
SMC vs Jaccard

• An alternative: Jaccard coefficient

• counts matching 1-1 and ignores matching 0-0
J=f11/(f01+f10+f11)

A = [1 0 0 0 0 0 0 0 0 0 ]
B = [0 0 0 0 0 0 1 0 0 1 ]
f01 = 2, f10 = 1, f00 = 7 , f11 = 0
J = 0 / (2 + 1 + 0) = 0 (A and B are dissimilar)

• Compare with SMC:

SMC= (0+7) / (2+1+0+7) = 0.7 (A and B are highly similar - incorrect)

Irena Koprinska, irena.koprinska@sydney.edu.au COMP5318 ML&DM, week 1b, 2023

36
Cosine similarity

• Useful for sparse data (both binary and non-binary)

• Widely used for classification of text documents:
𝐴. 𝐵
cos(𝐴, 𝐵) =
𝐴 𝐵
. - vector dot product, ||A|| - length of vector A
• Geometric representation: measures the angle between A and B
• Cosine similarity=1 => angle(A,B)=0º
• Cosine similarity =0 => angle (A,B)=90º

Irena Koprinska, irena.koprinska@sydney.edu.au COMP5318 ML&DM, week 1b, 2023

37
Cosine similarity - example

• Two document vectors:

d1 = 3 2 0 5 0 0 0 2 0 0
d2 = 1 0 0 0 0 0 0 1 0 2
𝑑1. 𝑑2
cos(𝑑1, 𝑑2) =
𝑑1 𝑑2

d1 . d2= 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5
||d1|| = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)1/2 = (42) 1/2 = 6.481
||d2|| = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 1/2 = (6) 1/2 = 2.449
=> cos( d1, d2 ) = 0.315

Irena Koprinska, irena.koprinska@sydney.edu.au COMP5318 ML&DM, week 1b, 2023

38
Correlation

• Measures linear relationship between numeric attributes

• Pearson correlation coefficient between vectors x and y with
dimensionality n
covar(x, y )
corr(x, y ) =
std(x) std(y )
n 2
where:
 xk
n
 (xk − mean(x) )
mean(x) = k =1 std (x) = k =1

n n −1
1 n
co var(x, y ) = 
n − 1 k =1
( xk − mean( x))( y k − mean( y ))

• Range: [-1, 1]
• -1: perfect negative correlation
• +1: perfect positive correlation
• 0: no correlation
Irena Koprinska, irena.koprinska@sydney.edu.au COMP5318 ML&DM, week 1b, 2023
39
Correlation - examples

• Ex1: corr(x,y)=?
x=(-3, 6, 0, 3, -6)
y=( 1,-2, 0,-1, 2)

• Ex2: corr(x,y)=?
x=(3, 6, 0, 3, 6)
y=( 1, 2, 0, 1, 2)

• Ex3: corr(x,y)=?
x=(-3, -2, -1, 0, 1, 2, 3)
y=( 9, 4, 1, 0, 1, 4, 9)

Irena Koprinska, irena.koprinska@sydney.edu.au COMP5318 ML&DM, week 1b, 2023

40
Answers

• Ex1: corr(x,y)=? corr(x,y) = -1

x=(-3, 6, 0, 3, -6) perfect negative linear correlation
y=( 1,-2, 0,-1, 2)

• Ex2: corr(x,y)=? corr(x,y) = +1

x=(3, 6, 0, 3, 6) perfect positive linear correlation
y=( 1, 2, 0, 1, 2)

• Ex3: corr(x,y)=? corr(x,y) = 0

x=(-3, -2, -1, 0, 1, 2, 3) no linear correlation
y=( 9, 4, 1, 0, 1, 4, 9) However, there is a non-linear
relationship: y=x2

Irena Koprinska, irena.koprinska@sydney.edu.au COMP5318 ML&DM, week 1b, 2023

41
Correlation – visual evaluation

Irena Koprinska, irena.koprinska@sydney.edu.au COMP5318 ML&DM, week 1b, 2023

42
Distance measures for nominal attributes

• Various options depending on the task and type of data type; requires
domain expertise
• E.g.:
• difference =0 if attribute values are the same
• difference =1 if they are not

• Example: 2 attributes = temperature and windy

temperature values: low and high
windy values: yes and no
A = (high, no)
B = (high, yes)
d(A,B) =(0+1)1/2=1 (Euclidean distance)

Irena Koprinska, irena.koprinska@sydney.edu.au COMP5318 ML&DM, week 1b, 2023

Lec01 Dataprep
No ratings yet
Lec01 Dataprep
67 pages
Lecture 2
No ratings yet
Lecture 2
27 pages
Lec 5
No ratings yet
Lec 5
24 pages
Feature Engineering
No ratings yet
Feature Engineering
15 pages
1 Data Mining
No ratings yet
1 Data Mining
47 pages
ML - Data - Preprocessing For Machine Learning
No ratings yet
ML - Data - Preprocessing For Machine Learning
44 pages
Lect 2
No ratings yet
Lect 2
77 pages
DM Day3 Preprocessing A S25
No ratings yet
DM Day3 Preprocessing A S25
109 pages
How To Work On Data You Haev
No ratings yet
How To Work On Data You Haev
40 pages
Data Mining Lecture2-2
No ratings yet
Data Mining Lecture2-2
29 pages
Week 5 - Data Mining Exploring Data With R
No ratings yet
Week 5 - Data Mining Exploring Data With R
146 pages
Data Wrangling and Descriptive Analytics: DR Sandipan Karmakar Department of Management Studies MNIT Jaipur
No ratings yet
Data Wrangling and Descriptive Analytics: DR Sandipan Karmakar Department of Management Studies MNIT Jaipur
57 pages
Data Similarity
0% (1)
Data Similarity
18 pages
S1 - 25 (NSP) - ML - CS 2 - 3rd Aug 2025
No ratings yet
S1 - 25 (NSP) - ML - CS 2 - 3rd Aug 2025
67 pages
Data Preprocessing Guide
No ratings yet
Data Preprocessing Guide
40 pages
17 Data Analysis
No ratings yet
17 Data Analysis
64 pages
DM Lec03
No ratings yet
DM Lec03
37 pages
Wk. 3. Data (12-05-2021)
No ratings yet
Wk. 3. Data (12-05-2021)
57 pages
(Ebook) Introduction To Data Mining by Pang-Ning Tan, Michael Steinbach and Vipin Kumar ISBN 9788131764633, 813176463X Online Version
No ratings yet
(Ebook) Introduction To Data Mining by Pang-Ning Tan, Michael Steinbach and Vipin Kumar ISBN 9788131764633, 813176463X Online Version
136 pages
(Ebook) Introduction To Data Mining by Pang-Ning Tan, Michael Steinbach and Vipin Kumar ISBN 9788131764633, 813176463X Available Full Chapters
No ratings yet
(Ebook) Introduction To Data Mining by Pang-Ning Tan, Michael Steinbach and Vipin Kumar ISBN 9788131764633, 813176463X Available Full Chapters
80 pages
Chapter - 2 Data Mining
No ratings yet
Chapter - 2 Data Mining
21 pages
Data and Attributes in Data Mining
No ratings yet
Data and Attributes in Data Mining
47 pages
Data Cleaning & Integration Guide
No ratings yet
Data Cleaning & Integration Guide
21 pages
Chapter 2 Data Issues
No ratings yet
Chapter 2 Data Issues
21 pages
ML Unit-Ii
No ratings yet
ML Unit-Ii
100 pages
Data Warehousing and Mining: Dr. Hossen Asiful Mustafa
No ratings yet
Data Warehousing and Mining: Dr. Hossen Asiful Mustafa
49 pages
Data Preprocessing PDF
No ratings yet
Data Preprocessing PDF
57 pages
Chapter 2: Getting To Know Your Data
No ratings yet
Chapter 2: Getting To Know Your Data
30 pages
Data Acquisition
No ratings yet
Data Acquisition
28 pages
Data Mining: Data
No ratings yet
Data Mining: Data
50 pages
Lec06 7 Feature Engineering 08112022 100115am
No ratings yet
Lec06 7 Feature Engineering 08112022 100115am
44 pages
Data Mining Notes C2
No ratings yet
Data Mining Notes C2
12 pages
Module2 - Preprocessing Updated - V3-2
No ratings yet
Module2 - Preprocessing Updated - V3-2
106 pages
Unit 2 1 Feature Sampling Normalization
No ratings yet
Unit 2 1 Feature Sampling Normalization
43 pages
Preprocessing 1
No ratings yet
Preprocessing 1
11 pages
Lect 2 DM Converted 1
No ratings yet
Lect 2 DM Converted 1
29 pages
Unit1 Data Preprocessing
No ratings yet
Unit1 Data Preprocessing
95 pages
Data Mining-L3
No ratings yet
Data Mining-L3
22 pages
Class 2 Introduction To Data
No ratings yet
Class 2 Introduction To Data
40 pages
Data Preprocessing 09112023 065121pm
No ratings yet
Data Preprocessing 09112023 065121pm
30 pages
Data Science Basics for Beginners
No ratings yet
Data Science Basics for Beginners
26 pages
Unit 4 Basics of Feature Engineering
100% (1)
Unit 4 Basics of Feature Engineering
33 pages
Lecture Notes For Chapter 2 Introduction To Data Mining
No ratings yet
Lecture Notes For Chapter 2 Introduction To Data Mining
34 pages
Full
No ratings yet
Full
367 pages
Bab 2 Data: Created By: Arif Djunaidy (Ftif - Its)
No ratings yet
Bab 2 Data: Created By: Arif Djunaidy (Ftif - Its)
57 pages
Unit 2exploratory Analysis
No ratings yet
Unit 2exploratory Analysis
37 pages
3-Preparing The Data-10-01-2024
No ratings yet
3-Preparing The Data-10-01-2024
127 pages
Concepts (PPT) - Data Preprocessing
No ratings yet
Concepts (PPT) - Data Preprocessing
19 pages
Etman MachineL 3
No ratings yet
Etman MachineL 3
47 pages
Intro to Machine Learning & PCA
No ratings yet
Intro to Machine Learning & PCA
70 pages
DATA 240 - 23 - Lec3 - FA 2024 - Dist
No ratings yet
DATA 240 - 23 - Lec3 - FA 2024 - Dist
50 pages
UNIT 2 DT
No ratings yet
UNIT 2 DT
8 pages
DWDM Unit6-Data Similarity Measures
No ratings yet
DWDM Unit6-Data Similarity Measures
40 pages
Data Mining Techniques Overview
No ratings yet
Data Mining Techniques Overview
12 pages
Data
No ratings yet
Data
84 pages
Unit 1 - IDS
No ratings yet
Unit 1 - IDS
49 pages
IML 2 - Data Preparation
No ratings yet
IML 2 - Data Preparation
13 pages
Data Mining and Business Intelligence
No ratings yet
Data Mining and Business Intelligence
52 pages
Lecture Notes For Chapter 2: by Tan, Steinbach, Kumar
No ratings yet
Lecture Notes For Chapter 2: by Tan, Steinbach, Kumar
25 pages
bpt1501 04 Mark089100-1
No ratings yet
bpt1501 04 Mark089100-1
6 pages
3rd LREC Program PDF
No ratings yet
3rd LREC Program PDF
3 pages
Windows Speech Recognition Macros Release Notes
No ratings yet
Windows Speech Recognition Macros Release Notes
3 pages
Grid Computing Explained
No ratings yet
Grid Computing Explained
10 pages
Quadrilateral Problem Solutions
No ratings yet
Quadrilateral Problem Solutions
3 pages
Course File Power Quality
No ratings yet
Course File Power Quality
10 pages
Osho - Books I Have Loved (SUMMARY Book List)
100% (1)
Osho - Books I Have Loved (SUMMARY Book List)
4 pages
Module 3 Ethical Principles and Legal Foundation of Testing and Assessment
No ratings yet
Module 3 Ethical Principles and Legal Foundation of Testing and Assessment
9 pages
EST Exams Date Sheet - MBA Batch 2024-26
No ratings yet
EST Exams Date Sheet - MBA Batch 2024-26
3 pages
Veterinary Clinic Work Placement Letter of Introduction
No ratings yet
Veterinary Clinic Work Placement Letter of Introduction
1 page
Grade 3 COT in Math Q2 2024
No ratings yet
Grade 3 COT in Math Q2 2024
3 pages
Ethics in Business Communication PDF
33% (3)
Ethics in Business Communication PDF
2 pages
Teca 1354 Mharper Lake Childobservationproject
No ratings yet
Teca 1354 Mharper Lake Childobservationproject
20 pages
Van Dijk - Critical Discourse Analysisk
No ratings yet
Van Dijk - Critical Discourse Analysisk
52 pages
Chapter 009, Systems Design: Process Costing: Essay Questions
100% (2)
Chapter 009, Systems Design: Process Costing: Essay Questions
15 pages
Bibliography
No ratings yet
Bibliography
6 pages
SA 2 Grade 5
No ratings yet
SA 2 Grade 5
3 pages
A) Choose The Correct Answer (75 Points)
0% (2)
A) Choose The Correct Answer (75 Points)
2 pages
Gap Certificate
No ratings yet
Gap Certificate
1 page
Part 02 The Consumer As An Individual
No ratings yet
Part 02 The Consumer As An Individual
43 pages
Total
No ratings yet
Total
8 pages
Lesson 5 May I Come in
No ratings yet
Lesson 5 May I Come in
58 pages
Cronbach's Alpha Explained
No ratings yet
Cronbach's Alpha Explained
5 pages
2005 - EnG106 - Technical Writing and Scientific Writing
No ratings yet
2005 - EnG106 - Technical Writing and Scientific Writing
6 pages
Mika Kuhlman: Professional Summary
No ratings yet
Mika Kuhlman: Professional Summary
2 pages
2PU Maths - Board Preparatory Paper Q + Soln
No ratings yet
2PU Maths - Board Preparatory Paper Q + Soln
11 pages
BEM vs IEM: Malaysian Engineering Registration
No ratings yet
BEM vs IEM: Malaysian Engineering Registration
2 pages
Preliminary Trainer 1 - Test 2 Ans
No ratings yet
Preliminary Trainer 1 - Test 2 Ans
1 page
Module Readings in Philippine History
No ratings yet
Module Readings in Philippine History
106 pages
English II: Personal Care & Beauty
No ratings yet
English II: Personal Care & Beauty
3 pages