Data mining revision :
•
•
•
•
•
•
•
•
•
•
•
Q1 : What’s data mining and why data mining ?
Q2 : What’s The Knowledge Discovery Process:
An Outline of the Steps of the KDD Process
Q3 : difference between classification and clustering with example
Examples
Classification Task
Q4 : decision tree
▪
age Blood hemoglobin
<=30 31…40 >40 High Medium Low
No yes No yes No yes No yes No yes No yes
2 3 4 0 3 2 3 1 4 2 2 2
Info age(D) = 2+3 / 14 ((-3/5)log2(3/5)) - ((2/5)log2(2/5)) Info blood hem(D) = 4/14((-3/4) log2(3/4))- ((1/4) log2(1/4))
4+0 / 14 ((-0/4)log2(0/4)) - ((4/4)log2(4/4)) 6/14((-4/6) log2(4/6))- ((2/6) log2(2/6))
3+2 / 14 ((-2/5)log2(2/5)) - ((3/5)log2(3/5)) = 0.694 bits 4/14((-2/4) log2(2/4))- ((2/4) log2(2/4)) = 0.91104 bits
Gender CBCR
M F Fair Excellent
No yes No yes No yes No yes
3 4 6 1 1
3 2 2
Info gain student(D) = 7/14((-4/7)log 2(4/7)) - ((3/7)log 2(3/7)) Info gain credit_rating (D) == 8/14((-6/8)log 2(6/8)) - ((2/8)log 2(2/8))
+ 7/14 ((-6/7)log 2(6/7))- ((1/7)log 2(1/7)) = 0.7884 bits. + 6/14 ((-3/6)log 2(3/6))- ((3/6)log 2(3/6)) = 0.892 bits.
𝒗
|𝑫𝒋 | |𝑫𝒋 |
𝑺𝒑𝒍𝒊𝒕𝑰𝒏𝒇𝒐𝑨 (𝑫) = − × 𝒍𝒐𝒈𝟐 ( )
|𝑫| |𝑫|
𝒋=𝟏
SplitInfo blood hem
SplitInfo student
SplitInfo blood hem
Age Blood h Gender CBCR
Gain 0.246 0.029 0.151 0.048
Split Info 1.5774 1.577 1 0.9852
0.246 1.5774 =
0.029 / 1.577 =
0.151 / 1 =
0.048 0.9852
age
<=30 31…40 >40
?? Yes ??
Age Blood h Gender CBCR
Gain <= 30 0.057 0.97 0.02
Split Info 0.993 0.97 0.97
GainRatio 0.0574 1 0.0206
Age
<=30 31…40 >40
Gender Yes ??
male fmale
yes no
Age
<=30 31…40 >40
gender Yes CBCR
male Fmale excellent fair
yes No No
Yes
Q5 : Single linkage
295
268
255
219
138