0% found this document useful (0 votes)
41 views21 pages

Datamining 1

Jntuk

Uploaded by

Hello Hello
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views21 pages

Datamining 1

Jntuk

Uploaded by

Hello Hello
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

I

T–I
:

WhyDat
aMi ni
ng?
Weli
vei
nawor l
dwher ev
astamount
sofdat
aar
ecol
l
ect
eddai
l
y.Anal
yzi
ngsuchdat
a
i
sanimpor
tantneed.

Mov
ingt
owar
dtheI
nfor
mat
ionAge:

● Wear eact uall


yli
vingi nthedat aage.Ter abytesorpet abyt
es1ofdat apouri nt
o
ourcomput ernetwor ks,theWor l
dWi deWeb( WWW) ,
andv ariousdatastorage
devicesev erydayf rom business,soci ety
, scienceandengi neering,
medicine,and
al
mostev eryotheraspectofdai l
yl i
fe.
● Businesseswor ldwi degener at
egi gant i
cdat asets,i
ncl udi
ngsal estr
ansacti
ons,
stocktradingrecor ds,productdescr i
pt i
ons, salespromot i
ons, companyprofiles
andper for
mance, andcust omerf eedback.

WhatI
sDat
aMi
ning?Andi
tsar
chi
tect
ure

Tor
etr
iev
ecor
rectdat
awi
thoutr
edundancyander
ror
sisdat
ami
ning

DATAMI
NINGARCHI
TECTURE

1.DataMi
ningi
s“knowl
edgemi ni
ngfrom data”
2.DataMi
ningi
s knowl
edgediscover
yfrom data,
orKDD,whi
leot
hersviewdat
a
mini
ngasmerel
yanessent
ialst
epintheprocessofknowl
edgediscov
ery

STEPSI
NVOLVEDI
NDATAMI
NING:

1.Dat
acl
eani
ng(
tor
emov
enoi
seandi
nconsi
stentdat
a)

2.Dat
aint
egr
ati
on(
wher
emul
ti
pledat
asour
cesmaybecombi
ned)

3.Dat
asel
ect
ion(
wher
edat
arel
evantt
otheanal
ysi
staskar
eret
ri
evedf
rom t
he
dat
abase)

4.Datatr
ansfor
mation(
wheredat
aaret
ransf
ormedandconsol
i
datedi
ntofor
ms
appr
opri
ateformini
ngbyper
for
mingsummaryoraggr
egat
ionoper
ati
ons)

5.Datamining(anessent
ial
processwher
eint
ell
i
gentmet
hodsar
eappl
i
edt
oext
ract
datapat
terns)

6.Patt
ernev aluati
on(t
oident
ifythetr
ulyint
eresti
ngpatter
nsrepresent
ingknowl edge
basedoninteresti
ngnessmeasures—seeSect i
on1.4.6)7.Knowledgepresentat
ion
(wher
evisuali
zat i
onandknowledgerepresent
ationtechni
quesareusedt opresent
minedknowledget ousers)

Steps1t hrough4aredif
ferentfor
msofdatapreprocessi
ng,wheredataar
eprepared
formini
ng.Thedat aminingstepmayint
eractwit
ht heuseroraknowledgebase.The
i
nterest
ingpat t
ernsar
epr esent
edtot
heuserandmaybest oredasnewknowl edgei
n
theknowl edgebase

WhatKi
ndsofDat
aCanBeMi
ned?

Datamini
ngcanbeappl
i
edt
oanyki
ndofdat
aasl
ongast
hedat
aar
emeani
ngf
ulf
ora
t
argetappl
icat
ion.

Themostbasi
cfor
msofdat
aformi
ningappl
i
cat
ionsar
e:

1.Databasedat a
2.Datawar ehousedata
3.Transacti
onaldata
4. Dataminingcanalsobeappliedtootherfor
msofdata(e.g.
,datastr
eams,
ordered/
sequencedata,graphornetworkeddat
a,spat
ialdat
a,textdat
a,
multimediadata,
andt heWWW) .

Dat
abaseDat
a:

Adatabasesyst
em,alsocal
ledadat
abasemanagementsyst
em (DBMS)
,consi
stsofa
col
l
ectionofi
nter
rel
ateddat
a,knownasadat
abase,andasetofsoft
war
eprogramsto
manageandaccessthedata.

Arelati
onaldatabaseisacoll
ectionoft ables,eachofwhi chisassignedauni quename.
Eachtableconsistsofasetofat t
ributes(columnsorf iel
ds)andusual lystor
esal ar
ge
setoftuples(r
ecordsorrows).Eacht upleinar elat
ionaltablerepr
esent sanobject
i
dentifi
edbyauni quekeyanddescr i
bedbyasetofat tri
butev al
ues.Asemant i
cdat a
model,suchasanent it
y-r
elat
ionship(ER)dat amodel ,isoftenconstructedforrelati
onal
databases.AnERdat amodel representsthedat abaseasasetofent it
iesandt heir
rel
ati
onships.
Ar el
ati
onaldat
abaseforAllEl
ect
roni
cs.Thefi
cti
tousAl
i lEl
ect
roni
csst
orei
susedto
i
l
lustr
ateconceptst
hroughoutthi
sbook.Thecompanyisdescr
ibedbyt
hefol
lowi
ng
r
elati
ontabl
es:

cust
omer
,it
em,
empl
oyee,
andbr
anch.

customer.custI
D,name,
addr
ess,
age,
occupat
ion,
annuali
ncome,
credi
tinf
ormat
ion,
categor
y,.../

i
tem .i
tem I
D, br
and,category,t
ype,pr
ice,pl
acemade, suppl
ier
,cost
,.../
employee.emplID,name, categor
y,group,sal
ary
,commi ssi
on,.../
branch.br
anchID,name, address.../
purchases.t
ransID,custID,emplID,date,t
ime,methodpaid,amount/
i
temssol d.t
ransID,i
tem ID,qty/
worksat.emplID,branchID/

Rel
ati
onal
schemaf
orar
elat
ional
dat Al
abase, lEl
ect
roni
cs

Relat
ionallanguagesal
souseaggregat ef
uncti
onssuchassum,av g(
av er
age),count,
max( maximum) ,
andmin(minimum) .Usi
ngaggregatesall
owsyoutoask: “
Showme
thetotalsalesofthel
astmonth,gr
oupedbybr anch,”or“
Howmanysal estransacti
ons
occurredint hemonthofDecember?”or“Whichsalesper
sonhadthehighestsales?”

Dat
aWar
ehouses:

Adatawar ehouseisareposi
toryofinf
ormati
oncol
lectedfr
om multi
plesources,
stor
ed
underaunifi
edschema, andusuall
yresi
dingatasi
nglesit
e.Datawarehousesare
constr
uctedviaaprocessofdatacleani
ng,dat
aint
egrati
on,dat
atransfor
mat i
on,dat
a
l
oading,andperiodi
cdataref
reshi
ng.

Exampl
e:

Suppose thatAllEl
ect
roni
cs is a successf
uli
nt er
nat
ionalcompanywi th branches
around the worl
d.Each branch has it
s own setofdat abases.The presi
dentof
All
Elect
ronicshasaskedy outoprovideananalysisofthecompany’ssalesperi t
em
ty
peperbr anchfort
hethir
dquarter
.

Tof acil
itatedecisionmaking,thedatainadat awar ehousear eorgani
zedar oundmajor
subj ect
s( e.g.,customer,it
em,suppl i
er,andact i
vit
y).Thedat aar estoredtopr ovi
de
i
nf ormationf rom ahi st
ori
calperspecti
ve,suchasi nt hepast6t o12mont hs,andare
typicaly summar
l ized.Forexampl e,rathert han stori
ng the detai
ls ofeach sales
transaction,thedat awarehousemayst oreasummar yoft hetransact
ionsperitem t
ype
foreachst or eor,summar i
zedtoahi gherlevel
,foreachsalesregion.
Adat awarehousei
susual l
ymodel edbyamul t
idi
mensionaldatastr
uct
ure,cal
leda
datacube,
inwhicheachdimensioncor r
espondstoanattr
ibut
eorasetofat t
ri
butesi
n
theschema,andeachcellstor
est hevalueofsomeaggregatemeasuresuchascount
orsum.salesamount /
.A dat acubepr ov
idesamult
idimensionalv
iew ofdataand
all
owstheprecomputat
ionandf astaccessofsummari
zeddat a.

Ty
pical
framewor
kofadat
awar orAl
ehousef l
Elect
roni
cs.

Exampl
e1.
3Adat orAl
acubef l
Elect
roni
cs.

Adat acubef orsummar izedsal esdataofAl l


Electronicsispr esent edi nFigur e1.7(a).
Thecubehast hreedi mensi ons:addr ess( withcityv uesChi
al cago,NewYor k,Tor onto,
Vancouv er,t
) i
me( wi t
hquar terv uesQ1,Q2,Q3,Q4)
al ,andi tem( wi thitem typev alues
homeent er
tai
nment ,comput er,phone, securi
ty).
Theaggr egat eval uest oredineachcel l
ofthecubei ssalesamount( i
nt housands) .Forexampl e,thet ot alsalesf ort hefirst
quarter,Q1,forthei temsr elat
edt osecur it
ysy stemsi nVancouv eris$400,000,as
storedincellVancouv er
,Q1, securit
y.Addi ti
onalcubesmaybeusedt ost or
eaggr egat e
sumsov ereachdi mensi on,cor r
espondi ngt otheaggr egatev aluesobt ainedusi ng
diff
erentSQLgr oup-bys(e. g.
, t
het otalsalesamountperci t
yandquar ter,orperci t
yand
i
tem, orperquarterandi t
em, orpereachi ndivi
dual dimension).

Bypr ovi
dingmult
idimensi
onaldataviewsandtheprecomput at
ionofsummarizeddata,
datawar ehousesy st
emscanpr ovi
dei nher
entsupportforOLAP.Onl ineanalyt
ical
processi
ngoperationsmakeuseofbackgr oundknowledgeregardi
ngthedomainofthe
databeingstudi
edt oal
lowthepresentat
i aatdi
onofdat f
fer
entlevel
sofabst
racti
on.

Suchoper
ationsaccommodatedi
ff
erentuservi
ewpoi
nts.ExamplesofOLAPoper
ati
ons
i
ncludedr
il
l-downandrol
l
-up,
whichall
owt heusert
oviewt hedataatdi
ff
eri
ngdegr
ees
ofsummarizati
on,

Amult
idi
mensionaldat
acube, commonlyusedfordat
awarehousing,(
a)showing
summari
zeddataforAll
Elect
ronicsand(
b)showingsummarizeddataresul
ti
ngfrom
dri
l
l-downandrol
l-
upoperat
ionsont
hecubei
n(a)
.Fori
mpr
ovedr
eadabi
l
ity
,onl
ysome
ofthecubecel
lval
uesareshown.

Tr
ansact
ional
Dat
a:

Transacti
onisar ecordintransactional
dat abasetr
ansacti
on,suchasacust omer’s
purchase,afl
ightbooking,orauser ’
scl
icksonawebpage.At r
ansacti
ontypical
l
y
i
ncludesauniquet ransact
ioni dentit
ynumber( tr
ansID)andal i
stoftheit
emsmaki ng
upthet r
ansacti
on, suchast heitemspur chasedinthetr
ansacti
on.

Atransact
ional
databaseforAl
lEl
ect
roni
cs.Transact
ionscanbestor
edinatabl
e,wit
h
onerecordpertr
ansacti
on.Afr
agmentofat r
ansacti
onaldat
abaseforAl
lEl
ect
roni
csis
showninFigure

AsananalystofAllEl
ectr
onics,
youmayask,“
Whichi
temssoldwellt
oget
her?
”Thi
s
kindofmarketbasketdataanalysi
swoul
denabl
eyoutobundl
egroupsofi
tems
toget
herasast r
ategyforboosti
ngsal
es.

Ot
herKi
ndsofDat
a:

Besidesrelat
ionaldatabasedat a,dat
awar ehousedata,andtransacti
ondat a,therear e
manyot herkindsofdat athathavev er
sati
leformsandst r
ucturesandr atherdifferent
semant i
cmeani ngs.Suchki ndsofdat acanbeseeni nmanyappl i
cat
ions:time-related
orsequencedat a(e.
g.,histori
calrecords,stockexchangedat a,andtime- seri
esand
biol
ogicalsequencedat a),
datast r
eams( e.
g.,vi
deosurvei
ll
anceandsensordat a,whi ch
arecont i
nuousl yt ransmi tt
ed) ,spati
aldata( e.
g.,maps) ,engineer i
ngdesi gndat a(e.g.,
thedesi gnofbui ldings,sy stem component s,ori ntegrated circuits),hy per
textand
mul ti
medi adat a( includingt ext,i
mage,v ideo,andaudi odat a),gr aphandnet worked
dat a(e.g. ,socialandi nformat i
onnet works),andt heWeb( ahuge,wi delydistri
buted
i
nf ormat ionr epositorymadeav ai
l
ablebyt heI nter
net).Theseappl icationsbr i
ngabout
new chal lenges,likehow t ohandl edatacar r
yingspeci alstr
uct ures( e.g.,sequences,
trees,gr aphs,andnet works)andspeci fi
csemant i
cs( suchasor dering,i mage,audi o
and v ideo cont ents,and connect i
vi
ty)
,and how t o mi ne patter ns t hatcar ryr i
ch
structuresandsemant i
cs.

WhatKi
ndsofPat
ter
nsCanBeMi
ned?OrDat
ami
ningFunct
ional
i
ties

Ther eanumberofdat
ear amini
ngfuncti
onal
it
ies,Dat
ami
ningf
unct
ional
i
tiesar
eused
tospeci
fyt
heki
ndsofpat
ter
nstobefoundindatamini
ngt
asksl
i
ke

1. char acterization
2. discriminat ion
3. mi ni
ngoff requentpat
ter
ns
4. associ ations
5. cor r
elations
6. cl
assi fi
cat ionandr egr
essi
on
7. cl
ust eringanal y
sis

I
ngener
al,
suchf
unct
ional
i
tiescanbecl
assi
fi
edi
ntot
wocat
egor
ies:

1.Descripti
ve(expr
essi
v e)
Descript
ivemini
ngtasksdescri
beproper
ti
esofthedatainatargetdataset
.
Ex:-
Descript
ivestat
isti
csareusefultoshow t
hingslike,t
otalstockininvent
ory
,
averagedoll
arsspentpercust
omerandYearoveryearchangeinsales.

2.Pr
edi
cti
ve.(
Anal
yti
cal
)

Pr
edi
cti
vemini
ngt
asksper
for
minduct
ion(
trai
ning)ont
hecur
rentdat
ainor
dert
omake
pr
edi
cti
ons.

Predict
ive analyti
cs can be used t
hroughoutthe organi
zat
ion,fr
om forecasti
ng
customerbehav iorandpur
chasingpat
ter
nstoident
if
yingtr
endsinsal
esact
ivi
ti
es.They
al
sohel pforecastdemandf ori
nput
sfrom t
hesuppl
ychain,oper
ati
onsandi
nventory
.

Cl
ass/
ConceptDescr
ipt
ion:
Char
act
eri
zat
ionandDi
scr
imi
nat
ion:

Dat
aent
ri
escanbeassoci
atedwi
thcl
assesorconcept
s.

Forexampl
e,i
ntheAl
l
Elect
roni
csst
ore,
Cl
assesofi
temsforsal
ei udecomput
ncl er
sandpr
int
ers,
Concept
sofcust
omersi udebi
ncl gSpender
sandbudgetSpender
s.

Suchdescri
pti
onsofacl
assoraconceptar
ecal
l
edcl
ass/
conceptdescr
ipt
ions.These
descr
ipt
ionscanbeder
ivedusi
ng

1.Dat
acharact
erizat
ion
2.Dat
adiscr
iminati
on
3.Bot
hdatacharacteri
zat
ionanddi
scr
imi
nat
ion.

Dat
achar
act
eri
zat
ion(
classi
fi
cat
ion,
cat
egor
izat
ion,
descr
ipt
ion)
:

Bysummarizi
ngt
hedat
aoft
hecl
assunderst
udy(
oft
encal
l
edt
het
argetcl
ass)i
n
gener
alt
erms.

Exampl
e:

Acust omerr elati


onshipmanageratAl l
Electroni
csmayor derthef ollowingdatamining
task:Summar i
zet hecharacter
isti
csofcust omerswhospendmor et han$5000ay ear
atAl l
Elect
ronics.Ther esul
tisagener alprofil
eofthesecustomer s,suchast hatthey
are40t o50y earsold,employed,andhav eexcellentcredi
tratings.Thedat ami ning
system shouldal l
owt hecustomerrel
ationshipmanagert odri
l
l downonanydi mension,
suchasonoccupat i
ont ovi
ewt hesecustomer saccordi
ngtotheirty peofempl oyment.

Theoutputofdatachar
acter
izat
ioncanbepr esent
edi nvari
ousforms.Exampl
es
i
ncl
ude pie chart
s, bar chart
s, curves, mul
ti
dimensi
onal dat
a cubes, and
mul
ti
dimensi
onal
tabl
es,i
ncl
udi
ngcr osst
abs.

Dat
adi
scr
imi
nat
ion:

Dat
adi scri
minat
ioni
sacompar i
sonofthegeneralfeat
uresofthetar
getclassdata
obj
ectsagainstt
hegeneralf
eaturesofobj
ectsf
rom oneormult
ipl
econtr
ast
ingclasses.
Thetargetandcontr
asti
ngclassescanbespeci f
iedbyauser,andthecorr
esponding
dat
aobj ect
scanberetr
ievedthroughdat
abasequeri
es.

Dat adiscriminati
on( f
avorit
ism).Acustomerr el
ationshipmanageratAl l
Electr
onicsmay
wantt ocompar et wogr oupsofcust omer s—thosewhoshopf orcomput erproducts
regularl
y( e.g.
,mor et hantwiceamont h)andthosewhor arelyshopf orsuchpr oducts
(e.g.,
l
ess t han three t i
mes a y ear)
.The r esulti
ng descript
ion provides a gener al
compar ativepr ofi
leoft hesecust omers,suchast hat80% oft hecust omer swho
frequentlypur chasecomput erproductsarebet ween20and40y earsol dandhav ea
universit
y educat ion,wher eas 60% oft he customer s who infrequently buy such
product sareei t
herseni or
sory ouths,
andhav enouni versi
tydegree.

Mi
ningFr
equentPat
ter
ns,
Associ
ati
ons,
andCor
rel
ati
ons:
Frequentpat
terns,arepat
ter
nsthatoccurfrequent
lyindata.Ther
earemanyki
ndsof
fr
equentpatt
erns,incl
udi
ngit
em set
s,subsequences,andsubstr
uct
ures.

Associ
ati
onanal
ysi
s:

Suppose,asamar ket
ingmanager
,youwoul
dli
ket odet
erminewhi
chi
temsar
e
fr
equentl
ypurchasedtoget
herwit
hint
hesametransact
ions.

buy
s(X,
“comput
er”
)=buy
s(X,
“sof
twar
e”)[
suppor
t=1%,
conf
idence=50%]

whereXi
sav ar
iabl
erepr
esenti
ngacust
omer.Conf
idence=50%meansthati
fa
cust
omerbuy
sacomput er,
therei
sa50%chancethatshewill
buysof
twareaswel
l
.

Suppor
t=1%meansthat1%ofal
lofthet
ransacti
onsunderanal
ysi
sshowedt
hat
computerandsof
twar
ewerepur
chasedtogether

Cl
assi
fi
cat
ionandRegr
essi
onf
orPr
edi
cti
veAnal
ysi
s:

Cl
assif
icat
ionistheprocessoffi
ndi
ngamodelt
hatdescr
ibesanddist
ingui
shesdat
a
cl
assesforthepurposeofbeingabl
etouset
hemodeltopredi
ctthecl
assofobject
s
whoseclasslabeli
sunknown.


Howisthederivedmodelpresent
ed?
”Thederiv
edmodelmayber epresent
edi
n
v
ari
ousforms,suchasclassi
fi
cati
on(I
F-THEN)r
ules,
deci
siont
rees,
mat hemat
ical
f
ormul
ae,orneuralnet
works.

Cl
ust
eri
ng:

Clusteri
ngisthetaskofdi v
idingthepopulat
ionordatapoint
si nt
oanumberofgroups
sucht hatdatapointsinthesamegr oupsaremor esi
milartootherdat
apoi
ntsinthe
samegr oupt hanthoseinot hergroups.I
nsi mplewords,theai mistosegr
egate
(separate)gr
oupswi t
hsimilarquali
ti
esandassignthem i
ntoclust
ers.

Let’
sunderst
andt hi
swi t
hanexampl e.Suppose,y
ouaretheheadofar ent
alstoreand
wisht ounderst
andpr ef
erencesofy ourcostumerstoscal
eupy ourbusiness.I
sit
possibl
eforyoutolookatdetail
sofeachcostumeranddevi
seauniquebusiness
str
ategyforeachoneofthem?Def i
nit
elynot.But,whatyoucandoistoclust
eral
lof
yourcost
umer si
ntosay10groupsbasedont hei
rpurchasi
nghabi
tsanduseasepar
ate
str
ategyf
orcostumersineachoft
hese10gr oups.

Whi
chTechnol
ogi
esAr
eUsed?

Datamininghasi ncorporat
edmanyt echniquesfrom otherdomai
nssuchasst ati
sti
cs,
machine learni
ng,pat t
ernr ecogni
ti
on,dat abase and data warehouse sy st
ems,
i
nformati
onr etr
ieval
,visual
izati
on,al
gori
thms,highperformancecomputing,andmany
appl
icat
iondomai ns.

St
ati
sti
cs:

● Statist
icsst udiesthecol l
ection,analy
sis,interpret
ationorexpl anation,and
presentat i
onofdat a.Dat ami ninghasani nherentconnect ionwi t
hst ati
sti
cs.
● Ast ati
stical model isasetofmat hemat i
cal functi
onst hatdescr i
bet hebehavior
oftheobj ect sinat argetcl assintermsofr andom v ariabl
esandt heirassoci
ated
probabili
tydi stri
butions.
● Statist
icalmodel sar ewi delyusedt omodel dataanddat aclasses.
● Statist
icsr esearchdev elopst ool
sforpredi cti
onandf orecastingusi ngdat aand
stati
sti
cal model s.St atisti
cal methodscanbeusedt osummar izeordescr i
bea
coll
ectionofdat a.
● Statist
icalmet hodscanal sobeusedt ov erif
ydat ami ningresul ts.
● Ast ati
stical hy
pot hesist est(somet i
mescal edconf
l irmator ydat aanal y
sis)
makesst atisti
caldeci sionsusi ngexperiment aldata.
Machi
nel
ear
ning?

Machinelear
ningi
nvesti
gateshowcomputerscanlear
n( orimpr
ovethei
rperf
ormance)
basedondata.Amainresearchareai
sforcomputerprogramstoaut
omat i
cal
lyl
ear
nt o
recogni
zecomplexpatt
ernsandmakeintel
l
igentdeci
sionsbasedondata.

Supervised learning as the name i ndi


cat
es a presence ofsuper visoras teacher
.
Basical
lysuper visedlearni
ngi salearni
nginwhi chwet eachortrai
nthemachi neusing
datawhi chiswel llabel
edt hatmeanssomedat aisalr
eadyt aggedwithcorrectanswer
.
Aft
ert hat,machi neispr ovidedwi t
hnew setofexampl es(
data)sot hatsupervi
sed
l
earningalgor i
thm analysest hetr
aini
ngdata(setoftr
ainingexamples)andpr oducesan
corr
ectout comef rom l
abeleddata.
Fori
nstance,supposeyouar
egivenanbasketfi
ll
edwit
hdiff
erentki
ndsoffrui
ts.Now
thef
ir
ststepistotrai
nthemachi
newit
halldi
ff
erentf
rui
tsonebyoneli
kethi
s:

● Ifshapeofobjecti
sr oundedanddepressi
onattophavingcol
orRedt heni
twil
lbe
labell
edas–Apple.
● Ifshapeofobjectislongcurvi
ngcyl
inderhav
ingcolorGreen-
Yell
owthenitwil
lbe
labell
edas–Banana.
Now supposeaft
ertrai
ningthedata,y
ouhav egivenanew separatefrui
tsayBanana
f
rom basketandaskedtoidenti
fyi
t.

Sincemachi nehasalreadylearntthethingsfrom previ


ousdataandthi
st i
mehav eto
useitwisely.I
twil
lfi
rstclassif
yt hefr
uitwithitsshapeandcolor,
andwoul dconfi
rmthe
fr
uitnameasBANANAandputi tinBananacat egory.Thusmachinel
earnsthethi
ngs
fr
om t r
aini
ngdata(basketcontainingfruit
s)andthenappl yt
heknowledgetotest
data(newfruit
).

Super
visedl
ear
ningcl
assi
fi
edi
ntot
wocat
egor
iesofal
gor
it
hms:
● Classi
fi
cation:
Acl assi
ficat
ionproblem i
swhent heoutputvar
iabl
eisacategory,
suchas“ Red”or“blue”or“disease”and“nodisease”
.
● Regressi
on: Aregr
essionpr oblem i
swhent heoutputvar
iabl
eisarealv
alue,such
as“doll
ars”or“weight”.

Unsuper
visedl
ear
ning:
Unsupervi
sedlearningi
sthetrai
ningofmachineusinginfor
mat i
onthatisnei
ther
cl
assif
iednorlabeledandall
owingthealgori
thm t
oactont hatinf
ormationwit
hout
gui
dance.Her ethetaskofmachineistogroupunsortedinf
ormationaccordi
ngto
si
milar
iti
es,patt
ernsanddif
ferenceswithoutanypri
ortrai
ningofdata.
Unlikesuper vi
sedl earni
ng, not eacherisprovidedt hatmeansnot raini
ngwi llbegiv
ent o
themachi ne.Ther eforemachi nei srestri
ctedtof indthehiddenstr
uct ur
ei nunlabel
ed
databyour -
self.
Fori nst
ance, supposei ti
sgi venani magehav ingbot hdogsandcat swhi chhav enot
seenev er.
Thusmachi nehasno anyi dea aboutt hef eaturesofdogsand catso wecan’ t
categorizeitindogsandcat s.Buti tcancategor izethem accor
dingt otheirsimil
ari
ti
es,
patternsanddi f
ferencesi .
e.,wecaneasi l
ycat egor i
zetheabovepi ctur
ei ntotwopart s.
Fir
stf ir
stmaycont ai
nal lpicshav i
ngdogsi ni tandsecondpar tmaycont ainallpics
havingcat sini t
.Her ey oudi dn’tlearnany t
hingbef ore,meansno t raini
ngdat aor
exampl es.
Unsuper vi
sedl earningclassifi
edi ntotwocat egoriesofalgori
thms:
● Clusteri
ng:A clust
eri
ng problem iswher eyou wantto discoverthe i
nherent
groupingsinthedat
a,suchasgr oupi
ngcustomersbypurchasi
ngbehav i
or.
● As soci
ati
on:Anassociati
onr ulel
earni
ngproblem i
swher eyouwantt odiscover
rul
est hatdescr
ibel
argeporti
onsofy ourdat
a,suchaspeoplethatbuyXalsotend
tobuyY.

Semi-supervi
sedl earni
ngisacl assofmachi
nelearningtechni
quest hatmakeuseof
bothlabeledandunl abeledexampleswhenlear
ningamodel .I
noneappr oach,l
abel
ed
examplesar eusedt olearnclassmodelsandunlabeledexamplesareusedt oref
inethe
boundar i
esbetweencl asses.Foratwo-cl
assproblem,wecant hinkofthesetof
examplesbel ongingtoonecl assastheposi
ti
veexampl esandt hosebelongi
ngtothe
otherclassast henegativeexamples.

I
nfor
mat
ionRet
ri
eval
:

I
nformati
onretr
ieval
(IR)i
sthesci
enceofsear
chingf ordocumentsori
nfor
mati
onin
documents.Documentscanbetextormult
imedia,andmayr esi
deontheWeb.The
di
ffer
encesbetweentradi
ti
onal
inf
ormati
onretr
ievalanddatabasesyst
emsaretwof
old:
I
nformati
onretr
iev
alassumesthat

(
1)t
hedataundersearchareunst
ructur
ed;and
(
2)t
hequeri
esareformedmai nl
ybykeywor ds,
whi
chdonothav
ecompl
exst
ruct
ures
(
unl
i
keSQLqueriesindatabasesyst
ems) .

Thetypicalapproachesi
ninfor
mati
onr et
ri
eval adoptprobabil
ist
icmodels.Forexampl e,
atextdocumentcanber egardedasabagofwor ds,
thatis,amulti
setofwor ds
appeari
ngi nthedocument.Thedocument’slanguagemodel i
stheprobabil
itydensit
y
funct
ionthatgeneratest
hebagofwor dsinthedocument .Thesimil
ari
tybetweent wo
document scanbemeasur edbythesi
mi l
ari
tybet weentheircorr
espondinglanguage
models.
Whi
chKi
ndsofAppl
i
cat
ionsAr
eTar
get
ed?

1.Busi
nessInt
ell
i
gence
2.WebSearchEngines

Busi
nessI
ntel
l
igence

Iti
scr i
ti
calf orbusinessestoacquir
eabet t
erunderst
andingofthecommer cialcontext
oftheirorgani zati
on,suchast hei
rcust omers,t
hemar ket,suppl
yandr esources,and
compet i
tors.Busi nessi nt
ell
i
gence( BI)technol
ogiesprov i
dehistori
cal
,current,and
predi
ctivev i
ewsofbusi nessoperations.Examplesincl
uder epor
ti
ng,onli
neanal yt
ical
processing, busi ness per for
mance management , compet it
ive intell
igence,
benchmar king, andpredi
cti
veanalyt
ics.

“How import
antisbusinessi nt
ell
i
gence?”Withoutdatamining,manybusi
nessesmay
notbeabl et o per
form effect
ivemar ketanaly
sis,comparecustomerfeedbackon
simil
arproducts,di
scoverthest r
engt
hsandweaknessesoft hei
rcompeti
tor
s,ret
ain
highl
yval
uablecustomers,andmakesmar tbusi
nessdecisi
ons.

WebSear
chEngi
nes

A Web search engi


ne i
s a speci
ali
zed comput
erser verthatsear ches f
or
i
nf or
mationontheWeb.Thesearchresult
sofauserquer yareoft
enr et
urnedasal i
st
(sometimescaledhi
l t
s).Thehi
tsmayconsistofwebpages,images,andothertypesof
fi
les.Somesearchenginesal
sosear
chandr etur
ndataavail
abl
einpubl i
cdatabasesor
opendirect
ori
es.

Searchengi nesdi ff
erfrom webdi rectori
esi nthatwebdi rectori
esar emai nt
ainedby
humanedi t
or swher eassear chengi nesoper atealgorit
hmi cal
l
yorbyami xt
ureof
al
gor i
thmi candhumani nput .
Websear chengi nesareessent i
all
yv erylargedatami ningapplicati
ons.Var i
ousdat a
miningt echni quesareusedi nal
laspect sofsearchengines,rangingfrom crawli
ng(e.g.
,
decidi
ngwhi chpagesshoul dbecrawl edandt hecrawlingfrequencies),i
ndexing(e.g.
,
sel
ecting pagest o bei ndexed and deciding t
o which extentt hei ndexshoul d be
constructed) ,and sear ching (
e.g.,deciding how pages shoul d be ranked,whi ch
adverti
sement sshouldbeadded,andhow t hesear
chr esultscanbeper sonali
zedor
made“ contextawar e”)
.
MajorIssuesi nDat aMining:

1.MiningMet hodol
ogy
2.UserI nt
eract
ion
3.Eff
iciencyandScalabil
i
ty
4.Di
ver sit
yofDatabaseTy pes
5.DataMi ningandSociety
1.Mi
ningMet
hodol
ogy

Mini
ngv ariousandnewki ndsofknowl edge:
Mini
ngknowl edgeinmul t
idimensionalspace
Datamining—ani nt
erdi
scipli
naryeffor
t:
Boosti
ngt hepowerofdi scov eryinanetwor kedenvi
ronment
Handli
nguncer t
aint
y,noi
se, orincompletenessofdata:
Patt
ernev aluati
onandpat tern-orconstr
aint-gui
dedmining.

Mi
ningv
ari
ousandnewki
ndsofknowl
edge:

Dat ami ni
ngcov ersawi despectrum ofdataanaly
sisandknowl edgediscoverytasks,
from dat achar acteri
zat
ionanddi scri
minati
ontoassociati
onandcor rel
ati
onanal ysi
s,
classifi
cation,regressi
on,cl
uster
ing,outli
eranal
ysi
s,sequenceanalysi
s,andt r
endand
ev ol
utionanal ysis.Thesetasksmayuset hesamedat abaseindiffer
entway sand
requirethedev elopmentofnumer ousdatamini
ngt echni
ques.Duet othediversi
tyof
applicati
ons, newmi ni
ngtaskscont i
nuetoemerge,makingdatami ni
ngady nami cand
fast-gr
owi ngf i
eld.

Mi
ningknowl
edgei
nmul
ti
dimensi
onal
space

When sear chi


ng forknowl edge i
nl arge dat
a sets,we can expl orethe datain
mult
idi
mensi onalspaceli
kecube.Thatis,wecansear chfori
nter
esti
ngpat t
ernsamong
combinat
ionsofdi mensions(att
ri
butes)atvar
yinglevel
sofabstract
ion.Suchmini
ngis
knownas( explor
atory)mult
idi
mensionaldat
ami ning.

Dat
ami
ning—ani
nter
disci
pli
nar
yef
for
t:

Thepowerofdataminingcanbesubst anti
all
yenhancedbyintegrat
ingnewmethods
fr
om multi
pledi
sci
pli
nes
Forexample:
Tomi nedatawi
thnatural
languagetext,
itmakessenset of
usedat aminingmet
hods
wit
hmet hodsofi
nfor
mat i
onr et
ri
evalandnatural
languageprocessi
ng.

Boost
ingt
hepowerofdi
scov
eryi
nanet
wor
kedenv
ironment
:

Mostdataobjectsresi
deinal inkedorint
erconnect
edenvir
onment,whetheritbethe
Web,databaser el
ati
ons,f
iles,ordocument s.Semanti
cl i
nksacr ossmulti
pledata
object
scanbeusedt oadv antageindatami ni
ng.Knowledgederivedinonesetof
object
scanbeusedt oboostt hediscov
eryofknowledgei
na“ rel
ated”orsemanti
call
y
l
inkedsetofobj
ects.
Handli
nguncert
aint
y,noi
se,orincomplet
enessofdata:

Dat acl
eani
ng,dataprepr
ocessi
ng,
out
li
erdet
ecti
onandremoval
,anduncert
aint
y
reasoni
ngare
examplesoftechni
questhatneedt
obeint
egr
atedwit
hthedat
ami ni
ngprocess.
Pat
ter
nev
aluat
ionandpat
ter
n-orconst
rai
nt-
gui
dedmi
ning.

Techniquesareneededtoassesst heint
erestingnessofdiscov eredpatt
ernsbasedon
subj
ect i
vemeasures.Theseesti
mat ethevalueofpat t
ernswit hrespecttoagi v
enuser
cl
ass, basedonuserbeli
efsorexpectati
ons.Mor eover
,byusi ngi nt
erest
ingness
measur esoruser
-speci
fi
edconstraint
stogui dethediscoverypr ocess,wemay
generatemoreint
eresti
ngpatter
nsandr educet hesearchspace.

2.UserInt
eracti
on:
Theuserplaysanimpor
tantr
olei
nthedat
ami
ningpr
ocess.I
nter
est
ingar
easof
resear
chinclude:

1.I
nter
acti
v emi ni
ng:
2.I
ncorporationofbackgr
oundknowledge
3.Adhocdat ami ni
nganddat amini
ngquerylanguages
4.Pr
esentationandv i
sual
izat
ionofdat
aminingr esul
ts

1.I
nter
act
ivemi
ning:

Thedat aminingpr ocessshoul dbehi yi


ghl nteracti
v e.Thus,itisimpor tantt obui l
d
fl
exibl
euserinterfacesandanexpl orat
oryminingenv i
ronment,faci
li
tat
ingt heuser ’
s
i
nteract
ionwitht hesy stem.A usermayl i
ket of ir
stsampl easetofdat a,explore
generalchar
acteri
sticsoft hedat a,andest i
mat epot enti
alminingresult
s.I nteract
ive
miningshoul
dal l
owuser st ody namicall
ychanget hefocusofasear ch,toref i
nemi ning
request
sbasedonr et
urnedr esult
s,andt odrill
,dice,andpi v
ott hr
ought hedat aand
knowledgespacei nt
eractiv
ely
, dynamicall
yexplori
ng“ cubespace”whilemi ning.

2.I
ncor
por
ati
onofbackgr
oundknowl
edge

Backgroundknowledge,constr
aints,
rul
es,andotherinf
ormati
onregardingthedomai
n
understudyshouldbeincorpor
atedintotheknowledgediscov
eryprocess.Such
knowledgecanbeusedf orpatterneval
uati
onaswel lastoguidet
hesear chtowar
d
i
nterest
ingpatt
erns.

3.Adhocdat
ami
ninganddat
ami
ningquer
ylanguages

Queryl
anguages(e.
g.,SQL)hav eplay
edanimpor t
antroleinfl
exi
blesear
chingbecause
theyal
l
owuser stoposeadhocquer i
es.Si
milar
ly,
high-
leveldat
ami ni
ngquery
l
anguagesorotherhi
gh- l
evelf
lexi
bleuseri
nterf
aceswillgi
veusersthefr
eedom to
defi
neadhocdatami ningtasks.

4.Pr
esent
ati
onandv
isual
i
zat
ionofdat
ami
ningr
esul
ts

Dat
ami ningsyst
em presentdat
amini
ngresul
ts,
viv
idl
yandfl
exibl
y,sot
hatt
he
di
scov
er edknowledgecanbeeasil
yunder
stoodanddi
rect
lyusabl
ebyhumans.
3.Ef
fi
ciencyandScal
abi
l
ity
:

Ef
fi
ciencyandscal
abi
l
ityofdat
ami
ningal
gor
it
hms:

Dataminingal gor
ithmsmustbeeffi
cientandscalabl
einordert oeff
ecti
vel
yext
ract
i
nformati
onf rom hugeamountsofdatainmanydat areposi
tori
esorindynamicdata
st
reams.I n otherwords,the r
unni
ng t i
me ofa data mining al
gori
thm mustbe
pr
edict
able,short,
andaccept
abl
ebyapplicati
ons.

Par
all
el,
dist
ri
but
ed,
andi
ncr
ement
almi
ningal
gor
it
hms:

The humongous si ze ofmanydat a set


s,the wide dist
ri
bution ofdat a,and the
comput ati
onalcomplexityofsomedatami ni
ngmet hodsarefact or
sthatmot i
vatethe
development of par al
leland dist
ri
buted data-
int
ensive mining algor
it
hms.Such
al
gor i
thmsf i
rstpart
iti
ont hedat
aint
o“ pi
eces.
”Eachpi ecei
spr ocessed,i
npar al
l
el,by
searchingforpatt
erns.

Cloudcomputingandclust
ercomputing,
whichusecomputersinadistr
ibut
edand
coll
aborat
ivewaytotackl
everylar
ge-
scalecomputat
ional
tasks,ar
ealsoacti
ve
resear
chthemesinparall
eldat
ami ni
ng.

4.Di
ver
sit
yofDat
abaseTy
pes:

Thewidediv
ersi
tyofdat
abasety
pesbr
ingsaboutchal
l
engest
odat
ami
ning.These
i
nclude
Handli
ngcomplexty
pesofdata:

Dif
ferentappli
cat i
onsgener ateawi despect rum ofnew datat ypes,fr
om st r
uct ured
datasuchasr elat
ionalanddat awar ehousedat atosemi-st
ructur
edandunst r
uct ured
data;from st
abledat areposit
oriestody nami cdatastreams;from si
mpl edataobj ects
tot emporaldat a,biologi
calsequences,sensordat a,spatialdata,hy per
textdat a,
multimediadat a,softwarepr ogram code,Webdat a,andsoci alnetwor kdat a.The
construct
ion ofef fecti
ve and ef f
ici
entdat a mi ni
ng toolsf ordiver
se applications
remainsachal l
engingandact i
v eareaofr esearch.

Mi
ningdy
nami
c,net
wor
ked,
andgl
obaldat
areposi
tor
ies:

Thediscov er
yofknowl edgefrom diff
erentsourcesofst ruct
ured,semi-
struct
ured,or
unst
ructured y eti
nter
connected data wi th di
v er
se data semant i
cs poses great
chal
l
engest odatamining.Miningsuchgi ganti
c,int
erconnect
edi nf
ormati
onnet works
mayhelpdi sclosemanymor epatter
nsandknowl edgeinheter
ogeneousdataset sthan
canbedi scoveredf
rom asmallsetofisolateddatareposi
tori
es.
5.Dat
aMi
ningandSoci
ety

Soci
ali
mpact
sofdat
ami
ning:

Withdat ami ni
ngpenetrat
ingourever
ydayl
ives,i
tisimportantt
ost udytheimpactof
datami ningonsoci et
y.How canweusedat amini
ngt echnol
ogytobenef i
tsociet
y?
Howcanweguar dagainstit
smisuse?Theimproperdiscl
osureoruseofdataandt he
potenti
alv i
olat
ionofindi
vi
dualpri
vacyanddataprot
ectionri
ghtsareareasofconcern
thatneedt obeaddressed.

Pr
ivacy
-pr
eser
vingdat
ami
ning:

Dat
ami ningwillhel
pscienti
fi
cdiscovery
,businessmanagement ,economyrecovery
,
andsecurit
yprotect
ion(e.
g.,thereal
-t
imediscoveryofint
rudersandcy berat
tacks)
.
However,
itposestheri
skofdiscl
osinganindi
vi
dual’
spersonali
nformati
on.

DataObj ectsandAt t
ri
butetypes
Whenwet alkaboutdat
ami ning, weusuallydiscussaboutknowledgediscov
eryfrom
data.Togett oknowaboutthedat aitisnecessarytodiscussaboutdataobj
ects,data
att
ributesandt ypesofdataattributes.Mini
ngdat aincl
udesknowingaboutdata,findi
ng
rel
ationbet weendata.Andfort hisweneedt odiscussaboutdataobject
sandat t
ribut
es.
Dat
aobject
saret heessenti
alpar
tofadat abase.Adataobjectrepresentst
heenti
ty.
Dat
aObjectsarelikegroupofatt
ri
butesofaent it
y.Forexampleasal esdataobj
ect
mayrepr
esentcustomer,sal
esorpurchases.
Whenadat aobjecti
slistedinadat
abase
t
heyarecal
leddatatupl
es.

Attri
butetypes:
I
tcanbeseenasadat afi
eldthatrepresentscharact
eristicsorfeatur
esofadat aobject
.
Foracust omerobj ectatt
ri
butescanbecust omerId,addr essetc.Wecansayt hataset
ofattri
butesusedt odescribeagiv enobjectareknownasat tr
ibutevectororf
eature
vector.
Typeofat tr
ibutes:
ThisistheFirststepofDat aData-prepr
ocessing.Wedi fferent
iatebetweendif
ferent
typesofat t
ri
butesandt henpreprocesst hedata.Soher eisdescr i
pti
onofattr
ibute
types.
1.Qualit
ative(Nomi nal(
N),Ordi
nal (O),
Binary(
B)).
2.Quant i
tat
ive(Discret
e,Conti
nuous)
Quali
tat
iveAttri
butes
1.NominalAt t
ributes–relat
edtonames: Thev al
uesofaNomi nalattr
ibut
eare
nameoft hings,someki ndofsymbols.ValuesofNomi nalat
tr
ibutesrepr
esent
s
somecat egoryorst at
eandthat’
swhynomi nalatt
ri
buteal
sor ef
erred
ascategori
cal attr
ibut
esandthereisnoor der(r
ank,posi
ti
on)amongv al
uesof
nominalattri
bute.
Example:

2.BinaryAt
tri
butes:Bi
narydatahasonly2values/states.ForExampl
eyesorno,
affect
edorunaffect
ed,trueorfal
se.
i
)Sy mmetri
c: Bothval
uesar eequal
l
yimportant(Gender )
.
i
i)Asy mmetri
c:Bothvaluesarenotequal
l
yimpor tant(Result
).

3.Ordi
nal At
tri
butes:TheOr dinalAt
tri
butescont ai
nsval
uesthathaveameani ngf
ul
sequenceorranking(order)bet
weent hem, butthemagni
tudebetweenv al
uesis
notactual
lyknown, theorderofval
uest hatshowswhatisimportantbutdon’t
i
ndicatehowi mportantiti
s.

Quanti
tati
veAt tr
ibutes
1.Numer i
c: Anumer icatt
ributeisquantitati
vebecause, i
tisameasur ablequanti
ty,
representedi nintegerorr ealval
ues.Numer i
cal att
ri
butesareof2
types, i
ntervalandr at
io.
i
)Ani nterval-
scal
edat tr
ibutehasv alues,whosedi f
ferencesareint
er pr
etabl
e,but
thenumer icalatt
ribut
esdonothav ethecor rectref
erencepointorwecancal lzer
o
point.Dat acanbeaddedandsubt ractedatintervalscalebutcannotbemul ti
pli
ed
ordivided.Consideraexampl eoftemper atureindegreesCent i
grade.Ifadays
temper atureofonedayi st wicethantheot herdaywecannotsayt hatonedayi s
twiceashotasanot herday.
i
i)Ar at
io-scal edatt
ributei
sanumer icatt
ri
butewit
hanf ixzero-point.Ifa
measur ementi srat
io-scal
ed,wecansayofav alueasbeingamul tiple(orrati
o)of
anotherv alue.Thev aluesareordered,andwecanal socomput et hedi ff
erence
betweenv alues,andt hemean, median,mode,Quantil
e-r
angeandFi venumber
summar ycanbegi ven.
2.Discret
e: Discret
edat ahavefini
tev al
uesitcanbenumer icalandcanal sobein
categoricalfor m.Theseattri
buteshasf i
nit
eorcountablyinfi
nit
esetofv al
ues.
Example

3.Conti
nuous:Cont
inuousdat
ahaveinf
ini
tenoofst
ates.Cont
inuousdat
aisoff
loat
ty
pe.Therecanbemanyv al
uesbet
ween2and3.
Example:

Basi
cSt
ati
sti
cal
Descr
ipt
ionsofDat
a:

1)
measur
esofcent
ral
tendency

mean,x¯
a) =∑xn

Boxpl
ot

Aboxploti
sagraphthatsummar
izest
hedat
abyr
epr
esent
ing5v
alues,
themi
nimum
andmaximum v
alue,
theQ1,Q2andQ3.
I
nt hegraphbel
ow, t
he3central hor
izontall
inesrepr
esentQ1, Q2andQ3,whilethe
pointontheext
remer epr
esentsanout li
ervalue.Thi
si sav ar
iati
onoftheboxplotas
descri
bedintheprevi
ousli
ne,sincetheot hertwohorizontalli
nescannotbethe
minimum andmaxi mum.Theyar ethe5t hand95t hpercentil
e.
Measur
esofLocat
ion:
Themode
Themodeisameasureofcentert
hatisusual
l
yusedfordatat
hatisnon-
numer
ical
.
Namely
,themodei hev
st aluethatoccur
smostfr
equentl
y,anditi
stheonl
yval
uethat
canbecoll
ect
edf
orqual
it
ativ
edat a.
Forexample,i
nexampl e3.17,t
heylistthesi
zeofdressessol
dbyast or
etobe10,7,
14,
9,9,14,18,
9,11,12,16,14,9,14,
14, 11,9and20.Inthi
scase,thenumber9and14
appearmostfrequentl
y,bothshowingexactly5ti
mes.
Therefor
e,youwouldsayint hi
scaset hatt
hisdat
aisbimodal(hastwomodes),
namely
9and14.

Measur
esofVar
iat
ion:
Ther
ange
Therangeisameasur
eofvari
ati
onorvari
abi
l
ityoft a.Ther
hedat angeofadat
aseti
s
thel
argestval
uemi
nusthesmall
estv
alue.
Forexample,
fort
heagedi
str
ibut
ioni
nthecl
ass,
whi
chwewr
it
eher
eagai
nfor
conveni
ence:
21212520262246255824202025232721232228,
t
her
angei
s58-
20=38y
ear
s.
Onedi
sadvant
ageoft
herangei
sthatitishi
ghl
yinf
luencedbyout
li
ers.Fort
hatr
eason,
weusethefol
l
owingmeasur
eofvari
ationmoreoft
en.

Measur
esofVar
iat
ion:
Thest
andar
ddev
iat
ion
Thest
andar
ddev
iat
ioni
sthemostgener
almeasur
eofv
ari
ati
on.
Tocal cul
atethestandarddevi
ati
onofapopul at
ion,
onefi
rstt
akest hedif
fer
enceof
eachdat apointtothemean( t
hevari
at ion)
,andsquarest
hatdi
fference(toi
nsur
eit
's
positi
ve).Then,al
lthosesquareddi
fferences
areaddedt ogetheranddivdedbyN,
i thesizeofthepopul
ati
on.Finall
y,t
hesquarer
oot
i
st aken.Thisissummar i
zedinthefollowingfor
mula:
σ=√∑(
x−μ)
2N
Dat
aVi
sual
i
zat
ion:
Datavisual
izat
ioni
sthegraphicalrepr
esentat
ionofinformati
onanddata.Byusi
ng
vi
sualelementsli
kechart
s,graphs,andmaps, dat
av isual
izat
iontool
sprovi
dean
accessibl
ewayt oseeandunder st
andtrends,out
li
ers,andpat t
ernsi
ndata.
I
nt hewor
ldofBigDat
a,dat
av i
sual
izat
iont
ool
sandtechnol
ogiesar
eessenti
alt
o
analy
zemassiv
eamountsofinfor
mat i
onandmakedata-
dri
vendeci
sions.

exampl
esofmet
hodst
ovi
sual
i
zedat
a:

● DotDi
str
ibut
ionMap
● Gant
tChar
t
● HeatMap
● Hi
ghl
i
ghtTabl
e
● Hi
stogr
am
● Mat
ri
x
● Scat
terPl
ot(
2Dor3D)
Measur
ingDat
aSi
mil
ari
tyandDi
ssi
mil
ari
ty:
Si
mil
ari
tyandDi
ssi
mil
ari
ty

Dist
anceorsimil
ari
tymeasur esar
eessenti
alinsolvi
ngmanypat ternrecogni
ti
on
probl
emssuchascl assi
fi
cationandcl
uster
ing.Vari
ousdi stance/si
milari
tymeasures
areavai
lablei
ntheli
ter
atur
et ocomparetwodat adist
ributi
ons.Ast henamessuggest ,
asimil
arit
ymeasureshowcl osetwodist
ri
buti
onsar e.Formul ti
vari
atedatacomplex
summar ymethodsaredevel
opedt oanswerthisquesti
on.
Si
mil
ari
tyMeasur
e

Numeri
calmeasureofhowal
iketwodat
aobj
ect
sof
tenf
all
bet
ween0(
no
si
mil
ari
ty)and1(complet
esi
milar
it
y)

Di
ssi
mil
ari
tyMeasur
e

Numeri
cal
measur
eofhowdif
ferenttwodat
aobj
ect
sar
erangef
rom 0(
obj
ect
s
ar
eali
ke)t
o∞∞ (obj
ect
saredif
ferent)

Pr
oxi
mit
y

r
efer
stoasi
mil
ari
tyordi
ssi
mil
ari
ty

Si
mil
ari
ty/
Dissi
mil
ari
tyf
orSi
mpl
eAt
tri
but
es

Her
e,pandqar
etheat
tri
but
eval
uesf
ort
wodat
aobj
ect
s.
Dist
ance,
suchastheEucl
i
deandist
ance,i
sadissimil
ari
tymeasureandhassomewel
l
-
knownproper
ti
es:
CommonPr opert
iesofDi
ssi
milari
tyMeasures

1.d( p,
q)≥0f orallpandq, andd( p,q)=0ifandonlyi fp=q,
2.d( p,
q)=d( q,
p)foralpandq,
l
3.d( p,
r)≤d( p,q)+d(q,r)forallp,q,andr,whered(p, q)i
sthedist
ance
(
dissimilar
ity
)betweenpoi nts(dataobject
s),pandq.
Adistancethatsat i
sfi
esthesepr opert
iesiscall
edamet ric.Fol
lowingi
sali
stofsev
eral
commondi stancemeasur estocompar emulti
vari
atedata.Wewi ll
assumethatt
he
att
ri
butesar eallconti
nuous.

Eucl
i
deanDi
stance

Assumet hatwehav emeasur ement sxikxik,i=1,


…,Ni=1,
…,N,
on
vari
ablesk=1,…,pk=1,…,p(alsocal l
edat t
ri
but es).
TheEucl i
deandi stancebetweent heithandj thobjectsi
s
dE(i
,j
)=(∑pk=1(xik−xjk)
2)12dE( i
,
j)=(∑k=1p(xik−xjk)
2)12
foreverypair(
i,j)ofobser v
ations.
Thewei
ght
edEucl
i
deandi
stancei
s:
dWE(i
,
j)=(
∑pk=1Wk(xi
k−xj
k)2)12dWE(
i,
j)
=(∑k=1pWk(xik−xj
k)2)
12
I
fscal
esoftheatt
ri
butesdiff
ersubst
anti
all
y,st
andardizat
ioni
snecessar
y.

Mi
nkowski
Dist
ance

TheMi
nkowski
dist
ancei
sagener
ali
zat
ionoft
heEucl
i
deandi
stance.
Wit
ht hemeasurement,xik,
i
=1,…,
N,k=1,…,
pxi
k,
i=1,…,N,k=1,
…,p,
theMi
nkowski
dist
ancei
s
dM(i
,j
)=(∑pk=1|
xi
k−xj
k|λ)1λdM(i,
j)
=(∑k=1p|
xi
k−xjk|λ)1λ
whereλ≥1λ≥1.Iti
salsocalledtheLλLλmet r
ic.

● λ=1:
L1λ=1:L1met ri
c,ManhattanorCit
y-blockdistance.
● λ=2:
L2λ=2:L2met ri
c,Eucl
ideandist
ance.
● λ→ : L∞λ→ : L∞ met ri
c,Supremum distance.
l
imλ→ =( ∑pk=1|xi
k−xjk|
λ)1λ=max(|xi
1−xj
1|,
.
..
,|xi
p−xjp|)
li
mλ→ =( ∑k=1p|
xi
k−xj
k|
λ)1λ=
max(|
xi
1−xj
1|,.
..
,
|xi
p−xjp|)
Not
ethatλandpar etwodi ffer
entparameter
s.Di mensionofthedat
amat r
ixr
emains
f
ini
te.

You might also like