0% found this document useful (0 votes)
28 views48 pages

DWDM

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views48 pages

DWDM

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 48

Unit I

Data Warehousing and Online Analytical Processing :

Data Mining

1. Why Data Mining?


Businesses wor ldwide generat e gigant ic dat a set s, including sales t ransact io ns,
st ock t rading records, product descr ipt ions, sales pro mot ions, co mpany pro file s
and per for mance, and cust omer feedback. For example, large st ores, such as
Wal-Mart , handle hundreds o f millio ns o f transact io ns per week at t housands o f
branches around t he wor ld. Scient ific and engineer ing pract ices generat e high
orders of pet abyt es of dat a in a cont inuous manner, fro m remot e sensing,
process measur ing, scient ific exper iment s, syst em per for mance, engineer ing
obser vat ions, and environment sur veillance.
Glo bal backbo ne t eleco mmunicat io n net works carr y t ens o f pet abyt es o f dat a
t raffic ever y day. The medical and healt h indust r y generat es t remendous
amount s of dat a fro m medical records, patient mo nit or ing, and medical imag ing.
This explo sive ly growing, widely available, and gigant ic body of dat a makes our
t ime t ruly t he dat a age. Power ful and versat ile t ools are badly needed t o
aut omat ically uncover valuable infor mat io n fro m t he t remendo us amount s o f
dat a and t o t ransfor m such dat a int o organized knowledge. This necessit y has
led to t he birt h o f dat a mining. The field is young, dynamic, and pro mising.
Dat a mining has and will co nt inue t o make great st rides in our journey fro m t he
dat a age toward t he coming infor mat ion age.
Data mining tu rns a large collection of data into knowledge.

 Data Mining as the Evolution of Information Technology

Dat a mining can be viewed as a result of t he nat ural evo lut io n of infor mat io n
t echno logy. The dat abase and dat a management indust r y evo lved in t he
development o fseveral cr it ical funct ionalit ies : data col lection and database
creation, data management ( inc luding data storage and r et r ieval and
database t ransaction processing ), and advanced data analysi s ( invo lving
data warehousing and data mining ).
E ffic ient met hods for online t ransact io n processing (OLTP), where a quer y is
viewed as a read-only t ransact io n, cont ribut ed subst ant ially t o t he evo lut io n
and wide accept ance of relat ional t echnology as a major t ool for effic ient
st orage, ret rieval, and management of large amount s of dat a.
Aft er t he est ablishment of dat abase management syst ems, dat abase
t echno logy mo ved t oward t he development of advanced dat abase syst ems,
dat a warehousing, and dat a mining for advanced dat a analys is and web - based
dat abases.
Advanced dat a analys is sprang up fr om t he lat e 1980s onward. This
t echno logy provides a great boost to t he dat abase and infor mat io n indust r y,
and it enables a huge number of dat abases and infor mat io n reposit or ies t o be
available for t ransact ion management , infor mat io n ret rieval, and dat a
analys is.
One emerging dat a reposit ory archit ect ure is t he data warehou se. This is a
reposit ory o f mu lt iple het erogeneous dat a sources organized under a unified
sche ma at a single sit e t o facilit at e management decis io n making. Dat a
warehouse t echno logy in cludes dat a cleaning, dat a int egrat io n, and online
analyt ical processing (OLAP) —t hat is, analys is t echniques wit h
funct io nalit ies such as summar izat ion, conso lidat io n, and aggregat ion, as
well as t he abilit y t o view infor mat io n from different angles. Alt ho ugh OLAP
tools support mult idimensio nal ana lysis and decis io n making, addit io nal dat a
analys is t ools are required for in -dept h analys is — for examp le, dat a mining
tools t hat provide dat a classificat ion, clust er ing, out lier/ ano maly det ect ion,
and t he char act er izat ion o f changes in dat a over t ime.

In summar y, t he abundance of dat a, coupled wit h t he need for power ful dat a
analys is t ools, has been descr ibed as a dat a rich but infor mat io n poor sit uat io n
(Figure 1.2). The fast -growing, tremendo us amount of dat a, collect ed and st ored
in large and numerous dat a reposit ories, has far exceeded our human abilit y for
co mprehensio n wit hout power ful t ools. As a result , dat a co llect ed in large dat a
reposit ories beco me “dat a tombs” —dat a archives t hat are seldo m visit ed.
Consequent ly, import ant decisio ns are o ft en made based not on t he infor mat ion -
r ich dat a st ored in dat a reposit ories but rat her on a decis io n maker ’s int uit io n,
simply because t he decis io n maker does not have t he t ools t o ext ract t he
valuable knowledge embedde d in t he vast amount s of dat a. Effort s have bee n
made t o develop expert syst em and knowledge - based t echno logies, which
t ypically rely o n users or domain exper t s to manually input knowledge int o
knowledge bases. Unfort unat ely, however, t he manual knowledge input
procedure is prone t o biases and errors and is ext remely cost ly and t ime
consu ming. The widening gap bet ween dat a and infor mat io n calls for t he
syst emat ic development of dat a mining tools t hat can t urn dat a t ombs int o
“go lden nugget s” o f knowledge.

2. What Is Data Mining?


 Dat a mining is t he process o f disco ver ing int erest ing pat t erns and knowledge
fro m large amo unt s of dat a. The dat a sources can include dat abases, dat a
warehouses, t he Web, ot her infor mat ion reposit ories, or dat a t hat are
st reamed int o t he syst em d ynamically.
 Dat a mining can also be used as a syno nym for anot her popular ly used t er m,
knowledge discover y fro m dat a, or KDD, while ot hers view dat a mining as
mer ely an essent ial st ep in t he process o f knowledge discover y.
 The knowledge disco ver y process is shown in F igure 1.4 as an it erat ive
sequence of t he fo llo wing st eps:
1. Dat a cleaning (t o remo ve no ise and inc onsist ent dat a).
2. Dat a int egrat io n (where mult iple dat a sources may be co mbined).
3. Dat a select io n (where dat a relevant to the analys is t ask are ret r ieved fro m t he
dat abase).
4. Dat a t ransfor mat io n (where dat a are t ransfor med and conso lidat ed int o for ms
appropr iat e for mining by per for ming summar y or aggregat ion operat ions).
5. Dat a mining (an essent ial process where int elligent met hods are app lied t o
ext ract dat a patt erns).
6. Patt ern evaluat ion (t o ident ify t he t ruly int erest ing pat t erns represent ing
knowledge based on int er est ingness measures).
7. Knowledge present at ion (where visualizat ion and knowledge represent at io n
t echniques ar e used to present mined knowledge t o users).

St eps 1 t hrough 4 are different for ms of dat a preprocessing, where dat a are
prepared for mining. The dat a mining st ep may int er act wit h t he user or a
know ledge base. The int erest ing pat t erns are present ed to t he user and ma y be
st ored as new knowledge in t he knowledge base .
3. What Kinds of Data Can Be Mined?

The mo st basic for ms o f dat a for mining applicat io ns are dat abase dat a , dat a
warehouse dat a, and t ransact io nal dat a. Dat a mining can also be applied to ot her
for ms o f dat a (e.g., dat a st reams, ordered/sequence dat a, graph or net worked
dat a, spat ial dat a, t ext dat a, mult imedia dat a, and t he WWW).

Database Data

A dat abase s yst em, also called a dat abase management syst em (DBMS),
consist s of a co llect ion of int errelat ed dat a, known as a dat abase, and a set of
so ft ware programs t o manage and access t he dat a. The soft ware programs
provide mechanisms for defining dat abase st ruct ures and dat a st orage; for
spec ifying and managing concurrent , shared, or dist ribut ed dat a access; and
for ensur ing consist ency and secur it y o f the infor mat io n st ored despit e syst e m
crashes or att empt s at unaut hor ized access.
A relat io nal dat abase is a co llect ion o f t ables, eac h of which is assigned a
unique name. Each t able consist s o f a set of at t ribut es (co lumns or fie lds) and
usually st ores a large set of t uples (records or rows). Each t uple in a
relat ional t able represent s an o bject ident ified by a unique key and descr ibe d
by a set of at t ribut e values. A se mant ic dat a model, such as an ent it y -
relat ionship (ER) dat a model, is o ft en const ruct ed for relat io nal dat abases. An
ER dat a model represent s t he dat abase as a set of ent it ies and t heir
relat ionships.

Examp le :
A relat io nal dat abase for AllE lect ronics. The fict it ious AllE lect ronics st ore is
used t o illust rat e concept s t hroughout t his book. The co mpany is descr ibed b y
t he fo llo wing relat io n t ables: cust o mer, it em, emplo yee, and branch. T he
header s o f t he t ables descr ibed here. ( A header is also called t he schema o f a
relat ion.)
cust omer (cust ID, name, address, age, occupat ion, annua l inco me, credit
infor mat io n, cat egory, . . .)
it e m (it em ID, brand, cat egory, t ype, pr ice, place made, supplier, cost ,
. . .)
emplo yee (empl ID, name, cat egor y, group, salar y, co mmissio n, . . .)
branch (branch ID, name, addr ess, . . .)
purchases (t rans ID, cust ID, empl I D, dat e, t ime, met hod paid, amount )
it e ms_so ld (t rans ID, it em ID, qt y) works at (empl ID, branch ID)

Relat io nal dat a can be accessed by dat abase quer ies wr it t en in a relat io na l
quer y language (e.g., SQL) or wit h t he assist ance of graphical user int er faces.
A given quer y is t ransfor med int o a set of r elat ional operat io ns, such as jo in,
select ion, and pro ject ion, and is t hen opt imized for effic ient processing.
When mining relat io nal dat abases, we can go furt her by searching for t rends
or dat a patt erns. For examp le, dat a mining may discover t hat t here has been a
change in packaging o f an it em or a significant increase in pr ice.

Data Warehouses
 A dat a warehouse is a reposit ory o f informat io n co llect ed fro m mult iple
sources, st ored under a unified schema, and usually residing at a single sit e.
 Dat a warehouses are const ruct ed via a process o f dat a cleaning, dat a
int egrat ion, dat a t ransfor mat io n, dat a loading, and per iodic dat a refr eshing.
 To facilit at e decisio n making, t he dat a in a dat a warehouse are organized
around major subject s (e.g., cust omer, it em, supplier, and act ivit y). The
dat a are stored to provide infor mat io n from a hist or ica l per spect ive, such as
in t he past 6 to 12 mont hs, and are t ypically summar ized.
 A dat a warehouse is usually modeled by a mult idimensio nal dat a st ruct ure,
called a dat a cube, in which each dimension correspon ds to an att ribut e or a
set of att ribut es in t he schema, and each cell st ores t he value of so me
aggregat e measur e such as count or sum( sales amount ). A dat a cube provides
a mult idimensio nal view o f dat a and all ows t he preco mput at ion and fast
access of summar ized dat a.

Examp le:
A dat a cube for AllE lect ronics. A dat a cube for summar ized sa les dat a of
AllE lect ronics is present ed in Figure 1.7(a). The cube has t hree dimensio ns:
address (wit h cit y va lues Chicago, New York, Toronto, Vancouver), t ime (wit h
quart er values Q1, Q2, Q3, Q4), and it em(wit h it em t ype values ho me
ent ert ainment , co mput er, pho ne, secur it y) . The aggregat e value st ored in eac h
cell o f t he cube issales amount ( in t housands).
Examples o f OLAP operat ions include dr ill -down and roll-up, which allo w t he
user t o view t he dat a at differ ing degrees of summar izat ion. For inst ance, we
can dr ill down on sales dat a summar ized by quart er to see dat a summar ized b y
mo nt h. S imilar ly, we can ro ll up o n sa les dat a summar ized by cit y t o view
dat a summar ized by count r y.

Transaction Data
In general, each record in a t ransact ional dat abase capt ures a t ransact ion, suc h
as a cust omer ’s purchase, a flight booking, or a user ’s clicks o n a web page. A
t ransact ion t ypically includes a unique t ransact ion ident it y number (t rans ID)
and a list of t he it e ms making up t he t ransact io n, such as t he it ems pur chased
in t he t ransact ion.

Examp le:
A t ransact ional dat abase for AllE lect ronics. Transact io ns can be st ored in a
t able, wit h one record per t ransact ion. A fragment o f a t r ansact io nal dat abase
for AllE lect ronics is shown in Figure 1.8. From t he relat io nal dat abase point
of view, t he sales t able in t he figure is a nest ed relat ion because t he att ribut e
list of it em I Ds cont ains a set of it ems. As an analyst of AllE lect ronics, you
ma y ask, “Which it ems so ld well t oget her?” This kind o f market basket dat a
analys is would enable you t o bundle groups of it ems t oget her as a st rat egy for
boost ing sales. For example, given t he knowledge t hat print ers are co mmo nl y
purchased t oget her wit h co mput ers, you could offer cert ain pr int ers at a st eep
discount (or even for free) to cust omers buying select ed co mput ers, in t he
hopes o f selling more co mput ers (which are o ft en more expensive t ha n
pr int ers).

Object Oriented Databases

These dat abases are based on object orient ed programming paradigm wher e in
general t er ms each ent it y is considered as an object .
Each object has associat ed wit h t he fo llowing:
i. A set of var iables t hat descr ibe t he object s. These correspond t o attr ibut es
in t he ER mod el.
ii. A set of messages t hat t he object can use t o communicat e wit h ot her
object s.
iii. A set of met hods where each met hod holds t he code t o implement message.

Object Relational Databases


These DB's are const ruct ed based on t he object relat ional mode l. This mode l
ent ends t he relat io nal model by providing a r ich dat a t ype for handling
co mplex object s and object orient at ion.

Spatial Databases

Spat ial dat abases cont ain spat ial relat ed infor mat io n such as dat abases
inc lude geographic dat abases, VLSI chin des ign dat abases, medical and
sat ellit e image dat abases.
DM may uncover pat t erns descr ibing t he charact er ist ics o f houses locat ed
near a specific kind o f locat ion such as a park.

Temporal & Time Series Databases


These DB's st ore t ime relat ed dat a.
A t emporal DB usua lly st ores relat ional dat a t hat include t ime relat ed
att ribut es.
Time ser ies De's st ores sequence o f values t hat change wit h t ime such as dat a
co llect ed regarding t he st ock e nchange.
Tent databases & Multi-media databases
Pext DB's are DB's t hat cont ain word descr ipt io n for object s. These word
descr ipt io ns are usually not simple keywords but rat her lo ng sent ences or
paragraphs such as product specificat ions, errors and bug r eport s, warning
messages, su mmar y r eport s, not es or ot her document s.
Mult i- media DB's st ore image, audio , video dat a. For mult i- media DB mining
st orage & such t echniques need t o be int egrat ed wit h st andard dat a mining
met ho ds.

World Wide Web


The www and it 's associat ed dist r ibut ed infor mat io n ser vices such as google,
bingo, yahoo, alt avist a, amer ica online, world wide online infor mat io n
ser vices where dat a object s are linked t oget her t o facilit at e access.

4. Data Mining Functionalities – What kind of patterns can be


mined?

Dat a mining funct ionalit ies are used to specify t he kinds of pat t erns to be found
in dat a mining t asks. In general, such t asks can be classified int o t wo
cat egories:

i. Descr ipt ive


Descr ipt ive mining t asks charact er ize propert ies of t he dat a in a t arget dat a
set .
ii. Predict ive
Predict ive mining t asks per for m induct io n on t he current dat a in order to
make predict io ns.

Class/Concept Description: Characterization and Discrimin ation


 Dat a ent r ies can be associat ed wit h classes or concept s. For example, in t he
AllE lect ronics st ore, classes o f it ems for sale include co mput ers and pr int er s,
and concept s of cust omers include bigSpenders and budget Spender s.
 It can be useful t o descr ibe individual classes and concept s in summar ized,
concise, and yet precise t erms. Such descr ipt ions of a class or a concept are
called class/co ncept descr ipt ions.
 These descr ipt ions can be der ived using dat a charact er izat io n, or dat a
discr iminat io n or bo t h.
Data Characterization
 Dat a charact er izat io n is a summar izat io n o f t he general charact er ist ics or
feat ures of a t arget class of dat a. The dat a corresponding t o t he user -
spec ified class are t ypically co llect ed by a quer y.
 The out put of dat a charact er izat ion can be present ed in var ious for ms.
Examples include pie chart s, bar chart s, curves, mult id imensio nal dat a cubes,
and mult idimensio nal t ables, includ ing cr osst abs.
 The result ing descr ipt io ns can also be present ed as generalized relat io ns.

Examp le:
For example, A cust omer relat io nship manager at AllE lect ronics may order t he
fo llo wing dat a mining t ask: Summar ize t he charact er ist ics of cust omers who
spend mor e t han $5000 a year at AllE lect ronics .
The result could be a genera l pro file o f t hese cust omers, such as t hat t hey ar e
40 to 50 years o ld, emplo yed, and have excellent credit rat ings. The dat a
mining syst em should allo w t he cust omer relat io nship manager to drill down
on any dimensio n, such as on occupat ion t o view t hese cust omer s according t o
t heir t ype of e mplo yment .

Data Discrimination
 Dat a discr iminat io n is a co mpar iso n of t he general feat ures o f t he t arget class
dat a object s against t he general feat ures o f object s fro m o ne or mult ip le
cont rast ing classes.T
 he for ms o f out put present at ion are similar t o t hose for charact er ist ic
descr ipt io ns, alt hough discr iminat io n descr ipt io ns should include co mparat ive
measures t hat help t o dist inguish bet ween t he t arget and cont rast ing classes.
 Discr iminat io n descr ipt ions expressed in t he for m of ru les are referred t o as
discr iminant rules.

Example:
A cust omer relat ionship manager at AllE lect ronics may want to compare t wo
groups of cust omer s—t hose who shop for comput er product s regular ly ( e.g.,
more t han t wice a mo nt h) and t hose who rarely shop for such product s (e.g.,
less t han t hree t imes a year).
The result ing descr ipt ion provides a general co mparat ive pro file o f t hese
cust omers, such as t hat 80% of t he cust omer s who frequent ly pur chase
co mput er product s are bet ween 20 and 40 years o ld and have a univer sit y
educat io n, whereas 60% o f t he cust omers who infr equent ly buy such product s
are eit her senior s or yout hs, and have no universit y degree. Dr illing down o n
a dimensio n like occupat io n, or adding a new dimens io n like inco me leve l,
ma y help t o find even more discr iminat ive feat ures bet ween t he t wo classes.
Mining Frequent Patterns, Associations, and Correlations

 Frequent pat t erns, as t he name suggest s, are pat t erns t hat occur frequent ly in
dat a. There are many kinds o f frequent patt erns, including frequent it emset s,
frequent subsequences (also known as sequent ial pat t erns), and frequent
subst ruct ures.
 A frequent it emset t ypically refer s t o a set of it ems t hat oft en appear
toget her in a t ransact ional dat a set .— for example, milk and bread, which ar e
frequent ly bought t oget her in grocer y st ores by many cust omers.
 A frequent ly occurr ing subsequence, such as t he pat t e rn t hat customer s, t end
to purchase first a lapt op, fo llowed by a digit al camera, and t hen a me mor y
card, is a ( frequent ) sequent ial pat t ern.
 A subst ruct ure can refer t o different st ruct ural for ms (e.g., graphs, t rees, or
lat t ices) t hat may be co mbined wit h it emset s or subsequences. If a
subst ruct ure occurs frequent ly, it is called a (frequent ) st ruct ured patt ern.

Exa mple:
Associat ion ana lysis. Suppose t hat , as a market ing manager at AllE lect ronics,
you want to know which it ems ar e frequent ly purchased t oget her ( i.e., wit hin
t he same t ransact io n). An examp le of such a rule, mined fro m t he
AllE lect ronics t ransact io nal dat abase, is
buys(X, “computer”) ⇒ buys(X, “software”) [ support = 1%,confidence = 50%] ,

where X is a var iable represent ing a cust omer. A confidence, or cert aint y, o f
50% means t hat if a cust o mer buys a co mput er, t here is a 50% chance t hat she
will buy so ft ware as well. A 1% support means t hat 1% of all t he t ransact io ns
under analys is show t hat comput er and soft ware are pur chased t oget her. This
associat ion rule invo lves a single at t ribut e or predicat e (i. e., buys) t hat
repeat s. Associat io n rules t hat cont ain a single predicat e are referred t o as
single-dimens io nal associat ion rules. Dropping t he pre dicat e not at io n, t he rule
can be wr it t en simply as “co mput er ⇒ soft ware [1%, 50%].” Suppose, inst ead,
t hat we are given t he AllE lect ronics relat io nal dat abase relat ed to purchases.
A dat a mining syst em may find associat ion rules like
age(X, “20..29 ”) ∧ income(X, “40K..49K”) ⇒ buys(X, “laptop”)
[ support = 2%, confidence = 60%]

The rule indicat es t hat of t he AllE lect ronics cust omer s under st udy, 2% are 20
to 29 years o ld wit h an inco me o f $40,000 t o $49,000 and have purch ased a
lapt op (comput er) at AllE lect ronics. There is a 60% probabilit y t hat a
cust omer in t his age and inco me group will pur chase a lapt op. Not e t hat t his
is an associat ion invo lving more t han one at tribut e or predicat e ( i.e., age,
inco me, and buys). Adopt ing t he t er mino logy used in mult id imensio na l
dat abases, where each at tr ibut e is referred to as a dimens io n, t he above rule
can be referred t o as a mult idimensio nal associat io n rule.
Typically, associat ion rules are discarded as unint er est ing if t hey do no t
sat isfy bot h a minimum support t hresho ld and a minimum co nfidence
t hresho ld. Addit io nal analys is can be per for med t o uncover int erest ing
st at ist ical correlat io ns bet ween associat ed att ribut e –value pair s.

Classification and Regression for Predictive Anal ysis


 Classificat ion is t he process o f find ing a model (or funct ion) t hat descr ibe s
and dist inguishes dat a classes or concept s. The model ar e der ived based o n
t he analys is o f a set of t raining dat a.
 “How is t he der ived model present ed?” The der ived model may be represent ed
in var ious for ms, such as $40,000 to $49,000 and have purchased a lapt op
(comput er) at AllE lect ronics. There is a 60% probabilit y t hat a cust omer
( frequent ) st ruct ured patt ern. t ree s, or latt ices) t hat may be co mbined wit h
it e mset s or subsequences.
 I f a subst ruct ure occurs frequent ly, it is called a t hen a memor y card, is a
( frequent ) sequent ial pat t ern. A subst ruct ure can refer t o differ ent st ruct ural
for ms (e.g., graphs, occurr ing subsequence, such as t he pat t ern t hat
cust omers, t end t o purchase fir st a lapt op, fo llowed by a digit al camera.
 A decis io nt ree is a flowchart - like t ree st ruct ure, where each node denot es a
t est on an at t ribut e value, each branch represent s an out come of t he t est , and
t ree leaves r epresent classes or class dist ribut io ns.Decis io n t rees can easily be
convert ed to classificat ion rules.
 A neural net work,when used for classificat ion, is t ypically a co llect io n o f
neuron- like processing unit swit hweight ed connect io ns bet ween t he unit s.
There are many ot her met hods for const ruct ing classificat ionmodels, such as
na¨ıve Bayesian classificat io n, support vect or machines, and k - near est -
neighbor classificat io n.
 Whereas classificat ion predict s cat egor ical (discret e, unorde red) labels,
regressio n mode ls cont inuous - valued funct ions. Regressio n analys is is a
st at ist ical met hodology t hat is most oft en used for numer ic predict io n,
alt hough ot her met hods exist as well.

Examp le : Classi fication and regression .


Suppose as a sale s manager o f AllE lect ronics you want to classify a large set
of it ems in t he st ore, based on t hree kinds o f responses t o a sales
campaign:good response, mild response and no response. You want to der ive a
model for each o f t hese t hr ee classes based on t he descr ipt ive feat ures o f t he
it e ms, such as pr ice, brand, place made, type, and cat egor y.
Cluster Analysis

 Unlike classificat io n and regr essio n, which analyze class - labeled (t raining)
dat a set s, clust er ing analyzes dat a object s wit hout consult ing class labels. In
many cases, class labeled dat a may simp ly not exist at t he beginning.
 Clust er ing can be used t o generat e class labels for a group of dat a. The
object s are clust ered or grouped based on t he pr inc iple o f maximizing t he
int raclass similar it y and minimizing t he int erclass similar it y.

Outlier Analysis

 A dat a set may cont ain object s t hat do not comply wit h t he general behavior
or model o f t he dat a. These dat a object s are out lier s.
 Many dat a mining met hods discard out liers as no ise or except io ns. However,
in so me applicat ions (e.g., fr aud det ect io n) t he rare event s can be mor e
int erest ing t han t he mor e regular ly occurring ones.
 The analys is o f out lier dat a is referred t o as out lier analys is or ano mal y
mining.

Examp le:
Out lier analys is may uncover fraudulent usage of credit cards by det ect ing
purchases o f unusually large amount s for a given account number i n
co mpar iso n to regular charges incurred by t he same account . Out lier values
ma y also be det ect ed wit h respect to t he locat io ns and t ypes of purchase, or
t he purchase frequency.

Are All Patterns Interesting?


“Ar e all o f t he pat t erns int erest ing?” Typically, t he answer is no —only a
small fr act io n of t he pat t erns pot ent ia lly gener at ed would act ually be o f
int erest to a given user. “What makes a patt ern int erest ing? Can a dat a mining
syst em generat e all o f t he int erest ing pat t erns? Or, Can t he syst em generat e
only t he int erest ing ones?” To answer t he fir st quest io n, a pat t ern is
int erest ing if it is (1) easily under st ood by humans, (2) va lid on new or t est
dat a wit h so me degree of cert aint y, (3) pot ent ially useful, and (4) novel. An
int erest ing pat t ern represent s knowledge. Several object ive measures o f
pat t ern int erest ingness exist . These are based on t he st ruct ure of discovered
pat t erns and t he st at ist ics under lying t hem. T his is t aken t o be t he condit io na l
probabilit y P.YjX), t hat is, t he probabilit y t hat a t ransact ion cont aining X a lso
cont ains Y.
support (X ⇒ Y) = P(X ∪ Y), confidence(X ⇒ Y) = P(Y|X).

Ot her object ive int erest ingness measures include accuracy and coverage for
classificat ion (IF-THEN) rules. In general t er ms, accuracy t ells us t he
percent age of dat a t hat are correct ly classified by a rule. Cover age is similar
to support , in t hat it t ells us t he percent age of dat a to which a rule applies.
Alt hough object ive measures help ident ify int erest ing pat t erns, t hey are o ft en
insufficient unless co mbined wit h subject ive measures t hat reflect a part icular
user ’s needs and int erest s. For example, pat t erns descr ibing t he
charact er ist ics o f cust omers who shop fr equent ly at AllE lect ronics should be
int erest ing t o t he market ing manager, but ma y be o f lit t le int erest to ot her
analyst s st udying t he same dat abase for patt erns on emplo yee per for mance.
Subject ive int erest ingness measures are based on user beliefs in t he dat a.
These measures find pat t erns int erest ing if t he pat t erns ar e unexpect ed
(cont radict ing a user ’s belief) or offer st rat egic infor mat io n on which t h e user
can act . In t he lat t er case, such pat t erns are referred t o as act ionable. For
example, pat t erns like “a large eart hquake oft en fo llows a clust er of sma l l
quakes” may be highly act io nable if user s can act on t he infor mat io n t o save
lives. Pat t erns t hat are expect ed can be int er est ing if t hey confir m a
hypot hesis t hat t he user wishes t o validat e or t hey r esemble a user ’s hunch.
The seco nd quest ion— “Can a dat a mining syst em generat e all o f t he
int erest ing pat t erns?” — refer s t o t he co mplet eness o f a dat a mining algor it hm.
It is o ft en unrealist ic and inefficient for dat a mining syst ems t o generat e al l
possible pat t erns. Fina lly, t he t hird quest ion — “Can a dat a mining s yst e m
generat e only int erest ing pat t erns?” — is an opt imizat io n problem in dat a
mining. It is highly des ir able for dat a mining syst ems t o generat e onl y
int erest ing pat t erns.

5. Which Technologies Are Used?


As a highly applicat ion-dr iven do ma in, dat a mining has incorporat ed many
t echniques fro m ot her do mains such as st at ist ics, machine lear ni ng, pat t ern
recognit io n, dat abase and dat a warehouse syst ems, infor mat io n ret r ieval,
visualizat io n, algor it hms, high per for mance co mput ing, and many applicat io n
domains.
Statistics

St at ist ics st udies t he co llect io n, analys is, int erpret at ion or explanat ion, and
present at ion o f dat a. Dat a mining has an inherent connect ion wit h st at ist ics. A
st at ist ical model is a set of mat he mat ical funct io ns t hat descr ibe t he behavior
of t he object s in a t arget class in t erms o f rando m var iables and t heir
associat ed probabilit y dist r ibut ions. St at ist ical models are widely used to
model dat a and dat a classes. For example, in dat a mining t asks like dat a
charact er izat io n and classificat ion, st at ist ical mo dels o f t arget classes can be
built . I n ot her words, such st at ist ical models can be t he out come o f a dat a
mining t ask.

Machine Learning

Machine lear ning invest igat es how co mput ers can lear n (or impro ve t heir
per for mance) based on dat a. A main research area is for comput er programs t o
aut omat ically lear n to recognize co mplex pat t erns and make int elligent
decis io ns based on dat a. For examp le, a t ypical machine lear ning proble m is
to program a co mput er so t hat it can automat ically recognize handwr it t en
post al codes on mail aft er lear ning fro m a set of examples. Machine lear ning
is a fast -growing discipline. Here, we illust rat e classic proble ms in machine
lear ning t hat are highly relat ed t o dat a mining.

Supervised learnin g

Super vised lear ning is basically a syno nym for classificat io n. The super vis io n
in t he lear ning co mes fro m t he labeled examples in t he t raining dat a set . For
example, in t he post al code recognit io n problem, a set of handwr it t en post a l
code images and t heir corresponding machine -readable t ranslat io ns are used
as t he t raining examp les, which super vise t he lear ning o f t he classificat io n
model.

Unsupervised learnin g

Unsuper vised lear ning is essent ially a synonym for clust er ing. T he lear ning
process is unsuper vised since t he input examples are not class labeled.
Typically, we ma y use clust er ing t o discover classes wit hin t he dat a. For
example, an unsuper vised lear ning met hod can t ake, as i nput , a set of image s
of handwr it t en digit s. Suppose t hat it finds 10 clust ers o f dat a. These clust er s
ma y correspond t o t he 10 dist inct digit s of 0 t o 9, respect ively. However,
since t he t raining dat a are not labeled, the lear ned model cannot t ell us t he
semant ic meaning o f t he clust ers found .
Semi- supervi sed learning

Semi- super vised lear ning is a class of machine lear ning t echniques t hat make
use o f bot h labeled and unlabeled examples when lear ning a model. I n one
approach, labeled examples ar e used t o lear n class models and unlabeled
examples are used t o refine t he boundar ies bet ween classes. For a t wo -clas s
problem, we can t hink o f t he set of examples belo nging t o one class as t he
posit ive examples and t hose belo ng ing to t he ot her class as t he negat iv e
examples.

Active learning

Act ive lear ning is a machine learning approach t hat let s user s play an act ive
role in t he lear ning process. An act ive learning approach can ask a user (e.g.,
a domain expert ) to label an example, which ma y be fro m a set of u nlabe led
examples or synt hes ized by t he lear ning program. The goal is t o opt imize t he
model qualit y by act ively acquir ing knowledge fro m human user s, given a
const raint on how many examples t he y can be asked t o label.

Database Systems and Data Warehouses

Dat abase syst ems research fo cuses on t he creat io n, ma int enance, and use o f
dat abases for organizat io ns and end -users. Part icular ly, dat abase syst ems
resear cher s have est ablished highly r ecognized pr incip les in dat a models,
quer y languages, quer y processing and opt imizat io n met hods, dat a st orage,
and indexing and accessing met hods. Dat abase syst ems are o ft en well known
for t heir high scalabilit y in processing ver y large, relat ively st ruct ured dat a
set s. Recent dat abase syst ems have built syst emat ic dat a an alysis capabilit ies
on dat abase dat a using dat a warehousing and dat a mining facilit ies.
A dat a warehouse int egrat es dat a originat ing fro m mu lt iple sources and
var ious t imefr ames. It conso lidat es dat a in mult idimens io nal space t o for m
part ially mat er ialized dat a cubes. The dat a cube model not only facilit at es
OLAP in mult id imensio nal dat abases but also promot es mult id imensio nal dat a
mining.

Information Retrieval
Infor mat io n ret r ieval (IR) is t he science o f searching for document s or
infor mat io n in document s. Document s can be t ext or mult imedia, and ma y
reside on t he Web. The differences bet ween t radit ional infor mat io n ret rieval
and dat abase syst ems are t wofo ld: In for mat io n ret r ieval assumes t hat
(1) t he dat a under search are unst ruct ured
(2) t he quer ies are for med mainly by keywords, which do not have co mplex
st ruct ures.
The t ypical approaches in infor mat io n ret rieval adopt probabilist ic models:
( i) language mode l
( ii) t opic model.
Increas ingly large amount s o f t ext and mult imedia dat a have bee n
accumulat ed and made available online due t o t he fast growt h o f t he Web and
applicat io ns such as d igit al librar ies, digit al gover nment s, and healt h car e
infor mat io n syst ems. Their effect ive search and analys is have raised man y
challenging issues in dat a mining. Therefore, t ext mining and mult imedia dat a
mining, int egrat ed wit h infor mat io n ret rieval met hods, have beco me
incr easingly import ant .

6. Which Kinds of Applicatio ns Are Targeted?

As a highly applicat ion-dr iven disc ipline, dat a mining has seen great
successes in many applicat io ns. It is impossible t o enumer at e all applicat io ns
where dat a mining plays a cr it ical ro le. Present at ions o f dat a mining in
knowledge- int ensive applicat ion do mains, such as bio infor mat ics and
so ft ware engineer ing .

Business Intelligence

Business int elligence (BI) t echno logies provide hist or ical, current , and
predict ive views o f business operat io ns. Examp les include report ing, online
analyt ica l processing, business per for m.
dat a mining is t he core of business int elligence. Online analyt ical processing
tools in business int elligence rely on dat a warehousing and mult idimens io na l
dat a mining. Classificat io n and predict io n t echniques are t he core o f
predict ive analyt ics in business int elli gence, for which t her e are many
applicat io ns in analyz ing market s, supplie s, and sales.
Using charact er izat io n mining t echniques, we can bet t er under st and feat ures
of each cust omer group and develo p cust omized cus t omer reward programs.

Web Search Engines

A Web search engine is a specia lized comput er ser ver t hat searches for
infor mat io n on t he Web. The search result s of a user quer y are oft en ret urned
as a list (so met imes ca lled hit s). The hit s may consist of web pages, images,
and ot her t ypes o f files. So me search engines also search and ret urn dat a
available in public dat abases or open dir ect ories. Search engines differ fro m
web direct ories in t hat web direct ories are maint ained by human edit ors
whereas search engines operat e algor it hmically or by a mixt ure of algor it hmic
and human input .
Var ious dat a mining t echniques are used in all aspect s of search engines,
rang ing fro m crawling (e.g., decid ing which pages should be crawled and t he
crawling frequencies), index ing (e.g., select ing pages t o be indexed and
deciding t o which ext ent t he index sho uld be const ruct ed), and searching
(e.g., deciding how pages should be ranked, which advert ise ment s should be
added, and how t he search r esult s can be personalized or made “ cont ext
aware”).
Search engines pose grand challenges t o dat a mining.
i. The y have t o handle a huge and ever -gr owing amount of dat a.
ii. Web search engines oft en have t o deal wit h online dat a. A sear ch engine
ma y be able t o afford const ruct ing a model o ffline on huge dat a set s.
iii. Web search engines o ft en have t o deal wit h quer ies t hat are asked only a
ver y small number of t imes.

7. Major issues in Data Mining

Dat a mining is a dyna mic and fast -expanding field wit h great st rengt hs. majo r
issues in dat a mining are part it ioned int o five groups: mining met hodology,
user int eract ion, efficiency and scalabilit y, divers it y o f dat a t ypes, and dat a
mining and societ y.

Mining Methodology

Researchers have been vigorously developing new dat a mining met hodo logies.
This invo lves t he invest igat ion o f new kinds o f knowledge, mining in
mult id imensio nal space, int egrat ing met hods fro m ot her disciplines, and t he
considerat ion o f semant ic t ies amo ng dat a object s. In addit io n, mining
met hodologies should consider issues such as dat a uncert aint y, no ise, and
inco mplet eness.
 Mining vari ous and ne w kinds of knowledge: Dat a mining covers a wide
spect rum o f dat a analys is and knowledge discover y t asks, fro m dat a
charact er izat io n and discr iminat io n t o associat ion and correlat ion
analys is, classificat ion, regressio n, clust er ing, out lier analysis, sequence
analys is, and t rend and evo lut ion ana lysis .
 Mining kno wledge in multidimensional space: We can search for
int erest ing pat t erns amo ng co mbinat io ns of dimensio ns (at t ribut es) at
var ying levels o f abst ract ion. Such mining is known as (explorat ory)
mult id imensio nal dat a mining. I n many cases, dat a can be aggregat ed or
viewed as a mult idimens io nal dat a cube.
 Data mining —an interdi sci plinary ef f ort: The power of dat a mining can be
subst ant ially enhanced by int egrat ing new met hods fro m mult iple
disciplines. For examp le, t o mine dat a wit h nat ural language t ext , it
makes sense t o fuse dat a mining met hods wit h met hods o f infor mat io n
ret rieval and nat ural language processing .
 Handling uncertaint y, noise, or incompleteness of data : Dat a oft en
cont ain no ise, errors, except ions, or uncer t aint y, or are inco mplet e. Errors
and no ise may co nfuse t he dat a mining process, leading t o t he der ivat io n
of erroneous pat t erns. Dat a cleaning, dat a preprocessing, out lier det ect io n
and remo val, and uncert aint y reasoning are exa mples o f t echniques t hat
need t o be int egrat ed wit h t he dat a mining process.
 Pattern evaluation and pattern - or const raint-guided mi ning: What makes
a patt ern int erest ing ma y var y fro m user to user. Therefore, t echniques are
needed t o assess t he int erest ingness o f discovered pat t erns based on
subject ive measures. These est imat e t he value o f pat t erns wit h respect t o a
g iven user class, based on user beliefs or expect at io ns. Moreover, by
using int er est ingness measures or user -specified co nst raint s t o guide t he
disco ver y process, we may generat e more int erest ing pat t erns and reduce
t he search space.

User Interaction

The user p lays an import ant role in t he dat a mining process. Int erest ing ar eas
of research include how to int eract wit h a dat a mining syst em, how to
incorporat e a user ’s background knowledge in mining, and how to visualize
and co mprehend dat a mining result s.
 Interactive mi ning: The dat a mining process should be highly int eract ive.
Thus, it is import ant to build flexible user int er faces and an explorat or y
mining environment , facilit at ing t he user ’s int eract io n wit h t he syst em.
Int eract ive mining should allo w us ers t o dynamically change t he focus o f
a search, t o refine mining request s based on ret urned r esult s, and t o drill,
dice, and pivot t hrough t he dat a and knowledge space int eract ively,
dynamically explor ing “cube space” while mining.
 Incorporation of background knowl edge: Background knowledge,
const raint s, rules, and ot her infor mat io n regarding t he do main under st ud y
should be incorporat ed int o t he knowledge discover y process. Such
knowledge can be used for patt ern evaluat ion as well as t o guide t he
search t oward int erest ing pat t erns.
 Ad hoc data mining and data mining query languages: Quer y languages
(e.g., SQL) have played an import ant role in flexible sear ching because
t hey allo w users t o pose ad hoc quer ies. S imilar ly, high - level dat a mining
quer y languages or ot her high- level flexible user int er faces will give user s
t he freedo m t o define ad hoc dat a mining t asks.

Efficiency and Scalability


 Eff iciency and scalabili ty of data mi ning algorithms: Dat a mining
algor it hms must be effic ient and scalable in or der t o effect ively ext ract
infor mat io n fro m huge amount s o f dat a in many dat a reposit ories or in
dynamic dat a st reams. E ffic iency, scalabilit y, per for mance, opt imizat io n,
and t he abilit y t o execut e in real t ime are key cr it er ia t hat dr ive t he
development o f many new dat a mining a lgorit hms.
 Parallel, di stri buted, and incremental mi ning algorithms: T he hu mo ngous
size o f many dat a set s, t he wide dist r ibut io n of dat a, and t he
co mput at io nal co mplexit y o f so me dat a mining met hods are fact ors t hat
mot ivat e t he development of par allel and dist r ibut ed dat a - int ensive
mining a lgor it hms. Such algor it hms first part it io n t he dat a int o “pieces.”
Each piece is processed, in paralle l, by searching for pat t erns. The
parallel processes ma y int eract wit h o ne anot her. The pat t er ns fro m each
part it io n ar e event ually merged. Cloud comput ing and clust er co mput ing,
which use co mput ers in a dist r ibut ed and co llaborat ive wa y t o t ackle ver y
large-scale co mput at io nal t asks, are also act ive research t hemes in
parallel dat a mining.

Diversity of Database Types


 Handling complex types of data: a. It is unrealist ic t o expect one dat a
mining syst em t o mine all kinds o f dat a, given t he diver sit y o f dat a t ypes
and t he differ ent goals o f dat a mining. Domain - or applicat io n-dedicat ed
dat a mining syst ems are being co nst ruct ed for indept h mining o f specific
kinds o f dat a. The co nst ruct ion o f effect ive and efficient dat a mining t ools
for diver se applicat io ns remains a challenging and act ive area o f research.
 Mining dyna mic, net worked, and globa l dat a reposit ories: The discover y
of knowledge fro m different sources of st ruct ured, semi -st ruct ured, or
unst ruct ured yet int erconnect ed dat a wit h diver se dat a semant ics poses
great challenges t o dat a mining. Mining such gigant ic, int erconnect ed
infor mat io n net works ma y help disclose many more pat t erns and
knowledge in het erogeneous dat a set s t han can be discovered fro m a sma l l
set of iso lat ed dat a reposit ories.

Data Mining and Society

 Social impact s of data mining : Wit h dat a mining penet rat ing our ever yday
lives, it is import ant to st udy t he impact of dat a mining on societ y. How
can we use dat a mining t echno logy t o benefit societ y? How can we guard
against it s misuse? The improper disclosure or use of dat a and t he
pot ent ial vio lat io n o f individual pr ivacy and dat a prot ect io n r ight s ar e
areas o f concer n t hat need t o be addressed.
 Pri vacy-preserving data mining: Dat a mining will help scient ific
disco ver y, business management , econo my recover y, and secur it y
prot ect ion (e.g., t he real-t ime discover y of int ruder s and cyberat t acks).
However, it poses t he r isk o f disclo sing an individual’s personal
infor mat io n. St udies on pr ivacy-preser ving dat a publishing and dat a
mining are ongo ing. The philosophy is to obser ve dat a sens it ivit y and
preser ve people’s pr ivacy while per for ming successful dat a mining.
 Invisibl e data mi ning: I nt elligent search engines and I nt ernet -based st ores
per for m such invis ible dat a mining by incorporat ing dat a mining int o t heir
co mponent s t o improve t heir funct io nalit y and per for mance. This is done
oft en unbeknownst t o t he user .

7. Typical Data Mining Architecture


Knowl edge base:
This is t he do ma in knowledge t hat is used to guide t he searc h or evaluat e t he
int erest ing ness of result ing p at t erns. Such knowledge can include concept
hier archies, used t o organize at t ribut es or att ribut e values int o differ ent
levels o f abst ract ion.
Data m ining engine:
This is essent ial t o t he dat a mining sys t em and ideally cons ist s of a set of
funct io nal modules for t asks s uch as charact er izat io n, asso ciat io n,
classificat ion, clust er analys is, and evo lut io n and deviat io n ana lysis.
Pattern evaluation m odule:
This co mponent t ypica lly emplo ys int er est ingness measures and int eract s
wit h t he dat a mining modules so as to focus t h e search t owards int erest ing
pat t erns. Alt er nat ively, t he pat t ern evaluat ion module may be int egrat ed wit h
t he mining module, depending on t he imple ment at ion of t he dat a mining
met hod used.
Graphi cal user interface:
This module co mmunicat es bet ween users and t he dat a mining syst em,
allowing t he user t o int eract wit h t he syst em by specifying a dat a mining
quer y or t ask, providing infor mat io n to help focus t he search, and per for ming
explorat ory dat a mining based on t he int er mediat e dat a mining result s.
Data Warehouse

1.1
1. What Is a Data Warehouse?
A dat a warehouse is a subject -orient ed, int egrat ed, t ime - var iant , and nonvo lat ile
co llect io n of dat a in support of manageme nt ’s decis io n making process.
A dat a warehouse refer s t o a dat a reposit ory t hat is maint ained separat ely fro m
an organizat ion’s operat io nal dat abases. Dat a warehouse syst ems a llow for
int egrat ion o f a var iet y o f applicat ion syst ems. T he y support infor mat io n
processing by pro viding a so lid plat for m o f conso lidat ed hist or ic dat a for
analys is.
 Subject-oriented : A dat a warehouse is organized around major subject s such
as cust omer, supplier, product , and sales. Rat her t han concent rat ing on t he
day-t o -day operat ions and t ransact ion pr ocessing o f an organizat ion, a dat a
warehouse focuses on t he modeling and analysis o f dat a for decis io n makers.
Hence, dat a warehouses t ypically provide a simp le and concise view of
part icular subject issues by excluding dat a t hat are not useful in t he dec isio n
support process.
 Integrated: A dat a warehouse is u sually const ruct ed by int egrat ing mult iple
het erogeneous sources, such as relat io nal dat abases, flat files, and online
t ransact ion records. Dat a cleaning and dat a int egrat ion t echniques are
applied t o ensur e consist ency in naming convent io ns, encoding st ru ct ures,
att ribut e measures, and so on.
 Tim e-variant: Dat a are st ored t o provide infor mat ion fro m an hist or ic
perspect ive. E xample: The past 5 –10 years. Ever y key st ruct ure in t he dat a
warehouse co nt ains, eit her implicit ly or explic it ly, a t ime element .
 Nonvolatil e: A dat a warehouse is alwa ys a phys ically separat e st ore of dat a
t ransfor med fro m t he applicat ion dat a found in t he operat io nal environment .
Due to t his separat io n, a dat a warehouse does not require t ransact io n
processing, recover y, and concurre ncy cont rol mechanis ms. It usuall y
requir es o nly t wo operat ions in dat a accessing: init ial lo ading o f dat a and
access of dat a.

In summar y, a dat a warehouse is a semant ically co nsist ent dat a st ore t hat ser ves
as a physical imple ment at ion o f a decis ion support dat a model. It stores t he
infor mat io n an ent erpr ise needs t o make st rat egic decis io ns. A dat a warehouse
is also oft en viewed as an archit ect ure, const ruct ed by int egr at ing dat a fro m
mult ip le het erogeneous sources t o support st ruct ured and/or ad hoc quer ies,
analyt ical report ing, and decis io n making.
How are organizat io ns using t he infor mat ion fro m dat a warehouses?
Many organizat ions use t his infor mat io n to support bus iness decis io n - making
act ivit ies, including:
 Increas ing cust o mer focus, which inc ludes t he analys is o f cust omer buying
pat t erns (such as buying preference, buying t ime, budget cycles, and
appet it es for spending).
 Reposit io ning product s and managing pr oduct port fo lio s by co mpar ing t he
per for mance o f sales by quart er, by year, and by geographic reg io ns in order
to fine-t une product io n st rat egies.
 Analyzing operat io ns and looking for sources o f pro fit .
 Managing cust omer relat io nships, mak ing enviro nment al correct io ns, and
managing t he cost of corporat e asset s.

Dat a warehousing provides an int erest ing alt er nat ive t o t his t radit ional
approach. Rat her t han using a quer y - dr iven approach, dat a warehousing
emplo ys an update-d riven approach in which infor mat ion fro m mu lt iple,
het erogeneous sources is int egrat ed in advance and st ored in a warehouse for
direct quer ying and analys is.

2. Differences between Operational Database Systems and


Data Warehouses
The major t ask of online operat iona l dat abase syst ems is t o per for m online
t ransact ion and quer y processing. These syst ems are ca lled online t ransact ion
processing (OLTP) syst ems. T hey cover mo st of t he day -t o -day operat ions o f a n
organizat ion such as purchasing, invent ory, manuf act ur ing, banking, payro ll,
regist rat io n, and account ing.
Dat a warehouse syst ems, o n t he ot her hand, ser ve user s or knowledge workers i n
t he role o f dat a analys is and decis io n making. Such syst ems can organize and
present dat a in var ious for mat s in order t o acco mmodat e t he diver se needs o f
different users. These syst ems are known as online analyt ical processing
(OLAP) syst ems.
T he ma jor distingu is hing f eatur es of OLT P and OL AP ar e su mmar ized as f ollows :
Users and system orientation:
 An OLTP syst em is cust omer-orient ed and is used for t ransact io n and quer y
processing by c lerks, client s, and infor ma t ion t echno logy pro fessio nals.
 An OLAP syst em is market -orient ed and is used for dat a analys is by
knowledge workers, including manager s, ex ecut ives, and analyst s.
Data content s:
 An OLTP s yst em manages current dat a t hat , t ypically, ar e t oo det ailed t o be
easily used for decisio n making.
 An OLAP syst e m manages large amo unt s of hist or ic dat a, provides facilit ie s
for summar izat ion and aggr egat io n, and st ores and manages infor mat ion at
different levels o f gr anular it y. T hese feat ures make t he dat a easier t o use for
infor med decis io n making.
Database design:
 An OLTP syst em usually adopt s an ent it y -relat io nship (ER) dat a model and
an applicat ion-or ient ed dat abase design.
 An OLAP syst e m t ypically adopt s eit her a st ar or a snowflake model and a
subject -orient ed dat abase design.

View:
 An OLTP syst em fo cuses mainly o n t he current dat a wit hin an ent erpr ise or
depart ment , wit hout referr ing to hist or ic dat a or dat a in different
organizat ions.
 In cont rast , an OLAP syst em o ft en spans mult iple ver sio ns of a dat abase
sche ma, due to t he evo lut ionar y process of an organizat ion. OLAP syst ems
also deal wit h infor mat ion t hat originat es fro m different organiza t io ns,
int egrat ing infor mat io n fro m many dat a st ores. Because o f t heir huge vo lume,
OLAP dat a are st ored on mult iple st orage media.
Access patterns:
 The access pat t erns of an OLTP syst em consist mainly o f short , ato mic
t ransact ions. Such a syst em requir e s concurrency cont ro l and recover y
mechanis ms.
 However, accesses t o OLAP syst ems are most ly r ead -only operat io ns
(because most dat a warehouses st ore hist or ic rat her t han up -t o -dat e
infor mat io n), alt hough many could be co mplex quer ies.
3. Why Have a Separate Data Warehouse?
“Why not perf orm online analyti cal processing directly on such databases
instead of spending additi onal time and resources to const ruct a separate dat a
warehouse?”
i. A major reason for such a separ at ion is t o help pro mot e t he hig h
per for mance o f bot h syst ems. An operat io nal dat abase is designed and
t uned fro m known t asks and workloads like indexing and hashing using
pr imar y keys, sear ching for part icular records, and opt imizing “canned”
quer ies.
ii. On t he ot her hand, dat a warehouse que r ies are o ft en co mplex. The y
invo lve t he co mput at ion o f large dat a groups at summar ized levels, and
ma y requir e t he use o f special dat a organizat io n, access, and
imple ment at ion met hods based on mult idimens io nal views. Processing
OLAP quer ies in operat ional dat abases would subst ant ially degr ade t he
per for mance o f operat ional t asks.
iii. Moreover, an operat ional dat abase support s t he concurrent processing o f
mult ip le t ransact io ns. Co ncurrency cont rol and recover y mechanis ms
(e.g., locking and logging) are requir ed t o ensure t he consist ency and
robust ness of t ransact ions.
iv. An OLAP quer y o ft en needs read -only access of dat a records for
summar izat ion and aggregat ion. Concurrency cont rol and recover y
mechanis ms, if applied for such OLAP operat ions, may jeopardize t he
execut io n of concurrent t ransact ions and t hus subst ant ially reduce t he
t hroughput of an OLTP syst em.
v. Fina lly, t he separat io n of operat io nal dat abases fro m dat a warehouses is
based on t he different st ruct ures, cont ent s, and uses o f t he dat a in t hese
t wo syst ems. Decisio n support requir es hist oric dat a, wher eas operat iona l
dat abases do not t ypically maint ain hist oric dat a. In t his cont ext , t he dat a
in operat ional dat abases, t hough abundant , are usually far fro m co mplet e
for decis io n making.
vi. Decis io n support requires co nso lidat ion (e.g., aggregat io n and
summar izat ion) o f dat a fro m het erogeneous sources, resu lt ing in high -
qualit y, clean, int egrat ed dat a. In co nt rast , operat ional dat abases co nt ain
only det ailed raw dat a, such as t ransact io ns, which need t o be
co nso lidat ed before analys is. Because t he t wo syst ems provide quit e
different funct ionalit ies and requir e differ ent kinds o f dat a, it is present ly
necessar y t o maint ain separat e dat abases.

4. Data Warehousing: A Multitiered Architecture


Dat a warehouses oft en adopt a t hree -t ier archit ect ure.
i. The bott om t ier is a warehouse dat abase ser ver t hat is almo st alwa ys a
relat ional dat abase syst em. Back - end tools and ut ilit ies are used to feed dat a
int o t he bott om t ier fro m operat ional dat abases or ot her ext ernal sources.
These t ools and ut ilit ies per for m dat a ext ract ion, cleaning, and
t ransfor mat io n (e.g., to merge similar dat a fro m different sources int o a
unified for mat ), as well as lo ad and r efresh funct io ns t o updat e t he dat a
warehouse.
ii. The dat a are ext ract ed using applicat ion program int er faces known as
gat eways. A gat eway is support ed by t he under lying DBMS and allo ws c lient
programs t o generat e SQL code t o be execut ed at a ser ver. E xamples o f
gat eways include ODBC (Open Dat abase Connect ion) a nd OLEDB (Object
Linking and E mbedding Dat abase) by Microso ft and JDBC (Java Dat abase
Connect ion). T his t ier also co nt ains a met adat a reposit or y, which st ores
infor mat io n about t he dat a warehouse and it s cont ent s.
iii. The middle t ier is an OLAP ser ver t hat is t ypically imp lement ed using e it her
(1) a relat ional OLAP (ROLAP) model ( i.e., an ext ended relat io nal DBMS
t hat maps operat ions on mult id imensional dat a to st andard relat ional
operat ions) ; or
(2) a mult idimensio nal OLAP (MOLAP) model ( i.e., a special -purpose ser ver
t hat direct ly imple ment s mult idimensio nal dat a and operat ions).
iv. The t op t ier is a front -end client layer, which cont ains quer y and report ing
tools, analys is t ools, and/or dat a mining t ools (e.g., t rend analys is,
predict ion, and so on).
5. Data Warehouse Models: Enterprise Warehouse, Data
Mart, and Virtual Warehouse
Enterpri se warehouse:
 An ent erpr ise warehouse co llect s all o f t he infor mat ion about subject s
spanning t he ent ire organizat ion. It provides corporat e -wide dat a
int egrat ion, usually fro m one or more operat ional syst ems or ext erna l
infor mat io n provider s, and is cross - funct ional in scope.
 It t ypically co nt ains det ailed dat a as well as summar ized dat a, and can
range in size fro m a few gigabyt es t o hundreds o f gigabyt es, t erabyt es, or
beyo nd.

Data m art:
 A dat a mart cont ains a subset of corporate -wide dat a t hat is o f value t o a
spec ific group of user s. The scope is confined t o specific select ed
subject s. Dat a mart s are usually implement ed on low -cost depart ment a l
ser ver s t hat are Unix/Linux or Windows based.
 The implement at io n cycle o f a dat a mart is more likely t o be measured in
weeks rat her t han mo nt hs or year s.
 Depend ing on t he source of dat a, data mart s can be cat egor ized as
independent or dependent .
 Independent dat a mart s are sourced fro m dat a capt ured fro m one or more
operat ional syst ems or ext er nal infor mat io n provider s, or fro m dat a
generat ed locally wit hin a part icular depart ment or geographic area.
Dependent dat a mart s are sourced direct ly fro m ent erpr ise dat a
warehouses.
Virtu al warehou se:
 A virt ual warehouse is a set of views over operat ional dat abases. For
effic ient quer y processing, only so me o f t he possible summar y views ma y
be mat er ialized. A virt ual warehouse is easy t o build but requir es excess
capacit y on operat io nal dat abase ser ver s.
“What are the pros and cons of the top - down and bottom-up approaches to
data warehouse devel opment?”
 The t op-down development o f an ent erpr ise warehouse ser ves as a
syst emat ic so lut ion and minimizes int egr at io n proble ms. However, it is
expensive, t akes a lo ng t ime t o develo p, and lacks flexibilit y due t o t he
difficult y in achieving consist ency and consensus for a commo n dat a
model for t he ent ire organizat io n.
 The bottom-up approach to t he design, development , and deplo yment of
independent dat a mart s provides flexibilit y, low cost , and rapid ret urn of
invest ment . It , however, can le ad to proble ms when int egrat ing var io us
dispar at e dat a mart s int o a consist ent ent erpr ise dat a warehouse.
Data Warehouse Implementation

A met hod for t he development of dat a warehouse syst ems is t o imp lement
t he warehouse in an increment al and evo lut io n ar y manner.
o First , a high- level corporat e dat a model is defined wit hin a reaso nabl y
short period (such as one or t wo mont hs) t hat provides a corporat e -wide,
consist ent , int egrat ed view o f dat a amo ng different subject s and pot ent ial
usages. This high- level model will great ly reduce fut ure int egrat ion
problems.
o Second, independent dat a mart s can be imple ment ed in parallel wit h t he
ent erpr ise warehouse based on t he same corporat e dat a model set not ed
before.
o Third, dist r ibut ed dat a mart s can be co nst ruct ed to int egrat e different
dat a mart s via hub ser ver s. Fina lly, a mult it ier dat a warehouse is
const ruct ed wher e t he ent erpr ise warehouse is t he so le cust odian o f al l
warehouse dat a, which is t hen dist r ibut ed to t he var ious dependent dat a
mart s.

6. Extraction, Transformation, and Loading


Dat a warehouse syst ems use back -end tools and ut ilit ies t o populat e and
refresh t heir dat a. These t ools and ut ilit ie s include t he fo llowing funct ions:
 Dat a ext ract ion, which t ypically gat her s dat a fro m mult iple, het erogeneous,
and ext ernal sources.
 Dat a cleaning, which det ect s errors in t he dat a and r ect ifies t he m when
possible.
 Dat a t ransfor mat ion, which convert s dat a fro m legacy or host for mat to
warehouse for mat .
 Load, which sort s, summar izes, conso lidat es, comput es views, checks
int egr it y, and builds indices and part it io ns.
 Refr esh, which propagat es t he updat es fro m t he dat a sources to t he
warehouse.
Bes ides t hese t ools, dat a warehouse s yst ems usua lly provide a good set of
dat a warehouse management tools. Dat a cleaning and dat a t ransfor mat ion ar e
import ant st eps in impro ving t he dat a qualit y and, subsequent ly, t he dat a
mining result s.

7. Metadata Repository
Met adat a are dat a about dat a. When used in a dat a warehouse, met adat a are
t he dat a t hat define warehouse object s. Met adat a are creat ed for t he dat a
names and definit io ns o f t he given warehouse.
A met adat a reposit ory should cont ain t he fo llo wing:
o A descr ipt ion o f t he dat a warehouse st ruct ure, which includes t he
warehouse schema, view, dimensio ns, hierarchies, and der ived dat a
definit io ns, as well as dat a mart locat io ns and cont ent s.
o Operat ional met adat a, which inc lude dat a lineage (hist or y o f migr at ed
dat a and t he sequence o f t ransfor mat io ns applied t o it ), currency o f dat a
(act ive, archived, or purged), and mo nit or ing infor mat io n (warehouse
usage st at ist ics, error report s, and aud it t rails).
o The algor it hms used for summar izat io n, which include measure an d
dimensio n definit io n algor it hms, dat a on granular it y, part it ions, subject
areas, aggregat io n, summar izat ion, and pr edefined quer ies and report s.
o Mapping fro m t he operat io nal environme nt to t he dat a warehouse, which
inc ludes source dat abases and t heir co nt ent s, gat eway descr ipt io ns, dat a
part it io ns, dat a ext ract ion, cleaning, t ransfor mat ion rules and default s,
dat a refresh and purging rules, and sec ur it y (user aut hor izat ion and access
cont rol).
o Dat a relat ed to syst em per for mance, which include indices and profile s
t hat impro ve dat a access and ret r ieval per for mance, in addit io n t o rules
for t he t iming and scheduling o f refresh, updat e, and replicat io n c ycles.
o Business met adat a, which include business t erms and definit io ns, dat a
ownership infor mat ion, and charging po licies.

Met adat a also ser ve as a guide t o t he algor it hms used for summar izat io n
bet ween t he current det ailed dat a and t he light ly summar i zed dat a, and
bet ween t he light ly summar ized dat a and t he highly summar ized dat a.
Met adat a should be st ored and managed persist ent ly ( i.e., on disk).

1.2
1. Data Cube: A Multidimensional Data Model
Data Cube
A dat a cube allows dat a to be modeled and viewed in mu lt iple dimensio ns. It
is defined by dimens io ns and fact s.

Dimension
Dimens io ns are t he per spect ives or ent it ies wit h respect to which a n
organizat ion want s t o keep records. Each dimensio n may have a t able
associat ed wit h it , called a dimensio n t able, which furt her descr ibes t he
dimensio n.

Fact
Fact s are numer ic measures. They are used t o analyze relat io nships bet wee n
dimensio ns. T he fact t able co nt ains t he names o f t he fact s, or measures, as
well as keys t o each o f t he relat ed dimensio n t ables.
In t his 2-D represent at ion, t he sales for Vancouver are shown wit h r espect to
t he t ime dimensio n (organized in quart ers) and t he it em di mens io n (organized
according to t he t ypes o f it ems so ld). The fact or measure displayed is
dollars so ld ( in t housands).

Now, suppose t hat we would like t o view t he sales dat a wit h a t hir d
dimensio n. For inst ance, suppose we would like t o view t he dat a a ccording t o
t ime and it em, as well as locat ion, for t he cit ies Chicago, New York, Toront o,
and Vancouver.
Suppose t hat we would now like t o view our sales dat a wit h an addit io nal
fourt h dimensio n such as supplier. Viewing t hings in 4 -D beco mes t ricky.
However, we can t hink o f a 4 -D cube as being a ser ies o f 3 -D cubes.

In t he dat a warehousing research lit er at ure, a dat a cube like t hose shown in
Figur es is o ft en referred t o as a cubo id. Given a set of dimensio ns, we ca n
generat e a cubo id for each o f t he possible subset s of t he given dimensio ns.
The result would for m a lat t ice of cuboids, each showing t he dat a at a
different level o f summar izat io n, or group -by. T he lat t ice o f cubo ids is t hen
referred t o as a dat a cube. Figure 4.5 shows a lat t ice o f c ubo ids for ming a
dat a cube for t he dimensio ns t ime, it em, locat ion, and supplier. The cubo id
t hat ho lds t he lowest leve l o f summar izat ion is called t he base cubo id.

The 0-D cubo id, which ho lds t he highest level o f summar izat io n, is ca lled t he
apex cubo id.

2. Schemas for Multidimensional Data Models


A dat a warehouse, however, requir es a concise, subject -orient ed schema t hat
facilit at es online dat a analys is. T he mo st popular dat a model for a dat a
warehouse is a mult idimens io nal model, which can exist in t he for m of a st ar
sche ma, a snowflake schema, or a fact const ellat io n schema.

Star Schema

The most commo n modeling paradigm is t he st ar schema, in which t he dat a


warehouse co nt ains:
 A large cent ral t able ( fact t able) cont aining t he bulk o f t he dat a, wit h no
redundancy
 A set of smaller at t endant t ables (dimensio n t ables), one for each
dimensio n.
 The schema graph resembles a st arbur st , wit h t he dimensio n t able s
displa yed in a radial pat t ern around t he cent ral fact t able.

Examp le:
A st ar schema for AllE lect ronics sales is shown in F igure 4.6. Sales are
considered alo ng four dimensio ns: t ime, it em, br anch, and locat io n. The
sche ma cont ains a cent ral fact t able for sales t hat cont ains ke ys t o each of
t he four dimensio ns, alo ng wit h t wo measures: do llar s so ld a nd unit s so ld. To
minimize t he size o f t he fact t able, dimensio n ident ifier s (e.g., t ime key and
it e m key) are syst em-generat ed ident ifiers.
Snowflake schema

 The snowflake schema is a var iant of t he st ar sche ma model, where so me


dimensio n t ables are nor malized, t hereby furt her split t ing t he dat a int o
addit io nal t ables.
 The result ing sche ma graph for ms a shape similar t o a snowflake.
 The major differ ence bet wee n t he snowflake and st ar schema mo dels is
t hat t he dimensio n t ables of t he snowflake mo del ma y be kept in
nor malized for m t o reduce redundanc ies. Such a t able is easy t o maint ai n
and saves st orage space.
 Alt hough t he snowflake schema reduces redundancy, it is not as popular
as t he st ar schema in dat a warehouse des ign.

Example :

A snowflake schema for AllE lect ronics sales is given in Figure 4.7. Here, t he
sales fact t able is ident ical t o t hat of t he st ar schema in Figure 4.6. T he ma in
difference bet ween t he t wo schemas is in t he definit io n o f dimensio n t ables.
The single dimensio n t able for it em in t he st ar schema is nor ma lized in t he
snowflake schema, result ing in new it em and supplier t ables. For example,
t he it em dimens io n t able now cont ains t he at tribut es it em key, it em name,
brand, t ype, and supplier key, where supplier key is linked t o t he supplier
dimensio n t able, cont aining supplier key and supplier t ype infor mat ion.
S imilar ly, t he single dimens io n t able for locat io n in t he st ar schema can be
nor malized int o t wo new t ables: locat ion and cit y. The cit y key in t he new
lo cat ion t able links t o t he cit y dimensio n. Not ice t hat , when desirable,
furt her nor malizat io n can be per for med on province or st at e and count r y in
t he snowflake schema.
Fact constellation:

 Sophist icat ed applicat ions may r equire mult iple fact t ables t o shar e
dimensio n t ables.
 This kind o f schema can be viewed as a co llect io n of st ars, and hence is
called a galaxy schema or a fact const ellat io n.

Examp le:
A fact const ellat io n schema is shown in Figure 4.8. This schema specifies
t wo fact t ables, sa les and shipp ing. T he sales t able definit io n is ident ical t o
t hat of t he st ar schema (Figure 4.6). The shipping t able has five dimens io ns,
or keys—it em key, t ime key, shipper key, fr o m locat io n, and t o locat ion —and
t wo measures—do llars cost and unit s shipped. A fact const ellat ion schema
allows dimensio n t ables to be shared bet ween fact t ables. For examp le, t he
dimensio ns t ables for t ime, it em, and lo cat io n are shar ed bet ween t he sales
and shipping fact t ables.
A dat a warehouse co llect s infor mat ion about subject s t hat span t he ent ire
organizat ion, such as cust omers, it ems, sales, asset s, and per sonnel, and t hus
it s scope is ent erpr ise -wide. For dat a warehouses, t he fact const ell at io n
sche ma is co mmo nly used, since it can mo del mu lt iple, int errelat ed subject s.
A dat a mart , on t he ot her hand, is a depar t ment subset of t he dat a warehouse
t hat focuses on select ed subject s, and t hus it s scope is depart ment wide. For
dat a mart s, t he st ar or snowflake schema is co mmo nly used, since bot h are
geared t oward modeling single subject s, alt hough t he st ar schema is more
popular and efficient .

3. Concept Hierarchies
 A co ncept hierarchy defines a sequence o f mappings fro m a set of low - level
concept s to higher- level, more general concept s.
 For example, Vancouver can be mapped to Brit ish Co lumbia, and Chicago
to Illino is. The provinces and st at es can in t ur n be mapped to t he count ry
(e.g., Canada or t he Unit ed St at es) to which t hey belo ng.
 These mappings for m a concept hier archy for t he dimensio n locat ion,
mapping a set of low - level concept s ( i.e., cit ies) t o higher - level, more
general co ncept s ( i. e., count ries).
 Many co ncept hierarchies ar e imp licit wit hin t he dat abase schema.
 Alt er nat ively, t he at t rib ut es o f a dimens ion may be organized in a part ial
order, forming a lat t ice.
 A concept hierar chy t hat is a tot al or part ial order among at t ribut es in a
dat abase schema is called a schema hierar chy.
 Concept hierarchies may also be defined by discret izing or grouping values
for a given dimensio n or att ribut e, result ing in a set -grouping hierar chy. A
tot al or part ial order can be defined amo ng groups of values.
 There ma y be more t han o ne concept hier archy for a given at t ribut e or
dimensio n, based on different u ser viewpo int s.
 Concept hierarchies may be provided manually by syst em user s, domain
expert s, or knowledge engineers, or may be aut omat ica lly generat ed based
on st at ist ical analys is o f t he dat a dist r ibut io n.
4. Typical OLAP Operations

“How are concept hierarchi es usef ul in OLAP?”


 In t he mult idimens io nal model, dat a are organized int o mult iple
dimensio ns, and each dimensio n cont ains mult iple levels o f abst ract ion
defined by concept hier archies.
 This organizat ion provides user s wit h t he flexibilit y t o view dat a fro m
different perspect ives.
 A number of OLAP dat a cube operat io ns exist to mat er ia lize t hese different
views, allowing int eract ive quer ying and analys is o f t he dat a at hand.
Hence, OLAP provides a user- fr iendly environment for int eract ive dat a
analys is.

OLAP Operations
Example:
At t he cent er of t he figure is a dat a cube for AllE lect ronics sales. The cube
cont ains t he dimensio ns lo cat ion, t ime, and it e m, where locat io n is aggregat ed
wit h respect to cit y values, t ime is aggr egat ed wit h respect to quart ers, and it em
is aggregat ed wit h respect to it em t ypes. To aid in our explanat ion, we refer to
t his cube as t he cent ral cube. The measur e displayed is do llar s so ld ( in
t housands). (For impro ved readabilit y, only so me of t he cubes’ cell values are
shown.) The dat a examined are for t he cit ies Chicago, New York, Toronto, and
Vancouver.

i. Roll-up:
 The ro ll-up operat ion (also called t he dr ill -up operat ion by so me vendors)
per for ms aggregat io n on a dat a cube, eit her by climbing up a concept
hier archy for a dimensio n or by dimensio n reduct ion.
 When roll- up is per for med by dimens io n reduct io n, one or more dimens io ns
are remo ved fro m t he given cube.

ii. Drill-down:
 Dr ill-down is t he rever se o f roll-up. It navigat es fro m less det ailed dat a to
more det ailed dat a. Dr ill-down can be realized by eit her st epping down a
concept hierarchy for a dimensio n or int roducing addit io nal dimensio ns.
 Because a dr ill-down adds more det ail t o t he given dat a, it can also be
per for med by adding new dimensio ns t o a cube.
iii. Slice and dice:
 The slice operat io n per for ms a select ion on one dimensio n o f t he given cube,
result ing in a subcube.
 The dice operat ion defines a subcube by per for ming a select ion on t wo or more
dimensio ns.

iv. Pivot (rotate):


 P ivot (also called rot at e) is a visualizat ion operat ion t hat rot at es t he dat a axes
in view t o provide an alt er nat ive dat a present at ion.
 Ot her examp les include rot at ing t he axes in a 3 -D cube, or t ransfor ming a 3 -D
cube int o a ser ies of 2 -D planes.

v. Other OLAP operations:


 Some OLAP s yst ems o ffer ad dit io nal dr illing operat ions. For examp le, dr ill -
across execut es quer ies invo lving ( i. e., across) more t han one fact t able.
 The dr ill-t hrough operat ion uses relat ional SQL facilit ies t o drill t hrough t he
bott om level o f a dat a cube down t o it s back -end re lat io nal t ables.
 Ot her OLAP operat ions may include ranking t he t op N or bottom N it ems in
list s, as well as co mput ing mo ving averages, growt h rat es, int erest s, int erna l
ret urn rat es, depreciat io n, currency conversio ns, and st at ist ical funct ions.
1.3 Data Warehouse Design and Usage

1. A Business Analysis Framework for Data Warehouse


Design

 To design an effect ive dat a warehouse we need t o underst and and analyze
business needs and co nst ruct a business analysis fr amework. The const ruct io n
of a large and co mplex infor mat ion syst em can be viewed as t he const ruct io n
of a large and co mplex building , for which t he owner, archit ect , and builder
have different views.
 These views are co mbined to form a complex framework t hat represent s t he
top-down, business-dr iven, or owner ’s perspect ive, as well as t he botto m -up,
builder-dr iven, or imp lement or ’s view of t he infor mat io n syst em.
 Four different views regard ing a dat a warehouse design must be co nsidered:
t he t opdown view, t he dat a source view, t he dat a warehouse view, and t he
business quer y view.

 The t op-down view allows t he select io n o f t he relevant infor mat io n


necessar y for t he dat a warehouse. T his infor mat io n mat ches current and
fut ure business needs.
 The dat a source view exposes t he infor mat io n being capt ured, stored, and
managed by operat ional syst ems. T his infor mat io n may be document ed at
var ious levels o f det ail and accuracy, from ind ividual dat a source t ables t o
int egrat ed dat a source t ables. Dat a sources are oft en modeled by t radit io nal
dat a modeling t echniques, such as t he ent it y-re lat io nship model or CASE
(comput er-aided so ft ware engineer ing) t ools.
 The dat a warehouse view includes fact t ables and dimensio n t ables. It
represent s t he infor mat ion t hat is st ored inside t he dat a warehouse,
inc luding precalculat ed t ot als and count s, as well as infor mat ion regarding
t he source, dat e, and t ime o f origin, added to provide hist or ical cont ext .
 Fina lly, t he business quer y view is t he dat a per spect ive in t he dat a
warehouse fro m t he end -user ’s viewpo int .

2 . Data Warehouse Design Process


A dat a warehouse can be built using a top -down approach, a bot tom-up
approach, or a combinat io n o f bot h.
 The t op-down approach st art s wit h o verall design and planning. It is useful in
cases wher e t he t echno logy is mat ure and well known, and wher e t he
business problems t hat must be so lved are clear and well underst ood.
 The bott omup approach st art s wit h exper i ment s and protot ypes. T his is usefu l
in t he ear ly st age o f business modeling and t echno logy development . It
allows an organizat ion t o move forward at considerably less expense and t o
evaluat e t he t echno logical benefit s before making significant commit ment s .
 In t he co mbined approach, an organizat io n can explo it t he planned and
st rat egic nat ure of t he top -down approach while ret aining t he rapid
imple ment at ion and opport unist ic applicat io n o f t he bott om -up approach .

3 Data Warehouse Usage for Information Processing

There are t hree kinds of dat a warehouse applicat io ns: infor mat ion processing,
analyt ical processing, and dat a mining.
i. Inform ation processing support s quer ying, basic st at ist ical analys is, and
report ing using cr osst abs, t ables, chart s, or graphs. A current t rend in dat a
warehouse infor mat io n processing is to const ruct low -cost web- based
accessing t ools t hat are t hen int egrat ed wit h web browsers.
ii. Analyti cal processing support s basic OLAP operat ions, including sli ce-and-
dice, dr ill-down, roll-up, and pivot ing. It generally operat es on hist oric dat a
in bot h summar ized and det ailed for ms. The major st rengt h of online
analyt ical processing over infor mat ion processing is t he mult idimensio na l
dat a ana lysis o f dat a warehouse dat a.
iii. Data m ining support s knowledge discover y by finding hidden pat t erns and
associat ions, const ruct ing analyt ical mo dels, per for ming classificat ion and
predict ion, and present ing t he mining result s using visualizat io n t ools .

“How does data mining relate to inf ormation processi ng and online analyti cal
processing?”
a) Infor mat io n process ing, based on quer ies, can find useful infor mat io n.
However, answer s t o such quer ies reflect t he infor mat io n dir ect ly st ored in
dat abases or comput able by aggregat e fu nct io ns.
b) They do not reflect sophist icat ed pat t erns or regular it ies bur ied in t he
dat abase. T herefore, infor mat io n processing is not dat a mining.
c) Online analyt ical processing co mes a st ep closer t o dat a mining because it
can der ive infor mat io n summar ized at mult iple gr anular it ies fro m user -
spec ified subset s of a dat a warehouse. Such descr ipt ions ar e equiva lent t o t he
class/co ncept descr ipt ions.

4 From Online Analytical Processing t o Multidimensional


Data Mining

Mult idimensio nal dat a mining (also kno wn as explorat ory mult idimensio nal dat a
mining, o nline analyt ical mining, or OLAM) int egr at es OLAP wit h dat a mining t o
uncover knowledge in mult id imensio na l dat abases. Amo ng t he many dif ferent
paradigms and archit ect ures o f dat a mining syst ems, mult idimensio nal dat a
mining is part icular ly import ant for t he follo wing reasons:
 High qualit y o f dat a in dat a warehouses: Most dat a mining t ools need t o work
on int egrat ed, consist ent , and cleaned dat a, which requires cost ly dat a
cleaning, dat a int egr at ion, and dat a t ransfor mat io n as preprocessing st eps. A
dat a warehouse const ruct ed by such prepr ocessing ser ves as a valuable source
of high-qualit y dat a for OLAP as well as for dat a mining.
 Available infor mat io n processing infr ast ruct ure surrounding dat a warehouses:
Co mprehensive infor mat io n processing and dat a analys is infr ast ruct ures ha ve
been or will be syst emat ically const ruct ed surrounding dat a warehouses,
which include accessing, int egrat io n, conso lidat ion, and t ransfor mat io n o f
mult ip le het erogeneous dat abases, ODBC/ OLEDB co nnect ions, Web accessing
and ser vice facilit ies, and report ing and OLAP analys is t ools. It is prudent to
make t he best use of t he available infr ast ruct ures rat her t han const ruct ing
ever yt hing fro m scrat ch.
 OLAP- based explorat ion o f mult idimensio nal dat a: Effect ive dat a mining
needs explorat ory dat a analysis. A use r will o ft en want to t raverse t hrough a
dat abase, select port ions o f relevant dat a, ana lyze t hem at different
granular it ies, and present knowledge/result s in differ ent for ms.
Mult idimensio nal dat a mining provides facilit ies for mining on differ ent
subset s of dat a and at var ying levels o f abst ract io n —by dr illing, pivot ing,
filt er ing, dic ing, and slic ing on a dat a cube and/or int er mediat e dat a mining
result s. This, t oget her wit h dat a/knowledge visualizat ion t ools, great l y
enhances t he power and flexibilit y o f dat a mining.
 Online select ion o f dat a mining funct ions: Users ma y not always know t he
spec ific k inds o f knowledge t he y want to mine. By int egr at ing OLAP wit h
var ious dat a mining funct ions, mult idimensio nal dat a mining provides user s
wit h t he flexibilit y t o select desired dat a mining funct io ns and swap dat a
mining t asks dyna mica lly.

1.4 Data Warehouse Implementation


Dat a warehouses cont ain huge vo lume s of dat a. OLAP ser ver s demand t hat
decis io n support queries be answer ed in t he order of seconds. Therefore, it is
crucial for dat a warehouse syst ems t o support highly efficient cube co mput at ion
t echniques, access met hods, and quer y processing t echniques.

1. Efficient Data Cube Computation: An Overview


At t he core of mult idimensio nal dat a analysis is t he efficient co mput at ion o f
aggregat io ns across many set s of dimensions. In SQL t er ms, t hese aggregat io ns are
referred t o as group -by’s. Each group-by can be represent ed by a cubo id, where
t he set of group-by’s for ms a lat t ice o f cubo ids defining a dat a cube.
2. The compute cube Operator and the Curse of Dimensionality

One approach t o cube co mput at ion ext ends SQL so as t o include a co mput e cube
operator. The comput e cube operat or comput es aggr egat es over all subset s of t he
dimensio ns specified in t he operat ion. This can requir e excessive st orage space,
especially for large nu mbers o f dimensio ns.

Examp le :
A dat a cube is a lat t ice of cubo ids. Suppose t hat you want to creat e a dat a cube for
AllE lect ronics sales t hat cont ains t he follo wing: cit y, it em, year, and sales in
dollars. You want to be able t o analyze t he dat a, wit h quer ies such as t he
fo llo wing: “Co mput e t he sum o f sales, gr ouping by cit y and it em. ” “Co mput e t he
sum o f sales, grouping by cit y.” “Co mput e t he sum o f sales, grouping by it em.”

What is t he tot al number o f cubo ids, or group -by’s, t hat can be co mput ed for t his
dat a cube? Taking t he t hree at t ribut es, cit y, it em, and year, as t he dimens io ns for
t he dat a cube, and sales in do llar s as t he mea sure, t he tot al number of cubo ids, or
groupby’s, t hat can be co mput ed for t his dat a cube is 23 = 8. T he possible group -
by’s are t he fo llowing: {(cit y, it em, year ), (cit y, it em), (cit y, year), ( it em, year),
(cit y), ( it em), ( year), ()}, where () means t hat t he group-by is empt y ( i.e., t he
dimensio ns ar e not grouped).

The base cubo id co nt ains all t hree dimensio ns, cit y, it em, and year. It can ret urn
t he tot al sales for any co mbinat io n of t he t hree dimensio ns. The apex cubo id, or 0 -
D cubo id, refers t o t he case where t he group -by is empt y. It cont ains t he tot al sum
of all sales. The base cubo id is t he least generalized ( most specific) o f t he
cubo ids. The apex cubo id is t he most generalized ( least specific) of t he cubo ids,
and is oft en denot ed as all. I f we st art at the apex cubo id and explore downward in
t he lat t ice, t his is equivalent t o drilling down wit hin t he dat a cube. I f we st art at
t he base cubo id and explor e upward, t his is akin t o rolling up.
3. Partial Materialization: Selected Computation of Cuboids
There ar e t hree cho ices for dat a cube mat er ializat ion given a base cubo id:
1. No materiali zation:
Do not precomput e any o f t he “nonbase” cubo ids. This leads to comput ing
expensive mult id imensio n al aggregat es on-t he- fly, which can be ext remely slow.
2. Full materi ali zation:
Preco mput e all o f t he cubo ids. The result ing lat t ice o f co mput ed cubo ids is
referred t o as t he full cube. This cho ice t ypically requires huge amount s o f
memor y space in order to store all o f t he preco mput ed cubo ids.
3. Partial mat eriali zation:
Select ively co mput e a proper subset of t he who le set of possible cubo ids.
Alt er nat ively, we may co mput e a subset of t he cube, which cont ains only t hose
cells t hat sat isfy so me user -specified cr it er io n, such as where t he t uple count of
each cell is above so me t hresho ld. We will use t he t erm subcube t o refer t o t he
lat t er case, wher e only so me of t he cells ma y be preco mput ed for var ious
cubo ids. Part ia l mat er ializat ion represent s an int er est ing t rade-off bet wee n
st orage space and response t ime.
4. Indexing OLAP Data: Bitmap Index and Join Index

To facilit at e effic ient dat a accessing, mo st dat a warehouse s yst ems support index
st ruct ures and mat er ialized views.

Bitmap indexing
The bit map indexing met hod is popular in OLAP product s because it allows quick
searching in dat a cubes. T he bit map index is an alt er nat ive represent at ion o f t he
record ID (RID) list . In t he bit map index for a given at tribut e, t here is a dist inct
bit vect or, Bv, for each value v in t he att ribut e’s do ma in. I f a given at t ribut e’s
domain consist s o f n values, t hen n bit s are needed for each ent r y in t he bit map
index ( i.e., t here are n bit vect ors). If t he attr ibut e has t he value v for a given row
in t he dat a t able, t hen t he bit represent ing t hat value is set to 1 in t he
corresponding row of t he bit map index. All ot her bit s for t hat row are set to 0.

Bit map indexing is advant ageous co mpared to hash and t ree indices. It is
especially useful for low - cardina lit y do mains because co mpar iso n, jo in, and
aggregat io n operat ions are t hen reduced to bit ar it hmet ic, which subst ant ially
reduces t he processing t ime. Bit map indexing leads t o significant reduct ions in
space and input /out put (I/O) since a st r ing o f charact ers can be represent ed by a
single bit . For higher-cardinalit y do mains, t he met hod can be adapt ed using
co mpressio n t echniques.

Join indexing
The jo in indexing met hod gained popular it y fro m it s use in relat io nal dat abase
quer y processing. Tradit io nal indexin g maps t he value in a given co lumn t o a list
of rows having t hat value. In co nt rast , jo in indexing regist er s t he jo inable rows o f
t wo relat ions fro m a relat io nal dat abase.
Jo in indexing maint ains r elat ionships bet ween att ribut e values o f a dimensio n
(e.g., wit hin a dimensio n t able) and t he corresponding rows in t he fact t able. Jo in
indices may span mult iple dimens io ns t o form co mposit e jo in indices.

Bitmapped join indices


To furt her speed up quer y processing, t he jo in indexing and t he bit map indexing
met hods can be int egrat ed to for m bit mapped jo in indices.

1.5 ROLAP vs MOLAP vs HOLAP


Logically, OLAP ser vers present busine ss users wit h mu lt idimensio nal dat a
fro m dat a warehouses or dat a mart s, wit hout concer ns regarding how or wher e
t he dat a are st ored. However, t he phys ical ar chit ect ure and imple ment at ion o f
OLAP ser vers must consider dat a st orage issues. es. I mplement at io ns o f a
warehouse ser ver for OLAP processing include t he fo llo wing:
a) Relational OLAP (RO LAP) servers:
 These are t he int er mediat e ser ver s t hat stand in bet ween a relat io nal back -
end ser ver and client front -end t ools.
 They use a relat io nal or ext ended -relat ional DBMS t o store and manage
warehouse dat a, and OLAP middleware t o support missing pieces.
 ROLAP ser vers include opt imizat ion for each DBMS back end,
imple ment at ion o f aggregat io n navigat io n logic, and addit io nal t ools and
ser vices.
 ROLAP t echno logy t ends to have great er scalabilit y t han MOLAP
t echno logy.
b) Multidimen sional O LAP (MOLAP) servers:
 These ser vers support mult idimensio nal dat a views t hrough array - based
mult id imensio nal st orage engines. They map mult id imensio na l views
direct ly t o dat a cube array st ruct ures.
 The advant age o f using a dat a cube is t hat it allows fast indexing t o
preco mput ed summar ized dat a.
 Many MOLAP ser ver s adopt a t wo -level st orage represent at ion t o handle
dense and spar se dat a set s: Denser subcubes are ident ified and st ored as
array st ruct ures, whereas sparse subcubes emplo y co mpressio n t echno log y
for efficient st orage ut ilizat ion.
c) Hyb rid OLAP (HOLAP) servers:
 The hybr id OLAP approach co mbines ROLAP and MOLAP t echno logy,
benefit ing fro m t he great er scalabilit y o f ROLAP and t he fast er
co mput at io n of MOLAP.
 For example, a HOLAP ser ver may a llow large vo lumes o f det ailed dat a t o
be st ored in a relat io nal dat abase, while aggregat ions are kept in a separat e
MOLAP st ore.

You might also like