DWDM
DWDM
Data Mining
Dat a mining can be viewed as a result of t he nat ural evo lut io n of infor mat io n
t echno logy. The dat abase and dat a management indust r y evo lved in t he
development o fseveral cr it ical funct ionalit ies : data col lection and database
creation, data management ( inc luding data storage and r et r ieval and
database t ransaction processing ), and advanced data analysi s ( invo lving
data warehousing and data mining ).
E ffic ient met hods for online t ransact io n processing (OLTP), where a quer y is
viewed as a read-only t ransact io n, cont ribut ed subst ant ially t o t he evo lut io n
and wide accept ance of relat ional t echnology as a major t ool for effic ient
st orage, ret rieval, and management of large amount s of dat a.
Aft er t he est ablishment of dat abase management syst ems, dat abase
t echno logy mo ved t oward t he development of advanced dat abase syst ems,
dat a warehousing, and dat a mining for advanced dat a analys is and web - based
dat abases.
Advanced dat a analys is sprang up fr om t he lat e 1980s onward. This
t echno logy provides a great boost to t he dat abase and infor mat io n indust r y,
and it enables a huge number of dat abases and infor mat io n reposit or ies t o be
available for t ransact ion management , infor mat io n ret rieval, and dat a
analys is.
One emerging dat a reposit ory archit ect ure is t he data warehou se. This is a
reposit ory o f mu lt iple het erogeneous dat a sources organized under a unified
sche ma at a single sit e t o facilit at e management decis io n making. Dat a
warehouse t echno logy in cludes dat a cleaning, dat a int egrat io n, and online
analyt ical processing (OLAP) —t hat is, analys is t echniques wit h
funct io nalit ies such as summar izat ion, conso lidat io n, and aggregat ion, as
well as t he abilit y t o view infor mat io n from different angles. Alt ho ugh OLAP
tools support mult idimensio nal ana lysis and decis io n making, addit io nal dat a
analys is t ools are required for in -dept h analys is — for examp le, dat a mining
tools t hat provide dat a classificat ion, clust er ing, out lier/ ano maly det ect ion,
and t he char act er izat ion o f changes in dat a over t ime.
In summar y, t he abundance of dat a, coupled wit h t he need for power ful dat a
analys is t ools, has been descr ibed as a dat a rich but infor mat io n poor sit uat io n
(Figure 1.2). The fast -growing, tremendo us amount of dat a, collect ed and st ored
in large and numerous dat a reposit ories, has far exceeded our human abilit y for
co mprehensio n wit hout power ful t ools. As a result , dat a co llect ed in large dat a
reposit ories beco me “dat a tombs” —dat a archives t hat are seldo m visit ed.
Consequent ly, import ant decisio ns are o ft en made based not on t he infor mat ion -
r ich dat a st ored in dat a reposit ories but rat her on a decis io n maker ’s int uit io n,
simply because t he decis io n maker does not have t he t ools t o ext ract t he
valuable knowledge embedde d in t he vast amount s of dat a. Effort s have bee n
made t o develop expert syst em and knowledge - based t echno logies, which
t ypically rely o n users or domain exper t s to manually input knowledge int o
knowledge bases. Unfort unat ely, however, t he manual knowledge input
procedure is prone t o biases and errors and is ext remely cost ly and t ime
consu ming. The widening gap bet ween dat a and infor mat io n calls for t he
syst emat ic development of dat a mining tools t hat can t urn dat a t ombs int o
“go lden nugget s” o f knowledge.
St eps 1 t hrough 4 are different for ms of dat a preprocessing, where dat a are
prepared for mining. The dat a mining st ep may int er act wit h t he user or a
know ledge base. The int erest ing pat t erns are present ed to t he user and ma y be
st ored as new knowledge in t he knowledge base .
3. What Kinds of Data Can Be Mined?
The mo st basic for ms o f dat a for mining applicat io ns are dat abase dat a , dat a
warehouse dat a, and t ransact io nal dat a. Dat a mining can also be applied to ot her
for ms o f dat a (e.g., dat a st reams, ordered/sequence dat a, graph or net worked
dat a, spat ial dat a, t ext dat a, mult imedia dat a, and t he WWW).
Database Data
A dat abase s yst em, also called a dat abase management syst em (DBMS),
consist s of a co llect ion of int errelat ed dat a, known as a dat abase, and a set of
so ft ware programs t o manage and access t he dat a. The soft ware programs
provide mechanisms for defining dat abase st ruct ures and dat a st orage; for
spec ifying and managing concurrent , shared, or dist ribut ed dat a access; and
for ensur ing consist ency and secur it y o f the infor mat io n st ored despit e syst e m
crashes or att empt s at unaut hor ized access.
A relat io nal dat abase is a co llect ion o f t ables, eac h of which is assigned a
unique name. Each t able consist s o f a set of at t ribut es (co lumns or fie lds) and
usually st ores a large set of t uples (records or rows). Each t uple in a
relat ional t able represent s an o bject ident ified by a unique key and descr ibe d
by a set of at t ribut e values. A se mant ic dat a model, such as an ent it y -
relat ionship (ER) dat a model, is o ft en const ruct ed for relat io nal dat abases. An
ER dat a model represent s t he dat abase as a set of ent it ies and t heir
relat ionships.
Examp le :
A relat io nal dat abase for AllE lect ronics. The fict it ious AllE lect ronics st ore is
used t o illust rat e concept s t hroughout t his book. The co mpany is descr ibed b y
t he fo llo wing relat io n t ables: cust o mer, it em, emplo yee, and branch. T he
header s o f t he t ables descr ibed here. ( A header is also called t he schema o f a
relat ion.)
cust omer (cust ID, name, address, age, occupat ion, annua l inco me, credit
infor mat io n, cat egory, . . .)
it e m (it em ID, brand, cat egory, t ype, pr ice, place made, supplier, cost ,
. . .)
emplo yee (empl ID, name, cat egor y, group, salar y, co mmissio n, . . .)
branch (branch ID, name, addr ess, . . .)
purchases (t rans ID, cust ID, empl I D, dat e, t ime, met hod paid, amount )
it e ms_so ld (t rans ID, it em ID, qt y) works at (empl ID, branch ID)
Relat io nal dat a can be accessed by dat abase quer ies wr it t en in a relat io na l
quer y language (e.g., SQL) or wit h t he assist ance of graphical user int er faces.
A given quer y is t ransfor med int o a set of r elat ional operat io ns, such as jo in,
select ion, and pro ject ion, and is t hen opt imized for effic ient processing.
When mining relat io nal dat abases, we can go furt her by searching for t rends
or dat a patt erns. For examp le, dat a mining may discover t hat t here has been a
change in packaging o f an it em or a significant increase in pr ice.
Data Warehouses
A dat a warehouse is a reposit ory o f informat io n co llect ed fro m mult iple
sources, st ored under a unified schema, and usually residing at a single sit e.
Dat a warehouses are const ruct ed via a process o f dat a cleaning, dat a
int egrat ion, dat a t ransfor mat io n, dat a loading, and per iodic dat a refr eshing.
To facilit at e decisio n making, t he dat a in a dat a warehouse are organized
around major subject s (e.g., cust omer, it em, supplier, and act ivit y). The
dat a are stored to provide infor mat io n from a hist or ica l per spect ive, such as
in t he past 6 to 12 mont hs, and are t ypically summar ized.
A dat a warehouse is usually modeled by a mult idimensio nal dat a st ruct ure,
called a dat a cube, in which each dimension correspon ds to an att ribut e or a
set of att ribut es in t he schema, and each cell st ores t he value of so me
aggregat e measur e such as count or sum( sales amount ). A dat a cube provides
a mult idimensio nal view o f dat a and all ows t he preco mput at ion and fast
access of summar ized dat a.
Examp le:
A dat a cube for AllE lect ronics. A dat a cube for summar ized sa les dat a of
AllE lect ronics is present ed in Figure 1.7(a). The cube has t hree dimensio ns:
address (wit h cit y va lues Chicago, New York, Toronto, Vancouver), t ime (wit h
quart er values Q1, Q2, Q3, Q4), and it em(wit h it em t ype values ho me
ent ert ainment , co mput er, pho ne, secur it y) . The aggregat e value st ored in eac h
cell o f t he cube issales amount ( in t housands).
Examples o f OLAP operat ions include dr ill -down and roll-up, which allo w t he
user t o view t he dat a at differ ing degrees of summar izat ion. For inst ance, we
can dr ill down on sales dat a summar ized by quart er to see dat a summar ized b y
mo nt h. S imilar ly, we can ro ll up o n sa les dat a summar ized by cit y t o view
dat a summar ized by count r y.
Transaction Data
In general, each record in a t ransact ional dat abase capt ures a t ransact ion, suc h
as a cust omer ’s purchase, a flight booking, or a user ’s clicks o n a web page. A
t ransact ion t ypically includes a unique t ransact ion ident it y number (t rans ID)
and a list of t he it e ms making up t he t ransact io n, such as t he it ems pur chased
in t he t ransact ion.
Examp le:
A t ransact ional dat abase for AllE lect ronics. Transact io ns can be st ored in a
t able, wit h one record per t ransact ion. A fragment o f a t r ansact io nal dat abase
for AllE lect ronics is shown in Figure 1.8. From t he relat io nal dat abase point
of view, t he sales t able in t he figure is a nest ed relat ion because t he att ribut e
list of it em I Ds cont ains a set of it ems. As an analyst of AllE lect ronics, you
ma y ask, “Which it ems so ld well t oget her?” This kind o f market basket dat a
analys is would enable you t o bundle groups of it ems t oget her as a st rat egy for
boost ing sales. For example, given t he knowledge t hat print ers are co mmo nl y
purchased t oget her wit h co mput ers, you could offer cert ain pr int ers at a st eep
discount (or even for free) to cust omers buying select ed co mput ers, in t he
hopes o f selling more co mput ers (which are o ft en more expensive t ha n
pr int ers).
These dat abases are based on object orient ed programming paradigm wher e in
general t er ms each ent it y is considered as an object .
Each object has associat ed wit h t he fo llowing:
i. A set of var iables t hat descr ibe t he object s. These correspond t o attr ibut es
in t he ER mod el.
ii. A set of messages t hat t he object can use t o communicat e wit h ot her
object s.
iii. A set of met hods where each met hod holds t he code t o implement message.
Spatial Databases
Spat ial dat abases cont ain spat ial relat ed infor mat io n such as dat abases
inc lude geographic dat abases, VLSI chin des ign dat abases, medical and
sat ellit e image dat abases.
DM may uncover pat t erns descr ibing t he charact er ist ics o f houses locat ed
near a specific kind o f locat ion such as a park.
Dat a mining funct ionalit ies are used to specify t he kinds of pat t erns to be found
in dat a mining t asks. In general, such t asks can be classified int o t wo
cat egories:
Examp le:
For example, A cust omer relat io nship manager at AllE lect ronics may order t he
fo llo wing dat a mining t ask: Summar ize t he charact er ist ics of cust omers who
spend mor e t han $5000 a year at AllE lect ronics .
The result could be a genera l pro file o f t hese cust omers, such as t hat t hey ar e
40 to 50 years o ld, emplo yed, and have excellent credit rat ings. The dat a
mining syst em should allo w t he cust omer relat io nship manager to drill down
on any dimensio n, such as on occupat ion t o view t hese cust omer s according t o
t heir t ype of e mplo yment .
Data Discrimination
Dat a discr iminat io n is a co mpar iso n of t he general feat ures o f t he t arget class
dat a object s against t he general feat ures o f object s fro m o ne or mult ip le
cont rast ing classes.T
he for ms o f out put present at ion are similar t o t hose for charact er ist ic
descr ipt io ns, alt hough discr iminat io n descr ipt io ns should include co mparat ive
measures t hat help t o dist inguish bet ween t he t arget and cont rast ing classes.
Discr iminat io n descr ipt ions expressed in t he for m of ru les are referred t o as
discr iminant rules.
Example:
A cust omer relat ionship manager at AllE lect ronics may want to compare t wo
groups of cust omer s—t hose who shop for comput er product s regular ly ( e.g.,
more t han t wice a mo nt h) and t hose who rarely shop for such product s (e.g.,
less t han t hree t imes a year).
The result ing descr ipt ion provides a general co mparat ive pro file o f t hese
cust omers, such as t hat 80% of t he cust omer s who frequent ly pur chase
co mput er product s are bet ween 20 and 40 years o ld and have a univer sit y
educat io n, whereas 60% o f t he cust omers who infr equent ly buy such product s
are eit her senior s or yout hs, and have no universit y degree. Dr illing down o n
a dimensio n like occupat io n, or adding a new dimens io n like inco me leve l,
ma y help t o find even more discr iminat ive feat ures bet ween t he t wo classes.
Mining Frequent Patterns, Associations, and Correlations
Frequent pat t erns, as t he name suggest s, are pat t erns t hat occur frequent ly in
dat a. There are many kinds o f frequent patt erns, including frequent it emset s,
frequent subsequences (also known as sequent ial pat t erns), and frequent
subst ruct ures.
A frequent it emset t ypically refer s t o a set of it ems t hat oft en appear
toget her in a t ransact ional dat a set .— for example, milk and bread, which ar e
frequent ly bought t oget her in grocer y st ores by many cust omers.
A frequent ly occurr ing subsequence, such as t he pat t e rn t hat customer s, t end
to purchase first a lapt op, fo llowed by a digit al camera, and t hen a me mor y
card, is a ( frequent ) sequent ial pat t ern.
A subst ruct ure can refer t o different st ruct ural for ms (e.g., graphs, t rees, or
lat t ices) t hat may be co mbined wit h it emset s or subsequences. If a
subst ruct ure occurs frequent ly, it is called a (frequent ) st ruct ured patt ern.
Exa mple:
Associat ion ana lysis. Suppose t hat , as a market ing manager at AllE lect ronics,
you want to know which it ems ar e frequent ly purchased t oget her ( i.e., wit hin
t he same t ransact io n). An examp le of such a rule, mined fro m t he
AllE lect ronics t ransact io nal dat abase, is
buys(X, “computer”) ⇒ buys(X, “software”) [ support = 1%,confidence = 50%] ,
where X is a var iable represent ing a cust omer. A confidence, or cert aint y, o f
50% means t hat if a cust o mer buys a co mput er, t here is a 50% chance t hat she
will buy so ft ware as well. A 1% support means t hat 1% of all t he t ransact io ns
under analys is show t hat comput er and soft ware are pur chased t oget her. This
associat ion rule invo lves a single at t ribut e or predicat e (i. e., buys) t hat
repeat s. Associat io n rules t hat cont ain a single predicat e are referred t o as
single-dimens io nal associat ion rules. Dropping t he pre dicat e not at io n, t he rule
can be wr it t en simply as “co mput er ⇒ soft ware [1%, 50%].” Suppose, inst ead,
t hat we are given t he AllE lect ronics relat io nal dat abase relat ed to purchases.
A dat a mining syst em may find associat ion rules like
age(X, “20..29 ”) ∧ income(X, “40K..49K”) ⇒ buys(X, “laptop”)
[ support = 2%, confidence = 60%]
The rule indicat es t hat of t he AllE lect ronics cust omer s under st udy, 2% are 20
to 29 years o ld wit h an inco me o f $40,000 t o $49,000 and have purch ased a
lapt op (comput er) at AllE lect ronics. There is a 60% probabilit y t hat a
cust omer in t his age and inco me group will pur chase a lapt op. Not e t hat t his
is an associat ion invo lving more t han one at tribut e or predicat e ( i.e., age,
inco me, and buys). Adopt ing t he t er mino logy used in mult id imensio na l
dat abases, where each at tr ibut e is referred to as a dimens io n, t he above rule
can be referred t o as a mult idimensio nal associat io n rule.
Typically, associat ion rules are discarded as unint er est ing if t hey do no t
sat isfy bot h a minimum support t hresho ld and a minimum co nfidence
t hresho ld. Addit io nal analys is can be per for med t o uncover int erest ing
st at ist ical correlat io ns bet ween associat ed att ribut e –value pair s.
Unlike classificat io n and regr essio n, which analyze class - labeled (t raining)
dat a set s, clust er ing analyzes dat a object s wit hout consult ing class labels. In
many cases, class labeled dat a may simp ly not exist at t he beginning.
Clust er ing can be used t o generat e class labels for a group of dat a. The
object s are clust ered or grouped based on t he pr inc iple o f maximizing t he
int raclass similar it y and minimizing t he int erclass similar it y.
Outlier Analysis
A dat a set may cont ain object s t hat do not comply wit h t he general behavior
or model o f t he dat a. These dat a object s are out lier s.
Many dat a mining met hods discard out liers as no ise or except io ns. However,
in so me applicat ions (e.g., fr aud det ect io n) t he rare event s can be mor e
int erest ing t han t he mor e regular ly occurring ones.
The analys is o f out lier dat a is referred t o as out lier analys is or ano mal y
mining.
Examp le:
Out lier analys is may uncover fraudulent usage of credit cards by det ect ing
purchases o f unusually large amount s for a given account number i n
co mpar iso n to regular charges incurred by t he same account . Out lier values
ma y also be det ect ed wit h respect to t he locat io ns and t ypes of purchase, or
t he purchase frequency.
Ot her object ive int erest ingness measures include accuracy and coverage for
classificat ion (IF-THEN) rules. In general t er ms, accuracy t ells us t he
percent age of dat a t hat are correct ly classified by a rule. Cover age is similar
to support , in t hat it t ells us t he percent age of dat a to which a rule applies.
Alt hough object ive measures help ident ify int erest ing pat t erns, t hey are o ft en
insufficient unless co mbined wit h subject ive measures t hat reflect a part icular
user ’s needs and int erest s. For example, pat t erns descr ibing t he
charact er ist ics o f cust omers who shop fr equent ly at AllE lect ronics should be
int erest ing t o t he market ing manager, but ma y be o f lit t le int erest to ot her
analyst s st udying t he same dat abase for patt erns on emplo yee per for mance.
Subject ive int erest ingness measures are based on user beliefs in t he dat a.
These measures find pat t erns int erest ing if t he pat t erns ar e unexpect ed
(cont radict ing a user ’s belief) or offer st rat egic infor mat io n on which t h e user
can act . In t he lat t er case, such pat t erns are referred t o as act ionable. For
example, pat t erns like “a large eart hquake oft en fo llows a clust er of sma l l
quakes” may be highly act io nable if user s can act on t he infor mat io n t o save
lives. Pat t erns t hat are expect ed can be int er est ing if t hey confir m a
hypot hesis t hat t he user wishes t o validat e or t hey r esemble a user ’s hunch.
The seco nd quest ion— “Can a dat a mining syst em generat e all o f t he
int erest ing pat t erns?” — refer s t o t he co mplet eness o f a dat a mining algor it hm.
It is o ft en unrealist ic and inefficient for dat a mining syst ems t o generat e al l
possible pat t erns. Fina lly, t he t hird quest ion — “Can a dat a mining s yst e m
generat e only int erest ing pat t erns?” — is an opt imizat io n problem in dat a
mining. It is highly des ir able for dat a mining syst ems t o generat e onl y
int erest ing pat t erns.
St at ist ics st udies t he co llect io n, analys is, int erpret at ion or explanat ion, and
present at ion o f dat a. Dat a mining has an inherent connect ion wit h st at ist ics. A
st at ist ical model is a set of mat he mat ical funct io ns t hat descr ibe t he behavior
of t he object s in a t arget class in t erms o f rando m var iables and t heir
associat ed probabilit y dist r ibut ions. St at ist ical models are widely used to
model dat a and dat a classes. For example, in dat a mining t asks like dat a
charact er izat io n and classificat ion, st at ist ical mo dels o f t arget classes can be
built . I n ot her words, such st at ist ical models can be t he out come o f a dat a
mining t ask.
Machine Learning
Machine lear ning invest igat es how co mput ers can lear n (or impro ve t heir
per for mance) based on dat a. A main research area is for comput er programs t o
aut omat ically lear n to recognize co mplex pat t erns and make int elligent
decis io ns based on dat a. For examp le, a t ypical machine lear ning proble m is
to program a co mput er so t hat it can automat ically recognize handwr it t en
post al codes on mail aft er lear ning fro m a set of examples. Machine lear ning
is a fast -growing discipline. Here, we illust rat e classic proble ms in machine
lear ning t hat are highly relat ed t o dat a mining.
Supervised learnin g
Super vised lear ning is basically a syno nym for classificat io n. The super vis io n
in t he lear ning co mes fro m t he labeled examples in t he t raining dat a set . For
example, in t he post al code recognit io n problem, a set of handwr it t en post a l
code images and t heir corresponding machine -readable t ranslat io ns are used
as t he t raining examp les, which super vise t he lear ning o f t he classificat io n
model.
Unsupervised learnin g
Unsuper vised lear ning is essent ially a synonym for clust er ing. T he lear ning
process is unsuper vised since t he input examples are not class labeled.
Typically, we ma y use clust er ing t o discover classes wit hin t he dat a. For
example, an unsuper vised lear ning met hod can t ake, as i nput , a set of image s
of handwr it t en digit s. Suppose t hat it finds 10 clust ers o f dat a. These clust er s
ma y correspond t o t he 10 dist inct digit s of 0 t o 9, respect ively. However,
since t he t raining dat a are not labeled, the lear ned model cannot t ell us t he
semant ic meaning o f t he clust ers found .
Semi- supervi sed learning
Semi- super vised lear ning is a class of machine lear ning t echniques t hat make
use o f bot h labeled and unlabeled examples when lear ning a model. I n one
approach, labeled examples ar e used t o lear n class models and unlabeled
examples are used t o refine t he boundar ies bet ween classes. For a t wo -clas s
problem, we can t hink o f t he set of examples belo nging t o one class as t he
posit ive examples and t hose belo ng ing to t he ot her class as t he negat iv e
examples.
Active learning
Act ive lear ning is a machine learning approach t hat let s user s play an act ive
role in t he lear ning process. An act ive learning approach can ask a user (e.g.,
a domain expert ) to label an example, which ma y be fro m a set of u nlabe led
examples or synt hes ized by t he lear ning program. The goal is t o opt imize t he
model qualit y by act ively acquir ing knowledge fro m human user s, given a
const raint on how many examples t he y can be asked t o label.
Dat abase syst ems research fo cuses on t he creat io n, ma int enance, and use o f
dat abases for organizat io ns and end -users. Part icular ly, dat abase syst ems
resear cher s have est ablished highly r ecognized pr incip les in dat a models,
quer y languages, quer y processing and opt imizat io n met hods, dat a st orage,
and indexing and accessing met hods. Dat abase syst ems are o ft en well known
for t heir high scalabilit y in processing ver y large, relat ively st ruct ured dat a
set s. Recent dat abase syst ems have built syst emat ic dat a an alysis capabilit ies
on dat abase dat a using dat a warehousing and dat a mining facilit ies.
A dat a warehouse int egrat es dat a originat ing fro m mu lt iple sources and
var ious t imefr ames. It conso lidat es dat a in mult idimens io nal space t o for m
part ially mat er ialized dat a cubes. The dat a cube model not only facilit at es
OLAP in mult id imensio nal dat abases but also promot es mult id imensio nal dat a
mining.
Information Retrieval
Infor mat io n ret r ieval (IR) is t he science o f searching for document s or
infor mat io n in document s. Document s can be t ext or mult imedia, and ma y
reside on t he Web. The differences bet ween t radit ional infor mat io n ret rieval
and dat abase syst ems are t wofo ld: In for mat io n ret r ieval assumes t hat
(1) t he dat a under search are unst ruct ured
(2) t he quer ies are for med mainly by keywords, which do not have co mplex
st ruct ures.
The t ypical approaches in infor mat io n ret rieval adopt probabilist ic models:
( i) language mode l
( ii) t opic model.
Increas ingly large amount s o f t ext and mult imedia dat a have bee n
accumulat ed and made available online due t o t he fast growt h o f t he Web and
applicat io ns such as d igit al librar ies, digit al gover nment s, and healt h car e
infor mat io n syst ems. Their effect ive search and analys is have raised man y
challenging issues in dat a mining. Therefore, t ext mining and mult imedia dat a
mining, int egrat ed wit h infor mat io n ret rieval met hods, have beco me
incr easingly import ant .
As a highly applicat ion-dr iven disc ipline, dat a mining has seen great
successes in many applicat io ns. It is impossible t o enumer at e all applicat io ns
where dat a mining plays a cr it ical ro le. Present at ions o f dat a mining in
knowledge- int ensive applicat ion do mains, such as bio infor mat ics and
so ft ware engineer ing .
Business Intelligence
Business int elligence (BI) t echno logies provide hist or ical, current , and
predict ive views o f business operat io ns. Examp les include report ing, online
analyt ica l processing, business per for m.
dat a mining is t he core of business int elligence. Online analyt ical processing
tools in business int elligence rely on dat a warehousing and mult idimens io na l
dat a mining. Classificat io n and predict io n t echniques are t he core o f
predict ive analyt ics in business int elli gence, for which t her e are many
applicat io ns in analyz ing market s, supplie s, and sales.
Using charact er izat io n mining t echniques, we can bet t er under st and feat ures
of each cust omer group and develo p cust omized cus t omer reward programs.
A Web search engine is a specia lized comput er ser ver t hat searches for
infor mat io n on t he Web. The search result s of a user quer y are oft en ret urned
as a list (so met imes ca lled hit s). The hit s may consist of web pages, images,
and ot her t ypes o f files. So me search engines also search and ret urn dat a
available in public dat abases or open dir ect ories. Search engines differ fro m
web direct ories in t hat web direct ories are maint ained by human edit ors
whereas search engines operat e algor it hmically or by a mixt ure of algor it hmic
and human input .
Var ious dat a mining t echniques are used in all aspect s of search engines,
rang ing fro m crawling (e.g., decid ing which pages should be crawled and t he
crawling frequencies), index ing (e.g., select ing pages t o be indexed and
deciding t o which ext ent t he index sho uld be const ruct ed), and searching
(e.g., deciding how pages should be ranked, which advert ise ment s should be
added, and how t he search r esult s can be personalized or made “ cont ext
aware”).
Search engines pose grand challenges t o dat a mining.
i. The y have t o handle a huge and ever -gr owing amount of dat a.
ii. Web search engines oft en have t o deal wit h online dat a. A sear ch engine
ma y be able t o afford const ruct ing a model o ffline on huge dat a set s.
iii. Web search engines o ft en have t o deal wit h quer ies t hat are asked only a
ver y small number of t imes.
Dat a mining is a dyna mic and fast -expanding field wit h great st rengt hs. majo r
issues in dat a mining are part it ioned int o five groups: mining met hodology,
user int eract ion, efficiency and scalabilit y, divers it y o f dat a t ypes, and dat a
mining and societ y.
Mining Methodology
Researchers have been vigorously developing new dat a mining met hodo logies.
This invo lves t he invest igat ion o f new kinds o f knowledge, mining in
mult id imensio nal space, int egrat ing met hods fro m ot her disciplines, and t he
considerat ion o f semant ic t ies amo ng dat a object s. In addit io n, mining
met hodologies should consider issues such as dat a uncert aint y, no ise, and
inco mplet eness.
Mining vari ous and ne w kinds of knowledge: Dat a mining covers a wide
spect rum o f dat a analys is and knowledge discover y t asks, fro m dat a
charact er izat io n and discr iminat io n t o associat ion and correlat ion
analys is, classificat ion, regressio n, clust er ing, out lier analysis, sequence
analys is, and t rend and evo lut ion ana lysis .
Mining kno wledge in multidimensional space: We can search for
int erest ing pat t erns amo ng co mbinat io ns of dimensio ns (at t ribut es) at
var ying levels o f abst ract ion. Such mining is known as (explorat ory)
mult id imensio nal dat a mining. I n many cases, dat a can be aggregat ed or
viewed as a mult idimens io nal dat a cube.
Data mining —an interdi sci plinary ef f ort: The power of dat a mining can be
subst ant ially enhanced by int egrat ing new met hods fro m mult iple
disciplines. For examp le, t o mine dat a wit h nat ural language t ext , it
makes sense t o fuse dat a mining met hods wit h met hods o f infor mat io n
ret rieval and nat ural language processing .
Handling uncertaint y, noise, or incompleteness of data : Dat a oft en
cont ain no ise, errors, except ions, or uncer t aint y, or are inco mplet e. Errors
and no ise may co nfuse t he dat a mining process, leading t o t he der ivat io n
of erroneous pat t erns. Dat a cleaning, dat a preprocessing, out lier det ect io n
and remo val, and uncert aint y reasoning are exa mples o f t echniques t hat
need t o be int egrat ed wit h t he dat a mining process.
Pattern evaluation and pattern - or const raint-guided mi ning: What makes
a patt ern int erest ing ma y var y fro m user to user. Therefore, t echniques are
needed t o assess t he int erest ingness o f discovered pat t erns based on
subject ive measures. These est imat e t he value o f pat t erns wit h respect t o a
g iven user class, based on user beliefs or expect at io ns. Moreover, by
using int er est ingness measures or user -specified co nst raint s t o guide t he
disco ver y process, we may generat e more int erest ing pat t erns and reduce
t he search space.
User Interaction
The user p lays an import ant role in t he dat a mining process. Int erest ing ar eas
of research include how to int eract wit h a dat a mining syst em, how to
incorporat e a user ’s background knowledge in mining, and how to visualize
and co mprehend dat a mining result s.
Interactive mi ning: The dat a mining process should be highly int eract ive.
Thus, it is import ant to build flexible user int er faces and an explorat or y
mining environment , facilit at ing t he user ’s int eract io n wit h t he syst em.
Int eract ive mining should allo w us ers t o dynamically change t he focus o f
a search, t o refine mining request s based on ret urned r esult s, and t o drill,
dice, and pivot t hrough t he dat a and knowledge space int eract ively,
dynamically explor ing “cube space” while mining.
Incorporation of background knowl edge: Background knowledge,
const raint s, rules, and ot her infor mat io n regarding t he do main under st ud y
should be incorporat ed int o t he knowledge discover y process. Such
knowledge can be used for patt ern evaluat ion as well as t o guide t he
search t oward int erest ing pat t erns.
Ad hoc data mining and data mining query languages: Quer y languages
(e.g., SQL) have played an import ant role in flexible sear ching because
t hey allo w users t o pose ad hoc quer ies. S imilar ly, high - level dat a mining
quer y languages or ot her high- level flexible user int er faces will give user s
t he freedo m t o define ad hoc dat a mining t asks.
Social impact s of data mining : Wit h dat a mining penet rat ing our ever yday
lives, it is import ant to st udy t he impact of dat a mining on societ y. How
can we use dat a mining t echno logy t o benefit societ y? How can we guard
against it s misuse? The improper disclosure or use of dat a and t he
pot ent ial vio lat io n o f individual pr ivacy and dat a prot ect io n r ight s ar e
areas o f concer n t hat need t o be addressed.
Pri vacy-preserving data mining: Dat a mining will help scient ific
disco ver y, business management , econo my recover y, and secur it y
prot ect ion (e.g., t he real-t ime discover y of int ruder s and cyberat t acks).
However, it poses t he r isk o f disclo sing an individual’s personal
infor mat io n. St udies on pr ivacy-preser ving dat a publishing and dat a
mining are ongo ing. The philosophy is to obser ve dat a sens it ivit y and
preser ve people’s pr ivacy while per for ming successful dat a mining.
Invisibl e data mi ning: I nt elligent search engines and I nt ernet -based st ores
per for m such invis ible dat a mining by incorporat ing dat a mining int o t heir
co mponent s t o improve t heir funct io nalit y and per for mance. This is done
oft en unbeknownst t o t he user .
1.1
1. What Is a Data Warehouse?
A dat a warehouse is a subject -orient ed, int egrat ed, t ime - var iant , and nonvo lat ile
co llect io n of dat a in support of manageme nt ’s decis io n making process.
A dat a warehouse refer s t o a dat a reposit ory t hat is maint ained separat ely fro m
an organizat ion’s operat io nal dat abases. Dat a warehouse syst ems a llow for
int egrat ion o f a var iet y o f applicat ion syst ems. T he y support infor mat io n
processing by pro viding a so lid plat for m o f conso lidat ed hist or ic dat a for
analys is.
Subject-oriented : A dat a warehouse is organized around major subject s such
as cust omer, supplier, product , and sales. Rat her t han concent rat ing on t he
day-t o -day operat ions and t ransact ion pr ocessing o f an organizat ion, a dat a
warehouse focuses on t he modeling and analysis o f dat a for decis io n makers.
Hence, dat a warehouses t ypically provide a simp le and concise view of
part icular subject issues by excluding dat a t hat are not useful in t he dec isio n
support process.
Integrated: A dat a warehouse is u sually const ruct ed by int egrat ing mult iple
het erogeneous sources, such as relat io nal dat abases, flat files, and online
t ransact ion records. Dat a cleaning and dat a int egrat ion t echniques are
applied t o ensur e consist ency in naming convent io ns, encoding st ru ct ures,
att ribut e measures, and so on.
Tim e-variant: Dat a are st ored t o provide infor mat ion fro m an hist or ic
perspect ive. E xample: The past 5 –10 years. Ever y key st ruct ure in t he dat a
warehouse co nt ains, eit her implicit ly or explic it ly, a t ime element .
Nonvolatil e: A dat a warehouse is alwa ys a phys ically separat e st ore of dat a
t ransfor med fro m t he applicat ion dat a found in t he operat io nal environment .
Due to t his separat io n, a dat a warehouse does not require t ransact io n
processing, recover y, and concurre ncy cont rol mechanis ms. It usuall y
requir es o nly t wo operat ions in dat a accessing: init ial lo ading o f dat a and
access of dat a.
In summar y, a dat a warehouse is a semant ically co nsist ent dat a st ore t hat ser ves
as a physical imple ment at ion o f a decis ion support dat a model. It stores t he
infor mat io n an ent erpr ise needs t o make st rat egic decis io ns. A dat a warehouse
is also oft en viewed as an archit ect ure, const ruct ed by int egr at ing dat a fro m
mult ip le het erogeneous sources t o support st ruct ured and/or ad hoc quer ies,
analyt ical report ing, and decis io n making.
How are organizat io ns using t he infor mat ion fro m dat a warehouses?
Many organizat ions use t his infor mat io n to support bus iness decis io n - making
act ivit ies, including:
Increas ing cust o mer focus, which inc ludes t he analys is o f cust omer buying
pat t erns (such as buying preference, buying t ime, budget cycles, and
appet it es for spending).
Reposit io ning product s and managing pr oduct port fo lio s by co mpar ing t he
per for mance o f sales by quart er, by year, and by geographic reg io ns in order
to fine-t une product io n st rat egies.
Analyzing operat io ns and looking for sources o f pro fit .
Managing cust omer relat io nships, mak ing enviro nment al correct io ns, and
managing t he cost of corporat e asset s.
Dat a warehousing provides an int erest ing alt er nat ive t o t his t radit ional
approach. Rat her t han using a quer y - dr iven approach, dat a warehousing
emplo ys an update-d riven approach in which infor mat ion fro m mu lt iple,
het erogeneous sources is int egrat ed in advance and st ored in a warehouse for
direct quer ying and analys is.
View:
An OLTP syst em fo cuses mainly o n t he current dat a wit hin an ent erpr ise or
depart ment , wit hout referr ing to hist or ic dat a or dat a in different
organizat ions.
In cont rast , an OLAP syst em o ft en spans mult iple ver sio ns of a dat abase
sche ma, due to t he evo lut ionar y process of an organizat ion. OLAP syst ems
also deal wit h infor mat ion t hat originat es fro m different organiza t io ns,
int egrat ing infor mat io n fro m many dat a st ores. Because o f t heir huge vo lume,
OLAP dat a are st ored on mult iple st orage media.
Access patterns:
The access pat t erns of an OLTP syst em consist mainly o f short , ato mic
t ransact ions. Such a syst em requir e s concurrency cont ro l and recover y
mechanis ms.
However, accesses t o OLAP syst ems are most ly r ead -only operat io ns
(because most dat a warehouses st ore hist or ic rat her t han up -t o -dat e
infor mat io n), alt hough many could be co mplex quer ies.
3. Why Have a Separate Data Warehouse?
“Why not perf orm online analyti cal processing directly on such databases
instead of spending additi onal time and resources to const ruct a separate dat a
warehouse?”
i. A major reason for such a separ at ion is t o help pro mot e t he hig h
per for mance o f bot h syst ems. An operat io nal dat abase is designed and
t uned fro m known t asks and workloads like indexing and hashing using
pr imar y keys, sear ching for part icular records, and opt imizing “canned”
quer ies.
ii. On t he ot her hand, dat a warehouse que r ies are o ft en co mplex. The y
invo lve t he co mput at ion o f large dat a groups at summar ized levels, and
ma y requir e t he use o f special dat a organizat io n, access, and
imple ment at ion met hods based on mult idimens io nal views. Processing
OLAP quer ies in operat ional dat abases would subst ant ially degr ade t he
per for mance o f operat ional t asks.
iii. Moreover, an operat ional dat abase support s t he concurrent processing o f
mult ip le t ransact io ns. Co ncurrency cont rol and recover y mechanis ms
(e.g., locking and logging) are requir ed t o ensure t he consist ency and
robust ness of t ransact ions.
iv. An OLAP quer y o ft en needs read -only access of dat a records for
summar izat ion and aggregat ion. Concurrency cont rol and recover y
mechanis ms, if applied for such OLAP operat ions, may jeopardize t he
execut io n of concurrent t ransact ions and t hus subst ant ially reduce t he
t hroughput of an OLTP syst em.
v. Fina lly, t he separat io n of operat io nal dat abases fro m dat a warehouses is
based on t he different st ruct ures, cont ent s, and uses o f t he dat a in t hese
t wo syst ems. Decisio n support requir es hist oric dat a, wher eas operat iona l
dat abases do not t ypically maint ain hist oric dat a. In t his cont ext , t he dat a
in operat ional dat abases, t hough abundant , are usually far fro m co mplet e
for decis io n making.
vi. Decis io n support requires co nso lidat ion (e.g., aggregat io n and
summar izat ion) o f dat a fro m het erogeneous sources, resu lt ing in high -
qualit y, clean, int egrat ed dat a. In co nt rast , operat ional dat abases co nt ain
only det ailed raw dat a, such as t ransact io ns, which need t o be
co nso lidat ed before analys is. Because t he t wo syst ems provide quit e
different funct ionalit ies and requir e differ ent kinds o f dat a, it is present ly
necessar y t o maint ain separat e dat abases.
Data m art:
A dat a mart cont ains a subset of corporate -wide dat a t hat is o f value t o a
spec ific group of user s. The scope is confined t o specific select ed
subject s. Dat a mart s are usually implement ed on low -cost depart ment a l
ser ver s t hat are Unix/Linux or Windows based.
The implement at io n cycle o f a dat a mart is more likely t o be measured in
weeks rat her t han mo nt hs or year s.
Depend ing on t he source of dat a, data mart s can be cat egor ized as
independent or dependent .
Independent dat a mart s are sourced fro m dat a capt ured fro m one or more
operat ional syst ems or ext er nal infor mat io n provider s, or fro m dat a
generat ed locally wit hin a part icular depart ment or geographic area.
Dependent dat a mart s are sourced direct ly fro m ent erpr ise dat a
warehouses.
Virtu al warehou se:
A virt ual warehouse is a set of views over operat ional dat abases. For
effic ient quer y processing, only so me o f t he possible summar y views ma y
be mat er ialized. A virt ual warehouse is easy t o build but requir es excess
capacit y on operat io nal dat abase ser ver s.
“What are the pros and cons of the top - down and bottom-up approaches to
data warehouse devel opment?”
The t op-down development o f an ent erpr ise warehouse ser ves as a
syst emat ic so lut ion and minimizes int egr at io n proble ms. However, it is
expensive, t akes a lo ng t ime t o develo p, and lacks flexibilit y due t o t he
difficult y in achieving consist ency and consensus for a commo n dat a
model for t he ent ire organizat io n.
The bottom-up approach to t he design, development , and deplo yment of
independent dat a mart s provides flexibilit y, low cost , and rapid ret urn of
invest ment . It , however, can le ad to proble ms when int egrat ing var io us
dispar at e dat a mart s int o a consist ent ent erpr ise dat a warehouse.
Data Warehouse Implementation
A met hod for t he development of dat a warehouse syst ems is t o imp lement
t he warehouse in an increment al and evo lut io n ar y manner.
o First , a high- level corporat e dat a model is defined wit hin a reaso nabl y
short period (such as one or t wo mont hs) t hat provides a corporat e -wide,
consist ent , int egrat ed view o f dat a amo ng different subject s and pot ent ial
usages. This high- level model will great ly reduce fut ure int egrat ion
problems.
o Second, independent dat a mart s can be imple ment ed in parallel wit h t he
ent erpr ise warehouse based on t he same corporat e dat a model set not ed
before.
o Third, dist r ibut ed dat a mart s can be co nst ruct ed to int egrat e different
dat a mart s via hub ser ver s. Fina lly, a mult it ier dat a warehouse is
const ruct ed wher e t he ent erpr ise warehouse is t he so le cust odian o f al l
warehouse dat a, which is t hen dist r ibut ed to t he var ious dependent dat a
mart s.
7. Metadata Repository
Met adat a are dat a about dat a. When used in a dat a warehouse, met adat a are
t he dat a t hat define warehouse object s. Met adat a are creat ed for t he dat a
names and definit io ns o f t he given warehouse.
A met adat a reposit ory should cont ain t he fo llo wing:
o A descr ipt ion o f t he dat a warehouse st ruct ure, which includes t he
warehouse schema, view, dimensio ns, hierarchies, and der ived dat a
definit io ns, as well as dat a mart locat io ns and cont ent s.
o Operat ional met adat a, which inc lude dat a lineage (hist or y o f migr at ed
dat a and t he sequence o f t ransfor mat io ns applied t o it ), currency o f dat a
(act ive, archived, or purged), and mo nit or ing infor mat io n (warehouse
usage st at ist ics, error report s, and aud it t rails).
o The algor it hms used for summar izat io n, which include measure an d
dimensio n definit io n algor it hms, dat a on granular it y, part it ions, subject
areas, aggregat io n, summar izat ion, and pr edefined quer ies and report s.
o Mapping fro m t he operat io nal environme nt to t he dat a warehouse, which
inc ludes source dat abases and t heir co nt ent s, gat eway descr ipt io ns, dat a
part it io ns, dat a ext ract ion, cleaning, t ransfor mat ion rules and default s,
dat a refresh and purging rules, and sec ur it y (user aut hor izat ion and access
cont rol).
o Dat a relat ed to syst em per for mance, which include indices and profile s
t hat impro ve dat a access and ret r ieval per for mance, in addit io n t o rules
for t he t iming and scheduling o f refresh, updat e, and replicat io n c ycles.
o Business met adat a, which include business t erms and definit io ns, dat a
ownership infor mat ion, and charging po licies.
Met adat a also ser ve as a guide t o t he algor it hms used for summar izat io n
bet ween t he current det ailed dat a and t he light ly summar i zed dat a, and
bet ween t he light ly summar ized dat a and t he highly summar ized dat a.
Met adat a should be st ored and managed persist ent ly ( i.e., on disk).
1.2
1. Data Cube: A Multidimensional Data Model
Data Cube
A dat a cube allows dat a to be modeled and viewed in mu lt iple dimensio ns. It
is defined by dimens io ns and fact s.
Dimension
Dimens io ns are t he per spect ives or ent it ies wit h respect to which a n
organizat ion want s t o keep records. Each dimensio n may have a t able
associat ed wit h it , called a dimensio n t able, which furt her descr ibes t he
dimensio n.
Fact
Fact s are numer ic measures. They are used t o analyze relat io nships bet wee n
dimensio ns. T he fact t able co nt ains t he names o f t he fact s, or measures, as
well as keys t o each o f t he relat ed dimensio n t ables.
In t his 2-D represent at ion, t he sales for Vancouver are shown wit h r espect to
t he t ime dimensio n (organized in quart ers) and t he it em di mens io n (organized
according to t he t ypes o f it ems so ld). The fact or measure displayed is
dollars so ld ( in t housands).
Now, suppose t hat we would like t o view t he sales dat a wit h a t hir d
dimensio n. For inst ance, suppose we would like t o view t he dat a a ccording t o
t ime and it em, as well as locat ion, for t he cit ies Chicago, New York, Toront o,
and Vancouver.
Suppose t hat we would now like t o view our sales dat a wit h an addit io nal
fourt h dimensio n such as supplier. Viewing t hings in 4 -D beco mes t ricky.
However, we can t hink o f a 4 -D cube as being a ser ies o f 3 -D cubes.
In t he dat a warehousing research lit er at ure, a dat a cube like t hose shown in
Figur es is o ft en referred t o as a cubo id. Given a set of dimensio ns, we ca n
generat e a cubo id for each o f t he possible subset s of t he given dimensio ns.
The result would for m a lat t ice of cuboids, each showing t he dat a at a
different level o f summar izat io n, or group -by. T he lat t ice o f cubo ids is t hen
referred t o as a dat a cube. Figure 4.5 shows a lat t ice o f c ubo ids for ming a
dat a cube for t he dimensio ns t ime, it em, locat ion, and supplier. The cubo id
t hat ho lds t he lowest leve l o f summar izat ion is called t he base cubo id.
The 0-D cubo id, which ho lds t he highest level o f summar izat io n, is ca lled t he
apex cubo id.
Star Schema
Examp le:
A st ar schema for AllE lect ronics sales is shown in F igure 4.6. Sales are
considered alo ng four dimensio ns: t ime, it em, br anch, and locat io n. The
sche ma cont ains a cent ral fact t able for sales t hat cont ains ke ys t o each of
t he four dimensio ns, alo ng wit h t wo measures: do llar s so ld a nd unit s so ld. To
minimize t he size o f t he fact t able, dimensio n ident ifier s (e.g., t ime key and
it e m key) are syst em-generat ed ident ifiers.
Snowflake schema
Example :
A snowflake schema for AllE lect ronics sales is given in Figure 4.7. Here, t he
sales fact t able is ident ical t o t hat of t he st ar schema in Figure 4.6. T he ma in
difference bet ween t he t wo schemas is in t he definit io n o f dimensio n t ables.
The single dimensio n t able for it em in t he st ar schema is nor ma lized in t he
snowflake schema, result ing in new it em and supplier t ables. For example,
t he it em dimens io n t able now cont ains t he at tribut es it em key, it em name,
brand, t ype, and supplier key, where supplier key is linked t o t he supplier
dimensio n t able, cont aining supplier key and supplier t ype infor mat ion.
S imilar ly, t he single dimens io n t able for locat io n in t he st ar schema can be
nor malized int o t wo new t ables: locat ion and cit y. The cit y key in t he new
lo cat ion t able links t o t he cit y dimensio n. Not ice t hat , when desirable,
furt her nor malizat io n can be per for med on province or st at e and count r y in
t he snowflake schema.
Fact constellation:
Sophist icat ed applicat ions may r equire mult iple fact t ables t o shar e
dimensio n t ables.
This kind o f schema can be viewed as a co llect io n of st ars, and hence is
called a galaxy schema or a fact const ellat io n.
Examp le:
A fact const ellat io n schema is shown in Figure 4.8. This schema specifies
t wo fact t ables, sa les and shipp ing. T he sales t able definit io n is ident ical t o
t hat of t he st ar schema (Figure 4.6). The shipping t able has five dimens io ns,
or keys—it em key, t ime key, shipper key, fr o m locat io n, and t o locat ion —and
t wo measures—do llars cost and unit s shipped. A fact const ellat ion schema
allows dimensio n t ables to be shared bet ween fact t ables. For examp le, t he
dimensio ns t ables for t ime, it em, and lo cat io n are shar ed bet ween t he sales
and shipping fact t ables.
A dat a warehouse co llect s infor mat ion about subject s t hat span t he ent ire
organizat ion, such as cust omers, it ems, sales, asset s, and per sonnel, and t hus
it s scope is ent erpr ise -wide. For dat a warehouses, t he fact const ell at io n
sche ma is co mmo nly used, since it can mo del mu lt iple, int errelat ed subject s.
A dat a mart , on t he ot her hand, is a depar t ment subset of t he dat a warehouse
t hat focuses on select ed subject s, and t hus it s scope is depart ment wide. For
dat a mart s, t he st ar or snowflake schema is co mmo nly used, since bot h are
geared t oward modeling single subject s, alt hough t he st ar schema is more
popular and efficient .
3. Concept Hierarchies
A co ncept hierarchy defines a sequence o f mappings fro m a set of low - level
concept s to higher- level, more general concept s.
For example, Vancouver can be mapped to Brit ish Co lumbia, and Chicago
to Illino is. The provinces and st at es can in t ur n be mapped to t he count ry
(e.g., Canada or t he Unit ed St at es) to which t hey belo ng.
These mappings for m a concept hier archy for t he dimensio n locat ion,
mapping a set of low - level concept s ( i.e., cit ies) t o higher - level, more
general co ncept s ( i. e., count ries).
Many co ncept hierarchies ar e imp licit wit hin t he dat abase schema.
Alt er nat ively, t he at t rib ut es o f a dimens ion may be organized in a part ial
order, forming a lat t ice.
A concept hierar chy t hat is a tot al or part ial order among at t ribut es in a
dat abase schema is called a schema hierar chy.
Concept hierarchies may also be defined by discret izing or grouping values
for a given dimensio n or att ribut e, result ing in a set -grouping hierar chy. A
tot al or part ial order can be defined amo ng groups of values.
There ma y be more t han o ne concept hier archy for a given at t ribut e or
dimensio n, based on different u ser viewpo int s.
Concept hierarchies may be provided manually by syst em user s, domain
expert s, or knowledge engineers, or may be aut omat ica lly generat ed based
on st at ist ical analys is o f t he dat a dist r ibut io n.
4. Typical OLAP Operations
OLAP Operations
Example:
At t he cent er of t he figure is a dat a cube for AllE lect ronics sales. The cube
cont ains t he dimensio ns lo cat ion, t ime, and it e m, where locat io n is aggregat ed
wit h respect to cit y values, t ime is aggr egat ed wit h respect to quart ers, and it em
is aggregat ed wit h respect to it em t ypes. To aid in our explanat ion, we refer to
t his cube as t he cent ral cube. The measur e displayed is do llar s so ld ( in
t housands). (For impro ved readabilit y, only so me of t he cubes’ cell values are
shown.) The dat a examined are for t he cit ies Chicago, New York, Toronto, and
Vancouver.
i. Roll-up:
The ro ll-up operat ion (also called t he dr ill -up operat ion by so me vendors)
per for ms aggregat io n on a dat a cube, eit her by climbing up a concept
hier archy for a dimensio n or by dimensio n reduct ion.
When roll- up is per for med by dimens io n reduct io n, one or more dimens io ns
are remo ved fro m t he given cube.
ii. Drill-down:
Dr ill-down is t he rever se o f roll-up. It navigat es fro m less det ailed dat a to
more det ailed dat a. Dr ill-down can be realized by eit her st epping down a
concept hierarchy for a dimensio n or int roducing addit io nal dimensio ns.
Because a dr ill-down adds more det ail t o t he given dat a, it can also be
per for med by adding new dimensio ns t o a cube.
iii. Slice and dice:
The slice operat io n per for ms a select ion on one dimensio n o f t he given cube,
result ing in a subcube.
The dice operat ion defines a subcube by per for ming a select ion on t wo or more
dimensio ns.
To design an effect ive dat a warehouse we need t o underst and and analyze
business needs and co nst ruct a business analysis fr amework. The const ruct io n
of a large and co mplex infor mat ion syst em can be viewed as t he const ruct io n
of a large and co mplex building , for which t he owner, archit ect , and builder
have different views.
These views are co mbined to form a complex framework t hat represent s t he
top-down, business-dr iven, or owner ’s perspect ive, as well as t he botto m -up,
builder-dr iven, or imp lement or ’s view of t he infor mat io n syst em.
Four different views regard ing a dat a warehouse design must be co nsidered:
t he t opdown view, t he dat a source view, t he dat a warehouse view, and t he
business quer y view.
There are t hree kinds of dat a warehouse applicat io ns: infor mat ion processing,
analyt ical processing, and dat a mining.
i. Inform ation processing support s quer ying, basic st at ist ical analys is, and
report ing using cr osst abs, t ables, chart s, or graphs. A current t rend in dat a
warehouse infor mat io n processing is to const ruct low -cost web- based
accessing t ools t hat are t hen int egrat ed wit h web browsers.
ii. Analyti cal processing support s basic OLAP operat ions, including sli ce-and-
dice, dr ill-down, roll-up, and pivot ing. It generally operat es on hist oric dat a
in bot h summar ized and det ailed for ms. The major st rengt h of online
analyt ical processing over infor mat ion processing is t he mult idimensio na l
dat a ana lysis o f dat a warehouse dat a.
iii. Data m ining support s knowledge discover y by finding hidden pat t erns and
associat ions, const ruct ing analyt ical mo dels, per for ming classificat ion and
predict ion, and present ing t he mining result s using visualizat io n t ools .
“How does data mining relate to inf ormation processi ng and online analyti cal
processing?”
a) Infor mat io n process ing, based on quer ies, can find useful infor mat io n.
However, answer s t o such quer ies reflect t he infor mat io n dir ect ly st ored in
dat abases or comput able by aggregat e fu nct io ns.
b) They do not reflect sophist icat ed pat t erns or regular it ies bur ied in t he
dat abase. T herefore, infor mat io n processing is not dat a mining.
c) Online analyt ical processing co mes a st ep closer t o dat a mining because it
can der ive infor mat io n summar ized at mult iple gr anular it ies fro m user -
spec ified subset s of a dat a warehouse. Such descr ipt ions ar e equiva lent t o t he
class/co ncept descr ipt ions.
Mult idimensio nal dat a mining (also kno wn as explorat ory mult idimensio nal dat a
mining, o nline analyt ical mining, or OLAM) int egr at es OLAP wit h dat a mining t o
uncover knowledge in mult id imensio na l dat abases. Amo ng t he many dif ferent
paradigms and archit ect ures o f dat a mining syst ems, mult idimensio nal dat a
mining is part icular ly import ant for t he follo wing reasons:
High qualit y o f dat a in dat a warehouses: Most dat a mining t ools need t o work
on int egrat ed, consist ent , and cleaned dat a, which requires cost ly dat a
cleaning, dat a int egr at ion, and dat a t ransfor mat io n as preprocessing st eps. A
dat a warehouse const ruct ed by such prepr ocessing ser ves as a valuable source
of high-qualit y dat a for OLAP as well as for dat a mining.
Available infor mat io n processing infr ast ruct ure surrounding dat a warehouses:
Co mprehensive infor mat io n processing and dat a analys is infr ast ruct ures ha ve
been or will be syst emat ically const ruct ed surrounding dat a warehouses,
which include accessing, int egrat io n, conso lidat ion, and t ransfor mat io n o f
mult ip le het erogeneous dat abases, ODBC/ OLEDB co nnect ions, Web accessing
and ser vice facilit ies, and report ing and OLAP analys is t ools. It is prudent to
make t he best use of t he available infr ast ruct ures rat her t han const ruct ing
ever yt hing fro m scrat ch.
OLAP- based explorat ion o f mult idimensio nal dat a: Effect ive dat a mining
needs explorat ory dat a analysis. A use r will o ft en want to t raverse t hrough a
dat abase, select port ions o f relevant dat a, ana lyze t hem at different
granular it ies, and present knowledge/result s in differ ent for ms.
Mult idimensio nal dat a mining provides facilit ies for mining on differ ent
subset s of dat a and at var ying levels o f abst ract io n —by dr illing, pivot ing,
filt er ing, dic ing, and slic ing on a dat a cube and/or int er mediat e dat a mining
result s. This, t oget her wit h dat a/knowledge visualizat ion t ools, great l y
enhances t he power and flexibilit y o f dat a mining.
Online select ion o f dat a mining funct ions: Users ma y not always know t he
spec ific k inds o f knowledge t he y want to mine. By int egr at ing OLAP wit h
var ious dat a mining funct ions, mult idimensio nal dat a mining provides user s
wit h t he flexibilit y t o select desired dat a mining funct io ns and swap dat a
mining t asks dyna mica lly.
One approach t o cube co mput at ion ext ends SQL so as t o include a co mput e cube
operator. The comput e cube operat or comput es aggr egat es over all subset s of t he
dimensio ns specified in t he operat ion. This can requir e excessive st orage space,
especially for large nu mbers o f dimensio ns.
Examp le :
A dat a cube is a lat t ice of cubo ids. Suppose t hat you want to creat e a dat a cube for
AllE lect ronics sales t hat cont ains t he follo wing: cit y, it em, year, and sales in
dollars. You want to be able t o analyze t he dat a, wit h quer ies such as t he
fo llo wing: “Co mput e t he sum o f sales, gr ouping by cit y and it em. ” “Co mput e t he
sum o f sales, grouping by cit y.” “Co mput e t he sum o f sales, grouping by it em.”
What is t he tot al number o f cubo ids, or group -by’s, t hat can be co mput ed for t his
dat a cube? Taking t he t hree at t ribut es, cit y, it em, and year, as t he dimens io ns for
t he dat a cube, and sales in do llar s as t he mea sure, t he tot al number of cubo ids, or
groupby’s, t hat can be co mput ed for t his dat a cube is 23 = 8. T he possible group -
by’s are t he fo llowing: {(cit y, it em, year ), (cit y, it em), (cit y, year), ( it em, year),
(cit y), ( it em), ( year), ()}, where () means t hat t he group-by is empt y ( i.e., t he
dimensio ns ar e not grouped).
The base cubo id co nt ains all t hree dimensio ns, cit y, it em, and year. It can ret urn
t he tot al sales for any co mbinat io n of t he t hree dimensio ns. The apex cubo id, or 0 -
D cubo id, refers t o t he case where t he group -by is empt y. It cont ains t he tot al sum
of all sales. The base cubo id is t he least generalized ( most specific) o f t he
cubo ids. The apex cubo id is t he most generalized ( least specific) of t he cubo ids,
and is oft en denot ed as all. I f we st art at the apex cubo id and explore downward in
t he lat t ice, t his is equivalent t o drilling down wit hin t he dat a cube. I f we st art at
t he base cubo id and explor e upward, t his is akin t o rolling up.
3. Partial Materialization: Selected Computation of Cuboids
There ar e t hree cho ices for dat a cube mat er ializat ion given a base cubo id:
1. No materiali zation:
Do not precomput e any o f t he “nonbase” cubo ids. This leads to comput ing
expensive mult id imensio n al aggregat es on-t he- fly, which can be ext remely slow.
2. Full materi ali zation:
Preco mput e all o f t he cubo ids. The result ing lat t ice o f co mput ed cubo ids is
referred t o as t he full cube. This cho ice t ypically requires huge amount s o f
memor y space in order to store all o f t he preco mput ed cubo ids.
3. Partial mat eriali zation:
Select ively co mput e a proper subset of t he who le set of possible cubo ids.
Alt er nat ively, we may co mput e a subset of t he cube, which cont ains only t hose
cells t hat sat isfy so me user -specified cr it er io n, such as where t he t uple count of
each cell is above so me t hresho ld. We will use t he t erm subcube t o refer t o t he
lat t er case, wher e only so me of t he cells ma y be preco mput ed for var ious
cubo ids. Part ia l mat er ializat ion represent s an int er est ing t rade-off bet wee n
st orage space and response t ime.
4. Indexing OLAP Data: Bitmap Index and Join Index
To facilit at e effic ient dat a accessing, mo st dat a warehouse s yst ems support index
st ruct ures and mat er ialized views.
Bitmap indexing
The bit map indexing met hod is popular in OLAP product s because it allows quick
searching in dat a cubes. T he bit map index is an alt er nat ive represent at ion o f t he
record ID (RID) list . In t he bit map index for a given at tribut e, t here is a dist inct
bit vect or, Bv, for each value v in t he att ribut e’s do ma in. I f a given at t ribut e’s
domain consist s o f n values, t hen n bit s are needed for each ent r y in t he bit map
index ( i.e., t here are n bit vect ors). If t he attr ibut e has t he value v for a given row
in t he dat a t able, t hen t he bit represent ing t hat value is set to 1 in t he
corresponding row of t he bit map index. All ot her bit s for t hat row are set to 0.
Bit map indexing is advant ageous co mpared to hash and t ree indices. It is
especially useful for low - cardina lit y do mains because co mpar iso n, jo in, and
aggregat io n operat ions are t hen reduced to bit ar it hmet ic, which subst ant ially
reduces t he processing t ime. Bit map indexing leads t o significant reduct ions in
space and input /out put (I/O) since a st r ing o f charact ers can be represent ed by a
single bit . For higher-cardinalit y do mains, t he met hod can be adapt ed using
co mpressio n t echniques.
Join indexing
The jo in indexing met hod gained popular it y fro m it s use in relat io nal dat abase
quer y processing. Tradit io nal indexin g maps t he value in a given co lumn t o a list
of rows having t hat value. In co nt rast , jo in indexing regist er s t he jo inable rows o f
t wo relat ions fro m a relat io nal dat abase.
Jo in indexing maint ains r elat ionships bet ween att ribut e values o f a dimensio n
(e.g., wit hin a dimensio n t able) and t he corresponding rows in t he fact t able. Jo in
indices may span mult iple dimens io ns t o form co mposit e jo in indices.