Algorithmics Research on Knowledge Discovery and Data Mining
Vladimir Estivill-Castro
School of Computing and Information Technology
Outline
u u u
Motivation What is Data Mining Extent of The Field
u Web
Mining, Text Mining, Software Engineering and Data Mining
u Example
of algorithm
Oriented Generalization
Vladimir Estivill -Castro 2
u Attribute
Motivation
Motivation
u
The problem of data overload looms ominously ahead.
u
Our technology to analyze data and understand massive datasets lags far behind our technology to gather and store data.
Vladimir Estivill -Castro
Interest for Knowledge Discovery
u
Emerged in Databases
u
Large logs of transactions
u u
Credit card transactions Supermarket transactions
Data Mining (1960s)
u
Massaging the data so statistics would reflect my preconceived hypothesis Data => Hypothesis vs Hypothesis validated by data
Vladimir Estivill -Castro 5
Deductive vs Inductive Science
u
Knowledge Discovery
The nontrivial process of identifying valid, novel, potentially useful and ultimately understandable patters in large data sets.
Data: The geo-referenced layers. u Information: The average population per administrative region u Knowledge: The patterns of growth of population densities and valid explanations for them.
u
Vladimir Estivill -Castro 7 6
Aspects of Knowledge Discovery
u
Nontrivial
u
u u
Potentially useful
u
it goes beyond computing closed- from quantities or evaluating models. the discovered patterns are true with some degree of certainty of unseen data
for the user simple, descriptive, informative
Understandable
u
Valid
u
Vladimir Estivill -Castro
What is Data Mining
What is Data Mining?
u
Different answers
Using computers to make sense of large volumes of data
u using
u u
a set of fundamental data manipulation tasks
classification association rules / basket analysis
what makes it different from Statistics?
A task in the process of knowledge discovery A multidisciplinary field
u what
makes it different from Statistics?
Vladimir Estivill -Castro 9
Mining Different Kinds of Knowledge
u Description u Estimation. u Prediction. u Affinity
Grouping .
u Classification. u Clustering.
Vladimir Estivill -Castro
Mining Different Kinds of Knowledge (by J. Han)
u
Characterization: Generalize, summarize, and possibly contrast data characteristics, e.g., dry vs. wet regions. Association: Rules like inside(x, city) near(x, highway). Classification: Classify data based on the values in a classifying attribute, e.g., classify countries based on climate. Clustering: Cluster data to form new classes, e.g., cluster houses to find distribution patterns. Trend and deviation analysis : Find and characterize evolution trend, sequential patterns, similar sequences, and deviation data, e.g., housing market analysis. Pattern-directed analysis : Find and characterize user-specified patterns in large databases, e.g., volcanoes on Mars.
Vladimir Estivill -Castro
u u
Example: Retail
u
Bar-code technology makes possible to collect and store massive amounts of sales data (the basked data). u Information driven marketing process demands mining association rules over basket data.
u
``98% of customers that purchase tires and auto accessories also get service done
Vladimir Estivill -Castro 12
Application of Rules
u
Cross-marketing and attached mailing applications.
u
customized and designed junk-mail
catalog design, add-on sales. u store layout u customer segmentation based on buying patterns.
Vladimir Estivill -Castro 13
Illustration
Product A Customer 1 Customer 2 Customer 3
Product B
Product C
Product D
X X
X X X X X X
``90% of purchases that have bread and butter also include milk. u It is a rule of the form A B (90%). u A is the antecedent. u B is the consequent. u There is a confidence value associated to the rule.
u
Vladimir Estivill -Castro
14
Usage of mining for rules
u Find all rules that have Diet Coke as the consequent. u Help plan what should be done to boast the sales of Diet Coke. u Find all the rules that have bagels in the antecedent. u Determine what products may be impacted if the store discontinues selling bagels.
Vladimir Estivill -Castro
15
Usage of mining rules
u
Find all rules that have sausage in the antecedent and mustard in the consequent u what items should be sold with sausages to make highly likely that mustard will also be sold. Find all rules relating items located in shelves A and B u understand if the distance affects the sales of items from both shelves.
Vladimir Estivill -Castro
16
The Knowledge Discovery Process
Arrangement Preprocessing Selection Data Mining Interpretation
Data
Target Data
Preprocessed Transformed Data Data
Patterns
Knowledge
Vladimir Estivill -Castro
17
A formal process (and a standard)
Vladimir Estivill -Castro
18
Multi-disciplinary
u
Finding useful patterns in data is known by different names among different communities:
u u u u u
data mining (statistics, databases) knowledge discovery information discovery, information harvesting data archeology pattern processing
Vladimir Estivill -Castro 19
A multi-disciplinary field
Data Bases
Statistic
Visualization Machine Learning Logic Programming
Vladimir Estivill -Castro 20
Knowledge Discovery
u
Discovery of useful knowledge from data.
u u u u u u u u u
Databases Machine Learning Pattern Recognition Artificial Intelligence Knowledge Acquisition Scientific Discovery High-Performance Computing Algorithms (Analysis and Design) Statistics
Vladimir Estivill -Castro 21
Differences with Statistics
Vladimir Estivill -Castro
22
A significant amount of overlap
u Statistics u KDDM
probabilistic models u descriptive, nonparametric, exploratory u mathematically sound (advanced) u informative and predictive
u
concern for computability and scalability u interpret data u understandable u data on electronic media (and structured)
u
Exploratory Analysis of Massive Data Sets
Vladimir Estivill -Castro 23
Extent of the Field
24
Large number of conferences and specialized workshops
u
KDD - ACM but now IEEE conference on KDD SIAM conference on KDD PAKDD (2001 - 5th Pacific Rim Conf. on KDD) PKDD (2000 - 4th European Conf. on Principles and Practice of KDD) Data Base conferences u SIGMOD
u u
u u u
VLDB Artificial Intelligence and Machine Learning u ICML u ECML
Vladimir Estivill -Castro 25
Successful Example
u Recent
u u u
analysis of Bank of America loan database
250 fields per customer back to 1914! over nine million records
u A clustering tools was used to automatically
segment customers into groups with many similar categorical attributes.
u 14 groups identified, only one could be explained
u
The interesting cluster had 2 properties
u u
39% of customers had business and personal accounts with the bank this cluster accounted for 27% of the 11% of customers that had been classified by a decision tree as likely respondents to a home equity loan offer
Vladimir Estivill -Castro 26
Text Mining
u u
With / without Natural Language Processing Different from Information Retrieval/Extraction
u Results:
u There
exists a text (or a sets of texts) such as speak-much-of Miss Lewinsky & speak-littleof Mrs. Clinton u There exists another text such as speak-a-lot-of Mrs. Clinton & do-not-speak-of Miss Lewinsky
Vladimir Estivill -Castro 27
Spatial Data Mining
u
Data Mining system generates hypothesis search (and visualization) in abstract space inductive generalizations exceeding content of database
GIS user generates hypothesis visualization in geographical space shows whats inside the data
hard to visualize multivariate dependencies on a map
Vladimir Estivill -Castro
28
Spatial Data Mining
Thematic Map Decision Tree
Vladimir Estivill -Castro
29
Spatial Associations
FIND SPATIAL ASSOCIATION RULE DESCRIBING "Golf Course" FROM Washington_Golf_courses, Washington WHERE CLOSE_TO(Washington_Golf_courses.Obj, Washington.Obj, "3 km") AND Washington.CFCC <> "D81" IN RELEVANCE TO Washington_Golf_courses.Obj, Washington.Obj, CFCC SET SUPPORT THRESHOLD 0.5
Vladimir Estivill -Castro
30
Spatial Associations & Hierarchy of Spatial Relationships
u
Spatial association: Association relationship containing spatial predicates, e.g., close_to, intersect, contains, etc.
u
Topological relations: u intersects, overlaps, disjoint, etc. Spatial orientations: u left_of, west_of, under, etc. Distance information:
u close_to, within_distance, etc. Hierarchy of spatial relationship:
u u
g_close_to: near_by, touch, intersect, contain, etc. First search for rough relationship and then refine it.
Vladimir Estivill -Castro
Example: Spatial Association Rule Mining
u What
kinds of spatial objects are close to each other in
B.C.?
u
Kinds of objects: cities, water, forests, usa_boundary, mines, etc.
u Rules
u u
mined:
is_a(x, large_town) ^ intersect(x, highway) adjacent_to(x, water). [7%, 85%] is_a(x, large_town) ^adjacent_to(x, georgia_strait) close_to(x, u.s.a.). [1%, 78%]
u Mining method:
Apriori + multi-level association + geospatial algorithms (from rough to high precision).
Vladimir Estivill -Castro
Spatial Classification
u Generalization-
based induction u Interactive classification
Vladimir Estivill -Castro 33
Visualization of Predicted Distribution
Vladimir Estivill -Castro
Spatial Prediction and Trend Analysis
u Spatial
u u u u
trend predictive modeling (Ester et al97):
Discover centers: local maximal of some non-spatial attribute. Determine the (theoretical) trend of some non-spatial attribute, when moving away from the centers. Discover deviations (from the theoretical trend). Explain the deviations.
u Example:
Trend of unemployment rate change according to the distance to Munich. u Similar modeling can be used to study trend of temperature with the altitude, degree of pollution in relevance to the regions of population density, etc.
Vladimir Estivill -Castro
Spatial Clustering
u How
can we cluster points? u What are the distinct features of the clusters?
There are more customers with university degrees in clusters located in the West. Thus, we can use different marketing strategies!
Vladimir Estivill -Castro 36
Mining in Image & Raster Databases
u Magellan
u u u u
project (Fayyad et al.96, JPL).
identify volcanos on Venus surface over 30,000 high resolution images Resolution accuracy: 80% 3 steps: data focusing, feature extraction, and classification learning
u POSSII
u u u
project (Palomar Obervatory Sky Survey II, )
2x stellar objects (galaxies, stars, etc.) classified Resolution:one magnitude better than in previous studies Classification accuracy: no normalization 75%, with normalization 94%, and compared with neural networks.
109
u QuakeFinder
(Stolorz et al96): Find earth quakes from
space.
u
using statistics, massive parallelism, and global optimization
Vladimir Estivill -Castro
Basket Analysis / Link Analysis
u Amazon.com
u u
Who bought what together What are related books
u Finding
u
who has fax at home
Get phone data for business with a fax number u Get usage records of lines to find who dials a business fax number from home for larger than 20 seconds
Vladimir Estivill -Castro 38
Crime detection
u crime
investigation (e.g., the Okalahoma City bombing) u fraud detection [Italy KDD-99 San Diego] but also Australian Taxation Office and HIC [PAKDD-99]
u
Characterization of Doctor Shoppers
Vladimir Estivill -Castro
39
Software Engineering and Data Mining
Vladimir Estivill -Castro
40
KDD techniques used in Software Engineering
Clustering
u (Mancoridis):
u Graph
Clustering for graph partitioning towards high-coupling and low-cohesion.
partitioning algorithms using hill-climbing
u (Ouyang)
Clustering towards improving the reusability in the design phase. u (Anquetil) Re-modularization of software
Vladimir Estivill -Castro 41
Decomposition into subsystemsNon OO case
u
u
Input from ISA
u
Software S made of a set of programs P, and set of files F.
p1p2p3p4p5 f1 X X X X f2 X X f3 X X X f4 X X X X
c p3= 3/4, cp4= 3/3
Representation of database of the system
A set alpha : A={P, F} : P P , F F
Creation of a grouping table
Programason rows, Files in columnss T = t1 , t2 , ... , t|F| ti={ p P | p uses fi}
u
KDDM
Associaiton rules c% of the programs that use file X also use files Y and Z. Results in a group of programs that uses a similar set of files.
p4 p3 (100%)
Colect and summirize results
u
Association rules are used to guide clustering
u
A set of programs and files makes a subsystem=> SUBSISTEMA
Vladimir Estivill -Castro 42
Decomposition into Subsystems - OO
u u
Using metrics, OO, KDDM and clustering to split with low coupling. Approach: ( Create list of related entity pairs (classes, methods and objects) ( Use OO metrics to create sets of metrics CBOSet, RFCSet, and DACSet (DAC: Data Abstraction Coupling) CBO_d, CBO_d, and DAC_t ( Generate matrices of classes that interact (matrices de interaccion) ( Apply KDDM algorithms (association rules)
CBOSet Interaction Matrix
Use hierarchical clustering on the coefficients of the association rules to produce a hierarchical decomposition.
Vladimir Estivill -Castro 43
Case Study - OO Sistema II
Mozilla - Netscape Communicator
u
HTML Editor
MOZILLA
HTML Composer/Editor - Mozilla SUMMARY Symbol Table Statistics of 30 projects ============================================== Files: 111 Includes: 697 Macros: 147 Functions: 60 Types: 0 Variables: 79 Enums: 4 Userdef: 0 Classes: 90 Inst Vars: 415 Methods: 1320 Friends: 25 Localdefs: 46 SUMMARY File Type Statistics of 30 projects ============================================== File Type: Number of Files: HTML Header IDL Interface Image Implementation Project Description 4 65 4 57 42 38
SUMMARY Symbol Table Statistics of 1223 projects ======================================== Files: 6713 Includes: 36492 Macros: 27024 Functions: 15898 Types: 3176 Variables: 11151 Enums: 715 Userdef: 0 Classes: 5933 Instance Variables: 23757 Methods: 41015 Friends: 273 Localdefs: 290 SUMMARY File Type Statistics of 1223 projects ======================================== File Type Number of files HTML Header Implementation Make Project Description 763 3331 3382 117 1309
Mozilla Statistics
HTML Editor Statistics
Vladimir Estivill -Castro 44
Graph Mining and WEB Mining
u Step
beyond link analysis
sub-graphs that are similar
molecules CAD/CAM parts (designs)
u Finding
u
u Chemical
Patterns of use u Analysis of WEB data u Text Mining/ Multimedia Mining
Vladimir Estivill -Castro 45
Time Series Mining
u
Predicting stock market Monitoring condition of equipment, weather, pilot behavior during long flights.
Vladimir Estivill -Castro
46
An example of generalization
u
For attribute-oriented data
Vladimir Estivill -Castro
47
Automated Attribute Oriented Induction
u
Illustration of basic strategies
based on a hierarchy of concepts, u where discovery initiated by a query for a rule with necessary conditions for a class.
u
Vladimir Estivill -Castro
48
The data set
NAME Anderson Bach Carlton Fraser STATUS M.A. junior junior M.S. PhD sophomore senior PhD MAJOR History Math Lib. Arts Physics Math Chemistry Computing Biology BIRTH PLACE Vancouver Calgary Edmonton Ottawa Bombay Richmond Victoria Shanghai GPA 3.5 3.7 2.6 3.9 3.3 2.7 3.5 3.4
STUDENT
Gupta Hart Jackson Liu
Wang Wise
M.S. freshman
Statistics Literature
Nanjing Toronto
3.2 3.9 49
Vladimir Estivill -Castro
The concept hierarchy
For attribute STATUS undergraduate ANY
graduate
freshman sophomore
junior
senior
M.A.
M.S.
PhD
Vladimir Estivill -Castro
50
The discovery query
u
In relation STUDENT, learn characteristic rule for STATUS=graduate in relevance to NAME, BIRTH PLACE, GPA u (threshold value of 3)
Vladimir Estivill -Castro
51
The induction
u 1)
Select graduate students.
STATUS M.A. M.S. PhD PhD PhD MAJOR History Physics Math Biology Computing BIRTH PLACE Vancouver Ottawa Bombay Shanghai Victoria GPA 3.5 3.9 3.3 3.4 3.8 VOTE 1 1 1 1 1 1 1 1
NAME Anderson Frazer Gupta Liu Monk
Wang
M.S.
Statistics
Nanjing
3.2
Vladimir Estivill -Castro
52
The induction
u 1)
Select graduate students.
STATUS M.A. M.S. PhD PhD PhD MAJOR History Physics Math Biology Computing BIRTH PLACE Vancouver Ottawa Bombay Shanghai Victoria GPA 3.5 3.9 3.3 3.4 3.8 VOTE 1 1 1 1 1 1 1 1
NAME Anderson Frazer Gupta Liu Monk
Wang
M.S.
Statistics
Nanjing
3.2
Vladimir Estivill -Castro
53
The induction
u 2)
Eliminate attribute STATUS because it is graduate for all students.
NAME Anderson Frazer Gupta Liu Monk MAJOR History Physics Math Biology Computing BIRTH PLACE Vancouver Ottawa Bombay Shanghai Victoria GPA 3.5 3.9 3.3 3.4 3.8 VOTE 1 1 1 1 1 1 Wang Statistics Nanjing 3.2 1 54
Vladimir Estivill -Castro
The induction
u 3)
Generalize on the smallest decomposable components.
u
Not illustrated here, because we do not have composite attributes.
Vladimir Estivill -Castro
55
The induction
u
4) If there is a large set of distinct values for an attribute but there is no higher level concept provided for the attribute, the attribute should be removed.
u
Attribute NAME satisfies this.
Vladimir Estivill -Castro
56
The induction
u 2)
Eliminate attribute NAME because it has to many values.
NAME Anderson Frazer Gupta Liu Monk MAJOR History Physics Math Biology Computing BIRTH PLACE Vancouver Ottawa Bombay Shanghai Victoria GPA 3.5 3.9 3.3 3.4 3.8 VOTE 1 1 1 1 1 1 Wang Statistics Nanjing 3.2 1 57
Vladimir Estivill -Castro
A generalization
MAJOR History Physics Math Biology Computing BIRTH PLACE Vancouver Ottawa Bombay Shanghai Victoria GPA 3.5 3.9 3.3 3.4 3.8 VOTE 2 3 1 1 2
Statistics
Nanjing
3.2
The value of the vote of a tuple should be carried to its generalized tuple and the votes should be accumulated when merging identical tuples.
Vladimir Estivill -Castro 58
Concept tree ascension
u
4) If there is a higher level concept in the concept tree for an attribute, the substitution of the higher level concept generalizes the tuple.
Science Arts
Statistics
Computing
Biology
Math
Physics
History
Vladimir Estivill -Castro
59
A generalization of each attribute
MAJOR Art Science Science Science Science BIRTH PLACE British Columbia Ontario British Columbia India China GPA excellent excellent excellent good good VOTE 35 10 30 10 15
The value of the vote of a tuple should be carried to its generalized tuple and the votes should be accumulated when merging identical tuples.
Vladimir Estivill -Castro 60
Threshold Control
u
If the number of distinct values of an attribute is larger than the generalization threshold value, further generalization on this attribute should be performed.
u u
Not the case for Major It is the case for Birth Place
Vladimir Estivill -Castro
61
Further generalization
MAJOR Art Science Science
BIRTH PLACE Canada Canada foreign
GPA excellent excellent good
VOTE 35 40 25
The value of the vote of a tuple should be carried to its generalized tuple and the votes should be accumulated when merging identical tuples.
Vladimir Estivill -Castro 62
Transformation to rules
MAJOR {Art, Science} Science BIRTH PLACE Canada GPA excellent VOTE 37
foreign
good
25
A graduate student is either (with 75% probability) a Canadian with excellent GPA or (with 25% probability) a foreign student, majoring in science with a good GPA.
Vladimir Estivill -Castro
63
Questions?
64