CS699
Lecture 1
Introduction
CS699
• Our focus is “data mining” not “data warehousing.”
• Data mining is an important component of data analysis.
• Will discuss
– Data preprocessing
– Basic data mining algorithms
– How to evaluate data mining models and data mining results
– How to perform data mining using software tools
• A good data miningweb site: kdnuggets.com
• A good dataset site: UCI Machine Learning Repository
2
CS699
• Prerequisites:
– CS546 and either CS669 or CS579, or instructor’s consent.
• Math requirements
– Math is a tool to describe algorithms
– Mostly basic algebra (not linear algebra) and basic
probabilities and statistics
– A little bit of calculus
– You will have to do calculations using a calculator (which has a
“log” function)
3
CS699
• You will practice data mining with Weka, JMP Pro, and
Oracle.
• These software are used for assignments.
• Weka:
– Free
– Easy to learn and easy to use
– Has a large number of data mining algorithms
– You will use it immediately
– Also used for class project
4
CS699
• Oracle data mining: takes time to learn
• You will learn how to use them with assignments
• Oracle:
– Will use preconfigured virtual machine
– VM runs on Linux, but you don’t need to use Linux
– You will use SQL Developer for data mining
5
CS699
• JMP Pro: A statistical analysis software with some data
mining algorithms implemented on it
• Freely available from BU’s IT website (refer to
homework 1)
• You will use it for assignments
6
CS699
• Class project:
– Building and testing classifier models using a real‐world
dataset
– You will use primarily Weka
– You may use any other tools, including R, Python, or JMP Pro
for data preprocessing
7
CS699
• Each week
– Quiz (except in Week 6)
– Assignment
– Discussion
8
CS699
• Live Class
8: 00 – 10:00 PM EST, every Wednesday
• Live Class (and/or Q & A )
11 – 12 PM EST, every Saturday
• Attendance is not mandatory but students
must study the live class material.
9
Blackboard
• Under Class Discussion (Discussion Board)
– Announcement (Common Area)
– Live Classroom Slides
– Weka issues
– Oracle issues
– JMP Pro issues
– Around the Clock Help (other questions)
10
Why Data Mining?
• The Explosive Growth of Data: from terabytes to petabytes
– Data collection and data availability
• Automated data collection tools, database systems, Web,
computerized society
– Major sources of abundant data
• Business: Web, e‐commerce, transactions, stocks, …
• Science: Remote sensing, bioinformatics, scientific simulation, …
• Society and everyone: news, digital cameras, YouTube, social network
• We are drowning in data, but starving for knowledge!
• “Necessity is the mother of invention”—Data mining—Automated analysis of
massive data sets
11
What Is Data Mining?
• Data mining (knowledge discovery from data)
– Extraction of interesting (non‐trivial, implicit, previously unknown and
potentially useful) patterns or knowledge from huge amount of data
• Alternative names
– Knowledge discovery (mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, data dredging,
information harvesting, business intelligence, etc.
• Watch out: Is everything “data mining”?
– Simple search and query processing
– (Deductive) expert systems
12
Knowledge Discovery (KDD) Process
• This is a view from typical database
systems and data warehousing
Pattern Evaluation
communities
• Data mining plays an essential role in
the knowledge discovery process
Data Mining
Task-relevant Data
Data Warehouse Selection
Data Cleaning
Data Integration
Databases 13
Data Mining in Business Intelligence
Increasing potential
to support
business decisions End User
Decision
Making
Data Presentation Business
Analyst
Visualization Techniques
Data Mining Data
Information Discovery Analyst
Data Exploration
Statistical Summary, Querying, and Reporting
Data Preprocessing/Integration, Data Warehouses
DBA
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems
14
A Typical View from ML and Statistics
Input Data Data Pre- Data Post-
Processing Mining Processing
Data integration Pattern discovery Pattern evaluation
Normalization Association & correlation Pattern selection
Feature selection Classification Pattern interpretation
Clustering
Dimension reduction Pattern visualization
Outlier analysis
…………
• This is a view from typical machine learning and statistics communities
15
What Kinds of Data?
• Database‐oriented data sets and applications
– Relational database, data warehouse, transactional database
• Advanced data sets and advanced applications
– Data streams and sensor data
– Time‐series data, temporal data, sequence data (incl. bio‐sequences)
– Structure data, graphs, social networks and multi‐linked data
– Object‐relational databases
– Heterogeneous databases and legacy databases
– Spatial data and spatiotemporal data
– Multimedia database
– Text databases
– The World‐Wide Web
16
Data Types
• Categorical (or nominal) vs. numeric data:
Categorical Numeric
OID Age Income Buy? OID Age Height Weight
1 Young Low Y 1 15 60 180
2 Young High Y 2 8 48 115
3 Old Low N 3 32 72 153
4 Middle Low Y 4 27 65 145
5 Middle High N 5 17 58 189
6 Old Low N 6 56 70 150
7 Young High N 7 72 56 163
8 Old High Y 8 22 63 172
9 Old High Y 9 42 71 139
10 Young Low N 10 39 68 150
17
Classification
• Classification and label prediction
– Construct models (functions) based on some training examples, called
training dataset.
– Describe and distinguish classes or concepts for future prediction
• E.g., classify countries based on (climate), or classify cars based on (gas
mileage)
– Predict some unknown class label (or class attribute)
• Typical methods
– Decision trees, naïve Bayesian classification, support vector machines, neural
networks, rule‐based classification, pattern‐based classification, logistic
regression, …
• Typical applications:
– Credit card fraud detection, direct marketing, classifying stars, diseases,
web‐pages, …
• Also called supervised learning
18
Classification
• Example (decision tree)
Classify a car with unknown class
label (risk):
4‐door, 4‐cylinder, wagon.
==> risk = 1
19
Classification
• Music CD purchase dataset example
• A synthetic dataset, where 1’s and 0’s were entered
arbitrarily.
• Contains information about customers’ purchase of
music CD’s collected over a certain period of time, say
past 12 months.
• 1 in the dataset indicates the customer purchased a
CD of the musician at least once in the past 12
months.
• The class attribute indicates whether a customer is
“young” or “old.”
20
Classification
• 12 attributes:1 ID, 10 predictor (independent) attributes, 1
class (dependent) attribute
50 tuples
• A part of the dataset
21
Classification
• Decision tree generated by J48 algorithm
22
Classification vs. Numeric Prediction
• Classification:
– Predicted (dependent) attribute is a nominal attribute.
– Example: Predict whether a customer will buy a computer or not (yes or
no, for example).
• Numeric prediction:
– Predicted (dependent) attribute is a numeric attribute.
– Example: Predict the weight (numeric value) of a person given the age and
the height of the person.
– Example: CPU dataset
23
Association and Correlation Analysis
• Frequent patterns (or frequent itemsets)
– What items are frequently purchased together in a grocery
store?
– Mine all frequent itemsets and then all strong rules.
– An itemset is frequent,
if its support is >= predefined threshold, minimum support
– A rule is written as: <left hand side> => <right hand side>
– Example of a rule: {milk, butter} => {cheese, egg}
– A rule is strong,
if its confidence is >= predefined threshold, minimum
confidence
24
Association and Correlation Analysis
• Example:
Support Examples:
support of {bread} = 7
support of {egg, milk} = 4
support of {bread, egg, milk}
A rule R: {bread} {egg, milk} =3
Quality measures of the rule and informal interpretation:
Support(R) = 33.3% (or 3/9) /* fraction of people who purchased {bread, milk, egg}
Confidence(R) = 42.9% (or 3/7) /* among those who purchased bread, fraction of
/* people who also purchased {milk, egg}
25
Association and Correlation Analysis
• Music CD purchase dataset (preprocessed for Weka’s Apriori
algorithm)
• A part of the dataset
26
Association and Correlation Analysis
• Some association rules (mined by Apriori algorithm)
• Handel=t Mahler=t 5 ==> Bach=t 5 <conf:(1)>
• Bach=t Haydn=t Mendelssohn=t 5 ==> Mozart=t 5 <conf:(1)>
• Bach=t Haydn=t Mozart=t 5 ==> Mendelssohn=t 5 <conf:(1)>
• Bach=t Mendelssohn=t 7 ==> Mozart=t 6 <conf:(0.86)>
• Bach=t Handel=t 6 ==> Mahler=t 5 <conf:(0.83)>
• Bach=t Mozart=t Mendelssohn=t 6 ==> Haydn=t 5 <conf:(0.83)>
• Haydn=t Mendelssohn=t 9 ==> Mozart=t 7 conf:(0.78)
• Haydn=t Mozart=t 9 ==> Mendelssohn=t 7 <conf:(0.78)>
• Bach=t Mozart=t 8 ==> Mendelssohn=t 6 <conf:(0.75)>
• Mahler=t 14 ==> Bach=t 10 <conf:(0.71)>
27
Association and Correlation Analysis
• Association, correlation vs. causality
– Are strongly associated items also strongly correlated?
– If two items are strongly correlated, is there a causal
relationship?
• How to mine such patterns and rules efficiently in large
datasets?
• Association rules can also be used for classification or
clustering.
28
Cluster Analysis
• Unsupervised learning (i.e., there is no class label)
• Group data to form new categories (i.e., clusters), e.g., cluster
customers into different groups
• Principle: Maximizing intra‐class similarity & minimizing
interclass similarity
• Many methods and applications
29
Cluster Analysis
• Clustering output types:
30
Cluster Analysis
• London cholera epidemic (Source: J. Leskovec, A. Rajaraman,
and J.D. Ullman, “Mining of Massive Datasets,” 2014, page 3.)
31
Cluster Analysis
• Iris dataset (from UCI ML Repository)
• Used for classification
• Has 4 attributes and class attribute
• Class attribute: type of iris plant
• A part of the dataset
32
Cluster Analysis
• A clustering algorithm was run on only two attributes
• Clustering result visualization
• X: petallength, Y: petalwidth
33
Outlier Analysis
– Outlier: A data object that does not comply with the general
behavior of the data
– Noise or exception? ― One person’s garbage could be
another person’s treasure
– Methods: byproduct of clustering or regression analysis, …
– Useful in fraud detection, rare events analysis
34
Sequential Pattern, Trend and Evolution Analysis
– Trend, time‐series, and deviation analysis: e.g., regression
and value prediction
– Sequential pattern mining
• e.g., first buy digital camera, then buy large SD memory
cards
– Periodicity analysis
– Biological sequence analysis
35
Evaluation of Knowledge
• Are all mined knowledge interesting?
– One can mine tremendous amount of “patterns” and knowledge
– Some may fit only certain dimension space (time, location, …)
– Some may not be representative, may be transient, …
• A pattern is interesting if
– easily understood
– valid on new data or test data with some degree of certainty
– potentially useful
– novel
• Objective measures (e.g., support and confidence of an
association rule)
• Subjective measures (e.g., expected/unexpected, actionable)
36
Technologies Used in Data Mining
Machine Pattern Statistics
Learning Recognition
Applications Data Mining Visualization
Algorithm Database High‐Performance
Technology Computing
37
Applications of Data Mining
• Web page analysis: from web page classification, clustering to PageRank &
HITS algorithms
• Collaborative analysis & recommender systems
• Basket data analysis to targeted marketing
• Biological and medical data analysis: classification, cluster analysis
(microarray data analysis), biological sequence analysis, biological network
analysis
• Data mining and software engineering
• From major dedicated data mining systems/tools (e.g., SAS, MS SQL‐Server
Analysis Manager, Oracle Data Mining Tools) to invisible data mining
38
Major Issues in Data Mining
• Mining Methodology
• User Interaction
• Efficiency and Scalability
• Diversity of data types
• Data mining and society
39
What is a Data Warehouse?
• Defined in many different ways, but not rigorously.
– A decision support database that is maintained separately from the
organization’s operational database
– Support information processing by providing a solid platform of
consolidated, historical data for analysis.
• “A data warehouse is a subject‐oriented, integrated, time‐variant, and
nonvolatile collection of data in support of management’s decision‐making
process.”—W. H. Inmon
• Data warehousing:
– The process of constructing and using data warehouses
40
Data Warehouse—Subject‐Oriented
• Organized around major subjects, such as customer, product,
sales
• Focusing on the modeling and analysis of data for decision
makers, not on daily operations or transaction processing
• Provide a simple and concise view around particular subject
issues by excluding data that are not useful in the decision
support process
41
Data Warehouse—Integrated
• Constructed by integrating multiple, heterogeneous data
sources
– relational databases, flat files, on‐line transaction records
• Data cleaning and data integration techniques are applied.
– Ensure consistency in naming conventions, encoding
structures, attribute measures, etc. among different data
sources
• E.g., Hotel price: currency, tax, breakfast covered, etc.
– When data is moved to the warehouse, it is converted.
42
Data Warehouse—Time Variant
• The time horizon for the data warehouse is significantly longer
than that of operational systems
– Operational database: current value data
– Data warehouse data: provide information from a historical
perspective (e.g., past 5‐10 years)
• Every key structure in the data warehouse
– Contains an element of time, explicitly or implicitly
– But the key of operational data may or may not contain
“time element”
43
Data Warehouse—Nonvolatile
• A physically separate store of data transformed from the
operational environment
• Operational update of data does not occur in the data
warehouse environment
– Does not require transaction processing, recovery, and
concurrency control mechanisms
– Requires only two operations in data accessing:
• initial loading of data and access of data
44
OLTP vs. OLAP
OLTP OLAP
users clerk, IT professional knowledge worker
function day to day operations decision support
DB design application-oriented subject-oriented
data current, up-to-date historical,
detailed, flat relational summarized, multidimensional
isolated integrated, consolidated
usage repetitive ad-hoc
access read/write lots of scans
index/hash on prim. key
unit of work short, simple transaction complex query
# records accessed tens millions
#users thousands hundreds
DB size 100MB-GB 100GB-TB
metric transaction throughput query throughput, response
45
Data Warehouse: A Three-Tier Architecture
Monitor
& OLAP Server
Other Metadata
sources Integrator
Analysis
Query
Operational Extract
Serve Reports
DBs Transform Data
Data mining
Load Warehouse
Refresh
Data Marts
Data Sources Bottom tier Middle tier Top tier
Data warehouse server OLAP Engine Front-End Tools46
Three Data Warehouse Models
• Enterprise warehouse
– collects all of the information about subjects spanning the
entire organization
• Data Mart
– a subset of corporate‐wide data that is of value to a specific
groups of users. Its scope is confined to specific, selected
groups, such as marketing data mart
• Independent vs. dependent (directly from warehouse) data mart
• Virtual warehouse
– A set of views over operational databases
– Only some of the possible summary views may be
materialized
47
Extraction, Transformation, and Loading (ETL)
• Data extraction
– get data from multiple, heterogeneous, and external sources
• Data cleaning
– detect errors in the data and rectify them when possible
• Data transformation
– convert data from legacy or host format to warehouse format
• Load
– sort, summarize, consolidate, compute views, check integrity,
and build indicies and partitions
• Refresh
– propagate the updates from the data sources to the
warehouse
48
Metadata Repository
• Meta data is the data defining warehouse objects. It stores:
• Description of the structure of the data warehouse
– schema, view, dimensions, hierarchies, derived data defn, data mart
locations and contents
• Operational meta‐data
– data lineage (history of migrated data and transformation path), currency
of data (active, archived, or purged), monitoring information (warehouse
usage statistics, error reports, audit trails)
• The algorithms used for summarization
• The mapping from operational environment to the data warehouse
• Data related to system performance
– warehouse schema, view and derived data definitions
• Business data
– business terms and definitions, ownership of data, charging policies
49
References
• Han, J., Kamber, M., Pei, J., “Data mining: concepts and
techniques,” 3rd Ed., Morgan Kaufmann, 2012
• http://www.cs.illinois.edu/~hanj/bk3/
50