Introduction
Data Mining
• Data mining (knowledge discovery from data)
• Extraction of interesting (non-trivial, implicit, previously unknown
and potentially useful) patterns or knowledge from huge amount
of data.
• Data mining: a misnomer?
• Alternative names
• Knowledge discovery (mining) in databases (KDD), (KDD)Name of Journal and TKDD
(Transactions) is also famous journal by
• knowledge extraction,
ACM which conducts SIGKDD(Special
Interest Group) conference from 1995 every
year.
• data/pattern analysis,
• data archeology,
• information harvesting,
• business intelligence, etc.
Why Data Mining?
• The Explosive Growth of Data: from terabytes to petabytes.
• Data collection and data availability.
• Automated data collection tools, database systems, Web,
computerized society.
• Major sources of abundant data
• Business: Web, e-commerce, transactions, stocks, …
• Science: Remote sensing, bioinformatics, scientific simulation, …
• Society and everyone: news, digital cameras, YouTube, …
• We are drowning in data, but starving for knowledge!
• “Necessity is the mother of invention”—Data mining—Automated
analysis of massive data sets. @Plato
Why Data Mining?
• The fast-growing, tremendous amount of data, collected and stored
in large and numerous data repositories, has far exceeded our human
ability for comprehension without powerful tools.
Data Rich but Information Poor
Business have so much data but lacks knowledge,skills,tools to make most out of it..
Data Mining
• Data mining—searching for knowledge
(interesting patterns) in data.
• Many people treat data mining as a
synonym for another popularly used
term, knowledge discovery from data,
or KDD.
• while others view data mining as
merely an essential step in the process
of knowledge discovery.
Data Mining
• The knowledge discovery process is an
iterative sequence of the following steps:
1. Data cleaning: to remove noise and
inconsistent data.
2. Data integration: multiple data sources
may be combined.
3. Data selection: data relevant to the
analysis task are retrieved from the
database.
4. Data transformation: data are
transformed and consolidated into forms
appropriate for mining by performing
summary or aggregation operations.
Data Mining
• The knowledge discovery process is an
iterative sequence of the following steps:
5. Data mining: intelligent methods are
applied to extract data patterns.
6. Pattern evaluation: to identify the truly
interesting patterns representing
knowledge based on interestingness
measures.
7. Knowledge presentation: visualization
and knowledge representation
techniques are used to present mined
knowledge to users.
Data Mining
• Data mining plays an essential role in the knowledge discovery
process.
• Data mining is the process of discovering interesting patterns and
knowledge from large amounts of data.
What Kinds of Data Can Be Mined?
• Data mining can be applied to any kind of data as long as the data are
meaningful for a target application.
• The most basic forms of Data:
• Database Data
• Data warehouse Data
• Transactional Data
• Other forms of Data
• Data Streams
• Ordered/Sequence Data
• Graph or Networked Data
• Spatial Data
• Text Data
• Multimedia Data
• WWW
Data Mining Functionalities and Tasks
• There are a number of data mining functionalities:
• Mining of frequent patterns, associations, and correlations
• Classification and Regression
• Clustering Analysis
• Outlier Analysis
• Data mining functionalities are used to specify the kinds of patterns
to be found in data mining tasks.
• Data mining tasks can be classified into two categories:
1) Descriptive: Descriptive mining tasks characterize properties of
the data in a target data set.
2) Predictive: Predictive mining tasks perform induction on the
current data in order to make predictions.
Data Mining Functionalities and Tasks
Use some variables to Find human-interpretable
predict unknown or future patterns that describe the
values of other variables data.
Mining Frequent Patterns, Associations, and Correlations
• Frequent patterns are patterns that occur frequently in data.
• There are many kinds of frequent patterns, including:
• Frequent itemsets,
• Frequent subsequences (or sequential patterns), and
• Frequent substructures.
• A frequent itemset typically refers to a set of items that often appear
together in a transactional data set.
• For example, milk and bread, which are frequently bought together in
grocery stores by many customers.
Mining Frequent Patterns, Associations, and Correlations
• A frequently subsequence, such as the pattern that customers, tend
to purchase first a laptop, followed by a digital camera, and then a
memory card, is a (frequent) sequential pattern.
• A substructure can refer to different structural forms (e.g., graphs,
trees, or lattices) that may be combined with itemsets or
subsequences. If a substructure occurs frequently, it is called a
(frequent) structured pattern. Example- Social Network Analysis
Here are some simplified examples of interactions:
Interaction 1: A -> B -> C (A is connected to B, and B is connected to C)
Interaction 2: D -> E -> F (D is connected to E, and E is connected to F)
Interaction 3: G -> H -> I -> J (G is connected to H, H is connected to I, and I is connected to J)
Interaction 4: A -> B -> C -> D (A is connected to B, B is connected to C, and C is connected to D)
Upon analyzing these interactions, we might find that the substructure "A -> B -> C" (a chain of three individuals connected sequentially) is a frequent
substructure, as it appears in multiple interactions.
Mining Frequent Patterns, Associations, and Correlations
• Association analysis: Suppose that, as a marketing manager at
AllElectronics, you want to know which items are frequently
purchased together (i.e., within the same transaction). An example of
such a rule, mined from the AllElectronics transactional database, is
Buys(X, computer”) → buys(X, software) [support = 1%,confidence = 50%]
• Association Rules are discarded as uninteresting if they do not satisfy
both a minimum support threshold and a minimum confidence
threshold.
• Additional analysis can be performed to uncover interesting statistical
correlations between associated attribute–value pairs.
Mining Frequent Patterns, Associations, and Correlations
• Correlation: is measured not only by its support and confidence but
also by the correlation between itemsets A and B.
• There are several correlation measures rules: Lift, Chi-square χ2
Classification and Regression
• Classification is the process of finding a model (or function) that
describes and distinguishes data classes or concepts.
• The model are derived based on the analysis of a set of training data
(i.e., data objects for which the class labels are known).
• The model is used to predict the class label of objects for which the
class label is unknown.
• The derived model may be represented in various forms, such as
classification rules (i.e., IF-THEN rules), decision trees, mathematical
formulae, neural networks.
• There are many other methods for constructing classification models,
such as na¨ıve Bayesian classification, support vector machines
(SVM),and k-nearest-neighbor classification.
Classification and Regression
• Classification predicts categorical (discrete, unordered) labels, and
regression models continuous-valued functions.
• Regression is used to predict missing or unavailable numerical data
values rather than (discrete) class labels.
• Regression analysis is a statistical methodology that is most often
used for numeric prediction.
Cluster Analysis
• Unlike classification and regression, which analyze class-labeled
(training) data sets, clustering analyzes data objects without
consulting class labels.
• In many cases, class labeled data may simply not exist at the
beginning.
• Clustering can be used to generate class labels for a group of data.
• The objects are clustered or grouped based on the principle of
maximizing the intraclass similarity and minimizing the interclass
similarity.
Outlier Analysis
• A data set may contain objects that do not comply with the general
behavior or model of the data. These data objects are outliers.
• Many data mining methods discard outliers as noise or exceptions
• . However, in some applications (e.g., fraud detection) the rare events
can be more interesting than the more regularly occurring ones.
• The analysis of outlier data is referred to as outlier analysis or
anomaly mining.
• Outliers may be detected using statistical tests that assume a
distribution or probability model for the data, or
• Using distance measures where objects that are remote from any
other cluster are considered outliers.
Which Technologies Are Used?
• data mining has incorporated many techniques from other domains
such as statistics, machine learning, pattern recognition, database
and data warehouse systems, information retrieval, visualization,
algorithms, high performance computing, and many application
domains.
• Data mining adopts
techniques from many
domains.
Machine Learning
• Machine learning investigates how computers can learn based on
data.
• Classic problems in machine learning that are highly related to data
mining.
▪ Supervised learning
▪ Unsupervised learning
▪ Semi-supervised learning
▪ Active learning
Machine Learning
▪ Supervised learning
▪ Unsupervised learning
▪ Semi-supervised learning
▪ Active learning
• synonym for classification
• The supervision in the learning comes from the labeled examples in
the training data set.
Machine Learning
▪ Supervised learning
▪ Unsupervised learning
▪ Semi-supervised learning
▪ Active learning
• synonym for clustering.
• The learning process is unsupervised since the input examples are not
class labeled.
• We may use clustering to discover classes within the data. .
Machine Learning
▪ Supervised learning
▪ Unsupervised learning
▪ Semi-supervised learning
▪ Active learning
• Semi-supervised learning is a class of machine learning techniques
that make use of both labeled and unlabeled examples when learning
a model.
• In one approach, labeled examples are used to learn class models and
unlabeled examples are used to refine the boundaries between
classes.
Machine Learning
▪ Supervised learning
▪ Unsupervised learning
▪ Semi-supervised learning
▪ Active learning
• Users play an active role in the learning process.
• An active learning approach can ask a user (e.g., a domain expert) to
label an example, which may be from a set of unlabeled examples or
synthesized by the learning program.
• The goal is to optimize the model quality by actively acquiring
knowledge from human users, given a constraint on how many
examples they can be asked to label.
Applications
• Data mining has many successful applications, such as:
• Business Intelligence
• Web Search
• Bioinformatics
• Health Informatics
• Finance
• Digital Libraries
• Digital Governments