0% found this document useful (0 votes)
15 views18 pages

Bi - Unit 3

The document provides an overview of Business Intelligence and Data Mining, focusing on the extraction of knowledge from large datasets through various techniques. It discusses the evolution of data science, types of data that can be mined, and the functionalities of data mining including classification, clustering, and association rule mining. Key concepts such as the Apriori Algorithm for mining frequent itemsets and the importance of understanding data patterns for decision-making are also highlighted.

Uploaded by

arclops
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views18 pages

Bi - Unit 3

The document provides an overview of Business Intelligence and Data Mining, focusing on the extraction of knowledge from large datasets through various techniques. It discusses the evolution of data science, types of data that can be mined, and the functionalities of data mining including classification, clustering, and association rule mining. Key concepts such as the Apriori Algorithm for mining frequent itemsets and the importance of understanding data patterns for decision-making are also highlighted.

Uploaded by

arclops
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

NMAMIT - CSE

BUSINESS
INTELLIGENCE
21CSE304
Mr. Prithviraj Jain

24
Unit 3
Business Intelligence (21CSE304)
By,

Prithviraj Jain

CSE, NMAMIT
Syllabus:
Data Mining—On What Kind of Data? Data Mining Functionalities—What Kinds of Patterns
Can Be Mined? Mining Association rules: Basic concepts, frequent item set mining methods
- Apriori Algorithm, Generating Association Rules from Frequent Item sets.

Data Mining:
Data mining is the process of extracting knowledge or insights from large amounts of data
using various statistical and computational techniques.
Moving toward the Information Age:
 Terabytes or petabytes1 of data pour into our computer networks, the World Wide Web
(WWW), and various data storage devices very day from business, society, science and
engineering, medicine, and almost every other aspect of daily life.
 This explosive growth of available data volume is a result of the computerization of our
society and the fast development of powerful data collection and storage tools.
 Global backbone telecommunication networks carry tens of petabytes of data traffic every
day. The medical and health industry generates tremendous amounts of data from medical
records, patient monitoring, and medical imaging. Billions of Web searches supported by
search engines process tens of petabytes of data daily.
 This explosively growing, widely available, and gigantic body of data makes our time
truly the data age.
 Powerful and versatile tools are badly needed to automatically uncover valuable
information from the tremendous amounts of data and to transform such data into
organized knowledge. This necessity has led to the birth of data mining.
 Example: A search engine (e.g., Google) receives hundreds of millions of queries every
day. Each query can be viewed as a transaction where the user describes her or his
information need.
Data Mining as the Evolution of Information Technology:
Before 1600, empirical science
1600-1950s, theoretical science
Each discipline has grown a theoretical component. Theoretical models often motivate
experiments and generalize our understanding.
1950s-1990s, computational science
Over the last 50 years, most disciplines have grown a third, computational branch (e.g.
empirical, theoretical, and computational ecology, or physics, or linguistics.)
Computational Science traditionally meant simulation. It grew out of our inability to find
closed-form solutions for complex mathematical models.
• 1990-now, data science
– The flood of data from new scientific instruments and simulations
– The ability to economically store and manage petabytes of data online
– The Internet and computing Grid that makes all these archives universally
accessible
– Scientific info. management, acquisition, organization, query, and
visualization tasks scale almost linearly with data volumes. Data mining is a
major new challenge!
Evolution of database technology:
• 1960s:
– Data collection, database creation, IMS and network DBMS
• 1970s:
– Relational data model, relational DBMS implementation
• 1980s:
– RDBMS, advanced data models (extended-relational, OO, deductive, etc.)
– Application-oriented DBMS (spatial, scientific, engineering, etc.)
• 1990s:
– Data mining, data warehousing, multimedia databases, and Web databases
• 2000s
– Stream data management and mining
– Data mining and its applications
– Web technology (XML, data integration) and global information systems
What Is Data Mining?
Data mining (knowledge discovery from data) - Extraction of interesting (non-trivial,
implicit, previously unknown and potentially useful) patterns or knowledge from huge
amount of data
Many people treat data mining as a synonym for another popularly used term, knowledge
discovery from data, or KDD, while others view data mining as merely an essential step in
the process of knowledge discovery.

Figure 1: Data mining as a step in the process of knowledge discovery.


1. Data cleaning (to remove noise and inconsistent data)
2. Data integration (where multiple data sources may be combined)
3. Data selection (where data relevant to the analysis task are retrieved from the database)
4. Data transformation (where data are transformed and consolidated into forms appropriate
for mining by performing summary or aggregation operations)
5. Data mining (an essential process where intelligent methods are applied to extract data
patterns)
6. Pattern evaluation (to identify the truly interesting patterns representing knowledge based
on interestingness measures
7. Knowledge presentation (where visualization and knowledge representation techniques
are used to present mined knowledge to users).
What Kinds of Data Can Be Mined?
1. Database Data
 A database system, also called a database management system (DBMS), consists of a
collection of interrelated data, known as a database, and a set of software programs to
manage and access the data.
 The software programs provide mechanisms for defining database structures and data
storage; for specifying and managing concurrent, shared, or distributed data access; and
for ensuring consistency and security of the information stored despite system crashes or
attempts at unauthorized access.
 A relational database is a collection of tables, each of which is assigned a unique
name. Each table consists of a set of attributes (columns or fields) and usually stores a
large set of tuples (records or rows).
 Each tuple in a relational table represents an object identified by a unique key and
described by a set of attribute values. A semantic data model, such as an entity-
relationship (ER) data model, is often constructed for relational databases.
Example: customer .cust ID, name, address, age, occupation, annual income, credit
information,
category, . . . /
item .item ID, brand, category, type, price, place made, supplier, cost, . . . /
employee .empl ID, name, category, group, salary, commission, . . . /
branch .branch ID, name, address, . . . /
purchases .trans ID, cust ID, empl ID, date, time, method paid, amount/
items sold .trans ID, item ID, qty/
works at .empl ID, branch ID/
2. DataWarehouses
 A data warehouse is a repository of information collected from multiple sources, stored
under a unified schema, and usually residing at a single site. Data warehouses are
constructed via a process of data cleaning, data integration, data transformation, data
loading, and periodic.
 To facilitate decision making, the data in a data warehouse are organized around major
subjects (e.g., customer, item, supplier, and activity).
 A data warehouse is usually modeled by a multidimensional data structure, called a data
cube, in which each dimension corresponds to an attribute or a set of attributes in the
schema, and each cell stores the value of some aggregate measure such as count or
sum.sales amount.
 A data cube provides a multidimensional view of data and allows the precomputation
and fast access of summarized data.

Figure 2. Typical framer of data warehouse


3. Transactional Data
In general, each record in a transactional database captures a transaction, such as a
customer’s purchase, a flight booking, or a user’s clicks on a web page. A transaction
typically includes a unique transaction identity number (trans ID) and a list of the items
making up the transaction, such as the items purchased in the transaction. A transactional
database may have additional tables, which contain other information related to the
transactions, such as item description, information about the salesperson or the branch, and so
on.
4. Other Kinds of Data
 Besides relational database data, data warehouse data, and transaction data, there are
many other kinds of data that have versatile forms and structures and rather different
semantic meanings.
 Such kinds of data can be seen in many applications: time-related or sequence data (e.g.,
historical records, stock exchange data, and time-series and biological sequence data),
data streams (e.g., video surveillance and sensor data, which are continuously
transmitted), spatial data (e.g., maps), engineering design data (e.g., the design of
buildings, system components, or integrated circuits), hypertext and multimedia data
(including text, image, video, and audio data), graph and networked data (e.g., social
and information networks), and the Web (a huge, widely distributed information
repository made available by the Internet).

What Kinds of Patterns Can Be Mined? (Data Mining Functionalities)

1. Class/Concept Description: Characterization and Discrimination

 It can be useful to describe individual classes and concepts in summarized, concise, and
yet precise terms. Such descriptions of a class or a concept are called class/concept
descriptions.

 These descriptions can be derived using (1) data characterization, by summarizing the
data of the class under study (often called the target class) in general terms, or (2) data
discrimination, by comparison of the target class with one or a set of comparative
classes (often called the contrasting classes), or (3) both data characterization and
discrimination.

 Data characterization is a summarization of the general characteristics or features of a


target class of data.

 The data corresponding to the user-specified class are typically collected by a query. For
example, to study the characteristics of software products with sales that increased by
10% in the previous year, the data related to such products can be collected by executing
an SQL query on the sales database.

 The output of data characterization can be presented in various forms. Examples include
pie charts, bar charts, curves, multidimensional data cubes, and multidimensional
tables, including crosstabs.
 Example: A customer relationship manager at AllElectronics may order the following
data mining task: Summarize the characteristics of customers who spend more than
$5000 a year at AllElectronics. The result is a general profile of these customers, such
as that they are 40 to 50 years old, employed, and have excellent credit ratings.

 Data discrimination is a comparison of the general features of the target class data
objects against the general features of objects from one or multiple contrasting classes.

 The target and contrasting classes can be specified by a user, and the corresponding data
objects can be retrieved through database queries. For example, a user may want to
compare the general features of software products with sales that increased by 10% last
year against those with sales that decreased by at least 30% during the same period.

 Example: A customer relationship manager at AllElectronics may want to compare two


groups of customers—those who shop for computer products regularly (e.g., more than
twice a month) and those who rarely shop for such products (e.g., less than three times a
year). The resulting description provides a general comparative profile of these
customers, such as that 80% of the customers who frequently purchase computer
products are between 20 and 40 years old and have a university education, whereas 60%
of the customers who infrequently buy such products are either seniors or youths, and
have no university degree.
2. Mining Frequent Patterns, Associations, and Correlations
 Frequent patterns, as the name suggests, are patterns that occur frequently in data.
There are many kinds of frequent patterns, including frequent itemsets, frequent
subsequences (also known as sequential patterns), and frequent substructures.
 A frequent itemset typically refers to a set of items that often appear together in a
transactional data set—for example, milk and bread, which are frequently bought
together in grocery stores by many customers. A frequently occurring subsequence,
such as the pattern that customers, tend to purchase first a laptop, followed by a digital
camera, and then a memory card, is a (frequent) sequential pattern.
 A substructure can refer to different structural forms (e.g., graphs, trees, or lattices) that
may be combined with itemsets or subsequences.
 Association analysis. Suppose that, as a marketing manager at AllElectronics, you want
to know which items are frequently purchased together (i.e., within the same
transaction).
 An example of such a rule, mined fromthe AllElectronics transactional database, is
buys.X, “computer”/)buys.X, “software”/ [support D 1%, confidence D 50%],
where X is a variable representing a customer. A confidence, or certainty, of 50% means
that if a customer buys a computer, there is a 50% chance that she will buy software as
well. A 1% support means that 1% of all the transactions under analysis show that
computer and software purchased together.
age.(X, “20..29”)^income.(X, “40K..49K”)=>buys.(X, “laptop”) [support = 2%,
confidence = 60%].
3. Classification and Regression for Predictive Analysis
 Classification is the process of finding a model (or function) that describes and
distinguishes data classes or concepts.
 The models are derived based on the analysis of a set of training data (i.e., data
objects for which the class labels are known).
 The model is used to predict the class label of objects for which the class label is
unknown.
 The derived model may be represented in various forms, such as classification rules
(i.e., IF-THEN rules), decision trees, mathematical formulae, or neural networks.
 Adecisiontree is a flowchart-like tree structure, where each node denotes a test on an
attribute value, each branch represents an outcome of the test, and tree leaves
represent classes or class distributions.

Figure 3. A classification model can be represented in various forms: (a) IF-THEN rules, (b) a
decision tree, or (c) a neural network.
4. Cluster Analysis
 Unlike classification and regression, which analyze class-labeled (training) data sets,
clustering analyzes data objects without consulting class labels.
 In many cases, classlabeled data may simply not exist at the beginning. Clustering can
be used to generate class labels for a group of data. The objects are clustered or grouped
based on the principle of maximizing the intraclass similarity and minimizing the
interclass similarity.
 That is, clusters of objects are formed so that objects within a cluster have high
similarity in comparison to one another, but are rather dissimilar to objects in other
clusters. Each cluster so formed can be viewed as a class of objects, from which rules
can be derived.
 Example: Cluster analysis. Cluster analysis can be performed on AllElectronics
customer data to identify homogeneous subpopulations of customers. These clusters
may represent individual target groups for marketing.

Figure 4. A 2-D plot of customer data with respect to customer locations in a city, showing three
data clusters.
5. Outlier Analysis
 A data set may contain objects that do not comply with the general behavior or model of
the data. These data objects are outliers. Many data mining methods discard outliers as
noise or exceptions.
 Outliers may be detected using statistical tests that assume a distribution or probability
model for the data, or using distance measures where objects that are remote from any
other cluster are considered outliers
Are All Patterns Interesting?
A pattern is interesting if it is
(1) easily understood by humans,
(2) valid on new or test data with some degree of certainty,
(3) Potentially useful, and
(4) novel.
A pattern is also interesting if it validates a hypothesis that the user sought to confirm.

Mining Association rules:


.Market Basket Analysis: A Motivating Example
 Frequent itemset mining leads to the discovery of associations and correlations among
items in large transactional or relational data sets.With massive amounts of data
continuously being collected and stored, many industries are becoming interested in
mining such patterns from their databases.
 A typical example of frequent itemset mining is market basket analysis. This process
analyzes customer buying habits by finding associations between the different items that
customers place in their “shopping baskets”
 Example: Suppose, as manager of an AllElectronics branch, you would like to learn
more about the buying habits of your customers. Specifically, you wonder, “Which
groups or sets of items are customers likely to purchase on a given trip to the store?”
To answer your question, market basket analysis may be performed on the retail data of
customer transactions at your store. You can then use the results to plan marketing or
advertising strategies, or in the design of a new catalog.
Computerantivirus software [support = 2%,confidence = 60%].
Frequent Itemset Mining Methods:
Apriori Algorithm: Finding Frequent Itemsets by Confined Candidate Generation

A two-step process is followed, consisting of join and prune actions.


1. The join step: To find Lk, a set of candidate k-itemsets is generated by joining Lk-1
with itself. This set of candidates is denoted Ck. Let l1 and l2 be itemsets in Lk-1. The
notation li[j] refers to the jth item in li (e.g., l1[k -2] refers to the second to the last
item in l1).
2. The prune step: Ck is a superset of Lk, that is, its members may or may not be
frequent, but all of the frequent k-itemsets are included in Ck. A database scan to
determine the count of each candidate in Ck would result in the determination of Lk
(i.e., all candidates having a count no less than the minimum support count are
frequent by definition, and therefore belong to Lk). Ck, however, can be huge, and so
this could involve heavy computation.
Example:
Generating Association Rules from Frequent Itemsets
Once the frequent itemsets from transactions in a database D have been found, it is
straightforward to generate strong association rules from them (where strong association
rules satisfy both minimum support and minimum confidence).

The conditional probability is expressed in terms of itemset support count, where support
count (AUB) is the number of transactions containing the itemsets (AUB), and support count
(A). is the number of transactions containing the itemset A.
Generating association rules:
Let’s try an example based on the transactional data for AllElectronics shown before in
previous Table. The data contain frequent itemset X= {I1, I2,I5}.What are the association
rules that can be generated from X? The nonempty subsets of X are {I1, I2}, {I1, I5}, {I2,
I5}, {I1}, {I2}, and {I5}. The resulting association rules are as shown below, each listed with
its confidence:
{I1, I2}=>I5, confidence = 2/4 = 50%
{I1, I5}=>I2, confidence = 2/2 = 100%
{I2, I5}=>I1, confidence = 2/2 = 100%
I1=> {I2, I5}, confidence = 2/6 = 33%
I2=> {I1, I5}, confidence = 2/7 = 29%
I5=> {I1, I2}, confidence = 2/2 = 100%

Improving the Efficiency of Apriori:

Hash table, H2, for candidate 2-itemsets. This hash table was generated by scanning Table
transactions while determining L1. If the minimum support count is, say, 3, then the itemsets in
buckets 0, 1, 3, and 4 cannot be frequent and so they should not be included in C2.

Hash-based technique (hashing itemsets into corresponding buckets):


 A hash-based technique can be used to reduce the size of the candidate k-itemsets, Ck, for
k > 1.
 For example, when scanning each transaction in the database to generate the frequent 1-
itemsets, L1, we can generate all the 2-itemsets for each transaction, hash (i.e., map) them
into the different buckets of a hash table structure, and increase the corresponding bucket
counts.
 A 2-itemset with a corresponding bucket count in the hash table that is below the support
threshold cannot be frequent and thus should be removed from the candidate set. Such a
hash-based technique may substantially reduce the number of candidate k-itemsets
examined.
Transaction reduction (reducing the number of transactions scanned in future iterations):
 A transaction that does not contain any frequent k-itemsets cannot contain any frequent (k
C1)-itemsets. Therefore, such a transaction can be marked or removed from further
consideration because subsequent database scans for j-itemsets, where j > k, will not need
to consider such a transaction.
Partitioning (partitioning the data to find candidate itemsets):
 A partitioning technique can be used that requires just two database scans to mine the
frequent itemsets.
 It consists of two phases. In phase I, the algorithm divides the transactions of D into n
nonoverlapping partitions. If the minimum relative support threshold for transactions in D
is min sup, then the minimum support count for a partition is min sup X the number of
transactions in that partition.
 For each partition, all the local frequent itemsets (i.e., the itemsets frequent within the
partition) are found.

Figure 5: Mining by partitioning the data.


 A local frequent itemset may or may not be frequent with respect to the entire database,
D. However, any itemset that is potentially frequent with respect to D must occur as a
frequent itemset in at least one of the partitions. Therefore, all local frequent itemsets are
candidate itemsets with respect to D. The collection of frequent itemsets from all
partitions forms the global candidate itemsets with respect to D.
 In phase 2, a second scan of D is conducted in which the actual support of each candidate
is assessed to determine the global frequent itemsets. Partition size and the number of
partitions are set so that each partition can fit into main memory and therefore be read
only once in each phase.
Sampling (mining on a subset of the given data):
 The basic idea of the sampling approach is to pick a random sample S of the given data D,
and then search for frequent itemsets in S instead of D. In this way, we trade off some
degree of accuracy against efficiency.
 The S sample size is such that the search for frequent itemsets in S can be done in main
memory, and so only one scan of the transactions in S is required overall. Because we are
searching for frequent itemsets in S rather than in D, it is possible that we will miss some
of the global frequent itemsets.
Dynamic itemset counting (adding candidate itemsets at different points during ascan):
 A dynamic itemset counting technique was proposed in which the database is partitioned
into blocks marked by start points.
 In this variation, new candidate itemsets can be added at any start point, unlike in Apriori,
which determines new candidate itemsets only immediately before each complete
database scan.
 The technique uses the count-so-far as the lower bound of the actual count. If the count-
so-far passes the minimum support, the itemset is added into the frequent itemset
collection and can be used to generate longer candidates.
 This leads to fewer database scans than with Apriori for finding all the frequent itemsets.

You might also like