CHAPTER 5
CONCEPT DESCRIPTION AND ASSOCIATION RULE MINING
SUBJECT:DMBI           PREPARED BY:
 CODE:2170715   ASST.PROF.NENSI KANSAGARA
                  (CSE DEPARTMENT,ACET)
WHAT IS CONCEPT DESCRIPTION?
❖ Descriptive vs. predictive data mining Descriptive mining:
  ➢ describes concepts or task-relevant data sets in concise, summarative,
      informative, discriminative forms
❖ Predictive mining: Based on data and analysis, constructs models for the
  database, and predicts the trend and properties of unknown data
❖ Concept description:
  ➢ Characterization: provides a concise and succinct summarization of the
      given collection of data
  ➢ Comparison: provides descriptions comparing two or more collections of
      data
WHAT IS CONCEPT DESCRIPTION?
❖ A concept usually refers to a collection of data such as
    frequent_buyers, graduate_students etc.
❖ As a data mining task, concept description is not a simple
    enumeration (number of things done one by one) of the data.
❖ Concept description generates descriptions for characterization and
    comparison of the data it is also called class description.
WHAT IS CONCEPT DESCRIPTION?
❖   Characterization provides a concise and brief summarization of the data.
❖   While concept or class comparison (also known as discrimination) provides discriminations
    (inequity) comparing two or more collections of data.
❖   Example
    ➢    Given the ABC Company database, for example, examining individual customer
         transactions.
    ➢    Sales managers may prefer to view the data generalized to higher levels, such as
         summarized by customer groups according to geographic regions, frequency of
         purchases per group and customer income.
DATA GENERALIZATION AND
SUMMARIZATION
❖ Data generalization
  ➢ A process which abstracts a large set of task-relevant data in a
     database from a low conceptual levels to higher ones.
❖ Approaches:
  ➢ Data cube approach(OLAP approach)
  ➢ Attribute-oriented induction approach
DATA GENERALIZATION
PRESENTATION OF GENERALIZED RESULTS
❖ Generalized relation:
  ➢ Relations where some or all attributes are generalized, with counts or other
      aggregation values accumulated.
❖ Cross tabulation:
  ➢ Mapping results into cross tabulation form (similar to contingency tables).
❖ Visualization techniques:
  ➢ Pie charts, bar charts, curves, cubes, and other visual forms.
❖ Quantitative characteristic rules:
  ➢ Mapping generalized result into characteristic rules with quantitative
      information associated with it
ATTRIBUTE RELEVANCE
❖ Why?
  ➢ Which dimensions should be included?
  ➢ How high level of generalization?
  ➢ Automatic vs. interactive Reduce # attributes; easy to understand patterns
❖ What?
  ➢ statistical method for preprocessing data
      ■ filter out irrelevant or weakly relevant attributes
      ■ retain or rank the relevant attributes
  ➢ relevance related to dimensions and levels
  ➢ analytical characterization, analytical comparison
❖ How?
❖ Data Collection
❖ Preliminary
❖ Relevance Analysis
  ➢ Perform AOI to removing or generalizing attributes.
❖ Relevance Analysis
  ➢ Use relevance analysis measure e.g. information gain to identify highly
      relevant dimensions and levels.
  ➢ Sort and select the most relevant dimensions and levels.
❖ Attribute-oriented Induction for class description
  ➢ On selected dimension/level
❖ OLAP operations (e.g. drilling, slicing) on relevance rules
CLASS COMPARISON
ASSOCIATION RULE MINING
MARKET BASKET ANALYSIS
❖ Market Basket Analysis is a modelling technique based upon the theory that if
  you buy a certain group of items, you are more (or less) likely to buy another
  group of items. For example, if you are in an English pub and you buy a pint of
  beer and don't buy a bar meal, you are more likely to buy crisps (US. chips) at the
  same time than somebody who didn't buy beer.
❖ The set of items a customer buys is referred to as an itemset, and market basket
  analysis seeks to find relationships between purchases.
❖ Typically the relationship will be in the form of a rule:
  ➢ IF {beer, no bar meal} THEN {crisps}.
MARKET BASKET ANALYSIS
❖   The probability that a customer will buy beer without a bar meal (i.e. that the antecedent is
    true) is referred to as the support for the rule. The conditional probability that a customer
    will purchase crisps is referred to as the confidence.
❖   The algorithms for performing market basket analysis are fairly straightforward (Berry and
    Linhoff is a reasonable introductory resource for this). The complexities mainly arise in
    exploiting taxonomies, avoiding combinatorial explosions (a supermarket may stock
    10,000 or more line items), and dealing with the large amounts of transaction data that may
    be available.
❖   A major difficulty is that a large number of the rules found may be trivial for anyone
    familiar with the business. Although the volume of data has been reduced, we are still
    asking the user to find a needle in a haystack. Requiring rules to have a high minimum
    support level and a high confidence level risks missing any exploitable result we might
    have found.
MARKET BASKET ANALYSIS
EXAMPLE OF MARKET BASKET ANALYSIS
ASSOCIATION RULE MINING
❖ Association rule learning
  is a rule-based machine
  learning method for
  discovering interesting
  relations between variables
  in large databases. It is
  intended to identify strong
  rules discovered in
  databases using some
  measures of
  interestingness
ASSOCIATION RULE MINING
❖ Learning of Association rules is used to find relationships between attributes in
  large databases. An association rule, A=> B, will be of the form” for a set of
  transactions, some value of itemset A determines the values of itemset B under the
  condition in which minimum support and confidence are met”.
❖ Support and Confidence can be represented by the following example:
 ❖   The above statement is an example of an association rule. This means that there is a 2%
     transaction that bought bread and butter together and there are 60% of customers who bought
     bread as well as butter.
ASSOCIATION RULE MINING
Association rule mining consists of 2 steps:
     1.   Find all the frequent itemsets.
     2.   Generate association rules from the above frequent itemsets.
EXAMPLE OF ASSOCIATION RULE
FINDING FREQUENT ITEM SETS:
APRIORI ALGORITHM
❖   With the quick growth in e-commerce applications, there is an accumulation vast quantity of data in
    months not in years. Data Mining, also known as Knowledge Discovery in Databases(KDD), to find
    anomalies, correlations, patterns, and trends to predict outcomes.
❖   Apriori algorithm is a classical algorithm in data mining. It is used for mining frequent itemsets and
    relevant association rules. It is devised to operate on a database containing a lot of transactions, for
    instance, items brought by customers in a store.
❖   It is very important for effective Market Basket Analysis and it helps the customers in purchasing their
    items with more ease which increases the sales of the markets. It has also been used in the field of
    healthcare for the detection of adverse drug reactions. It produces association rules that indicates what all
    combinations of medications and patient characteristics lead to ADRs.
WHAT IS AN ITEMSET?
❖ A set of items together is called an itemset. If any itemset has k-items it
  is called a k-itemset. An itemset consists of two or more items. An
  itemset that occurs frequently is called a frequent itemset. Thus
  frequent itemset mining is a data mining technique to identify the
  items that often occur together.
  ➢ For Example, Bread and butter, Laptop and Antivirus software,
       etc.
WHAT IS A FREQUENT ITEMSET?
❖ A set of items is called frequent if it satisfies a minimum threshold value for
  support and confidence. Support shows transactions with items purchased together
  in a single transaction. Confidence shows transactions where the items are
  purchased one after the other.
❖ For frequent itemset mining method, we consider only those transactions which
  meet minimum threshold support and confidence requirements. Insights from these
  mining algorithms offer a lot of benefits, cost-cutting and improved competitive
  advantage.
❖ There is a tradeoff time taken to mine data and the volume of data for frequent
  mining. The frequent mining algorithm is an efficient algorithm to mine the hidden
  patterns of itemsets within a short time and less memory consumption.
FREQUENT PATTERN MINING
❖   The frequent pattern mining algorithm is one of the most important techniques of data
    mining to discover relationships between different items in a dataset. These relationships
    are represented in the form of association rules. It helps to find the irregularities in data.
❖   FPM has many applications in the field of data analysis, software bugs, cross-marketing,
    sale campaign analysis, market basket analysis, etc.
❖   Frequent itemsets discovered through Apriori have many applications in data mining
    tasks. Tasks such as finding interesting patterns in the database, finding out sequence
    and Mining of association rules is the most important of them.
❖   Association rules apply to supermarket transaction data, that is, to examine the customer
    behavior in terms of the purchased products. Association rules describe how often the
    items are purchased together.
WHY FREQUENT ITEMSET MINING?
❖   Frequent itemset or pattern mining is broadly used because of its wide applications in mining
    association rules, correlations and graph patterns constraint that is based on frequent patterns,
    sequential patterns, and many other data mining tasks.
❖   Apriori says:
❖   The probability that item I is not frequent is if:
    ➢ P(I) < minimum support threshold, then I is not frequent.
    ➢ P (I+A) < minimum support threshold, then I+A is not frequent, where A also belongs to
         itemset.
    ➢ If an itemset set has value less than minimum support then all of its supersets will also
         fall below min support, and thus can be ignored. This property is called the
         Antimonotone property.
STEPS TO FOLLOW IN APRIORI ALGORITHM
 1. Join Step: This step generates (K+1) itemset from K-itemsets by
    joining each item with itself.
 2. Prune Step: This step scans the count of each item in the database.
    If the candidate item does not meet minimum support, then it is
    regarded as infrequent and thus it is removed. This step is
    performed to reduce the size of the candidate itemsets.
STEPS IN APRIORI ALGORITHM
Apriori algorithm is a sequence of steps to be followed to find the most frequent itemset in the given
database. This data mining technique follows the join and the prune steps iteratively until the most
frequent itemset is achieved. A minimum support threshold is given in the problem or it is assumed by the
user.
#1) In the first iteration of the algorithm, each item is taken as a 1-itemsets candidate. The algorithm will
count the occurrences of each item.
#2) Let there be some minimum support, min_sup ( eg 2). The set of 1 – itemsets whose occurrence is
satisfying the min sup are determined. Only those candidates which count more than or equal to min_sup,
are taken ahead for the next iteration and the others are pruned.
#3) Next, 2-itemset frequent items with min_sup are discovered. For this in the join step, the 2-itemset is
generated by forming a group of 2 by combining items with itself.
STEPS IN APRIORI ALGORITHM
#4) The 2-itemset candidates are pruned using min-sup threshold value. Now the table will have 2
–itemsets with min-sup only.
#5) The next iteration will form 3 –itemsets using join and prune step. This iteration will follow
antimonotone property where the subsets of 3-itemsets, that is the 2 –itemset subsets of each group
fall in min_sup. If all 2-itemset subsets are frequent then the superset will be frequent otherwise it
is pruned.
#6) Next step will follow making 4-itemset by joining 3-itemset with itself and pruning if its subset
does not meet the min_sup criteria. The algorithm is stopped when the most frequent itemset is
achieved.
Example of Apriori: Support threshold=50%, Confidence= 60%
6. Generate Association Rules: From the frequent itemset discovered above the association could be:
{I1, I2} => {I3}
Confidence = support {I1, I2, I3} / support {I1, I2} = (3/ 4)* 100 = 75%
{I1, I3} => {I2}
Confidence = support {I1, I2, I3} / support {I1, I3} = (3/ 3)* 100 = 100%
{I2, I3} => {I1}
Confidence = support {I1, I2, I3} / support {I2, I3} = (3/ 4)* 100 = 75%
{I1} => {I2, I3}
Confidence = support {I1, I2, I3} / support {I1} = (3/ 4)* 100 = 75%
{I2} => {I1, I3}
Confidence = support {I1, I2, I3} / support {I2 = (3/ 5)* 100 = 60%
{I3} => {I1, I2}
Confidence = support {I1, I2, I3} / support {I3} = (3/ 4)* 100 = 75%
This shows that all the above association rules are strong if minimum confidence threshold is 60%.
PSEUDO CODE OF APRIORI ALGORITHM
Advantages
   1. Easy to understand algorithm
   2. Join and Prune steps are easy to implement on large itemsets in large databases
Disadvantages
   1. It requires high computation if the itemsets are very large and the minimum
      support is kept very low.
   2. The entire database needs to be scanned.
IMPROVED APRIORI ALGORITHM
Many methods are available for improving the efficiency of the algorithm.
      1.   Hash-Based Technique: This method uses a hash-based structure called a hash table for generating
           the k-itemsets and its corresponding count. It uses a hash function for generating the table.
      2.   Transaction Reduction: This method reduces the number of transactions scanning in iterations. The
           transactions which do not contain frequent items are marked or removed.
      3.   Partitioning: This method requires only two database scans to mine the frequent itemsets. It says
           that for any itemset to be potentially frequent in the database, it should be frequent in at least one of
           the partitions of the database.
      4.   Sampling: This method picks a random sample S from Database D and then searches for frequent
           itemset in S. It may be possible to lose a global frequent itemset. This can be reduced by lowering
           the min_sup.
      5.   Dynamic Itemset Counting: This technique can add new candidate itemsets at any marked start
           point of the database during the scanning of the database.
APPLICATION OF ALGORITHM
Some fields where Apriori is used:
    1.   In Education Field: Extracting association rules in data mining of admitted
         students through characteristics and specialties.
    2.   In the Medical field: For example Analysis of the patient's database.
    3.   In Forestry: Analysis of probability and intensity of forest fire with the forest fire
         data.
    4.   Apriori is used by many companies like Amazon in the Recommender System and
         by Google for the auto-complete feature.
INCREMENTAL ARM
❖   It is noted that analysis of past transaction data can provide very valuable information
    on customer buying behavior, and thus improve the quality of business decisions.
❖   With the increasing use of the record-based databases whose data is being
    continuously added, updated, deleted etc.
❖   Examples of such applications include Web log records, stock market data, grocery
    sales data, transactions in e-commerce, and daily weather/traffic records etc.
❖   In many applications, we would like to mine the transaction database for a fixed
    amount of most recent data (say, data in the last 12 months).
INCREMENTAL ARM
❖ Mining is not a one-time operation, a naive approach to solve the
  incremental mining problem is to re-run the mining algorithm on the
  updated database.
ASSOCIATIVE CLASSIFICATION
❖   Associative classification (AC) is a branch of a wide area of scientific study known
    as data mining. Associative classification makes use of association rule mining for
    extracting efficient rules, which can precisely generalize the training data set, in the
    rule discovery process.
❖   An associative classifier (AC) is a kind of supervised learning model that uses
    association rules to assign a target value. The term associative classification was
    coined by Bing Liu et al., in which the authors defined a model made of rules "whose
    right-hand side are restricted to the classification class attribute".
DATA FLOW DIAGRAM OF ASSOCIATIVE
CLASSIFICATION
FP GROWTH ALGORITHM
❖   The FP-Growth Algorithm,
    proposed by Han in, is an efficient
    and scalable method for mining the
    complete set of frequent patterns by
    pattern fragment growth, using an
    extended prefix-tree structure for
    storing compressed and crucial
    information about frequent patterns
    named frequent-pattern tree
    (FP-tree).
FREQUENT PATTERN GROWTH ALGORITHM
❖ This algorithm is an improvement to the Apriori method. A frequent pattern is
  generated without the need for candidate generation. FP growth algorithm
  represents the database in the form of a tree called a frequent pattern tree or FP
  tree.
❖ This tree structure will maintain the association between the itemsets. The
  database is fragmented using one frequent item. This fragmented part is called
  “pattern fragment”. The itemsets of these fragmented patterns are analyzed.
  Thus with this method, the search for frequent itemsets is reduced
  comparatively.
FP TREE
❖ Frequent Pattern Tree is a tree-like structure that is made with the
  initial itemsets of the database. The purpose of the FP tree is to mine
  the most frequent pattern. Each node of the FP tree represents an
  item of the itemset.
❖ The root node represents null while the lower nodes represent the
  itemsets. The association of the nodes with the lower nodes that is
  the itemsets with the other itemsets are maintained while forming
  the tree.
FP GROWTH STEPS
#1) The first step is to scan the database to find the occurrences of the itemsets in the database. This step is the same as
the first step of Apriori. The count of 1-itemsets in the database is called support count or frequency of 1-itemset.
#2) The second step is to construct the FP tree. For this, create the root of the tree. The root is represented by null.
#3) The next step is to scan the database again and examine the transactions. Examine the first transaction and find out
the itemset in it. The itemset with the max count is taken at the top, the next itemset with lower count and so on. It means
that the branch of the tree is constructed with transaction itemsets in descending order of count.
#4) The next transaction in the database is examined. The itemsets are ordered in descending order of count. If any
itemset of this transaction is already present in another branch (for example in the 1st transaction), then this transaction
branch would share a common prefix to the root.
This means that the common itemset is linked to the new node of another itemset in this transaction.
#5) Also, the count of the itemset is incremented as it occurs in the transactions. Both the common node
and new node count is increased by 1 as they are created and linked according to transactions.
#6) The next step is to mine the created FP Tree. For this, the lowest node is examined first along with the
links of the lowest nodes. The lowest node represents the frequency pattern length 1. From this, traverse
the path in the FP Tree. This path or paths are called a conditional pattern base.
Conditional pattern base is a sub-database consisting of prefix paths in the FP tree occurring with the
lowest node (suffix).
#7) Construct a Conditional FP Tree, which is formed by a count of itemsets in the path. The itemsets
meeting the threshold support are considered in the Conditional FP Tree.
#8) Frequent Patterns are generated from the Conditional FP Tree.
Example Of FP-Growth Algorithm
Support threshold=50%, Confidence= 60%
STEP 1
STEP 2
STEP 3 :BUILD FP TREE
  1.   Considering the root node null.
  2.   The first scan of Transaction T1: I1, I2, I3 contains three items {I1:1}, {I2:1},
       {I3:1}, where I2 is linked as a child to root, I1 is linked to I2 and I3 is linked
       to I1.
  3.   T2: I2, I3, I4 contains I2, I3, and I4, where I2 is linked to root, I3 is linked to
       I2 and I4 is linked to I3. But this branch would share I2 node as common as it
       is already used in T1.
  4.   Increment the count of I2 by 1 and I3 is linked as a child to I2, I4 is linked as
       a child to I3. The count is {I2:2}, {I3:1}, {I4:1}.
  5.   T3: I4, I5. Similarly, a new branch with I5 is linked to I4 as a child is created.
  6.   T4: I1, I2, I4. The sequence will be I2, I1, and I4. I2 is already linked to the
       root node, hence it will be incremented by 1. Similarly I1 will be incremented
       by 1 as it is already linked with I2 in T1, thus {I2:3}, {I1:2}, {I4:1}.
  7.   T5:I1, I2, I3, I5. The sequence will be I2, I1, I3, and I5. Thus {I2:4}, {I1:3},
       {I3:2}, {I5:1}.
  8.   T6: I1, I2, I3, I4. The sequence will be I2, I1, I3, and I4. Thus {I2:5}, {I1:4},
       {I3:3}, {I4 1}.
STEP 4:MINING FP TREE IS SUMMARIZED
BELOW:
  1.   The lowest node item I5 is not considered as it does not have a min support count, hence it is
       deleted.
  2.   The next lower node is I4. I4 occurs in 2 branches , {I2,I1,I3:,I41},{I2,I3,I4:1}. Therefore
       considering I4 as suffix the prefix paths will be {I2, I1, I3:1}, {I2, I3: 1}. This forms the
       conditional pattern base.
  3.   The conditional pattern base is considered a transaction database, an FP-tree is constructed.
       This will contain {I2:2, I3:2}, I1 is not considered as it does not meet the min support count.
  4.   This path will generate all combinations of frequent patterns : {I2,I4:2},{I3,I4:2},{I2,I3,I4:2}
  5.   For I3, the prefix path would be: {I2,I1:3},{I2:1}, this will generate a 2 node FP-tree : {I2:4,
       I1:3} and frequent patterns are generated: {I2,I3:4}, {I1:I3:3}, {I2,I1,I3:3}.
  6.   For I1, the prefix path would be: {I2:4} this will generate a single node FP-tree: {I2:4} and
       frequent patterns are generated: {I2, I1:4}.
Advantages Of FP Growth Algorithm
   1.   This algorithm needs to scan the database only twice when compared to Apriori which
        scans the transactions for each iteration.
   2.   The pairing of items is not done in this algorithm and this makes it faster.
   3.   The database is stored in a compact version in memory.
   4.   It is efficient and scalable for mining both long and short frequent patterns.
Disadvantages Of FP-Growth Algorithm
   1.   FP Tree is more cumbersome and difficult to build than Apriori.
   2.   It may be expensive.
   3.   When the database is large, the algorithm may not fit in the shared memory.
FP GROWTH VS APRIORI ALGORITHM