Rapid Association Rule Mining
Rapid Association Rule Mining
net/publication/221614591
CITATIONS READS
89 1,889
3 authors, including:
All content following this page was uploaded by Yew-Kwong Woon on 03 June 2014.
474
section reviews related work. Section 3 gives a description of to construct the FP-tree. Refer to [7] for the construction
the problem while Section 4 presents the new tree structure. details and completeness proof of the FP-tree.
Section 5 describes the RARM algorithm. Time and space FP-growth then proceeds to recursively mine FP-trees of
complexity of the new structure will be examined in Section decreasing size to generate large itemsets without candidate
6. Performance evaluation is discussed in Section 7. Finally, generation and database scans. It does so by examining
the paper is concluded and recommendations for future work all the conditional pattern bases of the FP-tree, which con-
are made in Section 8. sists of the set of large itemsets occurring with the suffix
pattern. Conditional FP-trees are constructed from these
conditional pattern bases and mining is carried out recur-
2. RELATED WORK sively with such trees to discover large itemsets of various
Since the introduction of the Apriori algorithm [2] in 1994, sizes. However, the construction and use of the FP-trees
there has been sustained interest in discovering new associ- are complex and cause the performance of FP-growth to be
ation rule mining algorithms that could perform more effi- on par with Apriori at support thresholds of 3% and above.
ciently. Incremental algorithms are presented in [5, 6, 111 It only shows significant speed-ups at support thresholds of
and algorithms which support dynamic support thresholds 1.5% and below.
are introduced in [l, 81. However, in this paper, we are only
interested in breaking the speed barrier. Hence, in this sec-
tion, we will only look at two influential algorithms which 3. PROBLEM DESCRIPTION
achieve impressive speed-ups against Apriori. The problem of mining association rules can be formally
The Direct Hashing and Pruning (DHP) algorithm [9] is described as follows: Let I = {ai,as,...,a,} be a set of
the next widely used algorithm for the efficient mining of literals called items. Let D be a database of transactions,
association rules. It is built on top of Apriori but it em- where each transaction T contains a set of items such that
ploys a hashing technique to reduce the size of candidate T & I. An itemset is a set of items and a Ic-itemset is an
itemsets and database. This amounts to significant speed- itemset that contains exactly k items. For a given itemset
ups because the dominating factor in the generation of large X C I and a given transaction T, T contains X if and only if
itemsets is the size of the candidate itemscts, particularly X C T. Let (T= be the support count of an itemset X, which
that of the candidate P-itemsets. Jong et al. divides the is the number of transactions in D that contain X. Let s be
DHP algorithm into the following three main parts: the support threshold and IDI be the number of transactions
in D. An itemset X is large or frequent if (TX 2 ID] x s%.
1. A set of large 1-itemsets is obtained and a hash table An association rule is an implication of the form X --r. Y,
for 2-itemsets is built. where X & I, Y C_ I and X fl Y = 0. The association rule
X ti Y holds in the database D with confidence c% if no
2. Candidate k-itemsets are generated based on the hash less than c% of the transactions in D that contain X also
table built from the previous iteration. The hash ta- contain Y. The association rule X ti Y has support 8%
ble serves as a filtering mechanism; DHP only adds in D if UXUY = IDI x s%.
an itemset to the set of candidate L-itemsets if it is For a given pair of confidence and support thresholds, the
hashed to an entry whose value is larger or equal to the problem of mining association rules is to discover all rules
minimum support count needed. Large k-itemsets are that have confidence and support greater than the corre-
determined and a hash table for the candidate (k+l)- sponding thresholds. For example, in a computer hardware
itemsets is built as well. shop, the association rule Scanner ==+ Printer means that
whenever customers buy scanners, they also buy printers c%
3. Same as part 2 except that hash tables are not used. of the time and this trend occurs s% of the time.
This is because part 3 is only used in later iterations This problem can be decomposed into two sub-problems
where significantly lesser candidate itemsets are gen- as discussed in [2]:
erated.
1. Finding of large itemsets
It is obvious that DHP incurs additional overheads due
2. Generation of association rules from large itemsets
to the need to do hashing and to maintain a hash table.
Therefore, after experiments are performed, it is concluded Most researchers addressed the first sub-problem only be-
that the hashing technique should only be applied during cause it is more computationally expensive. Moreover, once
the generation of candidate 2-itemsets and not for the gen- the large itemsets are identified, the corresponding associa-
eration of large candidate itemsets. This allows DHP to tion rules can be generated in a simple and straightforward
achieve speed-ups of up to three times against Apriori. manner. Hence, only the first sub-problem will be addressed
Recently, Han et al. proposed the Frequent Pattern Growth in this paper. For a discussion of optimized algorithms for
(FP-growth) [7] algorithm which achieves impressive results the generation of association rules, more information can be
compared to Apriori. It completely eliminates the candi- found at [l].
date generation bottleneck by using a new tree structure
called Frequent Pattern pee (FP-tree) which stores critical
itemset information. This compact structure also removes 4. DATA STRUCTURE
the need for database scans and it is constructed using only This section first describes a general trie-like structure,
two scans. In the first database scan, large 1-itemsets Li followed by an enhanced version of it and finally the most
are obtained and sorted in support descending order. In optimized version. A trie is a tree structure whose organi-
the second scan, items in the transactions are first sorted zation is based on a hey space decomposition. In key space
according to the order of L1. These sorted items are used decomposition, the key range is equally subdivided and the
475
node ROOT simply contains pointers to the root nodes of
Table 1: An example transaction database with N =
the TrieITs. To understand how the TrieITs are updated,
4. let us see what happens when transaction 400 is inserted.
From transaction 400, the following itemsets are extracted
from it:
l 1-itemsets: {A}, {B}, {C}, {D}
l 2-itemsets: {A,B}, {A, C}, {A, D}, {B, C}, {B, D},
{CY Dl
splitting position within the key range for each node is pre- l 3-itemsets: {A, B, C}, {A, B, D}, {A, C, D}, {B, C, D}
defined. An alphabet trie is a trie used to store a dictionary l 4-itemsets: {A, B, C, D}
of words. The following data structures will be based on the
They are used to increment the support counts of their
alphabet trie.
corresponding nodes in the TrieITs by one. To locate the
4.1 A Complete TrieIT node that corresponds to an itemset {ak, akfl, . , ai}, sim-
ply trace the path {ak, a&l,. . . , ai} from the ROOT node.
Given a database D of transactions, we store it as a forest
If a node does not exist, create a new node and initialized
of lexicographically-ordered tree nodes known as Trie Item-
it with a support count of 1. Here, we need to create seven
set (TrieIT) so that the first sub-problem of association rule
nodes with the label D to represent itemsets ending with
mining-finding large itemsets-can be done efficiently. Let
the item D. The resultant set of complete TrieITs is shown
the set of items I = {cJ~,o~, . . ,a~} in the database be or-
in Figure l(b). This set of TrieITs contains the maximum
dered so that for any two items ai E I, aj E I (1 < i, j < N),
number of tree nodes because all N unique items appear in
ai 4 aj if and only if i < j. Likewise, every transaction
the database.
T E D is also ordered with respect to the ordering in I.
To obtain the support count of an itemset, simply locate
DEFINITION 1. A complete TrieIT is a set of tree nodes its path along the tree nodes. The node representing the
such that every tree node w is a .Z-tuple (we, w,) where w( E last item in the itemset would contain the support count
I is the label of the node and wC is a support count. Since for the itemset. For example, to find the support count of
every tree node corresponds to some item ai E I, for brevity, itemset {A, B, D}, locate the path from the ROOT node to
the node containing the last item D. Thus, from Figure l(b),
we also use wi to refer to a tree node that corresponds to
ai E I. The following conditions hold: the support count of itemset {A, B, D} is 1. This search
is efficient because the nodes are sorted lexicographically.
1. Let, C(w;) be the ordered set of children nodes of Wi. However, the set of complete TrieITs require a huge amount
If C(wi) # 0, then C(wi) S {wi+l, wifa,. . ,wN}. of memory space which scales up exponentially with N.
2. Givenanodewi,letWk,Wlc+l,...,Wi--l(l~k~i-1) 4.2 Support-Ordered Trie Itemset
be the set of nodes on the path from the root to the This approach builds on the ideas presented in the paper
parent of wi, then w, (the count of wi) is a count of on DHP 191. In that paper, it is discovered that generation of
the ih5m.d {ah, @+I,. . . , ai} in the database. Hence, large 2-itemsets is the main performance bottleneck. Using a
the support count of any k-itemset can be obtained by hash table, DHP is able to improve performance significantly
following a set of nodes to a depth of k. by reducing the size of the candidate 2-itemsets.
Similarly, this approach seeks to find a data structure that
Each complete TrieIT Wi corresponds to some ai 6 I such allows for quick generation of large 2-itemsets without the
that the root node has label ai. Then D is stored as a set of
heavy processing and memory requirements of the previous
complete %eITs denoted by W where two structures. The solution is a 2-level support-ordered tree
wc{wl,wZ ,..., WN}. I
which is called a SOTrieIT (Support-Ordered Trie Itemset).
It is unnecessary and inefficient to go beyond two levels be-
In order to improve the efficiency of finding large itemsets
cause the memory requirements will far outweigh the com-
using complete TrieITs, every transaction T E D is inserted
putation savings. This will be discussed in greater details in
into W as follows: Every ordered itemset X E P(T) (P(T)
section 7.
is the powerset of T) increases the count by one of node wj
of TrieIT Wi where ai and aj are the first and last items of X DEFINITION 2. A SOTrieIT is a complete TrieIT with a
respectively. If TrieIT Wi or node wj does not exist, it is cre- depth of 1; i.e., it consists only of a root node wi and some
ated with an initial count of one. Thus, each T E D updates child nodes. Moreover, all nodes in the forest of SOTrieIT
the support counts of all its sub-itemsets (from its power- are sorted according to their support counts in descending
set) in the corresponding TrieITs of W. Hence, no database order from the left. I
scanning is required subsequently as the support counts are
already stored in the WeITs. The following example illus- In constructing the set of SOTrieITs Y from a database
trates the concept of TrieITs and how transactions update D, only 1-itemsets and 2-itemsets X E P(T) of each trans-
the TrieITs. action T E D are used to update the support counts in Y.
In other words, the set of SOTrieITs only keeps a record
4.1.1 Example of all I-itemsets and 2-itemsets contained in a transaction.
Figure l(a) shows the TrieITs after the transactions 100 The first-level nodes represent 1-itemsets while second-level
to 300 in Table 1 have been inserted into the tree. The nodes represent 2-itemsets. Henceforth, we shall use the
numbers in brackets are the support counts. The special term SOTrieIT to denote a set of SOTrieITs.
476
ROOT
ROOT
cfi
D(l)
(4 (b)
Figure 1: Resultant Complete TrieIT.
By keeping track of the support counts of all 1-itemsets without scanning the database. Let T, be a transaction of
and 2-itemsets, SOTrieIT allows both large 1-itemsets and size s and T, = {bl , ba, . . . , b,}. The I-itemsets that are
2-itemsets to be found very quickly. This is because there extracted and used to build W are {bl}, {bz}, . . , {b,} and
is no need to scan the database which could be very large. the 2-itemsets extracted are {b,, b,} where 0 < z < s and
Instead, only the small SOTrieIT is scanned. Moreover, as x < y < s. These itemsets update counts in the SOTrieITs
the SOTrieIT is sorted according to the support counts of accordingly. Every itemset increments or decrements the
the itemsets, only part of the structure needs to be scanned. support count of its corresponding tree node depending on
This feature is elaborated below in the discussion of the whether the transaction is added or deleted.
RARM algorithm. Finally, the amount of pre-processing At any point in time, W contains all the support counts
work and memory needed are greatly reduced. of all 1-itemsets and 2-itemsets that appear in all the trans-
actions. Hence, there is no longer any need to scan the
4.2.1 Example database during the generation of large 1-itemsets and 2-
Figure 2 represents the fully constructed SOTrieIT for the itemsets.
example transaction database in Table 1. To illustrate how
nodes sre created, let us examine what happens when a 5. ALGORITHM RARM
new transaction arrives. Note that only 1-itemsets and 2- In this section, the RARM algorithm that uses a SOTrieIT
itemsets are extracted from the transactions. The nodes is described.
arc created in a similar manner as in the complete TrieITs.
When both transaction 100 and 200 arrive, the nodes cre- 5.1 Pre-processing
ated are the same as those created for the complete TrieITs Figure 3 shows the pre-processing steps taken whenever
as shown in Figures 2(a) and 2(b) because only 1-itemsets a transaction arrives. For every transaction that arrives,
and 2-itemsets are extracted in both cases. However, notice l-itemsets and 2-itemsets are first extracted from it. For
that in Figure 2(b), the node 2uc under the ROOT node each itemset, the SOTrieIT, denoted by Y, will be traversed
comes before the node WA. This is because the nodes are in order to locate the node that stores its support count.
sorted according to their support counts and WC has a higher Support counts of 1-itemsets and 2-itemsets are stored in
support count than WA. first-level and second-level nodes respectively. Therefore,
When transaction 300 arrives, the following itemsets are this traversal requires at most two redirections which makes
extracted: {A), {B), {Cl, (4 J3, {A, Cl, {B, C> it very fast. Y will then be sorted level-wise from left to right
Unlike the situation for the complete TrieITs, the itemset according to the support counts of the nodes in descending
{A, B, C} is not extracted from the transaction. Figure 2(c) order. If such a node does not exist, it will be created and
shows the resultant SOTrieIT when this transaction is pro- inserted into Y accordingly. Similarly, Y is then sorted after
cessed. When transaction 400 arrives, the following itemsets such an insertion.
are extracted: {Al, {Bl, {Cl, {Dl, {A, Bl, {A, Cl, -9% Dl,
{B, Cl, {B, Dl, {C, D> 5.2 Mining of large itemsets
The SOTrieITs are updated in a similar fashion for trans- Figure 4 shows the steps taken when the mining process
action 400 as seen in Figure 2(d). The final resultant SOTrieIT is started. The SOTrieIT, Y, is first traversed to discover
resembles the complete set of TrieITs except that it consists large l-itemsets and 2-itemsets. In our approach, depth-first
of only three levels of nodes and that sibling nodes are sorted search is used, starting from the leftmost first-level node. As
according to their support counts. Y is sorted according to support counts, the traversal can
be stopped the moment a node is found not to satisfy the
4.3 Correctness minimum support threshold. After large 1-itemsets and 2-
We need to show that with a SOTrieIT, the support counts itemsets are found, the algorithm proceeds to discover other
of all 1-itemsets and 2-itemsets can be correctly obtained larger itemsets using the Apriori algorithm. Experiments in
477
C(l)
(4
(4
Figure 2: Resultant SOTrieIT.
478
is only two levels deep, it takes at most two links to reach
ROOT
the desired node. Suppose it also takes one unit of time
2,21) to move over one link and update/create the node and one
unit of time to extract one itemset, it will take a maximum
of 2 x (‘Cl+ ‘Cz) units of time to move to all the nodes
required by a transaction of size a.
2,21)
both first-level and second-level nodes, are needed to store
the entire pre-processing information. Hence, its complexity
is O(N2).
479
Parameter M e a n i n g
Number of unique items Candidate ltemset Generation
Number of transactions 600000
:Yi I I I I
1;; Number of maximal potentially
large itemsets
Average size of transactions
Average size of maximal potentially
large itemsets
Figure 7: Definition of Parameters.
.I
0 k I i--.-2 .& &xi.. ,& .L. J,
improvements of RARM over Apriori. Figure 7 shows the 1 2 3 4 5 6 7 8 9
various parameters used and their meanings.
The method used for generating synthetic data is the same k
as the one used in [2]. To describe an experiment, we use the Figure 9: Size of candidate k-itemsets for D1 and D2
notation Tw.1z.Ny.D~ where w is the average size of trans- at a support threshold of 0.75%.
actions, z is the average size of maximal potentially large
itemsets, y is the number of unique items and z is the size
of the database. The databases created are similar to those
used in [7]. The first one is T25.IlO.NlK.DlOK which is de- for higher support thresholds, RARM is able to outperform
Apriori by up to two orders of magnitude because in D2,
noted as D1 while the second one is T25.120.NlOK.DlOOK
which is denoted as Dz. the maximum size of the large itemsets, k’, is much lower
than that of Dr. If k’ increases indefinitely, the perfor-
mance of RARM will eventually be reduced to that of Apri-
7.1 Comparison of RARM and Apriori ori. However, we can conclude from the experiments that as
Figure 8 shows the execution times for the two different databases increase in size, k’ will decrease and thus RARM
databases of both Apriori and RARM. From the graphs, it will scale up well against Apriori.
can be quickly observed that RARM outperforms Apriori in
all situations. In Figure 8(a), RARM maintains a steady 7.2 Comparison of RARM and FP-growth
speed-up of at least 20 times for support thresholds ranging
from 3% to 1.5% in DI. However, when the support thresh- Due to the lack of time, FP-growth is not implemented
old falls below 1.5%, the speed-up is significantly reduced. but its performance against RARM can be evaluated using
This is due to the fact that larger-sized frequent itemsets Apriori as a basis for relative comparisons. The experiments
exist and RARM uses the Apriori algorithm for discovering conducted in [7] reported an overall improvement of only an
large k-itemsets for k > 3. Hence, the computation savings order of magnitude for FP-growth over Apriori. In addi-
in the first two iterations are insignificant compared to the tion, the performance of FP-growth is on par with Apriori
for support thresholds ranging from 3% to 1.5% in D2. The
huge computation costs in obtaining larger frequent item-
sets. poor performance of FP-growth can be attributed to the
The situation changes dramatically in D2. Figure 8(b) cost in recursively constructing FP-trees. Hence, significant
speed-ups can only be noticed in lower support thresholds
uses a log scale for the time axis because of the vast dif-
ference between the execution times of RARM and Apriori. when Apriori cannot cope with the exponential increase in
RARM performs at least 70 times faster than Apriori for candidate itemset generation. This is undesirable because
support thresholds ranging from 3% to 1%. Its performance we want to mine databases at a wide variety of support
peaks at a support threshold of 2% where it performs 120 thresholds quickly instead of only at low support thresholds.
times faster than Apriori. We cannot determine the exe- RARM overcomes this limitation of FP-growth and consis-
cution time of Apriori for a support threshold of 0.5% be- tently outperforms Apriori at all support thresholds and ex-
cause there is insufficient memory to hold the number of periments show that it can even perform up to two orders
candidate 2-itemsets. RARM does not have this memory of magnitude faster than Apriori. In future, FP-growth will
problem because it can obtain large 2-itemsets without gen- be implemented to directly access its performance against
erating candidate 2-itemsets. The lowest support threshold RARM but in this case, it is clear that RARM will outper-
that allows us to compare the performance of the two algo- form FP-growth.
rithms is 0.75% and at this threshold, RARM outperforms
Apriori by a factor of 47. 7.3 Pre-processing Requirements
The vast improvement of RARM in D2 can be explained As pre-processing is carried on a transaction at the mo-
by Figure 9 which shows the number of candidate k-itemsets ment it arrives in the database, it is distributive by na-
generated during the mining of D1 and D2 for a support ture and thus will not burden a system excessively. RARM
threshold of 0.75%. From Figure 9, it is obvious that the spends an average of only 22 ms and 48 ms in pre-processing
main difference in candidate generation between DI and DZ a single transaction found in D1 and D2 respectively. This
is in the generation of candidate 2-itemsets and that DI has amount of time is insignificant considering that it will result
larger-sized candidate itemsets. Thus, by eliminating the in major speed-ups in the mining process. This requirement
need for candidate 2-itemset generation, RARM achieves a should not be taken into consideration in comparing the
much greater speed-up in DZ as it contains more than twice performance of RARM and Apriori because pre-processing
the number of candidate P-itemsets as compared to DI. As is done outside of the actual mining process itself.
480
800 ,f 1 I I 10000 L K,~ I I I I
'Apdori ---x--- .~ Apti& ---m--- -
700 +, &qRM 0 - Cl *... R/yR&,j 0
600 -'%s,, 1000 k..,, Y..
3 500 [; '5, -s _ $, ‘x....
.._.
g 400 1.;. *:, 100 : ".... -m- . . . . . . . * ..___.. ~(
: m. ifi '.h
i= 300 - :: ‘Y*... I= "...
"‘..‘%. '... '..,
200 - '. -_ '.__ 10 7 "...,
8..
100 - n. ".._. “3.. . . . . . . Q . ..____. ;( '.....,,
'.... . . . . . . . . . . . ,,. 'Y.,,
0 1 . I m
0.5 1 1.5 2 2.5 3 '0.5 1 1.5 2 2.5 3
SupportThreshold SupportThreshold
64 (b)
Figure 8: Execution times for two databases of the form T2ok.Ny.D~ where w is the average size of trans-
actions, 2 is the average size of maximal potentially large itemsets, y is the number of unique items and z is
the size of the database, at varying support thresholds.
481
V i e w p u b l i c a t i o n s t a t s