0% found this document useful (0 votes)
60 views9 pages

Rapid Association Rule Mining

Uploaded by

Mayur Joshi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
60 views9 pages

Rapid Association Rule Mining

Uploaded by

Mayur Joshi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/221614591

Rapid Association Rule Mining.

Conference Paper · October 2001


DOI: 10.1145/502585.502665 · Source: DBLP

CITATIONS READS

89 1,889

3 authors, including:

Wee Keong Ng Yew-Kwong Woon


Nanyang Technological University Airbus Group
452 PUBLICATIONS 8,239 CITATIONS 24 PUBLICATIONS 764 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Yew-Kwong Woon on 03 June 2014.

The user has requested enhancement of the downloaded file.


Rapid Association Rule Mining

Amitabha Das Wee-Keong Ng Yew-Kwong Woon


Nanyang Technological Nanyang Technological Nanyang Technological
University University University
Nanyang Avenue Nanyang Avenue Nanyang Avenue
Singapore 639798 Singapore 639798 Singapore 639798
asadas@ntu.edu.sg wkn@acm.org davidwyk@singnet.com.sg

ABSTRACT net trade would reach US$6.8 trillion in 2004, amounting


to 8.6% of the global sales of goods and services. Hence, it
Association rule mining is a well-researched area where many
is obvious as well as inevitable that companies will realign
algorithms have been proposed to improve the speed of min-
their business strategies with the Internet.
ing. In this paper, we propose an innovative algorithm called
One of the most important areas that needs addressing
Rapid Association Rule Mining (RARM) to once again break
is Customer Relationship Management (CRM) [4] which is
this speed barrier. It uses a versatile tree structure known
now a US$11.5 billion market. In a fast-changing environ-
as the Szlpport-Ordered Die Ztemset (SOTrieIT) structure to
ment like the Internet, data is abundant in the form of trans-
hold pre-processed transactional data. This allows RARM
actional data. In electronic commerce, thousands and mil-
to generate large l-itemsets and 2-itemsets quickly without
lions of transactions can easily take place in a single day.
scanning the database and without candidate 2-itemset gen-
Embedded within these data is valuable hidden knowledge
eration. It achieves significant speed-ups because the main
about the behavior of customers that could be unraveled
bottleneck in association rule mining using the Apriori prop-
with data mining techniques. Hence, the importance of
erty is the generation of candidate 2-itemsets. RARM has
data mining in electronic commerce, particularly in CRM,
been compared with the classical mining algorithm Apriori
is undisputed.
and it is found that it outperforms Apriori by up to two or-
Market basket analysis [3] or the mining of association
ders of magnitude (100 times), much more than what recent
rules is the most practical and beneficial data mining tech-
mining algorithms are able to achieve.
nique to be used in electronic commerce. This is particularly
true in CRM because it focuses on the buying habits of cus-
Categories and Subject Descriptors tomers and helps to decide which products to be bundled
H.2.8 [Database Applications]: Data Mining together. It can be employed in a wide variety of applica-
tions such as marketing, promotional bundling of products
General Terms and planning an optimal virtual store layout.
In this paper, a new algorithm called Rapid Association
Data Mining Rule Mining (RARM) is proposed to further push the speed
barrier so that association rule mining can be performed
Keywords more efficiently in electronic commerce. To achieve large
Association rule mining, electronic commerce, index data speed-ups even at low support, thresholds, RARM constructs
a new data structure called Support-Ordered Trie Ztemset
structure
(SOTrieIT). This trie-like tree structure stores the support
counts of all 1-itemsets and 2-itemsets in the database. All
1. INTRODUCTION transactions that arrive are pre-processed; all 1-itemsets and
Since the dawn of the Internet era in 1994, electronic com- 2-itemsets are extracted from each transaction. The ex-
merce is growing at, such an astonishing rate that companies tracted information is used to update the SOTrieIT. This
around the world race to move their business online in order structure is then sorted according to the support counts of
to position themselves for the near future when the Internet each node in descending order.
would dominate worldwide trading. RARM uses SOTrieIT to quickly discover large 1-itemsets
In a Forrester Research article, Global ecommerce Ap- and 2-itemsets without scanning the database. The need to
proaches Hypergrowth [lo], ‘t1 IS. estimated that global Inter- generate candidate 1-itemsets and 2-itemsets constitutes the
main bottleneck in large itemset generation, as observed in
[9]. Therefore, by eliminating this step, RARM achieves sig-
Permission to make digital or hard copies of all or part of this work for nificant speed-ups even though subsequently, it also applies
personal or classroom use is granted without fee provided that copies
the Apriori algorithm to obtain larger-sized itemsets.
are not made or distributed for profit or commercial advantage and that
Experiments have been conducted to study the perfor-
copies bear this notice and the full citation on the first page. To copy
otherwise, or republish, to post on servers or to redistribute to lists, mance of RARM and compare it against Apriori [2]. RARM
requires prior specific permission and/or a fee. is found to outperform Apriori by up to two orders of magni-
CIKM’OI, November S-10,2001, Atlanta, Georgia, USA. tude, far exceeding what the latest algorithms can achieve.
Copyright 2001 ACM I-58113-436-3/01/0011...$5.00. The rest, of the paper is organized as follows. The next

474
section reviews related work. Section 3 gives a description of to construct the FP-tree. Refer to [7] for the construction
the problem while Section 4 presents the new tree structure. details and completeness proof of the FP-tree.
Section 5 describes the RARM algorithm. Time and space FP-growth then proceeds to recursively mine FP-trees of
complexity of the new structure will be examined in Section decreasing size to generate large itemsets without candidate
6. Performance evaluation is discussed in Section 7. Finally, generation and database scans. It does so by examining
the paper is concluded and recommendations for future work all the conditional pattern bases of the FP-tree, which con-
are made in Section 8. sists of the set of large itemsets occurring with the suffix
pattern. Conditional FP-trees are constructed from these
conditional pattern bases and mining is carried out recur-
2. RELATED WORK sively with such trees to discover large itemsets of various
Since the introduction of the Apriori algorithm [2] in 1994, sizes. However, the construction and use of the FP-trees
there has been sustained interest in discovering new associ- are complex and cause the performance of FP-growth to be
ation rule mining algorithms that could perform more effi- on par with Apriori at support thresholds of 3% and above.
ciently. Incremental algorithms are presented in [5, 6, 111 It only shows significant speed-ups at support thresholds of
and algorithms which support dynamic support thresholds 1.5% and below.
are introduced in [l, 81. However, in this paper, we are only
interested in breaking the speed barrier. Hence, in this sec-
tion, we will only look at two influential algorithms which 3. PROBLEM DESCRIPTION
achieve impressive speed-ups against Apriori. The problem of mining association rules can be formally
The Direct Hashing and Pruning (DHP) algorithm [9] is described as follows: Let I = {ai,as,...,a,} be a set of
the next widely used algorithm for the efficient mining of literals called items. Let D be a database of transactions,
association rules. It is built on top of Apriori but it em- where each transaction T contains a set of items such that
ploys a hashing technique to reduce the size of candidate T & I. An itemset is a set of items and a Ic-itemset is an
itemsets and database. This amounts to significant speed- itemset that contains exactly k items. For a given itemset
ups because the dominating factor in the generation of large X C I and a given transaction T, T contains X if and only if
itemsets is the size of the candidate itemscts, particularly X C T. Let (T= be the support count of an itemset X, which
that of the candidate P-itemsets. Jong et al. divides the is the number of transactions in D that contain X. Let s be
DHP algorithm into the following three main parts: the support threshold and IDI be the number of transactions
in D. An itemset X is large or frequent if (TX 2 ID] x s%.
1. A set of large 1-itemsets is obtained and a hash table An association rule is an implication of the form X --r. Y,
for 2-itemsets is built. where X & I, Y C_ I and X fl Y = 0. The association rule
X ti Y holds in the database D with confidence c% if no
2. Candidate k-itemsets are generated based on the hash less than c% of the transactions in D that contain X also
table built from the previous iteration. The hash ta- contain Y. The association rule X ti Y has support 8%
ble serves as a filtering mechanism; DHP only adds in D if UXUY = IDI x s%.
an itemset to the set of candidate L-itemsets if it is For a given pair of confidence and support thresholds, the
hashed to an entry whose value is larger or equal to the problem of mining association rules is to discover all rules
minimum support count needed. Large k-itemsets are that have confidence and support greater than the corre-
determined and a hash table for the candidate (k+l)- sponding thresholds. For example, in a computer hardware
itemsets is built as well. shop, the association rule Scanner ==+ Printer means that
whenever customers buy scanners, they also buy printers c%
3. Same as part 2 except that hash tables are not used. of the time and this trend occurs s% of the time.
This is because part 3 is only used in later iterations This problem can be decomposed into two sub-problems
where significantly lesser candidate itemsets are gen- as discussed in [2]:
erated.
1. Finding of large itemsets
It is obvious that DHP incurs additional overheads due
2. Generation of association rules from large itemsets
to the need to do hashing and to maintain a hash table.
Therefore, after experiments are performed, it is concluded Most researchers addressed the first sub-problem only be-
that the hashing technique should only be applied during cause it is more computationally expensive. Moreover, once
the generation of candidate 2-itemsets and not for the gen- the large itemsets are identified, the corresponding associa-
eration of large candidate itemsets. This allows DHP to tion rules can be generated in a simple and straightforward
achieve speed-ups of up to three times against Apriori. manner. Hence, only the first sub-problem will be addressed
Recently, Han et al. proposed the Frequent Pattern Growth in this paper. For a discussion of optimized algorithms for
(FP-growth) [7] algorithm which achieves impressive results the generation of association rules, more information can be
compared to Apriori. It completely eliminates the candi- found at [l].
date generation bottleneck by using a new tree structure
called Frequent Pattern pee (FP-tree) which stores critical
itemset information. This compact structure also removes 4. DATA STRUCTURE
the need for database scans and it is constructed using only This section first describes a general trie-like structure,
two scans. In the first database scan, large 1-itemsets Li followed by an enhanced version of it and finally the most
are obtained and sorted in support descending order. In optimized version. A trie is a tree structure whose organi-
the second scan, items in the transactions are first sorted zation is based on a hey space decomposition. In key space
according to the order of L1. These sorted items are used decomposition, the key range is equally subdivided and the

475
node ROOT simply contains pointers to the root nodes of
Table 1: An example transaction database with N =
the TrieITs. To understand how the TrieITs are updated,
4. let us see what happens when transaction 400 is inserted.
From transaction 400, the following itemsets are extracted
from it:
l 1-itemsets: {A}, {B}, {C}, {D}
l 2-itemsets: {A,B}, {A, C}, {A, D}, {B, C}, {B, D},
{CY Dl
splitting position within the key range for each node is pre- l 3-itemsets: {A, B, C}, {A, B, D}, {A, C, D}, {B, C, D}
defined. An alphabet trie is a trie used to store a dictionary l 4-itemsets: {A, B, C, D}
of words. The following data structures will be based on the
They are used to increment the support counts of their
alphabet trie.
corresponding nodes in the TrieITs by one. To locate the
4.1 A Complete TrieIT node that corresponds to an itemset {ak, akfl, . , ai}, sim-
ply trace the path {ak, a&l,. . . , ai} from the ROOT node.
Given a database D of transactions, we store it as a forest
If a node does not exist, create a new node and initialized
of lexicographically-ordered tree nodes known as Trie Item-
it with a support count of 1. Here, we need to create seven
set (TrieIT) so that the first sub-problem of association rule
nodes with the label D to represent itemsets ending with
mining-finding large itemsets-can be done efficiently. Let
the item D. The resultant set of complete TrieITs is shown
the set of items I = {cJ~,o~, . . ,a~} in the database be or-
in Figure l(b). This set of TrieITs contains the maximum
dered so that for any two items ai E I, aj E I (1 < i, j < N),
number of tree nodes because all N unique items appear in
ai 4 aj if and only if i < j. Likewise, every transaction
the database.
T E D is also ordered with respect to the ordering in I.
To obtain the support count of an itemset, simply locate
DEFINITION 1. A complete TrieIT is a set of tree nodes its path along the tree nodes. The node representing the
such that every tree node w is a .Z-tuple (we, w,) where w( E last item in the itemset would contain the support count
I is the label of the node and wC is a support count. Since for the itemset. For example, to find the support count of
every tree node corresponds to some item ai E I, for brevity, itemset {A, B, D}, locate the path from the ROOT node to
the node containing the last item D. Thus, from Figure l(b),
we also use wi to refer to a tree node that corresponds to
ai E I. The following conditions hold: the support count of itemset {A, B, D} is 1. This search
is efficient because the nodes are sorted lexicographically.
1. Let, C(w;) be the ordered set of children nodes of Wi. However, the set of complete TrieITs require a huge amount
If C(wi) # 0, then C(wi) S {wi+l, wifa,. . ,wN}. of memory space which scales up exponentially with N.
2. Givenanodewi,letWk,Wlc+l,...,Wi--l(l~k~i-1) 4.2 Support-Ordered Trie Itemset
be the set of nodes on the path from the root to the This approach builds on the ideas presented in the paper
parent of wi, then w, (the count of wi) is a count of on DHP 191. In that paper, it is discovered that generation of
the ih5m.d {ah, @+I,. . . , ai} in the database. Hence, large 2-itemsets is the main performance bottleneck. Using a
the support count of any k-itemset can be obtained by hash table, DHP is able to improve performance significantly
following a set of nodes to a depth of k. by reducing the size of the candidate 2-itemsets.
Similarly, this approach seeks to find a data structure that
Each complete TrieIT Wi corresponds to some ai 6 I such allows for quick generation of large 2-itemsets without the
that the root node has label ai. Then D is stored as a set of
heavy processing and memory requirements of the previous
complete %eITs denoted by W where two structures. The solution is a 2-level support-ordered tree
wc{wl,wZ ,..., WN}. I
which is called a SOTrieIT (Support-Ordered Trie Itemset).
It is unnecessary and inefficient to go beyond two levels be-
In order to improve the efficiency of finding large itemsets
cause the memory requirements will far outweigh the com-
using complete TrieITs, every transaction T E D is inserted
putation savings. This will be discussed in greater details in
into W as follows: Every ordered itemset X E P(T) (P(T)
section 7.
is the powerset of T) increases the count by one of node wj
of TrieIT Wi where ai and aj are the first and last items of X DEFINITION 2. A SOTrieIT is a complete TrieIT with a
respectively. If TrieIT Wi or node wj does not exist, it is cre- depth of 1; i.e., it consists only of a root node wi and some
ated with an initial count of one. Thus, each T E D updates child nodes. Moreover, all nodes in the forest of SOTrieIT
the support counts of all its sub-itemsets (from its power- are sorted according to their support counts in descending
set) in the corresponding TrieITs of W. Hence, no database order from the left. I
scanning is required subsequently as the support counts are
already stored in the WeITs. The following example illus- In constructing the set of SOTrieITs Y from a database
trates the concept of TrieITs and how transactions update D, only 1-itemsets and 2-itemsets X E P(T) of each trans-
the TrieITs. action T E D are used to update the support counts in Y.
In other words, the set of SOTrieITs only keeps a record
4.1.1 Example of all I-itemsets and 2-itemsets contained in a transaction.
Figure l(a) shows the TrieITs after the transactions 100 The first-level nodes represent 1-itemsets while second-level
to 300 in Table 1 have been inserted into the tree. The nodes represent 2-itemsets. Henceforth, we shall use the
numbers in brackets are the support counts. The special term SOTrieIT to denote a set of SOTrieITs.

476
ROOT

ROOT

cfi
D(l)

(4 (b)
Figure 1: Resultant Complete TrieIT.

By keeping track of the support counts of all 1-itemsets without scanning the database. Let T, be a transaction of
and 2-itemsets, SOTrieIT allows both large 1-itemsets and size s and T, = {bl , ba, . . . , b,}. The I-itemsets that are
2-itemsets to be found very quickly. This is because there extracted and used to build W are {bl}, {bz}, . . , {b,} and
is no need to scan the database which could be very large. the 2-itemsets extracted are {b,, b,} where 0 < z < s and
Instead, only the small SOTrieIT is scanned. Moreover, as x < y < s. These itemsets update counts in the SOTrieITs
the SOTrieIT is sorted according to the support counts of accordingly. Every itemset increments or decrements the
the itemsets, only part of the structure needs to be scanned. support count of its corresponding tree node depending on
This feature is elaborated below in the discussion of the whether the transaction is added or deleted.
RARM algorithm. Finally, the amount of pre-processing At any point in time, W contains all the support counts
work and memory needed are greatly reduced. of all 1-itemsets and 2-itemsets that appear in all the trans-
actions. Hence, there is no longer any need to scan the
4.2.1 Example database during the generation of large 1-itemsets and 2-
Figure 2 represents the fully constructed SOTrieIT for the itemsets.
example transaction database in Table 1. To illustrate how
nodes sre created, let us examine what happens when a 5. ALGORITHM RARM
new transaction arrives. Note that only 1-itemsets and 2- In this section, the RARM algorithm that uses a SOTrieIT
itemsets are extracted from the transactions. The nodes is described.
arc created in a similar manner as in the complete TrieITs.
When both transaction 100 and 200 arrive, the nodes cre- 5.1 Pre-processing
ated are the same as those created for the complete TrieITs Figure 3 shows the pre-processing steps taken whenever
as shown in Figures 2(a) and 2(b) because only 1-itemsets a transaction arrives. For every transaction that arrives,
and 2-itemsets are extracted in both cases. However, notice l-itemsets and 2-itemsets are first extracted from it. For
that in Figure 2(b), the node 2uc under the ROOT node each itemset, the SOTrieIT, denoted by Y, will be traversed
comes before the node WA. This is because the nodes are in order to locate the node that stores its support count.
sorted according to their support counts and WC has a higher Support counts of 1-itemsets and 2-itemsets are stored in
support count than WA. first-level and second-level nodes respectively. Therefore,
When transaction 300 arrives, the following itemsets are this traversal requires at most two redirections which makes
extracted: {A), {B), {Cl, (4 J3, {A, Cl, {B, C> it very fast. Y will then be sorted level-wise from left to right
Unlike the situation for the complete TrieITs, the itemset according to the support counts of the nodes in descending
{A, B, C} is not extracted from the transaction. Figure 2(c) order. If such a node does not exist, it will be created and
shows the resultant SOTrieIT when this transaction is pro- inserted into Y accordingly. Similarly, Y is then sorted after
cessed. When transaction 400 arrives, the following itemsets such an insertion.
are extracted: {Al, {Bl, {Cl, {Dl, {A, Bl, {A, Cl, -9% Dl,
{B, Cl, {B, Dl, {C, D> 5.2 Mining of large itemsets
The SOTrieITs are updated in a similar fashion for trans- Figure 4 shows the steps taken when the mining process
action 400 as seen in Figure 2(d). The final resultant SOTrieIT is started. The SOTrieIT, Y, is first traversed to discover
resembles the complete set of TrieITs except that it consists large l-itemsets and 2-itemsets. In our approach, depth-first
of only three levels of nodes and that sibling nodes are sorted search is used, starting from the leftmost first-level node. As
according to their support counts. Y is sorted according to support counts, the traversal can
be stopped the moment a node is found not to satisfy the
4.3 Correctness minimum support threshold. After large 1-itemsets and 2-
We need to show that with a SOTrieIT, the support counts itemsets are found, the algorithm proceeds to discover other
of all 1-itemsets and 2-itemsets can be correctly obtained larger itemsets using the Apriori algorithm. Experiments in

477
C(l)
(4

(4
Figure 2: Resultant SOTrieIT.

1 Let Nt be the qth child node of parent node p.


2 Let NC, be the number of child nodes under node 1
3 Let I, be the itemset represented by node n.
4 for (x=1; 2 < NCROOT; z++) do begin
5 Let X = NEooT.
6 if ox 2 IDI x 3% then begin
7 Add Ix to Ll.
8 for (y=l; y < NCx; y++) do begin
9 if ‘T~,x > IDI x 3% then
1 Let Y be a set of SOTrieITs
10 Add INx to Lz.
2 for (k = 1; L < 2; k++) do begin v
Obtain all k-itemsets of the transaction and 11 endfor
3
Store them in Ck
12 endif
4 foreach itemset X E Ck do begin 13 endfor
Traverse Y to locate nodes along the path that 14 Run Apriori from its third iteration to find the rest
5
represents X of the large itemsets from 3-itemsets onwards.
6 if such a set of nodes exists in Y then
7 Increment the support count of the leaf node Figure 4: Mining Algorithm.
8 Sort updated node among siblings according
to its new support count in descending order
9 else
Section 7 prove that the savings obtained during the genera-
10 Create a new set of nodes with support
tion of large 1-itemsets and 2-itemsets are enough to greatly
counts of 1 that represent a path to X
improve its performance.
11 Insert them into Y according to their support
counts in descending order from the left 5.2.1 Example
12 endif
13 endfor To illustrate the mining algorithm, we use the same trans-
14 endfor action database found in Table 1 and the SOTrieIT struc-
ture in Figure 2(d). Suppose the support threshold is set at
80%. Then the minimum support count to qualify an item-
Figure 3: Pre-processing Algorithm. set to be large is 4. Figure 5 shows the traversal path taken
in obtaining the large 1-itemsets and 2-itemsets. The bold
numbers on the arrows denote the sequence with which the
SOTrieIT is traversed. The RARM algorithm stops travers-
ing the SOTrieIT at the third traversal when it encounters
the item A which has a support count of 3. This is because
all other nodes that come after first-level node A will have
a support count of 3 or less. Therefore, there will not be
any more large itemsets in the rest of the SOTrieIT. The
algorithm terminates and the only large itemset is {C} and

478
is only two levels deep, it takes at most two links to reach
ROOT
the desired node. Suppose it also takes one unit of time

2,21) to move over one link and update/create the node and one
unit of time to extract one itemset, it will take a maximum
of 2 x (‘Cl+ ‘Cz) units of time to move to all the nodes
required by a transaction of size a.

61.2 Space Complexity


“(1) C(3) BP) D(l) C(3) D(l) In a database with N unique items, there will be N first-
level nodes in the SOTrieIT. For each first-level node, since
Figure 5: Traversal path of the SOTrieIT at a sup-
the SOTrieIT is created in a trie-like manner, it will contain
port threshold of 80%. only items that are lexicographically larger than itself. The
first-level node who has the largest number of child nodes is
the total number of traversals is only three. the one which has the first position in a set of lexicons. It will
have N-l child nodes. Subsequent first-level nodes will have
one less child node than the previous one. Therefore, for N
ROOT unique items, a maximum of only C,“=, z nodes, inclusive of

2,21)
both first-level and second-level nodes, are needed to store
the entire pre-processing information. Hence, its complexity
is O(N2).

6.2 Mining of large itemsets


The next section discusses the time complexity of the min-
D(l) C(3) ‘W) ‘Xl) C(3) D(1) ing phase as compared to that of Apriori. Space complexity
will not be mentioned because this phase also involves the
Figure 6: Traversal path of the SOTrieIT at a sup- SOTrieIT whose space complexity is already discussed in the
port threshold of 75%. previous section.
For a support threshold of 75%, the minimum support 6.2.1 Time Complexity
count needed is 3. Figure 6 shows the traversal path taken To compare the time complexity of RARM and Apriori,
in obtaining the large 1-itemsets and 2-itemsets. During the we shall focus only on the scanning process to obtain the
generation of the first two large itemsets, the moment a first- support counts of itemsets. This is enough to see the vast
level node with a support count lower than 3 is encountered, improvement of RARM over Apriori.
the rest of its siblings and subtrees are not scanned. How- For each pass of the Apriori algorithm, there is a need to
ever, when a second-level node is found not to have satisfied scan the entire database regardless of the desired support
the minimum support count, only its subsequent siblings threshold. Suppose the database is of size a and the average
will be ignored. In this case, at the fifth traversal, when the size of each transaction is b. Then, Apriori takes O(ab) units
node that represent itemset {A, B} is found to have a sup- of time to scan the database at each pass.
port count of less than 3, the node that represent itemset For the first two passes of RARM, only the SOTrieIT Y
{A, D} is ignored. The final large 1-itemsets and 2-itemsets needs to be scanned. According to Section 6.1.2, Y has
found are Li = {{A}, {B}, {C}} and LZ = {{A, C}, {B, C}> R x C,“=, x nodes and since each node has one link from its
and the total number of traversals is nine.
parent, in the worst case, it will take at most 2 x RX Cr=‘=, x
units of time to traverse the entire structure where N < a.
6. TIME AND SPACE COMPLEXITY In addition, the time needed also depends on the desired
support threshold which would further reduce the number of
6.1 Pre-processing traversals. Therefore, the average total amount of scanning
We discuss the time and space complexities of the pre- time will be far less than O(ab).
processing phase in this section.
61.1 2i:me Complexity 7. PERFORMANCE EVALUATION
The amount of time to pre-process a transaction depends This section evaluates and compares the relative perfor-
on the amount of time to extract I-itemsets and 2-itemsets mance of the Apriori and RARM algorithms by conducting
from the transaction, to traverse the SOTrieIT to increment experiments on a Pentium-III machine with a CPU clock
the support counts of the respective nodes, and to create new rate of 1 GHz, 256 MB of main memory and running on a
nodes in the SOTrieIT for items that are not encountered Windows 2000 platform. The algorithms are implemented
yet. in Java and hence large memory requirements of the Java
For a transaction of size s, only (‘Cl+ ‘Cz) itemsets are Virtual Machine prevented us from scaling up the experi-
pre-processed. Hence, its complexity is O(s’). For example, ments. This issue will be tackled in future experiments.
when transaction 400 of Table 1 arrives, the following ten The SOTrieIT structure is implemented using a combina-
itemsets are extracted: {A}, {II}, {C}, {D}, {A, B}, {A, C}, tion of integer arrays and files. Implementation details are
(4 Dl, {B, Cl, {B, 01, {C, Dl omitted due to the lack of space. Despite extra file I/O re-
According to our formulation, this gives a total of & + quirements, RARM maintains its efficient performance. De-
+$ = 4+6 = 10 itemsets which is correct. As the SOTrieIT tailed analysis of the results is performed to explain the

479
Parameter M e a n i n g
Number of unique items Candidate ltemset Generation
Number of transactions 600000
:Yi I I I I
1;; Number of maximal potentially
large itemsets
Average size of transactions
Average size of maximal potentially
large itemsets
Figure 7: Definition of Parameters.

.I
0 k I i--.-2 .& &xi.. ,& .L. J,
improvements of RARM over Apriori. Figure 7 shows the 1 2 3 4 5 6 7 8 9
various parameters used and their meanings.
The method used for generating synthetic data is the same k
as the one used in [2]. To describe an experiment, we use the Figure 9: Size of candidate k-itemsets for D1 and D2
notation Tw.1z.Ny.D~ where w is the average size of trans- at a support threshold of 0.75%.
actions, z is the average size of maximal potentially large
itemsets, y is the number of unique items and z is the size
of the database. The databases created are similar to those
used in [7]. The first one is T25.IlO.NlK.DlOK which is de- for higher support thresholds, RARM is able to outperform
Apriori by up to two orders of magnitude because in D2,
noted as D1 while the second one is T25.120.NlOK.DlOOK
which is denoted as Dz. the maximum size of the large itemsets, k’, is much lower
than that of Dr. If k’ increases indefinitely, the perfor-
mance of RARM will eventually be reduced to that of Apri-
7.1 Comparison of RARM and Apriori ori. However, we can conclude from the experiments that as
Figure 8 shows the execution times for the two different databases increase in size, k’ will decrease and thus RARM
databases of both Apriori and RARM. From the graphs, it will scale up well against Apriori.
can be quickly observed that RARM outperforms Apriori in
all situations. In Figure 8(a), RARM maintains a steady 7.2 Comparison of RARM and FP-growth
speed-up of at least 20 times for support thresholds ranging
from 3% to 1.5% in DI. However, when the support thresh- Due to the lack of time, FP-growth is not implemented
old falls below 1.5%, the speed-up is significantly reduced. but its performance against RARM can be evaluated using
This is due to the fact that larger-sized frequent itemsets Apriori as a basis for relative comparisons. The experiments
exist and RARM uses the Apriori algorithm for discovering conducted in [7] reported an overall improvement of only an
large k-itemsets for k > 3. Hence, the computation savings order of magnitude for FP-growth over Apriori. In addi-
in the first two iterations are insignificant compared to the tion, the performance of FP-growth is on par with Apriori
for support thresholds ranging from 3% to 1.5% in D2. The
huge computation costs in obtaining larger frequent item-
sets. poor performance of FP-growth can be attributed to the
The situation changes dramatically in D2. Figure 8(b) cost in recursively constructing FP-trees. Hence, significant
speed-ups can only be noticed in lower support thresholds
uses a log scale for the time axis because of the vast dif-
ference between the execution times of RARM and Apriori. when Apriori cannot cope with the exponential increase in
RARM performs at least 70 times faster than Apriori for candidate itemset generation. This is undesirable because
support thresholds ranging from 3% to 1%. Its performance we want to mine databases at a wide variety of support
peaks at a support threshold of 2% where it performs 120 thresholds quickly instead of only at low support thresholds.
times faster than Apriori. We cannot determine the exe- RARM overcomes this limitation of FP-growth and consis-
cution time of Apriori for a support threshold of 0.5% be- tently outperforms Apriori at all support thresholds and ex-
cause there is insufficient memory to hold the number of periments show that it can even perform up to two orders
candidate 2-itemsets. RARM does not have this memory of magnitude faster than Apriori. In future, FP-growth will
problem because it can obtain large 2-itemsets without gen- be implemented to directly access its performance against
erating candidate 2-itemsets. The lowest support threshold RARM but in this case, it is clear that RARM will outper-
that allows us to compare the performance of the two algo- form FP-growth.
rithms is 0.75% and at this threshold, RARM outperforms
Apriori by a factor of 47. 7.3 Pre-processing Requirements
The vast improvement of RARM in D2 can be explained As pre-processing is carried on a transaction at the mo-
by Figure 9 which shows the number of candidate k-itemsets ment it arrives in the database, it is distributive by na-
generated during the mining of D1 and D2 for a support ture and thus will not burden a system excessively. RARM
threshold of 0.75%. From Figure 9, it is obvious that the spends an average of only 22 ms and 48 ms in pre-processing
main difference in candidate generation between DI and DZ a single transaction found in D1 and D2 respectively. This
is in the generation of candidate 2-itemsets and that DI has amount of time is insignificant considering that it will result
larger-sized candidate itemsets. Thus, by eliminating the in major speed-ups in the mining process. This requirement
need for candidate 2-itemset generation, RARM achieves a should not be taken into consideration in comparing the
much greater speed-up in DZ as it contains more than twice performance of RARM and Apriori because pre-processing
the number of candidate P-itemsets as compared to DI. As is done outside of the actual mining process itself.

480
800 ,f 1 I I 10000 L K,~ I I I I
'Apdori ---x--- .~ Apti& ---m--- -
700 +, &qRM 0 - Cl *... R/yR&,j 0
600 -'%s,, 1000 k..,, Y..
3 500 [; '5, -s _ $, ‘x....
.._.
g 400 1.;. *:, 100 : ".... -m- . . . . . . . * ..___.. ~(
: m. ifi '.h
i= 300 - :: ‘Y*... I= "...
"‘..‘%. '... '..,
200 - '. -_ '.__ 10 7 "...,
8..
100 - n. ".._. “3.. . . . . . . Q . ..____. ;( '.....,,
'.... . . . . . . . . . . . ,,. 'Y.,,
0 1 . I m
0.5 1 1.5 2 2.5 3 '0.5 1 1.5 2 2.5 3
SupportThreshold SupportThreshold
64 (b)

Figure 8: Execution times for two databases of the form T2ok.Ny.D~ where w is the average size of trans-
actions, 2 is the average size of maximal potentially large itemsets, y is the number of unique items and z is
the size of the database, at varying support thresholds.

7.4 Storage Requirements Engineering, pages 402-411, Orlando, Florida, USA,


The SOTrieIT structure resides in both memory and files. 1998.
As primitive integer arrays are employed in memory for stor- PI R. Agrawal and R. Srikant. Fast algorithms for mining
ing the first-level nodes, the SOTrieIT only takes up only 2 association rules. In Proc. 20th Int. Conf. on Very
KB and 14 KB in D1 and DZ respectively. Second-level Large Databases, pages 487-499, Santiago, Chile, 1994.
nodes grow exponentially with respect to N as seen in Sec- [31 M. Berry and G. Linoff. Data mining techniques: for
tion 6.1.2 and as such, they cannot be stored in memory. marketing, sales, and customer support. Wiley
Instead, they are stored in files which are named after the Computer Pub., New York, 1997.
labels of their parents. These files take up approximately 2 141 A. Berson, S. Smith, and K. Thearling. Building data
MB and 53 MB for D1 and Dz respectively. Therefore, it mining applications for CRM. McGraw-Hill, New
can be concluded that by distributing the SOTrieIT struc- York, 2000.
ture among memory and files, scalability is ensured as hard [51 D. W. Cheung, J. Han, V. T. Ng, and C. Y. Wong.
disk space is currently in the realm of tens of gigabytes. Maintenance of discovered association rules in large
databases: An incremental updating technique. In
Proc. 12th Ink Conf. on Data Engineering, pages
8. CONCLUSIONS 106-114, New Orleans, Louisiana, 1996.
The rising popularity of electronic commerce presents new
161 D. W. Cheung, S. D. Lee, and B. Kao. A general
challenges to association rule mining. Due to the easy avail- incremental technique for maintaining discovered
ability of huge amount of transactional data, there is a association rules. In Proc. 5th Znt. Conf. on Database
greater need for faster and scalable algorithms to exploit this Systems for Advanced Applications, pages 185- 194,
knowledgebase. We have proposed a new algorithm called Melbourne, Australia, 1997.
RARM which uses an efficient trie-like structure known as
171 J. Han, J. Pei, and Y. Yin. Mining frequent patterns
the SOTrieIT. By eliminating the need for candidate l-
without candidate generation. In Proc. ACM
itemset and 2-itemset generation, RARM is able to achieve SIGMOD Int. Con& on Management of Data, pages
significant speed-ups. Experiments have shown that RARM
1-12, Dallas, Texas, 2000.
is much faster than Apriori and FP-growth. It also main-
PI C. Hidber. Online association rule mining. In Proc.
tains its sharp edge at various support thresholds and is scal-
ACM SIGMOD Int. Conf. on Management of Data,
able. Though there are additional pre-processing and stor-
age requirements, they are both insignificant and worthwhile pages 145-154, Philadelphia, Pennsylvania, USA,
considering the immense speed-up they can help achieve. 1999.
The SOTrieIT is a simple and yet highly dynamic struc- PI J. S. Park, M. S. Chen, and P. S. Yu. Using a
ture which can be put to greater use in association rule min- hash-based method with transaction trimming for
ing. Due to the dynamic nature of the Internet, current mining association rules. IEEE Tkans. on Knowledge
algorithms are inadequate to handle web-based databases. and Data Engineering, 9(5):813%324, Sept. 1997.
The SOTrieIT may be a good tool to impart dynamism into WI M. R. Sanders. Global ecommerce Approaches
static algorithms. Hence, methods and algorithms to exploit Hypergrowth.
the SOTrieIT will be explored in future work. http://www.forrester.nl/ER/Research/Brief/Excerpt/
0,1317,9229,FF.html, April 2000.
WI N. L. Sarda and N. V. Srinivas. An adaptive algorithm
9. REFERENCES for incremental mining of association rules. In Proc.
[l] C. C. Aggarwal and P. S. Yu. Online generation of 9th Irk Conf. on Database and Expert Systems, pages
association rules. In Proc. 14th Int. Conf. on Data 240-245, Vienna, Austria, 1998.

481
V i e w p u b l i c a t i o n s t a t s

You might also like