0% found this document useful (0 votes)
183 views9 pages

Query Directed Web Page Clustering

A new web page clustering algorithm, QDC, uses the user's query as part of a reliable measure of cluster quality. Clustering performance is very important for usability. If cluster quality is poor, the clusters will be semantically meaningless or will contain many irrelevant pages.

Uploaded by

f6684sam
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
183 views9 pages

Query Directed Web Page Clustering

A new web page clustering algorithm, QDC, uses the user's query as part of a reliable measure of cluster quality. Clustering performance is very important for usability. If cluster quality is poor, the clusters will be semantically meaningless or will contain many irrelevant pages.

Uploaded by

f6684sam
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Query Directed Web Page Clustering

Daniel Crabtree, Peter Andreae, Xiaoying Gao


School of Mathematics, Statistics and Computer Science
Victoria University of Wellington
New Zealand
daniel@danielcrabtree.com, pondy@mcs.vuw.ac.nz, xgao@mcs.vuw.ac.nz

Abstract different user search goals. Often users must refine their
search by modifying the query to filter out the irrelevant re-
Web page clustering methods categorize and organize sults. Users must understand the result set to refine queries
search results into semantically meaningful clusters that as- effectively; but this is time consuming, if the result set is
sist users with search refinement; but finding clusters that unorganised.
are semantically meaningful to users is difficult. In this
paper, we describe a new web page clustering algorithm, Web page clustering is one approach for assisting
QDC, which uses the user’s query as part of a reliable mea- users to both comprehend the result set and to refine the
sure of cluster quality. The new algorithm has five key in- query. Web page clustering algorithms identify semanti-
novations: a new query directed cluster quality guide that cally meaningful groups of web pages and present these to
uses the relationship between clusters and the query, an im- the user as clusters. The clusters provide an overview of the
proved cluster merging method that generates semantically contents of the result set and when a cluster is selected the
coherent clusters by using cluster description similarity in result set is refined to just the relevant pages in that cluster.
additional to cluster overlap, a new cluster splitting method Clustering performance is very important for usability.
that fixes the cluster chaining or cluster drifting problem, an If cluster quality is poor, the clusters will be semanti-
improved heuristic for cluster selection that uses the query cally meaningless or will contain many irrelevant pages.
directed cluster quality guide, and a new method of improv- If cluster coverage is poor, then clusters representing use-
ing clusters by ranking the pages by relevance to the cluster. ful groups of pages will be missing or the clusters will be
We evaluate QDC by comparing its clustering performance missing many relevant pages. Therefore, improving the per-
against that of four other algorithms on eight data sets (four formance of web page clustering algorithms is both worth-
use full text data and four use snippet data) by using eleven while and very important.
different external evaluation measurements. We also eval-
uate QDC by informally analysing its real world usability This paper presents QDC, a query directed web page
and performance through comparison with six other algo- clustering algorithm that gives better clustering perfor-
rithms on four data sets. QDC provides a substantial per- mance than other clustering algorithms. QDC has five key
formance improvement over other web page clustering al- innovations: a new query directed cluster quality guide that
gorithms. uses the relationship between clusters and the query, an im-
proved cluster merging method that generates semantically
coherent clusters by using cluster description similarity in
additional to cluster overlap, a new cluster splitting method
1 Introduction
that fixes the cluster chaining (drifting) problem, an im-
proved heuristic for cluster selection that uses the query di-
Web search is difficult because it is hard for users to rected cluster quality guide, and a new method of improving
construct queries that are both sufficiently descriptive and clusters by ranking the pages by relevance to the cluster.
sufficiently discriminating to find just the web pages that
are relevant to the user’s search goal. Queries are often The next section of the paper sets QDC in context of
ambiguous: words and phrases are frequently polyseman- the other research in the field by describing related work.
tic and user search goals are often narrower in scope than The following sections describe the algorithm and evaluate
the queries used to express them. This ambiguity leads to QDC by comparing its performance against other clustering
search result sets containing distinct page groups that meet algorithms.

Proceedings of the 2006 IEEE/WIC/ACM International Conference


on Web Intelligence (WI 2006 Main Conference Proceedings)(WI'06)
0-7695-2747-7/06 $20.00 © 2006

Authorized licensed use limited to: National Dong Hwa University. Downloaded on February 24, 2009 at 21:36 from IEEE Xplore. Restrictions apply.
2 Related Work distance [4] and the Rough Set based Graded Thesaurus [5]
are two techniques that use these statistics to determine term
Most clustering algorithms for web pages start by pre- similarity and both have been shown to be effective on var-
processing the pages in a standard way. Various page ele- ious tasks, such as hierarchical word clustering [4] and web
ments and words are removed from the pages: HTML tags, query expansion [5].
punctuation and other similar non-informative text, a set QDC uses term relationships to provide a dramatic im-
of stop words containing very common and uninformative provement in clustering performance. Specifically, QDC
words such as “the”, “it”, and “on”. Light stemming, using uses normalized Google distance (NGD) [4]:
the Porter stemming algorithm [12], is often applied to re- max(ln(f (i)), ln(f (j))) − ln(f (i ∧ j))
duce terms to their root form, for example, “dogs” becomes N GD(i, j) =
ln(M ) − max(ln(f (i)), ln(f (j)))
“dog”. This leaves each page represented by a sequence of
words. where i and j are terms, f (t) is the approximate web fre-
There are several models used to represent the pre- quency of some term or terms, and M is the approximate
processed pages, the most common models are the bag total number of pages.
of terms and set of terms. Terms can be either words or Some algorithms represent pages using more advanced
phrases, although often just words are used. There are also models than a bag of words, but their performance is still in-
graph based models that preserve the ordering of document adequate. One of the best web page clustering algorithms is
terms [13]; one that is quite efficient is the suffix tree model STC, which uses the suffix tree model to identify base clus-
[19]. ters consisting of all pages containing one phrase. While
Researchers have applied all the standard data cluster- there are some high quality base clusters, many are too
ing methods [2, 8, 15] to web page clustering: hierarchi- broad and are ambiguous even within the context of the
cal (agglomerative and divisive), partitioning (probabilistic, result set and introduce many irrelevant pages into the fi-
k-medoids, k-means), grid-based, density-based, fuzzy c- nal clusters and degrade clustering performance. Advanced
means, Bayesian, Kohonen self-organising maps, and many models alone are inadequate for producing good clustering
more. Many algorithms build on the standard methods performance. QDC uses a simple set of words model and
by using web or document specific characteristics to as- removes low quality base clusters using the relationship be-
sist clustering: Suffix Tree Clustering (STC) [18] and Lingo tween cluster descriptions and the user’s query.
[10, 11] use phrases and some algorithms [17, 9] consider Base clusters often have poor coverage as they miss
the hyperlinked nature of web pages. many relevant pages. STC addresses this by merging clus-
Current web page clustering algorithms produce cluster- ters using a single-link clustering algorithm [8] with cluster
ings of low quality: many clusters are semantically mean- overlap as the similarity measure. But cluster overlap may
ingless and the meaningful clusters are often small, miss- merge semantically unrelated clusters, which lowers cluster
ing many relevant pages, and contain irrelevant pages. The quality, unless the overlap threshold is set very high. How-
problem is that from a textual perspective the algorithms ever, this leaves many related clusters separate, which limits
only use properties and statistics of pages from within the cluster coverage. QDC uses cluster description similarity in
result set. Many algorithms such as hierarchical and parti- addition to cluster overlap to provide a more effective simi-
tioning algorithms [15] use data similarity measures [2] to larity measure for merging clusters.
construct clusters; when applied directly to page data, the Some clustering algorithms, including single-link clus-
similarity based methods are not effective at producing se- tering and STC, are susceptible to cluster chaining (drifting)
mantically meaningful clusters. [19]. In a sequence of clusters, each cluster may be similar
One way of improving web page clustering algorithms to its immediate neighbours, but completely dissimilar from
is to make better use of the textual properties of web pages. clusters further away in the sequence. Clusters obtained by
The semantic relationships between words is very useful in- merging such sequences are often of low quality and are not
formation; for example, synonyms, hyponyms, meronyms, semantically meaningful. The improved similarity measure
etc. [4]. WordNet [4] is a lexical reference system and is used for merging in QDC limits this significantly, but does
one source of this information. However, the data in these not stop it entirely. QDC solves the cluster chaining (drift-
systems is incomplete, particularly for commercial, techni- ing) problem by making a second pass over merged clusters
cal, and popular culture word usage. and splitting those that have been joined inappropriately.
An alternate source, although less accurate and less in- Algorithms that construct many clusters, like STC, must
formative, is to use global document analysis and term co- select a subset (no more than about ten) to show the user,
occurrence statistics to identify whether terms are related or as the user cannot comprehend too much at one time. Ex-
unrelated. The number of pages in multi-term search result tended Suffix Tree Clustering (ESTC) [6], an extension of
sets can approximate term co-occurrence statistics. Google STC, uses a cluster selection method that considers page

Proceedings of the 2006 IEEE/WIC/ACM International Conference


on Web Intelligence (WI 2006 Main Conference Proceedings)(WI'06)
0-7695-2747-7/06 $20.00 © 2006

Authorized licensed use limited to: National Dong Hwa University. Downloaded on February 24, 2009 at 21:36 from IEEE Xplore. Restrictions apply.
coverage and cluster overlap in a heuristic hill-climbing low quality clusters, but if the cutoff is too low, high qual-
search with look-ahead and enhanced branch and bound ity clusters may be removed as well; using a higher cutoff
pruning. QDC improves the heuristic by additionally con- removes fewer clusters.
sidering cluster quality and the number of selected clusters. After pruning using query distance, there are still many
Providing the most relevant pages earlier in the results low quality clusters. The relationship between the pages
can reduce the time users spend searching [1]. Most clus- and the clusters can be used to further prune the collec-
tering algorithms order the pages in the clusters by their po- tion of base clusters. The distribution of web pages tends
sition in the search results [3]. Such an ordering fails to use to follow the frequency of user interest in the page topics.
the additional information about the user’s search goal, pro- Therefore, larger clusters have a greater probability of be-
vided by the user selecting the cluster, so the most relevant ing useful refinements and cluster size is an indication of
pages may not be shown first. QDC orders the pages within cluster quality. QDC removes the worst clusters according
each cluster according to their relevance to the cluster. to a measure proportional to cluster size and inversely pro-
portional to query distance. The number of clusters kept is
proportional to the total number of pages being clustered.
3 Algorithm - QDC Keeping a lower number of clusters will increase algorithm
speed but lowers clustering performance, but if too many
This section describes QDC, a query directed web page clusters are kept, low quality clusters are not pruned and
clustering algorithm with five stages, which roughly match may contaminate the merging process.
our five key innovations. Removing this many clusters would normally have a
negative effect on clustering performance, but because the
3.1 Base Cluster Identification query directed heuristics give a reliable guide to cluster
quality, the low quality clusters that would later contami-
A base cluster is described by a single word and consists nate the merging process are removed, and the performance
of all the pages containing that word. Equivalently, base actually improves.
clusters are single word search refinements based on the
current search results. After standard page pre-processing, 3.2 Cluster Merging
QDC constructs a collection of base clusters, one for ev-
ery word that is in at least 4% of the pages. Using a lower QDC constructs larger clusters by merging clusters to-
threshold will increase clustering performance at the cost of gether. Each cluster (c) is constructed from a set of base
algorithm speed. clusters (base(c)), and a cluster is described by the word
Many base clusters are useless and only serve to con- that describes the cluster’s largest base cluster. However,
taminate the final clusters. Removing these useless clusters the set of pages in a cluster is not necessarily all the pages
would improve the clustering, but selecting the right clus- in its base clusters. A page is only included in the cluster
ters to prune requires some guide to cluster quality. The if it is present in enough of the base clusters in the clus-
user’s query is the best, and often the only, specification of ter. This threshold should increase with the number of base
the information desired by the user. QDC uses the relation- clusters in the cluster, but should not increase steeply. QDC
ship between query terms and cluster descriptions as one uses a log function. A cluster is a set that contains the pages
part of a cluster quality guide. QDC computes the query that are in at least log2 (|base(c)| + 1) of the cluster’s base
distance of each base cluster — the distance from the query, clusters.
using NGD as defined in section 2. The query distance from Initially there is a singleton cluster for each base cluster.
a base cluster to the query is the minimum of the NGD be- QDC merges clusters using single-link clustering over a re-
tween the word specifying the base cluster and any query latedness graph. Single-link clustering merges together all
term. clusters that are part of the same connected component on
Terms with a low query distance tend to be very specific the graph. The relatedness graph has the clusters as vertices
and are often unambiguous in the context of the query, while and has an edge between any two clusters that are suffi-
terms with a high query distance tend to be quite broad and ciently similar.
are often ambiguous, even in the context of the query. Am- Previous methods use cluster content similarity and of-
biguous clusters are often of poor quality as they combine ten merge unrelated clusters. Merging unrelated clusters de-
multiple distinct ideas of which only one is normally of in- creases cluster quality by introducing irrelevant pages. The
terest to a given user. QDC removes these low quality clus- problem is exacerbated by cluster chaining (drifting): clus-
ters by removing clusters whose query distance is too large. ters that are closely related to one of the unrelated clusters
Our experiments use cutoffs of 1.5 when using full text data but not the others are often merged in too, bringing further
and 2.5 when using snippet data. This step removes most irrelevant pages with them.

Proceedings of the 2006 IEEE/WIC/ACM International Conference


on Web Intelligence (WI 2006 Main Conference Proceedings)(WI'06)
0-7695-2747-7/06 $20.00 © 2006

Authorized licensed use limited to: National Dong Hwa University. Downloaded on February 24, 2009 at 21:36 from IEEE Xplore. Restrictions apply.
QDC defines two clusters to be sufficiently similar only relatedness graph of length one (onelinks), or of length two
if both the cluster contents and cluster descriptions are suffi- (twolinks), and the average distance from base clusters in
ciently similar. Requiring the cluster descriptions to match one sub-cluster to base clusters in the other sub-cluster.
in addition to the contents dramatically reduces the merg-
ing of semantically unrelated clusters and increases cluster dist(c1 , c2 ) = onelinks + 0.5 twolinks − avgdist(c1 , c2 )
quality. Additionally, the cluster contents similarity thresh- P P
b1 ∈base(c1 ) b2 ∈base(c2 ) len(b1 , b2 )
old can be significantly reduced, which allows more seman- avgdist(c1 , c2 ) =
tically related clusters to merge (increasing cluster cover- |base(c1 )||base(c2 )|
age). Where len(b1 , b2 ) is the path length between two base clus-
The cluster contents are sufficiently similar if enough of ters in the relatedness graph.
the pages in one cluster are also in the other cluster (i.e., if
there is enough overlap between the clusters): 3.4 Cluster Selection
|c1 ∧ c2 |
> 0.6 At this stage, QDC has a small set of coherent clusters.
min(|c1 |, |c2 |)
However, there will still be more clusters than can be pre-
The cluster descriptions are sufficiently similar if the pair sented to the user. QDC needs to select the best subset of
of cluster descriptions occur together on the web signifi- the clusters to present to the user. Ideally, these clusters
cantly more frequently than would be expected if the pair should be high quality clusters that cover all the pages in
were unrelated (i.e., if their appearances were independent): the original set with minimal overlap.
QDC uses the ESTC cluster selection algorithm [6] with
M f (d1 ∧ d2 ) an improved heuristic, H(C), to select a set of clusters to
>4
f (d1 )f (d2 ) show the user. The ESTC cluster selection algorithm uses
Where d1 and d2 are the cluster descriptions, and f (t) and the heuristic with a 3-step look-ahead hill-climbing search
M are as per NGD in section 2. to select a set of clusters to present to the user. To evaluate a
Decreasing either the cluster content or the cluster de- candidate set of clusters, C, the new heuristic considers the
scription similarity threshold will increase cluster coverage number of pages covered by the clusters (CP ), the number
at the cost of greater cluster overlap. of distinct pages covered by the clusters (CD ), the number
of pages not covered by any of the clusters (CO ), and the
3.3 Cluster Splitting quality of each cluster (q(c)).
!
X
Each cluster now contains at least all the base clusters H(C) = q(c) − αCO − β(CP − CD )
that relate to one idea; this is assured as single-link cluster- c∈C
ing merges all related clusters. But single-link clustering,
even with our improved similarity function, can produce The new heuristic has two parameters that enable con-
clusters containing multiple ideas and irrelevant base clus- trol of characteristics of the clustering: α (our experiments
ters due to cluster chaining (drifting). Such clusters need use 0.2) and β (our experiments use 0.3). α controls cover-
to be split. Interestingly, it is easier to split such a com- age and increasing α will generate clusterings with greater
pound cluster than to prevent its formation in the first place; coverage at the cost of cluster quality. β controls overlap
because the splitting can take into account the final cluster, and increasing β will lead to clusterings with fewer pages
whereas the merging process cannot. in multiple clusters at the cost of page coverage.
QDC uses a hierarchical agglomerative clustering algo- The quality of a cluster (q(c)) controls the number of
rithm to identify the sub-cluster structure within each clus- clusters selected and places a bias towards high quality clus-
ter. The algorithm uses a distance measure to build a den- ters. Because of the logarithm, high quality clusters have
drogram for each cluster starting from the base clusters in above average quality and therefore positive quality values,
the cluster. Each cluster is split by cutting its dendrogram whereas low quality clusters have below average quality and
at an appropriate point — when the distance between the therefore negative quality values.
closest pair of sub-clusters falls below a threshold (our ex- quality(c)
periments use -2). This threshold means that any groups of q(c) = log2 ( )
average cluster quality
base clusters that are not tightly interconnected with each
other will be split. Using a higher threshold will lower the The quality measure for a cluster is extended from the
split point and increase the splitting frequency. base cluster quality measure to take the number of base
QDC uses a distance measure with three components: clusters into account as well as the cluster size and query
the number of paths between the two sub-clusters on the distance. The more base clusters that form a cluster, the

Proceedings of the 2006 IEEE/WIC/ACM International Conference


on Web Intelligence (WI 2006 Main Conference Proceedings)(WI'06)
0-7695-2747-7/06 $20.00 © 2006

Authorized licensed use limited to: National Dong Hwa University. Downloaded on February 24, 2009 at 21:36 from IEEE Xplore. Restrictions apply.
greater the evidence that the cluster represents a semanti- 4.2 Algorithm Performance
cally meaningful group of pages. But the increase in evi-
dence with each additional base cluster decreases. So we We used 11 measurements to compare the clustering per-
need a function with a monotonically decreasing 1st deriva- formance of QDC against four other web page clustering al-
tive; QDC uses the logarithm. The query distance of a clus- gorithms (STC, ESTC, K-means, and Random Clustering)
ter to the query, QD(c), is the average query distance of its on eight data sets: search results of four different queries
base clusters. (“salsa”, “jaguar”, “gp”, and “victoria university”) using
|c| both full-page and snippet data. The queries are of vary-
quality(c) = log2 (|base(c)| + 1) ing clustering difficulty. The simplest is “salsa”, which has
QD(c)
two distinct clusters (both large) and few outliers. “jaguar”
3.5 Cluster Cleaning is more challenging with five distinct clusters (one large,
three medium, and one small) and some outliers. “gp” is
harder with five distinct clusters (two large, and three small)
Base clusters are sometimes formed from polysemous
and many outliers. “victoria university” is the hardest with
words and therefore clusters can contain pages that cover
five very similar clusters (two large, one medium, and two
different topics. Since the clusters should relate to only one
small) and few outliers.
topic, pages from other topics are irrelevant. QDC com-
We compared the algorithms under an external evalu-
putes the relevance of each page in each cluster and removes
ation methodology using a gold standard method [16, 7].
irrelevant pages.
The method uses a rich ideal clustering structure and QC4
The relevance of a page to a cluster is based on the num-
measurements (quality and coverage) [7], as this is well
ber and size of the cluster’s base clusters of which it is a
suited for web page clustering evaluation. The QC4 mea-
member. Page relevance varies between 0 and 1, with 0 be-
surements provide a fair measure of clustering performance
ing a page that is completely irrelevant to the cluster. Page
as they do not have any bias towards particular clustering
relevance is computed as the sum of the sizes of the clus-
characteristics. In particular the clustering granularity may
ter’s base clusters of which it is a member, divided by the
be coarse or fine; the clusters may be disjoint or the clus-
sum of the sizes of all of the cluster’s base clusters.
ters may overlap, so that the same page may appear in sev-
P
{b|b∈base(c)∧p∈b} |b|
eral clusters; the clustering may be “flat” so that all clusters
relevance(p, c) = P are at the same level, or the clustering may be hierarchical
b∈base(c) |b|
so that lower-level clusters are sub-clusters of higher level
QDC proceeds to remove irrelevant pages from clusters clusters. Additionally, the QC4 measurements do not have
where two requirements are met: the page has relevance the problems that other measurements have with extreme or
below a threshold (our experiments use 0.1) and the page boundary case clusterings, such as the extreme case of hav-
has higher relevance in another cluster. A higher threshold ing all pages in one large cluster. On actual clusterings the
will remove additional irrelevant pages but will also remove QC4 measurements are also more expressive, as they give
relevant pages, but the threshold is not very sensitive as the random clusterings much lower performance: the informa-
second requirement limits the pages that can be removed. tion capacity of the measurements is larger as the range of
Page relevance also provides a ranking on the pages with informative values is larger. The unreliability of precision
respect to a cluster. QDC sorts and displays the pages in and recall can be seen in the experiments where the recall
each cluster according to relevance. This improves cluster measure gives a higher ranking (almost 60%) to the random
quality from the user’s perspective as any remaining irrele- clustering than to one of the clearly better algorithms.
vant pages are frequently near the bottom of clusters and so In addition to QC4 measurements, we present the stan-
users rarely see them. dard precision, recall, entropy, and mutual information mea-
surements [16, 7] to provide further evidence for the results.
All measurements come in both average and cluster-size
4 Evaluation weighted varieties (except mutual information for which av-
eraging is not applicable), providing 11 measurements in to-
4.1 Algorithm Speed tal. Average measurements treat all clusters as equally im-
portant, while weighted measurements treat larger clusters
QDC is on the order of ten times faster than STC and on as more important. Note that there is a trade-off between
the order of one hundred times faster than K-means. QDC different measurement pairs: quality vs coverage, precision
achieves a significant increase in algorithm speed by prun- vs recall, and entropy vs recall. Mutual information pro-
ing many base clusters during base cluster construction us- vides a single measure that combines both aspects. For all
ing the new query directed cluster quality measure. measurements, higher is better, except entropy, for which

Proceedings of the 2006 IEEE/WIC/ACM International Conference


on Web Intelligence (WI 2006 Main Conference Proceedings)(WI'06)
0-7695-2747-7/06 $20.00 © 2006

Authorized licensed use limited to: National Dong Hwa University. Downloaded on February 24, 2009 at 21:36 from IEEE Xplore. Restrictions apply.
(A) I
mpr
ove
mentOve
rRa
ndomCl
ust
eri
ng (B)

(C) (D)

Figure 1. Full Text Results Averaged Over All Data Sets: (A) Combined QC4 Measure, (B) Individual
QC4 Measures, (C) Precision and Recall, (D) Entropy and Mutual Information

lower is better. trade-offs, it was clear that QDC performed better overall.
On average QDC performs substantially better than the When QDC had slightly worse average and weighted preci-
other algorithms. Figure 1 shows that for full text data, on sion and entropy than STC, it had significantly better aver-
average, QDC outperforms all of the other algorithms on all age and weighted recall and would be significantly better on
measurements by convincing margins. Figure 1 (A) shows a combination score that balanced both factors in the trade-
the overall percentage improvement each algorithm makes off.
over the random clustering using the combined QC4 mea- We also evaluated the performance of QDC against the
sure 12 (H(AQ, AC) + H(W Q, W C)), where H is the Har- other algorithms at clustering just snippet data. Figure 2
monic Mean, AQ is Average Quality, AC is Average Cover- shows the 11 measurements averaged across the four data
age, WQ is Weighted Quality, and WC is Weighted Cover- sets and shows the percentage improvement each algorithm
age. makes over the random clustering. The results show that
A more detailed investigation of all test cases shows QDC offers a very large and significant improvement in
that QDC was almost universally better than the other al- performance over other clustering algorithms. QDC has
gorithms. In 40 of the 44 full text test cases (11 measure- better performance in all but the unreliable recall measure-
ments on each of 4 data sets), QDC was significantly better ments (where QDC is slightly outperformed by K-means),
than all the other algorithms. In the four cases where QDC but QDC does significantly better on precision and entropy
was worse, QDC had second best performance. The four and would be significantly better on a combination score
cases were for the “salsa” data set, which was the easiest that balanced both factors in the trade-off.
search as all algorithms performed comparatively well on As with the full text, QDC was almost universally bet-
this data set. In all cases where QDC performed worse, the ter than the other algorithms on the snippet data sets. In
advantage of the other algorithms was very marginal (typ- 38 of the 44 snippet test cases, QDC was significantly bet-
ically a few percent). Furthermore, when considering the ter than all the other algorithms. In five of the six cases

Proceedings of the 2006 IEEE/WIC/ACM International Conference


on Web Intelligence (WI 2006 Main Conference Proceedings)(WI'06)
0-7695-2747-7/06 $20.00 © 2006

Authorized licensed use limited to: National Dong Hwa University. Downloaded on February 24, 2009 at 21:36 from IEEE Xplore. Restrictions apply.
(A) (B)

(C) (D)

Figure 2. Snippet Results Averaged Over All Data Sets: (A) Combined QC4 Measure, (B) Individual
QC4 Measures, (C) Precision and Recall, (D) Entropy and Mutual Information

where QDC was worse, QDC had second best performance. proves both cluster quality and cluster coverage. The new
Four of the cases were for weighted recall, a particularly un- cluster splitting method in the third stage solves the clus-
reliable measure that often gave better performance to the ter chaining (drifting) problem and improves cluster qual-
random clustering than to other algorithms. The other two ity; we independently verified this by applying the cluster
cases were the coverage for the “salsa” data set. In all six splitting method to synthetic test cases, including cases that
cases the coverage or recall were only slightly worse (a few exhibited cluster chaining or drifting.
percent), but the quality, precision, and entropy were much The improved heuristic of the fourth stage improves clus-
better (twice as good in five of the six cases). ter selection speed significantly over ESTC and makes far
better selections. The improvement to the heuristic is an
4.3 Stage Performance and Sensitivity additional result of the query directed cluster quality guide.
The ranking method used in the final stage improves cluster
We conducted experiments to discover the relative im- quality, but does not contribute much to the external eval-
portance of each of the five innovations, which roughly cor- uation, as the ordering is not taken into account. We con-
respond to the five stages of the algorithm. Each innovation ducted an independent analysis of the ranking method and
and stage of the algorithm individually has a positive effect found that most of the pages that were hurting cluster qual-
on clustering or algorithm performance, though not as much ity were placed very close to the bottom of the cluster rank-
as the combination of all five. ings; when sorted by search position, they were distributed
The query directed cluster quality guide has a large im- randomly throughout the cluster rankings.
pact on performance. The pruning it enables in the first The heuristics in QDC use quite a number of parame-
stage of the algorithm dramatically improves algorithm ters. For the experiments above, we did minimal tuning of
speed and clustering performance. Using the cluster de- the parameters on one of the eight data sets. We ran fur-
scription similarity in the second stage significantly im- ther experiments to explore the sensitivity of the results to

Proceedings of the 2006 IEEE/WIC/ACM International Conference


on Web Intelligence (WI 2006 Main Conference Proceedings)(WI'06)
0-7695-2747-7/06 $20.00 © 2006

Authorized licensed use limited to: National Dong Hwa University. Downloaded on February 24, 2009 at 21:36 from IEEE Xplore. Restrictions apply.
the parameter values and found that with one exception all
parameters were able to be shifted in either direction by at Table 1. Clusters for “’jaguar’
QDC K-means ESTC
least 20% without making more than a ±2% difference in
clustering performance. The query distance threshold in the Car 109 Include 115 Car 56
first stage of the algorithm was more sensitive: shifting this Cat 48 Car 22 OS 10 33
by 20% could make up to a ±6% difference in clustering Other 40 OS 17 Panthera onca 21
performance. This is a further indication of the importance Apple 35 Free 16 Online 9
and effect of query distance in QDC. It may be worth tuning Atari 18 Largest 14 Pictures 9
this parameter. Type 13 System 8
Atari 12 Racing 7
Service 12 Prices 7
4.4 Real World Usability Panthera 9 Auto 7
Wildlife 7
The results of the external evaluation are impressive, but
the real test of a web page clustering algorithm is end user Lingo Vivisimo
usability. While we acknowledge a formal user study would Other 68 Club 48
best confirm the results from the external evaluation, at this Locate a Used Car 29 Parts 46
stage, we can only provide an informal analysis and com- Mac OS Jaguar 24 Jaguar Cars 41
parison with other clustering algorithms. The analysis used Cat the Jaguar 20 Photos 32
the same four queries as the external evaluation (“salsa”, Jaguar Auto Parts 18 Classic 16
“jaguar”, “gp”, and “victoria university”) and indicates the Safety Information 16 Animals 7
results from the external evaluation may have underesti- Jaguar Club 15 Mark Webber 7
mated the real world usability and performance of QDC. Home Page 13 Maya 5
The remainder of this section presents the informal analysis Official Web 13 Enthusiast 4
of the “jaguar” query (the results were similar for the other Amazon.com Books 11 Panthera onca 4
queries).
Table 1 shows the cluster names and number of pages
in each cluster produced by QDC, K-means, ESTC, Lingo,
and Vivisimo [14] for the search “jaguar”, sorted by size. refinements of a more specific search, for instance, “jaguar
STC (due to result similarity with ESTC) and Random car”. With “jaguar”, there are more obvious refinements
clustering (due to its obviously poor performance) are ex- that should be made first, and they are exactly those cap-
cluded here, but were included in our analysis. Lingo results tured by QDC.
are from http://carrot.cs.put.poznan.pl and Vivisimo from The informal analysis also shows that QDC finds fewer
http://vivisimo.com. Unlike the other algorithms, Lingo semantically meaningless clusters compared with the other
and Vivisimo clustered snippets instead of full-page data algorithms. For instance, QDC found none when clus-
and used different data sets of 200 and 228 pages respec- tering “jaguar”, whereas K-means found three (“include”,
tively. We made several minor changes to the Lingo and “free”, and “service”), ESTC and Lingo each found two,
Vivisimo clusters: normalizing cluster sizes to account for and Vivisimo found one.
the different data set sizes, and truncating overly long clus- The informal analysis also indicates that the usability
ter names. For Lingo, we display only the ten largest clus- and performance of QDC is even better than is shown by
ters of twenty-five. the external evaluation, because the evaluation did not pe-
An informal analysis of the clusters produced by the al- nalise the creation of overly specific clusters since the gold
gorithms shows that QDC finds larger, broader clusters such standard included them. What the external evaluation does
as “Car”, while the other algorithms find smaller more spe- show is that of the clusters produced by each algorithm,
cific clusters such as “Locate a Used Car” and “Jaguar Auto those produced by QDC had fewer irrelevant pages and cov-
Parts”. A problem with capturing topics that are more spe- ered additional relevant pages.
cific than necessary is that topics of interest to some users
may not be covered at all. Showing broader topics both 5 Conclusions and Future Work
maximizes the probability of a user being able to refine their
query and simplifies the user’s decision process. The deci- This paper has presented QDC, a new query directed
sion process is simpler as there are fewer choices, and it is web page clustering algorithm that has five key innovations.
less likely that there are multiple relevant choices. While Firstly, it identifies better clusters using a query directed
the smaller clusters that relate to sub-topics of “Car” are cluster quality guide that considers the relationship between
valid and semantically meaningful, they are better left for a cluster’s descriptive terms and the query terms. Secondly,

Proceedings of the 2006 IEEE/WIC/ACM International Conference


on Web Intelligence (WI 2006 Main Conference Proceedings)(WI'06)
0-7695-2747-7/06 $20.00 © 2006

Authorized licensed use limited to: National Dong Hwa University. Downloaded on February 24, 2009 at 21:36 from IEEE Xplore. Restrictions apply.
it increases the merging of semantically related clusters Soft Computing in Intelligent Agent and Web Technologies,
and decreases the merging of semantically unrelated clus- pages 9–16, September 2005.
ters by comparing the descriptions of clusters in addition to [6] D. Crabtree, X. Gao, and P. Andreae. Improving web clus-
comparing the overlap of page contents between clusters. tering by cluster selection. In The 2005 IEEE/WIC/ACM In-
ternational Conference on Web Intelligence, pages 172–178,
Thirdly, it fixed the cluster chaining (drifting) problem us-
September 2005.
ing a new cluster splitting method. Fourthly, it chooses bet- [7] D. Crabtree, X. Gao, and P. Andreae. Standardized eval-
ter clusters to show the user by improving the ESTC cluster uation method for web clustering results. In The 2005
selection heuristic to consider the number of clusters to se- IEEE/WIC/ACM International Conference on Web Intelli-
lect and cluster quality. Finally, it improves the clusters by gence, pages 280–283, September 2005.
ranking the pages according to cluster relevance. [8] A. K. Jain, M. N. Murty, and P. J. Flynn. Data clustering: A
The gold standard evaluation used QC4 measurements of review. ACM Computing Surveys (CSUR), 31(3):264–323,
cluster quality and cluster coverage, and the standard mea- 1999.
surements of precision, recall, entropy, and mutual infor- [9] F. Menczer. Lexical and semantic clustering by web links.
Journal of the American Society for Information Science and
mation on eight data sets (four queries using full text data
Technology, 55(14):1261–1269, December 2004.
and four queries using snippet data) to show that QDC pro- [10] S. Osiński, J. Stefanowski, and D. Weiss. Lingo: Search
vides a substantial improvement over Random, K-means, results clustering algorithm based on singular value decom-
STC, and ESTC clustering algorithms. Additionally, an in- position. In Proceedings of the International IIS: Intelli-
formal usability evaluation showed that QDC performs very gent Information Processing and Web Mining Conference,
well when compared with Random, K-means, STC, ESTC, Advances in Soft Computing, pages 359–368, Zakopane,
Lingo, and Vivisimo and the gold standard evaluation may Poland, 2004. Springer.
have underestimated the performance of QDC. [11] S. Osinski and D. Weiss. A concept-driven algorithm
While the results are already very impressive, QDC only for clustering search results. IEEE Intelligent Systems,
considers single words; STC, Lingo, and other clustering 20(3):48–54, May/June 2005.
[12] M. F. Porter. An algorithm for suffix stripping. Program,
algorithms have shown that using phrase information can
14(3):130–137, July 1980.
provide a dramatic improvement. One obvious direction for [13] A. Schenker, M. Last, H. Bunke, and A. Kandel. A compar-
future work is to extend QDC to use phrases rather than ison of two novel algorithms for clustering web documents.
just words. Another direction for future improvement is to In Proceedings of the 2nd International Workshop on Web
consider multiple terms from the cluster descriptions when Document Analysis (WDA 2003), pages 71–74, Edinburgh,
merging clusters instead of just considering the most de- Scotland, August 2003.
scriptive term. [14] A. Spink, S. Koshman, M. Park, C. Field, and B. J. Jansen.
Multitasking web search on vivisimo.com. In International
Conference on Information Technology: Coding and Com-
Acknowledgment puting (ITCC’05), volume II, pages 486–490, 2005.
[15] M. Steinbach, G. Karypis, and V. Kumar. A comparison of
Daniel Crabtree is supported by a Top Achiever Doc- document clustering techniques. In KDD Workshop on Text
Mining, 2000.
toral Scholarship from the Tertiary Education Commission
[16] A. Strehl. Relationship-based Clustering and Cluster En-
of New Zealand. sembles for High-dimensional Data Mining. PhD thesis,
Faculty of the Graduate School of The University of Texas
References at Austin, 2002.
[17] Y. Wang and M. Kitsuregawa. On combining link and con-
tents information for web page clustering. In 13th Interna-
[1] J. Back and C. Oppenheim. A model of cognitive load for tional Conference on Database and Expert Systems Applica-
ir: implications for user relevance feedback interaction. In- tions DEXA2002, Aix-en-Provence, France, pages 902–913,
formation Research, 6(2), 2001. September 2002.
[2] P. Berkhin. Survey of clustering data mining techniques. [18] O. Zamir and O. Etzioni. Web document clustering: A fea-
Technical report, Accrue Software, San Jose, CA, 2002. sibility demonstration. In Research and Development in In-
[3] H. Chen and S. Dumais. Bringing order to the web: au- formation Retrieval, pages 46–54, 1998.
tomatically categorizing search results. In Proceedings of [19] S. M. zu Eissen, B. Stein, and M. Potthast. The suffix tree
the SIGCHI conference on Human factors in computing sys- document model revisited. In Proceedings of the 5th Inter-
tems, pages 145–152, 2000. national Conference on Knowledge Management (I-KNOW
[4] R. Cilibrasi and P. M. B. Vitanyi. Automatic meaning dis- 2005), Graz, Austria, 2005.
covery using google. www.cwi.nl/paulv/papers/
amdug.pdf, 2004.
[5] M. D. Cock and C. Cornelis. Fuzzy rough set based web
query expansion. International workshop on Rough Sets and

Proceedings of the 2006 IEEE/WIC/ACM International Conference


on Web Intelligence (WI 2006 Main Conference Proceedings)(WI'06)
0-7695-2747-7/06 $20.00 © 2006

Authorized licensed use limited to: National Dong Hwa University. Downloaded on February 24, 2009 at 21:36 from IEEE Xplore. Restrictions apply.

You might also like