Query Directed Web Page Clustering

A new web page clustering algorithm, QDC, uses the user's query as part of a reliable measure of cluster quality. Clustering performance is very important for usability. If cluster quality is poor, the clusters will be semantically meaningless or will contain many irrelevant pages.

Uploaded by

f6684sam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

183 views9 pages

Query Directed Web Page Clustering

Uploaded by

f6684sam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

Query Directed Web Page Clustering

Daniel Crabtree, Peter Andreae, Xiaoying Gao

School of Mathematics, Statistics and Computer Science
Victoria University of Wellington
New Zealand
daniel@danielcrabtree.com, pondy@mcs.vuw.ac.nz, xgao@mcs.vuw.ac.nz

Abstract different user search goals. Often users must refine their
search by modifying the query to filter out the irrelevant re-
Web page clustering methods categorize and organize sults. Users must understand the result set to refine queries
search results into semantically meaningful clusters that as- effectively; but this is time consuming, if the result set is
sist users with search refinement; but finding clusters that unorganised.
are semantically meaningful to users is difficult. In this
paper, we describe a new web page clustering algorithm, Web page clustering is one approach for assisting
QDC, which uses the user’s query as part of a reliable mea- users to both comprehend the result set and to refine the
sure of cluster quality. The new algorithm has five key in- query. Web page clustering algorithms identify semanti-
novations: a new query directed cluster quality guide that cally meaningful groups of web pages and present these to
uses the relationship between clusters and the query, an im- the user as clusters. The clusters provide an overview of the
proved cluster merging method that generates semantically contents of the result set and when a cluster is selected the
coherent clusters by using cluster description similarity in result set is refined to just the relevant pages in that cluster.
additional to cluster overlap, a new cluster splitting method Clustering performance is very important for usability.
that fixes the cluster chaining or cluster drifting problem, an If cluster quality is poor, the clusters will be semanti-
improved heuristic for cluster selection that uses the query cally meaningless or will contain many irrelevant pages.
directed cluster quality guide, and a new method of improv- If cluster coverage is poor, then clusters representing use-
ing clusters by ranking the pages by relevance to the cluster. ful groups of pages will be missing or the clusters will be
We evaluate QDC by comparing its clustering performance missing many relevant pages. Therefore, improving the per-
against that of four other algorithms on eight data sets (four formance of web page clustering algorithms is both worth-
use full text data and four use snippet data) by using eleven while and very important.
different external evaluation measurements. We also eval-
uate QDC by informally analysing its real world usability This paper presents QDC, a query directed web page
and performance through comparison with six other algo- clustering algorithm that gives better clustering perfor-
rithms on four data sets. QDC provides a substantial per- mance than other clustering algorithms. QDC has five key
formance improvement over other web page clustering al- innovations: a new query directed cluster quality guide that
gorithms. uses the relationship between clusters and the query, an im-
proved cluster merging method that generates semantically
coherent clusters by using cluster description similarity in
additional to cluster overlap, a new cluster splitting method
1 Introduction
that fixes the cluster chaining (drifting) problem, an im-
proved heuristic for cluster selection that uses the query di-
Web search is difficult because it is hard for users to rected cluster quality guide, and a new method of improving
construct queries that are both sufficiently descriptive and clusters by ranking the pages by relevance to the cluster.
sufficiently discriminating to find just the web pages that
are relevant to the user’s search goal. Queries are often The next section of the paper sets QDC in context of
ambiguous: words and phrases are frequently polyseman- the other research in the field by describing related work.
tic and user search goals are often narrower in scope than The following sections describe the algorithm and evaluate
the queries used to express them. This ambiguity leads to QDC by comparing its performance against other clustering
search result sets containing distinct page groups that meet algorithms.