PRWB: A Framework for Creating Personal,
Site-Specific Web Crawlers
Abstract
Crawlers, also called robots and spiders, are programs that browse the World Wide
Web autonomously. This paper describes PRWB, a python toolkit and interactive
development environment for Web crawlers. Unlike other crawler development
systems, PRWB is geared towards developing crawlers that are Web-site-specific,
personally customized, and relocatable. PRWB allows site-specific crawling rules to
be encapsulated and reused in content analyzers, known as classifiers. Personal
crawling tasks can be performed (often without programming) in the Crawler
Workbench, an interactive environment for crawler development and testing. For
efficiency, relocatable crawlers developed using PRWB can be uploaded and executed
on a remote Web server.
Site-specific crawlers are also ill-supported by current crawler development
techniques. A site-specific crawler is tailored to the syntax of the particular Web sites
it crawls (presentation style, linking, and URL naming schemes). Examples of site-
specific crawlers include metasearch engines , homepage finders , robots for
personalized news retrieval , and comparison-shopping robots. Site-specific crawlers
are created by trial-and-error. The programmer needs to develop rules and patterns for
navigating within the site and parsing local documents by a process of "reverse
engineering." No good model exists for modular construction of site-specific crawlers
from reusable components, so site-specific rules engineered for one crawler are
difficult to transfer to another crawler. This causes users building crawlers for a site to
repeat the work of others.
A third area with room for improvement in present techniques is relocatability. A
relocatable crawler is capable of executing on a remote host on the network. Existing
Web crawlers tend to be nonrelocatable, pulling all the pages back to their home site
before they process them. This may be inefficient and wasteful of network bandwidth,
since crawlers often tend to look at more pages than they actually need. Further, this
strategy may not work in some cases because of access restrictions at the source site.
For example, when a crawler resides on a server outside a company's firewall, then it
can't be used to crawl the Web inside the firewall, even if the user invoking it has
permission to do so. Users often have a home computer connected by a slow phone
line to a fast Internet Service Provider (ISP) who may in turn communicate with fast
Web servers. The bottleneck in this communication is the user's link to the ISP. In
such cases the best location for a user-specific crawler is the ISP or, in the case of site-
specific crawlers, the Web site itself. On-site execution provides the crawler with
high-bandwidth access to the data, and permits the site to do effective billing, access
restriction, and load control.
2 Crawling Model
This section describes the model of Web traversal adopted by the PRWB crawling
toolkit, and the basic Python classes that implement the model. The Java interface is
used by a programmer writing a crawler directly in Python. Users of the Crawler
Workbench need not learn it, unless they want to extend the capabilities of the
Workbench with custom-programmed code.
The Workbench includes a customizable crawler with a selection of
predefined shouldVisit and visit operations for the user to choose from. For more
customized operations, the Workbench allows a programmer to write Javascript code
for shouldVisit and visit (if the Web browser supports it),
manipulating Pages and Links as Java objects.
The built-in shouldVisit predicates test whether a link should be followed, based on
the link's URL, anchor, or attributes attached to it by a classifier. The built-
in visit operations include, among others:
Save, which stores visited pages to the local filesystem, in a directory structure
mirroring the organization of files at the server, as revealed by the URLs seen;
Concatenate, which concatenates the visited pages into a single HTML
document, making it easy to view, save, or print them all at once (this feature
has also been called linearization .
Extract, which matches a pattern (such as a regular expression) against each
visited page and stores all the matching text to a file.
Each of the visit operations can be parameterized by a page predicate. This predicate
is used to determine whether a visited page should actually be processed. It can be
based on the page's title, URL, content, or attributes attached to it by a classifier. The
page predicate is needed for two reasons: first, shouldVisit cannot always tell from
the link alone whether a page will be interesting, and in such cases the crawler may
actually need to fetch the page and let the page predicate decide whether to process it.
Second, it may be necessary to crawl through uninteresting pages to reach interesting
ones, and so shouldVisit may need to be more general than the processing criterion.
An example of using the customizable crawler for a personal crawling task is shown
in Figure 1. The task is to print a hardcopy of Robert Harper's Introduction to
Standard ML, a Web document divided into multiple HTML pages.
(1a) (1b)
(1d)
(1c)
Category Ranking
The precision of search engine queries often suffers from the polysemy problem:
query words with multiple meanings introduce spurious matches. This problem gets
worse if there exists a bias on the Web towards certain interpretations of a word. For
example, querying AltaVista for the term "amoeba" turns up far more references to
the Amoeba distributed operating system than to the unicellular organism, at least
among the first 50 results.
Architecture/Design:-
Advantages for crawler operators:
You get to gather the data you want.
It will serves you for a long time, being codded only once.
It requires a knowledge of some programming, like Python or R.
Time wise, it, sometimes, is not the best option to consider.
Disadvantages for crawler operators:
Your traffic may be identified as abusive or suspicious and blocked
You may be constrained by your limits in bandwidth, processing, or storage
Advantages for site owners:
If your site is being included in some kind of index or list or database, you may
get additional traffic therefrom
Disadvantages for site owners:
The crawler bot traffic may be disruptive or annoying
Suggestions:
I have an idea for a search tool. Not based on a custom web crawler, but rather
Google's Search API:
Whenever a new buzzword appears, the service would set up searches for that buzzword
on a continuous basis, to see how it evolves in use.
In my market it's e.g. eddystone. Then I (or customers) would set this up as an
interesting keyword to follow over time, to see the number of sites referring to it,
possible associations with other keywords/phrases like physical web, proximity
marketing etc.
Some seeding of such (possibly) trending buzzwords would be needed, but after that the
service would build up relations between buzzwords automatically.