0% found this document useful (0 votes)

49 views23 pages

EECS 395/495 Lecture 5: Web Crawlers: Doug Downey

This document summarizes the basic design of web crawlers as discussed in the paper "Mercator: A Scalable, Exensible Web Crawler" by Allan Heydon and Marc Najork. It describes how crawlers maintain a URL frontier using a breadth-first approach with multiple subqueues, respect robots.txt policies, leverage caching of DNS lookups and page content using hashes to detect duplicates, and store hashed URLs in sorted form to check which have already been crawled. The goal is to efficiently crawl the web at scale while avoiding duplicating effort.

Uploaded by

Gabriel Fernandes

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

49 views23 pages

EECS 395/495 Lecture 5: Web Crawlers: Doug Downey

Uploaded by

Gabriel Fernandes

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 23

EECS 395/495 Lecture 5: Web Crawlers

Doug Downey

How Does Web Search Work?

Inverted Indices Web Crawlers Ranking Algorithms

Web Crawlers

http://www-csli.stanford.edu/~hinrich/information-retrieval-book.html

Crawling

Crawl(urls = {p1,pn}) Retrieve some pi from urls urls -= pi urls += ps links

Questions: Breadth-first or depth-first? Where to start? How to scale?

Basic Crawler Design

Our discussion mostly based on: Mercator: A Scalable, Exensible Web Crawler (1999)
by Allan Heydon, Marc Najork

URLs: The Final Frontier

URL frontier
Stores urls to be crawled Typically breadth-first (LIFO queue)

Mercator uses a set of subqueues

One subqueue per Web-page-fetching thread When a new url is enqueued, the destination subqueue is based on host name
So access to a given host is serial For politeness and bottleneck-avoidance

Etiquette
Robots.txt
Tells you which parts of site you shouldnt crawl Web-page fetchers cache this for each host

Basic Crawler Design

Mercator: A Scalable, Exensible Web Crawler (1999)

by Allan Heydon, Marc Najork

HTTP/DNS
Multiple Web-page-fetching threads download documents from specified host(s) Domain Name Service
Map host to IP address
www.yahoo.com -> 209.191.93.52

Well known bottleneck of crawlers

Exacerbated by synchronized interfaces

Solution: caching, asynchronous access

Caching Web Data

Zipf Distribution
Freq(ith most frequent element) i-z
Link-to Frequency
Yahoo.com

Northwestern.edu

sqlzoo.net

Link-to Frequency Rank

Caching Web Data

Zipf Distribution
log Freq(ith most frequent element) -z log i
Log(Link-to Frequency)
Yahoo.com

Northwestern.edu

sqlzoo.net

Log(Link-to Frequency Rank)

Caching Web Data

Thus:
Caching several hundred thousand hosts in memory saves a large number of disk seeks Caching the rest on disk saves many DNS requests

Zipfs Law also applies to:

Web page accesses, term frequency, link distribution (indegree and out-degree), search query frequency, etc. City sizes, incomes, earthquake magnitudes

Basic Crawler Design

Mercator: A Scalable, Exensible Web Crawler (1999)

by Allan Heydon, Marc Najork

URL Seen Test

Huge number of duplicate hyperlinks
Est. 20x number of pages

Goal: keep track of which URLs youve downloaded Problem: Lots of data
So use hashes

Brief Review of Hashing

Hash function
Maps a data object (e.g., a URL) to a fixed-size binary representation Mercator used a Rabins fingerprinting algorithm
For n strings of length < m, and fingerprint of length k
P(two strings have same representation) nm2 / 2k
Some applications of Rabins fingerprinting method [Broder, 1993]

Hashing for URL Seen Test

For urls of length < 1000 and 20,000,000,000 Web pages, 64-bit hashes: P(any two strings have same representation) nm2 / 2k 0.001 About 1/1000 chance of any collisions even using todays numbers Thus: just store & check the hashes Space savings: 12x (assuming avg. 100 bytes/url) Also translates to efficiency gains

Storing/Querying URLs Seen

We have a list of URL hashes weve seen
Smaller than text urls, but still large 20B urls * 8 bytes/url = 160GB

How do we store/query it?

Store it in sorted form (with in-memory index) Query with binary search

Wrinkles
Store two hashes instead of one
One for host name, one for rest of url Store as <host name hash><rest of url hash> Why? Exploit disk buffering

Also use an in-memory cache of popular URLs In Mercator, all this together results in about 1/6 seek and read ops per URL Seen test

Basic Crawler Design

Mercator: A Scalable, Exensible Web Crawler (1999)

by Allan Heydon, Marc Najork

Content Seen?
Many different URLs contain the same content
One website under multiple host names Mirrored documents >20% of Web docs are duplicates/near-duplicates

How to detect?
Nave: store each pages content and check Better: use hashes
Mercator does this

Some issues
Poor cache locality What about similar pages?

Alternative Technique
Compute binary feature vectors for each doc
E.g. term incidence vectors <1:1, 192:1, 4002:1, 4036:1, > Generate 100 random permutations of the vectors e.g. 1->40002, 2->5, 3->1, 4->2031, For each document D, store a vector vD containing the minimum feature for each permutation. Compare these representations
Much smaller than original feature vector but comparison is still expensive (see later techniques)

Next time: Document ranking & Link Analysis

Mercator: A Scalable, Extensible Web Crawler: Allan Heydon Marc Najork Compaq Systems Research Center
No ratings yet
Mercator: A Scalable, Extensible Web Crawler: Allan Heydon Marc Najork Compaq Systems Research Center
14 pages
Mercator: A Scalable, Extensible Web Crawler: Compaq Systems Research Center, 130 Lytton Avenue, Palo Alto, CA 94301, USA
No ratings yet
Mercator: A Scalable, Extensible Web Crawler: Compaq Systems Research Center, 130 Lytton Avenue, Palo Alto, CA 94301, USA
11 pages
Web Crawlers & Hyperlink Analysis
No ratings yet
Web Crawlers & Hyperlink Analysis
50 pages
Crawling The Web: Information Retrieval © Crista Lopes, UCI
No ratings yet
Crawling The Web: Information Retrieval © Crista Lopes, UCI
25 pages
CS571 Note
No ratings yet
CS571 Note
2 pages
Crawler: 1.0 Introduction
No ratings yet
Crawler: 1.0 Introduction
12 pages
Ir 5
No ratings yet
Ir 5
18 pages
Python Web Crawler Guide
No ratings yet
Python Web Crawler Guide
10 pages
Week 4
No ratings yet
Week 4
38 pages
Web Crawler Design Guide
No ratings yet
Web Crawler Design Guide
6 pages
IR Unit 3
No ratings yet
IR Unit 3
64 pages
CIS 455/555: Internet and Web Systems: Crawling and Publish/Subscribe February 15, 2012
No ratings yet
CIS 455/555: Internet and Web Systems: Crawling and Publish/Subscribe February 15, 2012
34 pages
3 Web Crawling
No ratings yet
3 Web Crawling
39 pages
Web Crawlers: Presented By: B. Tech. Final Year Information Technology
No ratings yet
Web Crawlers: Presented By: B. Tech. Final Year Information Technology
27 pages
Web Crawling
No ratings yet
Web Crawling
44 pages
20 Crawl
No ratings yet
20 Crawl
46 pages
08 Web Search and Web Crawling
No ratings yet
08 Web Search and Web Crawling
33 pages
Lecture Crawling
No ratings yet
Lecture Crawling
38 pages
Ms. Poonam Sinai Kenkre
No ratings yet
Ms. Poonam Sinai Kenkre
43 pages
Retrieving and Visualizing Data: Charles Severance
No ratings yet
Retrieving and Visualizing Data: Charles Severance
19 pages
Completed Final UNIT-V 9.10.17
100% (1)
Completed Final UNIT-V 9.10.17
74 pages
Pythonlearn 16 Data Viz
No ratings yet
Pythonlearn 16 Data Viz
19 pages
Web Mining1
No ratings yet
Web Mining1
87 pages
The Basic of Computer Science: Dr. Manish Kumar Kamboj Assistant Professor, CSE
No ratings yet
The Basic of Computer Science: Dr. Manish Kumar Kamboj Assistant Professor, CSE
25 pages
Commoncrawlpresentation 101027182938 Phpapp02
No ratings yet
Commoncrawlpresentation 101027182938 Phpapp02
17 pages
Web Crawling for Informatics Students
No ratings yet
Web Crawling for Informatics Students
114 pages
Chapter 3
No ratings yet
Chapter 3
39 pages
Crawler and URL Retrieving & Queuing
No ratings yet
Crawler and URL Retrieving & Queuing
5 pages
Web Crawler
0% (1)
Web Crawler
16 pages
Retrieving and Visualizing Data: Charles Severance
No ratings yet
Retrieving and Visualizing Data: Charles Severance
19 pages
Web Crawling and Search Engine Basics
No ratings yet
Web Crawling and Search Engine Basics
40 pages
S O W C A: Urvey F EB Rawling Lgorithms
No ratings yet
S O W C A: Urvey F EB Rawling Lgorithms
8 pages
Lecture16 Crawling
No ratings yet
Lecture16 Crawling
39 pages
IR-UNIT 10 (Web Crawling)
No ratings yet
IR-UNIT 10 (Web Crawling)
62 pages
Web Crawling: Christopher Olston and Marc Najork
No ratings yet
Web Crawling: Christopher Olston and Marc Najork
49 pages
Web Crawling for Linguistics Students
No ratings yet
Web Crawling for Linguistics Students
8 pages
Retrieving and Visualizing Data: Charles Severance
No ratings yet
Retrieving and Visualizing Data: Charles Severance
19 pages
Seminar Report: Submitted By: Aanchal Garg CSE
No ratings yet
Seminar Report: Submitted By: Aanchal Garg CSE
22 pages
Web Crawling: Based On The Slides by Filippo
No ratings yet
Web Crawling: Based On The Slides by Filippo
52 pages
Cse3024 WM Module-2 Smsatapathy
No ratings yet
Cse3024 WM Module-2 Smsatapathy
106 pages
5.web Crawler Writeup
No ratings yet
5.web Crawler Writeup
7 pages
Inverted Indexing For Text Retrieval
No ratings yet
Inverted Indexing For Text Retrieval
21 pages
Lect 02-Crawling Part A
No ratings yet
Lect 02-Crawling Part A
21 pages
IR - ch6 - Web Crawler
No ratings yet
IR - ch6 - Web Crawler
21 pages
Research Paper
No ratings yet
Research Paper
5 pages
Search Engine
No ratings yet
Search Engine
35 pages
Geo Dist Crawler
No ratings yet
Geo Dist Crawler
10 pages
Crahid: A New Technique For Web Crawling in Multimedia Web Sites
No ratings yet
Crahid: A New Technique For Web Crawling in Multimedia Web Sites
6 pages
Web Search Engingine Indexing Crawling and Ranking
No ratings yet
Web Search Engingine Indexing Crawling and Ranking
63 pages
Web Crawler with Advanced Algorithms
No ratings yet
Web Crawler with Advanced Algorithms
1 page
Lecture18 Crawling
No ratings yet
Lecture18 Crawling
48 pages
UNIT III-Web Crawlers Why Do We Need Web Crawlers?
No ratings yet
UNIT III-Web Crawlers Why Do We Need Web Crawlers?
19 pages
Web Crawling for Search Engines
No ratings yet
Web Crawling for Search Engines
14 pages
Web Search. Web Spidering
No ratings yet
Web Search. Web Spidering
44 pages
Crawling The Web: Seed Page and Then Uses The External Links Within It To Attend To Other Pages
No ratings yet
Crawling The Web: Seed Page and Then Uses The External Links Within It To Attend To Other Pages
25 pages
A Two Stage Crawler On Web Search Using Site Ranker For Adaptive Learning
No ratings yet
A Two Stage Crawler On Web Search Using Site Ranker For Adaptive Learning
4 pages
TE Mode in Waveguides
No ratings yet
TE Mode in Waveguides
9 pages
A
No ratings yet
A
2 pages
Grade 6 Maths Revision Test
No ratings yet
Grade 6 Maths Revision Test
3 pages
VDE-Vehicle Dynamics Engine
67% (3)
VDE-Vehicle Dynamics Engine
64 pages
CMM Filters & Outliers Guide
100% (1)
CMM Filters & Outliers Guide
13 pages
Lab Manual Cadd Ii - 2023
No ratings yet
Lab Manual Cadd Ii - 2023
14 pages
Etextbook 978-1464111709 Sensation and Perception 2nd Edition Instant Download
100% (1)
Etextbook 978-1464111709 Sensation and Perception 2nd Edition Instant Download
102 pages
3rd Periodical Test - ALL SUBJECTS
82% (11)
3rd Periodical Test - ALL SUBJECTS
49 pages
ANSYS-FLUENT Boiling Simulation with ANN
No ratings yet
ANSYS-FLUENT Boiling Simulation with ANN
7 pages
Math7 Las Q4-1
No ratings yet
Math7 Las Q4-1
82 pages
Tall Buildings and Damping: A Concept-Based Data-Driven Model
No ratings yet
Tall Buildings and Damping: A Concept-Based Data-Driven Model
15 pages
SSDD
0% (1)
SSDD
2 pages
2D Plate Stress Analysis in ANSYS
No ratings yet
2D Plate Stress Analysis in ANSYS
2 pages
An Introduction To Stochastic Thermodynamics From Basic To Advanced Naoto Shiraishi Instant Download
100% (1)
An Introduction To Stochastic Thermodynamics From Basic To Advanced Naoto Shiraishi Instant Download
83 pages
An Introduction To Data Analysis Using IBM SPSS, 1st Edition ISBN 1032891793, 9781032891798 Direct Ebook Download
No ratings yet
An Introduction To Data Analysis Using IBM SPSS, 1st Edition ISBN 1032891793, 9781032891798 Direct Ebook Download
15 pages
Test Procedure For Motor Protection
No ratings yet
Test Procedure For Motor Protection
87 pages
100 Math
No ratings yet
100 Math
5 pages
A Method To Improve The Stability of Scissor Lifti
No ratings yet
A Method To Improve The Stability of Scissor Lifti
9 pages
CH 14
No ratings yet
CH 14
13 pages
Artículo ELV
No ratings yet
Artículo ELV
6 pages
Research Methodology for Quezon City Architecture Project
No ratings yet
Research Methodology for Quezon City Architecture Project
7 pages
Deconstructive Design at Villette
100% (2)
Deconstructive Design at Villette
8 pages
Form Finding and Fabric Forming in The Work of Heinz Isler
No ratings yet
Form Finding and Fabric Forming in The Work of Heinz Isler
8 pages
Unit I
100% (1)
Unit I
32 pages
AG5 Quality Assurance Skills Matrix Template en
No ratings yet
AG5 Quality Assurance Skills Matrix Template en
14 pages
Finitely Generated Abelian Groups
No ratings yet
Finitely Generated Abelian Groups
11 pages
Arch Forms
No ratings yet
Arch Forms
10 pages
PDC Chapter 12
No ratings yet
PDC Chapter 12
8 pages
Project Crashing Guide
No ratings yet
Project Crashing Guide
6 pages
Cambridge Primary Checkpoint: Mathematics 0845/01
86% (7)
Cambridge Primary Checkpoint: Mathematics 0845/01
16 pages