The linear algebra behind Googles PageRank algorithm
Sujit Dunga
11110102
1. Abstract
Today, Around 40% of the world population has access to the internet and there are over 800
million websites on the Internet. Search engine forms the most important part of the use of
browsing the internet. More than 60% of the users do not go past the first page and more than 90%
users do not go past the 3rd page in the search engine results page (SERP). If your website cannot be
found within the first 3 pages in the search engine results page (SERP), your website is in great
danger of being visited by very few people.
This paper gives an insight on fundamental mathematics behind PageRank, an algorithm
used by worlds fastest search engine Google to rank websites in their search engine
results page.
2. Introduction
Earlier search engines typically used to give links to pages that had the highest keyword
density. Internet spammers took it to advantage and created websites by repeating the same
words/sentences again and again in order to attract higher search page results.
To overcome the internet spamming, PageRank was first formulated in 1996 by Sergey Brin
and Larry Page (co-founders of Google) at Stanford University. It was named after Larry
Page.
PageRank is a function that assigns a real number to each page in the Web (or at least to that
portion of the Web that has been crawled and its links discovered). The intent is that the
higher the PageRank of a page, the more important it is.1
The PageRank of a page is a measure of how likely it is to be a good response to a search
query. In its simplest form, PageRank is a solution to the recursive equation a page is
important if important pages link to it.2
PageRank is displayed on the toolbar of your web browser if you have installed the Google
toolbar (http://toolbar.google.com/)
1 Mining of Massive Datasets Anand Rajaraman, Jure Leskovec, and Jeffrey D. Ullman,
page 165
2 Mining of Massive Datasets Anand Rajaraman, Jure Leskovec, and Jeffrey D.
Ullman, page 197
3. Motivation
Earlier, Internet has always been full of spammers and it has become a major challenge for
the search engine and even the user to get the exact results the user desires. Developing an
efficient and accurate Web search was one of the biggest challenges of the decade. Even
though Google was not the first search engine, it was the first to be able to defeat the
spammers using the PageRank algorithm. PageRank is the first and still the best-known
algorithm used by the worlds fastest search engine. Hence the paper focuses on giving a
insight on how this PageRank algorithm works.
4. Mathematics behind PageRank
3 http://searchengineland.com/what-is-google-pagerank-a-guide-for-searcherswebmasters-11068
The entire web is assumed to be a directed graph where each node is a page each directed line
is a link from one page to another. In the above picture, n1, n2, n3 are the web pages. It
shows the various out-links from each webpage.
Calculating the PageRank:
Step 1: Formulating the transition matrix:
4 http://computationalculture.net/article/what_is_in_pagerank
5 Mining of Massive Datasets Anand Rajaraman, Jure Leskovec, and Jeffrey D. Ullman,
page 166
A transition matrix
is a square matrix of order
nn
webpages. Each row
i
and column correspond to the in-links and out-links of the webpages i.e, the th row and the
i
i
th column represent the th page of the Web.
The value of the element
i
out-link is to page .
If page
mij
is
1/ k
, if there are
if number of out-links from page
has no out-links to any page then the value of
mij
is
and one of the
is 0.
Therefore the transition matrix for the above sketch is as follows:
0 1/ 2
1/ 3 0
M
1/ 3 0
1/ 3 1/ 2
1 0
0 1/ 2
0 1/ 2
0 0
Column 1, row 1 represent page A; Column 2, row 2 represent page B and so on.
The matrix in-short gives the probability of a random surfer going from one page to all the
other pages in the web including itself.
Step 2: Formulating the PageRank vector:
Page rank is nothing but a column vector
j
the random surfer is at page .
whose
th component gives the probability that
The solution to the PageRank vector is given by the equation:
v Mv
From the above equation PageRank can also be defined as the principal eigenvector of the
M .v 1.v
transition matrix with eigen value 1(
).
This is nothing but a system of linear equations which can solved using analytical methods
(Gaussian elimination ) but in reality there will be billions of pages and the solving
becomes difficult (It takes time of cubic power of number of equations). Hence we use an
iterative method to solve this equation.
We re-write the equation in iterative terms as:
vi 1 Mvi
Step 3: Initializing the PageRank vector:
v0
Initial guess for the PageRank vector is taken as a column vector in which each element is
1
n
n
i.e, we set equal probability for the random surfer to be at any of the webpages initially.
Step 4: Iterating the PageRank vector:
Now we find
v1 Mv0
v2 Mv1 M ( Mv0 ) M 2 v0
v3 Mv2 M (M ( Mv0 )) M 3v0
M
vn M n v0
vn , vn 1
vn
We do this until
converge and the resulting
will be our PageRank function/vector
n
for the webpages.
The above algorithm counts all in-links equally. Modified later versions of PageRank
algorithm determines that some links may be more valuable than others, and therefore assigns
them more weight than others.
5. Applications in Various Disciplines
Predicting the extinction of a species by studying food web.
Used in molecular universe to map the shapes and chemical reactions of
quadrillions of molecules.
Analysis of Metabolic and protein-protein interaction (PPI) networks.
Ranking tweets in Twitter.
Recommendation systems: suggesting Movies on Netflix.
Treatment of lung cancer- to identify tumours more likely to spread to other sites
in the body and treat them before they have a chance to spread.
Suggesting friends on Facebook
6. Open Problems
A link farm is a website (or a group of websites) whose majority of the content
is hyperlinks , often random and unrelated to other websites. Link farms were generated in
response to Google's PageRank algorithm. It is created to increase the PageRank of another
site by increasing the number of incoming links. Since PageRank algorithm assumes that the
webpage and its in-links are related to each other, a page might have a higher PageRank even
if it is unrelated to search query. Google has the right to penalize web sites that have created
link farms.
PageRank of a website can be increased by purchasing paid links from high PageRank
sites. Google has publicly warned high PageRank sites that if they were found to be selling
links to other low rank sites, their links would not be included from then on in their PageRank
calculation scheme.
Google bombing is the practice of enhancing any websites SEO profile by repeatedly
inputting the keyword associated with the destination. With enough mass participation,
Google bombing can bump the desired content to a higher page ranking in search results.6
Example: In 2003, large amount of people gathered to link the phrase miserable failure as
anchor text to link to the biography of George Bush.
The war between those who want to make the internet useful and those who would exploit it
for their own purposes will never end. Modifications will be made to the PageRank algorithm
as years pass by or even a day might come when a new algorithm will algorithm take over the
place of PageRank algorithm.
7. Conclusion
In this paper we have learnt fundamental mathematics behind the googles PageRank
algorithm and its increasing applications in network based problems.
PageRank is a powerful and irreplaceable tool that made the web-search efficient and
accurate and found its place in various network (graph) based real life applications.
8. References
Book:Mining of Massive Datasets Anand Rajaraman, Jure Leskovec, and Jeffrey D. Ullman
http://infolab.stanford.edu/~ullman/mmds/book.pdf
http://www.google.co.in/about/company/history/
http://www.wired.com/2009/09/googlefoodwebs/
http://www.internetlivestats.com/internet-users/
http://news.bbc.co.uk/2/hi/8238462.stm
http://www.extremetech.com/extreme/118419-applying-googles-pagerank-algorithm-to-the-molecular-universe
http://web.stanford.edu/class/msande233/handouts/lecture8.pdf
http://en.wikipedia.org/wiki/PageRank
http://blog.woorank.com/2013/04/what-is-a-pagerank-and-how-important-is-it-in-2013/
http://en.wikipedia.org/wiki/Link_farm
http://searchengineland.com/what-is-google-pagerank-a-guide-for-searchers-webmasters-11068
http://knowyourmeme.com/memes/google-bombing
http://google.about.com/od/g/g/googlebombdef.htm
6 http://knowyourmeme.com/memes/google-bombing