Expt 9
Aim Implementation of Page rank/HITS algorithm
Objective: WAP to implement Page rank algorithm for a given graph and HITS algorithm
References:
http://pi.math.cornell.edu/~mec/Winter2009/RalucaRemus/Lecture3/lecture3.html PageRank
- Wikipedia
http://www.cs.princeton.edu/~chazelle/courses/BIB/pagerank.htm
https://www.youtube.com/watch?v=3_1h13PJkUs
Algorithm
Part 1: Simple
I/P- initial graph –READ/DEFINE THE MATRIX matrix and ( N –no of nodes 0
O/P- find page rank vector and k- no of iterations.
1) PASS MATRIX-A TO PAGE RANK FUNCTION
2) initial vector V0= [1/n, 1/n, 1/n..ntimes]-Print V0
3) LOOP -i=1,
Vi= M*V i-1
i=i+1;
4) Iterate until converge V or i steps( COMPARE if( Vi =Vi-1) then stop)
5) return (Print )Vi and i- no of iterations
Part 2: Web Surfer Algorithm ( Damping Factor=0.85)-
Page Rank Algorithm and Implementation - GeeksforGeeks
Page Rank - Wikipedia
Part 3: HITs algorithm- Refer PPT example
Lab Questions
1) Give the efficient approach to handle M and write reason
Ans:
● Sparse Matrix Representation: In a web graph, the link structure between pages often
results in a sparse matrix where most of the elements are zero. By representing the
web graph as a sparse matrix, we can significantly reduce the memory footprint and
computational cost associated with handling large datasets.
● Power Iteration Method: The power iteration method allows for the computation of
the PageRank vector through iterative matrix-vector multiplication. This method
converges to the principal eigenvector of the transition probability matrix, which
represents the PageRank scores for each page. It's an efficient way to estimate
PageRank without directly computing the full eigenvalue decomposition of the
transition matrix, which can be computationally expensive for large matrices.
● Parallel Processing: The power iteration method can be parallelized, allowing for the
distributed computation of PageRank across multiple processors or computing nodes.
This parallel processing capability further improves the efficiency of handling a large
number of web pages by distributing the computation workload.
● Iterative Convergence Control: Implementing a convergence criterion during the
iterative process ensures that the computation stops when the PageRank scores have
converged to a stable value. This helps avoid unnecessary iterations, optimizing the
algorithm's performance for large-scale datasets.
● Graph Partitioning Techniques: Leveraging graph partitioning techniques can help
distribute the web graph across different computing nodes or clusters, enabling
efficient processing and reducing communication overhead, especially in distributed
computing environments.
2) Give improvement of Page rank algorithm for spider trap problem and dead end
Ans:
● Damping Factor (Teleportation): The introduction of a damping factor, typically
denoted by the symbol d, helps in avoiding spider traps and dead ends. This factor
allows the possibility of "teleporting" from a dead end to any other page in the web
graph. By incorporating this probability of teleportation into the computation, the
PageRank algorithm ensures the convergence of the scores even in the presence of
dead ends.
● Random Surfer Model: The concept of the random surfer model assumes that a
random surfer, after encountering a dead end, will teleport to another page in the
graph with a certain probability. This model helps in mitigating the impact of dead
ends and ensures that the web graph remains connected, thereby improving the
robustness and accuracy of the PageRank algorithm.
● Handling Spider Traps: To address spider traps, where a surfer can get stuck in a
subset of pages without escaping, algorithms often introduce heuristics that break the
loops within the web graph. Techniques such as "trust-rank" or "topic-sensitive
PageRank" help prevent a surfer from getting trapped by adjusting the random walk
probabilities based on the topical relevance of web pages.
● Personalized PageRank: Personalized PageRank allows the computation of PageRank
scores with respect to a specific query or context, thereby reducing the impact of
spider traps and dead ends for personalized search results. This approach emphasizes
pages that are more closely related to the user's interests, minimizing the influence of
irrelevant or disconnected components.
● Topic-Sensitive PageRank: Topic-Sensitive PageRank modifies the original PageRank
algorithm to consider the topical relevance of web pages. By incorporating topical
information, the algorithm can navigate through the web graph more effectively,
avoiding spider traps and dead ends related to specific topics or themes.
Other references:
• Data Mining - Mining Text Data - Tutorialspoint
• Data Mining - Mining World Wide Web – Tutorialspoint
• Web Mining (tutorialride.com)
• Web Mining and Text Mining - An In-Depth Mining Guide (eduonix.com)
• Hyperlink Induced Topic Search (HITS) Algorithm using Networxx Module | Python –
GeeksforGeeks
• PageRank Algorithm - The Mathematics of Google Search (cornell.edu) • Google's
PageRank Algorithm: Explained and Tested (link-assistant.com) • HITS Algorithm - Hubs
and Authorities on the Internet (cornell.edu)
• HITS Algorithm: Link Analysis Explanation and Python Implementation from Scratch | by Chonyy | Towards
Data Science
• Text Data Mining – Javatpoint
• Text Summarization with NLTK in Python (stackabuse.com)
• (Tutorial) Text ANALYTICS for Beginners using NLTK – DataCamp • Beautiful Soup 4 Python -
PythonForBeginners.com • NLTK Tutorial in Python: What is Natural Language Toolkit?
(guru99.com) • NLTK Python Tutorial (Natural Language Toolkit) - DataFlair (data-flair.training) •
Natural Language Processing - Introduction – Tutorialspoint
• Natural Language Processing - Introduction – Tutorialspoint
• Natural Language Processing - Introduction - Tutorialspoint