TM3 ch05 Link Analysis

Tutorial Meeting #3
Data Science and Machine Learning
Link Analysis
(Chapter 5 of MMDS book)
DAMA60 Group
School of Science and Technology
Hellenic Open University
Content
• Preliminaries
• Matrices, Vectors
• Linear Algebra
• Graphs
• Methods and Algorithms

• Google PageRank algorithm
• HITS algorithm
2
Matrices
• A Matrix is a rectangular array of numbers
𝑎11 𝑎12 𝑎13 1 2 3

𝐴= 𝑎 𝑎22 𝑎23 =
21 4 5 6
• 𝑎𝑖𝑗 is the element of matrix A in row 𝑖 and column 𝑗

• 𝐴 is said to be a 𝑛 × 𝑚 matrix if it has 𝑛 rows and 𝑚
columns
• A square matrix is a 𝑛 × 𝑛 matrix
• The transpose 𝑨𝑻 of a matrix 𝐴 is the matrix obtained by
exchanging the rows and the columns
𝑇 𝑎11 𝑎21
𝑎11 𝑎12 𝑎13
𝐴𝑇 = 𝑎 𝑎 𝑎 = 𝑎12 𝑎22
21 22 23
𝑎13 𝑎23
3
Matrices
• A square matrix is diagonal if has 𝑎𝑖𝑗 = 0 for all 𝑖 ≠ 𝑗
𝑎11 0
𝐴=
0 𝑎22
• The Identity matrix 𝑰 is the diagonal matrix with 1´s along

the main diagonal
1 0
𝐴=
0 1
• A symmetric matrix 𝑨 satisfies the condition 𝑨 = 𝑨𝑻
4
Vectors
• A vector v is a one-dimensional array of numbers (is an
𝑛 × 1 matrix – column vector)
• Example:
3
𝒗= 5
7
• The standard form of a vector is a column vector
• The transpose of a column vector is a row vector
Τ
3
𝑻
𝒗 = 5 = (3 5 7)
7
5
Operation on matrices
• Addition: 𝚨 = 𝐚𝐢𝐣 , 𝐁 = 𝐛𝐢𝐣 , 𝐂 = 𝐜𝐢𝐣 = 𝐀 + 𝐁

• 𝐜𝐢𝐣 = 𝐚𝐢𝐣 + 𝒃𝐢𝐣
• Scalar multiplication: 𝜆 is a number, λ𝚨 = 𝝀𝐚𝐢𝐣

• Matrix Multiplication: if A and B are compatible, (the number
of columns of A are equal to the number of rows of B), then:
• 𝐂 = 𝐜𝐢𝐣 = 𝐀𝐁
• 𝐜𝐢𝐣 = σ𝒌 𝒂𝒊𝒌 𝒃𝒌𝒋
6
Examples
1 4
1 2 3 1∗1+2∗2+3∗3 1∗4+2∗5+3∗6 14 32
2 5 = =
4 5 6 4∗1+5∗2+6∗3 4∗4+5∗5+6∗6 32 77
3 6
1 1 1 −1 1 0
=
0 1 0 1 0 1
• If 𝑨𝑩 = 𝑰, then 𝑩 is said to be the inverse of 𝑨 and is

denoted with 𝑨−𝟏
• If a matrix has an inverse, it is called invertible or
nonsingular
7
Graphs: Introduction
• ‘Everything is connected: Graph Neural Networks’ a
recent publication by Petar Veličković (Research Scientist
at Google DeepMind) outlines the importance of graph
structures.
• In many ways, graphs are the main modality of data we
receive from nature.
• Social networks are also of primal importance in todays’
applications:
• Who talks to whom?
• Who leads to what?
• Which pages do we like?
• What items do we order?
8
Graphs: Notation
• A graph (or network) G is a set of vertices 𝑉 = 1,2, … , 𝑛 joined
by edges(or links).
• An undirected graph G = (V, E), consists of the vertex set V and

the edge set E, which contains the binary relations on V.
• V is the Vertex set of G: contains the vertices/nodes
• E is the Edge set of G: contains the edges
• A Directed graph G = (V, E) consists of edges that have ordered

pairs of vertices.
• The in-degree of a vertex v is the number of edges entering in v
• The out-degree of a vertex v is the number of edges leaving v
• A neighbor set N(v) is the set of vertices adjacent to v: 𝑁 𝑣 =

𝑢 ∈ 𝑉 𝑢 ≠ 𝑣, (𝑣, 𝑢) ∈ 𝐸}
9
Node Degree in Undirected graphs
• 𝑑𝑖 is the degree (# of adjacent nodes or # of neighbors) for a target
vertex 𝑣𝑖.
• In the above graph, the degree for vertex 𝑣1 is 𝑑1 = 8 and for all
others is 𝑑𝑖 = 1, 𝑖 ≠ 1.
• The above graph is called star graph 𝑆𝑘 or k-star, with 𝑘 = 9.
• In general, those graphs are trees with k nodes, where one node
has degree equal to k-1, and the other k-1 nodes have degree
equal to 1.
10
Directed Edges and Directed Graphs
• Edges can have directions. A directed edge is sometimes
called an arc.
• Figure (a): Edges are represented using their end-points,

i.e., 𝑒 𝑣2 , 𝑣1 ‒ representative example: citation networks
• Figure (b): Both representations are the same, i.e.,

𝑒 𝑣2 , 𝑣1 ≡ 𝑒 𝑣1 , 𝑣2 ‒ representative example: friendship
networks 11
Example for Node Degree in Directed Graphs
In- Out-
Node Degree Degree
A 1 3
B
C
D
E
F
G
12
Can you complete the rest values?

In- Out-
Node Degree Degree
A 1 3
B
C
D
E
F
G
13
Resulting in/out degrees

In- Out-
Node Degree Degree
A 1 3
B 1 2
C 2 3
D 3 1
E 2 1
F 2 2
G 2 1
14
Adjacency Matrix
Given a graph G with n nodes numbered from 1 to n, the adjacency
matrix 𝐴 = 𝑎𝑖𝑗 of G is a square 𝑛𝑥𝑛 matrix such that 𝑎𝑖𝑗 = 1, if
there exists an edge joining nodes i and j, and 𝑎𝑖𝑗 = 0, otherwise.
Here is an example:
A 1 2 3 4 5
1 0 1 1 1 1
2 1 0 1 0 1
3 1 1 0 0 1
4 1 0 0 0 0
5 1 1 1 0 0
You can verify that A is symmetric matrix (𝐴 = 𝐴𝑇 ), since we have an

undirected graph.
15
Google
• Google is the leading search and online advertising

company - founded by Larry Page and Sergey Brin (Ph.D.
students at Stanford University).
• “googol” or 10100 is the mathematical term Google was

named after.
• Google’s success in search is largely based on its

PageRank algorithm.
16
Intuition of PageRank
• Web pages are important if people visit them a lot. But we can’t
watch everybody using the Web.
• If we assume that people follow web-links randomly, that leads us to

random walker model:
• Start at a random page and follow random out-links repeatedly, from
whatever page you are at.
• In other words, “the importance of a page equals to its share of the

importance of each of its predecessor pages”.
• PageRank of a page equals to the probability of being at that page.
17
Ranking web pages
Web pages are not equally “important”

• www.oxford.uk (scam) vs. www.oxford.ac.uk (official)
• We can use the number of incoming links for a website

• Incoming links as votes
• Are all incoming links equal?
18
Simple recursive formulation
• Each link’s vote is proportional to the importance of its
source page.
• If page P with importance x has n out links, each link gets
x/n votes.
333 $ 333 $
333 $
1000 $
19
y, a, m represent the
Simple “flow” model importance of the 3 web
𝑦 pages, respectively.
2 Based only on their

incoming links
Yahoo
𝑦 𝑎
y 𝑦= +
𝑎 2 2
𝑦
2
2 𝑦
m 𝑎= + 𝑚
2
Amazon Microsoft
𝑎 𝑎
a 2 m 𝑚=
2
20
Transition Probability Matrix
• Matrix M is a square matrix that has one row and one column for
each web page
• Suppose page 𝑖 has 𝑛 outgoing links

If 𝑖 links to 𝑗 then
𝑀𝑗𝑖 = 1/𝑛
else
𝑀𝑗𝑖 = 0
• M is a column stochastic matrix

• Columns sum to 1 (be careful, since some authors use row stochastic
matrices, where rows sum to 1)
• Suppose 𝒓 is a vector with one entry per web page. We can call this
as rank vector, for which holds that:
• 𝒓𝒊 is the importance score of page 𝑖
• σ𝒊 𝒓𝒊 = 1
21
Transition Probability Matrix
When M is a row stochastic matrix, it holds that:
If 𝑖 links to 𝑗 then
𝑀𝑖𝑗 = 1/𝑛
else
then 𝑀𝑖𝑗 = 0
22
Flow equations:
𝑦 𝑎
𝑦= +
Example 𝑦
2 2
𝑎= + 𝑚
𝑦 2
𝑎
2 𝑚=
2
Incoming weights
Outgoing weights
Yahoo
y 1/2
𝑦 1/2
0
2
Amazon Microsoft
a m
23
Flow equations:
𝑦 𝑎
𝑦= +
Example 𝑦
2 2
𝑎= + 𝑚
2
𝑎
𝑚=
2
Incoming weights
Outgoing weights
Yahoo
y 1/2
𝑎
0
2 1/2
Amazon Microsoft
𝑎
a 2 m
24
Flow equations:
𝑦 𝑎
𝑦= +
Example 𝑦
2 2
𝑎= + 𝑚
2
𝑎
𝑚=
2
Incoming weights
Outgoing weights
Yahoo
y 0
1
0
m
Amazon Microsoft
a m
25
Flow equations:
𝑦 𝑎
𝑦= +
Example 𝑦
2 2
𝑎= + 𝑚
𝑦 2
𝑎
2 𝑚=
2
Yahoo y a m
y ½ ½ 0
𝑎
y
a ½ 0 1 =M
𝑦 m 0 ½ 0
2
2
m
Amazon Microsoft
𝑎
a 2 m
26
Flow equations:
Example (row-stochastic) 𝑦 𝑎
𝑦= +
2 2
𝑦
𝑎 = +𝑚
2
𝑦 𝑎
𝑚=
Yahoo 2 2
y
y a m
𝑎 𝑦 y ½ ½ 0
2 2 a ½ 0 ½ =M
m 0 1 0
m
Amazon Microsoft
𝑎
a m
2
27
Solving the flow equations
• Considering that v is a column vector

• 3 equations, 3 unknowns
• Solving Linear Equations of the type 𝒗 = 𝑀𝒗, do not have a unique
solution. For example, try with (2 2 1).
• Additional constraint forces uniqueness

• 𝑦 + 𝑎 + 𝑚 = 1 (normalization)
• 𝑦 = 2/5, 𝑎 = 2/5, 𝑚 = 1/5
• Gaussian elimination works for small examples, but we need a

better method for large graphs.
• We need to use relaxation (an iterative solution, e.g., power
iteration).
28
Solving the flow equations
• We elaborate the need of regularization in the previously seen

examined network.
′
𝑦 𝑎
1/2 1/2 0 𝑦 = +
𝑦′ 𝑦 2 2
𝑎′ = 1/2 0 1 𝑎 ⇒ 𝑦
𝑎′ = + 𝑚
𝑚′ 0 1/2 0 𝑚 2
′
𝑚 = 𝑎/2
• Given y=2, a=2, and m=1, we find that y’=2, a’=2, and m=1.
• The constraint 𝑦 ′ + 𝑎′ + 𝑚′ = 1 help us to normalize the
network importance at each step (dividing by the sum of y’,
a’, m’).
29
Random Walks on the Web
• Suppose v is a vector whose 𝑖𝑡ℎ component is the

probability that a random surfer is at page 𝑖 at a
certain time.
• If a surfer chooses a successor page from page 𝑖 at

random, the probability distribution for surfers is
then given by the product M ∗ 𝒗.
• In general, the 𝑀𝑖 ∗ 𝒗 gives us the distribution of

the surfer position after i steps.
30
Simulating a Random Walk
• Start with the vector 𝒗 = [1, 1, … , 1] representing the

idea that each Web page is given one unit of
importance.
• Note: it is more common to start with each vector element =
1/𝑁, where 𝑁 is the number of Web pages and to keep the
sum of the elements at 1.
• Repeatedly multiply matrix 𝑀 with vector 𝑣, allowing

the importance to flow like a random walk.
• About 10 iterations is sufficient to estimate the limiting

solution.
31
Power Iteration Solution
Yahoo y a m
y ½ ½ 0
a ½ 0 1 =M
m 0 ½ 0
Amazon Microsoft M*v = v’

M (y a m) 𝑇 = v’
(1/3 1/3 1/3) 𝑻 : initial probability
M (1/3 1/3 1/3) 𝑻 = (1/3 1/2 1/6)𝑻

M (1/3 1/2 1/6) 𝑻 = (5/12 1/3 1/4)𝑻
M (5/12 1/3 1/4) 𝑻 = (3/8 11/24 1/6) 𝑻
…
(2/5 2/5 1/5) 𝑻 is the final probability
32
The Surfers
Yahoo
Amazon M’soft
Importance of the 3 nodes

(y a m) 𝚻 = (1/3 1/3 1/3) 𝚻
33
y a m
The Surfers y ½ ½ 0
a ½ 0 1
m 0 ½ 0
Yahoo
Amazon M’soft
Importance of the 3 nodes (y a m)

M (1/3 1/3 1/3) 𝚻 = (1/3 1/2 1/6) 𝚻
34
y a m
a ½ 0 1
m 0 ½ 0
Yahoo
Amazon M’soft

M (1/3 1/2 1/6) 𝚻 = (5/12 1/3 1/4) 𝚻
•35
y a m
a ½ 0 1
m 0 ½ 0
Yahoo
Amazon M’soft

M (5/12 1/3 1/4) 𝚻 = (3/8 11/24 1/6) 𝚻
36
After convergence Y and A have equal relative
importance…
Yahoo
Amazon M’soft
Final Importance of the 3 nodes

(y a m) 𝚻 = (2/5 2/5 1/5) 𝑻
37
Dead Ends
Spider Traps
Taxation Policies
38
Dead Ends
• Some pages are dead ends (have no links out).
• Such a page causes importance to leak out (drain out),
or surfers to disappear.
• When there are dead ends the transition

matrix is no longer stochastic (the sum of
the row elements is not 1)
39
Microsoft Becomes Dead End
Yahoo y a m
y ½ ½ 0
a ½ 0 0 =M
m 0 ½ 0
Amazon Microsoft
A substochastic matrix =“not every column sums up to 1”
40
Example: Effect of Dead Ends
• Flow Equations v = Mv :
y a m
y = y /2 + a /2 y ½ ½ 0
a = y /2 a ½ 0 0 =M
m = a /2 m 0 ½ 0
41
Microsoft Becomes a Dead End
Yahoo
y a m
y ½ ½ 0
a ½ 0 0 =M
m 0 ½ 0
Amazon Microsoft
importance y 1
of the a = 1
3 nodes m 1
42
Yahoo
y a m
y ½ ½ 0
a ½ 0 0 =M
m 0 ½ 0
Amazon Microsoft
importance y 1 1
of the a = 1 1/2
3 nodes m 1 1/2
43
Yahoo
y a m
y ½ ½ 0
a ½ 0 0 =M
m 0 ½ 0
Amazon Microsoft
importance y 1 1 3/4
of the a = 1 1/2 1/2
3 nodes m 1 1/2 1/4
44
After convergence all relative
importance of nodes is lost…
Yahoo
y a m
y ½ ½ 0
a ½ 0 0 =M
m 0 ½ 0
Amazon Microsoft
importance y 1 1 3/4 5/8 0

of the a = 1 1/2 1/2 3/8 0
...
3 nodes m 1 1/2 1/4 1/4 0
No available surfers left to travel!

45
Dealing with dead-ends
• 1) Prune and propagate
• Preprocess the graph to eliminate dead-ends
• Keeps the Strongly connected component of the graph
• Compute PageRank on the reduced graph
• Approximate values for dead ends by propagating values from
reduced graph, where their PageRank is an aggregation of the one
from all predecessors
2) Teleportation
• Follow random teleport links with probability 1.0 from dead-ends
• Adjust matrix accordingly
46
Spider Trap
• Some groups of pages may be spider traps (all

out-links are within the group)
• Eventually spider traps absorb all importance; all
surfers get stuck in the trap
47
Example: Effect of Spider Trap
Microsoft Becomes Spider Trap
• Flow Equations v = M v : y a m
y = y /2 + a /2 y ½ ½ 0
a = y /2 a
m
½
0
0
½
0
1
=M
m = a /2 + m
48
Microsoft Becomes a Spider Trap
Yahoo y a m
y ½ ½ 0
a
m
½
0
0
½
0
1
=M
Amazon M’soft
importance y 1
of the a = 1
3 nodes m 1
49
Yahoo
y a m
y ½ ½ 0
a
m
½
0
0
½
0
1
=M
Amazon M’soft
importance y 1 1
of the a = 1 1/2
3 nodes m 1 3/2
50
Yahoo
y a m
y ½ ½ 0
a
m
½
0
0
½
0
1
=M
Amazon M’soft
importance y 1 1 3/4 5/8

of the a = 1 1/2 1/2 3/8
3 nodes m 1 3/2 7/4 2
51
After convergence all importance
goes in one node…
Yahoo
y a m
y ½ ½ 0
a
m
½
0
0
½
0
1
=M
Amazon M’soft
𝑦 1 1 3/4 5/8 0
importance of the
𝑎 =1 1/2 1/2 3/8 … 0
3 nodes
𝑚 1 3/2 7/4 2 3
52
Dealing with spider traps
Teleportation
• At each time step, the random surfer has two

options:
• With probability , follow a link at random
• With probability 1-, jump to some page uniformly
at random
• Common values for  are in the range 0.8 to 0.9
• Surfer will teleport out of a spider trap within

a few time steps
53
Vector formulation
• Suppose there are N pages
• Consider a page 𝑖 , with set of out links 𝑂 𝑖
• We have
1
• 𝑀𝑗𝑖 = 𝑂 𝑖
when 𝑖 links 𝑗, 𝑀𝑗𝑖 = 0 otherwise
• The random teleportation is equivalent to:

• adding a teleport link from i to every other page with
1−𝛽
probability 𝑁
• tax each page a fraction 1 − 𝛽 of its score and
redistribute evenly
• e is a vector with all elements equal to one
• Construct 𝒗′ vector as follows

′
𝒆
𝒗 =𝜷∗𝑴∗𝒗+ 𝟏−𝜷 ∗
𝑵
54
Some Problems with Page Rank
• Measures generic popularity of a page

• Biased against topic-specific authorities
• Solution: Topic-Sensitive PageRank (TSPR)
• Uses a single measure of importance

• Other models of importance
• Solution: Hubs-and-Authorities (coming next)
• Susceptible to Link spam

• Artificial link topographies created in order to boost page rank
• Solution: TrustRank
55
Topic-Sensitive PageRank
• Motivation: Topic of each page is critical for personalized

search
• Same or similar words into search queries can cover a wide
spectrum of topics
• Users are interested in pages with context related to their
preferences
• Goal: Integrate into PageRank scores the similarity

between pages and distinct topics
• Example: Query “jaguar” could retrieve pages related to
animals, automobile, operating systems or game consoles.
• Reducing the diversity of those topics could increase user
experience
• Idea: Bias the random walk

• Introduce a vector with weights for limited number of distinct
topics
• Boost teleporting to pages with related topic
56
Topic-Sensitive PageRank
• We need to update the teleportation part of PageRank
formulation:
𝟏−𝜷
𝜷∗𝑴∗𝒗+ , 𝑖𝑓 𝑖 ∈ 𝑆
• 𝑣TSPR ′ = ቐ 𝑺
𝜷 ∗ 𝑴 ∗ 𝒗, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
• Through this, we weight all topics in the S set equally,

reducing the chance of visiting irrelevant pages.
• We could not do that for each page, since we should keep an
enormous vector (equal to the number of distinct pages) per
user.
• The personalized set of S per user could be created by
directly asking the user to add some preferences, utilize
user’s history or by classifying queries into topics (text-based
approach).
57
Hubs and Authorities
• HITS (Hypertext-Induced Topic Selection)

• Is a measure of importance of pages or documents,
similar to PageRank
• Proposed at around same time as PageRank (‘98)
• Goal: Say we want to find good newspapers

• Don’t just find newspapers. Find “experts” – people
who link in a coordinated way to good newspapers
• Idea: Links as votes

• Page is more important if it has more links
• In-coming links? Out-going links?
58
Finding newspapers
NYT: 10
• Hubs and Authorities
Each page has 2 scores: eBay: 3
• Quality as an expert (hub): Yahoo: 3
• Total sum of votes of authorities pointed to
• Quality as a content (authority): CNN: 8
• Total sum of votes coming from experts
WSJ: 9
• Principle of repeated improvement
59
Interesting pages fall into two classes:

1. Authorities are pages containing
useful information
• Newspaper home pages
• Course home pages
• Home pages of auto manufacturers
2. Hubs are pages that link to authorities

• List of newspapers
• Course bulletin
• List of US auto manufacturers
60
Counting in-links: Authority
Each page starts with hub

score 1.
Authorities collect their votes
Note that this is an idealized example. In reality, graph is not

bipartite, and each page has both the hub and authority score. 61
Counting in-links: Authority
Sum of hub
scores of nodes
pointing to NYT.
62
Expert Quality: Hub
Sum of authority
scores of nodes
that the node
points to.
Hubs collect
authority scores
63
Reweighting
Authorities again collect

the hub scores
64
Mutually Recursive Definition
• A good hub links to many good authorities
• A good authority is linked from many good

hubs
• Model using two scores for each node:

• Hub score and Authority score represented as
vectors 𝒉 and 𝒂
65
• Each page 𝑖 has 2 scores: j1 j2 j3 j4
• Authority score: 𝒂𝒊
• Hub score: 𝒉𝒊
i
HITS algorithm:
(0) 𝒂𝒊 = ෍ 𝒉𝒋
• Initialize hj =1 𝒋→𝒊
• Then keep iterating until convergence: i

(𝑡+1)
• ∀𝒊: Authority: 𝑎𝑖 = σ𝒋→𝒊 ℎ𝑗(𝑡)
𝑡+1
• Normalize: max 𝑎𝑖 =1
j1 j2 j3 j4
(𝑡+1) (𝑡)
• ∀𝒊: Hub: ℎ𝑖 = σ𝒊→𝒋 𝑎𝑗
𝒉𝒊 = ෍ 𝒂𝒋
𝑡+1 𝒊→𝒋
• Normalize: max ℎ𝑖 =1
66
• HITS converges to a single stable point

• Notation:
• Vector 𝒂 = (𝑎1 … , 𝑎𝑛 )𝑇 , 𝒉 = (ℎ1 … , ℎ𝑛 )𝑇
• Adjacency matrix 𝑨 : 𝑨𝒊𝒋 = 1 if 𝑖𝑗, 0 otherwise
• Similarly, 𝒂𝒊 = σ𝒋→𝒊 𝒉𝒋 can be rewritten as 𝒂𝒊 = σ𝒋 𝑨𝒋𝒊 ⋅ 𝒉𝒋

• So: 𝒂 = 𝑨𝑻 ⋅ 𝒉
• Then 𝒉𝒊 = σ𝒊→𝒋 𝒂𝒋
can be rewritten as 𝒉𝒊 = σ𝒋 𝑨𝒊𝒋 ⋅ 𝒂𝒋
So: 𝒉 = 𝑨 ⋅ 𝒂
67
Example of HITS Yahoo
111 110
A= 101 AT = 1 0 1
010 110 Amazon M’soft
A𝑥𝑦 = 1 if there is an outgoing link from node 𝑥 to node 𝑦
AT𝑥𝑦 = 1 if there is incoming link from node 𝑦 to node 𝑥
a(yahoo) = 1.0 1.0 ... 1.0

a(amazon) = 1.0 0.8 ... 0.73
a(m’soft) = 1.0 1.0 ... 1.0
h(yahoo) = 1.0 1.0 ... 1.0

h(amazon) = 0.67 0.71 ... 0.73
h(m’soft) = 0.33 0.29 ... 0.27
68
Summary : PageRank and HITS
• PageRank and HITS are two solutions to the

same problem:
• What is the value of an in-link from u to v?
• In the PageRank model, the value of the link
depends on the links into u
• In the HITS model, it depends on the value of the
other links out of u
• The former uses the Adjacency matrix (A),

while the latter uses the Link matrix (L).
• The destinies of PageRank and HITS
post-1998 were very different.
69

TM3 ch05 Link Analysis

Uploaded by

Copyright:

Available Formats

TM3 ch05 Link Analysis

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

TM3 ch05 Link Analysis

Uploaded by

Copyright:

Available Formats

Tutorial Meeting #3

Data Science and Machine Learning

• Methods and Algorithms

• A Matrix is a rectangular array of numbers

𝑎11 𝑎12 𝑎13 1 2 3

• 𝑎𝑖𝑗 is the element of matrix A in row 𝑖 and column 𝑗

• A square matrix is diagonal if has 𝑎𝑖𝑗 = 0 for all 𝑖 ≠ 𝑗

• The Identity matrix 𝑰 is the diagonal matrix with 1´s along

• A symmetric matrix 𝑨 satisfies the condition 𝑨 = 𝑨𝑻

• The transpose of a column vector is a row vector

• Addition: 𝚨 = 𝐚𝐢𝐣 , 𝐁 = 𝐛𝐢𝐣 , 𝐂 = 𝐜𝐢𝐣 = 𝐀 + 𝐁

• Scalar multiplication: 𝜆 is a number, λ𝚨 = 𝝀𝐚𝐢𝐣

• If 𝑨𝑩 = 𝑰, then 𝑩 is said to be the inverse of 𝑨 and is

• An undirected graph G = (V, E), consists of the vertex set V and

• A Directed graph G = (V, E) consists of edges that have ordered

• A neighbor set N(v) is the set of vertices adjacent to v: 𝑁 𝑣 =

• Figure (a): Edges are represented using their end-points,

• Figure (b): Both representations are the same, i.e.,

Can you complete the rest values?

Resulting in/out degrees

You can verify that A is symmetric matrix (𝐴 = 𝐴𝑇 ), since we have an

• Google is the leading search and online advertising

• “googol” or 10100 is the mathematical term Google was

• Google’s success in search is largely based on its

• If we assume that people follow web-links randomly, that leads us to

• In other words, “the importance of a page equals to its share of the

• PageRank of a page equals to the probability of being at that page.

Web pages are not equally “important”

• We can use the number of incoming links for a website

• Are all incoming links equal?

2 Based only on their

• Suppose page 𝑖 has 𝑛 outgoing links

• M is a column stochastic matrix

• Considering that v is a column vector

• Additional constraint forces uniqueness

• Gaussian elimination works for small examples, but we need a

• We elaborate the need of regularization in the previously seen

• Suppose v is a vector whose 𝑖𝑡ℎ component is the

• If a surfer chooses a successor page from page 𝑖 at

• In general, the 𝑀𝑖 ∗ 𝒗 gives us the distribution of

• Start with the vector 𝒗 = [1, 1, … , 1] representing the

• Repeatedly multiply matrix 𝑀 with vector 𝑣, allowing

• About 10 iterations is sufficient to estimate the limiting

Amazon Microsoft M*v = v’

M (1/3 1/3 1/3) 𝑻 = (1/3 1/2 1/6)𝑻

Importance of the 3 nodes

Importance of the 3 nodes (y a m)

Importance of the 3 nodes (y a m)

Importance of the 3 nodes (y a m)

Final Importance of the 3 nodes

• When there are dead ends the transition

A substochastic matrix =“not every column sums up to 1”

importance y 1 1 3/4 5/8 0

No available surfers left to travel!

• Some groups of pages may be spider traps (all

importance y 1 1 3/4 5/8

• At each time step, the random surfer has two

• Surfer will teleport out of a spider trap within

• The random teleportation is equivalent to: