The Value of Being Linked In: Yaakov (J) Stein April 2009
The Value of Being Linked In: Yaakov (J) Stein April 2009
Abstract
I semi-empirically study the social networking sites such as LinkedIn.
Such sites enable users to maintain contact information of people they
know and trust (their first degree connections or friends), and to discover
the friends of their friends (their second degree connections), and to access
the friends of the friends of their friends (their third degree connections).
Connections up to some degree (e.g., third) make up a network. I find
the size of such a tree network grows sublinearly with time, even when
its owner actively seeks out new friends. Under simplistic assumptions I
find that the value of such a network to its owner is three times that of a
standard contact list (containing only first degree connection). The total
value of a network of N connections up to d degrees of separation to all
1
its members scales as N 1+ d . This is less than Metcalfe’s law that states
that the value of a fully connected network scales as N 2 , but more than
Odlyzko’s law where the scaling is only N log N . On the other hand, the
cost of maintaining a large network increases faster than the value, and
thus there is an optimal network size from the value point of view. Using
models I am able to estimate the average degrees of separation between
members of my network, and the size of the strongly connected cluster to
which I belong.
1 Introduction
In early 2008 I started receiving from acquaintances inviting me to join
their networks on various professional social networking sites, the most
popular being LinkedIn [9], which at the time advertised over 25 million
users (and is now quoting over 35 million). At first I ignored all such invi-
tations, assuming that most have come either from people seeking jobs or
from marketing types using these sites instead of business cards. However,
I was subsequently led to reconsider my behavior. While tracking down
prior art for a patent application I came across a reference to someone
whom I had never met, and whom I could not locate via a general In-
ternet search. A colleague performed a search on LinkedIn and promptly
produced the desired contact information.
Acknowledgements : Thanks to all my connections who provided information and suf-
fered my endless requests for assistance. Special thanks to Jonathan Levy for his help in
programming the Web-scraping tool.
1
This anecdotal evidence lead me to ask whether it was possible to
quantify the value of such large social networks. Does their value result
from the fact that the tool accesses a large network or purely from the
specific capabilities of the software platform? How much more valuable is
a network of contacts than a conventional contact list, such as provided
by conventional email clients?
It is evident that the social networking platform itself has some advan-
tages, such as the fact that people update their own contact information.
Thus, when your contact changes email address or telephone number, you
need not manually update this information, or even know of the change
until you need to contact him or her. However, securely updating contact
information is surely a solvable problem, and one that probably should
not require large interconnected networks.
So I was left with the question as to whether true value is derived from
the fact that in addition to enabling access to your friends (called contacts
in most conventional tools but connections in LinkedIn terminology) the
user gains visibility to the friend’s friends (second degree connections),
and somewhat more limited knowledge of the friends’s friends’s friends
(third degree connections). This entire network of first, second, and third
degree connections is much larger than the set of friends (usually by a
factor of between 1000 to 10,000), and has recently been the subject of
interest. In order to process information in such networks, various machine
readable formats for describing a FOAF (Friend Of A Friend) have been
developed. Collections of FOAF entries have been called social graphs
[7, 6], and Tim Berners Lee has called the set of all such data the giant
global graph (GGG) [3]. Lee contends that the value of the GGG exceeds
that of the WWW.
An extremely large social network of a different kind was recently
studied by Leskovec and Horvitz [8]. They analyzed a month of commu-
nications activities of 240 million users of the Microsoft Messenger instant
messaging system. They were able to construct the graph describing users
who communicated, and found it to be richly connected, with an average
of less than seven degrees of separation between randomly chosen users.
In order to better understand the advantages of a social network, I
decided to join LinkedIn. I chose to join this particular network for several
reasons.
• it is widely used by professionals in my areas of interest,
• unlike Facebook, its use is not blocked by my company’s firewall,
• there are no undesirable instant messaging or similar features.
People who heard of my decision, gave me the benefit of their (anecdotal)
experience. In particular I was told
• the size of the network increases ”exponentially” (over time?)
• I would be able to build a network with a million connections with
two weeks,
• I would find this network an extremely valuable tool,
• acquiring and maintaining such a network entails no cost,
• I would discover that any two people in a given field of interest are
separated by no more than two or three degrees of separation,
• all 25 million LinkedIn users are somehow connected.
2
I decided to investigate these ideas by carefully keeping track of my
network, as will be explained in section 2. The growth of the network,
both analytically and empirically, is found to be governed by a power law,
and not exponential. It took me over six weeks to reach a network size of
one million.
In addition to providing a compendium of information on your friends
that you can access from anywhere, the LinkedIn platform provides vari-
ous tools to make contact with people in your network, such as introduc-
tion by a friend in common. Such access can be of value when trying to
find potential collaborators or customers, when looking for a new job or
looking for a candidate for a job you need to fill, etc. However, it is clear
that not all network members are equally valuable, and thus the value of
such a network only scales weakly with the total network size. In fact,
the value of the network to me turns out to scale linearly in the number
of friends (first degree connections), and not in the network size. I discuss
value in section 3.
On the other hand, I questioned the idea that growing and maintain-
ing a valuable network are completely free. Even if LinkedIn does not
directly charge for maintaining a network, one certainly has to devote
some time and effort to its upkeep. I shall show that the cost of main-
taining a network scales quadratically in the number of friends, and thus
must eventually exceed the linearly scaling value. In section 4 I empirically
conclude that the trade-off is optimal at about F = 500 friends.
A key feature of social networks is that we can define the degrees
of separation between any two network members. This represents the
number of hops that need to be traversed between the two. Social network
sites usually define the network of a particular user as the set of all users
separated no more than some degree (e.g., 3). Thus although network
membership is symmetric (if you are in my network then I must be in
yours) it is not transitive (the friend of a third degree connection of mine
is not necessarily in my network). I often fall into a self-centered way of
thinking wherein my network is a tree emanating from me and extended to
my third degree connections. In fact there are actually many connections
between members of my network that have nothing to do with me (for
example, my friends often directly know each other as well).
When finding a nonfriend using LinkedIn’s search tools, the degrees
of separation as well as partial information as to the connections along
the path of separation are displayed. Likewise, when I view a friend’s
LinkedIn connections, friends in common are displayed first. However,
LinkedIn does not make available the information necessary to directly
deduce the degrees of separation between any two of my connections. In
section 5 I present indirect inferences regarding connectivity and degrees of
separation based on a pure tree model. I extend these results to somewhat
more general models in section 6.
Finally, I build a detailed model of how a networks grows in section 7
and compare the analytic results to simulation and to the dynamics of
my own network. From this model I was able to estimate the size of
the connected cluster in which I reside. While LinkedIn claimed over 25
million users at the time, I seem to belong to a strongly connected cluster
of only a few million people.
3
2 Linking In
In order to study social networks such as LinkedIn, I conducted an ex-
periment over a period of several months, from the end of August 2008.
I was interested in understanding the growth dynamics of such networks,
the connection between the number of friends and the total network size,
and in determining the value of maintaining such a network.
It is important to note that I studied my LinkedIn network, not the
LinkedIn network. The latter is a complex graph with LinkedIn users as
nodes and bidirectional edges between users who are friends. The only
thing I know about it is the number of users (rounded to the closest five
million). It would be interesting to know the size of the largest connected
cluster, the distribution of the number of friends per user, the degrees
of separation between users, etc. However, gathering such information
would require access to LinkedIn’s database. My information was limited
to the information that LinkedIn provides to its users.
On the other hand my LinkedIn network is not simply a subgraph
of this graph. In my network there is an order relation, assigning to
every node its degree (between zero and three). Furthermore, there is an
underlying tree structure, with each node being connected to me along a
particular path or paths (see Figure 5).
A colleague who was aware of the experiment predicted that I would
reach a network size of over one million within 2 weeks. It actually took
much longer. After one week I had over 100 friends, and a network of close
to 800,000 connections. But from then on the growth slowed considerably,
and the million mark was passed only after over six weeks.
I started my network by accepting several outstanding invitations to
join LinkedIn networks. Once accepted into someone’s network I gained
access to their contact list (although you can opt to keep your contact
list confidential, in practice I found that very few people block access). I
then scanned each contact list looking for people I wished to add to my
network, and sent invitations accordingly. I never returned to rescan a
friend’s contact list afterwards. In a short time I started receiving, and
accepting, numerous unsolicited invitations as well. If the need arose, I
also used the search tools provided by LinkedIn to track down particular
people I needed to contact.
In order to build a valuable network, I only invited, or accepted invi-
tations from, people who met two criteria. The criteria were essentially
the same ones that I normally consider before opening a contact record
in my email client.
First, I had to actually know the person in question. The standard I
used here is stronger than simply recognizing someone’s name. I require
that either I have met the potential friend on multiple occasions, or if on
only on a single occasion, that we had conversed for at least one hour.
Second, there had to be some finite probability that I would need to
contact the potential contact in the future. Such decisions were typically
clear cut, but when in doubt I preferred to err on the side of letting in
a possibly low-value contact rather than blocking someone whose contact
information would be difficult to acquire later if I did end up needing it.
Note that I emphatically did not use profession as a criterion. Al-
4
N2
15000
10000
5000 N1
50 75 100 125 150 175
N2 = 76N1 + 2724
This is a clear indication that while we observe a straight line for this
portion of the graph, the slope is decreasing with increasing number of
friends. We will study the form of this graph in section 7.
5
N3
1500000
1400000
1300000
1200000
1100000
1000000
900000
800000
700000
N2
8000 10000 12000 14000 16000
N3 = 46N2 + 268729
N = 3574N1 + 397159
6
N
1500000
1400000
1300000
1200000
1100000
1000000
900000
800000
700000
N1
50 75 100 125 150 175
linear. This sublinear behavior is a far cry from the predictions of expo-
nential growth.
It may be objected that this (sub)linear tendency is only characteristic
of the first phase of the experiment, when organic growth can be neglected.
The dramatic jump corresponding to the second phase demonstrates that
organic growth can be significant, and may possibly contribute to su-
perlinear growth rates. However, the dramatic jump in this depiction is
misleading. The first and second phases were chosen to be approximately
equal in time duration. During the first phase the number of friends in-
creased from 0 to 170, and the network size from 0 to a million. In the
second phase the number of friends barely increased, but the network in-
creased to about a million and a half. So, in the same time duration, the
organic growth of an already large network contributes only about half
the growth rate of active expansion up to that point.
The organic growth rate is driven by first and second degree con-
nections adding connections of their own. It thus is expected to have
contributions proportional to N1 and to N2
∆N (t)
= βN1 (t) + γN2 (t) + B = β 0 N1 (t) + B 0
∆t
where we have taken the linearity of N2 in N1 into account. Although in
our second phase we did not actively increase N1 in order to isolate the
contributions, were we to continue increasing N1 by n friends per unit
time, we could substitute N1 (t) = nt + n0 and and would see linearly
increasing organic growth.
∆N (t)
= β 0 nt + B 00
∆t
7
N
1E+07
1E+06
1E+05
1E+04 N1
10 100 1000
8
3 Quantifying the Value of Contact Lists
Anyone who has ever had his list of emails or telephone numbers acciden-
tally erased will attest to the value of such lists. Calling the number of
entries in the contact list F , the quantitative value V is proportional to
F . This trivially derives from the fact that if the contact information of
any specific contact is lost or becomes corrupted, the list owner will be
forced to spend time and effort, or alternatively would be willing to pay
money, to restore this information. The cost that the list owner would
be willing to incur to restore a contact’s information is equivalent to the
value assessed by the list owner to the possession of that contact informa-
tion. If each of the F entries has value v then the total cost is V = F v,
which scales linearly with F .
This argument can be readily extended to the case where all contacts
are valuable to some degree (else, the contact should be removed from the
list) but not necessarily equally valuable. For such
P as case each contact
has value vi , the total value of the list is V = vi = F < v > where
< v > is the average value. Thus, once again, the value of the list scales
linearly with F .
In the LinkedIn social network, in addition to full access to N1 = F
first degree connections, one can retrieve more limited information re-
garding N2 second degree connections, and have some access to N3 third
degree connections.
Metcalfe’s law [10] states that the value of a telecommunications net-
work with N participants is proportional to N 2 , i.e., V ∼ N 2 . (I use
V ∼ N 2 to indicate scaling, i.e., V = O(N 2 ).) The reasoning behind
this rule is simple to comprehend. Metcalfe is considering the total value
of the network, i.e., the sum of the value to all participants. Since each
participant can converse with N others, the value of the network to each
scales as N , and the total value for N participants scales as N 2 .
Of course Metcalfe’s law makes the implicit assumption that as N
grows all the new possible pairings remain valuable. In real life, as a net-
work grows, its geographic extent and variability increases, reducing the
value to an individual of each new participant in a manner that decreases
with N .
Metcalfe’s rule is not the most optimistic valuation for a network of N
participants. Reed’s law [13] goes even further based on the premise that
in a network of N participants not only are N 2 conversations possible,
but 2N possible conference calls or email lists. Thus, according to Reed’s
law, V ∼ 2N .
Andy Odlyzko and Ben Tilly [12] dispute these strongly increasing val-
uations and propose a much weaker N log N behavior. They reason that
the value’s dependence on N must be superlinear, but not too strongly so.
For any superlinear dependence, the value of two separate networks is less
than a single unified one, causing strong pressure for disparate networks
to be joined. For example, two networks of size N are separately worth
2N 2 , but together (2N )2 – twice as much. This joining is indeed seen in
the Internet, but many networks take too long time to merge for the N 2
law to be reasonable.
The simplest argument for N log N scaling is to assume that randomly
9
F friends depth=1
of
friends
3
F2
10
value of a second degree connection is necessarily subjective, in practice
the decisions were typically easily made. I rated as valuable a connection
that fulfilled at least one of the following criteria:
• I immediately recognized the connection’s name
• the connection and I have previously communicated more than once
• the connection and I share at least two areas of interest.
Note that these criteria are more lenient than those I use to accept some-
one into my network as a first degree connection.
In this search I found about 120 valuable second degree connections.
More interestingly, I almost always found one or two valuable second
degree connections in the list of each friend. Perhaps surprisingly, the
number of valuable connections was not greater for first degree connections
with more connections than for those with fewer (i.e., larger lists tended to
have more chaff). Thus the number of valuable second degree connections
M2 was only slightly larger than the number of first degree connections
M1 = N1 = F .
Let’s assume that each first-degree connection contributes exactly m
valuable second-degree connections. Then the number of second-degree
connections is M2 = mF . Although not directly verifiable, it would seem
to be highly improbable for a non-valuable second-degree connection to
contribute a valuable third degree connection. On the other hand, valu-
able second degree connections, although held to a looser standard than
friends, are not really that different from my first-degree connections. It
would thus seem probable for each of them to contribute about m valuable
connections to the network.
Taking these two assumptions to be true, we deduce that M3 =
m(mF ) = m2 F , and
M = M1 + M2 + M3 = F + mF + m2 F = (1 + m + m2 )F = f3 (m)F
where,
d−1
X
fd (m) = mp
p=0
for example,
2 1 m=0
3 m=1
X
p
f3(m) = m =
p=0 7
m=2
13 m=3
and the value of the network to me scales like f (m)F = f (m)N1 . In
our simple tree model, N ∼ N13 so that my valuation of my network is
1
V ∼ N 3 . This result can be readily extended to a network with d > 3
degrees of freedom, where my valuation of my network would scale as
1
V ∼ fd (m)F ∼ N d .
Of course, this is my valuation of the network, that is, the value of the
network to me. On the other hand, members of my network find value
in their friends, and some value to all their connections including myself.
Assuming that their situation is not too different from mine, each of the
1
N members of my network also finds value V ∼ N d in their respective
11
networks. hence, the total value of the network for all N members is N
1
times the value for each, which thus scales as N 1+ d .
For simple contact lists d = 1 and thus the total network value V ∼ N 2
as required by Metcalfe’s law. For networks such as LinkedIn d = 3, so
4
that V ∼ N 3 . This is weaker than Metcalfe’s law V ∼ N 2 , but stronger
than Odlyzko’s law V ∼ N log N .
12
• people are queued up waiting for me to accept their invitations,
• I haven’t been taking care of my LinkedIn network, I am too busy
with Facebook and MySpace,
• sorry for taking so long to accept your invitation, I get so many that
I only process them once a month,
• too many people are asking me to join to get access to my contacts,
• I have to do something for someone twice a week, but only use it
myself every few months,
• I have stopped accepting invitations - there are too many job seekers
out there.
People with between 200 and 700 friends seem to divided about half
and half. Some put a lot of time into keeping building their profile and
keeping it up to date, while others cut and paste something once and have
since forgotten about it. When questioned specifically as to how often they
retrieve useful information from the network versus how frequently they
are asked to take some action for someone else, the numbers ranged from
twice a week to it’s only happened a few times for both questions. We
can surmise that the optimal size is somewhere around F = 500. Under
that, the upkeep costs are small and useful information can be retrieved
relatively inexpensively. Over that, the maintenance costs become exces-
sive, and people that well connected apparently have alternative ways of
getting the needed information.
0 100 200 300 400 500 600 700 800 900 1000
13
friends) the histogram is approximately as linearly decreasing. For larger
networks the distribution is long-tailed (in the figure I somewhat arbitrar-
ily use the F −2 power law). While there is insufficient data to be able
to accurately determine the exponent of the power law, it is clear that
the behavior is that of a scale-free network [2] rather than exponential
decrease of a random network [4].
The remarkable feature here is the clear existence of delineated regimes.
The linear decrease occurs in the realm where value exceeds cost, and thus
people are motivated to continue adding friends to their networks. The
power law decrease is in the regime where the cost exceeds value. In be-
tween these two is a noticeable (and statistically significant) bump in the
histogram corresponding to more than expected friends of mine having
between 400 and 500 friends of their own. This overabundance comes at
the expense of fewer than expected friends having contact lists with 550 to
750 contacts. The existence of the regimes and the bump lend credence to
our conclusion that there is a cross-over somewhere in the neighborhood
of 500 contacts.
5 Degrees of Separation
It is of great interest to know the number of degrees of separation between
two members of my network. Milgram’s six degrees of separation rule (also
known as the small world phenomenon) [11] states that anyone is at most
six degrees of separation from anyone else on the planet. Milgram did
not invent the number six. The first to state six degree of separation
concept was Frigyes Karinthy (1887-1938), the Hungarian author, play-
wright, poet, and journalist. Karinthy believed that the modern world was
shrinking in the sense that technology was making social distances much
smaller than physical distances. In a short story entitled Chain-Links
in his 1929 volume of short stories Everything is Different his characters
conjecture that any two individuals could be connected through at most
five acquaintances.
While six degrees of separation has become popularized for any two
humans, people working in the same narrow field of international interest
are probably much closer than that. One well-known case is that of Erdös
numbers [5] that have become entrenched in the mathematician’s folklore.
Paul Erdös (1913-1996) wrote over 1,400 mathematical research papers
with over 500 co-authors. Erdös himself is given Erdös number 0, and his
co-authors (including Odlyzko whose law was mentioned in section 3) have
Erdös number 1. Co-authors of people with Erdös number 1 (who are not
Erdös nor have written a joint paper with Erdös) have Erdös number 2,
and so on recursively. Anyone not on the Erdös collaboration graph has
an infinite Erdös number. My own Erdös number is apparently 5 through
three different paths. Most active mathematicians have relatively low
Erdös numbers (i.e., few degrees of separation from Paul Erdös) and are
even closer to randomly chosen active mathematicians [1].
By construction, I am never further than three degrees of separation
from anyone in my LinkedIn network of first, second, and third degree
connections. However, two people in my network may be up to six degrees
14
of separation from each other, three from the first person to me, and three
back to the second person.
What is the average distance between any two members of my net-
work? In a pure tree model with large F , the great majority of network
members are third degree connections, and a randomly picked pair of them
will mostly likely be separated by the full 6 degrees. However, for finite
F the average separation will be smaller.
Let’s define the average separation < d > as the expectation of the
separation when we pick two network members i and j at random. This
will be the same as finding the expected number of degrees of separation
< di > of a randomly chosen member j from a given i, averaged over all
possible i.
1 X
<d> = dij
N (N − 1)/2
i6=j
1 X
= dij
N (N − 1)
i<j
1 X 1 X
= dij
N N −1
i j
1 X
= < di >
N
i
Let us assume that the network is a tree with fan-out F . From sym-
metry it is clear that all i of the same depth k on the tree will have the
same < di >. Thus rather than averaging over all i to find < d >, we
need only perform the weighted average over all depths.
1 X
< d >= N k < dk >
F3 + F2+F +1
k
This significantly reduces the analysis effort required to find the desired
average.
The degrees of separation between two nodes on our network tree
depend on the depths of the two nodes, and the depth of their (borrowing
terminology from family trees) Most Recent Common Ancestor (MRCA).
In Table 1 I give all the information required to calculate the average
degrees of separation between two nodes on a network tree. The first
column contains the depth k of the node for which we want to calculate
< dk >. The second column lists the depth of the node to which we want
to find the degrees of separation. In general there will be several rows
for each such depth, differentiated by the depth of the MRCA. The third
column gives the number of such second nodes, and the final column gives
the degrees of separation. Note that each pair of nodes appears twice in
the table.
In order to calculate < dk > we need to average over all rows belonging
to a given first column, of the fourth column weighted by the third column.
Next, we need to divide by the total number of nodes on the tree excepting
the present one.
1
3F 3 + 2F 2 + F
d0 =
F3 2
+F +F
15
from to MRCA number separation
0 1 0 F 1
2 0 F2 2
3 0 F3 3
1 0 0 1 1
1 0 F −1 2
2 1 F 1
0 F2 − F 3
3 1 F2 2
0 F3 − F2 4
2 0 0 1 2
1 1 1 1
0 F −1 3
2 1 F −1 2
0 F2 − F 4
3 2 F 1
1 F2 − F 3
0 F3 − F2 5
3 0 0 1 3
1 1 1 2
0 F −1 4
2 2 1 1
1 F −1 3
0 F2 − F 5
3 2 F −1 2
1 F2 − F 4
0 F3 − F2 6
1
4F 3 + F 2 − 1
d1 =
F3 2
+F +F
1
5F 3 + 2F 2 − F − 2
d2 =
F3 + F2 + F
1
6F 3 + 3F 2 − 3
d3 = 3 2
F +F +F
Finally, the average degrees of separation is the weighted average over
all depths k of < dk >.
1
d0 + F d1 + F 2 d2 + F 3 d3
<d> = (1)
(F 3+ F2
+ F + 1)
1
6F 6 + 8F 5 + 6F 4
= 3 2 3 2
(F + F + F + 1)(F + F + F )
16
<d>
6
2
5/3
F
10 20 30 40 50
2 nodes of depth 1 and 2 have two nodes with one degree of separation,
and one with 2 degrees of separation, for an average of 34 . Thus the overall
average is half of 2 + 34 , i.e., 53 .
As expected, as F → ∞ we see that the average degrees of separation
approaches 6. This means that the pure tree model predicts that although
my connections are all on LinkedIn, and although they are all no more
than three degrees of separation from me, they are just about as far from
each other as any two people on the planet.
17
10000
1000
100
10
1 overlap
0 0.2 0.4 0.6 0.8 1
they have in common (not including myself) Fij divided by the smaller of
the number of friends i has or j has (once again, not including myself).
Fij
Oij =
min(Fi, Fj )
A histogram of these overlaps is displayed on a logarithmic scale in Fig-
ure 8. The overlap decreases from two thirds of the pairs with zero over-
lap, to less than a fifth of a percent with overlap over 12 . There is a slight
overtendency for overlaps to be around 13 .
The detailed information gathered by scraping my LinkedIn network
can be used for further studies, such as blind determination of relation-
ships between people. Anecdotally, I discovered that two friends of mine
with a surprisingly high overlap had once worked together. It is simple to
define a distance measure between two members of my network based on
the overlap, and to perform clustering and cladistic analysis of this data.
Such an analysis is in its early stages and will be described elsewhere.
Now we can reperform the calculation that lead to Equation 1 taking
cross-link connections into account. The simplest topology that we can
study is based on the pure tree model, but augmented with cross-links
between friends with probability p1 = 31 . This is the only model for which
we have access to the probabilities, since one can not scrape from LinkedIn
information regarding the probability of cross-links between higher degree
connections. In addition, the probability of such higher degree cross-links
will undoubtedly be much lower.
Repeating our computation is straightforward. The intermediate re-
sults are presented in Table 2 and the average degrees of separation from
connections at the different depths are easy to find.
1
3F 3 + 2F 2 + F
d0 =
F3 + F2 + F
1
(4 − p1 )F 3 + F 2 − (1 − p1 )
d1 = 3 2
F +F +F
18
from to MRCA number separation
0 1 0 F 1
2 0 F2 2
3 0 F3 3
1 0 0 1 1
1 x p1 (F − 1) 1
1 0 (1 − p1 )(F − 1) 2
2 1 F 1
x p1 (F 2 − F ) 2
0 (1 − p1 )(F 2 − F ) 3
3 1 F2 2
x p1 (F 3 − F 2 ) 3
0 (1 − p1 )(F 3 − F 2 ) 4
2 0 0 1 2
1 1 1 1
x p1 (F − 1) 2
0 (1 − p1 )(F − 1) 3
2 1 F −1 2
x p1 (F 2 − F ) 3
0 (1 − p1 )(F 2 − F ) 4
3 2 F 1
1 F2 − F 3
x p1 (F 3 − F 2 ) 4
0 (1 − p1 )(F 3 − F 2 ) 5
3 0 0 1 3
1 1 1 2
x p1 (F − 1) 3
0 (1 − p1 )(F − 1) 4
2 2 1 1
1 F −1 3
x p1 (F 2 − F ) 4
0 (1 − p1 )(F 2 − F ) 5
3 2 F −1 2
1 F2 − F 4
x p1 (F 3 − F 2 ) 5
0 (1 − p1 )(F 3 − F 2 ) 6
1
(5 − p1 )F 3 + 2F 2 − F − (2 − p1 )
d2 =
F3 + F2 + F
1
(6 − p1 )F 3 + 3F 2 − (3 − p1 )
d3 = 3 2
F +F +F
Finally, the average degrees of separation between any two connections
is the weighted average over all depths k of < dk >.
(6 − p1 )F 6 + (8 − p1 )F 5 + (6 − p1 )F 4 + p1 F 2 + p1 F
< d >= (2)
(F 3 + F 2 + F + 1)(F 3 + F 2 + F )
Note that for p1 = 0 we return to equation 1 as required. We depict the
19
<d>
6
2
5/3
C
10 20 30 40 50
20
F 2 (F 2 −1)
2
pairs of second degree connections, F F (F2−1) share MRCA of
the first degree, and thus would be expected to have a cross-link with
3
probability p1 ! This leaves F (F2 −1) pairs with probability p21 so that the
p +F p2
average probability is p2 = 1F +1 1 . So far large F it is indeed true that
p2 ≈ p21 .
In similar fashion we expect for large F for the probability of cross-
links between two third degree connections to be p3 = p22 = p41 . For large
F we can neglect all other cross-links and all pairs of connections except
those whom would have been at separation 6 in the pure tree model.
These connections are directly connected, and thus at separation 1 with
probability p41 , and otherwise at separation 6.
For p1 = 31 this works out to be about 5.94, although it is lower for finite
F.
So higher degree cross-links are even less effective at reducing < d >
than cross-links between friends. Even taking both effects together would
not reduce the average value to less than 5. Perturbed tree models do not
lead to low numbers of degrees of separation.
21
for description of the full LinkedIn network, but does not seem to properly
capture the features of my personal network.
In this section I take a different approach. I incrementally grow
my connection tree, assuming that there are a finite number of poten-
tial connections N from which to grow it. At the stage when I have
amassed k friends I will call the total number of connections in my net-
work N (k), and the number of first, second, and third degree connec-
tions N1 (k) = k, N2 (k), and N3 (k) respectively. I call the number of
second degree connections gained by adding the kth friend n2 (k), i.e.,
N2 (k) = N2 (k − 1) + n2 (k). Similarly, the finite difference of N3 (k) is
called n3 (k), so that N3 (k) = N3 (k − 1) + n3 (k).
I assume that before I start growing my tree everyone else in the
network has already chosen F friends, and call the probability of any any
F
specific user having another user as a friend p = N .
I start growing my network by choosing a first connection. This gives
me N1 (1) = 1 first degree connections, and N2 (1) = n2 (1) = F second
degree connections. When I choose my second connection (different from
the first), I have N1 (2) = 2 first degree connections, but somewhat fewer
than 2F second degree connections. In fact, the probability that any
one of my second connection’s connections was also chosen by my first
connection is the ratio of his number of connections to the number of
F
users in the network, i.e. N = p. Thus the probability that it was not one
of his connections is 1 − p, and the expected number of new connections of
the second degree is n2 (2) = F (1 − p). So the total number of connections
of the second degree is N2 (2) = N2 (1) + n2 (2) = F + F (1 − p) = 2F − pF ,
indeed slightly less than 2F . I neglected the possibility that my first friend
is also a friend of my second friend, but assuming F >> 1 this will not
significantly change the results.
Now, adding a third (different) connection brings me to n1 (3) = 3,
but the probability of overlap with either of the first two connections is
N2 (2)
N
, so that
N2 (2)
n2 (3) = F (1 − )
N
F
= F {1 − (1 − p)} = F {1 − p(1 − p)}
N
= F (1 − 2p + p2 )
and
In like fashion
and in general, the total number of second degree connections after adding
22
k connections is
k k 2
N2 (k) = F k− p+ p − . . . ± pk−1
2 3
1 k k
= F k(−p) + (−p)2 + (−p)3 + . . . (−p)k
−p 2 3
1 1 − (1 − p)k
−1 + (1 − p)k = F
= F
−p p
N 1 − (1 − p)k
=
N2 (k) = N (1 − (1 − kp)) = N kp = F k
so that N3 (k) relates to N2 (k) in the same way that N2 (k) relates to
N1 (k) = k.
In order to check the approximations, I simulated the behavior of a
small network. My simulated environment consists of 65,536 users (each
user is given a 16-bit identifier). I initialize an array of length 65,536 that
represents the degree of each user in my tree by entering a suitably large
number in each position. The fanout was chosen to be F = 64.
I choose friends one at a time by randomly selecting a 16-bit number,
and entering the value 1 in the array of 65,536 users. Each of these
friends randomly selects F = 64 friends (my second degree connections)
by randomly choosing F 16-bit numbers, and entering the value 2 in
the array, unless there is already a 1 there. Each of my second degree
connections chooses F = 64 friends (my third degree connections) at
23
4096
3584
3072
2560
2048
1536
1024
512
0
0 8 16 24 32 40 48 56 64
Figure 10: The results of network growth simulation. The graph depicts
the number of second degree connections N2 (k) as a function
of the number of friends k, for k from 1 to 64. Note the slight
deceleration of growth as compared to linear growth expected
for the infinite network case.
65536
57344
49152
40960
32768
24576
16384
8192
0
0 8 16 24 32 40 48 56 64
Figure 11: The results of simulating network growth. The graph depicts
the total number of connections N1 (k) + N2 (k) + N3 (k) as a
function of the number of friends k, for k from 1 to 64. Note
the strong leveling off of growth due to finite network size.
10000
1000
100
10
1
1 10 100 1000
Figure 12: The results of simulating network growth. The graph depicts
N2 (k) vs. N1 (k) = k, and N3 (k) vs. N2 (k) on a logarithmic
scale as compared to the expected curve.
24
1e+06
100000
10000
1000
10 100 1000 10000
Figure 13: Finding NLI by matching the predicted growth curve. The
graph depicts N2 (k) vs. N1 (k) = k and N3 (k) vs. N2 (k) on
a logarithmic scale as compared to the expected curve. In
this graph NLI = 2, 000, 000.
random and enters a 3 into the array, unless the value already in that
position is less than 3. Finally, I scan the array counting how many 1s,
2s, and 3s are present. I repeat this procedure for between 1 and 64
friends.
Figure 10 depicts the number of second degree connections (i.e., num-
ber of entries in the array with value 2) as a function of k, for k between
1 and 64. Were there an infinite number of users we would expect this to
be a linearly increasing function, with 64 new second degree connections
for each new friend, and 642 = 4096 second degree connections at the end.
The actual behavior displays a slight deceleration of growth in comparison
with the superposed straight line.
Figure 11 depicts the total network size (i.e., number of entries in the
array with value less than or equal to 3) as a function of k, for k between
1 and 64. For an infinite number of users we would expect
N3 (k) = N1 (k) + N2 (k) + N3 (k) = k + kF + kF 2 = k(1 + F + F 2 )
which for k = 64 would reach N3 (64) = 64 + 642 + 643 = 266, 304. Ob-
viously the effect of there only being 65,536 possible connections is very
pronounced.
Figure 12 depicts both the number of second degree connections as a
function of the number of first degree connections (from Figure 10) and
the number of third degree connections as a function of the number of
second degree ones. The similarity of behavior of N3 (k) vs. N2 (k) as
compared to N2 (k) vs. N1 (k), is evident. In addition, I include
k
64
65536 1− 1−
65536
as a continuous line. The match is relatively good, but the logarithmic
scale hides the fact that the simulation falls detectably below the pre-
diction for large k. This shows that the assumptions break down in this
regime.
Having strengthened my confidence in the reasonableness of the ap-
proximations in the above analysis, I can now apply the results to the
25
data from my LinkedIn experiment. According to the theory, Figure 1
and Figure 2 should both be small portions of a single graph representing
the growth of a network with a maximum number of potential members
NLI .
F k
NLI 1 − 1 −
NLI
We determined F in section 2 to be about 78, all we need now is to perform
a best fit of our data to find NLI . The result, depicted in Figure 13, is
that NLI is surprisingly low, on the order of a few million at most. This
is much lower than the advertised 25 million registered LinkedIn users.
There are several possible explanations for this discrepancy. First, my
network may indeed be contained in a connected cluster of only a few
million users. This is possible as I mainly invited or accepted invitations
from people in one of my fields of interest, and did not pro-actively search
for people far removed from these fields. Second, connectedness is defined
only up to three hops, while fourth and higher degree connections are
not considered in the LinkedIn framework. Third, the technique used to
ascertain NLI is not really able to determine the size of connected clusters.
It estimates instead the size of the strongly connected cluster, neglecting
isolated links that were not taken. In addition, the model is not being
sensitive to links of my connections that I am not likely to use, further
reducing the effective size.
8 Concluding remarks
Interestingly, my semi-empirical study proved all the tips given to me to
be incorrect.
I was told that my network would increase ”exponentially”. Even were
I able to keep up adding some number of friends per day, the network only
grows linearly over time. Once I have most of my friends in my network,
the network growth slows considerably to the organic growth rate which
is sublinear in time. This result becomes even more reasonable when it
is realized that the network size is already on the order of the size of it
embedding connected cluster.
I was told that I would have a million connections in two weeks. It
took about three times as much time (although others have told me that
it took them much less). Even after a year I had not attained the two
million mark.
I was told that I would find this network an extremely valuable tool.
The value turns out to be only linear in the number of friends, albeit with
a proportionality constant above 1. So while social networks are more
valuable than simple contact list, they scale much more slowly in total
network size than other types of networks. After a year of use, I find that
I only use it once every week or two.
I was told that acquiring and maintaining such a network entails no
cost. It turns out that the maintenance cost increases more rapidly with
network size than the value, and that above about 500 friends maintaining
the network becomes time consuming and cumbersome. After a year I find
that I need to service requests (mostly invitations from people I turn down
26
due to not meeting my criteria) at about the same frequency as I exploit
LinkedIn for my own needs.
I was told that any two people in any of my fields of interest would
be separated by no more than two or three degrees of separation. If tree-
based models (natural for this kind of network) are to be believed, the
degrees of separation between two randomly chosen people in my network
are close to the maximum of 6, which incidentally is the value quoted as
the separation between any two randomly chosen people. On the other
hand, the network statistics I could gather does not match other models,
such as random graphs or scale-free networks.
I was told that all 25 million LinkedIn users are connected (i.e., that
there exists a single giant cluster to which essentially all users belong). In
practice, my network seems to reside in a connected cluster of only a few
million, from which I could not break out.
27
A Parsing LinkedIn Data
Computing the probability that two LinkedIn connections are themselves
connected is not as easy as it seems. The idea is to retrieve and parse
the connection lists of all friends and to look for duplicates. Retrieving
web pages and extracting information from them is commonly called web
scraping.
However, LinkedIn presents a challenge to web scrapers as it is based
on AJAX (Asynchronous JavaScript and XML) programming. Viewing
the source page of AJAX sites in the usual fashion reveals the Javascript
source code, not the desired information. It is not difficult to retrieve the
source of the HTML page as viewed, but the method to do so is browser
dependent. Internet Explorer users viewing an AJAX generated page need
only jump to the following address:
javascript:’<xmp>’%20+%20window.document.body.outerHTML+%20’</xmp>’
in order to dump the raw data to the screen for scraping.
The first thing one needs to know is that LinkedIn assigns a numeric
key that uniquely identifies each user. You can readily see the key of any
of your friends by jumping to that friends profile from your connection
list. Look at the profile page’s URL, it will be of the form:
http://www.linkedin.com/profile?...key=xxxxxxxx....
and that connection’s key is readily seen. Finding you own key is a bit
trickier. One way is to go to any of your pages such as your profile or
connection list, dump the raw format as described above, and search for
the string key=.
Now from the my contacts page:
http://www.linkedin.com/connections?trk=hb_side_connections
I can easily produce a list of all my contacts, their numeric keys, and
the number of contacts they have. I can then form the URLs of the pages
with their contacts, which are all of the form:
http://www.linkedin.com/profile?viewConns=&key=xxxxxx&split_page=n
with 60 contacts per page.
The program I used to automate this process is called linkscraper.
The program starts with my contact list and extracts the keys of all of
my friends. It also observes how many friends each friend has. After
performing these steps, linkscraper creates the URLs of the contacts list for
each friend, visits each of these links, dumps their raw format, and extracts
the numeric keys of their connections (my second degree connections).
The program then sorts the connections belonging to each friend creating
a sorted list of numeric keys for each of my connections. It is now a simple
matter of looping over every two different connections and checking if the
two files had a numeric key in common. In addition, linkscraper calculates
the number of connections in common between any two friends, building
a triangular array that can be later analyzed in various ways.
28
References
[1] The American Mathematical Society Math-
SciNet Collaboration Distance calculator, online at
http://www.ams.org/mathscinet/collaborationDistance.html.
[2] A.L. Barabási and R Albert, Emergence of Scaling in Random Net-
works, Science 286 509-512 (1999).
[3] Tim Berners-Lee, Giant Global Graph, online at
http://dig.csail.mit.edu/breadcrumbs/node/215.
[4] P. Erdös and A. Rényi, On Random Graphs, Publicationes Mathemat-
icae 6: 290 - 297 (1959); The Evolution of Random Graphs, Magyar
Tud. Akad. Mat. Kutató Int. Közl. 5: 17 - 61 (1960).
[5] The Erdös Number Project, online at http://www4.oakland.edu/enp/.
[6] Brad Fitzpatrick and David Recordon, Thoughts on the Social Graph,
online at http://bradfitz.com/social-graph-problem/.
[7] Alex Iskold, Social Graph: Concept and Issues, online at
http://www.readwriteweb.com/archives/social graph concepts and issues.php.
[8] J. Leskovec and Eric Horvitz, Planetary-Scale Views
on a Large Instant-Messaging Network, online at
http://www.cs.cmu.edu/ jure/pubs/.
[9] About LinkedIn, online at http://www.linkedin.com/static?key=company info.
[10] B. Metcalfe, Metcalfe’s Law: A network becomes more valuable as it
reaches more users, Infoworld, Oct. 2, 1995.
[11] S. Milgram, The small-world problem, Psychology Today 1: 61-67,
1967.
[12] A. M. Odlyzko and B. Tilly, A refutation of Metcalfe’s Law and a
better estimate for the value of networks and network interconnections,
online at http://www.dtc.umn.edu/ odlyzko/doc/networks.html.
[13] D. P. Reed, Weapon of math destruction: A simple formula explains
why the Internet is wreaking havoc on business models, Context Mag-
azine, Spring 1999.
29