0% found this document useful (0 votes)
62 views6 pages

Patent Citation Network Analysis

This document contains the report as part of the NAM Project report for CSCI 5352. This document has details regarding the analysis of U.S Patient Citation Network
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
62 views6 pages

Patent Citation Network Analysis

This document contains the report as part of the NAM Project report for CSCI 5352. This document has details regarding the analysis of U.S Patient Citation Network
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Analysis of U.

S Patent Citation Network


Krishna Chaitanya Sripada, Sesha Sailendra Chetlur
Keywords: Preferential Attachment, Relevance, Reciprocity, Generality, Originality, Self-Citations.

Abstract

The U.S patent data tells us a lot about how technological fields have evolved over time. As a network,
we can infer many things based on the pattern of citations that connect one patent to another. There
are a number of points of similarity between the citation network which deals with publications and the
citation network of patents. In particular, one of the main goals, among others, was to observe whether the
occurrence of preferential attachment could be seen in the patent network as seen in the citation network
for publications. Other aim of this project includes a general study of certain properties of the network such
as reciprocity, degree distribution, generality, originality and trends of self-citations [1]. Finally we discuss
the eventual shape the patent citation network we expect.

Introduction

The similarity of the patent network to the paper citation network lies in the fact that in both cases, newer
nodes connect to the older ones by the means of a citation. This hence ends up forming a directed graph
in both cases. In the case of the paper citation network, it has been shown that the nodes that came into
the network at a very early stage of network formation tend to hold some sort of pioneer status and attract
citations from the newer nodes that get added into the network. This has been described as the preferential
attachment model [2], where the probability of an incoming edge being added to an existing node is a
function of the age of the node. Hence, logically, as the network grows, it has been shown that the older
nodes tend to get more inbound edges, slowly forming a star graph structure [3]. The hypothesis considered
as part of this project questions whether the same is the case for the patent citation network.
Speaking in terms of technological development, it would make sense to believe that as time goes by,
a field of research may very well become obsolete and replaced by newer and more relevant technologies.
As a result, a patent being filed at present would have no reason to cite the oldest patent available, but
instead, would cite the most relevant patent that has had the earliest impact on the field. The probability
of an incoming edge being added to a node should no longer be a function of its age, but should rather
be a function of the relevance to the field of research being put forth in the patent. This may give rise to
what we would like to term as a relevance curve, which would depict the maximum citations corresponding
to a patent which was filed sometime significantly after the pioneer patents. The resulting graph drawn
showing the year versus the maximum number of citations in that year would show a curve with the peak
somewhere towards the right of the origin.
Patents which are filed are categorized into different classes with each class identifying a particular field
of research. Each field is associated with a number that helps in uniquely classifying them. A patent may
also cite other patents reflecting the existing knowledge that the citing paper is depended on. The data
collected contains information about all the patents filed in a particular time period, and their citations
form a directed network, which can be used for analytical purposes.

Other interesting aspects to study would be the way the citations different classes of fields of research.
This helps us realize the relations between different technologies and can give us an idea as to how different
fields of field of research can be combined to give rise to interdisciplinary fields of study. The measures for
this, which have been studied as part of this project include generality and originality.

Data

The data set was collected from Harvard Business Schools Dataverse Network [5] and it comprises of two
parts: The first part contains data from the year 1975 to 1999, and the second part from the year 2000 to
2010. Each of these data sets contains information regarding the patents, such as the patents that have
been cited, the year in which the patent was filed, the class to which the patent belongs in terms of the field
of research, the inventor data, and so on.
These data sets consisted of approximately 6 million nodes and 20 million edges. The data which was
available in the CSV format was loaded as tables into a SQLite database, and a set of queries were fired to
extract the necessary and relevant data for analyzing this network. For the purpose of analysis, we sampled
data to include only a certain set of classes of technologies under which patents for filed. These classes,
specifically, were related to the fields of Computer Science and Electrical Engineering.

Degree Distribution

Patent-Citation networks are directed networks with no loops. The below figures show three graphs of
vertex degree distributions (i.e., in-degree, out-degree and degree). The graphs are plotted on a logarithmic
scale. We see that the degree distributions have a power law tail [3] which indicates that the probability
of selecting vertex having k lines is distributed according to k where is a constant. These kinds of
networks are called scale-free networks. [4]
The degree distribution for this network follows a power-law for degrees larger than 20. Since our dataset
comprises of data from 1975 to 2010 and no citation data is available for the period before that, we consider
that the patents granted prior to 1975 have no citations.
The out-degree distribution tells us that most of the patents have citations fewer than 100 although
there are patents which cite huge number of other patents.

(a) In-degree Distribution

(b) Out-degree Distribution

(c) Degree Distribution

Figure 1: Figures showing the various degree distributions

Relevance

As mentioned earlier, the hypothesis made was that the new incoming citations would prefer to attach onto
a patent based on the relevance of the field of research rather than the age of it. This is not to say that
preferential attachment can be totally ignored, as is explained later on. Below, we can see in Figure 2, the
graphs drawn showing the number of citations versus the year for a subset of fields.

(a) Class 709

(b) Class 715

(c) Class 718

(d) Class 726

Figure 2: Graphs showing the variation of citation count with respect to the year
From the above samples, we can see that the patents with the maximum citations are not the ones filed
in the late 70s, but rather, in the early 90s. From there onwards, we see a decrease in the citation counts.
The second half of the graph, where the number of citations begins to decrease shows what could possibly
be interpreted as a preferential attachment model. Hence, preferential attachment can not be completely
discounted, as mentioned above. What we end up with seems to be an amalgamation of both relevance and
preferential attachment, where the probability of a patent receiving citations can be found to be a function
of both relevance as well as age.
The conclusion that can be drawn from this observation is that the citations are based more on the
relevance of the patent. field of research developed in the late 70s in the fields above have drastically
changed, and as the 90s came, these changes could be discerned as significant enough to make the previous
technologies rather obsolete. Building on these more relevant ideas, it is a possibility that further patents
that were filed were based not by the initial knowledge, but by the more recent ones upon which further
work could be done.
Going forth with this idea, we may further hypothesize that as time goes by, the peak will keep getting
drawn towards the right, always ensuring that the maximum citations end up accumulating at the year
where a significant change has occurred in the course of the field of research. This sort of behavior could
aptly be coined as the Relevance curve.
More plots can be found here: Additional Plots.

Reciprocity

The next observation made was with regard to the idea that if the number of citations going from class X
to class Y was found to be a certain value, the number of citations coming back from class Y to class X
could be in the same range. In the following figures we see the graphs obtained, showing the citations to
and from a class, for a particular class.
We can see that regardless of whether two classes are tightly coupled or loosely coupled, the reciprocity
is maintained in the sense that if class X doesnt relate to the technologies of class Y, class Y doesnt
work with the technologies of class X either. We can also clearly see from the graphs that the number of
3

(a) Class 703

(b) Class 706

(c) Class 713

(d) Class 717

Figure 3: Graphs showing the property of reciprocity between classes


citations are the highest within a class itself. This shows an assortative structure, which will be discussed
later on as part of the discussion on the hypothesized shape of the network.
More plots can be found here: Additional Plots.

Generality & Originality

The terms Generality and Originality were coined by authors Hall, Bronwyn H., Adam B. Jaffe, and
Manuel Trajtenberg in their paper [1]. We applied their hypothesis on our dataset and found the following:

7.1

Generality

The measure of generality is one which gives us an idea about how general a patents idea is. The score of
generality would be higher if a patent were to attract a higher percentage of citations from a more diverse
range of classes other than itself. If the idea is very specific to its own technological field, the percentage
of citations it receives from its own class would be much higher. The score itself is simply calculated by
dividing the number of citations from patents of classes other than a patents own, by the total number of
citations coming to the patent. Below, in Figure 4, we can see how the generality score has varied over time.

Figure 4: Variation of generality score over the years


From the above graph, it appears to be the case that the patents are becoming more and more specific
in nature, attracting a fewer percentage of citations from classes other than the class of the patents. This
can also be supported by the graph shown in Figure 6, depicting the variation of self-citation percentage
over time. As you would expect after seeing the above graph, the number of citations that occur within the
same class increases over time, creating a more and more assortative network as time goes by.

7.2

Originality

The opposite of generality is originality. The score of originality shows how many different classes a patent
has cited. This value is calculated by dividing the number of citations of a patent made towards classes other
than its own by the total number of citations made by the patent. Intuitively, this score should decrease
over time, seeing as how the patents are becoming more specific as shown by the graph for generality. However, as we can see from the below Figure 5, the originality score actually increases, albeit not at a high rate.

Figure 5: Variation of originality score over the years


While we are not yet certain as to why this score is increasing, we can still deduce the fact that the
diversity of the classes of fields of research that patents seem to be citing is actually increasing very slowly
as the years go by.

Self-Citations

The next measure studied was the self-citation percentage. This is the measure of the percentage of citations
that are made within a class, with respect to a class. This value is calculated for a class by dividing the
total number of edges that start and end within a class by the total number of edges that have at least
one end in that class. This gives the percentage of self citations for a patent. The average self-citation
percentage is calculated for all the patents in a particular year, and this value is plotted against the year.
As discussed above, the following figure, Figure 6, shows how the percentage of self citations increase with
respect to time. This is in coherence with the fact that patents are seemingly becoming more specific, and
are attracting less citations from patents of other classes, thus supporting the fact that more citations come
from patents of the same class. The same can be seen below in Figure 6.

Figure 6: Variation of self-citation percentage with respect to time

Hypothesis for network graph

Considering the size of the network, it was computationally infeasible for us to visualize the entire network
as such. However, we can make fair assumptions from the graphs observed so far. To start with, there exists
a community structure, where each class is a community. Considering these communities, there is a high
level of assortativity. This can be seen in the reciprocity graphs in Figure 3. It can be seen that the highest
5

number of citations to a class come from itself. Hence, we can state that if we were to display the network
in an Stochastic Block Model (SBM) [6] format, the diagonal elements will have higher values than the non
diagonal elements.
The second point to note is that these diagonal values keep increasing in value. This can be deduced
from the fact that the percentage of self citations keeps increasing over time. Hence, it is pretty clear that
the network only gets more and more assortative in nature over time.

References
[1] Hall, Bronwyn H., Adam B. Jaffe, and Manuel Trajtenberg. The NBER Patent Citations Data File:
Lessons, Insights, and Methodological Tools, UC Berkeley, Brandeis University, Tel Aviv University, and
NBER. August 2001.
[2] Clustering and preferential attachment in growing networks, M. E. J. Newman, Phys. Rev. E 64, 025102
(2001).
[3] Barabasi, A., & Albert, R. (1999). Emergence of Scaling in Random Networks. Science, 286(5439),
509512.
[4] Barabasi, A.-L., & Bonabeau, E. (2003). Scale-free networks. Scientific American, May, 60-69.
[5] Ronald Lai; Alexander DAmour; Amy Yu; Ye Sun; Lee Fleming, 2013, Disambiguation and Co-authorship Networks of the U.S. Patent Inventor Database (1975 - 2010),
http://hdl.handle.net/1902.1/15705 UNF:5:RqsI3LsQEYLHkkg5jG/jRg== The Harvard Dataverse Network [Distributor] V5 [Version].
[6] Stochastic blockmodels and community structure in networks, Brian Karrer and M. E. J. Newman, Phys.
Rev. E 83, 016107 (2011).

You might also like