Conformal Pred
Conformal Pred
Albin Bååw
Examiner
Henrik Boström <bostromh@kth.se>
Stockholm, Sweden
KTH Royal Institute of Technology
Supervisors
Paris Carbone <parisc@kth.se>
Stockholm, Sweden
KTH Royal Institute of Technology
ii
Abstract
We have seen a rapid increase in the development of deep learning algorithms in recent
decades. However, while these algorithms have unlocked new business areas and led
to great development in many fields, they are usually limited to Euclidean data. Re-
searchers are increasingly starting to find out that they can better represent the data
used in many real-life applications as graphs. Examples include high-risk domains
such as finding the side effects when combining medicines using a protein-protein net-
work. In high-risk domains, there is a need for trust and transparency in the results
returned by deep learning algorithms. In this work, we explore how we can quantify
uncertainty in Graph Neural Network predictions using conventional methods for con-
formal prediction as well as novel methods exploiting graph connectivity information.
We evaluate the methods on both static and dynamic graphs and find that neither of
the novel methods offers any clear benefits over the conventional methods. However,
we see indications that using the graph connectivity information can lead to more effi-
cient conformal predictors and a lower prediction latency than the conventional meth-
ods on large data sets. We propose that future work extend the research on using the
connectivity information, specifically the node embeddings, to boost the performance
of conformal predictors on graphs.
Keywords
iii
Abstract
Nyckelord
iv
Acknowledgements
There are a few people who have contributed to this work with their knowledge, which
I am very grateful for. One of these people is Henrik Boström, who acted as examiner
for this project. Henrik has a background in Conformal prediction and has guided
this work with his insight into the field and has provided me with both directions and
resources to bring the work forward.
Another is Paris Carbone, one of the two supervisors for the project. Paris contributed
with feedback throughout the project and helped form and understand the experimen-
tal results.
Finally, I want to give a big ”thank you” to Sonia Horchidan, the second supervisor.
Sonia has been heavily involved in the project and has provided feedback on every
aspect of the report. Sonia has, at times, gone over and beyond to help bring this project
into safe harbor. I am incredibly grateful that she accustomed her schedule to help me
out, and that she has answered my endless Slack messages from early morning to late
night.
I would also like to thank the Oracle Corporation for lending me their servers to run
the experiments. With the virtual machine offered by Oracle Cloud, I have been able
to make many more iterations on the algorithms presented in this project than I would
have been able to if I was only using my local machine.
v
Contents
Acronyms ix
List of Algorithms ix
List of Figures x
1 Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.5 Benefits, Ethics and Sustainability . . . . . . . . . . . . . . . . . . . . . 4
1.6 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.7 Delimitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.8 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Theoretical Background 6
2.1 Static and dynamic graphs . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Graph Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.1 Graph Convolutional Networks . . . . . . . . . . . . . . . . . . . 10
2.2.2 GraphSAGE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3 Conformal predictors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3.1 Class-Conditional Conformal Predictors . . . . . . . . . . . . . . 17
2.3.2 Attribute-Conditional Conformal Predictors . . . . . . . . . . . . 18
2.4 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.4.1 Countering the distribution shift in conformal prediction . . . . . 18
vi
CONTENTS
3 Method 22
3.1 Choice of research method . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2 Application of research method . . . . . . . . . . . . . . . . . . . . . . . 23
3.2.1 Overview of the case studies . . . . . . . . . . . . . . . . . . . . 23
3.2.2 Data sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2.3 Constructing the graphs . . . . . . . . . . . . . . . . . . . . . . . 27
3.2.4 Models implementation . . . . . . . . . . . . . . . . . . . . . . . 30
3.2.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.3 Quality assurance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.4 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5 Results 44
5.1 Static graphs study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.1.1 Cora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.1.2 PubMed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.1.3 OGB Arxiv . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.2 Dynamic graphs study . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.2.1 Bitcoin Elliptic . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.2.2 OGB Arxiv . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
6 Discussion 62
6.1 The conventional conformal predictors . . . . . . . . . . . . . . . . . . . 62
6.2 The novel conformal predictors . . . . . . . . . . . . . . . . . . . . . . . 63
6.3 The effect of resampling . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
6.4 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
6.5 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
7 Conclusions 68
vii
CONTENTS
References 70
viii
Acronyms
ix
List of Algorithms
x
List of Figures
Theoretical Background 6
2.1.1 Example of a social network (1) . . . . . . . . . . . . . . . . . . . . . . 7
2.1.2 Example of a social network (2) . . . . . . . . . . . . . . . . . . . . . . 8
2.1.3 Example of a social network (3) . . . . . . . . . . . . . . . . . . . . . . 9
2.2.1 Example of variations of a graph . . . . . . . . . . . . . . . . . . . . . 10
2.2.2 Example of a convolution on an image . . . . . . . . . . . . . . . . . . 11
2.2.3 Message passing in a GCN . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.4 Overview of the GraphSAGE algorithm . . . . . . . . . . . . . . . . . . 14
Method 22
3.2.1 Class distributions for the Cora and PubMed data sets. . . . . . . . . . 26
3.2.2 Class distribution for the OGB Arxiv data . . . . . . . . . . . . . . . . . 27
3.2.3 Class distribution for the Bitcoin Elliptic data set . . . . . . . . . . . . 28
3.2.4 Example of a single GraphSAGE convolution . . . . . . . . . . . . . . 32
Results 44
5.2.1 GraphSAGE - Accuracy (Bitcoin Elliptic) . . . . . . . . . . . . . . . . . 48
5.2.2 CP - Coverage (Bitcoin Elliptic) . . . . . . . . . . . . . . . . . . . . . . 49
5.2.3 CP - Average prediction set size (Bitcoin Elliptic) . . . . . . . . . . . . 50
5.2.4 CP - Singleton predictions (Bitcoin Elliptic) . . . . . . . . . . . . . . . 50
5.2.5 CP - Empty predictions (Bitcoin Elliptic) . . . . . . . . . . . . . . . . . 51
5.2.6 CP - Prediction latency (Bitcoin Elliptic) . . . . . . . . . . . . . . . . . 51
xi
LIST OF FIGURES
xii
List of Tables
Method 22
3.2.1 Data sets used in the case study on static graphs. . . . . . . . . . . . . 25
3.2.2 Data sets used in the case study on dynamic graphs. . . . . . . . . . . 25
3.4.1 Hyperparameters used in the experiments . . . . . . . . . . . . . . . . 38
Results 44
5.1.1 GraphSAGE performance (Cora) . . . . . . . . . . . . . . . . . . . . . 45
5.1.2 CP performance (Cora) . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.1.3 GraphSAGE performance (PubMed) . . . . . . . . . . . . . . . . . . . 46
5.1.4 CP performance (PubMed) . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.1.5 GraphSAGE performance (OGB Arxiv) . . . . . . . . . . . . . . . . . . 47
5.1.6 CP performance (OGB Arxiv) . . . . . . . . . . . . . . . . . . . . . . . 47
xiii
Chapter 1
Introduction
In this thesis, we explore how Conformal Prediction (CP) can be applied to Graph Neu-
ral Networks (GNN) to improve the reliability of results from node classification tasks.
We look both at static graphs and, the more realistic scenario [21], dynamic graphs.
The results include the performance of conventional CP methods as well as our own
novel methods exploiting the graph connectivity information.
1.1 Background
Graph Neural Networks is an emerging deep learning technology designed to handle
highly-interconnected graph data. Conventional neural networks are limited to fixed
form data and where each data point exists in Euclidean space. Graph data, on the
contrary, neither has a fixed form nor data points that exist in Euclidean space; hence,
conventional neural networks can not process it. GNNs resolve this issue by generating
a low-dimensional vector embedding from a graph or a component in a graph that a
neural network can process [37].
There are four types of GNNs available today [37]. The type we will use in this thesis
belongs to the family of Graph Convolutional Network (GCN). Generally speaking,
GCNs are a type of neural network architecture that can leverage the graph structure
and aggregate node information from the neighborhoods in a convolutional fashion.
Furthermore, GCNs have a great expressive power to learn the graph representations
and have achieved superior performance in various tasks and applications.
1
CHAPTER 1. INTRODUCTION
chine learning algorithms on both classification and regression tasks. CP replaces the
singular class prediction in machine learning algorithms with a prediction set that in-
cludes the true class with a predefined confidence level [27]. There are several varia-
tions of CP; one of those is Class-Conditional Conformal Prediction which provides
a valid predictor for each class label and is used to deal with imbalanced data sets
[28].
1.2 Problem
The application of GNNs in high-risk domains is increasing, and common tasks revolve
around medicinal applications such as predicting the side effects of mixing two drugs.
Therefore, it is crucial to quantify the uncertainty in the algorithms’ output to invoke
trust, and transparency [26].
As the field is still in its cradle, there have been few attempts at quantifying uncertainty
in GNNs. Most of the research we have found focuses on applying Bayesian learning to
GNNs to boost prediction accuracy [12, 30] or quantifying prediction reliability using
Bayesian concepts [10]. However, one recent study applied conformal prediction to
GNNs [38] but without a more profound analysis or accounting for the graph struc-
ture.
A cornerstone in CP is the validity of the predictor, which ensures that predictions are
within a specified confidence level. The validity depends on the underlying distribution
of the data set. If the distribution of the calibration data used to tune the predictor
differs from the distribution of the test data, likely, the validity will not hold.
In the case of dynamic graphs, the distribution of the data set is constantly changing.
Distribution shifts may cause a coverage gap in conformal predictors and potentially
render the predictor invalid. For that reason, a conformal predictor used on a dynamic
graph needs to take the coverage gap into account to maintain validity. Several papers
2
CHAPTER 1. INTRODUCTION
have been written on the topic of retraining conformal predictors when the distribu-
tion shift is too significant, such as [34]. Retraining is a sure way of maintaining valid
conformal predictors but can also be quite costly computational-wise. However, just
recently, a study was conducted on how to reduce the coverage gap for predictors in
the context of time-series data without the need for retraining [2]. The study proposes
a method for putting an upper bound on the coverage gap that we will describe further
in Section 2.4.1.
Furthermore, a study showed that GCNs have a bias towards high degree nodes, i.e.,
they are likely to provide more accurate predictions on high degree nodes than on low
degree nodes [29]. This naturally affects any conformal predictor that relies on the
output of a GCN, and it needs to be investigated if this also affects the predictor’s va-
lidity.
1.3 Purpose
We have seen a rapidly increasing development of deep learning algorithms in recent
decades. However, while these algorithms have unlocked new business areas and led
to great development in many fields, they are usually limited to Euclidean data [37].
Researchers are increasingly starting to find out that the data used in many real-life
applications is better represented as graphs [21]. Examples include finding side effects
when combining medicines using a protein-protein network [37], detecting fraudulent
transactions in financial networks, and predicting friendships in social networks.
The purpose of this thesis is to further the research on making GNNs viable in real-
life applications. We believe that a crucial part is providing prediction reliability in
high-risk domains, which is where this thesis’s focus lies. Thus, we try to answer the
following research question:
1.4 Goal
The goal of this thesis is to propose a method for applying CP to GNNs and raise aware-
ness about the importance of measuring uncertainty in the output of GNNs. We can
3
CHAPTER 1. INTRODUCTION
1. Exploration of data sets that have previously been used for similar tasks.
2. Training of a GNN on the data sets and comparing its performance with bench-
marks found in the literature.
We also expect that the project by itself is going to be interesting for GNN and Confor-
mal prediction research groups, GNNs are still in early development, and there is not
much written in the area.
This project’s ethics and sustainability considerations primarily concern the carbon
emissions released as part of training the deep learning algorithms. A recent study has
shown that carbon emissions can vary as much as 10-100x depending on how you set
up your algorithm [22]. By evaluating the performance of our proposed methods, we
hope not only to understand if the algorithms are computationally viable but also the
climate impact, as the two should be correlated.
1.6 Methodology
This thesis is grounded in the realist’s philosophical assumptions and uses the funda-
mental research method [5] to extract quantitative and qualitative results. The quali-
tative results come in the form of novel methods for quantifying uncertainty in graphs
based on CP and other related work. The quantitative results consist of a comparison of
4
CHAPTER 1. INTRODUCTION
1.7 Delimitations
There is a lot that we can explore when it comes to graphs and CP, and in this work, we
only answer a relatively small share of the possible questions we could answer.
In the thesis, we design several novel methods for leveraging graph connectivity in-
formation to make conformal predictions. There is a lot of connectivity information
that could be used, ranging from basic properties such as the node degree to complex
centrality measures. However, in this work, we solely focus on using the node degree
and the embeddings of nodes, as we have seen indications in the literature that these
could be helpful when making predictions.
Studying conformal predictors on dynamic graphs could entitle several research stud-
ies just by itself. For example, we could look at the effect of retraining the conformal
predictors. However, the main focus of this work is to avoid the extra cost of retraining,
and we therefore refrain from using retraining in our experiments.
Furthermore, there are different definitions of what a dynamic graph is, of which one
is a graph where each node is streamed one by one to the GNN. However, this is not
something we will consider in this paper, and instead, we will assume that we have
access to all static snapshots of the graph at all times as described in Section 2.1. Also,
the dynamic graphs we use in our experiments are only growing, never losing nodes or
edges.
1.8 Outline
We have organized the thesis as follows: Chapter 2 gives the reader a theoretical back-
ground to GNNs, conformal prediction, and related work. Chapter 3 discusses the
method used in the thesis, while Chapter 4 introduces the novel conformal predictors.
Finally, Chapter 5 provides the results from our experiments, Chapter 6 discusses the
results and Chapter 7 presents our conclusions.
5
Chapter 2
Theoretical Background
In this thesis, two concepts are tied together: Graph Neural Networks and Conformal
prediction. The former is a deep learning algorithm to draw insights from highly con-
nected graph data, while the latter is used to evaluate the quality of those insights. In
this chapter, we will dig deeper into how each functions.
A graph can either be directed, meaning that each edge u, v implies that node u has a
unidirectional connection to node v, or the graph can be undirected, meaning that each
edge is bidirectional. An example of a unidirectional edge is the edge indicating that
Bob is the father of Charlie; this edge has a single direction as it would be impossible
for Charlie to be the father of Bob. However, for Bob and Alice to be friends, each of
them must be a friend of the other, and thus they share a bidirectional edge as seen in
Figure 2.1.1.
On a similar topic, we also define two other types of graphs: homogeneous and hetero-
geneous graphs. In homogeneous graphs, nodes and edges can only be a single type,
for instance, nodes are people, and edges indicate friendships. In the heterogeneous
6
CHAPTER 2. THEORETICAL BACKGROUND
Figure 2.1.1: Example of a social network. The network captures different relationships be-
tween people and is a directed heterogeneous graph.
graph, nodes and edges can take on different types. For example, nodes could be both
people and pets, and edges could be any relationship between them, such as friend-
ship, family, or owner. In this work, we will only consider homogeneous graphs, but
GNNs like the GCN can often be reconfigured to treat heterogeneous graphs as well
[15].
The degree of a node n is the number of edges connecting n to the other nodes in the
graph. In our example above, Bob has a degree of 5 as he is connected to Alice, Charlie,
and David. We can further specify the degree for directed graphs; here, we split the
degree into in-degree and out-degree, which describe the incoming and outgoing edges
from n respectively. In our example, Bob has an out-degree of 3 but only an in-degree
of 2.
Furthermore, the graph can also have an accompanying set of features describing the
nodes or edges in the graph. Such as, Bob likes football (a node feature), and Bob and
Alice became friends through work (an edge feature).
While we often describe graphs in the context of static graphs, the nodes and edges
remain the same, most real-life graphs tend to be of a dynamic nature [21], such as the
set of nodes and edges might evolve over time. For instance, in Figure 2.1.2, Fiona has
entered our graph and has developed friendships with the earlier people in the graph.
The graph can also evolve in other ways, node features may change, or nodes might
disappear.
7
CHAPTER 2. THEORETICAL BACKGROUND
Figure 2.1.2: Example of a growing social network. The social network in Figure 2.1.1 has grown
to also include Fiona.
graphs: G = {G1 , ..., Gn } where Gn is the nodes and edges (Vn , En ) for snapshot n [7].
We see an example of this in our examples of the social network above, where Figure
2.1.1 and Figure 2.1.2 shows two snapshots of the same graph. By treating the graph
as a sequence, we can study each static snapshot of the dynamic graph using the same
techniques we would use on a static graph. There are other methods for working with
dynamic graphs, for example, by streaming the graph as incoming nodes and edges that
are only seen once. However, in this study, we will treat the graph as an accumulating
collection of nodes and edges over time, where we always have access to all the nodes
and edges within a specific snapshot.
As the graph evolves, we are bound to see previously unseen vertices, edges, and maybe
even classes in each snapshot. Due to this, a distribution shift may happen, i.e., the
property distribution among the nodes is widely different between a former and a latter
snapshot. The shift poses a problem for deep learning algorithms, which assume that
the distribution is static during training. If the distribution of the test data varies a
lot from the training data’s distribution, we can expect poor results during prediction.
We can relate this to our example above of the social network, Figure 2.1.1 and Figure
2.1.3 are two very different snapshots of the same dynamic graph. If we train our deep
learning algorithm on the graph in Figure 2.1.1 it is unlikely that it will perform well on
the graph in Figure 2.1.3. For this reason, algorithms on temporal data usually need to
consider the distribution shift and re-train the deep learning algorithm when the shift is
too large [36]. One method for countering the distribution shift is Continual learning,
also called life-long learning. Continual learning is commonly used on streamed data,
and the goal is to continuously train on incoming nodes without making use of previous
8
CHAPTER 2. THEORETICAL BACKGROUND
data [35]. However, in this thesis, we will not use any methods for re-training the
learning algorithm as it can be expensive, and we are trying to avoid it by designing
algorithms that maintain performance over time.
Figure 2.1.3: Example of a growing social network. The social network in Figure 2.1.2 has grown
to also include George and Herb.
Graph data is complex for two main reasons: (1) it is not Euclidean, which means that
there is an unlimited amount of ways we can traverse a graph, and (2) the connections
between nodes are as important as the data held by the nodes. In Figure 2.2.1, we
see the same graph from two different perspectives. Conventional deep learning algo-
rithms are restricted to the Euclidean space and the order of the data points. Therefore,
they would learn multiple separate representations for the same node depending on the
order the nodes are seen [14].
To avoid the issue of the same node being interpreted differently depending on when
we see the node, we can instead embed the node in the Euclidean space. There are
multiple ways to create a node embedding, but the one criteria that most embedding
algorithms share is that two similar nodes should be close to each other in the embed-
9
CHAPTER 2. THEORETICAL BACKGROUND
Figure 2.2.1: Example of how a graph can be ordered in two different ways depending on where
we start.
ding space [14]. The similarity can be defined as closeness in the graph or that nodes
share similar structural properties like a similar node degree.
In the early days, the graph representation field was mainly dominated by various ma-
trix factorization methods for generating graph and node embeddings. Examples of
these methods are node2vec, and random walk [4]. The embeddings were then used
as input to a deep learning algorithm, like a Multilayer Perceptron. However, this
came with an issue: the embedding methods are commonly transductive, and we need
to learn entirely new embeddings as soon the graph changes the slightest [14]. This
quickly becomes expensive in the case of dynamic graphs that are constantly changing.
To get around this issue, a new type of deep learning method was developed to find a
cheaper inductive method for embedding graphs and their components: GNNs [14].
GNNs uses a message-passing framework to dynamically create embeddings from the
nodes in the graph based on the node’s neighborhood. Typical tasks for GNNs include
node property prediction, link prediction, and graph classification.
GNNs is a broad concept, and there exist several variants. We can divide the main
ones into four categories [37]: (1) Recurrent Graph Neural Networks, (2) Convolu-
tional Graph Neural Networks (also known as GCN), (3) Graph Autoencoders, and (4)
Spatial-Temporal Graph Neural Networks. In this thesis, we only discuss GCNs.
Graph Convolutional Network (GCN) were first developed by Kipf et al. [11] and are
one of the earliest GNNs with many other GNNs building upon it. In summary, the
10
CHAPTER 2. THEORETICAL BACKGROUND
GCN is a deep learning algorithm that leverages the graph structure and aggregates
information from a node’s neighborhood in a convolutional fashion. GCNs tend to
have great expressive power and are therefore used in a wide range of tasks.
Figure 2.2.2: Example of a convolution in an image (Euclidean data) and a graph. For the
convolution on the image, we retrieve a weighted average of the red pixel’s neighboring pixels’
features. The blue edges indicate the neighboring pixels. Similarly, we retrieve a weighted
average from the red node’s neighbors in the graph convolution. The figure is adapted from
[37].
The idea behind GCNs comes from the classic Convolutional Neural Network (CNN)
model, which is commonly used on grid-like data, such as images [39]. CNNs exploits
the basic properties of grid-like data to aggregate data and make predictions. For ex-
ample, if we want to make a prediction on an image, we know that the number of neigh-
boring pixels is always fixed, and we can decide the order in which we scan images (e.g.,
left to right, up to down). We use this information when we design how the CNN should
perform its convolutions on the image. In a convolution, the CNN aggregates the in-
formation from a neighborhood of pixels in the image and outputs a new super pixel
representing the neighborhood. This process is performed several times, and in the
end, we retrieve a low-dimensional embedding that can be used by another algorithm,
such as a Multilayer Perceptron. We can see how a convolution is performed in Figure
2.2.2 (left).
Unfortunately, the grid properties that we take advantage of in the CNN do not exist
in the context of graphs. For example, the number of neighboring nodes can vary, and
there is no set order in which we process a graph.
However, the convolution operation can be generalized to graph data. Instead of ag-
11
CHAPTER 2. THEORETICAL BACKGROUND
Figure 2.2.3: Message passing in a GCN. Here we want to make a prediction in node A ag-
gregating information from A’s neighbors. In Layer 2, target node A aggregates input from its
neighbors B, C, and D, and uses it to make a prediction. Before that, in Layer 1, B, C, and D,
have themselves aggregated the node features from their neighbors in Layer 0. The figure is
adapted from [14].
gregating pixels, we can learn a node representation by aggregating the target node’s
features and the features of the neighboring nodes. The new node representation can
then be used to create a representation of the next node. This way, we can quickly ag-
gregate the information from a large neighborhood or whole graphs by stacking layers.
An overview of this process can be seen in Figure 2.2.3.
In Algorithm 1, we see how the forward propagation is calculated. First, we set the
hidden node embedding for the 0-th layer to be the features of the nodes in the layer.
Then we run several iterations of neighborhood aggregation. Next, we retrieve the
hidden node embedding hkv for the k-th layer by retrieving a weighted average of the
input from the nodes of the previous layer, summing it with features of the nodes we
are currently aggregating in, and passing it through an activation function. Finally,
after reaching the final layer K, we use the hidden node embedding for the K-th layer
to represent node v.
In 2020, Tang et al. conducted a study on degree-related biases in GCNs [29]. The
study concluded that there is a bias towards high degree nodes when making predic-
tions using GCNs. The bias shows in that GCNs often achieve higher accuracy in high
12
CHAPTER 2. THEORETICAL BACKGROUND
1 h0v ← Xv
2 k←0
3 while k < K do
∑ (k)
(k)
4 hk+1
v ← σ(Wk hu
u∈N (v) |N (v)| + B k hv )
5 k ←k+1
(K)
6 zv ← hv
degree nodes than in lower degree nodes. The bias was explained as being due to the
fact that the same high degree nodes occur more often during training as other con-
nected nodes poll them for information. Tang et al. used this information to build a
degree-aware GCNs and successfully used it to boost prediction accuracy in low de-
gree nodes. We explore this hypothesis when we design some of our own conformal
predictors.
2.2.2 GraphSAGE
The forward propagation of GraphSAGE is divided into three steps: (1) Sample the
nodes in the neighborhood of the target node, (2) aggregate the feature information
from the sampled nodes, and (3) use the aggregated information to perform the pre-
diction task. We can see an overview of the forward propagation in Figure 2.2.4.
Sampling nodes is an optional first step that we can use to reduce computational com-
plexity [6]. We can see an example of the sampling in Figure 2.2.4. We sample a pre-
defined number of nodes from each layer starting from the target node. The nodes
13
CHAPTER 2. THEORETICAL BACKGROUND
Figure 2.2.4: Overview of the GraphSAGE algorithm with sampling. The goal is to predict node
A’s label. In the first step, we sample three nodes among the node A’s neighbors and two nodes
from each of the sampled nodes’ neighbors. In the second step, we similarly aggregate the
information as in the GCN. Finally, we predict node A’s label using the hidden node embedding
we have aggregated. The figure is adapted from [6].
are randomly sampled with equal probability. After sampling nodes among the target
node’s neighbors, we sample nodes in the following layer by sampling neighbors from
the sampled nodes in the first step. Thus, it continues until we reach the predefined
sample size.
In the second step, we aggregate the information from the sampled nodes using a simi-
lar algorithm to that of the GCN. The main difference is not how the node information
is propagated but how we calculate the hidden node embedding. We simply change
line 4 in Algorithm 1 to the following instead:
where Concat represents the concatenation operation, which connects the neighbor’s
embedding and node v’s embedding, N (v) represents all the neighbors of node v, and
Aggregate is aggregating the features of all neighboring nodes. As we can see, the main
difference between the aggregation in GraphSAGE and GCN is that GraphSAGE con-
catenates the aggregated neighborhood vector with the node’s feature vector rather
than just summing the two. GraphSAGE also uses a more flexible feature aggrega-
tion function than the GCN. There are four feature aggregation methods for users to
choose from to cater to different scenarios, including Mean aggregator, GCN aggrega-
14
CHAPTER 2. THEORETICAL BACKGROUND
tor, LSTM aggregator, and Pooling aggregator. This makes the GraphSAGE algorithm
much more expressive than the GCN.
The conformal prediction algorithm uses a nonconformity measure A and a set of cal-
ibration data points Z. The nonconformity measure determines how similar an in-
coming node n is to the data points in the calibration set. n receives a nonconformity
score A(Z, n) that is compared to the nonconformity scores of the calibration set. If n’s
nonconformity score is lower than a predefined fraction of the scores from the other
data points, the label coupled with the nonconformity score is added to the prediction
set. The predefined fraction that we need to overcome is determined by the signifi-
cance level, which is equal to 1 minus the confidence level. For instance, if we set the
confidence level to 95% (0.95), the significance level would be 0.05.
We deem a conformal predictor valid if the outputted prediction sets contain the tar-
get label 95% of the times we make a prediction. The validity is a cornerstone safety
criterion when designing conformal predictors.
Another vital measure when looking at the performance of conformal predictors is ef-
15
CHAPTER 2. THEORETICAL BACKGROUND
The validity guarantee of the conformal prediction algorithm comes from a core con-
cept that the algorithm relies on. The concept is that all data points in the calibration
set are exchangeable, which means they need to be independent and identically dis-
tributed (i.i.d.). The conformal prediction algorithm can not promise that the predic-
tor is valid if this does not hold. This is an issue when we apply conformal prediction to
time series. In time series, the underlying distribution of the data points may change
over time, and thus the data points are not i.i.d. [2]. This also holds for graphs that
evolve over time.
The original conformal prediction algorithm sometimes called full conformal predic-
tion or transductive conformal prediction, was based on the assumption that we used
the training data as our calibration set. This means we need to retrain the underlying
model whenever we want to retrieve a prediction set. While the transductive confor-
mal predictor produces precise prediction sets, it is also computationally expensive.
Therefore, another algorithm called split conformal prediction or inductive confor-
mal prediction was developed to avoid the computational cost. The algorithm splits
the training set into a proper training set and a calibration set [20]. We train the
underlying model on the proper training set, and we use the calibration set when we
16
CHAPTER 2. THEORETICAL BACKGROUND
make predictions to calculate the prediction set. By splitting the training set, we do
not need to retrain the underlying model for every prediction, which reduces the com-
putational cost. However, we will also retrieve less precise prediction sets as the cali-
bration set will not be as finely-tuned to the underlying model. As inductive conformal
prediction is more fitting for real-life scenarios where we can not retrain before ev-
ery prediction, we use it in our experiments and refer to it as the Inductive Conformal
Predictor (ICP).
The CCCP algorithm is very similar to the standard CP algorithm. As we can see in
Algorithm 3, the only change is that we add a new first step where we limit the examples
(a.k.a. the calibration set) to only include examples where yi = y. This way, we retrieve
a nonconformity score only for examples including y, and also, we only compare alphan
with those examples.
17
CHAPTER 2. THEORETICAL BACKGROUND
Mondrian conformal predictors do not necessarily need to divide the data into cate-
gories based on the class labels, as described in the section above. Instead, we can also
choose to split the data based on some other property that produces non-overlapping
categories [31]. When we have a conformal predictor that splits the data based on an-
other property than the class label, we call it an Attribute-conditional Conformal pre-
diction. Besides using another property than the class label, the algorithm looks the
same as the Class-conditional Conformal Algorithm described in Algorithm 3. As we
can see in Algorithm 4, we only change the first step to split the calibration set based
on the category-forming j-th property of the prediction object.
18
CHAPTER 2. THEORETICAL BACKGROUND
distribution of the calibration set is too different from the distribution of the sample
(the incoming data point we run our prediction on). i.e., the calibration set fails the
exchangeability requirement.
Based on this insight, Barber et al. lift the exchangeability requirement and propose the
following method for bounding the coverage gap in the prediction intervals produced
by a conformal predictor for the regression task:
∑n
wi · dT V (Z, Zi )
Coverage gap ≤ i=1 ∑ (2.2)
1 + ni=1 wi
Where dT V (Z, Zi ) is the total variation distance between the data point Zi and the other
data points. dT V (Z, Zi ) can be seen as a measure of the distribution shift between Z and
Zi . If the data is truly exchangeable, dT V will be 0, but otherwise, it will increase as the
distribution shift grows larger. w is the set of weights we apply to the data points to
reduce the coverage gap, and wi ∈ [0, 1]. Hence, we need to pair a low weight wi with
large a dT V (Z, Zi ) to retrieve a low coverage gap. Due to the way the equation is formu-
lated, it works well both when there is not a distribution shift (the coverage gap will be
0) and in the case where there is a shift and the coverage gap will be bounded.
Barber et al. also show how the method for bounding the coverage gap can be applied
to several common conformal prediction algorithms. In the following equation we can
see how the bound is applied when computing the prediction interval Ĉ(Xn+1 ) using
an inductive conformal predictor:
(∑
n )
Ĉ(Xn+1 ) = µ̂(Xn+1 ) ± Q1−ϵ w̃i · αi + w̃n+1 · α+∞ (2.3)
i=1
wi 1
w̃ = , i = 1, ...n, and w̃n+1 = (2.4)
w1 + ... + wn + 1 w1 + ... + wn + 1
While Barber et al. provide a robust mathematical framework for bounding the cov-
erage gap, one missing part is how we can apply the method in various scenarios. The
cornerstone of their algorithm is the weights that we set to the data points to lower the
19
CHAPTER 2. THEORETICAL BACKGROUND
effect that distant data points have on the prediction result. However, Barber et al. do
not provide a framework for deciding the weights. In their experiment on time series,
they base their weights on the closeness in time between the data points. While this
can work well for time series, it does not apply to any data set.
For the above reason, and because Barber et al.’s method is designed for the regression
task, we can not directly apply these learnings to our work. However, we take inspi-
ration from Barber et al.’s method when we design our novel conformal predictors in
Chapter 4.
Like Barber et al.’s method, spatial conformal prediction is mainly designed for the
regression task and is therefore not directly applicable to our work. However, we take
inspiration from spatial conformal prediction when designing the Embedding distance
Weighted Conformal Predictor.
In our research, we have only been able to find a few cases where conformal prediction
has been applied to GNNs. However, in most of the studies, including a study from
2020 [38] conformal prediction was applied to GNNs but without a more profound
analysis of the CP method itself.
Conformal prediction is not the only way to quantify uncertainty in predictions. Most
of the research we have found on quantifying uncertainty in GNNs is based on Bayesian
learning, such as [10, 12, 30], where Bayesian concepts are leveraged to quantify pre-
diction reliability and boost the accuracy of predictors.
20
CHAPTER 2. THEORETICAL BACKGROUND
Another method used to quantify prediction reliability is Venn prediction [31]. Venn
predictors work similarly to CCCPs: we divide the old objects into categories, somehow
classify the new object into one of the categories, and then use the frequencies of labels
in the chosen category as probabilities for the new object’s label.
21
Chapter 3
Method
This chapter discusses the methodology used when conducting our research and de-
scribes our experiments. We will start by continuing our discussion from Section 1.6
about the research method we chose for this thesis. After that, we will describe in detail
how we have applied the research method in our experiments.
When deciding on the research method for this thesis, we took advice from Anne Håkans-
son’s paper on research methodologies applied to software engineering [5]. According
to Håkansson, most studies are either of a quantitative or a qualitative nature, how-
ever, some belong in between, and so does this thesis.
We considered using a purely quantitative approach for this work. In that case, we
would have only applied existing methods for quantifying uncertainty in graph pre-
dictions and measuring their performance over many data sets. However, this brings
two caveats: (1) we have not been able to find methods that use the graph connectivity
information, and (2) we have not been able to find enough data sets to bring statistical
significance to the results. On the other hand, using a purely qualitative approach, such
as performing a survey on existing approaches, would not bring much value either, as
there is limited work on quantifying uncertainty in graph predictions. So instead, this
work has to be a mix of quantitative and qualitative research to bring significant results.
In our experiments evaluating the novel methods, we retrieve quantitative results in
the form of performance metrics. By exploring the metrics, we observe patterns in the
22
CHAPTER 3. METHOD
data from which we can contrive qualitative results that can guide further research in
the area.
Based on the above, we saw two possible research methods for this study: (1) the ap-
plied research method, and (2) the fundamental research method [5]. We can use
the applied research method to answer specific questions originating from previous
research. However, as we described above, there is limited research on applying con-
formal prediction to GNNs, and therefore are no questions from previous research to
answer. The research question that we try to answer in this thesis is also open-ended,
making the applied research method even more unsuitable. Instead, we decided on the
fundamental research method, a curiosity-driven method used to generate new ideas
and theories. This suits our work well as we propose novel methods for leveraging the
graph connectivity information to produce valid and efficient conformal predictors for
graph predictions. The novel methods originate from our own ideas, but they are based
on theories and inspired by other research.
We conducted two case studies in this thesis, one on static graphs and another on dy-
namically evolving graphs.
As part of the case study, we run a single experiment on several data sets of static graphs
23
CHAPTER 3. METHOD
in the case study. We train a GraphSAGE classifier in the experiment and set up several
conformal predictors using conventional and novel algorithms. We then capture the
evaluation metrics over several rounds, and between every run, we resample the nodes
used for testing and calibrating the conformal predictors. After we have captured the
evaluation metrics, we plot the metrics, as can be seen in Chapter 5, and observe any
phenomenons in the performance of the predictors.
In the second case study, we focused on methods for mitigating the coverage gap in
conformal predictors caused by the distribution shift in dynamic graphs. The goal of
the experiment was to evaluate if any of our CP methods could avoid the need for re-
training the GraphSAGE classifier and re-calibrating the conformal predictors when-
ever the distribution changes.
In the case studies, we used five public data sets for the classification task to evaluate
the performance of our proposed methods. In the case study on static graphs, we used
the three data sets listed in Table 3.2.1, and we used the two data sets listed in Table
3.2.2 for the study on dynamic graphs. As we can see, the OGB Arxiv data set is part of
both lists. We can usually treat Dynamic graphs as static graphs by only considering a
single snapshot, such as the final snapshot, which is why we can use the OGB Arxiv data
set for both case studies. However, we could not treat a static graph as a dynamic graph
in our experiments, as we rely on a temporal feature to create the snapshots.
24
CHAPTER 3. METHOD
Table 3.2.1: Data sets used in the case study on static graphs.
Table 3.2.2: Data sets used in the case study on dynamic graphs.
The Cora and PubMed data sets were both used in the case study on static graphs. The
data sets belong to the Planetoid data sets [18] and are commonly used in research on
GNNs, such as the paper that introduced the GCN [11]. In that paper, they measured
an accuracy score of 81.5% for the Cora data set and 79.0% for PubMed.
The Cora and PubMed data sets both represent citation networks. Each node repre-
sents a published paper, and an edge represents the citation of a paper by another
paper. The node features is a word vector representing the presence or the absence of
a word in the paper. The task on both data sets is multiclass classification.
As we can see in Table 3.2.1, the Cora data set is relatively small in size yet complex
as each node has half as many features as there are nodes in total, and there are seven
possible classes. Due to the small size, we had to be cautious when splitting the data set
into a training, a calibration, and a test data set. We had to maintain a large enough
training data set to build an accurate classifier while keeping enough data for statis-
tically significant testing. While similar, The PubMed data set has fewer classes and
about ten times more data points than the Cora data set. For this reason, we do not
need to worry as much about the data set size when splitting it. [11].
OGB Arxiv
The OGB Arxiv data set belongs to the Open Graph Benchmark collection of data sets
introduced in 2020 [9]. OGB Arxiv describes a citation network of papers published on
Arxiv.org. Each node is a paper, and edges represent citations between papers. Each
25
CHAPTER 3. METHOD
Figure 3.2.1: Class distributions for the Cora and PubMed data sets.
node also has a set of features: a word embedding of the paper’s title and abstract.
The task coupled with the data set is to predict the category of each node. As the paper
belongs to the Open Graph Benchmark collection, there exist benchmarks for common
GNN algorithms, including GraphSAGE, which benchmark for is an accuracy score of
1.5% [9]. Beyond the feature set, each node also has a secondary feature called the node
year which is the year when the paper was published. We use this feature to treat the
data set as a dynamic graph. However, We used the data set in both case studies on
static and dynamic graphs. In the static case study, we use the final snapshot of the fully
evolved graph, and in the study on dynamic graphs, we use snapshots between 2010
and 2020. As we can see in Figure 3.2.2, the OGB Arxiv data set is a great example of
a data set that is affected by a distribution shift. Over time, we see several almost non-
existent classes grow to become the largest share of nodes. This made it an exciting
data set for our case study on dynamic graphs.
Bitcoin Elliptic
The other data set used in the case study on dynamic graphs is the Bitcoin Elliptic data
set. The data set was introduced by the Elliptic co. on the website Kaggle in 2019 [3].
The data set describes a Bitcoin transaction network, and the task is binary classifica-
tion, where we mainly want to find illicit transactions. The data set has more nodes
than the Planetoid data sets but a much lower average node degree (1.2). The feature
set of each node is also more varied and includes information about the transaction
and neighboring transactions. The fact that each node already has some information
aggregated from its neighbors might make the use of a GNN redundant, which can also
26
CHAPTER 3. METHOD
Figure 3.2.2: Class distribution for the OGB Arxiv data at various time steps.
For most of the data sets, we used pre-built data loaders from either PyTorch Geomet-
ric [24], or the Open Graph Benchmark collection [9] to access the data sets. Beyond
downloading the data set, the data loaders also helped us construct the graph repre-
senting the data set. However, in the case of the Elliptic data set, we had to download
27
CHAPTER 3. METHOD
Figure 3.2.3: Class distribution for the Bitcoin Elliptic data set at various time steps. Class −1
is nodes with an unknown label, Class 0 is licit nodes, and Class 1 is illicit nodes.
the raw data ourselves from Kaggle [3] and construct the graph. Furthermore, to rep-
resent a graph as a dynamically evolving graph, we also had to construct the snapshots
out of the graph. We will outline how we constructed the Bitcoin Elliptic graph and
created the snapshots for the dynamic graphs below.
To represent a graph used in a node property prediction task, we need three compo-
nents: (1) a list of all the edges connecting nodes, (2) the node features, and (3) the
target labels for the classification task. PyTorch Geometric [24] provides a Data object
that we used to gather this information to represent a graph. To construct the graph,
we had to retrieve the parts listed above and construct the Data object.
The Data object expects nodes to use incremental IDs starting from 0, but each node
was assigned a random ID in the raw data. Hence, to make the raw data compatible
28
CHAPTER 3. METHOD
with the Data object, we had to reassign new IDs to all nodes and update the edge list
accordingly.
Beyond reassigning the IDs and updating the edge list, we had to clean the node fea-
ture matrix. The node features contain some features in the raw data set that should
be hidden to the GraphSAGE model, so it does not use them when making predictions.
The features that we removed were the node ID (we do not need it to access the node
features) and the time step where the node appears. We kept the time steps in a sep-
arate list so that we could use them to construct the snapshots for the dynamic graph
later on.
Lastly, we constructed the list of target labels. In the raw data the target label can take
on three values: unknown, 1 - illicit, and 2 - licit. We reindexed the target labels to
enable us to include the nodes with an unknown target label when making predictions
on labeled data and easily excluding them when calculating loss during training and
measuring performance in the evaluation. We assigned these new labels to the nodes:
-1 - unknown, 0 - licit, and 1 - illicit.
For the case study on dynamic graphs, we had to turn the static graphs given by the
data loaders into dynamically evolving graphs. As described in Section 2.1, we can
represent a dynamic graph G as a series of snapshots G = {G1 , ..., Gn } where Gn is the
snapshot representing the dynamic graph at time step n. As we only use graphs that
are growing, this is easy to construct. For each snapshot, we need to remove the nodes
and edges that did not exist at that time step from the original static graph. By putting
the snapshots into an ordered list, we can now use them to represent a growing graph.
This step was required for both the OGB Arxiv and the Bitcoin Elliptic data sets.
For the OGB Arxiv data set, we could not create snapshots from all available time steps,
as the early time steps have too few nodes to build an accurate classifier. Thus, for the
Arxiv data set, we merged all time steps before 2011 into a single time step that we call
2010 and count this as the first time step of the graph.
Before we could start our experiments, we also had to split each data set into three: (1)
a training data set for the GraphSAGE model, (2) a test set used in the evaluation, and
29
CHAPTER 3. METHOD
We dedicated a large share of the data set for evaluation and to the calibration set. We
used 20% of the nodes for the test set and an equal amount for the calibration set, leav-
ing 60% of the nodes for the training of the GraphSAGE model. The large share of nodes
dedicated to testing was required to get statistically significant results for the Cora and
PubMed data set. The calibration set also required a large share for the conformal
predictors to perform well on these data sets. We sampled the nodes for the subsets
randomly to preserve the original distribution of the nodes in each data set.
When we split the data sets, we also had to decide how to handle the edges between
the subsets. For the training set, we removed all edges to the test and calibration sets
as we did not want the GraphSAGE model to learn the representation of these nodes.
This is a required step to keep the inductive abilities of the model [13]. However, for
the test and calibration set, we included edges to nodes in the other sets to aggregate
information from the other nodes in predictions but without, for example, counting
the training nodes in the evaluation.
In the case study on dynamic graphs, we only sampled training and calibration nodes
from the first snapshot of the graph. The only difference is for the conformal predic-
tors with resampling which re-calibrate every snapshot and therefore requires us to
resample the calibration set. The re-sampling was done by randomly sampling 20% of
the nodes available in the snapshot.
In the experiments, we produce our results using both a GCN and several CP algo-
rithms. In this section, we describe how we implement these algorithms.
The GCN model we chose for our experiments is the GraphSAGE model described in
Section 2.2.2. We chose the GraphSAGE model as it is designed to generalize well on
new data and performs relatively well in relation to other GNN models on the OGB
Arxiv data set [9].
In our case studies, we build a classifier using the GraphSAGE algorithm described in
Section 2.2.2. We implement the GraphSAGE algorithm using the PyTorch Geometric
30
CHAPTER 3. METHOD
[24] framework. There are multiple parts to the implementation. In this section, we are
going to focus on describing how we implemented the model’s forward propagation
function, training, and how we make predictions.
The model’s forward function describes how we turn data into predictions. The forward
function consists of multiple layers. All but the final layer looks as follows:
3. If we are training the model, we perform dropout, where we leave out parts of the
node’s hidden embedding.
We still perform a GraphSAGE convolution on the data in the final layer. However, in-
stead of normalizing the data and performing dropout, we feed it to a Softmax function
which gives us the class probabilities. We also provide an option to return the node em-
bedding, the result of the last convolution, together with the class probabilities. This
allows us to construct the Embedding distance Weighted Conformal Predictor (EWCP)
later on.
To train the GraphSAGE model, we feed data to the model, which will process it and
give back the class probabilities for each sample. Then, we compare the class prob-
abilities with the true class labels and retrieve the loss using PyTorch’s negative log-
likelihood loss function. Next, an optimizer uses the loss to update the model’s weights.
We used the Adam optimizer provided by PyTorch.
We make predictions using the model similar to how we are training, except that we
do not retrieve the loss or update the model’s weights. Instead, we return the class
probabilities, which we used to compare with the true labels of the test set to retrieve
31
CHAPTER 3. METHOD
Figure 3.2.4: Example of a single GraphSAGE convolution. The target node A and its neigh-
bors’ node features are aggregated in the hidden embedding hA . Then we retrieve the batch
normalized version ĥA . If we are training, we leave out one of the fields in ĥA by setting it to
0. Finally, we either forward the hidden embedding for A to the next layer, or we pass it to a
Softmax function to make a class prediction.
We test both conventional and novel CP methods in the case studies. As we describe
in Section 2.3, the conformal prediction algorithm relies on that we have a noncon-
formity measure that tells us how similar an incoming data point is to previous data
points. We only include classes in the prediction set with a low nonconformity score.
Hence, a common way to form a nonconformity measure is to use the error of the deep
learning algorithm’s prediction [19]. We calculate the error for the classification task
as follows:
αi = 1 − ŷi
Where αi is the retrieved nonconformity scores for sample i, and ŷi is the class prob-
abilities produced by the GraphSAGE classifier. We use this nonconformity measure
for all our conformal predictors. Instead, the CP methods vary in how they use the
nonconformity scores to retrieve the prediction set, which is what we will describe in
the sections below.
32
CHAPTER 3. METHOD
methods. The ICP is the traditional conformal predictor, and we use it as a baseline in
our experiments used to compare with the other methods.
The CCCP uses the implied class received from the GraphSAGE classifier to retrieve a
subset of the nonconformity scores in the calibration set, which is then used for predic-
tion. We use the CCCP in our experiments to investigate the relation between the node
class and the validity of the conformal predictor. The implied class from the Graph-
SAGE classifier is the class with the highest outputted class probability.
Novel predictors We present a set of novel conformal predictors and the reason-
ing behind them in Chapter 4. The novel methods all build on the notion that graph
connectivity information can be leveraged to improve validity and efficiency. This is
a key aspect of our research, and we test this by including the novel methods in our
experiments.
The Node degree-Conditional Conformal Predictor (NCCP) is the first novel method
that we use to test if we can achieve validity among nodes sharing a similar in-degree.
We divide the calibration and test nodes in a fixed number of bins depending on their
node degree to use this method. Then, we only use nonconformity scores from the
sample’s corresponding bin to retrieve the prediction set.
Another novel method that we test is the Node degree Weighted Conformal Predic-
tor (NWCP). Using this method, we want to see if we can lower the coverage gap in
predictors affected by distribution shift by leveraging the node degree. We base it on
the in-degree similar to the NCCP. The NWCP takes in both the node degree and the
nonconformity score of the sample. We retrieve a normalized node degree distance
between the sample and the calibration set. We use the distance as weights on the cal-
ibration set’s nonconformity scores when retrieving the prediction set. The distance is
calculated by first normalizing the node degrees d:
d
d˜ =
max(d)
Where d˜ is the normalized node degrees and max(d) is the maximum node degree
found in the calibration set and the sample node. We then calculate the normalized
node degree distance between the sample and each calibration data point:
33
CHAPTER 3. METHOD
Where d˜sample is the normalized node degree of the sample, and d˜i is the ith calibration
data point.
The final novel method is the EWCP. This method is similar to the NWCP but uses the
distance in the embedding space rather than the node degree distance. This means that
the predictor needs access to the node embeddings when retrieving the prediction set
for a sample. We implemented a special mode in the GraphSAGE classifier that returns
the node embedding together with the class probabilities when making a prediction.
The node embedding has as many dimensions as the dimension size of the hidden lay-
ers in the classifier. Therefore, we calculate the embedding space distance between the
sample node and the calibration set as a simplified version of the Euclidean distance
to reduce the computational cost. We calculate the simplified Euclidean distance like
so:
∑
d
distancei = abs(embi,d − embsample,d )
Where embi is the node embedding for the ith calibration data point, and embsample is
the incoming sample’s embedding. We then use the distances to find calibration data
points within a predefined radius of the sample. Only the data points within the radius
are considered when producing the prediction set.
Predictors with resampling As part of our case study on dynamic graphs, we test
all conformal predictors with and without resampling the calibration set. Resampling
the calibration set acts like a middle-ground between re-training and only training on
the first graph snapshot. We resample the calibration set for each snapshot.
34
CHAPTER 3. METHOD
3.2.5 Evaluation
In the case studies, we used multiple metrics to capture the following information:
We used the overall accuracy and the Macro-average F1 score for the multiclass clas-
sification tasks. As described above, the conventional F1 score can only be used for a
single class at a time. In the case where we have many available classes, this gives us
a poor overview of the model’s performance. The macro-averaged F1 score solves this
by returning an average for all classes where each class has the same weight [16].
For obtaining validity and performance of the conformal predictors, we used the met-
rics outlined in [33].
We measure the efficiency of the conformal predictors, i.e., the size of the prediction
set, using three common efficiency measures:
35
CHAPTER 3. METHOD
We used the average prediction set size to retrieve an overarching efficiency score,
and by comparing it to the number of possible classes, we get an understanding of the
predictor’s overall performance. Furthermore, we used the fraction of singleton pre-
dictions to understand how often the predictor gave us the ideal case, and the fraction
of empty predictions to see how often the predictor failed to give us any predictions at
all.
We used the same measures for evaluating the conformal predictors in the case study
on dynamic graphs. However, instead, we capture the performance for each snapshot
and observe how the metrics behave over time.
• We ensured that our results were reliable by running the experiments many times
and measuring averages and standard deviations. We also ran the experiments
on a generic virtual machine hosted by a cloud provider.
• We ensured our work was ethical by using data sets without sensitive data and
not wasting resources due to poorly written software.
36
CHAPTER 3. METHOD
To implement the experiments, we used Python 3.8 [25]. In addition, we heavily relied
on the PyTorch Geometric framework [24] to implement the GraphSAGE model, we
also used Numpy [8] for fast matrix calculations and scikit-learn [23] for some of the
metrics, such as the Macro F1 score described earlier.
Programming language
Language: Python 3.8
Notable packages:
• PyTorch Geometric [24]
• Numpy [8]
• scikit-learn [23]
37
CHAPTER 3. METHOD
Table 3.4.1: Hyperparameters used for each data set. Arx is OGB Arxiv, ArxD is OGB Arxiv in
the dynamic case study, and BE is the Bitcoin Elliptic data set.
38
Chapter 4
In this thesis, one of the things we explore is the effect of using graph connectivity
information in order to improve validity and efficiency in conformal predictors. In this
chapter, we describe our three novel methods for Conformal Prediction: the NCCP, the
NWCP, and the EWCP.
A conformal predictor’s validity is tightly coupled with the preciseness of the noncon-
formity measure used to produce prediction sets. In our case and many other cases, the
nonconformity measure is computed as the error from an underlying deep learning al-
gorithm. As we bring up in Section 2.2.1, a study has shown that Graph Convolutional
Networks tend to have a higher accuracy on nodes with a high degree than low degree
nodes [29]. Therefore, we can assume that a conformal predictor that bases its non-
conformity measure on the output of a GCN is more likely to be valid for higher degree
nodes than low degree nodes.
We relate this to the issue of imbalanced data sets, where we can have a conformal
predictor that achieves general validity even though it is invalid for the minority class.
In Section 2.3.1, we describe the Class-Conditional Conformal Predictor, which solves
this issue by dividing the data space into Mondrian categories based on the class labels.
Next, it forms a separate valid conformal predictor for each label. Thus, we can achieve
validity for all classes. We also know that we can form Mondrian categories based on
other properties from the data points to achieve validity for data points that share the
39
CHAPTER 4. CONFORMAL PREDICTORS FOR GRAPHS
While Barber et al.’s method is not directly applicable to our case, we take inspiration
and design our own weighted conformal predictor called the Node degree Weighted
40
CHAPTER 4. CONFORMAL PREDICTORS FOR GRAPHS
Conformal Predictor (NWCP). As we can see in Algorithm 6, the only change from the
standard CP algorithm is that we, on line 3, multiply the nonconformity scores of the
calibration set with the assigned weights before we compare the score with the subject
sample.
In the NWCP, we assume that nodes with similar node degrees originate from the same
underlying distribution. We leverage this information when we compute the weights
for the algorithm. We calculate the weights as the distance in node degree between the
sample node A and the calibration node B. We also normalize the distance to be within
the range [0, 1]. Below, we can see how the weights are assigned:
41
CHAPTER 4. CONFORMAL PREDICTORS FOR GRAPHS
point, otherwise, the weight is set to 0. The spatial conformal predictor can produce
valid and efficient prediction intervals as long as the data is locally exchangeable.
Figure 4.3.1: Example of how nodes in a graph can be represented in the embedding space.
Unfortunately, the spatial conformal prediction algorithm is not designed for the clas-
sification task, and when we designed the Embedding distance Weighted Conformal
Predictor (EWCP) we could only take inspiration from it. We define the EWCP In Al-
gorithm 7. As we can see, it is quite similar to the CCCP algorithm, except that we de-
termine our calibration set based on the embedding distance and that the calibration
sets are allowed to overlap. In our experiments, we calculate the embedding distance
as a simplified Euclidean distance between two node embeddings as described in Sec-
tion 3.2.4. However, nothing stops us from using another distance measure, such as
the Manhattan distance, if we believe that to be more performant.
42
CHAPTER 4. CONFORMAL PREDICTORS FOR GRAPHS
43
Chapter 5
Results
This chapter presents the results from our case studies on static and dynamic graphs.
First, we present the conformal predictors’ performance on static graphs. The results
are presented as tables, and we also provide some explanations of the results. Sec-
ondly, we present the conformal predictors’ performance on the dynamic graphs. To
avoid confusion, we present the results for the conformal predictors using resampling
and those not using resampling separately. The results are presented as line graphs
showing the performance of the conformal predictors over time.
5.1.1 Cora
In this section, we present the results of the experiments on the Cora citation graph.
The Cora data set is small and contains seven different class labels.
In Table 5.1.1, we can see that the underlying GraphSAGE model has almost perfect ac-
curacy and macro F1 when classifying nodes. As the accuracy and the macro F1 scores
have the same values, we can assume that the GraphSAGE classifier is not biased to-
44
CHAPTER 5. RESULTS
wards any class and performs equally well for all classes.
Accuracy 0.94
Macro F1 0.94
Table 5.1.1: Performance of the underlying GraphSAGE model on the Cora data set.
In Table 5.1.2, we can see the performance of the conformal predictors. Notably, all
the conformal predictors perform about the same on the Cora data set, and there is
minimal variation in terms of validity and efficiency. All of the predictors show to be
valid as they have a coverage score above the specified confidence level. The predictors
are also almost equally efficient. However, the EWCP is the most efficient with the
smallest average prediction set size, and it beats the NWCP by just a hair’s breadth in
terms of singleton predictions generated.
The most variation in the results we can see in the prediction latency at the bottom
of Table 5.1.2. Here we can see that the conventional ICP is by far the fastest, with a
prediction latency of 0.32 milliseconds. This is more than twice as fast as the EWCP
which has a prediction latency of 0.80 milliseconds. It is also surprising that the CCCP
is slower than the ICP. This goes against our intuition, as the CCCP compares incoming
test nodes with a subset of the calibration nodes in contrast to the ICP which compares
with all calibration nodes.
Table 5.1.2: Performance of the conformal predictors on the Cora data set.
5.1.2 PubMed
Here we present the results from the experiments on the PubMed citation graph. The
PubMed data set is larger than the Cora data set but only includes three different class
labels. However, the results from the PubMed data set are almost identical to those
from the experiments on the Cora data set.
45
CHAPTER 5. RESULTS
As we can see in Table 5.1.3, we again have an almost perfect GraphSAGE classifier,
and again the accuracy and macro F1 scores are identical, telling us that the classifier
performs equally well on all classes.
Accuracy 0.93
Macro F1 0.93
Table 5.1.3: Performance of the underlying GraphSAGE model on the PubMed data set.
In Table 5.1.4, we see that all of the conformal predictors appear to be valid, and overall
there is minimal variation in terms of efficiency with the ICP, the NWCP, and the EWCP
sharing the identical scores. The variation in the prediction latency is also similar to
what we see for the Cora data set, as we can in the bottom of Table 5.1.4. Again the
conventional ICP is by far the fastest, and the EWCP is almost ten times slower than
the ICP. Again the CCCP is surprisingly slower than the ICP.
Table 5.1.4: Performance of the conformal predictors on the PubMed data set.
This section presents the results from the experiments on the OGB Arxiv citation net-
work. The OGB Arxiv data set is magnitudes larger than the Cora and PubMed data sets
and includes 47 different class labels making it more of a challenge for the GraphSAGE
classifier.
As we can see in Table 5.1.5 the GraphSAGE model does not reach the same high accu-
racy as on the Cora and PubMed data sets. We can see a significant difference between
the accuracy and the macro F1 score, indicating that the GraphSAGE model might per-
form better on some class labels than others.
Table 5.1.6 shows the performance of the conformal predictors. As we can see, all of the
conformal predictors turn out to be valid except for the EWCP which falls just below
46
CHAPTER 5. RESULTS
Accuracy 0.71
Macro F1 0.57
Table 5.1.5: Performance of the underlying GraphSAGE model on the Arxiv data set.
the threshold. Both the CCCP and the EWCP stand out in terms of efficiency. The
Class-Conditional Conformal Predictor is less efficient than the other predictors and
has almost the double average prediction set size as the conventional ICP. The low
efficiency is likely due to class imbalance in the data set. We can see signs of bias
towards some classes in the Macro F1 for the GraphSAGE model, which is much lower
than the accuracy. While the other predictors provide narrow prediction sets, the CCCP
seems instead to offer its guarantee of validity for each class. In contrast, the EWCP
seems to offer some efficiency gains in relation to the other predictors, both in terms of
smaller average prediction set size and a larger fraction of singleton predictions.
The Node degree-Conditional Conformal Predictor has similar efficiency to the ICP
but seems to offer some gains in terms of prediction latency. A reason for this could
be that there is a broader distribution in terms of node degree, and the NCCP needs
to compare with fewer calibration nodes on average when making a prediction. The
NWCP also has similar efficiency as the ICP, but contrary to the NCCP, it instead has
a higher prediction latency. The high latency could be because it uses expensive calcu-
lations.
On another note, we can see that the prediction latencies are overall magnitudes larger
than those on the Cora and PubMed data sets. This can be explained by that we have
a much larger calibration set for the Arxiv data set, leading to proportionally more
comparisons in the conformal predictors.
Table 5.1.6: Performance of the conformal predictors on the Arxiv data set.
47
CHAPTER 5. RESULTS
Here we present the results from the experiments on the Bitcoin Elliptic transaction
network. The graph consists of 49 snapshots, with no edges between nodes in the pre-
vious snapshot and the new nodes included in the following snapshot. Furthermore,
there are two possible class labels, but most nodes do not have a class label, and we
leave them out in the evaluation.
In Figure 5.2.1, we can see the performance of the underlying GraphSAGE classifier
used by the conformal predictors. As we can see, the accuracy is overall high, but it
drops slightly over time, indicating that there might be a distribution shift. Further-
more, the macro F1 score is just half that of the accuracy, which tells us that the classi-
fier is biased towards one of the classes.
Figure 5.2.1: Accuracy for the underlying GraphSAGE model on the Bitcoin Elliptic dynamic
graph.
48
CHAPTER 5. RESULTS
Without resampling
In this section, we look at the results from the conformal predictors without resam-
pling. These are particularly interesting as this is the standard mode for conformal
predictors.
Figure 5.2.2 shows us the coverage for the conformal predictors for each snapshot.
As we can see, the coverage follows the GraphSAGE accuracy, and most of the time,
there is a significant coverage gap for all predictors. The coverage gap is large enough
for all predictors to be invalid. The novel predictors stand out, as the EWCP seem
to overall have a slightly smaller coverage gap than the other predictors in the start.
However, over time the CCCP takes the lead with its smaller gap. At the same time, the
NWCP seems to maintain the largest coverage gap and is already invalid on the early
snapshots.
Figure 5.2.2: Coverage for the conformal predictors on the Bitcoin Elliptic dynamic graph.
Looking instead at the efficiency of the conformal predictors, we can see in Figure 5.2.3
that all predictors except the CCCP maintain an average prediction set below 1.0. This
means that we have mostly singleton- and empty predictions, as we can also see in
Figures 5.2.4 and 5.2.5. To receive a large fraction of singleton predictions is the best
possible outcome for a conformal predictor, as singleton predictions are the most in-
formative. On the other hand, empty predictions are usually unsatisfactory, as it is the
effect of the predictor not being capable of predicting a label for the specified confi-
dence level. As we can see, we also have a relatively large fraction of those, around
1to2% of predictions. The NWCP seem to have more empty predictions than the other
close-by predictors, which could explain its low coverage.
49
CHAPTER 5. RESULTS
The CCCP stands out from the other predictors. It maintains an average prediction set
size of just above 1.0, which is more desired than below. It is not unusual for Mondrian
conformal predictors to have a higher prediction set size than other predictors, but in
turn, the predictor should offer validity guarantees for each category. While it does not
seem to offer validity guarantees over time, it still maintains a lower coverage gap than
the other predictors, so in this case, it still wins in both categories. However, it is odd
that the CCCP also has significantly more empty predictions than the others. With a
higher average prediction set size and a similar fraction of singleton predictions, we
would have assumed the empty predictions to be lower than the others.
Figure 5.2.3: Average prediction set size for the conformal predictors on the Bitcoin Elliptic
dynamic graph.
Figure 5.2.4: Fraction of singleton predictions for the conformal predictors on the Bitcoin El-
liptic dynamic graph.
The prediction latency of the conformal predictors is similar to what we saw for the
static graphs. In Figure 5.2.6, we see that the EWCP stands out as the by far slowest
50
CHAPTER 5. RESULTS
Figure 5.2.5: Fraction of empty predictions for the conformal predictors on the Bitcoin Elliptic
dynamic graph.
predictor. The other predictors all have a similar prediction latency, with the conven-
tional ICP having the lowest latency.
Figure 5.2.6: Prediction latency for the conformal predictors on the Bitcoin Elliptic dynamic
graph.
Using resampling
Next, we look at the conformal predictors with resampling. These conformal predictors
similarly do not retrain the GraphSAGE classifier, but they do resample the calibration
set at every snapshot before running the predictions.
As expected, the predictors with resampling maintain validity for most of the snap-
shots, as shown in Figure 5.2.7. The EWCP seems to maintain a coverage well above
the threshold of 0.95. The NWCP stands out as it is invalid for the early snapshots, and
51
CHAPTER 5. RESULTS
first, at time step 8, it turns valid and stays valid. It is unexpected as we do not see a
similar behavior for the NCCP which also uses the node degree.
Figure 5.2.7: Coverage for the conformal predictors with resampling on the Bitcoin Elliptic
dynamic graph.
However, the validity comes at the cost of efficiency, as shown in Figure 5.2.8. For most
predictors, the average prediction set size is just below 1.5 after a few time steps. This
is relatively high considering that the maximum possible prediction set size is 2. We
can also see the effect in Figure 5.2.9 which shows the number of singleton predictions.
Here we can see that the predictors produce fewer singleton predictions over time. As
we can see in Figure 5.2.10 that this is not compensated with empty predictions, it
means that we gain more uninformative predictions where both possible class labels
are included. The CCCP stands out from the others again. The CCCP has an overall
much higher average prediction set size and more empty predictions than the others on
the early snapshots. This renders its predictions uninformative most of the time.
The prediction latency for the predictors with resampling is similar to that of those
without resampling, as shown in Figure 5.2.11. However, the contrast is that the pre-
diction latency grows over time for the predictors with resampling. This is explained
by an increasing calibration set size when we resample the calibration set size, leading
to the predictions taking longer. The EWCP stands out even more in this experiment,
as its latency also grows faster than the others. This could signify that the embedding
distance calculations grow in cost exponentially.
52
CHAPTER 5. RESULTS
Figure 5.2.8: Average prediction set size for the conformal predictors with resampling on the
Bitcoin Elliptic dynamic graph.
Figure 5.2.9: Fraction of singleton predictions for the conformal predictors with resampling on
the Bitcoin Elliptic dynamic graph.
Figure 5.2.10: Fraction of empty predictions for the conformal predictors with resampling on
the Bitcoin Elliptic dynamic graph.
53
CHAPTER 5. RESULTS
Figure 5.2.11: Prediction latency for the conformal predictors with resampling on the Bitcoin
Elliptic dynamic graph.
Now we present the final graph, the OGB Arxiv citation network but this time as a
dynamic graph. As described earlier, the OGB Arxiv is a much larger graph than the
others graphs we present in this paper. It is also complex, with 47 possible classes. As
shown in Figure 3.2.2, the class distribution also changes significantly over time, and
some classes that used to be barely represented in the data set grow to be the dominant
classes in the last snapshots. So we already know there is a significant distribution shift
in the graph.
Figure 5.2.12 shows the performance of the underlying GraphSAGE classifier on the
data set. As we can see, the performance goes down significantly over time, signifying
the distribution shift we just mentioned. There is also a gap between the accuracy
score and the macro F1 score, indicating that the classifier has a bias towards some
classes.
Without resampling
First, we present the results for the conformal predictors without resampling. In Figure
5.2.13, we can see the coverage for the predictors. Similar to the GraphSAGE classifier’s
performance, the coverage drop over time, fortifying the effect of the distribution shift.
Already in the second snapshot, all predictors turn invalid. However, while significant,
the coverage gap remains relatively low. The predictors have similar coverage, and the
ICP and the EWCP have almost identical scores.
54
CHAPTER 5. RESULTS
Figure 5.2.12: Accuracy for the underlying GraphSAGE model on the OGB Arxiv dynamic
graph.
Figure 5.2.13: Coverage for the conformal predictors on the OGB Arxiv dynamic graph.
We see a similar trend in efficiency. Figures 5.2.14 and 5.2.15 shows the average pre-
diction set size and the number of singleton predictions. The prediction set size grows
significantly over time, and we see fewer singleton predictions. Thus, the predictions
become less informative over time. However, in Figure 5.2.16, we see that the fraction
of empty predictions remains low over time, showing that we overestimate some class
labels rather than vice versa.
The CCCP stands out from the others as it maintains a much higher prediction set size.
However, as we would also expect to see a higher coverage, this is not the case, and the
CCCP is simply perceived to be less efficient than the others.
The results on the prediction latency show nothing new, as shown in Figure 5.2.17.
The EWCP remains the slowest predictor, and the other shares a similar prediction
55
CHAPTER 5. RESULTS
Figure 5.2.14: Average prediction set size for the conformal predictors on the OGB Arxiv dy-
namic graph.
Figure 5.2.15: Fraction of singleton predictions for the conformal predictors on the OGB Arxiv
dynamic graph.
Figure 5.2.16: Fraction of empty predictions for the conformal predictors on the OGB Arxiv
dynamic graph.
56
CHAPTER 5. RESULTS
latency.
Figure 5.2.17: Prediction latency for the conformal predictors on the OGB Arxiv dynamic graph.
Using resampling
Finally, we present the results for the conformal predictors with resampling. Similar to
the other predictors, they only train the GraphSAGE classifier on the first step, but the
significant change is that they resample the calibration set for every snapshot.
In Figure 5.2.18, we see that the coverage for the predictors remains steady at around
0.95 across all time steps. This means that all predictors remain valid.
Figure 5.2.18: Coverage for the conformal predictors with resampling on the OGB Arxiv dy-
namic graph.
57
CHAPTER 5. RESULTS
steadily over time. The prediction sets quickly grow relatively large. In the final snap-
shot, the prediction set sizes averages around 25, which is about half of the maximum
possible prediction set size. We can assume that any prediction set with 25 labels is
uninformative. Looking at the singleton- and empty predictions in Figures 5.2.20 and
5.2.21, we see that the number of singleton predictions goes down but the number of
empty predictions remains low. This shows that we again overestimate labels rather
than underestimate them.
The CCCP turns out again to have the worst efficiency without any clear gains in cov-
erage.
Figure 5.2.19: Average prediction set size for the conformal predictors with resampling on the
OGB Arxiv dynamic graph.
Figure 5.2.20: Fraction of singleton predictions for the conformal predictors with resampling
on the OGB Arxiv dynamic graph.
The prediction latency of the predictors is similar to what we have seen before. In
Figure 5.2.22, we see that prediction latency grows over time for all predictors, and the
58
CHAPTER 5. RESULTS
Figure 5.2.21: Fraction of empty predictions for the conformal predictors with resampling on
the OGB Arxiv dynamic graph.
EWCP is by far the slowest. The increasing calibration set size can again explain the
growing prediction latency.
Figure 5.2.22: Prediction latency for the conformal predictors with resampling on the OGB
Arxiv dynamic graph.
In the experiments on the OGB Arxiv graph, we saw only minor differences between
the ICP and the EWCP, leading us to believe that we used a too-large radius for the
EWCP. The radius is much higher (1000) than the one we used for the static graph
(300). Therefore, we extended the experiment to look at what happens when we reduce
the radius to be the same as for the static graph.
In Figure 5.2.23, we see the drop in coverage when we lower the EWCP’s radius to 300.
The coverage drop is significant, and not even the predictor with resampling achieves
59
CHAPTER 5. RESULTS
Figure 5.2.23: EWCP radius comparison on OGB Arxiv dynamic graph - Coverage for the con-
formal predictors.
However, we can see that reducing the radius offers significant gains when we look at
the efficiency. In Figure 5.2.24, we see that when we lower the radius we also reduce
average prediction set size, this also leads that we have more singleton predictions as
we can see in Figure 5.2.25. At the same time, we also receive fewer empty prediction
sets as shown in Figure 5.2.26. All in all, we seem to get more informative prediction
sets when we lower the radius.
Figure 5.2.24: EWCP radius comparison on OGB Arxiv dynamic graph - Average prediction set
size for the conformal predictors.
60
CHAPTER 5. RESULTS
Figure 5.2.25: EWCP radius comparison on OGB Arxiv dynamic graph - Fraction of singleton
predictions for the conformal predictors.
Figure 5.2.26: EWCP radius comparison on OGB Arxiv dynamic graph - Fraction of empty
predictions for the conformal predictors.
Figure 5.2.27: EWCP radius comparison on OGB Arxiv dynamic graph - Prediction latency for
the conformal predictors.
61
Chapter 6
Discussion
In this chapter, we review the results from the case studies on static and dynamic
graphs. We start by looking at the performance of the conventional conformal pre-
dictors, and then we move on to the novel predictors and the effect of resampling.
Finally, we discuss the limitations of current work and propose directions for future
studies.
The ICP behaved as expected in all scenarios. On the static graphs, it performed on
par with the other predictors and acted as a good baseline to compare with the other
predictors. The ICP proved to be valid in all static scenarios and was also perceived to
be quite efficient, even if it was not always the most efficient predictor. In the dynamic
scenarios, neither of the predictors, including the ICP were able to maintain validity
without resampling. From what we see in the experiments, the ICP seems to be an
attractive conformal predictor for graphs due to its decent performance and low pre-
diction latency relative to the other predictors we tested. The ICP proved to have the
lowest prediction latency in all experiments except the experiment on the static OGB
Arxiv graph.
In contrast, the other conventional conformal predictor, CCCP came with mixed re-
62
CHAPTER 6. DISCUSSION
sults. The GraphSAGE classifier turned out to be perfectly balanced in the case of the
Cora and PubMed data sets, and this likely led to that we did not see any performance
gains from using the CCCP. However, while it did not bring any performance gains,
it came with a significant increase in prediction latency which was surprising. We ex-
pected the CCCP to have similar latency to that of the ICP, but in some cases, it was
more than double. As the high latency is unexpected, it would make sense to look into
the code to see if there are any bottlenecks in our implementation of the CCCP before
we draw any hard conclusions. For the other data sets, the CCCP saw a decrease in
efficiency relative to the other predictors. We could also see that these data sets were
imbalanced by looking at the class distribution and comparing the accuracy and macro
F1 scores. In these cases, the CCCP likely has a better coverage within each class, which
is why we see the decrease in efficiency. However, while the CCCP should guarantee
that validity for each class, we have not proven this. It could be worth considering in-
vestigating if the guarantee holds in further studies. It is also worth noting that while
the CCCP was unable to maintain validity on the dynamic graphs, it seemed to have a
slightly lower coverage gap on the Bitcoin Elliptic graph. However, not low enough to
use the CCCP instead of retraining or resampling.
63
CHAPTER 6. DISCUSSION
as some particular node degrees would otherwise not be covered by the calibration set.
We do not know if this has affected the performance, but it should at least not vio-
late the per-category validity guarantee. Furthermore, the only place where the NCCP
stood out was the static OGB Arxiv graph, where it offered significant gains in terms
of latency while maintaining validity and efficiency. This could indicate that the NCCP
offers a lower computational cost relative to other predictors, and it should be further
investigated on a more significant number of large data sets.
Finally, the last novel predictor we test is the EWCP, another weighted predictor that
determines what nodes should be part of the calibration set by looking at their distance
from the test node in the embedding space. In our experiments, we find that the radius
used to determine what nodes to include in the calibration set is a hyper parameter
that we can tune to determine the coverage and the validity of the predictor. We see an
example of this in our experiments on the OGB Arxiv data set. For the experiment on
the static version of the graph, we use a radius of 300, which produces a high-coverage
but invalid predictor. On the upside, the predictor is also more efficient than the other
predictors. In contrast, on the dynamic graph, we use a radius of 1000, leading to that
the EWCP has almost identical performance to that of the ICP. Usually, we can also
use the confidence level to receive a similar effect. By reducing the confidence level,
we receive a slimmer prediction set. But in contrast to tuning the radius of the EWCP,
changing the confidence level is a more transparent way of reducing the prediction set
size. Nonetheless, this indicates that we can use the node embeddings to reduce the
prediction set size. As we note in our comparison on different values for the EWCP
radius in the dynamic OGB Arxiv graph, it is especially lucrative that we can do so
without generating more empty prediction sets. We suggest that future studies look
64
CHAPTER 6. DISCUSSION
In our experiments, we also note that the EWCP has a significantly higher prediction
latency than the other predictors. This is due to the expensive calculations to determine
the distance between two nodes in the high-dimensional embedding space (we use 256
dimensions). Therefore, we believe it is worth exploring if there are ways to reduce
the cost, for example, by using a low-dimensional projection of the embedding for the
comparison or a different distance measure.
6.4 Limitations
The primary limitation of this work is the reliability of the results. The evaluation of
the predictors is not extensive enough to draw any hard conclusions. We use a small
number of data sets, 2 in the case of the dynamic graph and 3 for the case study on static
graphs. Furthermore, we were able to train two very accurate GraphSAGE models for
two data sets in the static experiment, which we believe also boosted the performance
of the predictors and reduced the variation in the results. The Bitcoin Elliptic data
65
CHAPTER 6. DISCUSSION
set is also sub-optimal, as the node features already include some information about
the neighboring nodes. This could lower the impact of using the graph connectivity
information in the CP algorithms.
However, even if we only had one data set, that gave reliable results. It should be
noted that there are not many existing data sets of dynamic graphs that can be used in
a study like this. Many data sets lack labels both for regression and classification tasks
and are primarily suitable for unsupervised learning tasks such as clustering. Hence,
this study can still act as inspiration for future work, as it is the first of its kind, and
hopefully motivate researchers to assemble more data sets for dynamic graphs that can
be used to cement the results of this and future studies.
• Perform the same experiments with more and larger data sets to get more robust
results.
• Run the measurements on prediction latency again with more performant con-
formal predictor implementations. Our implementation could only handle a sin-
gle node per prediction, which might have slowed predictions down significantly.
We are especially interested in understanding if the NCCP could offer lower la-
tency than the ICP on larger data sets.
• Investigate the effect of continual learning and if the coverage gap that occurs in
dynamic graphs can be used as a measure to trigger retraining efficiently.
• Investigate if the NCCP and the CCCP fulfills their promise of per-category valid-
ity.
• Investigate the NWCP’s low performance on the Bitcoin Elliptic data set, and
experiment with other weighting formulas based on the node degree that handles
the case where we have a lot of similar node degrees.
• Investigate if the NWCP can offer a similar upper bound on the coverage gap as
66
CHAPTER 6. DISCUSSION
• Propose a method for determining the radius used in the EWCP to guarantee
validity and efficiency.
• Propose a method for resampling a fixed sized calibration set, without losing pre-
dictor validity or efficiency.
67
Chapter 7
Conclusions
To answer the research question, we evaluated two conventional algorithms: the In-
ductive Conformal Predictor and the Class-Conditional Conformal Predictor, and we
also designed three novel CP algorithms leveraging the graph connectivity informa-
tion: the Node degree-Conditional Conformal Predictor, the Node degree Weighted
Conformal Predictor, and the Embedding distance Weighted Conformal Predictor. We
compared the predictors on both static and dynamic graphs. However, we could not
show that including the graph connectivity information improves the performance of
the predictors in any significant way. However, some indications make the area worth
exploring further. For instance, the NCCP seems to have the lowest prediction latency
on the OGB Arxiv data set, which is a reasonably large graph. We believe there should
be further investigation into whether the NCCP can be used to keep down the prediction
latency on large data sets. We also did not show that the NCCP brings validity within
nodes degree categories, and this could be investigated in further work to understand
if it can be used to mitigate the bias in GCNs. We also saw indications that we can use
the node embedding information to design efficient conformal predictors. However,
it requires further investigation to understand how this information is leveraged most
effectively. We also investigated the effect of resampling the calibration set in dynamic
graphs. In the experiments, we saw that most of the predictors could maintain validity
only when resampling was used. However, this came at the cost of efficiency, making
68
CHAPTER 7. CONCLUSIONS
To conclude, while we could not show that the graph connectivity information is useful
in GCN-based conformal predictors, we believe it is worth investigating further. We
look forward to seeing more studies in this emerging research area.
69
Bibliography
[2] Barber, Rina Foygel, Candes, Emmanuel J, Ramdas, Aaditya, and Tibshirani,
Ryan J. “Conformal prediction beyond exchangeability”. In: arXiv preprint arXiv:2202.13415
(2022).
[4] Grover, Aditya and Leskovec, Jure. “node2vec: Scalable feature learning for net-
works”. In: Proceedings of the 22nd ACM SIGKDD international conference on
Knowledge discovery and data mining. 2016, pp. 855–864.
[5] Håkansson, Anne. “Portal of research methods and methodologies for research
projects and degree projects”. In: The 2013 World Congress in Computer Sci-
ence, Computer Engineering, and Applied Computing WORLDCOMP 2013;
Las Vegas, Nevada, USA, 22-25 July. CSREA Press USA. 2013, pp. 67–73.
[6] Hamilton, Will, Ying, Zhitao, and Leskovec, Jure. “Inductive representation learn-
ing on large graphs”. In: Advances in neural information processing systems 30
(2017).
[7] Harary, Frank and Gupta, Gopal. “Dynamic graph models”. In: Mathematical
and Computer Modelling 25.7 (1997), pp. 79–87.
[8] Harris, Charles R., Millman, K. Jarrod, Walt, Stéfan J. van der, Gommers, Ralf,
Virtanen, Pauli, Cournapeau, David, Wieser, Eric, Taylor, Julian, Berg, Sebas-
tian, Smith, Nathaniel J., Kern, Robert, Picus, Matti, Hoyer, Stephan, Kerk-
wijk, Marten H. van, Brett, Matthew, Haldane, Allan, Río, Jaime Fernández
del, Wiebe, Mark, Peterson, Pearu, Gérard-Marchant, Pierre, Sheppard, Kevin,
70
BIBLIOGRAPHY
Reddy, Tyler, Weckesser, Warren, Abbasi, Hameer, Gohlke, Christoph, and Oliphant,
Travis E. “Array programming with NumPy”. In: Nature 585.7825 (Sept. 2020),
pp. 357–362. DOI: 10.1038/s41586-020-2649-2. URL: https://doi.org/10.
1038/s41586-020-2649-2.
[9] Hu, Weihua, Fey, Matthias, Zitnik, Marinka, Dong, Yuxiao, Ren, Hongyu, Liu,
Bowen, Catasta, Michele, and Leskovec, Jure. “Open Graph Benchmark: Datasets
for Machine Learning on Graphs”. In: arXiv preprint arXiv:2005.00687 (2020).
[10] Hwang, Doyeong, Lee, Grace, Jo, Hanseok, Yoon, Seyoul, and Ryu, Seongok. “A
benchmark study on reliable molecular supervised learning via Bayesian learn-
ing”. In: arXiv preprint arXiv:2006.07021 (2020).
[11] Kipf, Thomas N and Welling, Max. “Semi-supervised classification with graph
convolutional networks”. In: arXiv preprint arXiv:1609.02907 (2016).
[12] Kwon, Youngchun, Lee, Dongseon, Choi, Youn-Suk, and Kang, Seokho. “Uncertainty-
aware prediction of chemical reaction yields with graph neural networks”. In:
Journal of Cheminformatics 14.1 (2022), pp. 1–10.
[13] Leskovec, Jure. Stanford CS224W: GNN Augmentation and Training. https:
//web.stanford.edu/class/cs224w/slides/08-GNN-application.pdf. Oct.
2021.
[14] Leskovec, Jure. Stanford CS224W: Graph Neural Networks. https : / / web .
stanford.edu/class/cs224w/slides/06-GNN1.pdf. Oct. 2021.
[15] Leskovec, Jure. Stanford CS224W: Heterogeneous Graphs and Knowledge Graph
Embeddings. https://web.stanford.edu/class/cs224w/slides/10-kg.pdf.
Oct. 2021.
[16] Manning, Christopher D and Raghavan, Prabhakar. and Schutze, H.[2008] In-
troduction to Information Retrieval. 2008.
[17] Mao, Huiying, Martin, Ryan, and Reich, Brian. “Valid model-free spatial predic-
tion”. In: arXiv preprint arXiv:2006.15640 (2020).
[18] McCallum, Andrew Kachites, Nigam, Kamal, Rennie, Jason, and Seymore, Kristie.
“Automating the construction of internet portals with machine learning”. In: In-
formation Retrieval 3.2 (2000), pp. 127–163.
71
BIBLIOGRAPHY
[19] Norinder, Ulf, Carlsson, Lars, Boyer, Scott, and Eklund, Martin. “Introducing
conformal prediction in predictive modeling. A transparent and flexible alter-
native to applicability domain determination”. In: Journal of chemical infor-
mation and modeling 54.6 (2014), pp. 1596–1603.
[20] Papadopoulos, Harris, Proedrou, Kostas, Vovk, Volodya, and Gammerman, Alex.
“Inductive confidence machines for regression”. In: European Conference on
Machine Learning. Springer. 2002, pp. 345–356.
[21] Pareja, Aldo, Domeniconi, Giacomo, Chen, Jie, Ma, Tengfei, Suzumura, Toy-
otaro, Kanezashi, Hiroki, Kaler, Tim, Schardl, Tao, and Leiserson, Charles. “Evolvegcn:
Evolving graph convolutional networks for dynamic graphs”. In: Proceedings
of the AAAI Conference on Artificial Intelligence. Vol. 34. 04. 2020, pp. 5363–
5370.
[22] Patterson, David, Gonzalez, Joseph, Le, Quoc, Liang, Chen, Munguia, Lluis-
Miquel, Rothchild, Daniel, So, David, Texier, Maud, and Dean, Jeff. “Carbon
emissions and large neural network training”. In: arXiv preprint arXiv:2104.10350
(2021).
[23] Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O.,
Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos,
A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E. “Scikit-learn:
Machine Learning in Python”. In: Journal of Machine Learning Research 12
(2011), pp. 2825–2830.
[26] Rajpurkar, Pranav, Chen, Emma, Banerjee, Oishi, and Topol, Eric J. “AI in health
and medicine”. In: Nature Medicine (2022), pp. 1–8.
[27] Shafer, Glenn and Vovk, Vladimir. “A tutorial on conformal prediction.” In: Jour-
nal of Machine Learning Research 9.3 (2008).
[28] Sun, Jiangming, Carlsson, Lars, Ahlberg, Ernst, Norinder, Ulf, Engkvist, Ola,
and Chen, Hongming. “Applying mondrian cross-conformal prediction to esti-
mate prediction confidence on large imbalanced bioactivity data sets”. In: Jour-
nal of chemical information and modeling 57.7 (2017), pp. 1591–1598.
72
BIBLIOGRAPHY
[29] Tang, Xianfeng, Yao, Huaxiu, Sun, Yiwei, Wang, Yiqi, Tang, Jiliang, Aggarwal,
Charu, Mitra, Prasenjit, and Wang, Suhang. “Graph convolutional networks against
degree-related biases”. In: arXiv preprint arXiv:2006.15643 (2020).
[30] Vaida, Maria and Patil, Pranita. “Semi-Supervised Graph Neural Network with
Probabilistic Modeling to Mitigate Uncertainty”. In: Proceedings of the 2020 the
4th International Conference on Information System and Data Mining. 2020,
pp. 152–156.
[31] Vovk, Vladimir, Gammerman, Alexander, and Shafer, Glenn. Algorithmic learn-
ing in a random world. Springer Science & Business Media, 2005.
[32] Vovk, Vladimir, Lindsay, David, Nouretdinov, Ilia, and Gammerman, Alex. “Mon-
drian confidence machine”. In: Technical Report (2003).
[33] Vovk, Vladimir, Nouretdinov, Ilia, Fedorova, Valentina, Petej, Ivan, and Gam-
merman, Alex. “Criteria of efficiency for set-valued classification”. In: Annals of
Mathematics and Artificial Intelligence 81.1 (2017), pp. 21–46.
[34] Vovk, Vladimir, Petej, Ivan, Nouretdinov, Ilia, Ahlberg, Ernst, Carlsson, Lars,
and Gammerman, Alex. “Retrain or not retrain: Conformal test martingales for
change-point detection”. In: Conformal and Probabilistic Prediction and Ap-
plications. PMLR. 2021, pp. 191–210.
[35] Wang, Junshan, Song, Guojie, Wu, Yi, and Wang, Liang. “Streaming graph neu-
ral networks via continual learning”. In: Proceedings of the 29th ACM Interna-
tional Conference on Information & Knowledge Management. 2020, pp. 1515–
1524.
[36] Wu, Bingzhe, Li, Jintang, Hou, Chengbin, Fu, Guoji, Bian, Yatao, Chen, Liang,
and Huang, Junzhou. “Recent advances in reliable deep graph learning: Adver-
sarial attack, inherent noise, and distribution shift”. In: arXiv preprint arXiv:2202.07114
(2022).
[37] Wu, Zonghan, Pan, Shirui, Chen, Fengwen, Long, Guodong, Zhang, Chengqi,
and Philip, S Yu. “A comprehensive survey on graph neural networks”. In: IEEE
transactions on neural networks and learning systems 32.1 (2020), pp. 4–24.
[38] Zhang, Jin, Norinder, Ulf, and Svensson, Fredrik. “Deep learning-based confor-
mal prediction of toxicity”. In: Journal of Chemical Information and Modeling
61.6 (2021), pp. 2648–2657.
73
BIBLIOGRAPHY
[39] Zhang, Si, Tong, Hanghang, Xu, Jiejun, and Maciejewski, Ross. “Graph convolu-
tional networks: a comprehensive review”. In: Computational Social Networks
6.1 (2019), pp. 1–23.
74
TRITA-EECS-EX-2022:555
www.kth.se