0% found this document useful (0 votes)
21 views88 pages

Conformal Pred

This degree project explores the quantification of uncertainty in Graph Neural Network predictions using conformal prediction methods. The research evaluates both conventional and novel methods, finding that while neither novel method significantly outperforms conventional methods, leveraging graph connectivity information may enhance prediction efficiency and reduce latency. Future work is suggested to further investigate the use of node embeddings to improve conformal predictors on graphs.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views88 pages

Conformal Pred

This degree project explores the quantification of uncertainty in Graph Neural Network predictions using conformal prediction methods. The research evaluates both conventional and novel methods, finding that while neither novel method significantly outperforms conventional methods, leveraging graph connectivity information may enhance prediction efficiency and reduce latency. Future work is suggested to further investigate the use of node embeddings to improve conformal predictors on graphs.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 88

DEGREE PROJECT IN TECHNOLOGY,

SECOND CYCLE, 30 CREDITS


STOCKHOLM, SWEDEN 2022

Reliable graph predictions


Conformal prediction for Graph Neural Networks

Albin Bååw

KTH ROYAL INSTITUTE OF TECHNOLOGY


ELECTRICAL ENGINEERING AND COMPUTER SCIENCE
Authors
Albin Bååw <abaaw@kth.se>
Information and Communication Technology
KTH Royal Institute of Technology

Place for Project


Stockholm, Sweden
KTH Royal Institute of Technology

Examiner
Henrik Boström <bostromh@kth.se>
Stockholm, Sweden
KTH Royal Institute of Technology

Supervisors
Paris Carbone <parisc@kth.se>
Stockholm, Sweden
KTH Royal Institute of Technology

Sonia-Florina Horchidan <sfhor@kth.se>


Stockholm, Sweden
KTH Royal Institute of Technology

ii
Abstract

We have seen a rapid increase in the development of deep learning algorithms in recent
decades. However, while these algorithms have unlocked new business areas and led
to great development in many fields, they are usually limited to Euclidean data. Re-
searchers are increasingly starting to find out that they can better represent the data
used in many real-life applications as graphs. Examples include high-risk domains
such as finding the side effects when combining medicines using a protein-protein net-
work. In high-risk domains, there is a need for trust and transparency in the results
returned by deep learning algorithms. In this work, we explore how we can quantify
uncertainty in Graph Neural Network predictions using conventional methods for con-
formal prediction as well as novel methods exploiting graph connectivity information.
We evaluate the methods on both static and dynamic graphs and find that neither of
the novel methods offers any clear benefits over the conventional methods. However,
we see indications that using the graph connectivity information can lead to more effi-
cient conformal predictors and a lower prediction latency than the conventional meth-
ods on large data sets. We propose that future work extend the research on using the
connectivity information, specifically the node embeddings, to boost the performance
of conformal predictors on graphs.

Keywords

Conformal prediction, Graph Neural Networks, Dynamic graphs, Distribution shift,


Coverage gap

iii
Abstract

De senaste årtiondena har vi sett en drastiskt ökad utveckling av djupinlärningsalgo-


ritmer. Även fast dessa algoritmer har skapat nya potentiella affärsområden och har
även lett till nya upptäckter i flera andra fält, är dessa algoritmer dessvärre oftast be-
gränsade till Euklidisk data. Samtidigt ser vi att allt fler forskare har upptäckt att data
i verklighetstrogna applikationer oftast är bättre representerade i form av grafer. Ex-
empel inkluderar hög-risk domäner som läkemedelsutveckling, där man förutspår bi-
effekter från mediciner med hjälp av protein-protein nätverk. I hög-risk domäner finns
det ett krav på tillit och att resultaten från djupinlärningsalgoritmer är transparenta. I
den här tesen utforskar vi hur man kan kvantifiera osäkerheten i resultaten hos Neu-
rala Nätverk för grafer (eng. Graph Neural Networks) med hjälp av konform predik-
tion (eng. Conformal Prediction). Vi testar både konventionella metoder för kon-
form prediktion, samt originella metoder som utnyttjar strukturell information från
grafen. Vi utvärderar metoderna både på statiska och dynamiska grafer, och vi kommer
fram till att de originella metoderna varken är bättre eller sämre än de konventionella
metoderna. Däremot finner vi indikationer på att användning av den strukturella in-
formationen från grafen kan leda till effektivare prediktorer och till lägre svarstid än de
konventionella metoderna när de används på stora grafer. Vi föreslår att framtida ar-
bete i området utforskar vidare hur den strukturella informationen kan användas, och
framförallt nod representationerna, kan användas för att öka prestandan i konforma
prediktorer för grafer.

Nyckelord

Konform prediktion, Neurala Nätverk för Grafer, Dynamiska grafer, Distributions-


förändring, täckningsgap

iv
Acknowledgements

There are a few people who have contributed to this work with their knowledge, which
I am very grateful for. One of these people is Henrik Boström, who acted as examiner
for this project. Henrik has a background in Conformal prediction and has guided
this work with his insight into the field and has provided me with both directions and
resources to bring the work forward.

Another is Paris Carbone, one of the two supervisors for the project. Paris contributed
with feedback throughout the project and helped form and understand the experimen-
tal results.

Finally, I want to give a big ”thank you” to Sonia Horchidan, the second supervisor.
Sonia has been heavily involved in the project and has provided feedback on every
aspect of the report. Sonia has, at times, gone over and beyond to help bring this project
into safe harbor. I am incredibly grateful that she accustomed her schedule to help me
out, and that she has answered my endless Slack messages from early morning to late
night.

I would also like to thank the Oracle Corporation for lending me their servers to run
the experiments. With the virtual machine offered by Oracle Cloud, I have been able
to make many more iterations on the algorithms presented in this project than I would
have been able to if I was only using my local machine.

v
Contents

Acronyms ix

List of Algorithms ix

List of Figures x

List of Tables xii

1 Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.5 Benefits, Ethics and Sustainability . . . . . . . . . . . . . . . . . . . . . 4
1.6 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.7 Delimitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.8 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Theoretical Background 6
2.1 Static and dynamic graphs . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Graph Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.1 Graph Convolutional Networks . . . . . . . . . . . . . . . . . . . 10
2.2.2 GraphSAGE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3 Conformal predictors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3.1 Class-Conditional Conformal Predictors . . . . . . . . . . . . . . 17
2.3.2 Attribute-Conditional Conformal Predictors . . . . . . . . . . . . 18
2.4 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.4.1 Countering the distribution shift in conformal prediction . . . . . 18

vi
CONTENTS

2.4.2 Spatial conformal prediction . . . . . . . . . . . . . . . . . . . . 20


2.4.3 Methods for quantifying uncertainty in deep learning . . . . . . . 20

3 Method 22
3.1 Choice of research method . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2 Application of research method . . . . . . . . . . . . . . . . . . . . . . . 23
3.2.1 Overview of the case studies . . . . . . . . . . . . . . . . . . . . 23
3.2.2 Data sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2.3 Constructing the graphs . . . . . . . . . . . . . . . . . . . . . . . 27
3.2.4 Models implementation . . . . . . . . . . . . . . . . . . . . . . . 30
3.2.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.3 Quality assurance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.4 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4 Conformal predictors for graphs 39


4.1 Node degree-Conditional Conformal Predictor . . . . . . . . . . . . . . 39
4.2 Node degree Weighted Conformal Predictor . . . . . . . . . . . . . . . 40
4.3 Embedding distance Weighted Conformal Predictor . . . . . . . . . . . 41

5 Results 44
5.1 Static graphs study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.1.1 Cora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.1.2 PubMed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.1.3 OGB Arxiv . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.2 Dynamic graphs study . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.2.1 Bitcoin Elliptic . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.2.2 OGB Arxiv . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

6 Discussion 62
6.1 The conventional conformal predictors . . . . . . . . . . . . . . . . . . . 62
6.2 The novel conformal predictors . . . . . . . . . . . . . . . . . . . . . . . 63
6.3 The effect of resampling . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
6.4 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
6.5 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

7 Conclusions 68

vii
CONTENTS

References 70

viii
Acronyms

CCCP Class-Conditional Conformal Predictor


CNN Convolutional Neural Network
CP Conformal Prediction
EWCP Embedding distance Weighted Conformal Predictor
GCN Graph Convolutional Network
GNN Graph Neural Network
ICP Inductive Conformal Predictor
NCCP Node degree-Conditional Conformal Predictor
NWCP Node degree Weighted Conformal Predictor

ix
List of Algorithms

1 The GCN forward propagation [11] . . . . . . . . . . . . . . . . . . . . . 13


2 The Conformal Algorithm [27] . . . . . . . . . . . . . . . . . . . . . . . . 16
3 The Class-conditional Conformal Algorithm . . . . . . . . . . . . . . . . 17
4 The Attribute-conditional Conformal Algorithm . . . . . . . . . . . . . . 18

5 The Node degree-conditional Conformal Algorithm . . . . . . . . . . . . 40


6 The Node degree Weighted Conformal Predictor . . . . . . . . . . . . . . 41
7 The Embedding distance weighted Conformal Algorithm . . . . . . . . . 43

x
List of Figures

Theoretical Background 6
2.1.1 Example of a social network (1) . . . . . . . . . . . . . . . . . . . . . . 7
2.1.2 Example of a social network (2) . . . . . . . . . . . . . . . . . . . . . . 8
2.1.3 Example of a social network (3) . . . . . . . . . . . . . . . . . . . . . . 9
2.2.1 Example of variations of a graph . . . . . . . . . . . . . . . . . . . . . 10
2.2.2 Example of a convolution on an image . . . . . . . . . . . . . . . . . . 11
2.2.3 Message passing in a GCN . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.4 Overview of the GraphSAGE algorithm . . . . . . . . . . . . . . . . . . 14

Method 22
3.2.1 Class distributions for the Cora and PubMed data sets. . . . . . . . . . 26
3.2.2 Class distribution for the OGB Arxiv data . . . . . . . . . . . . . . . . . 27
3.2.3 Class distribution for the Bitcoin Elliptic data set . . . . . . . . . . . . 28
3.2.4 Example of a single GraphSAGE convolution . . . . . . . . . . . . . . 32

Conformal predictors for graphs 39


4.3.1 Example of nodes in the embedding space . . . . . . . . . . . . . . . . 42

Results 44
5.2.1 GraphSAGE - Accuracy (Bitcoin Elliptic) . . . . . . . . . . . . . . . . . 48
5.2.2 CP - Coverage (Bitcoin Elliptic) . . . . . . . . . . . . . . . . . . . . . . 49
5.2.3 CP - Average prediction set size (Bitcoin Elliptic) . . . . . . . . . . . . 50
5.2.4 CP - Singleton predictions (Bitcoin Elliptic) . . . . . . . . . . . . . . . 50
5.2.5 CP - Empty predictions (Bitcoin Elliptic) . . . . . . . . . . . . . . . . . 51
5.2.6 CP - Prediction latency (Bitcoin Elliptic) . . . . . . . . . . . . . . . . . 51

xi
LIST OF FIGURES

5.2.7 CP with resampling - Coverage (Bitcoin Elliptic) . . . . . . . . . . . . . 52


5.2.8 CP with resampling - Average prediction set size (Bitcoin Elliptic) . . . 53
5.2.9 CP with resampling - Singleton predictions (Bitcoin Elliptic) . . . . . . 53
5.2.10 CP with resampling - Empty predictions (Bitcoin Elliptic) . . . . . . . 53
5.2.11 CP with resampling - Prediction latency (Bitcoin Elliptic) . . . . . . . 54
5.2.12 GraphSAGE - Accuracy (OGB Arxiv) . . . . . . . . . . . . . . . . . . . 55
5.2.13 CP - Coverage (OGB Arxiv) . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.2.14 CP - Average prediction set size (OGB Arxiv) . . . . . . . . . . . . . . . 56
5.2.15 CP - Singleton predictions (OGB Arxiv) . . . . . . . . . . . . . . . . . . 56
5.2.16 CP - Empty predictions (OGB Arxiv) . . . . . . . . . . . . . . . . . . . 56
5.2.17 CP - Prediction latency (OGB Arxiv) . . . . . . . . . . . . . . . . . . . 57
5.2.18 CP with resampling - Coverage (OGB Arxiv) . . . . . . . . . . . . . . . 57
5.2.19 CP with resampling - Average prediction set size (OGB Arxiv) . . . . . 58
5.2.20 CP with resampling - Singleton predictions (OGB Arxiv) . . . . . . . . 58
5.2.21 CP with resampling - Empty predictions (OGB Arxiv) . . . . . . . . . . 59
5.2.22 CP with resampling - Prediction latency (OGB Arxiv) . . . . . . . . . . 59
5.2.23 EWCP radius comparison - Coverage (OGB Arxiv) . . . . . . . . . . . 60
5.2.24 EWCP radius comparison - Average prediction set size (OGB Arxiv) . 60
5.2.25 EWCP radius comparison - Singleton predictions (OGB Arxiv) . . . . 61
5.2.26 EWCP radius comparison - Empty predictions (OGB Arxiv) . . . . . . 61
5.2.27 EWCP radius comparison - Prediction latency (OGB Arxiv) . . . . . . 61

xii
List of Tables

Method 22
3.2.1 Data sets used in the case study on static graphs. . . . . . . . . . . . . 25
3.2.2 Data sets used in the case study on dynamic graphs. . . . . . . . . . . 25
3.4.1 Hyperparameters used in the experiments . . . . . . . . . . . . . . . . 38

Results 44
5.1.1 GraphSAGE performance (Cora) . . . . . . . . . . . . . . . . . . . . . 45
5.1.2 CP performance (Cora) . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.1.3 GraphSAGE performance (PubMed) . . . . . . . . . . . . . . . . . . . 46
5.1.4 CP performance (PubMed) . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.1.5 GraphSAGE performance (OGB Arxiv) . . . . . . . . . . . . . . . . . . 47
5.1.6 CP performance (OGB Arxiv) . . . . . . . . . . . . . . . . . . . . . . . 47

xiii
Chapter 1

Introduction

In this thesis, we explore how Conformal Prediction (CP) can be applied to Graph Neu-
ral Networks (GNN) to improve the reliability of results from node classification tasks.
We look both at static graphs and, the more realistic scenario [21], dynamic graphs.
The results include the performance of conventional CP methods as well as our own
novel methods exploiting the graph connectivity information.

1.1 Background
Graph Neural Networks is an emerging deep learning technology designed to handle
highly-interconnected graph data. Conventional neural networks are limited to fixed
form data and where each data point exists in Euclidean space. Graph data, on the
contrary, neither has a fixed form nor data points that exist in Euclidean space; hence,
conventional neural networks can not process it. GNNs resolve this issue by generating
a low-dimensional vector embedding from a graph or a component in a graph that a
neural network can process [37].

There are four types of GNNs available today [37]. The type we will use in this thesis
belongs to the family of Graph Convolutional Network (GCN). Generally speaking,
GCNs are a type of neural network architecture that can leverage the graph structure
and aggregate node information from the neighborhoods in a convolutional fashion.
Furthermore, GCNs have a great expressive power to learn the graph representations
and have achieved superior performance in various tasks and applications.

Conformal Prediction is a commonly used method for quantifying uncertainty in ma-

1
CHAPTER 1. INTRODUCTION

chine learning algorithms on both classification and regression tasks. CP replaces the
singular class prediction in machine learning algorithms with a prediction set that in-
cludes the true class with a predefined confidence level [27]. There are several varia-
tions of CP; one of those is Class-Conditional Conformal Prediction which provides
a valid predictor for each class label and is used to deal with imbalanced data sets
[28].

As CP provides a confidence level with each prediction, performance is usually mea-


sured in terms of efficiency. The size of the prediction set determines the efficiency,
and a conformal predictor is more efficient when the average size of the prediction set
is lower [27].

We provide a longer description of GNNs and CP in Chapter 2.

1.2 Problem

The application of GNNs in high-risk domains is increasing, and common tasks revolve
around medicinal applications such as predicting the side effects of mixing two drugs.
Therefore, it is crucial to quantify the uncertainty in the algorithms’ output to invoke
trust, and transparency [26].

As the field is still in its cradle, there have been few attempts at quantifying uncertainty
in GNNs. Most of the research we have found focuses on applying Bayesian learning to
GNNs to boost prediction accuracy [12, 30] or quantifying prediction reliability using
Bayesian concepts [10]. However, one recent study applied conformal prediction to
GNNs [38] but without a more profound analysis or accounting for the graph struc-
ture.

A cornerstone in CP is the validity of the predictor, which ensures that predictions are
within a specified confidence level. The validity depends on the underlying distribution
of the data set. If the distribution of the calibration data used to tune the predictor
differs from the distribution of the test data, likely, the validity will not hold.

In the case of dynamic graphs, the distribution of the data set is constantly changing.
Distribution shifts may cause a coverage gap in conformal predictors and potentially
render the predictor invalid. For that reason, a conformal predictor used on a dynamic
graph needs to take the coverage gap into account to maintain validity. Several papers

2
CHAPTER 1. INTRODUCTION

have been written on the topic of retraining conformal predictors when the distribu-
tion shift is too significant, such as [34]. Retraining is a sure way of maintaining valid
conformal predictors but can also be quite costly computational-wise. However, just
recently, a study was conducted on how to reduce the coverage gap for predictors in
the context of time-series data without the need for retraining [2]. The study proposes
a method for putting an upper bound on the coverage gap that we will describe further
in Section 2.4.1.

Furthermore, a study showed that GCNs have a bias towards high degree nodes, i.e.,
they are likely to provide more accurate predictions on high degree nodes than on low
degree nodes [29]. This naturally affects any conformal predictor that relies on the
output of a GCN, and it needs to be investigated if this also affects the predictor’s va-
lidity.

1.3 Purpose
We have seen a rapidly increasing development of deep learning algorithms in recent
decades. However, while these algorithms have unlocked new business areas and led
to great development in many fields, they are usually limited to Euclidean data [37].
Researchers are increasingly starting to find out that the data used in many real-life
applications is better represented as graphs [21]. Examples include finding side effects
when combining medicines using a protein-protein network [37], detecting fraudulent
transactions in financial networks, and predicting friendships in social networks.

The purpose of this thesis is to further the research on making GNNs viable in real-
life applications. We believe that a crucial part is providing prediction reliability in
high-risk domains, which is where this thesis’s focus lies. Thus, we try to answer the
following research question:

How can we leverage the graph connectivity information to design a valid


and efficient conformal predictor for static and dynamic graphs?

1.4 Goal
The goal of this thesis is to propose a method for applying CP to GNNs and raise aware-
ness about the importance of measuring uncertainty in the output of GNNs. We can

3
CHAPTER 1. INTRODUCTION

break down this work into the following steps:

1. Exploration of data sets that have previously been used for similar tasks.

2. Training of a GNN on the data sets and comparing its performance with bench-
marks found in the literature.

3. Applying conventional CP methods with the GNN as the underlying algorithm


and measuring the methods’ performance in both a static and a dynamic graph
setting.

4. Exploration of graph-aware CP methods and measuring of their performance.

5. Evaluation of the results from the experiments.

1.5 Benefits, Ethics and Sustainability


Many domains can benefit from the use of GNNs, one of those is medicine [1]. In
medicine, it is vital to be able to understand the results outputted by the deep learn-
ing algorithm [26]. By using conformal prediction, a practitioner can understand how
much to trust those results.

We also expect that the project by itself is going to be interesting for GNN and Confor-
mal prediction research groups, GNNs are still in early development, and there is not
much written in the area.

This project’s ethics and sustainability considerations primarily concern the carbon
emissions released as part of training the deep learning algorithms. A recent study has
shown that carbon emissions can vary as much as 10-100x depending on how you set
up your algorithm [22]. By evaluating the performance of our proposed methods, we
hope not only to understand if the algorithms are computationally viable but also the
climate impact, as the two should be correlated.

1.6 Methodology
This thesis is grounded in the realist’s philosophical assumptions and uses the funda-
mental research method [5] to extract quantitative and qualitative results. The quali-
tative results come in the form of novel methods for quantifying uncertainty in graphs
based on CP and other related work. The quantitative results consist of a comparison of

4
CHAPTER 1. INTRODUCTION

the proposed methods and conventional conformal predictors based on performance


metrics.

We discuss the choice of research method further in Section 3.1.

1.7 Delimitations
There is a lot that we can explore when it comes to graphs and CP, and in this work, we
only answer a relatively small share of the possible questions we could answer.

In the thesis, we design several novel methods for leveraging graph connectivity in-
formation to make conformal predictions. There is a lot of connectivity information
that could be used, ranging from basic properties such as the node degree to complex
centrality measures. However, in this work, we solely focus on using the node degree
and the embeddings of nodes, as we have seen indications in the literature that these
could be helpful when making predictions.

Studying conformal predictors on dynamic graphs could entitle several research stud-
ies just by itself. For example, we could look at the effect of retraining the conformal
predictors. However, the main focus of this work is to avoid the extra cost of retraining,
and we therefore refrain from using retraining in our experiments.

Furthermore, there are different definitions of what a dynamic graph is, of which one
is a graph where each node is streamed one by one to the GNN. However, this is not
something we will consider in this paper, and instead, we will assume that we have
access to all static snapshots of the graph at all times as described in Section 2.1. Also,
the dynamic graphs we use in our experiments are only growing, never losing nodes or
edges.

1.8 Outline
We have organized the thesis as follows: Chapter 2 gives the reader a theoretical back-
ground to GNNs, conformal prediction, and related work. Chapter 3 discusses the
method used in the thesis, while Chapter 4 introduces the novel conformal predictors.
Finally, Chapter 5 provides the results from our experiments, Chapter 6 discusses the
results and Chapter 7 presents our conclusions.

5
Chapter 2

Theoretical Background

In this thesis, two concepts are tied together: Graph Neural Networks and Conformal
prediction. The former is a deep learning algorithm to draw insights from highly con-
nected graph data, while the latter is used to evaluate the quality of those insights. In
this chapter, we will dig deeper into how each functions.

2.1 Static and dynamic graphs


In this thesis, we apply conformal prediction in the context of graphs. A graph G is
defined [7] as a pair (V, E) where V is a set of nodes (or vertices) and E is the set of
edges connecting pairs of nodes. An example of a graph is the social network seen in
Figure 2.1.1, where the nodes are people, and the edges describe relationships between
people, for example, Bob is a friend of Alice.

A graph can either be directed, meaning that each edge u, v implies that node u has a
unidirectional connection to node v, or the graph can be undirected, meaning that each
edge is bidirectional. An example of a unidirectional edge is the edge indicating that
Bob is the father of Charlie; this edge has a single direction as it would be impossible
for Charlie to be the father of Bob. However, for Bob and Alice to be friends, each of
them must be a friend of the other, and thus they share a bidirectional edge as seen in
Figure 2.1.1.

On a similar topic, we also define two other types of graphs: homogeneous and hetero-
geneous graphs. In homogeneous graphs, nodes and edges can only be a single type,
for instance, nodes are people, and edges indicate friendships. In the heterogeneous

6
CHAPTER 2. THEORETICAL BACKGROUND

Figure 2.1.1: Example of a social network. The network captures different relationships be-
tween people and is a directed heterogeneous graph.

graph, nodes and edges can take on different types. For example, nodes could be both
people and pets, and edges could be any relationship between them, such as friend-
ship, family, or owner. In this work, we will only consider homogeneous graphs, but
GNNs like the GCN can often be reconfigured to treat heterogeneous graphs as well
[15].

The degree of a node n is the number of edges connecting n to the other nodes in the
graph. In our example above, Bob has a degree of 5 as he is connected to Alice, Charlie,
and David. We can further specify the degree for directed graphs; here, we split the
degree into in-degree and out-degree, which describe the incoming and outgoing edges
from n respectively. In our example, Bob has an out-degree of 3 but only an in-degree
of 2.

Furthermore, the graph can also have an accompanying set of features describing the
nodes or edges in the graph. Such as, Bob likes football (a node feature), and Bob and
Alice became friends through work (an edge feature).

While we often describe graphs in the context of static graphs, the nodes and edges
remain the same, most real-life graphs tend to be of a dynamic nature [21], such as the
set of nodes and edges might evolve over time. For instance, in Figure 2.1.2, Fiona has
entered our graph and has developed friendships with the earlier people in the graph.
The graph can also evolve in other ways, node features may change, or nodes might
disappear.

A common way to study dynamic graphs is to treat it as a discrete sequence of static

7
CHAPTER 2. THEORETICAL BACKGROUND

Figure 2.1.2: Example of a growing social network. The social network in Figure 2.1.1 has grown
to also include Fiona.

graphs: G = {G1 , ..., Gn } where Gn is the nodes and edges (Vn , En ) for snapshot n [7].
We see an example of this in our examples of the social network above, where Figure
2.1.1 and Figure 2.1.2 shows two snapshots of the same graph. By treating the graph
as a sequence, we can study each static snapshot of the dynamic graph using the same
techniques we would use on a static graph. There are other methods for working with
dynamic graphs, for example, by streaming the graph as incoming nodes and edges that
are only seen once. However, in this study, we will treat the graph as an accumulating
collection of nodes and edges over time, where we always have access to all the nodes
and edges within a specific snapshot.

As the graph evolves, we are bound to see previously unseen vertices, edges, and maybe
even classes in each snapshot. Due to this, a distribution shift may happen, i.e., the
property distribution among the nodes is widely different between a former and a latter
snapshot. The shift poses a problem for deep learning algorithms, which assume that
the distribution is static during training. If the distribution of the test data varies a
lot from the training data’s distribution, we can expect poor results during prediction.
We can relate this to our example above of the social network, Figure 2.1.1 and Figure
2.1.3 are two very different snapshots of the same dynamic graph. If we train our deep
learning algorithm on the graph in Figure 2.1.1 it is unlikely that it will perform well on
the graph in Figure 2.1.3. For this reason, algorithms on temporal data usually need to
consider the distribution shift and re-train the deep learning algorithm when the shift is
too large [36]. One method for countering the distribution shift is Continual learning,
also called life-long learning. Continual learning is commonly used on streamed data,
and the goal is to continuously train on incoming nodes without making use of previous

8
CHAPTER 2. THEORETICAL BACKGROUND

data [35]. However, in this thesis, we will not use any methods for re-training the
learning algorithm as it can be expensive, and we are trying to avoid it by designing
algorithms that maintain performance over time.

Figure 2.1.3: Example of a growing social network. The social network in Figure 2.1.2 has grown
to also include George and Herb.

2.2 Graph Neural Networks


In this thesis, we are investigating prediction reliability in Graph Neural Network.
Graph Neural Networks fall under a broader topic called graph representation learning
which also concerns other techniques for embedding and retrieving information from
graphs [37]. Graph representation learning came around to fill a gap in conventional
deep learning algorithms that effectively find hidden patterns in Euclidean data but
are incompatible with graph data.

Graph data is complex for two main reasons: (1) it is not Euclidean, which means that
there is an unlimited amount of ways we can traverse a graph, and (2) the connections
between nodes are as important as the data held by the nodes. In Figure 2.2.1, we
see the same graph from two different perspectives. Conventional deep learning algo-
rithms are restricted to the Euclidean space and the order of the data points. Therefore,
they would learn multiple separate representations for the same node depending on the
order the nodes are seen [14].

To avoid the issue of the same node being interpreted differently depending on when
we see the node, we can instead embed the node in the Euclidean space. There are
multiple ways to create a node embedding, but the one criteria that most embedding
algorithms share is that two similar nodes should be close to each other in the embed-

9
CHAPTER 2. THEORETICAL BACKGROUND

Figure 2.2.1: Example of how a graph can be ordered in two different ways depending on where
we start.

ding space [14]. The similarity can be defined as closeness in the graph or that nodes
share similar structural properties like a similar node degree.

In the early days, the graph representation field was mainly dominated by various ma-
trix factorization methods for generating graph and node embeddings. Examples of
these methods are node2vec, and random walk [4]. The embeddings were then used
as input to a deep learning algorithm, like a Multilayer Perceptron. However, this
came with an issue: the embedding methods are commonly transductive, and we need
to learn entirely new embeddings as soon the graph changes the slightest [14]. This
quickly becomes expensive in the case of dynamic graphs that are constantly changing.
To get around this issue, a new type of deep learning method was developed to find a
cheaper inductive method for embedding graphs and their components: GNNs [14].
GNNs uses a message-passing framework to dynamically create embeddings from the
nodes in the graph based on the node’s neighborhood. Typical tasks for GNNs include
node property prediction, link prediction, and graph classification.

GNNs is a broad concept, and there exist several variants. We can divide the main
ones into four categories [37]: (1) Recurrent Graph Neural Networks, (2) Convolu-
tional Graph Neural Networks (also known as GCN), (3) Graph Autoencoders, and (4)
Spatial-Temporal Graph Neural Networks. In this thesis, we only discuss GCNs.

2.2.1 Graph Convolutional Networks

Graph Convolutional Network (GCN) were first developed by Kipf et al. [11] and are
one of the earliest GNNs with many other GNNs building upon it. In summary, the

10
CHAPTER 2. THEORETICAL BACKGROUND

GCN is a deep learning algorithm that leverages the graph structure and aggregates
information from a node’s neighborhood in a convolutional fashion. GCNs tend to
have great expressive power and are therefore used in a wide range of tasks.

Figure 2.2.2: Example of a convolution in an image (Euclidean data) and a graph. For the
convolution on the image, we retrieve a weighted average of the red pixel’s neighboring pixels’
features. The blue edges indicate the neighboring pixels. Similarly, we retrieve a weighted
average from the red node’s neighbors in the graph convolution. The figure is adapted from
[37].

The idea behind GCNs comes from the classic Convolutional Neural Network (CNN)
model, which is commonly used on grid-like data, such as images [39]. CNNs exploits
the basic properties of grid-like data to aggregate data and make predictions. For ex-
ample, if we want to make a prediction on an image, we know that the number of neigh-
boring pixels is always fixed, and we can decide the order in which we scan images (e.g.,
left to right, up to down). We use this information when we design how the CNN should
perform its convolutions on the image. In a convolution, the CNN aggregates the in-
formation from a neighborhood of pixels in the image and outputs a new super pixel
representing the neighborhood. This process is performed several times, and in the
end, we retrieve a low-dimensional embedding that can be used by another algorithm,
such as a Multilayer Perceptron. We can see how a convolution is performed in Figure
2.2.2 (left).

Unfortunately, the grid properties that we take advantage of in the CNN do not exist
in the context of graphs. For example, the number of neighboring nodes can vary, and
there is no set order in which we process a graph.

However, the convolution operation can be generalized to graph data. Instead of ag-

11
CHAPTER 2. THEORETICAL BACKGROUND

Figure 2.2.3: Message passing in a GCN. Here we want to make a prediction in node A ag-
gregating information from A’s neighbors. In Layer 2, target node A aggregates input from its
neighbors B, C, and D, and uses it to make a prediction. Before that, in Layer 1, B, C, and D,
have themselves aggregated the node features from their neighbors in Layer 0. The figure is
adapted from [14].

gregating pixels, we can learn a node representation by aggregating the target node’s
features and the features of the neighboring nodes. The new node representation can
then be used to create a representation of the next node. This way, we can quickly ag-
gregate the information from a large neighborhood or whole graphs by stacking layers.
An overview of this process can be seen in Figure 2.2.3.

In Algorithm 1, we see how the forward propagation is calculated. First, we set the
hidden node embedding for the 0-th layer to be the features of the nodes in the layer.
Then we run several iterations of neighborhood aggregation. Next, we retrieve the
hidden node embedding hkv for the k-th layer by retrieving a weighted average of the
input from the nodes of the previous layer, summing it with features of the nodes we
are currently aggregating in, and passing it through an activation function. Finally,
after reaching the final layer K, we use the hidden node embedding for the K-th layer
to represent node v.

The node degree’s effect on prediction quality

In 2020, Tang et al. conducted a study on degree-related biases in GCNs [29]. The
study concluded that there is a bias towards high degree nodes when making predic-
tions using GCNs. The bias shows in that GCNs often achieve higher accuracy in high

12
CHAPTER 2. THEORETICAL BACKGROUND

Algorithm 1: The GCN forward propagation [11]


Data: Xv the node features for the nodes at the 0-th layer, σ a non-linear
activation function, W the neighborhood weight matrix, N (v) a function to
retrieve the neighbors of node v, and B the target node weight matrix.
Result: zv the embedding of node v.

1 h0v ← Xv
2 k←0
3 while k < K do
∑ (k)
(k)
4 hk+1
v ← σ(Wk hu
u∈N (v) |N (v)| + B k hv )
5 k ←k+1
(K)
6 zv ← hv

degree nodes than in lower degree nodes. The bias was explained as being due to the
fact that the same high degree nodes occur more often during training as other con-
nected nodes poll them for information. Tang et al. used this information to build a
degree-aware GCNs and successfully used it to boost prediction accuracy in low de-
gree nodes. We explore this hypothesis when we design some of our own conformal
predictors.

2.2.2 GraphSAGE

The GraphSAGE algorithm developed by Hamilton et al. [6] is in essence an extension


of the GCN. What differs is that GraphSAGE uses a new way of sampling and aggregat-
ing node information. This makes GraphSAGE an inductive alternative to GCN that is
more efficient and usually gives higher accuracy. We will show the difference by briefly
describing the forward propagation of GraphSAGE.

The forward propagation of GraphSAGE is divided into three steps: (1) Sample the
nodes in the neighborhood of the target node, (2) aggregate the feature information
from the sampled nodes, and (3) use the aggregated information to perform the pre-
diction task. We can see an overview of the forward propagation in Figure 2.2.4.

Sampling nodes is an optional first step that we can use to reduce computational com-
plexity [6]. We can see an example of the sampling in Figure 2.2.4. We sample a pre-
defined number of nodes from each layer starting from the target node. The nodes

13
CHAPTER 2. THEORETICAL BACKGROUND

Figure 2.2.4: Overview of the GraphSAGE algorithm with sampling. The goal is to predict node
A’s label. In the first step, we sample three nodes among the node A’s neighbors and two nodes
from each of the sampled nodes’ neighbors. In the second step, we similarly aggregate the
information as in the GCN. Finally, we predict node A’s label using the hidden node embedding
we have aggregated. The figure is adapted from [6].

are randomly sampled with equal probability. After sampling nodes among the target
node’s neighbors, we sample nodes in the following layer by sampling neighbors from
the sampled nodes in the first step. Thus, it continues until we reach the predefined
sample size.

In the second step, we aggregate the information from the sampled nodes using a simi-
lar algorithm to that of the GCN. The main difference is not how the node information
is propagated but how we calculate the hidden node embedding. We simply change
line 4 in Algorithm 1 to the following instead:

hkv = σ(W k · Concat(hk−1


v , Aggregate(hu , ∀u ∈ N (v)))
k−1
(2.1)

where Concat represents the concatenation operation, which connects the neighbor’s
embedding and node v’s embedding, N (v) represents all the neighbors of node v, and
Aggregate is aggregating the features of all neighboring nodes. As we can see, the main
difference between the aggregation in GraphSAGE and GCN is that GraphSAGE con-
catenates the aggregated neighborhood vector with the node’s feature vector rather
than just summing the two. GraphSAGE also uses a more flexible feature aggrega-
tion function than the GCN. There are four feature aggregation methods for users to
choose from to cater to different scenarios, including Mean aggregator, GCN aggrega-

14
CHAPTER 2. THEORETICAL BACKGROUND

tor, LSTM aggregator, and Pooling aggregator. This makes the GraphSAGE algorithm
much more expressive than the GCN.

2.3 Conformal predictors


Conformal prediction is a method used to determine precise confidence levels in pre-
dictions [27]. Given a method for making a prediction ŷ, if we specify the confidence
level to be 95%, the conformal prediction algorithm will produce a prediction set that
includes the target label 95% of the times. The confidence level is a setting we can
adjust at any time, so we can also specify the confidence level to be 99% to retrieve
the 99% confidence interval. Conformal prediction applies to both classification and
regression tasks. In the classification case, the prediction set is a set of labels; in the
regression case, the prediction set is an interval.

The conformal prediction algorithm uses a nonconformity measure A and a set of cal-
ibration data points Z. The nonconformity measure determines how similar an in-
coming node n is to the data points in the calibration set. n receives a nonconformity
score A(Z, n) that is compared to the nonconformity scores of the calibration set. If n’s
nonconformity score is lower than a predefined fraction of the scores from the other
data points, the label coupled with the nonconformity score is added to the prediction
set. The predefined fraction that we need to overcome is determined by the signifi-
cance level, which is equal to 1 minus the confidence level. For instance, if we set the
confidence level to 95% (0.95), the significance level would be 0.05.

We can see how we implement conformal prediction in Algorithm 2. In the algorithm,


we want to decide whether to include y in our prediction set Γϵ based on that we see
the new object xn . First we need to compute the nonconformity scores α for the previ-
ous examples Z1,...,n−1 that we are comparing with (a.k.a. the calibration set) and the
proposed example (xn , y). When we have the nonconformity scores, we compare the
nonconformity score alphan for example Zn with scores of the other examples α1,...,n−1 .
If alphan is smaller than a fraction ϵ of α1,...,n−1 , we add y to the prediction set.

We deem a conformal predictor valid if the outputted prediction sets contain the tar-
get label 95% of the times we make a prediction. The validity is a cornerstone safety
criterion when designing conformal predictors.

Another vital measure when looking at the performance of conformal predictors is ef-

15
CHAPTER 2. THEORETICAL BACKGROUND

Algorithm 2: The Conformal Algorithm [27]


Data: nonconformity measure A, significance level ϵ, examples Z = {z1 , ..., zn−1 },
object xn , and the label y
Result: Decision whether to include y in Γϵ (z1 , ..., zn−1 , xn )

1 Provisionally set Zn := (xn , y);


2 For i = 1, ..., n, set αi = A(Z \ {zi }, zi );
#{i=1,..,n−1|αi ≥αn }
3 py = n
;
4 if py > ϵ then
5 Include y in Γϵ (z1 , ..., zn−1 , xn )

ficiency. A conformal predictor is deemed efficient if it outputs a small prediction set.


Efficiency is an important criterion to counterbalance the validity criteria. If we only
looked at the validity criteria, we could always output the maximum prediction set, i.e.,
all the possible target labels and the predictor would always be valid. However, these
predictions would not be very informative, so we need to look at the efficiency and find
a suitable trade-off.

The validity guarantee of the conformal prediction algorithm comes from a core con-
cept that the algorithm relies on. The concept is that all data points in the calibration
set are exchangeable, which means they need to be independent and identically dis-
tributed (i.i.d.). The conformal prediction algorithm can not promise that the predic-
tor is valid if this does not hold. This is an issue when we apply conformal prediction to
time series. In time series, the underlying distribution of the data points may change
over time, and thus the data points are not i.i.d. [2]. This also holds for graphs that
evolve over time.

The original conformal prediction algorithm sometimes called full conformal predic-
tion or transductive conformal prediction, was based on the assumption that we used
the training data as our calibration set. This means we need to retrain the underlying
model whenever we want to retrieve a prediction set. While the transductive confor-
mal predictor produces precise prediction sets, it is also computationally expensive.
Therefore, another algorithm called split conformal prediction or inductive confor-
mal prediction was developed to avoid the computational cost. The algorithm splits
the training set into a proper training set and a calibration set [20]. We train the
underlying model on the proper training set, and we use the calibration set when we

16
CHAPTER 2. THEORETICAL BACKGROUND

make predictions to calculate the prediction set. By splitting the training set, we do
not need to retrain the underlying model for every prediction, which reduces the com-
putational cost. However, we will also retrieve less precise prediction sets as the cali-
bration set will not be as finely-tuned to the underlying model. As inductive conformal
prediction is more fitting for real-life scenarios where we can not retrain before ev-
ery prediction, we use it in our experiments and refer to it as the Inductive Conformal
Predictor (ICP).

2.3.1 Class-Conditional Conformal Predictors

Class-Conditional Conformal Predictor (CCCP), also called Mondrian conformal pre-


dictors, was developed to deal with the case of imbalanced data sets [32]. The CCCP
divides the data into non-overlapping categories according to their label, and a sepa-
rate confidence level is set for each class. By this means, as long as the exchangeability
requirement holds within each class, we effectively create a separate valid conformal
predictor for each class, and we can guarantee validity for all classes.

The CCCP algorithm is very similar to the standard CP algorithm. As we can see in
Algorithm 3, the only change is that we add a new first step where we limit the examples
(a.k.a. the calibration set) to only include examples where yi = y. This way, we retrieve
a nonconformity score only for examples including y, and also, we only compare alphan
with those examples.

Algorithm 3: The Class-conditional Conformal Algorithm


Data: nonconformity measure A, significance level ϵ, examples Z = {z1 , ..., zn−1 },
object xn , and the label y
Result: Decision whether to include y in Γϵ (z1 , ..., zn−1 , xn )

1 Set Ẑ := {zi |zi ∈ Z ∧ yi = y}


2 Provisionally set Ẑn := (xn , y)
3 For i = 1, ..., n, set αi = A(Ẑ \ {ẑi }, ẑi )
#{i=1,..,n−1|αi ≥αn }
4 py = n
5 if py > ϵ then
6 Include y in Γϵ (z1 , ..., zn−1 , xn )

17
CHAPTER 2. THEORETICAL BACKGROUND

2.3.2 Attribute-Conditional Conformal Predictors

Mondrian conformal predictors do not necessarily need to divide the data into cate-
gories based on the class labels, as described in the section above. Instead, we can also
choose to split the data based on some other property that produces non-overlapping
categories [31]. When we have a conformal predictor that splits the data based on an-
other property than the class label, we call it an Attribute-conditional Conformal pre-
diction. Besides using another property than the class label, the algorithm looks the
same as the Class-conditional Conformal Algorithm described in Algorithm 3. As we
can see in Algorithm 4, we only change the first step to split the calibration set based
on the category-forming j-th property of the prediction object.

Algorithm 4: The Attribute-conditional Conformal Algorithm


Data: nonconformity measure A, significance level ϵ, examples Z = {z1 , ..., zn−1 },
object xn (xjn is the j-th property of xn ), and the label y
Result: Decision whether to include y in Γϵ (z1 , ..., zn−1 , xn )

1 Set Ẑ := {zi |zi ∈ Z ∧ xji = xjn }


2 Provisionally set Ẑn := (xn , y)
3 For i = 1, ..., n, set αi = A(Ẑ \ {ẑi }, ẑi )
#{i=1,..,n−1|αi ≥αn }
4 py = n
5 if py > ϵ then
6 Include y in Γϵ (z1 , ..., zn−1 , xn )

2.4 Related work


In this section, we will discuss other work that relates to quantifying uncertainty in
deep learning algorithms and specifically GNNs.

2.4.1 Countering the distribution shift in conformal prediction

In a recent study called Conformal prediction beyond exchangeability [2], Barber et


al. look at how to mitigate the coverage gap in conformal predictors caused by a distri-
bution shift in time series data. The coverage gap is the difference in how many times
we include the target label in our predictions and the validity threshold set by the con-
fidence level. Barber et al. argue that the coverage gap is caused by the fact that the

18
CHAPTER 2. THEORETICAL BACKGROUND

distribution of the calibration set is too different from the distribution of the sample
(the incoming data point we run our prediction on). i.e., the calibration set fails the
exchangeability requirement.

Based on this insight, Barber et al. lift the exchangeability requirement and propose the
following method for bounding the coverage gap in the prediction intervals produced
by a conformal predictor for the regression task:

∑n
wi · dT V (Z, Zi )
Coverage gap ≤ i=1 ∑ (2.2)
1 + ni=1 wi

Where dT V (Z, Zi ) is the total variation distance between the data point Zi and the other
data points. dT V (Z, Zi ) can be seen as a measure of the distribution shift between Z and
Zi . If the data is truly exchangeable, dT V will be 0, but otherwise, it will increase as the
distribution shift grows larger. w is the set of weights we apply to the data points to
reduce the coverage gap, and wi ∈ [0, 1]. Hence, we need to pair a low weight wi with
large a dT V (Z, Zi ) to retrieve a low coverage gap. Due to the way the equation is formu-
lated, it works well both when there is not a distribution shift (the coverage gap will be
0) and in the case where there is a shift and the coverage gap will be bounded.

Barber et al. also show how the method for bounding the coverage gap can be applied
to several common conformal prediction algorithms. In the following equation we can
see how the bound is applied when computing the prediction interval Ĉ(Xn+1 ) using
an inductive conformal predictor:

(∑
n )
Ĉ(Xn+1 ) = µ̂(Xn+1 ) ± Q1−ϵ w̃i · αi + w̃n+1 · α+∞ (2.3)
i=1

Where µ̂ is the estimated y value outputted by the underlying regression algorithm,


Q1−ϵ is the 1 − ϵ-th quantile, and w̃ is the normalized weights. The weights are normal-
ized like so:

wi 1
w̃ = , i = 1, ...n, and w̃n+1 = (2.4)
w1 + ... + wn + 1 w1 + ... + wn + 1

While Barber et al. provide a robust mathematical framework for bounding the cov-
erage gap, one missing part is how we can apply the method in various scenarios. The
cornerstone of their algorithm is the weights that we set to the data points to lower the

19
CHAPTER 2. THEORETICAL BACKGROUND

effect that distant data points have on the prediction result. However, Barber et al. do
not provide a framework for deciding the weights. In their experiment on time series,
they base their weights on the closeness in time between the data points. While this
can work well for time series, it does not apply to any data set.

For the above reason, and because Barber et al.’s method is designed for the regression
task, we can not directly apply these learnings to our work. However, we take inspi-
ration from Barber et al.’s method when we design our novel conformal predictors in
Chapter 4.

2.4.2 Spatial conformal prediction

Spatial conformal prediction is a new conformal prediction algorithm that produces


prediction intervals based on a data point, and its spatial location [17]. The method
can produce valid and efficient prediction intervals when data points are either glob-
ally or locally approximately exchangeable. Spatial conformal prediction is a weighted
conformal prediction algorithm similar to Barber et al.’s method. However, here we
set the weights of the calibration data points based on their closeness to the sample
data point. A special case of this is when we set weight 1 to nearby nodes (e.g., nodes
within a predefined radius) and weight 0 to all other nodes.

Like Barber et al.’s method, spatial conformal prediction is mainly designed for the
regression task and is therefore not directly applicable to our work. However, we take
inspiration from spatial conformal prediction when designing the Embedding distance
Weighted Conformal Predictor.

2.4.3 Methods for quantifying uncertainty in deep learning

In our research, we have only been able to find a few cases where conformal prediction
has been applied to GNNs. However, in most of the studies, including a study from
2020 [38] conformal prediction was applied to GNNs but without a more profound
analysis of the CP method itself.

Conformal prediction is not the only way to quantify uncertainty in predictions. Most
of the research we have found on quantifying uncertainty in GNNs is based on Bayesian
learning, such as [10, 12, 30], where Bayesian concepts are leveraged to quantify pre-
diction reliability and boost the accuracy of predictors.

20
CHAPTER 2. THEORETICAL BACKGROUND

Another method used to quantify prediction reliability is Venn prediction [31]. Venn
predictors work similarly to CCCPs: we divide the old objects into categories, somehow
classify the new object into one of the categories, and then use the frequencies of labels
in the chosen category as probabilities for the new object’s label.

21
Chapter 3

Method

This chapter discusses the methodology used when conducting our research and de-
scribes our experiments. We will start by continuing our discussion from Section 1.6
about the research method we chose for this thesis. After that, we will describe in detail
how we have applied the research method in our experiments.

3.1 Choice of research method

When deciding on the research method for this thesis, we took advice from Anne Håkans-
son’s paper on research methodologies applied to software engineering [5]. According
to Håkansson, most studies are either of a quantitative or a qualitative nature, how-
ever, some belong in between, and so does this thesis.

We considered using a purely quantitative approach for this work. In that case, we
would have only applied existing methods for quantifying uncertainty in graph pre-
dictions and measuring their performance over many data sets. However, this brings
two caveats: (1) we have not been able to find methods that use the graph connectivity
information, and (2) we have not been able to find enough data sets to bring statistical
significance to the results. On the other hand, using a purely qualitative approach, such
as performing a survey on existing approaches, would not bring much value either, as
there is limited work on quantifying uncertainty in graph predictions. So instead, this
work has to be a mix of quantitative and qualitative research to bring significant results.
In our experiments evaluating the novel methods, we retrieve quantitative results in
the form of performance metrics. By exploring the metrics, we observe patterns in the

22
CHAPTER 3. METHOD

data from which we can contrive qualitative results that can guide further research in
the area.

Based on the above, we saw two possible research methods for this study: (1) the ap-
plied research method, and (2) the fundamental research method [5]. We can use
the applied research method to answer specific questions originating from previous
research. However, as we described above, there is limited research on applying con-
formal prediction to GNNs, and therefore are no questions from previous research to
answer. The research question that we try to answer in this thesis is also open-ended,
making the applied research method even more unsuitable. Instead, we decided on the
fundamental research method, a curiosity-driven method used to generate new ideas
and theories. This suits our work well as we propose novel methods for leveraging the
graph connectivity information to produce valid and efficient conformal predictors for
graph predictions. The novel methods originate from our own ideas, but they are based
on theories and inspired by other research.

3.2 Application of research method


We are conducting two case studies on static and dynamic graphs to answer our re-
search question. A case study is an empirical investigation of a particular phenomenon
based on both quantitative and qualitative evidence [5]. We believe that the require-
ments of conformal predictors in static and dynamic graphs are widely different due
to the distribution shift in dynamic graphs. In our two case studies, we look at various
approaches to improve prediction performance and their cost. The sections below will
detail how we conducted case studies.

3.2.1 Overview of the case studies

We conducted two case studies in this thesis, one on static graphs and another on dy-
namically evolving graphs.

In the case study on static graphs, we wanted to investigate Conformal Prediction’s


general applicability to GNNs. Specifically, we investigated whether we could build a
valid and efficient conformal predictor without adding too much computational over-
head. We discuss the evaluation used in the case studies in Section 3.2.5.

As part of the case study, we run a single experiment on several data sets of static graphs

23
CHAPTER 3. METHOD

in the case study. We train a GraphSAGE classifier in the experiment and set up several
conformal predictors using conventional and novel algorithms. We then capture the
evaluation metrics over several rounds, and between every run, we resample the nodes
used for testing and calibrating the conformal predictors. After we have captured the
evaluation metrics, we plot the metrics, as can be seen in Chapter 5, and observe any
phenomenons in the performance of the predictors.

In the second case study, we focused on methods for mitigating the coverage gap in
conformal predictors caused by the distribution shift in dynamic graphs. The goal of
the experiment was to evaluate if any of our CP methods could avoid the need for re-
training the GraphSAGE classifier and re-calibrating the conformal predictors when-
ever the distribution changes.

We conducted an experiment similar to the experiment on static graphs. The main


difference in this experiment is that we divided the graph into multiple snapshots rep-
resenting the state of the full graph at various time steps. For instance, in the exper-
iment on the OGB Arxiv data set, we had a snapshot for each year between 2010 and
2020. The snapshot included all nodes and edges that were added before or during
the corresponding year, such that the snapshot for 2011 included nodes from 2011 and
before. We trained our GraphSAGE classifier on the first snapshot. We also sampled
the conformal predictors’ calibration set from this snapshot. We then measured the
conformal predictors’ performance for each snapshot.

3.2.2 Data sources

In the case studies, we used five public data sets for the classification task to evaluate
the performance of our proposed methods. In the case study on static graphs, we used
the three data sets listed in Table 3.2.1, and we used the two data sets listed in Table
3.2.2 for the study on dynamic graphs. As we can see, the OGB Arxiv data set is part of
both lists. We can usually treat Dynamic graphs as static graphs by only considering a
single snapshot, such as the final snapshot, which is why we can use the OGB Arxiv data
set for both case studies. However, we could not treat a static graph as a dynamic graph
in our experiments, as we rely on a temporal feature to create the snapshots.

24
CHAPTER 3. METHOD

Data set # nodes # edges # node features # classes


Cora (Planetoid) 2,708 10,556 1,433 7
PubMed (Planetoid) 19,717 88,648 500 3
OGB Arxiv 169,343 1,166,243 128 40

Table 3.2.1: Data sets used in the case study on static graphs.

Data set # nodes # edges # node features # classes


OGB Arxiv 169,343 1,166,243 128 40
Bitcoin Elliptic 203,769 234,355 166 2

Table 3.2.2: Data sets used in the case study on dynamic graphs.

Cora and PubMed

The Cora and PubMed data sets were both used in the case study on static graphs. The
data sets belong to the Planetoid data sets [18] and are commonly used in research on
GNNs, such as the paper that introduced the GCN [11]. In that paper, they measured
an accuracy score of 81.5% for the Cora data set and 79.0% for PubMed.

The Cora and PubMed data sets both represent citation networks. Each node repre-
sents a published paper, and an edge represents the citation of a paper by another
paper. The node features is a word vector representing the presence or the absence of
a word in the paper. The task on both data sets is multiclass classification.

As we can see in Table 3.2.1, the Cora data set is relatively small in size yet complex
as each node has half as many features as there are nodes in total, and there are seven
possible classes. Due to the small size, we had to be cautious when splitting the data set
into a training, a calibration, and a test data set. We had to maintain a large enough
training data set to build an accurate classifier while keeping enough data for statis-
tically significant testing. While similar, The PubMed data set has fewer classes and
about ten times more data points than the Cora data set. For this reason, we do not
need to worry as much about the data set size when splitting it. [11].

OGB Arxiv

The OGB Arxiv data set belongs to the Open Graph Benchmark collection of data sets
introduced in 2020 [9]. OGB Arxiv describes a citation network of papers published on
Arxiv.org. Each node is a paper, and edges represent citations between papers. Each

25
CHAPTER 3. METHOD

Figure 3.2.1: Class distributions for the Cora and PubMed data sets.

node also has a set of features: a word embedding of the paper’s title and abstract.
The task coupled with the data set is to predict the category of each node. As the paper
belongs to the Open Graph Benchmark collection, there exist benchmarks for common
GNN algorithms, including GraphSAGE, which benchmark for is an accuracy score of
1.5% [9]. Beyond the feature set, each node also has a secondary feature called the node
year which is the year when the paper was published. We use this feature to treat the
data set as a dynamic graph. However, We used the data set in both case studies on
static and dynamic graphs. In the static case study, we use the final snapshot of the fully
evolved graph, and in the study on dynamic graphs, we use snapshots between 2010
and 2020. As we can see in Figure 3.2.2, the OGB Arxiv data set is a great example of
a data set that is affected by a distribution shift. Over time, we see several almost non-
existent classes grow to become the largest share of nodes. This made it an exciting
data set for our case study on dynamic graphs.

Bitcoin Elliptic

The other data set used in the case study on dynamic graphs is the Bitcoin Elliptic data
set. The data set was introduced by the Elliptic co. on the website Kaggle in 2019 [3].
The data set describes a Bitcoin transaction network, and the task is binary classifica-
tion, where we mainly want to find illicit transactions. The data set has more nodes
than the Planetoid data sets but a much lower average node degree (1.2). The feature
set of each node is also more varied and includes information about the transaction
and neighboring transactions. The fact that each node already has some information
aggregated from its neighbors might make the use of a GNN redundant, which can also

26
CHAPTER 3. METHOD

Figure 3.2.2: Class distribution for the OGB Arxiv data at various time steps.

be seen in previous benchmarks where standard neural networks have a comparative


performance [3]. However, we still think that the data set is interesting for our case
study. The data set has a temporal time step feature with 49 different values. We use
this feature to create snapshots for the dynamic graphs. The snapshots for the Bitcoin
Elliptic data set are interesting because the new nodes added in each snapshot share no
edges with the previous nodes. Because they share no edges, we get a new distribution
shift every snapshot, and we wanted to see how this affected the performance of the
conformal predictors.

3.2.3 Constructing the graphs

For most of the data sets, we used pre-built data loaders from either PyTorch Geomet-
ric [24], or the Open Graph Benchmark collection [9] to access the data sets. Beyond
downloading the data set, the data loaders also helped us construct the graph repre-
senting the data set. However, in the case of the Elliptic data set, we had to download

27
CHAPTER 3. METHOD

Figure 3.2.3: Class distribution for the Bitcoin Elliptic data set at various time steps. Class −1
is nodes with an unknown label, Class 0 is licit nodes, and Class 1 is illicit nodes.

the raw data ourselves from Kaggle [3] and construct the graph. Furthermore, to rep-
resent a graph as a dynamically evolving graph, we also had to construct the snapshots
out of the graph. We will outline how we constructed the Bitcoin Elliptic graph and
created the snapshots for the dynamic graphs below.

Constructing the Elliptic graph

To represent a graph used in a node property prediction task, we need three compo-
nents: (1) a list of all the edges connecting nodes, (2) the node features, and (3) the
target labels for the classification task. PyTorch Geometric [24] provides a Data object
that we used to gather this information to represent a graph. To construct the graph,
we had to retrieve the parts listed above and construct the Data object.

The Data object expects nodes to use incremental IDs starting from 0, but each node
was assigned a random ID in the raw data. Hence, to make the raw data compatible

28
CHAPTER 3. METHOD

with the Data object, we had to reassign new IDs to all nodes and update the edge list
accordingly.

Beyond reassigning the IDs and updating the edge list, we had to clean the node fea-
ture matrix. The node features contain some features in the raw data set that should
be hidden to the GraphSAGE model, so it does not use them when making predictions.
The features that we removed were the node ID (we do not need it to access the node
features) and the time step where the node appears. We kept the time steps in a sep-
arate list so that we could use them to construct the snapshots for the dynamic graph
later on.

Lastly, we constructed the list of target labels. In the raw data the target label can take
on three values: unknown, 1 - illicit, and 2 - licit. We reindexed the target labels to
enable us to include the nodes with an unknown target label when making predictions
on labeled data and easily excluding them when calculating loss during training and
measuring performance in the evaluation. We assigned these new labels to the nodes:
-1 - unknown, 0 - licit, and 1 - illicit.

Constructing the dynamic graphs

For the case study on dynamic graphs, we had to turn the static graphs given by the
data loaders into dynamically evolving graphs. As described in Section 2.1, we can
represent a dynamic graph G as a series of snapshots G = {G1 , ..., Gn } where Gn is the
snapshot representing the dynamic graph at time step n. As we only use graphs that
are growing, this is easy to construct. For each snapshot, we need to remove the nodes
and edges that did not exist at that time step from the original static graph. By putting
the snapshots into an ordered list, we can now use them to represent a growing graph.
This step was required for both the OGB Arxiv and the Bitcoin Elliptic data sets.

For the OGB Arxiv data set, we could not create snapshots from all available time steps,
as the early time steps have too few nodes to build an accurate classifier. Thus, for the
Arxiv data set, we merged all time steps before 2011 into a single time step that we call
2010 and count this as the first time step of the graph.

Splitting the data sets

Before we could start our experiments, we also had to split each data set into three: (1)
a training data set for the GraphSAGE model, (2) a test set used in the evaluation, and

29
CHAPTER 3. METHOD

(3) a calibration set used by the conformal predictors.

We dedicated a large share of the data set for evaluation and to the calibration set. We
used 20% of the nodes for the test set and an equal amount for the calibration set, leav-
ing 60% of the nodes for the training of the GraphSAGE model. The large share of nodes
dedicated to testing was required to get statistically significant results for the Cora and
PubMed data set. The calibration set also required a large share for the conformal
predictors to perform well on these data sets. We sampled the nodes for the subsets
randomly to preserve the original distribution of the nodes in each data set.

When we split the data sets, we also had to decide how to handle the edges between
the subsets. For the training set, we removed all edges to the test and calibration sets
as we did not want the GraphSAGE model to learn the representation of these nodes.
This is a required step to keep the inductive abilities of the model [13]. However, for
the test and calibration set, we included edges to nodes in the other sets to aggregate
information from the other nodes in predictions but without, for example, counting
the training nodes in the evaluation.

In the case study on dynamic graphs, we only sampled training and calibration nodes
from the first snapshot of the graph. The only difference is for the conformal predic-
tors with resampling which re-calibrate every snapshot and therefore requires us to
resample the calibration set. The re-sampling was done by randomly sampling 20% of
the nodes available in the snapshot.

3.2.4 Models implementation

In the experiments, we produce our results using both a GCN and several CP algo-
rithms. In this section, we describe how we implement these algorithms.

Implementation of the GraphSAGE classifier

The GCN model we chose for our experiments is the GraphSAGE model described in
Section 2.2.2. We chose the GraphSAGE model as it is designed to generalize well on
new data and performs relatively well in relation to other GNN models on the OGB
Arxiv data set [9].

In our case studies, we build a classifier using the GraphSAGE algorithm described in
Section 2.2.2. We implement the GraphSAGE algorithm using the PyTorch Geometric

30
CHAPTER 3. METHOD

[24] framework. There are multiple parts to the implementation. In this section, we are
going to focus on describing how we implemented the model’s forward propagation
function, training, and how we make predictions.

As we described in Section 2.2.2, GraphSAGE can optionally sample the neighboring


nodes used for a prediction in a target node as part of the forward propagation. This
can be used to bound the computational complexity and memory usage of the model.
However, this can also cause reduced accuracy [6]. We had some issues getting an
accuracy over 50% on the OGB Arxiv data set when using sampling, and the other data
sets are too small for the computational complexity or memory usage to be a burden,
which is why we refrained from using sampling in our experiments.

The model’s forward function describes how we turn data into predictions. The forward
function consists of multiple layers. All but the final layer looks as follows:

1. We perform a GraphSAGE convolution on the data.

2. The data from the convolution is normalized.

3. If we are training the model, we perform dropout, where we leave out parts of the
node’s hidden embedding.

4. The data is forwarded to the next layer.

We still perform a GraphSAGE convolution on the data in the final layer. However, in-
stead of normalizing the data and performing dropout, we feed it to a Softmax function
which gives us the class probabilities. We also provide an option to return the node em-
bedding, the result of the last convolution, together with the class probabilities. This
allows us to construct the Embedding distance Weighted Conformal Predictor (EWCP)
later on.

To train the GraphSAGE model, we feed data to the model, which will process it and
give back the class probabilities for each sample. Then, we compare the class prob-
abilities with the true class labels and retrieve the loss using PyTorch’s negative log-
likelihood loss function. Next, an optimizer uses the loss to update the model’s weights.
We used the Adam optimizer provided by PyTorch.

We make predictions using the model similar to how we are training, except that we
do not retrieve the loss or update the model’s weights. Instead, we return the class
probabilities, which we used to compare with the true labels of the test set to retrieve

31
CHAPTER 3. METHOD

Figure 3.2.4: Example of a single GraphSAGE convolution. The target node A and its neigh-
bors’ node features are aggregated in the hidden embedding hA . Then we retrieve the batch
normalized version ĥA . If we are training, we leave out one of the fields in ĥA by setting it to
0. Finally, we either forward the hidden embedding for A to the next layer, or we pass it to a
Softmax function to make a class prediction.

the performance metrics.

Implementation of the Conformal predictors

We test both conventional and novel CP methods in the case studies. As we describe
in Section 2.3, the conformal prediction algorithm relies on that we have a noncon-
formity measure that tells us how similar an incoming data point is to previous data
points. We only include classes in the prediction set with a low nonconformity score.
Hence, a common way to form a nonconformity measure is to use the error of the deep
learning algorithm’s prediction [19]. We calculate the error for the classification task
as follows:

αi = 1 − ŷi

Where αi is the retrieved nonconformity scores for sample i, and ŷi is the class prob-
abilities produced by the GraphSAGE classifier. We use this nonconformity measure
for all our conformal predictors. Instead, the CP methods vary in how they use the
nonconformity scores to retrieve the prediction set, which is what we will describe in
the sections below.

Conventional predictors In Section 2.3 we described the Inductive Conformal


Predictor and the Class-Conditional Conformal Predictor which are both common CP

32
CHAPTER 3. METHOD

methods. The ICP is the traditional conformal predictor, and we use it as a baseline in
our experiments used to compare with the other methods.

The CCCP uses the implied class received from the GraphSAGE classifier to retrieve a
subset of the nonconformity scores in the calibration set, which is then used for predic-
tion. We use the CCCP in our experiments to investigate the relation between the node
class and the validity of the conformal predictor. The implied class from the Graph-
SAGE classifier is the class with the highest outputted class probability.

Novel predictors We present a set of novel conformal predictors and the reason-
ing behind them in Chapter 4. The novel methods all build on the notion that graph
connectivity information can be leveraged to improve validity and efficiency. This is
a key aspect of our research, and we test this by including the novel methods in our
experiments.

The Node degree-Conditional Conformal Predictor (NCCP) is the first novel method
that we use to test if we can achieve validity among nodes sharing a similar in-degree.
We divide the calibration and test nodes in a fixed number of bins depending on their
node degree to use this method. Then, we only use nonconformity scores from the
sample’s corresponding bin to retrieve the prediction set.

Another novel method that we test is the Node degree Weighted Conformal Predic-
tor (NWCP). Using this method, we want to see if we can lower the coverage gap in
predictors affected by distribution shift by leveraging the node degree. We base it on
the in-degree similar to the NCCP. The NWCP takes in both the node degree and the
nonconformity score of the sample. We retrieve a normalized node degree distance
between the sample and the calibration set. We use the distance as weights on the cal-
ibration set’s nonconformity scores when retrieving the prediction set. The distance is
calculated by first normalizing the node degrees d:

d
d˜ =
max(d)

Where d˜ is the normalized node degrees and max(d) is the maximum node degree
found in the calibration set and the sample node. We then calculate the normalized
node degree distance between the sample and each calibration data point:

33
CHAPTER 3. METHOD

distance = abs(d˜sample − d˜i )

Where d˜sample is the normalized node degree of the sample, and d˜i is the ith calibration
data point.

The final novel method is the EWCP. This method is similar to the NWCP but uses the
distance in the embedding space rather than the node degree distance. This means that
the predictor needs access to the node embeddings when retrieving the prediction set
for a sample. We implemented a special mode in the GraphSAGE classifier that returns
the node embedding together with the class probabilities when making a prediction.
The node embedding has as many dimensions as the dimension size of the hidden lay-
ers in the classifier. Therefore, we calculate the embedding space distance between the
sample node and the calibration set as a simplified version of the Euclidean distance
to reduce the computational cost. We calculate the simplified Euclidean distance like
so:


d
distancei = abs(embi,d − embsample,d )

Where embi is the node embedding for the ith calibration data point, and embsample is
the incoming sample’s embedding. We then use the distances to find calibration data
points within a predefined radius of the sample. Only the data points within the radius
are considered when producing the prediction set.

Predictors with resampling As part of our case study on dynamic graphs, we test
all conformal predictors with and without resampling the calibration set. Resampling
the calibration set acts like a middle-ground between re-training and only training on
the first graph snapshot. We resample the calibration set for each snapshot.

A note on performance Our implementations of the conformal predictors are not


optimized for speed, and we are only capable of retrieving a single prediction at a time
from the conformal predictors. This likely has a significant impact on the prediction
time and could probably be improved for most predictors.

34
CHAPTER 3. METHOD

3.2.5 Evaluation

In the case studies, we used multiple metrics to capture the following information:

1. The performance of the underlying model.

2. Validity and performance of the conformal predictor.

3. Performance of the conformal predictor over time in dynamically evolving graphs.

4. Computational overhead caused by the conformal predictors.

To measure the performance of the underlying GraphSAGE algorithm, we used met-


rics similar to the benchmarks of the data sets. We used the overall accuracy and the F1
score for the binary classification tasks. The overall accuracy is measured as the per-
centage of correctly classified labels, and it is a standard performance measure in the
machine learning field. The F1 score combines two other measures: recall and preci-
sion, to measure how well a predictor is performing on a particular class. We measured
a separate F1 score for each class in the binary setting. The F1 score is particularly well
suited for imbalanced data sets, as in the case of the Bitcoin Elliptic data set, where
only a fraction of samples are labeled illicit.

We used the overall accuracy and the Macro-average F1 score for the multiclass clas-
sification tasks. As described above, the conventional F1 score can only be used for a
single class at a time. In the case where we have many available classes, this gives us
a poor overview of the model’s performance. The macro-averaged F1 score solves this
by returning an average for all classes where each class has the same weight [16].

For obtaining validity and performance of the conformal predictors, we used the met-
rics outlined in [33].

We verified the conformal predictors’ validity by measuring the conformal predictor’s


average coverage. The coverage is the percentage of prediction sets that include the
true class label. We deem the predictor valid if the average coverage is greater or equal
to the specified confidence level.

We measure the efficiency of the conformal predictors, i.e., the size of the prediction
set, using three common efficiency measures:

1. Average prediction set size

2. Fraction of singleton prediction sets

35
CHAPTER 3. METHOD

3. Fraction of empty prediction sets

We used the average prediction set size to retrieve an overarching efficiency score,
and by comparing it to the number of possible classes, we get an understanding of the
predictor’s overall performance. Furthermore, we used the fraction of singleton pre-
dictions to understand how often the predictor gave us the ideal case, and the fraction
of empty predictions to see how often the predictor failed to give us any predictions at
all.

Finally, we evaluated the computational overhead caused by the conformal prediction


by capturing the average prediction latency. The prediction latency is the time added
by the conformal predictor and includes both preparations, for example, retrieving the
node degree, and computing the prediction set.

We used the same measures for evaluating the conformal predictors in the case study
on dynamic graphs. However, instead, we capture the performance for each snapshot
and observe how the metrics behave over time.

3.3 Quality assurance


The last step in the research methodology is quality assurance. We used quality as-
surance to ensure that we had correctly obtained our results. We based our quality
assurance on the requirements for deductive research [5]:

• We ensured the validity of our experiments by using popular open-source soft-


ware frameworks, measurements, and tools in as many places as possible. In
cases where this was not applicable, we implemented our algorithms strictly fol-
lowing the literature and rigorously testing the implementations.

• We ensured that our results were reliable by running the experiments many times
and measuring averages and standard deviations. We also ran the experiments
on a generic virtual machine hosted by a cloud provider.

• We ensured that our results were replicable by other researchers by providing


thorough explanations of our experiments and using standard and publicly avail-
able implementations and data sets to the largest possible extent.

• We ensured our work was ethical by using data sets without sensitive data and
not wasting resources due to poorly written software.

36
CHAPTER 3. METHOD

3.4 Experimental setup


We received an Oracle Cloud Starter Award to conduct our experiments, and for this
reason, we performed our experiments on virtual machines hosted by Oracle. The vir-
tual machine that we used had access to was relatively low-end and used a GPU from
an older generation. However, running the experiments using a GPU was much faster
than what it would have been using a CPU only.

Virtual machine specification


OS: Ubuntu 18.04
CPU: Intel Xeon Platinum 8167M. Base frequency 2.0GHz. OCPU count: 12
Memory: 72 GB
GPU: NVIDIA Tesla P100 16GB

To implement the experiments, we used Python 3.8 [25]. In addition, we heavily relied
on the PyTorch Geometric framework [24] to implement the GraphSAGE model, we
also used Numpy [8] for fast matrix calculations and scikit-learn [23] for some of the
metrics, such as the Macro F1 score described earlier.

Programming language
Language: Python 3.8
Notable packages:
• PyTorch Geometric [24]
• Numpy [8]
• scikit-learn [23]

We used the following hyperparameters for each of the experiments:

37
CHAPTER 3. METHOD

Cora PubMed Arx ArxD BE


Num. experiment re-runs 20 20 5 5 5
GraphSAGE - Num. layers 2 2 3 3 3
GraphSAGE - Hidden dimensions 256 256 256 256 256
GraphSAGE - Learning rate 0.01 0.01 0.01 0.01 0.01
GraphSAGE - Num. training epochs 100 100 200 300 100
CP - Confidence level 0.95 0.95 0.95 0.95 0.95
EWCP - Radius 300 300 300 1000 100
         
0 0 0 0 0
1 1 5 5 5
NCCP - Node degree bins          
5 5 10 10 10
10 10 20 20 20

Table 3.4.1: Hyperparameters used for each data set. Arx is OGB Arxiv, ArxD is OGB Arxiv in
the dynamic case study, and BE is the Bitcoin Elliptic data set.

38
Chapter 4

Conformal predictors for graphs

In this thesis, one of the things we explore is the effect of using graph connectivity
information in order to improve validity and efficiency in conformal predictors. In this
chapter, we describe our three novel methods for Conformal Prediction: the NCCP, the
NWCP, and the EWCP.

4.1 Node degree-Conditional Conformal Predictor

A conformal predictor’s validity is tightly coupled with the preciseness of the noncon-
formity measure used to produce prediction sets. In our case and many other cases, the
nonconformity measure is computed as the error from an underlying deep learning al-
gorithm. As we bring up in Section 2.2.1, a study has shown that Graph Convolutional
Networks tend to have a higher accuracy on nodes with a high degree than low degree
nodes [29]. Therefore, we can assume that a conformal predictor that bases its non-
conformity measure on the output of a GCN is more likely to be valid for higher degree
nodes than low degree nodes.

We relate this to the issue of imbalanced data sets, where we can have a conformal
predictor that achieves general validity even though it is invalid for the minority class.
In Section 2.3.1, we describe the Class-Conditional Conformal Predictor, which solves
this issue by dividing the data space into Mondrian categories based on the class labels.
Next, it forms a separate valid conformal predictor for each label. Thus, we can achieve
validity for all classes. We also know that we can form Mondrian categories based on
other properties from the data points to achieve validity for data points that share the

39
CHAPTER 4. CONFORMAL PREDICTORS FOR GRAPHS

same property; we call this an Attribute-Conditional Conformal Predictor.

When we designed the Node degree-Conditional Conformal Predictor (NCCP), we wanted


to ensure that the conformal predictor was valid for both low and high-degree nodes.
Therefore, we designed the NCCP as an Attribute-Conditional Conformal Predictor
that splits the data into Mondrian categories based on the nodes’ degree. We define
the NCCP in Algorithm 5. As we can see, the only difference from the CCCP algorithm
(as defined in Algorithm 3) is in line 1, where we choose to only include calibration data
points with the same node degree as the object.

Algorithm 5: The Node degree-conditional Conformal Algorithm


Data: nonconformity measure A, significance level ϵ, examples Z = {z1 , ..., zn−1 },
object xn , the label y, the node degrees deg1 , ..., degn , and the specified
node degree d
Result: Decision whether to include y in Γϵ (z1 , ..., zn−1 , xn )

1 Set Ẑ := {zi |zi ∈ Z ∧ degi = d}


2 Provisionally set Ẑn := (xn , y)
3 For i = 1, ..., n, set αi = A(Ẑ \ {ẑi }, ẑi )
#{i=1,..,n−1|αi ≥αn }
4 py = n
5 if py > ϵ then
6 Include y in Γϵ (z1 , ..., zn−1 , xn )

4.2 Node degree Weighted Conformal Predictor


Barber et al. propose a weighted conformal predictor to put an upper bound on the
coverage gap caused by the distribution shift that can occur in time-series data, as de-
scribed in Section 2.4.1. Barber et al.’s method is strictly restricted to the regression
task where we want to retrieve a confidence interval for a numeric value. Their method
is to multiply the nonconformity scores with a set of weights when computing the con-
fidence interval. The weights are used to compensate for the distribution shift between
the sample data point and the calibration set, where a larger distribution shift between
two points leads to a lower weight.

While Barber et al.’s method is not directly applicable to our case, we take inspiration
and design our own weighted conformal predictor called the Node degree Weighted

40
CHAPTER 4. CONFORMAL PREDICTORS FOR GRAPHS

Conformal Predictor (NWCP). As we can see in Algorithm 6, the only change from the
standard CP algorithm is that we, on line 3, multiply the nonconformity scores of the
calibration set with the assigned weights before we compare the score with the subject
sample.

Algorithm 6: The Node degree Weighted Conformal Predictor


Data: nonconformity measure A, significance level ϵ, examples Z = {z1 , ..., zn−1 },
object xn , the label y, and the weights {w1 , ..., wn−1 }
Result: Decision whether to include y in Γϵ (z1 , ..., zn−1 , xn )

1 Provisionally set Zn := (xn , y)


2 For i = 1, ..., n, set αi = A(Z \ {zi }, zi )
#{i=1,..,n−1|αi ∗wi ≥αn }
3 Set py := n
4 if py > ϵ then
5 Include y in Γϵ (z1 , ..., zn−1 , xn )

In the NWCP, we assume that nodes with similar node degrees originate from the same
underlying distribution. We leverage this information when we compute the weights
for the algorithm. We calculate the weights as the distance in node degree between the
sample node A and the calibration node B. We also normalize the distance to be within
the range [0, 1]. Below, we can see how the weights are assigned:

degmax = max(deg1 , ..., degn )


distanceA→B = abs(degA − degB ) (4.1)
distanceA→B
WB =
degmax

4.3 Embedding distance Weighted Conformal Predic-


tor
Another weighted conformal predictor is the spatial conformal predictor described in
Section 2.4.2. Instead of assigning weights based on the distribution shift as we do in
Barber et al.’s method, we instead assign weights based on the spatial distance between
two data points. In a special case of the spatial conformal predictor, the weight of a
calibration data point is set to 1 if the point is within a certain radius of the sample

41
CHAPTER 4. CONFORMAL PREDICTORS FOR GRAPHS

point, otherwise, the weight is set to 0. The spatial conformal predictor can produce
valid and efficient prediction intervals as long as the data is locally exchangeable.

As part of producing a prediction, a GCN model computes a high-dimensional embed-


ding of a node which we then put a prediction head on top of to classify a node. The
node embedding exists in the Euclidean space, which allows us to use the output of a
GCN as input for other deep learning algorithms. If the GCN model is accurate, the
generated node embeddings should ensure that two similar nodes should also be close
to each other in the embedding space. We can see an example of the node embedding
in Figure 4.3.1. For this reason, we think it is logical to assume that the node’s underly-
ing distribution determines the node’s location in the embedding space, and we should
be able to use this information to reduce the coverage gap.

Figure 4.3.1: Example of how nodes in a graph can be represented in the embedding space.

Unfortunately, the spatial conformal prediction algorithm is not designed for the clas-
sification task, and when we designed the Embedding distance Weighted Conformal
Predictor (EWCP) we could only take inspiration from it. We define the EWCP In Al-
gorithm 7. As we can see, it is quite similar to the CCCP algorithm, except that we de-
termine our calibration set based on the embedding distance and that the calibration
sets are allowed to overlap. In our experiments, we calculate the embedding distance
as a simplified Euclidean distance between two node embeddings as described in Sec-
tion 3.2.4. However, nothing stops us from using another distance measure, such as
the Manhattan distance, if we believe that to be more performant.

42
CHAPTER 4. CONFORMAL PREDICTORS FOR GRAPHS

Algorithm 7: The Embedding distance weighted Conformal Algorithm


Data: nonconformity measure A, significance level ϵ, examples Z = {z1 , ..., zn−1 },
object xn , the label y, the embedding distances dist(z1 , zn ), ..., dist(zn−1 , zn ),
and the specified radius r.
Result: Decision whether to include y in Γϵ (z1 , ..., zn−1 , xn )

1 Set Ẑ := {zi |zi ∈ Z ∧ dist(zi , zn ) ≤ r}


2 Provisionally set Ẑn := (xn , y)
3 For i = 1, ..., n, set αi = A(Ẑ \ {ẑi }, ẑi )
#{i=1,..,n−1|αi ≥αn }
4 py = n
5 if py > ϵ then
6 Include y in Γϵ (z1 , ..., zn−1 , xn )

43
Chapter 5

Results

This chapter presents the results from our case studies on static and dynamic graphs.
First, we present the conformal predictors’ performance on static graphs. The results
are presented as tables, and we also provide some explanations of the results. Sec-
ondly, we present the conformal predictors’ performance on the dynamic graphs. To
avoid confusion, we present the results for the conformal predictors using resampling
and those not using resampling separately. The results are presented as line graphs
showing the performance of the conformal predictors over time.

We leave any interpretation and discussion of the results to Chapter 6.

5.1 Static graphs study


Here we present the results from the case study on static graphs. In the case study, we
run experiments on three data sets: Cora, PubMed, and OGB Arxiv. We present the
results from these experiments separately.

5.1.1 Cora

In this section, we present the results of the experiments on the Cora citation graph.
The Cora data set is small and contains seven different class labels.

In Table 5.1.1, we can see that the underlying GraphSAGE model has almost perfect ac-
curacy and macro F1 when classifying nodes. As the accuracy and the macro F1 scores
have the same values, we can assume that the GraphSAGE classifier is not biased to-

44
CHAPTER 5. RESULTS

wards any class and performs equally well for all classes.

Accuracy 0.94
Macro F1 0.94

Table 5.1.1: Performance of the underlying GraphSAGE model on the Cora data set.

In Table 5.1.2, we can see the performance of the conformal predictors. Notably, all
the conformal predictors perform about the same on the Cora data set, and there is
minimal variation in terms of validity and efficiency. All of the predictors show to be
valid as they have a coverage score above the specified confidence level. The predictors
are also almost equally efficient. However, the EWCP is the most efficient with the
smallest average prediction set size, and it beats the NWCP by just a hair’s breadth in
terms of singleton predictions generated.

The most variation in the results we can see in the prediction latency at the bottom
of Table 5.1.2. Here we can see that the conventional ICP is by far the fastest, with a
prediction latency of 0.32 milliseconds. This is more than twice as fast as the EWCP
which has a prediction latency of 0.80 milliseconds. It is also surprising that the CCCP
is slower than the ICP. This goes against our intuition, as the CCCP compares incoming
test nodes with a subset of the calibration nodes in contrast to the ICP which compares
with all calibration nodes.

ICP CCCP NCCP NWCP EWCP


Coverage (confidence level: 0.95) 0.95 0.95 0.95 0.95 0.95
Average prediction set size 1.03 1.06 1.04 1.02 1.02
Fraction of singleton predictions 0.97 0.94 0.96 0.97 0.98
Fraction of empty predictions 0.00 0.01 0.01 0.00 0.00
Prediction latency (ms) 0.32 0.46 0.36 0.56 0.80

Table 5.1.2: Performance of the conformal predictors on the Cora data set.

5.1.2 PubMed

Here we present the results from the experiments on the PubMed citation graph. The
PubMed data set is larger than the Cora data set but only includes three different class
labels. However, the results from the PubMed data set are almost identical to those
from the experiments on the Cora data set.

45
CHAPTER 5. RESULTS

As we can see in Table 5.1.3, we again have an almost perfect GraphSAGE classifier,
and again the accuracy and macro F1 scores are identical, telling us that the classifier
performs equally well on all classes.

Accuracy 0.93
Macro F1 0.93

Table 5.1.3: Performance of the underlying GraphSAGE model on the PubMed data set.

In Table 5.1.4, we see that all of the conformal predictors appear to be valid, and overall
there is minimal variation in terms of efficiency with the ICP, the NWCP, and the EWCP
sharing the identical scores. The variation in the prediction latency is also similar to
what we see for the Cora data set, as we can in the bottom of Table 5.1.4. Again the
conventional ICP is by far the fastest, and the EWCP is almost ten times slower than
the ICP. Again the CCCP is surprisingly slower than the ICP.

ICP CCCP NCCP NWCP EWCP


Coverage (confidence level: 0.95) 0.95 0.95 0.95 0.95 0.95
Average prediction set size 1.03 1.04 1.04 1.03 1.03
Fraction of singleton predictions 0.97 0.97 0.96 0.97 0.97
Fraction of empty predictions 0.00 0.00 0.00 0.00 0.00
Prediction latency (ms) 0.18 0.47 0.32 0.48 1.42

Table 5.1.4: Performance of the conformal predictors on the PubMed data set.

5.1.3 OGB Arxiv

This section presents the results from the experiments on the OGB Arxiv citation net-
work. The OGB Arxiv data set is magnitudes larger than the Cora and PubMed data sets
and includes 47 different class labels making it more of a challenge for the GraphSAGE
classifier.

As we can see in Table 5.1.5 the GraphSAGE model does not reach the same high accu-
racy as on the Cora and PubMed data sets. We can see a significant difference between
the accuracy and the macro F1 score, indicating that the GraphSAGE model might per-
form better on some class labels than others.

Table 5.1.6 shows the performance of the conformal predictors. As we can see, all of the
conformal predictors turn out to be valid except for the EWCP which falls just below

46
CHAPTER 5. RESULTS

Accuracy 0.71
Macro F1 0.57

Table 5.1.5: Performance of the underlying GraphSAGE model on the Arxiv data set.

the threshold. Both the CCCP and the EWCP stand out in terms of efficiency. The
Class-Conditional Conformal Predictor is less efficient than the other predictors and
has almost the double average prediction set size as the conventional ICP. The low
efficiency is likely due to class imbalance in the data set. We can see signs of bias
towards some classes in the Macro F1 for the GraphSAGE model, which is much lower
than the accuracy. While the other predictors provide narrow prediction sets, the CCCP
seems instead to offer its guarantee of validity for each class. In contrast, the EWCP
seems to offer some efficiency gains in relation to the other predictors, both in terms of
smaller average prediction set size and a larger fraction of singleton predictions.

The Node degree-Conditional Conformal Predictor has similar efficiency to the ICP
but seems to offer some gains in terms of prediction latency. A reason for this could
be that there is a broader distribution in terms of node degree, and the NCCP needs
to compare with fewer calibration nodes on average when making a prediction. The
NWCP also has similar efficiency as the ICP, but contrary to the NCCP, it instead has
a higher prediction latency. The high latency could be because it uses expensive calcu-
lations.

On another note, we can see that the prediction latencies are overall magnitudes larger
than those on the Cora and PubMed data sets. This can be explained by that we have
a much larger calibration set for the Arxiv data set, leading to proportionally more
comparisons in the conformal predictors.

ICP CCCP NCCP NWCP EWCP


Coverage (confidence level: 0.95) 0.95 0.95 0.95 0.95 0.94
Average prediction set size 4.02 7.41 4.07 4.02 3.77
Fraction of singleton predictions 0.20 0.09 0.21 0.20 0.23
Fraction of empty predictions 0.00 0.00 0.00 0.00 0.01
Prediction latency (ms) 5.45 9.21 3.87 7.53 24.7

Table 5.1.6: Performance of the conformal predictors on the Arxiv data set.

47
CHAPTER 5. RESULTS

5.2 Dynamic graphs study


This section presents the results retrieved from the case study on dynamic graphs. We
performed experiments on the Bitcoin Elliptic and the OGB Arxiv data sets in the case
study. The latter is interesting as it represents a classical case of distribution shift in an
evolving graph. At the same time, the Bitcoin Elliptic provides an example of a graph
whose distribution changes completely at each time step. For both experiments, we use
conformal predictors that are only trained on the first snapshot of the dynamic graph.
We use the conformal predictors with and without resampling and present those re-
sults separately.

5.2.1 Bitcoin Elliptic

Here we present the results from the experiments on the Bitcoin Elliptic transaction
network. The graph consists of 49 snapshots, with no edges between nodes in the pre-
vious snapshot and the new nodes included in the following snapshot. Furthermore,
there are two possible class labels, but most nodes do not have a class label, and we
leave them out in the evaluation.

In Figure 5.2.1, we can see the performance of the underlying GraphSAGE classifier
used by the conformal predictors. As we can see, the accuracy is overall high, but it
drops slightly over time, indicating that there might be a distribution shift. Further-
more, the macro F1 score is just half that of the accuracy, which tells us that the classi-
fier is biased towards one of the classes.

Figure 5.2.1: Accuracy for the underlying GraphSAGE model on the Bitcoin Elliptic dynamic
graph.

48
CHAPTER 5. RESULTS

Without resampling

In this section, we look at the results from the conformal predictors without resam-
pling. These are particularly interesting as this is the standard mode for conformal
predictors.

Figure 5.2.2 shows us the coverage for the conformal predictors for each snapshot.
As we can see, the coverage follows the GraphSAGE accuracy, and most of the time,
there is a significant coverage gap for all predictors. The coverage gap is large enough
for all predictors to be invalid. The novel predictors stand out, as the EWCP seem
to overall have a slightly smaller coverage gap than the other predictors in the start.
However, over time the CCCP takes the lead with its smaller gap. At the same time, the
NWCP seems to maintain the largest coverage gap and is already invalid on the early
snapshots.

Figure 5.2.2: Coverage for the conformal predictors on the Bitcoin Elliptic dynamic graph.

Looking instead at the efficiency of the conformal predictors, we can see in Figure 5.2.3
that all predictors except the CCCP maintain an average prediction set below 1.0. This
means that we have mostly singleton- and empty predictions, as we can also see in
Figures 5.2.4 and 5.2.5. To receive a large fraction of singleton predictions is the best
possible outcome for a conformal predictor, as singleton predictions are the most in-
formative. On the other hand, empty predictions are usually unsatisfactory, as it is the
effect of the predictor not being capable of predicting a label for the specified confi-
dence level. As we can see, we also have a relatively large fraction of those, around
1to2% of predictions. The NWCP seem to have more empty predictions than the other
close-by predictors, which could explain its low coverage.

49
CHAPTER 5. RESULTS

The CCCP stands out from the other predictors. It maintains an average prediction set
size of just above 1.0, which is more desired than below. It is not unusual for Mondrian
conformal predictors to have a higher prediction set size than other predictors, but in
turn, the predictor should offer validity guarantees for each category. While it does not
seem to offer validity guarantees over time, it still maintains a lower coverage gap than
the other predictors, so in this case, it still wins in both categories. However, it is odd
that the CCCP also has significantly more empty predictions than the others. With a
higher average prediction set size and a similar fraction of singleton predictions, we
would have assumed the empty predictions to be lower than the others.

Figure 5.2.3: Average prediction set size for the conformal predictors on the Bitcoin Elliptic
dynamic graph.

Figure 5.2.4: Fraction of singleton predictions for the conformal predictors on the Bitcoin El-
liptic dynamic graph.

The prediction latency of the conformal predictors is similar to what we saw for the
static graphs. In Figure 5.2.6, we see that the EWCP stands out as the by far slowest

50
CHAPTER 5. RESULTS

Figure 5.2.5: Fraction of empty predictions for the conformal predictors on the Bitcoin Elliptic
dynamic graph.

predictor. The other predictors all have a similar prediction latency, with the conven-
tional ICP having the lowest latency.

Figure 5.2.6: Prediction latency for the conformal predictors on the Bitcoin Elliptic dynamic
graph.

Using resampling

Next, we look at the conformal predictors with resampling. These conformal predictors
similarly do not retrain the GraphSAGE classifier, but they do resample the calibration
set at every snapshot before running the predictions.

As expected, the predictors with resampling maintain validity for most of the snap-
shots, as shown in Figure 5.2.7. The EWCP seems to maintain a coverage well above
the threshold of 0.95. The NWCP stands out as it is invalid for the early snapshots, and

51
CHAPTER 5. RESULTS

first, at time step 8, it turns valid and stays valid. It is unexpected as we do not see a
similar behavior for the NCCP which also uses the node degree.

Figure 5.2.7: Coverage for the conformal predictors with resampling on the Bitcoin Elliptic
dynamic graph.

However, the validity comes at the cost of efficiency, as shown in Figure 5.2.8. For most
predictors, the average prediction set size is just below 1.5 after a few time steps. This
is relatively high considering that the maximum possible prediction set size is 2. We
can also see the effect in Figure 5.2.9 which shows the number of singleton predictions.
Here we can see that the predictors produce fewer singleton predictions over time. As
we can see in Figure 5.2.10 that this is not compensated with empty predictions, it
means that we gain more uninformative predictions where both possible class labels
are included. The CCCP stands out from the others again. The CCCP has an overall
much higher average prediction set size and more empty predictions than the others on
the early snapshots. This renders its predictions uninformative most of the time.

The prediction latency for the predictors with resampling is similar to that of those
without resampling, as shown in Figure 5.2.11. However, the contrast is that the pre-
diction latency grows over time for the predictors with resampling. This is explained
by an increasing calibration set size when we resample the calibration set size, leading
to the predictions taking longer. The EWCP stands out even more in this experiment,
as its latency also grows faster than the others. This could signify that the embedding
distance calculations grow in cost exponentially.

52
CHAPTER 5. RESULTS

Figure 5.2.8: Average prediction set size for the conformal predictors with resampling on the
Bitcoin Elliptic dynamic graph.

Figure 5.2.9: Fraction of singleton predictions for the conformal predictors with resampling on
the Bitcoin Elliptic dynamic graph.

Figure 5.2.10: Fraction of empty predictions for the conformal predictors with resampling on
the Bitcoin Elliptic dynamic graph.

53
CHAPTER 5. RESULTS

Figure 5.2.11: Prediction latency for the conformal predictors with resampling on the Bitcoin
Elliptic dynamic graph.

5.2.2 OGB Arxiv

Now we present the final graph, the OGB Arxiv citation network but this time as a
dynamic graph. As described earlier, the OGB Arxiv is a much larger graph than the
others graphs we present in this paper. It is also complex, with 47 possible classes. As
shown in Figure 3.2.2, the class distribution also changes significantly over time, and
some classes that used to be barely represented in the data set grow to be the dominant
classes in the last snapshots. So we already know there is a significant distribution shift
in the graph.

Figure 5.2.12 shows the performance of the underlying GraphSAGE classifier on the
data set. As we can see, the performance goes down significantly over time, signifying
the distribution shift we just mentioned. There is also a gap between the accuracy
score and the macro F1 score, indicating that the classifier has a bias towards some
classes.

Without resampling

First, we present the results for the conformal predictors without resampling. In Figure
5.2.13, we can see the coverage for the predictors. Similar to the GraphSAGE classifier’s
performance, the coverage drop over time, fortifying the effect of the distribution shift.
Already in the second snapshot, all predictors turn invalid. However, while significant,
the coverage gap remains relatively low. The predictors have similar coverage, and the
ICP and the EWCP have almost identical scores.

54
CHAPTER 5. RESULTS

Figure 5.2.12: Accuracy for the underlying GraphSAGE model on the OGB Arxiv dynamic
graph.

Figure 5.2.13: Coverage for the conformal predictors on the OGB Arxiv dynamic graph.

We see a similar trend in efficiency. Figures 5.2.14 and 5.2.15 shows the average pre-
diction set size and the number of singleton predictions. The prediction set size grows
significantly over time, and we see fewer singleton predictions. Thus, the predictions
become less informative over time. However, in Figure 5.2.16, we see that the fraction
of empty predictions remains low over time, showing that we overestimate some class
labels rather than vice versa.

The CCCP stands out from the others as it maintains a much higher prediction set size.
However, as we would also expect to see a higher coverage, this is not the case, and the
CCCP is simply perceived to be less efficient than the others.

The results on the prediction latency show nothing new, as shown in Figure 5.2.17.
The EWCP remains the slowest predictor, and the other shares a similar prediction

55
CHAPTER 5. RESULTS

Figure 5.2.14: Average prediction set size for the conformal predictors on the OGB Arxiv dy-
namic graph.

Figure 5.2.15: Fraction of singleton predictions for the conformal predictors on the OGB Arxiv
dynamic graph.

Figure 5.2.16: Fraction of empty predictions for the conformal predictors on the OGB Arxiv
dynamic graph.

56
CHAPTER 5. RESULTS

latency.

Figure 5.2.17: Prediction latency for the conformal predictors on the OGB Arxiv dynamic graph.

Using resampling

Finally, we present the results for the conformal predictors with resampling. Similar to
the other predictors, they only train the GraphSAGE classifier on the first step, but the
significant change is that they resample the calibration set for every snapshot.

In Figure 5.2.18, we see that the coverage for the predictors remains steady at around
0.95 across all time steps. This means that all predictors remain valid.

Figure 5.2.18: Coverage for the conformal predictors with resampling on the OGB Arxiv dy-
namic graph.

However, efficiency-wise we see a similar behavior to that of the conformal predictors


without resampling. Figure 5.2.19 shows that the average prediction set size grows

57
CHAPTER 5. RESULTS

steadily over time. The prediction sets quickly grow relatively large. In the final snap-
shot, the prediction set sizes averages around 25, which is about half of the maximum
possible prediction set size. We can assume that any prediction set with 25 labels is
uninformative. Looking at the singleton- and empty predictions in Figures 5.2.20 and
5.2.21, we see that the number of singleton predictions goes down but the number of
empty predictions remains low. This shows that we again overestimate labels rather
than underestimate them.

The CCCP turns out again to have the worst efficiency without any clear gains in cov-
erage.

Figure 5.2.19: Average prediction set size for the conformal predictors with resampling on the
OGB Arxiv dynamic graph.

Figure 5.2.20: Fraction of singleton predictions for the conformal predictors with resampling
on the OGB Arxiv dynamic graph.

The prediction latency of the predictors is similar to what we have seen before. In
Figure 5.2.22, we see that prediction latency grows over time for all predictors, and the

58
CHAPTER 5. RESULTS

Figure 5.2.21: Fraction of empty predictions for the conformal predictors with resampling on
the OGB Arxiv dynamic graph.

EWCP is by far the slowest. The increasing calibration set size can again explain the
growing prediction latency.

Figure 5.2.22: Prediction latency for the conformal predictors with resampling on the OGB
Arxiv dynamic graph.

The effect of the EWCP’s radius

In the experiments on the OGB Arxiv graph, we saw only minor differences between
the ICP and the EWCP, leading us to believe that we used a too-large radius for the
EWCP. The radius is much higher (1000) than the one we used for the static graph
(300). Therefore, we extended the experiment to look at what happens when we reduce
the radius to be the same as for the static graph.

In Figure 5.2.23, we see the drop in coverage when we lower the EWCP’s radius to 300.
The coverage drop is significant, and not even the predictor with resampling achieves

59
CHAPTER 5. RESULTS

validity for any of the time steps.

Figure 5.2.23: EWCP radius comparison on OGB Arxiv dynamic graph - Coverage for the con-
formal predictors.

However, we can see that reducing the radius offers significant gains when we look at
the efficiency. In Figure 5.2.24, we see that when we lower the radius we also reduce
average prediction set size, this also leads that we have more singleton predictions as
we can see in Figure 5.2.25. At the same time, we also receive fewer empty prediction
sets as shown in Figure 5.2.26. All in all, we seem to get more informative prediction
sets when we lower the radius.

Figure 5.2.24: EWCP radius comparison on OGB Arxiv dynamic graph - Average prediction set
size for the conformal predictors.

60
CHAPTER 5. RESULTS

Figure 5.2.25: EWCP radius comparison on OGB Arxiv dynamic graph - Fraction of singleton
predictions for the conformal predictors.

Figure 5.2.26: EWCP radius comparison on OGB Arxiv dynamic graph - Fraction of empty
predictions for the conformal predictors.

Figure 5.2.27: EWCP radius comparison on OGB Arxiv dynamic graph - Prediction latency for
the conformal predictors.

61
Chapter 6

Discussion

In this chapter, we review the results from the case studies on static and dynamic
graphs. We start by looking at the performance of the conventional conformal pre-
dictors, and then we move on to the novel predictors and the effect of resampling.
Finally, we discuss the limitations of current work and propose directions for future
studies.

6.1 The conventional conformal predictors


We used two conventional conformal predictors in the case studies: the Inductive
Conformal Predictor and the Class-Conditional Conformal Predictor. We start by dis-
cussing the ICP.

The ICP behaved as expected in all scenarios. On the static graphs, it performed on
par with the other predictors and acted as a good baseline to compare with the other
predictors. The ICP proved to be valid in all static scenarios and was also perceived to
be quite efficient, even if it was not always the most efficient predictor. In the dynamic
scenarios, neither of the predictors, including the ICP were able to maintain validity
without resampling. From what we see in the experiments, the ICP seems to be an
attractive conformal predictor for graphs due to its decent performance and low pre-
diction latency relative to the other predictors we tested. The ICP proved to have the
lowest prediction latency in all experiments except the experiment on the static OGB
Arxiv graph.

In contrast, the other conventional conformal predictor, CCCP came with mixed re-

62
CHAPTER 6. DISCUSSION

sults. The GraphSAGE classifier turned out to be perfectly balanced in the case of the
Cora and PubMed data sets, and this likely led to that we did not see any performance
gains from using the CCCP. However, while it did not bring any performance gains,
it came with a significant increase in prediction latency which was surprising. We ex-
pected the CCCP to have similar latency to that of the ICP, but in some cases, it was
more than double. As the high latency is unexpected, it would make sense to look into
the code to see if there are any bottlenecks in our implementation of the CCCP before
we draw any hard conclusions. For the other data sets, the CCCP saw a decrease in
efficiency relative to the other predictors. We could also see that these data sets were
imbalanced by looking at the class distribution and comparing the accuracy and macro
F1 scores. In these cases, the CCCP likely has a better coverage within each class, which
is why we see the decrease in efficiency. However, while the CCCP should guarantee
that validity for each class, we have not proven this. It could be worth considering in-
vestigating if the guarantee holds in further studies. It is also worth noting that while
the CCCP was unable to maintain validity on the dynamic graphs, it seemed to have a
slightly lower coverage gap on the Bitcoin Elliptic graph. However, not low enough to
use the CCCP instead of retraining or resampling.

6.2 The novel conformal predictors


In the case studies, we also tested the effect of using graph connectivity information as
part of a conformal prediction. We tested this by designing three novel conformal pre-
dictors: the Node degree-Conditional Conformal Predictor, the Node degree Weighted
Conformal Predictor, and the Embedding distance Weighted Conformal Predictor. We
start by discussing the node degree-based predictors.

The NCCP is an attribute-conditional conformal predictor that seeks to guarantee va-


lidity for low-degree nodes. Overall it seems to perform on par with the other pre-
dictors, and as we see no change in efficiency or coverage, we assume that the other
predictors already are valid for low-degree nodes, and there is no gain from the NCCP.
However, similar to the CCCP, we did not look into the validity per degree to see if this
was the case, and this is recommended for future work to understand better the value
of the NCCP. Our implementation of the NCCP might also not bring out the desired
validity aspects from the NCCP. Instead of looking at each node degree separately, our
implementation uses binning to batch similar node degrees. We found this required,

63
CHAPTER 6. DISCUSSION

as some particular node degrees would otherwise not be covered by the calibration set.
We do not know if this has affected the performance, but it should at least not vio-
late the per-category validity guarantee. Furthermore, the only place where the NCCP
stood out was the static OGB Arxiv graph, where it offered significant gains in terms
of latency while maintaining validity and efficiency. This could indicate that the NCCP
offers a lower computational cost relative to other predictors, and it should be further
investigated on a more significant number of large data sets.

The other node-degree-based predictor is the NWCP, a weighted conformal predictor


that puts more weight on calibration data points that share a similar node degree to the
test data point. The NWCP performs on par with the other predictors for all data sets
except the Bitcoin Elliptic dynamic graph. On the Bitcoin Elliptic data set, the NWCP
has a significantly greater coverage gap and is even invalid on the first snapshot, unlike
any other predictor. An explanation could be that there is just a minor variation in the
node degrees, leading to the calibration set receiving a majority of low weights, and
fewer class labels are included in the prediction set. If this is the case, it can probably
be fixed by a weight calculation that considers that there could be many similar node
degrees. We propose that future studies look into this.

Finally, the last novel predictor we test is the EWCP, another weighted predictor that
determines what nodes should be part of the calibration set by looking at their distance
from the test node in the embedding space. In our experiments, we find that the radius
used to determine what nodes to include in the calibration set is a hyper parameter
that we can tune to determine the coverage and the validity of the predictor. We see an
example of this in our experiments on the OGB Arxiv data set. For the experiment on
the static version of the graph, we use a radius of 300, which produces a high-coverage
but invalid predictor. On the upside, the predictor is also more efficient than the other
predictors. In contrast, on the dynamic graph, we use a radius of 1000, leading to that
the EWCP has almost identical performance to that of the ICP. Usually, we can also
use the confidence level to receive a similar effect. By reducing the confidence level,
we receive a slimmer prediction set. But in contrast to tuning the radius of the EWCP,
changing the confidence level is a more transparent way of reducing the prediction set
size. Nonetheless, this indicates that we can use the node embeddings to reduce the
prediction set size. As we note in our comparison on different values for the EWCP
radius in the dynamic OGB Arxiv graph, it is especially lucrative that we can do so
without generating more empty prediction sets. We suggest that future studies look

64
CHAPTER 6. DISCUSSION

further at how the node embeddings can be leveraged in conformal prediction.

In our experiments, we also note that the EWCP has a significantly higher prediction
latency than the other predictors. This is due to the expensive calculations to determine
the distance between two nodes in the high-dimensional embedding space (we use 256
dimensions). Therefore, we believe it is worth exploring if there are ways to reduce
the cost, for example, by using a low-dimensional projection of the embedding for the
comparison or a different distance measure.

6.3 The effect of resampling


In the experiments on the dynamic graphs, we test the effect of only resampling the
calibration nodes but not retraining the predictors. We show that most of the confor-
mal predictors are capable of maintaining validity for all time steps. However, validity
comes at the cost of efficiency. For example, on the Bitcoin Elliptic data set, most of
the predictors have an average prediction set size of around 1.5, which means that we
have many predictions which include both of the two possible class labels. We see a
similar behavior for the OGB Arxiv data set, where the average prediction set size is
around 25 out of the 47 possible classes. This means that we overall have a large share
of uninformative predictions, and the value of resampling is unclear. Furthermore, the
way we have implemented resampling in our experiments is that we always sample a
proportional share of the available nodes. This leads to a growing prediction set and
predictions becoming slower. This is unsustainable in a production system that sees
a lot of incoming data points. So any future work looking at resampling nodes should
investigate if it is possible to have a fixed size calibration set without losing perfor-
mance.

6.4 Limitations
The primary limitation of this work is the reliability of the results. The evaluation of
the predictors is not extensive enough to draw any hard conclusions. We use a small
number of data sets, 2 in the case of the dynamic graph and 3 for the case study on static
graphs. Furthermore, we were able to train two very accurate GraphSAGE models for
two data sets in the static experiment, which we believe also boosted the performance
of the predictors and reduced the variation in the results. The Bitcoin Elliptic data

65
CHAPTER 6. DISCUSSION

set is also sub-optimal, as the node features already include some information about
the neighboring nodes. This could lower the impact of using the graph connectivity
information in the CP algorithms.

However, even if we only had one data set, that gave reliable results. It should be
noted that there are not many existing data sets of dynamic graphs that can be used in
a study like this. Many data sets lack labels both for regression and classification tasks
and are primarily suitable for unsupervised learning tasks such as clustering. Hence,
this study can still act as inspiration for future work, as it is the first of its kind, and
hopefully motivate researchers to assemble more data sets for dynamic graphs that can
be used to cement the results of this and future studies.

6.5 Future Work


The GNN field and especially quantifying uncertainty in GNN predictions is still in
its cradle. Our work merely scratches on the surface, and we believe there is much
work left before it is ready for broad usage. We list some of the natural next steps
below:

• Perform the same experiments with more and larger data sets to get more robust
results.

• Run the measurements on prediction latency again with more performant con-
formal predictor implementations. Our implementation could only handle a sin-
gle node per prediction, which might have slowed predictions down significantly.
We are especially interested in understanding if the NCCP could offer lower la-
tency than the ICP on larger data sets.

• Investigate the effect of continual learning and if the coverage gap that occurs in
dynamic graphs can be used as a measure to trigger retraining efficiently.

• Investigate if the NCCP and the CCCP fulfills their promise of per-category valid-
ity.

• Investigate the NWCP’s low performance on the Bitcoin Elliptic data set, and
experiment with other weighting formulas based on the node degree that handles
the case where we have a lot of similar node degrees.

• Investigate if the NWCP can offer a similar upper bound on the coverage gap as

66
CHAPTER 6. DISCUSSION

Barber et al.’s method [2].

• Propose a method for determining the radius used in the EWCP to guarantee
validity and efficiency.

• Design a nonconformity measure that uses node embeddings, and investigate if


it leads to a more efficient predictor.

• Propose a method for resampling a fixed sized calibration set, without losing pre-
dictor validity or efficiency.

67
Chapter 7

Conclusions

In Chapter 1, we introduced our research question:

How can we leverage the graph connectivity information to design a valid


and efficient conformal predictor for static and dynamic graphs?

To answer the research question, we evaluated two conventional algorithms: the In-
ductive Conformal Predictor and the Class-Conditional Conformal Predictor, and we
also designed three novel CP algorithms leveraging the graph connectivity informa-
tion: the Node degree-Conditional Conformal Predictor, the Node degree Weighted
Conformal Predictor, and the Embedding distance Weighted Conformal Predictor. We
compared the predictors on both static and dynamic graphs. However, we could not
show that including the graph connectivity information improves the performance of
the predictors in any significant way. However, some indications make the area worth
exploring further. For instance, the NCCP seems to have the lowest prediction latency
on the OGB Arxiv data set, which is a reasonably large graph. We believe there should
be further investigation into whether the NCCP can be used to keep down the prediction
latency on large data sets. We also did not show that the NCCP brings validity within
nodes degree categories, and this could be investigated in further work to understand
if it can be used to mitigate the bias in GCNs. We also saw indications that we can use
the node embedding information to design efficient conformal predictors. However,
it requires further investigation to understand how this information is leveraged most
effectively. We also investigated the effect of resampling the calibration set in dynamic
graphs. In the experiments, we saw that most of the predictors could maintain validity
only when resampling was used. However, this came at the cost of efficiency, making

68
CHAPTER 7. CONCLUSIONS

resampling quite unattractive.

To conclude, while we could not show that the graph connectivity information is useful
in GCN-based conformal predictors, we believe it is worth investigating further. We
look forward to seeing more studies in this emerging research area.

69
Bibliography

[1] Ahmedt-Aristizabal, David, Armin, Mohammad Ali, Denman, Simon, Fookes,


Clinton, and Petersson, Lars. “Graph-based deep learning for medical diagnosis
and analysis: past, present and future”. In: Sensors 21.14 (2021), p. 4758.

[2] Barber, Rina Foygel, Candes, Emmanuel J, Ramdas, Aaditya, and Tibshirani,
Ryan J. “Conformal prediction beyond exchangeability”. In: arXiv preprint arXiv:2202.13415
(2022).

[3] co., Elliptic. Elliptic Data Set. https://www.kaggle.com/ellipticco/elliptic-


data-set. July 2019.

[4] Grover, Aditya and Leskovec, Jure. “node2vec: Scalable feature learning for net-
works”. In: Proceedings of the 22nd ACM SIGKDD international conference on
Knowledge discovery and data mining. 2016, pp. 855–864.

[5] Håkansson, Anne. “Portal of research methods and methodologies for research
projects and degree projects”. In: The 2013 World Congress in Computer Sci-
ence, Computer Engineering, and Applied Computing WORLDCOMP 2013;
Las Vegas, Nevada, USA, 22-25 July. CSREA Press USA. 2013, pp. 67–73.

[6] Hamilton, Will, Ying, Zhitao, and Leskovec, Jure. “Inductive representation learn-
ing on large graphs”. In: Advances in neural information processing systems 30
(2017).

[7] Harary, Frank and Gupta, Gopal. “Dynamic graph models”. In: Mathematical
and Computer Modelling 25.7 (1997), pp. 79–87.

[8] Harris, Charles R., Millman, K. Jarrod, Walt, Stéfan J. van der, Gommers, Ralf,
Virtanen, Pauli, Cournapeau, David, Wieser, Eric, Taylor, Julian, Berg, Sebas-
tian, Smith, Nathaniel J., Kern, Robert, Picus, Matti, Hoyer, Stephan, Kerk-
wijk, Marten H. van, Brett, Matthew, Haldane, Allan, Río, Jaime Fernández
del, Wiebe, Mark, Peterson, Pearu, Gérard-Marchant, Pierre, Sheppard, Kevin,

70
BIBLIOGRAPHY

Reddy, Tyler, Weckesser, Warren, Abbasi, Hameer, Gohlke, Christoph, and Oliphant,
Travis E. “Array programming with NumPy”. In: Nature 585.7825 (Sept. 2020),
pp. 357–362. DOI: 10.1038/s41586-020-2649-2. URL: https://doi.org/10.
1038/s41586-020-2649-2.

[9] Hu, Weihua, Fey, Matthias, Zitnik, Marinka, Dong, Yuxiao, Ren, Hongyu, Liu,
Bowen, Catasta, Michele, and Leskovec, Jure. “Open Graph Benchmark: Datasets
for Machine Learning on Graphs”. In: arXiv preprint arXiv:2005.00687 (2020).

[10] Hwang, Doyeong, Lee, Grace, Jo, Hanseok, Yoon, Seyoul, and Ryu, Seongok. “A
benchmark study on reliable molecular supervised learning via Bayesian learn-
ing”. In: arXiv preprint arXiv:2006.07021 (2020).

[11] Kipf, Thomas N and Welling, Max. “Semi-supervised classification with graph
convolutional networks”. In: arXiv preprint arXiv:1609.02907 (2016).

[12] Kwon, Youngchun, Lee, Dongseon, Choi, Youn-Suk, and Kang, Seokho. “Uncertainty-
aware prediction of chemical reaction yields with graph neural networks”. In:
Journal of Cheminformatics 14.1 (2022), pp. 1–10.

[13] Leskovec, Jure. Stanford CS224W: GNN Augmentation and Training. https:
//web.stanford.edu/class/cs224w/slides/08-GNN-application.pdf. Oct.
2021.

[14] Leskovec, Jure. Stanford CS224W: Graph Neural Networks. https : / / web .
stanford.edu/class/cs224w/slides/06-GNN1.pdf. Oct. 2021.

[15] Leskovec, Jure. Stanford CS224W: Heterogeneous Graphs and Knowledge Graph
Embeddings. https://web.stanford.edu/class/cs224w/slides/10-kg.pdf.
Oct. 2021.

[16] Manning, Christopher D and Raghavan, Prabhakar. and Schutze, H.[2008] In-
troduction to Information Retrieval. 2008.

[17] Mao, Huiying, Martin, Ryan, and Reich, Brian. “Valid model-free spatial predic-
tion”. In: arXiv preprint arXiv:2006.15640 (2020).

[18] McCallum, Andrew Kachites, Nigam, Kamal, Rennie, Jason, and Seymore, Kristie.
“Automating the construction of internet portals with machine learning”. In: In-
formation Retrieval 3.2 (2000), pp. 127–163.

71
BIBLIOGRAPHY

[19] Norinder, Ulf, Carlsson, Lars, Boyer, Scott, and Eklund, Martin. “Introducing
conformal prediction in predictive modeling. A transparent and flexible alter-
native to applicability domain determination”. In: Journal of chemical infor-
mation and modeling 54.6 (2014), pp. 1596–1603.

[20] Papadopoulos, Harris, Proedrou, Kostas, Vovk, Volodya, and Gammerman, Alex.
“Inductive confidence machines for regression”. In: European Conference on
Machine Learning. Springer. 2002, pp. 345–356.

[21] Pareja, Aldo, Domeniconi, Giacomo, Chen, Jie, Ma, Tengfei, Suzumura, Toy-
otaro, Kanezashi, Hiroki, Kaler, Tim, Schardl, Tao, and Leiserson, Charles. “Evolvegcn:
Evolving graph convolutional networks for dynamic graphs”. In: Proceedings
of the AAAI Conference on Artificial Intelligence. Vol. 34. 04. 2020, pp. 5363–
5370.

[22] Patterson, David, Gonzalez, Joseph, Le, Quoc, Liang, Chen, Munguia, Lluis-
Miquel, Rothchild, Daniel, So, David, Texier, Maud, and Dean, Jeff. “Carbon
emissions and large neural network training”. In: arXiv preprint arXiv:2104.10350
(2021).

[23] Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O.,
Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos,
A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E. “Scikit-learn:
Machine Learning in Python”. In: Journal of Machine Learning Research 12
(2011), pp. 2825–2830.

[24] PYG Documentation. https://pytorch-geometric.readthedocs.io/. 2022.

[25] Python 3.8.0. https://www.python.org/downloads/release/python- 380/.


Oct. 2019.

[26] Rajpurkar, Pranav, Chen, Emma, Banerjee, Oishi, and Topol, Eric J. “AI in health
and medicine”. In: Nature Medicine (2022), pp. 1–8.

[27] Shafer, Glenn and Vovk, Vladimir. “A tutorial on conformal prediction.” In: Jour-
nal of Machine Learning Research 9.3 (2008).

[28] Sun, Jiangming, Carlsson, Lars, Ahlberg, Ernst, Norinder, Ulf, Engkvist, Ola,
and Chen, Hongming. “Applying mondrian cross-conformal prediction to esti-
mate prediction confidence on large imbalanced bioactivity data sets”. In: Jour-
nal of chemical information and modeling 57.7 (2017), pp. 1591–1598.

72
BIBLIOGRAPHY

[29] Tang, Xianfeng, Yao, Huaxiu, Sun, Yiwei, Wang, Yiqi, Tang, Jiliang, Aggarwal,
Charu, Mitra, Prasenjit, and Wang, Suhang. “Graph convolutional networks against
degree-related biases”. In: arXiv preprint arXiv:2006.15643 (2020).

[30] Vaida, Maria and Patil, Pranita. “Semi-Supervised Graph Neural Network with
Probabilistic Modeling to Mitigate Uncertainty”. In: Proceedings of the 2020 the
4th International Conference on Information System and Data Mining. 2020,
pp. 152–156.

[31] Vovk, Vladimir, Gammerman, Alexander, and Shafer, Glenn. Algorithmic learn-
ing in a random world. Springer Science & Business Media, 2005.

[32] Vovk, Vladimir, Lindsay, David, Nouretdinov, Ilia, and Gammerman, Alex. “Mon-
drian confidence machine”. In: Technical Report (2003).

[33] Vovk, Vladimir, Nouretdinov, Ilia, Fedorova, Valentina, Petej, Ivan, and Gam-
merman, Alex. “Criteria of efficiency for set-valued classification”. In: Annals of
Mathematics and Artificial Intelligence 81.1 (2017), pp. 21–46.

[34] Vovk, Vladimir, Petej, Ivan, Nouretdinov, Ilia, Ahlberg, Ernst, Carlsson, Lars,
and Gammerman, Alex. “Retrain or not retrain: Conformal test martingales for
change-point detection”. In: Conformal and Probabilistic Prediction and Ap-
plications. PMLR. 2021, pp. 191–210.

[35] Wang, Junshan, Song, Guojie, Wu, Yi, and Wang, Liang. “Streaming graph neu-
ral networks via continual learning”. In: Proceedings of the 29th ACM Interna-
tional Conference on Information & Knowledge Management. 2020, pp. 1515–
1524.

[36] Wu, Bingzhe, Li, Jintang, Hou, Chengbin, Fu, Guoji, Bian, Yatao, Chen, Liang,
and Huang, Junzhou. “Recent advances in reliable deep graph learning: Adver-
sarial attack, inherent noise, and distribution shift”. In: arXiv preprint arXiv:2202.07114
(2022).

[37] Wu, Zonghan, Pan, Shirui, Chen, Fengwen, Long, Guodong, Zhang, Chengqi,
and Philip, S Yu. “A comprehensive survey on graph neural networks”. In: IEEE
transactions on neural networks and learning systems 32.1 (2020), pp. 4–24.

[38] Zhang, Jin, Norinder, Ulf, and Svensson, Fredrik. “Deep learning-based confor-
mal prediction of toxicity”. In: Journal of Chemical Information and Modeling
61.6 (2021), pp. 2648–2657.

73
BIBLIOGRAPHY

[39] Zhang, Si, Tong, Hanghang, Xu, Jiejun, and Maciejewski, Ross. “Graph convolu-
tional networks: a comprehensive review”. In: Computational Social Networks
6.1 (2019), pp. 1–23.

74
TRITA-EECS-EX-2022:555

www.kth.se

You might also like