Deep Graph Library

GNN training acceleration with BFloat16 data type on CPU

2024-08-10T00:00:00+00:00

Graph neural networks (GNN) have achieved state-of-the-art performance on various industrial tasks. However, most GNN operations are memory-bound and require a significant amount of RAM. To tackle this problem well known technique to reduce tensor size by using small data type is proposed for memory efficiency optimization of GNN training on Intel® Xeon® Scalable processors with Bfloat16. The proposed approach could achieve outstanding optimization on various GNN models, covering a wide range of datasets, which speeds up the training by up to 5×.

Bfloat16 data type

Bfloat16 is a half-precision data type. It differs from the default float data type only in mantissa length, which is 7 bits long compared to 23 bits.

Bfloat16 was developed by the Google Brain team and is currently used widely in DNN and other AI applications. A lot of devices natively support bfloat16 starting from GPUs and AI accelerators and ending with CPUs. Even compilers such as GCC, and LLVM are enabling this data type in the latest C/C++ standards. According to the Google Brain team exponent is more valuable for training and ML operations, so such reduction of mantissa bits will preserve the accuracy of the model and at the same time provide the same performance as other half-precision data types. Another advantage of using bfloat16 data type is the simplicity of conversion between bfloat16 and float.

Bfloat16 CPU acceleration

Starting from the 3rd Gen Intel® Xeon® Scalable processor (codenamed Cooper Lake) x86 CPU natively supports bfloat16. It was enabled via the Intel® Advanced Vector Extensions-512 (Intel® AVX-512): AVX512_BF16 vector instruction set, which like other AVX512 instructions provides basic operations: dot product and conversion functions. In the latest 4th Gen Intel® Xeon® Scalable processors (codename Sapphire Rapids), the Intel® AMX instruction set was introduced to further improve 16-bit and 8-bit matrix performance. In this instruction set “tile” instructions were added, which operate on special “tile” 2D registers. Currently, this instruction set has only tile matrix multiply unit (TMUL) support, that can perform matrix multiplication for bfloat16 and int8 data types. In the next Intel Xeon generations, starting from Granite Rapids, in addition to bfloat16, fp16 will be supported.

Bfloat16 in DGL

Recently bfloat16 support was added to DGL library (starting from DGL version 1.0.0 for Nvidia GPU and DGL version 1.1.0 for CPU), so it is possible to use it in model training and inference for both CPU and GPU devices. Examples of DGL API, which will help to convert the graph and the model to bfloat16 data type:

# Convert graph, model, and graph features to bfloat16
g = dgl.to_bfloat16(g)
feat = feat.to(dtype=torch.bfloat16)
model = model.to(dtype=torch.bfloat16)

The following example trains GraphSAGE with bfloat16 using the provided API:

import torch
import torch.nn as nn
import torch.nn.functional as F
import dgl
from dgl.data import CiteseerGraphDataset
from dgl.nn import SAGEConv
from dgl.transforms import AddSelfLoop


class SAGE(nn.Module):
    def __init__(self, in_size, hid_size, out_size):
        super().__init__()
        self.layers = nn.ModuleList()
        # two-layer SAGE
        self.layers.append(SAGEConv(in_size, hid_size, "gcn"))
        self.layers.append(SAGEConv(hid_size, out_size, "gcn"))
        self.dropout = nn.Dropout(0.5)


    def forward(self, graph, features):
        h = self.dropout(features)
        for l, layer in enumerate(self.layers):
            h = layer(graph, h)
            if l != len(self.layers) - 1:
                h = F.relu(h)
                h = self.dropout(h)
            return h


# Data loading
transform = AddSelfLoop()
data = CiteseerGraphDataset(transform=transform)


g = data[0]
g = g.int()
train_mask = g.ndata['train_mask']
feat = g.ndata['feat']
label = g.ndata['label']


in_size = feat.shape[1]
hid_size = 16
out_size = data.num_classes
model = SAGE(in_size, hid_size, out_size)


# Convert model and graph to bfloat16
g = dgl.to_bfloat16(g)
feat = feat.to(dtype=torch.bfloat16)
model = model.to(dtype=torch.bfloat16)


model.train()


# Create optimizer
optimizer = torch.optim.Adam(model.parameters(), lr=1e-2, weight_decay=5e-4)
loss_fcn = nn.CrossEntropyLoss()


for epoch in range(100):
    logits = model(g, feat)
    loss = loss_fcn(logits[train_mask], label[train_mask])


    loss.backward()
    optimizer.step()


    print('Epoch {} | Loss {}'.format(epoch, loss.item()))

Experimental results

The most popular examples have been tested – GCN, GraphSAGE. For full graph training, basic datasets were chosen, while for the mini-batch approach datasets from OGB were used, the sizes of which significantly exceed the sizes of the basic ones. For instance, ogbn-products has around 2.5 million nodes and 61 million edges, whereas ogbn-papers100M has 111 million nodes and 1.6 million edges. Table 1. demonstrates accuracy which is similar for float and bfloat16 or has not been changed significantly.

Model	Dataset	Test accuracy(float)	Test accuracy(bfloat16)
gcn	citeseer	71%	71%
gcn	cora	81%	81%
gcn	pubmed	79%	79%
graphsage	citeseer	71%	71%
graphsage	cora	81%	81%
graphsage	pubmed	78%	78%
gcn minibatch	ogbn-paper100M	57%	57%
gcn minibatch	ogbn-products	78%	78%
graphsage minibatch	ogbn-paper100M	62%	61%
graphsage minibatch	ogbn-products	76%	74%

The plots below show results on AWS r6i powered by 3rd Generation Intel Xeon Scalable processors (codename Ice Lake), which does not have native bfloat16 instruction, and results on AWS r7iz 4th Generation Intel Xeon Scalable-based (Sapphire Rapids) instances, which has native support for both AVX512_BF16 and AMX. In both experiments, the number of threads is limited to 16, which is the best-known number of threads for a single run on Intel® Xeon®.

The efficiency of GNN training has been enhanced on both types of Intel® Xeon® instances with bfloat16. Notably, for basic datasets on the AWS r6i, there was an improvement in performance of up to 32%. Similarly, for basic datasets on the r7iz accelerated by Intel® AMX machine, the use of bfloat16 led to an improvement in training performance of up to 92%. When discussing the results of OGB datasets, which are notably larger in size, performance improved by up to 2.89 times on the r6i and up to 5.04 times on the r7iz. On the provided plots it was demonstrated that all training steps experienced improvements in performance. The most significant impact was observed in the forward pass, which was up to 12.7 times faster when utilizing bfloat16.

Conclusion

Using the bfloat16 data type is strongly recommended for GNN training on Sapphire Rapids and earlier generations of Intel Xeon Scalable processors to enhance performance. Even on Ice Lake, bfloat16 enhances the efficiency of memory-bound operations and reduces training costs.

Nevertheless, certain methods within DL frameworks may not fully support or optimally use CPU bfloat16 instructions. In such situations, we advise evaluating both float and bfloat16 performance over a small number of epochs to determine the optimal choice.

DGL 2.1: GPU Acceleration for Your GNN Data Pipeline

2024-03-06T00:00:00+00:00

We are happy to announce the release of DGL 2.1. In this release, we are making GNN data loading lightning fast. We introduce GPU acceleration for the whole GNN data loading pipeline in GraphBolt, including the graph sampling and feature fetching stages.

Flexible data pipeline & customizable stages, all accelerated on your GPU

Starting from this release, the data moving stage can now be moved earlier in the data pipeline to enable GPU acceleration. With this in mind, the following permutations of the core stages are now possible:

To execute all of the data loading stages on the GPU, the graph and the features need to be GPU accessible. As the GPU memory may be limited, GraphBolt offers an in-place pinning operation to enable GPU access to the graph and features resident in main memory.

# Pin the graph and features in-place.
graph = dataset.graph.pin_memory_()
features = dataset.feature.pin_memory_()

However, if GPU has sufficient memory, either graph and/or its features can be moved to the GPU as follows:

# Move the graph and features to the GPU.
graph = dataset.graph.to("cuda:0")
features = dataset.feature.to("cuda:0")

It may be the case that the GPU has a large memory, however it may not be large enough to fit all the features. In that case, it is possible to cache some part of the features using gb.GPUCachedFeature, please see the GraphBolt multiGPU example.

By placing the copy operation earlier in the pipeline enables GPU execution for the rest of the operations. All of the GraphBolt components compose as you expect.

# Seed edge sampler.
dp = gb.ItemSampler(train_edge_set, batch_size=1024, shuffle=True)
# Copy here to execute the remaining operations on the GPU.
dp = dp.copy_to(device="cuda:0")
# Negative sampling.
dp = dp.sample_uniform_negative(graph, negative_ratio=10)
# Neighbor sampling.
dp = dp.sample_neighbor(graph, fanouts=[15, 10, 5])
# Fetch features.
dp = dp.fetch_feature(features, node_feature_keys=["feat"])

The descriptive nature of the PyTorch datapipe lets us take a defined data pipeline, make modifications to it to support GPU specific optimizations with no change to the user experience. Two such examples are the overlap_feature_fetch and overlap_graph_fetch arguments of gb.DataLoader, where the feature fetching and graph access operations are overlapped with the rest of the operations using a separate CUDA stream via pipeline parallelism.

GPU acceleration speedups

The dgl.graphbolt doesn’t just give you flexibility, it also provides top performance under the hood. As for the 2.1 release, almost all dgl.graphbolt operations are GPU accelerated, except for sampling with replacement. Additionally, the feature fetch operation now runs in parallel with everything else, via pipeline parallelism. This has the potential to cut runtimes by up to 2x depending on the scenario. Moreover, utilizing gb.GPUCachedFeature can cut feature transfer times even more, our multi-GPU benchmarks show up to 1.6x speedup.

To evaluate the performance of GraphBolt, we have tested 4 different scenarios:

Single-GPU Node Classification
Single-GPU Link Prediction
Single-GPU Heterogeneous Node Classification
Multi-GPU Node Classification

In these scenarios, we will compare 5 different configurations. First two are the existing baselines, and the last three are new configurations enabled by the DGL 2.1 release:

GraphBolt CPU backend, denoted as “dgl.graphbolt (cpu)”.
The legacy DGL dataloader with UVA on GPU, denoted as “Legacy DGL (pinned)” by pinning the dataset in system memory.
GraphBolt GPU backend, denoted as “dgl.graphbolt (pinned)” by pinning the dataset in system memory.
GraphBolt GPU backend, denoted as “dgl.graphbolt (pinned, 5M)” by pinning the dataset in system memory, utilizing gb.GPUCachedFeature to cache 5M of the node features.
GraphBolt GPU backend, denoted as “dgl.graphbolt (cuda)” by moving the dataset to the GPU memory.

All the experiments were run on an NVIDIA DGX-A100 system with 8 GPUs.

Single-GPU Node Classification

We evaluate the performance of GraphBolt and the Legacy DGL dataloader when the dataset is stored in pinned memory (UVA) against the CPU GraphBolt baseline. We use a 3 layer GraphSage model with batch size 1024 and fanout 10 in each layer and evaluate the performance on the ogbn-products and ogbn-papers100M datasets using the listed baselines above.

As one can see, GraphBolt’s new GPU backend can get up to 4.2x speedup compared to the GraphBolt CPU baseline while the legacy DGL dataloader can get at most 2.5x.

Single-GPU Link Prediction

Here, we shift our focus to the link prediction scenario on the ogbl-citation2 dataset with a similar setting as the previous section. Here, two different modes are evaluated, including or excluding reverse edges.

We observe that GraphBolt’s new GPU backend can get up to 5x speedup compared to its CPU baselines. Legacy DGL dataloader is slow due to missing GPU counterparts of some operations required for link prediction dataloading.

Single-GPU Heterogeneous Node Classification

You can accelerate heterogeneous sampling on your GPU as well. The R-GCN example runtime on the ogbn-mag dataset was 43.5s with the “dgl.graphbolt (cpu)” baseline and it went down to 25.2s with the new “dgl.graphbolt (pinned)” for a 1.73x speedup. You can expect the speedup numbers to increase as we optimize the different use cases for the GPU.

Multi-GPU Node Classification

Here, we evaluate the node classification use case in the multi-GPU setting with GraphSage model on the ogbn-papers100M dataset (111M nodes and 3.2B edges). A similar setting is used as the single-GPU scenario except that each GPU is using a batch size of 1024 with global batch size increasing linearly with the number of GPUs.

40GB memory on each of our A100 GPUs can be utilized by the GPU cache feature in GraphBolt. We can achieve 1.75x improvement on a single GPU and 2.31x improvement on 8 GPUs compared to previous state-of-the-art baseline, Legacy DGL (pinned). Moreover, compared to the GraphBolt CPU baseline, we achieve over 10x improvement.

Reduced graph storage space requirements

Many large-scale graphs and existing GNN datasets have fewer than 2 billion nodes but more than 2 billion edges. One such example is the ogbn-papers100M graph with its 111 million nodes and 3.2 billion edges. dgl.graphbolt uses the CSC (Compressed Sparse Column) format to store your graph in a memory efficient way. With our latest additions, the memory usage now scales by 4 bytes (int32) w.r.t. # edges and 8 bytes (int64) w.r.t. # nodes, meaning close to 2x space savings for graph storage by using mixed data types. The provided preprocessing functionality casts the tensors in your dataset into the smallest data types automatically for optimum space use and performance such as the edge type information in the heterogeneous case. With these optimizations, you get 3x space savings for the heterogenous ogb-lsc-mag240m graph compared to our previous release.

What’s more

Furthermore, dgl.graphbolt is compatible with Pytorch Geometric as well. In the figure below, the notation in the parentheses represents where the graph and the features are placed. “(cpu-cuda)” means that the graph is placed on the CPU while the features are moved to the GPU. We compare our advanced PyG example against the PyG official example, both using the PyG GraphSAGE model. We run the Node Classification task on the ogbn-products dataset with [15, 10, 5] fanout.

While providing an extremely optimized Neighbor Sampler implementation, we also offer a new drop-in replacement called Layer Neighbor Sampler from NeurIPS 2023. One can see that we provide up to 5.5x speedup over PyG, combining GPU acceleration, pipeline parallelism and the state-of-the-art algorithms. For more information on the new features in DGL 2.1, please refer to our release notes.

Get started with DGL 2.1

You can easily install DGL 2.1 with dgl.graphbolt on any platform using pip or conda. Dive into our updated Stochastic Training of GNNs with GraphBolt tutorial and experiment with our node classification and link prediction examples in Google Colab. No need to set up a local environment - just point and click! We also updated the existing 7 comprehensive single-GPU examples and 1 multi-GPU example with GPU Acceleration options. DGL 2.1 will be featured in the NVIDIA DGL container 24.03 release which will be released before the end of March 2024.

We welcome your feedback and are available via Github issues and Discuss posts. Join our Slack channel to stay updated and to connect with the community.

DGL 2.0: Streamlining Your GNN Data Pipeline from Bottleneck to Boost

2024-01-26T00:00:00+00:00

We’re thrilled to announce the release of DGL 2.0, a major milestone in our mission to empower developers with cutting-edge tools for Graph Neural Networks (GNNs). Traditionally, data loading has been a significant bottleneck in GNN training. Complex graph structures and the need for efficient sampling often lead to slow data loading times and resource constraints. This can drastically hinder the training speed and scalability of your GNN models. DGL 2.0 breaks free from these limitations with the introduction of dgl.graphbolt, a revolutionary data loading framework that supercharges your GNN training by streamlining the data pipeline.

High-Level Architecture of GraphBolt Data Pipeline

Flexible data pipeline & customizable stages

One size doesn’t fit all - and especially not when it comes to dealing with a variety of graph data and GNN tasks. For instance, link prediction requires negative sampling but not node classification, some features are too large to be stored in memory, and occasionally, we might combine multiple sampling operations to form subgraphs. To offer adaptable operators while maintaining high performance, dgl.graphbolt integrates seamlessly with the PyTorch datapipe, relying on the unified “MiniBatch” data structure to connect processing stages. The core stages are defined as:

Item Sampling: randomly selects a subset (nodes, edges, graphs) from the entire training set as an initial mini-batch for downstream computation.
Negative Sampling (for Link Prediction): generates non-existing edges as negative examples.
Subgraph Sampling: generates subgraphs based on the input nodes/edges.
Feature Fetching: fetches related node/edge features from the dataset for the given input.
Data Moving (for training on GPU): moves the data to specified device for training.

# Seed edge sampler.
dp = gb.ItemSampler(train_edge_set, batch_size=1024, shuffle=True)
# Negative sampling.
dp = dp.sample_uniform_negative(graph, negative_ratio=10)
# Neighbor sampling.
dp = dp.sample_neighbor(graph, fanouts=[15, 10, 5])
# Fetch features.
dp = dp.fetch_feature(features, node_feature_keys=["feat"])
# Copy to GPU for training.
dp = dp.copy_to(device="cuda:0")

The dgl.graphbolt allows you to plug in your own custom processing steps to build the perfect data pipeline for your needs, for example:

# Seed edge sampler.
dp = gb.ItemSampler(train_edge_set, batch_size=1024, shuffle=True)
# Negative sampling.
dp = dp.sample_uniform_negative(graph, negative_ratio=10)
# Neighbor sampling.
dp = dp.sample_neighbor(graph, fanouts=[15, 10, 5])

# Exclude seed edges.
dp = dp.transform(gb.exclude_seed_edges)

# Fetch features.
dp = dp.fetch_feature(features, node_feature_keys=["feat"])
# Copy to GPU for training.
dp = dp.copy_to(device="cuda:0")

The dgl.graphbolt empowers you to customize stages in your data pipelines. Implement custom stages using pre-defined APIs, such as loading features from external storage or adding customized caching mechanisms (e.g. GPUCachedFeature), and integrate the custom stages seamlessly without any modifications to your core training code.

Speed enhancement & memory efficiency

The dgl.graphbolt doesn’t just give you flexibility, it also provides top performance under the hood. It features a compact graph data structure for efficient sampling, blazing-fast multi-threaded neighbor sampling operator and edge exclusion operator, and a built-in option to store large feature tensors outside your CPU’s main memory. Additionally, The dgl.graphbolt takes care of scheduling across all hardware, minimizing wait times and maximizing efficiency.

The dgl.graphbolt brings impressive speed gains to your GNN training, showcasing over 30% faster node classification in our benchmark and a remarkable ~390% acceleration for link prediction in our benchmark that involve edge exclusion.

Epoch Time(s)	GraphSAGE	R-GCN
DGL Dataloader	22.5	73.6
dgl.graphbolt	17.2	64.6
Speedup	1.31x	1.14x

Node classification speedup (NVIDIA T4 GPU). GraphSAGE is tested on OGBN-Products. R-GCN is tested on OGBN-MAG

Epoch Time(s)	include seeds	exclude seeds
DGL Dataloader	37.75	135.32
dgl.graphbolt	15.51	27.62
Speedup	2.43x	4.90x

Link prediction speedup (NVIDIA T4 GPU) on OGBN-Citation2

For memory-constrained training on enormous graphs like OGBN-MAG240m, the dgl.graphbolt also proves its worth. While both utilize mmap-based optimization, compared to DGL dataloader, the dgl.graphbolt boasts a substantial speedup. The dgl.graphbolt’s well-defined component API streamlines the process for contributors to refine out-of-core RAM solutions for future optimization, ensuring even the most massive graphs can be tackled with ease.

Iteration time with different RAM size (s)	128GB RAM	256GB RAM	384GB RAM
Naïve DGL dataloader	OOM	OOM	OOM
Optimized DGL dataloader	65.42	3.86	0.30
dgl.graphbolt	60.99	3.21	0.23

Node classification on OGBN-MAG240m under different RAM sizes. Optimized DGL dataloader baseline uses mmap to load features.

What’s more

Furthermore, DGL 2.0 includes various new additions such as a hetero-relational GCN example and several datasets. Improvements have been introduced to the system, examples, and documentation, including updates to the CPU Docker tcmalloc, supporting sparse matrix slicing operators and enhancements in various examples. A set of utilities for building graph transformer models is released along with this version, including NN modules such as positional encoders and layers as building blocks, and examples and tutorials demonstrating the usage of them. Additionally, numerous bug fixes have been implemented, resolving issues such as the cusparseCreateCsr format for cuda12, addressing the lazy device copy problem related to DGL node/edge features e.t.c. For more information on the new additions and changes in DGL 2.0, please refer to our release note.

Get started with DGL 2.0

You can easily install DGL 2.0 with dgl.graphbolt on any platform using pip or conda. To jump right in, dive into our brand-new Stochastic Training of GNNs with GraphBolt tutorial and experiment with our node classification and link prediction examples in Google Colab. No need to set up a local environment - just point and click! This first release of DGL 2.0 with dgl.graphbolt packs a punch with 7 comprehensive single-GPU examples and 1 multi-GPU example, covering a wide range of tasks.

We welcome your feedback and are available via Github issues and Discuss posts. Join our Slack channel to stay updated and to connect with the community.

DGL 1.0: Empowering Graph Machine Learning for Everyone

2023-02-20T00:00:00+00:00

We are thrilled to announce the arrival of DGL 1.0, a cutting-edge machine learning framework for deep learning on graphs. Over the past three years, there has been growing interest from both academia and industry in this technology. Our framework has received requests from various scenarios, from academic research on state-of-the-art models to industrial demands for scaling Graph Neural Network (GNN) solutions to large, real-world problems. With DGL 1.0, we aim to provide a comprehensive and user-friendly solution for all users to take advantage of graph machine learning.

Different levels of user requests and what DGL 1.0 provides to fulfill them

DGL 1.0 adopts a layered and modular design to fulfill various user requests. The key features of DGL 1.0 include:

100+ examples of state-of-the-art GNN models, 15+ top-ranked baselines on Open Graph Benchmark (OGB), available for learning and integration
150+ GNN utilities including GNN layers, datasets, graph data transform modules, graph samplers, etc. for building new model architectures or GNN-based solutions
Flexible and efficient message passing and sparse matrix abstraction for developing new GNN building blocks
Multi-GPU and distributed training capability to scale to graphs of billions of nodes and edges

The new additions and updates in DGL 1.0 are depicted in the accompanying figure. One of the highlights of this release is the introduction of DGL-Sparse, a new specialized package for graph ML models defined in sparse matrix notations. DGL-Sparse streamlines the programming process not just for well-established GNNs such as Graph Convolutional Networks, but also for the latest models, including diffusion-based GNNs, hypergraph neural networks, and Graph Transformers. In the following article, we will provide an overview of two popular paradigms for expressing GNNs, i.e., the message passing view and the matrix view, which motivated the creation of DGL-Sparse. We will then show you how to get started with this new and exciting feature.

DGL 1.0 stack

The Message Passing View v.s. The Matrix View

“It’s the theory that the language you speak determines how you think and affects how you see everything.” — Louise Banks from Film Arrival

Representing a Graph Neural Network can take two distinct forms. The first, known as the message passing view, approaches GNN models from a fine-grained, local perspective, detailing how messages are exchanged along edges and how node states are updated accordingly. Alternatively, due to the algebraic equivalence of a graph to a sparse adjacency matrix, many researchers opt to express their GNN models from a coarse-grained, global perspective, emphasizing the operations involving the sparse adjacency matrix and dense feature tensors.

These local and global perspectives are sometimes interchangeable, but more often, provide complementary insights into the fundamentals and limitations of GNNs. For instance, the message passing view highlights the connection between GNNs and the Weisfeiler Lehman (WL) graph isomorphism test, which also relies on aggregating information from neighbors (as described in Xu et al., 2018). Meanwhile, the matrix view provides valuable understanding of the algebraic properties of GNNs, leading to intriguing findings such as the oversmoothing phenomenon (as discussed in Li et al., 2018). In conclusion, both the message passing view and matrix view are indispensable tools in studying and describing GNNs, and this is precisely what motivates the key feature we will be showcasing in DGL 1.0.

DGL Sparse: Sparse Matrix Abstraction for Graph ML

In DGL 1.0, we are happy to announce the release of DGL Sparse, a new sub-package (dgl.sparse) in addition to the existing message passing interface in DGL to accomplish the support of the entire spectrum of GNN models. DGL Sparse provides sparse matrix classes and operations specialized for Graph ML, making it easier to program GNNs described in the matrix view. In the following section, we will demonstrate a few examples of GNNs, showcasing their mathematical formulation and corresponding code implementation in DGL Sparse.

The Graph Convolutional Network (Kipf et al., 2017) is one of the pioneer works in GNN modeling. GCN can be expressed in both message passing view and matrix view. The code below compares the two different perspectives and implementations in DGL.

import dgl.function as fn  # DGL message passing functions

class GCNLayer(nn.Module):
    ...
    
    def forward(self, g, X):
        g.ndata['X'] = X
        g.ndata['deg'] = g.in_degrees().float()
        g.update_all(self.message, fn.sum('m', 'X_neigh'))
        X_neigh = g.ndata['X_neigh']
        return F.relu(self.W(X_neigh))
    
    def message(self, edges):
        c_ij = (edges.src['deg'] * edges.dst['deg']) ** -0.5
        return {'m' : edges.src['X'] * c_ij}

GCN in DGL's message passing API

import dgl.sparse as dglsp  # DGL 1.0 sparse matrix package

class GCNLayer(nn.Module):
    ...
    
    def forward(self, A, X):
        D_invsqrt = dglsp.diag(A.sum(1)) ** -0.5
        A_norm = D_invsqrt @ A @ D_invsqrt
        return F.relu(self.W(A_norm @ X))

GCN in DGL Sparse

Graph Diffusion-based GNNs. Graph diffusion is a process of propagating or smoothing node features/signals along edges. Many classical graph algorithms such as PageRank belong to this category. A series of research has shown that combining graph diffusion with neural networks is an effective and efficient way to enhance model predictions. The equation below describe the core computation of one representative model — Approximated Personalized Propagation of Neural Prediction (Gasteiger et al., 2018), which can be implemented in DGL Sparse straightforwardly.

class APPNP(nn.Module):
    ...

    def forward(self, A, X):
        Z_0 = Z = self.f_theta(X)
        for _ in range(self.num_hops):
            A_drop = dglsp.val_like(A, self.A_dropout(A.val))
            Z = (1 - self.alpha) * A_drop @ Z + self.alpha * Z_0
        return Z

Hypergraph Neural Networks. A hypergraph is a generalization of a graph in which an edge can join any number of nodes (called an hyperedge). Hypergraphs are particularly useful in scenarios that require capturing high-order relations such as co-purchase behaviors in e-commerce platforms, or co-authorship in citation networks, etc. A hypergraph is typically characterized by its sparse incidence matrix, and thus Hypergraph Neural Networks (HGNN) are commonly defined in sparse matrix notations. The equation and code implementation of Hypergraph Convolution, proposed by Feng et al., 2018, are presented below.

class HypergraphConv(nn.Module):
    ...

    def forward(self, H, X):
        d_V = H.sum(1)  # node degree
        d_E = H.sum(0)  # edge degree
        n_edges = d_E.shape[0]
        D_V_invsqrt = dglsp.diag(d_V**-0.5)  # D_V ** (-1/2)
        D_E_inv = dglsp.diag(d_E**-1)  # D_E ** (-1)
        W = dglsp.identity((n_edges, n_edges))
        L = D_V_invsqrt @ H @ W @ D_E_inv @ H.T @ D_V_invsqrt
        return self.Theta(L @ X)

Graph Transformers. The Transformer has proven to be an effective learning architecture in natural language processing and computer vision. Researchers have begun to extend the use of Transformers to graph learning as well. One of the pioneer work by (Dwivedi et al., 2020) proposed to constrain the all-pair multi-head attention to the connected node pairs in a graph. With DGL Sparse, implementing this new formulation is now a straightforward process, taking only about 10 lines of code.

class GraphMHA(nn.Module):
    ...

    def forward(self, A, h):
        N = len(h)
        q = self.q_proj(h).reshape(N, self.head_dim, self.num_heads)
        q *= self.scaling
        k = self.k_proj(h).reshape(N, self.head_dim, self.num_heads)
        v = self.v_proj(h).reshape(N, self.head_dim, self.num_heads)

        attn = dglsp.bsddmm(A, q, k.transpose(1, 0))  # [N, N, nh]
        attn = attn.softmax()
        out = dglsp.bspmm(attn, v)

        return self.out_proj(out.reshape(N, -1))

Key Features of DGL Sparse

To handle diverse use cases in an efficient manner, DGL Sparse is designed with two key features that set it apart from other sparse matrix libraries such as scipy.sparse or torch.sparse:

Automatic Sparse Format Selection. DGL Sparse eliminates the complexity of choosing the right data structure for storing a sparse matrix (also known as the sparse format). Users can create a sparse matrix with a single call to dgl.sparse.spmatrix and the internal DGL’s sparse matrix will automatically select the optimal format based on the intended operation.
Scalar or Vector Non-zero Elements. GNN models often associate edges with multi-channel weight vectors, such as multi-head attention vectors, as demonstrated in the Graph Transformer example. To accommodate this, DGL Sparse allows non-zero elements to have vector shapes and extends common sparse operations, such as sparse-dense-matrix multiplication (SpMM), to operate on this new form (as seen in the bspmm operation in the Graph Transformer example).

By utilizing these design features, DGL Sparse reduces code length by 2.7 times on average when compared to previous implementations of matrix-view models with message passing interface. The simplified code also results in 43% less overhead in the framework. Additionally, DGL Sparse is PyTorch compatible, making it easy to integrate with the various tools and packages available within the PyTorch ecosystem.

Get started with DGL 1.0

The framework is readily available on all platforms and can be easily installed using pip or conda. To get started with DGL Sparse, check out the new Quickstart tutorial and play with it in Google Colab without having to set up a local environment. In addition to the examples you’ve seen above, the first release of DGL Sparse includes 5 tutorials and 11 end-to-end examples to help you learn and understand the different uses of this new package.

We welcome your feedback and are available via Github issues and Discuss posts. Join our Slack channel to stay updated and to connect with the community.

For more information on the new additions and changes in DGL 1.0, please refer to our release note.

(Banner image generated by Midjourney.)

Improving Graph Neural Networks via Network-in-network Architecture

2022-11-28T00:00:00+00:00

As Graph Neural Networks (GNNs) has become increasingly popular, there is a wide interest of designing deeper GNN architecture. However, deep GNNs suffer from the oversmoothing issue where the learnt node representations quickly become indistinguishable with more layers. This blog features a simple yet effective technique to build a deep GNN without the concern of oversmoothing. The new architecture, Network in Graph Neural Networks (NGNN) inspired by the network-in-network architecture for computer vision, has shown superior performance on multiple Open Graph Benchmark (OGB) leaderboards.

Introducing NGNN

At a high-level, a graph neural network (MPGNN) layer can be written as a non-linear function:

$h^{(l+1)}=\sigma\left(f_w\left(\mathcal{G}, h^l\right)\right)$

with $h^{(0)}=X$ being the input node features, $\mathcal{G}$ being the input graph, $h^L$ being the node embeddings in the last layer used by downstream tasks, $L$ being the number of GNN layers. Additionally, the function $f_w\left(\mathcal{G}, h^l\right)$ is determined by learnable parameters $w$ and $\sigma(\cdot)$ is a non-linear activation function.

Instead of adding many more GNN layers, NGNN deepens a GNN model by inserting nonlinear feedforward neural network layer(s) within each GNN layer.

In essence, NGNN is just a nonlinear transformation of the original embeddings of the nodes in the $l$ -th layer. Despite its simplicity, the NGNN technique is quite powerful (we will come to that in a moment). Additionally, it does not have large memory overhead and can work with various training methods such as neighbor sampling or subgraph sampling.

The intuition behind is straightforward. As the number of GNN layers and the number of training iterations increases, the representations of nodes within the same connected component will tend to converge to the same value. NGNN uses a simple MLP after certain GNN layers to tackle the so-called oversmoothing issue.

Implementing NGNN in Deep Graph Library (DGL)

For better gaining insights into this trick, let us use DGL to implement a simple NGNN, using the GCN layer as the backbone.

With DGL’s builtin GCN layer dgl.nn.GraphConv, we can easily implement a minimal NGNN_GCN layer, which just applies an $\rm{ReLU}$ activation and a linear transformation after a GCN layer.

from dgl.nn import GraphConv

class NGNN_GCNConv(torch.nn.Module):
    def __init__(self, input_channels, hidden_channels, output_channels):
        super(NGNN_GCNConv, self).__init__()
        self.conv = GraphConv(input_channels, hidden_channels)
        self.fc = Linear(hidden_channels, output_channels)

    def forward(self, g, x, edge_weight=None):
        x = self.conv(g, x, edge_weight)
        x = F.relu(x)
        x = self.fc(x)
        return x

Afterwards, you can simply stack the dgl.nn.GraphConv layer and the NGNN_GCN layer to form a multi-layer NGNN_GCN network.

class NGNN_GCN(nn.Module):
    def __init__(self, input_channels, hidden_channels, output_channels):
        super(Model, self).__init__()
        self.conv1 = NGNN_GCNConv(input_channels, hidden_channels, hidden_channels)
        self.conv2 = GraphConv(hidden_channels, output_channels)

    def forward(self, g, input_channels):
        h = self.conv1(g, input_channels)
        h = F.relu(h)
        h = self.conv2(g, h)
        return h

You can replace dgl.nn.GraphConv with any other graph convolution layers in the NGNN architecture. DGL provides implementation of many popular convolutional layers and utility modules. You can easily invoke them with one line of code and build your own NGNN modules.

Model Performance

NGNN can be used for many downstream tasks, such as Node Classification/Regression, Edge Classification/Regression, Link prediction and Graph Classification. In general, NGNN achieves better results than its backbone GNN on these tasks. For instance, NGNN+SEAL achieves top-1 performance on the ogbl-ppa leaderboard with an improvement of Hit@100 by $10.91\%$ over the vanilla SEAL. The table below shows the performance improvement of NGNN over various vanilla GNN backbones.

Dataset	Metric	Model		Performance
ogbn-proteins	ROC-AUC(%)	GraphSage+Cluster Sampling	Vanilla	67.45 ± 1.21
			+NGNN	68.12 ± 0.96
ogbn-products	Accuracy(%)	GraphSage	Vanilla	78.27 ± 0.45
			+NGNN	79.88 ± 0.34
		GAT+Neighbor Sampling	Vanilla	79.23 ± 0.16
			+NGNN	79.67 ± 0.09
ogbl-collab	hit@50(%)	GCN	Vanilla	49.52 ± 0.70
			+NGNN	53.48 ± 0.40
		GraphSage	Vanilla	51.66 ± 0.35
			+NGNN	53.59 ± 0.56
ogbl-ppa	hit@100(%)	SEAL-DGCNN	Vanilla	48.80 ± 3.16
			+NGNN	59.71 ± 2.45
		GCN	Vanilla	18.67 ± 1.32
			+NGNN	36.83 ± 0.99

Accelerating Partitioning of Billion-scale Graphs with DGL v0.9.1

2022-09-19T00:00:00+00:00

Graphs is ubiquitous to represent relational data, and many real-world applications such as recommendation and fraud detection involve learning from massive graphs. As such, GNNs has emerged as a powerful family of models to learn their representations. However, training GNNs on massive graphs is challenging, one of the issues is high resource demand to distribute graph data to a cluster. For example, partitioning a random graph of 1 billion nodes and 5 billion edges into 8 partitions requires a powerful AWS EC2 x1e.32xlarge instance (128 vCPU, 3.9TB RAM) running for 10 hours to finish the job.

In the latest DGL v0.9.1, we released a new pipeline for preprocess, partition and dispatch graph of billions of nodes or edges for distributed GNN training. At its core is a new data format called Chunked Graph Data Format (CGDF) which stores graph data by chunks. The new pipeline processes data chunks in parallel which not only reduces the memory requirement of each machine but also significantly accelerates the entire procedure. For the same random graph with 1B nodes/5B edges, using a cluster of 8 AWS EC2 x1e.4xlarge (16 vCPU, 488GB RAM each), the new pipeline can reduce the running time to 2.7 hours and cut down the money cost by 3.7x.

In this blog, we will illustrate step-by-step how to partition and distribute a graph of billions of nodes and edges using this new feature.

Distributed GNN Training 101

A graph dataset typically consists of graph structure and the features associated with nodes/edges. If the graph is heterogeneous (i.e., having multiple types of nodes or edges), different types of nodes/edges may have different sets of features. Training a GNN model on a multi-machine cluster first requires users to partition their input graph, which involves two steps:

Run a graph partition algorithm (e.g., random, METIS) to assign each node to one partition.
Shuffle and dispatch the graph structure and node/edge features to the target machine that owns the partition.

Once the graph is partitioned and provisioned, users can then launch the distributed training program using DGL’s launch tool, which will:

Launch one main graph server per machine that loads the local graph partition into RAM. Graph servers provide remove process calls (RPCs) to conduct computation like graph sampling. Optionally, users can launch more backup graph servers that share the in-memory data of the main graph server to increase service throughput.
Launch a key-value store (KVStore) server per machine that loads the local node/edge features into RAM. KVStore service provides RPCs to fetch/update node/edge features.
Launch one or more trainer processes per machine. Trainer processes are connected with each other via PyTorch’s DistributedDataParallel component. At each training iteration, they issue requests to local or remote graph servers and KVStore servers to get a mini-batch of samples, perform gradient descent and synchronize their gradients before the next iteration.

The figure below depicts the system architecture. For more information, check out the Distributed Training chapter of DGL User Guide.

Due to the complexity of graph data, graph partitioning is typically run on a single machine, which demands the machine to have large enough memory to fit the entire graph and features as well as the runtime usage from the partition algorithm. For example, a random graph of 1 billion nodes and 5 billions edges and 50 features per nodes needs 268GB when stored in DGL graph format. Using the existing dgl.distributed.partition_graph API to partition this graph requires a powerful AWS EC2 x1e.32xlarge instance (128 vCPU, 3.9TB RAM) and runs for 10 hours — a significant bottleneck for users to train GNNs at scale.

DGL v0.9.1 addressed the issue by a new distributed graph partitioning pipeline. Specifically,

We designed a Chunked Graph Data Format (CGDF) for storing large graph data in chunks to avoid loading the entire graph into a single machine.
We provided scripts to partition and dispatch chunked graph in parallel using multiple machines to reduce the memory requirement of each machine as well as to accelerate the procedure.

Chunked Graph Data Format

Chunked graph dataset is organized as a data folder with the following data files:

A metadata.json file that stores the meta information of the graph, e.g., graph name, node/edge types, chunk sizes, chunk file paths, etc.
A list of edge index chunk files that store the source and destination node IDs. They are typically in plain texts.
A list of node data chunk files. They are typically array data stored in NumPy array binary.
A list of edge data chunk files. They are also in NumPy array binary.

Here, we illustrate the folder structure and the data files of a random social graph (nodes being users, edges being follow relation) where nodes have two data: “feat” and “label”. Check out the doc page for the full specification of the format and tips for how to convert your data to chunks.

//data/random_graph_chunked/
  |-- metadata.json            # metadata JSON
  |-- edge_index/              # edge index chunks
    |-- user:follow:user0.txt  # user-follow-user edges chunk 0
    |-- user:follow:user1.txt  # user-follow-user edges chunk 1
    |-- user:follow:user2.txt  # user-follow-user edges chunk 2
    |-- ...
  |-- node_data/           # node data chunks
    |-- user/              # user nodes have two data: "feat" and "label"
      |-- feat0.npy        # feat chunk 0
      |-- feat1.npy        # feat chunk 1
      |-- feat2.npy        # feat chunk 2
      |-- ...
      |-- label0.npy       # label chunk 0
      |-- label1.npy       # label chunk 1
      |-- label2.npy       # label chunk 2
      |-- ...
  |-- edge_data/           # edge data chunks
    |-- user:follow:user/
       |-- ...

Running Distributed Partitioning & Dispatching

The first step is to get a cluster of machines to partition the graph. We recommend the total RAM size of the cluster to be 2-3x larger than the graph data size to accommodate the runtime memory needed. Next is to setup shared workspace and software environment.

Setup a shared folder that is accessible by each instance in the cluster (e.g., using NFS). Make sure all instances can ssh to each other. Here, we suppose the folder is mounted to /workspace.

Clone and download the scripts from DGL 0.9.x branch to /workspace:

git clone https://github.com/dmlc/dgl.git -b 0.9.x /workspace/dgl

Copy/move the chunked graph data to /workspace/. Here, we suppose the data folder is at /workspace/random_graph_chunked/.

Create an /workspace/ip_config.txt file that contains the IP address of each instance.

# example IP config file of a 4 machine cluster
172.31.19.1
172.31.23.205
172.31.29.175
172.31.16.98

We can then run a partition algorithm to assign each node to a partition. Here, we choose random partitioning algorithm.

python /workspace/dgl/tools/partition_algo/random_partition.py \
    --in_dir=/workspace/random_graph_chunked/
    --out_dir=/workspace/partition_assign/
    --num_partitions=4

The above script simply calculates which partition a node belongs to. We then pass both the chunked graph and the partition assignments to the dispatch_data.py script to physically split the graph data into multiple pieces and distribute them to the entire cluster.

python /workspace/dgl/tools/dispatch_data.py \
    --in-dir=/workspace/random_graph_chunked/
    --partitions-dir=/workspace/partition_assign/
    --out-dir=/workspace/random_graph_dist/
    --ip-config=/workspace/ip_config.txt

The end result will look like the following. We can then launch distributed training following the instructions here.

/workspace/random_graph_dist/
  |-- medatdata.json      # metadata JSON file
  |-- part0/              # partition 0
    |-- graph.dgl         # graph structure of partition 0 in DGL binary format
    |-- node_feat.dgl     # node feature of partition 0 in DGL binary format
    |-- edge_feat.dgl     # edge feature of partition 0 in DGL binary format
  |-- part1/              # partition 1
    |-- graph.dgl         # graph structure of partition 1 in DGL binary format
    |-- node_feat.dgl     # node feature of partition 1 in DGL binary format
    |-- edge_feat.dgl     # edge feature of partition 1 in DGL binary format
  |-- part2/
  ...

Note that the scripts utilize multiple machines to cooperatively partition and process the data. Therefore, the new pipeline is significantly faster. For the same random graph with 1B nodes/5B edges, the new pipeline can finish partitioning in 2.7 hours (3.7x faster) using an cluster of 8 AWS EC2 x1e.4xlarge (16 vCPU, 488GB RAM).

v0.9 Release Highlights

2022-07-25T00:00:00+00:00

Six years after the first Graph Convolutional Networks paper, researchers are actively investigating more advanced GNN architecture or training methodology. As the developer team of DGL, we closely watch those new research trends and release features to facilitate them. Here, we highlighted some of the new functionalities of the recent v0.9 release.

Combining Graph Analytics with GNNs using cuGraph+DGL

Graph neural networks (GNNs) are capable of combining the feature and structural information of graph data. Its power can be further extended when synergistically combined with techniques of graph analytics, such as feature augmentation.

Graph analytics has been widely used for characterising graph structures, e.g., identifying important nodes, leading to interesting feature augmentation methods. To exploit the synergy, we would want a fast and scalable graph analytics engine. NVidia’s RAPIDS cuGraph library provides a collection of GPU accelerated algorithms for graph analytics, such as centrality computation and community detection. According to this documentation, “the latest NVIDIA GPUs (RAPIDS supports Pascal and later GPU architectures) make graph analytics 1000x faster on average over NetworkX”.

With collaboration with NVidia’s engineers, DGL v0.9 now allows conversion between a DGLGraph object and a cuGraph graph object with two APIs to_cugraph and from_cugraph, making it possible for DGL users to access efficient graph analytics implementations in cuGraph.

Installation

To install cuGraph with PyTorch and DGL, we recommend following the practice below. Mamba is a multi-threaded version of conda.

conda install mamba -n base -c conda-forge

mamba create -n dgl_and_cugraph -c dglteam -c rapidsai-nightly -c nvidia -c pytorch -c conda-forge \
    cugraph pytorch torchvision torchaudio cudatoolkit=11.3 dgl-cuda11.3 tqdm

conda activate dgl_and_cugraph

Feature Initialization via cuGraph

We showcase an example of node feature initialization using the graph analytics algorithms provided by cuGraph. Here, we consider two options:

Louvain algorithm that detects the community membership of each node based on modularity optimization.
Core number algorithm that calculates the maximal k-core subgraph each node belongs to. A k-core of a graph is a maximal subgraph that contains nodes of degree k or more.

The two algorithms capture different structural characteristics of a node. Louvain groups nodes with close spatial distance with each other, while nodes with the same core numbers are more structurally similar with each other. The figures below illustrate the node coloring produced by Louvain communities and core numbers on Zachary’s Karate Club Network.

cuGraph offers efficient GPU implementations of these two algorithms. To call them, we convert a dgl.DGLGraph to a cugraph.Graph using the to_cugraph API.

import cugraph
import torch

def louvain(dgl_g):
    cugraph_g = dgl_g.to_cugraph().to_undirected()
    df, _ = cugraph.louvain(cugraph_g, resolution=3)
    # revert the node ID renumbering by cugraph
    df = cugraph_g.unrenumber(df, 'vertex').sort_values('vertex')
    return torch.utils.dlpack.from_dlpack(df['partition'].to_dlpack()).long()

def core_number(dgl_g):
    cugraph_g = dgl_g.to_cugraph().to_undirected()
    df = cugraph.core_number(cugraph_g)
    # revert the node ID renumbering by cugraph
    df = cugraph_g.unrenumber(df, 'vertex').sort_values('vertex')
    return torch.utils.dlpack.from_dlpack(df['core_number'].to_dlpack()).long()

Training GNN via DGL

We then use the above functions to prepare node features for the ogbn-arxiv dataset. Note that since both algorithms calculate structural categories, we convert them to one-hot encoding and concatenate them as the initial node features.

import dgl.transforms as T
import torch.nn as nn
import torch.nn.functional as F

from dgl.nn import SAGEConv
from ogb.nodeproppred import DglNodePropPredDataset, Evaluator

device = torch.device('cuda')
dataset = DglNodePropPredDataset(name='ogbn-arxiv')
g, label = dataset[0]
transform = T.Compose([
    T.AddReverse(),
    T.AddSelfLoop(),
    T.ToSimple()
])
g = transform(g).int().to(device)
feat1 = louvain(g)
feat2 = core_number(g)
# convert to one-hot
feat1 = F.one_hot(feat1, feat1.max() + 1)
feat2 = F.one_hot(feat2, feat2.max() + 1)
# concat feat1 and feat2
x = torch.cat([feat1, feat2], dim=1).float()

We then train a simple three layer GraphSAGE model (see complete training code here). With the help of node features initialized by graph analytics algorithms, we are able to achieve an accuracy of about 0.6 on the test set using pure structural information, which even outperforms an MLP model using the original input node features. With the new DGL release, we are looking forward to seeing more innovation on GNNs combined with graph analytics.

FP16 & Mixed Precision Support

DGL v0.9 is now fully compatible with the PyTorch Automatic Mixed Precision (AMP) package for mixed precision training, thus saving both training time and GPU memory consumption.

By wrapping the forward pass with torch.cuda.amp.autocast(), PyTorch automatically selects the appropriate data type for each op and tensor. Half precision tensors are memory efficient, most operators on half precision tensors are faster as they leverage GPU tensorcores.

import torch.nn.functional as F
from torch.cuda.amp import autocast

def forward(g, feat, label, mask, model):
    with autocast(enabled=True):
        logit = model(g, feat)
        loss = F.cross_entropy(logit[mask], label[mask])
        return loss

Small gradients in float16 format have underflow problems (flush to zero). PyTorch AMP provides a GradScaler module to address this issue. It multiplies the loss by a factor and invokes backward pass on the scaled loss to prevent the underflow problem. It then unscales the computed gradients before the optimizer updates the parameters. The scale factor is determined automatically.

from torch.cuda.amp import GradScaler

scaler = GradScaler()

def backward(scaler, loss, optimizer):
    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()

Putting everything together, we have the example below.

import torch
import torch.nn as nn
from dgl.data import RedditDataset
from dgl.nn import GATConv
from dgl.transforms import AddSelfLoop

class GAT(nn.Module):
    def __init__(self, in_feats, num_classes, num_hidden=256, num_heads=2):
        super().__init__()
        self.conv1 = GATConv(in_feats, num_hidden, num_heads, activation=F.elu)
        self.conv2 = GATConv(num_hidden * num_heads, num_hidden, num_heads)

    def forward(self, g, h):
        h = self.conv1(g, h).flatten(1)
        h = self.conv2(g, h).mean(1)
        return h

device = torch.device('cuda')

transform = AddSelfLoop()
data = RedditDataset(transform)

g = data[0]
g = g.int().to(device)
train_mask = g.ndata['train_mask']
feat = g.ndata['feat']
label = g.ndata['label']
in_feats = feat.shape[1]

model = GAT(in_feats, data.num_classes).to(device)
model.train()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3, weight_decay=5e-4)

for epoch in range(100):
    optimizer.zero_grad()
    loss = forward(g, feat, label, train_mask, model)
    backward(scaler, loss, optimizer)

Training GNNs using low precision or mixed precision is still an active research topic. We hope the new v0.9 release will facilitate more research on this topic. Check out the documentation to know more.

DGL-Go Update: Model Inference and Graph Prediction

DGL-Go now supports training GNNs for graph property prediction tasks. It includes two popular GNN models – Graph Isomorphism Network (GIN) and Principal Neighborhood Aggregation (PNA). For example, to train a GIN model on the ogbg-molpcba dataset, first generate a YAML configuration file using command:

dgl configure graphpred --data ogbg-molpcba --model gin

which generates the following configuration file. Users can then manually adjust the configuration file.

version: 0.0.2
pipeline_name: graphpred
pipeline_mode: train
device: cpu                     # Torch device name, e.g., cpu or cuda or cuda:0
data:
    name: ogbg-molpcba
    split_ratio:                # Ratio to generate data split, for example set to [0.8, 0.1, 0.1] for 80% train/10% val/10% test. Leave blank to use builtin split in original dataset
model:
     name: gin
     embed_size: 300            # Embedding size
     num_layers: 5              # Number of layers
     dropout: 0.5               # Dropout rate
     virtual_node: false        # Whether to use virtual node
general_pipeline:
    num_runs: 1                 # Number of experiments to run
    train_batch_size: 32        # Graph batch size when training
    eval_batch_size: 32         # Graph batch size when evaluating
    num_workers: 4              # Number of workers for data loading
    optimizer:
        name: Adam
        lr: 0.001
        weight_decay: 0
    lr_scheduler:
        name: StepLR
        step_size: 100
        gamma: 1
    loss: BCEWithLogitsLoss
    metric: roc_auc_score
    num_epochs: 100             # Number of training epochs
    save_path: results          # Directory to save the experiment results

Alternatively, users can fetch model recipes of pre-defined hyperparameters for the original experiments.

dgl recipe get graphpred_pcba_gin.yaml

To launch training:

dgl train --cfg graphpred_ogbg-molpcba_gin.yaml

Another addition is a new command to conduct inference of a trained model on some other dataset. For example, the following shows how to apply the GIN model trained on ogbg-molpcba to ogbg-molhiv:

# Generate an inference configuration file from a saved experiment checkpoint
dgl configure-apply graphpred --data ogbg-molhiv --cpt results/run_0.pth

# Apply the trained model for inference
dgl apply --cfg apply_graphpred_ogbg-molhiv_pna.yaml

It will save the model prediction in a CSV file like below

May 2022 Update Note

2022-05-31T00:00:00+00:00

Synthetic Datasets for Developing GNN Explainability Approaches

We added the following new datasets. BAShapeDataset, BACommunityDataset, TreeCycleDataset, and TreeGridDataset were first introduced in GNNExplainer: Generating Explanations for Graph Neural Networks for node classification. BA2MotifDataset was first introduced in Parameterized Explainer for Graph Neural Network for graph classification.

# A dataset for node classification
from dgl.data import BAShapeDataset

dataset = BAShapeDataset()
num_classes = dataset.num_classes
graph = dataset[0]
feat = graph.ndata['feat']
label = graph.ndata['label']

# A dataset for graph classification
from dgl.data import BA2MotifDataset

dataset = BA2MotifDataset()
num_classes = dataset.num_classes
graph, label = dataset[0]           # Get the first graph data
feat = graph.ndata['feat']

Developer Recommendation: These synthetic graphs integrate specially designed substructures, motifs, into traditional random graph models, and assign labels based on their existence. Therefore, those substructures act as ground truth explanations for the node/graph labels, making them commonly used benchmarks for evaluating GNN explainability approaches.

SIGN Diffusion Transform

We added a new data transform module SIGNDiffusion first introduced in SIGN: Scalable Inception Graph Neural Networks, which diffuses node features for later use. It supports four built-in diffusion matrices, including raw adjacency matrix, random walk adjacency matrix, symmetrically normalized adjacency matrix, and personalized PageRank matrix.

import dgl
import dgl.transforms as T

dataset = dgl.data.CoraGraphDataset(
    transform=T.SIGNDiffusion(k=2)) # Diffuse for 1 & 2 hops
g = dataset[0]  # diffused node features will be generated as ndata
feat1 = g.ndata['out_feat_1']       # feature diffused for 1 hop
feat2 = g.ndata['out_feat_2']       # feature diffused for 2 hops

Developer Recommendation: The ability to learn to aggregate neighbor information is one of the key innovation of Message Passing Neural Networks, which also brings scalability challenges due to the exponentially growing receptive field with more hops of neighbors to explore. SIGN proposed a cheap yet efficient solution that decouples model depth and receptive field size by diffusing input node features using various kinds of algorithms. Because the diffusion process is not trainable, we package it as a data transform module so that users can easily plug-in SIGN before running their own model.

Label Propagation

We added a new NN module LabelPropagation first introduced in Learning from Labeled and Unlabeled Data with Label Propagation, which propagates node labels over a graph for inferring the labels of unlabeled nodes.

import dgl

lp = dgl.nn.LabelPropagation(k=3, alpha=0.9)
dataset = dgl.data.CoraGraphDataset()
g = dataset[0]
labels = g.ndata['label']
train_mask = g.ndata['train_mask']
new_labels = lp(g, labels, train_mask)

Developer Recommendation: The classical label propagation is a simple (non-parametric) yet effective algorithm, making it a strong baseline for many node classification datasets.

Directional Graph Network Layer

We added a new NN module DGNConv first introduced in Directional Graph Networks, which introduces directional aggregators in message passing based on the gradient of low-frequency eigenvectors of the graph Laplacian matrix.

import dgl
import torch

g = ...  # some graph
# Precompute 1 smallest non-trivial eigenvectors
transform = dgl.transforms.LaplacianPE(k=1)
g = transform(g)
x = torch.randn(g.num_nodes(), 10)  # node features
eig = g.ndata['PE']
conv = dgl.nn.DGNConv(10, 10,
                      aggregators=['dir1-av', 'dir1-dx'],
                      scalers=['identity', 'amplification'],
                      delta=2.5)
h = conv(g, x, eig_vec=eig)

Developer Recommendation: Directional Graph Networks (DGN) allow defining graph convolutions according to topologically-derived directional flows. It is a state-of-the-art baseline for many graph classification tasks.

Graph Isomorphism Network Layer with Edge Features

We added a new NN module GINEConv first introduced in Strategies for Pre-training Graph Neural Networks, which extends Graph Isomorphism Network (GIN) for handling edge features.

import dgl
import torch

g = ...  # some graph
xn = torch.randn(g.num_nodes(), 10)  # node features
xe = torch.randn(g.num_edges(), 10)  # edge features
conv = dgl.nn.GINEConv(torch.nn.Linear(10, 10))
hn = conv(g, xn, xe)

Developer Recommendation: Graph Isomorphism Network with edge features has been an important baseline for many graph classification tasks such as OGB graph property datasets.

Feature Masking

We added a new data transform module FeatMask first introduced in Graph Contrastive Learning with Augmentations, which randomly masks columns of node/edge features.

import dgl
import dgl.transforms as T

dataset = dgl.data.CoraGraphDataset(
    transform=T.FeatMask(p=0.1, node_feat_names=['feat']))
g = dataset[0]
feat = g.ndata['feat'] # The node feature tensor has been randomly masked.

Developer Recommendation: Randomly masking columns of features is a simple yet useful data augmentation for graph contrastive learning.

Row-Normalizer of Features

We added a new data transform module RowFeatNormalizer, which performs row-normalization of features.

import dgl
import dgl.transforms as T

dataset = dgl.data.CoraGraphDataset(
    transform=T.RowFeatNormalizer(node_feat_names=['feat']))
g = dataset[0]
feat = g.ndata['feat'] # The node feature tensor has been row-normalized.

Developer Recommendation: Row-normalization of raw features is a useful data pre-processing step.

For further readings, check out the release note for a complete list of new additions, improvements and bug fixes. If you have questions about DGL or GNN in general, welcome to join our Slack channel. If you have specific requests on what should be included in DGL next, you can submit them on our Github or fill in this survey.

April 2022 Update Note

2022-04-18T00:00:00+00:00

Grouped Reversible Residual Connection for GNNs

We added a new module GroupRevRes introduced in Training Graph Neural Networks with 1000 Layers. It can wrap any GNN module with grouped, reversible and residual connection (example code below).

import dgl
import torch

class GNNLayer(torch.nn.Module):
    def __init__(self, in_size, dropout=0.2):
        super(GNNLayer, self).__init__()
        # Use BatchNorm and dropout to prevent gradient vanishing
        # In particular if you use a large number of GNN layers
        self.norm = torch.nn.BatchNorm1d(in_size)
        self.conv = dgl.nn.GraphConv(in_size, in_size)
        self.dropout = torch.nn.Dropout(dropout)

    def forward(self, g, x):
        x = self.norm(x)
        x = self.dropout(x)
        return self.conv(g, x)

g = ... # some graph
x = torch.randn(g.num_nodes(), 32)
reversible_conv = dgl.nn.GroupRevRes(GNNLayer(32 // 4), 4)     # 4 groups
y = reversible_conv(g, x)  # forward

Developer Recommendation: The GroupRevRes module is reversible, meaning the backward propagation does not require storing forward activations. It can significantly reduce memory usage of GNNs, making it possible to train a very deep GNN with up to 1000 layers on a single commodity GPU.

Laplacian Positional Encoding

We added a new data transform module LaplacianPE first introduced in Benchmarking Graph Neural Networks. It computes Laplacian positional encoding for a graph. Besides data transform module, we also provide a functional API. See the example of usage below:

import dgl

# data transform
dataset = dgl.data.CoraGraphDataset(
    transform=dgl.transforms.LaplacianPE(k=2, feat_name='PE'))
g = dataset[0]  # positional encodings will be generated as an ndata
pe = g.ndata['PE']
# functional API
pe = dgl.laplacian_pe(g, k=2)

Developer Recommendation: Laplacian positional encoding improves the expressive power of GNNs by using k-smallest non-trivial Laplacian eigenvectors as additional node features.

Random Walk Positional Encoding

We added a new data transform module RandomWalkPE introduced in Graph Neural Networks with Learnable Structural and Positional Representations. It computes random-walk-based positional encoding for a graph. Besides data transform module, we also provide a functional API. See the example of usage below:

import dgl

# data transform
dataset = dgl.data.CoraGraphDataset(
    transform=dgl.transforms.RandomWalkPE(k=2, feat_name='PE'))
g = dataset[0]  # positional encodings will be generated automatically
pe = g.ndata['PE']
# Functional API
pe = dgl.random_walk_pe(g, k=2) # functional API

Developer Recommendation: Random walk positional encoding improves the expressive power of GNNs by using the landing probabilities of a node to itself in 1, 2, …, K steps as additional node features.

GraphSAINT Samplers

We added a new sampler SAINTSampler introduced in GraphSAINT: Graph Sampling Based Inductive Learning Method. SAINTSampler provides three strategies to extract induced subgraphs from a graph — by randomly selected node sets, randomly selected edge sets or nodes reached by random walks. See an example of usage below:

import torch
import dgl
from dgl.dataloading import SAINTSampler, DataLoader

sampler = SAINTSampler(
    mode='node',                      # Can be 'node', 'edge' or 'walk'
    budget=200,
    prefetch_ndata=['feat', 'label']  # optionally, specify data to prefetch
)

data_index = torch.arange(1000)  # 1000 mini-batches
g = dgl.data.CoraGraphDataset()[0]
dataloader = DataLoader(g, data_index, sampler, num_workers=4)
for sg in dataloader:
    train(sg)

Developer Recommendation: GraphSAINT is one of the state-of-the-art sampling methods in the family of subgraph sampling. Compared with neighbor sampling (or node-wise sampling), subgraph sampling avoids the issue of exponential neighborhood expansion, thus saving data transmission cost and enabling mini-batch training of deeper GNNs.

E(n) Equivariant Convolutional Layer

We added a new GNN module EGNNConv introduced in E(n) Equivariant Graph Neural Networks. It performs equivariant transformations on node embeddings and coordinate embeddings. See an example of usage below:

import dgl
import torch

g = ...  # some graph
h = torch.randn(g.num_nodes(), 10)   # node features
x = torch.randn(g.num_nodes(), 4)    # node coordinates
a = torch.randn(g.num_edges(), 2)    # edge features
conv = dgl.nn.EGNNConv(10, 10, 10, 2)
h, x = conv(g, h, x, a)

Developer Recommendation: GNNs with the capability of equivariant transformations have wide application in real-world structure data that have coordinates (e.g., molecules, point clouds, etc.). EGNN simplified previous attempts and proposed a design that is equivariant to rotations, translations, reflections and permutations on N-dimensional coordinates while considering both node features and node coordinates.

Principal Neighbourhood Aggregation Layer

We added a new GNN module PNAConv introduced in Principal Neighbourhood Aggregation for Graph Nets. The code below shows an example of usage:

import dgl
import torch

g = ...  # some graph
x = torch.randn(g.num_nodes(), 10)  # node features
conv = dgl.nn.PNAConv(10, 10,
                      aggregators=['mean', 'max', 'sum'],
                      scalers=['identity', 'amplification'],
                      delta=2.5)
h = conv(g, x)

Developer Recommendation: Principal Neighbourhood Aggregation (PNA) improves the expressive power of a GNN by combining multiple aggregation functions with degree-scalars, thus making it a state-of-the-art baseline for many graph classification tasks.

Survey

If there are papers for which you want to have DGL implementations or you have other feedback and suggestions, you could fill in this survey.

v0.8 Release Highlights

2022-03-01T00:00:00+00:00

We are excited to announce the release of DGL v0.8, which brings many new features as well as improvement on system performance. The highlights are:

A major update of the mini-batch sampling pipeline, better customizability, more optimizations; 3.9x and 1.5x faster for supervised and unsupervised GraphSAGE on OGBN-Products, with only one line of code change.
Significant acceleration and code simplification of popular heterogeneous graph NN modules (Up to 36x for RGCN convolution and 12x for HGT convolution). 11 new off-the-shelf NN modules for building models for link prediction, heterogeneous graph learning and GNN explanation.
GNNLens: a DGL empowered tool to visualize and understand graph data using GNN explanation models.
New functions to create, transform and augment graph datasets, making it easier to conduct research on graph contrastive learning or repurposing a graph for different tasks.
DGL-Go: a new GNN model training command line tool that utilizes a simple interface so that users can quickly apply GNNs to their problems and orchestrate experiments with state-of-the-art GNN models.

Mini-batch Sampling Pipeline Update

In training Neural Networks, minibatch sampling has been used to both improve model performance and enable scaling to large datasets. Mini-batch training in the context of GNNs on graphs introduces new complexities, which can be broken down into four main steps:

Extract a subgraph from the original graph.
Perform transformations on the subgraph.
Fetch the node/edge features of the subgraph.
Pass the subgraph and its features as the input to your GNN model and update parameters.

Among them, steps 1-3 are unique to GNNs and are quite costly. In v0.7, we have released the feature to speedup step 2 by transforming subgraphs on GPU, but the other two may continue to be the bottleneck. In this release, we further optimized the entire pipeline to reach an even better performance. We then briefly describe our technical solutions behind that.

To speed up subgraph extraction, we utilized CUDA Unified Virtual Addressing(UVA).

(Image courtesy: https://developer.download.nvidia.cn/CUDA/training/cuda_webinars_GPUDirect_uva.pdf)

CUDA UVA allows users to create in-memory data beyond the size of GPU RAM capacity while still harnessing GPU kernels for fast computation. Storing the entire graph structure and its features in UVA enables efficient subgraph extraction using GPU kernels, which is effective for training large-scale GNNs [1][2]. In this release, users can turn on the UVA mode by setting the use_uva flag in DataLoader, as shown in the example below:

g = ...                  # some DGLGraph data
train_nids = ...         # training node IDs
sampler = dgl.dataloading.MultiLayerNeighborSampler(
    fanout=[10, 15])
dataloader = dgl.dataloading.DataLoader(
    g, train_nids, sampler,
    device='cuda:0',     # perform sampling on GPU 0
    batch_size=1024,
    shuffle=True,
    use_uva=True         # turn on UVA optimization
)

To speed up feature fetching (step 3), DGL 0.8 supports pre-fetching node/edge features so that the model computation can happen in parallel with data movement. Users can specify the features as well as the labels to prefetch in the sampler object.

g = ...           # some DGLGraph data
train_nids = ...  # training node IDs
sampler = dgl.dataloading.MultiLayerNeighborSampler(
    fanout=[10, 15],
    prefetch_node_feats=['feat'],   # prefetch node feature 'feat'
    prefetch_labels=['label'],      # prefetch node label 'label'
)
dataloader = dgl.dataloading.DataLoader(
    g, train_nids, sampler,
    device='cuda:0',     # perform sampling on GPU 0
    batch_size=1024,
    shuffle=True,
    use_uva=True         # turn on UVA optimization
)

These optimizations bring significant speedup for both supervised and unsupervised mini-batch training. We compared it against the original pipeline of sampling on CPU but training on GPU for training a two-layer GraphSAGE model on the ogbn-papers100M graph using A100 GPUs. We observed a speedup of 3.9x and 1.5x for supervised and unsupervised GraphSAGE on a single GPU respectively. The speedup applies to multi-GPU training as well.


Speedup of Supervised GraphSAGE	Speedup of Unsupervised GraphSAGE

Defining a new sampler in DGL v0.8 is also easier, with only one simple interface sample to follow. Optionally, users can specify how to prefetch features for each sample. For example, the cluster sampler used by Cluster-GCN can be implemented in just a few lines of code.

class ClusterGCNSampler:
    def __init__(self, g, k, prefetch_ndata=None):
        part_ids = dgl.metis_partition_assignment(g, k)
        # convert partition assignment to bins of nodes
        part_sizes = torch.histogram(part_ids.float(), k)[0].int()
        self.node_bins = torch.split(torch.argsort(part_ids), part_sizes)
        # save the node feature names to be prefetched
        self.prefetch_ndata = prefetch_ndata

    def sample(self, g, part_ids):
        """Sample a subgraph given a list of partition IDs."""
        node_ids = torch.cat([self.node_bins[pid] for pid in part_ids])
        sg = g.subgraph(node_ids)  # get an induced subgraph
        # tell which feature to pre-fetch
        dgl.set_node_lazy_feature(sg, self.prefetch_ndata)
        return sg

New samplers in v0.8:

dgl.dataloading.ClusterGCNSampler: The sampler from Cluster-GCN: An Efficient Algorithm for Training Deep and Large Graph Convolutional Networks.
dgl.dataloading.ShaDowKHopSampler: The sampler from Deep Graph Neural Networks with Shallow Subgraph Samplers.

This remarkable improvement would not happen without the help from the community. We want to thank Xin Yao (@yaox12) and Dominique LaSalle (@nv-dlasalle) from NVIDIA and David Min (@davidmin7) from UIUC for their contributions.

Further reading:

User guide chapter for customizing graph samplers.
User guide chapter for writing graph samplers with feature prefetching.
Example implementation of ClusterGCN.

NN Module Update

Heterogeneous GNNs are known to be both difficult to implement as well as difficult to optimize. In this release, we have significantly improved the speed of dgl.nn.RelGraphConv and dgl.nn.HGTConv – two state-of-the-art NN modules for training on heterogeneous graphs, by sometimes an order of magnitude compared with various baselines[3]:


Speedup of RGCN convolution	Speedup of HGT convolution

More importantly, writing an efficient heterogeneous graph convolution is substantially easier. Here is a minimal implementation of RGCN convolution in 0.8 using the new nn.TypedLinear module:

class RGCNConv(nn.Module):
    def __init__(self, in_size, out_size, num_etypes):
        # TypedLinear is a new module in 0.8!
        self.linear_r = dgl.nn.TypedLinear(in_size, out_size, num_etypes)

    def forward(self, g, x, etype):
        g.ndata['x'] = x
        g.edata['etype'] = etype
        g.update_all(self.message, dgl.function.sum('m', 'h'))
        return g.ndata['h']

    def message(self, edges):
        return self.linear_r(edges.src['h'], edges.data['etype'])

This release also brings 11 new NN modules covering the most requested ones from the community. They include but are not limited to:

Commonly used edge score modules (e.g., TransE, TransR, etc.) for link prediction.
Linear projection module and embedding module for heterogeneous graphs (nn.HeteroLinear and nn.HeteroEmbedding).
The GNNExplainer module.

Understand Graph via Visualisation and GNN-based Explanation

Understanding graph data using GNN-based explanation model has become an important research topic. We partnered with the HKUST VisLab team to release GNNLens, an interactive visualization tool for graph neural networks.

To install GNNLens, pip install gnnlens.

It provides Python APIs for specifying the data to be visualized. For example, the following shows how to load DGL’s built-in Cora graph dataset and visualize it:

from dgl.data import CoraGraphDataset
dataset = CoraGraphDataset()
G = dataset[0]

from gnnlens import Writer

# Specify the path to create a new directory for dumping data files.
writer = Writer('tutorial_graph')
writer.add_graph(name='Cora', graph=cora_graph)
writer.add_graph(name='Citeseer', graph=citeseer_graph)
# Finish dumping
writer.close()

After running the script, you can then launch GNNLens with the following command:

gnnlens --logdir tutorial_graph

And you will see the webpage in your browser:

GNNLens is not only capable of visualizing raw graph data, but also designed for inspecting graph neural networks such as running explanation models to explain the prediction. Please check out the tutorials in the project README: https://github.com/dmlc/gnnlens2.

Composable Graph Data Transforms

Graph data augmentation has become an important component for graph contrastive learning or structural prediction in general. The new release makes it easier to compose and apply various graph augmentation and transformation algorithms to all DGL’s built-in dataset. The new dgl.transforms package follows the style of the PyTorch Dataset Transforms. Users can specify the transforms to use with the transform keyword argument of all DGL datasets:

import dgl
import dgl.transforms as T
t = T.Compose([
    T.AddSelfLoop(),
    T.GCNNorm(),
])
dataset = dgl.data.CoraGraphDataset(transform=t)
g = dataset[0]  # graph and features will be transformed automatically

DGL v0.8 provides 16 commonly used data transform APIs. See the API doc for more information.

Making graph datasets easily accessible for all kinds of research is important. A common scenario is to adapt a dataset for a different task than it was originally designed for (e.g., training a link prediction model on Cora which is originally for node classification). We therefore add two dataset adapters (dgl.data.AsNodePredDataset and dgl.data.AsLinkPredDataset) for this purpose. We also support generating new train/val/test split and save them for later use:

import dgl
dataset = dgl.data.CoraGraphDataset()
# make a Cora dataset suitable for link prediction
# add train/val/test split and negative samples
dataset = dgl.data.AsLinkPredDataset(dataset, split_ratio=[0.8, 0.1, 0.1], neg_ratio=3)

One more thing

As GNN is still a young and blooming domain, we received many “how to start” questions from our users:

“I’ve heard about GNNs, how to start training a GNN model on my own datasets?”
“I want to learn more about GNNs, how to start experimenting with SOTA baselines?”
“I have some new research ideas, how to start building it upon existing GNN models?”

To make those first steps easier, we developed DGL-Go, a command line tool for users to quickly access the latest GNN research progress.

Using DGL-Go is as easy as three steps:

Use dgl configure to pick the task, dataset and model of your interests. It generates a configuration file for later use. You could also use dgl recipe get to retrieve a configuration file we provided.
Use dgl train to launch training according to the configuration and see the results.
Use dgl export to generate a self-contained, reproducible Python script for advanced customization, or try the model on custom data stored in CSV format.

Install DGL-Go simply by pip install dglgo and check out the project README for more details.

Reference

[1] PyTorch-Direct: Enabling GPU Centric Data Access for Very Large Graph Neural Network Training with Irregular Accesses

[2] TorchQuiver: https://github.com/quiver-team/torch-quiver

[3] We compared our new nn.RelGraphConv module with multiple existing baselines from DGL and PyG. For DGL v0.7, Baseline#1 uses the old nn.RelGraphConv module with low_mem=False; Baseline#2 uses the old nn.RelGraphConv with low_mem=True; Baseline#3 uses nn.HeteroGarphConv. For PyG, Baseline#1 uses nn.RGCNConv while Baseline#2 uses nn.FastRGCNConv. All the benchmarks are tested on one NVIDIA T4 GPU card.

Deep Graph Library

GNN training acceleration with BFloat16 data type on CPU

Bfloat16 data type

Bfloat16 CPU acceleration

Bfloat16 in DGL

Experimental results

Conclusion

DGL 2.1: GPU Acceleration for Your GNN Data Pipeline

Flexible data pipeline & customizable stages, all accelerated on your GPU

GPU acceleration speedups

Single-GPU Node Classification

Single-GPU Link Prediction

Single-GPU Heterogeneous Node Classification

Multi-GPU Node Classification

Reduced graph storage space requirements

What’s more

Get started with DGL 2.1

DGL 2.0: Streamlining Your GNN Data Pipeline from Bottleneck to Boost

Flexible data pipeline & customizable stages

Speed enhancement & memory efficiency

What’s more

Get started with DGL 2.0

DGL 1.0: Empowering Graph Machine Learning for Everyone

The Message Passing View v.s. The Matrix View

DGL Sparse: Sparse Matrix Abstraction for Graph ML

Key Features of DGL Sparse

Get started with DGL 1.0

Improving Graph Neural Networks via Network-in-network Architecture

Introducing NGNN

Implementing NGNN in Deep Graph Library (DGL)

Model Performance

Further Readings

Accelerating Partitioning of Billion-scale Graphs with DGL v0.9.1

Distributed GNN Training 101

Chunked Graph Data Format

Running Distributed Partitioning & Dispatching

Further Readings

v0.9 Release Highlights

Combining Graph Analytics with GNNs using cuGraph+DGL

Installation

Feature Initialization via cuGraph

Training GNN via DGL

FP16 & Mixed Precision Support

DGL-Go Update: Model Inference and Graph Prediction

Further Reading

May 2022 Update Note

Synthetic Datasets for Developing GNN Explainability Approaches

SIGN Diffusion Transform

Label Propagation

Directional Graph Network Layer

Graph Isomorphism Network Layer with Edge Features

Feature Masking

Row-Normalizer of Features

April 2022 Update Note

Grouped Reversible Residual Connection for GNNs

Laplacian Positional Encoding

Random Walk Positional Encoding

GraphSAINT Samplers

E(n) Equivariant Convolutional Layer

Principal Neighbourhood Aggregation Layer

Survey

v0.8 Release Highlights

Mini-batch Sampling Pipeline Update

NN Module Update

Understand Graph via Visualisation and GNN-based Explanation

Composable Graph Data Transforms

One more thing

Further Reading

Reference