0% found this document useful (0 votes)
113 views27 pages

Scene Graph Generation

Scene graph generation is a task that aims to automatically map an image into a semantic structural scene graph by correctly labeling detected objects and their relationships. Deep learning techniques have led to significant advances in this area. The survey reviews 138 works on scene graph generation and summarizes existing methods from the perspective of feature representation and refinement, connecting visual relationship detection methods. It discusses current problems and future research directions to help understand the state of the field.

Uploaded by

ishwar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
113 views27 pages

Scene Graph Generation

Scene graph generation is a task that aims to automatically map an image into a semantic structural scene graph by correctly labeling detected objects and their relationships. Deep learning techniques have led to significant advances in this area. The survey reviews 138 works on scene graph generation and summarizes existing methods from the perspective of feature representation and refinement, connecting visual relationship detection methods. It discusses current problems and future research directions to help understand the state of the field.

Uploaded by

ishwar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

1

Scene Graph Generation: A Comprehensive


Survey
Guangming Zhu, Liang Zhang, Youliang Jiang, Yixuan Dang, Haoran Hou, Peiyi Shen*, Ming-
tao Feng, Xia Zhao*, Qiguang Miao, Syed Afaq Ali Shah and Mohammed Bennamoun

Abstract—Deep learning techniques have led to remarkable breakthroughs in the field of generic object detection and have spawned
a lot of scene-understanding tasks in recent years. Scene graph has been the focus of research because of its powerful semantic
representation and applications to scene understanding. Scene Graph Generation (SGG) refers to the task of automatically mapping
an image or a video into a semantic structural scene graph, which requires the correct labeling of detected objects and their
relationships. Although this is a challenging task, the community has proposed a lot of SGG approaches and achieved good results. In
arXiv:2201.00443v2 [cs.CV] 22 Jun 2022

this paper, we provide a comprehensive survey of recent achievements in this field brought about by deep learning techniques. We
review 138 representative works, and systematically summarize existing methods of image-based SGG from the perspective of feature
representation and refinement. We attempt to connect and systematize the existing visual relationship detection methods, to
summarize, and interpret the mechanisms and the strategies of SGG in a comprehensive way. Finally, we finish this survey with deep
discussions about current existing problems and future research directions. This survey will help readers to develop a better
understanding of the current research status and ideas.

Index Terms—Scene Graph Generation, Visual Relationship Detection, Object Detection, Scene Understanding.

1 I NTRODUCTION longs [38]. A large number of neural network models have


The ultimate goal of computer vision (CV) is to build intel- emerged and even achieved near humanlike performance in
ligent systems, which can extract valuable information from image classification tasks [27], [29], [33], [34]. Furthermore,
digital images, videos, or other modalities as humans do. In several other complex tasks, such as semantic segmentation
the past decades, machine learning (ML) has significantly at the pixel level, object detection and instance segmentation
contributed to the progress of CV. Inspired by the ability at the instance level, have suggested the decomposition of
of humans to interpret and understand visual scenes effort- an image into foreground objects vs background clutter. The
lessly, visual scene understanding has long been advocated pixel-level tasks aim at classifying each pixel of an image (or
as the holy grail of CV and has already attracted much several) into an instance, where each instance (or category)
attention from the research community. corresponds to a class [37]. The instance-level tasks focus
Visual scene understanding includes numerous sub- on the detection and recognition of individual objects in the
tasks, which can be generally divided into two parts: recog- given scene and delineating an object with a bounding box
nition and application tasks. These recognition tasks can or a segmentation mask, respectively. A recently proposed
be described at several semantic levels. Most of the earlier approach named Panoptic Segmentation (PS) takes into
works, which mainly concentrated on image classification, account both per-pixel class and instance labels [32]. With
only assign a single label to an image, e.g., an image of a cat the advancement of Deep Neural Networks (DNN), we
or a car, and go further in assigning multiple annotations have witnessed important breakthroughs in object-centric
without localizing where in the image each annotation be- tasks and various commercialized applications based on
existing state-of-the-art models [17], [19], [21], [22], [23].
• G. Zhu, L. Zhang, Y. Jiang, Y. Dang, H. Hou, P. Shen, M. Feng
However, scene understanding goes beyond the localization
and Q. Miao are with the School of Computer Science and Technology, of objects. The higher-level tasks lay emphasis on exploring
Xidian University, 710071, Xian, China (E-mail: gmzhu@xidian.edu.cn; the rich semantic relationships between objects, as well as
liang.zhang.cn@ieee.org; mqjyl2012@163.com; dyx4work@gmail.com; the interaction of objects with their surroundings, such as
unse3ry@gmail.com; pyshen@xidian.edu.cn; mtfeng@xidian.edu.cn; qg-
miao@xidian.edu.cn), * indicates the corresponding author. visual relationship detection (VRD) [15], [24], [26], [41] and
• X. Zhao is with the School of Arts and Sciences, National University of human-object interaction (HOI) [14], [16], [20]. These tasks
Defense Technology, Changsha, China (E-mail: zxmdi@163.com) are equally significant and more challenging. To a certain
• S. A. A. Shah is with the centre for AI and Machine Learning, Edith
Cowan University, Australia (E-mail: afaq.shah@ecu.edu.au)
extent, their development depends on the performance of
• M. Bennamoun is with School of Computer Science and Software En- individual instance recognition techniques. Meanwhile, the
gineering, The University of Western Australia, Australia (E-mail: mo- deeper semantic understanding of image content can also
hammed.bennamoun@uwa.edu.au) contribute to visual recognition tasks [2], [6], [36], [39], [120].
Manuscript received January XX, 2021. This work is supported by Na- Divvala et al. [40] investigated various forms of context
tional Natural Science Foundation of China(62073252, 62072358), Na- models, which can improve the accuracy of object-centric
tional Key R&D Program of China under Grant No.2020YFF0304900,
2019YFB1311600, and Chinese Defense Advance Research Program recognition tasks. In the last few years, researchers have
(50912020105). combined computer vision with natural language process-
2

Image Caption

Scene Graph Generation


black backpack is next to

shorts in woman standing The woman in shorts is


standing behind the man
who is jumping over fire
man is behind hydrant.

jumping over fire hydrant yellow

Image Generation
Referring Expression Region Description

standing

man woman
a woman in shorts
a man is jumping
is standing behind
over fire hydrant.
jumping over is behind in the man

fire hydrant man shorts

jumping over fire hydrant


Image Retrieval

Visual Question
Answering

yellow

Q. What color is the fire hydrant?


A. Yellow

Legend: objects attributes relationships

Fig. 1: A visual illustration of a scene graph structure and some applications. Scene graph generation models take an
image as an input and generate a visually-grounded scene graph. Image caption can be generated from a scene graph
directly. In contrast, Image generation inverts the process by generating realistic images from a given sentence or scene
graph. The Referring Expression (REF) marks a region of the input image corresponding to the given expression, while
the region and expression map the same subgraph of the scene graph. Scene graph-based image retrieval takes a query as
an input, and regards the retrieval as a scene graph matching problem. For the Visual Question Answering (VQA) task,
the answer can sometimes be found directly on the scene graph, even for the more complex visual reasoning, the scene
graph is also helpful.

ing (NLP) and proposed a number of advanced research tured representation that captures comprehensive semantic
directions, such as image captioning, visual question an- knowledge is a crucial step towards a deeper understanding
swering (VQA), visual dialog and so on. These vision-and- of visual scenes. Such representation can not only offer
language topics require a rich understanding of our visual contextual cues for fundamental recognition challenges, but
world and offer various application scenarios of intelligent also provide a promising alternative to high-level intelli-
systems. gence vision tasks. Scene graph, proposed by Johnson et al.
Although rapid advances have been achieved in the [1], is a visually-grounded graph over the object instances
scene understanding at all levels, there is still a long way in a specific scene, where the nodes correspond to object
to go. Overall perception and effective representation of bounding boxes with their object categories, and the edges
information are still bottlenecks. As indicated by a series represent their pair-wise relationships.
of previous works [1], [44], [191], building an efficient struc- Because of the structured abstraction and greater se-
3

mantic representation capacity compared to image features, with an analysis of the performance evaluation of
scene graph has the instinctive potential to tackle and im- the corresponding methods on these datasets.
prove other vision tasks. As shown in Fig.1, a scene graph
The rest of this paper is organized as follows; Section 2
parses the image to a simple and meaningful structure and
gives the definition of a scene graph, thoroughly analyses
acts as a bridge between the visual scene and textual de-
the characteristics of visual relationships and the structure
scription. Many tasks that combine vision and language can
of a scene graph. Section 3 surveys scene graph generation
be handled with scene graphs, including image captioning
methods. Section 4 summarizes almost all currently pub-
[3], [12], [18], visual question answering [4], [5], content-
lished datasets. Section 5 compares and discusses the per-
based image retrieval [1], [7], image generation [8], [9] and
formance of some key methods on the most commonly used
referring expression comprehension [35]. Some tasks take
datasets. Finally, Section 6 summarizes open problems in
an image as an input and parse it into a scene graph,
the current research and discusses potential future research
and then generate a reasonable text as output. Other tasks
directions. Section 7 concludes the paper.
invert the process by extracting scene graphs from the text
description and then generate realistic images or retrieve the
corresponding visual scene. 2 S CENE G RAPH
Xu et al. [199] have produced a thorough survey on scene A scene graph is a structural representation, which can
graph generation, which analyses the SGG methods based capture detailed semantics by explicitly modeling objects
on five typical models (CRF, TransE, CNN, RNN/LSTM and (“man”, “fire hydrant”, “shorts”), attributes of objects (“fire
GNN) and also includes a discussion of important contri- hydrant is yellow”), and relations between paired objects
butions by prior knowledge. Moreover, a detailed investi- (“man jumping over fire hydrant”), as shown in Fig.1. The
gation of the main applications of scene graphs was also fundamental elements of a scene graph are objects, at-
provided. The current survey focuses on visual relationship tributes and relations. Subjects/objects are the core building
detection of SGG, and our survey’s organization is based on blocks of an image and they can be located with bound-
feature representation and refinement. Specifically, we first ing boxes. Each object can have zero or more attributes,
provide a comprehensive and systematic review of 2D SGG. such as color (e.g., yellow), state (e.g., standing), material
In addition to multimodal features, prior information and (e.g., wooden), etc. Relations can be actions (e.g., “jump
commonsense knowledge to help overcome the long-tailed over”), spatial (e.g., “is behind”), descriptive verbs (e.g.,
distribution and the large intra-class diversity problems, is wear), prepositions (e.g. “with”), comparatives (e.g., “taller
also provided. To refine the local features and fuse the con- than”), prepositional phrases (e.g., “drive on”), etc [10],
textual information for high-quality relationship prediction, [28], [30], [110]. In short, a scene graph is a set of visual
we analyze some mechanisms, such as message passing, relationship triplets in the form of hsubject, relation, objecti
attention, and visual translation embedding. In addition or hobject, is, attributei. The latter is also considered as a
to 2D SGG, spatio-temporal and 3D SGG are also exam- relationship triplet (using the “is” relation for uniformity
ined. Further, a detailed discussion of the most common [10], [11]).
datasets is provided together with performance evaluation In this survey paper, we mainly focus on the triplet
measures. Finally, a comprehensive and systematic review description of a static scene. Given a visual scene S ∈ S
of the most recent research on the generation of scene [62], such as an image or a 3D mesh, its scene graph is a set
graphs is presented. We provide a survey of 138 papers of visual triplets RS ⊆ OS × PS × (OS ∪ AS ), where OS is
on SGG1 , which have appeared since 2016 in the leading the object set, AS is the attribute set and PS is the relation
computer vision, pattern recognition, and machine learning set including “is” relation pS,is where there is only one
conferences and journals. Our goal is to help the reader object involved. Each object oS,k ∈ OS has a semantic label
study and understand this research topic, which has gained lS,k ∈ OL (OL is the semantic label set) and grounded with a
a significant momentum in the past few years. The main bounding box (BB) bS,k in scene S , where k ∈ {1, . . . , |OS |}.
contributions of this article are as follows: Each relation pS,i→j ∈ PS ⊆ P is the core form of a visual
relationship triplet rS,i→j = hoS,i , pS,i→j , oS,j i ∈ RS and
1) A comprehensive review of 138 papers on scene i 6= j , where the third element oS,j could be an attribute
graph generation is presented, covering nearly all aS,j ⊆ AS if pS,i→j is the pS,is . As the relationship is one-
of the current literature on this topic. way, we expresse rS,i→j as hsS,i , pS,i→j , oS,j i to maintain
2) A systematic analysis of 2D scene graph generation semantic accuracy where sS,i , oS,j ∈ OS , sS,i is subject and
is presented, focusing on feature representation and oS,j is object.
refinement. The long-tail distribution problem and From the point of view of graph theory, a scene graph is
the large intra-class diversity problem are addressed a directed graph with three types of nodes: object, attribute,
from the perspectives of fusing prior information and relation. However, for the convenience of semantic
and commonsense knowledge, as well as refining expression, a node of a scene graph is seen as an object
features through message passing, attention, and with all its attributes, while the relation is called an edge. A
visual translation embedding. subgraph can be formed with an object, which is made up
3) A review of typical datasets for 2D, spatio-temporal of all the related visual triplets of the object. Therefore, the
and 3D scene graph generation is presented, along subgraph contains all the adjacent nodes of the object, and
these adjacent nodes directly reflect the context information
1. We provide a curated list of scene graph generation methods, pub- of the object. From the top-down view, a scene graph can
licly accessible at https://github.com/mqjyl/awesome-scene-graph be broken down into several subgraphs, a subgraph can be
4

splitted into several triplets, and a triplet can be splitted into 3 S CENE G RAPH G ENERATION
individual objects with their attributes and relations. Ac- The goal of scene graph generation is to parse an image
cordingly, we can find a region in the scene corresponding or a sequence of images in order to generate a structured
to the substructure that is a subgraph, a triplet, or an object. representation, to bridge the gap between visual and se-
Clearly, a perfectly-generated scene graph corresponding to mantic perception, and ultimately to achieve a complete
a given scene should be structurally unique. The process of understanding of visual scenes. However, it is difficult to
generating a scene graph should be objective and should generate an accurate and complete scene graph. Generating
only be dependent on the scene. Scene graphs should serve a scene graph is generally a bottom-up process in which
as an objective semantic representation of the state of the entities are grouped into triplets and these triplets are
scene. The SGG process should not be affected by who connected to form the entire scene graph. Evidently, the
labelled the data, on how it was assigned objects and pred- essence of the task is to detect the visual relationships,
icate categories, or on the performance of the SGG model i.e. hsubject, relation, objecti triplets, abbreviated as hs, r, oi.
used. Although, in reality, not all annotators who label the Methods, which are used to connect the detected visual
data give produce the exact same visual relationship for relationships to form a scene graph, do not fall in the scope
each triplet, and the methods that generate scene graphs do of this survey. This paper focuses on reviewing methods for
not always predict the correct relationships. The uniqueness visual relationship detection.
supports the argument that the use of a scene graph as Visual Relationship Detection has attracted the attention
a replacement for a visual scene at the language level is of the research community since the pioneering work by Lu
reasonable. et al [28], and the release of the ground-breaking large-scale
Compared with scene graphs, the well-known knowledge scene graph dataset Visual Genome (VG) by Krishna et al
graph is represented as multi-relational data with enormous [30]. Given a visual scene S and its scene graph TS [31],
fact triplets in the form of (head entity type, relation, tail [62]:
entity type) [112], [180]. Here, we have to emphasize that
the visual relationships in a scene graph are different from • BS = {bS,1 , . . . , bS,n } is the region candidate set,
those in social networks and knowledge bases. In the case with element bS,i denoting the bounding box of the
of vision, images and visual relationships are incidental i-th candidate object.
and are not intentionally constructed. Especially, visual • OS = {oS,1 , . . . , oS,n } is the object set, with element
relationships are usually image-specific because they only oS,i denoting the corresponding class label of the
depend on the content of the particular image in which they object bS,i .
appear. Although a scene graph is generated from a textual • AS = {aS,o1 ,1 , . . . , aS,o1 ,k1 , . . . , aS,o2 ,1 , . . . , aS,o2 ,k2 ,
description in some language-to-vision tasks, such as image . . . , aS,on ,1 , . . . , aS,on ,kn , } is the attribute set, with
generation, the relationships in a scene graph are always element aS,oi ,j denoting the j-th attribute of the i-th
situation-specific. Each of them has the corresponding vi- object, where ki ≥ 0 and j ∈ {1, . . . , ki }.
sual feature in the output image. Objects in scenes are not • RS = {rS,1→2 , rS,1→3 , . . . , rS,n→n−1 } is the relation
independent and tend to cluster. Sadeghi et al. [43] coined set, with element rS,i→j corresponding to a visual
the term visual phrases to introduce composite intermediates triple tS,i→j = hsS,i , rS,i→j , oS,j i, where sS,i and
between objects and scenes. Visual phrases, which integrate oS,j denote the subject and object respectively. This
linguistic representations of relationship triplets encode the set also includes “is” relation where there is only one
interactions between objects and scenes. object invovled.
A two-dimensional (2D) image is a projection of a three-
When attributes detection and relationships prediction
dimensional (3D) world from a particular perspective. Be-
are considered as two independent processes, we can de-
cause of the visual shade and dimensionality reduction
compose the probability distribution of the scene graph
caused by the projection of 3D to 2D, 2D images may have
p(TS |S) into four components similar to [31]:
incomplete or ambiguous information about the 3D scene,
leading to an imperfect representation of 2D scene graphs. p(TS |S) =p(BS |S)p(OS |BS , S)
As opposed to a 2D scene graph, a 3D scene graph prevents (1)
(p(AS |OS , BS , S)p(RS |OS , BS , S))
spatial relationship ambiguities between object pairs caused
by different viewpoints. The relationships described above In the equation, the bounding box component p(BS |S)
are static and instantaneous because the information is generates a set of candidate regions that cover most of the
grounded in an image or a 3D mesh that can only capture a crucial objects directly from the input image. The object
specific moment or a certain scene. On the other hand, with component p(OS |BS , S) predicts the class label of the object
videos, a visual relationship is not instantaneous, but varies in the bounding box. Both steps are identical to those
with time. A digital video consists of a series of images used in two-stage target detection methods, and can be
called frames, which means relations span over multiple implemented by the widely used Faster RCNN detector [17].
frames and have different durations. Visual relationships in Conditioned on the predicted labels, the attribute compo-
a video can construct a Spatio-Temporal Scene Graph, which nent p(AS |OS , BS , S) infers all possible attributes of each
includes entity nodes of the neighborhood in the time and object, while the relationship component p(RS |OS , BS , S)
space dimensions. infers the relationship of each object pair [31]. When all
The scope of our survey therefore extends beyond the visual triplets are collected, a scene graph can then be con-
generation of 2D scene graphs to include 3D and spatiotem- structed. Since attribute detection is generally regarded as
poral scene graphs as well. an independent research topic, visual relationship detection
5

and scene graph generation are often regarded as the same visual phrases. A deeper meaning can, however, be de-
task. Then, the probability of a scene graph TS can be rived from two aspects: the frequency of visual phrases
decomposed into three factors: and the common-sense constraints on relationship predic-
tion. For example, when “man”, “horse” and “hat” are
detected individually in an image, the most likely vi-
p(TS |S) = p(BS |S)p(OS |BS , S)p(RS |OS , BS , S) (2) sual triplets are hman, ride, horsei, hman, wearing, hati,
The following section provides a detailed review of more etc. hhat, on, horsei is possible, though not common. But
than a hundred deep learning-based methods proposed hhorse, wearing, hati is normally unreasonable. Thus, how
until 2020 on visual relationship detection and scene graph to integrate Prior Information about visual phrases and
generation. In view of the fact that 2D SGG has been Commonsense Knowledge will be the analyzed in Section
published much more than 3D or spatio-temporal SGG, a 3.1.2 and Section 3.1.3, respectively.
comprehensive overview of the methods for 2D SGG is first (3) A scene graph is a representation of visual relation-
provided. This is followed by a review of the 3D and spa- ships between objects, and it includes contextual informa-
tiotemporal SGG methods in order to ensure completeness tion about those relationships. To achieve high-quality pre-
and breadth of the survey. dictions, information must be fused between the individual
Note: We use “relationship” or a “triplet” to refer to the objects or relationships. In a scene graph, message passing
tuple of hsubject, relation, objecti in this paper, and “relation” can be used to refine local features and integrate contextual
or a “predicate” to refer to a relation element. information, while attention mechanisms can be used to
allow the models to focus on the most important parts
of the scene. Considering the large intra-class divergence
3.1 2D Scene Graph Generation and long-tailed distribution problems, visual translation
Scene graphs can be generated in two different ways [13]. embedding methods have been proposed to model rela-
The mainstream approach uses a two-step pipeline that tionships by interpreting them as translations operating on
detects objects first and then solves a classification task to the low-dimensional embeddings of the entities. Therefore,
determine the relationship between each pair of objects. we categorize the related methods into Message Passing,
The other approach involves jointly inferring the objects Attention Mechanism, and Visual Translation Embedding,
and their relationships based on the object region proposals. which will be deeply analyzed in Section 3.1.4, Section 3.1.5
Both of the above approaches need to first detect all existing and Section 3.1.6, respectively.
objects or proposed objects in the image, and group them
into pairs and use the features of their union area (called re-
3.1.1 Multimodal Features
lation features), as the basic representation for the predicate
inference. In this section, we focus on the two-step approach, The appearance features of the subject, object, and predicate
and Fig.2 illustrates the general framework for creating 2D ROIs make up the input of SGG methods, and affect SGG
scene graphs. Given an image, a scene graph generation significantly. The rapid development of deep learning based
method first generates subject/object and union proposals object detection and classification has led to the use of
with Region Proposal Network (RPN), which are sometimes many types of classical CNNs to extract appearance features
derived from the ground-truth human annotations of the from ROIs cropped from a whole image by bounding boxes
image. Each union proposal is made up of a subject, an or masks. Some CNNs even outperform humans when it
object and a predicate ROI. The predicate ROI is the box that comes to detecting/classifying objects based on appearance
tightly covers both the subject and the object. We can then features. Nevertheless, only the appearance features of a
obtain appearance, spatial information, label, depth, and subject, an object, and their union region are insufficient
mask for each object proposal using the feature representa- to accurately recognize the relationship of a subject-object
tion, and for each predicate proposal we can obtain appear- pair. In addition to appearance features, semantic features
ance, spatial, depth, and mask. These multimodal features of object categories or relations, spatial features of object
are vectorized, combined, and refined in the third step candidates, and even contextual features, can also be crucial
of the Feature Refinement module using message passing to understand a scene and can be used to improve the vi-
mechanisms, attention mechanisms and visual translation sual relationship detection performance. In this subsection,
embedding approaches. Finally, the classifiers are used to some integrated utilization methods of Appearance, Semantic,
predict the categories of the predicates, and the scene graph Spatial and Context features will be reviewed and analyzed.
is generated. Appearance-Semantic Features: A straightforward way
In this section, SGG methods for 2D inputs will be re- to fuse semantic features is to concatenate the semantic
viewed and analyzed according to the following strategies. word embeddings of object labels to the corresponding ap-
(1) Off-the-shelf object detectors can be used to detect pearance features. As in [28], there is another approach that
subjects, objects and predicate ROIs. The first point to utilizes language priors from semantic word embeddings to
consider is how to utilize the multimodal features of the finetune the likelihood of a predicted relationship, dealing
detected proposals. As a result, Section 3.1.1 reviews and with the fact that objects and predicates independently
analyzes the use of multimodal features, including appear- occur frequently, even if relationship triplets are infrequent.
ance, spatial, depth, mask, and label. Moreover, taking into account that the appearance of objects
(2) A scene graph’s compositionality is its most im- may profoundly change when they are involved in different
portant characteristic, and can be seen as an elevation visual relations, it is also possible to directly learn an ap-
of its semantic expression from independent objects to pearance model to recognize richer-level visual composites,
6

Fig. 2: An overview of 2D general scene graph generation framework. Firstly, off-the-shelf object detectors are used to detect
subjects, objects and predicate ROIs. Then, different kinds of methods are used in the stages of (b) Feature Representation
and (c) Feature Refinement to improve the final (d) Relation Prediction for high-quality visual relationship detection. This
survey focuses on the methods of feature representation and refinement.

i.e., visual phrases [43], as a whole, rather than detecting the function. For an input image x, the feature representations
basic atoms and then modeling their interactions. of visual appearance cue, spatial location cue and semantic
Appearance-Semantic-Spatial Features: The spatial dis- embedding cue are extracted for each relationship instance
tribution of objects is not only a reflection of their position, tuple. The learned features combined with multiple cues are
but also a representation of their structural information. A further concatenated and fused into a joint feature vector
spatial distribution of objects is described by the properties through one fully connected layer.
of regions, which include positional relations, size relations, Appearance-Semantic-Spatial-Context Features: Previ-
distance relations, and shape relations. In this context, Zhu ous studies typically extract features from a restricted object-
et al. [83] investigated how the spatial distribution of ob- object pair region and focus on local interaction modeling
jects can aid in visual relation detection. Sharifzadeh et al. to infer the objects and pairwise relation. For example, by
[46] used 3D information in visual relation detection by fusing pairwise features, VIP-CNN [26] captures contextual
synthetically generating depth maps from an RGB-to-Depth information directly. However, the global visual context
model incorporated within relation detection frameworks. beyond these pairwise regions is ignored, it may result in
They extracted pairwise feature vectors for depth, spatial, the loss of the chance to shrink the possible semantic space
label and appearance. using the rich context. Xu et al. [25] proposed a multi-
scale context modeling method that can simultaneously
The subject and object come from different distributions. discover and integrate the object-centric and region-centric
In response, Zhang et al. [84] proposed a 3-branch Rela- contexts for inference of scene graphs in order to overcome
tionship Proposal Networks (Rel-PN) to produce a set of the problem of large object/relation spaces. Yin et al. [72]
candidate boxes that represent subject, relationship, and proposed a Spatiality-Context-Appearance module to learn
object proposals. Then a proposal selection module selects the spatiality-aware contextual feature representation.
the candidate pairs that satisfy the spatial constraints. The In summary, appearance, semantics, spatial and contex-
resulting pairs are fed into two separate network modules tual features all contribute to visual relationship detection
designed to evaluate the relationship compatibility using from different perspectives. The integration of these multi-
visual and spatial criteria, respectively. Finally, visual and modal features precisely corresponds to the human’s multi-
spatial scores are combined with different weights to get the scale, multi-cue cognitive model. Using well-designed fea-
final score for predicates. In another work [11], the authors tures, visual relationships will be detected more accurately
added a semantic module to produce a semantic score for so scene graphs can be constructed more accurately.
predicates, then all three scores are added up to obtain an
overall score. Liang et al. [24] also considered three types 3.1.2 Prior Information
of features and proposed to cascade the multi-cue based The scene graph is a semantically structured description of
convolutional neural network with a structural ranking loss a visual world. Intuitively, the SGG task can be regarded
7

as a two-stage semantic tag retrieval process. Therefore, the These methods, however, are data-dependent because
determination of the relation category often depends on the their statistical co-occurrence probability is derived from
labels of the participating subject and object. In Section 3.1, training data. They do not contribute to the design of a
we discussed the compositionality of a scene graph in de- universal SGG network. We believe that in the semantic
tail. Although visual relationships are scene-specific, there space, language priors will be more useful.
are strong semantic dependencies between the relationship
predicate r and the object categories s and o in a relationship
triplet (s, p, o).
Data balance plays a key role in the performance of
deep neural networks due to their data-dependent training
process. However, because of the long-tailed distribution
of relationships between objects, collecting enough train-
(a) television-on-wall (b) man-riding-horse (c) boy-playing-seesaw
ing images for all relationships is time-consuming and
too expensive [15], [90], [104], [107]. Scene graphs should
serve as an objective semantic representation of the state
of a scene. We cannot arbitrarily assign the relationship
of hman, f eeding, horsei to the scene in Fig. 3(b) just be-
cause hman, f eeding, horsei occurs more frequently than
hman, riding, horsei in some datasets. However, in fact, (d) cat-on-suitcase (e) dog-sitting on-horse (f) man-riding-elephant
weighting the probability output of relationship detection
networks by statistical co-occurrences may improve the Fig. 3: Examples of the wide variety of visual relationships.
visual relationship detection performance on some datasets. The solid bounding boxes indicate the individual objects
We cannot deny the fact that human beings sometimes think and the dash red bounding boxes denote a visual relation-
about the world based on their experiences. As such, prior ship.
information, including Statistical Priors and Language Pri-
ors, can be regarded as a type of experience that allows Language Priors: Human communication is primarily
neural networks to “correctly understand” a scene more based on the use of words in a structured and conventional
frequently. Prior information has already been widely used manner. Similarly, visual relationships are represented as
to improve performance of SGG networks. triplets of words. Given the polysemy of words across dif-
Statistical Priors: The simplest way to use prior knowl- ferent contexts, one cannot simply encode objects and pred-
edge is to think that an event should happen this time icates as indexes or bitmasks. The semantics of object and
since it almost always does. This is called statistical prior. predicate categories should be used to deal with the poly-
Baier et al. [87] demonstrated how a visual statistical model semy in words. In particular, the following observations can
could improve visual relationship detection. Their semantic be made. First, the visual appearance of the relationships
model was trained using absolute frequencies that describe which has the same predicate but different agents varies
how often a triplet appears in the training data. Dai et greatly [26]. For instance, the “television-on-wall” (Fig.3a) and
al. [49] designed a deep relational network that exploited “cat-on-suitcase” (Fig.3d) have the same predicate type “on”,
both spatial configuration and statistical dependency to but they have distinct visual and spatial features. Second,
resolve ambiguities during relationship recognition. Zellers the type of relations between two objects is not only deter-
et al. [42] analyzed the statistical co-occurrences between mined by their relative spatial information but also through
relationships and object pairs on the Visual Genome dataset their categories. For example, the relative position between
and concluded that these statistical co-occurrences provided the kid and the horse (Fig.3b) is very similar to the ones
strong regularization for relationship prediction. between the dog and the horse (Fig.3e), but it is preferred to
Furthermore, Chen et al. [31] formally represented this describe the relationship “dog-sitting on-horse” rather than
information and explicitly incorporated it into graph prop- “dog-riding-horse” in the natural language setting. It is also
agation networks to aid in scene graph generation. For very rare to say “person-sitting on-horse”. On the other
each object pair with predicted labels (a subject oi and hand, the relationships between the observed objects are
an object oj ), they constructed a graph with a subject naturally based on our language knowledge. For example,
node, an object node, and K relation nodes. Each node we would like to use the expression “sitting on” or “play-
v ∈ V = {oi , oj , r1 , r2 , . . . , rK } has a hidden state htv at ing” for seesaw but not “riding” (Fig.3c), even though it has
timestep t. Let moi oj rk denote the statistical co-occurrence a very similar pose as the one of the types “riding” the horse
probability between oi and relation node rk as well as oj in Fig.3b. Third, relationships are semantically similar when
and relation node rk . At timestep t, the relationship nodes they appear in similar contexts. That is, in a given context,
aggregate messages from the object nodes, while object i.e., an object pair, the probabilities of different predicates to
nodes aggregate messages from the relationship nodes: describe this pair are related to their semantic similarity. For
(P example, “person-ride-horse” (Fig.3b) is similar to “person-
K ride-elephant” (Fig.3f), since “horse” and “elephant” belong
moi oj rk ht−1
rk , if v is an object node
atv = k=1
(3) to the same animal category [28]. It is therefore necessary
moi oj rk (ht−1
oi + h t−1
oj ), if v is a relation node
to explore methods for utilizing language priors in the
Then, the hidden state htv is updated with atv and its previ- semantic space.
ous hidden state by a gated mechanism. Lu et al. [28] proposed the first visual relationship de-
8

tection pipeline, which leverages the language priors (LP) Wen et al. [109] proposed the Rich and Fair semantic extrac-
to finetune the prediction. They scored each pair of object tion network (RiFa), which is able to extract richer semantics
proposals hO1 , O2 i using a visual appearance module and and preserve the fairness for relations with imbalanced
a language module. In the training phase, to optimize the distributions.
projection function f (.) such that it projects similar relation- In summary, statistical and language priors are effective
ships closer to one another, they used a heuristic formulated in providing some regularizations, for visual relationship
as: detection, derived from statistical and semantic spaces.
However, additional knowledge outside of the scope of
[f (r, W ) − f (r0, W )]2 object and predicate categories, is not included. The human
constant = , ∀r, r0 (4)
d(r, r0) mind is capable of reasoning over visual elements of an
image based on common sense. Thus, incorporating com-
where d(r, r0) is the sum of the cosine distances in word2vec monsense knowledge into SGG tasks will be valuable to
space between the two objects and the predicates of the two explore.
relationships r and r0. Similarly, Plesse et al. [105] computed
the similarity between each neighbor r0 ∈ {r1 , . . . , rK } and 3.1.3 Commonsense Knowledge
the query r with a softmax function: As previously stated, there are a number of models which
2 emphasize on the importance of language priors. However,
e−d(r,r0) due to the long tail distribution of relationships, it is costly
constant = PK 2
(5)
j=1 e−d(r,rj ) to collect enough training data for all relationships [90].
We should therefore use knowledge beyond the training
Based on this LP model, Jung et al. [104] further summarized data to help generate scene graphs [136]. Commonsense
some major difficulties for visual relationship detection and knowledge includes information about events that occur
performed a lot of experiments on all possible models with in time, about the effects of actions, about physical objects
variant modules. and how they are perceived, and about their properties and
Liao et al. [85] assumed that an inherent semantic rela- relationships with one another. Researchers have proposed
tionship connects the two words in the triplet rather than to extract commonsense knowledge to refine object and
a mathematical distance in the embedding space. They phrase features to improve generalizability of scene graph
proposed to use a generic bi-directional RNN to predict the generation. In this section, we analyze three fundamental
semantic connection between the participating objects in a sub-issues of commonsense knowledge applied to SGG, i.e.,
relationship from the aspect of natural language. Zhang et the Source, Formulation and Usage, as illustrated in Fig.
al. [15] used semantic associations to compensate for infre- 4. To be specific, the source of commonsense is generally
quent classes on a large and imbalanced benchmark with divided into internal training samples [88], [137], external
an extremely skewed class distribution. Their approach was knowledge base [89] or both [90], [91], and it can be trans-
to learn a visual and a semantic module that maps features formed into different formulations [92]. It is mainly applied
from the two modalities into a shared space and then to in the feature refinement on the original feature or other
employ the modified triplet loss to learn the joint visual typical procedures [93].
and semantic embedding. As a result, Abdelkarim et al. [97] Source: Commonsense knowledge can be directly ex-
highlighted the long-tail recognition problem and adopted tracted from the local training samples. For example, Duan
a weighted version of the softmax triplet loss above. et al. [88] calculated the co-occurrence probability of object
From the perspective of collective learning on multi- pairs, p (oi | oj ), and relationship in the presence of object
relational data, Hwang et al. [106] designed an efficient pairs, p (rk | oi , oj ), as the prior statistical knowledge ob-
multi-relational tensor factorization algorithm that yields tained from the training samples of VG dataset, to assist rea-
highly informative priors. Analogously, Dupty et al. [107] soning and deal with the unbalanced distribution. However,
learned conditional triplet joint distributions in the form of considering the tremendous valuable information from the
their normalized low rank non-negative tensor decomposi- large-scale external bases, e.g., Wikipedia and ConceptNet,
tions. increasing efforts have been devoted to distill knowledge
In addition, some other papers have also tried to mine from these resources.
the value of language prior knowledge for relationship Gu et al. [89] proposed a knowledge-based module,
prediction. Donadello et al. [108] encoded visual relationship which improves the feature refinement procedure by rea-
detection with Logic Tensor Networks (LTNs), which ex- soning over a basket of commonsense knowledge retrieved
ploit both the similarities with other seen relationships and from ConceptNet. Yu et al. [90] introduced a Linguistic
background knowledge, expressed with logical constraints Knowledge Distillation Framework that obtains linguistic
between subjects, relations and objects. In order to leverage knowledge by mining from both training annotations (inter-
the inherent structures of the predicate categories, Zhou et nal knowledge) and publicly available text, e.g., Wikipedia
al. [184] proposed to firstly build the language hierarchy (external knowledge), and then construct a teacher network
and then utilize the Hierarchy Guided Feature Learning to distill the knowledge into a student network that predicts
(HGFL) strategy to learn better region features of both the visual relationships from visual, semantic and spatial rep-
coarse-grained level and the fine-grained level. Liang et al. resentations. Zhan et al. [91] proposed a novel multi-modal
[110] proposed a deep Variation-structured Reinforcement feature based undetermined relationship learning network
Learning (VRL) framework to sequentially discover object (MF-URLN), which extracts and fuses features of object
relationships and attributes in an image sample. Recently, pairs from three complementary modules: visual, spatial,
9

[88], [89], [92], but there are a lot of attempts to implement it


and it contributes from a different aspect on the scene graph
model. Yao et al. [93] demonstrated a framework which can
train the scene graph models in an unsupervised manner,
based on the knowledge bases extracted from the triplets
of Web-scale image captions. The relationships from the
knowledge base are regarded as the potential relation can-
didates of corresponding pairs. The first step is to align the
knowledge base with images and initialize the probability
distribution DS for each candidate:

d = Ψ(s, o, Λ) (6)

where the (s, o) denotes the corresponding object pair pro-


posed by the detector, Λ represents the knowledge base and
Ψ(·) is the alignment procedure. d is a vector, where di = 1
if the relation ri belongs to the set of retrieved relation labels
from knowledge base, otherwise 0. In every non-initial iter-
ation t(t > 1), this distribution will be constantly updated
by the convex combination of the internal prediction from
the scene graph model and the external semantic signals.
Fig. 4: Three basic sub-issues (source, formulation and us- Inspired by a hierarchical reasoning from the human’s
age) of commonsense knowledge applied to scene graph prefrontal cortex, Yu et al. [94] built a Cognition Tree
generation. (CogTree) for all the relationship categories in a coarse-to-
fine manner. For all the samples with the same ground-
truth relationship class, they predicted the relationships
and linguistic modules. The linguistic module provides two by a biased model and calculated the distribution of the
kinds of features: external linguistic features and internal predicted label frequency, based on what the hierarchical
linguistic features. The former is the semantic represen- structure of CogTree can be consequently built. The tree can
tations of subject and object generated by the pretrained be divided into four layers (root, concept, coarse-fine, fine-
word2vec model of Wikipedia 2014. The latter is the proba- grained) and progressively divides the coarse concepts with
bility distributions of all relationship triplets in the training a clear distinction into fined-grained relationships which
set according to the subject’s and object’s categories. share similar features. On the basis of this structure, a
Formulation: Except from the actual sources of knowl- model-independent loss function, tree-based class-balanced
edge, it is also important to consider the formulation and (TCB) loss is introduced in the training procedure. This loss
how to incorporate the knowledge in a more efficient can suppress the inter-concept and intra-concept noises in
and comprehensive manner. As shown in several previous a hierarchical way, which finally contribute to an unbiased
studies [31], [88], the statistical correlation has been the scene graph prediction.
most common formulation of knowledge. They employ Recently, Zareian et al. [95] proposed a Graph Bridging
the co-occurrence matrices on both the object pairs and Network (GB-NET). This model is based on an assumption
the relationships in an explicit way. Similarly, Linguistic that a scene graph can be seen as an image-conditioned
knowledge from [91] is modeled by a conditional prob- instantiation of a commonsense knowledge graph. They
ability that encodes the strong correlation between the generalized the formulation of scene graphs into knowledge
object pair hsubj, obji and the predicate. However, Lin graphs where predicates are nodes rather than edges and
et al. [92] pointed out that they were generally compos- reformulated SGG from object and relation classification
able, complex and image-specific which will lead to a into graph linking. The GB-NET is an iterative process of
poor learning improvement. Lin proposed Atom Correla- message passing inside a heterogeneous graph. It consists
tion Based Graph Propagation (ACGP) for the scene graph of a commonsense graph and an initialized scene graph
generation task. The most significant one is to separate connected by some bridge edges. The commonsense graph is
the relationship semantics to form new nodes and decom- made up of commonsense entity nodes (CE) and common-
pose the conventional multi-dependency reasoning path of sense predicate nodes (CP), and the commonsense edges
hsubject, predicate, objecti into four different types of atom are compiled from three sources, WordNet, ConceptNet,
correlations, i.e., hsubject, objecti, hsubject, predicatei, and the Visual Genome training set. The scene graph is
hpredicate, predicatei, hpredicate, objecti, which are much initialized with scene entity nodes (SE), i.e., detected objects,
more flexible and easier to learn. It consequently results and scene predicate nodes (SP) for all entity pairs. Its goal is
in four kinds of knowledge graph, then the information to create bridge edges between the two graphs that connect
propagation is performed using the graph convolutional each instance (SE and SP node) to its corresponding class
network (GCN) with the guidance of the knowledge graphs, (CE and CP node).
to produce the evolved node features. Another work by Zareian et al. [96] perfectly match the
Usage: In general, the commonsense knowledge is used discussion of this section. It points out two specific issues of
as guidance on the original feature refinement for most cases current researches on knowledge-based SGG methods: (1)
10

external source of commonsense tend to be incomplete and


inaccurate; (2) statistics information such as co-occurrence
frequency is limited to reveal the complex, structured pat-
terns of commonsense. Therefore, they proposed a novel
mathematical formalization on visual commonsense and
extracted it based on a global-local attention multi-heads
transformer. It is implemented by training the encoders
on the corpus of annotated scene graphs to predict the
missing elements of a scene. Moreover, to compensate the
disagreement between commonsense reasoning and visual
prediction, it disentangles the commonsense and perception
into two separate trained models and builds a cascaded
fusion architecture to balance the results. The commonsense
is then used to adjust the final prediction.
As the main characteristics of commonsense knowl-
edge, external large-scale knowledge bases and specially-
designed formulations of statistical correlations have drawn
considerable attention in recent years. However, [93], [94]
have demonstrated that, except from feature refinement,
commonsense knowledge can also be useful in different
ways. Due to its graph-based structure and enriched infor-
mation, commonsense knowledge may boost the reasoning
process directly. Its graph-based structure makes it very
important to guide the message passing on GNN- and GCN-
based scene graph generation methods.
Fig. 5: Examples of two different types of message passing
3.1.4 Message Passing methods, i.e., local propagation within triplet items [26] and
A scene graph consists not only of individual objects and global propagation across all the elements. The global items,
their relations, but also of contextual information surround- according to layout structures, can be further divided into
ing and forming those visual relationships. From an in- the following forms: fully-connected graph [59], [77], chain
tuitive perspective, individual predictions of objects and [73], [74] and tree [78], [79].
relationships are influenced by their surrounding context.
Context can be understood on three levels. First, for a triplet,
the predictions of different phrase components depend on the triplets contribute to refine features and recognize vi-
each other. This is the compositionality of a scene graph. sual relationships. ViP-CNN, proposed by Li et al. [26], is
Second, the triplets are not isolated. Objects which have a phrase-guided visual relationship detection framework,
relationships are semantically dependent, and relationships which can be divided into two parts: triplet proposal and
which partially share object(s) are also semantically related phrase recognition. In phrase detection, for each triplet pro-
to one another. Third, visual relationships are scene-specific, posal, there are three feature extraction branches for subject,
so learning feature representations from a global view is predicate and object, respectively. The phrase-guided mes-
helpful when predicting relationships. Therefore, message sage passing structure (PMPS) is introduced to exchange the
passing between individual objects or triplets are valuable information between branches. Dai et al. [49] proposed an
for visual relationship detection. effective framework called Deep Relational Network (DR-
Constructing a high-quality scene graph relies on a prior Net), which uses Faster RCNN to locate a set of candidate
layout structure of proposals (objects and unions). There are objects. Through multiple inference units, who capture the
four forms of layout structures: triplet set, chain, tree and statistical relations between triplet components, the DR-Net
fully-connected graph. Accordingly, RNN and its variants outputs the posterior probabilities of s, r, and o. At each step
(LSTM, GRU) as sequential models are used to encode of the iterative updating procedure, it takes in a fixed set of
context for chains while TreeLSTM [79] for trees and GNN inputs, i.e. the observed features xs , xr , and xo , and refines
(or CRF) [59], [60], [77] for fully-connected graphs. the estimates of posterior probabilities. Another interesting
Basically, features and messages are passed between ele- model is Zoom-Net [72], which propagates spatiality-aware
ments of a scene graph, including objects and relationships. object features to interact with the predicate features and
To refine object features and extract phrase features, several broadcasts predicate features to reinforce the features of sub-
models rely on a variety of message passing techniques. Our ject and object. The core of Zoom-Net is a Spatiality-Context-
discussion in the subsections below is structured around Appearance Module, which consists of two spatiality-aware
two key perspectives: local propagation within triplet items feature alignment cells for message passing between the
and global propagation across all the elements, as illustrated different components of a triplet.
in Fig.5. The local message passing within triplets ignores the
Local Message Passing Within Triplets: Generally, fea- surrounding context, while the joint reasoning with contex-
tures of the subject, predicate and object proposals are tual information can often resolve ambiguities caused by
extracted for each triplet, and the information fusion within local predictions made in isolation. The passing of global
11

messages across all elements enhances the ability to detect Several other techniques consider SGG as a graph in-
finer visual relationships. ference process because of its particular structure. By con-
Global Message Passing Across All Elements: Consid- sidering all other objects as carriers of global contextual
ering that objects that have visual relationships are seman- information for each object, they will pass messages to each
tically related to each other, and that relationships which other’s via a fully-connected graph. However, inference on
partially share objects are also semantically related, passing a densely connected graph is very expensive. As shown
messages between related elements can be beneficial. Learn- in previous works [64], [65], dense graph inference can be
ing feature representation from a global view is helpful approximated by mean field in Conditional Random Fields
to scene-specific visual relationship detection. Scene graphs (CRF). Moreover, Johnson et al. [1] designed a CRF model
have a particular structure, so message passing on the graph that reasons about the connections between an image and
or subgraph structures is a natural choice. Chain-based its ground-truth scene graph, and use these scene graphs as
models (such as RNN or LSTM) can also be used to encode queries to retrieve images with similar semantic meanings.
contextual cues due to their ability to represent sequence Zheng et al. [66], [67] combines the strengths of CNNs with
features. When taking into consideration the inherent paral- CRFs, and formulates mean-field inference as Recurrent
lel/hierarchical relationships between objects, dynamic tree Neural Networks (RNN). Therefore, it is reasonable to use
structures can also be used to capture task-specific visual CRF or RNN to formulate a scene graph generation problem
contexts. In the following subsections, message passing [49], [56].
methods will be analyzed according to the three categories Further, there are some other relevant works which pro-
described below. posed modeling methods based on a pre-determined graph.
Message Passing on Graph Structures. Li et al. [41] Hu et al. [113] explicitly model objects and interactions
developed an end-to-end Multi-level Scene Description Net- by an interaction graph, a directed graph built on object
work (MSDN), in which message passing is guided by proposals based on the spatial relationships between objects,
the dynamic graph constructed from objects and caption and then propose a message-passing algorithm to propagate
region proposals. In the case of a phrase proposal, the the contextual information. Zhou et al. [114] mined and mea-
message comes from a caption region proposal that may sured the relevance of predicates using relative location and
cover multiple object pairs, and may contain contextual constructed a location-based Gated Graph Neural Network
information with a larger scope than a triplet. For com- (GGNN) to improve the relationship representation. Chen et
parison, the Context-based Captioning and Scene Graph al. [31] built a graph to associate the regions and employed
Generation Network (C2SGNet) [73] also simultaneously a graph neural network to propagate messages through the
generates region captions and scene graphs from input graph. Dornadula et al. [61] initialized a fully connected
images, but the message passing between phrase and re- graph, i.e., all objects are connected to all other objects by
gion proposals is unidirectional, i.e., the region proposals all predicate edges, and updated their representation using
requires additional context information for the relationships message passing protocols within a well-designed graph
between object pairs. Moreover, in an extension of MSDN convolution framework. Zareian et al. [95] formed a het-
model, Li et al. [13] proposed a subgraph-based scene graph erogeneous graph by using some bridge edges to connect a
generation approach called Factorizable Network (F-Net), commonsense graph and initialized a fully connected graph.
where the object pairs referring to the similar interacting They then employed a variant of GGNN to propagate infor-
regions are clustered into a subgraph and share the phrase mation among nodes and updated node representations and
representation. F-Net clusters the fully-connected graph into bridge edges. Wang et al. [115] constructed a virtual graph
several subgraphs to obtain a factorized connection graph with two types of nodes (objects vio and relations vij r
) and
by treating each subgraph as a node, and passing messages o o o r r o
three types of edges ( vi , vj , vi , vij and vij , vj , ), and
between subgraph and object features along the factorized then refined representations for objects and relationships
connection graph with a Spatial-weighted Message Passing with an explicit message passing mechanism.
(SMP) structure for feature refinement. Message Passing on Chain Structures. Dense graph
Even though MSDN and F-Net extended the scope of inference can be approximated by mean fields in CRF, and
message passing, a subgraph is considered as a whole when it can also be dealt with using an RNN-based model. Xu et
sending and receiving messages. Liao et al. [53] proposed al. [54] generated structured scene representation from an
semantics guided graph relation neural network (SGRNN), image, and solved the graph inference problem using GRUs
in which the target and source must be an object or a to iteratively improve its predictions via message passing.
predicate within a subgraph. It first establishes an undi- This work is considered as a milestone in scene graph
rected fully-connected graph by associating any two objects generation, demonstrating that RNN-based models can be
as a possible relationship. Then, they remove the connec- used to encode the contextual cues for visual relationship
tions that are semantically weakly dependent, through a recognition. At this point, Zellers et al. [42] presented a
semantics guided relation proposal network (SRePN), and novel model, Stacked Motif Network (MOTIFNET), which
a semantically connected graph is formed. To refine the uses LSTMs to create a contextualized representation of each
feature of a target entity (object or relationship), source- object. Dhingra et al. [55] proposed an object communication
target-aware message passing is performed by exploiting module based on a bi-directional GRU layer and used two
contextual information from the objects and relationships different transformer encoders to further refine the object
that the target is semantically correlated with for feature features and gather information for the edges. The Counter-
refinement. The scope of messaging is the same as Feature factual critic Multi-Agent Training (CMAT) approach [116]
Inter-refinement of objects and relations in [89]. is another important extension where an agent represents a
12

detected object. Each agent communicates with the others


for T rounds to encode the visual context. In each round
of communication, an LSTM is used to encode the agent
interaction history and extracts the internal state of each
agent.
Many other message passing methods based on RNN
have been developed. Chen et al. [74] used an RNN mod-
ule to capture instance-level context, including objects co-
occurrence, spatial location dependency and label relation.
Dai et al. [76] used a Bi-directional RNN and Shin et al. [73]
used Bi-directional LSTM as a replacement. Masui et al. [75]
proposed three triplet units (TUs) for selecting a correct SPO
triplet at each step of LSTM, while achieving class–number
scalability by outputting a single fact without calculating a
score for every combination of SPO triplets.
Message Passing on Tree Structures. As previously
stated, graph and chain structures are widely used for
message passing. However, these two structures are sub-
optimal. Chains are oversimplified and may only capture
simple spatial information or co-occurring information. Fig. 6: Two kinds of attention mechanisms in SGG: (1) Self-
Even though fully-connected graphs are complete, they do Attention mechanism [63] aggregates multimodal features
not distinguish between hierarchical relations. Tang et al. of an object to generate a more comprehensive representa-
[78] constructed a dynamic tree structure, dubbed VCTREE, tion. (2) Context-Aware Attention [58] learns the contextual
that places objects into a visual context, and then adopted features using graph parsing.
bidirectional TreeLSTM to encode the visual contexts. VC-
TREE has several advantages over chains and graphs, such
as hierarchy, dynamicity, and efficiency. VCTREE construc- more comprehensive representation. Zheng et al. [63] pro-
tion can be divided into three stages: (1) learn a score matrix posed a multi-level attention visual relation detection model
S , where each element is defined as the product of the object (MLA-VRD), which uses multi-stage attention for appear-
correlation and the pairwise task-dependency; (2) obtain a ance feature extraction and multi-cue attention for feature
maximum spanning tree using fusion. In order to capture discriminative information from
P the Prim’s algorithm, with
a root i satisfying argmaxi j6=i Sij ; (3) convert the multi- the visual appearance, the channel-wise attention is applied
branch tree into an equivalent binary tree (i.e., VCTREE) by in each convolutional block of the backbone network to
changing non-leftmost edges into right branches. The ways improve the representative ability of low-level appearance
of context encoding and decoding for objects and predicates features, and the spatial attention learns to capture the
are similar to [42], but they replace LSTM with TreeLSTM. In salient interaction regions in the union bounding box of the
[42], Zellers et al. tried several ways to order the bounding object pair. The multi-cue attention is designed to combine
regions in their analysis. Here, we can see the tree structure appearance, spatial and semantic cues dynamically accord-
in VCTREE as another way to order the bounding regions. ing to their significance for relation detection.
In another work, Zhou et al. [80] combined multi-stage
3.1.5 Attention Mechanisms and multi-cue attention to structure the Language and Posi-
Attention mechanisms flourished soon after the success of tion Guided Attention module (LPGA), where language and
Recurrent Attention Model (RAM) [51] for image classifica- position information are exploited to guide the generation
tion. They enable models to focus on the most significant of more efficient attention maps. Zhuang et al. [57] proposed
parts of the input [52]. With scene graph generation, as with a context-aware model, which applies an attention-pooling
iterative message passing models, there are two objectives: layer to the activations of the conv5 3 layer of VGG-16 as
refine local features and fuse contextual information. On the an appearance feature representation of the union region.
basic framework shown in Fig.2, attention mechanisms can For each relation class, there is a corresponding attention
be used both at the stage of feature representation and at model imposed on the feature map to generate a rela-
the stage of feature refinement. At the feature representation tion class-specific attention-pooling vectors. Han et al. [118]
stage, attention can be used in the spatial domain, channel argued that the context-aware model pays less attention
domain or their mixed domain to produce a more precise to small-scale objects. Therefore, they proposed the Vision
appearance representation of object regions and unions of Spatial Attention Network (VSA-Net), which employs a
object-pairs. At the feature refinement stage, attention is two-dimensional normal distribution attention scheme to
used to update each object and relationship representation effectively model small objects. The attention is added to the
by integrating contextual information. Therefore, this sec- corresponding position of the image according to the spa-
tion will analyze two types of attention mechanisms for tial information of the Faster R-CNN outputs. Kolesnikov
SGG (as illustrated in Fig. 6), namely, Self-Attention and et al. [121] proposed the Box Attention and incorporated
Context-Aware Attention mechanisms. box attention maps in the convolutional layers of the base
Self-Attention Mechanisms. Self-Attention mechanism detection model.
aggregates multimodal features of an object to generate a Context-Aware Attention Mechanisms. Context-Aware
13

Attention learns the contextual features using graph pars- the large variety in their appearance, which depends on
ing. Yang et al. [58] proposed Graph R-CNN based on graph the involved entities. Second, is to handle the scarcity of
convolutional neural network (GCN) [59], which can be fac- training data for zero-shot visual relation triplets. Visual
torized into three logical stages: (1) produce a set of localized embedding approaches aim at learning a compositional
object regions, (2) utilize a relation proposal network (RePN) representation for subject, object and predicate by learning
to learns to efficiently compute relatedness scores between separate visual-language embedding spaces, where each of
object pairs, which are used to intelligently prune unlikely these entities is mapped close to the language embedding
scene graph connections, and (3) apply an attentional graph of its associated annotation. By constructing a mathemati-
convolution network (aGCN) to propagate a higher-order cal relationship of visual-semantic embeddings for subject,
context throughout the sparse graph. In the aGCN, for predicate and object, an end-to-end architecture can be built
a target node i in the graph, the representations of its and trained to learn a visual translation vector for predic-
neighboring nodes {zj |j ∈ N (i)} are first transformed via tion. In this section, we divide the visual translation embed-
a learned linear transformation W . Then, these transformed ding methods according to the translations (as illustrated in
representations are gathered with predetermined weights α, Fig.7), including Translation between Subject and Object,
followed by a nonlinear function σ (ReLU). This layer-wise and Translation among Subject, Object and Predicate.
propagation can be written as:
 
(l+1) (l) (l)
X
zi = σ zi + αij W zj (7)
j∈N (i)

The attention αij for node i is:


(l) (l)
αij = softmax(whT σ(Wa [zi , zj ])) (8)
where wh and Wa are learned parameters and [·; ·] is the
concatenation operation. From the derivation, it can be seen
that the aGCN is similar to Graph Attention Network (GAT)
[60]. In conventional GCN, the connections in the graph are
known and the coefficient vectors αij are preset based on
the symmetrically normalized adjacency matrix of features.
Qi et al. [62] also leveraged a graph self-attention module
to embed entities, but the strategies to determine connection
(i.e., edges that represent relevant object pairs are likely
to have relationships) are different from the RePN, which
uses the multi-layer perceptron (MLP) to learn to efficiently Fig. 7: Two types of visual translation embedding ap-
estimate the relatedness of an object pair, where the adjacency proaches according to whether to embed the predicate into
matrix is determined with the space position of nodes. Lin et N-dimensional space [70] or not [68], besides the subject and
al. [165] designed a direction-aware message passing (DMP) object embedding.
module based on GAT to enhances the node feature with
node-specific contextual information. Moreover, Zhang et Translation Embedding between Subject and Object:
al. [81] used context-aware attention mechanism directly on Translation-based models in knowledge graphs are good at
the fully-connected graph to refine object region feature and learning embeddings, while also preserving the structural
performed comparative experiments of Soft-Attention and information of the graph [138], [139], [140]. Inspired by
Hard-Attention in ablation studies. Dornadula et al. [61] Translation Embedding (TransE) [138] to represent large-
introduced another interesting GCN-based attention model, scale knowledge bases, Zhang et al. [68] proposed a Visual
which treats predicates as learned semantic and spatial func- Translation Embedding network (VTransE) which places
tions that are trained within a graph convolution network objects in a low-dimensional relation space, where a re-
on a fully connected graph where object representations lationship can be modeled as a simple vector translation,
form the nodes and the predicate functions act as edges. i.e., subject + predicate ≈ object. Suppose xs , xo ∈ M R
are the M-dimensional features, VTransE learns a relation
3.1.6 Visual Translation Embedding
R
translation vector tp ∈ r (r  M ) and two projection
Each visual relation involves subject, object and predicate,
resulting in a greater skew of rare relations, especially when
R
matrices Ws , Wo ∈ r×M from the feature space to the
relation space. The visual relation can be represented as:
the co-occurrence of some pairs of objects is infrequent in
the dataset. Some types of relations contain very limited Ws xs + tp ≈ Wo xo (9)
examples. The long-tail problem heavily affects the scala-
bility and generalization ability of learned models. Another The overall feature xs or xo is a weighted concatenation
problem is the large intra-class divergence [122], i.e., rela- of three features: semantic, spatial and appearance. The
tions that have the same predicate but from which different semantic information is an (N + 1)-d vector of object clas-
subjects or objects are essentially different. Therefore, there sification probabilities (i.e., N classes and 1 background)
are two challenges for visual relationship detection models. from the object detection network, rather than the word2vec
First, is the right representation of visual relations to handle embedding of label.
14

Translation Embedding between Subject, Object and Another TransE-inspired model is RLSV (Representa-
Predicate: In an extension of VTransE, Hung et al. [69] tion Learning via Jointly Structural and Visual Embedding)
proposed the Union Visual Translation Embedding network [112]. The architecture of RLSV is a three-layered hierarchi-
(UVTransE), which learns three projection matrices Ws , cal projection that projects a visual triple onto the attribute
Wo , Wu which map the respective feature vectors of the space, the relation space, and the visual space in order.
bounding boxes enclosing the subject, object, and union of This makes the subject and object, which are packed with
subject and object into a common embedding space, as well attributes, projected onto the same space of the relation,
as translation vectors tp (to be consistent with VTransE) in instantiated, and translated by the relation vector. This also
the same space corresponding to each of the predicate labels makes the head entity and the tail entity packed with
that are present in the dataset. Another extension is ATR- attributes, projected onto the same space of the relation,
Net (Attention-Translation-Relation Network), proposed by instantiated, and translated by the relation vector. It jointly
Gkanatsios et al. [70] which projects the visual features from combines the structural embeddings and the visual embed-
the subject, the object region and their union into a score dings of a visual triple t = (s, r, o) as new representations
space as S , O and P with multi-head language and spatial (xs , xr , xo ) and scores it as follow:
attention guided. Let A denotes the attention matrix of all
predicates, Eq. 9 can be reformulated as: EI (t) = ||xs + xr − xo ||L1/L2 (15)

Wp (A)xp ≈ Wo (A)xo − Ws (A)xs (10)


In summary, while many of the above 2D SGG models use
Contrary to VTransE, the authors do not directly align P more than one method, we selected the method we felt best
and O − S by minimizing ||P + S − O||, instead, they reflected the idea of the paper for our primary classification
create two separate score spaces for predicate classification of methods. The aforementioned 2D SGG challenges can
(p) and object relevance (r), respectively and impose loss also be addressed in other ways utilizing different concepts.
constraints, LeP and LeOS (e can be p or r), to force both P As an example, Knyazev et al. [185] used Generative Ad-
and O − S to match with the task’s ground-truth as follows: versarial Networks (GANs) to synthesize rare yet plausi-
ble scene graphs to overcome the long-tailed distribution
problem. Huang et al. [181] designed a Contrasting Cross-
Le (W e ) = Lef (W e ) + LeP (WPe ) + LeOS (WOe , WSe ) (11)
Entropy loss and a scoring module to address class imbal-
Subsequently, Qi et al. [62] introduced a semantic transfor- ance. Fukuzawa and Toshiyuki [82] introduced a pioneering
mation module into their network structure to represent approach to visual relationship detection by reducing it to
hS, P, Oi in the semantic domain. This module leverages an object detection problem, and they won the Google AI
both the visual features (i.e., fi , fj and fij ) and the se- Open Images V4 Visual Relationship Track Challenge. A
mantic features (i.e., vi , vj and vij ) that are concatenated neural tensor network was proposed by Qiang et al. [190] for
and projected into a common semantic space to learn the predicting visual relationships in an image. These methods
relationship between pair-wise entities. L2 loss is used to contribute to the 2D SGG field in their own ways.
guide the learning process:
3.2 Spatio-Temporal Scene Graph Generation
L = ||W3 · [fj , vo ] − (W1 · [fi , vi ] + W2 · [fij , vij ])||22 (12) Recently, with the development of relationship detection
models in the context of still images (ImgVRD), some re-
The Multimodal Attentional Translation Embeddings
searchers have started to pay attention to understand visual
(MATransE) model built upon VTransE [71] learns a pro-
relationships in videos (VidVRD). Compared to images,
jection of hS, P, Oi into a score space where S + P ≈ O, by
videos provide a more natural set of features for detecting
guiding the features’ projection with attention to satisfy:
visual relations, such as the dynamic interactions between
objects. Due to their temporal nature, videos enable us to
Wp (s, o, m)xp = Wo (s, o, m)xo − Ws (s, o, m)xs (13) model and reason about a more comprehensive set of visual
relationships, such as those requiring temporal observa-
where x+ are the visual appearance features and
tions (e.g., man, lift up, box vs. man, put down, box), as
W+ (s, o, m) are the projection matrices that are learned
well as relationships that are often correlated through time
by employing a Spatio-Linguistic Attention module (SLA-
(e.g., woman, pay, money followed by woman, buy, coffee).
M) that uses binary masks’ convolutional features m and
Meanwhile, motion features extracted from spatial-temporal
encodes subject and object classes with pre-trained word
content in videos help to disambiguate similar predicates,
embeddings s, o. Compared with Eq. 9, Eq. 13 can be
such as “walk” and “run” (Fig. 8a). Another significant
interpreted as:
difference between VidVRD and ImgVRD is that the visual
relations in a video are usually changeable over time, while
Wo (s, o, m)xo − Ws (s, o, m)xs = tp = Wp (s, o, m)xp these of images are fixed. For instance, the objects may be
(14) occluded or out of one or more frame temporarily, which
Therefore, there are two branches, P-branch and OS-branch, causes the occurrence and disappearance of visual relations.
to learn the relation translation vector tp separately. To Even when two objects consistently appear in the same
satisfy Eq. 13, it enforces a score-level alignment by jointly video frames, the interactions between them may be tempo-
minimizing the loss of each one of the P- and OS-branches rally changed [101]. Fig. 8b shows an example of temporally
with respect to ground-truth using deep supervision. changing visual relation between two objects within a video.
15

Two visual relation instances containing their relationship relation prediction process consists of two steps: relation-
triplets and object trajectories of the subjects and objects. ship feature extraction and relationship modeling. Given a
pair of object tracklet proposals (Ts , To ) in a segment, (1)
extract the improved dense trajectory (iDT) features [179]
with HoG, HoF and MBH in video segments, which capture
both the motion and the low-level visual characteristics; (2)
extract the relative characteristics between Ts and To which
describes the relative position, size and motion between
the two objects; (3) add the classeme feature [68]. The
concatenation of these three types of features as the overall
relationship feature vector is fed into three predictors to
classify the observed relation triplets. The dominating way
to get the final video-level relationships is greedy local
association, which greedily merges two adjacent segments
if they contain the same relation.
Tsai et al. [102] proposed a Gated Spatio-Temporal En-
ergy Graph (GSTEG) that models the spatial and temporal
structure of relationship entities in a video by a spatial-
Fig. 8: Examples of video visual relations. (From [101]). temporal fully-connected graph, where each node repre-
sents an entity and each edge denotes the statistical de-
Different from static images and because of the addi- pendencies between the connected nodes. It also utilizes
tional temporal channel, dynamic relationships in videos an energy function with adaptive parameterization to meet
are often correlated in both the spatial and temporal dimen- the diversity of relations, and achieves the state-of-the-art
sions. All the relationships in a video can collectively form performance. The construction of the graph is realized by
a spatial-temporal graph structure, as mentioned in [99], linking all segments as a Markov Random Fields (MRF)
[102], [145], [171]. Therefore, we redefine the VidVRD as conditioned on a global observation.
Spatio-Temporal Scene Graph Generation (ST-SGG). To be Shang et al. [144] has published another dataset, Vi-
consistent with the definition of 2D scene graph, we also de- dOR and launched the ACM MM 2019 Video Relation
fine a spatio-temporal scene graph as a set of visual triplets Understanding (VRU) Challenge2 to encourage researchers
RS . However, for each rS,i→j = (sS,i , pS,i→j , oS,j ) ∈ RS , to explore visual relationships from a video [142]. In this
sS,i = (lS,k1 , Ts ) and oS,j = (lS,k2 , To ) both have a trajectory challenge, Zheng et al. [141] use Deep Structural Ranking
(resp. Ts and To ) rather than a fixed bbox. Specifically, Ts (DSR) [24] model to predict relations. Different from the
and To are two sequences of bounding boxes, which respec- pipeline in [101], they associate the short-term preliminary
tively enclose the subject and object, within the maximal trajectories before relation prediction by using a sliding
duration of the visual relation. Therefore, VidVRD aims window method to locate the endpoint frames of a re-
to detect each entire visual relation RS instance with one lationship triplet, rather than relational association at the
bounding box trajectory. end. Similarly, Sun et al. [143] also associate the preliminary
ST-SGG relies on video object detection (VOD). The trajectories on the front by applying a kernelized correlation
mainstream methods address VOD by integrating the lat- filter (KCF) tracker to extend the preliminary trajectories
est techniques in both image-based object detection and generated by Seq-NMS in a concurrent way and generate
multi-object tracking [175], [176], [177]. Although recent complete object trajectories to further associate the short-
sophisticated deep neural networks have achieved superior term ones.
performances in image object detection [17], [19], [173],
[174], object detection in videos still suffers from a low 3.3 3D Scene Graph Generation
accuracy, because of the presence of blur, camera motion The classic computer vision methods aim to recognize ob-
and occlusion in videos, which hamper an accurate object jects and scenes in static images with the use of a mathe-
localization with bounding box trajectories. Inevitably, these matical model or statistical learning, and then progress to do
problems have gone down to downstream video relation- motion recognition, target tracking, action recognition etc. in
ship detection and even are amplified. video. The ultimate goal is to be able to accurately obtain the
Shang et al. [101] first proposed VidVRD task and in- shapes, positions and attributes of the objects in the three-
troduced a basic pipeline solution, which adopts a bottom- dimensional space, so as to realize detection, recognition,
up strategy. The following models almost always use this tracking and interaction of objects in the real world. In the
pipeline, which decomposes the VidVRD task into three in- computer vision field, one of the most important branches
dependent parts: multi-object tracking, relation prediction, of 3D research is the representation of 3D information.
and relation instances association. They firstly split videos The common 3D representations are multiple views, point
into segments with a fixed duration and predict visual re- clouds, polygonal meshes, wireframe meshes and voxels of
lations between co-occurrent short-term object tracklets for various resolutions. To extend the concept of scene graph to
each video segment. Then they generate complete relation 3D space, researchers are trying to design a structured text
instances by a greedy associating procedure. Their object representation to encode 3D information. Although existing
tracklet proposal is implemented based on a video object de-
tection method similar to [178] on each video segment. The 2. https://videorelation.nextcenter.org/mm19-gdc/
16

scene graph research concentrates on 2D static scenes, based 4 DATASETS


on these findings as well as on the development of 3D In this section, we provide a summary of some of the
object detection [149], [150], [151], [155] and 3D Semantic most widely used datasets for visual relationship and scene
Scene Segmentation [152], [153], [154], scene graphs in 3D graph generation. These datasets are grouped into three
have recently started to gain more popularity [44], [146], categories–2D images, videos and 3D representation.
[147], [148]. Compared with the 2D scene graph generation
problem at the image level, to understand and represent
the interaction of objects in the three-dimensional space is 4.1 2D Datasets
usually more complicated. The majority of the research on visual relationship detection
Stuart et al. [146] were the first to introduce the term of and scene graph generation has focused on 2D images;
“3D scene graph” and defined the problem and the model therefore, several 2D image datasets are available and their
related to the prediction of 3D scene graph representations statistics are summarized in Table 1. The following are some
across multiple views. However, there is no essential dif- of the most popular ones:
ference in structure between their 3D scene graph and a Visual Phrase [43] is on visual phrase recognition and
2D scene graph. Moreover, Zhang et al. [135] started with detection. The dataset contains 8 object categories from
the cardinal direction relations and analyzed support relations Pascal VOC2008 [157] and 17 visual phrases that are formed
between a group of connected objects grounded in a set of by either an interaction between objects or activities of
RGB-D images about the same static scene from different single objects. There are 2,769 images including 822 negative
views. Kim et al. [147] proposed a 3D scene graph model samples and on average 120 images per category. A total of
for robotics, which however takes the traditional scene 5,067 bounding boxes (1,796 for visual phrases + 3,271 for
graph only as a sparse and semantic representation of three- objects) were manually marked.
dimensional physical environments for intelligent agents. To Scene Graph [1] is the first dataset of real-world scene
be precise, they just use the scene graph for the 3D scene graphs. The full dataset consists of 5,000 images selected
understanding task. Similarly, Johanna et al. [148] tried to from the intersection of the YFCC100m [161] and Microsoft
understand indoor reconstructions by constructing 3D se- COCO [162] datasets and each of which has a human-
mantic scene graph. None of these works have proposed an generated scene graph.
ideal way to model the 3D space and multi-level semantics. Visual Relationship Detection (VRD) [28] dataset in-
tends to benchmark the scene graph generation task. It
Until now, there is no unified definition and represen- highlights the long-tailed distribution of infrequent rela-
tation of 3D scene graph. However, as an extension of the tionships. The public benchmark based on this dataset uses
2D scene graph in 3D spaces, 3D scene graph should be 4,000 images for training and test on the remaining 1,000
designed as a simple structure which encodes the relevant images. The relations broadly fit into categories, such as
semantics within environments in an accurate, applicable, action, verbal, spatial, preposition and comparative.
usable, and scalable way, such as object categories and Visual Genome [30] has the maximum number of re-
relations between objects as well as physical attributes. It lation triplets with the most diverse object categories and
is noteworthy that, Armeni et al. [44] creatively proposed a relation labels up to now. Unlike VRD that is constructed by
novel 3D scene graph model, which performs a hierarchi- computer vision experts, VG is annotated by crowd workers
cal mapping of 3D models of large spaces in four stages: and thus a substantial fraction of the object annotations
camera, object, room and building, and describe a semi- has poor quality and overlapping bounding boxes and/or
automatic algorithm to build the scene graph. Recently, ambiguous object names. As an attempt to eliminate the
Rosinolet al. [172] defined 3D Dynamic Scene Graphs as noise, prior works have explored semi-automatic ways (e.g.,
a unified representation for actionable spatial perception. class merging and filtering) to clean up object and relation
More formally, this 3D scene graph is a layered directed annotations and constructed their own VG versions. Of
graph where nodes represent spatial concepts (e.g., ob- these, VG200 [68], VG150 [54], VG-MSDN [41] and sVG
jects, rooms, agents) and edges represent pair-wise spatio- [49] have released their cleansed annotations and are the
temporal relations (e.g., “agent A is in room B at time t”). most frequently used. Other works [26], [56], [72], [83],
They provide an example of a single-layer indoor environ- [86], [90], [110], [114], [117] use a paper-specific and non-
ment which includes 5 layers (from low to high abstraction publicly available split, disabling direct future comparisons
level): Metric-Semantic Mesh, Objects and Agents, Places with their experiments. Moreover, [15] presents experiments
and Structures, Rooms, and Building. Whether it is a four on a large-scale version of VG, named VG80K, and [156]
[44] - or five-story [172] structure, we can get a hint that proposes a new split that has not been benchmarked yet.
3D scene contains rich semantic information that goes far Sun [111] constructed two datasets for hierarchical visual
beyond the 2D scene graph representation. relationship detection (HVRD) based on VRD dataset and
VG dataset, named H-VRD and H-VG, by expanding their
In summary, this section includes a comprehensive flat relationship category spaces to hierarchical ones, respec-
overview of 2D SGG, followed by reviews of ST-SGG and tively. The statistics of these datasets are summarized in
3D SGG. Researchers have contributed to the SGG field and Table 2.
will continue to do so, but the long-tail problem and the VG150 [54] is constructed by pre-processing VG to im-
large intra-class diversity problem will remain hot issues, prove the quality of object annotations. On average, this
motivating researchers to explore more models to generate annotation refinement process has corrected 22 bounding
more useful scene graphs. boxes and/or names, deleted 7.4 boxes, and merged 5.4
17

TABLE 1: The statistics of common 2D datasets.

Dataset object bbox relationship triplet image source link


Visual Phrase [43] 8 3,271 9 1,796 2,769 http://vision.cs.uiuc.edu/phrasal/
http://imagenet.stanford.edu/internal/jcjohns/
Scene Graph [1] 266 69,009 68 109,535 5,000
scene graphs/sg dataset.zip
VRD [28] 100 - 70 37,993 5,000 https://cs.stanford.edu/people/ranjaykrishna/vrd/
https://storage.googleapis.com/openimages/
Open Images v4 [10] 57 3,290,070 329 374,768 9,178,275
web/index.html
Visual Genome [30] 33,877 3,843,636 40,480 2,347,187 108,077 http://visualgenome.org/
VrR-VG [156] 1,600 282,460 117 203,375 58,983 http://vrr-vg.com/
UnRel [128] - - 18 76 1,071 https://www.di.ens.fr/willow/research/unrel/
SpatialSense [159] 3,679 - 9 13,229 11,569 https://github.com/princeton-vl/SpatialSense
SpatialVOC2K [158] 20 5775 34 9804 2,026 https://github.com/muskata/SpatialVOC2K

TABLE 2: The statistics of common VG versions. images collected from the web with 76 unusual language
Dataset Pred. Classes Obj. Classes Total Images Train Images Test Images
triplet queries such as “person ride giraffe”. All images are
VG150 [54] 50 150 108,077 75.6k 32.4k annotated at box-level for the given triplet queries. Since the
VG200 [68] 100 200 99,658 73.8k 25.8k triplet queries of UnRel are rare (and thus likely not seen
sVG [49] 24 399 108,077 64.7k 8.7k
VG-MSDN [41] 50 150 95,998 71k 25k
at training), it is often used to evaluate the generalization
VG80k [15] 29,086 53,304 104,832 99.9k 4.8k performance of the algorithm.
SpatialSense [159] is a dataset specializing in spatial
relation recognition. A key feature of the dataset is that it
duplicate bounding boxes per image. The benchmark uses is constructed through adversarial crowdsourcing: a human
the most frequent 150 object categories and 50 predicates annotator is asked to come up with adversarial examples to
for evaluation. As a result, each image has a scene graph of confuse a recognition system.
around 11.5 objects and 6.2 relationships. SpatialVOC2K [158] is the first multilingual image
VrR-VG [156] is also based on Visual Genome. Its pre- dataset with spatial relation annotations and object features
processing aims at reducing the duplicate relationships for image-to-text generation. It consists of all 2,026 images
by hierarchical clustering and filtering out the visually- with 9,804 unique object pairs from the PASCAL VOC2008
irrelevant relationships. As a result, the dataset keeps the dataset. For each image, they provided additional annota-
top 1,600 objects and 117 visually-relevant relationships tions for each ordered object pair, i.e., (a) the single best, and
of Visual Genome. Their hypothesis to identify visually- (b) all possible prepositions that correctly describe the spatial
irrelevant relationships is that if a relationship label in relationship between objects. The preposition set contains 17
different triplets is predictable according to any informa- English prepositions and 17 French prepositions.
tion, except visual information, the relationship is visually-
irrelevant. This definition is a bit far-fetched but helps to 4.2 Video Datasets
eliminate redundant relationships. The area of video relation understanding aims at promoting
Open Images [10] is a dataset of 9M images anno- novel solutions and research on the topic of object detec-
tated with image-level labels, object bounding boxes, ob- tion, object tracking, action recognition, relation detection
ject segmentation masks, visual relationships, and localized and spatio-temporal analysis, that are integral parts into a
narratives. The images are very diverse and often contain comprehensive visual system of the future. So far there are
complex scenes with several objects (8.3 per image on av- two public datasets for video relational understanding.
erage). It contains a total of 16M bounding boxes for 600 ImageNet-VidVRD [101] is the first video visual rela-
object classes on 1.9M images, making it the largest existing tion detection dataset, which is constructed by selecting
dataset with object location annotations. The boxes have 1,000 videos from the training set and the validation set
largely been manually drawn by professional annotators to of ILSVRC2016-VID [163]. Based on the 1,000 videos, the
ensure accuracy and consistency. Open Images also offers object categories increase to 35. It contains a total of 3,219
visual relationship annotations, indicating pairs of objects relationship triplets (i.e., the number of visual relation types)
in particular relations (e.g., “woman playing guitar”, “beer with 132 predicate categories. All videos were decomposed
on table”), object properties (e.g., “table is wooden”), and into segments of 30 frames with 15 overlapping frames
human actions (e.g., “woman is jumping”). In total it has in advance, and all the predicates appearing in each seg-
3.3M annotations from 1,466 distinct relationship triplets. ment were labeled to obtain segment-level visual relation
So far, there are six released versions which are available on instances.
the official website and [10] describes Open Images V4 in VidOR [144] consists of 10,000 user-generated videos
details, i.e., from the data collection and annotation to the (98.6 hours) together with dense annotations on 80 cate-
detailed statistics about the data and the evaluation of the gories of objects and 50 categories of predicates. The whole
models trained on it. dataset is divided into 7,000 videos for training, 835 videos
UnRel [128] is a challenging dataset that contains 1,000 for validation, and 2,165 videos for testing. All the annotated
18

categories of objects and predicates appear in each of the overlap with the ground truth box. It is also called
train/val/test sets. Specifically, objects are annotated with Union boxes detectionin [49].
a bounding-box trajectory to indicate their spatio-temporal 2) Predicate Classification (PredCls) [54]: Given a set
locations in the videos; and relationships are temporally an- of localized objects with category labels, decide
notated with start and end frames. The videos were selected which pairs interact and classify each pair’s pred-
from YFCC-100M multimedia collection and the average icate.
length of the videos is about 35 seconds. The relations are 3) Scene Graph Classification (SGCls) [54]: Given a
divided into two types, spatial relations (8 categories) and set of localized objects, predict the predicate as well
action relations (42 categories) and the annotation method as the object categories of the subject and the object
is different for the two types of relations. in every pairwise relationship.
4) Scene Graph Generation (SGGen) [54]: Detect a
4.3 3D Datasets set of objects and predict the predicate between
each pair of the detected objects. This task is also
Three dimensional data is usually provided via multi-view
called Relationship Detection (RelDet) in [28] or
images such as point clouds, meshes, or voxels. Recently,
Two boxes detection in [49]. It is similar to phrase
several 3D datasets related to scene graphs have been re-
detection, but with the difference that both the
leased to satisfy the needs of SGG study.
bounding box of the subject and object need at least
3D Scene Graph is constructed by annotated the Gib-
50 percent of overlap with their ground truth. Since
son Environment Database [160] using the automated 3D
SGGen only scores a single complete triplet, the re-
Scene Graph generation pipeline proposed in [44]. Gibson’s
sult cannot reflect the detection effects of each com-
underlying database of spaces includes 572 full buildings
ponent in the whole scene graph. So Yang et al. [58]
composed of 1,447 floors covering a total area of 211km2 . It
proposed the Comprehensive Scene Graph Gen-
is collected from real indoor spaces using 3D scanning and
eration (SGGen+) as an augmentation of SGGen.
reconstruction and provides the corresponding 3D mesh
SGGen+ not only considers the triplets in the graph,
model of each building. Meanwhile, for each space, the RGB
but also the singletons (object and predicate). To be
images, depth and surface normals are provided. A fraction
clear, SGGen+ is essentially a metric rather than a
of the spaces is annotated with semantic objects.
task.
3DSGG, proposed in [147], is a large scale 3D dataset
that extends 3RScan with semantic scene graph annotations, There are also some paper-specific task settings includ-
containing relationships, attributes and class hierarchies. A ing Triple Detection [87], Relation Retrieval [68] and so
scene graph here is a set of tuples (N, R) between nodes N on.
and edges R. Each node is defined by a hierarchy of classes In the video based visual relationship detection task,
c = (c1 , · · · , cd ) and a set of attributes A that describe the there are two standard evaluation modes: Relation De-
visual and physical appearance of the object instance. The tection and Relation Tagging. The detection task aims to
edges define the semantic relations between the nodes. This generate a set of relationship triplets with tracklet proposals
representation shows that a 3D scene graph can easily be from a given video, while the tagging task only considers
rendered to 2D. the accuracy of the predicted video relation triplets and
ignores the object localization results.
5 P ERFORMANCE E VALUATION
In this section, we first introduce some commonly used 5.2 Metrics
evaluation modes and criteria for the scene graph genera- Recall@K. The conventional metric for the evaluation of
tion task. Then, we provide the quantitative performance of SGG is the image-level Recall@K(R@K), which com-
the promising models on popular datasets. Since there is no putes the fraction of times the correct relationship is pre-
uniform definition of a 3D scene graph, we will introduce dicted in the top K confident relationship predictions. In
these contents around 2D scene graph and spatio-temporal addition to the most commonly used R@50 and R@100,
scene graph. some works also use the more challenging R@20 for a
more comprehensive evaluation. Some methods compute
5.1 Tasks R@K with the constraint that merely one relationship can be
obtained for a given object pair. Some other works omit this
Given an image, the scene graph generation task consists of
constraint so that multiple relationships can be obtained,
localizing a set of objects, classifying their category labels,
leading to higher values. There is a superparameter k ,
and predicting relations between each pair of these objects.
often not clearly stated in some works, which measures the
Most prior works often evaluated their SGG models on
maximum predictions allowed per object pair. Most works
several of the following common sub-tasks. We preserve the
have seen PhrDet as a multiclass problem and they use
names of tasks as defined in [28] and [54] here, despite the
k = 1 to reward the correct top-1 prediction for each pair.
inconsistent terms used in other papers and the inconsisten-
While other works [63], [85], [105] tackle this as a multilabel
cies on whether they are in fact classification or detection
problem and they use a k equal to the number of predicate
tasks.
classes to allow for predicate co-occurrences [70]. Some
1) Phrase Detection (PhrDet) [28]: Outputs a label works [42], [70], [90], [129], [164] have also identified this
subject-predicate-object and localizes the entire rela- inconsistency and interpret it as whether there is graph con-
tionship in one bounding box with at least 0.5 straint (i.e., the k is the maximum number of edges allowed
19

between a pair of object nodes). The unconstrained metric Precision@K. In the video relation detection task,
(i.e., no graph constraint) evaluates models more reliably, P recision@K(P @K) is used to measure the accuracy
since it does not require a perfect triplet match to be the of the tagging results for the relation tagging task.
top-1 prediction, which is an unreasonable expectation in mAP. In the OpenImages VRD Challenge, results are evalu-
a dataset with plenty of synonyms and mislabeled annota- ated by calculating Recall@50(R@50), mean AP of relation-
tions. For example, ‘man wearing shirt’ and ‘man in shirt’ are ships (mAPrel ), and mean AP of phrases (mAPphr ) [129].
similar predictions, however, only the unconstrained metric The mAPrel evaluates AP of hs, p, oi triplets where both
allows for both to be included in ranking. Obviously, the the subject and object boxes have an IOU of at least 0.5
SGGen+ metric above has a similar motivation as removing with the ground truth. The mAPphr is similar, but applied
the graph constraint. Gkanatsios et al. [70] re-formulated the to the enclosing relationship box. mAP would penalize the
metric as Recallk @K(Rk @K). k = 1 is equivalent to prediction if that particular ground truth annotation does
‘graph constraints” and a larger k to “no graph constraints”, not exist. Therefore, it is a strict metric because we can’t
also expressed as ngRk @K . For n examined subject-object exhaustively annotate all possible relationships in an image.
pairs in an image, Recallk @K(Rk @K) keeps the top-k
predictions per pair and examines the K most confident out
5.3 Quantitative Performance
of nk total.
Given a set of ground truth triplets, GT , the image-level We present the quantitative performance on Recall@K met-
R@K is computed as: ric of some representative methods on several commonly
used datasets in Table 3-4. We preserve the respective task
R@K = |T opK ∩ GT | / |GT |, (16)
settings and tasks’ names for each dataset, though SGGen
where T opK is the top-K triplets extracted from the en- on VG150 are the same to the RelDet on others. ‡ denotes
tire image based on ranked predictions of a model [169]. the experimental results are under “no graph constraints”.
However, in the PredCLs setting, which is actually a simple By comparing Table 3 and Table 4, we notice that only
classification task, the R@K degenerates into the triplet- a few of the proposed methods have been simultaneously
level Recall@K (Rtr @K ). Rtr @K is similar to the top- verified on both VRD and VG150 datasets. The performance
K accuracy. Furthermore, Knyazev et al. [169] proposed of most methods on VG150 is better than that on VRD
weighted triplet Recall(wRtr @K ), which computes a recall dataset, because VG150 has been cleaned and enhanced.
at each triplet and reweights the average result based on the Experimental results on VG150 can better reflect the per-
frequency of the GT in the training set: formance of different methods, therefore, several recently
T proposed methods have adopted VG150 to compare their
performance metrics with other techniques.
X
wRtr @K = wt [rankt ≤ K], (17)
t Recently, two novel techniques i.e., SABRA [197] and
HET [195] have achieved SOTA performance for PhrDet
where T is the number of all test triplets, [·] is the Iverson
1 and RelDet on VRD, respectively. SABRA enhanced the
bracket, wt = (nt +1) P 1/(n t +1)∈[0,1]
and nt is the number of
t robustness of the training process of the proposed model
occurrences of the t-th triplet in the training set. It is friendly
by subdividing negative samples, while HET followed the
to those infrequent instances, since frequent triplets (with
intuitive perspective i.e., the more salient the object, the
high nt ) are downweighted proportionally. To speak for all
more important it would be for the scene graph.
predicates rather than very few trivial ones, Tang et al. [78]
On VG150, excellent performances have been achieved
and Chen et al. [31] proposed meanRecall@K(mR@K)
by using the Language Prior’s model, especially RiFa [109].
which retrieves each predicate separately then averages
In particular, RiFa has achieved good results on the unbal-
R@K for all predicates.
anced data distribution by mining the deep semantic infor-
Notably, there is an inconsistency in Recall’s definition
mation of the objects and relations in triplets. SGRN [53]
on the entire test set: whether it is a micro- or macro-Recall
generates the initial scene graph structure using the seman-
[70]. Let N be the number of testing images and GTi the
tic information, to ensure that its information transmission
ground-truth relationship annotations in image i. Then, hav-
process accepts the positive influence from the semantic
ing detected T Pi = T opKi ∩ GTi true positives in the image
PN
i |T P |
information. Theoretically, Commonsense Knowledge can
i, micro-Recall micro-averages these positives as PN i
greatly improve the performance, but in practice, several
i |GTi |
to reward correct predictions across dataset. Macro-Recall models that use Prior Knowledge have unsatisfactory per-
1
PN |T Pi |
computed as N i |GTi | macro-averages the detections in formance. We believe the main reason is the difficultly
terms of images. Early works use micro-Recall on VRD and to extract and use the effective knowledge information in
macro-Recall on VG150, but later works often use the two the scene graph generation model. Gb-net [95] has paid
types interchangeably and without consistency. attention to this problem, and achieved good results in
Zero-Shot Recall@K. Zero-shot relationship learning was PredDet and PhrDet by establishing connection between
proposed by Lu et al. [28] to evaluate the performance scene graph and Knowledge Graph, which can effectively
of detecting zero-shot relationships. Due to the long-tailed use the commonsense knowledge.
relationship distribution in the real world, it is a practical Due to the long tail effect of visual relationships, it is
setting to evaluate the extensibility of a model since it is hard to collect images for all the possible relationships. It is
difficult to build a dataset with every possible relationship. therefore crucial for a model to have the generalizability to
Besides, a single wRtr @K value can show zero or few-shot detect zero-shot relationships. VRD dataset contains 1,877
performance linearly aggregated for all n ≥ 0. relationships that only exist in the test set. Some researchers
20

TABLE 3: Performance summary of some representative TABLE 4: Performation summary of some representative
methods on VRD dataset. methods on VG150 dataset.

Task PredCLs PhrDet RelDet Task PredCLs PhrDet RelDet


year year
Models Metric R@100 R@50 R@100 R@50 R@100 R@50 Models Metric R@100 R@50 R@100 R@50 R@100 R@50
LP [28] 47.87 47.84 17.03 16.17 14.70 13.86 2016 IMP [54] 53.08 44.75 24.38 21.72 4.24 3.44 2017
VRL [110] - - 22.60 21.37 20.79 18.19 2017 Px2graph [100] 86.40 82.00 38.40 35.70 18.80 15.50 2017
U+W+SF+L:S+T [90] 55.16 55.16 24.03 23.14 21.34 19.17 2017 Interpretable SGG [11] 68.30 68.30 36.70 36.70 32.50 28.10 2018
DR-Net [49] 81.90 80.78 23.45 19.93 20.88 17.73 2017 IK-Re [86] 77.60 67.71 42.74 35.55 - - 2018
ViP-CNN [26] - - 27.91 22.78 20.01 17.32 2017 SK-Re [86] 77.43 67.42 42.25 35.07 - - 2018
AP+C+CAT [57] 53.59 53.59 25.56 24.04 23.52 20.35 2017 TFR [106] 58.30 51.90 26.60 24.30 6.00 4.80 2018
VTransE [68] - - 22.42 19.42 15.20 14.07 2017 MotifNet [42] 67.10 65.20 36.50 35.80 30.30 27.20 2018
Cues [168] - - 20.70 16.89 18.37 15.08 2017 Graph R-CNN [58] 59.10 54.20 31.60 29.60 13.70 11.40 2018
Weakly-supervised [125] - 46.80 - 16.00 - 14.10 2017 LinkNet [45] 68.50 67.00 41.70 41.00 30.10 27.40 2018
PPR-FCN [126] 47.43 47.43 23.15 19.62 15.72 14.41 2017 GPI [98] 66.90 65.10 38.80 36.50 - - 2018
Large VRU [15] - - 39.66 32.90 32.63 26.98 2018 KERN [31] 67.60 65.80 37.40 36.70 29.80 27.1 2019
Interpretable SGG [11] - - 41.25 33.29 32.55 26.67 2018 SGRN [53] 66.40 64.20 39.70 38.60 35.40 32.30 2019
CDDN-VRD [103] 93.76 87.57 - - 26.14 21.46 2018 Mem+Mix+Att [115] 57.90 53.20 29.50 27.80 13.90 11.40 2019
DSR [24] 93.18 86.01 - - 23.29 19.03 2018 VCTREE [78] 68.10 66.40 38.80 38.10 31.30 27.90 2019
Joint VSE [124] - - 24.12 20.53 16.26 14.23 2018 CMAT [116] 68.10 66.40 39.80 39.00 31.20 27.90 2019
Fo +Lm [50] - - 23.95 22.67 18.33 17.40 2018 VRasFunctions [61] 57.21 56.65 24.66 23.71 13.45 13.18 2019
SG-CRF [56] 50.47 49.16 - - 25.48 24.98 2018 PANet [74] 67.90 66.00 41.80 40.90 29.90 26.90 2019
OSL [122] 56.56 56.56 24.50 20.82 16.01 13.81 2018 ST+GSA+RI [62] 61.30 56.60 40.40 38.20 - - 2019
F-Net [13] - - 30.77 26.03 21.20 18.32 2018 Attention [81] 67.10 65.00 37.10 36.30 29.50 26.60 2019
Zoom-Net [72] 50.69 50.69 28.09 24.82 21.41 18.92 2018 RelDN [129] 68.40 68.40 36.80 36.80 32.70 28.30 2019
CAI+SCA-M [72] 55.98 55.98 28.89 25.21 22.39 19.54 2018 Large VRU [15] 68.40 68.40 36.70 36.70 32.50 27.9 2019
VSA-Net [118] 49.22 49.22 21.65 19.07 17.74 16.03 2018 GB-NET [95] 68.20 66.60 38.80 38.00 30.00 26.40 2020
MF-URLN [91] 58.20 58.20 36.10 31.50 26.80 23.90 2019 UVTransE [69] 67.30 65.30 36.60 35.90 33.60 30.10 2020
LRNNTD [107] - - 30.92 28.53 25.87 24.20 2019 GPS-Net [165] 69.70 69.70 42.30 42.30 33.20 28.90 2020
KB-GAN [89] - - 34.38 27.39 25.01 20.31 2019 RiFa [109] 88.35 80.64 44.38 37.62 26.68 20.86 2020
RLM [114] 57.19 57.19 39.74 33.20 31.15 26.55 2019 RONNIE [119] 69.00 65.00 37.00 36.20 - - 2020
NMP [113] 57.69 57.69 - - 23.98 20.19 2019 DG-PGNN [166] 73.00 70.10 40.80 39.50 33.10 32.10 2020
MLA-VRD [63] 95.05 90.18 28.12 23.36 24.91 20.54 2019 NODIS [132] 69.10 67.20 41.50 40.60 31.50 28.10 2020
ATR-Net [70] 58.40 58.40 34.63 29.74 24.87 22.83 2019 HCNet [167] 68.80 66.40 37.30 36.60 31.20 28.00 2020
BLOCK [130] 92.58 86.58 28.96 26.32 20.96 19.06 2019 Self-supervision [182] 68.87 68.85 37.03 37.01 32.56 28.28 2020
MR-Net [131] 61.19 61.19 - - 17.58 16.71 2019
MemoryNet [198] 69.30 69.20 37.10 37.10 32.40 27.60 2020
RelDN [129] - - 36.42 31.34 28.62 25.29 2019
HET [195] 68.10 66.30 37.30 36.60 30.90 27.50 2020
UVTransE [69] - 26.49 18.44 13.07 16.78 11.00 2020
HOSE-Net [186] 69.20 66.70 37.40 36.30 33.30 28.90 2020
AVR [164] 55.61 55.61 33.27 29.33 25.41 22.83 2020
PAIL [196] 69.40 67.70 40.20 39.40 32.70 29.40 2020
GPS-Net [165] - 63.40 39.20 33.80 31.70 27.80 2020
BL+SO+KT+FC [187] 68.80 66.20 38.30 37.50 31.40 28.20 2020
MemoryNet [198] - - 34.90 29.80 27.90 24.30 2020
Interpretable SGG‡ [11] 97.70 93.70 50.80 48.90 36.40 30.10 2018
HET [195] - - 42.94 35.47 24.88 22.42 2020
MotifNet‡ [42] 88.30 81.10 47.70 44.50 35.80 30.50 2018
HOSE-Net [186] - - 31.71 37.04 23.57 20.46 2020
GPI‡ [98] 88.20 80.80 50.80 45.50 - - 2018
SABRA [197] - - 39.62 33.56 32.48 27.87 2020
KERN‡ [31] 88.90 81.90 49.00 45.90 35.80 30.90 2019
NLGVRD‡ [85] 92.65 84.92 47.92 42.29 22.22 20.81 2017
CMAT‡ [116] 90.10 83.20 52.00 48.60 36.80 31.60 2019
U+W+SF+L:S+T‡ [90] 94.65 85.64 29.43 26.32 31.89 22.68 2017
PANet‡ [74] 89.70 82.60 55.20 51.30 36.30 31.10 2019
Zoom-Net‡ [72] 90.59 84.05 37.34 29.05 27.30 21.37 2018
RelDN‡ [129] 97.80 93.80 50.80 48.90 36.70 30.40 2019
CAI+SCA-M‡ [72] 94.56 89.03 38.39 29.64 28.52 22.34 2018
GB-NET‡ [95] 90.50 83.60 51.10 47.70 35.10 29.40 2020
LRNNTD‡ [107] - - 41.28 32.29 34.93 27.09 2019
HOSE-Net‡ [186] 89.20 81.10 48.10 44.20 36.30 30.50 2020
RLM‡ [114] 96.48 90.00 46.03 36.79 37.35 30.22 2019
BL+SO+KT+FC‡ [187] 90.20 82.50 50.20 46.20 36.50 31.40 2020
NMP‡ [113] 96.61 90.61 - - 27.50 21.50 2019
ATR-Net‡ [70] 96.97 91.00 41.01 33.20 31.94 26.04 2019
RelDN‡ [129] - - 42.12 34.45 33.91 28.15 2019
AVR‡ [164] 95.72 90.73 41.36 34.51 32.96 27.35 2020
MemoryNet‡ [198] - - 39.80 32.10 32.40 26.50 2020
idea of VCTREE comes from MotifNet, but it improves the
HET‡ [195] - - 43.05 35.47 31.81 26.88 2020 strategy of information transmission by changing the chain
HOSE-Net‡ [186] - - 36.16 28.89 27.36 22.13 2020 structure to the tree structure, making the information trans-
SABRA‡ [197] - - 45.29 36.62 37.71 30.71 2020
mission between objects more directional. MemoryNet [198]
has achieved SOTA results on both PredCLs and SGGen,
which focuses on semantic overlap between low and high
have evaluated the performance of their models on zero- frequency relationships.
shot learning. The performance summary of zero-shot pred- Tables 7 and 8 show the performances of several ST-SGG
icate and relationship detection on VRD dataset are shown methods on the ImageNet-VidVRD and VidOR datasets.
in Table 5. Evaluation of performance is based on two tasks, namely
Compared with the traditional Recall, meanRecall calcu- Relation Detection and Relation Tagging. Because of its large
lates a Recall rate for each relation. Therefore, meanRecall size, VidOR presents many challenges to relation detection
can better describe the performance of the model on each and tagging. ST-SGG is much more complex than 2D SGG
relation, which is obtained by averaging the Recall of each because additional steps such as object tracking, temporal
relation. Table 6 shows the meanRecall metric performance segmentation, and merging the detected relationships in
of several typical models. In Table 6, IMP’s meanRecall per- different segments are involved. It is expected that ST-SGG’s
formance in detecting tail relationships is not ideal. In IMP+, performance will improve as more researchers contribute.
due to the introduction of bidirectional LSTM to extract the A fair comparison between 3D SGG methods cannot
characteristics of each object, more attention has been paid currently be undertaken due to a lack of a unified definition
to the object itself, so there is an improvement. The core of 3D scene graphs. With rapid advances in 3D object
21

TABLE 5: Performance summary of some representative TABLE 7: Performance for standard video relation detection
methods for zero-shot visual relationship detection on the and video relation tagging on ImageNet-VidVRD dataset
VRD dataset. [171].

Task PredCLs PhrDet RelDet Task Relation Detection Relation Tagging


year year
Models Metric R@100 R@50 R@100 R@50 R@100 R@50 Models Metric R@50 R@100 mAP P@1 P@5 P@10
LP [28] 8.45 8.45 3.75 3.36 3.52 3.13 2016 VP [43] 0.89 1.41 1.01 36.50 25.55 19.20 2011
VRL [110] - - 10.31 9.17 8.52 7.94 2017 Lu’s-V [28] 0.99 1.80 2.37 20.00 12.60 9.55 2016
Cues [168] - - 15.23 10.86 13.43 9.67 2017 Lu’s [28] 1.10 2.23 2.40 20.50 16.30 14.05 2016
VTransE [68] - - 3.51 2.65 2.14 1.71 2017 VTransE [68] 0.72 1.45 1.23 15.00 10.00 7.65 2017
Weakly-supervised [125] - 19.00 - 6.90 - 6.70 2017 VidVRD [101] 5.54 6.37 8.58 43.00 28.90 20.80 2017
U+W+SF+L:S [90] 16.98 16.98 10.89 10.44 9.14 8.89 2017
GSTEG [102] 7.05 8.67 9.52 51.50 39.50 28.23 2019
AP+C+CAT [57] 16.37 16.37 11.30 10.78 10.26 9.54 2017
VRD-GCN [99] 8.07 9.33 16.26 57.50 41.00 28.50 2019
PPR-FCN [126] - - 8.22 6.93 6.29 5.68 2017
VRD-STGC [171] 11.21 13.69 18.38 60.00 43.10 32.24 2020
DSR [24] 79.81 60.90 - - 9.20 5.25 2018
CDDN-VRD [103] 84.00 67.66 - - 10.29 6.40 2018
Joint VSE [124] - - 6.16 5.05 5.73 4.79 2018 TABLE 8: Performance for standard video relation detection
SG-CRF [56] 21.22 - 6.70 - 5.22 - 2018 and video relation tagging on VidOR dataset [171].
MF-URLN [91] 26.90 26.90 7.90 5.90 5.50 4.30 2019
MLA-VRD [63] 88.96 73.65 13.84 8.43 12.81 8.08 2019
Task Relation Detection Relation Tagging
U+W+SF+L:S‡ [90] 74.65 54.20 17.24 13.01 16.15 12.31 2017 year
Models Metric R@50 R@100 mAP P@1 P@5
RELAbuilder [141] 1.58 1.85 1.47 33.05 35.27 2019
TABLE 6: Mean Recall performance summary of some typi- OTD+CAI [143] 6.19 8.16 5.65 48.31 38.49 2019
cal methods on VG150 dataset. OTD+GSTEG [143] 6.40 8.43 5.58 51.20 37.26 2019
MAGUS.Gamma [143] 6.89 8.83 6.56 51.20 40.73 2019
Task PredCLs SGCLs SGGen VRD-STGC [171] 8.21 9.90 6.85 48.92 36.78 2020
year
Models Metric mR@50 mR@100 mR@50 mR@100 mR@50 mR@100
IMP [54] 6.1 8.0 3.1 3.8 0.6 0.9 2017
IMP+ [42] 9.8 10.5 5.8 6.0 3.8 4.8 2018
FREQ [42] 13.0 16.0 7.2 8.5 6.1 7.1 2018
MotifNet [42] 14.0 15.3 7.7 8.2 5.7 6.6 2018
relationships, as object co-occurrence is infrequent in a real-
KERN [31] 17.7 19.2 9.4 10 6.4 7.3 2019 world scenario. An uneven distribution makes it difficult
VCTREE-SL [78] 17.0 18.5 9.8 10.5 6.7 7.7 2019
for the model to fully understand the properties of some
VCTREE-HL [78] 17.9 19.4 10.1 10.8 6.9 8 2019
GPS-Net [165] 21.3 22.8 11.8 12.6 8.7 9.8 2020 rare relationships and triplets. For example, if a model is
MemoryNet [198] 22.6 22.7 10.9 11 7.4 9 2020 trained to predict ‘on’ 1,000 times more than ‘standing on’,
PAIL [196] 19.2 20.9 10.9 11.6 7.7 8.8 2020
GB-NET [95] 19.3 20.9 9.6 10.2 6.1 7.3 2020 then, during the test phase, “on” is more likely to prevail
GB-NET-β [95] 22.1 24.0 12.7 13.4 7.1 8.5 2020 over “standing on”. This phenomenon where the model is
more likely to predict a simple and coarse relation than the
accurate relation is called Biased Scene Graph Generation.
detection, segmentation and description, 3D SGG should be Under this condition, even though the model can output
able to provide unified tasks, evaluation metrics, as well as a reasonable predicate, it is too coarse and obscure to de-
quantitative performances in the near future. scribe the scene. However, for several downstream tasks, an
accurate and informative pair-wise relation is undoubtedly
the most fundamental requirement. Therefore, to perform a
6 C HALLENGES & F UTURE R ESEARCH D IREC - sensible graph reasoning, we need to distinguish between
TIONS the more fine-grained relationships from the ostensibly
probable but trivial ones, which is generally regarded as
6.1 Challenges
unbiased scene graph generation. A lot of works [24], [56],
There is not doubt that there are many excellent SGG models [63], [68], [90], [103], [123], [125], [126], [127], [168] have
which have achieved good performance on the standard provided solutions for zero-shot relationship learning. Some
image datasets, such as VRD and VG150. However, there researchers recently proposed unbiased SGG [133], [134],
are still several challenges that have not been well resolved. [183], [188], [189] to make the tail classes to receive more
First, both the number of objects in the real world and attention in a coarse-to-fine mode.
the number of categories of relations are very large, but rea- The third challenge is that the visual appearance of the
sonable and meaningful relationships are scarce. Therefore, same relation varies greatly from scene to scene (Fig.3a and
detecting all individual objects first and then classifying all 3d). This makes the feature extraction phase more challeng-
pairs would be inefficient. Moreover, classification requires ing. As we have described in Section 3.1.2, a great deal of
a limited number of object categories, which does not scale methods focuses on semantic features, trying to make up for
with real-world images. Several works [15], [26], [49], [53], the lack of visual features. However, we have emphasized
[58], [84], [126], [193] have helped to filter out a set of object that visual relationships are incidental and scene-specific.
pairs with a low probability of interaction from the set of de- This requires us to think from the bottom up and try to
tected objects. An effective proposal network will definitely extract more discriminative visual features.
reduce the learning complexity and computational cost for The fourth challenge is the lack of clarity/consensus in
the subsequent predicate classification, thus improving the the definition of the relationships. It is always difficult to
accuracy of relationship detection. give mutually exclusive definitions to the predicate cate-
The second main challenge comes from the long-tailed gories as opposed to objects, which have clear meaning.
distribution of the visual relationships. Since interaction As a result, one relationship can be labelled with different
occurs between two objects, there is a greater skew of rare but reasonable predicates, making the datasets noisy and
22

the general SGG task ill-posed. Providing a well-defined and extraction of 3D semantic information has technological
relationship set is therefore one of the key challenges of the challenges.
SGG task.
The fifth challenge is the evaluation metric. Even though 7 C ONCLUSION
many evaluation metrics are used to assess the performance
This paper provides a comprehensive survey of the devel-
of the proposed networks and Recall@K or meanRecall@K
opments in the field of scene graph generation using deep
are common and widely adopted, none of them can provide
learning techniques. We first introduced the representative
perfect statistics on how well the model performs on the
works on 2D scene graph, spatio-temporal scene graph and
SGG task. When Recall@50 equals 100, does that mean that
3D scene graph in different sections, respectively. Further-
the model generates the perfect scene graph for an image?
more, we provided a summary of some of the most widely
Of course not. The existing evaluation metrics only reveal
used datasets for visual relationship and scene graph gener-
the relative performance, especially in the current research
ation, which are grouped into 2D images, video, and 3D
stage. As the research on SGG progresses, the evaluation
representation, respectively. The performance of different
metrics and benchmark datasets will pose a great challenge.
approaches on different datasets are also compared. Finally,
we discussed the challenges, problems and opportunities on
6.2 Opportunities the scene graph generation research. We believe this survey
The community has published hundreds of scene graph can promote more in-depth ideas used on SGG.
models and has obtained a wealth of research result. We
think there are several avenues for future work. Researchers R EFERENCES
will be motivated to explore more models as a result of [1] J. Johnson, R. Krishna, M. Stark, L. J. Li, D. A. Shamma, M. S.
the above challenges. Besides, on the one hand, from the Bernstein, and F. F. Li, “Image retrieval using scene graphs,” in
learning point of view, building a large dataset with fine- 2015 IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), 2015, pp. 3668–3678.
grained labels and accurate annotations is necessary and [2] J. Deng, Y. Pan, T. Yao, W. Zhou, H. Li, and T. Mei, “Relation
significant. It contains as many scenes as possible, preferably distillation networks for video object detection,” in Proceedings
constructed by computer vision experts. The models trained of the IEEE International Conference on Computer Vision, 2019, pp.
on such a dataset will have better performance on visual 7023–7032.
[3] L. Gao, B. Wang, and W. Wang, “Image captioning with scene-
semantic and develop a broader understanding of our visual graph based semantic concepts,” in Proceedings of the 2018 10th
world. However, this is a very challenging and expensive International Conference on Machine Learning and Computing, 2018,
task. On the other hand, from the application point of view, pp. 225–229.
[4] L. Li, Z. Gan, Y. Cheng, and J. Liu, “Relation-aware graph
we can design the models by subdividing the scene to attention network for visual question answering,” in Proceedings
reduce the imbalance of the relationship distribution. Obvi- of the IEEE International Conference on Computer Vision, 2019, pp.
ously, the categories and probability distributions of visual 10 313–10 322.
relationships are different in different scenarios. Of course, [5] C. Zhang, W.-L. Chao, and D. Xuan, “An empirical study on
leveraging scene graphs for visual question answering,” arXiv
even the types of objects are different. As a result, we can preprint arXiv:1907.12133, 2019.
design relationship detection models for different scenarios [6] H. Hu, J. Gu, Z. Zhang, J. Dai, and Y. Wei, “Relation networks for
and employ ensemble learning methods to promote scene object detection,” in Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, 2018, pp. 3588–3597.
graph generation applications. [7] S. Schuster, R. Krishna, A. Chang, L. Fei-Fei, and C. D. Man-
Another area of research is 3D scene graphs. An initial ning, “Generating semantically precise scene graphs from textual
step is to define an effective and unified 3D scene graph descriptions for improved image retrieval,” in Proceedings of the
structure, along with what information it should encode. A fourth workshop on vision and language, 2015, pp. 70–80.
[8] J. Johnson, A. Gupta, and L. Fei-Fei, “Image generation from
2D image is a two-dimensional projection of a 3D world scene graphs,” in Proceedings of the IEEE Conference on Computer
scene taken from a specific viewpoint. It is the specific Vision and Pattern Recognition, 2018, pp. 1219–1228.
viewpoint that makes some descriptions of spatial rela- [9] G. Mittal, S. Agrawal, A. Agarwal, S. Mehta, and T. Marwah,
“Interactive image generation using scene graphs,” arXiv preprint
tionships in 2D images meaningful. Taking the triplet of arXiv:1905.03743, 2019.
hwomen, is behind, f ire hydranti in Fig. 1 as an example, [10] A. Kuznetsova, H. Rom, N. Alldrin, J. Uijlings, I. Krasin, J. Pont-
the relation “is behind” makes sense because of the view- Tuset, S. Kamali, S. Popov, M. Malloci, T. Duerig et al., “The open
point. But, how can a relation be defined as “is behind” in images dataset v4: Unified image classification, object detection,
and visual relationship detection at scale,” International Journal of
3D scenes without a given viewpoint? Therefore, the chal- Computer Vision, vol. 128, no. 7, pp. 1956–1981, 2020.
lenge is how to define such spatial semantic relationships [11] J. Zhang, K. Shih, A. Tao, B. Catanzaro, and A. Elgammal, “An
in 3D scenes without simply resorting to 2.5D scenes (for interpretable model for scene graph generation,” arXiv preprint
arXiv:1811.09543, 2018.
example, RGB-D data captured from a specific viewpoint).
[12] X. Yang, K. Tang, H. Zhang, and J. Cai, “Auto-encoding scene
Armeni et al. augment the basic scene graph structure with graphs for image captioning,” in Proceedings of the IEEE Conference
essential 3D information and generate a 3D scene graph on Computer Vision and Pattern Recognition, 2019, pp. 10 685–
which extends the scene graph to 3D space and ground 10 694.
[13] Y. Li, W. Ouyang, B. Zhou, J. Shi, C. Zhang, and X. Wang,
semantic information there [44], [147]. However, their pro- “Factorizable net: an efficient subgraph-based framework for
posed structure representation does not have expansibility scene graph generation,” in Proceedings of the European Conference
and generality. Second, because 3D information can be on Computer Vision (ECCV), 2018, pp. 335–351.
grounded in many storage formats, which are fragmented [14] G. Gkioxari, R. Girshick, P. Dollár, and K. He, “Detecting and
recognizing human-object interactions,” in Proceedings of the IEEE
to specific types based on the visual modality (e.g., RGB-D, Conference on Computer Vision and Pattern Recognition, 2018, pp.
point clouds, 3D mesh/CAD models, etc.), the presentation 8359–8367.
23

[15] J. Zhang, Y. Kalantidis, M. Rohrbach, M. Paluri, A. Elgammal, [36] Y. Liu, R. Wang, S. Shan, and X. Chen, “Structure inference
and M. Elhoseiny, “Large-scale visual relationship understand- net: Object detection using scene-level context and instance-level
ing,” in Proceedings of the AAAI Conference on Artificial Intelligence, relationships,” in Proceedings of the IEEE conference on computer
vol. 33, 2019, pp. 9185–9194. vision and pattern recognition, 2018, pp. 6985–6994.
[16] S. Qi, W. Wang, B. Jia, J. Shen, and S.-C. Zhu, “Learning human- [37] S. A. Taghanaki, K. Abhishek, J. P. Cohen, J. Cohen-Adad, and
object interactions by graph parsing neural networks,” in Proceed- G. Hamarneh, “Deep semantic segmentation of natural and
ings of the European Conference on Computer Vision (ECCV), 2018, medical images: A review,” Artificial Intelligence Review, pp. 1–42,
pp. 401–417. 2020.
[17] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards [38] J. Li and J. Z. Wang, “Automatic linguistic indexing of pictures
real-time object detection with region proposal networks,” in by a statistical modeling approach,” IEEE Transactions on pattern
Advances in neural information processing systems, 2015, pp. 91–99. analysis and machine intelligence, vol. 25, no. 9, pp. 1075–1088, 2003.
[18] D.-J. Kim, J. Choi, T.-H. Oh, and I. S. Kweon, “Dense relational [39] R. Grzeszick and G. A. Fink, “Zero-shot object prediction us-
captioning: Triple-stream networks for relationship-based cap- ing semantic scene knowledge,” arXiv preprint arXiv:1604.07952,
tioning,” in Proceedings of the IEEE Conference on Computer Vision 2016.
and Pattern Recognition, 2019, pp. 6271–6280. [40] S. K. Divvala, D. Hoiem, J. H. Hays, A. A. Efros, and M. Hebert,
[19] J. Redmon and A. Farhadi, “Yolov3: An incremental improve- “An empirical study of context in object detection,” in 2009 IEEE
ment,” arXiv preprint arXiv:1804.02767, 2018. Conference on computer vision and Pattern Recognition, 2009, pp.
[20] T. Wang, R. M. Anwer, M. H. Khan, F. S. Khan, Y. Pang, L. Shao, 1271–1278.
and J. Laaksonen, “Deep contextual attention for human-object
[41] Y. Li, W. Ouyang, B. Zhou, K. Wang, and X. Wang, “Scene
interaction detection,” in Proceedings of the IEEE International
graph generation from objects, phrases and region captions,” in
Conference on Computer Vision, 2019, pp. 5694–5702.
Proceedings of the IEEE International Conference on Computer Vision,
[21] L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam, 2017, pp. 1261–1270.
“Encoder-decoder with atrous separable convolution for seman-
tic image segmentation,” in Proceedings of the European conference [42] R. Zellers, M. Yatskar, S. Thomson, and Y. Choi, “Neural motifs:
on computer vision (ECCV), 2018, pp. 801–818. Scene graph parsing with global context,” in Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, 2018,
[22] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional
pp. 5831–5840.
networks for biomedical image segmentation,” in International
Conference on Medical image computing and computer-assisted inter- [43] M. A. Sadeghi and A. Farhadi, “Recognition using visual
vention, 2015, pp. 234–241. phrases,” in Proceedings of the IEEE Conference on Computer Vision
[23] K. Chen, J. Pang, J. Wang, Y. Xiong, X. Li, S. Sun, W. Feng, and Pattern Recognition, 2011, pp. 1745–1752.
Z. Liu, J. Shi, W. Ouyang et al., “Hybrid task cascade for instance [44] I. Armeni, Z.-Y. He, J. Gwak, A. R. Zamir, M. Fischer, J. Malik, and
segmentation,” in Proceedings of the IEEE conference on computer S. Savarese, “3d scene graph: A structure for unified semantics,
vision and pattern recognition, 2019, pp. 4974–4983. 3d space, and camera,” in Proceedings of the IEEE International
[24] K. Liang, Y. Guo, H. Chang, and X. Chen, “Visual relationship Conference on Computer Vision, 2019, pp. 5664–5673.
detection with deep structural ranking,” in Proceedings of the [45] S. Woo, D. Kim, D. Cho, and I. S. Kweon, “Linknet: Relational
AAAI Conference on Artificial Intelligence, vol. 32, no. 1, 2018. embedding for scene graph,” in Advances in Neural Information
[25] N. Xu, A.-A. Liu, Y. Wong, W. Nie, Y. Su, and M. Kankanhalli, Processing Systems, 2018, pp. 560–570.
“Scene graph inference via multi-scale context modeling,” IEEE [46] S. Sharifzadeh, M. Berrendorf, and V. Tresp, “Improving vi-
Transactions on Circuits and Systems for Video Technology, vol. 31, sual relation detection using depth maps,” arXiv preprint
no. 3, pp. 1031–1041, 2020. arXiv:1905.00966, 2019.
[26] Y. Li, W. Ouyang, X. Wang, and X. Tang, “Vip-cnn: Visual phrase [47] K. Kato, Y. Li, and A. Gupta, “Compositional learning for human
guided convolutional neural network,” in Proceedings of the IEEE object interaction,” in Proceedings of the European Conference on
Conference on Computer Vision and Pattern Recognition, 2017, pp. Computer Vision (ECCV), 2018, pp. 234–251.
1347–1356. [48] T. Nagarajan, C. Feichtenhofer, and K. Grauman, “Grounded
[27] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning human-object interaction hotspots from video,” in Proceedings
for image recognition,” in Proceedings of the IEEE conference on of the IEEE International Conference on Computer Vision, 2019, pp.
computer vision and pattern recognition, 2016, pp. 770–778. 8688–8697.
[28] C. Lu, R. Krishna, M. Bernstein, and L. Fei-Fei, “Visual relation- [49] B. Dai, Y. Zhang, and D. Lin, “Detecting visual relationships with
ship detection with language priors,” in European Conference on deep relational networks,” in Proceedings of the IEEE conference on
Computer Vision, 2016, pp. 852–869. computer vision and Pattern recognition, 2017, pp. 3076–3086.
[29] F. Chollet, “Xception: Deep learning with depthwise separable [50] Y. Zhu and S. Jiang, “Deep structured learning for visual relation-
convolutions,” in Proceedings of the IEEE conference on computer ship detection,” in Proceedings of the AAAI Conference on Artificial
vision and pattern recognition, 2017, pp. 1251–1258. Intelligence, vol. 32, no. 1, 2018.
[30] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, [51] V. Mnih, N. Heess, A. Graves et al., “Recurrent models of visual
S. Chen, Y. Kalantidis, L.-J. Li, D. A. Shamma et al., “Visual attention,” in Advances in neural information processing systems,
genome: Connecting language and vision using crowdsourced 2014, pp. 2204–2212.
dense image annotations,” International Journal of Computer Vision,
[52] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine trans-
vol. 123, no. 1, pp. 32–73, 2017.
lation by jointly learning to align and translate,” arXiv preprint
[31] T. Chen, W. Yu, R. Chen, and L. Lin, “Knowledge-embedded
arXiv:1409.0473, 2014.
routing network for scene graph generation,” in Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, 2019, [53] W. Liao, C. Lan, W. Zeng, M. Y. Yang, and B. Rosenhahn,
pp. 6163–6171. “Exploring the semantics for visual relationship detection,” arXiv
[32] A. Kirillov, K. He, R. Girshick, C. Rother, and P. Dollár, “Panoptic preprint arXiv:1904.02104, 2019.
segmentation,” in Proceedings of the IEEE Conference on Computer [54] D. Xu, Y. Zhu, C. B. Choy, and L. Fei-Fei, “Scene graph gener-
Vision and Pattern Recognition, 2019, pp. 9404–9413. ation by iterative message passing,” in Proceedings of the IEEE
[33] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, Conference on Computer Vision and Pattern Recognition, 2017, pp.
“Densely connected convolutional networks,” in Proceedings of 5410–5419.
the IEEE conference on computer vision and pattern recognition, 2017, [55] N. Dhingra, F. Ritter, and A. Kunz, “Bgt-net: Bidirectional gru
pp. 4700–4708. transformer network for scene graph generation,” in Proceedings
[34] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. Alemi, “Inception-v4, of the IEEE/CVF Conference on Computer Vision and Pattern Recog-
inception-resnet and the impact of residual connections on learn- nition, 2021, pp. 2150–2159.
ing,” in Proceedings of the AAAI Conference on Artificial Intelligence, [56] W. Cong, W. Wang, and W.-C. Lee, “Scene graph generation via
vol. 31, no. 1, 2017. conditional random fields,” arXiv preprint arXiv:1811.08075, 2018.
[35] S. Yang, G. Li, and Y. Yu, “Cross-modal relationship inference [57] B. Zhuang, L. Liu, C. Shen, and I. Reid, “Towards context-
for grounding referring expressions,” in Proceedings of the IEEE aware interaction recognition for visual relationship detection,”
Conference on Computer Vision and Pattern Recognition, 2019, pp. in Proceedings of the IEEE International Conference on Computer
4145–4154. Vision, 2017, pp. 589–598.
24

[58] J. Yang, J. Lu, S. Lee, D. Batra, and D. Parikh, “Graph r-cnn for [79] K. S. Tai, R. Socher, and C. D. Manning, “Improved semantic
scene graph generation,” in Proceedings of the European conference representations from tree-structured long short-term memory
on computer vision (ECCV), 2018, pp. 670–685. networks,” arXiv preprint arXiv:1503.00075, 2015.
[59] T. N. Kipf and M. Welling, “Semi-supervised classification with [80] H. Zhou, C. Hu, C. Zhang, and S. Shen, “Visual relation-
graph convolutional networks,” arXiv preprint arXiv:1609.02907, ship recognition via language and position guided attention,”
2016. in ICASSP 2019-2019 IEEE International Conference on Acoustics,
[60] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio, Speech and Signal Processing (ICASSP), 2019, pp. 2097–2101.
and Y. Bengio, “Graph attention networks,” arXiv preprint [81] L. Zhang, S. Zhang, P. Shen, G. Zhu, S. Afaq Ali Shah, and
arXiv:1710.10903, 2017. M. Bennamoun, “Relationship detection based on object semantic
[61] A. Dornadula, A. Narcomey, R. Krishna, M. Bernstein, and F.-F. inference and attention mechanisms,” in Proceedings of the 2019 on
Li, “Visual relationships as functions: Enabling few-shot scene International Conference on Multimedia Retrieval, 2019, pp. 68–72.
graph prediction,” in Proceedings of the IEEE International Confer- [82] T. Fukuzawa, “A problem reduction approach for visual relation-
ence on Computer Vision Workshops, 2019, pp. 0–0. ships detection,” arXiv preprint arXiv:1809.09828, 2018.
[62] M. Qi, W. Li, Z. Yang, Y. Wang, and J. Luo, “Attentive relational [83] Y. Zhu, S. Jiang, and X. Li, “Visual relationship detection with
networks for mapping images to scene graphs,” in Proceedings object spatial distribution,” in 2017 IEEE International Conference
of the IEEE Conference on Computer Vision and Pattern Recognition, on Multimedia and Expo (ICME), 2017, pp. 379–384.
2019, pp. 3957–3966. [84] J. Zhang, M. Elhoseiny, S. Cohen, W. Chang, and A. Elgammal,
[63] S. Zheng, S. Chen, and Q. Jin, “Visual relation detection with “Relationship proposal networks,” in Proceedings of the IEEE
multi-level attention,” in Proceedings of the 27th ACM International Conference on Computer Vision and Pattern Recognition, 2017, pp.
Conference on Multimedia, 2019, pp. 121–129. 5678–5686.
[64] P. Krähenbühl and V. Koltun, “Efficient inference in fully con- [85] W. Liao, B. Rosenhahn, L. Shuai, and M. Ying Yang, “Natural
nected crfs with gaussian edge potentials,” in Advances in neural language guided visual relationship detection,” in Proceedings of
information processing systems, 2011, pp. 109–117. the IEEE Conference on Computer Vision and Pattern Recognition
Workshops, 2019, pp. 0–0.
[65] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional net-
works for semantic segmentation,” in Proceedings of the IEEE [86] F. Plesse, A. Ginsca, B. Delezoide, and F. Prêteux, “Visual re-
conference on computer vision and pattern recognition, 2015, pp. lationship detection based on guided proposals and semantic
3431–3440. knowledge distillation,” in 2018 IEEE International Conference on
Multimedia and Expo (ICME), 2018, pp. 1–6.
[66] X. Liang, X. Shen, J. Feng, L. Lin, and S. Yan, “Semantic object
parsing with graph lstm,” in European Conference on Computer [87] S. Baier, Y. Ma, and V. Tresp, “Improving visual relationship
Vision, 2016, pp. 125–143. detection using semantic modeling of scene descriptions,” in
International Semantic Web Conference, 2017, pp. 53–68.
[67] S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Z. Su,
[88] J. Duan, W. Min, D. Lin, J. Xu, and X. Xiong, “Multimodal
D. Du, C. Huang, and P. H. Torr, “Conditional random fields as
graph inference network for scene graph generation,” Applied
recurrent neural networks,” in Proceedings of the IEEE international
Intelligence, no. 5, 2021.
conference on computer vision, 2015, pp. 1529–1537.
[89] J. Gu, H. Zhao, Z. Lin, S. Li, J. Cai, and M. Ling, “Scene graph
[68] H. Zhang, Z. Kyaw, S.-F. Chang, and T.-S. Chua, “Visual transla-
generation with external knowledge and image reconstruction,”
tion embedding network for visual relation detection,” in Proceed-
in Proceedings of the IEEE Conference on Computer Vision and Pattern
ings of the IEEE conference on computer vision and pattern recognition,
Recognition, 2019, pp. 1969–1978.
2017, pp. 5532–5540.
[90] R. Yu, A. Li, V. I. Morariu, and L. S. Davis, “Visual relationship
[69] Z.-S. Hung, A. Mallya, and S. Lazebnik, “Contextual translation detection with internal and external linguistic knowledge distilla-
embedding for visual relationship detection and scene graph tion,” in Proceedings of the IEEE international conference on computer
generation,” IEEE Transactions on Pattern Analysis and Machine vision, 2017, pp. 1974–1982.
Intelligence, 2020.
[91] Y. Zhan, J. Yu, T. Yu, and D. Tao, “On exploring undetermined
[70] N. Gkanatsios, V. Pitsikalis, P. Koutras, and P. Maragos, relationships for visual relationship detection,” in Proceedings of
“Attention-translation-relation network for scalable scene graph the IEEE Conference on Computer Vision and Pattern Recognition,
generation,” in Proceedings of the IEEE International Conference on 2019, pp. 5128–5137.
Computer Vision Workshops, 2019, pp. 0–0. [92] A. Bl, Z. B. Yi, and A. Xl, “Atom correlation based graph propa-
[71] N. Gkanatsios, V. Pitsikalis, P. Koutras, A. Zlatintsi, and P. Mara- gation for scene graph generation,” Pattern Recognition, 2021.
gos, “Deeply supervised multimodal attentional translation em- [93] Y. Yao, A. Zhang, X. Han, M. Li, and M. Sun, “Visual distant
beddings for visual relationship detection,” in 2019 IEEE Interna- supervision for scene graph generation,” 2021.
tional Conference on Image Processing (ICIP), 2019, pp. 1840–1844.
[94] J. Yu, Y. Chai, Y. Wang, Y. Hu, and Q. Wu, “Cogtree: Cognition
[72] G. Yin, L. Sheng, B. Liu, N. Yu, X. Wang, J. Shao, and tree loss for unbiased scene graph generation,” in Thirtieth Inter-
C. Change Loy, “Zoom-net: Mining deep feature interactions for national Joint Conference on Artificial Intelligence IJCAI-21, 2021.
visual relationship recognition,” in Proceedings of the European [95] A. Zareian, S. Karaman, and S.-F. Chang, “Bridging knowledge
Conference on Computer Vision (ECCV), 2018, pp. 322–338. graphs to generate scene graphs,” arXiv preprint arXiv:2001.02314,
[73] D. Shin and I. Kim, “Deep image understanding using multilay- 2020.
ered contexts,” Mathematical Problems in Engineering, vol. 2018, [96] A. Zareian, Z. Wang, H. You, and S.-F. Chang, “Learning visual
2018. commonsense for robust scene graph generation,” in Computer
[74] Y. Chen, Y. Wang, Y. Zhang, and Y. Guo, “Panet: A context based Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August
predicate association network for scene graph generation,” in 23–28, 2020, Proceedings, Part XXIII 16. Springer, 2020, pp. 642–
2019 IEEE International Conference on Multimedia and Expo (ICME), 657.
2019, pp. 508–513. [97] S. Abdelkarim, P. Achlioptas, J. Huang, B. Li, K. Church, and
[75] K. Masui, A. Ochiai, S. Yoshizawa, and H. Nakayama, “Recurrent M. Elhoseiny, “Long-tail visual relationship recognition with
visual relationship recognition with triplet unit for diversity,” a visiolinguistic hubless loss,” arXiv preprint arXiv:2004.00436,
International Journal of Semantic Computing, vol. 12, no. 04, pp. 2020.
523–540, 2018. [98] R. Herzig, M. Raboh, G. Chechik, J. Berant, and A. Globerson,
[76] Y. Dai, C. Wang, J. Dong, and C. Sun, “Visual relationship detec- “Mapping images to scene graphs with permutation-invariant
tion based on bidirectional recurrent neural network,” Multimedia structured prediction,” Advances in Neural Information Processing
Tools and Applications, pp. 1–17, 2019. Systems, vol. 31, pp. 7211–7221, 2018.
[77] W. Hamilton, Z. Ying, and J. Leskovec, “Inductive representation [99] X. Qian, Y. Zhuang, Y. Li, S. Xiao, S. Pu, and J. Xiao, “Video
learning on large graphs,” in Advances in neural information pro- relation detection with spatio-temporal graph,” in Proceedings of
cessing systems, 2017, pp. 1024–1034. the 27th ACM International Conference on Multimedia, 2019, pp. 84–
[78] K. Tang, H. Zhang, B. Wu, W. Luo, and W. Liu, “Learning to com- 93.
pose dynamic tree structures for visual contexts,” in Proceedings [100] A. Newell and J. Deng, “Pixels to graphs by associative embed-
of the IEEE Conference on Computer Vision and Pattern Recognition, ding,” in Advances in neural information processing systems, 2017,
2019, pp. 6619–6628. pp. 2171–2180.
25

[101] X. Shang, T. Ren, J. Guo, H. Zhang, and T.-S. Chua, “Video visual [122] L. Zhou, J. Zhao, J. Li, L. Yuan, and J. Feng, “Object relation detec-
relation detection,” in Proceedings of the 25th ACM international tion based on one-shot learning,” arXiv preprint arXiv:1807.05857,
conference on Multimedia, 2017, pp. 1300–1308. 2018.
[102] Y.-H. H. Tsai, S. Divvala, L.-P. Morency, R. Salakhutdinov, and [123] J. Peyre, I. Laptev, C. Schmid, and J. Sivic, “Detecting unseen
A. Farhadi, “Video relationship reasoning using gated spatio- visual relations using analogies,” in Proceedings of the IEEE Inter-
temporal energy graph,” in Proceedings of the IEEE Conference on national Conference on Computer Vision, 2019, pp. 1981–1990.
Computer Vision and Pattern Recognition, 2019, pp. 10 424–10 433. [124] B. Li and Y. Wang, “Visual relationship detection using joint
[103] Z. Cui, C. Xu, W. Zheng, and J. Yang, “Context-dependent dif- visual-semantic embedding,” in 2018 24th International Conference
fusion network for visual relationship detection,” in Proceedings on Pattern Recognition (ICPR), 2018, pp. 3291–3296.
of the 26th ACM international conference on Multimedia, 2018, pp. [125] J. Peyre, J. Sivic, I. Laptev, and C. Schmid, “Weakly-supervised
1475–1482. learning of visual relations,” in Proceedings of the IEEE Interna-
[104] J. Jung and J. Park, “Visual relationship detection with language tional Conference on Computer Vision, 2017, pp. 5179–5188.
prior and softmax,” in 2018 IEEE international conference on image [126] H. Zhang, Z. Kyaw, J. Yu, and S.-F. Chang, “Ppr-fcn: Weakly
processing, applications and systems (IPAS), 2018, pp. 143–148. supervised visual relation detection via parallel pairwise r-fcn,”
[105] F. Plesse, A. Ginsca, B. Delezoide, and F. Prêteux, “Learning in Proceedings of the IEEE International Conference on Computer
prototypes for visual relationship detection,” in 2018 International Vision, 2017, pp. 4233–4241.
Conference on Content-Based Multimedia Indexing (CBMI), 2018, pp. [127] V. S. Chen, P. Varma, R. Krishna, M. Bernstein, C. Re, and L. Fei-
1–6. Fei, “Scene graph prediction with limited labels,” in Proceedings
[106] S. J. Hwang, S. Ravi, Z. Tao, H. J. Kim, M. D. Collins, and V. Singh, of the IEEE International Conference on Computer Vision, 2019, pp.
“Tensorize, factorize and regularize: Robust visual relationship 2580–2590.
learning,” 2018 IEEE/CVF Conference on Computer Vision and Pat- [128] A. Zareian, S. Karaman, and S.-F. Chang, “Weakly supervised vi-
tern Recognition, pp. 1014–1023, 2018. sual semantic parsing,” in Proceedings of the IEEE/CVF Conference
[107] M. H. Dupty, Z. Zhang, and W. S. Lee, “Visual relationship on Computer Vision and Pattern Recognition, 2020, pp. 3736–3745.
detection with low rank non-negative tensor decomposition,” in [129] J. Zhang, K. J. Shih, A. Elgammal, A. Tao, and B. Catanzaro,
Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, “Graphical contrastive losses for scene graph parsing,” in Pro-
no. 07, 2020, pp. 10 737–10 744. ceedings of the IEEE Conference on Computer Vision and Pattern
[108] I. Donadello and L. Serafini, “Compensating supervision incom- Recognition, 2019, pp. 11 535–11 543.
pleteness with prior knowledge in semantic image interpreta- [130] H. Ben-Younes, R. Cadene, N. Thome, and M. Cord, “Block:
tion,” in 2019 International Joint Conference on Neural Networks Bilinear superdiagonal fusion for visual question answering and
(IJCNN), 2019, pp. 1–8. visual relationship detection,” in Proceedings of the AAAI Confer-
[109] B. Wen, J. Luo, X. Liu, and L. Huang, “Unbiased scene graph ence on Artificial Intelligence, vol. 33, 2019, pp. 8102–8109.
generation via rich and fair semantic extraction,” arXiv preprint [131] Y. Bin, Y. Yang, C. Tao, Z. Huang, J. Li, and H. T. Shen, “Mr-net:
arXiv:2002.00176, 2020. Exploiting mutual relation for visual relationship detection,” in
[110] X. Liang, L. Lee, and E. P. Xing, “Deep variation-structured Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33,
reinforcement learning for visual relationship and attribute de- no. 1, 2019, pp. 8110–8117.
tection,” in Proceedings of the IEEE conference on computer vision
[132] C. Yuren, H. Ackermann, W. Liao, M. Y. Yang, and B. Rosenhahn,
and pattern recognition, 2017, pp. 848–857.
“Nodis: Neural ordinary differential scene understanding,” arXiv
[111] X. Sun, Y. Zi, T. Ren, J. Tang, and G. Wu, “Hierarchical visual re- preprint arXiv:2001.04735, 2020.
lationship detection,” in Proceedings of the 27th ACM International
[133] K. Tang, Y. Niu, J. Huang, J. Shi, and H. Zhang, “Unbiased
Conference on Multimedia, 2019, pp. 94–102.
scene graph generation from biased training,” in Proceedings of the
[112] H. Wan, Y. Luo, B. Peng, and W.-S. Zheng, “Representation learn-
IEEE/CVF Conference on Computer Vision and Pattern Recognition,
ing for scene graph completion via jointly structural and visual
2020, pp. 3716–3725.
embedding,” in Twenty-Seventh International Joint Conference on
[134] X. Yang, H. Zhang, and J. Cai, “Shuffle-then-assemble: Learning
Artificial Intelligence (IJCAI-2018), 2018, pp. 949–956.
object-agnostic visual relationship features,” in Proceedings of the
[113] Y. Hu, S. Chen, X. Chen, Y. Zhang, and X. Gu, “Neural message
European conference on computer vision (ECCV), 2018, pp. 36–52.
passing for visual relationship detection,” in ICML Workshop on
Learning and Reasoning with Graph-Structured Representations, Long [135] P. Zhang, X. Ge, and J. Renz, “Support relation analysis for objects
Beach, CA, 2019. in multiple view rgb-d images,” arXiv preprint arXiv:1905.04084,
2019.
[114] H. Zhou, C. Zhang, and C. Hu, “Visual relationship detection
with relative location mining,” in Proceedings of the 27th ACM [136] M. Y. Yang, W. Liao, H. Ackermann, and B. Rosenhahn, “On
International Conference on Multimedia, 2019, pp. 30–38. support relations and semantic scene graphs,” Isprs Journal of
[115] W. Wang, R. Wang, S. Shan, and X. Chen, “Exploring context Photogrammetry and Remote Sensing, vol. 131, pp. 15–25, 2017.
and visual pattern of relationship for scene graph generation,” in [137] D. Chen, X. Liang, Y. Wang, and W. Gao, “Soft transfer learning
Proceedings of the IEEE Conference on Computer Vision and Pattern via gradient diagnosis for visual relationship detection,” in 2019
Recognition, 2019, pp. 8188–8197. IEEE Winter Conference on Applications of Computer Vision (WACV),
[116] L. Chen, H. Zhang, J. Xiao, X. He, S. Pu, and S.-F. Chang, “Coun- 2019, pp. 1118–1126.
terfactual critic multi-agent training for scene graph generation,” [138] A. Bordes, N. Usunier, A. Garcia-Duran, J. Weston, and
in Proceedings of the IEEE International Conference on Computer O. Yakhnenko, “Translating embeddings for modeling multi-
Vision, 2019, pp. 4613–4623. relational data,” Advances in neural information processing systems,
[117] M. Klawonn and E. Heim, “Generating triples with adversarial vol. 26, pp. 2787–2795, 2013.
networks for scene graph construction,” in Proceedings of the [139] Z. Wang, J. Zhang, J. Feng, and Z. Chen, “Knowledge graph
AAAI Conference on Artificial Intelligence, vol. 32, no. 1, 2018, pp. embedding by translating on hyperplanes,” in Proceedings of the
6992–6999. AAAI Conference on Artificial Intelligence, vol. 28, no. 1, 2014.
[118] C. Han, F. Shen, L. Liu, Y. Yang, and H. T. Shen, “Visual spatial [140] G. Ji, K. Liu, S. He, and J. Zhao, “Knowledge graph completion
attention network for relationship detection,” in Proceedings of the with adaptive sparse transfer matrix,” in Proceedings of the AAAI
26th ACM international conference on Multimedia, 2018, pp. 510– Conference on Artificial Intelligence, vol. 30, no. 1, 2016.
518. [141] S. Zheng, X. Chen, S. Chen, and Q. Jin, “Relation understanding
[119] G. S. Kenigsfield and R. El-Yaniv, “Leveraging auxiliary text for in videos,” in Proceedings of the 27th ACM International Conference
deep recognition of unseen visual relationships,” arXiv preprint on Multimedia, 2019, pp. 2662–2666.
arXiv:1910.12324, 2019. [142] X. Shang, J. Xiao, D. Di, and T.-S. Chua, “Relation understanding
[120] R. Krishna, I. Chami, M. Bernstein, and L. Fei-Fei, “Referring in videos: A grand challenge overview,” in Proceedings of the 27th
relationships,” in Proceedings of the IEEE Conference on Computer ACM International Conference on Multimedia, 2019, pp. 2652–2656.
Vision and Pattern Recognition, 2018, pp. 6867–6876. [143] X. Sun, T. Ren, Y. Zi, and G. Wu, “Video visual relation detection
[121] A. Kolesnikov, A. Kuznetsova, C. Lampert, and V. Ferrari, “De- via multi-modal feature fusion,” in Proceedings of the 27th ACM
tecting visual relationships using box attention,” in Proceedings International Conference on Multimedia, 2019, pp. 2657–2661.
of the IEEE International Conference on Computer Vision Workshops, [144] X. Shang, D. Di, J. Xiao, Y. Cao, X. Yang, and T.-S. Chua, “Annotat-
2019, pp. 0–0. ing objects and relations in user-generated videos,” in Proceedings
26

of the 2019 on International Conference on Multimedia Retrieval, 2019, large scale visual recognition challenge,” International journal of
pp. 279–287. computer vision, vol. 115, no. 3, pp. 211–252, 2015.
[145] J. Ji, R. Krishna, L. Fei-Fei, and J. C. Niebles, “Action genome: [164] J. Lv, Q. Xiao, and J. Zhong, “Avr: Attention based salient visual
Actions as compositions of spatio-temporal scene graphs,” in relationship detection,” arXiv preprint arXiv:2003.07012, 2020.
Proceedings of the IEEE/CVF Conference on Computer Vision and [165] X. Lin, C. Ding, J. Zeng, and D. Tao, “Gps-net: Graph property
Pattern Recognition, 2020, pp. 10 236–10 247. sensing network for scene graph generation,” in Proceedings of the
[146] P. Gay, J. Stuart, and A. Del Bue, “Visual graphs from motion IEEE/CVF Conference on Computer Vision and Pattern Recognition,
(vgfm): Scene understanding with object geometry reasoning,” 2020, pp. 3746–3753.
in Asian Conference on Computer Vision, 2018, pp. 330–346. [166] M. Khademi and O. Schulte, “Deep generative probabilistic
[147] U. Kim, J. Park, T. Song, and J. Kim, “3-d scene graph: A sparse graph neural networks for scene graph generation.” in Proceed-
and semantic representation of physical environments for intelli- ings of the AAAI Conference on Artificial Intelligence, 2020, pp.
gent agents,” IEEE Transactions on Systems, Man, and Cybernetics, 11 237–11 245.
pp. 1–13, 2019. [167] G. Ren, L. Ren, Y. Liao, S. Liu, B. Li, J. Han, and S. Yan, “Scene
[148] J. Wald, H. Dhamo, N. Navab, and F. Tombari, “Learning 3d graph generation with hierarchical context,” IEEE Transactions on
semantic scene graphs from 3d indoor reconstructions,” in Pro- Neural Networks and Learning Systems, 2020.
ceedings of the IEEE/CVF Conference on Computer Vision and Pattern [168] B. A. Plummer, A. Mallya, C. M. Cervantes, J. Hockenmaier,
Recognition, 2020, pp. 3961–3970. and S. Lazebnik, “Phrase localization and visual relationship
[149] B. Yang, W. Luo, and R. Urtasun, “Pixor: Real-time 3d object de- detection with comprehensive image-language cues,” 2017 IEEE
tection from point clouds,” 2018 IEEE/CVF Conference on Computer International Conference on Computer Vision (ICCV), pp. 1946–1955,
Vision and Pattern Recognition, pp. 7652–7660, 2018. 2017.
[150] S. Shi, X. Wang, and H. Li, “Pointrcnn: 3d object proposal gener- [169] B. Knyazev, H. de Vries, C. Cangea, G. W. Taylor, A. Courville,
ation and detection from point cloud,” 2019 IEEE/CVF Conference and E. Belilovsky, “Graph density-aware losses for novel
on Computer Vision and Pattern Recognition (CVPR), pp. 770–779, compositions in scene graph generation,” arXiv preprint
2018. arXiv:2005.08230, 2020.
[151] W. Ali, S. Abdelkarim, M. Zidan, M. Zahran, and A. El Sallab, [170] D. A. Hudson and C. D. Manning, “Gqa: A new dataset for real-
“Yolo3d: End-to-end real-time 3d oriented object bounding box world visual reasoning and compositional question answering,”
detection from lidar point cloud,” in Proceedings of the European 2019 IEEE/CVF Conference on Computer Vision and Pattern Recogni-
Conference on Computer Vision (ECCV), 2018, pp. 0–0. tion (CVPR), pp. 6693–6702, 2019.
[171] C. Liu, Y. Jin, K. Xu, G. Gong, and Y. Mu, “Beyond short-term
[152] C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “Pointnet: Deep learning
snippet: Video relation detection with spatio-temporal global
on point sets for 3d classification and segmentation,” 2017 IEEE
context,” in Proceedings of the IEEE/CVF Conference on Computer
Conference on Computer Vision and Pattern Recognition (CVPR), pp.
Vision and Pattern Recognition, 2020, pp. 10 840–10 849.
77–85, 2017.
[172] A. Rosinol, A. Gupta, M. Abate, J. Shi, and L. Carlone, “3d
[153] D. Rethage, J. Wald, J. Sturm, N. Navab, and F. Tombari, “Fully-
dynamic scene graphs: Actionable spatial perception with places,
convolutional point networks for large-scale point clouds,” in
objects, and humans,” arXiv preprint arXiv:2002.06289, 2020.
Proceedings of the European Conference on Computer Vision (ECCV),
[173] X. Zhou, D. Wang, and P. Krähenbühl, “Objects as points,” arXiv
2018, pp. 596–611.
preprint arXiv:1904.07850, 2019.
[154] M. Jaritz, T.-H. Vu, R. d. Charette, E. Wirbel, and P. Pérez,
[174] Z. Cai and N. Vasconcelos, “Cascade r-cnn: Delving into high
“xmuda: Cross-modal unsupervised domain adaptation for 3d
quality object detection,” in Proceedings of the IEEE conference on
semantic segmentation,” in Proceedings of the IEEE/CVF Conference
computer vision and pattern recognition, 2018, pp. 6154–6162.
on Computer Vision and Pattern Recognition, 2020, pp. 12 605–
[175] F. Xiao and Y. Jae Lee, “Video object detection with an aligned
12 614.
spatial-temporal memory,” in Proceedings of the European Confer-
[155] S. Shi, Z. Wang, J. Shi, X. Wang, and H. Li, “From points to parts: ence on Computer Vision (ECCV), 2018, pp. 485–501.
3d object detection from point cloud with part-aware and part- [176] M. Shvets, W. Liu, and A. C. Berg, “Leveraging long-range
aggregation network,” IEEE Transactions on Pattern Analysis and temporal relationships between proposals for video object de-
Machine Intelligence, 2020. tection,” in Proceedings of the IEEE International Conference on
[156] Y. Liang, Y. Bai, W. Zhang, X. Qian, L. Zhu, and T. Mei, “Vrr-vg: Computer Vision, 2019, pp. 9756–9764.
Refocusing visually-relevant relationships,” in Proceedings of the [177] H. Wu, Y. Chen, N. Wang, and Z. Zhang, “Sequence level seman-
IEEE International Conference on Computer Vision, 2019, pp. 10 403– tics aggregation for video object detection,” in Proceedings of the
10 412. IEEE International Conference on Computer Vision, 2019, pp. 9217–
[157] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and 9225.
A. Zisserman, “The pascal visual object classes (voc) challenge,” [178] K. Kang, W. Ouyang, H. Li, and X. Wang, “Object detection
International Journal of Computer Vision, vol. 88, no. 2, pp. 303–338, from video tubelets with convolutional neural networks,” in
2010. Proceedings of the IEEE conference on computer vision and pattern
[158] A. Belz, A. Muscat, P. Anguill, M. Sow, G. Vincent, and Y. Zi- recognition, 2016, pp. 817–825.
nessabah, “Spatialvoc2k: A multilingual dataset of images with [179] H. Wang and C. Schmid, “Action recognition with improved
annotations and features for spatial relations between objects,” in trajectories,” in Proceedings of the IEEE international conference on
Proceedings of the 11th International Conference on Natural Language computer vision, 2013, pp. 3551–3558.
Generation, 2018, pp. 140–145. [180] S. Ji, S. Pan, E. Cambria, P. Marttinen, and P. S. Yu, “A survey
[159] K. Yang, O. Russakovsky, and J. B. Deng, “Spatialsense: An on knowledge graphs: Representation, acquisition and applica-
adversarially crowdsourced benchmark for spatial relation recog- tions,” arXiv preprint arXiv:2002.00388, 2020.
nition,” 2019 IEEE/CVF International Conference on Computer Vision [181] H. Huang, S. Saito, Y. Kikuchi, E. Matsumoto, W. Tang, and
(ICCV), pp. 2051–2060, 2019. P. S. Yu, “Addressing class imbalance in scene graph parsing
[160] F. Xia, A. R. Zamir, Z.-Y. He, A. Sax, J. Malik, and S. Savarese, by learning to contrast and score,” in Proceedings of the Asian
“Gibson env: Real-world perception for embodied agents,” 2018 Conference on Computer Vision, 2020.
IEEE/CVF Conference on Computer Vision and Pattern Recognition, [182] S. Inuganti and V. N. Balasubramanian, “Assisting scene
pp. 9068–9079, 2018. graph generation with self-supervision,” arXiv preprint
[161] B. Thomee, D. A. Shamma, G. Friedland, B. Elizalde, K. Ni, arXiv:2008.03555, 2020.
D. Poland, D. Borth, and L.-J. Li, “The new data and new chal- [183] J. Yu, Y. Chai, Y. Hu, and Q. Wu, “Cogtree: Cognition
lenges in multimedia research,” arXiv preprint arXiv:1503.01817, tree loss for unbiased scene graph generation,” arXiv preprint
2015. arXiv:2009.07526, 2020.
[162] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, [184] Y. Zhou, S. Sun, C. Zhang, Y. Li, and W. Ouyang, “Exploring the
P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in hierarchy in relation labels for scene graph generation,” arXiv
context,” in European conference on computer vision, 2014, pp. 740– preprint arXiv:2009.05834, 2020.
755. [185] B. Knyazev, H. de Vries, C. Cangea, G. W. Taylor, A. Courville,
[163] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, and E. Belilovsky, “Generative graph perturbations for scene
Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al., “Imagenet graph prediction,” arXiv preprint arXiv:2007.05756, 2020.
27

[186] M. Wei, C. Yuan, X. Yue, and K. Zhong, “Hose-net: Higher order


structure embedded network for scene graph generation,” in
Proceedings of the 28th ACM International Conference on Multimedia,
2020, pp. 1846–1854.
[187] T. He, L. Gao, J. Song, J. Cai, and Y.-F. Li, “Learning from the
scene and borrowing from the rich: Tackling the long tail in scene
graph generation,” arXiv preprint arXiv:2006.07585, 2020.
[188] S. Yan, C. Shen, Z. Jin, J. Huang, R. Jiang, Y. Chen, and X.-S.
Hua, “Pcpl: Predicate-correlation perception learning for unbi-
ased scene graph generation,” in Proceedings of the 28th ACM
International Conference on Multimedia, 2020, pp. 265–273.
[189] T.-J. J. Wang, S. Pehlivan, and J. Laaksonen, “Tackling the unan-
notated: Scene graph generation with bias-reduced models,”
arXiv preprint arXiv:2008.07832, 2020.
[190] Y. Qiang, Y. Yang, Y. Guo, and T. M. Hospedales, “Tensor com-
position net for visual relationship prediction,” arXiv preprint
arXiv:2012.05473, 2020.
[191] S. Y. Bao, M. Bagra, Y.-W. Chao, and S. Savarese, “Semantic
structure from motion with points, regions, and objects,” in 2012
IEEE Conference on Computer Vision and Pattern Recognition, 2012,
pp. 2703–2710.
[192] M. Raboh, R. Herzig, J. Berant, G. Chechik, and A. Globerson,
“Differentiable scene graphs,” in The IEEE Winter Conference on
Applications of Computer Vision, 2020, pp. 1488–1497.
[193] Y. Guo, J. Song, L. Gao, and H. T. Shen, “One-shot scene graph
generation,” in Proceedings of the 28th ACM International Confer-
ence on Multimedia, 2020, pp. 3090–3098.
[194] W. Wang, R. Liu, M. Wang, S. Wang, X. Chang, and Y. Chen,
“Memory-based network for scene graph with unbalanced rela-
tions,” in Proceedings of the 28th ACM International Conference on
Multimedia, 2020, pp. 2400–2408.
[195] W. Wang, R. Wang, S. Shan, and X. Chen, “Sketching image
gist: Human-mimetic hierarchical scene graph generation,” arXiv
preprint arXiv:2007.08760, 2020.
[196] H. Tian, N. Xu, A.-A. Liu, and Y. Zhang, “Part-aware interactive
learning for scene graph generation,” in Proceedings of the 28th
ACM International Conference on Multimedia, 2020, pp. 3155–3163.
[197] D. Jin, X. Ma, C. Zhang, Y. Zhou, J. Tao, M. Zhang, H. Zhao, S. Yi,
Z. Li, X. Liu, and H. Li, “Towards overcoming false positives
in visual relationship detection,” arXiv preprint arXiv:2012.12510,
2020.
[198] W. Wang, R. Liu, M. Wang, S. Wang, X. Chang, and Y. Chen,
“Memory-based network for scene graph with unbalanced rela-
tions,” in Proceedings of the 28th ACM International Conference on
Multimedia, 2020, pp. 2400–2408.
[199] P. Xu, X. Chang, L. Guo, P.-Y. Huang, X. Chen, and A. G. Haupt-
mann, “A survey of scene graph: Generation and application,”
IEEE Trans. Neural Netw. Learn. Syst, 2020.

You might also like