Scene Graph Generation
Scene Graph Generation
Abstract—Deep learning techniques have led to remarkable breakthroughs in the field of generic object detection and have spawned
a lot of scene-understanding tasks in recent years. Scene graph has been the focus of research because of its powerful semantic
representation and applications to scene understanding. Scene Graph Generation (SGG) refers to the task of automatically mapping
an image or a video into a semantic structural scene graph, which requires the correct labeling of detected objects and their
relationships. Although this is a challenging task, the community has proposed a lot of SGG approaches and achieved good results. In
arXiv:2201.00443v2 [cs.CV] 22 Jun 2022
this paper, we provide a comprehensive survey of recent achievements in this field brought about by deep learning techniques. We
review 138 representative works, and systematically summarize existing methods of image-based SGG from the perspective of feature
representation and refinement. We attempt to connect and systematize the existing visual relationship detection methods, to
summarize, and interpret the mechanisms and the strategies of SGG in a comprehensive way. Finally, we finish this survey with deep
discussions about current existing problems and future research directions. This survey will help readers to develop a better
understanding of the current research status and ideas.
Index Terms—Scene Graph Generation, Visual Relationship Detection, Object Detection, Scene Understanding.
Image Caption
Image Generation
Referring Expression Region Description
standing
man woman
a woman in shorts
a man is jumping
is standing behind
over fire hydrant.
jumping over is behind in the man
Visual Question
Answering
yellow
Fig. 1: A visual illustration of a scene graph structure and some applications. Scene graph generation models take an
image as an input and generate a visually-grounded scene graph. Image caption can be generated from a scene graph
directly. In contrast, Image generation inverts the process by generating realistic images from a given sentence or scene
graph. The Referring Expression (REF) marks a region of the input image corresponding to the given expression, while
the region and expression map the same subgraph of the scene graph. Scene graph-based image retrieval takes a query as
an input, and regards the retrieval as a scene graph matching problem. For the Visual Question Answering (VQA) task,
the answer can sometimes be found directly on the scene graph, even for the more complex visual reasoning, the scene
graph is also helpful.
ing (NLP) and proposed a number of advanced research tured representation that captures comprehensive semantic
directions, such as image captioning, visual question an- knowledge is a crucial step towards a deeper understanding
swering (VQA), visual dialog and so on. These vision-and- of visual scenes. Such representation can not only offer
language topics require a rich understanding of our visual contextual cues for fundamental recognition challenges, but
world and offer various application scenarios of intelligent also provide a promising alternative to high-level intelli-
systems. gence vision tasks. Scene graph, proposed by Johnson et al.
Although rapid advances have been achieved in the [1], is a visually-grounded graph over the object instances
scene understanding at all levels, there is still a long way in a specific scene, where the nodes correspond to object
to go. Overall perception and effective representation of bounding boxes with their object categories, and the edges
information are still bottlenecks. As indicated by a series represent their pair-wise relationships.
of previous works [1], [44], [191], building an efficient struc- Because of the structured abstraction and greater se-
3
mantic representation capacity compared to image features, with an analysis of the performance evaluation of
scene graph has the instinctive potential to tackle and im- the corresponding methods on these datasets.
prove other vision tasks. As shown in Fig.1, a scene graph
The rest of this paper is organized as follows; Section 2
parses the image to a simple and meaningful structure and
gives the definition of a scene graph, thoroughly analyses
acts as a bridge between the visual scene and textual de-
the characteristics of visual relationships and the structure
scription. Many tasks that combine vision and language can
of a scene graph. Section 3 surveys scene graph generation
be handled with scene graphs, including image captioning
methods. Section 4 summarizes almost all currently pub-
[3], [12], [18], visual question answering [4], [5], content-
lished datasets. Section 5 compares and discusses the per-
based image retrieval [1], [7], image generation [8], [9] and
formance of some key methods on the most commonly used
referring expression comprehension [35]. Some tasks take
datasets. Finally, Section 6 summarizes open problems in
an image as an input and parse it into a scene graph,
the current research and discusses potential future research
and then generate a reasonable text as output. Other tasks
directions. Section 7 concludes the paper.
invert the process by extracting scene graphs from the text
description and then generate realistic images or retrieve the
corresponding visual scene. 2 S CENE G RAPH
Xu et al. [199] have produced a thorough survey on scene A scene graph is a structural representation, which can
graph generation, which analyses the SGG methods based capture detailed semantics by explicitly modeling objects
on five typical models (CRF, TransE, CNN, RNN/LSTM and (“man”, “fire hydrant”, “shorts”), attributes of objects (“fire
GNN) and also includes a discussion of important contri- hydrant is yellow”), and relations between paired objects
butions by prior knowledge. Moreover, a detailed investi- (“man jumping over fire hydrant”), as shown in Fig.1. The
gation of the main applications of scene graphs was also fundamental elements of a scene graph are objects, at-
provided. The current survey focuses on visual relationship tributes and relations. Subjects/objects are the core building
detection of SGG, and our survey’s organization is based on blocks of an image and they can be located with bound-
feature representation and refinement. Specifically, we first ing boxes. Each object can have zero or more attributes,
provide a comprehensive and systematic review of 2D SGG. such as color (e.g., yellow), state (e.g., standing), material
In addition to multimodal features, prior information and (e.g., wooden), etc. Relations can be actions (e.g., “jump
commonsense knowledge to help overcome the long-tailed over”), spatial (e.g., “is behind”), descriptive verbs (e.g.,
distribution and the large intra-class diversity problems, is wear), prepositions (e.g. “with”), comparatives (e.g., “taller
also provided. To refine the local features and fuse the con- than”), prepositional phrases (e.g., “drive on”), etc [10],
textual information for high-quality relationship prediction, [28], [30], [110]. In short, a scene graph is a set of visual
we analyze some mechanisms, such as message passing, relationship triplets in the form of hsubject, relation, objecti
attention, and visual translation embedding. In addition or hobject, is, attributei. The latter is also considered as a
to 2D SGG, spatio-temporal and 3D SGG are also exam- relationship triplet (using the “is” relation for uniformity
ined. Further, a detailed discussion of the most common [10], [11]).
datasets is provided together with performance evaluation In this survey paper, we mainly focus on the triplet
measures. Finally, a comprehensive and systematic review description of a static scene. Given a visual scene S ∈ S
of the most recent research on the generation of scene [62], such as an image or a 3D mesh, its scene graph is a set
graphs is presented. We provide a survey of 138 papers of visual triplets RS ⊆ OS × PS × (OS ∪ AS ), where OS is
on SGG1 , which have appeared since 2016 in the leading the object set, AS is the attribute set and PS is the relation
computer vision, pattern recognition, and machine learning set including “is” relation pS,is where there is only one
conferences and journals. Our goal is to help the reader object involved. Each object oS,k ∈ OS has a semantic label
study and understand this research topic, which has gained lS,k ∈ OL (OL is the semantic label set) and grounded with a
a significant momentum in the past few years. The main bounding box (BB) bS,k in scene S , where k ∈ {1, . . . , |OS |}.
contributions of this article are as follows: Each relation pS,i→j ∈ PS ⊆ P is the core form of a visual
relationship triplet rS,i→j = hoS,i , pS,i→j , oS,j i ∈ RS and
1) A comprehensive review of 138 papers on scene i 6= j , where the third element oS,j could be an attribute
graph generation is presented, covering nearly all aS,j ⊆ AS if pS,i→j is the pS,is . As the relationship is one-
of the current literature on this topic. way, we expresse rS,i→j as hsS,i , pS,i→j , oS,j i to maintain
2) A systematic analysis of 2D scene graph generation semantic accuracy where sS,i , oS,j ∈ OS , sS,i is subject and
is presented, focusing on feature representation and oS,j is object.
refinement. The long-tail distribution problem and From the point of view of graph theory, a scene graph is
the large intra-class diversity problem are addressed a directed graph with three types of nodes: object, attribute,
from the perspectives of fusing prior information and relation. However, for the convenience of semantic
and commonsense knowledge, as well as refining expression, a node of a scene graph is seen as an object
features through message passing, attention, and with all its attributes, while the relation is called an edge. A
visual translation embedding. subgraph can be formed with an object, which is made up
3) A review of typical datasets for 2D, spatio-temporal of all the related visual triplets of the object. Therefore, the
and 3D scene graph generation is presented, along subgraph contains all the adjacent nodes of the object, and
these adjacent nodes directly reflect the context information
1. We provide a curated list of scene graph generation methods, pub- of the object. From the top-down view, a scene graph can
licly accessible at https://github.com/mqjyl/awesome-scene-graph be broken down into several subgraphs, a subgraph can be
4
splitted into several triplets, and a triplet can be splitted into 3 S CENE G RAPH G ENERATION
individual objects with their attributes and relations. Ac- The goal of scene graph generation is to parse an image
cordingly, we can find a region in the scene corresponding or a sequence of images in order to generate a structured
to the substructure that is a subgraph, a triplet, or an object. representation, to bridge the gap between visual and se-
Clearly, a perfectly-generated scene graph corresponding to mantic perception, and ultimately to achieve a complete
a given scene should be structurally unique. The process of understanding of visual scenes. However, it is difficult to
generating a scene graph should be objective and should generate an accurate and complete scene graph. Generating
only be dependent on the scene. Scene graphs should serve a scene graph is generally a bottom-up process in which
as an objective semantic representation of the state of the entities are grouped into triplets and these triplets are
scene. The SGG process should not be affected by who connected to form the entire scene graph. Evidently, the
labelled the data, on how it was assigned objects and pred- essence of the task is to detect the visual relationships,
icate categories, or on the performance of the SGG model i.e. hsubject, relation, objecti triplets, abbreviated as hs, r, oi.
used. Although, in reality, not all annotators who label the Methods, which are used to connect the detected visual
data give produce the exact same visual relationship for relationships to form a scene graph, do not fall in the scope
each triplet, and the methods that generate scene graphs do of this survey. This paper focuses on reviewing methods for
not always predict the correct relationships. The uniqueness visual relationship detection.
supports the argument that the use of a scene graph as Visual Relationship Detection has attracted the attention
a replacement for a visual scene at the language level is of the research community since the pioneering work by Lu
reasonable. et al [28], and the release of the ground-breaking large-scale
Compared with scene graphs, the well-known knowledge scene graph dataset Visual Genome (VG) by Krishna et al
graph is represented as multi-relational data with enormous [30]. Given a visual scene S and its scene graph TS [31],
fact triplets in the form of (head entity type, relation, tail [62]:
entity type) [112], [180]. Here, we have to emphasize that
the visual relationships in a scene graph are different from • BS = {bS,1 , . . . , bS,n } is the region candidate set,
those in social networks and knowledge bases. In the case with element bS,i denoting the bounding box of the
of vision, images and visual relationships are incidental i-th candidate object.
and are not intentionally constructed. Especially, visual • OS = {oS,1 , . . . , oS,n } is the object set, with element
relationships are usually image-specific because they only oS,i denoting the corresponding class label of the
depend on the content of the particular image in which they object bS,i .
appear. Although a scene graph is generated from a textual • AS = {aS,o1 ,1 , . . . , aS,o1 ,k1 , . . . , aS,o2 ,1 , . . . , aS,o2 ,k2 ,
description in some language-to-vision tasks, such as image . . . , aS,on ,1 , . . . , aS,on ,kn , } is the attribute set, with
generation, the relationships in a scene graph are always element aS,oi ,j denoting the j-th attribute of the i-th
situation-specific. Each of them has the corresponding vi- object, where ki ≥ 0 and j ∈ {1, . . . , ki }.
sual feature in the output image. Objects in scenes are not • RS = {rS,1→2 , rS,1→3 , . . . , rS,n→n−1 } is the relation
independent and tend to cluster. Sadeghi et al. [43] coined set, with element rS,i→j corresponding to a visual
the term visual phrases to introduce composite intermediates triple tS,i→j = hsS,i , rS,i→j , oS,j i, where sS,i and
between objects and scenes. Visual phrases, which integrate oS,j denote the subject and object respectively. This
linguistic representations of relationship triplets encode the set also includes “is” relation where there is only one
interactions between objects and scenes. object invovled.
A two-dimensional (2D) image is a projection of a three-
When attributes detection and relationships prediction
dimensional (3D) world from a particular perspective. Be-
are considered as two independent processes, we can de-
cause of the visual shade and dimensionality reduction
compose the probability distribution of the scene graph
caused by the projection of 3D to 2D, 2D images may have
p(TS |S) into four components similar to [31]:
incomplete or ambiguous information about the 3D scene,
leading to an imperfect representation of 2D scene graphs. p(TS |S) =p(BS |S)p(OS |BS , S)
As opposed to a 2D scene graph, a 3D scene graph prevents (1)
(p(AS |OS , BS , S)p(RS |OS , BS , S))
spatial relationship ambiguities between object pairs caused
by different viewpoints. The relationships described above In the equation, the bounding box component p(BS |S)
are static and instantaneous because the information is generates a set of candidate regions that cover most of the
grounded in an image or a 3D mesh that can only capture a crucial objects directly from the input image. The object
specific moment or a certain scene. On the other hand, with component p(OS |BS , S) predicts the class label of the object
videos, a visual relationship is not instantaneous, but varies in the bounding box. Both steps are identical to those
with time. A digital video consists of a series of images used in two-stage target detection methods, and can be
called frames, which means relations span over multiple implemented by the widely used Faster RCNN detector [17].
frames and have different durations. Visual relationships in Conditioned on the predicted labels, the attribute compo-
a video can construct a Spatio-Temporal Scene Graph, which nent p(AS |OS , BS , S) infers all possible attributes of each
includes entity nodes of the neighborhood in the time and object, while the relationship component p(RS |OS , BS , S)
space dimensions. infers the relationship of each object pair [31]. When all
The scope of our survey therefore extends beyond the visual triplets are collected, a scene graph can then be con-
generation of 2D scene graphs to include 3D and spatiotem- structed. Since attribute detection is generally regarded as
poral scene graphs as well. an independent research topic, visual relationship detection
5
and scene graph generation are often regarded as the same visual phrases. A deeper meaning can, however, be de-
task. Then, the probability of a scene graph TS can be rived from two aspects: the frequency of visual phrases
decomposed into three factors: and the common-sense constraints on relationship predic-
tion. For example, when “man”, “horse” and “hat” are
detected individually in an image, the most likely vi-
p(TS |S) = p(BS |S)p(OS |BS , S)p(RS |OS , BS , S) (2) sual triplets are hman, ride, horsei, hman, wearing, hati,
The following section provides a detailed review of more etc. hhat, on, horsei is possible, though not common. But
than a hundred deep learning-based methods proposed hhorse, wearing, hati is normally unreasonable. Thus, how
until 2020 on visual relationship detection and scene graph to integrate Prior Information about visual phrases and
generation. In view of the fact that 2D SGG has been Commonsense Knowledge will be the analyzed in Section
published much more than 3D or spatio-temporal SGG, a 3.1.2 and Section 3.1.3, respectively.
comprehensive overview of the methods for 2D SGG is first (3) A scene graph is a representation of visual relation-
provided. This is followed by a review of the 3D and spa- ships between objects, and it includes contextual informa-
tiotemporal SGG methods in order to ensure completeness tion about those relationships. To achieve high-quality pre-
and breadth of the survey. dictions, information must be fused between the individual
Note: We use “relationship” or a “triplet” to refer to the objects or relationships. In a scene graph, message passing
tuple of hsubject, relation, objecti in this paper, and “relation” can be used to refine local features and integrate contextual
or a “predicate” to refer to a relation element. information, while attention mechanisms can be used to
allow the models to focus on the most important parts
of the scene. Considering the large intra-class divergence
3.1 2D Scene Graph Generation and long-tailed distribution problems, visual translation
Scene graphs can be generated in two different ways [13]. embedding methods have been proposed to model rela-
The mainstream approach uses a two-step pipeline that tionships by interpreting them as translations operating on
detects objects first and then solves a classification task to the low-dimensional embeddings of the entities. Therefore,
determine the relationship between each pair of objects. we categorize the related methods into Message Passing,
The other approach involves jointly inferring the objects Attention Mechanism, and Visual Translation Embedding,
and their relationships based on the object region proposals. which will be deeply analyzed in Section 3.1.4, Section 3.1.5
Both of the above approaches need to first detect all existing and Section 3.1.6, respectively.
objects or proposed objects in the image, and group them
into pairs and use the features of their union area (called re-
3.1.1 Multimodal Features
lation features), as the basic representation for the predicate
inference. In this section, we focus on the two-step approach, The appearance features of the subject, object, and predicate
and Fig.2 illustrates the general framework for creating 2D ROIs make up the input of SGG methods, and affect SGG
scene graphs. Given an image, a scene graph generation significantly. The rapid development of deep learning based
method first generates subject/object and union proposals object detection and classification has led to the use of
with Region Proposal Network (RPN), which are sometimes many types of classical CNNs to extract appearance features
derived from the ground-truth human annotations of the from ROIs cropped from a whole image by bounding boxes
image. Each union proposal is made up of a subject, an or masks. Some CNNs even outperform humans when it
object and a predicate ROI. The predicate ROI is the box that comes to detecting/classifying objects based on appearance
tightly covers both the subject and the object. We can then features. Nevertheless, only the appearance features of a
obtain appearance, spatial information, label, depth, and subject, an object, and their union region are insufficient
mask for each object proposal using the feature representa- to accurately recognize the relationship of a subject-object
tion, and for each predicate proposal we can obtain appear- pair. In addition to appearance features, semantic features
ance, spatial, depth, and mask. These multimodal features of object categories or relations, spatial features of object
are vectorized, combined, and refined in the third step candidates, and even contextual features, can also be crucial
of the Feature Refinement module using message passing to understand a scene and can be used to improve the vi-
mechanisms, attention mechanisms and visual translation sual relationship detection performance. In this subsection,
embedding approaches. Finally, the classifiers are used to some integrated utilization methods of Appearance, Semantic,
predict the categories of the predicates, and the scene graph Spatial and Context features will be reviewed and analyzed.
is generated. Appearance-Semantic Features: A straightforward way
In this section, SGG methods for 2D inputs will be re- to fuse semantic features is to concatenate the semantic
viewed and analyzed according to the following strategies. word embeddings of object labels to the corresponding ap-
(1) Off-the-shelf object detectors can be used to detect pearance features. As in [28], there is another approach that
subjects, objects and predicate ROIs. The first point to utilizes language priors from semantic word embeddings to
consider is how to utilize the multimodal features of the finetune the likelihood of a predicted relationship, dealing
detected proposals. As a result, Section 3.1.1 reviews and with the fact that objects and predicates independently
analyzes the use of multimodal features, including appear- occur frequently, even if relationship triplets are infrequent.
ance, spatial, depth, mask, and label. Moreover, taking into account that the appearance of objects
(2) A scene graph’s compositionality is its most im- may profoundly change when they are involved in different
portant characteristic, and can be seen as an elevation visual relations, it is also possible to directly learn an ap-
of its semantic expression from independent objects to pearance model to recognize richer-level visual composites,
6
Fig. 2: An overview of 2D general scene graph generation framework. Firstly, off-the-shelf object detectors are used to detect
subjects, objects and predicate ROIs. Then, different kinds of methods are used in the stages of (b) Feature Representation
and (c) Feature Refinement to improve the final (d) Relation Prediction for high-quality visual relationship detection. This
survey focuses on the methods of feature representation and refinement.
i.e., visual phrases [43], as a whole, rather than detecting the function. For an input image x, the feature representations
basic atoms and then modeling their interactions. of visual appearance cue, spatial location cue and semantic
Appearance-Semantic-Spatial Features: The spatial dis- embedding cue are extracted for each relationship instance
tribution of objects is not only a reflection of their position, tuple. The learned features combined with multiple cues are
but also a representation of their structural information. A further concatenated and fused into a joint feature vector
spatial distribution of objects is described by the properties through one fully connected layer.
of regions, which include positional relations, size relations, Appearance-Semantic-Spatial-Context Features: Previ-
distance relations, and shape relations. In this context, Zhu ous studies typically extract features from a restricted object-
et al. [83] investigated how the spatial distribution of ob- object pair region and focus on local interaction modeling
jects can aid in visual relation detection. Sharifzadeh et al. to infer the objects and pairwise relation. For example, by
[46] used 3D information in visual relation detection by fusing pairwise features, VIP-CNN [26] captures contextual
synthetically generating depth maps from an RGB-to-Depth information directly. However, the global visual context
model incorporated within relation detection frameworks. beyond these pairwise regions is ignored, it may result in
They extracted pairwise feature vectors for depth, spatial, the loss of the chance to shrink the possible semantic space
label and appearance. using the rich context. Xu et al. [25] proposed a multi-
scale context modeling method that can simultaneously
The subject and object come from different distributions. discover and integrate the object-centric and region-centric
In response, Zhang et al. [84] proposed a 3-branch Rela- contexts for inference of scene graphs in order to overcome
tionship Proposal Networks (Rel-PN) to produce a set of the problem of large object/relation spaces. Yin et al. [72]
candidate boxes that represent subject, relationship, and proposed a Spatiality-Context-Appearance module to learn
object proposals. Then a proposal selection module selects the spatiality-aware contextual feature representation.
the candidate pairs that satisfy the spatial constraints. The In summary, appearance, semantics, spatial and contex-
resulting pairs are fed into two separate network modules tual features all contribute to visual relationship detection
designed to evaluate the relationship compatibility using from different perspectives. The integration of these multi-
visual and spatial criteria, respectively. Finally, visual and modal features precisely corresponds to the human’s multi-
spatial scores are combined with different weights to get the scale, multi-cue cognitive model. Using well-designed fea-
final score for predicates. In another work [11], the authors tures, visual relationships will be detected more accurately
added a semantic module to produce a semantic score for so scene graphs can be constructed more accurately.
predicates, then all three scores are added up to obtain an
overall score. Liang et al. [24] also considered three types 3.1.2 Prior Information
of features and proposed to cascade the multi-cue based The scene graph is a semantically structured description of
convolutional neural network with a structural ranking loss a visual world. Intuitively, the SGG task can be regarded
7
as a two-stage semantic tag retrieval process. Therefore, the These methods, however, are data-dependent because
determination of the relation category often depends on the their statistical co-occurrence probability is derived from
labels of the participating subject and object. In Section 3.1, training data. They do not contribute to the design of a
we discussed the compositionality of a scene graph in de- universal SGG network. We believe that in the semantic
tail. Although visual relationships are scene-specific, there space, language priors will be more useful.
are strong semantic dependencies between the relationship
predicate r and the object categories s and o in a relationship
triplet (s, p, o).
Data balance plays a key role in the performance of
deep neural networks due to their data-dependent training
process. However, because of the long-tailed distribution
of relationships between objects, collecting enough train-
(a) television-on-wall (b) man-riding-horse (c) boy-playing-seesaw
ing images for all relationships is time-consuming and
too expensive [15], [90], [104], [107]. Scene graphs should
serve as an objective semantic representation of the state
of a scene. We cannot arbitrarily assign the relationship
of hman, f eeding, horsei to the scene in Fig. 3(b) just be-
cause hman, f eeding, horsei occurs more frequently than
hman, riding, horsei in some datasets. However, in fact, (d) cat-on-suitcase (e) dog-sitting on-horse (f) man-riding-elephant
weighting the probability output of relationship detection
networks by statistical co-occurrences may improve the Fig. 3: Examples of the wide variety of visual relationships.
visual relationship detection performance on some datasets. The solid bounding boxes indicate the individual objects
We cannot deny the fact that human beings sometimes think and the dash red bounding boxes denote a visual relation-
about the world based on their experiences. As such, prior ship.
information, including Statistical Priors and Language Pri-
ors, can be regarded as a type of experience that allows Language Priors: Human communication is primarily
neural networks to “correctly understand” a scene more based on the use of words in a structured and conventional
frequently. Prior information has already been widely used manner. Similarly, visual relationships are represented as
to improve performance of SGG networks. triplets of words. Given the polysemy of words across dif-
Statistical Priors: The simplest way to use prior knowl- ferent contexts, one cannot simply encode objects and pred-
edge is to think that an event should happen this time icates as indexes or bitmasks. The semantics of object and
since it almost always does. This is called statistical prior. predicate categories should be used to deal with the poly-
Baier et al. [87] demonstrated how a visual statistical model semy in words. In particular, the following observations can
could improve visual relationship detection. Their semantic be made. First, the visual appearance of the relationships
model was trained using absolute frequencies that describe which has the same predicate but different agents varies
how often a triplet appears in the training data. Dai et greatly [26]. For instance, the “television-on-wall” (Fig.3a) and
al. [49] designed a deep relational network that exploited “cat-on-suitcase” (Fig.3d) have the same predicate type “on”,
both spatial configuration and statistical dependency to but they have distinct visual and spatial features. Second,
resolve ambiguities during relationship recognition. Zellers the type of relations between two objects is not only deter-
et al. [42] analyzed the statistical co-occurrences between mined by their relative spatial information but also through
relationships and object pairs on the Visual Genome dataset their categories. For example, the relative position between
and concluded that these statistical co-occurrences provided the kid and the horse (Fig.3b) is very similar to the ones
strong regularization for relationship prediction. between the dog and the horse (Fig.3e), but it is preferred to
Furthermore, Chen et al. [31] formally represented this describe the relationship “dog-sitting on-horse” rather than
information and explicitly incorporated it into graph prop- “dog-riding-horse” in the natural language setting. It is also
agation networks to aid in scene graph generation. For very rare to say “person-sitting on-horse”. On the other
each object pair with predicted labels (a subject oi and hand, the relationships between the observed objects are
an object oj ), they constructed a graph with a subject naturally based on our language knowledge. For example,
node, an object node, and K relation nodes. Each node we would like to use the expression “sitting on” or “play-
v ∈ V = {oi , oj , r1 , r2 , . . . , rK } has a hidden state htv at ing” for seesaw but not “riding” (Fig.3c), even though it has
timestep t. Let moi oj rk denote the statistical co-occurrence a very similar pose as the one of the types “riding” the horse
probability between oi and relation node rk as well as oj in Fig.3b. Third, relationships are semantically similar when
and relation node rk . At timestep t, the relationship nodes they appear in similar contexts. That is, in a given context,
aggregate messages from the object nodes, while object i.e., an object pair, the probabilities of different predicates to
nodes aggregate messages from the relationship nodes: describe this pair are related to their semantic similarity. For
(P example, “person-ride-horse” (Fig.3b) is similar to “person-
K ride-elephant” (Fig.3f), since “horse” and “elephant” belong
moi oj rk ht−1
rk , if v is an object node
atv = k=1
(3) to the same animal category [28]. It is therefore necessary
moi oj rk (ht−1
oi + h t−1
oj ), if v is a relation node
to explore methods for utilizing language priors in the
Then, the hidden state htv is updated with atv and its previ- semantic space.
ous hidden state by a gated mechanism. Lu et al. [28] proposed the first visual relationship de-
8
tection pipeline, which leverages the language priors (LP) Wen et al. [109] proposed the Rich and Fair semantic extrac-
to finetune the prediction. They scored each pair of object tion network (RiFa), which is able to extract richer semantics
proposals hO1 , O2 i using a visual appearance module and and preserve the fairness for relations with imbalanced
a language module. In the training phase, to optimize the distributions.
projection function f (.) such that it projects similar relation- In summary, statistical and language priors are effective
ships closer to one another, they used a heuristic formulated in providing some regularizations, for visual relationship
as: detection, derived from statistical and semantic spaces.
However, additional knowledge outside of the scope of
[f (r, W ) − f (r0, W )]2 object and predicate categories, is not included. The human
constant = , ∀r, r0 (4)
d(r, r0) mind is capable of reasoning over visual elements of an
image based on common sense. Thus, incorporating com-
where d(r, r0) is the sum of the cosine distances in word2vec monsense knowledge into SGG tasks will be valuable to
space between the two objects and the predicates of the two explore.
relationships r and r0. Similarly, Plesse et al. [105] computed
the similarity between each neighbor r0 ∈ {r1 , . . . , rK } and 3.1.3 Commonsense Knowledge
the query r with a softmax function: As previously stated, there are a number of models which
2 emphasize on the importance of language priors. However,
e−d(r,r0) due to the long tail distribution of relationships, it is costly
constant = PK 2
(5)
j=1 e−d(r,rj ) to collect enough training data for all relationships [90].
We should therefore use knowledge beyond the training
Based on this LP model, Jung et al. [104] further summarized data to help generate scene graphs [136]. Commonsense
some major difficulties for visual relationship detection and knowledge includes information about events that occur
performed a lot of experiments on all possible models with in time, about the effects of actions, about physical objects
variant modules. and how they are perceived, and about their properties and
Liao et al. [85] assumed that an inherent semantic rela- relationships with one another. Researchers have proposed
tionship connects the two words in the triplet rather than to extract commonsense knowledge to refine object and
a mathematical distance in the embedding space. They phrase features to improve generalizability of scene graph
proposed to use a generic bi-directional RNN to predict the generation. In this section, we analyze three fundamental
semantic connection between the participating objects in a sub-issues of commonsense knowledge applied to SGG, i.e.,
relationship from the aspect of natural language. Zhang et the Source, Formulation and Usage, as illustrated in Fig.
al. [15] used semantic associations to compensate for infre- 4. To be specific, the source of commonsense is generally
quent classes on a large and imbalanced benchmark with divided into internal training samples [88], [137], external
an extremely skewed class distribution. Their approach was knowledge base [89] or both [90], [91], and it can be trans-
to learn a visual and a semantic module that maps features formed into different formulations [92]. It is mainly applied
from the two modalities into a shared space and then to in the feature refinement on the original feature or other
employ the modified triplet loss to learn the joint visual typical procedures [93].
and semantic embedding. As a result, Abdelkarim et al. [97] Source: Commonsense knowledge can be directly ex-
highlighted the long-tail recognition problem and adopted tracted from the local training samples. For example, Duan
a weighted version of the softmax triplet loss above. et al. [88] calculated the co-occurrence probability of object
From the perspective of collective learning on multi- pairs, p (oi | oj ), and relationship in the presence of object
relational data, Hwang et al. [106] designed an efficient pairs, p (rk | oi , oj ), as the prior statistical knowledge ob-
multi-relational tensor factorization algorithm that yields tained from the training samples of VG dataset, to assist rea-
highly informative priors. Analogously, Dupty et al. [107] soning and deal with the unbalanced distribution. However,
learned conditional triplet joint distributions in the form of considering the tremendous valuable information from the
their normalized low rank non-negative tensor decomposi- large-scale external bases, e.g., Wikipedia and ConceptNet,
tions. increasing efforts have been devoted to distill knowledge
In addition, some other papers have also tried to mine from these resources.
the value of language prior knowledge for relationship Gu et al. [89] proposed a knowledge-based module,
prediction. Donadello et al. [108] encoded visual relationship which improves the feature refinement procedure by rea-
detection with Logic Tensor Networks (LTNs), which ex- soning over a basket of commonsense knowledge retrieved
ploit both the similarities with other seen relationships and from ConceptNet. Yu et al. [90] introduced a Linguistic
background knowledge, expressed with logical constraints Knowledge Distillation Framework that obtains linguistic
between subjects, relations and objects. In order to leverage knowledge by mining from both training annotations (inter-
the inherent structures of the predicate categories, Zhou et nal knowledge) and publicly available text, e.g., Wikipedia
al. [184] proposed to firstly build the language hierarchy (external knowledge), and then construct a teacher network
and then utilize the Hierarchy Guided Feature Learning to distill the knowledge into a student network that predicts
(HGFL) strategy to learn better region features of both the visual relationships from visual, semantic and spatial rep-
coarse-grained level and the fine-grained level. Liang et al. resentations. Zhan et al. [91] proposed a novel multi-modal
[110] proposed a deep Variation-structured Reinforcement feature based undetermined relationship learning network
Learning (VRL) framework to sequentially discover object (MF-URLN), which extracts and fuses features of object
relationships and attributes in an image sample. Recently, pairs from three complementary modules: visual, spatial,
9
d = Ψ(s, o, Λ) (6)
messages across all elements enhances the ability to detect Several other techniques consider SGG as a graph in-
finer visual relationships. ference process because of its particular structure. By con-
Global Message Passing Across All Elements: Consid- sidering all other objects as carriers of global contextual
ering that objects that have visual relationships are seman- information for each object, they will pass messages to each
tically related to each other, and that relationships which other’s via a fully-connected graph. However, inference on
partially share objects are also semantically related, passing a densely connected graph is very expensive. As shown
messages between related elements can be beneficial. Learn- in previous works [64], [65], dense graph inference can be
ing feature representation from a global view is helpful approximated by mean field in Conditional Random Fields
to scene-specific visual relationship detection. Scene graphs (CRF). Moreover, Johnson et al. [1] designed a CRF model
have a particular structure, so message passing on the graph that reasons about the connections between an image and
or subgraph structures is a natural choice. Chain-based its ground-truth scene graph, and use these scene graphs as
models (such as RNN or LSTM) can also be used to encode queries to retrieve images with similar semantic meanings.
contextual cues due to their ability to represent sequence Zheng et al. [66], [67] combines the strengths of CNNs with
features. When taking into consideration the inherent paral- CRFs, and formulates mean-field inference as Recurrent
lel/hierarchical relationships between objects, dynamic tree Neural Networks (RNN). Therefore, it is reasonable to use
structures can also be used to capture task-specific visual CRF or RNN to formulate a scene graph generation problem
contexts. In the following subsections, message passing [49], [56].
methods will be analyzed according to the three categories Further, there are some other relevant works which pro-
described below. posed modeling methods based on a pre-determined graph.
Message Passing on Graph Structures. Li et al. [41] Hu et al. [113] explicitly model objects and interactions
developed an end-to-end Multi-level Scene Description Net- by an interaction graph, a directed graph built on object
work (MSDN), in which message passing is guided by proposals based on the spatial relationships between objects,
the dynamic graph constructed from objects and caption and then propose a message-passing algorithm to propagate
region proposals. In the case of a phrase proposal, the the contextual information. Zhou et al. [114] mined and mea-
message comes from a caption region proposal that may sured the relevance of predicates using relative location and
cover multiple object pairs, and may contain contextual constructed a location-based Gated Graph Neural Network
information with a larger scope than a triplet. For com- (GGNN) to improve the relationship representation. Chen et
parison, the Context-based Captioning and Scene Graph al. [31] built a graph to associate the regions and employed
Generation Network (C2SGNet) [73] also simultaneously a graph neural network to propagate messages through the
generates region captions and scene graphs from input graph. Dornadula et al. [61] initialized a fully connected
images, but the message passing between phrase and re- graph, i.e., all objects are connected to all other objects by
gion proposals is unidirectional, i.e., the region proposals all predicate edges, and updated their representation using
requires additional context information for the relationships message passing protocols within a well-designed graph
between object pairs. Moreover, in an extension of MSDN convolution framework. Zareian et al. [95] formed a het-
model, Li et al. [13] proposed a subgraph-based scene graph erogeneous graph by using some bridge edges to connect a
generation approach called Factorizable Network (F-Net), commonsense graph and initialized a fully connected graph.
where the object pairs referring to the similar interacting They then employed a variant of GGNN to propagate infor-
regions are clustered into a subgraph and share the phrase mation among nodes and updated node representations and
representation. F-Net clusters the fully-connected graph into bridge edges. Wang et al. [115] constructed a virtual graph
several subgraphs to obtain a factorized connection graph with two types of nodes (objects vio and relations vij r
) and
by treating each subgraph as a node, and passing messages o o o r r o
three types of edges ( vi , vj , vi , vij and vij , vj , ), and
between subgraph and object features along the factorized then refined representations for objects and relationships
connection graph with a Spatial-weighted Message Passing with an explicit message passing mechanism.
(SMP) structure for feature refinement. Message Passing on Chain Structures. Dense graph
Even though MSDN and F-Net extended the scope of inference can be approximated by mean fields in CRF, and
message passing, a subgraph is considered as a whole when it can also be dealt with using an RNN-based model. Xu et
sending and receiving messages. Liao et al. [53] proposed al. [54] generated structured scene representation from an
semantics guided graph relation neural network (SGRNN), image, and solved the graph inference problem using GRUs
in which the target and source must be an object or a to iteratively improve its predictions via message passing.
predicate within a subgraph. It first establishes an undi- This work is considered as a milestone in scene graph
rected fully-connected graph by associating any two objects generation, demonstrating that RNN-based models can be
as a possible relationship. Then, they remove the connec- used to encode the contextual cues for visual relationship
tions that are semantically weakly dependent, through a recognition. At this point, Zellers et al. [42] presented a
semantics guided relation proposal network (SRePN), and novel model, Stacked Motif Network (MOTIFNET), which
a semantically connected graph is formed. To refine the uses LSTMs to create a contextualized representation of each
feature of a target entity (object or relationship), source- object. Dhingra et al. [55] proposed an object communication
target-aware message passing is performed by exploiting module based on a bi-directional GRU layer and used two
contextual information from the objects and relationships different transformer encoders to further refine the object
that the target is semantically correlated with for feature features and gather information for the edges. The Counter-
refinement. The scope of messaging is the same as Feature factual critic Multi-Agent Training (CMAT) approach [116]
Inter-refinement of objects and relations in [89]. is another important extension where an agent represents a
12
Attention learns the contextual features using graph pars- the large variety in their appearance, which depends on
ing. Yang et al. [58] proposed Graph R-CNN based on graph the involved entities. Second, is to handle the scarcity of
convolutional neural network (GCN) [59], which can be fac- training data for zero-shot visual relation triplets. Visual
torized into three logical stages: (1) produce a set of localized embedding approaches aim at learning a compositional
object regions, (2) utilize a relation proposal network (RePN) representation for subject, object and predicate by learning
to learns to efficiently compute relatedness scores between separate visual-language embedding spaces, where each of
object pairs, which are used to intelligently prune unlikely these entities is mapped close to the language embedding
scene graph connections, and (3) apply an attentional graph of its associated annotation. By constructing a mathemati-
convolution network (aGCN) to propagate a higher-order cal relationship of visual-semantic embeddings for subject,
context throughout the sparse graph. In the aGCN, for predicate and object, an end-to-end architecture can be built
a target node i in the graph, the representations of its and trained to learn a visual translation vector for predic-
neighboring nodes {zj |j ∈ N (i)} are first transformed via tion. In this section, we divide the visual translation embed-
a learned linear transformation W . Then, these transformed ding methods according to the translations (as illustrated in
representations are gathered with predetermined weights α, Fig.7), including Translation between Subject and Object,
followed by a nonlinear function σ (ReLU). This layer-wise and Translation among Subject, Object and Predicate.
propagation can be written as:
(l+1) (l) (l)
X
zi = σ zi + αij W zj (7)
j∈N (i)
Translation Embedding between Subject, Object and Another TransE-inspired model is RLSV (Representa-
Predicate: In an extension of VTransE, Hung et al. [69] tion Learning via Jointly Structural and Visual Embedding)
proposed the Union Visual Translation Embedding network [112]. The architecture of RLSV is a three-layered hierarchi-
(UVTransE), which learns three projection matrices Ws , cal projection that projects a visual triple onto the attribute
Wo , Wu which map the respective feature vectors of the space, the relation space, and the visual space in order.
bounding boxes enclosing the subject, object, and union of This makes the subject and object, which are packed with
subject and object into a common embedding space, as well attributes, projected onto the same space of the relation,
as translation vectors tp (to be consistent with VTransE) in instantiated, and translated by the relation vector. This also
the same space corresponding to each of the predicate labels makes the head entity and the tail entity packed with
that are present in the dataset. Another extension is ATR- attributes, projected onto the same space of the relation,
Net (Attention-Translation-Relation Network), proposed by instantiated, and translated by the relation vector. It jointly
Gkanatsios et al. [70] which projects the visual features from combines the structural embeddings and the visual embed-
the subject, the object region and their union into a score dings of a visual triple t = (s, r, o) as new representations
space as S , O and P with multi-head language and spatial (xs , xr , xo ) and scores it as follow:
attention guided. Let A denotes the attention matrix of all
predicates, Eq. 9 can be reformulated as: EI (t) = ||xs + xr − xo ||L1/L2 (15)
Two visual relation instances containing their relationship relation prediction process consists of two steps: relation-
triplets and object trajectories of the subjects and objects. ship feature extraction and relationship modeling. Given a
pair of object tracklet proposals (Ts , To ) in a segment, (1)
extract the improved dense trajectory (iDT) features [179]
with HoG, HoF and MBH in video segments, which capture
both the motion and the low-level visual characteristics; (2)
extract the relative characteristics between Ts and To which
describes the relative position, size and motion between
the two objects; (3) add the classeme feature [68]. The
concatenation of these three types of features as the overall
relationship feature vector is fed into three predictors to
classify the observed relation triplets. The dominating way
to get the final video-level relationships is greedy local
association, which greedily merges two adjacent segments
if they contain the same relation.
Tsai et al. [102] proposed a Gated Spatio-Temporal En-
ergy Graph (GSTEG) that models the spatial and temporal
structure of relationship entities in a video by a spatial-
Fig. 8: Examples of video visual relations. (From [101]). temporal fully-connected graph, where each node repre-
sents an entity and each edge denotes the statistical de-
Different from static images and because of the addi- pendencies between the connected nodes. It also utilizes
tional temporal channel, dynamic relationships in videos an energy function with adaptive parameterization to meet
are often correlated in both the spatial and temporal dimen- the diversity of relations, and achieves the state-of-the-art
sions. All the relationships in a video can collectively form performance. The construction of the graph is realized by
a spatial-temporal graph structure, as mentioned in [99], linking all segments as a Markov Random Fields (MRF)
[102], [145], [171]. Therefore, we redefine the VidVRD as conditioned on a global observation.
Spatio-Temporal Scene Graph Generation (ST-SGG). To be Shang et al. [144] has published another dataset, Vi-
consistent with the definition of 2D scene graph, we also de- dOR and launched the ACM MM 2019 Video Relation
fine a spatio-temporal scene graph as a set of visual triplets Understanding (VRU) Challenge2 to encourage researchers
RS . However, for each rS,i→j = (sS,i , pS,i→j , oS,j ) ∈ RS , to explore visual relationships from a video [142]. In this
sS,i = (lS,k1 , Ts ) and oS,j = (lS,k2 , To ) both have a trajectory challenge, Zheng et al. [141] use Deep Structural Ranking
(resp. Ts and To ) rather than a fixed bbox. Specifically, Ts (DSR) [24] model to predict relations. Different from the
and To are two sequences of bounding boxes, which respec- pipeline in [101], they associate the short-term preliminary
tively enclose the subject and object, within the maximal trajectories before relation prediction by using a sliding
duration of the visual relation. Therefore, VidVRD aims window method to locate the endpoint frames of a re-
to detect each entire visual relation RS instance with one lationship triplet, rather than relational association at the
bounding box trajectory. end. Similarly, Sun et al. [143] also associate the preliminary
ST-SGG relies on video object detection (VOD). The trajectories on the front by applying a kernelized correlation
mainstream methods address VOD by integrating the lat- filter (KCF) tracker to extend the preliminary trajectories
est techniques in both image-based object detection and generated by Seq-NMS in a concurrent way and generate
multi-object tracking [175], [176], [177]. Although recent complete object trajectories to further associate the short-
sophisticated deep neural networks have achieved superior term ones.
performances in image object detection [17], [19], [173],
[174], object detection in videos still suffers from a low 3.3 3D Scene Graph Generation
accuracy, because of the presence of blur, camera motion The classic computer vision methods aim to recognize ob-
and occlusion in videos, which hamper an accurate object jects and scenes in static images with the use of a mathe-
localization with bounding box trajectories. Inevitably, these matical model or statistical learning, and then progress to do
problems have gone down to downstream video relation- motion recognition, target tracking, action recognition etc. in
ship detection and even are amplified. video. The ultimate goal is to be able to accurately obtain the
Shang et al. [101] first proposed VidVRD task and in- shapes, positions and attributes of the objects in the three-
troduced a basic pipeline solution, which adopts a bottom- dimensional space, so as to realize detection, recognition,
up strategy. The following models almost always use this tracking and interaction of objects in the real world. In the
pipeline, which decomposes the VidVRD task into three in- computer vision field, one of the most important branches
dependent parts: multi-object tracking, relation prediction, of 3D research is the representation of 3D information.
and relation instances association. They firstly split videos The common 3D representations are multiple views, point
into segments with a fixed duration and predict visual re- clouds, polygonal meshes, wireframe meshes and voxels of
lations between co-occurrent short-term object tracklets for various resolutions. To extend the concept of scene graph to
each video segment. Then they generate complete relation 3D space, researchers are trying to design a structured text
instances by a greedy associating procedure. Their object representation to encode 3D information. Although existing
tracklet proposal is implemented based on a video object de-
tection method similar to [178] on each video segment. The 2. https://videorelation.nextcenter.org/mm19-gdc/
16
TABLE 2: The statistics of common VG versions. images collected from the web with 76 unusual language
Dataset Pred. Classes Obj. Classes Total Images Train Images Test Images
triplet queries such as “person ride giraffe”. All images are
VG150 [54] 50 150 108,077 75.6k 32.4k annotated at box-level for the given triplet queries. Since the
VG200 [68] 100 200 99,658 73.8k 25.8k triplet queries of UnRel are rare (and thus likely not seen
sVG [49] 24 399 108,077 64.7k 8.7k
VG-MSDN [41] 50 150 95,998 71k 25k
at training), it is often used to evaluate the generalization
VG80k [15] 29,086 53,304 104,832 99.9k 4.8k performance of the algorithm.
SpatialSense [159] is a dataset specializing in spatial
relation recognition. A key feature of the dataset is that it
duplicate bounding boxes per image. The benchmark uses is constructed through adversarial crowdsourcing: a human
the most frequent 150 object categories and 50 predicates annotator is asked to come up with adversarial examples to
for evaluation. As a result, each image has a scene graph of confuse a recognition system.
around 11.5 objects and 6.2 relationships. SpatialVOC2K [158] is the first multilingual image
VrR-VG [156] is also based on Visual Genome. Its pre- dataset with spatial relation annotations and object features
processing aims at reducing the duplicate relationships for image-to-text generation. It consists of all 2,026 images
by hierarchical clustering and filtering out the visually- with 9,804 unique object pairs from the PASCAL VOC2008
irrelevant relationships. As a result, the dataset keeps the dataset. For each image, they provided additional annota-
top 1,600 objects and 117 visually-relevant relationships tions for each ordered object pair, i.e., (a) the single best, and
of Visual Genome. Their hypothesis to identify visually- (b) all possible prepositions that correctly describe the spatial
irrelevant relationships is that if a relationship label in relationship between objects. The preposition set contains 17
different triplets is predictable according to any informa- English prepositions and 17 French prepositions.
tion, except visual information, the relationship is visually-
irrelevant. This definition is a bit far-fetched but helps to 4.2 Video Datasets
eliminate redundant relationships. The area of video relation understanding aims at promoting
Open Images [10] is a dataset of 9M images anno- novel solutions and research on the topic of object detec-
tated with image-level labels, object bounding boxes, ob- tion, object tracking, action recognition, relation detection
ject segmentation masks, visual relationships, and localized and spatio-temporal analysis, that are integral parts into a
narratives. The images are very diverse and often contain comprehensive visual system of the future. So far there are
complex scenes with several objects (8.3 per image on av- two public datasets for video relational understanding.
erage). It contains a total of 16M bounding boxes for 600 ImageNet-VidVRD [101] is the first video visual rela-
object classes on 1.9M images, making it the largest existing tion detection dataset, which is constructed by selecting
dataset with object location annotations. The boxes have 1,000 videos from the training set and the validation set
largely been manually drawn by professional annotators to of ILSVRC2016-VID [163]. Based on the 1,000 videos, the
ensure accuracy and consistency. Open Images also offers object categories increase to 35. It contains a total of 3,219
visual relationship annotations, indicating pairs of objects relationship triplets (i.e., the number of visual relation types)
in particular relations (e.g., “woman playing guitar”, “beer with 132 predicate categories. All videos were decomposed
on table”), object properties (e.g., “table is wooden”), and into segments of 30 frames with 15 overlapping frames
human actions (e.g., “woman is jumping”). In total it has in advance, and all the predicates appearing in each seg-
3.3M annotations from 1,466 distinct relationship triplets. ment were labeled to obtain segment-level visual relation
So far, there are six released versions which are available on instances.
the official website and [10] describes Open Images V4 in VidOR [144] consists of 10,000 user-generated videos
details, i.e., from the data collection and annotation to the (98.6 hours) together with dense annotations on 80 cate-
detailed statistics about the data and the evaluation of the gories of objects and 50 categories of predicates. The whole
models trained on it. dataset is divided into 7,000 videos for training, 835 videos
UnRel [128] is a challenging dataset that contains 1,000 for validation, and 2,165 videos for testing. All the annotated
18
categories of objects and predicates appear in each of the overlap with the ground truth box. It is also called
train/val/test sets. Specifically, objects are annotated with Union boxes detectionin [49].
a bounding-box trajectory to indicate their spatio-temporal 2) Predicate Classification (PredCls) [54]: Given a set
locations in the videos; and relationships are temporally an- of localized objects with category labels, decide
notated with start and end frames. The videos were selected which pairs interact and classify each pair’s pred-
from YFCC-100M multimedia collection and the average icate.
length of the videos is about 35 seconds. The relations are 3) Scene Graph Classification (SGCls) [54]: Given a
divided into two types, spatial relations (8 categories) and set of localized objects, predict the predicate as well
action relations (42 categories) and the annotation method as the object categories of the subject and the object
is different for the two types of relations. in every pairwise relationship.
4) Scene Graph Generation (SGGen) [54]: Detect a
4.3 3D Datasets set of objects and predict the predicate between
each pair of the detected objects. This task is also
Three dimensional data is usually provided via multi-view
called Relationship Detection (RelDet) in [28] or
images such as point clouds, meshes, or voxels. Recently,
Two boxes detection in [49]. It is similar to phrase
several 3D datasets related to scene graphs have been re-
detection, but with the difference that both the
leased to satisfy the needs of SGG study.
bounding box of the subject and object need at least
3D Scene Graph is constructed by annotated the Gib-
50 percent of overlap with their ground truth. Since
son Environment Database [160] using the automated 3D
SGGen only scores a single complete triplet, the re-
Scene Graph generation pipeline proposed in [44]. Gibson’s
sult cannot reflect the detection effects of each com-
underlying database of spaces includes 572 full buildings
ponent in the whole scene graph. So Yang et al. [58]
composed of 1,447 floors covering a total area of 211km2 . It
proposed the Comprehensive Scene Graph Gen-
is collected from real indoor spaces using 3D scanning and
eration (SGGen+) as an augmentation of SGGen.
reconstruction and provides the corresponding 3D mesh
SGGen+ not only considers the triplets in the graph,
model of each building. Meanwhile, for each space, the RGB
but also the singletons (object and predicate). To be
images, depth and surface normals are provided. A fraction
clear, SGGen+ is essentially a metric rather than a
of the spaces is annotated with semantic objects.
task.
3DSGG, proposed in [147], is a large scale 3D dataset
that extends 3RScan with semantic scene graph annotations, There are also some paper-specific task settings includ-
containing relationships, attributes and class hierarchies. A ing Triple Detection [87], Relation Retrieval [68] and so
scene graph here is a set of tuples (N, R) between nodes N on.
and edges R. Each node is defined by a hierarchy of classes In the video based visual relationship detection task,
c = (c1 , · · · , cd ) and a set of attributes A that describe the there are two standard evaluation modes: Relation De-
visual and physical appearance of the object instance. The tection and Relation Tagging. The detection task aims to
edges define the semantic relations between the nodes. This generate a set of relationship triplets with tracklet proposals
representation shows that a 3D scene graph can easily be from a given video, while the tagging task only considers
rendered to 2D. the accuracy of the predicted video relation triplets and
ignores the object localization results.
5 P ERFORMANCE E VALUATION
In this section, we first introduce some commonly used 5.2 Metrics
evaluation modes and criteria for the scene graph genera- Recall@K. The conventional metric for the evaluation of
tion task. Then, we provide the quantitative performance of SGG is the image-level Recall@K(R@K), which com-
the promising models on popular datasets. Since there is no putes the fraction of times the correct relationship is pre-
uniform definition of a 3D scene graph, we will introduce dicted in the top K confident relationship predictions. In
these contents around 2D scene graph and spatio-temporal addition to the most commonly used R@50 and R@100,
scene graph. some works also use the more challenging R@20 for a
more comprehensive evaluation. Some methods compute
5.1 Tasks R@K with the constraint that merely one relationship can be
obtained for a given object pair. Some other works omit this
Given an image, the scene graph generation task consists of
constraint so that multiple relationships can be obtained,
localizing a set of objects, classifying their category labels,
leading to higher values. There is a superparameter k ,
and predicting relations between each pair of these objects.
often not clearly stated in some works, which measures the
Most prior works often evaluated their SGG models on
maximum predictions allowed per object pair. Most works
several of the following common sub-tasks. We preserve the
have seen PhrDet as a multiclass problem and they use
names of tasks as defined in [28] and [54] here, despite the
k = 1 to reward the correct top-1 prediction for each pair.
inconsistent terms used in other papers and the inconsisten-
While other works [63], [85], [105] tackle this as a multilabel
cies on whether they are in fact classification or detection
problem and they use a k equal to the number of predicate
tasks.
classes to allow for predicate co-occurrences [70]. Some
1) Phrase Detection (PhrDet) [28]: Outputs a label works [42], [70], [90], [129], [164] have also identified this
subject-predicate-object and localizes the entire rela- inconsistency and interpret it as whether there is graph con-
tionship in one bounding box with at least 0.5 straint (i.e., the k is the maximum number of edges allowed
19
between a pair of object nodes). The unconstrained metric Precision@K. In the video relation detection task,
(i.e., no graph constraint) evaluates models more reliably, P recision@K(P @K) is used to measure the accuracy
since it does not require a perfect triplet match to be the of the tagging results for the relation tagging task.
top-1 prediction, which is an unreasonable expectation in mAP. In the OpenImages VRD Challenge, results are evalu-
a dataset with plenty of synonyms and mislabeled annota- ated by calculating Recall@50(R@50), mean AP of relation-
tions. For example, ‘man wearing shirt’ and ‘man in shirt’ are ships (mAPrel ), and mean AP of phrases (mAPphr ) [129].
similar predictions, however, only the unconstrained metric The mAPrel evaluates AP of hs, p, oi triplets where both
allows for both to be included in ranking. Obviously, the the subject and object boxes have an IOU of at least 0.5
SGGen+ metric above has a similar motivation as removing with the ground truth. The mAPphr is similar, but applied
the graph constraint. Gkanatsios et al. [70] re-formulated the to the enclosing relationship box. mAP would penalize the
metric as Recallk @K(Rk @K). k = 1 is equivalent to prediction if that particular ground truth annotation does
‘graph constraints” and a larger k to “no graph constraints”, not exist. Therefore, it is a strict metric because we can’t
also expressed as ngRk @K . For n examined subject-object exhaustively annotate all possible relationships in an image.
pairs in an image, Recallk @K(Rk @K) keeps the top-k
predictions per pair and examines the K most confident out
5.3 Quantitative Performance
of nk total.
Given a set of ground truth triplets, GT , the image-level We present the quantitative performance on Recall@K met-
R@K is computed as: ric of some representative methods on several commonly
used datasets in Table 3-4. We preserve the respective task
R@K = |T opK ∩ GT | / |GT |, (16)
settings and tasks’ names for each dataset, though SGGen
where T opK is the top-K triplets extracted from the en- on VG150 are the same to the RelDet on others. ‡ denotes
tire image based on ranked predictions of a model [169]. the experimental results are under “no graph constraints”.
However, in the PredCLs setting, which is actually a simple By comparing Table 3 and Table 4, we notice that only
classification task, the R@K degenerates into the triplet- a few of the proposed methods have been simultaneously
level Recall@K (Rtr @K ). Rtr @K is similar to the top- verified on both VRD and VG150 datasets. The performance
K accuracy. Furthermore, Knyazev et al. [169] proposed of most methods on VG150 is better than that on VRD
weighted triplet Recall(wRtr @K ), which computes a recall dataset, because VG150 has been cleaned and enhanced.
at each triplet and reweights the average result based on the Experimental results on VG150 can better reflect the per-
frequency of the GT in the training set: formance of different methods, therefore, several recently
T proposed methods have adopted VG150 to compare their
performance metrics with other techniques.
X
wRtr @K = wt [rankt ≤ K], (17)
t Recently, two novel techniques i.e., SABRA [197] and
HET [195] have achieved SOTA performance for PhrDet
where T is the number of all test triplets, [·] is the Iverson
1 and RelDet on VRD, respectively. SABRA enhanced the
bracket, wt = (nt +1) P 1/(n t +1)∈[0,1]
and nt is the number of
t robustness of the training process of the proposed model
occurrences of the t-th triplet in the training set. It is friendly
by subdividing negative samples, while HET followed the
to those infrequent instances, since frequent triplets (with
intuitive perspective i.e., the more salient the object, the
high nt ) are downweighted proportionally. To speak for all
more important it would be for the scene graph.
predicates rather than very few trivial ones, Tang et al. [78]
On VG150, excellent performances have been achieved
and Chen et al. [31] proposed meanRecall@K(mR@K)
by using the Language Prior’s model, especially RiFa [109].
which retrieves each predicate separately then averages
In particular, RiFa has achieved good results on the unbal-
R@K for all predicates.
anced data distribution by mining the deep semantic infor-
Notably, there is an inconsistency in Recall’s definition
mation of the objects and relations in triplets. SGRN [53]
on the entire test set: whether it is a micro- or macro-Recall
generates the initial scene graph structure using the seman-
[70]. Let N be the number of testing images and GTi the
tic information, to ensure that its information transmission
ground-truth relationship annotations in image i. Then, hav-
process accepts the positive influence from the semantic
ing detected T Pi = T opKi ∩ GTi true positives in the image
PN
i |T P |
information. Theoretically, Commonsense Knowledge can
i, micro-Recall micro-averages these positives as PN i
greatly improve the performance, but in practice, several
i |GTi |
to reward correct predictions across dataset. Macro-Recall models that use Prior Knowledge have unsatisfactory per-
1
PN |T Pi |
computed as N i |GTi | macro-averages the detections in formance. We believe the main reason is the difficultly
terms of images. Early works use micro-Recall on VRD and to extract and use the effective knowledge information in
macro-Recall on VG150, but later works often use the two the scene graph generation model. Gb-net [95] has paid
types interchangeably and without consistency. attention to this problem, and achieved good results in
Zero-Shot Recall@K. Zero-shot relationship learning was PredDet and PhrDet by establishing connection between
proposed by Lu et al. [28] to evaluate the performance scene graph and Knowledge Graph, which can effectively
of detecting zero-shot relationships. Due to the long-tailed use the commonsense knowledge.
relationship distribution in the real world, it is a practical Due to the long tail effect of visual relationships, it is
setting to evaluate the extensibility of a model since it is hard to collect images for all the possible relationships. It is
difficult to build a dataset with every possible relationship. therefore crucial for a model to have the generalizability to
Besides, a single wRtr @K value can show zero or few-shot detect zero-shot relationships. VRD dataset contains 1,877
performance linearly aggregated for all n ≥ 0. relationships that only exist in the test set. Some researchers
20
TABLE 3: Performance summary of some representative TABLE 4: Performation summary of some representative
methods on VRD dataset. methods on VG150 dataset.
TABLE 5: Performance summary of some representative TABLE 7: Performance for standard video relation detection
methods for zero-shot visual relationship detection on the and video relation tagging on ImageNet-VidVRD dataset
VRD dataset. [171].
the general SGG task ill-posed. Providing a well-defined and extraction of 3D semantic information has technological
relationship set is therefore one of the key challenges of the challenges.
SGG task.
The fifth challenge is the evaluation metric. Even though 7 C ONCLUSION
many evaluation metrics are used to assess the performance
This paper provides a comprehensive survey of the devel-
of the proposed networks and Recall@K or meanRecall@K
opments in the field of scene graph generation using deep
are common and widely adopted, none of them can provide
learning techniques. We first introduced the representative
perfect statistics on how well the model performs on the
works on 2D scene graph, spatio-temporal scene graph and
SGG task. When Recall@50 equals 100, does that mean that
3D scene graph in different sections, respectively. Further-
the model generates the perfect scene graph for an image?
more, we provided a summary of some of the most widely
Of course not. The existing evaluation metrics only reveal
used datasets for visual relationship and scene graph gener-
the relative performance, especially in the current research
ation, which are grouped into 2D images, video, and 3D
stage. As the research on SGG progresses, the evaluation
representation, respectively. The performance of different
metrics and benchmark datasets will pose a great challenge.
approaches on different datasets are also compared. Finally,
we discussed the challenges, problems and opportunities on
6.2 Opportunities the scene graph generation research. We believe this survey
The community has published hundreds of scene graph can promote more in-depth ideas used on SGG.
models and has obtained a wealth of research result. We
think there are several avenues for future work. Researchers R EFERENCES
will be motivated to explore more models as a result of [1] J. Johnson, R. Krishna, M. Stark, L. J. Li, D. A. Shamma, M. S.
the above challenges. Besides, on the one hand, from the Bernstein, and F. F. Li, “Image retrieval using scene graphs,” in
learning point of view, building a large dataset with fine- 2015 IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), 2015, pp. 3668–3678.
grained labels and accurate annotations is necessary and [2] J. Deng, Y. Pan, T. Yao, W. Zhou, H. Li, and T. Mei, “Relation
significant. It contains as many scenes as possible, preferably distillation networks for video object detection,” in Proceedings
constructed by computer vision experts. The models trained of the IEEE International Conference on Computer Vision, 2019, pp.
on such a dataset will have better performance on visual 7023–7032.
[3] L. Gao, B. Wang, and W. Wang, “Image captioning with scene-
semantic and develop a broader understanding of our visual graph based semantic concepts,” in Proceedings of the 2018 10th
world. However, this is a very challenging and expensive International Conference on Machine Learning and Computing, 2018,
task. On the other hand, from the application point of view, pp. 225–229.
[4] L. Li, Z. Gan, Y. Cheng, and J. Liu, “Relation-aware graph
we can design the models by subdividing the scene to attention network for visual question answering,” in Proceedings
reduce the imbalance of the relationship distribution. Obvi- of the IEEE International Conference on Computer Vision, 2019, pp.
ously, the categories and probability distributions of visual 10 313–10 322.
relationships are different in different scenarios. Of course, [5] C. Zhang, W.-L. Chao, and D. Xuan, “An empirical study on
leveraging scene graphs for visual question answering,” arXiv
even the types of objects are different. As a result, we can preprint arXiv:1907.12133, 2019.
design relationship detection models for different scenarios [6] H. Hu, J. Gu, Z. Zhang, J. Dai, and Y. Wei, “Relation networks for
and employ ensemble learning methods to promote scene object detection,” in Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, 2018, pp. 3588–3597.
graph generation applications. [7] S. Schuster, R. Krishna, A. Chang, L. Fei-Fei, and C. D. Man-
Another area of research is 3D scene graphs. An initial ning, “Generating semantically precise scene graphs from textual
step is to define an effective and unified 3D scene graph descriptions for improved image retrieval,” in Proceedings of the
structure, along with what information it should encode. A fourth workshop on vision and language, 2015, pp. 70–80.
[8] J. Johnson, A. Gupta, and L. Fei-Fei, “Image generation from
2D image is a two-dimensional projection of a 3D world scene graphs,” in Proceedings of the IEEE Conference on Computer
scene taken from a specific viewpoint. It is the specific Vision and Pattern Recognition, 2018, pp. 1219–1228.
viewpoint that makes some descriptions of spatial rela- [9] G. Mittal, S. Agrawal, A. Agarwal, S. Mehta, and T. Marwah,
“Interactive image generation using scene graphs,” arXiv preprint
tionships in 2D images meaningful. Taking the triplet of arXiv:1905.03743, 2019.
hwomen, is behind, f ire hydranti in Fig. 1 as an example, [10] A. Kuznetsova, H. Rom, N. Alldrin, J. Uijlings, I. Krasin, J. Pont-
the relation “is behind” makes sense because of the view- Tuset, S. Kamali, S. Popov, M. Malloci, T. Duerig et al., “The open
point. But, how can a relation be defined as “is behind” in images dataset v4: Unified image classification, object detection,
and visual relationship detection at scale,” International Journal of
3D scenes without a given viewpoint? Therefore, the chal- Computer Vision, vol. 128, no. 7, pp. 1956–1981, 2020.
lenge is how to define such spatial semantic relationships [11] J. Zhang, K. Shih, A. Tao, B. Catanzaro, and A. Elgammal, “An
in 3D scenes without simply resorting to 2.5D scenes (for interpretable model for scene graph generation,” arXiv preprint
arXiv:1811.09543, 2018.
example, RGB-D data captured from a specific viewpoint).
[12] X. Yang, K. Tang, H. Zhang, and J. Cai, “Auto-encoding scene
Armeni et al. augment the basic scene graph structure with graphs for image captioning,” in Proceedings of the IEEE Conference
essential 3D information and generate a 3D scene graph on Computer Vision and Pattern Recognition, 2019, pp. 10 685–
which extends the scene graph to 3D space and ground 10 694.
[13] Y. Li, W. Ouyang, B. Zhou, J. Shi, C. Zhang, and X. Wang,
semantic information there [44], [147]. However, their pro- “Factorizable net: an efficient subgraph-based framework for
posed structure representation does not have expansibility scene graph generation,” in Proceedings of the European Conference
and generality. Second, because 3D information can be on Computer Vision (ECCV), 2018, pp. 335–351.
grounded in many storage formats, which are fragmented [14] G. Gkioxari, R. Girshick, P. Dollár, and K. He, “Detecting and
recognizing human-object interactions,” in Proceedings of the IEEE
to specific types based on the visual modality (e.g., RGB-D, Conference on Computer Vision and Pattern Recognition, 2018, pp.
point clouds, 3D mesh/CAD models, etc.), the presentation 8359–8367.
23
[15] J. Zhang, Y. Kalantidis, M. Rohrbach, M. Paluri, A. Elgammal, [36] Y. Liu, R. Wang, S. Shan, and X. Chen, “Structure inference
and M. Elhoseiny, “Large-scale visual relationship understand- net: Object detection using scene-level context and instance-level
ing,” in Proceedings of the AAAI Conference on Artificial Intelligence, relationships,” in Proceedings of the IEEE conference on computer
vol. 33, 2019, pp. 9185–9194. vision and pattern recognition, 2018, pp. 6985–6994.
[16] S. Qi, W. Wang, B. Jia, J. Shen, and S.-C. Zhu, “Learning human- [37] S. A. Taghanaki, K. Abhishek, J. P. Cohen, J. Cohen-Adad, and
object interactions by graph parsing neural networks,” in Proceed- G. Hamarneh, “Deep semantic segmentation of natural and
ings of the European Conference on Computer Vision (ECCV), 2018, medical images: A review,” Artificial Intelligence Review, pp. 1–42,
pp. 401–417. 2020.
[17] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards [38] J. Li and J. Z. Wang, “Automatic linguistic indexing of pictures
real-time object detection with region proposal networks,” in by a statistical modeling approach,” IEEE Transactions on pattern
Advances in neural information processing systems, 2015, pp. 91–99. analysis and machine intelligence, vol. 25, no. 9, pp. 1075–1088, 2003.
[18] D.-J. Kim, J. Choi, T.-H. Oh, and I. S. Kweon, “Dense relational [39] R. Grzeszick and G. A. Fink, “Zero-shot object prediction us-
captioning: Triple-stream networks for relationship-based cap- ing semantic scene knowledge,” arXiv preprint arXiv:1604.07952,
tioning,” in Proceedings of the IEEE Conference on Computer Vision 2016.
and Pattern Recognition, 2019, pp. 6271–6280. [40] S. K. Divvala, D. Hoiem, J. H. Hays, A. A. Efros, and M. Hebert,
[19] J. Redmon and A. Farhadi, “Yolov3: An incremental improve- “An empirical study of context in object detection,” in 2009 IEEE
ment,” arXiv preprint arXiv:1804.02767, 2018. Conference on computer vision and Pattern Recognition, 2009, pp.
[20] T. Wang, R. M. Anwer, M. H. Khan, F. S. Khan, Y. Pang, L. Shao, 1271–1278.
and J. Laaksonen, “Deep contextual attention for human-object
[41] Y. Li, W. Ouyang, B. Zhou, K. Wang, and X. Wang, “Scene
interaction detection,” in Proceedings of the IEEE International
graph generation from objects, phrases and region captions,” in
Conference on Computer Vision, 2019, pp. 5694–5702.
Proceedings of the IEEE International Conference on Computer Vision,
[21] L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam, 2017, pp. 1261–1270.
“Encoder-decoder with atrous separable convolution for seman-
tic image segmentation,” in Proceedings of the European conference [42] R. Zellers, M. Yatskar, S. Thomson, and Y. Choi, “Neural motifs:
on computer vision (ECCV), 2018, pp. 801–818. Scene graph parsing with global context,” in Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, 2018,
[22] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional
pp. 5831–5840.
networks for biomedical image segmentation,” in International
Conference on Medical image computing and computer-assisted inter- [43] M. A. Sadeghi and A. Farhadi, “Recognition using visual
vention, 2015, pp. 234–241. phrases,” in Proceedings of the IEEE Conference on Computer Vision
[23] K. Chen, J. Pang, J. Wang, Y. Xiong, X. Li, S. Sun, W. Feng, and Pattern Recognition, 2011, pp. 1745–1752.
Z. Liu, J. Shi, W. Ouyang et al., “Hybrid task cascade for instance [44] I. Armeni, Z.-Y. He, J. Gwak, A. R. Zamir, M. Fischer, J. Malik, and
segmentation,” in Proceedings of the IEEE conference on computer S. Savarese, “3d scene graph: A structure for unified semantics,
vision and pattern recognition, 2019, pp. 4974–4983. 3d space, and camera,” in Proceedings of the IEEE International
[24] K. Liang, Y. Guo, H. Chang, and X. Chen, “Visual relationship Conference on Computer Vision, 2019, pp. 5664–5673.
detection with deep structural ranking,” in Proceedings of the [45] S. Woo, D. Kim, D. Cho, and I. S. Kweon, “Linknet: Relational
AAAI Conference on Artificial Intelligence, vol. 32, no. 1, 2018. embedding for scene graph,” in Advances in Neural Information
[25] N. Xu, A.-A. Liu, Y. Wong, W. Nie, Y. Su, and M. Kankanhalli, Processing Systems, 2018, pp. 560–570.
“Scene graph inference via multi-scale context modeling,” IEEE [46] S. Sharifzadeh, M. Berrendorf, and V. Tresp, “Improving vi-
Transactions on Circuits and Systems for Video Technology, vol. 31, sual relation detection using depth maps,” arXiv preprint
no. 3, pp. 1031–1041, 2020. arXiv:1905.00966, 2019.
[26] Y. Li, W. Ouyang, X. Wang, and X. Tang, “Vip-cnn: Visual phrase [47] K. Kato, Y. Li, and A. Gupta, “Compositional learning for human
guided convolutional neural network,” in Proceedings of the IEEE object interaction,” in Proceedings of the European Conference on
Conference on Computer Vision and Pattern Recognition, 2017, pp. Computer Vision (ECCV), 2018, pp. 234–251.
1347–1356. [48] T. Nagarajan, C. Feichtenhofer, and K. Grauman, “Grounded
[27] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning human-object interaction hotspots from video,” in Proceedings
for image recognition,” in Proceedings of the IEEE conference on of the IEEE International Conference on Computer Vision, 2019, pp.
computer vision and pattern recognition, 2016, pp. 770–778. 8688–8697.
[28] C. Lu, R. Krishna, M. Bernstein, and L. Fei-Fei, “Visual relation- [49] B. Dai, Y. Zhang, and D. Lin, “Detecting visual relationships with
ship detection with language priors,” in European Conference on deep relational networks,” in Proceedings of the IEEE conference on
Computer Vision, 2016, pp. 852–869. computer vision and Pattern recognition, 2017, pp. 3076–3086.
[29] F. Chollet, “Xception: Deep learning with depthwise separable [50] Y. Zhu and S. Jiang, “Deep structured learning for visual relation-
convolutions,” in Proceedings of the IEEE conference on computer ship detection,” in Proceedings of the AAAI Conference on Artificial
vision and pattern recognition, 2017, pp. 1251–1258. Intelligence, vol. 32, no. 1, 2018.
[30] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, [51] V. Mnih, N. Heess, A. Graves et al., “Recurrent models of visual
S. Chen, Y. Kalantidis, L.-J. Li, D. A. Shamma et al., “Visual attention,” in Advances in neural information processing systems,
genome: Connecting language and vision using crowdsourced 2014, pp. 2204–2212.
dense image annotations,” International Journal of Computer Vision,
[52] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine trans-
vol. 123, no. 1, pp. 32–73, 2017.
lation by jointly learning to align and translate,” arXiv preprint
[31] T. Chen, W. Yu, R. Chen, and L. Lin, “Knowledge-embedded
arXiv:1409.0473, 2014.
routing network for scene graph generation,” in Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, 2019, [53] W. Liao, C. Lan, W. Zeng, M. Y. Yang, and B. Rosenhahn,
pp. 6163–6171. “Exploring the semantics for visual relationship detection,” arXiv
[32] A. Kirillov, K. He, R. Girshick, C. Rother, and P. Dollár, “Panoptic preprint arXiv:1904.02104, 2019.
segmentation,” in Proceedings of the IEEE Conference on Computer [54] D. Xu, Y. Zhu, C. B. Choy, and L. Fei-Fei, “Scene graph gener-
Vision and Pattern Recognition, 2019, pp. 9404–9413. ation by iterative message passing,” in Proceedings of the IEEE
[33] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, Conference on Computer Vision and Pattern Recognition, 2017, pp.
“Densely connected convolutional networks,” in Proceedings of 5410–5419.
the IEEE conference on computer vision and pattern recognition, 2017, [55] N. Dhingra, F. Ritter, and A. Kunz, “Bgt-net: Bidirectional gru
pp. 4700–4708. transformer network for scene graph generation,” in Proceedings
[34] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. Alemi, “Inception-v4, of the IEEE/CVF Conference on Computer Vision and Pattern Recog-
inception-resnet and the impact of residual connections on learn- nition, 2021, pp. 2150–2159.
ing,” in Proceedings of the AAAI Conference on Artificial Intelligence, [56] W. Cong, W. Wang, and W.-C. Lee, “Scene graph generation via
vol. 31, no. 1, 2017. conditional random fields,” arXiv preprint arXiv:1811.08075, 2018.
[35] S. Yang, G. Li, and Y. Yu, “Cross-modal relationship inference [57] B. Zhuang, L. Liu, C. Shen, and I. Reid, “Towards context-
for grounding referring expressions,” in Proceedings of the IEEE aware interaction recognition for visual relationship detection,”
Conference on Computer Vision and Pattern Recognition, 2019, pp. in Proceedings of the IEEE International Conference on Computer
4145–4154. Vision, 2017, pp. 589–598.
24
[58] J. Yang, J. Lu, S. Lee, D. Batra, and D. Parikh, “Graph r-cnn for [79] K. S. Tai, R. Socher, and C. D. Manning, “Improved semantic
scene graph generation,” in Proceedings of the European conference representations from tree-structured long short-term memory
on computer vision (ECCV), 2018, pp. 670–685. networks,” arXiv preprint arXiv:1503.00075, 2015.
[59] T. N. Kipf and M. Welling, “Semi-supervised classification with [80] H. Zhou, C. Hu, C. Zhang, and S. Shen, “Visual relation-
graph convolutional networks,” arXiv preprint arXiv:1609.02907, ship recognition via language and position guided attention,”
2016. in ICASSP 2019-2019 IEEE International Conference on Acoustics,
[60] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio, Speech and Signal Processing (ICASSP), 2019, pp. 2097–2101.
and Y. Bengio, “Graph attention networks,” arXiv preprint [81] L. Zhang, S. Zhang, P. Shen, G. Zhu, S. Afaq Ali Shah, and
arXiv:1710.10903, 2017. M. Bennamoun, “Relationship detection based on object semantic
[61] A. Dornadula, A. Narcomey, R. Krishna, M. Bernstein, and F.-F. inference and attention mechanisms,” in Proceedings of the 2019 on
Li, “Visual relationships as functions: Enabling few-shot scene International Conference on Multimedia Retrieval, 2019, pp. 68–72.
graph prediction,” in Proceedings of the IEEE International Confer- [82] T. Fukuzawa, “A problem reduction approach for visual relation-
ence on Computer Vision Workshops, 2019, pp. 0–0. ships detection,” arXiv preprint arXiv:1809.09828, 2018.
[62] M. Qi, W. Li, Z. Yang, Y. Wang, and J. Luo, “Attentive relational [83] Y. Zhu, S. Jiang, and X. Li, “Visual relationship detection with
networks for mapping images to scene graphs,” in Proceedings object spatial distribution,” in 2017 IEEE International Conference
of the IEEE Conference on Computer Vision and Pattern Recognition, on Multimedia and Expo (ICME), 2017, pp. 379–384.
2019, pp. 3957–3966. [84] J. Zhang, M. Elhoseiny, S. Cohen, W. Chang, and A. Elgammal,
[63] S. Zheng, S. Chen, and Q. Jin, “Visual relation detection with “Relationship proposal networks,” in Proceedings of the IEEE
multi-level attention,” in Proceedings of the 27th ACM International Conference on Computer Vision and Pattern Recognition, 2017, pp.
Conference on Multimedia, 2019, pp. 121–129. 5678–5686.
[64] P. Krähenbühl and V. Koltun, “Efficient inference in fully con- [85] W. Liao, B. Rosenhahn, L. Shuai, and M. Ying Yang, “Natural
nected crfs with gaussian edge potentials,” in Advances in neural language guided visual relationship detection,” in Proceedings of
information processing systems, 2011, pp. 109–117. the IEEE Conference on Computer Vision and Pattern Recognition
Workshops, 2019, pp. 0–0.
[65] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional net-
works for semantic segmentation,” in Proceedings of the IEEE [86] F. Plesse, A. Ginsca, B. Delezoide, and F. Prêteux, “Visual re-
conference on computer vision and pattern recognition, 2015, pp. lationship detection based on guided proposals and semantic
3431–3440. knowledge distillation,” in 2018 IEEE International Conference on
Multimedia and Expo (ICME), 2018, pp. 1–6.
[66] X. Liang, X. Shen, J. Feng, L. Lin, and S. Yan, “Semantic object
parsing with graph lstm,” in European Conference on Computer [87] S. Baier, Y. Ma, and V. Tresp, “Improving visual relationship
Vision, 2016, pp. 125–143. detection using semantic modeling of scene descriptions,” in
International Semantic Web Conference, 2017, pp. 53–68.
[67] S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Z. Su,
[88] J. Duan, W. Min, D. Lin, J. Xu, and X. Xiong, “Multimodal
D. Du, C. Huang, and P. H. Torr, “Conditional random fields as
graph inference network for scene graph generation,” Applied
recurrent neural networks,” in Proceedings of the IEEE international
Intelligence, no. 5, 2021.
conference on computer vision, 2015, pp. 1529–1537.
[89] J. Gu, H. Zhao, Z. Lin, S. Li, J. Cai, and M. Ling, “Scene graph
[68] H. Zhang, Z. Kyaw, S.-F. Chang, and T.-S. Chua, “Visual transla-
generation with external knowledge and image reconstruction,”
tion embedding network for visual relation detection,” in Proceed-
in Proceedings of the IEEE Conference on Computer Vision and Pattern
ings of the IEEE conference on computer vision and pattern recognition,
Recognition, 2019, pp. 1969–1978.
2017, pp. 5532–5540.
[90] R. Yu, A. Li, V. I. Morariu, and L. S. Davis, “Visual relationship
[69] Z.-S. Hung, A. Mallya, and S. Lazebnik, “Contextual translation detection with internal and external linguistic knowledge distilla-
embedding for visual relationship detection and scene graph tion,” in Proceedings of the IEEE international conference on computer
generation,” IEEE Transactions on Pattern Analysis and Machine vision, 2017, pp. 1974–1982.
Intelligence, 2020.
[91] Y. Zhan, J. Yu, T. Yu, and D. Tao, “On exploring undetermined
[70] N. Gkanatsios, V. Pitsikalis, P. Koutras, and P. Maragos, relationships for visual relationship detection,” in Proceedings of
“Attention-translation-relation network for scalable scene graph the IEEE Conference on Computer Vision and Pattern Recognition,
generation,” in Proceedings of the IEEE International Conference on 2019, pp. 5128–5137.
Computer Vision Workshops, 2019, pp. 0–0. [92] A. Bl, Z. B. Yi, and A. Xl, “Atom correlation based graph propa-
[71] N. Gkanatsios, V. Pitsikalis, P. Koutras, A. Zlatintsi, and P. Mara- gation for scene graph generation,” Pattern Recognition, 2021.
gos, “Deeply supervised multimodal attentional translation em- [93] Y. Yao, A. Zhang, X. Han, M. Li, and M. Sun, “Visual distant
beddings for visual relationship detection,” in 2019 IEEE Interna- supervision for scene graph generation,” 2021.
tional Conference on Image Processing (ICIP), 2019, pp. 1840–1844.
[94] J. Yu, Y. Chai, Y. Wang, Y. Hu, and Q. Wu, “Cogtree: Cognition
[72] G. Yin, L. Sheng, B. Liu, N. Yu, X. Wang, J. Shao, and tree loss for unbiased scene graph generation,” in Thirtieth Inter-
C. Change Loy, “Zoom-net: Mining deep feature interactions for national Joint Conference on Artificial Intelligence IJCAI-21, 2021.
visual relationship recognition,” in Proceedings of the European [95] A. Zareian, S. Karaman, and S.-F. Chang, “Bridging knowledge
Conference on Computer Vision (ECCV), 2018, pp. 322–338. graphs to generate scene graphs,” arXiv preprint arXiv:2001.02314,
[73] D. Shin and I. Kim, “Deep image understanding using multilay- 2020.
ered contexts,” Mathematical Problems in Engineering, vol. 2018, [96] A. Zareian, Z. Wang, H. You, and S.-F. Chang, “Learning visual
2018. commonsense for robust scene graph generation,” in Computer
[74] Y. Chen, Y. Wang, Y. Zhang, and Y. Guo, “Panet: A context based Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August
predicate association network for scene graph generation,” in 23–28, 2020, Proceedings, Part XXIII 16. Springer, 2020, pp. 642–
2019 IEEE International Conference on Multimedia and Expo (ICME), 657.
2019, pp. 508–513. [97] S. Abdelkarim, P. Achlioptas, J. Huang, B. Li, K. Church, and
[75] K. Masui, A. Ochiai, S. Yoshizawa, and H. Nakayama, “Recurrent M. Elhoseiny, “Long-tail visual relationship recognition with
visual relationship recognition with triplet unit for diversity,” a visiolinguistic hubless loss,” arXiv preprint arXiv:2004.00436,
International Journal of Semantic Computing, vol. 12, no. 04, pp. 2020.
523–540, 2018. [98] R. Herzig, M. Raboh, G. Chechik, J. Berant, and A. Globerson,
[76] Y. Dai, C. Wang, J. Dong, and C. Sun, “Visual relationship detec- “Mapping images to scene graphs with permutation-invariant
tion based on bidirectional recurrent neural network,” Multimedia structured prediction,” Advances in Neural Information Processing
Tools and Applications, pp. 1–17, 2019. Systems, vol. 31, pp. 7211–7221, 2018.
[77] W. Hamilton, Z. Ying, and J. Leskovec, “Inductive representation [99] X. Qian, Y. Zhuang, Y. Li, S. Xiao, S. Pu, and J. Xiao, “Video
learning on large graphs,” in Advances in neural information pro- relation detection with spatio-temporal graph,” in Proceedings of
cessing systems, 2017, pp. 1024–1034. the 27th ACM International Conference on Multimedia, 2019, pp. 84–
[78] K. Tang, H. Zhang, B. Wu, W. Luo, and W. Liu, “Learning to com- 93.
pose dynamic tree structures for visual contexts,” in Proceedings [100] A. Newell and J. Deng, “Pixels to graphs by associative embed-
of the IEEE Conference on Computer Vision and Pattern Recognition, ding,” in Advances in neural information processing systems, 2017,
2019, pp. 6619–6628. pp. 2171–2180.
25
[101] X. Shang, T. Ren, J. Guo, H. Zhang, and T.-S. Chua, “Video visual [122] L. Zhou, J. Zhao, J. Li, L. Yuan, and J. Feng, “Object relation detec-
relation detection,” in Proceedings of the 25th ACM international tion based on one-shot learning,” arXiv preprint arXiv:1807.05857,
conference on Multimedia, 2017, pp. 1300–1308. 2018.
[102] Y.-H. H. Tsai, S. Divvala, L.-P. Morency, R. Salakhutdinov, and [123] J. Peyre, I. Laptev, C. Schmid, and J. Sivic, “Detecting unseen
A. Farhadi, “Video relationship reasoning using gated spatio- visual relations using analogies,” in Proceedings of the IEEE Inter-
temporal energy graph,” in Proceedings of the IEEE Conference on national Conference on Computer Vision, 2019, pp. 1981–1990.
Computer Vision and Pattern Recognition, 2019, pp. 10 424–10 433. [124] B. Li and Y. Wang, “Visual relationship detection using joint
[103] Z. Cui, C. Xu, W. Zheng, and J. Yang, “Context-dependent dif- visual-semantic embedding,” in 2018 24th International Conference
fusion network for visual relationship detection,” in Proceedings on Pattern Recognition (ICPR), 2018, pp. 3291–3296.
of the 26th ACM international conference on Multimedia, 2018, pp. [125] J. Peyre, J. Sivic, I. Laptev, and C. Schmid, “Weakly-supervised
1475–1482. learning of visual relations,” in Proceedings of the IEEE Interna-
[104] J. Jung and J. Park, “Visual relationship detection with language tional Conference on Computer Vision, 2017, pp. 5179–5188.
prior and softmax,” in 2018 IEEE international conference on image [126] H. Zhang, Z. Kyaw, J. Yu, and S.-F. Chang, “Ppr-fcn: Weakly
processing, applications and systems (IPAS), 2018, pp. 143–148. supervised visual relation detection via parallel pairwise r-fcn,”
[105] F. Plesse, A. Ginsca, B. Delezoide, and F. Prêteux, “Learning in Proceedings of the IEEE International Conference on Computer
prototypes for visual relationship detection,” in 2018 International Vision, 2017, pp. 4233–4241.
Conference on Content-Based Multimedia Indexing (CBMI), 2018, pp. [127] V. S. Chen, P. Varma, R. Krishna, M. Bernstein, C. Re, and L. Fei-
1–6. Fei, “Scene graph prediction with limited labels,” in Proceedings
[106] S. J. Hwang, S. Ravi, Z. Tao, H. J. Kim, M. D. Collins, and V. Singh, of the IEEE International Conference on Computer Vision, 2019, pp.
“Tensorize, factorize and regularize: Robust visual relationship 2580–2590.
learning,” 2018 IEEE/CVF Conference on Computer Vision and Pat- [128] A. Zareian, S. Karaman, and S.-F. Chang, “Weakly supervised vi-
tern Recognition, pp. 1014–1023, 2018. sual semantic parsing,” in Proceedings of the IEEE/CVF Conference
[107] M. H. Dupty, Z. Zhang, and W. S. Lee, “Visual relationship on Computer Vision and Pattern Recognition, 2020, pp. 3736–3745.
detection with low rank non-negative tensor decomposition,” in [129] J. Zhang, K. J. Shih, A. Elgammal, A. Tao, and B. Catanzaro,
Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, “Graphical contrastive losses for scene graph parsing,” in Pro-
no. 07, 2020, pp. 10 737–10 744. ceedings of the IEEE Conference on Computer Vision and Pattern
[108] I. Donadello and L. Serafini, “Compensating supervision incom- Recognition, 2019, pp. 11 535–11 543.
pleteness with prior knowledge in semantic image interpreta- [130] H. Ben-Younes, R. Cadene, N. Thome, and M. Cord, “Block:
tion,” in 2019 International Joint Conference on Neural Networks Bilinear superdiagonal fusion for visual question answering and
(IJCNN), 2019, pp. 1–8. visual relationship detection,” in Proceedings of the AAAI Confer-
[109] B. Wen, J. Luo, X. Liu, and L. Huang, “Unbiased scene graph ence on Artificial Intelligence, vol. 33, 2019, pp. 8102–8109.
generation via rich and fair semantic extraction,” arXiv preprint [131] Y. Bin, Y. Yang, C. Tao, Z. Huang, J. Li, and H. T. Shen, “Mr-net:
arXiv:2002.00176, 2020. Exploiting mutual relation for visual relationship detection,” in
[110] X. Liang, L. Lee, and E. P. Xing, “Deep variation-structured Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33,
reinforcement learning for visual relationship and attribute de- no. 1, 2019, pp. 8110–8117.
tection,” in Proceedings of the IEEE conference on computer vision
[132] C. Yuren, H. Ackermann, W. Liao, M. Y. Yang, and B. Rosenhahn,
and pattern recognition, 2017, pp. 848–857.
“Nodis: Neural ordinary differential scene understanding,” arXiv
[111] X. Sun, Y. Zi, T. Ren, J. Tang, and G. Wu, “Hierarchical visual re- preprint arXiv:2001.04735, 2020.
lationship detection,” in Proceedings of the 27th ACM International
[133] K. Tang, Y. Niu, J. Huang, J. Shi, and H. Zhang, “Unbiased
Conference on Multimedia, 2019, pp. 94–102.
scene graph generation from biased training,” in Proceedings of the
[112] H. Wan, Y. Luo, B. Peng, and W.-S. Zheng, “Representation learn-
IEEE/CVF Conference on Computer Vision and Pattern Recognition,
ing for scene graph completion via jointly structural and visual
2020, pp. 3716–3725.
embedding,” in Twenty-Seventh International Joint Conference on
[134] X. Yang, H. Zhang, and J. Cai, “Shuffle-then-assemble: Learning
Artificial Intelligence (IJCAI-2018), 2018, pp. 949–956.
object-agnostic visual relationship features,” in Proceedings of the
[113] Y. Hu, S. Chen, X. Chen, Y. Zhang, and X. Gu, “Neural message
European conference on computer vision (ECCV), 2018, pp. 36–52.
passing for visual relationship detection,” in ICML Workshop on
Learning and Reasoning with Graph-Structured Representations, Long [135] P. Zhang, X. Ge, and J. Renz, “Support relation analysis for objects
Beach, CA, 2019. in multiple view rgb-d images,” arXiv preprint arXiv:1905.04084,
2019.
[114] H. Zhou, C. Zhang, and C. Hu, “Visual relationship detection
with relative location mining,” in Proceedings of the 27th ACM [136] M. Y. Yang, W. Liao, H. Ackermann, and B. Rosenhahn, “On
International Conference on Multimedia, 2019, pp. 30–38. support relations and semantic scene graphs,” Isprs Journal of
[115] W. Wang, R. Wang, S. Shan, and X. Chen, “Exploring context Photogrammetry and Remote Sensing, vol. 131, pp. 15–25, 2017.
and visual pattern of relationship for scene graph generation,” in [137] D. Chen, X. Liang, Y. Wang, and W. Gao, “Soft transfer learning
Proceedings of the IEEE Conference on Computer Vision and Pattern via gradient diagnosis for visual relationship detection,” in 2019
Recognition, 2019, pp. 8188–8197. IEEE Winter Conference on Applications of Computer Vision (WACV),
[116] L. Chen, H. Zhang, J. Xiao, X. He, S. Pu, and S.-F. Chang, “Coun- 2019, pp. 1118–1126.
terfactual critic multi-agent training for scene graph generation,” [138] A. Bordes, N. Usunier, A. Garcia-Duran, J. Weston, and
in Proceedings of the IEEE International Conference on Computer O. Yakhnenko, “Translating embeddings for modeling multi-
Vision, 2019, pp. 4613–4623. relational data,” Advances in neural information processing systems,
[117] M. Klawonn and E. Heim, “Generating triples with adversarial vol. 26, pp. 2787–2795, 2013.
networks for scene graph construction,” in Proceedings of the [139] Z. Wang, J. Zhang, J. Feng, and Z. Chen, “Knowledge graph
AAAI Conference on Artificial Intelligence, vol. 32, no. 1, 2018, pp. embedding by translating on hyperplanes,” in Proceedings of the
6992–6999. AAAI Conference on Artificial Intelligence, vol. 28, no. 1, 2014.
[118] C. Han, F. Shen, L. Liu, Y. Yang, and H. T. Shen, “Visual spatial [140] G. Ji, K. Liu, S. He, and J. Zhao, “Knowledge graph completion
attention network for relationship detection,” in Proceedings of the with adaptive sparse transfer matrix,” in Proceedings of the AAAI
26th ACM international conference on Multimedia, 2018, pp. 510– Conference on Artificial Intelligence, vol. 30, no. 1, 2016.
518. [141] S. Zheng, X. Chen, S. Chen, and Q. Jin, “Relation understanding
[119] G. S. Kenigsfield and R. El-Yaniv, “Leveraging auxiliary text for in videos,” in Proceedings of the 27th ACM International Conference
deep recognition of unseen visual relationships,” arXiv preprint on Multimedia, 2019, pp. 2662–2666.
arXiv:1910.12324, 2019. [142] X. Shang, J. Xiao, D. Di, and T.-S. Chua, “Relation understanding
[120] R. Krishna, I. Chami, M. Bernstein, and L. Fei-Fei, “Referring in videos: A grand challenge overview,” in Proceedings of the 27th
relationships,” in Proceedings of the IEEE Conference on Computer ACM International Conference on Multimedia, 2019, pp. 2652–2656.
Vision and Pattern Recognition, 2018, pp. 6867–6876. [143] X. Sun, T. Ren, Y. Zi, and G. Wu, “Video visual relation detection
[121] A. Kolesnikov, A. Kuznetsova, C. Lampert, and V. Ferrari, “De- via multi-modal feature fusion,” in Proceedings of the 27th ACM
tecting visual relationships using box attention,” in Proceedings International Conference on Multimedia, 2019, pp. 2657–2661.
of the IEEE International Conference on Computer Vision Workshops, [144] X. Shang, D. Di, J. Xiao, Y. Cao, X. Yang, and T.-S. Chua, “Annotat-
2019, pp. 0–0. ing objects and relations in user-generated videos,” in Proceedings
26
of the 2019 on International Conference on Multimedia Retrieval, 2019, large scale visual recognition challenge,” International journal of
pp. 279–287. computer vision, vol. 115, no. 3, pp. 211–252, 2015.
[145] J. Ji, R. Krishna, L. Fei-Fei, and J. C. Niebles, “Action genome: [164] J. Lv, Q. Xiao, and J. Zhong, “Avr: Attention based salient visual
Actions as compositions of spatio-temporal scene graphs,” in relationship detection,” arXiv preprint arXiv:2003.07012, 2020.
Proceedings of the IEEE/CVF Conference on Computer Vision and [165] X. Lin, C. Ding, J. Zeng, and D. Tao, “Gps-net: Graph property
Pattern Recognition, 2020, pp. 10 236–10 247. sensing network for scene graph generation,” in Proceedings of the
[146] P. Gay, J. Stuart, and A. Del Bue, “Visual graphs from motion IEEE/CVF Conference on Computer Vision and Pattern Recognition,
(vgfm): Scene understanding with object geometry reasoning,” 2020, pp. 3746–3753.
in Asian Conference on Computer Vision, 2018, pp. 330–346. [166] M. Khademi and O. Schulte, “Deep generative probabilistic
[147] U. Kim, J. Park, T. Song, and J. Kim, “3-d scene graph: A sparse graph neural networks for scene graph generation.” in Proceed-
and semantic representation of physical environments for intelli- ings of the AAAI Conference on Artificial Intelligence, 2020, pp.
gent agents,” IEEE Transactions on Systems, Man, and Cybernetics, 11 237–11 245.
pp. 1–13, 2019. [167] G. Ren, L. Ren, Y. Liao, S. Liu, B. Li, J. Han, and S. Yan, “Scene
[148] J. Wald, H. Dhamo, N. Navab, and F. Tombari, “Learning 3d graph generation with hierarchical context,” IEEE Transactions on
semantic scene graphs from 3d indoor reconstructions,” in Pro- Neural Networks and Learning Systems, 2020.
ceedings of the IEEE/CVF Conference on Computer Vision and Pattern [168] B. A. Plummer, A. Mallya, C. M. Cervantes, J. Hockenmaier,
Recognition, 2020, pp. 3961–3970. and S. Lazebnik, “Phrase localization and visual relationship
[149] B. Yang, W. Luo, and R. Urtasun, “Pixor: Real-time 3d object de- detection with comprehensive image-language cues,” 2017 IEEE
tection from point clouds,” 2018 IEEE/CVF Conference on Computer International Conference on Computer Vision (ICCV), pp. 1946–1955,
Vision and Pattern Recognition, pp. 7652–7660, 2018. 2017.
[150] S. Shi, X. Wang, and H. Li, “Pointrcnn: 3d object proposal gener- [169] B. Knyazev, H. de Vries, C. Cangea, G. W. Taylor, A. Courville,
ation and detection from point cloud,” 2019 IEEE/CVF Conference and E. Belilovsky, “Graph density-aware losses for novel
on Computer Vision and Pattern Recognition (CVPR), pp. 770–779, compositions in scene graph generation,” arXiv preprint
2018. arXiv:2005.08230, 2020.
[151] W. Ali, S. Abdelkarim, M. Zidan, M. Zahran, and A. El Sallab, [170] D. A. Hudson and C. D. Manning, “Gqa: A new dataset for real-
“Yolo3d: End-to-end real-time 3d oriented object bounding box world visual reasoning and compositional question answering,”
detection from lidar point cloud,” in Proceedings of the European 2019 IEEE/CVF Conference on Computer Vision and Pattern Recogni-
Conference on Computer Vision (ECCV), 2018, pp. 0–0. tion (CVPR), pp. 6693–6702, 2019.
[171] C. Liu, Y. Jin, K. Xu, G. Gong, and Y. Mu, “Beyond short-term
[152] C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “Pointnet: Deep learning
snippet: Video relation detection with spatio-temporal global
on point sets for 3d classification and segmentation,” 2017 IEEE
context,” in Proceedings of the IEEE/CVF Conference on Computer
Conference on Computer Vision and Pattern Recognition (CVPR), pp.
Vision and Pattern Recognition, 2020, pp. 10 840–10 849.
77–85, 2017.
[172] A. Rosinol, A. Gupta, M. Abate, J. Shi, and L. Carlone, “3d
[153] D. Rethage, J. Wald, J. Sturm, N. Navab, and F. Tombari, “Fully-
dynamic scene graphs: Actionable spatial perception with places,
convolutional point networks for large-scale point clouds,” in
objects, and humans,” arXiv preprint arXiv:2002.06289, 2020.
Proceedings of the European Conference on Computer Vision (ECCV),
[173] X. Zhou, D. Wang, and P. Krähenbühl, “Objects as points,” arXiv
2018, pp. 596–611.
preprint arXiv:1904.07850, 2019.
[154] M. Jaritz, T.-H. Vu, R. d. Charette, E. Wirbel, and P. Pérez,
[174] Z. Cai and N. Vasconcelos, “Cascade r-cnn: Delving into high
“xmuda: Cross-modal unsupervised domain adaptation for 3d
quality object detection,” in Proceedings of the IEEE conference on
semantic segmentation,” in Proceedings of the IEEE/CVF Conference
computer vision and pattern recognition, 2018, pp. 6154–6162.
on Computer Vision and Pattern Recognition, 2020, pp. 12 605–
[175] F. Xiao and Y. Jae Lee, “Video object detection with an aligned
12 614.
spatial-temporal memory,” in Proceedings of the European Confer-
[155] S. Shi, Z. Wang, J. Shi, X. Wang, and H. Li, “From points to parts: ence on Computer Vision (ECCV), 2018, pp. 485–501.
3d object detection from point cloud with part-aware and part- [176] M. Shvets, W. Liu, and A. C. Berg, “Leveraging long-range
aggregation network,” IEEE Transactions on Pattern Analysis and temporal relationships between proposals for video object de-
Machine Intelligence, 2020. tection,” in Proceedings of the IEEE International Conference on
[156] Y. Liang, Y. Bai, W. Zhang, X. Qian, L. Zhu, and T. Mei, “Vrr-vg: Computer Vision, 2019, pp. 9756–9764.
Refocusing visually-relevant relationships,” in Proceedings of the [177] H. Wu, Y. Chen, N. Wang, and Z. Zhang, “Sequence level seman-
IEEE International Conference on Computer Vision, 2019, pp. 10 403– tics aggregation for video object detection,” in Proceedings of the
10 412. IEEE International Conference on Computer Vision, 2019, pp. 9217–
[157] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and 9225.
A. Zisserman, “The pascal visual object classes (voc) challenge,” [178] K. Kang, W. Ouyang, H. Li, and X. Wang, “Object detection
International Journal of Computer Vision, vol. 88, no. 2, pp. 303–338, from video tubelets with convolutional neural networks,” in
2010. Proceedings of the IEEE conference on computer vision and pattern
[158] A. Belz, A. Muscat, P. Anguill, M. Sow, G. Vincent, and Y. Zi- recognition, 2016, pp. 817–825.
nessabah, “Spatialvoc2k: A multilingual dataset of images with [179] H. Wang and C. Schmid, “Action recognition with improved
annotations and features for spatial relations between objects,” in trajectories,” in Proceedings of the IEEE international conference on
Proceedings of the 11th International Conference on Natural Language computer vision, 2013, pp. 3551–3558.
Generation, 2018, pp. 140–145. [180] S. Ji, S. Pan, E. Cambria, P. Marttinen, and P. S. Yu, “A survey
[159] K. Yang, O. Russakovsky, and J. B. Deng, “Spatialsense: An on knowledge graphs: Representation, acquisition and applica-
adversarially crowdsourced benchmark for spatial relation recog- tions,” arXiv preprint arXiv:2002.00388, 2020.
nition,” 2019 IEEE/CVF International Conference on Computer Vision [181] H. Huang, S. Saito, Y. Kikuchi, E. Matsumoto, W. Tang, and
(ICCV), pp. 2051–2060, 2019. P. S. Yu, “Addressing class imbalance in scene graph parsing
[160] F. Xia, A. R. Zamir, Z.-Y. He, A. Sax, J. Malik, and S. Savarese, by learning to contrast and score,” in Proceedings of the Asian
“Gibson env: Real-world perception for embodied agents,” 2018 Conference on Computer Vision, 2020.
IEEE/CVF Conference on Computer Vision and Pattern Recognition, [182] S. Inuganti and V. N. Balasubramanian, “Assisting scene
pp. 9068–9079, 2018. graph generation with self-supervision,” arXiv preprint
[161] B. Thomee, D. A. Shamma, G. Friedland, B. Elizalde, K. Ni, arXiv:2008.03555, 2020.
D. Poland, D. Borth, and L.-J. Li, “The new data and new chal- [183] J. Yu, Y. Chai, Y. Hu, and Q. Wu, “Cogtree: Cognition
lenges in multimedia research,” arXiv preprint arXiv:1503.01817, tree loss for unbiased scene graph generation,” arXiv preprint
2015. arXiv:2009.07526, 2020.
[162] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, [184] Y. Zhou, S. Sun, C. Zhang, Y. Li, and W. Ouyang, “Exploring the
P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in hierarchy in relation labels for scene graph generation,” arXiv
context,” in European conference on computer vision, 2014, pp. 740– preprint arXiv:2009.05834, 2020.
755. [185] B. Knyazev, H. de Vries, C. Cangea, G. W. Taylor, A. Courville,
[163] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, and E. Belilovsky, “Generative graph perturbations for scene
Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al., “Imagenet graph prediction,” arXiv preprint arXiv:2007.05756, 2020.
27