0% found this document useful (0 votes)
27 views10 pages

Large-Scale Esports

Uploaded by

whguieh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views10 pages

Large-Scale Esports

Uploaded by

whguieh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

LoL-V2T: Large-Scale Esports Video Description Dataset

Tsunehiko Tanaka Edgar Simo-Serra


Waseda University Waseda University
tsunehiko@fuji.waseda.jp ess@waseda.jp

00:10:49.360 --> 00:10:50.740


as well as lissandra that's very scary
Masking in Preprocessing
00:10:50.740 --> 00:10:51.970
and now it’ s all about this best of one and for g2
but we do have

00:10:51.970 --> 00:10:54.250


Xin moving down he should have been and now it’ s all about this best of <0> and for <team>
00:10:54.250 --> 00:10:55.870
spotted on the war that was further up

Video Description Task


Videos Subtitles
Sentence
Splitting Videos
Segmentation
LoL-V2T Dataset Selection
Calculation
of Timestamps
Captioning Model

sentence: that is very scary but we do have sentence: Xin moving down
timestamps: [5.09, 7.12] timestamp: [7.12, 8.14]
sentence: he should have been spotted on
the war that was further up <team> is just going to be able to take down the turret
timestamps: [8.14, 15.8] Captions and they are going to have

Figure 1: Overview of our video description approach for esports. Left: We created a new large-scale esports dataset
consisting of gameplay clip with multiple captions. Right: We mask domain-specific words in captions to improve training
and generalization abilities of our model.

Abstract formance. Our dataset and code is publicly available1 .

Esports is a fastest-growing new field with a largely


online-presence, and is creating a demand for automatic 1. Introduction
domain-specific captioning tools. However, at the current
Esports are growing rapidly in popularity and have gen-
time, there are few approaches that tackle the esports video
erated $950 million in revenue in 2019 with a year-on-
description problem. In this work, we propose a large-scale
year growth of +22.7% and is approaching the scale of
dataset for esports video description, focusing on the pop-
existing sports leagues [19]. The main content in esports
ular game “League of Legends”. The dataset, which we
consists of tournament or gameplay videos, however, they
call LoL-V2T, is the largest video description dataset in the
are challenging for a new audience to understand given
video game domain, and includes 9,723 clips with 62,677
that they contain significant information (e.g., character hit
captions. This new dataset presents multiple new video cap-
points, skill cool time, and effects of using items). Captions
tioning challenges such as large amounts of domain-specific
are a useful tool for viewers to understand the status of a
vocabulary, subtle motions with large importance, and a
match. Recently, video description approaches have started
temporal gap between most captions and the events that
to tackle the sports domain [32, 34], however, there are few
occurred. In order to tackle the issue of vocabulary, we
approaches tackling the esports domain. We believe this
propose a masking the domain-specific words and provide
is caused by a lack of large-scale datasets, and present the
additional annotations for this. In our results, we show that
LoL-V2T dataset to address this issue.
the dataset poses a challenge to existing video captioning
approaches, and the masking can significantly improve per- 1 https://github.com/Tsunehiko/lol-v2t

1
Wj+1 Wj+1

Softmax
Softmax
Linear
Linear
Add & Norm
Caption l
Add & Norm Decoder Feed Forward Mt

Feed Forward Multi-Head Memory


Attention Updater
l -l
Mt-1
Add & Norm l
Mt-1 Concat
Ht

Video
Encoder Add & Norm Multi-Head
Attention Feed Forward
Feed Forward
Add & Norm
Add & Norm
Add & Norm Masked
Masked Multi-Head
Attention
Multi-Head Multi-Head
Attention Attention xN
TE
xN xN
token Type position
Embedding Concat encoding

position position
encoding Emb Emb encoding
Linear & Norm Linear & Norm
CNN Wj CNN Emb
(shifted right)
W j
(shifted right)
t t

Figure 2: Overview of our captioning models. The left model is based on the vanilla transformer proposed by Zhou et al. [37],
while the model on the right is based on the transformer with recurrent memory module [15].

Our dataset for video description in esports consists of word. Furthermore, they change over time as the game is
narrated videos of the League of Legends world champi- updated and renewed. To tackle this problem in the LoL-
onships “Worlds”. We chose League of Legends as the V2T dataset, we rely on masking. In particular, our prepro-
sports title as it is one of the most popular esports games cessing approach consists in a masking scheme for proper
and thus it is easy to obtain significant amount of narrated nouns and numerals in the captions. First, we classify the
videos. We split each video into different scenes, which are proper nouns in the captions into several groups according
then filtered such that non-gameplay related scenes are re- to their meaning. We treat numerals as a group. Then, when
moved. Afterwards, we extract complete sentence captions a proper noun or a numeral appears in a caption, we replace
from the narrations, and compute the temporal boundaries the word with the name of the group to which it belongs. By
corresponding to each of the captions. After this process- using this method, we can increase the frequency of proper
ing, we obtain a total of 9,723 clips with 62,677 captions, nouns and numerals while keeping the group names in the
where each clip is associated to multiple captions. LoL- captions that are simple enough for a new audience to com-
V2T has three different challenges for training captioning prehend. Our key task is to generate multi-sentences for es-
models. First, the captions contain a significant amount ports videos, and we use video paragraph captioning mod-
of proper nouns specific to the esports title and numerals els that build upon the models of Vanilla Transformer [37]
that describe the status of matches. Second, some impor- and MART [15]. An overview is shown in Figure 2. We
tant objects in the clips are difficult for captioning models show the challenges of LoL-V2T and corroborate the effec-
to recognize because their size and subtle motions. Third, tiveness of our proposed approach through experiments.
some pairs of a clip and captions are not necessarily aligned
temporally, i.e., there can be significant lag between game 2. Related Work
actions and captions.
Video Description Dataset
In this paper, we additionally tackle the difficulty cor-
responding to the large amounts of proper nouns and nu- Various datasets for video description have been pro-
merals. While they are important for describing gameplay, posed covering a wide range of domains such as cook-
memorizing them is difficult for captioning models because ing [36, 22], instructions [18], and human activities [24, 31,
of the large variety and low frequency of occurrence of each 14]. We summarize existing video description datasets and

2
Name Domain Clips Captions Duration Multiple Narration
Charades [24] human 10k 16k 82h - -
MSR-VTT [31] open 10k 200k 40h - -
YouCook2 [36] cooking 14k 14k 176h X -
ANet Captions [14] open 100k 100k 849h X -
TACOS-ML [22] cooking 14k 53k 13h X -
HowTo100M [18] instruction 136M 136M 134,472h - X
Getting Over It [16] video game 2,274 2,274 1.8h - X
LoL-V2T (Ours) video game 9.7k 63k 76h X X

Table 1: Comparison of the existing video description datasets. Multiple indicates whether or not multiple captions are
associated with a single clip. Narration indicates whether or not captions are generated from the narration contained in the
video.

compare key statistics in Table 1. Video description datasets al. [15] solved this problem by MART which is an archi-
can be divided into two types: one with one caption per clip, tecture with memory module like LSTM [11] and GRU [5]
and one with multiple captions per clip. Furthermore, there based on the transformer.
are two types of captions: those generated by manual an-
notation, and those generated automatically from the narra- Video Description in Video Game
tion. In this work, we propose a large dataset for esports Several works [32, 34] on video description for existing
where multiple captions correspond to a clip and are auto- sports videos have been proposed. Yan et al. [32] use a ten-
matically generated from the narration. As shown in Table nis video as input and generate captions using Structured
1, LoL-V2T is the largest dataset in the video game domain SVM and LSTM. Yu et al. [34] used videos of NBA games
and is comparable in size to datasets in other domains. as inputs and generated captions using LSTM and subnet-
Video Description works suitable for basketball videos.
Esports footages contain much more information than
Early work in video description [13, 3, 7] was based existing sports videos, making video comprehension diffi-
on template-based methods. The templates require a large cult. Some video description efforts target video games as
number of linguistic rules manually set up, which are only complex as esports games and use “Let’s Play” videos as
effective in constrained environments. They also have lim- input. “Let’s Play” videos contain audio of players’ com-
ited applicability, and most research has focused on human ments on gameplay, which can be converted to text by an
actions. With the growth of deep learning, a method using Automatic Speech Recognition (ASR) system and used as
an encoder-decoder framework for video description [27] captioning data. Shah et al. [23] generated a caption for
was proposed, which overcomes the limitations of template- each frame by training a simple CNN model that combines
based methods. This method used CNN for encoder and three conv layers with 75 minutes of Minecraft “Let’s Play”
RNN for decoder, demonstrating the high performance of videos and 4,840 sentences. Li et al. [16] applied sequence-
CNN for video feature representation and RNN for sequen- to-sequence networks with attention to this task, using a
tial learning. Subsequently, mean pooling, which was used dataset of 110 minutes of “Let’s Play” videos by Getting
to aggregate the features of each frame in the encoder, was Over It with Bennett Foddy and 2,274 sentences. Various
replaced by CRF [8], and the CNN used in the encoder was inputs such as video, optical flow, and audio are compared.
replaced by RNN [8, 26]. Following the success [1] of at- In this paper, we propose LoL-V2T, a dataset using video
tention mechanisms in machine translation, several meth- game footages in esports that focuses on gameplay. LoL-
ods have been proposed: temporal attention [33], which V2T consists of 4,568 minutes of video and 62,677 sen-
focuses on the temporal direction, semantic attention [9], tences, which is much larger than existing datasets in the
which focuses on tags of semantic concepts extracted from video game domain.
images, and methods using both [20]. Furthermore, Zhou
et al. [37] applied the transformer architecture [25] to the 3. LoL-V2T Dataset
Video Description Model. Self-Attention in the transformer
can replace RNN, which is effective for modeling long-term We create a new dataset for video description in esports,
dependencies in series data. However, transformer architec- LoL-V2T. LoL-V2T consists of 9.7k clips of League of Leg-
tures are unable to model history information because they ends playing video taken from YouTube and 63,000 cap-
can operate only separated fixed-length segments. Lei et tions. Each video is associated with multiple captions based

3
accuracy precision recall F1 chunk is not a complete sentence and associated with a spe-
cific time interval in the video. To help captioning models to
0.963 0.928 0.285 0.437 understand the context, we re-segment the chunks into com-
plete sentences with a sentence segmenter. For a sentence
Table 2: Performance of the model to detect whether a clip segmenter, we use DeepSegment3 .
is related to gameplay. The relationship between chunk and temporal bound-
aries breaks down when the sentence is reconstructed. We
compute the new temporal boundaries to which the recon-
on manual or ASR-generated subtitles.
structed sentences correspond using the number of words
3.1. Data Collection in the sentences. The sequence of captions and temporal
boundaries generation is shown in Figure 4.
We collect footages of 157 matches of the League of
Legends world championships “Worlds” from YouTube. 3.4. Dataset Analysis
League of Legends is the most popular esports title, which is
the most-watched game on Twitch and YouTube [19]. Be- LoL-V2T is a labeled dataset with 4,568 minutes of video
sides, since Worlds has a large number of matches and com- and 62,677 captions. As shown in Table 1, LoL-V2T is
mentators always provide commentary in them, it is easy to larger than the existing dataset [16] in the video game do-
obtain narrations for the videos. While “Let’s Play” videos main and contains video-text data as large as medium scale
include narrations not related to gameplay, the quality of datasets frequently used in video description, such as Cha-
narrations in esports tournament footages is higher than that rades [24] and MSR-VTT [31]. The number of clips is
of “Let’s Play” videos because the purpose of commentators 9,723, the average length of clips is 28.0 seconds, and the
is not to enjoy viewers but to explain gameplays. average length of intervals between temporal boundaries is
4.38 seconds. The mean number of words in a caption is
3.2. Splitting Videos and Selecting Clips 15.4.
In LoL-V2T, same as in ActivityNet Captions [14], mul-
The average length of collected League of Legends
tiple captions are associated with a single clip. This dataset
videos is 44 minutes, which is too long to generate captions.
can also be used for the task of inferring temporal bound-
The videos include scenes not related to gameplay, such as
aries (Dense Video Captioning) in the future. LoL-V2T has
player seats and venue scenes. In order to reduce noise to
three features that make it more difficult to train captioning
the model training, we first split videos into clips by scene,
models compared to ActivityNet Captions. First, there are
and then remove the clips which are not gameplay. This
more proper nouns and numerals in LoL-V2T than in Activ-
process is shown in Figure 3.
ityNet Captions as shown in Figure 5. They interfere with
We automatically split the video into clips of a length
the training of captioning models. They also are important
that the video description model can handle. For this split-
elements in describing the information of a game. How-
ting, we use PySceneDetect2 which is a tool that can detect
ever, too many of them can result in captions that become
scene changes in videos. The average length of the clips is
difficult for beginners to comprehend. Second, motions of
23.4 seconds.
important objects for gameplays in the clips are too subtle to
To remove the clips not related to gameplay, we create a
be recognized by captioning models. The size of characters
model to detect whether a clip is a gameplay clip. It takes a
in League of Legends are smaller than that of people. This
temporally centered RGB frame as input and outputs 0 or 1
indicates that it is difficult for captioning models pre-trained
to indicate whether it is relevant to gameplay or not, respec-
on human activities to identify the clips in LoL-V2T. Finally,
tively. ResNext-50(32x4d) [29] is adopted as the model,
the clips and captions do not necessarily temporally match.
and the clips segmented by the splitting tool are used as the
The content represented by a caption is often earlier tempo-
dataset for training. As shown in Table 2, precision is much
rally than the timestamps calculated by Section 3.3 because
higher than recall in this model to reduce false-positive even
narrators talk about gameplays after watching them. In the
the amount of data decrease and then preserve the quality of
next section, we propose a method for dealing with the first
the dataset.
difficulty in LoL-V2T.
3.3. Generation of Captions and Temporal Bound-
aries 4. Proposed Method
We produce full-text captions and temporal boundaries In this section, we introduce a method for video descrip-
from the subtitles generated from narrations by YouTube. tion in esports. Our work builds upon the vanilla trans-
Subtitles are usually organized as a list of text chunks. Each former for video description proposed by Zhou et al. [37]
2 https://pyscenedetect.readthedocs.io/ 3 https://github.com/notAI-tech/deepsegment

4
Scene Extraction

CNN CNN CNN CNN CNN CNN CNN CNN CNN

0 0 0 1 1 1 1 0 0 Label
Gameplay Extraction

Figure 3: Overview of splitting and selection videos in League of Legends (LoL). We split the LoL gameplay videos into clips
for each scene because the average length of them is too long to be handled by the video description model. We remove clips
not related to gameplay with ResNext-50(32x4d) [29] to minimize noise in training.

00:00:00.000 --> 00:01:00.000 “I have a pen I live” “I live in Tokyo I had” “I had ...”
I have
Sentence Segmenter
00:01:00.000 --> 00:03:00.000
[“I have a pen”, “I live”] [“I live in Tokyo”, “I had”]
a pen I live

00:03:00.000 --> 00:05:00.000 0.00 1.00 2.00 3.00 2.00 3.00 4.00 5.00
in Tokyo I had
I have a pen I live I live in Tokyo I had

sentence: “I have a pen” sentence: “I live in Tokyo”


timestamps: [0.00, 2.00] timestamps: [2.00, 4.00]
Subtitles Sentence Segmentation / Calculation of Timestamps

Figure 4: Sequence to create captions and temporal boundaries from subtitles. First, we concatenate several consecutive
chunks and split them into complete sentences by a sentence segmenter. Next, the temporal boundaries are calculated ac-
cording to the number of words in reconstructed sentences. In this example, four words appear between 00:01:00.000 and
00:03:00.000, half of which in the previous caption, so the last timestamp of the previous caption is 00:02:00.000.

and MART [15] based on the transformer with a memory the video encoder. We also create masks with ground-truth
module for modeling of history information. We modify temporal boundaries and apply them to the outputs of the
them by masking proper nouns in preprocessing as we will video encoder to make the caption decoder focus on the
next explain. event proposals in the clips. Although there is the proposal
decoder that outputs event proposals and uses them to make
4.1. Model the masks in [37], we remove it to simplify the task by fo-
cusing on the inference of captions.
Vanilla Transformer. This model is an application of the
transformer [25] to the video description task, proposed by MART [15] is a model based on the transformer [25] for
Zhou et al. [37], as shown in Figure 2-(left). It contains the video paragraph captioning task, as shown in Figure 2-
two components: a video encoder and a caption decoder. (right). The transformer model decodes each caption indi-
The video encoder is composed of a stack of two identical vidually without using the context of the previously gener-
layers and each layer has a self-attention layer [25] and a ated captions as it can operate only separated fixed-length
position-wise fully feed-forward network. The caption de- segments. MART has two changes from the transformer to
coder inserts a multi-head attention layer over the output solve this problem. The first change is a unified encoder-
of the video encoder stack in addition to the two layers in decoder design, where the encoder and decoder are shared.

5
Verb Noun
Noun Verb
Adposition Determiner
Determiner Adposition
Adverb Adjective
part-of-speech

part-of-speech
Pronoun Conjunction
Adjective Pronoun
Proper Noun Adverb
Conjunction Particle
Particle Numeral
Numeral Proper Noun
Interjection Interjection

0 20000 40000 60000 80000 100000 120000 140000 0 25000 50000 75000 100000 125000 150000 175000 200000
frequency frequency

(a) LoL-V2T (b) ActivityNet Captions

Figure 5: Frequency of words included in captions for each part-of-speech. In LoL-V2T, there are more proper nouns and
numerals than in ActivityNet Captions.

The second change is an external memory module similar Origin 1


to LSTM [11] and GRU [5] that enables the modeling of and now it's all about this best of one and for g2
Masked 1
history information of clips and generated captions. With and now it is all about this best of <0> and for <Team>
these two improvements, MART is able to use previous con-
Origin 2
textual information and generate a better paragraphs with that's respect from the clutch gaming top laner and so now
higher coherence and less repetition. they don't have the gangplank ultimate to drop on a Herald
fight but they do have more pressure bottom line
4.2. Masking Masked 2
that is respect from the <Team> top laner and so now
As described in Section 3.4, the proportion of proper they do not have the <Champion> ultimate to drop on a <Monster>
fight but they do have more pressure bottom line
nouns and numerals to words in LoL-V2T is higher than in
other datasets. Although proper nouns and numerals are im-
portant for explaining gameplay, they are difficult for cap- Figure 6: An example of masking. Origin is the original
caption and Masked is the caption after masking. For ex-
tioning models to learn. To address this problem, we pro-
pose a masking method in preprocessing for captions. First, ample, “one” and “g2” in Origin 1 are converted to “<0>”
and “<Team>”, respectively. All numerals are converted
we classify the proper nouns in the captions into groups by
to “<0>” regardless of their value.
meaning. Then, when a proper noun or numeral appears in
a caption, we mask it with the name of the group to which
it belonged. An example of masking is shown in Figure 6.
demonstrate that our masking method improves generated
Since the masking increases the frequency of proper
captions. We measure the captioning performance with the
nouns and numerals, it makes them easier to recognize by
automatic evaluation metrics: BLEU@4 [21], RougeL [17],
the model. In addition, complex proper nouns with detailed
METEOR [2], and Repetition@4 [30].
meanings are replaced by simpler proper nouns, resulting
in more comprehensible captions. We treat numerals as a 5.1. Implementation Details
group and deal with misspellings in ASR. The classifica-
tion into groups is done manually by knowledgeable people We train the model on the LoL-V2T dataset with the split
in League of Legends. Examples of the created groups are in the Table 4. To preprocess the videos, we down-sample
shown in Table 3. each video every 0.26s and extract the TSN features [28]
from these sampled frames. The TSN model extracts spatial
5. Experiments features from RGB appearance and temporal features from
optical flow and concatenates the two features. For TSN,
In this section, we show the experimental results of video out implementation was build upon the mmaction2 [6], we
description for esports footages using LoL-V2T. We also use the two different ResNet-50 [10] model pre-trained on

6
Group Name Meaning Examples
names of teams competing fnatic, fanatic, Cloud9, C9, genji, Gen.G,
Team
in “Worlds” g2
names of players competing Baker, Matt, Svenskeren, Ceros,
Player
in “Worlds” Diamondprox, dyNquedo
names of characters
Champion Recon, Akali, Kaiser, Ryze, Ziya, Camille
in League of Legends
names of monsters
Monster Drake, Herald, Raptor, Krug
in League of Legends

Table 3: Examples of created proper noun groups. Proper nouns related to competitions, such as Team and Player, and proper
nouns related to gameplay, such as Champion and Monster, are separately grouped. Moreover, misnomers in ASR, such as
“fnatic” and “fanatic”, are addressed in this group.

Split train validation test in suppressing repetitive expressions.


clips 6,977 851 1,895 Qualitative results are shown in Figure 7. The models
captions 44,042 5,223 13,412 trained on our proposed masking (VTransformer+Masking,
MART+Masking) generate captions containing the names
Table 4: The splits of the LoL-V2T dataset. of the groups such as “<Player>” and “<Team>” and this
tendency is particularly prominent in MART, so the cap-
tions with our method are more explicit than the ones with
Kinetics-400 [12] without fine-tuning. We also use tvl1 [35] the baseline. The models also generate gameplay keywords
for calculating optical flow. All captions are converted to such as “kill” and “fight”, and simple League of Legends
lowercase. The words that occur less than 5 times in all terms such as “turret”, which indicates that the captions can
captions are replaced with “<unk>” tags. All settings of explain the contents in more detail. On the other hand, the
our implementations for the vanilla transformer and MART models trained with the baseline (VTransformer, MART)
are the same as in [37] and [15], respectively. generate keywords such as “mid lane” and “damage”, but
there are many “<unk>” occurrences. Besides, some of
5.2. Evaluation Metrics the captions in VTransformer are repetitions of “<unk>”
and do not form sentences. It is shown that ours generates
We measure the performance on the video description
more comprehensible captions than the baseline.
task with four automatic evaluation metrics: BLEU@4 [21],
RougeL [17], METEOR [2], and Repetition@4 [30].
We use the standard evaluation implementation from the
6. Conclusion
MSCOCO server [4].
In this paper, we introduced LoL-V2T, a new large-scale
5.3. Results and Analysis
dataset for video description in esports with 9,723 clips and
We compare the case where our proposed masking is in- 62,677 captions, where each clip is associated with mul-
cluded in preprocessing (Masking) with the case where it is tiple captions. LoL-V2T has three difficulties for training
excluded (baseline) using Vanilla Transformer and MART captioning models. First, it has a lot of proper nouns for
trained on the LoL-V2T dataset. The vocabulary of the train- esports and numerals in captions. Second, the motions of
ing data is 4,528 words after our proposed preprocessing objects in clips are subtle. Third, clips and captions do not
and 5,078 words after the baseline preprocessing. necessarily have a direct temporal correspondence. We ad-
The results on the testing set in the LoL-V2T dataset are dressed the first difficulty initially, and proposed a prepro-
shown in Table 5. We can see that our proposed masking cessing method consisting of masking proper nouns and nu-
outperforms the baseline in BLEU@4, RougeL, and ME- merals in captions. Our captioning models build upon [37]
TEOR. It shows that by grouping proper nouns with a low and [15]. Experimental results show that our proposed ap-
frequency of occurrence using our proposed masking, the proach can improve the generated captions. In the future,
frequency of occurrence increases, which helps captioning addressing the remaining two difficulties should further im-
models to recognize proper nouns. We can also display that prove performance. We hope that with the release of our
MART outperforms Vanilla Transformer in Repetition@4, LoL-V2T dataset, other researchers will be encouraged to
which show that the recurrent module in MART is effective advance the state of esports research.

7
Method BLEU@4↑ RougeL↑ METEOR↑ Repetition@4↓
VTransformer 1.53 12.76 8.58 37.33
MART 2.13 14.48 11.50 8.76
VTransformer+Masking 3.14 16.57 12.03 29.20
MART+Masking 3.56 15.39 12.98 9.74

Table 5: Captioning results on testing set in LoL-V2T dataset. We evaluate the performance of our proposed masking using
BLEU@4, RougeL, METEOR, and Repetition@4.

VTransformer: i think that 's a lot of damage that you can see that
MART: I think that this is a very good sign of a team that has been playing around
VTransformer+Masking: i think that is a very good start to the game for <Team> to be able to get
MART+Masking: i think that is kind of the best <Team> in the <League> and you can see how much of the
Ground-Truth: this is a <Team> that had a very good record against <Team> throughout the regular session

VTransformer: i think that 's a lot of damage that you can see that the <unk> <unk> is going
MART: I mean you can see that the gold lead is still in the mid lane for the side of the
VTransformer+Masking: i think that is a lot of the <unk> that is going to be a big <unk> for
MART+Masking: <Champion> is going to be a very big deal with so many of these fights and the fact that he
Ground-Truth: lane continue to fight for experience so as knows as the <Champion> he is never solar carrying this lane so instead he uses
his advantage to help out the other side of the map

VTransformer: the <unk> of the <unk> <unk> <unk>


MART: the fight and then the end of the day and the fight
VTransformer+Masking: the <unk> of the <Team> that is the best <Team>
MART+Masking: he is got the ultimate available and he is got the kill
Ground-Truth: just kill <0> of the squished

VTransformer: <unk> <unk> is gon na be taken down but they 're gon na find the kill on the
MART: I think that this is a really good play for vitality to be able to do it
VTransformer+Masking: <Player> 's going to be able to get the kill but he is going to be able to
MART+Masking: but he is going to be able to get away from this <0> as well as the turret goes down
Ground-Truth: his old <Team> is actually going to die out of the time that was an oopsie other choice will die as well"

Figure 7: Qualitative results on the testing set in the LoL-V2T dataset. Colored texts highlight relevant content in clips.
Captions generated by the model trained by the proposed method contain more words related to the screen in the sentence
than the baseline. Some references are not complete sentences because they are generated by ASR.

8
References images based on concept hierarchy of actions. International
Journal of Computer Vision, 50(2):171–184, 2002. 3
[1] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio.
[14] Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, and
Neural machine translation by jointly learning to align and
Juan Carlos Niebles. Dense-captioning events in videos. In
translate. arXiv preprint arXiv:1409.0473, 2014. 3
Proceedings of the IEEE international conference on com-
[2] Satanjeev Banerjee and Alon Lavie. METEOR: An auto-
puter vision, pages 706–715, 2017. 2, 3, 4
matic metric for MT evaluation with improved correlation
[15] Jie Lei, Liwei Wang, Yelong Shen, Dong Yu, Tamara Berg,
with human judgments. In Proceedings of the ACL Work-
and Mohit Bansal. MART: Memory-augmented recurrent
shop on Intrinsic and Extrinsic Evaluation Measures for Ma-
transformer for coherent video paragraph captioning. In Pro-
chine Translation and/or Summarization, pages 65–72, Ann
ceedings of the 58th Annual Meeting of the Association for
Arbor, Michigan, June 2005. Association for Computational
Computational Linguistics, pages 2603–2614, Online, July
Linguistics. 6, 7
2020. Association for Computational Linguistics. 2, 3, 5, 7
[3] Andrei Barbu, Alexander Bridge, Zachary Burchill, Dan
[16] Chengxi Li, Sagar Gandhi, and Brent Harrison. End-to-end
Coroian, Sven Dickinson, Sanja Fidler, Aaron Michaux, Sam
let’s play commentary generation using multi-modal video
Mussman, Siddharth Narayanaswamy, Dhaval Salvi, et al.
representations. In Proceedings of the 14th International
Video in sentences out. arXiv preprint arXiv:1204.2742,
Conference on the Foundations of Digital Games. Associ-
2012. 3
ation for Computing Machinery, 2019. 3, 4
[4] Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedan-
tam, Saurabh Gupta, Piotr Dollár, and C Lawrence Zitnick. [17] Chin-Yew Lin. ROUGE: A package for automatic evaluation
Microsoft coco captions: Data collection and evaluation of summaries. In Text Summarization Branches Out, pages
server. arXiv preprint arXiv:1504.00325, 2015. 7 74–81, Barcelona, Spain, July 2004. Association for Com-
putational Linguistics. 6, 7
[5] Junyoung Chung, Çaglar Gülçehre, KyungHyun Cho, and
Yoshua Bengio. Empirical evaluation of gated recurrent neu- [18] Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac,
ral networks on sequence modeling. CoRR, abs/1412.3555, Makarand Tapaswi, Ivan Laptev, and Josef Sivic.
2014. 3, 6 Howto100m: Learning a text-video embedding by watching
[6] MMAction2 Contributors. Openmmlab’s next generation hundred million narrated video clips. In Proceedings of the
video understanding toolbox and benchmark. https:// IEEE international conference on computer vision, pages
github.com/open-mmlab/mmaction2, 2020. 6 2630–2640, 2019. 2, 3
[7] Pradipto Das, Chenliang Xu, Richard F Doell, and Jason J [19] newzoo. Newzoo global esports market report 2020 — light
Corso. A thousand frames in just a few words: Lingual version. https://newzoo.com/insights/trend-
description of videos through latent topics and sparse ob- reports / newzoo - global - esports - market -
ject stitching. In Proceedings of the IEEE conference on report-2020-light-version/, 2020. [Online: ac-
computer vision and pattern recognition, pages 2634–2641, cessed 25-Feburuary-2021]. 1, 4
2013. 3 [20] Yingwei Pan, Ting Yao, Houqiang Li, and Tao Mei. Video
[8] Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, captioning with transferred semantic attributes. In Proceed-
Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, ings of the IEEE conference on computer vision and pattern
and Trevor Darrell. Long-term recurrent convolutional net- recognition, pages 6504–6512, 2017. 3
works for visual recognition and description. In Proceed- [21] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing
ings of the IEEE conference on computer vision and pattern Zhu. Bleu: a method for automatic evaluation of machine
recognition, pages 2625–2634, 2015. 3 translation. In Proceedings of the 40th annual meeting of the
[9] Zhe Gan, Chuang Gan, Xiaodong He, Yunchen Pu, Kenneth Association for Computational Linguistics, pages 311–318,
Tran, Jianfeng Gao, Lawrence Carin, and Li Deng. Semantic 2002. 6, 7
compositional networks for visual captioning. In Proceed- [22] Anna Rohrbach, Marcus Rohrbach, Wei Qiu, Annemarie
ings of the IEEE conference on computer vision and pattern Friedrich, Manfred Pinkal, and Bernt Schiele. Coherent
recognition, pages 5630–5639, 2017. 3 multi-sentence video description with variable level of detail.
[10] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. In Xiaoyi Jiang, Joachim Hornegger, and Reinhard Koch,
Deep residual learning for image recognition. In Proceed- editors, Pattern Recognition, pages 184–195, Cham, 2014.
ings of the IEEE conference on computer vision and pattern Springer International Publishing. 2, 3
recognition, pages 770–778, 2016. 6 [23] Shukan Shah, Matthew Guzdial, and Mark O. Riedl. Auto-
[11] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term mated let’s play commentary. CoRR, abs/1909.02195, 2019.
memory. Neural Computation, 9(8):1735–1780, 1997. 3, 6 3
[12] Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, [24] Gunnar A Sigurdsson, Gül Varol, Xiaolong Wang, Ali
Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Farhadi, Ivan Laptev, and Abhinav Gupta. Hollywood in
Tim Green, Trevor Back, Paul Natsev, et al. The kinetics hu- homes: Crowdsourcing data collection for activity under-
man action video dataset. arXiv preprint arXiv:1705.06950, standing. In European Conference on Computer Vision,
2017. 7 pages 510–526. Springer, 2016. 2, 3, 4
[13] Atsuhiro Kojima, Takeshi Tamura, and Kunio Fukunaga. [25] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-
Natural language description of human activities from video reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia

9
Polosukhin. Attention is all you need. In Advances in neural
information processing systems, pages 5998–6008, 2017. 3,
5
[26] Subhashini Venugopalan, Marcus Rohrbach, Jeffrey Don-
ahue, Raymond Mooney, Trevor Darrell, and Kate Saenko.
Sequence to sequence-video to text. In Proceedings of the
IEEE international conference on computer vision, pages
4534–4542, 2015. 3
[27] Subhashini Venugopalan, Huijuan Xu, Jeff Donahue, Mar-
cus Rohrbach, Raymond Mooney, and Kate Saenko. Trans-
lating videos to natural language using deep recurrent neural
networks. arXiv preprint arXiv:1412.4729, 2014. 3
[28] Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua
Lin, Xiaoou Tang, and Luc Van Gool. Temporal segment net-
works: Towards good practices for deep action recognition.
In European conference on computer vision, pages 20–36.
Springer, 2016. 6
[29] Saining Xie, Ross Girshick, Piotr Dollar, Zhuowen Tu, and
Kaiming He. Aggregated residual transformations for deep
neural networks. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition (CVPR), July
2017. 4, 5
[30] Yilei Xiong, Bo Dai, and Dahua Lin. Move forward and
tell: A progressive generator of video descriptions. In Pro-
ceedings of the European Conference on Computer Vision
(ECCV), September 2018. 6, 7
[31] Jun Xu, Tao Mei, Ting Yao, and Yong Rui. Msr-vtt: A large
video description dataset for bridging video and language. In
Proceedings of the IEEE conference on computer vision and
pattern recognition, pages 5288–5296, 2016. 2, 3, 4
[32] F. Yan, K. Mikolajczyk, and J. Kittler. Generating commen-
taries for tennis videos. In 2016 23rd International Con-
ference on Pattern Recognition (ICPR), pages 2658–2663,
2016. 1, 3
[33] Li Yao, Atousa Torabi, Kyunghyun Cho, Nicolas Ballas,
Christopher Pal, Hugo Larochelle, and Aaron Courville. De-
scribing videos by exploiting temporal structure. In Proceed-
ings of the IEEE international conference on computer vi-
sion, pages 4507–4515, 2015. 3
[34] H. Yu, S. Cheng, B. Ni, M. Wang, J. Zhang, and X.
Yang. Fine-grained video captioning for sports narrative. In
2018 IEEE/CVF Conference on Computer Vision and Pat-
tern Recognition, pages 6006–6015, 2018. 1, 3
[35] Christopher Zach, Thomas Pock, and Horst Bischof. A du-
ality based approach for realtime tv-l 1 optical flow. In Joint
pattern recognition symposium, pages 214–223. Springer,
2007. 7
[36] Luowei Zhou, Chenliang Xu, and Jason Corso. Towards
automatic learning of procedures from web instructional
videos. In Proceedings of the AAAI Conference on Artificial
Intelligence, volume 32, 2018. 2, 3
[37] Luowei Zhou, Yingbo Zhou, Jason J Corso, Richard Socher,
and Caiming Xiong. End-to-end dense video captioning with
masked transformer. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pages 8739–
8748, 2018. 2, 3, 4, 5, 7

10

You might also like