Deep Learning YouTube Video Tags
Travis Addair
Stanford University
taddair@stanford.edu
Abstract seconds long, is associated with at least one tag in the
vocabulary, and is not considered to have adult content.
Video tagging is a complex problem that combines
single-image feature extraction with arbitrarily long
sequence understanding. By improving at the task of
tagging videos with useful metadata labels, we necessarily
improve our ability to understand the content and context
of video data. Until recently, however, there hasnt been a
large corpus of labelled video data for researchers to
study. With Googles release of the YouTube-8M1[1]
dataset, academic researchers now have access to 7
million video URLs, 450 000 hours of video, 3.2 billion
features, and 4716 label classes. In conjunction with
Kaggles announcement of their Video Understanding
Challenge2, this project seeks to combine state-of-the-art
deep learning methods to the problem of automatically
labelling video frame data. We propose a hybrid
CNN-RNN architecture that takes the image features Figure 1. Example images tagged with the label guitar from
the YouTube-8M dataset.
generated for each video, and combines them with an
LSTM model run over the word embeddings of the label
set, to generate label predictions that take label Due to the volume of data in the collection,
correlation and dependency into account. We show that pre-computed features have been derived from the source
this model outperforms baseline models that operate only videos. 1.6 billion video features were extracted using
on raw image features without accounting for structural Googles Inception-V3 image annotation model3 [2]. 1.6
label similarity. billion audio features were extracted using a VGG
acoustic model. Both sets of features were run through
PCA and quantized such that the combined set of all
1. Introduction features is less than 2TB.
Both video level and frame level features are provided
The goal for this project was to generate a classifier that for each video. 1024 8-bit quantized features are provided
most accurately labels a collection of YouTube videos per second of video (frame), up to 300 seconds. 128 8-bit
with up to 20 tags that denote the genre and context of the audio features are provided per second of video as well, up
video. to 300 seconds.
Every video is annotated with 1 to 31 tags that identify
1.1. Dataset the themes of each video. The tags were generated using a
The YouTube-8M dataset consists of 7 million videos combination of content, metadata, context, and user
sampled uniformly at random from the entire collection of signals, and were generated by a neural network as a first
YouTube videos available publicly online. Every video pass, then curated by 3 human raters as a second pass.
sampled has at least 1000 views, is between 120 and 500 The tags are drawn from a vocabulary set of 4716
Knowledge Graph4 entities. Each tag is represented by at
1
https://research.googleblog.com/2016/09/announcing-youtube-8m-large-
3
https://www.tensorflow.org/tutorials/image_recognition
and-diverse.html 4
2
https://www.kaggle.com/c/youtube8m https://www.google.com/intl/es419/insidesearch/features/search/knowled
least 101 videos in the dataset. Tags are grouped into one image captioning, both of which have been heavily studied
of 24 different verticals or high level categories. in recent years.
Investigation into the problem of multi-label image
classification has taken off in recent years due to the
creation of the ImageNet [3] database. Wang et al [4]
created a hybrid CNN-RNN architecture that attempted to
learn a join image-label embedding that could be used to
generate a set of distinct labels for a given image, as well
as learn the semantic label dependency structure. Their
approach provided the intuition behind our own approach
to the problem of video classification, as did their decision
to use beam search during inference.
Hu et al [5] also applied a CNN-RNN architecture to
the problem of multi-label image classification, but instead
Figure 2. The distribution of videos tagged with each label
of using word embeddings as input to the recurrent
roughly follows a power law distribution.
sub-network, they instead use a word similarity matrix
constructed from the WordNet [6] taxonomy. One
limitation of Hu et als approach is that it relies heavily on
In addition to providing the Knowledge Graph entity ID the labels having a pre-existing, rich hierarchy of attributes
associated with each tag, a Wikipedia URL and summary that can be used to generate a structure from coarse
description of each tag is provided for context. attributes to fine labels. The Google Knowledge Graph
taxonomy, in contrast to WordNet, uses Schema.org5
1.2. Metrics entities, which lack rich attribute associations.
Liu et al [7] expanded upon the work done by Wang et
In evaluating the performance of our tagging system,
al [4] but separated the problem of learning the visual
we use Kaggles preferred evaluation metric of the Global
concepts (tags) from learning the concept similarity
Average Precision (GAP) defined as:
structure. They proposed using a semantic regularization
N embedding between the CNN image features and the RNN
GAP = p(i)r(i) label features. The concept of separating this two
i=1 subproblems for more efficient training is interested, but
there wasnt sufficient time in this study to explore this
Where p(i) is the precision of prediction i, r(i) is the idea fully, but we certainly believe it could potentially be
recall of prediction i, N is the number of predictions used to improve training performance on our model.
(label/confidence pairs). For the purpose of the Kaggle There has also been work done directly on the problem
competition, we limit our submission to the top k of multi-label video classification, but often with the
predictions per video, where k = 20 for the competition. addition of certain metadata features or raw visual/audio
The total number of predictions N = k * m where m is the signals we didnt have access to in this study. Yang and
number of videos in the test set. Toderici [8] combined metadata with raw video, but their
metadata consisted of per user watching statistics, which
we lacked in the YouTube-8M dataset.
2. Related Work Jang et al [9] proposed a model called rDNN
(Regularized Deep Neural Network) that extracts visual,
Research in the area of mutli-label classification for rich audio, and trajectory features for each video and combines
video data has been limited in the past by the lack of an them using a series of deep fully connected layers. The
accurately labeled, large scale database of such videos. label space is associated by concatenating label-level
The release of the YouTube-8M dataset marks the predictions into another series of fully connected layers for
beginning of a new set of opportunities to explore this final prediction. In our work, we do something similar to
nascent space. That said, the problem of multi-label video fuse LSTM outputs together for final prediction, which
classification can be compared to image labelling and goes one step further than the feed forward model
5
ge.html http://schema.org/
proposed by Jang et al. begins by taking the robust visual features C generated by
Karpathy et al [10] used CNNs on the problem of video the Inception-V3 network as inputs to a multi-layer
tagging, but focused primarily on improvements to the perceptron (MLP) shown in the figure as purple denoted
upstream CNN architecture to enhance predictive Video Projection. Given a sequence of input features of
capability. Because our dataset provides pre-computed length N, the Video Projection layer non-linearly
CNN features, our focus is primarily on synthesizing these combines these features using a sequence of ReLU
rich features with label structure, and constructing a robust activated fully connected layers to generate a vector in the
ensemble model for final prediction. concept embedding space. Each concept vector has
Vinyals et al [11] produced an open source image dimensionality D, so we project the final output vector I
captioning architecture6 similar to the multi-label image from this layer to have length D as well.
classification model proposed by Wang et al [4], including At training time, we select from the set of ground truth
the use of beam search during inference. The problem of labels the set of K labels with the highest confidence for
image captioning is very similar to the problem of image the given video. We break ties by selecting the labels that
labelling, with the key difference being that in the former most frequently occur in the dataset, so as to optimize our
the order of the outputted labels matters, and in the latter prior probability of choosing a valid label for the video.
they do not. Their approach was one of the first to create a We then perform a lookup into our label embedding
holistic model that learns visual and label features in matrix to obtain the concept vectors of length D
contrast to previous models [12, 13] that stitched together corresponding to the top K labels for the current video.
those subproblems using disjoint model architectures. We talk more about the concept embeddings in section 3.2
Other approaches that do not employ the CNN-RNN below.
architecture for exploiting label structure in multi-label
image classification include [14], which trains a series of
binary classifiers that predict whether the given label is
present in the image or not given the image features and
the previous predictions. Graphical models have also been
employed to model label similarity, including Conditional
Random Fields [15], Dependency Networks [16], and
co-occurrence matrices [17], and Label Augment Models [
18]. However, these models fall short in only capturing
pairwise label correlations, whereas the RNN can
efficiently capture more complex probability distributions,
as Wang et al [4] demonstrated. As such, our solution also
went with a CNN and RNN formulation.
3. Methods
Our model uses the video-level visual features
generated by the Inception-V3 network in conjunction
with an LSTM to generate per label confidence scores,
roughly probabilities that the given video is tagged with
the given label. Concretely, we attempt to minimize the
cross-entropy loss of the predicted label probabilities from Figure 3. The network architecture used in our model. The
the true label probabilities, where a true label has video features (generated by a CNN) and concept
probability 1 if the video is tagged with the label, and 0 if embeddings are provided as inputs. Several dense layers are
it is not. used to combine features and project their representations into
The model architecture described by Figure 3 is our the appropriate dimensions. An LSTM is used to learn the
version of the CNN-RNN hybrid architecture popularized label dependencies. The video and label features are
combined into a concatenated feature vector for prediction.
for image classification and captioning by [4, 11]. It
Now that weve embedded the ground truth labels L to
6
https://github.com/tensorflow/models/tree/master/im2txt
represent our desired predictions at each time step in the
RNN, we begin the process to learning the label 3.2. Concept Embeddings
dependencies. We use a Long Short-Term Memory
(LSTM) cell for this step. The LSTM cell is given as its Its been shown [19] that word embeddings generated
initial input the output vector from the Video Projection by methods such as word2vec7 are effective at encoding
layer I. At each time step t, the LSTM outputs an M semantic relationships between words in a language.
dimensional vector of output features. The M output Because our labels are entities or concepts that span
features at timestep t denoted Ot is finally concatenated multiple words, we needed vectors trained to learn the
context of holistic concepts.
with the raw CNN input features C to produce the final
We explored using entity vectors generated by Google
feature vector:
using word2vec and the Freebase open source knowledge
base, but it was found that 845 of the tags in our
O2 = [C, O]
vocabulary were missing from the pretrained embeddings.
Furthermore, the embeddings had dimensionality 1000,
Here, O is the t by M matrix where O[t] = Ot . This
which was prohibitively expensive for training. We
matrix acts as our Video/Label Projection output from
decided to explore other embeddings.
Figure 3. Finally, we apply a multi-layer perceptron to
As an alternative, we used concept embeddings
O2 to produce our final V dimensional output of label
described in more detail in [20] that attempt to encode
probabilities, given our label vocabulary of length V. entity similarity with higher accuracy than word2vec with
Similar to the Video Projection layer, we apply a series of dimensionality 300. Of the 4716 label classes in the
ReLU dense layers to the input to produce our final output vocabulary, 2226 had direct concept vector mappings,
confidence scores, which are normalized with a sigmoid 2237 were approximated from the Wikipedia descriptions
function to produce the final probabilities. provided in the CSV provided with the YouTube-8M
dataset, and 37 were missing and had to be generated with
3.1. Inference random noise.
At the time of inference, we no longer have access to To approximate concept vectors from Wikipedia
the ground truth label vectors for the purpose of generating descriptions, we extracted the noun phrases from the
the input features to our LSTM cell at time t. Instead we descriptions, and queried the concept vectors
must synthesize these features from our previous corresponding to each noun phrase. Then we appealed to
predictions. the geospatial properties of word vectors and simply
One approach would be to either sample or select the calculated the center of mass of all these concept
argmax label from the output probabilities of the model at embeddings to create approximate concept vectors.
time t-1, but this creates the problem that if our first guess
is wrong, the entire subsequent chain of predictions is
likely to be wrong as well. 4. Results
As an alternative, we use a technique known as Beam Our experiments were run on Google Compute Engine
Search where we select some number of beams B, and at using the full set of 1024 visual features aggregated for
each time step we retain the B highest scoring labels. each video and a 4716 vocabulary set of tags.
To generate our a priori estimates of the label scores
independent of their semantic structure, we use the same 4.1. Baselines
dense MLP architecture from the Video Projection layer to
generate a vector of features projected into the label In evaluating the performance of our models, we began
vocabulary space. Given V possible labels in our by investigating several baseline models. All of these
vocabulary, the resulting vector is now of length V, and baselines, except for the dense model described in more
corresponds to rough confidence scores for each label in detail below, were provided by Google as part of the
our vocabulary with respect to the current image. At YouTube-8M starter code8.
inference time, we use these coarse confidence values to The first and simplest baseline model we examined was
make our first set of selections for Beam Searching. a Logistic Model that applies a fully connected layer over
7
https://code.google.com/archive/p/word2vec/
8
https://github.com/google/youtube-8m
video level features in the input, using a sigmoid
activation function and L2 regularization penalty of 1e-8. 4.2. Hyperparameters
The second model, and the one that performed the best
in evaluation, is the Mixture of Experts model. This is an We chose as our learning rate 0.001, which helped in
ensemble method where multiple classifiers (experts) that preventing the LSTM from overstepping and consistently
divide the feature space into homogeneous regions, such predicting 0 scores for all labels. In the Top K layers, we
that one classifiers predicts on one set of features and selected K=10. In practice, we choose the top 20 labels
another classifier on other features. An additional gating when computing the GAP score, but we found empirically
network is used to determine which classifier to use for that increasing K to 20 did not significantly improve
which region of the input. We chose for our baseline a performance, but severely slowed down time to
model consisting of two experts and one dummy convergence.
network that always predicts 0. For the LSTM layers, we chose M -- the number of
The third video level model is a dense multi-layer units in the output -- to be 512, and stacked the LSTMs
perceptron that attempts to mirror the MLP layers from into three separate layers.
our CNN-RNN network to determine how much gain we In beam search, we select B=3, as a means of reducing
achieve by introducing the semantic concept embeddings. the computational overhead of using a larger number of
The dense model consists of two hidden layers with ReLU paths.
activations and 1024 hidden units. A small regularization
penalty is applied to prevent overfitting. 4.3. Evaluation
The next two baseline models were selected to operate For evaluation, we ran both the video and frame level
on the frame level features, to see if using these more training sets for 20 epochs of training.
granular features would result in greater predictive In addition to the GAP score described above, we also
capability. looked at Hit@1, which computes the average number of
The first of these models was a Deep Bag of Frames hits where the top scoring prediction is a valid label in
model that projected the features for each frame into a the truth label sets. PERR, or prevision at equal recall
higher dimensional clustering space and pooled across the rate, gives us the average precision at the point where the
frames within that space. It used a configurable precision and recall are equivalent. In practice, we found
video-level model to classify the newly aggregated that some classifiers performed better one some metrics
features, and randomly sampled either frames or than others, but should not diverge too heavily on one over
sequences of frames during training to speed up the other.
convergence.
The final baseline used a stack of Long Short-Term
Memory (LSTM) networks to represent each video. It
used a forget bias of 1.0 to improve performance. The Model GAP Hit@1 PERR
input to the LSTM was the frame at each time step, up to
300 frames in total. Each frame consisted of 1024 RGB Logistic (video) .720 .800 .660
features. A learning rate of 0.001 was used in place of the
0.01 used in other baseline models. The LSTM was very MoE (video) .760 .805 .679
slow to train overall, and surprisingly performed worse
Dense (video) .708 .811 .669
than other models along most metrics.
An interesting observation from the baseline
Deep Bag (frame) .692 .781 .633
experimentation was that video level models tended to
significantly outperform frame level models in both time
LSTM (frame) .674 .773 .611
to convergence and overall GAP score. The conclusion
we drew was that there was little headroom to be found in CNN-RNN (video) .890 .962 .865
simply finding a better way to generate image features
from the frames of the video, and that more significant
performance gains could be found by focusing instead on Table 1. Performance of each classifier with respect to each of
the semantic similarity between the labels. the evaluation metrics described. We make note of which
models use video level features and which use frame level
2009. IEEE Conference on, pages 248255. IEEE, 2009.
features. [4] Jiang Wang, Yi Yang, Junhua Mao, Zhiheng Huang, Chang
Huang: CNN-RNN: A Unified Framework for Multi-label
As can be seen from the results above, the CNN-RNN Image Classification, 2016;
[http://arxiv.org/abs/1604.04573 arXiv:1604.04573].
architecture weve laid out significant outperforms
[5] Hexiang Hu, Guang-Tong Zhou, Zhiwei Deng, Zicheng
baseline models on the evaluation dataset. Liao: Learning Structured Inference Neural Networks with
Label Relations, 2015; [http://arxiv.org/abs/1511.05616
arXiv:1511.05616].
5. Conclusions [6] G. A. Miller. Wordnet: a lexical database for english.
Communications of the ACM (CACM), 38(11):3941,
In summary, weve presented a variation of the 1995.
CNN-RNN architecture recently popularized as [7] Feng Liu, Tao Xiang, Timothy M. Hospedales, Wankou
state-of-the-art to the task of image captioning, and Yang: Semantic Regularisation for Recurrent Image
applied it to the problem of multi-label video Annotation, 2016; [http://arxiv.org/abs/1611.05490
classification. By learning the structural similarity arXiv:1611.05490].
[8] W. Yang and G. Toderici. Discriminative tag learning on
between tags in the label space, we were able to youtube videos with latent sub-tags. In CVPR, 2011.
significantly improve classification performance over [9] Y.-G. Jiang, Z. Wu, J. Wang, X. Xue, and S.-F. Chang.
baseline models. Exploiting feature and class relationships in video
In contrast to other approaches to this problem, we categorization with regularized deep neural networks. arXiv
didnt focus on directly improving the quality of the RGB preprint arXiv:1502.07209, 2015.
features by passing them through very deep networks or [10] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R.
attempting to create a better aggregator of the frame level Sukthankar, and L. Fei-Fei. Large-scale video classification
with convolutional neural networks. In IEEE Conference on
features, finding instead that the base video features with a Computer Vision and Pattern Recognition (CVPR), pages
simple model significantly outperformed a power LSTM 17251732, Columbus, Ohio, USA, 2014.
model on the frame features. Instead, we focused on the [11] Oriol Vinyals, Alexander Toshev, Samy Bengio: Show and
label similarity and learning the structure of the label Tell: Lessons learned from the 2015 MSCOCO Image
space, and achieved strong improvements as a result of Captioning Challenge, 2016, IEEE Transactions on Pattern
this alternative approach. Analysis and Machine Intelligence ( Volume: PP, Issue: 99 ,
Future work should examine ways to combine the beam July 2016 ); [http://arxiv.org/abs/1609.06647
arXiv:1609.06647]. DOI:
search generated labels into a unified confidence score for [http://dx.doi.org/10.1109/TPAMI.2016.2587640
each label that can be compared directly with other 10.1109/TPAMI.2016.2587640].
classifiers. In this model, we were unable to get an [12] A. Farhadi, M. Hejrati, M. A. Sadeghi, P. Young, C.
entirely fair comparison against baselines because the Rashtchian, J. Hockenmaier, and D. Forsyth, Every picture
nature of our model as sequence generating diverged from tells a story: Generating sentences from images, in ECCV,
the logistic classification models used in the baselines. 2010.
We further suspect that with with additional work [13] G. Kulkarni, V. Premraj, S. Dhar, S. Li, Y. Choi, A. C.
Berg, and T. L. Berg, Baby talk: Understanding and
integrating semantic regularization, this architecture could generating simple image descriptions, in CVPR, 2011.
prove to be a component of state-of-the-art video tagging [14] H. Harzallah, F. Jurie, and C. Schmid. Combining efficient
systems. object localization and image classification. In Computer
Vision, 2009 IEEE 12th International Conference on, pages
237244. IEEE, 2009.
References [15] N. Ghamrawi and A. McCallum. Collective multi-label
classification. In Proceedings of the 14th ACM international
[1] Sami Abu-El-Haija, Nisarg Kothari, Joonseok Lee, Paul conference on Information and knowledge management,
Natsev, George Toderici, Balakrishnan Varadarajan: pages 195200. ACM, 2005.
YouTube-8M: A Large-Scale Video Classification [16] Y. Guo and S. Gu. Multi-label classification using
Benchmark, 2016; arXiv:1609.08675. conditional dependency networks. In IJCAI Proceedings
[2] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, International Joint Conference on Artificial Intelligence,
Jonathon Shlens: Rethinking the Inception Architecture for volume 22, page 1300, 2011.
Computer Vision, 2015; arXiv:1512.00567. [17] X. Xue, W. Zhang, J. Zhang, B. Wu, J. Fan, and Y. Lu.
[3] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. FeiFei. Correlative multi-label multi-instance image annotation. In
Imagenet: A large-scale hierarchical image database. In Computer Vision (ICCV), 2011 IEEE International
Computer Vision and Pattern Recognition, 2009. CVPR Conference on, pages 651658. IEEE, 2011.
[18] X. Li, F. Zhao, and Y. Guo. Multi-label image classification
with a probabilistic label enhancement model. UAI, 2014.
[19] Tomas Mikolov, Kai Chen, Greg Corrado: Efficient
Estimation of Word Representations in Vector Space,
2013; [http://arxiv.org/abs/1301.3781 arXiv:1301.3781].
[20] Robert Speer and Catherine Havasi. "Representing General
Relational Knowledge in ConceptNet 5." LREC 2012;
[http://lrec-conf.org/proceedings/lrec2012/pdf/1072_Paper.p
df]