0% found this document useful (0 votes)
20 views8 pages

Uploaded by

smxxxr
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views8 pages

Uploaded by

smxxxr
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Autoencoding beyond pixels using a learned similarity metric

Anders Boesen Lindbo Larsen1 ABLL @ DTU . DK


Søren Kaae Sønderby2 SKAAESONDERBY @ GMAIL . COM
Hugo Larochelle3 HLAROCHELLE @ TWITTER . COM
Ole Winther1,2 OLWI @ DTU . DK
1
Department for Applied Mathematics and Computer Science, Technical University of Denmark
arXiv:1512.09300v2 [cs.LG] 10 Feb 2016

2
Bioinformatics Centre, Department of Biology, University of Copenhagen, Denmark
3
Twitter, Cambridge, MA, USA

Abstract z
We present an autoencoder that leverages learned encoder decoder/generator
representations to better measure similarities in
data space. By combining a variational au-
x x̃
toencoder with a generative adversarial network REAL / GEN
we can use learned feature representations in x discriminator
the GAN discriminator as basis for the VAE AE
reconstruction objective. Thereby, we replace GAN
element-wise errors with feature-wise errors to
better capture the data distribution while offer- Figure 1. Overview of our network. We combine a VAE with a
ing invariance towards e.g. translation. We apply GAN by collapsing the decoder and the generator into one.
our method to images of faces and show that it
outperforms VAEs with element-wise similarity
measures in terms of visual fidelity. Moreover,
ror objective. For this task, element-wise measures like
we show that the method learns an embedding
the squared error are the default. Element-wise metrics are
in which high-level abstract visual features (e.g.
simple but not very suitable for image data, as they do not
wearing glasses) can be modified using simple
model the properties of human visual perception. E.g. a
arithmetic.
small image translation might result in a large pixel-wise
error whereas a human would barely notice the change.
Therefore, we argue in favor of measuring image similarity
1. Introduction using a higher-level and sufficiently invariant representa-
Deep architectures have allowed a wide range of discrimi- tion of the images. Rather than hand-engineering a suit-
native models to scale to large and diverse datasets. How- able measure to accommodate the problems of element-
ever, generative models still have problems with complex wise metrics, we want to learn a function for the task. The
data distributions such as images and sound. In this work, question is how to learn such a similarity measure? We find
we show that currently used similarity metrics impose a that by jointly training a VAE and a generative adversar-
hurdle for learning good generative models and that we can ial network (GAN) (Goodfellow et al., 2014) we can use
improve a generative model by employing a learned simi- the GAN discriminator to measure sample similarity. We
larity measure. achieve this by combining a VAE with a GAN as shown in
Fig. 1. We collapse the VAE decoder and the GAN gener-
When learning models such as the variational autoencoder ator into one by letting them share parameters and training
(VAE) (Kingma & Welling, 2014; Rezende et al., 2014), them jointly. For the VAE training objective, we replace the
the choice of similarity metric is central as it provides the typical element-wise reconstruction metric with a feature-
main part of the training signal via the reconstruction er- wise metric expressed in the discriminator.

Preliminary work submitted to the International Conference on 1.1. Contributions


Machine Learning (ICML).
Our contributions are as follows:
Autoencoding beyond pixels using a learned similarity metric

• We combine VAEs and GANs into an unsupervised x is an actual training sample and probability 1 − y that
generative model that simultaneously learns to en- x is generated by our model through x = Gen(z) with
code, generate and compare dataset samples. z ∼ p(z). The GAN objective is to find the binary clas-
sifier that gives the best possible discrimination between
• We show that generative models trained with learned true and generated data and simultaneously encouraging
similarity measures produce better image samples Gen to fit the true data distribution. We thus aim to maxi-
than models trained with element-wise error mea- mize/minimize the binary cross entropy:
sures.
LGAN = log(Dis(x)) + log(1 − Dis(Gen(z))) , (5)
• We demonstrate that unsupervised training results in a
latent image representation with disentangled factors with respect to Dis / Gen with x being a training sample
of variation (Bengio et al., 2013). This is illustrated in and z ∼ p(z).
experiments on a dataset of face images labelled with
visual attribute vectors, where it is shown that simple 2.3. Beyond element-wise reconstruction error with
arithmetic applied in the learned latent space produces VAE/GAN
images that reflect changes in these attributes. An appealing property of GAN is that its discriminator net-
work implicitly has to learn a rich similarity metric for im-
2. Autoencoding with learned similarity ages, so as to discriminate them from “non-images”. We
thus propose to exploit this observation so as to transfer
In this section we provide background on VAEs and GANs. the properties of images learned by the discriminator into a
Then, we introduce our method for combining both ap- more abstract reconstruction error for the VAE. The end re-
proaches, which we refer to as VAE/GAN. As we’ll de- sult will be a method that combines the advantage of GAN
scribe, our proposed hybrid is motivated as a way to im- as a high quality generative model and VAE as a method
prove VAE, so that it relies on a more meaningful, feature- that produces an encoder of data into the latent space z.
wise metric for measuring reconstruction quality during
training. Specifically, since element-wise reconstruction errors are
not adequate for images and other signals with invariances,
2.1. Variational autoencoder we propose replacing the VAE reconstruction (expected log
likelihood) error term from Eq. 3 with a reconstruction er-
A VAE consists of two networks that encode a data sample ror expressed in the GAN discriminator. To achieve this,
x to a latent representation z and decode the latent repre- let Disl (x) denote the hidden representation of the lth layer
sentation back to data space, respectively: of the discriminator. We introduce a Gaussian observation
model for Disl (x) with mean Disl (x̃) and identity covari-
z ∼ Enc(x) = q(z|x) , x̃ ∼ Dec(z) = p(x|z) . (1) ance:
The VAE regularizes the encoder by imposing a prior over p(Disl (x)|z) = N (Disl (x)| Disl (x̃), I) , (6)
the latent distribution p(z). Typically z ∼ N (0, I) is cho-
where x̃ ∼ Dec(z) is the sample from the decoder of x.
sen. The VAE loss is minus the sum of the expected log
We can now replace the VAE error of Eq. 3 with
likelihood (the reconstruction error) and a prior regulariza-
tion term: LDis
llike = −Eq(z|x) [log p(Disl (x)|z)]
l
(7)
 
p(x|z)p(z) We train our combined model with the triple criterion
LVAE = −Eq(z|x) log = Lpixel
llike + Lprior
q(z|x) L = Lprior + LDis (8)
llike + LGAN .
l

(2)
Notably, we optimize the VAE wrt. LGAN which we regard
with as a style error in addition to the reconstruction error which
can be interpreted as a content error using the terminology
Lpixel
llike = − Eq(z|x) [log p(x|z)] (3) from Gatys et al. (2015). Moreover, since both Dec and
Lprior =DKL (q(z|x)kp(z)) , (4) Gen map from z to x, we share the parameters between
the two (or in other words, we use Dec instead of Gen in
where DKL is the Kullback-Leibler divergence. Eq. 5).
In practice, we have observed the devil in the details dur-
2.2. Generative adversarial network
ing development and training of this model. We therefore
A GAN consists of two networks: the generator network provide a list of practical considerations in this section. We
Gen(z) maps latents z to data space while the discrimina- refer to Fig. 2 and Alg. 1 for overviews of the training pro-
tor network assigns probability y = Dis(x) ∈ [0, 1] that cedure.
Autoencoding beyond pixels using a learned similarity metric

Lprior p(z) Algorithm 1 Training the VAE/GAN model


θEnc , θDec , θDis ← initialize network parameters
z zp repeat
X ← random mini-batch from dataset
Enc Dec Z ← Enc(X)
Lprior ← DKL (q(Z|X)kp(Z))
X̃ ← Dec(Z)
x x x̃ xp Dis LDis
llike ← −Eq(Z|X) [p(Disl (X)|Z)]
l
LGAN
Zp ← samples from prior N (0, I)
Xp ← Dec(Zp )
LDis
llike
l
LGAN ← log(Dis(X)) + log(1 − Dis(X̃))
+ log(1 − Dis(Xp ))
Figure 2. Flow through the combined VAE/GAN model during
training. Gray lines represent terms in the training objective. // Update parameters according to gradients
+
θEnc ← −∇θEnc (Lprior + LDis
llike )
l

+
θDec ← −∇θDec (γLDis
llike − LGAN )
l
Limiting error signals to relevant networks Using the +
loss function in Eq. 8, we train both a VAE and a GAN si- θDis ← −∇θDis LGAN
multaneously. This is possible because we do not update all until deadline
network parameters wrt. the combined loss. In particular,
Dis should not try to minimize LDis
llike as this would collapse
l

the discriminator to 0. We also observe better results by not prevalent solution to improve robustness to certain pertur-
backpropagating the error signal from LGAN to Enc. bations. Examples of preprocessing are contrast normaliza-
tion, working with gradient images or pixel statistics gath-
Weighting VAE vs. GAN As Dec receives an error sig- ered in histograms. We view these operations as a form
nal from both LDis
llike and LGAN , we use a parameter γ to
l
of metric engineering to account for the shortcomings of
weight the ability to reconstruct vs. fooling the discrimi- simple element-wise distance measures. A more detailed
nator. This can also be interpreted as weighting style and discussion on the subject is provided by Wang & Bovik
content. Rather than applying γ to the entire model (Eq. 8), (2009).
we perform the weighting only when updating the parame-
ters of Dec: Neural networks have been applied to metric learning in
form of the Siamese architecture (Bromley et al., 1993;
+
θDec ← −∇θDec (γLDis
llike − LGAN )
l
(9) Chopra et al., 2005). The learned distance metric is min-
imized for similar samples and maximized for dissimilar
Discriminating based on samples from p(z) and q(z|x) samples using a max margin cost. However, since Siamese
We observe better results when using samples from q(z|x) networks are trained in a supervised setup, we cannot apply
(i.e. the encoder Enc) in addition to our prior p(z) in the them directly to our problem.
GAN objective: Several attempts at improving on element-wise distances
for generative models have been proposed within the last
LGAN = log(Dis(x)) + log(1 − Dis(Dec(z)))
year. Ridgeway et al. (2015) apply the structural similar-
+ log(1 − Dis(Dec(Enc(x)))) (10) ity index as an autoencoder (AE) reconstruction metric for
grey-scale images. Yan et al. (2015) let a VAE output two
Note that the regularization of the latent space Lprior should
additional images to learn shape and edge structures more
make the set of samples from either p(z) or q(z|x) similar.
explicitly. Mansimov et al. (2015) append a GAN-based
However, for any given example x, the negative sample
sharpening step to their generative model. Mathieu et al.
Dec(Enc(x)) is much more likely to be similar to x than
(2015) supplement a squared error measure with both a
Dec(z). When updating according to LGAN , we suspect
GAN and an image gradient-based similarity measure to
that having similar positive and negative samples makes for
improve image sharpness of video prediction. While all
a more useful learning signal.
these extensions yield visibly sharper images, they do not
have the same potential for capturing high-level structure
3. Related work compared to a deep learning approach.
Element-wise distance measures are notoriously inade- In contrast to AEs that model the relationship between a
quate for complex data distributions like images. In the dataset sample and a latent representation directly, GANs
computer vision community, preprocessing images is a learn to generate samples indirectly. By optimizing the
Autoencoding beyond pixels using a learned similarity metric

GAN generator to produce samples that imitate the dataset • VAE with a learned distance (VAEDisl ). We first train
according to the GAN discriminator, GANs avoid element- a GAN and use the discriminator network as a learned
wise similarity measures by construction. This is a likely similarity measure. We select a single layer l at which
explanation for their ability to produce high-quality images we measure the similariy according to Disl . l is cho-
as demonstrated by Denton et al. (2015); Radford et al. sen such that the comparison is performed after 3
(2015). downsamplings of each a factor of 2 in the convolu-
tional encoder.
Lately, convolutional networks with upsampling have
shown useful for generating images from a latent rep- • The combined VAE/GAN model. This model is simi-
resentation. This has sparked interest in learning im- lar to VAEDisl but we also optimize Dec wrt. LGAN .
age embeddings where semantic relationships can be ex- • GAN. This modes has recently been shown capable of
pressed using simple arithmetic – similar to the suprising generating high-quality images (Radford et al., 2015).
results of the word2vec model by Mikolov et al. (2013).
First, Dosovitskiy et al. (2015) used supervised training to All models share the same architectures for Enc, Dec and
train convolutional network to generate chairs given high- Dis respectively. For all our experiments, we use convo-
level information about the desired chair. Later, Kulkarni lutional architectures and use backward convolution (aka.
et al. (2015); Yan et al. (2015); Reed et al. (2015) have fractional striding) with stride 2 to upscale images in Dec.
demonstrated encoder-decoder architectures with disentan- Backward convolution is achieved by flipping the convo-
gled feature representations, but their training schemes rely lution direction such that striding causes upsampling. Our
on supervised information. Radford et al. (2015) inspect models are trained with RMSProp using a learning rate of
the latent space of a GAN after training and find directions 0.0003 and a batch size of 64. In table 1 we list the network
corresponding to eyeglasses and smiles. As they rely on architectures. We refer to our implementation available on-
pure GANs, however, they cannot encode images making line1 .
it challenging to explore the latent space.
4.1. CelebA face images
Our idea of a learned similarity metric is partly motivated
by the neural artistic style network of Gatys et al. (2015) We apply our methods to face images from the CelebA
who demonstrate the representational power of deep con- dataset2 (Liu et al., 2015). This dataset consists of 202,599
volutional features. They obtain impressive results by opti- images annotated with 40 binary attributes such as eye-
mizing an image to have similar features as a subject image glasses, bangs, pale skin etc. We scale and crop the images
and similar feature correlations as a style image in a pre- to 64×64 pixels and use only the images (not the attributes)
trained convolutional network. In our VAE/GAN model, for unsupervised training.
one could view LDisllike as content and LGAN as style. Our
l
After training, we draw samples from p(z) and propagate
style term, though, is not computed from feature correla-
these through Dec to generate new images which are shown
tions but is the error signal from trying to fool the GAN
in Fig. 3. The plain VAE is able draw the frontal part
discriminator.
of the face sharply, but off-center the images get blurry.
This is because the dataset aligns faces using frontal land-
4. Experiments marks. When we move too far away from the aligned parts,
the recognition model breaks down because pixel corre-
Measuring the quality of generative models is challenging
spondence cannot be assumed. VAEDisl produces sharper
as current evaluation methods are problematic for larger
images even off-center because the reconstruction error is
natural images (Theis et al., 2015). In this work, we use
lifted beyond pixels. However, we see severe noisy arte-
images of size 64x64 and focus on more qualitative assess-
facts which we believe are caused by the harsh downsam-
ments since traditional log likelihood measures do not cap-
pling scheme. In comparison, VAE/GAN and pure GAN
ture visual fidelity. Indeed, we have tried discarding the
produce sharper images with more natural textures and face
GAN discriminator after training of the VAE/GAN model
parts.
and computing a pixel-based log likelihood using the re-
maining VAE. The results are far from competitive with Additionally, we make the VAEs reconstruct images taken
plain VAE models (on the CIFAR-10 dataset). from a separate test set. Reconstruction is not possible with
the GAN model as it lacks an encoder network. The results
In this section we investigate the performance of different
are shown in Fig. 4 and our conclusions are similar to what
generative models:
1
http://github.com/andersbll/
• Plain VAE with an element-wise Gaussian observation autoencoding_beyond_pixels
2
model. We use the aligned and cropped version of the dataset.
Autoencoding beyond pixels using a learned similarity metric

Enc Dec Dis


5×5 64 conv. ↓, BNorm, ReLU 8·8·256 fully-connected, BNorm, ReLU 5×5 32 conv., ReLU
5×5 128 conv. ↓, BNorm, ReLU 5×5 256 conv. ↑, BNorm, ReLU 5×5 128 conv. ↓, BNorm, ReLU
5×5 256 conv. ↓, BNorm, ReLU 5×5 128 conv. ↑, BNorm, ReLU 5×5 256 conv. ↓, BNorm, ReLU
2048 fully-connected, BNorm, ReLU 5×5 32 conv. ↑, BNorm, ReLU 5×5 256 conv. ↓, BNorm, ReLU
5×5 3 conv., tanh 512 fully-connected, BNorm, ReLU
1 fully-connected, sigmoid

Table 1. Architectures for the three networks that comprise VAE/GAN. ↓ and ↑ represent down- and upsampling respectively. BNorm
denotes batch normalization (Ioffe & Szegedy, 2015). When batch normalization is applied to convolutional layers, per-channel normal-
ization is used.

in the latent space corresponding to specific visual features


VAE in image space.
We use the binary attributes of the dataset to extract visual
VAEDisl attribute vectors. For all images we use the encoder to cal-
culate latent vector representations. For each attribute, we
compute the mean vector for images with the attribute and
VAE/GAN the mean vector for images without the attribute. We then
compute the visual attribute vector as the difference be-
tween the two mean vectors. This is a very simple method
GAN for calculating visual attribute vectors that will have prob-
lems with highly correlated visual attributes such as heavy
makeup and wearing lipstick. In Fig. 5, we show face im-
Figure 3. Samples from different generative models. ages as well as the reconstructions after adding different vi-
sual attribute vectors to the latent representations. Though
not perfect, we clearly see that the attribute vectors capture
Input semantic concepts like eyeglasses, bangs, etc. E.g. when
bangs are added to the faces, both the hair color and the hair
texture matches the original face. We also see that being a
VAE man is highly correlated with having a mustache, which is
caused by attribute correlations in the dataset.

VAEDisl 4.2. Attribute similarity, Labeled faces in the wild


Inspired by the attribute similarity experiment of Yan et al.
(2015), we seek a more quantative evaluation of our gen-
VAE/GAN
erated images. The idea is to learn a generative model for
face images conditioned on facial attributes. At test time,
we generate face images by retrieval from chosen attribute
Figure 4. Reconstructions from different autoencoders.
configurations and let a separately trained regressor net-
work predict the attributes from the generated images. A
good generative model should be able to produce visual
we observed for the random samples. Note that VAEDisl
attributes that are correctly recognized by the regression
generates noisy blue patterns in some of the reconstruc-
model. To imitate the original experiment, we use Labeled
tions. We suspect the GAN-based similarity measure can
faces in the wild (LFW) images (Huang et al., 2007) with
collapse to 0 in certain cases (such as the pattern we ob-
attributes (Kumar et al., 2009). We align the face images
serve), which encourages Dec to generate such patterns.
according to the landmarks in (Zhu et al., 2014). Addition-
ally, we crop and resize the images to 64×64 pixels and
4.1.1. V ISUAL ATTRIBUTE VECTORS
augment the dataset with common operations. Again, we
Inspired by attempts at learning embeddings in which se- refer to our implementation online for more details.
mantic concepts can be expressed using simple arithmetic
We construct conditional VAE, GAN and VAE/GAN mod-
(Mikolov et al., 2013), we inspect the latent space of a
els by concatenating the attribute vector to the vector repre-
trained VAE/GAN model. The idea is to find directions
Autoencoding beyond pixels using a learned similarity metric

tio
n ws up
uc bro s ke
str air air ey e s e i r m a he in
t n gs kh dh y las ha y tac sk
pu co ld n ac on sh eg ay av ale us le
In Re Ba Ba Bl Bl Bu Ey Gr He M M Pa

Figure 5. Using the VAE/GAN model to reconstruct dataset samples with visual attribute vectors added to their latent representations.

Model Cosine similarity Mean squared error


LFW test set 0.9193 14.1987
Prominent attributes: White, Fully Visible
VAE 0.9030 27.59 ± 1.42 Forehead, Mouth Closed, Male, Curly Hair,
GAN 0.8892 27.89 ± 3.07 Query Eyes Open, Pale Skin, Frowning, Pointy Nose,
Teeth Not Visible, No Eyewear.
VAE/GAN 0.9114 22.39 ± 1.16

VAE
Table 2. Attribute similarity scores. To replicate (Yan et al.,
2015), the cosine similarity is measured as the best out of 10 sam-
ples per attribute vector from the test set. The mean squared error
is computed over the test set and statistics are measured over 25 GAN
runs.

VAE/GAN
sentation of the input in Enc, Dec and Dis similar to (Mirza
& Osindero, 2014). For Enc and Dis, the attribute vector is
Prominent attributes: White, Male, Curly
concatenated to the input of the top fully connected layer. Hair, Frowning, Eyes Open, Pointy Nose,
Our regression network has almost the same architecture Query Flash, Posed Photo, Eyeglasses, Narrow Eyes,
Teeth Not Visible, Senior, Receding Hairline.
as Enc. We train using the LFW training set, and during
testing, we condition on the test set attributes and sample
faces to be propagated through the regression network. Fig- VAE
ure 6 shows faces generated by conditioning on attribute
vectors from the test set. We report regressor performance
numbers in Table 2. Compared to an ordinary VAE, the GAN
VAE/GAN model yields significantly better attributes vi-
sually that leads to smaller recognition error. The GAN
network performs suprisingly poorly and we suspect that VAE/GAN
this is caused by instabilities during training (GAN mod-
els are very difficult to train reliably due to the minimax
objective function). Note that our results are not directly Figure 6. Generating samples conditioned on the LFW attributes
listed alongside their corresponding image.
comparable with those of Yan et al. (2015) since we do not
have access to their preprocessing scheme nor regression
model.
Autoencoding beyond pixels using a learned similarity metric

4.3. Unsupervised pretraining for supervised tasks for measuring similarity.


For completeness, we report that we have tried evaluating In summary, we have demonstrated a first attempt at un-
VAE/GAN in a semi-supervised setup by unsupervised pre- supervised learning of encoder-decoder models as well as
training followed by finetuning using a small number of la- a similarity measure. Our results show that the visual fi-
beled examples (for both CIFAR-10 and STL-10 datasets). delity of our method is competitive with GAN, which in
Unfortunately, we have not been able to reach results com- that regard is considered state-of-the art. We therefore con-
petitive with the state-of-the-art (Rasmus et al., 2015; Zhao sider learned similarity measures a promising step towards
et al., 2015). We speculate that the intra-class variation may scaling up generative models to more complex data distri-
be too high for the VAE-GAN model to learn good gener- butions.
alizations of the different object classes.
Acknowledgements
5. Discussion
We would like to thank Søren Hauberg, Casper Kaae
The problems with element-wise distance metrics are well Sønderby and Lars Maaløe for insightful discussions,
known in the literature and many attempts have been made Nvidia for donating GPUs used in experiments, and the
at going beyond pixels – typically using hand-engineered authors of DeepPy3 and CUDArray (Larsen, 2014) for the
measures. Much in the spirit of deep learning, we argue software frameworks used to implement our model.
that the similarity measure is yet another component which
can be replaced by a learned model capable of capturing References
high-level structure relevant to the data distribution. In this
work, our main contribution is an unsupervised scheme for Bengio, Yoshua, Courville, Aaron, and Vincent, Pierre.
learning and applying such a distance measure. With the Representation learning: A review and new perspectives.
learned distance measure we are able to train an image Pattern Analysis and Machine Intelligence, IEEE Trans-
encoder-decoder network generating images of unprece- actions on, 35(8):1798–1828, 2013.
dented visual fidelity as shown by our experiments. More-
over, we show that our network is able to disentangle fac- Bromley, Jane, Bentz, James W., Bottou, Léon, Guyon,
tors of variation in the input data distribution and discover Isabelle, LeCun, Yann, Moore, Cliff, Säckinger, Ed-
visual attributes in the high-level representation of the la- uard, and Shah, Roopak. Signature verification using a
tent space. In principle, this lets us employ a large set of siamese time delay neural network. International Jour-
unlabeled images for training and use a small set of labeled nal of Pattern Recognition and Artificial Intelligence, 07
images to discover features in latent space. (04):669–688, 1993.

We regard our method as an extension of the VAE frame- Chopra, S., Hadsell, R., and LeCun, Y. Learning a similar-
work. Though, it must be noted that the high quality of our ity metric discriminatively, with application to face ver-
generated images is due to the combined training of Dec as ification. In Computer Vision and Pattern Recognition,
a both a VAE decoder and a GAN generator. This makes 2005. CVPR 2005. IEEE Computer Society Conference
our method more of a hybrid between VAE and GAN, and on, volume 1, pp. 539–546 vol. 1, June 2005.
alternatively, one could view our method more as an exten-
sion of GAN where p(z) is constrained by an additional Denton, Emily L, Chintala, Soumith, Szlam, Arthur, and
network. Fergus, Rob. Deep generative image models using a
laplacian pyramid of adversarial networks. In Cortes,
It is not obvious that the discriminator network of a GAN
C., Lawrence, N.D., Lee, D.D., Sugiyama, M., Garnett,
provides a useful similarity measure as it is trained for a
R., and Garnett, R. (eds.), Advances in Neural Informa-
different task, namely being able to tell generated sam-
tion Processing Systems 28, pp. 1486–1494. Curran As-
ples from real samples. However, convolutional features
sociates, Inc., 2015.
are often surprisingly good for transfer learning, and as
we show, good enough in our case to improve on element- Dosovitskiy, Alexey, Springenberg, Jost Tobias, and Brox,
wise distances for images. It would be interesting to see if Thomas. Learning to generate chairs with convolutional
better features in the distance measure would improve the neural networks. In IEEE International Conference on
model, e.g. by employing a similarity measure provided Computer Vision and Pattern Recognition (CVPR), pp.
by a Siamese network trained on faces, though in practice 1538–1546, 2015.
Siamese networks are not a good fit with our method as
they require labeled data. Alternatively one could investi- Gatys, Leon A., Ecker, Alexander S., and Bethge,
gate the effect of using a pretrained feedforward network
3
http://github.com/andersbll/deeppy
Autoencoding beyond pixels using a learned similarity metric

Matthias. A neural algorithm of artistic style. CoRR, C.J.C., Bottou, L., Welling, M., Ghahramani, Z., and
abs/1508.06576, 2015. Weinberger, K.Q. (eds.), Advances in Neural Informa-
tion Processing Systems 26, pp. 3111–3119. Curran As-
Goodfellow, Ian, Pouget-Abadie, Jean, Mirza, Mehdi, Xu, sociates, Inc., 2013.
Bing, Warde-Farley, David, Ozair, Sherjil, Courville,
Aaron, and Bengio, Yoshua. Generative adversarial nets. Mirza, Mehdi and Osindero, Simon. Conditional genera-
In Ghahramani, Z., Welling, M., Cortes, C., Lawrence, tive adversarial nets. CoRR, abs/1411.1784, 2014.
N.D., and Weinberger, K.Q. (eds.), Advances in Neu-
ral Information Processing Systems 27, pp. 2672–2680. Radford, Alec, Metz, Luke, and Chintala, Soumith. Unsu-
Curran Associates, Inc., 2014. pervised representation learning with deep convolutional
generative adversarial networks. CoRR, abs/1511.06434,
Huang, Gary B., Ramesh, Manu, Berg, Tamara, and 2015.
Learned-Miller, Erik. Labeled faces in the wild: A
database for studying face recognition in unconstrained Rasmus, Antti, Berglund, Mathias, Honkala, Mikko,
environments. Technical Report 07-49, University of Valpola, Harri, and Raiko, Tapani. Semi-supervised
Massachusetts, Amherst, October 2007. learning with ladder networks. In Cortes, C., Lawrence,
N.D., Lee, D.D., Sugiyama, M., and Garnett, R. (eds.),
Ioffe, Sergey and Szegedy, Christian. Batch normaliza- Advances in Neural Information Processing Systems 28,
tion: Accelerating deep network training by reducing in- pp. 3532–3540. Curran Associates, Inc., 2015.
ternal covariate shift. In Blei, David and Bach, Francis
(eds.), Proceedings of the 32nd International Conference Reed, Scott E, Zhang, Yi, Zhang, Yuting, and Lee,
on Machine Learning (ICML-15), pp. 448–456. JMLR Honglak. Deep visual analogy-making. In Cortes, C.,
Workshop and Conference Proceedings, 2015. Lawrence, N.D., Lee, D.D., Sugiyama, M., Garnett, R.,
and Garnett, R. (eds.), Advances in Neural Information
Kingma, Diederik P. and Welling, Max. Auto-encoding Processing Systems 28, pp. 1252–1260. Curran Asso-
variational Bayes. In Proceedings of the International ciates, Inc., 2015.
Conference on Learning Representations, 2014.
Rezende, Danilo Jimenez, Mohamed, Shakir, and Wierstra,
Kulkarni, Tejas D., Whitney, Will, Kohli, Pushmeet, and Daan. Stochastic backpropagation and approximate in-
Tenenbaum, Joshua B. Deep convolutional inverse ference in deep generative models. In Proceedings of
graphics network. CoRR, abs/1503.03167, 2015. The 31st International Conference on Machine Learn-
ing, pp. 1278–1286, 2014.
Kumar, Neeraj, Berg, Alexander C., Belhumeur, Peter N.,
and Nayar, Shree K. Attribute and simile classifiers for Ridgeway, Karl, Snell, Jake, Roads, Brett, Zemel,
face verification. In Computer Vision, 2009 IEEE 12th Richard S., and Mozer, Michael C. Learning to gen-
International Conference on, pp. 365–372, Sept 2009. erate images with perceptual similarity metrics. CoRR,
abs/1511.06409, 2015.
Larsen, Anders Boesen Lindbo. CUDArray: CUDA-based
NumPy. Technical Report DTU Compute 2014-21, De- Theis, Lucas, van den Oord, Aäron, and Bethge, Matthias.
partment of Applied Mathematics and Computer Sci- A note on the evaluation of generative models. CoRR,
ence, Technical University of Denmark, 2014. abs/1511.01844, 2015.

Liu, Ziwei, Luo, Ping, Wang, Xiaogang, and Tang, Xiaoou. Wang, Zhou and Bovik, A.C. Mean squared error: Love it
Deep learning face attributes in the wild. In Proceedings or leave it? a new look at signal fidelity measures. Signal
of International Conference on Computer Vision (ICCV), Processing Magazine, IEEE, 26(1):98–117, Jan 2009.
2015.
Yan, X., Yang, J., Sohn, K., and Lee, H. Attribute2Image:
Mansimov, Elman, Parisotto, Emilio, Ba, Lei Jimmy, and Conditional Image Generation from Visual Attributes.
Salakhutdinov, Ruslan. Generating images from cap- CoRR, abs/1512.00570, 2015.
tions with attention. CoRR, abs/1511.02793, 2015.
Zhao, Junbo, Mathieu, Michael, Goroshin, Ross, and Le-
Mathieu, Michaël, Couprie, Camille, and LeCun, Yann. Cun, Yann. Stacked what-where auto-encoders. CoRR,
Deep multi-scale video prediction beyond mean square abs/1506.02351, 2015.
error. CoRR, abs/1511.05440, 2015.
Zhu, Shizhan, Li, Cheng, Loy, Chen Change, and Tang,
Mikolov, Tomas, Sutskever, Ilya, Chen, Kai, Corrado, Xiaoou. Transferring landmark annotations for cross-
Greg S, and Dean, Jeff. Distributed representations of dataset face alignment. CoRR, abs/1409.0602, 2014.
words and phrases and their compositionality. In Burges,

You might also like