Detecting CNN-Generated Images
Detecting CNN-Generated Images
for now
                                         Sheng-Yu Wang1                         Oliver Wang2              Richard Zhang2               Andrew Owens1,3                  Alexei A. Efros1
                                                                            UC Berkeley1                 Adobe Research2                      University of Michigan3
arXiv:1912.11035v2 [cs.CV] 4 Apr 2020
                                         synthetic
                                          real
                                                     ProGAN [21] StyleGAN [22] BigGAN [9] CycleGAN [54] StarGAN [12] GauGAN [34]   CRN [11]     IMLE [26]   SITD [10]   Super-res. [15] Deepfakes [39]
                                        Figure 1: Are CNN-generated images hard to distinguish from real images? We show that a classifier trained to detect images generated
                                        by only one CNN (ProGAN, far left) can detect those generated by many other models (remaining columns). Our code and models are
                                        available at https://peterwang512.github.io/CNNDetection/.
                                                                            Abstract                                       are fake [16]. This issue has started to play a significant role
                                                                                                                           in global politics; in one case a video of the president of
                                           In this work we ask whether it is possible to create                            Gabon that was claimed by opposition to be fake was one
                                        a “universal” detector for telling apart real images from                          factor leading to a failed coup d’etat∗ . Much of this con-
                                        these generated by a CNN, regardless of architecture or                            cern has been directed at specific manipulation techniques,
                                        dataset used. To test this, we collect a dataset consisting                        such as “deepfake”-style face replacement [3], and photo-
                                        of fake images generated by 11 different CNN-based im-                             realistic synthetic humans [22]. However, these methods
                                        age generator models, chosen to span the space of com-                             represent only two instances of a broader set of techniques:
                                        monly used architectures today (ProGAN, StyleGAN, Big-                             image synthesis via convolutional neural networks (CNNs).
                                        GAN, CycleGAN, StarGAN, GauGAN, DeepFakes, cas-                                    Our goal in this work is to find a general image forensics
                                        caded refinement networks, implicit maximum likelihood es-                         approach for detecting CNN-generated imagery.
                                        timation, second-order attention super-resolution, seeing-                             Detecting whether an image was generated by a spe-
                                        in-the-dark). We demonstrate that, with careful pre- and                           cific synthesis technique is relatively straightforward — just
                                        post-processing and data augmentation, a standard image                            train a classifier on a dataset consisting of real images and
                                        classifier trained on only one specific CNN generator (Pro-                        images synthesized by the technique in question. However,
                                        GAN) is able to generalize surprisingly well to unseen ar-                         such an approach will likely be tied to the dataset used in
                                        chitectures, datasets, and training methods (including the                         image generation (e.g. faces), and, due to dataset bias [41],
                                        just released StyleGAN2 [23]). Our findings suggest the                            might not generalize when tested on new data (e.g. cars).
                                        intriguing possibility that today’s CNN-generated images                           Even worse, the technique-specific detector is likely to soon
                                        share some common systematic flaws, preventing them from                           become ineffective as generation methods evolve and the
                                        achieving realistic image synthesis.                                               technique it was trained on becomes obsolete.
                                                                                                                               It is natural, therefore, to ask whether today’s CNN-
                                                                                                                           generated images contain common artifacts, e.g., some kind
                                        1. Introduction                                                                    of detectable CNN fingerprints, that would allow a classi-
                                                                                                                           fier to generalize to an entire family of generation meth-
                                           Recent rapid advances in deep image synthesis tech-
                                                                                                                           ods, rather than a single one. Unfortunately, prior work
                                        niques, such as Generative Adversarial Networks (GANs),
                                                                                                                           has reported generalization to be a significant problem for
                                        have generated a huge amount of public interest and con-
                                        cern, as people worry that we are entering a world where it                          ∗ https://www.motherjones.com/politics/2019/03/
will be impossible to tell which images are real and which deepfake-gabon-ali-bongo/
                                                                                                                       1
image forensics approaches. For example, several recent           2. Related work
works [50, 14, 43] observe that that classifiers trained on
images produced by one GAN architecture perform poorly            Detecting CNN-based Manipulations Several recent
when tested on others, and in many cases they also fail           works have addressed the problem of detecting images gen-
to generalize when only the dataset (and not the architec-        erated by CNNs. Rössler et al. [39] evaluated methods
ture or task) is changed [50]. This makes sense, as image         for detecting face manipulation techniques, including CNN-
generation methods are highly varied: they use different          based face and mouth replacement methods. While they
datasets, network architectures, loss functions, and image        showed that simple classifiers could detect fakes generated
pre-processing.                                                   by the same model, they did not study generalization be-
                                                                  tween models or datasets. Marra et al. [29] likewise showed
    In this paper, we show that, contrary to this current un-
                                                                  that simple classifiers can detect images created by an image
derstanding, classifiers trained to detect CNN-generated im-
                                                                  translation network [19], but did not consider cross-model
ages can exhibit a surprising amount of generalization abil-
                                                                  transfer.
ity across datasets, architectures, and tasks. We follow con-
                                                                      Recently, Cozzolino et al. [14] found that forensics clas-
ventions and train our classifiers in a straightforward man-
                                                                  sifiers transferred poorly between models, often obtaining
ner, by generating a large number of fake images using a
                                                                  near-chance performance. They propose a new represen-
single CNN model (we use ProGAN, a high-performing un-
                                                                  tation learning method, based on autoencoders, to improve
conditional GAN model [21]), and train a binary classifier
                                                                  transfer performance in zero- and low-shot training regimes
to detect fakes, using the model’s real training images as
                                                                  for a variety of generation methods. While their ultimate
negative examples.
                                                                  goal is similar to ours, they take an orthogonal approach.
    To evaluate our model, we create a new dataset of CNN-        They focus on new learning methods for improving transfer
generated images, the ForenSynths dataset, consisting of          learning, and apply them to a diverse assortment of models
synthesized images from 11 models, that range from from           (including both CNN and non-CNN). In contrast, we em-
unconditional image generation methods, such as Style-            pirically study the performance of simple “baseline” clas-
GAN [22], to super-resolution methods [15], and deep-             sifiers under different training and testing conditions for
fakes [39]. Each model is trained on a different image            CNN-based image generation. Zhang et al. [50] finds that
dataset appropriate for its specific task. We have also           classifiers generalize poorly between GAN models. They
continued evaluating our detector on models that were re-         propose a method called AutoGAN for generating images
leased after our paper was originally written, finding that       that contain the upsampling artifacts common in GAN ar-
it works out-of-the-box on the very recent unconditional          chitectures, and test it on two types of GANs. Other work
GAN, StyleGAN2 [23].                                              has proposed to detect GAN images using hand-crafted co-
    Underneath the apparent simplicity of this approach, we       occurrence features [31], or by anomaly detection models
have found that there are a number of subtle challenges           built on pretrained face detectors [43]. Researchers have
which we study through a set of experiments and a new             also proposed methods for identifying which, of several,
dataset of trained image generation models. We find that          known GANs generated a given image [30, 47].
data augmentation, in the form of common image post-
processing operations, is critical for generalization, even       Image forensics Researchers have proposed a variety of
when the target images are not post-processed themselves.         methods for detecting more traditional manipulation tech-
We also find that diversity of training images matters; large     niques, such as those made by image editing tools. Early
datasets sampled from CNN synthesis methods lead to clas-         work focused on hand-crafted cues [16] such as com-
sifiers that outperform those trained on smaller datasets, to     pression artifacts [5], resampling [37], or physical scene
a point. Finally, it is critical to examine the effect of post-   constraints [32]. More recently, researchers have applied
processing on the model’s generalization ability which of-        learning-based methods to these problems [51, 18, 13, 38,
ten occur downstream of image creation (e.g., during stor-        44]. This line of work has found, like us, that simple, su-
age and distribution). We show that when the correct steps        pervised classifiers are often effective at detecting manipu-
are taken, classifiers are indeed robust to common opera-         lations [51, 44].
tions such as JPEG compression, blurring, and resizing.           Artifacts from CNN-based Generators Researchers
    In summary, our main contributions are: 1) we show that       have shown, recently, that common CNN designs contain
forensics models trained on CNN-generated images exhibit          artifacts that reduce their representational power. Much of
a surprising amount of generalization to other CNN synthe-        this work has focused on the way networks perform upam-
sis methods; 2) we propose a new dataset and evaluation           pling and downsampling. A well-known example of such
metric for detecting CNN-generated images; 3) we exper-           an artifact is the checkerboard artifact produced by decon-
imentally analyze the factors that account for cross-model        volutional layers [33]. Azulay and Weiss [6] showed convo-
generalization.                                                   lutional networks ignore the classical sampling theorem and
that strided convolutions therefore reduce translation invari-     Family          Method                 Image Source            # Images
ance, and Zhang [49] improved translation invariance by re-                        ProGAN [21]            LSUN                        8.0k
ducing aliasing in these layers. Very recently, Bau et al. [7]     Unconditional
                                                                   GAN             StyleGAN [22]          LSUN                       12.0k
suggested that GANs have limited generation capacity, and                          BigGAN [9]             ImageNet                    4.0k
analyzed the image structures that a pretrained GAN is un-         Conditional
                                                                                   CycleGAN [54]          Style/object transfer       2.6k
able to produce.                                                   GAN             StarGAN [12]           CelebA                      4.0k
                                                                                   GauGAN [34]            COCO                       10.0k
                                                                   Perceptual      CRN [11]               GTA                        12.8k
3. A dataset of CNN-based generation models                        loss            IMLE [26]              GTA                        12.8k
                                                                   Low-level       SITD [10]              Raw camera                   360
   To study the transferability of classifiers trained to detect   vision          SAN [15]               Standard SR benchmark        440
CNN-generated images, we collected a dataset of images             Deepfake        FaceForensics++ [39]   Videos of faces             5.4k
created from a variety of CNN models.
                                                                   Table 1: Generation models. We evaluate forensic classifiers on
3.1. Generation models                                             a variety of CNN-based image generation methods.
   Our dataset contains 11 synthesis models. We chose              Deep fakes We also evaluate our model on the face re-
methods that span a variety of CNN architectures, datasets,        placement images provided in the FaceForensics++ bench-
and losses. All of these models have an upsampling-                mark of Rössler et al. [39], which used the publicly avail-
convolutional structure (i.e. they generate images by a series     able faceswap tool [1]. While “deepfake” is often used
convolution and upsampling operations) since this is by far        as a general term, we take inspiration from the convention
the most common design for generative CNNs. Examples               in [39] and refer to this specific model as DeepFake. This
of their synthesized images can be found in Figure 1. The          model uses an autoencoder to generate faces, and images
statistics of each dataset are listed in Table 1. Details of the   undergo extensive post-processing steps, including Poisson
data collection process are provided in Appendix B.1.              image blending [35] with real content. We note that our
                                                                   main goal is to detect images directly output by CNN de-
GANs We include three state-of-the-art unconditional               coders, while DeepFake serves as an out-of-distribution test
GANs: ProGAN [21], StyleGAN [22], BigGAN [9], trained              case. Following [39], we use cropped faces.
on either the LSUN [46] or ImageNet [40] datasets. The
                                                                   3.2. Generating fake images
network structures and training procedures for these mod-
els contain significant differences. ProGAN and Style-                 We collect images from the models, taking care to match
GAN train a different network for each category; StyleGAN          the pre-processing operations performed by each (e.g. re-
injects large, per-pixel noise into the model to introduce         sizing and cropping). For each dataset, we collect fake im-
high frequency detail. BigGAN has a monolithic, class-             ages by generating them from the model without applying
conditional structure, is trained on very large batch sizes,       additional post-processing (or we download the officially
and uses self-attention layers [48, 45].                           released generated images if they are available). We collect
   We also include three conditional GANs: the state-of-           an equal number of real images from each method’s training
the-art image-to-image translation method GauGAN [34],             set. To make the distribution of the real and fake images as
and the popular unpaired image-to-image translation meth-          close as possible, real images are pre-processed according
ods CycleGAN [54] and StarGAN [12].                                to the pipeline prescribed by each method.
                                                                       Since 256 × 256 resolution is the most commonly shared
Perceptual loss We consider models that directly opti-             output size among most off-the-shelf image synthesis mod-
mize a perceptual loss [20], with no adversarial training.         els (e.g., CycleGAN, StarGAN, ProGAN LSUN, GauGAN
This includes Cascaded Refinement Networks (CRN) [11],             COCO, IMLE, etc.), we used this resolution for our dataset.
which synthesizes images in a coarse-to-fine manner,               For models that produce images at lower resolutions, (e.g.,
and the recent Implicit Maximum Likelihood Estimation              DeepFake), we rescale the images using bilinear interpo-
(IMLE) conditional image translation model [26].                   lation to 256 on the shorter side with the same aspect ra-
                                                                   tio, and for models that produce images at higher resolu-
Low-level vision We include the Seeing In The Dark                 tion (e.g., ProGAN, StyleGAN, SAN, SITD), we keep the
(SITD) model [10], which approximates long-exposure                images at the same resolution. Despite these cases being
photography under low light conditions from short-                 slightly different from our training scheme, we observe that
exposure raw camera input using a high-resolution fully            our model is still able to detect fake images under these cat-
convolutional network. We also use a state-of-the-art super-       egories. For all datasets, we make our real/fake prediction
resolution model, the Second Order Attention Network               from 224 × 224 crops (random-crop at training time and
(SAN) [15].                                                        center-crop at testing time).
                                       Training settings                                                  Individual test generators                                Total
 Family       Name                                                              Style-    Big-   Cycle-                                                     Deep-
                                                No.         Augments     Pro-                              Star-    Gau-
                              Train    Input   Class                     GAN                               GAN      GAN      CRN       IMLE   SITD   SAN            mAP
                                                                                GAN       GAN    GAN                                                        Fake
                                                           Blur   JPEG
             Cyc-Im         CycleGAN   RGB        –                      84.3   65.7      55.1    100.      99.2     79.9     74.5     90.6   67.8   82.9   53.2    77.6
 Zhang
            Cyc-Spec        CycleGAN   Spec       –                      51.4   52.7      79.6    100.      100.     70.8     64.7     71.3   92.2   78.5   44.5    73.2
 et al.
  [50]       Auto-Im        AutoGAN    RGB        –                      73.8   60.1      46.1    99.9      100.     49.0     82.5     71.0   80.1   86.7   80.8    75.5
            Auto-Spec       AutoGAN    Spec       –                      75.6   68.6      84.9    100.      100.     61.0     80.8     75.3   89.9   66.1   39.0    76.5
               2-class      ProGAN     RGB        2         X      X     98.8   78.3      66.4    88.7      87.3     87.4     94.0     97.3   85.2   52.9   58.1    81.3
               4-class      ProGAN     RGB        4         X      X     99.8   87.0      74.0    93.2      92.3     94.1     95.8     97.5   87.8   58.5   59.6    85.4
               8-class      ProGAN     RGB        8         X      X     99.9   94.2      78.9    94.3      91.9     95.4     98.9     99.4   91.2   58.6   63.8    87.9
              16-class      ProGAN     RGB       16         X      X     100.   98.2      87.7    96.4      95.5     98.1     99.0     99.7   95.3   63.1   71.9    91.4
  Ours         No aug       ProGAN     RGB       20                      100.   96.3      72.2    84.0      100.     67.0     93.5     90.3   96.2   93.6   98.2    90.1
              Blur only     ProGAN     RGB       20         X            100.   99.0      82.5    90.1      100.     74.7     66.6     66.7   99.6   53.7   95.1    84.4
             JPEG only      ProGAN     RGB       20                X     100.   99.0      87.8    93.2      91.8     97.5     99.0     99.5   88.7   78.1   88.1    93.0
          Blur+JPEG (0.5)   ProGAN     RGB       20         X      X     100.   98.5      88.2    96.8      95.4     98.1     98.9     99.5   92.7   63.9   66.3    90.8
          Blur+JPEG (0.1)   ProGAN     RGB       20         †      †     100.   99.6      84.5    93.5      98.2     89.5     98.2     98.4   97.2   70.5   89.0    92.6
Table 2: Cross-generator generalization results. We show the average precision (AP) of various classifiers from baseline Zhang et al. [50]
and ours, tested across 11 generators. Symbols X and † mean the augmentation is applied with 50% or 10% probability, respectively, at
training. Chance is 50% and best possible performance is 100%. When test generators are used in training, we show those results in
gray (as they are not testing generalization). Values in black show cross-generator generalization. Amongst those, the highest value is
highlighted in black. We show ablations with respect to fewer classes in ProGAN and by removing data augmentation. We report the mean
AP by averaging the AP scores over all datasets. Subsets are plotted in Figures 2, 3, 4 for comparison.
                                                                                                                          Blur+JPEG(0.5)
                                                                                                                          Blur+JPEG(0.1)
     0
         ProGAN   StyleGAN  BigGAN   CycleGAN StarGAN   GauGAN      CRN       IMLE      SITD    SAN     DeepFake
Figure 2: Effect of augmentation methods. All detectors are trained on ProGAN, and tested on other generators (AP shown). In general,
training with augmentation helps performance. Notable exceptions are super-resolution and DeepFake.
  100
                                                                                                                               Chance
                                                                                                                               2 classes
                                                                                                                               4 classes
     50                                                                                                                        8 classes
AP
                                                                                                                               16 classes
                                                                                                                               20 classes
     0
          ProGAN   StyleGAN   BigGAN   CycleGAN   StarGAN   GauGAN   CRN      IMLE     SITD      SAN     DeepFake
Figure 3: Effect of dataset diversity. All detectors are trained on ProGAN, and tested on other generators (AP shown). Training with
more classes improves performance. All runs use blur and JPEG augmentation with 50% probability.
  100
                                                                                                                     Chance
                                                                                                                     Zhang, Auto-Im
                                                                                                                     Zhang, Auto-Spec
     50                                                                                                              Ours, Blur+JPEG(0.1)
AP
     0
          ProGAN   StyleGAN   BigGAN   CycleGAN   StarGAN   GauGAN   CRN      IMLE     SITD      SAN     DeepFake
Figure 4: Model comparison. Compared to Zhang et al. [50], we observe that for the most part, our models generalize better to other
architectures. Notable exceptions to this are CycleGAN (which is identical to the training architecture from [50]), StarGAN (where both
methods obtain close to 100. AP), and SAN (where applying data augmentation hurts performance).
to 224×224 pixels without resizing in order to match the               augmentations, the performance on BigGAN significantly
post-processing pipeline used by models during training.               improves, 72.2 → 88.2. On conditional models (Cycle-
No data augmentation is included during testing; instead,              GAN, GauGAN, CRN, and IMLE), performance is simi-
we conduct experiments on model robustness under post-                 larly improved, 84.0 → 96.8, 67.0 → 98.1, 93.5 → 98.9,
processing in Section 4.2.                                             90.3 → 99.5, respectively.
4.2. Effect of data augmentation                                          Interestingly, there are two models, SAN and DeepFake,
    In Table 2, we investigate the generalization ability of           where directly training on ProGAN without augmentation
training with different augmentation methods. We find that             performs strongly (93.6 and 98.2, respectively), but aug-
using aggressive data augmentation (in the form of sim-                mentation hurts performance. As SAN is a super-resolution
ulated post-processing) provides surprising generalization             model, only high-frequency components can differentiate
capabilities, even when such perturbations are not used at             between real and fake images. Removing such cues at train-
test time. Additionally, we observe that these models are              ing time (e.g. by blurring) would therefore be likely to re-
significantly more robust to post-processing (Figure 5).               duce performance. As explained in Section 3.1, DeepFake
                                                                       serves as an out-of-distribution test case as images are not
Augmentation (usually) improves generalization To                      generated by CNN architectures alone, but surprisingly our
begin, we first evaluate ProGAN-based classifier without               model is able to generalize to this test case. However, it
augmentation, shown in the “no aug” row. As in previous                remains challenging to identify clear reasons for the perfor-
work [39], we find that testing on held-out ProGAN images              mance deterioration when applying augmentations. Apply-
works well (100.0 AP). We then test how well it general-               ing augmentation, but at reduced rate (Blur+JPEG (0.1)),
izes to other unconditional GANs. We find that it general-             offers a good balance: DeepFake detection is comparable to
izes extremely well to StyleGAN, which has a similar net-              the no-augmentation case (89.0), while most other datasets
work structure, but not as well to BigGAN. When adding                 are significantly improved over no augmentation.
                              Robustness to Blur                                                            Robustness to JPEG
            ProGAN          StyleGAN       BigGAN         CycleGAN                     ProGAN              StyleGAN       BigGAN               CycleGAN
  100                                                                       100
 AP
                                                                            AP
   50                                                                        50
                                                                            AP
   50                                                                        50
                                                                            AP
                                                          Blur+JPEG (0.1)                                                                      Blur+JPEG (0.1)
   50                                                                        50
        0       2     4 0       2    4 0       2      4                          100      65      30 100      65      30 100      65      30
             sigma           sigma          sigma                                       quality             quality             quality
Figure 5: Robustness. We show the effect of AP given test-time perturbation to (left) Gaussian blurring and (right) JPEG. We show
classifiers trained on ProGAN, with different augmentations applied during training. Note that in all cases and both perturbations, when
training without augmentation (red), performance degrades across all datasets when perturbations are added. In most cases, training
with both augmentations, performs best or near best. Notable exceptions are for super-resolution (where no augmentation is best), and
DeepFake, where training only with the perturbation used during testing, rather than both, performs best.
Augmentation improves robustness In many real-world                         Image diversity improves performance To study this,
scenarios, images that we would like to evaluate have un-                   we varied the number of classes in the dataset used to
dergone unknown post-processing operations, such as com-                    train our real-or-fake classifier (Figure 3). Specifically, we
pression and resizing. We investigated whether CNN-                         trained multiple classifiers, each one on a subset of the full
generated images can still be detected, even after these post-              training dataset by excluded both real and fake images de-
processing steps. To test this, we blurred (simulating re-                  rived from a specific set of LSUN classes. For all models
sampling) and JPEG-compressed the real and fake images                      we use the same augmentation scheme as the Blur+JPEG
following the protocol in [44], and evaluated our ability to                (0.5) model. We found that increasing the training set diver-
detect them (Figure 5). On ProGAN (i.e. the case where                      sity improves performance, but only up to a point. When the
the test distribution matches the training), performance is                 number of classes used increases from 2 to 16, AP consis-
100% even when applying augmentation operations, indi-                      tently improves, but we see diminishing returns. Minimal
cating that artifacts may not only be high-frequency, but                   improvement is observed when increasing from 16 to 20
exist across frequency bands. In terms of cross-generator                   classes. This indicates that there may be a training dataset
generalization, the augmented model is most robust to post-                 that is “diverse enough” for practical generalization.
processing operations that are included in data augmenta-
tion, agreeing with observations from [39, 44, 47, 50].                     4.4. Comparison to other models
However, we note that our model also gains robustness
from augmentation even when testing on out-of-distribution                      Next, we asked how our generalization performance
CNN models.                                                                 compares to other proposed forensic methods. We compare
                                                                            our approach to Zhang et al. [50], which is a suite of classi-
                                                                            fiers trained to detect artifacts generated by a common CNN
4.3. Effect of data diversity                                               architecture, which is shared by many image synthesis tasks
                                                                            such as CycleGAN and StarGAN. They introduced Auto-
   Next, we asked how the diversity of the real and fake                    GAN, an autoencoder based on CycleGAN’s generator that
images in the training set affects a classifier’s generalization            simulates artifacts resembling that of CycleGAN images.
ability.                                                                        We considered four variations of pretrained models from
     BigGAN
      StarGAN
0th percentile 25th percentile 50th percentile 75th percentile 100th percentile
Figure 6: Does our model’s confidence correlate with visual quality? We have found that for two models, BigGAN and StarGAN, the
images on the left (considered more real) tends to look better than the images on the right (considered more fake). However, this does not
seem to hold for the other models. More examples on each dataset are provided in Appendix A.1.
synthetic
real
ProGAN StyleGAN BigGAN CycleGAN StarGAN GauGAN CRN IMLE SITD SAN DeepFake
Figure 7: Frequency analysis on each dataset. We show the average spectra of each high-pass filtered image, for both the real and fake
images, similar to Zhang et al. [50]. We observe periodic patterns (dots or lines) in most of the synthetic images, while BigGAN and
ProGAN contains relatively few such artifacts.
Zhang et al. [50], each trained from one of the two image                       several changes to StyleGAN, including redesigned normal-
sources (CycleGAN and AutoGAN), and one of the two im-                          ization, multi-resolution, and regularization methods. In
age representations (images and spectrum) respectively. All                     Table 3, we test our detector on publicly available Style-
four variants included JPEG and resize data augmentation                        GAN2 generators. We used our Blur+JPEG (0.1) model
during training to improve the robustness of each model.                        and tested on the LSUN car, cat, church, and horse variants.
We found that our models generalized significantly better                       Despite these changes, our technique performs at 99.1% AP.
to other architectures, except on CycleGAN (which is the                        These results reinforce the notion that training on today’s
model architecture used by [50]), StarGAN (where both                           generators can generalize well to future generators, given
methods obtain near 100.0 AP). The comparison results are                       that they use similar underlying building blocks.
shown in Table 2 and Figure 4. We also include compar-
                                                                                4.6. Qualitative Analysis
isons to other baseline models in Appendix A.5.
                                                                                    To understand how the network is able to generalize to
4.5. New CNN models                                                             unseen CNN models, we study what possible cues the clas-
                                                                                sifier might be using by visualizing its ranking on the “fak-
   We hope that as new deep synthesis models arrive, our
                                                                                eness” over the synthetic dataset. In addition, we analyze
system will detect them out-of-the-box. One such an evalu-
                                                                                the difference between the frequency responses of both real
ation scenario has naturally arisen, with the recent release of
                                                                                and synthetic images across datasets.
StyleGAN2 [23], a state-of-the-art unconditional GAN ap-
pearing in these proceedings. The StyleGAN2 model makes                         “Fakeness” ranking by the model          We study whether
               ProGAN       StyleGAN       StyleGAN2                ization method.
        AP        100.         99.6            99.1
                                                                    5. Discussion
Table 3: Out-of-the box evaluation on recently released Style-          Despite the alarm that has been raised by the rapidly im-
GAN2 [23] model. We used our Blur+JPEG (0.1) model and
                                                                    proving quality of image synthesis methods, our results sug-
tested on StyleGAN2. We observed that our model generalizes to
                                                                    gest that today’s CNN-generated images retain detectable
detecting StyleGAN2 images. Numbers for ProGAN and Style-
GAN are included for comparison.                                    fingerprints that distinguish them from real photos. This
                                                                    allows forensic classifiers to generalize from one model to
                                                                    another without extensive adaptation.
our model is learning subtle low-level features generated               However, this does not mean that the current situation
by CNN architectures, or high-level features such as visual         will persist. Due to the difficulties in achieving Nash equi-
quality. Taking the similar approach as previous image real-        libria, none of the current GAN-based architectures are
ism works [25, 53], we rank synthesized images from each            optimized to convergence, i.e. the generator never wins
dataset by the model’s prediction, and visualize images in          against the discriminator. Were this to change, we would
the 0th , 25th , 50th , 75th , 100th percentile of the “fakeness”   suddenly find ourselves in a situation when synthetic im-
score from our model’s output.                                      ages are completely indistinguishable from real ones.
    In most datasets, we observe little noticeable correlation          Even with the current techniques, there remain practi-
between the model predictions and the visual quality of the         cal reasons for concern. First, even the best forensics de-
synthesized images. However, there is a weak correlation            tector will have some trade-off between true detection and
in the BigGAN and StarGAN datasets; qualitative examples            false-positive rates. Since a malicious user is typically look-
are shown in Figure 6. As the “fakeness” scores are higher,         ing to create a single fake image (rather than a distribu-
the images tend to contain more visible artifacts which dete-       tion of fakes), they could simply hand-pick the fake image
riorate the visual quality. This implies that our model might       which happens to pass the detection threshold. Second, ma-
learn to capture perceptual realism under this task. How-           licious use of fake imagery is likely be deployed on a social
ever, since the correlation is not observed in other datasets,      media platform (Facebook, Twitter, YouTube, etc.), so the
it is more likely that the model learns features more towards       data will undergo a number of often aggressive transforma-
low-level CNN artifacts. Examples across all datasets are           tions (compression, resizing, re-sampling, etc.). While we
provided in Appendix A.1.                                           demonstrated robustness to some degree of JPEG compres-
                                                                    sion, blurring, and resizing, much more work is needed to
Artifacts of CNN image synthesis Inspired by Zhang et               evaluate how well the current detectors can cope with these
al. [50], we visualize the average frequency spectra from           transformations in-the-wild. Finally, most documented in-
each dataset to study the artifacts generated by CNNs, as           stances of effective deployment of visual fakes to date have
shown in Figure 7. Following prior work, we perform                 been using classic “shallow” methods, such as Photoshop.
a simple form of high-pass filtering (subtracting the im-           We have experimented with running our detector on the
age from its median blurred version) before calculating the         face-aware liquify dataset from [44], and found that our
Fourier transform, as it provides a more informative visual-        method performs at chance on this data. This suggests that
ization [30]. For each dataset, we average over 2000 ran-           shallow methods exhibit fundamentally different behavior
domly chosen images (or the entire set, if it is smaller).          than deep methods, and should not be neglected.
   We note that there are many interesting patterns visi-
                                                                        We note that detecting fake images is just one small piece
ble in these visualizations. While the real image spectra
                                                                    of the puzzle of how to combat the threat of visual disinfor-
generally look alike (with minor variations due to differ-
                                                                    mation. Effective solutions will need to incorporate a wide
ences in the datasets), there are distinct patterns visible in
                                                                    range of strategies, from technical to social to legal.
images generated by different CNN models. Furthermore,
the repeated period patterns in these spectra may be consis-        Acknowledgements             We’d like to thank Jaakko Lehtinen,
tent with aliasing artifacts, a cue considered by [50]. In-         Taesung Park, Jacob (Minyoung) Huh, Hany Farid, and Matthias
terestingly, the most effective unconditional GANs (Big-            Kirchner for helpful discussions. We are grateful to Xu Zhang,
GAN, ProGAN) contain relatively few such artifacts. Also,           Lakshmanan Nataraj, and Davide Cozzolino for significant help
DeepFake images does not contain obvious artifacts. We              with comparisons to [50, 31, 14], respectively. This work was
note that DeepFake images have gone through various pre-            funded, in part, by DARPA MediFor, Adobe gift and grant from
and post-processing, where the synthesized face region is           the UC Berkeley Center for Long-Term Cybersecurity. The views,
resized, blended, and compressed with MPEG. These op-               opinions and/or findings expressed are those of the authors and
erations perturbs the low-level image statistic, which may          should not be interpreted as representing the official views or poli-
cause the frequency patterns to not emerge with this visual-        cies of the Department of Defense or the U.S. Government.
References                                                           [22] Tero Karras, Samuli Laine, and Timo Aila. A style-based
                                                                          generator architecture for generative adversarial networks. In
 [1] Deepfakes faceswap github repository.            https://            CVPR, 2019. 1, 2, 3, 12
     github.com/deepfakes/faceswap. 3
                                                                     [23] Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten,
 [2] Faced. https://github.com/iitzco/faced. 13                           Jaakko Lehtinen, and Timo Aila. Analyzing and improving
 [3] Faceswap. https://faceswap.dev/. 1                                   the image quality of stylegan. In CVPR, 2020. 1, 2, 7, 8, 12
 [4] Which face is real? http://www.whichfaceisreal.com/. 12
                                                                     [24] Diederik P Kingma and Jimmy Ba. Adam: A method for
 [5] Shruti Agarwal and Hany Farid. Photo forensics from jpeg             stochastic optimization. In ICLR, 2015. 11, 13
     dimples. In 2017 IEEE Workshop on Information Forensics
                                                                     [25] Jean-François Lalonde and Alexei A. Efros. Using color
     and Security (WIFS), 2017. 2
                                                                          compatibility for assessing image realism. In ICCV, 2007.
 [6] Aharon Azulay and Yair Weiss. Why do deep convolutional
                                                                          8
     networks generalize so poorly to small image transforma-
                                                                     [26] Ke Li, Tianhao Zhang, and Jitendra Malik. Diverse image
     tions? JMLR, 2019. 2
                                                                          synthesis from semantic layouts via conditional imle. In
 [7] David Bau, Jun-Yan Zhu, Jonas Wulff, William Peebles,
                                                                          ICCV, 2019. 1, 3, 13
     Hendrik Strobelt, Bolei Zhou, and Antonio Torralba. See-
                                                                     [27] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays,
     ing what a gan cannot generate. In ICCV, 2019. 3
                                                                          Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence
 [8] Gary Bradski and Adrian Kaehler. Learning OpenCV: Com-
                                                                          Zitnick. Microsoft coco: Common objects in context. In
     puter vision with the OpenCV library. ”O’Reilly Media,
                                                                          ECCV, 2014. 13
     Inc.”, 2008. 4
                                                                     [28] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang.
 [9] Andrew Brock, Jeff Donahue, and Karen Simonyan. Large
                                                                          Large-scale celebfaces attributes (celeba) dataset. 12
     scale gan training for high fidelity natural image synthesis.
     In ICLR, 2019. 1, 3, 10, 12                                     [29] Francesco Marra, Diego Gragnaniello, Davide Cozzolino,
[10] Chen Chen, Qifeng Chen, Jia Xu, and Vladlen Koltun.                  and Luisa Verdoliva. Detection of gan-generated fake images
     Learning to see in the dark. In CVPR, 2018. 1, 3, 13                 over social networks. In 2018 IEEE Conference on Multime-
                                                                          dia Information Processing and Retrieval (MIPR), 2018. 2,
[11] Qifeng Chen and Vladlen Koltun. Photographic image syn-
                                                                          12
     thesis with cascaded refinement networks. In ICCV, 2017. 1,
     3, 13                                                           [30] Francesco Marra, Diego Gragnaniello, Luisa Verdoliva, and
[12] Yunjey Choi, Minje Choi, Munyoung Kim, Jung-Woo Ha,                  Giovanni Poggi. Do gans leave artificial fingerprints? In
     Sunghun Kim, and Jaegul Choo. Stargan: Unified genera-               2019 IEEE Conference on Multimedia Information Process-
     tive adversarial networks for multi-domain image-to-image            ing and Retrieval (MIPR), 2019. 2, 8
     translation. In CVPR, 2018. 1, 3, 12                            [31] Lakshmanan Nataraj, Tajuddin Manhar Mohammed, BS
[13] Davide Cozzolino, Giovanni Poggi, and Luisa Verdoliva.               Manjunath, Shivkumar Chandrasekaran, Arjuna Flenner,
     Splicebuster: A new blind image splicing detector. In 2015           Jawadul H Bappy, and Amit K Roy-Chowdhury. Detect-
     IEEE International Workshop on Information Forensics and             ing gan generated fake images using co-occurrence matrices.
     Security (WIFS), 2015. 2                                             Electronic Imaging, 2019. 2, 8, 11, 12
[14] Davide Cozzolino, Justus Thies, Andreas Rössler, Christian     [32] James F O’Brien and Hany Farid. Exposing photo manipula-
     Riess, Matthias Nießner, and Luisa Verdoliva. Forensictrans-         tion with inconsistent reflections. ACM Trans. Graph., 2012.
     fer: Weakly-supervised domain adaptation for forgery detec-          2
     tion. arXiv preprint arXiv:1812.02510, 2018. 2, 8, 11           [33] Augustus Odena, Vincent Dumoulin, and Chris Olah. De-
[15] Tao Dai, Jianrui Cai, Yongbing Zhang, Shu-Tao Xia, and               convolution and checkerboard artifacts. Distill, 2016. 2
     Lei Zhang. Second-order attention network for single image      [34] Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan
     super-resolution. In CVPR, 2019. 1, 2, 3, 13                         Zhu. Semantic image synthesis with spatially-adaptive nor-
[16] Hany Farid. Photo forensics. MIT Press, 2016. 1, 2                   malization. In CVPR, 2019. 1, 3, 13
[17] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.          [35] Patrick Pérez, Michel Gangnet, and Andrew Blake. Poisson
     Deep residual learning for image recognition. In CVPR,               image editing. ACM Transactions on graphics (TOG), 2003.
     2016. 4                                                              3
[18] Minyoung Huh, Andrew Liu, Andrew Owens, and Alexei A            [36] John Platt et al. Probabilistic outputs for support vector ma-
     Efros. Fighting fake news: Image splice detection via                chines and comparisons to regularized likelihood methods.
     learned self-consistency. In ECCV, 2018. 2, 4                        1999. 11
[19] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A          [37] Alin C Popescu and Hany Farid. Exposing digital forgeries
     Efros. Image-to-image translation with conditional adver-            by detecting traces of resampling. IEEE Transactions on sig-
     sarial networks. In CVPR, 2017. 2                                    nal processing, 2005. 2
[20] Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual     [38] Yuan Rao and Jiangqun Ni. A deep learning approach to
     losses for real-time style transfer and super-resolution. In         detection of splicing and copy-move forgeries in images. In
     ECCV, 2016. 3                                                        2016 IEEE International Workshop on Information Forensics
[21] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen.           and Security (WIFS), 2016. 2
     Progressive growing of gans for improved quality, stability,    [39] Andreas Rössler, Davide Cozzolino, Luisa Verdoliva, Chris-
     and variation. In ICLR, 2018. 1, 2, 3, 4, 12                         tian Riess, Justus Thies, and Matthias Nießner. FaceForen-
     sics++: Learning to detect manipulated facial images. In        Appendix
     ICCV, 2019. 1, 2, 3, 5, 6, 13
[40] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, San-
     jeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy,         A. Additional Analysis
     Aditya Khosla, Michael Bernstein, et al. Imagenet large
     scale visual recognition challenge. IJCV, 2015. 3, 11           A.1. Additional ranking visualizations
[41] Antonio Torralba and Alexei A Efros. Unbiased look at              In we rank ordered the fake images according
     dataset bias. In CVPR, 2011. 1                                  to how “fake” the classifier deemed them to be.
[42] Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky.           These full ranking results are included in the follow-
     Deep image prior. In CVPR, 2018. 4, 11                          ing link: https://peterwang512.github.io/
                                                                     CNNDetection/ranking/. We randomly select 20
[43] Run Wang, Lei Ma, Felix Juefei-Xu, Xiaofei Xie, Jian Wang,      real and 20 fake images from each dataset, and rank all
     and Yang Liu. Fakespotter: A simple baseline for spotting       images based on our (Blur+JPEG (0.1)) model’s scores.
     ai-synthesized fake faces. arXiv preprint arXiv:1909.06122,
                                                                     Note that there is a clear separation between real and fake
     2019. 2
                                                                     images, where the real images have lower “fakeness” score
[44] Sheng-Yu Wang, Oliver Wang, Andrew Owens, Richard               and vice versa. Moreover, we observe the synthetic images
     Zhang, and Alexei A Efros. Detecting photoshopped faces         ranked more “real” are super resolution (SAN) outputs, and
     by scripting photoshop. In ICCV, 2019. 2, 4, 6, 8               the ones ranked more “fake” are CRN and IMLE outputs.
[45] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaim-          However, we observe little noticeable correlation between
     ing He. Non-local neural networks. In CVPR, 2018. 3             the model predictions and the visual quality of the synthe-
                                                                     sized images in each dataset, where BigGAN and StarGAN
[46] Fisher Yu, Ari Seff, Yinda Zhang, Shuran Song, Thomas           images are the exceptions.
     Funkhouser, and Jianxiong Xiao. Lsun: Construction of a
     large-scale image dataset using deep learning with humans
     in the loop. arXiv preprint arXiv:1506.03365, 2015. 3, 4, 12    A.2. Effect of dataset size
[47] Ning Yu, Larry Davis, and Mario Fritz. Attributing fake im-
     ages to gans: Analyzing fingerprints in generated images. In       We include additional ablation studies on the effect of
     ICCV, 2019. 2, 6                                                dataset size, and the results are shown in Table 4. To com-
                                                                     pare with the dataset diversity ablation in Section 4.3 of the
[48] Han Zhang, Ian Goodfellow, Dimitris Metaxas, and Augus-         main text, we train 4 additional models with 10%, 20%,
     tus Odena. Self-attention generative adversarial networks. In   40%, 80% of the entire dataset respectively, while having
     ICML, 2019. 3
                                                                     all 20 LSUN classes included in the training set. Same
[49] Richard Zhang. Making convolutional networks shift-             augmentation scheme as Blur+JPEG (0.5) is applied to all
     invariant again. In ICML, 2019. 3                               models. We observe much less reduction in generalization
                                                                     performance, indicating data diversity, comparing to dataset
[50] Xu Zhang, Svebor Karaman, and Shih-Fu Chang. Detecting
     and simulating artifacts in gan fake images. In WIFS, 2019.     size, contributes more towards better CNN detection in gen-
     2, 4, 5, 6, 7, 8, 11, 12                                        eral.
Table 4: Additional evaluations. We evaluate other baseline models, classifiers trained with DIP and BigGAN images, respectively, and
classifiers trained with various dataset size. Same as Table 2 in the main text, we show the average precision (AP) of the models tested
across 11 generators. For comparison, we include the ablations on the number of classes and the Blur+JPEG (0.5) model’s results, which
are presented in the main text. Symbols X means the augmentation is applied with 50% or probability at training. The color coding scheme
is identical to that of Table 2 in the main text. We note that when only the dataset size is reduced, AP dropped less comparing to reducing
the number of classes. Also, the model trained with ProGAN outperforms the baselines, DIP and Blur+JPEG (Big).
A.4. Training with images generated with a deep                                                     across different methods, but just leveraging on those may
      image prior                                                                                   not be sufficient for a general detection.
   Instead of generating fake images with GANs, which                                               A.5. Comparison to other baselines
have limited representational capacity and hence large syn-
thesis errors, we consider an “oracle” generation method                                               In the main text, we compared with Zhang et al. [50], a
based on the deep image prior (DIP) [42]. We ask what                                               state-of-the-art in GAN detection, and outperform it across
the very best reconstruction of an image is achievable via a                                        different synthesis methods. In addition, we include the
given network architecture, regardless of the synthesis task.                                       performance of Nataraj et al. [31], another GAN detec-
For each synthesized image in our dataset, we train a differ-                                       tion method trained on co-occurrence matrices of images,
ent network to reconstruct it by minimizing `1 loss:                                                and Cozzolino et al. [14], a few-shot single-target domain
                                                                                                    adaption method trained on HF filtered images. For Coz-
                                min ||f (θi ) − Ii ||1 ,                                  (1)       zolino et al., we evaluate the ProGAN/CycleGAN model.
                                   θ
                                                                                                    Both methods are evaluated on 256 × 256 images in a zero-
where f (θi ) is the image generated by a neural network                                            shot setting, and if the image is larger than 256 pixels, it is
parameterized with weights θi and Ii is a real image. We                                            center-cropped to 256 pixels. The results are in Tab. 4.
use the reconstructed image f (θi ) as an instance of a fake
                                                                                                    A.6. Other evaluation metrics
image. During reconstruction, we use the Adam opti-
mizer [24] with β1 = 0.9, β2 = 0.999, and optimized with                                               To help clarify the threshold-less AP evaluation metric,
a decreasing learning rate: 0.01 → 0.001 → 0.0001. For                                              we also computed several other metrics (Table 5). We pro-
each learning rate we optimize for 2000 iterations.                                                 vide the precision and recall curve on each dataset from
   As training data, we take 44k real images randomly sam-                                          our (Blur+JPEG (0.1)) model in Figure 8. We give the
pled from ImageNet [40], and “fake” images are the recon-                                           uncalibrated generalization accuracy of the model on the
struction by the generator of ProGAN (and hence 44k dif-                                            test distribution, by simply using the classifier threshold we
ferent networks). We take DIP images optimized for 1000,                                            learned during training, and oracle accuracy that chooses
2000, 3000, 4000, 5000, 6000 iterations into our “fake”                                             the threshold that maximizes accuracy on the test set. We
image set. We then train a classifier on this dataset, and                                          also consider a two-shot regime where we have access to
we over-sample the real images by 6 times to balance the                                            one real and one fake image from each dataset, and only the
classes. All training configurations and augmentations are                                          model’s threshold is adjusted during the two-shot calibra-
same as Blur+JPEG (0.5). This model is denoted as DIP                                               tion process.
in Tab. 4.                                                                                             We calibrate the model by a single random real and fake
   We note although this model does not perform as well as                                          pair, and we augment the image pair by taking 224 × 224
the model directly trained on ProGAN images, but it is able                                         random crops 128 times. The images are passed into the
to detect several datasets, including StarGAN, CRN, SITD,                                           model to get the logits, which are then fitted by a logistic
and SAN. This indicates that low-level artifacts shares                                             regression (the method is also known as Platt scaling [36]).
                                                                    B. Implementation Details
                                                                    B.1. Dataset Collection
                                                                    ProGAN [21] 2 We take 20 officially released ProGAN
                                                                    models pretrained on LSUN [46] airplane, bicycle, bird,
                                                                    boat, bottle, bus, car, cat, chair, cow, dining table, dog,
                                                                    horse, motorbike, person, potted plant, sheep, sofa, train,
                                                                    tv-monitor respectively. Following the official code, we
                                                                    sample the synthetic images with z ∼ N (0, I), and
                                                                    generate real images by center cropping the images just on
                                                                    the long edge (center crop length is exactly the length of
Figure 8: Precision and recall curves. The PR curves on each        the short edge) and then resizing to 256 × 256
dataset from the (Blur+JPEG (0.1)) model are shown. Note that
AP is defined as the area under the PR curve. Higher AP indicates   StyleGAN [22] 3 We take officially released StyleGAN
better trade-off between precision and recall, and vice versa.
                                                                    models pretrained on LSUN [46] bedroom, cat and car,
                                                                    with size 256 × 256, 256 × 256 and 512 × 384 respectively.
                                                                    We download the released synthesized images, all of which
We take the bias learned from the logistic regression to ad-        are generated with 0.5 truncation, and following the code,
just the base rate of our model. Specifically, we apply the         we generate real images by resizing to the according size
bias to our model’s logit and then take the sigmoid to get          of each category.
the calibrated probability.
                                                                    StyleGAN2 [23] 4 We take officially released StyleGAN2
A.7. Detecting GAN images from the internet                         config-F models pretrained on LSUN [46] church, cat,
                                                                    horse and car, with size 256 × 256, 256 × 256, 256 × 256
   Unfortunately, there are currently no collections of “in-
                                                                    and 512 × 384 respectively. We download the released
the-wild” CNN-generated image datasets which we can
                                                                    synthesized images, all of which are generated with 0.5
evaluate with our model. As a “proxy” testcase, we scraped
                                                                    truncation, and following the code, we generate real images
1k real face and 1k fake faces from whichfaceisreal.
                                                                    by resizing to the according size of each category.
com [4]. This is a website containing StyleGAN-generated
faces and real faces in 1024 pixels, with all images com-
pressed into JPEG. We tested our Blur+JPEG (0.1) model              BigGAN [9] 5 We take officially released BigGAN-deep
on this testset in two scenarios: (1) directly center crop im-      model pretrained on 256×256 ImageNet images. Following
ages to 224 pixels without resizing (matching how we test           the official code, we sample the images with uniform class
StyleGAN) or (2) resize to 256 pixels then center crop to           distribution and with 0.4 truncation; also, we generate real
224 pixels. Without resizing, the model gets 83.6% ac-              images by center cropping the images just on the long edge
curacy and 93.2% AP. With resizing, the model drops to              (center crop length is exactly the length of the short edge)
74.9% accuracy and 82.6% AP, still well above chance                and then resizing to 256 × 256.
(50%). This indicates our model can be robust to resizing           CycleGAN [54] 6 We take officially released CycleGAN
and in-the-wild JPEG compression. However, maintaining              models: apple2orange, orange2apple, horse2zebra, ze-
similar performance after significant post-processing (e.g.,        bra2horse, summer2winter, winter2summer, and generate
heavy resizing) remains challenging.                                real and fake image pairs out of all six categories. Pre-
                                                                    processed real images and synthetic images are directly
A.8. CycleGAN testcase                                              generated from the released code.
   While prior works on GAN detection [29, 31, 50] train            StarGAN [12] 7 We take officially released StarGAN
on CycleGAN images and evaluate generalization across               model pretrained on CelebA [28], and generate real and
CycleGAN categories, our method is not trained on any               fake image pairs. Pre-processed real images and synthetic
CycleGAN images and tests generalization across methods             images are directly generated from the released code.
(a significantly harder task). Nonetheless, we still observe          2 https://github.com/tkarras/progressive_growing_of_gans
comparable performance in terms of AP (Tab. 2 in the main             3 https://github.com/NVlabs/stylegan
                                                                      4 https://github.com/NVlabs/stylegan2
text) when compared to Zhang et al. [50]. For a further               5 https://tfhub.dev/s?q=biggan
comparison, we include our Blur+JPEG (0.1) model’s ac-                6 https://github.com/junyanz/pytorch- CycleGAN- and- pix2pix
curacy on each CycleGAN category in Tab. 6.                           7 https://github.com/yunjey/stargan
                      StyleGAN           BigGAN           CycleGAN             StarGAN         GauGAN       CRN     IMLE      SITD     SAN     DeepFake
Uncalibrated              87.1              70.2               85.2                91.7          78.9       86.3     86.2      90.3    50.5       53.5
Oracle                    96.8              81.1               86.3                92.8          85.5       95.3     95.4      92.8    68.0       80.7
Two-shot                  91.9              74.0               82.4                86.0          79.1       91.6     91.2      88.7    54.8       65.7
Table 5: Two-shot classifier calibration. We show the accuracy of the classifiers directly trained from ProGAN (“uncalibrated”), after
calibrating the threshold given two examples in the test distribution (“two-shot”) and an upper bound, given a perfect calibration (“oracle”).
Horse     Zebra      Summer       Winter      Apple      Orange       Facades      Cityscape    Map     Ukiyoe   Vangogh    Cezanne   Monet   Photo   Avg.
 62.1      87.5        83.2         88.0       90.5        87.7         100.         66.6        78.0    85.4      76.9      82.8     56.2    86.8    80.8
Table 6: CycleGAN testcase. We evaluate the uncalibrated accuracy of Blur+JPEG (0.1) model tested on each CycleGAN category. We
note that our model is still able to perform well above chance (50%) even if not directly trained on any CycleGAN images.
SAN [15] 12 We take both the ground truth and the offi-
cially released 4x super-resolution predictions on the stan-
dard benchmark datasets: Set5, Set14, BSD100 and Ur-
ban100. The synthetic images are directly downloaded from
the repository.