100% found this document useful (1 vote)
106 views13 pages

Detecting CNN-Generated Images

Uploaded by

aayantejani325
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
106 views13 pages

Detecting CNN-Generated Images

Uploaded by

aayantejani325
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

CNN-generated images are surprisingly easy to spot...

for now

Sheng-Yu Wang1 Oliver Wang2 Richard Zhang2 Andrew Owens1,3 Alexei A. Efros1
UC Berkeley1 Adobe Research2 University of Michigan3
arXiv:1912.11035v2 [cs.CV] 4 Apr 2020
synthetic
real

ProGAN [21] StyleGAN [22] BigGAN [9] CycleGAN [54] StarGAN [12] GauGAN [34] CRN [11] IMLE [26] SITD [10] Super-res. [15] Deepfakes [39]
Figure 1: Are CNN-generated images hard to distinguish from real images? We show that a classifier trained to detect images generated
by only one CNN (ProGAN, far left) can detect those generated by many other models (remaining columns). Our code and models are
available at https://peterwang512.github.io/CNNDetection/.

Abstract are fake [16]. This issue has started to play a significant role
in global politics; in one case a video of the president of
In this work we ask whether it is possible to create Gabon that was claimed by opposition to be fake was one
a “universal” detector for telling apart real images from factor leading to a failed coup d’etat∗ . Much of this con-
these generated by a CNN, regardless of architecture or cern has been directed at specific manipulation techniques,
dataset used. To test this, we collect a dataset consisting such as “deepfake”-style face replacement [3], and photo-
of fake images generated by 11 different CNN-based im- realistic synthetic humans [22]. However, these methods
age generator models, chosen to span the space of com- represent only two instances of a broader set of techniques:
monly used architectures today (ProGAN, StyleGAN, Big- image synthesis via convolutional neural networks (CNNs).
GAN, CycleGAN, StarGAN, GauGAN, DeepFakes, cas- Our goal in this work is to find a general image forensics
caded refinement networks, implicit maximum likelihood es- approach for detecting CNN-generated imagery.
timation, second-order attention super-resolution, seeing- Detecting whether an image was generated by a spe-
in-the-dark). We demonstrate that, with careful pre- and cific synthesis technique is relatively straightforward — just
post-processing and data augmentation, a standard image train a classifier on a dataset consisting of real images and
classifier trained on only one specific CNN generator (Pro- images synthesized by the technique in question. However,
GAN) is able to generalize surprisingly well to unseen ar- such an approach will likely be tied to the dataset used in
chitectures, datasets, and training methods (including the image generation (e.g. faces), and, due to dataset bias [41],
just released StyleGAN2 [23]). Our findings suggest the might not generalize when tested on new data (e.g. cars).
intriguing possibility that today’s CNN-generated images Even worse, the technique-specific detector is likely to soon
share some common systematic flaws, preventing them from become ineffective as generation methods evolve and the
achieving realistic image synthesis. technique it was trained on becomes obsolete.
It is natural, therefore, to ask whether today’s CNN-
generated images contain common artifacts, e.g., some kind
1. Introduction of detectable CNN fingerprints, that would allow a classi-
fier to generalize to an entire family of generation meth-
Recent rapid advances in deep image synthesis tech-
ods, rather than a single one. Unfortunately, prior work
niques, such as Generative Adversarial Networks (GANs),
has reported generalization to be a significant problem for
have generated a huge amount of public interest and con-
cern, as people worry that we are entering a world where it ∗ https://www.motherjones.com/politics/2019/03/

will be impossible to tell which images are real and which deepfake-gabon-ali-bongo/

1
image forensics approaches. For example, several recent 2. Related work
works [50, 14, 43] observe that that classifiers trained on
images produced by one GAN architecture perform poorly Detecting CNN-based Manipulations Several recent
when tested on others, and in many cases they also fail works have addressed the problem of detecting images gen-
to generalize when only the dataset (and not the architec- erated by CNNs. Rössler et al. [39] evaluated methods
ture or task) is changed [50]. This makes sense, as image for detecting face manipulation techniques, including CNN-
generation methods are highly varied: they use different based face and mouth replacement methods. While they
datasets, network architectures, loss functions, and image showed that simple classifiers could detect fakes generated
pre-processing. by the same model, they did not study generalization be-
tween models or datasets. Marra et al. [29] likewise showed
In this paper, we show that, contrary to this current un-
that simple classifiers can detect images created by an image
derstanding, classifiers trained to detect CNN-generated im-
translation network [19], but did not consider cross-model
ages can exhibit a surprising amount of generalization abil-
transfer.
ity across datasets, architectures, and tasks. We follow con-
Recently, Cozzolino et al. [14] found that forensics clas-
ventions and train our classifiers in a straightforward man-
sifiers transferred poorly between models, often obtaining
ner, by generating a large number of fake images using a
near-chance performance. They propose a new represen-
single CNN model (we use ProGAN, a high-performing un-
tation learning method, based on autoencoders, to improve
conditional GAN model [21]), and train a binary classifier
transfer performance in zero- and low-shot training regimes
to detect fakes, using the model’s real training images as
for a variety of generation methods. While their ultimate
negative examples.
goal is similar to ours, they take an orthogonal approach.
To evaluate our model, we create a new dataset of CNN- They focus on new learning methods for improving transfer
generated images, the ForenSynths dataset, consisting of learning, and apply them to a diverse assortment of models
synthesized images from 11 models, that range from from (including both CNN and non-CNN). In contrast, we em-
unconditional image generation methods, such as Style- pirically study the performance of simple “baseline” clas-
GAN [22], to super-resolution methods [15], and deep- sifiers under different training and testing conditions for
fakes [39]. Each model is trained on a different image CNN-based image generation. Zhang et al. [50] finds that
dataset appropriate for its specific task. We have also classifiers generalize poorly between GAN models. They
continued evaluating our detector on models that were re- propose a method called AutoGAN for generating images
leased after our paper was originally written, finding that that contain the upsampling artifacts common in GAN ar-
it works out-of-the-box on the very recent unconditional chitectures, and test it on two types of GANs. Other work
GAN, StyleGAN2 [23]. has proposed to detect GAN images using hand-crafted co-
Underneath the apparent simplicity of this approach, we occurrence features [31], or by anomaly detection models
have found that there are a number of subtle challenges built on pretrained face detectors [43]. Researchers have
which we study through a set of experiments and a new also proposed methods for identifying which, of several,
dataset of trained image generation models. We find that known GANs generated a given image [30, 47].
data augmentation, in the form of common image post-
processing operations, is critical for generalization, even Image forensics Researchers have proposed a variety of
when the target images are not post-processed themselves. methods for detecting more traditional manipulation tech-
We also find that diversity of training images matters; large niques, such as those made by image editing tools. Early
datasets sampled from CNN synthesis methods lead to clas- work focused on hand-crafted cues [16] such as com-
sifiers that outperform those trained on smaller datasets, to pression artifacts [5], resampling [37], or physical scene
a point. Finally, it is critical to examine the effect of post- constraints [32]. More recently, researchers have applied
processing on the model’s generalization ability which of- learning-based methods to these problems [51, 18, 13, 38,
ten occur downstream of image creation (e.g., during stor- 44]. This line of work has found, like us, that simple, su-
age and distribution). We show that when the correct steps pervised classifiers are often effective at detecting manipu-
are taken, classifiers are indeed robust to common opera- lations [51, 44].
tions such as JPEG compression, blurring, and resizing. Artifacts from CNN-based Generators Researchers
In summary, our main contributions are: 1) we show that have shown, recently, that common CNN designs contain
forensics models trained on CNN-generated images exhibit artifacts that reduce their representational power. Much of
a surprising amount of generalization to other CNN synthe- this work has focused on the way networks perform upam-
sis methods; 2) we propose a new dataset and evaluation pling and downsampling. A well-known example of such
metric for detecting CNN-generated images; 3) we exper- an artifact is the checkerboard artifact produced by decon-
imentally analyze the factors that account for cross-model volutional layers [33]. Azulay and Weiss [6] showed convo-
generalization. lutional networks ignore the classical sampling theorem and
that strided convolutions therefore reduce translation invari- Family Method Image Source # Images
ance, and Zhang [49] improved translation invariance by re- ProGAN [21] LSUN 8.0k
ducing aliasing in these layers. Very recently, Bau et al. [7] Unconditional
GAN StyleGAN [22] LSUN 12.0k
suggested that GANs have limited generation capacity, and BigGAN [9] ImageNet 4.0k
analyzed the image structures that a pretrained GAN is un- Conditional
CycleGAN [54] Style/object transfer 2.6k
able to produce. GAN StarGAN [12] CelebA 4.0k
GauGAN [34] COCO 10.0k
Perceptual CRN [11] GTA 12.8k
3. A dataset of CNN-based generation models loss IMLE [26] GTA 12.8k
Low-level SITD [10] Raw camera 360
To study the transferability of classifiers trained to detect vision SAN [15] Standard SR benchmark 440
CNN-generated images, we collected a dataset of images Deepfake FaceForensics++ [39] Videos of faces 5.4k
created from a variety of CNN models.
Table 1: Generation models. We evaluate forensic classifiers on
3.1. Generation models a variety of CNN-based image generation methods.

Our dataset contains 11 synthesis models. We chose Deep fakes We also evaluate our model on the face re-
methods that span a variety of CNN architectures, datasets, placement images provided in the FaceForensics++ bench-
and losses. All of these models have an upsampling- mark of Rössler et al. [39], which used the publicly avail-
convolutional structure (i.e. they generate images by a series able faceswap tool [1]. While “deepfake” is often used
convolution and upsampling operations) since this is by far as a general term, we take inspiration from the convention
the most common design for generative CNNs. Examples in [39] and refer to this specific model as DeepFake. This
of their synthesized images can be found in Figure 1. The model uses an autoencoder to generate faces, and images
statistics of each dataset are listed in Table 1. Details of the undergo extensive post-processing steps, including Poisson
data collection process are provided in Appendix B.1. image blending [35] with real content. We note that our
main goal is to detect images directly output by CNN de-
GANs We include three state-of-the-art unconditional coders, while DeepFake serves as an out-of-distribution test
GANs: ProGAN [21], StyleGAN [22], BigGAN [9], trained case. Following [39], we use cropped faces.
on either the LSUN [46] or ImageNet [40] datasets. The
3.2. Generating fake images
network structures and training procedures for these mod-
els contain significant differences. ProGAN and Style- We collect images from the models, taking care to match
GAN train a different network for each category; StyleGAN the pre-processing operations performed by each (e.g. re-
injects large, per-pixel noise into the model to introduce sizing and cropping). For each dataset, we collect fake im-
high frequency detail. BigGAN has a monolithic, class- ages by generating them from the model without applying
conditional structure, is trained on very large batch sizes, additional post-processing (or we download the officially
and uses self-attention layers [48, 45]. released generated images if they are available). We collect
We also include three conditional GANs: the state-of- an equal number of real images from each method’s training
the-art image-to-image translation method GauGAN [34], set. To make the distribution of the real and fake images as
and the popular unpaired image-to-image translation meth- close as possible, real images are pre-processed according
ods CycleGAN [54] and StarGAN [12]. to the pipeline prescribed by each method.
Since 256 × 256 resolution is the most commonly shared
Perceptual loss We consider models that directly opti- output size among most off-the-shelf image synthesis mod-
mize a perceptual loss [20], with no adversarial training. els (e.g., CycleGAN, StarGAN, ProGAN LSUN, GauGAN
This includes Cascaded Refinement Networks (CRN) [11], COCO, IMLE, etc.), we used this resolution for our dataset.
which synthesizes images in a coarse-to-fine manner, For models that produce images at lower resolutions, (e.g.,
and the recent Implicit Maximum Likelihood Estimation DeepFake), we rescale the images using bilinear interpo-
(IMLE) conditional image translation model [26]. lation to 256 on the shorter side with the same aspect ra-
tio, and for models that produce images at higher resolu-
Low-level vision We include the Seeing In The Dark tion (e.g., ProGAN, StyleGAN, SAN, SITD), we keep the
(SITD) model [10], which approximates long-exposure images at the same resolution. Despite these cases being
photography under low light conditions from short- slightly different from our training scheme, we observe that
exposure raw camera input using a high-resolution fully our model is still able to detect fake images under these cat-
convolutional network. We also use a state-of-the-art super- egories. For all datasets, we make our real/fake prediction
resolution model, the Second Order Attention Network from 224 × 224 crops (random-crop at training time and
(SAN) [15]. center-crop at testing time).
Training settings Individual test generators Total
Family Name Style- Big- Cycle- Deep-
No. Augments Pro- Star- Gau-
Train Input Class GAN GAN GAN CRN IMLE SITD SAN mAP
GAN GAN GAN Fake
Blur JPEG
Cyc-Im CycleGAN RGB – 84.3 65.7 55.1 100. 99.2 79.9 74.5 90.6 67.8 82.9 53.2 77.6
Zhang
Cyc-Spec CycleGAN Spec – 51.4 52.7 79.6 100. 100. 70.8 64.7 71.3 92.2 78.5 44.5 73.2
et al.
[50] Auto-Im AutoGAN RGB – 73.8 60.1 46.1 99.9 100. 49.0 82.5 71.0 80.1 86.7 80.8 75.5
Auto-Spec AutoGAN Spec – 75.6 68.6 84.9 100. 100. 61.0 80.8 75.3 89.9 66.1 39.0 76.5
2-class ProGAN RGB 2 X X 98.8 78.3 66.4 88.7 87.3 87.4 94.0 97.3 85.2 52.9 58.1 81.3
4-class ProGAN RGB 4 X X 99.8 87.0 74.0 93.2 92.3 94.1 95.8 97.5 87.8 58.5 59.6 85.4
8-class ProGAN RGB 8 X X 99.9 94.2 78.9 94.3 91.9 95.4 98.9 99.4 91.2 58.6 63.8 87.9
16-class ProGAN RGB 16 X X 100. 98.2 87.7 96.4 95.5 98.1 99.0 99.7 95.3 63.1 71.9 91.4
Ours No aug ProGAN RGB 20 100. 96.3 72.2 84.0 100. 67.0 93.5 90.3 96.2 93.6 98.2 90.1
Blur only ProGAN RGB 20 X 100. 99.0 82.5 90.1 100. 74.7 66.6 66.7 99.6 53.7 95.1 84.4
JPEG only ProGAN RGB 20 X 100. 99.0 87.8 93.2 91.8 97.5 99.0 99.5 88.7 78.1 88.1 93.0
Blur+JPEG (0.5) ProGAN RGB 20 X X 100. 98.5 88.2 96.8 95.4 98.1 98.9 99.5 92.7 63.9 66.3 90.8
Blur+JPEG (0.1) ProGAN RGB 20 † † 100. 99.6 84.5 93.5 98.2 89.5 98.2 98.4 97.2 70.5 89.0 92.6

Table 2: Cross-generator generalization results. We show the average precision (AP) of various classifiers from baseline Zhang et al. [50]
and ours, tested across 11 generators. Symbols X and † mean the augmentation is applied with 50% or 10% probability, respectively, at
training. Chance is 50% and best possible performance is 100%. When test generators are used in training, we show those results in
gray (as they are not testing generalization). Values in black show cross-generator generalization. Amongst those, the highest value is
highlighted in black. We show ablations with respect to fewer classes in ProGAN and by removing data augmentation. We report the mean
AP by averaging the AP scores over all datasets. Subsets are plotted in Figures 2, 3, 4 for comparison.

4. Detecting CNN-synthesized images are provided in Appendix A.3 and A.4.


The main idea of our experiments is to train a “real-or-
Are there common features or artifacts shared across fake” classifier on this ProGAN dataset, and evaluate how
diverse CNN generators? To understand this, we study well the model generalizes to other CNN-synthesized im-
whether it is possible to train a forensics classifier on images ages. For the choice of classifier, we use ResNet-50 [17]
from one model that generalize to those of many models. pre-trained with ImageNet, and train it in a binary classifica-
tion setting. Details of the training procedure are provided
4.1. Training classifiers in Appendix B.2.
While all of these models are useful for evaluation, due Data augmentation During training, we simulate image
to limitations in dataset size, not all are well-suited to train- post-processing operations in a variety of ways. All of our
ing a classifier. We take advantage of the fact that the un- models are trained with images that are randomly left-right
conditional GAN models in our dataset can synthesize ar- flipped and cropped to 224 pixels. We evaluate several ad-
bitrary numbers of images, and choose one specific model, ditional augmentation variants: (1) No aug: no augmenta-
ProGAN [21] to train the detector on. The decision to use a tion applied, (2) Gaussian blur: before cropping, with 50%
single model for training most closely resembles real world probability, images are blurred with σ ∼ Uniform[0, 3], (3)
detection problems, where the diversity or number of mod- JPEG: with 50% probability images are JPEG-ed by two
els to generalize on are unknown at training time. By select- popular libraries, OpenCV [8] and the Python Imaging Li-
ing only a single model to train on, we are computing an up- brary (PIL), with quality ∼ Uniform{30, 31, . . . , 100}, (4a)
per bound on how challenging the task is — jointly training Blur+JPEG (0.5): the image is possibly blurred and JPEG-
on multiple models would make the generalization problem ed, each with 50% probability, (4b) Blur+JPEG (0.1): sim-
easier. We chose ProGAN since it generates high quality ilar to (4a), but with 10% probability.
images and has a simple convolutional network structure. Evaluation Following other recent forensics works [52,
We then create a large-scale dataset that consists solely 18, 44], we evaluate our model’s performance on each
of ProGAN-generated images and real images. We use 20 dataset using average precision (AP), since it is a threshold-
models each trained on a different LSUN [46] object cat- less, ranking-based score that is not sensitive to the base rate
egory, and generate 36K train images and 200 validation of the real and fake images in the dataset. We compute this
images, each with equal numbers of real and fake images score for each dataset separately, since we expect it to be de-
for each model. In total there are 720K images for train- pendent on the semantic content of the photos as a whole.
ing and 4K images for validation. To evaluate the choice To help interpret the threshold-less results, we also conduct
of the training dataset, we also include a model that is experiments on thresholding the model’s outputs and com-
trained solely on the BigGAN dataset. We also consider a puting accuracy, under the assumption that real and fake
model that generates training images using the deep image images are equally likely to appear; the details are in Ap-
prior [42], rather than a GAN. The details for these models pendix A.6. During testing, each image is center-cropped
100
Chance
No aug.
Blur only
50 JPEG only
AP

Blur+JPEG(0.5)
Blur+JPEG(0.1)
0
ProGAN StyleGAN BigGAN CycleGAN StarGAN GauGAN CRN IMLE SITD SAN DeepFake
Figure 2: Effect of augmentation methods. All detectors are trained on ProGAN, and tested on other generators (AP shown). In general,
training with augmentation helps performance. Notable exceptions are super-resolution and DeepFake.

100
Chance
2 classes
4 classes
50 8 classes
AP

16 classes
20 classes
0
ProGAN StyleGAN BigGAN CycleGAN StarGAN GauGAN CRN IMLE SITD SAN DeepFake

Figure 3: Effect of dataset diversity. All detectors are trained on ProGAN, and tested on other generators (AP shown). Training with
more classes improves performance. All runs use blur and JPEG augmentation with 50% probability.

100
Chance
Zhang, Auto-Im
Zhang, Auto-Spec
50 Ours, Blur+JPEG(0.1)
AP

0
ProGAN StyleGAN BigGAN CycleGAN StarGAN GauGAN CRN IMLE SITD SAN DeepFake

Figure 4: Model comparison. Compared to Zhang et al. [50], we observe that for the most part, our models generalize better to other
architectures. Notable exceptions to this are CycleGAN (which is identical to the training architecture from [50]), StarGAN (where both
methods obtain close to 100. AP), and SAN (where applying data augmentation hurts performance).

to 224×224 pixels without resizing in order to match the augmentations, the performance on BigGAN significantly
post-processing pipeline used by models during training. improves, 72.2 → 88.2. On conditional models (Cycle-
No data augmentation is included during testing; instead, GAN, GauGAN, CRN, and IMLE), performance is simi-
we conduct experiments on model robustness under post- larly improved, 84.0 → 96.8, 67.0 → 98.1, 93.5 → 98.9,
processing in Section 4.2. 90.3 → 99.5, respectively.

4.2. Effect of data augmentation Interestingly, there are two models, SAN and DeepFake,
In Table 2, we investigate the generalization ability of where directly training on ProGAN without augmentation
training with different augmentation methods. We find that performs strongly (93.6 and 98.2, respectively), but aug-
using aggressive data augmentation (in the form of sim- mentation hurts performance. As SAN is a super-resolution
ulated post-processing) provides surprising generalization model, only high-frequency components can differentiate
capabilities, even when such perturbations are not used at between real and fake images. Removing such cues at train-
test time. Additionally, we observe that these models are ing time (e.g. by blurring) would therefore be likely to re-
significantly more robust to post-processing (Figure 5). duce performance. As explained in Section 3.1, DeepFake
serves as an out-of-distribution test case as images are not
Augmentation (usually) improves generalization To generated by CNN architectures alone, but surprisingly our
begin, we first evaluate ProGAN-based classifier without model is able to generalize to this test case. However, it
augmentation, shown in the “no aug” row. As in previous remains challenging to identify clear reasons for the perfor-
work [39], we find that testing on held-out ProGAN images mance deterioration when applying augmentations. Apply-
works well (100.0 AP). We then test how well it general- ing augmentation, but at reduced rate (Blur+JPEG (0.1)),
izes to other unconditional GANs. We find that it general- offers a good balance: DeepFake detection is comparable to
izes extremely well to StyleGAN, which has a similar net- the no-augmentation case (89.0), while most other datasets
work structure, but not as well to BigGAN. When adding are significantly improved over no augmentation.
Robustness to Blur Robustness to JPEG
ProGAN StyleGAN BigGAN CycleGAN ProGAN StyleGAN BigGAN CycleGAN
100 100
AP

AP
50 50

StarGAN GauGAN CRN IMLE StarGAN GauGAN CRN IMLE


100 100
AP

AP
50 50

SITD SAN DeepFake SITD SAN DeepFake


100 Chance 100 Chance
No aug. No aug.
Blur only Blur only
JPEG only JPEG only
Blur+JPEG (0.5) Blur+JPEG (0.5)
AP

AP
Blur+JPEG (0.1) Blur+JPEG (0.1)

50 50
0 2 4 0 2 4 0 2 4 100 65 30 100 65 30 100 65 30
sigma sigma sigma quality quality quality

Figure 5: Robustness. We show the effect of AP given test-time perturbation to (left) Gaussian blurring and (right) JPEG. We show
classifiers trained on ProGAN, with different augmentations applied during training. Note that in all cases and both perturbations, when
training without augmentation (red), performance degrades across all datasets when perturbations are added. In most cases, training
with both augmentations, performs best or near best. Notable exceptions are for super-resolution (where no augmentation is best), and
DeepFake, where training only with the perturbation used during testing, rather than both, performs best.

Augmentation improves robustness In many real-world Image diversity improves performance To study this,
scenarios, images that we would like to evaluate have un- we varied the number of classes in the dataset used to
dergone unknown post-processing operations, such as com- train our real-or-fake classifier (Figure 3). Specifically, we
pression and resizing. We investigated whether CNN- trained multiple classifiers, each one on a subset of the full
generated images can still be detected, even after these post- training dataset by excluded both real and fake images de-
processing steps. To test this, we blurred (simulating re- rived from a specific set of LSUN classes. For all models
sampling) and JPEG-compressed the real and fake images we use the same augmentation scheme as the Blur+JPEG
following the protocol in [44], and evaluated our ability to (0.5) model. We found that increasing the training set diver-
detect them (Figure 5). On ProGAN (i.e. the case where sity improves performance, but only up to a point. When the
the test distribution matches the training), performance is number of classes used increases from 2 to 16, AP consis-
100% even when applying augmentation operations, indi- tently improves, but we see diminishing returns. Minimal
cating that artifacts may not only be high-frequency, but improvement is observed when increasing from 16 to 20
exist across frequency bands. In terms of cross-generator classes. This indicates that there may be a training dataset
generalization, the augmented model is most robust to post- that is “diverse enough” for practical generalization.
processing operations that are included in data augmenta-
tion, agreeing with observations from [39, 44, 47, 50]. 4.4. Comparison to other models
However, we note that our model also gains robustness
from augmentation even when testing on out-of-distribution Next, we asked how our generalization performance
CNN models. compares to other proposed forensic methods. We compare
our approach to Zhang et al. [50], which is a suite of classi-
fiers trained to detect artifacts generated by a common CNN
4.3. Effect of data diversity architecture, which is shared by many image synthesis tasks
such as CycleGAN and StarGAN. They introduced Auto-
Next, we asked how the diversity of the real and fake GAN, an autoencoder based on CycleGAN’s generator that
images in the training set affects a classifier’s generalization simulates artifacts resembling that of CycleGAN images.
ability. We considered four variations of pretrained models from
BigGAN
StarGAN

0th percentile 25th percentile 50th percentile 75th percentile 100th percentile

Figure 6: Does our model’s confidence correlate with visual quality? We have found that for two models, BigGAN and StarGAN, the
images on the left (considered more real) tends to look better than the images on the right (considered more fake). However, this does not
seem to hold for the other models. More examples on each dataset are provided in Appendix A.1.
synthetic
real

ProGAN StyleGAN BigGAN CycleGAN StarGAN GauGAN CRN IMLE SITD SAN DeepFake

Figure 7: Frequency analysis on each dataset. We show the average spectra of each high-pass filtered image, for both the real and fake
images, similar to Zhang et al. [50]. We observe periodic patterns (dots or lines) in most of the synthetic images, while BigGAN and
ProGAN contains relatively few such artifacts.

Zhang et al. [50], each trained from one of the two image several changes to StyleGAN, including redesigned normal-
sources (CycleGAN and AutoGAN), and one of the two im- ization, multi-resolution, and regularization methods. In
age representations (images and spectrum) respectively. All Table 3, we test our detector on publicly available Style-
four variants included JPEG and resize data augmentation GAN2 generators. We used our Blur+JPEG (0.1) model
during training to improve the robustness of each model. and tested on the LSUN car, cat, church, and horse variants.
We found that our models generalized significantly better Despite these changes, our technique performs at 99.1% AP.
to other architectures, except on CycleGAN (which is the These results reinforce the notion that training on today’s
model architecture used by [50]), StarGAN (where both generators can generalize well to future generators, given
methods obtain near 100.0 AP). The comparison results are that they use similar underlying building blocks.
shown in Table 2 and Figure 4. We also include compar-
4.6. Qualitative Analysis
isons to other baseline models in Appendix A.5.
To understand how the network is able to generalize to
4.5. New CNN models unseen CNN models, we study what possible cues the clas-
sifier might be using by visualizing its ranking on the “fak-
We hope that as new deep synthesis models arrive, our
eness” over the synthetic dataset. In addition, we analyze
system will detect them out-of-the-box. One such an evalu-
the difference between the frequency responses of both real
ation scenario has naturally arisen, with the recent release of
and synthetic images across datasets.
StyleGAN2 [23], a state-of-the-art unconditional GAN ap-
pearing in these proceedings. The StyleGAN2 model makes “Fakeness” ranking by the model We study whether
ProGAN StyleGAN StyleGAN2 ization method.
AP 100. 99.6 99.1
5. Discussion
Table 3: Out-of-the box evaluation on recently released Style- Despite the alarm that has been raised by the rapidly im-
GAN2 [23] model. We used our Blur+JPEG (0.1) model and
proving quality of image synthesis methods, our results sug-
tested on StyleGAN2. We observed that our model generalizes to
gest that today’s CNN-generated images retain detectable
detecting StyleGAN2 images. Numbers for ProGAN and Style-
GAN are included for comparison. fingerprints that distinguish them from real photos. This
allows forensic classifiers to generalize from one model to
another without extensive adaptation.
our model is learning subtle low-level features generated However, this does not mean that the current situation
by CNN architectures, or high-level features such as visual will persist. Due to the difficulties in achieving Nash equi-
quality. Taking the similar approach as previous image real- libria, none of the current GAN-based architectures are
ism works [25, 53], we rank synthesized images from each optimized to convergence, i.e. the generator never wins
dataset by the model’s prediction, and visualize images in against the discriminator. Were this to change, we would
the 0th , 25th , 50th , 75th , 100th percentile of the “fakeness” suddenly find ourselves in a situation when synthetic im-
score from our model’s output. ages are completely indistinguishable from real ones.
In most datasets, we observe little noticeable correlation Even with the current techniques, there remain practi-
between the model predictions and the visual quality of the cal reasons for concern. First, even the best forensics de-
synthesized images. However, there is a weak correlation tector will have some trade-off between true detection and
in the BigGAN and StarGAN datasets; qualitative examples false-positive rates. Since a malicious user is typically look-
are shown in Figure 6. As the “fakeness” scores are higher, ing to create a single fake image (rather than a distribu-
the images tend to contain more visible artifacts which dete- tion of fakes), they could simply hand-pick the fake image
riorate the visual quality. This implies that our model might which happens to pass the detection threshold. Second, ma-
learn to capture perceptual realism under this task. How- licious use of fake imagery is likely be deployed on a social
ever, since the correlation is not observed in other datasets, media platform (Facebook, Twitter, YouTube, etc.), so the
it is more likely that the model learns features more towards data will undergo a number of often aggressive transforma-
low-level CNN artifacts. Examples across all datasets are tions (compression, resizing, re-sampling, etc.). While we
provided in Appendix A.1. demonstrated robustness to some degree of JPEG compres-
sion, blurring, and resizing, much more work is needed to
Artifacts of CNN image synthesis Inspired by Zhang et evaluate how well the current detectors can cope with these
al. [50], we visualize the average frequency spectra from transformations in-the-wild. Finally, most documented in-
each dataset to study the artifacts generated by CNNs, as stances of effective deployment of visual fakes to date have
shown in Figure 7. Following prior work, we perform been using classic “shallow” methods, such as Photoshop.
a simple form of high-pass filtering (subtracting the im- We have experimented with running our detector on the
age from its median blurred version) before calculating the face-aware liquify dataset from [44], and found that our
Fourier transform, as it provides a more informative visual- method performs at chance on this data. This suggests that
ization [30]. For each dataset, we average over 2000 ran- shallow methods exhibit fundamentally different behavior
domly chosen images (or the entire set, if it is smaller). than deep methods, and should not be neglected.
We note that there are many interesting patterns visi-
We note that detecting fake images is just one small piece
ble in these visualizations. While the real image spectra
of the puzzle of how to combat the threat of visual disinfor-
generally look alike (with minor variations due to differ-
mation. Effective solutions will need to incorporate a wide
ences in the datasets), there are distinct patterns visible in
range of strategies, from technical to social to legal.
images generated by different CNN models. Furthermore,
the repeated period patterns in these spectra may be consis- Acknowledgements We’d like to thank Jaakko Lehtinen,
tent with aliasing artifacts, a cue considered by [50]. In- Taesung Park, Jacob (Minyoung) Huh, Hany Farid, and Matthias
terestingly, the most effective unconditional GANs (Big- Kirchner for helpful discussions. We are grateful to Xu Zhang,
GAN, ProGAN) contain relatively few such artifacts. Also, Lakshmanan Nataraj, and Davide Cozzolino for significant help
DeepFake images does not contain obvious artifacts. We with comparisons to [50, 31, 14], respectively. This work was
note that DeepFake images have gone through various pre- funded, in part, by DARPA MediFor, Adobe gift and grant from
and post-processing, where the synthesized face region is the UC Berkeley Center for Long-Term Cybersecurity. The views,
resized, blended, and compressed with MPEG. These op- opinions and/or findings expressed are those of the authors and
erations perturbs the low-level image statistic, which may should not be interpreted as representing the official views or poli-
cause the frequency patterns to not emerge with this visual- cies of the Department of Defense or the U.S. Government.
References [22] Tero Karras, Samuli Laine, and Timo Aila. A style-based
generator architecture for generative adversarial networks. In
[1] Deepfakes faceswap github repository. https:// CVPR, 2019. 1, 2, 3, 12
github.com/deepfakes/faceswap. 3
[23] Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten,
[2] Faced. https://github.com/iitzco/faced. 13 Jaakko Lehtinen, and Timo Aila. Analyzing and improving
[3] Faceswap. https://faceswap.dev/. 1 the image quality of stylegan. In CVPR, 2020. 1, 2, 7, 8, 12
[4] Which face is real? http://www.whichfaceisreal.com/. 12
[24] Diederik P Kingma and Jimmy Ba. Adam: A method for
[5] Shruti Agarwal and Hany Farid. Photo forensics from jpeg stochastic optimization. In ICLR, 2015. 11, 13
dimples. In 2017 IEEE Workshop on Information Forensics
[25] Jean-François Lalonde and Alexei A. Efros. Using color
and Security (WIFS), 2017. 2
compatibility for assessing image realism. In ICCV, 2007.
[6] Aharon Azulay and Yair Weiss. Why do deep convolutional
8
networks generalize so poorly to small image transforma-
[26] Ke Li, Tianhao Zhang, and Jitendra Malik. Diverse image
tions? JMLR, 2019. 2
synthesis from semantic layouts via conditional imle. In
[7] David Bau, Jun-Yan Zhu, Jonas Wulff, William Peebles,
ICCV, 2019. 1, 3, 13
Hendrik Strobelt, Bolei Zhou, and Antonio Torralba. See-
[27] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays,
ing what a gan cannot generate. In ICCV, 2019. 3
Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence
[8] Gary Bradski and Adrian Kaehler. Learning OpenCV: Com-
Zitnick. Microsoft coco: Common objects in context. In
puter vision with the OpenCV library. ”O’Reilly Media,
ECCV, 2014. 13
Inc.”, 2008. 4
[28] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang.
[9] Andrew Brock, Jeff Donahue, and Karen Simonyan. Large
Large-scale celebfaces attributes (celeba) dataset. 12
scale gan training for high fidelity natural image synthesis.
In ICLR, 2019. 1, 3, 10, 12 [29] Francesco Marra, Diego Gragnaniello, Davide Cozzolino,
[10] Chen Chen, Qifeng Chen, Jia Xu, and Vladlen Koltun. and Luisa Verdoliva. Detection of gan-generated fake images
Learning to see in the dark. In CVPR, 2018. 1, 3, 13 over social networks. In 2018 IEEE Conference on Multime-
dia Information Processing and Retrieval (MIPR), 2018. 2,
[11] Qifeng Chen and Vladlen Koltun. Photographic image syn-
12
thesis with cascaded refinement networks. In ICCV, 2017. 1,
3, 13 [30] Francesco Marra, Diego Gragnaniello, Luisa Verdoliva, and
[12] Yunjey Choi, Minje Choi, Munyoung Kim, Jung-Woo Ha, Giovanni Poggi. Do gans leave artificial fingerprints? In
Sunghun Kim, and Jaegul Choo. Stargan: Unified genera- 2019 IEEE Conference on Multimedia Information Process-
tive adversarial networks for multi-domain image-to-image ing and Retrieval (MIPR), 2019. 2, 8
translation. In CVPR, 2018. 1, 3, 12 [31] Lakshmanan Nataraj, Tajuddin Manhar Mohammed, BS
[13] Davide Cozzolino, Giovanni Poggi, and Luisa Verdoliva. Manjunath, Shivkumar Chandrasekaran, Arjuna Flenner,
Splicebuster: A new blind image splicing detector. In 2015 Jawadul H Bappy, and Amit K Roy-Chowdhury. Detect-
IEEE International Workshop on Information Forensics and ing gan generated fake images using co-occurrence matrices.
Security (WIFS), 2015. 2 Electronic Imaging, 2019. 2, 8, 11, 12
[14] Davide Cozzolino, Justus Thies, Andreas Rössler, Christian [32] James F O’Brien and Hany Farid. Exposing photo manipula-
Riess, Matthias Nießner, and Luisa Verdoliva. Forensictrans- tion with inconsistent reflections. ACM Trans. Graph., 2012.
fer: Weakly-supervised domain adaptation for forgery detec- 2
tion. arXiv preprint arXiv:1812.02510, 2018. 2, 8, 11 [33] Augustus Odena, Vincent Dumoulin, and Chris Olah. De-
[15] Tao Dai, Jianrui Cai, Yongbing Zhang, Shu-Tao Xia, and convolution and checkerboard artifacts. Distill, 2016. 2
Lei Zhang. Second-order attention network for single image [34] Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan
super-resolution. In CVPR, 2019. 1, 2, 3, 13 Zhu. Semantic image synthesis with spatially-adaptive nor-
[16] Hany Farid. Photo forensics. MIT Press, 2016. 1, 2 malization. In CVPR, 2019. 1, 3, 13
[17] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. [35] Patrick Pérez, Michel Gangnet, and Andrew Blake. Poisson
Deep residual learning for image recognition. In CVPR, image editing. ACM Transactions on graphics (TOG), 2003.
2016. 4 3
[18] Minyoung Huh, Andrew Liu, Andrew Owens, and Alexei A [36] John Platt et al. Probabilistic outputs for support vector ma-
Efros. Fighting fake news: Image splice detection via chines and comparisons to regularized likelihood methods.
learned self-consistency. In ECCV, 2018. 2, 4 1999. 11
[19] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A [37] Alin C Popescu and Hany Farid. Exposing digital forgeries
Efros. Image-to-image translation with conditional adver- by detecting traces of resampling. IEEE Transactions on sig-
sarial networks. In CVPR, 2017. 2 nal processing, 2005. 2
[20] Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual [38] Yuan Rao and Jiangqun Ni. A deep learning approach to
losses for real-time style transfer and super-resolution. In detection of splicing and copy-move forgeries in images. In
ECCV, 2016. 3 2016 IEEE International Workshop on Information Forensics
[21] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. and Security (WIFS), 2016. 2
Progressive growing of gans for improved quality, stability, [39] Andreas Rössler, Davide Cozzolino, Luisa Verdoliva, Chris-
and variation. In ICLR, 2018. 1, 2, 3, 4, 12 tian Riess, Justus Thies, and Matthias Nießner. FaceForen-
sics++: Learning to detect manipulated facial images. In Appendix
ICCV, 2019. 1, 2, 3, 5, 6, 13
[40] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, San-
jeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, A. Additional Analysis
Aditya Khosla, Michael Bernstein, et al. Imagenet large
scale visual recognition challenge. IJCV, 2015. 3, 11 A.1. Additional ranking visualizations
[41] Antonio Torralba and Alexei A Efros. Unbiased look at In we rank ordered the fake images according
dataset bias. In CVPR, 2011. 1 to how “fake” the classifier deemed them to be.
[42] Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. These full ranking results are included in the follow-
Deep image prior. In CVPR, 2018. 4, 11 ing link: https://peterwang512.github.io/
CNNDetection/ranking/. We randomly select 20
[43] Run Wang, Lei Ma, Felix Juefei-Xu, Xiaofei Xie, Jian Wang, real and 20 fake images from each dataset, and rank all
and Yang Liu. Fakespotter: A simple baseline for spotting images based on our (Blur+JPEG (0.1)) model’s scores.
ai-synthesized fake faces. arXiv preprint arXiv:1909.06122,
Note that there is a clear separation between real and fake
2019. 2
images, where the real images have lower “fakeness” score
[44] Sheng-Yu Wang, Oliver Wang, Andrew Owens, Richard and vice versa. Moreover, we observe the synthetic images
Zhang, and Alexei A Efros. Detecting photoshopped faces ranked more “real” are super resolution (SAN) outputs, and
by scripting photoshop. In ICCV, 2019. 2, 4, 6, 8 the ones ranked more “fake” are CRN and IMLE outputs.
[45] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaim- However, we observe little noticeable correlation between
ing He. Non-local neural networks. In CVPR, 2018. 3 the model predictions and the visual quality of the synthe-
sized images in each dataset, where BigGAN and StarGAN
[46] Fisher Yu, Ari Seff, Yinda Zhang, Shuran Song, Thomas images are the exceptions.
Funkhouser, and Jianxiong Xiao. Lsun: Construction of a
large-scale image dataset using deep learning with humans
in the loop. arXiv preprint arXiv:1506.03365, 2015. 3, 4, 12 A.2. Effect of dataset size
[47] Ning Yu, Larry Davis, and Mario Fritz. Attributing fake im-
ages to gans: Analyzing fingerprints in generated images. In We include additional ablation studies on the effect of
ICCV, 2019. 2, 6 dataset size, and the results are shown in Table 4. To com-
pare with the dataset diversity ablation in Section 4.3 of the
[48] Han Zhang, Ian Goodfellow, Dimitris Metaxas, and Augus- main text, we train 4 additional models with 10%, 20%,
tus Odena. Self-attention generative adversarial networks. In 40%, 80% of the entire dataset respectively, while having
ICML, 2019. 3
all 20 LSUN classes included in the training set. Same
[49] Richard Zhang. Making convolutional networks shift- augmentation scheme as Blur+JPEG (0.5) is applied to all
invariant again. In ICML, 2019. 3 models. We observe much less reduction in generalization
performance, indicating data diversity, comparing to dataset
[50] Xu Zhang, Svebor Karaman, and Shih-Fu Chang. Detecting
and simulating artifacts in gan fake images. In WIFS, 2019. size, contributes more towards better CNN detection in gen-
2, 4, 5, 6, 7, 8, 11, 12 eral.

[51] Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva,


and Antonio Torralba. Learning deep features for discrimi- A.3. Comparison to training on a different model
native localization. In CVPR, 2016. 2
To evaluate the choice of training architecture, we also
[52] Peng Zhou, Xintong Han, Vlad I Morariu, and Larry S Davis.
include a model that is trained solely on BigGAN. To pre-
Learning rich features for image manipulation detection. In
pare the training data, we generate 400k fake images from
CVPR, 2018. 4
an ImageNet-pretrained 256 × 256 BigGAN model [9], and
[53] Jun-Yan Zhu, Philipp Krähenbühl, Eli Shechtman, and take 400k ImageNet images with the same class distribution
Alexei A. Efros. Learning a discriminative model for the as real images. For comparison, we train the model with the
perception of realism in composite images. In ICCV, 2015. same data augmentation as Blur+JPEG (0.5). We denote
8 this model as Blur+JPEG (Big). We see in Table 4 that
[54] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A this model also exhibits generalization, albeit with slightly
Efros. Unpaired image-to-image translation using cycle- lower results in most cases. One explanation for this is that
consistent adversarial networks. In ICCV, 2017. 1, 3, 12 while our ProGANs model was trained on an ensemble (one
model per class), BigGAN images were generated with a
single model.
Training settings Individual test generators Total
Family Name Style- Big- Cycle- Deep-
No. Augments Pro- Star- Gau-
Train Input Class GAN GAN GAN CRN IMLE SITD SAN mAP
GAN GAN GAN Fake
Blur JPEG
Nataraj et al. [31] – CycleGAN Co-occur. mtx – 76.4 96.5 56.4 100. 88.2 56.2 58.7 83.1 39.6 46.1 55.1 68.8
Cozzolino et al. [14] ForensicTransfer ProGAN HF residual – 88.9 77.9 79.5 77.2 91.7 83.3 99.9 31.3 72.8 90.8 79.2 79.3
DIP ProGAN-DIP RGB – X X 62.0 52.3 61.7 62.4 100. 49.0 98.2 38.6 92.8 93.1 63.1 70.3
Blur+JPEG (Big) BigGAN RGB 1000 X X 85.1 82.4 100. 86.2 87.4 96.7 79.7 82.6 91.2 71.9 60.3 83.9
2-class ProGAN RGB 2 X X 98.8 78.3 66.4 88.7 87.3 87.4 94.0 97.3 85.2 52.9 58.1 81.3
4-class ProGAN RGB 4 X X 99.8 87.0 74.0 93.2 92.3 94.1 95.8 97.5 87.8 58.5 59.6 85.4
8-class ProGAN RGB 8 X X 99.9 94.2 78.9 94.3 91.9 95.4 98.9 99.4 91.2 58.6 63.8 87.9
16-class ProGAN RGB 16 X X 100. 98.2 87.7 96.4 95.5 98.1 99.0 99.7 95.3 63.1 71.9 91.4
Ours 10% data ProGAN RGB 20 X X 100. 93.2 82.3 94.1 93.2 97.1 96.8 99.4 88.2 58.1 63.5 87.8
20% data ProGAN RGB 20 X X 100. 96.8 85.9 95.9 93.6 97.9 98.7 99.5 90.2 61.8 65.2 89.6
40% data ProGAN RGB 20 X X 100. 97.8 87.5 96.0 95.3 98.1 98.2 99.3 91.2 61.4 67.9 90.2
80% data ProGAN RGB 20 X X 100. 98.1 88.1 96.4 95.4 98.0 98.9 99.4 93.0 63.8 65.1 90.6
Blur+JPEG (0.5) ProGAN RGB 20 X X 100. 98.5 88.2 96.8 95.4 98.1 98.9 99.5 92.7 63.9 66.3 90.8

Table 4: Additional evaluations. We evaluate other baseline models, classifiers trained with DIP and BigGAN images, respectively, and
classifiers trained with various dataset size. Same as Table 2 in the main text, we show the average precision (AP) of the models tested
across 11 generators. For comparison, we include the ablations on the number of classes and the Blur+JPEG (0.5) model’s results, which
are presented in the main text. Symbols X means the augmentation is applied with 50% or probability at training. The color coding scheme
is identical to that of Table 2 in the main text. We note that when only the dataset size is reduced, AP dropped less comparing to reducing
the number of classes. Also, the model trained with ProGAN outperforms the baselines, DIP and Blur+JPEG (Big).

A.4. Training with images generated with a deep across different methods, but just leveraging on those may
image prior not be sufficient for a general detection.
Instead of generating fake images with GANs, which A.5. Comparison to other baselines
have limited representational capacity and hence large syn-
thesis errors, we consider an “oracle” generation method In the main text, we compared with Zhang et al. [50], a
based on the deep image prior (DIP) [42]. We ask what state-of-the-art in GAN detection, and outperform it across
the very best reconstruction of an image is achievable via a different synthesis methods. In addition, we include the
given network architecture, regardless of the synthesis task. performance of Nataraj et al. [31], another GAN detec-
For each synthesized image in our dataset, we train a differ- tion method trained on co-occurrence matrices of images,
ent network to reconstruct it by minimizing `1 loss: and Cozzolino et al. [14], a few-shot single-target domain
adaption method trained on HF filtered images. For Coz-
min ||f (θi ) − Ii ||1 , (1) zolino et al., we evaluate the ProGAN/CycleGAN model.
θ
Both methods are evaluated on 256 × 256 images in a zero-
where f (θi ) is the image generated by a neural network shot setting, and if the image is larger than 256 pixels, it is
parameterized with weights θi and Ii is a real image. We center-cropped to 256 pixels. The results are in Tab. 4.
use the reconstructed image f (θi ) as an instance of a fake
A.6. Other evaluation metrics
image. During reconstruction, we use the Adam opti-
mizer [24] with β1 = 0.9, β2 = 0.999, and optimized with To help clarify the threshold-less AP evaluation metric,
a decreasing learning rate: 0.01 → 0.001 → 0.0001. For we also computed several other metrics (Table 5). We pro-
each learning rate we optimize for 2000 iterations. vide the precision and recall curve on each dataset from
As training data, we take 44k real images randomly sam- our (Blur+JPEG (0.1)) model in Figure 8. We give the
pled from ImageNet [40], and “fake” images are the recon- uncalibrated generalization accuracy of the model on the
struction by the generator of ProGAN (and hence 44k dif- test distribution, by simply using the classifier threshold we
ferent networks). We take DIP images optimized for 1000, learned during training, and oracle accuracy that chooses
2000, 3000, 4000, 5000, 6000 iterations into our “fake” the threshold that maximizes accuracy on the test set. We
image set. We then train a classifier on this dataset, and also consider a two-shot regime where we have access to
we over-sample the real images by 6 times to balance the one real and one fake image from each dataset, and only the
classes. All training configurations and augmentations are model’s threshold is adjusted during the two-shot calibra-
same as Blur+JPEG (0.5). This model is denoted as DIP tion process.
in Tab. 4. We calibrate the model by a single random real and fake
We note although this model does not perform as well as pair, and we augment the image pair by taking 224 × 224
the model directly trained on ProGAN images, but it is able random crops 128 times. The images are passed into the
to detect several datasets, including StarGAN, CRN, SITD, model to get the logits, which are then fitted by a logistic
and SAN. This indicates that low-level artifacts shares regression (the method is also known as Platt scaling [36]).
B. Implementation Details
B.1. Dataset Collection
ProGAN [21] 2 We take 20 officially released ProGAN
models pretrained on LSUN [46] airplane, bicycle, bird,
boat, bottle, bus, car, cat, chair, cow, dining table, dog,
horse, motorbike, person, potted plant, sheep, sofa, train,
tv-monitor respectively. Following the official code, we
sample the synthetic images with z ∼ N (0, I), and
generate real images by center cropping the images just on
the long edge (center crop length is exactly the length of
Figure 8: Precision and recall curves. The PR curves on each the short edge) and then resizing to 256 × 256
dataset from the (Blur+JPEG (0.1)) model are shown. Note that
AP is defined as the area under the PR curve. Higher AP indicates StyleGAN [22] 3 We take officially released StyleGAN
better trade-off between precision and recall, and vice versa.
models pretrained on LSUN [46] bedroom, cat and car,
with size 256 × 256, 256 × 256 and 512 × 384 respectively.
We download the released synthesized images, all of which
We take the bias learned from the logistic regression to ad- are generated with 0.5 truncation, and following the code,
just the base rate of our model. Specifically, we apply the we generate real images by resizing to the according size
bias to our model’s logit and then take the sigmoid to get of each category.
the calibrated probability.
StyleGAN2 [23] 4 We take officially released StyleGAN2
A.7. Detecting GAN images from the internet config-F models pretrained on LSUN [46] church, cat,
horse and car, with size 256 × 256, 256 × 256, 256 × 256
Unfortunately, there are currently no collections of “in-
and 512 × 384 respectively. We download the released
the-wild” CNN-generated image datasets which we can
synthesized images, all of which are generated with 0.5
evaluate with our model. As a “proxy” testcase, we scraped
truncation, and following the code, we generate real images
1k real face and 1k fake faces from whichfaceisreal.
by resizing to the according size of each category.
com [4]. This is a website containing StyleGAN-generated
faces and real faces in 1024 pixels, with all images com-
pressed into JPEG. We tested our Blur+JPEG (0.1) model BigGAN [9] 5 We take officially released BigGAN-deep
on this testset in two scenarios: (1) directly center crop im- model pretrained on 256×256 ImageNet images. Following
ages to 224 pixels without resizing (matching how we test the official code, we sample the images with uniform class
StyleGAN) or (2) resize to 256 pixels then center crop to distribution and with 0.4 truncation; also, we generate real
224 pixels. Without resizing, the model gets 83.6% ac- images by center cropping the images just on the long edge
curacy and 93.2% AP. With resizing, the model drops to (center crop length is exactly the length of the short edge)
74.9% accuracy and 82.6% AP, still well above chance and then resizing to 256 × 256.
(50%). This indicates our model can be robust to resizing CycleGAN [54] 6 We take officially released CycleGAN
and in-the-wild JPEG compression. However, maintaining models: apple2orange, orange2apple, horse2zebra, ze-
similar performance after significant post-processing (e.g., bra2horse, summer2winter, winter2summer, and generate
heavy resizing) remains challenging. real and fake image pairs out of all six categories. Pre-
processed real images and synthetic images are directly
A.8. CycleGAN testcase generated from the released code.

While prior works on GAN detection [29, 31, 50] train StarGAN [12] 7 We take officially released StarGAN
on CycleGAN images and evaluate generalization across model pretrained on CelebA [28], and generate real and
CycleGAN categories, our method is not trained on any fake image pairs. Pre-processed real images and synthetic
CycleGAN images and tests generalization across methods images are directly generated from the released code.
(a significantly harder task). Nonetheless, we still observe 2 https://github.com/tkarras/progressive_growing_of_gans
comparable performance in terms of AP (Tab. 2 in the main 3 https://github.com/NVlabs/stylegan
4 https://github.com/NVlabs/stylegan2
text) when compared to Zhang et al. [50]. For a further 5 https://tfhub.dev/s?q=biggan
comparison, we include our Blur+JPEG (0.1) model’s ac- 6 https://github.com/junyanz/pytorch- CycleGAN- and- pix2pix
curacy on each CycleGAN category in Tab. 6. 7 https://github.com/yunjey/stargan
StyleGAN BigGAN CycleGAN StarGAN GauGAN CRN IMLE SITD SAN DeepFake
Uncalibrated 87.1 70.2 85.2 91.7 78.9 86.3 86.2 90.3 50.5 53.5
Oracle 96.8 81.1 86.3 92.8 85.5 95.3 95.4 92.8 68.0 80.7
Two-shot 91.9 74.0 82.4 86.0 79.1 91.6 91.2 88.7 54.8 65.7

Table 5: Two-shot classifier calibration. We show the accuracy of the classifiers directly trained from ProGAN (“uncalibrated”), after
calibrating the threshold given two examples in the test distribution (“two-shot”) and an upper bound, given a perfect calibration (“oracle”).

Horse Zebra Summer Winter Apple Orange Facades Cityscape Map Ukiyoe Vangogh Cezanne Monet Photo Avg.
62.1 87.5 83.2 88.0 90.5 87.7 100. 66.6 78.0 85.4 76.9 82.8 56.2 86.8 80.8

Table 6: CycleGAN testcase. We evaluate the uncalibrated accuracy of Blur+JPEG (0.1) model tested on each CycleGAN category. We
note that our model is still able to perform well above chance (50%) even if not directly trained on any CycleGAN images.

GauGAN [34] 8 We take officially released GauGAN B.2. Training details


model pretrained on COCO [27], and generate real and fake
To train the classifiers, we use the Adam optimizer [24]
image pairs. Pre-processed real images and synthetic im-
with β1 = 0.9, β2 = 0.999, batch size 64, and initial learn-
ages are directly generated from the released code.
ing rate 10−4 . Learning rate are dropped by 10× if after 5
CRN [11] 9 We take officially released CRN model pre- epochs the validation accuracy does not increase by 0.1%,
trained on GTA, and generate synthesized images from pre- and we terminate training at learning rate 10−6 . One excep-
processed segmentation maps. Pre-processed real images tion is that, in order to balance training iterations with the
and segmentation maps are downloaded from the IMLE size of the training set, for the {2, 4, 8, 16}-class models
repository. and the {10, 20, 40, 80}%-data models, the learning rate is
dropped if the validation accuracy plateaus for {50, 25, 13,
IMLE [26] 10 We take officially released IMLE model 7} epochs instead.
pretrained on GTA, and generate synthesized images from
pre-processed segmentation maps. Pre-processed real im-
ages and segmentation maps are downloaded from the offi-
cial repository.

SITD [10] 11 We take officially released pretrained model


and the dataset by Sony and Fuji cameras from the repos-
itory. Pre-processed real images and synthetic images are
directly generated from the released code.

SAN [15] 12 We take both the ground truth and the offi-
cially released 4x super-resolution predictions on the stan-
dard benchmark datasets: Set5, Set14, BSD100 and Ur-
ban100. The synthetic images are directly downloaded from
the repository.

DeepFake [39] 13 We download raw manipulated and


original image sequences in the validation and test split of
the Deepfakes dataset. We extracted all frames from the
videos, and in each frame a face is detected and cropped
using Faced [2]. Similar to [39], our dataset is comprised
entirely of cropped faces.
8 https://github.com/NVlabs/SPADE
9 https://github.com/CQFIO/PhotographicImageSynthesis
10 https://github.com/zth667/Diverse- Image- Synthesis- from- Semantic- Layout
11 https://github.com/cchen156/Learning- to- See- in- the- Dark
12 https://github.com/daitao/SAN
13 https://github.com/ondyari/FaceForensics

You might also like