Deepfake Detection For Human Face Images and Videos: A Survey
Deepfake Detection For Human Face Images and Videos: A Survey
  ABSTRACT Techniques for creating and manipulating multimedia information have progressed to the point
  where they can now ensure a high degree of realism. DeepFake is a generative deep learning algorithm that
  creates or modifies face features in a superrealistic form, in which it is difficult to distinguish between
  real and fake features. This technology has greatly advanced and promotes a wide range of applications in
  TV channels, video game industries, and cinema, such as improving visual effects in movies, as well as a
  variety of criminal activities, such as misinformation generation by mimicking famous people. To identify
  and classify DeepFakes, research in DeepFake detection using deep neural networks (DNNs) has attracted
  increased interest. Basically, DeepFake is the regenerated media that is obtained by injecting or replacing
  some information within the DNN model. In this survey, we will summarize the DeepFake detection methods
  in face images and videos on the basis of their results, performance, methodology used and detection type.
  We will review the existing types of DeepFake creation techniques and sort them into five major categories.
  Generally, DeepFake models are trained on DeepFake datasets and tested with experiments. Moreover,
  we will summarize the available DeepFake dataset trends, focusing on their improvements. Additionally,
  the issue of how DeepFake detection aims to generate a generalized DeepFake detection model will be
  analyzed. Finally, the challenges related to DeepFake creation and detection will be discussed. We hope that
  the knowledge encompassed in this survey will accelerate the use of deep learning in face image and video
  DeepFake detection methods.
                     This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
VOLUME 10, 2022                                                                                                                                                       18757
                                                                   A. Malik et al.: DeepFake Detection for Human Face Images and Videos: Survey
maps can be obtained by implementing the convolution oper-                     where N denotes the number of input-output relations
ation with different kernels. While training, the convolution                  (x (n) , y(n) ), x (n) is the nth input data, y(n) is its target label,
operation is called forward propagation; during backpropaga-                   and o(n) is the output of the CNN [13]. Training a CNN
tion, the gradient descent optimization technique updates the                  determines the global minima, which identify the best-fitting
learnable parameters (kernels and weights) according to the                    set of parameters by minimizing the loss function. Currently,
                                 l ) at location (i, j) in the k th
loss value. The feature value (Zi,j,k                                          many CNN models exist, such as AlexNet [15], ZFNet [16],
                     th
feature map of the l layer in [13] is as follows:                              VGGNet [17], GoogLeNet/Inception [18] and ResNet [19].
                         l
                       Zi,j,k = (Wkl )T xi,j
                                         l
                                             + blk                    (1)      B. RNN BACKGROUND
where   Wkl and blk are the weight vector and bias term of the                 An RNN is a neural network in which the output from the
k th filterof the l th layer,               l is the input patch
                            respectively. xi,j                                 previous step is used as input in the next phase. All inputs
centered at location (i, j) of the l th layer. Then, a nonlinear               and outputs in typical neural networks are independent of one
activation function is applied to detect nonlinear features such               another; however, in some situations, such as when predicting
as sigmoid, tanh and ReLU. A nonlinear activation function                     the next word of a phrase, the prior words are necessary, and
A(·) can be expressed as:                                                      therefore, the previous words must be remembered. Conse-
                                                                               quently, RNNs were created, which use a hidden layer to
                           ali,j,k = A(Zi,j
                                         l
                                            ),                        (2)      overcome the problem. The hidden state, which remembers
                                                                               certain information about a sequence, is the most significant
where ali,j,k is the output value after applying the nonlinear                 aspect of RNNs. RNNs have a ‘‘memory’’ that stores all infor-
activation function.                                                           mation about the calculations. This memory utilizes the same
   A pooling layer provides a typical downsampling oper-                       settings for each input since it produces the same outcome
ation to reduce the dimensionality of the feature maps to                      by performing the same job on all inputs or hidden layers.
introduce translation invariance to small shifts and distortions               Unlike in other neural networks, this method minimizes the
and thereby decrease the number of subsequent learnable                        complexity of the parameters. When the gap between the
parameters. The pooling function is pool(·); for each feature                  relevant input data is large, Hochreiter and Schmidhuber
map al:,:,k , we have:                                                         [20] proposed long short-term memory (LSTM) in 1997,
              yli,j,k = pool(alm,n,k ),    ∀(m, n) ∈ Ri,j ,           (3)      which handles long-term dependencies. LSTM has been the
                                                                               focus of deep learning since it accomplishes nearly all the
where Ri,j is a local neighborhood around location (i, j). The                 exciting outcomes based on RNNs. The recurrent layers, also
fully connected layers are the final outputs of the CNN, such                  known as hidden layers in RNNs, are made up of recurrent
as the probabilities for each class in classification tasks. The               cells whose states are influenced by both previous states and
number of output nodes in the final fully connected layer is                   current input via feedback connections. The classic recurrent
B. IDENTITY SWAP
The identity swap technique, also called the face-swap
method, is very popular for replacing the face of one person
in an image or video with that of another person. An example
of an identity swap can be seen in Figure 6, where the                         face-swap datasets are UADFV (49-FakeApp), D-TIMIT
source image shows the identity, the target image provides                     (620-faceswap-GAN), FF++ (1k-FaceSwap,1k-DeepFake),
the attributes and a swapped face image is generated. Such                     DFD(3k-DeepFake), Celeb-DF (5k-DeepFake) and DFDC
swaps can be divided into two major types: i) graphics-                        Preview (4k-Unknown). This kind of manipulation might
based approaches such as FaceSwap and ii) deep learning                        be useful in a variety of industries, including the entertain-
technique-based approaches such as DeepFakes. The existing                     ment industry. However, it might also be used for malicious
objectives, such as the production of celebrity pornographic      FIGURE 8. Example of expression Swap in [34].
videos, fraud, and financial fraud.
                                                                  406k-Neural-Textures [36]). This form of fraud could have
C. ATTRIBUTE MANIPULATION
                                                                  significant consequences, such as a video of someone saying
Attribute manipulation, also known as face editing or face        something that he or she never said.
retouching, entails changing aspects of the face, such as hair
or skin color, gender, age, and the addition of spectacles
                                                                  E. MISCELLANEOUS
[31]. An example of attribute manipulation can be seen in
Figure 7, where Figure 7(a) shows the source image and the        Regarding miscellaneous manipulation, we identified three
corresponding generated images: blond hair, gender, aged,         types: face morphing, face deidentification, audio-to-video
and pale skin. Figure 7(b) shows the source image and the         and text-to-video facial expression swaps.
corresponding generated images: angry, happy, fearful. This          Face morphing is a technique used for creating artificial
manipulation process is usually carried out through a GAN,        biometric face samples that mimic the biometric data of
such as the StarGAN approach proposed in [31]. The popu-          multiple people. This type of manipulation leads to correctly
lar AI face editor FaceApp, which is a mobile application,        verifying the created morphed face images against a manip-
is an example of this type of manipulation. The existing          ulated reference in a facial recognition system database if a
attribute manipulation dataset is DFFD [28](80K-StarGAN,          morphed face image is stored as a reference. Hence, morphed
12K-FaceAPP). Consumers may utilize this technology to            face images constitute a significant threat to face recognition
test a wide range of items in a virtual environment, including    systems, as they contradict the core principle of biometrics,
cosmetics and makeup, spectacles, and hairstyles.                 which is the unique link between the sample and its matching
                                                                  person. [37] presented a comprehensive study of face morph-
                                                                  ing in 2019, covering both morphing strategies and morphing
                                                                  attack detectors.
                                                                     Face deidentification is a type of manipulation used to
                                                                  remove artificial biometric fingerprints from images and
                                                                  videos. This technique can save artificial biometric finger-
                                                                  print information for illegal verification. This action can be
                                                                  accomplished in a variety of ways. The most basic method is
                                                                  face blurring or pixelating. Other methods also exist, such as
                                                                  swapping an identity or synthesis identity swapping (apply-
                                                                  ing some operations, i.e., pose, expression). An adversarial
FIGURE 7. Example of attribute manipulation in [31].              autoencoder-based video face deidentification method was
                                                                  demonstrated in [38].
                                                                     Audio-to-video (A2V) and text-to-video (T2V) are also
D. EXPRESSION SWAP                                                called lip-sync deep fakes [39]. Basically, the expression
Expression swap, also known as face reenactment, mod-             of the face in a video is synthesized using audio or text.
ifies the facial expression of a person. An example of            An example of a fake video [40] describes a method
an expression swap can be seen in Figure 8, where the             used for synthesizing high-quality films of a person (in
input expression is transferred to the targeted image, which      this case, Barack Obama) speaking with an accurate lip-
then generates a reenactment result. The available tech-          sync track. Other important state-of-the-art methods are dis-
niques, such as image-level manipulation through popular          cussed in [41], [42]. In addition, [43] presents a procedure
GAN architectures [32], [33] and some popular video-based         for blending counterfeit recordings from a text that takes
manipulation techniques, such as Face2Face [34] and neural        information from a video of an individual talking and the
textures [35], replace one person’s facial expression in a        necessary content to be spoken and makes another video
video with another person’s facial expression. The existing       wherein the individual’s lips are synchronized with the new
reenactment-based datasets are FF++(509k-Face2Face [34],          words.
IV. DATASETS                                                                   160 tampered and 440 original images. The IEEE Informa-
Forensics datasets can be classified into two broad types: tra-                tion Forensics and Security Technical Committee (IFS-TC)
ditional and DeepFake datasets. Traditional forensics datasets                 conducted the First Image Forensics Challenge (2013), which
are created manually with extensive manual effort under care-                  is an international competition that collected thousands of
fully controlled conditions such as camera artifacts, splicing,                photographs of varied scenes, both indoors and outdoors,
inpainting, resampling and rotation detection. The Dresden                     using 25 digital cameras. The Wild Web Dataset (WWD)
Image Database (DID) [59] is based on camera fingerprint-                      [45] contains 82 cases of 92 forgery variants and 101 unique
ing and consists of 14,000 images from 73 cameras. The                         mask splice detections. The WWD aims to address that
73 different cameras were of 25 different models and camera                    gap in the evaluation of image tampering localization algo-
fingerprinting types (indoor and outdoor scenes). While most                   rithms. The performance of [45] is evaluated in [60]. The
traditional datasets incorporate image alteration forensics,                   CelebFaces Attributes Dataset (CelebA) is a large-scale face
only some of them cover video-based manipulation forensics.                    attribute dataset with more than 200K celebrity images, each
For example, MICC-F220, MICC F2000, and MICC-F600                              with 40 attribute annotations. The images in this dataset
are image datasets used to detect copy-move modifications.                     cover large pose variations and background clutter. CelebA
MICC-F220 is composed of 110 tampered and 110 orig-                            has large diversities, large quantities, and rich annotations,
inal images, MICC-F2000 is composed of 700 tampered                            including 10,177 identities, 202,599 face images, 5 landmark
and 1300 original images, and MICC-F600 is composed of                         locations, and 40 binary attribute annotations per image.
In 2017, a VISION dataset was created that contained 11,732      WildDeepfake dataset (WDF) [57] was found to consist of
original images and 648 original videos. The images were         7,314 face sequences extracted from 707 DeepFake videos
uploaded to social platforms such as Facebook and What-          collected completely from the internet. WildDeepfake is a
sApp, and the videos were uploaded to YouTube and What-          small dataset that can be used in addition to extending the
sApp, resulting in a total of 34,427 images and 1,914 videos.    existing datasets. Moreover, WDF is used to develop and test
   The second main type of forensics datasets are DeepFake       the effectiveness of DeepFake detectors against real-world
datasets. These datasets are generally created by GAN-based      DeepFakes. On the other hand, research on DeepFakes is also
models, which are very popular due to their realistic per-       expanding to examine more than one face in a single image to
formance. The UADFV [48] consists of 49 real YouTube             detect DeepFake forgery, such as the OpenForensics dataset
and 49 DeepFake videos. The DeepFake videos are gen-             (OF) [58]. The OF dataset consists of 115K unrestricted
erated using the DNN model with FakeAPP. The average             images with 334K human faces. Table 3 summarizes these
length of these videos is approximately 11:14 seconds, with      existing datasets.
a typical resolution of 294 × 500. The DeepFake-TIMIT
(DF-TIMIT) dataset [49] was created by using the VidTIMIT        V. DEEPFAKE DETECTION
dataset [61] and FaceSwap-GAN; 16 similar-looking pairs of       DeepFake face images and video detection dominate research
people from VidTIMIT [61] were selected, and for each of the     on monitoring multimedia information and have the pos-
32 people, the database generated approximately 10 videos        itive intention to improve the confidentiality and integrity
using low-quality of size 64 × 64, i.e., DF-TIMIT-(LQ),          of multimedia content. In addition, it is not an easy task
and high-quality of size 128 × 128, i.e., DF-TIMIT-(HQ)          to detect such altered multimedia content. This task has
by using a face-swap GAN model. FaceFornesics (FF)               become more challenging after the emergence of genera-
[50] is a DeepFake dataset that aims to perform forensic         tive models. Basically, forgery detection in multimedia con-
tasks for facial identification and segmentation to forged       tent entails analyzing the multimedia content to determine
images. It is composed of 1004 videos (face videos down-         whether the generated multimedia has been tampered with
loaded from YouTube) over 500,000 frames. The two types          or is original. In the past, forgery detection techniques were
of manipulation are source-to-target, where facial expres-       considered traditional research; however, in recent years,
sions from a source video to a target video use Face2Face        DNN (AI-based)-based generated multimedia detection has
[34], and self-reenactment, where Face2Face reenacts the         become more popular. In this section, we will discuss both
facial expressions of a source video. The FaceFornesics++        traditional and DeepFakes forensics-based techniques.
(FF++) [51] dataset has 1,000 real videos collected from
YouTube, and 1,000 DeepFake videos were generated by             A. TRADITIONAL FORENSIC-BASED TECHNIQUES
applying each of the 4 face modification techniques: Deep-       To modify image content, various traditional image process-
Fake, Face2Face [34], FaceSwap and Neural Texture [36]           ing technologies are employed, such as copy-move (splicing),
(4,000 face modification videos were created overall). These     resampling (resize, rotate, stretch), and the addition and/or
fake videos have produced 1.8 million manipulated face           removal of any part of the image. Traditional forensics-based
images. The Diverse Fake Face Dataset (DFFD) dataset com-        techniques are commonly divided into two types: active and
bines multiple forgery types (FaceSwap, Deepfake, Deep-          passive.
FaceLab, FaceAPP, StarGAN and StyleGAN) in a single                 Active techniques require prior knowledge of multime-
dataset. DeepFake Detection (DFD) [55] was developed by          dia for the authentication process. Basically, at the time of
Google and JigSaw; 363 original videos were filmed with          multimedia generation, some information is encoded, such
the assistance of 28 invited actors based on over 3,600 Deep-    as watermarks and digital signatures. For instance, a water-
Fake videos using DeepFake techniques. In September 2019,        mark is information that is added to a source image without
Amazon Web Services, Facebook, Microsoft, and a number           degrading the visible artifact. Watermark extraction proce-
of academics collected a large-scale DeepFake dataset for        dure is used to recover the watermark on the target image
the DeepFake Detection Challenge-Preview (DFDC-P) [53].          to discern whether the image has been manipulated. The
A full version of the DFDC-P was developed with eight            manipulated portions in the target image can be detected
manipulation methods and is known as the DeepFake Detec-         using the extracted watermark. Over the past few years, mim-
tion Challenge (DFDC). The Celeb-DF dataset [55] contains        icking aspects of genuine users or generating hyperrealistic
590 actual videos and 5,639 DeepFake videos. Recently,           masks at the presentation side for face images and videos
the DeeperForensics-1.0 dataset (DF-1.0) [56] was found          have highlighted one kind of biometric vulnerability (bio-
to consist of 60,000 videos with a total of 17.6 million         metric attack). To monitor or identify such biometric attacks,
frames for real-world face forgery detection. In addition,       a variety of anti-spoofing techniques are used to counter these
100 paid actors were invited from 26 countries to collect        attacks, including eye blink detection in live stream scenarios,
high-resolution images of size 1920 × 1080. The new end-         challenge-response techniques, 3D cameras, Active Flash and
to-end face-swapping method (i.e., DF-VAE) was introduced        deep learning.
and systematically applied to seven types of perturbations of       Facial recognition [110] is essential for face image and
fake videos at five intensity levels. More recently, a small     video detection before applying a traditional or a deep fake
method. In this context, many researchers are interested in           to be misused for unwanted activities. However, it is also a
recognizing face images to identify authentic expressions,            technique that cyber attackers employ to penetrate identifi-
such as gestures made by the human face, which commu-                 cation or authentication systems to gain illegitimate access,
nicate information such as fear, disgust, happiness, sadness,         thus violating privacy and compromising social security and
surprise, anger, and neutrality. Umer et al. [111], [112] pro-        democracy.
posed a method to identify human facial expressions using                To combat the destructive impacts of DeepFakes,
data augmentation and fine-tuning the CNN model. A brief              researchers have also turned dedicated attention to multi-
survey of biometric anti-spoofing methods for face recog-             media forensic techniques to identify DeepFakes. Existing
nition is available in [113]. To check the validity of the            methods have focused on either spatial and temporal artifacts
face images, Umer et al. [114] proposed a method that                 left from the generation process or data-driven classification.
combines preprocessing, feature extraction and classification         Recently, researchers have used features such as those in
techniques. Initially, the landmark is extracted from the face        Figure 9 to generate DeepFake detection models. This section
images to identify the face region of the person; next, the           reviews these features to create detection methods, and a
detected face region is used to extract features. Finally, fea-       summary of typical approaches is provided in Table 4. Incon-
tures are extracted from the detected facial region, and the          sistencies, irregularities in the background, and GAN finger-
scores are fused to calculate the final result based on the           prints are examples of spatial artifacts. Detecting fluctuations
performance of the classifier according to these features.            in a person’s behavior, physiological signals, coherence, and
   In contrast to active techniques, passive techniques do not        video frame synchronization are all examples of temporal
require prior knowledge of multimedia for the authentica-             artifacts.
tion process. In fact, statistical information about the source          In this part, we will review recent DeepFake detection-
image (multimedia) that is highly consistent between distinct         based techniques grouped into three types: (1) traditional-
images is used. Consequently, the inherent statistical informa-       based techniques for DeepFakes, (2) DNN-based techniques
tion of images is utilized to detect any fake areas of the image.     for DeepFakes, and (3) artifact analysis for DeepFakes.
Moreover, in the absence of digital watermarks, signatures,
or specialized hardware, passive forensic techniques are used         1) TRADITIONAL-BASED TECHNIQUES FOR DEEPFAKE
[115]. In Table 5, passive forensic techniques used in specific       In this method, pixel-level differences in the image and videos
types of applications are summarized.                                 are examined to identify DeepFakes. Focusing on pixels and
                                                                      exploiting the correlations are easy to understand and pro-
TABLE 5. Traditional forensics methods.
                                                                      vides hints in the detection process to clarify the variations
                                                                      between real and counterfeit (fake). When images or videos
                                                                      are modified by basic transformations, however, these efforts
                                                                      suffer from robustness concerns.
                                                                         A novel photoresponse nonuniformity (PRNU) analysis
                                                                      method has been tested for its effectiveness at detecting
                                                                      DeepFake video manipulation [62]. This PRNU analysis
                                                                      reveals a statistically significant difference in mean normal-
                                                                      ized cross-correlation scores between real and DeepFake
                                                                      videos. However, the model has been tested on a very small
                                                                      dataset. The DeepFake GUI OpenFaceSwap application was
                                                                      used to create 10 authentic and 16 DeepFake images. The
                                                                      results shows that the cut-off value of 0.05 has a 3.8% false
                                                                      positive rate and a 0% false negative rate. In [64], a ste-
                                                                      ganalysis method was adopted to identify DeepFake images.
                                                                      In fact, the co-occurrence matrices were constructed from
B. DEEPFAKES FORENSICS-BASED TECHNIQUES                               RGB images, and the resulting values were trained with a
Currently, DeepFake forensics-based techniques are a very             deep convolutional neural network to identify the fakes. The
active research area. Due to the popularity of DeepFake tools         experimental result shows 99% classification accuracy for
on the internet, it is very easy to create fake content that looks    cycleGAN- and StarGAN-based fake images. Li et al. [65]
highly realistic and is difficult to distinguish with traditional     evaluated the statistical properties of deep network-generated
techniques. To mitigate this challenging task or classify the         images, such as the correlation between adjacent pixels in
content as either fake or pristine, researchers are developing        HSV and YCbCr color spaces, to distinguish DeepFake
DeepFake detection models. In contrast, many researchers are          images. In Lips Don’t Lie, Haliassos et al. [66] suggested
focusing on generating generalized realistic models to create         a generalizable and robust approach to detect face forgery
DeepFakes. Creating DeepFakes is fun for users because                in videos also known as LipForensics. The fundamental
many web-based tools are available online to perform such             theme is monitoring lip movements with high-level semantic
manipulations, which can still identify people and cause them         inconsistencies that are present in many synthesized videos.
Lugstein et al. [67] designed a novel pipeline to detect                       computer-generated image/video detection in [69], where a
DeepFakes using photoresponse nonuniformity (PRNU).                            capsule network was developed to resolve computer vision
Basically, the PRNU technique is famous for detecting facial                   challenges and digital forensics issues. The ability of a
retouching and face morphing attacks. In Lugstein et al. [67],                 capsule network based on a dynamic routing algorithm
the PRNU feature detection is similar to that in [116], [117]                  [118] to represent hierarchical pose relationships between
and adds a face image extraction stage, as well as an SVM                      object pieces has recently been demonstrated. To distinguish
classification stage. Two types of mesoscopic (a compact                       between fake and real images, a dynamic routing algorithm
facial video forgery detection network) models (Meso-4 and                     is used to route the outputs of the three capsules to the output
MesoInception-4) have been proposed by Afchar et al. [63]                      capsules over a series of iterations. Four datasets are used to
to classify hyperrealistic forged videos based on DeepFake                     test the approach, which cover a wide spectrum of fabricated
and Face2Face. It is obvious that uncompressed videos are                      image and video attacks. In these four datasets, the sug-
severely degraded by image noise, wherein microscopic                          gested strategy outperforms existing methods. This outcome
investigation-based image noise is not applicable. Moreover,                   demonstrates the capsule network’s utility in developing a
the models are efficient in detecting hyperrealistic forged                    generic detection system that can effectively detect a variety
videos at a low computational cost. The average detection                      of counterfeit image and video attacks.
efficiency rate was found to be 98% for DeepFake videos and                       A generalized fake face image detection method was pro-
95% for Face2Face videos under real conditions of diffusion                    posed by Xuan et. al. [71] in 2019. The key aim is to
on the internet.                                                               explicitly add a preprocessing step in the training stage to
                                                                               remove low-level unstable artifacts of GAN images and force
2) DNN-BASED TECHNIQUES FOR DEEPFAKES                                          the forensics classifier to focus on higher intrinsic forensic
In this method, existing DNN models are used to analyze spa-                   indications to detect such GAN-based images. In the prepro-
tial characteristics, boost detection efficacy and improve the                 cessing step, Xuan et al. used Gaussian blur and Gaussian
generalization capacity to detect DeepFakes. These methods                     noise methods. Adding Gaussian blur and Gaussian noise
are entirely data-driven. However, all of these DNN-based                      to low-level pixel data can depress low-level unstable arti-
detection approaches are vulnerable to adversarial attacks,                    facts. DCGAN [21], WGAN-GP [22] and PGGAN [23] are
and very few studies have been able to assess their perfor-                    used to generate the GAN images, where pristine images
mance in combating adversarial attacks. Existing studies that                  are taken from CelebA-HQ. The generated image is used for
use DNN to detect DeepFakes can be divided into three types.                   PGGAN [23] to train the CNN and other DCGANs [21], and
A fine-tuning approach is employed to improve the detection                    WGAN-GP [22] is used for testing purposes. However, the
capacity of existing DNN models, explore artifact clues and                    model shows little improvement in generalization ability on
train DNN models on different types of datasets to improve                     unseen types of fake image datasets.
the generalization capacity. Güera and Delp [68] proposed a                       Investigating the artifact clues in the image and videos
face-swapping-based detection method combining CNN and                         is also a prominent scheme to detect DeepFakes. In [72],
LSTM. InceptionV3 (CNN) is used to extract frame-level                         a combination of a recurrent convolutional model and face
features, and the output of CNN is fed to LSTM to construct a                  alignment approach was introduced to detect the three types
sequence descriptor that is used for classification. The highest               of manipulations: DeepFake, Face2Face and FaceSwap. Ini-
accuracy of the model is greater than 97% when classifying                     tially, preprocessing operations are applied on video to
a video as pristine or DeepFake.                                               detect, crop and align faces in a sequence of frames. Next,
   A capsule network is used to detect forged images                           a combination of appropriate CNN models ResNet [19] or
and videos in a variety of forging scenarios, includ-                          DenseNet [119] with alignment and a bidirectional recurrent
ing replay attack detection and (both full and partial)                        network is used to test the accuracy. The model [72] is able to
utilize micro-, meso- and macroscopic features for manipula-       significant resistance to perturbation attacks such as down-
tion detection. Finally, according to the experimental results,    sampling, JPEG compression, blur, and noise, according
landmark-based face alignment with bidirectional recurrent         to experimental data. Gram-Net, which has demonstrated
DenseNet performs the best for detecting face manipulation         encouraging results in the wild, also has a proven general-
in videos.                                                         ization capacity in working with various GANs.
   Jeon et al. [73] introduced an FDFtNet method to improve           The current DeepFake detection methods use small
the capability of existing CNN models, such as SqueezeNet,         datasets for specific types of manipulation. These types
ShallowNetV3, ResNetV2, and Xception. In this method,              of generated deep fakes are highly realistic. The detection
the fine-tuning method is used to extract the features using       techniques for such DeepFakes suffer from performance.
MBblockV3, and the method can be called fine-tuning trans-         To solve this issue, Khalid and Woo [79] proposed the
formation. This method shows a higher performance than that        OC-FakeDect method, which uses a one-class variational
of the existing classical models. Moreover, the preference for     autoencoder (VAE) to train only on real face images and
unseen types of GAN-based image permutation attacks has            detects nonreal images such as DeepFakes by treating them
not been calculated. Jeon et al. [74] proposed a transferable      as anomalies.
GAN-image detection framework (T-GD) technique, which                 Fung et. al. [80] introduced a unique unsupervised learn-
efficiently detects DeepFake images. The model works on            ing method for detecting facial modification. Two modified
teacher and student relations, which mutually improve the          copies of a face image are generated using two distinct trans-
detection performance.                                             formations and fed into two sequential subnetworks (Xcep-
   Hsu et. al. [75] proposed a pairwise learning model to          tion and projection head network). Furthermore, the outputs
detect GAN-based generated fake images. The model was              of the projection head networks maximize the agreement.
designed by combining the architecture of the improved             The model architecture was inspired by the method proposed
version of the DenseNet backbone network and the Siamese           by Chen et al. [120], which shows high accuracy of visual
network and is also called a common fake feature network           representations over previous state-of-the-art methods.
(CFFN). To learn the discriminative common fake feature,              By improving the generalization ability, conventional
pairwise information (labeled training dataset) is provided        DNNs have been frequently used to detect fake faces; how-
to the CFFN. The trained CFFN is capable of performing             ever, they can overfit specific manipulation types and suffer
the classification task indicating whether the image is real or    from transferability concerns when unknown manipulation
fake.                                                              methods are not available. Tariq et. al. [81] proposed a gener-
   Gandhi and Jain [76] proposed a method to enhance the           alized method to detect multiple types of DeepFakes. Addi-
performance of existing DeepFake models by adding adver-           tionally, the model was tested on unseen types of DeepFakes,
sarial perturbations in DeepFake images. The fast gradient         such as the DeepFake-in-the-Wild video dataset (Shahroz-
sign method and the Carlini and Wagner L2 norms are used           tariq/CLRNet/blob/main/dataset_samples). The main idea is
to create adversarial perturbations in both black box and white    to trace the spatial and temporal information in DeepFakes
box settings, and Lipschitz regularization and deep image          by a convolutional LSTM-based residual network (CLRNet),
prior (DIP) are introduced to increase the robustness of CNN       which has a unique type of training strategy. The best perfor-
(ResNet and VGG)-based deep-fake detectors. Lipschitz reg-         mance of the CLRNet model on the DeepFake-in-the-Wild
ularization increases the detection of perturbed DeepFakes,        video dataset is 93.86%.
with a 10 percent improvement in the black box scenario, and
                                                                   3) ARTIFACT ANALYSIS FOR DEEPFAKES
DIP defense obtains a 95 percent accuracy with an original
                                                                   DeepFakes frequently produce artifacts that are difficult to
98 percent accuracy. Moreover, there are two models with
                                                                   identify by humans but are quickly recognized by machine
some limitations. The performance of Lipschitz regulariza-
                                                                   and forensic analysis. Inconsistencies, irregularities in the
tion in the white box scenario only improves by 2.2 percent,
                                                                   background, and GAN fingerprints are examples of spa-
and the DIP method shows higher performance than that of
                                                                   tial artifacts. Detecting fluctuation in a person’s behavior,
Lipschitz regularization; however, the detection process is
                                                                   physiological signals, coherence, and video frame synchro-
highly time-consuming even after a high-performance con-
                                                                   nization are all examples of temporal artifacts. Agarwal
figuration. Wu et al. [77] introduced an SSTNet method that
                                                                   et al. [88], [97] proposed a combination of static biomet-
combines spatial, steganalysis and feature extracted proce-
                                                                   rics on facial identity with temporal behavioral biometrics
dures to detect DeepFakes. Basically, XceptionNet is used          on facial expressions and head movements for DeepFake
to monitor the spatial features and statistical information of     detection. According to Chai et al. [98], redundant arti-
the image. Moreover, steganalysis operations are applied, and      facts can be evaluated from local patches to identify the
RNN is also used to mine the temporal features. Finally, all       fake face. This idea has been tested using different existing
the extracted information is combined for binary classifica-       models, such as Resnet-18 [19], Xception [121], MesoIn-
tion to detect DeepFakes.                                          ception4 [63], and CNN [122], with p values of 0.1 and
   Liu et al. [78], using global texture data, increased           0.5 on the CelebA-HQ and FFHQ datasets, respectively.5
the robustness and generalization capabilities of existing
CNNs in identifying synthetic fake faces. Gram-Net shows              5 https://github.com/NVlabs/ffhq-dataset
This idea shows generalized characteristics with different net-                videos is a very important clue to detect the synthesized
work architectures and different datasets. Zhang et. al. [82]                  video. The techniques [39], [99], [102] can clearly demon-
raised the concern about the applications used for face swap-                  strate why the video is a fake. Mittal et al. [99] distinguish
ping in less than a minute. This issue can be a serious                        ‘‘real’’ and ‘‘fake’’ videos using a correlation between modal-
problem for face authentication on the internet. To solve                      ities and affective signals. For modelling the visual and audio
this issue, automated face swapping and its detection method                   in videos, a Siamese network is used, along with a mixture of
were proposed with a combination of basic machine learn-                       the two triplet loss functions to determine similarity. One loss
ing techniques. Initially, the key points from the face image                  function aims to calculate the similarity between visual and
are detected and presented as descriptors (capturing local                     auditory stimuli, while the other is designed to calculate effect
information about the key point). Because each key point                       cues such as perceived emotion. The experimental results
is independent, a further clustering operation is applied to                   show that the idea of estimating the audio-visual correlation
generate the codebook for each image. This codebook is taken                   is efficient in estimating DeepFake videos. Agarwal et al.
as input for linear or nonlinear-based machine learning to                     [39] introduced a fake video detection method that takes
estimate its legitimacy. However, the features are extracted                   advantage of abnormalities in the dynamics of the mouth
using speeded-up robust features (SURF) [123], and bag of                      shape (visemes) and the pronounced phoneme. Mama, baba,
words (bow) [124] methods are used to generate the code-                       and papa are examples of phonemes that require the lips to
book. The codebook information is then fed into support vec-                   be totally closed to be properly spoken. The authors’ recom-
tor machines (SVMs), random forests (RFs) and multilayer                       mended strategy worked well, especially as the video became
perceptrons (MLPs) for binary classification. In the experi-                   longer. The Modality Dissonance Score (MDS) was proposed
ments, the best solution for detection accuracy is greater than                by Chugh et al. [102] to detect DeepFake videos. Basi-
92%. Nirkin et al. [109] used the discrepancy between faces                    cally, dissimilarity scores are calculated between audio-visual
and their context to identify fake faces. In other words, two                  segments over 1-second video segments, and the MDS is
networks are trained; the first network is trained to identify                 estimated after applying aggregation to all the segments.
the person’s face, and the second context recognition network                  The resultant value can efficiently estimate the DeepFake
takes the face’s context into account, such as the person’s                    video. This method can also be utilized for temporal forgery
hair, ears, and neck. To identify fake faces, discrepancies are                localization, which identifies the video segment that has been
calculated by comparing these two networks. This method                        tampered with.
exhibits a high generalization ability.                                           The idea of monitoring the lack of visual consistency in
   Rather than looking at the visual artifacts in fake faces,                  [48], [84], [87], [94], which is used to estimate DeepFake
other researchers are looking at the imperfect designs of the                  videos, particularly the shape, facial features, and landmarks
current GANs, which offer signals for distinguishing between                   of faces, is not based in nature. Li et al. [84] proposed an
genuine and DeepFake faces. McCloskey and Albright [89]                        eye blinking-based fake face video detection method using a
explored the architecture of a GAN generator, which intended                   CNN and an RNN, which is an LRCN model. Basically, the
to enhance methods for detecting visual artifacts in DeepFake                  LRCN model consists of three steps: feature extraction from
images. In fact, the generator’s normalization processes are                   the eye sequence by using VGG16, sequence learning by
taken into account, which will reduce the frequency of sat-                    using LSTM, a special kind of RNN, and finally, state predic-
urated and underexposed pixels. Finally, the generated fea-                    tion, which generates the likelihood of eye open and closure
tures are classified by SVM. Marra et al. [90] proposed GAN                    states based on the output of LSTM. The best performance
fingerprints (unique artifacts of Pro-GAN and Cycle-GAN                        of the model under the ROC curve was 0.99. Li and Lyu [87]
fingerprints), which aim to detect DeepFake images.                            described a new deep learning-based model that can distin-
   Yu et al. [92] studied GAN fingerprints for image                           guish DeepFake videos from real videos. The model takes
attribution and used them to classify images as real or                        leverage of the warping step during DeepFake creation. This
produced GANs. This study also identified the source of                        step leaves a resolution discrepancy between the warped face
GAN-generated images. If the model is trained by very little                   area and the surrounding context, and noticeable artifacts
change in the dataset, then the model fingerprint will be                      appear. Then, CNN models are used to detect such artifacts.
distinct, which lends greater granularity to model authentica-                 CNN is specifically trained to recognize faces first and then
tion. Additionally, finetuning is an effective technique used to               extract landmarks to compute transform matrices to align the
immunize the DNN model against adversarial perturbations                       faces to a standard configuration. Gaussian blurring is applied
in fingerprint images.                                                         to the aligned face, and then the inverse of the predicted
   Analyzing artifacts in biological signals is also gaining                   transformation matrix is used to affine and warp it back to
prominent attention from researchers who aim to identify                       the original image. Faces are aligned into several scales to
DeepFakes. In the synthesized fake faces, biological signal                    boost data diversity and to simulate more varied resolution
artifacts provide evident signals for fake detection. These bio-               scenarios of affine warped faces. The performance was cal-
logical signals are divided into the following groups: visual-                 culated on four CNN models, namely, VGG16, ResNet50,
audio inconsistency, visual inconsistency and biological                       ResNet101 and ResNet152, and on DeepFake datasets
signal-in-video. The visual-audio irregularity in DeepFake                     (UADFV and DF-TIMIT with two qualities, LQ and HQ).
The ResNet50-based DeepFake detection model outperforms                 Demir and Ciftci [108] proposed a model to detect DeepFakes
the DeepFake datasets.                                                  by analyzing the gaze in videos.
   Yang et al. [48] suggested a method for detecting changes               The biological signs in such videos are difficult to dupli-
between 3D head pose movement, which includes head ori-                 cate. Heart rate has been demonstrated in studies to be
entation and position. To detect such orientation and position-         useful in detecting DeepFake videos. Extracting the heart
ing, 68 facial landmarks of the central face region are used.           rate from videos is another challenging task. Taking advan-
The 3D head postures are investigated since the DeepFake                tage of the neural ordinary differential equation (Neural-
face generator pipeline has a flaw. After obtaining the detec-          ODE [127]) to identify DeepFake videos was presented
tion results, the retrieved features are passed into an SVM             by Fernandes et al. [96]. Qi et al. [106] proposed a Deep-
classifier. Experiments on two datasets (UADFV, DARPA                   Rhythm model that also exposes DeepFake videos using
MediFor) reveal that the detection method outperforms the               heartbeat rhythms. The authors created motion-magnified
other methods. Guarnera et. al. [103] proposed a model for              spatial-temporal representation (MMSTR) for the video to
DeepFake detection by monitoring the hidden forensics traces            highlight heart rhythm signals. Finally, based on the output
in images. Basically, the expectation maximization (EM)                 of MMSTR, a dual-spatial-temporal attentional network was
algorithm [125] is used to extract a set of local features to           built to identify fraudulent videos.
model the underlying convolutional generative process. The
model was evaluated with five different types of DeepFake               VI. CHALLENGES FOR DEEPFAKE CREATION AND
creation techniques, namely, GDWCT, StarGAN, ATTGAN,                    DETECTION
StyleGAN and StyleGAN2, and on the CELEBA dataset                       In recent years, many DeepFake tools have become avail-
using naïve classifiers to discriminate between originals and           able that have highly realistic performance levels, and many
fakes.                                                                  more are in development. In contrast, the development of the
   Matern et al. [94] investigated a way to exploit DeepFake            DeepFake generation model is creating large challenges for
and face manipulation artifacts based on visual attributes              forensics experts in terms of combatting them. DeepFakes
such as eyes, teeth, and facial features. The visual artifacts          are AI-generated hyperrealistic images or videos that have
are caused by a lack of global consistency, an incorrect or             been digitally edited using techniques such as face swapping,
inadequate estimate of incident illumination, or an inaccurate          changing the attributes and representing individuals speaking
estimate of the actual geometry. To detect DeepFakes, geo-              and doing things that never happened.
metrical inconsistencies in reflections, eye and tooth areas               GANs, which are popular artificial intelligence (AI) tech-
are monitored, and textural characteristics collected from the          niques, consist of two discriminative and generative models
face region based on facial landmarks and other factors are             that compete against each other to improve their performance
taken into account. Consequently, eye, teeth, and full-face             to generate believable fakes. These impersonations of real
crop features are employed. Following feature extraction, two           persons are frequently highly viral and spread swiftly across
classifiers, namely, logistic regression and a shallow neural           social media platforms, thereby making them an effective
network, are used to distinguish DeepFakes from original                tool for propaganda. In digital forensics, as in other security-
videos. The model works well on YouTube videos, with a                  related disciplines, it is necessary to account for the presence
best result of 0.851 in terms of the area under the receiver            of an adversary who is actively attempting to fool inves-
operating characteristics curve. The drawback of this method            tigators. In reality, a knowledgeable attacker who under-
is that it requires pictures that satisfy specific criteria, such as    stands the concepts on which the forensic tools are based
open eyes or visible teeth. Fernandes et. al. [104] proposed an         may take a variety of counterforensic steps to avoid detec-
attribution-based confidence (ABC) metric [126] for detect-             tion [128]. Forensics tools should be able to detect such situ-
ing DeepFake videos. Initially, DeepFake videos were created            ational threats, as well as any real-world situations that tend
using a commercial website (https://deepfakesweb.com/).                 to degrade test accuracy. Therefore, the numerous counter
Then, the generated DeepFake was tested on a pretrained                 forensics approaches intended to confuse current detectors
ResNet50 model, where the model was trained with the                    are a valuable aid in the development of multimedia forensics,
VGGFace2 dataset [105]. According to the obtained attribu-              as they expose the flaws in current solutions and encourage
tion score, a threshold value of 0.94 was considered for the            research to find a more robust resolution.
ABC metric that can differentiate a pristine from a DeepFake               To date, many models are available to create or detect
video. Hu et al. [107] analyzed the inconsistency between               fakes, but they still have weaknesses. In the following sub-
two eyes for detecting DeepFake face images. The detection              section, we will discuss the main challenges, point by point,
model takes advantage of physical/physiological restrictions            in creating or detecting DeepFakes.
in GAN-based images and then sufficiently estimates the
discrepancy between two eyes to identify fakes. These restric-          A. CHALLENGES FOR DEEPFAKE CREATION
tions provide solid assurances for explaining the choice to             Despite the fact that significant efforts have been made
differentiate a real from a fake; however, when improved                to increase the visual quality of created DeepFakes, there
GANs are suggested, they will be invalid. In addition, the              are still a number of hurdles to overcome. Some chal-
model’s resistance against perturbation attacks is unknown.             lenges related to creating DeepFakes include generalization,
                                                                         VII. CONCLUSION
                                                                         This article offers a comprehensive survey of a new and
                                                                         prominent technology, namely, DeepFake. It communicates
                                                                         the basics, benefits and threats associated with DeepFake,
                                                                         GAN-based DeepFake applications. In addition, DeepFake
                                                                         detection models are also discussed. The inability to trans-
FIGURE 13. An example of an adversarial attack on a DeepFake detector    fer and generalize is common in most existing deep
in [76].                                                                 learning-based detection methods, which implies that multi-
                                                                         media forensics has not yet reached its zenith. Much interest
  •   Lack of DeepFake datasets: The performance of a                    has been shown by different important organizations and
      DeepFake detection model depends on the variety of                 experts that are contributing to the improvement of applied
      large datasets used during training. If the model is tested        techniques. However, much effort is still needed to ensure
      on downloaded media, which have an unknown type                    data integrity, hence the need for other protection meth-
      of manipulation, then designing the model to identify              ods. Furthermore, experts are anticipating a new wave of
      the unknown type of manipulation is challenging. Due               DeepFake propaganda in AI against AI encounters where
      to the popularity of web-based applications, postpro-              none of the sides has an edge over the other.
      cessing operations are applied to DeepFake multimedia
      with the intention of fooling the DeepFake detector;               REFERENCES
      such manipulation could consist of removing temporal                 [1] H. Farid, ‘‘Image forgery detection,’’ IEEE Signal Process. Mag., vol. 26,
      artifices, blurring, smoothing, cropping, etc.                           no. 2, pp. 16–25, Mar. 2009.
  •   Unknown type of attack: Another challenging task is                  [2] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,
                                                                               S. Ozair, A. Courville, and Y. Bengio, ‘‘Generative adversarial nets,’’ in
      to design a robust DeepFake detection model against                      Proc. Adv. Neural Inf. Process. Syst., vol. 27, 2014, pp. 1–9.
      unknown types of attacks such as the fast gradient                   [3] P. Baldi, ‘‘Autoencoders, unsupervised learning, and deep architectures,’’
      sign method (FGSM) [129] and the Carlini and Wag-                        in Proc. ICML Workshop Unsupervised Transf. Learn., 2012, pp. 37–49.
      ner L2 norm attack (CW-L2) [130]. These attacks are                  [4] T. Karras, S. Laine, and T. Aila, ‘‘A style-based generator architecture for
                                                                               generative adversarial networks,’’ in Proc. IEEE/CVF Conf. Comput. Vis.
      used to fool classifiers in their actual output. An exam-                Pattern Recognit. (CVPR), Jun. 2019, pp. 4401–4410.
      ple of a DeepFake creation using source and target                   [5] Y. Mirsky and W. Lee, ‘‘The creation and detection of deep-
      faces, with adversarial perturbations, can be seen in Fig-               fakes: A survey,’’ ACM Comput. Surv., vol. 54, no. 1, pp. 1–41,
                                                                               Jan. 2022.
      ure 13. DeepFakes are accurately classified as fake by               [6] M. Masood, M. Nawaz, K. M. Malik, A. Javed, and A. Irtaza, ‘‘Deepfakes
      a DeepFake detector, but adversarially perturbed Deep-                   generation and detection: State-of-the-art, open challenges, countermea-
      Fakes are classified as real.                                            sures, and way forward,’’ 2021, arXiv:2103.00484.
                                                                           [7] R. Tolosana, R. Vera-Rodriguez, J. Fierrez, A. Morales, and
  •   Temporal Aggregation: Existing DeepFake detection                        J. Ortega-Garcia, ‘‘Deepfakes and beyond: A survey of face manipulation
      algorithms use binary frame-level classification, which                  and fake detection,’’ Inf. Fusion, vol. 64, pp. 131–148, Dec. 2020.
      involves determining whether each video frame is real                [8] T. T. Nguyen, Q. V. H. Nguyen, D. T. Nguyen, D. T. Nguyen,
                                                                               T. Huynh-The, S. Nahavandi, T. T. Nguyen, Q.-V. Pham, and
      or fake. However, as these methods do not take inter-
                                                                               C. M. Nguyen, ‘‘Deep learning for deepfakes creation and detection:
      frame temporal consistency into consideration, they may                  A survey,’’ 2019, arXiv:1909.11573.
      encounter issues, such as exhibiting temporal abnor-                 [9] L. Verdoliva, ‘‘Media forensics and DeepFakes: An overview,’’ IEEE J.
      malities and real/artificial frames occurring in consec-                 Sel. Topics Signal Process., vol. 14, no. 5, pp. 910–932, Aug. 2020.
                                                                          [10] K. Fukushima, ‘‘Neocognitron: A self-organizing neural network model
      utive intervals. Furthermore, these methods necessitate                  for a mechanism of pattern recognition unaffected by shift in position,’’
      an extra step to compute the video integrity score, which                Biol. Cybern., vol. 36, no. 4, pp. 193–202, Apr. 1980.
      must be integrated for each frame to obtain the final               [11] Y. LeCun, B. Boser, J. Denker, D. Henderson, R. Howard, W. Hubbard,
                                                                               and L. Jackel, ‘‘Handwritten digit recognition with a back-propagation
      result.                                                                  network,’’ in Proc. Adv. Neural Inf. Process. Syst., vol. 2, 1989,
  •   Unlabeled data: Usually, DeepFake detection models                       pp. 396–404.
      are trained with large datasets. However, in some cases,            [12] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, ‘‘Gradient-based learn-
                                                                               ing applied to document recognition,’’ Proc. IEEE, vol. 86, no. 11,
      such as journalism or law enforcement-based DeepFake                     pp. 2278–2324, Nov. 1998.
      detection, only a small dataset may be available. More-             [13] J. Gu, Z. Wang, J. Kuen, L. Ma, A. Shahroudy, B. Shuai, T. Liu, X. Wang,
      over, this kind of dataset needs an additional effort to                 G. Wang, J. Cai, and T. Chen, ‘‘Recent advances in convolutional neural
      label the score corresponding to the type of forgery used.               networks,’’ Pattern Recognit., vol. 77, pp. 354–377, May 2018.
                                                                          [14] S. Dong, P. Wang, and K. Abbas, ‘‘A survey on deep learn-
      Consequently, further study is required to understand                    ing and its applications,’’ Comput. Sci. Rev., vol. 40, May 2021,
      journalism or law enforcement-based forgery cases.                       Art. no. 100379.
 [15] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang,       [39] S. Agarwal, H. Farid, O. Fried, and M. Agrawala, ‘‘Detecting deep-
      A. Karpathy, A. Khosla, M. Bernstein, and A. C. Berg, ‘‘ImageNet large              fake videos from phoneme-viseme mismatches,’’ in Proc. IEEE/CVF
      scale visual recognition challenge,’’ Int. J. Comput. Vis., vol. 115, no. 3,        Conf. Comput. Vis. Pattern Recognit. Workshops (CVPRW), Jun. 2020,
      pp. 211–252, Dec. 2015.                                                             pp. 660–661.
 [16] M. D. Zeiler and R. Fergus, ‘‘Visualizing and understanding convolu-           [40] S. Suwajanakorn, S. M. Seitz, and I. Kemelmacher-Shlizerman, ‘‘Syn-
      tional networks,’’ in Proc. Eur. Conf. Comput. Vis., Cham, Switzerland:             thesizing Obama: Learning lip sync from audio,’’ ACM Trans. Graph.,
      Springer, 2014, pp. 818–833.                                                        vol. 36, no. 4, pp. 1–13, 2017.
 [17] K. Simonyan and A. Zisserman, ‘‘Very deep convolutional networks for           [41] Y. Song, J. Zhu, D. Li, X. Wang, and H. Qi, ‘‘Talking face generation by
      large-scale image recognition,’’ 2014, arXiv:1409.1556.                             conditional recurrent adversarial network,’’ 2018, arXiv:1804.04786.
 [18] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,       [42] L. Song, W. Wu, C. Qian, R. He, and C. C. Loy, ‘‘Everybody’s Talkin’:
      V. Vanhoucke, and A. Rabinovich, ‘‘Going deeper with convolutions,’’                Let me talk as you want,’’ 2020, arXiv:2001.05201.
      in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2015,          [43] O. Fried, A. Tewari, M. Zollhöfer, A. Finkelstein, E. Shechtman,
      pp. 1–9.                                                                            D. B. Goldman, K. Genova, Z. Jin, C. Theobalt, and M. Agrawala, ‘‘Text-
 [19] K. He, X. Zhang, S. Ren, and J. Sun, ‘‘Deep residual learning for                   based editing of talking-head video,’’ ACM Trans. Graph., vol. 38, no. 4,
      image recognition,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.             pp. 1–14, Aug. 2019.
      (CVPR), Jun. 2016, pp. 770–778.                                                [44] I. Amerini, L. Ballan, R. Caldelli, A. Del Bimbo, and G. Serra, ‘‘A sift-
 [20] S. Hochreiter and J. Schmidhuber, ‘‘Long short-term memory,’’ Neural                based forensic method for copy–move attack detection and transfor-
      Comput., vol. 9, no. 8, pp. 1735–1780, 1997.                                        mation recovery,’’ IEEE Trans. Inf. Forensics Security, vol. 6, no. 3,
 [21] A. Radford, L. Metz, and S. Chintala, ‘‘Unsupervised representation                 pp. 1099–1110, Sep. 2011.
      learning with deep convolutional generative adversarial networks,’’ 2015,      [45] (2015). Wild Web Tampered Image Dataset. [Online]. Available:
      arXiv:1511.06434.                                                                   https://mklab.iti.gr/results/the-wild-web-tampered-image-dataset/
 [22] A. Creswell and A. A. Bharath, ‘‘Inverting the generator of a generative       [46] Z. Liu, P. Luo, X. Wang, and X. Tang, ‘‘Deep learning face attributes
      adversarial network,’’ IEEE Trans. Neural Netw. Learn. Syst., vol. 30,              in the wild,’’ in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Dec. 2015,
      no. 7, pp. 1967–1974, Jul. 2019.                                                    pp. 3730–3738.
 [23] T. Karras, T. Aila, S. Laine, and J. Lehtinen, ‘‘Progressive grow-             [47] D. Shullani, M. Fontani, M. Iuliani, O. A. Shaya, and A. Piva, ‘‘VISION:
      ing of GANs for improved quality, stability, and variation,’’ 2017,                 A video and image dataset for source identification,’’ EURASIP J. Inf.
      arXiv:1710.10196.                                                                   Secur., vol. 2017, no. 1, pp. 1–16, Dec. 2017.
 [24] A. Brock, J. Donahue, and K. Simonyan, ‘‘Large scale GAN training for          [48] X. Yang, Y. Li, and S. Lyu, ‘‘Exposing deep fakes using inconsistent
      high fidelity natural image synthesis,’’ 2018, arXiv:1809.11096.                    head poses,’’ in Proc. IEEE Int. Conf. Acoust., Speech Signal Process.
 [25] T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, and T. Aila,             (ICASSP), May 2019, pp. 8261–8265.
      ‘‘Analyzing and improving the image quality of StyleGAN,’’ in Proc.
                                                                                     [49] P. Korshunov and S. Marcel, ‘‘DeepFakes: A new threat to face recogni-
      IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020,
                                                                                          tion? Assessment and detection,’’ 2018, arXiv:1812.08685.
      pp. 8110–8119.
                                                                                     [50] A. Rössler, D. Cozzolino, L. Verdoliva, C. Riess, J. Thies, and M. Nießner,
 [26] T. Karras, M. Aittala, J. Hellsten, S. Laine, J. Lehtinen, and T. Aila,
                                                                                          ‘‘FaceForensics: A large-scale video dataset for forgery detection in
      ‘‘Training generative adversarial networks with limited data,’’ 2020,
                                                                                          human faces,’’ 2018, arXiv:1803.09179.
      arXiv:2006.06676.
                                                                                     [51] A. Rossler, D. Cozzolino, L. Verdoliva, C. Riess, J. Thies, and
 [27] Generated Photos. Face Generator—Generate Faces Online Using AI.
                                                                                          M. Niessner, ‘‘FaceForensics++: Learning to detect manipulated facial
      [Online]. Available: https://generated.photos/face-generator
                                                                                          images,’’ in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2019,
 [28] H. Dang, F. Liu, J. Stehouwer, X. Liu, and A. K. Jain, ‘‘On the detection of
                                                                                          pp. 1–11.
      digital face manipulation,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern
                                                                                     [52] Google AI Blog. (2019). Contributing Data to Deepfake Detec-
      Recognit. (CVPR), Jun. 2020, pp. 5781–5790.
                                                                                          tion Research. [Online]. Available: https://ai.googleblog.com/2019/09/
 [29] J. C. Neves, R. Tolosana, R. Vera-Rodriguez, V. Lopes, H. Proenca, and
                                                                                          contributing-data-to-deepfake-detection.html
      J. Fierrez, ‘‘GANprintR: Improved fakes and evaluation of the state of the
      art in face manipulation detection,’’ IEEE J. Sel. Topics Signal Process.,     [53] B. Dolhansky, R. Howes, B. Pflaum, N. Baram, and C. C. Ferrer,
      vol. 14, no. 5, pp. 1038–1048, Aug. 2020.                                           ‘‘The deepfake detection challenge (DFDC) preview dataset,’’ 2019,
 [30] Y. Zhu, Q. Li, J. Wang, C. Xu, and Z. Sun, ‘‘One shot face swapping on              arXiv:1910.08854.
      megapixels,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit.           [54] B. Dolhansky, J. Bitton, B. Pflaum, J. Lu, R. Howes, M. Wang, and
      (CVPR), Jun. 2021, pp. 4834–4844.                                                   C. C. Ferrer, ‘‘The DeepFake detection challenge (DFDC) dataset,’’
 [31] Y. Choi, M. Choi, M. Kim, J.-W. Ha, S. Kim, and J. Choo, ‘‘Star-                    2020, arXiv:2006.07397.
      GAN: Unified generative adversarial networks for multi-domain image-           [55] Y. Li, X. Yang, P. Sun, H. Qi, and S. Lyu, ‘‘Celeb-DF: A large-scale
      to-image translation,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern                challenging dataset for DeepFake forensics,’’ in Proc. IEEE/CVF Conf.
      Recognit., Jun. 2018, pp. 8789–8797.                                                Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020, pp. 3207–3216.
 [32] Z. He, W. Zuo, M. Kan, S. Shan, and X. Chen, ‘‘AttGAN: Facial attribute        [56] L. Jiang, R. Li, W. Wu, C. Qian, and C. C. Loy, ‘‘DeeperForensics-1.0:
      editing by only changing what you want,’’ 2017, arXiv:1711.10678.                   A large-scale dataset for real-world face forgery detection,’’ in Proc.
 [33] M. Liu, Y. Ding, M. Xia, X. Liu, E. Ding, W. Zuo, and S. Wen, ‘‘STGAN:              IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020,
      A unified selective transfer network for arbitrary image attribute edit-            pp. 2889–2898.
      ing,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR),          [57] B. Zi, M. Chang, J. Chen, X. Ma, and Y.-G. Jiang, ‘‘WildDeepfake: A
      Jun. 2019, pp. 3673–3682.                                                           challenging real-world dataset for deepfake detection,’’ in Proc. 28th
 [34] J. Thies, M. Zollhofer, M. Stamminger, C. Theobalt, and M. Niessner,                ACM Int. Conf. Multimedia, Oct. 2020, pp. 2382–2390.
      ‘‘Face2Face: Real-time face capture and reenactment of RGB videos,’’           [58] T.-N. Le, H. H. Nguyen, J. Yamagishi, and I. Echizen, ‘‘Openforen-
      in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2016,               sics: Large-scale challenging dataset for multi-face forgery detection and
      pp. 2387–2395.                                                                      segmentation in-the-wild,’’ in Proc. Int. Conf. Comput. Vis., Oct. 2021,
 [35] J. Thies, M. Zollhöfer, and M. Nießner, ‘‘Deferred neural rendering:                pp. 10117–10127.
      Image synthesis using neural textures,’’ ACM Trans. Graph., vol. 38,           [59] T. Gloe and R. Böhme, ‘‘The ‘Dresden image Database’ for benchmark-
      no. 4, pp. 1–12, 2019.                                                              ing digital image forensics,’’ in Proc. ACM Symp. Appl. Comput. (SAC),
 [36] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, ‘‘Image-to-image translation         2010, pp. 1584–1590.
      with conditional adversarial networks,’’ in Proc. IEEE Conf. Comput. Vis.      [60] M. Zampoglou, S. Papadopoulos, and Y. Kompatsiaris, ‘‘Detecting image
      Pattern Recognit., Jul. 2017, pp. 1125–1134.                                        splicing in the wild (web),’’ in Proc. IEEE Int. Conf. Multimedia Expo.
 [37] U. Scherhag, C. Rathgeb, J. Merkle, R. Breithaupt, and C. Busch, ‘‘Face             Workshops (ICMEW), Jun. 2015, pp. 1–6.
      recognition systems under morphing attacks: A survey,’’ IEEE Access,           [61] C. Sanderson, ‘‘The VidTIMIT database,’’ IDIAP Inst. Res., Martigny,
      vol. 7, pp. 23012–23026, 2019.                                                      Switzerland, Tech. Rep. Idiap-Com-06-2002, 2002.
 [38] O. Gafni, L. Wolf, and Y. Taigman, ‘‘Live face de-identification in            [62] M. Koopman, A. M. Rodriguez, and Z. Geradts, ‘‘Detection of deepfake
      video,’’ in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2019,               video manipulation,’’ in Proc. 20th Irish Mach. Vis. image Process. Conf.
      pp. 9378–9387.                                                                      (IMVIP), Aug. 2018, pp. 133–136.
[63] D. Afchar, V. Nozick, J. Yamagishi, and I. Echizen, ‘‘MesoNet: A com-            [88] S. Agarwal, H. Farid, Y. Gu, M. He, K. Nagano, and H. Li, ‘‘Protecting
     pact facial video forgery detection network,’’ in Proc. IEEE Int. Workshop            world leaders against deep fakes,’’ in Proc. CVPR Workshops, vol. 1,
     Inf. Forensics Secur. (WIFS), Dec. 2018, pp. 1–7.                                     Jun. 2019, pp. 1–8.
[64] L. Nataraj, T. M. Mohammed, B. Manjunath, S. Chandrasekaran,                     [89] S. McCloskey and M. Albright, ‘‘Detecting GAN-generated imagery
     A. Flenner, J. H. Bappy, and A. K. Roy-Chowdhury, ‘‘Detecting GAN                     using saturation cues,’’ in Proc. IEEE Int. Conf. Image Process. (ICIP),
     generated fake images using co-occurrence matrices,’’ Electron. Imag.,                Sep. 2019, pp. 4584–4588.
     vol. 2019, no. 5, pp. 1–532, 2019.                                               [90] F. Marra, D. Gragnaniello, L. Verdoliva, and G. Poggi, ‘‘Do GANs leave
[65] H. Li, B. Li, S. Tan, and J. Huang, ‘‘Identification of deep network                  artificial fingerprints?’’ in Proc. IEEE Conf. Multimedia Inf. Process.
     generated images using disparities in color components,’’ Signal Process.,            Retr. (MIPR), Mar. 2019, pp. 506–511.
     vol. 174, Sep. 2020, Art. no. 107616.                                            [91] D.-T. Dang-Nguyen, C. Pasquini, V. Conotter, and G. Boato, ‘‘RAISE:
[66] A. Haliassos, K. Vougioukas, S. Petridis, and M. Pantic, ‘‘Lips Don’t lie:            A raw images dataset for digital image forensics,’’ in Proc. 6th ACM
     A generalisable and robust approach to face forgery detection,’’ in Proc.             Multimedia Syst. Conf., Mar. 2015, pp. 219–224.
     IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2021,                 [92] N. Yu, L. Davis, and M. Fritz, ‘‘Attributing fake images to GANs:
     pp. 5039–5049.                                                                        Learning and analyzing GAN fingerprints,’’ in Proc. IEEE/CVF Int. Conf.
[67] F. Lugstein, S. Baier, G. Bachinger, and A. Uhl, ‘‘PRNU-based deep-                   Comput. Vis. (ICCV), Oct. 2019, pp. 7556–7566.
     fake detection,’’ in Proc. ACM Workshop Inf. Hiding Multimedia Secur.,           [93] F. Yu, A. Seff, Y. Zhang, S. Song, T. Funkhouser, and J. Xiao, ‘‘LSUN:
     Jun. 2021, pp. 7–12.                                                                  Construction of a large-scale image dataset using deep learning with
[68] D. Guera and E. J. Delp, ‘‘Deepfake video detection using recurrent                   humans in the loop,’’ 2015, arXiv:1506.03365.
     neural networks,’’ in Proc. 15th IEEE Int. Conf. Adv. Video Signal Based         [94] F. Matern, C. Riess, and M. Stamminger, ‘‘Exploiting visual artifacts to
     Surveill. (AVSS), Nov. 2018, pp. 1–6.                                                 expose deepfakes and face manipulations,’’ in Proc. IEEE Winter Appl.
[69] H. H. Nguyen, J. Yamagishi, and I. Echizen, ‘‘Capsule-forensics:                      Comput. Vis. Workshops (WACVW), Jan. 2019, pp. 83–92.
     Using capsule networks to detect forged images and videos,’’ in Proc.            [95] D. P. Kingma and P. Dhariwal, ‘‘Glow: Generative flow with invertible
     IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP), May 2019,                   1x1 convolutions,’’ 2018, arXiv:1807.03039.
     pp. 2307–2311.                                                                   [96] S. Fernandes, S. Raj, E. Ortiz, I. Vintila, M. Salter, G. Urosevic, and
[70] I. Chingovska, A. Anjos, and S. Marcel, ‘‘On the effectiveness of local               S. Jha, ‘‘Predicting heart rate variations of deepfake videos using neural
     binary patterns in face anti-spoofing,’’ in Proc. BIOSIG Int. Conf. Bio-              ODE,’’ in Proc. IEEE/CVF Int. Conf. Comput. Vis. Workshop (ICCVW),
     metrics Special Interest Group (BIOSIG), Sep. 2012, pp. 1–7.                          Oct. 2019, pp. 1721–1729.
[71] X. Xuan, B. Peng, W. Wang, and J. Dong, ‘‘On the generalization of               [97] S. Agarwal, H. Farid, T. El-Gaaly, and S.-N. Lim, ‘‘Detecting deep-fake
     GAN image forensics,’’ in Proc. Chin. Conf. biometric Recognit., Cham,                videos from appearance and behavior,’’ in Proc. IEEE Int. Workshop Inf.
     Switzerland: Springer, 2019, pp. 134–141.                                             Forensics Secur. (WIFS), Dec. 2020, pp. 1–6.
[72] E. Sabir, J. Cheng, A. Jaiswal, W. AbdAlmageed, I. Masi, and
                                                                                      [98] L. Chai, D. Bau, S.-N. Lim, and P. Isola, ‘‘What makes fake images
     P. Natarajan, ‘‘Recurrent convolutional strategies for face manipulation
                                                                                           detectable? Understanding properties that generalize,’’ in Proc. Eur. Conf.
     detection in videos,’’ Interface (GUI), vol. 3, no. 1, pp. 80–87, 2019.
                                                                                           Comput. Vis., Cham, Switzerland: Springer, 2020, pp. 103–120.
[73] H. Jeon, Y. Bang, and S. S. Woo, ‘‘FDFtNet: Facing off fake images
                                                                                      [99] T. Mittal, U. Bhattacharya, R. Chandra, A. Bera, and D. Manocha, ‘‘Emo-
     using fake detection fine-tuning network,’’ in Proc. IFIP Int. Conf. ICT
                                                                                           tions don’t lie: An audio-visual deepfake detection method using affective
     Syst. Secur. Privacy Protection. Cham, Switzerland: Springer, 2020,
                                                                                           cues,’’ 2020, arXiv:2003.06711.
     pp. 416–430.
                                                                                     [100] Four in-the-Wild Lip-Sync Deep Fakes, Instagram. [Online]. Available:
[74] H. Jeon, Y. Bang, J. Kim, and S. S. Woo, ‘‘T-GD: Transferable GAN-
                                                                                           https://www.instagram.com/bill_posters_uk
     generated images detection framework,’’ 2020, arXiv:2008.04115.
[75] C.-C. Hsu, Y.-X. Zhuang, and C.-Y. Lee, ‘‘Deep fake image detection             [101] Four in-the-Wild Lip-Sync Deep Fakes, Youtube. [Online]. Available:
     based on pairwise learning,’’ Appl. Sci., vol. 10, no. 1, p. 370, Jan. 2020.          https://www.youtube.com/watch?v=VWMEDacz3L4
[76] A. Gandhi and S. Jain, ‘‘Adversarial perturbations fool deepfake detec-         [102] K. Chugh, P. Gupta, A. Dhall, and R. Subramanian, ‘‘Not made for each
     tors,’’ in Proc. Int. Joint Conf. Neural Netw. (IJCNN), Jul. 2020, pp. 1–8.           other- audio-visual dissonance-based deepfake detection and localiza-
[77] X. Wu, Z. Xie, Y. Gao, and Y. Xiao, ‘‘SSTNet: Detecting manipu-                       tion,’’ in Proc. 28th ACM Int. Conf. Multimedia, Oct. 2020, pp. 439–447.
     lated faces through spatial, steganalysis and temporal features,’’ in Proc.     [103] L. Guarnera, O. Giudice, and S. Battiato, ‘‘DeepFake detection by analyz-
     IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP), May 2020,                   ing convolutional traces,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern
     pp. 2952–2956.                                                                        Recognit. Workshops (CVPRW), Jun. 2020, pp. 666–667.
[78] Z. Liu, X. Qi, and P. H. S. Torr, ‘‘Global texture enhancement for fake face    [104] S. Fernandes, S. Raj, R. Ewetz, J. S. Pannu, S. K. Jha, E. Ortiz, I. Vintila,
     detection in the wild,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern                 and M. Salter, ‘‘Detecting deepfake videos using attribution-based confi-
     Recognit. (CVPR), Jun. 2020, pp. 8060–8069.                                           dence metric,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit.
[79] H. Khalid and S. S. Woo, ‘‘OC-FakeDect: Classifying deepfakes using                   Workshops (CVPRW), Jun. 2020, pp. 308–309.
     one-class variational autoencoder,’’ in Proc. IEEE/CVF Conf. Comput.            [105] Q. Cao, L. Shen, W. Xie, O. M. Parkhi, and A. Zisserman, ‘‘VGGFace2:
     Vis. Pattern Recognit. Workshops (CVPRW), Jun. 2020, pp. 656–657.                     A dataset for recognising faces across pose and age,’’ in Proc. 13th IEEE
[80] S. Fung, X. Lu, C. Zhang, and C.-T. Li, ‘‘DeepfakeUCL: Deepfake detec-                Int. Conf. Autom. Face Gesture Recognit. (FG), May 2018, pp. 67–74.
     tion via unsupervised contrastive learning,’’ 2021, arXiv:2104.11507.           [106] H. Qi, Q. Guo, F. Juefei-Xu, X. Xie, L. Ma, W. Feng, Y. Liu, and
[81] S. Tariq, S. Lee, and S. Woo, ‘‘One detector to rule them all: Towards                J. Zhao, ‘‘DeepRhythm: Exposing DeepFakes with attentional visual
     a general deepfake attack detection framework,’’ in Proc. Web Conf.,                  heartbeat rhythms,’’ in Proc. 28th ACM Int. Conf. Multimedia, Oct. 2020,
     Apr. 2021, pp. 3625–3637, doi: 10.1145/3442381.3449809.                               pp. 4318–4327.
[82] Y. Zhang, L. Zheng, and V. L. L. Thing, ‘‘Automated face swapping and its       [107] S. Hu, Y. Li, and S. Lyu, ‘‘Exposing GAN-generated faces using inconsis-
     detection,’’ in Proc. IEEE 2nd Int. Conf. Signal Image Process. (ICSIP),              tent corneal specular highlights,’’ in Proc. IEEE Int. Conf. Acoust., Speech
     Aug. 2017, pp. 15–19.                                                                 Signal Process. (ICASSP), Jun. 2021, pp. 2500–2504.
[83] G. B. Huang, M. Mattar, T. Berg, and E. Learned-Miller, ‘‘Labeled faces         [108] I. Demir and U. A. Ciftci, ‘‘Where do deep fakes look? Synthetic face
     in the wild: A database forstudying face recognition in unconstrained                 detection via gaze tracking,’’ in Proc. ACM Symp. Eye Tracking Res.
     environments,’’ in Proc. Workshop Faces Real-Life’Images, Detection,                  Appl., May 2021, pp. 1–11.
     Alignment, Recognit., 2008, pp. 1–11.                                           [109] Y. Nirkin, L. Wolf, Y. Keller, and T. Hassner, ‘‘DeepFake detec-
[84] Y. Li, M.-C. Chang, and S. Lyu, ‘‘In Ictu oculi: Exposing AI created                  tion based on discrepancies between faces and their context,’’ IEEE
     fake videos by detecting eye blinking,’’ in Proc. IEEE Int. Workshop Inf.             Trans. Pattern Anal. Mach. Intell., early access, Jun. 29, 2021, doi:
     Forensics Secur. (WIFS), Dec. 2018, pp. 1–7.                                          10.1109/TPAMI.2021.3093446.
[85] Cew Dataset. [Online]. Available: http://parnec.nuaa.edu.cn/_upload/            [110] M. Wang and W. Deng, ‘‘Deep face recognition: A survey,’’ Neurocom-
     tpl/02/db/731/template731/pages/xtan/ClosedEyeDatabases.html                          puting, vol. 429, pp. 215–244, Mar. 2021.
[86] Ebv Dataset. [Online]. Available: http://www.cs.albany.edu/lsw/                 [111] S. Umer, R. K. Rout, C. Pero, and M. Nappi, ‘‘Facial expression recog-
     downloads.html                                                                        nition with trade-offs between data augmentation and deep learning
[87] Y. Li and S. Lyu, ‘‘Exposing DeepFake videos by detecting face warping                features,’’ J. Ambient Intell. Humanized Comput., vol. 13, pp. 721–735,
     artifacts,’’ 2018, arXiv:1811.00656.                                                  Jan. 2021.
[112] S. Hossain, S. Umer, V. Asari, and R. K. Rout, ‘‘A unified framework of                                    MINORU KURIBAYASHI (Senior Member,
      deep learning-based facial expression recognition system for diversified                                   IEEE) received the B.E., M.E., and D.E. degrees
      applications,’’ Appl. Sci., vol. 11, no. 19, p. 9174, Oct. 2021.                                           from Kobe University, Japan, in 1999, 2001, and
[113] J. Galbally, S. Marcel, and J. Fierrez, ‘‘Biometric antispoofing methods:                                  2004, respectively. From 2002 to 2007, he was a
      A survey in face recognition,’’ IEEE Access, vol. 2, pp. 1530–1552, 2014.                                  Research Associate at Kobe University, where he
[114] S. Umer, B. C. Dhara, and B. Chanda, ‘‘Face recognition using fusion                                       was an Assistant Professor, from 2007 to 2015.
      of feature learning techniques,’’ Measurement, vol. 146, pp. 43–54,                                        Since 2015, he has been an Associate Profes-
      Nov. 2019.
                                                                                                                 sor with the Graduate School of Natural Sci-
[115] H. Farid, ‘‘Digital image forensics,’’ Sci. Amer., vol. 298, no. 6, pp. 66–71,
                                                                                                                 ence and Technology, Okayama University. His
      2008.
[116] C. Rathgeb, A. Botaljov, F. Stockhardt, S. Isadskiy, L. Debiasi, A. Uhl,                                   research interests include multimedia security,
      and C. Busch, ‘‘PRNU-based detection of facial retouching,’’ IET Bio-            digital watermarking, cryptography, and coding theory. He is a member of the
      metrics, vol. 9, no. 4, pp. 154–164, Jul. 2020.                                  Information Forensics and Security Technical Committee of the IEEE Signal
[117] U. Scherhag, L. Debiasi, C. Rathgeb, C. Busch, and A. Uhl, ‘‘Detection           Processing Society. He received the Young Professionals Award from the
      of face morphing attacks based on PRNU analysis,’’ IEEE Trans. Biomet-           IEEE Kansai Section in 2014 and the Best Paper Award at IWDW 2015 and
      rics, Behav., Identity Sci., vol. 1, no. 4, pp. 302–317, Oct. 2019.              2019. He is the Vice Chair of the APSIPA Multimedia Security and Forensics
[118] S. Sabour, N. Frosst, and G. E. Hinton, ‘‘Dynamic routing between                Technical Committee. He serves as an Associate Editor for IEEE Signal
      capsules,’’ 2017, arXiv:1710.09829.                                              Processing Letters, JISA, and IEICE.
[119] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, ‘‘Densely
      connected convolutional networks,’’ in Proc. IEEE Conf. Comput. Vis.
      Pattern Recognit. (CVPR), Jul. 2017, pp. 4700–4708.
[120] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, ‘‘A simple framework
      for contrastive learning of visual representations,’’ in Proc. Int. Conf.
      Mach. Learn., 2020, pp. 1597–1607.
[121] F. Chollet, ‘‘Xception: Deep learning with depthwise separable convo-
      lutions,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR),
      Jul. 2017, pp. 1251–1258.
[122] S.-Y. Wang, O. Wang, R. Zhang, A. Owens, and A. A. Efros, ‘‘CNN-
      generated images are surprisingly easy to Spot. . . for now,’’ in Proc.
      IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020,
      pp. 8695–8704.
[123] H. Bay, T. Tuytelaars, and L. Van Gool, ‘‘Surf: Speeded up robust fea-                                    SANI M. ABDULLAHI (Member, IEEE) received
      tures,’’ in Proc. Eur. Conf. Comput. Vis., Berlin, Germany: Springer, 2006,                               the M.Sc. degree from The University of Manch-
      pp. 404–417.                                                                                              ester, U.K., in 2013, and the Ph.D. degree from
[124] G. Qiu, ‘‘Indexing chromatic and achromatic patterns for content-based                                    Southwest Jiaotong University, China, in 2019.
      colour image retrieval,’’ Pattern Recognit., vol. 35, no. 8, pp. 1675–1686,
                                                                                                                He is currently a Postdoctoral Researcher at China
      Aug. 2002.
                                                                                                                Three Gorges University, Yichang, China. He has
[125] T. K. Moon, ‘‘The expectation-maximization algorithm,’’ IEEE Signal
      Process. Mag., vol. 13, no. 6, pp. 47–60, Nov. 1997.                                                      published a number of reputable journals and
[126] S. Jha, S. Raj, S. Fernandes, S. K. Jha, S. Jha, B. Jalaian, G. Verma,                                    conferences, including the IEEE TRANSACTIONS ON
      and A. Swami, ‘‘Attribution-based confidence metric for deep neural                                       INFORMATION FORENSICS AND SECURITY, IEEE AVSS,
      networks,’’ Tech. Rep., 2019.                                                                             IWDW, and IWDCF. His research interests include
[127] R. T. Q. Chen, Y. Rubanova, J. Bettencourt, and D. K. Duvenaud,                  information security, biometric template protection, digital forensics, mul-
      ‘‘Neural ordinary differential equations,’’ in Advances in                       timedia security, and digital watermarking. He received the Best Paper
      Neural Information Processing Systems, vol. 31, S. Bengio,                       Award at the International Workshop on Digital Crime and Forensics
      H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and                      (IWDCF-2017).
      R. Garnett, Eds. Red Hook, NY, USA: Curran Associates, 2018.
      [Online]. Available: https://proceedings.neurips.cc/paper/2018/file/
      69386f6bb1dfed68692a24c8686939b9-Paper.pdf
[128] T. Gloe, M. Kirchner, A. Winkler, and R. Böhme, ‘‘Can we trust digital
      image forensics?’’ in Proc. 15th Int. Conf. Multimedia (MULTIMEDIA),
      2007, pp. 78–86.
[129] I. J. Goodfellow, J. Shlens, and C. Szegedy, ‘‘Explaining and harnessing
      adversarial examples,’’ 2014, arXiv:1412.6572.
[130] N. Carlini and D. Wagner, ‘‘Towards evaluating the robustness of neu-
      ral networks,’’ in Proc. IEEE Symp. Secur. Privacy (SP), May 2017,
      pp. 39–57.
                          ASAD MALIK (Member, IEEE) received the                                                 AHMAD NEYAZ KHAN (Member, IEEE)
                          B.Sc. degree (Hons.) in computer application                                           received the B.Sc. (Hons.) and master’s degrees
                          from Aligarh Muslim University, Aligarh, India,                                        in computer applications from Aligarh Muslim
                          in 2012, the master’s degree in computer applica-                                      University, India, in 2009 and 2012, respectively,
                          tion from Jamia Millia Islamia University, India,                                      and the Ph.D. degree from the School of Com-
                          in 2015, and the Ph.D. degree from the School of                                       puter Science and Engineering, University of Elec-
                          Information Science and Technology, Southwest                                          tronic Science and Technology of China, Chengdu,
                          Jiaotong University, Chengdu, China, in 2020.                                          China. He is currently an Assistant Professor with
                          He is currently an Assistant Professor with the                                        Integral University, India. His research interests
                          Department of Computer Science, Aligarh Muslim                                         include information security, machine learning,
University. His research interests include multimedia forensics and security,          and reversible data hiding in the encrypted domain.
image processing, information hiding, and deep learning.