Deep Fake Detection Through
Convolutional
Neural Network
Anjum Bano¹ , Shreya Chaudhary¹ , Simran ¹, Jaspreet Singh
Abstraction
Videos and pictures that have not undergone any kind of testing to confirm their veracity have been
categorised as fake media. The problem of fake accounts and fake media is quite serious, and it gets
more complicated every day. Fake media and account detection are two subjects that have drawn a lot
of attention from researchers in recent years.
Thus, the primary objective of the article is to successfully detect fraudulent media by using and utilising
a range of deep learning techniques and neural networks. Consequently, CNN networks have been
utilised for deepfake detection, yielding the greatest outcomes. A sequential convolutional neural
network model is employed in this study along with additional techniques like max pooling and Adam
optimiser, using three distinct datasets (Celeb-DF and Faceforensics++). The model's final accuracy and
loss rate were 93.3% and 19.5%, respectively.
1. Overview
Significant advancements have been made in the field of automatic video editing techniques in recent
years. Particularly, techniques for manipulating the face have drawn a lot of attention.
Today, for instance, it is possible to execute facial reenactment, which involves transferring facial
expressions from one film to another. With minimal effort, this makes it possible to alter a speaker's
identity. These days, facial modification tools and systems are so sophisticated that even those with no
prior knowledge of digital arts or picture retouching may utilise them. In fact, libraries and code that
operate practically automatically are increasingly being made freely available to the public [1]. The
sophistication of mobile camera technology has increased, while social media and media sharing
websites have become more widely available, making it simpler than ever to create and share digital
movies [2]. Up until recently, the number of fake videos and their levels of realism were constrained by
the lack of sophisticated editing tools, the high demand for subject-matter expertise, and the challenging
and time-consuming process involved. However, the amount of time required creating and altering
movies has drastically dropped in recent years due to the availability of massive volumes of training data
and high-throughput processing. These technologies can be dangerous instruments to damage the
digital identity and reputation and they offer serious risks to the accuracy of visual information.
personnel. These worries are supported by the numerous abuse instances involving prominent members
of the political and economic communities that have been publicised in recent months. It is reasonable
to assume that this issue will only worsen in the years to come.
In response, a lot of study has been conducted in recent years on the detection of the use of new,
effective approaches for the creation of synthetic media. Along with the creation of benchmark datasets
(like FaceForensics++) and global open challenges (like the Facebook Deepfake Detection Challenge), an
ever-growing variety of tools and methodologies have been suggested in recent years [3]. Thanks to (i)
the availability of vast amounts of public data and (ii) the development of deep learning techniques that
do away with numerous manual editing steps, such as Autoencoders (AE) and Generative Adversarial
Networks (GAN), it is now easier than ever to automatically synthesise nonexistent faces or manipulate a
real face (also known as a bonafide presentation) of one subject in an image or video. Because of this,
open software and smartphone apps like ZAO and FaceApp have been made available, allowing anyone
to produce phoney photos and videos without any prior knowledge of the subject [4]. I talked about the
history of deep learning techniques in this section.
The literature evaluation of related work is covered in Section II. The model's technique was described in
Section III.
The outcome of the model is shown in Section IV. The conclusion is in Section V.
2. Background
The neural network concept serves as the foundation for the machine learning technique known as deep
learning. The use of multiple hidden layers in the network is referred to be "deep" in deep learning. The
deep learning architecture, which draws inspiration from artificial networks, extracts more information
from raw input data by using an infinite number of hidden layers of bounded size. The complexity of the
training data determines how many hidden layers are used.
More hidden layers are required for more complex data in order to effectively get the right results. In
recent years, deep learning has been effectively used in several domains, such as computer vision, audio
processing, machine translation, and natural language processing. Deep learning produces state-of-the-
art outcomes in a number of domains when contrasted with machine learning techniques. Additionally,
deep learning has demonstrated promise in identifying deepfakes. Numerous deep learning methods,
such as 1) convolutional neural networks (CNN) and 2) recurrent neural networks (RNN), have been
proposed in the literature. In the sections that follow, these tactics are briefly discussed before their
application to deepfake discovery is explained.
2.1. CNN, or convolutional neural network
Among deep neural network models, convolutional neural networks (CNNs) are the most often utilised.
CNNs have input and output layers together with one or more hidden layers, just as neural networks.
Following their reading of the inputs from the first layer, the hidden layers subject the input values to a
convolutional mathematical process.
Convolution in this sense refers to a matrix multiplication or other dot product. CNN uses a nonlinearity
activation function, like the Rectified Linear Unit (RELU), following matrix multiplication. In order to
decrease the dimensionality of data, pooling layers compute outputs using functions such as maximum
pooling or average pooling [5].
2.2. Neural Network Recurrent (RNN)
Recurrent neural networks (RNNs) are another application of artificial neural networks that can learn
characteristics from sequence data. Several unseen layers, each with a weight and bias, make up RNNs,
just like neural networks.
links that run sequentially between nodes in the direct cycle graph of an RNN. One advantage of RRN is
that it allows for the detection of temporal dynamic [6].
As opposed to feed forward networks (FFNs), RNNs use an internal memory to store information
sequences from prior inputs, which makes them valuable in a range of applications, such as speech
recognition and natural language processing.
To manage a temporal sequence, an RNN may have a recurrent hidden state that captures
interdependence across time scales.
2.3. Adversarial Generative Network (GAN)
A deep learning model that combines generative and discriminative models was proposed in 2014. The
generative model may generate data at random, but the discriminative model determines whether the
generated data comes from training datasets.The rivalry between the generative and discriminative
models may aid GAN in achieving superior results. Among other things, picture categorisation is
frequently used to create text and visuals.
3.Review of Literature
At this point, an overview of fake media detection publications that discussed the majority of false
media identification techniques and the viewpoint models they used, as well as related research and
comparable applications, will be provided.
Ekraam Sabir and Etel [7],The model and face alignment technique improves the state-of-the-art. The
community gained access to video-based face manipulation with the recent release of FaceForensics
and its enlarged and improved version, FaceForensics++.
FaceForensics made Face2Face manipulation public. FaceForensics++ (FF++), an improvement on FF,
expands the collection by include Deepfake and FaceSwap operations. The collection's 1,000 movies are
divided into one split, with 720 put aside for instruction and 140 left aside for validation.
First, faces from video frames are chopped and aligned. Next, manipulation detection is carried out
across the preprocessed facial region using CNN + RNN. This is the overall technique for manipulation
detection. ultimately determined that the model's accuracy rate was 93%.
Etel, Shruti, and Agarwal [8]. The A2V synthesis technology receives a video of a person speaking and an
audio recording. It outputs a new video with the speaker's mouth synced with the new audio.
Investigated whether the hand-crafted profile feature could be outperformed by a more contemporary
learning-based strategy. In particular, a convolutional neural network (CNN) was trained to determine
whether a mouth in a single video frame is open or closed. A colour image that has been cropped
around the mouth and rescaled to 128 x 128 pixels serves as the network's input (Figure 1). A real-
valued number in [0,1] that represents a "open" (0) or "closed" (1) mouth is the network's output, c.
ultimately achieved a model accuracy of 96.4%.
[9] Davide Coccomini and etel. They tested the 5000 test movies made available for the DFDC dataset as
well as FaceForensics++. Wodajo and Atnafu's Convolutional Vision Transformer [2021] was tested on
these movies to achieve the required AUC and F1-score values for comparison in order to compare their
approaches on the DFDC test set as well.trained the networks using 220,444 faces that were taken from
the FaceForensics++ and DFDC training sets, then used 8070 faces from the DFDC dataset for validation.
[10] Convolutional Cross ViT employs two separate branches: the Lbranch, which operates on bigger
patches and provides a more global perspective of the input image, and the S-branch, which operates on
small patches and provides a local view. Cross-attention is used in each of these branches to combine
the Transformer Encoders' outputs, enabling direct communication between the two outcomes.
obtained an accuracy rate of 87% for the model.
4. Proposed Model
The model's initial preprocessing step primarily concentrates on the face by first framing the video,
followed by face detection, face cropping, and face alignment.
After that, it follows a path to the celeb (v1 and v2) processed dataset to start the training phase on the
samples contained within.
The next step is the manipulation detection phase, which uses CNN to extract features. after which a
trained model is loaded and its authenticity is determined. in the manner depicted in Figure 1.
4.1 Datasets
The DeepFake Forensics (Celeb-DF) dataset, which includes artificially produced movies with greater
visual quality, to allow the creation of more potent detection systems and to more thoroughly test
current DeepFake detection algorithms.In these datasets, DeepFake movies are clearly distinguished
from real footage thanks to a range of visual artefacts.There are 5,639 DeepFake and 590 real movies in
the Celeb-DF dataset, totalling over two million video frames. All videos have an average duration of
roughly 13 seconds and a frame rate of 30 frames per second. The authentic videos are derived from
YouTube recordings that are accessible to the general public and correspond to interviews with 59
celebrities of different ages, genders, and races.Of the people in the real videos, 43.2 percent are
women and 56.8 percent are men. 30.5 percent are between the ages of 50 and 60, 26.6 percent are in
their 40s, 28.0 percent are in their 30s, 6.4 percent are under 30, and 8.5 percent are beyond 60.Asians
make up 5.1 percent, African Americans 6.8 percent, and Caucasians 88.1 percent. Using four
automated face manipulation techniques—Deep Fakes, Face2Face, FaceSwap, and NeuralTextures—
original video sequences have been altered to create the second dataset, FaceForensics++.
The data can be utilised for segmentation and image and video classification since it provides binary
masks. Furthermore, it offers 1000 Deep Fakes models to produce and augment new data.
4.2 Preprocessing
To crop the frame, identify the face, and align it with the centre of the frame, a computer vision library
was used. To determine which information from each pixel should be taken and which should be
ignored, normalise the frame next.
4.3. Extraction of Features
To prevent needless calculations, the dataset is first divided into two halves (real and fake), and then it is
scaled to fit all frames. There will be a limit of 8,000 training photos. Higher results will be obtained if
there are more training images than 8000. However, this will require a lot of computing power.
4.4. CNN Model
The sequential model is separated into two convolutional layers. The rectified linear activation (ReLU)
activation function, kernel size (3,3), and 64 filters are used by the two layers. The first layer uses the
feature extraction to determine the image's form. Batch normalisation is then used to shorten the
training epochs and stabilise the learning processing. The global average pooling, which determines the
average outputs of even features, uses max pooling to divide the frames by matrix (2,2).dense for
adjusting the vectors' size by employing each of the 265 neurones via ReLU, followed by softmax for the
neural network's output layer.
5. Experiment Result
The areas under the receiver operating curve (AUC) scores should also be reported. All numbers are
reported on Celeb-DF(v1,v2).
Using for training Adam optimizer with 1e- 4 learning rate. Furthermore, all findings are based on the
dataset's highly compressed version. Since the baseline performance for videos of both high and low
quality is already very high, it does not assess them.
Table 2 displays the outcomes of our model's several levels of recurrence and face alignment method
selection. For the easier.In particular, a straightforward CNN with two convolution and max-pooling
layers, each of which is followed by a feedforward network serving as the sampler's localisation net and
bilinear interpolation.In particular, because our experiments' DenseNet contains two feature map-
generating blocks.
6. Conclusion
7.
This study showed that deep neural network techniques are excellent in detecting deepfakes. based on
deep learning principles, proposed a method that can automatically identify deepfakes.
Sharing information between the activities of feature extraction, preprocessing, and model construction
enhanced the overall performance of the model.The accuracy and dependability of the model were
good. In the near future, this study can be expanded by investigating more architectures that provide
the application of fresh detection methods for deepfakes.
7.References
1] Bonettini, N., & Cannas, E. D. Video
Face Manipulation Detection Through
Ensemble of CNNs. (2020).
[2] Aditi Kohli & Abhinav Gobta.
Detecting Deepfake, Faceswap and
Face2Face facial forgeries using frequency
CNN.(2020)
[3] Marcon, F., Pasquini, C., Boato, G.&
etel. Detection of Manipulated Face Videos
over Social Networks: A Large-Scale
Study.(2021)
[4] Tolosana, R., Rodriguez, R. V.-.. An
Introduction to Digital Face
Manipulation.(2022)
[8] Agarwal, S., & Fried, O. & etel..
.Detecting Deep-Fake Videos from
Phoneme-Viseme Mismatches. (2021)
[9] Coccomini, D., Messina, N. & etel..
COMBINING EFFICIENTNET AND
VISION TRANSFORMERS FOR VIDEO
DEEP FAKE DETECTION. (2021)
[5] Pishori, A., Rollins, B., & etel...
Detecting Deepfake Videos: An Analysis
of Three Techniques. (2020)
[6] Heo, Y-J, Kim, B.
-G, & etel… Deep
Face Detection scheme based on vision
transformer and distillation.(2021)
[7] Sabir, E., Cheng, J., “Recurrent
Convolutional Strategies for Face
Manipulation Detection in Videos.”.(2021)
.