Ilovepdf Merged1 Merged
Ilovepdf Merged1 Merged
Submitted by
BACHELOR OF TECHNOLOGY
IN
COMPUTER SCIENCE AND ENGINEERING
July-Dec 2024
Acknowledgement
We are thankful to Dr. Supriya Panda, Project Coordinator, Professor, CSE department for
her guidance and support.
We express our deep gratitude to Dr. Tapas Kumar, Head of Department (CSE-spl) for his
endless support and affection towards us. His constant encouragement has helped to widen the
horizon of our knowledge and inculcate the spirit of dedication to the purpose.
We would like to express our sincere gratitude to Dr. Geeta Nijhawan, Associate Dean SET,
MRIIRS for providing us the facilities in the Institute for completion of our work.
Words cannot express our gratitude for all those people who helped us directly or indirectly in
our Endeavour. We take this opportunity to express our sincere thanks to all staff members of
CSE-spl department for the valuable suggestion and also to our family and friends for their
support.
i
Declaration
We hereby declare that this project report entitled “ Deepfake detection using AI” by Ayush
Choudhary (1/21/FET/BCS/284), Vani Jain (1/21/FET/BCS/262), Nishant
(1/21/FET/BCS/302), being submitted in partial fulfilment of the requirements for the degree of
Bachelor of Technology in Computer Science & Engineering under School of Engineering &
Technology of Manav Rachna International Institute of Research and Studies, Faridabad, during
the academic year 2024-2025, is a bonafide record of our original work carried out under the
guidance of Krishan Kumar , Associate Professor , CSE department.
We further declare that we have not submitted the matter presented in this Project for the award
of any other Degree/Diploma of this University or any other University/Institute.
ii
Manav Rachna International Institute of Research and Studies,
Faridabad
School of Engineering & Technology
July-Dec 2024
Certificate
This is to certify that this project report entitled “Deepfake detection using AI” by Ayush
Choudhary (1/21/FET/BCS/284) , Vani Jain (1/21/FET/BCS/262) , Nishant
(1/21/FET/BCS/302), submitted in partial fulfillment of the requirements for the degree of
Bachelor of Technology in Computer Science and Engineering under School of Engineering &
Technology of Manav Rachna International Institute of Research and Studies Faridabad, during
the academic year 2024-2025 , is a bonafide record of work carried out under my guidance and
supervision.
iii
TABLE OF CONTENTS
Acknowledgement i
Declaration ii
Certificate iii
List of Tables vi
Abstract viii
Chapters
1. Introduction 1-14
1.1. Introduction 1
1.2. Problem Statement 2-6
1.3. Objectives 7-9
1.4. Methodology 10-13
1.5. Organization 14
5. Conclusions 55-57
5.1. Conclusions 55
5.2. Future Work 56
5.3. Applications 57
6.Appendix:Code
7.Referance
iv
List of tables
v
List of Figures
vi
Abstract
Audio, video, and image manipulation techniques grow with the advancements in artificial
intelligence and cloud computing. These are referred to as deepfakes, in regard to this type of
media content. The more believable ways in which computers can exercise control over media,
like duplicating a public figure's voice or placing someone's face on another, makes this
possible. The three types of countermeasures used to fight back against deepfakes are media
confirmation, media provenance, and deepfake discovery. Deepfake detection solutions rely on
multi-modal detection to identify any kind of alterations or synthesizing of target media. There
are two kinds of techniques currently in use, which include manual and algorithmic techniques.
Techniques using human media analysts with access to software are known as traditional
manual techniques. Algorithmic detection involves using AI-based algorithms to find
manipulated media.
We are developing a deep learning binary classifier that is hoped to serve one application in
resolving several of the real-world issues created by deepfakes, which involve distortion of
democratic discourse, election manipulation, a weakening of institutional credibility, a
weakening of journalism, social division, The first segment examines the technical architecture
of deepfake generation, primarily focusing on Generative Adversarial Networks (GANs) and
autoencoders. These tools enable the creation of deceptive videos and images that can challenge
human perception and traditional detection methods. Understanding these architectures is vital
to designing countermeasures. Additionally, we discuss the evolution of deepfake creation
techniques, highlighting how improvements in generative AI have exacerbated detection
challenges.
Next, we delve into AI-based deepfake detection methodologies, categorizing them into image-
based, video- based, and audio-based detection. Techniques such as convolutional neural
networks (CNNs) for image analysis, recurrent neural networks (RNNs) for temporal
consistency checks in videos, and spectrogram analysis for audio forgery detection are
explored. Emphasis is placed on identifying anomalies, such as unnatural blinking patterns,
inconsistencies in lighting, or spectral artifacts, that signal manipulation. Moreover, hybrid
models that combine multiple modalities for enhanced detection accuracy are discussed.
The discussion also addresses dataset curation for deepfake detection. Quality datasets, such as
FaceForensics++, DFDC (Deepfake Detection Challenge), and Celeb-DF, play a pivotal role
in training robust AI models. Challenges associated with dataset bias, generalization across
different types of deepfakes, and ethical considerations in dataset creation are examined.
Strategies for generating synthetic datasets to improve model robustness are also considered.
Deepfake technology, leveraging advancements in artificial intelligence and deep learning,
has emerged as a significant tool for creating hyper-realistic but fabricated audio, video, and
image content. While offering potential for creative applications, deepfakes pose severe
ethical, security, and societal threats, particularly in misinformation campaigns, identity theft,
and the erosion of public trust. This paper explores the methodologies for detecting
deepfakes, focusing on machine learning techniques, forensic analysis, and hybrid
approaches.
vii
Chapter-1: INTRODUCTION
1.1 Introduction:
Technologies of image, video, and audio editing are developing at a high pace. There has
been an increase in innovation in the areas of changing images, sound recordings, and
recordings. A wide range of methods for making and controlling advanced substances is
also available. Today, it is possible to generate hyper-reasonable advanced pictures with a
little asset and a simple how-to guide available on the web. It refers to a technique by
which essence of any one in the video is used in replacement of somebody else's essence.
It implies taking one merged portrait of the people and then introducing that image of an
individual person's face into his body. Therefore, that word refers to both product itself,
the reasonably promoted video final product. In addition to creating extraordinary
Computer-Generated Imagery (CGI), Virtual Reality (VR), and Augmented Reality (A),
Deepfakes can be used for educational purposes, animation, art, and cinema.
As the smartphone has become more sophisticated and good internet access is more the
rule, even more social media and media sharing sites have enhanced the creation and
distribution of digital videos. With increasing steady low-cost computing power, deep
learning is more powerful than ever. Such advances will have inevitably brought new
challenges. DeepFake is the manipulation of video and audio by using deep generative
adversarial models. There are many instances where DF spreads over social media
platforms that cause spamming and giving wrong information.
Such DFs are horrible and can scare people into them or mislead people into them.
An autoencoder EA replicates the essences of A from the dataset of pictures of the face of
A and an autoencoder EB replicates the essences of B from the dataset of pictures of the
face of B. To create Deepfake images, you compile changed features of two morphed
persons, An and B, you then train an autoencoder EA on replicas of the essences of A
from the set of images of the face of A, and another encoder EB. We use shared
8
encoding loads of EA and EB while keeping the decoding parts of every encoder
individual. By improving this coder, any picture with face of A can be decoded using
the decoder of EB rather than this common encoder. This principle is summarized in
figures 1 and 2. The ability to detect and counter deepfakes has become a critical area of
research to mitigate these risks. Deepfake detection focuses on identifying subtle
artifacts, inconsistencies, or anomalies in generated media that distinguish it from
authentic content. Despite advancements in detection, evolving deepfake generation
techniques continue to challenge existing models, highlighting the need for robust,
adaptive, and scalable detection systems. This paper explores the methodologies,
challenges, and emerging trends in deepfake detection, emphasizing the importance of
proactive research and collaborative efforts to safeguard against misuse.
This methodology includes an encoder that combines general information regarding the
brightness, position, and appearance of the face, but a specific decoder puts together each
detail of a face and recollects its stable characteristics and features. The concerned
information must be differentiated from the morphological one, otherwise, it is not valid.
Finally, the results are excellent and hence the process is worth doing. The final step
would be to take the objective video, take the objective face from each frame, rectify it so
that it is the same illumination and expression, then use the modified autoencoder to
produce yet another face and then merge it with the objective face.
9
Figure 2. Illustration of a picture (left) being manufactured (right) utilizing the
Deepfake procedure. Note that the manufactured face comes up short on the
expressiveness of the first.
Since the Deepfake peculiarity, different creators have proposed various systems to
distinguish between actual recordings and fake ones. Although each proposed system
has its own solidity, current discovery techniques require generalizability. According to
[1], although each proposed system has its own solidity, current discovery techniques
require generalizability. The developers also accept the fact that current models appear
to be concentrating on Deepfake creation machines to take care of and focus on their
supposed practices. For example, Yuezun et al. [2] and TackHyun et al. [3] worked for
anomalies of the eye blinking in Deepfakes detection. But this can also be replicated as
Konstantinos et al. [4] and Hai et al. [5] had worked for it. A system they proposed in [4]
uses natural facial expressions such as blinking eyes for producing videos of talking
heads. Authors in [5] have come up with a model, which can generate facial expression
from a portrait. Their framework can composite a still picture to express feelings; it
includes a mind flight of eye-flickering movements. Such progressions of Deepfake
innovation make hard to identify the manipulated recording. This is a situation that
demands the acknowledgment of DF.
10
We now suggest another deep learning-based approach that would help to effectively
distinguish AI-generated fake videos, i.e., DF Videos, from the real videos. Our aim is to
develop a CNN model that would effectively classify and label deepfake videos with very
high accuracy. Based on the accuracy metric, we further establish a methodology based
on semi-supervised learning that outperforms CNN. The ResNet50 + LSTM model
developed by us has been experimented on a subset of Deepfake Detection Challenge
data. Since there were many features in the original dataset, it was impossible to train on
it. A subset of the data was taken but the same data and splits were kept.
11
1.2 Problem Statement:
The task involves designing and developing a deep learning algorithm to classify the
video as deepfake or pristine. It predicts the probability that the video is fake or not by
using DF detection, mostly a binary classification, where the input video is mp4 and the
output is a label L E ["REAL, "FAKE, and we will analyze the input video.
We can treat the image classification problem to detect deepfakes as a binary problem of
image classification. The video is 150 frames by input, where each one is 1920x1080
pixels or 1080x2040 if done vertically. We can now define the output as an indicator of
the presence or absence of DF videos, which is a binary class V * [0,1].
The formula p(Y = i|V) is the probability with which the network will tag i.
1.3 Objectives:
Deepfake detection objectives are complex and reflect the increasingly critical nature of
identifying and managing manipulated media in the context of an increasingly digital and
connected world. The ability to effectively identify and mark synthetic media will
become an imperative in a number of different sectors, from social media and news
agencies to law enforcement agencies and cybersecurity professionals. We can break
down the top objectives of deepfake detection into several key goals: identification of
manipulated content, accuracy, and reliability, improvement of generalization,
robustness, real-time detection capacity, increased trust, and responsible AI use. These set
the foundation for effective detection systems that will be used to wage the battle against
the malevolent effects of deepfakes.
One of the most apparent goals of deepfake detection is the identification of manipulated
content. Deepfake algorithms are supported by very advanced machine learning models
that can create really realistic videos and images and mimic the original. These videos,
whether they be for face-swapping, voice synthesis, or puppet-based manipulation, tend
to have some minor inconsistency that humans will
12
It is not easily detected. Central objectives for deepfake detection would include building
systems which could automatically and accurately determine if a video or image had been
tampered with, therefore telling whether content is authentic or artificial. The critical
scenario would be within the arena of news reports, political campaigns, or personalities
because dissemination can lead to great and harmful impacts: spreading misinformation,
defamation, or political maneuvering. It identifies manipulated media, which helps
prevent the spread of false information and maintains the integrity of digital content.
The other critical objective is the accuracy and reliability of detection. Deepfakes'
detection models should be very accurate about both real and fake media with a minimal
occurrence of false positives-that are real videos identified as fake and false negatives-the
fake videos misinterpreted as real. This would be achievable through several performance
metrics-precision, recall, the F1-score, and even the Area Under the Receiver Operating
Characteristic curve, referred to as AUC-ROC-through which deepfake detection models
can measure how well such a model would perform and adapt it for the best possible
efficiency. Evidently, the accuracy in the higher end will not be very close with this
version of the new variations of deepfake generation. However, it does stand as an
objective where such detection systems would then flag the content reliably with an
incredibly wide range of usages that may then potentially keep the digital media
trustworthy.
Another important objective is generalization where the model will detect deepfakes for
other types of datasets, manipulation techniques, and types of videos. Deepfake
generation techniques are highly diverse, and novel techniques are being developed
exponentially in most cases to try and avoid common detection methods. It should be able
to generalize among many manipulation techniques, that include face swapping, lip-
syncing and puppet manipulation so as not to rely greatly on any single dataset and/or
manipulation type. Generally speaking, generalization is extremely important because
tools for generation through deepfakes are no static, therefore, after change happens in
generation methods, detecting algorithms have to respond and stick well with the
methods. That deepfake detection systems can generalize so they stay effective over time
too and find new forms of deepfakes that could never have existed at training. That's an
important goal to make sure deepfake detection systems can be deployed in the dynamic
13
environment . So, there are new forms of manipulations in it.
The other goal is to increase the users' trust and confidence in a detection system.
Deepfakes break public trust in digital content. The more things the user cannot tell
between what is real and fake, the more reasons the user will have for distrust in the
digital content. Deepfake detection systems are really crucial in restoring trust by
providing transparency at the same time as guaranteeing accuracy in the media they
consume. This will be very important in fields like journalism, law enforcement, and
14
social media . Dealing out deepfake content has critical effects. The accuracy of
explanation with results from detection systems provides assurance to the user to
ascertain that any piece of content in interaction has not been meddled with and
therefore maintains originality. Above all, detection transparency is of greater
importance in developing a raised awareness of the public's danger posed by deepfakes
and of detection technology, in particular, that offers solutions to it.
Responsible use of AI then is another huge goal of deepfake detection systems. This is
because a rising power in AI in developing hyper-realistic deepfakes calls for an ethical
code and responsible use of such technology for deepfake detection. Indeed, the
detection systems of deepfakes must be designed with a social sense of ethics-that is, not
infringe on privacy and free speech rights or unduly suppress legitimate content. Ethical
issues also include detection of the fact that the detection systems are unbiased to be
used in a balanced manner over the categories of content and hence the varied altered
media from various sources or platforms can be detected without promoting any specific
group or agenda. Developing detection systems along with ethical values helps
stakeholders in building responsible usage of AI systems to detect and prevent malicious
manipulations of digital media.
Finally, such continuous research and development in the subject are behind the
incentives that promote cooperation and innovation in deepfake detection. Because
things evolve, deepfake generation has to move, and ahead of its manipulation
techniques, the method of its detection should advance as well. Therefore, such a need
for the approach calls for collaboration between researchers, developers, and
practitioners from all concerned disciplines of computer vision, machine learning,
cybersecurity, and ethics. That brings into play open-source initiatives, shared datasets,
and even collaborations between institutions, all in the name of furthering the
advancement in deepfakes detection, because constantly evolving systems would allow
more inventive solutions to such an increasingly rampant trend. For a challenging
environment such as deepfake challenges, collaboration is also needed.
15
and spurring innovation in the field. All these components are integral parts of the
efficient deepfake detection systems that counter the increased danger in synthetic media
with the help of advancing deepfakes.
and countermeasures to overcome them, so that society can continue believing in the
authenticity of digital content despite the growing complexity and pervasiveness of
digital mediation.
1.4 Methodology:
There is a host of tools existing for the generation of Deepfakes (DF) but surprisingly
very few tools are developed for detection. We expect that our approach toward the
detection of DFs would play a huge role toward stopping the "percolation" of deepfakes
across the worldwide web. We developed the approach, which detects all types of
Deepfakes: replacement deepfakes, retrenchment deepfakes, and interpersonal
deepfakes.
Our proposed method Figure 3 eliminates the presence of transient elements in faces by
using a combination of a CNN with an RNN because all the visual controls are found
within regions of faces and countenances often appear within a narrow region of the
frames.
Considering the same, we focus strictly on deleting highlights of faces in video areas
where face visibility has been confirmed. So, we first split the video into frames. Then
follows face detection, editing of video frames with that detected face. Finally, it collates
all those newly edited faces into making a brand new video. We then apply the
ResNet50 CNN model to segment highlights from video frames followed by an LSTM
layer for succession handling. We then conduct test-time augmentation and make
predictions. The next subsections illustrate our methodology in full detail, which
includes a helpful test augmentation approach we included within our DFDC
submission.
16
Figure 3. System Architecture
1.5 Organization:
The first step in organizing deepfake detection is data collection and preprocessing. The
Data collection and preprocessing are the first steps that lead to organizing a deepfake
17
detection.
The quality and diversity of the data on which a deepfake detection system is trained
determine its success for the most part. Deepfakes for the most part have to be detected
based on video and image data, which is relatively large and comprises real as well as
synthetic media. The FaceForensics++, CelebDF, the Kaggle Deepfake Detection
Dataset, and the dataset of the DeepFake Detection Challenge are some of the most
recognized datasets for deepfake research. These datasets contain a lot of tampered
content and involve various kinds of deepfakes while generating them. Preprocessing
merely involves cleaning raw data before sending it to detect models for training. These
are image and video resizing to the same dimension, normalization of pixel values, and
video to frames or sequences which can be processed by the model. Noise reduction,
enhancement of images, and incorporating more data in the dataset through rotation,
flipping, and scaling help diversify the data and hence make it easier to apply to real life.
Feature extraction follows data preprocessing. Feature extraction is the process through
which raw data is transformed into a format that machine learning algorithms can use for
classification within deepfake detection. In the context of deepfake detection, feature
extraction is about focusing on the most subtle artifacts and inconsistencies that deepfake
algorithms inject into media. Such features might include facial expressions, inconsistent
lighting, or even erratic blinking. In feature extraction, most deepfake detection
algorithms utilize CNN's that are specifically designed for image and video analysis.
CNN's automatically learn spatial hierarchies of features like edges, textures, and
patterns, which helps in the detection of inconsistencies in manipulated content. Other
techniques used in advanced systems include RNNs and LSTM networks, particularly
with sequential data, like videos. The temporal features, for instance, the change in facial
expression or lip movement over time, play a significant role in these systems in
distinguishing between the real and fake content. Besides, we are likely to use ResNet
and XceptionNet, which are two types of CNNs, to extract multi-dimensional and
abstract features from video and image.
After feature extraction comes model development in an organization of a deepfake
detection system. This is where we would train some of the machine learning algorithms
to classify data based on features. This can also be done through various approaches that
range from more traditional approaches using machine learning models.
18
Such as support vector machines and random forests to more complicated deep learning
models, such as CNNs, RNNs, and hybrid, or combination of multiple kinds of neural
network architectures. In fact, among several models, the selection process is influenced
by some factors like data characteristics to be evaluated, complexity of targeted
deepfakes, and the massive amount of computational resources available. Deepfake
detection models are mostly used in supervised learning, where the model is trained on
labeled datasets of real and fake media. The model learns which features correspond to
which label during the training process, based on patterns observed in the data. That takes
advantage of the knowledge other images in terms of their categorization can also acquire
pre-trained models fine-tuned into the knowledge base of deepfake datasets; that is,
transfer learning in application.
The last step in organizing deepfake detection systems is training. It tunes the model's
inner parameters to get as close as it can to the true predictions over the training data.
One approach of optimization algorithms like stochastic gradient descent or Adam can
work by iteratively updating model weights based on the loss function. Overfitting is a
training issue in which the model learns the detailed characteristics of the training data
and fails to generalize to new, unseen data. Techniques used to prevent overfitting
include dropout, early stopping, and cross-validation, along with data augmentation and
using diverse datasets in order to train the model in a way that it learns to generalize
better to the real world. We then test the model against a different validation dataset,
taking into account its performance on unseen data and tune the hyperparameters
appropriately. In the overall framework of the training phase, it is aimed to produce a
model that can classify real and fake media accurately even when new techniques of
manipulation are thrown upon it or the quality of input varies.
The second step after training is model evaluation. Model evaluation is a critical
component in organizing a deepfake detection system as it helps evaluate performance
along several dimensions. Some common performance metrics include accuracy,
precision, recall, F1-score, and AUC-ROC. These help assess how well the model is at
detecting fake media while preventing false positives and false negatives. Since
deepfakes, by their very nature, will never be progressive, the old set of parameters just
can't be used to judge the assessment. The generalized application of the system across as
vast a variety
19
Datasets, manipulation techniques, and video qualities. In the case of a model trained on
one particular data, it may not generalize that well to another dataset with the other
manipulation technique. The model tests a number of datasets to ensure they are strong
and applicable in different deepfake generation methods. Adversarial testing is another
evaluation strategy. This is important in determining the vulnerability of the model to
attacks that are designed to deceive it, thus ensuring its reliability and effectiveness in
real-world applications.
The last phase of deepfake detection is how it's organized: deployment and real-world
application. Deployment means placing the trained and tested model available for use in
practical scenarios. Depending on the application, how the deployment might happen
could include its integration into other platforms-one might be social media websites, or
maybe a new news website, police or law enforcement agencies' respective sites, or
streaming. For example, social media can have deep fake detection systems that
determine a manipulated video automatically before spreading virally. Such applications
need to work on real-time processing of videos while having high efficiency and not
causing latency in moderation of content. We often rely on models that are distributed
based on the deployment of both edge computing and the cloud to spread out the work for
real-time feedback. For instance, video data being streamed in from cameras or
smartphones in live streaming can be preprocessed on site before it hits the cloud for
further inspection. Even deployment calls for constant monitoring and maintenance since
the detection system needs to constantly change to keep in step with new deep fake
techniques and latest approaches in media manipulation. It can be achieved either through
periodic retraining of the model using new datasets or fine-tuning it based on feedback
from users or stakeholders.
In conclusion, we outline deepfake detection as a process of multiple steps that begin
with data collection and preprocessing. It then continues to feature extraction, model
development, training, evaluation, and deployment. Every step is crucial in the
effectiveness of the system to detect manipulated media. As methods for making
deepfakes evolve, so must deepfake detection systems. They must accommodate the new,
creative ways to alter the media to find accuracy, reliability, and efficiency. It is likely
that the multiple applications of machine learning techniques will include CNNs, LSTMs,
and hybrid models along with strong evaluation frameworks that will be able to maintain
20
a deepfake detection system efficient enough for real-world use cases.
Among its applications such as content moderation and enforcement law there is an
organized manner in which the detection has to be presented in which developers and
researchers find a way to combat the spreading menace of deepfakes, in the effort to
preserve the integrity of digital media.
Methodology Applied:
• \data Augmentation
• \transfer Learning
• \tResNet50
• \tLSTM
Technologies Implemented:
•Python, matplotlib
• SKlearn
• Keras
• \TensorFlow Any one of them is free and not needing technical degree. Because of the
time constraint in the development cycle, these technologies are even more easy to use
and faster. Thus, we can easily implement the project.
21
Chapter-2: LITERATURE SURVEY
Deep fake video is emerging everywhere; then it threatens democracy, justice, and public trust.
This makes there a growing demand for video analysis, detection, and intervention. A few
examples are listed below:
• The trained Convolutional Neural Network model in its specific mode to compare the
synthesized faces against their context, Exposing DF Videos by Detecting Face Warping
Artifacts
• The authors are able to separate the artifacts. This paper presents two classes of facial artifacts.
According to the strategy, the currently available implementations of the DF algorithm are
capable only of generating images with a resolution level limited or restricted, which should
then be transformed to suit the size of the faces in the video source.
• Uncover AI-Generated Fake Videos by Detecting Eye Blinking [7] is another approach
that could be used to unveil the forged face recordings produced by deep neural network
models. The approach relies on detecting eye blinking in the recording, which is a
physiological cue that is not high-class in the scripted fake recordings. It has been
experimentally tested over benchmark datasets that evaluate eye-blinking techniques and
shows promising results when applied to recordings created using deep neural network-
based programming. Their technique uses only the absence of flickering as a cue for
discovery. Anyway, there are some other parameters too to be considered for
identification of deep artificial like teeth appeal, wrinkles on faces etc. We present an
approach that considers these many parameters.
• Capsule Network is another way to reach manipulated or forged video and image data
detection. This discovers manipulated and forged videos also images in the replay attack
and computer-generated video scenario. Here, they utilized their method with random
noise that must not have. Their model had a positive output on their dataset, though still
hard to survive with it in real-time data due to noise in training. For training our method,
we propose a noiseless and real-time dataset.
22
• Synthetic Portrait Videos Detection [9] detects and identifies the fake portrait videos
containing biological signals by extracting biological signals from real and fake portrait
video pairs. Change the highlight sets and PPG guides to catch sign attributes and ensure
spatial soundness and transient consistency for training a probabilistic SVM and a CNN.
Once the video has been confirmed to be probably false or true, the next verification is
carried out. This program specifically determines with accuracy whether a commodity
has counterfeit parts regardless of the generator, content, intention, or nature of the
video. It proves to be inconvenient to formulate a differentiable loss function based on
the proposed signal processing steps when lacking a discriminator results in the loss
findings.
23
Chapter-3: SYSTEM DEVELOPMENT
Deepfakes is one of the current and growing threats to security, governance majoritarianism, and
protection. Some of the credible applications of face swapping in video composing,
representation changes as well as character verification include: this is because it's possible to
replace faces within photographs with those selected from a variety of stock pictures. Digital
attackers use face swapping as a tactic where they gain unauthorized access to infiltration
verification frameworks. It becomes really difficult while using deep learning algorithms like
CNN and GAN for forensic modeling because residual images could retain the pose, facial
expression, and lighting of the originals.
Probably, images from GANs are the most difficult of all generated images by deep learning
algorithms since they may be able to be even more realistic and high-quality images as possible,
due to the fact that GAN can learn distributions of inputs and then output distributions that match
those inputs.
3.1 Design:
This makes the design of a deepfake detection system very complex and multi-faceted so as
to effectively identify and differentiate manipulated media from authentic content. At this
rate where the media produced with deep fake technology have been made nearly
indistinguishable from reality at other times impossible to spot, there's need for changes to
happen in the existing systems as a response to new challenges posed by such sophisticated
manipulations. The model design process involves several stages, such as the selection of
appropriate methodologies, model architecture, data preprocessing, feature extraction, model
training, evaluation, and deployment. Each of these stages will contribute to achieving the
goal of developing a robust, accurate, and reliable system that can detect deepfakes across a
wide range of manipulation techniques, content types, and media formats. A system like this
would therefore necessitate deep knowledge in machine learning, computer vision, digital
forensics, and ethics for a perfect design.
The methodologies coupled with models that are used for the deep fake detection systems are
critical in design. In most cases, deepfakes are detected using machine learning but deep
24
learning is the only viable option since it is capable of learning Complex patterns from big
data. Convolutional neural networks (CNNs) are the most widely used models in deepfake
detection, especially for static image and frame-level analysis. The CNNs identify subtle
visual artifacts introduced by deepfake generation methods by detecting spatial hierarchies in
images, including textures, edges, and facial landmarks. More complex models, like RNNs or
LSTM networks, can capture the dependencies and motion patterns across frames in a video
to analyze them. Such models are better suited for sequential data such as expressions, head
movements, and lip synchronization that often break in deepfakes.
Another highly exploited area in deepfake detectors is hybrid architectures. In these, CNN-
LSTM networks take the best from CNN feature extractors and LSTMs for sequential data.
This helps the detector system view video in aspects both spatial and temporal, which in turn
helps catch many of the deepfakes. This hybrid approach makes the detector system aware
not only of single anomalies in frames but also, over time, the temporal inconsistency such as
awkward facial movements or unnatural movement-also a common characteristic for most
deepfake videos. Advanced systems even probe about the possibility of transformer models
and the use of attention for feature extraction for subtle detection about both spatial and
temporal features to ensure a high level of accuracy in their findings.
Apart from the above selection process of suitable models, data collection and processing is
equally important. The success of the deep fake detection system majorly depends on quality
and diversity in training data. For deepfake detection, big-sized high-quality datasets are
necessary, both in authentic and synthetic media. Sometimes, the basis of research and
development would be other prime datasets, such as FaceForensics++, DeepFake Detection
Challenge, and Deepfake Detection Dataset on Kaggle. Such datasets consist of videos over a
very wide range of subjects - from face-swapped celebrity faces to voice synthesis, which
also helps train the models on the various types of media-manipulation.
At the preprocessing stage, raw video and image data will be prepared through various steps
so that it will be appropriate for training the deepfake detection model. In preprocessing, it
involves some tasks like resizing images and videos into uniform sizes, converting video files
into frames, normalizing pixel values, and may involve the change of video format for proper
use with deep
Feature extraction is the most important part of deepfake detection. After pre-processing
data, meaningful features should be extracted to find whether content is real or fake. Feature
extraction in the traditional image analysis usually deals with low-level features such as
edges, textures, and color histograms. But the detection systems based on deepfakes detect
more complex, higher-level features that are probably unnatural facial movements,
inconsistent lighting, or artifacts coming from the algorithms themselves. Perhaps one or a
combination of those methods might be applied in order to extract that: CNN is applied in
recognizing pixel-level inconsistency, auto-encoders/GANs to establish models about normal
facial and motion patterns. At other times, the systems make use of optical flow analysis in
determining movements of facial muscles or eyes, which are altered in most deepfake videos.
The model gets trained after feature extraction during the design process. In this training
phase, the model learns to identify the difference between real and manipulated media.
Deepfake detection systems are very commonly trained through supervised learning, which is
training a model on a labeled dataset containing both real and fake samples. This model
learns to associate specific features with the real or fake label by using a technique called
backpropagation. In this technique, the model's weights are updated so that the difference
between the predicted and actual labels decreases. Dropout, early stopping, and cross-
validation techniques are applied during training to avoid overfitting. Deepfake datasets are
very large and complex, which means this step demands much computing power.
Accelerated processing can be achieved using Graphics Processing Units or Tensor
Processing Units.
This is done through a process called transfer learning, whereby the model obtained is fine-
tuned, applying a pre-trained model for some task, for instance one trained on a massive
repository of real-world images over a deepfake-specific set, making use of learned
characteristics to adapt to deepfakes.
More efficiently. Transfer learning has especially proved very helpful when labeled deepfake
26
data are scanty as it reduces the volume of the needed training data and increases the speed
with which the model reaches a high accuracy level.
The training process of the system follows the model's evaluation for the second stage; this
will ensure that the system models are actually reliable for distinguishing a real image from
the deepfake ones. Main evaluation metrics include: Accuracy Precision Recall F1-score and
Area Under the Receiver Operating Characteristic Curve. These metrics give insight to the
performance of the model, especially considering how well it identifies fake media without
incorrectly flagging real media as fake. Testing also is performed on the model with various
different datasets, manipulation techniques, and video qualities for testing its generalization
abilities. Because the algorithms behind deepfakes are evolving constantly, the detection
mechanisms need to be flexible enough to adapt to new and unseen manipulation techniques.
We should measure the deepfake detection systems in terms of their robustness against
adversarial attacks, besides other traditional performance metrics. Because the attacker might
try to build deepfakes which are particularly designed to evade the detection system, it
becomes an integral part of the design process to test the robustness of the system. The
robustness tests may include testing a model in conditions such as compressed video, low
resolution, or even noise, which prevail in most real-world scenarios.
The last step in the design process is deployment. In this phase of deployment, we embed the
trained model into the real-world applications to let it detect deepfakes across various media
formats: social media platforms, video streaming services, and news organizations. It
requires ensuring that the model can handle real-time detection, especially in live streaming
or fast-paced media consumption. This will usually be smoothened through the model and
ready to run with video streams holding as little compromise on speed as possible. This will
also call for edge computing and cloud-based systems to deal with volumes of data and
lessen latency.
In conclusion, designing deepfake detection systems has various stages which are important
to ensure that the system can detect manipulated media effectively and correctly. This
includes selecting an appropriate machine learning model, collecting data, preprocessing,
feature extraction, training, evaluation, and eventually deployment-all steps that need to be
thought through properly.
They are designed to address the emerging challenges of deepfake technology. Deepfake
27
technology is always evolving, and detection systems need to evolve as well. Therefore, it
requires the latest research in machine learning, computer vision, and artificial intelligence.
These are robust, scalable, and reliable systems that can help society identify and combat the
harmful impacts of deepfakes on society.
3.2 Algorithm:
1. Beginnings
2. Upload Deepfake Detection Challenge Image dataset
3. Importing all necessary Libraries
4.Data Preprocessing
5. Developing and Training ResNet50 + LSTM Model
6.ResNet50 extracts features
7.LSTM for sequence processing
8.Test Time Augmentation
9.Predict on Test Dataset
10.Conclusion
Dataset: -
Learning Learning from data is at the core of deep learning. To achieve excellent learning
quality and accurate predictions, careful dataset preparation is vital. MTCN, YouTube, and
Deep Fake Detection Challenge datasets are being used in equal quantities for our mixed
dataset. The nature of these recordings commonly incorporates standing or sitting
individuals, either confronting the camera or not, with a wide scope of foundations, light
conditions, and video quality. Preparation recordings aim at 1920 × 1080 pixels, or 1080 ×
1920 pixels when captured in vertical mode. The dataset is built by an all-out of 119,146
recordings with an extraordinary name, either valid or fake in a prep set, 400 recordings on
the validation
28
The test set contains 4000 private anonymous recordings. We should be able to train,
evaluate the model on that 4000 recordings of test set using the Kaggle infrastructure in spite
of the fact we cannot look at them directly. There are 1: 0.28 manipulated real recordings.
There are only names in the 119,245 prepared recordings, and for that reason, we use all this
dataset to prepare and train our approach. The prepared recordings are divided into 50
numbered sections. Our preparation process uses 30 sections, our validating process uses 10
sections, and our testing process uses 10 sections.
The private set, in testing, evaluates submitted strategies in the Kaggle environment and
returns a log-likelihood loss. Log-likelihood loss penalizes very heavily the phenomenon of
being simultaneously confident and wrong. In the worst case, the assumption that a video is
valid when it actually is manipulated or vice versa adds infinity to your Every video gets an
interesting name telling if it has control or not. It doesn't say though whether that control is
on face or sound, or both. Since our method only is considering video evidence, fixed
recording using audio controls only results in labels being very noisy. Since the video gets
labelled as a fake while its faces will be real, this makes it loud since all those extra people
that get occasionally in the video have one on which is having face control attached. error
score. Given that, in real experiments assuming the worstcase happening the loss becomes
decreased into a super-massive value.
This is another test of this evaluation system because methods that perform well on measures
like accuracy can make some really humongous errors in the case of log-likelihood. Our
newly prepared dataset comprises half of the first video and half of the controlled deepfake
recordings. It was divided into 80% and 20% training and testing sets respectively.
Classification Setup:
In the following model, X is the input set, Y is the output set, and f is the prediction function
of the classifier that takes values in X as input to the action set A, the random variable pair
(X, Y) assumed to take values in X * Y. The chosen classification task for minimization is
30
These are then developed to use with PyTorch and Keras 2.1.5 using the latter as Python 3.5
and with default parameters for β1 = 0.9 and β2 = 0.999 ADAM used in the optimization of
the weights of the network through frames successive in size 224 × 224 × 3.
Preprocessing: -
First, the video is pre-processed. It splits the video into frames. Then faces are recognized,
and then trimming of video frames containing the recognized faces happens. We find a video
mean dataset and derive a new set of handled face trimmed one which contains the equivalent
of mean frames so that we can equally keep the number of the frames. At stages of pre-
processing frame without face is/ were ignored. Since all the visual controls are confined to
the face regions, and faces generally occur in a small region of the frame, using a system that
focuses features from an entire frames is not good. Instead, we focus on extracting features
only in regions in which a face exists. As processing the 10-second video at 30 frames each
second, for example full out 300 frames will require lots of computational power. Thus, we
propose using only 150 frames for training the
In general, preprocessing is the appropriate process for making a deepfake detection model.
The preprocessing ensures the input data to be consistent, clean, and ready for analysis. Such
preprocessing involves gathering a robust and diverse dataset with both manipulated and real
media. Some datasets used are FaceForensics++, CelebDF, and the DeepFake Detection
Challenge, with various contents of real and synthetic items for training. After collecting the
dataset, for video data, we extract frames such that we extract single frames at a fixed rate of
frames per second, here 1 frame per second to maintain coherency in time while also
reducing the computational load; through frame extraction, deep fake techniques may
concentrate their efforts on content areas mainly being faces.
The most important process is face detection and cropping. We often use algorithms, Haar
cascades or MTCNN. Sometimes, it's more about Dlib for single-face detection, isolated as
well in pictures as within video frames, so what we do in practice in such a context is
limiting the model's focus, not to process all unnecessary data from areas where actually
nothing of value would interest us. The other characteristic of standardization is when faces
31
that are detected are aligned with the help of landmarks that can be eyes or even the nose;
this further helps in standardizing data. This ensures uniformity in the orientation of the face
and thus easier to maintain model on detecting incongruence without variability from head
tilt or placement.
This scales the images to a size of 224 x 224 pixels. Coincidentally, this is very popular as an
input dimension for many deep models including CNNs. It simply ensures compatibility in
architectures on some models while conserving scale and consistency within the input data.
Pixel values are standardized to a range; an example might be within the bounds [0, 1] or in
[-1, 1]. Doing this has standardized input data for this problem which makes that faster
convergent when training this kind of model. For many of the standard deep-learning
libraries, standardizing this also has good prevention from being numerically unstable. End.
Another important pre-processing aspect is data augmentation that helps in enhancing the
model's generalization capability. Methods such as random rotation, flipping, scaling, and
translation bring variations to the data, making it more diverse in mimicking real-world
conditions. Flipping a face horizontally could be a simulation of multiple views, and rotation
and scaling
Subtle yet noticeable deepfake artifacts are crucial in preprocessing. Compression artifacts,
such as pixel irregularities or unnatural textures, are much more prominent in synthetic
content because of the encoding processes involved in deepfakes. Among the most important
areas of attention involves temporal artifacts, especially in videos due to inconsistencies
between frames. For instance, unnatural blinking or abrupt motion transitions indicate
manipulation. We extract and highlight these artifacts during preprocessing so that the model
learns effective patterns to distinguish real content from the fake.
Temporal smoothing preprocesses adjacent video frames so that the model can identify
inconsistencies from one frame to another. It is very useful for the detection of manipulations
like lip-sync mismatches, which are hard to identify through single-frame analysis. Advanced
32
techniques, like spatiotemporal smoothing, combine spatial and temporal features to ensure
the preprocessing pipeline is more inclusive.
Feature extraction is an optional but highly powerful preprocessing step to clean up the
dataset further. We're going to be using OpenCV or Dlib for facial landmark extraction, so
we get a structural view of the face. Those areas are where the manipulation is going to take
place-the eyes, the mouth, and the nose-to focus the model's attention on. Another way to
detect some unusual patterns in an image's frequency is doing the frequency domain analysis
with Fourier transforms, often exposing mistakes made at the moment of creating such a
deepfake. These add extra features, which make the input data richer and contribute to the
improvement of the model for better manipulations detection.
Label encoding is one of the most easy yet basic preprocessing steps, in which all the data is
assigned specific labels. For example, in a binary classification task, real is assigned 0, and
fake is given 1. In multi-class classification, it is further extended by allowing more labels for
manipulation, such as face-swapping as well as lip-syncing. Proper labeling will help train
the model with the correct relation between the input data and their corresponding outputs.
The prepared data is divided into training, validation, and test sets. This ensures that the
model is tested on data it has never encountered before, thus avoiding overfitting and
ensuring generalizability. To prevent biased training, a balance must be achieved in real and
fake content across these sets. Data is also shuffled and organized into batches for improved
training pipeline. It's batching for efficient utilization of memory in training, and shuffling to
avoid unwanted patterns learned from ordering in the data.
Preprocessing is also concerned with pipeline optimization for real-time applications: low
latency and high throughput. Above techniques, including preloading of frames or efficient
face detection algorithms such as RetinaFace, can be used to speed up the preprocessing
time. The preprocessing pipeline is built modular, so that change in one step does not disturb
the entire workflow. Thus, new data augmentation techniques, improved face alignment
algorithms, etc., can be plugged into the workflow without affecting the workflow.
Model:
In deepfake videos, inconsistencies are found between frames and within the frames. LSTM
and CNN are utilized in the recognition of deepfake recordings using a temporal-aware
pipeline.
technique. CNN is used to remove frame-level elements which afterwards used in LSTM in
forming a succession descriptor. Model follows one LSTM layer after the residual ResNet50
network, at last a fully connected one for the task of grouping altered with the genuine ones
based on this succession descriptor. In pre-handling the data from recordings, Data Loader
places all face-trimmed pre-handled recordings that break them into train and test sets. In
addition, the frames of the processed recordings are submitted to the model for preparation
and testing in mini-batches. The recognition network formed by fully connected layers is
This is done to take on the category label as a variable, calculate the probabilities of the
frame sequence belonging to either the true or the deepfake class.
CNNs are essentially the backbones of image-based analysis, and three architectures named
XceptionNet, ResNet, and EfficientNet have proven to be very effective. The models achieve
high accuracy in identifying pixel-level anomalies, irregular lighting, unnatural facial
expressions, and inconsistencies in facial landmarks. Video-based detection uses RNNs and
LSTM networks to scan for anomalies in sequential patterns like blinking, lip-syncing, or
34
motion dynamics that are inconsistent from frame to frame. Hybrid models combining CNNs
with LSTMs or transformers effectively address the problem, because both spatial and
temporal features can be captured, which means manipulations of this complexity can be
detected.
Emerging techniques such as Vision Transformers (ViT) along with spatio-temporal attention
mechanisms are revolutionizing the game by improving the analysis on multi-dimensional
data, giving them the acumen to uncover very subtle and adaptive deepfake methods.
Examples include models like XceptionNet, EfficientNet, fine-tuned to specific datasets such
The strength of these approaches lies in their robustness to the limitations of computational
resources or labeled data. In ensemble learning, predictions are aggregated from multiple
models, hence improving detection accuracy based on the strengths of different architectures.
Real-world applications often demand low-latency, high-throughput systems for real-time
content moderation or forensic analysis. Moreover, adversarial attacks and generalizability
are two major challenges requiring ongoing training with different datasets and evolving
techniques. Last but not least, the final challenge would be to produce systems that are
scalable, ethical, and resilient enough to deal with the continually advancing threat from
deepfake technologies in domains like cybersecurity, media verification, and legal
investigations.
35
ResNet CNN for Feature Extraction:
ResNet is a deep network architecture highly effective for CNN within feature extraction in
deepfake detection because it can model complex patterns within image data and mitigate the
vanishing gradient problem within a deep neural network. ResNet is an innovation of
Microsoft Research. In residual learning, the network learns to map the differences or
residuals between the input and output rather than the full transformation by applying
shortcut connections, where input skips one or more layers directly to output. With shortcut
connections, the model allows training much deeper networks with no degradation in
performance. Here, ResNet is employed owing to its good spatial feature extraction and
detection capability in the context of deepfake detection.
Detect subtle anomalies, and identify artifacts that distinguish real images from manipulated
content.
With hierarchical architecture design, ResNet can have the power of feature extraction: it is
essentially a stack of residual blocks. Each residual block consists of convolutional layers,
batch normalization layers, and ReLU activation layers connected with a shortcut that
directly adds the input and the output of the block. So it learns low-level features like edges,
texture, and high-level ones like facial structures and patterns. These capabilities are
particularly important in the application of deepfake detection where such minute
inconsistencies, including the irregular textures, unnatural lighting conditions, or mismatched
facial landmarks, could differentiate between real and fake images. For example, artifacts
introduced by the generative models like GANs may show themselves as slight pixel-level
inconsistency in which ResNet is well-equipped to detect from hierarchical feature maps.
One of the key advantages of using ResNet for feature extraction is that it is scalable. There
36
are variants of ResNet-18, ResNet-50, and ResNet-101, and these offer different depths to
help researchers balance computational efficiency with feature complexity according to the
requirements of the detection task. For example, ResNet-50 is a very popular choice, which
has 50 layers and is good enough for detecting deepfake artifacts in moderate-sized datasets.
Conversely, if finer details have to be extracted in a scene then it is used, namely, ResNet-
101. Secondly, pre-trained ResNets can be further used as for deepfake detection which
happens to be pre-trained versions based on the ImageNet but on much lesser computational
time.
Transfer learning for feature extraction from ResNet considers yet another approach that uses
the weights learned from an existing network to fine-tune to work with the task at hand. For
deepfakes,
It freezes the early layers that are used to capture the generic features and fine-tunes the
deeper layers to specialize in catching the deepfake-specific patterns. The idea is that since
there is a massive amount of knowledge encoded within ResNet's pre-trained weights, this
would help the network detect both the general anomalies and the deepfake-specific
anomalies. Transfer learning has proven particularly effective for this type of scenario: a very
common issue for any research into deepfakes.
Apart from the spatial features, ResNet can also be integrated into a temporal model for
deepfake detection on videos. The systems that merge ResNet with an RNN or LSTM
network are capable of evaluating the sequence of frames for identifying temporal
inconsistencies which occur in unnatural blinking patterns as well as abrupt motion shifts. In
these setups, ResNet would work as a feature extractor for independent frames and the high
dimensional feature vectors are passed forward into the temporal model. This hybrid
approach would integrate the strengths of spatial analysis by ResNet with those of RNNs or
LSTMs to provide holistic detection capabilities for manipulations in video data.
The output of feature extraction using ResNet is normally a high-dimensional feature map
applied as input to a classification head, which is a fully connected layer followed by softmax
that assigns probabilities to all classes. For example, applications for multiple-class deepfake
detection shall distinguish different manipulation types, thus, the head of classification may
be taken with the number of output nodes corresponding to each type of class.
37
It takes care of the adversarial examples and other artifacts of deepfake techniques as well,
which the advance deepfake methods come with. For instance, many of the deepfake
algorithms carried compression artifacts or even the error of blending, almost impossible to
detect by traditional ways. It features a deep architecture mechanism based on the residual
learning that is good enough to capture these minor inconsistencies, making it highly
dependable for detecting even some quality deepfakes. The architecture of ResNet further
supports data augmentation techniques such as random cropping, flipping, and rotation,
which improves generalization capabilities and robustness to variations in input data.
Although ResNet is very robust, it still has weakness: the quality and diversity of training
data determine its strong performance. Model generalization is typically diminished through
overfitting whenever training data have bias or high dependence on deepfakes. Researchers
always incorporate new elements in the training set, including adversarial training to improve
the robust architectures against these constantly evolving algorithms of deepfakes.
In conclusion, ResNet is a very powerful CNN architecture for feature extraction in deepfake
detection. Techniques such as transfer learning, hybrid modeling, and data augmentation can
be used with ResNet-based systems to achieve high performance in the detection of
deepfakes, which address both spatial and temporal inconsistencies. The above problems still
exist, including limited data and computational demands; model optimization and dataset
curation are rapidly evolving and help push ResNet in its ability to provide timely support
against deepfakes.
38
processing. LSTMs don't process the inputs independently in the same way feedforward
networks do; they capture the dependencies in time by holding over some memory for long
sequences. That comes very handy in video-based deepfake detection, because temporal
consistency between
Such sequence is really needed for the detection of content manipulated with regard to
consecutive frames. This, actually allows LSTMs to recognize such inconsistencies that
might not appear if being studied individually, like unnatural face expressions, inconsistent
patterns for blinking eyes, rapid motions and unrealistic movements -all often artifacts of
generation processes at the deepfake stage.
A strong strength is provided through their architecture. LSTMs have memory cells that
assist in retaining information for long time periods; they eliminate the problem known as
vanishing gradients where the traditional RNNs lose the ability to handle long sequences in
the past. There is a cell state within an LSTM and there are three gates, each of input, forget,
and output gates that dictate information flow. The input gate decides what should be fed
into the cell state, the forget gate determines which to forget, and the output gate determines
which should be forwarded to the next layer. This architecture captures very aptly the long-
term dependencies and neglects the unnecessary data: LSTMs are more appropriate for
applications in which the context of what has been fed in previously drives the interpretation
of what goes into the system later on. This means an LSTM can track facial movements or
background elements changes over several frames that may indicate a tampering pattern in
deepfake detection.
LSTMs can, therefore, be used in deepfake detection and process video data frame by
frame. This means that the model learns temporal patterns and detects inconsistencies which
may emerge over time. For instance, in the creation of videos, algorithms associated with
deepfakes cannot create natural, synchronized facial expressions. This creates difficulties in
maintaining constant eye movement or lip-syncing between frames. This will allow the
LSTM to identify anomalies like unnatural blinking patterns, inconsistent eye movement, or
lip movements that do not conform to the temporal characteristics of a real video. In
addition, the background, lighting, and motion patterns of deepfakes will have some subtle
inconsistencies across frames.
The advantage LSTMs add to deepfake detection is the fact that they can model dynamic
change over more than one frame. Any true video will show smooth changes over frames
with consistent motion and realistic dynamics. Anomalies in a deep fake may include facial
expressions suddenly changed, unnatural motion, and inconsistent illumination-things which
are exactly what the LSTM model will detect. Besides this, in video manipulation,
algorithms may fail to model the relations of a subject's movement and expressions
precisely, therefore creating unnatural sequences. A distorted sequence may be captured
because LSTMs learn those patterns that should have otherwise occurred in the case of
genuine videos and know when that pattern has been broken down by the process of
tampering.
The quality of training data largely determines the success of LSTMs in the task of
deepfake detection. Since LSTMs rely significantly on learning sequential patterns, the
model needs to be exposed to a diverse set of videos, both authentic and deepfake content,
so that it captures temporal dependencies of real-world video sequences. Moreover,
preprocessing of data is high in importance for the improvement of the LSTM model. Video
frames have to be extracted, resized appropriately and normalized, so that the model is
getting input in a similar way. Face detection and alignment can help the model concentrate
on the parts of the video containing the face-most frequently manipulated parts of
deepfakes.
It detects various types of deepfakes including face swaps, voice impersonations, and many
more.
Even for anomaly-based detectors, LSTMs apply this logic to deepfake detection because
such logic can be used to make the detection of temporal anomalies of facial expressions.
They might also be integrated into deeper, more complex detection architectures like a
hybrid system that blends LSTM networks with other popular machine learning models such
as CNNs, Vision Transformers, and even Generative Adversarial Networks. Especially
when applied to the detection of high-level deepfake manipulations such as voice synthesis
and face swapping, multi-model approaches that contribute either to spatial or semantic
feature extraction as well as LSTM-based hybrid systems extracting temporal relations
among frames are particularly useful.
Recently, LSTMs have also been used in detecting other types of media manipulation that
include audio deepfakes. If used on audio, LSTMs will process the sequential nature of
sound waves and identify unnatural speech patterns different from those found in authentic
records. Hence, by analyzing dependencies in both visual and audio modalities, LSTM-
based systems can identify multimodal deepfakes, including video and audio manipulations.
These LSTM networks can identify fine grain variations that emerge due to deepfakes by
taking into consideration the temporal dependencies that are present in the video frames
41
across different time steps. LSTMs are useful because of its ability to capture the dynamics
that describe real video sequences and the detection of anomalies like unnatural facial
movement or blinking or even lighting.
This is in distinguishing between the actual content and manipulated media. Used with
other models, such as CNNs, LSTMs provide a comprehensive solution to the problems that
appear from deepfakes detection; this allows systems to analyze both spatial and temporal
patterns effectively. Despite the overfitting and optimization problems coupled with poor-
quality training data, the flexibility and efficiency of LSTMs make it a great addition to the
growing area of deepfake detection, especially for video-based applications in which
temporal coherence is critical.
Training Process: -
The essential components of Deepfake detection include training, validation, and testing.
Training is the core of the proposed model. This is where learning actually happens. In order
for DL models to fit specific domains of problems, designs and fine-tuning become
necessary. We need to search for parameters that are optimal to train our dataset. The training
and validation components are also the same. We fine-tune our model during the validation
process. The validation module monitors the performance and accuracy of DeepFake
detection in training. A specific video is categorized and declared by the testing module
through finding the class of the extracted faces. The testing module helps to aid the research
goals.
Thus, feature learning forms part of the model, while classification is the other. FL is
essentially feature extraction that can be learned from face images through analysis.
Classification will take the FL input and transform it into a sequence of pixels that the final
process of detection can then use. FL is basically a method that implies convolutional
operations stacked in a stack-like fashion; the feature learning component uses an
architecture inspired by the ResNet50.
We begin with the pre-trained ResNet50. We then add in the LSTM and Fully Connected
layers with an arbitrary weight system. The network is good to roll from top down with the
42
parallel cross-entropy loss (BCE) work with the LSTM expectation. The BCE loss is run on
trimmed countenances from casings of a randomly chosen video
Note that this loss is controlled by the result probabilities of recordings video level forecast.
The BCE applied to refreshes the loads. The BCE applied to refreshes all weights of the
outfit except ResNet50.
While training the complete group from start to finish, we initialize the process with an
optional pre-training step consisting of 2000 epochs of random crops just to have a
preliminary set of model's parameters. In our experiments, this did not lead to an increased
discovery accuracy but rather to faster convergence and significantly more stable training
procedure. Because of the computational constraint of GPUs, the network size, and the
number of info frames, only one video can be processed at one time. However, after
processing every 64 recordings, the network parameters are updated for the binary cross-
entropy loss. With a learning rate of 0.001, Adam is used as the optimization technique. The
training process for deepfake detection involves a systematic approach to ensure the model
can accurately identify synthetic media. It begins with the collection of diverse datasets
containing real and deepfake content, sourced from publicly available repositories like
FaceForensics++ or Celeb-DF or custom-curated datasets. Preprocessing follows, where data
is optimized through resizing, normalization, and augmentation to enhance variations in
lighting, scaling, and occlusions. Next, the model architecture is selected, often involving
convolutional neural networks (CNNs) for spatial analysis and recurrent neural networks
(RNNs) or transformers for temporal inconsistencies in videos.
The training phase employs supervised learning, where labeled real and fake data help the
model learn distinguishing features. Techniques like cross-entropy loss and backpropagation
refine the model's weights. Validation using a separate dataset ensures performance
monitoring and helps prevent overfitting, employing methods such as early stopping and
regularization. Post-training, the model’s effectiveness is evaluated against unseen data using
metrics like accuracy, precision, recall, and F1 score. To ensure robustness, adversarial
examples and variations in quality and compression are tested.
43
Figure 5. Data Flow Diagram
44
Evaluation: -
One of the important metrics to evaluate a deepfake detection system is accuracy-that is, the
number of true positives and false positives and true negatives and false negatives. It usually
proves inadequate for many applications, however. With such highly imbalanced datasets as
might typically be expected, where one class, this will simply result in most inputs being
classified as real because it's the larger class. That's not terribly helpful for practical deepfake
detection. Thus, precision, recall, and F1-score are highly used as more informative
measures.
• Precision: The measure of True Positives meaning correctly tagged fake videos out of all
the videos tagged as fake. The higher value of precision reflects that the system does not
classify many real videos as fake ones.
• Recall, or sensitivity measures true positives divided by all the true false ones. High recall
points out that most of the correct fakes have been identified; at the same time, a number of
irrelevant or wrong outputs could occur.
• F1-score is the harmonic mean of precision and recall, giving false positives and false
negatives the same weight. It is, therefore, very useful in imbalanced datasets because it does
not bias toward any one class.
Another very good performance evaluation measure is the Area Under the Receiver
Operating Characteristic curve, or AUC-ROC. A ROC curve is plot between true positive
rate vs false positive rate and it tells that what is the possibility that a randomly selected true
instance would be ranked greater than that of a randomly chosen fake one by this model.
Thus a greater
45
AUC stands for better discriminative power; 1 represents perfect separation between actual
and fake content.
Another intriguing generalization of the evaluation criterion of a detection model is how well
the model will work on unseen data or other manipulation techniques. Detection models are
typically trained on specific datasets like FaceForensics++ or CelebDF, with the most varied
range of synthetic media including face-swapping, lip-syncing, and puppet-based
manipulations. However, with the rise of the new emerging deepfake techniques, this
detection system must generalise and effectively detect the new manipulations. One could
check this model by conducting cross-validation over different data sets using varied
generation techniques for deepfakes, or introducing some data augmentation strategies at the
training stages to make sure that the model will not be sensitive to variation in input data.
With the advancement of deepfakes, adversarial attacks raise considerable concerns for
deepfake detection systems. An adversarial attack can be defined as a set of specially
constructed inputs that could mislead machine learning models to classify real or artificial
content. To ensure deepfakes are reliable enough in real-world applications, susceptibility
towards adversarial manipulation has to be tested for these deepfake detection systems.
Generally, the common strategy applies perturbations directly in the input data and using
adversarial deepfakes, which are built purely to evade detection. The performance of these
systems is then evaluated under these attacks to assess how robust they are and to provide
improvements in their security.
One major issue when applying the technique to assess real data is that a large effect comes
46
from the quality of a particular dataset. Of all popular ones include DeepFake Detection
Challenge, FaceForensics++, CelebDF, and Kaggle's Deepfake Detection Dataset.
These datasets are filled mostly with a combination of real and synthetic media with different
manipulations such as face swaps, lip-syncing, and facial reenactments. This will evaluate the
models on a range of datasets, so it should generalize in cases where deepfake techniques
would be pretty dissimilar from the real ones. In addition, a diverse set of datasets that have
more representative examples of the manipulation techniques help to prevent biases in model
performance and might ensure fairness and robustness.
Regardless, cross-domain generalization is the greatest challenge of the deep fake detection
method. Variations of the background features and changes in the illumination could affect
the outcome of the model. Various domains including films, news broadcasts, social media,
etc., most times have the need to test whether a model can generalize across other domains
when determining if the models work practically since it might have a deep fake presence
across very different scenarios.
The other area of growing interest in the study of deepfake detection models is
interpretability. Most of the deep learning models, such as CNNs and LSTMs, are black-box
systems that do not explain why a video has been classified as real or fake. This may become
problematic in high-stakes applications such as legal or regulatory use cases. E.g.,
interpretation of such a system: Evaluating deepfakes' detector by providing saliency, Grad-
CAM and attention weights mapping over such a regions for video upon which does model
rely its decision at decision stage. Interpretablility Improvements would in this case assist
developers by fostering the trust upon the developer and other entities towards this
47
application model and enable its users to understand its reasoning process in order to be
adopted into actual practice.
Real-world deployment and user feedback are quite critical in the assessment of deepfake
detection systems as well. Models that look to be performing excellent in the benchmark
datasets may face a variation in real-world deployments because of several reasons such as
video quality, manipulation techniques, and some domain-specific challenges. Feedback
from users within real-world environments can also identify weaknesses in a model and
obtain useful insights which might be needed for further refinement. In addition to these
metrics, the Receiver Operating Characteristic (ROC) curve and the Area Under the
Curve (AUC) are used to analyze the model's ability to distinguish between real and fake
samples across different thresholds. Robustness tests are also conducted, exposing the model
to adversarial examples, varying video resolutions, compression artifacts, and unseen
deepfake techniques to evaluate its adaptability. Generalizability is another critical aspect,
ensuring the model performs well across datasets it was not trained on. A thorough
evaluation process helps identify areas for improvement and ensures the model is reliable for
deployment in real-world scenarios.
48
Chapter-4: PERFORMANCE ANALYSIS
4.1 Analysis:
Deepfakes give you a possibility to open new avenues in computerized media, virtual reality,
mechanics, education, and more. In one way or another, they are innovations that can destroy
and undermine the whole society. This thought in our minds is how we came up with the
model design that combined CNNs and LSTMs for the task of DF video identification.
LSTMs can handle the sequence of sequential frames while CNNs are good in learning local
highlights. Our model exploits the bound in associating each pixel of a given image and
understanding non-local highlights. We have attached equal importance to preprocessing data
in training and aggregation.
Network classification: Networks have been investigated to know how they would classify.
We can do this by applying the weights of different types of convolutional kernels and also
neurons as descriptions to images. Inferences could be interpreted as discrete second orders,
for example using a positive weight, a negative one, and a positive weight again. This indeed
is only an indication of the topmost layer but does not relate much during appearances. To
know what kind of signal is being received by a particular channel [6] another method is to
generate an information picture amplifying the initiation of that channel. The last secret layer
of ResNet50 has been actuated so much for such a long time, as shown in the following
figure:. We can distinguish the neurons, based on weight assigned to their result for the last
arrangement choice, that push toward a negative or positive score, and thereby influencing
either the genuine or produced class. In contrast, induction of positive-weighted neurons
shows photos with an exceptionally clear eye, nose, and mouth areas as compared to the
negative-weighted neurons which contained "differences" on the establishment portion
leaving the face area "smooth." Since Deepfake-made faces are generally blurry, or else
somehow opaque, unless there is focus on details, are otherwise presented differently to the
rest of the photograph that remains the same.
49
The underlying result of a layer can, at the same time, be regarded as a mean after-effect for
groups of certified and manufactured images.
And obviously, differences among the prompts can be considered and interpreted, as well,
are that information photographs which are connected with the course of action. Since in any
case institution actually puts the most essential apices on actual photographs, it is opened
naturally with opened eyes on an actual photograph.
It is quite obvious for us that once again the problem is vagueness, since in real images it is
the most sharp image of the eye, in synthetic images, the eye appears as the first part that due
to the aspect reduction brought by the face.
4.2 Results:
We employ the combination of Conv+LSTMs and test-time augmentation. We apply
Transfer Learning over the pre-trained models like ResNet50, MesoNet and DenseNet121.
The preparation and evaluation for our strategy are carried out by the dataset known as
DFDC. Our comparisons depict that our approach is quite relatively superior to the other
three approaches. In case the networks are involved, then it is possible to normalize
expectations on each frame in order to calculate a video-level forecast. With the help of the
validation set, the configuration that produces the highest adjusted precision is decided.
According to balanced accuracy, Table 1 can be seen. If face regions are only being
preprocessed, the accuracy jumped dramatically. Almost all models were overfitting, as
shown in Figure 3, the loss in the train sets and in the validation sets for ResNet+LSTM. The
validation loss begins to increase at the 5th epoch and that is when the model starts
overfitting. Testing was enhanced using test-time augmentation (TTA). Using TTA, one can
apply data augmentation to a test image and average predictions for multiple versions of it. In
our test of TTA, different transformations were used than that used for the ResNet model.
The model with the highest accuracy is ResNet50 + LSTM, with an accuracy of 94.63%.
50
.
Train and Validation Accuracy Train and Validation loss
Importing Packages :
51
Model Architecture :
52
Pre-Process:
53
Training Model:
54
55
56
57
Model Accuracy & Loss :
58
4.1 Comparisons:
Our dataset was used to train multiple models. From those, ResNet50 + LSTM offers the most
accurate testing and training results. Our observations have shown that the network regularly
fails to engage in exceptionally practical and effective manipulation based on poorly-lit or
foggy images that are inferior in quality. Despite their difficulty, manipulations made in
excellent recordings appear to be accurately identified.
59
Chapter-5: CONCLUSIONS
5.1 Conclusions:
In the design, neural network-based approach is adopted that classifies video to be deep
fake or real and the confidence of proposed model. The inspiration behind this proposed
method comes from the way GAN creates deep fakes with autoencoders. ResNet50 CNN
has been used for the frame-level detection, followed by RNN and LSTM for video
classification. Therefore, this proposed method is differentiated based on the parameters
of the paper between the fake video and actual video. In our analysis, it depicts that our
method can very surely identify DF accurately within an average of 94.63% in a web
scenario considering actual conditions of dispersion within an average. Real-time
information will rely upon real-time high expectancy. One of the fundamental things with
deep learning is that, the possibility of forming a solution to a presented problem does not
require an earlier hypothetical review. However we also have the option of understanding
this arrangement's outset in order to evaluate the traits and constraints of that, so we spent
considerable amount of time imagining the channels within our network. Most striking
empirical evidence has indicated the significant role played by eyes and mouth while
identifying the aspects or perceptions with the aid of DF. Next future machines will make
our organizations strong and effective with the capabilities to make profound businesses
understandable.
Model Ensembling:
Diverse modeling algorithms or diverse training data set used by ensemble modeling
60
enables various models to predict the outcome. After all the basic models are combined,
an ensemble model then computes a single final prediction from the inconspicuous data.
Ensemble multiple models may be further enhanced our prediction process like
(ResNet50, MesoNet, DenseNet and Custom Model).
Data Augmentation:
It is a technique wherein newer information integrated into prior ones over time or simply
less varied versions of existing ones serve as a method. Also it acts as a general purpose
regularizer of an artificial intelligent model and reduces the problem known to occur as
overfitting in the model.
Early Stopping:
Early stopping can make the model better in training. You can state one of the individual
levels of epochs by training after which preparation of a model will be stopped improving
a holdout validation dataset.
BlockChain:
Image Aggregation:
Videos tend to compress in most cases, especially online-viewed videos; therefore, large
pieces of lost information are met. However the same face presented in frames may aid in
61
the increasing accuracy of the video by its overall score if this allows multiple
experiences. This is the average of the network prediction over the video
This can be achieved through using such elements as theoretically, the relation of frames
of a single
Audio Tampering:- Until now, our model was only able to work with deepfakes that were
manipulated on the video side. However, due to the fast development of CNNs [4, 20],
the GANs, and their variants, it is now possible to create deepfakes manipulated on the
audio side also. Video with manipulated audio will be declared fake, but faces will
remain genuine, This makes it much more difficult to distinguish between true,
untampered audiovisuals and altered ones. Because of this ability to create a real sound,
video, and images, various stakeholders took action to deter such developments from
being used maliciously.
Therefore, researchers are trying to create deepfake detection mechanisms that will also
detect synthetic audio.
5.3 Applications:
62
CHAPTER- 6 : APPENDIX
Preprocess:
if augment:
transform_list.extend([
transforms.RandomHorizontalFlip(),
transforms.RandomRotation(10),
transforms.ColorJitter(brightness=0.1),
])
transform_list.extend([
transforms.ToTensor(),
transforms.Normalize(mean=[0.5, 0.5, 0.5],
std=[0.5, 0.5, 0.5])
])
transform = transforms.Compose(transform_list)
# image = Image.open(image_path).convert('RGB')
image = transform(image)
return image
def get_transforms(augment=False):
transform_list = [
transforms.Resize((299, 299)),
]
if augment:
transform_list.extend([
transforms.RandomHorizontalFlip(),
transforms.RandomRotation(10),
transforms.ColorJitter(brightness=0.1),
])
63
transform_list.extend([
transforms.ToTensor(),
transforms.Normalize(mean=[0.5, 0.5, 0.5],
std=[0.5, 0.5, 0.5])
])
return transforms.Compose(transform_list)
Model:
if augment:
transform_list.extend([
transforms.RandomHorizontalFlip(),
transforms.RandomRotation(10),
transforms.ColorJitter(brightness=0.1),
])
transform_list.extend([
transforms.ToTensor(),
transforms.Normalize(mean=[0.5, 0.5, 0.5],
64
std=[0.5, 0.5, 0.5])
])
transform = transforms.Compose(transform_list)
# image = Image.open(image_path).convert('RGB')
image = transform(image)
return image
def get_transforms(augment=False):
transform_list = [
transforms.Resize((299, 299)),
]
if augment:
transform_list.extend([
transforms.RandomHorizontalFlip(),
transforms.RandomRotation(10),
transforms.ColorJitter(brightness=0.1),
])
transform_list.extend([
transforms.ToTensor(),
transforms.Normalize(mean=[0.5, 0.5, 0.5],
std=[0.5, 0.5, 0.5])
])
return transforms.Compose(transform_list)
65
Train:
import torch
from torch.utils.data import DataLoader
import os
import logging
import torch.nn as nn
import torch.optim as optim
def load_data(batch_size):
train_loader, val_loader = get_datasets(data_dir='data')
return train_loader, val_loader
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=learning_rate)
best_val_accuracy = 0.0
66
epoch_acc = running_corrects.double() / len(train_loader.dataset)
model.eval()
val_running_loss = 0.0
val_running_corrects = 0
with torch.no_grad():
for inputs, labels in val_loader:
outputs = model(inputs)
_, preds = torch.max(outputs, 1)
loss = criterion(outputs, labels)
print(f'Epoch {epoch+1}/{epochs}')
print(f'Train Loss: {epoch_loss:.4f}, Train Acc: {epoch_acc:.4f}')
print(f'Val Loss: {val_loss:.4f}, Val Acc: {val_acc:.4f}')
Evaluate:
import torch
from torch.utils.data import DataLoader
from torchvision import transforms
from torchvision.datasets import ImageFolder
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score,
f1_score
67
import json
import os
import numpy as np
import seaborn as sns
import torch.nn as nn
import matplotlib.pyplot as plt
from preprocess import get_transforms
with torch.no_grad():
for images, labels in test_loader:
images = images.to(device)
labels = labels.to(device)
outputs = model(images)
_, preds = torch.max(outputs, 1)
all_preds.extend(preds.cpu().numpy())
all_labels.extend(labels.cpu().numpy())
all_preds = np.array(all_preds)
all_labels = np.array(all_labels)
metrics = {
'accuracy': accuracy,
'precision': precision,
'recall': recall,
'f1_score': f1,
'confusion_matrix': cm.tolist()
}
return metrics
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt=".2f" if normalize else "d", cmap='Blues',
68
xticklabels=classes, yticklabels=classes)
plt.ylabel('True Label')
plt.xlabel('Predicted Label')
plt.title(title)
plt.show()
69
App:
import sys
import os
sys.path.insert(0, os.path.abspath(os.path.join(os.path.dirname(_file_), '..')))
import streamlit as st
from PIL import Image
import torch
from src.model import get_pretrained_model
from src.preprocess import preprocess_image
from io import BytesIO
@st.cache_data()
def load_model():
model = get_pretrained_model('resnet18')
model.eval()
return model
model = load_model()
st.title("Image Classification")
70
# Display the probabilities
st.write(f"Probabilities: Real: {probabilities[0][0]:.4f}, Fake: {probabilities[0][1]:.4f}")
ground_truth_label = 1
# Calculate precision, recall, F1-score if ground truth is available
if ground_truth_label is not None:
precision = (label == ground_truth_label) / (label + ground_truth_label)
recall = (label == ground_truth_label) / (ground_truth_label)
if(precision + recall):
f1_score = 2 * (precision * recall) / (precision + recall)
st.write(f"Precision: {precision:.4f}")
st.write(f"Recall: {recall:.4f}")
st.write(f"F1 Score: {f1_score:.4f}")
71
71
CHAPTER – 7 :REFERESCE
.[1] Joshua Brockschmidt, Jiacheng Shang, and Jie Wu. On the Generality of Facial Forgery
Detection. In 2019 IEEE 16th International Conference on Mobile Ad Hoc and Sensor Systems
[2] Yuezun Li, Ming-Ching Chang, and Siwei Lyu. In Ictu Oculi: Exposing AI Generated Fake
Face Videos by Detecting Eye Blinking. arXiv preprint arXiv:1806.02877v2, 2018. [3]
TackHyun Jung, SangWon Kim, and KeeCheon Kim. Deep-Vision: Deepfakes Detection
[4] Konstantinos Vougioukas, Stavros Petridis, and Maja Pantic. Realistic Speech-Driven
2020.
[5] Hai X. Pham, Yuting Wang, and Vladimir Pavlovic. Generative Adversarial Talking Head:
Bringing Portraits to Life with a Weakly Supervised Neural Network. arXiv preprint
arXiv:1803.07716, 2018
[6] Yuezun Li, Siwei Lyu, “ExposingDF Videos By Detecting Face Warping Artifacts,” in
arXiv:1811.00656v3.
[7] Yuezun Li, Ming-Ching Chang and Siwei Lyu “Exposing AI Created Fake Videos by
[8] Huy H. Nguyen , Junichi Yamagishi, and Isao Echizen “ Using capsule networks to detect
[9] Umur Aybars Ciftci, ˙Ilke Demir, Lijun Yin “Detection of Synthetic Portrait Videos using
[10] https://www.kaggle.com/c/deepfake-detection-challenge/data
[11] Liu, M. Y., Huang, X., Mallya, A., Karras, T., Aila, T., Lehtinen, J., and Kautz, J. (2019).
[12] Park, T., Liu, M. Y., Wang, T. C., and Zhu, J. Y. (2019). Semantic image synthesis with
https://mrdeepfakes.com/forums/thread-deepfacelab-explained-and-usage-tutorial.
team/kerascontrib/blob/master/keras_contrib/losses/dssim.py.
[15] Lattas, A., Moschoglou, S., Gecer, B., Ploumpis, S., Triantafyllou, V., Ghosh, A., &
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp.
760- 769).
[16] Ha, S., Kersner, M., Kim, B Seo, S., & Kim, D. (2020, April). MarioNETte: few-shot
face reenactment preserving identity of unseen targets. In Proceedings of the AAAI Conference
[17] Deng, Y., Yang, J., Chen, D., Wen, F., & Tong, X.(2020). Disentangled and controllable
[18] Tewari, A., Elgharib, M., Bharaj, G., Bernard, F., Seidel, H. P., P´erez, P., ... & Theobalt,
C. (2020). StyleRig: Rigging StyleGAN for 3D control over portrait images. In Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 6142-6151).
[19] Li, L., Bao, J., Yang, H., Chen, D., & Wen, F. (2019). FaceShifter: Towards high fidelity
[20] Nirkin, Y., Keller, Y., & Hassner, T. (2019). FSGAN: subject agnostic face swapping and
[21] Olszewski, K., Tulyakov, S., Woodford, O., Li, H., & Luo, L. (2019). Transformable
Vision (pp. 7648-7657). [22] Chan, C., Ginosar, S., Zhou, T., & Efros, A. A. (2019). Everybody
(pp. 5933-5942).
[23] Thies, J., Elgharib, M., Tewari, A., Theobalt, C., & Nießner, M. (2020, August). Neural