A SYNOPSIS ON
Multimodal Deepfake Detection
Submitted in partial fulfilment of the requirement for the award of the degree of
BACHELOR OF TECHNOLOGY
IN
COMPUTER SCIENCE & ENGINEERING
Submitted by:
Student Name 1: Amol Joshi University Roll No. 2021648
Student Name 2: Ayush Kumar Singh University Roll No. 2021671
Student Name 3: Abhinav University Roll No. 2021635
Under the Guidance of
Dr. Deepak Gaur
Project Team ID: ID No. MP2025AI&DS18
Department of Computer Science and Engineering
Graphic Era (Deemed to be University)
Dehradun, Uttarakhand
September-2025
CANDIDATE’S DECLARATION
I/We hereby certify that the work which is being presented in the Synopsis entitled
“Multimodal Deepfake Detection” in partial fulfillment of the requirements for the award
of the Degree of Bachelor of Technology in Computer Science and Engineering in the
Department of Computer Science and Engineering of the Graphic Era (Deemed to be
University), Dehradun shall be carried out by the undersigned under the supervision of Dr.
Deepak Gaur, Associate Professor, Department of Computer Science and Engineering,
Graphic Era (Deemed to be University), Dehradun.
Amol Joshi 2021648 signature
Ayush Kumar Singh 2021671 signature
Abhinav 2021635 signature
The above mentioned students shall be working under the supervision of the undersigned on
the “Multimodal Deepfake Detection”
Signature Signature
Internal Evaluation (By DPRC Committee)
Status of the Synopsis: Accepted / Rejected
Any Comments:
Name of the Committee Members: Signature with Date
1.
2.
Table of Contents
Chapter No. Description Page No.
Chapter 1 Introduction and Problem Statement 4-5
Chapter 2 Background/ Literature Survey 6-7
Chapter 3 Objectives 8
Chapter 5 Possible Approach/ Algorithms 9-10
Chapter 6 References 11
Chapter 1
Introduction and Problem Statement
In the following sections, a brief introduction and the problem statement for the work has
been included.
1.1 Introduction
The rapid growth of the internet and social media platforms has revolutionized the way information is
produced, consumed, and shared. While these advancements have made communication faster and
more accessible, they have also created an environment where misinformation and manipulated
content can spread at an unprecedented scale. Two of the most critical issues in this regard are fake
news and deepfakes.
Deepfakes, on the other hand, use advanced AI techniques such as Generative Adversarial Networks
(GANs) to manipulate images, audio, and video content. These fabricated media files can
convincingly mimic real individuals, creating false scenarios that can damage reputations, spread
propaganda, or even threaten national security.
This phenomena pose serious challenges to digital trust, cybersecurity, journalism, and democratic
systems. For instance, fake news articles can mislead public opinion during elections, while deepfake
videos can be weaponized for blackmail or misinformation campaigns. With the increasing
sophistication of these manipulations, traditional detection methods such as manual verification or
single-modality models are no longer sufficient.
To address this growing problem, researchers are turning towards Artificial Intelligence (AI) and
Deep Learning. In particular, Natural Language Processing (NLP) techniques like BERT have proven
highly effective in understanding textual semantics, while Computer Vision methods such as
Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) are capable of detecting
subtle inconsistencies in manipulated images and videos.
This project proposes a multimodal detection framework that integrates both text and image/video
analysis for a comprehensive solution. By combining BERT-based text classification with CNN +
ViT-based visual detection, the system aims to accurately identify fake news and deepfakes. This
approach not only increases detection accuracy but also mirrors real-world scenarios, where
misinformation often appears in multiple forms simultaneously.
1.2 Problem Statement
The problem statement for the present work can be stated as follows:
The main problem lies in the rapid spread of misinformation through fake news articles and deepfake
media content. Fake news manipulates textual information to mislead readers, while deepfakes alter
images and videos so realistically that they are almost impossible to detect with the human eye. This
has created a serious challenge for digital trust, cybersecurity, and public awareness, as false content
can influence opinions, damage reputations, and even destabilize societies.
Traditional detection methods are either manual (slow and unscalable) or single-modality (focusing
only on text or only on images/videos), which makes them insufficient in combating the multimodal
nature of modern misinformation. Moreover, the increasing sophistication of AI-based content
generation tools makes it even harder to identify manipulated media accurately.
To address this problem, we propose building a multimodal AI-based detection system that will use a
combination of Convolutional Neural Networks (CNNs) for spatial feature extraction and Vision
Transformers (ViTs) for analyzing image patches and video frames.
Finally, we will fuse the outputs of both pipelines to provide a unified decision on whether the given
content is real or fake.
Chapter 2
Background/ Literature Survey
The emergence of deepfake technology has introduced a major challenge to digital media authenticity.
Deepfakes are synthetic media — images, audio, or videos — generated or manipulated using deep
learning techniques, especially Generative Adversarial Networks (GANs) and autoencoders. These
manipulations can create hyper-realistic yet false content that is almost indistinguishable from
genuine media.
While deepfake technology has legitimate applications in entertainment, gaming, and accessibility, it
is increasingly misused for malicious purposes such as spreading misinformation, political
propaganda, character defamation, identity fraud, and cybercrime. The ability of deepfakes to erode
trust in visual evidence poses significant risks to cybersecurity, journalism, legal systems, and social
stability.
Traditional manual detection methods, such as visual inspection or forensic analysis, are no longer
effective given the realism and scale of modern deepfakes. This has led to growing interest in AI-
driven deepfake detection systems. Early approaches rely on Convolutional Neural Networks (CNNs)
to identify artifacts, inconsistencies, or unnatural features in images and videos. More recently, Vision
Transformers (ViTs) have been explored to capture global contextual information across image
patches, improving robustness against advanced manipulations.
Several researchers have attempted to tackle the problem of deepfake detection using different
techniques, ranging from traditional CNNs to more advanced transformer-based architectures. Below
is a summary of notable previous works:
I. MesoNet
MesoNet introduced a lightweight CNN architecture designed for real-time deepfake
detection. It focused on capturing mid-level patterns that distinguish fake from real faces.
While efficient, MesoNet struggled with generalization. It performed well on the dataset it
was trained on but failed when tested on unseen manipulation techniques. Its lightweight
nature also limited its ability to detect subtle, high-quality deepfakes.
II. XceptionNet
XceptionNet, a deeper CNN model, was trained on the FaceForensics++ dataset,
leveraging depthwise separable convolutions for efficient feature extraction. It became a
benchmark in early deepfake detection research.
While efficient, MesoNet struggled with generalization. It performed well on the dataset it
was trained on but failed when tested on unseen manipulation techniques. Its lightweight
nature also limited its ability to detect subtle, high-quality deepfakes.
III. Capsule Networks
Capsule Networks were used to capture part-whole relationships in facial structures,
aiming to detect inconsistencies caused by manipulation.
While promising, Capsule Networks were computationally expensive and had
unstable training, making them impractical for large-scale real-world use.
Chapter 3
Objectives
The objectives of the proposed work are as follows:
Develop a multimodal deepfake detection framework that ensures the model can detect
manipulated content. For this, we will create a unified pipeline capable of handling two different
models which are CNNs and ViTs.
Leverage CNNs to extract local features such as facial texture inconsistencies, blending errors,
edge artifacts, or pixel-level distortions that typically appear in deepfake images and video
frames.
Incorporate Vision Transformers (ViTs) to capture long-range dependencies and relationships
between image patches, enabling the detection of subtle manipulations in high-quality deepfakes
that CNNs may overlook.
Design a robust and generalizable detection pipeline to ensure the system is not limited to one
dataset or manipulation technique by training and testing on multiple benchmark datasets. The
goal is to minimize overfitting to dataset-specific artifacts and enhance real-world applicability.
Benchmark and validate the proposed solution against state-of-the-art methods to compare the
hybrid CNN + ViT model with existing deepfake detection approaches in terms of accuracy,
precision, recall, scalability, and robustness under different noise and compression settings.
Chapter 4
Possible Approach/ Algorithms
The goal is a robust deepfake detection system that generalizes across manipulation techniques and
real-world artifacts. Our proposed strategy is a hybrid model that combines:
1. Overview & Design Rationale
The goal is a robust deepfake detection system that generalizes across manipulation techniques
and real-world artifacts. Our proposed strategy is a hybrid model that combines:
CNN backbone(s) to capture local, pixel-level artifacts (texture, edges, blending), and Vision
Transformer (ViT) layers to model global, long-range dependencies across image patches
(context, inconsistencies across face/background).
The operation primarily will be the frame level (extract frames from videos) and aggregate
frame-level predictions to a video-level decision using temporal pooling or a lightweight
temporal model. The pipeline emphasizes dataset diversity, augmentation, and cross-dataset
validation to reduce overfitting to dataset-specific artifacts.
2. End-to-end Pipeline
Data collection: Obtain public datasets (FaceForensics++, DFDC, Celeb-DF, etc.). Keep
separate splits for cross-dataset testing.
Preprocessing:
i. Video to frame extraction.
ii. Face detection & alignment (crop to face bounding box + margin). Optionally keep whole-
frame variants for context.
iii. Resize frames to a fixed input (e.g., 224×224 or 256×256).
iv. Resize frames to a fixed input.
Data augmentation : Random crops, rotations, color jitter, Gaussian blur/noise, JPEG
compression, and random temporal jitter to mimic social-media artifacts.
Modeling:
i. CNN backbone to extract feature maps.
ii. Patch embedding + ViT encoder(s) to process either raw image patches or CNN feature patches
(hybrid).
Temporal aggregation — Average pooling over frame probabilities or a small
transformer/LSTM to produce a video-level prediction.
Evaluation & cross-dataset generalization — test across unseen datasets and compression
levels.
3. Training Strategy & Losses
Loss Functions: Binary Cross Entropy(BCE) loss.
Optimizer: AdamW
Regularization for Dropout in classification head and Label smoothing can help reduce
overconfidence and generalize.
Batch size depending on GPU memory. Currently training for 20–50 epochs, monitor validation
AUC and early stop.
4. Evaluation Protocols and Metrics
Metrics: Accuracy, Precision, Recall, F1-score, AUC-ROC, and Balanced Accuracy. Report
both frame-level and video-level metrics.
Robustness testing: Evaluate with different compression levels, additive noise, and cross-dataset
generalization.
Comparing with previous stratagies using the evaluation metrics.
References
[1]Afchar, D., Nozick, V., Yamagishi, J., & Echizen, I. (2018). MesoNet: a compact facial
video forgery detection network. IEEE International Workshop on Information Forensics and
Security (WIFS), pp. 1–7.
[2]Rossler, A., Cozzolino, D., Verdoliva, L., Riess, C., Thies, J., & Nießner, M. (2019).
FaceForensics++: Learning to detect manipulated facial images. Proceedings of the
IEEE/CVF International Conference on Computer Vision (ICCV), pp. 1–11.
[3]Guera, D., & Delp, E. J. (2018). Deepfake video detection using recurrent neural
networks. IEEE International Conference on Advanced Video and Signal Based Surveillance
(AVSS), pp. 1–6.
[4]Nguyen, H. H., Yamagishi, J., & Echizen, I. (2019). Use of a Capsule Network to detect
fake images and videos. arXiv preprint arXiv:1910.12467.
[5]Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., ...
& Houlsby, N. (2021). An image is worth 16x16 words: Transformers for image recognition
at scale. International Conference on Learning Representations (ICLR).
[5]Tran, H., He, X., Singh, A., Zheng, C., & Bui, T. (2021). Exploring self-attention for
deepfake detection. IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP), pp. 2500–2504.
[6]Dolhansky, B., Howes, R., Pflaum, B., Baram, N., & Ferrer, C. C. (2020). The DeepFake
Detection Challenge (DFDC) dataset. arXiv preprint arXiv:2006.07397.
[7]Li, Y., Yang, X., Sun, P., Qi, H., & Lyu, S. (2020). Celeb-DF: A large-scale challenging
dataset for deepfake forensics. Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition (CVPR), pp. 3207–3216.
[8]Verdoliva, L. (2020). Media forensics and deepfakes: An overview. IEEE Journal of
Selected Topics in Signal Processing, 14(5), 910–932.
[9]Tariq, S., Lee, S., Kim, H., Shin, Y., & Woo, S. S. (2018). Detecting both machine and
human created fake face images in the wild. ACM Workshop on Information Hiding and
Multimedia Security (IH&MMSec), pp. 81–87.