Vit 3 55 - Merged
Vit 3 55 - Merged
Vision Transformer
by
Ajit Kumar Maurya
Akash Gupta
Abhishek
Akash Singh
under the guidance of
Dr. Sarsij Tripathi
We declare that the work presented in this report titled “DEEPFAKE IMAGE
DETECTION SYSTEM USING VISION TRANSFORMER”, submitted to the
Computer Science and Engineering Department, Motilal Nehru National Institute of
Technology Allahabad, Prayagraj, for the award of the Bachelor of Technology degree
in Computer Science & Engineering, is our original work. We have not plagiarized or
submitted the same work for the award of any other degree. In case this undertaking is
found incorrect, we accept that our degree may be unconditionally withdrawn.
May, 2025
Allahabad
Ajit Kumar Maurya (20214051)
Akash Gupta
(20214081)
Abhishek
(20214268)
Akash Singh
(20214271)
CERTIFICATE
May, 2025
iii
Preface
The rapid evolution of Artificial Intelligence has brought forth remarkable innova-
tions, yet it has also introduced significant challenges - notably the rise of deepfake
technology. What began as an interesting technological advancement has quickly
become a pressing concern, as these sophisticated fake media creations increasingly
test our ability to distinguish authentic content from artificial manipulation.
Our major project tackles this growing challenge, explicitly focusing on the de-
tection of synthetic images. Through hands-on research and experimentation, we
have explored the intricate mechanisms behind deepfake creation while developing
practical approaches to identify manipulated content. The journey has been both
challenging and enlightening, revealing the complex nature of this technological phe-
nomenon.
This report outlines our research journey, documenting our systematic approach
to understanding and addressing deepfake detection. We discuss our chosen method-
ologies, outline the significant challenges encountered, and present our key findings.
Though conducted within the scope of a course project, our research yields insights
that we believe will prove valuable to the broader conversation surrounding digital
media authentication.
We are particularly indebted to Dr. Sarsij Tripathi, whose expertise and men-
torship proved instrumental throughout our research process. As concerns about
digital media manipulation continue to grow, we trust this work adds a meaningful
perspective to ongoing efforts in protecting digital integrity. Our findings, while
preliminary, point toward promising directions for future research in this critical
field.
iv
Acknowledgements
v
Contents
Preface iv
Acknowledgements v
1 Introduction 1
1.1 Generation of DeepFake Images . . . . . . . . . . . . . . . . . . . . . 3
1.1.1 Steps in Deepfake Generation . . . . . . . . . . . . . . . . . . 3
1.1.2 Deepfake Generation Algorithms . . . . . . . . . . . . . . . . 4
1.1.3 Tools for Generating DeepFake Images . . . . . . . . . . . . . 7
1.2 Motivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2 Related Works 10
3 Methodology 15
3.1 High Level Overview of the ViT model . . . . . . . . . . . . . . . . . 16
3.2 Overview of Vision Transformer (ViT) Architecture . . . . . . . . . . 17
3.2.1 Architecture Overview . . . . . . . . . . . . . . . . . . . . . . 17
3.2.2 Key Features . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2.3 Delving into Various Layers of Vision Transformer (ViT) . . . 20
3.2.4 Advantages of Vision Transformer (ViT) in Deepfake Detection 22
3.3 Comparative Analysis of Vision Transformer (ViT) with Other Models 23
3.3.1 Summary of Comparison . . . . . . . . . . . . . . . . . . . . . 25
4 Experimental Setup 27
4.1 Tools and Technologies Used . . . . . . . . . . . . . . . . . . . . . . . 28
vi
4.2 Dataset Acquisition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.2.1 Data Composition and Preprocessing . . . . . . . . . . . . . . 30
4.3 Exploratory Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . 30
4.4 Data Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.5 Model Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.6 Model Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.7 Model Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.8 Model Saving . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.9 Model Deployment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
vii
Chapter 1
Introduction
The world of digital media stands at a critical turning point. Deepfakes - those
incredibly convincing fake videos and images created by sophisticated AI - are not
just another technological novelty. They represent a fundamental shift in how we
must approach digital truth, and their implications run deeper than many realize.
Think about the ripple effects: A carefully timed fake video showing a politician’s
inflammatory speech could upend an entire election. Social media’s lightning-fast
sharing mechanisms mean such content could spread worldwide before fact-checkers
even notice. These are not hypothetical scenarios - they are very real possibilities
that keep security experts up at night.
The personal stakes are equally concerning. Anyone could become a target.
Imagine waking up to find a convincingly fabricated video damaging your profes-
sional reputation or a fake news story wreaking havoc on your company’s stock price.
Financial markets, particularly sensitive to executive statements and interviews, face
unique vulnerabilities when deepfakes enter the picture.
What is particularly troubling is how accessible this technology has become.
Creating deepfakes no longer requires extensive technical expertise or expensive
equipment. With open-source tools and online tutorials readily available, this capa-
bility has moved beyond sophisticated criminals to a much wider audience. While
precise statistics on deepfake prevalence remain elusive, the trend is clear - reports
of synthetic media targeting everyone from high-profile celebrities to private citizens
1
continue to climb at an alarming rate. Social media’s viral nature only compounds
this problem, creating a perfect storm for rapid disinformation spread.
The technology’s democratization presents a double-edged sword: while it drives
innovation, it also lowers barriers for potential misuse. As we have seen from numer-
ous recent incidents, the targets of deepfakes span the entire social spectrum - from
public figures to private citizens, leaving no one immune to this emerging threat.
2
1.1 Generation of DeepFake Images
Deepfake images are typically created using deep learning algorithms, particularly
Generative Adversarial Networks (GANs). The discriminator and generator neu-
ral networks that make up a GAN are trained in competition with one another
concurrently.
Generator:
The generator network accepts random noise as input and generates artificial images.
Initially, these images might exhibit low quality and realism.
Discriminator:
The discriminator network is trained to differentiate between genuine and fabricated
images. It offers feedback to the generator, aiding in enhancing the quality of its
generated outputs.
Data Collection
Deepfake algorithms require a large dataset of real images to learn from. These
images are typically sourced from the internet or specialized databases.
Preprocessing
Before training the model, the images need to be preprocessed to ensure consistency
in size, format, and quality. Typical preprocessing methods consist of normalization,
cropping, and scaling.
Model Training
The Generative Adversarial Network (GAN) is trained using the gathered dataset.
During training, the generator learns to produce increasingly realistic fake images,
3
while the discriminator learns to differentiate between real and fake images.
Fine-tuning
After initial training, the model may undergo fine-tuning to improve its performance
further. This involves adjusting hyperparameters, optimizing the architecture, or
training on additional data.
Generation
Once the model is trained, it can be used to generate Deepfake images. By providing
random noise as input to the generator, the model produces fake images that closely
resemble real ones.
Generator:
The generator network accepts random noise as input and uses it to produce outputs
and learns to generate fake images. It aims to produce images that are indistinguish-
able from real images in the training dataset.
4
Discriminator:
The discriminator network is trained to differentiate between authentic and fake
images. It provides feedback to the generator, guiding it to generate more realistic
images over time.
Training Process
Throughout training, the generator and discriminator are optimized within a min-
imax game framework. The generator’s goal is to deceive the discriminator, while
the discriminator’s objective is to accurately classify between real and fake images.
This adversarial training method results in the creation of realistic synthetic images.
2. Autoencoders
Autoencoders are another class of neural networks used for image generation tasks.
They consist of an encoder network, which compresses the input image into a lower-
dimensional latent space, and a decoder network, which reconstructs the original
image from the latent representation.
Encoder
The encoder network learns to map input images to a latent space representation,
capturing the essential features of the input images.
Decoder
The decoder network takes samples from the latent space and reconstructs the orig-
inal images. By training the autoencoder to minimize the reconstruction error, it
learns to generate images that resemble the training data.
5
3. Deep Convolutional Neural Networks (CNNs)
Deep Convolutional Neural Networks (CNNs) are widely used as the backbone ar-
chitecture in both GANs and autoencoders for image generation tasks. CNNs are
adept at learning hierarchical features from images, making them well-suited for
generating visually realistic images.
Convolutional Layers
Convolutional Neural Networks (CNNs) are composed of multiple layers of convo-
lutional filters that are trained to capture features from input images across various
levels of abstraction.
Generator Architecture
In GANs, the generator network typically consists of upsampling layers followed by
convolutional layers, which progressively generate higher-resolution images from the
input noise.
Discriminator Architecture
The discriminator network in GANs also employs convolutional layers to process
input images and make binary classification decisions.
6
Figure 2: Deepfake Generation Process
7
users to swap faces in images and videos. It provides both GUI and command-
line interfaces and supports various deep learning backends, including Tensor-
Flow and PyTorch.
• FakeApp: FakeApp is one of the earliest deepfake creation tools that gained
popularity. It is based on the Autoencoder algorithm and allows users to create
deepfake videos by swapping faces in existing videos with minimal user input.
1.2 Motivations
The emergence of deepfakes, intricately crafted synthetic media that depict real indi-
viduals in fictional scenarios, poses a significant challenge to our digital landscape.
Their capacity to distort reality and undermine trust in information underscores
the critical need for robust deepfake detection systems. Here’s why this research is
crucial:
1. Combating the Disinformation Threat: Deepfakes can be weaponized
to spread misinformation and influence public opinion. Malicious actors can fab-
ricate compromising statements from politicians or create damaging scandals in-
volving celebrities. Developing reliable deepfake detection systems is essential for
distinguishing truth from fiction in today’s information age.
2. Safeguarding Public Trust: The proliferation of deepfakes erodes public
trust in media sources and public figures. By detecting deepfakes, we can uphold
the integrity of information and enhance public confidence in institutions.
8
3. Mitigating Social Engineering Risks: Deepfakes pose risks in the form
of social engineering attacks, where individuals may be manipulated into revealing
sensitive information or engaging in harmful actions. Detection systems are crucial
for identifying and addressing these deceptive tactics.
4. Cultivating a Healthier Online Environment: Uncontrolled dissemina-
tion of deepfakes can contribute to a toxic online environment. Detection systems
promote authenticity and accountability, fostering a more positive digital space for
all participants.
5. Driving Technological Advancements: Research in deepfake detection
contributes significantly to the broader field of artificial intelligence. By developing
methods to identify synthetic media, we advance the capabilities of computer vision
and machine learning, leading to innovations that transcend the specific challenge
of deepfakes.
9
Chapter 2
Related Works
The detection of Deepfake imagery has become a critical research area, spurred
by the rapid advancement and proliferation of generative techniques. While Con-
volutional Neural Networks (CNNs) have been the cornerstone of initial detection
efforts, the landscape is evolving. Recently, transformer-based architectures, partic-
ularly the Vision Transformer (ViT), have shown significant promise due to their
ability to capture global dependencies within images. This section reviews key prior
works that highlight the transition towards and advancements within ViT-based
Deepfake detection, providing context for the methodology adopted in this project.
Methodology
The authors introduced the Convolutional Vision Transformer (CVT), integrating a
CNN module with the standard ViT architecture. The CNN component was used to
extract learnable local features, which were then processed by the ViT component
leveraging its attention mechanism to capture global relationships and classify the
10
input. The model was trained and evaluated on the DeepFake Detection Challenge
(DFDC) dataset.
Findings
The proposed CVT achieved competitive results (reported accuracy of 91.5% and
AUC of 0.91 on DFDC), demonstrating the potential benefit of combining the local
feature extraction strength of CNNs with the global context modeling capabili-
ties of ViT for detecting Deepfakes. This work highlighted the feasibility of using
transformer-based models in this domain.
Methodology
The proposed scheme utilized a Vision Transformer architecture but added a distilla-
tion token, a technique borrowed from the DeiT (Data-efficient Image Transformer)
model, to improve training efficiency and performance, especially with potentially
limited datasets compared to typical ViT pre-training scales. The input to the model
considered both standard ViT patch embeddings and features extracted via CNNs,
aiming to create a more generalized model.
Findings
The study suggested that combining patch embeddings, CNN features, and distilla-
tion techniques within a ViT framework could lead to more robust and generalized
Deepfake detection models. This approach aimed to leverage multiple feature types
and training strategies to improve upon standard ViT performance for the specific
11
task of Deepfake identification. (Note: Specific performance metrics might be de-
tailed in the full paper).
Methodology
DFDT is an end-to-end framework featuring four main components: patch extrac-
tion embedding, a multi-stream transformer block utilizing a re-attention mecha-
nism (instead of standard multi-head self-attention to potentially improve scalabil-
ity), an attention-based patch selection module to focus on informative regions, and
a multi-scale classifier. It was designed to learn both local image features and global
pixel relationships indicative of forgery.
Findings
DFDT demonstrated high detection rates on established benchmarks, achieving re-
ported accuracies of 99.41% on FaceForensics++, 99.31% on Celeb-DF (V2), and
81.35% on the challenging WildDeepfake dataset. The authors also highlighted the
framework’s excellent cross-dataset and cross-manipulation generalization capabili-
ties, indicating its effectiveness and robustness.
12
CNNs for Deepfake detection, providing practical insights.
Methodology
The authors systematically trained and benchmarked different ViT models and CNN
baselines (like EfficientNet) on standard Deepfake datasets (e.g., FaceForensics++,
DFDC). They often used CNNs (like EfficientNet-B0) as feature extractors combined
with transformer blocks. Their analysis included performance metrics, robustness to
perturbations, and generalization across different forgery types (cross-forgery anal-
ysis).
Findings
The empirical results confirmed that ViT-based models could achieve state-of-the-art
performance. Notably, their studies often concluded that ViTs demonstrated supe-
rior generalization capabilities compared to CNNs when tested on unseen forgery
methods. However, they also noted that ViTs might require substantial data or
pre-training. Their work underscored the practical trade-offs and highlighted ViT’s
robustness advantage.
Methodology
The M2TR (Multi-modal Multi-scale Transformer) framework processes both stan-
dard RGB visual data and frequency domain information (e.g., phase spectrum)
as separate modalities. Within the transformer architecture, it employs a multi-
scale patch strategy to capture artifacts that might manifest at different resolutions.
13
Cross-modal attention mechanisms are used to fuse information effectively.
Findings
The study showed that incorporating multi-modal data (RGB + frequency) and
analyzing features at multiple scales significantly improved detection performance
and robustness, especially against common corruptions like compression. This work
demonstrated the flexibility of the ViT framework for integrating domain-specific
knowledge (frequency artifacts) and advanced processing strategies (multi-scale).
The works reviewed here illustrate the increasing adoption and adaptation of
Vision Transformers for Deepfake detection. From early hybrid models combining
CNN strengths with ViT’s global view, to specialized end-to-end transformer frame-
works like DFDT, and multi-modal approaches like M2TR, the research demon-
strates ViT’s versatility. Empirical studies consistently highlight ViT’s potential
for better generalization compared to traditional CNNs, albeit sometimes requiring
careful training strategies. These advancements provide a strong foundation and
motivation for employing ViT architectures, as explored in this project, to tackle
the challenges of sophisticated Deepfake image detection.
14
Chapter 3
Methodology
The methodology adopted for the development of the Deepfake image detection
model is based on the Vision Transformer (ViT) architecture. This approach inte-
grates cutting-edge transformer-based techniques into the field of computer vision,
deviating from traditional convolutional architectures. The process initiates with
dataset collection and preprocessing, involving real and fake image datasets.
Subsequently, the dataset is partitioned into training, validation, and testing
subsets to ensure robust evaluation. Data augmentation techniques are employed to
enhance the model’s ability to generalize across diverse manipulations and variations.
The Vision Transformer (ViT) model is then utilized, which segments the input
images into patches, embeds them into vectors, and processes them using a series of
transformer encoder blocks. A classification head is appended for final prediction.
The model is trained using the Adam optimizer, with carefully chosen hyperpa-
rameters such as learning rate, batch size, and number of epochs. During training,
key metrics such as training loss, validation loss, and accuracy are monitored to
prevent overfitting and ensure convergence.
Finally, the trained model’s performance is evaluated on the test set using var-
ious evaluation metrics, including accuracy, precision, recall, F1-score, confusion
matrix, and ROC-AUC curve, providing a comprehensive insight into its efficacy
for Deepfake detection tasks. This methodology ensures a rigorous and state-of-
the-art pipeline, leveraging transformer-based learning for robust Deepfake image
15
classification.
Initially, the user provides an input image, which is forwarded to the preprocess-
ing module where the image is resized, normalized, and divided into patches. Each
patch is then linearly embedded and positionally encoded before being passed into
the Vision Transformer encoder blocks. After sequential processing through self-
attention mechanisms and feed-forward networks, a final embedding is produced
and passed through the classification head.
The output consists of the predicted label — indicating whether the input image
is real or fake — which is communicated back to the user or evaluation system.
1. Start: The process begins with the acquisition of real and fake images from
datasets.
16
8. Evaluation: Predictions are compared with ground truth labels to compute
evaluation metrics.
z0i = xip We + be
where xip is the i-th patch, We and be are the learnable weights and biases of
the projection layer.
17
3. Positional Embedding: Since Transformer models are permutation-invariant,
ViT adds learnable positional embeddings to each patch embedding to retain
spatial information:
z0i = z0i + Epos
i
i
where Epos represents the positional encoding for the i-th patch.
18
Figure 3: Vision Transformer (ViT) Architecture
• Global Context Modeling: Unlike CNNs which capture local patterns, ViT
captures global relationships between all image patches through self-attention,
making it highly sensitive to subtle Deepfake artifacts spread across the image.
• Scalability: ViT can easily scale up in terms of model size, patch size, and
dataset size, achieving state-of-the-art results given sufficient data and com-
pute.
19
• Transfer Learning: Pre-trained ViT models on large datasets like ImageNet-
21k or JFT-300M can be fine-tuned efficiently on downstream tasks, including
Deepfake detection.
• Patch Embedding Layer: This initial layer partitions the input image into
fixed-size patches, flattens each patch, and projects it into a dense vector
space. It acts similarly to the convolutional layer in CNNs but treats patches
independently, facilitating sequence processing by the Transformer.
QK ⊤
Attention(Q, K, V ) = sof tmax √ V
dk
20
where Q, K, and V are the query, key, and value matrices, and dk is the
dimension of the keys.
21
Figure 4: Layer-wise Structure of Vision Transformer (ViT)
22
high-resolution images by dividing them into patches and processing them as
a sequence. This enables better detection of fine-grained manipulations often
missed by traditional CNNs.
23
1. ResNet (Residual Networks): ResNet introduced the concept of residual
connections to allow very deep networks to train effectively. Although ResNet
achieves remarkable results on many image recognition tasks, its convolutional
structure limits its ability to capture long-range dependencies within images.
In contrast, ViT’s self-attention mechanism naturally models these global in-
teractions, which are particularly important for identifying subtle, widespread
artifacts in Deepfake images.
24
3.3.1 Summary of Comparison
In conclusion, while CNNs like ResNet, VGG, and EfficientNet have demonstrated
strong performance on standard image classification tasks, Vision Transformer (ViT)
presents a transformative approach by leveraging self-attention to model global fea-
tures directly. This makes ViT particularly well-suited for Deepfake detection, where
identifying distributed and subtle inconsistencies across the entire image is essential.
25
Figure 5: Comparison of Vit with other models
26
Chapter 4
Experimental Setup
The experimental setup for this study involved the development of an anti-spoofing
(deepfake detection) system using a Vision Transformer (ViT) architecture, specif-
ically implemented from scratch using the vit-pytorch library. The dataset was
organized into real (client) and fake (imposter) images, with their respective paths
loaded through structured text files. No external dataset like Kaggle was used;
instead, images were directly accessed from mounted Google Drive storage.
Preprocessing steps included resizing all images to 256×256 pixels, followed by
a center crop to 224×224 pixels to match the input size expected by the Vision
Transformer model. Standard ImageNet normalization parameters (mean = [0.485,
0.456, 0.406], std = [0.229, 0.224, 0.225]) were applied to ensure consistency with
the ViT training distribution. The dataset was split into training and validation
sets while maintaining the class balance between real and fake images.
The Vision Transformer model was configured with the following specifications: a
patch size of 32×32, an embedding dimension of 1024, 6 transformer encoder layers,
16 attention heads, and a feed-forward MLP dimension of 2048, with a dropout rate
of 0.1. The model was trained from scratch without any ImageNet pretraining or
additional Dense layers.
Training was conducted using the Binary Cross-Entropy with Logits Loss (BCE-
WithLogitsLoss) function, optimized with the Adam optimizer (initial learning rate
= 1e-4, weight decay = 5e-4). An exponential learning rate scheduler with a gamma
27
value of 0.45 was employed to progressively reduce the learning rate across epochs.
The batch size was set to 32 for both training and validation phases.
Model checkpointing was implemented to save the model with the best validation
accuracy automatically. Training was performed for 5 epochs using GPU acceler-
ation to expedite convergence. Throughout training, metrics such as training loss,
validation loss, training accuracy, and validation accuracy were tracked and plotted.
For evaluation, the saved best model was loaded and assessed on the validation
set. Performance metrics included accuracy, precision, recall, F1-score, ROC AUC
score, confusion matrix analysis, APCER (Attack Presentation Classification Error
Rate), BPCER (Bona Fide Presentation Classification Error Rate), ACER (Average
Classification Error Rate), and Equal Error Rate (EER). Visualizations such as the
ROC curve, confusion matrix heatmap, and error rate plots were also generated to
better interpret the model’s behavior.
All experiments and implementations were performed using PyTorch, torchvi-
sion, and supporting Python libraries such as NumPy, scikit-learn, and Matplotlib,
within a Google Colab environment.
28
– vit-pytorch: A PyTorch-based library used to implement the Vision Trans-
former architecture from scratch.
– Vision Transformer (ViT): Transformer-based model architecture em-
ployed for binary classification of real and fake images.
– Matplotlib: Used for plotting training and validation loss curves, ROC
curves, confusion matrices, and error rate plots.
• Development Environment:
The efficient integration of these tools and technologies facilitated the effective
development, training, evaluation, and visualization of the deepfake detection system
based on the Vision Transformer model.
29
4.2.1 Data Composition and Preprocessing
The dataset contained two classes:
The dataset was then split into training, validation and test sets with an ap-
proximately balanced distribution between real and fake classes to ensure unbiased
training.
30
4.4 Data Processing
For training the Vision Transformer (ViT) model, standardized preprocessing steps
were necessary to prepare the images appropriately for input.
Initially, real (client) and fake (imposter) image paths were loaded from separate
text files. Labels were assigned as 1 for real images and 0 for fake images. The images
were not manually sampled or randomly selected; the entire available dataset was
used, ensuring balanced representation between the two classes.
Each image underwent the following preprocessing steps:
• Center cropped to 224 × 224 pixels to match the ViT input requirements.
31
Instead of using a pre-trained ViT model, a new ViT model was initialized with
custom parameters specifically for this task. The architecture includes:
• An image patching layer that divides the input image into fixed-size patches.
• A classification token ([CLS]) that aggregates the representation for final clas-
sification.
• A final MLP head consisting of a single linear layer outputting a binary decision
(REAL or FAKE).
32
4.6 Model Training
The Vision Transformer (ViT)-based model was trained using a custom dataset
loader, with real and fake images provided via specific text files listing image paths.
Training configurations:
• Batch size: 32
• Number of epochs: 5
• The model achieving the highest validation accuracy during training was saved
to disk.
The model training was managed using custom train() and validate() func-
tions, with real-time evaluation on validation data at the end of each epoch. Only
validation accuracy was primarily monitored, and the best model was checkpointed
for further evaluation.
33
4.7 Model Evaluation
After training, the Vision Transformer (ViT) model’s performance was evaluated on
an unseen validation dataset to assess its generalization capabilities.
The following evaluation metrics were calculated:
• ROC AUC Score: The Receiver Operating Characteristic - Area Under Curve
(ROC AUC) was computed to assess the model’s ability to distinguish between
classes.
• BPCER (Bona Fide Presentation Classification Error Rate): The rate at which
fake images were misclassified as real.
• ROC Curve: Plotted to visualize the trade-off between true positive rate and
false positive rate.
Visualization tools like Matplotlib were used to plot the ROC curve, aiding in
the interpretation of the model’s performance.
34
Only the model parameters were saved, without storing the optimizer state or
training configuration. To reload and use the model, the model architecture must
be redefined before loading the saved weights.
A version-controlled saving approach was adopted, where models were saved
based on their validation accuracy. The saved files were named accordingly to reflect
the best-performing model checkpoints.
The model files were saved in the standard .pt format and stored on Google
Drive for persistent access.
• Loading the saved ViT model weights (.pt file) and reconstructing the model
architecture in the Streamlit app.
• Passing the processed images through the ViT model to predict the probability
of the image being real or fake.
The application provides immediate feedback to the user regarding the authen-
ticity of the uploaded image. The deployment highlights:
35
Chapter 5
The deep learning model was developed using the PyTorch framework, implementing
a custom Vision Transformer (ViT) architecture specifically designed for binary
classification (real vs. fake images).
The dataset used for training consisted of images listed through curated text files,
differentiating real and fake samples. The images were preprocessed by resizing to
256×256 pixels and normalized before feeding into the model.
Training was conducted using the Adam optimizer without explicit learning rate
scheduling. The loss function employed was Binary Cross-Entropy with logits, suit-
able for binary classification tasks.
Throughout the training process:
• The best model was selected based on the highest validation accuracy and
saved for evaluation.
• ROC AUC scores, confusion matrix analysis, and attack detection metrics such
as APCER and BPCER were used to further assess performance.
36
• The Receiver Operating Characteristic (ROC) curve was plotted to visualize
the model’s classification capability.
• APCER and BPCER: 0.114 and 0.039 respectively, highlighting low error rates
in classifying real and fake images
37
Figure 6: Receiver Operating Characteristic (ROC) curve
38
Figure 7: Confusion Matrix for the ViT Model
39
5.3.1 Frontend of the Application
40
5.4 Model Limitations
Despite the promising performance of the Vision Transformer (ViT)-based model
for real versus fake image classification, several limitations must be acknowledged:
1. Dataset Bias: The model’s effectiveness is closely tied to the quality, size,
and diversity of the training dataset. In this project, the dataset included a
limited variety of real and fake samples, which may introduce biases. Limited
diversity in spoofing techniques could restrict the model’s ability to generalize
across different types of attacks.
41
In summary, while the ViT-based model exhibits strong potential for detecting
spoofed images, addressing these limitations is vital to improve its robustness, scala-
bility, and ethical deployment. Ongoing research and interdisciplinary collaboration
are essential to build more resilient and responsible deep learning systems.
42
Chapter 6
In this project, we successfully developed a deep learning model to classify real and
fake images using the Vision Transformer (ViT) architecture. The model was trained
using a balanced dataset, featuring advanced preprocessing techniques to optimize
for robustness and generalization across multiple evaluation metrics.
Our experimental process involved meticulous preprocessing steps, including re-
sizing, normalizing, and augmenting the dataset to enhance the model’s capability.
The dataset was partitioned into training, validation, and test sets to ensure a com-
prehensive evaluation. The ViT model, augmented with additional layers, proved
effective for feature extraction and classification.
The training regimen consisted of optimizing over 5 epochs with the Adam op-
timizer, applying an initial learning rate of 1e-4 with a decay strategy, achieving
a high validation accuracy nearing 93.1%. This strategy fostered a robust conver-
gence of the model, characterized by high accuracy and minimal overfitting across
datasets.
Testing the model on an external test set, the ViT exhibited robust accuracy,
achieving a high ROC AUC score of 0.98, corroborated by detailed confusion ma-
trix analysis and F1-scores, demonstrating efficacy in distinguishing real from fake
images [2].
43
This project underscores the capabilities of leveraging Transformer-based archi-
tectures in media authenticity verification contexts, offering substantial advance-
ments in deepfake detection. Moving forward, extended research focusing on model
scalability and incorporation of diverse media types can significantly refine detection
technologies. As synthetic media continue to evolve, these innovations will play a
critical role in combating digital misinformation and its societal implications.
The future development of this project focuses on refining the model to enhance
its accuracy, robustness, and generalization against evolving and sophisticated im-
age manipulation techniques. Several promising directions for future research and
development include:
44
• Ensemble Learning Strategies: Using ensemble methods—such as bagging
or stacking multiple models—can reduce prediction variance and bias, leading
to more accurate and reliable detection outcomes.
45
References
2. Y.-J. Heo, Y.-J. Choi, Y.-W. Lee, and B.-G. Kim. “Deepfake Detection
Scheme Based on Vision Transformer and Distillation.” IEEE Access, vol.
9, pp. 75194-75203, 2021..
46
7. Y. Li, X. Yang, P. Sun, H. Qi, and S. Lyu. “Celeb-DF: A Large-scale Chal-
lenging Dataset for DeepFake Forensics.” Proceedings of the IEEE/CVF Con-
ference on Computer Vision and Pattern Recognition (CVPR), 2020.
12. S. Yu, Z. Li, H. Yang, and L. Zhao. “EfficientViT: Memory and Computation
Efficient Vision Transformers for Deepfake Detection.” Pattern Recognition,
vol. 144, 2024.
47
16. S. Lyu. “Deepfake Detection: Current Challenges and Next Steps.” In Pro-
ceedings of the 2023 IEEE International Conference on Acoustics, Speech and
Signal Processing Workshops (ICASSPW), 2023.
48