0% found this document useful (0 votes)
20 views10 pages

Gaurav Vision Transformer

Paper about vision transformers, history, current applications, challenges and future work.

Uploaded by

gkg11092003
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views10 pages

Gaurav Vision Transformer

Paper about vision transformers, history, current applications, challenges and future work.

Uploaded by

gkg11092003
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Vision Transformers

A Project Report Submitted


by

Gaurav Kumar Gupta


(2101EE31)
To
Dr. Maheshkumar Kolekar

For the Subject


EE508 : Video Surveillance

DEPARTMENT OF ELECTRICAL ENGINEERING


INDIAN INSTITUTE OF TECHNOLOGY, PATNA
Feb 18, 2024

i
1. INTRODUCTION

Vision Transformers (ViT) represent a pivotal shift in the landscape of image


processing by adapting the transformer architecture, originally designed for
natural language processing, to the realm of computer vision. At its core, ViT
works by treating an image as a sequence of patches rather than pixels. Each
image is divided into fixed-size patches, which are then linearly embedded into
tokens, similar to how words are tokenized in text. These tokens are augmented
with positional encodings to retain their spatial relationships, and then fed into a
series of transformer blocks. These blocks apply self-attention mechanisms to
learn contextual relationships among patches, allowing the model to understand
the image's content holistically.

1.1 Contrast with Convolutional Neural Networks

Traditional Convolutional Neural Networks (CNNs) have long been the go-to
architecture for image recognition tasks due to their ability to exploit spatial
hierarchies through the use of convolutional layers. CNNs inherently assume
that nearby pixels are more related than distant ones, which is beneficial for
tasks like edge detection or recognizing simple shapes. However, Vision
Transformers bypass this locality assumption by using attention mechanisms to
weigh the importance of all patches relative to one another, regardless of their
spatial proximity. This global context understanding can lead to superior
performance on complex data, particularly when trained on large datasets.

1.2 Significance in Modern Computer Vision

The introduction of Vision Transformers has been revolutionary in the field of


computer vision. They bring several advantages:
• Scalability: ViTs can scale effectively with model size and dataset size,
often outperforming CNNs when trained on vast amounts of data.
• Flexibility: Unlike CNNs, which are somewhat rigid in structure due to
their convolutional nature, transformers can be adapted for various
architectures or combined with other models for hybrid approaches.
• Transfer Learning: Pre-trained ViT models have shown excellent transfer
learning capabilities, allowing them to be fine-tuned for diverse vision

2
tasks with less data.

1.3 Revolutionizing Image Processing

Vision Transformers have effectively redefined how we approach image


processing tasks. Their ability to scale with data size has led to significant
breakthroughs in areas like image classification, object detection, and
semantic segmentation, particularly when trained on large datasets. This
scalability implies that ViTs can leverage the vast amounts of internet data
for training, promising continual improvement in performance as more data
becomes available.

1.4 Importance on AI and Machine Learning Fields

The adoption of Vision Transformers has broader implications for AI and


machine learning. It encourages a reevaluation of core principles in model
design, pushing for more flexible, data-driven architectures. The success of
ViTs has spurred research into hybrid models, where the strengths of CNNs
and transformers are synergistically combined, potentially leading to more
robust and efficient systems. Moreover, the principles of transformers, like
self-attention, are now being explored for various other modalities and
tasks, suggesting a convergence in methodologies across different branches
of AI. Vision Transformers have not only brought a fresh perspective to
image recognition but have also ignited a wave of innovation, influencing
how we think about and develop AI systems for visual understanding. Their
emergence marks a pivotal moment in the evolution of machine learning,
with far-reaching implications for both academic research and practical
applications.

3
2. History and Evolution

The journey of transformers from text to images is a narrative of innovative


adaptation in neural network architectures. Originally introduced in the
landmark paper "Attention Is All You Need" by Vaswani et al. in 2017 [1],
transformers were designed to address the limitations of recurrent neural
networks (RNNs) in handling long-range dependencies in sequential data.
Transformers quickly became the go-to architecture for many NLP tasks due to
their parallel processing capabilities and the introduction of the self-attention
mechanism. This mechanism allowed models like BERT, GPT, and T5 to
achieve unprecedented results in tasks like translation, text generation, and
understanding by focusing on relevant parts of the input sequence irrespective of
their positions. The leap to applying transformers to vision tasks came with the
work of Dosovitskiy et al. in 2020, who introduced Vision Transformers (ViT)
in "An Image is Worth 16x16 Words". Here, the principle was to treat an image
as a sequence of patches, analogous to words in a sentence, thus allowing the
application of transformer architectures directly to image data. This conceptual
bridge from NLP to vision marked a significant evolution in computer vision,
challenging the established hegemony of Convolutional Neural Networks
(CNNs) by demonstrating that transformer-based models could achieve or
surpass the performance of CNNs on large datasets.
Core Transformer Mechanics
o Attention Mechanism: At the heart of transformers is the attention
mechanism, which allows the model to weigh the importance of different
parts of the input data relative to one another. For NLP, this meant giving
attention to relevant words in a sentence; for vision, it's about understanding
the relevance of different image patches.
o Self-Attention in Sequence Data: Self-attention, a variant of attention,
enables the model to relate different positions of a single sequence in order to
compute a representation of the sequence. In ViT, this is extended to relate
patches within an image, capturing complex spatial interactions in a way that
does not rely on the local connectivity of CNNs.
The self-attention mechanism computes a weighted sum of all positions, where
each weight is derived from a compatibility function of the queries, keys, and
values derived from the input data. This process allows the model to
dynamically focus on different parts of the input for each output position,
providing a powerful tool for understanding both the local and global context of
images without explicit convolutional operations.

4
3. Working of Vision Transformers
The leap to applying transformers to vision tasks came with the work of
Dosovitskiy et al. in 2020, who introduced Vision Transformers (ViT) in "An
Image is Worth 16x16 Words". Here, the principle was to treat an image as a
sequence of patches, analogous to words in a sentence, thus allowing the
application of transformer architectures directly to image data. This conceptual
bridge from NLP to vision marked a significant evolution in computer vision.

Figure 1. Vision Transformer Architecture

1. Patch Embedding: Vision Transformers (ViT) begin processing an image by


dividing it into a series of fixed-size, non-overlapping patches. Typically, an
image of size H×W with 3 color channels is split into patches of size P×P,
resulting in H×W/P^2 patches. Each patch is then flattened into a one-
dimensional vector. This vector is linearly projected into an embedding space
of dimension D, which is consistent with the transformer's input requirements.
This process can be viewed as creating a sequence of tokens from the image,
similar to how words are tokenized in NLP, with each patch representing a
"visual token".

2. Positional Encoding: Since transformers do not inherently understand the


order or spatial arrangement of the input data due to their permutation invariant
nature, positional information must be explicitly encoded. In ViT, positional
encodings are added to the patch embeddings to preserve the spatial relationships
among patches. These encodings can be learned during training or predefined,
similar to those used in NLP transformers:

5
Learned Positional Encoding: Each position gets a unique, learnable
embedding added to its patch embedding.
Fixed Positional Encoding: Sinusoidal encodings based on position, similar to
those in the original transformer paper, might be used, though ViT typically
opts for learned embeddings for flexibility and performance.
This step ensures that the model can differentiate between the same patches in
different positions of the image.

Transformer Blocks

Once the patches are embedded and their positions encoded, they are processed
through a series of transformer blocks. Each block includes:

• Multi-Head Self-Attention: This mechanism computes attention scores for


every pair of patches, allowing each patch to "attend" to every other patch in the
image. It does this multiple times (with different heads) to capture different
aspects of the relationships.
• Layer Normalization: Applied before and after the self-attention and feed-
forward layers to stabilize the learning process.
• Feed-Forward Networks (FFN): These are fully connected layers that process
each position independently, adding non-linearity and complexity to the model's
feature representation.
• Residual Connections: To facilitate gradient flow during training, residual
connections skip over each main component (self-attention, FFN), adding the
input to the output of these layers.

This structure allows the model to capture both local and global dependencies
within the image, processing all patches simultaneously, much like how
transformers handle words in a sentence. The final output from these blocks can
be used for classification by adding a class token or for other vision tasks like
segmentation by further processing the spatial information retained throughout
the network.

6
4. Implementation, Training Challenges and
Optimization

Figure 2 Roadmap for ViT Implementation

4.1 Data Preparation

Large-Scale Datasets: Vision Transformers (ViT) thrive on vast amounts of


data due to their lack of inductive biases compared to CNNs. To perform
well, ViTs often require datasets at scales like JFT-300M or ImageNet-21k,
which provide the diversity and volume needed for learning rich feature
representations.
Data Augmentation: To compensate for potentially smaller datasets or to
enhance model robustness, data augmentation techniques are crucial.
Methods include random cropping, flipping, color jittering, and rotation.
These augmentations artificially expand the dataset, providing the model with
varied examples that help generalize better.

4.2 Training Challenges

Resource Requirements: Training ViTs is computationally intensive due to


the quadratic cost of self-attention with respect to the sequence length
(number of patches). This necessitates powerful GPUs or TPUs, substantial

7
memory, and extended training times.

4.3 Strategies for Overcoming Computational Barriers:

• Model Parallelism: Distributing model parameters across multiple devices.


• Gradient Accumulation: Accumulating gradients over several mini-batches
before updating weights to simulate larger batch sizes on smaller memory
footprints.
• Layer-wise Adaptive Rate Scaling (LARS): Optimizing learning rates for
different layers, which can help in training stability with large models.

4.4 Fine-Tuning Vision Transformers

• Transfer Learning from Large Pre-Trained Models: ViTs benefit


significantly from transfer learning, where models pre-trained on large datasets
like ImageNet are fine-tuned on smaller, task-specific datasets. This approach
leverages the general visual understanding of the pre-trained model, often
leading to superior performance with less data.
• Layer Freezing: Keeping early layers static to preserve general features while
allowing later layers to adapt to new tasks.
• Task-Specific Layers: Adding custom layers or heads on top of the transformer
for specific tasks like segmentation or detection.
• Image Suggestion: A graph showing the training dynamics (e.g., loss,
accuracy, or both over epochs) for a ViT model on a benchmark dataset like
ImageNet. This visual could illustrate the model's learning curve, highlighting
how performance metrics evolve and demonstrating the impact of strategies like
learning rate scheduling or data augmentation.
• Hyperparameter Tuning: ViTs are sensitive to hyperparameters such as
learning rate, batch size, and the number of transformer blocks. Techniques like
grid search, random search, or more advanced methods like Bayesian
optimization are used to find the optimal settings.
• Mixed Precision Training: This involves using lower precision (like FP16) for
most computations to speed up training and reduce memory usage while
maintaining high precision (FP32) for critical parts like gradient accumulation,
thus balancing speed and accuracy.
• Efficient Attention Mechanisms: Due to the high computational cost of the
full self-attention mechanism, variants like "Linear Transformers" or
"Reformer" have been proposed, aiming to reduce complexity while retaining
much of the performance.

8
Figure 3 Comparison between ViT and ResNet (BiT) architecture accuracies on different
sizes of training data. The y-axis is the size of pretraining data in the ImageNet dataset. The
x-axis is the accuracy selected from the top 1% of the selected five-shots.

5. Conclusion

The advent of Vision Transformers (ViT) has fundamentally transformed the


landscape of computer vision, marking a significant departure from the
traditional reliance on Convolutional Neural Networks (CNNs). By
conceptualizing images as sequences of patches akin to words in text, ViT
leverages the power of transformer architecture to achieve remarkable results in
image classification, object detection, and beyond. The inherent scalability,
flexibility, and transfer learning capabilities of ViTs have not only challenged
the dominance of CNNs but have also spurred a broader rethinking of model
architectures in AI.

This shift has led to innovations like hybrid models that combine the strengths
of both transformers and convolutions, potentially ushering in a new era of AI
where models are more adaptable and efficient. The computational challenges
associated with training ViTs, due to their attention mechanisms, have catalyzed
advancements in model optimization, data handling, and hardware utilization,
setting new standards for what is possible in machine learning research and
application.

Looking forward, the implications of ViTs extend beyond current applications

9
into areas like multi-modal learning, where understanding across different data
types could be revolutionized. As we continue to refine these models, the focus
will likely intensify on making them more interpretable, robust against
adversarial threats, and efficient for deployment on diverse hardware platforms.
Vision Transformers not only represent a milestone in AI but also a beacon for
future explorations in how we process and interpret visual information. The
journey of ViTs from a groundbreaking idea to a practical tool underscores the
dynamic and ever-evolving nature of AI research, promising further innovations
that will continue to reshape our technological landscape.

6. References

1. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ...
& Polosukhin, I. (2017). Attention is all you need. In Advances in Neural
Information Processing Systems (pp. 5998-6008).
2. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X.,
Unterthiner, T., ... & Houlsby, N. (2021). An image is worth 16x16 words:
Transformers for image recognition at scale. In International Conference on
Learning Representations.
3. Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., & Zagoruyko, S.
(2020). DETR: End-to-end object detection with transformers. In European
Conference on Computer Vision (pp. 213-229). Springer, Cham.
4. Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., & Jégou, H.
(2021). Training data-efficient image transformers & distillation through
attention. In Proceedings of Machine Learning Research (Vol. 139, pp. 10347-
10357). PMLR.
5. Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., & Veit, A.
(2021). ViT on small datasets. In Advances in Neural Information Processing
Systems (Vol. 34, pp. 21553-21564).

10

You might also like