0% found this document useful (0 votes)
19 views19 pages

DL - QB Solution

The document discusses key concepts in neural networks, including the definition of neural networks, activation functions, and the role of loss functions. It explains the learning process of artificial neural networks (ANNs), the challenges of weight assignment, and the steps to build a feedforward neural network. Additionally, it compares the use of GPUs and CPUs in deep learning, highlights the advantages of mixed precision training, and addresses the benefits and challenges of using pre-trained models in computer vision tasks.

Uploaded by

aryanraina480
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views19 pages

DL - QB Solution

The document discusses key concepts in neural networks, including the definition of neural networks, activation functions, and the role of loss functions. It explains the learning process of artificial neural networks (ANNs), the challenges of weight assignment, and the steps to build a feedforward neural network. Additionally, it compares the use of GPUs and CPUs in deep learning, highlights the advantages of mixed precision training, and addresses the benefits and challenges of using pre-trained models in computer vision tasks.

Uploaded by

aryanraina480
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Module 1

2 Marks
1. What is a neural network? Contrast the function of a loss function in a neural network.

A neural network is a computational model, inspired by the human brain, composed of


interconnected nodes (neurons) arranged in layers. It processes input data to learn patterns and
make predictions. In contrast, a loss function does not process data; instead, it is a
mathematical function that measures the "error" or "cost" of the network's predictions
compared to the true values. While the network performs the task, the loss function quantifies
how well the task was performed, guiding the network to improve.

2. Define activation functions in the context of deep learning.

An activation function is a mathematical function applied to the output of each neuron in a


neural network. Its primary purpose is to introduce non-linearity into the model. This is critical
because without non-linearity, a deep network would behave like a simple linear model,
preventing it from learning the complex patterns found in data like images or text. Essentially, it
determines whether a neuron should be "activated" based on its weighted input sum.

3. List two examples of popular deep learning frameworks.

Two of the most popular deep learning frameworks are TensorFlow and PyTorch.

• TensorFlow, developed by Google, is known for its robust production deployment


capabilities and a comprehensive ecosystem of tools.

• PyTorch, developed by Facebook, is praised for its flexibility, intuitive Python-based


interface, and dynamic computational graph, making it a favorite in the research
community.

4. What is the purpose of optimization in a neural network? Why is model optimization


essential for training?

The purpose of optimization is to systematically adjust the model's parameters (weights and
biases) to minimize the loss function. This process, often carried out by an optimizer algorithm
like Adam or SGD, is the core of how a network "learns" from data. It is essential for training
because it provides the mechanism for the model to improve itself. By iteratively updating its
parameters based on the calculated error, the model moves from a state of poor performance
to one that makes accurate predictions.

5. Illustrate distributed computing in the context of deep learning.

Distributed computing in deep learning accelerates training by using multiple processing units
(like GPUs or servers) simultaneously. The most common approach is Data Parallelism: the
model is copied to each processor, and the training dataset is split among them. Each
processor trains the model on its data subset, and the resulting updates (gradients) are then
averaged and applied to a master copy of the model. This allows for training on massive
datasets in significantly less time than on a single machine.

6. What does ReLU stand for, and what is its purpose in deep learning?
ReLU stands for Rectified Linear Unit. It is a widely used activation function in the hidden
layers of neural networks. Its mathematical form is f(x) = max(0, x), which means it outputs the
input if the input is positive and zero otherwise. Its purpose is twofold: to introduce non-linearity
into the model and to do so in a computationally efficient way. Importantly, it helps mitigate
the vanishing gradient problem, allowing for faster and more effective training of deep
networks.

7. What are the key components of a deep learning model architecture, such as layers,
neurons, and connections?

The key components of a deep learning model are:

• Neurons (or Nodes): The fundamental computational units that receive weighted inputs
and apply an activation function to produce an output.

• Layers: Neurons are organized into layers: an input layer to receive data, one or
more hidden layers to learn features at different levels of abstraction, and an output
layer to produce the final result.

• Connections (and Weights): These are the links between neurons in adjacent layers.
Each connection has a learnable weight that represents the strength of the connection,
which the model adjusts during training.

5 Marks

1. Calculate feed forward neural network.

This question is likely asking for an explanation of the calculation process, also known as the
forward pass or forward propagation.

The calculation in a feed-forward neural network is a step-by-step process where information


flows in one direction—from the input layer, through the hidden layers, to the output layer—
without any loops.

The calculation for a single neuron is as follows:

1. Weighted Sum: Each input xᵢ coming into the neuron is multiplied by its corresponding
weight wᵢ. These products are summed together, and a bias term b is added. This
produces a value, often denoted as z.

o Formula: z = (w₁x₁ + w₂x₂ + ... + wₙxₙ) + b or in vector form: z = w · x + b.

2. Activation: The result z is then passed through a non-linear activation function f(z) (like
ReLU or Sigmoid). The output of this function, a = f(z), becomes the neuron's final
output.

This process is performed for every neuron in a layer, using the outputs from the previous layer
as inputs. The calculation then "propagates" forward to the next layer until it reaches the final
output layer, which produces the network's prediction.
2. Describe the learning process of an ANN. Explain with example the challenge in
assigning synaptic weights for the interconnection between neurons. How can this
challenge be addressed.

This question has three parts: describing the learning process, explaining the weight assignment
challenge, and providing the solution.

1. The Learning Process of an ANN:


The learning process in an Artificial Neural Network (ANN) is an iterative cycle designed to
minimize prediction error. It consists of four main steps:

1. Forward Propagation: Input data is fed into the network. Calculations are performed
layer by layer (as described above) to produce an output or prediction.

2. Loss Calculation: The network's prediction is compared to the true, known label using a
loss function (e.g., Mean Squared Error or Cross-Entropy). This function calculates a
single value representing the model's error.

3. Backpropagation: The error is propagated backward through the network. Using the
chain rule from calculus, the algorithm calculates the gradient of the loss function with
respect to each weight and bias in the network. This gradient indicates the direction of
the steepest increase in error.

4. Weight Update: An optimizer (like Gradient Descent or Adam) uses these gradients to
update the weights and biases. It takes a small step in the opposite direction of the
gradient to reduce the error. This cycle is repeated for many epochs until the loss is
minimized.

2. The Challenge in Assigning Synaptic Weights:


The primary challenge is that the optimal values for the millions (or billions) of synaptic weights
in a deep network are unknown and impossible to determine manually. This creates a vast, high-
dimensional, and non-convex optimization problem.

• Example: Consider a network to classify images of cats and dogs. The network must
learn to recognize features like pointy ears, whiskers, snouts, etc. A specific weight
might contribute to recognizing "pointiness," while another detects "fur texture."
Manually assigning a numerical value to a weight to represent "pointiness" is
impossible. Furthermore, these features interact in complex ways; changing one weight
can have unpredictable effects throughout the entire network, making the problem
intractable to solve by hand. Starting with all weights at zero would cause all neurons in
a layer to learn the same thing, defeating the purpose of a deep network.

3. How This Challenge is Addressed:


This challenge is addressed by a combination of two key techniques:

1. Random Initialization: Instead of trying to guess the correct weights, they are initialized
with small, random numbers (e.g., using Xavier or He initialization schemes). This
breaks the symmetry and ensures that different neurons in a layer will learn different
features during training.

2. Iterative Optimization: Using the learning process described above (specifically


backpropagation and gradient descent), the network starts with these random weights
and iteratively adjusts them over many training examples. Each small adjustment moves
the weights closer to a set of values that minimizes the overall loss, effectively "learning"
the optimal weights from the data.

3. Describe the steps to build a simple feedforward neural network.

Building a simple feedforward neural network typically follows a structured, five-step process
using a modern framework like TensorFlow or PyTorch.

1. Define the Architecture: Decide the structure of the model. This includes specifying
the number of layers (input, hidden, output), the number of neurons in each layer, and
the activation function for each layer (e.g., ReLU for hidden layers, Sigmoid/Softmax for
the output layer depending on the task). The input layer's size must match the number of
features in the dataset.

2. Compile the Model: Configure the learning process. This involves selecting and
assigning three key components:

o An Optimizer (e.g., Adam, SGD) which defines the algorithm for updating
weights.

o A Loss Function (e.g., 'binary_crossentropy' for binary classification,


'categorical_crossentropy' for multi-class) which measures the model's error.

o Metrics (e.g., ['accuracy']) to monitor during training and testing.

3. Prepare the Data: Load, clean, and preprocess the dataset. This includes tasks like
normalizing or scaling numerical features to a common range (e.g., 0 to 1) and splitting
the data into training, validation, and test sets.

4. Train the Model (Fitting): This is the core learning step. The model is trained by calling a
fit() method, providing the training data and labels. Key parameters for this step are the
number of epochs (how many times to cycle through the entire dataset) and the batch
size (how many samples to process before updating the weights).

5. Evaluate the Model: After training is complete, the model's performance is assessed on
a separate, unseen test dataset. This gives an unbiased estimate of how well the model
will generalize to new data. Based on the evaluation, you may need to go back and tune
the architecture or hyperparameters.

4. Compare the use of GPUs and CPUs in the context of deep learning model training.

CPUs (Central Processing Units) and GPUs (Graphics Processing Units) have fundamentally
different architectures, which makes them suited for different tasks in deep learning.

Feature CPU (Central Processing Unit) GPU (Graphics Processing Unit)

Composed of a few (~4-32) highly


Composed of thousands (~1000-10,000)
powerful cores optimized for serial
Architecture of simpler cores designed for massively
tasks and low-latency access to
parallel tasks.
memory.
Excellent for sequential tasks like Specialized for the core computations of
Primary Task data loading, pre-processing, and training: matrix multiplications and
in DL orchestrating the overall training tensor operations, which are highly
loop. parallelizable.

Dramatically faster (10x to 100x+) for


Significantly slower for training deep
training. The parallel cores can perform
Speed neural networks. A training process
thousands of identical operations
can take weeks or months on a CPU.
simultaneously.

Data preparation, general computing, The gold standard for training deep
Best Use and sometimes inference (running a learning models and for high-throughput
Case trained model) for small-scale or low- inference where many predictions are
latency applications. needed at once.

A GPU is a specialist, excelling at


A CPU is a generalist, excelling at simple, repetitive calculations
Summary
complex, sequential logic. performed in parallel, which is exactly
what deep learning training requires.

5. Describe the advantages of using mixed precision training in distributed deep learning
systems.

Mixed precision training is a technique that uses both 16-bit (half-precision, FP16) and 32-bit
(single-precision, FP32) floating-point numbers during training. It provides three main
advantages, which are amplified in a distributed setting:

1. Increased Training Speed: Modern GPUs (especially those with Tensor Cores) can
perform operations on FP16 numbers much faster than on FP32 numbers. This directly
translates to higher computational throughput, reducing the time it takes to train a
model. In a distributed system, this means each node finishes its portion of the work
faster.

2. Reduced Memory Footprint: Storing model weights, activations, and gradients in FP16
requires half the memory compared to FP32. This allows for training larger models or
using larger batch sizes on the same hardware. In a distributed setting, this can reduce
the number of GPUs needed, lowering overall system cost.

3. Faster Communication: In distributed training, nodes must regularly communicate and


synchronize their gradients across the network. Because FP16 gradients are half the size
of FP32 gradients, the amount of data transferred over the network is halved. This
reduces the communication bottleneck, which is often a key limiting factor in
distributed system performance.

6. Discuss the benefits and challenges of using pre-trained models in computer vision
tasks.
Using pre-trained models, a technique known as transfer learning, is a dominant approach in
modern computer vision. It involves taking a model that has already been trained on a large,
general dataset (like ImageNet) and adapting it for a new, specific task.

Benefits:

• Improved Performance: Pre-trained models have already learned a rich hierarchy of


features (from simple edges and textures to complex object parts) from the large
dataset. This provides a much better starting point than random initialization, often
leading to higher accuracy, especially when the target dataset is small.

• Reduced Training Time: Since the model has already learned foundational features,
training on the new task is much faster. You are only "fine-tuning" the existing weights,
which requires fewer epochs and less computational resources compared to training a
model from scratch.

• Lower Data Requirements: The major benefit is that you can achieve state-of-the-art
results with a much smaller, task-specific dataset. This is crucial in fields like medicine
or manufacturing, where large labeled datasets are rare and expensive to create.

Challenges:

• Domain Mismatch: If the pre-trained model's original domain (e.g., everyday objects in
ImageNet) is vastly different from the target domain (e.g., medical X-rays, satellite
imagery, or abstract art), the learned features may not be relevant or could even be
detrimental (negative transfer).

• Architectural Rigidity: You are typically constrained to the architecture of the pre-
trained model. Making significant changes to the model's structure can be difficult
without breaking the learned weights, limiting flexibility for novel problems.

• Fine-Tuning Complexity: Deciding how to fine-tune is a challenge. One must choose


how many layers to freeze (keep weights fixed) versus how many to retrain. An incorrect
choice can lead to either destroying the valuable pre-trained features or failing to adapt
the model sufficiently to the new task.

7. Discuss how distributed computing supports real-time AI applications.

Distributed computing is essential for deploying robust, real-time AI applications that require
both low latency and high throughput. It supports these applications in three critical ways:

1. Low Latency through Geographic Distribution: For real-time applications like virtual
assistants, online gaming, or real-time translation, network delay is a major bottleneck.
By deploying model inference servers in multiple geographic regions (on edge devices or
in regional cloud data centers), requests can be routed to the server closest to the user.
This minimizes network latency and provides a near-instantaneous response.

2. High Throughput and Scalability via Load Balancing: A single server can quickly
become overwhelmed if it receives thousands of concurrent requests per second. A
distributed system uses a load balancer to distribute incoming inference requests
across a fleet of servers. This horizontal scaling ensures that the system can handle
massive, fluctuating user demand without slowing down, which is crucial for
applications like social media feed recommendations or e-commerce search.

3. High Availability and Fault Tolerance: Real-time AI services need to be operational


24/7. A distributed architecture provides high availability by eliminating single points of
failure. If one server (or even an entire data center) goes down, the load balancer
automatically reroutes traffic to healthy servers, ensuring the application remains
responsive and available to users without interruption.

Module 2
2 Marks
1. Describe a convolutional layer in CNNs? What is the purpose of feature maps in
convolutional layers?

A convolutional layer is the fundamental component of a CNN that applies learnable filters (or
kernels) across an input image. These filters are small matrices that slide over the input,
performing convolutions to detect specific local patterns. The output of a single filter applied
across the entire image is called a feature map. The purpose of a feature map is to act as a
"feature detector," creating a spatial map that shows the locations where its specific feature
(e.g., a vertical edge, a specific color, or a texture) was found in the input.

2. Explain two types of pooling layers commonly used in CNNs.

Two common types of pooling layers are Max Pooling and Average Pooling. Max Pooling
partitions the input feature map into a grid and outputs the maximum value from each section.
Its purpose is to reduce dimensionality and make the representation robust to the exact
location of features by retaining the strongest activation. In contrast, Average Pooling also
reduces dimensionality but works by calculating the average value within each section. This
creates a smoother, more generalized representation of the features found in the input map.

3. Why are pre-trained models used in transfer learning?

Pre-trained models are used in transfer learning because they act as powerful feature extractors
that have already been trained on a large-scale, general dataset like ImageNet. This pre-training
process allows the model to learn a rich hierarchy of generic features, from simple edges and
textures to complex object parts. By leveraging this existing knowledge, we can achieve high
performance on a new, specific task with significantly less data and computational cost. This
provides a much better starting point than random initialization and often leads to higher
accuracy.

4. Define transfer learning and its importance in deep learning.

Transfer learning is a deep learning methodology where a model previously trained on a large,
foundational task (e.g., ImageNet classification) is repurposed as the starting point for a new,
related task. Its importance is immense because it drastically reduces the need for massive
labeled datasets and extensive computational resources. By reusing the learned features,
developers can build highly accurate models much faster and with far less data. This makes
deep learning accessible and practical for a wide range of applications where data is scarce or
expensive.

5. List two applications of CNNs in computer vision.

Two major applications of CNNs in computer vision are Image Classification and Object
Detection. In image classification, the CNN takes an entire image as input and outputs a single
label that categorizes it (e.g., identifying a picture as containing a "cat" or a "dog"). Object
detection is a more advanced task where the CNN not only classifies objects within an image
but also locates each one by drawing a bounding box around it (e.g., identifying all cars and
pedestrians in a self-driving car's camera feed).

6. Explain object detection and its applications in real-world scenarios.

Object detection is a computer vision task that involves identifying and localizing one or more
objects within an image. It outputs "bounding boxes" around each detected object along with a
class label (e.g., "person," "car"). Its applications are vast; in autonomous vehicles, it is critical
for detecting pedestrians, cars, and traffic signs. In retail, it can be used for automated
checkout systems or to monitor shelf inventory. Other common applications include security
surveillance systems and finding anomalies in medical scans like X-rays.

7. Define what a convolutional neural network (CNN) is and describe its primary function in
the context of image processing.

A Convolutional Neural Network (CNN) is a specialized class of deep neural network designed
to process grid-like data, such as images. It is characterized by its architecture, which includes
convolutional, pooling, and fully-connected layers. Its primary function in image processing is
to automatically learn a spatial hierarchy of features directly from the raw pixels. Early layers
learn simple features like edges and colors, while deeper layers combine these to recognize
complex patterns and objects, eliminating the need for manual feature engineering.

5 Marks

1. Calculate the size of the output feature map for a 28x28 input image with a 3x3 filter,
stride 1, and no padding.

To calculate the size of the output feature map, we use the standard formula for a convolutional
layer's output dimension:

Output Size = (W - K + 2P) / S + 1

Where:

• W = Input size (width or height) = 28

• K = Kernel (filter) size = 3

• P = Padding = 0 (since the question specifies "no padding")

• S = Stride = 1

Calculation:
• Output Size = (28 - 3 + 2 * 0) / 1 + 1

• Output Size = (25 + 0) / 1 + 1

• Output Size = 25 / 1 + 1

• Output Size = 25 + 1

• Output Size = 26

Since the input image is 28x28 and the parameters are symmetrical, the output feature map will
be 26x26.

2. Explain the importance of ReLU in optimizing CNN performance.

The ReLU (Rectified Linear Unit) activation function is critically important for optimizing CNN
performance in two main ways: by improving learning speed and enabling deeper networks.

1. Solving the Vanishing Gradient Problem: Older activation functions like Sigmoid and
Tanh have gradients that approach zero for high or low input values. In a deep CNN,
these small gradients are multiplied together during backpropagation, causing the
overall gradient to "vanish" for early layers. This stops them from learning effectively.
ReLU's derivative is a constant 1 for all positive inputs, creating a direct path for the
gradient to flow backward without diminishing, which allows very deep networks to train
successfully.

2. Computational Efficiency: The ReLU function, f(x) = max(0, x), is a very simple
thresholding operation. It is computationally much cheaper than the complex
exponential calculations required for Sigmoid or Tanh. This means both the forward and
backward passes of training are significantly faster, reducing the overall time required to
train a CNN. This combination of faster computation and more effective gradient flow is
key to the success of modern deep CNNs.

3. Describe the steps involved in training a CNN model for object detection.

Training a CNN for object detection is a more complex process than standard image
classification and involves the following key steps:

1. Data Preparation and Annotation: First, you need a dataset where each image is
annotated with bounding boxes (coordinates for a box around each object) and class
labels for every object present. This is more intensive than simple image labeling. The
data is then split into training, validation, and test sets.

2. Model Selection/Architecture: Choose an object detection architecture like YOLO


(You Only Look Once), SSD (Single Shot Detector), or Faster R-CNN. These models
typically consist of:

o A backbone: A pre-trained CNN (like ResNet or VGG) used for extracting


features from the input image.
o A head: One or more layers that use the features from the backbone to predict
bounding box coordinates and class probabilities.

3. Define a Multi-part Loss Function: Object detection models use a composite loss
function that combines two different errors:

o Localization Loss (or Regression Loss): Measures how inaccurate the


predicted bounding box coordinates are compared to the ground truth (e.g.,
using Smooth L1 loss).

o Classification Loss: Measures how incorrect the predicted class label is for the
detected object (e.g., using Cross-Entropy loss).

4. Training (Fine-tuning): The model is trained using backpropagation and an optimizer


like Adam. The goal is to minimize the combined loss function. Almost always, this is
done using transfer learning, where the pre-trained backbone's weights are fine-tuned
on the new object detection dataset.

5. Evaluation: Performance is evaluated using metrics specific to object detection,


primarily Mean Average Precision (mAP). This metric considers both the correctness of
the class prediction and the accuracy of the bounding box, measured by Intersection
over Union (IoU).

4. Consider a CNN has three convolutional layers with strides of 1, 2, and 2 respectively.
Kernel size = 3, Padding = 'same' for stride 1, and 'valid' (no padding) for stride > 1. Solve by
computing the final output size for a 512x512 input image.

Let's calculate the output size layer by layer using the formula: Output = floor((W - K + 2P) / S) +
1

Initial Input: 512x512

Layer 1:

• Input (W): 512

• Kernel (K): 3

• Stride (S): 1

• Padding (P): 'same'. For a stride of 1, 'same' padding means padding is added to keep the
output size the same as the input size. Here, P=1.

• Calculation: (512 - 3 + 2*1) / 1 + 1 = 511 / 1 + 1 = 512.

• Output after Layer 1: 512x512

Layer 2:

• Input (W): 512 (from Layer 1)

• Kernel (K): 3

• Stride (S): 2
• Padding (P): 'valid' (means no padding), so P=0.

• Calculation: floor((512 - 3 + 2*0) / 2) + 1 = floor(509 / 2) + 1 = 254 + 1 = 255.

• Output after Layer 2: 255x255

Layer 3:

• Input (W): 255 (from Layer 2)

• Kernel (K): 3

• Stride (S): 2

• Padding (P): 'valid', so P=0.

• Calculation: floor((255 - 3 + 2*0) / 2) + 1 = floor(252 / 2) + 1 = 126 + 1 = 127.

• Output after Layer 3: 127x127

The final output size for the 512x512 input image after three convolutional layers is 127x127.

5. Compare and contrast pre-trained models considering key features and applications.

Feature VGGNet (e.g., VGG16) ResNet (e.g., ResNet50) Inception (GoogLeNet)

Residual "skip" Inception module, which


Deep, uniform
connections that allow performs convolutions
architecture using only
Key gradients to bypass with multiple filter sizes
stacked 3x3 convolution
Innovation layers, enabling much (1x1, 3x3, 5x5) in parallel
filters. Proved depth is
deeper networks without and concatenates the
critical.
degradation. results.

Deep with shortcut Wide, not just deep. It


Sequential and deep. paths. More complex processes information at
Architecture Very simple and blocks, but allows for multiple scales
straightforward structure. extreme depth (50, 101, simultaneously within the
152+ layers). same layer.

Very high (e.g., VGG16


Moderate (e.g., ResNet50 Low (e.g., InceptionV1 has
has ~138M). Memory-
has ~25M). Much more ~7M). Designed for high
Parameters intensive and prone to
efficient than VGG for its accuracy with a strict
overfitting without large
depth and accuracy. computational budget.
datasets.

Good as a general- Excellent for real-time


purpose feature The de-facto standard applications where both
extractor. Popular in style backbone for a wide high accuracy and
Applications range of computer vision
transfer and for baseline computational efficiency
comparisons due to its tasks, including are important, such as on-
simple structure. classification, object device vision.
detection, and
segmentation.

Contrast Summary: VGG valued simplicity and depth; ResNet solved the problems of extreme
depth with skip connections; and Inception prioritized computational efficiency by making the
network wider instead of just deeper.

6. Explain the challenges involved in Pre-trained models for Image Classification.

Using pre-trained models for image classification, while powerful, presents several key
challenges:

1. Domain Mismatch: This is the most significant challenge. A model pre-trained on a


general dataset like ImageNet (common objects) may perform poorly on a specialized
domain like medical imagery (X-rays, MRIs), satellite photos, or manufacturing defects.
The low-level features (textures, shapes) learned from ImageNet might not be relevant
or may even be misleading for the new task, a phenomenon known as negative
transfer.

2. Fine-Tuning Complexity: It is challenging to determine the optimal strategy for fine-


tuning. One must decide how many layers to "freeze" (keep weights fixed) and how many
to retrain. Freezing too many layers may prevent the model from adapting to the new
data, while retraining too many layers on a small dataset can lead to overfitting and
losing the valuable pre-trained knowledge.

3. Data Imbalance in the Target Task: If the new, smaller dataset for classification is
highly imbalanced (e.g., 95% one class, 5% another), the pre-trained model can easily
become biased towards the majority class. This requires careful handling through
techniques like data augmentation, class weighting in the loss function, or specialized
sampling strategies during fine-tuning.

7. Differentiate between Semantic Segmentation and Instance Segmentation.

Semantic Segmentation and Instance Segmentation are both pixel-level image


understanding tasks, but they differ in how they treat individual objects.

• Semantic Segmentation: The goal is to assign a class label to every single pixel in the
image. It does not distinguish between different objects of the same class. For example,
in an image containing three cars, semantic segmentation would label all pixels
belonging to any of the three cars simply as "car". It answers the question: "What is at
this pixel?"

• Instance Segmentation: This is a more challenging task. It also assigns a class label to
every pixel, but it also differentiates between individual instances of the same object
class. In the same image with three cars, instance segmentation would identify three
distinct car objects, labeling them as "car 1", "car 2", and "car 3". It answers the
question: "What object instance is at this pixel?"
In summary, the key difference is that semantic segmentation groups all objects of a class
together, while instance segmentation separates and identifies each unique object.

Module 3
2 Marks
1. Define Recurrent Neural Networks (RNNs). What is the primary difference between
RNNs and feedforward neural networks?

A Recurrent Neural Network (RNN) is a class of neural network designed specifically for
processing sequential data like text or time series. Its defining feature is an internal "memory" or
hidden state that captures information about previous elements in the sequence. The primary
difference is that RNNs have a feedback loop, allowing information from the past (the previous
hidden state) to influence the current output. In contrast, feedforward networks process each
input independently, with no concept of time or sequence.

2. What is the purpose of Long Short-Term Memory (LSTM) units in RNNs? State the role of
gate mechanisms in LSTMs.

The main purpose of Long Short-Term Memory (LSTM) units is to solve the vanishing gradient
problem that plagues standard RNNs. This allows them to effectively learn long-range
dependencies in data. The role of the gate mechanisms (Forget, Input, and Output gates) is to
regulate the flow of information. These gates are small neural networks that learn to control
what information should be removed from, added to, or read from the cell state (the long-term
memory), enabling the network to remember relevant information over long sequences.

3. List and explain any two components of RNN.

Two fundamental components of a vanilla RNN are:

1. Hidden State (hₜ): This is the internal memory of the RNN at a given time step t. It's a
vector that encapsulates information from all previous steps in the sequence. The
hidden state is updated at each step using the current input and the previous hidden
state.

2. Shared Weights: An RNN uses the same set of weight matrices for every time step of
the input sequence. This parameter sharing makes the network efficient and allows it
to generalize its learning across different positions in the sequence.

4. Justify the need of Transformers for Vision.

Transformers were needed for vision to overcome a key limitation of CNNs: modeling long-
range, global dependencies. While CNNs are excellent at learning local features through their
convolutional filters, they struggle to capture relationships between distant parts of an image.
The self-attention mechanism in Transformers allows the model to weigh the importance of all
input patches relative to each other, regardless of their position. This enables a more holistic
understanding of image context, leading to state-of-the-art performance.

5. Define Gated Recurrent Units (GRUs) and their role in RNNs.


Gated Recurrent Units (GRUs) are a type of gated RNN, similar to LSTMs but with a simpler
architecture. Their primary role is also to mitigate the vanishing gradient problem and capture
long-term dependencies. A GRU uses two gates: a Reset Gate to decide how much past
information to forget, and an Update Gate to decide how much of the past information to carry
forward. This makes them computationally more efficient than LSTMs while often achieving
comparable performance.

6. Summarize the challenges in RNN.

The main challenges in training traditional RNNs are:

1. Vanishing and Exploding Gradients: During backpropagation through time, gradients


can either shrink exponentially to zero (vanish), preventing learning of long-range
dependencies, or grow exponentially and become unstable (explode).

2. Sequential Computation: RNNs must process data step-by-step, which prevents


parallelization over the time dimension. This makes them significantly slower to train
compared to models like Transformers.

3. Difficulty with Very Long Sequences: Even with LSTMs and GRUs, retaining precise
information from the very distant past remains a practical challenge.

7. What is a Swin Transformer, and how does it differ from traditional RNNs?

A Swin (Shifted Window) Transformer is a modern vision transformer architecture that efficiently
processes images by introducing a hierarchical structure and local attention. It differs from
traditional RNNs in two fundamental ways. First, Swin Transformers can process image patches
in parallel using self-attention, whereas RNNs are inherently sequential and must process
data one element at a time. Second, Swin Transformers are built for spatial data (images), while
traditional RNNs are designed for 1D temporal sequences (like text or time series).

5 Marks

1. Categorize the various concepts of text classification using RNNs and how they can be
trained to classify text into different categories or labels.

Text classification using RNNs is fundamentally a many-to-one sequence processing task. The
core concept is to encode an entire variable-length sequence of text (the "many" part) into a
single fixed-size vector representation, which is then used for classification (the "one" part). A
popular variation is using a bidirectional RNN, which processes the text both forward and
backward to capture context from both directions, often improving accuracy.

The training process involves these key steps:

1. Text Preprocessing: The raw text is cleaned, tokenized into individual words or sub-
words, and a vocabulary is built. Each token is mapped to a unique integer index.

2. Embedding Layer: These integer indices are converted into dense vector
representations called embeddings (e.g., using GloVe, Word2Vec, or a learnable
embedding layer). These vectors capture semantic relationships between words.
3. RNN Layer: The sequence of word embeddings is fed one-by-one into an RNN (typically
an LSTM or GRU). The network updates its hidden state at each step, accumulating
contextual information from the sequence.

4. Classification: The final hidden state from the last time step (which serves as a
summary of the entire sentence) is passed to a standard feedforward (Dense) layer with
a Softmax activation function. This final layer outputs a probability distribution over the
different categories or labels.

5. Training: The model is trained using backpropagation to minimize a loss function like
Categorical Cross-Entropy, which measures the difference between the predicted
probabilities and the true one-hot encoded labels.

2. Explain the basic architecture of RNN? what do you mean by rolled and unrolled
recurrent neural network.

Basic Architecture of an RNN:


The basic architecture of a simple RNN cell is a loop. At each time step t, the cell takes two
inputs: the current input data xₜ and the hidden state from the previous time step hₜ₋₁. It then
computes the new hidden state hₜ by applying a transformation (using shared weight matrices)
to both inputs and passing the result through an activation function (commonly tanh). The cell
also produces an output yₜ for that time step, usually by applying another transformation and
activation function to the new hidden state hₜ. The key idea is that the same weights are used at
every step, allowing it to process sequences of any length.

Rolled vs. Unrolled RNN:


These are two ways of visualizing the same RNN architecture:

• Rolled RNN: This is the compact, conceptual representation showing the RNN cell with
a feedback loop. This visualization emphasizes the recurrent nature of the network and
the concept of parameter sharing across time steps. It clearly shows how the hidden
state h is updated in place.

• Unrolled RNN: This representation "unrolls" the loop across the time dimension,
showing the RNN as a deep feedforward network where each time step is a new layer.
This view is essential for understanding the flow of information and gradients during
training (i.e., backpropagation through time). It makes it clear that the output at a given
step depends on all preceding steps.

3. Consider the input "feeling under the weather.” and explain step by step execution of
this input with RNN.

Here is the step-by-step execution for classifying the sentiment of "feeling under the weather."
using an RNN:

1. Preprocessing and Embedding: The sentence is first tokenized into a sequence of


words: ['feeling', 'under', 'the', 'weather', '.']. Each word is converted into its corresponding
dense vector representation (embedding).

2. Initialization: The RNN initializes its first hidden state h₀ to a vector of zeros.
3. Execution at Time Step 1 (word: "feeling"):

o The RNN cell takes the embedding for "feeling" (x₁) and the initial hidden state
h₀.

o It calculates the new hidden state: h₁ = tanh(W_hh * h₀ + W_xh * x₁ + b_h). This h₁


now encodes the meaning of "feeling".

4. Execution at Time Step 2 (word: "under"):

o The cell takes the embedding for "under" (x₂) and the previous hidden state h₁.

o It calculates h₂ = tanh(W_hh * h₁ + W_xh * x₂ + b_h). Now, h₂ contains information


about both "feeling" and "under".

5. Continued Execution: This process repeats for "the", "weather", and ".". At each step t,
the hidden state hₜ is a function of the current word xₜ and the accumulated context
from all previous words stored in hₜ₋₁.

6. Final Classification: After the last word ("."), the final hidden state h₅ is produced. This
vector represents a contextual summary of the entire sentence. This h₅ is then fed into a
final feedforward (Dense) layer with a Softmax/Sigmoid activation function to output a
probability for the class "negative sentiment".

4. Evaluate the impact of sentiment analysis using RNNs on content moderation in social
media.

The impact of using RNN-based sentiment analysis for content moderation is significant, with
both powerful benefits and considerable challenges.

Positive Impacts (The Evaluation):

• Scalability and Automation: Social media platforms generate vast amounts of content
impossible for humans to moderate manually. RNNs automate the detection of toxic,
abusive, or harmful content at an immense scale, in real-time.

• Contextual Understanding: Unlike simple keyword filters, RNNs can understand


nuance and context. For example, they can distinguish between "You are the bomb!"
(positive) and "He wants to bomb the place" (negative), and can handle negation like
"not a good movie."

• Proactive Moderation: By identifying escalating negative sentiment or hateful speech


early, platforms can proactively flag, deprioritize, or remove content before it causes
widespread harm, protecting users from harassment and abuse.

Challenges and Negative Impacts (The Evaluation):

• Bias and Fairness: If trained on biased data, the models can disproportionately flag
content from minority groups or certain political viewpoints, leading to unfair
censorship.
• Lack of Nuance for Sarcasm and Irony: RNNs still struggle to reliably understand
complex human expressions like sarcasm or irony, which can lead to both false
positives (flagging jokes) and false negatives (missing veiled threats).

• Adversarial Attacks: Users can intentionally circumvent models by using creative


misspellings, leetspeak (e.g., h4te), or rephrasing toxic comments in novel ways that the
model has not seen during training.

5. Compare the architecture of GRUs and LSTMs and their advantages in handling long-
term dependencies.

Comparison of Architecture:

Feature LSTM (Long Short-Term Memory) GRU (Gated Recurrent Unit)

Three Gates: <br>1. Forget Gate:


Two Gates: <br>1. Reset Gate: Decides
Decides what to discard from the
how much of the past to forget. <br>2.
Gating long-term memory. <br>2. Input
Update Gate: Decides how much of the
Mechanisms Gate: Decides what new info to add.
past to keep vs. how much new info to
<br>3. Output Gate: Decides what
add (combines forget/input gates).
to output.

Two States: <br>1. Cell State (cₜ):


The "conveyor belt" for long-term One State: <br>1. Hidden State (hₜ): Acts
State Vectors
memory. <br>2. Hidden State (hₜ): as both the memory and the output.
The short-term memory and output.

Simpler, fewer parameters,


Complexity More complex, more parameters.
computationally faster.

Advantages in Handling Long-term Dependencies:


Both LSTMs and GRUs excel at handling long-term dependencies by overcoming the vanishing
gradient problem, but they do so slightly differently.

• The LSTM's primary advantage comes from its dedicated cell state. This acts as an
uninterrupted "information superhighway" where gradients can flow easily across many
time steps. The gates precisely control the information flowing into and out of this
highway, but the path itself remains direct, preventing gradients from vanishing.

• The GRU's advantage is its efficiency. It achieves a similar effect with fewer moving
parts. Its update gate can learn to pass the previous hidden state almost directly to the
next, effectively creating the same kind of shortcut path for gradients and long-term
information as the LSTM's cell state.

In summary, both are highly effective, with the LSTM being more explicit and complex, while the
GRU is more efficient and has shown comparable performance on many tasks.

6. Explain sentiment analysis using RNN.


Sentiment analysis using an RNN is the process of training a model to classify a piece of text as
having a positive, negative, or neutral sentiment. The process works by leveraging an RNN's
ability to understand context and word order, which are crucial for determining sentiment.

The steps are as follows:

1. Input: A sentence or document is taken as input (e.g., "This movie was absolutely
fantastic!").

2. Tokenization and Embedding: The text is broken down into a sequence of words
(tokens). Each token is then converted into a numerical vector via an embedding layer,
which represents the word's meaning.

3. Sequential Processing with RNN: The sequence of word vectors is fed into an RNN (like
an LSTM or GRU) one word at a time. The RNN reads the sequence and updates its
internal hidden state at each step, building a vector representation that captures the
cumulative meaning and context of the sentence up to that point.

4. Feature Extraction: After processing the entire sentence, the RNN's final hidden state
is used as a fixed-size summary vector. This vector encapsulates the overall sentiment
of the text.

5. Classification: This summary vector is passed through a final fully-connected (Dense)


layer with a Softmax (for positive/negative/neutral) or Sigmoid (for positive/negative)
activation function to produce the final classification.

7. Discuss the significance of Swin Transformers in reducing computational costs in vision


tasks.

The significance of Swin Transformers lies in their ability to make the powerful Transformer
architecture practical and efficient for computer vision tasks, which was a major challenge with
the original Vision Transformer (ViT). They reduce computational costs through two primary
innovations:

1. Window-Based Local Self-Attention: The standard ViT calculates self-attention


globally, meaning every image patch attends to every other patch. This has a quadratic
computational complexity (O(N²)) with respect to the number of patches (N), which is
extremely expensive for high-resolution images. Swin Transformers solve this by dividing
the image into small, non-overlapping windows (e.g., 7x7 patches) and performing self-
attention only within each window. This reduces the complexity to be linear (O(N)) with
respect to the image size, drastically cutting computational cost.

2. Shifted Window Mechanism and Hierarchical Design: To allow for information to flow
between windows (and thus learn global context), Swin introduces a shifted window
approach. In alternating layers, the window configuration is shifted, so that subsequent
attention operations are performed on different groupings of patches. This allows for
cross-window connections without the quadratic cost of global attention. This,
combined with a hierarchical design that merges patches to reduce spatial resolution in
deeper layers (similar to a CNN), further enhances computational efficiency and makes
the architecture compatible with standard vision backbones.

You might also like