DL - QB Solution
DL - QB Solution
2 Marks
1. What is a neural network? Contrast the function of a loss function in a neural network.
Two of the most popular deep learning frameworks are TensorFlow and PyTorch.
The purpose of optimization is to systematically adjust the model's parameters (weights and
biases) to minimize the loss function. This process, often carried out by an optimizer algorithm
like Adam or SGD, is the core of how a network "learns" from data. It is essential for training
because it provides the mechanism for the model to improve itself. By iteratively updating its
parameters based on the calculated error, the model moves from a state of poor performance
to one that makes accurate predictions.
Distributed computing in deep learning accelerates training by using multiple processing units
(like GPUs or servers) simultaneously. The most common approach is Data Parallelism: the
model is copied to each processor, and the training dataset is split among them. Each
processor trains the model on its data subset, and the resulting updates (gradients) are then
averaged and applied to a master copy of the model. This allows for training on massive
datasets in significantly less time than on a single machine.
6. What does ReLU stand for, and what is its purpose in deep learning?
ReLU stands for Rectified Linear Unit. It is a widely used activation function in the hidden
layers of neural networks. Its mathematical form is f(x) = max(0, x), which means it outputs the
input if the input is positive and zero otherwise. Its purpose is twofold: to introduce non-linearity
into the model and to do so in a computationally efficient way. Importantly, it helps mitigate
the vanishing gradient problem, allowing for faster and more effective training of deep
networks.
7. What are the key components of a deep learning model architecture, such as layers,
neurons, and connections?
• Neurons (or Nodes): The fundamental computational units that receive weighted inputs
and apply an activation function to produce an output.
• Layers: Neurons are organized into layers: an input layer to receive data, one or
more hidden layers to learn features at different levels of abstraction, and an output
layer to produce the final result.
• Connections (and Weights): These are the links between neurons in adjacent layers.
Each connection has a learnable weight that represents the strength of the connection,
which the model adjusts during training.
5 Marks
This question is likely asking for an explanation of the calculation process, also known as the
forward pass or forward propagation.
1. Weighted Sum: Each input xᵢ coming into the neuron is multiplied by its corresponding
weight wᵢ. These products are summed together, and a bias term b is added. This
produces a value, often denoted as z.
2. Activation: The result z is then passed through a non-linear activation function f(z) (like
ReLU or Sigmoid). The output of this function, a = f(z), becomes the neuron's final
output.
This process is performed for every neuron in a layer, using the outputs from the previous layer
as inputs. The calculation then "propagates" forward to the next layer until it reaches the final
output layer, which produces the network's prediction.
2. Describe the learning process of an ANN. Explain with example the challenge in
assigning synaptic weights for the interconnection between neurons. How can this
challenge be addressed.
This question has three parts: describing the learning process, explaining the weight assignment
challenge, and providing the solution.
1. Forward Propagation: Input data is fed into the network. Calculations are performed
layer by layer (as described above) to produce an output or prediction.
2. Loss Calculation: The network's prediction is compared to the true, known label using a
loss function (e.g., Mean Squared Error or Cross-Entropy). This function calculates a
single value representing the model's error.
3. Backpropagation: The error is propagated backward through the network. Using the
chain rule from calculus, the algorithm calculates the gradient of the loss function with
respect to each weight and bias in the network. This gradient indicates the direction of
the steepest increase in error.
4. Weight Update: An optimizer (like Gradient Descent or Adam) uses these gradients to
update the weights and biases. It takes a small step in the opposite direction of the
gradient to reduce the error. This cycle is repeated for many epochs until the loss is
minimized.
• Example: Consider a network to classify images of cats and dogs. The network must
learn to recognize features like pointy ears, whiskers, snouts, etc. A specific weight
might contribute to recognizing "pointiness," while another detects "fur texture."
Manually assigning a numerical value to a weight to represent "pointiness" is
impossible. Furthermore, these features interact in complex ways; changing one weight
can have unpredictable effects throughout the entire network, making the problem
intractable to solve by hand. Starting with all weights at zero would cause all neurons in
a layer to learn the same thing, defeating the purpose of a deep network.
1. Random Initialization: Instead of trying to guess the correct weights, they are initialized
with small, random numbers (e.g., using Xavier or He initialization schemes). This
breaks the symmetry and ensures that different neurons in a layer will learn different
features during training.
Building a simple feedforward neural network typically follows a structured, five-step process
using a modern framework like TensorFlow or PyTorch.
1. Define the Architecture: Decide the structure of the model. This includes specifying
the number of layers (input, hidden, output), the number of neurons in each layer, and
the activation function for each layer (e.g., ReLU for hidden layers, Sigmoid/Softmax for
the output layer depending on the task). The input layer's size must match the number of
features in the dataset.
2. Compile the Model: Configure the learning process. This involves selecting and
assigning three key components:
o An Optimizer (e.g., Adam, SGD) which defines the algorithm for updating
weights.
3. Prepare the Data: Load, clean, and preprocess the dataset. This includes tasks like
normalizing or scaling numerical features to a common range (e.g., 0 to 1) and splitting
the data into training, validation, and test sets.
4. Train the Model (Fitting): This is the core learning step. The model is trained by calling a
fit() method, providing the training data and labels. Key parameters for this step are the
number of epochs (how many times to cycle through the entire dataset) and the batch
size (how many samples to process before updating the weights).
5. Evaluate the Model: After training is complete, the model's performance is assessed on
a separate, unseen test dataset. This gives an unbiased estimate of how well the model
will generalize to new data. Based on the evaluation, you may need to go back and tune
the architecture or hyperparameters.
4. Compare the use of GPUs and CPUs in the context of deep learning model training.
CPUs (Central Processing Units) and GPUs (Graphics Processing Units) have fundamentally
different architectures, which makes them suited for different tasks in deep learning.
Data preparation, general computing, The gold standard for training deep
Best Use and sometimes inference (running a learning models and for high-throughput
Case trained model) for small-scale or low- inference where many predictions are
latency applications. needed at once.
5. Describe the advantages of using mixed precision training in distributed deep learning
systems.
Mixed precision training is a technique that uses both 16-bit (half-precision, FP16) and 32-bit
(single-precision, FP32) floating-point numbers during training. It provides three main
advantages, which are amplified in a distributed setting:
1. Increased Training Speed: Modern GPUs (especially those with Tensor Cores) can
perform operations on FP16 numbers much faster than on FP32 numbers. This directly
translates to higher computational throughput, reducing the time it takes to train a
model. In a distributed system, this means each node finishes its portion of the work
faster.
2. Reduced Memory Footprint: Storing model weights, activations, and gradients in FP16
requires half the memory compared to FP32. This allows for training larger models or
using larger batch sizes on the same hardware. In a distributed setting, this can reduce
the number of GPUs needed, lowering overall system cost.
6. Discuss the benefits and challenges of using pre-trained models in computer vision
tasks.
Using pre-trained models, a technique known as transfer learning, is a dominant approach in
modern computer vision. It involves taking a model that has already been trained on a large,
general dataset (like ImageNet) and adapting it for a new, specific task.
Benefits:
• Reduced Training Time: Since the model has already learned foundational features,
training on the new task is much faster. You are only "fine-tuning" the existing weights,
which requires fewer epochs and less computational resources compared to training a
model from scratch.
• Lower Data Requirements: The major benefit is that you can achieve state-of-the-art
results with a much smaller, task-specific dataset. This is crucial in fields like medicine
or manufacturing, where large labeled datasets are rare and expensive to create.
Challenges:
• Domain Mismatch: If the pre-trained model's original domain (e.g., everyday objects in
ImageNet) is vastly different from the target domain (e.g., medical X-rays, satellite
imagery, or abstract art), the learned features may not be relevant or could even be
detrimental (negative transfer).
• Architectural Rigidity: You are typically constrained to the architecture of the pre-
trained model. Making significant changes to the model's structure can be difficult
without breaking the learned weights, limiting flexibility for novel problems.
Distributed computing is essential for deploying robust, real-time AI applications that require
both low latency and high throughput. It supports these applications in three critical ways:
1. Low Latency through Geographic Distribution: For real-time applications like virtual
assistants, online gaming, or real-time translation, network delay is a major bottleneck.
By deploying model inference servers in multiple geographic regions (on edge devices or
in regional cloud data centers), requests can be routed to the server closest to the user.
This minimizes network latency and provides a near-instantaneous response.
2. High Throughput and Scalability via Load Balancing: A single server can quickly
become overwhelmed if it receives thousands of concurrent requests per second. A
distributed system uses a load balancer to distribute incoming inference requests
across a fleet of servers. This horizontal scaling ensures that the system can handle
massive, fluctuating user demand without slowing down, which is crucial for
applications like social media feed recommendations or e-commerce search.
Module 2
2 Marks
1. Describe a convolutional layer in CNNs? What is the purpose of feature maps in
convolutional layers?
A convolutional layer is the fundamental component of a CNN that applies learnable filters (or
kernels) across an input image. These filters are small matrices that slide over the input,
performing convolutions to detect specific local patterns. The output of a single filter applied
across the entire image is called a feature map. The purpose of a feature map is to act as a
"feature detector," creating a spatial map that shows the locations where its specific feature
(e.g., a vertical edge, a specific color, or a texture) was found in the input.
Two common types of pooling layers are Max Pooling and Average Pooling. Max Pooling
partitions the input feature map into a grid and outputs the maximum value from each section.
Its purpose is to reduce dimensionality and make the representation robust to the exact
location of features by retaining the strongest activation. In contrast, Average Pooling also
reduces dimensionality but works by calculating the average value within each section. This
creates a smoother, more generalized representation of the features found in the input map.
Pre-trained models are used in transfer learning because they act as powerful feature extractors
that have already been trained on a large-scale, general dataset like ImageNet. This pre-training
process allows the model to learn a rich hierarchy of generic features, from simple edges and
textures to complex object parts. By leveraging this existing knowledge, we can achieve high
performance on a new, specific task with significantly less data and computational cost. This
provides a much better starting point than random initialization and often leads to higher
accuracy.
Transfer learning is a deep learning methodology where a model previously trained on a large,
foundational task (e.g., ImageNet classification) is repurposed as the starting point for a new,
related task. Its importance is immense because it drastically reduces the need for massive
labeled datasets and extensive computational resources. By reusing the learned features,
developers can build highly accurate models much faster and with far less data. This makes
deep learning accessible and practical for a wide range of applications where data is scarce or
expensive.
Two major applications of CNNs in computer vision are Image Classification and Object
Detection. In image classification, the CNN takes an entire image as input and outputs a single
label that categorizes it (e.g., identifying a picture as containing a "cat" or a "dog"). Object
detection is a more advanced task where the CNN not only classifies objects within an image
but also locates each one by drawing a bounding box around it (e.g., identifying all cars and
pedestrians in a self-driving car's camera feed).
Object detection is a computer vision task that involves identifying and localizing one or more
objects within an image. It outputs "bounding boxes" around each detected object along with a
class label (e.g., "person," "car"). Its applications are vast; in autonomous vehicles, it is critical
for detecting pedestrians, cars, and traffic signs. In retail, it can be used for automated
checkout systems or to monitor shelf inventory. Other common applications include security
surveillance systems and finding anomalies in medical scans like X-rays.
7. Define what a convolutional neural network (CNN) is and describe its primary function in
the context of image processing.
A Convolutional Neural Network (CNN) is a specialized class of deep neural network designed
to process grid-like data, such as images. It is characterized by its architecture, which includes
convolutional, pooling, and fully-connected layers. Its primary function in image processing is
to automatically learn a spatial hierarchy of features directly from the raw pixels. Early layers
learn simple features like edges and colors, while deeper layers combine these to recognize
complex patterns and objects, eliminating the need for manual feature engineering.
5 Marks
1. Calculate the size of the output feature map for a 28x28 input image with a 3x3 filter,
stride 1, and no padding.
To calculate the size of the output feature map, we use the standard formula for a convolutional
layer's output dimension:
Where:
• S = Stride = 1
Calculation:
• Output Size = (28 - 3 + 2 * 0) / 1 + 1
• Output Size = 25 / 1 + 1
• Output Size = 25 + 1
• Output Size = 26
Since the input image is 28x28 and the parameters are symmetrical, the output feature map will
be 26x26.
The ReLU (Rectified Linear Unit) activation function is critically important for optimizing CNN
performance in two main ways: by improving learning speed and enabling deeper networks.
1. Solving the Vanishing Gradient Problem: Older activation functions like Sigmoid and
Tanh have gradients that approach zero for high or low input values. In a deep CNN,
these small gradients are multiplied together during backpropagation, causing the
overall gradient to "vanish" for early layers. This stops them from learning effectively.
ReLU's derivative is a constant 1 for all positive inputs, creating a direct path for the
gradient to flow backward without diminishing, which allows very deep networks to train
successfully.
2. Computational Efficiency: The ReLU function, f(x) = max(0, x), is a very simple
thresholding operation. It is computationally much cheaper than the complex
exponential calculations required for Sigmoid or Tanh. This means both the forward and
backward passes of training are significantly faster, reducing the overall time required to
train a CNN. This combination of faster computation and more effective gradient flow is
key to the success of modern deep CNNs.
3. Describe the steps involved in training a CNN model for object detection.
Training a CNN for object detection is a more complex process than standard image
classification and involves the following key steps:
1. Data Preparation and Annotation: First, you need a dataset where each image is
annotated with bounding boxes (coordinates for a box around each object) and class
labels for every object present. This is more intensive than simple image labeling. The
data is then split into training, validation, and test sets.
3. Define a Multi-part Loss Function: Object detection models use a composite loss
function that combines two different errors:
o Classification Loss: Measures how incorrect the predicted class label is for the
detected object (e.g., using Cross-Entropy loss).
4. Consider a CNN has three convolutional layers with strides of 1, 2, and 2 respectively.
Kernel size = 3, Padding = 'same' for stride 1, and 'valid' (no padding) for stride > 1. Solve by
computing the final output size for a 512x512 input image.
Let's calculate the output size layer by layer using the formula: Output = floor((W - K + 2P) / S) +
1
Layer 1:
• Kernel (K): 3
• Stride (S): 1
• Padding (P): 'same'. For a stride of 1, 'same' padding means padding is added to keep the
output size the same as the input size. Here, P=1.
Layer 2:
• Kernel (K): 3
• Stride (S): 2
• Padding (P): 'valid' (means no padding), so P=0.
Layer 3:
• Kernel (K): 3
• Stride (S): 2
The final output size for the 512x512 input image after three convolutional layers is 127x127.
5. Compare and contrast pre-trained models considering key features and applications.
Contrast Summary: VGG valued simplicity and depth; ResNet solved the problems of extreme
depth with skip connections; and Inception prioritized computational efficiency by making the
network wider instead of just deeper.
Using pre-trained models for image classification, while powerful, presents several key
challenges:
3. Data Imbalance in the Target Task: If the new, smaller dataset for classification is
highly imbalanced (e.g., 95% one class, 5% another), the pre-trained model can easily
become biased towards the majority class. This requires careful handling through
techniques like data augmentation, class weighting in the loss function, or specialized
sampling strategies during fine-tuning.
• Semantic Segmentation: The goal is to assign a class label to every single pixel in the
image. It does not distinguish between different objects of the same class. For example,
in an image containing three cars, semantic segmentation would label all pixels
belonging to any of the three cars simply as "car". It answers the question: "What is at
this pixel?"
• Instance Segmentation: This is a more challenging task. It also assigns a class label to
every pixel, but it also differentiates between individual instances of the same object
class. In the same image with three cars, instance segmentation would identify three
distinct car objects, labeling them as "car 1", "car 2", and "car 3". It answers the
question: "What object instance is at this pixel?"
In summary, the key difference is that semantic segmentation groups all objects of a class
together, while instance segmentation separates and identifies each unique object.
Module 3
2 Marks
1. Define Recurrent Neural Networks (RNNs). What is the primary difference between
RNNs and feedforward neural networks?
A Recurrent Neural Network (RNN) is a class of neural network designed specifically for
processing sequential data like text or time series. Its defining feature is an internal "memory" or
hidden state that captures information about previous elements in the sequence. The primary
difference is that RNNs have a feedback loop, allowing information from the past (the previous
hidden state) to influence the current output. In contrast, feedforward networks process each
input independently, with no concept of time or sequence.
2. What is the purpose of Long Short-Term Memory (LSTM) units in RNNs? State the role of
gate mechanisms in LSTMs.
The main purpose of Long Short-Term Memory (LSTM) units is to solve the vanishing gradient
problem that plagues standard RNNs. This allows them to effectively learn long-range
dependencies in data. The role of the gate mechanisms (Forget, Input, and Output gates) is to
regulate the flow of information. These gates are small neural networks that learn to control
what information should be removed from, added to, or read from the cell state (the long-term
memory), enabling the network to remember relevant information over long sequences.
1. Hidden State (hₜ): This is the internal memory of the RNN at a given time step t. It's a
vector that encapsulates information from all previous steps in the sequence. The
hidden state is updated at each step using the current input and the previous hidden
state.
2. Shared Weights: An RNN uses the same set of weight matrices for every time step of
the input sequence. This parameter sharing makes the network efficient and allows it
to generalize its learning across different positions in the sequence.
Transformers were needed for vision to overcome a key limitation of CNNs: modeling long-
range, global dependencies. While CNNs are excellent at learning local features through their
convolutional filters, they struggle to capture relationships between distant parts of an image.
The self-attention mechanism in Transformers allows the model to weigh the importance of all
input patches relative to each other, regardless of their position. This enables a more holistic
understanding of image context, leading to state-of-the-art performance.
3. Difficulty with Very Long Sequences: Even with LSTMs and GRUs, retaining precise
information from the very distant past remains a practical challenge.
7. What is a Swin Transformer, and how does it differ from traditional RNNs?
A Swin (Shifted Window) Transformer is a modern vision transformer architecture that efficiently
processes images by introducing a hierarchical structure and local attention. It differs from
traditional RNNs in two fundamental ways. First, Swin Transformers can process image patches
in parallel using self-attention, whereas RNNs are inherently sequential and must process
data one element at a time. Second, Swin Transformers are built for spatial data (images), while
traditional RNNs are designed for 1D temporal sequences (like text or time series).
5 Marks
1. Categorize the various concepts of text classification using RNNs and how they can be
trained to classify text into different categories or labels.
Text classification using RNNs is fundamentally a many-to-one sequence processing task. The
core concept is to encode an entire variable-length sequence of text (the "many" part) into a
single fixed-size vector representation, which is then used for classification (the "one" part). A
popular variation is using a bidirectional RNN, which processes the text both forward and
backward to capture context from both directions, often improving accuracy.
1. Text Preprocessing: The raw text is cleaned, tokenized into individual words or sub-
words, and a vocabulary is built. Each token is mapped to a unique integer index.
2. Embedding Layer: These integer indices are converted into dense vector
representations called embeddings (e.g., using GloVe, Word2Vec, or a learnable
embedding layer). These vectors capture semantic relationships between words.
3. RNN Layer: The sequence of word embeddings is fed one-by-one into an RNN (typically
an LSTM or GRU). The network updates its hidden state at each step, accumulating
contextual information from the sequence.
4. Classification: The final hidden state from the last time step (which serves as a
summary of the entire sentence) is passed to a standard feedforward (Dense) layer with
a Softmax activation function. This final layer outputs a probability distribution over the
different categories or labels.
5. Training: The model is trained using backpropagation to minimize a loss function like
Categorical Cross-Entropy, which measures the difference between the predicted
probabilities and the true one-hot encoded labels.
2. Explain the basic architecture of RNN? what do you mean by rolled and unrolled
recurrent neural network.
• Rolled RNN: This is the compact, conceptual representation showing the RNN cell with
a feedback loop. This visualization emphasizes the recurrent nature of the network and
the concept of parameter sharing across time steps. It clearly shows how the hidden
state h is updated in place.
• Unrolled RNN: This representation "unrolls" the loop across the time dimension,
showing the RNN as a deep feedforward network where each time step is a new layer.
This view is essential for understanding the flow of information and gradients during
training (i.e., backpropagation through time). It makes it clear that the output at a given
step depends on all preceding steps.
3. Consider the input "feeling under the weather.” and explain step by step execution of
this input with RNN.
Here is the step-by-step execution for classifying the sentiment of "feeling under the weather."
using an RNN:
2. Initialization: The RNN initializes its first hidden state h₀ to a vector of zeros.
3. Execution at Time Step 1 (word: "feeling"):
o The RNN cell takes the embedding for "feeling" (x₁) and the initial hidden state
h₀.
o The cell takes the embedding for "under" (x₂) and the previous hidden state h₁.
5. Continued Execution: This process repeats for "the", "weather", and ".". At each step t,
the hidden state hₜ is a function of the current word xₜ and the accumulated context
from all previous words stored in hₜ₋₁.
6. Final Classification: After the last word ("."), the final hidden state h₅ is produced. This
vector represents a contextual summary of the entire sentence. This h₅ is then fed into a
final feedforward (Dense) layer with a Softmax/Sigmoid activation function to output a
probability for the class "negative sentiment".
4. Evaluate the impact of sentiment analysis using RNNs on content moderation in social
media.
The impact of using RNN-based sentiment analysis for content moderation is significant, with
both powerful benefits and considerable challenges.
• Scalability and Automation: Social media platforms generate vast amounts of content
impossible for humans to moderate manually. RNNs automate the detection of toxic,
abusive, or harmful content at an immense scale, in real-time.
• Bias and Fairness: If trained on biased data, the models can disproportionately flag
content from minority groups or certain political viewpoints, leading to unfair
censorship.
• Lack of Nuance for Sarcasm and Irony: RNNs still struggle to reliably understand
complex human expressions like sarcasm or irony, which can lead to both false
positives (flagging jokes) and false negatives (missing veiled threats).
5. Compare the architecture of GRUs and LSTMs and their advantages in handling long-
term dependencies.
Comparison of Architecture:
• The LSTM's primary advantage comes from its dedicated cell state. This acts as an
uninterrupted "information superhighway" where gradients can flow easily across many
time steps. The gates precisely control the information flowing into and out of this
highway, but the path itself remains direct, preventing gradients from vanishing.
• The GRU's advantage is its efficiency. It achieves a similar effect with fewer moving
parts. Its update gate can learn to pass the previous hidden state almost directly to the
next, effectively creating the same kind of shortcut path for gradients and long-term
information as the LSTM's cell state.
In summary, both are highly effective, with the LSTM being more explicit and complex, while the
GRU is more efficient and has shown comparable performance on many tasks.
1. Input: A sentence or document is taken as input (e.g., "This movie was absolutely
fantastic!").
2. Tokenization and Embedding: The text is broken down into a sequence of words
(tokens). Each token is then converted into a numerical vector via an embedding layer,
which represents the word's meaning.
3. Sequential Processing with RNN: The sequence of word vectors is fed into an RNN (like
an LSTM or GRU) one word at a time. The RNN reads the sequence and updates its
internal hidden state at each step, building a vector representation that captures the
cumulative meaning and context of the sentence up to that point.
4. Feature Extraction: After processing the entire sentence, the RNN's final hidden state
is used as a fixed-size summary vector. This vector encapsulates the overall sentiment
of the text.
The significance of Swin Transformers lies in their ability to make the powerful Transformer
architecture practical and efficient for computer vision tasks, which was a major challenge with
the original Vision Transformer (ViT). They reduce computational costs through two primary
innovations:
2. Shifted Window Mechanism and Hierarchical Design: To allow for information to flow
between windows (and thus learn global context), Swin introduces a shifted window
approach. In alternating layers, the window configuration is shifted, so that subsequent
attention operations are performed on different groupings of patches. This allows for
cross-window connections without the quadratic cost of global attention. This,
combined with a hierarchical design that merges patches to reduce spatial resolution in
deeper layers (similar to a CNN), further enhances computational efficiency and makes
the architecture compatible with standard vision backbones.