1.
Why is the choice of activation function crucial
in shaping a neural network’s performance?
                                                        7. In what key ways do RNNs and CNNs differ in
The activation function introduces non-linearity        processing data?
into the model, enabling the neural network to
                                                        CNNs process spatial data (e.g., images) by
learn complex patterns and relationships in data.
                                                        capturing local features using filters, while RNNs
Without it, the network behaves like a linear
                                                        handle sequential data (e.g., text/time series) by
model regardless of depth.
                                                        maintaining temporal dependencies through
                                                        hidden states.
2. Why is Mean Squared Error (MSE) commonly
used as a loss function in deep learning?
                                                        8. How does the forget gate in an LSTM improve
MSE is widely used because it penalizes larger          learning efficiency?
errors more heavily, ensuring precise predictions. It
                                                        The forget gate selectively removes irrelevant
is differentiable and computationally efficient for
                                                        information from the cell state, preventing long-
regression problems.
                                                        term memory clutter and enhancing the network’s
                                                        ability to learn relevant patterns over time.
3. How does adjusting the learning rate impact
the efficiency of the backpropagation algorithm?
                                                        9. What is the primary role of Generative
A higher learning rate speeds up learning but may       Adversarial Networks (GANs) in deep learning?
overshoot minima, while a lower rate ensures
                                                        GANs generate realistic synthetic data by training
stability but slows convergence. Proper tuning
                                                        two models (generator and discriminator) in a
helps achieve efficient and accurate learning.
                                                        competitive setup, widely used for data
                                                        augmentation, image generation, and more.
4. Mention the significance of BAM in neural
networks.
                                                        10. Why is BERT widely recognized as a
Bidirectional Associative Memory (BAM) stores           breakthrough architecture in NLP-based Neural
pattern pairs and recalls outputs from given inputs     Network Application?
using a bidirectional associative mechanism, useful
                                                        BERT uses bidirectional context from transformers,
in memory-based neural models.
                                                        enabling it to understand word meanings based on
                                                        full sentence context. It significantly improved
                                                        performance on many NLP tasks like question
5. What are the different types of pooling in
                                                        answering and sentiment analysis.
CNNs, and how do they influence feature
extraction?
Common pooling types include max pooling,
average pooling, and global pooling. Pooling
reduces spatial dimensions, extracts dominant
features, and provides translation invariance.
6. Why might increasing the filter size in a CNN
lead to a loss of fine image details?
Larger filters cover broader areas, smoothing out
high-frequency details like edges or textures, thus
losing fine-grained information crucial for precise
feature detection.
1. Multi-Layer Perceptron (MLP) Model for Laptop            •   Adam Optimizer: Combines momentum
Price Classification                                            and RMSprop for adaptive learning rates
                                                                and faster convergence.
Problem Statement:
                                                            •   Loss Function: Categorical Crossentropy is
Classify laptops into three price categories: low,
                                                                standard for multi-class classification.
medium, and high based on features such as
processor speed, RAM, brand, screen size, SSD
presence, and GPU availability.
                                                        Benefits of MLP:
                                                            •   Learns non-linear boundaries between
Proposed MLP Architecture:                                      classes.
Layer            Details                                    •   Works well for structured/tabular data.
                 6 neurons (one for each feature like       •   Can handle categorical features with
Input Layer                                                     embedding or one-hot encoding.
                 RAM, CPU speed, etc.)
Hidden
                 64 neurons, ReLU activation
Layer 1                                                 2. Activation Functions and Their Properties
Hidden                                                  Sigmoid Function:
                 32 neurons, ReLU activation
Layer 2
                                                            •   Formula:
Output           3 neurons, Softmax activation (for 3   σ(x)=11+e−x\sigma(x) = \frac{1}{1 + e^{-
Layer            classes)                               x}}σ(x)=1+e−x1
python                                                      •   Input Range: (-∞, ∞)
CopyEdit                                                    •   Output Range: (0, 1)
model = Sequential([                                        •   Use Case: Binary classification, last-layer
     Dense(64, input_dim=6, activation='relu'),                 activation.
     Dense(32, activation='relu'),                          •   Limitation: Causes vanishing gradients
                                                                when inputs are large/small.
     Dense(3, activation='softmax')
])
                                                        ReLU (Rectified Linear Unit):
model.compile(optimizer='adam',
loss='categorical_crossentropy',                            •   Formula:
metrics=['accuracy'])
                                                        f(x)=max(0,x)f(x) = \max(0, x)f(x)=max(0,x)
                                                            •   Input Range: (-∞, ∞)
Justification:
                                                            •   Output Range: [0, ∞)
       •   ReLU Activation: Prevents vanishing
                                                            •   Use Case: Hidden layers; introduces
           gradient and allows deep networks to
                                                                sparsity.
           learn complex patterns.
                                                            •   Limitation: Dying ReLU (neurons stuck at
       •   Softmax: Ideal for multi-class
                                                                zero when inputs < 0).
           classification, normalizes outputs to
           probabilities.
                                                        Softmax Function:
    •    Formula:                                      4. Optimizers in Neural Networks
Softmax(xi)=exi∑jexj\text{Softmax}(x_i) =              Role of Optimizers:
\frac{e^{x_i}}{\sum_{j} e^{x_j}}Softmax(xi)=∑jexjexi
                                                           •     Minimize the loss function.
    •    Input Range: Vector of real numbers
                                                           •     Control the learning dynamics of the
    •    Output Range: Probabilities summing to 1                network.
    •    Use Case: Multi-class classification output
         layer.
                                                       A. Stochastic Gradient Descent (SGD)
                                                           •     Basic optimizer.
3. Forward vs Backward Propagation
                                                           •     May get stuck in local minima.
Forward Propagation:
                                                       B. Momentum
    •    Pass input through layers.
                                                           •     Adds inertia to SGD.
    •    Compute outputs via weighted sums and
                                                           •     Helps escape shallow minima.
         activation functions.
                                                       C. RMSProp
    •    Loss function compares predicted output
         with actual label.                                •     Scales learning rate adaptively using a
                                                                 moving average of squared gradients.
Backward Propagation:                                      •     Suitable for RNNs.
                                                       D. Adam
    •    Uses chain rule of calculus.
                                                           •     Combines Momentum + RMSProp.
    •    Calculates gradients of the loss w.r.t
         weights.                                          •     Popular for all neural networks.
    •    Updates weights using gradient descent.
                                                       Comparison Table:
Why Both Are Needed:
                                                                   Adaptive
                                                       Optimizer               Momentum Use Case
    •    Forward gives prediction and computes                     LR
         error.
                                                                                               Simple
                                                       SGD         No          No
    •    Backward corrects the model by                                                        problems
         minimizing this error.
                                                       RMSProp Yes             No              RNNs
    •    One without the other results in either no
         learning or no objective.                                                             Most deep
                                                       Adam        Yes         Yes
                                                                                               models
Here are extremely detailed answers for the                •     Effect: Smoothens the feature map,
following deep learning topics:                                  preserves background information better.
    •    5. Types of Pooling in CNNs                       •     Use Case: Used in early CNNs like LeNet;
                                                                 modern networks use it for global
    •    6. Why increasing filter size can cause a
                                                                 pooling.
         loss of fine details
                                                       C. Global Pooling
    •    7. Difference between RNNs and CNNs in
         data processing                                   •     Global Max or Average pooling over the
                                                                 entire feature map.
    •    8. Forget gate in LSTM and its impact on
         learning                                          •     Used before the output layer to reduce
                                                                 feature maps to a single number per map.
    •    9. Role of GANs in deep learning
                                                           •     Prevents overfitting, fewer parameters
    •    10. BERT as a breakthrough in NLP                       than fully connected layers.
                                                       D. Lp Pooling
5. What are the different types of pooling in
                                                           •     Generalized pooling where:
CNNs, and how do they influence feature
extraction?                                            y=(∑xp)1/py = \left(\sum x^p\right)^{1/p}
Pooling is a downsampling technique used in CNNs           •     Special cases: p=1 → Avg pooling; p=∞ →
to reduce the spatial dimensions (width and                      Max pooling.
height) of feature maps, controlling overfitting and
computational load.
    Purpose of Pooling:                                   Influence on Feature Extraction:
    •    Reduce dimensionality and computation.        Pooling
                                                                   Characteristics     Best Use
                                                       Type
    •    Retain essential features while discarding
         noise.                                        Max         Keeps strong
                                                                                       Edge/textures
                                                       Pooling     features
    •    Introduce translation invariance to the
         model.                                        Avg         Smooth
                                                                                       Backgrounds
                                                       Pooling     representation
    Types of Pooling:
A. Max Pooling                                         Global      Removes spatial     Classification,
                                                       Pooling     info                Efficient CNNs
    •    Mechanism: Selects the maximum value
         from each patch of the feature map.
    •    Effect: Retains strongest activation; great   6. Why might increasing the filter size in a CNN
         for edge and texture detection.               lead to a loss of fine image details?
    •    Formula:                                         Role of Filter Size in Feature Detection:
         yi,j=max(xm,n)for
                                                           •     Filters (kernels) are responsible for
         m,n∈patch around (i,j)y_{i,j} =
                                                                 extracting features like edges, corners,
         \max(x_{m,n}) \quad \text{for} \; m,n \in
                                                                 textures.
         \text{patch around } (i,j)
                                                           •     Common filter sizes: 3x3, 5x5, 7x7.
B. Average Pooling
                                                          Effect of Larger Filters:
    •    Mechanism: Averages the values within
         the patch.                                        •     Cover larger spatial areas in one pass.
      •   Capture broader features but lose local
          granularity.
                                                           RNNs: Recurrent Neural Networks
   Fine Details Missed:
                                                            •     Input: Sequences (e.g., time-series, text).
      •   Small patterns like dots, edges, noise.
                                                            •     Maintains hidden state/memory across
      •   These details often lie in small pixel                  time steps.
          differences—larger filters average them
                                                            •     Handles variable-length input.
          out.
                                                        Example:
                                                            •     Text generation, speech recognition, stock
   Comparative View:
                                                                  prediction.
Filter
           Captures                 Misses
Size
                                                           Comparative Table:
           Fine textures, sharp     High-level
3x3
           edges                    patterns            Feature         CNN               RNN
           Abstract shapes,         Fine image          Input           2D (images)       1D sequences
7x7
           context                  textures
                                                        Order
                                                                        Low               High
   Best Practice:                                       sensitivity
      •   Stack smaller filters (e.g., two 3x3s ≈ one                   No internal       Maintains hidden
          5x5) to keep receptive field while            Memory
                                                                        memory            state
          preserving detail.
                                                        Use Case        Vision            NLP, time series
      •   More non-linearity and better feature
          richness.
                                                        8. How does the forget gate in an LSTM improve
                                                        learning efficiency?
7. How do RNNs and CNNs differ in processing
data?                                                      Problem in Standard RNNs:
   Core Idea:                                               •     Struggle to retain long-term
                                                                  dependencies.
      •   CNNs process data spatially.
                                                            •     Gradients either vanish or explode.
      •   RNNs process data sequentially
          (temporally).
                                                           LSTM Architecture Overview:
   CNNs: Convolutional Neural Networks                  An LSTM has three gates:
      •   Input: Fixed-size 2D grids (e.g., images).        1.    Forget Gate ftf_t
      •   Learn spatial hierarchies (edges →                2.    Input Gate iti_t
          textures → shapes).
                                                            3.    Output Gate oto_t
      •   Weight sharing across spatial dimensions.
Example:
                                                           Forget Gate Mechanics:
      •   Image classification, object detection,
          medical scans.                                    •     Formula:
ft=σ(Wf⋅[ht−1,xt]+bf)f_t = \sigma(W_f \cdot [h_{t-      How It Works:
1}, x_t] + b_f)
                                                         •   Generator improves to fool the
    •   ft∈[0,1]f_t \in [0,1]: Determines how                Discriminator.
        much of previous memory to keep.
                                                         •   Discriminator improves to detect fakes.
    •   If ftf_t → 1: Keep memory.
                                                         •   Both get better over time, resulting in
    •   If ftf_t → 0: Forget completely.                     realistic data generation.
    Impact on Learning:                                 Applications of GANs:
    •   Helps avoid memorizing irrelevant                •   Face generation
        patterns.                                            (ThisPersonDoesNotExist).
    •   Enables selective memory retention.              •   Super-resolution images.
    •   Better gradient flow → improved long-            •   Art and music synthesis.
        term learning.
                                                         •   Data augmentation in low-data scenarios.
    Example:
                                                     10. Why is BERT widely recognized as a
In a sentence:                                       breakthrough in NLP-based Neural Network
                                                     Applications?
“The weather in Paris is cold but not snowy.”
                                                         BERT = Bidirectional Encoder Representations
To predict “snowy”, LSTM might forget irrelevant
                                                     from Transformers
words like "Paris" but retain "cold".
                                                        Why It’s Revolutionary:
9. What is the primary role of Generative            A. Bidirectional Context
Adversarial Networks (GANs) in deep learning?            •   Unlike older models (e.g., GPT), BERT
                                                             reads left and right context together.
    Goal:
                                                         •   Deeply understands contextual meaning.
GANs aim to generate realistic data by learning
from an existing dataset.                            B. Pre-training + Fine-tuning Paradigm
                                                         •   Pre-trained on large corpus (Wikipedia +
                                                             Books).
    GAN Architecture:
                                                         •   Fine-tuned on tasks like sentiment
    •   Generator (G): Creates fake data from
                                                             analysis, Q&A, NER.
        noise.
    •   Discriminator (D): Classifies real vs fake
        data.                                           Key Components:
They play a minimax game:                                •   Based on Transformer encoder (not
minGmaxDV(D,G)=Ex∼pdata[logD(x)]+Ez∼pz[                   decoder).
log(1−D(G(z)))]\min_G \max_D V(D, G) =                  •   Input: WordPiece embeddings +
\mathbb{E}_{x \sim p_{data}}[\log D(x)] +                    positional encodings.
\mathbb{E}_{z \sim p_z}[\log(1 - D(G(z)))]
                                                         •   Uses Masked Language Modeling (MLM)
                                                             to learn context.
•   Uses Next Sentence Prediction (NSP) for
    understanding relationships.
Applications:
•   Sentiment classification
•   Named Entity Recognition (NER)
•   Question answering (like SQuAD)
•   Document search and ranking (Google
    Search uses it!)
Real-world Impact:
•   Reduced task-specific architecture
    engineering.
•   State-of-the-art results on 11 NLP tasks.
•   Foundation of models like RoBERTa,
    DistilBERT, ALBERT.