ISE-1 Imp DLPDF
ISE-1 Imp DLPDF
Question:
How would you reflect on the key differences between Machine Learning and Deep Learning in terms of
their applicability to real-world problems?
Answer:
   •   Machine Learning (ML) typically relies on manually engineered features and simpler models. It
       works well when you have structured data and can identify important variables by hand.
   •   Deep Learning (DL) uses neural networks with multiple layers to automatically learn hierarchical
       features. This makes DL especially powerful for complex tasks like image recognition, natural
       language processing, and speech analysis where the raw data is high-dimensional.
   •   In real-world applications, ML can be more interpretable and faster to train on smaller datasets,
       while DL often requires large amounts of data and computational power but can capture much more
       complex patterns.
Question:
How would you reflect on the importance of activation functions in shaping the performance of Artificial
Neural Networks in real-life applications?
Answer:
   •   Activation functions introduce non-linearity into the network, allowing it to learn complex
       relationships that a simple linear model cannot capture.
   •   They determine whether a neuron should be activated or not, effectively controlling the signal flow
       through the network.
   •   Different functions (like ReLU, Sigmoid, and Tanh) have unique properties that impact issues such
       as vanishing gradients, training speed, and model accuracy in real-world applications.
Question:
List and explain the basic building blocks of deep learning.
Answer:
   •   Neurons: The basic units that receive inputs, perform a weighted sum, and apply an activation
       function.
   •   Layers: Groupings of neurons; include an input layer, one or more hidden layers, and an output
       layer.
   •   Weights and Biases: Parameters that are learned during training; they adjust how inputs are
       transformed as they pass through the network.
   •   Activation Functions: Functions such as ReLU, Sigmoid, and Tanh that introduce non-linearity.
   •   Loss Function: Measures the difference between the predicted output and the true output, guiding
       the training process.
   •   Optimization Algorithms: Methods like Gradient Descent that update weights and biases to
       minimize the loss.
4. Meaning of ReLU
Question:
What does ReLU stand for in the context of activation functions?
Answer:
ReLU stands for Rectified Linear Unit. It is widely used because of its simplicity and efficiency in
allowing networks to converge faster by outputting the input directly if it is positive, and zero otherwise.
Question:
Which activation function is most commonly used in deep learning models due to its simplicity and
efficiency?
Answer:
The ReLU (Rectified Linear Unit) is the most common activation function because it is simple to compute
and helps mitigate issues like the vanishing gradient problem.
Question:
What is the primary purpose of the loss function in training a neural network?
Answer:
The loss function measures how far the network’s predictions are from the actual values. It provides a
quantitative way to adjust the network's weights and biases during training, so that the overall error is
minimized.
Question:
A neural network with no hidden layers is equivalent to which type of model?
Answer:
A network without hidden layers essentially performs a linear transformation of the input data. This is
equivalent to a linear model (e.g., linear regression for continuous outputs or logistic regression for binary
classification).
8. Designing an MP Neuron for the AND Function
Question:
Design an MP (McCulloch-Pitts) neuron that implements the AND logic function using numerical values.
Answer:
Question:
How would you demonstrate the impact of choosing the right activation function on the efficiency and
performance of a neural network?
Answer:
   •   Experimentation: Compare models with different activation functions (e.g., ReLU vs. Sigmoid)
       using the same dataset.
   •   Visualization: Plot training curves and observe convergence speed, final accuracy, and issues like
       the vanishing gradient.
   •   Performance Metrics: Evaluate metrics such as accuracy, loss, and training time. The right choice
       often leads to faster convergence and improved accuracy.
Question:
Describe the computational flow in a Multilayer Perceptron (MLP).
Answer:
Question:
How is the ReLU activation function mathematically represented?
Answer:
The ReLU function is defined as:
Question:
Compare the behavior of the Sigmoid and Tanh activation functions.
Answer:
   •   Sigmoid:
          o Output range: (0, 1)
          o Can cause vanishing gradients in deep networks
   •   Tanh:
          o Output range: (-1, 1), which is zero-centered
          o Often converges faster in practice due to its centered output
             Both functions introduce non-linearity but differ in scale and gradient behavior.
Question:
How does the choice of activation function influence the depth of a Neural Network?
Answer:
Question:
How would you justify the use of Leaky ReLU over standard ReLU in certain cases?
Answer:
   •   Standard ReLU outputs zero for all negative inputs, which can lead to the dying ReLU problem
       where neurons stop updating.
   •   Leaky ReLU assigns a small, non-zero slope (e.g., 0.01) for negative inputs, ensuring that gradients
       continue to flow and reducing the risk of neurons becoming inactive.
   •   This makes Leaky ReLU advantageous in networks where many neurons might otherwise become
       “dead.”
Module 2
1. Role of Bias Terms in Neural Networks
Question:
Discuss the role of bias terms in neural networks.
Answer:
   •   Bias terms act as additional adjustable parameters that allow the activation function to be shifted left
       or right.
   •   They provide an extra degree of freedom for the model to fit the data better by ensuring that even
       when all input features are zero, the neuron can still be activated.
Question:
What are the differences between Stochastic Gradient Descent (SGD) and Batch Gradient Descent, and
when should each be used?
Answer:
   •   Batch Gradient Descent: Uses the entire dataset to compute the gradient for each update. It is stable
       but can be very slow and computationally expensive for large datasets.
   •   Stochastic Gradient Descent (SGD): Uses one data sample (or a small mini-batch) at a time to
       update parameters. It is faster and introduces noise that can help escape local minima, though it
       might result in a more noisy convergence.
   •   Usage:
           o Use Batch GD when the dataset is small or when stable convergence is critical.
           o Use SGD (or mini-batch gradient descent) for large-scale datasets where computational
              efficiency is important.
Question:
Can you demonstrate why hidden layers are critical for enabling Neural Networks to learn complex patterns
and improve model performance?
Answer:
Question:
Can you discuss the key benefits of using pre-trained Neural Network models in real-world applications?
Answer:
   •   Reduced Training Time: Pre-trained models have already learned useful features from large
       datasets, so you can fine-tune them for a specific task.
   •   Less Data Required: They allow for effective performance even with smaller task-specific datasets.
   •   Improved Accuracy: Leveraging features from vast amounts of data can enhance model
       performance on new but related tasks through transfer learning.
Question:
What are the main types of loss functions used in deep learning models?
Answer:
Question:
What is the role of the learning rate in the convergence of a deep learning model?
Answer:
   •   The learning rate (η) determines the step size during the weight update process in gradient descent.
   •   A high learning rate can cause the model to overshoot the optimal solution, while a low learning rate
       may lead to slow convergence.
   •   Selecting an appropriate learning rate is crucial for achieving a balance between speed and stability
       during training.
Question:
Write the formula for gradient descent and explain its components.
Answer:
The gradient descent update rule is:
Question:
Describe the components of a feedforward neural network and explain the purpose of each.
Answer:
Question:
How does gradient descent help in minimizing errors during training?
Answer:
   •     Gradient descent calculates the derivative of the loss function with respect to each parameter.
   •     By moving in the opposite direction of the gradient, it iteratively updates the parameters to reduce
         the error until the model converges to a minimum.
Question:
How does mini-batch gradient descent improve convergence in deep learning models?
Answer:
f(x)=x2+3x+2f(x)=x^2+3x+2f(x)=x2+3x+2
Answer:
Using NAG, we first initialize the velocity v0=0v_0 = 0v0=0.
Iteration 1:
   1. Lookahead Step:
      xlookahead=x0−γ⋅v0=3−0.9×0=3x_{\text{lookahead}} = x_0 - \gamma \cdot v_0 = 3 - 0.9 \times 0
      = 3xlookahead=x0−γ⋅v0=3−0.9×0=3
   2. Gradient Calculation:
      f′(x)=2x+3f'(x) = 2x+3f′(x)=2x+3 so f′(3)=2×3+3=9f'(3)=2 \times 3 + 3 = 9f′(3)=2×3+3=9
   3. Update Velocity:
      v1=γ⋅v0+η⋅f′(xlookahead)=0.9×0+0.1×9=0.9v_1 = \gamma \cdot v_0 + \eta \cdot
      f'(x_{\text{lookahead}}) = 0.9 \times 0 + 0.1 \times 9 = 0.9v1=γ⋅v0+η⋅f′(xlookahead
      )=0.9×0+0.1×9=0.9
   4. Update Position:
      x1=x0−v1=3−0.9=2.1x_1 = x_0 - v_1 = 3 - 0.9 = 2.1x1=x0−v1=3−0.9=2.1
Iteration 2:
   1. Lookahead Step:
      xlookahead=x1−γ⋅v1=2.1−0.9×0.9=2.1−0.81=1.29x_{\text{lookahead}} = x_1 - \gamma \cdot v_1 =
      2.1 - 0.9 \times 0.9 = 2.1 - 0.81 = 1.29xlookahead=x1−γ⋅v1=2.1−0.9×0.9=2.1−0.81=1.29
   2. Gradient Calculation:
      f′(1.29)=2×1.29+3=2.58+3=5.58f'(1.29)= 2 \times 1.29 + 3 = 2.58 + 3 =
      5.58f′(1.29)=2×1.29+3=2.58+3=5.58
   3. Update Velocity:
      v2=γ⋅v1+η⋅5.58=0.9×0.9+0.1×5.58=0.81+0.558=1.368v_2 = \gamma \cdot v_1 + \eta \cdot 5.58 =
      0.9 \times 0.9 + 0.1 \times 5.58 = 0.81 + 0.558 = 1.368v2=γ⋅v1
      +η⋅5.58=0.9×0.9+0.1×5.58=0.81+0.558=1.368
   4. Update Position:
      x2=x1−v2=2.1−1.368≈0.732x_2 = x_1 - v_2 = 2.1 - 1.368 \approx 0.732x2=x1−v2
      =2.1−1.368≈0.732
After two iterations, the updated value is approximately x≈0.732x \approx 0.732x≈0.732. This example
demonstrates how NAG uses a lookahead step to potentially achieve faster convergence.
12. Comparing Gradient Descent (GD) and Stochastic Gradient Descent (SGD)
Question:
Compare Gradient Descent (GD) with Stochastic Gradient Descent (SGD).
Answer:
Question:
Explain the differences between forward propagation and backward propagation in a neural network.
Answer:
   •   Forward Propagation:
          o Data flows from the input layer through the hidden layers to the output layer.
          o It computes the network’s predictions.
   •   Backward Propagation:
          o Computes the gradient of the loss function with respect to each weight by propagating the
             error backward through the network.
          o It is used to update the weights and biases to minimize the loss.
Question:
How do activation functions like ReLU and Leaky ReLU handle negative input values?
Answer:
   •   ReLU:
          o   Outputs 0 for any negative input, effectively "turning off" the neuron.
   •   Leaky ReLU:
          o Allows a small, non-zero output (e.g., 0.01×x0.01 \times x0.01×x) for negative input values,
              ensuring that neurons can still learn even if the input is negative.
   •   This small difference helps prevent issues like the dying ReLU problem.
Module 3
1. Concept of Autoencoders in Unsupervised Learning
Question:
Explain the concept of Autoencoders and their role in unsupervised learning.
Answer:
   •   Autoencoders are neural networks designed to learn efficient encodings of input data.
   •   They compress the data into a lower-dimensional latent space and then reconstruct the original input
       from this compressed representation.
   •   This makes them useful for tasks like dimensionality reduction, denoising, and feature extraction
       without needing labeled data.
Question:
Describe the bottleneck architecture in Autoencoders and its importance.
Answer:
   •   The bottleneck is the layer in the autoencoder with the smallest number of neurons.
   •   It forces the network to capture the most critical features of the input data by compressing the
       information.
   •   This compression is key for learning an efficient representation and reducing noise or redundancy.
Question:
What is the role of dimensionality reduction in Autoencoders?
Answer:
Question:
Explain the reconstruction error in Autoencoders and its significance.
Answer:
   •   The reconstruction error is the difference between the original input and the autoencoder’s output
       (reconstructed data).
   •   It serves as a measure of how well the network has learned to capture the essential information of the
       data.
   •   Lower reconstruction error indicates that the autoencoder has learned a good representation.
5. Concept of Latent Space
Question:
Define the concept of latent space in Autoencoders.
Answer:
   •   The latent space is the compressed representation of the input data produced by the encoder.
   •   It captures the most essential features in a lower-dimensional format, which can be useful for
       visualization, clustering, or as input for other tasks.
Question:
Explain the bottleneck architecture in autoencoders.
Answer:
Question:
What is the role of nonlinear activation functions in autoencoders?
Answer:
   •   Nonlinear activation functions allow autoencoders to learn complex mappings between the input
       and the latent representation.
   •   They enable the network to capture nonlinear relationships in the data, which improves the quality
       of the learned features and the reconstruction.
Question:
Discuss the difference between the Encoder and Decoder in an Autoencoder.
Answer:
   •   The Encoder compresses the input data into a latent space representation.
   •   The Decoder takes this compressed representation and attempts to reconstruct the original input.
   •   Together, they learn a meaningful, compact representation of the data.
Answer:
   •   Undercomplete Autoencoders:
          o The latent space has fewer dimensions than the input.
          o Forces the network to learn the most essential features.
          o Example: Compressing a 1000-dimensional input to a 50-dimensional latent space.
   •   Overcomplete Autoencoders:
          o The latent space has equal or more dimensions than the input.
          o Risk: The network might simply learn an identity function.
          o Regularization techniques (like sparsity constraints) are often applied to force the network to
             learn useful features.
Question:
How do Denoising Autoencoders improve generalization in learning tasks?
Answer:
   •   Denoising Autoencoders add noise to the input data and train the network to reconstruct the
       original, clean data.
   •   This process forces the model to learn robust features that capture the underlying structure of the
       data, thereby improving its ability to generalize to new, unseen data.
Question:
Analyze the differences between Contractive Autoencoders and Sparse Autoencoders.
Answer:
   •   Contractive Autoencoders:
          o Add a penalty to the loss function based on the sensitivity of the hidden layer activations with
             respect to the input.
          o Encourage the learned representation to be robust to small input changes.
   •   Sparse Autoencoders:
          o Impose a sparsity constraint on the activations (i.e., most neurons are inactive).
          o This encourages the network to learn a compact and interpretable representation by
             activating only a few neurons for a given input.
Question:
Analyze the reconstruction loss of an overcomplete Autoencoder and discuss its implications.
Answer:
   •   In an overcomplete autoencoder, the latent space has more dimensions than the input, so the network
       might learn to simply copy the input, resulting in a low reconstruction loss.
   •   However, this low loss does not necessarily mean that the model has learned useful features.
   •   Regularization (such as sparsity constraints) is crucial to prevent the network from learning an
       identity mapping and to ensure that the model generalizes well.
Question:
Discuss the role of dimensionality reduction in autoencoders.
Answer:
Question:
Illustrate the architecture of a basic autoencoder.
Answer:
A basic autoencoder consists of three main parts:
This structured explanation covers all the questions in detail using simple language while highlighting the
important keywords. Each answer is designed to provide both the conceptual insight and practical
implications relevant to deep learning topics.
You said:
make sure to answer each question i detail even if the question is repeated
Reasoned about deep learning questions for 27 seconds
Below is a comprehensive, detailed answer for each question. Every question is answered individually—
even if repeated—with important keywords highlighted and written in simple language.
Module 1
1. Key Differences between Machine Learning and Deep Learning in Real-World
Problems
Question:
How would you reflect on the key differences between Machine Learning (ML) and Deep Learning (DL)
in terms of their applicability to real-world problems?
Answer:
Question:
How would you reflect on the importance of activation functions in shaping the performance of Artificial
Neural Networks in real-life applications?
Answer:
   •   Non-linearity:
          o Activation functions introduce non-linearity, allowing networks to model complex patterns
              beyond linear relationships.
   •   Decision Making:
          o They determine whether a neuron should be activated based on the weighted sum of its
              inputs, directly influencing the flow of information.
   •   Impact on Training:
          o The choice of activation function affects issues like the vanishing gradient problem and
              convergence speed.
   •   Examples:
          o ReLU, Sigmoid, and Tanh each have unique properties that influence model performance in
              various tasks.
3. Basic Building Blocks of Deep Learning
Question:
List and explain the basic building blocks of deep learning.
Answer:
   •   Neurons:
          o Basic computational units that perform a weighted sum of inputs followed by an activation
              function.
   •   Layers:
          o Input Layer: Receives raw data.
          o Hidden Layers: Intermediate layers that transform inputs through non-linear functions.
          o Output Layer: Provides final predictions.
   •   Weights and Biases:
          o Weights: Determine the strength of each input connection.
          o Biases: Allow neurons to shift the activation function and improve flexibility.
   •   Activation Functions:
          o Functions like ReLU, Sigmoid, and Tanh that introduce non-linearity.
   •   Loss Function:
          o Measures the error between predicted outputs and true values, guiding the training process.
   •   Optimization Algorithms:
          o Methods such as Gradient Descent that update the weights and biases to minimize the loss.
4. Meaning of ReLU
Question:
What does ReLU stand for in the context of activation functions?
Answer:
ReLU stands for Rectified Linear Unit. It is popular because it outputs the input directly if it is positive and
outputs 0 if the input is negative, making computations simple and effective.
Question:
Which activation function is most commonly used in deep learning models due to its simplicity and
efficiency?
Answer:
The ReLU (Rectified Linear Unit) is the most commonly used activation function. It is favored for its
efficiency in computation and its ability to help networks converge faster by mitigating the vanishing
gradient issue.
Question:
What is the primary purpose of the loss function in training a neural network?
Answer:
The loss function quantifies the difference between the predicted outputs and the actual labels. It serves as a
feedback signal that is minimized during training using optimization techniques, ensuring that the network
improves its performance over time.
Question:
A neural network with no hidden layers is equivalent to which type of model?
Answer:
A neural network with no hidden layers performs a simple linear transformation of the input data. This is
equivalent to a linear model (such as linear regression for continuous outputs or logistic regression for
binary classification).
Question:
Design an MP (McCulloch-Pitts) neuron that implements the AND logic function using numerical values.
Answer:
Question:
How would you demonstrate the impact of choosing the right activation function on the efficiency and
performance of a neural network?
Answer:
   •   Experimental Comparison:
          o Build two similar models, one using ReLU and another using Sigmoid (or another function).
   •   Observation of Training Curves:
          o Compare convergence speed and final accuracy on the validation set.
   •   Performance Metrics:
          o Evaluate metrics like loss, accuracy, and training time.
   •   Conclusion:
          o A well-chosen activation function can lead to faster training and better performance by
             ensuring efficient gradient flow during backpropagation.
10. Computational Flow in a Multilayer Perceptron (MLP)
Question:
Describe the computational flow in a Multilayer Perceptron (MLP).
Answer:
   •   Input Layer:
          o Receives raw data and passes it to the next layer.
   •   Forward Propagation:
          o Data is processed through successive hidden layers using weighted sums and activation
              functions.
   •   Output Layer:
          o The final layer produces the network’s prediction.
   •   Backpropagation:
          o The prediction error (from the loss function) is propagated back through the network to
              update the weights and biases using an optimization algorithm like Gradient Descent.
Question:
How is the ReLU activation function mathematically represented?
Answer:
The ReLU function is defined as:
Question:
Compare the behavior of the Sigmoid and Tanh activation functions.
Answer:
   •   Sigmoid Function:
          o Range: (0, 1)
          o Can lead to vanishing gradients in deep networks.
   •   Tanh Function:
          o Range: (-1, 1) and is zero-centered, which can result in faster convergence.
   •   Key Difference:
          o Tanh often provides better performance because its zero-centered output can make
             optimization easier in many cases.
Answer:
   •   Gradient Flow:
          o Activation functions that avoid vanishing gradients (like ReLU) support training of deeper
             networks.
   •   Network Depth:
          o Using functions prone to saturation (like Sigmoid) can limit the effective depth, as gradients
             become too small in deep networks.
   •   Overall Effect:
          o A proper activation function helps maintain effective learning even as the network’s depth
             increases.
Question:
How would you justify the use of Leaky ReLU over standard ReLU in certain cases?
Answer:
Module 2
1. Role of Bias Terms in Neural Networks
Question:
Discuss the role of bias terms in neural networks.
Answer:
   •   Offset Adjustment:
          o Bias terms allow neurons to shift the activation function left or right, ensuring that the neuron
              can activate even when inputs are zero.
   •   Flexibility:
          o They add flexibility to the model, enabling it to better fit the data by allowing a non-zero
              output independent of the weighted inputs.
   •   Learning Capability:
          o Biases act as additional parameters that the network learns, improving its overall
              performance.
2. Differences between SGD and Batch Gradient Descent
Question:
What are the differences between Stochastic Gradient Descent (SGD) and Batch Gradient Descent, and
when should each be used?
Answer:
Question:
Can you demonstrate why hidden layers are critical for enabling Neural Networks to learn complex patterns
and improve model performance?
Answer:
Question:
Can you discuss the key benefits of using pre-trained Neural Network models in real-world applications?
Answer:
Question:
What are the main types of loss functions used in deep learning models?
Answer:
   •   Cross-Entropy Loss:
          o Commonly used in classification tasks to measure the difference between predicted
             probabilities and true labels.
   •   Mean Squared Error (MSE):
          o Frequently used in regression problems, measuring the average squared differences between
             predictions and actual values.
   •   Hinge Loss:
          o Often used with support vector machines and some classification neural networks.
   •   Absolute Error Loss:
          o Measures the absolute differences between predicted and true values, sometimes used in
             regression tasks.
Question:
What is the role of the learning rate in the convergence of a deep learning model?
Answer:
   •   Step Size:
          o The learning rate (η\etaη) determines how large a step is taken in the direction opposite to the
              gradient during optimization.
   •   Balancing Convergence:
          o A high learning rate may lead to overshooting the minimum, while a low rate can cause very
              slow convergence.
   •   Optimal Setting:
          o Selecting the right learning rate is crucial for balancing training speed and ensuring stable
              convergence.
Answer:
The gradient descent update rule is:
Where:
Question:
Describe the components of a feedforward neural network and explain the purpose of each.
Answer:
   •     Input Layer:
            o Receives the raw input data.
   •     Hidden Layers:
            o Process data through weighted sums and activation functions to extract and transform
                features.
   •     Output Layer:
            o Generates the final prediction.
   •     Weights and Biases:
            o Parameters that are adjusted during training to best fit the data.
   •     Activation Functions:
            o Introduce non-linearity, enabling the network to learn complex mappings.
Question:
How does gradient descent help in minimizing errors during training?
Answer:
   •     Error Reduction:
             o Gradient descent computes the gradient of the loss function with respect to each parameter.
   •     Iterative Updates:
             o By updating the parameters in the opposite direction of the gradient, it reduces the error
                 gradually until the model converges at a (local) minimum.
   •     Optimization:
             o This process minimizes the overall loss, thereby improving the model’s predictions.
10. Mini-Batch Gradient Descent and Convergence
Question:
How does mini-batch gradient descent improve convergence in deep learning models?
Answer:
   •   Balanced Approach:
           o Mini-batch gradient descent divides the dataset into small batches.
   •   Reduced Variance:
           o It offers a middle ground between the high variance of SGD and the computational expense
              of full batch gradient descent.
   •   Efficient Updates:
           o Utilizing mini-batches allows for faster, more stable updates and leverages parallel
              processing on modern hardware.
Question:
Consider a simple deep learning optimization problem where we aim to minimize:
f(x)=x2+3x+2f(x)=x^2+3x+2f(x)=x2+3x+2
using Nesterov Accelerated Gradient (NAG) with initial x0=3x_0=3x0=3, learning rate
η=0.1\eta=0.1η=0.1, momentum γ=0.9\gamma=0.9γ=0.9, and 2 iterations.
Answer:
Iteration 1:
   1. Lookahead:
         o xlookahead=x0−γ⋅0=3x_{\text{lookahead}} = x_0 - \gamma \cdot 0 = 3xlookahead=x0
             −γ⋅0=3
   2. Gradient:
         o f′(x)=2x+3⇒f′(3)=9f'(x) = 2x+3 \Rightarrow f'(3)=9f′(x)=2x+3⇒f′(3)=9
   3. Velocity Update:
         o v1=0.9×0+0.1×9=0.9v_1 = 0.9 \times 0 + 0.1 \times 9 = 0.9v1=0.9×0+0.1×9=0.9
   4. Position Update:
         o x1=3−0.9=2.1x_1 = 3 - 0.9 = 2.1x1=3−0.9=2.1
Iteration 2:
   1. Lookahead:
         o xlookahead=x1−γ⋅v1=2.1−0.9×0.9=2.1−0.81=1.29x_{\text{lookahead}} = x_1 - \gamma
             \cdot v_1 = 2.1 - 0.9 \times 0.9 = 2.1 - 0.81 = 1.29xlookahead=x1−γ⋅v1
             =2.1−0.9×0.9=2.1−0.81=1.29
   2. Gradient:
         o f′(1.29)=2(1.29)+3≈5.58f'(1.29) = 2(1.29)+3 \approx 5.58f′(1.29)=2(1.29)+3≈5.58
   3. Velocity Update:
         o v2=0.9×0.9+0.1×5.58≈0.81+0.558=1.368v_2 = 0.9 \times 0.9 + 0.1 \times 5.58 \approx 0.81
             + 0.558 = 1.368v2=0.9×0.9+0.1×5.58≈0.81+0.558=1.368
   4. Position Update:
         o x2=2.1−1.368≈0.732x_2 = 2.1 - 1.368 \approx 0.732x2=2.1−1.368≈0.732
After two iterations, the updated value is approximately x≈0.732x \approx 0.732x≈0.732. This shows how
NAG uses a lookahead step to improve the update and potentially converge faster.
12. Comparing Gradient Descent (GD) with Stochastic Gradient Descent (SGD)
Question:
Compare Gradient Descent (GD) with Stochastic Gradient Descent (SGD).
Answer:
Question:
Explain the differences between forward propagation and backward propagation in a neural network.
Answer:
   •   Forward Propagation:
          o Data flows from the input layer to the output layer, computing predictions at each layer
             using the current weights and activation functions.
   •   Backward Propagation:
          o The error (calculated by the loss function) is propagated backwards through the network.
          o This process computes gradients for each weight and bias, which are then updated to reduce
             the overall error.
Question:
How do activation functions like ReLU and Leaky ReLU handle negative input values?
Answer:
   •   ReLU:
          o  Outputs 0 for any negative input, effectively deactivating the neuron.
   •   Leaky ReLU:
          o Allows a small, non-zero output for negative inputs (e.g., 0.01×x0.01 \times x0.01×x), which
             helps maintain a small gradient and prevents the dying ReLU problem.
   •   Summary:
          o   This difference can significantly affect training, especially in deep networks where retaining
              some gradient for negative inputs is beneficial.
Module 3
1. Autoencoders and Unsupervised Learning
Question:
Explain the concept of Autoencoders and their role in unsupervised learning.
Answer:
   •   Autoencoders are neural networks designed to learn an efficient encoding of the input data.
   •   They consist of an encoder that compresses the data into a lower-dimensional representation (latent
       space) and a decoder that reconstructs the original data.
   •   They are used in unsupervised learning for tasks like dimensionality reduction, denoising, and
       feature extraction without the need for labeled data.
Question:
Describe the bottleneck architecture in Autoencoders and its importance.
Answer:
Question:
What is the role of dimensionality reduction in Autoencoders?
Answer:
Answer:
   •   Reconstruction error measures the difference between the original input and its reconstruction by
       the autoencoder.
   •   It is a key metric used during training: a lower reconstruction error indicates that the network has
       successfully learned to capture the essential features of the data.
   •   Minimizing this error is central to training an effective autoencoder.
Question:
Define the concept of latent space in Autoencoders.
Answer:
   •   The latent space is the compressed representation of the input data learned by the encoder.
   •   It is a lower-dimensional space that captures the most significant features of the data, often used for
       clustering, visualization, or as input to other models.
   •   The quality of the latent space directly impacts the performance of the autoencoder.
Question:
Explain the bottleneck architecture in autoencoders.
Answer:
   •   As described earlier, the bottleneck is the layer with the minimum number of neurons.
   •   This design choice forces the autoencoder to focus on the most important features of the input,
       effectively reducing noise and ensuring a compressed, efficient representation.
   •   Repeating this explanation reinforces its critical role in ensuring the model does not simply copy the
       input but learns a meaningful encoding.
Question:
What is the role of nonlinear activation functions in autoencoders?
Answer:
   •   Nonlinear activation functions allow the autoencoder to model complex, non-linear relationships
       between the input and its compressed representation.
   •   They help in capturing more complex patterns in the data than linear functions would allow, leading
       to a more expressive and robust latent space.
   •   Common functions used include ReLU, Sigmoid, and Tanh.
8. Difference Between Encoder and Decoder in Autoencoders
Question:
Discuss the difference between the Encoder and Decoder in an Autoencoder.
Answer:
   •   Encoder:
          o Compresses the input data into the latent space by learning an efficient, lower-dimensional
             representation.
   •   Decoder:
          o Reconstructs the original input data from the latent space, ideally producing an output as
             close as possible to the input.
   •   Together, they allow the autoencoder to learn the intrinsic structure of the data.
Question:
Compare undercomplete and overcomplete Autoencoders with examples.
Answer:
   •   Undercomplete Autoencoders:
          o The latent space has fewer dimensions than the input, forcing the network to learn only the
             most salient features.
          o Example: Compressing a 1000-dimensional input to a 50-dimensional latent representation.
   •   Overcomplete Autoencoders:
          o The latent space has equal to or more dimensions than the input.
          o Risk: They may learn an identity function if not regularized properly.
          o Regularization methods (such as sparsity constraints) are often used to ensure that even with
             many dimensions, the autoencoder learns meaningful features.
Question:
How do Denoising Autoencoders improve generalization in learning tasks?
Answer:
   •   Denoising Autoencoders are trained by deliberately corrupting the input data with noise.
   •   The network then learns to reconstruct the original, clean data from this noisy version.
   •   This process forces the model to capture robust features that are invariant to noise, which improves
       its ability to generalize to new, unseen data.
Answer:
   •   Contractive Autoencoders:
          o Add a penalty term to the loss that discourages large derivatives in the hidden layers.
          o This regularization forces the model to learn representations that are robust to small changes
             in the input.
   •   Sparse Autoencoders:
          o Enforce a sparsity constraint on the hidden layer activations, meaning that only a few neurons
             are active at a time.
          o This constraint leads to a more interpretable and compact representation.
   •   Key Difference:
          o While both techniques aim to improve the quality of the learned representation, contractive
             autoencoders focus on robustness to perturbations, whereas sparse autoencoders emphasize
             sparse, efficient coding of the input.
Question:
Analyze the reconstruction loss of an overcomplete Autoencoder and discuss its implications.
Answer:
   •   In an overcomplete autoencoder, the latent space has more dimensions than the input, which may
       allow the model to simply copy the input.
   •   This can result in a low reconstruction loss even if the model has not learned a meaningful
       representation.
   •   Implications:
           o A low reconstruction loss in this setting may be deceptive; it could mean that the network has
               memorized the input rather than learning robust, generalizable features.
           o Regularization techniques (e.g., sparsity constraints) are necessary to ensure that the model
               learns useful representations rather than an identity mapping.
Question:
Discuss the role of dimensionality reduction in autoencoders.
Answer:
Question:
Illustrate the architecture of a basic autoencoder.
Answer:
A basic autoencoder consists of three main parts:
   •   Input Layer:
          o Receives the raw input data.
   •   Encoder:
          o One or more layers that compress the input into a lower-dimensional latent space (the
              bottleneck).
   •   Decoder:
          o Mirrors the encoder to reconstruct the original input from the latent representation.
   •   Overall Purpose:
          o The network is trained to minimize the reconstruction error, ensuring that the compressed
              representation retains as much essential information as possible.