- Formula:
σ(x) = 1 / (1 + e^(-x))
- Output Range: (0, 1)
- Commonly used in binary classification
- Limitation: Saturates for large inputs, leading to vanishing gradients
- Formula:
tanh(x) = (e^x - e^(-x)) / (e^x + e^(-x))
- Output Range: (-1, 1)
- Zero-centered activation, better than sigmoid
- Formula:
f(x) = max(0, x)
- Fast and efficient
- Limitation: Dying ReLU problem when neurons get stuck
- Converts raw scores into probability distribution
- Used in the output layer for multi-class classification
Layer | Role |
---|---|
Convolution | Extracts local features using filters (e.g., edges, textures) |
Max Pooling | Reduces dimensionality while retaining important features |
Flatten | Converts 2D feature maps to 1D vector before feeding to dense layers |
- Dropout – Randomly drops neurons during training to prevent co-adaptation
- L2 Regularization – Penalizes large weights by adding them to the loss function
- Early Stopping – Stops training when validation error starts increasing
- Data Augmentation – Expands dataset with transformations (flip, rotate, etc.)
- Cross-Validation – Ensures model generalization on unseen data
- Uses a pre-trained model (e.g., on ImageNet) for a new task
- Helps reduce training time and data requirements
- Retains useful low-level features (edges, corners)
- Common in computer vision and NLP
- Only top layers are retrained for new task
- A Siamese Network has two or more identical subnetworks with shared weights
- Measures similarity between two inputs
- Outputs feature embeddings which are compared using a distance metric
- Applications: Face verification, signature matching, one-shot learning
- Learns to tell whether two inputs are similar or different
- Selects the maximum value from a patch of the feature map.
- Helps retain the most dominant features.
- Takes the average of values in the patch.
- Used when feature intensity is less important.
- Averages entire feature maps to produce a single number per map.
- Reduces overfitting and used before dense layers.
Input layer: 6 → Hidden Layer: 4 → Hidden Layer: 2 → Output layer (binary) (4 Point)
-
Input to Hidden1 (6 → 4):
( (6 \times 4) + 4 = 28 ) -
Hidden1 to Hidden2 (4 → 2):
( (4 \times 2) + 2 = 10 ) -
Hidden2 to Output (2 → 1):
( (2 \times 1) + 1 = 3 )
✅ Total parameters = 28 + 10 + 3 = 41
- In deep networks, during backpropagation, gradients become very small in early layers.
- Leads to very slow or no learning in initial layers.
- Use ReLU instead of sigmoid/tanh
- Use Batch Normalization
- Use Residual connections (as in ResNet)
- Use architectures like LSTM/GRU for sequential tasks
- IoU (Intersection over Union)
- mAP (mean Average Precision)
- Precision & Recall
- F1 Score
- Measures overlap between predicted and ground truth boxes.
- IoU > threshold (usually 0.5) → correct detection.
- Helps evaluate localization accuracy.
- Algorithm to compute gradients for updating weights using chain rule
Q1 a) Summarize the Neural Network process. Write a short note on Learning Rate and Momentum. (4 Point)
- Input layer receives raw data.
- Data flows through hidden layers, where weights and biases are applied.
- Activation functions add non-linearity.
- Output layer makes prediction.
- Loss is computed using a loss function.
- Backpropagation adjusts weights to minimize the loss.
- Process repeats for multiple epochs.
- Controls the size of weight updates during training.
- Too small → slow learning; too large → overshooting minima.
- Adds a fraction of previous update to the current one.
- Helps accelerate convergence and escape local minima.
- Technique to normalize the inputs to a layer for each mini-batch.
- Helps maintain a mean of 0 and standard deviation of 1.
- Speeds up training.
- Reduces internal covariate shift.
- Allows higher learning rates.
- Acts as a regularizer, sometimes reducing the need for dropout.
- Optimization algorithm to minimize loss by updating weights in the direction of negative gradient.
-
Batch Gradient Descent
- Uses entire dataset for each update.
- Stable but slow and memory intensive.
-
Stochastic Gradient Descent (SGD)
- Uses one data point per update.
- Faster, but has high variance.
-
Mini-Batch Gradient Descent
- Uses small batches of data.
- Combines speed and stability.
Q1 d) Why do we need Activation Functions? Write a short note on ReLU. What are the drawbacks of ReLU? (4 Point)
- Introduce non-linearity to the model.
- Enable the network to learn complex patterns and relationships.
- Formula:
f(x) = max(0, x)
- Simple and computationally efficient.
- Helps avoid vanishing gradient problem.
- Dying ReLU: neurons may output zero for all inputs.
- Not zero-centered.
- Single-shot object detection algorithm.
- Divides image into grid cells; each predicts bounding boxes and class probabilities.
- Input image is divided into
S x S
grid. - Each grid predicts:
- Bounding box coordinates (x, y, w, h)
- Confidence score
- Class probabilities
- Non-Maximum Suppression (NMS) removes redundant detections.
- Real-time speed
- End-to-end prediction
- Fewer false positives
- Too complex model with many parameters
- Insufficient training data
- Too many training epochs
- Noisy or irrelevant features
- Lack of regularization
- Dropout – Randomly disables neurons during training to reduce over-reliance.
- Data Augmentation – Increases dataset size and variability using transformations. (Also acceptable: L2 regularization, early stopping, cross-validation)
Layer | Purpose |
---|---|
Convolution | Detects spatial patterns/features using learnable filters |
Pooling | Reduces feature map size, controls overfitting, speeds up training |
Dense | Fully connected layer used for final classification |
- In deep networks, gradients get smaller as they are backpropagated.
- Early layers receive almost zero updates → learning halts.
- Use ReLU activation instead of sigmoid/tanh
- Apply Batch Normalization
- Use Residual connections (ResNet)
- Use LSTM or GRU for sequential models
Q1 d) List various performance metrics for object detection. What is the use of Non-Maximal Suppression? (4 Point)
- IoU (Intersection over Union)
- Precision / Recall
- mAP (mean Average Precision)
- F1 Score
- Eliminates multiple overlapping boxes for the same object.
- Keeps only the box with the highest confidence.
- Improves detection accuracy and reduces duplicates.
Q1 e) Why are activation functions needed in deep neural networks? Define Tanh, ReLU, and Softmax. (4 Point)
- Introduce non-linearity
- Allow networks to model complex patterns
Function | Formula | Usage |
---|---|---|
Tanh | (e^x - e^(-x)) / (e^x + e^(-x)) |
Range: (-1, 1), zero-centered |
ReLU | f(x) = max(0, x) |
Fast training, avoids vanishing gradient |
Softmax | exp(x_i) / Σexp(x_j) |
Converts scores to probabilities; used in output layer for multi-class classification |
-
Input Layer
- Takes image data (e.g., 128x128x3)
-
Convolution Layers
- Apply filters/kernels to extract features like edges, textures
- Output: feature maps
-
Activation Function (ReLU)
- Adds non-linearity
-
Pooling Layers (Max/Average Pooling)
- Downsamples feature maps to reduce spatial size
- Helps with translation invariance
-
Flatten Layer
- Converts 2D data into 1D vector for Dense layers
-
Dense (Fully Connected) Layers
- Final decision-making layers
- Often followed by softmax for classification
-
Output Layer
- Produces final class predictions
- Model performs well on training data but poorly on unseen/test data
- Learns noise or irrelevant details
- Dropout – Randomly deactivate neurons
- Data Augmentation – Increases effective dataset size
- Early Stopping – Stops training when validation loss increases
- Regularization (L2) – Penalizes large weights
- Mathematical functions applied after each layer
- Introduce non-linearity into the model
- Without them, deep models behave like linear functions
- ReLU – Used in hidden layers
- Tanh – Output centered at 0
- Sigmoid – Output between 0 and 1
- Softmax – Used in output layer for multi-class classification
Type | Description | Example |
---|---|---|
Single-stage | Detects objects in one forward pass over image | YOLO, SSD |
Multi-stage | Generates region proposals, then classifies and refines them | Faster R-CNN |
- Single-stage: Fast, less accurate
- Multi-stage: Slower, more accurate
- Comprises two networks:
- Generator – Tries to produce fake data
- Discriminator – Tries to distinguish real vs fake data
- They are trained adversarially until generator produces realistic data
- Can generate highly realistic data (images, text, audio)
- Used in image synthesis, super-resolution, data augmentation
- Helps in creating synthetic training data for low-resource domains