Module 3
Deep Learinig [BCA701]
BY
Prof. Prasanna Patil
Asst. Professor
Dept of CS&E
VSM SRKIT NIDASOSHI
Regularization for
Deep Learning
Chapter 7
Contents
• Introduction to Regularization
• Overfitting vs. Underfitting
• Regularization Techniques Overview
• L1 Regularization
• L2 Regularization
• Dropout Regularization
• Early Stopping
• Data Augmentation
• Weight Decay
• Batch Normalization
• Ensemble Methods
• Hyperparameter Tuning
• Summary of Regularization Techniques
Introduction to Regularization
• Regularization is a set of techniques to prevent overfitting in
machine learning models.
• Purpose:
• Helps improve model generalization to unseen data.
• Controls model complexity to avoid fitting noise in the training
data.
• Overfitting vs. Underfitting:
• Overfitting: Model learns training data too well, leading to poor
performance on new data.
• Underfitting: Model is too simple to capture the underlying data
patterns.
Cntd…
• Need for Regularization:
• Complex models can learn intricate patterns but risk overfitting.
• Regularization techniques balance bias and variance trade-off.
• Common Regularization Methods:
• L1 and L2 regularization
• Dropout
• Early stopping
• Data augmentation
• Applications:
• Widely used in neural networks, especially in deep learning frameworks.
• Essential for tasks with limited training data.
Overfitting vs. Underfitting
• Overfitting:
• Definition: Model learns the training data too well, capturing noise and
fluctuations.
• Symptoms:
• High accuracy on training data.
• Poor performance on validation/testing data.
• Causes:
• Excessive model complexity (too many parameters).
• Insufficient training data.
• Consequences:
• Lack of generalization to new data.
• Reduced model reliability in practical applications.
• Underfitting:
• Definition: Model is too simple to capture underlying patterns in the data.
• Symptoms:
• Low accuracy on both training and validation data.
• Poor predictive performance.
• Causes:
• Inadequate model complexity (too few parameters).
• Excessive regularization.
• Consequences:
• Inability to learn from the data.
• Missed opportunities for effective predictions.
Regularization Techniques
Overview
Common Regularization Techniques:
• L1 Regularization
• L2 Regularization
• Dropout
• Early Stopping
• Data Augmentation
• Batch Normalization
• Ensemble Methods
L1 Regularization
• Definition:
• Adds the absolute value of weights to the loss function, promoting sparsity.
• Mathematical Formulation:
• Loss function:
• : Regularization strength (hyperparameter).
• wi: Weights of the model.
• Key Characteristics:
• Encourages the model to use fewer features by driving some weights to zero.
• Useful for feature selection, especially in high-dimensional datasets.
• Pros:
• Results in simpler models that are easier to interpret.
• Helps reduce overfitting by focusing on the most important features.
• Cons:
• Can lead to instability in weight estimation when features are highly
correlated.
• May not perform well if all features contribute to the output.
• Use Cases:
• Effective in problems with many irrelevant features (e.g., text classification).
• Commonly used in linear models and sparse data settings.
L2 Regularization
• Definition:
• Adds the squared value of weights to the loss function, discouraging large weights.
• Mathematical Formulation:
• Loss function:
• :Regularization strength (hyperparameter).
• wi: Weights of the model.
• Key Characteristics:
• - Tends to distribute weights more evenly across features rather than driving some
to zero.
• - Helps to maintain all features in the model while reducing their impact.
• Pros:
• Reduces overfitting by preventing the
model from becoming overly complex.
• More stable weight estimates, particularly
in the presence of multicollinearity.
• Cons:
• Does not perform feature selection; all features remain in the model.
• May lead to a slight increase in training time due to additional calculations.
• Use Cases:
• Commonly used in many machine learning algorithms, including linear
regression and neural networks.
• Effective in scenarios where all features are believed to contribute to the output
Dropout Regularization
• Definition:
• A regularization technique that randomly deactivates a fraction of neurons during
training.
• Mechanism:
• During each training iteration, a specified percentage of neurons (e.g., 20-50%) are
"dropped out" or set to zero.
• Prevents neurons from co-adapting too much.
• Purpose:
• Encourages the network to learn robust features that are useful in conjunction with
many different random subsets of the neurons.
• Benefits:
• Reduces overfitting by introducing noise during training.
• Helps improve model generalization to unseen data.
• Acts as an ensemble of multiple networks, promoting diversity in learned
representations.
• Implementation:
• Commonly applied in fully connected layers of neural networks.
• Typically not used during inference; all neurons are active.
Early Stopping
• Definition:
• A regularization technique that halts training when the model's performance on a
validation set begins to degrade.
• Purpose:
• Prevents overfitting by monitoring the model's performance during training.
• Aims to find the optimal point where the model generalizes best.
• Mechanism:
• During training, the validation loss is evaluated at regular intervals (e.g., after each
epoch).
• Training stops if the validation loss does not improve for a specified number of
epochs (patience).
• Benefits:
• Reduces unnecessary training time by stopping early.
• Helps in maintaining a balance between bias and variance.
• Implementation:
• Requires a validation dataset to monitor performance.
• Can be combined with other regularization techniques for improved results.
• Considerations:
• Selecting the right patience value is crucial; too short may lead to
underfitting.
• Can be sensitive to the choice of the validation set.
Data Augmentation
• Definition:
• A technique to artificially increase the size of a training dataset by creating modified versions
of existing data points.
• Purpose:
• Improves model generalization by exposing it to varied examples.
• Reduces overfitting, especially in cases with limited data.
• Common Techniques:
• Image Augmentation:
• Rotation, flipping, cropping, scaling, and color adjustments.
• Text Augmentation:
• Synonym replacement, random insertion, and back-translation.
• Audio Augmentation:
• Time stretching, pitch shifting, and adding background noise.
• Benefits:
• Enhances model robustness to variations and noise in real-world data.
• Helps in learning invariant features that are crucial for performance.
• Implementation:
• Often performed on-the-fly during training to save storage and increase
diversity.
• Can be integrated into training pipelines using libraries like TensorFlow and
PyTorch.
• Considerations:
• Care must be taken to ensure that augmentations do not alter the
fundamental characteristics of the data.
• Balance is needed; excessive augmentation can lead to noise and confuse the
model.
Weight Decay
• - Definition:
• A regularization technique that penalizes large weights in a neural network by adding a term
to the loss function.
• Mathematical Formulation:
• Loss function:
• : Regularization strength (hyperparameter).
• wi: Weights of the model.
• Purpose:
• Prevents overfitting by discouraging the model from assigning excessive importance to any
single feature.
• Encourages smaller, more evenly distributed weights across the network.
• Mechanism:
• The added penalty term shrinks weights during optimization, effectively
controlling model complexity.
• Benefits:
• Leads to smoother loss surfaces, improving optimization stability.
• Can enhance model generalization to unseen data.
• Use Cases:
• Commonly used in various neural network architectures (e.g., CNNs, RNNs).
• Effective in scenarios where model simplicity is desired.
• Considerations:
• Choosing the right value for lambda is critical; too high may lead to
underfitting.
• Often combined with other regularization techniques for optimal results.
Batch Normalization
• Definition:
• A technique that normalizes the inputs of each layer in a neural network to improve
training stability and speed.
• Purpose:
• Reduces internal covariate shift by ensuring that the inputs to each layer have a
consistent distribution.
• Allows for higher learning rates and can reduce the need for other regularization
techniques.
• Mechanism:
• Normalizes activations using the mean and variance of the mini-batch.
• Applies learnable parameters (scaling and shifting) to restore the network’s capacity.
• Benefits:
• Accelerates convergence, leading to faster training times.
• Helps mitigate overfitting by introducing a form of regularization.
• Makes the training process less sensitive to weight initialization.
• Implementation:
• Typically inserted after linear layers and before activation functions.
• Can be applied in both feedforward and convolutional networks.
• Considerations:
• Requires careful tuning of batch size; too small can lead to noisy estimates of mean
and variance.
• Not always beneficial for all types of models, especially when batch sizes are very
small.
Ensemble Methods
• Definition:
• Techniques that combine predictions from multiple models to improve overall
performance and robustness.
• Purpose:
• Reduces model variance and increases accuracy by leveraging the strengths of
different algorithms.
• Helps mitigate overfitting by averaging out individual model errors.
• Common Types:
• Bagging:
• Builds multiple models from
random subsets of the training
data (e.g., Random Forest).
• Reduces variance by averaging
predictions.
• Finally, the outputs or features
from these base learners are
then combined to make a final
prediction. This is done by either
averaging the predictions for
regression tasks or a majority
vote for classification tasks.
• Common Types:
• Boosting:
• sequential ensemble
method where weak
learners (simple models)
are built one after another.
• Each new model focuses on
correcting the errors of the
previous one. (e.g.,
AdaBoost, XGBoost).
• This iterative process helps
to reduce overall bias.
• Improves accuracy by
converting weak learners
into strong ones.
• Common Types:
• Stacking:
• Combines predictions from multiple base models using a meta-model.
• Learns to make better predictions based on the outputs of the base models.
• It combines the predictions of multiple base learners in a two-stage approach:
• Stage 1. Training Base Learners
• Stage 2: Generating Meta-Features and Building the Final Model
• Benefits:
• Improved predictive performance compared to single models.
• Greater robustness against noise and outliers in data.
• Flexibility in combining diverse algorithms (e.g., decision trees, linear models).
• Limitations:
• Bagging may not significantly improve models already with low variance.
• If boosting continues excessively over time, the ensemble may become overly
complex and overfitting training data, leading to poor performance on unseen data
as well.
• In Stacking, if the meta model is not chosen or trained well, the ensemble can
become too complex and overfit the data.
• Considerations:
• Increased computational cost and complexity due to training multiple models.
• Requires careful tuning of hyperparameters and model selection for optimal
performance.
• Considerations:
• Increased computational cost and complexity due to training multiple models.
• Requires careful tuning of hyperparameters and model selection for optimal
performance.
Hyperparameter Tuning
• Definition:
• The process of optimizing hyperparameters to improve model performance and
generalization.
• Purpose:
• Adjusts settings that control the learning process, influencing how well a model fits the
data.
• Essential for achieving the best possible results from machine learning algorithms.
• Common Hyperparameters:
• Learning Rate: Determines step size during optimization.
• Regularization Strength: Controls the impact of regularization techniques (e.g., L1, L2).
• Batch Size: Number of samples processed before updating the model.
• Number of Epochs: Total training iterations over the entire dataset.
• Network Architecture: Number of layers, neurons per layer, and activation functions.
• Tuning Methods:
• Grid Search:
• Exhaustively tests combinations of hyperparameters.
• Simple but computationally expensive.
• Random Search:
• Samples hyperparameter combinations randomly.
• More efficient than grid search in high-dimensional spaces.
• Bayesian Optimization:
• Uses probabilistic models to find optimal hyperparameters iteratively.
• More sophisticated and often yields better results with fewer evaluations.
• Cross-Validation:
• Evaluates model performance using multiple splits of the training data to ensure
robustness.
• Benefits:
• Improved model accuracy and generalization.
• Helps prevent overfitting by optimizing regularization parameters.
• Considerations:
• Tuning can be time-consuming; balancing thoroughness with computational
resources is crucial.
• May require domain knowledge to set reasonable ranges for
hyperparameters.
Conclusion of Regularization:
• Importance of Regularization:
• Crucial for preventing overfitting in complex models, especially in deep learning.
• Overview of Techniques:
• Multiple regularization techniques available (L1, L2, Dropout, Early Stopping, etc.) to suit different scenarios.
• Impact on Model Performance:
• Proper application of regularization enhances model generalization and robustness.
• Balances bias and variance for optimal predictive performance.
• Practical Considerations:
• Hyperparameter tuning is essential for maximizing the effectiveness of regularization methods.
• Combining techniques can yield superior results.
• Future Directions:
• Ongoing research into more advanced regularization methods and adaptive techniques.
• Importance of understanding and experimenting with regularization in diverse applications.
• Final Takeaway:
• Regularization is a key component in the design of effective machine learning models, driving better
performance in real-world tasks.