Random Forests
Deep Dive into Random Forests
How Random Forests Work
Random Forests are an ensemble learning method that builds mul3ple decision trees and merges them together to get a
more accurate and stable predic3on. The basic idea is to combine the output of mul3ple (randomly created) decision trees
to generate a single result.
. Training Process:
Bootstrap Sampling: Each tree is trained on a different bootstrapped sample of the original dataset. This means
that for each tree, a subset of the training data is randomly chosen with replacement.
Decision Tree Construc3on: Each tree is grown to the fullest extent without pruning. During the construc3on of
each tree, a random subset of features is selected at each split point to determine the best split.
Aggrega3on of Predic3ons: For classifica3on tasks, the final predic3on is made based on the majority vote of the
individual trees. For regression tasks, the final predic3on is the average of the predic3ons from all the individual
trees.
. Feature Randomness:
Random Feature Selec3on: At each split in the tree, a random subset of the features is considered for spliLng.
This randomness helps to create a diverse set of trees and ensures that the ensemble model is not overly
dependent on any single feature.
Reduces Correla3on: By using different subsets of features, the correla3on between the individual trees is
reduced, which helps in improving the overall performance of the Random Forest.
Building and Tuning Random Forests
Prac3cal Considera3ons
. Data Preprocessing:
Handling Missing Values: Random Forests can handle missing values internally, but it is s3ll good prac3ce to
handle them during preprocessing.
Feature Scaling: Not strictly necessary as Random Forests are not sensi3ve to the scale of the features, but it
can be beneficial for other preprocessing steps.
. Training the Model:
Number of Trees (n_es3mators): The number of trees in the forest. More trees generally lead to beRer
performance but at the cost of increased computa3on 3me.
Number of Features (max_features): The number of features to consider when looking for the best split. This
can be set as a fixed number or as a percentage of the total number of features.
Unit-3: Ensemble Learning Random Forest
Hyperparameter Tuning
. Key Hyperparameters:
n_es3mators: The number of trees in the forest. A larger number of trees generally leads to beRer
performance but also increases the computa3onal cost.
max_features: The maximum number of features considered for spliLng a node. Can be a fixed number or a
percentage of the total features.
max_depth: The maximum depth of each tree. Deeper trees can capture more details but are more likely to
overfit. min_samples_split: The minimum number of samples required to split an internal node. Higher
values prevent the model from learning overly specific paRerns (overfiLng). min_samples_leaf: The
minimum number of samples required to be at a leaf node. A higher number makes the model more robust
by smoothing the model.
bootstrap: Whether bootstrap samples are used when building trees. If False, the whole dataset is used to build
each tree.
. Grid Search and Cross-Valida3on:
Grid Search: A systema3c way to work through mul3ple combina3ons of hyperparameter values, cross-valida3ng
as it goes to determine which combina3on gives the best performance.
Cross-Valida3on: Used to assess how the model will generalize to an independent dataset, helping to avoid
overfiLng.
Performance Evalua:on
. Metrics for Classifica3on:
Accuracy: The frac3on of correctly classified instances.
Precision: The frac3on of relevant instances among the retrieved instances.
Recall: The frac3on of relevant instances that have been retrieved over the total amount of relevant instances.
F1 Score: The harmonic mean of precision and recall, providing a balance between the two.
. Metrics for Regression:
Mean Squared Error (MSE): The average of the squares of the errors, giving higher weight to larger errors.
Mean Absolute Error (MAE): The average of the absolute errors, providing a linear score without over-
penalizing large errors.
R-squared: The propor3on of the variance in the dependent variable that is predictable from the independent
variables.
. Out-of-Bag (OOB) Error Es3mate:
Defini3on: An internal valida3on method where each tree is tested on the data not used in the bootstrap
sample for that tree.
Purpose: Provides an unbiased es3mate of the generaliza3on error without the need for a separate valida3on
set.
. Confusion Matrix:
Defini3on: A table used to evaluate the performance of a classifica3on algorithm by comparing the actual vs.
predicted classifica3ons.
Components: True Posi3ve (TP), True Nega3ve (TN), False Posi3ve (FP), False Nega3ve (FN).
.
2
Unit-3: Ensemble Learning Random Forest
Receiver Opera3ng Characteris3c (ROC) Curve and AUC:
ROC Curve: A graphical representa3on of the true posi3ve rate vs. the false posi3ve rate at various threshold
seLngs.
AUC (Area Under the Curve): A single scalar value to compare the performance of different models.
Summary
Random Forests are a powerful and versa3le machine learning technique that improves accuracy and robustness by
combining mul3ple decision trees.
Feature Randomness and Bootstrap Sampling are key to reducing variance and preven3ng overfiLng.
Hyperparameter Tuning and Performance Evalua3on are essen3al to op3mizing and assessing the model.
Random Forests can be applied to both classifica3on and regression tasks and are widely used in various domains for
their effec3veness and ease of use.