Random Forest
Prof. Kailash Singh
Department of Chemical Engineering
          MNIT Jaipur
                Prof. Kailash Singh
                  MNIT Jaipur
Random Forest
What is Random Forest                                Prof. Kailash Singh
                                                       MNIT Jaipur
• Random Forest is a versatile machine learning
  algorithm used for both classification and regression
  tasks.
• It works by creating a large number of decision trees at
  training time and outputting either:
   – by voting (for classification).
   – The mean prediction (for regression).
• Random forests are widely used for classification and
  regression functions, which are known for their ability
  to handle complex data, reduce overfitting, and
  provide reliable forecasts in different environments.
A schematic of Random Forest   Prof. Kailash Singh
                                 MNIT Jaipur
What is Ensemble Learning                    Prof. Kailash Singh
                                               MNIT Jaipur
• In ensemble learning, different models team
  up to enhance predictive performance.
• It’s all about leveraging the collective wisdom
  of the group to overcome individual
  limitations and make more informed decisions
  in various machine learning tasks.
• Some popular ensemble models include-
  XGBoost, AdaBoost, LightGBM, Random
  Forest, Bagging, Voting, etc.
What is Bagging and Boosting                              Prof. Kailash Singh
                                                            MNIT Jaipur
• Bagging is an ensemble learning model, where multiple
  week models are trained on different subsets of the
  training data.
• Each subset is sampled with replacement and prediction is
  made by averaging the prediction of the week models for
  regression problem and considering majority vote for
  classification problem.
• Boosting trains multiple based models sequentially. In this
  method, each model tries to correct the errors made by the
  previous models. Each model is trained on a modified
  version of the dataset, the instances that were misclassified
  by the previous models are given more weight. The final
  prediction is made by weighted voting.
How Random Forest Works                                         Prof. Kailash Singh
                                                                  MNIT Jaipur
• Step 1: Bootstrapping
   – Draw several bootstrapped samples from the training data
     (sampling with replacement).
   – Build a decision tree from each bootstrapped sample.
• Step 2: Random Feature Selection
   – For each node in a tree, instead of considering all features,
     select a random subset of features and choose the best one.
   – This process reduces correlation between individual trees,
     making the model more robust.
• Step 3: Tree Voting/Averaging
   – In classification, trees "vote" for the class.
   – In regression, each tree produces a numeric prediction, and the
     average of these predictions becomes the final result.
Random Forest Hyperparameters                                    Prof. Kailash Singh
                                                                   MNIT Jaipur
•Number of Trees (n_estimators):
   •The number of trees in the forest.
   •Larger forests tend to give better performance but at a cost of
   higher computation.
•Max Features:
   •Maximum number of features to consider for splitting a node.
   •Higher values can result in better accuracy but may increase
   overfitting.
•Max Depth:
   •The maximum depth of each decision tree.
   •Deeper trees capture more complexity but can also lead to
   overfitting.
•Min Samples Split:
   •The minimum number of samples required to split an internal
   node.
•Min Samples Leaf:
   •The minimum number of samples required to be at a leaf node.
 Python program                                               Prof. Kailash Singh
                                                                MNIT Jaipur
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# Load dataset (e.g., iris dataset)
from sklearn.datasets import load_iris
data = load_iris()
X_train, X_test, y_train, y_test = train_test_split(data.data,
data.target, test_size=0.2)
# Create a random forest classifier
clf = RandomForestClassifier(n_estimators=100, max_depth=5,
random_state=42)
clf.fit(X_train, y_train)
# Predict on the test set
y_pred = clf.predict(X_test)
# Evaluate model
print(f"Accuracy: {accuracy_score(y_test, y_pred)}")
Advantages of Random Forest                 Prof. Kailash Singh
                                              MNIT Jaipur
• High accuracy due to multiple decision trees.
• Reduces overfitting by averaging.
• Can handle missing data by using majority
  voting or averaging predictions.
• Provides a way to measure the importance of
  each feature in prediction.
Limitations                                   Prof. Kailash Singh
                                                MNIT Jaipur
• Computational Complexity: Slower to train
  and predict due to multiple trees.
• Memory Usage: Requires more memory,
  especially for a large number of trees or large
  datasets.
• Interpretability: Harder to interpret than a
  single decision tree.
Applications of Random Forest                Prof. Kailash Singh
                                               MNIT Jaipur
• Healthcare: Disease prediction, personalized
  medicine.
• Finance: Credit scoring, risk analysis.
• E-commerce: Customer segmentation,
  recommendation systems.
• Image Recognition: Classification of objects in
  images.
• Genomics: Feature selection and classification
  in bioinformatics.
Questions for students                     Prof. Kailash Singh
                                             MNIT Jaipur
• What is Random Forest used for?
• What is the difference between decision tree
  and random forest?
• What is the difference between XGBoost and
  Random Forest?