Random Forest (RF)
Random Forest (RF) is one of the many machine learning algorithms used for supervised learning,
this means for learning from labelled data and making predictions based on the learned patterns. RF
can be used for both classification and regression tasks.
Decision trees
RF is based on decision trees. In machine learning decision trees are a technique for creating
predictive models. They are called decision trees because the prediction follows several
branches of “if… then…” decision splits - similar to the branches of a tree.
If we imagine that we start with a sample, which we want to predict a class for, we would
start at the bottom of a tree and travel up the trunk until we come to the first split-off
branch. This split can be thought of as a feature in machine learning, let’s say it would be
“age”; we would now make a decision about which branch to follow: “if our sample has an
age bigger than 30, continue along the left branch, else continue along the right branch”.
This we would do until we come to the next branch and repeat the same decision process
until there are no more branches before us. This endpoint is called a leaf and in decision
trees would represent the final result: a predicted class or value.
At each branch, the feature thresholds that best split the (remaining) samples locally is
found.
Single decision trees are very easy to visualize and understand because they follow a
method of decision-making that is very similar to how we humans make decisions: with a
chain of simple rules. However, they are not very robust, i.e. they don’t generalize well to
unseen samples. Here is where Random Forests come into play.
Ensemble learning
RF makes predictions by combining the results from many individual decision trees - so we call them
a forest of decision trees. Because RF combines multiple models, it falls under the category of
ensemble learning. Other ensemble learning methods are gradient boosting and stacked ensembles.
Combining decision trees
There are two main ways for combining the outputs of multiple decision trees into a random forest:
1. Bagging, which is also called Bootstrap aggregation (used in Random Forests)
Bagging is the default method used with Random Forests.
Decision trees are trained on randomly sampled subsets of the data, while sampling is
being done with replacement.
A big advantage of bagging over individual trees is that it decreases the variance of the
model. Individual trees are very prone to overfitting and are very sensitive to noise in
the data. As long as our individual trees are not correlated, combining them with
bagging will make them more robust without increasing the bias.
We remove (most of) the correlation by randomly sampling subsets of data and training
the different decision trees on these subsets instead of on the entire dataset.
In addition to randomly sampling instances from our data, RF also uses feature bagging.
2. Boosting (used in Gradient Boosting Machines)
The samples are weighted for sampling so that samples, which were predicted incorrectly
get a higher weight and are therefore sampled more often.
The idea behind this is that difficult cases should be emphasized during learning compared
to easy cases.
Because of this difference bagging can be easily paralleled, while boosting is performed
sequentially.
Final Result
The final result of our model is calculated by averaging over all predictions from these sampled trees
or by majority vote.
Hyperparameters
Hyperparameters are the arguments that can be set before training and which define how
the training is done.
The main hyperparameters in Random Forests are:
o The number of decision trees to be combined
o The maximum depth of the trees
o The maximum number of features considered at each split
o Whether bagging/bootstrapping is performed with or without replacement
Pros and Cons of Random Forests:
Pros
They are a relatively fast and powerful algorithm for classification and regression learning.
Calculations can be parallelized and perform well on many problems, even with small
datasets and the output returns prediction probabilities.
Cons
They are black-boxes, meaning that we can’t interpret the decisions made by the model
because they are too complex.
RF are also somewhat prone to overfitting and they tend to be bad at predicting
underrepresented classes in unbalanced datasets.
Boosting
The idea of boosting came out of the idea of whether a weak learner can be modified to
become better.
A weak hypothesis or weak learner is defined as one whose performance is at least slightly
better than random chance.