Random Forest is a supervised learning algorithm.
Like you can already see from its name, it creates
afforest and makes it somehow random. The forest it builds is an ensemble of Decision Trees most of the
time trained with the bagging method. Random Forest is a flexible easy to use machine learning algorithm
that produces even without hyper-parameter tuning a great result most of the time. It is also one of the most
used algorithms because it’s simplicity and the fact that it can be used for both classification and regression
tasks. In this post you are going to learn how the random forest algorithm works and several other
important things about it.
Table of Contents:
How it works
Real Life Analogy
Feature Importance
Difference between Decision Trees and Random Forests
Important Hyper parameters (predictive power speed)
Advantages and Disadvantages
Use Cases
Summary
How it works:
Random Forest is a supervised learning algorithm. Like you can already see from its name it creates a
forest and makes it somehow random. The forest it builds is an ensemble of Decision Trees most of the
time trained with the bagging method. The general idea of the bagging method is that a combination of
learning models increases the overall result. To say it in simple words: Random forest builds multiple
decision trees and merges them together to get a more accurate and stable prediction.
One big advantage of random forest is that it can be used for both classification and regression
problems which form the majority of current machine learning systems. I will talk about random
forest in classification since classification is sometimes considered the building block of machine
learning. Below you can see how a random forest would look like with two trees:
Random Forest has nearly the same hyper parameters as a decision tree or a bagging classifier. Fortunately
you don’t have to combine a decision tree with a bagging classifier and can just easily use the classifier-
class of Random Forest. Like I already said with Random Forest you can also deal with Regression tasks
by using the Random Forest regressor. Random Forest adds additional randomness to the model while
growing the trees. Instead of searching for the most important feature while splitting a node it searches for
the best feature among a random subset of features. This results in a wide diversity that generally results in
a better model.
Therefore in Random Forest only a random subset of the features is taken into consideration by the
algorithm for splitting a node. You can even make trees more random by additionally using random
thresholds for each feature rather than searching for the best possible thresholds (like a normal
decision tree does).
Real Life Analogy:
Imagine a guy named Andrew that wand's to decide to which places he should travel during a one-
year vacation trip. He asks people who know him for advice. First he goes to a friend tha asks Andrew
where he traveled to in the past and if he liked it or not. Based on the answers he will give Andrew
some advice. This is a typical decision tree algorithm approach. Andrews friend created rules to guide
his decision about what he should recommend by using the answers of Andrew.
Afterwards Andrew starts asking more and more of his friends to advise him and they again ask him
different questions where they can derive some recommendations from. Then he chooses the places
that where recommend the most to him which is the typical Random Forest algorithm approach.
Feature Importance:
Another great quality of the random forest algorithm is that it is very easy to measure the relative
importance of each feature on the prediction. Sklearn provides a great tool for this that measures a
features importance by looking at how much the tree nodes which use that feature reduce impurity
across all trees in the forest. It computes this score automatically for each feature after training and
scales the results so that the sum of all importance is equal to 1. If you don’t know how a decision tree
works and if you don’t know what a leaf or node is here is a good description from Wikipedia: In a
decision tree each internal node represents a test on an attribute (e.g. whether a coin flip comes up
heads or tails) each branch represents the outcome of the test and each leaf node represents a class
label (decision taken after computing all attributes). A node that has no children is a leaf.
Through looking at the feature importance you can decide which features you may want to drop
because they don’t contribute enough or nothing to the prediction process. This is important because
a general rule in machine learning is that the more features you have the more likely your model will
suffer from over fitting and vice versa.
Below you can see a table and a visualization that show the importance of 13 features which we used
during a supervised classification project with the famous Titanic dataset on kaggle. You can find the
whole project here.
Difference between Decision Trees and Random Forests: Like I already
mentioned Random Forest is a collection of Decision Trees but there are some differences. If you input a
training dataset with features and labels into a decision tree it will formulate some set of rules which will
be used to make the predictions. For example, if you want to predict whether a person will click on an
online advertisement, you could collect the ad’s the person clicked in the past and some features that
describe his decision. If you put the features and labels into a decision tree it will generate some rules.
Then you can predict whether the advertisement will be clicked or not. In comparison the Random Forest
algorithm randomly selects observations and features to build several decision trees and then averages the
results. Another difference is that deep decision trees might suffer from over fitting. Random Forest
prevents over fitting most of the time by creating random subsets of the features and building smaller trees
using these subsets. Afterwards it combines the sub trees. Note that this doesn’t work every time and that it
also makes the computation slower depending on how many trees your random forest builds.
Important Hyper parameters:
The Hyper parameters in random forest are either used to increase the predictive power of the model or to
make the model faster. we will here talk about the hyper parameters of sklearns built-in random forest
function.
1. Increasing the Predictive Power
Firstly there is the n estimators hyper parameter which is just the number of trees the algorithm builds
before taking the maximum voting or taking averages of predictions. In general a higher number of trees
increases the performance and makes the predictions more stable but it also slows down the computation.
Another important hyper parameter is max features which is the maximum number of features
Random Forest considers to split a node.
2. Increasing the Models Speed
The n jobs hyper parameter tells the engine how many processors it is allowed to use. If it has a value of 1
it can only use one processor. A value of -1 means that there is no limit.
Random state makes the models output replicable. The model will always produce the same results when
it has a definite value of random state and if it has been given the same hyper parameters and the same
training data.
Lastly there is the oob_score (also called oob sampling) which is a random forest cross validation
method. In this sampling about one-third of the data is not used to train the model and can be used to
evaluate its performance. These samples are called the out of bag samples. It is very similar to the
leave-one-out cross-validation method but almost no additional computational burden goes along
with it.
Advantages and Disadvantages:
Like we already mentioned an advantage of random forest is that it can be used for both regression and
classification tasks and that it’s easy to view the relative importance it assigns to the input features.
Random Forest is also considered as a very handy and easy to use algorithm because it’s default hyper
parameters often produce a good prediction result. The number of hyper parameters is also not that high
and they are straightforward to understand. One of the big problems in machine learning is over fitting
but most of the time this won’t happen that easy to a random forest classifier. That’s because if there
are enough trees in the forest the classifier won’t overfit the model. The main limitation of Random
Forest is that a large number of trees can make the algorithm to slow and ineffective for real-time
predictions. In general these algorithms are fast to train but quite slow to create predictions once they
are trained. A more accurate prediction requires more trees which results in a slower model. In most
real-world applications the random forest algorithm is fast enough but there can certainly be
situations where run-time performance is important and other approaches would be preferred. And of
course Random Forest is a predictive modeling tool and not a descriptive tool. That means if you are
looking for a description of the relationships in your data other approaches would be preferred.
Use Cases: The random forest algorithm is used in a lot of different fields like Banking , Stock
Market, Medicine and E-Commerce. In Banking it is used for example to detect customers who will use
the bank’s services more frequently than others and repay their debt in time. In this domain it is also used
to detect fraud customers who want to scam the bank. In finance it is used to determine a stock’s behavior
in the future. In the healthcare domain it is used to identify the correct combination of components in
medicine and to analyze a patient’s medical history to identify diseases. And lastly in E-commerce random
forest is used to determine whether a customer will actually like the product or not.
Summary: Random Forest is a great algorithm to train early in the model development process to see
how it performs and it’s hard to build a bad Random Forest because of its simplicity. This algorithm is also
a great choice if you need to develop a model in a short period of time. On top of that it provides a pretty
good indicator of the importance it assigns to your features. Random Forests are also very hard to beat in
terms of performance. Of course you can probably always find a model that can perform better like a
neural network but these usually take much more time in the development. And on top of that they can
handle a lot of different feature types like binary categorical and numerical. Overall Random Forest is a
(mostly) fast simple and flexible tool although it has its limitations.
Bagging and Random Forest Ensemble Algorithms for Machine
Learning
Random Forest is one of the most popular and most powerful machine learning algorithms. It is a type
of ensemble machine learning algorithm called Bootstrap Aggregation or bagging.
Bootstrap Method
The bootstrap is a powerful statistical method for estimating a quantity from a data sample. This is easiest
to understand if the quantity is a descriptive statistic such as a mean or a standard deviation. Let’s assume
we have a sample of 100 values (x) and we’d like to get an estimate of the mean of the sample. We can
calculate the mean directly from the sample as:
mean(x) = 1/100 * sum(x)
We know that our sample is small and that our mean has error in it. We can improve the estimate of our
mean using the bootstrap procedure:
Create many (e.g. 1000) random sub-samples of our dataset with replacement (meaning we can
select the same value multiple times).
Calculate the mean of each sub-sample.
Calculate the average of all of our collected means and use that as our estimated mean for the
data.
For example let’s say we used 3 resample's and got the mean values 2.3, 4.5 and 3.3. Taking the average
of these we could take the estimated mean of the data to be 3.367. This process can be used to estimate
other quantities like the standard deviation and even quantities used in machine learning algorithms like
learned coefficients.
Bootstrap Aggregation (Bagging)
Bootstrap Aggregation (or Bagging for short) is a simple and very powerful ensemble method. An
ensemble method is a technique that combines the predictions from multiple machine learning algorithms
together to make more accurate predictions than any individual model. Bootstrap Aggregation is a general
procedure that can be used to reduce the variance for those algorithm that have high variance. An
algorithm that has high variance are decision trees like classification and regression trees (CART).
Decision trees are sensitive to the specific data on which they are trained. If the training data is changed
(e.g. a tree is trained on a subset of the training data) the resulting decision tree can be quite different and
in turn the predictions can be quite different. Bagging is the application of the Bootstrap procedure to a
high-variance machine learning algorithm typically decision trees. Let’s assume we have a sample
dataset of 1000 instances (x) and we are using the CART algorithm. Bagging of the CART algorithm
would work as follows.
Create many (e.g. 100) random sub-samples of our dataset with replacement.
Train a CART model on each sample.
Given a new dataset calculate the average prediction from each model.
For example if we had 5 bagged decision trees that made the following class predictions for a in input
sample: blue, blue, red, blue and red we would take the most frequent class and predict blue. When
bagging with decision trees we are less concerned about individual trees overfitting the training data. For
this reason and for efficiency the individual decision trees are grown deep (e.g. few training samples at
each leaf-node of the tree) and the trees are not pruned. These trees will have both high variance and low
bias. These are important characterize of sub-models when combining predictions using bagging. The
only parameters when bagging decision trees is the number of samples and hence the number of trees to
include. This can be chosen by increasing the number of trees on run after run until the accuracy begins to
stop showing improvement (e.g. on a cross validation test harness). Very large numbers of models may
take a long time to prepare, but will not overfit the training data. Just like the decision trees themselves,
Bagging can be used for classification and regression problems.
Random Forest
Random Forests are an improvement over bagged decision trees. A problem with decision trees like
CART is that they are greedy. They choose which variable to split on using a greedy algorithm that
minimizes error. As such even with Bagging the decision trees can have a lot of structural similarities and
in turn have high correlation in their predictions. Combining predictions from multiple models in
ensembles works better if the predictions from the sub-models are uncorrelated or at best weakly
correlated. Random forest changes the algorithm for the way that the sub-trees are learned so that the
resulting predictions from all of the sub trees have less correlation. It is a simple tweak. In CART when
selecting a split point the learning algorithm is allowed to look through all variables and all variable
values in order to select the most optimal split-point. The random forest algorithm changes this procedure
so that the learning algorithm is limited to a random sample of features of which to search. The number of
features that can be searched at each split point (m) must be specified as a parameter to the algorithm.
You can try different values and tune it using cross validation.
For classification a good default is: m = sqrt(p)
For regression a good default is: m = p/3
Where m is the number of randomly selected features that can be searched at a split point and p is the
number of input variables. For example if a dataset had 25 input variables for a classification problem
then:
m = sqrt(25)
m=5
Estimated Performance
For each bootstrap sample taken from the training data there will be samples left behind that were not
included. These samples are called Out-Of-Bag samples or OOB. The performance of each model on its
left out samples when averaged can provide an estimated accuracy of the bagged models. This estimated
performance is often called the OOB estimate of performance. These performance measures are reliable
test error estimate and correlate well with cross validation estimates.
Variable Importance
As the Bagged decision trees are constructed we can calculate how much the error function drops for a
variable at each split point. In regression problems this may be the drop in sum squared error and in
classification this might be the Gini score. These drops in error can be averaged across all decision trees
and output to provide an estimate of the importance of each input variable. The greater the drop when the
variable was chosen the greater the importance. These outputs can help identify subsets of input variables
that may be most or least relevant to the problem and suggest at possible feature selection experiments
you could perform where some features are removed from the dataset.
A Short Introduction: Bagging and Random Forest
Random Forest is one of the most popular and most powerful machine learning algorithms. It is a type of
ensemble machine learning algorithm called Bootstrap Aggregation or bagging. The bootstrap is a
powerful statistical method for estimating a quantity from a data sample. Such as a mean. You take lots of
samples of your data, calculate the mean, then average all of your mean values to give you a better
estimation of the true mean value. In bagging, the same approach is used, but instead for estimating entire
statistical models, most commonly decision trees. Multiple samples of your training data are taken then
models are constructed for each data sample. When you need to make a prediction for new data, each
model makes a prediction and the predictions are averaged to give a better estimate of the true output
value. Random forest is a tweak on this approach where decision trees are created so that rather than
selecting optimal split points, suboptimal splits are made by introducing randomness. The models created
for each sample of the data are therefore more different than they otherwise would be, but still accurate in
their unique and different ways. Combining their predictions results in a better estimate of the true
underlying output value.
Boosting is an ensemble technique that attempts to create a strong classifier from a number of weak
classifiers. This is done by building a model from the training data then creating a second model that
attempts to correct the errors from the first model. Models are added until the training set is predicted
perfectly or a maximum number of models are added. AdaBoost was the first really successful boosting
algorithm developed for binary classification. It is the best starting point for understanding boosting.
Modern boosting methods build on AdaBoost most notably stochastic gradient boosting machines.
AdaBoost is used with short decision trees. After the first tree is created the performance of the tree on
each training instance is used to weight how much attention the next tree that is created should pay
attention to each training instance. Training data that is hard to predict is given more weight whereas easy
to predict instances are given less weight. Models are created sequentially one after the other each
updating the weights on the training instances that affect the learning performed by the next tree in the
sequence. After all the trees are built predictions are made for new data, and the performance of each tree
is weighted by how accurate it was on the training data. Because so much attention is put on correcting
mistakes by the algorithm it is important that you have clean data with outliers removed.
Machine Learning Basics – Random Forest
Random Forest (RF) is one of the many machine learning algorithms used for supervised learning this
means for learning from labelled data and making predictions based on the learned patterns. RF can be
used for both classification and regression tasks.
Decision trees
RF is based on decision trees. In machine learning decision trees are a technique for creating predictive
models. They are called decision trees because the prediction follows several branches of if… then…
decision splits similar to the branches of a tree. If we imagine that we start with a sample which we want
to predict a class for we would start at the bottom of a tree and travel up the trunk until we come to the
first split-off branch. This split can be thought of as a feature in machine learning let’s say it would be
“age” we would now make a decision about which branch to follow: “if our sample has an age bigger
than 30, continue along the left branch, else continue along the right branch”. This we would do until we
come to the next branch and repeat the same decision process until there are no more branches before us.
This endpoint is called a leaf and in decision trees would represent the final result: a predicted class or
value. At each branch the feature thresholds that best split the (remaining) samples locally is found. The
most common metrics for defining the “best split” are gini impurity and information gain for
classification tasks and variance reduction for regression. Single decision trees are very easy to visualize
and understand because they follow a method of decision-making that is very similar to how we humans
make decisions: with a chain of simple rules. However they are not very robust i.e. they don’t generalize
well to unseen samples. Here is where Random Forests come into play.
Ensemble learning
RF makes predictions by combining the results from many individual decision trees. so we call them
a forest of decision trees. Because RF combines multiple models it falls under the category of ensemble
learning. Other ensemble learning methods are gradient boosting and stacked ensembles.
Combining decision trees
There are two main ways for combining the outputs of multiple decision trees into a random forest:
Bagging which is also called Bootstrap aggregation (used in Random Forests)
Boosting (used in Gradient Boosting Machines)
Bagging works the following way: decision trees are trained on randomly sampled subsets of the data
while sampling is being done with replacement. Bagging is the default method used with Random
Forests. A big advantage of bagging over individual trees is that it decrease the variance of the model.
Individual trees are very prone to overfitting and are very sensitive to noise in the data. As long as our
individual trees are not correlated combining them with bagging will make them more robust without
increasing the bias. The part about correlation is important, though! We remove (most of) the correlation
by randomly sampling subsets of data and training the different decision trees on this subsets instead of
on the entire dataset. In addition to randomly sampling instances from our data RF also uses feature
bagging. With feature bagging at each split in the decision tree only a random subset of features is
considered. This technique reduces correlation even more because it helps reduce the impact of very
strong predictor variables (i.e. features that have a very strong influence on predicting the target or
response variable). Boosting works similarly but with one major difference: the samples are weighted for
sampling so that samples which were predicted incorrectly get a higher weight and are therefore sampled
more often. The idea behind this is that difficult cases should be emphasized during learning compared to
easy cases. Because of this difference bagging can be easily paralleled, while boosting is performed
sequentially.
The final result of our model is calculated by averaging over all predictions from these sampled trees or
by majority vote.
Hyper parameters to be tuned
Hyper parameters are the arguments that can be set before training and which define how the training is
done. The main hyper parameters in Random Forests are
The number of decision trees to be combined
The maximum depth of the trees
The maximum number of features considered at each split
Whether bagging/bootstrapping is performed with or without replacement
Training Random Forest models
Random Forest implementations are available in many machine learning libraries for R and Python
like caret (R imports the random Forest and other RF packages) Scikit-learn (Python) and H2O (R and
Python).
Other tree-based machine learning algorithms
The pros of Random Forests are that they are a relatively fast and powerful algorithm for classification
and regression learning. Calculations can be parallelized and perform well on many problems, even with
small datasets and the output returns prediction probabilities. Downsides of Random Forests are that they
are black-boxes meaning that we can’t interpret the decisions made by the model because they are too
complex. RF are also somewhat prone to overfitting and they tend to be bad at predicting
underrepresented classes in unbalanced datasets. Other tree-based algorithms are (Extreme) Gradient
Boosting and Rotation Forests.
Lesson 13: Bagging and Random Forest
Random Forest is one of the most popular and most powerful machine learning algorithms. It is a type of
ensemble machine learning algorithm called Bootstrap Aggregation or bagging. The bootstrap is a
powerful statistical method for estimating a quantity from a data sample. Such as a mean. You take lots of
samples of your data, calculate the mean, then average all of your mean values to give you a better
estimation of the true mean value. In bagging, the same approach is used, but instead for estimating entire
statistical models, most commonly decision trees. Multiple samples of your training data are taken then
models are constructed for each data sample. When you need to make a prediction for new data, each
model makes a prediction and the predictions are averaged to give a better estimate of the true output
value. Random forest is a tweak on this approach where decision trees are created so that rather than
selecting optimal split points, suboptimal splits are made by introducing randomness. The models created
for each sample of the data are therefore more different than they otherwise would be, but still accurate in
their unique and different ways. Combining their predictions results in a better estimate of the true
underlying output value. If you get good good results with an algorithm with high variance (like decision
trees), you can often get better results by bagging that algorithm.
Lesson 14: Boosting and AdaBoost
Boosting is an ensemble technique that attempts to create a strong classifier from a number of weak
classifiers. This is done by building a model from the training data, then creating a second model that
attempts to correct the errors from the first model. Models are added until the training set is predicted
perfectly or a maximum number of models are added. AdaBoost was the first really successful boosting
algorithm developed for binary classification. It is the best starting point for understanding boosting.
Modern boosting methods build on AdaBoost, most notably stochastic gradient boosting machines.
AdaBoost is used with short decision trees. After the first tree is created, the performance of the tree on
each training instance is used to weight how much attention the next tree that is created should pay
attention to each training instance. Training data that is hard to predict is given more weight, whereas
easy to predict instances are given less weight. Models are created sequentially one after the other, each
updating the weights on the training instances that affect the learning performed by the next tree in the
sequence. After all the trees are built, predictions are made for new data, and the performance of each tree
is weighted by how accurate it was on the training data. Because so much attention is put on correcting
mistakes by the algorithm it is important that you have clean data with outliers removed.