0% found this document useful (0 votes)

30 views31 pages

Biau 2016

Uploaded by

Kashif Alim

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

30 views31 pages

Biau 2016

Uploaded by

Kashif Alim

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 31

TEST

DOI 10.1007/s11749-016-0481-7

INVITED PAPER

A random forest guided tour

Gérard Biau1,2 · Erwan Scornet1

© Sociedad de Estadística e Investigación Operativa 2016

Abstract The random forest algorithm, proposed by L. Breiman in 2001, has been
extremely successful as a general-purpose classification and regression method. The
approach, which combines several randomized decision trees and aggregates their pre-
dictions by averaging, has shown excellent performance in settings where the number
of variables is much larger than the number of observations. Moreover, it is versatile
enough to be applied to large-scale problems, is easily adapted to various ad hoc learn-
ing tasks, and returns measures of variable importance. The present article reviews the
most recent theoretical and methodological developments for random forests. Empha-
sis is placed on the mathematical forces driving the algorithm, with special attention
given to the selection of parameters, the resampling mechanism, and variable impor-
tance measures. This review is intended to provide non-experts easy access to the main
ideas.

Keywords Random forests · Randomization · Resampling · Parameter tuning ·

Variable importance

This invited paper is discussed in comments available at: doi:10.1007/s11749-016-0482-6;

doi:10.1007/s11749-016-0483-5; doi:10.1007/s11749-016-0484-4; doi:10.1007/s11749-016-0485-3;
doi:10.1007/s11749-016-0487-1.

B Gérard Biau
gerard.biau@upmc.fr
Erwan Scornet
erwan.scornet@upmc.fr

1 Sorbonne Universités, UPMC Univ Paris 06, CNRS, Laboratoire de Statistique Théorique et
Appliquées (LSTA), boîte 158, 4 place Jussieu, 75005 Paris, France
2 Institut universitaire de France, Paris, France

123
G. Biau, E. Scornet

Mathematics Subject Classification 62G02

1 Introduction

To take advantage of the sheer size of modern data sets, we now need learning algo-
rithms that scale with the volume of information, while maintaining sufficient statistical
efficiency. Random forests, devised by Breiman in the early 2000s (Breiman 2001),
are part of the list of the most successful methods currently available to handle data
in these cases. This supervised learning procedure, influenced by the early work of
Amit and Geman (1997), Ho (1998), and Dietterich (2000), operates according to
the simple but effective “divide and conquer” principle: sample fractions of the data,
grow a randomized tree predictor on each small piece, then paste (aggregate) these
predictors together.
What has greatly contributed to the popularity of forests is the fact that they can
be applied to a wide range of prediction problems and have few parameters to tune.
Aside from being simple to use, the method is generally recognized for its accuracy
and its ability to deal with small sample sizes and high-dimensional feature spaces. At
the same time, it is easily parallelizable and has, therefore, the potential to deal with
large real-life systems. The corresponding R package randomForest can be freely
downloaded on the CRAN web site (http://www.r-project.org), while a MapReduce
(Jeffrey and Sanja 2008) open source implementation called Partial Decision Forests
is available on the Apache Mahout website at https://mahout.apache.org. This allows
the building of forests using large data sets as long as each partition can be loaded into
memory.
The random forest methodology has been successfully involved in various prac-
tical problems, including a data science hackathon on air quality prediction (http://
www.kaggle.com/c/dsg-hackathon), chemoinformatics (Svetnik et al. 2003), ecol-
ogy (Prasad et al. 2006; Cutler et al. 2007), 3D object recognition (Shotton et al.
2011), and bioinformatics (Díaz-Uriarte and de Andrés 2006), just to name a few.
Howard (Kaggle) and Bowles (Biomatica) claim in Howard and Bowles (2012)
that ensembles of decision trees—often known as “random forests”—have been the
most successful general-purpose algorithm in modern times, while Varian, Chief
Economist at Google, advocates in Varian (2014) the use of random forests in
econometrics.
On the theoretical side, the story of random forests is less conclusive and, despite
their extensive use, little is known about the mathematical properties of the method.
The most celebrated theoretical result is that of Breiman (2001), which offers an upper
bound on the generalization error of forests in terms of correlation and strength of the
individual trees. This was followed by a technical note (Breiman 2004), which focuses
on a stylized version of the original algorithm (see also Breiman 2000a, b). A critical
step was subsequently taken by Lin and Jeon (2006), who highlighted an interesting
connection between random forests and a particular class of nearest neighbor predic-
tors, further developed by Biau and Devroye (2010). In recent years, various theoretical
studies have been performed (e.g., Meinshausen 2006; Biau et al. 2008; Ishwaran and
Kogalur 2010; Biau 2012; Genuer 2012; Zhu et al. 2015), analyzing more elaborate

123
A random forest guided tour

models and moving ever closer to the practical situation. Recent attempts towards
narrowing the gap between theory and practice include that of Denil et al. (2013),
who prove the consistency of a particular online forest, Wager (2014) and Mentch and
Hooker (2015), who study the asymptotic distribution of forests, and Scornet et al.
(2015), who show that Breiman’s (2001) forests are consistent in an additive regression
framework.
The difficulty in properly analyzing random forests can be explained by the
black-box flavor of the method, which is indeed a subtle combination of different
components. Among the forests’ essential ingredients, both bagging (Breiman 1996)
and the Classification And Regression Trees (CART)-split criterion (Breiman et al.
1984) play critical roles. Bagging (a contraction of bootstrap-aggregating) is a general
aggregation scheme, which generates bootstrap samples from the original data set,
constructs a predictor from each sample, and decides by averaging. It is one of the
most effective computationally intensive procedures to improve on unstable estimates,
especially for large, high-dimensional data sets, where finding a good model in one
step is impossible because of the complexity and scale of the problem (Bühlmann and
Yu 2002; Kleiner et al. 2014; Wager et al. 2014). As for the CART-split criterion, it
originates from the influential CART program of Breiman et al. (1984), and is used
in the construction of the individual trees to choose the best cuts perpendicular to the
axes. At each node of each tree, the best cut is selected by optimizing the CART-split
criterion, based on the so-called Gini impurity (for classification) or the prediction
squared error (for regression).
However, while bagging and the CART-splitting scheme play key roles in the ran-
dom forest mechanism, both are difficult to analyze with rigorous mathematics, thereby
explaining why theoretical studies have so far considered simplified versions of the
original procedure. This is often done by simply ignoring the bagging step and/or
replacing the CART-split selection by a more elementary cut protocol. As well as this,
in Breiman’s (2001) forests, each leaf (that is, a terminal node) of individual trees
contains a small number of observations, typically between 1 and 5.
The goal of this survey is to embark the reader on a guided tour of random forests.
We focus on the theory behind the algorithm, trying to give an overview of major
theoretical approaches while discussing their inherent pros and cons. For a more
methodological review covering applied aspects of random forests, we refer to the
surveys by Criminisi et al. (2011) and Boulesteix et al. (2012). We start by gently
introducing the mathematical context in Sect. 2 and describe in full detail Breiman’s
(2001) original algorithm. Section 3 focuses on the theory for a simplified forest model
called purely random forests, and emphasizes the connections between forests, nearest
neighbor estimates and kernel methods. Section 4 provides some elements of theory
about resampling mechanisms, the splitting criterion and the mathematical forces at
work in Breiman’s approach. Section 5 is devoted to the theoretical aspects of con-
nected variable selection procedures. Section 6 discusses various extensions to random
forests including online learning, survival analysis and clustering problems. A short
discussion follows in Sect. 7.

123
G. Biau, E. Scornet

2 The random forest estimate

2.1 Basic principles

Let us start with a word of caution. The term “random forests” is a bit ambiguous. For
some authors, it is but a generic expression for aggregating random decision trees, no
matter how the trees are obtained. For others, it refers to Breiman’s (2001) original
algorithm. We essentially adopt the second point of view in the present survey.
As mentioned above, the forest mechanism is versatile enough to deal with both
supervised classification and regression tasks. However, to keep things simple, we
focus in this introduction on regression analysis, and only briefly survey the classifi-
cation case. Our objective in this section is to provide a concise but mathematically
precise presentation of the algorithm for building a random forest. The general
framework is nonparametric regression estimation, in which an input random vec-
tor X ∈ X ⊂ R p is observed, and the goal is to predict the square integrable random
response Y ∈ R by estimating the regression function m(x) = E[Y |X = x]. With this
aim in mind, we assume we are given a training sample Dn = ((X1 , Y1 ), . . . , (Xn , Yn ))
of independent random variables distributed as the independent prototype pair (X, Y ).
The goal is to use the data set Dn to construct an estimate m n : X → R of the function
m. In this respect, we say that the regression function estimate m n is (mean squared
error) consistent if E[m n (X) − m(X)]2 → 0 as n → ∞ (the expectation is evaluated
over X and the sample Dn ).
A random forest is a predictor consisting of a collection of M randomized regression
trees. For the jth tree in the family, the predicted value at the query point x is denoted
by m n (x; Θ j , Dn ), where Θ1 , . . . , Θ M are independent random variables, distributed
the same as a generic random variable Θ and independent of Dn . In practice, the
variable Θ is used to resample the training set prior to the growing of individual trees
and to select the successive directions for splitting—more precise definitions will be
given later. In mathematical terms, the jth tree estimate takes the form

1Xi ∈An (x;Θ j ,Dn ) Yi

m n (x; Θ j , Dn ) = ,
Nn (x; Θ j , Dn )
i∈Dn (Θ j )

where Dn (Θ j ) is the set of data points selected prior to the tree construction,
An (x; Θ j , Dn ) is the cell containing x, and Nn (x; Θ j , Dn ) is the number of (pres-
elected) points that fall into An (x; Θ j , Dn ).
At this stage, we note that the trees are combined to form the (finite) forest estimate

1
M
m M,n (x; Θ1 , . . . , Θ M , Dn ) = m n (x; Θ j , Dn ). (1)
M
j=1

In the R package randomForest, the default value of M (the number of trees in

the forest) is ntree = 500. Since M may be chosen arbitrarily large (limited only
by available computing resources), it makes sense, from a modeling point of view, to

123
A random forest guided tour

let M tend to infinity, and consider instead of (1) the (infinite) forest estimate

m ∞,n (x; Dn ) = EΘ [m n (x; Θ, Dn )] .

In this definition, EΘ denotes the expectation with respect to the random parameter
Θ, conditional on Dn . In fact, the operation “M → ∞” is justified by the law of large
numbers, which asserts that almost surely, conditional on Dn ,

lim m M,n (x; Θ1 , . . . , Θ M , Dn ) = m ∞,n (x; Dn )

M→∞

(see for instance Breiman 2001, and Scornet 2015a, for more information on this limit
calculation). In the following, to lighten notation we will simply write m ∞,n (x) instead
of m ∞,n (x; Dn ).

2.2 Algorithm

We now provide some insight into how the individual trees are constructed and how
randomness kicks in. In Breiman’s (2001) original forests, each node of a single tree
is associated with a hyperrectangular cell. The root of the tree is X itself and, at each
step of the tree construction, a node (or equivalently its corresponding cell) is split
into two parts. The terminal nodes (or leaves), taken together, form a partition of X .
The algorithm works by growing M different (randomized) trees as follows. Prior
to the construction of each tree, an observations are drawn at random with (or without)
replacement from the original data set. These—and only these—an observations (with
possible repetitions) are taken into account in the tree building. Then, at each cell of
each tree, a split is performed by maximizing the CART-criterion (see below) over
mtry directions chosen uniformly at random among the p original ones. (The resulting
subset of selected coordinates is called Mtry .) Lastly, construction of individual trees is
stopped when each cell contains less than nodesize points. For any query point x ∈
X , each regression tree predicts the average of the Yi (that were among the an points)
for which the corresponding Xi falls into the cell of x. We draw attention to the fact that
growing the tree and making the final estimation only depends on the an preselected
data points. Algorithm 1 describes in full detail how to compute a forest’s prediction.
Algorithm 1 may seem a bit complicated at first sight, but the underlying ideas are
simple. We start by noticing that this algorithm has three important parameters:
1. an ∈ {1, . . . , n}: the number of sampled data points in each tree;
2. mtry ∈ {1, . . . , p}: the number of possible directions for splitting at each node
of each tree;
3. nodesize ∈ {1, . . . , an }: the number of examples in each cell below which the
cell is not split.
By default, in the regression mode of the R package randomForest, the parameter
mtry is set to p/3 (· is the ceiling function), an is set to n, and nodesize is set
to 5. The role and influence of these three parameters on the accuracy of the method
will be thoroughly discussed in the next section.

123
G. Biau, E. Scornet

Algorithm 1: Breiman’s random forest predicted value at x.

Input: Training set Dn , number of trees M > 0, an ∈ {1, . . . , n}, mtry ∈ {1, . . . , p},
nodesize ∈ {1, . . . , an }, and x ∈ X .
Output: Prediction of the random forest at x.
1 for j = 1, . . . , M do
2 Select an points, with (or without) replacement, uniformly in Dn . In the following steps, only
these an observations are used.
3 Set P = (X ) the list containing the cell associated with the root of the tree.
4 Set Pfinal = ∅ an empty list.
5 while P = ∅ do
6 Let A be the first element of P.
7 if A contains less than nodesize points or if all Xi ∈ A are equal then
8 Remove the cell A from the list P.
9 Pfinal ← Concatenate(Pfinal , A).
10 else
11 Select uniformly, without replacement, a subset Mtry ⊂ {1, . . . , p} of cardinality mtry.
12 Select the best split in A by optimizing the CART-split criterion along the coordinates in
Mtry (see text for details).
13 Cut the cell A according to the best split. Call A L and A R the two resulting cells.
14 Remove the cell A from the list P.
15 P ← Concatenate(P, A L , A R ).
16 end
17 end
18 Compute the predicted value m n (x; Θ j , Dn ) at x equal to the average of the Yi falling in the cell
of x in partition Pfinal .
19 end
20 Compute the random forest estimate m M,n (x; Θ1 , . . . , Θ M , Dn ) at the query point x according to
(1).

We still have to describe how the CART-split criterion operates. As for now, we
consider for the ease of understanding a tree with no subsampling, which uses the
entire and original data set Dn for its construction. In addition, we let A be a generic
cell and denote by Nn (A) the number of data points falling in A. A cut in A is a pair
( j, z), where j is some value (dimension) from {1, . . . , p} and z the position of the cut
along the jth coordinate, within the limits of A. Let C A be the set of all such possible
(1) ( p)
cuts in A. Then, with the notation Xi = (Xi , . . . , Xi ), for any ( j, z) ∈ C A , the
CART-split criterion takes the form

1
n
L reg,n ( j, z) = (Yi − Ȳ A )2 1Xi ∈A
Nn (A)
i=1

1
n
− (Yi − Ȳ A L 1X( j) <z − Ȳ A R 1X( j) ≥z )2 1Xi ∈A , (2)
Nn (A) i i
i=1

where A L = {x ∈ A : x( j) < z}, A R = {x ∈ A : x( j) ≥ z}, and Ȳ A (resp., Ȳ A L , Ȳ A R )

is the average of the Yi such that Xi belongs to A (resp., A L , A R ), with the convention
that the average is equal to 0 when no point Xi belongs to A (resp., A L , A R ). For each
cell A, the best cut ( jn , z n ) is selected by maximizing L reg,n ( j, z) over Mtry and C A ;
that is,

123
A random forest guided tour

( jn , z n ) ∈ arg max L reg,n ( j, z).

j∈Mtry
( j,z)∈C A

(To remove some of the ties in the argmax, the best cut is always performed in the
middle of two consecutive data points.) Let us finally notice that the above optimiza-
tion program extends effortlessly to the resampling case, by optimizing over the an
preselected observations instead of the original data set Dn .
Thus, at each cell of each tree, the algorithm chooses uniformly at random mtry
coordinates in {1, . . . , p}, evaluates criterion (2) over all possible cuts along the direc-
tions in Mtry , and returns the best one. The quality measure (2) is the criterion used
in the CART algorithm of Breiman et al. (1984). This criterion computes the (renor-
malized) difference between the empirical variance in the node before and after a
cut is performed. There are three essential differences between CART and a tree of
Breiman’s (2001) forest. First of all, in Breiman’s forests, the criterion (2) is evaluated
over a subset Mtry of randomly selected coordinates, and not over the whole range
{1, . . . , p}. Besides, the individual trees are not pruned, and the final cells do not
contain more than nodesize observations (unless all data points in the cell have the
same Xi ). At last, each tree is constructed on a subset of an examples picked within the
initial sample, not on the whole sample Dn ; and only these an observations are used to
calculate the estimation. When an = n (and the resampling is done with replacement),
the algorithm runs in bootstrap mode, whereas an < n corresponds to subsampling
(with or without replacement).

2.3 Supervised classification

For simplicity, we only consider here the binary classification problem, keeping in
mind that random forests are intrinsically capable of dealing with multi-class problems
(see, e.g., Díaz-Uriarte and de Andrés 2006). In this setting (Devroye et al. 1996), the
random response Y takes values in {0, 1} and, given X, one has to guess the value of
Y . A classifier, or classification rule, m n is a Borel measurable function of X and Dn
that attempts to estimate the label Y from X and Dn . In this framework, one says that
the classifier m n is consistent if its probability of error

L(m n ) = P[m n (X) = Y ] → L ,

n→∞

where L is the error of the optimal—but unknown—Bayes classifier:

1 if P[Y = 1|X = x] > P[Y = 0|X = x]
m (x) =
0 otherwise.

In the classification context, the random forest classifier is obtained via a majority vote
among the classification trees, that is,
M
m M,n (x; Θ1 , . . . , Θ M , Dn ) =
1 if M1 j=1 m n (x; Θ j , Dn ) > 1/2
0 otherwise.

123
G. Biau, E. Scornet

If a leaf represents region A, then a randomized tree classifier takes the simple form
⎧
⎨ 1 if 1Xi ∈A,Yi =1 > 1Xi ∈A,Yi =0 , x ∈ A
m n (x; Θ j , Dn ) = i∈Dn (Θ j ) i∈Dn (Θ j )
⎩ 0 otherwise,

where Dn (Θ j ) contains the data points selected in the resampling step, that is, in each
leaf, a majority vote is taken over all (Xi , Yi ) for which Xi is in the same region. Ties
are broken, by convention, in favor of class 0. Algorithm 1 can be easily adapted to do
two-class classification without modifying the CART-split criterion. To see this, take
Y ∈ {0, 1} and consider a single tree with no subsampling step. For any generic cell
A, let p0,n (A) (resp., p1,n (A)) be the empirical probability, given a data point in a cell
A, that it has label 0 (resp., label 1). By noticing that Ȳ A = p1,n (A) = 1 − p0,n (A),
the classification CART-split criterion reads, for any ( j, z) ∈ C A ,
Nn (A L )
L class,n ( j, z) = p0,n (A) p1,n (A) − × p0,n (A L ) p1,n (A L )
Nn (A)
Nn (A R )
− × p0,n (A R ) p1,n (A R ).
Nn (A)

This criterion is based on the so-called Gini impurity measure 2 p0,n (A) p1,n (A)
(Breiman et al. 1984), which has the following simple interpretation. To classify a
data point that falls in cell A, one uses the rule that assigns a point, uniformly selected
from {Xi ∈ A : (Xi , Yi ) ∈ Dn }, to label with probability p,n (A), for j ∈ {0, 1}.
The estimated probability that the item has actually label is p,n (A). Therefore, the
estimated error under this rule is the Gini index 2 p0,n (A) p1,n (A). Note, however, that
the prediction strategy is different in classification and regression: in the classification
regime, each tree uses a local majority vote, whereas in regression the prediction is
achieved by a local averaging.
When dealing with classification problems, it is usually recommended to set
√
nodesize = 1 and mtry = p (see, e.g., Liaw and Wiener 2002).
We draw attention to the fact that regression estimation may also have an interest
in the context of dichotomous and multicategory outcome variables (in this case, it is
often termed probability estimation). For example, estimating outcome probabilities
for individuals is important in many areas of medicine, with applications to surgery,
oncology, internal medicine, pathology, pediatrics, and human genetics. We refer the
interested reader to Malley et al. (2012) and to the survey papers by Kruppa et al.
(2014a, b).

2.4 Parameter tuning

Literature focusing on tuning the parameters M, mtry, nodesize and an is unfortu-

nately rare, with the notable exception of Díaz-Uriarte and de Andrés (2006), Bernard
et al. (2008), and Genuer et al. (2010). According to Schwarz et al. (2010), tuning
the forest parameters may result in a computational burden, in particular for big data
sets, with hundreds and thousands of samples and variables. To circumvent this issue,

123
A random forest guided tour

Schwarz et al. (2010) implement a fast version of the original algorithm, which they
name Random Jungle.
It is easy to see that the forest’s variance decreases as M grows. Thus, more accurate
predictions are likely to be obtained by choosing a large number of trees. Interestingly,
picking a large M does not lead to overfitting. In effect, following an argument of
Breiman (2001), we have

lim E[m M,n (X; Θ1 , . . . , Θ M ) − m(X)]2 = E[m ∞,n (X) − m(X)]2 .

M→∞

However, the computational cost for inducing a forest increases linearly with M, so
a good choice results from a trade-off between computational complexity (M should
not be too large for the computations to finish in a reasonable time) and accuracy (M
must be large enough for predictions to be stable). In this respect, Díaz-Uriarte and de
Andrés (2006) argue that the value of M is irrelevant (provided that M is large enough)
in a prediction problem involving microarray data sets, where the aim is to classify
patients according to their genetic profiles (typically, less than one hundred patients
for several thousand genes). For more details we refer the reader to Genuer et al.
(2010), who offer a thorough discussion on the choice of this parameter in various
regression problems. Another interesting and related approach is by Latinne et al.
(2001), who propose a simple procedure that determines a priori a minimum number
of tree estimates to combine to obtain a prediction accuracy level similar to that of a
larger forest. Their experimental results show that it is possible to significantly limit
the number of trees.
In the R package randomForest, the default value of the parameter nodesize
is 1 for classification and 5 for regression. These values are often reported to be
good choices (e.g., Díaz-Uriarte and de Andrés 2006), despite the fact that this is not
supported by solid theory. A simple algorithm to tune the parameter nodesize in
the classification setting is discussed in Kruppa et al. (2013).
The effect of mtry is thoroughly investigated in Díaz-Uriarte and de Andrés (2006),
who show that this parameter has a little impact on the performance of the method,
though larger values may be associated with a reduction in the predictive performance.
On the other hand, Genuer et al. (2010) claim that the default value of mtry is either
optimal or too small. Therefore, a conservative approach is to take mtry as large as
possible (limited by available computing resources) and set mtry = p (recall that
p is the dimension of the Xi ). A data-driven choice of mtry is implemented in the
algorithm Forest-RK of Bernard et al. (2008).
Let us finally notice that even if there is no theoretical guarantee to support the
default values of the parameters, they are nevertheless easy to tune without requiring
an independent validation set. Indeed, the procedure accuracy is estimated internally,
during the run, as follows. Since each tree is constructed using a different bootstrap
sample from the original data, about one-third of the observations are left out of the
bootstrap sample and not used in the construction of the jth tree. In this way, for each
tree, a test set—disjoint from the training set—is obtained, and averaging over all these
left-out data points and over all trees is known as the out-of-bag error estimate. Thus,
the out-of-bag error, computed on the observations set aside by the resampling prior

123
G. Biau, E. Scornet

to the tree building, offers a simple way to adjust the parameters without the need of
a validation set. (e.g., Kruppa et al. 2013).

3 Simplified models and local averaging estimates

3.1 Simplified models

Despite their widespread use, a gap remains between the theoretical understanding
of random forests and their practical performance. This algorithm, which relies on
complex data-dependent mechanisms, is difficult to analyze and its basic mathematical
properties are still not well understood.
As observed by Denil et al. (2014), this state of affairs has led to polarization
between theoretical and empirical contributions to the literature. Empirically focused
papers describe elaborate extensions to the basic random forest framework but come
with no clear guarantees. In contrast, most theoretical papers focus on simplifications
or stylized versions of the standard algorithm, where the mathematical analysis is more
tractable.
A basic framework to assess the theoretical properties of forests involves models
in which partitions do not depend on the training set Dn . This family of simplified
models is often called purely random forests, for which X = [0, 1]d . A widespread
example is the centered forest, whose principle is as follows: (i) there is no resampling
step; (ii) at each node of each individual tree, a coordinate is uniformly chosen in
{1, . . . , p}; and (iii) a split is performed at the center of the cell along the selected
coordinate. The operations (ii)–(iii) are recursively repeated k times, where k ∈ N is a
parameter of the algorithm. The procedure stops when a full binary tree with k levels
is reached, so that each tree ends up with exactly 2k leaves. The final estimation at the
query point x is achieved by averaging the Yi corresponding to the Xi in the cell of x.
The parameter k acts as a smoothing parameter that controls the size of the terminal
cells (see Fig. 1 for an example in two dimensions). It should be chosen large enough
to detect local changes in the distribution, but not too much to guarantee an effective
averaging process in the leaves. In uniform random forests, a variant of centered forests,
cuts are performed uniformly at random over the range of the selected coordinate, not
at the center. Modulo some minor modifications, their analysis is similar.
The centered forest rule was first formally analyzed by Breiman (2004), and then
later by Biau et al. (2008) and Scornet (2015a), who proved that the method is consistent
(both for classification and regression) provided k → ∞ and n/2k → ∞. The proof
relies on a general consistency result for random trees stated in Devroye et al. (1996,
Chapter 6). If X is uniformly distributed in X = [0, 1] p , then there are on average about
n/2k data points per terminal node. In particular, the choice k ≈ log n corresponds
to obtaining a small number of examples in the leaves, in accordance with Breiman’s
(2001) idea that the individual trees should not be pruned. Unfortunately, this choice
of k does not satisfy the condition n/2k → ∞, so something is lost in the analysis.
Moreover, the bagging step is absent, and forest consistency is obtained as a by-
product of tree consistency. Overall, this model does not demonstrate the benefit of

123
A random forest guided tour

Fig. 1 A centered tree at level 2

using forests in place of individual trees and is too simple to explain the mathematical
forces driving Breiman’s forests.
The rates of convergence of centered forests are discussed in Breiman (2004) and
Biau (2012). In their approach, the covariates X ( j) are independent and the target
regression function m(x) = E[Y |X = x], which is originally a function of x =
(x (1) , . . . , x ( p) ), is assumed to depend only on a nonempty subset S (for Strong) of
the p features. Thus, letting XS = (X ( j) : j ∈ S), we have

m(x) = E[Y |XS = xS ].

The variables of the remaining set {1, . . . , p}\S have no influence on the function
m and can be safely removed. The ambient dimension p can be large, much larger than
the sample size n, but we believe that the representation is sparse, i.e., that a potentially
small number of arguments of m are active—the ones with indices matching the set
S. Letting |S| be the cardinality of S, the value |S| characterizes the sparsity of the
model: the smaller |S|, the sparser m. In this dimension-reduction scenario, Breiman
(2004) and Biau (2012) proved that if the probability p j,n of splitting along the jth
direction tends to 1/S and m satisfies a Lipschitz-type smoothness condition, then

2 −0.75
E m ∞,n (X) − m(X) = O n |S| log 2+0.75 .

This equality shows that the rate of convergence of m ∞,n to m depends only on
the number |S| of strong variables, not on the dimension p. This rate is strictly faster
than the usual rate n −2/( p+2) as soon as |S| ≤ 0.54 p (· is the floor function).
In effect, the intrinsic dimension of the regression problem is |S|, not p, and we

123
G. Biau, E. Scornet

see that the random forest estimate adapts itself to the sparse framework. Of course,
this is achieved by assuming that the procedure succeeds in selecting the informative
variables for splitting, which is indeed a strong assumption.
An alternative model for pure forests, called Purely Uniform Random Forests
(PURF) is discussed in Genuer (2012). For p = 1, a PURF is obtained by drawing k
random variables uniformly on [0, 1], and subsequently dividing [0, 1] into random
sub-intervals. (Note that as such, the PURF can only be defined for p = 1.) Although
this construction is not exactly recursive, it is equivalent to growing a decision tree
by deciding at each level which node to split with a probability equal to its length.
Genuer (2012) proves that PURF are consistent and, under a Lipschitz assumption,
that the estimate satisfies

E[m ∞,n (X) − m(X)]2 = O n −2/3 .

This rate is minimax over the class of Lipschitz functions (Stone 1980, 1982).
It is often acknowledged that random forests reduce the estimation error of a single
tree, while maintaining the same approximation error. In this respect, Biau (2012)
argues that the estimation error of centered forests tends to zero (at the slow rate
1/ log n) even if each tree is fully grown (i.e., k ≈ log n). This result is a consequence
of the tree-averaging process, since the estimation error of an individual fully grown
tree does not tend to zero. Unfortunately, the choice k ≈ log n is too large to ensure
consistency of the corresponding forest, whose approximation error remains constant.
Similarly, Genuer (2012) shows that the estimation error of PURF is reduced by a
factor of 0.75 compared to the estimation error of individual trees. The most recent
attempt to assess the gain of forests in terms of estimation and approximation errors
is by Arlot and Genuer (2014), who claim that the rate of the approximation error of
certain models is faster than that of the individual trees.

3.2 Forests, neighbors and kernels

Let us consider a sequence of independent and identically distributed random variables

X1 , . . . , Xn . In random geometry, an observation Xi is said to be a layered nearest
neighbor (LNN) of a point x (from X1 , . . . , Xn ) if the hyperrectangle defined by x
and Xi contains no other data points (Barndorff-Nielsen and Sobel 1966; Bai et al.
2005; see also Devroye et al. 1996, Chapter 11, Problem 6). As illustrated in Fig. 2,
the number of LNN of x is typically larger than one and depends on the number and
configuration of the sample points.
Surprisingly, the LNN concept is intimately connected to random forests that ignore
the resampling step. Indeed, if exactly one point is left in the leaves and if there is
no resampling, then no matter what splitting strategy is used, the forest estimate at
x is a weighted average of the Yi whose corresponding Xi are LNN of x. In other
words,
n
m ∞,n (x) = Wni (x)Yi , (3)
i=1

123
A random forest guided tour

Fig. 2 The layered nearest neighbors (LNN) of a point x in dimension p = 2

where the weights (Wn1 , . . . , Wnn ) are nonnegative functions of the sample Dn that
n
satisfy Wni (x) = 0 if Xi is not an LNN of x and i=1 Wni = 1. This important
connection was first pointed out by Lin and Jeon (2006), who proved that if (X) has a
density on ([0, 1] p ) then, providing tree growing is independent of Y1 , . . . , Yn (such
simplified models are sometimes called non-adaptive), we have

2 1
E m ∞,n (X) − m(X) ≥C ,
n max (log n) p−1

where C > 0 is a constant and (n max ) is the maximal number of points in the terminal
cells. Unfortunately, the exact values of the weight vector (Wn1 , . . . , Wnn ) attached
to the original random forest algorithm are unknown, and a general theory of forests
in the LNN framework is still undeveloped.
It remains, however, that Eq. (3) opens the way to the analysis of random forests
via a local averaging approach, i.e., via the average of those Yi for which Xi is “close”
to x (Györfi et al. 2002). Indeed, observe starting from (1) that for a finite forest with
M trees and without resampling, we have

n
1 Yi 1Xi ∈An (x;Θ j )
M
m M,n (x; Θ1 , . . . , Θ M ) = ,
M Nn (x; Θ j )
j=1 i=1

123
G. Biau, E. Scornet

n
where An (x; Θ j ) is the cell containing x and Nn (x; Θ j ) = i=1 1Xi ∈An (x;Θ j ) is the
number of data points falling in An (x; Θ j ). Thus,

n
m M,n (x; Θ1 , . . . , Θ M ) = Wni (x)Yi ,
i=1

where the weights Wni (x) are defined by

1 1Xi ∈An (x;Θ j )

M
Wni (x) = .
M Nn (x; Θ j )
j=1

It is easy to see that the Wni are nonnegative and sum to one if the cell containing
x is not empty. Thus, the contribution of observations falling into cells with a high
density of data points is smaller than the contribution of observations belonging to less-
populated cells. This remark is especially true when the forests are built independently
of the data set—for example, PURF—since, in this case, the number of examples in
each cell is not controlled. Next, if we let M tend to infinity, then the estimate m ∞,n
may be written (up to some negligible terms)
n
Yi K n (Xi , x)
m ∞,n (x) ≈ i=1
n , (4)
j=1 K n (X j , x)

where

K n (x, z) = PΘ [z ∈ An (x, Θ)] .

The function K n (·, ·) is called the kernel and characterizes the shape of the “cells” of
the infinite random forest. The quantity K n (x, z) is nothing but the probability that
x and z are connected (i.e., they fall in the same cell) in a random tree. Therefore,
the kernel K n can be seen as a proximity measure between two points in the forest.
Hence, any forest has its own metric K n , but unfortunately the one associated with
the CART-splitting strategy is strongly data dependent and, therefore, complicated to
work with.
It should be noted that K n does not necessarily belong to the family of Nadaraya–
Watson-type kernels (Nadaraya 1964; Watson 1964), which satisfy a translation-
invariant homogeneous property of the form K h (x, z) = h1 K ((x − z)/ h) for some
smoothing parameter h > 0. The analysis of estimates of the form (4) is, in general,
more complicated, depending of the type of forest under investigation. For example,
Scornet (2015b) proved that for a centered forest defined on [0, 1] p with parameter k,
we have

k! 1 k
p
K n,k (x, z) = 12k j x kj .
k1 ! . . . k p ! p j =2 zj
k ,...,k p j=1
p1
j=1 k j =k

123
A random forest guided tour

Fig. 3 Representations of f 1 , f 2 and f 5 in [0, 1]2

As an illustration, Fig. 3 shows the graphical representation for k = 1, 2 and 5 of the

function f k defined by

f k : [0, 1] × [0, 1] → [0, 1]

z = (z 1 , z 2 ) → K n,k ( 21 , 21 ), z .

The connection between forests and kernel estimates is mentioned in Breiman

(2000a) and developed in detail in Geurts et al. (2006). The most recent advances
in this direction are by Arlot and Genuer (2014), who show that a simplified forest
model can be written as a kernel estimate, and provide its rates of convergence. On the
practical side, Davies and Ghahramani (2014) plug a specific (random forest-based)
kernel—seen as a prior distribution over the piecewise constant functions—into a
standard Gaussian process algorithm, and empirically demonstrate that it outperforms
the same algorithm ran with linear and radial basis kernels. Besides, forest-based
kernels can be used as the input for a large variety of existing kernel-type methods
such as Kernel Principal Component Analysis and Support Vector Machines.

4 Theory for Breiman’s forests

This section deals with Breiman’s (2001) original algorithm. Since the construction
of Breiman’s forests depends on the whole sample Dn , a mathematical analysis of the

123
G. Biau, E. Scornet

entire algorithm is difficult. To move forward, the individual mechanisms at work in

the procedure have been investigated separately, namely the resampling step and the
splitting scheme.

4.1 The resampling mechanism

The resampling step in Breiman’s (2001) original algorithm is performed by choosing

n times from n points with replacement to compute the individual tree estimates. This
procedure, which traces back to the work of Efron (1982) (see also Politis et al. 1999), is
called the bootstrap in the statistical literature. The idea of generating many bootstrap
samples and averaging predictors is called bagging (bootstrap-aggregating). It was
suggested by Breiman (1996) as a simple way to improve the performance of weak or
unstable learners. Although one of the great advantages of the bootstrap is its simplicity,
the theory turns out to be complex. In effect, the distribution of the bootstrap sample
Dn is different from that of the original one Dn , as the following example shows.
Assume that X has a density, and note that whenever the data points are sampled
with replacement then, with positive probability, at least one observation from the
original sample is selected more than once. Therefore, with positive probability, there
exist two identical data points in Dn , and the distribution of Dn cannot be absolutely
continuous.
The role of the bootstrap in random forests is still poorly understood and, to date,
most analyses are doomed to replace the bootstrap by a subsampling scheme, assuming
that each tree is grown with an < n examples randomly chosen without replacement
from the initial sample (Mentch and Hooker 2015; Wager 2014; Scornet et al. 2015).
Most of the time, the subsampling rate an /n is assumed to tend to zero at some
prescribed rate—an assumption that excludes de facto the bootstrap regime. In this
respect, the analysis of so-called median random forests by Scornet (2015a) provides
some insight into the role and importance of subsampling.
A median forest resembles a centered forest. Once the splitting direction is chosen,
the cut is performed at the empirical median of the Xi in the cell. In addition, the
construction does not stop at level k but continues until each cell contains exactly one
observation. Since the number of cases left in the leaves does not grow with n, each
tree of a median forest is in general inconsistent (see Györfi et al. 2002, Problem 4.3).
However, Scornet (2015a) shows that if an /n → 0, then the median forest is consistent,
despite the fact that the individual trees are not. The assumption an /n → 0 guarantees
that every single observation pair (Xi , Yi ) is used in the jth tree’s construction with
a probability that becomes small as n grows. It also forces the query point x to be
disconnected from (Xi , Yi ) in a large proportion of trees. Indeed, if this were not
the case, then the predicted value at x would be overly influenced by the single pair
(Xi , Yi ), which would make the ensemble inconsistent. In fact, the estimation error of
the median forest estimate is small as soon as the maximum probability of connection
between the query point and all observations is small. Thus, the assumption an /n → 0
is but a convenient way to control these probabilities, by ensuring that partitions are
dissimilar enough.

123
A random forest guided tour

Biau and Devroye (2010) noticed that Breiman’s bagging principle has a simple
application in the context of nearest neighbor methods. Recall that the 1-nearest neigh-
bor (1-NN) regression estimate sets rn (x) = Y(1) (x), where Y(1) (x) corresponds to the
feature vector X(1) (x) whose Euclidean distance to x is minimal among all X1 , . . . , Xn .
(Ties are broken in favor of smallest indices.) It is clearly not, in general, a consistent
estimate (Devroye et al. 1996, Chapter 5). However, by subbagging, one may turn
the 1-NN estimate into a consistent one, provided that the size of subsamples is suffi-
ciently small. We proceed as follows, via a randomized basic regression estimate ran
in which 1 ≤ an < n is a parameter. The elementary predictor ran is the 1-NN rule
for a random subsample of size an drawn with (or without) replacement from Dn . We
apply subbagging, that is, we repeat the random subsampling an infinite number of
times and take the average of the individual outcomes. Thus, the subbagged regression
estimate rn is defined by

rn (x) = E ran (x) ,

where E denotes expectation with respect to the resampling distribution, conditional

on the data set Dn . Biau and Devroye (2010) proved that the estimate rn is universally
(i.e., without conditions on the distribution of (X, Y )) mean squared error consistent,
provided an → ∞ and an /n → 0. The proof relies on the observation that rn is in
fact a local averaging estimate (Stone 1977) with weights

Wni (x) = P[Xi is the 1-NN of x in a random selection of size an ].

The connection between bagging and nearest neighbor estimation is further explored
by Biau et al. (2010), who prove that the subbagged estimate rn achieves optimal rate
of convergence over Lipschitz smoothness classes, independently from the fact that
resampling is done with or without replacement.

4.2 Decision splits

The coordinate-split process of the random forest algorithm is not easy to grasp,
essentially because it uses both the Xi and Yi variables to make its decision. Building
upon the ideas of Bühlmann and Yu (2002), Banerjee and McKeague (2007) establish
a limit law for the split location in the context of a regression model of the form
Y = m(X) + ε, where X is real-valued and ε an independent Gaussian noise. In
essence, their result is as follows. Assume for now that the distribution of (X, Y ) is
known, and denote by d the (optimal) split that maximizes the theoretical CART-
criterion at a given node. In this framework, the regression estimate restricted to the
left (resp., right) child of the cell takes the form

β,n = E[Y |X ≤ d ]
resp., βr,n = E[Y |X > d ] .

123
G. Biau, E. Scornet

When the distribution of (X, Y ) is unknown, so are β , βr and d , and these quantities
are estimated by their natural empirical counterparts:

n
2
(β̂,n , β̂r,n , d̂n ) ∈ arg min Yi − β 1 X i ≤d − βr 1 X i >d .
β ,βr ,d i=1

Assuming that the model satisfies some regularity assumptions (in particular, X has a
density f , and both f and m are continuously differentiable), Banerjee and McKeague
(2007) prove that
⎛ ⎞ ⎛ ⎞
β̂,n − β c1
D
n 1/3 ⎝ β̂r,n − βr ⎠ → ⎝ c2 ⎠ arg max(aW (t) − bt 2 ), (5)
d̂n − d 1 t

where D denotes convergence in distribution, and W is a standard two-sided Brownian

motion process on the real line. Both a and b are positive constants that depend upon the
model parameters and the unknown quantities β , βr and d . The limiting distribution
in (5) allows one to construct confidence intervals for the position of CART-splits.
Interestingly, Banerjee and McKeague (2007) refer to the study of Qian et al. (2003)
on the effects of phosphorus pollution in the Everglades, which uses split points in a
novel way. There, the authors identify threshold levels of phosphorus concentration
that are associated with declines in the abundance of certain species. In their approach,
split points are not just a means to build trees and forests, but can also provide important
information on the structure of the underlying distribution.
A further analysis of the behavior of forest splits is performed by Ishwaran (2013),
who argues that the so-called End-Cut Preference (ECP) of the CART-splitting pro-
cedure (that is, the fact that splits along non-informative variables are likely to be near
the edges of the cell—see Breiman et al. 1984) can be seen as a desirable property.
Given the randomization mechanism at work in forests, there is indeed a positive
probability that none of the preselected variables at a node are informative. When this
happens, and if the cut is performed, say, at the center of a side of the cell (assuming
that X = [0, 1]d ), then the sample size of the two resulting cells is roughly reduced
by a factor of two—this is an undesirable property, which may be harmful for the
prediction task. Thus, Ishwaran (2013) stresses that the ECP property ensures that a
split along a noisy variable is performed near the edge, thus maximizing the tree node
sample size and making it possible for the tree to recover from the split downstream.
Ishwaran (2013) claims that this property can be of benefit even when considering a
split on an informative variable, if the corresponding region of space contains little
signal.
It is shown in Scornet et al. (2015) that random forests asymptotically perform, with
high probability, splits along the S informative variables (in the sense of Sect. 3.1).
Denote by jn,1 (X), . . . , jn,k (X) the first k cut directions used to construct the cell of
X, with the convention that jn,q (X) = ∞ if the cell has been cut strictly less than q
times. Assuming some regularity conditions on the regression model, and considering
a modification of Breiman’s forests in which all directions are preselected for splitting,

123
A random forest guided tour

Scornet et al. (2015) prove that, with probability 1 − ξ , for all n large enough and all
1 ≤ q ≤ k,

jn,q (X) ∈ {1, . . . , S}.

This result offers an interesting perspective on why random forests nicely adapt to the
sparsity setting. Indeed, it shows that the algorithm selects splits mostly along the S
informative variables, so that everything happens as if data were projected onto the
vector space spanned by these variables.
There exists a variety of random forest variants based on the CART-criterion. For
example, the Extra-Tree algorithm of Geurts et al. (2006) consists in randomly select-
ing a set of split points and then choosing the split that maximizes the CART-criterion.
This algorithm has similar accuracy performance while being more computationally
efficient. In the PERT (Perfect Ensemble Random Trees) approach of Cutler and Zhao
(2001), one builds perfect-fit classification trees with random split selection. While
individual trees clearly overfit, the authors claim that the whole procedure is eventually
consistent since all classifiers are believed to be almost uncorrelated. As a variant of
the original algorithm, Breiman (2001) considered splitting along linear combinations
of features (this procedure has been implemented by Truong 2009, in the package
obliquetree of the statistical computing environment R). As noticed by Menze
et al. (2011), the feature space separation by orthogonal hyperplanes in random forests
results in box-like decision surfaces, which may be advantageous for some data but
suboptimal for other, particularly for collinear data with correlated features.
With respect to the tree building process, selecting uniformly at each cell a set of
features for splitting is simple and convenient, but such procedures inevitably select
irrelevant variables. Therefore, several authors have proposed modified versions of the
algorithm that incorporate a data-driven weighing of variables. For example, Kyril-
lidis and Zouzias (2014) study the effectiveness of non-uniform randomized feature
selection in classification tree, and experimentally show that such an approach may
be more effective compared to naive uniform feature selection. Enriched Random
Forests, designed by Amaratunga et al. (2008) choose at each node the eligible sub-
sets by weighted random sampling with the weights tilted in favor of informative
features. Similarly, the Reinforcement Learning Trees (RLT) of Zhu et al. (2015) build
at each node a random forest to determine the variable that brings the greatest future
improvement in later splits, rather than choosing the one with largest marginal effect
from the immediate split. Splits in random forests are known to be biased toward
covariates with many possible splits (Breiman et al. 1984; Segal 1988) or with miss-
ing values (Kim and Loh 2001). Hothorn et al. (2006) propose a two-step procedure
to correct this situation by first selecting the splitting variable and then the position of
the cut along the chosen variable. The predictive performance of the resulting trees is
empirically shown to be as good as the performance of the exhaustive search proce-
dure. We also refer the reader to Ziegler and König (2014), who review the different
splitting strategies.
Choosing weights can also be done via regularization. Deng and Runger (2012)
propose a Regularized Random Forest (RRF), which penalizes selecting a new feature
for splitting when its gain is similar to the features used in previous splits. Deng and

123
G. Biau, E. Scornet

Runger (2013) suggest a Guided RRF (GRRF), in which the importance scores from an
ordinary random forest are used to guide the feature selection process in RRF. Lastly,
a Garrote-style convex penalty, proposed by Meinshausen (2009), selects functional
groups of nodes in trees, yielding to parsimonious estimates. We also mention the
work of Konukoglu and Ganz (2014) who address the problem of controlling the false-
positive rate of random forests and present a principled way to determine thresholds
for the selection of relevant features without any additional computational load.

4.3 Consistency, asymptotic normality, and more

All in all, little has been proven mathematically for the original procedure of Breiman.
A seminal result by Breiman (2001) shows that the error of the forest is small as soon
as the predictive power of each tree is good and the correlation between the tree errors
is low. More precisely, independently of the type of forest, one has

EX,Y [Y − m ∞,n (X)]2 ≤ ρ̄ EX,Y,Θ [Y − m n (X; Θ)]2 ,

where

EΘ,Θ [ρ(Θ, Θ )g(Θ)g(Θ )]

ρ̄ = ,
EΘ [g(Θ)]2

with Θ and Θ independent and identically distributed,

ρ(Θ, Θ ) = CorrX,Y Y − m n (X; Θ), Y − m n (X; Θ ) ,

and g(Θ) = EX,Y [Y − m n (X; Θ)]2 . Similarly, Friedman et al. (2009) decompose
the variance of the forest as a product of the correlation between trees and the variance
of a single tree. Thus, for all x,

Var[m ∞,n (x)] = ρ(x)σ (x),

where ρ(x) = Corr[m n (x; Θ), m n (x; Θ )] and σ (x) = Var[m n (x; Θ)].
A link between the error of the finite and infinite forests is established in Scornet
(2015a), who shows, provided some regularity assumptions are satisfied, that

0 ≤ E[m M,n (X; Θ1 , . . . , Θ M ) − m(X)]2 − E[m ∞,n (X) − m(X)]2

8
≤ × m2∞ + σ 2 (1 + 4 log n) .
M

This inequality provides an interesting solution for choosing the number of trees, by
making the error of the finite forest arbitrary close to that of the infinite one.
Consistency and asymptotic normality of the whole algorithm were recently proved,
replacing bootstrap by subsampling and simplifying the splitting step. So, Wager
(2014) shows the asymptotic normality of Breiman’s infinite forests, assuming that

123
A random forest guided tour

(i) cuts are spread along all the p directions and do not separate a small fraction of
the data set; and (ii) two different data set are used to, respectively, build the tree and
estimate the value within a leaf. He also establishes that the infinitesimal jackknife
(Efron 1979) consistently estimates the forest variance.
Mentch and Hooker (2015) prove a similar result for finite forests, which mainly
relies on the assumption that the prediction of the forests does not much vary when
the label of one point in the training set is slightly modified. These authors show√ that
whenever Mn (the number of trees) is allowed to vary with n, and when an = o( n)
and limn→∞ n/Mn = 0, then, for a fixed x,

√
n m M,n (x; Θ1 , . . . , Θ M ) − E[m M,n (x; Θ1 , . . . , Θ M )] D
→ N,
an2 ζ1,an

where N is a standard normal random variable,

ζ1,an = Cov m n (X1 , X2 , . . . , Xan ; Θ), m n (X1 , X2 , . . . , Xa n ; Θ ) ,

Xi an independent copy of Xi and Θ an independent copy of Θ. It is worth noting

that both Mentch and Hooker (2015) and Wager et al. (2014) provide corrections for
estimating the forest variance ζ1,an .
Scornet et al. (2015) proved a consistency result in the context of additive regression
models for the pruned version of Breiman’s forest. Unfortunately, the consistency of
the unpruned procedure comes at the price of a conjecture regarding the behavior of
the CART algorithm that is difficult to verify.
We close this section with a negative but interesting result due to Biau et al. (2008).
In this example, the total number k of cuts is fixed and mtry = 1. Furthermore,
each tree is built by minimizing the true probability of error at each node. Consider
the joint distribution of (X, Y ) sketched in Fig. 4 and let m(x) = E[Y |X = x]. The
variable X has a uniform distribution on [0, 1]2 ∪ [1, 2]2 ∪ [2, 3]2 and Y is a function
of X—that is, m(x) ∈ {0, 1} and L = 0—defined as follows. The lower left square
[0, 1] × [0, 1] is divided into countably infinitely many vertical strips in which the
strips with m(x) = 0 and m(x) = 1 alternate. The upper right square [2, 3] × [2, 3] is
divided similarly into horizontal strips. The middle rectangle [1, 2] × [1, 2] is a 2 × 2
checkerboard. It is easy to see that no matter what the sequence of random selection
of split directions is and no matter for how long each tree is grown, no tree will ever
cut the middle rectangle and, therefore, the probability of error of the corresponding
random forest classifier is at least 1/6. This example illustrates that consistency of
greedily grown random forests is a delicate issue. We note, however, that if Breiman’s
(2001) original algorithm is used in this example (that is, each cell contains exactly
one data point) then one obtains a consistent classification rule. We also note that
the regression function m is not Lipschitz—a smoothness assumption on which many
results on random forests rely.

123
G. Biau, E. Scornet

Fig. 4 An example of a distribution for which greedy random forests are inconsistent. The distribution of
X is uniform on the union of the three large squares. White areas represent the set where m(x) = 0 and grey
where m(x) = 1

5 Variable importance

5.1 Variable importance measures

Random forests can be used to rank the importance of variables in regression or clas-
sification problems via two measures of significance. The first, called Mean Decrease
Impurity (MDI; see Breiman 2003a), is based on the total decrease in node impurity
from splitting on the variable, averaged over all trees. The second, referred to as Mean
Decrease Accuracy (MDA), first defined by Breiman (2001), stems from the idea
that if the variable is not important, then rearranging its values should not degrade
prediction accuracy.
Set X = (X (1) , . . . , X ( p) ). For a forest resulting from the aggregation of M trees,
the MDI of the variable X ( j) is defined by

1
M

MDI(X ( j) ) =
pn,t L reg,n ( jn,t
, z n,t ),
M
=1 t∈T
=j
jn,t

where pn,t is the fraction of observations falling in the node t, {T }1≤≤M the collection
, z ) the split that maximizes the empirical criterion
of trees in the forest, and ( jn,t n,t
(2) in node t. Note that the same formula holds for classification random forests by
replacing the criterion L reg,n by its classification version L class,n . Thus, the MDI of

123
A random forest guided tour

X ( j) computes the weighted decrease of impurity corresponding to splits along the

variable X ( j) and averages this quantity over all trees.
The MDA relies on a different principle and uses the out-of-bag error estimate (see
Sect. 2.4). To measure the importance of the jth feature, we randomly permute the
values of variable X ( j) in the out-of-bag observations and put these examples down
the tree. The MDA of X ( j) is obtained by averaging the difference in out-of-bag error
estimation before and after the permutation over all trees. In mathematical terms,
consider a variable X ( j) and denote by D,n the out-of-bag data set of the th tree and
D,n the same data set where the values of X ( j) have been randomly permuted. Recall
j

that m n (·; Θ ) stands for the -th tree estimate. Then, by definition,

M
1
( j) ) =
MDA(X
j
Rn m n (·; Θ ), D,n − Rn m n (·; Θ ), D,n , (6)
M
=1

j
where Rn is defined for D = D,n or D = D,n by

1
Rn m n (·; Θ ), D = (Yi − m n (Xi ; Θ ))2 . (7)
|D|
i:(Xi ,Yi )∈D

( j) ) is
It is easy to see that the population version of MDA(X

MDA (X ( j) ) = E Y − m n (Xj ; Θ)
2 2
− E Y − m n (X; Θ) ,

where Xj = (X (1) , . . . , X ( j) , . . . , X ( p) ) and X ( j) is an independent copy of X ( j) .

For classification purposes, the MDA still satisfies (6) and (7) since Yi ∈ {0, 1}
(so, Rn (m n (·; Θ), D) is also the proportion of points that are correctly classified by
m n (·; Θ) in D).

5.2 Theoretical results

In the context of a pair of categorical variables (X, Y ), where X takes finitely many
values in, say, X1 × · · · × Xd , Louppe et al. (2013) consider an infinite ensemble
of totally randomized and fully developed trees. At each cell, the th tree is grown
by selecting a variable X ( j) uniformly among the features that have not been used
in the parent nodes, and by subsequently dividing the cell into |X j | children (so the
number of children equals the number of modalities of the selected variable). In this
framework, it can be shown that the population version of MDI(X ( j) ) computed with
the whole forest satisfies

p−1
1
( j)
MDI (X )= k I (X ( j) ; Y |B),
k=0 p ( p − k) B∈Pk (V − j )

123
G. Biau, E. Scornet

where V − j = {1, . . . , j − 1, j + 1, . . . , p}, Pk (V − j ) the set of subsets of V − j of

cardinality k, and I (X ( j) ; Y |B) the conditional mutual information of X ( j) and Y
given the variables in B. In addition,

p
MDI (X ( j) ) = I (X (1) , . . . , X ( p) ; Y ).
j=1

These results show that the information I (X (1) , . . . , X ( p) ; Y ) is the sum of the impor-
tances of each variable, which can itself be made explicit using the information values
I (X ( j) ; Y |B) between each variable X ( j) and the output Y , conditional on variable
subsets B of different sizes.
Louppe et al. (2013) define a variable X ( j) as irrelevant with respect to B ⊂ V =
X1 × · · · × X p whenever I (X ( j) ; Y |B) = 0. Thus, X ( j) is irrelevant with respect
to V if and only if MDI (X ( j) ) = 0. It is easy to see that if an additional irrelevant
variable X ( p+1) is added to the list of variables, then, for any j, the variable importance
MDI (X ( j) ) computed with a single tree does not change if the tree is built with the
new collection of variables V ∪ {X ( p+1) }. In other words, building a tree with an
additional irrelevant variable does not alter the importances of the other variables in
an infinite sample setting.
The most notable results regarding MDA are due to Ishwaran (2007), who studies
a slight modification of the criterion replacing permutation by feature noising. To add
noise to a variable X ( j) , one considers a new observation X, take X down the tree and
stop when a split is made according to the variable X ( j) . Then the right or left child
node is selected with probability 1/2, and this procedure is repeated for each subsequent
node (whether it is performed along the variable X ( j) or not). The variable importance
MDA (X ( j) ) is still computed by comparing the error of the forest with that of the
“noisy” forest. Assuming that the forest is consistent and that the regression function
is piecewise constant, Ishwaran (2007) gives the asymptotic behavior of MDA (X ( j) )
when the sample size tends to infinity. This behavior is intimately related to the set of
subtrees (of the initial regression tree) whose roots are split along the coordinate X ( j) .
Let us lastly mention the approach of Gregorutti et al. (2016), who compute the
MDA criterion for several distributions of (X, Y ). For example, consider a model of
the form

Y = m(X) + ε,

where (X, ε) is a Gaussian random vector, and assume that the correlation matrix C
satisfies C = [Cov(X ( j) , X (k) )]1≤ j,k≤ p = (1 − c)I p + c11 [the symbol denotes
transposition, 1 = (1, . . . , 1) , and c is a constant in (0, 1)]. Assume, in addition,
that Cov(X ( j) , Y ) = τ0 for all j ∈ {1, . . . , p}. Then, for all j,

2
τ0
MDA (X ( j) ) = 2 .
1 − c + pc

123
A random forest guided tour

Thus, in the Gaussian setting, the variable importance decreases as the inverse of the
square of p when the number of correlated variables p increases.

5.3 Related works

The empirical properties of the MDA criterion have been extensively explored and
compared in the statistical computing literature. Indeed, Archer and Kimes (2008),
Strobl et al. (2008), Nicodemus and Malley (2009), Auret and Aldrich (2011), and
Toloşi and Lengauer (2011) stress the negative effect of correlated variables on MDA
performance. In this respect, Genuer et al. (2010) noticed that MDA is less able to detect
the most relevant variables when the number of correlated features increases. Similarly,
the empirical study of Archer and Kimes (2008) points out that both MDA and MDI
behave poorly when correlation increases—these results have been experimentally
confirmed by Auret and Aldrich (2011) and Toloşi and Lengauer (2011). An argument
of Strobl et al. (2008) to justify the bias of MDA in the presence of correlated variables
is that the algorithm evaluates the marginal importance of the variables instead of taking
into account their effect conditional on each other. A way to circumvent this issue is to
combine random forests and the Recursive Feature Elimination algorithm of Guyon
et al. (2002), as in Gregorutti et al. (2016). Detecting relevant features can also be
achieved via hypothesis testing (Mentch and Hooker 2015)—a principle that may be
used to detect more complex structures of the regression function, like for instance its
additivity (Mentch and Hooke 2014).

6 Extensions

6.1 Weighted forests

In Breiman’s (2001) forests, the final prediction is the average of the individual tree
outcomes. A natural way to improve the method is to incorporate tree-level weights to
emphasize more accurate trees in prediction (Winham et al. 2013). A closely related
idea, proposed by Bernard et al. (2012), is to guide tree building—via resampling of
the training set and other ad hoc randomization procedures—so that each tree will
complement as much as possible the existing trees in the ensemble. The resulting
Dynamic Random Forest (DRF) shows significant improvement in terms of accuracy
on 20 real-based data sets compared to the standard, static, algorithm.

6.2 Online forests

In its original version, random forests is an offline algorithm, which is given the whole
data set from the beginning and required to output an answer. In contrast, online
algorithms do not require that the entire training set is accessible at once. These models
are appropriate for streaming settings, where training data are generated over time and
must be incorporated into the model as quickly as possible. Random forests have been
extended to the online framework in several ways (Saffari et al. 2009; Denil et al.

123
G. Biau, E. Scornet

2013; Lakshminarayanan et al. 2014). In Lakshminarayanan et al. (2014), so-called

Mondrian forests are grown in an online fashion and achieve competitive predictive
performance comparable with other online random forests while being faster. When
building online forests, a major difficulty is to decide when the amount of data is
sufficient to cut a cell. Exploring this idea, Yi et al. (2012) propose Information Forests,
whose construction consists in deferring classification until a measure of classification
confidence is sufficiently high, and in fact break down the data so as to maximize this
measure. An interesting theory related to these greedy trees can be found in Biau and
Devroye (2013).

6.3 Survival forests

Survival analysis attempts to deal with analysis of time duration until one or more
events happen. Most often, survival analysis is also concerned with incomplete data,
and particularly right-censored data, in fields such as clinical trials. In this context,
parametric approaches such as proportional hazards are commonly used, but fail to
model nonlinear effects. Random forests have been extended to the survival context
by Ishwaran et al. (2008), who prove consistency of Random Survival Forests (RSF)
algorithm assuming that all variables are categorical. Yang et al. (2010) showed that
by incorporating kernel functions into RSF, their algorithm KIRSF achieves better
results in many situations. Ishwaran et al. (2011) review the use of the minimal depth,
which measures the predictive quality of variables in survival trees.

6.4 Ranking forests

Clémençon et al. (2013) have extended random forests to deal with ranking problems
and propose an algorithm called Ranking Forests based on the ranking trees of Clé-
mençon and Vayatis (2009). Their approach relies on nonparametric scoring and ROC
curve optimization in the sense of the AUC criterion.

6.5 Clustering forests

Yan et al. (2013) present a new clustering ensemble method called Cluster Forests (CF)
in the context of unsupervised classification. CF randomly probes a high-dimensional
data cloud to obtain good local clusterings, then aggregates via spectral clustering to
obtain cluster assignments for the whole data set. The search for good local clusterings
is guided by a cluster quality measure, and CF progressively improves each local
clustering in a fashion that resembles tree growth in random forests.

6.6 Quantile forests

Meinshausen (2006) shows that random forests provide information about the full
conditional distribution of the response variable, and thus can be used for quantile
estimation.

123
A random forest guided tour

6.7 Missing data

One of the strengths of random forests is that they can handle missing data. The
procedure, explained in Breiman (2003b), takes advantage of the so-called proximity
matrix, which measures the proximity between pairs of observations in the forest,
to estimate missing values. This measure is the empirical counterpart of the kernels
defined in Sect. 3.2. Data imputation based on random forests has further been explored
by Rieger et al. (2010), Crookston and Finley (2008), and extended to unsupervised
classification by Ishioka (2013).

6.8 Single class data

One-class classification is a binary classification task for which only one class of sam-
ples is available for learning. Désir et al. (2013) study the One Class Random Forests
algorithm, which is designed to solve this particular problem. Geremia et al. (2013)
have introduced a supervised learning algorithm called Spatially Adaptive Random
Forests to deal with semantic image segmentation applied to medical imaging pro-
tocols. Lastly, in the context of multi-label classification, Joly et al. (2014) adapt the
idea of random projections applied to the output space to enhance tree-based ensem-
ble methods by improving accuracy while significantly reducing the computational
burden.

6.9 Unbalanced data set

Random forests can naturally be adapted to fit the unbalanced data framework by
down-sampling the majority class and growing each tree on a more balanced data
set (Chen et al. 2004; Kuhn and Johnson 2013). An interesting application in which
unbalanced data sets are involved is by Fink et al. (2010), who explore the continent-
wide inter-annual migrations of common North American birds. They use random
forests for which each tree is trained and allowed to predict on a particular (random)
region in space and time.

7 Conclusion and perspectives

The authors trust that this review paper has provided an overview of some of the recent
literature on random forests and offered insights into how new and emerging fields
are impacting the method. As statistical applications become increasingly sophisti-
cated, massive and complex data sets require today the development of algorithms
that ensure global competitiveness, achieving both computational efficiency and safe
with high-dimension models and huge number of samples. It is our belief that forests
and their basic principles (“divide and conquer”, resampling, aggregation, random
search of the feature space) offer simple but fundamental ideas that may leverage new
state-of-the-art algorithms.

123
G. Biau, E. Scornet

It remains, however, that the present results are insufficient to explain in full gen-
erality the remarkable behavior of random forests. The authors’ intuition is that tree
aggregation models are able to estimate patterns that are more complex than classical
ones—patterns that cannot be simply characterized by standard sparsity or smoothness
conditions. These patterns, which are beyond the reach of classical methods, are still
to be discovered, quantified, and mathematically described.
It is sometimes alluded to that random forests have the flavor of deep network
architectures (e.g., Bengio 2009), insofar as ensemble of trees allow to discriminate
between a very large number of regions. Indeed, the identity of the leaf node with
which a data point is associated for each tree forms a tuple that can represent a consid-
erable quantity of possible patterns, because the total intersections of the leaf regions
can be exponential in the number of trees. This point of view, could be one of the rea-
sons for the success of forests on large-scale data. As a matter of fact, the connection
between random forests and neural networks is largely unexamined (Welbl 2014).
Another critical issue is how to choose tuning parameters that are optimal in a certain
sense, especially the size an of the preliminary resampling. By default, the algorithm
runs in bootstrap mode (i.e., an = n points selected with replacement) and although
this seems to give excellent results, there is to date no theory to support this choice.
Furthermore, although random forests are fully grown in most applications, the impact
of tree depth on the statistical performance of the algorithm is still an open question.

Acknowledgments We thank the Editors and three anonymous referees for valuable comments and
insightful suggestions.

References
Amaratunga D, Cabrera J, Lee Y-S (2008) Enriched random forests. Bioinformatics 24:2010–2014
Amit Y, Geman D (1997) Shape quantization and recognition with randomized trees. Neural Comput
9:1545–1588
Archer KJ, Kimes RV (2008) Empirical characterization of random forest variable importance measures.
Comput Stat Data Anal 52:2249–2260
Arlot S, Genuer R (2014) Analysis of purely random forests bias. arXiv:1407.3939
Auret L, Aldrich C (2011) Empirical comparison of tree ensemble variable importance measures. Chemom
Intell Lab Syst 105:157–170
Bai Z-H, Devroye L, Hwang H-K, Tsai T-H (2005) Maxima in hypercubes. Random Struct Algorithms
27:290–309
Banerjee M, McKeague IW (2007) Confidence sets for split points in decision trees. Ann Stat 35:543–574
Barndorff-Nielsen O, Sobel M (1966) On the distribution of the number of admissible points in a vector
random sample. Theory Probab Appl 11:249–269
Bengio Y (2009) Learning deep architectures for AI. Found Trends Mach Learn 2:1–127
Bernard S, Heutte L, Adam S (2008) Forest-RK: a new random forest induction method. In: Huang D-S,
Wunsch DC II, Levine DS, Jo K-H (eds) Advanced intelligent computing theories and applications.
With aspects of artificial intelligence. Springer, Berlin, pp 430–437
Bernard S, Adam S, Heutte L (2012) Dynamic random forests. Pattern Recognit Lett 33:1580–1586
Biau G (2012) Analysis of a random forests model. J Mach Learn Res 13:1063–1095
Biau G, Devroye L (2010) On the layered nearest neighbour estimate, the bagged nearest neighbour estimate
and the random forest method in regression and classification. J Multivar Anal 101:2499–2518
Biau G, Devroye L (2013) Cellular tree classifiers. Electron J Stat 7:1875–1912
Biau G, Devroye L, Lugosi G (2008) Consistency of random forests and other averaging classifiers. J Mach
Learn Res 9:2015–2033

123
A random forest guided tour

Biau G, Cérou F, Guyader A (2010) On the rate of convergence of the bagged nearest neighbor estimate. J
Mach Learn Res 11:687–712
Boulesteix A-L, Janitza S, Kruppa J, König IR (2012) Overview of random forest methodology and practical
guidance with emphasis on computational biology and bioinformatics. Wiley Interdiscip Rev Data
Mining Knowl Discov 2:493–507
Breiman L (1996) Bagging predictors. Mach Learn 24:123–140
Breiman L (2000a) Some infinity theory for predictor ensembles. Technical Report 577, University of
California, Berkeley
Breiman L (2000b) Randomizing outputs to increase prediction accuracy. Mach Learn 40:229–242
Breiman L (2001) Random forests. Mach Learn 45:5–32
Breiman L (2003a) Setting up, using, and understanding random forests V3.1. https://www.stat.berkeley.
edu/~breiman/Using_random_forests_V3.1.pdf
Breiman L (2003b) Setting up, using, and understanding random forests V4.0. https://www.stat.berkeley.
edu/~breiman/Using_random_forests_v4.0.pdf
Breiman L (2004) Consistency for a simple model of random forests. Technical Report 670, University of
California, Berkeley
Breiman L, Friedman JH, Olshen RA, Stone CJ (1984) Classification and regression trees. Chapman &
Hall/CRC, Boca Raton
Bühlmann P, Yu B (2002) Analyzing bagging. Ann Stat 30:927–961
Chen C, Liaw A, Breiman L (2004) Using random forest to learn imbalanced data. Technical Report 666,
University of California, Berkeley
Clémençon S, Depecker M, Vayatis N (2013) Ranking forests. J Mach Learn Res 14:39–73
Clémençon S, Vayatis N (2009) Tree-based ranking methods. IEEE Trans Inform Theory 55:4316–4336
Criminisi A, Shotton J, Konukoglu E (2011) Decision forests: a unified framework for classification, regres-
sion, density estimation, manifold learning and semi-supervised learning. Found Trends Comput Graph
Vis 7:81–227
Crookston NL, Finley AO (2008) yaImpute: an R package for kNN imputation. J Stat Softw 23:1–16
Cutler A, Zhao G (2001) PERT—perfect random tree ensembles. Comput Sci Stat 33:490–497
Cutler DR, Edwards TC Jr, Beard KH, Cutler A, Hess KT, Gibson J, Lawler JJ (2007) Random forests for
classification in ecology. Ecology 88:2783–2792
Davies A, Ghahramani Z (2014) The Random Forest Kernel and creating other kernels for big data from
random partitions. arXiv:1402.4293
Deng H, Runger G (2012) Feature selection via regularized trees. In: The 2012 international joint conference
on neural networks, pp 1–8
Deng H, Runger G (2013) Gene selection with guided regularized random forest. Pattern Recognit 46:3483–
3489
Denil M, Matheson D, de Freitas N (2013) Consistency of online random forests. In: International conference
on machine learning (ICML)
Denil M, Matheson D, de Freitas N (2014) Narrowing the gap: random forests in theory and in practice. In:
International conference on machine learning (ICML)
Désir C, Bernard S, Petitjean C, Heutte L (2013) One class random forests. Pattern Recognit 46:3490–3506
Devroye L, Györfi L, Lugosi G (1996) A probabilistic theory of pattern recognition. Springer, New York
Díaz-Uriarte R, Alvarez de Andrés S (2006) Gene selection and classification of microarray data using
random forest. BMC Bioinform 7:1–13
Dietterich TG (2000) Ensemble methods in machine learning. In: Kittler J, Roli F (eds) Multiple classifier
systems. Springer, Berlin, pp 1–15
Efron B (1979) Bootstrap methods: another look at the jackknife. Ann Stat 7:1–26
Efron B (1982) The jackknife, the bootstrap and other resampling plans, vol 38. CBMS-NSF Regional
Conference Series in Applied Mathematics, Philadelphia
Fink D, Hochachka WM, Zuckerberg B, Winkler DW, Shaby B, Munson MA, Hooker G, Riedewald M,
Sheldon D, Kelling S (2010) Spatiotemporal exploratory models for broad-scale survey data. Ecol
Appl 20:2131–2147
Friedman J, Hastie T, Tibshirani R (2009) The elements of statistical learning, 2nd edn. Springer, New York
Genuer R (2012) Variance reduction in purely random forests. J Nonparametr Stat 24:543–562
Genuer R, Poggi J-M, Tuleau-Malot C (2010) Variable selection using random forests. Pattern Recognit
Lett 31:2225–2236

123
G. Biau, E. Scornet

Geremia E, Menze BH, Ayache N (2013) Spatially adaptive random forests. In: IEEE international sympo-
sium on biomedical imaging: from nano to macro, pp 1332–1335
Geurts P, Ernst D, Wehenkel L (2006) Extremely randomized trees. Mach Learn 63:3–42
Gregorutti B, Michel B, Saint Pierre P (2016) Correlation and variable importance in random forests. Stat
Comput. doi:10.1007/s11222-016-9646-1
Guyon I, Weston J, Barnhill S, Vapnik V (2002) Gene selection for cancer classification using support
vector machines. Mach Learn 46:389–422
Györfi L, Kohler M, Krzyżak A, Walk H (2002) A distribution-free theory of nonparametric regression.
Springer, New York
Ho T (1998) The random subspace method for constructing decision forests. Pattern Anal Mach Intell
20:832–844
Hothorn T, Hornik K, Zeileis A (2006) Unbiased recursive partitioning: a conditional inference framework.
J Comput Graph Stat 15:651–674
Howard J, Bowles M (2012) The two most important algorithms in predictive modeling today. In: Strata
Conference: Santa Clara. http://strataconf.com/strata2012/public/schedule/detail/22658
Ishioka T (2013) Imputation of missing values for unsupervised data using the proximity in random forests.
In: eLmL 2013, The fifth international conference on mobile, hybrid, and on-line learning, pp 30–36.
International Academy, Research, and Industry Association
Ishwaran H (2007) Variable importance in binary regression trees and forests. Electron J Stat 1:519–537
Ishwaran H (2013) The effect of splitting on random forests. Mach Learn 99:75–118
Ishwaran H, Kogalur UB (2010) Consistency of random survival forests. Stat Probab Lett 80:1056–1064
Ishwaran H, Kogalur UB, Blackstone EH, Lauer MS (2008) Random survival forests. Ann Appl Stat 2:841–
860
Ishwaran H, Kogalur UB, Chen X, Minn AJ (2011) Random survival forests for high-dimensional data.
Stat Anal Data Mining ASA Data Sci J 4:115–132
Jeffrey D, Sanja G (2008) Simplified data processing on large clusters. Commun ACM 51:107–113
Joly A, Geurts P, Wehenkel L (2014) Random forests with random projections of the output space for high
dimensional multi-label classification. In: Calders T, Esposito F, Hüllermeier E, Meo R (eds) Machine
learning and knowledge discovery in databases. Springer, Berlin, pp 607–622
Kim H, Loh W-Y (2001) Classification trees with unbiased multiway splits. J Am Stat Assoc 96:589–604
Kleiner A, Talwalkar A, Sarkar P, Jordan MI (2014) A scalable bootstrap for massive data. J Royal Stat Soc
Ser B (Stat Methodol) 76:795–816
Konukoglu E, Ganz M (2014) Approximate false positive rate control in selection frequency for random
forest. arXiv:1410.2838
Kruppa J, Schwarz A, Arminger G, Ziegler A (2013) Consumer credit risk: individual probability estimates
using machine learning. Expert Syst Appl 40:5125–5131
Kruppa J, Liu Y, Biau G, Kohler M, König IR, Malley JD, Ziegler A (2014a) Probability estimation with
machine learning methods for dichotomous and multicategory outcome: theory. Biometr J 56:534–563
Kruppa J, Liu Y, Diener H-C, Holste T, Weimar C, König IR, Ziegler A (2014b) Probability estimation
with machine learning methods for dichotomous and multicategory outcome: applications. Biometr J
56:564–583
Kuhn M, Johnson K (2013) Applied predictive modeling. Springer, New York
Kyrillidis A, Zouzias A (2014) Non-uniform feature sampling for decision tree ensembles. In: IEEE inter-
national conference on acoustics, speech and signal processing, pp 4548–4552
Lakshminarayanan B, Roy DM, Teh YW (2014) Mondrian forests: efficient online random forests. In:
Ghahramani Z, Welling M, Cortes C, Lawrence ND, Weinberger KQ (eds) Advances in neural infor-
mation processing systems, pp 3140–3148
Latinne P, Debeir O, Decaestecker C (2001) Limiting the number of trees in random forests. In: Kittler J,
Roli F (eds) Multiple classifier systems. Springer, Berlin, pp 178–187
Liaw A, Wiener M (2002) Classification and regression by randomForest. R News 2:18–22
Lin Y, Jeon Y (2006) Random forests and adaptive nearest neighbors. J Am Stat Assoc 101:578–590
Louppe G, Wehenkel L, Sutera A, Geurts P (2013) Understanding variable importances in forests of ran-
domized trees. In: Burges CJC, Bottou L, Welling M, Ghahramani Z, Weinberger KQ (eds) Advances
in neural information processing systems, pp 431–439
Malley JD, Kruppa J, Dasgupta A, Malley KG, Ziegler A (2012) Probability machines: consistent probability
estimation using nonparametric learning machines. Methods Inform Med 51:74–81
Meinshausen N (2006) Quantile regression forests. J Mach Learn Res 7:983–999

123
A random forest guided tour

Meinshausen N (2009) Forest Garrote. Electron J Stat 3:1288–1304

Mentch L, Hooker G (2014) A novel test for additivity in supervised ensemble learners. arXiv:1406.1845
Mentch L, Hooker G (2015) Quantifying uncertainty in random forests via confidence intervals and hypoth-
esis tests. J Mach Learn Res (in press)
Menze BH, Kelm BM, Splitthoff DN, Koethe U, Hamprecht FA (2011) On oblique random forests. In:
Gunopulos D, Hofmann T, Malerba D, Vazirgiannis M (eds) Machine learning and knowledge dis-
covery in databases. Springer, Berlin, pp 453–469
Nadaraya EA (1964) On estimating regression. Theory Probab Appl 9:141–142
Nicodemus KK, Malley JD (2009) Predictor correlation impacts machine learning algorithms: Implications
for genomic studies. Bioinformatics 25:1884–1890
Politis DN, Romano JP, Wolf M (1999) Subsampling. Springer, New York
Prasad AM, Iverson LR, Liaw A (2006) Newer classification and regression tree techniques: bagging and
random forests for ecological prediction. Ecosystems 9:181–199
Qian SS, King RS, Richardson CJ (2003) Two statistical methods for the detection of environmental thresh-
olds. Ecol Model 166:87–97
Rieger A, Hothorn T, Strobl C (2010) Random forests with missing values in the covariates. Technical
Report 79, University of Munich, Munich
Saffari A, Leistner C, Santner J, Godec M, Bischof H (2009) On-line random forests. In: IEEE 12th
international conference on computer vision workshops, pp 1393–1400
Schwarz DF, König IR, Ziegler A (2010) On safari to random jungle: a fast implementation of random
forests for high-dimensional data. Bioinformatics 26:1752–1758
Scornet E (2015a) On the asymptotics of random forests. J Multivar Anal 146:72–83
Scornet E (2015b) Random forests and kernel methods. IEEE Trans Inform Theory 62:1485–1500
Scornet E, Biau G, Vert J-P (2015) Consistency of random forests. Ann Stat 43:1716–1741
Segal MR (1988) Regression trees for censored data. Biometrics 44:35–47
Shotton J, Fitzgibbon A, Cook M, Sharp T, Finocchio M, Moore R, Kipman A, Blake A (2011) Real-time
human pose recognition in parts from single depth images. In: IEEE conference on computer vision
and pattern recognition, pp 1297–1304
Stone CJ (1977) Consistent nonparametric regression. Ann Stat 5:595–645
Stone CJ (1980) Optimal rates of convergence for nonparametric estimators. Ann Stat 8:1348–1360
Stone CJ (1982) Optimal global rates of convergence for nonparametric regression. Ann Stat 10:1040–1053
Strobl C, Boulesteix A-L, Kneib T, Augustin T, Zeileis A (2008) Conditional variable importance for random
forests. BMC Bioinform 9:307
Svetnik V, Liaw A, Tong C, Culberson JC, Sheridan RP, Feuston BP (2003) Random forest: a classification
and regression tool for compound classification and QSAR modeling. J Chem Inform Comput Sci
43:1947–1958
Toloşi L, Lengauer T (2011) Classification with correlated features: unreliability of feature ranking and
solutions. Bioinformatics 27:1986–1994
Truong AKY (2009) Fast growing and interpretable oblique trees via logistic regression models. PhD thesis,
University of Oxford, Oxford
Varian H (2014) Big data: new tricks for econometrics. J Econ Perspect 28:3–28
Wager S (2014) Asymptotic theory for random forests. arXiv:1405.0352
Wager S, Hastie T, Efron B (2014) Confidence intervals for random forests: the jackknife and the infinitesimal
jackknife. J Mach Learn Res 15:1625–1651
Watson GS (1964) Smooth regression analysis. Sankhyā Ser A 26:359–372
Welbl J (2014) Casting random forests as artificial neural networks and profiting from it. In: Jiang X,
Hornegger J, Koch R (eds) Pattern recognition. Springer, Berlin, pp 765–771
Winham SJ, Freimuth RR, Biernacka JM (2013) A weighted random forests approach to improve predictive
performance. Stat Anal Data Mining ASA Data Sci J 6:496–505
Yan D, Chen A, Jordan MI (2013) Cluster forests. Comput Stat Data Anal 66:178–192
Yang F, Wang J, Fan G (2010) Kernel induced random survival forests. arXiv:1008.3952
Yi Z, Soatto S, Dewan M, Zhan Y (2012) Information forests. In: 2012 information theory and applications
workshop, pp 143–146
Zhu R, Zeng D, Kosorok MR (2015) Reinforcement learning trees. J Am Stat Assoc 110(512):1770–1784
Ziegler A, König IR (2014) Mining data with random forests: current options for real-world applications.
Wiley Interdiscip Rev Data Mining Knowl Discov 4:55–63

123

A Random Forest Guided Tour: Gerard - Biau@
No ratings yet
A Random Forest Guided Tour: Gerard - Biau@
41 pages
Test Bs
No ratings yet
Test Bs
33 pages
Guided Tour To Random Forest
No ratings yet
Guided Tour To Random Forest
42 pages
Machine Learning: Practical Tutorial On Random Forest and Parameter Tuning in R
No ratings yet
Machine Learning: Practical Tutorial On Random Forest and Parameter Tuning in R
11 pages
Main
No ratings yet
Main
27 pages
Schonlau Zou 2020 The Random Forest Algorithm For Statistical Learning
No ratings yet
Schonlau Zou 2020 The Random Forest Algorithm For Statistical Learning
27 pages
Random Forest
No ratings yet
Random Forest
8 pages
Fin Irjmets1655542455
No ratings yet
Fin Irjmets1655542455
4 pages
Random Forest
No ratings yet
Random Forest
11 pages
Random Forest Algorithm Overview
No ratings yet
Random Forest Algorithm Overview
11 pages
ML Lec6
No ratings yet
ML Lec6
4 pages
Random Forest
100% (1)
Random Forest
83 pages
Random Forests: Lecture Overview
No ratings yet
Random Forests: Lecture Overview
28 pages
Randomization As Regularization: A Degrees of Freedom Explanation For Random Forest Success
No ratings yet
Randomization As Regularization: A Degrees of Freedom Explanation For Random Forest Success
36 pages
Da MS
No ratings yet
Da MS
24 pages
Random Forest for Insurance Recommendations
No ratings yet
Random Forest for Insurance Recommendations
4 pages
Forest
No ratings yet
Forest
2 pages
Random Forests
No ratings yet
Random Forests
35 pages
Random Forests
No ratings yet
Random Forests
43 pages
Decision Tree & Regression
No ratings yet
Decision Tree & Regression
33 pages
Random Forest
No ratings yet
Random Forest
14 pages
Random Forest
No ratings yet
Random Forest
5 pages
Random Forests 2
No ratings yet
Random Forests 2
43 pages
Random Forest: Machine Learning Guide
100% (1)
Random Forest: Machine Learning Guide
32 pages
20schonlau Rforest
No ratings yet
20schonlau Rforest
23 pages
Random Forests - SpringerLink
No ratings yet
Random Forests - SpringerLink
6 pages
Random Forests: N 1 N J X A I X A I
No ratings yet
Random Forests: N 1 N J X A I X A I
12 pages
Random Forest Algorithm Overview: Review Article
No ratings yet
Random Forest Algorithm Overview: Review Article
11 pages
Bagging and Random Forest Presentation1
100% (4)
Bagging and Random Forest Presentation1
23 pages
Random Forests For Beginners PDF
No ratings yet
Random Forests For Beginners PDF
71 pages
Random Forest, CNN and Different Algorithm
No ratings yet
Random Forest, CNN and Different Algorithm
14 pages
Association Rule Learning Explained
No ratings yet
Association Rule Learning Explained
35 pages
Deep Learning and Neural Networks
No ratings yet
Deep Learning and Neural Networks
21 pages
12 PAGES - Random Forest Algorithm, Support Vector Machine For Regression Analysis
No ratings yet
12 PAGES - Random Forest Algorithm, Support Vector Machine For Regression Analysis
12 pages
Random Forests Simplified
No ratings yet
Random Forests Simplified
39 pages
Random Forest for ML Enthusiasts
No ratings yet
Random Forest for ML Enthusiasts
4 pages
Data Science - Decision Tree - Random Forest
No ratings yet
Data Science - Decision Tree - Random Forest
15 pages
Ensemble Learning Explained
No ratings yet
Ensemble Learning Explained
32 pages
Random Forest Algorithms - Comprehensive Guide With Examples
No ratings yet
Random Forest Algorithms - Comprehensive Guide With Examples
13 pages
Random Forest
No ratings yet
Random Forest
29 pages
Classification Algorithms
No ratings yet
Classification Algorithms
68 pages
Machine Learning
No ratings yet
Machine Learning
5 pages
Ilovepdf Merged-3
No ratings yet
Ilovepdf Merged-3
70 pages
Lecture-12 Machine Learning With Python
No ratings yet
Lecture-12 Machine Learning With Python
18 pages
Random Forest Algorithm in Machine Learning Random Forest Random Forests or Random Decision Trees Decision Trees
No ratings yet
Random Forest Algorithm in Machine Learning Random Forest Random Forests or Random Decision Trees Decision Trees
6 pages
Business Analytics: Foundation: Material Handouts
No ratings yet
Business Analytics: Foundation: Material Handouts
7 pages
Random Forests For Time Series
No ratings yet
Random Forests For Time Series
21 pages
Present
No ratings yet
Present
20 pages
Random Forest Class Lecture Notes
No ratings yet
Random Forest Class Lecture Notes
2 pages
Paper 6
No ratings yet
Paper 6
15 pages
DL Unit 1
No ratings yet
DL Unit 1
20 pages
Random Forest
No ratings yet
Random Forest
2 pages
Lecture+Notes+-+Random Forests
No ratings yet
Lecture+Notes+-+Random Forests
10 pages
Ijeit1412201405 47
No ratings yet
Ijeit1412201405 47
7 pages
Lecture 5
No ratings yet
Lecture 5
53 pages
RandomForests Sayed
No ratings yet
RandomForests Sayed
21 pages
Developing Green Fleet Management Strategies
No ratings yet
Developing Green Fleet Management Strategies
11 pages
Fishers Linear Discriminant Function
No ratings yet
Fishers Linear Discriminant Function
24 pages
Practical File MDC
No ratings yet
Practical File MDC
12 pages
FrequencyDomainImageProcessing Set 02
No ratings yet
FrequencyDomainImageProcessing Set 02
79 pages
Identification of Defects in Fruits Usin
No ratings yet
Identification of Defects in Fruits Usin
4 pages
(CSE303: ADA) (Assignment-1) : Count 0 For (I 1 I N I For (J 1 J N J Count Count +1
No ratings yet
(CSE303: ADA) (Assignment-1) : Count 0 For (I 1 I N I For (J 1 J N J Count Count +1
2 pages
Cs221 Section2 Solutions
No ratings yet
Cs221 Section2 Solutions
7 pages
AI Lab Assignment 1
No ratings yet
AI Lab Assignment 1
4 pages
SLA - Class Test - 2 - AnswerKey
No ratings yet
SLA - Class Test - 2 - AnswerKey
5 pages
2.5 Newton Raphson
No ratings yet
2.5 Newton Raphson
16 pages
Capstone Project - 1
No ratings yet
Capstone Project - 1
12 pages
Assignment 12 1 (PQT)
100% (1)
Assignment 12 1 (PQT)
3 pages
Hasan
No ratings yet
Hasan
8 pages
Digital Control 4 - Lecture Notes 2023
No ratings yet
Digital Control 4 - Lecture Notes 2023
29 pages
Ex. No: 9.B Power Spectral Density Estimation Date: by Welch Method
No ratings yet
Ex. No: 9.B Power Spectral Density Estimation Date: by Welch Method
5 pages
Appendix A Numerical Solution To Heat Equation
No ratings yet
Appendix A Numerical Solution To Heat Equation
3 pages
Linear Programming Guide
No ratings yet
Linear Programming Guide
4 pages
Wavelets 99
No ratings yet
Wavelets 99
12 pages
MBA or LPP Graphical Method
No ratings yet
MBA or LPP Graphical Method
2 pages
DAA Unit Wise Questions
No ratings yet
DAA Unit Wise Questions
8 pages
Image Processing Basics
No ratings yet
Image Processing Basics
17 pages
HW 10
No ratings yet
HW 10
2 pages
LABEX3
No ratings yet
LABEX3
28 pages
EE370 Digital Electronics: L12: Logic Synthesis - Part-2
No ratings yet
EE370 Digital Electronics: L12: Logic Synthesis - Part-2
24 pages
Lab 5,6,7
No ratings yet
Lab 5,6,7
18 pages
Report 1 Coderindeed PDF
No ratings yet
Report 1 Coderindeed PDF
15 pages
Dsa Iat 1
No ratings yet
Dsa Iat 1
3 pages
Operations Research 9ed Chapter 22
No ratings yet
Operations Research 9ed Chapter 22
40 pages
Tutorial 3: Assignment Combinatorial Optimization
No ratings yet
Tutorial 3: Assignment Combinatorial Optimization
7 pages
B.Tech Digital Image Processing Exam
No ratings yet
B.Tech Digital Image Processing Exam
2 pages

Biau 2016

Uploaded by

Biau 2016

Uploaded by

TEST

A random forest guided tour

Gérard Biau1,2 · Erwan Scornet1

© Sociedad de Estadística e Investigación Operativa 2016

Keywords Random forests · Randomization · Resampling · Parameter tuning ·

This invited paper is discussed in comments available at: doi:10.1007/s11749-016-0482-6;

Mathematics Subject Classification 62G02

2 The random forest estimate

2.1 Basic principles

1Xi ∈An (x;Θ j ,Dn ) Yi

In the R package randomForest, the default value of M (the number of trees in

m ∞,n (x; Dn ) = EΘ [m n (x; Θ, Dn )] .

lim m M,n (x; Θ1 , . . . , Θ M , Dn ) = m ∞,n (x; Dn )

Algorithm 1: Breiman’s random forest predicted value at x.

where A L = {x ∈ A : x( j) < z}, A R = {x ∈ A : x( j) ≥ z}, and Ȳ A (resp., Ȳ A L , Ȳ A R )

( jn , z n ) ∈ arg max L reg,n ( j, z).

2.3 Supervised classification

L(m n ) = P[m n (X) = Y ] → L ,

where L is the error of the optimal—but unknown—Bayes classifier:

2.4 Parameter tuning

Literature focusing on tuning the parameters M, mtry, nodesize and an is unfortu-

lim E[m M,n (X; Θ1 , . . . , Θ M ) − m(X)]2 = E[m ∞,n (X) − m(X)]2 .

3 Simplified models and local averaging estimates

3.1 Simplified models

Fig. 1 A centered tree at level 2

m(x) = E[Y |XS = xS ].

E[m ∞,n (X) − m(X)]2 = O n −2/3 .

3.2 Forests, neighbors and kernels

Let us consider a sequence of independent and identically distributed random variables

Fig. 2 The layered nearest neighbors (LNN) of a point x in dimension p = 2

where the weights Wni (x) are defined by

1 1Xi ∈An (x;Θ j )

K n (x, z) = PΘ [z ∈ An (x, Θ)] .

Fig. 3 Representations of f 1 , f 2 and f 5 in [0, 1]2

As an illustration, Fig. 3 shows the graphical representation for k = 1, 2 and 5 of the

f k : [0, 1] × [0, 1] → [0, 1] 

The connection between forests and kernel estimates is mentioned in Breiman

4 Theory for Breiman’s forests

entire algorithm is difficult. To move forward, the individual mechanisms at work in

4.1 The resampling mechanism

The resampling step in Breiman’s (2001) original algorithm is performed by choosing

where E denotes expectation with respect to the resampling distribution, conditional

Wni (x) = P[Xi is the 1-NN of x in a random selection of size an ].

4.2 Decision splits

where D denotes convergence in distribution, and W is a standard two-sided Brownian

jn,q (X) ∈ {1, . . . , S}.

4.3 Consistency, asymptotic normality, and more

EX,Y [Y − m ∞,n (X)]2 ≤ ρ̄ EX,Y,Θ [Y − m n (X; Θ)]2 ,

EΘ,Θ  [ρ(Θ, Θ  )g(Θ)g(Θ  )]

with Θ and Θ  independent and identically distributed,

Var[m ∞,n (x)] = ρ(x)σ (x),

0 ≤ E[m M,n (X; Θ1 , . . . , Θ M ) − m(X)]2 − E[m ∞,n (X) − m(X)]2

where N is a standard normal random variable,

Xi an independent copy of Xi and Θ  an independent copy of Θ. It is worth noting

5.1 Variable importance measures

X ( j) computes the weighted decrease of impurity corresponding to splits along the

where Xj = (X (1) , . . . , X ( j) , . . . , X ( p) ) and X ( j) is an independent copy of X ( j) .

5.2 Theoretical results

where V − j = {1, . . . , j − 1, j + 1, . . . , p}, Pk (V − j ) the set of subsets of V − j of

5.3 Related works

6.1 Weighted forests

6.2 Online forests

2013; Lakshminarayanan et al. 2014). In Lakshminarayanan et al. (2014), so-called

6.3 Survival forests

6.4 Ranking forests

6.5 Clustering forests

6.6 Quantile forests

6.7 Missing data

6.8 Single class data

6.9 Unbalanced data set

7 Conclusion and perspectives

Meinshausen N (2009) Forest Garrote. Electron J Stat 3:1288–1304

You might also like

f k : [0, 1] × [0, 1] → [0, 1]

EΘ,Θ [ρ(Θ, Θ )g(Θ)g(Θ )]

with Θ and Θ independent and identically distributed,

Xi an independent copy of Xi and Θ an independent copy of Θ. It is worth noting

where Xj = (X (1) , . . . , X ( j) , . . . , X ( p) ) and X ( j) is an independent copy of X ( j) .