0% found this document useful (0 votes)
46 views27 pages

Main

The document is a review of the theory behind Random Forests (RF), a popular machine learning tool known for its efficiency and interpretability. It discusses various RF models, their consistency, central limit theorems, and methods for assessing variable importance, while highlighting recent theoretical advancements in the field. The review aims to clarify the mathematical properties of RF and guide future research directions.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views27 pages

Main

The document is a review of the theory behind Random Forests (RF), a popular machine learning tool known for its efficiency and interpretability. It discusses various RF models, their consistency, central limit theorems, and methods for assessing variable importance, while highlighting recent theoretical advancements in the field. The review aims to clarify the mathematical properties of RF and guide future research directions.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

Theory of Random Forests: A Review

Erwan Scornet, Giles Hooker

To cite this version:


Erwan Scornet, Giles Hooker. Theory of Random Forests: A Review. 2025. �hal-05006431�

HAL Id: hal-05006431


https://hal.science/hal-05006431v1
Preprint submitted on 26 Mar 2025

HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est


archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents
entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non,
lished or not. The documents may come from émanant des établissements d’enseignement et de
teaching and research institutions in France or recherche français ou étrangers, des laboratoires
abroad, or from public or private research centers. publics ou privés.
Theory of Random Forests: A Review
Erwan Scornet,1 Giles Hooker,2
1
Sorbonne Université and Université Paris Cité,
CNRS, Laboratoire de Probabilités, Statistique et Modélisation,
F-75005 Paris, France, email: erwan.scornet@polytechnique.edu
2
Department of Statistic and Data Science,
University of Pennsylvania, Philadelphia, PA, USA, 19104;
email: ghooker@wharton.upenn.edu

Abstract
Random forests (RF) have a long history, originally defined by Leo Breiman in
2001, but with antecedents in bagging methods introduced in 1996. They have become
one of the most widely adopted machine learning tools thanks to their computational
efficiency, relative insensitivity to tuning parameters, in-built cross validation, and
interpretation tools. Despite their popularity, mathematical theory about the funda-
mental properties of random forests has been slow to emerge. Nonetheless, the past
decade has seen significant advances in our understanding of all these processes. In
this review paper, we describe several variations of RF and how rates of consistency of
these variants highglight the impact of different RF mechanisms on their performance.
Another line of research focuses on establishing central limit theorems and confidence
intervals for random forests. We also depict recent analyses in variable importance
computed with RF.

Keywords: Random forests, excess risk, central limit theorem, variable importance.

Contents
1 INTRODUCTION 2
1.1 Notations and algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 BIAS, VARIANCE AND CONSISTENCY 5


2.1 Purely random forests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Uniform forests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Mondrian Forests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.4 Median Forests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.5 Slight variants of Breiman’s forest . . . . . . . . . . . . . . . . . . . . . . . . 8
2.6 Breiman’s forests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.7 Interpolation, self-regularization and double descent . . . . . . . . . . . . . . 9

1
3 VARIATIONS 10
3.1 Performance Improvements . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2 Functionality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.3 Secondary Uses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

4 CENTRAL LIMIT THEOREMS AND VARIANCE ESTIMATION 11

5 VARIABLE IMPORTANCE 14
5.1 Mean Decrease in Impurity (MDI) . . . . . . . . . . . . . . . . . . . . . . . . 14
5.2 Mean Decrease in Accuracy (MDA) . . . . . . . . . . . . . . . . . . . . . . . 16
5.3 Tree SHAP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

6 DISCUSSION 19

1 INTRODUCTION
Decision trees are among the most comprehensible machine learning methods: a recursive
partition of the input space is constructed from the data, and the prediction at a new point
results only from an average (or majority vote) of the observations falling within the same
leaf as that point. The instability of these methods in the face of dataset perturbation
and their limited predictive capacity have led to the consideration of tree-based ensemble
methods such as random forests or gradient boosting decision trees. Despite the advent
of deep learning, such methods are part of the state of the art for dealing with tabular
datasets. What’s more, these ensemble methods benefit from the versatility of trees: they
can handle a variety of problems (classification, regression, quantile estimation, survival...),
missing data of different types and are computationally attractive. For a long time, the
statistical properties of random forests remained a mystery. Before delving into the various
directions of random forest research, we start by describing the original algorithm.

1.1 Notations and algorithm


Throughout the paper, we assume to be given a dataset Dn = {(Xi , Yi ), i = 1, . . . , n},
composed of i.i.d. pairs (Xi , Yi ) distributed as the generic pair (X, Y ), where X ∈ Rd
and Y ∈ R. This general setting includes continuous and discrete inputs X and covers
classification and regression problems.
At its most general, a random forest consists of an average of M prediction decision
trees, which we write as
M
ˆ 1 X
f (x) = T (x; ωm , Dn ), (1)
M m=1

in which T is the prediction at x ∈ Rd of a decision tree trained on the data Dn along


with independent and identically distributed randomization parameters ωm . Because the
T (x; ωm , Dn ) for m = 1, . . . , M are i.i.d. conditional on the data, it is common to consider
the limit M → ∞ yielding the infinite forest (see Scornet et al., 2015)

fˆ∞ (x) = Eω T (x; ω, Dn ), (2)

where the expectation is taken w.r.t. ω only. Indeed, M can be made large enough to
remove variation associated with ω at the price of additional computing.

2
A random forest is thus defined by two choices: the decision tree algorithm and how the
randomization parameters ωm impact the tree structure/prediction. The original random
forest algorithm was designed by Breiman (2001), who proposed aggregating fully-developed
CART (specific decision trees, see Breiman et al., 1984) randomized via data resampling
and with, at each node, feature subsampling. Since then, the class of random forest-like
algorithms has significantly expanded, with variations proposed to improve performance,
to expand the applicability of random forests to new tasks or data types, and to produce
algorithms that are theoretically tractable or have desireable distributional properties. Here
we outline the broad structure of a forest and note the choices that were made in Breimans’
original version; many of the results that we detail below are given for specific variations.
Note that almost all random forest variants are based on sequentially-chosen diadic splits,
that is, the feature space Rd is recursively divided into cells, until some stopping criterion
is satisfied.

ˆ Data resampling. The original random forest algorithm obtained bootstrap samples
from Dn to create an input Dn∗ to each tree which we characterize using the vector Nim
indicating the number of times observation i is repeated in the sample given to the
mth tree. However, most theoretical advances have been made by selecting a random
subset of size k either with (Mentch and Hooker (2016), Wager and Athey (2018))
or without (Scornet (2016a), Zhou et al. (2021)) replacement. It is also possible to
rely solely on randomization within the tree-building process, as is implicitly done in
defining the equivalent kernels described in Scornet (2016b).
ˆ Split randomization - Set of potential splits. Most random forest algorithms consider
partitioning the current cell based on the value of one of the dimensions of X. Early
tree building algorithms such as CART Breiman et al. (1984) considered the midpoint
between each successive data value, along each dimension, as potential splits. Random
forests restricted these to a randomly chosen subset of mtry dimensions. More restric-
tive versions are limited to the medians of the observed data values or the midpoints
of the current node range; others restrict the imbalance in the proportion of data on
each side of the split. Other variants consider splits based on linear combinations of
features.

ˆ Tree building process - Best split selection. How to choose between potential splits
selected above? Here random forests follows CART in scoring each split by the decrease
in squared prediction error when the prediction on each side of the split is held constant
(see, e.g., Geurts et al., 2006); many extensions consider criteria based on particular
models. Honest forests (Wager and Athey, 2018) choose among splits based on 1/2 of
the data available to the current tree, reserving the other half to select values in the
leaves. Other random forest variants choose randomly among candidate splits.
ˆ Tree building process - Stopping rule. The original random forests ceased splitting
regions with fewer than k observations (with k = 1 or 5 by default). Other models
can specify tree depths, trees pruned in the manner of CART, or stochastically chosen
stopping rules.
ˆ Tree prediction. The original random forests return a constant value which minimizes
the loss (squared error, or weighted misclassification cost) in each leaf. Honest forests
use a held-out set of data to decided on this value to keep the leaf value independent

3
from the structure of the tree. Other algorithms estimate parametric models in each
leaf (see, for example, Chaudhuri et al., 1994).
ˆ Prediction combination. The original random forest algorithm was designed as a pre-
diction tool and produced either an average, if predicting a real-valued quantity, or a
majority vote if a class label was needed. Averaging remains the most common way
to combine the predictions from trees, but weighted averages, or using leaf indicators
within a larger regression model (Agarwal et al., 2023) are also possible.
Figure 1 presents a schematic of random forests algorithms. Note that other descriptions
are possible: for instance, the selection of data can be described as a process internal to
the tree. However, the collection of algorithmic choices presented above encompasses most
proposed practical random forest variants, and their mathematical models.

Training Data

Data Resampling

Tree 1 Split Randomization Tree 2 Tree M


Split Selection

... =

Stopping Rule

Prediction

Prediction Combination

Figure 1: Illustration of a random forest on the left and a tree partition on the right.

In this review, we focus on four different aspects of random forest theory. In Section 2,
we detail different rates of consistency for several random forest models, with the goal of
elucidating the role of each RF hyperparameter on RF predictive performance. In Section 3,
we review variants of RF that have been theoretically studied. In Section 4, we detail
central limit theorems for random forests and the resulting uncertainties quantifications. In
Section 5, we summarize the analyses of variable importances in random forests, in order
to provide guidance on how to use them. Finally a discussion of further research avenues is
presented in Section 6.
Several earlier papers already review some aspects of random forests theory. Notably,
Biau and Scornet (2016) provide a detailed review of the theoretical properties of random
forests at that time. From a methodological perspective, Loh (2014) provide a very detailed
review of various decision tree algorithms. A description of different applications that can be
solved using random forest (or its variants) are provided in Criminisi et al. (2012). Boulesteix
et al. (2012) describe the existing implementations of random forests, with an emphasis on
advances in bioinformatics. The interested reader is also refered to Probst et al. (2019)
for insightful discussions on the impact of parameter tuning in random forests. Thus, the
present review focuses on the most recent theoretical advances in this field.

4
2 BIAS, VARIANCE AND CONSISTENCY
The analysis of random forests is notoriously difficult due to a combination of several in-
gredients – recursive (tree) structure with data-dependent splits, split/data randomization,
large tree depths, aggregation – each element of which produces theoretical challenges. As
first attempts to dissect their internal behavior, several research works designed new styl-
ized versions of random forests, whose splitting procedure does not depend on the data.
Such forests, called purely random forests (see Arlot and Genuer, 2014), or non-adaptive
random forests, allow us to comprehend the contribution of some key ingredients (split ran-
domization, tree depth) on the predictive performance of the forest. The strength of the
results comes at the price of simplification of the original algorithm. Another route consists
in deriving results for Breiman’s random forests: results focus on simpler properties, while
requiring much more involved technical tools. We explore these two avenues in the following
sections. We will work with the general regression model

Y = f ⋆ (X) + ε, (3)

where X ∼ U([0, 1]d ), f ⋆ (X) = E[Y |X = x] is the regression function, ε a noise satisfying
E[ε|X] = 0 with variance V[ε|X] ≤ σ 2 a.s. In this section, all results are stated for infinite
random forests (see (2)). They can be extended to finite random forests if the number of
trees M is chosen large enough.

2.1 Purely random forests


The first purely random forest is the centered random forest in which each cell of each
tree is built in the following way: first, a direction is selected uniformly at random among
{1, . . . , d}, and then the split is made at the center of the cell, along the chosen direction.
The process is repeated until a prespecified depth is reached, which results in balanced trees.
Centered random forests were first introduced by Breiman (2004) and later analyzed by Biau
(2012) who derived an upper bound on their rate of consistency. According to Theorem 5
in (Biau, 2012), assuming that f ⋆ is Lispchitz, the centred forest fˆk,ncen
of depth k satisfies
h i
E (fˆk,n
cen
(X) − f ⋆ (X))2 ≲ n−1/((4/3)d log 2+1) , (4)

with k ∼ Cd log2 (n) and Cd = (1 + 3/(4d log 2))−1 . This rate was further improved by
Klusowski (2021) (Theorem 2), which proves that, under similar assumptions,
h i 1
E (fˆk,n
cen
(X) − f ⋆ (X))2 ≲ n− d log 2+1 (1+∆d ) , (5)

where limd→∞ ∆d = 0. Klusowski (2021) showed that this rate was not improvable for gen-
eral Lipschitz functions. Recall that the minimax rate over the class of Lipschitz functions
is given by n−2/(d+2) (Stone, 1982). Many estimators reach this optimal rate, among which
kernel methods and simple cubic partitions (regular partitions, a very-specific tree form).
Thus, it may seem surprising that randomizing tree partitions leads to a suboptimal esti-
mator, as highlighted by 5. In fact, the slow rate of convergence of centered forest results
from their (relative) bad approximation capacities (Klusowski (2021)): the bias term is of
order 2−k/(d log 2) , whereas it is of order 2−2k/d < 2−k/(d log 2) for a single cubic partition.
Therefore, the randomness of the tree partition (and in particular the fact that all cells do

5
not have the same shape) harms the approximation error of the forest compared to a single
cubic partition.
While thorough analyses established the exact properties of centered forests, studying
such models does not allow for a precise understanding of Breiman’s forests. The most im-
portant limitation of centered forest is the data-independent splitting process, which makes
the algorithm unable to select only relevant variables, thus not adapting to sparsity. Besides,
the analysis does not show any benefits from using centered forests instead of centered trees,
whereas in practice, Breiman’s forests outperform individual CART trees. Since lower and
upper bounds match (see Klusowski, 2021), this is the result of an overly simplistic model
of centered forests rather than from an improvable theoretical analysis. Finally, the rate of
consistency is optimal when leaves contain numerous observations, whereas original random
forests build fully-developed trees (whose leaves contain few observations)1 .

2.2 Uniform forests


Uniform random forests, introduced by Biau et al. (2008) work as centered forests, except
that split position is drawn uniformly at random instead of being made at the middle of the
cell. The resulting tree is still balanced of depth k. Biau et al. (2008) prove the consistency
of uniform forests when the depth k tends to infinity, such that n/2k also tends to infinity,
thus ensuring a growing number of observations per cell on average. Arlot and Genuer
(2014) (Equation 53) show that the risk of such forests satisfies
h i
E (fˆk,n
uni
(X) − f ⋆ (X))2 ≲ n−2α/(2α+1) , (6)

with α = − log2 (1 − 1/(2d)). It can be shown that the risk of a single uniform tree is
larger than the risk of the forest. However, uniform forest do not reach the minimax rate of
convergence for C 2 functions, as 2α/(2α + 1) < 4/(4 + d). Arlot and Genuer (2014) claim
that this slow rate of convergence results from the fact that all cells are split equally, even
if some are much smaller than others (see also the comment above on centered forests and
Klusowski, 2021).
Contrary to previous analyses, Arlot and Genuer (2014) show that the number of trees
must be larger than M = (1 − 1/2d)−k to ensure that the predictive performances of the
finite forest (with M trees) is close to that of the infinite forest.

2.3 Mondrian Forests


Splitting schemes that take into account the geometry of the current partition can be con-
sidered instead of centered or uniform forests. Such partitions are still independent of the
dataset but contain cells designed to be of similar shapes. An example of such a construction
is the Mondrian forest (Lakshminarayanan et al., 2014), which originates from Mondrian
processes (Roy et al., 2008). In Mondrian forests, the probability of splitting a cell is pro-
portional to its perimeter, then a direction is chosen at random, proportionally to the side
length and the position is drawn uniformly, along the chosen direction. A lifetime parameter
controls the depth of the tree. Since leaves of Mondrian trees have shapes close to hyper-
cubes, such forests exhibit an approximation error of the same order as cubic partitions for
1 However, note that according to the optimal expression of k in 4, in a high-dimensional setting, centered
n
trees are almost fully-grown.

6
Lipschitz function, and are thus minimax over this class (see Mourtada et al., 2020). More-
over, contrary to a single Mondrian tree, or a cubic partition, Mondrian Forests are also
minimax over the class of Holder functions with exponent β ∈ (1, 2) (see Mourtada et al.,
2020). Hence, there is an improvement of Mondrian forest compared to a single Mondrian
tree in terms of rate of consistency. Such phenomenon results solely from the randomization
of partitions.

2.4 Median Forests


In Breiman’s forests, two types of randomization occur: a randomization of partitions due
to split randomization and subsampling, the latter inducing changes in the optimal split
positions, and a randomization of labels due to subsampling only; see Arlot and Genuer
(2016) for a thorough discussion. We have so far considered randomization of partitions due
to split randomization only, as purely random forests do not use the dataset to split. In order
to analyze all randomizations at work in random forest, we now turn to the median forests.
Median trees (see Section 20.3 in Devroye et al., 2013) are built by uniformly choosing a
split direction and then cut at the empirical median of the inputs along this direction. The
resulting cells contain the same number of points up to ±1 observation. As this construction
depends on the position of the dataset inputs, we can build fully-developed trees whose
leaves contain exactly one observation, bringing us closer to the original random forest
algorithm. As for any fully-developed non-adaptive tree, a single median tree is inconsistent
(see Problem 20.2 in Devroye et al., 2013). However, median forests, composed of fully-
developed median trees, are consistent if either subsampling (Scornet, 2016a, Duroux and
Scornet, 2018) or split randomization (Arnould et al., 2023) is properly incorporated into
the model.
Duroux and Scornet (2018) show that tuning the subsample size an or the tree depth
kn in median forests results in the same upper bound on the rate of consistency (improved
later by Klusowski, 2021). This suggests that these two hyperparameters play the same role
in median forests and thus, only one of them needs to be optimized (see also Tang et al.,
2018, for the importance of subsampling). More importantly, their analysis can be carried
out in the exact same way if split randomization is removed with all trees choosing the same
split directions, thus stressing the impact of subsampling on forest (rate of) consistency.
On the other hand, disabling subsampling, and studying only split randomization for fully-
developed median trees, Arnould et al. (2023) prove that the risk of such median forests
satisfies
h i
E (fˆnmed (X) − f ⋆ (X))2 ≲ (log2 n)−(d−1)/2 . (7)

These results highlight the benefits of split randomization on the consistency of the forest.
However, the rate of consistency in (7) is very slow. It turns out that such a phenomenon
is expected for fully-developed non-adaptive random forests. Indeed, Lin and Jeon (2006)
proved that the risk of a non-adaptive forest trained on the full dataset, is larger than
(log n)−(d−1)
, (8)
nobs/leaf
where nobs/leaf is the number of samples inside each leaf of each tree. Therefore, non-adaptive
fully-developed forests that do not incorporate subsampling can’t reach rates of consistency
faster than (log n)−(d−1) . Such results seem to point toward the use of subsampling instead
of split randomization in non-adaptive forests.

7
2.5 Slight variants of Breiman’s forest
Wager and Walther (2015) define (α, k)-valid tree partitions that are (i) constructed on
the full original sample without bootstrapping or subsampling, (ii) whose splits leave at
least a fraction α of parent node observations in each child node, and (iii) such that each
final leaf contains at least k observations. Standard implementations of random forests
allow removing bootstrap step (i) and a minimum number of observations in each leaf (iii).
Condition (ii) is not satisfied in general, as it is known that splits performed along noise
features are close to the edges of the cell (End-Cut Preference, see Ishwaran, 2015, Breiman
et al., 1984). Enforcing (ii) thus requires to modify the splitting criterion. However, the
overall algorithm remains quite close to the algorithm used in practice.
Wager and Walther (2015) establish a control on any random forest using (α, k)-valid tree
partitions. More precisely, assuming that lim inf d/n > 0 and under regularity constraints
on the regression model,
" s #
log(n) log(d) 1
P sup |fˆΛ1 ,...,ΛM (x) − fΛ1 ,...,ΛM (x)| ≲

√ → 1, (9)
x∈[0,1]d ,Λ1 ,...,ΛM log((1 − α)−1 ) k

where fˆΛ1 ,...,ΛM (x) is the prediction of the forest composed of the tree partition Λ1 , . . . , ΛM
and fΛ⋆1 ,...,ΛM (x) is the same forest where empirical means in each tree partition are replaced
by conditional expectation of the output variable in the corresponding cell. Based on the
above concentration result, Wager and Walther (2015) propose the Guess-and-Check forest
and prove its pointwise consistency.
Wager and Athey (2018) introduce the honesty property: a tree is honest if each obser-
vation cannot be used simultaneously for the tree construction and for estimating the mean
response in terminal nodes. A simple manner for a tree to be honest is to use a subsample
for the tree construction and the rest of the observations for mean leaf estimation. Under
regularity conditions, Wager and Athey (2018) prove the pointwise consistency of honest
trees with (α, k)-valid partitions in the context of causal inference.

2.6 Breiman’s forests


What may be the most difficult ingredient of Breiman’s forests is the data-dependent split-
ting criterion: both inputs and outputs are used to find the optimal splits via exhaustive
search. The process does not involve any gradient-based strategies, and thus theoretical
analyses cannot leverage the vast literature on gradient descent. Such adaptive partitions
are intrinsically difficult to analyze, as they do not correspond to a local averaging estimate
that can be addressed using Stone’s theorem since weights depend on the labels, see Stone
(1977).
Scornet et al. (2015) is the first work to study the original Breiman’s algorithm. Con-
sidering an additive model of the form
d
X
Y = fj (X (j) ) + ε, (10)
j=1

the original random forest was proved to be consistent, assuming that the number of leaves
tn tends to infinity while satisfying tn = o(an /(log an )9 ), where an is the subsample size.
Due to the previous condition, the consistency result of Scornet et al. (2015) is not valid

8
for fully-developed trees (tn = an ). They also establish the ability of random forests to
asymptotically identify the relevant direction for splitting in a sparse additive framework.
Using other proof techniques, Klusowski and Tian (2022) extend these results to addi-
tive (logistic and ordinary) regression, establishing oracle inequalities on the excess risk of
generic decision trees. They also prove that the rate of consistency for Breiman’s random
forests is upper bounded by (log n)−1 . Klusowski and Tian (2022) highlights that split
randomization cannot damage the predictive performances of RF by more than a factor
3 (with default RF parameters). Klusowski and Tian (2022) also prove the consistency
of RF in sparse high-dimensional settings, where the dimension d is allowed to grow sub-
exponentially with n and where additive components can be discontinuous (Corollary 7.2).
Assuming a minimum impurity decrease in all tree nodes, Chi et al. (2022) derive upper
bounds on the approximation error of random forests in high-dimensional settings, high-
lighting the influence of the number of direction eligible for splitting. In the case of some
specific high-dimensional r-sparse regression models with binary input features, Syrgkanis
and Zampetakis (2020) establish rate of consistency of MSE of Breiman’s forests of order
2r log(d) log(n)/n, which increases logarithmically with the ambient dimension d.
While decision trees are able to detect signal and are thus tailored for sparse settings,
they fail to adapt to the simplicity of additive models. Using rate-distortion theory, Tan
et al. (2022) establish a lower bound on the MSE of general honest decision trees, implying
that they fail to recover the efficient rate of n−2/3 of Generalized Additive Models (Stone,
1985). We note that while these results bring us closer to Breiman’s original forests, they
still rely on modifications such as subsampling and controlling tree depth. The impact of
these choices relative to the original algorithm remains unclear.

2.7 Interpolation, self-regularization and double descent


Recent works (Belkin et al., 2019) show that overparameterized interpolating estimators do
not automatically overfit the data. This seems to contradict the common belief, based on
the bias-variance tradeoffs, stating that model complexity must be reasonable, in order to
prevent variance from exploding. Such a phenomenon, called double descent was established
for overparameterized neural networks and for random forests in Wyner et al. (2017), Belkin
et al. (2019). Double descent phenomenon is observed for random forest when the complexity
of the model is controlled via the number of leaves per tree and the number of trees (see
Figure 5 in Belkin et al., 2019). In the binary classification setting, even in presence of
bootstrap, random forest interpolate exactly the training set. Curth et al. (2024a) argue
that the double descent phenomenon only occurs when the complexity is measured via two
distinct quantities – the number of leaves and trees – merged into one single axis. Unfolding
these two quantities leads to classical U-curve (see Figure 1 and 2 in Curth et al., 2024a).
Wyner et al. (2017) advocates that RF good predictive performances come from their
spiked-smooth nature (see Figure 1 in Wyner et al., 2017), which results directly from
their interpolating properties. However, Mentch and Zhou (2020) claim that interpolation
in itself is not sufficient to explain the good performances of RF, as bagging (without
split randomization) has lower predictive performances while still interpolating in binary
classification setting. According to Mentch and Zhou (2020), RF good performances result
from the feature randomization (via mtry), which acts as implicit regularization: low values
of mtry leads to better predictive performances in low SNR (Signal-To-Noise) settings. In
the same setting, Mentch and Zhou (2022) show that adding input noise variables can
improve RF predictive performance. In fact, tuning the tree depth instead of considering

9
fully-grown trees in RF may act as a beneficial regularization in low Signal-to-Noise Ratio
settings (Zhou and Mentch, 2023).
Curth et al. (2024b) propose to reconcile these two views. The “spiked-smooth” behavior
mentioned by Wyner et al. (2017) is quantified for RF by Curth et al. (2024b) via an
effective parameters measure: the effective parameters value computed on test points is
lower than that computed on training points, which provides empirical evidence of a spiked
RF behavior near training points and a smooth behavior far from training points. The
effective parameters value computed on test points also decreases when the number of trees,
the feature randomization (small values of mtry), or the tree depth increase (see Figure 4
in Curth et al., 2024b). Thus, the good performances of RF seem to result of their “spiked-
smooth” behavior, something that is exagerated with interpolating trees or low values of
mtry.

3 VARIATIONS
Beyond algorithmic modifications for purposes of theoretical tractability, or elucidating the
role of computational mechanisms, many variations on random forests have been proposed
on the grounds of improving performance, with the purpose of providing new functionality,
or repurposing some aspects of random forests. We provide a brief catalog of some of these
variants here, particularly focusing on recent methods – see Loh (2014) for earlier variants.

3.1 Performance Improvements


RF improvements come either from theoretical or heuristic justification. Some of these mod-
ify splitting criteria used for constituent trees. Tomita et al. (2020) splits along randomly
selected sparse linear combinations of features, while Hornung and Boulesteix (2022) uses
two-variable combinations to split out axis-oriented subsets of the current node. These splits
are a strict superset of axis-oriented splits and necessarily increase the expressive capacity
of trees, but the additional complexity increases the challenges of a theoretical motivation.
Cattaneo et al. (2024) analyses a variant of these; with additional constraints on both the
regression function, the tree partition and the number of observations per node, oblique
RF can reach a rate of consistency of n−2/(2+q) , where q acts as an effective dimension.
Alternative splitting criteria fit models within nodes; Raymaekers et al. (2024) develops
computationally fast methods to include a further linear term at each split, choosing be-
tween enforcing continuity across the split or allowing a break.
Other variants focus on specific tasks or model structures; Basu et al. (2018) explicitly
uses the structure of random forests to emphasize high-dimensional interactions, iteratively
reweighting data, and re-fitting forests, based on the data’s participation in interactions
defined through strings of important splits within a tree. Fink et al. (2010) proposes a
method aimed at ecological survey data in which each tree is obtained from data that is
localized to one of many overlapping spatiotemporal regions. This induces interactions
between time and location and any landscape or ecological features used in the data, in
particular preventing the forest from using relationships in one ecological context in contexts
that could be quite different. Coleman et al. (2019) employ a reweighting strategy when
covariate shift is expected at the time of deployment.
Further extensions use random forests within larger models. Tan et al. (2025) extends
Frieman and Popescu (2008) in performing regression on indicators for the leaves of each

10
tree with linear functions. Ghosal and Hooker (2020) “boosts” random forests; fitting a
second random forest to the out-of-bag residuals of the first. Bayesian methods for trees
and ensembles of trees (e.g. Chipman et al., 2012, Jeong and Rockova, 2020) do not explicitly
use random forest structures, but averages of posterior samples create forest-like averages;
see Ročková and Van der Pas (2020) for posterior consistency of these ensembles.

3.2 Functionality
The original Random Forests was developed for regression and classification tasks, but this
has been significantly expanded. A first strategy for these is to adapt splitting criteria to
reflect the data model. Here, Saha et al. (2021), Rabinowicz and Rosset (2022) assume
spatially correlated predictions and split based on log likelihood. Ishwaran et al. (2008)
develops random forests for survival analysis; accounting for censoring, with consistency
studied in Cui et al. (2022). Meinshausen and Ridgeway (2006) developed a random forest
quantile regression. There has also been considerable interest in causal estimation; Wager
and Athey (2018) adapted the tree building process to estimate conditional average treat-
ment effect, extended to a survival setting in Cui et al. (2023). Athey et al. (2019) presents a
general model for splits maximizing a likelihood with approximate optimization for models
in each daughter node to speed computation.

3.3 Secondary Uses


The structure of trees in random forests has also been employed to provide an adaptive
kernel or distance function to be used in further models. In the context of purely random
trees, Scornet (2016b) developed an explicit connection between random forests and kernel
methods. This distance is induced by the frequency over trees with which an observation
falls within the same leaf as the point of interest; these are often referred to as “potential
nearest neighbors” (Lin and Jeon, 2006). When the tree structure is obtained from data, its
adaptivity results in a kernel that, heuristically, locally mimics the contours of the target
of interest, although we are aware of few results studying this. This adaptivity has been
used to produce prototypes for classification (Tan et al., 2020), as the kernel in a local
linear regression (Friedberg et al., 2020) who demonstrates an improvement in the resulting
convergence rate, and to provide prediction intervals based on out-of-bag residuals (Lu and
Hardin, 2021) or conduct more formal quantile regression Li et al. (2024).

4 CENTRAL LIMIT THEOREMS AND VARIANCE


ESTIMATION
While much of the analysis of convergence rates rests on the analysis of tree building al-
gorithms, providing uncertainty quantification in terms of confidence intervals or variance
estimates for fˆn (x) has generally focused on the ensembling structure of the random forest.
Here it will be helpful to explicitly re-write the subsampling step into the random forest (1):
M
1 X
fˆn (x) = T (x; ωm , DnIm ),
M m=1

in which Im is an a-vector of the indices 1, . . . , n and DnIm selects the corresponding set of ob-
servations (Xi1 , Yi1 ), . . . , (Xia , Yia ) Brieman’s original forest employed a bootstrap sample,

11
generating a = n observations sampled with replacement, but employing subsampling with
or without replacement has proved valuable for variance estimation as well as consistency.
Early approaches to estimating var(fˆn (x)) centered around variations of the bootstrap
or jackknife for ensemble methods. Sexton and Laake (2009) employed a “bag of little boot-
straps” approach: generating B ensembles fˆn1 (x), . . . , fˆnB (x), each of M/B trees resulting
PB
in a final prediction B1 b=1 fˆnb (x) along with uncertainty obtained via the standard devi-
ation among the fˆnb (x) along with a correction for the bias of the standard deviation; the
consistency of this estimate was later demonstrated by Athey et al. (2019).
In a similar manner, methods based around the infinitesimal jackknife were developed
by examining the infinite forest (2). The infinitesimal jackknife operates by attaching equal
weights to each observation and examining the directional derivative Ui – the gradient of
the estimate with respect to the weight given on observation i. For ensemble methods,
Efron (2014) attached weights in the form of selection probabilities when generating I but
not for how observations are used subsequently, this also reflects the implementation of
weights within the randomForest package in R (Liaw and Wiener, 2002). The application of
this to random forests is given in Wager and Walther (2015), where the resulting direction
derivatives are given by
Ui = covI T (x; DnI , ω), N (I)i ,


where N (I)i indicates the number of times observation i is repeated in sample I. For a finite
forest, the theoretical covariance above is simply replaced with the empirical covariance over
the subsamples Im . We then obtain an estimate
n
X
c fˆn (x)) =
var( Ui2
i=1

although this can be substantially biased unless M is much larger than usually employed in
practice.
Beyond variance estimates, inferential tools like confidence intervals also require distri-
butional results. For random forests, these have been obtained by observing a connection
between an infinite forest and the classical construction of U-statistics (Hoeffding, 1948)
when the sets Im are obtained from subsamples taken without replacement. Mentch and
Hooker (2016), Wager and Athey (2018), Peng et al. (2022) developed versions of a central
limit theorem extending basic U -statistics constructions. To formally state these results,
we examine a forest of M trees obtained using subsamples of size an and define the Hájek
projection

gc ((x1 , y1 ), . . . , (xc , yc )) = E[T (x; DnI , ω)|(X1 , Y1 ), . . . , (Xc , Yc )]

to be the expected value of T conditioned on the first c data points with gan understood to
also include ω, and define the corresponding variance components

ζc,an = var[gc ((X1 , Y1 ), . . . , (Xc , Yc ))].

Then, so long as limn→∞ (an /n)(ζan ,an /an ζ1,an ) = 0 and the response Y is appropriately
regular, we have
fˆ (x) − E[fˆn (x)] d
p n → N (0, 1).
2
an ζ1,an /n + ζan ,an /M

12
The limit condition above reflects a heurstic that ζan ,an ∝ an ζ1,an reducing the condition
to an = o(n). Note here that the centering in this theorem is given by E[fˆn (x)] rather than
a target of the form E[Y |X = x]. This is because the construction of the result ignores
the tree-building process and cannot guarantee that the bias in fˆn (x) is ignorable. In that
sense, this result cannot be used to guarantee that confidence intervals derived from it
are consistent. One can argue that the stability of the forest is of interest independently
of formal coverage guarantees. Alternatively,Wager and Athey (2018) do demonstrate a
CLT centered on E[Y |X = x] by a careful control of both the tree-building process and
choosing a subsampling rate an = n−β for a range of β that depend on constraints on the
asymmetry of allowable splits. This results in a general an /n = n−1+β convergence rate.
Athey et al. (2019) demonstrate the asymptotic equivalence of generalized random forests
to a U -statistics with equivalent statistical properties; similar analyses are carried out for
random survival forests in Cui et al. (2023).
Peng et al. (2022) also provided a Berry-Esseen theorem for complete U -statistics, that
is assuming that M is large enough to be ignorable, resulting in:
! 1/2
fˆn (x) − E[fˆn (x)] 6.1E[|g1 |3 ] √
 
an ζan ,an
sup P p < z − Φ(z) < √ 3/2 + (1 + 2) −1
z∈R a2n ζ1,an /n nζ1,an n an ζ1,an

Here for fixed subsample sizes an , the bound has a classical n−1/2 rate, but specific rates in
the infinite-order case depend on the scaling of the variance components (an ζ1,an , ζan ,an );
as for the CLT above assuming these remain bounded, an = o(n) suffices for the bound to
vanish asymptotically.
Within the context of estimating variances, both the infinitesimal jackknife and bootstrap-
of-little-bags can be shown to be consistent for ζ1,n . However, both employ the variance over
a Monte Carlo approximation to quantities defined for an infinite forest – either a forest built
on a bootstrap sample, or cov(T, N (I)); this was also the case for the two-level estimate
of Mentch and Hooker (2016). As such, the estimated variance is inflated by the Monte
Carlo variance of the estimate, and this bias can be substantial unless much larger forests
are built than are needed for good prediction. While intuitive estimates of the Monte Carlo
variance are available, they are themselves inflated by the without-replacement sampling
scheme which induces negative correlation between trees, conditional on the data. Zhou
et al. (2021) instead examined a forest built using samples taken with replacement in which
subsamples can be viewed as simulations from the empirical distribution and provided a
general estimation scheme, defining
M
X X
µi = Nim T (x; DIm , ωm )/ Nim , i = 1, . . . , n
m=1
n X
X M n X
X M
SSτ = Nim (µi − µ̄)2 , SSϵ = Nim (Tm − µi )2
i=1 m=1 i=1 m=1
SSτ − (n − 1)SSϵ /(M an − n)
ζ̂1,an = P P
M an − i ( m Nim )2 /(M an )
Here the µi are the means of trees for which observation i was included in the subsample,
weighted by its inclusion frequency and the resulting ζ̂1,an is unbiased for data simulated
from the empirical distribution. The use of subsamples with replacement moves forests from
U -statistics to being V -statistics. The classical equivalence between these (e.g. Van der

13
Vaart, 2000) only holds when an = o(n1/4 ) (Shieh, 1994); Zhou et al. (2021) re-expresses
a general V statistic as a U statistic with a modified kernel with the equivalent central
limit theorem, and the variance estimates above remain consistent and unbiased for data
generated from the empirical distribution.
The distributional results defined here only characterize the variability of fˆn (x); they do
not provide prediction intervals. These can be obtained by quantile reqression (Meinshausen
and Ridgeway, 2006), or Lu and Hardin (2021) developed an out-of-bag form of conformal
inference (Angelopoulos et al., 2023). Bias corrections have also been studied in Hooker and
Mentch (2018), Lu and Hardin (2021), Ghosal and Hooker (2020) these all add a correction
ĝn (x) to fˆn (x) which improved predictive performance on a range of regression problems,
often very significantly. Ghosal et al. (2025) demonstrate the extension of infinitesimal
jacknife (IJ) methods to estimate the covariance between models to which IJ is applicable
and hence to iterative bias corrections of this form. Finally, note that these distributional
results rest entirely on the subsampling process of random forests rather than the specifics
of tree construction. They thus apply more generally to any bagged estimator. They also
apply to any quantity calculated from individual trees and then averaged over the forest;
that includes all of the variable importance measures studied below.

5 VARIABLE IMPORTANCE
The original implementation of random forests come with two importance measures, wildly
used in practice: the Mean Decrease in Impurity (MDI) and the Mean Decrease in Accuracy
(MDA). We review below their theoretical properties and improvements that have been
proposed recently. Although relatively rarely employed, central limit theorems for these
measures follow immediately by repurposing the distributional theory for random forests
and observing that each is a quantity defined from individual trees that is then averaged
over the forest2 , thus allowing uncertainty quantification.

5.1 Mean Decrease in Impurity (MDI)


The MDI (Breiman, 2002) of any variable X (j) is defined for each tree of the forest as the
weighted sum of the impurity reduction in each node that splits along the jth variable, the
weights being simply the fraction of data points falling into each node. More formally, the
MDI of the jth variable, computed via a tree T is defined as
X
[ T (X (j) ) =
MDI pn,A Ln,A (jn,A , zn,A ),
A∈T
jn,A =j

where Ln,A (jn,A , zn,A ) is the impurity reduction in the cell A and pn,A the fraction of data
points falling into A. The MDI of a forest is then simply the average of all MDI computed
via the trees that compose it.
Louppe et al. (2013) established one of the first expression of the population version
of the MDI, for fully-grown trees built on categorical covariates. In this context, a tree
splits on all possible covariates and deriving the MDI expression reduces to enumerating all
2 MDA can be defined with respect to a generic model and when implemented generically does not take

this form, however Breiman (2001) defined variable importance as an average over trees, taken using out-
of-bag samples.

14
eligible splits. In this scenario, Louppe et al. (2013) proved that the sum of the MDI on all
input variables is equal to the mutual information of Y and X (1) , . . . , X (d) , highlighting that
MDI is a natural importance measure that is spread through all input variables. Louppe
et al. (2013) also establish that the importance of irrelevant features (independent of Y
conditional on any subset of covariates) is zero. While interesting, the setting presented in
Louppe et al. (2013) does not permit to study the impact on MDI performances of split
positions (all variables are categorical and split so that each node corresponds to a modality)
nor tree depth (trees are fully grow).
The CART splitting procedure is more likely to select variables with a large number
of split points (as continuous features or categorical variables with a lot of values). As a
consequence, the MDI is positively biased towards such variables (see, e.g., Nicodemus, 2011,
Boulesteix et al., 2012). Such a phenomenon was first noticed by Breiman et al. (1984) (see
also Strobl et al., 2007). Similarly, MDI of variables containing many missing values may
be overinflated (see, e.g., Kim and Loh, 2001).
In order to reduce these intrinsic biases, several works proposed to replace the usual
splitting procedure in CART by statistical tests, such as F-test on ordered variables (FACT,
Loh and Vanichsetakul, 1988), chi-squared tests on categorical ones (see algorithms QUEST
and CRUISE in Loh and Shih, 1997, Kim and Loh, 2001), permutation tests, or tests on the
maximal value of Gini criterion (Strobl et al., 2007). To the best of our knowledge, there
exist no detailled theoretical analyses of these procedures.
Even in the absence of dependencies between inputs and outputs, variables with many
categories are more likely to be split (Zhou and Hooker, 2021). This phenomenon is related
to multiple testing problems, since the quality of many split points are evaluated to choose
the best one. Because a unique dataset is used to build the tree and evaluate the MDI,
the resulting MDI may be positively biased for such high cardinality variables. In order to
remove this bias, Zhou and Hooker (2021) propose using two different datasets to respectively
build the tree and compute the MDI. Li et al. (2019) propose another way to use an extra-
sample based on a rewriting of the MDI of any variable X (j) as
n
1X
fT,j (Xi )Yi , (11)
n i=1

where
X 
fT,j (x) = ȲAL 1x∈AL + ȲAR 1x∈AR − ȲA 1x∈A . (12)
A∈T,jA =j

Equation (11) shows that MDI can be seen as a covariance between the output and a
function fT,j related to the tree partition. Li et al. (2019) propose to use a part of the data
to build the partition (and thus fT,j ) and the other to estimate the MDI via Equation (11),
employing similar motivation to that of honesty in Wager and Athey (2018).
Deep trees tend to produce biased MDI. Indeed, Scornet (2023) proves that the sum of all
MDI produced by a fully-grown tree tends to the mutual information of Y and X (1) , . . . , X (d)
plus the variance of the model noise. Any input variable can have their MDI modified by an
additive factor of the noise variance. Such phenomenon highlights the asymptotic empirical
bias of fully grown trees. A finite sample analysis of the same phenomenon was proposed by
Li et al. (2019). Consider a model with d input variables, whose noise variables – features

15
assumed independent of all other quantities – are indexed by k ∈ S ⊂ {1, . . . , d}. For any
tree T , let
X
G0 (T ) = [ T (X (j) )
MDI (13)
j∈S

be the sum of importances of noise variables. Then, letting Tn (mn , kn ) be the set of decision
trees whose leaves contain more than mn observations and whose depth is less than kn , there
exists a constant C depending on ∥f ⋆ ∥∞ such that, for all n large enough,
" #
log d kn log(nd)
≤E sup G0 (T ) ≤ C , (14)
Cmn T ∈Tn (mn ,kn ) mn

where the lower bound holds when f ⋆ = 0 and mn ≳ log n. The population version of the
sum of MDIs of irrelevant features is zero, but its empirical counterpart is lower bounded
by (log d)/Cmn . Thus, a part of MDI bias is related to the number of observations in each
leaf. Increasing the number of observations per leaf, or equivalently, reducing the depth
of the tree can help to diminish this bias. However, it may also decrease the predictive
performance of the tree, therefore leading to erroneous MDI.
Two different types of MDI biases have been identified. Biases (related to the splitting
criterion) in favor of several types of variables can be reduced by computing MDI on a dataset
that was not used during the training phase. Biases due to tree depths (or equivalently, low
numbers of observations per leaf) can be reduced by simply limiting the tree depth. These
two types of bias act separately: for instance, decreasing tree depth reduces the second
bias but has no effect on the first. Note that the standard implementation of RF (i) uses
a unique dataset to build trees and compute MDI and (ii) builds fully-grown trees, thus
exacerbating MDI biases. Finally, the population value targeted by MDI remains unclear
outside of specific generating structures such as generalized additive models (Scornet, 2023).

5.2 Mean Decrease in Accuracy (MDA)


Mean Decrease in Accuracy (MDA, see Breiman, 2001) is built on a random permutation
of the values of a given variable in the test set. The difference between the RF predictive
performance on this modified test set and on the original test set is called the MDA. MDA
is also called permutation importance and can be used for any learner (Fisher et al., 2019),
including neural networks (Gevrey et al., 2003). When computing permutation importance
within RF, the predictive performance of each tree is measured and the result aggregated,
thus making it amenable to the variance estimation methods discussed in Section 4) and
allowing tree’s out-of-bag samples to be used in lieu of a test set.
As illustrated by Strobl et al. (2008), MDA is biased in favor of correlated inputs since
(i) splits are more likely to be performed along correlated inputs and (ii) permutation
importance itself is biased in favor of correlated inputs. To understand this last point, note
that if the population MDA is nonzero, then there exists a dependence between X (j) and
(Y, X (−j) ), where X (−j) is the set of all input variables different from X (j) . Thus MDA,
which is meant to assess the link between X (j) and Y , is biased as soon as X (j) and X (−j)
are not independent. Strobl et al. (2008) propose to overcome this limitation by computing
conditional permutation importance.

16
Another limitation of unconditional permutation is due to the nature of the permutated
samples. Indeed, Hooker et al. (2021) explain that, because of correlation between inputs,
permuting values of a given variable may create unrealistic observations. Thus, MDA eval-
uates the extrapolation properties of RF rather than the importance of the given variable.
The first theoretical analysis of permutation importance for trees is by Ishwaran (2007).
In order to evaluate the importance of a variable X (j) , the error of a tree on noisy obser-
vations is computed, where the noise comes from randomly assigning the observation to
left/right node whenever a split occurs along X (j) , and do the same for all subsequent splits
in the tree. Ishwaran (2007) establishes the limit of such an importance measure, which
resembles the permutation importance. Gregorutti et al. (2017) were the first to provide
the population expression of the MDA, in an additive model with Gaussian inputs.
Bénard et al. (2022) establish the convergence of MDA and interpret the limit quantity
by the yardstick of sensitivity analysis. Denote the Total Sobol Index of X (j) , corresponding
to the proportion of explained variance lost when X (j) is removed from the full model, by

E[V[f ⋆ (X)|X (−j) ]]


ST (j) = . (15)
V[Y ]
Then, considering a random forest composed of M trees with tn leaves, built with subsamples
of size an (with tn /an → 0, an /n ≤ 1 − κ for some κ ∈ (0, 1)), Bénard et al. (2022) proves
that the MDA of such a forest satisfies
⋆(j)
M
\ DAM,n (X (j) ) → V[Y ] × ST (j) + V[Y ] × STmg
(j)
+ M DA3 , (16)

where, letting Xπj the variable X whose jth component is replaced by an independent copy,

(j) E[V[f ⋆ (Xπj )|X (−j) ]]


STmg = , (17)
V[Y ]
⋆(j)
and M DA3 a positive quantity, with a closed-form expression. The limiting quantity
in (16) is difficult to interpret, and does not correspond to any known sensitivity index
(Bénard et al., 2022). When performing backward selection, starting from the model with
d covariates and eliminating iteratively the variable that decreases the least the predictive
performances, the variable to discard is the one with the smallest ST (j) . However, it can be
proved that the quantity in (16) reduces to ST (j) only in the case of independent features.
Such a result highlights the drawback of MDA in presence of correlated inputs.
Williamson et al. (2021, 2023) consider variable importance that can be written as the
difference in predictive capacity of a model trained with the full set of input variable com-
pared to that trained without the variable(s) of interest. Computing such measures require
retraining the whole model for each variable/group of variables whose importance needs to
be measured, which can be computationally intractable. Such measures have been studied
for other predictors than random forests by Lei et al. (2018). Other methods to evaluate
variable importance have been proposed (see, e.g., Section 5 in Hooker et al., 2021). Bénard
et al. (2022) note that from a forest trained on the full dataset, one can extract a forest
trained on a subset of variables, the so-called projected forests. A variable importance can
be computed with such forests, which is proved to target the quantity ST (j) .
Ishwaran and Lu (2019) (Section 3.2 and 4.4) establish asymptotic confidence intervals
of MDA based on subsampling. Theorem 1 in Williamson et al. (2023) establishes the

17
asymptotic normality of generic importance measures that are computed as the difference
in predictive performance between a predictor trained on all variables and the same predictor
trained on a strict subset.
One could be tempted to use such results to test whether a variable is important. How-
ever, in the presence of a variable of null importance, asymptotic normality does not hold,
and a specific test must be designed (see Section 3.4 in Williamson et al., 2023). Hooker
et al. (2021) and Mentch and Zhou (2022) establish that the testing procedure described
in Williamson et al. (2023) has a tendency to incorrectly reject the null hypothesis (i.e.,
to identify irrelevant variables as relevant). Indeed, Mentch and Zhou (2022) prove that
adding noise features (independent of the output) can increase the predictive performance
of random forests. This is not surprising, as this can be seen as a regularization procedure
(Mentch and Zhou, 2022). However, this questions the interpretation drawn from such tests:
the null hypothesis for a variable can be rejected, thus implying increased predictive perfor-
mances when the variable is added to the model, whereas the variable itself is not related
to the output (see the discussion in Section 5.1 in Mentch and Zhou, 2022).

5.3 Tree SHAP


Both MDI and MDA produce measures of the importance of each input as measured in its
contribution to the forest as a whole. Local feature importance asks about how features
contribute to making the specific prediction fn (x) at a given x. A number of model-agnostic
feature attribution methods have been proposed based on gradients (Simonyan et al. (2014)),
or local sparse linear approximations (Ribeiro et al. (2016)) which have developed a signif-
icant literature. Among these, Shapley values Lundberg and Lee (2017) are particularly
notable for being able to leverage the structure of trees to significantly improve computa-
tion.
Shapley values borrow from game-theoretic considerations for valuing individuals in co-
operative games (Shapley (1953)). These partition the total reward from a game among its
participants in manner originally designed to satisfy a set of desiderata associated with fair
rewards and are based on combining the rewards (or value function) that would be obtained
for every subset of the participants. In the context of local interpretation, Lundberg and
Lee (2017) suggests examining fS (xS ): the prediction that we would make if we were only
given access to the features whose indices are in the set S. Rather than fitting new models
this is taken to be the expected value of the prediction over the features not in S:
fS (xS ) = Ef (xS , X−S )
where the expectation can be with respect to the marginal distribution of X−S (yielding “in-
terventional” Shapley values) or its conditional distribution X−S |xS (“conditional” Shapley
values), both of which are obtained by Monte-Carlo averaging for a generic f .
Formally, the Shapley value for feature j is then defined by averaging its change to
the prediction when it enters the model, with the average taken over permutations of the
ordering of features. Letting π = (π1 , . . . , πd ) be a permutation of the feature indices with
Sjπ = {π1 , . . . , πj−1 } the set of inidices appearing before j, we define the jth SHAP value as
1 X 
ϕj = fSjπ ∪{j} (xSjπ ∪{j} ) − fSjπ (xSjπ ) . (18)
d! π

SHAP has become one of the most common feature attribution methods in machine learning,
see eg Holzinger et al. (2020), Molnar (2020) although critiques point to both the computa-

18
tional expense, the rather abstract definition, and a disjunction between the desiderata that
were used to define Shapley values and the specific properties need for variable importance
(Verdinelli and Wasserman (2024)).
In the context of tree-based models, Shapley values can be computed efficiently based
in two properties. First, the conditional expectations involved in calculating fS (xS ) can be
achieved analytically by weighted averages of the leaves; this is particularly simple in the case
of interventional Shapley values (see Bénard et al., 2022). Second, the set of permutations
to be considered can be substantially reduced: only those π for which some feature used in
nodes encountered by x appears after j, and only for the trees for which feature j appears
along that path (see Lundberg et al., 2018, Laberge and Pequignot, 2022, for algorithmic
details).
Few theoretical studies of Shapley values have been carried out, but we observe that the
consistency of a random forest also translates to consistency of the conditional expectations
in (18). As noted above, these values are also expressed as averages over values calculated
for each tree and for which there is also a central limit theorem.

6 DISCUSSION
The last two decades have seen important advances in random forest theory. Designing
and studying random forests variants have led to a better understanding of the role of each
ingredient (subsampling, feature randomization, tree depth) on the predictive performance
of RF. However, quite notably, the roles of two ingredients remain largely unadressed: the
impact of fully-developed trees and the number of features eligible for splitting. Indeed,
they are intrinsically difficult to analyze since many nodes in fully-developed trees has very
few observations (while statistical guarantees require many data points) and since analyzing
the number of preselected features requires a (data-dependent) manner to select the best
split, which is, on its own, a difficulty.
Additionally, a unique RF model that allows for studying possible interactions between
these ingredients has yet to be found. Most models are tailored for the analysis of one
ingredient only. For example, centered forests are well suited for the analysis of split ran-
domization, but make the analysis of fully-developed trees impossible.
Most analyses assess the impact of RF hyperparameters on predictive performances, but
the impact on variable importance remains largely unclear. For example, RF of deep trees
have good predictive performances while the resulting variable importances (either MDI or
MDA) are clearly biased. Understanding the tradeoff between predictive performance and
variable importance (that is, interpretability) is a promising future avenue for research.
Finally, the greatest source of open problems involves understanding adaptively chosen
splits. Adaptation is often listed as one of the key advantages of tree-based models, but
mathematical results supporting this are limited. While conditions such as regularity and
honesty in Wager and Athey (2018) allow for adaptively chosen splits, they are designed
to provide a means of ignoring it within the mathematics. A fully developed theory would
provide a finer-level description of the adaptation of node geometry to the target model,
elucidating the role of adaptive tree kernels both for secondary uses as in Tan et al. (2020)
and for combinations of random forest models (Ghosal et al., 2025), as well as controlling
the dependence between leaf values and tree structures.

19
References
Agarwal, A., A. M. Kenney, Y. S. Tan, T. M. Tang, and B. Yu (2023). Mdi+: A flexible
random forest-based feature importance framework. arXiv preprint arXiv:2307.01932 .

®
Angelopoulos, A. N., S. Bates, et al. (2023). Conformal prediction: A gentle introduction.
Foundations and Trends in Machine Learning 16 (4), 494–591.

Arlot, S. and R. Genuer (2014). Analysis of purely random forests bias. arXiv preprint
arXiv:1407.3939 .
Arlot, S. and R. Genuer (2016). Comments on: A random forest guided tour. Test 25,
228–238.
Arnould, L., C. Boyer, and E. Scornet (2023). Is interpolation benign for random forest
regression? In International Conference on Artificial Intelligence and Statistics, pp.
5493–5548. PMLR.
Athey, S., J. Tibshirani, and S. Wager (2019). Generalized random forests. The Annals of
Statistics 47 (2), 1148 – 1178.

Basu, S., K. Kumbier, J. B. Brown, and B. Yu (2018). Iterative random forests to discover
predictive and stable high-order interactions. Proceedings of the National Academy of
Sciences 115 (8), 1943–1948.
Belkin, M., D. Hsu, S. Ma, and S. Mandal (2019). Reconciling modern machine-learning
practice and the classical bias–variance trade-off. Proceedings of the National Academy of
Sciences 116 (32), 15849–15854.

Bénard, C., S. Da Veiga, and E. Scornet (2022). Mean decrease accuracy for random forests:
inconsistency, and a practical solution via the sobol-mda. Biometrika 109 (4), 881–900.
Biau, G. (2012). Analysis of a random forests model. The Journal of Machine Learning
Research 13, 1063–1095.

Biau, G., L. Devroye, and G. Lugosi (2008). Consistency of random forests and other
averaging classifiers. Journal of Machine Learning Research 9 (9).
Biau, G. and E. Scornet (2016). A random forest guided tour. Test 25, 197–227.
Boulesteix, A.-L., A. Bender, J. Lorenzo Bermejo, and C. Strobl (2012). Random forest gini
importance favours snps with large minor allele frequency: impact, sources and recom-
mendations. Briefings in Bioinformatics 13 (3), 292–304.
Boulesteix, A.-L., S. Janitza, J. Kruppa, and I. R. König (2012). Overview of random forest
methodology and practical guidance with emphasis on computational biology and bioin-
formatics. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 2 (6),
493–507.
Breiman, L. (2001). Random forests. Machine learning 45, 5–32.
Breiman, L. (2002). Manual on setting up, using, and understanding random forests v3. 1.
Statistics Department University of California Berkeley, CA, USA 1 (58), 3–42.

20
Breiman, L. (2004). Consistency for a simple model of random forests.
Breiman, L., J. Friedman, C. J. Stone, and R. A. Olshen (1984). Classification and regression
trees. CRC press.
Cattaneo, M. D., R. Chandak, and J. M. Klusowski (2024). Convergence rates of oblique
regression trees for flexible function libraries. The Annals of Statistics 52 (2), 466–490.
Chaudhuri, P., M.-C. Huang, W.-Y. Loh, and R. Yao (1994). Piecewise-polynomial regres-
sion trees. Statistica Sinica, 143–167.
Chi, C.-M., P. Vossler, Y. Fan, and J. Lv (2022). Asymptotic properties of high-dimensional
random forests. The Annals of Statistics 50 (6), 3415–3438.

Chipman, H. A., E. I. George, and R. E. McCulloch (2012). Bart: Bayesian additive


regression trees. Annals of Applied Statistics 6 (1), 266–298.
Coleman, T., K. Kaufeld, M. F. Dorn, and L. Mentch (2019). Locally optimized random
forests. arXiv preprint arXiv:1908.09967 .

Criminisi, A., J. Shotton, E. Konukoglu, et al. (2012). Decision forests: A unified framework

®
for classification, regression, density estimation, manifold learning and semi-supervised
learning. Foundations and trends in computer graphics and vision 7 (2–3), 81–227.
Cui, Y., M. R. Kosorok, E. Sverdrup, S. Wager, and R. Zhu (2023). Estimating heteroge-
neous treatment effects with right-censored data via causal survival forests. Journal of
the Royal Statistical Society Series B: Statistical Methodology 85 (2), 179–211.
Cui, Y., R. Zhu, M. Zhou, and M. Kosorok (2022). Consistency of survival tree and forest
models. Statistica Sinica 32 (3), 1245–1267.
Curth, A., A. Jeffares, and M. van der Schaar (2024a). A u-turn on double descent: Rethink-
ing parameter counting in statistical learning. Advances in Neural Information Processing
Systems 36.
Curth, A., A. Jeffares, and M. van der Schaar (2024b). Why do random forests work?
understanding tree ensembles as self-regularizing adaptive smoothers. arXiv preprint
arXiv:2402.01502 .

Devroye, L., L. Györfi, and G. Lugosi (2013). A probabilistic theory of pattern recognition,
Volume 31. Springer Science & Business Media.
Duroux, R. and E. Scornet (2018). Impact of subsampling and tree depth on random forests.
ESAIM: Probability and Statistics 22, 96–128.

Efron, B. (2014). Estimation and accuracy after model selection. Journal of the American
Statistical Association 109 (507), 991–1007.
Fink, D., W. M. Hochachka, B. Zuckerberg, D. W. Winkler, B. Shaby, M. A. Munson,
G. Hooker, M. Riedewald, D. Sheldon, and S. Kelling (2010). Spatiotemporal exploratory
models for broad-scale survey data. Ecological Applications 20 (8), 2131–2147.

21
Fisher, A., C. Rudin, and F. Dominici (2019). All models are wrong, but many are use-
ful: Learning a variable’s importance by studying an entire class of prediction models
simultaneously. Journal of Machine Learning Research 20 (177), 1–81.
Friedberg, R., J. Tibshirani, S. Athey, and S. Wager (2020). Local linear forests. Journal
of Computational and Graphical Statistics 30 (2), 503–517.

Frieman, J. H. and B. E. Popescu (2008). Predictive learning via rule ensembles. Annals of
Applied Statistics 2 (3), 916–954.
Geurts, P., D. Ernst, and L. Wehenkel (2006). Extremely randomized trees. Machine
learning 63, 3–42.

Gevrey, M., I. Dimopoulos, and S. Lek (2003). Review and comparison of methods to
study the contribution of variables in artificial neural network models. Ecological mod-
elling 160 (3), 249–264.
Ghosal, I. and G. Hooker (2020). Boosting random forests to reduce bias; one-step boosted
forest and its variance estimate. Journal of Computational and Graphical Statistics 30 (2),
493–502.
Ghosal, I., Y. Zhou, and G. Hooker (2025). The infinitesimal jackknife and combinations of
models. Statistical Science, to appear.
Gregorutti, B., B. Michel, and P. Saint-Pierre (2017). Correlation and variable importance
in random forests. Statistics and Computing 27, 659–678.
Hoeffding, W. (1948). A Class of Statistics with Asymptotically Normal Distribution. The
Annals of Mathematical Statistics 19 (3), 293 – 325.
Holzinger, A., A. Saranti, C. Molnar, P. Biecek, and W. Samek (2020). Explainable ai
methods-a brief overview. In International workshop on extending explainable AI beyond
deep models and classifiers, pp. 13–38. Springer.
Hooker, G. and L. Mentch (2018). Bootstrap bias corrections for ensemble methods. Statis-
tics and Computing 28, 77–86.
Hooker, G., L. Mentch, and S. Zhou (2021). Unrestricted permutation forces extrapola-
tion: variable importance requires at least one more model, or there is no free variable
importance. Statistics and Computing 31, 1–16.
Hornung, R. and A.-L. Boulesteix (2022). Interaction forests: Identifying and exploiting
interpretable quantitative and qualitative interaction effects. Computational Statistics &
Data Analysis 171, 107460.

Ishwaran, H. (2007). Variable importance in binary regression trees and forests.


Ishwaran, H. (2015). The effect of splitting on random forests. Machine learning 99, 75–118.
Ishwaran, H., U. B. Kogalur, E. H. Blackstone, and M. S. Lauer (2008). Random survival
forests. The Annals of Applied Statistics 2 (3), 841 – 860.

22
Ishwaran, H. and M. Lu (2019). Standard errors and confidence intervals for variable impor-
tance in random forest regression, classification, and survival. Statistics in medicine 38 (4),
558–582.
Jeong, S. and V. Rockova (2020). The art of bart: On flexibility of bayesian forests. arXiv
preprint arXiv:2008.06620 3 (69), 146.
Kim, H. and W.-Y. Loh (2001). Classification trees with unbiased multiway splits. Journal
of the American Statistical Association 96 (454), 589–604.
Klusowski, J. (2021). Sharp analysis of a simple model for random forests. In International
Conference on Artificial Intelligence and Statistics, pp. 757–765. PMLR.
Klusowski, J. and P. Tian (2022). Large scale prediction with decision trees. Journal of the
American Statistical Association.
Laberge, G. and Y. Pequignot (2022). Understanding interventional treeshap: How and
why it works. arXiv preprint arXiv:2209.15123 .
Lakshminarayanan, B., D. M. Roy, and Y. W. Teh (2014). Mondrian forests: Efficient online
random forests. Advances in neural information processing systems 27.
Lei, J., M. G’Sell, A. Rinaldo, R. J. Tibshirani, and L. Wasserman (2018). Distribution-
free predictive inference for regression. Journal of the American Statistical Associa-
tion 113 (523), 1094–1111.
Li, M., B. Sarmah, D. Desai, J. Rosaler, S. Bhagat, P. Sommer, and D. Mehta (2024).
Quantile regression using random forest proximities. In Proceedings of the 5th ACM
International Conference on AI in Finance, pp. 728–736.
Li, X., Y. Wang, S. Basu, K. Kumbier, and B. Yu (2019). A debiased mdi feature importance
measure for random forests. Advances in Neural Information Processing Systems 32.
Liaw, A. and M. Wiener (2002). Classification and regression by randomforest. R News 2 (3),
18–22.
Lin, Y. and Y. Jeon (2006). Random forests and adaptive nearest neighbors. Journal of the
American Statistical Association 101 (474), 578–590.
Loh, W.-Y. (2014). Fifty years of classification and regression trees. International Statistical
Review 82 (3), 329–348.
Loh, W.-Y. and Y.-S. Shih (1997). Split selection methods for classification trees. Statistica
sinica, 815–840.
Loh, W.-Y. and N. Vanichsetakul (1988). Tree-structured classification via generalized
discriminant analysis. Journal of the American Statistical Association 83 (403), 715–725.
Louppe, G., L. Wehenkel, A. Sutera, and P. Geurts (2013). Understanding variable im-
portances in forests of randomized trees. Advances in neural information processing sys-
tems 26.
Lu, B. and J. Hardin (2021). A unified framework for random forest prediction error esti-
mation. Journal of Machine Learning Research 22 (8), 1–41.

23
Lundberg, S. M., G. G. Erion, and S.-I. Lee (2018). Consistent individualized feature
attribution for tree ensembles. arXiv preprint arXiv:1802.03888 .
Lundberg, S. M. and S.-I. Lee (2017). A unified approach to interpreting model predictions.
Advances in neural information processing systems 30.

Meinshausen, N. and G. Ridgeway (2006). Quantile regression forests. Journal of machine


learning research 7 (6).
Mentch, L. and G. Hooker (2016). Quantifying uncertainty in random forests via confidence
intervals and hypothesis tests. The Journal of Machine Learning Research 17 (1), 841–881.
Mentch, L. and S. Zhou (2020). Randomization as regularization: A degrees of freedom
explanation for random forest success. Journal of Machine Learning Research 21 (171),
1–36.
Mentch, L. and S. Zhou (2022). Getting better from worse: Augmented bagging and a
cautionary tale of variable importance. Journal of Machine Learning Research 23 (224),
1–32.

Molnar, C. (2020). Interpretable machine learning. Lulu. com.


Mourtada, J., S. Gaı̈ffas, and E. Scornet (2020). Minimax optimal rates for mondrian trees
and forests. The Annals of Statistics 48 (4), 2253–2276.
Nicodemus, K. K. (2011). On the stability and ranking of predictors from random forest
variable importance measures. Briefings in bioinformatics 12 (4), 369–373.
Peng, W., T. Coleman, and L. Mentch (2022). Rates of convergence for random forests via
generalized u-statistics. Electronic Journal of Statistics 16 (1), 232–292.
Probst, P., M. N. Wright, and A.-L. Boulesteix (2019). Hyperparameters and tuning strate-
gies for random forest. Wiley Interdisciplinary Reviews: data mining and knowledge
discovery 9 (3), e1301.
Rabinowicz, A. and S. Rosset (2022). Tree-based models for correlated data. The Journal
of Machine Learning Research 23 (1), 11802–11832.
Raymaekers, J., P. J. Rousseeuw, T. Verdonck, and R. Yao (2024). Fast linear model trees
by pilot. Machine Learning 113 (9), 6561–6610.
Ribeiro, M. T., S. Singh, and C. Guestrin (2016). ” why should i trust you?” explaining
the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international
conference on knowledge discovery and data mining, pp. 1135–1144.

Ročková, V. and S. Van der Pas (2020). Posterior concentration for bayesian regression
trees and forests. The Annals of Statistics 48 (4), 2108–2131.
Roy, D. M., Y. W. Teh, et al. (2008). The mondrian process. In NIPS, Volume 21.
Saha, A., S. Basu, and A. Datta (2021). Random forests for spatially dependent data.
Journal of the American Statistical Association, 1–19.

24
Scornet, E. (2016a). On the asymptotics of random forests. Journal of Multivariate Analy-
sis 146, 72–83.
Scornet, E. (2016b). Random forests and kernel methods. IEEE Transactions on Informa-
tion Theory 62 (3), 1485–1500.

Scornet, E. (2023). Trees, forests, and impurity-based variable importance in regression.


In Annales de l’Institut Henri Poincare (B) Probabilites et statistiques, Volume 59, pp.
21–52. Institut Henri Poincaré.
Scornet, E., G. Biau, and J.-P. Vert (2015). Consistency of random forests.
Sexton, J. and P. Laake (2009). Standard errors for bagged and random forest estimators.
Computational Statistics & Data Analysis 53 (3), 801–811.
Shapley, L. S. (1953). A value for n-person games. Contribution to the Theory of Games 2.
Shieh, G. S. (1994). Infinite order v-statistics. Statistics & Probability Letters 20 (1), 75–80.

Simonyan, K., A. Vedaldi, and A. Zisserman (2014). Deep inside convolutional networks:
visualising image classification models and saliency maps. In Proceedings of the Interna-
tional Conference on Learning Representations (ICLR). ICLR.
Stone, C. J. (1977). Consistent nonparametric regression. The annals of statistics, 595–620.
Stone, C. J. (1982). Optimal global rates of convergence for nonparametric regression. The
annals of statistics, 1040–1053.
Stone, C. J. (1985). Additive regression and other nonparametric models. The annals of
Statistics 13 (2), 689–705.
Strobl, C., A.-L. Boulesteix, and T. Augustin (2007). Unbiased split selection for classifi-
cation trees based on the gini index. Computational Statistics & Data Analysis 52 (1),
483–501.
Strobl, C., A.-L. Boulesteix, T. Kneib, T. Augustin, and A. Zeileis (2008). Conditional
variable importance for random forests. BMC bioinformatics 9, 1–11.
Strobl, C., A.-L. Boulesteix, A. Zeileis, and T. Hothorn (2007). Bias in random forest
variable importance measures: Illustrations, sources and a solution. BMC bioinformat-
ics 8 (1), 1–21.
Syrgkanis, V. and M. Zampetakis (2020). Estimation and inference with trees and forests
in high dimensions. In Conference on learning theory, pp. 3453–3454. PMLR.

Tan, S., M. Soloviev, G. Hooker, and M. T. Wells (2020). Tree space prototypes: Another
look at making tree ensembles interpretable. In Proceedings of the 2020 ACM-IMS on
foundations of data science conference, pp. 23–34.
Tan, Y. S., A. Agarwal, and B. Yu (2022). A cautionary tale on fitting decision trees to
data from additive models: generalization lower bounds. In International Conference on
Artificial Intelligence and Statistics, pp. 9663–9685. PMLR.

25
Tan, Y. S., C. Singh, K. Nasseri, A. Agarwal, J. Duncan, O. Ronen, M. Epland, A. Kornblith,
and B. Yu (2025). Fast interpretable greedy-tree sums. Proceedings of the National
Academy of Sciences 122 (7), e2310151122.
Tang, C., D. Garreau, and U. von Luxburg (2018). When do random forests fail? Advances
in neural information processing systems 31.

Tomita, T. M., J. Browne, C. Shen, J. Chung, J. L. Patsolic, B. Falk, C. E. Priebe, J. Yim,


R. Burns, M. Maggioni, et al. (2020). Sparse projection oblique randomer forests. The
Journal of Machine Learning Research 21 (1), 4193–4231.
Van der Vaart, A. W. (2000). Asymptotic statistics, Volume 3. Cambridge university press.

Verdinelli, I. and L. Wasserman (2024). Feature Importance: A Closer Look at Shapley


Values and LOCO. Statistical Science 39 (4), 623 – 636.
Wager, S. and S. Athey (2018). Estimation and inference of heterogeneous treatment effects
using random forests. Journal of the American Statistical Association 113 (523), 1228–
1242.

Wager, S. and G. Walther (2015). Adaptive concentration of regression trees, with applica-
tion to random forests. arXiv preprint arXiv:1503.06388 .
Williamson, B. D., P. B. Gilbert, M. Carone, and N. Simon (2021). Nonparametric variable
importance assessment using machine learning techniques. Biometrics 77 (1), 9–22.

Williamson, B. D., P. B. Gilbert, N. R. Simon, and M. Carone (2023). A general frame-


work for inference on algorithm-agnostic variable importance. Journal of the American
Statistical Association 118 (543), 1645–1658.
Wyner, A. J., M. Olson, J. Bleich, and D. Mease (2017). Explaining the success of ad-
aboost and random forests as interpolating classifiers. Journal of Machine Learning Re-
search 18 (48), 1–33.
Zhou, S. and L. Mentch (2023). Trees, forests, chickens, and eggs: when and why to prune
trees in a random forest. Statistical Analysis and Data Mining: The ASA Data Science
Journal 16 (1), 45–64.

Zhou, Z. and G. Hooker (2021). Unbiased measurement of feature importance in tree-based


methods. ACM Transactions on Knowledge Discovery from Data (TKDD) 15 (2), 1–21.
Zhou, Z., L. Mentch, and G. Hooker (2021). V-statistics and variance estimation. The
Journal of Machine Learning Research 22 (1), 13112–13159.

26

You might also like