Main
Main
Abstract
Random forests (RF) have a long history, originally defined by Leo Breiman in
2001, but with antecedents in bagging methods introduced in 1996. They have become
one of the most widely adopted machine learning tools thanks to their computational
efficiency, relative insensitivity to tuning parameters, in-built cross validation, and
interpretation tools. Despite their popularity, mathematical theory about the funda-
mental properties of random forests has been slow to emerge. Nonetheless, the past
decade has seen significant advances in our understanding of all these processes. In
this review paper, we describe several variations of RF and how rates of consistency of
these variants highglight the impact of different RF mechanisms on their performance.
Another line of research focuses on establishing central limit theorems and confidence
intervals for random forests. We also depict recent analyses in variable importance
computed with RF.
Keywords: Random forests, excess risk, central limit theorem, variable importance.
Contents
1 INTRODUCTION 2
1.1 Notations and algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1
3 VARIATIONS 10
3.1 Performance Improvements . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2 Functionality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.3 Secondary Uses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
5 VARIABLE IMPORTANCE 14
5.1 Mean Decrease in Impurity (MDI) . . . . . . . . . . . . . . . . . . . . . . . . 14
5.2 Mean Decrease in Accuracy (MDA) . . . . . . . . . . . . . . . . . . . . . . . 16
5.3 Tree SHAP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
6 DISCUSSION 19
1 INTRODUCTION
Decision trees are among the most comprehensible machine learning methods: a recursive
partition of the input space is constructed from the data, and the prediction at a new point
results only from an average (or majority vote) of the observations falling within the same
leaf as that point. The instability of these methods in the face of dataset perturbation
and their limited predictive capacity have led to the consideration of tree-based ensemble
methods such as random forests or gradient boosting decision trees. Despite the advent
of deep learning, such methods are part of the state of the art for dealing with tabular
datasets. What’s more, these ensemble methods benefit from the versatility of trees: they
can handle a variety of problems (classification, regression, quantile estimation, survival...),
missing data of different types and are computationally attractive. For a long time, the
statistical properties of random forests remained a mystery. Before delving into the various
directions of random forest research, we start by describing the original algorithm.
where the expectation is taken w.r.t. ω only. Indeed, M can be made large enough to
remove variation associated with ω at the price of additional computing.
2
A random forest is thus defined by two choices: the decision tree algorithm and how the
randomization parameters ωm impact the tree structure/prediction. The original random
forest algorithm was designed by Breiman (2001), who proposed aggregating fully-developed
CART (specific decision trees, see Breiman et al., 1984) randomized via data resampling
and with, at each node, feature subsampling. Since then, the class of random forest-like
algorithms has significantly expanded, with variations proposed to improve performance,
to expand the applicability of random forests to new tasks or data types, and to produce
algorithms that are theoretically tractable or have desireable distributional properties. Here
we outline the broad structure of a forest and note the choices that were made in Breimans’
original version; many of the results that we detail below are given for specific variations.
Note that almost all random forest variants are based on sequentially-chosen diadic splits,
that is, the feature space Rd is recursively divided into cells, until some stopping criterion
is satisfied.
Data resampling. The original random forest algorithm obtained bootstrap samples
from Dn to create an input Dn∗ to each tree which we characterize using the vector Nim
indicating the number of times observation i is repeated in the sample given to the
mth tree. However, most theoretical advances have been made by selecting a random
subset of size k either with (Mentch and Hooker (2016), Wager and Athey (2018))
or without (Scornet (2016a), Zhou et al. (2021)) replacement. It is also possible to
rely solely on randomization within the tree-building process, as is implicitly done in
defining the equivalent kernels described in Scornet (2016b).
Split randomization - Set of potential splits. Most random forest algorithms consider
partitioning the current cell based on the value of one of the dimensions of X. Early
tree building algorithms such as CART Breiman et al. (1984) considered the midpoint
between each successive data value, along each dimension, as potential splits. Random
forests restricted these to a randomly chosen subset of mtry dimensions. More restric-
tive versions are limited to the medians of the observed data values or the midpoints
of the current node range; others restrict the imbalance in the proportion of data on
each side of the split. Other variants consider splits based on linear combinations of
features.
Tree building process - Best split selection. How to choose between potential splits
selected above? Here random forests follows CART in scoring each split by the decrease
in squared prediction error when the prediction on each side of the split is held constant
(see, e.g., Geurts et al., 2006); many extensions consider criteria based on particular
models. Honest forests (Wager and Athey, 2018) choose among splits based on 1/2 of
the data available to the current tree, reserving the other half to select values in the
leaves. Other random forest variants choose randomly among candidate splits.
Tree building process - Stopping rule. The original random forests ceased splitting
regions with fewer than k observations (with k = 1 or 5 by default). Other models
can specify tree depths, trees pruned in the manner of CART, or stochastically chosen
stopping rules.
Tree prediction. The original random forests return a constant value which minimizes
the loss (squared error, or weighted misclassification cost) in each leaf. Honest forests
use a held-out set of data to decided on this value to keep the leaf value independent
3
from the structure of the tree. Other algorithms estimate parametric models in each
leaf (see, for example, Chaudhuri et al., 1994).
Prediction combination. The original random forest algorithm was designed as a pre-
diction tool and produced either an average, if predicting a real-valued quantity, or a
majority vote if a class label was needed. Averaging remains the most common way
to combine the predictions from trees, but weighted averages, or using leaf indicators
within a larger regression model (Agarwal et al., 2023) are also possible.
Figure 1 presents a schematic of random forests algorithms. Note that other descriptions
are possible: for instance, the selection of data can be described as a process internal to
the tree. However, the collection of algorithmic choices presented above encompasses most
proposed practical random forest variants, and their mathematical models.
Training Data
Data Resampling
... =
Stopping Rule
Prediction
Prediction Combination
Figure 1: Illustration of a random forest on the left and a tree partition on the right.
In this review, we focus on four different aspects of random forest theory. In Section 2,
we detail different rates of consistency for several random forest models, with the goal of
elucidating the role of each RF hyperparameter on RF predictive performance. In Section 3,
we review variants of RF that have been theoretically studied. In Section 4, we detail
central limit theorems for random forests and the resulting uncertainties quantifications. In
Section 5, we summarize the analyses of variable importances in random forests, in order
to provide guidance on how to use them. Finally a discussion of further research avenues is
presented in Section 6.
Several earlier papers already review some aspects of random forests theory. Notably,
Biau and Scornet (2016) provide a detailed review of the theoretical properties of random
forests at that time. From a methodological perspective, Loh (2014) provide a very detailed
review of various decision tree algorithms. A description of different applications that can be
solved using random forest (or its variants) are provided in Criminisi et al. (2012). Boulesteix
et al. (2012) describe the existing implementations of random forests, with an emphasis on
advances in bioinformatics. The interested reader is also refered to Probst et al. (2019)
for insightful discussions on the impact of parameter tuning in random forests. Thus, the
present review focuses on the most recent theoretical advances in this field.
4
2 BIAS, VARIANCE AND CONSISTENCY
The analysis of random forests is notoriously difficult due to a combination of several in-
gredients – recursive (tree) structure with data-dependent splits, split/data randomization,
large tree depths, aggregation – each element of which produces theoretical challenges. As
first attempts to dissect their internal behavior, several research works designed new styl-
ized versions of random forests, whose splitting procedure does not depend on the data.
Such forests, called purely random forests (see Arlot and Genuer, 2014), or non-adaptive
random forests, allow us to comprehend the contribution of some key ingredients (split ran-
domization, tree depth) on the predictive performance of the forest. The strength of the
results comes at the price of simplification of the original algorithm. Another route consists
in deriving results for Breiman’s random forests: results focus on simpler properties, while
requiring much more involved technical tools. We explore these two avenues in the following
sections. We will work with the general regression model
Y = f ⋆ (X) + ε, (3)
where X ∼ U([0, 1]d ), f ⋆ (X) = E[Y |X = x] is the regression function, ε a noise satisfying
E[ε|X] = 0 with variance V[ε|X] ≤ σ 2 a.s. In this section, all results are stated for infinite
random forests (see (2)). They can be extended to finite random forests if the number of
trees M is chosen large enough.
with k ∼ Cd log2 (n) and Cd = (1 + 3/(4d log 2))−1 . This rate was further improved by
Klusowski (2021) (Theorem 2), which proves that, under similar assumptions,
h i 1
E (fˆk,n
cen
(X) − f ⋆ (X))2 ≲ n− d log 2+1 (1+∆d ) , (5)
where limd→∞ ∆d = 0. Klusowski (2021) showed that this rate was not improvable for gen-
eral Lipschitz functions. Recall that the minimax rate over the class of Lipschitz functions
is given by n−2/(d+2) (Stone, 1982). Many estimators reach this optimal rate, among which
kernel methods and simple cubic partitions (regular partitions, a very-specific tree form).
Thus, it may seem surprising that randomizing tree partitions leads to a suboptimal esti-
mator, as highlighted by 5. In fact, the slow rate of convergence of centered forest results
from their (relative) bad approximation capacities (Klusowski (2021)): the bias term is of
order 2−k/(d log 2) , whereas it is of order 2−2k/d < 2−k/(d log 2) for a single cubic partition.
Therefore, the randomness of the tree partition (and in particular the fact that all cells do
5
not have the same shape) harms the approximation error of the forest compared to a single
cubic partition.
While thorough analyses established the exact properties of centered forests, studying
such models does not allow for a precise understanding of Breiman’s forests. The most im-
portant limitation of centered forest is the data-independent splitting process, which makes
the algorithm unable to select only relevant variables, thus not adapting to sparsity. Besides,
the analysis does not show any benefits from using centered forests instead of centered trees,
whereas in practice, Breiman’s forests outperform individual CART trees. Since lower and
upper bounds match (see Klusowski, 2021), this is the result of an overly simplistic model
of centered forests rather than from an improvable theoretical analysis. Finally, the rate of
consistency is optimal when leaves contain numerous observations, whereas original random
forests build fully-developed trees (whose leaves contain few observations)1 .
with α = − log2 (1 − 1/(2d)). It can be shown that the risk of a single uniform tree is
larger than the risk of the forest. However, uniform forest do not reach the minimax rate of
convergence for C 2 functions, as 2α/(2α + 1) < 4/(4 + d). Arlot and Genuer (2014) claim
that this slow rate of convergence results from the fact that all cells are split equally, even
if some are much smaller than others (see also the comment above on centered forests and
Klusowski, 2021).
Contrary to previous analyses, Arlot and Genuer (2014) show that the number of trees
must be larger than M = (1 − 1/2d)−k to ensure that the predictive performances of the
finite forest (with M trees) is close to that of the infinite forest.
6
Lipschitz function, and are thus minimax over this class (see Mourtada et al., 2020). More-
over, contrary to a single Mondrian tree, or a cubic partition, Mondrian Forests are also
minimax over the class of Holder functions with exponent β ∈ (1, 2) (see Mourtada et al.,
2020). Hence, there is an improvement of Mondrian forest compared to a single Mondrian
tree in terms of rate of consistency. Such phenomenon results solely from the randomization
of partitions.
These results highlight the benefits of split randomization on the consistency of the forest.
However, the rate of consistency in (7) is very slow. It turns out that such a phenomenon
is expected for fully-developed non-adaptive random forests. Indeed, Lin and Jeon (2006)
proved that the risk of a non-adaptive forest trained on the full dataset, is larger than
(log n)−(d−1)
, (8)
nobs/leaf
where nobs/leaf is the number of samples inside each leaf of each tree. Therefore, non-adaptive
fully-developed forests that do not incorporate subsampling can’t reach rates of consistency
faster than (log n)−(d−1) . Such results seem to point toward the use of subsampling instead
of split randomization in non-adaptive forests.
7
2.5 Slight variants of Breiman’s forest
Wager and Walther (2015) define (α, k)-valid tree partitions that are (i) constructed on
the full original sample without bootstrapping or subsampling, (ii) whose splits leave at
least a fraction α of parent node observations in each child node, and (iii) such that each
final leaf contains at least k observations. Standard implementations of random forests
allow removing bootstrap step (i) and a minimum number of observations in each leaf (iii).
Condition (ii) is not satisfied in general, as it is known that splits performed along noise
features are close to the edges of the cell (End-Cut Preference, see Ishwaran, 2015, Breiman
et al., 1984). Enforcing (ii) thus requires to modify the splitting criterion. However, the
overall algorithm remains quite close to the algorithm used in practice.
Wager and Walther (2015) establish a control on any random forest using (α, k)-valid tree
partitions. More precisely, assuming that lim inf d/n > 0 and under regularity constraints
on the regression model,
" s #
log(n) log(d) 1
P sup |fˆΛ1 ,...,ΛM (x) − fΛ1 ,...,ΛM (x)| ≲
⋆
√ → 1, (9)
x∈[0,1]d ,Λ1 ,...,ΛM log((1 − α)−1 ) k
where fˆΛ1 ,...,ΛM (x) is the prediction of the forest composed of the tree partition Λ1 , . . . , ΛM
and fΛ⋆1 ,...,ΛM (x) is the same forest where empirical means in each tree partition are replaced
by conditional expectation of the output variable in the corresponding cell. Based on the
above concentration result, Wager and Walther (2015) propose the Guess-and-Check forest
and prove its pointwise consistency.
Wager and Athey (2018) introduce the honesty property: a tree is honest if each obser-
vation cannot be used simultaneously for the tree construction and for estimating the mean
response in terminal nodes. A simple manner for a tree to be honest is to use a subsample
for the tree construction and the rest of the observations for mean leaf estimation. Under
regularity conditions, Wager and Athey (2018) prove the pointwise consistency of honest
trees with (α, k)-valid partitions in the context of causal inference.
the original random forest was proved to be consistent, assuming that the number of leaves
tn tends to infinity while satisfying tn = o(an /(log an )9 ), where an is the subsample size.
Due to the previous condition, the consistency result of Scornet et al. (2015) is not valid
8
for fully-developed trees (tn = an ). They also establish the ability of random forests to
asymptotically identify the relevant direction for splitting in a sparse additive framework.
Using other proof techniques, Klusowski and Tian (2022) extend these results to addi-
tive (logistic and ordinary) regression, establishing oracle inequalities on the excess risk of
generic decision trees. They also prove that the rate of consistency for Breiman’s random
forests is upper bounded by (log n)−1 . Klusowski and Tian (2022) highlights that split
randomization cannot damage the predictive performances of RF by more than a factor
3 (with default RF parameters). Klusowski and Tian (2022) also prove the consistency
of RF in sparse high-dimensional settings, where the dimension d is allowed to grow sub-
exponentially with n and where additive components can be discontinuous (Corollary 7.2).
Assuming a minimum impurity decrease in all tree nodes, Chi et al. (2022) derive upper
bounds on the approximation error of random forests in high-dimensional settings, high-
lighting the influence of the number of direction eligible for splitting. In the case of some
specific high-dimensional r-sparse regression models with binary input features, Syrgkanis
and Zampetakis (2020) establish rate of consistency of MSE of Breiman’s forests of order
2r log(d) log(n)/n, which increases logarithmically with the ambient dimension d.
While decision trees are able to detect signal and are thus tailored for sparse settings,
they fail to adapt to the simplicity of additive models. Using rate-distortion theory, Tan
et al. (2022) establish a lower bound on the MSE of general honest decision trees, implying
that they fail to recover the efficient rate of n−2/3 of Generalized Additive Models (Stone,
1985). We note that while these results bring us closer to Breiman’s original forests, they
still rely on modifications such as subsampling and controlling tree depth. The impact of
these choices relative to the original algorithm remains unclear.
9
fully-grown trees in RF may act as a beneficial regularization in low Signal-to-Noise Ratio
settings (Zhou and Mentch, 2023).
Curth et al. (2024b) propose to reconcile these two views. The “spiked-smooth” behavior
mentioned by Wyner et al. (2017) is quantified for RF by Curth et al. (2024b) via an
effective parameters measure: the effective parameters value computed on test points is
lower than that computed on training points, which provides empirical evidence of a spiked
RF behavior near training points and a smooth behavior far from training points. The
effective parameters value computed on test points also decreases when the number of trees,
the feature randomization (small values of mtry), or the tree depth increase (see Figure 4
in Curth et al., 2024b). Thus, the good performances of RF seem to result of their “spiked-
smooth” behavior, something that is exagerated with interpolating trees or low values of
mtry.
3 VARIATIONS
Beyond algorithmic modifications for purposes of theoretical tractability, or elucidating the
role of computational mechanisms, many variations on random forests have been proposed
on the grounds of improving performance, with the purpose of providing new functionality,
or repurposing some aspects of random forests. We provide a brief catalog of some of these
variants here, particularly focusing on recent methods – see Loh (2014) for earlier variants.
10
tree with linear functions. Ghosal and Hooker (2020) “boosts” random forests; fitting a
second random forest to the out-of-bag residuals of the first. Bayesian methods for trees
and ensembles of trees (e.g. Chipman et al., 2012, Jeong and Rockova, 2020) do not explicitly
use random forest structures, but averages of posterior samples create forest-like averages;
see Ročková and Van der Pas (2020) for posterior consistency of these ensembles.
3.2 Functionality
The original Random Forests was developed for regression and classification tasks, but this
has been significantly expanded. A first strategy for these is to adapt splitting criteria to
reflect the data model. Here, Saha et al. (2021), Rabinowicz and Rosset (2022) assume
spatially correlated predictions and split based on log likelihood. Ishwaran et al. (2008)
develops random forests for survival analysis; accounting for censoring, with consistency
studied in Cui et al. (2022). Meinshausen and Ridgeway (2006) developed a random forest
quantile regression. There has also been considerable interest in causal estimation; Wager
and Athey (2018) adapted the tree building process to estimate conditional average treat-
ment effect, extended to a survival setting in Cui et al. (2023). Athey et al. (2019) presents a
general model for splits maximizing a likelihood with approximate optimization for models
in each daughter node to speed computation.
in which Im is an a-vector of the indices 1, . . . , n and DnIm selects the corresponding set of ob-
servations (Xi1 , Yi1 ), . . . , (Xia , Yia ) Brieman’s original forest employed a bootstrap sample,
11
generating a = n observations sampled with replacement, but employing subsampling with
or without replacement has proved valuable for variance estimation as well as consistency.
Early approaches to estimating var(fˆn (x)) centered around variations of the bootstrap
or jackknife for ensemble methods. Sexton and Laake (2009) employed a “bag of little boot-
straps” approach: generating B ensembles fˆn1 (x), . . . , fˆnB (x), each of M/B trees resulting
PB
in a final prediction B1 b=1 fˆnb (x) along with uncertainty obtained via the standard devi-
ation among the fˆnb (x) along with a correction for the bias of the standard deviation; the
consistency of this estimate was later demonstrated by Athey et al. (2019).
In a similar manner, methods based around the infinitesimal jackknife were developed
by examining the infinite forest (2). The infinitesimal jackknife operates by attaching equal
weights to each observation and examining the directional derivative Ui – the gradient of
the estimate with respect to the weight given on observation i. For ensemble methods,
Efron (2014) attached weights in the form of selection probabilities when generating I but
not for how observations are used subsequently, this also reflects the implementation of
weights within the randomForest package in R (Liaw and Wiener, 2002). The application of
this to random forests is given in Wager and Walther (2015), where the resulting direction
derivatives are given by
Ui = covI T (x; DnI , ω), N (I)i ,
where N (I)i indicates the number of times observation i is repeated in sample I. For a finite
forest, the theoretical covariance above is simply replaced with the empirical covariance over
the subsamples Im . We then obtain an estimate
n
X
c fˆn (x)) =
var( Ui2
i=1
although this can be substantially biased unless M is much larger than usually employed in
practice.
Beyond variance estimates, inferential tools like confidence intervals also require distri-
butional results. For random forests, these have been obtained by observing a connection
between an infinite forest and the classical construction of U-statistics (Hoeffding, 1948)
when the sets Im are obtained from subsamples taken without replacement. Mentch and
Hooker (2016), Wager and Athey (2018), Peng et al. (2022) developed versions of a central
limit theorem extending basic U -statistics constructions. To formally state these results,
we examine a forest of M trees obtained using subsamples of size an and define the Hájek
projection
to be the expected value of T conditioned on the first c data points with gan understood to
also include ω, and define the corresponding variance components
Then, so long as limn→∞ (an /n)(ζan ,an /an ζ1,an ) = 0 and the response Y is appropriately
regular, we have
fˆ (x) − E[fˆn (x)] d
p n → N (0, 1).
2
an ζ1,an /n + ζan ,an /M
12
The limit condition above reflects a heurstic that ζan ,an ∝ an ζ1,an reducing the condition
to an = o(n). Note here that the centering in this theorem is given by E[fˆn (x)] rather than
a target of the form E[Y |X = x]. This is because the construction of the result ignores
the tree-building process and cannot guarantee that the bias in fˆn (x) is ignorable. In that
sense, this result cannot be used to guarantee that confidence intervals derived from it
are consistent. One can argue that the stability of the forest is of interest independently
of formal coverage guarantees. Alternatively,Wager and Athey (2018) do demonstrate a
CLT centered on E[Y |X = x] by a careful control of both the tree-building process and
choosing a subsampling rate an = n−β for a range of β that depend on constraints on the
asymmetry of allowable splits. This results in a general an /n = n−1+β convergence rate.
Athey et al. (2019) demonstrate the asymptotic equivalence of generalized random forests
to a U -statistics with equivalent statistical properties; similar analyses are carried out for
random survival forests in Cui et al. (2023).
Peng et al. (2022) also provided a Berry-Esseen theorem for complete U -statistics, that
is assuming that M is large enough to be ignorable, resulting in:
! 1/2
fˆn (x) − E[fˆn (x)] 6.1E[|g1 |3 ] √
an ζan ,an
sup P p < z − Φ(z) < √ 3/2 + (1 + 2) −1
z∈R a2n ζ1,an /n nζ1,an n an ζ1,an
Here for fixed subsample sizes an , the bound has a classical n−1/2 rate, but specific rates in
the infinite-order case depend on the scaling of the variance components (an ζ1,an , ζan ,an );
as for the CLT above assuming these remain bounded, an = o(n) suffices for the bound to
vanish asymptotically.
Within the context of estimating variances, both the infinitesimal jackknife and bootstrap-
of-little-bags can be shown to be consistent for ζ1,n . However, both employ the variance over
a Monte Carlo approximation to quantities defined for an infinite forest – either a forest built
on a bootstrap sample, or cov(T, N (I)); this was also the case for the two-level estimate
of Mentch and Hooker (2016). As such, the estimated variance is inflated by the Monte
Carlo variance of the estimate, and this bias can be substantial unless much larger forests
are built than are needed for good prediction. While intuitive estimates of the Monte Carlo
variance are available, they are themselves inflated by the without-replacement sampling
scheme which induces negative correlation between trees, conditional on the data. Zhou
et al. (2021) instead examined a forest built using samples taken with replacement in which
subsamples can be viewed as simulations from the empirical distribution and provided a
general estimation scheme, defining
M
X X
µi = Nim T (x; DIm , ωm )/ Nim , i = 1, . . . , n
m=1
n X
X M n X
X M
SSτ = Nim (µi − µ̄)2 , SSϵ = Nim (Tm − µi )2
i=1 m=1 i=1 m=1
SSτ − (n − 1)SSϵ /(M an − n)
ζ̂1,an = P P
M an − i ( m Nim )2 /(M an )
Here the µi are the means of trees for which observation i was included in the subsample,
weighted by its inclusion frequency and the resulting ζ̂1,an is unbiased for data simulated
from the empirical distribution. The use of subsamples with replacement moves forests from
U -statistics to being V -statistics. The classical equivalence between these (e.g. Van der
13
Vaart, 2000) only holds when an = o(n1/4 ) (Shieh, 1994); Zhou et al. (2021) re-expresses
a general V statistic as a U statistic with a modified kernel with the equivalent central
limit theorem, and the variance estimates above remain consistent and unbiased for data
generated from the empirical distribution.
The distributional results defined here only characterize the variability of fˆn (x); they do
not provide prediction intervals. These can be obtained by quantile reqression (Meinshausen
and Ridgeway, 2006), or Lu and Hardin (2021) developed an out-of-bag form of conformal
inference (Angelopoulos et al., 2023). Bias corrections have also been studied in Hooker and
Mentch (2018), Lu and Hardin (2021), Ghosal and Hooker (2020) these all add a correction
ĝn (x) to fˆn (x) which improved predictive performance on a range of regression problems,
often very significantly. Ghosal et al. (2025) demonstrate the extension of infinitesimal
jacknife (IJ) methods to estimate the covariance between models to which IJ is applicable
and hence to iterative bias corrections of this form. Finally, note that these distributional
results rest entirely on the subsampling process of random forests rather than the specifics
of tree construction. They thus apply more generally to any bagged estimator. They also
apply to any quantity calculated from individual trees and then averaged over the forest;
that includes all of the variable importance measures studied below.
5 VARIABLE IMPORTANCE
The original implementation of random forests come with two importance measures, wildly
used in practice: the Mean Decrease in Impurity (MDI) and the Mean Decrease in Accuracy
(MDA). We review below their theoretical properties and improvements that have been
proposed recently. Although relatively rarely employed, central limit theorems for these
measures follow immediately by repurposing the distributional theory for random forests
and observing that each is a quantity defined from individual trees that is then averaged
over the forest2 , thus allowing uncertainty quantification.
where Ln,A (jn,A , zn,A ) is the impurity reduction in the cell A and pn,A the fraction of data
points falling into A. The MDI of a forest is then simply the average of all MDI computed
via the trees that compose it.
Louppe et al. (2013) established one of the first expression of the population version
of the MDI, for fully-grown trees built on categorical covariates. In this context, a tree
splits on all possible covariates and deriving the MDI expression reduces to enumerating all
2 MDA can be defined with respect to a generic model and when implemented generically does not take
this form, however Breiman (2001) defined variable importance as an average over trees, taken using out-
of-bag samples.
14
eligible splits. In this scenario, Louppe et al. (2013) proved that the sum of the MDI on all
input variables is equal to the mutual information of Y and X (1) , . . . , X (d) , highlighting that
MDI is a natural importance measure that is spread through all input variables. Louppe
et al. (2013) also establish that the importance of irrelevant features (independent of Y
conditional on any subset of covariates) is zero. While interesting, the setting presented in
Louppe et al. (2013) does not permit to study the impact on MDI performances of split
positions (all variables are categorical and split so that each node corresponds to a modality)
nor tree depth (trees are fully grow).
The CART splitting procedure is more likely to select variables with a large number
of split points (as continuous features or categorical variables with a lot of values). As a
consequence, the MDI is positively biased towards such variables (see, e.g., Nicodemus, 2011,
Boulesteix et al., 2012). Such a phenomenon was first noticed by Breiman et al. (1984) (see
also Strobl et al., 2007). Similarly, MDI of variables containing many missing values may
be overinflated (see, e.g., Kim and Loh, 2001).
In order to reduce these intrinsic biases, several works proposed to replace the usual
splitting procedure in CART by statistical tests, such as F-test on ordered variables (FACT,
Loh and Vanichsetakul, 1988), chi-squared tests on categorical ones (see algorithms QUEST
and CRUISE in Loh and Shih, 1997, Kim and Loh, 2001), permutation tests, or tests on the
maximal value of Gini criterion (Strobl et al., 2007). To the best of our knowledge, there
exist no detailled theoretical analyses of these procedures.
Even in the absence of dependencies between inputs and outputs, variables with many
categories are more likely to be split (Zhou and Hooker, 2021). This phenomenon is related
to multiple testing problems, since the quality of many split points are evaluated to choose
the best one. Because a unique dataset is used to build the tree and evaluate the MDI,
the resulting MDI may be positively biased for such high cardinality variables. In order to
remove this bias, Zhou and Hooker (2021) propose using two different datasets to respectively
build the tree and compute the MDI. Li et al. (2019) propose another way to use an extra-
sample based on a rewriting of the MDI of any variable X (j) as
n
1X
fT,j (Xi )Yi , (11)
n i=1
where
X
fT,j (x) = ȲAL 1x∈AL + ȲAR 1x∈AR − ȲA 1x∈A . (12)
A∈T,jA =j
Equation (11) shows that MDI can be seen as a covariance between the output and a
function fT,j related to the tree partition. Li et al. (2019) propose to use a part of the data
to build the partition (and thus fT,j ) and the other to estimate the MDI via Equation (11),
employing similar motivation to that of honesty in Wager and Athey (2018).
Deep trees tend to produce biased MDI. Indeed, Scornet (2023) proves that the sum of all
MDI produced by a fully-grown tree tends to the mutual information of Y and X (1) , . . . , X (d)
plus the variance of the model noise. Any input variable can have their MDI modified by an
additive factor of the noise variance. Such phenomenon highlights the asymptotic empirical
bias of fully grown trees. A finite sample analysis of the same phenomenon was proposed by
Li et al. (2019). Consider a model with d input variables, whose noise variables – features
15
assumed independent of all other quantities – are indexed by k ∈ S ⊂ {1, . . . , d}. For any
tree T , let
X
G0 (T ) = [ T (X (j) )
MDI (13)
j∈S
be the sum of importances of noise variables. Then, letting Tn (mn , kn ) be the set of decision
trees whose leaves contain more than mn observations and whose depth is less than kn , there
exists a constant C depending on ∥f ⋆ ∥∞ such that, for all n large enough,
" #
log d kn log(nd)
≤E sup G0 (T ) ≤ C , (14)
Cmn T ∈Tn (mn ,kn ) mn
where the lower bound holds when f ⋆ = 0 and mn ≳ log n. The population version of the
sum of MDIs of irrelevant features is zero, but its empirical counterpart is lower bounded
by (log d)/Cmn . Thus, a part of MDI bias is related to the number of observations in each
leaf. Increasing the number of observations per leaf, or equivalently, reducing the depth
of the tree can help to diminish this bias. However, it may also decrease the predictive
performance of the tree, therefore leading to erroneous MDI.
Two different types of MDI biases have been identified. Biases (related to the splitting
criterion) in favor of several types of variables can be reduced by computing MDI on a dataset
that was not used during the training phase. Biases due to tree depths (or equivalently, low
numbers of observations per leaf) can be reduced by simply limiting the tree depth. These
two types of bias act separately: for instance, decreasing tree depth reduces the second
bias but has no effect on the first. Note that the standard implementation of RF (i) uses
a unique dataset to build trees and compute MDI and (ii) builds fully-grown trees, thus
exacerbating MDI biases. Finally, the population value targeted by MDI remains unclear
outside of specific generating structures such as generalized additive models (Scornet, 2023).
16
Another limitation of unconditional permutation is due to the nature of the permutated
samples. Indeed, Hooker et al. (2021) explain that, because of correlation between inputs,
permuting values of a given variable may create unrealistic observations. Thus, MDA eval-
uates the extrapolation properties of RF rather than the importance of the given variable.
The first theoretical analysis of permutation importance for trees is by Ishwaran (2007).
In order to evaluate the importance of a variable X (j) , the error of a tree on noisy obser-
vations is computed, where the noise comes from randomly assigning the observation to
left/right node whenever a split occurs along X (j) , and do the same for all subsequent splits
in the tree. Ishwaran (2007) establishes the limit of such an importance measure, which
resembles the permutation importance. Gregorutti et al. (2017) were the first to provide
the population expression of the MDA, in an additive model with Gaussian inputs.
Bénard et al. (2022) establish the convergence of MDA and interpret the limit quantity
by the yardstick of sensitivity analysis. Denote the Total Sobol Index of X (j) , corresponding
to the proportion of explained variance lost when X (j) is removed from the full model, by
where, letting Xπj the variable X whose jth component is replaced by an independent copy,
17
asymptotic normality of generic importance measures that are computed as the difference
in predictive performance between a predictor trained on all variables and the same predictor
trained on a strict subset.
One could be tempted to use such results to test whether a variable is important. How-
ever, in the presence of a variable of null importance, asymptotic normality does not hold,
and a specific test must be designed (see Section 3.4 in Williamson et al., 2023). Hooker
et al. (2021) and Mentch and Zhou (2022) establish that the testing procedure described
in Williamson et al. (2023) has a tendency to incorrectly reject the null hypothesis (i.e.,
to identify irrelevant variables as relevant). Indeed, Mentch and Zhou (2022) prove that
adding noise features (independent of the output) can increase the predictive performance
of random forests. This is not surprising, as this can be seen as a regularization procedure
(Mentch and Zhou, 2022). However, this questions the interpretation drawn from such tests:
the null hypothesis for a variable can be rejected, thus implying increased predictive perfor-
mances when the variable is added to the model, whereas the variable itself is not related
to the output (see the discussion in Section 5.1 in Mentch and Zhou, 2022).
SHAP has become one of the most common feature attribution methods in machine learning,
see eg Holzinger et al. (2020), Molnar (2020) although critiques point to both the computa-
18
tional expense, the rather abstract definition, and a disjunction between the desiderata that
were used to define Shapley values and the specific properties need for variable importance
(Verdinelli and Wasserman (2024)).
In the context of tree-based models, Shapley values can be computed efficiently based
in two properties. First, the conditional expectations involved in calculating fS (xS ) can be
achieved analytically by weighted averages of the leaves; this is particularly simple in the case
of interventional Shapley values (see Bénard et al., 2022). Second, the set of permutations
to be considered can be substantially reduced: only those π for which some feature used in
nodes encountered by x appears after j, and only for the trees for which feature j appears
along that path (see Lundberg et al., 2018, Laberge and Pequignot, 2022, for algorithmic
details).
Few theoretical studies of Shapley values have been carried out, but we observe that the
consistency of a random forest also translates to consistency of the conditional expectations
in (18). As noted above, these values are also expressed as averages over values calculated
for each tree and for which there is also a central limit theorem.
6 DISCUSSION
The last two decades have seen important advances in random forest theory. Designing
and studying random forests variants have led to a better understanding of the role of each
ingredient (subsampling, feature randomization, tree depth) on the predictive performance
of RF. However, quite notably, the roles of two ingredients remain largely unadressed: the
impact of fully-developed trees and the number of features eligible for splitting. Indeed,
they are intrinsically difficult to analyze since many nodes in fully-developed trees has very
few observations (while statistical guarantees require many data points) and since analyzing
the number of preselected features requires a (data-dependent) manner to select the best
split, which is, on its own, a difficulty.
Additionally, a unique RF model that allows for studying possible interactions between
these ingredients has yet to be found. Most models are tailored for the analysis of one
ingredient only. For example, centered forests are well suited for the analysis of split ran-
domization, but make the analysis of fully-developed trees impossible.
Most analyses assess the impact of RF hyperparameters on predictive performances, but
the impact on variable importance remains largely unclear. For example, RF of deep trees
have good predictive performances while the resulting variable importances (either MDI or
MDA) are clearly biased. Understanding the tradeoff between predictive performance and
variable importance (that is, interpretability) is a promising future avenue for research.
Finally, the greatest source of open problems involves understanding adaptively chosen
splits. Adaptation is often listed as one of the key advantages of tree-based models, but
mathematical results supporting this are limited. While conditions such as regularity and
honesty in Wager and Athey (2018) allow for adaptively chosen splits, they are designed
to provide a means of ignoring it within the mathematics. A fully developed theory would
provide a finer-level description of the adaptation of node geometry to the target model,
elucidating the role of adaptive tree kernels both for secondary uses as in Tan et al. (2020)
and for combinations of random forest models (Ghosal et al., 2025), as well as controlling
the dependence between leaf values and tree structures.
19
References
Agarwal, A., A. M. Kenney, Y. S. Tan, T. M. Tang, and B. Yu (2023). Mdi+: A flexible
random forest-based feature importance framework. arXiv preprint arXiv:2307.01932 .
®
Angelopoulos, A. N., S. Bates, et al. (2023). Conformal prediction: A gentle introduction.
Foundations and Trends in Machine Learning 16 (4), 494–591.
Arlot, S. and R. Genuer (2014). Analysis of purely random forests bias. arXiv preprint
arXiv:1407.3939 .
Arlot, S. and R. Genuer (2016). Comments on: A random forest guided tour. Test 25,
228–238.
Arnould, L., C. Boyer, and E. Scornet (2023). Is interpolation benign for random forest
regression? In International Conference on Artificial Intelligence and Statistics, pp.
5493–5548. PMLR.
Athey, S., J. Tibshirani, and S. Wager (2019). Generalized random forests. The Annals of
Statistics 47 (2), 1148 – 1178.
Basu, S., K. Kumbier, J. B. Brown, and B. Yu (2018). Iterative random forests to discover
predictive and stable high-order interactions. Proceedings of the National Academy of
Sciences 115 (8), 1943–1948.
Belkin, M., D. Hsu, S. Ma, and S. Mandal (2019). Reconciling modern machine-learning
practice and the classical bias–variance trade-off. Proceedings of the National Academy of
Sciences 116 (32), 15849–15854.
Bénard, C., S. Da Veiga, and E. Scornet (2022). Mean decrease accuracy for random forests:
inconsistency, and a practical solution via the sobol-mda. Biometrika 109 (4), 881–900.
Biau, G. (2012). Analysis of a random forests model. The Journal of Machine Learning
Research 13, 1063–1095.
Biau, G., L. Devroye, and G. Lugosi (2008). Consistency of random forests and other
averaging classifiers. Journal of Machine Learning Research 9 (9).
Biau, G. and E. Scornet (2016). A random forest guided tour. Test 25, 197–227.
Boulesteix, A.-L., A. Bender, J. Lorenzo Bermejo, and C. Strobl (2012). Random forest gini
importance favours snps with large minor allele frequency: impact, sources and recom-
mendations. Briefings in Bioinformatics 13 (3), 292–304.
Boulesteix, A.-L., S. Janitza, J. Kruppa, and I. R. König (2012). Overview of random forest
methodology and practical guidance with emphasis on computational biology and bioin-
formatics. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 2 (6),
493–507.
Breiman, L. (2001). Random forests. Machine learning 45, 5–32.
Breiman, L. (2002). Manual on setting up, using, and understanding random forests v3. 1.
Statistics Department University of California Berkeley, CA, USA 1 (58), 3–42.
20
Breiman, L. (2004). Consistency for a simple model of random forests.
Breiman, L., J. Friedman, C. J. Stone, and R. A. Olshen (1984). Classification and regression
trees. CRC press.
Cattaneo, M. D., R. Chandak, and J. M. Klusowski (2024). Convergence rates of oblique
regression trees for flexible function libraries. The Annals of Statistics 52 (2), 466–490.
Chaudhuri, P., M.-C. Huang, W.-Y. Loh, and R. Yao (1994). Piecewise-polynomial regres-
sion trees. Statistica Sinica, 143–167.
Chi, C.-M., P. Vossler, Y. Fan, and J. Lv (2022). Asymptotic properties of high-dimensional
random forests. The Annals of Statistics 50 (6), 3415–3438.
Criminisi, A., J. Shotton, E. Konukoglu, et al. (2012). Decision forests: A unified framework
®
for classification, regression, density estimation, manifold learning and semi-supervised
learning. Foundations and trends in computer graphics and vision 7 (2–3), 81–227.
Cui, Y., M. R. Kosorok, E. Sverdrup, S. Wager, and R. Zhu (2023). Estimating heteroge-
neous treatment effects with right-censored data via causal survival forests. Journal of
the Royal Statistical Society Series B: Statistical Methodology 85 (2), 179–211.
Cui, Y., R. Zhu, M. Zhou, and M. Kosorok (2022). Consistency of survival tree and forest
models. Statistica Sinica 32 (3), 1245–1267.
Curth, A., A. Jeffares, and M. van der Schaar (2024a). A u-turn on double descent: Rethink-
ing parameter counting in statistical learning. Advances in Neural Information Processing
Systems 36.
Curth, A., A. Jeffares, and M. van der Schaar (2024b). Why do random forests work?
understanding tree ensembles as self-regularizing adaptive smoothers. arXiv preprint
arXiv:2402.01502 .
Devroye, L., L. Györfi, and G. Lugosi (2013). A probabilistic theory of pattern recognition,
Volume 31. Springer Science & Business Media.
Duroux, R. and E. Scornet (2018). Impact of subsampling and tree depth on random forests.
ESAIM: Probability and Statistics 22, 96–128.
Efron, B. (2014). Estimation and accuracy after model selection. Journal of the American
Statistical Association 109 (507), 991–1007.
Fink, D., W. M. Hochachka, B. Zuckerberg, D. W. Winkler, B. Shaby, M. A. Munson,
G. Hooker, M. Riedewald, D. Sheldon, and S. Kelling (2010). Spatiotemporal exploratory
models for broad-scale survey data. Ecological Applications 20 (8), 2131–2147.
21
Fisher, A., C. Rudin, and F. Dominici (2019). All models are wrong, but many are use-
ful: Learning a variable’s importance by studying an entire class of prediction models
simultaneously. Journal of Machine Learning Research 20 (177), 1–81.
Friedberg, R., J. Tibshirani, S. Athey, and S. Wager (2020). Local linear forests. Journal
of Computational and Graphical Statistics 30 (2), 503–517.
Frieman, J. H. and B. E. Popescu (2008). Predictive learning via rule ensembles. Annals of
Applied Statistics 2 (3), 916–954.
Geurts, P., D. Ernst, and L. Wehenkel (2006). Extremely randomized trees. Machine
learning 63, 3–42.
Gevrey, M., I. Dimopoulos, and S. Lek (2003). Review and comparison of methods to
study the contribution of variables in artificial neural network models. Ecological mod-
elling 160 (3), 249–264.
Ghosal, I. and G. Hooker (2020). Boosting random forests to reduce bias; one-step boosted
forest and its variance estimate. Journal of Computational and Graphical Statistics 30 (2),
493–502.
Ghosal, I., Y. Zhou, and G. Hooker (2025). The infinitesimal jackknife and combinations of
models. Statistical Science, to appear.
Gregorutti, B., B. Michel, and P. Saint-Pierre (2017). Correlation and variable importance
in random forests. Statistics and Computing 27, 659–678.
Hoeffding, W. (1948). A Class of Statistics with Asymptotically Normal Distribution. The
Annals of Mathematical Statistics 19 (3), 293 – 325.
Holzinger, A., A. Saranti, C. Molnar, P. Biecek, and W. Samek (2020). Explainable ai
methods-a brief overview. In International workshop on extending explainable AI beyond
deep models and classifiers, pp. 13–38. Springer.
Hooker, G. and L. Mentch (2018). Bootstrap bias corrections for ensemble methods. Statis-
tics and Computing 28, 77–86.
Hooker, G., L. Mentch, and S. Zhou (2021). Unrestricted permutation forces extrapola-
tion: variable importance requires at least one more model, or there is no free variable
importance. Statistics and Computing 31, 1–16.
Hornung, R. and A.-L. Boulesteix (2022). Interaction forests: Identifying and exploiting
interpretable quantitative and qualitative interaction effects. Computational Statistics &
Data Analysis 171, 107460.
22
Ishwaran, H. and M. Lu (2019). Standard errors and confidence intervals for variable impor-
tance in random forest regression, classification, and survival. Statistics in medicine 38 (4),
558–582.
Jeong, S. and V. Rockova (2020). The art of bart: On flexibility of bayesian forests. arXiv
preprint arXiv:2008.06620 3 (69), 146.
Kim, H. and W.-Y. Loh (2001). Classification trees with unbiased multiway splits. Journal
of the American Statistical Association 96 (454), 589–604.
Klusowski, J. (2021). Sharp analysis of a simple model for random forests. In International
Conference on Artificial Intelligence and Statistics, pp. 757–765. PMLR.
Klusowski, J. and P. Tian (2022). Large scale prediction with decision trees. Journal of the
American Statistical Association.
Laberge, G. and Y. Pequignot (2022). Understanding interventional treeshap: How and
why it works. arXiv preprint arXiv:2209.15123 .
Lakshminarayanan, B., D. M. Roy, and Y. W. Teh (2014). Mondrian forests: Efficient online
random forests. Advances in neural information processing systems 27.
Lei, J., M. G’Sell, A. Rinaldo, R. J. Tibshirani, and L. Wasserman (2018). Distribution-
free predictive inference for regression. Journal of the American Statistical Associa-
tion 113 (523), 1094–1111.
Li, M., B. Sarmah, D. Desai, J. Rosaler, S. Bhagat, P. Sommer, and D. Mehta (2024).
Quantile regression using random forest proximities. In Proceedings of the 5th ACM
International Conference on AI in Finance, pp. 728–736.
Li, X., Y. Wang, S. Basu, K. Kumbier, and B. Yu (2019). A debiased mdi feature importance
measure for random forests. Advances in Neural Information Processing Systems 32.
Liaw, A. and M. Wiener (2002). Classification and regression by randomforest. R News 2 (3),
18–22.
Lin, Y. and Y. Jeon (2006). Random forests and adaptive nearest neighbors. Journal of the
American Statistical Association 101 (474), 578–590.
Loh, W.-Y. (2014). Fifty years of classification and regression trees. International Statistical
Review 82 (3), 329–348.
Loh, W.-Y. and Y.-S. Shih (1997). Split selection methods for classification trees. Statistica
sinica, 815–840.
Loh, W.-Y. and N. Vanichsetakul (1988). Tree-structured classification via generalized
discriminant analysis. Journal of the American Statistical Association 83 (403), 715–725.
Louppe, G., L. Wehenkel, A. Sutera, and P. Geurts (2013). Understanding variable im-
portances in forests of randomized trees. Advances in neural information processing sys-
tems 26.
Lu, B. and J. Hardin (2021). A unified framework for random forest prediction error esti-
mation. Journal of Machine Learning Research 22 (8), 1–41.
23
Lundberg, S. M., G. G. Erion, and S.-I. Lee (2018). Consistent individualized feature
attribution for tree ensembles. arXiv preprint arXiv:1802.03888 .
Lundberg, S. M. and S.-I. Lee (2017). A unified approach to interpreting model predictions.
Advances in neural information processing systems 30.
Ročková, V. and S. Van der Pas (2020). Posterior concentration for bayesian regression
trees and forests. The Annals of Statistics 48 (4), 2108–2131.
Roy, D. M., Y. W. Teh, et al. (2008). The mondrian process. In NIPS, Volume 21.
Saha, A., S. Basu, and A. Datta (2021). Random forests for spatially dependent data.
Journal of the American Statistical Association, 1–19.
24
Scornet, E. (2016a). On the asymptotics of random forests. Journal of Multivariate Analy-
sis 146, 72–83.
Scornet, E. (2016b). Random forests and kernel methods. IEEE Transactions on Informa-
tion Theory 62 (3), 1485–1500.
Simonyan, K., A. Vedaldi, and A. Zisserman (2014). Deep inside convolutional networks:
visualising image classification models and saliency maps. In Proceedings of the Interna-
tional Conference on Learning Representations (ICLR). ICLR.
Stone, C. J. (1977). Consistent nonparametric regression. The annals of statistics, 595–620.
Stone, C. J. (1982). Optimal global rates of convergence for nonparametric regression. The
annals of statistics, 1040–1053.
Stone, C. J. (1985). Additive regression and other nonparametric models. The annals of
Statistics 13 (2), 689–705.
Strobl, C., A.-L. Boulesteix, and T. Augustin (2007). Unbiased split selection for classifi-
cation trees based on the gini index. Computational Statistics & Data Analysis 52 (1),
483–501.
Strobl, C., A.-L. Boulesteix, T. Kneib, T. Augustin, and A. Zeileis (2008). Conditional
variable importance for random forests. BMC bioinformatics 9, 1–11.
Strobl, C., A.-L. Boulesteix, A. Zeileis, and T. Hothorn (2007). Bias in random forest
variable importance measures: Illustrations, sources and a solution. BMC bioinformat-
ics 8 (1), 1–21.
Syrgkanis, V. and M. Zampetakis (2020). Estimation and inference with trees and forests
in high dimensions. In Conference on learning theory, pp. 3453–3454. PMLR.
Tan, S., M. Soloviev, G. Hooker, and M. T. Wells (2020). Tree space prototypes: Another
look at making tree ensembles interpretable. In Proceedings of the 2020 ACM-IMS on
foundations of data science conference, pp. 23–34.
Tan, Y. S., A. Agarwal, and B. Yu (2022). A cautionary tale on fitting decision trees to
data from additive models: generalization lower bounds. In International Conference on
Artificial Intelligence and Statistics, pp. 9663–9685. PMLR.
25
Tan, Y. S., C. Singh, K. Nasseri, A. Agarwal, J. Duncan, O. Ronen, M. Epland, A. Kornblith,
and B. Yu (2025). Fast interpretable greedy-tree sums. Proceedings of the National
Academy of Sciences 122 (7), e2310151122.
Tang, C., D. Garreau, and U. von Luxburg (2018). When do random forests fail? Advances
in neural information processing systems 31.
Wager, S. and G. Walther (2015). Adaptive concentration of regression trees, with applica-
tion to random forests. arXiv preprint arXiv:1503.06388 .
Williamson, B. D., P. B. Gilbert, M. Carone, and N. Simon (2021). Nonparametric variable
importance assessment using machine learning techniques. Biometrics 77 (1), 9–22.
26