An Introduction To Boosting and Leveraging: 1 A Brief History of Boosting
An Introduction To Boosting and Leveraging: 1 A Brief History of Boosting
T
f (x) = αt ht (x). (1)
t=1
Here αt denotes the coefficient with which the ensemble member ht is combined;
both αt and the learner or hypothesis ht are to be learned within the Boosting
procedure.
The idea of Boosting has its roots in PAC learning (cf. [188]). Kearns and
Valiant [101] proved the astonishing fact that learners, each performing only
slightly better than random, can be combined to form an arbitrarily good en-
semble hypothesis (when enough data is available). Schapire [163] was the first
S. Mendelson, A.J. Smola (Eds.): Advanced Lectures on Machine Learning, LNAI 2600, pp. 118–183, 2003.
c Springer-Verlag Berlin Heidelberg 2003
An Introduction to Boosting and Leveraging 119
to provide a provably polynomial time Boosting algorithm, while [55] were the
first to apply the Boosting idea to a real-world OCR task, relying on neural
networks as base learners.
The AdaBoost (Adaptive Boosting) algorithm by [67, 68, 70] (cf. Algo-
rithm 2.1) is generally considered as a first step towards more practical Boosting
algorithms. Very similar to AdaBoost is the Arcing algorithm, for which con-
vergence to a linear programming solution can be shown (cf. [27]). Although
the Boosting scheme seemed intuitive in the context of algorithmic design, a
step forward in transparency was taken by explaining Boosting in terms of
a stage-wise gradient descent procedure in an exponential cost function (cf.
[27, 63, 74, 127, 153]). A further interesting step towards practical applicabil-
ity is worth mentioning: large parts of the early Boosting literature persistently
contained the misconception that Boosting would not overfit even when run-
ning for a large number of iterations. Simulations by [81, 153] on data sets with
higher noise content could clearly show overfitting effects, which can only be
avoided by regularizing Boosting so as to limit the complexity of the function
class (cf. Section 6).
When trying to develop means for achieving robust Boosting it is important
to elucidate the relations between Optimization Theory and Boosting procedures
(e.g. [71, 27, 48, 156, 149]). Developing this interesting relationship opened the
field to new types of Boosting algorithms. Among other options it now became
possible to rigorously define Boosting algorithms for regression (cf. [58, 149]),
multi-class problems [3, 155], unsupervised learning (cf. [32, 150]) and to estab-
lish convergence proofs for Boosting algorithms by using results from the The-
ory of Optimization. Further extensions to Boosting algorithms can be found in
[32, 150, 74, 168, 169, 183, 3, 57, 129, 53].
Recently, Boosting strategies have been quite successfully used in various
real-world applications. For instance [176] and earlier [55] and [112] used boosted
ensembles of neural networks for OCR. [139] proposed a non-intrusive monitoring
system for assessing whether certain household appliances consume power or not.
In [49] Boosting was used for tumor classification with gene expression data. For
further applications and more details we refer to the Applications Section 8.3
and to http://www.boosting.org/applications.
Although we attempt a full and balanced treatment of most available liter-
ature, naturally the presentation leans in parts towards the authors’ own work.
In fact, the Boosting literature, as will become clear from the bibliography, is so
extensive, that a full treatment would require a book length treatise. The present
review differs from other reviews, such as the ones of [72, 165, 166], mainly in the
choice of the presented material: we place more emphasis on robust algorithms
for Boosting, on connections to other margin based approaches (such as support
vector machines), and on connections to the Theory of Optimization. Finally,
we discuss applications and extensions of Boosting.
The content of this review is organized as follows. After presenting some ba-
sic ideas on Boosting and ensemble methods in the next section, a brief overview
of the underlying theory is given in Section 3. The notion of the margin and its
120 R. Meir and G. Rätsch
This section includes a brief definition of the general learning setting addressed
in this paper, followed by a discussion of different aspects of ensemble learning.
such that f will correctly predict unseen examples (x, y). In the case where
Y = {−1, +1} we have a so-called hard classifier and the label assigned to an
input x is given by f (x). Often one takes Y = R, termed a soft classifier, in
which case the label assignment is performed according to sign(f (x)). The true
performance of a classifier f is assessed by
L(f ) = λ(f (x), y)dP (x, y), (2)
where λ denotes a suitably chosen loss function. The risk L(f ) is often termed the
generalization error as it measures the expected loss with respect to examples
which were not observed in the training set. For binary classification one often
uses the so-called 0/1-loss λ(f (x), y) = I(yf (x) ≤ 0), where I(E) = 1 if the
event E occurs and zero otherwise. Other loss functions are often introduced
depending on the specific context.
Unfortunately the risk cannot be minimized directly, since the underlying
probability distribution P (x, y) is unknown. Therefore, we have to try to esti-
mate a function that is close to the optimal one based on the available informa-
tion, i.e. the training sample and properties of the function class F from which
the solution f is chosen. To this end, we need what is called an induction prin-
ciple. A particularly simple one consists of approximating the minimum of the
risk (2) by the minimum of the empirical risk
An Introduction to Boosting and Leveraging 121
1
N
L̂(f ) = λ(f (xn ), yn ). (3)
N n=1
From the law of large numbers (e.g. [61]) one expects that L̂(f ) → L(f ) as N →
∞. However, in order to guarantee that the function obtained by minimizing L̂(f )
also attains asymptotically the minimum of L(f ) a stronger condition is required.
Intuitively, the complexity of the class of functions F needs to be controlled,
since otherwise a function f with arbitrarily low error on the sample may be
found, which, however, leads to very poor generalization. A sufficient condition
for preventing this phenomenon is the requirement that L̂(f ) converge uniformly
(over F) to L(f ); see [50, 191, 4] for further details.
While it is possible to provide conditions for the learning machine which
ensure that asymptotically (as N → ∞) the empirical risk minimizer will perform
optimally, for small sample sizes large deviations are possible and overfitting
might occur. Then a small generalization error cannot be obtained by simply
minimizing the training error (3). As mentioned above, one way to avoid the
overfitting dilemma – called regularization – is to restrict the size of the function
class F that one chooses the function f from [190, 4]. The intuition, which will
be formalized in Section 3, is that a “simple” function that explains most of the
data is preferable to a complex one which fits the data very well (Occam’s razor,
e.g. [21]).
For Boosting algorithms which generate a complex composite hypothesis as
in (1), one may be tempted to think that the complexity of the resulting function
class would increase dramatically when using an ensemble of many learners. How-
ever this is provably not the case under certain conditions (see e.g. Theorem 3),
as the ensemble complexity saturates (cf. Section 3). This insight may lead us to
think that since the complexity of Boosting is limited we might not encounter the
effect of overfitting at all. However when using Boosting procedures on noisy real-
world data, it turns out that regularization (e.g. [103, 186, 143, 43]) is mandatory
if overfitting is to be avoided (cf. Section 6). This is in line with the general expe-
rience in statistical learning procedures when using complex non-linear models,
for instance for neural networks, where a regularization term is used to appropri-
ately limit the complexity of the models (e.g. [2, 159, 143, 133, 135, 187, 191, 36]).
Before proceeding to discuss ensemble methods we briefly review the strong
and weak PAC models for learning binary concepts [188]. Let S be a sample
consisting of N data points {(xn , yn )}Nn=1 , where xn are generated independently
at random from some distribution P (x) and yn = f (xn ), and f belongs to some
known class F of binary functions. A strong PAC learning algorithm has the
property that for every distribution P , every f ∈ F and every 0 ≤ , δ ≤ 1/2
with probability larger than 1−δ, the algorithm outputs a hypothesis h such that
Pr[h(x) = f (x)] ≤ . The running time of the algorithm should be polynomial
in 1/ε, 1/δ, n, d, where d is the dimension (appropriately defined) of the input
space. A weak PAC learning algorithm is defined analogously, except that it is
only required to satisfy the conditions for particular ε and δ, rather than all
pairs. Various extensions and generalization of the basic PAC concept can be
found in [87, 102, 4].
122 R. Meir and G. Rätsch
0 0 0
0 0.5 1 0 0.5 1 0 0.5 1
0 0 0
0 0.5 1 0 0.5 1 0 0.5 1
Fig. 1. Illustration of AdaBoost on a 2D toy data set: The color indicates the label
and the diameter is proportional to the weight of the examples in the first, second,
third, 5th, 10th and 100th iteration. The dashed lines show the decision boundaries of
the single classifiers (up to the 5th iteration). The solid line shows the decision line of
the combined classifier. In the last two plots the decision line of Bagging is plotted for
a comparison. (Figure taken from [153].)
An Introduction to Boosting and Leveraging 123
N
(t)
εt = dn I(yn = ht (xn )) ,
n=1
c) Set:
1 1 − εt
αt = log
2 εt
d) Update weights:
(t+1) (t)
dn = dn exp {−αt yn ht (xn )} /Zt ,
(t+1)
where Zt is a normalization constant, such that N n=1 dn = 1.
1
4. Break if εt = 0 or εt ≥ 2 and set T = t − 1.
T
αt
5. Output: fT (x) = T ht (x)
t=1 r=1 αr
N
εt (ht , d(t) ) = d(t)
n I(yn = ht (xn )). (4)
n=1
After selecting the hypothesis ht , its weight αt is computed such that it minimizes
a certain loss function (cf. step (3c)). In AdaBoost one minimizes
N
GAB (α) = exp {−yn (αht (xn ) + ft−1 (xn ))} , (5)
n=1
t−1
ft−1 (xn ) = αr hr (xn ). (6)
r=1
N
G LR
(α) = log {1 + exp (−yn (αht (xn ) + ft−1 (xn )))} . (7)
n=1
124 R. Meir and G. Rätsch
An important feature of (7) as compared to (5) is that the former increases only
linearly for negative values of yn (αht (xn ) + ft−1 (xn )), while the latter increases
exponentially. It turns out that this difference is important in noisy situations,
where AdaBoost tends to focus too much on outliers and noisy data. In Section 6
we will discuss other techniques to approach this problem. Some further details
concerning LogitBoost will be given in Section 5.2.
For AdaBoost it has been shown [168] that αt in (5) can be computed analyt-
ically leading to the expression in step (3c) of Algorithm 2.1. Based on the new
combined hypothesis, the weighting d of the sample is updated as in step (3d)
(1)
of Algorithm 2.1. The initial weighting d(1) is chosen uniformly: dn = 1/N .
The so-called Arcing algorithms proposed in [27] are similar to AdaBoost,
but use a different loss function.1 Each loss function leads to different weighting
schemes of the examples and hypothesis coefficients. One well-known example
is Arc-X4 where one uses a fourth order polynomial to compute the weight of
the examples. Arcing is the predecessor of the more general leveraging algo-
rithms discussed in Section 5.2. Moreover, for one variant, called Arc-GV, it has
been shown [27] to find the linear combination that solves a linear optimization
problem (LP) (cf. Section 4).
The AdaBoost algorithm presented above is based on using binary (hard)
weak learners. In [168] and much subsequent work, real-valued (soft) weak learn-
ers were used. One may also find the hypothesis weight αt and the hypothesis
ht in parallel, such that (5) is minimized. Many other variants of Boosting al-
gorithms have emerged over the past few years, many of which operate very
similarly to AdaBoost, even though they often do not possess the PAC-boosting
property. The PAC-boosting property refers to schemes which are able to guar-
antee that weak learning algorithms are indeed transformed into strong learning
algorithms in the sense described in Section 2.1. Duffy and Helmbold [59] reserve
the term ‘boosting’ for algorithms for which the PAC-boosting property can be
proved to hold, while using ‘leveraging’ in all other cases. Since we are not overly
concerned with PAC learning per se in this review, we use the terms ‘boosting’
and ‘leveraging’ interchangeably.
In principle, any leveraging approach iteratively selects one hypothesis ht ∈
H at a time and then updates their weights; this can be implemented in different
ways. Ideally, given {h1 , . . . , ht }, one solves the optimization problem for all
hypothesis coefficients {α1 , . . . , αt }, as proposed by [81, 104, 48]. In contrast to
this, the greedy approach is used by the original AdaBoost/Arc-GV algorithm
as discussed above: only the weight of the last hypothesis is selected, while
minimizing some appropriate cost function [27, 74, 127, 153]. Later we will show
in Sections 4 and 5 that this relates to barrier optimization techniques [156]
and coordinate descent methods [197, 151]. Additionally, we will briefly discuss
relations to information geometry [24, 34, 110, 104, 39, 147, 111] and column
generation techniques (cf. Section 6.2).
An important issue in the context of the algorithms discussed in this review
pertains to the construction of the weak learner (e.g. step (3a) in Algorithm 2.1).
1
Any convex loss function is allowed that goes to infinity when the margin goes to
minus infinity and to zero when the margin goes to infinity.
An Introduction to Boosting and Leveraging 125
At step t, the weak learner is constructed based on the weighting d(t) . There are
basically two approaches to taking the weighting into consideration. In the first
approach, we assume that the learning algorithm can operate with reweighted
examples. For example, if the weak learner is based on minimizing a cost func-
tion (see Section 5), one can construct a revised cost function which assigns a
weight to each of the examples, so that heavily weighted examples are more in-
fluential. For example, for the least squares error one may attempt to minimize
N (t) 2
n=1 dn (yn − h(xn )) . However, not all weak learners are easily adaptable to
the inclusion of weights. An approach proposed in [68] is based on resampling
the data with replacement based on the distribution d(t) . The latter approach
is more general as it is applicable to any weak learner; however, the former ap-
proach has been more widely used in practice. Friedman [73] has also considered
sampling based approaches within the general framework described in Section 5.
He found that in certain situations (small samples and powerful weak learners) it
is advantageous to sample a subset of the data rather than the full data set itself,
where different subsets are sampled at each iteration of the algorithm. Overall,
however, it is not yet clear whether there is a distinct advantage to using either
one of the two approaches.
N
1 1
ε(h, d) = dn I[yn = h(xn )] ≤ − γ, (γ > 0). (8)
n=1
2 2
For simple weak learners, it may not be possible to find a strictly positive
value of γ for which (8) holds without making any assumptions about the
data. For example, consider the two-dimensional xor problem with N = 4,
x1 = (−1, −1), x2 = (+1, +1), x3 = (−1, +1), x4 = (+1, +1), and corresponding
labels {−1, −1, +1, +1}. If the weak learner h is restricted to be an axis-parallel
half-space, it is clear that no such h can achieve an error smaller than 1/2 for a
uniform weighting over the examples.
We consider two situations, where it is possible to establish the strict positiv-
ity of γ, and to set a lower bound on its value. Consider a mapping f from the bi-
nary cube X = {−1, +1}d to Y = {−1, 1}, and assume the true labels yn are given
by yn = f (xn ). We wish to approximate f by combinations of binary hypotheses
ht belonging to H. Intuitively we expect that a large edge γ can be achieved if
we can find a weak hypothesis h which correlates well with f (cf. Section 4.1).
Let H be a class of binary hypotheses (Boolean functions), and let D be a distri-
bution over X. The correlation between f and H, with respect to D, is given by
CH,D (f ) = suph∈H ED {f (x)h(x)}. The distribution-free correlation between f
and H is given by CH (f ) = inf D CH,D (f ). [64] shows
that if T >2 log(2)dCH (f )−2
T
then f can be represented exactly as f (x) = sign t=1 ht (x) . In other words,
if H is highly correlated with the target function f , then f can be exactly repre-
sented as a convex combinations of a small number of functions from H. Hence,
after a sufficiently large number of Boosting iterations, the empirical error can
be expected to approach zero. Interestingly, this result is related to the Min-Max
theorem presented in Section 4.1.
− −
− − −
−− − −− −− − − −
−
−
− − −
− − −
−
−
− + + −
+ Fig. 2. A single convex set con-
+ + + +
− +
+ +
taining the positively labeled ex-
+ + B + + − − amples separated from the nega-
− − tive examples by a gap of η.
η
− − − − −
−
− − −
− − − − − −
− − − − −
The results of [64] address only the case of Boolean functions and a known
target function f . An important question that arises relates to the establishment
of geometric conditions for which the existence of a weak learner can be guaran-
teed. Consider the case where the input patterns x belong to Rd , and let H, the
class of weak learners, consisting of linear classifiers of the form sign(w x + b).
It is not hard to show [122], that for any distribution d over a training set of
An Introduction to Boosting and Leveraging 127
ρn (f ) = yn f (xn ). (9)
Let f be a real-valued function taking values in [−1, +1]. The empirical margin
error is defined as
1
N
θ
L̂ (f ) = ϕθ (yn f (xn )). (10)
N n=1
It is obvious from the definition that the classification error, namely the fraction
of misclassified examples, is given by θ = 0, i.e. L̂(f ) = L̂0 (f ). In addition,
L̂θ (f ) is monotonically increasing in θ. We note that one often uses the so-called
0/1-margin error defined by
1
N
L̃θ (f ) = I(yn f (xn ) ≤ θ).
N n=1
Noting that ϕθ (yf (x)) ≤ I(yf (x) ≤ θ), it follows that L̂θ (f ) ≤ L̃θ (f ). Since we
use L̂θ (f ) as part of an upper bound on the generalization error in Section 3.4,
the bound we obtain using it is tighter than would be obtained using L̃θ (f ).
T
1−θ 1+θ
L̂θ (fT ) ≤ (1 − γt ) 2 (1 + γt ) 2 . (11)
t=1
Proof. We present a proof from [167] for the case where ht ∈ {−1, +1}. We begin
by showing that for every {αt }
T
T
L̃ (fT ) ≤ exp θ
θ
αt Zt . (12)
t=1 t=1
By definition
N
Zt = d(t)
n e
−yn αt ht (xn )
n=1
= d(t)
n e
−αt
+ d(t)
n e
αt
which we rewrite as
An Introduction to Boosting and Leveraging 129
T
T
I[Y fT (X) ≤ θ] ≤ exp −y αt ht (x) + θ αt . (13)
t=1 t=1
Note that
(T )
dn exp (−αT yn hT (xn ))
d(T
n
+1)
=
Z
T
T
exp − t=1 αt yn ht (xn )
= T (by induction). (14)
N t=1 Zt
Using (13) and (14) we find that
1
N
L̃θ (f ) = I[yn fT (xn ) ≤ θ]
N n=1
1
N T T
≤ exp −yn αt ht (xn ) + θ αt
N n=1 t=1 t=1
T N
1 T
= exp θ αt exp −yn αt ht (xn )
N t=1 n=1 t=1
T
T N
= exp θ αt Zt d(T
n
+1)
Next, set αt = (1/2) log((1 − εt )/εt ) as in Algorithm 2.1, which easily implies
that
Zt = 2 εt (1 − εt ).
Substituting this result into (12) we find that
T
L̃θ (f ) ≤ 4ε1−θ
t (1 − εt )1+θ
t=1
which yields the desired result upon recalling that εt = 1/2 − γt /2, and noting
that L̂θ (f ) ≤ L̃θ (f ).
In the special case that θ = 0, i.e. one considers the training error, we find that
T 2
L̂(fT ) ≤ e− t=1 γt /2 ,
T
from which we infer that the condition t=1 γt2 → ∞ suffices to guarantee that
√
L̂(fT ) → 0. For example, the condition γt ≥ c/ t suffices for this.2 Clearly
2
When one uses binary valued hypotheses, then γt ≥ c/t is already sufficient to
achieve a margin of at least zero on all training examples (cf. [146], Lemma 2.2, 2.3).
130 R. Meir and G. Rätsch
this holds if γt ≥ γ0 > 0 for some positive constant γ0 . In fact, in this case
L̂θ (fT ) → 0 for any θ ≤ γ0 /2.
In general, however, one may not be interested in the convergence of the
empirical error to zero, due to overfitting (see discussion in Section 6). Sufficient
conditions for convergence of the error to a nonzero value can be found in [121].
where I[E] = 1, if the event E occurs and zero otherwise. In this case it is easy
to see that L(f ) = Pr[y = h(x)], namely the probability of misclassification.
A classic result by Vapnik and Chervonenkis relates the empirical classifi-
cation error of a binary hypothesis f , to the probability of error Pr[y = f (x)].
For binary functions f we use the notation P (y = f (x)) for the probability of
error and P̂ (y = f (x) for the empirical classification error. Before presenting
the result, we need to define the VC-dimension of a class of binary hypotheses
F. Consider a set of N points X = (x1 , . . . , xN ). Each binary function f defines
a dichotomy (f (x1 ), . . . , f (xN )) ∈ {−1, +1}N on these points. Allowing f to
run over all elements of F, we generate a subset of the binary N -dimensional
cube {−1, +1}N , denoted by FX , i.e. FX = {(f (x1 ), . . . , f (xN )) : f ∈ F}. The
VC-dimension of F, VCdim(F), is defined as the maximal cardinality of the set
of points X = (x1 , . . . , xN ) for which |FX | = 2N . Good estimates of the VC
dimension are available for many classes of hypotheses. For example, in the d-
dimensional space Rd one finds for hyperplanes a VC dimension of d + 1, while
for rectangles the VC dimension is 2d. Many more results and bounds on the
VC dimension of various classes can be found in [50] and [4].
We present an improved version of the classic VC bound, taken from [11].
where the expectation is taken with respect to both {σn } and {xn }. The
Rademacher complexity has proven to be essential in the derivation of effec-
tive generalization bounds (e.g. [189, 11, 108, 10]).
The basic intuition behind the definition of RN (F) is its interpretation as
a measure of correlation between the class F and a random set of labels. For
very rich function classes F we expect a large value of RN (F) while small func-
tion classes can only achieve small correlations with random labels. In the
special case that the class F consists of binary functions, one can show that
RN (F) = O( VCdim(F)/N ). For real-valued functions, one needs to extend the
132 R. Meir and G. Rätsch
Corollary 1. Let the conditions of Theorem 3 hold, and set F = coT (H). Then,
for any integer N , with probability at least 1 − δ over samples of length N , every
f ∈ coT (H) satisfies
4RN (H) log(2/δ)
L(f ) ≤ L̂ (f ) +
θ
+ . (17)
θ 2N
Had we used a bound of the form of (15) for f ∈ coT (H), we would have obtained
a bound depending on Pdim(coT (H)), which often grows linearly in T , leading to
inferior results. The important observation here is that the complexity penalty
in (17) is independent of T , the number of Boosting iterations.
As a final comment we add that a considerable amount of recent work has
been devoted to the derivation of so-called data-dependent bounds (e.g. [5, 9,
92]), where the second term on the r.h.s. of (16) is made to depend explicitly on
the data. Data-dependent bounds depending explicitly on the weights αt of the
weak learners are given in [130]. In addition, bounds which take into account
the full margin distribution are presented in [180, 45, 181]. Such results are
particularly useful for the purpose of model selection, but are beyond the scope
of this review.
3.5 Consistency
The bounds presented in Theorems 2 and 3, depend explicitly on the data, and
are therefore potentially useful for the purpose of model selection. However,
an interesting question regarding the statistical consistency of these procedures
arises. Consider a generic binary classification problem, characterized by the class
conditional distribution function P (y|x), where τ (x) = P (y = 1|x) is assumed
to belong to some target class of functions T. It is well known [50] that in this
case the optimal classifier is given by the Bayes classifier fB (x) = sign(τ (x)− 12 ),
leading to the minimal error LB = L(fB ). We say that an algorithm is strongly
consistent with respect to T if, based on a sample of size N , it generates an
empirical classifier fˆN for which L(fˆN ) → LB almost surely for N → ∞, for every
τ ∈ T. While consistency may seem to be mainly of theoretical significance, it
is reassuring to have the guarantee that a given procedure ultimately performs
optimally. However, it turns out that in many cases inconsistent procedures
perform better for finite amounts of data, than consistent ones. A classic example
of this is the so-called James-Stein estimator ([96, 160], Section 2.4.5).
In order to establish consistency one needs to assume (or prove in specific
cases) that as T → ∞ the class of functions coT (H) is dense in T. The consistency
of Boosting algorithms has recently been established in [116, 123], following
related previous work [97]. The work of [123] also includes rates of convergence
for specific weak learners and target classes T. We point out that the full proof
of consistency must tackle at least three issues. First, it must be shown that
the specific algorithm used converges as a function of the number of iterations
- this is essentially an issue of optimization. Furthermore, one must show that
the function to which the algorithm converges itself converges to the optimal
134 R. Meir and G. Rätsch
estimator as the sample size increases - this is a statistical issue. Finally, the
approximation theoretic issue of whether the class of weak learners is sufficiently
powerful to represent the underlying decision boundary must also be addressed.
The edge is an affine transformation of ε(h, d) in the case when h(x) ∈ {−1, +1}:
ε(h, d) = 12 − 12 γ(h, d) [68, 27, 168]. Recall from Section 3.2 that the margin of a
function f for a given example (xn , yn ) is defined by ρn (f ) = yn f (xn ) (cf. (9)).
An Introduction to Boosting and Leveraging 135
Thus, the minimal edge γ ∗ that can be achieved over all possible weightings d
of the training set is equal to the maximal margin ∗ of a combined hypothesis
from H. We refer to the left hand side of (19) as the edge minimization problem.
This problem can be rewritten as a linear program (LP):
min γ
γ,0≤d∈RN
N
N (20)
s.t. dn = 1 and dn yn h̃(xn ) ≤ γ ∀h̃ ∈ H.
n=1 n=1
J
fw (x) = wj h̃j (x). (21)
j=1
Remark 1. Theorem 4 was stated for finite hypothesis sets H. However, the
same result holds also for countable hypothesis classes. For uncountable classes,
one can establish the same results under some regularity conditions on H,
in particular that the real-valued hypotheses h are uniformly bounded (cf.
[93, 149, 157, 147]).
136 R. Meir and G. Rätsch
t
wjt = αr I(hr = h̃j ), j = 1, . . . , J. (22)
r=1
t J
It is easy to verify that r=1 αr hr (x) = j=1 wjt h̃j (x). Also note, if the α’s are
positive (as in Algorithm 2.1), then
w
1 =
α
1 holds.
Margins have been used frequently in the context of Support Vector Machines
(SVMs) [22, 41, 190] and Boosting. These so-called large margin algorithms fo-
cus on generating hyperplanes/functions with large margins on most training
examples. Let us therefore study some properties of the maximum margin hy-
perplane and discuss some consequences of using different norms for measuring
the margin (see also Section 6.3).
Suppose we are given N examples in some space F: S = {(xn , yn )}N
n=1 , where
(xn , yn ) ⊆ F × {−1, 1}. Note that we use x as elements of the feature space F
instead of x as elements of the input space X (details below). We are interested
in the separation of the training set using a hyperplane A through the origin4
A = {x | x, w = 0} in F determined by some vector w, which is assumed to
be normalized with respect to some norm. The p -norm margin of an example
(xn , yn ) with respect to the hyperplane A is defined as
yn xn , w
ρpn (w) := ,
w
p
where the superscript p ∈ [1, ∞] specifies the norm with respect to which w is
normalized (the default is p = 1). A positive margin corresponds to a correct
classification. The margin of the hyperplane A is defined as the minimum margin
over all N examples, ρp (w) = minn ρpn (w).
To maximize the margin of the hyperplane one has to solve the following
convex optimization problem [119, 22]:
yn xn , w
max ρp (w) = max min . (23)
w w 1≤n≤N
w
p
3
For simplicity, we have omitted the normalization implicit in Theorem 4.
4
This can easily be generalized to general hyperplanes by introducing a bias term.
An Introduction to Boosting and Leveraging 137
The form of (23) implies that without loss of generality we may take
w
p = 1,
leading to the following convex optimization problem:
max ρ
ρ,w
s.t. yn xn , w ≥ ρ, n = 1, 2, . . . , N (24)
w
p = 1.
|x, w|
=
x − A
q , (25)
w
p
Boosting (p = 1). In the case of Boosting, the space F is spanned by the base
hypotheses. One considers the set of base hypotheses that could be generated by
the base learner – the base hypothesis set – and constructs a mapping Φ from
the input space X to the “feature space” F:
h̃1 (x)
Φ(x) = h̃2 (x) : X → F. (26)
..
.
as used before in Algorithm 2.1, where we used the 1 -norm for normalization
and without loss of generality assumed that the w’s and α’s are non-negative
(assuming H is closed under negation). By Theorem 5, one therefore maximizes
the ∞ -norm distance of the mapped examples to the hyperplane in the feature
space F. Under mild assumptions, one can show that the maximum margin
hyperplane is aligned with most coordinate axes in the feature space, i.e. many
wj ’s are zero (cf. Figure 3 and Section 6.3).
Support Vector Machines (p = 2). Here, the feature space F is implicitly de-
fined by a Mercer kernel k(x, y) [131], which computes the inner product of two
examples in the feature space F [22, 175]. One can show that for every such
kernel there exists a mapping Φ : X → F, such that k(x, y) = Φ(x), Φ(y) for
all x, y ∈ X. Additionally, one uses the 2 -norm for normalization and, hence,
the Euclidean distance to measure distances between the mapped examples and
the hyperplane in the feature space F:
N
yn Φ(xn ), w yn i=1 βi k(xn , xi )
ρ2n (w) = = ,
w
2 N
β β k(x , x )
i,j=1 i j i j
where we used the Representer Theorem [103, 172] that shows that the maximum
margin solution w can be written as a sum of the mapped examples, i.e. w =
N
n=1 βn Φ(xn ).
We have seen in Section 3.4 that a large value of the margin is conducive to good
generalization, in the sense that if a large margin can be achieved with respect
to the data, then an upper bound on the generalization error is small (see also
discussion in Section 6). This observation motivates searching for algorithms
which maximize the margin.
An Introduction to Boosting and Leveraging 139
The algorithm makes progress, if each of the products on the right hand side of
(27) is smaller than one.
Suppose we would like to reach a margin θ on all training examples, where
we obviously need to assume θ ≤ ∗ (here ∗ is defined in (19)). The question
arises as to which sequence of {t }Tt=1 one should use to find such a combined
hypothesis in as few iterations as possible (according to (27)). One can show
that the right hand side of (27) is minimized for t = θ and, hence, one should
always use this choice, independent of how the base learner performs.
Using Proposition 1, we can determine an upper bound on the number of
iterations needed by AdaBoost for achieving a margin of on all examples,
given that the maximum margin is ∗ (cf. [167, 157]):
Corollary 2. Assume the base learner always achieves an edge γt ≥ ∗ . If 0 ≤
≤ ∗ − ν, ν > 0, then AdaBoost will converge to a solution with margin of at
2
least on all examples in at most 2 log(Nν)(1−
2
)
+ 1 steps.
Maximal Margins. Using the methodology reviewed so far, we can also analyze
to what value the maximum margin of the original AdaBoost algorithm converges
asymptotically. First, we state a lower bound on the margin that is achieved by
AdaBoost . We find that the size of the margin is not as large as it could be
theoretically based on Theorem 4. We briefly discuss below two algorithms that
are able to maximize the margin.
140 R. Meir and G. Rätsch
As long as each factor on the r.h.s. of (27) is smaller than 1, the bound
decreases. If it is bounded away from 1, then it converges exponentially fast to
zero. The following corollary considers the asymptotic case and provides a lower
bound on the margin when running AdaBoost forever.
Corollary 3 ([153]). Assume AdaBoost generates hypotheses h1 , h2 , . . . with
edges γ1 , γ2 , . . . . Let γ = inf t=1,2,... γt and assume γ > . Then the smallest
margin ρ of the combined hypothesis is asymptotically (t → ∞) bounded from
below by
log(1 − 2 ) − log(1 − γ 2 ) γ+
ρ≥ ≥ . (28)
1+γ 1+ 2
log 1−γ − log 1−
From (28) one can understand the interaction between and γ: if the difference
between γ and is small, then the middle term of (28) is small. Thus, if is
large (assuming ≤ γ), then ρ must be large, i.e. choosing a larger results in
a larger margin on the training examples.
Remark 2. Under the conditions of Corollary 3, one can compute a lower bound
on the hypothesis coefficients in each iteration. Hence the sum of the hypothesis
coefficients will increase to infinity at least linearly. It can be shown that this
suffices to guarantee that the combined hypothesis has a large margin, i.e. larger
than (cf. [147], Section 2.3).
However, in Section 4.1 we have shown that the maximal achievable margin
is at least γ. Thus if is chosen to be too small, then we guarantee only a
suboptimal asymptotic margin. In the original formulation of AdaBoost we have
= 0 and we guarantee only that AdaBoost0 achieves a margin of at least
γ/2.6 This gap in the theory led to the so-called Arc-GV algorithm [27] and the
Marginal AdaBoost algorithm [157].
Arc-GV. The idea of of Arc-GV [27] is to set to the margin that is currently
achieved by the combined hypothesis, i.e. depending on the previous performance
of the base learner:7
" #
t = max t−1 , min yn ft−1 (xn ) . (29)
n=1,... ,N
We would like to point out how the discussed algorithms can be seen in the
context of barrier optimization. This illustrates how, from an optimization point
of view, Boosting algorithms are related to linear programming, and provide
further insight as to why they generate combined hypotheses with large margins.
The idea of barrier techniques is to solve a sequence of unconstrained op-
timization problems in order to solve a constrained optimization problem (e.g.
[77, 136, 40, 156, 147]). We show that the exponential function acts as a barrier
function for the constraints in the maximum margin LP (obtained from (24) by
setting p = 1 and restricting wj ≥ 0):
max ρ
ρ,w
s.t. yn xn , w≥ ρ n = 1, 2, . . . , N (30)
wj ≥ 0, wj = 1.
Following the standard methodology and using the exponential barrier func-
tion 8 β exp(−z/β) [40], we find the barrier objective for problem (30):
N
1 yn fw (xn )
Fβ (w, ρ) = −ρ + β exp ρ− , (31)
n=1
β j wj
which is minimized with respect to ρ and w for some fixed β. We denote the
optimum by wβ and ρβ . One can show that any limit point of (wβ , ρβ ) for β → 0
is a solution of (30) [136, 19, 40]. This also holds when one only has a sequence
of approximate minimizers and the approximation becomes better for decreasing
β [40]. Additionally, the quantities d˜n = exp(ρβ j wj − yn fwβ (xn ))/Z (where
Z is chosen such that n d˜n = 1) converge for β → 0 to the optimal Lagrange
∗
multipliers dn of the constraints in (30) (under mild assumptions, see [40], Theo-
rem 2, for details). Hence, they are a solution of the edge minimization problem
(20).
To see the connection between Boosting and the barrier approach, we chose
−1
β = j wj . If the sum of the hypothesis coefficients increases to infinity
(cf. Remark 2 in Section 4.3), then β converges to zero. $Plugging in this choice
%
N
of β into (31) one obtains: −ρ + ( j wj )−1 n=1 exp ρ j wj − yn fw (xn ) .
Without going into further details [147, 150], one can show that this function
8
Other barrier functions are β log(z) (log-barrier) and βz log(z) (entropic barrier).
142 R. Meir and G. Rätsch
η(G[ft ] − G[f ∗ ]) for some fixed η ∈ [0, 1) has been shown. This means that the
deviation from the minimum loss converges exponentially to zero (in the number
of iterations). Similar convergence rates have been proven for AdaBoost in the
special case of separable data (cf. Section 4.3 and [70]). In the general case these
rates can only be shown when the hypothesis set is finite. However, in practice
one often uses infinite sets (e.g. hypotheses parameterized by some real valued
parameters). Recently, Zhang [199] has shown order one convergence for such
algorithms, i.e. G[ft+1 ] − G[f ∗ ] ≤ c/t for some fixed c > 0 depending on the loss
function.
5.1 Preliminaries
In this section and the next one we show that AdaBoost, Logistic Regression and
many other leveraging algorithms generate a combined hypothesis f minimizing
a particular loss G on the training sample.
The composite hypothesis fw is a linear combination of the base hypotheses:
fw ∈ lin(H) := wh̃ h̃ | h̃ ∈ H, wh̃ ∈ R ,
h̃∈H
where the w’s are the combination coefficients. Often one would like to find
a combined function with small classification error, hence one would like to
minimize the 0/1-loss:
N
G0/1 (f, S) := I(yn = sign(fw (xn ))).
n=1
However, since this loss is non-convex and not even differentiable, the problem of
finding the best linear combination is a very hard problem. In fact, the problem is
provably intractable [98]. One idea is to use another loss function which bounds
the 0/1-loss from above. For instance, AdaBoost employs the exponential loss
N
GAB (fw , S) := exp(−yn fw (xn )),
n=1
N
GLR (fw , S) := log2 (1 + exp(−yn fw (xn ))) .
n=1
As can be seen in Figure 4, both loss functions bound the classification error
G0/1 (f, S) from above. Other loss functions have been proposed in the literature,
e.g. [74, 127, 154, 196].
It can be shown [116, 196, 28] that in the infinite sample limit, where the
sample average converges to the expectation, minimizing either GAB (fw , S) or
144 R. Meir and G. Rätsch
7
0/1−loss
Squared loss
Logistic loss
6 Hinge loss
0
−1.5 −1 −0.5 0 0.5 1 1.5 2
yf (x)
N
G(f, S) := g(f (xn ), yn ), (32)
n=1
N
min G(f, S) = min g(fw (xn ), yn ) . (33)
f ∈lin(H) w
n=1
In step (2) of Algorithm 5.2 the example weights are initialized using the deriva-
tive g of the loss function g with respect to the first argument at (f0 (xn ), yn )
(i.e. at (0, yn )).
At each step t of Algorithm 5.2 one can compute the weight wjt assigned to
each base hypothesis h̃j , such that (cf. (22))
t
fwt := ft = wh̃t h̃ = αt ht .
h̃∈H r=1
This is a rather restrictive assumption, being one among many that can be made
(cf. [168, 199, 152]). We will later significantly relax this condition (cf. (40)).
146 R. Meir and G. Rätsch
Let us discuss why the choice in (34) is useful. For this we compute the
gradient of the loss function with respect to the weight wh̃ of each hypothesis h̃
in the hypothesis set:
∂ G(fwt , S)
N N
= g (ft (xn ), yn )h̃(xn ) = dtn h̃(xn ). (35)
∂wh̃ n=1 n=1
Hence, ht is the coordinate that has the largest component in the direction of
the gradient vector of G evaluated at wt .10 If one optimizes this coordinate one
would expect to achieve a large reduction in the value of the loss function.
This method of selecting the coordinate with largest absolute gradient com-
ponent is called Gauss-Southwell-method in the field of numerical optimization
(e.g. [115]). It is known to converge under mild assumptions [117] (see details in
[152] and Section 5.4).
Performing the explicit calculation in (35) for the loss functions GAB (see (5))
yields the AdaBoost approach described in detail in Algorithm 2.1. In this case
(t)
g(f (x), y) = exp(−yf (x)) implying that dn ∝ yn exp(−yn ft−1 (xn )), which, up
to a normalization constant and the factor yn is the form given in Algorithm 2.1.
The LogitBoost algorithm [74] mentioned in Section 2.2 can be derived in
a similar fashion using the loss function GLR (see (7)). In this case we find
(t)
that dn ∝ yn /(1 + exp(yn ft−1 (xn ))). Observe that in this case, the weights
assigned to misclassified examples with very negative margins is much smaller
than the exponential weights assigned by AdaBoost. This may help to explain
why LogitBoost tends to be less sensitive to noise and outliers than AdaBoost
[74]. Several extensions of the AdaBoost and LogitBoost algorithms based on
Bregman distances were proposed in [39] following earlier work of [104] and
[110] (cf. Section 5.3).
N
∆(d, d̃) = dn log(dn /d˜n ) − dn + d˜n . (36)
n=1
The update of the distribution in steps (3c) and (3d) of Algorithm 2.1 is the
solution to the following optimization problem (“generalized projection”):
Hence, the new weighting d(t+1) is chosen such that the edge of the previous hy-
pothesis becomes zero (as e.g. observed in [70]) and the relative entropy between
the new d(t+1) and the old d(t) distribution is as small as possible [104]. After
the projection the new d lies on the hyperplane defined by the corresponding
constraint (cf. Figure 5).
The work of [104, 110] and later of [39, 47, 111, 196] lead to a more general
understanding of Boosting methods in the context of Bregman distance opti-
mization and Information Theory. Given an arbitrary strictly convex function
G : RN
+ → R (of Legendre-type), called a Bregman function, one can define a
Bregman divergence (“generalized distance”):
planes.11 A precursor of such algorithms is the Bregman algorithm [24, 34] and
it is known that the sequence {d(t) } converges to a point on the intersection
of all hyperplanes which minimizes the divergence to the first point d(0) – in
AdaBoost the uniform distribution. Hence they solve the following optimization
problem:12
There has recently been a lot of work on establishing the convergence of several
algorithms, which are all very similar to Algorithm 5.2. Since we cannot discuss
all approaches in detail, we only provide an overview of different results which
have been obtained in the past few years (cf. Table 1). In the following few
paragraphs we briefly discuss several aspects which are important for proving
convergence of such algorithms.
for some fixed constants α > 0, β > 0, γ > 0 and some strictly monotone and
continuous function τ : R → R with τ (0) = 0.
Size of the Hypothesis Set. For most convergence results the size of the
hypothesis set does not matter. However, in order to be able to attain the min-
imum, one has to assume that the set is bounded and closed (cf. [149]). In
[39, 48, 151] finite hypothesis sets have been assumed in order to show con-
vergence and rates. In the case of classification, where the base hypotheses are
discrete valued, this holds true. However, in practice, one often wants to use
real valued, parameterized hypotheses. Then the hypothesis set is uncountable.
[197, 199] presents rather general convergence results for an algorithm related
to that in Algorithm 5.2, which do not depend on the size of the hypothesis sets
and is applicable to uncountable hypothesis sets.
14
Note that the label yn may be absorbed into the d’s.
150 R. Meir and G. Rätsch
Convergence Rates. The first convergence result was obtained [70] for Ad-
aBoost and the γ-relaxed condition on the base learner. It showed that the
objective function is reduced at every step by a factor. Hence, it decreases ex-
ponentially. Here, the convergence speed does not directly depend on the size
of the hypothesis class and the number of examples. Later this was generalized
to essentially exponentially decreasing functions (like in Logistic Regression, but
still α-relaxed condition, cf. [59]). Also, for finite hypothesis sets and strictly con-
vex loss one can show the exponential decrease of the loss towards its minimum
[151] (see also [117]) – the constants, however, depend heavily on the size of the
training set and the number of base hypotheses. The best known rate for convex
loss functions and arbitrary hypothesis sets is given in [197]. Here the rate is
O(1/t) and the constants depend only on the smoothness of the loss function.
In this approach it is mandatory to regularize with the 1 -norm. Additionally,
the proposed algorithm does not exactly fit into the scheme discussed above (see
[197] for details).
153, 127, 12, 51]). In this section, we summarize techniques that yield state-of-
the-art results and extend the applicability of boosting to the noisy case.
alleviate the distortions that single difficult examples (e.g. outliers) can cause
the decision boundary.
We will discuss several techniques to approach this problem. We start with
algorithms that implement the intuitive idea of limiting the influence of a single
example. First we present AdaBoostReg [153] which trades off the influence with
the margin, then discuss BrownBoost [65] which gives up on examples for which
one cannot achieve large enough margins within a given number of iterations.
We then discuss SmoothBoost [178] which prevents overfitting by disallowing
overly skewed distributions, and finally briefly discuss some other approaches of
the same flavor.
In Section 6.2 we discuss a second group of approaches which are motivated
by the margin-bounds reviewed in Section 3.4. The important insight is that
it is not the minimal margin which is to be maximized, but that the whole
margin distribution is important: one should accept small amounts of margin-
errors on the training set, if this leads to considerable increments of the margin.
First we describe the DOOM approach [125] that uses a non-convex, monotone
upper bound to the training error motivated from the margin-bounds. Then we
discuss a linear program (LP) implementing a soft-margin [17, 153] and outline
algorithms to iteratively solve the linear programs [154, 48, 146].
The latter techniques are based on modifying the AdaBoost margin loss
function to achieve better noise robustness. However, DOOM as well as the
LP approaches employ ∞ -margins, which correspond to a 1 -penalized cost
function. In Section 6.3 we will discuss some issues related to other choices of
the penalization term.
The weights of examples which are hard to classify are increased at every step
of AdaBoost. This leads to AdaBoost’s tendency to exponentially decrease the
training error and to generate combined hypotheses with large minimal margin.
To discuss the suboptimal performance of such a hard margin classifier in the
presence of outliers and mislabeled examples in a more abstract way, we analyze
Figure 6. Let us first consider the noise free case [left]. Here, we can estimate a
separating hyperplane correctly. In Figure 6 [middle] we have one outlier, which
corrupts the estimation. Hard margin algorithms will concentrate on this outlier
and impair the good estimate obtained in the absence of the outlier. Finally,
let us consider more complex decision boundaries (cf. Figure 6 [right]). Here the
overfitting problem is even more pronounced, if one can generate more and more
complex functions by combining many hypotheses. Then all training examples
(even mislabeled ones or outliers) can be classified correctly, which can result in
poor generalization.
From these cartoons, it is apparent that AdaBoost and any other algorithm
with large hard margins are noise sensitive. Therefore, we need to relax the hard
margin and allow for a possibility of “mistrusting” the data. We present several
algorithms which address this issue.
An Introduction to Boosting and Leveraging 153
Fig. 6. The problem of finding a maximum margin “hyperplane” on reliable data [left],
data with outlier [middle] and with a mislabeled example [right]. The solid line shows
the resulting decision boundary, whereas the dashed line marks the margin area. In
the middle and on the left the original decision boundary is plotted with dots. The
hard margin implies noise sensitivity, because only one example can spoil the whole
estimation of the decision boundary. (Figure taken from [153].)
AdaBoostReg . Examples that are mislabeled and usually more difficult to clas-
sify should not be forced to have a positive margin. If one knew beforehand which
examples are “unreliable”, one would simply remove them from the training set
or, alternatively, not require that they have a large margin. Assume one has
defined a non-negative mistrust parameter ζn , which expresses the “mistrust” in
an example (xn , yn ). Then one may relax the hard margin constraints (cf. (23)
and (24)) leading to:
(t)
where the αt ’s and dn ’s are the weights for the hypotheses and examples, re-
spectively. Moreover, one defines a new quantity, the soft margin ρ̃n (w) :=
ρn (w) + Cζn of an example (xn , yn ), which is used as a replacement of the mar-
gin in the AdaBoost algorithm. So, the idea in AdaBoostReg is to find a solution
with large soft margin on all examples, i.e. maximize ρ w.r.t. w and ρ such that
ρ̃n (w) ≥ ρ, n = 1, . . . , N . In this case one expects to observe less overfitting.
One of the problems of AdaBoostReg is that it lacks a convergence proof.
Additionally, it is not clear what the underlying optimization problem is. The
modification is done at an algorithmic level, which makes it difficult to relate
to an optimization problem. Nevertheless, it was one of the first boosting-like
algorithm that achieved state-of-the art generalization results on noisy data (cf.
[146]). In experimental evaluations, it was found that this algorithm is among
the best performing ones (cf. [153]).
determines its weight. More details on this algorithm and theoretical results on
its performance can be found in [65]. Whereas the idea of the algorithm seems
to be very promising, we are not aware of any empirical results illustrating the
efficiency of the approach.
where θ ∈ (0, 1] (cf. Corollary 1). We recall again that, perhaps surprisingly, the
bound in (43) is independent of the number of hypotheses ht that contribute
156 R. Meir and G. Rätsch
to defining f = t αt ht ∈ co(H). It was stated that a reason for the success
of AdaBoost, compared to other ensemble learning methods (e.g. Bagging [25]),
is that it generates combined hypotheses with large margins on the training
examples. It asymptotically finds a linear combination fw of base hypotheses
satisfying
yn fw (xn )
ρn (w) = ≥ρ n = 1, . . . , N, (44)
j wj
for some large margin ρ (cf. Section 4). Then the first term of (43) can be made
zero for θ = ρ and the second term becomes small, if ρ is large. In [27, 81, 157]
algorithms have been proposed that generate combined hypotheses with even
larger margins than AdaBoost. It was shown that as the margin increases, the
generalization performance can become better on data sets with almost no noise
(see also [167, 138, 157]). However, on problems with a large amount of noise,
it has been found that the generalization ability often degrades for hypotheses
with larger margins (see also [145, 27, 153]).
In Section 4 we have discussed connections of boosting to margin maximiza-
tion. The algorithms under consideration approximately solve a linear program-
ming problem, but tend to perform sub-optimally on noisy data. From the margin
bound (43), this is indeed not surprising. The minimum on the right hand side
of (43) is not necessarily achieved with the maximum (hard) margin (largest θ).
In any event, one should keep in mind that (43) is only an upper bound, and
that more sophisticated bounds (e.g. [180, 45, 181]), based on looking at the full
margin distribution as opposed to the single hard margin θ, may lead to different
conclusions.
In this section we discuss algorithms, where the number of margin errors can
be controlled, and hence one is able to control the contribution of both terms in
(43) separately. We first discuss the DOOM approach [125] and then present an
extended linear program – the ν-LP problem – which allows for margin errors.
Additionally, we briefly discuss approaches to solve the resulting linear program
[48, 154, 147].
DOOM. (Direct Optimization Of Margins) The basic idea of [125, 127, 124] is
to replace AdaBoost’s exponential loss function with another loss function with
1 -normalized hypothesis coefficients. In order to do this one defines a class of
B-admissible margin cost functions, parameterized by some integer M .
These loss functions are motivated from the margin bounds of the kind dis-
cussed in Section 3.4, which have a free parameter θ. A large value of M ∼ θ12
corresponds to a “high resolution” and a high effective complexity of the convex
combination (small margin θ), whereas smaller values of M correspond to larger
θ and therefore smaller effective complexity.
The B-admissibility condition ensures that the loss functions appropriately
take care of the trade-off between complexity and empirical error. Following
some motivation (which we omit here), [125, 127, 124] derived an optimal family
of margin loss functions (according to the margin bound). Unfortunately, this
loss function is non-monotone and non-convex (cf. Figure 7, [left]), leading to a
very difficult optimization problem (NP hard).
An Introduction to Boosting and Leveraging 157
The idea proposed in [125] is to replace this loss function with a piece-wise
linear loss function, that is monotone (but non-convex, cf. Figure 7, [right]):
1.1 − 0.1z : −1 ≤ z ≤ 0,
Cθ (z) = 1.1 − z/θ : 0 ≤ z ≤ θ,
0.1(1 − z)/(1 − θ) : θ ≤ z ≤ 1.
The optimization of this margin loss function to a global optimum is still very
difficult, but good heuristics for optimization have been proposed [125, 127, 124].
Theoretically, however, one can only prove convergence to a local minimum (cf.
Section 5.4).
Fig. 7. [left] The “optimal” margin loss functions CM (z), for M = 20, 50 and 200,
compared to the 0/1-loss. [right] Piecewise linear upper bound on the functions CM (z)
and the 0/1-loss (Figure taken from [125].)
Despite the above mentioned problems the DOOM approach seems very
promising. Experimental results have shown that, in noisy situations, one can
considerably improve the performance compared to AdaBoost. However, there
is the additional regularization parameter θ (or M ), which needs to be chosen
appropriately (e.g. via cross validation).
The ν-LP Problem. Let us consider the case where we have given a set H = {hj :
x → [−1, 1], j = 1, . . . , J} of J hypotheses. To find the coefficients w for the
combined hypothesis fw (x) in (21), we extend the linear problem (30) and solve
the following linear optimization problem (see also [17, 154]), which we call the
ν-LP problem:
158 R. Meir and G. Rätsch
1
N
max ρ− ξn
ρ,w,ξ νN n=1
s.t. yn fw (xn ) ≥ ρ − ξn n = 1, . . . , N (45)
ξn , wj ≥ 0 j = 1, . . . , J, n = 1, . . . , N
J
j=1 wj = 1,
where ν ∈ (1/N, 1] is a parameter of the algorithm. Here, one does not force all
margins to be large and obtains a soft margin hyperplane.16 The dual optimiza-
tion problem of (45) (cf. (46)) is the same as the edge minimization problem
1
(20) with one additional constraint: dn ≤ νN . Since dn is the Lagrange multi-
plier associated with the constraint for the n-th example, its size characterizes
how much the example influences the solution of (45). In this sense, the ν-LP
also implements the intuition discussed in Section 6.1. In addition, there seems
to be a direct connection to the SmoothBoost algorithm.
The following proposition shows that ν has an immediate interpretation:
Proposition 2 (ν-Property, e.g. [174, 78, 150]). The solution to the opti-
mization problem (45) possesses the following properties:
1. ν upper-bounds the fraction of margin errors.
2. 1 − ν is greater than the fraction of examples with a margin larger than ρ.
Since the slack variables ξn only enter the cost function linearly, their gra-
dient is constant (once ξn > 0) and the absolute size is not important. Loosely
speaking, this is due to the fact that for the optimum of the primal objective
function, only derivatives w.r.t. the primal variables matter, and the derivative
of a linear function is constant. This can be made more explicit: Any example
outside the margin area, i.e. satisfying yn fw (xn ) > ρ, can be shifted arbitrarily,
as long as it does not enter the margin area. Only if the example is exactly on the
edge of the margin area, i.e. yn fw (xn ) = ρ, then (almost) no local movement is
possible without changing the solution. If the example (xn , yn ) is in the margin
area, i.e. ξn > 0, then it has been shown that almost any local movements are
allowed (cf. [147] for details).
min γ
d,γ
N
s.t. yn dn hj (xn ) ≤ γ j = 1, . . . , J
n=1 N (46)
d ≥ 0, n=1 dn = 1
1
dn ≤ νN n = 1, . . . , N.
16
See Figure 4 for the soft margin loss, here one uses a scaled version.
An Introduction to Boosting and Leveraging 159
A Barrier Algorithm. Another idea is to derive a barrier algorithm for (45) along
the lines of Section 4.4. Problem (45) is reformulated: one removes the constant
1
νN in the objective of (45), fixes ρ = 1 and adds the sum of the hypothesis
1
coefficients multiplied with another regularization constant C (instead of νN ).
One can in fact show that the modified problem has the same solution as (45):
for any given ν one can find a C such that both problems have the same solution
(up to scaling) [48]. The corresponding barrier objective Fβ for this problem
can be derived as in Section 4.4. It has two sets of parameters, the combination
coefficients w and the slack variables ξ ∈ RN . By setting ∇ξ Fβ = 0, we can
find the minimizing ξ for given β and w, which one plugs in and obtains
J
N & " #'
1 − yn fw (xn )
Fβ (w) = C wj + β log 1 + exp + βN. (47)
j=1 n=1
β
An earlier realization of the same idea was proposed in [154], and termed
ν-Arc. The idea was to reformulate the linear program with soft margin into a
non-linear program with maximum hard margin, by appropriately redefining the
margin (similar to AdaBoostReg in Section 6.1). Then, as a heuristic, Arc-GV
(cf. Section 4.3) was employed to maximize this newly defined margin.17
The barrier approach and ν-Arc led to considerable improvements compared
to AdaBoost on a large range of benchmark problems [154, 147]. Since they
try to solve a similar optimization problem as the column generation algorithm,
one would expect very minor differences. In practice, the column generation
algorithm often converges faster, however, it has a problem in combination with
some base learners [48]: In the CG approach, most of the example weights will be
zero after a few iterations (since the margin in these examples is larger than ρ and
the margin constraints are not active). Then the base learner may have problems
to generate good hypotheses. Since the barrier approach is much “smoother”,
this effect is reduced considerably.
We will now discuss two main choices of regularizers that have a drastic influence
on the properties of the solution of regularized problems. Let us consider opti-
mization problems as in Section 5 with a regularization term in the objective:18
N J
min g(fw (xn ), yn ) + C p(wj )
w n=1 j=1 (48)
s.t. 0 ≤ w ∈ R ,J
Let us first consider the 1 -norm as regularizer, i.e. p(wj ) = wj . Assume fur-
ther that w∗ is the optimal solution to (48) for some C ≥ 0. Since the regularizer
is linear, there will be a (J − 1)-dimensional subspace of w’s having the same
1 -norm as w∗ . By further restricting fw (xn ) = fw∗ (xn ), n = 1, . . . , N , one
obtains N additional constraints. Hence, one can choose a w from a J − N − 1
dimensional space that leads to the same objective value as w∗ . Therefore, there
exists a solution that has at most N + 1 non-zero entries and, hence, the solution
is sparse. This observations holds for arbitrary loss functions and any concave
regularizer such as 1 -norm [147]. Note that N + 1 is an upper bound on the
number of non-zero elements – in practice much sparser solutions are observed
[37, 48, 23, 8].
In neural networks, SVMs, matching pursuit (e.g. [118]) and many other
algorithms, one uses the 2 -norm for regularization. In this case the optimal
solution w∗ can be expressed as a linear combination of the mapped examples
in feature space (cf. (26)),
N
w∗ = βn Φ(xn ).
n=1
Hence, if the vectors Φ(xn ) are not sparse and the β’s are not all zero, the solution
w∗ is also unlikely to be sparse (since it is a linear combination of non-sparse vec-
tors). Consider, for instance, the case where the J vectors (hj (x1 ), . . . , hj (xN ))
(j = 1, . . . , J), interpreted as points in an N dimensional space are in general
position (any subset of N points span the full N dimensional space. For instance,
this occurs with probability 1 for points drawn independently at random from
a probability density supported over the whole If J N , then one can
space).
show that there is a fraction of at most O J−N N
coefficients that are zero.
Hence, for large J almost all coefficients are non-zero and the solution is not
sparse. This holds not only for the 2 -norm regularization, but for any other
strictly convex regularizer [147]. This observation can be intuitively explained
as follows: There is a J − N dimensional subspace leading to the same output
fw (xn ) on the training examples. Assume the solution is sparse and one has a
small number of large weights. If the regularizer is strictly convex, then one can
reduce the regularization term by distributing large weights over many other
weights which were previously zero (while keeping the loss term constant).
We can conclude that the type of regularizer determines whether there ex-
ists an optimal hypothesis coefficient vector that is sparse or not. Minimizing a
convex loss function with a strictly concave regularizer will generally lead to non-
convex optimization problems with potentially many local minima. Fortunately,
the 1 -norm is concave and convex. By employing the 1 -norm regularization,
one can obtain sparse solutions while retaining the computational tractability of
the optimization problem. This observation seems to lend strong support to the
1 -norm regularization.
162 R. Meir and G. Rätsch
7 Extensions
The bulk of the work discussed in this review deals with the case of binary
classification and supervised learning. However, the general Boosting framework
is much more widely applicable. In this section we present a brief survey of
several extensions and generalizations, although many others exist, e.g. [76,
158, 62, 18, 31, 3, 155, 15, 44, 52, 82, 164, 66, 38, 137, 147, 16, 80, 185, 100, 14].
N
min −νρ + N1 ξn
w∈RJ ,ξ∈RN ,ρ∈R
J n=1
(49)
s.t. j=1 wj hj (xn ) ≥ ρ − ξn , n = 1, . . . , N
w
1 = 1, ξn ≥ 0, n = 1, . . . , N
which is essentially the same as in the two-class soft-margin approach (cf. Sec-
tion 6.2), but with all labels equal to +1. Hence, the derivation appropriate
optimization algorithms is very similar.
7.2 Multi-class
There has been much recent interest in extending Boosting to multi-class prob-
lems, where the number of classes exceeds two. Let S = {(x1 , y 1 ), . . . , (xN , y N )}
An Introduction to Boosting and Leveraging 163
ρ
origin
N
K
G= {yn,k log p(k|xn ) + (1 − yn,k ) log(1 − p(k|xn ))} , (50)
n=1 k=1
where yn,k is the k’th component of the vector y n , and p(k|xn ) is the model
probability that xn belongs to the k’th class. Using the standard softmax rep-
resentation,
eFk (x)
p(k|x) = K ,
Fk (x)
k =1 e
7.3 Regression
The learning problem for regression is based on a data set S = {(xn , yn )}N n=1 ,
where in contrast to the classification case, the variable y can take on real val-
ues, rather than being restricted to a finite set. Probably the first approach to
addressing regression in the context of Boosting appeared in [70], where the
problem was addressed by translating the regression task into a classification
problem (see also [18]). Most of the Boosting algorithms for binary classifica-
tion described in this review, are based on a greedy stage-wise minimization of a
smooth cost function. It is therefore not surprising that we can directly use such a
smooth cost function in the context of regression where we are trying to estimate
a smooth function. Probably the first work to discuss regression in the context of
stage-wise minimization appeared in [76] (LS-Boost and other algorithms), and
later further extended by others (e.g. [158, 182, 195]). The work in [156] further
addressed regression in connection with barrier optimization (essentially with a
two-sided soft-margin loss – the -insensitive loss). This approach is very simi-
lar in spirit to one of the approaches proposed in [57] (see below). Later it was
extended in [149, 147] to arbitrary strictly convex loss functions and discussed
in connection with infinite hypothesis sets and semi-infinite programming. The
resulting algorithms were shown to be very useful in real world problems.
We briefly describe a regression framework recently introduced in [57], where
several Boosting like algorithms were introduced. All these algorithms operate
using the following scheme, where different cost functions are used for each algo-
rithm. (i) At each iteration the algorithm modifies the sample to produce a new
sample S̃ = {(x1 , ỹ1 ), . . . , (xN , ỹN )} where the y’s have been modified, based on
the residual error at each stage. (ii) A distribution d over the modified samples
S̃ is constructed and a base regression algorithm is called. (iii) The base learner
produces a function f ∈ F with some edge on S̃ under d (for a definition of
An Introduction to Boosting and Leveraging 165
an edge in the regression context see [57]). (iv) A new composite regressor of
the form F + αf is formed, where α is selected by solving a one-dimensional
optimization problem.
Theoretical performance guarantees for the regression algorithms introduced
in [57] were analyzed in a similar fashion to that presented in Section 3 for
binary classification. Under the assumption that a strictly positive edge was ob-
tained at each step, it was shown that the training error converges exponentially
fast to zero (this is analogous to Theorem 1 for classification). Significantly, the
optimality of the base learners was not required for this claim, similarly to re-
lated results described in Section 5 for classification. Finally, generalization error
bounds were derived in [57], analogous to Theorem 3. These bounds depend on
the edge achieved by the weak learner.
Decision trees and stumps. Decision trees have been widely used for many
years in the statistical literature (e.g. [29, 144, 85]) as powerful, effective and
easily interpretable classification algorithms that are able to automatically select
relevant features. Hence, it is not surprising that some of the most successful
initial applications of Boosting algorithms were performed using decision trees
as weak learners. Decision trees are based on the recursive partition of the input
19
A Matlab implementation of the algorithm, as part of an extensive Pattern Recog-
nition toolbox, can be downloaded from
http://tiger.technion.ac.il/˜eladyt/classification/index.htm.
An Introduction to Boosting and Leveraging 167
space into a set of nested regions, usually of rectangular shape. The so-called
decision stump is simply a one-level decision tree, namely a classifier formed
by splitting the input space in an axis-parallel fashion once and then halting. A
systematic approach to the effective utilization of trees in the context of Boosting
is given in [168], while several other approaches based on logistic regression with
decision trees are described in [76]. This work showed that Boosting significantly
enhances the performance of decision trees and stumps.
Neural networks. Neural networks (e.g. [88, 20]) were extensively used during
the 1990’s in many applications. The feed-forward neural network, by far the
most widely used in practice, is essentially a highly non-linear function repre-
sentation formed by repeatedly combining a fixed non-linear transfer function.
It has been shown in [113] that any real-valued continuous function over Rd can
be arbitrarily well approximated by a feed-forward neural network with a single
hidden layer, as long as the transfer function is not a polynomial. Given the po-
tential power of neural networks in representing arbitrary continuous functions,
it would seem that they could easily lead to overfitting and not work effectively
in the context of Boosting. However, careful numerical experiments conducted
by [176] demonstrated that AdaBoost can significantly improve the performance
of neural network classifiers in real world applications such as OCR.
K
γ
h (x) := βk Gk (x), β ∈ RK . (51)
k=1
The functions Gk could, for example, be kernel functions, i.e. Gk (x) = k(x, xk ),
centered around the training examples. The set {hβ } is an infinite hypothesis set
and is unbounded, if β is unbounded. Hence, maximizing the edge as discussed in
Section 5.4, would lead to diverging β’s. Keeping in mind the overfitting issues
discussed in Section 6, we attempt to restrict the complexity of the class of
functions by limiting the norm of β. We discuss two approaches.
Then hβ with β = (0, . .. , 0, βk∗ , 0, . . . , 0) maximizes the edge, where βk∗ =
N
sign n=1 dn k(xk∗ , xn ) . This means that we will be adding in exactly one
basis function gk∗ (·) per iteration. Hence, this approach reduces to the case of
168 R. Meir and G. Rätsch
using the single functions. A variant of this approach has been used in [183].
A related classification function approach was used for solving a drug discovery
task in [149].
In [149] so-called active kernel functions have been used, where the set of
functions were kernel-like functions hn,p (x) = kp (x, xn ), and kp is a SVM kernel
function parameterized by p (e.g. the variance in a RBF kernel). In this case one
has to find the example xn and the parameter p that maximize the edge. This
may lead to a hard non-convex optimization problem, for which quite efficient
heuristics have been proposed in [149].
8.3 Applications
The list of applications of Boosting and Leveraging methods is growing rapidly.
We discuss three applications in detail and then list a range of other applications.
prerequisite for controlling the electric energy demand, however, is the ability
to correctly and non-intrusively detect and classify the operating status of elec-
tric appliances of individual households. The goal in [139] was to develop such
a non-intrusive measuring system for assessing the status of electric appliances.
This is a hard problem, in particular for appliances with inverter systems,20
whereas non-intrusive measuring systems have already been developed for con-
ventional on/off (non-inverter) operating electric equipments (cf. [83, 33]). The
study in [139] presents a first evaluation of machine learning techniques to clas-
sify the operating status of electric appliances (with and without inverter) for the
purpose of constructing a non-intrusive monitoring system. In this study, RBF
networks, K-nearest neighbor classifiers (KNNs) (e.g. [42]), SVMs and ν-Arc (cf.
Section 6.2) were compared.
The data set available for this task is rather small (36 examples), since the
collection and labeling of data is manual and therefore expensive. As a result
of this, one has to be very careful with finding good model parameters. All
model parameters were found using the computationally expensive but generally
reliable leave-one-out method.
The results reported in [139] demonstrate that the ν-Arc algorithm with RBF
networks as base learner performs better on average than all other algorithms
studied, followed closely by the SVM classifier with a RBF kernel. These results
suggest that the goal of a control system to balance load peaks might be a
feasible prospect in the not too distant future.
9 Conclusions
We have presented a general overview of ensemble methods in general and the
Boosting and Leveraging family of algorithms in particular. While Boosting was
introduced within a rather theoretical framework in order to transform a poorly
performing learning algorithm into a powerful one, this idea has turned out to
have manifold extensions and applications, as discussed in this survey.
As we have shown, Boosting turns out to belong to a large family of models
which are greedily constructed by adding on a single base learner to a pool of
previously constructed learners using adaptively determined weights. Interest-
ingly, Boosting was shown to be derivable as a stagewise greedy gradient descent
algorithm attempting to minimize a suitable cost function. In this sense Boost-
ing is strongly related to other algorithms that were known within the Statistics
literature for many years, in particular additive models [86] and matching pur-
suit [118]. However, the recent work on Boosting has brought to the fore many
issues which were not studied previously. (i) The important concept of the mar-
gin, and its impact on learning and generalization has been emphasized. (ii)
The derivation of sophisticated finite sample data-dependent bounds has been
possible. (iii) Understanding the relationship between the strength of the weak
learner, and the quality of the composite classifier (in terms of training and gen-
eralization errors). (iv) The establishment of consistency (v) The development
of computationally efficient procedures.
With the emphasis of much of the Boosting work on the notion of the margin,
it became clear that Boosting is strongly related to another very successful cur-
rent algorithm, namely the Support Vector Machine [190]. As was pointed out
in [167, 134, 150], both Boosting and the SVM can be viewed as attempting to
maximize the margin, except that the norm used by each procedure is different.
Moreover, the optimization procedures used in both cases are very different.
While Boosting constructs a complex composite hypothesis, which can in
principle represent highly irregular functions, the generalization bounds for
Boosting turn out to lead to tight bounds in cases where large margins can
be guaranteed. Although initial work seemed to indicate that Boosting does not
overfit, it was soon realized that overfitting does indeed occur under noisy con-
ditions. Following this observation, regularized Boosting algorithms were devel-
oped which are able to achieve the appropriate balance between approximation
and estimation required to achieve excellent performance even under noisy con-
ditions. Regularization is also essential in order to establish consistency under
general conditions.
We conclude with several open questions.
1. While it has been possible to derive Boosting-like algorithms based on many
types of cost functions, there does not seem to be at this point a systematic
approach to the selection of a particular one. Numerical experiments and
some theoretical results indicate that the choice of cost function may have
a significant effect of the performance of Boosting algorithms (e.g. [74, 126,
154]).
2. The selection of the best type of weak learner for a particular task is also not
entirely clear. Some weak learners are unable even in principle to represent
172 R. Meir and G. Rätsch
References
42. T.M. Cover and P.E. Hart. Nearest neighbor pattern classifications. IEEE trans-
action on information theory, 13(1):21–27, 1967.
43. D.D. Cox and F. O’Sullivan. Asymptotic analysis of penalized likelihood and
related estimates. The Annals of Statistics, 18(4):1676–1695, 1990.
44. K. Crammer and Y. Singer. On the learnability and design of output codes for
multiclass problems. In N. Cesa-Bianchi and S. Goldberg, editors, Proc. Colt,
pages 35–46, San Francisco, 2000. Morgan Kaufmann.
45. N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machines.
Cambridge University Press, Cambridge, UK, 2000.
46. S. Della Pietra, V. Della Pietra, and J. Lafferty. Inducing features of random fields.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(4):380–393,
April 1997.
47. S. Della Pietra, V. Della Pietra, and J. Lafferty. Duality and auxiliary functions
for Bregman distances. Technical Report CMU-CS-01-109, School of Computer
Science, Carnegie Mellon University, 2001.
48. A. Demiriz, K.P. Bennett, and J. Shawe-Taylor. Linear programming boosting
via column generation. Journal of Machine Learning Research, 46:225–254, 2002.
49. M. Dettling and P. Bühlmann. How to use boosting for tumor classification with
gene expression data. Preprint. See http://stat.ethz.ch/˜dettling/boosting,
2002.
50. L. Devroye, L. Györfi, and G. Lugosi. A Probabilistic Theory of Pattern Recog-
nition. Number 31 in Applications of Mathematics. Springer, New York, 1996.
51. T.G. Dietterich. An experimental comparison of three methods for construct-
ing ensembles of decision trees: Bagging, boosting, and randomization. Machine
Learning, 40(2):139–157, 1999.
52. T.G. Dietterich and G. Bakiri. Solving multiclass learning problems via error-
correcting output codes. Journal of Aritifical Intelligence Research, 2:263–286,
1995.
53. C. Domingo and O. Watanabe. A modification of AdaBoost. In Proc. COLT, San
Francisco, 2000. Morgan Kaufmann.
54. H. Drucker, C. Cortes, L.D. Jackel, Y. LeCun, and V. Vapnik. Boosting and other
ensemble methods. Neural Computation, 6, 1994.
55. H. Drucker, R.E. Schapire, and P.Y. Simard. Boosting performance in neural
networks. International Journal of Pattern Recognition and Artificial Intelligence,
7:705–719, 1993.
56. N. Duffy and D.P. Helmbold. A geometric approach to leveraging weak learners.
In P. Fischer and H. U. Simon, editors, Computational Learning Theory: 4th
European Conference (EuroCOLT ’99), pages 18–33, March 1999. Long version
to appear in TCS.
57. N. Duffy and D.P. Helmbold. Boosting methods for regression. Technical report,
Department of Computer Science, University of Santa Cruz, 2000.
58. N. Duffy and D.P. Helmbold. Leveraging for regression. In Proc. COLT, pages
208–219, San Francisco, 2000. Morgan Kaufmann.
59. N. Duffy and D.P. Helmbold. Potential boosters? In S.A. Solla, T.K. Leen,
and K.-R. Müller, editors, Advances in Neural Information Processing Systems,
volume 12, pages 258–264. MIT Press, 2000.
60. G. Escudero, L. Màrquez, and G. Rigau. Boosting applied to word sense dis-
ambiguation. In LNAI 1810: Proceedings of the 12th European Conference on
Machine Learning, ECML, pages 129–141, Barcelona, Spain, 2000.
61. W. Feller. An Introduction to Probability Theory and its Applications. Wiley,
Chichester, third edition, 1968.
176 R. Meir and G. Rätsch
62. D.H. Fisher, Jr., editor. Improving regressors using boosting techniques, 1997.
63. M. Frean and T. Downs. A simple cost function for boosting. Technical report,
Dep. of Computer Science and Electrical Engineering, University of Queensland,
1998.
64. Y. Freund. Boosting a weak learning algorithm by majority. Information and
Computation, 121(2):256–285, September 1995.
65. Y. Freund. An adaptive version of the boost by majority algorithm. Machine
Learning, 43(3):293–318, 2001.
66. Y. Freund, R. Iyer, R.E. Schapire, and Y. Singer. An efficient boosting algorithm
for combining preferences. In Proc. ICML, 1998.
67. Y. Freund and R.E. Schapire. A decision-theoretic generalization of on-line learn-
ing and an application to boosting. In EuroCOLT: European Conference on Com-
putational Learning Theory. LNCS, 1994.
68. Y. Freund and R.E. Schapire. Experiments with a new boosting algorithm. In
Proc. 13th International Conference on Machine Learning, pages 148–146. Mor-
gan Kaufmann, 1996.
69. Y. Freund and R.E. Schapire. Game theory, on-line prediction and boosting. In
Proc. COLT, pages 325–332, New York, NY, 1996. ACM Press.
70. Y. Freund and R.E. Schapire. A decision-theoretic generalization of on-line learn-
ing and an application to boosting. Journal of Computer and System Sciences,
55(1):119–139, 1997.
71. Y. Freund and R.E. Schapire. Adaptive game playing using multiplicative weights.
Games and Economic Behavior, 29:79–103, 1999.
72. Y. Freund and R.E. Schapire. A short introduction to boosting. Journal of
Japanese Society for Artificial Intelligence, 14(5):771–780, September 1999. Ap-
peared in Japanese, translation by Naoki Abe.
73. J. Friedman. Stochastic gradient boosting. Technical report, Stanford University,
March 1999.
74. J. Friedman, T. Hastie, and R.J. Tibshirani. Additive logistic regression: a sta-
tistical view of boosting. Annals of Statistics, 2:337–374, 2000. with discus-
sion pp.375–407, also Technical Report at Department of Statistics, Sequoia Hall,
Stanford University.
75. J.H. Friedman. On bias, variance, 0/1–loss, and the corse of dimensionality. In
Data Mining and Knowledge Discovery, volume I, pages 55–77. Kluwer Academic
Publishers, 1997.
76. J.H. Friedman. Greedy function approximation. Technical report, Department of
Statistics, Stanford University, February 1999.
77. K.R. Frisch. The logarithmic potential method of convex programming. Memo-
randum, University Institute of Economics, Oslo, May 13 1955.
78. T. Graepel, R. Herbrich, B. Schölkopf, A.J. Smola, P.L. Bartlett, K.-R. Müller,
K. Obermayer, and R.C. Williamson. Classification on proximity data with LP-
machines. In D. Willshaw and A. Murray, editors, Proceedings of ICANN’99,
volume 1, pages 304–309. IEE Press, 1999.
79. Y. Grandvalet. Bagging can stabilize without reducing variance. In ICANN’01,
Lecture Notes in Computer Science. Springer, 2001.
80. Y. Grandvalet, F. D’alché-Buc, and C. Ambroise. Boosting mixture models for
semi-supervised tasks. In Proc. ICANN, Vienna, Austria, 2001.
81. A.J. Grove and D. Schuurmans. Boosting in the limit: Maximizing the margin
of learned ensembles. In Proceedings of the Fifteenth National Conference on
Artifical Intelligence, 1998.
An Introduction to Boosting and Leveraging 177
103. G.S. Kimeldorf and G. Wahba. Some results on Tchebycheffian spline functions.
J. Math. Anal. Applic., 33:82–95, 1971.
104. J. Kivinen and M. Warmuth. Boosting as entropy projection. In Proc. 12th Annu.
Conference on Comput. Learning Theory, pages 134–144. ACM Press, New York,
NY, 1999.
105. J. Kivinen, M. Warmuth, and P. Auer. The perceptron algorithm vs. winnow:
Linear vs. logarithmic mistake bounds when few input variables are relevant.
Special issue of Artificial Intelligence, 97(1–2):325–343, 1997.
106. J. Kivinen and M.K. Warmuth. Additive versus exponentiated gradient updates
for linear prediction. Information and Computation, 132(1):1–64, 1997.
107. K.C. Kiwiel. Relaxation methods for strictly convex regularizations of piecewise
linear programs. Applied Mathematics and Optimization, 38:239–259, 1998.
108. V. Koltchinksii and D. Panchenko. Empirical margin distributions and bounding
the generalization error of combined classifiers. Ann. Statis., 30(1), 2002.
109. A. Krieger, A. Wyner, and C. Long. Boosting noisy data. In Proceedings, 18th
ICML. Morgan Kaufmann, 2001.
110. J. Lafferty. Additive models, boosting, and inference for generalized divergences.
In Proc. 12th Annu. Conf. on Comput. Learning Theory, pages 125–133, New
York, NY, 1999. ACM Press.
111. G. Lebanon and J. Lafferty. Boosting and maximum likelihood for exponential
models. In Advances in Neural information processings systems, volume 14, 2002.
to appear. Longer version also NeuroCOLT Technical Report NC-TR-2001-098.
112. Y.A. LeCun, L.D. Jackel, L. Bottou, A. Brunot, C. Cortes, J.S. Denker,
H. Drucker, I. Guyon, U.A. Müller, E. Säckinger, P.Y. Simard, and V.N. Vap-
nik. Comparison of learning algorithms for handwritten digit recognition. In
F. Fogelman-Soulié and P. Gallinari, editors, Proceedings ICANN’95 — Interna-
tional Conference on Artificial Neural Networks, volume II, pages 53–60, Nanterre,
France, 1995. EC2.
113. M. Leshno, V. Lin, A. Pinkus, and S. Schocken. Multilayer Feedforward Networks
with a Nonpolynomial Activation Function Can Approximate any Function. Neu-
ral Networks, 6:861–867, 1993.
114. N. Littlestone, P.M. Long, and M.K. Warmuth. On-line learning of linear func-
tions. Journal of Computational Complexity, 5:1–23, 1995. Earlier version is
Technical Report CRL-91-29 at UC Santa Cruz.
115. D.G. Luenberger. Linear and Nonlinear Programming. Addison-Wesley Publish-
ing Co., Reading, second edition, May 1984. Reprinted with corrections in May,
1989.
116. Gábor Lugosi and Nicolas Vayatis. A consistent strategy for boosting algorithms.
In Proceedings of the Annual Conference on Computational Learning Theory, vol-
ume 2375 of LNAI, pages 303–318, Sydney, February 2002. Springer.
117. Z.-Q. Luo and P. Tseng. On the convergence of coordinate descent method for
convex differentiable minimization. Journal of Optimization Theory and Applica-
tions, 72(1):7–35, 1992.
118. S. Mallat and Z. Zhang. Matching Pursuits with time-frequency dictionaries.
IEEE Transactions on Signal Processing, 41(12):3397–3415, December 1993.
119. O.L. Mangasarian. Linear and nonlinear separation of patterns by linear pro-
gramming. Operations Research, 13:444–452, 1965.
120. O.L. Mangasarian. Arbitrary-norm separating plane. Operation Research Letters,
24(1):15–23, 1999.
An Introduction to Boosting and Leveraging 179
121. S. Mannor and R. Meir. Geometric bounds for generlization in boosting. In Pro-
ceedings of the Fourteenth Annual Conference on Computational Learning Theory,
pages 461–472, 2001.
122. S. Mannor and R. Meir. On the existence of weak learners and applications to
boosting. Machine Learning, 48(1-3):219–251, 2002.
123. S. Mannor, R. Meir, and T. Zhang. The consistency of greedy algorithms for
classification. In Procedings COLT’02, volume 2375 of LNAI, pages 319–333,
Sydney, 2002. Springer.
124. L. Mason. Margins and Combined Classifiers. PhD thesis, Australian National
University, September 1999.
125. L. Mason, P.L. Bartlett, and J. Baxter. Improved generalization through explicit
optimization of margins. Technical report, Department of Systems Engineering,
Australian National University, 1998.
126. L. Mason, J. Baxter, P.L. Bartlett, and M. Frean. Functional gradient tech-
niques for combining hypotheses. In A. J. Smola, P.L. Bartlett, B. Schölkopf,
and C. Schuurmans, editors, Advances in Large Margin Classifiers. MIT Press,
Cambridge, MA, 1999.
127. L. Mason, J. Baxter, P.L. Bartlett, and M. Frean. Functional gradient tech-
niques for combining hypotheses. In A.J. Smola, P.L. Bartlett, B. Schölkopf, and
D. Schuurmans, editors, Advances in Large Margin Classifiers, pages 221–247.
MIT Press, Cambridge, MA, 2000.
128. J. Matoušek. Geometric Discrepancy: An Illustrated Guide. Springer Verlag, 1999.
129. R. Meir, R. El-Yaniv, and Shai Ben-David. Localized boosting. In Proc. COLT,
pages 190–199, San Francisco, 2000. Morgan Kaufmann.
130. R. Meir and T. Zhang. Data-dependent bounds for bayesian mixture models.
unpublished manuscript, 2002.
131. J. Mercer. Functions of positive and negative type and their connection with the
theory of integral equations. Philos. Trans. Roy. Soc. London, A 209:415–446,
1909.
132. S. Merler, C. Furlanello, B. Larcher, and A. Sboner. Tuning cost-sensitive boost-
ing and its application to melanoma diagnosis. In J. Kittler and F. Roli, edi-
tors, Proceedings of the 2nd Internationa Workshop on Multiple Classifier Systems
MCS2001, volume 2096 of LNCS, pages 32–42. Springer, 2001.
133. J. Moody. The effective number of parameters: An analysis of generalization and
regularization in non-linear learning systems. In S. J. Hanson J. Moody and R. P.
Lippman, editors, Advances in Neural information processings systems, volume 4,
pages 847–854, San Mateo, CA, 1992. Morgan Kaufman.
134. K.-R. Müller, S. Mika, G. Rätsch, K. Tsuda, and B. Schölkopf. An introduction
to kernel-based learning algorithms. IEEE Transactions on Neural Networks,
12(2):181–201, 2001.
135. N. Murata, S. Amari, and S. Yoshizawa. Network information criterion — deter-
mining the number of hidden units for an artificial neural network model. IEEE
Transactions on Neural Networks, 5:865–872, 1994.
136. S. Nash and A. Sofer. Linear and Nonlinear Programming. McGraw-Hill, New
York, NY, 1996.
137. Richard Nock and Patrice Lefaucheur. A robust boosting algorithm. In Proc.
13th European Conference on Machine Learning, volume LNAI 2430, Helsinki,
2002. Springer Verlag.
180 R. Meir and G. Rätsch
156. G. Rätsch, M. Warmuth, S. Mika, T. Onoda, S. Lemm, and K.-R. Müller. Barrier
boosting. In Proc. COLT, pages 170–179, San Francisco, 2000. Morgan Kaufmann.
157. G. Rätsch and M.K. Warmuth. Maximizing the margin with boosting. In Proc.
COLT, volume 2375 of LNAI, pages 319–333, Sydney, 2002. Springer.
158. G. Ridgeway, D. Madigan, and T. Richardson. Boosting methodology for re-
gression problems. In D. Heckerman and J. Whittaker, editors, Proceedings of
Artificial Intelligence and Statistics ’99, pages 152–161, 1999.
159. J. Rissanen. Modeling by shortest data description. Automatica, 14:465–471,
1978.
160. C. P. Robert. The Bayesian Choice: A Decision Theoretic Motivation. Springer
Verlag, New York, 1994.
161. M. Rochery, R. Schapire, M. Rahim, N. Gupta, G. Riccardi, S. Bangalore, H. Al-
shawi, and S. Douglas. Combining prior knowledge and boosting for call classi-
fication in spoken language dialogue. In International Conference on Accoustics,
Speech and Signal Processing, 2002.
162. R.T. Rockafellar. Convex Analysis. Princeton Landmarks in Mathemathics.
Princeton University Press, New Jersey, 1970.
163. R.E. Schapire. The strength of weak learnability. Machine Learning, 5(2):197–
227, 1990.
164. R.E. Schapire. Using output codes to boost multiclass learning problems. In
Machine Learning: Proceedings of the 14th International Conference, pages 313–
321, 1997.
165. R.E. Schapire. A brief introduction to boosting. In Proceedings of the Sixteenth
International Joint Conference on Artificial Intelligence, 1999.
166. R.E. Schapire. The boosting approach to machine learning: An overview. In
Workshop on Nonlinear Estimation and Classification. MSRI, 2002.
167. R.E. Schapire, Y. Freund, P.L. Bartlett, and W. S. Lee. Boosting the margin: A
new explanation for the effectiveness of voting methods. The Annals of Statistics,
26(5):1651–1686, October 1998.
168. R.E. Schapire and Y. Singer. Improved boosting algorithms using confidence-rated
predictions. Machine Learning, 37(3):297–336, December 1999. also Proceedings
of the 14th Workshop on Computational Learning Theory 1998, pages 80–91.
169. R.E. Schapire and Y. Singer. Boostexter: A boosting-based system for text cate-
gorization. Machine Learning, 39(2/3):135–168, 2000.
170. R.E. Schapire, Y. Singer, and A. Singhal. Boosting and rocchio applied to text
filtering. In Proc. 21st Annual International Conference on Research and Devel-
opment in Information Retrieval, 1998.
171. R.E. Schapire, P. Stone, D. McAllester, M.L. Littman, and J.A. Csirik. Modeling
auction price uncertainty using boosting-based conditional density estimations
noise. In Proceedings of the Proceedings of the Nineteenth International Confer-
ence on Machine Learning, 2002.
172. B. Schölkopf, R. Herbrich, and A.J. Smola. A generalized representer theorem.
In D.P. Helmbold and R.C. Williamson, editors, COLT/EuroCOLT, volume 2111
of LNAI, pages 416–426. Springer, 2001.
173. B. Schölkopf, J. Platt, J. Shawe-Taylor, A.J. Smola, and R.C. Williamson. Esti-
mating the support of a high-dimensional distribution. TR 87, Microsoft Research,
Redmond, WA, 1999.
174. B. Schölkopf, A. Smola, R.C. Williamson, and P.L. Bartlett. New support vector
algorithms. Neural Computation, 12:1207–1245, 2000. also NeuroCOLT Technical
Report NC-TR-1998-031.
182 R. Meir and G. Rätsch
175. B. Schölkopf and A. J. Smola. Learning with Kernels. MIT Press, Cambridge,
MA, 2002.
176. H. Schwenk and Y. Bengio. Boosting neural networks. Neural Computation,
12(8):1869–1887, 2000.
177. R.A. Servedio. PAC analogoues of perceptron and winnow via boosting the mar-
gin. In Proc. COLT, pages 148–157, San Francisco, 2000. Morgan Kaufmann.
178. R.A. Servedio. Smooth boosting and learning with malicious noise. In Proceedings
of the Fourteenth Annual Conference on Computational Learning Theory, pages
473–489, 2001.
179. J. Shawe-Taylor, P.L. Bartlett, R.C. Williamson, and M. Anthony. Structural
risk minimization over data-dependent hierarchies. IEEE Trans. Inf. Theory,
44(5):1926–1940, September 1998.
180. J. Shawe-Taylor and N. Cristianini. Further results on the margin distribution. In
Proceedings of the twelfth Conference on Computational Learning Theory, pages
278–285, 1999.
181. J. Shawe-Taylor and N. Cristianini. On the genralization of soft margin algo-
rithms. Technical Report NC-TR-2000-082, NeuroCOLT2, June 2001.
182. J. Shawe-Taylor and G. Karakoulas. Towards a strategy for boosting regressors.
In A.J. Smola, P.L. Bartlett, B. Schölkopf, and D. Schuurmans, editors, Advances
in Large Margin Classifiers, pages 247–258, Cambridge, MA, 2000. MIT Press.
183. Y. Singer. Leveraged vector machines. In S.A. Solla, T.K. Leen, and K.-R.
Müller, editors, Advances in Neural Information Processing Systems, volume 12,
pages 610–616. MIT Press, 2000.
184. D. Tax and R. Duin. Data domain description by support vectors. In M. Verleysen,
editor, Proc. ESANN, pages 251–256, Brussels, 1999. D. Facto Press.
185. F. Thollard, M. Sebban, and P. Ezequel. Boosting density function estimators. In
Proc. 13th European Conference on Machine Learning, volume LNAI 2430, pages
431–443, Helsinki, 2002. Springer Verlag.
186. A.N. Tikhonov and V.Y. Arsenin. Solutions of Ill-posed Problems. W.H. Winston,
Washington, D.C., 1977.
187. K. Tsuda, M. Sugiyama, and K.-R. Müller. Subspace information criterion for
non-quadratic regularizers – model selection for sparse regressors. IEEE Trans-
actions on Neural Networks, 13(1):70–80, 2002.
188. L.G. Valiant. A theory of the learnable. Communications of the ACM,
27(11):1134–1142, November 1984.
189. A.W. van der Vaart and J.A. Wellner. Weak Convergence and Empirical Pro-
cesses. Springer Verlag, New York, 1996.
190. V.N. Vapnik. The nature of statistical learning theory. Springer Verlag, New
York, 1995.
191. V.N. Vapnik. Statistical Learning Theory. Wiley, New York, 1998.
192. V.N. Vapnik and A.Y. Chervonenkis. On the uniform convergence of relative
frequencies of events to their probabilities. Theory of Probab. and its Applications,
16(2):264–280, 1971.
193. J. von Neumann. Zur Theorie der Gesellschaftsspiele. Math. Ann., 100:295–320,
1928.
194. M.A. Walker, O. Rambow, and M. Rogati. Spot: A trainable sentence planner.
In Proc. 2nd Annual Meeting of the North American Chapter of the Assiciation
for Computational Linguistics, 2001.
An Introduction to Boosting and Leveraging 183