Poly Aml
Poly Aml
Austin J. Stromme∗
November 6, 2024
Abstract
Contents
1 Introduction and basics of binary classification 3
1.1 What is machine learning? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Binary classification: a minimal theoretical model . . . . . . . . . . . . . . . 4
1.3 Empirical risk minimization . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Warm up: finite classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1
5 Support vector machines 29
5.1 Margin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.2 Hard-SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.3 Soft-SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.4 Statistical analysis of soft-SVM . . . . . . . . . . . . . . . . . . . . . . . . . 32
6 Ensemble methods 34
6.1 Majority vote . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
6.2 Random forests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
6.3 AdaBoost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
7 Neural networks 36
7.1 VC dimension of neural networks with sign activation . . . . . . . . . . . . 36
7.2 Training neural networks with backpropagation . . . . . . . . . . . . . . . . 39
7.3 Brief introduction to first-order methods . . . . . . . . . . . . . . . . . . . . 39
7.4 Backpropagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
B Chaining 55
B.1 Covering and packing numbers . . . . . . . . . . . . . . . . . . . . . . . . . 57
B.2 Single-scale chaining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
B.3 Dudley’s entropy integral . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
2
1 Introduction and basics of binary classification
1.1 What is machine learning?
Machine learning (ML) is the theory and practice of using data to make decisions. Compar-
ing with statistics, ML was historically less concerned with modeling, studying problems
where modeling was difficult or impossible, and less interested in quantifying uncertainty.
For example, suppose we want to classify images of handwritten digits {0, 1, . . . , 9}. Cre-
ating an explicit model for such data may be quite difficult; the approach of ML is to
instead focus on finding a good predictor, side-stepping the problem of recovering the
“true” data-generating process.
In this sense, ML favors a more blackbox approach than statistics, and is oriented more
towards prediction than inference. However, these differences are receding. Modern statis-
tics is increasingly concerned with computational considerations, finite sample analysis,
and model misspecification. And modern ML incorporates many aspects of probabilistic
modeling.
Examples of ML problems. Let us give some more discussion of the kinds of problems
that are studied in ML.
Example 1.1 (Supervised learning). Supervised learning refers to problems where the
training data includes a response variable that we are interested in estimating for the
test data. We observe (X1 , Y1 ), . . . , (Xn , Yn ) ∈ X × Y, assumed to be iid draws from a
distribution P, and we wish to use this information to estimate y ∼ P(· | x) for a new x.
Examples:
1. Spam detection: given a collection of emails labeled spam or not, and want to use
this data to predict whether new emails are spam
3
2. Clustering: given K ∈SN, produce S1 , . . . SK ⊂ {X1 , . . . , Xn } such that the sets are
pairwise disjoint and K
k=1 Sk = {X1 , . . . , Xn }
3. Generative modeling: how can we generate more samples that are (approximately)
from the distribution of X1 , . . . , Xn ? Examples include DALL-E for images, or GPT
for text.
Example 1.3 (Reinforcement learning). In many settings, the ML model interacts with
its environment in some way. For example:
1. ML systems for games, such as chess, Go, Starcraft II. Here the system has to take
a series of actions which it receives rewards for.
Note that the above categories are not necessarily disjoint, and often are mutually
supportive. For example, dimensionality reduction is a common preliminary step in many
machine learning pipelines, clustering can be used as a method for supervised learning,
reinforcement learning pipelines often involve training an evaluator which estimates the
reward of a given action (e.g. the value of a given chess board), and generative models
often include supervised components.
is as small as possible.
Let us make some remarks at this point. First, since y is a binary random variable,
it has a Bernoulli distribution, so the problem is entirely encoded within the regression
function, namely the conditional probability
η(x) := PP [Y = 1 | X = x].
Second, we emphasize that we cannot hope to predict y with perfect accuracy even given
knowledge of the regression function η(x), unless η(x) ∈ {0, 1}. In the extreme case
η(x) = 0.5, we cannot predict y at all. As we will discuss, the amount of noise has a
strong impact on the sample complexity of the problem.
4
Because of this issue of noise, we cannot, in general, drive the classification error R(h)
to 0. What is the smallest value that can be achieved? As a thought experiment, assume
that we have all the information that we can hope to get, so that we know the regression
function η(x) perfectly. Again, if η(x) = 1/2, then x contains no information about y and
we might as well guess randomly. Otherwise, we can do better than random guessing: if
η(x) > 1/2 then P[y = 1 | x] > P[y = 0 | x], so if we output 1 we’ll be correct more often
than if we output 0. And similarly if η(x) < 1/2. As we will rigorously show, the optimal
classifier is the Bayes classifier h⋆ (x) := 1[η(x) > 1/2], in the sense that
To measure the performance of another classifier h, we will look at the excess risk E(h)
defined as
E(h) = R(h) − R(h⋆ ) ⩾ 0. (1.2)
Our hope is that, with sufficiently many observations (sufficiently large n), we can output
a classifier which achieves nearly 0 excess risk by mimicking h⋆ . We have the following
result.
Theorem 1.4 (Decomposition of excess risk). Suppose h : X → {0, 1} is given. Write the
marginal distribution of P on x as Px . Then
Z
E(h) = |2η(x) − 1|dPx (x)
{h̸=h⋆ }
Note also that when η(x) ≡ 1/2 identically this result recovers the intuitive fact that
no learning is possible – every classifier performs the same.
Now, notice that if h(x) ̸= h⋆ (x) then h(x) ̸= y if and only if h⋆ (x) = y, since these are
all binary numbers. Hence,
Z
R(h) = (1[h⋆ (x) = y] − 1[h⋆ (x) ̸= y])dP(x, y)
{h̸=h⋆ }
Z
= (P[h⋆ (x) = y | x] − P[h⋆ (x) ̸= y | x])dPx (x)
{h̸=h⋆ }
Z
= (2P[h⋆ (x) = y | x] − 1)dPx (x).
{h̸=h⋆ }
5
Now, we claim that
P[h⋆ (x) = y | x] = |η(x) − 1/2| + 1/2
Indeed, if η(x) > 1/2, then the LHS is η(x), and if η(x) ⩽ 1/2, then the LHS is 1 − η(x),
by the definition of the Bayes classifier h⋆ . The result follows by plugging this equation
into the above.
n
1X
Rn (h) := 1[h(Xi ) ̸= Yi ]. (1.3)
n
i=1
Observe that for any given classifier h, Rn is an unbiased estimator of R since E[Rn (h)] =
R(h). Moreover, Rn (h) → R(h) as n → ∞, almost surely, by the law of the large numbers,
and so we expect Rn to be a good approximation of R for large n.
Since the Bayes classifier h⋆ minimizes the population risk (1.1), a natural approach
to estimating it is thus to minimize the empirical risk (1.3). But notice that here we run
into a problem: what classifiers do we consider when minimizing (1.3)? Because without
any restrictions, we can minimize the empirical risk by taking the classifier ĥ(Xi ) = Yi for
i = 1, . . . , n and h(x) = 0 if x ̸∈ {X1 , . . . , Xn }. This classifier achieves the optimal value
Rn (ĥ) = 0, but doesn’t generalize outside the data – a case of extreme overfitting.
This problem is inherent to any method if we aren’t willing to make assumptions on
the distribution of (x, y): without assumptions on its distribution, it can have arbitrary
behavior out of the sample, and thus we cannot learn. For example, we have the following
“no-free-lunch” result, which shows that there is no algorithm which does substantially
better than random chance uniformly over all distributions of (x, y).
Theorem 1.5 (No-free-lunch for binary classification). Let n be a positive integer and
assume X is a domain with at least 2n elements. Let A be any algorithm for the task
of binary classification over X . Then there exists a distribution P on X such that the
associated Bayes estimator h⋆ achieves 0 risk, R(h⋆ ) = 0, yet
6
Generative vs. discriminative approaches Because of this issue, it is clear that
we need to somehow restrict the distributions P of (x, y). A common approach to this
kind of issue in statistics is the generative approach, where we directly restrict the set of
candidate distributions P. For example, in discriminant analysis, we might assume that
(x, y) is generated by sampling x from a mixture of two Gaussians and letting y be the
mixture component. But in machine learning problems we generally don’t expect that
we can accurately model the true underlying distribution in this kind of way (consider
again the example of handwritten digit classification), so we instead take a discriminative
approach, where we restrict the set of classifiers.
Estimation error vs. approximation error Specifically, we’ll assume that we are
given a class H of classifiers such that R(h) is small for some h ∈ H. This class may be
constructed from domain knowledge, computational considerations, or some mixture of
the two. We will see several examples in this class. For any candidate classifier ĥ built
from the data, we can decompose its excess risk as
E(ĥ) = R(ĥ) − R(h⋆ ) = R(ĥ) − inf R(h) + inf R(h) − R(h⋆ ) . (1.4)
h∈H h∈H
| {z } | {z }
estimation error approximation error
The estimation error accounts for the error due to only having a sample from P rather
than the full distribution; this should go to zero as n → ∞. For this to be possible we
want H to be relatively small – the previous results imply, for example, that this cannot
happen if H is the set of all classifiers. On the other hand, if we make H too small then
we may have little estimation error but large approximation error. For this reason, we will
often want to balance the two sources of error. Without knowing more about the underling
data generating process P, it is difficult to say anything about the approximation error.
In contrast, and perhaps surprisingly, we will be able to say a lot about the approximation
error, even under mild assumptions on P.
The central question we will focus on for this chapter and the next is thus the following:
Problem 1.6. How does the estimation error go to 0 as a function of n and the class H?
This question is typical of the area of theoretical machine learning called statistical
learning theory, and the methods we will develop can be extended to settings well beyond
binary classification. However, to focus on the main ideas and leave time for other topics,
we will focus on binary classification as our minimal theoretical model.
with high probability. Note that we give a basic treatment, with proofs, of the concentra-
tion inequalities we will use in this course in Appendix A.
7
To see why these bounds will be important, assume that
The former we shall call the population risk minimizer, and the latter we shall call the
empirical risk minimizer. (In general, our results will go through even if we can only
approximately solve these problems.) Then we can bound the estimation error
where we used the definition of ĥ to observe that Rn (ĥ) ⩽ Rn (h̄). Now observe that the
two remaining terms concern the difference between a population and empirical average.
When the inner function is deterministic as h̄, these differences can be controlled with
concentration inequalities. However, the first term R(ĥ) − Rn (ĥ) cannot be directly con-
trolled with concentration inequalities since ĥ is itself random – it depends in a potentially
complex way on the sample (X1 , Y1 ), . . . , (Xn , Yn ). To get around this issue, we will use a
coarse bound:
R(ĥ) − Rn (ĥ) ⩽ sup R(h) − Rn (h).
h∈H
It is remarkable that, as we will see, this bound is sufficient to get good results in many
cases; these objects are called suprema of empirical processes. Much of the theoretical
work of this course will be in the service of controlling such suprema.
Note that we could be a bit more careful and handle the Rn (h̄) − R(h̄) separately; the
only reason we use the supremum step is to deal with the fact that ĥ depends on the data
8
in a complex way; on the contrary, h̄ doesn’t depend on the data at all. However, using
the bound for Rn (h̄) − R(h̄) somewhat simplifies the analysis, and anyways only worsens
the bound by a constant factor.
To control the supremum in (1.5), we now apply Hoeffding’s inequality (see Ap-
pendix A) to each of the classifiers in H. For each j ∈ [M ] andP∈ [n], we consider the ran-
dom variables ζij := 1[Yi ̸= hj (Xi )]. Notice that Rn (hj ) = n1 ni=1 ζij and R(hj ) = E[ζij ]
for each i ∈ [n]. On the other hand, ζij ∈ [0, 1] almost surely. Hence applying Hoeffding’s
for bounded random variables, Corollary A.9, yields that for all j ∈ [M ] and t > 0,
2
P[|Rn (hj ) − R(hj )| > t] ⩽ 2e−2nt
Setting r
log(2M/δ)
tδ := ,
2n
it follows that, for all j ∈ [M ],
δ
P[|Rn (hj ) − R(hj )| > tδ ] < .
M
Hence, by the union bound,
M M
X X δ
P[∃j ∈ [M ] : |Rn (hj ) − R(hj )| > tδ ] ⩽ P[|Rn (hj ) − R(hj )| > tδ ] ⩽ = δ.
M
j=1 j=1
Using the above with (1.5) we find that, with probability at least 1 − δ,
Plugging in the definition of tδ and recalling the decomposition of the excess risk from (1.4)
yields the result.
This is the first of many bounds along these lines, but it illustrates the main ideas: we
bound the approximation error by the supremum of an empirical process, and then use
concentration inequalities to argue that some kind of concentration holds uniformly over
the relevant class, to then obtain a bound.
9
The bound we gave in the last section interpolated between these two extremes by
considering finite classes H and showed that the statistical difficulty of the problem scaled
as log(|H|). One might naturally think at this point that finite classes are all that can be
learned, but in fact much more is possible. There is more than one approach to this goal,
but in this lecture we will focus on an approach called Vapnik-Chervonenkis theory, which
was pioneered by Vapnik and Chervonenkis in the late 1960s.
Recall that we broke the excess risk up as
where
ĥ ∈ arg min Rn (h) h̄ ∈ arg min R(h)
h∈H h∈H
are the empirical risk minimizer and the population risk minimizer, respectively. We then
use the definition of ĥ to yield the bound
As in the last section, rather than try to understand the possibly complex dependence of
ĥ on the sample D, we instead take the supremum over all h ∈ H. This should still be a
surprising strategy, but its strength will be further established in this lecture. We obtain
the bound
R(ĥ) − R(h̄) ⩽ 2 sup |R(h) − Rn (h)|. (2.1)
h∈H
2.1 VC dimension
To bound the size of the empirical process (2.1), we will use the following notion of
complexity of function classes.
In other words, H is shattered by a set of n points if functions from H that attain all
possible outputs on those n points.
Definition 2.2 (Shattering coefficients). The shattering coefficients of H are defined, for
each n ∈ N, as
SH (n) := max |{(h(xi ))ni=1 : h ∈ H}|
(x1 ,...,xn )∈X n
The key intuition of VC theory is that the growth of these shattering coefficients SH (n)
should control the “complexity” of H: if they grow quickly then H must be very big since
it can fit arbitrary sign patterns.
10
Definition 2.3 (VC dimension). Given an integer d ∈ N, we say that a set of classifiers
H has VC dimension d, written VCdim(H) = d, if H is shattered by some set of d points
in X , but not by any set of d + 1 points in X . If H is shattered by a set of d points in X
for all d ∈ N, we write VCdim(H) = ∞.
Example 2.4 (Finite classes). Suppose H = {h1 , . . . , hM } is a finite class. Let {x1 , . . . , xn } ⊂
X be any subset of n points. Then
Hence
VCdim(H) ⩽ ⌈log M ⌉.
In particular, if H has one element then VCdim(H) = 0.
The following example is our first instance of an infinite class with finite VC dimension.
Theorem 2.6 (Main upper bound of VC theory). Suppose H has VC dimension d and
n > d + 1. Then for all δ > 0, with probability at least 1 − δ,
r r
2d log(2en/d) log(2/δ)
sup |R(h) − Rn (h)| ⩽ 2 + .
h∈H n 2n
In particular, r r
2d log(2en/d) log(2/δ)
R(ĥ) − R(h̄) ⩽ 4 + .
n 2n
11
Example 2.7 (Finite classes). Let’s compare this result to what we found in the previous
section. Suppose H = {h1 , . . . , hM } is a finite class. Then Theorem 2.6 implies that for
all δ > 0 with probability at least 1 − δ,
r r
2 log(M ) log(2en/ log(M )) log(2/δ)
sup |R(h) − Rn (h)| ⩽ 2 + .
h∈H n 2n
In the previous lecture we showed that, for all δ > 0, with probability at least 1 − δ,
r
2 log(2M/δ)
.
n
Ignoring constants, we have thus nearly recovered our previous result, but with the addi-
tion of the log(n/ log M ) factor.
We remark that we generally expect terms of the form nd where d = VCdim(H), since
when n < d, we cannot distinguish H from the set of all classifiers. This intuition can be
made rigorous with formal lower bounds to the effect that the above bound is essentially
optimal (in the sense that there is some distribution P on X × {0, 1} for which it is tight),
but such lower bounds will not be covered in this class. If you’re curious, you can find a
proof in [6, Chapter 28].
2.3 Symmetrization
The proof of Theorem 2.6 will use an important and ingenious technique known as
symmetrization. Symmetrization applies to the expectation of the empirical process
suph∈H |R(h) − Rn (h)|; the following consequence of the bounded differences inequality
allows us to reduce to the expected empirical process.
Lemma 2.8 (Tail bounds for the empirical process). For all δ > 0, with probability at
least 1 − δ,
r
log(1/δ)
sup |R(h) − Rn (h)| ⩽ ED [sup |R(h) − Rn (h)|] + .
h∈H h∈H 2n
Then it is straightforward to verify that this function has the bounded differences property
with ci = n1 : if we change one sample (xi , yi ) to (x′i , yi′ ) and write the corresponding
12
empirical risk as Rn′ we find that
sup |R(h) − Rn (h)| − sup |R(h) − Rn′ (h)| ⩽ sup (|R(h) − Rn (h)| − |R(h) − Rn′ (h)|)
h∈H h∈H h∈H
⩽ sup |Rn (h) − Rn′ (h)|
h∈H
1 1
= sup |1[h(xi ) ̸= yi ] − 1[h(x′i ) ̸= yi′ ]| ⩽ .
h∈H n n
This verifies the bounded difference condition, so we may apply Theorem A.14 to obtain,
for all t > 0,
2
P sup |R(h) − Rn (h)| − ED [sup |R(h) − Rn (h)|] > t ⩽ e−2nt .
h∈H h∈H
So let us focus on the expectation of the empirical process. We will introduce a new
sample, known as a ghost sample: suppose D′ = ((x′1 , y1′ ), . . . , (x′n , yn′ )) ∼ P n is an iid
draw from P n , distinct from D. For a classifier h : X → {0, 1}, write
n
1X
Rn′ (h) := 1[h(x′i ) = yi′ ].
n
i=1
where we apply Jensen’s inequality twice. We now make the crucial observation of the
symmetrization technique: for any vector (εi )ni=1 ∈ {−1, 1}n the symmetry of D and D′
imply
n
1X
ED,D′ [sup |Rn′ (h) − Rn (h)|] = ED,D′ [sup 1[h(xi ) ̸= yi ] − 1[h(x′i ) ̸= yi′ ] ]
h∈H h∈H n
i=1
n
1X
= ED,D′ [sup εi (1[h(xi ) ̸= yi ] − 1[h(x′i ) ̸= yi′ ]) ].
h∈H n i=1
Since this is true for all ε ∈ {−1, 1}n , we can take each entry to be distributed according
to a Rademacher random variable, εi ∼ Rad(1/2) (recall this means P[εi = 1] = P[εi =
−1] = 1/2), and write
n
1X
ED,D′ [sup |Rn′ (h) − Rn (h)|] = ED,D′ ,ε [sup εi (1[h(xi ) ̸= yi ] − 1[h(x′i ) ̸= yi′ ]) ].
h∈H h∈H n
i=1
13
Apply triangle inequality to finally yield
n
1X
ED [sup |R(h) − Rn (h)|] ⩽ 2ED,ε [sup εi 1[h(xi ) ̸= yi ] ] (2.2)
h∈H h∈H n i=1
We now work conditionally on the data D. In this case, the supremum is actually over
a finite set. Let
AH (D) := {(1[h(xi ) ̸= yi ])ni=1 : h ∈ H} ⊂ {0, 1}n .
Then
n
1X 1
Eε [sup εi 1[h(xi ) ̸= yi ] ] = Eε [ max |⟨a, ε⟩|].
h∈H n n a∈AH (D)
i=1
Lemma 2.9. Suppose A ⊂ Rn is a finite set and (εi )ni=1 is a vector of iid Rad(1/2) random
variables. Then p
Eε [sup⟨a, ε⟩] ⩽ max ∥a∥ · 2 log(|A|).
a∈A a∈A
Proof. The result follows from a more general observation, which is as follows.
Lemma 2.10 (Maximum of finitely many sub-Gaussian random variables). Suppose ζ1 , . . . , ζM
are mean zero and σ 2 -sub Gaussian. Then
p
E[ max ζk ] ⩽ 2σ 2 log M .
k∈[M ]
With Lemma 2.10 in hand, we simply need to check that ⟨a, ε⟩ is mean 0 and sub-
Gaussian. For this, we use the fact that
n
X
⟨a, ε⟩ = ai εi ,
i=1
14
so it is a sum of n independent, mean zero, sub-Gaussian variables, where the ith term
is a2i -sub Gaussian by Example A.4 and Proposition A.7. Hence by Proposition A.7 once
again, we find that that ⟨a, ε⟩ is ∥a∥2 -sub Gaussian. The result then follows immediately
from Lemma 2.10.
Note that
|AH (D)| = |{(h(xi ))ni=1 : h ∈ H}| ⩽ SH (n).
Thus
n
r
1X 2 log(4SH (n))
Eε [sup εi 1[h(xi ) ̸= yi ] ] ⩽ .
h∈H n i=1
n
We therefore see that the shattering coefficients SH (n) control the rate of convergence of
the empirical process.
Lemma 2.11 (Sauer-Shelah). Suppose H has VC dimension d. Then for all n > d + 1,
we have
en d
SH (n) ⩽ .
d
Observe that this lemma does indeed imply a transition from exponential to polynomial
growth: the definition of VC dimension implies that SH (n) = 2n for n ⩽ d and SH (d+1) <
2d+1 . This lemma is proved with a non-trivial combinatorial argument that we will not
cover in this course. If you are curious about its proof, take a look at this nice writeup.
15
2.5 Application to linear classes
Let X = Rd and define the set of linear maps
Ld := {hw : w ∈ Rd , },
for
hw (x) := ⟨w, x⟩ ∀x ∈ Rd .
We remark that in practice we will want to include a constant bias term, but this is
equivalent to our setting here since we may consider x′ = (x, 1) ∈ Rd+1 and w′ = (w, b) ∈
Rd+1 .
Using this set of linear maps, let us consider the following hypothesis class
HL,d := {1[hw ⩾ 0] : hw ∈ Ld }.
small. Following the ERM paradigm, we choose ĥ to minimize the empirical risk
n
1X
ĥ ∈ arg min Rn (h) := 1[h(Xi ) ̸= Yi ].
h∈HL,d n
i=1
Using the theory we’ve developed so far, we can get a handle on the resulting estimation
error by controlling the VC dimension VCdim(HL,d ). We have the following result.
VCdim(HL,d ) = d.
Proof. Let e1 , . . . , ed ∈ Rd bePthe standard basis vectors. Then, given any labeling
(y1 , . . . , yd ) ∈ {0, 1}d , let w := di=1 yi ei . Then hw (ei ) = yi for all i ∈ [d], so that
SHL,d (d) = 2d
d+1
X
αi xi = 0.
i=1
16
Without loss of generality, let us suppose there is some i0 ∈ [d + 1] such that αi0 > 0
(there must be one that is non-zero, and we can multiply by −1 to make it positive if
necessary). Let yi := −1 if αi ⩾ 0, and 1 if αi < 0. We claim that there is no w ∈ Rd for
which hw (xi ) = yi for all i ∈ [d + 1]. Suppose, for the sake of contradiction, that there
does exist some such w, so that ⟨w, xi ⟩) ⩾ 0 if and only if yi = 1, for all i ∈ [d + 1]. Then
⟨w, xi0 ⟩ < 0, yet
−1 X
0 > ⟨w, xi0 ⟩ = αi ⟨w, xi ⟩ ⩾ 0,
α i0
i̸=i0
where the last inequality follows because if αi ⩾ 0 then ⟨w, xi ⟩ < 0 so αi ⟨w, xi ⟩ ⩽ 0, and
if αi < 0, then ⟨w, xi ⟩ ⩾ 0, so αi ⟨w, xi ⟩ ⩽ 0. Hence we have a contradiction and can
conclude.
How can we actually compute this ERM? One approach is the Perceptron algorithm,
a classic algorithm due to Rosenblatt. The Perceptron algorithm iteratively improves the
estimate of w⋆ using the update
w(t+1) = w(t) + Yi Xi
for any pair (Xi , Yi ) which is mis-classified by w(t) , namely for which ⟨w(t) , Xi ⟩Yi ⩽ 0.
Observe that this update improves the performance of the classifier, at least on Xi , since
it ensures
17
Algorithm 1 The Perceptron.
1: Input: Data points {(Xi , Yi )}n d
i=1 ⊂ R × {−1, 1}.
2: procedure Perceptron({(Xi , Yi )}ni=1 )
3: Initialize w(0) = 0.
4: for t = 1, 2, . . . do
5: If there are i such that Yi ⟨w(t) , Xi ⟩ ⩽ 0, choose one and put w(t+1) = w(t) +Yi Xi .
6: Else, return w(t)
7:
Let’s consider the well-specified case, where each Yi = hw⋆ (Xi ) for some w⋆ ∈ Rd .
Proof. Suppose the algorithm runs for at least T ∈ N iterations. Let us assume without
loss of generality that ∥w⋆ ∥ = B and Yi ⟨w⋆ , Xi ⟩ ⩾ 1 for all i ∈ [n].
We start by showing that
⟨w⋆ , w(T ) ⟩ ⩾ T. (3.2)
Observe that for any time t ∈ {0, 1, . . . , T − 1}, for some i ∈ [n],
Hence
T
X
(T ) (T ) (0)
⟨w⋆ , w ⟩ = ⟨w⋆ , w ⟩ − ⟨w⋆ , w ⟩= w(t) − w(t−1) ⩾ T,
t=1
yielding (3.2).
Next, we claim that
√
∥w(t+1) ∥ ⩽ tR, ∀t ∈ {0, 1, . . . , T − 1}. (3.3)
∥w(t+1) ∥2 ⩽ tR2 .
18
3.2 Compuational hardness of ERM
Unfortunately, the fact that the ERM problem over halfspaces admits a practical algorithm
(the perceptron algorithm) is a rare occurrence. For one, when we remove the well-specified
assumption, the ERM problem over halfspaces becomes computationally hard. But, in
general, even well-specified ERM problems can be computationally hard (unless P = N P ,
there is no algorithm for the ERM problem with running time polynomial in n and d;
those who are curious can find more precise details in [6, Chapter 8]).
4.1 Introduction
Notice that, by definition, the Perceptron algorithm doesn’t converge if the data points
aren’t linearly separable. That is, if there is no w ∈ Rd such that Yi ⟨w, Xi ⟩ > 0 for all
i ∈ [n], then the algorithm never stops.
As a simple example of such a case, consider the following data for X = R and n = 3:
This data is not linearly separable, and so the Perceptron algorithm won’t converge. Of
course, data which is not linearly separable, either due to an underlying non-linear data
generating process, or due to noise, is the rule rather than the exception.
In this section, we’ll discuss the kernel trick, a very clever and useful workaround for
this issue, which will have applications well beyond the Perceptron algorithm.
There are three key ideas involved in the kernel trick.
1. Embed the data into a higher dimensional feature space. To overcome this
issue of linear separability, we will embed our data into a higher dimensional vector
′ ′
space Rd for d′ ⩾ d, called the feature space, using an embedding ψ : Rd → Rd , called
the feature map. If we choose this map properly then the data ((ψ(X1 ), Y1 ), . . . , (ψ(Xn ), Yn )),
will be linearly separable. We now want to run the Perceptron on this transformed
′
data, to obtain a vector w⋆ ∈ Rd such that Yi ⟨w⋆ , ψ(Xi )⟩ > 0 for all i ∈ [n]. We can
use this w⋆ to classify new x via x 7→ sgn(⟨w⋆ , ψ(x)⟩).
2. We only need to compute inner products. In general, we may have made our
problem significantly more computationally difficult, since now we have to manipu-
late these higher dimensional data points ψ(Xi ). But the Perceptron algorithm has
a special feature: it only depends on inner products of the feature vectors. Because
of the update w(t+1) = w(t) + Yi ψ(Xi ), we know that the iterate w(t) will always be a
sum of (signed) vectors in {ψ(X1 ), . . . , ψ(Xn )}, and so checking the mis-classification
condition Yi ⟨w(t) , ψ(Xi )⟩ ⩽ 0 can be implemented by considering a (signed) sum of
19
products ⟨ψ(Xi ), ψ(Xj )⟩. Explicitly, if we used xit at step t of the algorithm for our
update, then
T
X −1
w(T ) = yit ψ(xit ),
t=1
T
X −1
(T )
Yj ⟨w , ψ(Xj )⟩ ⩽ 0 ⇐⇒ Yj Yit ⟨ψ(Xit ), ψ(Xj )⟩ ⩽ 0.
t=1
And if the algorithm terminates after T steps, we can obtain a predicted label for a
new x via
T
X −1
sgn(⟨w(T ) , ψ(x)⟩) = sgn
yit ⟨ψ(xit ), ψ(x)⟩ .
t=1
Hence, by keeping track of the indices we have visited so far, we can implement
the Perceptron algorithm only using the inner products (⟨ψ(Xi ), ψ(Xj )⟩)ni,j=1 and
compute predictions for x ∈ Rd , only using the inner products (⟨ψ(Xi ), ψ(x)⟩)ni=1 .
3. Only use ψ such that the kernel K(x, x′ ) := ⟨ψ(x), ψ(x′ )⟩ is easy to compute.
The map K : Rd × Rd → R defined by K(x, x′ ) := ⟨ψ(x), ψ(x′ )⟩ is called the kernel.
So long as K(x, x′ ) is easy to compute (typically O(d)), we will be able to take
advantage of the flexibility of this very high dimensional space W , without paying
any computational penalty.
Example 4.1 (Classifying three points). Let’s take the example we gave above: X = R
and n = 3:
(X1 , Y1 ) = (−1, 1), (x2 , y2 ) = (0, −1), (x3 , y3 ) = (1, 1).
Then we can take the embedding ψ : R → R2 defined by ψ(x) := (x, x2 ). The transformed
data is then
(ψ(X1 ), Y1 ) = ((−1, 1), 1), (ψ(x2 ), y2 ) = ((0, 0), −1), (ψ(x3 ), y3 ) = ((1, 1), 1),
Example 4.2 (Embedding into finite degree polynomials). We can generalize the above
k
example by considering X = Rd and the embedding ψ : Rd → R(d+1) for some k ∈ N
defined by taking,
k
Y
ψ(x)J := x Jl , ∀J ∈ {0, 1, . . . , d}k ,
l=1
20
Observe that for x, x′ ∈ Rd , we have
= (1 + ⟨x, x′ ⟩)k .
Thus, for this choice of embedding, the kernel is easy to compute (time O(d)), despite the
fact that the feature space is exponentially larger than the original space!
Example 4.3 (Embedding into ℓ2 with the Gaussian kernel). Let’s consider another
example, where the feature space is the space of square-summable infinite sequences,
defined as
∞
X
ℓ2 := (Xj )∞ Xj2 < ∞ ,
N
j=0 ∈ R :
j=0
2 /2 xj
ψ(x)j := e−∥x∥ √ j = 0, 1, . . . .
j!
So we have an infinite dimensional feature space, yet still can compute the kernel in O(1)
time.
21
Another reference for this material, albeit not freely available, is [8, Chapter 12]. There
is also a discussion in [6, Chapter 16], but it’s pretty vague.
First, we’ll need to recall the definition of a Hilbert space.
Definition 4.4 (Hilbert space). A Hilbert space (W, ⟨·, ·⟩W ) is a real vector space W
associated with an inner product ⟨·, ·⟩W , such that the resulting normed space is complete.
We’ll now formally define these three equivalent ways of viewing the kernel trick.
Definition 4.5 (Feature map). A feature map is any map ψ : X → V from X into a
Hilbert space V .
n
X
αi αj K(Xi , Xj ) ⩾ 0.
i,j=1
If there is some reproducing kernel for W , we say that W is a reproducing kernel Hilbert
space (RKHS).
22
Algorithm 2 The Kernelized Perceptron.
1: Input: Data points {(Xi , Yi )}n
i=1 ⊂ X × {−1, 1} and a PSD kernel K : X × X → R.
2: procedure KernelizedPerceptron({(Xi , Yi )}ni=1 , K)
3: Initialize a list I = []. P
4: For any list of indices I, define hI (x) := sgn j∈I Yj K(x, Xj ) , where the sum
includes repeated elements of I.
5: for t = 1, 2, . . . do
6: If there is an i such that hI (Xi ) ̸= Yi , add it to I.
7: Else, return hI
8:
Proof. We first verify that the inner product is well-defined. It suffices to show that for
any α ∈ Rn and X1 , . . . , Xn ∈ X , if
n
X
αi K(Xi , ·) = 0, (4.2)
i=1
23
then
Xn m
X
⟨ αi K(Xi , ·), βj K(Xj′ , ·)⟩W0 = 0, ∀β ∈ Rm , X1′ , . . . , x′m ∈ X .
i=1 j=1
So that
Xn m
X n X
X m m
X n
X
′ ′
⟨ αi K(Xi , ·), βj K(Xj , ·)⟩W0 = αi βj K(Xi , xj ) = βj αi K(Xi , x′j ) = 0.
i=1 j=1 i=1 j=1 j=1 i=1
Implying
n
2
X
αi K(Xi , ·) W0
> 0,
i=1
as desired.
24
Hence W0 is a real vector space of functions on X with valid inner product ⟨·, ·⟩W0 .
And it obeys the reproducing property, since for all f ∈ W0 , we have
n
X
f= αi K(Xi , ·)
i=1
We have thus nearly constructed an RKHS which has K as its reproducing kernel, but
we are missing one crucial property: the space W0 may not be complete, and so isn’t a
Hilbert space. To handle this issue, we consider the completion W of W0 . Recall that we
can complete a metric space (Z, d) by letting Z⋆ be the set of all Cauchy sequences in Z with
the metric d⋆ ((zk )∞ ′ ∞ ′ ∞ ′ ∞
k=1 , (zk )k=1 ) := limk→∞ d(zk , zk ), where two sequences (zk )k=1 , (zk )k=1
are identified whenever limk→∞ d(zk , zk′ ) = 0.
In our setting, the completion of W0 is the set of all Cauchy sequences (fn )∞ n=1 for
fn ∈ W0 . We’ll first check that we can regard these Cauchy sequences as functions on X .
To this end, for each Cauchy sequence (fk )∞ k=1 in W0 , let f⋆ : X → R be defined as
|fk (x) − fℓ (x)| = |⟨K(x, ·), fk − fℓ ⟩W0 | ⩽ ∥K(x, ·)∥W0 ∥fk − fℓ ∥W0 ,
which can be made arbitrarily small for k, ℓ sufficiently large. And (4.4) is well-defined, in
the sense that if (fk )∞ ′ ∞ ′
k=1 , (fk )k=1 are two Cauchy sequences in W0 such that ∥fk −fk ∥W0 →
0 as k → ∞, then the limits agree, since by the reproducing property (4.3)
| lim fk (x) − lim fk′ (x)| = lim |fk (x) − fk′ (x)| ⩽ lim ∥K(x, ·)∥W0 ∥fk − fk′ ∥W0 = 0.
k→∞ k→∞ k→∞ k→∞
So we can regard W as a set of functions on X via (4.4). It’s clear that W is also a
real vector space under the definition, for α, β ∈ R, and (fk )∞ ∞
k=1 , (gk )k=1 ∈ W ,
α(fk )∞ ∞ ∞
k=1 + β(gk )k=1 := (αfk + βgk )k=1 .
⟨(fk )∞ ∞
k=1 , (gk )k=1 ⟩W := lim ⟨fk , gk ⟩W0 ,
k→∞
then it is straightforward to check that this is well-defined, compatible with the above
vector space structure, and gives rise to the same metric as that of the completion on W .
So W is a Hilbert space with these definitions, and each of its elements can be regarded
as a function on X .
25
We can finally check that W is an RKHS with reproducing kernel K. First of all, it is
clear that for all x ∈ X , the sequence (K(x, ·))∞
k=1 is Cauchy, so is in W , and corresponds
to the function x′ 7→ K(x, x′ ). Hence K(x, ·) ∈ W for all x ∈ X . For the reproducing
property, let f ∈ W and x ∈ X . Then there is a Cauchy sequence (fk )∞ k=1 in W0 such that
Hence W is an RKHS with reproducing kernel K, and we conclude the existence portion
of the proof.
For uniqueness, suppose W ′ is another RKHS over X with reproducing kernel K.
Then clearly, W ⊂ W ′ . We’ll show W ′ ⊂ W to conlude uniqueness. So take any f ∈ W ′ .
Since W is a closed linear subspace, we can write f = f ∥ + f ⊥ where f ∥ ∈ W and f ⊥
is orthogonal to all elements of W . Using the reproducing property, we find that for all
x ∈ X.
But since this is true for all x ∈ X , it implies that f = f ∥ ∈ W , so that W ′ ⊂ W . Hence
W ′ = W and we have shown uniqueness.
Now that we have shown Propositions 4.8 and 4.9, we can safely think of RKHSs and
PSD kernels as coming in pairs, and we’ll do this from now on. We conclude our equiva-
lences by showing that each feature map has a unique kernel and RKHS pair associated
to it.
Our first claim in this direction is essentially trivial, and states that each RKHS and
PSD kernel pair has an associated unique feature map.
Proposition 4.10. Suppose we have two pairs of RKHSs and reproducing kernels, W0 , K0
and W1 , K1 . Define ψ0 : X → W0 and ψ1 : X → W1 via ψ0 (x) := K0 (x, ·) and ψ1 (x) :=
K1 (x, ·). Then ψ0 , ψ1 are feature maps, and we have uniqueness in the following sense:
K0 = K1 ⇐⇒ ψ0 (x) = ψ1 (x) ∀x ∈ X ,
To go in the other direction, we want to say that each feature map ψ : X → V corre-
sponds to a unique RKHS and PSD kernel pair W, K.
We then have the following statement.
26
Proposition 4.11. Suppose ψ0 : X → V0 and ψ1 : X → V1 are feature maps. Let K0 : X ×
X → R and K1 : X × X → R be defined by K0 (x, x′ ) := ⟨ψ0 (x), ψ0 (x′ )⟩V0 and K1 (x, x′ ) :=
⟨ψ1 (x), ψ1 (x′ )⟩V1 for all x, x′ ∈ X . Then K0 , K1 are PSD kernels. Let W0 , W1 be their
associated RKHSs. Then we have
span({ψ0 (x) : x ∈ X }) ∼
= W0 , span({ψ1 (x) : x ∈ X }) ∼
= W1 .
span({ψ0 (x) : x ∈ X }) ∼
= span({ψ1 (x) : x ∈ X })
We remark that the uniqueness statement above is the best we can hope for because
we can always compose a Hilbert space isomorphism L : V0 → V1 with ψ0 to obtain a new
feature map ψ1 – feature maps are only specified up to Hilbert space isomorphism.
Proof. Let’s first show that K0 is a PSD kernel. Observe that for all x, x′ ∈ X , using the
symmetry of inner products
K0 (x, x′ ) = ⟨ψ0 (x), ψ0 (x′ )⟩V0 = ⟨ψ0 (x′ ), ψ0 (x)⟩V0 = K0 (x′ , x),
by
n
X n
X
T0 αi ψ0 (Xi ) := αi K0 (Xi , ·).
i=1 i=1
It isn’t hard to see that T0 is well-defined. We next check that T0 is a bijection. It is
clearly a surjection, and to check it is an injection it suffices to show it has trivial kernel.
So suppose X1 , . . . , Xn ∈ X and α ∈ Rn are such that
n
X
αi K0 (Xi , ·) = 0.
i=1
27
so that
n
X
αi ψ0 (Xi ) = 0.
i=1
span({ψ0 (x) : x ∈ X }) ∼
= span({K0 (x, ·) : x ∈ X }) = W0 ,
where we used Proposition 4.8 for the equality. Obviously, the same proof works to show
span({ψ1 (x) : x ∈ X }) ∼
= W1
28
dimensional problems over Rn . As you can see, the crucial step in the proof comes from
the reproducing property f (x) = ⟨K(x, ·), f ⟩W , giving an example of a situation where
this formulation of the kernel concept is most useful.
Proof. Let
Wn := {gα : α ∈ Rn }.
Then Wn is a closed linear subspace of W , and so has an orthogonal complement Wn⊥ , so
that each f ∈ W can be written uniquely as f = f ∥ + f ⊥ , for f ∥ ∈ Wn and f ⊥ ∈ Wn⊥ . By
the reproducing property, for each i ∈ [n],
Therefore
where the second-to-last equality follows because ∥f ∥2W = ∥f ∥ ∥2W + ∥f ⊥ ∥2W , and since the
objective only depends on f ∥ , minimization over ∥f ∥2W ⩽ λ2 is therefore equivalent to
minimization over ∥f ∥ ∥2W ⩽ λ2 . Finally, observe that f ∈ Wn if and only if there exists
α ∈ Rn such that
X n
f= αi K(Xi , ·).
i=1
Thus f (Xj ) = (Kn α)j , the jth element of the matrix-vector product Kn α. The constraint
∥f ∥W ⩽ λ can then be written
n
X
∥f ∥W ⩽ λ ⇐⇒ ∥f ∥2W ⩽ λ2 ⇐⇒ αi αj K(Xi , Xj ) ⩽ λ2 .
i,j=1
29
only with the number of sample points n and the data dimension d, rather than the feature
space dimension. But what about the statistical performance of such a procedure?
In general, the statistical performance could be quite bad. Observe that the Perceptron
using the PSD kernel K and associated RKHS W outputs an element of the set of half-
spaces of the RKHS W :
HW := {v ∈ W 7→ sgn(⟨w, v⟩W ) : w ∈ W }.
So we are effectively looking at classification over halfspaces in the RKHS W , and the
VC dimension of that class scales as dim(W ), which is often infinite. To handle this
problem, we need to use some structural condition on the resulting classification problem
over halfspaces in W .
The support vector machine paradigm (SVM) addresses this problem by considering a
natural condition known as the margin condition. As we will eventually see, this condition
will allow us to get totally dimension-free rates.
5.1 Margin
In this section, we’ll work for simplicity over X = Rd , but the discussion will apply as
well to RKHSs over X . We will thus be interested in results which don’t depend on the
dimension of the space.
Suppose we have data D = ((X1 , Y1 ), . . . , (Xn , Yn )) ∈ (Rd × {−1, 1})n that is linearly
separable. Often, there will be many linear halfspaces which perfectly classify the data.
The idea of support vector machines is that, in general, we should prefer halfspaces which
separate the points by a lot rather than halfspaces which separate the points by a little.
The notion of margin makes this idea precise.
Definition 5.1 (Margin). Given data D = ((X1 , Y1 ), . . . , (Xn , Yn )) we define the margin
margD as follows. If w ∈ Rd correctly classifies D, then
w
margD (w) := min Yi ⟨ , Xi ⟩.
i∈[m] ∥w∥
d(z, Hw ) := min
′
∥z − z ′ ∥.
z ∈Hw
30
Proof. Without loss of generality, we may assume ∥w∥ = 1. We first show that
d2 (z, Hw ) = ⟨z, w⟩2 (5.1)
Indeed, note that z − ⟨z, w⟩w ∈ Hw , so
d2 (z, Hw ) ⩽ ∥z − (z − ⟨z, w⟩w)∥2 = ⟨z, w⟩2 .
On the other hand, for any z ′ ∈ Hw
⟨z, w⟩2 = ⟨z − z ′ , w⟩2 ⩽ ∥z − z ′ ∥2 .
Hence (5.1) holds.
Now, notice that, since w correctly classifies D, for all i ∈ [n], we have sgn(⟨Xi , w⟩) =
Yi . Hence for all i ∈ [n],
|⟨Xi , w⟩| = sgn(⟨Xi , w⟩)⟨Xi , w⟩ = Yi ⟨Xi , w⟩.
Taking the minimum over i and using (5.1) yields the result.
5.2 Hard-SVM
Let some linearly separable data D = ((X1 , Y1 ), . . . , (Xn , Yn )) be fixed. In the SVM
paradigm, we prefer hyperplanes with large margin over hyperplanes with small margin.
This leads us to solve the following optimization problem, known as the Hard-SVM prob-
lem.
max margD (w). (5.2)
w∈Rd
However, it is not clear that this problem can be solved efficiently in this form. In fact,
we can re-write it in the following equivalent form, which is more suitable to optimization
algorithms
min ∥w∥2 : w ∈ Rd , Yi ⟨w, Xi ⟩ ⩾ 1, ∀i ∈ [n] .
(5.3)
Problem (5.3) is an example of a quadratic program, for which efficient solvers exist.
Proposition 5.3 (Equivalence of both formulations of Hard-SVM). Suppose there exists
w ∈ Rd which correctly classifies D. Then any solution of (5.3) is a solution of (5.2).
And any solution of (5.2) can be re-scaled to become a solution of (5.3).
Proof. Suppose w0 is a solution of (5.2) and w1 is a solution of (5.3). Then
1
margD (w0 ) ⩾ margD (w1 ) ⩾ .
∥w1 ∥
On the other hand,
w0 2 1
∥w1 ∥2 ⩽ = ⩽ ∥w1 ∥2 .
mini∈[n] Yi ⟨w0 , Xi ⟩ margD (w0 )2
Hence each inequality holds with equality, completing the proof.
One of the main reasons for our interest in the margin condition is it allows us to
prove statistical rates of convergence for the support vector machine problem which do
not depend on the dimension of the feature space (d in our notation), but instead on the
margin and the scale of the dataset.
31
5.3 Soft-SVM
The “hard” in “Hard-SVM” refers to the assumption that the data set is linearly separa-
ble. It is often more useful to consider a relaxation of Hard-SVM to allow for imperfect
classification – the problem may not be linearly separable, or even if it is, outputting a
perfect classifier makes the model much more sensitive to outliers. We’ll now define this
relaxed problem. Starting with (5.3), and let us introduce slack variables that allow for
violations in the constraints Yi ⟨w, Xi ⟩ ⩾ 1 to form
n
1X
min λ∥w∥2 + ζi : w ∈ Rd , ζ ∈ Rn⩾0 Yi ⟨w, Xi ⟩ ⩾ 1 − ζi , ∀i ∈ [n] ,
n
i=1
where λ > 0 is a parameter that controls the amount of penalization associated to the
average violated constraint. Observe that we can equivalently write this problem, using
the hinge loss φ(z) := max(0, 1 + z),
n
1X
min φ(−Yi ⟨w, Xi ⟩) + λ∥w∥2 . (5.4)
w∈Rd n
i=1
This is the Soft-SVM problem; you may also recognize it as a one hidden-layer neural
network with ℓ2 regularization and ReLU activations.
Definition 5.4 ((γ, D)-margin assumption). We say that P satisfies the (γ, D)-margin
assumption if there exists w⋆ ∈ W such that ∥w⋆ ∥W ⩽ γ, and for P-almost every (x, y),
we have ∥x∥ ⩽ D and 1 ⩽ y⟨w⋆ , K(x, ·)⟩W .
This condition is enough to permit a satisfying statistical analysis of both the Soft-
SVM and Hard-SVM estimators. For reasons of space, we defer the analysis of the latter
problem to Appendix C, and here just provide the analysis of the soft-SVM problem.
32
We recall our notation for binary classification: for each w ∈ W , we define the clas-
sifier hw (x) := sgn(⟨w, x⟩), and for every classifier h : W → {−1, 1} we define R(h) :=
P(X,Y )∼P [h(X) ̸= Y ] as well as the excess risk
2D2
E[E(hŵ )] ⩽ + λγ 2 .
λn
We remark that the above can be extended to bounds in high probability via the
bounded differences inequality, but we omit this for simplicity. The proof is based on a
more general technique, known as a stability argument; see [6, Chapter 13].
E(hŵ ) = R(hŵ ).
Writing the hinge loss φ(z) := max(0, 1 + z), notice that for any (X, Y ) ∈ W × {−1, 1},
we have
1[hw (X) ̸= Y ] ⩽ φ(−Y ⟨w, X⟩).
Hence
R(hŵ ) ⩽ E(X,Y )∼P [φ(−Y ⟨ŵ, X⟩)].
Now, let (X1′ , Y1′ ) ∼ P be a draw from P, independent of D. Define w′ to be the solution
of the soft-SVM problem with just the first sample changed, namely
n
1 1X
w := arg min φ(−Y1′ ⟨w, X1′ ⟩) +
′
φ(−Yi ⟨w, Xi ⟩) + λ∥w∥2 .
w∈W n n
i=2
ED,(X1′ ,Y1′ ) [φ(−Y1 ⟨w′ , X1 ⟩) − φ(−Y1 ⟨ŵ, X1 ⟩)] ⩽ ED,(X1′ ,Y1′ ) [|⟨w′ − ŵ, X1′ ⟩|]
⩽ DED,(X1′ ,Y1′ ) [∥ŵ − w′ ∥]. (5.5)
33
But by definition of w′ we have
1 1
L̂(w′ ) − L̂(ŵ) ⩽ φ(−Y1 ⟨w′ , X1 ⟩) − φ(−Y1′ ⟨w′ , X1 ⟩)
n n
1 1
+ φ(−Y1′ ⟨ŵ, X1′ ⟩) − φ(−Y1 ⟨ŵ, X1 ⟩).
n n
Applying the fact that φ is 1-Lipschitz again yields
2D 2D
λ∥w − ŵ∥2 ⩽ ∥ŵ − w′ ∥ =⇒ ∥w − ŵ∥ ⩽ .
n λn
Plugging into (5.5) yields
2D2
ED,(X1′ ,Y1′ ) [φ(−Y1 ⟨w′ , X1 ⟩) − φ(−Y1 ⟨ŵ, X1 ⟩)] ⩽ .
λn
Finally, we observe that, by symmetry, for all i ∈ [n] we have
Whence
n
h1 X i
ED [φ(−Y1 ⟨ŵ, X1 ⟩)] = ED φ(−Yi ⟨ŵ, Xi ⟩)
n
i=1
n
h1 X i
2 2
⩽ ED φ(−Yi ⟨w⋆ , Xi ⟩) + λ(∥w⋆ ∥ − ∥ŵ∥ )
n
i=1
= ED λ(∥w⋆ ∥2 − ∥ŵ∥2 ) ⩽ λ∥w⋆ ∥2 .
ED,(X,Y )∼P [φ(−Y ⟨ŵ, X⟩)] = ED,(X1′ ,Y1′ ) [φ(−Y1 ⟨w′ , X1 ⟩)]
= ED,(X1′ ,Y1′ ) [φ(−Y1 ⟨w′ , X1 ⟩) − φ(−Y1 ⟨ŵ, X1 ⟩)]
+ ED [φ(−Y1 ⟨ŵ, X1 ⟩)],
6 Ensemble methods
Another important paradigm is that of ensemble methods. The idea here is that we can
sometimes combine multiple classifiers into one classifier which achieves better performance
than each individual classifier does on its own. Suppose that we have K fixed classifers,
h1 , . . . , hK : X → {−1, 1}. An ensemble classifier, based on these K classifiers, is an
element of
K
X
wk hk (x) : w ∈ Rd }.
hw (x) := sgn
k=1
It is thus a certain halfspace in the coordinates of the classifiers h1 , . . . , hK . There are
many different methods in this family, we’ll just briefly mention a couple here.
34
6.1 Majority vote
A very simple way of combining multiple classifiers is through majority vote. Here, we
train each hk on the data D = ((X1 , Y1 ), . . . , (Xn , Yn )), and then output the classifier
n
X
x 7→ sgn hk (x) .
k=1
2. Grow a decision tree hk on these m points, where at each node you choose a random
subset of ℓ feature dimensions to find the best split on.
6.3 AdaBoost
The AdaBoost algorithm is a famous method for producing the K classifiers h1 , . . . , hK .
Here, it is assumed that we have access to an algorithm which produces weak learners.
Definition 6.1 (Weak learner). Given ε ∈ (0, 1/2] a dataset D = ((X1 , Y1 ), . . . , (Xn , Yn )) ∈
X × {−1, 1} and a weight vector µ := (µ1 , . . . , µn ) ∈ Rn⩾0 , we say that a classifier
h : X → {−1, 1} is an ε-weak learner if
n
X 1
µi 1[h(Xi ) ̸= Yi ] ⩽ − ε.
2
i=1
For example, a common weak learning algorithm in practice is decision stumps, which
corresponds to a depth 1 decision tree (of course, this may not be a true weak learner in
the formal sense above, but it is nonetheless used in practice).
The AdaBoost algorithm assumes access to an algorithm which returns an ε-weak
learner for any weight vector µ, and follows the steps
2. For t = 1, . . . , T :
a. Let ht be an ε-weak learner for the dataset D and weight vector µ(t) .
(t)
b. Put εt := ni=1 µi 1[ht (Xi ) ̸= Yi ].
P
35
3. Return the weighted ensemble classifier
T
X
hw : x 7→ sgn wt ht (x) .
t=1
So the AdaBoost classifier iteratively up-weights data points that are mis-classified
(where Yi ht (Xi ) < 0) and down-weights data points that are correctly classified (where
Yi ht (Xi ) ⩾ 0).
We have the following result.
Theorem 6.2. Suppose the AdaBoost algorithm is run for T steps with an algorithm that
always returns an ε-weak learner for any weight vector µ. Let the output be hw . Then
n
1X 2
1[hw (Xi ) ̸= Yi ] ⩽ e−2ε T
n
i=1
7 Neural networks
We will use the following notation for a neural network: let (V, E, w) be a directed,
weighted
SL graph, and suppose the vertex set V can′ be split into disjoint sets of vertices,
V = ℓ=0 Vℓ such that if e ∈ E then e = (v, v ) for v ∈ Vℓ and v ∈ Vℓ+1 for some ′
|V |
ℓ = {0, 1, . . . , L − 1}. We’ll write the vertices in layer ℓ as Vℓ = {vℓ1 , . . . , vℓ ℓ }. The
number L is called the number of layers of the neural network. The integers din := |V0 |
and dout := |VL | are called the input and output dimensions of the network. We’ll think of
the directed graph (V, E) as fixed throughout training, and this is often called the network
architecture.
The parameters w are the weights, which we can think of either as a function w : E → R,
or as a collection of matrices (W1 , . . . , WL ) where each Wℓ ∈ R|Vℓ |×|Vℓ−1 | and is fixed to
have entry (v, v ′ ) to be 0 if (v, v ′ ) ̸∈ E, along with biases (b1 , . . . , bL ) such that bℓ ∈ R|Vℓ | .
It is these weights that are adjusted through training to eventually fit the model.
We assume as given an activation function σ : R → R, for example a ReLU unit
(σ(τ ) := max(0, τ )), a logistic unit (σ(τ ) = (1 + e−τ )−1 ) or something else.
The network then generates output via a forward pass. Given an input x ∈ Rdin , we
calculate the output o0 (x) = (x1 , . . . , xd ), and then for subsequent layers we put aℓ (x) =
Wℓ oℓ−1 (x) + bℓ and oℓ (x) := σ(aℓ (x)) for output. The output of the network is, finally,
oL (x).
36
Following our standard approach in this class, we can imagine that we want to fit some
distribution P over features x ∈ Rdin and outputs y ∈ {−1, 1} with a loss ℓ, using our
neural network. The population problem is then
But we don’t have direct access to this population problem, and instead have some training
data D := ((x1 , y1 ), . . . , (xn , yn )), so we consider the empirical problem
n
1X
min Rn (w) := ℓ(hw (xi ), yi )
w n
i=1
Proposition 7.1 (VC dimension of neural nets with sign activatation). Suppose the net-
work architecture (V, E) is such that dout = 1 and the activation function σ = sgn, so
that H is a family of binary classifiers from the feature space Rdin → {−1, 1}. Then for
universal constants C, C ′ , we have
To upper bound these shattering coefficients, we will use the following fact.
Proposition 7.2. Suppose H01 , H02 , . . . , H0K are each sets of functions from some domain
X → {−1, 1} and H1 is a set of functions from {−1, 1}k → {−1, 1}. Let
H1 ◦ H0 := {h1 (h10 , . . . , hK 1 1 K K
0 ) : h1 ∈ H1 , h0 ∈ H0 , . . . h0 ∈ H0 }.
Then
K
Y
SH1 ◦H0 (n) ⩽ SH1 (n) SHk (n).
0
k=1
Now, let Hℓk for ℓ = 0, . . . , L and k = 1, . . . , |Vℓ | be the the function which maps
z → σ(Wℓ,k z + bℓ,k ) where Wℓ,k indicates the kth row of the matrix Wℓ and bℓ,k the kth
entry (this is just the output of the neuron vℓk ∈ Vℓ ). Using the above notation, we have
H = HL ◦ · · · ◦ H0 .
Applying Proposition 7.2 recursively with these Hℓk we bound the shattering coefficients
as
|Vℓ |
L Y
Y
SH (n) ⩽ SHk (n).
ℓ
ℓ=0 k=1
37
But notice that we can directly compute the VC dimension of the classes Hℓk since
they are a halfspace predictor with dimension determined by the network architecture. In
particular, let the number of edges that go into vℓk , be deg(vℓk ), so that
VCdim(Hℓk ) = deg(vℓk ) + 1.
Using the Sauer-Shelah Lemma with the above product inequality we obtain that for
n > deg(vℓk ) + 1,
|Vℓ |
L Y L |Vℓ |
Y en deg(vk )+1 Y Y k
SH (n) ⩽ k
ℓ
⩽ (en)deg(vℓ )+1 = (en)|E|+|V | .
ℓ=0 k=1
deg(vℓ ) + 1 ℓ=0 k=1
Observe that if VCdim(H) < |E| + |V | + 2, then we are done. Otherwise, we can apply
the above bound with n = VCdim(H) (valid because in this case n = VCdim(H) ⩾
|E| + |V | + 2 > VCdim(Hℓk )) to observe that
We conclude with the following observation, whose proof is included at the end of these
notes.
Proposition 7.3. Suppose a > 1 and k is a positive integer. If x > 1 is such that
ax ⩽ xk ,
then
k k
x⩽2 log + 5.
log a log a
We can apply this fact to the inequality (7.1) with a = 21/e , x = e VCdim(H) and
k = |E| + |V | to finally obtain the result.
We won’t prove it here, but similar results hold for networks with sigmoid activation
functions [6, Chapter 20]. For completeness, we include a proof of Proposition 7.3 here.
If k/ log a < 2, then it implies x < 2 log x and which has no solution for x > 1, so we may
assume k/ log a ⩾ 2. Suppose for the sake of contradiction that
k k
x>2 log + 5.
log a log a
38
We claim that
k k
log 2 log ⩽ log ,
log a log a
Indeed, if not, it reduces once more to the claim τ < 2 log τ with τ = k/ log a ⩾ 2, which
has no solution. So the previous inequality holds, and plugging it into (7.3) yields
k
= ,
log a
min f (θ).
θ∈Rd
The function f is called the objective. Then gradient descent uses the iteration, for some
initial point θ0 ,
θt+1 = θt − ηt ∇f (θt ),
where ηt ∈ R is a learning rate that is specified by the user (typically a constant in the
case of gradient descent proper).
An objective f that arises frequently in statistics and machine learning is of the form
n
1X
f (θ) = ℓ(hθ (xi ), yi ),
n
i=1
39
where ℓ is a loss function and (hθ )θ∈Θ is a parametrized model class. Indeed, this objective
corresponds to an empirical loss. A work-around for this issue, which has additional at-
tractive properties that we won’t discuss now, is called stochastic gradient descent (SGD).
Here, we take a uniformly random i ∈ [n], and use it to form a stochastic gradient
ˆ (θ) := ∇θ ℓ(hθ (xi ), yi ).
∇f
Observe that, over the randomness of this uniform random sample, E[∇f ˆ (θ)] = ∇f (θ).
It is thus plausible that we can use this stochastic gradient as a replacement for the full
gradient ∇f (θ), and run a stochastic version of the gradient descent algorithm. This is
called stochastic gradient descent, and uses the updates
ˆ (θt ),
θt+1 = θt − ηt ∇f
where, importantly, we use fresh randomness at each iteration to form the stochastic
ˆ (θt ).
gradient ∇f
In the next class we will develop significantly more material about first-order methods.
7.4 Backpropagation
The fundamental thing we need to implement first-order methods is the gradient of the
objective. Backpropagation is an efficient method to calculate the gradients of the weights
in a neural-network, famously invented by Rumelhart, Hinton, and Williams in 1986 [5].
Backpropagation can be thought of as the opposite of the forward pass that is used to
compute the output of the neural network.
To describe backpropagation, let us again take a given network architecture (V, E) as
fixed, and and let hw : Rdin → Rdout denote the network with weights w. For simplicity,
we’ll work without any biases but the extension to them is quite simple. We’ll consider
the squared norm loss, where we aim to minimize the empirical risk
n
1X1
R(w) := ∥hw (xi ) − yi ∥2 .
n 2
i=1
Since we are working without biases, the network parameters are exactly the weight ma-
trices (W1 , . . . , WL ), and we thus need to compute gradients of the form
1
∇Wℓ ∥hw (x) − y∥2
2
for a fixed pair (x, y).
The back-prop algorithm works in the following way: first, you do a forward pass,
keeping track of the aℓ and oℓ (we’ll suppress mention of x for notational ease) as you go.
Then, the backward pass begins. For this pass, you initialize δL := (oL − y)T and then
recursively put δℓ = δℓ+1 diag(σ ′ (aℓ+1 ))Wℓ , where σ ′ (aℓ+1 ) is the vector which is formed
by taking the derivative of the activation and applying it element-wise to the vector aℓ+1 ,
and diag is the diagonal matrix of the vector. You finally return (δℓT ⊙ σ ′ (aℓ ))oTℓ−1 as the
gradient for the Wℓ parameters, where we define u ⊙ v to be the element-wise product
of vectors u and v. (Note that in class I stated an incorrect formula for the output of
40
the back-propagation algorithm, this version is correct.) One magical thing about back-
propagation is that it allows to obtain gradients with respect to matrix-valued parameters
without doing any matrix multiplies, and indeed only using matrix-vector products. This
is an extremely important point in practice.
To prove that this is correct, the following matrix calculus fact will be quite helpful.
Proposition 7.4. Fix some v ∈ Rm and let T : Rn×m → Rn be the map defined on
matrices as T (A) = Av. If g : Rn → R is a scalar valued function, then
∇(g ◦ T )[A] = ∇g[T (A)]T v T .
Proof. For a map h : Rk → Rl we use the notation dhb to denote the Jacobian of h at some
b ∈ Rk . The chain rule then implies
d(g ◦ T )A = dgT (A) dA T.
uT dA T = dA uT T = dA (uT Av) = uv T .
The result follows.
41
8 Basics of convex optimization
For our last lecture, we will consider gradient descent (GD) and stochastic gradient descent
(SGD) as paradigmatic examples of optimization algorithms in machine learning. These
algorithms fall within the field of convex optimization, and we will only scratch the surface
of this fascinating area. I recommend those who are interested in a rapid yet fairly thorough
introduction to the theory read [4]; the references [2, 3] are also worth a look.
The intuitive idea behind gradient descent and related methods is to solve a problem
of the form
min f (θ),
θ∈Rd
by using the Taylor approximation (let’s assume f is twice continuously differentiable for
this discussion)
f (θ) = f (θ0 ) + ⟨∇f (θ0 ), θ − θ0 ⟩ + O(∥θ − θ0 ∥2 ),
to argue that if θ is close to θ0 and θ − θ0 is in the direction of ∇f (θ0 ), then f (θ) < f (θ0 ).
Iterating the update θt+1 := θt −ηt ∇f (θt ) for appropriately chosen step-sizes, our objective
value f will decrease. We can then hope that we eventually reach a global optimizer. The
purpose of this lecture will be to describe some mathematical conditions under which this
method works.
(1 − t)x + ty ∈ C.
Throughout this section let us fix a convex set C ⊂ Rd and a convex function f : C → R.
The following fact is important, but its proof is out of scope of this lecture, see [4, Chapter
1] if you are curious.
Proposition 8.3 (Convex functions have sub-gradients). For any x ∈ int(C) there exists
some g ∈ Rd such that for all y ∈ C,
42
As we show next, when the function f is differentiable, the sub-gradient coincides with
the usual gradient. While the sub-gradient may thus seem a bit abstract, it is actually
important in applications. A fundamental example is in ℓ1 regularized linear regression
(LASSO). There, the relevant convex function is essentially the absolute value f (x) = |x|,
and understanding its sub-gradient at 0 (where it is not differentiable) is important both
for statistics and computation. It is instructive to verify for yourself that for f (x) = |x|,
the sub-gradient at 0 is ∂f (0) = [−1, 1].
Proposition 8.5 (Gradients and sub-gradients). Suppose f is differentiable at x ∈ int(C).
Then ∂f (x) = {∇f (x)}.
Proof. Let U ⊂ C be an open set containing x. Then for all y ∈ U and t ∈ [0, 1],
which implies
f ((1 − t)x + ty) − f (x)
⩽ f (y) − f (x).
t
Taking the limit as t → 0+ , we obtain
Which implies f (x) ⩽ f (y). But since y ∈ C was arbitrary, it follows that x0 is a global
minimizer of f on C.
For the second claim, note that 0 ∈ ∂f (x0 ) if and only if for all y ∈ C,
0 ⩽ f (y) − f (x0 ),
43
8.2 Convex optimization isn’t easy
It is commonly thought that “convex optimization problems are easy.” But this isn’t
necessarily true. Indeed, we can, in fact, transform any optimization problem into a
convex optimization problem with a linear objective. To see this, suppose we have some
function g : Rd → R (not necessarily convex) and we want to solve
min g(x),
x∈Rd
and that there is a solution for this problem. Then define the epigraph of g by
and let
S := conv(epi(g)),
On the other hand, since every y ∈ S is a convex combination of points in epi(g), it must
hold that yd+1 ⩾ minx∈Rd g(x). We finally conclude
GD for Lipschitz and convex functions. We’ll consider the gradient descent (GD)
algorithm: given an initial point x1 ∈ Rd and a sequence of step-sizes {ηt }Tt=1 , it uses the
iterative update, for t = 1, . . . , T ,
xt+1 = xt − ηt gt .
44
Theorem 8.7 (GD for Lipschitz and convex functions). Suppose f : Rd → R is both L-
Lipschitz and convex, and there exists x⋆ ∈ arg minx∈Rd f (x). Assume that ∥x1 − x⋆ ∥ ⩽ R,
R
and the step-sizes are chosen to be ηt = η := L√ T
. Then
T
1X LR
f xt − f (x⋆ ) ⩽ √ ,
T T
t=1
and
LR
min f (xt ) − f (x⋆ ) ⩽ √ .
t∈[T ] T
Proof. Observe that by convexity
T T
1X 1X
f f (xt ) − f (x⋆ ) ⩽ f (xt ) − f (x⋆ ).
T T
t=1 t=1
It is also clear that
T
1X
min f (xt ) − f (x⋆ ) ⩽ f (xt ) − f (x⋆ ).
t∈[T ] T
t=1
So it suffices to show the bound
T
1X LR
f (xt ) − f (x⋆ ) ⩽ √ .
T T
t=1
To this end, we use the definition of sub-gradients to observe that
f (x⋆ ) − f (xt ) ⩾ ⟨gt , x⋆ − xt ⟩.
Hence
1
f (xt ) − f (x⋆ ) ⩽ ⟨gt , xt − x⋆ ⟩ = ⟨xt − xt+1 , xt − x⋆ ⟩.
η
We then observe that
2⟨xt − xt+1 , xt − x⋆ ⟩ = ∥xt − xt+1 ∥2 + ∥xt − x⋆ ∥2 − ∥xt+1 − x⋆ ∥2 ,
so that, putting together the pieces, we find
T T
1X 1X 1
f (xt ) − f (x⋆ ) ⩽ ⟨xt − xt+1 , xt − x⋆ ⟩
T T 2η
t=1 t=1
T
1 X
= ∥xt − xt+1 ∥2 + ∥xt − x⋆ ∥2 − ∥xt+1 − x⋆ ∥2
2ηT
t=1
T
1 1 X
∥x1 − x⋆ ∥2 − ∥xT +1 − x⋆ ∥2 + ∥xt − xt+1 ∥2
=
2ηT 2ηT
t=1
T
R2 1 X
⩽ + ∥xt − xt+1 ∥2
2ηT 2ηT
t=1
T
R2 η X
= + ∥gt ∥2 .
2ηT 2T
t=1
45
We then observe that by the convexity statement
We finally obtain
T
1X R2 ηL2
f (xt ) − f (x⋆ ) ⩽ + .
T 2ηT 2
t=1
√
Plugging in η = R/L T , we obtain the result.
SGD for convex functions with Lipschitz gradients. Next, let us consider the
stochastic gradient descent (SGD) algorithm. We assume that for all x ∈ Rd we have
access to stochastic gradients g̃ ∈ Rd such that E[g̃] ∈ ∂f (x). Given an initial point
x1 ∈ Rd and step-sizes {ηt }Tt=1 we use the update for t = 1 . . . , T ,
xt+1 := xt − ηg̃t ,
g̃ := ∇fi (x)
for a uniform random i ∈ [n]. As we have discussed previously, this is a standard approach
to solving ERM problems in practice, in which case each fi represents the loss of the model
associated to a different data point.
Theorem 8.8 (SGD for convex and Lipschitz functions). Suppose f : Rd → R is a convex
function and that the stochastic gradients are bounded as E[∥g̃∥2 ] ⩽ L2 for all t. Set
R
ηt = η := L√ T
. Then
T
1X LR
E f xt − f (x⋆ ) ⩽ √ .
T T
t=1
T
1X 1 X
f xt ⩽ T f (xt ).
T T
t=1 T =1
46
We obtain
1
f (xt ) − f (x⋆ ) = E[⟨xt − xt+1 , xt − x⋆ ⟩ | xt ]
η
1
= E[∥xt − xt+1 ∥2 + ∥xt − x⋆ ∥2 − ∥xt+1 − x⋆ ∥2 | xt ]
2η
1
= E[η 2 ∥g̃∥2 + ∥xt − x⋆ ∥2 − ∥xt+1 − x⋆ ∥2 | xt ]
2η
ηL2 1
⩽ + E[∥xt − x⋆ ∥2 − ∥xt+1 − x⋆ ∥2 | xt ].
2 2η
(as well as, in the case of GD, the best iterate), but practitioners invariably use the
last iterate xT .
47
By making stronger assumptions on the function f (strong convexity and smoothness vs.
convexity and Lipschitzness), we can provide results which are more similar to practice,
although with the caveat that they are only valid under these stronger assumptions. In
general, we do not yet have a strong theoretical understanding of the success of (S)GD in
practice [1], and developing this understanding is a fascinating and extremely active area
of research. Unfortunately, this topic would require another semester to cover!
48
A Background on concentration inequalities
In this appendix we give some background on concentration inequalities, specifically Ho-
effding’s, Bernstein’s, and the bounded differences inequality. For a thorough treatment of
these topics we recommend the reader consult [8, Chapter 2]. The core underlying bound
is Chernoff ’s inequality, which we include here.
Proposition A.1 (Chernoff’s inequality). Suppose ζ is a real random variable with finite
mean. Then for all λ, t > 0,
P[ζ − E[ζ] > t] = P[λ(ζ − E[ζ]) > λt] = P[exp(λ(ζ − E[ζ])) > eλt ].
E[exp(λ(ζ − E[ζ]))]
P[ζ − E[ζ] > t] ⩽ = e−λt E[exp(λ(ζ − E[ζ]))].
eλt
All three of the concentration inequalities that we develop will rely on Chernoff’s in-
equality; their main difference will be in how they control the moment generating function
E[exp(λ(ζ − E[ζ]))].
Example A.3 (Gaussians are sub-Gaussian). Suppose ζ ∼ N (0, σ 2 ). Then, for all λ ∈ R,
Z
1 1 2
E[exp(λ(ζ − E[ζ]))] = √ eλζ− 2σ2 ζ dζ
2πσ
Z
1 1 2 2 σ 2 λ2
=√ e− 2σ2 (ζ−σ λ) + 2 dζ
2πσ
2 λ2 /2
= eσ .
The result follows for Gaussians with arbitrary mean by observing that ζ − E[ζ] is dis-
tributed according to a mean 0 Gaussian.
49
Example A.4 (Bounded random variables are sub-Gaussian). Suppose ζ is a random
variable which takes values in [a, b] for some a < b. Let
Note that
E[ζeλζ ]
ψ ′ (λ) = ,
E[eλζ ]
and
E[ζ 2 eλζ ] E[ζeλζ ] 2
ψ ′′ (λ) = − .
E[eλζ ] E[eλζ ]
In particular, ψ(0) = 0 and ψ ′ (0) = E[ζ]. Let us now estimate the second derivative. For
this, it is convenient to set pλ (ζ) := eλζ /E[eλζ ], and write Eλ [h(ζ)] := E[pλ (ζ)h(ζ)]. We
then notice that this is, in fact, a probability distribution, and thus ψ ′′ (λ) is the variance
of ζ when it follows this modified distribution:
Recall that a random variable in [a, b] can have variance at most (b − a)2 /4, so that
(b − a)2
ψ ′′ (λ) ⩽ .
4
Putting this together, we obtain
Z λ
′
ψ(λ) = ψ(0) + λψ (0) + ψ ′′ (γ)(λ − γ)dγ
0
Z λ
= λE[ζ] + ψ ′′ (λ)(λ − γ)dγ
0
(b − a)2 λ
Z
⩽ λE[ζ] + (λ − γ)dγ
4 0
(b − a)2 1
= λE[ζ] + · λ2 − λ2
4 2
(b − a)2 λ2
= λE[ζ] +
8
Recalling the definition of ψ, we conclude that ζ is (b − a)2 /4-sub-Gaussian.
The main consequence of sub-Gaussianity that we’ll use is that the tails decay quickly.
Proposition A.5 (Tail decay of sub-Gaussian random variables). Suppose ζ is σ 2 -sub-
Gaussian. Then for all t ⩾ 0,
2 /2σ 2
P[ζ > E[ζ] + t] ⩽ e−t .
E[eλ(ζ−E[ζ]) ] 2 2
P[ζ > E[ζ] + t] = P[eλ(ζ−E[ζ]) > eλt ] ⩽ λt
⩽ eλ σ /2−λt
e
50
Taking λ = t/σ 2 (suggested by minimizing the quadratic) we obtain
2 /2σ 2
P[ζ > E[ζ] + t] ⩽ e−t .
Remark A.6 (Additional tail bounds). Notice that by symmetry of the definition of sub-
Gaussianity, ζ being σ 2 -sub-Gaussian implies −ζ is σ 2 -sub-Gaussian, so that the above
implies in particular that
2 2
P[ζ < E[ζ] − t] ⩽ e−t /2σ .
Putting the two together, we have the following useful two-sided tail control.
2 /2σ 2
P[|ζ − E[ζ]| < t] ⩽ 2e−t .
51
A.2 Bernstein’s inequality
We can develop a similar theory for sub-exponential random variables, namely those that
have tail decay at an exponential rate. However, this would take us even further afield, so
here we will just focus on the case of bounded random variables and prove our concentra-
tion inequality directly.
Proof. Expanding the power-series of the exponential we obtain that, for all λ ∈ R and
i ∈ [n],
∞
λ(ζi −E[ζi ])
X λk
E[e ]=1+ E[(ζi − E[ζi ])k ]
k!
k=2
∞
X λk (b − a)k−2
⩽ 1 + Var(ζi )
k!
k=2
∞
X λk (b − a)k−2
⩽ 1 + Var(ζi )
k!
k=2
1
eλ(b−a) − λ(b − a) − 1
= 1 + Var(ζi ) 2
(b − a)
1 λ(b−a)
⩽ exp(Var(ζi ) · e − λ(b − a) − 1 ,
(b − a)2
where the last inequality follows because 1 + u ⩽ eu for all u ∈ R. For convenience of
notation, let g(λ)
1
eλ(b−a) − λ(b − a) − 1 .
g(λ) := 2
(b − a)
Using Chernoff’s inequality, Prop. A.1, we obtain, for all λ > 0
n n
1X −λt
λX
P[ ζi − E[ζi ] > t] ⩽ e E exp (ζi − E[ζi ])
n n
i=1 i=1
n
Y λ
= e−λt
E exp (ζi − E[ζi ])
n
i=1
n
X
⩽ exp g(λ/n) Var(ζi ) − λt .
i=1
Hence
n n
1X X
P[ ζi − E[ζi ] > t] ⩽ inf exp g(λ/n) Var(ζi ) − λt (A.1)
n λ>0
i=1 i=1
52
z 2 /2
To conclude, we use the inequality ez ⩽ 1 + z + 1−z/3 for |z| < 3 (which can be verified
numerically). Hence for λ < 3n/(b − a), we have
λ2 /2n2
g(λ/n) ⩽ .
1 − λ(b − a)/3n
Pn
For ease of notation, let σ 2 := i=1 Var(ζi ). Then take
nt
λ= 1 2 .
nσ + t(b − a)/3
Note that, so long as σ 2 > 0, we have λ < 3n/(b − a). If σ 2 = 0 the variables are constant,
and the inequality is trivial. Thus assume σ 2 > 0. We may thus apply the above bound
on g with this choice of λ to find
σ 2 λ2 /2n2
σ 2 g(λ/n) − λt ⩽ − λt
1 − λ(b − a)/3n
σ 2 t2 1 t2 n
= 1 2 · σ 2 /n
− 1 2
2( n σ + t(b − a)/3)2 1 2 n σ + t(b − a)/3
n
σ +t(b−a)/3
nt2
=− .
2( n1 σ 2 + t(b − a)/3)
53
Proof. We apply the Chernoff inequality, Prop. A.1 to yield, for all λ > 0,
n n
1 X −λt
λ X
P ∆i > t ⩽ e E exp ∆i .
n n
i=1 i=1
Observe that, by the definition of the martingale difference sequence, ∆1 , . . . , ∆n−1 are
each Fn−1 -measurable. Hence
n n
λ X λX
E exp ∆i = E E exp ∆i | Fn−1
n n
i=1 i=1
n−1
λ X λ
∆i E e n ∆n | Fn−1 .
= E exp
n
i=1
But since ∆n ∈ [An , Bn ] for An , Bn -Fn−1 -measurable, we can use our previous bound from
Example A.4 on the moment generating function of bounded random variables to yield
λ (Bn − An )2 λ2
E e n ∆n | Fn−1 ⩽ exp E ∆n | Fn +
8n2
2
(Bn − An ) λ 2 Cn2 λ2
= exp ⩽ exp .
8n2 8n2
Hence
n n−1
λ X 2 2 2 λ X
∆i ⩽ eCn λ /8n E exp
E exp ∆i .
n n
i=1 i=1
2t2
P[g(X1 , . . . , Xn ) − E[g(X1 , . . . , Xn )] > t] ⩽ exp − Pn 2 .
i=1 Ci
54
Proof. For i ∈ [n], set
Notice that the ∆i form a martingale difference sequence with respect to the filtration
Fi := σ(X1 , . . . , Xi ). Write the law of each variable Xi as µi . Then for all i ∈ [n] have
Z
∆i = g(X1 , . . . , Xi , xi+1 , . . . , xn )dµi+1 · · · dµn
Z
− g(X1 , . . . , Xi−1 , xi , . . . , xn )dµi · · · dµn
Z
= (g(X1 , . . . , Xi−1 , Xi , . . . , xn ) − g(X1 , . . . , Xi−1 , xi , . . . , xn ))dµi · · · dµn
⩽ Ci .
B Chaining
In this section we will introduce a very powerful technique for controlling Rademacher
complexities and related stochastic processes. Given a stochastic process (Zt )t∈T we want
to understand the size of the expected supremum
E[sup Zt ]. (B.1)
t∈T
But the application to Rademacher complexities will be more powerful (this is, indeed, one
of the reasons why we went through all the effort of symmetrization: while it is possible
to apply the techniques we develop in this section directly to empirical processes, it yields
much worse bounds).
Observe that, as soon as the index set T is infinite, the expected supremum (B.1) can
be infinite. Indeed, if T is infinite and each random variable Zt is iid and unbounded,
then (B.1) is infinite. On the other hand, if every Zt is exactly equal to every other Zt′
(total dependence) then (B.1) reduces to E[Zt ], which will typically be 0. Our bounds
in this and the next section will thus interpolate between these very simple cases, using
dependence between the Zt s to establish bounds on (B.1). It turns out there is an elegant
and beautiful way to formalize this intuition; to introduce it we’ll first need to recall the
definition of a sub-Gaussian random variable.
55
Definition B.1 (Sub-Gaussian random variable). We say that a real-valued random vari-
able ζ is σ 2 sub-Gaussian, if for all λ ∈ R
2 σ 2 /2
E[eλ(ζ−E[ζ]) ] ⩽ eλ .
Using this definition, we can define a broad class of stochastic processes that we’ll be
able to control using the techniques we introduce in this and the next section.
The intuition for this definition is that the dependency structure of the stochastic pro-
cess (Zt )t∈T is encoded by the metric space structure (T, d): when t and t′ are close, the
difference Zt − Zt′ is highly concentrated around 0, meaning the variables are highly corre-
lated. While if t and t′ are far, then the difference Zt − Zt′ can be essentially independent.
Intuitively, this allows us to reduce questions about the size of (B.1) to the size of the
metric space T .
Our main tool to verify that stochastic processes are indeed sub-Gaussian will be
Hoeffding’s bound.
Example B.4 (Empirical process). For a set of real-valued functions G functions and a
Rprobability distribution Q, we can consider the stochastic process (Zg )g∈G where Zg :=
g(ζ)dQ(ζ) − n1 ni=1 g(ζi ), for ζ1 , . . . , ζn ∼ Q iid. Then Hoeffding’s bound implies that
P
the stochastic process is sub-Gaussian with respect to the metric
1 1
d(g, g ′ ) := √ ∥g − g ′ ∥∞ := √ sup |g(ζ) − g ′ (ζ)|.
n n ζ
1
d(s, s′ ) := ∥s − s′ ∥.
n
56
B.1 Covering and packing numbers
We will measure the size of the metric space (T, d) with covering numbers, which we define
now. We will write the closed ball in (T, d) at t0 of radius ρ > 0 as
Definition B.6 (Covering). For a metric space (T, d) and ρ > 0, a covering at scale ρ of
T is any collection of points U such that
[
T ⊂ B(u, ρ).
u∈U
Definition B.7 (Covering number). For a metric space (T, d) and ρ > 0, the covering
number of (T, d) at scale ρ is defined to be the size of the smallest covering at scale ρ. In
other words, it is defined as
K
[
N (T, d, ρ) := min K : ∃t1 , . . . , tK s.t. T ⊂ B(tk , ρ) .
k=1
Definition B.8 (Packing). For a metric space (T, d) and ρ > 0, a packing at scale ρ is
defined to be a collection of points U ⊂ T such that for all u, u′ ∈ U such that u ̸= u′ ,
d(u, u′ ) > ρ.
Definition B.9 (Packing number). For a metric space (T, d) and ρ > 0 the packing
number of (T, d) at scale ρ is defined to the size of the largest packing at scale ρ. In other
words, it is defined as
M(T, d, ρ) := max K : ∃t1 , . . . , TK ∈ T s.t. d(tj , tk ) > ρ, ∀j ̸= k}.
Proposition B.10 (Comparing covering and packing numbers). For any metric space
(T, d) and ρ > 0 we have
Proof. For the upper bound, take t1 , . . . , tK such that d(tj , tk ) > ρ for all j ̸= k and
K = M(T, d, ρ). We claim that these t1 , . . . , tK form a covering at scale ρ of T . Indeed,
if there were some t ∈ T such that
K
[
t ̸∈ B(tk , ρ),
k=1
then d(t, tk ) > ρ for all k ∈ [K]. But then we could form a larger packing at scale ρ with
the points {t1 , . . . , tK , t}, contradicting the maximality of the packing t1 , . . . , tK . Hence
57
they form a covering at scale ρ, and since N (T, d, ρ) is defined to be the size of the smallest
coverting at scale ρ, we must have
which is a contradiction to our assumption that t′1 , . . . , t′K ′ is a packing of T at scale 2ρ.
Notice that this result is a generalization of our previous bound on the size of a
Rademacher process over a finite set, which was of crucial importance in the VC the-
ory analysis.
So we obtain
1
E[ max ζk ] ⩽ log E[ max exp(λζk )] .
k∈[K] λ k∈[K]
58
and the definition of sub-Gaussianity to obtain
K
1 X
E[ max ζk ] ⩽ log E exp(λζk )
k∈[K] λ
k=1
1
⩽ log K exp(λ2 σ 2 /2)
λ
log K λσ 2
= + .
λ 2
Since this is true for all√λ > 0, we are free to choose a positive λ that minimizes this upper
bound. Selecting λ = 2 log K/σ (suggested via calculus) we obtain the result.
We’ll now prove the following result which will serve as a preliminary form of chaining.
Proposition B.12 (Single-scale chaining). Let (Zt )t∈T be a sub-Gaussian process with
respect to a metric space (T, d). Then for any ρ > 0,
p
E[sup Zt ] ⩽ E[ sup Zt − Zt′ ] + 2 diam(T )2 log N (T, d, ρ),
t∈T t,t′ ∈T : d(t,t′ )⩽ρ
Observe that
We thus have a maximum of K, diam(T )2 -sub Gaussian random variables, and so apply
Proposition B.11 to obtain
p
E[sup Zt ] ⩽ E[ sup Zt − Zt′ ] + 2 diam(T )2 log(N (T, d, ρ).
t∈T t,t′ ∈T : d(t,t′ )⩽ρ
59
B.3 Dudley’s entropy integral
The next result refines this single-scale chaining bound.
Theorem B.13 (Dudley’s entropy integral bound). Let (Zt )t∈T be a stochastic process,
sub-Gaussian with respect to a metric space (T, d). Then for all ρ > 0,
Z ∞p
E[sup Zt ] ⩽ E sup Zt − Zt′ ] + 12 log(N (T, d, τ ))dτ.
t∈T t,t′ ∈T : d(t,t′ )⩽ρ ρ/4
Let us make a few remarks on this result. First of all, in almost all applications it will
be the case that
E[ sup Zt − Zt′ ] → 0 (B.2)
t,t′ ∈T : d(t,t′ )⩽ρ
Sometimes the entropy integral bound is stated in this form (under mild additional assump-
tions which ensure that (B.2) holds), but we prefer our form since it is more convenient
in applications.
Secondly, notice that as soon as ρ ⩾ diam(T ), log(N (T, d, ρ)) = 0, so that the bound
can be equivalently stated
Z diam(T ) p
E[sup Zt ] ⩽ E sup Zt − Zt′ ] + 12 log(N (T, d, τ ))dτ.
t∈T t,t′ ∈T : d(t,t′ )⩽ρ ρ/4
Third, the function log(N (T, d, ·)) is often called the metric entropy, since it is related
to the number of bits it takes to represent points in T up to error τ . Thus the name
“entropy integral.”
Proof. Let us begin by assuming that diam(T ) < ∞, and then once we have proven the
result in this case we will explain how to handle the general case. Define
Then for each integer j such that J0 ⩽ j ⩽ J1 , let Uj = {tj1 , . . . , tjKj } be a cover of T
at scale 2j , and define a map πj : T → Uj which takes t ∈ T to some tjk ∈ Uj such that
d(t, tjk ) ⩽ ρ. Then for any t ∈ T , we can write
1 −1
JX
Zt = Zt − ZπJ0 (t) + Zπj (t) − Zπj+1 (t) + ZπJ1 (t) . (B.3)
j=J0
This decomposition is the most important step in the chaining argument, and the reason
why it bears the name “chaining”: each term in the sum is a link in a “chain” connecting
the lowest-scale term ZπJ0 (t) to the highest-scale term ZπJ1 (t).
60
Once (B.3) is established, we can apply the same idea as in the single-scale proof:
bound each link in the chain with our estimate on maxima of finitely many sub-Gaussian
random variables. To set this step up, we use the following estimate
1 −1
JX
E[sup Zt ] ⩽ E[sup Zt − ZπJ0 (t) ] + E[sup Zπj (t) − Zπj+1 (t) ] + E[sup ZπJ1 (t) ]
t∈T t∈T j=J0 t∈T t∈T
1 −1
JX
⩽ E[ sup Zt − Zt′ ] + E[sup Zπj (t) − Zπj+1 (t) ] + E[sup ZπJ1 (t) ]
t,t′ ∈T : d(t,t′ )⩽ρ j=J0 t∈T t∈T
1 −1
JX
= E[ sup Zt − Zt′ ] + E[sup Zπj (t) − Zπj+1 (t) ],
t,t′ ∈T : d(t,t′ )⩽ρ j=J0 t∈T
where the last equality follows because there is only one element of UJ1 (since 2J1 ⩾
diam(T )).
We can now focus on the terms
d(πj (t), πj+1 (t)) ⩽ d(πj (t), t) + d(t, πj+1 (t)) ⩽ 2j + 2j+1 = 3 · 2j .
Thus
sup Zπj (t) − Zπj+1 (t) ⩽ max Ztj − Ztj+1 .
t∈T tj ∈Uj ,tj+1 ∈Uj+1 : d(tj ,tj+1 )⩽3·2j
This is a maximum over at most |Uj ||Uj+1 | = Kj Kj+1 ⩽ Kj2 variables, each of which is at
worst (3 · 2j )2 sub-Gaussian, so by Proposition B.11,
q q q
E[sup Zπj (t) − Zπj+1 (t) ] ⩽ 3 · 2j 4 log(Kj ) = 3 · 2j+1 log(Kj ) = 6 · 2j log(N (T, d, 2j )).
t∈T
Hence
1 −1
JX q
E[sup Zt ] ⩽ E[ sup Zt − Zt′ ] + 6 2j N (T, d, 2j ). (B.4)
t∈T t,t′ ∈T : d(t,t′ )⩽ρ j=J0
This yields
Z 2J1 −1 p
E[sup Zt ] ⩽ E[ sup Zt − Zt′ ] + 12 log(N (T, d, τ ))dτ.
t∈T t,t′ ∈T : d(t,t′ )⩽ρ 2J0 −1
We conclude the desired inequality by observing that 2J1 −1 ⩽ diam(T ) and 2J0 −1 ⩾ ρ/4.
61
The above argument proves the result in the case where diam(T ) < ∞. To prove the
full result, fix some t0 ∈ T , take any R > 0, and let TR := B(t0 , R). Then the above
argument implies
Z ∞p
E[ sup Zt ] ⩽ E[ sup Zt − Zt′ ] + 12 log(N (TR , d, τ ))dτ
t∈TR t,t′ ∈TR ρ/4
Z ∞p
⩽ E[ sup Zt − Zt′ ] + 12 log(N (T, d, τ ))dτ.
t,t′ ∈T ρ/4
Then we may apply the monotone convergence theorem to the functions which take ω 7→
supt∈TR Zt (ω) − Zt0 (ω): these functions are non-negative and pointwise increasing, and
so (B.5) holds and we conclude.
From the proof, particularly (B.4), it is clear that we can state the analogous bound
with discrete sums rather than integrals.
Corollary B.14. Let (Zt )t∈T be a stochastic process, sub-Gaussian with respect to a metric
space (T, d). Then for all ρ > 0,
∞
X q
E[sup Zt ] ⩽ E sup Zt − Zt′ ] + 6 log(N (T, d, 2j ))dρ.
t∈T t,t′ ∈T : d(t,t′ )⩽ρ
j : 2j ⩾ρ/2
62
C.1 Convexified loss functions
The computational hardness of ERM motivates us to consider alternative, more compu-
tationally practical problems that we can solve as proxies for the ERM problem. The key
idea here is to take the original 0 − 1 loss 1[h(Xi ) ̸= Yi ] and relax it to a convex loss
function, which are generally easier to optimize.
Let’s write the ERM objective in a more suggestive way. Since we are using the labels
{−1, 1}, we have h(x) ̸= y if and only if −h(x)y > 0, and so the ERM objective can be
written
n n
1X 1X
1[h(Xi ) ̸= Yi ] = φ1 (−h(Xi )Yi ),
n n
i=1 i=1
Definition C.1 (Convex set). A set C ⊂ Rd is convex if for all x, y ∈ C and t ∈ [0, 1],
(1 − t)x + ty ∈ C.
We will first make the hypothesis class H convex. Recall that each h ∈ H is a classifier
h : X → {−1, 1}; so H will not be convex as soon as it contains at least two elements. We
thus consider the following relaxed notion of a classifier.
Definition C.3 (Soft classifier). A soft classifier is any (measurable) f : X → [−1, 1]. The
hard classifier associated to f is hf := sgn(f )
2. Majority vote: given classifiers h1 , . . . , hM , we look at the set of all convex combi-
nations of these classifiers
M
nX M
X o
F := λj hj (x) : λj ⩾ 0, λj = 1 .
j=1 j=1
We’ve now convexified the hypothesis class, but we still have non-convexity in the loss φ1 .
We thus consider minimizing a surrogate for φ1 .
63
1. Hinge loss: φ(z) = max(1 + z, 0)
And its value is written Rφ⋆ := Rφ (fφ⋆ ). Let η(x) be the regression function, suitably
modified to consider labels in {1, −1}, so that
64
By monotonicity of φ′ , and again ignoring issues of division by 0, this implies
η(x)
η(x) ⩾ 1/2 ⇐⇒ ⩾ 1 ⇐⇒ φ′ (fφ⋆ (x)) ⩾ φ′ (−fφ⋆ (x)) ⇐⇒ fφ⋆ (x) ⩾ 0.
1 − η(x)
We thus see – admittedly not rigorously – that fφ⋆ (x) should be outputting the same signs
as the Bayes classifier h⋆ (x) := sgn(2η(x) − 1), meaning that this surrogate problem is, at
least intuitively, equivalent to the original problem.
The following Lemma shows a rigorous and quantitative version of this claim. It shows
that if we can solve the φ-risk problem with some f then the associated hard classifier hf
necessarily solves the φ-risk problem, at least under the hypotheses of the Lemma.
Suppose there are constants c > 0 and γ ∈ [0, 1] such that for all η ∈ [0, 1]
Then for any soft classifier f : X → [−1, 1], letting hf := sgn(f ) be the associated hard
classifier, we have the inequality
Notice that τ (η) ⩽ Hη (0) = 1 for all η ∈ [0, 1], so the hypothesis makes sense.
We have the following values of τ, c, and γ for the convex surrogates we mentioned
earlier:
3. Logistical
p loss (φ(z) = log2 (1 + ez )): τ (η) = −η log2 η − (1 − η) log2 (1 − η), c =
1/ 2 log(2) (note that log without a subscript always indicates the natural loga-
rithm), and γ = 1/2.
Proof. We use the decomposition of binary classification risk we proved at the start of
class (which clearly holds unchanged for y ∈ {−1, 1} rather than y ∈ {0, 1}) to observe
that
Where we used the fact that z γ = z for any z ∈ {0, 1} and all γ ∈ [0, 1]. Apply Jensen’s
inequality (the function z γ is concave for z ⩾ 0 and γ ∈ [0, 1]) to observe
γ
R(hf ) ⩽ 2cE (1 − τ (η(x)))1[hf (x) ̸= h⋆ (x)] .
65
We claim that,
To show (C.1) we split into cases. The case where hf (x) = h⋆ (x) is clear because τ (η) ⩽
Hη (f (x)) by definition. So suppose hf (x) ̸= h⋆ (x). Then it must be that f (x)(η(x) −
1/2) ⩽ 0. But by convexity of φ,
And since φ is non-decreasing and φ(0) = 1, it follows that φ(2(1/2 − η(x))f (x)) ⩾ 1,
implying
Hη(x) (f (x)) ⩾ 1 =⇒ Hη(x) (f (x)) − τ (η(x)) ⩾ 1 − τ (η(x)),
proving (C.1) in this case and finishing the proof of Zhang’s Lemma.
Suppose we are given a set of maps f : X → [−1, 1], written F. We’ll then consider the
empirical risk minimizer fˆ and the population risk minimizer f¯ defined as1
Eℓ (f ) := Rℓ (f ) − Rℓ (f⋆ ),
1
As in our development of VC theory, we will assume the existence of these optimal maps, but if they
don’t exist, the results can be easily made approximate.
66
Where f⋆ is the minimizer of the risk Rℓ over all (measurable) maps. Out interest will be
in controlling the excess risk
Let’s consider some examples. First, we observe that we can recover our old setting of
hard binary classification.
So that this is an extension of the binary classification loss, and thus the empirical risk
minimizer fˆ and the population risk minimizer f¯ for the ℓramp loss are exactly the same
as the empirical risk minimizer ĥ and the population risk minimizer h̄ in the binary
classification problem we studied previously.
as well as its empirical analog. We can recover this setting by taking ℓφ (z, y) := φ(−zy).
In particular, it can be checked that the following choices of φ lead to a loss ℓφ that
satisfy our assumptions (when appropriately re-scaled):
Example C.8 (Regression problems with squared loss). If we take the squared loss
ℓ(z, y) := 41 (z − y)2 we can recover regression (with bounded target variables).
Rℓ (fˆ) − Rℓ (f¯)
67
Let’s follow a similar development as we did with VC theory until we can’t go any further.
We start with the inequality
Rℓ (fˆ) − Rℓ (f¯) = Rℓ (fˆ) − Rℓ,n (fˆ) + Rℓ,n (fˆ) − Rℓ,n (f¯) + Rℓ,n (f¯) − Rℓ (f¯)
⩽ Rℓ (fˆ) − Rℓ,n (fˆ) + Rℓ,n (f¯) − Rℓ (f¯).
If we were to exactly follow the analysis in the VC theory problem, we would upper
bound each of these terms with
sup |Rℓ (f ) − Rℓ,n (f )|,
f ∈F
and then used the bounded differences inequality to reduce to controlling the expectation
E[sup |Rℓ (f ) − Rℓ,n (f )|]
f ∈F
This time, we will be a bit more careful in order to avoid some additional technicalities
later. In particular, we will only use the supremum bound on the term involving fˆ, and
then apply the bounded differences inequality. We obtain the following result.
Proposition C.9. For all δ > 0, with probability at least 1 − δ,
r
ˆ ¯ log(2/δ)
Rℓ (f ) − Rℓ (f ) ⩽ ED [sup Rℓ (f ) − Rℓ,n (f )] + 2
f ∈F n
Proof. Using the above inequalities and applying the sup bound to just the term involving
fˆ we have
Rℓ (fˆ) − Rℓ (f¯) ⩽ Rℓ (fˆ) − Rℓ,n (fˆ) + Rℓ,n (f¯) − Rℓ (f¯)
⩽ sup Rℓ (f ) − Rℓ,n (f ) + Rℓ,n (f¯) − Rℓ (f¯).
f ∈F
Using a similar argument as in Lecture 2, we can check that g satisfies the bounded
differences condition with constants 2/n, so that for all t > 0,
2 /2
P[ sup Rℓ (f )−Rℓ,n (f )+Rℓ,n (f¯)−Rℓ (f¯)−ED [sup Rℓ (f )−Rℓ,n)(f ) +Rℓ,n (f¯)−Rℓ (f¯)] > t] ⩽ 2e−nt .
f ∈F f ∈F
Notice that
ED [sup Rℓ (f ) − Rℓ,n (f ) + Rℓ,n (f¯) − Rℓ (f¯)] = ED [sup Rℓ (f ) − Rℓ,n)(f ) ].
f ∈F f ∈F
68
Recall that in our development of VC theory, we controlled the expected empirical
process
ED [sup |R(h) − Rn (h)|]
h∈H
using the symmetrization technique. We’ll use the exact same technique now to control
Introduce a “ghost sample” D′ := ((x′1 , y1′ ), . . . , (x′n , yn′ )) ∈ (X × [−1, 1])n drawn inde-
pendently (in particular, independent of D := ((x1 , y1 ), . . . , (xn , yn ))) and identically dis-
tributed according to P. Let the corresponding empirical risk be Rℓ,n ′ , namely
n
′ 1X
Rℓ,n (f ) := ℓ(f (x′i ), yi′ ).
n
i=1
Observe that
′
Rℓ (f ) = ED′ [Rℓ,n (f )].
Using this fact implies
′ ′
ED [sup Rℓ (f ) − Rℓ,n (f )] = ED [sup ED′ [Rℓ,n (f )] − Rℓ,n (f )] ⩽ ED,D′ [sup Rℓ,n (f ) − Rℓ,n (f )].
f ∈F f ∈F f ∈F
We finally obtain
n
1X
εi (ℓ(f (xi ), yi ) − ℓ(f (x′i ), yi′ ))
ED [sup Rℓ (f ) − Rℓ,n (f )] ⩽ ED,D′ ,ε sup
f ∈F f ∈F n i=1
n
1X
⩽ 2ED,ε sup εi ℓ(f (xi ), yi ) .
f ∈F n i=1
We’ll now introduce a formal definition for the quantity that we have reduced the
problem to.
Definition C.10 (Rademacher complexity). Suppose S ⊂ Rn and ε = (εi )ni=1 is such that
each εi ∼ Rad(1/2) independently. Then the Rademacher complexity of S, written
1
R(S) := Eε sup ⟨s, ε⟩]
s∈S n
69
At this point in the VC theory analysis, we used the bound
1 p
R(S) ⩽ sup ∥a∥ 2 log |A|,
n a∈A
applied to the set
S := {(1[h(xi ) ̸= yi ])ni=1 : h ∈ H},
which let us reduce the problem to the shattering coefficients of the class H, which were
then bounded by the Sauer-Shelah lemma.
We can’t directly follow these steps since for our current setting, the relevant vectors
are (ℓ(f (xi , yi )))ni=1 for f ∈ F, which are not necessarily binary valued. In fact, the set
is not necessarily finite! To handle this issue, we’ll instead pursue a more general approach,
that will turn out to be strictly stronger than the previous bound. The first step of this
approach will be to get rid of the the losses ℓi .
For a set S ⊂ [−1, 1]n , let’s use the notation
R(ℓ ◦ S) ⩽ LR(S).
To see this, take (v1 , v2 ) ∈ V and (v1′ , v2′ ) ∈ V . Then using our assumption that ψ is
1-Lipschitz, we find
70
Taking the supremum over (v1 , v2 ), (v1′ , v2′ ) ∈ V , we obtain (C.2).
With (C.2) in hand, let us proceed with the proof of our main claim. Observe that
n−1
h 1X 1 i
R(ℓ ◦ S) = Eε sup εi ℓ(si , yi ) + εn ℓ(sn , yn )
s∈S n i=1
n
n−1
h h 1X 1 ii
= Eε1 ,...,εn−1 Eεn sup εi ℓ(si , yi ) + εn ℓ(sn , yn )
s∈S n i=1
n
n−1
X
V := εi ℓ(si , yi ), ℓ(sn , yn )) : s ∈ S ,
i=1
Observe that we can repeat the same argument for each index i ∈ [n − 1], each time
replacing the ℓ(si , yi ) terms with si . The result follows.
R(ℓ ◦ F |D ) ⩽ LR(F|D ).
71
C.5 Entropy integral integral bound on Rademacher processes
Let S ⊂ Rn , and consider the Rademacher process ( n1 ⟨ε, s⟩)s∈S for ε1 , . . . , εn ∼ Rad(1/2)
iid. As we verified in the previous lecture, the increments Zs −Zs′ are σ(s, s′ )2 sub-Gaussian
for
1
σ(s, s′ ) = ∥s − s′ ∥.
n
Hence we can apply the developed theory with (T, d) = (S, n1 ∥·∥) to bound the Rademacher
complexity R(S) := E[sups∈S n1 ⟨ε, n⟩]. We obtain the following result.
Theorem C.13 (Bounding the Rademacher complexity with Dudley’s entropy integral).
For any S ⊂ Rn and ρ > 0
∞
Z r
1
R(S) ⩽ E[ sup ⟨s − s′ , ε⟩] + 12 log(N (S, ∥ · ∥, τ ))dτ.
s,s′ ∈S : ∥s−s′ ∥⩽ρ ρ/4 n
In particular,
∞
Z r
1
R(S) ⩽ 12 log(N (S, ∥ · ∥, τ ))dτ.
0 n
Combining this result with our reduction (from last lecture) of the estimation error for
ERM under general losses to the Rademacher complexity of the restricted classes
Let the minimizer of R(f ) over f ∈ F be written f¯. Suppose D := ((x1 , y1 ), . . . , (xn , yn )) ∈
(X × [−1, 1])n is an iid sample of size n from P. Let the minimizer of the corresponding
empirical risk be written fˆ. Then for all δ > 0, with probability at least 1 − δ,
r
Z ∞p
12L 2 log(2/δ)
R(fˆ) − R(f¯) ⩽ √ ED
log(N (F, ∥ · ∥n , τ ))dτ + .
n 0 n
where we define
n
1X 1/2
∥f − f ′ ∥n := (f (xi ) − f ′ (xi ))2 , ∀f, f ′ ∈ F.
n
i=1
72
C.6 Improved bounds for classes with finite VC dimension
You will prove the following Lemma in the second problems session.
Lemma C.15 (Covering number of classes with finite VC dimension). Suppose H is set
of hard classifiers such that VCdim(H) = d, and µ is any probability measure on X . Let
Z
′
1/2
∥h − h ∥L2 (µ) := (h(x) − h′ (x))2 dµ(x) .
73
C.7 Statistical analysis of hard-SVM
Let’s consider the SVM approach to binary classification. We suppose again that P is
supported on X × {−1, 1}, and W is an RKHS over X with associated PSD kernel K. We
consider the following structural assumption on P.
Definition C.17 ((γ, D)-margin assumption). We say that P satisfies the (γ, D)-margin
assumption if there exists w⋆ ∈ W such that ∥w⋆ ∥W ⩽ γ, and for P-almost every (x, y),
we have ∥x∥ ⩽ R and 1 ⩽ y⟨w⋆ , K(x, ·)⟩W .
We will show in this section that under the (γ, D) margin assumption, the max-
margin/hard-SVM solution solves the classification problem with dimension-free rates.
This is remarkable because the VC dimension of this set of classifiers may be large, if not
infinite.
Recall that for each w ∈ W we write the associated hard classifier hw (x) := sgn(⟨w, K(x, ·)⟩W ).
Given D := ((X1 , Y1 ), . . . , (Xn , Yn )) ∈ (X × {−1, 1})n iid from P, let ŵ be a solution of
the emepirical hard-SVM problem
Then we are interested in the risk of the hard classifier associated to ŵ, namely
Theorem C.18 (Dimension-free rates for hard-SVM). Suppose P satisfies the (γ, D)-
margin assumption. Then for all δ > 0, with probability at least 1 − δ,
r
Dγ log(2/δ)
R(hŵ ) ⩽ √ +
2 n n
74
We can apply Zhang’s Lemma with the hinge loss (the constants are written in lecture
3, and proven in the first problem session) to see that
We now observe that α ◦ fŵ is an empirical risk minimizer for Rℓ,n , so can apply Theorem
7 from the previous lecture to find that for all δ > 0, with probability at least 1 − δ,
r
1 log(2/δ)
R(hŵ ) ⩽ ED [R((α ◦ F)D )] + 2 . (C.3)
2 n
Observe that α is 1-Lipschitz, so by the Contraction Lemma (from last lecture) we obtain
References
[1] M. Belkin, “Fit without fear: remarkable mathematical phenomena of deep learning
through the prism of interpolation,” Acta Numerica, vol. 30, pp. 203–248, 2021.
[2] J. M. Borwein and A. S. Lewis, Convex analysis and nonlinear optimization, 2nd ed.,
ser. CMS Books in Mathematics/Ouvrages de Mathématiques de la SMC. Springer,
New York, 2006, vol. 3, theory and examples.
75
[4] S. Bubeck, “Convex optimization: algorithms and complexity,” Found. Trends Mach.
Learn., vol. 8, no. 3-4, pp. 231–357, 2015.
[7] R. Van Handel, “Probability in high dimension,” Lecture Notes (Princeton University),
2014.
76