0% found this document useful (0 votes)
22 views76 pages

Poly Aml

The document provides notes on Advanced Machine Learning, focusing on theoretical aspects of statistical learning theory in binary classification, including topics such as perceptrons, kernel methods, and neural networks. It outlines various machine learning problems, including supervised, unsupervised, and reinforcement learning, while emphasizing the importance of empirical risk minimization. The content also covers advanced topics like VC theory, support vector machines, and optimization techniques.

Uploaded by

teamgaming1970
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views76 pages

Poly Aml

The document provides notes on Advanced Machine Learning, focusing on theoretical aspects of statistical learning theory in binary classification, including topics such as perceptrons, kernel methods, and neural networks. It outlines various machine learning problems, including supervised, unsupervised, and reinforcement learning, while emphasizing the importance of empirical risk minimization. The content also covers advanced topics like VC theory, support vector machines, and optimization techniques.

Uploaded by

teamgaming1970
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 76

Advanced Machine Learning

Austin J. Stromme∗

November 6, 2024

Abstract

Notes for Advanced Machine Learning at ENSAE. We give a theoretical treatment


of basic statistical learning theory in binary classification, then consider perceptrons,
kernel methods, (briefly) ensemble methods, neural networks, and some topics in op-
timization.

Contents
1 Introduction and basics of binary classification 3
1.1 What is machine learning? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Binary classification: a minimal theoretical model . . . . . . . . . . . . . . . 4
1.3 Empirical risk minimization . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Warm up: finite classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2 Vapnik-Chervonenkis (VC) theory 9


2.1 VC dimension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Main result of VC theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3 Symmetrization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.4 Bounding the shattering coefficients with the Sauer-Shelah lemma . . . . . 15
2.5 Application to linear classes . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3 Computing the empirical risk minimizer 17


3.1 The perceptron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2 Compuational hardness of ERM . . . . . . . . . . . . . . . . . . . . . . . . 19

4 The kernel trick 19


4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.2 Formalizing the kernel trick . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.3 Three equivalent views on the kernel trick . . . . . . . . . . . . . . . . . . . 23
4.4 The take-home message . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.5 The representer theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

Department of Statistics at ENSAE/CREST, austin.stromme@ensae.fr.

1
5 Support vector machines 29
5.1 Margin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.2 Hard-SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.3 Soft-SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.4 Statistical analysis of soft-SVM . . . . . . . . . . . . . . . . . . . . . . . . . 32

6 Ensemble methods 34
6.1 Majority vote . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
6.2 Random forests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
6.3 AdaBoost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

7 Neural networks 36
7.1 VC dimension of neural networks with sign activation . . . . . . . . . . . . 36
7.2 Training neural networks with backpropagation . . . . . . . . . . . . . . . . 39
7.3 Brief introduction to first-order methods . . . . . . . . . . . . . . . . . . . . 39
7.4 Backpropagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

8 Basics of convex optimization 42


8.1 Convex functions and sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
8.2 Convex optimization isn’t easy . . . . . . . . . . . . . . . . . . . . . . . . . 44
8.3 Two standard analyses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
8.4 (S)GD in practice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

A Background on concentration inequalities 49


A.1 sub-Gaussian random variables and Hoeffding’s inequality . . . . . . . . . . 49
A.2 Bernstein’s inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
A.3 The bounded differences inequality . . . . . . . . . . . . . . . . . . . . . . . 53

B Chaining 55
B.1 Covering and packing numbers . . . . . . . . . . . . . . . . . . . . . . . . . 57
B.2 Single-scale chaining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
B.3 Dudley’s entropy integral . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

C Analysis for general losses and applications 62


C.1 Convexified loss functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
C.2 Zhang’s lemma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
C.3 Setting of ERM analysis for general losses . . . . . . . . . . . . . . . . . . . 66
C.4 Reduction to Rademacher complexity . . . . . . . . . . . . . . . . . . . . . 67
C.5 Entropy integral integral bound on Rademacher processes . . . . . . . . . . 72
C.6 Improved bounds for classes with finite VC dimension . . . . . . . . . . . . 73
C.7 Statistical analysis of hard-SVM . . . . . . . . . . . . . . . . . . . . . . . . 74

2
1 Introduction and basics of binary classification
1.1 What is machine learning?
Machine learning (ML) is the theory and practice of using data to make decisions. Compar-
ing with statistics, ML was historically less concerned with modeling, studying problems
where modeling was difficult or impossible, and less interested in quantifying uncertainty.
For example, suppose we want to classify images of handwritten digits {0, 1, . . . , 9}. Cre-
ating an explicit model for such data may be quite difficult; the approach of ML is to
instead focus on finding a good predictor, side-stepping the problem of recovering the
“true” data-generating process.
In this sense, ML favors a more blackbox approach than statistics, and is oriented more
towards prediction than inference. However, these differences are receding. Modern statis-
tics is increasingly concerned with computational considerations, finite sample analysis,
and model misspecification. And modern ML incorporates many aspects of probabilistic
modeling.

Examples of ML problems. Let us give some more discussion of the kinds of problems
that are studied in ML.

Example 1.1 (Supervised learning). Supervised learning refers to problems where the
training data includes a response variable that we are interested in estimating for the
test data. We observe (X1 , Y1 ), . . . , (Xn , Yn ) ∈ X × Y, assumed to be iid draws from a
distribution P, and we wish to use this information to estimate y ∼ P(· | x) for a new x.
Examples:

1. Spam detection: given a collection of emails labeled spam or not, and want to use
this data to predict whether new emails are spam

2. Handwritten digit recognition: given a collection of images of handwritten digits


labeled with their corresponding digit, want to label images of handwritten digits
with their correct digit

3. Sentiment analysis: detecting whether text is positive, neutral, or negative

4. Recommender systems: recommending unseen products/movies/songs to users based


on their other preferences

Example 1.2 (Un-supervised learning). In un-supervised learning, we do not have a


response variable in the training data that we want to estimate for the test data. Instead,
we only have data X1 , . . . , Xn ∈ X . There are many different problems that fall in this
domain, some include:

1. Dimensionality reduction: representing data with drastically fewer parameters while


still preserving properties relevant to the problem we are interested in. Find a map
F : X → X ′ such that dim(X ) ≪ dim(X ′ ) and F (X1 ), . . . , F (Xn ) has “similar”
properties (precise meaning here depends on application domain) to X1 , . . . , Xn .

3
2. Clustering: given K ∈SN, produce S1 , . . . SK ⊂ {X1 , . . . , Xn } such that the sets are
pairwise disjoint and K
k=1 Sk = {X1 , . . . , Xn }

3. Generative modeling: how can we generate more samples that are (approximately)
from the distribution of X1 , . . . , Xn ? Examples include DALL-E for images, or GPT
for text.

Example 1.3 (Reinforcement learning). In many settings, the ML model interacts with
its environment in some way. For example:

1. ML systems for games, such as chess, Go, Starcraft II. Here the system has to take
a series of actions which it receives rewards for.

2. Clinical trials, where one desires to simultaneously determine the effectiveness of


drugs while also minimizing negative outcomes for patients.

Note that the above categories are not necessarily disjoint, and often are mutually
supportive. For example, dimensionality reduction is a common preliminary step in many
machine learning pipelines, clustering can be used as a method for supervised learning,
reinforcement learning pipelines often involve training an evaluator which estimates the
reward of a given action (e.g. the value of a given chess board), and generative models
often include supervised components.

1.2 Binary classification: a minimal theoretical model


Since there is a great diversity of ML methods, and many of them fall outside of existing
theory, we will begin by studying a kind of minimal theoretical model of machine learning.
Specifically, we look at the simplest possible supervised learning problem, namely that of
binary classification.
In binary classification, we observe n pairs (X1 , Y1 ), . . . , (Xn , Yn ) sampled iid from a
probability distribution P on X × {0, 1}. The goal is to use this data to allow us to predict
the label y of a new feature x, when (x, y) ∼ P follows the same distribution. Specifically,
we want to use the data {(X1 , Y1 ), . . . , (Xn , Yn )} to output a function h : X → {0, 1} such
that the classification error

R(h) := P(X,Y )∼P [h(X) ̸= Y ] (1.1)

is as small as possible.
Let us make some remarks at this point. First, since y is a binary random variable,
it has a Bernoulli distribution, so the problem is entirely encoded within the regression
function, namely the conditional probability

η(x) := PP [Y = 1 | X = x].

Second, we emphasize that we cannot hope to predict y with perfect accuracy even given
knowledge of the regression function η(x), unless η(x) ∈ {0, 1}. In the extreme case
η(x) = 0.5, we cannot predict y at all. As we will discuss, the amount of noise has a
strong impact on the sample complexity of the problem.

4
Because of this issue of noise, we cannot, in general, drive the classification error R(h)
to 0. What is the smallest value that can be achieved? As a thought experiment, assume
that we have all the information that we can hope to get, so that we know the regression
function η(x) perfectly. Again, if η(x) = 1/2, then x contains no information about y and
we might as well guess randomly. Otherwise, we can do better than random guessing: if
η(x) > 1/2 then P[y = 1 | x] > P[y = 0 | x], so if we output 1 we’ll be correct more often
than if we output 0. And similarly if η(x) < 1/2. As we will rigorously show, the optimal
classifier is the Bayes classifier h⋆ (x) := 1[η(x) > 1/2], in the sense that

R(h⋆ ) = inf R(h).


h : Rd →{0,1}

To measure the performance of another classifier h, we will look at the excess risk E(h)
defined as
E(h) = R(h) − R(h⋆ ) ⩾ 0. (1.2)
Our hope is that, with sufficiently many observations (sufficiently large n), we can output
a classifier which achieves nearly 0 excess risk by mimicking h⋆ . We have the following
result.

Theorem 1.4 (Decomposition of excess risk). Suppose h : X → {0, 1} is given. Write the
marginal distribution of P on x as Px . Then
Z
E(h) = |2η(x) − 1|dPx (x)
{h̸=h⋆ }

In particular, the Bayes classifier h⋆ is optimal.

Note also that when η(x) ≡ 1/2 identically this result recovers the intuitive fact that
no learning is possible – every classifier performs the same.

Proof. Observe that

R(h) = P[h(x) ̸= y] − P[h⋆ (x) ̸= y]


Z
= (1[h(x) ̸= y] − 1[h⋆ (x) ̸= y])dP(x, y)
Z
= (1[h(x) ̸= y] − 1[h⋆ (x) ̸= y])dP(x, y).
{h̸=h⋆ }

Now, notice that if h(x) ̸= h⋆ (x) then h(x) ̸= y if and only if h⋆ (x) = y, since these are
all binary numbers. Hence,
Z
R(h) = (1[h⋆ (x) = y] − 1[h⋆ (x) ̸= y])dP(x, y)
{h̸=h⋆ }
Z
= (P[h⋆ (x) = y | x] − P[h⋆ (x) ̸= y | x])dPx (x)
{h̸=h⋆ }
Z
= (2P[h⋆ (x) = y | x] − 1)dPx (x).
{h̸=h⋆ }

5
Now, we claim that
P[h⋆ (x) = y | x] = |η(x) − 1/2| + 1/2

Indeed, if η(x) > 1/2, then the LHS is η(x), and if η(x) ⩽ 1/2, then the LHS is 1 − η(x),
by the definition of the Bayes classifier h⋆ . The result follows by plugging this equation
into the above.

1.3 Empirical risk minimization


This leads us to the empirical risk. Since we do not have complete knowledge of the
distribution P, and instead only the iid sample (X1 , Y1 ), . . . , (Xn , Yn ), we do not know
η(x) := P[y = 1 |x]. We thus focus, instead, on minimizing the empirical risk

n
1X
Rn (h) := 1[h(Xi ) ̸= Yi ]. (1.3)
n
i=1

Observe that for any given classifier h, Rn is an unbiased estimator of R since E[Rn (h)] =
R(h). Moreover, Rn (h) → R(h) as n → ∞, almost surely, by the law of the large numbers,
and so we expect Rn to be a good approximation of R for large n.
Since the Bayes classifier h⋆ minimizes the population risk (1.1), a natural approach
to estimating it is thus to minimize the empirical risk (1.3). But notice that here we run
into a problem: what classifiers do we consider when minimizing (1.3)? Because without
any restrictions, we can minimize the empirical risk by taking the classifier ĥ(Xi ) = Yi for
i = 1, . . . , n and h(x) = 0 if x ̸∈ {X1 , . . . , Xn }. This classifier achieves the optimal value
Rn (ĥ) = 0, but doesn’t generalize outside the data – a case of extreme overfitting.
This problem is inherent to any method if we aren’t willing to make assumptions on
the distribution of (x, y): without assumptions on its distribution, it can have arbitrary
behavior out of the sample, and thus we cannot learn. For example, we have the following
“no-free-lunch” result, which shows that there is no algorithm which does substantially
better than random chance uniformly over all distributions of (x, y).

Theorem 1.5 (No-free-lunch for binary classification). Let n be a positive integer and
assume X is a domain with at least 2n elements. Let A be any algorithm for the task
of binary classification over X . Then there exists a distribution P on X such that the
associated Bayes estimator h⋆ achieves 0 risk, R(h⋆ ) = 0, yet

PD∼P n [R(A(D)) ⩾ 1/8] ⩾ 1/7.

A couple of remarks on this result. First, we are often interested in classification on


continuous spaces, where the hypothesis that X has at least 2n elements is guaranteed.
Second, this result is not an unavoidable result of noise, since the distribution P is actually
deterministic, conditionally on x (since R(h⋆ ) = 0). This result is proven in a natural way:
one considers all distributions on a fixed set of 2n elements, and argues that in any sample
of size n there must be at least n elements which are not observed and on which the
algorithm can do no better than guessing. See [6, Theorem 5.1] for a detailed proof.

6
Generative vs. discriminative approaches Because of this issue, it is clear that
we need to somehow restrict the distributions P of (x, y). A common approach to this
kind of issue in statistics is the generative approach, where we directly restrict the set of
candidate distributions P. For example, in discriminant analysis, we might assume that
(x, y) is generated by sampling x from a mixture of two Gaussians and letting y be the
mixture component. But in machine learning problems we generally don’t expect that
we can accurately model the true underlying distribution in this kind of way (consider
again the example of handwritten digit classification), so we instead take a discriminative
approach, where we restrict the set of classifiers.

Estimation error vs. approximation error Specifically, we’ll assume that we are
given a class H of classifiers such that R(h) is small for some h ∈ H. This class may be
constructed from domain knowledge, computational considerations, or some mixture of
the two. We will see several examples in this class. For any candidate classifier ĥ built
from the data, we can decompose its excess risk as

E(ĥ) = R(ĥ) − R(h⋆ ) = R(ĥ) − inf R(h) + inf R(h) − R(h⋆ ) . (1.4)
h∈H h∈H
| {z } | {z }
estimation error approximation error

The estimation error accounts for the error due to only having a sample from P rather
than the full distribution; this should go to zero as n → ∞. For this to be possible we
want H to be relatively small – the previous results imply, for example, that this cannot
happen if H is the set of all classifiers. On the other hand, if we make H too small then
we may have little estimation error but large approximation error. For this reason, we will
often want to balance the two sources of error. Without knowing more about the underling
data generating process P, it is difficult to say anything about the approximation error.
In contrast, and perhaps surprisingly, we will be able to say a lot about the approximation
error, even under mild assumptions on P.
The central question we will focus on for this chapter and the next is thus the following:

Problem 1.6. How does the estimation error go to 0 as a function of n and the class H?

This question is typical of the area of theoretical machine learning called statistical
learning theory, and the methods we will develop can be extended to settings well beyond
binary classification. However, to focus on the main ideas and leave time for other topics,
we will focus on binary classification as our minimal theoretical model.

Reduction to empirical processes. A key mathematical technique shall be concen-


tration inequalities, such as Hoeffding’s and Bernstein’s, which allow us to control how
close the empirical risk is to the classification error by bounding the deviations
n
1X
1[h(Xi ) ̸= Yi ] − P[h(x) ̸= y] ,
n
i=1

with high probability. Note that we give a basic treatment, with proofs, of the concentra-
tion inequalities we will use in this course in Appendix A.

7
To see why these bounds will be important, assume that

h̄ ∈ arg min R(h), ĥ ∈ arg min Rn (h).


h∈H h∈H

The former we shall call the population risk minimizer, and the latter we shall call the
empirical risk minimizer. (In general, our results will go through even if we can only
approximately solve these problems.) Then we can bound the estimation error

R(ĥ) − inf R(h) = R(ĥ) − R(h̄)


h∈H

= R(ĥ) − Rn (ĥ) + Rn (ĥ) − Rn (h̄) + Rn (h̄) − R(h̄)


⩽ R(ĥ) − Rn (ĥ) + Rn (h̄) − R(h̄),

where we used the definition of ĥ to observe that Rn (ĥ) ⩽ Rn (h̄). Now observe that the
two remaining terms concern the difference between a population and empirical average.
When the inner function is deterministic as h̄, these differences can be controlled with
concentration inequalities. However, the first term R(ĥ) − Rn (ĥ) cannot be directly con-
trolled with concentration inequalities since ĥ is itself random – it depends in a potentially
complex way on the sample (X1 , Y1 ), . . . , (Xn , Yn ). To get around this issue, we will use a
coarse bound:
R(ĥ) − Rn (ĥ) ⩽ sup R(h) − Rn (h).
h∈H

It is remarkable that, as we will see, this bound is sufficient to get good results in many
cases; these objects are called suprema of empirical processes. Much of the theoretical
work of this course will be in the service of controlling such suprema.

1.4 Warm up: finite classes


To see how this will work, let us begin with the simplest possible case: a finite class H.
Obviously, this is a rather simplistic setting, but it’s important to understand this analysis
to gain context for the remainder of the class.
Theorem 1.7 (Basic ERM bound for finite classes). Suppose H = {h1 , . . . , hM }. Then
for all δ > 0, with probability at least 1 − δ,
r
log(2M/δ)
E(ĥ) ⩽ 2 + R(h̄) − R(h⋆ ).
2n
Proof. Proceeding as above, we find that

R(ĥ) − R(h̄) ⩽ R(ĥ) − Rn (ĥ) + Rn (h̄) − R(h̄).

Taking the supremum over h ∈ H, we obtain

R(ĥ) − R(h̄) ⩽ 2 sup |Rn (h) − R(h)|. (1.5)


h∈H

Note that we could be a bit more careful and handle the Rn (h̄) − R(h̄) separately; the
only reason we use the supremum step is to deal with the fact that ĥ depends on the data

8
in a complex way; on the contrary, h̄ doesn’t depend on the data at all. However, using
the bound for Rn (h̄) − R(h̄) somewhat simplifies the analysis, and anyways only worsens
the bound by a constant factor.
To control the supremum in (1.5), we now apply Hoeffding’s inequality (see Ap-
pendix A) to each of the classifiers in H. For each j ∈ [M ] andP∈ [n], we consider the ran-
dom variables ζij := 1[Yi ̸= hj (Xi )]. Notice that Rn (hj ) = n1 ni=1 ζij and R(hj ) = E[ζij ]
for each i ∈ [n]. On the other hand, ζij ∈ [0, 1] almost surely. Hence applying Hoeffding’s
for bounded random variables, Corollary A.9, yields that for all j ∈ [M ] and t > 0,
2
P[|Rn (hj ) − R(hj )| > t] ⩽ 2e−2nt

Setting r
log(2M/δ)
tδ := ,
2n
it follows that, for all j ∈ [M ],

δ
P[|Rn (hj ) − R(hj )| > tδ ] < .
M
Hence, by the union bound,
M M
X X δ
P[∃j ∈ [M ] : |Rn (hj ) − R(hj )| > tδ ] ⩽ P[|Rn (hj ) − R(hj )| > tδ ] ⩽ = δ.
M
j=1 j=1

Using the above with (1.5) we find that, with probability at least 1 − δ,

R(ĥ) − R(h̄) ⩽ 2tδ .

Plugging in the definition of tδ and recalling the decomposition of the excess risk from (1.4)
yields the result.

This is the first of many bounds along these lines, but it illustrates the main ideas: we
bound the approximation error by the supremum of an empirical process, and then use
concentration inequalities to argue that some kind of concentration holds uniformly over
the relevant class, to then obtain a bound.

2 Vapnik-Chervonenkis (VC) theory


In the previous section, we saw that when the set of classifiers H is all possible maps
X → {0, 1}, learning is impossible (for any algorithm). In particular, ensuring a small
estimation error for empirical risk minimization is impossible. Intuitively, this is because
when the set of classifiers is not restricted in any way, the behavior of the true classifier
outside the observed sample can be arbitrary. At the other extreme, when H = {h1 } has
just one element, the estimation error is zero since

R(ĥ) − R(h̄) = R(h1 ) − R(h1 ) = 0.

9
The bound we gave in the last section interpolated between these two extremes by
considering finite classes H and showed that the statistical difficulty of the problem scaled
as log(|H|). One might naturally think at this point that finite classes are all that can be
learned, but in fact much more is possible. There is more than one approach to this goal,
but in this lecture we will focus on an approach called Vapnik-Chervonenkis theory, which
was pioneered by Vapnik and Chervonenkis in the late 1960s.
Recall that we broke the excess risk up as

E(ĥ) = R(ĥ) − R(h⋆ ) = R(ĥ) − R(h̄) + R(h̄) − R(h⋆ ) ,


| {z } | {z }
estimation error approximation error

where
ĥ ∈ arg min Rn (h) h̄ ∈ arg min R(h)
h∈H h∈H

are the empirical risk minimizer and the population risk minimizer, respectively. We then
use the definition of ĥ to yield the bound

R(ĥ) − R(h̄) = R(ĥ) − Rn (ĥ) + Rn (ĥ) − Rn (h̄) + Rn (h̄) − R(h̄)


⩽ R(ĥ) − Rn (ĥ) + Rn (h̄) − R(h̄).

As in the last section, rather than try to understand the possibly complex dependence of
ĥ on the sample D, we instead take the supremum over all h ∈ H. This should still be a
surprising strategy, but its strength will be further established in this lecture. We obtain
the bound
R(ĥ) − R(h̄) ⩽ 2 sup |R(h) − Rn (h)|. (2.1)
h∈H

2.1 VC dimension
To bound the size of the empirical process (2.1), we will use the following notion of
complexity of function classes.

Definition 2.1 (Shattering). We say that a set of classifiers H is shattered by a set


{x1 , . . . , xn } ⊂ X if
|{(h(xi ))ni=1 : h ∈ H}| = 2n .

In other words, H is shattered by a set of n points if functions from H that attain all
possible outputs on those n points.

Definition 2.2 (Shattering coefficients). The shattering coefficients of H are defined, for
each n ∈ N, as
SH (n) := max |{(h(xi ))ni=1 : h ∈ H}|
(x1 ,...,xn )∈X n

The key intuition of VC theory is that the growth of these shattering coefficients SH (n)
should control the “complexity” of H: if they grow quickly then H must be very big since
it can fit arbitrary sign patterns.

10
Definition 2.3 (VC dimension). Given an integer d ∈ N, we say that a set of classifiers
H has VC dimension d, written VCdim(H) = d, if H is shattered by some set of d points
in X , but not by any set of d + 1 points in X . If H is shattered by a set of d points in X
for all d ∈ N, we write VCdim(H) = ∞.

Example 2.4 (Finite classes). Suppose H = {h1 , . . . , hM } is a finite class. Let {x1 , . . . , xn } ⊂
X be any subset of n points. Then

|{(h(xi ))ni=1 : h ∈ H}| ⩽ |H|.

Hence
VCdim(H) ⩽ ⌈log M ⌉.
In particular, if H has one element then VCdim(H) = 0.

The following example is our first instance of an infinite class with finite VC dimension.

Example 2.5 (Axis-aligned rectangles). Let H be the class of axis-aligned rectangles,


namely
H = {h(a1 ,a2 ,b1 ,b2 ) : a1 ⩽ a2 , b1 ⩽ b2 },
where for x ∈ R2 ,
h(a1 ,a2 ,b1 ,b2 ) (x) := 1[x ∈ [a1 , a2 ] × [b1 , b2 ]].
We claim that VCdim(H) = 4. We begin by showing that VCdim(H) ⩾ 4. Indeed, consider
{(0, 1), (1, 0), (−1, 0), (0, −1)}. Then it is not hard to see that all possible sign patterns on
this set can be achieved by axis-aligned rectangles. We next claim that VCdim(H) ⩽ 4.
To this end, consider any set of five points {x1 , x2 , . . . , x5 } ⊂ R2 . We may assume that
the points are distinct since otherwise all possible sign patterns cannot be achieved. Let
p1 be a left-most point, p2 be a right-most point, p3 be a upper-most point, and p4 be
a lower-most point (it is possible that some of these points are equal). Then there is at
least one xi not in P := {p1 , p2 , p3 , p4 }, but this xi will be included in any axis-aligned
rectangle which includes all the elements of P . In particular, there is no classifier in H
which is 1 on the elements of P and 0 on xi . Therefore SH (5) < 25 , so VCdim(H) < 5.
The result follows.

2.2 Main result of VC theory


The main result of VC theory says that VC dimension is strong enough to act as an
“effective class size” even if the class is not actually finite.

Theorem 2.6 (Main upper bound of VC theory). Suppose H has VC dimension d and
n > d + 1. Then for all δ > 0, with probability at least 1 − δ,
r r
2d log(2en/d) log(2/δ)
sup |R(h) − Rn (h)| ⩽ 2 + .
h∈H n 2n

In particular, r r
2d log(2en/d) log(2/δ)
R(ĥ) − R(h̄) ⩽ 4 + .
n 2n

11
Example 2.7 (Finite classes). Let’s compare this result to what we found in the previous
section. Suppose H = {h1 , . . . , hM } is a finite class. Then Theorem 2.6 implies that for
all δ > 0 with probability at least 1 − δ,
r r
2 log(M ) log(2en/ log(M )) log(2/δ)
sup |R(h) − Rn (h)| ⩽ 2 + .
h∈H n 2n

In the previous lecture we showed that, for all δ > 0, with probability at least 1 − δ,
r
2 log(2M/δ)
.
n

Ignoring constants, we have thus nearly recovered our previous result, but with the addi-
tion of the log(n/ log M ) factor.

We remark that we generally expect terms of the form nd where d = VCdim(H), since
when n < d, we cannot distinguish H from the set of all classifiers. This intuition can be
made rigorous with formal lower bounds to the effect that the above bound is essentially
optimal (in the sense that there is some distribution P on X × {0, 1} for which it is tight),
but such lower bounds will not be covered in this class. If you’re curious, you can find a
proof in [6, Chapter 28].

2.3 Symmetrization
The proof of Theorem 2.6 will use an important and ingenious technique known as
symmetrization. Symmetrization applies to the expectation of the empirical process
suph∈H |R(h) − Rn (h)|; the following consequence of the bounded differences inequality
allows us to reduce to the expected empirical process.

Lemma 2.8 (Tail bounds for the empirical process). For all δ > 0, with probability at
least 1 − δ,
r
log(1/δ)
sup |R(h) − Rn (h)| ⩽ ED [sup |R(h) − Rn (h)|] + .
h∈H h∈H 2n

It is conceptually interesting to note that, in fact, controlling the expectation of an


empirical process is significantly more difficult than controlling its tails.

Proof. This is an application of the bounded differences inequality, Theorem A.14. We


will apply this inequality to the function

f : ((x1 , y1 ) . . . , (xn , yn )) 7→ sup |R(h) − Rn (h)|


h∈H

Then it is straightforward to verify that this function has the bounded differences property
with ci = n1 : if we change one sample (xi , yi ) to (x′i , yi′ ) and write the corresponding

12
empirical risk as Rn′ we find that

sup |R(h) − Rn (h)| − sup |R(h) − Rn′ (h)| ⩽ sup (|R(h) − Rn (h)| − |R(h) − Rn′ (h)|)
h∈H h∈H h∈H
⩽ sup |Rn (h) − Rn′ (h)|
h∈H
1 1
= sup |1[h(xi ) ̸= yi ] − 1[h(x′i ) ̸= yi′ ]| ⩽ .
h∈H n n
This verifies the bounded difference condition, so we may apply Theorem A.14 to obtain,
for all t > 0,
2
P sup |R(h) − Rn (h)| − ED [sup |R(h) − Rn (h)|] > t ⩽ e−2nt .
 
h∈H h∈H

The result follows.

So let us focus on the expectation of the empirical process. We will introduce a new
sample, known as a ghost sample: suppose D′ = ((x′1 , y1′ ), . . . , (x′n , yn′ )) ∼ P n is an iid
draw from P n , distinct from D. For a classifier h : X → {0, 1}, write
n
1X
Rn′ (h) := 1[h(x′i ) = yi′ ].
n
i=1

Then note that


R(h) = ED′ [Rn′ (h)].
Thus,

ED [sup |R(h) − Rn (h)|] = ED [sup |ED′ [Rn′ (h)] − Rn (h)|]


h∈H h∈H
⩽ ED [sup ED′ [|Rn′ (h) − Rn (h)|]]
h∈H
⩽ ED,D′ [sup |Rn′ (h) − Rn (h)|],
h∈H

where we apply Jensen’s inequality twice. We now make the crucial observation of the
symmetrization technique: for any vector (εi )ni=1 ∈ {−1, 1}n the symmetry of D and D′
imply
n
1X
ED,D′ [sup |Rn′ (h) − Rn (h)|] = ED,D′ [sup 1[h(xi ) ̸= yi ] − 1[h(x′i ) ̸= yi′ ] ]
h∈H h∈H n
i=1
n
1X
= ED,D′ [sup εi (1[h(xi ) ̸= yi ] − 1[h(x′i ) ̸= yi′ ]) ].
h∈H n i=1

Since this is true for all ε ∈ {−1, 1}n , we can take each entry to be distributed according
to a Rademacher random variable, εi ∼ Rad(1/2) (recall this means P[εi = 1] = P[εi =
−1] = 1/2), and write
n
1X
ED,D′ [sup |Rn′ (h) − Rn (h)|] = ED,D′ ,ε [sup εi (1[h(xi ) ̸= yi ] − 1[h(x′i ) ̸= yi′ ]) ].
h∈H h∈H n
i=1

13
Apply triangle inequality to finally yield
n
1X
ED [sup |R(h) − Rn (h)|] ⩽ 2ED,ε [sup εi 1[h(xi ) ̸= yi ] ] (2.2)
h∈H h∈H n i=1

We now work conditionally on the data D. In this case, the supremum is actually over
a finite set. Let
AH (D) := {(1[h(xi ) ̸= yi ])ni=1 : h ∈ H} ⊂ {0, 1}n .
Then
n
1X 1
Eε [sup εi 1[h(xi ) ̸= yi ] ] = Eε [ max |⟨a, ε⟩|].
h∈H n n a∈AH (D)
i=1

So it suffices to control a maximum over a finite set.

Lemma 2.9. Suppose A ⊂ Rn is a finite set and (εi )ni=1 is a vector of iid Rad(1/2) random
variables. Then p
Eε [sup⟨a, ε⟩] ⩽ max ∥a∥ · 2 log(|A|).
a∈A a∈A

Proof. The result follows from a more general observation, which is as follows.
Lemma 2.10 (Maximum of finitely many sub-Gaussian random variables). Suppose ζ1 , . . . , ζM
are mean zero and σ 2 -sub Gaussian. Then
p
E[ max ζk ] ⩽ 2σ 2 log M .
k∈[M ]

Proof of Lemma 2.10. Let λ > 0 and note that


1 
E[max ζk ] = E[log exp λ max ζk )]
k∈M λ k∈[M ]
1  
⩽ log E exp λ max ζk
λ k∈[M ]
1  
= log E max exp λζk
λ k∈[M ]
M
1 X  
⩽ log E exp λζk
λ
k=1
1 2 2 λσ 2 log M
⩽ log M eλ σ /2 = + .
λ 2 λ
q
The result follows by taking λ = 2 log
σ2
M
.

With Lemma 2.10 in hand, we simply need to check that ⟨a, ε⟩ is mean 0 and sub-
Gaussian. For this, we use the fact that
n
X
⟨a, ε⟩ = ai εi ,
i=1

14
so it is a sum of n independent, mean zero, sub-Gaussian variables, where the ith term
is a2i -sub Gaussian by Example A.4 and Proposition A.7. Hence by Proposition A.7 once
again, we find that that ⟨a, ε⟩ is ∥a∥2 -sub Gaussian. The result then follows immediately
from Lemma 2.10.

Applying Lemma 2.9 we see that

Eε [ max |⟨a, ε⟩|] = Eε [ max ⟨a, ε⟩]


a∈AH (D) a∈AH (D)∪−AH (D)
p
⩽ max ∥a∥ · 2 log(|AH (D) ∪ −AH (D)|)
a∈AH (D)∪−AH (D)
√ p
⩽ n · 2 log(2|AH (D)|).

Note that
|AH (D)| = |{(h(xi ))ni=1 : h ∈ H}| ⩽ SH (n).

Thus
n
r
1X 2 log(4SH (n))
Eε [sup εi 1[h(xi ) ̸= yi ] ] ⩽ .
h∈H n i=1
n

Plugging this into (2.2) we obtain


r
2 log(2SH (n))
ED [sup |R(h) − Rn (h)|] ⩽ 2 (2.3)
h∈H n

We therefore see that the shattering coefficients SH (n) control the rate of convergence of
the empirical process.

2.4 Bounding the shattering coefficients with the Sauer-Shelah lemma


As we mentioned before, the key intuition of VC theory is that the growth of these shat-
tering coefficients SH (n) should control the “complexity” of H: if they grow quickly then
H must be very big since it can fit arbitrary sign patterns.
The next result gives a precise quantitative characterization of this intuition. It says
that when the class H has finite VC dimension, the size of the shattering coefficients SH (n)
grows polynomially, rather than exponentially, once n > d + 1.

Lemma 2.11 (Sauer-Shelah). Suppose H has VC dimension d. Then for all n > d + 1,
we have
en d
SH (n) ⩽ .
d
Observe that this lemma does indeed imply a transition from exponential to polynomial
growth: the definition of VC dimension implies that SH (n) = 2n for n ⩽ d and SH (d+1) <
2d+1 . This lemma is proved with a non-trivial combinatorial argument that we will not
cover in this course. If you are curious about its proof, take a look at this nice writeup.

15
2.5 Application to linear classes
Let X = Rd and define the set of linear maps

Ld := {hw : w ∈ Rd , },

for
hw (x) := ⟨w, x⟩ ∀x ∈ Rd .
We remark that in practice we will want to include a constant bias term, but this is
equivalent to our setting here since we may consider x′ = (x, 1) ∈ Rd+1 and w′ = (w, b) ∈
Rd+1 .
Using this set of linear maps, let us consider the following hypothesis class

HL,d := {1[hw ⩾ 0] : hw ∈ Ld }.

Functions in HL,d are none other than halfspaces labeled 1 or 0.


We can now consider our risk minimization problem over HL,d . Recall that we are
given a sample (X1 , Y1 ), . . . , (Xn , Yn ) ∈ Rd × {0, 1} from a distribution P and wish to use
this sample to find a ĥ ∈ HL,d which makes the estimation error

R(ĥ) − inf R(h)


h∈HL,d

small. Following the ERM paradigm, we choose ĥ to minimize the empirical risk
n
1X
ĥ ∈ arg min Rn (h) := 1[h(Xi ) ̸= Yi ].
h∈HL,d n
i=1

Using the theory we’ve developed so far, we can get a handle on the resulting estimation
error by controlling the VC dimension VCdim(HL,d ). We have the following result.

Proposition 2.12 (VC dimension of linear classifiers). We have

VCdim(HL,d ) = d.

Proof. Let e1 , . . . , ed ∈ Rd bePthe standard basis vectors. Then, given any labeling
(y1 , . . . , yd ) ∈ {0, 1}d , let w := di=1 yi ei . Then hw (ei ) = yi for all i ∈ [d], so that

SHL,d (d) = 2d

and thus VCdim(HL,d ) ⩾ d.


We must now show VCdim(HL,d ) ⩽ d. So let x1 , . . . , xd+1 ∈ Rd be d + 1 vectors; we
will show that not all labelings are possible. Since we have d + 1 vectors in Rd , there must
be coefficients (αi )d+1
i=1 ∈ R
d+1 , not all 0, such that

d+1
X
αi xi = 0.
i=1

16
Without loss of generality, let us suppose there is some i0 ∈ [d + 1] such that αi0 > 0
(there must be one that is non-zero, and we can multiply by −1 to make it positive if
necessary). Let yi := −1 if αi ⩾ 0, and 1 if αi < 0. We claim that there is no w ∈ Rd for
which hw (xi ) = yi for all i ∈ [d + 1]. Suppose, for the sake of contradiction, that there
does exist some such w, so that ⟨w, xi ⟩) ⩾ 0 if and only if yi = 1, for all i ∈ [d + 1]. Then
⟨w, xi0 ⟩ < 0, yet
−1 X
0 > ⟨w, xi0 ⟩ = αi ⟨w, xi ⟩ ⩾ 0,
α i0
i̸=i0

where the last inequality follows because if αi ⩾ 0 then ⟨w, xi ⟩ < 0 so αi ⟨w, xi ⟩ ⩽ 0, and
if αi < 0, then ⟨w, xi ⟩ ⩾ 0, so αi ⟨w, xi ⟩ ⩽ 0. Hence we have a contradiction and can
conclude.

We can plug this result into Theorem 2.6 to find


Proposition 2.13. Let ĥ ∈ HL,d be the empirical risk minimizer over HL,d , and assume
n > d + 1. Then for all δ > 0, with probability at least 1 − δ,
r r
2d log(2en/d) 2 log(2/δ)
R(ĥ) − inf R(h) ⩽ 4 + .
h∈HL,d n n

3 Computing the empirical risk minimizer


We will now move on to a new part of the course. Rather than exclusively developing
statistical theory, we will study learning methods and algorithms in conjunction with
statistics. To motivate the main models we will discuss after this, let us consider a simple
model and algorithm for classification, the perceptron.
It will also be convenient at this point to perform a slight change of notation, and
suppose Y = {−1, 1} rather than Y = {0, 1} as before.

3.1 The perceptron


So using the VC theory, we can get a bound on the estimation error of the ERM over the
halfspace hypothesis class HL,d . Recall that this is defined as
n
1X
ĥ ∈ arg min 1[hw (Xi ) ̸= Yi ]. (3.1)
h∈HL,d n i=1

How can we actually compute this ERM? One approach is the Perceptron algorithm,
a classic algorithm due to Rosenblatt. The Perceptron algorithm iteratively improves the
estimate of w⋆ using the update

w(t+1) = w(t) + Yi Xi

for any pair (Xi , Yi ) which is mis-classified by w(t) , namely for which ⟨w(t) , Xi ⟩Yi ⩽ 0.
Observe that this update improves the performance of the classifier, at least on Xi , since
it ensures

Yi hw(t+1) (Xi ) = Yi ⟨w(t+1) , Xi ⟩ = Yi ⟨w(t) , Xi ⟩ + ∥Xi ∥2 ⩾ Yi hw(t) (Xi ).

17
Algorithm 1 The Perceptron.
1: Input: Data points {(Xi , Yi )}n d
i=1 ⊂ R × {−1, 1}.
2: procedure Perceptron({(Xi , Yi )}ni=1 )
3: Initialize w(0) = 0.
4: for t = 1, 2, . . . do
5: If there are i such that Yi ⟨w(t) , Xi ⟩ ⩽ 0, choose one and put w(t+1) = w(t) +Yi Xi .
6: Else, return w(t)
7:

Let’s consider the well-specified case, where each Yi = hw⋆ (Xi ) for some w⋆ ∈ Rd .

Proposition 3.1 (Convergence of Perceptron in well-specified case). Suppose D = {(Xi , Yi )}ni=1


is such that Yi = hw⋆ (Xi ) for all i ∈ [n]. Let B := min{∥w∥ : ∀i ∈ [n], Yi ⟨w, Xi ⟩ ⩾ 1},
and R :=i ∥Xi ∥. Then the Perceptron algorithm stops after at most (RB)2 iterations, and
when it stops it correctly classifies all the points, namely for all i ∈ [n], Yi ⟨w(t) , Xi ⟩ > 0.

Proof. Suppose the algorithm runs for at least T ∈ N iterations. Let us assume without
loss of generality that ∥w⋆ ∥ = B and Yi ⟨w⋆ , Xi ⟩ ⩾ 1 for all i ∈ [n].
We start by showing that
⟨w⋆ , w(T ) ⟩ ⩾ T. (3.2)
Observe that for any time t ∈ {0, 1, . . . , T − 1}, for some i ∈ [n],

⟨w⋆ , w(t+1) ⟩ − ⟨w⋆ , w(t) ⟩ = ⟨w⋆ , Yi Xi ⟩ ⩾ 1.

Hence
T
X
(T ) (T ) (0)
⟨w⋆ , w ⟩ = ⟨w⋆ , w ⟩ − ⟨w⋆ , w ⟩= w(t) − w(t−1) ⩾ T,
t=1

yielding (3.2).
Next, we claim that

∥w(t+1) ∥ ⩽ tR, ∀t ∈ {0, 1, . . . , T − 1}. (3.3)

This follows since

∥w(t+1) ∥2 = ∥w(t) + Yi Xi ∥2 = ∥w(t) ∥2 + 2Yi ⟨w(t) , Xi ⟩ + Yi2 ∥Xi ∥2 ⩽ ∥w(t) ∥2 + R2 ,

so that, applying this inequality recursively,

∥w(t+1) ∥2 ⩽ tR2 .

Putting (3.2) and (3.3) together and applying Cauchy-Schwarz, we obtain


√ √
T ⩽ ⟨w⋆ , w(T ) ⟩ ⩽ ∥w⋆ ∥∥w(T ) ∥ ⩽ T − 1BR ⩽ T BR.

This implies the result.

18
3.2 Compuational hardness of ERM
Unfortunately, the fact that the ERM problem over halfspaces admits a practical algorithm
(the perceptron algorithm) is a rare occurrence. For one, when we remove the well-specified
assumption, the ERM problem over halfspaces becomes computationally hard. But, in
general, even well-specified ERM problems can be computationally hard (unless P = N P ,
there is no algorithm for the ERM problem with running time polynomial in n and d;
those who are curious can find more precise details in [6, Chapter 8]).

4 The kernel trick


In this section, we explore a trick that allows us to perform binary classification by lifting
the problem to a higher dimensional space. As we will see, this will allow us to use linear
methods to learn non-linear decision boundaries.

4.1 Introduction
Notice that, by definition, the Perceptron algorithm doesn’t converge if the data points
aren’t linearly separable. That is, if there is no w ∈ Rd such that Yi ⟨w, Xi ⟩ > 0 for all
i ∈ [n], then the algorithm never stops.
As a simple example of such a case, consider the following data for X = R and n = 3:

(X1 , Y1 ) = (−1, 1), (x2 , y2 ) = (0, −1), (x3 , y3 ) = (1, 1).

This data is not linearly separable, and so the Perceptron algorithm won’t converge. Of
course, data which is not linearly separable, either due to an underlying non-linear data
generating process, or due to noise, is the rule rather than the exception.
In this section, we’ll discuss the kernel trick, a very clever and useful workaround for
this issue, which will have applications well beyond the Perceptron algorithm.
There are three key ideas involved in the kernel trick.

1. Embed the data into a higher dimensional feature space. To overcome this
issue of linear separability, we will embed our data into a higher dimensional vector
′ ′
space Rd for d′ ⩾ d, called the feature space, using an embedding ψ : Rd → Rd , called
the feature map. If we choose this map properly then the data ((ψ(X1 ), Y1 ), . . . , (ψ(Xn ), Yn )),
will be linearly separable. We now want to run the Perceptron on this transformed

data, to obtain a vector w⋆ ∈ Rd such that Yi ⟨w⋆ , ψ(Xi )⟩ > 0 for all i ∈ [n]. We can
use this w⋆ to classify new x via x 7→ sgn(⟨w⋆ , ψ(x)⟩).

2. We only need to compute inner products. In general, we may have made our
problem significantly more computationally difficult, since now we have to manipu-
late these higher dimensional data points ψ(Xi ). But the Perceptron algorithm has
a special feature: it only depends on inner products of the feature vectors. Because
of the update w(t+1) = w(t) + Yi ψ(Xi ), we know that the iterate w(t) will always be a
sum of (signed) vectors in {ψ(X1 ), . . . , ψ(Xn )}, and so checking the mis-classification
condition Yi ⟨w(t) , ψ(Xi )⟩ ⩽ 0 can be implemented by considering a (signed) sum of

19
products ⟨ψ(Xi ), ψ(Xj )⟩. Explicitly, if we used xit at step t of the algorithm for our
update, then
T
X −1
w(T ) = yit ψ(xit ),
t=1

so that the misclassification condition at step T , for some j ∈ [n], is

T
X −1
(T )
Yj ⟨w , ψ(Xj )⟩ ⩽ 0 ⇐⇒ Yj Yit ⟨ψ(Xit ), ψ(Xj )⟩ ⩽ 0.
t=1

And if the algorithm terminates after T steps, we can obtain a predicted label for a
new x via
T
X −1
sgn(⟨w(T ) , ψ(x)⟩) = sgn

yit ⟨ψ(xit ), ψ(x)⟩ .
t=1

Hence, by keeping track of the indices we have visited so far, we can implement
the Perceptron algorithm only using the inner products (⟨ψ(Xi ), ψ(Xj )⟩)ni,j=1 and
compute predictions for x ∈ Rd , only using the inner products (⟨ψ(Xi ), ψ(x)⟩)ni=1 .

3. Only use ψ such that the kernel K(x, x′ ) := ⟨ψ(x), ψ(x′ )⟩ is easy to compute.
The map K : Rd × Rd → R defined by K(x, x′ ) := ⟨ψ(x), ψ(x′ )⟩ is called the kernel.
So long as K(x, x′ ) is easy to compute (typically O(d)), we will be able to take
advantage of the flexibility of this very high dimensional space W , without paying
any computational penalty.

Let’s give some examples.

Example 4.1 (Classifying three points). Let’s take the example we gave above: X = R
and n = 3:
(X1 , Y1 ) = (−1, 1), (x2 , y2 ) = (0, −1), (x3 , y3 ) = (1, 1).

Then we can take the embedding ψ : R → R2 defined by ψ(x) := (x, x2 ). The transformed
data is then

(ψ(X1 ), Y1 ) = ((−1, 1), 1), (ψ(x2 ), y2 ) = ((0, 0), −1), (ψ(x3 ), y3 ) = ((1, 1), 1),

which is linearly separable.

Example 4.2 (Embedding into finite degree polynomials). We can generalize the above
k
example by considering X = Rd and the embedding ψ : Rd → R(d+1) for some k ∈ N
defined by taking,
k
Y
ψ(x)J := x Jl , ∀J ∈ {0, 1, . . . , d}k ,
l=1

where we use the convention x0 := 1.

20
Observe that for x, x′ ∈ Rd , we have

K(x, x′ ) = ⟨ψ(x), ψ(x′ )⟩


X k
Y
= xJl x′Jl
J∈{0,1,...,d}k l=1

= (1 + ⟨x, x′ ⟩)k .

Thus, for this choice of embedding, the kernel is easy to compute (time O(d)), despite the
fact that the feature space is exponentially larger than the original space!

Example 4.3 (Embedding into ℓ2 with the Gaussian kernel). Let’s consider another
example, where the feature space is the space of square-summable infinite sequences,
defined as

X
ℓ2 := (Xj )∞ Xj2 < ∞ ,
 N
j=0 ∈ R :
j=0

with inner product



X
⟨x, x′ ⟩ℓ2 := Xj x′j .
j=0

Take X = R and ψ : R → ℓ2 defined as

2 /2 xj
ψ(x)j := e−∥x∥ √ j = 0, 1, . . . .
j!

Then for x, x′ ∈ Rd , we have

K(x, x′ ) = ⟨ψ(x), ψ(x′ )⟩ℓ2



2 ′ 2
X (xx′ )j
= e−(x +(x ) )/2
j!
j=0
−(x2 +(x′ )2 )/2 ′
=e · exx
′ 2 /2
= e−(x−x ) .

So we have an infinite dimensional feature space, yet still can compute the kernel in O(1)
time.

4.2 Formalizing the kernel trick


As the above examples illustrate, working directly with the feature map ψ can be quite
complicated. The utility of the kernel trick is precisely that, although the feature map
might be quite complicated, when implementing the trick we only need to access the inner
products K(Xi , Xj ) = ⟨ψ(Xi ), ψ(Xj )⟩. Thus, we can think about the kernel trick in two
different ways: as a feature map, or as a kernel. It turns out that there is a third way
which is useful for mathematical analysis, that of reproducing kernel Hilbert spaces. Let
us now rigoroushy define these terms.

21
Another reference for this material, albeit not freely available, is [8, Chapter 12]. There
is also a discussion in [6, Chapter 16], but it’s pretty vague.
First, we’ll need to recall the definition of a Hilbert space.

Definition 4.4 (Hilbert space). A Hilbert space (W, ⟨·, ·⟩W ) is a real vector space W
associated with an inner product ⟨·, ·⟩W , such that the resulting normed space is complete.

We’ll now formally define these three equivalent ways of viewing the kernel trick.

Definition 4.5 (Feature map). A feature map is any map ψ : X → V from X into a
Hilbert space V .

Definition 4.6 (PSD Kernel). A map K : X × X → R is a positive semi-definite kernel if


it obeys

1. Symmetry: K(x, x′ ) = K(x′ , x) for all x, x′ ∈ X

2. Positive semi-definiteness: for all n ∈ N, all X1 , . . . , Xn ∈ X the matrix (K(Xi , Xj ))i,j∈[n]


is PSD, so that for all α ∈ Rn ,

n
X
αi αj K(Xi , Xj ) ⩾ 0.
i,j=1

Definition 4.7 (RKHS). Given a Hilbert space W of functions from X → R and a


symmetric map K : X × X → R, we say that K is a reproducing kernel for W if

1. Symmetric: K(x, x′ ) = K(x′ , x) for all x ∈ X .

2. For all x ∈ X , K(x, ·) ∈ W

3. Reproducing property: for all f ∈ W and x ∈ X , f (x) = ⟨K(x, ·), f ⟩W .

If there is some reproducing kernel for W , we say that W is a reproducing kernel Hilbert
space (RKHS).

As we will show, there is a one-to-one correspondence between RKHSs, PSD


kernels, and feature maps, for a given data space X . This means that these views
are equivalent, and that we can shift between them depending on what is convenient.
Generally speaking, the feature map definition is the most intuitive, since it clearly explains
how we are lifting the data to a higher dimensional space to make it linearly separable. On
the other hand, the RKHS definition will be the most helpful when doing mathematical
analysis. And finally, the PSD kernel is all we will need for the actual implementation
(and this is, indeed, the point of kernel methods). For example, consider the following
pseudocode for the Kernelized Perceptron.

22
Algorithm 2 The Kernelized Perceptron.
1: Input: Data points {(Xi , Yi )}n
i=1 ⊂ X × {−1, 1} and a PSD kernel K : X × X → R.
2: procedure KernelizedPerceptron({(Xi , Yi )}ni=1 , K)
3: Initialize a list I = []. P 
4: For any list of indices I, define hI (x) := sgn j∈I Yj K(x, Xj ) , where the sum
includes repeated elements of I.
5: for t = 1, 2, . . . do
6: If there is an i such that hI (Xi ) ̸= Yi , add it to I.
7: Else, return hI
8:

4.3 Three equivalent views on the kernel trick


Let us now describe the precise correspondences between these views on the kernel trick.
We’ll first show that each RKHS has a unique PSD kernel which is reproducing for it.

Proposition 4.8. Let W be an RKHS over X with reproducing kernel K : X × X → R.


Then K is a PSD kernel, and is the unique reproducing kernel for W .

Proof. Uniqueness is an exercise. To show that K must be a PSD kernel, suppose


X1 , . . . , Xn ∈ X and α ∈ Rn , then
n n
X X 2
αi αj K(Xi , Xj ) = αi K(Xi , ·) ⩾ 0,
W
i,j=1 i=1

so (K(Xi , Xj ))ni,j=1 is PSD.

The next Proposition is the converse of Proposition 4.8.

Proposition 4.9. For each PSD kernel K : X × X → R there is a unique RKHS W of


functions over X for which K is the reproducing kernel. This W can be constructed in the
following manner: let
W0 := span({K(x, ·) : x ∈ X }).
Define an inner product on W0 : for all α ∈ Rn , β ∈ Rm , and X1 , . . . , Xn , X1′ , . . . , x′m ∈ X ,
by
n
X m
X n X
X m

⟨ αi K(Xi , ·), βj K(Xj , ·)⟩W0 := αi βj K(Xi , Xj′ ). (4.1)
i=1 j=1 i=1 j=1

Then W is the completion of W0 with respect to the resulting norm on W0 .

Proof. We first verify that the inner product is well-defined. It suffices to show that for
any α ∈ Rn and X1 , . . . , Xn ∈ X , if
n
X
αi K(Xi , ·) = 0, (4.2)
i=1

23
then
Xn m
X
⟨ αi K(Xi , ·), βj K(Xj′ , ·)⟩W0 = 0, ∀β ∈ Rm , X1′ , . . . , x′m ∈ X .
i=1 j=1

Note that (4.2) means exactly that for all x ∈ X


n
X
αi K(Xi , x) = 0.
i=1

So that
Xn m
X n X
X m m
X n
X
′ ′
⟨ αi K(Xi , ·), βj K(Xj , ·)⟩W0 = αi βj K(Xi , xj ) = βj αi K(Xi , x′j ) = 0.
i=1 j=1 i=1 j=1 j=1 i=1

Hence the inner product ⟨·, ·⟩W0 is well-defined.


It is clearly bi-linear, and symmetry and non-negativity immediately follows from
symmetry and PSD-ness of K, respectively. The last item we need to check is that, for all
α ∈ Rn and X1 , . . . , Xn ∈ X ,
n
X n
X
αi K(Xi , ·) ̸= 0 =⇒ αi K(Xi , ·) W0
> 0.
i=1 i=1

So suppose α ∈ Rn and X1 , . . . , Xn ∈ X are such that


n
X
αi K(Xi , ·) ̸= 0.
i=1

Then there exists some x ∈ X such that


n
X
αi K(Xi , x) ̸= 0.
i=1

So for any λ ∈ R, by non-negativity we have


n n n
2 2
X X X
0 ⩽ λK(x, ·) + αi K(Xi , ·) W0
= λ2 K(x, x) + 2λ K(Xi , x) + αi K(Xi , ·) W0
i=1 i=1 i=1
Pn
But since λ2 K(x, x) ⩾ 0 and i=1 αi K(Xi , x) ̸= 0, there is some λ0 sufficiently small such
that
n
X
λ20 K(x, x) + 2λ0 K(Xi , x) < 0.
i=1

Implying
n
2
X
αi K(Xi , ·) W0
> 0,
i=1

as desired.

24
Hence W0 is a real vector space of functions on X with valid inner product ⟨·, ·⟩W0 .
And it obeys the reproducing property, since for all f ∈ W0 , we have
n
X
f= αi K(Xi , ·)
i=1

for some α ∈ Rn and X1 , . . . , Xn ∈ X , so that for all x ∈ X


n
X n
X
⟨K(x, ·), f ⟩W0 = ⟨K(x, ·), αi K(Xi , ·)⟩W0 = αi K(Xi , x) = f (x). (4.3)
i=1 i=1

We have thus nearly constructed an RKHS which has K as its reproducing kernel, but
we are missing one crucial property: the space W0 may not be complete, and so isn’t a
Hilbert space. To handle this issue, we consider the completion W of W0 . Recall that we
can complete a metric space (Z, d) by letting Z⋆ be the set of all Cauchy sequences in Z with
the metric d⋆ ((zk )∞ ′ ∞ ′ ∞ ′ ∞
k=1 , (zk )k=1 ) := limk→∞ d(zk , zk ), where two sequences (zk )k=1 , (zk )k=1
are identified whenever limk→∞ d(zk , zk′ ) = 0.
In our setting, the completion of W0 is the set of all Cauchy sequences (fn )∞ n=1 for
fn ∈ W0 . We’ll first check that we can regard these Cauchy sequences as functions on X .
To this end, for each Cauchy sequence (fk )∞ k=1 in W0 , let f⋆ : X → R be defined as

f⋆ (x) := lim fk (x). (4.4)


k→∞

This limit exists because the sequence (fk (x))∞


k=1 is Cauchy: by the reproducing prop-
erty (4.3), we have

|fk (x) − fℓ (x)| = |⟨K(x, ·), fk − fℓ ⟩W0 | ⩽ ∥K(x, ·)∥W0 ∥fk − fℓ ∥W0 ,

which can be made arbitrarily small for k, ℓ sufficiently large. And (4.4) is well-defined, in
the sense that if (fk )∞ ′ ∞ ′
k=1 , (fk )k=1 are two Cauchy sequences in W0 such that ∥fk −fk ∥W0 →
0 as k → ∞, then the limits agree, since by the reproducing property (4.3)

| lim fk (x) − lim fk′ (x)| = lim |fk (x) − fk′ (x)| ⩽ lim ∥K(x, ·)∥W0 ∥fk − fk′ ∥W0 = 0.
k→∞ k→∞ k→∞ k→∞

So we can regard W as a set of functions on X via (4.4). It’s clear that W is also a
real vector space under the definition, for α, β ∈ R, and (fk )∞ ∞
k=1 , (gk )k=1 ∈ W ,

α(fk )∞ ∞ ∞
k=1 + β(gk )k=1 := (αfk + βgk )k=1 .

If we define the inner product

⟨(fk )∞ ∞
k=1 , (gk )k=1 ⟩W := lim ⟨fk , gk ⟩W0 ,
k→∞

then it is straightforward to check that this is well-defined, compatible with the above
vector space structure, and gives rise to the same metric as that of the completion on W .
So W is a Hilbert space with these definitions, and each of its elements can be regarded
as a function on X .

25
We can finally check that W is an RKHS with reproducing kernel K. First of all, it is
clear that for all x ∈ X , the sequence (K(x, ·))∞
k=1 is Cauchy, so is in W , and corresponds
to the function x′ 7→ K(x, x′ ). Hence K(x, ·) ∈ W for all x ∈ X . For the reproducing
property, let f ∈ W and x ∈ X . Then there is a Cauchy sequence (fk )∞ k=1 in W0 such that

f (x) = lim fk (x).


k→∞

Notice that, by definition

⟨K(x, ·), f ⟩W = lim ⟨K(x, ·), fk ⟩W0 .


k→∞

So by the reproducing property on W0 , namely (4.3), we have

⟨K(x, ·), f ⟩W = lim fk (x) = f (x).


k→∞

Hence W is an RKHS with reproducing kernel K, and we conclude the existence portion
of the proof.
For uniqueness, suppose W ′ is another RKHS over X with reproducing kernel K.
Then clearly, W ⊂ W ′ . We’ll show W ′ ⊂ W to conlude uniqueness. So take any f ∈ W ′ .
Since W is a closed linear subspace, we can write f = f ∥ + f ⊥ where f ∥ ∈ W and f ⊥
is orthogonal to all elements of W . Using the reproducing property, we find that for all
x ∈ X.

f (x) = ⟨K(x, ·), f ⟩W ′ = ⟨K(x, ·), f ∥ + f ⊥ ⟩W ′ = ⟨K(x, ·), f ∥ ⟩W ′ = f ∥ (x).

But since this is true for all x ∈ X , it implies that f = f ∥ ∈ W , so that W ′ ⊂ W . Hence
W ′ = W and we have shown uniqueness.

Now that we have shown Propositions 4.8 and 4.9, we can safely think of RKHSs and
PSD kernels as coming in pairs, and we’ll do this from now on. We conclude our equiva-
lences by showing that each feature map has a unique kernel and RKHS pair associated
to it.
Our first claim in this direction is essentially trivial, and states that each RKHS and
PSD kernel pair has an associated unique feature map.
Proposition 4.10. Suppose we have two pairs of RKHSs and reproducing kernels, W0 , K0
and W1 , K1 . Define ψ0 : X → W0 and ψ1 : X → W1 via ψ0 (x) := K0 (x, ·) and ψ1 (x) :=
K1 (x, ·). Then ψ0 , ψ1 are feature maps, and we have uniqueness in the following sense:

K0 = K1 ⇐⇒ ψ0 (x) = ψ1 (x) ∀x ∈ X ,

where ψ0 (x) = ψ1 (x) is understood in the sense of pointwise equivalence of functions on


X.
Proof. Trivial.

To go in the other direction, we want to say that each feature map ψ : X → V corre-
sponds to a unique RKHS and PSD kernel pair W, K.
We then have the following statement.

26
Proposition 4.11. Suppose ψ0 : X → V0 and ψ1 : X → V1 are feature maps. Let K0 : X ×
X → R and K1 : X × X → R be defined by K0 (x, x′ ) := ⟨ψ0 (x), ψ0 (x′ )⟩V0 and K1 (x, x′ ) :=
⟨ψ1 (x), ψ1 (x′ )⟩V1 for all x, x′ ∈ X . Then K0 , K1 are PSD kernels. Let W0 , W1 be their
associated RKHSs. Then we have

span({ψ0 (x) : x ∈ X }) ∼
= W0 , span({ψ1 (x) : x ∈ X }) ∼
= W1 .

Finally, we have uniqueness in the sense that, if K0 = K1 , then

span({ψ0 (x) : x ∈ X }) ∼
= span({ψ1 (x) : x ∈ X })

We remark that the uniqueness statement above is the best we can hope for because
we can always compose a Hilbert space isomorphism L : V0 → V1 with ψ0 to obtain a new
feature map ψ1 – feature maps are only specified up to Hilbert space isomorphism.

Proof. Let’s first show that K0 is a PSD kernel. Observe that for all x, x′ ∈ X , using the
symmetry of inner products

K0 (x, x′ ) = ⟨ψ0 (x), ψ0 (x′ )⟩V0 = ⟨ψ0 (x′ ), ψ0 (x)⟩V0 = K0 (x′ , x),

so K0 is symmetric. For PSD-ness, take X1 , . . . , Xn ∈ X , and α ∈ Rn , and use the


bi-linearity and non-negativity of inner products to observe that
n n
2
X X X
αi αj K0 (Xi , Xj ) = αi αj ⟨ψ0 (Xi ), ψ0 (Xj )⟩V0 = αi ψ0 (Xi ) V0
⩾ 0.
i,j=1 i,j=1 i=1

Hence K0 is a PSD kernel, and so is K1 .


Let W0 , W1 be their associated RKHSs. Define the linear map

T0 : span({ψ0 (x) : x ∈ X }) → span({K0 (x, ·) : x ∈ X })

by
n
X n
X

T0 αi ψ0 (Xi ) := αi K0 (Xi , ·).
i=1 i=1
It isn’t hard to see that T0 is well-defined. We next check that T0 is a bijection. It is
clearly a surjection, and to check it is an injection it suffices to show it has trivial kernel.
So suppose X1 , . . . , Xn ∈ X and α ∈ Rn are such that
n
X
αi K0 (Xi , ·) = 0.
i=1

Then for all x ∈ X ,


Xn
⟨ αi ψ0 (Xi ), ψ(x)⟩V0 = 0.
i=1
But by taking x = Xi , multiplying by αi , and summing, this implies
n
2
X
0= αi ψ0 (Xi ) V0
,
i=1

27
so that
n
X
αi ψ0 (Xi ) = 0.
i=1

Hence T0 has trivial kernel.


Finally, T0 is an isometry:
n n
2
X  X
T0 αi ψ0 (Xi ) W0
= αi αj K0 (Xi , Xj )
i=1 i,j=1
X n
= αi αj ⟨ψ0 (Xi ), ψ0 (Xj )⟩V0
i,j=1
n
2
X
= αi ψ0 (Xi ) V0
.
i=1

Hence the closure of the two vector spaces are isomorphic

span({ψ0 (x) : x ∈ X }) ∼
= span({K0 (x, ·) : x ∈ X }) = W0 ,

where we used Proposition 4.8 for the equality. Obviously, the same proof works to show

span({ψ1 (x) : x ∈ X }) ∼
= W1

The uniqueness statement then follows from Proposition 4.9.

4.4 The take-home message


What Propositions 4.8, 4.9, 4.10 and 4.11 ultimately mean is that we can adopt three
equivalent perspectives when working with kernels. That of PSD kernels, RKHSs, and
feature maps. To go from a PSD kernel to its associated RKHS, we use the construction
from Proposition 4.9. To go in the other direction, from a RKHS to a PSD kernel, we
just take its reproducing kernel as in Proposition 4.8. To go from a PSD kernel and
RKHS pair K, W to a feature map, we use ψ : X → W with ψ(x) := K(x, ·). And to go
from a feature map ψ : X → V to a PSD kernel and RKHS pair, we just take the kernel
K(x, x′ ) := ⟨ψ(x), ψ(x′ )⟩V , and its associated RKHS.
The best perspective depends on what is most useful in your context. In practical
settings, it is usually most helpful to use the PSD kernel perspective, since that is what
you’ll actually be coding. And when doing statistical analysis, the RKHS perspective is
very useful for proving very clean bounds. There are many, many PSD kernels to choose
from; the choice of one or another becomes a matter of domain knowledge, trial and error,
statistical considerations, and other factors.

4.5 The representer theorem


The following remarkable fact greatly strengthens our earlier discussion of the kernel trick
in the context of the Perceptron algorithm. It shows that the problems that arise in
ERM over a (potentially infinite dimensional) RKHS can be turned into completely finite

28
dimensional problems over Rn . As you can see, the crucial step in the proof comes from
the reproducing property f (x) = ⟨K(x, ·), f ⟩W , giving an example of a situation where
this formulation of the kernel concept is most useful.

Theorem 4.12 (Representer theorem). Suppose W is a RKHS over X with reproducing


kernel K, F : X → R is any function, and λ > 0. Fix any X1 , . . . , Xn ∈ X , and let
Kn := (K(Xi , Xj ))ni,j=1 . Then

inf F (f (X1 ), . . . , f (Xn )) = inf F (Kn α).


f ∈W : ∥f ∥W ⩽λ α∈Rn : αT Kn α⩽λ2

Proof. Let
Wn := {gα : α ∈ Rn }.
Then Wn is a closed linear subspace of W , and so has an orthogonal complement Wn⊥ , so
that each f ∈ W can be written uniquely as f = f ∥ + f ⊥ , for f ∥ ∈ Wn and f ⊥ ∈ Wn⊥ . By
the reproducing property, for each i ∈ [n],

f (Xi ) = ⟨K(Xi , ·), f ⟩W = ⟨K(Xi , ·), f ∥ ⟩.

Therefore

inf F (f (X1 ), . . . , f (Xn )) = inf F (f ∥ (X1 ), . . . , f ∥ (Xn ))


f ∈W : ∥f ∥W ⩽λ f ∈W : ∥f ∥W ⩽λ

= inf F (f ∥ (X1 ), . . . , f ∥ (Xn ))


f ∈W : ∥f ∥ ∥W ⩽λ

= inf F (f (X1 ), . . . , f (Xn )),


f ∈Wn : ∥f ∥W ⩽λ

where the second-to-last equality follows because ∥f ∥2W = ∥f ∥ ∥2W + ∥f ⊥ ∥2W , and since the
objective only depends on f ∥ , minimization over ∥f ∥2W ⩽ λ2 is therefore equivalent to
minimization over ∥f ∥ ∥2W ⩽ λ2 . Finally, observe that f ∈ Wn if and only if there exists
α ∈ Rn such that
X n
f= αi K(Xi , ·).
i=1

Thus f (Xj ) = (Kn α)j , the jth element of the matrix-vector product Kn α. The constraint
∥f ∥W ⩽ λ can then be written
n
X
∥f ∥W ⩽ λ ⇐⇒ ∥f ∥2W ⩽ λ2 ⇐⇒ αi αj K(Xi , Xj ) ⩽ λ2 .
i,j=1

Hence we can simply minimize over α ∈ Rn , and write the constraint αT KD α ⩽ λ2 .

5 Support vector machines


If we apply the kernel trick and the data becomes separable, then the Perceptron algo-
rithm will return a solution to the associated ERM problem. And this ERM solution will
(assuming we chose a reasonable kernel) be feasible to compute, with complexity scaling

29
only with the number of sample points n and the data dimension d, rather than the feature
space dimension. But what about the statistical performance of such a procedure?
In general, the statistical performance could be quite bad. Observe that the Perceptron
using the PSD kernel K and associated RKHS W outputs an element of the set of half-
spaces of the RKHS W :

HW := {v ∈ W 7→ sgn(⟨w, v⟩W ) : w ∈ W }.

So we are effectively looking at classification over halfspaces in the RKHS W , and the
VC dimension of that class scales as dim(W ), which is often infinite. To handle this
problem, we need to use some structural condition on the resulting classification problem
over halfspaces in W .
The support vector machine paradigm (SVM) addresses this problem by considering a
natural condition known as the margin condition. As we will eventually see, this condition
will allow us to get totally dimension-free rates.

5.1 Margin
In this section, we’ll work for simplicity over X = Rd , but the discussion will apply as
well to RKHSs over X . We will thus be interested in results which don’t depend on the
dimension of the space.
Suppose we have data D = ((X1 , Y1 ), . . . , (Xn , Yn )) ∈ (Rd × {−1, 1})n that is linearly
separable. Often, there will be many linear halfspaces which perfectly classify the data.
The idea of support vector machines is that, in general, we should prefer halfspaces which
separate the points by a lot rather than halfspaces which separate the points by a little.
The notion of margin makes this idea precise.
Definition 5.1 (Margin). Given data D = ((X1 , Y1 ), . . . , (Xn , Yn )) we define the margin
margD as follows. If w ∈ Rd correctly classifies D, then
w
margD (w) := min Yi ⟨ , Xi ⟩.
i∈[m] ∥w∥

If w ∈ Rd doesn’t correctly classify D, then we put margD (w) = −∞.


The first thing to notice about this definition is that it makes sense: because we can
always scale w by λw for λ > 0 and retain the same classifier, our definition must be (and
is) be invariant to such scalings.
We next verify that this notion of margin is equivalent to the distance to the closest
point on the hyperplane defined by w, which we will write Hw := {z ∈ Rd : ⟨z, w⟩ = 0}.
It is also convenient to define the notation

d(z, Hw ) := min

∥z − z ′ ∥.
z ∈Hw

Lemma 5.2 (Margin is distance from hyperplane to nearest point). Suppose w ∈ Rd


correctly classifies D = ((X1 , Y1 ), . . . , (Xn , Yn )). Then

margD (w) = min d(z, Hw ).


i∈[n]

30
Proof. Without loss of generality, we may assume ∥w∥ = 1. We first show that
d2 (z, Hw ) = ⟨z, w⟩2 (5.1)
Indeed, note that z − ⟨z, w⟩w ∈ Hw , so
d2 (z, Hw ) ⩽ ∥z − (z − ⟨z, w⟩w)∥2 = ⟨z, w⟩2 .
On the other hand, for any z ′ ∈ Hw
⟨z, w⟩2 = ⟨z − z ′ , w⟩2 ⩽ ∥z − z ′ ∥2 .
Hence (5.1) holds.
Now, notice that, since w correctly classifies D, for all i ∈ [n], we have sgn(⟨Xi , w⟩) =
Yi . Hence for all i ∈ [n],
|⟨Xi , w⟩| = sgn(⟨Xi , w⟩)⟨Xi , w⟩ = Yi ⟨Xi , w⟩.
Taking the minimum over i and using (5.1) yields the result.

5.2 Hard-SVM
Let some linearly separable data D = ((X1 , Y1 ), . . . , (Xn , Yn )) be fixed. In the SVM
paradigm, we prefer hyperplanes with large margin over hyperplanes with small margin.
This leads us to solve the following optimization problem, known as the Hard-SVM prob-
lem.
max margD (w). (5.2)
w∈Rd
However, it is not clear that this problem can be solved efficiently in this form. In fact,
we can re-write it in the following equivalent form, which is more suitable to optimization
algorithms
min ∥w∥2 : w ∈ Rd , Yi ⟨w, Xi ⟩ ⩾ 1, ∀i ∈ [n] .

(5.3)
Problem (5.3) is an example of a quadratic program, for which efficient solvers exist.
Proposition 5.3 (Equivalence of both formulations of Hard-SVM). Suppose there exists
w ∈ Rd which correctly classifies D. Then any solution of (5.3) is a solution of (5.2).
And any solution of (5.2) can be re-scaled to become a solution of (5.3).
Proof. Suppose w0 is a solution of (5.2) and w1 is a solution of (5.3). Then
1
margD (w0 ) ⩾ margD (w1 ) ⩾ .
∥w1 ∥
On the other hand,
w0 2 1
∥w1 ∥2 ⩽ = ⩽ ∥w1 ∥2 .
mini∈[n] Yi ⟨w0 , Xi ⟩ margD (w0 )2
Hence each inequality holds with equality, completing the proof.

One of the main reasons for our interest in the margin condition is it allows us to
prove statistical rates of convergence for the support vector machine problem which do
not depend on the dimension of the feature space (d in our notation), but instead on the
margin and the scale of the dataset.

31
5.3 Soft-SVM
The “hard” in “Hard-SVM” refers to the assumption that the data set is linearly separa-
ble. It is often more useful to consider a relaxation of Hard-SVM to allow for imperfect
classification – the problem may not be linearly separable, or even if it is, outputting a
perfect classifier makes the model much more sensitive to outliers. We’ll now define this
relaxed problem. Starting with (5.3), and let us introduce slack variables that allow for
violations in the constraints Yi ⟨w, Xi ⟩ ⩾ 1 to form
n
1X
min λ∥w∥2 + ζi : w ∈ Rd , ζ ∈ Rn⩾0 Yi ⟨w, Xi ⟩ ⩾ 1 − ζi , ∀i ∈ [n] ,

n
i=1

where λ > 0 is a parameter that controls the amount of penalization associated to the
average violated constraint. Observe that we can equivalently write this problem, using
the hinge loss φ(z) := max(0, 1 + z),
n
1X
min φ(−Yi ⟨w, Xi ⟩) + λ∥w∥2 . (5.4)
w∈Rd n
i=1

This is the Soft-SVM problem; you may also recognize it as a one hidden-layer neural
network with ℓ2 regularization and ReLU activations.

5.4 Statistical analysis of soft-SVM


Notice that both the soft-SVM and hard-SVM problems are methods for searching for a
linear classifier based on the data. If we compare to the empirical risk minimizer approach,
the hard-SVM is merely distinguishing a specific empirical risk minimizer, while the soft-
SVM problem may not return an empirical risk minimizer at all. However, whatever
approach we take, we know that the VC dimension of linear classifiers equals the dimension
of the space, and as we discussed in Section 2, the VC dimension controls the worst-case
complexity of statistical learning. In other words, when we do linear classification in high-
dimensional feature spaces, such as those arising from the kernel method, there is always
some probability distribution for which we pay the curse of dimensionality independent of
the learning algorithm.
Therefore, to develop any non-trivial statistical theory for linear classification in high
dimension, we’ll need to make some structural assumptions on the data distribution. The
notion of margin, defined earlier, turns out to be a very convenient assumption for this
purpose. In particular, we will consider data distributions which satisfy the following
assumption.

Definition 5.4 ((γ, D)-margin assumption). We say that P satisfies the (γ, D)-margin
assumption if there exists w⋆ ∈ W such that ∥w⋆ ∥W ⩽ γ, and for P-almost every (x, y),
we have ∥x∥ ⩽ D and 1 ⩽ y⟨w⋆ , K(x, ·)⟩W .

This condition is enough to permit a satisfying statistical analysis of both the Soft-
SVM and Hard-SVM estimators. For reasons of space, we defer the analysis of the latter
problem to Appendix C, and here just provide the analysis of the soft-SVM problem.

32
We recall our notation for binary classification: for each w ∈ W , we define the clas-
sifier hw (x) := sgn(⟨w, x⟩), and for every classifier h : W → {−1, 1} we define R(h) :=
P(X,Y )∼P [h(X) ̸= Y ] as well as the excess risk

E(h) := R(h) − R(h⋆ ),

where h⋆ is the Bayes classifier, h⋆ (x) := P[Y = 1 | X = x].

Theorem 5.5 (Generalization of soft-SVM). Suppose W is an RKHS and P is a proba-


bility distribution supported on W × {−1, 1} which satisfies the (γ, D)-margin assumption.
Suppose D = ((X1 , Y1 ), . . . , Xn , Yn )) is an iid sample from P of size n, and let ŵ solve the
associated soft-SVM problem (5.4) with parameter λ > 0. Then

2D2
E[E(hŵ )] ⩽ + λγ 2 .
λn
We remark that the above can be extended to bounds in high probability via the
bounded differences inequality, but we omit this for simplicity. The proof is based on a
more general technique, known as a stability argument; see [6, Chapter 13].

Proof. We begin by observing that R(h⋆ ) = 0 by the margin assumption. Hence

E(hŵ ) = R(hŵ ).

Writing the hinge loss φ(z) := max(0, 1 + z), notice that for any (X, Y ) ∈ W × {−1, 1},
we have
1[hw (X) ̸= Y ] ⩽ φ(−Y ⟨w, X⟩).
Hence
R(hŵ ) ⩽ E(X,Y )∼P [φ(−Y ⟨ŵ, X⟩)].
Now, let (X1′ , Y1′ ) ∼ P be a draw from P, independent of D. Define w′ to be the solution
of the soft-SVM problem with just the first sample changed, namely
n
1 1X
w := arg min φ(−Y1′ ⟨w, X1′ ⟩) +

φ(−Yi ⟨w, Xi ⟩) + λ∥w∥2 .
w∈W n n
i=2

Recall that the hinge loss φ(z) is 1-Lipschitz, so that

ED,(X1′ ,Y1′ ) [φ(−Y1 ⟨w′ , X1 ⟩) − φ(−Y1 ⟨ŵ, X1 ⟩)] ⩽ ED,(X1′ ,Y1′ ) [|⟨w′ − ŵ, X1′ ⟩|]
⩽ DED,(X1′ ,Y1′ ) [∥ŵ − w′ ∥]. (5.5)

Now, consider the function


n
1X
L̂(w) := φ(−Yi ⟨w, Xi ⟩) + λ∥w∥2 .
n
i=1

Then one can show (exercise) that for all w ∈ W ,

λ∥w − ŵ∥2 ⩽ L̂(w) − L̂(ŵ).

33
But by definition of w′ we have
1 1
L̂(w′ ) − L̂(ŵ) ⩽ φ(−Y1 ⟨w′ , X1 ⟩) − φ(−Y1′ ⟨w′ , X1 ⟩)
n n
1 1
+ φ(−Y1′ ⟨ŵ, X1′ ⟩) − φ(−Y1 ⟨ŵ, X1 ⟩).
n n
Applying the fact that φ is 1-Lipschitz again yields
2D 2D
λ∥w − ŵ∥2 ⩽ ∥ŵ − w′ ∥ =⇒ ∥w − ŵ∥ ⩽ .
n λn
Plugging into (5.5) yields

2D2
ED,(X1′ ,Y1′ ) [φ(−Y1 ⟨w′ , X1 ⟩) − φ(−Y1 ⟨ŵ, X1 ⟩)] ⩽ .
λn
Finally, we observe that, by symmetry, for all i ∈ [n] we have

ED [φ(−Y1 ⟨ŵ, X1 ⟩)] = ED [φ(−Yi ⟨ŵ, Xi ⟩)].

Whence
n
h1 X i
ED [φ(−Y1 ⟨ŵ, X1 ⟩)] = ED φ(−Yi ⟨ŵ, Xi ⟩)
n
i=1
n
h1 X i
2 2
⩽ ED φ(−Yi ⟨w⋆ , Xi ⟩) + λ(∥w⋆ ∥ − ∥ŵ∥ )
n
i=1
= ED λ(∥w⋆ ∥2 − ∥ŵ∥2 ) ⩽ λ∥w⋆ ∥2 .
 

The result follows by observing that

ED,(X,Y )∼P [φ(−Y ⟨ŵ, X⟩)] = ED,(X1′ ,Y1′ ) [φ(−Y1 ⟨w′ , X1 ⟩)]
= ED,(X1′ ,Y1′ ) [φ(−Y1 ⟨w′ , X1 ⟩) − φ(−Y1 ⟨ŵ, X1 ⟩)]
+ ED [φ(−Y1 ⟨ŵ, X1 ⟩)],

and applying the previous two bounds.

6 Ensemble methods
Another important paradigm is that of ensemble methods. The idea here is that we can
sometimes combine multiple classifiers into one classifier which achieves better performance
than each individual classifier does on its own. Suppose that we have K fixed classifers,
h1 , . . . , hK : X → {−1, 1}. An ensemble classifier, based on these K classifiers, is an
element of
K
X
wk hk (x) : w ∈ Rd }.
 
hw (x) := sgn
k=1
It is thus a certain halfspace in the coordinates of the classifiers h1 , . . . , hK . There are
many different methods in this family, we’ll just briefly mention a couple here.

34
6.1 Majority vote
A very simple way of combining multiple classifiers is through majority vote. Here, we
train each hk on the data D = ((X1 , Y1 ), . . . , (Xn , Yn )), and then output the classifier
n
X 
x 7→ sgn hk (x) .
k=1

6.2 Random forests


In random forests, the classifiers h1 , . . . , hK are obtained in the following way: for each
k ∈ [K],

1. Select a random size m subset of the n data points with replacement.

2. Grow a decision tree hk on these m points, where at each node you choose a random
subset of ℓ feature dimensions to find the best split on.

The final classifier is then the majority vote of h1 , . . . hK .

6.3 AdaBoost
The AdaBoost algorithm is a famous method for producing the K classifiers h1 , . . . , hK .
Here, it is assumed that we have access to an algorithm which produces weak learners.

Definition 6.1 (Weak learner). Given ε ∈ (0, 1/2] a dataset D = ((X1 , Y1 ), . . . , (Xn , Yn )) ∈
X × {−1, 1} and a weight vector µ := (µ1 , . . . , µn ) ∈ Rn⩾0 , we say that a classifier
h : X → {−1, 1} is an ε-weak learner if
n
X 1
µi 1[h(Xi ) ̸= Yi ] ⩽ − ε.
2
i=1

For example, a common weak learning algorithm in practice is decision stumps, which
corresponds to a depth 1 decision tree (of course, this may not be a true weak learner in
the formal sense above, but it is nonetheless used in practice).
The AdaBoost algorithm assumes access to an algorithm which returns an ε-weak
learner for any weight vector µ, and follows the steps

1. Initialize µ(0) = (1/n)1.

2. For t = 1, . . . , T :

a. Let ht be an ε-weak learner for the dataset D and weight vector µ(t) .
(t)
b. Put εt := ni=1 µi 1[ht (Xi ) ̸= Yi ].
P

c. Put wt := (1/2) log(1/εt − 1).


(t+1/2) (t)
d. Set µi := µi · e−wt Yi ht (Xi ) .
e. Let µ(t+1) := µ(t+1/2) /⟨1, µ(t+1/2) ⟩.

35
3. Return the weighted ensemble classifier

T
X 
hw : x 7→ sgn wt ht (x) .
t=1

So the AdaBoost classifier iteratively up-weights data points that are mis-classified
(where Yi ht (Xi ) < 0) and down-weights data points that are correctly classified (where
Yi ht (Xi ) ⩾ 0).
We have the following result.

Theorem 6.2. Suppose the AdaBoost algorithm is run for T steps with an algorithm that
always returns an ε-weak learner for any weight vector µ. Let the output be hw . Then
n
1X 2
1[hw (Xi ) ̸= Yi ] ⩽ e−2ε T
n
i=1

Proof. Assigned reading [6, Chapter 10].

7 Neural networks
We will use the following notation for a neural network: let (V, E, w) be a directed,
weighted
SL graph, and suppose the vertex set V can′ be split into disjoint sets of vertices,
V = ℓ=0 Vℓ such that if e ∈ E then e = (v, v ) for v ∈ Vℓ and v ∈ Vℓ+1 for some ′
|V |
ℓ = {0, 1, . . . , L − 1}. We’ll write the vertices in layer ℓ as Vℓ = {vℓ1 , . . . , vℓ ℓ }. The
number L is called the number of layers of the neural network. The integers din := |V0 |
and dout := |VL | are called the input and output dimensions of the network. We’ll think of
the directed graph (V, E) as fixed throughout training, and this is often called the network
architecture.
The parameters w are the weights, which we can think of either as a function w : E → R,
or as a collection of matrices (W1 , . . . , WL ) where each Wℓ ∈ R|Vℓ |×|Vℓ−1 | and is fixed to
have entry (v, v ′ ) to be 0 if (v, v ′ ) ̸∈ E, along with biases (b1 , . . . , bL ) such that bℓ ∈ R|Vℓ | .
It is these weights that are adjusted through training to eventually fit the model.
We assume as given an activation function σ : R → R, for example a ReLU unit
(σ(τ ) := max(0, τ )), a logistic unit (σ(τ ) = (1 + e−τ )−1 ) or something else.
The network then generates output via a forward pass. Given an input x ∈ Rdin , we
calculate the output o0 (x) = (x1 , . . . , xd ), and then for subsequent layers we put aℓ (x) =
Wℓ oℓ−1 (x) + bℓ and oℓ (x) := σ(aℓ (x)) for output. The output of the network is, finally,
oL (x).

7.1 VC dimension of neural networks with sign activation


Fix the network architecture (V, E), and let the neural network with weight function w
be hw . We’ll take the activation function σ(τ ) := sgn(τ ), which is 1 if τ ⩾ 0 and −1 if
τ < 0. This is a simple case, but allows us to get a good idea of the basic theory.

36
Following our standard approach in this class, we can imagine that we want to fit some
distribution P over features x ∈ Rdin and outputs y ∈ {−1, 1} with a loss ℓ, using our
neural network. The population problem is then

min R(w) := E(x,y)∼P [ℓ(hw (x), y)].


w

But we don’t have direct access to this population problem, and instead have some training
data D := ((x1 , y1 ), . . . , (xn , yn )), so we consider the empirical problem
n
1X
min Rn (w) := ℓ(hw (xi ), yi )
w n
i=1

It is then natural to ask: what is the VC dimension of H := {hw : w ∈ R|E|+|V | }?


The following fact has a natural extension to signed networks with dout > 1, but we
will avoid this since we would have to talk about a generalized notion of VC dimension.

Proposition 7.1 (VC dimension of neural nets with sign activatation). Suppose the net-
work architecture (V, E) is such that dout = 1 and the activation function σ = sgn, so
that H is a family of binary classifiers from the feature space Rdin → {−1, 1}. Then for
universal constants C, C ′ , we have

VCdim(H) ⩽ O((|E| + |V |) log(|E| + |V |)).

Proof. Recall the definition of the shattering coefficients

(hw (x1 ), . . . , hw (xn )) : w ∈ R|E|+|V |



SH (n) := max
x1 ,...,xn ∈Rdin

To upper bound these shattering coefficients, we will use the following fact.
Proposition 7.2. Suppose H01 , H02 , . . . , H0K are each sets of functions from some domain
X → {−1, 1} and H1 is a set of functions from {−1, 1}k → {−1, 1}. Let

H1 ◦ H0 := {h1 (h10 , . . . , hK 1 1 K K
0 ) : h1 ∈ H1 , h0 ∈ H0 , . . . h0 ∈ H0 }.

Then
K
Y
SH1 ◦H0 (n) ⩽ SH1 (n) SHk (n).
0
k=1

Now, let Hℓk for ℓ = 0, . . . , L and k = 1, . . . , |Vℓ | be the the function which maps
z → σ(Wℓ,k z + bℓ,k ) where Wℓ,k indicates the kth row of the matrix Wℓ and bℓ,k the kth
entry (this is just the output of the neuron vℓk ∈ Vℓ ). Using the above notation, we have

H = HL ◦ · · · ◦ H0 .

Applying Proposition 7.2 recursively with these Hℓk we bound the shattering coefficients
as
|Vℓ |
L Y
Y
SH (n) ⩽ SHk (n).

ℓ=0 k=1

37
But notice that we can directly compute the VC dimension of the classes Hℓk since
they are a halfspace predictor with dimension determined by the network architecture. In
particular, let the number of edges that go into vℓk , be deg(vℓk ), so that

VCdim(Hℓk ) = deg(vℓk ) + 1.

Using the Sauer-Shelah Lemma with the above product inequality we obtain that for
n > deg(vℓk ) + 1,

|Vℓ |
L Y L |Vℓ |
Y en deg(vk )+1 Y Y k
SH (n) ⩽ k

⩽ (en)deg(vℓ )+1 = (en)|E|+|V | .
ℓ=0 k=1
deg(vℓ ) + 1 ℓ=0 k=1

Observe that if VCdim(H) < |E| + |V | + 2, then we are done. Otherwise, we can apply
the above bound with n = VCdim(H) (valid because in this case n = VCdim(H) ⩾
|E| + |V | + 2 > VCdim(Hℓk )) to observe that

2VCdim(H) ⩽ (e VCdim(H))|E|+|V | . (7.1)

We conclude with the following observation, whose proof is included at the end of these
notes.
Proposition 7.3. Suppose a > 1 and k is a positive integer. If x > 1 is such that

ax ⩽ xk ,

then
k k 
x⩽2 log + 5.
log a log a
We can apply this fact to the inequality (7.1) with a = 21/e , x = e VCdim(H) and
k = |E| + |V | to finally obtain the result.

We won’t prove it here, but similar results hold for networks with sigmoid activation
functions [6, Chapter 20]. For completeness, we include a proof of Proposition 7.3 here.

Proof of Proposition 7.3. Taking the logarithm and re-arranging yields


x k
⩽ . (7.2)
log x log a

If k/ log a < 2, then it implies x < 2 log x and which has no solution for x > 1, so we may
assume k/ log a ⩾ 2. Suppose for the sake of contradiction that

k k 
x>2 log + 5.
log a log a

Then since x > 5, the function x/ log x is increasing in x, so that

2 logk a log logk a



x
>  . (7.3)
log x log 2 logk a log logk a

38
We claim that
k  k
log 2 log ⩽ log ,
log a log a
Indeed, if not, it reduces once more to the claim τ < 2 log τ with τ = k/ log a ⩾ 2, which
has no solution. So the previous inequality holds, and plugging it into (7.3) yields

2 logk a log logk a



x
> k
 k

log x log log a + log 2 log log a
2 logk a log logk a


log logk a + log logk a
 

k
= ,
log a

contradicting (7.2) and yielding the result.

7.2 Training neural networks with backpropagation


The above VC dimension estimate is fairly encouraging. Due to the fundamental theorem
of statistical learning, it implies that the empirical risk minimizer over H while converge so
long as we have n = Ω((|E|+|V |) log(|E|+|V |) samples. Unfortunately, the computational
picture – at least theoretically speaking – is significantly worse: it can be shown that there
is a two-layer neural network with sign activations for which it is NP-hard to solve the
ERM problem over even a two-layer network [6, Chapter 20].

7.3 Brief introduction to first-order methods


In practice, the ERM problem for neural networks is solved through various forms of
stochastic gradient descent. It is interesting to note that in practice, the ERM problem
can often be solved with SGD-based methods (to 0 error!), suggesting that the NP-hardness
result is overly pessimistic. Here’s a brief reminder on gradient descent methods.
Suppose we want to solve the minimization problem

min f (θ).
θ∈Rd

The function f is called the objective. Then gradient descent uses the iteration, for some
initial point θ0 ,
θt+1 = θt − ηt ∇f (θt ),

where ηt ∈ R is a learning rate that is specified by the user (typically a constant in the
case of gradient descent proper).
An objective f that arises frequently in statistics and machine learning is of the form
n
1X
f (θ) = ℓ(hθ (xi ), yi ),
n
i=1

39
where ℓ is a loss function and (hθ )θ∈Θ is a parametrized model class. Indeed, this objective
corresponds to an empirical loss. A work-around for this issue, which has additional at-
tractive properties that we won’t discuss now, is called stochastic gradient descent (SGD).
Here, we take a uniformly random i ∈ [n], and use it to form a stochastic gradient
ˆ (θ) := ∇θ ℓ(hθ (xi ), yi ).
∇f

Observe that, over the randomness of this uniform random sample, E[∇f ˆ (θ)] = ∇f (θ).
It is thus plausible that we can use this stochastic gradient as a replacement for the full
gradient ∇f (θ), and run a stochastic version of the gradient descent algorithm. This is
called stochastic gradient descent, and uses the updates
ˆ (θt ),
θt+1 = θt − ηt ∇f

where, importantly, we use fresh randomness at each iteration to form the stochastic
ˆ (θt ).
gradient ∇f
In the next class we will develop significantly more material about first-order methods.

7.4 Backpropagation
The fundamental thing we need to implement first-order methods is the gradient of the
objective. Backpropagation is an efficient method to calculate the gradients of the weights
in a neural-network, famously invented by Rumelhart, Hinton, and Williams in 1986 [5].
Backpropagation can be thought of as the opposite of the forward pass that is used to
compute the output of the neural network.
To describe backpropagation, let us again take a given network architecture (V, E) as
fixed, and and let hw : Rdin → Rdout denote the network with weights w. For simplicity,
we’ll work without any biases but the extension to them is quite simple. We’ll consider
the squared norm loss, where we aim to minimize the empirical risk
n
1X1
R(w) := ∥hw (xi ) − yi ∥2 .
n 2
i=1

Since we are working without biases, the network parameters are exactly the weight ma-
trices (W1 , . . . , WL ), and we thus need to compute gradients of the form
1
∇Wℓ ∥hw (x) − y∥2
2
for a fixed pair (x, y).
The back-prop algorithm works in the following way: first, you do a forward pass,
keeping track of the aℓ and oℓ (we’ll suppress mention of x for notational ease) as you go.
Then, the backward pass begins. For this pass, you initialize δL := (oL − y)T and then
recursively put δℓ = δℓ+1 diag(σ ′ (aℓ+1 ))Wℓ , where σ ′ (aℓ+1 ) is the vector which is formed
by taking the derivative of the activation and applying it element-wise to the vector aℓ+1 ,
and diag is the diagonal matrix of the vector. You finally return (δℓT ⊙ σ ′ (aℓ ))oTℓ−1 as the
gradient for the Wℓ parameters, where we define u ⊙ v to be the element-wise product
of vectors u and v. (Note that in class I stated an incorrect formula for the output of

40
the back-propagation algorithm, this version is correct.) One magical thing about back-
propagation is that it allows to obtain gradients with respect to matrix-valued parameters
without doing any matrix multiplies, and indeed only using matrix-vector products. This
is an extremely important point in practice.
To prove that this is correct, the following matrix calculus fact will be quite helpful.
Proposition 7.4. Fix some v ∈ Rm and let T : Rn×m → Rn be the map defined on
matrices as T (A) = Av. If g : Rn → R is a scalar valued function, then
∇(g ◦ T )[A] = ∇g[T (A)]T v T .
Proof. For a map h : Rk → Rl we use the notation dhb to denote the Jacobian of h at some
b ∈ Rk . The chain rule then implies
d(g ◦ T )A = dgT (A) dA T.

Put uT = dgT (A) , and note that

uT dA T = dA uT T = dA (uT Av) = uv T .
The result follows.

Let us move on to deriving backpropagation. To calculate the gradient with respect


to Wℓ we need to keep all of the other weight matrices fixed and take the gradient of
1
Wℓ 7→ ∥hw (x) − y∥2 .
2
Write this function as αℓ (Wℓ ). Since every layer before and after the ℓth layer is fixed,
we can write αℓ (Wℓ ) = βℓ (σ(Wℓ oℓ−1 )) for βℓ representing the rest of the network after the
ℓth layer – in particular βL+1 (z) = 12 ∥z − y∥2 . Proposition 7.4 then implies

∇Wℓ αℓ (Wℓ ) = ∇(βℓ ◦ σ)[aℓ ]T oTℓ−1 ,


Applying the chain rule we arrive at
∇Wℓ αℓ (Wℓ ) = diag(σ ′ (aℓ ))∇βℓ (oℓ )T oTℓ−1 (7.4)
To finish, we just need to calculate the vectors ∇βℓ (oℓ ). Rather than an explicit form,
we will derive a recursive equation that lies at the heart of the backpropagation algorithm.
For this, we can use the fact that
βℓ (z) = βℓ+1 (σ(Wℓ+1 z)).
Which implies that
∇βℓ (oℓ ) = ∇βℓ+1 (oℓ+1 ) diag(σ ′ (aℓ+1 ))Wℓ+1 ,
this being the precise recursion we implement in backpropagation, beginning with ∇βL+1 (oL ) =
(oL − y)T .
Putting things together, we find that (7.4) becomes
∇Wℓ αℓ (Wℓ ) = diag(σ ′ (aℓ ))δℓT oℓ−1 ,
Since diag(u)v = u ⊙ v, we conclude our derivation of backpropagation.

41
8 Basics of convex optimization
For our last lecture, we will consider gradient descent (GD) and stochastic gradient descent
(SGD) as paradigmatic examples of optimization algorithms in machine learning. These
algorithms fall within the field of convex optimization, and we will only scratch the surface
of this fascinating area. I recommend those who are interested in a rapid yet fairly thorough
introduction to the theory read [4]; the references [2, 3] are also worth a look.
The intuitive idea behind gradient descent and related methods is to solve a problem
of the form
min f (θ),
θ∈Rd

by using the Taylor approximation (let’s assume f is twice continuously differentiable for
this discussion)
f (θ) = f (θ0 ) + ⟨∇f (θ0 ), θ − θ0 ⟩ + O(∥θ − θ0 ∥2 ),
to argue that if θ is close to θ0 and θ − θ0 is in the direction of ∇f (θ0 ), then f (θ) < f (θ0 ).
Iterating the update θt+1 := θt −ηt ∇f (θt ) for appropriately chosen step-sizes, our objective
value f will decrease. We can then hope that we eventually reach a global optimizer. The
purpose of this lecture will be to describe some mathematical conditions under which this
method works.

8.1 Convex functions and sets


We recall the following definitions.
Definition 8.1 (Convex set). A set C ⊂ Rd is convex if for all x, y ∈ C and t ∈ [0, 1],

(1 − t)x + ty ∈ C.

Definition 8.2 (Convex function). Let C ⊂ Rd be a convex set. Then a function f : C →


R is convex if for all x, y ∈ C and t ∈ [0, 1],

f ((1 − t)x + ty) ⩽ (1 − t)f (x) + tf (y).

Throughout this section let us fix a convex set C ⊂ Rd and a convex function f : C → R.
The following fact is important, but its proof is out of scope of this lecture, see [4, Chapter
1] if you are curious.
Proposition 8.3 (Convex functions have sub-gradients). For any x ∈ int(C) there exists
some g ∈ Rd such that for all y ∈ C,

f (y) − f (x) ⩾ ⟨g, y − x⟩.

Using this fact, we can define sub-gradients.


Definition 8.4 (Sub-gradient of a convex function). For any x ∈ C, we say that g ∈ Rd
is a sub-gradient of f at x if

f (y) − f (x) ⩾ ⟨g, y − x⟩, ∀y ∈ C.

The set of all sub-gradients of f at x ∈ C is written ∂f (x).

42
As we show next, when the function f is differentiable, the sub-gradient coincides with
the usual gradient. While the sub-gradient may thus seem a bit abstract, it is actually
important in applications. A fundamental example is in ℓ1 regularized linear regression
(LASSO). There, the relevant convex function is essentially the absolute value f (x) = |x|,
and understanding its sub-gradient at 0 (where it is not differentiable) is important both
for statistics and computation. It is instructive to verify for yourself that for f (x) = |x|,
the sub-gradient at 0 is ∂f (0) = [−1, 1].
Proposition 8.5 (Gradients and sub-gradients). Suppose f is differentiable at x ∈ int(C).
Then ∂f (x) = {∇f (x)}.
Proof. Let U ⊂ C be an open set containing x. Then for all y ∈ U and t ∈ [0, 1],

f ((1 − t)x + ty) ⩽ (1 − t)f (x) + tf (y),

which implies
f ((1 − t)x + ty) − f (x)
⩽ f (y) − f (x).
t
Taking the limit as t → 0+ , we obtain

⟨∇f (x), y − x⟩ ⩽ f (y) − f (x).

Hence ∇f (x) ∈ ∂f (x).


On the other hand, suppose g ∈ ∂f (x) is some sub-gradient of f at x. Then for any
u ∈ Rd such that ∥u∥ = 1, we have that
f (x + tu) − f (x) ⟨tu, g⟩
⟨∇f (x), u⟩ = lim ⩾ lim = ⟨g, u⟩.
t→0+ t t→0 + t
Applying the same inequality to −u we see that ⟨∇f (x), u⟩ = ⟨g, u⟩ for all unit norm
vectors u ∈ Rd , which implies g = ∇f (x).

Proposition 8.6 (Local minimizers are global). If x0 ∈ C is a local minimizer of f then


it is a global minimizer, so that

f (x0 ) = min f (x)


x∈C

Moreover, x0 ∈ C is a global minimizer of f if and only if 0 ∈ ∂f (x0 ).


Proof. For the first claim, suppose x0 is a local minimizer of f . Then observe that for any
y ∈ C, and t > 0 sufficiently small,

f (x0 ) ⩽ f ((1 − t)x0 + ty) ⩽ (1 − t)f (x0 ) + tf (y).

Which implies f (x) ⩽ f (y). But since y ∈ C was arbitrary, it follows that x0 is a global
minimizer of f on C.
For the second claim, note that 0 ∈ ∂f (x0 ) if and only if for all y ∈ C,

0 ⩽ f (y) − f (x0 ),

which is the statement of x0 being a global minimizer.

43
8.2 Convex optimization isn’t easy
It is commonly thought that “convex optimization problems are easy.” But this isn’t
necessarily true. Indeed, we can, in fact, transform any optimization problem into a
convex optimization problem with a linear objective. To see this, suppose we have some
function g : Rd → R (not necessarily convex) and we want to solve

min g(x),
x∈Rd

and that there is a solution for this problem. Then define the epigraph of g by

epi(g) := {(x, t) : x ∈ Rd+1 , t ⩾ g(x)},

and let
S := conv(epi(g)),

be the convex hull of the epigraph of g. Then it is clear that

min yd+1 ⩽ min g(x).


y∈S x∈Rd

On the other hand, since every y ∈ S is a convex combination of points in epi(g), it must
hold that yd+1 ⩾ minx∈Rd g(x). We finally conclude

min g(x) = min yd+1 .


x∈Rd y∈S

We have thus transformed an arbitrary optimization problem into minimization of a


linear objective over a convex set in only one dimension higher. This indicates that, in
general, convex optimization is NOT necessarily easy, and also highlights the fact that the
difficulty of the problem lies in both the objective as well as the constraints.

8.3 Two standard analyses


Recognizing thus that convex optimization can be quite difficult (impossible), we next
give two standard analyses in this area, one for GD for convex and Lipschitz functions,
and one for SGD for convex and Lipschitz functions. We will assume the minimization is
unconstrained; similar results hold if it is constrained but with an additional projection
step (and this projection step can indeed be a source of serious computational difficulty).

GD for Lipschitz and convex functions. We’ll consider the gradient descent (GD)
algorithm: given an initial point x1 ∈ Rd and a sequence of step-sizes {ηt }Tt=1 , it uses the
iterative update, for t = 1, . . . , T ,

xt+1 = xt − ηt gt .

for any gt ∈ ∂f (xt ).

44
Theorem 8.7 (GD for Lipschitz and convex functions). Suppose f : Rd → R is both L-
Lipschitz and convex, and there exists x⋆ ∈ arg minx∈Rd f (x). Assume that ∥x1 − x⋆ ∥ ⩽ R,
R
and the step-sizes are chosen to be ηt = η := L√ T
. Then
T
1X  LR
f xt − f (x⋆ ) ⩽ √ ,
T T
t=1
and
LR
min f (xt ) − f (x⋆ ) ⩽ √ .
t∈[T ] T
Proof. Observe that by convexity
T T
1X  1X
f f (xt ) − f (x⋆ ) ⩽ f (xt ) − f (x⋆ ).
T T
t=1 t=1
It is also clear that
T
1X
min f (xt ) − f (x⋆ ) ⩽ f (xt ) − f (x⋆ ).
t∈[T ] T
t=1
So it suffices to show the bound
T
1X LR
f (xt ) − f (x⋆ ) ⩽ √ .
T T
t=1
To this end, we use the definition of sub-gradients to observe that
f (x⋆ ) − f (xt ) ⩾ ⟨gt , x⋆ − xt ⟩.
Hence
1
f (xt ) − f (x⋆ ) ⩽ ⟨gt , xt − x⋆ ⟩ = ⟨xt − xt+1 , xt − x⋆ ⟩.
η
We then observe that
2⟨xt − xt+1 , xt − x⋆ ⟩ = ∥xt − xt+1 ∥2 + ∥xt − x⋆ ∥2 − ∥xt+1 − x⋆ ∥2 ,
so that, putting together the pieces, we find
T T
1X 1X 1
f (xt ) − f (x⋆ ) ⩽ ⟨xt − xt+1 , xt − x⋆ ⟩
T T 2η
t=1 t=1
T
1 X
= ∥xt − xt+1 ∥2 + ∥xt − x⋆ ∥2 − ∥xt+1 − x⋆ ∥2
2ηT
t=1
T
1 1 X
∥x1 − x⋆ ∥2 − ∥xT +1 − x⋆ ∥2 + ∥xt − xt+1 ∥2

=
2ηT 2ηT
t=1
T
R2 1 X
⩽ + ∥xt − xt+1 ∥2
2ηT 2ηT
t=1
T
R2 η X
= + ∥gt ∥2 .
2ηT 2T
t=1

45
We then observe that by the convexity statement

∥gt ∥2 ⩽ f (xt + gt ) − f (xt ) ⩽ L∥gt ∥ =⇒ ∥gt ∥2 ⩽ L2 .

We finally obtain
T
1X R2 ηL2
f (xt ) − f (x⋆ ) ⩽ + .
T 2ηT 2
t=1

Plugging in η = R/L T , we obtain the result.

SGD for convex functions with Lipschitz gradients. Next, let us consider the
stochastic gradient descent (SGD) algorithm. We assume that for all x ∈ Rd we have
access to stochastic gradients g̃ ∈ Rd such that E[g̃] ∈ ∂f (x). Given an initial point
x1 ∈ Rd and step-sizes {ηt }Tt=1 we use the update for t = 1 . . . , T ,

xt+1 := xt − ηg̃t ,

where g̃t is an independent stochastic gradient such that E[g̃t ] ∈ ∂f (xt ).


This setting encompasses, in particular, the setting where
n
1X
f (x) = fi (x),
n
i=1

and we use the stochastic gradients

g̃ := ∇fi (x)

for a uniform random i ∈ [n]. As we have discussed previously, this is a standard approach
to solving ERM problems in practice, in which case each fi represents the loss of the model
associated to a different data point.

Theorem 8.8 (SGD for convex and Lipschitz functions). Suppose f : Rd → R is a convex
function and that the stochastic gradients are bounded as E[∥g̃∥2 ] ⩽ L2 for all t. Set
R
ηt = η := L√ T
. Then
T
 1X  LR
E f xt − f (x⋆ ) ⩽ √ .
T T
t=1

Proof. We proceed as in the previous proof: apply convexity to observe

T
1X  1 X
f xt ⩽ T f (xt ).
T T
t=1 T =1

Then note that

f (xt ) − f (x⋆ ) = ⟨E[g̃t | xt ], xt − x⋆ ⟩ = E[⟨g̃t , xt − x⋆ ⟩ | xt ].

46
We obtain
1
f (xt ) − f (x⋆ ) = E[⟨xt − xt+1 , xt − x⋆ ⟩ | xt ]
η
1
= E[∥xt − xt+1 ∥2 + ∥xt − x⋆ ∥2 − ∥xt+1 − x⋆ ∥2 | xt ]

1
= E[η 2 ∥g̃∥2 + ∥xt − x⋆ ∥2 − ∥xt+1 − x⋆ ∥2 | xt ]

ηL2 1
⩽ + E[∥xt − x⋆ ∥2 − ∥xt+1 − x⋆ ∥2 | xt ].
2 2η

Plugging this in, we get


T T
1 X  ηL2 1 X
f xt ⩽ + E[∥xt − x⋆ ∥2 − ∥xt+1 − x⋆ ∥2 | xt ].
T 2 2ηT
t=1 t=1

Taking the expectation the sum telescopes and we obtain


T
 1X ηL2 1 ηL2 R2
+ E[∥x1 − x⋆ ∥2 − ∥xT +1 − x⋆ ∥2 ] ⩽

E f xt − f (x⋆ ) ⩽ + + .
T 2 2ηT 2 2ηT
t=1

Plugging in the choice of η yields the result.

8.4 (S)GD in practice


The results we have provided are far from practice in a number of ways:

1. One typically doesn’t know R or L, so the step-size that is recommended in this


result is not knowable in practice. Worse yet, the step-size depends on the number
of iterations but in practice people usually just run the algorithm until the objective
stops decreasing. In practice, the step-sizes are usually taken to be ηt := c/t, where
c is a hyperparameter this is optimized by brute-force.

2. The convergence rate 1/ T is extremely slow.

3. The results apply to the averaged iterate


T
1X
xt ,
T
t=1

(as well as, in the case of GD, the best iterate), but practitioners invariably use the
last iterate xT .

4. (S)GD is often successfully applied to non-convex functions – the convexity assump-


tion is often unrealistic.

47
By making stronger assumptions on the function f (strong convexity and smoothness vs.
convexity and Lipschitzness), we can provide results which are more similar to practice,
although with the caveat that they are only valid under these stronger assumptions. In
general, we do not yet have a strong theoretical understanding of the success of (S)GD in
practice [1], and developing this understanding is a fascinating and extremely active area
of research. Unfortunately, this topic would require another semester to cover!

48
A Background on concentration inequalities
In this appendix we give some background on concentration inequalities, specifically Ho-
effding’s, Bernstein’s, and the bounded differences inequality. For a thorough treatment of
these topics we recommend the reader consult [8, Chapter 2]. The core underlying bound
is Chernoff ’s inequality, which we include here.

Proposition A.1 (Chernoff’s inequality). Suppose ζ is a real random variable with finite
mean. Then for all λ, t > 0,

P[ζ − E[ζ] > t] ⩽ e−λt E[exp(λ(ζ − E[ζ]))].

Proof. Notice that

P[ζ − E[ζ] > t] = P[λ(ζ − E[ζ]) > λt] = P[exp(λ(ζ − E[ζ])) > eλt ].

Applying Markov’s inequality, we obtain

E[exp(λ(ζ − E[ζ]))]
P[ζ − E[ζ] > t] ⩽ = e−λt E[exp(λ(ζ − E[ζ]))].
eλt

All three of the concentration inequalities that we develop will rely on Chernoff’s in-
equality; their main difference will be in how they control the moment generating function
E[exp(λ(ζ − E[ζ]))].

A.1 sub-Gaussian random variables and Hoeffding’s inequality


Definition A.2 (sub-Gaussian random variable). Suppose ζ is a real random variable
and σ ⩾ 0. Then ζ is σ 2 -sub-Gaussian if
2 λ2 /2
E[exp(λ(ζ − E[ζ]))] ⩽ eσ , ∀λ ∈ R.

Here are a few simple examples.

Example A.3 (Gaussians are sub-Gaussian). Suppose ζ ∼ N (0, σ 2 ). Then, for all λ ∈ R,
Z
1 1 2
E[exp(λ(ζ − E[ζ]))] = √ eλζ− 2σ2 ζ dζ
2πσ
Z
1 1 2 2 σ 2 λ2
=√ e− 2σ2 (ζ−σ λ) + 2 dζ
2πσ
2 λ2 /2
= eσ .

The result follows for Gaussians with arbitrary mean by observing that ζ − E[ζ] is dis-
tributed according to a mean 0 Gaussian.

49
Example A.4 (Bounded random variables are sub-Gaussian). Suppose ζ is a random
variable which takes values in [a, b] for some a < b. Let

ψ(λ) := log E[eλζ ].

Note that
E[ζeλζ ]
ψ ′ (λ) = ,
E[eλζ ]
and
E[ζ 2 eλζ ] E[ζeλζ ] 2
ψ ′′ (λ) = − .
E[eλζ ] E[eλζ ]
In particular, ψ(0) = 0 and ψ ′ (0) = E[ζ]. Let us now estimate the second derivative. For
this, it is convenient to set pλ (ζ) := eλζ /E[eλζ ], and write Eλ [h(ζ)] := E[pλ (ζ)h(ζ)]. We
then notice that this is, in fact, a probability distribution, and thus ψ ′′ (λ) is the variance
of ζ when it follows this modified distribution:

ψ ′′ (λ) = E[pλ (ζ)ζ 2 ] − E[pλ (ζ)ζ]2 = Eλ [ζ 2 ] − Eλ [ζ]2

Recall that a random variable in [a, b] can have variance at most (b − a)2 /4, so that

(b − a)2
ψ ′′ (λ) ⩽ .
4
Putting this together, we obtain
Z λ

ψ(λ) = ψ(0) + λψ (0) + ψ ′′ (γ)(λ − γ)dγ
0
Z λ
= λE[ζ] + ψ ′′ (λ)(λ − γ)dγ
0
(b − a)2 λ
Z
⩽ λE[ζ] + (λ − γ)dγ
4 0
(b − a)2 1 
= λE[ζ] + · λ2 − λ2
4 2
(b − a)2 λ2
= λE[ζ] +
8
Recalling the definition of ψ, we conclude that ζ is (b − a)2 /4-sub-Gaussian.
The main consequence of sub-Gaussianity that we’ll use is that the tails decay quickly.
Proposition A.5 (Tail decay of sub-Gaussian random variables). Suppose ζ is σ 2 -sub-
Gaussian. Then for all t ⩾ 0,
2 /2σ 2
P[ζ > E[ζ] + t] ⩽ e−t .

Proof. We use the Chernoff bound: for all λ > 0 we have

E[eλ(ζ−E[ζ]) ] 2 2
P[ζ > E[ζ] + t] = P[eλ(ζ−E[ζ]) > eλt ] ⩽ λt
⩽ eλ σ /2−λt
e

50
Taking λ = t/σ 2 (suggested by minimizing the quadratic) we obtain
2 /2σ 2
P[ζ > E[ζ] + t] ⩽ e−t .

Remark A.6 (Additional tail bounds). Notice that by symmetry of the definition of sub-
Gaussianity, ζ being σ 2 -sub-Gaussian implies −ζ is σ 2 -sub-Gaussian, so that the above
implies in particular that
2 2
P[ζ < E[ζ] − t] ⩽ e−t /2σ .
Putting the two together, we have the following useful two-sided tail control.
2 /2σ 2
P[|ζ − E[ζ]| < t] ⩽ 2e−t .

We will frequently use these consequences without comment.


It is not hard to see the following fact.
Proposition A.7. Suppose ζ1 , ζ2 are independent and σ12 -sub-Gaussian, σ22 -sub-Gaussian,
respectively. Then
1. Behavior under sums: the random ζ1 + ζ2 is (σ12 + σ22 )-sub-Gaussian.
2. Behavior under scaling: for any α ∈ R, αζ1 is α2 σ12 -sub-Gaussian.
An important consequence of this is simple observation is that if ζ1 , . . . , ζn are inde-
pendent and σ12 , . . . , σn2 -sub-Gaussian, then, for some α ∈ Rn
n
X n
X
α i ζi is αi2 σi2 -sub-Gaussian.
i=1 i=1

Of particular interest to us, the average


n n
1X 1 X 2
ζi is σi -sub-Gaussian.
n n2
i=1 i=1

Applying the previous tail bound we arrive at Hoeffding’s inequality.


Theorem A.8 (Hoeffding’s inequality). Suppose ζ1 , . . . , ζn are independent and each σ 2 -
sub Gaussian. Then for all t > 0,
n
1X 2 2
P[ ζi − E[ζi ] > t] ⩽ e−nt /2σ
n
i=1

It is convenient to state Hoeffding’s specifically for bounded random variables. By


combining Theorem A.8 and Example A.4 we obtain the following corollary.
Corollary A.9 (Hoeffding’s for bounded random variables). Suppose ζ1 , . . . ζm are inde-
pendent random variables in an interval [a, b]. Then for all t > 0,
n
1X 2 2
P[ ζi − E[ζi ] > t] ⩽ e−2nt /(b−a) .
n
i=1

51
A.2 Bernstein’s inequality
We can develop a similar theory for sub-exponential random variables, namely those that
have tail decay at an exponential rate. However, this would take us even further afield, so
here we will just focus on the case of bounded random variables and prove our concentra-
tion inequality directly.

Theorem A.10 (Bernstein’s inequality). Let ζ1 , . . . , ζn be independent random variables


such that each ζi ∈ [a, b] with probability 1. Then for all t > 0,
n
1X nt2 
P[ ζi − E[ζi ] > t] ⩽ exp − 2 Pn 2 .
n n i=1 Var(ζi ) + 3 t(b − a)
i=1

Proof. Expanding the power-series of the exponential we obtain that, for all λ ∈ R and
i ∈ [n],

λ(ζi −E[ζi ])
X λk
E[e ]=1+ E[(ζi − E[ζi ])k ]
k!
k=2

X λk (b − a)k−2
⩽ 1 + Var(ζi )
k!
k=2

X λk (b − a)k−2
⩽ 1 + Var(ζi )
k!
k=2
1
eλ(b−a) − λ(b − a) − 1

= 1 + Var(ζi ) 2
(b − a)
1 λ(b−a)

⩽ exp(Var(ζi ) · e − λ(b − a) − 1 ,
(b − a)2

where the last inequality follows because 1 + u ⩽ eu for all u ∈ R. For convenience of
notation, let g(λ)
1
eλ(b−a) − λ(b − a) − 1 .

g(λ) := 2
(b − a)
Using Chernoff’s inequality, Prop. A.1, we obtain, for all λ > 0
n n
1X −λt
 λX 
P[ ζi − E[ζi ] > t] ⩽ e E exp (ζi − E[ζi ])
n n
i=1 i=1
n
Y λ
= e−λt
 
E exp (ζi − E[ζi ])
n
i=1
n
X 
⩽ exp g(λ/n) Var(ζi ) − λt .
i=1

Hence
n n
1X X 
P[ ζi − E[ζi ] > t] ⩽ inf exp g(λ/n) Var(ζi ) − λt (A.1)
n λ>0
i=1 i=1

52
z 2 /2
To conclude, we use the inequality ez ⩽ 1 + z + 1−z/3 for |z| < 3 (which can be verified
numerically). Hence for λ < 3n/(b − a), we have
λ2 /2n2
g(λ/n) ⩽ .
1 − λ(b − a)/3n
Pn
For ease of notation, let σ 2 := i=1 Var(ζi ). Then take
nt
λ= 1 2 .
nσ + t(b − a)/3

Note that, so long as σ 2 > 0, we have λ < 3n/(b − a). If σ 2 = 0 the variables are constant,
and the inequality is trivial. Thus assume σ 2 > 0. We may thus apply the above bound
on g with this choice of λ to find
σ 2 λ2 /2n2
σ 2 g(λ/n) − λt ⩽ − λt
1 − λ(b − a)/3n
σ 2 t2 1 t2 n
= 1 2 · σ 2 /n
− 1 2
2( n σ + t(b − a)/3)2 1 2 n σ + t(b − a)/3
n
σ +t(b−a)/3

nt2
=− .
2( n1 σ 2 + t(b − a)/3)

Plugging this bound into (A.1) we obtain the claimed result.

A.3 The bounded differences inequality


The final concentration inequality that we shall have need of is the bounded differences
inequality. This is a very deep and powerful result, which often allows us to reduce
questions about the size of a random object in probability to questions about its size in
expectation. In other words, due to this inequality, it is often the case that showing a
quantity concentrates around its expectation is much easier than actually estimating its
expectation.
Let us first establish a bit of notation.
Definition A.11 (Martingale difference sequence). Suppose (Ω, F, P) is a probability
space with a filtration (non-decreasing sequence of sub-σ-algebras of F) {Fi }i∈[n] . A
sequence of variables ∆i , each Fi -measurable, is a martingale difference sequence if,
for all i ∈ [n],
E[∆i | Fi−1 ] = 0,
where we use the convention that F0 := {∅, Ω} is the trivial σ-algebra.
Theorem A.12 (Azuma-Hoeffding inequality). Suppose (∆i )ni=1 is a martingale difference
sequence, and for each i = 1, . . . , n there exist Fi−1 -measurable random variables Ai , Bi ,
such that Ai ⩽ ∆i ⩽ Ai , and Bi − Ai ⩽ Ci almost surely for constant Ci > 0. Then for
all t > 0,
n
1 X  −2n2 t2 
P ∆i > t ⩽ exp Pn 2 .
n i=1 C i
i=1

53
Proof. We apply the Chernoff inequality, Prop. A.1 to yield, for all λ > 0,
n n
1 X  −λt
 λ X 
P ∆i > t ⩽ e E exp ∆i .
n n
i=1 i=1

Observe that, by the definition of the martingale difference sequence, ∆1 , . . . , ∆n−1 are
each Fn−1 -measurable. Hence
n n
 λ X    λX  
E exp ∆i = E E exp ∆i | Fn−1
n n
i=1 i=1
n−1
λ X   λ
∆i E e n ∆n | Fn−1 .
 
= E exp
n
i=1

But since ∆n ∈ [An , Bn ] for An , Bn -Fn−1 -measurable, we can use our previous bound from
Example A.4 on the moment generating function of bounded random variables to yield
 λ  (Bn − An )2 λ2 
E e n ∆n | Fn−1 ⩽ exp E ∆n | Fn +
 
8n2
2
(Bn − An ) λ 2  Cn2 λ2 
= exp ⩽ exp .
8n2 8n2
Hence
n n−1
λ X  2 2 2  λ X 
∆i ⩽ eCn λ /8n E exp

E exp ∆i .
n n
i=1 i=1

Iterating this argument, we find


n n
1 X  λ2 X 2 
P ∆i > t ⩽ exp 2
Cn − λt .
n 8n
i=1 i=1
Pn
Taking λ = 4n2 t/ 2
i=1 Cn yields the result.

To state the bounded differences inequality, it is convenient to have the following


definition.

Definition A.13 (Bounded differences property). Suppose g : X n → R is some measur-


able function. Then we say that g has the bounded differences property if there are
constants C1 , . . . , Cn such that for each i ∈ [n] and x1 , . . . , xn we have

sup g(x1 , . . . , xi−1 , x′ , xi+1 , . . . , xn ) − inf



g(x1 , . . . , xi−1 , x′ , xi+1 , . . . , xn ) ⩽ Ci .
x′ ∈X x ∈X

Theorem A.14 (Bounded differences inequality). Suppose g : X n → R has the bounded


differences property with constants C1 , . . . , Cn . Then, for any random variable (X1 , . . . , Xn )
on X n with independent components, for all t > 0, we have

2t2 
P[g(X1 , . . . , Xn ) − E[g(X1 , . . . , Xn )] > t] ⩽ exp − Pn 2 .
i=1 Ci

54
Proof. For i ∈ [n], set

∆i := E[g(X1 , . . . , Xn ) | X1 , . . . Xi ] − E[g(X1 , . . . , Xn ) | X1 , . . . Xi−1 ],

Notice that the ∆i form a martingale difference sequence with respect to the filtration
Fi := σ(X1 , . . . , Xi ). Write the law of each variable Xi as µi . Then for all i ∈ [n] have
Z
∆i = g(X1 , . . . , Xi , xi+1 , . . . , xn )dµi+1 · · · dµn
Z
− g(X1 , . . . , Xi−1 , xi , . . . , xn )dµi · · · dµn
Z
= (g(X1 , . . . , Xi−1 , Xi , . . . , xn ) − g(X1 , . . . , Xi−1 , xi , . . . , xn ))dµi · · · dµn

⩽ Ci .

Hence we may apply the Azuma-Hoeffding inequality, Theorem A.12, to conclude.

B Chaining
In this section we will introduce a very powerful technique for controlling Rademacher
complexities and related stochastic processes. Given a stochastic process (Zt )t∈T we want
to understand the size of the expected supremum

E[sup Zt ]. (B.1)
t∈T

This setting includes the Rademacher complexity of a set S ∈ Rn ,


1
R(S) := Eε [sup ⟨s, ε⟩].
s∈S n

It also includes empirical processes: if G is a set of real-valued functions and Q is probability


distribution, we can consider the empirical process
Z n
 1X 
Eζ1 ,...,ζn ∼Q sup g(ζ)dQ(ζ) − g(ζi ) .
g∈G n
i=1

But the application to Rademacher complexities will be more powerful (this is, indeed, one
of the reasons why we went through all the effort of symmetrization: while it is possible
to apply the techniques we develop in this section directly to empirical processes, it yields
much worse bounds).
Observe that, as soon as the index set T is infinite, the expected supremum (B.1) can
be infinite. Indeed, if T is infinite and each random variable Zt is iid and unbounded,
then (B.1) is infinite. On the other hand, if every Zt is exactly equal to every other Zt′
(total dependence) then (B.1) reduces to E[Zt ], which will typically be 0. Our bounds
in this and the next section will thus interpolate between these very simple cases, using
dependence between the Zt s to establish bounds on (B.1). It turns out there is an elegant
and beautiful way to formalize this intuition; to introduce it we’ll first need to recall the
definition of a sub-Gaussian random variable.

55
Definition B.1 (Sub-Gaussian random variable). We say that a real-valued random vari-
able ζ is σ 2 sub-Gaussian, if for all λ ∈ R
2 σ 2 /2
E[eλ(ζ−E[ζ]) ] ⩽ eλ .

Using this definition, we can define a broad class of stochastic processes that we’ll be
able to control using the techniques we introduce in this and the next section.

Definition B.2 (Sub-Gaussian stochastic process). Suppose (T, d) is a metric space. A


real-valued stochastic process on T , (Zt )t∈T , is sub-Gaussian if, E[Zt ] = 0 for all t ∈ T
and for each t, t′ ∈ T and all λ ∈ R,
2 d(t,t′ )2 /2
E[eλ(Zt −Zt′ ) ] ⩽ eλ .

In other words, Zt − Zt′ is d(t, t′ )2 -sub-Gaussian for all t, t′ .

The intuition for this definition is that the dependency structure of the stochastic pro-
cess (Zt )t∈T is encoded by the metric space structure (T, d): when t and t′ are close, the
difference Zt − Zt′ is highly concentrated around 0, meaning the variables are highly corre-
lated. While if t and t′ are far, then the difference Zt − Zt′ can be essentially independent.
Intuitively, this allows us to reduce questions about the size of (B.1) to the size of the
metric space T .
Our main tool to verify that stochastic processes are indeed sub-Gaussian will be
Hoeffding’s bound.

Proposition B.3 (Hoeffding’s bound). Suppose ζ1 , . . . , ζn arePindependent random vari-


ables, such that ζi ∈ [ai , bi ] for each i ∈ [n]. Then the variable ni=1 ζi is σ 2 sub-Gaussian
for
n
1 X 1/2
σ= (bi − ai )2 .
2
i=1

Using Hoeffding’s bound, we can consider a couple of examples.

Example B.4 (Empirical process). For a set of real-valued functions G functions and a
Rprobability distribution Q, we can consider the stochastic process (Zg )g∈G where Zg :=
g(ζ)dQ(ζ) − n1 ni=1 g(ζi ), for ζ1 , . . . , ζn ∼ Q iid. Then Hoeffding’s bound implies that
P
the stochastic process is sub-Gaussian with respect to the metric

1 1
d(g, g ′ ) := √ ∥g − g ′ ∥∞ := √ sup |g(ζ) − g ′ (ζ)|.
n n ζ

Example B.5 (Rademacher process). Suppose S ⊂ Rn and S is bounded. We want to


consider the Rademacher process (Zs )s∈S where Zs := n1 ⟨s, ε⟩. Hoeffding’s bound implies
that this process is sub-Gaussian with respect to the metric

1
d(s, s′ ) := ∥s − s′ ∥.
n

56
B.1 Covering and packing numbers
We will measure the size of the metric space (T, d) with covering numbers, which we define
now. We will write the closed ball in (T, d) at t0 of radius ρ > 0 as

B(t0 , ρ) := {t ∈ T : d(t, t0 ) ⩽ ρ}.

Definition B.6 (Covering). For a metric space (T, d) and ρ > 0, a covering at scale ρ of
T is any collection of points U such that
[
T ⊂ B(u, ρ).
u∈U

Definition B.7 (Covering number). For a metric space (T, d) and ρ > 0, the covering
number of (T, d) at scale ρ is defined to be the size of the smallest covering at scale ρ. In
other words, it is defined as
K
[

N (T, d, ρ) := min K : ∃t1 , . . . , tK s.t. T ⊂ B(tk , ρ) .
k=1

Observe that in the above definition, we don’t require the centers t1 , . . . , tK to be


within – this is for convenience in later applications.

Definition B.8 (Packing). For a metric space (T, d) and ρ > 0, a packing at scale ρ is
defined to be a collection of points U ⊂ T such that for all u, u′ ∈ U such that u ̸= u′ ,
d(u, u′ ) > ρ.

Definition B.9 (Packing number). For a metric space (T, d) and ρ > 0 the packing
number of (T, d) at scale ρ is defined to the size of the largest packing at scale ρ. In other
words, it is defined as

M(T, d, ρ) := max K : ∃t1 , . . . , TK ∈ T s.t. d(tj , tk ) > ρ, ∀j ̸= k}.

We have the following result.

Proposition B.10 (Comparing covering and packing numbers). For any metric space
(T, d) and ρ > 0 we have

M(T, d, 2ρ) ⩽ N (T, d, ρ) ⩽ M(T, d, ρ).

Proof. For the upper bound, take t1 , . . . , tK such that d(tj , tk ) > ρ for all j ̸= k and
K = M(T, d, ρ). We claim that these t1 , . . . , tK form a covering at scale ρ of T . Indeed,
if there were some t ∈ T such that
K
[
t ̸∈ B(tk , ρ),
k=1

then d(t, tk ) > ρ for all k ∈ [K]. But then we could form a larger packing at scale ρ with
the points {t1 , . . . , tK , t}, contradicting the maximality of the packing t1 , . . . , tK . Hence

57
they form a covering at scale ρ, and since N (T, d, ρ) is defined to be the size of the smallest
coverting at scale ρ, we must have

N (T, d, ρ) ⩽ K = M(T, d, ρ).

For the lower bound, let t1 , . . . , tK be a minimal covering at scale ρ of T , so that


K = N (T, d, ρ). Let t′1 , . . . , t′K ′ ∈ T be a maximal packing at scale 2ρ of T , so that
K ′ = M(T, d, 2ρ). Suppose for the sake of contradiction that K ′ > K. Observe that for
each k ∈ [K ′ ], there exists some j ∈ [K] such that d(t′k , tj ) ⩽ ρ by the definition of a
covering at scale ρ. Since K ′ > K, there must exist at least two indices k, ℓ ∈ [K ′ ] such
that d(t′k , tj ), d(t′ℓ , tj ) ⩽ ρ for some j ∈ [K]. But then by the triangle inequality,

d(t′k , t′ℓ ) ⩽ d(t′k , tj ) + d(t′ℓ , tj ) ⩽ 2ρ,

which is a contradiction to our assumption that t′1 , . . . , t′K ′ is a packing of T at scale 2ρ.

B.2 Single-scale chaining


Our main tool will be the following fundamental result.

Proposition B.11 (Expected maximum of finitely many sub-Gaussian random variables).


Suppose ζ1 , . . . , ζK are each σ 2 sub-Gaussian (not necessarily independent) and mean zero.
Then p
E[ max ζk ] ⩽ 2σ 2 log K.
k∈[K]

Notice that this result is a generalization of our previous bound on the size of a
Rademacher process over a finite set, which was of crucial importance in the VC the-
ory analysis.

Proof. Let λ > 0 be arbitrary. Then observe that


1
E[ max ζk ] = E[log(exp(λ max ζk ))].
k∈[K] λ k∈[K]

By Jensen’s inequality, it follows that


1 
E[ max ζk ] ⩽ log E[exp(λ max ζk )] .
k∈[K] λ k∈[K]

Since exp is monotonic and λ > 0,

exp(λ max ζk ) = max exp(λζk ).


k∈[K] k∈[K]

So we obtain
1 
E[ max ζk ] ⩽ log E[ max exp(λζk )] .
k∈[K] λ k∈[K]

Use the bound


K
X
max exp(λζk ) ⩽ exp( λζk ),
k∈[K]
k=1

58
and the definition of sub-Gaussianity to obtain

K
1 X 
E[ max ζk ] ⩽ log E exp(λζk )
k∈[K] λ
k=1
1
⩽ log K exp(λ2 σ 2 /2)

λ
log K λσ 2
= + .
λ 2

Since this is true for all√λ > 0, we are free to choose a positive λ that minimizes this upper
bound. Selecting λ = 2 log K/σ (suggested via calculus) we obtain the result.

We’ll now prove the following result which will serve as a preliminary form of chaining.

Proposition B.12 (Single-scale chaining). Let (Zt )t∈T be a sub-Gaussian process with
respect to a metric space (T, d). Then for any ρ > 0,
p
E[sup Zt ] ⩽ E[ sup Zt − Zt′ ] + 2 diam(T )2 log N (T, d, ρ),
t∈T t,t′ ∈T : d(t,t′ )⩽ρ

where diam(T ) := supt,t′ ∈T d(t, t′ ).

Proof. Let t1 , . . . , tK be a minimal cover of T at scale ρ. For each t ∈ T , let π(t) ∈


{t1 , . . . , tK } such that d(t, π(t)) ⩽ ρ. Then

E[sup Zt ] = E[sup (Zt − Zπ(t) + Zπ(t) )]


t∈T T ∈T
⩽ E[sup Zt − Zπ(t) ] + E[sup Zπ(t) ]
t∈T t∈T
⩽ E[ sup Zt − Zt′ ] + E[sup Zπ(t) ]
t,t′ ∈T : d(t,t′ )⩽ρ t∈T

= E[ sup Zt − Zt′ ] + E[ max Ztk ].


t,t′ ∈T : d(t,t′ )⩽ρ k∈[K]

Observe that

E[ max Ztk ] = E[ max Ztk ] − E[Zt1 ] = E[ max (Ztk − Zt1 )].


k∈[K] k∈[K] k∈[K]

We thus have a maximum of K, diam(T )2 -sub Gaussian random variables, and so apply
Proposition B.11 to obtain
p
E[sup Zt ] ⩽ E[ sup Zt − Zt′ ] + 2 diam(T )2 log(N (T, d, ρ).
t∈T t,t′ ∈T : d(t,t′ )⩽ρ

59
B.3 Dudley’s entropy integral
The next result refines this single-scale chaining bound.

Theorem B.13 (Dudley’s entropy integral bound). Let (Zt )t∈T be a stochastic process,
sub-Gaussian with respect to a metric space (T, d). Then for all ρ > 0,
Z ∞p

E[sup Zt ] ⩽ E sup Zt − Zt′ ] + 12 log(N (T, d, τ ))dτ.
t∈T t,t′ ∈T : d(t,t′ )⩽ρ ρ/4

Let us make a few remarks on this result. First of all, in almost all applications it will
be the case that
E[ sup Zt − Zt′ ] → 0 (B.2)
t,t′ ∈T : d(t,t′ )⩽ρ

as ρ → 0+ , in which case the bound implies


Z ∞p
E[sup Zt ] ⩽ 12 log(N (T, d, τ ))dτ.
t∈T 0

Sometimes the entropy integral bound is stated in this form (under mild additional assump-
tions which ensure that (B.2) holds), but we prefer our form since it is more convenient
in applications.
Secondly, notice that as soon as ρ ⩾ diam(T ), log(N (T, d, ρ)) = 0, so that the bound
can be equivalently stated


Z diam(T ) p
E[sup Zt ] ⩽ E sup Zt − Zt′ ] + 12 log(N (T, d, τ ))dτ.
t∈T t,t′ ∈T : d(t,t′ )⩽ρ ρ/4

Third, the function log(N (T, d, ·)) is often called the metric entropy, since it is related
to the number of bits it takes to represent points in T up to error τ . Thus the name
“entropy integral.”

Proof. Let us begin by assuming that diam(T ) < ∞, and then once we have proven the
result in this case we will explain how to handle the general case. Define

J0 := max{j ∈ Z : 2j ⩽ ρ}, J1 := min{j ∈ Z : 2j ⩾ diam(T )}.

Then for each integer j such that J0 ⩽ j ⩽ J1 , let Uj = {tj1 , . . . , tjKj } be a cover of T
at scale 2j , and define a map πj : T → Uj which takes t ∈ T to some tjk ∈ Uj such that
d(t, tjk ) ⩽ ρ. Then for any t ∈ T , we can write

1 −1
JX
Zt = Zt − ZπJ0 (t) + Zπj (t) − Zπj+1 (t) + ZπJ1 (t) . (B.3)
j=J0

This decomposition is the most important step in the chaining argument, and the reason
why it bears the name “chaining”: each term in the sum is a link in a “chain” connecting
the lowest-scale term ZπJ0 (t) to the highest-scale term ZπJ1 (t).

60
Once (B.3) is established, we can apply the same idea as in the single-scale proof:
bound each link in the chain with our estimate on maxima of finitely many sub-Gaussian
random variables. To set this step up, we use the following estimate
1 −1
JX
E[sup Zt ] ⩽ E[sup Zt − ZπJ0 (t) ] + E[sup Zπj (t) − Zπj+1 (t) ] + E[sup ZπJ1 (t) ]
t∈T t∈T j=J0 t∈T t∈T

1 −1
JX
⩽ E[ sup Zt − Zt′ ] + E[sup Zπj (t) − Zπj+1 (t) ] + E[sup ZπJ1 (t) ]
t,t′ ∈T : d(t,t′ )⩽ρ j=J0 t∈T t∈T

1 −1
JX
= E[ sup Zt − Zt′ ] + E[sup Zπj (t) − Zπj+1 (t) ],
t,t′ ∈T : d(t,t′ )⩽ρ j=J0 t∈T

where the last equality follows because there is only one element of UJ1 (since 2J1 ⩾
diam(T )).
We can now focus on the terms

E[sup Zπj (t) − Zπj+1 (t) ],


t∈T

for j ∈ Z such that J0 ⩽ j ⩽ J1 . We first observe that for any t ∈ T ,

d(πj (t), πj+1 (t)) ⩽ d(πj (t), t) + d(t, πj+1 (t)) ⩽ 2j + 2j+1 = 3 · 2j .

Thus
sup Zπj (t) − Zπj+1 (t) ⩽ max Ztj − Ztj+1 .
t∈T tj ∈Uj ,tj+1 ∈Uj+1 : d(tj ,tj+1 )⩽3·2j

This is a maximum over at most |Uj ||Uj+1 | = Kj Kj+1 ⩽ Kj2 variables, each of which is at
worst (3 · 2j )2 sub-Gaussian, so by Proposition B.11,
q q q
E[sup Zπj (t) − Zπj+1 (t) ] ⩽ 3 · 2j 4 log(Kj ) = 3 · 2j+1 log(Kj ) = 6 · 2j log(N (T, d, 2j )).
t∈T

Hence
1 −1
JX q
E[sup Zt ] ⩽ E[ sup Zt − Zt′ ] + 6 2j N (T, d, 2j ). (B.4)
t∈T t,t′ ∈T : d(t,t′ )⩽ρ j=J0

Finally, we rewrite this sum in integral form by observing that


q Z 2j q Z 2j p
j j j
2 log(N (T, d, 2 )) = 2 log(N (T, d, 2 ))dτ ⩽ 2 log(N (T, d, τ ))dτ.
2j−1 2j−1

This yields
Z 2J1 −1 p
E[sup Zt ] ⩽ E[ sup Zt − Zt′ ] + 12 log(N (T, d, τ ))dτ.
t∈T t,t′ ∈T : d(t,t′ )⩽ρ 2J0 −1

We conclude the desired inequality by observing that 2J1 −1 ⩽ diam(T ) and 2J0 −1 ⩾ ρ/4.

61
The above argument proves the result in the case where diam(T ) < ∞. To prove the
full result, fix some t0 ∈ T , take any R > 0, and let TR := B(t0 , R). Then the above
argument implies
Z ∞p
E[ sup Zt ] ⩽ E[ sup Zt − Zt′ ] + 12 log(N (TR , d, τ ))dτ
t∈TR t,t′ ∈TR ρ/4
Z ∞p
⩽ E[ sup Zt − Zt′ ] + 12 log(N (T, d, τ ))dτ.
t,t′ ∈T ρ/4

So it suffices to show that


lim E[ sup Zt ] = E[sup Zt ]. (B.5)
R→∞ t∈TR t∈T

To this end, write


lim E[ sup Zt ] = lim E[ sup Zt − Zt0 ].
R→∞ t∈TR R→∞ t∈TR

Then we may apply the monotone convergence theorem to the functions which take ω 7→
supt∈TR Zt (ω) − Zt0 (ω): these functions are non-negative and pointwise increasing, and
so (B.5) holds and we conclude.

From the proof, particularly (B.4), it is clear that we can state the analogous bound
with discrete sums rather than integrals.

Corollary B.14. Let (Zt )t∈T be a stochastic process, sub-Gaussian with respect to a metric
space (T, d). Then for all ρ > 0,


X q

E[sup Zt ] ⩽ E sup Zt − Zt′ ] + 6 log(N (T, d, 2j ))dρ.
t∈T t,t′ ∈T : d(t,t′ )⩽ρ
j : 2j ⩾ρ/2

There is no mathematically significant difference between Theorem B.13 and Corol-


lary B.14; Theorem B.13 is (sometimes) preferred only for its aesthetic merits.
Note that the chaining technique is extremely powerful, to the extent that – when
appropriately refined – it completely characterizes the suprema of Gaussian processes [7,
Chapter 6], as famously shown by the great French probabilist Michel Talagrand. It
has many applications in statistics, the theory of machine learning, and related areas of
theoretical computer science and probability. We’ll only briefly explore its use in this
course.

C Analysis for general losses and applications


The refinements will be in two key directions: first, we will consider ERM that are more
general than the 0-1 loss we have already studied, and second, we will prove tighter bounds.
Finally, we will apply this analysis to develop new statistical understanding of many of
the models we have discussed so far.

62
C.1 Convexified loss functions
The computational hardness of ERM motivates us to consider alternative, more compu-
tationally practical problems that we can solve as proxies for the ERM problem. The key
idea here is to take the original 0 − 1 loss 1[h(Xi ) ̸= Yi ] and relax it to a convex loss
function, which are generally easier to optimize.
Let’s write the ERM objective in a more suggestive way. Since we are using the labels
{−1, 1}, we have h(x) ̸= y if and only if −h(x)y > 0, and so the ERM objective can be
written
n n
1X 1X
1[h(Xi ) ̸= Yi ] = φ1 (−h(Xi )Yi ),
n n
i=1 i=1

where φ1 (z) := 1[z > 0].


Recall the following definitions.

Definition C.1 (Convex set). A set C ⊂ Rd is convex if for all x, y ∈ C and t ∈ [0, 1],

(1 − t)x + ty ∈ C.

Definition C.2 (Convex function). Let C ⊂ Rd be a convex set. Then a function f : C →


R is convex if for all x, y ∈ C and t ∈ [0, 1],

f ((1 − t)x + ty) ⩽ (1 − t)f (x) + tf (y).

We will first make the hypothesis class H convex. Recall that each h ∈ H is a classifier
h : X → {−1, 1}; so H will not be convex as soon as it contains at least two elements. We
thus consider the following relaxed notion of a classifier.

Definition C.3 (Soft classifier). A soft classifier is any (measurable) f : X → [−1, 1]. The
hard classifier associated to f is hf := sgn(f )

We will be considering convex sets of soft classifiers F. Some examples include:

1. Linear classifiers: F = {⟨a, x⟩ : a ∈ A} for a convex set A ⊂ Rd . This example


recovers the halfspace hypothesis class we studied above.

2. Majority vote: given classifiers h1 , . . . , hM , we look at the set of all convex combi-
nations of these classifiers
M
nX M
X o
F := λj hj (x) : λj ⩾ 0, λj = 1 .
j=1 j=1

We’ve now convexified the hypothesis class, but we still have non-convexity in the loss φ1 .
We thus consider minimizing a surrogate for φ1 .

Definition C.4 (Convex surrogate). A function φ : R → R+ is called a convex surrogate


if its a convex, non-decreasing function such that φ(0) = 1 and φ(z) ⩾ φ1 (z) for all z ∈ R.

Here are some examples:

63
1. Hinge loss: φ(z) = max(1 + z, 0)

2. Exponential loss: φ(z) = ez

3. Logistic loss: φ(z) = log2 (1 + ez ).

To bypass the non-convexity of φ1 , we’ll use a convex surrogate φ in place of φ1 . We


thus consider minimizing the empirical φ−risk, Rn,φ defined as
n
1X
Rn,φ (f ) := φ(−Yi f (Xi )).
n
i=1

The population counterpart to Rn,φ is the population φ-risk, written

Rφ (f ) := E(x,y)∼P [φ(−yf (x))].

C.2 Zhang’s lemma


Let us fix a convex surrogate φ. In this section, we will relate the φ-risk Rφ (f ) of a
soft classifier f and the true classification error R(hf ) of its associated hard classifier
hf := sgn(f ).
Let us introduce some notation. The φ-risk minimizer is defined as

fφ⋆ := arg min Rφ (f ).


f : X →[−1,1]

And its value is written Rφ⋆ := Rφ (fφ⋆ ). Let η(x) be the regression function, suitably
modified to consider labels in {1, −1}, so that

η(x) := P[y = 1 | x].

The Bayes classifier is then


h⋆ (x) := sgn(η(x) − 1/2).
Finally, it will also be convenient here and later to consider the following function, defined
for any η ∈ [0, 1], as
Hη (α) := ηφ(−α) + (1 − η)φ(α).
We first make a non-rigorous observation about the φ-risk minimizer fφ⋆ . Since fφ⋆
minimizes
Rφ (f ) = E[E[φ(−yf (x)) | x]] = E[Hη(x) (f (x))],
it ought to be true that fφ⋆ (x) is a critical point of Hη(x) (·). So we expect

−η(x)φ′ (−fφ⋆ (x)) + (1 − η(x))φ′ (fφ⋆ (x)) = 0.

Ignoring issues of division by 0, this means

φ′ (fφ⋆ (x)) η(x)


′ ⋆
= .
φ (−fφ (x)) 1 − η(x)

64
By monotonicity of φ′ , and again ignoring issues of division by 0, this implies

η(x)
η(x) ⩾ 1/2 ⇐⇒ ⩾ 1 ⇐⇒ φ′ (fφ⋆ (x)) ⩾ φ′ (−fφ⋆ (x)) ⇐⇒ fφ⋆ (x) ⩾ 0.
1 − η(x)

We thus see – admittedly not rigorously – that fφ⋆ (x) should be outputting the same signs
as the Bayes classifier h⋆ (x) := sgn(2η(x) − 1), meaning that this surrogate problem is, at
least intuitively, equivalent to the original problem.
The following Lemma shows a rigorous and quantitative version of this claim. It shows
that if we can solve the φ-risk problem with some f then the associated hard classifier hf
necessarily solves the φ-risk problem, at least under the hypotheses of the Lemma.

Lemma C.5 (Zhang’s Lemma). For each η ∈ [0, 1], put

τ (η) := inf Hη (α).


α∈R

Suppose there are constants c > 0 and γ ∈ [0, 1] such that for all η ∈ [0, 1]

|η − 1/2| ⩽ c(1 − τ (η))γ .

Then for any soft classifier f : X → [−1, 1], letting hf := sgn(f ) be the associated hard
classifier, we have the inequality

R(hf ) − R(h⋆ ) ⩽ 2c(Rφ (f ) − Rφ⋆ )γ .

Notice that τ (η) ⩽ Hη (0) = 1 for all η ∈ [0, 1], so the hypothesis makes sense.
We have the following values of τ, c, and γ for the convex surrogates we mentioned
earlier:

1. Hinge loss (φ(z) = max(1 + z, 0)): τ (η) = 1 − |1 − 2η|, c = 1/2 and γ = 1.


p √
2. Exponential loss (φ(z) = ez ): τ (η) = 2 η(1 − η), c = 1/ 2 and γ = 1/2.

3. Logistical
p loss (φ(z) = log2 (1 + ez )): τ (η) = −η log2 η − (1 − η) log2 (1 − η), c =
1/ 2 log(2) (note that log without a subscript always indicates the natural loga-
rithm), and γ = 1/2.

Proof. We use the decomposition of binary classification risk we proved at the start of
class (which clearly holds unchanged for y ∈ {−1, 1} rather than y ∈ {0, 1}) to observe
that

R(hf ) − R(h⋆ ) = 2E[|η(x) − 1/2|1[hf (x) ̸= h⋆ (x)]]


⩽ 2cE[(1 − τ (η(x))γ 1[hf (x) ̸= h⋆ (x)]]
 γ 
= 2cE (1 − τ (η(x)))1[hf (x) ̸= h⋆ (x)] .

Where we used the fact that z γ = z for any z ∈ {0, 1} and all γ ∈ [0, 1]. Apply Jensen’s
inequality (the function z γ is concave for z ⩾ 0 and γ ∈ [0, 1]) to observe
 γ
R(hf ) ⩽ 2cE (1 − τ (η(x)))1[hf (x) ̸= h⋆ (x)] .

65
We claim that,

(1 − τ (η(x)))1[hf (x) ̸= h⋆ (x)] ⩽ Hη(x) (f (x)) − τ (η(x)). (C.1)

If we can show this, then we are done since

E[Hη(x) (f (x)) − τ (η(x))] = Rφ (f ) − E[ inf Hη(x) (α)] ⩽ Rφ (f ) − Rφ⋆ .


α∈R

To show (C.1) we split into cases. The case where hf (x) = h⋆ (x) is clear because τ (η) ⩽
Hη (f (x)) by definition. So suppose hf (x) ̸= h⋆ (x). Then it must be that f (x)(η(x) −
1/2) ⩽ 0. But by convexity of φ,

Hη(x) (f (x)) ⩾ φ(2(1/2 − η(x))f (x)).

And since φ is non-decreasing and φ(0) = 1, it follows that φ(2(1/2 − η(x))f (x)) ⩾ 1,
implying
Hη(x) (f (x)) ⩾ 1 =⇒ Hη(x) (f (x)) − τ (η(x)) ⩾ 1 − τ (η(x)),
proving (C.1) in this case and finishing the proof of Zhang’s Lemma.

C.3 Setting of ERM analysis for general losses


We’ll work in the following setting: suppose P is a probability distribution on X × [−1, 1]
(observe that we allow for continuous valued variables y, rather than y ∈ {±1} as before),
and ℓ : [−1, 1] × [−1, 1] → R is a loss function. We’ll make the following assumption on ℓ:
1. L-Lipschitz in first argument: |ℓ(z, y)−ℓ(z ′ , y)| ⩽ L|z−z ′ | for all z, z ′ , y ∈ [−1, 1].

2. Range is in [0, 1]: ℓ(z, y) ∈ [0, 1] for all z, y ∈ [−1, 1]


As always, we’ll assume D := ((x1 , y1 ), . . . , (xn , yn )) ∈ (X × [−1, 1])n is an iid sample
from P of size n. For soft classifiers f : X → [−1, 1], we’ll be primarily interested in the
risk
Rℓ (f ) := E(x,y)∼P [ℓ(f (x), y)],
as well as its empirical counterpart
n
1X
Rn,ℓ (f ) := ℓ(f (xi ), yi ).
n
i=1

Suppose we are given a set of maps f : X → [−1, 1], written F. We’ll then consider the
empirical risk minimizer fˆ and the population risk minimizer f¯ defined as1

fˆ ∈ arg min Rn,ℓ (f ), f¯ ∈ arg min Rℓ (f ).


f ∈F f ∈F

And finally, we’ll define the excess risk as

Eℓ (f ) := Rℓ (f ) − Rℓ (f⋆ ),
1
As in our development of VC theory, we will assume the existence of these optimal maps, but if they
don’t exist, the results can be easily made approximate.

66
Where f⋆ is the minimizer of the risk Rℓ over all (measurable) maps. Out interest will be
in controlling the excess risk

Eℓ (fˆ) = Rℓ (fˆ) − Rℓ (f⋆ ).

Let’s consider some examples. First, we observe that we can recover our old setting of
hard binary classification.

Example C.6 (Recovering binary classification). Suppose P is supported on X ×{−1, 1},


F = H for a set of hard classifiers H. Intuitively, the loss function ℓ(z, y) := 1[z ̸= y]
is 1-Lipschitz for our problem. We can formally make this loss function fit the above
framework by using the hinge loss, ℓ(z, y) := 12 max(0, 1 − zy). This loss is 1/2-Lipschitz
and takes values in [0, 1]. Notice that for this choice of ℓ, when z, y ∈ {−1, 1},

ℓ(z, y) = 1[z ̸= y].

So that this is an extension of the binary classification loss, and thus the empirical risk
minimizer fˆ and the population risk minimizer f¯ for the ℓramp loss are exactly the same
as the empirical risk minimizer ĥ and the population risk minimizer h̄ in the binary
classification problem we studied previously.

We also can recover convexificiations of the binary classification problem.

Example C.7 (Convexifications of the binary classification problem). Suppose P is sup-


ported on X × {−1, 1}. Recall that a soft classifier is any function f : X → [−1, 1].
Suppose we have a set of soft classifiers F. Recall from that lecture that a convex surro-
gate φ : R → R+ that is convex, non-decreasing, and such that φ(z) ⩾ 1 for all z ⩾ 0. We
then considered the problems

Rφ (f ) := E(x,y)P [φ(−f (x)y)],

as well as its empirical analog. We can recover this setting by taking ℓφ (z, y) := φ(−zy).
In particular, it can be checked that the following choices of φ lead to a loss ℓφ that
satisfy our assumptions (when appropriately re-scaled):

1. Hinge φ(τ ) = max(0, 1 − τ )

2. Logistic loss φ(τ ) = log2 (1 + e−τ )

3. Exponential loss φ(τ ) := e−τ

Example C.8 (Regression problems with squared loss). If we take the squared loss
ℓ(z, y) := 41 (z − y)2 we can recover regression (with bounded target variables).

C.4 Reduction to Rademacher complexity


We are interested in controlling the estimation error

Rℓ (fˆ) − Rℓ (f¯)

67
Let’s follow a similar development as we did with VC theory until we can’t go any further.
We start with the inequality
Rℓ (fˆ) − Rℓ (f¯) = Rℓ (fˆ) − Rℓ,n (fˆ) + Rℓ,n (fˆ) − Rℓ,n (f¯) + Rℓ,n (f¯) − Rℓ (f¯)
⩽ Rℓ (fˆ) − Rℓ,n (fˆ) + Rℓ,n (f¯) − Rℓ (f¯).
If we were to exactly follow the analysis in the VC theory problem, we would upper
bound each of these terms with
sup |Rℓ (f ) − Rℓ,n (f )|,
f ∈F

and then used the bounded differences inequality to reduce to controlling the expectation
E[sup |Rℓ (f ) − Rℓ,n (f )|]
f ∈F

This time, we will be a bit more careful in order to avoid some additional technicalities
later. In particular, we will only use the supremum bound on the term involving fˆ, and
then apply the bounded differences inequality. We obtain the following result.
Proposition C.9. For all δ > 0, with probability at least 1 − δ,
r
ˆ ¯ log(2/δ)
Rℓ (f ) − Rℓ (f ) ⩽ ED [sup Rℓ (f ) − Rℓ,n (f )] + 2
f ∈F n
Proof. Using the above inequalities and applying the sup bound to just the term involving
fˆ we have
Rℓ (fˆ) − Rℓ (f¯) ⩽ Rℓ (fˆ) − Rℓ,n (fˆ) + Rℓ,n (f¯) − Rℓ (f¯)
⩽ sup Rℓ (f ) − Rℓ,n (f ) + Rℓ,n (f¯) − Rℓ (f¯).
f ∈F

We’ll thus apply the bounded differences inequality to the function


g : ((x1 , y1 ), . . . , (xn , yn )) 7→ sup Rℓ (f ) − Rℓ,n (f ) + Rℓ,n (f¯) − Rℓ (f¯)
f ∈F

Using a similar argument as in Lecture 2, we can check that g satisfies the bounded
differences condition with constants 2/n, so that for all t > 0,
2 /2
P[ sup Rℓ (f )−Rℓ,n (f )+Rℓ,n (f¯)−Rℓ (f¯)−ED [sup Rℓ (f )−Rℓ,n)(f ) +Rℓ,n (f¯)−Rℓ (f¯)] > t] ⩽ 2e−nt .
f ∈F f ∈F

Notice that
ED [sup Rℓ (f ) − Rℓ,n (f ) + Rℓ,n (f¯) − Rℓ (f¯)] = ED [sup Rℓ (f ) − Rℓ,n)(f ) ].
f ∈F f ∈F

Hence for all δ > 0, with probability at least 1 − δ, we have


r
2 log(2/δ)
sup Rℓ (f ) − Rℓ,n (f ) + Rℓ,n (f¯) − Rℓ (f¯) ⩽ E[sup Rℓ (f ) − Rℓ,n (f )] + .
f ∈F f ∈F n
The result follows by combining this inequality with the first two inequalities in the proof.

68
Recall that in our development of VC theory, we controlled the expected empirical
process
ED [sup |R(h) − Rn (h)|]
h∈H

using the symmetrization technique. We’ll use the exact same technique now to control

ED [sup Rℓ (f ) − Rℓ,n (f )].


f ∈F

Introduce a “ghost sample” D′ := ((x′1 , y1′ ), . . . , (x′n , yn′ )) ∈ (X × [−1, 1])n drawn inde-
pendently (in particular, independent of D := ((x1 , y1 ), . . . , (xn , yn ))) and identically dis-
tributed according to P. Let the corresponding empirical risk be Rℓ,n ′ , namely

n
′ 1X
Rℓ,n (f ) := ℓ(f (x′i ), yi′ ).
n
i=1

Observe that

Rℓ (f ) = ED′ [Rℓ,n (f )].
Using this fact implies
′ ′
ED [sup Rℓ (f ) − Rℓ,n (f )] = ED [sup ED′ [Rℓ,n (f )] − Rℓ,n (f )] ⩽ ED,D′ [sup Rℓ,n (f ) − Rℓ,n (f )].
f ∈F f ∈F f ∈F

We next observe that for any choice of signs ε ∈ {−1, 1}n ,


n
′ 1X
εi (ℓ(f (xi ), yi ) − ℓ(f (x′i ), yi′ )) .
 
ED,D′ [sup Rℓ,n (f ) − Rℓ,n (f )] = ED,D′ sup
f ∈F f ∈F n i=1

Hence we can take each εi ∼ Rad(1/2) independently, and have that


n
′ 1X
εi (ℓ(f (xi ), yi ) − ℓ(f (x′i ), yi′ )) .
 
ED,D′ [sup Rℓ,n (f ) − Rℓ,n (f )] = E D,D′ ,ε sup
f ∈F f ∈F n i=1

We finally obtain
n
1X
εi (ℓ(f (xi ), yi ) − ℓ(f (x′i ), yi′ ))
 
ED [sup Rℓ (f ) − Rℓ,n (f )] ⩽ ED,D′ ,ε sup
f ∈F f ∈F n i=1
n
 1X 
⩽ 2ED,ε sup εi ℓ(f (xi ), yi ) .
f ∈F n i=1

We’ll now introduce a formal definition for the quantity that we have reduced the
problem to.

Definition C.10 (Rademacher complexity). Suppose S ⊂ Rn and ε = (εi )ni=1 is such that
each εi ∼ Rad(1/2) independently. Then the Rademacher complexity of S, written
 1
R(S) := Eε sup ⟨s, ε⟩]
s∈S n

69
At this point in the VC theory analysis, we used the bound
1 p
R(S) ⩽ sup ∥a∥ 2 log |A|,
n a∈A
applied to the set
S := {(1[h(xi ) ̸= yi ])ni=1 : h ∈ H},
which let us reduce the problem to the shattering coefficients of the class H, which were
then bounded by the Sauer-Shelah lemma.
We can’t directly follow these steps since for our current setting, the relevant vectors
are (ℓ(f (xi , yi )))ni=1 for f ∈ F, which are not necessarily binary valued. In fact, the set

(ℓ(f (xi ), yi ))ni=1 : f ∈ F




is not necessarily finite! To handle this issue, we’ll instead pursue a more general approach,
that will turn out to be strictly stronger than the previous bound. The first step of this
approach will be to get rid of the the losses ℓi .
For a set S ⊂ [−1, 1]n , let’s use the notation

ℓ ◦ S := {(ℓ(si , yi ))ni=1 : s ∈ S}.

Then we have the following bound


Lemma C.11 (Contraction lemma). We have

R(ℓ ◦ S) ⩽ LR(S).

Proof. Without loss of generality, we may assume L = 1.


We first show that for any 1-Lipschitz function ψ : R → R and any bounded subset
V ⊂ R2

sup v1 + ψ(v2 ) + sup v1 − ψ(v2 ) ⩽ sup v1 + v2 + sup v1 − v2 . (C.2)


(v1 ,v2 )∈V (v1 ,v2 )∈V (v1 ,v2 )∈V (v1 ,v2 )∈V

To see this, take (v1 , v2 ) ∈ V and (v1′ , v2′ ) ∈ V . Then using our assumption that ψ is
1-Lipschitz, we find

v1 + ψ(v2 ) + v1′ − ψ(v2′ ) ⩽ v1 + v1′ + |v2 − v2′ |.

We proceed by casework. If v2 ⩾ v2′ , then this becomes

v1 + v1′ + |v2 − v2′ | = v1 + v1′ + v2 − v2′ ⩽ sup v1 + v2 + sup v1 − v2 .


(v1 ,v2 )∈V (v1 ,v2 )∈V

And if v2 < v2′ , it becomes

v1 + v1′ + |v2 − v2′ | = v1 + v1′ + v2′ − v2 ⩽ sup v1 + v2 + sup v1 − v2 .


(v1 ,v2 )∈V (v1 ,v2 )∈V

Hence in either case, we find

v1 + ψ(v2 ) + v1′ − ψ(v2′ ) ⩽ sup v1 + v2 + sup v1 − v2 .


(v1 ,v2 )∈V (v1 ,v2 )∈V

70
Taking the supremum over (v1 , v2 ), (v1′ , v2′ ) ∈ V , we obtain (C.2).
With (C.2) in hand, let us proceed with the proof of our main claim. Observe that
n−1
h 1X 1 i
R(ℓ ◦ S) = Eε sup εi ℓ(si , yi ) + εn ℓ(sn , yn )
s∈S n i=1
n
n−1
h h 1X 1 ii
= Eε1 ,...,εn−1 Eεn sup εi ℓ(si , yi ) + εn ℓ(sn , yn )
s∈S n i=1
n

Now, observe that for any fixed ε1 , . . . , εn−1 ,


n−1 n−1
h 1X 1 i 1 1X 1
Eεn sup εi ℓ(si , yi ) + εn ℓ(sn , yn ) = sup εi ℓ(si , yi ) + ℓ(sn , yn )
s∈S n n 2 s∈S n n
i=1 i=1
n−1
1 1 X 1
+ sup εi ℓ(si , yi ) − ℓ(sn , yn ).
2 s∈S n n
i=1

We now apply (C.2) to the set

 n−1
X
V := εi ℓ(si , yi ), ℓ(sn , yn )) : s ∈ S ,
i=1

with the 1-Lipschitz ψ(z) := ℓ(z, yn ). This yields


n−1 n−1
h 1X 1 i 1 1X 1
Eεn sup εi ℓ(si , yi ) + εn ℓ(sn , yn ) ⩽ sup εi ℓ(si , yi ) + sn
s∈S n n 2 s∈S n n
i=1 i=1
n−1
1 1 X 1
+ sup εi ℓ(si , yi ) − sn
2 s∈S n n
i=1
n−1
h 1X 1 i
= Eεn sup εi ℓ(si , yi ) + εn sn .
s∈S n n
i=1

Observe that we can repeat the same argument for each index i ∈ [n − 1], each time
replacing the ℓ(si , yi ) terms with si . The result follows.

Using the notation


F|D := {(f (xi ))ni=1 f ∈ F},
and applying Lemma C.11 we find

R(ℓ ◦ F |D ) ⩽ LR(F|D ).

The following result summarizes our work in this section.


Theorem C.12. Suppose ℓ : [−1, 1] × [−1, 1] → [0, 1] is a loss which is L-Lipschitz in its
first argument. Then for all δ > 0, with probability at least 1 − δ,
r
ˆ ¯ 2 log(2/δ)
Rℓ (f ) − Rℓ (f ) ⩽ LED [R(F|D )] + .
n

71
C.5 Entropy integral integral bound on Rademacher processes
Let S ⊂ Rn , and consider the Rademacher process ( n1 ⟨ε, s⟩)s∈S for ε1 , . . . , εn ∼ Rad(1/2)
iid. As we verified in the previous lecture, the increments Zs −Zs′ are σ(s, s′ )2 sub-Gaussian
for
1
σ(s, s′ ) = ∥s − s′ ∥.
n
Hence we can apply the developed theory with (T, d) = (S, n1 ∥·∥) to bound the Rademacher
complexity R(S) := E[sups∈S n1 ⟨ε, n⟩]. We obtain the following result.

Theorem C.13 (Bounding the Rademacher complexity with Dudley’s entropy integral).
For any S ⊂ Rn and ρ > 0


Z r
1
R(S) ⩽ E[ sup ⟨s − s′ , ε⟩] + 12 log(N (S, ∥ · ∥, τ ))dτ.
s,s′ ∈S : ∥s−s′ ∥⩽ρ ρ/4 n

In particular,

Z r
1
R(S) ⩽ 12 log(N (S, ∥ · ∥, τ ))dτ.
0 n

Combining this result with our reduction (from last lecture) of the estimation error for
ERM under general losses to the Rademacher complexity of the restricted classes

F|D := {(f (xi ))ni=1 : f ∈ F}

and changing the variable of integration, we obtain the following result.

Theorem C.14 (Estimation error of ERM by entropy integral). Suppose P is a distri-


bution supported on X × [−1, 1]. Suppose ℓ : [−1, 1] × [−1, 1] → [0, 1] is L-Lipschitz in its
first variable, and F is a set of functions from X → [−1, 1]. Let the population risk be
defined
R(f ) := E(x,y)∼P [ℓ(f (x), y)].

Let the minimizer of R(f ) over f ∈ F be written f¯. Suppose D := ((x1 , y1 ), . . . , (xn , yn )) ∈
(X × [−1, 1])n is an iid sample of size n from P. Let the minimizer of the corresponding
empirical risk be written fˆ. Then for all δ > 0, with probability at least 1 − δ,
r
Z ∞p
12L  2 log(2/δ)
R(fˆ) − R(f¯) ⩽ √ ED

log(N (F, ∥ · ∥n , τ ))dτ + .
n 0 n

where we define

n
1X 1/2
∥f − f ′ ∥n := (f (xi ) − f ′ (xi ))2 , ∀f, f ′ ∈ F.
n
i=1

72
C.6 Improved bounds for classes with finite VC dimension
You will prove the following Lemma in the second problems session.
Lemma C.15 (Covering number of classes with finite VC dimension). Suppose H is set
of hard classifiers such that VCdim(H) = d, and µ is any probability measure on X . Let
Z

1/2
∥h − h ∥L2 (µ) := (h(x) − h′ (x))2 dµ(x) .

Then for some universal constant C,


C Cd
N (H, ∥ · ∥L2 (µ) , τ ) ⩽
, ∀τ ∈ (0, 1).
τ
Applying this Lemma to the empirical measures
n
1X
µn := δxi ,
n
i=1
that arise in Theorem C.14 we can obtain the following improvement of our main upper
bound from the VC theory analysis.
Theorem C.16 (Improved upper bound for binary classification). Suppose P is a dis-
tribution supported on X × {−1, 1}, and H is a set of hard classifiers on X such that
VCdim(H) = d. Let ĥ and h̄ be the empirical and population risk minimizers associated
to the 0-1 binary classification loss. Then there is a universal constant C such that for all
δ > 0, with probability at least 1 − δ,
r r
d log(2/δ)
R(ĥ) − R(h̄) ⩽ C + 2 .
n n
Comparing p with our main result from the VC theory lecture, we see that this improves
by removing a log(n/d) factor. With this improvement, the result becomes tight up to
multiplicative constants [6, Chapter 6].
Proof. Define the loss ℓ(z, y) := 12 max(0, 1 − yz). Then it is easy to see that ℓ is 1/2-
Lipschitz, and that if R and Rn denote the population and emprical risk associated to the
0-1 classification loss, then for any hard classifier h ∈ H
R(h) = Rℓ (h), Rn (h) = Rℓ,n (h).
Hence Theorem C.14 implies
r
Z 1p
1   log(2/δ)
R(ĥ) − R(h̄) ⩽ √ ED log(N (H, ∥ · ∥n , τ ))dτ + 2 ,
2 n 0 n
where we may truncate the integral at 1 since the diameter of H in any ∥ · ∥n norm is at
most 1. Plugging in the covering number bound from Lemma C.15 we obtain
Z 1p r
1 log(2/δ)
R(ĥ) − R(h̄) ⩽ √ Cd log(C/τ )dτ + 2 .
2 n 0 n
Observe that Z 1p √
Cd log(C/τ )dτ ⩽ C d
0
for (varying) universal constants C. The result follows.

73
C.7 Statistical analysis of hard-SVM
Let’s consider the SVM approach to binary classification. We suppose again that P is
supported on X × {−1, 1}, and W is an RKHS over X with associated PSD kernel K. We
consider the following structural assumption on P.

Definition C.17 ((γ, D)-margin assumption). We say that P satisfies the (γ, D)-margin
assumption if there exists w⋆ ∈ W such that ∥w⋆ ∥W ⩽ γ, and for P-almost every (x, y),
we have ∥x∥ ⩽ R and 1 ⩽ y⟨w⋆ , K(x, ·)⟩W .

We will show in this section that under the (γ, D) margin assumption, the max-
margin/hard-SVM solution solves the classification problem with dimension-free rates.
This is remarkable because the VC dimension of this set of classifiers may be large, if not
infinite.
Recall that for each w ∈ W we write the associated hard classifier hw (x) := sgn(⟨w, K(x, ·)⟩W ).
Given D := ((X1 , Y1 ), . . . , (Xn , Yn )) ∈ (X × {−1, 1})n iid from P, let ŵ be a solution of
the emepirical hard-SVM problem

min{∥w∥2 : w ∈ W, Yi ⟨w, K(Xi , ·)⟩W ⩾ 1, ∀i ∈ [n]}.

Then we are interested in the risk of the hard classifier associated to ŵ, namely

R(hŵ ) := E(x,y)∼P [1[y ̸= sgn(⟨ŵ, K(x, ·)⟩W )]].

Theorem C.18 (Dimension-free rates for hard-SVM). Suppose P satisfies the (γ, D)-
margin assumption. Then for all δ > 0, with probability at least 1 − δ,
r
Dγ log(2/δ)
R(hŵ ) ⩽ √ +
2 n n

Proof. Consider the hinge loss


1
ℓ(z, y) := max(0, 1 − yz).
2
For each w ∈ W , put
fw (x) := ⟨w, K(x, ·)⟩W ,
and set
F := {fw : w ∈ W, ∥w∥ ⩽ γ}.
We want to apply our results to this function class, but notice that ℓ(fw (x), y) may not
be in [0, 1], so we can’t directly use the theory we have developed in the past two lectures.
We circumvent this issue by considering the following truncation function

1
 z>1
α(z) := z z ∈ [−1, 1]

−1 z < −1.

Let α ◦ F := {α(fw ) : fw ∈ F} denote the truncated versions of each fw ∈ F.

74
We can apply Zhang’s Lemma with the hinge loss (the constants are written in lecture
3, and proven in the first problem session) to see that

R(hŵ ) = R(hŵ ) − R(hw⋆ ) ⩽ Rℓ (α ◦ fŵ ) − Rℓ (α ◦ fw⋆ ).

We now observe that α ◦ fŵ is an empirical risk minimizer for Rℓ,n , so can apply Theorem
7 from the previous lecture to find that for all δ > 0, with probability at least 1 − δ,
r
1 log(2/δ)
R(hŵ ) ⩽ ED [R((α ◦ F)D )] + 2 . (C.3)
2 n
Observe that α is 1-Lipschitz, so by the Contraction Lemma (from last lecture) we obtain

ED [R((α ◦ F)D )] ⩽ ED [R(F|D )].

We can now perform the main calculation of the proof:


n
 1X
ED [R(F|D )] = ED,ε sup εi ⟨w, K(Xi , ·)⟩W ]
fw ∈F n i=1
n
 1X
= ED,ε sup w, εi K(Xi , ·) W
]
fw ∈F n
i=1
n
 1X 
⩽ γED,ε εi K(Xi , ·) W
n
i=1
n
 1 X 2 1/2
⩽ γED,ε εi K(Xi , ·) W
n
i=1
n
1 X 1/2
= γED,ε 2 εi εj K(Xi , xj )
n
i,j=1
n
1 X 1/2
= γED K(Xi , Xi )
n2
i=1

⩽√ .
n
Combining this bound with (C.3) we conclude.

References
[1] M. Belkin, “Fit without fear: remarkable mathematical phenomena of deep learning
through the prism of interpolation,” Acta Numerica, vol. 30, pp. 203–248, 2021.

[2] J. M. Borwein and A. S. Lewis, Convex analysis and nonlinear optimization, 2nd ed.,
ser. CMS Books in Mathematics/Ouvrages de Mathématiques de la SMC. Springer,
New York, 2006, vol. 3, theory and examples.

[3] S. P. Boyd and L. Vandenberghe, Convex optimization. Cambridge university press,


2004.

75
[4] S. Bubeck, “Convex optimization: algorithms and complexity,” Found. Trends Mach.
Learn., vol. 8, no. 3-4, pp. 231–357, 2015.

[5] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning representations by


back-propagating errors,” nature, vol. 323, no. 6088, pp. 533–536, 1986.

[6] S. Shalev-Shwartz and S. Ben-David, Understanding machine learning: From theory


to algorithms. Cambridge university press, 2014.

[7] R. Van Handel, “Probability in high dimension,” Lecture Notes (Princeton University),
2014.

[8] M. J. Wainwright, High-dimensional statistics: A non-asymptotic viewpoint. Cam-


bridge university press, 2019, vol. 48.

76

You might also like