0% found this document useful (0 votes)
17 views76 pages

Cptuto

Uploaded by

Gaurav Dhar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views76 pages

Cptuto

Uploaded by

Gaurav Dhar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 76

Distribution-Free Uncertainty Quantification

A short introduction to Conformal Prediction

Margaux Zaffran
Useful resources on Conformal Prediction (non exhaustive)

• Book reference: Vovk et al. (2005) (new edition in 2022)


• A gentle tutorial:
◦ Angelopoulos and Bates (2023)
◦ Videos playlist
• Another tutorial: Fontana et al. (2023)
• Ryan Tibshirani introductive lecture’s notes
• GitHub repository with plenty of links: Manokhin (2022)

All material is freely accessible on this webpage, including sources.


Feel free to reuse it for presentations or teaching! If you do, please credit this
version, and include a link to this webpage.

This tutorial by Margaux Zaffran (2023) is licensed under a Creative Commons


Attribution-NonCommercial-ShareAlike 4.0 International License.
https://conformalpredictionintro.github.io
1 / 55
On the importance of quantifying uncertainty

10 10 10

5 5 5

0 0 0
Y

Y
−5 −5 −5

−10 −10 −10


0 2 4 0 2 4 0 2 4
X X X

1 1 1

,→ Same predictions, yet 3 distinct underlying phenomena!


=⇒ Quantifying uncertainty conveys this information.

2 / 55
Quantile Regression

Split Conformal Prediction (SCP)

Avoiding data splitting: full conformal and out-of-bags approaches

Beyond exchangeability
Reminder about quantiles

• Quantile level β ∈ [0, 1]


• QX (β) := inf{x ∈ R, P(X ≤ x) ≥ β}
:= inf{x ∈ R, FX (x) ≥ β}
• Empirical quantile qβ (X1 , . . . , Xn )
:= dβ × ne smallest value of (X1 , . . . , Xn )

Example of quantile: the median


β = 0.5
,→ q0.5 (X1 , . . . , Xn ) is the empirical median of (X1 , . . . , Xn );
,→ QX (0.5) represents the median of the distribution of X .

Similarly, let qβ,inf (X1 , . . . , Xn ) := bβ × nc smallest value of (X1 , . . . , Xn )


3 / 55
Median regression

• The Bayes predictor depends on the chosen loss function.


,→ Bayes predictor f ? ∈ argmin Risk` (f )
f

:= argmin E [`(Y , f (X ))]


f
• Mean Absolute Error (MAE): `(Y , Y 0 ) = |Y − Y 0 | Associated risk:
Risk` (f ) = E [|Y − f (X )|]
⇒ f ? (X ) = median [Y |X ] = QY |X (0.5)
2.5

2
MAE(Y, f (X)) = |Y − f (X)|

2.0

1.5 0

Y
1.0
−2

0.5

−4
0.0
−4 −2 0 2 4 0 1 2 3 4 5
Y − f (X) X
4 / 55
Generalization: Quantile regression

• Quantile level β ∈ [0, 1]


• Pinball loss
`β (Y , Y 0 ) = β|Y − Y 0 |1{|Y −Y 0 |≥0} + (1 − β)|Y − Y 0 |1{|Y −Y 0 |≤0}
Associated risk: Risk`β (f ) = E [`β (Y , f (X ))]
Bayes predictor: f ? ∈ argmin Risk`β (f )
f

⇒ f ? (X ) = QY |X (β)

β = 0.05
4 β = 0.1
β = 0.3
3 β = 0.5
`β (Y, f (X))

β = 0.7
β = 0.9
2
β = 0.95

0
−4 −2 0 2 4
Y − f (X) 5 / 55
Quantile regression: foundations

• Link between the pinball loss and the quantiles?


Set q ? ∈ arg min E [`β (Y − q)]. Then,
q
Z +∞
0= `0β (y − q ? )dfY (y )
−∞
Z q? Z +∞
= (β − 1) dfY (y ) + β dfY (y )
−∞ q?

0 = (β − 1)FY (q ? ) + β(1 − FY (q ? ))
(1 − β)FY (q ? ) = β(1 − FY (q ? ))
β = FY (q ? )
⇔ q ? = FY−1 (β)

6 / 55
Quantile regression: visualisation

β = 0.05
4 2 β = 0.1
β = 0.3
3 β = 0.5
`β (Y, f (X))

0 β = 0.7

Y
β = 0.9
2
β = 0.95
−2
1

−4
0
−4 −2 0 2 4 0 1 2 3 4 5
Y − f (X) X

1
Warning
No theoretical guarantee with a finite sample!
 h i
P Y ∈ Q̂Y |X (β/2); Q̂Y |X (1 − β/2) 6= 1 − β

7 / 55
Quantile Regression

Split Conformal Prediction (SCP)


Standard regression case
Conformalized Quantile Regression (CQR)
Generalization of SCP: going beyond regression

Avoiding data splitting: full conformal and out-of-bags approaches

Beyond exchangeability
Quantifying predictive uncertainty

• (X , Y ) ∈ Rd × R random variables
• n training samples (Xi , Yi )ni=1
• Goal: predict an unseen point Yn+1 at Xn+1 with confidence
• How? Given a miscoverage level α ∈ [0, 1], build a predictive set Cα such that:
P {Yn+1 ∈ Cα (Xn+1 )} ≥ 1 − α, (1)
and Cα should be as small as possible, in order to be informative
For example: α = 0.1 and obtain a 90% coverage interval
• Construction of the predictive intervals should be
◦ agnostic to the model
◦ agnostic to the data distribution
◦ valid in finite samples

8 / 55
Quantile Regression

Split Conformal Prediction (SCP)


Standard regression case
Conformalized Quantile Regression (CQR)
Generalization of SCP: going beyond regression

Avoiding data splitting: full conformal and out-of-bags approaches

Beyond exchangeability
Split Conformal Prediction (SCP)1,2,3 : toy example

Train Cal Test

0
Y

−2

0 1 2 3 4 5
X
1
Vovk et al. (2005), Algorithmic Learning in a Random 1 World
2
Papadopoulos et al. (2002), Inductive Confidence Machines for Regression, ECML
3
Lei et al. (2018), Distribution-Free Predictive Inference for Regression, JRSS B
9 / 55
Split Conformal Prediction (SCP)1,2,3 : training step

0
Y

I Learn (or get) µ̂

−2

0 2 4
X

1
1
Vovk et al. (2005), Algorithmic Learning in a Random World
2
Papadopoulos et al. (2002), Inductive Confidence Machines for Regression, ECML
3
Lei et al. (2018), Distribution-Free Predictive Inference for Regression, JRSS B
10 / 55
Split Conformal Prediction (SCP)1,2,3 : calibration step

2
I Predict with µ̂
I Get the |residuals|, a.k.a.
conformity scores
0
Y

I Compute the (1 − α) empirical


quantile of
−2
S = {|residuals|}Cal ∪ {+∞},
noted q1−α (S)
0 2 4
X

1
1
Vovk et al. (2005), Algorithmic Learning in a Random World
2
Papadopoulos et al. (2002), Inductive Confidence Machines for Regression, ECML
3
Lei et al. (2018), Distribution-Free Predictive Inference for Regression, JRSS B
11 / 55
Split Conformal Prediction (SCP)1,2,3 : prediction step

0
I Predict with µ̂
Y

I Build Cbα (x): [µ̂(x) ± q1−α (S)]


−2

0 2 4
X

1
1
Vovk et al. (2005), Algorithmic Learning in a Random World
2
Papadopoulos et al. (2002), Inductive Confidence Machines for Regression, ECML
3
Lei et al. (2018), Distribution-Free Predictive Inference for Regression, JRSS B
12 / 55
SCP: implementation details

Calib. Train

1. Randomly split the training data into a proper training set (size #Tr) and a
calibration set (size #Cal)
2. Get µ̂ by training the algorithm A on the proper training set
3. On the calibration set, get prediction values with µ̂
4. Obtain a set of #Cal + 1 conformity scores :
S = {Si = |µ̂(Xi ) − Yi |, i ∈ Cal} ∪ {+∞}
(+ worst-case scenario)

5. Compute the 1 − α quantile of these scores, noted q1−α (S)


6. For a new point Xn+1 , return
Cbα (Xn+1 ) = [µ̂(Xn+1 ) − q1−α (S); µ̂(Xn+1 ) + q1−α (S)]

13 / 55
SCP: implementation details

Calib. Train

1. Randomly split the training data into a proper training set (size #Tr) and a
calibration set (size #Cal)
2. Get µ̂ by training the algorithm A on the proper training set
3. On the calibration set, get prediction values with µ̂
4. Obtain a set of #Cal conformity scores :
S = {Si = |µ̂(Xi ) − Yi |, i ∈ Cal}

 
1
5. Compute the (1 − α) + 1 quantile of these scores, noted q1−α (S)
#Cal
6. For a new point Xn+1 , return
Cbα (Xn+1 ) = [µ̂(Xn+1 ) − q1−α (S); µ̂(Xn+1 ) + q1−α (S)]

13 / 55
SCP: theoretical foundation

Definition (Exchangeability)
(Xi , Yi )ni=1 are exchangeable if, for any permutation σ of J1, nK:
 
L ((X1 , Y1 ) , . . . , (Xn , Yn )) = L Xσ(1) , Yσ(1) , . . . , Xσ(n) , Yσ(n) ,

where L designates the joint distribution.

Examples of exchangeable sequences


• i.i.d. samples
   2 
m σ

γ2
 .   .. 
 ..   . 
• The components of N  ..  , 
   
γ2

 ..
 .   .
 

m σ2

14 / 55
SCP: theoretical guarantees

SCP enjoys finite sample guarantees proved in Vovk et al. (2005); Lei et al. (2018).
Theorem
Suppose (Xi , Yi )n+1 4 n
i=1 are exchangeable . SCP applied on (Xi , Yi )i=1 outputs
Cbα (·) such that: n o
P Yn+1 ∈ Cbα (Xn+1 ) ≥ 1 − α.

Additionally, if the scores {Si }i∈Cal ∪ {Sn+1 } are a.s. distinct:


n o 1
P Yn+1 ∈ Cbα (Xn+1 ) ≤ 1 − α + .
#Cal + 1

4
Only the calibration and test data need to be exchangeable.
15 / 55
Proof architecture of SCP guarantees

Lemma (Quantile lemma)


If (U1 , . . . , Un , Un+1 ) are exchangeable, then for any β ∈]0, 1[:
P (Un+1 ≤ qβ (U1 , . . . , Un , +∞)) ≥ β.

Additionally, if U1 , . . . , Un , Un+1 are almost surely distinct, then:


1
P (Un+1 ≤ qβ (U1 , . . . , Un , +∞)) ≤ β + .
n+1

When (Xi , Yi )n+1


i=1 are exchangeable, the scores {Si }i∈Cal ∪{Sn+1 } are exchangeable.
,→ applying the quantile lemma to the scores concludes the proof.

16 / 55
Proof of the quantile lemma

First note that Un+1 ≤ qβ (U1 , . . . , Un , +∞) ⇐⇒ Un+1 ≤ qβ (U1 , . . . , Un , Un+1 ).


Then, by definition of qβ :
Un+1 ≤ qβ (U1 , . . . , Un , Un+1 ) ⇐⇒ rank(Un+1 ) ≤ dβ(n + 1)e

By exchangeability, rank(Un+1 ) ∼ U{1, . . . , n + 1}. Thus:


dβ(n + 1)e
P (rank(Un+1 ) ≤ dβ(n + 1)e) ≥ ≥ β.
n+1
If U1 , . . . , Un , Un+1 are almost surely distinct (without ties):
dβ(n + 1)e
P (rank(Un+1 ) ≤ dβ(n + 1)e) =
n+1
1 + β(n + 1) 1
≤ =β+ .
n+1 n+1

17 / 55
SCP: theoretical guarantees

SCP enjoys finite sample guarantees proved in Vovk et al. (2005); Lei et al. (2018).
Theorem
Suppose (Xi , Yi )n+1 4 n
i=1 are exchangeable . SCP applied on (Xi , Yi )i=1 outputs
Cbα (·) such that: n o
P Yn+1 ∈ Cbα (Xn+1 ) ≥ 1 − α.

Additionally, if the scores {Si }i∈Cal ∪ {Sn+1 } are a.s. distinct:


n o 1
P Yn+1 ∈ Cbα (Xn+1 ) ≤ 1 − α + .
#Cal + 1
n o
7 Marginal coverage: P Yn+1 ∈ Cbα (Xn+1 ) 
|X ≥1−α

n+1
 =
 x

4
Only the calibration and test data need to be exchangeable.
18 / 55
Conditional coverage implies adaptiveness

n o
• Marginal coverage: P Yn+1 ∈ Cbα (Xn+1 ) the errors may differ across regions
of the input space (i.e. non-adaptive)
n o
• Conditional coverage: P Yn+1 ∈ Cbα (Xn+1 ) |Xn+1 errors are evenly distributed
(i.e. fully adaptive)
• Conditional coverage is stronger than marginal coverage

no coverage marginal conditional


2 2 2

0 0 0
Y

Y
−2 −2 −2

−4 −4 −4

0 2 4 0 2 4 0 2 4
X X X

1 1 1

19 / 55
Standard mean-regression SCP is not adaptive

0
I Predict with µ̂
Y

I Build Cbα (x): [µ̂(x) ± q1−α (S)]


−2

0 2 4
X

20 / 55
Informative conditional coverage as such is impossible

• Impossibility results
,→ Vovk (2012); Lei and Wasserman (2014); Barber et al. (2021a)

Without distribution assumption,


n in 
finite sample,
 aoperfectly condition-
ally valid Cα is such that P mes Cα (x) = ∞ ≥ 1 − α for any
b b
non-atomic x.

• Approximate conditional coverage


,→ Romano et al. (2020a); Guan (2022); Jung et al. (2023); Gibbs et al. (2023)
Target P(Yn+1 ∈ Cbα |Xn+1 ∈ R(x)) ≥ 1 − α
• Asymptotic (with the sample size) conditional coverage
,→ Romano et al. (2019); Kivaranovic et al. (2020); Chernozhukov et al.
(2021); Sesia and Romano (2021); Izbicki et al. (2022)
Non exhaustive references.
21 / 55
Quantile Regression

Split Conformal Prediction (SCP)


Standard regression case
Conformalized Quantile Regression (CQR)
Generalization of SCP: going beyond regression

Avoiding data splitting: full conformal and out-of-bags approaches

Beyond exchangeability
Conformalized Quantile Regression (CQR)5

Train Cal Test

0
Y

−2

−4

0 1 2 3 4 5
X

1
5
Romano et al. (2019), Conformalized Quantile Regression, NeurIPS
22 / 55
Conformalized Quantile Regression (CQR)5 : training step

0
Y

I Learn (or get) QR


c lower and
−2 QR
c upper

−4

0 2 4
X

5
Romano et al. (2019), Conformalized Quantile Regression, NeurIPS
23 / 55
Conformalized Quantile Regression (CQR)5 : calibration step

+ +
+
+ + +
++ I Predict with QR
c lower and
QR
c upper
I Get the scores
- -
S = {Si }Cal ∪ {+∞}
- +
- -
- I Compute the (1 − α) empirical
+
quantile of S, noted q1−α (S)

n o
c lower (Xi ) − Yi , Yi − QR
,→ Si := max QR c upper (Xi )

5
Romano et al. (2019), Conformalized Quantile Regression, NeurIPS
24 / 55
Conformalized Quantile Regression (CQR)5 : prediction step

0
Y

I Predict with QR
c lower and
−2 QR
c upper

−4

0 2 4
X

I Build 1
c lower (x) − q1−α (S); QR
Cbα (x) = [QR c upper (x) + q1−α (S)]

5
Romano et al. (2019), Conformalized Quantile Regression, NeurIPS
25 / 55
CQR: implementation details

Calib. Train

1. Randomly split the training data into a proper training set (size #Tr) and a
calibration set (size #Cal)
2. Get QR
c lower and QR c upper by training the algorithm A on the proper training
set
3. Obtain a set of #Cal + 1 conformity scores S:
 
S = {Si = max QR
c lower (Xi ) − Yi , Yi − QR
c upper (Xi ) , i ∈ Cal} ∪ {+∞}
4. Compute the 1 − α quantile of these scores, noted q1−α (S)
5. For a new point Xn+1 , return
c lower (Xn+1 ) − q1−α (S); QR
Cbα (Xn+1 ) = [QR c upper (Xn+1 ) + q1−α (S)]

26 / 55
CQR: implementation details

1. Randomly split the training data into a proper training set (size #Tr) and a
calibration set (size #Cal)
2. Get QR
c lower and QR c upper by training the algorithm A on the proper training
set
3. Obtain a set of #Cal conformity scores S:
 
S = {Si = max QR
c lower (Xi ) − Yi , Yi − QR
c upper (Xi ) , i ∈ Cal}
 
1
4. Compute the (1 − α) + 1 quantile of these scores, noted q1−α (S)
#Cal
5. For a new point Xn+1 , return
c lower (Xn+1 ) − q1−α (S); QR
Cbα (Xn+1 ) = [QR c upper (Xn+1 ) + q1−α (S)]

26 / 55
CQR: theoretical guarantees

This procedure enjoys the finite sample guarantee proposed and proved in Romano
et al. (2019).
Theorem
Suppose (Xi , Yi )n+1 6 n
i=1 are exchangeable . CQR on (Xi , Yi )i=1 outputs Cα (·) such
b
that: n o
P Yn+1 ∈ Cbα (Xn+1 ) ≥ 1 − α.

If, in addition, the scores {Si }i∈Cal ∪ {Sn+1 } are almost surely distinct, then
n o 1
P Yn+1 ∈ Cbα (Xn+1 ) ≤ 1 − α + .
#Cal + 1

Proof: application of the quantile lemma.


n o
7 Marginal coverage: P Yn+1 ∈ Cbα (Xn+1 ) 
|X =x ≥1−α
 
n+1

6
Only the calibration and test data need to be exchangeable.
27 / 55
Quantile Regression

Split Conformal Prediction (SCP)


Standard regression case
Conformalized Quantile Regression (CQR)
Generalization of SCP: going beyond regression

Avoiding data splitting: full conformal and out-of-bags approaches

Beyond exchangeability
SCP is defined by the conformity score function

Calib. Train

1. Randomly split the training data into a proper training set (size #Tr) and a
calibration set (size #Cal)
2. Get  by training the algorithm A on the proper training set
3. On the calibration set, obtain #Cal + 1 conformity scores
S = {Si = s (Â(Xi ), Yi ), i ∈ Cal} ∪ {+∞}
Ex 1: s (Â(Xi ), Yi ) := |µ̂(Xi 
) − Yi | in regression with standard scores
c lower (Xi ) − Yi , Yi − QR
Ex 2: s (Â(Xi ), Yi ) := max QR c upper (Xi ) in CQR
4. Compute the 1 − α quantile of these scores, noted q1−α (S)
5. For a new point Xn+1 , return
Cbα (Xn+1 ) = {y such that s (Â(Xn+1 ), y ) ≤ q1−α (S)}
,→ The definition of the conformity scores is crucial, as they incorporate almost all
the information: data + underlying model
28 / 55
SCP: theoretical guarantees

This procedure enjoys the finite sample guarantee proposed and proved in Vovk
et al. (2005).
Theorem
Suppose (Xi , Yi )n+1 7 n
i=1 are exchangeable . SCP on (Xi , Yi )i=1 outputs Cα (·) such
b
that: n o
P Yn+1 ∈ Cbα (Xn+1 ) ≥ 1 − α.

If, in addition, the scores {Si }i∈Cal ∪ {Sn+1 } are almost surely distinct, then
n o 1
P Yn+1 ∈ Cbα (Xn+1 ) ≤ 1 − α + .
#Cal + 1

Proof: application of the quantile lemma.


n o
7 Marginal coverage: P Yn+1 ∈ Cbα (Xn+1 ) 
|X =x ≥1−α
 
n+1

7
Only the calibration and test data need to be exchangeable.
29 / 55
SCP: what choices for the regression scores?
Cbα (Xn+1 ) = {y such that s (Â(Xn+1 ), y ) ≤ q1−α (S)}
Standard SCP Locally weighted SCP CQR
Vovk et al. (2005) Lei et al. (2018) Romano et al. (2019)
c lower (X ) − Y ,
max(QR
|µ̂(X ) − Y |
s (Â(X ), Y ) |µ̂(X ) − Y |
ρ̂(X ) Y − QR c upper (X ))
c lower (x) − q1−α (S);
[QR
Cbα (x) [µ̂(x) ± q1−α (S)] [µ̂(x) ± q1−α (S)ρ̂(x)]
QR
c upper (x) + q1−α (S)]
2 2 2

0 0 0
Y

Y
−2 −2 −2

−4 −4 −4

0 2 4 0 2 4 0 2 4
Visu. X X X

1 1 1

3 black-box around a “us- black-box around a “usable” adaptive


able” prediction prediction

7 not adaptive limited adaptiveness no black-box around a “us-


able” prediction
30 / 55
SCP: standard classification

• Y ∈ {1, . . . , C } (C classes)
• Â(X ) = (p̂1 (X ), . . . , p̂C (X )) (estimated probabilities)
• s (Â(X ), Y ) := 1 − (Â(X ))Y
• For a new point Xn+1 , return
Cbα (Xn+1 ) = {y such that s (Â(Xn+1 ), y ) ≤ q1−α (S)}

31 / 55
SCP: standard classification in practice

Ex: Yi ∈ {“dog”, “tiger”, “cat”}, with α = 0.1

• Scores on the calibration set


Cali “dog” “dog” “dog” “tiger” “tiger” “tiger” “tiger” “cat” “cat” “cat”
p̂dog (Xi ) 0.95 0.90 0.85 0.15 0.15 0.20 0.15 0.15 0.25 0.20
p̂tiger (Xi ) 0.02 0.05 0.10 0.60 0.55 0.50 0.45 0.40 0.35 0.45
p̂cat (Xi ) 0.03 0.05 0.05 0.25 0.30 0.30 0.40 0.45 0.40 0.35
Si 0.05 0.1 0.15 0.40 0.45 0.50 0.55 0.55 0.6 0.65

• q1−α (S) = 0.65


• Â(Xn+1 ) = (0.05, 0.60, 0.35)
,→ s (Â(Xn+1 ), “dog”) = 0.95 “dog” ∈/ Cbα (Xn+1 )
,→ s (Â(Xn+1 ), “tiger”) = 0.40 ≤ q1−α (S) “tiger” ∈ Cbα (Xn+1 )
,→ s (Â(Xn+1 ), “cat”) = 0.65 ≤ q1−α (S) “cat” ∈ Cbα (Xn+1 )
• Cbα (Xn+1 ) = {“tiger”, “cat”}

32 / 55
SCP: standard classification in practice, cont’d

Ex: Y ∈ {“dog”, “tiger”, “cat”}, with α = 0.1

• Scores on the calibration set


Cali “dog” “dog” “dog” “tiger” “tiger” “tiger” “tiger” “cat” “cat” “cat”
p̂dog (Xi ) 0.95 0.90 0.85 0.05 0.05 0.05 0.05 0.10 0.10 0.15
p̂tiger (Xi ) 0.02 0.05 0.10 0.85 0.80 0.75 0.70 0.25 0.30 0.30
p̂cat (Xi ) 0.03 0.05 0.05 0.10 0.15 0.20 0.25 0.65 0.60 0.55
Si 0.05 0.1 0.15 0.15 0.20 0.25 0.30 0.35 0.40 0.45
• q1−α (S) = 0.45
• Â(Xn+1 ) = (0.05, 0.60, 0.35)
,→ s (Â(Xn+1 ), “dog”) = 0.95 “dog” ∈/ Cbα (Xn+1 )
,→ s (Â(Xn+1 ), “tiger”) = 0.40 ≤ q1−α (S) “tiger” ∈ Cbα (Xn+1 )
,→ s (Â(Xn+1 ), “cat”) = 0.65 “cat” ∈/ Cbα (Xn+1 )
• Cbα (Xn+1 ) = {“tiger”}

33 / 55
SCP: limits of the standard classification case

The standard classification conformity score function leads to:

3 smallest prediction sets on average


7 undercovering (overcovering) hard (easy) subgroups
(similar to the standard mean regression case!)

⇒ Other score functions can be built to improve adaptiveness


(as in regression with localized scores)

34 / 55
SCP: classification with Adaptive Prediction Sets8

1. Sort in decreasing order p̂ σ(1) (X ) ≥ . . . ≥ p̂ σ(C ) (X )


σ −1 (Y )
X
2. s (Â(X ), Y ) := p̂ σ(k) (X ) (sum of the estimated probabilities
k=1
associated to classes at least as large as that of the true class Y )
3. Return the set of classes {σn+1 (1), . . . , σn+1 (r ? )}, where
( r )
X
r ? = arg max p̂ σn+1 (k) (Xn+1 ) < q1−α (S) +1
1≤r ≤C k=1
probabilities

probabilities
Cumulative
Estimated

estimated
er

er
g

g
t

t
Ca

Ca
Do

Do
Tig

Tig
8
Romano et al. (2020b), Classification with Valid and Adaptive Coverage, NeurIPS
Figure highly inspired by Angelopoulos and Bates (2023). 35 / 55
SCP: classification with Adaptive Prediction Sets in practice

Ex: Y ∈ {“dog”, “tiger”, “cat”}, with α = 0.1

• Scores on the calibration set


Cali “dog” “dog” “dog” “tiger” “tiger” “tiger” “tiger” “cat” “cat” “cat”
p̂dog (Xi ) 0.95 0.90 0.85 0.05 0.05 0.05 0.10 0.25 0.10 0.15
p̂tiger (Xi ) 0.02 0.05 0.10 0.85 0.80 0.75 0.75 0.40 0.30 0.30
p̂cat (Xi ) 0.03 0.05 0.05 0.10 0.15 0.20 0.15 0.35 0.60 0.55
Si 0.95 0.90 0.85 0.85 0.80 0.75 0.75 0.75 0.60 0.55
• q1−α (S) = 0.95
,→ Ex 1: Â(Xn+1 ) = (0.05, 0.45, 0.5), r ? = 2
Cbα (Xn+1 ) = {“tiger”, “cat”}
?
,→ Ex 2: Â(Xn+1 ) = (0.03, 0.95, 0.02), r = 1
Cbα (Xn+1 ) = {“tiger”}

36 / 55
Split Conformal Prediction: summary

• Simple procedure which quantifies the uncertainty of any predictive model Â


by returning predictive regions
• Finite-sample guarantees
• Distribution-free as long as the data are exchangeable (and so are the scores)
• Marginal theoretical guarantee over the joint (X , Y ) distribution, and not con-
ditional, i.e., no guarantee that for any x:
n o
P Yn+1 ∈ Cbα (Xn+1 ) |X = x ≥ 1 − α.
 
 n+1
,→ marginal also over the whole calibration set and the test point!

37 / 55
Challenges: open questions (non exhaustive!)

• Conditional coverage (∼ Previous Section)


• Computational cost vs statistical power (Next Section)
• Exchangeability (Last Section)

38 / 55
Quantile Regression

Split Conformal Prediction (SCP)

Avoiding data splitting: full conformal and out-of-bags approaches


Full Conformal Prediction
Jackknife+

Beyond exchangeability
Quantile Regression

Split Conformal Prediction (SCP)

Avoiding data splitting: full conformal and out-of-bags approaches


Full Conformal Prediction
Jackknife+

Beyond exchangeability
Splitting the data might not be desired

SCP suffers from data splitting:

• lower statistical efficiency (lower model accuracy and higher predictive set size)
• higher statistical variability

Can we avoid splitting the data set?

39 / 55
The naive idea does not enjoy valid coverage (even empirically)

• A naive idea:
◦ Get  by training the algorithm A on {(X1 , Y1 ), . . . , (Xn , Yn )}.
◦ compute the empirical quantile q1−α (S) of the set of scores
n  on
S = s  (Xi ) , Yi ∪ {∞}.
i=1
n   o
◦ output the set y such that s  (Xn+1 ) , y ≤ q1−α (S) .

7 Â has been obtained using the training set {(X1 , Y1 ), . . . , (Xn , Yn )} but did
not use
 Xn+1 . 
⇒ s  (Xn+1 ) , y stochastically dominates any element of
n  on
s  (Xi ) , Yi .
i=1

40 / 55
Full Conformal Prediction9 does not discard training points!

• Full (or transductive) Conformal Prediction


◦ avoids data splitting
◦ at the cost of many more model fits
• Idea: the most probable labels Yn+1 live in Y, and have a low enough conformity
score. By looping over all possible y ∈ Y, the ones leading to the smallest
conformity scores will be found.

9
Vovk et al. (2005), Algorithmic Learning in a Random World
41 / 55
Full Conformal Prediction (CP): recovering exchangeability

For any candidate (Xn+1 , y ):

1. Get Ây by training A on {(X1 , Y1 ), . . . , (Xn , Yn )} ∪ {(Xn+1 , y )}


2. Obtain a set of training scores
n on
(train)
Sy = s (Ây (Xi ), Yi )∪ { s (Ây (Xn+1 ), y )}
i=1
 
(train)
and compute their 1 − α empirical quantile q1−α Sy
n    o
(train)
Output the set y such that s Ây (Xn+1 ) , y ≤ q1−α Sy .

3 Test point treated in the same way than train points


7 Computationally costly

42 / 55
Full CP: theoretical foundation

Definition (Symmetrical algorithm)


A deterministic algorithm A : (U1 , . . . , Un ) 7→ Â is symmetric if for any
permutation σ of J1, nK:
a.s. 
A (U1 , . . . , Un ) = A Uσ(1) , . . . , Uσ(n) .

43 / 55
Full CP: theoretical guarantees

Full CP enjoys finite sample guarantees proved in Vovk et al. (2005).


Theorem
Suppose that

(i) (Xi , Yi )n+1


i=1 are exchangeable,
(ii) the algorithm A is symmetric.

Full CP applied on (Xi , Yi )ni=1 ∪ {Xn+1 } outputs Cbα (·) such that:
n o
P Yn+1 ∈ Cbα (Xn+1 ) ≥ 1 − α.
Additionally, if the scores are a.s. distinct:
n o 1
P Yn+1 ∈ Cbα (Xn+1 ) ≤ 1 − α + .
n+1
n o
7 Marginal coverage: P Yn+1 ∈ Cbα (Xn+1 ) 
|X =x ≥1−α
 
n+1

44 / 55
Interpolation regime

FCP sets with an interpolating algorithm


Assume A interpolates:

• Â = A ((x1 , y1 ), . . . , (xn+1 , yn+1 ))


• Â(xk ) − yk = 0 for any k ∈ J1, n + 1K

⇒ Full Conformal Prediction (with standard score functions) outputs Y (the


whole label space) for any new test point!

45 / 55
Quantile Regression

Split Conformal Prediction (SCP)

Avoiding data splitting: full conformal and out-of-bags approaches


Full Conformal Prediction
Jackknife+

Beyond exchangeability
Jackknife: the naive idea does not enjoy valid coverage

• Based on leave-one-out (LOO) residuals

• Dn = {(X1 , Y1 ), . . . , (Xn , Yn )} training data


• Get Â−i by training A on Dn \ (Xi , Yi )
n o
• LOO scores S = |Â−i (Xi ) − Yi | ∪ {+∞} (in standard mean regression)
i
• Get  by training A on Dn
h i
• Build the predictive interval: Â(Xn+1 ) ± q1−α (S)

Warning
No guarantee on the prediction of  with scores based on (Â−i )i , without
assuming a form of stability on A.
46 / 55
Jackknife+10

• Based on leave-one-out (LOO) residuals

• Dn = {(X1 , Y1 ), . . . , (Xn , Yn )} training data


• Get Â−i by training A on Dn \ (Xi , Yi )
• LOO predictions
n / predictive intervals o
Sup/down = Â−i (Xn+1 ) ± |Â−i (Xi ) − Yi | ∪ {±∞}
i
(in standard mean regression)
• Build the predictive interval: [qα,inf (Sdown ); q1−α (Sup )]
Theorem
If Dn ∪ (Xn+1 , Yn+1 ) are exchangeable and A is symmetric: P(Yn+1 ∈ Cbα (Xn+1 )) ≥ 1 − 2α.
10
Barber et al. (2021b), Predictive Inference with the jackknife+, The Annals of Statistics
Recall qβ,inf (X1 , . . . , Xn ) := bβ × nc smallest value of (X1 , . . . , Xn )
47 / 55
11
CV+ (see also cross-conformal predictors: Vovk, 2015)

• Based on cross-validation residuals


• Dn = {(X1 , Y1 ), . . . , (Xn , Yn )} training data
• Split Dn into K folds F1 , . . . , FK
• Get Â−Fk by training A on Dn \ Fk
• Cross-val predictions
 / predictive intervals 
n o
Sup/down = Â−Fk (Xn+1 ) ± |Â−Fk (Xi ) − Yi | ∪ {±∞}
i∈Fk k
(in standard mean regression)
• Build the predictive interval: [qα,inf (Sdown ); q1−α (Sup )]
Theorem
If Dn ∪ (Xn+1 , Yn+1 ) are exchangeableand A is symmetric: 
2(1 − 1/K ) 1 − K /n p
P(Yn+1 ∈ Cbα (Xn+1 )) ≥ 1 − 2α − min , ≥ 1 − 2α − 2/n.
n/K + 1 K +1
11
Barber et al. (2021b), Predictive Inference with the jackknife+, The Annals of Statistics
Recall qβ,inf (X1 , . . . , Xn ) := bβ × nc smallest value of (X1 , . . . , Xn ) 48 / 55
General overview

Statistical efficiency

Computational efficiency

SCP CV+ Jackknife+ FCP

Nested Conformal Prediction

• Generalized framework encapsulating out-of-sample methods: Nested CP


(Gupta et al., 2022)
• Accelerating FCP: Nouretdinov et al. (2001); Lei (2019); Ndiaye and Takeuchi
(2019); Cherubin et al. (2021); Ndiaye and Takeuchi (2022); Ndiaye (2022)
Non exhaustive references.
49 / 55
Quantile Regression

Split Conformal Prediction (SCP)

Avoiding data splitting: full conformal and out-of-bags approaches

Beyond exchangeability
Exchangeability does not hold in many practical applications

• CP requires exchangeable data points to ensure validity


7 Covariate shift, i.e. LX changes but LY |X stays constant
7 Label shift, i.e. LY changes but LX |Y stays constant
7 Arbitrary distribution shift
7 Possibly many shifts, not only one

50 / 55
Covariate shift (Tibshirani et al., 2019)12

• Setting:
i.i.d.
◦ (X1 , Y1 ), . . . , (Xn , Yn ) ∼ PX × PY |X
◦ (Xn+1 , Yn+1 ) ∼ P̃X × PY |X
• Idea: give more importance to calibration points that are closer in distribution
to the test point
• In practice:
dP̃X (Xi )
1. estimate the likelihood ratio w (Xi ) =
dPX (Xi )
w (Xi )
2. normalize the weights, i.e. ωi = ω(Xi ) = Pn+1
j=1 w (Xj )
3. outputs Cbα (Xn+1 ) =
n o
y : s (Â(Xn+1 ), y ) ≤ q1−α ({ωi Si }i∈Cal ∪ {+∞})

12
Tibshirani et al. (2019), Conformal Prediction Under Covariate Shift, NeurIPS
51 / 55
Label shift (Podkopaev and Ramdas, 2021)13
• Setting:
i.i.d.
◦ (X1 , Y1 ), . . . , (Xn , Yn ) ∼ PX |Y × PY
◦ (Xn+1 , Yn+1 ) ∼ PX |Y × P̃Y
◦ Classification
• Idea: give more importance to calibration points that are closer in distribution
to the test point
• Trouble: the actual test labels are unknown
• In practice:
dP̃Y (Yi )
1. estimate the likelihood ratio w (Yi ) = using algorithms from the existing
dPY (Yi )
label shift literature
w (Yi )
2. normalize the weights, i.e. ω yi = ω y (Xi ) = Pn
j=1 (Yj ) + w (y )
w
3. outputs Cbα (Xn+1 ) =
n o
y : s (Â(Xn+1 ), y ) ≤ q1−α ({ω yi Si }i∈Cal ∪ {+∞})
13
Podkopaev and Ramdas (2021), Distribution-free uncertainty quantification for classification under label
shift, UAI 52 / 55
Generalizations

• Arbitrary distribution shift: Cauchois et al. (2020) leverages ideas from the
distributionally robust optimization literature
• Two major general theoretical results beyond exchangeability:
◦ Chernozhukov et al. (2018)
,→ If the learnt model is accurate and the data noise is strongly mixing, then CP
is valid asymptotically 3
◦ Barber et al. (2022)
,→ Quantifies the coverage loss depending on the strength of exchangeability
violation
P(Yn+1 ∈ Cbα (Xn+1 )) ≥ 1 − α − average violation of exchangeability
by each calibration point
,→ proposed algorithm: reweighting again!
e.g., in a temporal setting, give higher weights to more recent points.

53 / 55
Online setting

• Data: T0 random variables (X1 , Y1 ), . . . , (XT0 , YT0 ) in Rd × R


• Aim: predict the response values as well as predictive intervals for T1 subsequent
observations XT0 +1 , . . . , XT0 +T1 sequentially: at any prediction step t ∈ JT0 +
1, T0 + T1 K, Yt−T0 , . . . , Yt−1 have been revealed
• Build the smallest interval Cbαt such that:
n o
P Yt ∈ Cbαt (Xt ) ≥ 1 − α, for t ∈ JT0 + 1, T0 + T1 K,

often simplified in:


T0 +T1
1 X n o
1 Yt ∈ Cbαt (Xt ) ≈ 1 − α.
T1
t=T0 +1

54 / 55
Recent developments

• Consider splitting strategies that respect the temporal structure


• Gibbs and Candès (2021) propose a method which reacts faster to temporal
evolution
◦ Idea: track the previous coverages of the predictive intervals (1{Yt ∈ Cbα (Xt )})
◦ Tool: update the empirical quantile level with a learning rate γ
◦ Asymptotic guarantee (on average) for any distribution (even adversarial)
• Zaffran et al. (2022) studies the influence of this learning rate γ and proposes,
along with Gibbs and Candès (2022), a method not requiring to choose γ
• Bhatnagar et al. (2023) enjoys anytime regret bound, by leveraging tools from
the strongly adaptive regret minimization literature
• Bastani et al. (2022) proposes an algorithm achieving stronger coverage guar-
antees (conditional on specified overlapping subsets, and threshold calibrated)
without hold-out set
• Angelopoulos et al. (2023) combines CP ideas with control theory ones, to
adaptively improve the predictive intervals depending on the errors structure
Non exhaustive references. 55 / 55
References i

Angelopoulos, A. N. and Bates, S. (2023). Conformal prediction: A gentle


introduction. Foundations and Trends® in Machine Learning, 16(4).
Angelopoulos, A. N., Candès, E. J., and Tibshirani, R. J. (2023). Conformal pid
control for time series prediction. arXiv: 2307.16895.
Barber, R. F., Candès, E. J., Ramdas, A., and Tibshirani, R. J. (2021a). The limits
of distribution-free conditional predictive inference. Information and Inference: A
Journal of the IMA, 10(2).
Barber, R. F., Candès, E. J., Ramdas, A., and Tibshirani, R. J. (2021b). Predictive
inference with the jackknife+. The Annals of Statistics, 49(1).
Barber, R. F., Candès, E. J., Ramdas, A., and Tibshirani, R. J. (2022). Conformal
prediction beyond exchangeability. To appear in Annals of Statistics (2023).
References ii

Bastani, O., Gupta, V., Jung, C., Noarov, G., Ramalingam, R., and Roth, A.
(2022). Practical adversarial multivalid conformal prediction. In Advances in
Neural Information Processing Systems. Curran Associates, Inc.
Bhatnagar, A., Wang, H., Xiong, C., and Bai, Y. (2023). Improved online
conformal prediction via strongly adaptive online learning. In Proceedings of the
40th International Conference on Machine Learning. PMLR.
Cauchois, M., Gupta, S., Ali, A., and Duchi, J. C. (2020). Robust Validation:
Confident Predictions Even When Distributions Shift. arXiv: 2008.04267.
Chernozhukov, V., Wüthrich, K., and Yinchu, Z. (2018). Exact and Robust
Conformal Inference Methods for Predictive Machine Learning with Dependent
Data. In Conference On Learning Theory. PMLR.
Chernozhukov, V., Wüthrich, K., and Zhu, Y. (2021). Distributional conformal
prediction. Proceedings of the National Academy of Sciences, 118(48).
References iii

Cherubin, G., Chatzikokolakis, K., and Jaggi, M. (2021). Exact optimization of


conformal predictors via incremental and decremental learning. In Proceedings
of the 38th International Conference on Machine Learning. PMLR.
Fontana, M., Zeni, G., and Vantini, S. (2023). Conformal prediction: A unified
review of theory and new challenges. Bernoulli, 29(1).
Gibbs, I. and Candès, E. (2021). Adaptive conformal inference under distribution
shift. In Advances in Neural Information Processing Systems. Curran Associates,
Inc.
Gibbs, I. and Candès, E. (2022). Conformal inference for online prediction with
arbitrary distribution shifts. arXiv: 2208.08401.
Gibbs, I., Cherian, J. J., and Candès, E. J. (2023). Conformal prediction with
conditional guarantees. arXiv: 2305.12616.
References iv

Guan, L. (2022). Localized conformal prediction: a generalized inference


framework for conformal prediction. Biometrika, 110(1).
Gupta, C., Kuchibhotla, A. K., and Ramdas, A. (2022). Nested conformal
prediction and quantile out-of-bag ensemble methods. Pattern Recognition, 127.
Izbicki, R., Shimizu, G., and Stern, R. B. (2022). CD-split and HPD-split: Efficient
conformal regions in high dimensions. Journal of Machine Learning Research,
23(87).
Jung, C., Noarov, G., Ramalingam, R., and Roth, A. (2023). Batch multivalid
conformal prediction. In International Conference on Learning Representations.
Kivaranovic, D., Johnson, K. D., and Leeb, H. (2020). Adaptive, Distribution-Free
Prediction Intervals for Deep Networks. In International Conference on Artificial
Intelligence and Statistics. PMLR.
References v

Lei, J. (2019). Fast exact conformalization of the lasso using piecewise linear
homotopy. Biometrika, 106(4).
Lei, J., G’Sell, M., Rinaldo, A., Tibshirani, R. J., and Wasserman, L. (2018).
Distribution-Free Predictive Inference for Regression. Journal of the American
Statistical Association.
Lei, J. and Wasserman, L. (2014). Distribution-free prediction bands for
non-parametric regression. Journal of the Royal Statistical Society: Series B
(Statistical Methodology), 76(1).
Manokhin, V. (2022). Awesome conformal prediction.
https://github.com/valeman/awesome-conformal-prediction.
Ndiaye, E. (2022). Stable conformal prediction sets. In Proceedings of the 39th
International Conference on Machine Learning. PMLR.
References vi

Ndiaye, E. and Takeuchi, I. (2019). Computing full conformal prediction set with
approximate homotopy. In Advances in Neural Information Processing Systems.
Curran Associates, Inc.
Ndiaye, E. and Takeuchi, I. (2022). Root-finding approaches for computing
conformal prediction set. Machine Learning, 112(1).
Nouretdinov, I., Melluish, T., and Vovk, V. (2001). Ridge regression confidence
machine. In Proceedings of the 18th International Conference on Machine
Learning.
Papadopoulos, H., Proedrou, K., Vovk, V., and Gammerman, A. (2002). Inductive
Confidence Machines for Regression. In Machine Learning: ECML. Springer.
Podkopaev, A. and Ramdas, A. (2021). Distribution-free uncertainty quantification
for classification under label shift. In Proceedings of the Thirty-Seventh
Conference on Uncertainty in Artificial Intelligence. PMLR.
References vii

Romano, Y., Barber, R. F., Sabatti, C., and Candès, E. (2020a). With Malice
Toward None: Assessing Uncertainty via Equalized Coverage. Harvard Data
Science Review, 2(2).
Romano, Y., Patterson, E., and Candès, E. (2019). Conformalized Quantile
Regression. In Advances in Neural Information Processing Systems. Curran
Associates, Inc.
Romano, Y., Sesia, M., and Candes, E. (2020b). Classification with valid and
adaptive coverage. In Advances in Neural Information Processing Systems.
Curran Associates, Inc.
Sesia, M. and Romano, Y. (2021). Conformal prediction using conditional
histograms. In Advances in Neural Information Processing Systems. Curran
Associates, Inc.
References viii

Tibshirani, R. J., Barber, R. F., Candes, E., and Ramdas, A. (2019). Conformal
Prediction Under Covariate Shift. In Advances in Neural Information Processing
Systems. Curran Associates, Inc.
Vovk, V. (2012). Conditional Validity of Inductive Conformal Predictors. In Asian
Conference on Machine Learning. PMLR.
Vovk, V. (2015). Cross-conformal predictors. Annals of Mathematics and Artificial
Intelligence, 74(1-2).
Vovk, V., Gammerman, A., and Shafer, G. (2005). Algorithmic Learning in a
Random World. Springer US.
Zaffran, M., Féron, O., Goude, Y., Josse, J., and Dieuleveut, A. (2022). Adaptive
conformal predictions for time series. In Proceedings of the 39th International
Conference on Machine Learning. PMLR.
References ix

This tutorial by Margaux Zaffran (2023) is licensed under a Creative Commons


Attribution-NonCommercial-ShareAlike 4.0 International License.
https://conformalpredictionintro.github.io

You might also like