0% found this document useful (0 votes)
26 views10 pages

Test-Time Augmentation for CNNs

Greedy policy search (GPS) is a simple algorithm that learns an optimal policy for test-time data augmentation to improve predictive performance of machine learning models. The algorithm searches over diverse data augmentation operations and selects those that maximize a calibrated log-likelihood objective on a validation set. When evaluated on image classification tasks, GPS consistently outperforms other test-time augmentation baselines and provides gains even when combined with powerful training-time augmentation techniques.

Uploaded by

ammuarati
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views10 pages

Test-Time Augmentation for CNNs

Greedy policy search (GPS) is a simple algorithm that learns an optimal policy for test-time data augmentation to improve predictive performance of machine learning models. The algorithm searches over diverse data augmentation operations and selects those that maximize a calibrated log-likelihood objective on a validation set. When evaluated on image classification tasks, GPS consistently outperforms other test-time augmentation baselines and provides gains even when combined with powerful training-time augmentation techniques.

Uploaded by

ammuarati
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Greedy Policy Search:

A Simple Baseline for Learnable Test-Time Augmentation

Dmitry Molchanov∗ 1,2


Alexander Lyzhov∗ 1,3,4
Yuliya Molchanova∗ 1
Arsenii Ashukha∗ 1,2
Dmitry Vetrov 2,1

1
Samsung AI Center Moscow
2
Samsung-HSE Laboratory, National Research University Higher School of Economics
3
National Research University Higher School of Economics
4
Skolkovo Institute of Science and Technology

Abstract

Test-time data augmentation—averaging the


predictions of a machine learning model across
multiple augmented samples of data—is a
widely used technique that improves the pre-
dictive performance. While many advanced
learnable data augmentation techniques have
emerged in recent years, they are focused on
the training phase. Such techniques are not
necessarily optimal for test-time augmentation
and can be outperformed by a policy consist-
ing of simple crops and flips. The primary
goal of this paper is to demonstrate that test-
time augmentation policies can be successfully Figure 1: A sample from the test-time data augmentation
learned too. We introduce greedy policy search policy learned by greedy policy search for EfficientNet-
(GPS), a simple but high-performing method B5 on ImageNet. Averaging the predictions across sam-
for learning a policy of test-time augmenta- ples from the policy outperforms the conventional multi-
tion. We demonstrate that augmentation poli- crop evaluation by a wide margin.
cies learned with GPS achieve superior predic-
tive performance on image classification prob-
lems, provide better in-domain uncertainty es-
timation, and improve the robustness to do- data augmentation that artificially expands a dataset with
main shift. label-preserving transformations is used during training
to further promote the invariance to such symmetries.
Training with data augmentation has been used for a
long time to improve the predictive performance of
machine learning and pattern recognition algorithms
1 INTRODUCTION
(Yaeger et al., 1997; Simard et al., 2003; Krizhevsky
et al., 2012). Earlier techniques enlarge datasets with a
Convolutional neural networks (CNNs) have become a
handcrafted set of transformations, such as scale, trans-
de facto standard for problems with complex data that
lation, rotation, and require manual tuning of augmenta-
contain a lot of label-preserving symmetries. Such archi-
tion strategies. Recent works explore learnable and more
tectures use spatially invariant operations that have been
diverse strategies of data augmentation (Cubuk et al.,
specifically designed to reflect the symmetries present
2019a,b; Lim et al., 2019). These strategies have become
in data. These architectural choices are not enough, so
a standard component of training powerful deep learning

Equal contribution models (Tan & Le, 2019).

Proceedings of the 36th Conference on Uncertainty in Artificial


Intelligence (UAI), PMLR volume 124, 2020.
Even when learning with data augmentation, CNNs pointing out several successful applications of TTA in
are still not perfectly invariant to all the symmetries medical imaging. As one example, Wang et al. (2019)
present in the data distribution. Therefore, test-time show that TTA improves uncertainty estimation for med-
augmentation—averaging the predictions of a model ical image segmentation. Pang et al. (2019) demon-
across multiple augmentations of an object—often in- strated that mixup data augmentation (Zhang et al., 2017)
creases predictive performance. A special case of test- can be applied during testing, improving defense against
time augmentation called multi-crop evaluation has even adversarial attacks on image classifiers.
become a standard evaluation protocol for large scale im-
age classification (Krizhevsky et al., 2009; Simonyan & Learnable train-time augmentation Data augmenta-
Zisserman, 2014; He et al., 2016). Test-time augmenta- tion is more commonly applied during training rather
tion is, however, limited to simple transformations and than during inference. Seeking to improve train-time
usually does not benefit from using a more diverse aug- augmentation, a recent line of works starting from Cubuk
mentation policy, e.g. the one used during training. et al. (2019a) explored the practice of adapting it to pe-
culiarities of a specific dataset. AutoAugment (Cubuk
In this work, we aim to demonstrate that test-time aug-
et al., 2019a) learns an augmentation policy with rein-
mentation of images can benefit more from a wide range
forcement learning and requires a repetition of an expen-
of diverse data augmentations if their composition is
sive model training for each iteration of the policy search
learned. We introduce greedy policy search (GPS),
algorithm. Subsequent works proposed more efficient
a simple algorithm that learns a policy for test-time
methods of policy search for training set augmentation
data augmentation based on the predictive performance
(Ho et al., 2019; Cubuk et al., 2019b; Lim et al., 2019;
on a validation set. In an ablation study, we show
Zhang et al., 2019).
that optimizing the calibrated log-likelihood (Ashukha
et al., 2020) is a crucial part of the policy search algo-
rithm, while the default objectives—accuracy and log- Ensembling Neural network ensembling—computing
likelihood—lead to a significant drop in the final perfor- predictions using a distribution over neural networks in-
mance. stead of a single model—improves performance on var-
ious machine learning problems. Often, ensembling in-
Our evaluation is performed on the following problems: volves obtaining a set of trained neural networks and av-
conventional image classification, in-domain uncertainty eraging their predictions on each test object. There are
estimation, and classification under dataset shift. We many methods of ensembling (Srivastava et al., 2014;
demonstrate that test-time augmentation policies found Blundell et al., 2015; Lakshminarayanan et al., 2017;
by GPS (see an example on Figure 1) outperform other Huang et al., 2017), differing in time and memory re-
data augmentation baselines significantly on a wide quirements, diversity of ensemble members and perfor-
range of deep learning architectures from VGG-style net- mance.
works (Simonyan & Zisserman, 2014) to the recently
proposed EfficientNets (Tan & Le, 2019). GPS pro- Sub-ensemble selection Even though a single model
vides consistent improvements in the performance of en- is used for TTA, it makes sense to see the TTA as an en-
sembles, models trained with powerful train-time data semble of different models, each with its own augmenta-
augmentation techniques such as AutoAugment (Cubuk tion sub-policy. The specific members in this ensemble
et al., 2019a) and RandAugment (Cubuk et al., 2019b), can be selected from a variety of discrete possibilities.
as well as models trained without advanced data augmen- Historically, ensemble pruning methods have been ap-
tation. We also show that the obtained policies transfer plied for such optimization problems. Partridge & Yates
well across different architectures. (1996) introduced a heuristic which can serve as a rule
of selection of ensemble members. Fan et al. (2002);
2 RELATED WORK Caruana et al. (2004) described and used another, sim-
pler, greedy ensemble pruning method which is the one
Test-time augmentation The test-time data augmen- that we adopt in this work for test-time augmentation.
tation (TTA) has been present for a long time in deep
learning research. Krizhevsky et al. (2012) averaged the 3 LEARNABLE TEST-TIME
predictions of an image classification model over random
AUGMENTATION
crops and flips of test data. This became a standard evalu-
ation protocol (Krizhevsky et al., 2009; Simonyan & Zis-
In this section we discuss the training of test-time aug-
serman, 2014; He et al., 2016). Shorten & Khoshgoftaar
mentation policy for image classification problems.
(2019) provided an extensive survey of data augmenta-
tion for deep learning including test-time augmentation, Policy We define a test-time augmentation (TTA) policy
P as a set of sub-policies {si (·)}. A sub-policy s(·) con- 82

sists of Ns consecutively applied image transformations 80

Top-1 accuracy (%)


tj (·, Mj ), j ∈ {1, . . . , Ns }, where tj is one of the pre-
78
defined image operations, Mj ≥ 0 being its magnitude.
The transformations that we use and their respective typ- 76

ical magnitudes are listed in Appendix A. A visualization 74 Central crop


of these transforms is presented in Figure 13. Scale, Crops, Flips
72 RandAugment
Inference During inference, the predictions are aver-
20 21 22 23 24
aged across samples of different sub-policies: Num. samples of TTA

1 X Figure 2: Accuracy of EfficientNet B2 (trained with


πθP (x) = p(y | s(x), θ). (1) RandAugment) on ImageNet for two TTA strategies:
|P |
s∈P scale-crop-flip augmentation, and RandAugment (the
same as during training). The scale-crop-flip policy out-
performs the RandAugment policy and the effect still
3.1 Naive approaches to test-time augmentation holds for large number of samples. This example demon-
strates that the policy learned for training is not necessar-
Common test-time augmentation policies consist of sub- ily optimal for test-time augmentation.
policies that are sampled independently from a fixed dis-
tribution. For example, a single sub-policy may con-
sist of randomly resized crops and horizontal flips. A 3.2 Greedy policy search
potential alternative is to use the same policy that has
been learned for training (e.g. a policy obtained with We introduce greedy policy search (GPS) as a means
RandAugment (Cubuk et al., 2019b) or AutoAugment of demonstrating that learnable policy for test-time aug-
(Cubuk et al., 2019a)) to perform test-time augmenta- mentation can boost the predictive performance, uncer-
tion. A possible motivation behind this choice is that tainty estimates and robustness of deep learning models.
such a policy might reflect the specifics of a particular
Greedy policy search GPS starts with an empty pol-
dataset or architecture better.
icy and builds it in an iterative fashion. It searches for
For simplicity, we use a slightly modified set of PIL the sub-policy that provides the largest performance gain
transforms that is commonly used for learning the train- when added to the current policy. This selection step is
ing time augmentation policies as test-time augmentation repeated until a policy of the desired length is built. To
transformation options. make the procedure computationally efficient, we first
draw a pool of candidate sub-policies from a prior dis-
Our experiments indicate that in some cases (Figure 2) a
tribution over sub-policies p(s). We precompute the pre-
TTA policy that was learned for training performs worse
dictions on all these sub-policies so that the sub-policy
than the default policy consisting of random scalings,
selection step could be performed in the space of predic-
crops and flips. This means that the process of learn-
tions without passes through the neural network. Both
ing a policy for training does not necessarily result in a
the pool generation and the selection procedure are em-
good TTA policy. A natural alternative is to learn the
barrassingly parallel, so the resulting algorithm is effi-
TTA policy for a trained neural network by directly opti-
cient and easily scalable. The whole procedure is sum-
mizing some TTA performance objective. For example,
marized in Algorithm 1.
we can parameterize a policy with a magnitude parame-
ter shared across all transformations, as in RandAugment Optimization criterion The criteria of predictive per-
(Cubuk et al., 2019b), and find the optimal magnitude formance that are often used as objectives for policy, ar-
using grid search. As we show in Figure 12, the optimal chitecture or hyperparameter search are classification ac-
magnitude for test-time augmentation is different from curacy and log-likelihood. We find, however, that these
the optimal magnitude for training. To push the idea of criteria are ill-suited for TTA policy search. As we dis-
direct optimization of TTA performance further, we em- cuss in Section 4.2, the log-likelihood is unable to fairly
ploy the greedy ensemble pruning for TTA. The resulting judge the performance of test-time augmentation, and
method, greedy policy search, can be considered a sim- the accuracy is typically too noisy to provide an ade-
ple yet strong baseline for more advanced discrete op- quate signal for learning a well-performing TTA policy.
timization method like reinforcement learning, used in We follow Ashukha et al. (2020) and use the calibrated
AutoAugment (Cubuk et al., 2019a), or Bayesian opti- log-likelihood instead. The calibrated log-likelihood is
mization, used in Fast AutoAugment (Lim et al., 2019). defined as the log-likelihood measured after the post-
Figure 3: An illustration of one step of the greedy policy search algorithm. Each step selects a sub-policy that provides
the largest improvement in calibrated log-likelihood of ensemble predictions and add it to the current policy.

Algorithm 1 Greedy Policy Search (GPS) Zisserman, 2014), PreResNet110 (He et al., 2016) and
Require: Trained neural network p(y | x, θ) WideResNet28x10 (Zagoruyko & Komodakis, 2016).
Require: Validation data Xval , yval On ImageNet (Russakovsky et al., 2015), we use
Require: Pool size B, policy size T ResNet50 and EfficientNet B2/B5/L2 (Tan & Le, 2019).
Require: Prior over sub-policies p(s) PyTorch (Paszke et al., 2017) is used for all experiments.
S←∅ . Pool of candidate sub-policies The source code is available at https://github.
for i ← 1 to B do com/bayesgroup/gps-augment.
si ∼ p(s) Training CIFAR models were trained for 2000 epochs
S ← S ∪ {si } . Add si to pool using a modified version of RandAugment with N = 3
si
πval ← p(y | si (Xval ), θ) . Predict with si transformations for each image, where the magnitude of
end for each transformation for each image has been drawn from
P ←∅ . GPS policy the uniform distribution M f ∼ U [0, 45]. We provide the
P
πval ←0 . Predictions made with GPS policy details of training these models in Appendix A.
for t ← 1 to T do
. Choose the best sub-policy s∗ based on calibrated We reused the publicly available snapshots2&3 of Ima-
log-likelihood on validation: geNet models. EfficientNets B2/B5 were trained with
s∗ ← arg max cLL t−1 1 s vanilla RandAugment, EfficientNet L2 was trained with
P

s∈S t πval + t πval ; yval
∗ Noisy Student (Xie et al., 2020) and RandAugment,
P
πval ← t−1 P 1 s
t πval + t πval . Update predictions ResNet50 was trained with AugMix (Hendrycks et al.,
P ← P ∪ {s∗ } . Update policy 2020) and RandAugment.
end for
return policy P Policy search To obtain the results on CIFAR datasets,
we first train all our models with the same stratified train-
validation split (we use 45000 objects for training and
5000 objects for validation), and perform GPS or magni-
hoc temperature scaling (Guo et al., 2017). The tem-
tude grid search on the validation set. We then retrain all
perature scaling is typically performed by optimizing the
models on the full training set, and evaluate them with
validation log-likelihood w.r.t. the temperature τ of the
the obtained policies. Since we did not train the Im-
softmax(·/τ ) function used to obtain the predictions.
ageNet models, we split the validation set in half with
Our experiments show that the calibrated log-likelihood
a stratified split, use the first half for policy search and
is the key ingredient of GPS. This objective is suited for
report the results for the second half. We use approxi-
learning TTA policies better than both accuracy and con-
mately 1000 sub-policies in the candidate pools for GPS,
ventional uncalibrated log-likelihood.
and describe the construction of the pools in Appendix A.
Evaluation Following Ashukha et al. (2020), we use
4 EXPERIMENTS the calibrated log-likelihood as our main evaluation met-

We perform experiments with greedy policy search on a 2


https://github.com/tensorflow/tpu/
variety of architectures on CIFAR-10/100 and ImageNet tree/master/models/official/efficientnet
classification problems. On CIFAR-10/100 datasets 3
https://github.com/rwightman/
(Krizhevsky et al., 2009), we use VGG16 (Simonyan & pytorch-image-models
100 aug. samples
CIFAR-100 CIFAR-100 CIFAR-100 CIFAR-100 CIFAR-100 CIFAR-100
VGG ResNet110 WideResNet VGG ResNet110 WideResNet
-0.624 -0.552 -0.479
82.1 86.4
83.5

Accuracy (%)
log-likelihood

-0.640 -0.498 81.9 86.2


Calibrated

-0.559 83.4
-0.657 -0.517 81.7 86.0
83.3
-0.566 81.5
-0.673 -0.536 83.2 85.7
81.3 85.5
-0.689 -0.573 -0.555
CF Tr M* GPS CF Tr M* GPS CF Tr M* GPS CF Tr M* GPS CF Tr M* GPS CF Tr M* GPS

1 aug. sample 5 aug. samples 10 aug. samples 20 aug. samples


ImageNet ImageNet ImageNet ImageNet ImageNet ImageNet
ResNet50 Efficientnet B2 Efficientnet B5 ResNet50 Efficientnet B2 Efficientnet B5
−0.75 81.0
Calibrated log-likelihood

−0.58

Top-1 accuracy (%)


−0.70 82.0 84.50
80.5
−0.60
−0.80 81.5 84.25
−0.75 80.0
−0.62 81.0 84.00
79.5
−0.85 −0.64 80.5
−0.80 83.75
CC 5C 10C CF GPS CC 5C 10C CF GPS CC 5C 10C CF GPS CC 5C 10C CF GPS CC 5C 10C CF GPS CC 5C 10C CF GPS

Figure 4: Performance of various test-time augmentation strategies on clean test set of CIFAR-100 dataset (top) and
ImageNet (bottom). CC: central crop. CF: random crops and horizontal flips. Tr: augmentation used for training
(modified RandAugment with M = 45). M ∗ : modified RandAugment with M found by grid search. 5/10C: 5/10-
crop evaluation (four corner crops, one center crop for 5C; five crops with horizontal flips for 10C). Greedy policy
search (GPS) consistently outperforms all other methods in both the calibrated log-likelihood and accuracy. The results
for CIFAR-100 have been averaged over five runs of TTA.

ric for in-domain uncertainty estimation, and we reuse GPS policy is found or transferred from a different model
their “test-time cross-validation” procedure to perform or dataset, the gain in the predictive performance can be
calibration. The test set is divided in half, the optimal obtained for free.
temperature is found on the first split, and the metrics are
Aside from test-time data augmentation, there are other
evaluated on the second split. We average the metrics
techniques that allow one to use ensembling during test
across five random splits. While it is possible to opti-
time with almost no training overhead. Such methods
mize the temperature on a validation set, we stick with
as variational inference (Blundell et al., 2015), dropout
test-time cross-validation for convenience since the opti-
(Srivastava et al., 2014), K-FAC Laplace approximation
mal temperature is different for each TTA policy and for
are praised as ways to hide an ensemble inside a sin-
each number of samples for TTA (see Figure 11 for de-
gle model using a stochastic computation graph. It was,
tails). The optimal temperature has a very low variance,
however, recently shown that these techniques are typi-
and the values found on the validation set closely match
cally significantly outperformed by test-time augmenta-
the values found during test-time cross-validation.
tion with random crops and flips (Ashukha et al., 2020)
in conventional image classification benchmarks (CIFAR
4.1 In-domain predictive performance and ImageNet classification). Since GPS outperforms
vanilla TTA, it outperforms these techniques as well.
Greedy policy search achieves better predictive perfor- However, GPS can be combined with ensembling tech-
mance compared to all of the following: conventional niques to further improve their performance (see Sec-
test-time augmentation techniques (e.g. random crops tion 4.5).
and flips), reuse of policy learned during training, and a
more advanced baseline (RandAugment with magnitude 4.2 What metric to use for policy search?
grid search). The results for CIFAR-100 and ImageNet
are presented in Figure 4, and the results for CIFAR- Any policy search procedure that relies on optimizing
10 are presented in Figure 17, numerical results can be the validation performance requires a metric to optimize.
found in Tables 2, 3. Common predictive performance metrics are classifica-
tion accuracy and log-likelihood.
When using the same amount of samples, GPS has the
same test-time computational complexity as vanilla test- Both of these metrics have problems. The plain log-
time augmentation or the standard multi-crop evaluation, likelihood cannot be used for a fair comparison of dif-
yet achieves a better predictive performance. Once the ferent techniques, especially in the test-time augmenta-
5 aug. samples 100 aug. samples

VGG ResNet110 WideResNet VGG ResNet110 WideResNet


CIFAR-10-C CIFAR-10-C CIFAR-10-C CIFAR-100-C CIFAR-100-C CIFAR-100-C
10.6 10.2 7.4 34.7 34.0 29.6
(Corruption Error)

10.3 10.0 7.2 33.8 28.8


muCE (%)

33.5
10.1 9.8 7.0 32.8 32.9 28.0

9.9 9.6 6.9 31.8 32.4 27.2

9.7 9.4 6.7 30.8 31.9 26.4


CF M* GPS CF M* GPS CF M* GPS CF M* GPS CF M* GPS CF M* GPS

Figure 5: Mean unnormalized corruption error (muCE) on corrupted versions of CIFAR datasets for various test-
time augmentation strategies: random crops and horizontal flips (CF), modified RandAugment with M found by grid
search (M ∗ ) and GPS policy (GPS). Learnable TTA methods are run on clean, uncorrupted data. In most cases, GPS
policies are more robust to the domain shift compared to alternatives.

GPS 1 aug. sample 5 aug. samples 20 aug. samples


VGG ResNet110 WideResNet
criterion ResNet50 EfficientNetB2 EfficientNetB5
ImageNet-C ImageNet-C ImageNet-C
81.17 ± 0.15 83.01 ± 0.18 85.71 ± 0.10
Acc.(%)

Acc. 0.72 0.68 0.61

(Corruption Error)
LL 81.89 ± 0.07 83.55 ± 0.09 86.22 ± 0.05 0.71 0.67 0.59
cLL 82.21 ± 0.17 83.54 ± 0.06 86.44 ± 0.05

mCE
0.70 0.66 0.57
Acc. −0.837 ± 0.003 −0.691 ± 0.001 −0.661 ± 0.003 0.69 0.65 0.56
cLL

LL −0.640 ± 0.001 −0.560 ± 0.001 −0.489 ± 0.001 0.67 0.64 0.54


cLL −0.623 ± 0.001 −0.552 ± 0.001 −0.479 ± 0.001 CC CF GPS CC CF GPS CC CF GPS

Table 1: Performance of greedy policy search using dif- Figure 6: Mean corruption error (mCE) on ImageNet-
ferent metrics as a search objective, measured on CIFAR- C for various test-time augmentation strategies: central
100 dataset. Calibrated log-likelihood results in superior crop (CC), random scale-crop-flip transformation (CF),
performance across all tasks and metrics. The results GPS policy trained on the clean data (GPS). GPS pol-
have been averaged over five runs of TTA. icy outperforms non-learnable test-time augmentation
strategies under domain shift.

tion setting (Ashukha et al., 2020). The authors suggest one can see from Figure 12, the optimal value of M is
switching to calibrated log-likelihood (cLL) instead. The different for different metrics. The accuracy is too noisy
problem with the log-likelihood is that it can dismiss a to reliably find the optimal M . The log-likelihood pro-
good model that happened to be miscalibrated, but can be vides a very conservative value of M since large mag-
fixed by temperature scaling. With test-time augmenta- nitudes decalibrate the model. On the contrary, the cal-
tion it is often the case that the optimal temperature of the ibrated log-likelihood does not suffer from this problem
predictive distribution drastically changes with the num- and results in a better value of M .
ber of samples (see Figure 11). The accuracy, in turn,
appears to be too noisy to provide robust learning signal
for greedy optimization. 4.3 Robustness to domain shift

To evaluate the influence of the objective function, we Despite the natural human ability to correctly recognize
run GPS for a VGG, a PreResNet110 and a WideRes- an object given an image with visual perturbations, neu-
Net28x10 on CIFAR-100 dataset. The pool of candi- ral networks are typically very sensitive to changes in the
date sub-policies and the resulting length of sub-policy data distribution. As for now, models suffer a significant
is kept the same for all methods, as described in Sec- performance loss even under a slight domain shift (Ova-
tion 4. We evaluate three different objectives for GPS: dia et al., 2019). To explore how different test-time aug-
classification accuracy, log-likelihood and calibrated log- mentation strategies influence the robustness to domain
likelihood. The results are presented in Table 1. We find shift, we use the benchmark, proposed by Hendrycks &
that optimizing the calibrated log-likelihood consistently Dietterich (2018).
outperforms other metrics in terms of both accuracy and
We perform an evaluation of TTA methods on CIFAR-
calibrated log-likelihood.
10-C, CIFAR-100-C and ImageNet-C datasets with 15
To better see how the metrics fail, we evaluate test-time corruptions C from groups noise, blur, weather and
RandAugment policies with different magnitudes M . As digital. These datasets consist of the test sets of the
Search policy on
corresponding original datasets with applied corruption
CIFAR10 CIFAR100
transforms c ∈ C with five different severity levels s, Crop/flip
VGG ResNet WRN VGG ResNet WRN policy
1 ≤ s ≤ 5. For a given corruption c at severity level s we
0.000 -0.002 -0.002 -0.004 -0.003 -0.006 -0.080

VGG
compute the error rate Ec,s . On CIFAR datasets for each
corruption we compute the unnormalized corruption er-

CIFAR10
0.000 0.000 -0.000 -0.002 -0.001 -0.004 -0.052

WRN ResNet
P5
ror uCEc = 51 s=1 Ec,s , as proposed by Hendrycks

Evaluate policy on
et al. (2020), whereas for ImageNet-C we normalize 0.001 -0.000 0.000 -0.001 -0.000 -0.002 -0.058

the corruption error


Pby the central
P5 cropAlexN
performance of
5 et -0.015 -0.020 -0.008 0.000 -0.010 -0.003 -0.276

VGG
AlexNet: CEc = s=1 Ec,s / s=1 Ec,s . We ob-
tain the final metric muCE or mCE by averaging the

CIFAR100
-0.001 -0.004 -0.001 -0.001 0.000 -0.003 -0.219

WRN ResNet
corruption errors (uCEc or CEc ) over different corrup-
tions c ∈ C. We report these metrics for the policies -0.018 -0.015 -0.009 0.001 -0.006 0.000 -0.266

found using the clean validation data (the same poli-


cies as in other experiments), and compare our method
Figure 7: The change in cLL when switching from a
with several baselines. The results are presented in Fig-
GPS policy learned for one dataset-architecture pair to a
ures 5 and 6 and in Tables 4 and 5.
GPS policy learned for another dataset-architecture pair.
We use the same stratified validation-test split as the Policy transfer outperforms random crops and flips in
one we used for policy search. It should be noted that all considered cases. Negative numbers mean that TTA
ImageNet-C has a different data format compared with works best when the policy is evaluated on the same ar-
ImageNet: it consists of images with pre-applied central chitecture and dataset as used for policy search. The re-
cropping which shrinks the resolution down to 224×224. sults have been averaged over five runs of TTA.
For this experiment, we use the same magnitudes for
scale and crop transforms as before for all the consid- 5 aug. samples 10 aug. samples 20 aug. samples
ImageNet ImageNet
ered policies even though these magnitudes were set on Efficientnet L2-475 Efficientnet L2-475
-0.438 88.36
full-resolution images. Although such choice may not
Calibrated log-likelihood

-0.439 88.32

Top-1 accuracy (%)


be optimal, it is consistent, and still leads to a substan- 88.28
-0.441
tial improvement over the central crop baseline. Ide- -0.443
88.24

ally, the ImageNet-C dataset should be modified to con- -0.445


88.20
88.16
tain corrupted full-resolution images to establish a uni- -0.446
88.12
fied benchmark for models, designed for different reso- -0.448

lutions and for non-standard inference techniques such CF 10C GPS GPS GPS CF 10C GPS GPS GPS
R50 B2 B5 R50 B2 B5
as test-time data augmentation.
Even though the corruptions of ImageNet-C do slightly Figure 8: Policies learned with GPS for ResNet-50 (GPS
intersect with the augmentation transformations used R50), EfficientNet B2 (GPS B2), and EfficientNet B5
during training, this does not favor GPS over other meth- (GPS B5) models transfer well to the larger EfficientNet
ods. L2 architecture and outperform conventional baselines
for multi-crop evaluation: random scale-crop-flip trans-
Surprisingly, policies trained on clean validation data formation (CF) and multi-crop evaluation with 5 crops
work decently for corrupted data. In most cases, GPS and 2 horizontal flips for each crop (10C).
outperforms both the conventional baselines and Ran-
dAugment with the optimal (for the clean validation set)
magnitude M ∗ . Somewhat counter-intuitively, we find and test-time augmentation are complementary practices
that extreme augmentations (see Figure 14) of data that and can be combined to boost the performance. We ex-
is already corrupted leads to a significant performance pect this combination to work well in the setting of do-
boost as compared to conservative crops and flips. Not main shift.
only does this demonstrate the efficiency of learnable
TTA, it also shows that the policy does not overfit to 4.4 Policy transfer
clean data and consists of augmentations that are useful
in other settings. We evaluate the policies found by GPS on other archi-
tectures and datasets in order to test their generality. The
Although ensembling is a popular way to mitigate dataset change in calibrated log-likelihood when transferring the
shift (Ovadia et al., 2019), we do not compare model en- policies across CIFAR datasets and architectures is re-
sembles with TTA in this setting. As noted by (Ashukha ported in Figure 7. The decrease in performance is not
et al., 2020) and as we show in Section 4.5, ensembling dramatic, and the transferred policies still outperform
100 aug. samples 1 aug. sample 5 aug. samples 100 aug. samples

CIFAR-100 CIFAR-100
5x WideResNet ensemble 5x WideResNet ensemble VGG ResNet110 WideResNet
CIFAR-100 CIFAR-100 CIFAR-100
-0.415
87.9 −0.9 −0.75 −0.70

log-likelihood
Calibrated
-0.424 87.8
−1.0 −0.80 −0.75

Accuracy (%)
log-likelihood
Calibrated

87.7
-0.432 −1.1 −0.85 −0.80
87.5 CC CF GPS CC CF GPS CC CF GPS
-0.440 VGG ResNet110 WideResNet
87.4 CIFAR-100 CIFAR-100 CIFAR-100
-0.449 77 79.5

Accuracy (%)
81.0
CF Tr M* GPS GPS CF Tr M* GPS GPS
1 1 ens 1 1 ens 76 78.5 80.5

75 77.5 80.0
Figure 9: Greedy policy search improves the predictive
CC CF GPS CC CF GPS CC CF GPS
performance of ensembles. CC: central crop. CF: ran-
dom crops and horizontal flips. Tr: augmentation used Figure 10: Greedy policy search (GPS) for models
for training (modified RandAugment with M = 45). trained with vanilla augmentation (random crops and
“M∗ 1”: modified RandAugment with M∗ = 35 found flips) still outperforms vanilla test-time augmentation.
by grid search for a single model. “GPS 1”: GPS is ap- CC: central crop. CF: random crops and horizontal flips.
plied to a single model, and the ensemble is evaluated GPS: greedy policy search. The results for CIFAR-100
using the resulting policy. “GPS ens”: GPS is applied have been averaged over five runs of TTA.
to the whole ensemble. The results have been averaged
over five runs of TTA.
The simplest way is to perform GPS for a single model,
and then evaluate the whole ensemble using that pol-
standard random crop and flip augmentations. We ob- icy. Another way is to perform GPS for the ensemble
serve that keeping the same dataset during transfer is directly, using the same sub-policy for every member of
more important than keeping the same architecture. the ensemble. Other modifications can include searching
We also transfer the GPS policies found on ImageNet for a separate policy for each member of the ensemble.
for ResNet50, EfficientNet-B2 and EfficientNet-B5 to an We test the first two options (denoted “GPS single” and
even larger architecture, EfficientNet-L2, and show the “GPS ensemble” respectively), and leave other possible
results in Figure 8. We observe that all of these poli- directions for future research.
cies transfer to a larger architecture well, and outperform The results are presented in Figure 9. They are consis-
the vanilla test-time augmentation policy and multi-crop tent with the findings in previous sections. Even a grid
evaluation significantly. search for the optimal magnitude in test-time RandAug-
We do not transfer policies from CIFAR to ImageNet ment is enough to significantly outperform random crops
and vice versa since the image preprocessing for these and flips. GPS improves the performance even further.
datasets is different. Transferring the policy from a single model to an ensem-
ble (“GPS single”) performs worse than applying GPS to
the whole ensemble directly, however, both variants of
4.5 Greedy policy search for ensembles GPS outperform other baselines.
Deep ensemble (Lakshminarayanan et al., 2017) is a sim- The combination of ensembling methods and test-time
ple yet powerful technique that achieves state-of-the-art augmentation usually provides meaningful benefits to
results in in-domain and out-of-domain uncertainty es- predictive performance (Ashukha et al., 2020). Because
timation (Ovadia et al., 2019; Ashukha et al., 2020). of this, we expect these results to also hold for other
Ashukha et al. (2020) have shown that deep ensembles ensembling methods that are more efficient in terms of
can be improved for free using test-time augmentation. training time than deep ensembles.
We show that deep ensembles can be improved even fur-
ther by using a learnable test-time augmentation policy. 4.6 Greedy policy search for models trained with
We use an ensemble of five WideResNet28x10 mod- vanilla augmentation
els, trained independently using the same training proce-
While we mainly tested GPS for models trained with ad-
dure as we used for training individual models (modified
vanced data augmentation methods like RandAugment,
RandAugment training with N = 3 and M = 45).
it can be applied to any image classification model. To
There are several ways to apply GPS to an ensemble. further study the breadth of applicability of GPS, we
apply it for models trained with standard (vanilla) data Acknowledgements
augmentation. While the learned augmentation policy is
less diverse than the policy learned for models trained Dmitry Vetrov and Dmitry Molchanov were sup-
with RandAugment (see Figure 14), GPS still manages ported by the Russian Science Foundation grant no.
to find a policy that significantly outperforms standard 19-71-30020. This research was supported in part
crops and flips on CIFAR-100 (see Figures 10 and 17 through computational resources of HPC facilities at
for the comparison). Even though the models learned NRU HSE.
with standard data augmentation are less robust to Rand-
Augment perturbations (see Figure 12), they can benefit
from some of the transformations. The magnitude of the
References
transformations is almost twice as low as compared to Ashukha, Arsenii, Lyzhov, Alexander, Molchanov,
the policies for RandAugment models, and the identity Dmitry, and Vetrov, Dmitry. Pitfalls of in-domain un-
transform is chosen much more often (see Figure 16). certainty estimation and ensembling in deep learning.
In International Conference on Learning Representa-
5 CONCLUSION tions, 2020. URL https://openreview.net/
forum?id=BJxI5gHKDr.
We have designed a simple yet powerful greedy policy Blundell, Charles, Cornebise, Julien, Kavukcuoglu, Ko-
search method for test-time augmentation and tested it ray, and Wierstra, Daan. Weight uncertainty in neural
in a broad empirical evaluation. To highlight the general networks. arXiv preprint arXiv:1505.05424, 2015.
idea that switching to learnable test-time augmentation Caruana, Rich, Niculescu-Mizil, Alexandru, Crew, Ge-
strategy is beneficial, we aimed to keep the policy search off, and Ksikes, Alex. Ensemble selection from li-
simple rather than to tweak it for maximum performance. braries of models. In Proceedings of the twenty-first
Our findings can be summarized as follows: international conference on Machine learning, pp. 18,
2004.
• We show that the learned test-time augmentation
Cubuk, Ekin D, Zoph, Barret, Mane, Dandelion, Vasude-
policies consistently provide superior predictive
van, Vijay, and Le, Quoc V. Autoaugment: Learning
performance and uncertainty estimates compared
augmentation strategies from data. In Proceedings of
to existing approaches to test-time augmentation.
the IEEE conference on computer vision and pattern
We report a significant improvement for both clean
recognition, pp. 113–123, 2019a.
(in-domain) data and corrupted data (under domain
shift). Cubuk, Ekin D, Zoph, Barret, Shlens, Jonathon, and
Le, Quoc V. Randaugment: Practical data aug-
• We find that the calibrated log-likelihood is a su- mentation with no separate search. arXiv preprint
perior objective for learning test-time augmentation arXiv:1909.13719, 2019b.
strategies as compared to LL or accuracy. This
Fan, Wei, Chu, Fang, Wang, Haixun, and Yu, Philip S.
finding may have important implications in adja-
Pruning and dynamic scheduling of cost-sensitive en-
cent fields such as meta-learning and neural archi-
sembles. In AAAI/IAAI, pp. 146–151, 2002.
tecture search, where the target (meta-)objective is
often chosen to be either accuracy or plain valida- Guo, Chuan, Pleiss, Geoff, Sun, Yu, and Weinberger,
tion log-likelihood with no calibration. Kilian Q. On calibration of modern neural net-
works. In Proceedings of the 34th International Con-
• We show the policies obtained with our method ference on Machine Learning-Volume 70, pp. 1321–
to be transferable between different architectures. 1330. JMLR. org, 2017.
This means that transferring policies found for
He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, and Sun,
small architectures to large architectures is a viable
Jian. Deep residual learning for image recognition.
strategy if computational resources are limited.
In Proceedings of the IEEE conference on computer
vision and pattern recognition, pp. 770–778, 2016.
There are many promising directions for future research
on trainable test-time data augmentation. One potential Hendrycks, Dan and Dietterich, Thomas G. Bench-
area of improvement is in the design of dynamic object- marking neural network robustness to common cor-
dependent TTA policies as opposed to static object- ruptions and surface variations. arXiv preprint
independent policies, used in this paper. Intuitively, this arXiv:1807.01697, 2018.
might be especially helpful under domain shift, as an Hendrycks, Dan, Mu, Norman, Cubuk, Ekin Dogus,
object-dependent policy has a potential to alleviate it. Zoph, Barret, Gilmer, Justin, and Lakshminarayanan,
Balaji. Augmix: A simple method to improve ro- Shorten, Connor and Khoshgoftaar, Taghi M. A survey
bustness and uncertainty under data shift. In Interna- on image data augmentation for deep learning. Jour-
tional Conference on Learning Representations, 2020. nal of Big Data, 6(1):60, 2019.
URL https://openreview.net/forum?id= Simard, Patrice Y, Steinkraus, David, Platt, John C, et al.
S1gmrxHFvB. Best practices for convolutional neural networks ap-
Ho, Daniel, Liang, Eric, Chen, Xi, Stoica, Ion, and plied to visual document analysis. In Icdar, volume 3,
Abbeel, Pieter. Population based augmentation: Ef- 2003.
ficient learning of augmentation policy schedules. In Simonyan, Karen and Zisserman, Andrew. Very deep
International Conference on Machine Learning, pp. convolutional networks for large-scale image recogni-
2731–2741, 2019. tion. arXiv preprint arXiv:1409.1556, 2014.
Huang, Gao, Li, Yixuan, Pleiss, Geoff, Liu, Zhuang, Srivastava, Nitish, Hinton, Geoffrey, Krizhevsky, Alex,
Hopcroft, John E, and Weinberger, Kilian Q. Snap- Sutskever, Ilya, and Salakhutdinov, Ruslan. Dropout:
shot ensembles: Train 1, get m for free. arXiv preprint a simple way to prevent neural networks from overfit-
arXiv:1704.00109, 2017. ting. The journal of machine learning research, 15(1):
Krizhevsky, Alex, Hinton, Geoffrey, et al. Learning mul- 1929–1958, 2014.
tiple layers of features from tiny images. 2009. Tan, Mingxing and Le, Quoc. Efficientnet: Rethinking
Krizhevsky, Alex, Sutskever, Ilya, and Hinton, Geof- model scaling for convolutional neural networks. In
frey E. Imagenet classification with deep convolu- International Conference on Machine Learning, pp.
tional neural networks. In Advances in neural infor- 6105–6114, 2019.
mation processing systems, pp. 1097–1105, 2012.
Wang, Guotai, Li, Wenqi, Aertsen, Michael, Deprest,
Lakshminarayanan, Balaji, Pritzel, Alexander, and Blun- Jan, Ourselin, Sébastien, and Vercauteren, Tom.
dell, Charles. Simple and scalable predictive uncer- Aleatoric uncertainty estimation with test-time aug-
tainty estimation using deep ensembles. In Advances mentation for medical image segmentation with con-
in Neural Information Processing Systems, pp. 6402– volutional neural networks. Neurocomputing, 338:34–
6413, 2017. 45, 2019.
Lim, Sungbin, Kim, Ildoo, Kim, Taesup, Kim, Chiheon, Xie, Cihang, Tan, Mingxing, Gong, Boqing, Wang,
and Kim, Sungwoong. Fast autoaugment. In Advances Jiang, Yuille, Alan, and Le, Quoc V. Adversarial ex-
in Neural Information Processing Systems, pp. 6662– amples improve image recognition. arXiv preprint
6672, 2019. arXiv:1911.09665, 2019.
Ovadia, Yaniv, Fertig, Emily, Ren, Jie, Nado, Zachary, Xie, Qizhe, Luong, Minh-Thang, Hovy, Eduard, and Le,
Sculley, D, Nowozin, Sebastian, Dillon, Joshua V, Quoc V. Self-training with noisy student improves im-
Lakshminarayanan, Balaji, and Snoek, Jasper. Can agenet classification. In Proceedings of the IEEE/CVF
you trust your model’s uncertainty? evaluating pre- Conference on Computer Vision and Pattern Recogni-
dictive uncertainty under dataset shift. arXiv preprint tion, pp. 10687–10698, 2020.
arXiv:1906.02530, 2019.
Yaeger, Larry S, Lyon, Richard F, and Webb, Brandyn J.
Pang, Tianyu, Xu, Kun, and Zhu, Jun. Mixup inference: Effective training of a neural network character classi-
Better exploiting mixup to defend adversarial attacks. fier for word recognition. In Advances in neural infor-
arXiv preprint arXiv:1909.11515, 2019. mation processing systems, pp. 807–816, 1997.
Partridge, Derek and Yates, William B. Engineering mul- Zagoruyko, Sergey and Komodakis, Nikos. Wide resid-
tiversion neural-net systems. Neural computation, 8 ual networks. arXiv preprint arXiv:1605.07146, 2016.
(4):869–893, 1996.
Zhang, Hongyi, Cisse, Moustapha, Dauphin, Yann N,
Paszke, Adam, Gross, Sam, Chintala, Soumith, Chanan, and Lopez-Paz, David. mixup: Beyond empirical
Gregory, Yang, Edward, DeVito, Zachary, Lin, Zem- risk minimization. arXiv preprint arXiv:1710.09412,
ing, Desmaison, Alban, Antiga, Luca, and Lerer, 2017.
Adam. Automatic differentiation in pytorch. 2017.
Zhang, Xinyu, Wang, Qiang, Zhang, Jian, and Zhong,
Russakovsky, Olga, Deng, Jia, Su, Hao, Krause, Zhao. Adversarial autoaugment. arXiv preprint
Jonathan, Satheesh, Sanjeev, Ma, Sean, Huang, Zhi- arXiv:1912.11188, 2019.
heng, Karpathy, Andrej, Khosla, Aditya, Bernstein,
Michael, et al. Imagenet large scale visual recognition
challenge. International journal of computer vision,
115(3):211–252, 2015.

You might also like