Test-Time Augmentation for CNNs
Test-Time Augmentation for CNNs
1
Samsung AI Center Moscow
2
Samsung-HSE Laboratory, National Research University Higher School of Economics
3
National Research University Higher School of Economics
4
Skolkovo Institute of Science and Technology
Abstract
Algorithm 1 Greedy Policy Search (GPS) Zisserman, 2014), PreResNet110 (He et al., 2016) and
Require: Trained neural network p(y | x, θ) WideResNet28x10 (Zagoruyko & Komodakis, 2016).
Require: Validation data Xval , yval On ImageNet (Russakovsky et al., 2015), we use
Require: Pool size B, policy size T ResNet50 and EfficientNet B2/B5/L2 (Tan & Le, 2019).
Require: Prior over sub-policies p(s) PyTorch (Paszke et al., 2017) is used for all experiments.
S←∅ . Pool of candidate sub-policies The source code is available at https://github.
for i ← 1 to B do com/bayesgroup/gps-augment.
si ∼ p(s) Training CIFAR models were trained for 2000 epochs
S ← S ∪ {si } . Add si to pool using a modified version of RandAugment with N = 3
si
πval ← p(y | si (Xval ), θ) . Predict with si transformations for each image, where the magnitude of
end for each transformation for each image has been drawn from
P ←∅ . GPS policy the uniform distribution M f ∼ U [0, 45]. We provide the
P
πval ←0 . Predictions made with GPS policy details of training these models in Appendix A.
for t ← 1 to T do
. Choose the best sub-policy s∗ based on calibrated We reused the publicly available snapshots2&3 of Ima-
log-likelihood on validation: geNet models. EfficientNets B2/B5 were trained with
s∗ ← arg max cLL t−1 1 s vanilla RandAugment, EfficientNet L2 was trained with
P
s∈S t πval + t πval ; yval
∗ Noisy Student (Xie et al., 2020) and RandAugment,
P
πval ← t−1 P 1 s
t πval + t πval . Update predictions ResNet50 was trained with AugMix (Hendrycks et al.,
P ← P ∪ {s∗ } . Update policy 2020) and RandAugment.
end for
return policy P Policy search To obtain the results on CIFAR datasets,
we first train all our models with the same stratified train-
validation split (we use 45000 objects for training and
5000 objects for validation), and perform GPS or magni-
hoc temperature scaling (Guo et al., 2017). The tem-
tude grid search on the validation set. We then retrain all
perature scaling is typically performed by optimizing the
models on the full training set, and evaluate them with
validation log-likelihood w.r.t. the temperature τ of the
the obtained policies. Since we did not train the Im-
softmax(·/τ ) function used to obtain the predictions.
ageNet models, we split the validation set in half with
Our experiments show that the calibrated log-likelihood
a stratified split, use the first half for policy search and
is the key ingredient of GPS. This objective is suited for
report the results for the second half. We use approxi-
learning TTA policies better than both accuracy and con-
mately 1000 sub-policies in the candidate pools for GPS,
ventional uncalibrated log-likelihood.
and describe the construction of the pools in Appendix A.
Evaluation Following Ashukha et al. (2020), we use
4 EXPERIMENTS the calibrated log-likelihood as our main evaluation met-
Accuracy (%)
log-likelihood
-0.559 83.4
-0.657 -0.517 81.7 86.0
83.3
-0.566 81.5
-0.673 -0.536 83.2 85.7
81.3 85.5
-0.689 -0.573 -0.555
CF Tr M* GPS CF Tr M* GPS CF Tr M* GPS CF Tr M* GPS CF Tr M* GPS CF Tr M* GPS
−0.58
Figure 4: Performance of various test-time augmentation strategies on clean test set of CIFAR-100 dataset (top) and
ImageNet (bottom). CC: central crop. CF: random crops and horizontal flips. Tr: augmentation used for training
(modified RandAugment with M = 45). M ∗ : modified RandAugment with M found by grid search. 5/10C: 5/10-
crop evaluation (four corner crops, one center crop for 5C; five crops with horizontal flips for 10C). Greedy policy
search (GPS) consistently outperforms all other methods in both the calibrated log-likelihood and accuracy. The results
for CIFAR-100 have been averaged over five runs of TTA.
ric for in-domain uncertainty estimation, and we reuse GPS policy is found or transferred from a different model
their “test-time cross-validation” procedure to perform or dataset, the gain in the predictive performance can be
calibration. The test set is divided in half, the optimal obtained for free.
temperature is found on the first split, and the metrics are
Aside from test-time data augmentation, there are other
evaluated on the second split. We average the metrics
techniques that allow one to use ensembling during test
across five random splits. While it is possible to opti-
time with almost no training overhead. Such methods
mize the temperature on a validation set, we stick with
as variational inference (Blundell et al., 2015), dropout
test-time cross-validation for convenience since the opti-
(Srivastava et al., 2014), K-FAC Laplace approximation
mal temperature is different for each TTA policy and for
are praised as ways to hide an ensemble inside a sin-
each number of samples for TTA (see Figure 11 for de-
gle model using a stochastic computation graph. It was,
tails). The optimal temperature has a very low variance,
however, recently shown that these techniques are typi-
and the values found on the validation set closely match
cally significantly outperformed by test-time augmenta-
the values found during test-time cross-validation.
tion with random crops and flips (Ashukha et al., 2020)
in conventional image classification benchmarks (CIFAR
4.1 In-domain predictive performance and ImageNet classification). Since GPS outperforms
vanilla TTA, it outperforms these techniques as well.
Greedy policy search achieves better predictive perfor- However, GPS can be combined with ensembling tech-
mance compared to all of the following: conventional niques to further improve their performance (see Sec-
test-time augmentation techniques (e.g. random crops tion 4.5).
and flips), reuse of policy learned during training, and a
more advanced baseline (RandAugment with magnitude 4.2 What metric to use for policy search?
grid search). The results for CIFAR-100 and ImageNet
are presented in Figure 4, and the results for CIFAR- Any policy search procedure that relies on optimizing
10 are presented in Figure 17, numerical results can be the validation performance requires a metric to optimize.
found in Tables 2, 3. Common predictive performance metrics are classifica-
tion accuracy and log-likelihood.
When using the same amount of samples, GPS has the
same test-time computational complexity as vanilla test- Both of these metrics have problems. The plain log-
time augmentation or the standard multi-crop evaluation, likelihood cannot be used for a fair comparison of dif-
yet achieves a better predictive performance. Once the ferent techniques, especially in the test-time augmenta-
5 aug. samples 100 aug. samples
33.5
10.1 9.8 7.0 32.8 32.9 28.0
Figure 5: Mean unnormalized corruption error (muCE) on corrupted versions of CIFAR datasets for various test-
time augmentation strategies: random crops and horizontal flips (CF), modified RandAugment with M found by grid
search (M ∗ ) and GPS policy (GPS). Learnable TTA methods are run on clean, uncorrupted data. In most cases, GPS
policies are more robust to the domain shift compared to alternatives.
(Corruption Error)
LL 81.89 ± 0.07 83.55 ± 0.09 86.22 ± 0.05 0.71 0.67 0.59
cLL 82.21 ± 0.17 83.54 ± 0.06 86.44 ± 0.05
mCE
0.70 0.66 0.57
Acc. −0.837 ± 0.003 −0.691 ± 0.001 −0.661 ± 0.003 0.69 0.65 0.56
cLL
Table 1: Performance of greedy policy search using dif- Figure 6: Mean corruption error (mCE) on ImageNet-
ferent metrics as a search objective, measured on CIFAR- C for various test-time augmentation strategies: central
100 dataset. Calibrated log-likelihood results in superior crop (CC), random scale-crop-flip transformation (CF),
performance across all tasks and metrics. The results GPS policy trained on the clean data (GPS). GPS pol-
have been averaged over five runs of TTA. icy outperforms non-learnable test-time augmentation
strategies under domain shift.
tion setting (Ashukha et al., 2020). The authors suggest one can see from Figure 12, the optimal value of M is
switching to calibrated log-likelihood (cLL) instead. The different for different metrics. The accuracy is too noisy
problem with the log-likelihood is that it can dismiss a to reliably find the optimal M . The log-likelihood pro-
good model that happened to be miscalibrated, but can be vides a very conservative value of M since large mag-
fixed by temperature scaling. With test-time augmenta- nitudes decalibrate the model. On the contrary, the cal-
tion it is often the case that the optimal temperature of the ibrated log-likelihood does not suffer from this problem
predictive distribution drastically changes with the num- and results in a better value of M .
ber of samples (see Figure 11). The accuracy, in turn,
appears to be too noisy to provide robust learning signal
for greedy optimization. 4.3 Robustness to domain shift
To evaluate the influence of the objective function, we Despite the natural human ability to correctly recognize
run GPS for a VGG, a PreResNet110 and a WideRes- an object given an image with visual perturbations, neu-
Net28x10 on CIFAR-100 dataset. The pool of candi- ral networks are typically very sensitive to changes in the
date sub-policies and the resulting length of sub-policy data distribution. As for now, models suffer a significant
is kept the same for all methods, as described in Sec- performance loss even under a slight domain shift (Ova-
tion 4. We evaluate three different objectives for GPS: dia et al., 2019). To explore how different test-time aug-
classification accuracy, log-likelihood and calibrated log- mentation strategies influence the robustness to domain
likelihood. The results are presented in Table 1. We find shift, we use the benchmark, proposed by Hendrycks &
that optimizing the calibrated log-likelihood consistently Dietterich (2018).
outperforms other metrics in terms of both accuracy and
We perform an evaluation of TTA methods on CIFAR-
calibrated log-likelihood.
10-C, CIFAR-100-C and ImageNet-C datasets with 15
To better see how the metrics fail, we evaluate test-time corruptions C from groups noise, blur, weather and
RandAugment policies with different magnitudes M . As digital. These datasets consist of the test sets of the
Search policy on
corresponding original datasets with applied corruption
CIFAR10 CIFAR100
transforms c ∈ C with five different severity levels s, Crop/flip
VGG ResNet WRN VGG ResNet WRN policy
1 ≤ s ≤ 5. For a given corruption c at severity level s we
0.000 -0.002 -0.002 -0.004 -0.003 -0.006 -0.080
VGG
compute the error rate Ec,s . On CIFAR datasets for each
corruption we compute the unnormalized corruption er-
CIFAR10
0.000 0.000 -0.000 -0.002 -0.001 -0.004 -0.052
WRN ResNet
P5
ror uCEc = 51 s=1 Ec,s , as proposed by Hendrycks
Evaluate policy on
et al. (2020), whereas for ImageNet-C we normalize 0.001 -0.000 0.000 -0.001 -0.000 -0.002 -0.058
VGG
AlexNet: CEc = s=1 Ec,s / s=1 Ec,s . We ob-
tain the final metric muCE or mCE by averaging the
CIFAR100
-0.001 -0.004 -0.001 -0.001 0.000 -0.003 -0.219
WRN ResNet
corruption errors (uCEc or CEc ) over different corrup-
tions c ∈ C. We report these metrics for the policies -0.018 -0.015 -0.009 0.001 -0.006 0.000 -0.266
-0.439 88.32
lutions and for non-standard inference techniques such CF 10C GPS GPS GPS CF 10C GPS GPS GPS
R50 B2 B5 R50 B2 B5
as test-time data augmentation.
Even though the corruptions of ImageNet-C do slightly Figure 8: Policies learned with GPS for ResNet-50 (GPS
intersect with the augmentation transformations used R50), EfficientNet B2 (GPS B2), and EfficientNet B5
during training, this does not favor GPS over other meth- (GPS B5) models transfer well to the larger EfficientNet
ods. L2 architecture and outperform conventional baselines
for multi-crop evaluation: random scale-crop-flip trans-
Surprisingly, policies trained on clean validation data formation (CF) and multi-crop evaluation with 5 crops
work decently for corrupted data. In most cases, GPS and 2 horizontal flips for each crop (10C).
outperforms both the conventional baselines and Ran-
dAugment with the optimal (for the clean validation set)
magnitude M ∗ . Somewhat counter-intuitively, we find and test-time augmentation are complementary practices
that extreme augmentations (see Figure 14) of data that and can be combined to boost the performance. We ex-
is already corrupted leads to a significant performance pect this combination to work well in the setting of do-
boost as compared to conservative crops and flips. Not main shift.
only does this demonstrate the efficiency of learnable
TTA, it also shows that the policy does not overfit to 4.4 Policy transfer
clean data and consists of augmentations that are useful
in other settings. We evaluate the policies found by GPS on other archi-
tectures and datasets in order to test their generality. The
Although ensembling is a popular way to mitigate dataset change in calibrated log-likelihood when transferring the
shift (Ovadia et al., 2019), we do not compare model en- policies across CIFAR datasets and architectures is re-
sembles with TTA in this setting. As noted by (Ashukha ported in Figure 7. The decrease in performance is not
et al., 2020) and as we show in Section 4.5, ensembling dramatic, and the transferred policies still outperform
100 aug. samples 1 aug. sample 5 aug. samples 100 aug. samples
CIFAR-100 CIFAR-100
5x WideResNet ensemble 5x WideResNet ensemble VGG ResNet110 WideResNet
CIFAR-100 CIFAR-100 CIFAR-100
-0.415
87.9 −0.9 −0.75 −0.70
log-likelihood
Calibrated
-0.424 87.8
−1.0 −0.80 −0.75
Accuracy (%)
log-likelihood
Calibrated
87.7
-0.432 −1.1 −0.85 −0.80
87.5 CC CF GPS CC CF GPS CC CF GPS
-0.440 VGG ResNet110 WideResNet
87.4 CIFAR-100 CIFAR-100 CIFAR-100
-0.449 77 79.5
Accuracy (%)
81.0
CF Tr M* GPS GPS CF Tr M* GPS GPS
1 1 ens 1 1 ens 76 78.5 80.5
75 77.5 80.0
Figure 9: Greedy policy search improves the predictive
CC CF GPS CC CF GPS CC CF GPS
performance of ensembles. CC: central crop. CF: ran-
dom crops and horizontal flips. Tr: augmentation used Figure 10: Greedy policy search (GPS) for models
for training (modified RandAugment with M = 45). trained with vanilla augmentation (random crops and
“M∗ 1”: modified RandAugment with M∗ = 35 found flips) still outperforms vanilla test-time augmentation.
by grid search for a single model. “GPS 1”: GPS is ap- CC: central crop. CF: random crops and horizontal flips.
plied to a single model, and the ensemble is evaluated GPS: greedy policy search. The results for CIFAR-100
using the resulting policy. “GPS ens”: GPS is applied have been averaged over five runs of TTA.
to the whole ensemble. The results have been averaged
over five runs of TTA.
The simplest way is to perform GPS for a single model,
and then evaluate the whole ensemble using that pol-
standard random crop and flip augmentations. We ob- icy. Another way is to perform GPS for the ensemble
serve that keeping the same dataset during transfer is directly, using the same sub-policy for every member of
more important than keeping the same architecture. the ensemble. Other modifications can include searching
We also transfer the GPS policies found on ImageNet for a separate policy for each member of the ensemble.
for ResNet50, EfficientNet-B2 and EfficientNet-B5 to an We test the first two options (denoted “GPS single” and
even larger architecture, EfficientNet-L2, and show the “GPS ensemble” respectively), and leave other possible
results in Figure 8. We observe that all of these poli- directions for future research.
cies transfer to a larger architecture well, and outperform The results are presented in Figure 9. They are consis-
the vanilla test-time augmentation policy and multi-crop tent with the findings in previous sections. Even a grid
evaluation significantly. search for the optimal magnitude in test-time RandAug-
We do not transfer policies from CIFAR to ImageNet ment is enough to significantly outperform random crops
and vice versa since the image preprocessing for these and flips. GPS improves the performance even further.
datasets is different. Transferring the policy from a single model to an ensem-
ble (“GPS single”) performs worse than applying GPS to
the whole ensemble directly, however, both variants of
4.5 Greedy policy search for ensembles GPS outperform other baselines.
Deep ensemble (Lakshminarayanan et al., 2017) is a sim- The combination of ensembling methods and test-time
ple yet powerful technique that achieves state-of-the-art augmentation usually provides meaningful benefits to
results in in-domain and out-of-domain uncertainty es- predictive performance (Ashukha et al., 2020). Because
timation (Ovadia et al., 2019; Ashukha et al., 2020). of this, we expect these results to also hold for other
Ashukha et al. (2020) have shown that deep ensembles ensembling methods that are more efficient in terms of
can be improved for free using test-time augmentation. training time than deep ensembles.
We show that deep ensembles can be improved even fur-
ther by using a learnable test-time augmentation policy. 4.6 Greedy policy search for models trained with
We use an ensemble of five WideResNet28x10 mod- vanilla augmentation
els, trained independently using the same training proce-
While we mainly tested GPS for models trained with ad-
dure as we used for training individual models (modified
vanced data augmentation methods like RandAugment,
RandAugment training with N = 3 and M = 45).
it can be applied to any image classification model. To
There are several ways to apply GPS to an ensemble. further study the breadth of applicability of GPS, we
apply it for models trained with standard (vanilla) data Acknowledgements
augmentation. While the learned augmentation policy is
less diverse than the policy learned for models trained Dmitry Vetrov and Dmitry Molchanov were sup-
with RandAugment (see Figure 14), GPS still manages ported by the Russian Science Foundation grant no.
to find a policy that significantly outperforms standard 19-71-30020. This research was supported in part
crops and flips on CIFAR-100 (see Figures 10 and 17 through computational resources of HPC facilities at
for the comparison). Even though the models learned NRU HSE.
with standard data augmentation are less robust to Rand-
Augment perturbations (see Figure 12), they can benefit
from some of the transformations. The magnitude of the
References
transformations is almost twice as low as compared to Ashukha, Arsenii, Lyzhov, Alexander, Molchanov,
the policies for RandAugment models, and the identity Dmitry, and Vetrov, Dmitry. Pitfalls of in-domain un-
transform is chosen much more often (see Figure 16). certainty estimation and ensembling in deep learning.
In International Conference on Learning Representa-
5 CONCLUSION tions, 2020. URL https://openreview.net/
forum?id=BJxI5gHKDr.
We have designed a simple yet powerful greedy policy Blundell, Charles, Cornebise, Julien, Kavukcuoglu, Ko-
search method for test-time augmentation and tested it ray, and Wierstra, Daan. Weight uncertainty in neural
in a broad empirical evaluation. To highlight the general networks. arXiv preprint arXiv:1505.05424, 2015.
idea that switching to learnable test-time augmentation Caruana, Rich, Niculescu-Mizil, Alexandru, Crew, Ge-
strategy is beneficial, we aimed to keep the policy search off, and Ksikes, Alex. Ensemble selection from li-
simple rather than to tweak it for maximum performance. braries of models. In Proceedings of the twenty-first
Our findings can be summarized as follows: international conference on Machine learning, pp. 18,
2004.
• We show that the learned test-time augmentation
Cubuk, Ekin D, Zoph, Barret, Mane, Dandelion, Vasude-
policies consistently provide superior predictive
van, Vijay, and Le, Quoc V. Autoaugment: Learning
performance and uncertainty estimates compared
augmentation strategies from data. In Proceedings of
to existing approaches to test-time augmentation.
the IEEE conference on computer vision and pattern
We report a significant improvement for both clean
recognition, pp. 113–123, 2019a.
(in-domain) data and corrupted data (under domain
shift). Cubuk, Ekin D, Zoph, Barret, Shlens, Jonathon, and
Le, Quoc V. Randaugment: Practical data aug-
• We find that the calibrated log-likelihood is a su- mentation with no separate search. arXiv preprint
perior objective for learning test-time augmentation arXiv:1909.13719, 2019b.
strategies as compared to LL or accuracy. This
Fan, Wei, Chu, Fang, Wang, Haixun, and Yu, Philip S.
finding may have important implications in adja-
Pruning and dynamic scheduling of cost-sensitive en-
cent fields such as meta-learning and neural archi-
sembles. In AAAI/IAAI, pp. 146–151, 2002.
tecture search, where the target (meta-)objective is
often chosen to be either accuracy or plain valida- Guo, Chuan, Pleiss, Geoff, Sun, Yu, and Weinberger,
tion log-likelihood with no calibration. Kilian Q. On calibration of modern neural net-
works. In Proceedings of the 34th International Con-
• We show the policies obtained with our method ference on Machine Learning-Volume 70, pp. 1321–
to be transferable between different architectures. 1330. JMLR. org, 2017.
This means that transferring policies found for
He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, and Sun,
small architectures to large architectures is a viable
Jian. Deep residual learning for image recognition.
strategy if computational resources are limited.
In Proceedings of the IEEE conference on computer
vision and pattern recognition, pp. 770–778, 2016.
There are many promising directions for future research
on trainable test-time data augmentation. One potential Hendrycks, Dan and Dietterich, Thomas G. Bench-
area of improvement is in the design of dynamic object- marking neural network robustness to common cor-
dependent TTA policies as opposed to static object- ruptions and surface variations. arXiv preprint
independent policies, used in this paper. Intuitively, this arXiv:1807.01697, 2018.
might be especially helpful under domain shift, as an Hendrycks, Dan, Mu, Norman, Cubuk, Ekin Dogus,
object-dependent policy has a potential to alleviate it. Zoph, Barret, Gilmer, Justin, and Lakshminarayanan,
Balaji. Augmix: A simple method to improve ro- Shorten, Connor and Khoshgoftaar, Taghi M. A survey
bustness and uncertainty under data shift. In Interna- on image data augmentation for deep learning. Jour-
tional Conference on Learning Representations, 2020. nal of Big Data, 6(1):60, 2019.
URL https://openreview.net/forum?id= Simard, Patrice Y, Steinkraus, David, Platt, John C, et al.
S1gmrxHFvB. Best practices for convolutional neural networks ap-
Ho, Daniel, Liang, Eric, Chen, Xi, Stoica, Ion, and plied to visual document analysis. In Icdar, volume 3,
Abbeel, Pieter. Population based augmentation: Ef- 2003.
ficient learning of augmentation policy schedules. In Simonyan, Karen and Zisserman, Andrew. Very deep
International Conference on Machine Learning, pp. convolutional networks for large-scale image recogni-
2731–2741, 2019. tion. arXiv preprint arXiv:1409.1556, 2014.
Huang, Gao, Li, Yixuan, Pleiss, Geoff, Liu, Zhuang, Srivastava, Nitish, Hinton, Geoffrey, Krizhevsky, Alex,
Hopcroft, John E, and Weinberger, Kilian Q. Snap- Sutskever, Ilya, and Salakhutdinov, Ruslan. Dropout:
shot ensembles: Train 1, get m for free. arXiv preprint a simple way to prevent neural networks from overfit-
arXiv:1704.00109, 2017. ting. The journal of machine learning research, 15(1):
Krizhevsky, Alex, Hinton, Geoffrey, et al. Learning mul- 1929–1958, 2014.
tiple layers of features from tiny images. 2009. Tan, Mingxing and Le, Quoc. Efficientnet: Rethinking
Krizhevsky, Alex, Sutskever, Ilya, and Hinton, Geof- model scaling for convolutional neural networks. In
frey E. Imagenet classification with deep convolu- International Conference on Machine Learning, pp.
tional neural networks. In Advances in neural infor- 6105–6114, 2019.
mation processing systems, pp. 1097–1105, 2012.
Wang, Guotai, Li, Wenqi, Aertsen, Michael, Deprest,
Lakshminarayanan, Balaji, Pritzel, Alexander, and Blun- Jan, Ourselin, Sébastien, and Vercauteren, Tom.
dell, Charles. Simple and scalable predictive uncer- Aleatoric uncertainty estimation with test-time aug-
tainty estimation using deep ensembles. In Advances mentation for medical image segmentation with con-
in Neural Information Processing Systems, pp. 6402– volutional neural networks. Neurocomputing, 338:34–
6413, 2017. 45, 2019.
Lim, Sungbin, Kim, Ildoo, Kim, Taesup, Kim, Chiheon, Xie, Cihang, Tan, Mingxing, Gong, Boqing, Wang,
and Kim, Sungwoong. Fast autoaugment. In Advances Jiang, Yuille, Alan, and Le, Quoc V. Adversarial ex-
in Neural Information Processing Systems, pp. 6662– amples improve image recognition. arXiv preprint
6672, 2019. arXiv:1911.09665, 2019.
Ovadia, Yaniv, Fertig, Emily, Ren, Jie, Nado, Zachary, Xie, Qizhe, Luong, Minh-Thang, Hovy, Eduard, and Le,
Sculley, D, Nowozin, Sebastian, Dillon, Joshua V, Quoc V. Self-training with noisy student improves im-
Lakshminarayanan, Balaji, and Snoek, Jasper. Can agenet classification. In Proceedings of the IEEE/CVF
you trust your model’s uncertainty? evaluating pre- Conference on Computer Vision and Pattern Recogni-
dictive uncertainty under dataset shift. arXiv preprint tion, pp. 10687–10698, 2020.
arXiv:1906.02530, 2019.
Yaeger, Larry S, Lyon, Richard F, and Webb, Brandyn J.
Pang, Tianyu, Xu, Kun, and Zhu, Jun. Mixup inference: Effective training of a neural network character classi-
Better exploiting mixup to defend adversarial attacks. fier for word recognition. In Advances in neural infor-
arXiv preprint arXiv:1909.11515, 2019. mation processing systems, pp. 807–816, 1997.
Partridge, Derek and Yates, William B. Engineering mul- Zagoruyko, Sergey and Komodakis, Nikos. Wide resid-
tiversion neural-net systems. Neural computation, 8 ual networks. arXiv preprint arXiv:1605.07146, 2016.
(4):869–893, 1996.
Zhang, Hongyi, Cisse, Moustapha, Dauphin, Yann N,
Paszke, Adam, Gross, Sam, Chintala, Soumith, Chanan, and Lopez-Paz, David. mixup: Beyond empirical
Gregory, Yang, Edward, DeVito, Zachary, Lin, Zem- risk minimization. arXiv preprint arXiv:1710.09412,
ing, Desmaison, Alban, Antiga, Luca, and Lerer, 2017.
Adam. Automatic differentiation in pytorch. 2017.
Zhang, Xinyu, Wang, Qiang, Zhang, Jian, and Zhong,
Russakovsky, Olga, Deng, Jia, Su, Hao, Krause, Zhao. Adversarial autoaugment. arXiv preprint
Jonathan, Satheesh, Sanjeev, Ma, Sean, Huang, Zhi- arXiv:1912.11188, 2019.
heng, Karpathy, Andrej, Khosla, Aditya, Bernstein,
Michael, et al. Imagenet large scale visual recognition
challenge. International journal of computer vision,
115(3):211–252, 2015.