0% found this document useful (0 votes)

24 views38 pages

Failing Loudly

Uploaded by

ankit.sekseria94

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views38 pages

Failing Loudly

Uploaded by

ankit.sekseria94

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 38

Failing Loudly: An Empirical Study of Methods

for Detecting Dataset Shift

Stephan Rabanser∗ Stephan Günnemann Zachary C. Lipton

AWS AI Labs Technical University of Munich Carnegie Mellon University
rabans@amazon.com guennemann@in.tum.de zlipton@cmu.edu
arXiv:1810.11953v4 [stat.ML] 28 Oct 2019

Abstract
We might hope that when faced with unexpected inputs, well-designed software
systems would fire off warnings. Machine learning (ML) systems, however, which
depend strongly on properties of their inputs (e.g. the i.i.d. assumption), tend to
fail silently. This paper explores the problem of building ML systems that fail
loudly, investigating methods for detecting dataset shift, identifying exemplars
that most typify the shift, and quantifying shift malignancy. We focus on several
datasets and various perturbations to both covariates and label distributions with
varying magnitudes and fractions of data affected. Interestingly, we show that
across the dataset shifts that we explore, a two-sample-testing-based approach,
using pre-trained classifiers for dimensionality reduction, performs best. More-
over, we demonstrate that domain-discriminating approaches tend to be helpful
for characterizing shifts qualitatively and determining if they are harmful.

1 Introduction
Software systems employing deep neural networks are now applied widely in industry, powering the
vision systems in social networks [47] and self-driving cars [5], providing assistance to radiologists
[24], underpinning recommendation engines used by online platforms [9, 12], enabling the best-
performing commercial speech recognition software [14, 21], and automating translation between
languages [50]. In each of these systems, predictive models are integrated into conventional human-
interacting software systems, leveraging their predictions to drive consequential decisions.
The reliable functioning of software depends crucially on tests. Many classic software bugs can be
caught when software is compiled, e.g. that a function receives input of the wrong type, while other
problems are detected only at run-time, triggering warnings or exceptions. In the worst case, if the
errors are never caught, software may behave incorrectly without alerting anyone to the problem.
Unfortunately, software systems based on machine learning are notoriously hard to test and maintain
[42]. Despite their power, modern machine learning models are brittle. Seemingly subtle changes
in the data distribution can destroy the performance of otherwise state-of-the-art classifiers, a phe-
nomenon exemplified by adversarial examples [51, 57]. When decisions are made under uncertainty,
even shifts in the label distribution can significantly compromise accuracy [29, 56]. Unfortunately,
in practice, ML pipelines rarely inspect incoming data for signs of distribution shift. Moreover, best
practices for detecting shift in high-dimensional real-world data have not yet been established2 .
In this paper, we investigate methods for detecting and characterizing distribution shift, with the
hope of removing a critical stumbling block obstructing the safe and responsible deployment of
machine learning in high-stakes applications. Faced with distribution shift, our goals are three-fold:
∗
Work done while a Visiting Research Scholar at Carnegie Mellon University.
2
TensorFlow’s data validation tools compare only summary statistics of source vs target data:
https://tensorflow.org/tfx/data_validation/get_started#checking_data_skew_and_drift

33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.
…
…

…
…

…
x source Two-Sample Test(s) Combined Test Statistic &
source
Shift Detection
Dimensionality

…
…

…
Reduction
x target

Figure 1: Our pipeline for detecting dataset shift. Source and target data is fed through a dimen-
sionality reduction process and subsequently analyzed via statistical hypothesis testing. We consider
various choices for how to represent the data and how to perform two-sample tests.

(i) detect when distribution shift occurs from as few examples as possible; (ii) characterize the shift,
e.g. by identifying those samples from the test set that appear over-represented in the target data;
and (iii) provide some guidance on whether the shift is harmful or not. As part of this paper we
principally focus on goal (i) and explore preliminary approaches to (ii) and (iii).
We investigate shift detection through the lens of statistical two-sample testing. We wish to test the
equivalence of the source distribution p (from which training data is sampled) and target distribu-
tion q (from which real-world data is sampled). For simple univariate distributions, such hypothesis
testing is a mature science. However, best practices for two sample tests with high-dimensional
(e.g. image) data remain an open question. While off-the-shelf methods for kernel-based multivari-
ate two-sample tests are appealing, they scale badly with dataset size and their statistical power is
known to decay badly with high ambient dimension [37].
Recently, Lipton et al. [29] presented results for a method called black box shift detection (BBSD),
showing that if one possesses an off-the-shelf label classifier f with an invertible confusion ma-
trix, then detecting that the source distribution p differs from the target distribution q requires only
detecting that p(f (x)) 6= q(f (x)). Building on their idea of combining black-box dimensionality
reduction with subsequent two-sample testing, we explore a range of dimensionality-reduction tech-
niques and compare them under a wide variety of shifts (Figure 1 illustrates our general framework).
We show (empirically) that BBSD works surprisingly well under a broad set of shifts, even when the
label shift assumption is not met. Furthermore, we provide an empirical analysis on the performance
of domain-discriminating classifier-based approaches (i.e. classifiers explicitly trained to discrimi-
nate between source and target samples), which has so far not been characterized for the complex
high-dimensional data distributions on which modern machine learning is routinely deployed.

2 Related work

Given just one example from the test data, our problem simplifies to anomaly detection, surveyed
thoroughly by Chandola et al. [8] and Markou and Singh [33]. Popular approaches to anomaly
detection include density estimation [6], margin-based approaches such as the one-class SVM [40],
and the tree-based isolation forest method due to [30]. Recently, also GANs have been explored for
this task [39]. Given simple streams of data arriving in a time-dependent fashion where the signal
is piece-wise stationary with abrupt changes, this is the classic time series problem of change point
detection, surveyed comprehensively by Truong et al. [52]. An extensive literature addresses dataset
shift in the context of domain adaptation. Owing to the impossibility of correcting for shift absent
assumptions [3], these papers often assume either covariate shift q(x, y) = q(x)p(y|x) [15, 45, 49]
or label shift q(x, y) = q(y)p(x|y) [7, 29, 38, 48, 56]. Schölkopf et al. [41] provides a unifying
view of these shifts, associating assumed invariances with the corresponding causal assumptions.
Several recent papers have proposed outlier detection mechanisms dubbing the task out-of-
distribution (OOD) sample detection. Hendrycks and Gimpel [19] proposes to threshold the max-
imum softmax entry of a neural network classifier which already contains a relevant signal. Liang
et al. [28] and Lee et al. [26] extend this idea by either adding temperature scaling and adversarial-
like perturbations on the input or by explicitly adapting the loss to aid OOD detection. Choi and
Jang [10] and Shalev et al. [44] employ model ensembling to further improve detection reliability.
Alemi et al. [2] motivate use of the variational information bottleneck. Hendrycks et al. [20] ex-
pose the model to OOD samples, exploring heuristics for discriminating between in-distribution and
out-of-distribution samples. Shafaei et al. [43] survey numerous OOD detection techniques.

2
3 Shift Detection Techniques
Given labeled data {(x1 , y1 ), ..., (xn , yn )} ∼ p and unlabeled data {x01 , ..., x0m } ∼ q, our task is
to determine whether p(x) equals q(x0 ). Formally, H0 : p(x) = q(x0 ) vs HA : p(x) 6= q(x0 ).
Chiefly, we explore the following design considerations: (i) what representation to run the test on;
(ii) which two-sample test to run; (iii) when the representation is multidimensional; whether to run
multivariate or multiple univariate two-sample tests; and (iv) how to combine their results.

3.1 Dimensionality Reduction

We now introduce the multiple dimensionality reduction (DR) techniques that we compare vis-
a-vis their effectiveness in shift detection (in concert with two-sample testing). Note that absent
assumptions on the data, these mappings, which reduce the data dimensionality from D to K (with
K D), are in general surjective, with many inputs mapping to the same output. Thus, it is trivial
to construct pathological cases where the distribution of inputs shifts while the distribution of low-
dimensional latent representations remains fixed, yielding false negatives. However, we speculate
that in a non-adversarial setting, such shifts may be exceedingly unlikely. Thus our approach is (i)
empirically motivated; and (ii) not put forth as a defense against worst-case adversarial attacks.
No Reduction (NoRed ): To justify the use of any DR technique, our default baseline is to run
tests on the original raw features.
Principal Components Analysis (PCA ): Principal components analysis is a standard tool that
finds an optimal orthogonal transformation matrix R such that points are linearly uncorrelated after
transformation. This transformation is learned in such a way that the first principal component
accounts for as much of the variability in the dataset as possible, and that each succeeding principal
component captures as much of the remaining variance as possible subject to the constraint that
it be orthogonal to the preceding components. Formally, we wish to learn R given X under the
mentioned constraints such that X̂ = XR yields a more compact data representation.
Sparse Random Projection (SRP ): Since computing the optimal transformation might be expen-
sive in high dimensions, random projections are a popular DR technique which trade a controlled
amount of accuracy for faster processing times. Specifically, we make use of sparse random pro-
jections, a more memory- and computationally-efficient modification of standard Gaussian random
projections. Formally, we generate a random projection matrix R and use it to reduce the dimen-
sionality of a given data matrix X, such that X̂ = XR. The elements of R are generated using the
following rule set [1, 27]:
 p
+ K with probability 2v
 v 1

Rij = 0 with probability 1 − v1 where v = √1D . (1)

− v with probability 1
 p
K 2v

Autoencoders (TAE and UAE ): We compare the above-mentioned linear models to non-linear
reduced-dimension representations using both trained (TAE) and untrained autoencoders (UAE).
Formally, an autoencoder consists of an encoder function φ : X → H and a decoder function
ψ : H → X where the latent space H has lower dimensionality than the input space X . As part of
the training process, both the encoding function φ and the decoding function ψ are learned jointly
to reduce the reconstruction loss: φ, ψ = arg minφ,ψ kX − (ψ ◦ φ)Xk2 .
Label Classifiers (BBSDs / and BBSDh .): Motivated by recent results achieved by black box
shift detection (BBSD) [29], we also propose to use the outputs of a (deep network) label classifier
trained on source data as our dimensionality-reduced representation. We explore variants using
either the softmax outputs (BBSDs) or the hard-thresholded predictions (BBSDh) for subsequent
two-sample testing. Since both variants provide differently sized output (with BBSDs providing an
entire softmax vector and BBSDh providing a one-dimensional class prediction), different statistical
tests are carried out on these representations.
Domain Classifier (Classif ×): Here, we attempt to detect shift by explicitly training a domain
classifier to discriminate between data from source and target domains. To this end, we partition
both the source data and target data into two halves, using the first to train a domain classifier to
distinguish source (class 0) from target (class 1) data. We then apply this model to the second

3
half and subsequently conduct a significance test to determine if the classifier’s performance is
statistically different from random chance.

3.2 Statistical Hypothesis Testing

The DR techniques each yield a representation, either uni- or multi-dimensional, and either continu-
ous or discrete, depending on the method. The next step is to choose a suitable statistical hypothesis
test for each of these representations.
Multivariate Kernel Two-Sample Tests: Maximum Mean Discrepancy (MMD): For all multi-
dimensional representations, we evaluate the Maximum Mean Discrepancy [16], a popular kernel-
based technique for multivariate two-sample testing. MMD allows us to distinguish between two
probability distributions p and q based on the mean embeddings µp and µq of the distributions in a
reproducing kernel Hilbert space F, formally
MMD(F, p, q) = ||µp − µq ||2F . (2)
Given samples from both distributions, we can calculate an unbiased estimate of the squared MMD
statistic as follows
m Xm n n m n
1 X 1 XX 2 XX
MMD2 = 2 κ(xi , xj ) + 2 κ(x0i , x0j ) − κ(xi , x0j ) (3)
m − m i=1 n − n i=1 mn i=1 j=1
j6=i j6=i
1 2
where we use a squared exponential kernel κ(x, x̃) = e− σ kx−x̃k and set σ to the median distance
between points in the aggregate sample over p and q [16]. A p-value can then be obtained by carrying
out a permutation test on the resulting kernel matrix.
Multiple Univariate Testing: Kolmogorov-Smirnov (KS) Test + Bonferroni Correction: As a
simple baseline alternative to MMD, we consider the approach consisting of testing each of the
K dimensions separately (instead testing over all dimensions jointly). Here, for continuous data,
we adopt the Kolmogorov-Smirnov (KS) test, a non-parametric test whose statistic is calculated by
computing the largest difference Z of the cumulative density functions (CDFs) over all values z as
follows
Z = sup |Fp (z) − Fq (z)| (4)
z
where Fp and Fq are the empirical CDFs of the source and target data, respectively. Under the null
hypothesis, Z follows the Kolmogorov distribution.
Since we carry out a KS test on each of the K components, we must subsequently combine the p-
values from each test, raising the issue of multiple hypothesis testing. As we cannot make strong as-
sumptions about the (in)dependence among the tests, we rely on a conservative aggregation method,
notably the Bonferroni correction [4], which rejects the null hypothesis if the minimum p-value
among all tests is less than α/K (where α is the significance level of the test). While several less
conservative aggregations methods have been proposed [18, 32, 46, 53, 55], they typically require
assumptions on the dependencies among the tests.
Categorical Testing: Chi-Squared Test: For the hard-thresholded label classifier (BBSDh), we
employ Pearson’s chi-squared test, a parametric tests designed to evaluate whether the frequency
distribution of certain events observed in a sample is consistent with a particular theoretical distri-
bution. Specifically, we use a test of homogeneity between the class distributions (expressed in a
contingency table) of source and target data. The testing problem can be formalized as follows:
Given a contingency table with 2 rows (one for absolute source and one for absolute target class
frequencies) and C columns (one for each of the C-many classes) containing observed counts Oij ,
the expected frequency under the independence hypothesis for a particular cell is Eij = Nsum pi• p•j
PC Oij
with Nsum being the sum of all cells in the table, pi• = NOi•
sum
= j=1 Nsum being the fraction of row
O•j P2 Oij
totals, and p•j = Nsum = i=1 Nsum being the fraction of column totals. The relevant test statistic
X 2 can be computed as
2 X C
X (Oij − Eij )2
X2 = (5)
i=1 j=1
Eij
which, under the null hypothesis, follows a chi-squared distribution with C − 1 degrees of freedom:
X 2 ∼ χ2C−1 .

4
Binomial Testing: For the domain classifier, we simply compare its accuracy (acc) on held-out
data to random chance via a binomial test. Formally, we set up a testing problem H0 : acc = 0.5
vs HA : acc 6= 0.5. Under the null hypothesis, the accuracy of the classifier follows a binomial
distribution: acc ∼ Bin(Nhold , 0.5), where Nhold corresponds to the number of held-out samples.

3.3 Obtaining Most Anomalous Samples

As our detection framework does not detect outliers but rather aims at capturing top-level shift
dynamics, it is not possible for us to decide whether any given sample is in- or out-of-distribution.
However, we can still provide an indication of what typical samples from the shifted distribution look
like by harnessing domain assignments from the domain classifier. Specifically, we can identify
the exemplars which the classifier was most confident in assigning to the target domain. Since
the domain classifier assigns class-assignment confidence scores to each incoming sample via the
softmax-layer at its output, it is easy to create a ranking of samples that are most confidently believed
to come from the target domain (or, alternatively, from the source domain). Hence, whenever the
binomial test signals a statistically significant accuracy deviation from chance, we can use use the
domain classifier to obtain the most anomalous samples and present them to the user.
In contrast to the domain classifier, the other shift detectors do not base their shift detection potential
on explicitly deciding which domain a single sample belongs to, instead comparing entire distribu-
tions against each other. While we did explore initial ideas on identifying samples which if removed
would lead to a large increase in the overall p-value, the results we obtained were unremarkable.

3.4 Determining the Malignancy of a Shift

Theoretically, absent further assumptions, distribution shifts can cause arbitrarily severe degradation
in performance. However, in practice distributions shift constantly, and often these changes are
benign. Practitioners should therefore be interested in distinguishing malignant shifts that damage
predictive performance from benign shifts that negligibly impact performance. Although prediction
quality can be assessed easily on source data on which the black-box model f was trained, we are
not able compute the target error directly without labels.
We therefore explore a heuristic method for approximating the target performance by making use
of the domain classifier’s class assignments as follows: Given access to a labeling function that can
correctly label samples, we can feed in those examples predicted by the domain classifier as likely
to come from the target domain. We can then compare these (true) labels to the labels returned by
the black box model f by feeding it the same anomalous samples. If our model is inaccurate on
these examples (where the exact threshold can be user-specified to account for varying sensitivities
to accuracy drops), then we ought to be concerned that the shift is malignant. Put simply, we sug-
gest evaluating the accuracy of our models on precisely those examples which are most confidently
assigned to the target domain.

4 Experiments

Our main experiments were carried out on the MNIST (Ntr = 50000; Nval = 10000; Nte = 10000;
D = 28 × 28 × 1; C = 10 classes) [25] and CIFAR-10 (Ntr = 40000; Nval = 10000; Nte =
10000; D = 32 × 32 × 3; C = 10 classes) [23] image datasets. For the autoencoder (UAE &
TAE) experiments, we employ a convolutional architecture with 3 convolutional layers and 1 fully-
connected layer. For both the label and the domain classifier we use a ResNet-18 [17]. We train
all networks (TAE, BBSDs, BBSDh, Classif) using stochastic gradient descent with momentum in
batches of 128 examples over 200 epochs with early stopping.
For PCA, SRP, UAE, and TAE, we reduce dimensionality to K = 32 latent dimensions, which for
PCA explains roughly 80% of the variance in the CIFAR-10 dataset. The label classifier BBSDs
reduces dimensionality to the number of classes C. Both the hard label classifier BBSDh and the
domain classifier Classif reduce dimensionality to a one-dimensional class prediction, where BBSDh
predicts label assignments and Classif predicts domain assignments.
To challenge our detection methods, we simulate a variety of shifts, affecting both the covariates
and the label proportions. For all shifts, we evaluate the various methods’ abilities to detect shift at

5
a significance level of α = 0.05. We also include the no-shift case to check against false positives.
We randomly split all of the data into training, validation, and test sets according to the indicated
proportions Ntr , Nval , and Nte and then apply a particular shift to the test set only. In order to
qualitatively quantify the robustness of our findings, shift detection performance is averaged over a
total of 5 random splits, which ensures that we apply the same type of shift to different subsets of the
data. The selected training data used to fit the DR methods is kept constant across experiments with
only the splits between validation and test changing across the random runs. Note that DR methods
are learned using training data, while shift detection is being performed on dimensionality-reduced
representations of the validation and the test set. We evaluate the models with various amounts of
samples from the test set s ∈ {10, 20, 50, 100, 200, 500, 1000, 10000}. Because of the unfavorable
dependence of kernel methods on the dataset size, we run these methods only up until 1000 target
samples have been acquired.
For each shift type (as appropriate) we explored three levels of shift intensity (e.g. the magnitude of
added noise) and various percentages of affected data δ ∈ {0.1, 0.5, 1.0}. Specifically, we explore
the following types of shifts:

(a) Adversarial (adv): We turn a fraction δ of samples into adversarial samples via FGSM [13];
(b) Knock-out (ko): We remove a fraction δ of samples from class 0, creating class imbalance [29];
(c) Gaussian noise (gn): We corrupt covariates of a fraction δ of test set samples by Gaussian noise
with standard deviation σ ∈ {1, 10, 100} (denoted s gn, m gn, and l gn);
(d) Image (img): We also explore more natural shifts to images, modifying a fraction δ of
images with combinations of random rotations {10, 40, 90}, (x, y)-axis-translation percentages
{0.05, 0.2, 0.4}, as well as zoom-in percentages {0.1, 0.2, 0.4} (denoted s img, m img, and l img);
(e) Image + knock-out (m img+ko): We apply a fixed medium image shift with δ1 = 0.5 and a
variable knock-out shift δ;
(f) Only-zero + image (oz+m img): Here, we only include images from class 0 in combination
with a variable medium image shift affecting only a fraction δ of the data;
(g) Original splits: We evaluate our detectors on the original source/target splits provided by the
creators of MNIST, CIFAR-10, Fashion MNIST [54], and SVHN [35] datasets (assumed to be i.i.d.);
(h) Domain adaptation datasets: Data from the domain adaptation task transferring from MNIST
(source) to USPS (target) (Ntr = Nval = Nte = 1000; D = 16 × 16 × 1; C = 10 classes) [31] as
well as the COIL-100 dataset (Ntr = Nval = Nte = 2400; D = 32 × 32 × 3; C = 100 classes) [34]
where images between 0◦ and 175◦ are sampled by the source and images between 180◦ and 355◦
are sampled by the target distribution.

We provide a sample implementation of our experiments-pipeline written in Python, making use of

sklearn [36] and Keras [11], located at: https://github.com/steverab/failing-loudly.

5 Discussion
Univariate VS Multivariate Tests: We first evaluate whether we can detect shifts more easily
using multiple univariate tests and aggregating their results via the Bonferroni correction or by using
multivariate kernel tests. We were surprised to find that, despite the heavy correction, multiple
univariate testing seem to offer comparable performance to multivariate testing (see Table 1a).
Dimensionality Reduction Methods: For each testing method and experimental setting, we eval-
uate which DR technique is best suited to shift detection. Specifically in the multiple-univariate-
testing case (and overall), BBSDs was the best-performing DR method. In the multivariate-testing
case, UAE performed best. In both cases, these methods consistently outperformed others across
sample sizes. The domain classifier, a popular shift detection approach, performs badly in the low-
sample regime (≤ 100 samples), but catches up as more samples are obtained. Noticeably, the
multivariate test performs poorly in the no reduction case, which is also regarded a widely used shift
detection baseline. Table 1a summarizes these results.
We note that BBSDs being the best overall method for detecting shift is good news for ML practi-
tioners. When building black-box models with the main purpose of classification, said model can be

6
Table 1: Dimensionality reduction methods (a) and shift-type (b) comparison. Underlined entries
indicate accuracy values larger than 0.5.
(a) Detection accuracy of different dimensionality (b) Detection accuracy of different shifts on
reduction techniques across all simulated shifts on MNIST and CIFAR-10 using the best-performing
MNIST and CIFAR-10. Green bold entries indi- DR technique (univariate: BBSDs, multivariate:
cate the best DR method at a given sample size, UAE). Green bold shifts are identified as harmless,
red italic the worst. Results for χ2 and Bin tests red italic shifts as harmful.
are only reported once under the univariate cate-
gory. BBSDs performs best for univariate testing, Number of samples from test
Test Shift
while both UAE and TAE perform best for multi- 10 20 50 100 200 500 1,000 10,000
variate testing. s gn 0.00 0.00 0.03 0.03 0.07 0.10 0.10 0.10
m gn 0.00 0.00 0.10 0.13 0.13 0.13 0.23 0.37

Univariate BBSDs
Number of samples from test l gn 0.17 0.27 0.53 0.63 0.67 0.83 0.87 1.00
Test DR s img 0.00 0.00 0.23 0.30 0.40 0.63 0.70 0.93
10 20 50 100 200 500 1,000 10,000 m img 0.30 0.37 0.60 0.67 0.70 0.80 0.90 1.00
l img 0.30 0.50 0.70 0.70 0.77 0.87 0.97 1.00
NoRed 0.03 0.15 0.26 0.36 0.41 0.47 0.54 0.72
adv 0.13 0.27 0.40 0.43 0.53 0.77 0.83 0.90
PCA 0.11 0.15 0.30 0.36 0.41 0.46 0.54 0.63
Univ. tests

ko 0.00 0.00 0.07 0.07 0.07 0.33 0.40 0.70

SRP 0.15 0.15 0.23 0.27 0.34 0.42 0.55 0.68
m img+ko 0.13 0.40 0.87 0.93 0.90 1.00 1.00 1.00
UAE 0.12 0.16 0.27 0.33 0.41 0.49 0.56 0.77
oz+m img 0.67 1.00 1.00 1.00 1.00 1.00 1.00 1.00
TAE 0.18 0.23 0.31 0.38 0.43 0.47 0.55 0.69
BBSDs 0.19 0.28 0.47 0.47 0.51 0.65 0.70 0.79 s gn 0.03 0.03 0.03 0.03 0.03 0.07 0.07 –
m gn 0.03 0.03 0.03 0.03 0.17 0.27 0.30 –
χ2 BBSDh 0.03 0.07 0.12 0.22 0.22 0.40 0.46 0.57
l gn 0.50 0.57 0.67 0.70 0.80 0.90 1.00 –

Multivariate UAE
Bin Classif 0.01 0.03 0.11 0.21 0.28 0.42 0.51 0.67
s img 0.17 0.20 0.27 0.30 0.40 0.47 0.63 –
NoRed 0.14 0.15 0.22 0.28 0.32 0.44 0.55 – m img 0.23 0.33 0.37 0.40 0.47 0.60 0.70 –
l img 0.30 0.30 0.37 0.47 0.60 0.77 0.87 –
Multiv. tests

PCA 0.15 0.18 0.33 0.38 0.40 0.46 0.55 –

SRP 0.12 0.18 0.23 0.31 0.31 0.44 0.54 – adv 0.03 0.20 0.27 0.27 0.33 0.40 0.40 –
UAE 0.20 0.27 0.40 0.43 0.45 0.53 0.61 – ko 0.10 0.13 0.13 0.13 0.17 0.17 0.30 –
TAE 0.18 0.26 0.37 0.38 0.45 0.52 0.59 – m img+ko 0.20 0.30 0.37 0.53 0.54 0.63 0.87 –
BBSDs 0.16 0.20 0.25 0.35 0.35 0.47 0.50 – oz+m img 0.27 0.63 0.77 1.00 1.00 1.00 1.00 –

Table 2: Shift detection performance based on shift intensity (a) and perturbed sample percentages
(b) using the best-performing DR technique (univariate: BBSDs, multivariate: UAE). Underlined
entries indicate accuracy values larger than 0.5.
(a) Detection accuracy of varying shift intensities. (b) Detection accuracy of varying shift percentages.
Number of samples from test Number of samples from test
Test Intensity Test Percentage
10 20 50 100 200 500 1,000 10,000 10 20 50 100 200 500 1,000 10,000

Small 0.00 0.00 0.14 0.14 0.18 0.36 0.40 0.54 10% 0.11 0.15 0.24 0.25 0.28 0.44 0.54 0.66
Univ.

Univ.

Medium 0.14 0.21 0.39 0.38 0.42 0.57 0.66 0.76 50% 0.14 0.28 0.52 0.53 0.60 0.68 0.72 0.85
Large 0.32 0.54 0.78 0.82 0.83 0.92 0.96 1.00 100% 0.26 0.41 0.61 0.64 0.70 0.82 0.84 0.86

Small 0.11 0.11 0.12 0.14 0.20 0.23 0.33 – 10% 0.12 0.13 0.21 0.26 0.27 0.31 0.44 –
Multiv.

Multiv.

Medium 0.11 0.19 0.23 0.27 0.32 0.42 0.44 – 50% 0.19 0.27 0.41 0.41 0.47 0.57 0.60 –
Large 0.34 0.45 0.57 0.68 0.72 0.82 0.93 – 100% 0.29 0.41 0.44 0.53 0.60 0.70 0.78 –

easily extended to also double as a shift detector. Moreover, black-box models with soft predictions
that were built and trained in the past can be turned into shift detectors retrospectively.
Shift Types: Table 1b lists shift detection accuracy values for each distinct shift as an increasing
amount of samples is obtained from the target domain. Specifically, we see that l gn, m gn, l img,
m img+ko, oz+m img, and even adv are easily detectable, many of them even with few samples,
while s gn, m gn, and ko are hard to detect even with many samples. With a few exceptions, the
best DR technique (BBDSs for multiple univariate tests, UAE for multivariate tests) is significantly
faster and more accurate at detecting shift than the average of all dimensionality reduction methods.
Shift Strength: Based on the results in Table 2a, we can conclude that small shifts (s gn, s img,
and ko) are harder to detect than medium shifts (m gn, m img, and adv) which in turn are harder
to detect than large shifts (l gn, l img, m img+ko, and oz+m img). Specifically, we see that large
shifts can on average already be detected with better than chance accuracy at only 20 samples using
BBSDs, while medium and small shifts require orders of magnitude more samples in order to achieve
similar accuracy. Moreover, the results in Table 2b show that while target data exhibiting only 10%
anomalous samples are hard to detect, suggesting that this setting might be better addressed via
outlier detection, perturbation percentages 50% and 100% can already be detected with better than
chance accuracy using 50 samples.

7
1.0 1.0 1.0 NoRed
PCA
SRP
0.8 0.8 0.8 UAE
TAE
0.6 0.6 0.6

p-value

p-value
BBSDs
BBSDh
Classif
0.4 0.4 0.4

0.2 0.2 0.2

0.0 0.0 0.0

101 102 103 104 101 102 103 104 101 102 103 104
Number of samples from test Number of samples from test Number of samples from test

(a) Shift test (univ.) with (b) Shift test (univ.) with (c) Shift test (univ.) with (d) Top different.
10% perturbed test data. 50% perturbed test data. 100% perturbed test data.

1.00 1.0 1.0

0.8
0.8

Accuracy

Accuracy
0.95
Accuracy

p
0.6 q
Classif
0.6
0.90 0.4

0.4 0.2
101 102 103 104 101 102 103 104 101 102 103 104
Number of samples from test Number of samples from test Number of samples from test

(e) Classification accuracy (f) Classification accuracy (g) Classification accuracy (h) Top similar.
on 10% perturbed data. on 50% perturbed data. on 100% perturbed data.
Figure 2: Shift detection results for medium image shift on MNIST. Subfigures (a)-(c) show the
p-value evolution of the different DR methods with varying percentages of perturbed data, while
subfigures (e)-(g) show the obtainable accuracies over the same perturbations. Subfigures (d) and
(h) show the most different and most similar exemplars returned by the domain classifier across
perturbation percentages. Plots show mean values obtained over 5 random runs with a 1-σ error-bar.

1.0 1.0 NoRed

PCA
SRP
0.8 0.8 UAE
TAE
0.6
p-value

p-value

BBSDs
0.6 BBSDh
Classif
0.4 0.4

0.2
0.2
0.0
101 102 103 101 102 103
Number of samples from test Number of samples from test

(a) Shift test (univ.) with shuffled sets (b) Shift test (univ.) with angle parti- (c) Top different.
containing images from all angles. tioned source and target sets.

1.00 1.00

0.99 0.98
Accuracy

Accuracy

0.98 0.96
p
q
Classif
0.97 0.94
101 102 103 101 102 103
Number of samples from test Number of samples from test

(d) Classification accuracy on ran- (e) Classification accuracy on angle (f) Top similar.
domly shuffled sets containing images partitioned source and target sets.
from all angles.
Figure 3: Shift detection results on COIL-100 dataset. Subfigure organization is similar to Figure 2.

Most Anomalous Samples and Shift Malignancy: Across all experiments, we observe that the
most different and most similar examples returned by the domain classifier are useful in charac-
terizing the shift. Furthermore, we can successfully distinguish malignant from benign shifts (as
reported in Table 1b) by using the framework proposed in Section 3.4. While we recognize that
having access to an external labeling function is a strong assumption and that accessing all true la-
bels would be prohibitive at deployment, our experimental results also showed that, compared to the
total sample size, two to three orders of magnitude fewer labeled examples suffice to obtain a good
approximation of the (usually unknown) target accuracy.

8
Training set average for 6 Test set average for 6 Training set 6s — test set 6s
0 0 0 0.08
0.8 0.8
0.06
0.7 0.7 5
5 5
0.04
0.6 0.6
10 10 10 0.02
0.5 0.5
0.00
0.4 0.4 15
15 15 -0.02
0.3 0.3
-0.04
20 20 20
0.2 0.2
-0.06
0.1 0.1 25
25 25 -0.08
0.0 0.0
0 5 10 15 20 25 0 5 10 15 20 25 0 5 10 15 20 25

Figure 4: Difference plot for training and test set sixes.

Individual Examples: While full results with exact p-value evolution and anomalous samples are
documented in the supplementary material, we briefly present two illustrative results in detail:
(a) Synthetic medium image shift on MNIST (Figure 2): From subfigures (a)-(c), we see that most
methods are able to detect the simulated shift with BBSDs being the quickest method for all tested
perturbation percentages. We further observe in subfigures (e)-(g) that the (true) accuracy on sam-
ples from q increasingly deviates from the model’s performance on source data from p as more
samples are perturbed. Since true target accuracy is usually unknown, we use the accuracy obtained
on the top anomalous labeled instances returned by the domain classifier Classif. As we can see,
these values significantly deviate from accuracies obtained on p, which is why we consider this shift
harmful to the label classifier’s performance.
(b) Rotation angle partitioning on COIL-100 (Figure 3): Subfigures (a) and (b) show that our testing
framework correctly claims the randomly shuffled dataset containing images from all angles to not
contain a shift, while it identifies the partitioned dataset to be noticeably different. However, as we
can see from subfigure (e), this shift does not harm the classifier’s performance, meaning that the
classifier can safely be deployed even when encountering this specific dataset shift.
Original Splits: According to our tests, the original split from the MNIST dataset appears to exhibit
a dataset shift. After inspecting the most anomalous samples returned by the domain classifier, we
observed that many of these samples depicted the digit 6. A mean-difference plot (see Figure 4)
between sixes from the training set and sixes from the test set revealed that the training instances are
rotated slightly to the right, while the test samples are drawn more open and centered. To back up
this claim even further, we also carried out a two-sample KS test between the two sets of sixes in the
input space and found that the two sets can conclusively be regarded as different with a p-value of
2.7 · 10−10 , significantly undercutting the respective Bonferroni threshold of 6.3 · 10−5 . While this
specific shift does not look particularly significant to the human eye (and is also declared harmless
by our malignancy detector), this result however still shows that the original MNIST split is not i.i.d.

6 Conclusions
In this paper, we put forth a comprehensive empirical investigation, examining the ways in which
dimensionality reduction and two-sample testing might be combined to produce a practical pipeline
for detecting distribution shift in real-life machine learning systems. Our results yielded the surpris-
ing insights that (i) black-box shift detection with soft predictions works well across a wide variety
of shifts, even when some of its underlying assumptions do not hold; (ii) that aggregated univariate
tests performed separately on each latent dimension offer comparable shift detection performance
to multivariate two-sample tests; and (iii) that harnessing predictions from domain-discriminating
classifiers enables characterization of a shift’s type and its malignancy. Moreover, we produced
the surprising observation that the MNIST dataset, despite ostensibly representing a random split,
exhibits a significant (although not worrisome) distribution shift.
Our work suggests several open questions that might offer promising paths for future work, including
(i) shift detection for online data, which would require us to account for and exploit the high degree
of correlation between adjacent time steps [22]; and, since we have mostly explored a standard image
classification setting for our experiments, (ii) applying our framework to other machine learning
domains such as natural language processing or graphs.

9
Acknowledgements
We thank the Center for Machine Learning and Health, a joint venture of Carnegie Mellon Univer-
sity, UPMC, and the University of Pittsburgh for supporting our collaboration with Abridge AI to
develop robust models for machine learning in healthcare. We are also grateful to Salesforce Re-
search, Facebook AI Research, and Amazon AI for their support of our work on robust deep learning
under distribution shift.

References
[1] Dimitris Achlioptas. Database-Friendly Random Projections: Johnson-Lindenstrauss with Bi-
nary Coins. Journal of Computer and System Sciences, 66, 2003.
[2] Alexander A Alemi, Ian Fischer, and Joshua V Dillon. Uncertainty in the Variational Informa-
tion Bottleneck. arXiv Preprint arXiv:1807.00906, 2018.
[3] Shai Ben-David, Tyler Lu, Teresa Luu, and Dávid Pál. Impossibility Theorems for Domain
Adaptation. In International Conference on Artificial Intelligence and Statistics (AISTATS),
2010.
[4] J Martin Bland and Douglas G Altman. Multiple Significance Tests: The Bonferroni Method.
BMJ, 1995.
[5] Mariusz Bojarski, Davide Del Testa, Daniel Dworakowski, Bernhard Firner, Beat Flepp, Pra-
soon Goyal, Lawrence D Jackel, Mathew Monfort, Urs Muller, Jiakai Zhang, et al. End to End
Learning for Self-Driving Cars. arXiv Preprint arXiv:1604.07316, 2016.
[6] Markus M Breunig, Hans-Peter Kriegel, Raymond T Ng, and Jörg Sander. LOF: Identifying
Density-Based Local Outliers. In ACM SIGMOD Record, 2000.
[7] Yee Seng Chan and Hwee Tou Ng. Word Sense Disambiguation with Distribution Estimation.
In International Joint Conference on Artificial intelligence (IJCAI), 2005.
[8] Varun Chandola, Arindam Banerjee, and Vipin Kumar. Anomaly Detection: A Survey. ACM
Computing Surveys (CSUR), 2009.
[9] Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra, Hrishi Arad-
hye, Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, et al. Wide & Deep Learning
for Recommender Systems. In Proceedings of the 1st Workshop on Deep Learning for Recom-
mender Systems. ACM, 2016.
[10] Hyunsun Choi and Eric Jang. Generative Ensembles for Robust Anomaly Detection. arXiv
Preprint arXiv:1810.01392, 2018.
[11] François Chollet et al. Keras. https://keras.io, 2015.
[12] Paul Covington, Jay Adams, and Emre Sargin. Deep Neural Networks for YouTube Recom-
mendations. In Proceedings of the 10th ACM Conference on Recommender Systems. ACM,
2016.
[13] Ian Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and Harnessing Adver-
sarial Examples. In International Conference on Learning Representations (ICLR), 2014.
[14] Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. Speech Recognition with Deep
Recurrent Neural Networks. In IEEE International Conference on Acoustics, Speech and Sig-
nal Processing. IEEE, 2013.
[15] Arthur Gretton, Alexander J Smola, Jiayuan Huang, Marcel Schmittfull, Karsten M Borgwardt,
and Bernhard Schölkopf. Covariate Shift by Kernel Mean Matching. Journal of Machine
Learning Research (JMLR), 2009.
[16] Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Schölkopf, and Alexander
Smola. A Kernel Two-Sample Test. Journal of Machine Learning Research (JMLR), 2012.

10
[17] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for Image
Recognition. In Computer Vision and Pattern Recognition (CVPR), 2016.
[18] Nicholas A Heard and Patrick Rubin-Delanchy. Choosing Between Methods of Combining-
Values. Biometrika, 2018.
[19] Dan Hendrycks and Kevin Gimpel. A Baseline for Detecting Misclassified and Out-Of-
Distribution Examples in Neural Networks. In International Conference on Learning Rep-
resentations (ICLR), 2017.
[20] Dan Hendrycks, Mantas Mazeika, and Thomas G Dietterich. Deep Anomaly Detection with
Outlier Exposure. In International Conference on Learning Representations (ICLR), 2019.
[21] Geoffrey Hinton, Li Deng, Dong Yu, George Dahl, Abdel-rahman Mohamed, Navdeep Jaitly,
Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Brian Kingsbury, et al. Deep Neural
Networks for Acoustic Modeling in Speech Recognition. IEEE Signal Processing Magazine,
29, 2012.
[22] Steven R Howard, Aaditya Ramdas, Jon McAuliffe, and Jasjeet Sekhon. Uniform, Nonpara-
metric, Non-Asymptotic Confidence Sequences. arXiv Preprint arXiv:1810.08240, 2018.
[23] Alex Krizhevsky and Geoffrey Hinton. Learning Multiple Layers of Features from Tiny Im-
ages. Technical report, Citeseer, 2009.
[24] Paras Lakhani and Baskaran Sundaram. Deep Learning at Chest Radiography: Automated
Classification of Pulmonary Tuberculosis by Using Convolutional Neural Networks. Radiol-
ogy, 284, 2017.
[25] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-Based Learning
Applied to Document Recognition. Proceedings of the IEEE, 86, 1998.
[26] Kimin Lee, Honglak Lee, Kibok Lee, and Jinwoo Shin. Training Confidence-Calibrated Clas-
sifiers for Detecting Out-Of-Distribution Samples. In International Conference on Learning
Representations (ICLR), 2018.
[27] Ping Li, Trevor J Hastie, and Kenneth W Church. Very Sparse Random Projections. In Pro-
ceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and
Data Mining (KDD). ACM, 2006.
[28] Shiyu Liang, Yixuan Li, and R Srikant. Enhancing the Reliability of Out-Of-Distribution Im-
age Detection in Neural Networks. In International Conference on Learning Representations
(ICLR), 2018.
[29] Zachary C Lipton, Yu-Xiang Wang, and Alex Smola. Detecting and Correcting for Label Shift
with Black Box Predictors. In International Conference on Machine Learning (ICML), 2018.
[30] Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou. Isolation Forest. In International Conference
on Data Mining (ICDM), 2008.
[31] Mingsheng Long, Jianmin Wang, Guiguang Ding, Jiaguang Sun, and Philip S Yu. Transfer
Feature Learning with Joint Distribution Adaptation. In International Conference on Computer
Vision (ICCV), 2013.
[32] Thomas M Loughin. A Systematic Comparison of Methods for Combining p-Values from
Independent Tests. Computational Statistics & Data Analysis, 2004.
[33] Markos Markou and Sameer Singh. Novelty Detection: A Review: Part 1: Statistical Ap-
proaches. Signal Processing, 2003.
[34] Sameer A Nene, Shree K Nayar, and Hiroshi Murase. Columbia Object Image Library (COIL-
100). 1996.
[35] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng.
Reading Digits in Natural Images With Unsupervised Feature Learning. 2011.

11
[36] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Pret-
tenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Per-
rot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning
Research, 12:2825–2830, 2011.
[37] Aaditya Ramdas, Sashank Jakkam Reddi, Barnabás Póczos, Aarti Singh, and Larry A Wasser-
man. On the Decreasing Power of Kernel and Distance Based Nonparametric Hypothesis Tests
in High Dimensions. In Association for the Advancement of Artificial Intelligence (AAAI),
2015.
[38] Marco Saerens, Patrice Latinne, and Christine Decaestecker. Adjusting the Outputs of a Clas-
sifier to New a Priori Probabilities: A Simple Procedure. Neural Computation, 2002.
[39] Thomas Schlegl, Philipp Seeböck, Sebastian M Waldstein, Ursula Schmidt-Erfurth, and Georg
Langs. Unsupervised Anomaly Detection with Generative Adversarial Networks to Guide
Marker Discovery. In International Conference on Information Processing in Medical Imag-
ing, 2017.
[40] Bernhard Schölkopf, Robert C Williamson, Alex J Smola, John Shawe-Taylor, and John C
Platt. Support Vector Method for Novelty Detection. In Advances in Neural Information
Processing Systems (NIPS), 2000.
[41] Bernhard Schölkopf, Dominik Janzing, Jonas Peters, Eleni Sgouritsa, Kun Zhang, and Joris
Mooij. On Causal and Anticausal Learning. In International Conference on Machine Learning
(ICML), 2012.
[42] D Sculley, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, and Michael Young. Machine
Learning: The High-Interest Credit Card of Technical Debt. In SE4ML: Software Engineering
for Machine Learning (NIPS 2014 Workshop), 2014.
[43] Alireza Shafaei, Mark Schmidt, and James J Little. Does Your Model Know the Digit 6 Is
Not a Cat? A Less Biased Evaluation of Outlier Detectors. arXiv Preprint arXiv:1809.04729,
2018.
[44] Gabi Shalev, Yossi Adi, and Joseph Keshet. Out-Of-Distribution Detection Using Multiple
Semantic Label Representations. In Advances in Neural Information Processing Systems
(NeurIPS), 2018.
[45] Hidetoshi Shimodaira. Improving Predictive Inference Under Covariate Shift by Weighting
the Log-Likelihood Function. Journal of Statistical Planning and Inference, 2000.
[46] R John Simes. An Improved Bonferroni Procedure for Multiple Tests of Significance.
Biometrika, 1986.
[47] Zak Stone, Todd Zickler, and Trevor Darrell. Autotagging Facebook: Social Network Context
Improves Photo Annotation. In IEEE Computer Society Conference on Computer Vision and
Pattern Recognition Workshops. IEEE, 2008.
[48] Amos Storkey. When Training and Test Sets Are Different: Characterizing Learning Transfer.
Dataset Shift in Machine Learning, 2009.
[49] Masashi Sugiyama, Shinichi Nakajima, Hisashi Kashima, Paul V Buenau, and Motoaki
Kawanabe. Direct Importance Estimation with Model Selection and Its Application to Covari-
ate Shift Adaptation. In Advances in Neural Information Processing Systems (NIPS), 2008.
[50] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to Sequence Learning with Neural
Networks. In Advances in Neural Information Processing Systems (NIPS), 2014.
[51] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Good-
fellow, and Rob Fergus. Intriguing Properties of Neural Networks. In International Conference
on Learning Representations (ICLR), 2014.
[52] Charles Truong, Laurent Oudre, and Nicolas Vayatis. A Review of Change Point Detection
Methods. arXiv Preprint arXiv:1801.00718, 2018.

12
[53] Vladimir Vovk and Ruodu Wang. Combining p-Values via Averaging. arXiv Preprint
arXiv:1212.4966, 2018.
[54] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-MNIST: a Novel Image Dataset for
Benchmarking Machine Learning Algorithms, 2017.
[55] Dmitri V Zaykin, Lev A Zhivotovsky, Peter H Westfall, and Bruce S Weir. Truncated Prod-
uct Method for Combining p-Values. Genetic Epidemiology: The Official Publication of the
International Genetic Epidemiology Society, 2002.
[56] Kun Zhang, Bernhard Schölkopf, Krikamol Muandet, and Zhikun Wang. Domain Adapta-
tion Under Target and Conditional Shift. In International Conference on Machine Learning
(ICML), 2013.
[57] Daniel Zügner, Amir Akbarnejad, and Stephan Günnemann. Adversarial Attacks on Neural
Networks for Graph Data. In International Conference on Knowledge Discovery & Data Min-
ing (KDD), 2018.

13
A Detailed Shift Detection Results

Our complete shift detection results in which we evaluate different kinds of target shifts on MNIST
and CIFAR-10 using the proposed methods are documented below. In addition to our artificially
generated shifts, we also evaluated our testing procedure on the original splits provided by MNIST,
Fashion MNIST, CIFAR-10, and SVHN.

A.1 Artificially Generated Shifts

A.1.1 MNIST

1.0 1.0 1.0 NoRed

PCA
SRP
0.8 0.8 0.8 UAE
TAE
0.6 0.6 0.6
p-value

p-value

p-value
BBSDs
BBSDh
Classif
0.4 0.4 0.4

0.2 0.2 0.2

0.0 0.0 0.0

101 102 103 104 101 102 103 104 101 102 103 104
Number of samples from test Number of samples from test Number of samples from test

(a) 10% adversarial samples. (b) 50% adversarial samples. (c) 100% adversarial samples.
1.00 1.0 1.0

0.8
0.95 0.8
Accuracy

Accuracy

0.6 p
0.90 0.6 q
Classif
0.4
0.85 0.4 0.2

0.80 0.0
0.2
101 102 103 104 101 102 103 104 101 102 103 104
Number of samples from test Number of samples from test Number of samples from test

(d) 10% adversarial samples. (e) 50% adversarial samples. (f) 100% adversarial samples.

(g) Top different samples. (h) Top similar samples.

Figure 5: MNIST adversarial shift, univariate two-sample tests + Bonferroni aggregation.

NoRed
0.8 0.8 PCA
0.6 SRP
UAE
0.6 0.6 TAE
p-value

p-value

BBSDs
0.4
0.4 0.4
0.2
0.2 0.2

0.0 0.0 0.0

101 102 103 101 102 103 101 102 103
Number of samples from test Number of samples from test Number of samples from test

(a) 10% adversarial samples. (b) 50% adversarial samples. (c) 100% adversarial samples.
Figure 6: MNIST adversarial shift, multivariate two-sample tests.

14
1.0 1.0 1.0 NoRed
PCA
SRP
0.8 0.8 0.8 UAE
TAE
0.6 0.6 0.6
p-value

p-value

p-value
BBSDs
BBSDh
Classif
0.4 0.4 0.4

0.2 0.2 0.2

0.0 0.0 0.0

101 102 103 104 101 102 103 104 101 102 103 104
Number of samples from test Number of samples from test Number of samples from test

(a) Knock out 10% of class 0. (b) Knock out 50% of class 0. (c) Knock out 100% of class 0.
1.00 1.000 1.000

0.998
0.98 0.995
Accuracy

Accuracy

Accuracy
0.996

0.994
0.96
0.990
0.992 p
q
0.990 Classif
0.94
101 102 103 104 101 102 103 104 101 102 103 104
Number of samples from test Number of samples from test Number of samples from test

(d) Knock out 10% of class 0. (e) Knock out 50% of class 0. (f) Knock out 100% of class 0.

(g) Top different samples. (h) Top similar samples.

Figure 7: MNIST knock-out shift, univariate two-sample tests + Bonferroni aggregation.

NoRed
0.8 0.8 0.8 PCA
SRP
UAE
0.6 0.6 0.6 TAE
p-value

p-value

BBSDs

0.4 0.4 0.4

0.2 0.2 0.2

0.0 0.0 0.0

101 102 103 101 102 103 101 102 103
Number of samples from test Number of samples from test Number of samples from test

(a) Knock out 10% of class 0. (b) Knock out 50% of class 0. (c) Knock out 100% of class 0.
Figure 8: MNIST knock-out shift, multivariate two-sample tests.

15
1.0 1.0 1.0 NoRed
PCA

0.8 0.8 0.8 SRP

UAE
TAE
0.6 0.6 0.6
p-value

p-value

p-value
BBSDs
BBSDh
Classif
0.4 0.4 0.4

0.2 0.2 0.2

0.0 0.0 0.0

101 102 103 104 101 102 103 104 101 102 103 104
Number of samples from test Number of samples from test Number of samples from test

(a) 10% perturbed samples. (b) 50% perturbed samples. (c) 100% perturbed samples.
1.000 1.00 1.00

0.99 0.95
0.995
Accuracy

Accuracy

Accuracy
0.98
0.90
0.990
0.97
0.85
0.985 0.96 p
q
0.80 Classif
0.95
101 102 103 104 101 102 103 104 101 102 103 104
Number of samples from test Number of samples from test Number of samples from test

(d) 10% perturbed samples. (e) 50% perturbed samples. (f) 100% perturbed samples.

(g) Top different samples. (h) Top similar samples.

Figure 9: MNIST large Gaussian noise shift, univariate two-sample tests + Bonferroni aggregation.

NoRed
0.8 0.8 0.8 PCA
SRP
UAE
0.6 0.6 0.6 TAE
p-value

p-value

BBSDs

0.4 0.4 0.4

0.2 0.2 0.2

0.0 0.0 0.0

101 102 103 101 102 103 101 102 103
Number of samples from test Number of samples from test Number of samples from test

(a) 10% perturbed samples. (b) 50% perturbed samples. (c) 100% perturbed samples.
Figure 10: MNIST large Gaussian noise shift, multivariate two-sample tests.

16
1.0 1.0 1.0 NoRed
PCA
SRP
0.8 0.8 0.8 UAE
TAE
0.6 0.6 0.6
p-value

p-value

p-value
BBSDs
BBSDh
Classif
0.4 0.4 0.4

0.2 0.2 0.2

0.0 0.0 0.0

101 102 103 104 101 102 103 104 101 102 103 104
Number of samples from test Number of samples from test Number of samples from test

(a) 10% perturbed samples. (b) 50% perturbed samples. (c) 100% perturbed samples.
1.0000 1.000 1.000

0.9975 0.998
0.995
Accuracy

Accuracy

Accuracy
0.9950 0.996
0.9925
0.994 0.990
0.9900 p
0.992 q
Classif
0.9875
101 102 103 104 101 102 103 104 101 102 103 104
Number of samples from test Number of samples from test Number of samples from test

(d) 10% perturbed samples. (e) 50% perturbed samples. (f) 100% perturbed samples.

(g) Top different samples. (h) Top similar samples.

Figure 11: MNIST medium Gaussian noise shift, univariate two-sample tests + Bonferroni aggrega-
tion.

1.0 1.0 NoRed

PCA
0.8 0.8 SRP
0.8 UAE
TAE
0.6 0.6 0.6
p-value

p-value

BBSDs

0.4 0.4 0.4

0.2 0.2 0.2

0.0 0.0 0.0

101 102 103 101 102 103 101 102 103
Number of samples from test Number of samples from test Number of samples from test

(a) 10% perturbed samples. (b) 50% perturbed samples. (c) 100% perturbed samples.
Figure 12: MNIST medium Gaussian noise shift, multivariate two-sample tests.

17
1.0 1.0 1.0 NoRed
PCA
SRP
0.8 0.8 0.8 UAE
TAE
0.6 0.6 0.6
p-value

p-value

p-value
BBSDs
BBSDh
Classif
0.4 0.4 0.4

0.2 0.2 0.2

0.0 0.0 0.0

101 102 103 104 101 102 103 104 101 102 103 104
Number of samples from test Number of samples from test Number of samples from test

(a) 10% perturbed samples. (b) 50% perturbed samples. (c) 100% perturbed samples.
1.0000 1.000 1.000 p
q

0.9975 0.998
0.995
Accuracy

Accuracy

Accuracy
0.9950 0.996
0.9925
0.994 0.990
0.9900
0.992
0.9875
101 102 103 104 101 102 103 104 101 102 103 104
Number of samples from test Number of samples from test Number of samples from test

(d) 10% perturbed samples. (e) 50% perturbed samples. (f) 100% perturbed samples.

(g) Top different samples. (h) Top similar samples.

Figure 13: MNIST small Gaussian noise shift, univariate two-sample tests + Bonferroni aggregation.

1.0 1.0 NoRed

PCA
0.8
0.8 0.8 SRP
UAE
TAE
0.6 0.6 0.6
p-value

p-value

BBSDs

0.4 0.4 0.4

0.2 0.2 0.2

0.0 0.0 0.0

101 102 103 101 102 103 101 102 103
Number of samples from test Number of samples from test Number of samples from test

(a) 10% perturbed samples. (b) 50% perturbed samples. (c) 100% perturbed samples.
Figure 14: MNIST small Gaussian noise shift, multivariate two-sample tests.

18
1.0 1.0 1.0 NoRed
PCA
SRP
0.8 0.8 0.8 UAE
TAE
0.6 0.6 0.6
p-value

p-value

p-value
BBSDs
BBSDh
Classif
0.4 0.4 0.4

0.2 0.2 0.2

0.0 0.0 0.0

101 102 103 104 101 102 103 104 101 102 103 104
Number of samples from test Number of samples from test Number of samples from test

(a) 10% perturbed samples. (b) 50% perturbed samples. (c) 100% perturbed samples.
1.00 1.0 1.0

0.8 0.8
0.95
Accuracy

Accuracy

Accuracy
0.6 p
0.6 q
0.90 Classif
0.4
0.4
0.85 0.2
0.2
0.0
101 102 103 104 101 102 103 104 101 102 103 104
Number of samples from test Number of samples from test Number of samples from test

(d) 10% perturbed samples. (e) 50% perturbed samples. (f) 100% perturbed samples.

(g) Top different samples. (h) Top similar samples.

Figure 15: MNIST large image shift, univariate two-sample tests + Bonferroni aggregation.

0.6 0.5 NoRed

0.8 PCA
SRP
0.4 UAE
0.6 TAE
0.4 0.3
p-value

p-value

BBSDs

0.4
0.2
0.2
0.2 0.1

0.0 0.0 0.0

101 102 103 101 102 103 101 102 103
Number of samples from test Number of samples from test Number of samples from test

(a) 10% perturbed samples. (b) 50% perturbed samples. (c) 100% perturbed samples.
Figure 16: MNIST large image shift, multivariate two-sample tests.

19
1.0 1.0 1.0 NoRed
PCA
SRP
0.8 0.8 0.8 UAE
TAE
0.6 0.6 0.6
p-value

p-value

p-value
BBSDs
BBSDh
Classif
0.4 0.4 0.4

0.2 0.2 0.2

0.0 0.0 0.0

101 102 103 104 101 102 103 104 101 102 103 104
Number of samples from test Number of samples from test Number of samples from test

(a) 10% perturbed samples. (b) 50% perturbed samples. (c) 100% perturbed samples.
1.00 1.0 1.0

0.8
0.95 0.8
Accuracy

Accuracy

Accuracy
p
0.6 q
Classif
0.6
0.90 0.4

0.4 0.2
101 102 103 104 101 102 103 104 101 102 103 104
Number of samples from test Number of samples from test Number of samples from test

(d) 10% perturbed samples. (e) 50% perturbed samples. (f) 100% perturbed samples.

(g) Top different samples. (h) Top similar samples.

Figure 17: MNIST medium image shift, univariate two-sample tests + Bonferroni aggregation.

0.6
0.6 NoRed
0.8 PCA
SRP
UAE
0.6 0.4 0.4 TAE
p-value

p-value

BBSDs
p-value

0.4
0.2 0.2
0.2

0.0 0.0 0.0

101 102 103 101 102 103 101 102 103
Number of samples from test Number of samples from test Number of samples from test

(a) 10% perturbed samples. (b) 50% perturbed samples. (c) 100% perturbed samples.
Figure 18: MNIST medium image shift, multivariate two-sample tests.

20
1.0 1.0 1.0 NoRed
PCA
SRP
0.8 0.8 0.8 UAE
TAE
0.6 0.6 0.6
p-value

p-value

p-value
BBSDs
BBSDh
Classif
0.4 0.4 0.4

0.2 0.2 0.2

0.0 0.0 0.0

101 102 103 104 101 102 103 104 101 102 103 104
Number of samples from test Number of samples from test Number of samples from test

(a) 10% perturbed samples. (b) 50% perturbed samples. (c) 100% perturbed samples.
1.00 1.000 1.00

0.98 0.995 0.98

Accuracy

Accuracy
0.990
0.96 0.96
0.985
0.94 0.94
0.980 p
0.92 q
0.975 0.92 Classif

101 102 103 104 101 102 103 104 101 102 103 104
Number of samples from test Number of samples from test Number of samples from test

(d) 10% perturbed samples. (e) 50% perturbed samples. (f) 100% perturbed samples.

(g) Top different samples. (h) Top similar samples.

Figure 19: MNIST small image shift, univariate two-sample tests + Bonferroni aggregation.

NoRed
0.8 PCA
0.8 0.8 SRP
UAE
0.6 0.6 TAE
0.6
p-value

p-value

BBSDs

0.4 0.4 0.4

0.2 0.2 0.2

0.0 0.0 0.0

101 102 103 101 102 103 101 102 103
Number of samples from test Number of samples from test Number of samples from test

(a) 10% perturbed samples. (b) 50% perturbed samples. (c) 100% perturbed samples.
Figure 20: MNIST small image shift, multivariate two-sample tests.

21
1.0 1.0 1.0 NoRed
PCA
SRP
0.8 0.8 0.8 UAE
TAE
0.6 0.6 0.6
p-value

p-value

p-value
BBSDs
BBSDh
Classif
0.4 0.4 0.4

0.2 0.2 0.2

0.0 0.0 0.0

101 102 103 104 101 102 103 104 101 102 103 104
Number of samples from test Number of samples from test Number of samples from test

(a) Knock out 10% of class 0. (b) Knock out 50% of class 0. (c) Knock out 100% of class 0.
1.0 1.0 1.0

0.8
0.8 0.8
Accuracy

Accuracy

Accuracy
0.6

0.4 0.6
0.6
p
q
0.2
0.4 Classif

101 102 103 104 101 102 103 104 101 102 103 104
Number of samples from test Number of samples from test Number of samples from test

(d) Knock out 10% of class 0. (e) Knock out 50% of class 0. (f) Knock out 100% of class 0.

(g) Top different samples. (h) Top similar samples.

Figure 21: MNIST medium image shift (50%, fixed) plus knock-out shift (variable), univariate two-
sample tests + Bonferroni aggregation.

0.8
NoRed
0.8 PCA
0.6 SRP
0.6 UAE
0.6 TAE
p-value

p-value

BBSDs
0.4 0.4
0.4
0.2 0.2
0.2

0.0 0.0 0.0

101 102 103 101 102 103 101 102 103
Number of samples from test Number of samples from test Number of samples from test

(a) Knock out 10% of class 0. (b) Knock out 50% of class 0. (c) Knock out 100% of class 0.
Figure 22: MNIST medium image shift (50%, fixed) plus knock-out shift (variable), multivariate
two-sample tests.

22
1.0 1.0 NoRed
PCA
0.8 SRP
0.8 0.8 UAE
TAE
0.6 0.6 0.6
p-value

p-value

p-value
BBSDs
BBSDh
Classif
0.4 0.4 0.4

0.2 0.2 0.2

0.0 0.0 0.0

101 102 103 101 102 103 101 102 103
Number of samples from test Number of samples from test Number of samples from test

(a) 10% perturbed samples. (b) 50% perturbed samples. (c) 100% perturbed samples.
1.00 1.0 1.0

0.9 0.8
0.95
Accuracy

Accuracy

Accuracy
0.6
0.8
0.90 0.4
0.7
0.2 p
q
0.85 0.6 Classif
0.0
101 102 103 101 102 103 101 102 103
Number of samples from test Number of samples from test Number of samples from test

(d) 10% perturbed samples. (e) 50% perturbed samples. (f) 100% perturbed samples.

(g) Top different samples. (h) Top similar samples.

Figure 23: MNIST only-zero shift (fixed) plus medium image shift (variable), univariate two-sample
tests + Bonferroni aggregation.

0.6 NoRed
0.3 PCA
SRP
UAE
0.4 0.4 TAE
0.2
p-value

p-value

BBSDs

0.1 0.2 0.2

0.0 0.0 0.0

101 102 103 101 102 103 101 102 103
Number of samples from test Number of samples from test Number of samples from test

(a) 10% perturbed samples. (b) 50% perturbed samples. (c) 100% perturbed samples.
Figure 24: MNIST only-zero shift (fixed) plus medium image shift (variable), multivariate two-
sample tests.

23
1.0 1.0 NoRed
PCA
SRP
0.8 0.8 UAE
TAE
0.6

p-value

p-value
BBSDs
0.6 BBSDh
Classif
0.4
0.4
0.2
0.2
0.0
101 102 103 101 102 103
Number of samples from test Number of samples from test

(a) Randomly shuffled dataset with (b) Original split.

same split proportions as original
dataset.
1.00 1.00 p
q
Classif

0.95 0.95
Accuracy

Accuracy
0.90 0.90

0.85 0.85

101 102 103 101 102 103

Number of samples from test Number of samples from test

(c) Randomly shuffled dataset with (d) Original split.

same split proportions as original
dataset.

(e) Top different samples. (f) Top similar samples.

Figure 25: MNIST to USPS domain adaptation, univariate two-sample tests + Bonferroni aggrega-
tion.

1.0
0.04
0.8
0.02
0.6
p-value

p-value

0.00
0.4 NoRed
PCA
−0.02 SRP
0.2 UAE
−0.04 TAE
BBSDs
0.0
101 102 103 101 102 103
Number of samples from test Number of samples from test

(a) Randomly shuffled dataset with (b) Original split.

same split proportions as original
dataset.
Figure 26: MNIST to USPS domain adaptation, multivariate two-sample tests.

24
A.1.2 CIFAR-10

1.0 1.0 1.0

0.8 0.8 0.8

0.6 0.6 0.6

p-value

p-value
NoRed
PCA
0.4 0.4 0.4 SRP
UAE
TAE
0.2 0.2 0.2 BBSDs
BBSDh
Classif
0.0 0.0 0.0
101 102 103 104 101 102 103 104 101 102 103 104
Number of samples from test Number of samples from test Number of samples from test

(a) 10% adversarial samples. (b) 50% adversarial samples. (c) 100% adversarial samples.
1.0 1.0 1.0

0.8
0.9 0.8
Accuracy

Accuracy

Accuracy
0.6 p
q

0.8 0.6 0.4

Classif

0.2
0.7 0.4
0.0
101 102 103 104 101 102 103 104 101 102 103 104
Number of samples from test Number of samples from test Number of samples from test

(d) 10% adversarial samples. (e) 50% adversarial samples. (f) 100% adversarial samples.

(g) Top different samples. (h) Top similar samples.

Figure 27: CIFAR-10 adversarial shift, univariate two-sample tests + Bonferroni aggregation.

1.0 1.0 1.0

0.8 0.8 0.8

0.6 0.6 0.6

p-value

0.4 0.4 0.4 NoRed

PCA
SRP
0.2 0.2 0.2 UAE
TAE
BBSDs
0.0 0.0 0.0
101 102 103 101 102 103 101 102 103
Number of samples from test Number of samples from test Number of samples from test

(a) 10% adversarial samples. (b) 50% adversarial samples. (c) 100% adversarial samples.
Figure 28: CIFAR-10 adversarial shift, multivariate two-sample tests.

25
1.0 1.0 1.0 NoRed
PCA
SRP
0.8 0.8 0.8 UAE
TAE
0.6 0.6 0.6
p-value

p-value

p-value
BBSDs
BBSDh
Classif
0.4 0.4 0.4

0.2 0.2 0.2

0.0 0.0 0.0

101 102 103 104 101 102 103 104 101 102 103 104
Number of samples from test Number of samples from test Number of samples from test

(a) Knock out 10% of class 0. (b) Knock out 50% of class 0. (c) Knock out 100% of class 0.
1.0 1.0 1.0 p
q

0.9
0.9 0.9
Accuracy

Accuracy

Accuracy
0.8
0.8
0.8 0.7

0.6 0.7
0.7
101 102 103 104 101 102 103 104 101 102 103 104
Number of samples from test Number of samples from test Number of samples from test

(d) Knock out 10% of class 0. (e) Knock out 50% of class 0. (f) Knock out 100% of class 0.

No samples available as Classif did not detect a shift. No samples available as Classif did not detect a shift.

(g) Top different samples. (h) Top similar samples.

Figure 29: CIFAR-10 knock-out shift, univariate two-sample tests + Bonferroni aggregation.

1.0 1.0 NoRed

PCA

0.8 0.8 0.8 SRP

UAE
TAE
0.6 0.6 0.6
p-value

p-value

BBSDs

0.4 0.4 0.4

0.2 0.2 0.2

0.0 0.0 0.0

101 102 103 101 102 103 101 102 103
Number of samples from test Number of samples from test Number of samples from test

(a) Knock out 10% of class 0. (b) Knock out 50% of class 0. (c) Knock out 100% of class 0.
Figure 30: CIFAR-10 knock-out shift, multivariate two-sample tests.

26
1.0 1.0 1.0 NoRed
PCA
SRP
0.8 0.8 0.8 UAE
TAE
0.6 0.6 0.6
p-value

p-value

p-value
BBSDs
BBSDh
Classif
0.4 0.4 0.4

0.2 0.2 0.2

0.0 0.0 0.0

101 102 103 104 101 102 103 104 101 102 103 104
Number of samples from test Number of samples from test Number of samples from test

(a) 10% perturbed samples. (b) 50% perturbed samples. (c) 100% perturbed samples.
1.0 1.0 1.0

0.9
0.8
0.9
0.8
Accuracy

Accuracy

Accuracy
0.6
0.7
0.8
0.6 0.4
p
0.7 0.5 0.2
q
Classif

101 102 103 104 101 102 103 104 101 102 103 104
Number of samples from test Number of samples from test Number of samples from test

(d) 10% perturbed samples. (e) 50% perturbed samples. (f) 100% perturbed samples.

(g) Top different samples. (h) Top similar samples.

Figure 31: CIFAR-10 large Gaussian noise shift, univariate two-sample tests + Bonferroni aggrega-
tion.

1.0 NoRed
PCA
0.8 0.8
0.8 SRP
UAE

0.6 0.6 0.6 TAE

p-value

BBSDs
p-value

0.4 0.4 0.4

0.2 0.2 0.2

0.0 0.0 0.0

101 102 103 101 102 103 101 102 103
Number of samples from test Number of samples from test Number of samples from test

(a) 10% perturbed samples. (b) 50% perturbed samples. (c) 100% perturbed samples.
Figure 32: CIFAR-10 large Gaussian noise shift, multivariate two-sample tests.

27
1.0 1.0 1.0

0.8 0.8 0.8

0.6 0.6 0.6

p-value

p-value
NoRed
PCA

0.4 0.4 0.4 SRP

UAE
TAE
0.2 0.2 0.2 BBSDs
BBSDh
Classif
0.0 0.0 0.0
101 102 103 104 101 102 103 104 101 102 103 104
Number of samples from test Number of samples from test Number of samples from test

(a) 10% perturbed samples. (b) 50% perturbed samples. (c) 100% perturbed samples.
1.0 1.0 1.0 p
q
Classif

0.9 0.9 0.9

Accuracy

Accuracy
0.8 0.8 0.8

0.7 0.7 0.7

101 102 103 104 101 102 103 104 101 102 103 104
Number of samples from test Number of samples from test Number of samples from test

(d) 10% perturbed samples. (e) 50% perturbed samples. (f) 100% perturbed samples.

(g) Top different samples. (h) Top similar samples.

Figure 33: CIFAR-10 medium Gaussian noise shift, univariate two-sample tests + Bonferroni ag-
gregation.

1.0 1.0 1.0

0.8 0.8 0.8

0.6 0.6 0.6

p-value

0.4 0.4 0.4 NoRed

PCA
SRP
0.2 0.2 0.2 UAE
TAE
BBSDs
0.0 0.0 0.0
101 102 103 101 102 103 101 102 103
Number of samples from test Number of samples from test Number of samples from test

(a) 10% perturbed samples. (b) 50% perturbed samples. (c) 100% perturbed samples.
Figure 34: CIFAR-10 medium Gaussian noise shift, multivariate two-sample tests.

28
1.0 1.0 1.0

0.8 0.8 0.8

0.6 0.6 0.6

p-value

(a) 10% perturbed samples. (b) 50% perturbed samples. (c) 100% perturbed samples.
1.0 1.0 1.0 p
q

0.9 0.9 0.9

Accuracy

Accuracy
0.8 0.8
0.8

0.7 0.7
0.7
101 102 103 104 101 102 103 104 101 102 103 104
Number of samples from test Number of samples from test Number of samples from test

(d) 10% perturbed samples. (e) 50% perturbed samples. (f) 100% perturbed samples.

No samples available as Classif did not detect a shift. No samples available as Classif did not detect a shift.

(g) Top different samples. (h) Top similar samples.

Figure 35: CIFAR-10 small Gaussian noise shift, univariate two-sample tests + Bonferroni aggrega-
tion.

1.0 1.0 1.0

0.8 0.8 0.8

0.6 0.6 0.6

p-value

0.4 0.4 0.4 NoRed

PCA
SRP
0.2 0.2 0.2 UAE
TAE
BBSDs
0.0 0.0 0.0
101 102 103 101 102 103 101 102 103
Number of samples from test Number of samples from test Number of samples from test

(a) 10% perturbed samples. (b) 50% perturbed samples. (c) 100% perturbed samples.
Figure 36: CIFAR-10 small Gaussian noise shift, multivariate two-sample tests.

29
1.0 1.0 1.0 NoRed
PCA
SRP
0.8 0.8 0.8 UAE
TAE
0.6 0.6 0.6
p-value

p-value

p-value
BBSDs
BBSDh
Classif
0.4 0.4 0.4

0.2 0.2 0.2

0.0 0.0 0.0

101 102 103 104 101 102 103 104 101 102 103 104
Number of samples from test Number of samples from test Number of samples from test

(a) 10% perturbed samples. (b) 50% perturbed samples. (c) 100% perturbed samples.
1.0 1.0 1.0

0.8
0.9 0.8
Accuracy

Accuracy

Accuracy
0.6 p
q
0.8 0.6 Classif
0.4

0.7 0.2
0.4

101 102 103 104 101 102 103 104 101 102 103 104
Number of samples from test Number of samples from test Number of samples from test

(d) 10% perturbed samples. (e) 50% perturbed samples. (f) 100% perturbed samples.

(g) Top different samples. (h) Top similar samples.

Figure 37: CIFAR-10 large image shift, univariate two-sample tests + Bonferroni aggregation.

1.0 1.0 0.8 NoRed

PCA
SRP
0.8 0.8 0.6 UAE
TAE
0.6 0.6
p-value

p-value

BBSDs
0.4
0.4 0.4
0.2
0.2 0.2

0.0 0.0 0.0

101 102 103 101 102 103 101 102 103
Number of samples from test Number of samples from test Number of samples from test

(a) 10% perturbed samples. (b) 50% perturbed samples. (c) 100% perturbed samples.
Figure 38: CIFAR-10 large image shift, multivariate two-sample tests.

30
1.0 1.0 1.0 NoRed
PCA
SRP
0.8 0.8 0.8 UAE
TAE
0.6 0.6 0.6
p-value

p-value

p-value
BBSDs
BBSDh
Classif
0.4 0.4 0.4

0.2 0.2 0.2

0.0 0.0 0.0

101 102 103 104 101 102 103 104 101 102 103 104
Number of samples from test Number of samples from test Number of samples from test

(a) 10% perturbed samples. (b) 50% perturbed samples. (c) 100% perturbed samples.
1.0 1.0 1.0

0.9
0.9 0.8
0.8
Accuracy

Accuracy

Accuracy
0.6
0.8 0.7

0.6 0.4 p
0.7 q
0.5 Classif
0.2
101 102 103 104 101 102 103 104 101 102 103 104
Number of samples from test Number of samples from test Number of samples from test

(d) 10% perturbed samples. (e) 50% perturbed samples. (f) 100% perturbed samples.

(g) Top different samples. (h) Top similar samples.

Figure 39: CIFAR-10 medium image shift, univariate two-sample tests + Bonferroni aggregation.

1.0 1.0 1.0 NoRed

PCA

0.8 0.8 0.8 SRP

UAE
TAE
0.6 0.6
p-value

p-value

BBSDs
0.6

0.4 0.4 0.4

0.2 0.2
0.2
0.0 0.0
101 102 103 101 102 103 101 102 103
Number of samples from test Number of samples from test Number of samples from test

(a) 10% perturbed samples. (b) 50% perturbed samples. (c) 100% perturbed samples.
Figure 40: CIFAR-10 medium image shift, multivariate two-sample tests.

31
1.0 1.0 1.0 NoRed
PCA
SRP
0.8 0.8 0.8 UAE
TAE
0.6 0.6 0.6
p-value

p-value

p-value
BBSDs
BBSDh
Classif
0.4 0.4 0.4

0.2 0.2 0.2

0.0 0.0 0.0

101 102 103 104 101 102 103 104 101 102 103 104
Number of samples from test Number of samples from test Number of samples from test

(a) 10% perturbed samples. (b) 50% perturbed samples. (c) 100% perturbed samples.
1.0 1.0 1.0 p
q
Classif
0.9
0.9 0.9
Accuracy

Accuracy

Accuracy
0.8
0.8 0.8
0.7

0.7 0.7 0.6

101 102 103 104 101 102 103 104 101 102 103 104
Number of samples from test Number of samples from test Number of samples from test

(d) 10% perturbed samples. (e) 50% perturbed samples. (f) 100% perturbed samples.

(g) Top different samples. (h) Top similar samples.

Figure 41: CIFAR-10 small image shift, univariate two-sample tests + Bonferroni aggregation.

1.0 1.0 1.0 NoRed

PCA
SRP
0.8 0.8 0.8 UAE
TAE
0.6 0.6
p-value

p-value

BBSDs
0.6

0.4 0.4 0.4

0.2 0.2 0.2

0.0 0.0
101 102 103 101 102 103 101 102 103
Number of samples from test Number of samples from test Number of samples from test

(a) 10% perturbed samples. (b) 50% perturbed samples. (c) 100% perturbed samples.
Figure 42: CIFAR-10 small image shift, multivariate two-sample tests.

32
1.0 1.0 1.0 NoRed
PCA
SRP
0.8 0.8 0.8 UAE
TAE
0.6 0.6 0.6
p-value

p-value

p-value
BBSDs
BBSDh
Classif
0.4 0.4 0.4

0.2 0.2 0.2

0.0 0.0 0.0

101 102 103 104 101 102 103 104 101 102 103 104
Number of samples from test Number of samples from test Number of samples from test

(a) Knock out 10% of class 0. (b) Knock out 50% of class 0. (c) Knock out 100% of class 0.
1.0 1.0 1.0 p
q
Classif
0.9 0.9
0.8
Accuracy

Accuracy

Accuracy
0.8 0.8

0.7 0.7 0.6

0.6 0.6
0.4
101 102 103 104 101 102 103 104 101 102 103 104
Number of samples from test Number of samples from test Number of samples from test

(d) Knock out 10% of class 0. (e) Knock out 50% of class 0. (f) Knock out 100% of class 0.

(g) Top different samples. (h) Top similar samples.

Figure 43: CIFAR-10 medium image shift (50%, fixed) plus knock-out shift (variable), univariate
two-sample tests + Bonferroni aggregation.

1.0 1.0 NoRed

PCA
0.8 SRP
0.8 0.8 UAE
TAE
0.6 0.6 0.6
p-value

p-value

BBSDs

0.4 0.4 0.4

0.2 0.2 0.2

0.0 0.0 0.0

101 102 103 101 102 103 101 102 103
Number of samples from test Number of samples from test Number of samples from test

(a) Knock out 10% of class 0. (b) Knock out 50% of class 0. (c) Knock out 100% of class 0.
Figure 44: CIFAR-10 medium image shift (50%, fixed) plus knock-out shift (variable), multivariate
two-sample tests.

33
1.0 1.0 1.0 NoRed
PCA
SRP
0.8 0.8 0.8 UAE
TAE
0.6 0.6 0.6
p-value

p-value

p-value
BBSDs
BBSDh
Classif
0.4 0.4 0.4

0.2 0.2 0.2

0.0 0.0 0.0

101 102 103 101 102 103 101 102 103
Number of samples from test Number of samples from test Number of samples from test

(a) 10% perturbed samples. (b) 50% perturbed samples. (c) 100% perturbed samples.
1.0 1.0 1.0 p
q
Classif
0.9
0.9 0.8
Accuracy

Accuracy

Accuracy
0.8

0.8 0.7
0.6
0.6
0.7
101 102 103 101 102 103 101 102 103
Number of samples from test Number of samples from test Number of samples from test

(d) 10% perturbed samples. (e) 50% perturbed samples. (f) 100% perturbed samples.

(g) Top different samples. (h) Top similar samples.

Figure 45: CIFAR-10 only-zero shift (fixed) plus medium image shift (variable), univariate two-
sample tests + Bonferroni aggregation.

0.6 0.8 NoRed

PCA
SRP
0.6
0.6 UAE
0.4 TAE
p-value

p-value

BBSDs
0.4 0.4
0.2
0.2 0.2

0.0 0.0 0.0

101 102 103 101 102 103 101 102 103
Number of samples from test Number of samples from test Number of samples from test

(a) 10% perturbed samples. (b) 50% perturbed samples. (c) 100% perturbed samples.
Figure 46: CIFAR-10 only-zero shift (fixed) plus medium image shift (variable), multivariate two-
sample tests.

34
A.2 Original Splits

A.2.1 MNIST

1.0 1.0 NoRed

PCA
SRP
0.8 0.8 UAE
TAE
0.6 0.6
p-value

p-value
BBSDs
BBSDh
Classif
0.4 0.4

0.2 0.2

0.0 0.0
101 102 103 104 101 102 103 104
Number of samples from test Number of samples from test

(a) Randomly shuffled dataset with (b) Original split.

same split proportions as original
dataset.
1.000 1.000

0.998
0.998
Accuracy

Accuracy

0.996

0.994 0.996
p
0.992 q
Classif
0.994
101 102 103 104 101 102 103 104
Number of samples from test Number of samples from test

(c) Randomly shuffled dataset with (d) Original split.

same split proportions as original
dataset.

(e) Top different samples. (f) Top similar samples.

Figure 47: MNIST randomized and original split, univariate two-sample tests + Bonferroni aggre-
gation.

1.0 1.0 NoRed

PCA

0.8 0.8 SRP

UAE
TAE
0.6 0.6
p-value

p-value

BBSDs

0.4 0.4

0.2 0.2

0.0 0.0
101 102 103 101 102 103
Number of samples from test Number of samples from test

(a) Randomly shuffled dataset with (b) Original split.

same split proportions as original
dataset.
Figure 48: MNIST randomized and original split, multivariate two-sample tests.

35
A.2.2 Fashion MNIST

1.0 1.0

0.8 0.8

0.6 0.6
p-value

p-value
NoRed
PCA

0.4 0.4 SRP

UAE
TAE
0.2 0.2 BBSDs
BBSDh
Classif
0.0 0.0
101 102 103 104 101 102 103 104
Number of samples from test Number of samples from test

(a) Randomly shuffled dataset with (b) Original split.

same split proportions as original
dataset.
1.00 1.00 p
q

0.98 0.98
Accuracy

0.96 Accuracy 0.96

0.94 0.94

0.92
0.92
0.90
101 102 103 104 101 102 103 104
Number of samples from test Number of samples from test

(c) Randomly shuffled dataset with (d) Original split.

same split proportions as original
dataset.

No samples available as Classif did not detect a shift. No samples available as Classif did not detect a shift.

(e) Top different samples. (f) Top similar samples.

Figure 49: Fashion MNIST randomized and original split, univariate two-sample tests + Bonferroni
aggregation.

1.0 1.0

0.8 0.8

0.6 0.6
p-value

p-value

0.4 0.4 NoRed

PCA
SRP
0.2 0.2 UAE
TAE
BBSDs
0.0 0.0
101 102 103 101 102 103
Number of samples from test Number of samples from test

(a) Randomly shuffled dataset with (b) Original split.

same split proportions as original
dataset.
Figure 50: Fashion MNIST randomized and original split, multivariate two-sample tests.

36
A.2.3 CIFAR-10

1.0 1.0

0.8 0.8

0.6 0.6
p-value

p-value
NoRed
PCA
0.4 0.4 SRP
UAE
TAE
0.2 0.2 BBSDs
BBSDh
Classif
0.0 0.0
101 102 103 104 101 102 103 104
Number of samples from test Number of samples from test

(a) Randomly shuffled dataset with (b) Original split.

same split proportions as original
dataset.
1.0 1.00 p
q

0.9 0.95
Accuracy

Accuracy
0.8 0.90

0.7 0.85

101 102 103 104 101 102 103 104

Number of samples from test Number of samples from test

(c) Randomly shuffled dataset with (d) Original split.

same split proportions as original
dataset.

No samples available as Classif did not detect a shift. No samples available as Classif did not detect a shift.

(e) Top different samples. (f) Top similar samples.

Figure 51: CIFAR-10 randomized and original split, univariate two-sample tests + Bonferroni ag-
gregation.

1.0 1.0 NoRed

PCA
SRP
0.8 0.8 UAE
TAE
0.6
p-value

p-value

BBSDs
0.6
0.4
0.4
0.2
0.2
0.0
101 102 103 101 102 103
Number of samples from test Number of samples from test

(a) Randomly shuffled dataset with (b) Original split.

same split proportions as original
dataset.
Figure 52: CIFAR-10 randomized and original split, multivariate two-sample tests.

37
A.2.4 SVHN

1.0 1.0 NoRed

PCA
SRP
0.8 0.8 UAE
TAE
0.6 0.6

p-value

p-value
BBSDs
BBSDh
Classif
0.4 0.4

0.2 0.2

0.0 0.0
101 102 103 104 101 102 103 104
Number of samples from test Number of samples from test

(a) Randomly shuffled dataset with (b) Original split.

same split proportions as original
dataset.
1.000 1.00 p
q
Classif
0.975 0.98
Accuracy

Accuracy
0.950 0.96

0.925 0.94

0.900 0.92

0.875 0.90
101 102 103 104 101 102 103 104
Number of samples from test Number of samples from test

(c) Randomly shuffled dataset with (d) Original split.

same split proportions as original
dataset.

(e) Top different samples. (f) Top similar samples.

Figure 53: SVHN randomized and original split, univariate two-sample tests + Bonferroni aggrega-
tion.

1.0 NoRed
PCA
0.8
0.8 SRP
UAE

0.6 TAE
0.6
p-value

p-value

BBSDs

0.4 0.4

0.2 0.2

0.0 0.0
101 102 103 101 102 103
Number of samples from test Number of samples from test

(a) Randomly shuffled dataset with (b) Original split.

same split proportions as original
dataset.
Figure 54: SVHN randomized and original split, multivariate two-sample tests.

Magdiff:: Covariate Data Set Shift Detection Via Activation Graphs of Deep Neural Networks
No ratings yet
Magdiff:: Covariate Data Set Shift Detection Via Activation Graphs of Deep Neural Networks
19 pages
Context-Aware Drift Detection
No ratings yet
Context-Aware Drift Detection
25 pages
Takahashi 2019
No ratings yet
Takahashi 2019
6 pages
Pattern Recognition: Haider Raza, Girijesh Prasad, Yuhua Li
No ratings yet
Pattern Recognition: Haider Raza, Girijesh Prasad, Yuhua Li
11 pages
Automatic Dataset Shift Identification To Support Root Cause Analysis of AI Performance Drift
No ratings yet
Automatic Dataset Shift Identification To Support Root Cause Analysis of AI Performance Drift
17 pages
Machine Learning Model Drift Detection Via Weak Data Slices
No ratings yet
Machine Learning Model Drift Detection Via Weak Data Slices
8 pages
A Pdf-Free Change Detection Test Based On Density Difference Estimation
No ratings yet
A Pdf-Free Change Detection Test Based On Density Difference Estimation
11 pages
A PDF Free Change Detection Test Based On Density Difference Estimation
No ratings yet
A PDF Free Change Detection Test Based On Density Difference Estimation
11 pages
Causal Anomaly Detection with Distribution Shifts
No ratings yet
Causal Anomaly Detection with Distribution Shifts
28 pages
NeurIPS 2018 A Simple Unified Framework For Detecting Out of Distribution Samples and Adversarial Attacks Paper
No ratings yet
NeurIPS 2018 A Simple Unified Framework For Detecting Out of Distribution Samples and Adversarial Attacks Paper
11 pages
Diagnosing Model Performance Under Distribution Shift: Tiffany (Tianhui) Cai Hongseok Namkoong Steve Yadlowsky
No ratings yet
Diagnosing Model Performance Under Distribution Shift: Tiffany (Tianhui) Cai Hongseok Namkoong Steve Yadlowsky
41 pages
Arning Time Series Classification With Fisher Information
No ratings yet
Arning Time Series Classification With Fisher Information
22 pages
You Are Out of Context!: Giancarlo Cobino, Simone Farci October 2024
No ratings yet
You Are Out of Context!: Giancarlo Cobino, Simone Farci October 2024
37 pages
Out-of-Distribution Detection Methods Answer The Wrong Questions
No ratings yet
Out-of-Distribution Detection Methods Answer The Wrong Questions
26 pages
Averly Unified Out-Of-Distribution Detection A Model-Specific Perspective ICCV 2023 Paper
No ratings yet
Averly Unified Out-Of-Distribution Detection A Model-Specific Perspective ICCV 2023 Paper
11 pages
Unsupervised Drift Detection Method
No ratings yet
Unsupervised Drift Detection Method
8 pages
Era Splitting Invariant Learning For Decision
No ratings yet
Era Splitting Invariant Learning For Decision
32 pages
M D - G P N O - D G: Odeling The ATA Enerating Rocess Is Ecessary For UT OF Istribution Eneralization
No ratings yet
M D - G P N O - D G: Odeling The ATA Enerating Rocess Is Ecessary For UT OF Istribution Eneralization
25 pages
Data Augmentation Classifier For Imbalanced Fault Classification
No ratings yet
Data Augmentation Classifier For Imbalanced Fault Classification
12 pages
188 1496475265 - 03-06-2017 PDF
No ratings yet
188 1496475265 - 03-06-2017 PDF
6 pages
References
No ratings yet
References
6 pages
ID 429 Anodot Ultimate Guide To Building A Machine Learning Outlier Detection System Part II
No ratings yet
ID 429 Anodot Ultimate Guide To Building A Machine Learning Outlier Detection System Part II
22 pages
An Insight Into Classification With Imbalanced Data
No ratings yet
An Insight Into Classification With Imbalanced Data
29 pages
Novel Ensemble Algorithm for IPS Detection
No ratings yet
Novel Ensemble Algorithm for IPS Detection
8 pages
A Model-Driven Engineering Approach For Monitoring Machine Learning Models
No ratings yet
A Model-Driven Engineering Approach For Monitoring Machine Learning Models
5 pages
Analysis of Continual Learning Models For Intrusio
No ratings yet
Analysis of Continual Learning Models For Intrusio
22 pages
Evaluating Model Drift in Machine Learning Algorithms
No ratings yet
Evaluating Model Drift in Machine Learning Algorithms
8 pages
T6 - QMchange Point Anomaly
No ratings yet
T6 - QMchange Point Anomaly
11 pages
Classification of Data Streams With Skewed Distribution
No ratings yet
Classification of Data Streams With Skewed Distribution
55 pages
Efficient Outlier Detection in High-Dimensional Data Using
No ratings yet
Efficient Outlier Detection in High-Dimensional Data Using
21 pages
AI2: Enhancing Cybersecurity with ML
No ratings yet
AI2: Enhancing Cybersecurity with ML
13 pages
An Effectiveness Analysis of Transfer Learning For The Concept Drift Problem in Malware Detection
No ratings yet
An Effectiveness Analysis of Transfer Learning For The Concept Drift Problem in Malware Detection
20 pages
Machine Learning in Non Stationary Environments Ab00 PDF
No ratings yet
Machine Learning in Non Stationary Environments Ab00 PDF
263 pages
Lecture 2b
No ratings yet
Lecture 2b
45 pages
A Novel Drift Detection Algorithm Based
No ratings yet
A Novel Drift Detection Algorithm Based
12 pages
A Fuzzy Proximity Relation Approach For Outlier Detection in - 2021 - Soft Compu
No ratings yet
A Fuzzy Proximity Relation Approach For Outlier Detection in - 2021 - Soft Compu
12 pages
A Review of Changepoint Detection Models
No ratings yet
A Review of Changepoint Detection Models
11 pages
Addressing Imbalanced Data in Network Intrusion de
No ratings yet
Addressing Imbalanced Data in Network Intrusion de
8 pages
Miguel Angel Abad Arranz
No ratings yet
Miguel Angel Abad Arranz
172 pages
Tranad: Deep Transformer Networks For Anomaly Detection in Multivariate Time Series Data
No ratings yet
Tranad: Deep Transformer Networks For Anomaly Detection in Multivariate Time Series Data
15 pages
Data Infrastructure For Machine Learning
No ratings yet
Data Infrastructure For Machine Learning
5 pages
Prediction of Mental Health (Depression) Using Data Science Technique
No ratings yet
Prediction of Mental Health (Depression) Using Data Science Technique
6 pages
Matlab GUI for IoT Sensor Outlier Detection
No ratings yet
Matlab GUI for IoT Sensor Outlier Detection
6 pages
Signal Segmentations
No ratings yet
Signal Segmentations
39 pages
KRAWXZYKINFFUS2017
No ratings yet
KRAWXZYKINFFUS2017
86 pages
Covariate Shift Generalisation 2023
No ratings yet
Covariate Shift Generalisation 2023
9 pages
Prediction Errors Tech Report
No ratings yet
Prediction Errors Tech Report
9 pages
Anomaly Detection On Industrial Electrical Systems Using Deep Learning
No ratings yet
Anomaly Detection On Industrial Electrical Systems Using Deep Learning
6 pages
407 A Decade S Battle On Datas
No ratings yet
407 A Decade S Battle On Datas
17 pages
Kova Rasan 2018
No ratings yet
Kova Rasan 2018
11 pages
MLSys 2022 Matchmaker Data Drift Mitigation in Machine Learning For Large Scale Systems Paper
No ratings yet
MLSys 2022 Matchmaker Data Drift Mitigation in Machine Learning For Large Scale Systems Paper
18 pages
Over Fitting
No ratings yet
Over Fitting
19 pages
Wang 等 - 2023 - Causal Balancing for Domain Generalization
No ratings yet
Wang 等 - 2023 - Causal Balancing for Domain Generalization
24 pages
Change Point Detection in Time Series Data With Random Forests
No ratings yet
Change Point Detection in Time Series Data With Random Forests
13 pages
Concept Drift in Machine Learning
No ratings yet
Concept Drift in Machine Learning
1 page
Autoregressive Based Drift Detection Method
No ratings yet
Autoregressive Based Drift Detection Method
13 pages
Deep Residual Flow For Out of Distribution Detection
No ratings yet
Deep Residual Flow For Out of Distribution Detection
14 pages
PAACDA Comprehensive Data Corruption Detection Algorithm
No ratings yet
PAACDA Comprehensive Data Corruption Detection Algorithm
8 pages
Gaurav Bindal - Resume
No ratings yet
Gaurav Bindal - Resume
3 pages
Automated Software Vulnerability Assessment With Concept Drift
No ratings yet
Automated Software Vulnerability Assessment With Concept Drift
12 pages
2024 - Target in India Benefits Summary
No ratings yet
2024 - Target in India Benefits Summary
6 pages
Detecting Covariate Drift in Text Data Using Document Embeddings and Dimensionality Reduction
No ratings yet
Detecting Covariate Drift in Text Data Using Document Embeddings and Dimensionality Reduction
8 pages
Analysis of Concept Drift in Fake Reviews Detection
No ratings yet
Analysis of Concept Drift in Fake Reviews Detection
20 pages
Plain Englist Summarization of Contracts
No ratings yet
Plain Englist Summarization of Contracts
11 pages
Problem Statement
No ratings yet
Problem Statement
2 pages
Assessment Preparation
No ratings yet
Assessment Preparation
98 pages
PHD - Thesis - Final - Statistics Tests
100% (1)
PHD - Thesis - Final - Statistics Tests
154 pages
Addressing Event-Driven Concept Drift in Twitter Stream A Stance Detection Application
No ratings yet
Addressing Event-Driven Concept Drift in Twitter Stream A Stance Detection Application
13 pages
Presentation Paper Copy v2
No ratings yet
Presentation Paper Copy v2
14 pages
Assessment Part 1
No ratings yet
Assessment Part 1
50 pages
Assessment Part 2
No ratings yet
Assessment Part 2
48 pages
DST 2016
No ratings yet
DST 2016
8 pages
Project Synopsis Part 2
No ratings yet
Project Synopsis Part 2
8 pages
Chapter3 Comparative Static Analysis Math Econ 3rd y
No ratings yet
Chapter3 Comparative Static Analysis Math Econ 3rd y
6 pages
How To Write A Literature Review Chemical Engineering
100% (2)
How To Write A Literature Review Chemical Engineering
8 pages
Duong BANA3050 Section# MS Excel Practicum1
No ratings yet
Duong BANA3050 Section# MS Excel Practicum1
22 pages
3.13 Regional Transportation
No ratings yet
3.13 Regional Transportation
23 pages
Jurnal 4
No ratings yet
Jurnal 4
13 pages
Off-Road Suspension Design Guide
No ratings yet
Off-Road Suspension Design Guide
30 pages
Chapter 1 Lecture Final Fall 2024
No ratings yet
Chapter 1 Lecture Final Fall 2024
37 pages
APWorksheet
No ratings yet
APWorksheet
5 pages
Patchogue-Medford Yearbook '75
No ratings yet
Patchogue-Medford Yearbook '75
86 pages
DC Bus Voltage Regulation
No ratings yet
DC Bus Voltage Regulation
2 pages
Linear Control Systems Lecture # 8 Observability & Discrete-Time Systems
No ratings yet
Linear Control Systems Lecture # 8 Observability & Discrete-Time Systems
25 pages
Introduction to Python Basics
No ratings yet
Introduction to Python Basics
97 pages
O Level Physics Pre Mock Exam
No ratings yet
O Level Physics Pre Mock Exam
10 pages
Sop Masterlist
100% (1)
Sop Masterlist
1 page
Hebbia Case Study
No ratings yet
Hebbia Case Study
13 pages
PQLI - India
No ratings yet
PQLI - India
4 pages
Cooperation and Ambedkarism
No ratings yet
Cooperation and Ambedkarism
8 pages
Bucket Inspection Form
No ratings yet
Bucket Inspection Form
1 page
Paul Andre Verdier - Brainwashing and The Cults - An Exposé On Capturing The Human Mind
0% (1)
Paul Andre Verdier - Brainwashing and The Cults - An Exposé On Capturing The Human Mind
120 pages
OEM Samples
No ratings yet
OEM Samples
5 pages
DRCS Cover - To Author PDF
0% (1)
DRCS Cover - To Author PDF
1 page
Converting MicroSim® Schematics Designs To OrCAD Capture® Designs
No ratings yet
Converting MicroSim® Schematics Designs To OrCAD Capture® Designs
44 pages
Anx1 Sow b000
No ratings yet
Anx1 Sow b000
5 pages
Azola
No ratings yet
Azola
4 pages
Solucionario Statistics For Business and Economics - David R. Anderson, Dennis J. Sweeney - 8ed
0% (1)
Solucionario Statistics For Business and Economics - David R. Anderson, Dennis J. Sweeney - 8ed
8 pages
Turbomechinery 7
No ratings yet
Turbomechinery 7
11 pages
PS4 Solution
No ratings yet
PS4 Solution
9 pages
Machine Learning in Forecasting Motor Insurance CL
No ratings yet
Machine Learning in Forecasting Motor Insurance CL
19 pages
Groundwater Development Course
No ratings yet
Groundwater Development Course
2 pages