research-article

Open access

Out-of-distribution Detection in Time-series Domain: A Novel Seasonal Ratio Scoring Approach

Authors:

Taha Belkhouja,

Yan Yan,

Janardhan Rao DoppaAuthors Info & Claims

ACM Transactions on Intelligent Systems and Technology, Volume 15, Issue 1

Article No.: 8, Pages 1 - 24

https://doi.org/10.1145/3630633

Published: 19 December 2023 Publication History

PDF eReader

Abstract

Safe deployment of time-series classifiers for real-world applications relies on the ability to detect the data that is not generated from the same distribution as training data. This task is referred to as out-of-distribution (OOD) detection. We consider the novel problem of OOD detection for the time-series domain. We discuss the unique challenges posed by time-series data and explain why prior methods from the image domain will perform poorly. Motivated by these challenges, this article proposes a novel Seasonal Ratio Scoring (SRS) approach. SRS consists of three key algorithmic steps. First, each input is decomposed into class-wise semantic component and remainder. Second, this decomposition is employed to estimate the class-wise conditional likelihoods of the input and remainder using deep generative models. The seasonal ratio score is computed from these estimates. Third, a threshold interval is identified from the in-distribution data to detect OOD examples. Experiments on diverse real-world benchmarks demonstrate that the SRS method is well-suited for time-series OOD detection when compared to baseline methods.

1 Introduction

Time-series data analytics using deep neural networks [22] enable many real-world applications in health-care [40], finance [46], smart grids [49], energy management [21, 44], and intrusion detection [24]. However, safe and reliable deployment of such machine learning (ML) systems require robust models [1, 2, 3, 4, 20], uncertainty quantification [16, 17], and the ability to detect time-series data that do not follow the distribution of training data, a.k.a. in-distribution (ID). For example, a circumstantial event for an epilepsy patient or a sudden surge in a smart grid branch result in sensor readings that deviate from the training distribution. This task is referred to as out-of-distribution (OOD) detection. If the ML model encounters OOD inputs, then it can output wrong predictions with high confidence. Another important application of OOD detection for the time-series domain is synthetic data generation. Many time-series applications suffer from limited or imbalanced data, which motivates methods to generate synthetic data [39]. A key challenge is to automatically assess the validity of synthetic data, which can be alleviated using accurate OOD detectors.

There is a growing body of work on OOD detection for the image domain [8, 12, 28, 30, 43, 47] and other types of data such as genomic sequences [36]. These methods can be categorized into

–

Supervised methods that fine-tune the ML system or perform specific training to distinguish examples from ID and OOD.

–

Unsupervised methods that employ Deep Generative Models (DGMs) on unlabeled data to perform OOD detection.

However, time-series data with its unique characteristics (e.g., sparse peaks, fast oscillations) pose unique challenges that are not encountered in the image domain:

–

Spatial relations between pixels are not similar to the temporal relations across different timesteps of time-series signals.

–

Pixel variables follow a categorical distribution of values \(\lbrace 0,1,\ldots ,255\rbrace ,\) whereas time-series variables follow a continuous distribution.

–

The semantics of images (e.g., background, edges) do not apply to time-series data.

–

Humans can identify OOD images for fine-tuning purposes, but this task is challenging for time-series data. Hence, prior OOD methods are not suitable for the time-series domain.

This article proposes a novel OOD detection algorithm for the time-series domain referred to as Seasonal Ratio Scoring (SRS). To the best of our knowledge, this is the first work on OOD detection over time-series data. SRS employs the Seasonal and Trend decomposition using Loess (STL) [11] on time-series signals from the ID data to create a class-wise semantic pattern and a remainder component for each signal. For example, in a human activity recognition system, SRS would extract a pattern “running” that describes semantically all the recorded “running” windows. If the person trips and falls, then SRS would detect that this event does not belong to the pre-defined activity classes and flag it as OOD. For this purpose, we train two separate DGMs to estimate the class-wise conditional likelihood of a given time-series signal and its STL-based remainder component. The Seasonal Ratio (SR) score for each time-series signal from ID is computed from these two estimates. A threshold interval is estimated from the statistics of all these scores over ID data. Given a new time-series input and a classifier at the testing time, the SRS approach computes the SR score for the predicted output and flags the time-series signal as OOD example if the score lies outside the threshold interval. Figure 1 illustrates the SRS algorithm. The effectiveness of SRS critically depends on the extraction of accurate class-wise semantic components. Since time-series data is prone to warping and time-shifts, we also propose a new alignment approach based on Dynamic Time Warping (DTW) [32] to improve the output accuracy of STL decomposition. Our experiments on diverse real-world time-series datasets demonstrate that SRS method is well-suited for time-series OOD detection when compared to prior methods.

Fig. 1.

Contributions. The main contribution of this article is the development and evaluation of the SRS algorithm for OOD detection in the time-series domain. Specific contributions include:

(1)

Principled algorithm based on STL decomposition and deep generative models to compute the SR score to detect OOD time-series examples.

(2)

Novel time-series alignment algorithm based on dynamic time warping to improve the effectiveness of the SR score-based OOD detection.

(3)

Formulation of the experimental setting for time-series OOD detection. Experimental evaluation of SRS algorithm on real-world datasets and comparison with state-of-the-art baselines.

(4)

Open-source code and data for SRS method https://github.com/tahabelkhouja/SRS

2 Problem Setup

Suppose \(\mathcal {D}_{in}\) is an in-distribution (ID) time-series dataset with d examples \(\lbrace (x_i, y_i)\rbrace\) sampled from the distribution \(P^*\) defined on the joint space of input-output pairs \((\mathcal {X}, \mathcal {Y})\). Each \(x_i \in \mathbb {R}^{n \times T}\) from \(\mathcal {X}\) is a multi-variate time-series input, where n is the number of channels and T is the window-size of the signal. \(y_i \in \mathcal {Y}=\lbrace 1,\ldots ,C\rbrace\) represents the class label for time-series signal \(x_i\). We consider a time-series classifier \(F: \mathbb {R}^{n \times T} \rightarrow \lbrace 1, \ldots , C\rbrace\) learned using \(\mathcal {D}_{in}\). For example, in a health monitoring application using physiological sensors for patients diagnosed with cardiac arrhythmia, we use the measurements from wearable devices to predict the likelihood of cardiac failure.

OOD samples \((x,y)\) are typically generated from a distribution other than \(P^*\). Specifically, we consider a sample \((x,y)\) to be OOD if the class label y is different from the set of in-distribution class labels, i.e., \(y \notin \mathcal {Y}\). The classifier \(F_{\theta }(x)\) learned using \(\mathcal {D}_{in}\) will assign one of the C class labels from \(\mathcal {Y}\) when encountering an OOD sample \((x,y)\). Our goal is to detect such OOD examples for safe and reliable real-world deployment of time-series classifiers. We provide a summary of the mathematical notations used in this article in Table 1.

Table 1.

Variable	Definition
\(\mathcal {D}_{in}\)	Dataset of in-distribution time-series signals
\(P^*\)	True distribution of the time-series dataset
\(x_i\)	Input time-series signal
\(\mathcal {Y}\)	Set of output class labels \(y \in \lbrace 1,\ldots ,C\rbrace\)
\(F_{\theta }\)	Classifier that maps an input \(x_i\in \mathbb {R}^{n \times T}\) to a class label \(y\in \mathcal {Y}\)
\(S_y\)	Semantic pattern of a class y according to STL decomposition
SR	Seasonal Ratio score
SRS	Seasonal Ratio Scoring framework

Table 1. Mathematical Notations Used in This Article

Challenges of time-series data. The unique characteristics of time-series data (e.g., temporal relation across timesteps, fast oscillations, continuous distribution of variables) pose unique challenges not seen in the image domain. Real-world time-series datasets are typically small (relative to image datasets) and exhibit high class-imbalance [14]. Therefore, estimating a good approximation of in-distribution \(P^*\) is hard, which results in the failure of prior OOD methods. Indeed, our experiments demonstrate that prior OOD methods are not suited for the time-series domain. As a prototypical example, Figure 7 shows the limitation of Likelihood Regret score [43] to identify OOD examples: ID and OOD scores of real-world time-series examples overlap.

Fig. 2.

Fig. 3.

Fig. 4.

Fig. 5.

Fig. 6.

Fig. 7.

3 Background and Preliminaries

In this section, we provide the necessary background on conditional VAE and STL decomposition to better understand the proposed SR score-based OOD detection approach.

Conditional VAE. Variational Auto-Encoders (VAEs) are a class of likelihood-based generative models with many real-world applications [15]. They rely on the encoding of a raw input x as a latent Gaussian variable z to estimate the likelihood of x. The latent variable z is used to compute the likelihood of the training data: \(p_{\theta }(x)=\int p_{\theta }(x|z)p(z)dz\). Since the direct computation of this likelihood is impractical, the principle of evidence lower bound (ELBO) [43] is employed. In this work, we consider the ID data \(\mathcal {D}_{in}\) as d input-output samples of the form \((x_i, y_i)\). We want to estimate the ID using both \(x_i\) and \(y_i\). Therefore, we propose to use conditional VAE (CVAE) for this purpose. CVAEs are a class of likelihood-based generative models [15]. They rely on the encoding of raw input \((x,y)\) as a latent Gaussian variable z to estimate the conditional likelihood of x over the class label y. CVAE is similar to VAE with the key difference being the use of conditional probability over both \(x_i\) and \(y_i\). The ELBO objective of CVAE is:

\begin{equation*} \mathcal {L}_{\text{ELBO}}\overset{\Delta }{=} \mathbb {E}_{\phi }\big [\log p_{\theta }(x|z,y)\big ] - D_{\text{KL}}\big [q_{\phi }(z|x,y)||p(z|y)\big ], \end{equation*}

where \(q_{\phi }(z|x,y)\) is the variational approximation of the true posterior distribution \(p_{\theta }(x|z,y)\). As CVAE only computes the lower bound of the log-likelihood of a given input, the exact log-likelihood is estimated using Monte Carlo sampling, as shown below:

\begin{equation} \mathcal {L}_M=\mathbb {E}_{z^m \sim q_{\phi }(z|x,y)}\bigg [\log \dfrac{1}{M} \sum _{m=1}^M \dfrac{p_{\theta }(x|z^m,y)p(z^m)}{q_{\phi }(z^m|x,y)}\bigg ] . \end{equation}

(1)

The intuitive expectation from a DGM learned using training data is to assign a high likelihood to ID samples and a low likelihood to OOD samples. However, recent research showed that DGMs tend to assign highly unreliable likelihood to OOD samples regardless of the different semantics of both ID and OOD data [43]. Indeed, our experimental results shown in Table 4 demonstrate that this observation is also true for the time-series domain.

Table 2.

Domain label	Domain name
D1	Motion
D2	ECG
D3	HAR
D4	EEG
D5	Audio
D6	Other

Table 2. List of Domain Labels Used in the Experimental Section and the Corresponding UCR Domain Name

Table 3.

In-distribution Dataset name	MAE	OOD Dataset label
In-distribution Dataset name	MAE	DS1	DS2	DS3	DS4	DS5	DS6	DS7
ArticW. (Motion)	0.025	CharacterT.	EigenW.	PenD.	\(\emptyset\)	\(\emptyset\)	\(\emptyset\)	\(\emptyset\)
EigenW. (Motion)	0.000	ArticW.	CharacterT.	PenD.	\(\emptyset\)	\(\emptyset\)	\(\emptyset\)	\(\emptyset\)
PenD. (Motion)	0.001	ArticW.	CharacterT.	EigenW.	\(\emptyset\)	\(\emptyset\)	\(\emptyset\)	\(\emptyset\)
AtrialF. (ECG)	0.005	StandW.	\(\emptyset\)	\(\emptyset\)	\(\emptyset\)	\(\emptyset\)	\(\emptyset\)	\(\emptyset\)
StandW. (ECG)	0.012	AtrialF.	\(\emptyset\)	\(\emptyset\)	\(\emptyset\)	\(\emptyset\)	\(\emptyset\)	\(\emptyset\)
BasicM. (HAR)	0.024	Cricket	Epilepsy	Handw.	Libras	NATOPS	RacketS.	UWaveG.
Cricket (HAR)	0.010	BasicM.	Epilepsy	Handw.	Libras	NATOPS	RacketS.	UWaveG.
Epilepsy (HAR)	0.030	BasicM.	Cricket	Handw.	Libras	NATOPS	RacketS.	UWaveG.
Handw. (HAR)	0.006	BasicM.	Cricket	Epilepsy	Libras	NATOPS	RacketS.	UWaveG.
Libras (HAR)	0.003	BasicM.	Cricket	Epilepsy	Handw.	NATOPS	RacketS.	UWaveG.
NATOPS (HAR)	0.046	BasicM.	Cricket	Epilepsy	Handw.	Libras	RacketS.	UWaveG.
RacketS. (HAR)	0.026	BasicM.	Cricket	Epilepsy	Handw.	Libras	NATOPS	UWaveG.
UWaveG. (HAR)	0.015	BasicM.	Cricket	Epilepsy	Handw.	Libras	NATOPS	RacketS.
EthanolC. (Other)	0.001	ER.	LSST	PEMS-SF	\(\emptyset\)	\(\emptyset\)	\(\emptyset\)	\(\emptyset\)
ER. (Other)	0.044	EthanolC.	LSST	PEMS-SF	\(\emptyset\)	\(\emptyset\)	\(\emptyset\)	\(\emptyset\)
LSST (Other)	0.008	EthanolC.	ER.	PEMS-SF	\(\emptyset\)	\(\emptyset\)	\(\emptyset\)	\(\emptyset\)
PEMS-SF (Other)	0.525	EthanolC.	ER.	LSST	\(\emptyset\)	\(\emptyset\)	\(\emptyset\)	\(\emptyset\)
FingerM. (EEG)	0.048	HandM.	MotorI.	SelfR1.	SelfR2.	\(\emptyset\)	\(\emptyset\)	\(\emptyset\)
HandM. (EEG)	0.006	FingerM.	MotorI.	SelfR1.	SelfR2.	\(\emptyset\)	\(\emptyset\)	\(\emptyset\)
MotorI. (EEG)	0.543	FingerM.	HandM.	SelfR1.	SelfR2.	\(\emptyset\)	\(\emptyset\)	\(\emptyset\)
SelfR1. (EEG)	0.009	FingerM.	HandM.	MotorI.	SelfR2.	\(\emptyset\)	\(\emptyset\)	\(\emptyset\)
SelfR2. (EEG)	0.012	FingerM.	HandM.	MotorI.	SelfR1.	\(\emptyset\)	\(\emptyset\)	\(\emptyset\)
Heartbeat (Audio)	0.011	JapaneseV.	SpokenA.	\(\emptyset\)	\(\emptyset\)	\(\emptyset\)	\(\emptyset\)	\(\emptyset\)

Table 3. Reference Table for the In-domain Dataset Labels Used in the Experimental Section and the Corresponding UCR Dataset Name

The second column shows the average CVAE normalized reconstruction Mean Absolute Error (MAE) with a negligible variance \(\le\) 0.001 on the in-distribution data.

Table 4.

ID Train		AWR	SWJ	Ckt	HMD	Hbt	ERg
ID Test		0.025	0.012	0.010	0.118	0.011	0.045
OOD	AWR	\(\emptyset\)	0.018	0.035	0.039	0.002	0.137
OOD	Fmv	1.658	0.146	0.146	0.039	0.071	6.132

Table 4. Average Reconstruction Error of CVAE Is Small on Both ID and OOD Data

The variance is \(\le 0.001\).

STL decomposition. STL [23] is a statistical method for decomposing a given time-series signal x into three different components: (1) The seasonality \(x_s\) is a fixed regular pattern that recurs in the data; (2) The trend \(x_t\) is the increment or the decrement of the seasonality over time; and (3) The residual \(x_r\) represents random additive noise. STL employs Loess (LOcal regrESSion) smoothing in an iterative process to estimate the seasonality component \(x_s\) [31]. The remainder is the additive residual from the input x after summing both \(x_s\) and \(x_t\). For the proposed SRS algorithm, we assume that there is a fixed semantic pattern \(S_{y}\) for every class label \(y \in \mathcal {Y}\), and this pattern recurs in all examples \((x_i, y_i)\) from \(\mathcal {D}_{in}\) with the same class label, i.e., \(y_i\)=y. We will elaborate more on this assumption and a reformulation of the problem that can be used when the assumption is violated in the next section. Hence, every time-series example has the following two elements: \(x_i = S_{y_i} + r_i\), where \(S_{y_i}\) is the pattern for the class label y=\(y_i\) and \(r_i\) is the remainder noise w.r.t. \(S_{y_i}\). For time-series classification tasks, ID samples are assumed to be stationary. Therefore, we propose to average the trend \(x_t\) component observed during training and include it in the semantic pattern \(S_{y}\). Figure 2 illustrates the above-mentioned decomposition for two different classes from the ERing dataset.

4 Seasonal Ratio Scoring Approach for OOD Detection

Overview of SRS algorithm. The training stage proceeds as follows: We employ STL decomposition to get the semantic component \(S_y\) for each class label \(y \in \mathcal {Y}\)=\(\lbrace 1,\ldots ,C\rbrace\) from the given ID data \(\mathcal {D}_{in}\). Details on the STL decomposition steps within the SRS framework are provided in Section 4.2. To improve the accuracy of STL decomposition, we apply a time-series alignment method based on dynamic time warping to address scaling, warping, and time-shifts, as detailed in Section 4.3. We train two CVAE models \(\mathcal {M}_x\) and \(\mathcal {M}_r\) to estimate the class-wise conditional likelihood of each time-series signal \(x_i\) and its remainder component \(r_i\) w.r.t. the semantic component \(S_{y_i}\). The seasonal ratio score for each ID example \((x_i, y_i)\) from \(\mathcal {D}_{in}\) is computed as the ratio of the class-wise conditional likelihood estimates for \(x_i\) and its remainder \(r_i\): \(SR_i(x_i, y_i) \overset{\Delta }{=} \frac{p(x_i|y=y_i)}{p(r_i|y=y_i)}\). We compute the SR scores for all in-distribution examples from \(\mathcal {D}_{in}\) to estimate the threshold interval \([\tau _l, \tau _u]\) for OOD detection. During the inference stage, given a time-series signal x and a trained classifier \(F(x)\), we compute the SR score of x with the predicted output \(\hat{y}\)=\(F(x) \in \mathcal {Y}\) and identify it as an OOD example if the SR score lies outside the threshold interval \([\tau _l, \tau _u]\). Figure 1 provides a high-level illustration of the SRS algorithm.

Below, we first provide an intuitive explanation to motivate the SR score. Next, we describe the complete details of the SRS algorithm including both training and inference stages. Finally, we motivate and describe a time-series alignment approach based on dynamic time warping to improve the effectiveness of SRS.

4.1 Intuition for Seasonal Ratio Score

We explain the intuition behind the proposed SRS algorithm using STL decomposition of time-series signals and CVAE models for likelihood estimation. Current research shows that DGMs alone can fail to identify OOD samples [43]. They not only assign high likelihood to OOD samples, but they also exhibit good reconstruction quality. In fact, we show in Table 4 that CVAEs trained on a given ID data generally exhibit a low reconstruction error on most of the OOD samples. Furthermore, we show in Table 8 that using a trained CVAE likelihood output for OOD detection fails to perform well. These results motivate the new for a new OOD scoring method for the time-series domain.

Table 5.

		AWR	SWJ	Ckt	HMD	Hbt	ERg
In-domain	ODIN	0.65\(\pm 0.03\)	0.54\(\pm 0.01\)	0.65\(\pm 0.03\)	0.55\(\pm 0.01\)	0.71\(\pm 0.03\)	0.70\(\pm 0.03\)
In-domain	GM	0.70\(\pm 0.03\)	0.58\(\pm 0.01\)	0.64\(\pm 0.03\)	0.58\(\pm 0.02\)	0.80\(\pm 0.01\)	0.75\(\pm 0.03\)
Cross-domain	ODIN	0.55\(\pm 0.01\)	0.54\(\pm 0.01\)	0.55\(\pm 0.01\)	0.50	0.70\(\pm 0.01\)	0.70\(\pm 0.01\)
Cross-domain	GM	0.56\(\pm 0.01\)	0.54\(\pm 0.01\)	0.65\(\pm 0.03\)	0.55\(\pm 0.01\)	0.75\(\pm 0.03\)	0.78\(\pm 0.02\)

Table 5. Average AUROC Results for ODIN and GMM

Table 6.

	AWR	SWJ	Ckt	HMD	Hbt	ERg
LR	1.00	1.00	1.00	1.00	0.95	1.00

Table 6. Average Performance of LR on OOD Examples Sampled from Gaussian/Uniform Distribution

Table 7.

	AWR	SWJ	Ckt	HMD	Hbt	ERg
MAE	0.047	0.031	0.029	0.078	0.002	0.086
DTW	0.039	0.018	0.022	0.068	0.002	0.069

Table 7. Results for the Validity of Assumption 1

Average distance (MAE and DTW measures) between the semantic pattern from STL \(S_y\) and time-series example x with label y from the testing data (with a negligible variance\(\le 0.001\)).

Table 8.

Class-wise seasonality via STL decomposition. The proposed SRS algorithm relies on the following assumption to analyze the time-series space for OOD detection:

Assumption 1.

Each time-series example \((x_i, y_i)\) from the in-distribution data \(\mathcal {D}_{in}\) consists of two components. (1) A class-wise semantic pattern \(S_y\) for each class label \(y \in \mathcal {Y}\) representing the meaningful semantics of the class label y. (2) A remainder noise \(r_i\) representing an additive perturbation to the semantic portion. Hence, \(\forall (x_i, y_i) \in \mathcal {D}_{in}:~ x_i = S_{y_i} + r_i\)

We propose to employ STL decomposition to estimate semantic pattern \(S_y\) (as illustrated in Figure 2) and deduce the remainder noise that can be due to several factors including errors in sensor measurements and noise in communication channels. These two components are analogous to the foreground and the background of an image, where the foreground is the interesting segment of the input that describes it, and the background may not necessarily be related to the foreground. In spite of this analogy, prior methods for the image domain are not suitable for time-series, as explained in the related work (Section 5). In this decomposition, we cannot assume that \(S_y\) and r are independent for a given time-series example \((x, y)\), as \(S_y\) is class-dependent and r is the remainder of the input x using \(S_y\). Hence, we present their conditional likelihoods in the following observation:

Observation 1.

Let \(x \in \mathbb {R}^{n \times T}\) is a time-series signal and \(y_i \in \mathcal {Y}\)=\(\lbrace 1,\ldots ,C\rbrace\) be the corresponding class label. As x = \(S_{y_i} + r\), we have:

\begin{equation} p(x|y_i) = p(r|y_i)p(S_{y_i}|y_i). \end{equation}

(2)

Proof of Observation 1 As \(X = S_{y_i} + r\), it is intuitive to think that \(p(X|y=y_i) = p(S_{y_i})\times p(r)\). However, we cannot assume that \(S_{y_i}\) and r are independent, as \(S_{y_i}\) is class-dependent and r is the remainder of the input X given \(S_{y_i}\).

Therefore, we make use of the conditional probabilities of the components. The likelihood \(p(X)\) can be decomposed as follows:

\begin{equation*} \begin{split}p(X|y=y_i) &= p(S_{y_i},r|y=y_i)\\ &= \frac{p(S_{y_i},r,y=y_i)}{p(y=y_i)}\\ &= \frac{p(r,y=y_i)p(S_{y_i}|r,y=y_i)}{p(y=y_i)} . \end{split} \end{equation*}

For the conditional probability \(p(S_{y_i}|r,y=y_i)\), as only the pattern \(S_{y_i}\) depends on the class label, and that we have defined r as a non-meaningful noise to the input, we can assume that \(S_{y_i}\) and r are conditionally independent given the class \(y_i\). Therefore, we have the following:

\begin{equation*} \begin{split}p(X|y=y_i) &= \frac{p(r,y=y_i)p(S_{y_i}|r,y=y_i)}{p(y=y_i)}\\ &= \frac{p(r,y=y_i)p(S_{y_i}|y=y_i)}{p(y=y_i)}\\ &= \frac{p(r|y=y_i)p(y=y_i)p(S_{y_i}|y=y_i)}{p(y=y_i)}\\ &= p(r|y=y_i)p(S_{y_i}|y=y_i).\\ \end{split} \end{equation*}

Discussion on Observation 1. \(S_y\) is a fixed class-wise semantic pattern that characterizes a class \(y \in \mathcal {Y}\). By definition, \(S_y\) is a deterministic pattern extracted using STL decomposition during training and is not a random variable. At the inference time, we do not estimate \(S_y\) of each test input x, but we use \(S_y\) computed during the training stage to estimate the remainder component r. Hence, \(P(S_y|y)\) is defined as a deterministic variable and not as a density that SRS is aiming to estimate.

OOD detection using CVAEs. Observation 1 shows the relationship between the conditional likelihood of the input x and its remainder r. We propose to employ CVAEs to estimate both likelihoods, since they are conditional likelihoods. Recall that OOD examples come from an unknown distribution that is different from the in-distribution \(P^*\) and do not belong to any pre-defined class label from \(\mathcal {Y}\). Therefore, we propose to use the following observation for OOD detection in the time-series domain:

Observation 2.

Let \(x \in \mathbb {R}^{n \times T}\) is a time-series signal and \(y \in \mathcal {Y}\)=\(\lbrace 1,\ldots ,C\rbrace\) be the corresponding class label. As x = \(S_{y} + r\), x is an OOD example if \(p(x|y) \ne p(r|y)\) and an in-distribution example if \(p(x|y)\) = \(p(r|y)\).

Observation 2 shows how we can exploit the relationship between the estimated conditional likelihood of the time-series signal x and its remainder r to predict whether x is an OOD example or not. This observation relies on the assumption that \(p(S_y|y)=1\) for in-distribution data. For ID data, the semantic pattern \(S_y\) is a class-dependent signal that defines the class label y. Since the semantic component is guaranteed to be \(S_y\) for any time-series example with class label y, we have \(p(S_y|y)\)=1. However, OOD examples do not belong to any class label from \(\mathcal {Y}\), i.e., \(p(S_y|y) \ne 1\) for any \(y \in \mathcal {Y}\). To estimate \(p(x|y \in \mathcal {Y})\) and \(p(r|y \in {\mathcal {Y}})\) in Observation 2, we train two separate CVAE models using the in-distribution data \(\mathcal {D}_{in}\). While estimating two separate distributions can cause instability, we note that

(1)

During hyper-parameter tuning and the definition of the ID score range \([\tau _l, \tau _u]\), any outlier that may cause estimation instability will be omitted.

(2)

In case of drastic estimation instability, both CVAEs can be tuned during training time to overcome the problem.

(3)

If this instability is seen during inference time, then the SRS algorithm automatically indicates that the test example is an OOD example.

Discussion on Assumption 1. The article acknowledges that this assumption may fail to hold in some real-world scenarios. However, surprisingly, our experimental results shown in Table 7 strongly corroborate this key assumption:The distance between each time-series signal \(x_i\) and its semantic pattern \(S_{y_i}\) is very small. The strong OOD performance of SRS algorithm in our diverse experiments demonstrates the effectiveness of a simple approach based on this assumption.

Suppose the assumption does not hold and some class label y can possess \(K \gt 1\) different semantics \(\lbrace S^k_y\rbrace _{k\le K}\). If we take a human activity recognition example, then it is safe to think that a certain activity (e.g., running or walking) will have \(K\gt 1\) different patterns (e.g., athletic runners vs. young runners). Therefore, the decomposition in Assumption 1 for a given time-series example \((x_i, y_i)\) will result into a semantic pattern describing the patterns of the different sub-categories (e.g., a pattern that describes both athletic runners and young runners). By using Lowess smoothing, the STL season extracted over a multiple-pattern class is a pattern \(S_y\) that is a linear combination of \(\lbrace S^k_y\rbrace _{k\le K}\) (for our example, it describes the combination of both athletic runs and young runs). While for an in-distribution example, \(p(S_y|y)\)=1 of Observation 2 will not hold, \(p(S_y|y)\) is likely to be well-defined from \(p(S^k_y|y),\) as \(\lbrace S^k_y\rbrace _{k\le K}\) are fixed and natural for the class label y. Hence, we can still rely on the CVAEs to estimate this distribution and to perform successful OOD detection. Alternatively, we can use a simple reformulation of the problem by clustering time-series signals of a class label y (for which the assumption is not satisfied) to identify sub-classes and apply the SRS algorithm on transformed data. Since we found the assumption to be true in all our experimental scenarios (see Table 7), we did not find the need to apply this reformulation.

4.2 OOD Detection Approach

One key advantage of SRS method is that it can be directly executed at inference stage and does not require additional training similar to prior VAE-based methods such as Likelihood Regret scoring.

Training stage. Our overall training procedure for time-series OOD detection is as follows:

(1)

Train a CVAE \(\mathcal {M}_x\) using in-distribution data \(\mathcal {D}_{in}\) to estimate the conditional likelihood \(p(x|y \in \mathcal {Y})\) of time-series signal x.

(2)

Execute STL decomposition as follows:

(a)

From the training data \(\lbrace (x_i, y_i)\rbrace\), we create a group \(\mathcal {D}_y=\lbrace x_i| y_i=y\rbrace\).

(b)

We concatenate all the examples \(x_i\in \mathcal {D}_y\) in a single stream of data according to the T dimension. If \(\mathcal {D}_y\) has k examples, then the output is a single stream \(X_{stream}\in \mathbb {R}^{n \times (k\cdot T)}\)

(c)

We apply STL decomposition on the stream \(X_{stream}\) by defining the pattern dimensions \(S_y \in \mathbb {R}^{n \times T}\).

(d)

We store the semantic component \(S_y\) to be used later in estimating the remainder component for any given training example \((x,y)\): r = \(x - S_y\).

(3)

Create the remainder for each training example \((x_i, y_i) \in \mathcal {D}_{in}\) using the patterns \(S_y\) for each class label y \(:~ r_i = x_i - S_{y_i}\). We train another CVAE \(\mathcal {M}_r\) using all these remainders to estimate the conditional likelihood \(p(r|y \in {\mathcal {Y}})\).

(4)

Compute seasonal ratio score for each \((x_i, y_i) \in D_{in}\) using the trained CVAEs \(\mathcal {M}_x\) and \(\mathcal {M}_r\).

\begin{equation} SR_i(x_i, y_i) \overset{\Delta }{=} \dfrac{p(x_i|y=y_i)}{p(r_i|y=y_i)} \end{equation}

(3)

(5)

Compute the mean \(\mu _{SR}\) and variance \(\sigma _{SR}\) over SR scores of all in-distribution examples seen during training. Set the OOD detection threshold interval as \([\tau _l, \tau _u]\) such that \(\tau _l\) = \(\mu _{SR} - \lambda \times \sigma _{SR}\) and \(\tau _u\) = \(\mu _{SR} + \lambda \times \sigma _{SR}\), where \(\lambda\) is a hyper-parameter.

(6)

Tune the hyper-parameter \(\lambda\) on the validation data to maximize OOD detection accuracy.

The choice of \([\tau _l, \tau _u]\) for OOD detection is motivated by the fractional nature of the seasonal ratio scores. SRS algorithm assumes that in-distribution examples satisfy \(p(x|y)= p(r|y)\). Hence, we characterize in-distribution examples with an SR score close to 1, whether from left (\(\tau _l\)) or right (\(\tau _u\)) side. To identify in-distribution examples, we rely on SR scores that are close to the mean score recorded during training, whether from left (\(\tau _l\)) or right (\(\tau _u\)) side. This design choice is based on the fact that SR score is a quotient ideally centered around the value 1. Indeed, we observe in Figure 3 that the SR score for OOD examples can go on either the left or the right side of the SR scores for in-distribution examples. Ideally, the SR score for in-distribution examples is closer to \(\mu _{SR}\) than SR scores for OOD examples, as illustrated in Figure 1. \(\lambda\) is tuned to define the valid range of SR scores for in-distribution examples from \(\mathcal {D}_{in}\). We note that the score can be changed easily to consider the quantiles of estimated ratios during training stage and use it to separate the region of OOD and ID score. Therefore, we can redefine \(\tau _l = (0.5 - \lambda)\) as the \(\tau _l\)-quantile for the lower-limit of ID score and \(\tau _u = (0.5 + \lambda)\) as the \(\tau _u\)-quantile for the upper-limit of ID score. Given this definition, we need to tune the hyper-parameter \(0\lt \lambda \le 0.5\) on the validation data to maximize OOD detection accuracy. Furthermore, the ID score range is not required to be symmetric. In the general case, we can define \(\tau _l = (0.5 - \lambda _l)\) as the \(\tau _l\)-quantile and \(\tau _u = (0.5 + \lambda _u)\) as the \(\tau _u\)-quantile, where \(\lambda _l \ne \lambda _u\). We have observed in our experiments that both these settings give similar performance. Therefore, we only consider \(\tau _{u,l}\) = \(\mu _{SR} \pm \lambda \times \sigma _{SR}\) for simplicity for our experimental evaluation.

Inference stage. Given a time-series signal x, our OOD detection approach works as follows:

(1)

Compute the predicted class label \(\hat{y}\) using the classifier \(F(x)\).

(2)

Create the remainder component of x with the predicted label \(\hat{y}\): r = \(x - S_{\hat{y}}\).

(3)

Compute conditional likelihoods \(p(x|\hat{y})\) and \(p(r|\hat{y})\) from trained CVAE models \(\mathcal {M}_x\) and \(\mathcal {M}_r\).

(4)

Compute the seasonal ratio score using conditional likelihoods.

\begin{equation*} SR(x, \hat{y})=\frac{p(x|y=\hat{y})}{p(r|y=\hat{y})} \end{equation*}

(5)

If the seasonal ratio score \(SR(x, \hat{y})\) does not lie within the threshold interval \([\tau _l, \tau _u]\), then classify x as OOD example. Otherwise, classify x as in-distribution example.

Algorithm 1 shows the complete pseudo-code including the offline training stage and online inference stage for new time-series signals. For a given time-series signal x at the inference stage, we employ SRS algorithm to compute the seasonal ratio (SR) score. If the score is within \([\tau _l, \tau _u]\), then the time-series signal is classified as in-distribution. Otherwise, we flag it as an OOD time-series signal.

4.3 Alignment method for improving the accuracy of SRS algorithm

In this section, we first motivate the need for pre-processing raw time-series signals to improve the accuracy of SRS algorithm. Subsequently, we describe a novel time-series alignment method based on dynamic time warping to achieve this goal.

Motivation. The effectiveness of the SRS algorithm depends critically on the accuracy of the STL decomposition. STL method employs fixed-length window over the serialized data to estimate the recurring pattern. This is a challenge for real-world time-series signals, as they are prone to scaling, warping, and time-shifts. We illustrate in Figure 4 the challenge of scaling, warping, and time-shift occurrences in time-series data. The top-left figure depicts a set of time-series signals with a clear ECG pattern. Due to their misalignment, if we subtract one fixed ECG pattern from every time-series signal, then the remainder will be inaccurate. The figures in the left and right columns show the difference in the remainder components between the natural data (Left) and the aligned version of the time-series data (Right). We can clearly observe that the remainder components from the aligned data are more accurate. If input time-series data is not aligned, then it can significantly affect the estimation of \(p(r_i|y=y_i)\) and the effectiveness of SRS for OOD detection. Hence, we propose a novel alignment method using the class-wise semantic for the in-distribution data \(\mathcal {D}_{in}\) during both training and inference stages.

Time-series alignment algorithm. The overall goal of our approach is to produce a class-wise aligned time-series signals using the ID data \(\mathcal {D}_{in}\) so STL algorithm will produce accurate semantic components \(S_{y}\) for each \(y \in \mathcal {Y}\). We propose to employ dynamic time warping (DTW) [32] based optimal alignment to achieve this goal. The optimal DTW alignment describes the warping between two time-series signals to make them aligned in time. It overcomes warping and time-shifts issue by developing a one-to-many match over timesteps. There are two key steps in our alignment algorithm. First, we compute the semantic components \(S_{y}\) for each \(y \in \mathcal {Y}\) from \(\mathcal {D}_{in}\) using STL decomposition. For each in-distribution example \((x_i, y_i) \in \mathcal {D}_{in}\), we compute the optimal DTW alignment between \(S_{y_i}\) and \(x_i\). Second, we use an appropriate time-series transformation for each in-distribution example \((x_i, y_i)\) to improve the DTW alignment from the first step. Specifically, we use the timesteps of the longest one-to-many or many-to-one or sequential one-to-one sequence match to select the Expand, Reduce, and Translate transformation, as illustrated in Figure 5. We define these three time-series transformations below.

Let \(X^1=(t^1_1, t^1_2, \ldots , t^1_T)\) and \(X^2=(t^2_1, t^2_2, \ldots , t^2_T)\) be two time-series signals of length T.

–

Expand\((X^1, X^2)\): We employ this transformation for a one-to-many timestep matching (\(t^1_i\) is matched with \([t^2_j, \ldots , t^2_{j+k}],\) as shown in Figure 5(a)). It duplicates the \(t^1_i\) timestep for k times.

–

Reduce\((X^1, X^2)\): We employ this transformation in the case of a many-to-one timestep matching (\([t^1_i, \ldots , t^1_{i+k}]\) is matched with \(t^2_j\) as shown in Figure 5(b)). It replaces the timesteps \([t^1_i, \ldots , t^1_{i+k}]\) by a single averaged value.

–

Translate\((X^1, X^2)\): We employ this transformation in the case of a sequential one-to-one timestep matching (\([t^1_i, \ldots , t^1_{i+k}]\) is matched one-to-one with \([t^2_j, \ldots , t^2_{j+k}]\) as shown in Figure 5(c)). It translates \(X_1\) to ensure that \(t^1_i=t^2_j\).

We illustrate in Figure 6 two examples of transformation choices for time-series signal x when aligned with a pattern S. The alignment on the left exhibits that the longest consecutive matching sequence is a one-to-many (\(x_4\) is matched with \([S_2, \ldots , S_7]\)), while the alignment on the right exhibits that the longest consecutive matching sequence is a sequential one-to-one (\([x_4, \ldots ,x_8]\) is matched with \([S_3, \ldots , S_7]\)).

5 Related Work

OOD detection via pre-trained models. Employing pre-trained deep neural networks (DNNs) to detect OOD examples was justified by the observation that DNNs with ReLU activation can produce arbitrarily high softmax confidence for OOD examples [12]. Maximum probability over class labels has been used [12] to improve the OOD detection accuracy. Building on the success of this method, temperature scaling and controlled perturbations were used [28] to further increase the performance. The Mahalanobis-based scoring method [27] is used to identify OOD examples with class-conditional Gaussian distributions. Gram matrices [38] were used to detect OOD examples based on the features learned from the training data. The effectiveness of these prior methods depends critically on the availability of a highly accurate DNN for the classification task. However, this requirement is challenging for the time-series domain, as real-world datasets are typically small and exhibit high class imbalance resulting in inaccurate DNNs [19, 42].

OOD detection via synthetic data. During the training phase, it is impossible to anticipate the OOD examples that would be encountered during the deployment of DNNs [18]. Hence, unsupervised methods [48] are employed or synthetic data based based on generative models is created [26, 29] to explicitly regularize the DNN weights over potential OOD examples. It is much more challenging to create synthetic data for the time-series domain due to the limited data and their ambiguity to be validated by human experts.

OOD detection via deep generative models. The overall idea of using deep generative models (DGMs) for OOD detection is as follows: (1) DGMs are trained to directly estimate the in-distribution \(P^*\); and (2) The learned DGM identifies OOD samples when they are found lying in a low-density region. Prior work has used auto-regressive generative models [36] or GANs [41] and proposed scoring metrics such as likelihood estimates to obtain good OOD detectors. DGMs are shown to be effective in evaluating the likelihood of input data and estimating the data distribution, which makes them a good candidate to identify OOD examples with high accuracy. However, as shown by Reference [33], DGMs can assign a high likelihood to OOD examples. Likelihood ratio [36] and likelihood regret [43] are proposed to improve OOD detection. While likelihood regret method can generalize to different types of data, likelihood ratio is limited to categorical data distributions with the assumption that the data contains background units (background pixels for images and background sequences for genomes). Likelihood ratio cannot be applied to the time-series domain for two reasons: (1) We need to deal with continuous distributions; and (2) We cannot assume that timesteps (information unit) can be independently classified as background or semantic content.

OOD detection via time-series anomaly detection. Generic Anomaly Detection (AD) algorithms [13, 34, 37] can be employed to solve OOD problems for time-series data. Anomaly detection is the task of identifying observed points or examples that deviate significantly from the rest of data. Anomaly detection relies on different approaches such as distance-based metrics or density-based approaches to quantify the dissimilarities between any example and the rest of the data. Current methods using DNNs (e.g., Generative Adversarial Networks, auto-encoders) showed higher performance in anomaly detection, as they can capture more complex features in high-dimensional spaces [34, 35]. We note that there exist some AD methods that can cover the same setting as the OOD problem for time-series domain. However, both settings are still considered as two different frameworks with two different goals [45]. By definition, AD aims to detect and flag anomalous samples that deviate from a pre-defined normality [7, 25] estimated during training. Under the AD assumption of normality, such samples only originate from a covariate shift in the data distribution [37]. Semantically, such samples do not classify as OOD samples [45]. For example, consider an intelligent system trained to identify the movement of a person (e.g., run, stand, walk, swim), where stumbling may occur during running. Such an event would be classified as an anomaly, as the activity running is still taking place, but in an irregular manner. However, if the runner slips and falls, then such activity should be flagged as OOD due to the fact that it does not belong to any of the pre-defined activity classes. In other words, OOD samples must originate from a different class distribution (\(y_{\text{OOD}} \notin \mathcal {Y}\)) than in-distribution examples, while anomalies typically originate from the same underlying distribution but with anomalous behavior. Open-set recognition methods can be applicable for this setting [50, 51], as it has been shown that they are effective in detecting unknown categories without prior knowledge. However, OOD Detection encompasses a broader spectrum of solution space and does not require the complexity of identifying the semantic class of the anomalies. Additionally, anomalies can manifest as a single timestep, non-static window-length, but not generally a complete time-series example in itself. Such differences can be critical for users and practitioners, which necessitates the study of separate algorithms for AD and OOD. Unlike anomaly detection, OOD detection focuses on identifying test samples with non-overlapping labels with in-distribution data and can generalize to multi-class setting [45]. The main limitations of time-series AD algorithms [9] for OOD detection tasks are

–

OOD samples cannot be used as labeled anomalous examples during training due to the general definition of the OOD space. For various AD methods such as nearest neighbors and distance-based, the fine-tuning of the cut-off threshold between “normal” and “anomalous” examples requires anomaly labels during training. Mainly, window-based techniques [10] require both normal and anomalous sequences during training, and if there are none, then anomalous examples are randomly generated. Such a requirement is not practical for OOD problem settings, as the distribution is ambiguous to define and sample from.

–

AD assumes that normal samples are homogeneous in their observations. This assumption helps the AD algorithm to detect anomalies. Such an assumption cannot hold for different classes of the in-distribution space for multi-class settings. Therefore, time-series AD algorithms are prone to fail at detecting OOD samples. Indeed, our experiments demonstrate the failure of state-of-the-art time-series AD methods.

6 Experiments and Results

In this section, we present experimental results comparing the proposed SRS algorithm and prior methods on diverse real-world time-series datasets.

6.1 Experimental Setup

Datasets. We employ the multivariate benchmarks from the UCR time-series repository [14]. Due to space constraints, we present the results on representative datasets from six different pre-defined domains Motion, ECG, HAR, EEG, Audio, and Other. The list of datasets includes Articulary Word Recognition (AWR), Stand Walk Jump (SWJ), Cricket (Ckt), Hand Movement Direction (HMD), Heartbeat (Hbt), and ERing (ERg). We employ the standard training/validation/testing splits from these benchmarks.

OOD experimental setting. Prior work formalized the OOD experimental setting for different domains such as computer vision [12]. However, there is no OOD setting for the time-series domain. In what follows, we explain the challenges for the time-series domain and propose a concrete OOD experimental setting for it.

The first challenge with the time-series domain is the dimensionality of signals. Let the ID space be \(\mathbb {R}^{n_i\times T_i}\) and the OOD space be \(\mathbb {R}^{n_o\times T_o}\). Since we train CVAEs on the ID space, \({n_o\times T_o}\) needs to match \({n_i\times T_i}\). Hence, if \(n_o\gt n_i\) or \(T_o\gt T_i\), then we window-clip the respective OOD dimension to have \(n^{\prime }_o=n_i\) or \(T^{\prime }_o=T_i\). If \(n_o\lt n_i\) or \(T_o\lt T_i\), then we zero-pad the respective OOD dimension to have \(n^{\prime }_o=n_i\) or \(T^{\prime }_o=T_i\). Zero-padding is based on the assumption that the additional dimension exists but takes null values. The second challenge is in defining OOD examples. Since the number of datasets in UCR repository is large, conducting experiments on all combinations of datasets as ID and OOD is impractical and repetitive (600 distinct configurations for the 25 different datasets considered in this article).

Hence, we propose two settings using the notion of domains.

–

In-domain OOD: Both ID and OOD datasets belong to the same domain. This setting helps in understanding the behavior of OOD detectors when real-world OOD examples come from the same application domain. For example, a detector of Epileptic time-series signals should consider signals resulting from sports activity (Cricket) as OOD.

–

Cross-domain OOD: Both ID and OOD datasets come from two different domains. This configuration is more intuitive for OOD detectors, where time-series signals from different application domains should not confuse the ML model (e.g., Motion and HAR data).

Our intuition is that the in-domain OOD setting is more likely to occur during real-world deployment. Hence, we propose to do separate experiments by treating every dataset from the same domain as OOD. For the cross-domain OOD setting, we believe that a single representative dataset from the domain can be used as OOD. In this work, we focus on real-world OOD detection for the time-series domain. Since random noise does not inherit the characteristics of time-series data, methods from the computer vision literature have a good potential in detecting random noise.

For improved readability and ease of understanding, we provide Table 2 and Table 3 to explain the domain labels and dataset labels used in the experimental section of our article along with the corresponding UCR domain name and dataset name.

–

Table 2 shows the label used to represent a given domain for Cross-domain OOD setting.

–

Table 3 shows the label used to represent the dataset used as an OOD source against a given ID dataset for the In-domain OOD setting. For example, while reading Table 8 in the main article, when AWR is the ID distribution, according to Table 3, DS1 represents CharacterT. dataset. However, if HMD is the ID distribution, then DS1 represents FingerM. dataset.

Evaluation metrics. We employ the following two standard metrics in our experimental evaluation. (1) AUROC score: The area under the receiver operating characteristic curve is a threshold-independent metric. This metric (higher is better) is equal to 1.0 for a perfect detector and 0.5 for a random detector. (2) F1 score: It is the harmonic mean of precision and recall. Due the threshold dependence of F1 score, we propose to use the highest F1 score obtained with a variable threshold. This score has a maximum of 1.0 in the case of a perfect precision and recall.

Configuration of algorithms. We employ a 1D-CNN architecture for the CVAE models required for SR scoring method. We consider a naive baseline where the CVAE is trained on the ID data and the likelihood (LL) is used to detect OOD samples. We consider a variant of SR scoring (SR\(_a\)) that works on the aligned time-series data using the method explained in Section 4.3. We evaluate both SR and SR\(_a\) against state-of-the-art baselines and employ their publicly available code: Out-of-Distribution Images in Neural networks (ODIN) [28] and Gram Matrices (GM) [38] that have been shown to outperform most of the existing baselines; recently proposed Likelihood Regret (LR) score [43]; adaptation of a very recent time-series AD method referred to as Deep generative model with hierarchical latent (HL) space to time series windows [9] that does not require labeled anomalies for training purposes. We chose HL as the main baseline to represent time-series AD under OOD setting, as it is the state-of-the-art time-series AD algorithm. HL for time-series was shown [9] to outperform nearest-neighbor based methods, LSTM-based methods, and other methods [5, 6] in various AD settings.

–

Choice of architecture: We have experimented with three different types of CVAE architecture to decide on the most suitable one for our OOD experiments. We evaluated (1) fully connected, (2) convolutional, and (3) LSTM-based architectures using the reconstruction error as the performance metric. We have observed that fully connected networks generally suffer from poor reconstruction performance, especially on high-dimensional data. We have also observed that LSTM’s runtime during training and inference is relatively longer than the other architectures. However, CNN-based CVAEs delivered both a good reconstruction performance and fast runtime.

–

1D-CNN CVAE details: To evaluate the effectiveness of the proposed SR score, we employed a CVAE that is based on 1D-CNN layers. The encoder of the CVAE is composed of (1) a minmax normalization layer, (2) a series of 1D-CNN layers, and (3) a fully connected layer. At the end of the encoder, the parameters \(\mu _{\text{CVAE}}\) and \(\sigma _{\text{CVAE}}\) are computed to estimate the posterior distribution. A random sample is then generated from this distribution and passed on to the CVAE decoder along with the class label. The decoder of the CVAE is composed of (1) a fully connected layer, (2) a series of transposed convolutional layer, and (3) a denormalization layer.

–

CVAE Training: We use the standard training, validation, and testing split on the benchmark datasets to train both CVAEs \(\mathcal {M}_x\) and \(\mathcal {M}_r\). Both CVAEs are trained to maximize the ELBO on the conditional log-likelihood defined in Section 3 using Adam optimizer with a learning rate of \(10^{-4}\). We employ a maximum number of training iterations equal to 500. To ensure the reliability of the performance of the proposed CVAEs, we report in Table 3 the test reconstruction error of the trained CVAE on ID data using Mean Absolute Error (MAE). We observe clearly that the proposed CVAE is able to learn well the ID space as the reconstruction error is relatively low. To compute the semantic patterns and remainders for in-distribution examples for training \(\mathcal {M}_x\) and \(\mathcal {M}_r\), we use the STLdecompose¹ Python package.

–

Implementation of the baselines: The baseline methods for ODIN,² GM,³ HL,⁴ and LR⁵ were implemented using their respective publicly available code with the recommended settings. To employ ODIN and GM, we have trained two different DNN models: a 1D-CNN and an LSTM for classification tasks with different settings. We report the average performance of the baseline OOD detectors in our experimental setting. To repurpose HL method from the AD setting to OOD setting, we have serialized the training data and use it during the training of the generator. For OOD detection at inference time, we serialize both the test ID data and OOD data and shuffle them. By providing the window size equal to the timesteps dimension of the original in-distribution inputs, we execute HL anomaly detection algorithm and report every anomaly as an OOD sample. We employed the default parameters of the generator. As recommended by the authors, we use a hierarchical level equal to 4 and 500 iterations for training and inference. We lower the learning rate to \(10^{-6}\) to prevent the exploding gradient that occurred using the default \(10^{-3}\) value. For a fair comparison, the VAE for Likelihood Regret (LR) has the same architecture as the CVAE used to estimate the SR and the naive LL score.

6.2 Results and Discussion

Reconstruction error of DGMs. Table 4 shows the test reconstruction error of the trained CVAE on ID data using MAE. We clearly observe that CVAE model is able to learn the ID space, as the reconstruction error is relatively low. Table 4 shows analogous results for the same CVAE on some OOD data. We observe that DGMs perform well on OOD samples regardless of the different semantics of both ID and OOD data. The pre-trained CVAEs performed well on the OOD AWR dataset with a reconstruction error \(\le 0.1\). For the OOD FingerMovement (Fmv) dataset, only two out of the six CVAEs exhibited an intuitive high reconstruction error.

OOD detection via pre-trained classifier and DGMs. Our first hypothesis is that pre-trained DNN classifiers are not well-suited for OOD detection. To test this hypothesis, we train two DNN models: a 1D-CNN and an RNN classifier. We use these models for OOD detection using the ODIN and GM baselines. Table 5 shows that AUROC is low on all datasets. For datasets such as HMD and SWJ, the AUROC score does not exceed 0.6 for any experimental setting. The accuracy of DNNs for time-series classification is not as high as those for the image domain for the reasons explained earlier. Hence, we believe that this uncertainty of DNNs causes the baselines ODIN and GM to fail in OOD detection. Our second hypothesis is that DGMs assign a high likelihood for OOD samples and ID samples as well for time-series data. While results in Table 4 corroborated this hypothesis, we provide the use of a pre-trained CVAE for OOD detection (LL) in Table 8. We observe that AUROC score of LL does not outperform any of the other baselines. Hence, a new scoring method is necessary for CVAE-based OOD detection.

Random Noise as OOD. An existing experimental setting for OOD detection tasks is to detect random noise. For this setting, we generate random noise as an input sampled from a Gaussian distribution or a Uniform distribution. Table 6 shows the LR baseline performance on detecting the random noise as OOD examples. We observe in Table 6 that LR has an excellent performance on this task. This is explainable, as time-series noise does not necessarily obey time-series characteristics. Hence, the existing baselines can perform strongly on the OOD examples. We motivate our seasonal ratio scoring approach for OOD detection based on real-world examples. We have shown in the main article that existing baselines have poor performance in detecting real-world OOD examples, whereas SR has a significantly better performance.

Results for SR score. The effectiveness of SR score depends on the validity of Assumption 1. Table 7 shows both MAE and DTW measure between semantic pattern \(S_y\) from STL and different time-series examples of the same class y. We observe that the average difference measure is low. These results strongly demonstrate that Assumption 1 holds empirically. For qualitative results, Figure 7 shows the performance of SR in contrast to the performance of LR shown in Figure 7. This illustration shows that SRS provides significantly better OOD separability.

SR score vs. Baselines. Table 8 shows the OOD results for SRS and baseline methods. For a fair comparison, we use the same architecture for VAEs computing the LL, LR, and SR scores. We make the following observations: (1) The naive LL method fails to outperform any other approach, which demonstrates that DGMs are not reliable on their own, as they produce high likelihood for OOD samples. (2) The time-series anomaly detection method HL fails drastically for various OOD settings, as reflected by the poor AUROC score of 0.5. This demonstrates that AD methods are not appropriate for OOD detection in the multi-class setting. (3) SR score outperforms LR score in identifying OOD examples on 80% of the total experiments. This means that the improvement is due to a better scoring function. (4) For the in-domain OOD setting, AUROC score of LR is always lower than SRS. For the cross-domain setting, SRS outperforms LR in all cases except one experiment on a single dataset SWJ. Finally, LR and SRS have the same performance in 20% of the total experiments. Therefore, we conclude that SRS is better than LR in terms of OOD performance and execution time (LR requires new training for every single testing input, unlike SR).

Alignment improves the accuracy of SR score. Our hypothesis is that extraction of an accurate semantic component using STL results in improved OOD detection accuracy. To test this hypothesis, we compare SR and SR\(_a\) (SR with aligned time-series data). Table 8 shows the AUROC scores of SR and SR\(_a\). SR\(_a\) improves the performance of SR for around 50% of the overall experiments. For example, on HMD dataset, we observe that SR\(_a\) enhances the performance of SR by an average of 15% under the in-domain OOD setting. These results strongly corroborate our hypothesis that alignment improves OOD performance.

SR performance using F1-score. In addition to the AUROC score, we employ F1 score to assess the effectiveness of SR score in detecting OOD. Table 9 provides the results comparing SR score and LR score. Like AUROC score evaluation, we make similar observations on F1 score. (1) SR score outperforms LR score in identifying OOD examples on 60% of the total experiments. This means improvement is due to better scoring function. (2) For the in-domain OOD setting, F1 score of LR is mostly lower than SR. (3) For the cross-domain setting, SR outperforms LR in 66% of the cases. Hence, we conclude that SR is better than LR in terms of OOD performance measured as F1 metric.

Table 9.

AUROC performance of SR scoring on the full multivariate UCR dataset. For the sake of completeness, Table 10 provides additional results for the performance of SR on all the UCR multi-variate datasets in terms of the AUROC score. These results demonstrate that the proposed SR scoring approach is general and highly effective for all time-series datasets.

Table 10.

Inference runtime comparison of the different OOD detection algorithms. We provide in Tables 11 and 12 a comparison of the number of parameters and the runtime between different OOD detection methods for time-series. Intuitively, both HL and SR methods are characterized by a larger number of parameters than LL and LR, as the latter two methods only rely on a single VAE model to compute the OOD score. However, we can observe that LR has the longest score computation runtime: This is due to the new training iterations LR introduces to compute the OOD score of each example. However, SR algorithm only runs a single inference pass for each example, then computes the ratio between both computed likelihoods. This approach of SR algorithm yields a fast and accurate OOD detector.

Table 11.

Table 12.

7 Summary and Future Work

We introduced a novel seasonal ratio (SR) score to detect out-of-distribution (OOD) examples in the time-series domain. SR scoring relies on Seasonal and Trend decomposition using Loess (STL) to extract class-wise semantic patterns and remainders from time-series signals; and estimating class-wise conditional likelihoods for both input time-series and remainders using deep generative models. The SR score of a given time-series signal and the estimated threshold interval from the in-distribution data enables OOD detection. Our strong experimental results demonstrate the effectiveness of SR scoring and alignment method in detecting time-series OOD examples over prior methods. Immediate future work includes applying seasonal ratio score-based OOD detection to generate synthetic time-series data for small-data settings.

8 Acknowledgments

The authors would like to thank Alan Fern for the useful discussions regarding the key assumption behind the seasonal ratio scoring approach.

Footnotes

https://github.com/jrmontag/STLDecompose.git

https://github.com/facebookresearch/odin.git

https://github.com/VectorInstitute/gram-ood-detection.git

⁴

https://github.com/cchallu/dghl.git

⁵

https://github.com/XavierXiao/Likelihood-Regret.git

References

[1]

Taha Belkhouja and Janardhan Rao Doppa. 2020. Analyzing deep learning for time-series data through adversarial lens in mobile and IoT applications. IEEE Trans. Comput. Aid. Des. Integ. Circ. Syst. 39, 11 (2020), 3190–3201.

Abstract

1 Introduction

2 Problem Setup

3 Background and Preliminaries

4 Seasonal Ratio Scoring Approach for OOD Detection

4.1 Intuition for Seasonal Ratio Score

4.2 OOD Detection Approach

4.3 Alignment method for improving the accuracy of SRS algorithm

5 Related Work

6 Experiments and Results

6.1 Experimental Setup

6.2 Results and Discussion

7 Summary and Future Work

8 Acknowledgments

Footnotes

References

Cited By

Index Terms

Recommendations

Time series forecasting by a seasonal support vector regression model

Combining seasonal ARIMA models with computational intelligence techniques for time series forecasting

A novel data-driven seasonal multivariable grey model for seasonal time series forecasting

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

Login options

Full Access

Share

Share this Publication link

Share on social media

Affiliations