skip to main content
research-article
Open access

Out-of-distribution Detection in Time-series Domain: A Novel Seasonal Ratio Scoring Approach

Published: 19 December 2023 Publication History

Abstract

Safe deployment of time-series classifiers for real-world applications relies on the ability to detect the data that is not generated from the same distribution as training data. This task is referred to as out-of-distribution (OOD) detection. We consider the novel problem of OOD detection for the time-series domain. We discuss the unique challenges posed by time-series data and explain why prior methods from the image domain will perform poorly. Motivated by these challenges, this article proposes a novel Seasonal Ratio Scoring (SRS) approach. SRS consists of three key algorithmic steps. First, each input is decomposed into class-wise semantic component and remainder. Second, this decomposition is employed to estimate the class-wise conditional likelihoods of the input and remainder using deep generative models. The seasonal ratio score is computed from these estimates. Third, a threshold interval is identified from the in-distribution data to detect OOD examples. Experiments on diverse real-world benchmarks demonstrate that the SRS method is well-suited for time-series OOD detection when compared to baseline methods.

1 Introduction

Time-series data analytics using deep neural networks [22] enable many real-world applications in health-care [40], finance [46], smart grids [49], energy management [21, 44], and intrusion detection [24]. However, safe and reliable deployment of such machine learning (ML) systems require robust models [1, 2, 3, 4, 20], uncertainty quantification [16, 17], and the ability to detect time-series data that do not follow the distribution of training data, a.k.a. in-distribution (ID). For example, a circumstantial event for an epilepsy patient or a sudden surge in a smart grid branch result in sensor readings that deviate from the training distribution. This task is referred to as out-of-distribution (OOD) detection. If the ML model encounters OOD inputs, then it can output wrong predictions with high confidence. Another important application of OOD detection for the time-series domain is synthetic data generation. Many time-series applications suffer from limited or imbalanced data, which motivates methods to generate synthetic data [39]. A key challenge is to automatically assess the validity of synthetic data, which can be alleviated using accurate OOD detectors.
There is a growing body of work on OOD detection for the image domain [8, 12, 28, 30, 43, 47] and other types of data such as genomic sequences [36]. These methods can be categorized into
Supervised methods that fine-tune the ML system or perform specific training to distinguish examples from ID and OOD.
Unsupervised methods that employ Deep Generative Models (DGMs) on unlabeled data to perform OOD detection.
However, time-series data with its unique characteristics (e.g., sparse peaks, fast oscillations) pose unique challenges that are not encountered in the image domain:
Spatial relations between pixels are not similar to the temporal relations across different timesteps of time-series signals.
Pixel variables follow a categorical distribution of values \(\lbrace 0,1,\ldots ,255\rbrace ,\) whereas time-series variables follow a continuous distribution.
The semantics of images (e.g., background, edges) do not apply to time-series data.
Humans can identify OOD images for fine-tuning purposes, but this task is challenging for time-series data. Hence, prior OOD methods are not suitable for the time-series domain.
This article proposes a novel OOD detection algorithm for the time-series domain referred to as Seasonal Ratio Scoring (SRS). To the best of our knowledge, this is the first work on OOD detection over time-series data. SRS employs the Seasonal and Trend decomposition using Loess (STL) [11] on time-series signals from the ID data to create a class-wise semantic pattern and a remainder component for each signal. For example, in a human activity recognition system, SRS would extract a pattern “running” that describes semantically all the recorded “running” windows. If the person trips and falls, then SRS would detect that this event does not belong to the pre-defined activity classes and flag it as OOD. For this purpose, we train two separate DGMs to estimate the class-wise conditional likelihood of a given time-series signal and its STL-based remainder component. The Seasonal Ratio (SR) score for each time-series signal from ID is computed from these two estimates. A threshold interval is estimated from the statistics of all these scores over ID data. Given a new time-series input and a classifier at the testing time, the SRS approach computes the SR score for the predicted output and flags the time-series signal as OOD example if the score lies outside the threshold interval. Figure 1 illustrates the SRS algorithm. The effectiveness of SRS critically depends on the extraction of accurate class-wise semantic components. Since time-series data is prone to warping and time-shifts, we also propose a new alignment approach based on Dynamic Time Warping (DTW) [32] to improve the output accuracy of STL decomposition. Our experiments on diverse real-world time-series datasets demonstrate that SRS method is well-suited for time-series OOD detection when compared to prior methods.
Fig. 1.
Fig. 1. Overview of the seasonal ratio (SR) scoring algorithm. The semantic component \(S_{\hat{y}}\) for the predicted output \(\hat{y}\) is obtained from the training stage via Seasonal and Trend decomposition using Loess (STL). The semantic component \(S_{\hat{y}}\) is subtracted from the time-series x to obtain the remainder r. The trained CVAE models \(\mathcal {M}_x\) and \(\mathcal {M}_r\) are used to compute the SR score. If the SR score is within the threshold interval \([\tau _l, \tau _u]\) identified during training, then x is classified as ID. Otherwise, it is flagged as OOD.
Contributions. The main contribution of this article is the development and evaluation of the SRS algorithm for OOD detection in the time-series domain. Specific contributions include:
(1)
Principled algorithm based on STL decomposition and deep generative models to compute the SR score to detect OOD time-series examples.
(2)
Novel time-series alignment algorithm based on dynamic time warping to improve the effectiveness of the SR score-based OOD detection.
(3)
Formulation of the experimental setting for time-series OOD detection. Experimental evaluation of SRS algorithm on real-world datasets and comparison with state-of-the-art baselines.
(4)
Open-source code and data for SRS method https://github.com/tahabelkhouja/SRS

2 Problem Setup

Suppose \(\mathcal {D}_{in}\) is an in-distribution (ID) time-series dataset with d examples \(\lbrace (x_i, y_i)\rbrace\) sampled from the distribution \(P^*\) defined on the joint space of input-output pairs \((\mathcal {X}, \mathcal {Y})\). Each \(x_i \in \mathbb {R}^{n \times T}\) from \(\mathcal {X}\) is a multi-variate time-series input, where n is the number of channels and T is the window-size of the signal. \(y_i \in \mathcal {Y}=\lbrace 1,\ldots ,C\rbrace\) represents the class label for time-series signal \(x_i\). We consider a time-series classifier \(F: \mathbb {R}^{n \times T} \rightarrow \lbrace 1, \ldots , C\rbrace\) learned using \(\mathcal {D}_{in}\). For example, in a health monitoring application using physiological sensors for patients diagnosed with cardiac arrhythmia, we use the measurements from wearable devices to predict the likelihood of cardiac failure.
OOD samples \((x,y)\) are typically generated from a distribution other than \(P^*\). Specifically, we consider a sample \((x,y)\) to be OOD if the class label y is different from the set of in-distribution class labels, i.e., \(y \notin \mathcal {Y}\). The classifier \(F_{\theta }(x)\) learned using \(\mathcal {D}_{in}\) will assign one of the C class labels from \(\mathcal {Y}\) when encountering an OOD sample \((x,y)\). Our goal is to detect such OOD examples for safe and reliable real-world deployment of time-series classifiers. We provide a summary of the mathematical notations used in this article in Table 1.
Table 1.
VariableDefinition
\(\mathcal {D}_{in}\)Dataset of in-distribution time-series signals
\(P^*\)True distribution of the time-series dataset
\(x_i\)Input time-series signal
\(\mathcal {Y}\)Set of output class labels \(y \in \lbrace 1,\ldots ,C\rbrace\)
\(F_{\theta }\)Classifier that maps an input \(x_i\in \mathbb {R}^{n \times T}\) to a class label \(y\in \mathcal {Y}\)
\(S_y\)Semantic pattern of a class y according to STL decomposition
SRSeasonal Ratio score
SRSSeasonal Ratio Scoring framework
Table 1. Mathematical Notations Used in This Article
Challenges of time-series data. The unique characteristics of time-series data (e.g., temporal relation across timesteps, fast oscillations, continuous distribution of variables) pose unique challenges not seen in the image domain. Real-world time-series datasets are typically small (relative to image datasets) and exhibit high class-imbalance [14]. Therefore, estimating a good approximation of in-distribution \(P^*\) is hard, which results in the failure of prior OOD methods. Indeed, our experiments demonstrate that prior OOD methods are not suited for the time-series domain. As a prototypical example, Figure 7 shows the limitation of Likelihood Regret score [43] to identify OOD examples: ID and OOD scores of real-world time-series examples overlap.
Fig. 2.
Fig. 2. Illustration of STL method for two different classes from the ERing dataset. Dotted signals are natural time-series signals x and the red signal is the semantic pattern \(S_y\).
Fig. 3.
Fig. 3. Histogram showing the ID and OOD scores along the seasonal ratio score axis. The seasonal ratio scores for OOD examples can be either greater or less than the seasonal ratio scores for ID examples.
Fig. 4.
Fig. 4. Illustration of the challenges in time-series data for STL decomposition: semantic component and remainder. (Left column) Set of natural time-series signals with an ECG wave as semantic component \(S_y\) and the corresponding remainders w.r.t. \(S_y\). (Right column) Time-series signals and remainders from STL decomposition after applying the alignment procedure.
Fig. 5.
Fig. 5. Illustration of the use of appropriate transformation to adjust the alignment between two time-series signals \(X^1\) (blue signal) and \(X^2\) (green signal).
Fig. 6.
Fig. 6. Illustration of two transformation choices for a time-series x aligned with a pattern S. (Left) One-to-many as the longest match, calling to use Expand transformation. (Right) sequential one-to-one as the longest match, calling to use the Translate.
Fig. 7.
Fig. 7. Histogram showing the non-separability of ID and OOD LR scores (Top row) and the separability using the seasonal ratio method on real-world time-series data (Bottom row).

3 Background and Preliminaries

In this section, we provide the necessary background on conditional VAE and STL decomposition to better understand the proposed SR score-based OOD detection approach.
Conditional VAE. Variational Auto-Encoders (VAEs) are a class of likelihood-based generative models with many real-world applications [15]. They rely on the encoding of a raw input x as a latent Gaussian variable z to estimate the likelihood of x. The latent variable z is used to compute the likelihood of the training data: \(p_{\theta }(x)=\int p_{\theta }(x|z)p(z)dz\). Since the direct computation of this likelihood is impractical, the principle of evidence lower bound (ELBO) [43] is employed. In this work, we consider the ID data \(\mathcal {D}_{in}\) as d input-output samples of the form \((x_i, y_i)\). We want to estimate the ID using both \(x_i\) and \(y_i\). Therefore, we propose to use conditional VAE (CVAE) for this purpose. CVAEs are a class of likelihood-based generative models [15]. They rely on the encoding of raw input \((x,y)\) as a latent Gaussian variable z to estimate the conditional likelihood of x over the class label y. CVAE is similar to VAE with the key difference being the use of conditional probability over both \(x_i\) and \(y_i\). The ELBO objective of CVAE is:
\begin{equation*} \mathcal {L}_{\text{ELBO}}\overset{\Delta }{=} \mathbb {E}_{\phi }\big [\log p_{\theta }(x|z,y)\big ] - D_{\text{KL}}\big [q_{\phi }(z|x,y)||p(z|y)\big ], \end{equation*}
where \(q_{\phi }(z|x,y)\) is the variational approximation of the true posterior distribution \(p_{\theta }(x|z,y)\). As CVAE only computes the lower bound of the log-likelihood of a given input, the exact log-likelihood is estimated using Monte Carlo sampling, as shown below:
\begin{equation} \mathcal {L}_M=\mathbb {E}_{z^m \sim q_{\phi }(z|x,y)}\bigg [\log \dfrac{1}{M} \sum _{m=1}^M \dfrac{p_{\theta }(x|z^m,y)p(z^m)}{q_{\phi }(z^m|x,y)}\bigg ] . \end{equation}
(1)
The intuitive expectation from a DGM learned using training data is to assign a high likelihood to ID samples and a low likelihood to OOD samples. However, recent research showed that DGMs tend to assign highly unreliable likelihood to OOD samples regardless of the different semantics of both ID and OOD data [43]. Indeed, our experimental results shown in Table 4 demonstrate that this observation is also true for the time-series domain.
Table 2.
Domain labelDomain name
D1Motion
D2ECG
D3HAR
D4EEG
D5Audio
D6Other
Table 2. List of Domain Labels Used in the Experimental Section and the Corresponding UCR Domain Name
Table 3.
In-distribution Dataset nameMAEOOD Dataset label
DS1DS2DS3DS4DS5DS6DS7
ArticW. (Motion)0.025CharacterT.EigenW.PenD.\(\emptyset\)\(\emptyset\)\(\emptyset\)\(\emptyset\)
EigenW. (Motion)0.000ArticW.CharacterT.PenD.\(\emptyset\)\(\emptyset\)\(\emptyset\)\(\emptyset\)
PenD. (Motion)0.001ArticW.CharacterT.EigenW.\(\emptyset\)\(\emptyset\)\(\emptyset\)\(\emptyset\)
AtrialF. (ECG)0.005StandW.\(\emptyset\)\(\emptyset\)\(\emptyset\)\(\emptyset\)\(\emptyset\)\(\emptyset\)
StandW. (ECG)0.012AtrialF.\(\emptyset\)\(\emptyset\)\(\emptyset\)\(\emptyset\)\(\emptyset\)\(\emptyset\)
BasicM. (HAR)0.024CricketEpilepsyHandw.LibrasNATOPSRacketS.UWaveG.
Cricket (HAR)0.010BasicM.EpilepsyHandw.LibrasNATOPSRacketS.UWaveG.
Epilepsy (HAR)0.030BasicM.CricketHandw.LibrasNATOPSRacketS.UWaveG.
Handw. (HAR)0.006BasicM.CricketEpilepsyLibrasNATOPSRacketS.UWaveG.
Libras (HAR)0.003BasicM.CricketEpilepsyHandw.NATOPSRacketS.UWaveG.
NATOPS (HAR)0.046BasicM.CricketEpilepsyHandw.LibrasRacketS.UWaveG.
RacketS. (HAR)0.026BasicM.CricketEpilepsyHandw.LibrasNATOPSUWaveG.
UWaveG. (HAR)0.015BasicM.CricketEpilepsyHandw.LibrasNATOPSRacketS.
EthanolC. (Other)0.001ER.LSSTPEMS-SF\(\emptyset\)\(\emptyset\)\(\emptyset\)\(\emptyset\)
ER. (Other)0.044EthanolC.LSSTPEMS-SF\(\emptyset\)\(\emptyset\)\(\emptyset\)\(\emptyset\)
LSST (Other)0.008EthanolC.ER.PEMS-SF\(\emptyset\)\(\emptyset\)\(\emptyset\)\(\emptyset\)
PEMS-SF (Other)0.525EthanolC.ER.LSST\(\emptyset\)\(\emptyset\)\(\emptyset\)\(\emptyset\)
FingerM. (EEG)0.048HandM.MotorI.SelfR1.SelfR2.\(\emptyset\)\(\emptyset\)\(\emptyset\)
HandM. (EEG)0.006FingerM.MotorI.SelfR1.SelfR2.\(\emptyset\)\(\emptyset\)\(\emptyset\)
MotorI. (EEG)0.543FingerM.HandM.SelfR1.SelfR2.\(\emptyset\)\(\emptyset\)\(\emptyset\)
SelfR1. (EEG)0.009FingerM.HandM.MotorI.SelfR2.\(\emptyset\)\(\emptyset\)\(\emptyset\)
SelfR2. (EEG)0.012FingerM.HandM.MotorI.SelfR1.\(\emptyset\)\(\emptyset\)\(\emptyset\)
Heartbeat (Audio)0.011JapaneseV.SpokenA.\(\emptyset\)\(\emptyset\)\(\emptyset\)\(\emptyset\)\(\emptyset\)
Table 3. Reference Table for the In-domain Dataset Labels Used in the Experimental Section and the Corresponding UCR Dataset Name
The second column shows the average CVAE normalized reconstruction Mean Absolute Error (MAE) with a negligible variance \(\le\) 0.001 on the in-distribution data.
Table 4.
ID TrainAWRSWJCktHMDHbtERg
ID Test0.0250.0120.0100.1180.0110.045
OODAWR\(\emptyset\)0.0180.0350.0390.0020.137
Fmv1.6580.1460.1460.0390.0716.132
Table 4. Average Reconstruction Error of CVAE Is Small on Both ID and OOD Data
The variance is \(\le 0.001\).
STL decomposition. STL [23] is a statistical method for decomposing a given time-series signal x into three different components: (1) The seasonality \(x_s\) is a fixed regular pattern that recurs in the data; (2) The trend \(x_t\) is the increment or the decrement of the seasonality over time; and (3) The residual \(x_r\) represents random additive noise. STL employs Loess (LOcal regrESSion) smoothing in an iterative process to estimate the seasonality component \(x_s\) [31]. The remainder is the additive residual from the input x after summing both \(x_s\) and \(x_t\). For the proposed SRS algorithm, we assume that there is a fixed semantic pattern \(S_{y}\) for every class label \(y \in \mathcal {Y}\), and this pattern recurs in all examples \((x_i, y_i)\) from \(\mathcal {D}_{in}\) with the same class label, i.e., \(y_i\)=y. We will elaborate more on this assumption and a reformulation of the problem that can be used when the assumption is violated in the next section. Hence, every time-series example has the following two elements: \(x_i = S_{y_i} + r_i\), where \(S_{y_i}\) is the pattern for the class label y=\(y_i\) and \(r_i\) is the remainder noise w.r.t. \(S_{y_i}\). For time-series classification tasks, ID samples are assumed to be stationary. Therefore, we propose to average the trend \(x_t\) component observed during training and include it in the semantic pattern \(S_{y}\). Figure 2 illustrates the above-mentioned decomposition for two different classes from the ERing dataset.

4 Seasonal Ratio Scoring Approach for OOD Detection

Overview of SRS algorithm. The training stage proceeds as follows: We employ STL decomposition to get the semantic component \(S_y\) for each class label \(y \in \mathcal {Y}\)=\(\lbrace 1,\ldots ,C\rbrace\) from the given ID data \(\mathcal {D}_{in}\). Details on the STL decomposition steps within the SRS framework are provided in Section 4.2. To improve the accuracy of STL decomposition, we apply a time-series alignment method based on dynamic time warping to address scaling, warping, and time-shifts, as detailed in Section 4.3. We train two CVAE models \(\mathcal {M}_x\) and \(\mathcal {M}_r\) to estimate the class-wise conditional likelihood of each time-series signal \(x_i\) and its remainder component \(r_i\) w.r.t. the semantic component \(S_{y_i}\). The seasonal ratio score for each ID example \((x_i, y_i)\) from \(\mathcal {D}_{in}\) is computed as the ratio of the class-wise conditional likelihood estimates for \(x_i\) and its remainder \(r_i\): \(SR_i(x_i, y_i) \overset{\Delta }{=} \frac{p(x_i|y=y_i)}{p(r_i|y=y_i)}\). We compute the SR scores for all in-distribution examples from \(\mathcal {D}_{in}\) to estimate the threshold interval \([\tau _l, \tau _u]\) for OOD detection. During the inference stage, given a time-series signal x and a trained classifier \(F(x)\), we compute the SR score of x with the predicted output \(\hat{y}\)=\(F(x) \in \mathcal {Y}\) and identify it as an OOD example if the SR score lies outside the threshold interval \([\tau _l, \tau _u]\). Figure 1 provides a high-level illustration of the SRS algorithm.
Below, we first provide an intuitive explanation to motivate the SR score. Next, we describe the complete details of the SRS algorithm including both training and inference stages. Finally, we motivate and describe a time-series alignment approach based on dynamic time warping to improve the effectiveness of SRS.

4.1 Intuition for Seasonal Ratio Score

We explain the intuition behind the proposed SRS algorithm using STL decomposition of time-series signals and CVAE models for likelihood estimation. Current research shows that DGMs alone can fail to identify OOD samples [43]. They not only assign high likelihood to OOD samples, but they also exhibit good reconstruction quality. In fact, we show in Table 4 that CVAEs trained on a given ID data generally exhibit a low reconstruction error on most of the OOD samples. Furthermore, we show in Table 8 that using a trained CVAE likelihood output for OOD detection fails to perform well. These results motivate the new for a new OOD scoring method for the time-series domain.
Table 5.
  AWRSWJCktHMDHbtERg
In-domainODIN0.65\(\pm 0.03\)0.54\(\pm 0.01\)0.65\(\pm 0.03\)0.55\(\pm 0.01\)0.71\(\pm 0.03\)0.70\(\pm 0.03\)
GM0.70\(\pm 0.03\)0.58\(\pm 0.01\)0.64\(\pm 0.03\)0.58\(\pm 0.02\)0.80\(\pm 0.01\)0.75\(\pm 0.03\)
Cross-domainODIN0.55\(\pm 0.01\)0.54\(\pm 0.01\)0.55\(\pm 0.01\)0.500.70\(\pm 0.01\)0.70\(\pm 0.01\)
GM0.56\(\pm 0.01\)0.54\(\pm 0.01\)0.65\(\pm 0.03\)0.55\(\pm 0.01\)0.75\(\pm 0.03\)0.78\(\pm 0.02\)
Table 5. Average AUROC Results for ODIN and GMM
Table 6.
 AWRSWJCktHMDHbtERg
LR1.001.001.001.000.951.00
Table 6. Average Performance of LR on OOD Examples Sampled from Gaussian/Uniform Distribution
Table 7.
 AWRSWJCktHMDHbtERg
MAE0.0470.0310.0290.0780.0020.086
DTW0.0390.0180.0220.0680.0020.069
Table 7. Results for the Validity of Assumption 1
Average distance (MAE and DTW measures) between the semantic pattern from STL \(S_y\) and time-series example x with label y from the testing data (with a negligible variance\(\le 0.001\)).
Table 8.
Table 8. AUROC Results for the Baselines, SR, and SR with Time-series Alignment (SR\(_a\)) on Different Datasets for Both In-domain and Cross-domain OOD Settings
Class-wise seasonality via STL decomposition. The proposed SRS algorithm relies on the following assumption to analyze the time-series space for OOD detection:
Assumption 1.
Each time-series example \((x_i, y_i)\) from the in-distribution data \(\mathcal {D}_{in}\) consists of two components. (1) A class-wise semantic pattern \(S_y\) for each class label \(y \in \mathcal {Y}\) representing the meaningful semantics of the class label y. (2) A remainder noise \(r_i\) representing an additive perturbation to the semantic portion. Hence, \(\forall (x_i, y_i) \in \mathcal {D}_{in}:~ x_i = S_{y_i} + r_i\)
We propose to employ STL decomposition to estimate semantic pattern \(S_y\) (as illustrated in Figure 2) and deduce the remainder noise that can be due to several factors including errors in sensor measurements and noise in communication channels. These two components are analogous to the foreground and the background of an image, where the foreground is the interesting segment of the input that describes it, and the background may not necessarily be related to the foreground. In spite of this analogy, prior methods for the image domain are not suitable for time-series, as explained in the related work (Section 5). In this decomposition, we cannot assume that \(S_y\) and r are independent for a given time-series example \((x, y)\), as \(S_y\) is class-dependent and r is the remainder of the input x using \(S_y\). Hence, we present their conditional likelihoods in the following observation:
Observation 1.
Let \(x \in \mathbb {R}^{n \times T}\) is a time-series signal and \(y_i \in \mathcal {Y}\)=\(\lbrace 1,\ldots ,C\rbrace\) be the corresponding class label. As x = \(S_{y_i} + r\), we have:
\begin{equation} p(x|y_i) = p(r|y_i)p(S_{y_i}|y_i). \end{equation}
(2)
Proof of Observation 1 As \(X = S_{y_i} + r\), it is intuitive to think that \(p(X|y=y_i) = p(S_{y_i})\times p(r)\). However, we cannot assume that \(S_{y_i}\) and r are independent, as \(S_{y_i}\) is class-dependent and r is the remainder of the input X given \(S_{y_i}\).
Therefore, we make use of the conditional probabilities of the components. The likelihood \(p(X)\) can be decomposed as follows:
\begin{equation*} \begin{split}p(X|y=y_i) &= p(S_{y_i},r|y=y_i)\\ &= \frac{p(S_{y_i},r,y=y_i)}{p(y=y_i)}\\ &= \frac{p(r,y=y_i)p(S_{y_i}|r,y=y_i)}{p(y=y_i)} . \end{split} \end{equation*}
For the conditional probability \(p(S_{y_i}|r,y=y_i)\), as only the pattern \(S_{y_i}\) depends on the class label, and that we have defined r as a non-meaningful noise to the input, we can assume that \(S_{y_i}\) and r are conditionally independent given the class \(y_i\). Therefore, we have the following:
\begin{equation*} \begin{split}p(X|y=y_i) &= \frac{p(r,y=y_i)p(S_{y_i}|r,y=y_i)}{p(y=y_i)}\\ &= \frac{p(r,y=y_i)p(S_{y_i}|y=y_i)}{p(y=y_i)}\\ &= \frac{p(r|y=y_i)p(y=y_i)p(S_{y_i}|y=y_i)}{p(y=y_i)}\\ &= p(r|y=y_i)p(S_{y_i}|y=y_i).\\ \end{split} \end{equation*}
Discussion on Observation 1. \(S_y\) is a fixed class-wise semantic pattern that characterizes a class \(y \in \mathcal {Y}\). By definition, \(S_y\) is a deterministic pattern extracted using STL decomposition during training and is not a random variable. At the inference time, we do not estimate \(S_y\) of each test input x, but we use \(S_y\) computed during the training stage to estimate the remainder component r. Hence, \(P(S_y|y)\) is defined as a deterministic variable and not as a density that SRS is aiming to estimate.
OOD detection using CVAEs. Observation 1 shows the relationship between the conditional likelihood of the input x and its remainder r. We propose to employ CVAEs to estimate both likelihoods, since they are conditional likelihoods. Recall that OOD examples come from an unknown distribution that is different from the in-distribution \(P^*\) and do not belong to any pre-defined class label from \(\mathcal {Y}\). Therefore, we propose to use the following observation for OOD detection in the time-series domain:
Observation 2.
Let \(x \in \mathbb {R}^{n \times T}\) is a time-series signal and \(y \in \mathcal {Y}\)=\(\lbrace 1,\ldots ,C\rbrace\) be the corresponding class label. As x = \(S_{y} + r\), x is an OOD example if \(p(x|y) \ne p(r|y)\) and an in-distribution example if \(p(x|y)\) = \(p(r|y)\).
Observation 2 shows how we can exploit the relationship between the estimated conditional likelihood of the time-series signal x and its remainder r to predict whether x is an OOD example or not. This observation relies on the assumption that \(p(S_y|y)=1\) for in-distribution data. For ID data, the semantic pattern \(S_y\) is a class-dependent signal that defines the class label y. Since the semantic component is guaranteed to be \(S_y\) for any time-series example with class label y, we have \(p(S_y|y)\)=1. However, OOD examples do not belong to any class label from \(\mathcal {Y}\), i.e., \(p(S_y|y) \ne 1\) for any \(y \in \mathcal {Y}\). To estimate \(p(x|y \in \mathcal {Y})\) and \(p(r|y \in {\mathcal {Y}})\) in Observation 2, we train two separate CVAE models using the in-distribution data \(\mathcal {D}_{in}\). While estimating two separate distributions can cause instability, we note that
(1)
During hyper-parameter tuning and the definition of the ID score range \([\tau _l, \tau _u]\), any outlier that may cause estimation instability will be omitted.
(2)
In case of drastic estimation instability, both CVAEs can be tuned during training time to overcome the problem.
(3)
If this instability is seen during inference time, then the SRS algorithm automatically indicates that the test example is an OOD example.
Discussion on Assumption 1. The article acknowledges that this assumption may fail to hold in some real-world scenarios. However, surprisingly, our experimental results shown in Table 7 strongly corroborate this key assumption:The distance between each time-series signal \(x_i\) and its semantic pattern \(S_{y_i}\) is very small. The strong OOD performance of SRS algorithm in our diverse experiments demonstrates the effectiveness of a simple approach based on this assumption.
Suppose the assumption does not hold and some class label y can possess \(K \gt 1\) different semantics \(\lbrace S^k_y\rbrace _{k\le K}\). If we take a human activity recognition example, then it is safe to think that a certain activity (e.g., running or walking) will have \(K\gt 1\) different patterns (e.g., athletic runners vs. young runners). Therefore, the decomposition in Assumption 1 for a given time-series example \((x_i, y_i)\) will result into a semantic pattern describing the patterns of the different sub-categories (e.g., a pattern that describes both athletic runners and young runners). By using Lowess smoothing, the STL season extracted over a multiple-pattern class is a pattern \(S_y\) that is a linear combination of \(\lbrace S^k_y\rbrace _{k\le K}\) (for our example, it describes the combination of both athletic runs and young runs). While for an in-distribution example, \(p(S_y|y)\)=1 of Observation 2 will not hold, \(p(S_y|y)\) is likely to be well-defined from \(p(S^k_y|y),\) as \(\lbrace S^k_y\rbrace _{k\le K}\) are fixed and natural for the class label y. Hence, we can still rely on the CVAEs to estimate this distribution and to perform successful OOD detection. Alternatively, we can use a simple reformulation of the problem by clustering time-series signals of a class label y (for which the assumption is not satisfied) to identify sub-classes and apply the SRS algorithm on transformed data. Since we found the assumption to be true in all our experimental scenarios (see Table 7), we did not find the need to apply this reformulation.

4.2 OOD Detection Approach

One key advantage of SRS method is that it can be directly executed at inference stage and does not require additional training similar to prior VAE-based methods such as Likelihood Regret scoring.
Training stage. Our overall training procedure for time-series OOD detection is as follows:
(1)
Train a CVAE \(\mathcal {M}_x\) using in-distribution data \(\mathcal {D}_{in}\) to estimate the conditional likelihood \(p(x|y \in \mathcal {Y})\) of time-series signal x.
(2)
Execute STL decomposition as follows:
(a)
From the training data \(\lbrace (x_i, y_i)\rbrace\), we create a group \(\mathcal {D}_y=\lbrace x_i| y_i=y\rbrace\).
(b)
We concatenate all the examples \(x_i\in \mathcal {D}_y\) in a single stream of data according to the T dimension. If \(\mathcal {D}_y\) has k examples, then the output is a single stream \(X_{stream}\in \mathbb {R}^{n \times (k\cdot T)}\)
(c)
We apply STL decomposition on the stream \(X_{stream}\) by defining the pattern dimensions \(S_y \in \mathbb {R}^{n \times T}\).
(d)
We store the semantic component \(S_y\) to be used later in estimating the remainder component for any given training example \((x,y)\): r = \(x - S_y\).
(3)
Create the remainder for each training example \((x_i, y_i) \in \mathcal {D}_{in}\) using the patterns \(S_y\) for each class label y \(:~ r_i = x_i - S_{y_i}\). We train another CVAE \(\mathcal {M}_r\) using all these remainders to estimate the conditional likelihood \(p(r|y \in {\mathcal {Y}})\).
(4)
Compute seasonal ratio score for each \((x_i, y_i) \in D_{in}\) using the trained CVAEs \(\mathcal {M}_x\) and \(\mathcal {M}_r\).
\begin{equation} SR_i(x_i, y_i) \overset{\Delta }{=} \dfrac{p(x_i|y=y_i)}{p(r_i|y=y_i)} \end{equation}
(3)
(5)
Compute the mean \(\mu _{SR}\) and variance \(\sigma _{SR}\) over SR scores of all in-distribution examples seen during training. Set the OOD detection threshold interval as \([\tau _l, \tau _u]\) such that \(\tau _l\) = \(\mu _{SR} - \lambda \times \sigma _{SR}\) and \(\tau _u\) = \(\mu _{SR} + \lambda \times \sigma _{SR}\), where \(\lambda\) is a hyper-parameter.
(6)
Tune the hyper-parameter \(\lambda\) on the validation data to maximize OOD detection accuracy.
The choice of \([\tau _l, \tau _u]\) for OOD detection is motivated by the fractional nature of the seasonal ratio scores. SRS algorithm assumes that in-distribution examples satisfy \(p(x|y)= p(r|y)\). Hence, we characterize in-distribution examples with an SR score close to 1, whether from left (\(\tau _l\)) or right (\(\tau _u\)) side. To identify in-distribution examples, we rely on SR scores that are close to the mean score recorded during training, whether from left (\(\tau _l\)) or right (\(\tau _u\)) side. This design choice is based on the fact that SR score is a quotient ideally centered around the value 1. Indeed, we observe in Figure 3 that the SR score for OOD examples can go on either the left or the right side of the SR scores for in-distribution examples. Ideally, the SR score for in-distribution examples is closer to \(\mu _{SR}\) than SR scores for OOD examples, as illustrated in Figure 1. \(\lambda\) is tuned to define the valid range of SR scores for in-distribution examples from \(\mathcal {D}_{in}\). We note that the score can be changed easily to consider the quantiles of estimated ratios during training stage and use it to separate the region of OOD and ID score. Therefore, we can redefine \(\tau _l = (0.5 - \lambda)\) as the \(\tau _l\)-quantile for the lower-limit of ID score and \(\tau _u = (0.5 + \lambda)\) as the \(\tau _u\)-quantile for the upper-limit of ID score. Given this definition, we need to tune the hyper-parameter \(0\lt \lambda \le 0.5\) on the validation data to maximize OOD detection accuracy. Furthermore, the ID score range is not required to be symmetric. In the general case, we can define \(\tau _l = (0.5 - \lambda _l)\) as the \(\tau _l\)-quantile and \(\tau _u = (0.5 + \lambda _u)\) as the \(\tau _u\)-quantile, where \(\lambda _l \ne \lambda _u\). We have observed in our experiments that both these settings give similar performance. Therefore, we only consider \(\tau _{u,l}\) = \(\mu _{SR} \pm \lambda \times \sigma _{SR}\) for simplicity for our experimental evaluation.
Inference stage. Given a time-series signal x, our OOD detection approach works as follows:
(1)
Compute the predicted class label \(\hat{y}\) using the classifier \(F(x)\).
(2)
Create the remainder component of x with the predicted label \(\hat{y}\): r = \(x - S_{\hat{y}}\).
(3)
Compute conditional likelihoods \(p(x|\hat{y})\) and \(p(r|\hat{y})\) from trained CVAE models \(\mathcal {M}_x\) and \(\mathcal {M}_r\).
(4)
Compute the seasonal ratio score using conditional likelihoods.
\begin{equation*} SR(x, \hat{y})=\frac{p(x|y=\hat{y})}{p(r|y=\hat{y})} \end{equation*}
(5)
If the seasonal ratio score \(SR(x, \hat{y})\) does not lie within the threshold interval \([\tau _l, \tau _u]\), then classify x as OOD example. Otherwise, classify x as in-distribution example.
Algorithm 1 shows the complete pseudo-code including the offline training stage and online inference stage for new time-series signals. For a given time-series signal x at the inference stage, we employ SRS algorithm to compute the seasonal ratio (SR) score. If the score is within \([\tau _l, \tau _u]\), then the time-series signal is classified as in-distribution. Otherwise, we flag it as an OOD time-series signal.

4.3 Alignment method for improving the accuracy of SRS algorithm

In this section, we first motivate the need for pre-processing raw time-series signals to improve the accuracy of SRS algorithm. Subsequently, we describe a novel time-series alignment method based on dynamic time warping to achieve this goal.
Motivation. The effectiveness of the SRS algorithm depends critically on the accuracy of the STL decomposition. STL method employs fixed-length window over the serialized data to estimate the recurring pattern. This is a challenge for real-world time-series signals, as they are prone to scaling, warping, and time-shifts. We illustrate in Figure 4 the challenge of scaling, warping, and time-shift occurrences in time-series data. The top-left figure depicts a set of time-series signals with a clear ECG pattern. Due to their misalignment, if we subtract one fixed ECG pattern from every time-series signal, then the remainder will be inaccurate. The figures in the left and right columns show the difference in the remainder components between the natural data (Left) and the aligned version of the time-series data (Right). We can clearly observe that the remainder components from the aligned data are more accurate. If input time-series data is not aligned, then it can significantly affect the estimation of \(p(r_i|y=y_i)\) and the effectiveness of SRS for OOD detection. Hence, we propose a novel alignment method using the class-wise semantic for the in-distribution data \(\mathcal {D}_{in}\) during both training and inference stages.
Time-series alignment algorithm. The overall goal of our approach is to produce a class-wise aligned time-series signals using the ID data \(\mathcal {D}_{in}\) so STL algorithm will produce accurate semantic components \(S_{y}\) for each \(y \in \mathcal {Y}\). We propose to employ dynamic time warping (DTW) [32] based optimal alignment to achieve this goal. The optimal DTW alignment describes the warping between two time-series signals to make them aligned in time. It overcomes warping and time-shifts issue by developing a one-to-many match over timesteps. There are two key steps in our alignment algorithm. First, we compute the semantic components \(S_{y}\) for each \(y \in \mathcal {Y}\) from \(\mathcal {D}_{in}\) using STL decomposition. For each in-distribution example \((x_i, y_i) \in \mathcal {D}_{in}\), we compute the optimal DTW alignment between \(S_{y_i}\) and \(x_i\). Second, we use an appropriate time-series transformation for each in-distribution example \((x_i, y_i)\) to improve the DTW alignment from the first step. Specifically, we use the timesteps of the longest one-to-many or many-to-one or sequential one-to-one sequence match to select the Expand, Reduce, and Translate transformation, as illustrated in Figure 5. We define these three time-series transformations below.
Let \(X^1=(t^1_1, t^1_2, \ldots , t^1_T)\) and \(X^2=(t^2_1, t^2_2, \ldots , t^2_T)\) be two time-series signals of length T.
Expand\((X^1, X^2)\): We employ this transformation for a one-to-many timestep matching (\(t^1_i\) is matched with \([t^2_j, \ldots , t^2_{j+k}],\) as shown in Figure 5(a)). It duplicates the \(t^1_i\) timestep for k times.
Reduce\((X^1, X^2)\): We employ this transformation in the case of a many-to-one timestep matching (\([t^1_i, \ldots , t^1_{i+k}]\) is matched with \(t^2_j\) as shown in Figure 5(b)). It replaces the timesteps \([t^1_i, \ldots , t^1_{i+k}]\) by a single averaged value.
Translate\((X^1, X^2)\): We employ this transformation in the case of a sequential one-to-one timestep matching (\([t^1_i, \ldots , t^1_{i+k}]\) is matched one-to-one with \([t^2_j, \ldots , t^2_{j+k}]\) as shown in Figure 5(c)). It translates \(X_1\) to ensure that \(t^1_i=t^2_j\).
We illustrate in Figure 6 two examples of transformation choices for time-series signal x when aligned with a pattern S. The alignment on the left exhibits that the longest consecutive matching sequence is a one-to-many (\(x_4\) is matched with \([S_2, \ldots , S_7]\)), while the alignment on the right exhibits that the longest consecutive matching sequence is a sequential one-to-one (\([x_4, \ldots ,x_8]\) is matched with \([S_3, \ldots , S_7]\)).

5 Related Work

OOD detection via pre-trained models. Employing pre-trained deep neural networks (DNNs) to detect OOD examples was justified by the observation that DNNs with ReLU activation can produce arbitrarily high softmax confidence for OOD examples [12]. Maximum probability over class labels has been used [12] to improve the OOD detection accuracy. Building on the success of this method, temperature scaling and controlled perturbations were used [28] to further increase the performance. The Mahalanobis-based scoring method [27] is used to identify OOD examples with class-conditional Gaussian distributions. Gram matrices [38] were used to detect OOD examples based on the features learned from the training data. The effectiveness of these prior methods depends critically on the availability of a highly accurate DNN for the classification task. However, this requirement is challenging for the time-series domain, as real-world datasets are typically small and exhibit high class imbalance resulting in inaccurate DNNs [19, 42].
OOD detection via synthetic data. During the training phase, it is impossible to anticipate the OOD examples that would be encountered during the deployment of DNNs [18]. Hence, unsupervised methods [48] are employed or synthetic data based based on generative models is created [26, 29] to explicitly regularize the DNN weights over potential OOD examples. It is much more challenging to create synthetic data for the time-series domain due to the limited data and their ambiguity to be validated by human experts.
OOD detection via deep generative models. The overall idea of using deep generative models (DGMs) for OOD detection is as follows: (1) DGMs are trained to directly estimate the in-distribution \(P^*\); and (2) The learned DGM identifies OOD samples when they are found lying in a low-density region. Prior work has used auto-regressive generative models [36] or GANs [41] and proposed scoring metrics such as likelihood estimates to obtain good OOD detectors. DGMs are shown to be effective in evaluating the likelihood of input data and estimating the data distribution, which makes them a good candidate to identify OOD examples with high accuracy. However, as shown by Reference [33], DGMs can assign a high likelihood to OOD examples. Likelihood ratio [36] and likelihood regret [43] are proposed to improve OOD detection. While likelihood regret method can generalize to different types of data, likelihood ratio is limited to categorical data distributions with the assumption that the data contains background units (background pixels for images and background sequences for genomes). Likelihood ratio cannot be applied to the time-series domain for two reasons: (1) We need to deal with continuous distributions; and (2) We cannot assume that timesteps (information unit) can be independently classified as background or semantic content.
OOD detection via time-series anomaly detection. Generic Anomaly Detection (AD) algorithms [13, 34, 37] can be employed to solve OOD problems for time-series data. Anomaly detection is the task of identifying observed points or examples that deviate significantly from the rest of data. Anomaly detection relies on different approaches such as distance-based metrics or density-based approaches to quantify the dissimilarities between any example and the rest of the data. Current methods using DNNs (e.g., Generative Adversarial Networks, auto-encoders) showed higher performance in anomaly detection, as they can capture more complex features in high-dimensional spaces [34, 35]. We note that there exist some AD methods that can cover the same setting as the OOD problem for time-series domain. However, both settings are still considered as two different frameworks with two different goals [45]. By definition, AD aims to detect and flag anomalous samples that deviate from a pre-defined normality [7, 25] estimated during training. Under the AD assumption of normality, such samples only originate from a covariate shift in the data distribution [37]. Semantically, such samples do not classify as OOD samples [45]. For example, consider an intelligent system trained to identify the movement of a person (e.g., run, stand, walk, swim), where stumbling may occur during running. Such an event would be classified as an anomaly, as the activity running is still taking place, but in an irregular manner. However, if the runner slips and falls, then such activity should be flagged as OOD due to the fact that it does not belong to any of the pre-defined activity classes. In other words, OOD samples must originate from a different class distribution (\(y_{\text{OOD}} \notin \mathcal {Y}\)) than in-distribution examples, while anomalies typically originate from the same underlying distribution but with anomalous behavior. Open-set recognition methods can be applicable for this setting [50, 51], as it has been shown that they are effective in detecting unknown categories without prior knowledge. However, OOD Detection encompasses a broader spectrum of solution space and does not require the complexity of identifying the semantic class of the anomalies. Additionally, anomalies can manifest as a single timestep, non-static window-length, but not generally a complete time-series example in itself. Such differences can be critical for users and practitioners, which necessitates the study of separate algorithms for AD and OOD. Unlike anomaly detection, OOD detection focuses on identifying test samples with non-overlapping labels with in-distribution data and can generalize to multi-class setting [45]. The main limitations of time-series AD algorithms [9] for OOD detection tasks are
OOD samples cannot be used as labeled anomalous examples during training due to the general definition of the OOD space. For various AD methods such as nearest neighbors and distance-based, the fine-tuning of the cut-off threshold between “normal” and “anomalous” examples requires anomaly labels during training. Mainly, window-based techniques [10] require both normal and anomalous sequences during training, and if there are none, then anomalous examples are randomly generated. Such a requirement is not practical for OOD problem settings, as the distribution is ambiguous to define and sample from.
AD assumes that normal samples are homogeneous in their observations. This assumption helps the AD algorithm to detect anomalies. Such an assumption cannot hold for different classes of the in-distribution space for multi-class settings. Therefore, time-series AD algorithms are prone to fail at detecting OOD samples. Indeed, our experiments demonstrate the failure of state-of-the-art time-series AD methods.

6 Experiments and Results

In this section, we present experimental results comparing the proposed SRS algorithm and prior methods on diverse real-world time-series datasets.

6.1 Experimental Setup

Datasets. We employ the multivariate benchmarks from the UCR time-series repository [14]. Due to space constraints, we present the results on representative datasets from six different pre-defined domains Motion, ECG, HAR, EEG, Audio, and Other. The list of datasets includes Articulary Word Recognition (AWR), Stand Walk Jump (SWJ), Cricket (Ckt), Hand Movement Direction (HMD), Heartbeat (Hbt), and ERing (ERg). We employ the standard training/validation/testing splits from these benchmarks.
OOD experimental setting. Prior work formalized the OOD experimental setting for different domains such as computer vision [12]. However, there is no OOD setting for the time-series domain. In what follows, we explain the challenges for the time-series domain and propose a concrete OOD experimental setting for it.
The first challenge with the time-series domain is the dimensionality of signals. Let the ID space be \(\mathbb {R}^{n_i\times T_i}\) and the OOD space be \(\mathbb {R}^{n_o\times T_o}\). Since we train CVAEs on the ID space, \({n_o\times T_o}\) needs to match \({n_i\times T_i}\). Hence, if \(n_o\gt n_i\) or \(T_o\gt T_i\), then we window-clip the respective OOD dimension to have \(n^{\prime }_o=n_i\) or \(T^{\prime }_o=T_i\). If \(n_o\lt n_i\) or \(T_o\lt T_i\), then we zero-pad the respective OOD dimension to have \(n^{\prime }_o=n_i\) or \(T^{\prime }_o=T_i\). Zero-padding is based on the assumption that the additional dimension exists but takes null values. The second challenge is in defining OOD examples. Since the number of datasets in UCR repository is large, conducting experiments on all combinations of datasets as ID and OOD is impractical and repetitive (600 distinct configurations for the 25 different datasets considered in this article).
Hence, we propose two settings using the notion of domains.
In-domain OOD: Both ID and OOD datasets belong to the same domain. This setting helps in understanding the behavior of OOD detectors when real-world OOD examples come from the same application domain. For example, a detector of Epileptic time-series signals should consider signals resulting from sports activity (Cricket) as OOD.
Cross-domain OOD: Both ID and OOD datasets come from two different domains. This configuration is more intuitive for OOD detectors, where time-series signals from different application domains should not confuse the ML model (e.g., Motion and HAR data).
Our intuition is that the in-domain OOD setting is more likely to occur during real-world deployment. Hence, we propose to do separate experiments by treating every dataset from the same domain as OOD. For the cross-domain OOD setting, we believe that a single representative dataset from the domain can be used as OOD. In this work, we focus on real-world OOD detection for the time-series domain. Since random noise does not inherit the characteristics of time-series data, methods from the computer vision literature have a good potential in detecting random noise.
For improved readability and ease of understanding, we provide Table 2 and Table 3 to explain the domain labels and dataset labels used in the experimental section of our article along with the corresponding UCR domain name and dataset name.
Table 2 shows the label used to represent a given domain for Cross-domain OOD setting.
Table 3 shows the label used to represent the dataset used as an OOD source against a given ID dataset for the In-domain OOD setting. For example, while reading Table 8 in the main article, when AWR is the ID distribution, according to Table 3, DS1 represents CharacterT. dataset. However, if HMD is the ID distribution, then DS1 represents FingerM. dataset.
Evaluation metrics. We employ the following two standard metrics in our experimental evaluation. (1) AUROC score: The area under the receiver operating characteristic curve is a threshold-independent metric. This metric (higher is better) is equal to 1.0 for a perfect detector and 0.5 for a random detector. (2) F1 score: It is the harmonic mean of precision and recall. Due the threshold dependence of F1 score, we propose to use the highest F1 score obtained with a variable threshold. This score has a maximum of 1.0 in the case of a perfect precision and recall.
Configuration of algorithms. We employ a 1D-CNN architecture for the CVAE models required for SR scoring method. We consider a naive baseline where the CVAE is trained on the ID data and the likelihood (LL) is used to detect OOD samples. We consider a variant of SR scoring (SR\(_a\)) that works on the aligned time-series data using the method explained in Section 4.3. We evaluate both SR and SR\(_a\) against state-of-the-art baselines and employ their publicly available code: Out-of-Distribution Images in Neural networks (ODIN) [28] and Gram Matrices (GM) [38] that have been shown to outperform most of the existing baselines; recently proposed Likelihood Regret (LR) score [43]; adaptation of a very recent time-series AD method referred to as Deep generative model with hierarchical latent (HL) space to time series windows [9] that does not require labeled anomalies for training purposes. We chose HL as the main baseline to represent time-series AD under OOD setting, as it is the state-of-the-art time-series AD algorithm. HL for time-series was shown [9] to outperform nearest-neighbor based methods, LSTM-based methods, and other methods [5, 6] in various AD settings.
Choice of architecture: We have experimented with three different types of CVAE architecture to decide on the most suitable one for our OOD experiments. We evaluated (1) fully connected, (2) convolutional, and (3) LSTM-based architectures using the reconstruction error as the performance metric. We have observed that fully connected networks generally suffer from poor reconstruction performance, especially on high-dimensional data. We have also observed that LSTM’s runtime during training and inference is relatively longer than the other architectures. However, CNN-based CVAEs delivered both a good reconstruction performance and fast runtime.
1D-CNN CVAE details: To evaluate the effectiveness of the proposed SR score, we employed a CVAE that is based on 1D-CNN layers. The encoder of the CVAE is composed of (1) a minmax normalization layer, (2) a series of 1D-CNN layers, and (3) a fully connected layer. At the end of the encoder, the parameters \(\mu _{\text{CVAE}}\) and \(\sigma _{\text{CVAE}}\) are computed to estimate the posterior distribution. A random sample is then generated from this distribution and passed on to the CVAE decoder along with the class label. The decoder of the CVAE is composed of (1) a fully connected layer, (2) a series of transposed convolutional layer, and (3) a denormalization layer.
CVAE Training: We use the standard training, validation, and testing split on the benchmark datasets to train both CVAEs \(\mathcal {M}_x\) and \(\mathcal {M}_r\). Both CVAEs are trained to maximize the ELBO on the conditional log-likelihood defined in Section 3 using Adam optimizer with a learning rate of \(10^{-4}\). We employ a maximum number of training iterations equal to 500. To ensure the reliability of the performance of the proposed CVAEs, we report in Table 3 the test reconstruction error of the trained CVAE on ID data using Mean Absolute Error (MAE). We observe clearly that the proposed CVAE is able to learn well the ID space as the reconstruction error is relatively low. To compute the semantic patterns and remainders for in-distribution examples for training \(\mathcal {M}_x\) and \(\mathcal {M}_r\), we use the STLdecompose1 Python package.
Implementation of the baselines: The baseline methods for ODIN,2 GM,3 HL,4 and LR5 were implemented using their respective publicly available code with the recommended settings. To employ ODIN and GM, we have trained two different DNN models: a 1D-CNN and an LSTM for classification tasks with different settings. We report the average performance of the baseline OOD detectors in our experimental setting. To repurpose HL method from the AD setting to OOD setting, we have serialized the training data and use it during the training of the generator. For OOD detection at inference time, we serialize both the test ID data and OOD data and shuffle them. By providing the window size equal to the timesteps dimension of the original in-distribution inputs, we execute HL anomaly detection algorithm and report every anomaly as an OOD sample. We employed the default parameters of the generator. As recommended by the authors, we use a hierarchical level equal to 4 and 500 iterations for training and inference. We lower the learning rate to \(10^{-6}\) to prevent the exploding gradient that occurred using the default \(10^{-3}\) value. For a fair comparison, the VAE for Likelihood Regret (LR) has the same architecture as the CVAE used to estimate the SR and the naive LL score.

6.2 Results and Discussion

Reconstruction error of DGMs. Table 4 shows the test reconstruction error of the trained CVAE on ID data using MAE. We clearly observe that CVAE model is able to learn the ID space, as the reconstruction error is relatively low. Table 4 shows analogous results for the same CVAE on some OOD data. We observe that DGMs perform well on OOD samples regardless of the different semantics of both ID and OOD data. The pre-trained CVAEs performed well on the OOD AWR dataset with a reconstruction error \(\le 0.1\). For the OOD FingerMovement (Fmv) dataset, only two out of the six CVAEs exhibited an intuitive high reconstruction error.
OOD detection via pre-trained classifier and DGMs. Our first hypothesis is that pre-trained DNN classifiers are not well-suited for OOD detection. To test this hypothesis, we train two DNN models: a 1D-CNN and an RNN classifier. We use these models for OOD detection using the ODIN and GM baselines. Table 5 shows that AUROC is low on all datasets. For datasets such as HMD and SWJ, the AUROC score does not exceed 0.6 for any experimental setting. The accuracy of DNNs for time-series classification is not as high as those for the image domain for the reasons explained earlier. Hence, we believe that this uncertainty of DNNs causes the baselines ODIN and GM to fail in OOD detection. Our second hypothesis is that DGMs assign a high likelihood for OOD samples and ID samples as well for time-series data. While results in Table 4 corroborated this hypothesis, we provide the use of a pre-trained CVAE for OOD detection (LL) in Table 8. We observe that AUROC score of LL does not outperform any of the other baselines. Hence, a new scoring method is necessary for CVAE-based OOD detection.
Random Noise as OOD. An existing experimental setting for OOD detection tasks is to detect random noise. For this setting, we generate random noise as an input sampled from a Gaussian distribution or a Uniform distribution. Table 6 shows the LR baseline performance on detecting the random noise as OOD examples. We observe in Table 6 that LR has an excellent performance on this task. This is explainable, as time-series noise does not necessarily obey time-series characteristics. Hence, the existing baselines can perform strongly on the OOD examples. We motivate our seasonal ratio scoring approach for OOD detection based on real-world examples. We have shown in the main article that existing baselines have poor performance in detecting real-world OOD examples, whereas SR has a significantly better performance.
Results for SR score. The effectiveness of SR score depends on the validity of Assumption 1. Table 7 shows both MAE and DTW measure between semantic pattern \(S_y\) from STL and different time-series examples of the same class y. We observe that the average difference measure is low. These results strongly demonstrate that Assumption 1 holds empirically. For qualitative results, Figure 7 shows the performance of SR in contrast to the performance of LR shown in Figure 7. This illustration shows that SRS provides significantly better OOD separability.
SR score vs. Baselines. Table 8 shows the OOD results for SRS and baseline methods. For a fair comparison, we use the same architecture for VAEs computing the LL, LR, and SR scores. We make the following observations: (1) The naive LL method fails to outperform any other approach, which demonstrates that DGMs are not reliable on their own, as they produce high likelihood for OOD samples. (2) The time-series anomaly detection method HL fails drastically for various OOD settings, as reflected by the poor AUROC score of 0.5. This demonstrates that AD methods are not appropriate for OOD detection in the multi-class setting. (3) SR score outperforms LR score in identifying OOD examples on 80% of the total experiments. This means that the improvement is due to a better scoring function. (4) For the in-domain OOD setting, AUROC score of LR is always lower than SRS. For the cross-domain setting, SRS outperforms LR in all cases except one experiment on a single dataset SWJ. Finally, LR and SRS have the same performance in 20% of the total experiments. Therefore, we conclude that SRS is better than LR in terms of OOD performance and execution time (LR requires new training for every single testing input, unlike SR).
Alignment improves the accuracy of SR score. Our hypothesis is that extraction of an accurate semantic component using STL results in improved OOD detection accuracy. To test this hypothesis, we compare SR and SR\(_a\) (SR with aligned time-series data). Table 8 shows the AUROC scores of SR and SR\(_a\). SR\(_a\) improves the performance of SR for around 50% of the overall experiments. For example, on HMD dataset, we observe that SR\(_a\) enhances the performance of SR by an average of 15% under the in-domain OOD setting. These results strongly corroborate our hypothesis that alignment improves OOD performance.
SR performance using F1-score. In addition to the AUROC score, we employ F1 score to assess the effectiveness of SR score in detecting OOD. Table 9 provides the results comparing SR score and LR score. Like AUROC score evaluation, we make similar observations on F1 score. (1) SR score outperforms LR score in identifying OOD examples on 60% of the total experiments. This means improvement is due to better scoring function. (2) For the in-domain OOD setting, F1 score of LR is mostly lower than SR. (3) For the cross-domain setting, SR outperforms LR in 66% of the cases. Hence, we conclude that SR is better than LR in terms of OOD performance measured as F1 metric.
Table 9.
Table 9. F1 Metric Results of LR \(SR_a\) on the Different Datasets for Both In-domain and Cross-domain Settings
AUROC performance of SR scoring on the full multivariate UCR dataset. For the sake of completeness, Table 10 provides additional results for the performance of SR on all the UCR multi-variate datasets in terms of the AUROC score. These results demonstrate that the proposed SR scoring approach is general and highly effective for all time-series datasets.
Table 10.
Table 10. AUROC Results for LR and SR\(_a\) on Different Datasets for Both In-domain and Cross-domain OOD Settings
Inference runtime comparison of the different OOD detection algorithms. We provide in Tables 11 and 12 a comparison of the number of parameters and the runtime between different OOD detection methods for time-series. Intuitively, both HL and SR methods are characterized by a larger number of parameters than LL and LR, as the latter two methods only rely on a single VAE model to compute the OOD score. However, we can observe that LR has the longest score computation runtime: This is due to the new training iterations LR introduces to compute the OOD score of each example. However, SR algorithm only runs a single inference pass for each example, then computes the ratio between both computed likelihoods. This approach of SR algorithm yields a fast and accurate OOD detector.
Table 11.
Table 11. Number of Parameters of Each DNN Used by the Different OOD Methods
Table 12.
Table 12. OOD Inference Runtime Comparison on the Different Datasets Using Different OOD Methods

7 Summary and Future Work

We introduced a novel seasonal ratio (SR) score to detect out-of-distribution (OOD) examples in the time-series domain. SR scoring relies on Seasonal and Trend decomposition using Loess (STL) to extract class-wise semantic patterns and remainders from time-series signals; and estimating class-wise conditional likelihoods for both input time-series and remainders using deep generative models. The SR score of a given time-series signal and the estimated threshold interval from the in-distribution data enables OOD detection. Our strong experimental results demonstrate the effectiveness of SR scoring and alignment method in detecting time-series OOD examples over prior methods. Immediate future work includes applying seasonal ratio score-based OOD detection to generate synthetic time-series data for small-data settings.

8 Acknowledgments

The authors would like to thank Alan Fern for the useful discussions regarding the key assumption behind the seasonal ratio scoring approach.

Footnotes

References

[1]
Taha Belkhouja and Janardhan Rao Doppa. 2020. Analyzing deep learning for time-series data through adversarial lens in mobile and IoT applications. IEEE Trans. Comput. Aid. Des. Integ. Circ. Syst. 39, 11 (2020), 3190–3201.
[2]
Taha Belkhouja and Janardhan Rao Doppa. 2022. Adversarial framework with certified robustness for time-series domain via statistical features. J. Artif. Intell. Res. 73 (2022), 1435–1471.
[3]
Taha Belkhouja, Yan Yan, and Janardhan Rao Doppa. 2022. Training robust deep models for time-series domain: Novel algorithms and theoretical analysis. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI’22).
[4]
Taha Belkhouja, Yan Yan, and Janardhan Rao Doppa. 2023. Dynamic time warping based adversarial framework for time-series domain. IEEE Trans. Pattern Anal. Mach. Intell. 45 (2023), 7353–7366.
[5]
Ane Blázquez-García, Angel Conde, Usue Mori, and Jose A. Lozano. 2021. A review on outlier/anomaly detection in time series data. ACM Computing Surveys 54, 3 (2021), 1–33.
[6]
Mohammad Braei and Sebastian Wagner. 2020. Anomaly detection in univariate time-series: A survey on the state-of-the-art. arXiv preprint arXiv:2004.00433 (2020).
[7]
Mikel Canizo, Isaac Triguero, Angel Conde, and Enrique Onieva. 2019. Multi-head CNN–RNN for multi-time series anomaly detection: An industrial case study. Neurocomputing 363 (2019), 246–260.
[8]
Tianshi Cao, Chin-Wei Huang, David Yu-Tung Hui, and Joseph Paul Cohen. 2020. A benchmark of medical out of distribution detection. arXiv preprint arXiv:2007.04250 (2020).
[9]
Cristian Challu, Peihong Jiang, Ying Nian Wu, and Laurent Callot. 2022. Deep generative model with hierarchical latent factors for time series anomaly detection. In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS’22).
[10]
Varun Chandola, Arindam Banerjee, and Vipin Kumar. 2010. Anomaly detection for discrete sequences: A survey. IEEE Trans. Knowl. Data Eng. 24, 5 (2010), 823–839.
[11]
Robert B. Cleveland, William S. Cleveland, Jean E. McRae, and Irma Terpenning. 1990. STL: A seasonal-trend decomposition. J. Off. Stat 6, 1 (1990), 3–73.
[12]
Kevin Gimpel and Dan Hendrycks. 2017. A baseline for detecting misclassified and out-of-distribution examples in neural networks. In Proceedings of the 5th International Conference on Learning Representations (ICLR’17).
[13]
Shubhomoy Das, Md Rakibul Islam, Nitthilan Kannappan Jayakodi, and Janardhan Rao Doppa. 2019. Active anomaly detection via ensembles: Insights, algorithms, and interpretability. CoRR abs/1901.08930 (2019).
[14]
Hoang Anh Dau, Anthony Bagnall, Kaveh Kamgar, Chin-Chia Michael Yeh, Yan Zhu, Shaghayegh Gharghabi, Chotirat Ann Ratanamahatana, and Eamonn Keogh. 2019. The UCR time series archive. IEEE/CAA J. Autom. Sinic. 6, 6 (2019), 1293–1305.
[15]
Carl Doersch. 2016. Tutorial on variational autoencoders. arXiv preprint arXiv:1606.05908 (2016).
[16]
Subhankar Ghosh, Taha Belkhouja, Yan Yan, and Janardhan Rao Doppa. 2023. Improving uncertainty quantification of deep classifiers via neighborhood conformal prediction: Novel algorithm and theoretical analysis. In Proceedings of the 37th AAAI Conference on Artificial Intelligence (AAAI’23).AAAI Press, 7722–7730.
[17]
Subhankar Ghosh, Yuanjie Shi, Taha Belkhouja, Yan Yan, Jana Doppa, and Brian Jones. 2023. Probabilistically robust conformal prediction. In Proceedings of the Conference on Uncertainty in Artificial Intelligence(Proceedings of Machine Learning Research, Vol. 216). PMLR, 681–690.
[18]
Dan Hendrycks, Steven Basart, Mantas Mazeika, Mohammadreza Mostajabi, Jacob Steinhardt, and Dawn Song. 2019. Scaling out-of-distribution detection for real-world settings. arXiv preprint arXiv:1911.11132 (2019).
[19]
Chen Huang, Yining Li, Chen Change Loy, and Xiaoou Tang. 2016. Learning deep representation for imbalanced classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16).
[20]
Dina Hussein, Taha Belkhouja, Ganapati Bhat, and Janardhan Rao Doppa. 2022. Reliable machine learning for wearable activity monitoring: Novel algorithms and theoretical guarantees. In Proceedings of the 41st IEEE/ACM International Conference on Computer-Aided Design (ICCAD’22). ACM, 33:1–33:9.
[21]
Dina Hussein, Ganapati Bhat, and Janardhan Rao Doppa. 2022. Adaptive energy management for self-sustainable wearables in mobile health. In Proceedings of the 36th AAAI Conference on Artificial Intelligence (AAAI’22). AAAI Press, 11935–11944.
[22]
Hassan Ismail Fawaz, Germain Forestier, Jonathan Weber, Lhassane Idoumghar, and Pierre-Alain Muller. 2019. Deep learning for time series classification: A review. Data Mining and Knowledge Discovery 33, 4 (2019), 917–963.
[23]
Kaiyong Jiang, Changbiao Huang, and Bin Liu. 2011. Part decomposing algorithm based on STL solid model used in shape deposition manufacturing process.Int J. Adv Manuf Technol 54 (2011), 187–194.
[24]
Jin Kim, Nara Shin, Seung Yeon Jo, and Sang Hyun Kim. 2017. Method of intrusion detection using deep neural network. In Proceedings of the IEEE International Conference on Big Data and Smart Computing (BigComp’17). IEEE.
[25]
Nikolay Laptev, Saeed Amizadeh, and Ian Flint. 2015. Generic and scalable framework for automated time-series anomaly detection. InProceedings of the ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD’15). Association for Computing Machinery.
[26]
Kimin Lee, Honglak Lee,Kibok Lee, and Jinwoo Shin. 2018. Training confidence-calibrated classifiers for detecting out-of-distribution samples. InProceedings of the International Conference on Learning Representations (ICLR’18).
[27]
Kimin Lee, Kibok Lee, Honglak Lee, and Jinwoo Shin. 2018. A simple unified framework for detecting out-of-distribution samples and adversarial attacks. InProceedings of the International Conference on Advances in Neural Information Processing Systems (NeurIPS’18).
[28]
Shiyu Liang, Yixuan Li, and Rayadurgam Srikant. 2018. Enhancing the reliability of out-of-distribution image detection in neural networks. Proceedings of the International Conference on Learning Representations (ICLR’18).
[29]
Zinan Lin, Alankar Jain, Chen Wang, Giulia Fanti, and Vyas Sekar. 2020.Using GANs for sharing networked time series data: Challenges, initial promise, and open questions. In Proceedings of the ACM Internet Measurement Conference (IMC’20).
[30]
Weitang Liu, Xiaoyun Wang, John Owens, and Yixuan Li. 2020. Energy-based Out-of-distribution Detection. In Proceedings of the International Conference on Advances in Neural Information Processing Systems (NeurIPS’20).
[31]
Wes McKinney, Josef Perktold, and Skipper Seabold. 2011. Time series analysis in python with statsmodels. Jarrodmillman Com (2011), 96–102.
[32]
Meinard Müller. 2007. Dynamic time warping. In Information Retrieval for Music and Motion. Springer Berlin Heidelberg, Berlin, Heidelberg, 69–84. DOI:
[33]
Eric Nalisnick, Akihiro Matsukawa, Yee Whye Teh, Dilan Gorur, and Balaji Lakshminarayanan. 2018. Do deep generative models know what they don’t know? arXiv preprint arXiv:1810.09136 (2018).
[34]
Guansong Pang, Chunhua Shen, Longbing Cao, and Anton Van Den Hengel. 2021.Deep learning for anomaly detection: A review. ACM Comput. Surv. 54, 2 (2021),1–38.
[35]
Guansong Pang, Chunhua Shen, and Anton van den Hengel. 2019. Deep anomaly detection with deviation networks. InProceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 353–362.
[36]
Jie Ren, Peter J. Liu, Emily Fertig, Jasper Snoek, Ryan Poplin, Mark Depristo, Joshua Dillon, and Balaji Lakshminarayanan. 2019. Likelihood ratios for out-of-distribution detection. In Proceedings of the International Conference on Advances in Neural Information Processing Systems (NeurIPS’19).
[37]
Lukas Ruff, Jacob R. Kauffmann, Robert A. Vandermeulen, Grégoire Montavon, Wojciech Samek, Marius Kloft, Thomas G. Dietterich, and Klaus-Robert Müller. 2021. A unifying review of deep and shallow anomaly detection. Proceedings of the IEEE 109, 5 (2021), 756–795.
[38]
Chandramouli Shama Sastry and Sageev Oore. 2020. Detecting out-of-distribution examples with gram matrices. In Proceedings of the 37th International Conference on Machine Learning (ICML’20).
[39]
Kaleb E. Smith and Anthony O. Smith. 2020. Conditional GAN for timeseries generation. arXiv preprint arXiv:2006.16477 (2020).
[40]
R. K. Tripathy and U. Rajendra Acharya. 2018. Use of features from RR-time series and EEG signals for automated classification of sleep stages in deep neural network framework.Biocybern. Biomed. Eng. 38, 4 (2018), 890–902.
[41]
Ziyu Wang, Bin Dai, David Wipf, and Jun Zhu. 2020. Further analysis of outlier detection with deep generative models. In Proceedings of the International Conference on Advances in Neural Information Processing Systems (NeurIPS’20).
[42]
Qingsong Wen, Liang Sun, Fan Yang, Xiaomin Song, Jingkun Gao, Xue Wang, and Huan Xu. 2020. Time series data augmentation for deep learning: A survey.arXiv preprint arXiv:2002.12478 (2020).
[43]
Zhisheng Xiao, Qing Yan, and Yali Amit. 2020. Likelihood regret: An out-of-distribution detection score for variational auto-encoder. In Advances in Neural Information Processing Systems (NeurIPS), H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (Eds.). Curran Associates, Inc.
[44]
Nuzhat Yamin, Ganapati Bhat, and Janardhan Rao Doppa. 2022. DIET: A dynamic energy management approach for wearable health monitoring devices. In Proceedings of the Design, Automation & Test in Europe Conference & Exhibition (DATE’22). IEEE,1365–1370.
[45]
Jingkang Yang, Kaiyang Zhou, Yixuan Li, and Ziwei Liu. 2021. Generalized out-of-distribution detection: A survey. arXiv preprint arXiv:2110.11334 (2021).
[46]
Shuigen Yang. 2021. A novel study on deep learning framework to predict and analyze the financial time series information. Fut. Gen. Comput. Syst. 125, (2021), 812–819.
[47]
Yao-Yuan Yang, Cyrus Rashtchian, Hongyang Zhang, Ruslan Salakhutdinov, and Kamalika Chaudhuri. 2020. A closer look at accuracy vs. robustness. In Proceedings of the International Conference on Neural Information Processing Systems (NeurIPS’20).
[48]
Qing Yu and Kiyoharu Aizawa. 2019. Unsupervised out-of-distribution detection by maximum classifier discrepancy. In Proceedings of the IEEE/CVF International Conference on Computer Vision.
[49]
Zibin Zheng, Yatao Yang, Xiangdong Niu, Hong-Ning Dai, and Yuren Zhou. 2017. Wide and deep convolutional neural networks for electricity-theft detection to secure smart grids. IEEE Transactions on Industrial Informatics 14, 4 (2017), 1606–1615.
[50]
Da-Wei Zhou, Yang Yang, and De-Chuan Zhan. 2021. Learning to classify with incremental new class. IEEE Trans. Neural Netw. Learn. Syst. 33, 6 (2021), 2429–2443.
[51]
Da-Wei Zhou, Han-Jia Ye, and De-Chuan Zhan. 2021. Learning placeholders for open-set recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4401–4410.

Cited By

View all
  • (2024)Unsupervised Outlier Detection with Reinforced Noise DiscriminatorACM Transactions on Intelligent Systems and Technology10.1145/370611716:2(1-26)Online publication date: 2-Dec-2024
  • (2024)Anomaly Detection Representation Learning Framework Towards Mixed Time Series with Scalable Multivariate FusionAdvanced Data Mining and Applications10.1007/978-981-96-0811-9_18(256-268)Online publication date: 13-Dec-2024

Index Terms

  1. Out-of-distribution Detection in Time-series Domain: A Novel Seasonal Ratio Scoring Approach

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on Intelligent Systems and Technology
      ACM Transactions on Intelligent Systems and Technology  Volume 15, Issue 1
      February 2024
      533 pages
      EISSN:2157-6912
      DOI:10.1145/3613503
      • Editor:
      • Huan Liu
      Issue’s Table of Contents

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 19 December 2023
      Online AM: 30 October 2023
      Accepted: 09 October 2023
      Revised: 17 August 2023
      Received: 30 June 2022
      Published in TIST Volume 15, Issue 1

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Time series data
      2. Deep learning
      3. Machine learning robustness
      4. Out-of-distribution detection

      Qualifiers

      • Research-article

      Funding Sources

      • AgAID AI Institute for Agriculture Decision Support
      • National Science Foundation and United States Department of Agriculture - National Institute of Food and Agriculture

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)1,163
      • Downloads (Last 6 weeks)106
      Reflects downloads up to 12 Feb 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Unsupervised Outlier Detection with Reinforced Noise DiscriminatorACM Transactions on Intelligent Systems and Technology10.1145/370611716:2(1-26)Online publication date: 2-Dec-2024
      • (2024)Anomaly Detection Representation Learning Framework Towards Mixed Time Series with Scalable Multivariate FusionAdvanced Data Mining and Applications10.1007/978-981-96-0811-9_18(256-268)Online publication date: 13-Dec-2024

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Login options

      Full Access

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media