research-article

Open access

Exploring Structure Incentive Domain Adversarial Learning for Generalizable Sleep Stage Classification

Authors:

Shuo Ma,

Yingwei Zhang,

Yiqiang Chen,

Tao Xie,

Shuchao Song,

Ziyu JiaAuthors Info & Claims

ACM Transactions on Intelligent Systems and Technology, Volume 15, Issue 1

Article No.: 14, Pages 1 - 30

https://doi.org/10.1145/3625238

Published: 16 January 2024 Publication History

PDF eReader

Abstract

Sleep stage classification is crucial for sleep state monitoring and health interventions. In accordance with the standards prescribed by the American Academy of Sleep Medicine, a sleep episode follows a specific structure comprising five distinctive sleep stages that collectively form a sleep cycle. Typically, this cycle repeats about five times, providing an insightful portrayal of the subject’s physiological attributes. The progress of deep learning and advanced domain generalization methods allows automatic and even adaptive sleep stage classification. However, applying models trained with visible subject data to invisible subject data remains challenging due to significant individual differences among subjects. Motivated by the periodic category-complete structure of sleep stage classification, we propose a Structure Incentive Domain Adversarial learning (SIDA) method that combines the sleep stage classification method with domain generalization to enable cross-subject sleep stage classification. SIDA includes individual domain discriminators for each sleep stage category to decouple subject dependence differences among different categories and fine-grained learning of domain-invariant features. Furthermore, SIDA directly connects the label classifier and domain discriminators to promote the training process. Experiments on three benchmark sleep stage classification datasets demonstrate that the proposed SIDA method outperforms other state-of-the-art sleep stage classification and domain generalization methods and achieves the best cross-subject sleep stage classification results.

1 Introduction

Sleep accounts for a significant portion of a human life, precisely one-third, and is directly related to one’s physical and mental well-being. As a fundamental technique for disease monitoring [13], arrangement, and intervention, sleep stage classification has remarkable practical significance in healthcare [6]. The two principal standards governing sleep stage classification are the Rechtschaffen & Kales (R&K) criteria [49] and the American Academy of Sleep Medicine (AASM) criteria [4]. Based on these widely accepted international sleep stage classification standards, sleep monitoring is indispensable in many healthcare areas. Notably, brain disorders such as aphasia, epilepsy, and Parkinson’s disease exhibit intricate and close associations with sleep disorders, prompting extensive research into the application of sleep monitoring in the intervention of brain disorders [43]. Christensen et al. [11] employed electroencephalography (EEG) monitoring equipment and data-driven analytical methods to reveal sleep characteristics in patients with insomnia. Coelli et al. [12] conducted benchmark research on sleep monitoring in epileptic patients, using a multiscale functional clustering approach to survey epileptic networks in various sleep stages. In Parkinson’s disease, sleep disorders represent the most frequent non-motor symptoms, and monitoring sleep quality offers an effective way to anticipate Parkinson’s disease onset and track disease progression [27].

The conventional method of sleep stage classification requires professional medical experts to manually analyze the Polysomnography (PSG) signals of subjects [51]. This approach is time-consuming, low in efficiency, and labour intensive. Moreover, this method’s results are subjective and easily influenced by the expertise and experience of the analysts [26]. The development of artificial intelligence has led to the emergence of automatic sleep classification approaches that significantly improve accuracy and efficiency [26]. Typically, these methods extract time-frequency transformation features from the raw PSG signal and employ machine learning methods like Random Forest [38], Support Vector Machine (SVM) [1], and K-Nearest Neighbor [52] to build the final classification model. However, these methods require significant prior knowledge for feature extraction and processing. With the development of deep learning, the emergence of deep learning has brought many advancements in the accuracy and efficiency of sleep stage classification. Deep learning-based sleep stage classification methods employ end-to-end neural networks for feature extraction and model construction. Convolutional neural networks (CNN) have been employed to extract spatial sleep features from the PSG signal [40, 48]. Goshtasbi et al. proposed a fully convolutional neural network called SleepFCN [18], which utilizes residual dilated causal convolutions to capture temporal context information and thus enhances the accuracy and speed of recognition. Recurrent neural networks (RNN) have also been used to extract temporal features related to sleep from the PSG signal [8, 41, 54]. Furthermore, Long Short-Term Memory (LSTM) [15, 40] has been utilized to address the issue of forgetting over long-time series signals. Zhao et al. proposed SleepContextNet [57], which utilizes a CNN-LSTM model structure combined with data augmentation techniques, significantly improving classification accuracy. Wang et al. [45] proposed a novel multi-scale attention mechanism incorporating channel and spatial attention, resulting in exceptional classification accuracy. Phan et al. proposed SeqSleepNet [33] to address the sleep stage classification problem as a sequence-to-sequence classification problem. To achieve interpretability at the epoch and sequence level and improve the accuracy of sleep stage classification, they further developed SleepTransformer [34], which is the first transformer-based sleep stage classification model and achieved state-of-the-art performance. To address the issue of heterogeneity among physiological signals, Zhu et al. proposed MaskSleepNet [59]. This model learns the joint distribution of mask and non-mask modalities by leveraging partial modalities of mask signals. It also uses multi-scale convolution and multi-head attention to extract features and make predictions at sub-scales, respectively. In addition, Researchers have utilized sparse autoencoders to categorize pre-extracted time-frequency features [44]. And some generative adversarial networks models are used for EEG and electrocardiography (ECG) signal generation to improve related classification tasks [17].

However, the abovementioned models are more suitable for extracting features from grid or image data. They do not utilize the functional connectivity relationship of brain structures in the PSG signal. Furthermore, the brain’s cerebral cortex forms a non-Euclidean space, making it well suited for representing the feature distribution of brain space using a graph structure. Correspondingly, the graph neural networks (GCN) have been widely employed and worked well in graph-structured data [58]. Although existing studies have achieved acceptable sleep stage classification accuracy [21, 23, 28], these approaches have not addressed the challenge of PSG signal-based sleep stage classification, which depends on the combination of multiple physiological signals, including EEG, ECG, electrooculography (EOG), and electromyography (EMG) signals, which vary significantly across different subjects [10]. For instance, the EEG signal can be affected by subjects’ electrode drift and hair, while the EMG signal can be affected by muscle fatigue, skin resistance, and muscle strength of subjects [56]. The challenge of subject dependence limits the adaptability of sleep stage classification models, as models trained on certain subjects cannot be applied to new subjects. However, most existing methods only modify the feature extractor based on graph models without focusing on improving subject independence. Furthermore, obtaining and labeling sleep stage classification data is complex and requires professional medical expertise [39], making training a new model for each new subject with their data impractical.

Fortunately, the development of transfer learning has provided hope for achieving subject-independent sleep stage classification [30, 60]. Researchers have begun to focus on improving the generalization of the model. Jia et al. proposed the MSTGCN model [22], which integrates domain generalization [5, 46] and spatio-temporal GCN, using the domain adversarial (DA) method to improve the model’s robustness across subjects. Tang et al. [42] employed the Maximum Mean Discrepancy (MMD) [19] method to reduce the distribution difference between the training set and the testing set data of the ECG signal. Most other transfer learning-based sleep stage classification methods utilize the pre-training and fine-tuning paradigm to enhance prediction accuracy [2]. However, this paradigm has many limitations due to the need for target data. Moreover, they ignored the structural characteristics of the sleep stage classification problem, resulting in unsatisfactory limited improvement results. To tackle the aforementioned challenges, we have fused the sleep stage classification problem with domain generalization [31], culminating in the proposal of a Structure Incentive Domain Adversarial learning (SIDA) method to augment subject generalization of the sleep stage classification model. As shown in Figure 1, the inspiration for the SIDA method came from the structure of the sleep cycle. During an entire sleep episode, there are typically five complete sleep cycles [16], each consisting of five stages from the Wakefulness (Wake) stage to the Rapid Eye Moment (REM) stage and back [7]. The sleep stage categories themselves are limited and consist of five distinct stages, and each stage may exhibit unique subject dependencies. Furthermore, we generalize the problems caused by the above structure as the Subject Dependency Differences of different sleep Categories (SDDC) concept. More specifically, in contrast to traditional domain generalization models, SIDA establishes distinct domain (i.e., subject) discriminators for every sleep stage to dissociate the subject dependence differences amongst the various sleep stages. This strategy facilitates the model in precisely learning subject or domain invariant features. Moreover, we have bridged the sleep stage classifier and domain discriminators in SIDA with direct connections, positively influencing the training process. To our knowledge, this study marks the inaugural effort to define the SDDC notion precisely. Leveraging the PSG-based sleep stage classification’s category structure, we introduce the SIDA method to attain optimal cross-subject sleep stage classification. Notably, we have utilized the leave-one-subject-out cross-validation method to rigorously validate our method. We have trained the classification model on the data from existing seen subjects and tested the efficacy of the trained model on the data of another unseen subject. Furthermore, we have validated and chosen the ultimate model on separate validation data that are randomly selected from training data. We have evaluated the effectiveness of the proposed SIDA method on three benchmark sleep stage classification datasets (i.e., ISRUC-S1 [24], ISRUC-S3 [24], and Sleep Heart Health Study Visit 1 (SHHS1) [36, 53]. The experimental results indicate that our proposed SIDA method outperforms other comparing methods and delivers the best cross-subject sleep stage classification results. In conclusion, the primary contributions of this study can be summarized as follows:

Fig. 1.

—

We clearly define the SDDC concept and open up the idea of handling the challenge of subject dependence on the category from the perspective of transfer learning.

—

We propose the SIDA method, which is a domain generalization method, to realize category-by-category subject dependency alignment and achieve direct soft weighting between the classifier and discriminators.

—

Our proposed SIDA method is a plug-and-play method that can easily combine with existing methods. With experiments on the three public sleep stage classification datasets, the extensive experiments demonstrate that the results of the existing sleep stage classification methods have been improved by combining them with our SIDA method.

2 Related Work

The research of this article is mainly related to sleep stage classification and the domain generalization method. Therefore, this section will review these two parts and their intersection.

2.1 Sleep Stage Classification

Sleep stage classification is critical in monitoring and diagnosing sleep disorders. It involves collecting data during sleep and training models to classify different sleep stages. Medical experts use this information to diagnose and treat brain and neurological diseases. In 1968, Rechtschaffen and Kales proposed the R&K standard based on PSG collected during sleep, which divided sleep stages into seven stages: Wake, REM, four Non-Rapid Eye Movement (NREM) stages, and Movement Time stages. Four NREM stages consist of Stage 1 (S1), Stage 2 (S2), Stage 3 (S3), and Stage 4 (S4). Stages S1 and S2 are regarded as the light sleep stage, while stages S3 and S4 are regarded as the deep sleep stage, also known as slow-wave sleep. In 2007, the American Academy of Sleep Medicine merged the S3 and S4 stages in the R&K standard into stage S3 and recalled stages S1, S2, and S3 as stages N1, N2, and N3. The improved AASM standard divides the sleep stages into five categories: stages Wake, REM, N1, N3, and N3, corresponding to the five categories in sleep stage classification.

The internationally accepted method for sleep stage classification relies on multimodal time-series physiological signals, known as the PSG signal, which are collected simultaneously using various sensors attached to different parts of the subjects, such as the brain, heart, or legs. However, traditional analysis of physiological signals heavily depends on extracting statistical and spatial features. Although spectrum analysis is interpretable, it requires prior solid knowledge, and its actual classification performance is unsatisfactory. To address this issue, Hassan et al. [20] proposed a tunable-Q wavelet transform to analyze the EEG signal’s spectral features, followed by bootstrap aggregating for classification. Their method achieved state-of-the-art EEG-based sleep stage classification performance on the benchmark Sleep-EDF and DREAMS subjects databases. Furthermore, their approach works well and performs equally well for both R&K and AASM sleep scoring standards. Researchers have integrated machine learning techniques to improve the effectiveness of sleep stage classification. For instance, Rahman et al. [37] utilized Discrete Wavelet Transform to extract and analyze the spectral characteristics of the EOG signal and used Random Forest and SVM as the sleep stage classification model. They evaluated their approach on three publicly available databases, including the Sleep-EDF, Sleep-EDFX, and ISRUC-Sleep databases, and demonstrated that it outperforms state-of-the-art EOG-based techniques in accuracy. Similarly, Alickovic et al. [1] proposed a Rotational Support Vector Machine for sleep stage classification. Besides the traditional SVM, they integrated three components: multiscale principal component analysis, discrete wavelet transform, and rotational support vector machine, to enhance the accuracy of EEG signal sleep stage classification. Their approach achieved sensitivity and accuracy values of 84.46% and 91.1%, respectively, across all subjects on the open-source sleep-edfx dataset.

In recent years, numerous researchers have utilized various simple neural networks, including CNN, RNN, and LSTM, for sleep stage classification. Notably, the Time Distributed Multivariate Network, introduced by Chambon et al. [9], has become a standard approach for sleep stage classification problems. This network aggregates the previous d epochs, the subsequent d epochs, and the dth epoch itself to extract features that identify the sleep stage of the dth epoch. Additionally, they employed two convolution kernels of different sizes to extract dual-channel features [40]. Following the proposal of FeatureNet by Jia et al. [22], this model structure is widely employed in the pre-extraction of features in sleep stage classification. This pre-extraction of features enhances the rapidity of neural network training while guaranteeing accuracy. Goshtasbi et al. proposed SleepFCN [18], which involves multi-scale feature extraction and residual diffusion causal convolution. This method yields state-of-the-art classification results in the Sleep-EDF dataset consisting of 20 subjects and a sample of 240 subjects from the SHHS1 dataset. Zhao et al. used the CNN-LSTM-based model called SleepContextNet [57] to extract long-term and short-term temporal context information and developed a data enhancement method. Excellent results were achieved on the Sleep-EDF dataset of 20 subjects and the Sleep-EDFx dataset of 78 subjects, as well as data of 329 subjects selected from the SHHS1 dataset. In addition to using a simple CNN network, more complex network models, such as the U-Net model [32] and its variants, are also employed for sleep stage classification. Phan et al. proposed a sequence-to-sequence method called SeqSleepNet [33] to interpret epoch and sequence levels. Based on this, they developed the first transformer-based sleep stage classification model called SleepTransformer [34] and achieved state-of-the-art performance on the SHHS1 dataset of 5,791 subjects and SleepEDF-78 of 78 subjects. Wang et al. [45] designed a residual attention layer that includes channel attention and spatial attention and achieved state-of-the-art results on the Sleep-EDF dataset and the Sleep-EDFx dataset with 197 PSG records. Zhu et al. proposed MaskSleepNet [59], which consists of a mask module, a multi-scale convolutional network module, a compression and excitation module, and a multi-head attention module. It enables simultaneous learning of both masking and non-masking modality information and performs multi-scale feature extraction and prediction. The proposed model achieved outstanding classification performance on Sleep-EDFx, as well as datasets from the Montreal Archive of Sleep Studies [29] and Huashan Hospital, Fudan University. Nowadays, many sleep stage classification methods rely on GCN due to the similarity between the brain’s functional areas and graph structures. These methods make good use of the information about the position and function of the brain. Jia et al. [23] proposed a groundbreaking GCN-based method called GraphSleepNet for sleep stage classification. In this method, each PSG signal channel corresponds to a node in the sleep graph, with a connection between two nodes forming an edge. The features are constructed based on the brain’s functional connections, and spatial-temporal graph convolution is used to classify sleep stages. GraphSleepNet is considered a pioneering work in using GCNs for sleep stage classification. Following this, Ji et al. [21] proposed JK-STGCN, a module for aggregating features from different layers. Li et al. [28] developed MVF-SleepNet by adding spectral features from time-frequency (TF) images [50] of the PSG time-series signal. Spectral features are extracted with models such as VGG16 and fused with features extracted by the GCN.

2.2 Domain Generalization

Some existing models incorporate transfer learning techniques to enhance cross-subject generalization in sleep stage classification, typically pre-trained on sizable datasets and fine-tuned on smaller ones. Nonetheless, this approach has its limitations, requiring labeled targets for subject data fine-tuning and thereby impeding the model’s ability to generalize to previously unseen subjects. A meta-learning-based method called MetaSleepLearner [2] has been proposed, which involves pre-training on the Montreal Archive of Sleep Studies dataset and fine-tuning on new samples from the Sleep-EDF, CAP Sleep Database, ISRUC, and UCD datasets. The outcomes have been encouraging in the realm of sleep stage classification. Phan et al. [35] have also leveraged Kullback-Leibler divergence regularization to facilitate the model’s generalization. According to their empirical findings on the Sleep-EDF Expanded database, which contains 75 subjects, their method can boost accuracy by 4.5 percentage points relative to the baseline, resulting in a sleep stage classification accuracy of 79.6%. However, a similar pre-training and fine-tuning paradigm is still necessary, necessitating substantial data for pre-training and labeled data for fine-tuning. Moreover, this approach can only enhance the model’s generalization to specific subjects of the fine-tuning data. Gathering data and carrying out extensive calculations are resource-intensive, making these issues unrealistic in real-world application scenarios where the model cannot access future subject data in advance. To improve the effect of classifying sleep stages exclusive to the subjects, Tang et al. [42] used the MMD method to solve the problem of inconsistency in the distribution of ECG signal data between the training set and the test set and achieved remarkable results on the four datasets including SHHS. Nevertheless, the authors did not investigate methods to enhance model generalization when the test set is unavailable, which is a more pragmatic scenario.

Domain generalization can enhance the model’s ability to generalize to unseen subjects without sacrificing accuracy when resources are limited. It improves the model’s cross-domain generalization using techniques like Domain-invariant representation learning and Feature disentanglement. Domain adversarial is the most widely used and effective method in domain generalization, which confuses the model’s differentiation between domains by introducing a gradient flip layer, thus improving the model’s cross-domain robustness. The main advantage of domain generalization is that it improves the cross-domain generalization of the model through the method itself rather than relying on other processes, such as fine-tuning. Moreover, after training, it does not require additional information about the new testing set, making it capable of achieving better results on previously unseen data. Therefore, domain generalization is highly suitable for medical scenarios involving unseen subjects. In the sleep stage classification, Jia et al. [22] proposed a novel framework called MSTCGN, which integrates domain generalization and GCN to extract subject-independent sleep features. Their approach employed the adversarial domain generalization method during training to prevent the model from discerning which source domain the data belonged to, thus enabling it to learn subject-independent information. While their approach achieved state-of-the-art performance at the time, they did not consider the subject dependence difference of the category, and the data from different subjects may be aligned indiscriminately. Therefore, the data from different categories may also be incorrectly aligned. Additionally, their use of cross-validation to only divide the training and testing sets, and the model saving the best result on the testing set during training, could be more rigorous. A better approach is to randomly select a portion of the training set as the validation set, save the best model on the validation set during training, and then test on the unseen testing set data. This ensures complete invisibility of the testing set and generalization verification of the model. It is worth noting that the emergence of different sleep stages is related to the age of the subjects, and it is crucial to take age-related differences into account. To address this issue, Baumert et al. [3] divided their subjects into pediatric, adult, and older adult groups to conduct their research, which is meaningful and provides new insights.

3 Preliminaries and Motivation

3.1 Sleep Stage Classification Problem

PSG is often employed to record various human body electrical signals during sleep. It contains multi-channel EEG, ECG, EOG, and EMG signals. The PSG signal can be segmented into multi-segment multi-channel signals with 30-second epochs each for sleep stage classification. According to the AASM standard, sleep stages are divided into five stages: Wake, REM, N1, N3, and N3, corresponding to the five categories in sleep stage classification.

The sleep stage classification aims to make the model learn the mapping relationship between the input signal and the sleep stage category. The sleep stage classification problem is defined as \(\hat{y}_i = G_y(G_f(x_i))\), building a sleep stage classification model based on the input sample \(x_i\), where \(G_f\) is the feature extractor, and \(G_y\) is the label classifier. Given the input signal sequence \(\mathcal {S} =~(S_{i-d},\ ...\ , S_i,\ ...,\ S_{i+d}) \in \mathbb {R}^{N\times {T_n}\times {T_s}}\\)\), where N denotes the number of channels, \(T_s\) denotes the time series length of each epoch, \(T_n=2d+1\) denotes the number of samples of neighbouring \(2d+1\) epochs, \(\mathcal {S}\) represents the temporal context of \(S_i\). The classification model will jointly predict the characteristics of the ith epoch according to the transition characteristics of sleep stage rules [9]. Features of each sleep epoch are pre-extracted from the dual-channel FeatureNet [22] and an N-channel feature matrix of the ith epoch is defined as \(X_i = {(x^1_i,\ x^2_i,\ ...,\ x^N_i)}^T \in \mathbb {R}^{N\times F}\), where \(x_i^n\in \mathbb {R}^F, n\in \lbrace 1, 2,..., N\rbrace\) denotes features pre-extracted from channel n at epoch i. Sometimes, features are preprocessed by bandpass filters according to the frequency distribution of different signals. However, current sleep stage classification methods generally use full unfiltered features.

3.2 Domain Generalization

Suppose we have M subjects (i.e., subjects), we randomly divide M subjects into \(M^{^{\prime }}\) groups, where \(M^{^{\prime }} = |\frac{M}{num}|,num\) is the number of subjects in each group. Group \(m^{^{\prime }} = \lbrace m_1^{^{\prime }},...,m_{num}^{^{\prime }}\rbrace\), where {\(m_1^{^{\prime }},\ ...,\ m_{num}^{^{\prime }}\)} is random group sampling without replacement from the set {\(1,\ ...,\ M\)}. The data of the \(M^{^{\prime }}\) group constitutes \(M^{^{\prime }}\) domains (i.e., \(\mathcal {D}_{m^{^{\prime }}} = \lbrace (x_{m^{^{\prime }},k},y_{m^{^{\prime }},k})| k\in \lbrace 1,\ ...,\ K\rbrace \rbrace\) and K denotes the number of samples of \(D_{M^{^{\prime }}}\)), and the joint distributions between each pair of domains are different (i.e., \(P^{j_1}_{XY} \ne P^{j_2}_{XY}, 1 \le j_1 \ne j_2 \le M^{^{\prime }}\)). Cross-subject classification is the following process, suppose the sample of {\(1,\ ...,\ M^{^{\prime }}-1\)} domains constitutes \(\mathcal {D}_{train} = \lbrace \mathcal {D}_1,...,\mathcal {D}_{M^{^{\prime }}-1} \rbrace = \lbrace (x_j,y_j,d_j)|j\ \in \ \lbrace 1,\ ...,\ J\rbrace \rbrace . (x_j,y_j,d_j)\) is the sample composed of \(M^{^{\prime }}-1\) domains, where \(x_j\) denotes the training sample (i.e. the pre-trained feature), \(y_j\) denotes the sleep stage label, \(d_j \in \lbrace 1,\ ...,\ M^{^{\prime }}-1\rbrace\) denotes the subject domain label. J is the sum of the numbers of \(M^{^{\prime }}-1\) domain samples. The sample of the \(M^{^{\prime }}\)th domain constitutes \(\mathcal {D}_{test}={\mathcal {D}_{M^{^{\prime }}}}=\lbrace (x_j^{te},y_j^{te},d_j^{te})\rbrace\). \((x_j^{te},y_j^{te},d_j^{te})\) is the data composed of the \(M^{^{\prime }}\)th domain, where \(x_j^{te}\) denotes the sample (i.e., the pre-trained feature), \(y_j^{te}\) denotes the sleep stage label, \(d_j^{te}\) denotes the subject domain label.

3.3 Motivation

We aim to enhance our model’s cross-subject sleep stage classification robustness through domain generalization. Domain generalization eliminates differences between domains (i.e., subjects) through domain alignment. The alignment process aims to align all data of each domain without distinction. However, the biggest challenge in classification tasks is always the category difference, as different sleep stage categories have subject dependency differences. As illustrated in Figure 2, different shapes represent different sleep stage categories, and different colors represent different subject domains. In aligning subject data, if data of the same category are correctly aligned (i.e., the green box in the figure is a positive transfer), then it will enhance the model’s cross-subject generalization and improve its classification accuracy. However, if data from the different categories are incorrectly aligned (i.e., the red box in the figure is a negative transfer), then it will severely impact the model classification accuracy. Inspired by the subject dependency difference of categories of sleep stage classification, we hope to align subjects in a fine-grained way by category. Fortunately, the sleep stage classification problem for subject generalization has a category-complete and recurrent structure with information from domain supervision. This structure motivates us to propose the category-specific domain adversarial method SIDA.

Fig. 2.

4 Structural Incentive Domain Adversarial Method

To mitigate the influence of individual subject differences in physiological signals, we introduce the concept of SDDC and propose a SIDA method. The overall architecture of the SIDA method is depicted in Figure 3. In the subsequent sections, we will outline how we employ neural networks to obtain an effective representation of multimodal physiological signals in Section 4.1. We then describe the design of category-special domain discriminators of the sleep stage based on the structure of the sleep stage classification in Section 4.2. Section 4.3 presents how SIDA establishes direct connections between the sleep stage classifier and category-specific domain discriminators to promote the model. Finally, Section 4.4 outlines the overall implementation procedure.

Fig. 3.

4.1 The Effective Representation of Multimodal Physiological Signal

Due to their non-linear and stationary characteristics and their multimodal heterogeneity [47], effectively representing multimodal physiological signals is a challenging problem. This difficulty arises from three factors: (1) The PSG signal acquisition adheres to medical norms, such as the international standard 10-20 system electrode placement method for the EEG [14, 25] signal, and the EMG signal placement at various muscle sites based on different measurement objectives, resulting in complex spatial structures. (2) The PSG signal is temporal, with temporal dependencies along the timeline, but integrating context information is difficult. (3) There are large differences between multimodal signals across modalities, yet modal consistency exists. Thus, researchers have been grappling with how to fully exploit the consistency between modalities and the difference between compatible modalities. Furthermore, there are still variations between different subjects within the same category or modality. Mainstream methods mostly utilize neural network-based methods to efficiently capture multimodal physiological signals’ temporal and spatial features. As Equation (1), CNN-based methods [55] can extract the spatial features of these signals by implementing linear maps through convolution operations with trainable kernels,

\begin{equation} x^{l+1}_\beta (\tau , \mu) = \sigma \left(b_\beta ^l+\sum _{\gamma =1}^{F^l}U^l_{\beta \gamma } \ast x^{l}_\beta (\tau , \mu)\right) = \sigma \left(b_\beta ^l+\sum _{\gamma =1}^{F^l}\left[\sum _{\psi =1,\varphi =1}^{\psi ^l\varphi ^l}U^l_{\beta \gamma }(\psi ,\varphi)\ast x^{l}_\beta (\tau -\psi , \mu -\varphi)\right]\right), \end{equation}

(1)

where \(x^{l+1}_\beta (\tau , \mu)\) denotes the feature map \(\beta\) in layer \((l+1),\ \sigma\) is a non-linear function, \(F^l\) is the number of feature maps in layer l, \(U^l_{\beta \gamma }\) is the kernel convolved over feature map \(\gamma\) in layer l to create the feature map \(\beta\) in layer \((l+1)\), \(\tau , \mu\) are the horizontal and vertical coordinates of the convolution position, respectively, \(\psi ^l,\varphi ^l\) are the length and width of kernels in layer l, respectively, and \(b^l\) is a bias vector.

However, the performance of CNN in temporal extraction is limited. While RNN can extract features by combining sequence context, it lacks information filtering when processing time-series data and is susceptible to gradient disappearance and explosion in long sequences. LSTM addresses these issues through various gating mechanisms, thereby addressing long-term catastrophic forgetting. The updated representation of the LSTM layer is as follows:

\begin{equation} e_t = \sigma _e~(W_{ae}x_t + W_{he}h_{t-1} + W_{ce}c_{t-1} + b_e), \end{equation}

(2)

\begin{equation} f_t = \sigma _f~(W_{af}x_t + W_{hf}h_{t-1} + W_{cf}c_{t-1} + b_f), \end{equation}

(3)

\begin{equation} o_t = \sigma _o~(W_{ao}x_t +W_{ho}h_{t-1} + W_{co}c_{t-1} + b_o), \end{equation}

(4)

\begin{equation} c_t = f_tc_{t-1} + i_t\sigma _c~(W_{ac}x_t + W_{hc}h_{t-1} + b_c), \end{equation}

(5)

\begin{equation} h_t = o_t\sigma _h(c_t), \end{equation}

(6)

where e, f, o, and c are the input gate, forget gate, output gate, and cell activation vectors. They all define the hidden value and are the same size as vector h. The \(\sigma\) represents the non-linear function. The \(x_t\) is the input to the memory cell layer at time t. \(W_{ae}\), \(W_{he}\), \(W_{ce}\), \(W_{af}\), \(W_{hf}\), \(W_{cf}\), \(W_{ac}\), \(W_{hc}\), \(W_{ao}\), \(W_{ho}\), and \(W_{co}\) are weight matrices, \(b_e, b_f, b_c\), and \(b_o\) are bias vectors.

Incorporating GCN is a promising approach to constructing graph-based data structures and networks based on modalities and functional locations, allowing for the representation and fusion of multimodal physiological signals. In the graph-based framework, each signal channel is allocated to a node in the sleep graph, while the edges between the nodes represent the connections between signal channels. We has yielded exceptional results in graph-based sleep stage classification, as demonstrated by the MSTGCN [22]. MSTGCN utilizes a multi-view learning strategy that integrates the function connections (FC) and the distance connections (DC) of sleep graphs and incorporates both temporal and spatial features. We were inspired by Reference [28] to transform the sleep signal into a TF representation using short-time Fourier transform (STFT) and to fuse the hidden feature representations extracted by CNN and GCN.

—

Spatial Feature: We extract spatial features of sleep graphs using the spatial attention mechanism and Chebyshev graph convolution. As described in Reference [22], the spatial attention is defined as follows:

\begin{equation} P = V_p \cdot \sigma \left((\mathcal {X}^{l-1}Z_1)Z_2(Z_3\mathcal {X}^{l-1})^T+b_p\right), \end{equation}

(7)

\begin{equation} P_{m_1m_2}^{^{\prime }} = softmax(P_{m_1,m_2}), \end{equation}

(8)

where \(\mathcal {X}^{l-1}\) is the lth layer’s input; \(V_p, b_p, Z_1, Z_2\), and \(Z_3\) are learnable parameters; and \(\sigma\) denotes the sigmoid activation function. P denotes the attention matrix, and \(P_{m_1m_2}^{^{\prime }}\) denotes the correlation between node \(m_1\) and \(m_2\). The softmax operation is utilized to normalize the attention matrix P. The Chebyshev graph convolution can extract the information of neighboring 0 to \(E - 1\) order neighbors centered at each node and is defined as follows:

\begin{equation} g_\rho *\Omega x = g_\rho (\delta)x = \sum _{\epsilon =0}^{E-1} \rho _\epsilon T_\epsilon (\tilde{\delta })x, \end{equation}

(9)

\begin{equation} \tilde{\delta } = \frac{2}{\lambda _{max}}\delta -I_U, \end{equation}

(10)

where \(g_\rho\) denotes the convolution kernel, \(* \Omega\) denotes the operation of graph convolution, \(\lambda _{max}\) denotes the Laplacian matrix’s maximum eigenvalue, and \(I_U\) denotes an identity matrix. \(T_\epsilon\) denotes the Chebyshev polynomials recursively, and \(\delta = D - A\) denotes the Laplacian matrix, where \(D \in \mathbb {R}^{U\times U}\) denotes the degree matrix. \(\rho \in \mathbf {R}^\epsilon\) denotes a vector of Chebyshev coefficients, and x denotes the input data.

—

Temporal Feature: As Section 3.1, according to the primarily identical transition rules of adjacent sleep epochs, we combine temporal context information of neighboring \(T_n\) sleep epochs using the temporal attention mechanism and neural network (two-dimensional convolution [22] or a layer of GRU [28]). As described in Reference [22], the temporal attention is defined as follows:

\begin{equation} Q = V_q \cdot \sigma \left(((\mathcal {X}^{l-1})^TM_1)M_2(M_3\mathcal {X}^{l-1})+b_q\right), \end{equation}

(11)

\begin{equation} Q_{uv}^{^{\prime }} = softmax(Q_{u,v}), \end{equation}

(12)

where \(\mathcal {X}^{l-1}\) is the lth layer’s input; \(V_q, b_q, M_1, M_2\), and \(M_3\) are learnable parameters; Q denotes the attention matrix; and \(Q_{u,v}\) denotes the strength of correlation between sleep brain network \(G_u\) and \(G_v\). The softmax operation is utilized to normalize the attention matrix. As shown in Section 3.1, the temporal graph convolution can fuse the temporal context information of adjacent \(T_n\) sleep epochs and is defined as follows:

\begin{equation} \mathcal {X}^{l} = ReLU(\phi *(ReLU(g_\rho *G \hat{\mathcal {X}}^{l-1}))), \end{equation}

(13)

where \(\hat{\mathcal {X}}^{l-1}\) is the lth layer’s input with temporal attention, \(g_\rho\) denotes the convolution kernel, ReLU is the non-linear activation function, \(\phi\) denotes the parameters of the convolution kernel, and \(*\) denotes the convolution operation.

—

Spectral Feature: Transforming time-series data into TF image data using techniques such as STFT can effectively facilitate the model in capturing frequency-related information, allowing it to fully utilize the strengths of CNN in image classification and recognition. In a recent study [28], the ResNet and VGG models were utilized to extract features from TF images, which were then combined with GRU to integrate the temporal features of multiple sleep epochs. The resulting features were further fused with the features extracted by GCN, leading to notable improvements in performance.

—

Multi-view Feature Fusion: In the MSTGCN-based [22] methods, we concatenate the graph features based on the FC and the graph features based on the DC. Each feature consists of spatial features and temporal features.

In other methods based on graph features, we only employ the graph features based on FC. Spatial-temporal features are utilized in most methods. In particular, in MVF-SleepNet [28], not only spatial-temporal features are included but also spectral-temporal features are included.

4.2 Structural Incentive Domain Adversarial Learning

To improve the cross-subject generalization of the model while ensuring classification accuracy, traditional domain generalization methods often extract subject-invariant information to ensure that the model learns a general and robust representation. We exploit an adversarial domain generalization method to enhance the generalization of various sleep stage classification models. Suppose the input signal is \(x_j\), the feature extractor is \(G_f\), the label classifier is \(G_y\), and the domain classifier is \(G_d\). Specifically, Gradient Reversal Layer (GRL) is implemented between the \(G_f\) and the \(G_d\) to form an adversarial relationship. During training, the model parameters of the \(G_f\) are jointly affected by the \(G_d\) and the \(G_y\). The purpose of the \(G_d\) is to confuse the model’s identification of subjects, thereby enhancing the cross-subject generalization of the model. Unlike the traditional transfer learning framework in which the pre-training and fine-tuning are separated, the domain adversarial method integrates the classification and domain generalization into a unified and end-to-end framework. Due to the existence of GRL, the model parameters \(\theta _f\) of the \(G_f\) are learned by minimizing the loss \(\mathcal {L}_y\) of the label classifier and maximizing the loss \(\mathcal {L}_d\) of category-specific domain discriminators. Without loss of generality, the multi-class cross-entropy \(\mathcal {L}_{mc}\) is exploited as the basic loss function,

\begin{equation} \mathcal {L}_y = \frac{1}{J}\sum _{j=1}^{J}\mathcal {L}_{mc}(G_y(G_f(x_j)), y_j), \end{equation}

(14)

\begin{equation} \mathcal {L}_{d} = \frac{1}{J}\sum _{j=1}^{J}\mathcal {L}_{mc}(G_d(G_f(x_j)), d_j), \end{equation}

(15)

where J denotes the number of training samples and \(y_j\) and \(d_j\) denote the true label of sleep stage and subject domain, respectively. The network is optimized by minimizing the sum of two losses, and the total loss of domain generalization is defined as

\begin{equation} \mathcal {L} = \mathcal {L}_y - \mathcal {L}_d. \end{equation}

(16)

Existing domain adversarial methods in sleep stage classification improve the ability of the model to generalize to different subjects. However, there are considerable differences in the subject dependence on different sleep stage categories. As Section 3.3 shows, aligning the comprehensive data of different subjects will introduce substantial interference to the model’s judgment of category. Fortunately, the sleep stage classification has a periodic category-complete structure. As Figure 3 shows, subject to the incentive structure of which we set category-specific domain discriminators and perform category-by-category fine-grained alignment on different subjects’ data. This way, the domain adversarial process will not introduce additional errors to the sleep stage classification process. In addition, using the prediction results of the label classifier to weight category-specific domain discriminators dynamically also realizes an adaptive and direct correlation between them. The structurally incentive label classifier loss and total loss are consistent with Equations (14) and (16), respectively. The structural incentive category-specific domain discriminators loss is defined as

\begin{equation} \mathcal {L}_{d} = \frac{1}{J}\sum _{r=1}^{R}\sum _{j=1}^{J}\alpha _r\mathcal {L}_d^r\left(G_d^r\left(\hat{y}_j^rG_f\left(x_j\right)\right), d_j\right), \end{equation}

(17)

where J and R denote the number of training samples and categories, respectively; \(d_j\) denotes the domain label of the sample \(x_j\); r denotes rth domain discriminator; \(\hat{y}_j^r\) denotes the predicted softmax value of the rth category by label classifier; and \(\alpha _r= \frac{1}{R}\) are set as the weight of the loss of domain discriminator of the category r.

4.3 The Advantage of Direct Dynamic Bridge

In traditional domain adversarial methods, the relationship between the label classifier \(G_y\) and the domain discriminator \(G_d\) is coordinated through indirect gradient backpropagation. The nature of their relationship cannot be easily explained. Therefore, as illustrated in Figure 4, we employ the current prediction softmax value of the \(G_y\) as the weight for the features of the category-specific domain discriminator \(G_d^r\), which aims to address the issue of subject dependency differences category-by-category. This approach provides two benefits: (1) the fine-grained alignment at the category level avoids problems associated with the SDDC problem and (2) the original indirect association between the \(G_y\) and the \(G_d\) has been transformed into a dynamic weighting association. As a result, a soft attention weight is created that allows for interpretability between the two, which is entirely dependent on the end-to-end learning of the model. The model can adjust the proportion of features of each \(G_d^r\) adaptively based on the current prediction situation. Furthermore, the soft weight is smooth and differentiable, which ensures that each \(G_d^r\) is taken care of without becoming too absolute. Even when the \(G_y\) is inaccurate, they will still promote and adjust each other.

Fig. 4.

4.4 Method Implementation

SIDA proposed in this article is fully elucidated in Algorithm 1, SIDA aimed at achieving generalizable sleep stage classification. The framework is characterized by the loss \(\mathcal {L}_y\) of label classifier and the loss \(\mathcal {L}_d\) of category-specific domain discriminators. Initially, the features pre-trained with FeatureNet are input, which encompasses the training set \(\mathcal {D}_{train}\) and the testing set \(\lbrace x^{te}_j\rbrace\). Subsequently, the model’s parameters \(\theta _{f},\ \theta _{y}\), and \(\theta _{d}\) are initialized. Features of \(x_j\) are extracted with feature extractor \(G_f\), and then the classification result is obtained, following which the classification loss \(\mathcal {L}_y\) is computed by the fully connected classifier \(G_y\). Based on the softmax value of the current prediction result, each category’s prediction result is derived, features used by each category-specific domain discriminator are weighted, and the weighted loss sum \(\mathcal {L}_d\) of category-specific domain discriminators is calculated. The loss \(\mathcal {L}_d\) after gradient inversion is added to the loss \(\mathcal {L}_y\) to acquire the total loss \(\mathcal {L}\). Gradient backpropagation and updates of all parameters \(\theta _{f},\ \theta _{y}\), and \(\theta _{d}\) of the model continue until convergence. Finally, the best model for sleep classification is obtained.

5 Experiments

All these experiments are implemented with Python 3.8.0, Nvidia-TensorFlow 1.15.0, and Keras 2.3.1. We conducted them on a computer server equipped with 960GB Memory, Ubuntu 20.04.1 operating system, and four Nvidia A100 GPUs with 80 GB GPU Memory each.

5.1 Dataset and Experiment Settings

5.1.1 Dataset.

Our experiments employ three datasets: two publicly available subsets of the ISRUC-Sleep database, ISRUC-S1 and ISRUC-S3, and one large-scale dataset, SHHS1. The general information is shown in Table 1. The PSG recording was segmented into 30-second-long epochs and annotated by two experts according to the AASM standards. In detail, (1) the common points of the first two datasets are as follows: Each recording contains six EEG channels (F3-A2, C3-A2, O1-A2, F4-A1, C4-A1, and O2-A1), two EOG channels (LOC-A2 and ROC-A1), three EMG channels (the chin EMG, left leg movements, and right leg movements), and one ECG channel. Considering the sleep task has little correlation with the EMG signal of the legs. We are consistent with the comparative method, such as MSTGCN, removing the EMG channels of the two legs for experiments to focus on brain signals and employing the data of 10 channels in total. In addition, signals were resampled at 100 Hz. (2) ISRUC-S3 subgroup contains 10 healthy adults (nine male and one female, aged from 30 to 58). (3) The ISRUC-S1 subgroup contains 100 adults with sleep disorders (55 males and 45 females, aged from 20 to 85). (4) Following References [18, 57], 329 subjects with regular sleep of SHHS1 dataset are selected according to the Apnea Hypopnea Index. The six channels (two EEG, two EOG, one ECG, and one EMG) are employed in our experiment. In addition, signals were sampled at 125 Hz. As is shown in Section 3.1, the model will jointly predict the features of the intermediate epoch according to the \(T_n\) epochs, which will better contain the context information. As is shown in Table 1, the pre-trained features of ISRUC-S3 with context are from 10 subjects, and the pre-trained features of each subject discard a total of four epochs, so ISRUC-S3 with context is 40 epochs less than the original features. Similarly, ISRUC-S1 with context with 100 subjects is 400 epochs less than the original features, and SHHS1 with context with 329 subjects is 1,316 epochs less than the original features.

Table 1.

Dataset	Number of subject	Number of sleep stage
Dataset	Number of subject	Wake	N1	N2	N3	REM	Total
ISRUC-S1	100	20,098	11,062	27,511	17,251	11,265	87,187
ISRUC-S1 with context	100	19,860	11,025	27,448	17,232	11,222	86,787
ISRUC-S3	10	1,674	1,217	2,616	2,016	1,066	8,589
ISRUC-S3 with context	10	1,651	1,215	2,609	2,014	1,060	8,549
SHHS1	329	46,319	10,304	142,125	60,153	65,953	324,854
SHHS1 with context	329	45,312	10,278	141,936	60,128	65,884	323,538

Table 1. Data Description

5.1.2 Parameter Settings.

We compare SIDA with several baselines and against experiments with only integrating the traditional domain adversarial method, described in Tables 3, 4, and 5. We employ the same experimental settings for all models for a fair comparison. We reproduce each comparative method and employ 10-fold cross-validation to divide the training and testing set. In detail, the ratio of the training set to the testing set is 9:1. Then, randomly selecting 20% from the training set as the validation set, we save the best model validated on the validation set and test it on the completely invisible new subject testing set. The comparative method in their papers did not use the validation set, so our experimental results seem lower than in the comparative papers. All networks and parameters are consistent with each original method. Detailed hyper-parameters are shown in Table 2, where the parameter neighbouring epoch size means the number of neighbouring temporal epochs to aggregate (i.e., \(T_n\)), the parameter Order of Chebyshev polynomials \(\epsilon\) is set to five in GraphSleepNet and MSTGCN and is set to nine among other methods to remain consistent with the original method.

Table 2.

Hyper-parameter	Value
Neighbouring sleep epoch size (context)	5
Number of training epoch	80
Training batch size	64
Ratio of gradient reversal	0.001
Ratio of rth domain discriminator loss \(\alpha _r\)	0.2
Optimizer	Adam
Learning rate	0.0001
Dropout ratio	0.5

Table 2. The Shared Hyper-parameters in All Methods

Table 3.

Method	Overall results			F1-score for each class
Method	Accuracy	Macro F1	Kappa	Wake	N1	N2	N3	REM
FeatureNet	0.7767	0.7592	0.7130	0.8713	0.5245	0.7588	0.8391	0.8025
MaskSleepNet	0.7507	0.7426	0.6807	0.8594	0.4974	0.7279	0.8301	0.7981
SleepContextNet	0.7422	0.7234	0.6682	0.8169	0.5035	0.7480	0.8256	0.7229
DAN	0.7436	0.7233	0.6688	0.8469	0.4553	0.7337	0.8117	0.7689
GraphSleepNet	0.7879	0.7689	0.7260	0.8842	0.5265	0.7714	0.8513	0.8112
MSTGCN	0.7900	0.7732	0.7289	0.8802	0.5286	0.7761	0.8507	0.8303
MSTGCN+SIDA	0.8004	0.7792	0.7411	0.8899	0.5348	0.7864	0.8563	0.8285
JK-STGCN	0.7887	0.7694	0.7271	0.8824	0.5332	0.7719	0.8519	0.8074
JK-STGCN+DA	0.7926	0.7725	0.7321	0.8832	0.5318	0.7797	0.8583	0.8093
JK-STGCN+SIDA	0.7963	0.7741	0.7360	0.8891	0.5222	0.7808	0.8602	0.8184
MVF-SleepNet	0.7903	0.7741	0.7296	0.8855	0.5390	0.7754	0.8470	0.8237
MVF-SleepNet+DA	0.7910	0.7747	0.7306	0.8759	0.5389	0.7788	0.8548	0.8251
MVF-SleepNet+SIDA	0.7941	0.7774	0.7343	0.8784	0.5391	0.7823	0.8566	0.8308

Table 3. The Performance Comparison of the Mainstream Method with/without Our SIDA or Traditional Domain Adversarial Method on the ISRUC-S1 Dataset

Table 4.

Method	Overall results			F1-score for each class
Method	Accuracy	Macro F1	Kappa	Wake	N1	N2	N3	REM
FeatureNet	0.7538	0.7456	0.6855	0.8742	0.5637	0.6885	0.8274	0.7740
MaskSleepNet	0.7427	0.7256	0.6690	0.8576	0.4819	0.7209	0.8404	0.7270
SleepContextNet	0.7709	0.7618	0.7040	0.8717	0.5464	0.7477	0.8397	0.8032
DAN	0.7350	0.7133	0.6574	0.8311	0.4496	0.7277	0.8300	0.7278
GraphSleepNet	0.7763	0.7654	0.7116	0.8674	0.5409	0.7506	0.8482	0.8201
MSTGCN	0.7830	0.7725	0.7202	0.8741	0.5585	0.7587	0.8612	0.8099
MSTGCN+SIDA	0.7972	0.7802	0.7383	0.8801	0.5264	0.7832	0.8703	0.8408
JK-STGCN	0.7870	0.7762	0.7254	0.8770	0.5601	0.7652	0.8580	0.8208
JK-STGCN+DA	0.7913	0.7805	0.7309	0.8784	0.5603	0.7713	0.8562	0.8364
JK-STGCN+SIDA	0.7952	0.7798	0.7351	0.8786	0.5477	0.7821	0.8658	0.8249
MVF-SleepNet	0.7899	0.7827	0.7292	0.8931	0.5842	0.7720	0.8312	0.8330
MVF-SleepNet+DA	0.7917	0.7824	0.7316	0.8931	0.5770	0.7782	0.8396	0.8240
MVF-SleepNet+SIDA	0.7972	0.7882	0.7383	0.8959	0.5852	0.7850	0.8439	0.8307

Table 4. Performance Comparison of the Mainstream Method with/without Our SIDA or Traditional Domain Adversarial Method on the ISRUC-S3 Dataset

The bold and underline items denote the best and second-best results, respectively.

Table 5.

Method	Overall results			F1-score for each class
Method	Accuracy	Macro F1	Kappa	Wake	N1	N2	N3	REM
FeatureNet	0.7959	0.6778	0.7103	0.8158	0.1561	0.8230	0.8388	0.7556
MaskSleepNet	0.7689	0.7083	0.6880	0.8412	0.2504	0.8296	0.8608	0.7596
SleepContextNet	0.8237	0.7404	0.7541	0.8220	0.3597	0.8388	0.8375	0.8442
DAN	0.8470	0.7382	0.7821	0.8820	0.2462	0.8636	0.8645	0.8348
GraphSleepNet	0.8664	0.7899	0.8117	0.8893	0.4295	0.8800	0.8695	0.8814
MSTGCN	0.8763	0.8009	0.8254	0.8987	0.4454	0.8873	0.8762	0.8968
MSTGCN+SIDA	0.8802	0.7660	0.8298	0.9019	0.2546	0.8912	0.8788	0.9034
JK-STGCN	0.8777	0.7899	0.8266	0.8935	0.3911	0.8910	0.8750	0.8987
JK-STGCN+DA	0.8782	0.7929	0.8274	0.8986	0.4009	0.8906	0.8770	0.8976
JK-STGCN+SIDA	0.8843	0.8048	0.8366	0.8995	0.4387	0.8958	0.8807	0.9092
MVF-SleepNet	0.8681	0.7654	0.8126	0.8882	0.3013	0.8796	0.8632	0.8947
MVF-SleepNet+DA	0.8694	0.7801	0.8149	0.8915	0.3692	0.8812	0.8645	0.8942
MVF-SleepNet+SIDA	0.8736	0.7975	0.8211	0.9101	0.4340	0.8839	0.8684	0.8912

Table 5. The Performance Comparison of the Mainstream Method with/without Our SIDA or Traditional Domain Adversarial Method on the SHHS1 Dataset

The bold and underline items denote the best and second-best results, respectively.

5.1.3 Sleep Stage Classification Methods.

Features are extracted and fused based on the dual-channel FeatureNet [22]. FeatureNet is an effective baseline method for sleep stage classification and has been commonly employed as the pre-extraction of sleep stage classification features. FeatureNet aims to extract neural network features from the raw input feature matrix, which means that original data points from each channel at each epoch will be transferred into pre-extracted feature vectors. FeatureNet, in our experimental results Tables 3, 4, and 5, is utilized to classify the pre-extracted features through the fully connected classifier directly. All comparative methods employ the features pre-extracted by FeatureNet as the original feature input. MaskSleepNet [59] consists of a masking module, a multi-scale convolutional neural network, a squeezing and excitation block, and a multi-headed attention module. SleepContextNet [57] extracts the signal’s long-term and short-term context information based on the CNN-RNN backbone and designs a data augmentation method. DAN [42] is based on a CNN-GRU backbone, using the MMD [19] to align different domains. GraphSleepNet [23] is a GCN-based method. In GraphSleepNet, each PSG signals channel corresponds to a node in the sleep graph, and a connection between two nodes forms an edge in the sleep graph. This method constructed graph features based on the brain’s FC and employs spatial-temporal graph convolution to classify sleep stages. Based on GraphSleepNet, MSTGCN [22] adds DC based on the electrode distance of functional brain areas, and domain generalization is incorporated into the model to extract subject-independent information. JK-STGCN [21] is proposed to aggregate features from different layers by a jumping knowledge module. In MVF-SleepNet [28], spectral features from TF images of the PSG time-series signal were added and were utilized by models such as VGG16. We also compare the results of the traditional domain adversarial method and our SIDA combined with each method. The single discriminator structure is consistent with the label classifier of each original method. In summary, our experimental methods and comparative methods are as follows:

—

FeatureNet [22]: The FeatureNet model is utilized for feature extraction and sleep stage classification. It is the only method in this article that does not employ the data with context, as is shown in Table 1. The results reported in the experiment are the results of all training samples, and other methods employ pre-extracted features with context of training samples.

—

MaskSleepNet [59]: The FeatureNet model is utilized for features pre-extraction, and the MaskSleepNet model is exploited to study neural network-based features and classify sleep stages.

—

SleepContextNet [57]: The FeatureNet model is utilized for features pre-extraction, and the SleepContextNet model is exploited to study neural network-based features and classify sleep stages.

—

DAN [42]: The FeatureNet model is utilized for features pre-extraction, and the DAN model is exploited to study neural network-based features and classify sleep stages.

—

GraphSleepNet [23]: The FeatureNet model is utilized for features pre-extraction, and the GraphSleepNet model is exploited to study graph-based features and classify sleep stages.

—

MSTGCN [22]: The FeatureNet model is utilized for features pre-extraction, and the MSTGCN model is exploited to study graph-based features and classify sleep stages. The MSTGCN model combines the traditional domain adversarial method to extract subject-independent information.

—

MSTGCN+SIDA: The FeatureNet model is utilized for features pre-extraction, and the MSTGCN model is exploited to study graph-based features and classify sleep stages. The MSTGCN+SIDA method is integrated with our SIDA to extract subject-independent information.

—

JK-STGCN [21]: The FeatureNet model is utilized for features pre-extraction, and the JK-STGCN method is exploited to study graph-based features and classify sleep stages.

—

JK-STGCN+DA: The FeatureNet model is utilized for features pre-extraction, and the JK-STGCN model is exploited to study graph-based features and classify sleep stages. The JK-STGCN+DA method is integrated with the traditional domain adversarial method to extract subject-independent information.

—

JK-STGCN+SIDA: The FeatureNet model is utilized for features pre-extraction, and the JK-STGCN model is exploited to study graph-based features and classify sleep stages. The JK-STGCN+SIDA method is integrated with our SIDA method to extract subject-independent information.

—

MVF-SleepNet [28]: The FeatureNet model is utilized for features pre-extraction, and the MVF-SleepNet model is exploited to study graph-based features and classify sleep stages.

—

MVF-SleepNet+DA: The FeatureNet model is utilized for features pre-extraction, and the MVF-SleepNet model is exploited to study graph-based features and classify sleep stages. The MVF-SleepNet+DA method is integrated with the traditional domain adversarial method to extract subject-independent information.

—

MVF-SleepNet+SIDA: The FeatureNet model is utilized for features pre-extraction, and the MVF-SleepNet method is exploited to study graph-based features and classify sleep stages. The MVF-SleepNet+SIDA method is integrated with our SIDA method to extract subject-independent information.

Besides FeatureNet, the same pre-extracted features extracted by the FeatureNet are employed in all methods. The word DA denotes the traditional domain adversarial method. The word SIDA means adding our SIDA method to the origin method. All DA-related and SIDA-related parameters are kept the same in the original methods. Compared with GraphSleepNet, MSTGCN mainly adds a subject discriminator for domain adversarial operation. Moreover, according to the distance between different positions of the electrodes in the brain area, MSTGCN incorporates distance connections to enrich the spatial proximity structural features of the brain. The backbone of the two methods is basically the same, so we mainly compare our SIDA method based on the upgraded version of GraphSleepNet, MSTGCN.

5.1.4 Performance Metrics.

The evaluation measures, including Accuracy (Acc), F1 score (F1), \(Macro\ F1\), and Kappa are defined as follows:

\begin{equation} Acc = \frac{TP+TN}{TP+FP+FN+TN}, \end{equation}

(18)

\begin{equation} F1 = \frac{2 \times ~(Precision \times Recall)}{Precision + Recall}, \end{equation}

(19)

\begin{equation} Macro\ F1 = \frac{\sum _rF1_r}{R}, \end{equation}

(20)

\begin{equation} Kappa = \frac{p_0-p_e}{1-P_e}, \end{equation}

(21)

where TP refers to the number of samples of the current sleep stage classified correctly, FP refers to the number of samples of other sleep stages classified wrongly to be the current sleep stage, FN refers to the number of samples of the current sleep stage classified wrongly to be other stages, and TN refers to the number of samples of other sleep stages classified correctly. F1 score is the harmonic mean of Recall and Precision, r is the sleep stage category, R is the number of sleep stage categories, and \(Macro\ F1\) is for each sleep stage category of the F1 score calculation arithmetic mean. Moreover, \(p_o\) is the relative observation consistency between raters, and \(p_e\) is the hypothetical probability of probability consistency.

5.2 Comparative Experiment Results

Our SIDA method provides a fine-grained distribution alignment to reduce subject-dependence variability in sleep stage classification compared to traditional domain adversarial techniques. We evaluate the classification performance using FeatureNet on raw data from ISRUC-S1, ISRUC-S3, and SHHS1 datasets (Table 1) and compare other methods on pre-extracted features with context. The experimental results presented in Table 3, 4, and 5 show that we achieved further improvement on three datasets by combining SIDA with the original methods, which also shows superiority compared to several state-of-the-art methods. Results on the ISRUC-S3 dataset demonstrate that MSTGCN+SIDA and MVF-SleepNet+SIDA attained the highest Acc result of 0.7972. Notably, the Acc result of MSTGCN+SIDA increased by over one percentage point compared to the MSTGCN method and about two percentage points compared to GraphSleepNet. The \(Macro\ F1\) result of MVF-SleepNet+SIDA also reached a high of 0.7882, indicating a significant improvement. Compared to the best-performing method SleepContextNet in the comparison method, the Acc, \(Macro\ F1\), and Kappa of MVF-SleepNet+SIDA increased by approximately three percentage points each. On the ISRUC-S1 dataset, MSTGCN+SIDA achieved the highest Acc result of 0.8004 due to the larger number of samples. Notably, the \(Macro\ F1\) and Kappa results of MSTGCN+SIDA also reached a high value of 0.7792 and 0.7411 and increased by about one percentage point each compared to the MSTGCN method. In comparison to the best-performing method MaskSleepNet, the Acc, \(Macro\ F1\), and Kappa of MSTGCN+SIDA increased by approximately five, four, and six percentage points, respectively, which are significant improvements. On the SHHS1 dataset, JK-STGCN+SIDA attained the highest Acc of 0.8843, \(Macro\ F1\) of 0.8048, and Kappa of 0.8366. Compared to the JK-STGCN method without SIDA, the results increased by approximately one percentage point. The Acc, \(Macro\ F1\), and Kappa of JK-STGCN+SIDA also increased by about four, seven, and five percentage points, respectively, compared to the best-performing method DAN. Our experiments visualized the changes in sleep stage categories throughout an entire sleep episode when fusing SIDA with/without other methods. The classification performance is excellent, with most misclassifications occurring during sleep stage transitions, which are typically difficult to detect in medical practice, as demonstrated in Figures 5, 6, and 7 for the ISRUC-S1, ISRUC-S3, and SHHS1 datasets, respectively.

Fig. 5.

Fig. 6.

Fig. 7.

5.3 Across-age Experiment Results

Following Reference [3], we divided the ISRUC-S1 dataset into three age groups: 34 pediatrics, 32 adults, and 34 older adults, treating each group as a domain. Our experiments utilized GraphSleepNet, MSTGCN, and MSTGCN+SIDA methods to evaluate performance in different age groups. Table 6 displays that our MSTGCN+SIDA method outperforms both MSTGCN and GraphSleepNet methods. Notably, we discovered a noteworthy phenomenon: the model consistently outperforms older groups in identifying younger groups, which warrants further exploration in future research.

Table 6.

Method	Accuracy
Method	Pediatric	Adult	Older adult	Total
GraphSleepNet	0.8173	0.7776	0.7335	0.7761
MSTGCN	0.8143	0.7870	0.7378	0.7795
MSTGCN+SIDA	0.8225	0.7956	0.7386	0.7853

Table 6. The Across-age Performance Comparison of the GraphSleepNet method with/without Our SIDA or Traditional Domain Adversarial Method on the ISRUC-S1 Dataset

The bold and underline items denote the best and second-best results, respectively.

5.4 Feature Visualization Analysis

To investigate the impact of the traditional domain adversarial method and our SIDA, we performed feature visualization for some methods by selecting hidden features before the fully connected classifier and utilizing the tSNE tool to reduce their dimensions to a two-dimensional plane. As illustrated in Figure 8 (d), (e) and (f) for the ISRUC-S3 dataset, the classification boundary of the GraphSleepNet method is ambiguous, with the features of the same subjects being primarily concentrated together, showing significant differences in subject personalization. Moreover, the category boundary of sleep stage N1 is the most unclear and challenging to identify, consistent with our experimental findings. Nevertheless, after using the traditional domain adversarial method, the category boundary becomes distinct, and the model’s dependency on the subject is weakened, enhancing cross-subject generalization, although misclassifications still occur. While our SIDA method results in each category’s feature shape being close to circular, with a more apparent boundary and a significantly improved classification effect. Notably, the features of different subjects are evenly distributed, resulting in low subject dependency on the model. As shown in the results of the ISRUC-S1 dataset in Figure 8 (a), (b), and (c), The size of ISRUC-S1 is 10 times that of the ISRUC-S3 dataset, so it is tough to classify the GraphSleepNet method. The high-dimensional feature representation is difficult to separate. After adding the traditional domain adversarial method, the category boundary of the feature is vaguely visible, and each category is separated into different clusters. After using our SIDA method, the characteristics of different categories tend to be further separated. As shown in the SHHS1 dataset results. In Figure 8 (g), (h), and (i), the size of the SHHS1 dataset is about 40 times that of the ISRUC-S3 dataset, so it is especially difficult for the sleep stage N1 with a relatively small number of samples. Hidden features are difficult to assemble into a cluster, and many misclassifications occur. After adding the traditional domain adversarial method, wrong clustering is greatly improved. However, the distance between different categories is relatively close, and the interface is unclear. After using our SIDA method, the features of different categories tend to be further apart, and each category has a clear interface, the distribution of features across subjects also becomes uniform.

Fig. 8.

5.5 Confusion Matrix Analysis

As shown in Figures 9, 10, and 11, we analyzed the confusion matrix by combining the experimental results of our SIDA method and the comparative method on the three classical datasets of the ISRUC-S1, ISRUC-S3, and SHHS1 datasets. From the confusion matrix, we can see that the classification results are generally good. However, the number of samples of the sleep stage N1 is the least, and the classification effect is the worst. This phenomenon is reflected in the previous classification methods of sleep stages. On the one hand, the number of samples of the sleep stage N1 is small. On the other hand, sleep stage N1 is a light sleep between the sleep stage Wake and N2. Physiologically, the brain is between the lightly active state and the light sleep state, and the signal fluctuation is slight, and the category changes are varied.

Fig. 9.

Fig. 10.

Fig. 11.

5.6 Loss and Accuracy Change Analysis

The loss and accuracy change of the training and validation sets of the ISRUC-S1, ISRUC-S3, and SHHS1 datasets during training are shown in Figure 12. It can be seen from the curve in the figure that the loss and accuracy of the training and validation set converge well during the training process, and the performance of the three datasets is basically the same. The loss and accuracy of the validation set converge earlier than the training set, so there is a particular gap between the effect of the training and validation set.

Fig. 12.

5.7 Inference Time Analysis

Figure 13 depicts the average inference time per sample for each method during testing on the ISRUC-S1, ISRUC-S3, and SHHS1 datasets, with time measured in milliseconds. To maintain uniform representation, GraphSleepNet+DA in Figure 13 refers to the MSTGCN model. Since our SIDA is an improvement over the conventional method, it has more network parameters than the methods with the traditional domain generalization method and the baseline method that does not employ domain generalization. As a result, our model’s inference time is longer. As seen in Figure 13(b), our method takes longer to process a single sample than other methods, with the inference time for a single sample with SIDA increasing by less than 25% compared to the methods with the traditional domain generalization method. The time increase is still acceptable. As seen in Figure 13(a) and (c), when incorporating some methods with the SIDA method on the ISRUC-S1 and SHHS1 datasets, the increase of inference time may be insignificant, and the time cost is also within an acceptable range.

Fig. 13.

6 Conclusion and Future Work

Inspired by the structure of sleep stage classification, we propose a plug-and-play method called the SIDA method. It considers the subject dependency differences between different sleep stage categories and aligns them category-by-category to improve the model’s classification accuracy and subject generalization robustly. We integrate mainstream sleep stage classification methods and compare our method against the traditional domain adversarial method and the three latest state-of-the-art methods. Experimental results on three classic sleep stage classification datasets show that our SIDA method significantly improves the model’s subject generalization and classification accuracy.

Looking to the future, we acknowledge that some challenges remain to be addressed. Our current method of soft weighting between classifiers and category-specific domain discriminators relies on neural network learning in an end-to-end manner, which is both convenient and efficient. However, we observe that classifier performance is unsatisfactory in the early stages of training, leading to domain discriminator features being given inappropriate weights by the classifier, leading to training bias. Therefore, our goal is to improve the training process of the domain discriminator by using the training labels to adjust the weight distribution based on the classifier’s current prediction accuracy. This approach aims to improve the accuracy and generalization of the model in the early stages of training by correcting the weight distribution based on the classifier’s performance.

References

[1]

Emina Alickovic and Abdulhamit Subasi. 2018. Ensemble SVM method for automatic sleep stage classification. IEEE Trans. Instrum. Meas. 67, 6 (2018), 1258–1265.

Abstract

1 Introduction

2 Related Work

2.1 Sleep Stage Classification

2.2 Domain Generalization

3 Preliminaries and Motivation

3.1 Sleep Stage Classification Problem

3.2 Domain Generalization

3.3 Motivation

4 Structural Incentive Domain Adversarial Method

4.1 The Effective Representation of Multimodal Physiological Signal

4.2 Structural Incentive Domain Adversarial Learning

4.3 The Advantage of Direct Dynamic Bridge

4.4 Method Implementation

5 Experiments

5.1 Dataset and Experiment Settings

5.1.1 Dataset.

5.1.2 Parameter Settings.

5.1.3 Sleep Stage Classification Methods.

5.1.4 Performance Metrics.

5.2 Comparative Experiment Results

5.3 Across-age Experiment Results

5.4 Feature Visualization Analysis

5.5 Confusion Matrix Analysis

5.6 Loss and Accuracy Change Analysis

5.7 Inference Time Analysis

6 Conclusion and Future Work

References

Cited By

Index Terms

Recommendations

SleepMG: Multimodal Generalizable Sleep Staging with Inter-modal Balance of Classification and Domain Discrimination

Reliable automatic sleep stage classification based on hybrid intelligence

Hierarchical Binary Classifiers for Sleep Stage Classification

Comments

Information

Published In

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

Login options

Full Access

Share

Share this Publication link

Share on social media

Affiliations