0% found this document useful (0 votes)
34 views11 pages

KDnet RUL

The document presents KDnet-RUL, a knowledge distillation framework designed to compress deep neural networks for predicting machine remaining useful life (RUL). It utilizes generative adversarial networks for knowledge transfer between different architectures and enhances performance through a learning-during-teaching method for identical architectures. The proposed method significantly reduces model complexity while maintaining comparable predictive performance to more complex LSTM models, making it suitable for deployment on resource-constrained edge devices.

Uploaded by

wangliwangli917
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views11 pages

KDnet RUL

The document presents KDnet-RUL, a knowledge distillation framework designed to compress deep neural networks for predicting machine remaining useful life (RUL). It utilizes generative adversarial networks for knowledge transfer between different architectures and enhances performance through a learning-during-teaching method for identical architectures. The proposed method significantly reduces model complexity while maintaining comparable predictive performance to more complex LSTM models, making it suitable for deployment on resource-constrained edge devices.

Uploaded by

wangliwangli917
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

2022 IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS, VOL. 69, NO.

2, FEBRUARY 2022

KDnet-RUL: A Knowledge Distillation


Framework to Compress Deep Neural Networks
for Machine Remaining Useful Life Prediction
Qing Xu , Zhenghua Chen , Keyu Wu , Chao Wang , Min Wu , and Xiaoli Li

Abstract—Machine remaining useful life (RUL) prediction I. INTRODUCTION


is vital in improving the reliability of industrial systems
ACHINE remaining useful life (RUL) prediction is of
and reducing maintenance cost. Recently, long short-term
memory (LSTM) based algorithms have achieved state-of-
the-art performance for RUL prediction due to their strong
M great importance for real industry [1]–[5]. It is able to
reduce the maintenance cost and improve the reliability of indus-
capability of modeling sequential sensory data. In many trial systems. However, accurate prediction of machine RUL is
cases, the RUL prediction algorithms are required to be
deployed on edge devices to support real-time decision still challenging due to the high complexity of modern industrial
making, reduce the data communication cost, and preserve systems. To predict machine RUL, many advanced methods have
the data privacy. However, the powerful LSTM-based meth- been developed. Generally, they can be divided into two different
ods which have high complexity cannot be deployed to categories, i.e., model-based and data-driven. Model-based solu-
edge devices with limited computational power and mem- tions intend to explicitly model the relationship between sensory
ory. To solve this problem, we propose a knowledge distil-
lation framework, entitled KDnet-RUL, to compress a com- data and RUL [6], [7]. Since the industrial systems become more
plex LSTM-based method for RUL prediction. Specifically, and more complex, the explicit modeling is extremely difficult.
it includes a generative adversarial network based knowl- Alternatively, data-driven solutions aim to learn the relationship
edge distillation (GAN-KD) for disparate architecture knowl- directly from data without knowing the physical model of a
edge transfer, a learning-during-teaching based knowledge system [8], [9]. They become very promising techniques for
distillation (LDT-KD) for identical architecture knowledge
transfer, and a sequential distillation upon LDT-KD for com- RUL prediction, especially for complicated industrial systems.
plicated datasets. We leverage simple and complicated Conventional machine learning algorithms are widely used
datasets to verify the effectiveness of the proposed KDnet- data-driven methods to predict machine RUL [8],[10]. For con-
RUL. The results demonstrate that the proposed method ventional machine learning based RUL prediction, the first step
significantly outperforms state-of-the-art KD methods. The is to perform feature engineering which manually extracts repre-
compressed model with 12.8 times less weights and 46.2
times less total float point operations even achieves a com- sentative features from the sensory data based on expert knowl-
parable performance with the complex LSTM model for RUL edge. Then, machine learning algorithms, such as support vector
prediction. regression, decision tree, random forest, etc., can be adopted to
predict RUL based on the extracted features. However, conven-
Index Terms—Generative adversarial network (GAN),
knowledge distillation (KD), model compression, remaining tional machine learning based RUL prediction requires to extract
useful life (RUL) prediction. features based on domain knowledge which may not be available
all the time. Besides, the feature extraction and RUL prediction
cannot be jointly optimized in conventional machine learning
methods, which also hinders their performance.
Manuscript received August 27, 2020; revised November 7, 2020 and
December 19, 2020; accepted January 17, 2021. Date of publication Recently, deep learning has achieved great successes in many
February 9, 2021; date of current version October 27, 2021. This work challenging domains, including RUL prediction [11], [12]. The
was supported in part by the A*STAR Industrial Internet of Things greatest merit of deep learning is that it is able to automatically
Research Program under the RIE2020 IAF-PP Grant A1788a0023, in
part by the National Key Research and Development Program of China learn representative features from data without human interven-
under Grant 2017YFA0700900 and Grant 2017YFA0700903, and in part tion and perform RUL prediction simultaneously, leading to a
by the National Science Foundation of China under Grant 61976200. superior performance. One of the most popular deep learning
(Corresponding author: Zhenghua Chen.)
Qing Xu, Zhenghua Chen, Keyu Wu, Min Wu, and Xiaoli Li are with the algorithms is convolutional neural network (CNN) which has
Institute for Infocomm Research, Singapore 138632, Singapore (e-mail: achieved remarkable performance for image classification [13].
xu_qing@i2r.a-star.edu.sg; chen0832@e.ntu.edu.sg; wu_keyu@i2r.a- Due to the unique structure of CNN, it is very efficient for feature
star.edu.sg; wumin@i2r.a-star.edu.sg; xlli@i2r.a-star.edu.sg).
Chao Wang is with the School of Computer Science, University learning and can be trained in parallel. It has also been utilized
of Science and Technology of China, Hefei 230052, China (e-mail: for RUL prediction and outperformed conventional machine
cswang@ustc.edu.cn). learning algorithms [14], [15]. Another popular deep learn-
Color versions of one or more figures in this article are available at
https://doi.org/10.1109/TIE.2021.3057030. ing algorithm for RUL prediction is long short-term memory
Digital Object Identifier 10.1109/TIE.2021.3057030 (LSTM) which is specifically designed for analyzing sequential

0278-0046 © 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: Dalian University of Technology. Downloaded on November 28,2024 at 11:27:07 UTC from IEEE Xplore. Restrictions apply.
XU et al.: KDNET-RUL: A KNOWLEDGE DISTILLATION FRAMEWORK 2023

data with temporal information [16], [17]. Since the sensory minimize the discrepancy between the features learned
data for machine RUL prediction are typical time series with from LSTM and CNN by using a GAN technique.
temporal information, the LSTM network is naturally suitable 3) To enhance the performance of CNN, we propose an LDT-
for RUL prediction. Recent studies [16]–[18] have shown that KD method for KD between identical architectures.
the LSTM outperforms the CNN for RUL prediction. However, 4) In complicated scenarios where multiple working con-
the LSTM generally has much higher computational complexity ditions are involved for RUL prediction, we propose
than CNN due to its unique structure of cascade connection. a sequential distillation scheme upon LDT-KD to fur-
In many real-world scenarios, the RUL prediction algorithms ther enhance the performance of the learned CNN
need to be deployed on edge devices, which have limited model.
computational resources and memory, for timely response and The rest of the article is organized as follows. Section II re-
security concerns. Thus, the industry generally prefers a learning views some related works on RUL prediction and KD. Section III
algorithm which can achieve accurate RUL prediction and is also presents the deep neural networks for RUL prediction, followed
very efficient (e.g., small size and fast inference). The current by the disparate and identical architecture knowledge transfer.
deep learning algorithms are either too complicated or with Section IV first describes the data for evaluation and the ex-
limited performance. perimental setup. Then, the experimental results, ablation study,
To deal with these issues, model compression techniques and sensitivity analysis are introduced. Section V concludes this
have been proposed to compress deep neural networks for edge article.
deployment. For instance, parameter quantization methods [19],
[20] compress the original network by using less bits to represent II. RELATED WORK
the weights. They can achieve significant speedup but also result
in accuracy loss [21]. Another commonly used method for model A. RUL Prediction
compression is weight pruning [22], which aims to remove Deep learning for RUL prediction has gained increasing
unnecessary parameters in a trained deep neural network. Al- attention due to its ability of modeling complex machinery
though the weight pruning is able to reduce model storage size, degradation process [27]. Various deep learning methods, such
it cannot improve the efficiency in terms of training or inference as CNN and LSTM, have been shown to be effective for RUL
time. Other methods like matrix decomposition [23], [24] have prediction tasks. Babu et al. [14] proposed a novel CNN-based
also shown the capability of reducing model size, but they only model to estimate RUL of airplane engines by using sliding
address the storage complexity issue of deep models and have windows on the raw sensory data as input samples. Instead
similar drawbacks as the weight pruning method. Relatively, of directly feeding the raw sensory data into CNN models,
knowledge distillation (KD) has shown great promise in reduc- Zhu et al. [15] transformed the sensory data to derive the
ing not only model storage size but also model efficiency [25], time frequency representation of each sample. Then, a multi-
[26]. scale convolutional neural network was developed with these
In this article, we propose a novel KD framework, entitled samples for RUL prediction. Even though the CNN-based
KDnet-RUL, to compress deep learning models for RUL pre- models have already outperformed traditional methods, such
diction. Specifically, we first design a generative adversarial as multilayer perceptron (MLP) and support vector machines,
network based knowledge distillation (GAN-KD) for disparate they are not naturally designed for sensory data with temporal
architecture knowledge transfer, which distills the knowledge information.
from a powerful and complicated LSTM model to a simple To better capture the temporal information of sensory data,
CNN model. Then, a learning-during-teaching based knowl- Zheng et al. [16] employed an LSTM network to model the
edge distillation (LDT-KD) for identical architecture knowledge long-term dependency characteristic of data for RUL prediction.
transfer is proposed to enhance the performance of the CNN Hence, such an LSTM method achieved a better performance
model learned by GAN-KD. For complicated RUL prediction than traditional machine learning and CNN approaches. There-
scenarios, e.g., data with multiple operation conditions, we after, several LSTM-based approaches, such as bidirectional
leverage a sequential distillation scheme upon the LDT-KD for LSTM [17] and attention-based LSTM [18], were proposed
accurate and robust RUL prediction. The performance of the to further improve RUL prediction accuracy. However, LSTM-
proposed KDnet-RUL method is evaluated by using both simple based models often have high computational complexity, and,
and complex datasets. thus, it is very difficult to deploy them on edge devices with
The main contributions of the proposed method are summa- limited computing resources. To address this problem, model
rized as follows. compression methods can be adopted for LSTM models to
1) We propose a KD framework, named KDnet-RUL, which reduce their complexity and preserve the performance as much
distills knowledge from a complicated LSTM model to as possible.
a simple CNN model for efficient RUL prediction. The
efficient CNN model can thus be deployed on resource
constrained edge devices. B. Knowledge Distillation
2) For KD between disparate architectures, i.e., from LSTM KD, also known as a teacher–student strategy, is widely
to CNN, a GAN-KD method is proposed. It attempts to applied for model compression. It was first introduced by [28],

Authorized licensed use limited to: Dalian University of Technology. Downloaded on November 28,2024 at 11:27:07 UTC from IEEE Xplore. Restrictions apply.
2024 IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS, VOL. 69, NO. 2, FEBRUARY 2022

which refers to training a shallow network (i.e., Student) by


mimicking the output of a larger and deeper network (i.e.,
Teacher). Hinton et al. further generalized it by introducing a
temperature variable to soften the logits from the cumbersome
model as “soft target” [29].
Subsequently, various methods have been proposed for effi-
cient knowledge transfer between the teacher and student for
model compression. To improve the generalization capability of
thin but deep student, Romero et al. [30] introduced a hint-based
pretraining strategy to guide the student to learn intermediate
feature representations close to the teacher’s. The authors in [31]
Fig. 1. Proposed KDnet-RUL framework. (a) GAN-KD for disparate
proposed to transfer the attention maps with different levels network architectures. (b) LDT-KD for identical network architectures.
from a teacher network and showed significant improvement.
Tian et al. [32] proposed a contrastive learning approach to
force the student to generate close representations as the teacher
for the same inputs, while generating distant representations A. Deep Neural Networks for RUL Prediction
for different inputs. The GAN-based architectures were also To precisely estimate the RUL for mechanical systems, it
adopted to align the source of logits [33] or feature maps [26] is desirable to design deep neural networks (e.g., LSTM or
for knowledge transfer. CNN) that are capable of modeling the temporal dependency
Note that the softened logits output of the teacher in the afore- of multivariate sensory data. Such networks normally consist
mentioned classification tasks can provide additional knowledge of a feature extractor and a regression module. In particular,
about the correlations of class labels [29]. Therefore, most of the the feature extractor extracts the features from the input sensor
previous KD studies focus on the classification tasks. In fact, KD data. The extracted feature maps are then fed into the regression
is also suitable for regression tasks. Chen et al. [34] introduced module to predict the RUL. The regression module generally
a bounded regression loss for KD on bounding-box regression contains several fully connected (FC) layers.
problems. Combining with hint-based learning, the proposed To demonstrate the effectiveness of our proposed pipeline, we
distillation framework can significantly improve the accuracy first design an LSTM-based network that serves as a powerful
compared to the baselines. The authors in [25] proposed an but luxurious teacher, considering that it achieves state-of-the-
attention imitation loss which intended to use the teacher loss as art performance for RUL prediction [16]–[18]. Subsequently,
a confidence score for camera pose regression problem. It allows a dilated CNN-based network is adopted as the student, which
to attentively learn from the predictions in which the teacher has ideally can maintain comparable performance as the teacher but
more confidence. with much less complexity. This dilated CNN-based structure
However, most of the previous KD studies focus on trans- has shown promising capability on handling sequential data [35],
ferring knowledge between networks with similar architectures, and, thus, we use it as the student network as shown in Fig. 2.
i.e., the student is a simplified version of teacher with less layers
or hidden units. It is not clear whether those KD methods are also
suitable for disparate architectures, e.g., an LSTM-based teacher B. Disparate Architecture Knowledge Transfer
and a CNN-based student. Therefore, to fill this gap, we propose As mentioned earlier, LSTM networks are too complex to be
a method named GAN-KD for this scenario. Moreover, due to deployed on resource-constrained edge devices. Simple CNN
the inherent difference between LSTM and CNN, we propose a networks are suitable for edge deployment. However, they are
method named LDT-KD to further optimize CNN-based student usually not able to achieve desirable performance as LSTM
learned from GAN-KD. models. To address this dilemma, we first propose a GAN-based
KD method called GAN-KD for knowledge transfer between
disparate architectures, i.e., from LSTM to CNN. Particularly,
III. METHODOLOGY as shown in Fig. 3, we distill the knowledge from a complicated
In this section, we present a framework called KDnet- LSTM structure to a simple CNN structure in GAN-KD to
RUL to transfer knowledge between disparate and iden- improve the performance of the CNN model. In our GAN-KD,
tical network architectures for RUL prediction. The over- both the teacher (LSTM) and student (CNN) consist of a feature
all KDnet-RUL pipeline is depicted as Fig. 1. To be spe- extractor and a regression module for RUL prediction. We thus
cific, a GAN-KD approach is proposed to transfer knowl- adopt a two-stage training scheme for our GAN-KD. Specifi-
edge between different network structures, i.e., from LSTM to cally, we train the feature extractor by feature distillation and
CNN. An LDT-KD approach is proposed to transfer knowl- the regression module by KD separately. Next, we introduce the
edge between identical network structures, i.e., from CNN feature distillation and KD in detail.
to CNN. Moreover, a sequential self-distillation scheme upon 1) Feature Distillation: We design a GAN, which contains
LDT-KD is designed to further improve the performance of a generator (G) and a discriminator (D), for feature distillation.
RUL prediction on complex datasets with multiple operating In particular, the feature extractor of the CNN-based student is
conditions. considered as the generator G in the GAN as shown in Fig. 3.

Authorized licensed use limited to: Dalian University of Technology. Downloaded on November 28,2024 at 11:27:07 UTC from IEEE Xplore. Restrictions apply.
XU et al.: KDNET-RUL: A KNOWLEDGE DISTILLATION FRAMEWORK 2025

and G(x) from the teacher and the student, respectively, and G
aims to minimize the probability that D will predict G(x) from
the student. The objective function can be expressed as follows:
min max V (D, G) = Ex [log(D(φ(x)))+log(1 − D(G(x)))].
G D
(1)
At each iteration of training stage, we first fix the G and train
D by maximizing the following loss function LD :
LD = log(D(φ(x))) + log(1 − D(G(x))). (2)
Then, we fix D and start to train G by minimizing the prob-
ability log(1 − D(G(x))). We further mix this GAN objective
with the L1 distance between student’s and teacher’s features,
denoted as LG
LG = log(1 − D(G(x))) + λ ∗ φ(x) − G(x)1 (3)
where λ is a hyperparameter to control the contribution of L1
distance in the final loss LG . Minimizing LG can thus help to
easily achieve the equilibrium of G generating perfect features
as teacher’s and D guessing with 50% accuracy.
We alternately repeat the above generator and discriminator
Fig. 2. Dilated CNN-based student architecture. Conv1D(3, 2, 1) refers
to a 1-D convolution layer with kernel size being 3, stride being 2, and training process, i.e., iteratively minimizing LD and maximizing
dilation being 1. LG . Eventually, the student is able to generate the feature maps
similar to the teacher’s.
2) Knowledge Distillation: KD by logits or soft labels has
already been proved to be effective for training the student in
classification tasks. In this article, we deal with the regression
task for RUL prediction. Hence, we attempt to utilize predictions
from the teacher for KD, similar to the logits or soft labels
in classification tasks. In particular, we define the Soft Loss
LSoft as the difference between student’s prediction and teacher’s
prediction in (4). We also have the Hard Loss LHard which is the
difference between student’s prediction and ground truth (i.e.,
real labels) in (5). The loss function for KD LKD is then defined
as the weighted combination of LSoft and LHard in (6)
LSoft = ŷS − ŷT 2 , (4)
LHard = ŷS − y2 , (5)
LKD = α ∗ LSoft + (1 − α) ∗ LHard . (6)
Here, ŷS and ŷT represent predictions of student and teacher
Fig. 3. GAN-KD for disparate network architectures. All network blocks networks, respectively, and y is the ground truth. α is a hy-
with dash lines are trainable and those with solid lines are locked during
training.
perparameter to adjust the weight of hard and soft losses. By
minimizing the loss for KD LKD , we learn the regressor module
in the student for RUL prediction.
Meanwhile, the discriminator D is designed to maximize the
C. Identical Architecture Knowledge Transfer
similarity between the CNN-based and the LSTM-based feature
extractors and thus improves the CNN-based feature extractor. In the above section, GAN-KD can help to learn a simple
Given that x is the input sensory data, and x ∈ RT ×n , where CNN model by distilling the knowledge from LSTM for RUL
T is the window size and n is the number of sensors, φ(x) is the prediction. However, the learned CNN may not be optimal in
output of feature extractor in teacher network, while G(x) is the terms of prediction performance due to the inherent difference
output of feature extractor in student network. The discriminator between CNN and LSTM. In this section, we aim to further
D, as a binary classification network, aims to identify if the improve the CNN learned by our GAN-KD via KD between
feature map is from the teacher’s or student’s feature extractor. identical network structures.
Here, D and G play a two-player mini-max game in which D 1) Learning-During-Teaching: A few previous stud-
aims to maximize the probability of correctly classifying φ(x) ies [29],[36] have already proven the feasibility of transferring

Authorized licensed use limited to: Dalian University of Technology. Downloaded on November 28,2024 at 11:27:07 UTC from IEEE Xplore. Restrictions apply.
2026 IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS, VOL. 69, NO. 2, FEBRUARY 2022

Fig. 5. Sequential distillation upon LDT-KD.

Fig. 4. LDT-KD architecture. T is the teacher model, S represents the upon the LDT-KD. The sequential distillation was first proposed
student model, and GT represents ground truth. by Furlanello et al. in [38] where they sequentially distilled the
knowledge from a teacher with identical structure to a student.
And in each generation, a new student is required to be initialized
knowledge between models with identical architectures. with a different random seed. At the end of the procedure, they
Hinton et al. [29] demonstrated the effectiveness of distilling employed an ensemble of student models from each generation
knowledge from an ensemble of models into a single model with and achieved a remarkable performance.
the same architecture. However, pretraining a set of models We adopt this sequential training idea upon the LDT-KD mod-
for ensemble is often time-consuming. On the other hand, ule as shown in Fig. 5. However, our method differs from [38]
Yim et al. [36] proposed a flow of solution procedure (FSP) (denoted as BAN) in several aspects. First, the weights of the
matrix (relationship of outputs from two layers) to transfer the teacher model are simultaneously updated with those of the stu-
knowledge flow between two identical deep neural networks dent model in our proposed LDT-KD. Whereas, the born-again
(DNNs). However, it is not straightforward on how to choose network (BAN) fixes the weights of the teacher model. Second,
the proper layers to calculate FSP. the BAN applies an ensemble of multiple students from different
In this article, we propose a method called LDT-KD to update generations for final prediction. However, this ensemble version
both the student and teacher in a closed-loop process as shown is too luxurious for edge devices due to the requirement of more
in Fig. 4. Here, the teacher and student in Fig. 4 have the storage memory and longer inference time. For our proposed
same network structure and the same set of model weights. approach, either the final single student or teacher can be used for
The teacher in LDT-KD is directly copied from the student RUL prediction. And both teacher and the final student can gen-
learned by GAN-KD. To accelerate the convergence of the eralize well. Third, we empirically show that the implementation
teacher model, we pretrain the student in LDT-KD for several of sequential training depends on datasets. It is only compulsory
epochs with conventional KD strategy before performing the to perform sequential distillation for complicated datasets, e.g.,
closed-loop process in Fig. 4. At each training step, we first datasets with multiple operation conditions.
update the weights of the student with gradient descent under
the supervision of ground truth and soft labels from the teacher, IV. EXPERIMENTS
i.e., by minimizing the KD loss in (6). Second, we update the
weights of the teacher using the exponential moving average of In this section, we evaluate the performance of our proposed
the student weights, inspired by the mean teacher model in [37]. KDnet-RUL method to distill the knowledge for RUL prediction.
It can be expressed as follows:
A. Experimental Data and Setup
WTi+1 = β ∗ WTi + (1 − β) ∗ WSi (7)
1) C-MAPSS Dataset: In our experiments, we used the pub-
where WTi and WSi represent the weights of the teacher and lic C-MAPSS dataset for evaluation, which has been widely used
student at training step i, respectively. β is a smoothing parame- in many previous studies for RUL prediction [14], [17],[18]. This
ter determining how much historical information of the teacher dataset simulates the degradation process of turbofan engines. It
model will be carried forward for the update. Once the teacher consists of four subdatasets under varying operating conditions
weights are updated, we repeat the above two steps until the and fault modes. For each subdataset, it can be further divided
stopping criteria is satisfied, e.g., the performance of the teacher into training and testing data, as shown in Table I.
on the validate data starts to drop. Each trajectory in training and testing data corresponds to an
2) Sequential Distillation: Our empirical study shows that engine and consists of 21 sensor measurements for this engine.
the performance of LDT-KD is superior and stable for simple The training trajectories include all run-to-failure measurements
datasets. However, its performance is not consistently good for for the engine units, while the testing trajectories only contain
complex datasets, such as datasets for RUL prediction with the measurements of certain period during degradation. The
multiple operating conditions. To stabilize the model training objective is to accurately predict the RUL for the testing engines,
for RUL prediction, we present a sequential distillation scheme given their test trajectories.

Authorized licensed use limited to: Dalian University of Technology. Downloaded on November 28,2024 at 11:27:07 UTC from IEEE Xplore. Restrictions apply.
XU et al.: KDNET-RUL: A KNOWLEDGE DISTILLATION FRAMEWORK 2027

TABLE I Following the previous studies [18], [39], [40], T , s, and


DETAILS OF C-MAPSS DATASET
RULmax are set to be 30, 1, and 130 in our experiments,
respectively.
3) Experimental Setup: In our experiments, the teacher of
our GAN-KD is a five-layer LSTM model with 32 hidden units
in each layer as feature extractor and two FC layers as regression
module. After properly training and hyperparameter tuning, we
can obtain a decent performance for the teacher on the RUL
prediction task. Subsequently, we develop a compact student,
which consists of a feature extractor with dilated CNN structure
and a regression module with two FC layers. We denote the
student model, which is first trained from scratch under the
supervision of ground truth only, as “Student Only.” The pro-
posed GAN-KD is evaluated by training a new student under the
supervision of both pretrained LSTM-based teacher and ground
truth. We use the validation set to choose the student model and
validate its performance on the test set. The selected student is
further employed as the teacher at the open-loop pretrain stage
of identical architecture knowledge transferring by using the
LDT-KD. Similarly, for the sequential self-distillation, we use
the teacher selected by the validation set in previous generation
Fig. 6. Data preprocessing. as the teacher for next generation.
For the proposed KDnet-RUL, it consists of GAN-KD, LDT-
KD, and sequential distillation. Some hyperparameters need to
be determined. Specifically, we set the batch size of 256, learning
2) Data Prepossessing: We randomly split each original
rate of 1e-3, optimizer of Adam, and training epochs of 160
training data into training and validation by the ratio of 9:1 in
for the proposed method. For GAN-KD, we choose λ = 1.0
terms of engine units. For instance, we randomly select 90
for (3). A grid search is adopted to identify the α in (6) from
trajectories from the total 100 trajectories in FD001 for model
the range α ∈ [0.0, 1.0] with a step size of 0.1. For LDT-KD,
training and the remaining 10 trajectories for validation. Then,
we use β = 0.99 for the smoothing parameter. Considering the
we applied the following data preprocessing methods to all the
randomness of model initialization, all reported results are the
training, validation, and test data.
average of five repeats.
First, 7 out of 21 sensors with constant readings (sensor
The FD001 and FD003 are relatively simple datasets with
indices 1, 5, 6, 10, 16, 18, and 19) are removed and the re-
only one working condition. The empirical study shows that
maining 14 sensor measurements are utilized to predict the
the sequential distillation scheme upon LDT-KD is not required
RUL [17],[18]. A min–max normalization is applied to restrict
on these simple datasets. Whereas, for the complicated FD002
the measurement values within [0,1] to speed up the training
and FD004 datasets with multiple working conditions, we adopt
process. Particularly, FD002 and FD004 have six working con-
the sequential distillation upon LDT-KD to further improve the
ditions and we normalize the data in each working condition
performance. Specifically, in experiments, we empirically find
for these two datasets. A sliding window with window size T
that three generations are adequate to achieve a satisfactory
and step size s is adopted to segment the data. For the training
performance on FD002 and FD004.
and validation data, a sliding window moves with a step size s
4) Evaluation Metrics: Same as previous works, two com-
from the starting cycle to the life-end cycle. For the test data,
monly used metrics are adopted to validate the proposed method,
we extract the last segment with the same window size. As
i.e., root mean square error (RMSE) and Score function. The
illustrated in Fig. 6, the RUL for the first sample is L − T , and
RMSE is a standard way to measure the error of model predic-
the (i + 1)th sample has an RUL of L − T − s ∗ i, where L is
tions, which is defined as follows:
the total engine life cycle.
In practice, the degradation of system components at the initial
stage is not significant and can be negligible. Meanwhile, the 
1 N
system’s health degrades along with time when it is getting to RMSE = (ŷi − yi )2 (9)
the end-of-life. Therefore, we follow the previous studies [18], N i=1

[39], [40] and apply piecewise RUL. In particular, if the true RUL
is larger than the maximal RUL, it is set to RULmax instead, as
where ŷi and yi are the predicted RUL and true RUL, respec-
shown in the following equation:
tively. N is the total number of samples. The Score function,
 defined as (10), was designed to place more penalization on late
L − T − s ∗ i, if RUL < RULmax , predictions than early predictions, as late predictions may lead
RUL = (8)
RULmax , otherwise. to more serious catastrophic consequences. Same as RMSE, the

Authorized licensed use limited to: Dalian University of Technology. Downloaded on November 28,2024 at 11:27:07 UTC from IEEE Xplore. Restrictions apply.
2028 IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS, VOL. 69, NO. 2, FEBRUARY 2022

TABLE II
PERFORMANCE COMPARISON AMONG VARIOUS APPROACHES ON FOUR DATASETS

Bold values indicate the best results.

lower the Score is, the better performance the model can achieve has a superior performance over the teacher model in terms of
RMSE.
 ŷi −yi
N To verify the effectiveness of dilated CNN as the student,
(e− 13 − 1), if ŷi < yi ,
Score = i=1
N ŷi −yi (10) we further compare dilated CNN with conventional CNN [14]
i=1 (e − 1), otherwise.
10
under two different scenarios as shown in Table III. In Case I,
we train them from scratch for RUL prediction (i.e., Student
B. Comparison With Benchmark Approaches Only). In Case II, we train them as the students in our KDnet-
RUL framework under the guidance of LSTM-based teacher.
To verify the effectiveness of the proposed method, we have Comparing with conventional CNN [14], the dilated CNN is
compared with some benchmark approaches, including the stan- capable of modeling temporal information in time series sensory
dard KD [29], L1 -KD [34], and L2 -KD [30]. To demonstrate data, which is vital for RUL prediction. We can observe that
the effectiveness of the proposed sequential distillation upon dilated CNN performs better than conventional CNN under two
LDT-KD, we also compare it with the BAN in [38] which different scenarios as shown in Table III. Moreover, our KDnet-
is also sequentially trained with self-distillation. Particularly, RUL can also improve the performance of conventional CNN
we use “Student Only” as the teacher in the first generation via KD from LSTM, further demonstrating the effectiveness of
for the BAN. Note that the BAN does not have the ability of our proposed KD framework.
model compression. It can only improve model generalization Table IV compares the complexities of the teacher and student
performance at the expense of model complexity in terms of models. Here, we consider the number of weights and total
memory and inference time. In experiments, we use the ensem- floating-point operations (TFPO) when comparing model com-
ble of multiple students with five generations for the BAN on plexity. More weights and TFPO refer to a more complex model.
RUL prediction. Moreover, considering the disparate network Note that for the proposed KDnet-RUL, the final model is the
architectures between teacher and student, the feature distillation student network after training. It can be found that the number of
step can also be treated as a domain alignment task which intends weights of the student model is 12.8 times less than the teacher
to minimize the discrepancy of feature distributions between model. During inference, the student model only requires 52 400
the teacher and the student. In order to further validate the TFPO which is 46.2 times more efficient than the teacher model.
effectiveness of the proposed method, we also explore several In conclusion, the proposed KDnet-RUL can achieve a com-
domain alignment techniques, combined with our framework, parable performance to a very complex LSTM network but with
such as maximum mean discrepancy (MMD) [41] and correla- a much more efficient structure, i.e., 12.8 times less weights and
tion alignment (CORAL) [42]. 46.2 times less TFPO.
Table II shows the evaluation results of different methods on
the four subdatasets. The “Student Only” which implements a
dilated CNN performs the worst due to its compact network
structure. The teacher model which is built upon the LSTM C. Ablation Study
structure performs much better than the “Student Only.” All the Recall that our KDnet-RUL consists of three components,
KD methods improve the performance of the student. This indi- namely, GAN-KD, LDT-KD, and sequential distillation. In this
cates the effectiveness of the KD algorithms for improving the section, we conduct an ablation study to analyze how each
performance of the student network. The BAN can effectively component affects the model performance. In particular, we
improve the performance of the student model, especially on the derive the following model variants for ablation study.
complicated scenarios, i.e., FD002 and FD004. Among all the 1) LDT-KD: The teacher of this variant is a CNN trained
methods, the proposed KDnet-RUL performs the best in terms of from scratch, i.e., “Student Only.”
both RMSE and Score. Besides, it achieves a comparable perfor- 2) GAN-KD: The teacher of this variant is a five-layer
mance to the teacher model. In particular, the proposed method LSTM.
outperforms the teacher network on FD002 and FD003 in terms 3) GAN-KD+LDT-KD: LDT-KD in this variant uses the
of both RMSE and Score. For FD004, the proposed KDnet-RUL student learned by GAN-KD as its teacher.

Authorized licensed use limited to: Dalian University of Technology. Downloaded on November 28,2024 at 11:27:07 UTC from IEEE Xplore. Restrictions apply.
XU et al.: KDNET-RUL: A KNOWLEDGE DISTILLATION FRAMEWORK 2029

TABLE III
PERFORMANCE COMPARISON BETWEEN CONVENTIONAL CNN AND DILATED CNN

TABLE IV
COMPARISON OF MODEL COMPLEXITY

Fig. 8. Results of the ablation study on FD002 and FD004.

GAN-KD+LDT-KD in Fig. 8), the performance improvement


is marginal. For the Score values on FD004, including LDT-
KD even degrades the performance of the model. However,
when combining sequential distillation upon LDT-KD (i.e., with
Fig. 7. Results of the ablation study on FD001 and FD003.
multiple generations of LDT-KD), the performance of model
is consistently improved. With several generations (i.e., three
generations of LDT-KD in this article), the performance of the
Note that our empirical study shows that the sequential dis- model becomes stable and even better than the teacher network
tillation upon LDT-KD does not present further improvement in terms of the RMSE on FD004 and the Score on FD002.
on the relatively simple datasets, i.e., FD001 and FD003 with In a word, the proposed KDnet-RUL does not require sequen-
one working condition. One possible reason is that there is a tial distillation upon LDT-KD (i.e., only one LDT-KD) to ensure
performance threshold for student due to the limited model a superior performance for simple datasets. While the sequential
capability and such threshold can be easily reached for those distillation upon LDT-KD (i.e., several generations of LDT-KD)
relatively simple datasets. Therefore, we do not perform the is compulsory for complex datasets.
sequential distillation upon LDT-KD on FD001 and FD003,
i.e., KDnet-RUL refers to GAN-KD+LDT-KD on FD001 and
D. Parameter Sensitivity Analysis
FD003. For the complicated datasets, i.e., FD002 and FD004
with six working conditions, the sequential distillation upon In this section, we conduct the sensitivity analysis for param-
LDT-KD can further enhance the performance of the KDnet- eters α, λ, and β. In particular, we adopt the grid search method
RUL model. KDnet-RUL refers to the combination of three on the validation set for parameter tuning.
components on FD002 and FD004. a) Parameter α. Fig. 9 shows the impact of a key hyperparam-
Fig. 7 illustrates the experimental results of the ablation study eter α in (6), which controls the contribution of truth labels and
on FD001 and FD003. It is clear that both the LDT-KD and soft labels when supervising the training of the student network.
GAN-KD can significantly improve the performance of the Two special cases are α = 0.0 and α = 1.0, which respectively
student network. The KDnet-RUL with both GAN-KD and means only using the ground truth and only using the soft
LDT-KD is able to further enhance the performance, especially labels for training. However, α = 1.0 will gradually mislead the
on the FD003 dataset, in terms of both RMSE and Score. It student in LDT-KD and sequential distillation process such that
even outperforms the powerful and complex teacher network on the performance of teacher and student will degrade rapidly over
FD003. This indicates that the proposed GAN-KD and LDT-KD generations due to the error accumulation if there is no correction
are effective for RUL prediction on simple datasets. from truth labels. Hence, we omit the result of α = 1.0 in Fig. 9.
The experimental results on the complicated FD002 and As we can see, the soft labels produced by teacher are more
FD004 are shown in Fig. 8. In this scenario, the KDnet-RUL contributive to the performance than the ground truth. In most
is the combination of all the three components. Similarly, the cases, a higher α value tends to yield better performance. In our
performance of the student network can be enhanced by the experiments, we set α as 0.7 on FD001, FD002, and FD003, and
proposed GAN-KD and LDT-KD, except for the RMSE on 0.8 on FD004. Empirically, we would recommend α = 0.7 for
FD004. When combining the GAN-KD and LDT-KD (i.e., our proposed model.
Authorized licensed use limited to: Dalian University of Technology. Downloaded on November 28,2024 at 11:27:07 UTC from IEEE Xplore. Restrictions apply.
2030 IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS, VOL. 69, NO. 2, FEBRUARY 2022

Fig. 9. Sensitivity analysis of parameter α. (a) FD001. (b) FD002. (c) Fig. 11. Sensitivity analysis of parameter β. (a) FD001. (b) FD002. (c)
FD003. (d) FD004. FD003. (d) FD004.

TABLE V
RESULTS ON PHM 2008 CHALLENGE DATASET

Fig. 10. Sensitivity analysis of parameter λ. (a) FD001. (b) FD002. (c)
FD003. (d) FD004. Bold values indicate the best results.

used to evaluate the performance of models for RUL prediction.


b) Parameter λ. Fig. 10 illustrates the impact of hyperparam- Since the true RUL values of the challenge dataset are not re-
eter λ in (3) on model performance for the four subdatasets. leased, the results need to be uploaded to NASA Data Repository
As we can see, λ = 0.0 performs worst in terms of RMSE website, where the RUL Score will be generated for performance
and Score. It demonstrates that integrating L1 distance during evaluation.
training the generator G can help to improve model performance Table V shows the results on the PHM2008 Challenge Dataset.
as mentioned earlier. With λ increasing from 1 to 10, the model In addition to various KD baselines, we also included the results
gradually performs worse since the training of generator relies of various RUL prediction methods (e.g., CNN [14], LSTM [16],
more on L1 distance, which is harmful for disparate architecture and attention-based LSTM [18]) in Table V. It is clear that
knowledge transfer. various KD methods can effectively help to improve the per-
c) Parameter β. Fig. 11 shows the performance of the teacher formance of a compact student (“Student Only”). The proposed
learned by LDT-KD with different values for the hyperparameter KDnet-RUL has a superior performance over not only state-of-
β in (7). Small β will dramatically update teacher’s model pa- the-art RUL prediction approaches but also various benchmark
rameters and will make the model hard to converge. Empirically, KD methods.
we would recommend β = 0.99 for our proposed model.
V. CONCLUSION
E. Results on PHM2008 Challenge Dataset In this article, we proposed a deep model compression frame-
PHM2008 Challenge dataset was used for prognostics chal- work based on KD, named KDnet-RUL, for machine RUL
lenge competition at the International Conference on Prognos- prediction. The KDnet-RUL consists of three components, i.e.,
tics and Health Management (PHM2008) [4]. It is also widely GAN-KD, LDT-KD, and sequential distillation. A complicated

Authorized licensed use limited to: Dalian University of Technology. Downloaded on November 28,2024 at 11:27:07 UTC from IEEE Xplore. Restrictions apply.
XU et al.: KDNET-RUL: A KNOWLEDGE DISTILLATION FRAMEWORK 2031

LSTM-based model was adopted as a powerful teacher and [14] G. S. Babu, P. Zhao, and X.-L. Li, “Deep convolutional neural network
a dilated CNN was utilized as an efficient student. By using based regression approach for estimation of remaining useful life,” in Proc.
Int. Conf. Database Syst. Adv. Appl., 2016, pp. 214–228.
the proposed KDnet-RUL, the student network can achieve [15] J. Zhu, N. Chen, and W. Peng, “Estimation of bearing remaining useful
comparable performance with the teacher network but with 12.8 life based on multiscale convolutional neural network,” IEEE Trans. Ind.
times less weights and 46.2 times less total float point operations. Electron., vol. 66, no. 4, pp. 3208–3216, Apr. 2019.
[16] S. Zheng, K. Ristovski, A. Farahat, and C. Gupta, “Long short-term
In the future, we will consider a more real and challenging memory network for remaining useful life estimation,” in Proc. IEEE Int.
scenario where the data for training and testing may come from Conf. Prognostics Health Manage., 2017, pp. 88–95.
different distributions, caused by the changing environments or [17] C.-G. Huang, H.-Z. Huang, and Y.-F. Li, “A bidirectional LSTM prog-
nostics method under multiple operational conditions,” IEEE Trans. Ind.
varying machines. For instance, we have built a model based on Electron., vol. 66, no. 11, pp. 8792–8802, Nov. 2019.
the collected dataset A from machine A. When directly using [18] Z. Chen, M. Wu, R. Zhao, F. Guretno, R. Yan, and X. Li, “Machine
this RUL prediction model on a new Machine B which may remaining useful life prediction via an attention based deep learning
approach,” IEEE Trans. Ind. Electron., vol. 68, no. 3, pp. 2521–2531, Mar.
have different data distributions due to different operation con- 2021.
ditions, the performance of the model will significantly degrade. [19] Y. Gong, L. Liu, M. Yang, and L. Bourdev, “Compressing deep convolu-
Whereas, collecting a new dataset from machine B to retrain the tional networks using vector quantization,” in Proc. Int. Conf. Learn. Rep-
resentations, 2014. [Online]. Available: https://arxiv.org/abs/1412.6115.
model is tedious and requires lots of efforts. Instead of doing [20] J. Wu, C. Leng, Y. Wang, Q. Hu, and J. Cheng, “Quantized convolutional
that, we intend to transfer the knowledge learned from dataset neural networks for mobile devices,” in Proc. IEEE Conf. Comput. Vis.
A to the new machine B without collecting labeled data from Pattern Recognit., 2016, pp. 4820–4828.
[21] Y. Cheng, D. Wang, P. Zhou, and T. Zhang, “A survey of model compres-
machine B, which is also known as domain adaptation [43], [44]. sion and acceleration for deep neural networks,” 2017, arXiv:1710.09282.
In this practical scenario, both model compression and domain [22] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep
adaptation need to be considered. neural networks with pruning, trained quantization and huffman coding,”
2015, arXiv:1510.00149.
[23] M. Denil, B. Shakibi, L. Dinh, M. Ranzato, and N. De Freitas, “Predicting
parameters in deep learning,” in Proc. Adv. Neural Inf. Process. Syst., 2013,
REFERENCES pp. 2148–2156.
[24] E. L. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fergus, “Exploiting
[1] F. Yang, M. S. Habibullah, T. Zhang, Z. Xu, P. Lim, and S. Nadara- linear structure within convolutional networks for efficient evaluation,” in
jan, “Health index-based prognostics for remaining useful life predic- Proc. Adv. Neural Inf. Process. Syst., 2014, pp. 1269–1277.
tions in electrical machines,” IEEE Trans. Ind. Electron., vol. 63, no. 4, [25] M. R. U. Saputra, P. P. de Gusmao, Y. Almalioglu, A. Markham, and N.
pp. 2633–2644, Apr. 2016. Trigoni, “Distilling knowledge from a deep pose regressor network,” in
[2] J. Sikorska, M. Hodkiewicz, and L. Ma, “Prognostic modelling options for Proc. IEEE Int. Conf. Comput. Vis., 2019, pp. 263–272.
remaining useful life estimation by industry,” Mech. Syst. Signal Process., [26] W.-C. Chen, C.-C. Chang, and C.-R. Lee, “Knowledge distillation with
vol. 25, no. 5, pp. 1803–1836, 2011. feature maps for image classification,” in Proc. Asian Conf. Comput. Vis.,
[3] X.-S. Si, W. Wang, C.-H. Hu, and D.-H. Zhou, “Remaining useful life 2018, pp. 200–215.
estimation-a review on the statistical data driven approaches,” Eur. J. Oper. [27] Y. Lei, N. Li, L. Guo, N. Li, T. Yan, and J. Lin, “Machinery health
Res., vol. 213, no. 1, pp. 1–14, 2011. prognostics: A systematic review from data acquisition to RUL prediction,”
[4] A. Saxena, K. Goebel, D. Simon, and N. Eklund, “Damage propagation Mech. Syst. Signal Process., vol. 104, pp. 799–834, 2018.
modeling for aircraft engine run-to-failure simulation,” in Proc. Int. Conf. [28] C. Bucilua, R. Caruana, and A. Niculescu-Mizil, “Model compression,”
Prognostics Health Manage., 2008, pp. 1–9. in Proc. 12th ACM SIGKDD Int. Conf. Know. Discov. Data Mining, 2006,
[5] F. O. Heimes, “Recurrent neural networks for remaining useful life es- pp. 535–541.
timation,” in Proc. IEEE Int. Conf. Prognostics Health Manage., 2008, [29] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural
pp. 1–6. network,” in Proc. NIPS Deep Learn. Representation Learn. Workshop,
[6] R. K. Singleton, E. G. Strangas, and S. Aviyente, “Extended Kalman vol. 1050, 2015. [Online]. Available: https://arxiv.org/abs/1503.02531
filtering for remaining-useful-life estimation of bearings,” IEEE Trans. [30] A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio,
Ind. Electron., vol. 62, no. 3, pp. 1781–1790, Mar. 2015. “Fitnets: Hints for thin deep nets,” in Proc. Int. Conf. Learn. Representa-
[7] N. Li, Y. Lei, L. Guo, T. Yan, and J. Lin, “Remaining useful life prediction tions, 2015. [Online]. Available: https://arxiv.org/abs/1412.6550.
based on a general expression of stochastic process models,” IEEE Trans. [31] S. Zagoruyko and N. Komodakis, “Paying more attention to attention:
Ind. Electron., vol. 64, no. 7, pp. 5709–5718, Jul. 2017. Improving the performance of convolutional neural networks via attention
[8] R. Khelif, B. Chebel-Morello, S. Malinowski, E. Laajili, F. Fnaiech, and N. transfer,” in Proc. 5th Int. Conf. Learn. Representations, 2017. [Online].
Zerhouni, “Direct remaining useful life estimation based on support vector Available: https://arxiv.org/abs/1612.03928
regression,” IEEE Trans. Ind. Electron., vol. 64, no. 3, pp. 2276–2285, [32] Y. Tian, D. Krishnan, and P. Isola, “Contrastive representation distillation,”
Mar. 2017. in Proc. Int. Conf. Learn. Representations, 2020. [Online]. Available: https:
[9] C. Sun, M. Ma, Z. Zhao, S. Tian, R. Yan, and X. Chen, “Deep transfer //arxiv.org/abs/1910.10699
learning based on sparse autoencoder for remaining useful life prediction [33] L. Gao, H. Mi, B. Zhu, D. Feng, Y. Li, and Y. Peng, “An adversarial
of tool in manufacturing,” IEEE Trans. Ind. Inform., vol. 15, no. 4, feature distillation method for audio classification,” IEEE Access, vol. 7,
pp. 2416–2425, Apr. 2019. pp. 105 319–105330, 2019.
[10] Z. Tian, “An artificial neural network method for remaining useful life [34] G. Chen, W. Choi, X. Yu, T. Han, and M. Chandraker, “Learning efficient
prediction of equipment subject to condition monitoring,” J. Intell. Manuf., object detection models with knowledge distillation,” in Proc. Adv. Neural
vol. 23, no. 2, pp. 227–237, 2012. Inf. Process. Syst., 2017, pp. 742–751.
[11] M. Xia, T. Li, T. Shu, J. Wan, C. W. De Silva, and Z. Wang, “A two-stage [35] Y. Kim, “Convolutional neural networks for sentence classifica-
approach for the remaining useful life prediction of bearings using deep tion,” in Proc. Conf. Empirical Methods Natural Lang. Process.,
neural networks,” IEEE Trans. Ind. Inform., vol. 15, no. 6, pp. 3703–3711, 2014, pp. 1746–1751.
Jun. 2019. [36] J. Yim, D. Joo, J. Bae, and J. Kim, “A gift from knowledge distillation:
[12] B. Yang, R. Liu, and E. Zio, “Remaining useful life prediction based Fast optimization, network minimization and transfer learning,” in Proc.
on a double-convolutional neural network architecture,” IEEE Trans. Ind. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 4133–4141.
Electron., vol. 66, no. 12, pp. 9521–9530, Dec. 2019. [37] A. Tarvainen and H. Valpola, “Mean teachers are better role mod-
[13] S. Lawrence, C. L. Giles, A. C. Tsoi, and A. D. Back, “Face recognition: els: Weight-averaged consistency targets improve semi-supervised deep
A convolutional neural-network approach,” IEEE Trans. Neural Netw., learning results,” in Proc. Adv. Neural Inf. Process. Syst., 2017,
vol. 8, no. 1, pp. 98–113, Jan. 1997. pp. 1195–1204.

Authorized licensed use limited to: Dalian University of Technology. Downloaded on November 28,2024 at 11:27:07 UTC from IEEE Xplore. Restrictions apply.
2032 IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS, VOL. 69, NO. 2, FEBRUARY 2022

[38] T. Furlanello, Z. C. Lipton, M. Tschannen, L. Itti, and A. Anandkumar, Chao Wang (Senior Member, IEEE) received
“Born again neural networks,” in Proc. 35th Int. Conf. Mach. Learn., the B.S. and Ph.D. degrees in computer science
vol. 80, 2018, pp. 1602–1611. from the University of Science and Technol-
[39] X. Li, Q. Ding, and J.-Q. Sun, “Remaining useful life estimation in ogy of China, Hefei, China, in 2006 and 2011,
prognostics using deep convolution neural networks,” Rel. Eng. Syst. Saf., respectively.
vol. 172, pp. 1–11, 2018. He is currently an Associate Professor with
[40] C. Zhang, P. Lim, A. K. Qin, and K. C. Tan, “Multiobjective deep belief the University of Science and Technology of
networks ensemble for remaining useful life estimation in prognostics,” China. He was a Visiting Scholar with the Elec-
IEEE Trans. Neural Netw. Learn. Syst., vol. 28, no. 10, pp. 2306–2318, trical and Computer Engineering Department,
Oct. 2017. University of California at Santa Barbara, Santa
[41] M. Long, H. Zhu, J. Wang, and M. I. Jordan, “Deep transfer learning Barbara, CA, USA, from 2015 to 2016. His re-
with joint adaptation networks,” in Proc. Int. Conf. Mach. Learn., 2017, search interests focus on multicore and reconfigurable computing.
pp. 2208–2217. Dr. Wang serves as the Associate Editor for the ACM Transac-
[42] B. Sun and K. Saenko, “Deep coral: Correlation alignment for deep domain tions on Design Automations for Electronics Systems (ACM TODAES),
adaptation,” in Proc. Eur. Conf. Comput. Vis., 2016, pp. 443–450. IEEE/ACM Transactions on Computational Biology and Bioinformatics
[43] X. Li, W. Zhang, N.-X. Xu, and Q. Ding, “Deep learning-based ma- (TCBB), Microprocessors & Microsystems, and IET Computers & De-
chinery fault diagnostics with domain adaptation across sensors at dif- sign Techniques. He is also the publicity chair of HiPEAC 2015 and ISPA
ferent places,” IEEE Trans. Ind. Electron., vol. 67, no. 8, pp. 6785–6794, 2014. He is a senior member of the ACM and CCF.
Aug. 2020.
[44] M. Ragab et al., “Contrastive adversarial domain adaptation for machine
remaining useful life prediction,” IEEE Trans. Ind. Informat., to be pub-
lished, doi: 10.1109/TII.2020.3032690.

Qing Xu received the B.Eng. degree in mea-


suring control technology and instruments from
Southeast University, Nanjing, China, in 2010
and the M.Eng. degree in instrument science Min Wu received the B.S. degree in computer
and technology from Southeast University, Nan- science from the University of Science and
jing, in 2015. Technology of China (USTC), Hefei, China, in
He is currently a Research Engineer with 2006 and the Ph.D. degree in computer science
Institute for Infocomm Research, Agency for from Nanyang Technological University (NTU),
Science, Technology and Research (A*STAR), Singapore in 2011.
Singapore. His research interests include deep He is currently a Senior Scientist with Insti-
learning, transfer learning, and model compres- tute for Infocomm Research, Agency for Sci-
sion and related applications. ence, Technology and Research (A*STAR), Sin-
gapore. His current research interests include
machine learning, data mining, and bioinformat-
ics.
Zhenghua Chen received the B.Eng. degree Dr. Wu is the recipient of the best paper awards in InCoB 2016 and
in mechatronics engineering from the University DASFAA 2015. He also won the IJCAI competition on repeated buyers
of Electronic Science and Technology of China prediction in 2015.
(UESTC), Chengdu, China, in 2011, and the
Ph.D. degree in electrical and electronic engi-
neering from Nanyang Technological University
(NTU), Singapore, in 2017.
He has been a Research Fellow with NTU.
Currently, he is a Scientist with Institute for In-
focomm Research, Agency for Science, Tech-
nology and Research (A*STAR), Singapore. His
research interests include smart sensing, data analytics, machine learn- Xiaoli Li received the B.Eng. degree in com-
ing, and transfer learning and related applications. puter science from Shanxi University, China,
Dr. Chen has won several competitive awards, such as A*STAR Ca- and the Ph.D. degree in computer software
reer Development Award, First Runner-Up Award for Grand Challenge at from Institute of Computing Technology, Chi-
IEEE VCIP 2020, Finalist Academic Paper Award at IEEE ICPHM 2020, nese Academy of Science, China. He is cur-
etc. He serves as an Associate Editor for Elsevier Neurocomputing and rently a Department Head and Principal Scien-
Guest Editor for IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTA- tist with the Institute for Infocomm Research,
TIONAL INTELLIGENCE and Elsevier Neurocomputing. He is currently the A*STAR, Singapore. He also holds adjunct Pro-
Vice Chair of IEEE Sensors Council Singapore Chapter. fessor position with Nanyang Technological Uni-
versity, Singapore.
Prof. Li won numerous best paper/benchmark
competition awards. He has been serving as Area Chair/Senior PC
Keyu Wu received the B.Eng. degree in bioengi- Member/Workshop Chair in leading data mining and AI-related confer-
neering from the National University of Singa- ences [including Knowledge Discovery and Data Mining (KDD), IEEE
pore, Singapore, in 2013 and the Ph.D. degree International Conference on Data Mining (ICDM), SIAM International
in electrical and electronic engineering from Conference on Data Mining (SDM), European Conference on Ma-
Nanyang Technological University, Singapore in chine Learning and Principles and Practice of Knowledge Discovery in
2020. Databases (PKDD/ECML), International World Wide Web Conference
She is currently a Scientist with the Institute (WWW), International Joint Conferences on Artificial Intelligence (IJ-
for Infocomm Research, Agency for Science, CAI), Association for the Advancement of Artificial Intelligence (AAAI),
Technology and Research (A*STAR), Singa- Association for Computational Linguistics (ACL), and Conference on In-
pore. Her research interests include reinforce- formation and Knowledge Management (CIKM)]. His research interests
ment learning, transfer learning, unsupervised include data mining, machine learning, AI, and bioinformatics. He has
learning, path planning, and autonomous navigation. authored or coauthored more than 200 high-quality papers.

Authorized licensed use limited to: Dalian University of Technology. Downloaded on November 28,2024 at 11:27:07 UTC from IEEE Xplore. Restrictions apply.

You might also like