KDnet RUL
KDnet RUL
2, FEBRUARY 2022
0278-0046 © 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Dalian University of Technology. Downloaded on November 28,2024 at 11:27:07 UTC from IEEE Xplore. Restrictions apply.
XU et al.: KDNET-RUL: A KNOWLEDGE DISTILLATION FRAMEWORK 2023
data with temporal information [16], [17]. Since the sensory minimize the discrepancy between the features learned
data for machine RUL prediction are typical time series with from LSTM and CNN by using a GAN technique.
temporal information, the LSTM network is naturally suitable 3) To enhance the performance of CNN, we propose an LDT-
for RUL prediction. Recent studies [16]–[18] have shown that KD method for KD between identical architectures.
the LSTM outperforms the CNN for RUL prediction. However, 4) In complicated scenarios where multiple working con-
the LSTM generally has much higher computational complexity ditions are involved for RUL prediction, we propose
than CNN due to its unique structure of cascade connection. a sequential distillation scheme upon LDT-KD to fur-
In many real-world scenarios, the RUL prediction algorithms ther enhance the performance of the learned CNN
need to be deployed on edge devices, which have limited model.
computational resources and memory, for timely response and The rest of the article is organized as follows. Section II re-
security concerns. Thus, the industry generally prefers a learning views some related works on RUL prediction and KD. Section III
algorithm which can achieve accurate RUL prediction and is also presents the deep neural networks for RUL prediction, followed
very efficient (e.g., small size and fast inference). The current by the disparate and identical architecture knowledge transfer.
deep learning algorithms are either too complicated or with Section IV first describes the data for evaluation and the ex-
limited performance. perimental setup. Then, the experimental results, ablation study,
To deal with these issues, model compression techniques and sensitivity analysis are introduced. Section V concludes this
have been proposed to compress deep neural networks for edge article.
deployment. For instance, parameter quantization methods [19],
[20] compress the original network by using less bits to represent II. RELATED WORK
the weights. They can achieve significant speedup but also result
in accuracy loss [21]. Another commonly used method for model A. RUL Prediction
compression is weight pruning [22], which aims to remove Deep learning for RUL prediction has gained increasing
unnecessary parameters in a trained deep neural network. Al- attention due to its ability of modeling complex machinery
though the weight pruning is able to reduce model storage size, degradation process [27]. Various deep learning methods, such
it cannot improve the efficiency in terms of training or inference as CNN and LSTM, have been shown to be effective for RUL
time. Other methods like matrix decomposition [23], [24] have prediction tasks. Babu et al. [14] proposed a novel CNN-based
also shown the capability of reducing model size, but they only model to estimate RUL of airplane engines by using sliding
address the storage complexity issue of deep models and have windows on the raw sensory data as input samples. Instead
similar drawbacks as the weight pruning method. Relatively, of directly feeding the raw sensory data into CNN models,
knowledge distillation (KD) has shown great promise in reduc- Zhu et al. [15] transformed the sensory data to derive the
ing not only model storage size but also model efficiency [25], time frequency representation of each sample. Then, a multi-
[26]. scale convolutional neural network was developed with these
In this article, we propose a novel KD framework, entitled samples for RUL prediction. Even though the CNN-based
KDnet-RUL, to compress deep learning models for RUL pre- models have already outperformed traditional methods, such
diction. Specifically, we first design a generative adversarial as multilayer perceptron (MLP) and support vector machines,
network based knowledge distillation (GAN-KD) for disparate they are not naturally designed for sensory data with temporal
architecture knowledge transfer, which distills the knowledge information.
from a powerful and complicated LSTM model to a simple To better capture the temporal information of sensory data,
CNN model. Then, a learning-during-teaching based knowl- Zheng et al. [16] employed an LSTM network to model the
edge distillation (LDT-KD) for identical architecture knowledge long-term dependency characteristic of data for RUL prediction.
transfer is proposed to enhance the performance of the CNN Hence, such an LSTM method achieved a better performance
model learned by GAN-KD. For complicated RUL prediction than traditional machine learning and CNN approaches. There-
scenarios, e.g., data with multiple operation conditions, we after, several LSTM-based approaches, such as bidirectional
leverage a sequential distillation scheme upon the LDT-KD for LSTM [17] and attention-based LSTM [18], were proposed
accurate and robust RUL prediction. The performance of the to further improve RUL prediction accuracy. However, LSTM-
proposed KDnet-RUL method is evaluated by using both simple based models often have high computational complexity, and,
and complex datasets. thus, it is very difficult to deploy them on edge devices with
The main contributions of the proposed method are summa- limited computing resources. To address this problem, model
rized as follows. compression methods can be adopted for LSTM models to
1) We propose a KD framework, named KDnet-RUL, which reduce their complexity and preserve the performance as much
distills knowledge from a complicated LSTM model to as possible.
a simple CNN model for efficient RUL prediction. The
efficient CNN model can thus be deployed on resource
constrained edge devices. B. Knowledge Distillation
2) For KD between disparate architectures, i.e., from LSTM KD, also known as a teacher–student strategy, is widely
to CNN, a GAN-KD method is proposed. It attempts to applied for model compression. It was first introduced by [28],
Authorized licensed use limited to: Dalian University of Technology. Downloaded on November 28,2024 at 11:27:07 UTC from IEEE Xplore. Restrictions apply.
2024 IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS, VOL. 69, NO. 2, FEBRUARY 2022
Authorized licensed use limited to: Dalian University of Technology. Downloaded on November 28,2024 at 11:27:07 UTC from IEEE Xplore. Restrictions apply.
XU et al.: KDNET-RUL: A KNOWLEDGE DISTILLATION FRAMEWORK 2025
and G(x) from the teacher and the student, respectively, and G
aims to minimize the probability that D will predict G(x) from
the student. The objective function can be expressed as follows:
min max V (D, G) = Ex [log(D(φ(x)))+log(1 − D(G(x)))].
G D
(1)
At each iteration of training stage, we first fix the G and train
D by maximizing the following loss function LD :
LD = log(D(φ(x))) + log(1 − D(G(x))). (2)
Then, we fix D and start to train G by minimizing the prob-
ability log(1 − D(G(x))). We further mix this GAN objective
with the L1 distance between student’s and teacher’s features,
denoted as LG
LG = log(1 − D(G(x))) + λ ∗ φ(x) − G(x)1 (3)
where λ is a hyperparameter to control the contribution of L1
distance in the final loss LG . Minimizing LG can thus help to
easily achieve the equilibrium of G generating perfect features
as teacher’s and D guessing with 50% accuracy.
We alternately repeat the above generator and discriminator
Fig. 2. Dilated CNN-based student architecture. Conv1D(3, 2, 1) refers
to a 1-D convolution layer with kernel size being 3, stride being 2, and training process, i.e., iteratively minimizing LD and maximizing
dilation being 1. LG . Eventually, the student is able to generate the feature maps
similar to the teacher’s.
2) Knowledge Distillation: KD by logits or soft labels has
already been proved to be effective for training the student in
classification tasks. In this article, we deal with the regression
task for RUL prediction. Hence, we attempt to utilize predictions
from the teacher for KD, similar to the logits or soft labels
in classification tasks. In particular, we define the Soft Loss
LSoft as the difference between student’s prediction and teacher’s
prediction in (4). We also have the Hard Loss LHard which is the
difference between student’s prediction and ground truth (i.e.,
real labels) in (5). The loss function for KD LKD is then defined
as the weighted combination of LSoft and LHard in (6)
LSoft = ŷS − ŷT 2 , (4)
LHard = ŷS − y2 , (5)
LKD = α ∗ LSoft + (1 − α) ∗ LHard . (6)
Here, ŷS and ŷT represent predictions of student and teacher
Fig. 3. GAN-KD for disparate network architectures. All network blocks networks, respectively, and y is the ground truth. α is a hy-
with dash lines are trainable and those with solid lines are locked during
training.
perparameter to adjust the weight of hard and soft losses. By
minimizing the loss for KD LKD , we learn the regressor module
in the student for RUL prediction.
Meanwhile, the discriminator D is designed to maximize the
C. Identical Architecture Knowledge Transfer
similarity between the CNN-based and the LSTM-based feature
extractors and thus improves the CNN-based feature extractor. In the above section, GAN-KD can help to learn a simple
Given that x is the input sensory data, and x ∈ RT ×n , where CNN model by distilling the knowledge from LSTM for RUL
T is the window size and n is the number of sensors, φ(x) is the prediction. However, the learned CNN may not be optimal in
output of feature extractor in teacher network, while G(x) is the terms of prediction performance due to the inherent difference
output of feature extractor in student network. The discriminator between CNN and LSTM. In this section, we aim to further
D, as a binary classification network, aims to identify if the improve the CNN learned by our GAN-KD via KD between
feature map is from the teacher’s or student’s feature extractor. identical network structures.
Here, D and G play a two-player mini-max game in which D 1) Learning-During-Teaching: A few previous stud-
aims to maximize the probability of correctly classifying φ(x) ies [29],[36] have already proven the feasibility of transferring
Authorized licensed use limited to: Dalian University of Technology. Downloaded on November 28,2024 at 11:27:07 UTC from IEEE Xplore. Restrictions apply.
2026 IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS, VOL. 69, NO. 2, FEBRUARY 2022
Fig. 4. LDT-KD architecture. T is the teacher model, S represents the upon the LDT-KD. The sequential distillation was first proposed
student model, and GT represents ground truth. by Furlanello et al. in [38] where they sequentially distilled the
knowledge from a teacher with identical structure to a student.
And in each generation, a new student is required to be initialized
knowledge between models with identical architectures. with a different random seed. At the end of the procedure, they
Hinton et al. [29] demonstrated the effectiveness of distilling employed an ensemble of student models from each generation
knowledge from an ensemble of models into a single model with and achieved a remarkable performance.
the same architecture. However, pretraining a set of models We adopt this sequential training idea upon the LDT-KD mod-
for ensemble is often time-consuming. On the other hand, ule as shown in Fig. 5. However, our method differs from [38]
Yim et al. [36] proposed a flow of solution procedure (FSP) (denoted as BAN) in several aspects. First, the weights of the
matrix (relationship of outputs from two layers) to transfer the teacher model are simultaneously updated with those of the stu-
knowledge flow between two identical deep neural networks dent model in our proposed LDT-KD. Whereas, the born-again
(DNNs). However, it is not straightforward on how to choose network (BAN) fixes the weights of the teacher model. Second,
the proper layers to calculate FSP. the BAN applies an ensemble of multiple students from different
In this article, we propose a method called LDT-KD to update generations for final prediction. However, this ensemble version
both the student and teacher in a closed-loop process as shown is too luxurious for edge devices due to the requirement of more
in Fig. 4. Here, the teacher and student in Fig. 4 have the storage memory and longer inference time. For our proposed
same network structure and the same set of model weights. approach, either the final single student or teacher can be used for
The teacher in LDT-KD is directly copied from the student RUL prediction. And both teacher and the final student can gen-
learned by GAN-KD. To accelerate the convergence of the eralize well. Third, we empirically show that the implementation
teacher model, we pretrain the student in LDT-KD for several of sequential training depends on datasets. It is only compulsory
epochs with conventional KD strategy before performing the to perform sequential distillation for complicated datasets, e.g.,
closed-loop process in Fig. 4. At each training step, we first datasets with multiple operation conditions.
update the weights of the student with gradient descent under
the supervision of ground truth and soft labels from the teacher, IV. EXPERIMENTS
i.e., by minimizing the KD loss in (6). Second, we update the
weights of the teacher using the exponential moving average of In this section, we evaluate the performance of our proposed
the student weights, inspired by the mean teacher model in [37]. KDnet-RUL method to distill the knowledge for RUL prediction.
It can be expressed as follows:
A. Experimental Data and Setup
WTi+1 = β ∗ WTi + (1 − β) ∗ WSi (7)
1) C-MAPSS Dataset: In our experiments, we used the pub-
where WTi and WSi represent the weights of the teacher and lic C-MAPSS dataset for evaluation, which has been widely used
student at training step i, respectively. β is a smoothing parame- in many previous studies for RUL prediction [14], [17],[18]. This
ter determining how much historical information of the teacher dataset simulates the degradation process of turbofan engines. It
model will be carried forward for the update. Once the teacher consists of four subdatasets under varying operating conditions
weights are updated, we repeat the above two steps until the and fault modes. For each subdataset, it can be further divided
stopping criteria is satisfied, e.g., the performance of the teacher into training and testing data, as shown in Table I.
on the validate data starts to drop. Each trajectory in training and testing data corresponds to an
2) Sequential Distillation: Our empirical study shows that engine and consists of 21 sensor measurements for this engine.
the performance of LDT-KD is superior and stable for simple The training trajectories include all run-to-failure measurements
datasets. However, its performance is not consistently good for for the engine units, while the testing trajectories only contain
complex datasets, such as datasets for RUL prediction with the measurements of certain period during degradation. The
multiple operating conditions. To stabilize the model training objective is to accurately predict the RUL for the testing engines,
for RUL prediction, we present a sequential distillation scheme given their test trajectories.
Authorized licensed use limited to: Dalian University of Technology. Downloaded on November 28,2024 at 11:27:07 UTC from IEEE Xplore. Restrictions apply.
XU et al.: KDNET-RUL: A KNOWLEDGE DISTILLATION FRAMEWORK 2027
[39], [40] and apply piecewise RUL. In particular, if the true RUL
is larger than the maximal RUL, it is set to RULmax instead, as
where ŷi and yi are the predicted RUL and true RUL, respec-
shown in the following equation:
tively. N is the total number of samples. The Score function,
defined as (10), was designed to place more penalization on late
L − T − s ∗ i, if RUL < RULmax , predictions than early predictions, as late predictions may lead
RUL = (8)
RULmax , otherwise. to more serious catastrophic consequences. Same as RMSE, the
Authorized licensed use limited to: Dalian University of Technology. Downloaded on November 28,2024 at 11:27:07 UTC from IEEE Xplore. Restrictions apply.
2028 IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS, VOL. 69, NO. 2, FEBRUARY 2022
TABLE II
PERFORMANCE COMPARISON AMONG VARIOUS APPROACHES ON FOUR DATASETS
lower the Score is, the better performance the model can achieve has a superior performance over the teacher model in terms of
RMSE.
ŷi −yi
N To verify the effectiveness of dilated CNN as the student,
(e− 13 − 1), if ŷi < yi ,
Score = i=1
N ŷi −yi (10) we further compare dilated CNN with conventional CNN [14]
i=1 (e − 1), otherwise.
10
under two different scenarios as shown in Table III. In Case I,
we train them from scratch for RUL prediction (i.e., Student
B. Comparison With Benchmark Approaches Only). In Case II, we train them as the students in our KDnet-
RUL framework under the guidance of LSTM-based teacher.
To verify the effectiveness of the proposed method, we have Comparing with conventional CNN [14], the dilated CNN is
compared with some benchmark approaches, including the stan- capable of modeling temporal information in time series sensory
dard KD [29], L1 -KD [34], and L2 -KD [30]. To demonstrate data, which is vital for RUL prediction. We can observe that
the effectiveness of the proposed sequential distillation upon dilated CNN performs better than conventional CNN under two
LDT-KD, we also compare it with the BAN in [38] which different scenarios as shown in Table III. Moreover, our KDnet-
is also sequentially trained with self-distillation. Particularly, RUL can also improve the performance of conventional CNN
we use “Student Only” as the teacher in the first generation via KD from LSTM, further demonstrating the effectiveness of
for the BAN. Note that the BAN does not have the ability of our proposed KD framework.
model compression. It can only improve model generalization Table IV compares the complexities of the teacher and student
performance at the expense of model complexity in terms of models. Here, we consider the number of weights and total
memory and inference time. In experiments, we use the ensem- floating-point operations (TFPO) when comparing model com-
ble of multiple students with five generations for the BAN on plexity. More weights and TFPO refer to a more complex model.
RUL prediction. Moreover, considering the disparate network Note that for the proposed KDnet-RUL, the final model is the
architectures between teacher and student, the feature distillation student network after training. It can be found that the number of
step can also be treated as a domain alignment task which intends weights of the student model is 12.8 times less than the teacher
to minimize the discrepancy of feature distributions between model. During inference, the student model only requires 52 400
the teacher and the student. In order to further validate the TFPO which is 46.2 times more efficient than the teacher model.
effectiveness of the proposed method, we also explore several In conclusion, the proposed KDnet-RUL can achieve a com-
domain alignment techniques, combined with our framework, parable performance to a very complex LSTM network but with
such as maximum mean discrepancy (MMD) [41] and correla- a much more efficient structure, i.e., 12.8 times less weights and
tion alignment (CORAL) [42]. 46.2 times less TFPO.
Table II shows the evaluation results of different methods on
the four subdatasets. The “Student Only” which implements a
dilated CNN performs the worst due to its compact network
structure. The teacher model which is built upon the LSTM C. Ablation Study
structure performs much better than the “Student Only.” All the Recall that our KDnet-RUL consists of three components,
KD methods improve the performance of the student. This indi- namely, GAN-KD, LDT-KD, and sequential distillation. In this
cates the effectiveness of the KD algorithms for improving the section, we conduct an ablation study to analyze how each
performance of the student network. The BAN can effectively component affects the model performance. In particular, we
improve the performance of the student model, especially on the derive the following model variants for ablation study.
complicated scenarios, i.e., FD002 and FD004. Among all the 1) LDT-KD: The teacher of this variant is a CNN trained
methods, the proposed KDnet-RUL performs the best in terms of from scratch, i.e., “Student Only.”
both RMSE and Score. Besides, it achieves a comparable perfor- 2) GAN-KD: The teacher of this variant is a five-layer
mance to the teacher model. In particular, the proposed method LSTM.
outperforms the teacher network on FD002 and FD003 in terms 3) GAN-KD+LDT-KD: LDT-KD in this variant uses the
of both RMSE and Score. For FD004, the proposed KDnet-RUL student learned by GAN-KD as its teacher.
Authorized licensed use limited to: Dalian University of Technology. Downloaded on November 28,2024 at 11:27:07 UTC from IEEE Xplore. Restrictions apply.
XU et al.: KDNET-RUL: A KNOWLEDGE DISTILLATION FRAMEWORK 2029
TABLE III
PERFORMANCE COMPARISON BETWEEN CONVENTIONAL CNN AND DILATED CNN
TABLE IV
COMPARISON OF MODEL COMPLEXITY
Fig. 9. Sensitivity analysis of parameter α. (a) FD001. (b) FD002. (c) Fig. 11. Sensitivity analysis of parameter β. (a) FD001. (b) FD002. (c)
FD003. (d) FD004. FD003. (d) FD004.
TABLE V
RESULTS ON PHM 2008 CHALLENGE DATASET
Fig. 10. Sensitivity analysis of parameter λ. (a) FD001. (b) FD002. (c)
FD003. (d) FD004. Bold values indicate the best results.
Authorized licensed use limited to: Dalian University of Technology. Downloaded on November 28,2024 at 11:27:07 UTC from IEEE Xplore. Restrictions apply.
XU et al.: KDNET-RUL: A KNOWLEDGE DISTILLATION FRAMEWORK 2031
LSTM-based model was adopted as a powerful teacher and [14] G. S. Babu, P. Zhao, and X.-L. Li, “Deep convolutional neural network
a dilated CNN was utilized as an efficient student. By using based regression approach for estimation of remaining useful life,” in Proc.
Int. Conf. Database Syst. Adv. Appl., 2016, pp. 214–228.
the proposed KDnet-RUL, the student network can achieve [15] J. Zhu, N. Chen, and W. Peng, “Estimation of bearing remaining useful
comparable performance with the teacher network but with 12.8 life based on multiscale convolutional neural network,” IEEE Trans. Ind.
times less weights and 46.2 times less total float point operations. Electron., vol. 66, no. 4, pp. 3208–3216, Apr. 2019.
[16] S. Zheng, K. Ristovski, A. Farahat, and C. Gupta, “Long short-term
In the future, we will consider a more real and challenging memory network for remaining useful life estimation,” in Proc. IEEE Int.
scenario where the data for training and testing may come from Conf. Prognostics Health Manage., 2017, pp. 88–95.
different distributions, caused by the changing environments or [17] C.-G. Huang, H.-Z. Huang, and Y.-F. Li, “A bidirectional LSTM prog-
nostics method under multiple operational conditions,” IEEE Trans. Ind.
varying machines. For instance, we have built a model based on Electron., vol. 66, no. 11, pp. 8792–8802, Nov. 2019.
the collected dataset A from machine A. When directly using [18] Z. Chen, M. Wu, R. Zhao, F. Guretno, R. Yan, and X. Li, “Machine
this RUL prediction model on a new Machine B which may remaining useful life prediction via an attention based deep learning
approach,” IEEE Trans. Ind. Electron., vol. 68, no. 3, pp. 2521–2531, Mar.
have different data distributions due to different operation con- 2021.
ditions, the performance of the model will significantly degrade. [19] Y. Gong, L. Liu, M. Yang, and L. Bourdev, “Compressing deep convolu-
Whereas, collecting a new dataset from machine B to retrain the tional networks using vector quantization,” in Proc. Int. Conf. Learn. Rep-
resentations, 2014. [Online]. Available: https://arxiv.org/abs/1412.6115.
model is tedious and requires lots of efforts. Instead of doing [20] J. Wu, C. Leng, Y. Wang, Q. Hu, and J. Cheng, “Quantized convolutional
that, we intend to transfer the knowledge learned from dataset neural networks for mobile devices,” in Proc. IEEE Conf. Comput. Vis.
A to the new machine B without collecting labeled data from Pattern Recognit., 2016, pp. 4820–4828.
[21] Y. Cheng, D. Wang, P. Zhou, and T. Zhang, “A survey of model compres-
machine B, which is also known as domain adaptation [43], [44]. sion and acceleration for deep neural networks,” 2017, arXiv:1710.09282.
In this practical scenario, both model compression and domain [22] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep
adaptation need to be considered. neural networks with pruning, trained quantization and huffman coding,”
2015, arXiv:1510.00149.
[23] M. Denil, B. Shakibi, L. Dinh, M. Ranzato, and N. De Freitas, “Predicting
parameters in deep learning,” in Proc. Adv. Neural Inf. Process. Syst., 2013,
REFERENCES pp. 2148–2156.
[24] E. L. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fergus, “Exploiting
[1] F. Yang, M. S. Habibullah, T. Zhang, Z. Xu, P. Lim, and S. Nadara- linear structure within convolutional networks for efficient evaluation,” in
jan, “Health index-based prognostics for remaining useful life predic- Proc. Adv. Neural Inf. Process. Syst., 2014, pp. 1269–1277.
tions in electrical machines,” IEEE Trans. Ind. Electron., vol. 63, no. 4, [25] M. R. U. Saputra, P. P. de Gusmao, Y. Almalioglu, A. Markham, and N.
pp. 2633–2644, Apr. 2016. Trigoni, “Distilling knowledge from a deep pose regressor network,” in
[2] J. Sikorska, M. Hodkiewicz, and L. Ma, “Prognostic modelling options for Proc. IEEE Int. Conf. Comput. Vis., 2019, pp. 263–272.
remaining useful life estimation by industry,” Mech. Syst. Signal Process., [26] W.-C. Chen, C.-C. Chang, and C.-R. Lee, “Knowledge distillation with
vol. 25, no. 5, pp. 1803–1836, 2011. feature maps for image classification,” in Proc. Asian Conf. Comput. Vis.,
[3] X.-S. Si, W. Wang, C.-H. Hu, and D.-H. Zhou, “Remaining useful life 2018, pp. 200–215.
estimation-a review on the statistical data driven approaches,” Eur. J. Oper. [27] Y. Lei, N. Li, L. Guo, N. Li, T. Yan, and J. Lin, “Machinery health
Res., vol. 213, no. 1, pp. 1–14, 2011. prognostics: A systematic review from data acquisition to RUL prediction,”
[4] A. Saxena, K. Goebel, D. Simon, and N. Eklund, “Damage propagation Mech. Syst. Signal Process., vol. 104, pp. 799–834, 2018.
modeling for aircraft engine run-to-failure simulation,” in Proc. Int. Conf. [28] C. Bucilua, R. Caruana, and A. Niculescu-Mizil, “Model compression,”
Prognostics Health Manage., 2008, pp. 1–9. in Proc. 12th ACM SIGKDD Int. Conf. Know. Discov. Data Mining, 2006,
[5] F. O. Heimes, “Recurrent neural networks for remaining useful life es- pp. 535–541.
timation,” in Proc. IEEE Int. Conf. Prognostics Health Manage., 2008, [29] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural
pp. 1–6. network,” in Proc. NIPS Deep Learn. Representation Learn. Workshop,
[6] R. K. Singleton, E. G. Strangas, and S. Aviyente, “Extended Kalman vol. 1050, 2015. [Online]. Available: https://arxiv.org/abs/1503.02531
filtering for remaining-useful-life estimation of bearings,” IEEE Trans. [30] A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio,
Ind. Electron., vol. 62, no. 3, pp. 1781–1790, Mar. 2015. “Fitnets: Hints for thin deep nets,” in Proc. Int. Conf. Learn. Representa-
[7] N. Li, Y. Lei, L. Guo, T. Yan, and J. Lin, “Remaining useful life prediction tions, 2015. [Online]. Available: https://arxiv.org/abs/1412.6550.
based on a general expression of stochastic process models,” IEEE Trans. [31] S. Zagoruyko and N. Komodakis, “Paying more attention to attention:
Ind. Electron., vol. 64, no. 7, pp. 5709–5718, Jul. 2017. Improving the performance of convolutional neural networks via attention
[8] R. Khelif, B. Chebel-Morello, S. Malinowski, E. Laajili, F. Fnaiech, and N. transfer,” in Proc. 5th Int. Conf. Learn. Representations, 2017. [Online].
Zerhouni, “Direct remaining useful life estimation based on support vector Available: https://arxiv.org/abs/1612.03928
regression,” IEEE Trans. Ind. Electron., vol. 64, no. 3, pp. 2276–2285, [32] Y. Tian, D. Krishnan, and P. Isola, “Contrastive representation distillation,”
Mar. 2017. in Proc. Int. Conf. Learn. Representations, 2020. [Online]. Available: https:
[9] C. Sun, M. Ma, Z. Zhao, S. Tian, R. Yan, and X. Chen, “Deep transfer //arxiv.org/abs/1910.10699
learning based on sparse autoencoder for remaining useful life prediction [33] L. Gao, H. Mi, B. Zhu, D. Feng, Y. Li, and Y. Peng, “An adversarial
of tool in manufacturing,” IEEE Trans. Ind. Inform., vol. 15, no. 4, feature distillation method for audio classification,” IEEE Access, vol. 7,
pp. 2416–2425, Apr. 2019. pp. 105 319–105330, 2019.
[10] Z. Tian, “An artificial neural network method for remaining useful life [34] G. Chen, W. Choi, X. Yu, T. Han, and M. Chandraker, “Learning efficient
prediction of equipment subject to condition monitoring,” J. Intell. Manuf., object detection models with knowledge distillation,” in Proc. Adv. Neural
vol. 23, no. 2, pp. 227–237, 2012. Inf. Process. Syst., 2017, pp. 742–751.
[11] M. Xia, T. Li, T. Shu, J. Wan, C. W. De Silva, and Z. Wang, “A two-stage [35] Y. Kim, “Convolutional neural networks for sentence classifica-
approach for the remaining useful life prediction of bearings using deep tion,” in Proc. Conf. Empirical Methods Natural Lang. Process.,
neural networks,” IEEE Trans. Ind. Inform., vol. 15, no. 6, pp. 3703–3711, 2014, pp. 1746–1751.
Jun. 2019. [36] J. Yim, D. Joo, J. Bae, and J. Kim, “A gift from knowledge distillation:
[12] B. Yang, R. Liu, and E. Zio, “Remaining useful life prediction based Fast optimization, network minimization and transfer learning,” in Proc.
on a double-convolutional neural network architecture,” IEEE Trans. Ind. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 4133–4141.
Electron., vol. 66, no. 12, pp. 9521–9530, Dec. 2019. [37] A. Tarvainen and H. Valpola, “Mean teachers are better role mod-
[13] S. Lawrence, C. L. Giles, A. C. Tsoi, and A. D. Back, “Face recognition: els: Weight-averaged consistency targets improve semi-supervised deep
A convolutional neural-network approach,” IEEE Trans. Neural Netw., learning results,” in Proc. Adv. Neural Inf. Process. Syst., 2017,
vol. 8, no. 1, pp. 98–113, Jan. 1997. pp. 1195–1204.
Authorized licensed use limited to: Dalian University of Technology. Downloaded on November 28,2024 at 11:27:07 UTC from IEEE Xplore. Restrictions apply.
2032 IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS, VOL. 69, NO. 2, FEBRUARY 2022
[38] T. Furlanello, Z. C. Lipton, M. Tschannen, L. Itti, and A. Anandkumar, Chao Wang (Senior Member, IEEE) received
“Born again neural networks,” in Proc. 35th Int. Conf. Mach. Learn., the B.S. and Ph.D. degrees in computer science
vol. 80, 2018, pp. 1602–1611. from the University of Science and Technol-
[39] X. Li, Q. Ding, and J.-Q. Sun, “Remaining useful life estimation in ogy of China, Hefei, China, in 2006 and 2011,
prognostics using deep convolution neural networks,” Rel. Eng. Syst. Saf., respectively.
vol. 172, pp. 1–11, 2018. He is currently an Associate Professor with
[40] C. Zhang, P. Lim, A. K. Qin, and K. C. Tan, “Multiobjective deep belief the University of Science and Technology of
networks ensemble for remaining useful life estimation in prognostics,” China. He was a Visiting Scholar with the Elec-
IEEE Trans. Neural Netw. Learn. Syst., vol. 28, no. 10, pp. 2306–2318, trical and Computer Engineering Department,
Oct. 2017. University of California at Santa Barbara, Santa
[41] M. Long, H. Zhu, J. Wang, and M. I. Jordan, “Deep transfer learning Barbara, CA, USA, from 2015 to 2016. His re-
with joint adaptation networks,” in Proc. Int. Conf. Mach. Learn., 2017, search interests focus on multicore and reconfigurable computing.
pp. 2208–2217. Dr. Wang serves as the Associate Editor for the ACM Transac-
[42] B. Sun and K. Saenko, “Deep coral: Correlation alignment for deep domain tions on Design Automations for Electronics Systems (ACM TODAES),
adaptation,” in Proc. Eur. Conf. Comput. Vis., 2016, pp. 443–450. IEEE/ACM Transactions on Computational Biology and Bioinformatics
[43] X. Li, W. Zhang, N.-X. Xu, and Q. Ding, “Deep learning-based ma- (TCBB), Microprocessors & Microsystems, and IET Computers & De-
chinery fault diagnostics with domain adaptation across sensors at dif- sign Techniques. He is also the publicity chair of HiPEAC 2015 and ISPA
ferent places,” IEEE Trans. Ind. Electron., vol. 67, no. 8, pp. 6785–6794, 2014. He is a senior member of the ACM and CCF.
Aug. 2020.
[44] M. Ragab et al., “Contrastive adversarial domain adaptation for machine
remaining useful life prediction,” IEEE Trans. Ind. Informat., to be pub-
lished, doi: 10.1109/TII.2020.3032690.
Authorized licensed use limited to: Dalian University of Technology. Downloaded on November 28,2024 at 11:27:07 UTC from IEEE Xplore. Restrictions apply.