Sequential Multi-Objective Multi-Agent Reinforcement Learning Approach For Predictive Maintenance
Sequential Multi-Objective Multi-Agent Reinforcement Learning Approach For Predictive Maintenance
1. Introduction
Engagement in PdM for aircraft engines plays a pivotal role in averting UR, scenarios where
engines endure until they have reached the end of their operational lifespan [1]. At the core of
estimating RUL and orchestrating PdM for aircraft, a wealth of signal data derived from
numerous embedded sensors throughout the aircraft should serve as the foundation [2].
State-of-the-art in PdM: From Traditional Reliability Models to Data-Driven Approaches
However, in recent years, despite extensive exploration of PdM across various intricate
facilities beyond turbofan engines alone, many maintenance planning models tend to simplify
assumptions regarding system or structural degradation [3] [4] [5] [6]. These studies often not
fully utilized the wealth of sensor data gathered from realistic systems. Processes such as the
Gamma process, Wiener process, non-homogeneous Poisson process, or Markov process are
* Corresponding author.
E-mail address: cliu647@cityu.edu.hk
commonly presumed to govern degradation in PdM research [7]. With the advancement of data-
driven technologies, there exists a significant opportunity to utilize real data for analyzing
degradation and extracting crucial information related to system RUL [8] [9], which can be
leveraged for PdM [10]. However, only a few studies have effectively integrated data-driven
approaches for RUL prognostics into predictive maintenance strategies [11]. In a recent study,
researchers built a probabilistic RUL prognostics model using CNN for turbofan engines [12],
optimizing maintenance planning and showcasing substantial cost reductions compared to
traditional methods, highlighting the benefits of data-driven based PdM. Additionally, a method
for estimating RUL of Unmanned Aircraft Systems (UAS) using vibration data was developed,
which calculates RUL upon exceeding a specified threshold [13], demonstrating its efficacy in
predicting system degradation and PdM in UAS operations. Another study presents a method
for PdM of industrial robots in intelligent manufacturing [14], combining data and knowledge
to predict faults. By utilizing deep learning and knowledge graphs, the approach can
automatically formulate PdM strategies.
Uncertainty Quantification in RUL Prediction for Robust Maintenance Strategies
Nevertheless, existing investigations often simplify RUL predictions to a point estimation.
However, in real-world scenarios, RUL predictions are influenced by various factors such as
sensor noise, operational variability, and the complex nature of equipment aging, all of which
introduce significant uncertainty. This uncertainty plays a critical role in maintenance decision-
making, as ignoring it can lead to suboptimal strategies. Therefore, quantifying the uncertainty
of RUL in models is essential, as it not only enhances the model's adaptability to real-world
data but also improves the robustness and reliability of decision-making processes. Bayesian
learning is a common method for estimating the uncertainty of neural network outputs. For
instance, an integrated prognostic-driven dynamic PdM framework for industrial systems has
connected degradation features with RULs using Bayesian deep learning [15]. By adjusting
maintenance decisions based on predictive RUL distributions, operational constraints are
efficiently managed and compared against benchmark policies using turbofan engine data.
However, the computational complexity of Bayesian networks is significant. On the other hand,
Quantile Regression (QR) has emerged as a promising non-parametric probabilistic prediction
method that focuses on forecasting specific quantiles within the forecast distribution, ensuring
robustness by avoiding distributional assumptions and providing accurate probabilistic
forecasts with precise intervals. For example, researchers leveraged QR based failure prediction
method to develop a PdM strategy [16], demonstrating its effectiveness and flexibility.
Moreover, a paper proposed a QR-based model to enhance RUL predictions for rolling bearings
by addressing uncertainty [17], which shows high accuracy and reliability, indicating improved
performance in cross-domain prediction tasks. Therefore, in this study, we leverage QR to
estimate the probabilistic distribution of RUL, thereby quantifying the associated uncertainty,
and offering richer information to support the development of maintenance strategies.
Beyond Thresholds: Adaptive PdM with Deep Reinforcement Learning
Despite incorporating probabilistic RUL prognostics into maintenance planning, many
approaches often rely on fixed degradation thresholds for PdM decision. In this article [18],
researchers introduced a novel approach involving training a new model to assess the
probability of the predicted RUL falling below a predefined threshold. However, the essence
still lies in using fixed thresholds [19], limiting the applicability in real world PdM tasks.
Moreover, this reliance within the PdM framework can exhibit instability as data complexity
increases. To handle this, a new research has introduced Deep Reinforcement Learning (DRL)
for adaptive PdM [20], eliminating the need for fixed thresholds in determining maintenance
schedules [21]. In [22], authors discuss using RL for industrial PdM, who proposed a novel
approach that combines probabilistic modeling and DRL, demonstrating superior performance
in a turbofan engine case study with enhanced interpretability. Furthermore, recently,
researchers proposed integrating data-driven probabilistic RUL prognostics into PdM using
DRL with Monte Carlo dropout [23]. Also demonstrated on turbofan engines, the approach
reduces total maintenance costs and prevents 95.6% of unscheduled maintenance, offering a
comprehensive roadmap from sensor data to PdM.
Towards Balanced PdM: Multi-Objective Optimization with DRL
Another problem is, for whether DRL-based or reliant on fixed thresholds, most PdM
approaches predominantly center on deciding whether to initiate replacement action in each
inspection and pay little attention to the cost caused by checking. But conducting checks after
every operational cycle proves economically inefficient, especially during the initial stages of
operation when engine degradation is minimal [20]. Consequently, in an adaptive PdM
framework, it is vital to simultaneously consider multiple objectives, necessitating the
integration of multi-objective optimization into DRL for more balanced and efficient decision-
making [24]. Research on MARL has commenced in various domains, showcasing the
remarkable capability of DRL in addressing multi-objective PdM tasks.
To address the aforementioned limitations of existing methods and inspired by these
advancements, a new framework is proposed in this paper within the context of turbofan engine
PdM tasks, aiming to bridge the gap in applying MARL to data-driven PdM. We have 2 main
objectives - the first being to reduce the RUL at the time of replacement for maximizing engine
utilization, and the second being to minimize the frequency of engine inspections by extending
the intervals between inspections as much as possible. Therefore, we introduce 2 RL agents to
achieve these objectives. However, due to the sequential nature of these 2 objectives, we cannot
apply typical MARL algorithms [25]. To address this, we designed SMOMA-PPO algorithm,
inspired by PPO [26], providing a novel approach to solving such problems. On the other hand,
inspired by the good performance of Gated Recurrent Unit (GRU) on RUL prediction tasks [27]
and in order to provide effective information for RL algorithms to make decisions, QR was
utilized to construct the GRP model to obtain a probability distribution of estimated RUL.
Compared to previous methods, we achieve efficient and stable Probabilistic Regression, and
achieve higher accuracy in the later stages of engine operations. The focus on the later stages
is due to the need for PdM task to take replacement actions when the RUL is small, requiring
higher precision in last cycles. Finally, our experimental results demonstrate that our method
significantly reduces the average RUL at replacement time, maximizes inspection intervals, and
substantially decreases overall cost. Our contributions are summarized as follows:
1. Introduction of a new probabilistic RUL prediction method utilizing QR, demonstrating
significantly higher prediction accuracy in the later stages of a system's lifecycle
compared to existing approaches.
2. Innovative utilization of probability distribution functions to fit the results of QR from
the RUL prognostic model, leading to favorable outcomes through the calculation of
cumulative probability values of RUL for constructing RL observation environments.
3. Development of a novel multi-agent reinforcement learning algorithm addressing
multi-objective optimization problems with temporal dependencies, enabling
minimized RUL engine replacements and substantial inspection cost reductions,
offering a new data-driven solution for temporally constrained problems.
2. Problem Statement
PdM of turbofan engines is a critical yet complex task, demanding precise and adaptive
strategies to ensure operational reliability while minimizing maintenance costs. Despite notable
advancements in recent research, several critical challenges remain unresolved, limiting the
practical effectiveness of current approaches:
1. RUL prediction serves as the foundation for maintenance planning. However, existing
methods often simplify RUL predictions to deterministic point estimates, neglecting
the inherent uncertainties present in real-world scenarios. This lack of probabilistic
insight weakens the robustness and reliability of maintenance decisions.
2. Modern turbofan engines generate abundant sensor data. Yet, many existing approaches
fail to fully exploit this wealth of information, relying instead on oversimplified
degradation assumptions or traditional statistical models. This results in suboptimal
feature extraction from degradation signals, limiting the ability to capture subtle
indicators of impending failures.
3. Common PdM strategies rely on predefined RUL thresholds to trigger maintenance
actions. While straightforward, this approach lacks adaptability to dynamic operating
conditions and often becomes unstable as data complexity increases, reducing its real-
world applicability.
4. Most PdM frameworks overlook the economic inefficiencies associated with frequent
inspections. In practice, conducting inspections after every operational cycle is not only
costly but also unnecessary. Furthermore, achieving dual objectives—reducing the
RUL at replacement to maximize engine utilization while simultaneously minimizing
inspection frequency to reduce costs—has been rarely studied due to its inherent
complexity. These objectives are further complicated by their temporal dependencies,
as decisions made to achieve one goal often impact the timing and feasibility of the
other, requiring a sequential resolution strategy that adapts over time.
These challenges underscore the urgent need for a comprehensive, data-driven framework
capable of addressing the uncertainty in RUL predictions, fully utilizing sensor data, and
dynamically balancing interdependent objectives with temporal constraints. To tackle these
issues, this study proposes a novel approach that leverages QR for probabilistic RUL estimation
and integrates it with a MARL framework. By bridging these critical gaps, the proposed
framework aims to significantly improve the adaptability, robustness, and cost-efficiency of
PdM strategies for turbofan engines.
3. Methodology
To achieve efficient and reliable PdM for turbofan engines, this study introduces a novel
framework that integrates probabilistic RUL prediction with a sequential multi-objective multi-
agent RL approach. The framework addresses two critical objectives: (1) reducing the RUL at
replacement to maximize engine utilization, and (2) minimizing inspection frequency to reduce
operational costs. These objectives are achieved through the combination of advanced RUL
prediction techniques and a MARL framework specifically designed to handle sequential,
interdependent decision-making tasks.
As illustrated in Fig. 1, the framework begins with the extraction of essential information from
raw sensor data using a GRP model (top-right corner of Fig. 1). This model utilizes QR to
estimate the probability distribution of the RUL, allowing the system to quantify prediction
uncertainty and generate richer information for decision-making. The resulting RUL
distribution is then transformed into cumulative probability values for specific RUL intervals
(e.g., RUL = 0 to 10 cycles), as shown in the green box on the left of Fig. 1. These cumulative
probabilities compactly encode both the likelihood of failure and the uncertainty, serving as the
input state representation for the RL agents. The bottom section of Fig. 1 illustrates the
SMOMA-RL framework, where two RL agents collaborate to achieve the dual objectives:
1. Agent 1 determines whether to replace the engine during the current cycle, aiming to
minimize RUL at replacement while ensuring safety.
2. Agent 2 predicts the optimal time for the next inspection, maximizing the interval
between inspections while maintaining reliability.
A key innovation of this framework lies in the sequential dependency between the 2 agents: the
inspection intervals determined by Agent 2 affect the observations available to Agent 1, while
the replacement action of Agent 1 resets the inspection process for Agent 2. This
interdependence requires a careful design of the RL algorithm. To address this, the proposed
SMOMA-PPO algorithm adapts Proximal Policy Optimization (PPO) to handle multi-agent,
multi-objective tasks with temporal dependencies.
The following sections detail the technical components of the framework, starting with the GRP
model (Section 3.1) and progressing to the SMOMA-PPO algorithm (Section 3.2). Together,
these components enable a robust and adaptive solution for real-world PdM tasks.
Fig. 1. Overall process diagram of this study. Sensor signals were obtained from the Turbofan
Engine, then key features were extracted using a deep learning model based on GRU.
Subsequently, quantile regression and function fitting were employed to estimate the RUL
range, which is utilized as the observation state for MARL agents for executing PdM actions.
3.1. Gru-based Probabilistic RUL Prediction Model
3.1.1. Point Estimation & Probabilistic RUL prediction
At the timestamp t, the real RUL yt is as Eq. (1).
yt = f(xt ; θ) + εt (1)
where f(xt ; θ) is a deep network with input xt and learnable parameters θ. εt is noise. xt can
contain multiple historical observations zt , … zt−l+1 , where 𝑙 is the sliding window length.
Usually, noise εt ∼ 𝒩(0, σ2 ) was assumed to follow Normal distribution. Thus, the parameters
θ of deep network f(∙) can be estimated by solving Eq. (2).
N
∗
1 2
θ = argminθ ∑ (yi −∣ f(xi ; θ ∣)) (2)
N
i=1
where {xi , yi }N ∗
̂ t = f(xt ; θ∗ ) can be
i=1 are training samples. After obtaining the estimated θ , y
calculated as the predicted RUL at 𝑡. ŷt is said to be a point estimate of yt . At time 𝑡, make
probabilistic RUL prediction as Eq. (3).
ℙ(yt ∣ xt ) = ℙ(yt ∣ zt , … zt−l+2 , zt−l+1 ) (3)
After conditional distribution, point estimate ŷt can be obtained as an expectation as Eq. (4).
ŷt = 𝔼[yt ∣ xt ] (4)
Thus, the main task is designing a deep network f(xt ; θ) incorporating historical observations
and RULs to model a conditional distribution. However, traditional point estimation methods
fail to capture the uncertainty inherent in the prediction process as we discussed in the
introduction. To quantify the uncertainty in predictions, our work proposed a probabilistic RUL
prediction model generating a posterior estimated RUL distribution using QR. QR for
distribution estimation: Multiple predicted RULs at different quantile levels can be obtained by
q
QR method. Given ŷi outputted by f(xi ; θ) at a quantile level q ∈ [0, 1], and the real RUL yi ,
the QR loss is as Eq. (5).
q q + q +
Lq (yi , ŷi ) = q(yi − ŷi ) + (1 − q)(ŷi − yi ) (5)
where (yi )+ = max(0, yi ). Thus, the final optimization problem using QR loss is as Eq. (6).
N M
∗
1 q
θ = argminθ ∑ ∑ Lqm (yi , ŷi m ) (6)
N
i=1 m=1
where Wr , Ur and σ are weight matrices and logistic sigmoid. h̃t will then be generated by rt
with a tanh layer as Eq. (8).
h̃t = tan h(W(rt∗ ht−1 ) + Uxt ) (8)
GRU create a zt to replace the remember gate and forget gate in LSTM as Eq. (9).
zt = σ(Wz ht−1 + Uz xt ) (9)
At last, the hidden state value is updated as Eq. (10).
ht = (1 − zt ) ∗ ht−1 + zt ∗ h̃t (10)
Compared to LSTM, GRU offers a more compact and efficient architecture with fewer variables,
resulting in improved performance across various tasks.
3.1.3. Self-Attention Mechanism
Attention mechanism [32, 33] captures the concept of human tendency to focus on specific
regions of an image during recognition, which means that different regions of an image can be
assigned with distinct weights. In RUL prediction, it can be leveraged to assign varying weights
to different features at different time steps. The features learned by the GRU network for a
sample is denoted as 𝐻 = {ℎ1 , ℎ2 , … , ℎ𝑑 }𝑇 . Here, ℎ𝑖 ∈ 𝑅𝑛 and 𝑛 denotes the number of
sequential steps. Then, the importance of different sequential steps for the ith input hi is as Eq.
(11), where 𝑊 is weight and 𝑏 is bias.
𝑠𝑖 = 𝛷(𝑊 ⊤ ℎ𝑖 + 𝑏) (11)
The score function 𝛷(∙) is an activation function within neural networks. Subsequently, 𝑠𝑖 will
be normalized as Eq. (12).
𝑒𝑥𝑝(𝑠𝑖 )
𝑎𝑖 = 𝑠𝑜𝑓𝑡𝑚𝑎 𝑥(𝑠𝑖 ) = (12)
∑𝑖 𝑒𝑥 𝑝(𝑠𝑖 )
The final output 𝑂 of self-attention operation is as Eq. (13). 𝐴 = {𝑎1 , 𝑎2 , … , 𝑎𝑑 } and ⊗ is
element-wise multiplication.
𝑂 =𝐻⊗𝐴 (13)
3.1.4. GRU-based RUL Prediction Model
The GRP model integrates advanced deep learning techniques with handcrafted feature
extraction to enhance the accuracy and reliability of RUL prediction. As shown in Fig. 2, raw
sensor data is first fed into a GRU network to extract sequential features that capture temporal
degradation patterns. These features are further refined using a self-attention mechanism to
assign attention weights to different time steps, emphasizing the most relevant information for
prediction. Meanwhile, handcrafted features, i.e., the mean and trend coefficients of signals
from sliding windows, are extracted to provide additional degradation-related insights. These
handcrafted features have been shown to improve RUL prediction performance [34].
Fig. 2. GRU-based probabilistic RUL prediction model, the outputs are the quartiles of
predicted RULs.
To maximize the utilization of all available information, a feature fusion framework is proposed,
combining the sequential features learned by the GRU network with the handcrafted features.
The fused feature set is then processed by fully connected layers (FC) and a regression layer to
predict RUL quantiles, specifically the 10%, 30%, 50%, 70%, and 90% percentiles. These
quantiles provide a probabilistic representation of the RUL, enabling the model to capture both
point estimates and uncertainty. To address potential overfitting and improve model
generalization, dropout [35] is applied after the FC layers.
Fig. 3. Model prediction of engine 53 in FD001, where the purple region represents the 10%-
90% confidence interval of RUL, and the blue area represents the 30%-70% interval.
Fig. 5. The RUL probability distribution of the engine at different cycles was achieved by
Normal distribution fitting.
Fig. 5 demonstrates the fitted RUL distributions for various cycles of engine 53 from the FD001
dataset. The peak of the distribution shifts leftward over time, accurately capturing the
progressive decline in the most probable RUL as the engine approaches failure. These
distributions are critical for transforming RUL predictions into actionable insights.
Fig. 6. The curve of RUL cumulative probability distribution of the engine at different cycles
by Normal distribution fitting.
For downstream PdM task, the continuous RUL probability distributions are further
transformed into cumulative probability values. These cumulative probabilities represent the
likelihood of the engine's RUL falling within specific intervals (e.g., RUL = 1–10 cycles) and
provide a compact, uncertainty-aware state representation that is suitable for RL agents. Fig. 6
illustrates the cumulative probability curves for different cycles, showing the evolution of
failure likelihood as the engine degrades. This transformation not only captures prediction
uncertainty but also encodes critical information for downstream decision-making.
In this study, once agent 2 chooses an action, the hidden state 𝜌𝑡+1 and the observed state 𝑠𝑡+1
𝑝
are also updated accordingly. The transition probability 𝑃(⋅∣ 𝑠𝑡 , 𝑎𝑡𝑟 , 𝑎𝑡 ) in this task decides the
next state 𝑠𝑡+1 and is deterministic, which means that the environment transfer from the current
state 𝑠𝑡 to the next state 𝑠𝑡+1 according to the order of the sampled states at each epoch. If the
engine is replaced at decision step 𝑡, the next step considers a new engine from the C-MAPSS
data set. Otherwise, we further obtain sensor measurements 𝑥𝑡+1 during the next cycle and
update the distribution of the RUL (the next state 𝑠𝑡+1 ) by generating new RUL prognostics
using the GRP model. When interacting with the environment, the trajectory 𝜏 will be recorded,
which is a sequence of interactions from the state 𝑠0 to the terminal state 𝑠𝑚 : 𝜏 =
𝑝 𝑚 𝑝
{(𝑠𝑡 , 𝑎𝑡𝑟 , 𝑎𝑡 , 𝑟𝑡 )}0 . It can be used for optimizing 2 agents’ policy 𝜋𝑡𝑟 and 𝜋𝑡 . In summary, the
reward function of this PdM task can be formulated as the weighted sum of the aforementioned
2 rewards as Eq. (16).
𝑝
𝑅𝑡 = 𝛽1 𝑟𝑡𝑟 + 𝛽2 𝑟𝑡 (16)
𝑇
1
+ {∑ ∑ 𝑚𝑖𝑛[𝜒𝑖 (𝜑𝑖 , 𝑡)𝐴̂𝑖𝑡 , 𝑐𝑙𝑖𝑝1−𝜀
1+𝜀 {𝜒 (𝜑
𝑖
̂𝑖
𝑖 , 𝑡)}𝐴𝑡 ]} (18)
2𝑇
𝑡=1 𝑖=𝑟,𝑝
where 𝑚𝑖𝑛[𝑥, 𝑦] represents the minimum value between 𝑥 and 𝑦 . Moreover, 𝜒𝑖 (𝜑𝑖 , 𝑡) =
([𝜋𝜑𝑖 (𝑎𝑡𝑖 ∣ 𝑠𝑡 )]/[𝜋𝜑𝑖 (𝑎𝑡𝑖 ∣ 𝑠𝑡 )]) and 𝜀 is the PPO clip ratio. Moreover, the entropy
𝐻[𝜋𝜑𝑖 (𝑎𝑡𝑖 |𝑠𝑡 )] with coefficient 𝜖 can be formulated as Eq. (19).
and 𝑃𝜋𝜑 is the probability distribution of the strategy. On the other hand, the objective of the
𝑖
critic network is to minimize the mean squared error of the value function as Eq. (20).
2
∑𝑇𝑡=1 ∑𝑖=𝑟,𝑝 (𝑉𝜙𝑖 (𝑠𝑡 ) − 𝑉̃𝜙𝑖 (𝑠𝑡 ))
ℒ(𝜙) = (20)
2𝑇
With the configurations detailed above, the SMOMA-PPO approach can be encapsulated in
Algorithm 1. Initially, the algorithm kicks off by orthogonalizing the network parameters for
both the actor and critic networks. Throughout the training cycle, two agents gather trajectory
segments of length 𝑇 in each iteration. Each agent, based on its local observation 𝑠𝑡 , selects an
action 𝑎𝑡𝑖 to interact with the environment via the actor network. Subsequently, the agents
acquire and store new observations, states, and rewards during this interaction, repeating this
loop until reaching time step 𝑇 to construct a full trajectory. Leveraging the gathered trajectory
set, the algorithm employs the GAE method to approximate the advantage function 𝐴̂ .
Following this, the optimization procedure for both the actor and critic networks kicks off using
the Adam optimization method, incorporating gradient clipping and a decaying learning rate.
This training regimen is reiterated for a predetermined number of episodes denoted by E.
Algorithm 1: SMOMA-PPO
Initialization: Initialize parameter 𝜙 for 𝜋, 𝜑 for 𝑉.
Set learning rate 𝑙𝑟 .
for episode = 1 → E do
set data buffer D = {};
for n = 1 → batch size do
Set empty list τ [ ].
for t = 1 → t = T do
for agent 𝑖 in {𝑟, 𝑝} do
𝑖
𝑃𝑖 (𝑡) = 𝜋𝜑𝑖 (𝑎𝑛,𝑡 ∣ 𝑠𝑛,𝑡 ).
𝑖
𝑎𝑛,𝑡 ∼ 𝑃𝑖 (𝑡).
𝑉𝑖 (𝑡) = 𝑉𝜙𝑖 (𝑠𝑛,𝑡 ).
end for
Execute 𝑎𝑛,𝑡
Obtain 𝑅(𝑡)
Observe 𝑠𝑡+1 .
𝜏+= [𝑠𝑡 , 𝑎𝑡 , 𝑉(𝑡), 𝑅(𝑡), 𝑠𝑡+1 ].
end for
Compute advantage estimate 𝐴̂ by GAE on 𝜏 .
Compute 𝑉̃𝜙 with value normalization on 𝜏 .
Split trajectory 𝜏 into chunks of length L
for 𝑙 = 0,1, … , 𝑇//𝐿 do
𝐷 = 𝐷 ∪ (𝜏[𝑙: 𝑙 + 𝑇, 𝐴̂[𝑙: 𝑙 + 𝐿], 𝑅̂[𝑙: 𝑙 + 𝐿]]).
end for
end for
for mini-batch 𝑘 = 1, . . . , 𝐾 do
𝑏←random mini-batch from 𝐷 with all agent data
for each data chunk 𝑐 in the mini-batch 𝑏 do
Update 𝑉 from first hidden state in data chunk
end for
end for
Adam update 𝜙 on 𝐿(𝜙) with data 𝑏.
Adam update 𝜑 on 𝐿(𝜑) with data 𝑏.
end for
The policy network in SMOMA-PPO, also known as actor network, is structured as shown in
Table 1 and Table 2, while the critic network structure is illustrated in Table 3. The input
dimension for both the actor and critic networks is set to be 50 because the agent utilizes
information from the RUL distribution of the previous 5 timesteps for decision-making. For
each timestep, we calculate the cumulative probability of RUL ranging from 0 to 10 based on
the RUL distribution. Therefore, there are a total of 5×10=50 input features. The critic network
structures for agent 1 and agent 2 are the same. However, for Agent 1, which has only 2 actions.
Therefore, the output dimension of actor network 1 is 2. Regarding Agent 2, in this study, the
output dimension of its actor network is set to be 50 because we assume its predicted inspection
time intervals range from integers 1 to 50.
Table 1. Actor Network Structure for Agent 1
Layer Type Input Size Output Size Activation Parameter Size
1 Linear 50 128 Tanh 6400
2 Linear 128 128 Tanh 16384
3 Linear 128 2 Tanh 256
Table 2. Actor Network Structure for Agent 2
Layer Type Input Size Output Size Activation Parameter Size
1 Linear 50 128 Tanh 6400
2 Linear 128 128 Tanh 16384
3 Linear 128 50 Tanh 6400
Table 3. Critic Network Structure for Agent 1 & 2
Layer Type Input Size Output Size Activation Parameter Size
1 Linear 50 256 Tanh 12800
2 Linear 256 64 Tanh 16384
3 Linear 64 1 Tanh 64
5. Experiment
5.1. Data Description
The proposed approach is evaluated using the widely used C-MAPSS dataset [36] in
Prognostics and Health Management (PHM) or PdM community, which illustrates the
degradation of engines, as depicted in Fig. 1. To monitor the engine's condition, 21 sensors are
positioned to measure parameters like temperature, pressure, and speed. The dataset comprises
four subsets, each encompassing distinct operational conditions and fault types, as shown in
Table 4. In every subset, a training file records sensor data during the run-to-fail experiments
for a certain number of engines, while a testing file contains sensor measurements for specific
running cycles of another set of engines.
Table 4. C-MAPSS Dataset Description.
Sub-dataset FD001 FD002 FD003 FD004
Training engines number 100 260 100 249
Testing engines number 100 259 100 248
Operational conditions 1 6 1 6
Fault modes 1 1 2 2
As explained in section III, given our objective of life-long PdM, we focus on utilizing run-to-
fail engines. Therefore, the “training file” from C-MAPSS was selected and portioned into a
training dataset for model training and a testing dataset for testing. Since different operating
conditions can also impact the RUL, we consider the operating settings to be signals for RUL
prediction. Consequently, we utilize data from the 21 sensors and 3 columns of operation
settings, which were treated as 24 signals.
5.2. Data Preprocessing
The sliding window is widely adopted for data segmentation [37, 38]. For run-to-fail engines,
we assign 𝑇 to represent the total number of running cycles, 𝑠 to denote the window size, and
𝑝 to indicate the step size. Each sample will have a size of 𝑠 × 𝑛, where 𝑛 corresponds to signal
types. We have opted for a window size of 60 and a step size of 1 for all subsets, resulting in a
size of 𝑠 × 𝑛 = 60 × 24 for every sample. RUL of the (𝑖 + 1)th sample can be computed as
𝑇 − 𝑠 − 𝑖 × 𝑝. It is essential to note that a piece-wise linear RUL is employed, which ensures
that if the RUL exceeds the maximum RUL, it is capped at the maximum value of 125. To
compare with the performance of related research [23], we adopted the same data partitioning
approach, i.e., the training files are divided into two parts: the first 50% is allocated as a training
dataset, and the remaining as a testing dataset. The training samples devised by fixed window
for four subsets are: 6,959, 19,188, 10,131, and 23,036. (For FD004, the training file contains
records for 249 engines, where the first 124 are used for training.) Similarly, the number of
testing samples for the four subsets are: 7,772, 19,231, 8,689, and 23,522.
5.3. Performance of GRP and Fitting Distribution Selection
Before evaluating the performance of SMOMA-PPO, we first validate the effectiveness of the
GRP model, we conducted a comparative study against several state-of-the-art RUL prediction
methods, including CNN-LSTM-SAM [39], GCU-Transformer [40], TCNN-Transformer [41],
and RNN-LSTM [42]. Since these methods did not originally employ QR, their regression
components were adapted to output quantile predictions for a fair comparison. Specifically, the
50th percentile (median) of each method’s outputs was used as the predicted RUL, as it
corresponds to the maximum likelihood estimate of the RUL distribution.
The experiments were conducted using FD001. Given the critical importance of accurate RUL
predictions during the final stages of engine operation, the Root Mean Square Error (RMSE)
[33] was calculated over the last 30 cycles of the engines’ lifecycles. Each method was trained
and tested 50 times to ensure robustness, and the minimum, maximum, median, mean, and
standard deviation of RMSE values were recorded. Additionally, we measured the shortest
training time required for convergence to evaluate the computational efficiency of each method.
Table 5. Comparison Between Popular RUL Prediction Approaches
Standard Training
Approach Minimum Maximum Median Mean
Deviation Time (s)
GRP 3.9173 3.1034 3.5465 3.5255 0.2070 32.55
CNN-LSTM-SAM 4.8418 3.1544 3.7922 3.7963 0.3622 41.13
GCU-Transformer 4.9699 3.4375 4.2633 4.2670 0.3930 86.39
TCNN-Transformer 4.6465 3.9385 4.3766 4.3726 0.1691 107.87
RNN-LSTM 4.9755 3.1066 4.1581 4.1824 0.4456 113.01
Fig. 8. The comparative results of various RUL prediction methods, with the average RMSE
values obtained by each method connected by a light blue line. On the right side of each box, a
Normal distribution is employed to model the experimental results.
The results, presented in Table 5 and Fig. 8, demonstrate that the proposed GRP model
significantly outperformed all other methods in terms of predictive accuracy, robustness, and
computational efficiency. The GRP model achieved the lowest mean RMSE of 3.5255 with a
standard deviation of 0.2070, reflecting both high accuracy and consistency across runs. This
is particularly critical during the final stages of engine operation, where precise RUL
predictions are essential for effective maintenance decisions. Fig. 8 visually compares the
RMSE distributions from 50 experiments for each method using box plots. The GRP model not
only shows a lower median RMSE but also exhibits a narrow interquartile range, indicating
greater reliability and small variability. In addition to accuracy, the GRP model is highly
efficient, requiring only 32.55 seconds for training, significantly less than GCU-Transformer
(86.39 seconds) and TCNN-Transformer (107.87 seconds). This balance between strong
predictive performance and computational efficiency highlights the strengths of the GRP
model’s compact architecture and its well-designed feature fusion framework.
Next, to determine the most suitable probability distribution for fitting the RUL quantiles
obtained by GRP, we evaluated three most commonly used symmetric distributions: Normal,
Cauchy, and Laplace. The uniform distribution was excluded due to its inability to capture the
higher density near the median observed in the quantile predictions. Four engines (engine IDs:
10, 15, 20, and 25) were selected using segmented random sampling from the FD001 testing
dataset to ensure representative evaluation and demonstration. For each engine, quantile
predictions at 110, 120, 130, and 140 cycles were used to fit the distributions. The 50th
percentile was set as the central parameter (mean for Normal, location for Cauchy and Laplace),
while gradient descent was employed to optimize the shape parameters (standard deviation for
Normal, scale for Cauchy and Laplace). The absolute fitting error was computed for the 10th,
30th, 50th, 70th and 90th percentiles.
Fig. 9. The performances of fitting quantile regression results using different probability
distributions.
As shown in Fig. 9, the Normal distribution exhibited the best performance, achieving the
lowest absolute error at convergence. The error stabilized at approximately 3×10−4, which was
an order of magnitude lower than the errors of the Laplace and Cauchy distributions. The
Normal distribution also demonstrated consistent behavior across all tested engines and cycles,
confirming its reliability for fitting RUL quantiles obtained by GRP. Therefore, the fitted
Normal distribution was then used to calculate the cumulative probability values, providing a
compact and uncertainty-aware representation of RUL predictions, enabling considerate
decision-making in the following PdM task.
Next comes the performance comparison of all PdM methods, including both RL approaches and
RUL prediction methods. We have utilized Ideal Maintenance and Corrective Maintenance as
baselines. Ideal maintenance (at true RUL) refers to a scenario where the true RUL is pre-known by
an Oracle, enabling engines to be replaced precisely at this predetermined RUL. This strategy
involves no unscheduled maintenance tasks and consistently maintains zero wasted engine lives,
representing an optimal maintenance approach. Corrective Maintenance, on the other hand, involves
replacing engines immediately upon failure, leading to a constant flow of unscheduled
replacements, an unfavorable scenario.
Within the Table 8, SRE denotes the successful replacement executions, while UR signifies the
occurrences of unscheduled replacements, with the intervals in the table headers carrying the same
implications as Table 7. Following the conventional practice in PdM research, we have also assessed
the cost associated with each method. The cost computation is outlined as follows: when a
replacement is executed with the engine's true RUL falling within (0,5], the cost is minimal, assigned
as 1; within (5,10], the cost escalates slightly to 2; within (10,29], the cost increases further due to
the engine possessing a considerable RUL, assigned as 3; and within (20,125], the cost peaks at 10.
In cases of UR, the cost incurs the maximum penalty of 20. Additionally, we have calculated the
average RUL. Furthermore, considering the Inspection Period for PdM methods, which is a
relatively scarce metric in existing researches, our focus has primarily centered on comparing
against Prob-RL.
Table 8. Performance comparison of all PdM methods.
Approach SRE UR (0,5] (5,10] (10,20] (20,125] Cost Average Inspection
RUL Period
Ideal Maintenance 130 0(0%) 130 0 0 0 130(0%) / /
Corrective Maintenance 0 130(100%) 0 0 0 0 2600(100%) / /
SMOMA-PPO 130 0(0.00%) 34 45 51 0 277(5.95%) 8.4240 32.54
Prob-RL 128 2(1.54%) 36 42 49 1 317(7.57%) 8.9692 30.00
RL HIRL 127 3(2.31%) 34 55 38 0 318(7.61%) 9.3437 1.00
SAPPO 125 5(3.85%) 27 53 45 0 368(9.64%) 10.1200 1.00
SADDPG 126 4(3.08%) 26 54 43 3 373(9.84%) 9.3015 1.00
GRP-12 130 0(0.00%) 1 20 97 12 452(13.04%) 14.3384 1.00
GRP-6 126 4(3.08%) 26 71 28 1 342(8.58%) 8.0158 1.00
AttnPINN-12 129 1(0.77%) 2 46 76 5 392(10.61%) 11.7984 1.00
DL
AttnPINN-6 110 20(15.38%) 59 41 10 0 571(17.85%) 5.3636 1.00
GCU-Trans-12 129 1(0.77%) 5 38 80 6 401(10.97%) 12.0465 1.00
GCU-Trans-6 113 17(13.08%) 56 49 8 0 518(15.71%) 5.6106 1.00
Except SMOMA-PPO, all existing RL-based methods have only 1 agent, facing inherent inability
of adjusting the inspection, while our method dynamically handles this based on the prediction of
agent 2, which has brought a significant improvement. For instance, Prob-RL incurs a higher cost
(317) and 2 URs with a fixed inspection interval (30 cycles). HIRL slightly increases UR to 3 and
maintains a comparable cost (318). Other single-agent RL methods, SAPPO and SADDPG, exhibit
even higher UR occurrences, reflecting inefficiencies in both inspection and replacement decisions.
In contrast, SMOMA-PPO eliminates all URs and achieves the lowest cost (277). Additionally, it
achieves an average inspection period of 32.54 cycles, surpassing Prob-RL’s fixed interval (30) and
significantly outperforming other approaches, which rely on per-cycle inspections. These results
clearly demonstrate the robustness and efficiency of our multi-agent approach, which not only
reduces costs but also enhances system reliability by dynamically adapting inspection and
replacement schedules. This improvement underscores the unique advantage of employing multiple
agents to address the real-world predictive maintenance tasks.
In addition to RL-based methods, the RUL prediction-based approaches exhibit noticeable
limitations in this PdM task, as reflected in Table 8. For instance, GRP-12 achieves no URs, but its
cost is significantly higher (452) due to excessive replacements within the range of (20,125] RUL,
highlighting its inefficiency in balancing replacement timing. GRP-6, with a lower threshold, incurs
4 URs and a reduced cost (342), but still underperforms compared to RL-based approaches like
Prob-RL, as it heavily relies a pre-defined rigid threshold. Other methods, such as AttnPINN and
GCU-Trans, demonstrate similar drawbacks. AttnPINN-12 and GCU-Trans-12 achieve 1 UR each,
but their costs remain high at 392 and 401, respectively, primarily due to frequent premature
replacements. When the threshold is reduced to 6, both AttnPINN-6 and GCU-Trans-6 exhibit a
dramatic increase in UR occurrences (20 and 17), leading to considerably high costs (571 and 518).
Fig. 17. Frequency distribution showing the number of replacements each method performs
across different RUL intervals.
To further illustrate the comparison, we visualized the distribution of all replacement actions across
different RUL intervals for each approach as shown in Fig. 17, which highlights distinct patterns
among the methods. SMOMA-PPO demonstrates a balanced distribution, with most replacements
occurring in the (5,10] and (10,20] ranges, reflecting its ability to optimize replacement timing while
avoiding URs. Other RL-based methods, such as Prob-RL and HIRL, show a similar trend but with
higher occurrences of replacements in near-failure interval (0,5] or beyond (20,125], indicating less
efficient decision-making. In contrast, DL-based methods show more scattered distributions, as seen
with GRP-12 and GCU-Trans-12. When thresholds are reduced (e.g., GRP-6 and AttnPINN-6),
replacements shift towards (0,5], leading to increased occurrences of URs. These patterns
emphasize the less adaptive nature of DL-based approaches.
Fig. 18. Stacked bar chart separating the costs caused by SRE and UR for each method.
Fig. 18 illustrates the breakdown of costs for each method, separating the contributions of SRE
(green) and UR (red). As shown, our proposed method, SMOMA-PPO, achieves the lowest total
cost, entirely attributed to scheduled replacements, with no cost from URs. In contrast, other RL-
based methods incur higher costs due to their inability to fully eliminate unscheduled events. DL-
based methods, particularly GRP-12, exhibit high scheduled replacement costs due to premature
replacements, while methods like AttnPINN-6 and GCU-Trans-6 suffer from significant UR costs.
This visualization highlights the effectiveness of SMOMA-PPO in maintaining a balanced and cost-
efficient maintenance strategy.
At last, we have also expressed the position of each UR and Cost value in percentage terms relative
to Ideal Maintenance and Corrective Maintenance. Finally, for a more visual representation of the
algorithms' functionality, we sampled eight engines from the FD002 testing dataset to demonstrate
the specific PdM processes of models trained on the training dataset as Fig. 19. Taking engine 131
as an example, in the illustration, the blue rectangular boxes represent the actions of agent 1, while
the purple dialog boxes represent the actions of agent 2. At the initial time point, the cycle count has
already reached 60, as in the GRP algorithm, the sliding window length is set to 60. At the starting
point, the first agent determines that the engine should not be replaced at the current moment, hence
the action provided is "continue." The second agent, based on the current environmental state, which
was generated by employing the GRP algorithm and the cumulative probability values obtained
from function fitting over the last 10 cycles, to decide the next appropriate inspection time. Agent 2
suggests an action value of 43, which shifts the next assessment point forward by 43 cycles, and this
process continues accordingly. This iterative process continues until the true RUL of the engine
reaches 8. At this point, the first agent concludes that the engine should be replaced, executing the
replacement action as indicated by the final yellow box.