Our work lies at the intersection of imitation learning and interpretable/explainable AI. This section provides a comprehensive comparison with those studies. Note that for the brevity of the description, we outline together interpretable and explainable methods. We further provide a brief comparison with reward shaping methods. In this article, explainability refers to a technical understanding of the connection between the inputs and outputs of a particular AI model. Therefore, given a generated output, a human is able to understand how it originated from the input features. For instance, in a previous study [
61], an explanation is defined as a collection of features of the interpretable domain that have contributed for a given example to produce a decision. Another possible way to provide such an explanation may be to generate a heatmap that highlights the pixels of an input image that had the most influence on the decision [
32]. However, interpretability relates to literally explaining what is happening behind the curtain. Thus, interpretability refers to the ability to explain or present results in understandable terms to humans [
16]. A similar definition is provided by Rudin et al. [
71], which consider interpretability as the ability of intrinsically interpretable models and explainability as the ability to explain models by using post hoc interpretability techniques. Intrinsic interpretability relates to a machine learning model that is constructed to be inherently interpretable or self-explanatory at the time of training by restricting the complexity of the model, including building decision tree policies [
3,
56]. These definitions were also employed in recent surveys [
26,
97].
2.3 Interpretable Reinforcement Learning
Several approaches have been proposed to address the interpretability issues inherent in DRL. One line of work is post hoc methods, where these methods are used in addition to the original model to help users understand the reasons for the decisions. For instance, the user can receive visual explanations that highlight the most relevant regions of the state space [
39,
76,
87,
98]. Another solution is to employ an attention mechanism to identify task-relevant information [
52]. In this framework, the output of the attention layer is leveraged to identify the most important features of the state space. However, the interpretation of strategic states for real-world applications may not be as simple as the objects in a grid game. However, some methods generate textual explanations for choosing an action [
31,
88]. Another study proposed to learn a vector
Q-function, where each component explains preferences between actions based on reward decomposition [
37]. Another popular post hoc explainer is Shapley Additive Explanations [
96], which attributes feature importance to the inputs of a (deep) predictor for a single data sample by “removing” input features and measuring the changes on the output [
51]. Hence, in post hoc methods, an explanation can be used to clarify, justify, or explain an action choice. However, such approaches do not offer full interpretability, as they focus on explaining the local reasons for a decision. In a different spirit, the idea of explaining the knowledge learned by imitation learning models (e.g., GAIL) has been employed by Pan et al. [
64]. The authors proposed to discover visual explanations via post hoc interpretation of a trained GAIL model. Our method, however, directly learns interpretable symbolic rewards that can be understood and improved by humans. To the best of our knowledge, we propose the first approach that can discover knowledge learned by IRL approaches in a way that the model can be adjusted, modified, or improved by a human
Since achieving full interpretability is very challenging, another line of work focuses on high-level interpretability. In particular, these methods focus on their high-level interpretability, as their lower-level components rarely claim to be interpretable. For instance, in Reference [
94], the high-level agent forms a representation of the world and task at hand that is interpretable for a human operator while the low-level employs a neural network [
6]. Although such hierarchical approaches can rarely claim full interpretability, they benefit from the flexibility of neural approaches while providing explanations regarding the strategy of the agent [
45,
78,
81,
94,
100].
Model approximation is an approach that employs a self-interpretable model to mimic the target agent’s policy and then derives explanations from the self-interpretable model for the target DRL agent. For instance, VIPER [
3] leverages ideas from model compression and imitation learning to learn decision tree policies guided by a DNN policy. Specifically, they learn a decision tree policy that plays Atari Pong on a symbolic abstraction of the state space rather than from pixels. In a similar spirit, another study [
50] introduced Linear model U-trees to approximate a neural network’s predictions, using a tree structure that is transparent, allowing rule extraction and measuring feature influence. A similar approach [
89] presented
Neurally Directed Program Search (NDPS), for solving the challenging non-smooth optimization problem of finding a programmatic policy with maximal reward. NDPS works by first discovering a neural policy using DRL and then performing a local search over programmatic policies that aims to minimize the distance from this neural target policy.
Since most previous work in explainable deep learning focuses on explaining only a single decision in terms of input features, a recent study [
86] extends explainability to the trajectory level. They introduced Abstracted Policy Graphs that are Markov chains of abstract states. This representation summarizes a policy so that individual decisions can be explained in the context of expected future transitions. A recent follow-up [
5] presented a policy representation via a novel variant of the CART decision tree algorithm. Instead of mimicking the target agent’s policy, self-interpretable modeling builds a self-explainable model to replace the policy network. Since the new model is interpretable, one can easily derive an explanation for the target agent [
99]. A top-down attention may be employed to direct observation of the information used by the agent to select its actions, providing easier interpretation than of traditional models [
62]. Neuroevolution can also be used for training self-attention architectures for vision-based reinforcement learning tasks [
84]. To avoid making any assumptions as to either unbiasedness of beliefs or optimality of policies, INTERPOLE [
36] aims to discover the most plausible explanation in terms of decision dynamics and boundaries. In EDGE [
30], the authors proposed a novel self-explainable model that augments a Gaussian process with a customized kernel function and an interpretable predictor, capturing both correlations between timesteps and the joint effect across episodes. Our work differs fundamentally from the DRL explanation research mentioned above in terms of the objective pursued. Although the focus of DRL explanation research is typically on developing methods to explain the behavior of a trained agent, our work is focused on discovering interpretable reward functions through expert demonstrations, which can align the agent’s behavior with the desired behavior. Consequently, our main goals are to both understand/verify the learned behavior and accelerate the agent’s exploration.
Another relevant idea is reward decomposition [
37], which decomposes rewards into sums of semantically meaningful reward types, so that actions can be compared in terms of tradeoffs among the types. Decomposing the reward function and seeing the influence of aspects in the reward toward the decision-making process as well as the correspondence between each other is a reasonable way to explainability. To explain skill learning, Shu et al. [
78] utilized hierarchical policies that decide when to use a previously learned policy and when to learn a new skill. However, action preferences can then be explained by contrasting the future properties predicted for each action [
47]. For multi-agent tasks, a method called counterfactual multi-agent policy gradient utilizes a counterfactual advantage function to perform local agent training [
21]. Nevertheless, this method ignores the correlation and interaction between local agents, leading to poor performance on more complex tasks. Recently, Wang et al. [
92] combine the Shapley value with the
Q-value and perform reward decomposition at a higher level in multi-agent tasks to guide the policy gradient process, allowing us to explain how the global reward is divided during training and how much each agent contributes. However, the present study seeks to guide learning while providing reward interpretability in sparse reward tasks. In such tasks, reward decomposition may be difficult to apply as the extrinsic reward is zero for most of the timesteps. To alleviate this challenge, we seek to discover dense symbolic rewards from expert trajectories rather than decomposing extrinsic rewards.
In a different spirit, when inputs are high-dimensional raw data, one solution is to extract symbolic representations on which a human can reason and make assumptions. Since such methods tend to abstract away irrelevant details, the reasons for a decision can be quickly and effectively understood by humans. For instance, a few methods have proposed to distill DRL into decision trees [
18,
46]. An expert may also bootstrap the learning process, as shown by Silva et al. [
79], where the policy tree is initialized from human knowledge. Some previous studies [
23,
24] proposed to learn a symbolic policy by learning a relevant symbolic representation prior to using
Q-learning. The symbolic representation includes interactions between objects in the environment. However, it is not clear how to apply such methods to tasks without well-defined
objects such as in process industries or robot control. In Soft Decision Tree [
12], the authors proposed to train a soft decision tree to mimic the action classification of a DRL policy. The resulting soft decision trees provide a form of interpretability of how the policy operates. A well-known framework for learning symbolic policies is genetic programming [
58]. In GPRL [
33], a genetic programming model is trained to discover a symbolic policy on an environment approximation referred to as a world model. In this work, we rely on an autoregressive RNN to learn a symbolic reward function from expert data and show that the learned reward can be used to guide the agent’s training. One common problem when learning symbolic policies is that the performance is capped by the policy being imitated. This is because most of the approaches learn symbolic policies as a classification problem—without interacting with the environment. However, we focus on learning symbolic reward functions from expert data. Namely, the proposed approach differs from most of the previous work in that we learn interpretable symbolic reward functions from expert data rather than modeling the agent’s policy. Since our agent still has access to the environmental reward, the performance is not capped by the quality of the demonstrations. Besides, using expert data provides a significant speedup in the training process, while highlighting the most important features in the environment.
A study closely related to ours employs genetic programming to mimic the rewards provided by the environment [
77]. Namely, they use a genetic programming model to clone the rewards received by the agent. However, we argue that expert data can provide more meaningful explanations as they are likely to cover task-relevant regions of the environment. In addition, the discovered reward function can be used to accelerate the agent’s training, and the expert can directly modify and improve the learned reward function. Furthermore, rather than learning reward functions with genetic programming, we propose to employ an auto-regressive RNN that tends to perform better on high-dimensional and complex data.
2.5 Reward Shaping
Reward shaping methods involve designing a reward function that encourages the agent to take actions that lead to a desired behavior. This involves modifying the original reward function with a shaping reward that incorporates domain knowledge. The additional reward terms can be either hand-crafted or learned from data and are designed based on the domain knowledge of the problem. Early work of reward shaping [
15,
69] focused on crafting the shaping reward function, but did not consider that the shaping rewards may change the optimal policy. Besides the shaping approaches mentioned above, other important approaches of reward shaping include the automatic shaping methods [
29,
55], multi-agent reward shaping [
13], and some recent ideas such as ethics shaping [
95], belief reward shaping [
54], and reward shaping via meta-learning [
105]. Mirchandani et al. [
57] proposed a reward-shaping method that leverages interactions between an agent and a human to shape sparse rewards associated with human instruction goals and the current state of the environment. Specifically, their method utilizes termination and relevance classifiers to shape the reward signal. Similarly, Tabrez and Hayes [
83] presented a framework called Reward Augmentation and Repair through Explanation, which employs partially observable Markov decision processes to approximate the understanding of collaborators in joint tasks. However, the framework requires careful consideration of the tradeoff between reward modification and abandonment, and it may be limited by the complexity of the POMDP approximation. In addition, reward shaping can also suffer from several limitations. One of the main challenges is that the additional reward terms must be carefully designed to avoid overfitting to the specific problem domain. This can be a difficult and time-consuming task, and it can limit the generalization of the approach to other problem domains. Besides, the use of reward shaping can introduce human bias in the learning process, which can lead to suboptimal solutions [
44]. Instead of investigating how to learn helpful shaping rewards, our work studies a different problem where an exploration reward is discovered from
expert trajectories. The proposed exploration reward can be viewed as a form of reward shaping that distills valuable priors about the domain from expert trajectories. One key advantage is that this framework does not require any handcrafted rewards or additional labeled data for discovering the exploration bonus. In addition, since the exploration reward is represented as a symbolic tree, it can be understood and verified by humans, providing a form of explainability.
In recent years, curiosity [
65] has also been proposed as a form of reward shaping. The intrinsic motivation measure may include mutual information between actions and end states [
28], surprise [
65], state prediction error [
9], learned skills [
8], state visit counts [
82], empowerment [
73], or progress [
7]. Curiosity seeks to accelerate exploration by providing an additional intrinsic reward. The present work differs from curiosity-driven learning in that it attempts to learn complex reward functions through the use of expert priors, rather than replacing reward with a fixed intrinsic objective that aims to encourage exploration of the entire state–action space. Additionally, unlike curiosity-driven approaches that encode rewards primarily through deep neural networks, our goal is to discover interpretable reward functions. By doing so, we aim to achieve a higher level of transparency and understanding in the learning process, allowing for better analysis and interpretation of the training objective.