General Information
- Leveraging Reward Consistency for Interpretable Feature Discovery in Reinforcement Learning
- Qisen Yang, Huanqian Wang, Mukun Tong, Wenjie Shi, Gao Huang, and Shiji Song
- Tsinghua University, China
- Q. Yang, H. Wang, M. Tong, W. Shi, G. Huang and S. Song, "Leveraging Reward Consistency for Interpretable Feature Discovery in Reinforcement Learning," in IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. 54, no. 2, pp. 1014-1025, Feb. 2024, doi: 10.1109/TSMC.2023.3312411.
Leveraging Reward Consistency for Interpretable Feature Discovery in Reinforcement Learning
![Interpretable feature discovery RL](https://duckietown.com/wp-content/uploads/2024/07/image-768x426.avif)
What is interpretable feature discovery in reinforcement learning?
To understand this, let’s introduce a few important topics:
Reinforcement Learning (RL): A machine learning approach where an agent gains the ability to make decisions by engaging with an environment to accomplish a specific objective. Interpretable Feature Discovery in RL is an approach that aims to make the decision-making process of RL agents more understandable to humans.
The need for interpretability: In real-world applications, especially in safety-critical domains like self-driving cars, it is crucial to understand why an RL agent makes a certain decision. Interpretability helps:
- Build trust in the system
- Debug and improve the model
- Ensure compliance with regulations and ethical standards
- Understand fault if accidents arise
Feature discovery: Feature discovery in this context refers to identifying the key artifacts (features) of the environment that the RL agent is focusing on while making decisions. For example, in a self-driving car simulation, relevant features might include the position of other cars, road signs, or lane markings.
Learn about RL, navigation, and other robot autonomy topics at the link below!
Abstract
The black-box nature of deep reinforcement learning (RL) hinders them from real-world applications. Therefore, interpreting and explaining RL agents have been active research topics in recent years. Existing methods for post-hoc explanations usually adopt the action matching principle to enable an easy understanding of vision-based RL agents. In this article, it is argued that the commonly used action matching principle is more like an explanation of deep neural networks (DNNs) than the interpretation of RL agents.
It may lead to irrelevant or misplaced feature attribution when different DNNs’ outputs lead to the same rewards or different rewards result from the same outputs. Therefore, we propose to consider rewards, the essential objective of RL agents, as the essential objective of interpreting RL agents as well. To ensure reward consistency during interpretable feature discovery, a novel framework (RL interpreting RL, denoted as RL-in-RL) is proposed to solve the gradient disconnection from actions to rewards.
We verify and evaluate our method on the Atari 2600 games as well as Duckietown, a challenging self-driving car simulator environment. The results show that our method manages to keep reward (or return) consistency and achieves high-quality feature attribution. Further, a series of analytical experiments validate our assumption of the action matching principle’s limitations.
Highlights - Leveraging Reward Consistency for Interpretable Feature Discovery in Reinforcement Learning
Here is a visual tour of the work of the authors. For more details, check out the full paper.
![Figure showing the limitations of action matching: In the driving task illustrated in (a), two actions—turning left by 0.12 as shown in (a.1) and turning left by 0.15 as shown in (a.2)—represent the same behavior of "avoiding collision" and receive the same reward. In the scenario depicted in (b), the same actions—shown in (b.1) and (b.2)—result in different rewards depending on the task, but action matching methods would provide the same explanation for both.](https://duckietown.com/wp-content/uploads/2024/07/1-2-768x576.avif)
![Diagram of the reward-oriented interpretation method: a DNN with an encoder and decoder learns the mask and attentive state. The pretrained policy takes actions from both primitive and attentive states, with the environment providing corresponding rewards. Reward matching is hindered by disconnected gradient backward in supervised learning.](https://duckietown.com/wp-content/uploads/2024/07/3-1-768x576.avif)
![Diagram of interaction process: (a) shows how the pretrained policy (πpre) interacts with the environment during pretraining. (b) depicts the interpretation task where reward matching is treated as an RL problem, with the RL-in-RL policy (π̃) corresponding to the gray box in Fig. 2. The reward (r̃) is provided by both the environment (E) and the pretrained policy (πpre).](https://duckietown.com/wp-content/uploads/2024/07/4-1-768x576.avif)
![𝖠𝗅𝗀𝗈𝗋𝗂𝗍𝗁𝗆 𝟣 𝗈𝗎𝗍𝗅𝗂𝗇𝖾𝗌 𝗍𝗁𝖾 𝖱𝖫-𝗂𝗇-𝖱𝖫 𝗆𝗈𝖽𝖾𝗅, 𝗐𝗁𝖾𝗋𝖾 𝖺 𝗉𝗋𝖾𝗍𝗋𝖺𝗂𝗇𝖾𝖽 𝗉𝗈𝗅𝗂𝖼𝗒 𝜋 𝗉𝗋𝖾 π 𝗉𝗋𝖾 𝗂𝗇𝗍𝖾𝗋𝖺𝖼𝗍𝗌 𝗐𝗂𝗍𝗁 𝗍𝗁𝖾 𝖾𝗇𝗏𝗂𝗋𝗈𝗇𝗆𝖾𝗇𝗍 𝗍𝗈 𝖻𝖾 𝗂𝗇𝗍𝖾𝗋𝗉𝗋𝖾𝗍𝖾𝖽. 𝖳𝗁𝖾 𝗋𝖾𝗐𝖺𝗋𝖽 𝖿𝗎𝗇𝖼𝗍𝗂𝗈𝗇 𝑅 ( 𝑠 𝑡 , 𝑎 𝑡 ) 𝖱(𝗌 𝗍 ,𝖺 𝗍 ) 𝗂𝗌 𝗉𝗋𝗈𝗏𝗂𝖽𝖾𝖽 𝖻𝗒 𝗍𝗁𝖾 𝖾𝗇𝗏𝗂𝗋𝗈𝗇𝗆𝖾𝗇𝗍 𝐸 𝖤. 𝖳𝗁𝖾 𝖺𝗅𝗀𝗈𝗋𝗂𝗍𝗁𝗆 𝗂𝗇𝗏𝗈𝗅𝗏𝖾𝗌 𝗂𝗇𝗂𝗍𝗂𝖺𝗅𝗂𝗓𝗂𝗇𝗀 𝖺𝗇𝖽 𝗎𝗉𝖽𝖺𝗍𝗂𝗇𝗀 𝗌𝗉𝖾𝖼𝗂𝖿𝗂𝖼 𝖿𝗎𝗇𝖼𝗍𝗂𝗈𝗇𝗌 𝖺𝗇𝖽 𝗉𝖺𝗋𝖺𝗆𝖾𝗍𝖾𝗋𝗌, 𝗂𝗍𝖾𝗋𝖺𝗍𝗂𝗇𝗀 𝗈𝗏𝖾𝗋 𝖾𝗉𝗈𝖼𝗁𝗌 𝗎𝗇𝗍𝗂𝗅 𝖼𝗈𝗇𝗏𝖾𝗋𝗀𝖾𝗇𝖼𝖾. 𝖣𝗎𝗋𝗂𝗇𝗀 𝖾𝖺𝖼𝗁 𝖾𝗉𝗈𝖼𝗁, 𝗍𝗁𝖾 𝗌𝗍𝖺𝗍𝖾 𝗂𝗌 𝗂𝗇𝗂𝗍𝗂𝖺𝗅𝗂𝗓𝖾𝖽, 𝖺𝖼𝗍𝗂𝗈𝗇𝗌 𝖺𝗇𝖽 𝗋𝖾𝗐𝖺𝗋𝖽𝗌 𝖺𝗋𝖾 𝖼𝗈𝗆𝗉𝗎𝗍𝖾𝖽, 𝖺𝗇𝖽 𝗍𝗋𝖺𝗃𝖾𝖼𝗍𝗈𝗋𝗂𝖾𝗌 𝖺𝗋𝖾 𝗌𝖺𝗏𝖾𝖽. 𝖳𝗁𝖾 𝗎𝗉𝖽𝖺𝗍𝖾 𝗌𝗍𝖾𝗉 𝖾𝗆𝗉𝗅𝗈𝗒𝗌 𝗍𝗁𝖾 𝖯𝗋𝗈𝗑𝗂𝗆𝖺𝗅 𝖯𝗈𝗅𝗂𝖼𝗒 𝖮𝗉𝗍𝗂𝗆𝗂𝗓𝖺𝗍𝗂𝗈𝗇 (𝖯𝖯𝖮) 𝖺𝗅𝗀𝗈𝗋𝗂𝗍𝗁𝗆 𝗍𝗈 𝗆𝖺𝗑𝗂𝗆𝗂𝗓𝖾 𝖺 𝗌𝗉𝖾𝖼𝗂𝖿𝗂𝖾𝖽 𝗈𝖻𝗃𝖾𝖼𝗍𝗂𝗏𝖾.](https://duckietown.com/wp-content/uploads/2024/07/2-2-768x576.avif)
![The attentive state is an overlaid combination of the state and the attentive heatmap of feature attribution. The feature importance ranges from 0 to 1 as the heatmap color changes from blue to red.](https://duckietown.com/wp-content/uploads/2024/07/5-1-768x576.avif)
![Attentive features of RL-in-RL^K with observation length K = 10](https://duckietown.com/wp-content/uploads/2024/07/6-1-768x576.avif)
![Comparison of interpretation methods on Atari 2600 games. The RL-in-RL model and the supervision-based method are shown with overlaid heatmaps. Perturbation-based and gradient-based methods highlight attention areas on the saliency-overlaid state. Subfigures include (a) Pong, (b) Ms. Pac-Man, and (c) Space Invaders.](https://duckietown.com/wp-content/uploads/2024/07/3-2-768x576.avif)
![Comparisons between the best-performing action matching method and our reward-oriented RL-in-RL on the Duckietown environment.](https://duckietown.com/wp-content/uploads/2024/07/7-1-768x576.avif)
![(a) shows the percent episode return of the pretrained policy under various lane patterns, illustrating the impact of different lines (Fleft_w, Fy, Fright_w) on policy performance. (b) depicts the percent action divergence of the interpretation model under these lane patterns, comparing action consistency with the pretrained policy's actions and showing the effect of the lines on action consistency.](https://duckietown.com/wp-content/uploads/2024/07/8-1-768x576.avif)
![Interpretation results from the action matching method, when the policy is pretrained without the middle yellow line.](https://duckietown.com/wp-content/uploads/2024/07/9-768x576.avif)
![Attention pattern of RL−in−RLᵃ.](https://duckietown.com/wp-content/uploads/2024/07/10-768x576.avif)
![Architecture of the encoder–decoder network in RL-in-RL](https://duckietown.com/wp-content/uploads/2024/07/11-768x576.avif)
![Additional results of the RL-in-RL model on Atari 2600 games. Attentive states are visualized as in Figure 4. The games displayed are: (a) Enduro, (b) Assault, and (c) Breakout.](https://duckietown.com/wp-content/uploads/2024/07/5-2-768x576.avif)
![𝖳𝗐𝗈 𝖺𝖻𝗅𝖺𝗍𝗂𝗈𝗇 𝗌𝗍𝗎𝖽𝗂𝖾𝗌 𝗈𝗇 𝗍𝗁𝖾 𝗁𝗒𝗉𝖾𝗋𝗉𝖺𝗋𝖺𝗆𝖾𝗍𝖾𝗋𝗌 𝗈𝖿 𝗍𝗁𝖾 𝖱𝖫-𝗂𝗇-𝖱𝖫 𝗆𝗈𝖽𝖾𝗅. (𝖺) 𝖲𝗁𝗈𝗐𝗌 𝗍𝗁𝖾 𝖾𝖿𝖿𝖾𝖼𝗍 𝗈𝖿 𝗌𝖾𝗍𝗍𝗂𝗇𝗀 𝛼 = 𝟢.𝟣 α=𝟢.𝟣, 𝗐𝗁𝖾𝗋𝖾 𝛼 α 𝗂𝗌 𝗍𝗁𝖾 𝗐𝖾𝗂𝗀𝗁𝗍 𝗈𝖿 𝗍𝗁𝖾 𝖺𝗎𝗑𝗂𝗅𝗂𝖺𝗋𝗒 𝗍𝖺𝗌𝗄 𝗅𝗈𝗌𝗌 𝖺𝗌 𝖽𝖾𝖿𝗂𝗇𝖾𝖽 𝗂𝗇 𝖤𝗊𝗎𝖺𝗍𝗂𝗈𝗇 (𝟪). (𝖻) 𝖲𝗁𝗈𝗐𝗌 𝗍𝗁𝖾 𝖾𝖿𝖿𝖾𝖼𝗍 𝗈𝖿 𝗌𝖾𝗍𝗍𝗂𝗇𝗀 𝛽 = 𝟢.𝟣 β=𝟢.𝟣, 𝗐𝗁𝖾𝗋𝖾 𝛽 β 𝗂𝗌 𝗎𝗌𝖾𝖽 𝗍𝗈 𝗌𝗉𝖾𝖾𝖽 𝗎𝗉 𝗍𝗁𝖾 𝗍𝗋𝖺𝗂𝗇𝗂𝗇𝗀 𝗉𝗋𝗈𝖼𝖾𝗌𝗌 𝖺𝗌 𝖽𝖾𝗌𝖼𝗋𝗂𝖻𝖾𝖽 𝗂𝗇 𝖠𝗉𝗉𝖾𝗇𝖽𝗂𝗑 𝖠.](https://duckietown.com/wp-content/uploads/2024/07/6-2-768x576.avif)
![Comparative results of the RL-in-RL method and the action matching method’s ablation studies on the hyperparameter α.](https://duckietown.com/wp-content/uploads/2024/07/7-2-768x576.avif)
![Visulization of attention shift in the zigzag turn. Three rows correspond to original states, attentive states, and masked states, respectively.](https://duckietown.com/wp-content/uploads/2024/07/4-2-768x576.avif)
Conclusion
Here are the conclusions from the authors of this paper:
“In this article, we discussed the limitations of the commonly used assumption, the action matching principle, in RL interpretation methods. It is suggested that action matching cannot truly interpret the agent since it differs from the reward-oriented goal of RL. Hence, the proposed method first leverages reward consistency during feature attribution and models the interpretation problem as a new RL problem, denoted as RL-in-RL.
Moreover, it provides an adjustable observation length for one-step reward or multistep reward (or return) consistency, depending on the requirements of behavior analyses. Extensive experiments validate the proposed model and support our concerns that action matching would lead to redundant and noncausal attention during interpretation since it is dedicated to exactly identical actions and thus results in a sort of “overfitting.”
Nevertheless, although RL-in-RL shows superior interpretability and dispenses with redundant attention, further exploration of interpreting RL tasks with explicit causality is left for future work.”
Project Authors
Qisen Yang is an Artificial Intelligence PhD Student at Tsinghua University, China.
Huanqian Wang is currently pursuing the B.E. degree in control science and engineering with the Department of Automation, Tsinghua University, Beijing, China.
Mukun Tong is currently pursuing the B.E. degree in control science and engineering with the Department of Automation, Tsinghua University,
Beijing, China.
Wenjie Shi received his Ph.D. degree in control science and engineering from the Department of Automation, Institute of Industrial Intelligence and System, Tsinghua University, Beijing, China, in 2022.
Guang-Bin Huang is in the School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore.
Shiji Song is currently a Professor with the Department of Automation, Tsinghua University, Beijing, China.
Learn more
Duckietown is a platform for creating and disseminating robotics and AI learning experiences.
It is modular, customizable and state-of-the-art, and designed to teach, learn, and do research. From exploring the fundamentals of computer science and automation to pushing the boundaries of knowledge, Duckietown evolves with the skills of the user.