skip to main content
research-article

Bayesian Strategy Networks Based Soft Actor-Critic Learning

Published: 29 March 2024 Publication History

Abstract

A strategy refers to the rules that the agent chooses the available actions to achieve goals. Adopting reasonable strategies is challenging but crucial for an intelligent agent with limited resources working in hazardous, unstructured, and dynamic environments to improve the system’s utility, decrease the overall cost, and increase mission success probability. This article proposes a novel hierarchical strategy decomposition approach based on Bayesian chaining to separate an intricate policy into several simple sub-policies and organize their relationships as Bayesian strategy networks (BSN). We integrate this approach into the state-of-the-art DRL method—soft actor-critic (SAC), and build the corresponding Bayesian soft actor-critic (BSAC) model by organizing several sub-policies as a joint policy. Our method achieves the state-of-the-art performance on the standard continuous control benchmarks in the OpenAI Gym environment. The results demonstrate that the promising potential of the BSAC method significantly improves training efficiency. Furthermore, we extend the topic to the Multi-Agent systems (MAS), discussing the potential research fields and directions.

References

[1]
Stefano V. Albrecht and Peter Stone. 2018. Autonomous agents modelling other agents: A comprehensive survey and open problems. Artificial Intelligence 258 (2018), 66–95.
[2]
Kai Arulkumaran, Marc Peter Deisenroth, Miles Brundage, and Anil Anthony Bharath. 2017. Deep reinforcement learning: A brief survey. IEEE Signal Processing Magazine 34, 6 (2017), 26–38.
[3]
Leemon Baird and Andrew Moore. 1998. Gradient descent for general reinforcement learning. InProceedings of the 11th International Conference on Neural Information Processing Systems.
[4]
Gabriel Barth-Maron, Matthew W. Hoffman, David Budden, Will Dabney, Dan Horgan, Dhruva TB, Alistair Muldal, Nicolas Heess, and Timothy Lillicrap. 2018. Distributional policy gradients. In International Conference on Learning Representations.
[5]
D. N. Barton, T. Saloranta, S. J. Moe, H. O. Eggestad, and S. Kuikka. 2008. Bayesian belief networks as a meta-modelling tool in integrated river basin management-pros and cons in evaluating nutrient abatement decisions under uncertainty in a Norwegian river basin. Ecological Economics 66, 1 (2008), 91–104.
[6]
Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. 2016. OpenAI Gym. arXiv:1606.01540. Retrieved from https://arxiv.org/abs/1606.01540
[7]
Michael Buro. 2003. Real-time strategy games: A new AI research challenge. In Proceedings of the International Joint Conference on Artificial Intelligence. Vol. 2003, 1534–1535.
[8]
Lawrence Freedman. 2015. Strategy: A History. Oxford University Press.
[9]
Scott Fujimoto, Herke Hoof, and David Meger. 2018. Addressing function approximation error in actor-critic methods. In Proceedings of the International Conference on Machine Learning. PMLR, 1587–1596.
[10]
Mohammad Ghavamzadeh, Shie Mannor, Joelle Pineau, Aviv Tamar, et al. 2015. Bayesian reinforcement learning: A survey. Foundations and Trends® in Machine Learning 8, 5–6 (2015), 359–483.
[11]
Ivo Grondman, Lucian Busoniu, Gabriel A. D. Lopes, and Robert Babuska. 2012. A survey of actor-critic reinforcement learning: Standard and natural policy gradients. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 42, 6 (2012), 1291–1307.
[12]
Shixiang Gu, Timothy Lillicrap, Zoubin Ghahramani, Richard E. Turner, and Sergey Levine. 2017. Q-Prop: Sample-efficient policy gradient with An off-policy critic. In International Conference on Learning Representations.
[13]
Tuomas Haarnoja, Haoran Tang, Pieter Abbeel, and Sergey Levine. 2017. Reinforcement learning with deep energy-based policies. In Proceedings of the International Conference on Machine Learning. PMLR, 1352–1361.
[14]
Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. 2018. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the International Conference on Machine Learning. PMLR, 1861–1870.
[15]
Tuomas Haarnoja, Aurick Zhou, Kristian Hartikainen, George Tucker, Sehoon Ha, Jie Tan, Vikash Kumar, Henry Zhu, Abhishek Gupta, Pieter Abbeel, and Sergey Levine. 2018. Soft actor-critic algorithms and applications. arXiv:1812.05905. Retrieved from https://arxiv.org/abs/1812.05905
[16]
Bernhard Hengst. 2012. Hierarchical approaches. In Reinforcement Learning, Marco Wiering and Martijn Otterlo (Eds.). Springer, 293–323.
[17]
Pablo Hernandez-Leal, Bilal Kartal, and Matthew E. Taylor. 2019. A survey and critique of multiagent deep reinforcement learning. Autonomous Agents and Multi-Agent Systems 33, 6 (2019), 750–797.
[18]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation 9, 8 (1997), 1735–1780.
[19]
Laurent Valentin Jospin, Hamid Laga, Farid Boussaid, Wray Buntine, and Mohammed Bennamoun. 2022. Hands-on Bayesian neural networks-a tutorial for deep learning users. IEEE Computational Intelligence Magazine 17, 2 (2022), 29–48.
[20]
Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In International Conference on Learning Representations.
[21]
Daphne Koller and Nir Friedman. 2009. Probabilistic Graphical Models: Principles and Techniques. MIT Press.
[22]
Pat Langley, John E. Laird, and Seth Rogers. 2009. Cognitive architectures: Research issues and challenges. Cognitive Systems Research 10, 2 (2009), 141–160.
[23]
Pedro Larranaga, Hossein Karshenas, Concha Bielza, and Roberto Santana. 2013. A review on evolutionary algorithms in Bayesian network learning and inference tasks. Information Sciences 233 (2013), 109–125.
[24]
Yuxi Li. 2017. Deep reinforcement learning: An overview. arXiv:1701.07274. Retrieved from https://arxiv.org/abs/1701.07274
[25]
Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. 2016. Continuous control with deep reinforcement learning. In International Conference on Learning Representations.
[26]
Rongrong Liu, Florent Nageotte, Philippe Zanne, Michel de Mathelin, and Birgitta Dresp-Langley. 2021. Deep reinforcement learning for the control of robotic manipulation: A focussed mini-review. Robotics 10, 1 (2021), 22.
[27]
Siqi Liu, Guy Lever, Zhe Wang, Josh Merel, S. M. Ali Eslami, Daniel Hennes, Wojciech M. Czarnecki, Yuval Tassa, Shayegan Omidshafiei, Abbas Abdolmaleki, Noah Y. Siegel, Leonard Hasenclever, Luke Marris, Saran Tunyasuvunakool, H. Francis Song, Markus Wulfmeier, Paul Muller, Tuomas Haarnoja, Brendan D. Tracey, Karl Tuyls, Thore Graepel, and Nicolas Heess. 2021. From motor control to team play in simulated humanoid football. arXiv:2105.12196. Retrieved from https://arxiv.org/abs/2105.12196
[28]
Bruce G. Marcot. 2017. Common quandaries and their practical solutions in Bayesian network modeling. Ecological Modelling 358 (2017), 1–9.
[29]
Bruce G. Marcot and Trent D. Penman. 2019. Advances in Bayesian network modelling: Integration of modelling technologies. Environmental Modelling & Software 111 (2019), 386–393.
[30]
Laetitia Matignon, Guillaume J. Laurent, and Nadine Le Fort-Piat. 2012. Independent reinforcement learners in cooperative Markov games: A survey regarding coordination problems. The Knowledge Engineering Review 27, 1 (2012), 1–31.
[31]
Tomas Mikolov, Martin Karafiát, Lukas Burget, Jan Cernockỳ, and Sanjeev Khudanpur. 2010. Recurrent neural network based language model. In Proceedings of the INTERSPEECH. Vol. 2, Makuhari, 1045–1048.
[32]
Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. 2016. Asynchronous methods for deep reinforcement learning. In Proceedings of the International Conference on Machine Learning. PMLR, 1928–1937.
[33]
Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. 2015. Human-level control through deep reinforcement learning. Nature 518, 7540 (2015), 529–533.
[34]
Robin R. Murphy. 2014. Disaster Robotics. MIT Press.
[35]
Illah R. Nourbakhsh, Katia Sycara, Mary Koes, Mark Yong, Michael Lewis, and Steve Burion. 2005. Human-robot teaming for search and rescue. IEEE Pervasive Computing 4, 1 (2005), 72–79.
[36]
Shubham Pateria, Budhitama Subagdja, Ah-hwee Tan, and Chai Quek. 2021. Hierarchical reinforcement learning: A comprehensive survey. ACM Computing Surveys (CSUR) 54, 5 (2021), 1–35.
[37]
Athanasios S. Polydoros and Lazaros Nalpantidis. 2017. Survey of model-based reinforcement learning: Applications on robotics. Journal of Intelligent & Robotic Systems 86, 2 (2017), 153–173.
[38]
Pascal Poupart and Nikos Vlassis. 2008. Model-based Bayesian reinforcement learning in partially observable domains. In Proceedings of the International Symposium on Artificial Intelligence and Mathematics,. 1–2.
[39]
Pascal Poupart, Nikos Vlassis, Jesse Hoey, and Kevin Regan. 2006. An analytic solution to discrete Bayesian reinforcement learning. In Proceedings of the 23rd International Conference on Machine Learning. 697–704.
[40]
Yara Rizk, Mariette Awad, and Edward W. Tunstel. 2018. Decision making in multiagent systems: A survey. IEEE Transactions on Cognitive and Developmental Systems 10, 3 (2018), 514–529.
[41]
Mauro Scanagatta, Antonio Salmerón, and Fabio Stella. 2019. A survey on Bayesian network structure learning from data. Progress in Artificial Intelligence 8 (2019), 425–439.
[42]
John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. 2015. Trust region policy optimization. In Proceedings of the International Conference on Machine Learning. PMLR, 1889–1897.
[43]
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proximal policy optimization algorithms. arXiv:1707.06347. Retrieved from https://arxiv.org/abs/1707.06347
[44]
Kun Shao, Zhentao Tang, Yuanheng Zhu, Nannan Li, and Dongbin Zhao. 2019. A survey of deep reinforcement learning in video games. arXiv:1912.10944. Retrieved from https://arxiv.org/abs/1912.10944
[45]
Bharat Singh, Rajesh Kumar, and Vinay Pratap Singh. 2022. Reinforcement learning in robotic applications: A comprehensive survey. Artificial Intelligence Review 55 (2022), 945–990.
[46]
Richard S. Sutton and Andrew G. Barto. 2018. Reinforcement Learning: An Introduction. MIT Press.
[47]
Haoran Tang and Tuomas Haarnoja. 2017. Learning diverse skills via maximum entropy deep reinforcement learning. Berkeley Artificial Intelligence Research. Retrieved Oct 6, 2017 from https://bair.berkeley.edu/blog/2017/10/06/soft-q-learning/
[48]
Arash Tavakoli, Fabio Pardo, and Petar Kormushev. 2018. Action branching architectures for deep reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence. 4131–4138.
[49]
Saran Tunyasuvunakool, Alistair Muldal, Yotam Doron, Siqi Liu, Steven Bohez, Josh Merel, Tom Erez, Timothy Lillicrap, Nicolas Heess, and Yuval Tassa. 2020. dm_control: Software and tasks for continuous control. Software Impacts 6 (2020), 100022.
[50]
Gabriel Valverde, David Quesada, Pedro Larrañaga, and Concha Bielza. 2023. Causal reinforcement learning based on Bayesian networks applied to industrial settings. Engineering Applications of Artificial Intelligence 125 (2023), 106657.
[51]
Oriol Vinyals, Igor Babuschkin, Wojciech M. Czarnecki, Michaël Mathieu, Andrew Dudzik, Junyoung Chung, David H. Choi, Richard Powell, Timo Ewalds, Petko Georgiev, Junhyuk Oh, Dan Horgan, Manuel Kroiss, Ivo Danihelka, Aja Huang, Laurent Sifre, Trevor Cai, John P. Agapiou, Max Jaderberg, Alexander S. Vezhnevets, Rémi Leblond, Tobias Pohlen, Valentin Dalibard, David Budden, Yury Sulsky, James Molloy, Tom L. Paine, Caglar Gulcehre, Ziyu Wang, Tobias Pfaff, Yuhuai Wu, Roman Ring, Dani Yogatama, Dario Wünsch, Katrina McKinney, Oliver Smith, Tom Schaul, Timothy Lillicrap, Koray Kavukcuoglu, Demis Hassabis, Chris Apps, and David Silver. 2019. Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature 575, 7782 (2019), 350–354.
[52]
Jing Wang, Jinglin Zhou, Xiaolu Chen, Jing Wang, Jinglin Zhou, and Xiaolu Chen. 2022. Probabilistic graphical model for continuous variables. Data-Driven Fault Detection and Reasoning for Industrial Monitoring 3 (2022), 251–265.
[53]
Ziyu Wang, Tom Schaul, Matteo Hessel, Hado Hasselt, Marc Lanctot, and Nando Freitas. 2016. Dueling network architectures for deep reinforcement learning. In Proceedings of the International Conference on Machine Learning. PMLR, 1995–2003.
[54]
Alfred Wehrl. 1978. General properties of entropy. Reviews of Modern Physics 50, 2 (1978), 221.
[55]
Qin Yang. 2021. Self-adaptive swarm system (SASS). In Proceedings of the 30th International Joint Conference on Artificial Intelligence, IJCAI-21, 5040–5041. Doctoral Consortium.
[56]
Qin Yang. 2022. Self-Adaptive Swarm System. Ph.D. Dissertation. University of Georgia.
[57]
Qin Yang. 2023. Hierarchical needs-driven agent learning systems: From deep reinforcement learning to diverse strategies. InProceedings of the 37th AAAI 2023 Conference on Artificial Intelligence and Robotics Bridge Program.
[58]
Qin Yang and Rui Liu. 2023. Understanding the application of utility theory in robotics and artificial intelligence: A survey. arXiv:2306.09445. Retrieved from https://arxiv.org/abs/2306.09445
[59]
Qin Yang, Zhiwei Luo, Wenzhan Song, and Ramviyas Parasuraman. 2019. Self-reactive planning of multi-robots with dynamic task assignments. In Proceedings of the 2019 International Symposium on Multi-Robot and Multi-Agent Systems (MRS). IEEE, 89–91.
[60]
Qin Yang and Ramviyas Parasuraman. 2020. Hierarchical needs based self-adaptive framework for cooperative multi-robot system. In Proceedings of the 2020 IEEE International Conference on Systems, Man, and Cybernetics (SMC). IEEE, 2991–2998.
[61]
Qin Yang and Ramviyas Parasuraman. 2020. Needs-driven heterogeneous multi-robot cooperation in rescue missions. In Proceedings of the 2020 IEEE International Symposium on Safety, Security, and Rescue Robotics (SSRR). IEEE, 252–259.
[62]
Qin Yang and Ramviyas Parasuraman. 2021. How can robots trust each other for better cooperation? A relative needs entropy based robot-robot trust assessment model. In Proceedings of the 2021 IEEE International Conference on Systems, Man, and Cybernetics (SMC). IEEE, 2656–2663.
[63]
Qin Yang and Ramviyas Parasuraman. 2023. A game-theoretic utility network for cooperative multi-agent decision-making in adversarial environments. In Proceedings of the 38th ACM/SIGAPP Symposium on Applied Computing.
[64]
Qin Yang and Ramviyas Parasuraman. 2023. A strategy-oriented bayesian soft actor-critic model. Procedia Computer Science 220 (2023), 561–566.
[65]
Yichuan Zhang, Yixing Lan, Qiang Fang, Xin Xu, Junxiang Li, and Yujun Zeng. 2021. Efficient reinforcement learning from demonstration via Bayesian network-based knowledge extraction. Computational Intelligence and Neuroscience 2021 (2021), 7588221. DOI:. PMCID: PMC8486502.
[66]
Wenshuai Zhao, Jorge Peña Queralta, and Tomi Westerlund. 2020. Sim-to-real transfer in deep reinforcement learning for robotics: A survey. In Proceedings of the 2020 IEEE Symposium Series on Computational Intelligence (SSCI). IEEE, 737–744.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Intelligent Systems and Technology
ACM Transactions on Intelligent Systems and Technology  Volume 15, Issue 3
June 2024
646 pages
EISSN:2157-6912
DOI:10.1145/3613609
  • Editor:
  • Huan Liu
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 29 March 2024
Online AM: 01 February 2024
Accepted: 24 January 2024
Revised: 21 November 2023
Received: 15 August 2023
Published in TIST Volume 15, Issue 3

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Strategy
  2. bayesian networks
  3. deep reinforcement learning
  4. soft actor-critic
  5. utility
  6. expectation

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 432
    Total Downloads
  • Downloads (Last 12 months)432
  • Downloads (Last 6 weeks)19
Reflects downloads up to 12 Feb 2025

Other Metrics

Citations

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

Full Text

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media