skip to main content
10.5555/3635637.3662902acmconferencesArticle/Chapter ViewAbstractPublication PagesaamasConference Proceedingsconference-collections
research-article

Reinforcement Learning in the Wild with Maximum Likelihood-based Model Transfer

Published: 06 May 2024 Publication History

Abstract

In this paper, we study the problem of transferring the available Markov Decision Process (MDP) models to learn and plan efficiently in an unknown but similar MDP. We refer to it as Model Transfer Reinforcement Learning (MTRL) problem. First, we formulate MTRL for discrete MDPs and Linear Quadratic Regulators (LQRs) with continuous state actions. Then, we propose a generic two-stage algorithm, MLEMTRL, to address the MTRL problem in discrete and continuous settings. In the first stage, MLEMTRL uses a constrained Maximum Likelihood Estimation (MLE) -based approach to estimate the target MDP model using a set of known MDP models. In the second stage, using the estimated target MDP model, MLEMTRL deploys a model-based planning algorithm appropriate for the MDP class. Theoretically, we prove worst-case regret bounds for MLEMTRL both in realisable and non-realisable settings. We empirically demonstrate that MLEMTRL allows faster learning in new MDPs than learning from scratch and achieves near-optimal performance depending on the similarity of the available MDPs and the target MDP.

References

[1]
John Aitchison and SD Silvey. 1958. Maximum-likelihood estimation of parameters subject to restraints. The annals of mathematical Statistics, Vol. 29, 3 (1958), 813--828.
[2]
Isac Arnekvist, Danica Kragic, and Johannes A Stork. 2019. Vpe: Variational policy embedding for transfer reinforcement learning. In 2019 International Conference on Robotics and Automation (ICRA). IEEE, 36--42.
[3]
Christopher G Atkeson and Juan Carlos Santamaria. 1997. A comparison of direct and model-based reinforcement learning. In Proceedings of international conference on robotics and automation, Vol. 4. IEEE, 3557--3564.
[4]
Peter Auer, Thomas Jaksch, and Ronald Ortner. 2008. Near-optimal regret bounds for reinforcement learning. Advances in neural information processing systems, Vol. 21 (2008).
[5]
Andrew G Barto, Richard S Sutton, and Charles W Anderson. 1983. Neuronlike adaptive elements that can solve difficult learning control problems. IEEE transactions on systems, man, and cybernetics 5 (1983), 834--846.
[6]
David E Bell. 1982. Regret in decision making under uncertainty. Operations research, Vol. 30, 5 (1982), 961--981.
[7]
George Casella and Roger L Berger. 2021. Statistical inference. Cengage Learning.
[8]
Gabriela Ciuperca, Andrea Ridolfi, and Jérôme Idier. 2003. Penalized maximum likelihood estimator for normal mixtures. Scandinavian Journal of Statistics, Vol. 30, 1 (2003), 45--59.
[9]
Felipe Leno Da Silva and Anna Helena Reali Costa. 2019. A survey on transfer learning for multiagent reinforcement learning systems. Journal of Artificial Intelligence Research, Vol. 64 (2019), 645--703.
[10]
Richard Dearden, Nir Friedman, and Stuart Russell. 1998. Bayesian Q-learning. Aaai/iaai, Vol. 1998 (1998), 761--768.
[11]
Gabriel Dulac-Arnold, Nir Levine, Daniel J Mankowitz, Jerry Li, Cosmin Paduraru, Sven Gowal, and Todd Hester. 2021. Challenges of real-world reinforcement learning: definitions, benchmarks and analysis. Machine Learning, Vol. 110, 9 (2021), 2419--2468.
[12]
Hannes Eriksson, Debabrota Basu, Mina Alibeigi, and Christos Dimitrakakis. 2022. Sentinel: taming uncertainty with ensemble based distributional reinforcement learning. In Uncertainty in Artificial Intelligence. PMLR, 631--640.
[13]
Yannis Flet-Berliac and Debabrota Basu. 2022. SAAC: Safe Reinforcement Learning as an Adversarial Game of Actor-Critics. In RLDM 2022-The Multi-disciplinary Conference on Reinforcement Learning and Decision Making.
[14]
Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. 2018. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning. PMLR, 1861--1870.
[15]
Rudolph Emil Kalman. 1960. A new approach to linear filtering and prediction problems. (1960).
[16]
Jack Kiefer and Jacob Wolfowitz. 1956. Consistency of the maximum likelihood estimator in the presence of infinitely many incidental parameters. The Annals of Mathematical Statistics (1956), 887--906.
[17]
Dieter Kraft. 1988. A software package for sequential quadratic programming. Forschungsbericht-Deutsche Forschungs- und Versuchsanstalt fur Luftund Raumfahrt (1988).
[18]
Pat Langley. 2006. Transfer of knowledge in cognitive systems. In Talk, workshop on Structural Knowledge Transfer for Machine Learning at the Twenty-Third International Conference on Machine Learning.
[19]
Romain Laroche and Merwan Barlier. 2017. Transfer reinforcement learning with shared dynamics. In Thirty-First AAAI Conference on Artificial Intelligence.
[20]
Alessandro Lazaric. 2012. Transfer in reinforcement learning: a framework and a survey. In Reinforcement Learning. Springer, 143--173.
[21]
Alessandro Lazaric and Mohammad Ghavamzadeh. 2010. Bayesian multi-task reinforcement learning. In ICML-27th International Conference on Machine Learning. Omnipress, 599--606.
[22]
Xinle Liang, Yang Liu, Tianjian Chen, Ming Liu, and Qiang Yang. 2023. Federated transfer reinforcement learning for autonomous driving. In Federated and Transfer Learning. Springer, 357--371.
[23]
Thomas Minka. 2000. Bayesian linear regression. Technical Report. Citeseer.
[24]
Thomas M Moerland, Joost Broekens, Aske Plaat, Catholijn M Jonker, et al. 2023. Model-based reinforcement learning: A survey. Foundations and Trends® in Machine Learning, Vol. 16, 1 (2023), 1--118.
[25]
Whitney K Newey and James L Powell. 1987. Asymmetric least squares estimation and testing. Econometrica: Journal of the Econometric Society (1987), 819--847.
[26]
I. Osband, D. Russo, and B. Van Roy. 2013. (More) efficient reinforcement learning via posterior sampling. In Advances in Neural Information Processing Systems. 3003--3011.
[27]
Reda Ouhamma, Debabrota Basu, and Odalric Maillard. 2023. Bilinear exponential family of MDPs: frequentist regret bound with tractable exploration & planning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 37. 9336--9344.
[28]
Emilio Parisotto, Jimmy Lei Ba, and Ruslan Salakhutdinov. 2015. Actor-mimic: Deep multitask and transfer reinforcement learning. arXiv preprint arXiv:1511.06342 (2015).
[29]
Xue Bin Peng, Marcin Andrychowicz, Wojciech Zaremba, and Pieter Abbeel. 2018. Sim-to-real transfer of robotic control with dynamics randomization. In 2018 IEEE international conference on robotics and automation (ICRA). IEEE, 3803--3810.
[30]
Athanasios S Polydoros and Lazaros Nalpantidis. 2017. Survey of model-based reinforcement learning: Applications on robotics. Journal of Intelligent & Robotic Systems, Vol. 86, 2 (2017), 153--173.
[31]
Martin L Puterman. 2014. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons.
[32]
Jian Qian, Ronan Fruit, Matteo Pirotta, and Alessandro Lazaric. 2020. Concentration inequalities for multinoulli random variables. arXiv preprint arXiv:2001.11595 (2020).
[33]
Cédric Rommel, Joseph Frédéric Bonnans, Baptiste Gregorutti, and Pierre Martinon. 2017. Aircraft dynamics identification for optimal control. In 7th European Conference on Aeronautics and Space Sciences (EUCASS 2017).
[34]
Andrei A Rusu, Sergio Gomez Colmenarejo, Caglar Gulcehre, Guillaume Desjardins, James Kirkpatrick, Razvan Pascanu, Volodymyr Mnih, Koray Kavukcuoglu, and Raia Hadsell. 2015. Policy distillation. arXiv preprint arXiv:1511.06295 (2015).
[35]
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017).
[36]
Julian Skirzy'nski, Frederic Becker, and Falk Lieder. 2021. Automatic discovery of interpretable planning strategies. Machine Learning, Vol. 110, 9 (2021), 2641--2683.
[37]
Alexander L Strehl and Michael L Littman. 2005. A theoretical analysis of model-based interval estimation. In Proceedings of the 22nd international conference on Machine learning. 856--863.
[38]
Richard S Sutton and Andrew G Barto. 2018. Reinforcement learning: An introduction. MIT press.
[39]
Aviv Tamar, Daniel Soudry, and Ev Zisselman. 2022. Regularization Guarantees Generalization in Bayesian Reinforcement Learning through Algorithmic Stability. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. 8423--8431.
[40]
Yuval Tassa, Yotam Doron, Alistair Muldal, Tom Erez, Yazhe Li, Diego de Las Casas, David Budden, Abbas Abdolmaleki, Josh Merel, Andrew Lefrancq, et al. 2018. Deepmind control suite. arXiv preprint arXiv:1801.00690 (2018).
[41]
Matthew E Taylor, Nicholas K Jong, and Peter Stone. 2008. Transferring instances for model-based reinforcement learning. In Joint European conference on machine learning and knowledge discovery in databases. Springer, 488--505.
[42]
Matthew E Taylor and Peter Stone. 2009. Transfer learning for reinforcement learning domains: A survey. Journal of Machine Learning Research, Vol. 10, 7 (2009).
[43]
Jan Willems. 1971. Least squares stationary optimal control and the algebraic Riccati equation. IEEE Transactions on automatic control, Vol. 16, 6 (1971), 621--634.
[44]
Aaron Wilson, Alan Fern, Soumya Ray, and Prasad Tadepalli. 2007. Multi-task reinforcement learning: a hierarchical bayesian approach. In Proceedings of the 24th international conference on Machine learning. 1015--1022.
[45]
Tianhe Yu, Deirdre Quillen, Zhanpeng He, Ryan Julian, Karol Hausman, Chelsea Finn, and Sergey Levine. 2020. Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning. In Conference on robot learning. PMLR, 1094--1100.
[46]
Amy Zhang, Harsh Satija, and Joelle Pineau. 2018. Decoupling dynamics and reward for transfer learning. arXiv preprint arXiv:1804.10689 (2018).
[47]
Amy Zhang, Shagun Sodhani, Khimya Khetarpal, and Joelle Pineau. 2020. Learning Robust State Abstractions for Hidden-Parameter Block MDPs. In International Conference on Learning Representations.
[48]
Zhuangdi Zhu, Kaixiang Lin, Anil K Jain, and Jiayu Zhou. 2023. Transfer learning in deep reinforcement learning: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023).

Recommendations

Comments

Information & Contributors

Information

Published In

AAMAS '24: Proceedings of the 23rd International Conference on Autonomous Agents and Multiagent Systems
May 2024
2898 pages
ISBN:9798400704864

Sponsors

Publisher

International Foundation for Autonomous Agents and Multiagent Systems

Richland, SC

Publication History

Published: 06 May 2024

Check for updates

Author Tags

  1. linear quadratic regulator
  2. maximum likelihood estimation
  3. reinforcement learning
  4. transfer learning

Qualifiers

  • Research-article

Funding Sources

  • Inria-Kyoto University Associate Team RELIANT
  • ANR JCJC grant for the REPUBLIC project
  • Wallenberg AI Autonomous Systems and Software Program (WASP)

Conference

AAMAS '23
Sponsor:

Acceptance Rates

Overall Acceptance Rate 1,155 of 5,036 submissions, 23%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 13
    Total Downloads
  • Downloads (Last 12 months)13
  • Downloads (Last 6 weeks)3
Reflects downloads up to 29 Oct 2024

Other Metrics

Citations

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media