article

Free access

Finite-sample analysis of least-squares policy iteration

Authors:

Alessandro Lazaric,

Mohammad Ghavamzadeh,

Rémi MunosAuthors Info & Claims

The Journal of Machine Learning Research, Volume 13, Issue 1

Pages 3041 - 3074

Published: 01 October 2012 Publication History

PDF eReader Publisher Site

Abstract

In this paper, we report a performance bound for the widely used least-squares policy iteration (LSPI) algorithm. We first consider the problem of policy evaluation in reinforcement learning, that is, learning the value function of a fixed policy, using the least-squares temporal-difference (LSTD) learning method, and report finite-sample analysis for this algorithm. To do so, we first derive a bound on the performance of the LSTD solution evaluated at the states generated by the Markov chain and used by the algorithm to learn an estimate of the value function. This result is general in the sense that no assumption is made on the existence of a stationary distribution for the Markov chain. We then derive generalization bounds in the case when the Markov chain possesses a stationary distribution and is b-mixing. Finally, we analyze how the error at each policy evaluation step is propagated through the iterations of a policy iteration method, and derive a performance bound for the LSPI algorithm.

References

[1]

Y. Abbasi-Yadkori, D. Pal, and Cs. Szepesvári. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, 2011.

[2]

A. Antos, Cs. Szepesvári, and R. Munos. Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path. Machine Learning Journal, 71:89-129, 2008.

[3]

L. Baird. Residual algorithms: Reinforcement learning with function approximation. In Proceedings of the Twelfth International Conference on Machine Learning, pages 30-37, 1995.

[4]

D. Bertsekas. Dynamic Programming and Optimal Control, volume II. Athena Scientific, 2007.

[5]

D. Bertsekas and J. Tsitsiklis. Neuro-Dynamic Programming. Athena Scientific, 1996.

[6]

J. Boyan. Least-squares temporal difference learning. Proceedings of the 16th International Conference on Machine Learning, pages 49-56, 1999.

[7]

S. Bradtke and A. Barto. Linear least-squares algorithms for temporal difference learning. Machine Learning, 22:33-57, 1996.

[8]

V. de la Peña and G. Pang. Exponential inequalities for self-normlized processes with applications. Electronic Communications in Probability, 14:372-381, 2009.

[9]

V. de la Peña, M. Klass, and T. Leung Lai. Pseudo-maximization and self-normalized processes. Propability Surveys, 4:172-192, 2007.

[10]

S. Delattre and S. Gäiffas. Nonparametric regression with martingale increment errors. Stochastic Processes and their Applications, 121(12):2899 - 2924, 2011.

[11]

A. M. Farahmand, M. Ghavamzadeh, Cs. Szepesvári, and S. Mannor. Regularized policy iteration. In Advances in Neural Information Processing Systems 21, pages 441-448. MIT Press, 2008.

[12]

A. M. Farahmand, R. Munos, and Cs. Szepesvári. Error propagation for approximate policy and value iteration. In Advances in Neural Information Processing Systems, 2010.

[13]

V. Gabillon, A. Lazaric, M. Ghavamzadeh, and B. Scherrer. Classification-based policy iteration with a critic. In Proceedings of the Twenty-Eighth International Conference on Machine Learning, pages 1049-1056, 2011.

[14]

M. Ghavamzadeh, A. Lazaric, O. Maillard, and R. Munos. Lstd with random projections. In Advances in Neural Information Processing Systems, pages 721-729, 2010.

[15]

M. Ghavamzadeh, A. Lazaric, R. Munos, and M. Hoffman. Finite-sample analysis of lasso-td. In Proceedings of the 28th International Conference on Machine Learning, pages 1177-1184, 2011.

[16]

L. Györfi, M. Kohler, A. Krzy?zak, and H. Walk. A Distribution-free Theory of Nonparametric Regression. Springer-Verlag, New York, 2002.

[17]

D. Hsu, S. Kakade, and T. Zhang. Random design analysis of ridge regression. In Proceedings of the 25th Conference on Learning Theory, 2012.

[18]

M. Lagoudakis and R. Parr. Least-squares policy iteration. Journal of Machine Learning Research, 4:1107-1149, 2003.

[19]

A. Lazaric, M. Ghavamzadeh, and R. Munos. Finite-sample analysis of lstd. In Proceedings of the 27th International Conference on Machine Learning, pages 615-622, 2010.

[20]

R. Meir. Nonparametric time series prediction through adaptive model selection. Machine Learning, 39(1):5-34, April 2000.

[21]

S. P. Meyn and R. L. Tweedie. Markov Chains and Stochastic Stability. Springer-Verlag, 1993.

[22]

B. Ávila Pires and Cs. Szepesvári. Statistical linear estimation with penalized estimators: an application to reinforcement learning. In Proceedings of the 29th International Conference on Machine Learning, 2012.

[23]

B. Scherrer. Should one compute the temporal difference fix point or minimize the bellman residual? the unified oblique projection view. In Proceedings of the 27th International Conference on Machine Learning, pages 959-966, 2010.

[24]

P. Schweitzer and A. Seidmann. Generalized polynomial approximations in markovian decision processes. Journal of Mathematical Analysis and Applications, 110:568-582, 1985.

[25]

R. Sutton and A. Barto. Reinforcement Learning: An Introduction. MIT Press, 1998.

[26]

M. Talagrand. The Generic Chaining: Upper and Lower Bounds of Stochastic Processes. Springer-Verlag, 2005.

[27]

J. Tsitsiklis and B. Van Roy. An analysis of temporal difference learning with function approximation. IEEE Transactions on Automatic Control, 42:674-690, 1997.

[28]

B. Yu. Rates of convergence for empirical processes of stationary mixing sequences. The Annals of Probability, 22(1):94-116, January 1994.

[29]

H. Yu. Convergence of least squares temporal difference methods under general conditions. In Proceedings of the 27th International Conference on Machine Learning, pages 1207-1214, 2010.

[30]

H. Yu and D. Bertsekas. Error bounds for approximations from projected linear equations. Math. Oper. Res., 35(2):306-329, 2010.

Cited By

Santos PCarvalho DSardinha AMelo F(2024)The impact of data distribution on Q-learning with function approximationMachine Language10.1007/s10994-024-06564-5113:9(6141-6163)Online publication date: 1-Sep-2024
https://dl.acm.org/doi/10.1007/s10994-024-06564-5
Zhang SDes Combes RLaroche RKrause ABrunskill ECho KEngelhardt BSabato SScarlett J(2023)On the convergence of SARSA with linear function approximationProceedings of the 40th International Conference on Machine Learning10.5555/3618408.3620155(41613-41646)Online publication date: 23-Jul-2023
https://dl.acm.org/doi/10.5555/3618408.3620155
Russo D(2023)Approximation Benefits of Policy Gradient Methods with Aggregated StatesManagement Science10.1287/mnsc.2023.478869:11(6898-6911)Online publication date: 28-Jun-2023
https://dl.acm.org/doi/10.1287/mnsc.2023.4788
Show More Cited By

Index Terms

Finite-sample analysis of least-squares policy iteration

Recommendations

Online exploration in least-squares policy iteration
AAMAS '09: Proceedings of The 8th International Conference on Autonomous Agents and Multiagent Systems - Volume 2

One of the key problems in reinforcement learning is balancing exploration and exploitation. Another is learning and acting in large or even continuous Markov decision processes (MDPs), where compact function approximation has to be used. In this paper, ...
Analysis of classification-based policy iteration algorithms

We introduce a variant of the classification-based approach to policy iteration which uses a cost-sensitive loss function weighting each classification mistake by its actual regret, that is, the difference between the action-value of the greedy action ...
Q-Learning and Enhanced Policy Iteration in Discounted Dynamic Programming

We consider the classical finite-state discounted Markovian decision problem, and we introduce a new policy iteration-like algorithm for finding the optimal state costs or Q-factors. The main difference is in the policy evaluation phase: instead of ...

Comments

Information & Contributors

Information

Published In

The Journal of Machine Learning Research Volume 13, Issue 1

January 2012

3712 pages

ISSN:1532-4435

EISSN:1533-7928

Issue’s Table of Contents

Publisher

JMLR.org

Publication History

Published: 01 October 2012

Published in JMLR Volume 13, Issue 1

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

23
Total Citations
View Citations
132
Total Downloads

Downloads (Last 12 months)44
Downloads (Last 6 weeks)8

Reflects downloads up to 23 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Santos PCarvalho DSardinha AMelo F(2024)The impact of data distribution on Q-learning with function approximationMachine Language10.1007/s10994-024-06564-5113:9(6141-6163)Online publication date: 1-Sep-2024
https://dl.acm.org/doi/10.1007/s10994-024-06564-5
Zhang SDes Combes RLaroche RKrause ABrunskill ECho KEngelhardt BSabato SScarlett J(2023)On the convergence of SARSA with linear function approximationProceedings of the 40th International Conference on Machine Learning10.5555/3618408.3620155(41613-41646)Online publication date: 23-Jul-2023
https://dl.acm.org/doi/10.5555/3618408.3620155
Russo D(2023)Approximation Benefits of Policy Gradient Methods with Aggregated StatesManagement Science10.1287/mnsc.2023.478869:11(6898-6911)Online publication date: 28-Jun-2023
https://dl.acm.org/doi/10.1287/mnsc.2023.4788
Panaganti KXu ZKalathil DGhavamzadeh MKoyejo SMohamed SAgarwal ABelgrave DCho KOh A(2022)Robust reinforcement learning using offline dataProceedings of the 36th International Conference on Neural Information Processing Systems10.5555/3600270.3602604(32211-32224)Online publication date: 28-Nov-2022
https://dl.acm.org/doi/10.5555/3600270.3602604
Metelli AManneschi GRestelli M(2022)Policy space identification in configurable environmentsMachine Language10.1007/s10994-021-06033-3111:6(2093-2145)Online publication date: 1-Jun-2022
https://dl.acm.org/doi/10.1007/s10994-021-06033-3
Prashanth LKorda NMunos R(2022)Fast LSTD Using Stochastic Approximation: Finite Time Analysis and Application to Traffic ControlMachine Learning and Knowledge Discovery in Databases10.1007/978-3-662-44851-9_5(66-81)Online publication date: 10-Mar-2022
https://dl.acm.org/doi/10.1007/978-3-662-44851-9_5
Xie TCheng CJiang NMineiro PAgarwal ARanzato MBeygelzimer ADauphin YLiang PVaughan J(2021)Bellman-consistent pessimism for offine reinforcement learningProceedings of the 35th International Conference on Neural Information Processing Systems10.5555/3540261.3540773(6683-6694)Online publication date: 6-Dec-2021
https://dl.acm.org/doi/10.5555/3540261.3540773
Wu ZShen HChen TLing Q(2021)Byzantine-Resilient Decentralized Policy Evaluation With Linear Function ApproximationIEEE Transactions on Signal Processing10.1109/TSP.2021.309095269(3839-3853)Online publication date: 1-Jan-2021
https://dl.acm.org/doi/10.1109/TSP.2021.3090952
Cai DLim SWynter L(2021)Efficient Reinforcement Learning in Resource Allocation Problems Through Permutation Invariant Multi-task Learning2021 60th IEEE Conference on Decision and Control (CDC)10.1109/CDC45484.2021.9683491(2270-2275)Online publication date: 14-Dec-2021
https://dl.acm.org/doi/10.1109/CDC45484.2021.9683491
Prashanth LKorda NMunos R(2021)Concentration bounds for temporal difference learning with linear function approximation: the case of batch data and uniform samplingMachine Language10.1007/s10994-020-05912-5110:3(559-618)Online publication date: 1-Mar-2021
https://dl.acm.org/doi/10.1007/s10994-020-05912-5
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents