skip to main content
article
Free access

Finite-sample analysis of least-squares policy iteration

Published: 01 October 2012 Publication History

Abstract

In this paper, we report a performance bound for the widely used least-squares policy iteration (LSPI) algorithm. We first consider the problem of policy evaluation in reinforcement learning, that is, learning the value function of a fixed policy, using the least-squares temporal-difference (LSTD) learning method, and report finite-sample analysis for this algorithm. To do so, we first derive a bound on the performance of the LSTD solution evaluated at the states generated by the Markov chain and used by the algorithm to learn an estimate of the value function. This result is general in the sense that no assumption is made on the existence of a stationary distribution for the Markov chain. We then derive generalization bounds in the case when the Markov chain possesses a stationary distribution and is b-mixing. Finally, we analyze how the error at each policy evaluation step is propagated through the iterations of a policy iteration method, and derive a performance bound for the LSPI algorithm.

References

[1]
Y. Abbasi-Yadkori, D. Pal, and Cs. Szepesvári. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, 2011.
[2]
A. Antos, Cs. Szepesvári, and R. Munos. Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path. Machine Learning Journal, 71:89-129, 2008.
[3]
L. Baird. Residual algorithms: Reinforcement learning with function approximation. In Proceedings of the Twelfth International Conference on Machine Learning, pages 30-37, 1995.
[4]
D. Bertsekas. Dynamic Programming and Optimal Control, volume II. Athena Scientific, 2007.
[5]
D. Bertsekas and J. Tsitsiklis. Neuro-Dynamic Programming. Athena Scientific, 1996.
[6]
J. Boyan. Least-squares temporal difference learning. Proceedings of the 16th International Conference on Machine Learning, pages 49-56, 1999.
[7]
S. Bradtke and A. Barto. Linear least-squares algorithms for temporal difference learning. Machine Learning, 22:33-57, 1996.
[8]
V. de la Peña and G. Pang. Exponential inequalities for self-normlized processes with applications. Electronic Communications in Probability, 14:372-381, 2009.
[9]
V. de la Peña, M. Klass, and T. Leung Lai. Pseudo-maximization and self-normalized processes. Propability Surveys, 4:172-192, 2007.
[10]
S. Delattre and S. Gäiffas. Nonparametric regression with martingale increment errors. Stochastic Processes and their Applications, 121(12):2899 - 2924, 2011.
[11]
A. M. Farahmand, M. Ghavamzadeh, Cs. Szepesvári, and S. Mannor. Regularized policy iteration. In Advances in Neural Information Processing Systems 21, pages 441-448. MIT Press, 2008.
[12]
A. M. Farahmand, R. Munos, and Cs. Szepesvári. Error propagation for approximate policy and value iteration. In Advances in Neural Information Processing Systems, 2010.
[13]
V. Gabillon, A. Lazaric, M. Ghavamzadeh, and B. Scherrer. Classification-based policy iteration with a critic. In Proceedings of the Twenty-Eighth International Conference on Machine Learning, pages 1049-1056, 2011.
[14]
M. Ghavamzadeh, A. Lazaric, O. Maillard, and R. Munos. Lstd with random projections. In Advances in Neural Information Processing Systems, pages 721-729, 2010.
[15]
M. Ghavamzadeh, A. Lazaric, R. Munos, and M. Hoffman. Finite-sample analysis of lasso-td. In Proceedings of the 28th International Conference on Machine Learning, pages 1177-1184, 2011.
[16]
L. Györfi, M. Kohler, A. Krzy?zak, and H. Walk. A Distribution-free Theory of Nonparametric Regression. Springer-Verlag, New York, 2002.
[17]
D. Hsu, S. Kakade, and T. Zhang. Random design analysis of ridge regression. In Proceedings of the 25th Conference on Learning Theory, 2012.
[18]
M. Lagoudakis and R. Parr. Least-squares policy iteration. Journal of Machine Learning Research, 4:1107-1149, 2003.
[19]
A. Lazaric, M. Ghavamzadeh, and R. Munos. Finite-sample analysis of lstd. In Proceedings of the 27th International Conference on Machine Learning, pages 615-622, 2010.
[20]
R. Meir. Nonparametric time series prediction through adaptive model selection. Machine Learning, 39(1):5-34, April 2000.
[21]
S. P. Meyn and R. L. Tweedie. Markov Chains and Stochastic Stability. Springer-Verlag, 1993.
[22]
B. Ávila Pires and Cs. Szepesvári. Statistical linear estimation with penalized estimators: an application to reinforcement learning. In Proceedings of the 29th International Conference on Machine Learning, 2012.
[23]
B. Scherrer. Should one compute the temporal difference fix point or minimize the bellman residual? the unified oblique projection view. In Proceedings of the 27th International Conference on Machine Learning, pages 959-966, 2010.
[24]
P. Schweitzer and A. Seidmann. Generalized polynomial approximations in markovian decision processes. Journal of Mathematical Analysis and Applications, 110:568-582, 1985.
[25]
R. Sutton and A. Barto. Reinforcement Learning: An Introduction. MIT Press, 1998.
[26]
M. Talagrand. The Generic Chaining: Upper and Lower Bounds of Stochastic Processes. Springer-Verlag, 2005.
[27]
J. Tsitsiklis and B. Van Roy. An analysis of temporal difference learning with function approximation. IEEE Transactions on Automatic Control, 42:674-690, 1997.
[28]
B. Yu. Rates of convergence for empirical processes of stationary mixing sequences. The Annals of Probability, 22(1):94-116, January 1994.
[29]
H. Yu. Convergence of least squares temporal difference methods under general conditions. In Proceedings of the 27th International Conference on Machine Learning, pages 1207-1214, 2010.
[30]
H. Yu and D. Bertsekas. Error bounds for approximations from projected linear equations. Math. Oper. Res., 35(2):306-329, 2010.

Cited By

View all
  • (2024)The impact of data distribution on Q-learning with function approximationMachine Language10.1007/s10994-024-06564-5113:9(6141-6163)Online publication date: 1-Sep-2024
  • (2023)On the convergence of SARSA with linear function approximationProceedings of the 40th International Conference on Machine Learning10.5555/3618408.3620155(41613-41646)Online publication date: 23-Jul-2023
  • (2023)Approximation Benefits of Policy Gradient Methods with Aggregated StatesManagement Science10.1287/mnsc.2023.478869:11(6898-6911)Online publication date: 28-Jun-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

The Journal of Machine Learning Research  Volume 13, Issue 1
January 2012
3712 pages
ISSN:1532-4435
EISSN:1533-7928
Issue’s Table of Contents

Publisher

JMLR.org

Publication History

Published: 01 October 2012
Published in JMLR Volume 13, Issue 1

Author Tags

  1. Markov decision processes
  2. finite-sample analysis
  3. generalization bounds
  4. least-squares policy iteration
  5. least-squares temporal-difference
  6. reinforcement learning

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)44
  • Downloads (Last 6 weeks)8
Reflects downloads up to 23 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)The impact of data distribution on Q-learning with function approximationMachine Language10.1007/s10994-024-06564-5113:9(6141-6163)Online publication date: 1-Sep-2024
  • (2023)On the convergence of SARSA with linear function approximationProceedings of the 40th International Conference on Machine Learning10.5555/3618408.3620155(41613-41646)Online publication date: 23-Jul-2023
  • (2023)Approximation Benefits of Policy Gradient Methods with Aggregated StatesManagement Science10.1287/mnsc.2023.478869:11(6898-6911)Online publication date: 28-Jun-2023
  • (2022)Robust reinforcement learning using offline dataProceedings of the 36th International Conference on Neural Information Processing Systems10.5555/3600270.3602604(32211-32224)Online publication date: 28-Nov-2022
  • (2022)Policy space identification in configurable environmentsMachine Language10.1007/s10994-021-06033-3111:6(2093-2145)Online publication date: 1-Jun-2022
  • (2022)Fast LSTD Using Stochastic Approximation: Finite Time Analysis and Application to Traffic ControlMachine Learning and Knowledge Discovery in Databases10.1007/978-3-662-44851-9_5(66-81)Online publication date: 10-Mar-2022
  • (2021)Bellman-consistent pessimism for offine reinforcement learningProceedings of the 35th International Conference on Neural Information Processing Systems10.5555/3540261.3540773(6683-6694)Online publication date: 6-Dec-2021
  • (2021)Byzantine-Resilient Decentralized Policy Evaluation With Linear Function ApproximationIEEE Transactions on Signal Processing10.1109/TSP.2021.309095269(3839-3853)Online publication date: 1-Jan-2021
  • (2021)Efficient Reinforcement Learning in Resource Allocation Problems Through Permutation Invariant Multi-task Learning2021 60th IEEE Conference on Decision and Control (CDC)10.1109/CDC45484.2021.9683491(2270-2275)Online publication date: 14-Dec-2021
  • (2021)Concentration bounds for temporal difference learning with linear function approximation: the case of batch data and uniform samplingMachine Language10.1007/s10994-020-05912-5110:3(559-618)Online publication date: 1-Mar-2021
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media