Fixed-Point Iterations in Markov Decision Processes

Prof. Roberto Cominetti

Abstract

This short course is an introduction to some novel techniques for studying the convergence of stochastic fixed point iterations — such as the Krasnoselskii-Mann, Halpern, and similar iterative methods — with explicit approximation guarantees in the form of finite-time error bounds and convergence rates. Starting from the base case of deterministic iterations, in which the relevant maps can be computed exactly, we will move to settings where the iterations are affected by numerical inaccuracies or stochastic perturbations. The stochastic case is a natural setting for studying the convergence of reinforcement learning procedures in Markov decision processes (MDP’s). We will survey the recent results for MDP’s with discounted payoffs, and illustrate the full power of the new techniques, which becomes more evident when considering the case of long-term average rewards.

Content

The tentative program for the course is (click on each lecture for the slides):

Lecture 1: MDP’s as fixed point iterations. What are we looking for?
Lecture 2: Optimal transport error bounds for fixed-point iterations.
Lecture 3: Extension to stochastic fixed point iterations.
Lecture 4: Application to reinforcement learning for MDPs.

The error-bound analysis is strongly based on finite-dimensional optimal transport. While the course does not presume familiarity with this topic, some acquaintance with linear and convex programming is desirable. Also, the analysis of stochastic iterations would benefit from prior knowledge about finite-dimensional martingales, although the main results used in the analysis will be recalled as needed.

References

[1] J. Abounadi, D. Bertsekas, and V.S.. Borkar, Stochastic approximation for nonexpansive maps: application to Q-learning algorithms, SIAM J. Control Optim., 41 (2002), pp. 1-22.

[2] Z. Allen-Zhu, How to make the gradients small stochastically: Even faster convex and non-convex SGD, In: Advances in Neural Information Processing Systems 31, (2018).

[3] F. Bach and E. Moulines, Non-asymptotic analysis of stochastic approximation algorithms for machine learning, In: Advances in Neural Information Processing Systems 24, (2011).

[4] M. Benam, Dynamics of stochastic approximation algorithms, in Seminaire de Probabilites, XXXIII, vol. 1709 of Lecture Notes in Math., Springer, Berlin, 1999, pp. 1-68.

[5] M. Bravo and R. Cominetti, Stochastic xed-point iterations for nonexpansive maps: Convergence and error bounds, SIAM J. Control and Optimization, To appear (2023).

[6] M. Bravo, R. Cominetti, and T. Champion, Universal bounds for xed-point iterations via optimal transport metrics, Applied Set-Valued Analysis and Optimization, 4 (2022), pp. 293-310.

[7] M. Bravo, R. Cominetti, and M. Pavez-Signe, Rates of convergence for inexact Krasnosel'skii-Mann iterations in Banach spaces, Math. Program., 175 (2019), pp. 241-262.

[8] P.L. Combettes and J.C. Pesquet, Stochastic quasi-Fejer block-coordinate xed point iterations with random sweeping, SIAM J. Optim., 25 (2015), pp. 1221-1248.

[9] P.L. Combettes and J.C. Pesquet, Stochastic approximations and perturbations in forward-backward splitting for monotone operators, Pure Appl. Funct. Anal., 1 (2016), pp. 13-37.

[10] R. Cominetti, J. Soto, and J. Vaisman, On the rate of convergence of Krasnosel'skii-Mann iterations and their connection with sums of Bernoullis, Israel Journal of Mathematics, 199 (2014), pp. 757-772.

[11] J. Contreras and R. Cominetti, Optimal error bounds for nonexpansive xed-point iterations in normed spaces, Mathematical Programming, 199 (2023), pp. 343-374.

[12] V. Dewanto, G. Dunn, A. Eshragh, M. R. Gallagher, and F. Roosta, Average-reward model-free reinforcement learning: a systematic review and literature mapping, ArXiv, abs/2010.08920 (2020), https://api.semanticscholar.org/CorpusID:224715012.

[13] E. Even-Dar and Y. Mansour, Learning rates for q-learning, J. Machine Learn. Res., 5 (2019), pp. 1-25.

[14] S. Ghadimi and G. Lan, Stochastic rst- and zeroth-order methods for nonconvex stochastic programming, SIAM Journal on Optimization, 23 (2013), pp. 2341-2368.

[15] A. Gosavi, Reinforcement learning: A tutorial survey and recent advances, INFORMS Journal on Computing, 21 (2008), pp. 178-192.

[16] T. Jaakola, M.I.. Jordan, and S.P.. Singh, On the convergence of stochastic iterative dynamic programming algorithms, Neural Comput., 6 (1994), pp. 1185-1201.

[17] G. Li, C. Cai, Y. Chen, Y. Wei, and Y. Chi, Is q-learning minimax optimal? A tight sample complexity analysis, Operations Research, (2023).

[18] J. Liu and Y. Yuan, On almost sure convergence rates of stochastic gradient methods, in 35th Annual Conference on Learning Theory, 2022.

[19] A. Nemirovskii, A. Juditskii, G. Lan, and A. Shapiro, Robust stochastic approximation approach to stochastic programming, SIAM J. Optim., 19 (2009), pp. 1574-1609.

[20] A. Pananjady and M. J. Wainwright, Instance-dependent `1-bounds for policy evaluation in tabular reinforcement learning, IEEE Transactions on Information Theory, 67 (2021), pp. 566-585, https://doi.org/10.1109/TIT.2020.3027316.

[21] B.T. Polyak and A.B. Juditsky, Acceleration of stochastic approximation by averaging, SIAM J. Control Optim., 30 (1992), pp. 838-855.

[22] M. Puterman, Markov Decision Processes: Discrete Stochastic Dynamic Programming, John Wiley and Sons, Hoboken, New Jersey, 1994, 2005.

[23] L. Rosasco, S. Villa, and B.C. Vu, Stochastic forward-backward splitting for monotone inclusions, J. Optim. Theory Appl., 169 (2016), pp. 388-406.

[24] R. Sutton and A. Barto, Reinforcement Learning, The MIT Press, Cambridge, Massachusetts, 2nd ed., 2018.

[25] J.N. Tsitsiklis, Asynchronous stochastic approximation and Q-learning, Mach. Learn., 16 (1994), pp. 185-202.

[26] M. Wainwright, Stochastic approximation with cone-contractive operators: Sharp l-infinity bounds for q-learning, arxiv.org/abs/1905.06265, (2019).

[27] M. Wainwright, Variance-reduced q-learning is minimax optimal, arxiv.org/abs/1906.04697, (2019).

[28] Y. Wan, A. Naik, and R. Sutton, Learning and planning in average-reward Markov decision processes, in Proceedings 38th International Conference on Machine Learning, M. Meila and T. Zhang, eds., vol. 139 of Proc. Machine Learning Research, 2021, pp. 10653-10662.

[29] S. Wang, J. Blanchet, and P. Glynn, Optimal sample complexity of reinforcement learning for mixing discounted markov decision processes, arXiv:2302.07477, (2023), pp. 1-38.

[30] C. Watkins and P. Dayan, Q-learning, Machine Learn., 8 (1992), pp. 279-292.