From stochastic control to continuous-time reinforcement learning

This was part of Machine Learning and Mean-Field Games

Yufei Zhang, London School of Economics and Political Science

Monday, May 23, 2022

Abstract: Recently, reinforcement learning (RL) has attracted substantial research interests. Much of the attention and success, however, has been for the discrete-time setting. Continuous-time RL, despite its natural analytical connection to stochastic controls, has been largely unexplored and with limited progress. This talk analyses convergence rate and sample efficiency of continuous-time RL algorithms through the lens of continuous-time stochastic control theory. The first part of the talk proposes a policy gradient method for finite-time horizon stochastic control problems with controlled drift, possibly degenerate noise, and nonsmooth nonconvex objective functions. We prove under suitable conditions that the algorithm converges linearly to a stationary point of the control problem. The convergence result justifies the recent reinforcement learning heuristics that adding entropy regularisation or a fictitious discount factor to the optimisation objective accelerates the convergence of policy gradient methods. The proof exploits careful regularity estimates of backward stochastic differential equations. The second part of the talk discusses some recent advances in the regret analysis for the episodic linear-convex RL problem, and reports a sublinear (or even logarithmic) regret bound for greedy least-squares algorithms. The approach is probabilistic, involving studying the Lipschitz stability of feedback controls via the associated forward-backward stochastic differential equations, and exploring the concentration properties of sub-Weibull random variables.