Exploration-exploitation trade-off for continuous-time episodic reinforcement learning

This was part of Machine Learning and Mean-Field Games

Lukas Szpruch, University of Edinburgh

Wednesday, May 25, 2022

Abstract:

We develop a probabilistic framework for analysing model-based reinforcement learning in the episodic setting. We then apply it to study finite-time horizon stochastic control problems with linear dynamics but unknown coefficients and convex, but possibly irregular, objective function. Using probabilistic representations, we study the regularity of the associated cost functions and establish precise estimates for the performance gap between applying optimal feedback control derived from estimated and true model parameters. We identify conditions under which this performance gap is quadratic. Next, we propose a phase-based learning algorithm for which we show how to optimise exploration-exploitation trade-off and achieve sublinear regrets in high probability and expectation.

Next, we study exploration-exploitation learning using noisy policies. Again we achieve sub-linear regrets for a class of entropy regularised stochastic control problems.