Policy Gradient is Essentially Policy Evaluation

This was part of Machine Learning and Mean-Field Games

Xunyu Zhou, Columbia University

Monday, May 23, 2022

Abstract: We study policy gradient (PG) for reinforcement learning (RL) in continuous time and space under the regularized exploratory formulation developed by Wang et al. (JMLR 2020). We represent the gradient of the value function with respect to a given parameterized stochastic policy as the expected integration of an auxiliary running reward function that can be evaluated using samples and the current value function. This effectively turns PG into a policy evaluation (PE) problem, enabling us to apply the martingale approach recently developed by Jia and Zhou (JMLR 2022) for PE to solve our PG problem. Based on this analysis, we propose two types of the actor-critic algorithms for RL, where we learn and update value functions and policies simultaneously and alternatingly. Joint work with Yanwei Jia.