IEEE Transactions on Automatic Control, Vol.50, No.5, 696-699, 2005
A basic formula for Online policy gradient algorithms
This note presents a (new) basic formula for sample-path-based estimates for performance gradients for Markov systems (called policy gradients in reinforcement learning literature). With this basic formula, many policy-gradient algorithms, including those that have previously appeared in the literature, can be easily developed. The formula follows naturally from a sensitivity equation in perturbation analysis. New research direction is discussed.
Keywords:Markov decision processes;online estimation;perturbation analysis (PA);perturbation realization;Poisson equations;potentials;reinforcement learning