A basic formula for Online policy gradient algorithms

Cao XR

IEEE Transactions on Automatic Control, Vol.50, No.5, 696-699, 2005

DOI10.1109/TAC.2005.847037 Export Citation

A basic formula for Online policy gradient algorithms

This note presents a (new) basic formula for sample-path-based estimates for performance gradients for Markov systems (called policy gradients in reinforcement learning literature). With this basic formula, many policy-gradient algorithms, including those that have previously appeared in the literature, can be easily developed. The formula follows naturally from a sensitivity equation in perturbation analysis. New research direction is discussed.

Keywords:Markov decision processes;online estimation;perturbation analysis (PA);perturbation realization;Poisson equations;potentials;reinforcement learning