화학공학소재연구정보센터
IEEE Transactions on Automatic Control, Vol.50, No.5, 696-699, 2005
A basic formula for Online policy gradient algorithms
This note presents a (new) basic formula for sample-path-based estimates for performance gradients for Markov systems (called policy gradients in reinforcement learning literature). With this basic formula, many policy-gradient algorithms, including those that have previously appeared in the literature, can be easily developed. The formula follows naturally from a sensitivity equation in perturbation analysis. New research direction is discussed.