화학공학소재연구정보센터
SIAM Journal on Control and Optimization, Vol.38, No.1, 94-123, 1999
Actor-critic-type learning algorithms for Markov decision processes
Algorithms for learning the optimal policy of a Markov decision process (MDP) based on simulated transitions are formulated and analyzed. These are variants of the well-known "actor-critic" (or "adaptive critic") algorithm in the artificial intelligence literature. Distributed asynchronous implementations are considered. The analysis involves two time scale stochastic approximations.