A Two Time-Scale Update Rule Ensuring Convergence of Episodic Reinforcement Learning Algorithms at the Example of RUDDER
Sprache des Titels:
Neural Information Processing Systems Foundation (NeurIPS 2019), 2019
We prove under commonly used assumptions the convergence of episodic, that is, sequence-based, actor-critic-like reinforcement algorithms for which the policy becomes more greedy during learning. The most prominent example of such algorithms is the recently introduced RUDDER method, which speeds up the learning of delayed reward problems by reward redistribution. RUDDER is based on simultaneously learning a policy and a reward redistribution network similar to actor-critic methods. We show the convergence of RUDDER which can be generalized to similar actor-critic-like algorithms. In contrast to previous convergence proofs for actor-critic-like methods, we consider whole episodes as learning examples, undiscounted reward, and a policy that becomes more greedy during learning. We employ recent techniques from two time-scale stochastic approximation theory which are equipped with a controlled Markov process to account for the policy getting more greedy. We expect our framework to be useful to prove convergence of other algorithms based on reward shaping or on attention mechanisms.