Incremental Policy Gradients for Online Reinforcement Learning Control

Abstract

Policy gradient methods are built on the policy gradient theorem, which involves a term representing the complete sum of rewards into the future: the return. Due to this, one usually either waits until the end of an episode before performing updates, or learns an estimate of this return--a so-called critic. Our emphasis is on the first approach in this work, detailing an incremental policy gradient update which neither waits until the end of the episode, nor relies on learning estimates of the return. We provide on-policy and off-policy variants of our algorithm, for both the discounted return and average reward settings. Theoretically, we draw a connection between the traces our methods use and the stationary distributions of the discounted and average reward settings. We conclude with an experimental evaluation of our methods on both simple-to-understand and complex domains.

Incremental Policy Gradients for Online Reinforcement Learning Control

Abstract

Latest Research Papers

Temporal Abstraction in Reinforcement Learning with the Successor Representation

Toward Efficient Gradient-Based Value Estimation

Online Real-Time Recurrent Learning Using Sparse Connections and Selective Learning

Let us help you

Connect with the community

Explore training and advanced education

Harness the potential of artificial intelligence

Connect with the community

Explore training and advanced education

Harness the potential of artificial intelligence