Bandit and reinforcement learning (RL) problems can often be framed as optimization problems where the goal is to maximize average performance while having access only to stochastic estimates of the true gradient. Traditionally, stochastic optimization theory predicts that learning dynamics are governed by the curvature of the loss function and the noise of the gradient estimates. In this paper we demonstrate that this is not the case for bandit and RL problems. To allow our analysis to be interpreted in light of multi-step MDPs, we focus on techniques derived from stochastic optimization principles (e.g., natural policy gradient and EXP3) and we show that some standard assumptions from optimization theory are violated in these problems. We present theoretical results showing that, at least for bandit problems, curvature and noise are not sufficient to explain the learning dynamics and that seemingly innocuous choices like the baseline can determine whether an algorithm converges. These theoretical findings match our empirical evaluation, which we extend to multi-state MDPs.
Feb 15th 2022
Read this research paper, co-authored by Amii Fellow and Canada CIFAR AI Chair Adam White: Learning Expected Emphatic Traces for Deep RL
Feb 15th 2022
Read this research paper, co-authored by Canada CIFAR AI Chair Kevin Leyton-Brown: The Perils of Learning Before Optimizing
Feb 14th 2022
Read this research paper, co-authored by Amii Fellows and Canada CIFAR AI Chairs Osmar Zaïane,and Lili Mou, Non-Autoregressive Translation with Layer-Wise Prediction and Deep Supervision
Looking to build AI capacity? Need a speaker at your event?
Get involved in Alberta's growing AI ecosystem! Speaker, sponsorship, and letter of support requests welcome.
Curious about study options under one of our researchers? Want more information on training opportunities?