The Tea Time Talks 2021: Week Five

The Tea Time Talks are back! Throughout the summer, take in 20-minute talks on early-stage ideas, prospective research and technical topics delivered by students, faculty and guests. Presented by Amii and the RLAI Lab at the University of Alberta, the talks are a relaxed and informal way of hearing leaders in AI discuss future lines of research they may explore.

Watch select talks from the five week of the series now:

Yufeng Yuan: Asynchronous Reinforcement Learning for Real-Time Control of Physical Robots

Abstract: An oft-ignored challenge of real-world reinforcement learning is that, unlike standard simulated environments, the real world does not pause when agents make learning updates. In this TTT, we investigate, for the same algorithm (Soft Actor-Critic), how the sequentially-implemented version and asynchronously-implemented version differ in performance in real-world robotic control tasks.

Alex Trudeau: Go-Exploit

Abstract: AlphaZero achieved superhuman performance in the games of Chess, Shogi, and Go using a general self-play reinforcement learning algorithm. AlphaZero employs exploration in its self-play games so that it encounters states throughout the state space, enabling it to learn which states and actions lead to wins. While AlphaZero uses a robust mechanism for exploration within its search, it has more simplistic mechanisms for exploration during self-play training: randomly perturbing the learned policy during search and stochastically selecting actions near the start of the game. We introduce an alternative training strategy called Go-Exploit that more reliably visits and revisits states throughout the state space and reduces exploration’s biasing of learning targets. Go-Exploit, inspired by Go-Explore, maintains an archive of previously visited states of interest and samples from this archive to determine the start state of self-play trajectories. We show in the games of Connect Four and 9x9 Go that Go-Exploit successfully visits and revisits more states throughout the state space and learns more effectively than AlphaZero.

