Amii is proud to feature the work of our researchers at the 37th International Conference on Machine Learning (ICML). Amii supports cutting-edge AI research and translates scientific advancement into industry adoption.

Such research will be featured at ICML, running online this year from July 12 to 18. ICML is globally renowned for presenting and publishing leading-edge research across all aspects of machine learning, and is one of the fastest-growing AI conferences in the world. According to Synced, the prestigious conference has an acceptance rate of 21.8%, with a total of 1,088 papers out of 4,990 submissions making it in.

Accepted papers from Amii researchers cover a range of topics including how to rethink the optimization paradigm of RL applications, off-policy deep RL algorithms, and variance reduction within extensive-form games.

Amii Fellows and researchers – professors and graduate students at the University of Alberta and the University of BC – are presenting their papers to conference attendees through the week:

###### Tuesday, July 14

**8 – 8:45 a.m. & 7 – 7:45 p.m. MDT**: **Optimizing for the Future in Non-Stationary MDPs**

Yash Chandak, Georgios Theocharous, Shiv Shankar, Martha White, Sridhar Mahadevan, Philip ThomasMost reinforcement learning methods are based upon the key assumption that the transition dynamics and reward functions are fixed, that is, the underlying Markov decision process (MDP) is stationary. However, in many practical real-world applications, this assumption is clearly violated. We discuss how current methods can have inherent limitations for non-stationary MDPs, and therefore searching a policy that is good for the future, unknown MDP, requires rethinking the optimization paradigm. To address this problem, we develop a method that builds upon ideas from both counter-factual reasoning and curve-fitting to proactively search for a good future policy, without ever modeling the underlying non-stationarity. The effectiveness of the proposed method is demonstrated on problems motivated by real-world applications.

**10 – 10:45 a.m. & 9 – 9:45 p.m. MDT:** **Scalable Deep Generative Modeling for Sparse Graphs**

Hanjun Dai, Azade Nazi, Yujia Li, Bo Dai, Dale SchuurmansLearning graph generative models is a challenging task for deep learning and has wide applicability to a range of domains like chemistry, biology and social science. However current deep neural methods suffer from limited scalability: for a graph with n nodes and m edges, existing deep neural methods require \Omega(n^2) complexity by building up the adjacency matrix. On the other hand, many real world graphs are actually sparse in the sense that m\ll n^2. Based on this, we develop a novel autoregressive model, named BiGG, that utilizes this sparsity to avoid generating the full adjacency matrix, and importantly reduces the graph generation time complexity to O((n + m)\log n). Furthermore, during training this autoregressive model can be parallelized with O(\log n) synchronization stages, which makes it much more efficient than other autoregressive models that require \Omega(n). Experiments on several benchmarks show that the proposed approach not only scales to orders of magnitude larger graphs than previously possible with deep autoregressive graph generative models, but also yields better graph generation quality.

**10 – 10:45 a.m. & 10 – 10:45 p.m. MDT: Fiduciary Bandits**

Gal Bahar, Omer Ben-Porat, Kevin Leyton-Brown, Moshe TennenholtzRecommendation systems often face exploration-exploitation tradeoffs: the system can only learn about the desirability of new options by recommending them to some user. Such systems can thus be modeled as multi-armed bandit settings; however, users are self-interested and cannot be made to follow recommendations. We ask whether exploration can nevertheless be performed in a way that scrupulously respects agents’ interests—i.e., by a system that acts as a \emph{fiduciary}. More formally, we introduce a model in which a recommendation system faces an exploration-exploitation tradeoff under the constraint that it can never recommend any action that it knows yields lower reward in expectation than an agent would achieve if it acted alone. Our main contribution is a positive result: an asymptotically optimal, incentive compatible, and \emph{ex-ante} individually rational recommendation algorithm.

**11 – 11:45 a.m. & (Wednesday, July 15) 12 – 12:45 a.m. MDT: ****Selective Dyna-style Planning Under Limited Model Capacity**

Zaheer SM, Samuel Sokota, Erin Talvitie, Martha WhiteIn model-based reinforcement learning, planning with an imperfect model of the environment has the potential to harm learning progress. But even when a model is imperfect, it may still contain information that is useful for planning. In this paper, we investigate the idea of using an imperfect model selectively. The agent should plan in parts of the state space where the model would be helpful but refrain from using the model where it would be harmful. An effective selective planning mechanism requires estimating predictive uncertainty, which arises out of aleatoric uncertainty and epistemic uncertainty. Prior work has focused on parameter uncertainty, a particular kind of epistemic uncertainty, for selective planning. In this work, we emphasize the importance of structural uncertainty, a distinct kind of epistemic uncertainty that signals the errors due to limited capacity or a misspecified model class. We show that heteroscedastic regression, under an isotropic Gaussian assumption, can signal structural uncertainty that is complementary to that which is detected by methods designed to detect parameter uncertainty, indicating that considering both parameter and structural uncertainty may be a more promising direction for effective selective planning than either in isolation.

**11 – 11:45 a.m. & (Wednesday, July 15) 12 – 12:45 a.m.**

**MDT:**

**A simpler approach to accelerated optimization: iterative averaging meets optimism**

Pooria Joulani, Anant Raj, András György, Csaba SzepesváriRecently there have been several attempts to extend Nesterov’s accelerated algorithm to smooth stochastic and variance-reduced optimization. In this paper, we show that there is a simpler approach to acceleration: applying optimistic online learning algorithms and querying the gradient oracle at the online average of the intermediate optimization iterates. In particular, we tighten a recent result of Cutkosky (2019) to demonstrate theoretically that online iterate averaging results in a reduced optimization gap, independently of the algorithm involved. We show that carefully combining this technique with existing generic optimistic online learning algorithms yields the optimal accelerated rates for optimizing strongly-convex and non-strongly-convex, possibly composite objectives, with deterministic as well as stochastic first-order oracles. We further extend this idea to variance-reduced optimization. Finally, we also provide universal” algorithms that achieve the optimal rate for smooth and non-smooth composite objectives simultaneously without further tuning, generalizing the results of Kavis et al. (2019) and solving a number of their open problems.

**2 – 2:45 p.m. & (Wednesday, July 15) 3 – 3:45 a.m. MDT:**

**Learning with Good Feature Representations in Bandits and in RL with a Generative Model**

Gellért Weisz, Tor Lattimore, Csaba SzepesváriThe construction in the recent paper by Du et al. [2019] implies that searching for a near-optimal action in a bandit sometimes requires examining essentially all the actions, even if the learner is given linear features in R^d that approximate the rewards with a small uniform error. We use the Kiefer-Wolfowitz theorem to prove a positive result that by checking only a few actions, a learner can always find an action that is suboptimal with an error of at most O(ε√d) where ε is the approximation error of the features. Thus, features are useful when the approximation error is small relative to the dimensionality of the features. The idea is applied to stochastic bandits and reinforcement learning with a generative model where the learner has access to d-dimensional linear features that approximate the action-value functions for all policies to an accuracy of ε. For linear bandits, we prove a bound on the regret of order d√(n log(k)) + εn√d log(n) with k the number of actions and n the horizon. For RL we show that approximate policy iteration can learn a policy that is optimal up to an additive error of order ε√d/(1 − γ)^2 and using about d/(ε^2(1 − γ)^4) samples from the generative model. These bounds are independent of the finer details of the features. We also investigate how the structure of the feature set impacts the tradeoff between sample complexity and estimation error.

###### Wednesday, July 15

**6 – 6:45 a.m. & 5 – 5:45 p.m. MDT: ****An Optimistic Perspective on Offline Deep Reinforcement Learning**

Rishabh Agarwal, Dale Schuurmans, Mohammad NorouziOff-policy reinforcement learning (RL) using a fixed offline dataset of logged interactions is an important consideration in real world applications. This paper studies offline RL using the DQN replay dataset comprising the entire replay experience of a DQN agent on 60 Atari 2600 games. We demonstrate that recent off-policy deep RL algorithms, even when trained solely on this fixed dataset, outperform the fully trained DQN agent. To enhance generalization in the offline setting, we present Random Ensemble Mixture (REM), a robust Q-learning algorithm that enforces optimal Bellman consistency on random convex combinations of multiple Q-value estimates. Offline REM trained on the DQN replay dataset surpasses strong RL baselines. Ablation studies highlight the role of offline dataset size and diversity as well as the algorithm choice in our positive results. Overall, the results here present an optimistic view that robust RL algorithms trained on sufficiently large and diverse offline datasets can lead to high quality policies. The DQN replay dataset can serve as an offline RL benchmark and is open-sourced.

**11 – 11:45 a.m. & 10 – 10:45 p.m. MDT: ****Batch Stationary Distribution Estimation**

Junfeng Wen, Bo Dai, Lihong Li, Dale SchuurmansWe consider the problem of approximating the stationary distribution of an ergodic Markov chain given a set of sampled transitions. Classical simulation-based approaches assume access to the underlying process so that trajectories of sufficient length can be gathered to approximate stationary sampling. Instead, we consider an alternative setting where a \emph{fixed} set of transitions has been collected beforehand, by a separate, possibly unknown procedure. The goal is still to estimate properties of the stationary distribution, but without additional access to the underlying system. We propose a consistent estimator that is based on recovering a correction ratio function over the given data. In particular, we develop a variational power method (VPM) that provides provably consistent estimates under general conditions. In addition to unifying a number of existing approaches from different subfields, we also find that VPM yields significantly better estimates across a range of problems, including queueing, stochastic differential equations, post-processing MCMC, and off-policy evaluation.

**11 – 11:45 a.m. & 10 – 10:45 p.m. MDT: Handling the Positive-Definite Constraint in the Bayesian Learning Rule**

Wu Lin, Mark Schmidt, Mohammad Emtiyaz KhanBayesian learning rule is a recently proposed variational inference method, which not only contains many existing learning algorithms as special cases but also enables the design of new algorithms. Unfortunately, when posterior parameters lie in an open constraint set, the rule may not satisfy the constraints and require line-searches which could slow down the algorithm. In this paper, we fix this issue for the positive-definite constraint by proposing an improved rule that naturally handles the constraint. Our modification is obtained using Riemannian gradient methods, and is valid when the approximation attains a block-coordinate natural parameterization (e.g., Gaussian distributions and their mixtures). Our method outperforms existing methods without any significant increase in computation. Our work makes it easier to apply the learning rule in presence of positive-definite constraints in parameter spaces.

**10 – 10:45 a.m. & 10 – 10:45 p.m. MDT:** **Low-Variance and Zero-Variance Baselines for Extensive-Form Games**

Trevor Davis, Martin Schmid, Michael BowlingExtensive-form games (EFGs) are a common model of multi-agent interactions with imperfect information. State-of-the-art algorithms for solving these games typically perform full walks of the game tree that can prove prohibitively slow in large games. Alternatively, sampling-based methods such as Monte Carlo Counterfactual Regret Minimization walk one or more trajectories through the tree, touching only a fraction of the nodes on each iteration, at the expense of requiring more iterations to converge due to the variance of sampled values. In this paper, we extend recent work that uses baseline estimates to reduce this variance. We introduce a framework of baseline-corrected values in EFGs that generalizes the previous work. Within our framework, we propose new baseline functions that result in significantly reduced variance compared to existing techniques. We show that one particular choice of such a function — predictive baseline — is provably optimal under certain sampling schemes. This allows for efficient computation of zero-variance value estimates even along sampled trajectories.

###### Thursday, July 16

**8 – 8:45 a.m. & 8 – 8:45 p.m. MDT: **

**Model-Based Reinforcement Learning with Value-Targeted Regression**

Zeyu Jia, Lin Yang, Csaba Szepesvári, Mengdi Wang, Alex AyoubReinforcement learning (RL) applies to control problems with large state and action spaces, hence it is natural to consider RL with a parametric model. In this paper we focus on finite-horizon episodic RL where the transition model admits a nonlinear parametrization P_{\theta}, a special case of which is the linear parameterization: P_{\theta} = \sum_{i=1}^{d} (\theta)_{i}P_{i}. We propose an upper confidence model-based RL algorithm with value-targeted model parameter estimation. The algorithm updates the estimate of \theta by solving a nonlinear regression problem using the latest value estimate as the target. We demonstrate the efficiency of our algorithm by proving its expected regret bound which, in the special case of linear parameterization takes the form \tilde{\mathcal{O}}(d\sqrt{H^{3}T}), where H, T, d are the horizon, total number of steps and dimension of \theta. This regret bound is independent of the total number of states or actions, and is close to a lower bound \Omega(\sqrt{HdT}). In the general nonlinear case, we handle the regret analysis by using the concept of Eluder dimension proposed by \citet{RuVR14}.

**8 – 8:45 a.m. & 7 – 7:45 p.m. MDT:** **ConQUR: Mitigating Delusional Bias in Deep Q-Learning**

DiJia Su, Jayden Ooi, Tyler Lu, Dale Schuurmans, Craig BoutilierDelusional bias is a fundamental source of error in approximate Q-learning. To date, the only techniques that explicitly address delusion require comprehensive search using tabular value estimates. In this paper, we develop efficient methods to mitigate delusional bias by training Q-approximators with labels that are “consistent” with the underlying greedy policy class. We introduce a simple penalization scheme that encourages Q-labels used across training batches to remain (jointly) consistent with the expressible policy class. We also propose a search framework that allows multiple Q-approximators to be generated and tracked, thus mitigating the effect of premature (implicit) policy commitments. Experimental results demonstrate that these methods can improve the performance of Q-learning in a variety of Atari games, sometimes dramatically.

**8 – 8:45 a.m. & 7 – 7:45 p.m.**

**MDT:**

**Go Wide, Then Narrow: Efficient Training of Deep Thin Networks**

Denny Zhou, Mao Ye, Chen Chen, Mingxing Tan, Tianjian Meng, Xiaodan Song, Quoc Le, Qiang Liu, Dale SchuurmansWe propose an efficient algorithm to train a very deep and thin network with theoretic guarantee. Our method is motivated by model compression, and consists of three stages. In the first stage, we widen the deep thin network and train it until convergence. In the second stage, we use this well trained deep wide network to warm up or initialize the original deep thin network. In the last stage, we train this well initialized deep thin network until convergence. The key ingredient of our method is its second stage, in which the thin network is gradually warmed up by imitating the intermediate outputs of the wide network from bottom to top. We establish theoretical guarantee using mean field analysis. We show that our method is provably more efficient than directly training a deep thin network from scratch. We also conduct empirical evaluations on image classification and language modeling. By training with our approach, ResNet50 can outperform ResNet101 which is normally trained as in the literature, and BERT

BASE can be comparable with BERTLARGE.

**9 – 9:45 a.m. & 8 – 8:45 p.m. MDT: ****Gradient Temporal-Difference Learning with Regularized Corrections**

Sina Ghiassian, Andrew Patterson, Shivam Garg, Dhawal Gutpa, Adam White, Martha WhiteValue function learning remains a critical component of many reinforcement learning systems. Many algorithms are based on temporal difference (TD) updates, which have well-documented divergence issues, even though potentially sound alternatives exist like Gradient TD. Unsound approaches like Q-learning and TD remain popular because divergence seems rare in practice and these algorithms typically perform well. However, recent work with large neural network learning systems reveals that instability is more common than previously thought. Practitioners face a difficult dilemma: choose an easy to use and performant TD method, or a more complex algorithm that is more sound but harder to tune, less sample efficient, and underexplored with control. In this paper, we introduce a new method called TD with Regularized Corrections (TDRC), that attempts to balance ease of use, soundness, and performance. It behaves as well as TD, when TD performs well, but is sound even in cases where TD diverges. We characterize the expected update for TDRC, and show that it inherits soundness guarantees from Gradient TD, and converges to the same solution as TD. Empirically, TDRC exhibits good performance and low parameter sensitivity across several problems.

**9 – 9:45 a.m. & 8 – 8:45 p.m. MDT: ****Domain Aggregation Networks for Multi-Source Domain Adaptation**

Junfeng Wen, Russell Greiner, Dale SchuurmansIn many real-world applications, we want to exploit multiple source datasets to build a model for a different but related target dataset. Despite the recent empirical success, most existing research has used ad-hoc methods to combine multiple sources, leading to a gap between theory and practice. In this paper, we develop a finite-sample generalization bound based on domain discrepancy and accordingly propose a theoretically justified optimization procedure. Our algorithm, Domain AggRegation Network (DARN), can automatically and dynamically balance between including more data to increase effective sample size and excluding irrelevant data to avoid negative effects during training. We find that DARN can significantly outperform the state-of-the-art alternatives on multiple real-world tasks, including digit/object recognition and sentiment analysis.

**10 – 10:45 a.m. & 9 – 9:45 p.m. MDT:** **On the Global Convergence Rates of Softmax Policy Gradient Methods**

Jincheng Mei, Chenjun Xiao, Csaba Szepesvári, Dale SchuurmansWe make three contributions toward better understanding policy gradient methods. First, we show that with the true gradient, policy gradient with a softmax parametrization converges at a O(1/t) rate, with constants depending on the problem and initialization. This result significantly improves recent asymptotic convergence results. The analysis relies on two findings: that the softmax policy gradient satisfies a \L{}ojasiewicz inequality, and the minimum probability of an optimal action during optimization can be bounded in terms of its initial value. Second, we analyze entropy regularized policy gradient and show that in the one state (bandit) case it enjoys a linear convergence rate O(e^{-t}), while for general MDPs we prove that it converges at a O(1/t) rate. This result resolves an open question in the recent literature. A key insight is that the entropy regularized gradient update behaves similarly to the contraction operator in value learning, with contraction factor depending on current policy. Finally, combining the above two results and additional lower bound results, we explain how entropy regularization improves policy optimization, even with the true gradient, from the perspective of convergence rate. These results provide a theoretical understanding of the impact of entropy and corroborate existing empirical studies.

**10 – 10:45 a.m. & (Friday, July 17) 12 – 12:45 a.m. MDT: ****Energy-Based Processes for Exchangeable Data**

Mengjiao Yang, Bo Dai, Hanjun Dai, Dale SchuurmansRecently, there has been growing interest in modeling sets with exchangeability such as point clouds. A shortcoming of current approaches is that they restrict the cardinality of the sets considered or can only express limited forms of distribution over unobserved data. To overcome these limitations, we introduce Energy-Based Processes (EBPs), which extend energy-based models to exchangeable data while allowing neural network parameterizations of the energy function. A key advantage of these models is the ability to express more flexible distributions over sets without restricting their cardinality. We develop an efficient training procedure for EBPs that demonstrates state-of-the-art performance on a variety of tasks such as point cloud generation, classification, denoising, and image completion.

###### Amii will also host socials during the conference to create opportunity for conference-goers to socialize with each other and high-profile researchers:

– Doing RL in the Real World Social – Friday, July 17 at 3 – 5 p.m. MDT

– Multiagent Interactions: a Game Theory social – Thursday, July 16 at 1 – 2 p.m. MDT

– Neural Networking: an AI in Health social – Friday, July 17 at 6 – 8 p.m. MDT