The fourteenth annual International Conference on Learning Representations (ICLR) is underway this week in Rio de Janeiro, Brazil. Amii is proud to showcase the diverse and high-imact research our Fellows, Canada CIFAR AI Chairs, and students are presenting this year.
ICLR is a premier global conference for the advancement of representation learning, exploring how models process data to solve complex problems in computer vision, robotics, and natural language processing.
This year, Amii’s contributions push the boundaries of what’s possible in automated intelligence. Our researchers are unveiling novel frameworks to make reinforcement learning (RL) more sample-efficient, developing memory architectures inspired by human cognition for Large Language Models (LLMs) that handle millions of tokens, and establishing new standards for fairness and privacy in data synthesis.
Want to stay up-to-date on the latest research from the Amii community? Sign up for our monthly newsletter!
*- denotes Amii affiliation
Accepted Papers
Distributions as Actions: A Unified Framework for Diverse Action Spaces
Jiamin He* , A. Rupam Mahmood* , Martha White *
LINK TO PAPER
We introduce a novel reinforcement learning (RL) framework that treats parameterized action distributions as actions, redefining the boundary between agent and environment. This reparameterization makes the new action space continuous, regardless of the original action type (discrete, continuous, hybrid, etc.). Under this new parameterization, we develop a generalized deterministic policy gradient estimator, Distributions-as-Actions Policy Gradient (DA-PG), which has lower variance than the gradient in the original action space. Although learning the critic over distribution parameters poses new challenges, we introduce Interpolated Critic Learning (ICL), a simple yet effective strategy to enhance learning, supported by insights from bandit settings. Building on TD3, a strong baseline for continuous control, we propose a practical actor-critic algorithm, Distributions-as-Actions Actor-Critic (DA-AC). Empirically, DA-AC achieves competitive performance in various settings across discrete, continuous, and hybrid control.
Regularized Latent Dynamics Prediction is a Strong Baseline For Behavioral Foundation Models
Pranaya Jajoo*, Harshit Sikchi, Siddhant Agarwal, Amy Zhang, Scott Niekum, Martha White *
LINK TO PAPER
Behavioral Foundation Models (BFMs) have been recently successful in producing agents with the capabilities to adapt to any unknown reward or task. In reality, these methods are only able to produce near-optimal policies for the reward functions that are in the span of some pre-existing state features. Naturally, their efficiency relies heavily on the choice of state features that they use. As a result, these BFMs have used a wide variety of complex objectives, often sensitive to environment coverage, to train task spanning features with different inductive properties. With this work, our aim is to examine the question: are these complex representation learning objectives necessary for zero-shot RL? Specifically, we revisit the objective of self-supervised next-state prediction in latent space for state feature learning, but observe that such an objective alone is prone to increasing state-feature similarity, and subsequently reducing span of reward functions that we can represent optimal policies for. We propose an approach, RLDP, that adds a simple regularization to maintain feature diversity and can match or surpass state-of-the-art complex representation learning methods for zero-shot RL. Furthermore, we demonstrate the prior approaches diverge in low-coverage scenarios where RLDP still succeeds.
Token Hidden Reward: Steering Exploration-Exploitation in Group Relative Deep Reinforcement Learning
Wenlong Deng, Yi Ren, Yushu Li, Boying Gong, Danica J. Sutherland*, Xiaoxiao Li, Christos Thrampoulidis
LINK TO PAPER
Reinforcement learning with verifiable rewards has significantly advanced the reasoning capabilities of large language models, yet how to explicitly steer training toward exploration or exploitation remains an open problem. We introduce Token Hidden Reward (THR), a token-level metric that quantifies each token's influence on the likelihood of correct responses under Group Relative Policy Optimization (GRPO). We find that training dynamics are dominated by a small subset of tokens with high absolute THR values. Most interestingly, tokens with positive THR strengthen confidence in correct outputs, thus favoring exploitation, while tokens with negative THR preserve probability mass for alternative outputs, enabling exploration. This insight suggests a natural intervention: a THR-guided reweighting algorithm that modulates GRPO's learning signals to explicitly bias training toward exploitation or exploration. We validate the efficacy of this algorithm on diverse math reasoning benchmarks. By amplifying tokens with positive THR value and weakening negative ones, our algorithm improves greedy-decoding accuracy, favoring exploitation. The reverse strategy yields consistent gains in Pass@K accuracy, favoring exploration. We further demonstrate that our algorithm integrates seamlessly with other RL objectives such as GSPO and generalizes across architectures including Llama. These findings establish THR as a principled and fine-grained mechanism for dynamically controlling exploration and exploitation in RL-tuned LLMs, providing new tools for targeted fine-tuning in reasoning-intensive applications.
Learn more about the researchers
Beyond a Million Tokens: Benchmarking and Enhancing Long-Term Memory in LLMs
Mohammad Tavakoli*, Alireza Salemi, Carrie Ye*, Mohamed Abdalla*, Hamed Zamani, J Ross Mitchell*
LINK TO PAPER
Evaluating the abilities of large language models (LLMs) for tasks that require long-term memory and thus long-context reasoning, for example in conversational settings, is hampered by the existing benchmarks, which often lack narrative coherence, cover narrow domains, and only test simple recall-oriented tasks. This paper introduces a comprehensive solution to these challenges. First, we present a novel framework for automatically generating long (up to 10M tokens), coherent, and topically diverse conversations, accompanied by probing questions targeting a wide range of memory abilities. From this, we construct BEAM, a new benchmark comprising 100 conversations and 2,000 validated questions. Second, to enhance model performance, we propose LIGHT-a framework inspired by human cognition that equips LLMs with three complementary memory systems: a long-term episodic memory, a short-term working memory, and a scratchpad for accumulating salient facts. Our experiments on BEAM reveal that even LLMs with 1M token context windows (with and without retrieval-augmentation) struggle as dialogues lengthen. In contrast, LIGHT consistently improves performance across various models, achieving an average improvement of 3.5%-12.69% over the strongest baselines, depending on the backbone LLM. An ablation study further confirms the contribution of each memory component.
Efficient Reinforcement Learning by Guiding World Models with Non-Curated Data
Yi Zhao, Aidan Scannell, Wenshuai Zhao, Yuxin Hou, Tianyu Cui, Le Chen, Dieter Büchler,* Arno Solin, Juho Kannala, Joni Pajarinen
LINK TO PAPER
Leveraging offline data is a promising way to improve the sample efficiency of online reinforcement learning (RL). This paper expands the pool of usable data for offline-to-online RL by leveraging abundant non-curated data that is reward-free, of mixed quality, and collected across multiple embodiments. Although learning a world model appears promising for utilizing such data, we find that naive fine-tuning fails to accelerate RL training on many tasks. Through careful investigation, we attribute this failure to the distributional shift between offline and online data during fine-tuning. To address this issue and effectively use the offline data, we propose two essential techniques: \emph{i)} experience rehearsal and \emph{ii)} execution guidance. With these modifications, the non-curated offline data substantially improves RL's sample efficiency. Under limited sample budgets, our method achieves a 102.8\% relative improvement in aggregate score over learning-from-scratch baselines across 72 visuomotor tasks spanning 6 embodiments. On challenging tasks such as locomotion and robotic manipulation, it outperforms prior methods that utilize offline data by a decent margin.
Learning Nonlinear Causal Reductions to Explain Reinforcement Learning Policies
Armin Kekić, Jan Schneider, Dieter Büchler, Bernhard Schölkopf, Michel Besserve
LINK TO PAPER
Leveraging offline data is a promising way to improve the sample efficiency of online reinforcement learning (RL). This paper expands the pool of usable data for offline-to-online RL by leveraging abundant non-curated data that is reward-free, of mixed quality, and collected across multiple embodiments. Although learning a world model appears promising for utilizing such data, we find that naive fine-tuning fails to accelerate RL training on many tasks. Through careful investigation, we attribute this failure to the distributional shift between offline and online data during fine-tuning. To address this issue and effectively use the offline data, we propose two essential techniques: \emph{i)} experience rehearsal and \emph{ii)} execution guidance. With these modifications, the non-curated offline data substantially improves RL's sample efficiency. Under limited sample budgets, our method achieves a 102.8\% relative improvement in aggregate score over learning-from-scratch baselines across 72 visuomotor tasks spanning 6 embodiments. On challenging tasks such as locomotion and robotic manipulation, it outperforms prior methods that utilize offline data by a decent margin.
Exponential-Wrapped Mechanisms: Differential Privacy on Hadamard Manifolds Made Practical
Yangdi Jiang, Xiaotian Chang, Lei Ding, Linglong Kong*, Bei Jiang*
LINK TO PAPER
We extend the Differential Privacy (DP) framework to Hadamard manifolds, the class of complete and simply connected Riemannian manifolds with non-positive sectional curvature. Inspired by the Cartan–Hadamard theorem, we introduce Exponential-Wrapped Laplace and Gaussian mechanisms to achieve -DP, -DP, Gaussian DP (GDP), and R'enyi DP (RDP) on these manifolds. Our approach employs efficient, straightforward algorithms that circumvent the computationally intensity Monte Carlo Markov Chain (MCMC) methods. This work is the first to extend -DP, GDP, and RDP to Hadamard manifolds. We further demonstrate the effectiveness of our methodology through simulations on the space of Symmetric Positive Definite Matrices, a frequently used Hadamard manifold in statistics. Our findings reveal that our Exponential-Wrapped mechanisms surpass traditional MCMC-based approaches, which require careful tuning and extensive diagnostics, in both performance and ease of use. Additionally, our methods achieve comparable utility to the Riemannian Laplace mechanism with enhanced utility for smaller privacy budgets () and operate orders of magnitude faster computationally.
A Bayesian Nonparametric Framework for Private, Fair, and Balanced Tabular Data Synthesis
Forough Fazeli-Asl, Michael Minyi Zhang, Linglong Kong*, Bei Jiang*
LINK TO PAPER
A fundamental challenge in data synthesis is protecting the fairness and privacy of the individual, particularly in data-scarce environments where underrepresented groups are at risk of further marginalization by reproducing the biases inherent in the data modeling process. We introduce a privacy- and fairness-aware for a class of generative models, which fuses the conditional generator within the framework of Bayesian nonparametric learning (BNPL). This conditional structure imposes fairness constraints in our generative model by minimizing the mutual information between generated outcomes and protected attributes. Unlike existing methods that primarily focus on sensitive binary-valued attributes, our framework extends seamlessly to non-binary attributes. Moreover, our method provides a systematic solution to class imbalance, ensuring adequate representation of underrepresented protected groups. Our proposed approach offers a scalable, privacy-preserving framework for ethical and equitable data generation, which we demonstrate by theoretical guarantees and extensive experiments on sensitive empirical examples.
Principled Fast and Meta Knowledge Learners for Continual Reinforcement Learning
Ke Sun, Hongming Zhang, Jun Jin*, Chao Gao, Xi Chen, Wulong Liu, Linglong Kong*
LINK TO PAPER
Inspired by the human learning and memory system, particularly the interplay between the hippocampus and cerebral cortex, this study proposes a dual-learner framework comprising a fast learner and a meta learner to address continual Reinforcement Learning~(RL) problems. These two learners are coupled to perform distinct yet complementary roles: the fast learner focuses on knowledge transfer, while the meta learner ensures knowledge integration. In contrast to traditional multi-task RL approaches that share knowledge through average return maximization, our meta learner incrementally integrates new experiences by explicitly minimizing catastrophic forgetting, thereby supporting efficient cumulative knowledge transfer for the fast learner. To facilitate rapid adaptation in new environments, we introduce an adaptive meta warm-up mechanism that selectively harnesses past knowledge. We conduct experiments in various pixel-based and continuous control benchmarks, revealing the superior performance of continual learning for our proposed dual-learner approach relative to baseline methods.
Learn more about the researchers
Beyond Visual Reconstruction Quality: Object Perception-aware 3D Gaussian Splatting for Autonomous Driving
Renzhi Wang, Yuxiang Fu, Wuqi Wang, Haigen Min, Wei Feng, Lei Ma, Qing Guo
LINK TO PAPER
Reconstruction techniques, such as 3D Gaussian Splatting (3DGS), are increasingly used to generate scenarios for autonomous driving system (ADS) research. Existing 3DGS-based approaches for autonomous-driving scenario generation have, through various optimizations, achieved high visual similarity in reconstructed scenes. However, this route is built on a strong assumption: that higher scene similarity directly translates into better preservation of ADS behaviour. Unfortunately, this assumption has not been effectively validated, and ADS behaviour is more closely related to objects within the field of view rather than the global image. Thus, we focus on the perception module—the entry point of ADS. Preliminary experiments reveal that although current methods can produce reconstructions with high overall similarity, they often fail to ensure that the perception module outputs remain consistent with those obtained from the original images. Such a limitation can significantly harm the applicability of reconstruction in the ADS domain. To address this gap, we propose two complementary solutions: a perception-aligned loss, which directly leverages output differences between reconstructed and ground-truth images during training; and an object zone quality loss, which specifically reinforces training on object locations identified by the perception model on ground-truth images. Experiments demonstrate that both of our methods improve the ability of reconstructed scenes to maintain consistency between the perception module outputs and the ground-truth inputs. We release code at: https://github.com/Shanicky-RenzhiWang/Perception-aware-3DGS
Nano3D: A Training-Free Approach for Efficient 3D Editing Without Masks
Renzhi Wang, Yuxiang Fu, Wuqi Wang, Haigen Min, Wei Feng, Lei Ma, Qing Guo
LINK TO PAPER
Reconstruction techniques, such as 3D Gaussian Splatting (3DGS), are increasingly used to generate scenarios for autonomous driving system (ADS) research. Existing 3DGS-based approaches for autonomous-driving scenario generation have, through various optimizations, achieved high visual similarity in reconstructed scenes. However, this route is built on a strong assumption: that higher scene similarity directly translates into better preservation of ADS behaviour. Unfortunately, this assumption has not been effectively validated, and ADS behaviour is more closely related to objects within the field of view rather than the global image. Thus, we focus on the perception module—the entry point of ADS. Preliminary experiments reveal that although current methods can produce reconstructions with high overall similarity, they often fail to ensure that the perception module outputs remain consistent with those obtained from the original images. Such a limitation can significantly harm the applicability of reconstruction in the ADS domain. To address this gap, we propose two complementary solutions: a perception-aligned loss, which directly leverages output differences between reconstructed and ground-truth images during training; and an object zone quality loss, which specifically reinforces training on object locations identified by the perception model on ground-truth images. Experiments demonstrate that both of our methods improve the ability of reconstructed scenes to maintain consistency between the perception module outputs and the ground-truth inputs. We release code at: https://github.com/Shanicky-RenzhiWang/Perception-aware-3DGS
Cactus: Accelerating Auto-Regressive Decoding with Constrained Acceptance Speculative Sampling
Yongchang Hao*, Lili Mou*
LINK TO PAPER
Speculative sampling (SpS) has been successful in accelerating the decoding throughput of auto-regressive large language models by leveraging smaller draft models. SpS strictly enforces the generated distribution to match that of the verifier LLM. This is unnecessarily restrictive as slight variations of the verifier's distribution, such as sampling with top- or temperature, would also be acceptable. Typical acceptance sampling (TAS) alleviates this issue by accepting more tokens using entropy-based heuristics. However, this approach distorts the verifier distribution, potentially degrading output quality when the verifier encodes critical information. In this work, we formalize the speculative sampling algorithm through the lens of constrained optimization. Based on this formulation, we propose Cactus (constrained acceptance speculative sampling), a method that guarantees controlled divergence from the verifier distribution and increasing acceptance rates. Empirical results across a wide range of benchmarks confirm the effectiveness of our approach.
TokMem: One-Token Procedural Memory for Large Language Models
Zijun Wu*, Yongchang Hao*, Lili Mou*
LINK TO PAPER
Large language models are typically controlled via prompts, which must be repeatedly re-processed for every new query and are difficult to reuse modularly. We introduce TokMem, a procedural memory framework that compiles each reusable task procedure into a single trainable memory token. Each token serves as both a procedure index and a generation control signal that steers generation, enabling targeted behaviors with constant-size overhead. TokMem keeps the backbone LLM frozen and stores procedural knowledge entirely in these dedicated units, so new procedures can be added continually without interfering with existing ones. We evaluate TokMem on two settings: atomic recall over 1,000 Super-Natural Instructions tasks and compositional recall on multi-step function-calling. Our results show that TokMem consistently outperforms retrieval-augmented prompting while avoiding repeated context overhead. Moreover, it matches or exceeds parameter-efficient fine-tuning with substantially fewer trainable parameters.
Learning Admissible Heuristics for A*: Theory and Practice
Ehsan Futuhi*, Nathan R. Sturtevant*
LINK TO PAPER
Heuristic functions are central to the performance of search algorithms such as A*, where \emph{admissibility}—the property of never overestimating the true shortest-path cost—guarantees solution optimality. Recent deep learning approaches often disregard full admissibility and provide limited guarantees on generalization beyond the training data. We address both of these limitations. First, we pose heuristic learning as a constrained optimization problem and introduce \emph{Cross-Entropy Admissibility (CEA)}, a loss function that enforces admissibility during training. When evaluated on the Rubik’s Cube domain, our method yields heuristics with near-perfect admissibility and significantly stronger guidance than compressed pattern database (PDB) heuristics. On the theoretical side, we derive a new upper bound on the expected suboptimality of A*. By leveraging PDB abstractions and the structural properties of graphs such as the Rubik’s Cube, we tighten the bound on the number of training samples needed for A* to generalize to unseen states. Replacing a general hypothesis class with a ReLU neural network gives bounds that depend primarily on the network’s width and depth, rather than on graph size. Using the same network, we also provide the first generalization guarantees for \emph{goal-dependent} heuristics.
Learn more about the researchers
Learning to Reason Efficiently with Discounted Reinforcement Learning
Alex Ayoub*, Kavosh Asadi, Dale Schuurmans*, Csaba Szepesvári*, Karim Bouyarmane
LINK TO PAPER
Large reasoning models (LRMs) often consume excessive tokens, inflating computational cost and latency. We challenge the assumption that longer responses improve accuracy. By penalizing reasoning tokens using a discounted reinforcement learning setup (interpretable as a small token cost) and analyzing Blackwell optimality in restricted policy classes, we encourage concise yet accurate reasoning. Experiments confirm our theoretical results that this approach shortens chains of thought while preserving accuracy.
Workshop Papers
Aligning Visual Structural Compositionality in Humans & Vision Language Models
Helena Balabin*, Lauren Nicole De Long, Rohan Saha*, Rik Vandenberghe, Marie-Francine Moens, Alona Fyshe*
LINK TO PAPER
Large reasoning models (LRMs) often consume excessive tokens, inflating computational cost and latency. We challenge the assumption that longer responses improve accuracy. By penalizing reasoning tokens using a discounted reinforcement learning setup (interpretable as a small token cost) and analyzing Blackwell optimality in restricted policy classes, we encourage concise yet accurate reasoning. Experiments confirm our theoretical results that this approach shortens chains of thought while preserving accuracy.














