The fourteenth annual International Conference on Learning Representations (ICLR) is underway this week in Rio de Janeiro, Brazil. Amii is proud to showcase the diverse and high-imact research our Fellows, Canada CIFAR AI Chairs, and students are presenting this year.

ICLR is a premier global conference for the advancement of representation learning, exploring how models process data to solve complex problems in computer vision, robotics, and natural language processing.

This year, Amii’s contributions push the boundaries of what’s possible in automated intelligence. Our researchers are unveiling novel frameworks to make reinforcement learning (RL) more sample-efficient, developing memory architectures inspired by human cognition for Large Language Models (LLMs) that handle millions of tokens, and establishing new standards for fairness and privacy in data synthesis.

*- denotes Amii affiliation

Accepted Papers

Distributions as Actions: A Unified Framework for Diverse Action Spaces

Jiamin He* , A. Rupam Mahmood* , Martha White *

LINK TO PAPER

We introduce a novel reinforcement learning (RL) framework that treats parameterized action distributions as actions, redefining the boundary between agent and environment. This reparameterization makes the new action space continuous, regardless of the original action type (discrete, continuous, hybrid, etc.). Under this new parameterization, we develop a generalized deterministic policy gradient estimator, Distributions-as-Actions Policy Gradient (DA-PG), which has lower variance than the gradient in the original action space. Although learning the critic over distribution parameters poses new challenges, we introduce Interpolated Critic Learning (ICL), a simple yet effective strategy to enhance learning, supported by insights from bandit settings. Building on TD3, a strong baseline for continuous control, we propose a practical actor-critic algorithm, Distributions-as-Actions Actor-Critic (DA-AC). Empirically, DA-AC achieves competitive performance in various settings across discrete, continuous, and hybrid control.

Regularized Latent Dynamics Prediction is a Strong Baseline For Behavioral Foundation Models

Pranaya Jajoo*, Harshit Sikchi, Siddhant Agarwal, Amy Zhang, Scott Niekum, Martha White *

LINK TO PAPER

Behavioral Foundation Models (BFMs) have been recently successful in producing agents with the capabilities to adapt to any unknown reward or task. In reality, these methods are only able to produce near-optimal policies for the reward functions that are in the span of some pre-existing state features. Naturally, their efficiency relies heavily on the choice of state features that they use. As a result, these BFMs have used a wide variety of complex objectives, often sensitive to environment coverage, to train task spanning features with different inductive properties. With this work, our aim is to examine the question: are these complex representation learning objectives necessary for zero-shot RL? Specifically, we revisit the objective of self-supervised next-state prediction in latent space for state feature learning, but observe that such an objective alone is prone to increasing state-feature similarity, and subsequently reducing span of reward functions that we can represent optimal policies for. We propose an approach, RLDP, that adds a simple regularization to maintain feature diversity and can match or surpass state-of-the-art complex representation learning methods for zero-shot RL. Furthermore, we demonstrate the prior approaches diverge in low-coverage scenarios where RLDP still succeeds.

Token Hidden Reward: Steering Exploration-Exploitation in Group Relative Deep Reinforcement Learning

Wenlong Deng, Yi Ren, Yushu Li, Boying Gong, Danica J. Sutherland*, Xiaoxiao Li, Christos Thrampoulidis

LINK TO PAPER

Reinforcement learning with verifiable rewards has significantly advanced the reasoning capabilities of large language models, yet how to explicitly steer training toward exploration or exploitation remains an open problem. We introduce Token Hidden Reward (THR), a token-level metric that quantifies each token's influence on the likelihood of correct responses under Group Relative Policy Optimization (GRPO). We find that training dynamics are dominated by a small subset of tokens with high absolute THR values. Most interestingly, tokens with positive THR strengthen confidence in correct outputs, thus favoring exploitation, while tokens with negative THR preserve probability mass for alternative outputs, enabling exploration. This insight suggests a natural intervention: a THR-guided reweighting algorithm that modulates GRPO's learning signals to explicitly bias training toward exploitation or exploration. We validate the efficacy of this algorithm on diverse math reasoning benchmarks. By amplifying tokens with positive THR value and weakening negative ones, our algorithm improves greedy-decoding accuracy, favoring exploitation. The reverse strategy yields consistent gains in Pass@K accuracy, favoring exploration. We further demonstrate that our algorithm integrates seamlessly with other RL objectives such as GSPO and generalizes across architectures including Llama. These findings establish THR as a principled and fine-grained mechanism for dynamically controlling exploration and exploitation in RL-tuned LLMs, providing new tools for targeted fine-tuning in reasoning-intensive applications.

Learn more about the researchers

Rupam Mahmood

Fellow & Canada CIFAR AI Chair

Martha White

Fellow & Canada CIFAR AI Chair

Danica J. Sutherland

Canada CIFAR AI Chair

Beyond a Million Tokens: Benchmarking and Enhancing Long-Term Memory in LLMs

Mohammad Tavakoli*, Alireza Salemi, Carrie Ye*, Mohamed Abdalla*, Hamed Zamani, J Ross Mitchell*

LINK TO PAPER

Evaluating the abilities of large language models (LLMs) for tasks that require long-term memory and thus long-context reasoning, for example in conversational settings, is hampered by the existing benchmarks, which often lack narrative coherence, cover narrow domains, and only test simple recall-oriented tasks. This paper introduces a comprehensive solution to these challenges. First, we present a novel framework for automatically generating long (up to 10M tokens), coherent, and topically diverse conversations, accompanied by probing questions targeting a wide range of memory abilities. From this, we construct BEAM, a new benchmark comprising 100 conversations and 2,000 validated questions. Second, to enhance model performance, we propose LIGHT-a framework inspired by human cognition that equips LLMs with three complementary memory systems: a long-term episodic memory, a short-term working memory, and a scratchpad for accumulating salient facts. Our experiments on BEAM reveal that even LLMs with 1M token context windows (with and without retrieval-augmentation) struggle as dialogues lengthen. In contrast, LIGHT consistently improves performance across various models, achieving an average improvement of 3.5%-12.69% over the strongest baselines, depending on the backbone LLM. An ablation study further confirms the contribution of each memory component.

Efficient Reinforcement Learning by Guiding World Models with Non-Curated Data

Yi Zhao, Aidan Scannell, Wenshuai Zhao, Yuxin Hou, Tianyu Cui, Le Chen, Dieter Büchler,* Arno Solin, Juho Kannala, Joni Pajarinen

LINK TO PAPER

Leveraging offline data is a promising way to improve the sample efficiency of online reinforcement learning (RL). This paper expands the pool of usable data for offline-to-online RL by leveraging abundant non-curated data that is reward-free, of mixed quality, and collected across multiple embodiments. Although learning a world model appears promising for utilizing such data, we find that naive fine-tuning fails to accelerate RL training on many tasks. Through careful investigation, we attribute this failure to the distributional shift between offline and online data during fine-tuning. To address this issue and effectively use the offline data, we propose two essential techniques: \emph{i)} experience rehearsal and \emph{ii)} execution guidance. With these modifications, the non-curated offline data substantially improves RL's sample efficiency. Under limited sample budgets, our method achieves a 102.8\% relative improvement in aggregate score over learning-from-scratch baselines across 72 visuomotor tasks spanning 6 embodiments. On challenging tasks such as locomotion and robotic manipulation, it outperforms prior methods that utilize offline data by a decent margin.

Learning Nonlinear Causal Reductions to Explain Reinforcement Learning Policies

Armin Kekić, Jan Schneider, Dieter Büchler, Bernhard Schölkopf, Michel Besserve

LINK TO PAPER

Why do reinforcement learning (RL) policies fail or succeed? This is a challenging question due to the complex, high-dimensional nature of agent-environment interactions. In this work, we take a causal perspective on explaining the behavior of RL policies by viewing the states, actions, and rewards as variables in a low-level causal model. We introduce random perturbations to policy actions during execution and observe their effects on the cumulative reward, learning a simplified high-level causal model that explains these relationships. To this end, we develop a nonlinear Causal Model Reduction framework that ensures approximate interventional consistency, meaning the simplified high-level model responds to interventions in a similar way as the original complex system. We prove that for a class of nonlinear causal models, there exists a unique solution that achieves exact interventional consistency, ensuring learned explanations reflect meaningful causal patterns. Experiments on both synthetic causal models and practical RL tasks-including pendulum control and robot table tennis-demonstrate that our approach can uncover important behavioral patterns, biases, and failure modes in trained RL policies.

Learn more about the researchers

Mohamed Abdalla

Fellow & Canada CIFAR AI Chair

J. Ross Mitchell

Fellow & Canada CIFAR AI Chair

Dieter Büchler

Fellow & Canada CIFAR AI Chair

Exponential-Wrapped Mechanisms: Differential Privacy on Hadamard Manifolds Made Practical

Yangdi Jiang, Xiaotian Chang, Lei Ding, Linglong Kong*, Bei Jiang*

LINK TO PAPER

We extend the Differential Privacy (DP) framework to Hadamard manifolds, the class of complete and simply connected Riemannian manifolds with non-positive sectional curvature. Inspired by the Cartan–Hadamard theorem, we introduce Exponential-Wrapped Laplace and Gaussian mechanisms to achieve -DP, -DP, Gaussian DP (GDP), and R'enyi DP (RDP) on these manifolds. Our approach employs efficient, straightforward algorithms that circumvent the computationally intensity Monte Carlo Markov Chain (MCMC) methods. This work is the first to extend -DP, GDP, and RDP to Hadamard manifolds. We further demonstrate the effectiveness of our methodology through simulations on the space of Symmetric Positive Definite Matrices, a frequently used Hadamard manifold in statistics. Our findings reveal that our Exponential-Wrapped mechanisms surpass traditional MCMC-based approaches, which require careful tuning and extensive diagnostics, in both performance and ease of use. Additionally, our methods achieve comparable utility to the Riemannian Laplace mechanism with enhanced utility for smaller privacy budgets () and operate orders of magnitude faster computationally.

A Bayesian Nonparametric Framework for Private, Fair, and Balanced Tabular Data Synthesis

Forough Fazeli-Asl, Michael Minyi Zhang, Linglong Kong*, Bei Jiang*

LINK TO PAPER

A fundamental challenge in data synthesis is protecting the fairness and privacy of the individual, particularly in data-scarce environments where underrepresented groups are at risk of further marginalization by reproducing the biases inherent in the data modeling process. We introduce a privacy- and fairness-aware for a class of generative models, which fuses the conditional generator within the framework of Bayesian nonparametric learning (BNPL). This conditional structure imposes fairness constraints in our generative model by minimizing the mutual information between generated outcomes and protected attributes. Unlike existing methods that primarily focus on sensitive binary-valued attributes, our framework extends seamlessly to non-binary attributes. Moreover, our method provides a systematic solution to class imbalance, ensuring adequate representation of underrepresented protected groups. Our proposed approach offers a scalable, privacy-preserving framework for ethical and equitable data generation, which we demonstrate by theoretical guarantees and extensive experiments on sensitive empirical examples.

Principled Fast and Meta Knowledge Learners for Continual Reinforcement Learning

Ke Sun, Hongming Zhang, Jun Jin*, Chao Gao, Xi Chen, Wulong Liu, Linglong Kong*

LINK TO PAPER

Inspired by the human learning and memory system, particularly the interplay between the hippocampus and cerebral cortex, this study proposes a dual-learner framework comprising a fast learner and a meta learner to address continual Reinforcement Learning~(RL) problems. These two learners are coupled to perform distinct yet complementary roles: the fast learner focuses on knowledge transfer, while the meta learner ensures knowledge integration. In contrast to traditional multi-task RL approaches that share knowledge through average return maximization, our meta learner incrementally integrates new experiences by explicitly minimizing catastrophic forgetting, thereby supporting efficient cumulative knowledge transfer for the fast learner. To facilitate rapid adaptation in new environments, we introduce an adaptive meta warm-up mechanism that selectively harnesses past knowledge. We conduct experiments in various pixel-based and continuous control benchmarks, revealing the superior performance of continual learning for our proposed dual-learner approach relative to baseline methods.

Learn more about the researchers

Linglong Kong

Fellow & Canada CIFAR AI Chair

Bei Jiang

Fellow & Canada CIFAR AI Chair

Jun Jin

Fellow & Canada CIFAR AI Chair

Beyond Visual Reconstruction Quality: Object Perception-aware 3D Gaussian Splatting for Autonomous Driving

Renzhi Wang, Yuxiang Fu, Wuqi Wang, Haigen Min, Wei Feng, Lei Ma, Qing Guo

LINK TO PAPER

Reconstruction techniques, such as 3D Gaussian Splatting (3DGS), are increasingly used to generate scenarios for autonomous driving system (ADS) research. Existing 3DGS-based approaches for autonomous-driving scenario generation have, through various optimizations, achieved high visual similarity in reconstructed scenes. However, this route is built on a strong assumption: that higher scene similarity directly translates into better preservation of ADS behaviour. Unfortunately, this assumption has not been effectively validated, and ADS behaviour is more closely related to objects within the field of view rather than the global image. Thus, we focus on the perception module—the entry point of ADS. Preliminary experiments reveal that although current methods can produce reconstructions with high overall similarity, they often fail to ensure that the perception module outputs remain consistent with those obtained from the original images. Such a limitation can significantly harm the applicability of reconstruction in the ADS domain. To address this gap, we propose two complementary solutions: a perception-aligned loss, which directly leverages output differences between reconstructed and ground-truth images during training; and an object zone quality loss, which specifically reinforces training on object locations identified by the perception model on ground-truth images. Experiments demonstrate that both of our methods improve the ability of reconstructed scenes to maintain consistency between the perception module outputs and the ground-truth inputs. We release code at: https://github.com/Shanicky-RenzhiWang/Perception-aware-3DGS

Nano3D: A Training-Free Approach for Efficient 3D Editing Without Masks

Renzhi Wang, Yuxiang Fu, Wuqi Wang, Haigen Min, Wei Feng, Lei Ma, Qing Guo

LINK TO PAPER

3D object editing is essential for interactive content creation in gaming, animation, and robotics, yet current approaches remain inefficient, inconsistent, and often fail to preserve unedited regions. Most methods rely on editing multi-view renderings followed by reconstruction, which introduces artifacts and limits practicality. To address these challenges, we propose Nano3D, a training-free framework for precise and coherent 3D object editing without masks. Nano3D integrates FlowEdit into TRELLIS to perform localized edits guided by front-view renderings, and further introduces region-aware merging strategies, Voxel/Slat-Merge, which adaptively preserve structural fidelity by ensuring consistency between edited and unedited areas. Experiments demonstrate that Nano3D achieves superior 3D consistency and visual quality compared with existing methods. Based on this framework, we construct the first large-scale 3D editing datasets Nano3D-Edit-100k, which contains over 100,000 high-quality 3D editing pairs. This work addresses long-standing challenges in both algorithm design and data availability, significantly improving the generality and reliability of 3D editing, and laying the groundwork for the development of feed-forward 3D editing models. Project Page:this https URL

Cactus: Accelerating Auto-Regressive Decoding with Constrained Acceptance Speculative Sampling

Yongchang Hao*, Lili Mou*

LINK TO PAPER

Speculative sampling (SpS) has been successful in accelerating the decoding throughput of auto-regressive large language models by leveraging smaller draft models. SpS strictly enforces the generated distribution to match that of the verifier LLM. This is unnecessarily restrictive as slight variations of the verifier's distribution, such as sampling with top- or temperature, would also be acceptable. Typical acceptance sampling (TAS) alleviates this issue by accepting more tokens using entropy-based heuristics. However, this approach distorts the verifier distribution, potentially degrading output quality when the verifier encodes critical information. In this work, we formalize the speculative sampling algorithm through the lens of constrained optimization. Based on this formulation, we propose Cactus (constrained acceptance speculative sampling), a method that guarantees controlled divergence from the verifier distribution and increasing acceptance rates. Empirical results across a wide range of benchmarks confirm the effectiveness of our approach.

TokMem: One-Token Procedural Memory for Large Language Models

Zijun Wu*, Yongchang Hao*, Lili Mou*

LINK TO PAPER

Large language models are typically controlled via prompts, which must be repeatedly re-processed for every new query and are difficult to reuse modularly. We introduce TokMem, a procedural memory framework that compiles each reusable task procedure into a single trainable memory token. Each token serves as both a procedure index and a generation control signal that steers generation, enabling targeted behaviors with constant-size overhead. TokMem keeps the backbone LLM frozen and stores procedural knowledge entirely in these dedicated units, so new procedures can be added continually without interfering with existing ones. We evaluate TokMem on two settings: atomic recall over 1,000 Super-Natural Instructions tasks and compositional recall on multi-step function-calling. Our results show that TokMem consistently outperforms retrieval-augmented prompting while avoiding repeated context overhead. Moreover, it matches or exceeds parameter-efficient fine-tuning with substantially fewer trainable parameters.

Learning Admissible Heuristics for A*: Theory and Practice

Ehsan Futuhi*, Nathan R. Sturtevant*

LINK TO PAPER

Heuristic functions are central to the performance of search algorithms such as A*, where \emph{admissibility}—the property of never overestimating the true shortest-path cost—guarantees solution optimality. Recent deep learning approaches often disregard full admissibility and provide limited guarantees on generalization beyond the training data. We address both of these limitations. First, we pose heuristic learning as a constrained optimization problem and introduce \emph{Cross-Entropy Admissibility (CEA)}, a loss function that enforces admissibility during training. When evaluated on the Rubik’s Cube domain, our method yields heuristics with near-perfect admissibility and significantly stronger guidance than compressed pattern database (PDB) heuristics. On the theoretical side, we derive a new upper bound on the expected suboptimality of A*. By leveraging PDB abstractions and the structural properties of graphs such as the Rubik’s Cube, we tighten the bound on the number of training samples needed for A* to generalize to unseen states. Replacing a general hypothesis class with a ReLU neural network gives bounds that depend primarily on the network’s width and depth, rather than on graph size. Using the same network, we also provide the first generalization guarantees for \emph{goal-dependent} heuristics.

Learn more about the researchers

Lei Ma

Fellow

Lili Mou

Fellow & Canada CIFAR AI Chair

Csaba Szepesvári

Fellow & Canada CIFAR AI Chair

Learning to Reason Efficiently with Discounted Reinforcement Learning

Alex Ayoub*, Kavosh Asadi, Dale Schuurmans*, Csaba Szepesvári*, Karim Bouyarmane

LINK TO PAPER

Large reasoning models (LRMs) often consume excessive tokens, inflating computational cost and latency. We challenge the assumption that longer responses improve accuracy. By penalizing reasoning tokens using a discounted reinforcement learning setup (interpretable as a small token cost) and analyzing Blackwell optimality in restricted policy classes, we encourage concise yet accurate reasoning. Experiments confirm our theoretical results that this approach shortens chains of thought while preserving accuracy.

Gradient-Based Program Synthesis with Neurally Interpreted Languages

Matthew Macfarlane, Clément Bonnet, Herke van Hoof, Levi Lelis*
LINK TO PAPER

A central challenge in program induction has long been the trade-off between symbolic and neural approaches. Symbolic methods offer compositional generalisation and data efficiency, yet their scalability is constrained by formalisms such as domain-specific languages (DSLs), which are labor-intensive to create and may not transfer to new domains. In contrast, neural networks flexibly learn from data but fail to generalise systematically. We bridge this divide with the Neural Language Interpreter (NLI), an architecture that learns its own discrete, symbolic-like programming language end-to-end. NLI autonomously discovers a vocabulary of subsymbolic primitive operations and uses a novel differentiable neural executor to interpret variable-length sequences of these primitives. This allows NLI to represent programs that are not bound to a constant number of computation steps, enabling it to solve more complex problems than those seen during training. To make these discrete, compositional program structures amenable to gradient-based optimisation, we employ the Gumbel-Softmax relaxation, enabling the entire model to be trained end-to-end. Crucially, this same differentiability enables powerful test-time adaptation. At inference, NLI's program inductor provides an initial program guess. This guess is then refined via gradient descent through the neural executor, enabling efficient search for the neural program that best explains the given data. We demonstrate that NLI outperforms in-context learning, test-time training, and continuous latent program networks (LPNs) on tasks that require combinatorial generalisation and rapid adaptation to unseen tasks. Our results establish a new path toward models that combine the compositionality of discrete languages with the gradient-based search and end-to-end learning of neural networks.

Workshop Papers

Aligning Visual Structural Compositionality in Humans & Vision Language Models

Helena Balabin*, Lauren Nicole De Long, Rohan Saha*, Rik Vandenberghe, Marie-Francine Moens, Alona Fyshe*

Learn more about the researchers

Dale Schuurmans

Fellow & Canada CIFAR AI Chair

Csaba Szepesvári

Fellow & Canada CIFAR AI Chair

Alona Fyshe

Fellow & Canada CIFAR AI Chair

Levi Lelis

Fellow & Canada CIFAR AI Chair

Want to stay up-to-date on the latest research from the Amii community? Sign up for our monthly newsletter!

Authors

Scott Lilwall

Amii Research at ICLR 2026: Advancing Efficiency in Reinforcement Learning, Memory Systems, and Generative AI

Published

Categories

Subject Matter

Accepted Papers

Distributions as Actions: A Unified Framework for Diverse Action Spaces

Regularized Latent Dynamics Prediction is a Strong Baseline For Behavioral Foundation Models

Token Hidden Reward: Steering Exploration-Exploitation in Group Relative Deep Reinforcement Learning

Learn more about the researchers

Rupam Mahmood

Martha White

Danica J. Sutherland

Beyond a Million Tokens: Benchmarking and Enhancing Long-Term Memory in LLMs

Efficient Reinforcement Learning by Guiding World Models with Non-Curated Data

Learning Nonlinear Causal Reductions to Explain Reinforcement Learning Policies

Learn more about the researchers

Mohamed Abdalla

J. Ross Mitchell

Dieter Büchler

Exponential-Wrapped Mechanisms: Differential Privacy on Hadamard Manifolds Made Practical

A Bayesian Nonparametric Framework for Private, Fair, and Balanced Tabular Data Synthesis

Principled Fast and Meta Knowledge Learners for Continual Reinforcement Learning

Learn more about the researchers

Linglong Kong

Bei Jiang

Jun Jin

Beyond Visual Reconstruction Quality: Object Perception-aware 3D Gaussian Splatting for Autonomous Driving

Nano3D: A Training-Free Approach for Efficient 3D Editing Without Masks

Cactus: Accelerating Auto-Regressive Decoding with Constrained Acceptance Speculative Sampling

TokMem: One-Token Procedural Memory for Large Language Models

Learning Admissible Heuristics for A*: Theory and Practice

Learn more about the researchers

Lei Ma

Lili Mou

Csaba Szepesvári

Learning to Reason Efficiently with Discounted Reinforcement Learning

Gradient-Based Program Synthesis with Neurally Interpreted Languages

Workshop Papers

Aligning Visual Structural Compositionality in Humans & Vision Language Models

Learn more about the researchers

Dale Schuurmans

Csaba Szepesvári

Alona Fyshe

Levi Lelis

Authors

Share

Stay in the loop

Stay in the loop