This workshop highlights recent advances in online reinforcement learning (RL), with a focus on its connections to emerging technologies like large language models (LLMs). As machine learning systems grow more capable, online RL can further enhance their task specific capabilities.
Participants will explore the evolving RL landscape, discuss its integration with large-scale models, and examine challenges and opportunities at this intersection. Join us to engage with cutting-edge ideas shaping the future of online reinforcement learning.
Poster Session and Lightning Talks
This workshop will include a poster sessionand lightning talks for early career researchers (including graduate students). In order to propose a poster or a lightning talk, you must first register for the workshop, and then submit a proposal using the form that will become available on this page after you register. You can request to do one, or both. The registration form should not be used to propose a posteror a lightning talk.
The deadline for proposing is Wednesday, March 4, 2026. If your proposal is accepted, you should plan to attend the event in-person.
In-Person Registration
Seats are limited at the venue, which means that in-person registration may be capped prior to the workshop start date. If capacity is reached, a waitlist will be imposed, which the registration form will reflect. Early registration is strongly encouraged.
All in-person registrants must wait to receive an invitation to attend in-person from IMSI before traveling, which generally begin to be sent out 4-6 weeks in advance.
All registrants (online and in-person) will receive zoom links and are welcome to attend online.
Andrew Wagenmaker
University of California, Berkeley
M
U
Masatoshi Uehara
Evolutionary Scale
Z
Y
Zhuoran Yang
Yale University
X
Z
Xuezhou Zhang
Boston University
B
Z
Banghua Zhu
University of Washington and NVIDIA
Schedule
Monday, March 30, 2026
8:30-8:50 CDT
Breakfast/Check-in
8:50-9:00 CDT
Welcome
9:00-9:45 CDT
Open Discussion
9:45-10:00 CDT
Q&A
10:00-10:05 CDT
Tech break
10:05-10:50 CDT
Towards Practical Online Improvement of Pretrained Policies for Robotic Manipulation
Speaker: Andrew Wagenmaker (University of California, Berkeley)
Robotic control policies learned from human demonstrations have achieved impressive results in many real-world applications. However, in scenarios where initial performance is not satisfactory, as is often the case in novel open-world settings, such behavioral cloning (BC)-learned policies typically require collecting additional human demonstrations to further improve their behavior—an expensive and time-consuming process. In contrast, reinforcement learning (RL) holds the promise of enabling autonomous online policy improvement, but often falls short of achieving this due to the large number of samples it typically requires. In this talk we take steps towards enabling fast autonomous adaptation of BC-trained policies via efficient real-world RL. We consider two angles on this problem. First, focusing in particular on diffusion policies, we propose diffusion steering via reinforcement learning (DSRL): adapting the BC policy by running RL over its latent-noise space. We show that DSRL is highly sample efficient, requires only black-box access to the BC policy, and enables effective real-world autonomous policy improvement. Second, we consider the role of the pretrained policy itself in RL improvement, and ask how we might pretrain policies that are amenable to downstream improvement. We show that standard BC pretraining can produce policies which fail to meet minimal conditions necessary for effective finetuning—coverage over the demonstrator’s actions—but that, if we instead fit a policy to the posterior of the demonstrator’s behaviors, we can achieve action coverage while ensuring the performance of the pretrained policy is no worse than that of the BC policy. We show experimentally that such posterior BC-pretrained policies enable much more efficient online improvement than standard BC-pretrained policies.
10:50-11:05 CDT
Q&A
11:05-11:35 CDT
Coffee break
11:35-12:20 CDT
Self-Supervised Reinforcement Learning and Patterns in Time
Speaker: Benjamin Eysenbach (Princeton University)
In the same way that computer vision models find structures and patterns in images, how might reinforcement learning models find structures and patterns in solutions to control problems? This talk will focus on learning temporal representations, which map high-dimensional observations to compact representations where distances reflect shortest paths. Once learned, these temporal representations encode the value function for certain tasks – learning temporal representations is itself an RL algorithm. In both robotics and reasoning problems, such representations capture temporal patterns. Temporal representations also facilitate a form of (temporal) generalization: navigating between pairs of states that are more distant than those seen during training. I will show evidence that agents trained via temporal representations exhibit surprising exploration strategies, in both single-agent and multi-agent settings.
12:20-12:35 CDT
Q&A
12:35-13:35 CDT
Lunch Break
13:35-14:20 CDT
Multi-turn and Multi-agent Reinforcement Learning Fine-Tuning of LLMs
Speaker: Natasha Jaques (University of Washington)
In spite of the fact that Reinforcement Learning (RL) training has contributed to massive gains in Large Language Model (LLM) abilities, it is largely still limited to optimizing a single response to a user query, rather than learning how to plan the course of a conversation or interaction, or interacting with other agents that may change their behavior during training and deployment. This talk will discuss recent work that enables both multi-agent and multi-turn RL post training. On the multi-turn side, we address critical challenges in long-horizon interaction: for example, we introduces a curiosity-based intrinsic reward that enables LLMs to learn how to learn about the user, significantly improving both personalization and online generalization to new users. I will also discuss Generative Adversarial Post Training (GAPT), an adversarial RL framework which draws from GANs, and is designed to mitigate reward hacking and output collapse in creative, adaptive tasks where preserving diversity and realism is paramount. Finally, I will discuss our group’s work on multi-agent interactive training, which can provide both safety guarantees, and the emergence of complex skills. Together, these methods demonstrate novel approaches to instilling complex, user-aware planning capabilities and safeguarding output quality over extended multi-agent interactions.
14:20-14:35 CDT
Q&A
14:35-15:35 CDT
Lighting Talks
15:35-16:30 CDT
Poster Session and Social Hour
Tuesday, March 31, 2026
8:30-9:00 CDT
Breakfast/Check-in
9:00-9:45 CDT
Building Deep Research Agents via Reinforcement Learning
Speaker: Wen Sun (Cornell University)
9:45-10:00 CDT
Q&A
10:00-10:05 CDT
Tech break
10:05-10:50 CDT
On the Mechanism and Dynamics of Modular Addition: Fourier Features, Lottery Ticket, and Grokking
Speaker: Zhuoran Yang (Yale University)
We present a comprehensive analysis of how two-layer neural networks learn features to solve the modular addition task. Our work provides a full mechanistic interpretation of the learned model and a theoretical explanation of its training dynamics. While prior work has identified that individual neurons learn single-frequency Fourier features and phase alignment, it does not fully explain how these features combine into a global solution. We bridge this gap by formalizing a diversification condition that emerges during training when overparametrized, consisting of two parts: phase symmetry and frequency diversification. We prove that these properties allow the network to collectively approximate a flawed indicator function on the correct logic for the modular addition task. While individual neurons produce noisy signals, the phase symmetry enables a majority-voting scheme that cancels out noise, allowing the network to robustly identify the correct sum. Furthermore, we explain the emergence of these features under random initialization via a lottery ticket mechanism. Our gradient flow analysis proves that frequencies compete within each neuron, with the “winner” determined by its initial spectral magnitude and phase alignment. From a technical standpoint, we provide a rigorous characterization of the layer-wise phase coupling dynamics and formalize the competitive landscape using the ODE comparison lemma. Finally, we use these insights to demystify grokking, characterizing it as a three-stage process involving memorization followed by two generalization phases, driven by the competition between loss minimization and weight decay.
10:50-11:05 CDT
Q&A
11:05-11:35 CDT
Coffee break
11:35-12:20 CDT
Toward a Statistical Perspective on LLM Post-training: Preference Sampling and Gradient Reweighting
Speaker: Yaqi Duan (New York University)
Post-training—the process of fine-tuning large language models (LLMs) with human preference or verifiable feedback—has rapidly evolved into a central problem in LLM development, yet its principles remain largely heuristic. This talk explores how statistical thinking can provide structure and rigor to this stage of learning. I will present two studies as steps toward formulating a statistical perspective on post-training. The first, PILAF, views human preference collection as an optimal experimental design problem, deriving sampling strategies that maximize reward-model information under budget constraints. The second, LENS, reinterprets reinforcement learning with verifiable rewards (RLVR) as a likelihood-based estimation problem, showing how confidence-weighted corrections on negative responses recover gradients otherwise lost in standard policy optimization. Together, these results illustrate how classical statistical reasoning can strengthen the foundations of post-training data collection and policy-update procedures, advancing efficiency, stability, and theoretical clarity.
12:20-12:35 CDT
Q&A
12:35-13:35 CDT
Lunch Break
13:35-14:20 CDT
Regression as Policy Optimization: Advantages In, Policies Out
Speaker: Kianté Brantley (Harvard University)
Reinforcement learning (RL) is an essential method for training large language models (LLMs), enabling better alignment with human preferences and enhanced reasoning capabilities. Nevertheless, RL post-training remains computationally demanding due to repeated rollouts, high-variance credit assignment, and the complexities of distributed systems that can introduce policy lag. A promising direction is to view KL-regularized policy optimization as a KL-prox (mirror-descent–style) step and solve it with a simple least-squares regression loss. This regression-based perspective addresses key RL challenges and enables more efficient post-training procedures. In this talk, I will introduce a unified perspective encompassing three complementary approaches. A⋆-PO minimizes reliance on online sampling by utilizing offline value computation and optimal-advantage regression. OAPL supports scalable, fully off-policy regression, even when using stale rollouts and lagged policies in distributed settings. RDA2C offers a regularized dual-averaging approach that employs cumulative gradient information to stabilize updates and reduce variance, rather than relying on local, per-round mirror-descent steps. Although RDA2C has been primarily assessed on standard RL benchmarks rather than comprehensive LLM post-training, its focus on variance reduction and data reuse aligns closely with the stability challenges encountered in large-scale LLM alignment. Collectively, these methods provide an efficient toolkit for RL-based LLM post-training and present opportunities for further research and scalable deployment.
14:20-14:35 CDT
Q&A
14:35-15:00 CDT
Coffee break
15:00-15:45 CDT
TBA
Speaker: Ayush Sekhari (MIT)
15:45-16:00 CDT
Q&A
Wednesday, April 1, 2026
8:30-9:00 CDT
Breakfast/Check-in
9:00-9:45 CDT
AI that Learns How to Act: Toward Data-Driven Autonomous Scientific Discovery
Speaker: Aldo Pacchiano (Boston University)
Modern machine learning systems excel at pattern recognition but remain limited in their ability to autonomously discover strategies for planning, exploration, and adaptation; core components of sequential decision making. In this talk, I present recent advances that take a learning-to-learn perspective on this challenge, showing how decision-making algorithms themselves can emerge from data.
9:45-10:00 CDT
Q&A
10:00-10:05 CDT
Tech break
10:05-10:50 CDT
Reinforcement Learning beyond Reward Maximization
Speaker: Yuda Song (Carnegie Mellon University)
Reinforcement learning has become a core ingredient of LLM post-training, but much of today’s pipeline is built around an unusually narrow learning signal: the model is optimized to maximize scalar reward, often derived from little more than whether an output succeeds. In this talk, I will present two recent directions for going beyond this paradigm.
The first part introduces Maximum Likelihood Reinforcement Learning (MaxRL), which starts from the observation that in correctness-based domains the model implicitly defines a likelihood over successful rollouts, while standard RL optimizes only a lower-order approximation to that likelihood. Empirically, MaxRL consistently outperforms standard RL baselines across the models and tasks studied, delivers up to 20× gains in test-time scaling efficiency, and shows stronger scaling with additional data and compute.
The second part introduces Reinforcement Learning from Text Feedback (RLTF), a learning paradigm in which natural-language critiques are available during training but not at inference, requiring the model to internalize richer feedback rather than merely condition on it. Together, these works suggest that the next frontier of LLM reinforcement learning lies not only in scaling optimization, but in rethinking both the objectives we optimize and the forms of supervision we learn from.
10:50-11:05 CDT
Q&A
11:05-11:35 CDT
Coffee break
11:35-12:20 CDT
The Statistical Cost of Hyperparameter Tuning in Reinforcement Learning
Speaker: Xuezhou Zhang (Boston University)
The performance of reinforcement learning (RL) algorithms is often benchmarked without accounting for the cost of hyperparameter tuning, despite its significant practical impact. In this paper, we show that such practices distort the perceived efficiency of RL methods and impede meaningful algorithmic progress. We formalize this concern by proving a lower bound showing that tuning m hyperparameters in RL can induce an exponential exp(m) blow-up in the sample complexity or regret, in stark contrast to the linear O(m) overhead observed in supervised learning. This highlights a fundamental inefficiency unique to RL. In light of this, we propose evaluation protocols that account for the number and cost of tuned hyperparameters, enabling fairer comparisons across algorithms. Surprisingly, we find that once tuning cost is accounted for, elementary algorithms can outperform their successors with more sophisticated design. These findings call for a shift in how RL algorithms are benchmarked and compared, especially in settings where efficiency and scalability are critical.
12:20-12:35 CDT
Q&A
12:35-13:35 CDT
Lunch Break
13:35-14:20 CDT
Failure Patterns of LLM Agentic Reinforcement Learning
Speaker: Manling Li (Northwestern University)
Reinforcement learning has driven strong gains in LLM reasoning on static tasks. However, when applied to agents, they consistently fail in unfamiliar environments, where effective exploration is required. In this talk, we identify a failure pattern: as agents move out of distribution, their reasoning trajectories become progressively shorter, and exploration collapses. We show that standard entropy-based metrics are insufficient, and introduce mutual information as a more reliable signal of input-dependent behavior. Through a SNR perspective, I explain why low reward variance causes input-agnostic regularization to dominate, driving this collapse. We then ask what enables robust exploration. Through VAGEN, we demonstrate that learning a structured world model, decomposed into state estimation and transition dynamics, provides the necessary inductive bias, enabling a 3B VLM to outperform GPT-5 on agent benchmarks. Finally, we show that when this structure is learned matters. With Self-Play, we find that acquiring these priors via self-play before RL is significantly more effective than learning them during RL. Overall, this suggests a simple paradigm: learn how the world works first, then learn what to do, with mutual information as a key diagnostic to prevent reasoning collapse.
14:20-14:35 CDT
Q&A
14:35-15:00 CDT
Coffee break
15:00-15:45 CDT
TBA
Speaker: Zhaoran Wang (Northwestern University)
15:45-16:00 CDT
Q&A
Thursday, April 2, 2026
8:30-9:00 CDT
Breakfast/Check-in
9:00-9:45 CDT
Reward-Guided Generation in Diffusion Models
Speaker: Masatoshi Uehara (Evolutionary Scale)
Diffusion models are celebrated for their strong generative capabilities. However, practical applications often demand sample generation that not only produces realistic outputs but also optimizes specific objectives (e.g., human preference scores in computer vision, binding affinity in proteins). To address this, diffusion models can be adapted to explicitly maximize desired reward metrics. While many methods have been developed for domains like computer vision, applying reward-guided generation to biological design poses unique challenges: (1) reward functions are often non-differentiable, and (2) biological data frequently involves discrete data. In this talk, I will present our recent advances in test-time controlled generation methods that address these challenges. I will also discuss how these techniques enable real-world applications across molecular design tasks, including protein, DNA, RNA, and small molecule generation.
9:45-10:00 CDT
Q&A
10:00-10:05 CDT
Tech break
10:05-10:50 CDT
TBA
Speaker: Gokul Swamy (Carnegie Mellon University)
10:50-11:05 CDT
Q&A
11:05-11:35 CDT
Coffee break
11:35-12:20 CDT
Proactive Agents: Task Performance Isn’t the Only Goal
Speaker: Laixi Shi (Johns Hopkins University)
Decision-making artificial intelligence (AI) has revolutionized human life ranging from healthcare, daily life, to scientific discovery. However, current AI systems often lack reliability and are highly vulnerable to small changes in complex, interactive, and dynamic environments. My research focuses on achieving both reliability and learning efficiency simultaneously when building AI solutions. These two goals seem conflicting, as enhancing robustness against variability often leads to more complex problems that requires more data and computational resources, at the cost of learning efficiency. But does it have to?
In this talk, I overview my work on building reliable decision-making AI without sacrificing learning efficiency, offering insights into effective optimization problem design for reliable AI. To begin, I will focus on reinforcement learning (RL) — a key framework for sequential decision-making, and demonstrate how distributional robustness can be achieved provably without paying statistical premium (additional training data cost) compared to non-robust counterparts. Next, shifting to decision-making in strategic multi-agent systems, I will demonstrate that incorporating realistic risk preferences—a key feature of human decision-making—enables computational tractability, a benefit not present in traditional models. Finally, I will present a vision for building reliable, learning-efficient AI solutions for human-centered applications, though agentic and multi-agentic AI systems.
12:20-12:35 CDT
Q&A
12:35-13:35 CDT
Lunch Break
13:35-14:20 CDT
Exploration from a Primal-Dual Optimization Lens in Reinforcement Learning
Speaker: Bo Dai (Georgia Institute of Technology)
Online reinforcement learning (RL) with complex function approximations such as transformers and deep neural networks plays a significant role in the modern practice of artificial intelligence. Despite its popularity and importance, balancing the fundamental trade-off between exploration and exploitation remains a long-standing challenge; in particular, we are still in lack of efficient and practical schemes that are backed by theoretical performance guarantees. We develop a new exploration mechanism via optimistic regularization, providing an interpretation of the principle of optimism through the lens of optimization. From this fresh perspective, we set forth a new value-incentivized actor-critic (VAC) method, which optimizes a single easy-to-optimize objective integrating exploration and exploitation -- it promotes state-action and policy estimates that are both consistent with collected data transitions and result in higher value functions. Theoretically, the proposed VAC method has near-optimal regret guarantees under linear Markov decision processes (MDPs) in both finite-horizon and infinite-horizon settings, which can be extended to the general function approximation setting under appropriate assumptions. We also test the proposed algorithms in both standard RL tasks, as well as RLHF for LLMs, where demonstrating the significant improvement.
14:20-14:35 CDT
Q&A
14:35-15:00 CDT
Coffee break
15:00-16:00 CDT
Panel, Open Discussion, Working groups, Hands-on, etc
Friday, April 3, 2026
8:30-9:00 CDT
Breakfast/Check-in
9:00-9:45 CDT
Miles: Open Source RL for Large MoE Models
Speaker: Banghua Zhu (University of Washington and NVIDIA)
IMSI is committed to making all of our programs and events inclusive and accessible.
Contact [email protected] to request
disability-related accommodations.
In order to register for this workshop, you must have an IMSI account and be logged in.