Reinforcement Learning (RL) has seen remarkable progress in recent years, yet many of its most impressive achievements rely on extensive online interaction, curated environments, or simulated data—conditions rarely available in real-world settings. In contrast, real-world decision-making often depends on learning from limited, imperfect, or passively collected data, alongside guidance from human preferences, demonstrations, or corrections.
This workshop brings together researchers and practitioners exploring the frontiers of Offline Reinforcement Learning (Offline RL) and Reinforcement Learning from Human Feedback (RLHF)—two rapidly growing areas that aim to make RL more robust, safe, and deployable in practice.
Poster Session
This workshop will include a poster session for early career researchers (including graduate students). In order to propose a poster, you must first register for the workshop, and then submit a proposal using the form that will become available on this page after you register. The registration form should not be used to propose a poster.
The deadline for proposing is Wednesday, March 18, 2026. If your proposal is accepted, you should plan to attend the event in-person.
In-Person Registration
Seats are limited at the venue, which means that in-person registration may be capped prior to the workshop start date. If capacity is reached, a waitlist will be imposed, which the registration form will reflect. Early registration is strongly encouraged.
All in-person registrants must wait to receive an invitation to attend in-person from IMSI before traveling, which generally begin to be sent out 4-6 weeks in advance.
All registrants (online and in-person) will receive zoom links and are welcome to attend online.
Yuting Wei
University of Pennsylvania, The Wharton School
R
X
Renyuan Xu
Stanford University
L
X
Lingzhou Xue
Penn State University
L
Y
Lei Ying
University of Michigan
Schedule
Monday, April 20, 2026
8:30-8:55 CDT
Check-in/Breakfast
8:55-9:00 CDT
Welcome Remarks
9:00-9:40 CDT
TBA
Speaker: Nathan Srebro (Toyota Technological Institute at Chicago)
9:40-9:55 CDT
Q&A
9:55-10:00 CDT
Tech Break
10:00-10:40 CDT
Deep Transfer Offline Q-Learning under Nonstationary Environments
Speaker: Jianqing Fan (Princeton University)
In dynamic decision-making scenarios across business and healthcare, leveraging sample trajectories from diverse populations can significantly enhance reinforcement learning (RL) performance for specific target populations. While existing transfer learning methods primarily focus on linear regression settings, they lack direct applicability to reinforcement learning algorithms. This paper pioneers the study of transfer learning for dynamic decision scenarios modeled by nonstationary finite-horizon Markov decision processes, utilizing neural networks as powerful function approximators and backward inductive learning. We demonstrate that naive sample pooling strategies, effective in regression settings, fail in Markov decision processes. To address this challenge, we introduce a novel “re-weighted targeting procedure” to construct “transferable RL samples” and propose “transfer deep Q-learning”, enabling neural network approximation with theoretical guarantees. We assume that the reward functions are transferable and deal with both situations in which the transition densities are transferable or nontransferable. Our analytical techniques for transfer learning in neural network approximation and transition density transfers have broader implications, extending to supervised transfer learning with neural networks and domain shift scenarios. Empirical experiments on both synthetic and real datasets corroborate the advantages of our method. (Joint work with Jinhang Chai and Elynn Chen)
10:40-10:55 CDT
Q&A
10:55-11:25 CDT
Coffee Break
11:25-12:05 CDT
Automated hypothesis validation with agentic sequential falsifications
Speaker: Ying Jin (University of Pennsylvania)
Hypotheses are central to information acquisition, decision-making, and discovery. However, many real-world hypotheses are abstract, high-level statements that are difficult to validate directly. This challenge is further intensified by the rise of hypothesis generation from Large Language Models (LLMs), which are prone to hallucination and produce hypotheses in volumes that make manual validation impractical. Here we propose Popper, an agentic framework for rigorous automated validation of free-form hypotheses. Guided by Karl Popper's principle of falsification, Popper validates a hypothesis using LLM agents that design and execute falsification experiments targeting its measurable implications. A novel sequential testing framework ensures strict Type-I error control while actively gathering evidence from diverse observations, whether drawn from existing data or newly conducted procedures. We demonstrate Popper on six domains including biology, economics, and sociology. Popper delivers robust error control, high power, and scalability. Furthermore, compared to human scientists, Popper achieved comparable performance in validating complex biological hypotheses while reducing time by 10 folds, providing a scalable, rigorous solution for hypothesis validation.
12:05-12:20 CDT
Q&A
12:20-13:30 CDT
Lunch Break
13:30-14:10 CDT
Reinforcement Learning For Individual Optimal Policy From Heterogeneous Data
Speaker: Annie Qu (University of California, Santa Barbara)
Offline reinforcement learning (RL) aims to find optimal policies in dynamic environments in order to maximize the expected total rewards by leveraging pre-collected data. Learning from heterogeneous data is one of the fundamental challenges in offline RL. Traditional methods focus on learning an optimal policy for all individuals with pre-collected data from a single episode or homogeneous batch episodes, and thus, may result in a suboptimal policy for a heterogeneous population. In this paper, we propose an individualized offline policy optimization framework for heterogeneous time stationary Markov decision processes (MDPs). The proposed heterogeneous model with individual latent variables enables us to efficiently estimate the individual Q-functions, and our Penalized Pessimistic Personalized Policy Learning (P4L) algorithm guarantees a fast rate on the average regret under a weak partial coverage assumption on behavior policies. In addition, our simulation studies and a real data application demonstrate the superior numerical performance of the proposed method compared with existing methods.
14:10-14:25 CDT
Q&A
14:25-14:30 CDT
Tech Break
14:30-15:10 CDT
Sampler Stochasticity in Training Diffusion Models for RLHF
Speaker: Wenpin Tang (Columbia University)
In this talk, I will talk about the reward gap problem, which sees a tradeoff between RL training and diffusion inference. This provides some insights in choosing the level of stochasticity in diffusion generation.
15:10-15:25 CDT
Q&A
15:25-15:40 CDT
Coffee Break
15:40-16:20 CDT
Stochastic control for fine-tuning diffusion models: optimality, regularity, convergence, and tail-risk mitigation
Speaker: Renyuan Xu (Stanford University)
We develop a stochastic control framework for fine-tuning diffusion models, using denoising diffusion probabilistic models as the pre-trained reference dynamics and combining linear dynamics control with Kullback–Leibler regularization. We establish well-posedness and regularity of the resulting control problem and propose a policy iteration algorithm, PI-FT, for its numerical solution. We prove that PI-FT converges globally at a linear rate and, unlike existing analyses that impose regularity assumptions throughout training, show that the iterates themselves preserve the required regularity. If time allows, we will also discuss how incorporating a risk-sensitive control criterion can better align the fine-tuned model with rare but important tail events in generation.
16:20-16:35 CDT
Q&A
Tuesday, April 21, 2026
8:30-9:00 CDT
Check-in/Breakfast
9:00-9:40 CDT
From Offline to Low-Adaptive Reinforcement Learning
Speaker: Yu-Xiang Wang (University of California, San Diego (UCSD))
Online Reinforcement Learning requires access to the environment for trials-and-error. Offline Reinforcement Learning learns from existing logged trajectories (i.e. observational studies) but must either weaken the learning goals or making unrealistic assumptions. Are there any meaningful setting in between? The talk starts by discuss the statistical complexity and limitation of offline RL, then review a burgeoning problem of **low-adaptive exploration** which addresses these limitations by providing a sweet middle ground between offline and online RL. Somewhat surprisingly, we show that only O(log-log T) batches of umbarrassingly parallel access to the environment is needed for us to solve exploration with near-optimal sqrt(T) regret (up to log factors). We also discuss the influence of function approximation, two-player game and other settings such as pure and reward-free) exploration.
9:40-9:55 CDT
Q&A
9:55-10:00 CDT
Tech Break
10:00-10:40 CDT
Consequentialist Objectives and Catastrophe
Speaker: Benjamin van Roy (Stanford University)
Because human preferences are too complex to codify, AIs operate with misspecified objectives. Optimizing such objectives often produces undesirable outcomes; this phenomenon is known as reward hacking. Such outcomes are not necessarily catastrophic. Indeed, most examples of reward hacking in previous literature are benign. And typically, objectives can be modified to resolve the issue.
We study the prospect of catastrophic outcomes induced by AIs operating in complex environments. We argue that, when capabilities are sufficiently advanced, pursuing a fixed consequentialist objective tends to result in catastrophic outcomes. We formalize this by establishing conditions that provably lead to such outcomes. Under these conditions, simple or random behavior is safe. Catastrophic risk arises due to extraordinary competence rather than incompetence.
With a fixed consequentialist objective, avoiding catastrophe requires constraining AI capabilities. In fact, constraining capabilities the right amount not only averts catastrophe but yields valuable outcomes. Our results apply to any objective produced by modern industrial AI development pipelines.
10:40-10:55 CDT
Q&A
10:55-11:25 CDT
Coffee Break
11:25-12:05 CDT
Demystifying Group Relative Policy Optimization: Its Policy Gradient is a U-Statistic
Speaker: Chengchun Shi (London School of Economics and Political Science)
Group relative policy optimization (GRPO), a core methodological component of DeepSeekMath and DeepSeek-R1, has emerged as a cornerstone for scaling reasoning capabilities of large language models. Despite its widespread adoption and the proliferation of follow-up works, the theoretical properties of GRPO remain less studied. This paper provides a unified framework to understand GRPO through the lens of classical U-statistics. We demonstrate that the GRPO policy gradient is inherently a U-statistic, allowing us to characterize its mean squared error (MSE), derive the finite-sample error bound and asymptotic distribution of the suboptimalitygap for its learned policy. Our findings reveal that GRPO is asymptotically equivalent to an oracle policy gradient algorithm – one with access to a value function that quantifies the goodness of its learning policy at each training iteration – and achieves asymptotically optimal performance within a broad class of policy gradient algorithms. Furthermore, we establish a universal scaling law that offers principled guidance for selecting the optimal group size. Empirical experiments further validate our theoretical findings, demonstrating that the optimal group size is universal, and verify the oracle property of GRPO.
12:05-12:20 CDT
Q&A
12:20-13:45 CDT
Lunch Break
13:45-14:25 CDT
New Results for Distributional Reinforcement Learning
Speaker: Lan Wang (University of Miami)
Distributional reinforcement learning (RL) models the entire distribution of returns and is particularly useful for risk-sensitive decision-making. Quantile temporal difference (QTD) learning is a widely used model-free distributional RL method with strong empirical performance, yet its theoretical guarantees remain less developed. Its analysis is complicated by bias from the quantile semi-gradient, discretization error in approximating return distributions, and nonsmooth update dynamics. We provide nonasymptotic performance guarantees for QTD. We obtain finite-time bounds on the expected supremum 1-Wasserstein distance between the learned and true return distributions. These results advance the theoretical foundation of distributional RL.
14:25-14:40 CDT
Q&A
14:40-15:10 CDT
Coffee Break
15:10-15:50 CDT
From Reward Learning to Leaderboards: Uncertainty Quantification for LLMs under Heterogeneous Human Feedback
Speaker: Will Wei Sun (Purdue University)
Pairwise human feedback is now widely used in both LLM alignment and LLM evaluation, from reward modeling in RLHF to public leaderboards based on head-to-head comparisons. However, these data are noisy, heterogeneous, and highly non-uniform, making uncertainty quantification a central statistical challenge. In this talk, I will present two recent works on this topic. The first studies reward learning under heterogeneous human feedback, jointly modeling latent rewards and annotator rationality, with asymptotic guarantees that enable valid reward comparison and uncertainty-aware best-of-N sampling. The second studies LLM evaluation as inference on a low-rank latent score tensor observed through pairwise comparisons, leading to efficient debiased inference and a score-whitening method for handling anisotropic information under non-uniform sampling. Together, these works illustrate how statistical inference can provide principled uncertainty quantification for both alignment and evaluation of large language models.
15:50-16:05 CDT
Q&A
Wednesday, April 22, 2026
8:30-9:00 CDT
Check-in/Breakfast
9:00-9:40 CDT
On the Learning Dynamics of RLVR at the Edge of Competence
Speaker: Yuejie Chi (Yale University)
9:40-9:55 CDT
Q&A
9:55-10:00 CDT
Tech Break
10:00-10:40 CDT
Statistical Inference under Adaptive Sampling with LinUCB
Speaker: Yuting Wei (University of Pennsylvania)
Adaptively collected data has become ubiquitous in modern practice. Yet even seemingly benign adaptive sampling schemes can introduce severe biases, rendering traditional statistical inference tools inapplicable. Focusing on the linear bandit problem, a fundamental and influential framework in reinforcement learning and the bandit literature, we characterize the performance of LinUCB, a canonical upper-confidence-bound algorithm that balances exploration and exploitation, and derive inferential procedures that remain valid despite the challenges posed by adaptive data collection. A central difficulty is to understand the behavior of the eigenvalues and eigenvectors of the random feature covariance matrix generated by LinUCB without imposing the stability assumptions that prior work relied upon. Our analysis provides this characterization and, in turn, enables us to establish a central limit theorem for LinUCB: the estimation error converges in distribution at a $T^{-1/4}$ rate and is asymptotically normal. The resulting Wald-type confidence sets and hypothesis tests do not depend on the feature covariance matrix and are asymptotically tighter than existing nonasymptotic confidence sets. Numerical simulations corroborate our theoretical findings.
10:40-10:55 CDT
Q&A
10:55-11:25 CDT
Coffee Break
11:25-12:05 CDT
Non-Asymptotic CLTs and Concentration Inequalities for Stochastic Approximation Algorithms, with Applications to Reinforcement Learning
Speaker: R. Srikant (University of Illinois at Urbana-Champaign)
We present non-asymptotic CLT error bounds for stochastic approximation
algorithms in the Wasserstein-p distance. To obtain explicit finite-sample guarantees for the last iterate, we develop a coupling argument that compares the discrete-time process to a limiting Ornstein-Uhlenbeck process. Our analysis applies to algorithms driven by general noise conditions, including martingale differences and functions of ergodic Markov chains. Complementing this result, we handle the convergence rate of the Polyak-Ruppert average through a direct analysis that applies under the same general setting. We demonstrate the utility of this approach by considering an application to TD learning, where we explicitly quantify the transition from heavy-tailed to Gaussian behavior of the iterates, thereby bridging the gap between recent finite-sample analyses and asymptotic theory. Based on joint work with Seo Taek Kong.
12:05-12:20 CDT
Q&A
12:20-13:45 CDT
Lunch Break
13:45-14:25 CDT
Off-policy Evaluation via Particle Filtering and Moment Matching
Speaker: Nan Jiang (University of Illinois at Urbana-Champaign)
I will present a new algorithmic framework and analysis for off-policy evaluation (OPE) in finite-horizon MDPs. The algorithm learns a scalar weight for each data point by a moment matching objective against a discriminator class F that realizes Qπ. Notably, the theoretical guarantee of the algorithm is dimension-free, that the finite-sample error does NOT depend on the statistical complexity of the function class F (e.g., no log|F| dependence), and generalizes the standard error bound for linear regression with a fixed design. The algorithm is also closely connected to several existing methods, such as linear FQE, (sequential) importance sampling, and trajectory stitching, providing connections and novel perspectives to the foundational task of OPE.
14:25-14:40 CDT
Q&A
14:40-14:45 CDT
Tech Break
14:45-15:25 CDT
Model simulation using offline observations with low-rank factor model
Speaker: Devavrat Shah (Massachusetts Institute of Technology (MIT))
We will discuss the role of low-rank factor models in developing model simulation using offline observations that are likely biased and coming from potentially heterogenous settings. We do so by positing that the transition dynamics can be represented as a latent function of latent factors associated with agents, states, and actions. Such naturally leads to approximate low-rank decomposition of separable agent, state, and action latent functions. This enables effective learning of the transition dynamics per agent, even with limited, offline data. This naturally extends the literature on causal inference rooted in panel data setting in Econometrics.
I will discuss application of this approach in developing CausalSim, simulation platform for communication network protocols. Time permitting, I will discuss some of the ongoing theoretical inquiries suggested by the empirical success of such an approach.
15:25-15:40 CDT
Q&A
15:40-16:30 CDT
Poster Session/Social Hour
Thursday, April 23, 2026
8:30-9:00 CDT
Check-in/Breakfast
9:00-9:40 CDT
What structures make model-free RL possible? an elliptic theory for controlled Markov diffusions
Speaker: Wenlong Mou (University of Toronto)
Can offline reinforcement learning with function approximation ever be as easy as supervised learning? In general, the answer is no — the Bellman operator contracts only in the sup-norm, not in the L^2-norm induced by the data distribution. This geometric mismatch makes model-free value learning with function approximation provably harder than regression. However, real-world problems often come with additional structures that may facilitate reinforcement learning.In this talk, I will discuss recent advances in understanding the structures that enable model-free offline RL. Focusing on controlled Markov diffusions—a widely used class of dynamical systems—I will provide an affirmative answer to the question above. Specifically, I will identify ellipticity as a key structure that makes model-free RL with function approximation tractable with offline data. Leveraging ellipticity, I will demonstrate desirable geometric properties of Bellman operators in an appropriate Sobolev space. Based on these insights, I will introduce a new class of algorithms for model-free RL with function approximation that achieve near-optimal oracle inequalities efficiently. Finally, I will discuss an application to fine-tuning diffusion-based generative models, where the ellipticity structure is exploited to design a PDE-based algorithm that attains fast convergence rates.
9:40-9:55 CDT
Q&A
9:55-10:00 CDT
Tech Break
10:00-10:40 CDT
TBA
Speaker: Xin Guo (University of California, Berkeley (UC Berkeley))
10:40-10:55 CDT
Q&A
10:55-11:25 CDT
Coffee Break
11:25-12:05 CDT
Optimal offline policy learning under unknown confounding factors
Speaker: Zhimei Ren (University of Pennsylvania)
We investigate the problem of offline policy learning in the presence of unobserved confounders, which may arise in both observational studies and adaptive experiments (e.g., self-selection and noncompliance in sequential medical settings). In particular, we study this problem under the f -sensitivity model, which characterizes the confounding effect by its “average” strength. Under the f -sensitivity model, we characterize the distribution shift from the observable to the counterfactual and design a distributionally robust policy learning algorithm, f -SR(ad)L, which maximizes the expected outcome within a given policy class Π . We show that the sub-optimality gap of f -SR(ad)L learned from a sequential (i.i.d. or adaptively collected) dataset is of the order O(κ(Π)n), where κ(Π) is the entropy integral of Π under the Hamming distance and n is the sample size. A matching lower bound of is provided to show the optimality of the rate. Finally, we assess our method on synthetic and a real-world data on lung cancer treatments to demonstrate its advantage over existing benchmarks.
12:05-12:20 CDT
Q&A
12:20-13:45 CDT
Lunch Break
13:45-14:25 CDT
Toward efficient exploration for language models
Speaker: Dylan Foster (Microsoft Research)
14:25-14:40 CDT
Q&A
14:40-15:10 CDT
Coffee Break
15:10-15:50 CDT
TBA
Speaker: Lingzhou Xue (The Pennsylvania State University)
15:50-16:05 CDT
Q&A
Friday, April 24, 2026
8:30-9:00 CDT
Check-in/Breakfast
9:00-9:40 CDT
PPO Fine-Tuning of Diffusion Models: Provable Convergence across Interpolated Trajectories
Speaker: Yingbin Liang (The Ohio State University)
Fine-tuning diffusion models is commonly carried out in practice using reinforcement learning algorithms such as Proximal Policy Optimization (PPO). Despite the remarkable empirical success of these approaches, the theoretical understanding of their convergence behavior remains rather limited. In this paper, we provide the first convergence guarantee for PPO-style algorithms for fine-tuning diffusion models. Specifically, we characterize the convergence rate of PPO in terms of two diffusion-specific factors that fundamentally govern RL-based fine-tuning: (i) the sampler stochasticity parameter $lambda$, which controls trajectory interpolation between deterministic and stochastic denoising dynamics, and (ii) the KL-regularization coefficient $mu$, which keeps the fine-tuned policy to remain close to the pretrained model. Our results imply that increased sampler stochasticity $lambda$, which corresponds to trajectories closer to DDPM-style sampling, is more favorable for RL fine-tuning, and stronger KL regularization (i.e., larger $mu$) provably accelerates convergence. Our experiments further validate our theoretical results.
9:40-9:55 CDT
Q&A
9:55-10:00 CDT
Tech Break
10:00-10:40 CDT
Stochastic Zeroth-Order Policy Optimization for RLHF
Speaker: Lei Ying (University of Michigan)
10:40-10:55 CDT
Q&A
10:55-11:25 CDT
Coffee Break
11:25-12:05 CDT
TBA
Speaker: Junwei Lu (Harvard University)
12:05-12:20 CDT
Q&A
12:20-12:35 CDT
Workshop Survey and Closing Remarks
Registration
IMSI is committed to making all of our programs and events inclusive and accessible.
Contact [email protected] to request
disability-related accommodations.
In order to register for this workshop, you must have an IMSI account and be logged in.