Reinforcement Learning (RL) has seen remarkable progress in recent years, yet many of its most impressive achievements rely on extensive online interaction, curated environments, or simulated data—conditions rarely available in real-world settings. In contrast, real-world decision-making often depends on learning from limited, imperfect, or passively collected data, alongside guidance from human preferences, demonstrations, or corrections.
This workshop brings together researchers and practitioners exploring the frontiers of Offline Reinforcement Learning (Offline RL) and Reinforcement Learning from Human Feedback (RLHF)—two rapidly growing areas that aim to make RL more robust, safe, and deployable in practice.
Poster Session
This workshop will include a poster session for early career researchers (including graduate students). In order to propose a poster, you must first register for the workshop, and then submit a proposal using the form that will become available on this page after you register. The registration form should not be used to propose a poster.
The deadline for proposing is Wednesday, March 18, 2026. If your proposal is accepted, you should plan to attend the event in-person.
In-Person Registration
Seats are limited at the venue, which means that in-person registration may be capped prior to the workshop start date. If capacity is reached, a waitlist will be imposed, which the registration form will reflect. Early registration is strongly encouraged.
All in-person registrants must wait to receive an invitation to attend in-person from IMSI before traveling, which generally begin to be sent out 4-6 weeks in advance.
All registrants (online and in-person) will receive zoom links and are welcome to attend online.
Yuting Wei
University of Pennsylvania, The Wharton School
R
X
Renyuan Xu
Stanford University
L
X
Lingzhou Xue
Penn State University
L
Y
Lei Ying
University of Michigan
Schedule
Monday, April 20, 2026
8:30-8:55 CDT
Check-in/Breakfast
8:55-9:00 CDT
Welcome Remarks
9:00-9:40 CDT
Learning to Answer from Correct Demonstrations
Speaker: Nathan Srebro (Toyota Technological Institute at Chicago)
We study the problem of learning to generate an answer (or completion) to a question (or prompt), where there could be multiple correct answers, any one of which is acceptable at test time. Learning is based on demonstrations of some correct answer to each training question, as in Supervised Fine Tuning (SFT). Current standard practice focuses on maximum likelihood (ie log loss minimization) approaches, but we argue that likelihood-maximization methods can fail even in simple settings. Instead, we view the problem as apprenticeship learning (i.e., imitation learning) in contextual bandits, with offline demonstrations from some expert (optimal, or very good) policy, and suggest alternative simple approaches with strong guarantees.
Joint work with Nirmit Joshi, Gene Li, Siddharth Bhandari, Shiva Kasiviswanathan, and Cong Ma
9:40-9:55 CDT
Q&A
9:55-10:00 CDT
Tech Break
10:00-10:40 CDT
Deep Transfer Offline Q-Learning under Nonstationary Environments
Speaker: Jianqing Fan (Princeton University)
In dynamic decision-making scenarios across business and healthcare, leveraging sample trajectories from diverse populations can significantly enhance reinforcement learning (RL) performance for specific target populations. While existing transfer learning methods primarily focus on linear regression settings, they lack direct applicability to reinforcement learning algorithms. This paper pioneers the study of transfer learning for dynamic decision scenarios modeled by nonstationary finite-horizon Markov decision processes, utilizing neural networks as powerful function approximators and backward inductive learning. We demonstrate that naive sample pooling strategies, effective in regression settings, fail in Markov decision processes. To address this challenge, we introduce a novel “re-weighted targeting procedure” to construct “transferable RL samples” and propose “transfer deep Q-learning”, enabling neural network approximation with theoretical guarantees. We assume that the reward functions are transferable and deal with both situations in which the transition densities are transferable or nontransferable. Our analytical techniques for transfer learning in neural network approximation and transition density transfers have broader implications, extending to supervised transfer learning with neural networks and domain shift scenarios. Empirical experiments on both synthetic and real datasets corroborate the advantages of our method. (Joint work with Jinhang Chai and Elynn Chen)
10:40-10:55 CDT
Q&A
10:55-11:25 CDT
Coffee Break
11:25-12:05 CDT
Automated hypothesis validation with agentic sequential falsifications
Speaker: Ying Jin (University of Pennsylvania)
Hypotheses are central to information acquisition, decision-making, and discovery. However, many real-world hypotheses are abstract, high-level statements that are difficult to validate directly. This challenge is further intensified by the rise of hypothesis generation from Large Language Models (LLMs), which are prone to hallucination and produce hypotheses in volumes that make manual validation impractical. Here we propose Popper, an agentic framework for rigorous automated validation of free-form hypotheses. Guided by Karl Popper's principle of falsification, Popper validates a hypothesis using LLM agents that design and execute falsification experiments targeting its measurable implications. A novel sequential testing framework ensures strict Type-I error control while actively gathering evidence from diverse observations, whether drawn from existing data or newly conducted procedures. We demonstrate Popper on six domains including biology, economics, and sociology. Popper delivers robust error control, high power, and scalability. Furthermore, compared to human scientists, Popper achieved comparable performance in validating complex biological hypotheses while reducing time by 10 folds, providing a scalable, rigorous solution for hypothesis validation.
12:05-12:20 CDT
Q&A
12:20-13:30 CDT
Lunch Break
13:30-14:10 CDT
Reinforcement Learning For Individual Optimal Policy From Heterogeneous Data
Speaker: Annie Qu (University of California, Santa Barbara)
Offline reinforcement learning (RL) aims to find optimal policies in dynamic environments in order to maximize the expected total rewards by leveraging pre-collected data. Learning from heterogeneous data is one of the fundamental challenges in offline RL. Traditional methods focus on learning an optimal policy for all individuals with pre-collected data from a single episode or homogeneous batch episodes, and thus, may result in a suboptimal policy for a heterogeneous population. In this paper, we propose an individualized offline policy optimization framework for heterogeneous time stationary Markov decision processes (MDPs). The proposed heterogeneous model with individual latent variables enables us to efficiently estimate the individual Q-functions, and our Penalized Pessimistic Personalized Policy Learning (P4L) algorithm guarantees a fast rate on the average regret under a weak partial coverage assumption on behavior policies. In addition, our simulation studies and a real data application demonstrate the superior numerical performance of the proposed method compared with existing methods.
14:10-14:25 CDT
Q&A
14:25-14:30 CDT
Tech Break
14:30-15:10 CDT
Sampler Stochasticity in Training Diffusion Models for RLHF
Speaker: Wenpin Tang (Columbia University)
In this talk, I will talk about the reward gap problem, which sees a tradeoff between RL training and diffusion inference. This provides some insights in choosing the level of stochasticity in diffusion generation.
15:10-15:25 CDT
Q&A
15:25-15:40 CDT
Coffee Break
15:40-16:20 CDT
Conditional Diffusion Guidance under Hard Constraint: A Stochastic Analysis Approach
Speaker: Renyuan Xu (Stanford University)
We study how to steer diffusion models under hard constraints, so that generated samples satisfy prescribed events almost surely. This problem arises naturally in safety-critical generation, constrained decision-making, and rare-event simulation, where one seeks to condition a pretrained model using only offline trajectories while guaranteeing exact constraint satisfaction.
Our approach builds on Doob’s h-transform and introduces a novel martingale-based loss to learn an additive guidance term, without retraining the full score network. We propose two off-policy objectives for estimating this guidance term from pretrained trajectories, establish non-asymptotic guarantees for the resulting sampler, and demonstrate strong performance on stress testing for financial assets and queueing networks.
This is based on joint work with Wenpin Tang and Zhengyi Guo (Columbia University).
16:20-16:35 CDT
Q&A
Tuesday, April 21, 2026
8:30-9:00 CDT
Check-in/Breakfast
9:00-9:40 CDT
From Offline to Low-Adaptive Reinforcement Learning
Speaker: Yu-Xiang Wang (University of California, San Diego (UCSD))
Online Reinforcement Learning requires access to the environment for trials-and-error. Offline Reinforcement Learning learns from existing logged trajectories (i.e. observational studies) but must either weaken the learning goals or making unrealistic assumptions. Are there any meaningful setting in between? The talk starts by discuss the statistical complexity and limitation of offline RL, then review a burgeoning problem of **low-adaptive exploration** which addresses these limitations by providing a sweet middle ground between offline and online RL. Somewhat surprisingly, we show that only O(log-log T) batches of umbarrassingly parallel access to the environment is needed for us to solve exploration with near-optimal sqrt(T) regret (up to log factors). We also discuss the influence of function approximation, two-player game and other settings such as pure and reward-free) exploration.
9:40-9:55 CDT
Q&A
9:55-10:00 CDT
Tech Break
10:00-10:40 CDT
Consequentialist Objectives and Catastrophe
Speaker: Benjamin van Roy (Stanford University)
Because human preferences are too complex to codify, AIs operate with misspecified objectives. Optimizing such objectives often produces undesirable outcomes; this phenomenon is known as reward hacking. Such outcomes are not necessarily catastrophic. Indeed, most examples of reward hacking in previous literature are benign. And typically, objectives can be modified to resolve the issue.
We study the prospect of catastrophic outcomes induced by AIs operating in complex environments. We argue that, when capabilities are sufficiently advanced, pursuing a fixed consequentialist objective tends to result in catastrophic outcomes. We formalize this by establishing conditions that provably lead to such outcomes. Under these conditions, simple or random behavior is safe. Catastrophic risk arises due to extraordinary competence rather than incompetence.
With a fixed consequentialist objective, avoiding catastrophe requires constraining AI capabilities. In fact, constraining capabilities the right amount not only averts catastrophe but yields valuable outcomes. Our results apply to any objective produced by modern industrial AI development pipelines.
10:40-10:55 CDT
Q&A
10:55-11:25 CDT
Coffee Break
11:25-12:05 CDT
Demystifying Group Relative Policy Optimization: Its Policy Gradient is a U-Statistic
Speaker: Chengchun Shi (London School of Economics and Political Science)
Group relative policy optimization (GRPO), a core methodological component of DeepSeekMath and DeepSeek-R1, has emerged as a cornerstone for scaling reasoning capabilities of large language models. Despite its widespread adoption and the proliferation of follow-up works, the theoretical properties of GRPO remain less studied. This paper provides a unified framework to understand GRPO through the lens of classical U-statistics. We demonstrate that the GRPO policy gradient is inherently a U-statistic, allowing us to characterize its mean squared error (MSE), derive the finite-sample error bound and asymptotic distribution of the suboptimalitygap for its learned policy. Our findings reveal that GRPO is asymptotically equivalent to an oracle policy gradient algorithm – one with access to a value function that quantifies the goodness of its learning policy at each training iteration – and achieves asymptotically optimal performance within a broad class of policy gradient algorithms. Furthermore, we establish a universal scaling law that offers principled guidance for selecting the optimal group size. Empirical experiments further validate our theoretical findings, demonstrating that the optimal group size is universal, and verify the oracle property of GRPO.
12:05-12:20 CDT
Q&A
12:20-13:45 CDT
Lunch Break
13:45-14:25 CDT
New Results for Distributional Reinforcement Learning
Speaker: Lan Wang (University of Miami)
Distributional reinforcement learning (RL) models the entire distribution of returns and is particularly useful for risk-sensitive decision-making. Quantile temporal difference (QTD) learning is a widely used model-free distributional RL method with strong empirical performance, yet its theoretical guarantees remain less developed. Its analysis is complicated by bias from the quantile semi-gradient, discretization error in approximating return distributions, and nonsmooth update dynamics. We provide nonasymptotic performance guarantees for QTD. We obtain finite-time bounds on the expected supremum 1-Wasserstein distance between the learned and true return distributions. These results advance the theoretical foundation of distributional RL.
14:25-14:40 CDT
Q&A
14:40-15:10 CDT
Coffee Break
15:10-15:50 CDT
From Reward Learning to Leaderboards: Uncertainty Quantification for LLMs under Heterogeneous Human Feedback
Speaker: Will Wei Sun (Purdue University)
Pairwise human feedback is now widely used in both LLM alignment and LLM evaluation, from reward modeling in RLHF to public leaderboards based on head-to-head comparisons. However, these data are noisy, heterogeneous, and highly non-uniform, making uncertainty quantification a central statistical challenge. In this talk, I will present two recent works on this topic. The first studies reward learning under heterogeneous human feedback, jointly modeling latent rewards and annotator rationality, with asymptotic guarantees that enable valid reward comparison and uncertainty-aware best-of-N sampling. The second studies LLM evaluation as inference on a low-rank latent score tensor observed through pairwise comparisons, leading to efficient debiased inference and a score-whitening method for handling anisotropic information under non-uniform sampling. Together, these works illustrate how statistical inference can provide principled uncertainty quantification for both alignment and evaluation of large language models.
15:50-16:05 CDT
Q&A
Wednesday, April 22, 2026
8:30-9:00 CDT
Check-in/Breakfast
9:00-9:40 CDT
On the Learning Dynamics of RLVR at the Edge of Competence
Speaker: Yuejie Chi (Yale University)
9:40-9:55 CDT
Q&A
9:55-10:00 CDT
Tech Break
10:00-10:40 CDT
Statistical Inference under Adaptive Sampling with LinUCB
Speaker: Yuting Wei (University of Pennsylvania)
Adaptively collected data has become ubiquitous in modern practice. Yet even seemingly benign adaptive sampling schemes can introduce severe biases, rendering traditional statistical inference tools inapplicable. Focusing on the linear bandit problem, a fundamental and influential framework in reinforcement learning and the bandit literature, we characterize the performance of LinUCB, a canonical upper-confidence-bound algorithm that balances exploration and exploitation, and derive inferential procedures that remain valid despite the challenges posed by adaptive data collection. A central difficulty is to understand the behavior of the eigenvalues and eigenvectors of the random feature covariance matrix generated by LinUCB without imposing the stability assumptions that prior work relied upon. Our analysis provides this characterization and, in turn, enables us to establish a central limit theorem for LinUCB: the estimation error converges in distribution at a $T^{-1/4}$ rate and is asymptotically normal. The resulting Wald-type confidence sets and hypothesis tests do not depend on the feature covariance matrix and are asymptotically tighter than existing nonasymptotic confidence sets. Numerical simulations corroborate our theoretical findings.
10:40-10:55 CDT
Q&A
10:55-11:25 CDT
Coffee Break
11:25-12:05 CDT
Non-Asymptotic CLTs and Concentration Inequalities for Stochastic Approximation Algorithms, with Applications to Reinforcement Learning
Speaker: R. Srikant (University of Illinois at Urbana-Champaign)
We present non-asymptotic CLT error bounds for stochastic approximation
algorithms in the Wasserstein-p distance. To obtain explicit finite-sample guarantees for the last iterate, we develop a coupling argument that compares the discrete-time process to a limiting Ornstein-Uhlenbeck process. Our analysis applies to algorithms driven by general noise conditions, including martingale differences and functions of ergodic Markov chains. Complementing this result, we handle the convergence rate of the Polyak-Ruppert average through a direct analysis that applies under the same general setting. We demonstrate the utility of this approach by considering an application to TD learning, where we explicitly quantify the transition from heavy-tailed to Gaussian behavior of the iterates, thereby bridging the gap between recent finite-sample analyses and asymptotic theory. Based on joint work with Seo Taek Kong.
12:05-12:20 CDT
Q&A
12:20-13:45 CDT
Lunch Break
13:45-14:25 CDT
Off-policy Evaluation via Particle Filtering and Moment Matching
Speaker: Nan Jiang (University of Illinois at Urbana-Champaign)
I will present a new algorithmic framework and analysis for off-policy evaluation (OPE) in finite-horizon MDPs. The algorithm learns a scalar weight for each data point by a moment matching objective against a discriminator class F that realizes Qπ. Notably, the theoretical guarantee of the algorithm is dimension-free, that the finite-sample error does NOT depend on the statistical complexity of the function class F (e.g., no log|F| dependence), and generalizes the standard error bound for linear regression with a fixed design. The algorithm is also closely connected to several existing methods, such as linear FQE, (sequential) importance sampling, and trajectory stitching, providing connections and novel perspectives to the foundational task of OPE.
14:25-14:40 CDT
Q&A
14:40-14:45 CDT
Tech Break
14:45-15:25 CDT
Model simulation using offline observations with low-rank factor model
Speaker: Devavrat Shah (Massachusetts Institute of Technology (MIT))
We will discuss the role of low-rank factor models in developing model simulation using offline observations that are likely biased and coming from potentially heterogenous settings. We do so by positing that the transition dynamics can be represented as a latent function of latent factors associated with agents, states, and actions. Such naturally leads to approximate low-rank decomposition of separable agent, state, and action latent functions. This enables effective learning of the transition dynamics per agent, even with limited, offline data. This naturally extends the literature on causal inference rooted in panel data setting in Econometrics.
I will discuss application of this approach in developing CausalSim, simulation platform for communication network protocols. Time permitting, I will discuss some of the ongoing theoretical inquiries suggested by the empirical success of such an approach.
15:25-15:40 CDT
Q&A
15:40-16:30 CDT
Poster Session/Social Hour
Thursday, April 23, 2026
8:30-9:00 CDT
Check-in/Breakfast
9:00-9:40 CDT
What structures make model-free RL possible? an elliptic theory for controlled Markov diffusions
Speaker: Wenlong Mou (University of Toronto)
Can offline reinforcement learning with function approximation ever be as easy as supervised learning? In general, the answer is no — the Bellman operator contracts only in the sup-norm, not in the L^2-norm induced by the data distribution. This geometric mismatch makes model-free value learning with function approximation provably harder than regression. However, real-world problems often come with additional structures that may facilitate reinforcement learning.In this talk, I will discuss recent advances in understanding the structures that enable model-free offline RL. Focusing on controlled Markov diffusions—a widely used class of dynamical systems—I will provide an affirmative answer to the question above. Specifically, I will identify ellipticity as a key structure that makes model-free RL with function approximation tractable with offline data. Leveraging ellipticity, I will demonstrate desirable geometric properties of Bellman operators in an appropriate Sobolev space. Based on these insights, I will introduce a new class of algorithms for model-free RL with function approximation that achieve near-optimal oracle inequalities efficiently. Finally, I will discuss an application to fine-tuning diffusion-based generative models, where the ellipticity structure is exploited to design a PDE-based algorithm that attains fast convergence rates.
9:40-9:55 CDT
Q&A
9:55-10:00 CDT
Tech Break
10:00-10:40 CDT
Deterministic Policy Gradient for Reinforcement Learning with Continuous Time and Space
Speaker: Xin Guo (University of California, Berkeley (UC Berkeley))
The theory of continuous-time reinforcement learning (RL) has progressed rapidly in recent years. While the ultimate objective of RL is typically to learn deterministic control policies, most existing continuous-time RL methods rely on stochastic policies. Such approaches often require sampling actions at very high frequencies, and involve computationally expensive expectations over continuous action spaces, resulting in high-variance gradient estimates and slow convergence. In this talk, we will introduce deterministic policy gradient (DPG) methods for continuous-time RL. We will derive a continuous-time policy gradient formula expressed as the expected gradient of an advantage rate function and establish a martingale characterization for both the value function and the advantage rate. These theoretical results provide tractable estimators for deterministic policy gradients in continuous-time RL. Building on this foundation, we propose a model-free continuous-time Deep Deterministic Policy Gradient (CT-DDPG) algorithm that enables stable learning for general reinforcement learning problems with continuous time-and-state. Numerical experiments show that CT-DDPG achieves superior stability and faster convergence compared to existing stochastic-policy methods, across a wide range of learning tasks with varying time discretizations and noise levels.
10:40-10:55 CDT
Q&A
10:55-11:25 CDT
Coffee Break
11:25-12:05 CDT
Optimal offline policy learning under unknown confounding factors
Speaker: Zhimei Ren (University of Pennsylvania)
We investigate the problem of offline policy learning in the presence of unobserved confounders, which may arise in both observational studies and adaptive experiments (e.g., self-selection and noncompliance in sequential medical settings). In particular, we study this problem under the f -sensitivity model, which characterizes the confounding effect by its “average” strength. Under the f -sensitivity model, we characterize the distribution shift from the observable to the counterfactual and design a distributionally robust policy learning algorithm, f -SR(ad)L, which maximizes the expected outcome within a given policy class Π . We show that the sub-optimality gap of f -SR(ad)L learned from a sequential (i.i.d. or adaptively collected) dataset is of the order O(κ(Π)n), where κ(Π) is the entropy integral of Π under the Hamming distance and n is the sample size. A matching lower bound of is provided to show the optimality of the rate. Finally, we assess our method on synthetic and a real-world data on lung cancer treatments to demonstrate its advantage over existing benchmarks.
12:05-12:20 CDT
Q&A
12:20-13:45 CDT
Lunch Break
13:45-14:25 CDT
Toward efficient exploration for language models
Speaker: Dylan Foster (Microsoft Research)
14:25-14:40 CDT
Q&A
14:40-15:10 CDT
Coffee Break
15:10-15:50 CDT
Sample-Efficient and Low-Cost Model-Free Reinforcement Learning
Speaker: Lingzhou Xue (The Pennsylvania State University)
Reinforcement learning (RL) provides a general framework for sequential decision-making under uncertainty, and in federated reinforcement learning (FRL), multiple agents collaboratively learn under the coordination of a central server without sharing raw data. Recently, we have developed new methodological and theoretical results for model-free RL in tabular episodic Markov Decision Processes across both single-agent and federated settings. In particular, we have established the first gap-dependent regret for federated RL and developed a novel fine-grained analytical framework that yields the fine-grained regret bound. Further, we have proposed new algorithms with provable guarantees for low-cost RL.
15:50-16:05 CDT
Q&A
Friday, April 24, 2026
8:30-9:00 CDT
Check-in/Breakfast
9:00-9:40 CDT
PPO Fine-Tuning of Diffusion Models: Provable Convergence across Interpolated Trajectories
Speaker: Yingbin Liang (The Ohio State University)
Fine-tuning diffusion models is commonly carried out in practice using reinforcement learning algorithms such as Proximal Policy Optimization (PPO). Despite the remarkable empirical success of these approaches, the theoretical understanding of their convergence behavior remains rather limited. In this paper, we provide the first convergence guarantee for PPO-style algorithms for fine-tuning diffusion models. Specifically, we characterize the convergence rate of PPO in terms of two diffusion-specific factors that fundamentally govern RL-based fine-tuning: (i) the sampler stochasticity parameter $lambda$, which controls trajectory interpolation between deterministic and stochastic denoising dynamics, and (ii) the KL-regularization coefficient $mu$, which keeps the fine-tuned policy to remain close to the pretrained model. Our results imply that increased sampler stochasticity $lambda$, which corresponds to trajectories closer to DDPM-style sampling, is more favorable for RL fine-tuning, and stronger KL regularization (i.e., larger $mu$) provably accelerates convergence. Our experiments further validate our theoretical results.
9:40-9:55 CDT
Q&A
9:55-10:00 CDT
Tech Break
10:00-10:40 CDT
Stochastic Zeroth-Order Policy Optimization for RLHF
Speaker: Lei Ying (University of Michigan)
10:40-10:55 CDT
Q&A
10:55-11:25 CDT
Coffee Break
11:25-12:05 CDT
Fisher Random Walk: Automatic Preference Inference for Language Models
Speaker: Junwei Lu (Harvard University)
Human preference alignment has been shown to be effective in training the large language models (LLMs). It allows the LLM to understand human feedback and preferences. Despite the extensive literature dealing with algorithms aligning the rank of human preference, uncertainty quantification for the ranking estimation still needs to be explored and is of great practical significance. For example, it is important to overcome the problem of hallucination for LLM in the medical domain, and an inferential method for the ranking of LLM answers becomes necessary. In this talk, we will present a novel framework called “Fisher random walk” to conduct semi-parametric efficient preference inference for language models and illustrate its application in the language models for medical knowledge.
IMSI is committed to making all of our programs and events inclusive and accessible.
Contact [email protected] to request
disability-related accommodations.
In order to register for this workshop, you must have an IMSI account and be logged in.