Online Learning with Survival Data

This was part of Advances in Quantitative Medical Care

Arielle Anderer, Cornell University

Wednesday, February 4, 2026

Abstract: Often, decision makers perform online learning with a time-to-event outcome, in which the goal is to increase or reduce the amount of time until some event occurs. Examples span disciplines from marketing (e.g. testing interventions to increase the length of time a customer subscribes) to healthcare (e.g. testing outreach modalities to reduce the time until a patient performs an overdue health screening). Time-to-event outcomes present challenges to multi-armed bandit algorithms, which typically expect the delay between giving an intervention and observing the outcome to be uninformative about the outcome itself. As a result, bandit algorithms for these outcomes often resort to dichotomization -- selecting a fixed time threshold and defining outcomes based on whether an event occurs before or after the selected threshold. We propose alternative bandit algorithms based on the Cox Proportional Hazards model. We analytically show that dichotomization can be very costly -- it can increase regret by 40% or more across a range of scenarios; the situation is even worse once we introduce uncertainty about the rate of outcomes. We numerically show robust benefits using a real-world dataset of time-to-event outcomes explored in healthcare screening.