Training distribution optimization in the space of probability measures

This was part of Data Assimilation and Inverse Problems for Digital Twins

Nicholas Nelsen, Cornell University

Thursday, October 9, 2025

Abstract: What probability distribution should training data be sampled from to best approximate a target function or operator? This talk provides an answer in the setting that “best” refers to out-of-distribution (OOD) generalization accuracy with respect to a family of downstream tasks. The talk proposes to minimize average-case OOD accuracy, or its upper bound, over the space of probability measures. The minimizing data distribution depends not just on the model class (e.g., kernel regressors, neural networks) and the accuracy metric, but also on the target map itself. This leads to two implementable adaptive and target-dependent data selection algorithms based on either bilevel or alternating optimization. The new approach produces trained surrogate models that display increased robustness when evaluated on test data from shifted and unseen distributions. These models further empirically outperform others trained on traditional nonadaptive or target-independent data distributions in several function approximation, operator learning, and inverse problem tasks.