**Verification, Validation, and Uncertainty Quantification Across Disciplines**

### May 10-14, 2021

**This workshop will take place online.**

**Organizers**

- Mihai Anitescu (Argonne National Laboratory and Statistics, Chicago)
- Fausto Cattaneo (Astrophysics, Chicago)
- Carlo Graziani (Argonne National Laboratory and Astrophysics, Chicago)
- Robert Rosner (Astrophysics, Chicago)

**Description**

With the advent of terascale, petascale and beyond computational capabilities, the reach of computational sciences – both modeling and simulation – is rapidly broadening well beyond its traditional ‘homes’ of physics, chemistry and computational engineering sciences to the biological and social sciences. To the extent to which such modeling and simulation are meant to be predictive in nature – and to the extent to which the systems being simulated are complex in nature – obvious questions regarding the veracity of the computational results must be inevitably confronted. Historically, it is only in the engineering sciences that a formal, comprehensive and rigorous process of verifying and validating (V&V) simulation codes – and defining the error bounds on obtained solutions (e.g., uncertainty quantification, or UQ) – has been developed. In other disciplines, past efforts along such lines have been less systematic and far-reaching, presumably because the consequences of significant errors in the modeling have traditionally not been as consequential as in the engineering disciplines. But it is not only the recent increases in computational capabilities that are changing the situation – it is also the fact that science-based, predictive modeling and simulation are playing an increasing role in supporting political policy decision-making; and in such a context, transparency about the predictive capabilities of such modeling has become increasingly important – the case of global climate change, and the roles played by for example the global climate models and integrated climate impact assessment models, are a key exemplar of this type of interaction between the computational science community and the world at large.

An important element in the development of discipline-appropriate V&V and UQ methods is the extent to which experimentation or, equivalently, data generation allows exploration of the state space of possible solutions. Confidence in the validity of simulations clearly depends on the extent to which simulation results accurately describe the modeled system’s behavior throughout the solution state space; and thus, a naïve expectation would be that experimental constraints on exploring the simulation state space form an obstacle to proper V&V and UQ analyses. Some disciplines are “data-rich”, that is, there are abundant data on the full range of possible experimental outcomes. For others, the data environment is relatively poor, that is, the opportunities for directly validating simulations and for establishing uncertainty bounds are fundamentally limited either in principle or by ethical, legal or practical constraints. In these experimentally constrained instances, one faces fundamental conceptual barriers in the ability to apply the methodologies developed in the data-rich environments. The obvious question is how the highly developed techniques for V&V and UQ in the data-rich (principally engineering) environments can nevertheless make contact with the far more constrained modeling environments defined by disciplines ranging from astrophysics to the social sciences.

The workshop aims to bring together practitioners from across the natural and social sciences, from data rich to data poor environments, together with computer scientists and applied mathematicians involved in developing V&V and UQ methodologies, and to seed interactions between these disparate areas.

**Confirmed speakers**

- Liliana Borcea (University of Michigan)
- Donald Estep (Canadian Statistical Sciences Institute, Simon Fraser University)
- Stephen Eubank (University of Virginia)
- Roger Ghanem (University of Southern California)
- Dimitrios Giannakis (New York University)
- Earl Lawrence (Los Alamos National Laboratory)
- Ann Lee (Carnegie Mellon University)
- Andrea Malagoli (SwissRe Corporate Solutions)
- William Oberkampf (W.L. Oberkampf Consulting)
- Daniel Sanz-Alonso (University of Chicago)
- Maike Sonnewald (Princeton University and Geophysical Fluid Dynamics Laboratory)
- Daniel Tartakovsky (Stanford University)
- Jonathan Weare (Courant Institute, University of Chicago)
- David Weisbach (University of Chicago)

**Monday, May 10**

All times CDT (UTC-5)9:20-9:30 | Introductory remarks: IMSI Director and workshop organizers |

Morning Session Chair: Carlo Graziani | |

9:30-10:15 |
Stephen Eubank (University of Virginia) Social systems, and particularly infectious disease epidemiology, present serious challenges for evidence-based policy-making. Chief among them is drawing inferences from natural, as opposed to controlled, experiments in enormous configuration spaces where data are either unavailable, incommensurate, or collected for other purposes. This talk will explore whether and how ideas about measurement, representational complexity, sensitivity analysis, and even consistency translate from data- or model-rich environments like physics, finance, and linguistics to social systems. What modeling data-rich systems taught me about validating epidemiological models |

10:15-11:00 | Discussion and Break |

11:00-11:45 |
David Weisbach (University of Chicago) Legal systems address uncertainty in at least three distinct ways. First, by providing clear rules for interactions among private actors, legal systems can reduce uncertainty in these interactions. For example, property rules provide bright lines regarding the ownership of things reducing uncertainty about the use of things. Second, legal systems must address uncertainty about the effects of changes to legal rules. For example, if we change the content of a legal rule from x (such as “separate but equal education is allowed”) to y (separate but equal is not allowed), the likely effects may be uncertain or hard to predict. Legal systems can partially address this type of uncertainty by moving incrementally, and by experimenting across jurisdictions. Finally, legal systems create incentives on policy makers making choices under uncertainty. For example, a policy maker choosing the time period between COVID vaccine doses must choose in the absence of good information about the effects of the choice. The legal system and the institutional structure in which the decisionmaker operates will affect her choices. The legal system can be designed to help improve choices made in an uncertain environment. This talk will review issues in all three areas. How the Law Addresses Uncertainty |

11:45-12:30 | Discussion and Break |

12:30-1:30 | Lunch Break |

Afternoon Session Chair: Carlo Graziani | |

1:30-2:15 | Andrea Malagoli (SwissRe Corporate Solutions) In this talk I will discuss the opportunity for innovative approaches to the use of models and data in the financial and insurance industries. The financial and insurance industries rely on stochastic models to manage the risk of uncertain future events in all their lines of business. While most of the effort focuses on developing the mathematical structure of the models, the lesser studied problem arises from the availability of data and the estimation of the models’ parameters. In fact, many financial theories assume that all the risk identifies with the stochastic nature of the model while disregarding the uncertainties of the models’ parameters. It turns out that parameters’ uncertainty, or the ‘risk of risk’, poses serious challenges for both theoretical and practical industry applications. I will illustrate the problem with some simple examples and discuss speculative ideas for more data-aware modeling approaches. The Risk of Risk in Finance and Insurance |

2:15-3:00 | Discussion and Break |

**Tuesday, May 11**

All times CDT (UTC-5)Morning Session Chair: Carlo Graziani | |

9:30-10:15 |
Dimitrios Giannakis (New York University) The dynamics of physical, chemical, and biological systems in a broad range of applications has the property of exhibiting coherent behavior embedded within chaotic dynamics. A classical example is the Earth’s climate system, which exhibits coherent oscillations such as the El Nino Southern Oscillation or the Madden-Julian Oscillation despite an extremely large number of active degrees of freedom and chaotic behavior on short (“weather”) timescales. Identifying these patterns from observational data or model output is useful from a UQ standpoint, for instance for assessing limits of predictability, or for providing predictor variables capturing the conditional statistics of quantities of interest. In this talk, we describe how operator-theoretic techniques from ergodic theory, combined with methods from data science, can identify observables of complex systems with two main features: slow correlation decay and cyclicity. These observables are approximate eigenfunctions of Koopman evolution operators, estimated from high-dimensional time series data using kernel methods. We discuss mathematical and computational aspects of these approaches, and illustrate them with applications to idealized systems and real-world examples from climate science. Operator-theoretic approaches for coherent feature extraction in complex systems |

10:15-11:00 | Discussion and Break |

11:00-11:45 |
William Oberkampf Computer simulation is becoming a critical tool in predicting the behavior of an exceedingly wide range of physical and social phenomena. Although the foundation of computational simulation was built in physics, chemistry, and engineering, applications are now common in areas such as environmental modeling, biology, economics, and societal planning. Results from computational simulations can be used for improving scientific knowledge or technological knowledge. Scientific knowledge is focused on improving understanding the workings of the system of interest, whether it be an inanimate physical system or a living/social system. Technological knowledge is generally used for designing new systems, as well as optimizing or influencing existing systems. As a result, decision making is a crucial element in the application of technological knowledge. The credibility of the simulation information can be assessed by the techniques developed in the fields of verification of the computational procedures and validation of simulation results. I contend that the impact of uncertainty quantification on the credibility of simulation results is different because it deals with likelihoods and possibilities of potential outcomes. Decision makers, whether in business or government, can have mixed reactions to comprehensive uncertainty quantification of simulation results. Many of these decision makers understand that some uncertainties are well characterized, whereas some are very poorly understood; potentially not even included in the simulation. To capture a wide range of uncertainty sources and characterizations, the term predictive capability or total predictive uncertainty has been used in certain communities. In contrast to traditional uncertainty estimation which concentrates on random variables, predictive capability attempts to capture all potential sources of uncertainty. These include numerical solution error, model form uncertainty, and uncertainty in the environments and scenarios to which the system could be exposed, either intentionally or unintentionally. This talk will discuss a wide range of uncertainties and factors, both technical and value-oriented, that influence decision makers when simulation is a critical ingredient. Simulation-Informed Decision Making |

11:45-12:30 | Discussion and Break |

12:30-1:30 | Lunch Break |

Afternoon Session Chair: Mihai Anitescu | |

1:30-2:15 |
Jonathan Weare (New York University) Events that occur on very long timescales are often the most interesting features of complicated dynamical systems. Even when we are able to reach the required timescales with a sufficiently accurate computer simulation, the resulting high dimensional data is difficult to mine for useful information about the event of interest. Over the past two decades, substantial progress has been made in the development of tools aimed at turning trajectory data into useful understanding of long-timescale processes. I will begin by describing one of the most popular of these tools, the variational approach to conformational dynamics (VAC). VAC involves approximating the eigenfunctions corresponding to large eigenvalues of the transition operator of the Markov process under study. These eigenfunctions encode the most slowly decorrelating functions of the system. I will describe our efforts to close significant gaps in the mathematical understanding of VAC error. A second part of the talk will focus on a family of methods very closely related to VAC that aim to compute predictions of specific long timescale phenomena (i.e. rare events) using only relatively short trajectory data (e.g. much shorter than the return time of the event). I will close by presenting a few questions for future numerical analysis. Learning long-timescale behavior from short trajectory data |

2:15-3:00 | Discussion and Break |

**Wednesday, May 12**

All times CDT (UTC-5)Morning Session Chair: Mihai Anitescu | |

9:30-10:15 |
Ann Lee (Carnegie-Mellon University) Many areas of the physical, engineering and biological sciences make extensive use of computer simulators to model complex systems. Whereas these simulators may be able to generate realistic synthetic data, they are often poorly suited for the inverse problem of inferring the underlying scientific mechanisms associated with observed real-world phenomena. Hence, a recent trend in the sciences has been to fit approximate models to high-fidelity simulators, and then use these approximate models for scientific inference. Inevitably, any downstream analysis will depend on the trustworthiness of the approximate model, the data collected, as well as the design of the simulations. In the first part of my talk, I will discuss the problem of validating a forward model. Most validation techniques compare histograms of a few summary statistics from a forward model with that of observed data or, equivalently, output from a high-resolution but costly model. Here we propose new methods that can provide insight into how two distributions of high-dimensional data (e.g. images or sequences of images) may differ, and if such differences are statistically significant. Then, in the second part of my talk, I will discuss the inverse problem of inferring parameters of interest when the likelihood (or function relating internal parameters with observed data) cannot be evaluated but is implicitly encoded by a forward model. I will describe a new machinery that bridges classical statistics with modern machine learning to provide scalable tools and diagnostics for constructing frequentist confidence sets with finite-sample validity in such a setting. (Part of this work is joint with Niccolo Dalmasso, Rafael Izbicki, Ilmun Kim and David Zhao.) Calibration and Validation of Approximate Likelihood Models |

10:15-11:00 | Discussion and Break |

11:00-11:45 |
Daniel Sanz-Alonso (University of Chicago) Graph-based Bayesian Semi-supervised Learning: Prior Design and Posterior Contraction |

11:45-12:30 | Discussion and Break |

12:30-1:30 | Lunch Break |

Afternoon Session Chair: Mihai Anitescu | |

1:30-2:15 |
Maike Sonnewald (Princeton University and Geophysical Fluid Dynamics Laboratory) An unsupervised learning method is presented for determining global marine ecological provinces (eco-provinces) from plankton community structure and nutrient flux data. The systematic aggregated eco-province (SAGE) method identifies eco-provinces within a highly nonlinear ecosystem model. To accommodate the non-Gaussian covariance of the data, SAGE uses t-stochastic neighbor embedding (t-SNE) to reduce dimensionality. Over a hundred eco-provinces are identified with the density-based spatial clustering of applications with noise (DBSCAN) algorithm. Using a connectivity graph with ecological dissimilarity as the distance metric, robust aggregated eco-provinces (AEPs) are objectively defined by nesting the eco-provinces. Using the AEPs, the control of nutrient supply rates on community structure is explored. Eco-provinces and AEPs are unique and aid model interpretation. They could facilitate model intercomparison and potentially improve understanding and monitoring of marine ecosystems. Elucidating ecological complexity: Unsupervised learning determines global marine eco-provinces |

2:15-3:00 | Discussion and Break |

**Thursday, May 13**

All times CDT (UTC-5)Morning Session Chair: Fausto Cattaneo | |

9:30-10:15 |
Liliana Borcea (University of Michigan) I will describe how one can construct a reduced order model from scattering data collected by an array of sensors. The construction is based on interpreting the wave propagation as a dynamical system that is to be learned from the data. The states of the dynamical system are the snapshots of the wave at discrete time intervals. We only know these at the locations of the sensors in the array. The reduced order model is a Galerkin approximation of the dynamical system that can be calculated from such knowledge. I will describe some properties of the reduced order modeland show how it can be used for solving inverse scattering problems. Data driven Reduced Order Modeling for inverse scattering |

10:15-11:00 | Discussion and Break |

11:00-11:45 |
Earl Lawrence (Los Alamos National Lab) The Department of Energy’s investment in exascale computing will enable simulations with unprecedented resolution. This will let scientists investigate fine-scale behavior in areas of interest to DOE such as climate and space physics. However, the computational power of exascale machines has outstripped the I/O and storage capacity which will make some forms of post hoc analysis impossible. To address this, we are working on methods for in situ uncertainty quantification, that is analysis done inside the simulations as they are running. I will provide an overview of the problem and describe some of the work that we are doing at LANL to fit Bayesian hierarchical models to data inside of simulations of climate and space weather. In Situ Uncertainty Quantification for Exascale |

11:45-12:30 | Discussion and Break |

12:30-1:30 | Lunch Break |

Afternoon Session Chair: Fausto Cattaneo | |

1:30-2:15 | Daniel Tartakovsky (Stanford University) Statistical (machine learning) tools for equation discovery require large amounts of data that are typically computer generated rather than experimentally observed. Multiscale modeling and stochastic simulations are two areas where learning on simulated data can lead to such discovery. In both, the data are generated with a reliable but impractical model, e.g., molecular dynamics simulations, while a model on the scale of interest is uncertain, requiring phenomenological constitutive relations and ad-hoc approximations. We replace the human discovery of such models, which typically involves spatial/stochastic averaging or coarse-graining, with a machine-learning strategy based on sparse regression that can be executed in two modes. The first, direct equation-learning, discovers a differential operator from the whole dictionary. The second, constrained equation-learning, discovers only those terms in the differential operator that need to be discovered, i.e., learns closure approximations. We illustrate our approach by learning a deterministic equation that governs the spatiotemporal evolution of the probability density function of a system state whose dynamics are described by a nonlinear partial differential equation with random inputs. A series of examples demonstrates the accuracy, robustness, and limitations of our approach to equation discovery. Learning with Uncertainty on Dynamic Manifolds |

2:15-3:00 | Discussion and Break |

**Friday, May 14**

All times CDT (UTC-5)Morning Session Chair: Fausto Cattaneo | |

9:30-10:15 |
Roger Ghanem (University of Southern California) Validation is concerned with extrapolation away from historical records. A naive statistical perspective would view this problem as one of characterizing outliers, settling for significant errors associated with approximating and sampling rare events. A hierarchical perspective, however, quickly regularizes the validation problem with constraints that, while not visible at the operational scale, can be credibly transferred from other contexts such as laboratory experiments or other operational settings. Scaling is a key challenge with this operation of knowledge transfer. It pertains to clarifying intrinsic structure that is invariant across data items acquired under different conditions. In this talk I will describe recent efforts at combining hierarchical models with intrinsic structure paradigms for the purpose of out-of-set prediction. Hierarchy and intrinsic structure for a more credible validation |

10:15-11:00 | Discussion and Break |

11:00-11:45 |
Donald Estep (Simon Fraser University ) Stochastic Inverse Problems for Uncertainty Quantification |

11:45-12:30 | Discussion and Break |