Sampling Strategies for Training Machine Learning Emulators of Gravity Wave

This was part of Machine Learning for Climate and Weather Applications

Minah Yang, New York University

Wednesday, November 2, 2022

Abstract: With the goal of developing a data-driven parameterization of gravity waves (GWP) for use in general circulation models, we train various machine learning architectures to emulate Alexander-Dunkerton 1999, an existing GWP scheme. We diagnose the disparity between online and offline performance of the trained emulators by identifying a subspace of the phase space that is prone to large errors and sparse samples, and develop a sampling algorithm to treat biases that stem from underrepresentation. This strategy can be used for regression tasks over long-tailed (and other imbalanced) distributions. We find that error-prone samples often have larges shears in the wind profile– this is corroborated with physical intuition as large shears indicate many breaking levels, which requires a more complex, nonlocal computation. To remedy this, we develop a sampling strategy that performs a parameterized histogram equalization. The sampling algorithm uses a linear mapping from the original histogram to the uniform histogram parameterized by $t in [0,1]$. Parameters $t$ and ``maximum repeat'' assign each bin a new probability. The new probability is applied in two different implementations: 1) by sampling the bins to adjust the training set distribution; 2) by weighting the loss function to achieve the same effect in expectation. We find that this strategy improves the errors at the tail of the distribution except at the extreme end, while maintaining minimal loss of accuracy at the peak of the distribution.