A Mathematical Perspective On Contrastive Learning

This was part of Data Assimilation and Inverse Problems for Digital Twins

Ricardo Baptista, University of Toronto

Tuesday, October 7, 2025

Abstract: Multimodal contrastive learning is a methodology for linking data across different modalities—for instance, text and images in data science, or Lagrangian and Eulerian observations in data assimilation. The standard formulation identifies a set of modality-specific encoders that map inputs into a shared latent space, where representations are aligned. In this work, we focus on the bimodal setting and interpret contrastive learning as optimizing parameterized encoders that define conditional probability distributions, with each modality conditioned on the other in a way consistent with the data. This probabilistic view naturally unifies multimodal tasks such as cross-modal retrieval, classification, and generative modeling. It also suggests two principled extensions of classical contrastive learning: (i) novel probabilistic loss functions, and (ii) alternative alignment metrics in the latent space. We analyze these extensions in the multivariate Gaussian setting and validate the framework through numerical experiments on benchmark machine learning datasets as well as a data assimilation application in oceanography.