Standardizing the spectra of count data matrices by diagonal scaling

This was part of Eliciting Structure in Genomics Data

Boris Landa, Yale University

Wednesday, September 1, 2021

Abstract: A longstanding question when using PCA is how to choose the number of components. Random matrix theory provides useful insights into this question by assuming a "signal+noise" model, where the goal is to estimate the rank of the underlying signal matrix. If the noise is homoskedastic, i.e. the variances are identical across all entries, the spectrum of the noise admits the celebrated Marchenko-Pastur (MP) law, allowing for a simple method for rank estimation. However, in many practical situations, such as in single-cell RNA sequencing (scRNA-seq), the noise is far from homoskedastic. In this talk, focusing on a Poisson data model, I will present a simple procedure termed biwhitening, which enforces the MP law to hold by appropriately scaling the rows and columns of the data matrix. Aside from the Poisson distribution, the procedure is extended to families of distributions with a quadratic variance function. I will demonstrate our approach on both simulated and experimental data, showcasing accurate rank estimation in simulations and excellent fits to the MP law in real scRNA-seq datasets.