DescriptionBack to top
Methods for dimension reduction play a critical role in a wide variety of genomic applications. Indeed, as technology develops, and datasets grow in both size and complexity, the need for effective dimension reduction methods that help visualize and distill the primary structures remains as essential as ever. Examples of the many practical applications in genomics include: (a) understanding (i) the structure of wild populations (particularly endangered species) from population genetic variation, (ii) human evolutionary history, also from population genetic variation, (iii) the 3-D structure of DNA from hi-C data, and (iv) genetic factors that influence risk for different human disease; (b) identifying (i) substructure among cell populations based on single-cell transcription patterns, and (ii) distinctive signatures of somatic mutations distinguishing different cancer subtypes; c) estimating confounding factors and other sources of unwanted variation in gene expression studies; d) segmenting and annotating genomic regions based on chromatin marks and other molecular features.
The development and provision of effective methods for dimension reduction involves connecting a series of areas of expertise: from theory to algorithms, implementations and applications. Theory is required to help decide what methods and algorithms to focus on; algorithms are required that help turn theoretical ideas into practical tools; and implementation of these algorithms is an often-overlooked step, where decisions are sometimes made that can greatly influence results. And all these steps need performing with at least one eye on the details of the practical applications and the data-types to which they will be applied. Unfortunately, there are relatively few opportunities for experts in these different areas to come together and learn from one another. This workshop will address this problem by bringing together mathematicians and computer scientists with a deep understanding of the theory and algorithmic and implementation issues, with applied statistical geneticists who have invaluable experience with both implementing and applying these methods to data, and interpreting the results. The goal will be to start new conversations across disciplinary barriers. The workshop will expose theoretical experts to the many ways that these methods are used in practice and the ongoing challenges that arise; and it will expose those familiar with applications to recent developments on the theoretical side.
OrganizersBack to top
SpeakersBack to top
ScheduleBack to top
Speaker: Kevin Corlette, Director, IMSI
Speaker: Soledad Villar (Johns Hopkins University)
Speaker: Zhou Fan (Yale University)
Speaker: Karl Rohe (University of Wisconsin-Madison)
Speaker: Sriram Sankararaman (University of California, Los Angeles (UCLA))
Speaker: Smita Krishnaswamy (Yale University)
Speaker: Tracy Ke (Harvard University)
Speaker: Jingshu Wang (University of Chicago)
Trajectory inference methods analyze thousands of cells from single-cell sequencing technologies and computationally infer their developmental trajectories. Though many tools have been developed for trajectory inference, most of them lack a coherent statistical model and reliable uncertainty quantification. In this talk, I will introduce our latest computational tool VITAE, which combines a latent hierarchical mixture model with variational autoencoders to infer trajectories from posterior approximations. VITAE is computationally scalable and can adjust for confounding covariates to integrate multiple datasets. We show that VITAE outperforms other state-of-the-art trajectory inference methods on both real and synthetic data under various trajectory topologies. We also apply VITAE to jointly analyze two single-cell RNA sequencing datasets on the mouse neocortex. Our results suggest that VITAE can successfully uncover a shared developmental trajectory of the projection neurons and reliably order cells from both datasets along the inferred trajectory.
Speaker: Boris Landa (Yale University)
Speaker: Gal Mishne (University of California, San Diego)
In this talk I will present Low Distortion Local Eigenmaps (LDLE), a “bottom-up” manifold learning technique which constructs a set of low distortion local views of a dataset in lower dimension and registers them to obtain a global embedding. The local views are constructed using subsets of the global eigenvectors of the graph Laplacian and are registered using Procrustes analysis. The choice of these eigenvectors may vary across the regions. In contrast to existing techniques, LDLE is more geometric and can embed manifolds without boundary as well as non-orientable manifolds into their intrinsic dimension.
Joint work with Dhruv Kohli and Alex Cloninger.
Speaker: Ben Raphael (Princeton University)
Speaker: Anna Gilbert (Yale University)
Speaker: Miaoyan Wang (University of Wisconsin-Madison)
Speaker: Kasper Hansen (Johns Hopkins University)
The cell cycle is a highly conserved, continuous process which controls faithful replication and division of cells. Single-cell technologies have enabled increasingly precise measurements of the cell cycle both as a biological process of interest and as a possible confounding factor. Despite its importance and conservation, there is no universally applicable approach to infer position in the cell cycle with high-resolution from single-cell RNA-seq data. here, we present tricycle, a method with associated software, to address this challenge by leveraging key features of the biology of the cell cycle, the mathematical properties of principal component analysis of periodic functions, and the applicability of transfer learning. Tricycle works by first constructing a reference latent space using a fixed reference dataset as well as a projection operator which allows us to project any new dataset into the fixed reference space. We show that tricycle can predict any cell’s position in the cell cycle regardless of the cell type, species of origin, and even sequencing assay. The accuracy of tricycle compares favorably to gold-standard experimental assays which generally require specialized measurements in specifically designed in vitro systems. Unlike gold-standard assays, tricycle is easily applicable to any single-cell RNA-seq dataset. We will highlight features of the problem and tricycle which we believe are specifically important for achieving high generalizability.
Speaker: Tandy Warnow (University of Illinois at Urbana-Champaign)
The estimation of phylogenetic trees for individual genes or multi-locus datasets is a basic part of considerable biological research, and is approached through methods that are based on stochastic models of sequence and gene evolution. Statistical properties of methods, including statistical consistency and sample complexity, are important, and theoretical advances in these aspects have resulted in new methods with outstanding theoretical properties. Yet, empirical performance has been of mixed quality. In this talk, I will discuss algorithm design issues that arise, and present new results that shed light on those strategies that maintain theoretical advantages while providing excellent empirical performance on large simulated datasets. I will also present some open questions of interest to mathematicians that would be important to further method development.
Speaker: Alex Bloemendal (Broad Institute)
Speaker: Bianca Dumitrascu (University of Cambridge)
Extracting interpretable, lower dimensional representation from data is not a new problem. Over the years, numerous algorithms and models have been developed across a range of disciplines such as statistics, computer science and operation research. In computational biology, under the auspice of single cell technology development, one challenging task is to select a small set of informative markers to identify and differentiate specific cellular information (e.g cell type, cell state or cell location) as precisely as possible. In this talk, I will discuss scGene-Fit, a method for selecting gene transcript markers that jointly optimize cell label recovery using a simple label-aware compressive classification approach. Beyond presenting its features and limitations, I will also discuss on-going work aimed at improving them. Finally, I will review recent literature on the topic and related interpretable machine learning approaches that need further understanding and exploration, but which hold promise in the genomic context.
Speaker: Petros Drineas (Purdue University)
VideosBack to top
MREC: a fast and versatile framework for aligning and matching point clouds with applications to single cell molecular data
August 30, 2021
Two persistent puzzles in multivariate statistics; “rotations” and “picking k”
August 30, 2021
Geometric and Topological Approaches to Representation Learning in Biomedical Data
August 31, 2021
Bulk Eigenvalue Matching Analysis: A new method to estimating K in a spiked covariance matrix
August 31, 2021
Model-Based Trajectory Inference for Single-Cell RNA Sequencing Using Deep Learning with a Mixture Prior
August 31, 2021
Spatial transcriptomics: Alignment, integration, and inference of genomic aberrations
September 1, 2021
Beyond matrices: higher-order tensor methods meet computational biology
September 2, 2021
Machine learning for actionable, interpretable marker selection in -omics studies
September 3, 2021