Approximating Persistent Homology for Large Datasets

This was part of Randomness in Topology and its Applications

Anthea Monod, Imperial College

Wednesday, March 22, 2023

Abstract: Persistent homology is an important methodology from topological data analysis which adapts theory from algebraic topology to data settings and has been successfully implemented in many applications; it produces a summary in the form of a persistence diagram, which captures the shape and size of the data. Despite its popularity, persistent homology is simply impossible to compute for very large datasets which prohibits its widespread use in many big data settings. What can we do if we would like a representative persistence diagram for a very large dataset whose persistent homology we cannot compute due to size restrictions? We adapt here the classical statistical method of bootstrapping, namely, drawing and studying smaller subsamples from the large dataset. We show that the mean of the persistence diagrams of subsamples is a valid approximation of the persistence diagram of the large dataset and derive its convergence rate to the true persistent homology of the large dataset. We demonstrate our approach on synthetic and real data; furthermore, we give an example of the utility of our approach in a shape clustering problem where we are able to obtain accurate results with only 2% subsampled from the original dataset.