A Statistically Provable Approach to Integrating LLMs into Topic Modeling

This was part of Statistics Meets Tensors

Tracy Ke, Harvard University

Tuesday, May 6, 2025

Abstract: "The rise of large language models (LLMs) raises an important question: how can statisticians leverage their expertise in the AI era? Statisticians excel in developing resource-efficient, theoretically grounded models. In this talk, we use topic modeling as an example to illustrate how such expertise can enhance the processing of LLM-generated data. Traditional topic modeling is applied to word counts without considering contextual meaning. LLMs, however, produce contextualized word embeddings that capture deeper semantic relationships. We leverage these embeddings to refine topic modeling by representing each document as a sequence of word embeddings, modeled as a Poisson point process. Its intensity measure is expressed as a convex combination of K base measures, each representing a topic. To estimate these topics, we propose a flexible algorithm that integrates traditional topic modeling methods and nonparametric density estimation techniques. A key advantage of this approach is its compatibility with any existing bag-of-words topic modeling method as a plug-in module, requiring no modifications. Assuming each topic is a beta Hölder smooth intensity measure in the embedded space, we establish the convergence rate of our method. We also derive a minimax lower bound and show that our method attains this rate when beta is in a certain range. Finally, we validate our approach on multiple datasets, demonstrating its advantages over traditional topic modeling techniques in capturing word contexts."