Colloquium Series: Tracy Ke
Date and Time
Location
Our upcoming event for the Statistics Colloquium Series is scheduled for Monday, April 27 from 12:00 – 1:00pm (ET) and will be an in-person presentation at Maxwell-Dworkin 134A/B. Lunch will be provided to guests following the talk. This week's speaker will be faculty member Tracy Ke from our Statistics department.
Title: Integrating Large Language Models with Statistical Methods for Text Analysis
Abstract: In this talk, I will present several recent projects at the intersection of large language models (LLMs) and statistical methodology for text analysis. A central theme of this work is to treat a pre-trained LLM as a feature generator, and to develop principled statistical models for the resulting representations.
I will begin with a recent paper introducing PPTM, a new topic model for LLM-generated word embeddings. In this framework, each document is modeled as a mixture of K latent nonparametric densities (“topics”). Empirical studies demonstrate that PPTM effectively captures context-dependent topic structure in real-world text corpora.
I will then discuss three follow-up projects spanning theory to applications. The first addresses a fundamental question motivated by PPTM: how to optimally demix mixtures of nonparametric densities. We establish the minimax optimal rate and propose a rate-optimal estimator. The second explores a downstream application of topic modeling, introducing a topic-aware Bradley–Terry–Luce model for ranking problems, such as journal evaluation and LLM leaderboards. The third engages with generative AI, proposing a framework for training LLMs to generate documents with pre-specified topic weights.
This work is based on collaborations with Morgane Austern, Jianqing Fan, Yuanchuan Guo, John Lafferty, Tianle Liu, Gabriel Moryoussef, Zhaoyang Shi, and Yuxin Tao.