Colloquium Series: Tracy Ke

Date and Time

April 27, 2026
12:00PM - 01:00PM EDT

Location

Maxwell-Dworkin 134A/B

Our upcoming event for the Statistics Colloquium Series is scheduled for Monday, April 27 from 12:00 – 1:00pm (ET) and will be an in-person presentation at Maxwell-Dworkin 134A/B. Lunch will be provided to guests following the talk. This week's speaker will be faculty member Tracy Ke from our Statistics department.

 

Title: Integrating Large Language Models with Statistical Methods for Text Analysis

 

Abstract: In this talk, I will present several recent projects at the intersection of large language models (LLMs) and statistical methodology for text analysis. A central theme of this work is to treat a pre-trained LLM as a feature generator, and to develop principled statistical models for the resulting representations. 

I will begin with a recent paper introducing PPTM, a new topic model for LLM-generated word embeddings. In this framework, each document is modeled as a mixture of K latent nonparametric densities (“topics”). Empirical studies demonstrate that PPTM effectively captures context-dependent topic structure in real-world text corpora.

I will then discuss three follow-up projects spanning theory to applications. The first addresses a fundamental question motivated by PPTM: how to optimally demix mixtures of nonparametric densities. We establish the minimax optimal rate and propose a rate-optimal estimator. The second explores a downstream application of topic modeling, introducing a topic-aware Bradley–Terry–Luce model for ranking problems, such as journal evaluation and LLM leaderboards. The third engages with generative AI, proposing a framework for training LLMs to generate documents with pre-specified topic weights.

This work is based on collaborations with Morgane Austern, Jianqing Fan, Yuanchuan Guo, John Lafferty, Tianle Liu, Gabriel Moryoussef, Zhaoyang Shi, and Yuxin Tao.