Statistics Colloquium Series

Date: 

Monday, January 23, 2023, 12:00pm to 1:00pm

Location: 

Science Center, Room 316

Our upcoming event for the Statistics Department Colloquium Series is scheduled for this Monday, January 23rd from 12:00 – 1:00pm (ET) and will be an in person presentation in room 316 of the Science Center.  The speaker is Diana Cai who is a Ph.D. candidate in computer science at Princeton University.

Title : Probabilistic inference under misspecification

Abstract : Probabilistic inference is a pillar of modern data analysis, and the flexibility and interpretability of these methods has led to wide applicability in the computational sciences and engineering. A probabilistic generative model is, of necessity, a simplification of complex real-world phenomena, and in many cases facilitates tractable data analysis and discovery of meaningful and actionable patterns in data. But typically any model of a real-world data set is misspecified, and some types of misspecification can be dangerous, in that they may lead to fundamentally inaccurate or misleading inferences. In this talk, I analyze fundamental applications of when misspecification leads to misleading answers and propose solutions for mitigation.  First, I study finite mixture models, which are applied widely in domains such as genomics and neuroscience. In particular, scientists and engineers are often interested in learning the number of subpopulations, or components, present in a data set. I show that under an arbitrary amount of component misspecification, the Bayesian estimate of the number of components diverges: i.e., the posterior probability of any particular finite number of components converges to 0 in the limit of infinite data. Next, I study Markov chain Monte Carlo (MCMC) methods for problems with expensive models, such as a costly but accurate physical simulation. In practice, these computations are often approximated via a cheaper, low-fidelity computation, leading to bias in the resulting target density. I propose a framework for multi-fidelity MCMC that utilizes models of varying costs and accuracies in order to simulate samples from the expensive target density with lower computational cost.