Statistics Colloquium Series


Monday, October 30, 2023, 12:00pm to 1:30pm


Science Center 316

Our upcoming event for the Statistics Department Colloquium Series is scheduled for Monday, October 30 from 12:00 – 1:00pm (ET) and will be an in-person presentation Science Center Rm. 316. Lunch will be provided to guests following the talk. This week's speaker will be David Alvarez-Melis of the Computer Science Department at Harvard University.


Title: Machine Learning in the Space of Datasets: an Optimal Transport Perspective


Abstract: Machine learning —as taught in classrooms and textbooks— typically involves a single, fixed, and homogenous dataset on which models are trained and evaluated. But machine learning in practice is rarely so ‘pristine’. In most real-life applications clean labeled data is typically scarce, so it is often necessary to leverage multiple heterogeneous data sources. In particular, there is an almost-universal discrepancy between training and testing data distributions. This phenomenon has been profoundly amplified by the recent advent of massive reusable ‘pre-trained’ deep learning models, which rely on vast amounts of highly heterogeneous datasets for training, and are then re-purposed for a variety of distinct (and often unrelated) tasks. This emerging paradigm of ‘Machine Learning on collections of datasets’ necessitates new theoretical and algorithmic tools. In this talk, I will argue that Optimal Transport provides an ideal framework on which to lay the foundations for this novel paradigm. It allows to us to define semantically-meaningful distances between datasets, to elucidate correspondences between them, and to solve optimization objectives over them. Through applications in dataset selection, transfer learning, and dataset shaping, I will show that besides enjoying sound theoretical footing, these OT-based approaches yield powerful, highly-scalable, and at times surprisingly insightful methods.


Bio: David Alvarez-Melis is an assistant professor of computer science at Harvard SEAS, and is a faculty affiliate at the Harvard Data Science initiative, the Kempner Institute, and the Center for Research on Computation and Society (CRCS). Before Harvard, he spent a few years at Microsoft Research New England, as part of the core Machine Learning and Statistics group. His research seeks to make machine learning more broadly applicable (especially to data-poor applications) and trustworthy (e.g., robust and interpretable). For this, he draws on ideas from various fields including statistics, optimization and applied mathematics, and takes inspiration from problems arising in the application of machine learning to the natural sciences.