Data Science Achievements by our Faculty Members


Causal inference and design of experiments (Edo Airoldi, Tirthankar Dasgupta, Donald Rubin)
Developed and established the standard method for causal inference from observation & administrative data in modern science, medicine and social science.
Crucial insights include: the “Rubin causal model” and propensity scores. Current work includes:

    • Benefits/challenges of big data in making causal inference
    • Methods for optimal design and analysis of randomized experiments, in classical and modern settings (e.g. in the presence of social interference).

Missing data (Xiao-Li Meng, Donald Rubin)
Missing data: formalizing types of missing data (e.g. missing at random & selection effects) and the development of methods for appropriately handling missing data (e.g. EM & PX-EM algorithms, multiple imputation). Massively applied in modern science, medicine, administrative data & social science.

Statistics and computing (Edo Airoldi, Pierre Jacob, Samuel Kou, Jun Liu, Xiao-Li Meng, Natesh Pillai, Neil Shephard)

    • Parallel MCMC methods (e.g. parallel tempering, multiple try, equi-energy sampler, bridge sampler, ASIS).  These methods allow modern computation architecture to be harnessed to perform statistical inference on large and small datasets
    • Fundamental contributions to the development of sequential Monte Carlo, which is a method for learning sequentially.  It has become the generalization of the Kalman filter to non-linear, non-Gaussian systems.   Used in many application areas in engineering, science and economics
    • Strategies to sample in high-dimensional and highly constrained spaces
    • Stochastic approximation methods for estimation and inference with big data
    • Conceptualization of statistical efficiency vs. computational complexity tradeoffs
    • Inference methods based on de-randomization.

Application-focused methodological development

Astronomy and astrophysics (Xiao-Li Meng as part of the California-Harvard AstroStat Collaboration)

    • Pioneered the use of Bayesian modeling for X-ray spectral data, for time symmetry in X-ray light curves, for event location detection & for calibrating uncertainties in large-scale X-ray analysis
    • Developed wavelet-based feature detection methods for handling millions of highly irregular light curves with confounding features and complex noise structures
    • Pioneered an array of statistical methods for dealing with complex image data, for detecting narrow emission lines, for computing hardness ratios, etc.
    • Developed a multi-scale deconvolution algorithm for the high energy astrophysical images used for revealing signatures of binary black holes, relativistic quasar jets, etc.
    • Developed a practical representation method for processing high-resolution and high-cadence images for automated classifications of sunspots and coronal loop structures.

Bioinformatics, biophysics and computational biology (Edo Airoldi, Samuel Kou, Jun Liu)

    • Development of statistical algorithms and methods (e.g. Gibbs motif sampler, BioProspector, MDscan, MotifRegressor) for discovering novel sequence repetitive patterns
    • Pioneered use of both genomic sequence information and mRNA expression information for gene transcription regulatory analysis
    • Pioneered Bayesian models for genome-wide association studies and the development of human genetic analysis algorithms (e.g. "HAPLOTYPER", and the Partition-Ligation method)
    • Our data modelling overturned the classical Michaelis-Menten model of enzymatic reactions.  Developed a new standard model to replace it
    • Pioneered introduction of stochastic modeling and Bayesian experimental data analysis in single-molecule biophysics, such as subdiffusion modeling of protein’s conformational dynamics
    • More generally, new methods and applications in genomics and proteomics.      

Social sciences and economics (Edo Airoldi, Donald Rubin, Neil Shephard)

    • Formalizing the use of high frequency asset price data to measure time-varying volatility, correlation and jump risk
    • Currently estimating the value of income contingent English student loans (linking the UK Government’s student loan book of 2.8M individual loans and the entire individual level UK income tax record for the last 10 years, which is roughly 400M records)
    • Computational and statistical strategies for estimating customers’ lifetime value to support massive advertising campaigns on social media platforms
    • Methods for quantifying causal mechanisms through which social structure and interactions can affect workforce mobility, and labor market dynamics more generally
    • Estimating the effects of line and station closures on from-to traffic volumes in massive transportation systems, in collaboration with Transport for London.