My current research focuses on a range of related problems in Bayesian mixture model development for classification, design, variable selection and hypothesis testing in such studies, structured and hierarchical non-parametric Bayesian methods, rare event detection, multiple comparisons and statistical computation involving simulation and optimization. Much of this is linked to problems involving large data sets, with originating motivations in analysis of biomedical data from flow cytometry technologies.
Mixtures, classification and variable selection
Data sets of increasing scale and complexity pose challenges to standard statistical methods, and these are exemplified in areas where one goal is classification and discrimination of subpopulations. A key general question we have explored is to identify subsets of variables that play roles in discrimination of subpopulations in the context of multivariate mixture modeling. We introduced a new discriminative information measure utilizing the concordance between multivariate mixture component densities, and have developed and applied it to these general variable selection/design questions in flow cytometry applications (and others). The method is both effective and computationally attractive for routine use in assessing and prioritizing subsets of variables according to their roles in discriminating subpopulation structure.
High dimensional data clustering and rare event detection
As the number of measured variables grows, there is an increasing need to consider structured, hierarchical models to enable sensitive inference on subpopulation structure. Moreover, as sample sizes increase we often face problems of masking of subtler substructure; model fitting can often lack the ability to identify “rare events” due to the dominance of much of the data. Our work has introduced novel, hierarchical nonparametric Bayesian mixture models that address both problems. The key idea is to first partition the outcome variables into a set of subsets, typically involving substantive contextual information. We then apply Bayesian nonparametric mixture models to the reduced-dimensional distribution of one selected subset of variables; this delivers classification/clustering in that marginal space. This naturally then induces partitions of the data based on the marginal classification, and a second level of mixture modeling applies—in parallel—to a second subset of variables within each of the partitioned data sets. This can be repeated, hierarchically defining an overall product-mixture model within which each modeling exercise is developed in lower dimensions and with smaller data subsets. These latter features enable more sensitive isolation of fine substructure and a focus on rare subpopulations, in particular.
High dimensional data clustering and graphical model
Data in high dimensions is often difficult to understand and visualize. Graphical models are frequently used to address these problems, taking advantage of the (conditional) independencies between subsets of variables based on their representations using a graph. Unlike previous graphical model approaches, we proposed here a new Bayesian mixture model using binary trees, constructed with the goal of modeling the data structure from each individual dimension. The dependencies among the dimensions are captured by the tree structure. Kingman's coalescent is utilized as a prior for the tree structure. An efficient MCMC algorithm is developed for posterior inference on model parameters and tree structures.
Bayesian multiple hypothesis testing
A common problem in the analysis of single-cell assays is to identify subjects for whom the proportion of cells is significantly different across marker combinations between two experimental conditions. We are currently developing a Bayesian hierarchical framework for such multiple hypothesis testing based on a structured Dirichlet-Multinomial mixture model.