The SSC is pleased to announce the line-up for the 2013 Spring SSC Seminar Series. In its 4th year, the lecture series provides participants with the opportunity to hear from leading scholars and experts who work in different applied areas, including business, biology, text mining, computer vision, economics, and public health.
The series is envisioned as a vital contribution to the intellectual, cultural, and scholarly environment at The University of Texas at Austin for students, faculty, and the wider community. Each talk is free of charge and open to the public. For more information, contact
This e-mail address is being protected from spambots. You need JavaScript enabled to view it
.
January 15, 2013—Jared Murray (Duke University, Department of Statistical Science)
"Semiparametric Bayesian Gaussian Copula Factor Models"
CBA 3.202, 1:00–2:00 PM
January 18, 2013—Sinead Williamson (Carnegie Mellon University, Machine Learning Department)
"Advances in Bayesian Nonparametrics"
CBA 3.202, 1:00–2:00 PM
January 22, 2013—Bailey Fosdick (University of Washington, Department of Statistics)
"Modeling Heterogeneity within Multiway Array Data"
CBA 3.202, 1:00–2:00 PM
January 25, 2013—Mingyuan Zhou (Duke University, Statistical Machine Learning)
"Nonparametric Bayesian Count and Mixture Modeling"
CBA 3.202, 1:00–2:00 PM
January 29, 2013—Corey Zigler (Harvard University, School of Public Health)
"Uncertainty in Propensity Score Estimation: Bayesian Methods for Variable Selection and Model Averaged Causal Effects"
CBA 4.328, 1:00–2:00 PM
Jared Murray (Duke University, Department of Statistical Science)
Title: Semiparametric Bayesian Gaussian Copula Factor Models
Abstract: Gaussian factor models have proven widely useful for parsimoniously characterizing dependence in multivariate data. There is a rich literature on their extension to mixed categorical and continuous variables, using latent Gaussian variables, mixture distributions or through generalized latent trait models accommodating marginal distributions in the exponential family. However, existing methods have a number of drawbacks: choosing parametric models can be problematic, inference is often very difficult computationally, and the dependence parameters are typically confounded with the marginal distributions which complicates interpretation.
To address these problems we propose a novel class of Bayesian Gaussian copula factor models which decouple the latent factors from the marginal distributions. A semiparametric specification for the marginal distributions based on the extended rank likelihood yields straightforward implementation and substantial computational gains. I will describe some new theoretical and empirical justifications for using this likelihood in Bayesian inference. I will also present suggestions for new default priors in copula and probit factor models, and demonstrate the efficacy of a parameter-expanded Gibbs sampler for the Gaussian copula factor model. Finally, I will illustrate the use of this model in two applications from political science.
Parts of this work are to appear in JASA Theory and Methods as Murray, J.S., Dunson, D.B., Carin, L. and Lucas, J.E. (2013) "Bayesian Gaussian Copula Factor Models for Mixed Data" (Available from http://arxiv.org/abs/1111.0317)
Sinead Williamson (Carnegie Mellon University, Machine Learning Department)
Title: Advances in Bayesian Nonparametrics
Abstract: An important challenge in Bayesian machine learning is developing classes of models that are flexible enough to represent a wide range of possible data sets. It is often difficult to determine a priori the number of parameters needed to represent a data set - for example the number of clusters in a mixture model. Nonparametric Bayesian methods provide an elegant and flexible framework for modeling data that neatly sidesteps questions of parameter cardinality. In this talk, I will give an overview of the challenges faced in developing and implementing nonparametric hierarchical models, giving examples from my own research. I will focus on three main aspects: The development of flexible and widely applicable nonparametric priors; the incorporation of such priors into application-specific hierarchical models; and the design of efficient inference algorithms.
Bailey Fosdick (University of Washington, Department of Statistics)
Title: Modeling Heterogeneity within Multiway Array Data
Abstract: Data that can be represented in the form of an array are present in many of the social and biological sciences. Examples include relational measurements over time or measurements of several variables over time and space. Regression models for such data often assume an independent error distribution, or an error model that allows for dependence along at most one or two dimensions of the data array. Failing to account for other dependencies can lead to inefficient estimates of regression parameters, inaccurate standard errors, and poor predictions. Previously developed models for estimating dependence within an array are non-stochastic and difficult to interpret, or require a large number of parameters prohibiting likelihood based inference for some arrays.
In this talk I will introduce a model called Separable Factor Analysis (SFA), which is parameterized by a factor-analytic structured covariance matrix for each dimension of the array. This model can be viewed as an extension of factor analysis to array-valued data, as it uses a factor model to estimate the covariance along each array dimension. I will discuss properties of this model as they relate to ordinary factor analysis, de- scribe maximum likelihood and Bayesian estimation methods, and propose a likelihood ratio testing procedure for selecting the factor model ranks. This methodology will be illustrated in the analysis of data from the Human Mortality Database and will be shown to outperform simpler covariance models in a cross-validation experiment.
Mingyuan Zhou (Duke University, Statistical Machine Learning)
Title: Nonparametric Bayesian Count and Mixture Modeling
Abstract: The web and related sources manifest data of unprecedented scale, dimensionality, diversity, and complexity. This poses considerable challenges to conventional approaches of statistical modeling. Bayesian nonparametrics constitute a promising research direction, in that such techniques can fit the data with a model that can grow with complexity to match the data. Moving beyond Gaussian processes and Dirichlet processes, we consider nonparametric Bayesian modeling with completely random measures, a family of pure-jump stochastic processes with nonnegative increments. In this talk, I will show a wide variety of successful applications of our nonparametric Bayesian hierarchical models to real problems in science and engineering, including count modeling, text analysis, image processing, compressive sensing and computer vision. In particular, I will present the negative binomial process, a novel nonparametric Bayesian prior that unites the seemingly disjoint problems of count and mixture modeling. I will present augmentation and marginalization techniques unique to the negative binomial process, making it easy to construct and amenable to posterior computation. I will also present dictionary learning for sparse image representation using the beta process and the dependent hierarchical beta process. In addition, I will show our recent research on multivariate count and mixture modeling built on the gamma process, the negative binomial process, a latent factor model and the Polya-Gamma distribution, which is ideal for analyzing data with multiple modalities.
Corey Zigler (Harvard University, School of Public Health)
Title: Uncertainty in Propensity Score Estimation: Bayesian Methods for Variable Selection and Model Averaged Causal Effects
Abstract: Causal inference with observational data frequently relies on the notion of the propensity score (PS) to adjust treatment comparisons for observed confounding factors. As comparative effectiveness research in the era of “big data” increasingly relies on large and complex collections of administrative resources, researchers are frequently confronted with decisions regarding which of a high-dimensional covariate set to include in the PS model in order to satisfy the assumptions necessary for estimating average causal effects. Typically, simple or ad-hoc methods are employed to arrive at a single PS model, without acknowledging the uncertainty associated with the model selection. We propose Bayesian methods for PS variable selection and model averaging that 1) select relevant variables from a set of candidate variables to include in the PS model and 2) estimate causal treatment effects as weighted averages of estimates under different PS models. The associated weight for each PS model reflects the data-driven support for that model’s ability to adjust for the necessary variables. We illustrate features of our proposed approaches with a simulation study, and ultimately use our methods to compare the effectiveness of treatments for brain tumors among Medicare beneficiaries.
Chong Wang (Carnegie Mellon University, Machine Learning Department)
Title: New Probabilistic Models for Document Recommendation and Exploration
Abstract: How can we help people quickly navigate the vast amount of data and acquire useful knowledge from it? Recommender systems provide a promising solution to this problem. They narrow down the search space by providing a few recommendations that are tailored to users' personal preferences. However, these systems usually work like a black box, limiting further opportunities to provide more exploratory experiences to their users.
In this talk, I will describe a new document recommender system that is both predictive and interpretable. It enjoys the advantages of probabilistic topic models and matrix factorization, but outperforms each of them in term of predictive performance. Furthermore, with built-in interpretable dimensions from topic models, it is more transparent than traditional recommender systems and can create many opportunities for exploratory analysis---For example, a user can manually adjust her preferences and the system responds to this by changing its recommendations. It can also form recommendations about sparsely read or previously unread documents. Finally, I will describe how we tackle the large-scale computational challenges for probabilistic models like this.
Pierpaolo De Blasi (University of Torino and Collegio Carlo Alberto, Italy)
Title: Bayesian Nonparametric Estimation of the Discrepancy with Misspecified Parametric Models
Abstract: We consider a Bayesian semi-parametric model where we have made specific requests about the parameter values to be estimated. The aim is to find the parameter of a parametric family which minimizes a distance to the data generating density and then to estimate the discrepancy using nonparametric methods. We illustrate how coherent Bayesian updating can proceed given that formal Bayesian posterior is not appropriate due to the non identification of the model parameters. Bayesian updating is performed using MCMC methods and in particular a novel method for dealing with intractable normalizing constants is required. Illustrations using synthetic data are provided.
Joint work with Stephen Walker
Chris Hans (Ohio State University, Department of Statistics)
Title: Structuring Dependence in Regression: Radius Mixtures of Spherically Uniform Priors
Abstract: We investigate prior distributions that are designed to incorporate information about the strength of a regression relationship. The most commonly-used prior distributions for regression models typically assume that coefficients are a priori independent or induce dependence via the empirical design matrix. While these standard priors (and recently-refined versions of them) may exhibit desirable behavior with respect to targeted inferential goals, we should not expect them to distribute probability throughout the entire parameter space in a way that is consistent with all of our prior beliefs. Examination reveals that when we focus on the strength of the regression relationship, standard priors place nearly all of their mass in regions of the parameter space that are not only inconsistent with reasonable prior belief but are nearly certain to clash so greatly with the likelihood that we might question the validity of particular inferences.
We describe a new class of priors that allows one to directly incorporate information about the strength of the regression relationship. We compare the Bayesian model uncertainty properties of our priors with those of standard priors, highlighting the consequences of inappropriately ignoring prior information when it is indeed available, and highlighting the consequences of unintentionally incorporating strong prior information when it does not exist. We describe MCMC algorithms that scale well with model size and require minimal storage by using a fixed-dimensional parameterization across models of different sizes. We discuss several strategies for improving MCMC output-based estimation using the structure of the posterior.
Lurdes Inoue (University of Washington, School of Public Health)
Title: Modeling Disease Prognosis and Progression of Prostate Cancer
Abstract: In the first part of this talk we will discuss approaches for modeling the natural history of disease progression using longitudinal data. In particular, we present models for grade progression in active surveillance studies. The proposed models are assessed with a simulation study and applied to data from the Johns Hopkins active surveillance cohort. We show that the proposed models can substantially improve inferences about the timing of disease grade progression while accounting for the uncertainty in the biopsy sensitivity and specificity, the variability in the biomarker growth and the serial correlation in the observations.
In the second part of this talk, we investigate the survival benefit implied by prognostic models where the predictor(s) of disease-specific survival are age and/or biomarker level at disease detection. We show that the benefit depends on the rate of biomarker change, the lead time, and the biomarker level at the original date of diagnosis as well as on the parameters of the prognostic model. Even if the prognostic model indicates that lowering the threshold of the biomarker is associated with longer disease-specific survival, this does not necessarily imply that early detection will confer an extension of life expectancy.
Vanja Dukic (University of Colorado Boulder, Department of Applied Mathematics)
Title: Tracking Epidemics with Google Flu Trends Data and a State-Space SEIR Model
Abstract: In this talk we use Google Flu Trends data together with a sequential surveillance model based on the state-space methodology, to track the evolution of an epidemic process over time. We embed a classical mathematical epidemiology model (a susceptible-exposed-infected-recovered (SEIR) model) within the state-space framework, thereby allowing the classic SEIR dynamics to allow changes through time. The implementation of this model is based on a particle filtering algorithm, which learns about the epidemic process sequentially through time, and provides updated estimates of epidemic parameters and states with each new surveillance data point. We show how this approach, in combination with sequential Bayes factors, can serve as an on-line diagnostic tool for influenza pandemic. We take a close look at the Google Flu Trends data describing the spread of flu in the US during 2003–2009.
Prakash Laud (Medical College of Wisconsin, Division of Biostatistics)
Title: Model Based Methods in Comparative Effectiveness Research Using Instrumental Variables
Abstract: Comparative Effectiveness Research aims to determine which of two medical treatments for a condition results in better outcome. While randomized controlled trials address this question directly, these methods are not feasible in every situation. This leads to the question of how to use observational data in which patients are not randomized to treatment, but make a choice in consultation with medical professionals.
In this talk we will first discuss how instrumental variables result in one method of alleviating the statistical difficulties arising from nonrandomized treatment choice. We will then describe model and likelihood based Bayesian methods that lead to better estimation, even while relaxing distributional assumptions. Following this we will address the use of instrumental variables with binary outcome and treatment, and time-to-event outcomes. We will also indicate how it is possible to compute so-called causal quantities of interest from these models.
Jim Hobert (University of Florida, Department of Statistics)
Title: Convergence Analysis of the Gibbs Sampler for Bayesian General Linear Mixed Models with Improper Priors
Abstract: A popular default prior for the general linear mixed model is an improper prior that takes a product form with a flat prior on the regression parameter, and so-called power priors on each of the variance components. I will describe a convergence rate analysis of the Gibbs samplers associated with these Bayesian models. The main result is a simple, easily-checked sufficient condition for geometric ergodicity of the Gibbs Markov chain. (This is joint work with Jorge Román and Brett Presnell.)
Richard Hahn (The University of Chicago, Booth School of Business)
Title: DSS: Decoupled Shrinkage and Selection in linear models
Abstract: We propose a new variable selection approach from a fully Bayesian decision theory viewpoint. The method draws an explicit distinction between actions and inferences, effectively dealing with the trade-off associated with the competing goals of predictive generalization and interpretability. By decoupling posterior learning from model reporting, our approach creates a flexible framework where continuous, shrinkage priors can be used but ``sparse solutions'' can be obtained. The method generalizes straightforwardly to the GLM setting.
Dave Stephens (McGill University, Department of Mathematics and Statistics)
Title: Bayesian approaches to causal inference: a lack-of-success story
Abstract: Despite almost universal acceptance across most fields of statistics, Bayesian inferential methods have yet to breakthrough to widespread use in causal inference, despite Bayesian arguments being a core component of early developments in the field. Some quasi-Bayesian procedures have been proposed, but often these approaches rely on heuristic, sometimes flawed, arguments. In this talk I will discuss some formulations of classical causal inference problems from the perspective of standard Bayesian representations, and propose some inferential solutions.
Sayan Mukherjee (Duke University, Department of Statistical Science)
Title: Modeling Quantitative Phenotypes
Abstract: In this talk we consider two problems in modeling quantitative phenotypes.
The first problem is estimating the genetic covariance matrix (G-matrix) of high-dimensional traits.
The second problem involves measuring distances between bones (2-dimension surfaces embedding in 3-dimensions).
Problem 1: Quantitative genetic studies that model complex, multivariate phenotypes are important for both evolutionary prediction and artificial selection. For example, changes in gene expression can provide insight into developmental and physiological mechanisms that link genotype and phenotype. However, classical analytical techniques are poorly suited to quantitative genetic studies of gene expression where the number of traits assayed per individual can reach many thousand. Here, we derive a Bayesian genetic sparse factor model for estimating the genetic covariance matrix (G-matrix) of high-dimensional traits, such as gene expression, in a mixed effects model.
Problem 2: will discuss a method to measure distances between surfaces, such as bones, when the surfaces are qualitatively different, for example they are not isomorphic.
The method uses ideas from computational topology and places them in a probabilistic framework.
Wesley Johnson (University of California, Irvine, Department of Statistics)
Title: Bayesian Nonparametric Longitudinal Data Analysis with Embedded Autoregressive Structure: Application to Hormone Data
Abstract: We develop a novel Dirichlet Process Mixture model for irregular longitudinal data. The model mixes on the two parameters of the traditional Ornstein-Uhlenbeck process with exponential covariance function and thus allows for the possibility of multiple groups with distinct autoregressive covariance structures. We illustrate the use of the model to track hormone curve data through the menopausal transition, and we also test the model on simulated data, both to check its performance in estimating mean functions as well as a variety of covariance structures.
