The SSC is pleased to announce the line-up for the 2012 Spring SSC Seminar Series. In its 3rd year, the lecture series provides participants with the opportunity to hear from leading scholars and experts who work in different applied areas, including business, biology, text mining, computer vision, and economics.
The series is envisioned as a vital contribution to the intellectual, cultural, and scholarly environment at The University of Texas at Austin for students, faculty, and the wider community. Each talk is free of charge and open to the public. For more information, contact
This e-mail address is being protected from spambots. You need JavaScript enabled to view it
.
January 11, 2012—Cosma Shalizi (Carnegie Mellon University, Department of Statistics)
"When Can We Learn Network Models from Samples?"
WCH 1.108/1.110, 2:00–3:00 PM
January 18, 2012—Paul Sheet (M.D. Anderson, Department of Epidemiology)
"Haplotype-based Discovery of Subtle Allelic Imbalance with SNP Arrays"
CBA 6.420, 1:00–2:00 PM
January 23, 2012—Jichun Xie (Temple University, Fox School of Business)
"Covariance Adjusted Precision Matrix Estimation"
CBA 6.420, 10:00–11:00 AM
February 6, 2012—Lauren Hannah (Duke University, Department of Statistical Science)
"Multivariate Convex Regression"
WCH 1.108, 1:00–2:00 PM
February 8, 2012—Jacob Bien (Stanford University, Department of Statistics)
"Sparse Hierarchical Interactions"
CBA 6.420, 1:00–2:00 PM
February 13, 2012—Anirban Bhattacharya (Duke University, Department of Statistical Science)
"Bayesian Latent Factor Modeling and Dimensionality Reduction"
WCH 1.108, 1:00–2:00 PM
March 21, 2012—Roman Jandarov (Pennsylvania State University, Department of Statistics)
"Inference with Implicit Likelihoods for Infectious Disease Models"
CBA 6.420 1:00–2:00 PM
March 23, 2012—Larry Carin (Duke University, Department of Electrical and Computer Engineering)
"Inferring Latent Structure from Mixed Real and Categorical Relational Data"
CBA 4.304, 2:00–3:00 PM
March 30, 2012—David Blei (Princeton University, Department of Computer Science)
"Scalable Topic Models and Stochastic Variational Inference"
NHB 1.720, 1:00–2:00 PM
Cosma Shalizi (Carnegie Mellon University, Department of Statistics)
Title: When Can We Learn Network Models from Samples?
Abstract: Statistical models of network structure are are models for the entire network, but the data is typically just a sampled sub-network. Parameters for the whole network, which are what we care about, are estimated by fitting the model on the sub-network. This assumes that the model is "consistent under sampling" (forms a projective family). For the widely-used exponential random graph models (ERGMs), this trivial-looking condition is violated by many popular and scientifically appealing models; satisfying it drastically limits ERGMs' expressive power. These results are special cases of more general ones about exponential families of dependent variables, which we also prove. As a consolation prize, we offer easily checked conditions for the consistency of maximum likelihood estimation in ERGMs, and discuss some possible constructive responses.
Joint work with Alessandro Rinaldo.
Paper: http://arxiv.org/abs/1111.3054
Paul Sheet (M.D. Anderson, Department of Epidemiology)
Title: Haplotype-based Discovery of Subtle Allelic Imbalance with SNP Arrays
Abstract: The majority of a human’s genome is diploid and thus, at heterozygous loci, allelelic variation is balanced. In studies of human cancer, allelic imbalance may occur in samples of non-homogeneous cells, e.g. due to mixtures of normal tissue and tumor cells exhibiting deletions or loss of heterozygosity. Methods have been proposed to detect the presence of aberrant cells using DNA microarrays. However, these methods fail to account for the fact that an entire chromosome (or segment) will be in relative abundance, which induces a dependence of the alleles in relative abundance among nearby loci. To address this, we test whether within-mixture allele frequencies are consistent with likely configurations of an individual’s germ line haplotypes, which we model or estimate statistically. Using a publicly available data set of tumor and normal samples from the same individual, we demonstrate that we can detect tumor tissue at levels much lower than from existing methods (e.g. 4%, cf. 10-20%). Finally, in cases where we do not have a priori knowledge of the tumor genome, we apply a hidden Markov model to detect low levels of chromosomal aberrations and the event boundaries.
Jichun Xie (Temple University, Fox School of Business
Title: Covariance Adjusted Precision Matrix Estimation
Abstract: A key problem in biomedical research is to elucidate the complex gene regulatory network underlying complex traits such as common human diseases. In genetical genomics (eQTL) experiments, gene expression levels are often treated as quantitative traits that are subject to genetic analysis. These data can also provide important information on gene regulation and genetic networks. In this talk, I introduce a sparse high dimensional multivariate regression model for studying the conditional independent relationships among a set of genes adjusting for possible genetic effects, as well as the genetic architecture that influences the gene expression. I present a covariate adjusted precision matrix estimation method (CAPME), which can be easily implemented by linear programming. Asymptotic convergence rates and sign consistency are established for estimators of the regression coefficients and the precision matrix. Numerical performance of the estimator is investigated using both simulated and real data sets. Simulation results have shown that the CAPME results in great improvements in both estimation and graph structure selection. We apply CAPME to analysis of a yeast eQTL data in order to identify the gene regulatory network among a set of genes in the MAPK pathway. In addition, I will discuss analysis of multi-tissue eQTL data and simultaneous estimation of multiple precision matrices with similar structures.
Lauren Hannah (Duke University, Department of Statistical Science)
Title: Multivariate Convex Regression
Abstract: Regression problems with a convexity constraint on the mean function are common in economics, financial engineering, operations research and electrical engineering. In a purely regression setting, convexity constraints can increase predictive accuracy compared to unconstrained regression. In a convex optimization setting, convex regression can be used to approximate objective functions and constraints. However, current convex regression methods are computationally infeasible for moderate to large problems in a multivariate setting. We introduce two new methods for multivariate convex regression, a Bayesian and a frequentist version. We give consistency results for both methods and show adaptive convergence rates for the Bayesian method. We apply the methods to value function approximation for sequential decision problems including response surface estimation and pricing American basket options.
Jacob Bien (Stanford University)
Title: Sparse Hierarchical Interactions
Abstract: Building predictive interaction models is a challenging problem, especially when the number of variables is large. Statisticians commonly demand that an interaction only be included in a model if both variables are marginally important. We study the problem of identifying hierarchical two-way interaction models from the viewpoint of the Lasso (i.e., L1-penalized regression). We show that by adding a set of convex constraints to the Lasso problem, we can produce sparse interaction models that honor the hierarchy restriction. In contrast to stepwise procedures that are most commonly used for building interaction models, our formulation is convex, and its solution is completely characterized by a set of optimality conditions. This makes it easier to study as a statistical estimator. We argue that restricting to hierarchical interactions can be advantageous both statistically and computationally. Our proposal extends more generally to any problem in which "hierarchical sparsity" is desired (i.e., one parameter is forced to be zero if another is zero). For example, in (univariate) polynomial regression, hierarchical sparsity yields low-order polynomial fits. Underlying our work is the observation that there is more to interpretability and simplicity in a model than sparsity alone.
Anirban Bhattacharya (Duke University, Department of Statistical Science)
Title: Bayesian Latent Factor Modeling and Dimensionality Reduction
Abstract: Factor models are popularly used for lower-dimensional representation of high-dimensional observations. We shall first discuss a novel prior construction for Gaussian factor models, with applications in prediction and variable selection with high-dimensional correlated predictors. While Gaussian factor models can be easily generalized to accommodate binary, count and ordered categorical variables, they lead to challenging computation and complex modeling structure in case of unordered categorical data. We propose a novel class of simplex factor model as an alternative. Akin to Gaussian factor models, the joint pmf of the high-dimensional categorical observations factorizes conditional on fewer latent factors, leading to a parsimonious decomposition of the joint probability. The model can characterize flexible dependence structures parsimoniously with few factors, and as factors are added, any multivariate categorical data distribution can be accurately approximated. Using a Bayesian approach for computation and inferences, an MCMC algorithm is proposed that scales well with increasing dimension, with the number of factors treated as unknown. Applications are described for modeling dependence in nucleotide sequences and prediction from high-dimensional categorical predictors.
Matt Hoffman (Columbia University)
Title: Making Bayesian Inference Faster and Easier
Abstract: Analyzing data using hierarchical Bayesian models almost always requires approximate posterior inference techniques such as Markov chain Monte Carlo (MCMC) or variational Bayes (VB). These methods can be challenging to apply to complex models or large datasets. For example, popular MCMC methods such as Gibbs sampling can be very slow when applied to complex models with many parameters. VB is often much faster, but introduces bias and is less generally applicable. And any "batch" inference method will be unacceptably slow when applied to a sufficiently large dataset. In this talk, I will present two algorithms for efficient Bayesian inference: online VB and the no-U-turn sampler (NUTS). Online VB incrementally fits an approximation to the posterior, considering only a subset of the full dataset at each iteration. When applied to the latent Dirichlet allocation model, online VB is able to discover a set of topics from millions of Wikipedia documents in a fraction of the time needed by batch algorithms. NUTS is an MCMC algorithm that extends the Hamiltonian Monte Carlo (HMC) algorithm. HMC can be orders of magnitude faster than Gibbs sampling, but it requires careful problem-specific tuning. NUTS both eliminates the need to hand-tune HMC and improves upon HMC's efficiency, making it easier to fit complex models quickly.
Hedibert Lopes (University of Chicago, Booth School of Business)
Title: Examining the Effect of Early-life Conditions and Education on Health via Parsimonious Bayesian Factor Analysis when Number of Factors is Unknown
Abstract: We introduce a new and general set of identifiability conditions for factor models which handles the ordering problem associated with current common practice. In addition, the new class of parsimonious Bayesian factor analysis (PBFA) leads to a factor loading matrix representation which is an intuitive and easy to implement factor selection scheme. We argue that the structuring the factor loadings matrix is in concordance with recent trends in applied factor analysis. The methodology is successfully implemented when examining the effect of early-life conditions and education on health. More specifically, our PBFA is applied to the 1970 British Cohort Study within a life course framework to analyze the effect of child cognition, mental and physical health on education and adult economic and health outcomes in a model where individuals select into education on the basis of their gains. We provide evidence that a mispecification of the latent factor structure leads to an incorrect assessment of both the importance of early-life traits in influencing later-life outcomes and of the heterogeneity in the causal effects of education on health.
Veera Baladandayuthanpani (M.D. Anderson, Department of Biostatistics)
Title: Bayesian Nonparametric Functional Models for High-dimensional Genomics Data
Abstract: Due to rapid technological advances, various types of genomic, epigenomic, transcriptomic and proteomic data with different sizes, formats, and structures have become available. These experiments typically yield data consisting of high-resolution genetic changes of hundreds/thousands of markers across the whole chromosomal map. Modeling and inference in such studies is challenging, not only due to high dimensionality, but also due to presence of structured dependencies (e.g. serial and spatial correlations). Using genome continuum models as a general principle we present a class of Bayesian methods to model these genomic profiles using functional data analysis approaches. Our methods allow for simultaneous characterization of these high-dimensional functions using non-parametric basis functions, joint modeling of spatially correlated functional data and detection of local features in spatially heterogeneous functional data – to answer several important biological questions. We illustrate our methodology by using several real and simulated datasets and propose methods to integrate various types of genomics data as well.
Luis Nieto (Instituto Tecnológico Autónomo de México, Departamento de Estadística)
Title: Bayesian Analysis of Functional Proteomics Profiles
Abstract: Using a new type of array technology, the reverse phase protein array (RPPA), we measure time-course protein expression for a set of selected markers that are known to co-regulate biological functions in a pathway structure. To accommodate the complex dependent nature of the data, including temporal correlation and pathway dependence for the protein markers, we propose a mixed effects model with temporal and protein-specific components. We develop a sequence of random probability measures (RPM) to account for the dependence in time of the protein expression measurements. We also acknowledge the pathway dependence among proteins via a conditionally autoregressive (CAR) model. Applying our model to the RPPA data, we reveal a pathway-dependent functional profile for the set of proteins as well as marginal expression profiles over time for individual markers.
Larry Carin (Duke University, Department of Electrical and Computer Engineering)
Title: Inferring Latent Structure from Mixed Real and Categorical Relational Data
Abstract: We consider analysis of relational data (a matrix), in which the rows correspond to subjects (e.g., people) and the columns correspond to attributes. The elements of the matrix may be a mix of real and categorical. Each subject and attribute is characterized by a latent binary feature vector, and an inferred matrix maps each row-column pair of binary feature vectors to an observed matrix element. The latent binary features of the rows are modeled via a multivariate Gaussian distribution with low-rank covariance matrix, and the Gaussian random variables are mapped to latent binary features via a probit link. The same type construction is applied jointly to the columns. The model infers latent, low-dimensional binary features associated with each row and each column, as well correlation structure between all rows and between all columns. The Bayesian construction is successfully applied to real-world data, demonstrating an ability to infer meaningful low-dimensional structure from high-dimensional relational data.
Roman Jandarov (Pennsylvania State University, Department of Statistics)
Candidate for the Sheldon Ekland-Olson Postdoctoral Fellowship
Title: Inference with Implicit Likelihoods for Infectious Disease Models
Abstract: Probabilistic models for infectious disease dynamics are useful for understanding the mechanism underlying the spread of infection. When the likelihood function for these models is expensive to evaluate, traditional likelihood-based inference may be computationally intractable. Furthermore, traditional inference may lead to poor parameter estimates and the fitted model may not capture important biological characteristics of the observed data. In this talk, I describe a novel approach for resolving these issues that is inspired by recent work in emulation and calibration for complex computer models. Using our motivating example, the gravity time series susceptible-infected-recovered (TSIR) model for measles dynamics, I demonstrate that the new approach is computationally expedient, provides accurate parameter inference, and results in a good model fit. The approach focuses on the characteristics of the process that are of scientific interest. We find a Gaussian process approximation to the gravity model using key summary statistics obtained from model simulations. The method is widely applicable to problems where traditional likelihood-based inference is computationally intractable or produces a poor model fit. It is also an alternative to approximate Bayesian computation (ABC) when simulations from the model are expensive. I will also discuss how our methodology is useful for inference in mixed membership random graph models for affiliation networks.
At the end of the talk I will briefly describe two other projects I have worked on, one on modeling meningitis transmission and the other on estimating periodicities in gypsy moth outbreaks.
David Blei (Princeton University, Department of Computer Science)
Title: Scalable Topic Models and Stochastic Variational Inference
Abstract: Probabilistic topic modeling provides a suite of tools for analyzing large collections of documents. Topic modeling algorithms can uncover the underlying themes of a collection and decompose its documents according to those themes. We can use topic models to explore the thematic structure of a corpus and to solve a variety of prediction problems about documents.
Most topic models are based on hierarchical mixed-membership models, where each document expresses a set of components (called topics) with individual per-document proportions. The computational problem is to condition on a collection of observed documents and estimate the posterior distribution of the topics and per-document proportions. In modern data sets, this amounts to posterior inference with billions of latent variables.
How can we cope with such data? In this talk, I will describe stochastic variational inference, an algorithm for computing with topic models that can handle very large document collections (and even endless streams of documents). I will demonstrate our algorithm with models fitted to millions of articles. I will show how stochastic variational inference can be generalized to many kinds of hierarchical models, including models of images and social networks, and Bayesian nonparametric models. I will highlight several open questions and outstanding issues.
Nikos Karampatziakis (Cornell University, Department of Computer Science)
Candidate for the Sheldon Ekland-Olson Postdoctoral Fellowship
Title: Large Scale Agnostic Active Learning
Abstract: I will present an active learning algorithm that is theoretically sound in an agnostic setting, empirically effective, and as efficient as standard online learning algorithms. One of the ingredients that enables this is a new type of online update that is based on some closed form solutions of ordinary differential equations. This allows us to soundly and effectively decide whether the benefit of exploration in active learning outweighs its cost at a scale of 10^6 examples/second. This is joint work with John Langford.
David Madigan (Columbia University, Department of Statistics)
Title: Statistical Methods for Drug Safety Surveillance
Abstract: Regulators such as the U.S. Food and Drug Administration have elaborate, multi-year processes for approving new drugs as safe and effective. Nonetheless, in recent years, several approved drugs have been withdrawn from the market because of serious and sometimes fatal side effects. We describe statistical methods for post-approval data analysis that attempt to detect drug safety problems as quickly as possible. Bayesian approaches are especially useful because of the high dimensionality of the data, and, in the future, for incorporating disparate sources of information.
Abel Rodríguez (University of California, Santa Cruz, Department of Applied Mathematics & Statistics)
Title: Modeling and Analysis of Trading Networks
Abstract: We explore the structure of trading networks arising in the NYMEX natural gas futures market. Our main aim is to understand how the structure of trading networks is affected by the introduction of an electronic trading platform. In addition to standard descriptors of network topology such as degree distributions and clustering coefficients, we study the effect of electronic trading platforms on the underlying community structure associated with these networks. This is accomplished through the use of stochastic blockmodels, which extend the ideas behind clustering algorithms to network data. Stochastic blockmodels allow us to identify traders that play similar roles in the market and that engage in similar trading strategies. Our results suggest that electronic platforms dramatically affect the topology of trading networks, possibly increasing their fragility.
Piyush Rai (University of Utah, Department of Computer Science)
Title: Nonparametric Bayesian Models - Learning Latent Features, Predictive Structures, and Efficient Inference
Abstract: Nonparametric Bayesian methods offer a flexible modeling paradigm for data without limiting the model-complexity a priori. Nonparametric sparse latent feature models such as the Indian Buffet Process (IBP) are one such example which allow expressing data in terms of a small set of latent features, without having to specify the number of latent features beforehand. In this talk, I will describe some of my work on nonparametric Bayesian learning of low-dimensional latent structures from high-dimensional data. In particular, I will talk about (1) a nonparametric Bayesian sparse latent factor model that allows the latent factors to be related with each-other via an a priori unknown hierarchy (akin to sparse coding with hierarchically related dictionary atoms), (2) a nonparametric Bayesian multi-task learning model that learns shared, latent predictive structures from multiple learning tasks (by learning the group structure of the tasks, and simultaneously learning "task dictionaries" for tasks within each group), and (3) an efficient, search-based inference method for finding approximate maximum-a-posteriori (MAP) solution in nonparametric Bayesian sparse latent feature models.
Raymond Carroll (Texas A&M, Institute for Applied Mathematics and Computational Science)
Title: What Percentage of Children in the U.S. are Eating a Healthy Diet? A Statistical Approach
Abstract: In the United States the preferred method of obtaining dietary intake data is the 24-hour dietary recall, yet the measure of most interest is usual or long-term average daily intake, which is impossible to measure. Thus, usual dietary intake is assessed with considerable measurement error. Also, diet represents numerous foods, nutrients and other components, each of which have distinctive attributes. Sometimes, it is useful to examine intake of these components separately, but increasingly nutritionists are interested in exploring them collectively to capture overall dietary patterns and their effect on various diseases. Consumption of these components varies widely: some are consumed daily by almost everyone on every day, while others are episodically consumed so that 24-hour recall data are zero-inflated. In addition, they are often correlated with each other. Finally, it is often preferable to analyze the amount of a dietary component relative to the amount of energy (calories) in a diet because dietary recommendations often vary with energy level.
We propose the first model appropriate for this type of data, and give the first workable solution to fit such a model. After describing the model, we use survey-weighted MCMC computations to fit the model, with uncertainty estimation coming from balanced repeated replication. The methodology is illustrated through an application to estimating the population distribution of the Healthy Eating Index-2005 (HEI-2005), a multi-component dietary quality index involving ratios of interrelated dietary components to energy, among children aged 2-8 in the United States. We pose a number of interesting questions about the HEI-2005, and show that it is a powerful predictor of the risk of developing colorectal cancer.
Jason Abrevaya (The University of Texas at Austin, Department of Economics)
Title: Missing Data in Panel (Longitudinal) Data Models
Abstract: Missing data is prevalent throughout empirical economics research, but remarkably little attention is paid to missing-data methodology within economics. This talk will consider some simple approaches to dealing with missing data in panel data models. Under missing-at-random (MAR) type assumptions, we will discuss methods that involve linear projections, method of moments estimation, and/or minimum distance estimation.
Mike Daniels (University of Florida, Department of Statistics)
Title: Bayesian Inference for Incomplete Data with Applications to Infectious Diseases
Abstract: We discuss general issues for Bayesian inference in the presence of incomplete data including model selection and lack of identifiability of parameters of interest. For the former, we propose several appropriate criteria for model selection and for the latter, we discuss construction of informative prior distributions. We illustrate these ideas in the context of infectious diseases with two short case studies.
Joint work with Arkendu Chatterjee (PhD student at UF), Chenguang Wang (JHU), Yang Yang (UF), Betz Halloran (UW), Ira Longini (UF)