# Thursday, 05/30/2019, Time: 2:00PM – 3:00PMStatistical Methods for Bulk and Single-cell RNA Sequencing Data

Gonda 1-357 (1st Floor Seminar Room)

Wei Vivian Li

UCLA Department of Statistics

Since the invention of next-generation RNA sequencing (RNA-seq) technologies, they have become a powerful tool to study the presence and quantity of RNA molecules in biological samples and have revolutionized transcriptomic studies on both bulk tissues and single cells. The first part of the talk would focus on the statistical challenges involved in the transcript-level analysis of bulk RNA-seq data. I will introduce two statistical methods developed to improve the full-length RNA transcript identification and quantification. The second part of this talk would address the statistical challenges to improve single-cell RNA-seq data analysis. I will discuss the imputation of single-cell RNA-seq data to assist downstream analyses and briefly introduce a statistical tool for rational single-cell RNA-seq experimental design.

# Tuesday, 05/21/2019, Time: 2:00PM Shared Subspace Models for Multi-Group Covariance Estimation

Royce 156

Alexander Franks, Assistant Professor

Department of Statistics and Applied Probability, UCSB

We develop a model-based method for evaluating heterogeneity among several p x p covariance matrices when the number of features is larger than the sample size. This is done by assuming a spiked covariance model for each group and sharing information about the space spanned by the group-level eigenvectors. We use an empirical Bayes method to identify a low-dimensional subspace which explains variation across all groups and use an MCMC algorithm to estimate the posterior uncertainty of eigenvectors and eigenvalues on this subspace. The implementation and utility of our model is illustrated with analyses of high-dimensional multivariate gene expression data.

Alex Franks is an Assistant Professor in the Department of Statistics and Applied Probability at the University of California, Santa Barbara. His research interests include covariance estimation, multivariate analysis and high dimensional data, causal inference, missing data, and errors-in-variables models. His applied research interests include computational and statistical modeling of “omics” data. He is also an active member of the XY Research group, which conducts research in sports statistics with a focus on player-tracking data.

# Tuesday, 05/14/2019, Time: 2:00PM – 3:15PMHow Small Are Our Big Data: Turning the 2016 Surprise into a 2020 Vision

Kinsey Pavilion 1240B

Xiao-Li Meng, Whipple V. N. Jones Professor of Statistics

Harvard University

A UCLA Statistics and Biostatistics Joint Seminar

The term “Big Data” emphasizes data quantity, not quality. However, most statistical and data analytical methods and the measures of uncertainties are derived under the assumption that the data are of some designed (and desired) quality, e.g., when they can be viewed as probabilistic samples. We show that a seemingly negligible deviation from this assumption can make the effective sample size of a “Big Data” set vanishingly small. Without understanding this phenomenon, “Big Data” can do more harm than good because of the drastically inflated precision assessment hence a gross overconfidence, setting us up to be caught by surprise when the reality unfolds, as we all experienced during the 2016 US presidential election. Data from Cooperative Congressional Election Study (CCES, conducted by Stephen Ansolabehere, Douglas River and others, and analyzed by Shiro Kuriwaki), are used to assess the data quality in 2016 US election polls, with the aim to gain a clearer vision for the 2020 election and beyond. A key ingredient of this assessment is the “data defect index” (ddi), a measure of individual response behaviors to survey questions. We discuss how behavioral research can help to reduce ddi when it is infeasible to control it through the traditional probabilistic sampling strategies.

# Tuesday, 05/07/2019, Time: 2:00PMComputer Vision for Communication Research: Predicting Political Ideology from Social Media Photographs

Royce 156

Jungseock Joo, Assistant Professor

UCLA Department of Communication

Advanced machine learning and scalable data analytic methods have been increasingly adopted by social science researchers in the recent literature. While the majority of existing approaches in computational social science rely on numeric, network or text data, a few recent studies have demonstrated that computer vision methods can be applied to automatically analyze visual content data in social and mass media and facilitate quantitative investigations in many areas. Such methods dramatically enlarge the scale and scope of data analysis and open up new opportunities to social science researchers. In this talk, I will begin by reviewing the existing literature of visual communication where communication scholars used close reading or manual coding and then discuss recent developments of computational approaches to these research questions. Specifically, I will use our recent study on political ideology prediction from Facebook photographs posted by politicians to demonstrate the utility of computer vision in these areas. I will also discuss the current limitations of machine learning models and public datasets on the lack of representativeness and transparency and its implications.

Jungseock Joo is an assistant professor in Communication at University of California, Los Angeles. He is also an affiliated assistant professor in Statistics. His research primarily focuses on understanding multimodal human communication with computer vision and machine learning based methods. In particular, his research employs various types of large scale multimodal media data such as TV news or online social media and examines how multimodal cues in these domains relate to public opinions and real world events. He holds Ph.D. in Computer Science from UCLA. He was a former research scientist in Computer Vision at Facebook prior to joining UCLA in 2015.

# Tuesday, 04/30/2019, Time: 2:00PMStatistics Weekly SeminarInjecting Expert Knowledge and Corpus-Level Constraints in Natural Language Processing Models

Royce 156

Kai-Wei Chang, Assistant Professor

UCLA Department of Computer Science

Recent advances in data-driven machine learning techniques (e.g. deep neural networks) have revolutionized many natural language processing applications. These approaches automatically learn how to make decisions based on the statistics and diagnostic information from large amounts of labeled data. Despite these methods being successful in various applications, they run the risk of making nonsensical mistakes, suffering from domain shift, and reinforcing the societal biases (e.g. gender bias) that are present in the underlying data. In this talk, I will describe a collection of results that leveraging corpus-level constraints and domain knowledge to facilitate the learning and inference in Natural Language Processing applications. These results lead to greater control of NLP systems to be socially responsible and accountable.

Kai-Wei Chang is an assistant professor in the Department of Computer Science at the University of California Los Angeles. His research interests include designing robust machine learning methods for large and complex data and building language processing models for social good applications. Kai-Wei has published broadly in machine learning, natural language processing, and artificial intelligence. His awards include the EMNLP Best Long Paper Award (2017), the KDD Best Paper Award (2010), and the Okawa Research Grant Award (2018). Additional information is available at http://kwchang.net.

# Tuesday, 04/23/2019, Time: 2:00PMStatistics Weekly SeminarTowards Understanding Overparameterized Deep Neural Networks: From Optimization To Generalization

Royce 156

Quanquan Gu, Assistant Professor

UCLA Department of Computer Science

Deep learning has achieved tremendous successes in many applications. However, why deep learning is so powerful is still less well understood. One of the mysteries is that deep neural networks used in practice are often heavily over-parameterized such that they can even fit random labels to the input data, while they can still achieve very small test error when trained with real labels. In order to understand this phenomenon, in this talk, I will first show that with over-parameterization and a proper random initialization, gradient-based methods can find the global minima of the training loss for DNNs with the ReLU activation function. Then I will show under certain assumption on the data distribution, gradient descent with a proper random initialization is able to train a sufficiently over-parameterized DNN to achieve arbitrarily small test error. This leads to an algorithmic-dependent generalization error bound for deep learning. I will conclude by discussing implications, challenges and future work along this line of research.

Quanquan Gu is an Assistant Professor of Computer Science at UCLA. His current research is in the area of artificial intelligence and machine learning, with a focus on developing and analyzing nonconvex optimization algorithms for machine learning to understand large-scale, dynamic, complex and heterogeneous data, and building the theoretical foundations of deep learning. He received his Ph.D. degree in Computer Science from the University of Illinois at Urbana-Champaign in 2014. He is a recipient of the Yahoo! Academic Career Enhancement Award in 2015, NSF CAREER Award in 2017, Adobe Data Science Research Award and Salesforce Deep Learning Research Award in 2018, and Simons Berkeley Research Fellowship in 2019.

# Tuesday, 4/23/2019, Time: 4:00pm – 5:30pm2019 De Leeuw Seminar

California NanoSystems Institute (CNSI) Auditorium

Roger Peng

Department of Biostatistics, Johns Hopkins University

Please RSVP here. A flyer for the seminar is available here.

The data revolution has led to an increased interest in the practice of data analysis, which most would agree is a fundamental aspect of a broader definition of data science. But how well can we characterize data analysis and communicate its fundamental principles? Previous work has largely focused on the “forward mechanism” of data analysis by trying to model and understand the cognitive processes that govern data analyses. While developing such an understanding has value, it largely focuses on unobserved phenomena. An alternate approach characterizes data analyses based on their observed outputs and develops principles or criteria for comparing one to another. Furthermore, these principles can be used to formalize a definition of a successful analysis. In general, the theoretical basis for data analysis leaves much to be desired, and in this talk I will attempt to sketch a foundation upon which we can hopefully make progress.

Roger D. Peng is a Professor of Biostatistics at the Johns Hopkins Bloomberg School of Public Health where his research focuses on the development of statistical methods for addressing environmental health problems. He is also a co-founder of the Johns Hopkins Data Science Specialization, the Simply Statistics blog, the Not So Standard Deviations podcast, and The Effort Report podcast. He is a Fellow of the American Statistical Association and is the recipient of the 2016 Mortimer Spiegelman Award from the American Public Health Association.

# Tuesday, 04/16/2019, Time: 2:00PMStatistics Weekly SeminarVerifying and Enhancing the Robustness of Neural Networks

Royce 156

Cho-Jui Hsieh

UCLA Department of Computer Science

Robustness of neural networks have become an important issue for mission-critical applications, including self driving cars and control systems. It is thus important to verify the safety of neural networks and give provable guarantees. In this talk, I will present simple and efficient neural network verification algorithms developed by our group. Furthermore, I will discuss some effective ways to improve the robustness of neural networks.

Cho-Jui Hsieh is an assistant professor in UCLA CS. His research focus is on efficiency and robustness of machine learning systems. Cho-Jui obtained his master degree in 2009 from National Taiwan University (advisor: Chih-Jen Lin) and Ph.D. from University of Texas at Austin in 2015 (advisor: Inderjit S. Dhillon). He is the recipient of IBM Ph.D. fellowships in 2013-2015, the best paper award in KDD 2010, ICDM 2012, ICPP 2018 and best paper finalist in AISec 2017.

# Tuesday, 04/09/2019, Time: 2:00PMStatistics Weekly SeminarReverse Engineering Human Cooperation

Royce 156

Max Kleiman-Weiner

Harvard University

Human cooperation is distinctly powerful. We collaborate with others to accomplish together what none of us could do on our own; we share the benefits of collaboration fairly and trust others to do the same. Even young children cooperate with a scale and sophistication unparalleled in other animal species. I seek to understand these everyday feats of social intelligence in computational terms. What are the cognitive representations and processes that underlie these abilities and what are their origins? How can we apply these cognitive principles to build machines that have the capacity to understand, learn from, and cooperate with people? I will present a formal framework based on the integration of individually rational, hierarchical Bayesian models of learning, together with socially rational multi-agent and game-theoretic models of cooperation. First, I investigate the evolutionary origins of the cognitive structures that enable cooperation and support social learning. I then describe how these structures are used to learn social and moral knowledge rapidly during development. Finally I show how this knowledge is generalized in the moment, across an infinitude of possible situations: inferring the intentions and reputations of others, distinguishing who is friend or foe, and learning a new moral value all from just a few observations of behavior.

Dr. Max Kleiman-Weiner is a fellow of the Data Science Institute and Center for Research on Computation and Society (CRCS) within the computer science and psychology departments at Harvard. He did his PhD in computational cognitive science at MIT advised by Josh Tenenbaum where he was a NSF and Hertz Foundation Fellow. He won best paper at RLDM 2017 for models of human cooperation and the William James Award at SPP for computational work on moral learning. He also serves as Chief Scientist of Diffeo a startup building collaborative machine intelligence. Previously, he was a Fulbright Fellow in Beijing, earned an MSc in Statistics as a Marshall Scholar at Oxford, and did his undergraduate work at Stanford as a Goldwater Scholar.

# Thursday, 04/04/2019, Time: 10:30AMAdventures in Dimensionland: Geometry, Dimensionality and Clustering in Revealed Preference Models

4242 Young Hall (JIFRESSE)

Abel Rodriguez

Professor of Statistics

Associate Dean for Graduate Affairs, Baskin School of Engineering

Associate Director, Center for Data, Discovery and Decisions

University of California-Santa Cruz

Measuring preferences from observed behavior is a critical task in a number of disciplines in the social sciences. This talk is motivated by application in political science, where the spatial voting model has become the dominant methodological and theoretical tool for operationalizing the process of inferring legislator’s preferences from voting data. This model, which is closely related to the classical factor analysis model in the statistics literature and the item response theory (IRT) model in the psychometrics literature, embeds binary responses onto a (potentially multidimensional) continuous Euclidean policy space. The resulting latent traits, usually referred to as the legislator’s “ideal points”, are useful not only as low-dimensional descriptors of the legislature, but for testing theories of legislative behavior.

Understanding the dimension and geometry of the latent policy space is a critical issue in the application of spatial voting models. The first part of this talk discusses extensions of the traditional spatial voting model that allow us to estimate legislators’ revealed preferences in different domains of voting on a common scale. Our approach assumes that, in principle, legislators might have different preferences in each voting domain, and uses ideas from model-based clustering to shrink the number of distinct positions. These priors are carefully constructed to ensure that the various latent spaces are comparable, and to address multiplicity issues that arise from simultaneously testing a large number of hypothesis. We argue that, under certain circumstances, these models can be interpreted as allowing the dimension of the policy space to be legislator-dependent. The second part of the talk introduces a novel class of spatial voting models in which legislator’s preferences are embedded in the surface of a n-dimensional sphere. The resulting model contains the standard binary Euclidean factor model as a limiting case, and provides a mechanism to operationalize (and extend) the so-called “horseshoe theory” in political science. This theory postulates that the far-left and far-right are more similar to each other in essentials than either are to the political center. The various models are illustrated using the voting record of recent US Congresses.

This work is the result of a collaboration with Scott Moser (University of Nottingham), Chelsea Lofland (UC Santa Cruz) and Xingchen Yu (UC Santa Cruz).

# Tuesday, 04/02/2019, Time: 2:00PMStatistics Weekly SeminarBayesian Inference in Nonparanormal Graphical Models

Royce 156

Subhashis Ghoshal

North Carolina State University

Cohosted by UCLA Biostatistics

A graphical model can very effectively describe the intrinsic dependence among several potentially related variables, giving a hidden lower-dimensional structure in a high-dimensional joint distribution. A Gaussian graphical model has easy interpretability and it admits relatively easy discovery of its structure. However, the Gaussianity assumption may be restrictive in many applications. A nonparanormal graphical model is a semiparametric generalization of a Gaussian graphical model for continuous variables, where it is assumed that the variables follow a Gaussian graphical model only after some unknown smooth monotone transformations. We consider two distinct Bayesian solutions for the nonparanormal model, one by putting priors on the transformation functions, and the other by eliminating the effects of the transformations by considering a rank-based likelihood function. In the first approach, priors on the underlying transformations are put using finite random series of B-splines with increasing coefficients. On the underlying precision matrix of the transformed variables, we consider either a spike-and-slab prior or a continuous shrinkage prior on its Cholesky decomposition. A Hamiltonian Monte-Carlo method allows efficient sampling from the posterior distribution of the transformation. We show that the posterior distribution for the transformation function is consistent under general conditions. We study the numerical performance of the proposed method through a simulation study and apply it on a real dataset. For the rank-likelihood approach, we describe a Gibbs sampling technique for posterior computation and establish a posterior consistency theorem about finding the graphical structure.

The talk is based on a joint work with Dr. Jami Jackson Mulgrave, post-doc at Columbia University and former graduate student at NCSU.

# Tuesday, 03/12/2019, Time: 2:00PMStatistics Weekly SeminarWatching your Weights: Generalizing from a Randomized Trial to a ‘Real World’ Target Population

Physics and Astronomy Building Room 1434A

Eloise Kaizar

Ohio State University

This seminar is co-sponsored by the UCLA Center for Social Statistics.

Randomized controlled trials are often thought to provide definitive evidence on the magnitude of treatment effects. But because treatment modifiers may have a different distribution in a real world population than among trial participants, trial results may not directly reflect the average treatment effect that would follow real world adoption of a new treatment. Recently, weight-based methods have been repurposed to more provide more relevant average effect estimates for real populations. In this talk, I summarize important analytical choices involving what should and should not be borrowed from other applications of weight-based estimators, make evidence-based recommendations about confidence interval construction, and present conjectures about best choices for other aspects of statistical inference.

Eloise Kaizar is Associate Professor of Statistics at The Ohio State University. Her primary research focus is on assessing the effects and safety of medical exposures and interventions, especially those whose effects are heterogeneous across populations or measured with rare event outcomes. As such, she has worked on methodology to combine multiple sources of information relevant to the same broad policy or patient-centered question. She is particularly interested in how data collected via different study designs can contribute complementary information.

# Tuesday, 03/05/2019, Time: 2:00PMStatistics Weekly SeminarParsimonious Regressions for Repeated Measure Analysis

Physics and Astronomy Building Room 1434A

Lan Liu

University of Minnesota – Twin Cities

This seminar is co-sponsored by the UCLA Center for Social Statistics.

Longitudinal data with repeated measures frequently arises in various disciplines. The standard methods typically impose a mean outcome model as a function of individual features, time and their interactions. However, the validity of the estimators relies on the correct specifications of the time dependency. The envelope method is recently proposed as a sufficient dimension reduction (SDR) method in multivariate regressions. In this paper, we demonstrate the use of the envelope method as a new parsimonious regression method for repeated measures analysis, where the specification of the underlying pattern of time trend is not required by the model. We found that if there is enough prior information to support the specification of the functional dependency of the mean outcome on time and if the dimension of the prespecified functional form is low, then the standard method is advantageous as an efficient and unbiased estimator. Otherwise, the envelope method is appealing as a more robust and potentially efficient parsimonious regression method in repeated measure analysis. We compare the performance of the envelope estimators with the existing estimators in simulation study and in an application to the China Health and Nutrition Survey.

Lan Liu is an assistant professor at the University of Minnesota at Twin Cities, School of Statistics. She obtained her PhD in Biostatistics from UNC Chapel Hill and worked as a postdoc at Harvard before joining UMN. Her research interests includes Causal inference, missing data analysis, clinical trials, doubly robust inference, Bayesian analysis, surrogate outcomes, measurement error, mediation analysis, social network. She is also interested in various collaboration work with physicians and epidemiologists.

# Tuesday, 02/26/2019, Time: 2:00PMStatistics Weekly SeminarCovariate Screening in High Dimensional Data: Applications to Forecasting and Text Data

Physics and Astronomy Building Room 1434A

Adeline Lo

Princeton University

This seminar is co-sponsored by the UCLA Center for Social Statistics.

High dimensional (HD) data, where the number of covariates and/or meaningful covariate interactions might exceed the number of observations, is increasing used in prediction in the social sciences. An important question for the researcher is how to select the most predictive covariates among all the available covariates. Common covariate selection approaches use ad hoc rules to remove noise covariates, or select covariates through the criterion of statistical significance or by using machine learning techniques. These can suffer from lack of objectivity, choosing some but not all predictive covariates, and failing reasonable standards of consistency that are expected to hold in most high-dimensional social science data. The literature is scarce in statistics that can be used to directly evaluate covariate predictivity. We address these issues by proposing a variable screening step prior to traditional statistical modeling, in which we screen covariates for their predictivity. We propose the influence (I) statistic to evaluate covariates in the screening stage, showing that the statistic is directly related to predictivity and can help screen out noisy covariates and discover meaningful covariate interactions. We illustrate how our screening approach can removing noisy phrases from U.S. Congressional speeches and rank important ones to measure partisanship. We also show improvements to out-of-sample forecasting in a state failure application. Our approach is applicable via an open-source software package.

Adeline Lo is a postdoctoral research associate at the Department of Politics at Princeton University. Her research lies in the design of statistical tools for prediction and measurement for applied social sciences, with a substantive interest in conflict and post-conflict processes. She has an ongoing research agenda on high dimensional forecasting, especially in application to violent events. Her work has been published in the Proceedings of the National Academy of Sciences, Comparative Political Studies and Nature. She will be joining the Department of Political Science at the University of Wisconsin-Madison as an Assistant Professor in Fall 2019.

# Tuesday, 02/19/2019, Time: 2:00PMStatistics Weekly SeminarBayesian Propagation of Record Linkage Uncertainty into Population Size Estimation with Application to Human Rights Violations

Physics and Astronomy Building Room 1434A

Mauricio Sadinle

University of Washington

Multiple-systems or capture–recapture estimation are common techniques for population size estimation, particularly in the quantitative study of human rights violations. These methods rely on multiple samples from the population, along with the information of which individuals appear in which samples. The goal of record linkage techniques is to identify unique individuals across samples based on the information collected on them. Linkage decisions are subject to uncertainty when such information contains errors and missingness, and when different individuals have very similar characteristics. Uncertainty in the linkage should be propagated into the stage of population size estimation. We propose an approach called linkage-averaging to propagate linkage uncertainty, as quantified by some Bayesian record linkage methodologies, into a subsequent stage of population size estimation. Linkage-averaging is a two-stage approach in which the results from the record linkage stage are fed into the population size estimation stage. We show that under some conditions the results of this approach correspond to those of a proper Bayesian joint model for both record linkage and population size estimation. The two-stage nature of linkage-averaging allows us to combine different record linkage models with different capture–recapture models, which facilitates model exploration. We present a case study from the Salvadoran civil war, where we are interested in estimating the total number of civilian killings using lists of witnesses’ reports collected by different organizations. These lists contain duplicates, typographical and spelling errors, missingness, and other inaccuracies that lead to uncertainty in the linkage. We show how linkage-averaging can be used for transferring the uncertainty in the linkage of these lists into different models for population size estimation.

Mauricio is an Assistant Professor in the Department of Biostatistics at the University of Washington. Previously, he was a Postdoctoral Associate in the Department of Statistical Science at Duke University and the National Institute of Statistical Sciences, working under the mentoring of Jerry Reiter. Mauricio completed his PhD in the Department of Statistics at Carnegie Mellon University, where his advisor was Steve Fienberg. His undergraduate studies are from the National University of Colombia, in Bogota, where he majored in statistics.

# Tuesday, 02/12/2019, Time: 2:00PMStatistics Weekly SeminarData-adaptive Methods to Control for Missing Data, to Increase Efficiency, and for Risk Prediction in the SEARCH Study

Physics and Astronomy Building Room 1434A

Laura Balzer

University of Massachussets – Amherst

This seminar is co-sponsored by the UCLA Center for Social Statistics.

In this talk, we highlight the use of data-adaptive methods in three priority areas of HIV-AIDS research. First, estimates of population-level coverage at each step of the HIV care cascade - diagnosis, treatment, and viral suppression – are needed to assess the effectiveness of programs. However, the data available are often susceptible to differential missingness. We discuss the assumptions needed to identify population-level estimates and use targeted maximum likelihood estimation (TMLE) with Super Learner for semiparametric estimation of the resulting statistical parameters. Second, adjustment for baseline covariates during the analysis of randomized trials can reduce variance and increase power. However, it is often unclear a priori which covariates, if any, should be included in the adjustment set. To address this challenge, we use cross-validation to data-adaptively select, from a pre-specified set, the TMLE that maximizes precision and thereby power. Finally, optimal strategies are needed to identify individuals at risk of HIV infection in generalized epidemic settings. We apply Super Learner to develop a machine learning-based HIV risk score and assess its ability to target testing in three regions with varying HIV prevalence. All methods are illustrated using data from the SEARCH Study (NCT01864603), a community randomized trial for HIV prevention and treatment in rural Kenya and Uganda, and demonstrate the need of data-adaptive methods for improved estimation and inference.

Laura B. Balzer, PhD MPhil, is an Assistant Professor of Biostatistics at the University of Massachusetts-Amherst. She earned her PhD from the University of California-Berkeley and completed her post-doctoral studies at the Harvard T.H. Chan School of Public Health. Her areas of expertise include causal inference, machine learning, and dependent data. Dr. Balzer is the Primary Statistician for two cluster randomized trials: the SEARCH study for HIV prevention and treatment in East Africa and the SPIRIT study for TB prevention in Uganda. Her work is supported by the National Institutes of Health (NIH). Dr. Balzer received the ASA’s Causality in Statistics Education Award and the Gertrude M. Cox Scholarship.

# Tuesday, 02/05/2019, Time: 2:00PMStatistics Weekly SeminarRecruitment with a purpose: Designing RCTs with Moderation and Reporting in Mind

Physics and Astronomy Building Room 1434A

Elizabeth Tipton

Northwestern University

This seminar is co-sponsored by the UCLA Center for Social Statistics.

The randomized control trial (RCT) is now widely recognized as the gold standard for determining causality in medicine, education, and the social sciences. Inference from an RCT to a policy relevant population, however, is difficult when the effect of the intervention varies across people and institutions. As a result, statisticians are increasingly interested in the development of methods for testing moderators of treatment impacts and for providing policy makers with improved information on for whom and where an intervention might hold promise. Yet much of this methodological development has focused on analytic approaches, neglecting the role that the sample itself plays in these analyses. This is particularly important given that nearly all RCTs are conducted in samples of convenience. Prior work has provided methods for improved sample selection and recruitment with a single target population in mind. In this talk, this sample selection approach is extended to include optimal designs for estimation of treatment impacts for multiple target populations and estimation of moderators. Throughout the talk, examples from education and psychology RCTs will be included.

# Tuesday, 01/29/2019, Time: 2:00PMStatistics Weekly SeminarFuzzy Forests: Variable Selection Under Correlation

Physics and Astronomy Building Room 1434A

Christina Ramirez

UCLA

Fuzzy forests is specifically designed to provide relatively unbiased rankings of variable importance in the presence of highly correlated features, especially when p >> n. We introduce our implementation of fuzzy forests in the R package, fuzzyforest. Fuzzy forests works by taking advantage of the network structure between features. First, the features are partitioned into separate modules such that the correlation within modules is high and the correlation between modules is low. The package fuzzyforest allows for easy use of Weighted Gene Coexpression Network Analysis (WGCNA) to form modules of features such that the modules are roughly uncorrelated. Then recursive feature elimination random forests (RFE-RFs) are used on each module, separately. From the surviving features, a final group is selected and ranked using one last round of RFE-RFs. This procedure results in a ranked variable importance list whose size is pre-specified by the user. The selected features can then be used to construct a predictive model. We apply fuzzy forests to two applications: flow cytometry data and the California Health Interview Survey. We show that fuzzy forests is able to extract the most important features from both of these dataset.

My research interests generally relate to uncovering the mechanisms behind HIV pathogenesis. To this end, I work closely with investigators in the clinical and basic sciences. I am particularly interested in HIV drug resistance mutation/recombination, viral fitness and coreceptor utilization. I work to develop methods to understand the evolutionary dynamics of gene regions under the selective pressure of the host immune system and antiretrovirals. I am also interested in complex, high-dimensional data analysis where we have large p and small n. These methods have been applied to infer relationships between virus genotype and phenotype.

# Tuesday, 01/22/2019, Time: 2:00PMStatistics Weekly SeminarCausal Inference with Interference and Noncompliance in Two-Stage Randomized Experiments

Physics and Astronomy Building Room 1434A

Kosuke Imai

Harvard University

In many social science experiments, subjects often interact with each other and as a result one unit’s treatment influences the outcome of another unit. Over the last decade, a significant progress has been made towards causal inference in the presence of such interference between units. Researchers have shown that the two-stage randomization of treatment assignment enables the identification of average direct and spillover effects. However, much of the literature has assumed perfect compliance with treatment assignment. In this paper, we establish the nonparametric identification of the complier average direct and spillover effects in two-stage randomized experiments with interference and noncompliance. In particular, we consider the spillover effect of the treatment assignment on the treatment receipt as well as the spillover effect of the treatment receipt on the outcome. We propose consistent estimators and derive their randomization-based variances under the stratified interference assumption. We also prove the exact relationship between the proposed randomization-based estimators and the popular two-stage least squares estimators. Our methodology is motivated by and applied to the randomized evaluation of the India’s National Health Insurance Program (RSBY), where we find some evidence of spillover effects on both treatment receipt and outcome. The proposed methods are implemented via an open-source software package.

Kosuke Imai is a Professor in the Department of Government and the Department of Statistics at Harvard University. He is also an affiliate of the Institute for Quantitative Social Science where his primary office is located. Before moving to Harvard in 2018, Imai taught at Princeton University for 15 years where he was the founding director of the Program in Statistics and Machine Learning. He specializes in the development of statistical methods and their applications to social science research and is the author of Quantitative Social Science: An Introduction (Princeton University Press, 2017). Outside of Harvard, Imai is currently serving as the President of the Society for Political Methodology. He is also Professor of Visiting Status in the Graduate Schools of Law and Politics at The University of Tokyo.

# Tuesday, 01/15/2019, Time: 2:00PMStatistics Weekly SeminarThe Regression Discontinuity Design: Methods and Applications

Physics and Astronomy Building Room 1434A

Rocio Titiunik

University of Michigan

The Regression Discontinuity (RD) design is one of the most widely used non-experimental strategies for the study of treatment effects in the social, behavioral, biomedical, and statistical sciences. In this design, units are assigned a score and a treatment is offered if the value of that score exceeds a known threshold—and withheld otherwise. In this talk, I will discuss the assumptions under which the RD design can be used to learn about treatment effects, and how to make valid inferences about them based on modern theoretical results in nonparametrics that emphasize the importance of extrapolation of regression functions and misspecification biases near the RD cutoff. If time permits, I will also discuss the common approach of augmenting nonparametric regression models using predetermined covariates in RD setups, and how this affects nonparametric identification of as well as statistical inference about the RD parameter. The talk will also present a more general version of the RD design based on multiple cutoffs, which expands the generalizability of the standard RD design by allowing researchers to test richer hypotheses regarding the heterogeneity of the treatment effect and, under additional assumptions, to extrapolate the treatment effect to score values far from the cutoff.

Rocío Titiunik is the James Orin Murfin Professor of Political Science at the University of Michigan. She specializes in quantitative methodology for the social sciences, with emphasis on quasi-experimental methods for causal inference and political methodology. Her research interests lie at the intersection of political science, political economy, and applied statistics, particularly on the development and application of quantitative methods to the study of political institutions. Her recent methodological research includes the development of statistical methods for the analysis and interpretation of treatment effects and program evaluation, with emphasis on regression discontinuity (RD) designs. Her recent substantive research centers on democratic accountability and the role of party systems in developing democracies. Rocio’s work appeared in various journals in the social sciences and statistics, including the American Political Science Review, the American Journal of Political Science, the Journal of Politics, Econometrica, the Journal of the American Statistical Association, and the Journal of the Royal Statistical Society. In 2016, she received the Emerging Scholar Award from the Society for Political Methodology, which honors a young researcher who is making notable contributions to the field of political methodology. She is a member of the leadership team of the Empirical Implications of Theoretical Models (EITM) Summer Institute, member of Evidence in Governance and Politics (EGAP), and has served in various leadership roles for the American Political Science Association and for the Society for Political Methodology. She has also served as Associate Editor for Political Science Research and Methods and the American Journal of Political Science, has served in the Advisory Committee for the Social, Behavioral, and Economic Sciences Directorate and in the advisory panel for Methodology, Measurement, and Statistics program of the National Science Foundation.

# Tuesday, 01/08/2019, Time: 2:00PM – 3:15PMJob TalkBayes Shrinkage at GWAS Scale

JIFRESSE Seminar Room; 4242 Young Hall

James Johndrow

Stanford University

Bayesian analysis in high-dimensional sparse regression settings differs from typical frequentist methods in part by eschewing selection of a single model. While this simplifies some aspects of inference, computation for these models is considerably less scalable than popular procedures for selection, such as the lasso. Global-local shrinkage priors, such as horseshoe, were proposed in part as a computationally scalable alternative to classical spike-and-slab mixture priors, but the promised computational scalability has failed to materialize. We propose an MCMC algorithm for computation with the horseshoe prior that permits analysis of genome-wide association study (GWAS) data with hundreds of thousands of covariates and thousands of subjects. The algorithm is shown empirically to outperform alternatives by orders of magnitude. Among other enhancements, our algorithm employs certain approximations to an expensive matrix operation. We give general results on the accuracy of time averages obtained from approximating Markov chains, with an application to our horseshoe MCMC sampler.

# Monday, 01/07/2019, Time: 2:00PM – 3:15PMJob TalkTopic: Support Points – A New Way to Reduce Big and High-dimensional Data

JIFRESSE Seminar Room; 4242 Young Hall

Simon Mak

Georgia Institute of Technology

This talk presents a new method for reducing big and high-dimensional data into a smaller dataset, called support points (SPs). In an era where data is plentiful but downstream analysis is oftentimes expensive, SPs can be used to tackle many big data challenges in statistics, engineering and machine learning. SPs have two key advantages over existing methods. First, SPs provide optimal and model-free reduction of big data for a broad range of downstream analyses. Second, SPs can be efficiently computed via parallelized difference-of-convex optimization; this allows us to reduce millions of data points to a representative dataset in mere seconds. SPs also enjoy appealing theoretical guarantees, including distributional convergence and improved reduction over random sampling and clustering-based methods. The effectiveness of SPs is then demonstrated in two real-world applications, the first for reducing long Markov Chain Monte Carlo (MCMC) chains for rocket engine design, and the second for data reduction in computationally intensive predictive modeling.

# Friday, 12/14/2018, Time: 2:00PM – 3:15PMJob TalkTopic: A Modern Maximum-likelihood Approach for High-dimensional Logistic Regression

JIFRESSE Seminar Room; 4242 Young Hall

Pragya Sur

Stanford University

Logistic regression is arguably the most widely used and studied non-linear model in statistics. Classical maximum-likelihood theory based statistical inference is ubiquitous in this context. This theory hinges on well-known fundamental results—(1) the maximum-likelihood-estimate (MLE) is asymptotically unbiased and normally distributed, (2) its variability can be quantified via the inverse Fisher information, and (3) the likelihood-ratio-test (LRT) is asymptotically a Chi-Squared. In this talk, I will show that in the common modern setting where the number of features and the sample size are both large and comparable, classical results are far from accurate. In fact, (1) the MLE is biased, (2) its variability is far greater than classical results, and (3) the LRT is not distributed as a Chi-Square. Consequently, p-values obtained based on classical theory are completely invalid in high dimensions.

In turn, I will propose a new theory that characterizes the asymptotic behavior of both the MLE and the LRT under some assumptions on the covariate distribution, in a high-dimensional setting. Empirical evidence demonstrates that this asymptotic theory provides accurate inference in finite samples. Practical implementation of these results necessitates the estimation of a single scalar, the overall signal strength, and I will propose a procedure for estimating this parameter precisely. Finally, I will describe analogous characterizations for regularized estimators such as the logistic lasso or ridge in high dimensions.

This is based on joint work with Emmanuel Candes and Yuxin Chen.

# Tuesday, 12/4/2018, Time: 2:00PM – 3:00PMStatistics Weekly SeminarGeneralizability of Study Results

Physics and Astronomy Building Room 1434A

Catherine Lesko

Johns Hopkins University

Causal effect estimates from well-controlled observational studies or randomized trials may not equal the effects observed when interventions are applied to the target populations for whom they were intended. We will discuss threats to external validity of study results and demonstrate the application of statistical methods to estimating population average treatment effects when the study sample is not a random sample of the target population.

Catherine Lesko is an Assistant Professor in the Department of Epidemiology at the Johns Hopkins Bloomberg School of Public Health. Her research interests include: describing and improving clinical outcomes for persons with HIV living in the United States, particularly as they relate to mental health, alcohol use and substance use; monitoring and improving progress through the HIV care continuum; and the application and development of epidemiologic methods and theory for estimating policy- and patient-relevant health effects from observational data.

# Monday, 12/3/2018, Time: 2:00PM – 3:15PMJob TalkTopic: Fused Lasso in Graph Estimation Problems

JIFRESSE Seminar Room; 4242 Young Hall

Oscar Madrid Padilla

Visiting Assistant Professor, UC Berkeley

In this talk I will describe theory and methods for the fused lasso on network problems. Two classes of problems will be discussed: denoising on graphs, and nonparametric regression on general metric spaces. For the first of these tasks, I will provide a general upper bound on the mean squared error of the fused lasso that depends on the sample size and the total variation of the underlying signal. I will show that such upper bound is minimax when the graph is a tree of bounded degree, and I will present a surrogate estimator that attains the same upper bound and can be found in linear time. The second part of the talk will focus on extending the fused lasso to general nonparametric regression. The resulting approach, which we call the K-nearest neighbors (K-NN) fused lasso, involves (i) computing the K-NN graph of the design points; and (ii) performing the fused lasso over this K-NN graph. I will discuss several theoretical advantages over competing approaches: specifically, the estimator inherits local adaptivity from its connection to the fused lasso, and it inherits manifold adaptivity from its connection to the K-NN approach. Finally, I will briefly mention some of my other research directions.

# Tuesday, 11/27/2018, Time: 2:00PM – 3:00PMStatistics Weekly SeminarIdentifiability of Nonparametric Mixture Models, Clustering, and Semi-Supervised Learning

Physics and Astronomy Building Room 1434A

Bryon Aragam

Carnegie Mellon University

“Motivated by problems in data clustering and semi-supervised learning, we establish general conditions under which families of nonparametric mixture models are identifiable by introducing a novel framework for clustering overfitted parametric (i.e. misspecified) mixture models. These conditions generalize existing conditions in the literature, allowing for general nonparametric mixture components. After a discussion of some statistical aspects of this problem (e.g. estimation), we will discuss two applications of this framework. First, we extend classical model-based clustering to nonparametric settings and develop a practical algorithm for learning nonparametric mixtures. Second, we analyze the sample complexity of semi-supervised learning (SSL) and introduce new assumptions based on the mismatch between a mixture model learned from unlabeled data and the true mixture model induced by the (unknown) class conditional distributions. Under these assumptions, we establish an \Omega(K\log K) labeled sample complexity bound without imposing parametric assumptions, where K is the number of classes. These results suggest that even in nonparametric settings it is possible to learn a near-optimal classifier using only a few labeled samples.

[1] Aragam, B., Dan, C., Ravikumar, P. and Xing, E. P. Identifiability of nonparametric mixture models and Bayes optimal clustering. Under review. https://arxiv.org/abs/1802.04397

[2] Dan, C., Leqi, L., Aragam, B., Ravikumar, P., and Xing, E. P. Sample complexity of nonparametric semi-supervised learning. NIPS 2018, to appear. https://arxiv.org/abs/1809.03073

Bryon Aragam is a project scientist in the Machine Learning Department at Carnegie Mellon University. He received his PhD in statistics from UCLA in 2015. His research interests are at the intersection of high-dimensional statistics and machine learning, with a focus on developing algorithms, theory, and software for applications in computational biology and precision medicine. Some of his recent projects include nonparametric mixture models, high-dimensional inference, and personalized models.

# Tuesday, 11/20/2018, Time: 2:00PM – 3:00PMStatistics Weekly SeminarTesting for Peer Effects in Randomized Group Formation Designs

Co-sponsored by the Center for Social Statistics

Physics and Astronomy Building Room 1434A

Guillaume Basse

UC Berkeley

Many important causal questions focus on the effect of forming groups of units, also known as peer effects, such as assigning students to classrooms or workers to teams. For example, what is the impact of assigning a student to a classroom with students with, on average, greater academic preparation? Even when these groups are assigned randomly, however, standard methods for analyzing peer effects are heavily model dependent and typically fail to exploit the randomized design. At the same time, naive permutation testing will generally be invalid, and the growing causal inference literature on testing under interference has instead focused on the setting where groups are fixed ex-ante. In this paper, we extend methods for randomization-based testing under interference to these “randomized group formation designs” in which the randomness comes from the groups themselves, such as assigning students to classrooms. Unlike existing methods, the proposed tests are justified by the randomization itself and require relatively few assumptions. While we give a general framework, we highlight designs in which this approach can be implemented via a simple permutation test. We apply this approach to several recent examples of group formation designs, including assigning college freshmen to dorms and judges to panels.

I am currently a postdoctoral fellow in the Statistics Department at UC Berkeley where I am advised by Peng Ding. My research focuses on Causal Inference and Design of Experiments in the presence of interference. I got my PhD in Statistics at Harvard in 2018, under the supervision of Edo Airoldi. Before coming to Harvard I attended the Ecole Centrale Paris, where I studied Applied Mathematics and Engineering. I have lived in France, Israel, the US and Senegal, where I was born. I will start as an assistant professor in the MS&E and Statistics departments at Stanford in July 2019.

# Thursday, 11/15/2018, Time: 3:00PM – 3:50PMSpecial Statistics Seminar: Metropolis-Hastings MCMC with Dual Mini-Batches

MS 5148

Rachel Wang

University of Sydney

For many decades Markov chain Monte Carlo (MCMC) methods have been the main workhorse of Bayesian inference. However, traditional MCMC algorithms are computationally intensive. In particular, the Metropolis-Hastings (MH) algorithm requires passing over the entire dataset to evaluate the likelihood ratio in each iteration. We propose a general framework for performing MH-MCMC using two mini-batches (MHDB) of the whole dataset each time and show that this gives rise to approximately a tempered stationary distribution. We prove that MHDB preserves the modes of the original target distribution and derive an error bound on the approximation for a general class of models including mixtures of exponential family distributions, linear binary classification and regression. To further extend the utility of the algorithm to high dimensional settings, we construct a proposal with forward and reverse moves using stochastic gradient and show that the construction leads to reasonable acceptance probabilities. We demonstrate the performance of our algorithm in neural network applications and show that compared with popular optimization methods, our method is more robust to the choice of learning rate and improves testing accuracy. (Joint work with Tung-Yu Wu and Wing H. Wong)

# Tuesday, 11/13/2018, Time: 2:00PM – 3:00PMStatistics Weekly SeminarProbabilistic Projection of Carbon Emissions

Physics and Astronomy Building Room 1434A

Adrian Raftery

University of Washington

The Intergovernmental Panel on Climate Change (IPCC) recently published climate change projections to 2100, giving likely ranges of global temperature increase for each of four possible scenarios for population, economic growth and carbon use. We develop a probabilistic forecast of carbon emissions to 2100, using a country-specific version of Kaya’s identity, which expresses carbon emissions as a product of population, GDP per capita and carbon intensity (carbon per unit of GDP). We use the UN’s probabilistic population projections for all countries, based on methods from our group, and develop a joint Bayesian hierarchical model for GDP per capita and carbon intensity in most countries. In contrast with opinion-based scenarios, our findings are statistically based using data for 1960–2010. We find that our likely range (90% interval) for cumulative carbon emissions to 2100 includes the IPCC’s two middle scenarios but not the lowest or highest ones. We combine our results with the ensemble of climate models used by the IPCC to obtain a predictive distribution of global temperature increase to 2100. This is joint work with Dargan Frierson (UW Atmospheric Science), Richard Startz (UCSB Economics), Alec Zimmer (Upstart), and Peiran Liu (UW Statistics).

# Tuesday, 11/6/2018, Time: 2:00PM – 3:00PMStatistics Weekly SeminarEvaluating Stochastic Seeding Strategies in Networks

Physics and Astronomy Building Room 1434A

Dean Eckles

MIT

When trying to maximize the adoption of a behavior in a population connected by a social network, it is common to strategize about where in the network to seed the behavior. Some seeding strategies require explicit knowledge of the network, which can be difficult to collect, while other strategies do not require such knowledge but instead rely on non-trivial stochastic ingredients. For example, one such stochastic seeding strategy is to select random network neighbors of random individuals, thus exploiting a version of the friendship paradox, whereby the friend of a random individual is expected to have more friends than a random individual. Empirical evaluations of these strategies have demanded large field experiments designed specifically for this purpose, but these experiments have yielded relatively imprecise estimates of the relative efficacy of these seeding strategies.

Here we show both how stochastic seeding strategies can be evaluated using existing data arising from randomized experiments in networks designed for other purposes and how to design much more efficient experiments for this specific evaluation. In particular, we consider contrasts between two common stochastic seeding strategies and analyze nonparametric estimators adapted from policy evaluation or importance sampling. We relate this work to developments in the literatures on counterfactual policy evaluation, dynamic treatment regimes, and importance sampling.

Using simulations on real networks, we show that the proposed estimators and designs can dramatically increase precision while yielding valid inference. We apply our proposed estimators to a field experiment that randomly assigned households to an intensive marketing intervention and a field experiment that randomly assigned students to an anti-bullying intervention.

Joint work with Alex Chin & Johan Ugander.

Paper at: https://arxiv.org/abs/1809.09561

Dean Eckles is a social scientist and statistician. Dean is the KDD Career Development Professor in Communications and Technology at Massachusetts Institute of Technology (MIT), an assistant professor in the MIT Sloan School of Management, and affiliated faculty at the MIT Institute for Data, Systems & Society. He was previously a member of the Core Data Science team at Facebook. Much of his research examines how interactive technologies affect human behavior by mediating, amplifying, and directing social influence — and statistical methods to study these processes. Dean’s empirical work uses large field experiments and observational studies. His research appears in the Proceedings of the National Academy of Sciences and other peer-reviewed journals and proceedings in statistics, computer science, and marketing. Dean holds degrees from Stanford University in philosophy (BA), symbolic systems (BS, MS), statistics (MS), and communication (PhD).

# Tuesday, 10/16/2018, Time: 2:00PM – 3:00PMStatistics Weekly SeminarBayesian Regression Tree Models for Causal Inference: Regularization, Confounding, and Heterogeneous Effects

Physics and Astronomy Building Room 1434A

Jared Murray

University of Texas — Austin

We introduce a semi-parametric Bayesian regression model for estimating heterogeneous treatment effects from observational data. Standard nonlinear regression models, which may work quite well for prediction, can yield badly biased estimates of treatment effects when fit to data with strong confounding. Our Bayesian causal forests model avoids this problem by directly incorporating an estimate of the propensity function in the specification of the response model, implicitly inducing a covariate-dependent prior on the regression function. This new parametrization also allows treatment heterogeneity to be regularized separately from the prognostic effect of control variables, making it possible to informatively “shrink to homogeneity”, in contrast to existing Bayesian non- and semi-parametric approaches.

Jared is an assistant professor of statistics in the Departments of Statistics and Data Science and Information, Risk, and Operations Management at the McCombs School of Business. Until July of 2017 he was a visiting assistant professor in the Department of Statistics at Carnegie Mellon University. Prior to joining CMU he completed his Ph.D. in Statistical Science at Duke University, working with Jerry Reiter. His methodological work spans several areas, including Bayesian modeling for non- and semiparametric regression modeling, causal inference, missing data, and record linkage.

# Tuesday, 10/09/2018, Time: 2:00PM – 3:00PMStatistics Weekly SeminarLinking Survey and Data Science: Aspects of Privacy

Physics and Astronomy Building Room 1434A

Frauke Kreuter

University of Maryland

The recent reports of the Commission on Evidenced-Based Policymaking and the National Academy of Science Panel on Improving Federal Statistics for Policy and Social Science Research Using Multiple Data Sources and State-of-the-Art Estimation Methods emphasize the need to make greater use of data from administrative and other processes. The promise of such data sources is great, and even more so if multiple data sources are linked in an effort to overcome the shortage of relevant information in each individual source. However, looking at countries in which administrative data have been accessible for longer, or the tech industry in which process data are used extensively for decision making, we see that process data are often insufficient to answer relevant questions or to ensure proper measurement. This creates a desire to augment process data and administrative data with surveys. This talk will focus on two practical aspects resulting from this situation: the enormous challenge in ensuring privacy, and the need to cross-train computers scientists, statisticians, and survey methodologists.

Professor Frauke Kreuter is Director of the Joint Program in Survey Methodology at the University of Maryland, USA; Professor of Statistics and Methodology at the University of Mannheim; and head of the Statistical Methods Research Department at the Institute for Employment Research in Nürnberg, Germany. She founded the International Program in Survey and Data Science, and is co-founder of the Coleridge Initiative. Frauke Kreuter is elected fellow of the American Statistical Association and recipient of the Gertrude Cox Award.

# Tuesday, 10/02/2018, Time: 2:00PM – 3:00PMStatistics Weekly SeminarGraphical models in machine learning, networks and uncertainty quantification

Physics and Astronomy Building Room 1434A

Andrea Bertozzi

University of California, Los Angeles

I will present semi-supervised and unsupervised graph models for classification using similarity graphs and for community detection in networks. The methods stem from graph-based variational models built on graph cut metrics. The equivalence between the graph mincut problem and total variation minimization on the graph for an assignment function allows one to cast graph-cut variational problems in the language of total variation minimization, thus creating a parallel between low dimensional data science problems in Euclidean space (e.g. image segmentation) and high dimensional clustering. The connection paves the way for new algorithms for data science that have a similar structure to well-known computational methods for nonlinear partial differential equations. This paper focuses on a class of methods build around diffuse interface models (e.g. the Ginzburg–Landau functional and the Allen–Cahn equation) and threshold dynamics, developed by the speaker and collaborators. Semi-supervised learning with a small amount of training data can be carried out in this framework with diverse applications ranging from hyperspectral pixel classification to identifying activity in police body worn video. It can also be extended to the context of uncertainty quantification with Gaussian noise models. The problem of community detection in networks also has a graph-cut structure and algorithms are presented for the use of threshold dynamics for modularity optimization. With efficient methods, this allows for the use of network modularity for unsupervised machine learning problems with unknown number of classes.