Thursday, 02/09/2023, Time: 11:00am – 12:15pm PT
Combining biased and unbiased data for estimating stratified COVID-19 infection fatality rates
Gonzalo E. Mena, Postdoctoral Fellow
Department of Statistics, University of Oxford
Young Hall CS50
One major limitation of so-called ‘big data’ is that bigger sample sizes don’t lead to more reliable conclusions if data is corrupted by bias. However, responding to complex scientific and societal questions requires us to think about how to draw inferences out of such corrupted data efficiently. One emerging paradigm consists of suitably combining unbiased (but typically small and expensive) with biased (but cheap and bigger) datasets. Unfortunately, although Bayesian inference is a major workhorse of modern scientific research, methods for combining information within this paradigm are still lacking, and we often have to content ourselves with the suboptimal solution of throwing away all biased data.
In this talk, I will present a computationally efficient Bayesian method for combining biased and unbiased data enjoying theoretical guarantees. This method is based on a predictive philosophy: given a family of Bayesian models indexed by an unknown parameter representing how data should be merged, we seek to find the value that will best predict unobserved units given the rest of the observed ones. I study in-depth the performance of our method in the Gaussian case, showing that if D is greater than 8, then including biased data is always better than not doing so. Moreover, I show that it enjoys a certain robustness property, making it preferable to the best available baseline, the Green-Strawderman shrinkage estimator. This criterion can be seamlessly implemented through leave-one-out cross-validation in usual probabilistic programming pipelines, and I show through simulations that benefits manifest in more complex scenarios as well, for example, in hierarchical models.
I apply these methods to one important scientific and policy sensitive question: determining how COVID-19 lethality depends on age and socioeconomic status. This problem is remarkably hard since lethality is defined in terms of the true number of infections, a quantity that we typically observe with bias. Using small-area data from Chile, I present three stratified examples based on biased (administrative surveillance data), unbiased (a serosurvey) and biased + unbiased data to confirm the result that there is a strong dependence of lethality on socioeconomic status among younger populations.
Gonzalo Mena is a Florence Nightingale Fellow in Computational Statistics and Machine Learning at the Department of Statistics, University of Oxford. Prior to that he was a Data Science Initiative Postdoctoral fellow at Harvard University. He earned his PhD in Statistics at Columbia University advised by Liam Paninski. Before his PhD, he obtained a bachelor’s degree in Mathematical Engineeging at Universidad of Chile, in his home country. His main research motivation is the development of statistical methods to address complex scientific and societal problems.