Monday, December 13, 2021
10:00 am - 11:00 am
Abstract: Precise statistical modeling and inference often rely on integrative analysis of datasets from multiple sites. While such modern meta-analysis could be uniquely challenging for electronic health record (EHR) data due to noisiness, high dimensionality, heterogeneity and privacy constraints. I will present novel statistical framework and approaches to overcome these practical challenges. In specific, we develop three methods for individual information protected aggregation of multi-institutional large-scale and heterogeneous EHR data sets, aiming at sparse regression, multiple testing, and surrogate-assisted semi-supervised learning respectively. Through both asymptotic analysis and numerical experiments, we demonstrate that our proposed methods outperform existing options and perform closely to the ideal individual patient data pooling analysis not feasible due to the privacy constraint. We illustrate the use of our methods in real EHR-based studies including EHR phenotyping for cardiovascular disease and inferring genetic associations of type II diabetes linked with biobank data.