Biostatistics Seminar Series: Abhishek Chakrabortty, PhD

Tuesday, March 28, 2017
3:30 pm - 4:30 pm
03/28/17 - 3:30pm to 03/28/17 - 4:30pm
Add to Calendar
701 Blockley Hall
Postdoctoral Researcher Department of Biostatistics, Epidemiology and Informatics University of Pennsylvania Perelman School of Medicine Department of Statistics The Wharton School of the University of Pennsylvania   Abstract:  We consider a variety of M/Z-estimation problems, based on estimating equations (EEs), under a semi-supervised (SS) setting, where the available data typically consists of: (i) a small or moderate sized 'labeled' data, and (ii) a much larger sized 'unlabeled' data. Such data arises naturally when the outcome, unlike the covariates, is difficult or expensive to obtain, a frequent scenario in modern studies involving large databases like electronic medical records (EMR). It is often of interest in these settings to investigate if and when the unlabeled data can be exploited to improve estimation of the parameter of interest, compared to supervised approaches based on only the labeled data. We provide a unified framework for analyzing such problems from a semi-parametric perspective and propose a family of 'Efficient and Adaptive Semi-Supervised Estimators' (EASE), two-step estimators constructed via a (non-parametric) imputation approach, that are always as efficient as the supervised estimator and more efficient (further, optimal in some cases) when the information from the unlabeled data actually relates to the parameter of interest. This adaptive property, often unaddressed, is crucial for advocating 'safe' use of unlabeled data. For a subclass of EEs, including those corresponding to standard generalized linear (working) models, we provide a more flexible 'semi-non-parametric' imputation strategy suited for covariates that are not low dimensional. All imputations involve a 'refitting' or debiasing step along with possible use of cross-validation (CV), both having useful practical as well as theoretical benefits. We establish consistency, asymptotic normality, influence function expansions and the adaptive properties of EASE, as well as CV based methods for inference. The results are validated through simulations, followed by illustrations on real datasets from EMR studies on autoimmune diseases.