Tuesday, March 28, 2017
3:30 pm - 4:30 pm
701 Blockley Hall
Postdoctoral Researcher
Department of Biostatistics, Epidemiology and Informatics
University of Pennsylvania Perelman School of Medicine
Department of Statistics
The Wharton School of the University of Pennsylvania
Abstract: We consider a variety of M/Z-estimation problems, based on estimating equations (EEs), under a semi-supervised (SS) setting, where the available data typically consists of: (i) a small or moderate sized 'labeled' data, and (ii) a much larger sized 'unlabeled' data. Such data arises naturally when the outcome, unlike the covariates, is difficult or expensive to obtain, a frequent scenario in modern studies involving large databases like electronic medical records (EMR). It is often of interest in these settings to investigate if and when the unlabeled data can be exploited to improve estimation of the parameter of interest, compared to supervised approaches based on only the labeled data.
We provide a unified framework for analyzing such problems from a semi-parametric perspective and propose a family of 'Efficient and Adaptive Semi-Supervised Estimators' (EASE), two-step estimators constructed via a (non-parametric) imputation approach, that are always as efficient as the supervised estimator and more efficient (further, optimal in some cases) when the information from the unlabeled data actually relates to the parameter of interest. This adaptive property, often unaddressed, is crucial for advocating 'safe' use of unlabeled data. For a subclass of EEs, including those corresponding to standard generalized linear (working) models, we provide a more flexible 'semi-non-parametric' imputation strategy suited for covariates that are not low dimensional. All imputations involve a 'refitting' or debiasing step along with possible use of cross-validation (CV), both having useful practical as well as theoretical benefits. We establish consistency, asymptotic normality, influence function expansions and the adaptive properties of EASE, as well as CV based methods for inference. The results are validated through simulations, followed by illustrations on real datasets from EMR studies on autoimmune diseases.