Biostatistics Seminar Series

Tuesday, November 27, 2018
3:30 pm - 4:30 pm
11/27/18 - 3:30pm to 11/27/18 - 4:30pm
Add to Calendar
701, Blockley Hall, 423 Guardian Drive, Philadelphia, PA 19104
Title: A new discriminant analysis method for integrative analysis of multi-type dataAbstract: Linear classifiers, such as the linear discriminant analysis (LDA) classifier and versions of it, which maximizes separation between classes while minimizing variation within classes through a few linear combinations of available variables, are popular tools for classification purposes. However, their use is limited to applications where there is one data type available. Owing to advances in data collection technologies, multiple data types (including genomics, metabolomics) are being measured on the same subject with each data type measuring different sets of characteristics but collectively helping to explain underlying complex relationships. When the goal is to simultaneously use these data types for classification purposes, the standard LDA and its current versions suffer. In addition, the estimated linear combinations of the standard LDA usually consist of all original variables, making it difficult to interpret, and to identify important variables. We present a novel linear discriminant analysis method for simultaneous integrative analysis of multiple data types and classification. The method we propose solves an optimization problem that considers the overall association between different data types, and maximum separation of classes within each data type in choosing discriminant vectors that optimally separate subjects into different classes. To screen out irrelevant and redundant variables for easy interpretation, we utilize a coordinate-independent penalty function that is invariant to orthogonal transformation of the LDA basis vectors, targeting the removal of row vectors from the basis matrix. Simulation studies and real-data example, including an application to genomics and metabolomics data for classifying subjects into low-vs high-risk atherosclerosis cardiovascular disease groups, are used to demonstrate the effectiveness and efficiency of the proposed approach.